Abstract:
Generic Concurrency Restriction (GCR) may divide a set of threads waiting to acquire a lock into two sets: an active set currently able to contend for the lock, and a passive set waiting for an opportunity to join the active set and contend for the lock. The number of threads in the active set may be limited to a predefined maximum or even a single thread. Generic Concurrency Restriction may be implemented as a wrapper around an existing lock implementation. Generic Concurrency Restriction may, in some embodiments, be unfair (e.g., to some threads) over the short term, but may improve the overall throughput of the underlying multithreaded application via passivation of a portion of the waiting threads.
Abstract:
An HTM-assisted Combining Framework (HCF) may enable multiple (combiner and non-combiner) threads to access a shared data structure concurrently using hardware transactional memory (HTM). As long as a combiner executes in a hardware transaction and ensures that the lock associated with the data structure is available, it may execute concurrently with other threads operating on the data structure. HCF may include attempting to apply operations to a concurrent data structure utilizing HTM and if the HTM attempt fails, utilizing flat combining within HTM transactions. Publication lists may be used to announce operations to be applied to a concurrent data structure. A combiner thread may select a subset of the operations in the publication list and attempt to apply the selected operations using HTM. If the thread fails in these HTM attempts, it may acquire a lock associated with the data structure and apply the selected operations without HTM.
Abstract:
Particular techniques for improving the scalability of concurrent programs (e.g., lock-based applications) may be effective in some environments and for some workloads, but not others. The systems described herein may automatically choose appropriate ones of these techniques to apply when executing lock-based applications at runtime, based on observations of the application in the current environment and with the current workload. In one example, two techniques for improving lock scalability (e.g., transactional lock elision using hardware transactional memory, and optimistic software techniques) may be integrated together. A lightweight runtime library built for this purpose may adapt its approach to managing concurrency by dynamically selecting one or more of these techniques (at different times) during execution of a given application. In this Adaptive Lock Elision approach, the techniques may be selected (based on pluggable policies) at runtime to achieve good performance on different platforms and for different workloads.
Abstract:
Particular techniques for improving the scalability of concurrent programs (e.g., lock-based applications) may be effective in some environments and for some workloads, but not others. The systems described herein may automatically choose appropriate ones of these techniques to apply when executing lock-based applications at runtime, based on observations of the application in the current environment and with the current workload. In one example, two techniques for improving lock scalability (e.g., transactional lock elision using hardware transactional memory, and optimistic software techniques) may be integrated together. A lightweight runtime library built for this purpose may adapt its approach to managing concurrency by dynamically selecting one or more of these techniques (at different times) during execution of a given application. In this Adaptive Lock Elision approach, the techniques may be selected (based on pluggable policies) at runtime to achieve good performance on different platforms and for different workloads.
Abstract:
An HTM-assisted Combining Framework (HCF) may enable multiple (combiner and non-combiner) threads to access a shared data structure concurrently using hardware transactional memory (HTM). As long as a combiner executes in a hardware transaction and ensures that the lock associated with the data structure is available, it may execute concurrently with other threads operating on the data structure. HCF may include attempting to apply operations to a concurrent data structure utilizing HTM and if the HTM attempt fails, utilizing flat combining within HTM transactions. Publication lists may be used to announce operations to be applied to a concurrent data structure. A combiner thread may select a subset of the operations in the publication list and attempt to apply the selected operations using HTM. If the thread fails in these HTM attempts, it may acquire a lock associated with the data structure and apply the selected operations without HTM.
Abstract:
A computer including multiple processors and memory implements a managed runtime providing a synchronization application programming interface (API) for threads that perform synchronized accesses to shared objects. A standardized header of objects includes a memory word storing an object identifier. To lock the object for synchronized access, the memory word may be converted to store the tail of a linked list of a first-in-first-out synchronization structures for threads waiting to acquire the lock, with the object identifier relocated to the list structure. The list structure may further include a stack of threads waiting on events related to the object, with the synchronization API additionally providing wait, notify and related synchronization operations. Upon determining that no threads hold or desire to hold the lock for the object and that no threads are waiting on events related to the object, the memory word may be restored to contain the object identifier.
Abstract:
Systems, computer instructions encoded on non-transitory computer-accessible storage media and computer-implemented methods are disclosed for improving the inference performance of machine learning systems using a divide-and-conquer technique. An application configured to perform inferences using a trained machine learning model may be evaluated to identify opportunities to execute the portions of the application in parallel. The application may then be divided into multiple independently executable tasks according to the identified opportunities. Weighting values for individual ones of the tasks may be assigned according to expected computational intensity values of the respective tasks. Then, computational resources may be distributed among the tasks according to the respective weighting values and the application executed using the distributed computational resources.
Abstract:
Transactional Lock Elision allows hardware transactions to execute unmodified critical sections protected by the same lock concurrently, by subscribing to the lock and verifying that it is available before committing the transaction. A “lazy subscription” optimization, which delays lock subscription, can potentially cause behavior that cannot occur when the critical sections are executed under the lock. Hardware extensions may provide mechanisms to ensure that lazy subscriptions are safe (e.g., that they result in correct behavior). Prior to executing a critical section transactionally, its lock and subscription code may be identified (e.g., by writing their locations to special registers). Prior to committing the transaction, the thread executing the critical section may verify that the correct lock was correctly subscribed to. If not, or if locations identified by the special registers have been modified, the transaction may be aborted. Nested critical sections associated with different lock types may invoke different subscription code.
Abstract:
Systems and methods are disclosed to improve disaster recovery by implementing a scalable low-loss disaster recovery for a data store. The disaster recovery system enables disaster recovery for a linearizable (e.g., externally consistent) distributed data store. The disaster recovery system also provides for a small lag on the backup site relative to the primary site, thereby reducing the data loss by providing a smaller data loss window compared to traditional disaster recovery techniques. The disaster recovery system implements a timestamp for log records based on a globally synchronized clock. The disaster recovery system also implements a watermark service that updates a global watermark timestamp that a backup node uses to apply log records.
Abstract:
A computer comprising one or more processors and memory may implement multiple threads performing mutually exclusive lock acquisition operations on disjoint ranges of a shared resource each using atomic compare and swap (CAS) operations. A linked list of currently locked ranges is maintained and, upon entry to a lock acquisition operation, a thread waits for all locked ranges overlapping the desired range to be released then inserts a descriptor for the desired range into the linked list using a single CAS operation. To release a locked range, a thread executes a single fetch and add (FAA) operation. The operation may be extended to support simultaneous exclusive and non-exclusive access by allowing overlapping ranges to be locked for non-exclusive access and by performing an additional validation after locking to provide conflict resolution should a conflict be detected.