Architectural support to exploit commutativity in shared-memory systems

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016. === Cataloged from PDF version of thesis. === Includes bibliographical references (pages 57-64). === Parallel systems are limited by the high costs of communication and synchronizati...

Full description

Bibliographic Details
Main Author: Zhang, Guowei, Ph. D. Massachusetts Institute of Technology
Other Authors: Daniel Sanchez.
Format: Others
Language:English
Published: Massachusetts Institute of Technology 2016
Subjects:
Online Access:http://hdl.handle.net/1721.1/106073
Description
Summary:Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016. === Cataloged from PDF version of thesis. === Includes bibliographical references (pages 57-64). === Parallel systems are limited by the high costs of communication and synchronization. Exploiting commutativity has historically been a fruitful avenue to reduce traffic and serialization. This is because commutative operations produce the same final result regardless of the order they are performed in, and therefore can be processed concurrently and without communication. Unfortunately, software techniques that exploit commutativity, such as privatization and semantic locking, incur high runtime overheads. These overheads offset the benefit and thereby limit the applicability of software techniques. To avoid high overheads, it would be ideal to exploit commutativity in hardware. In fact, hardware already provides much of the functionality that is required to support commutativity For instance, private caches can buffer and coalesce multiple updates. However, current memory hierarchies can understand only reads and writes, which prevents hardware from recognizing and accelerating commutative operations. The key insight this thesis develops is that, with minor hardware modifications and minimal extra complexity, cache coherence protocols, the key component of communication and synchronization in shared-memory systems, can be extended to allow local and concurrent commutative operations. This thesis presents two techniques that leverage this insight to exploit commutativity in hardware. First, Coup provides architectural support for a limited number of single-instruction commutative updates, such as addition and bitwise logical operations. CouP allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with update-only permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Second, CoMMTM is a commutativity-aware hardware transactional memory (HTM) that supports an even broader range of multi-instruction, semantically commutative operations, such as set insertions and ordered puts. COMMTM extends the coherence protocol with a reducible state tagged with a user-defined label. Multiple caches can hold a given line in the reducible state with the same label, and transactions can implement arbitrary user-defined commutative operations through labeled loads and stores. These commutative operations proceed concurrently, without triggering conflicts or incurring any communication. A non-commutative operation (e.g., a conventional load or store) triggers a user-defined reduction that merges the different cache lines and may abort transactions with outstanding reducible updates. CouP and CoMMTM reduce communication and synchronization in many challenging parallel workloads. At 128 cores, CouP accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4x, and COMMTM outperforms a conventional eager-lazy HTM by up to 3.4x and reduces or eliminates wasted work due to transactional aborts. === by Guowei Zhang. === S.M.