Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory requests. These resources, however, often exhibi...

Full description

Bibliographic Details
Main Authors: Kiriansky, Vladimir (Author), Xu, Haoran (Author), Rinard, Martin (Author), Amarasinghe, Saman (Author)
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor)
Format: Article
Language:English
Published: Association of Computing Machinery, 2020-05-06T20:05:53Z.
Subjects:
Online Access:Get fulltext
LEADER 02202 am a22002173u 4500
001 125080
042 |a dc 
100 1 0 |a Kiriansky, Vladimir  |e author 
100 1 0 |a Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory  |e contributor 
700 1 0 |a Xu, Haoran  |e author 
700 1 0 |a Rinard, Martin  |e author 
700 1 0 |a Amarasinghe, Saman  |e author 
245 0 0 |a Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP 
260 |b Association of Computing Machinery,   |c 2020-05-06T20:05:53Z. 
856 |z Get fulltext  |u https://hdl.handle.net/1721.1/125080 
520 |a Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory requests. These resources, however, often exhibit poor utilization rates on workloads with large working sets, e.g., in-memory databases, key-value stores, and graph analytics, as compilers and hardware struggle to expose ILP and MLP from the instruction stream automatically. In this paper, we introduce the IMLP (Instruction and Memory Level Parallelism) task programming model. IMLP tasks execute as coroutines that yield execution at annotated long-latency operations, e.g., memory accesses, divisions, or unpredictable branches. IMLP tasks are interleaved on a single thread, and integrate well with thread parallelism and vectorization. Our DSL embedded in C++, Cimple, allows exploration of task scheduling and transformations, such as buffering, vectorization, pipelining, and prefetching. We demonstrate state-of-the-art performance on core algorithms used in in-memory databases that operate on arrays, hash tables, trees, and skip lists. Cimple applications reach 2.5× throughput gains over hardware multithreading on a multi-core, and 6.4× single thread speedup. 
520 |a DOE (Grant DE-SC0014204) 
520 |a Toyota Research Institute (Grant LP-C000765-SR) 
546 |a en 
655 7 |a Article 
773 |t Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques