Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling

Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to red...

Full description

Bibliographic Details
Main Authors:	Beckmann, Nathan Zachary (Contributor), Tsai, Po-An (Contributor), Sanchez, Daniel (Contributor)
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Contributor)
Format:	Article
Language:	English
Published:	Institute of Electrical and Electronics Engineers (IEEE), 2015-02-26T13:37:58Z.
Subjects:	Article
Online Access:	Get fulltext


LEADER	02187 am a22002533u 4500
001	95648
042			\|a dc
100	1	0	\|a Beckmann, Nathan Zachary \|e author
100	1	0	\|a Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science \|e contributor
100	1	0	\|a Beckmann, Nathan Zachary \|e contributor
100	1	0	\|a Tsai, Po-An \|e contributor
100	1	0	\|a Sanchez, Daniel \|e contributor
700	1	0	\|a Tsai, Po-An \|e author
700	1	0	\|a Sanchez, Daniel \|e author
245	0	0	\|a Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling
260			\|b Institute of Electrical and Electronics Engineers (IEEE), \|c 2015-02-26T13:37:58Z.
856			\|z Get fulltext \|u http://hdl.handle.net/1721.1/95648
520			\|a Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint. We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.
520			\|a National Science Foundation (U.S.) (Grant CCF-1318384)
520			\|a Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Jacobs Presidential Fellowship)
520			\|a United States. Defense Advanced Research Projects Agency (PERFECT Contract HR0011-13-2-0005)
546			\|a en_US
655	7		\|a Article
773			\|t Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture

Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling

Similar Items