Cooperative hardware/software caching for next-generation memory systems

The memory system remains a major performance bottleneck in modern and future architectures. In this dissertation, we propose a hardware/software cooperative approach and demonstrate its effectiveness. This approach combines the global yet imperfect view of the compiler with the timely yet narrow-sc...

Full description

Bibliographic Details
Main Author: Wang, Zhenlin
Language:ENG
Published: ScholarWorks@UMass Amherst 2004
Subjects:
Online Access:https://scholarworks.umass.edu/dissertations/AAI3118338
Description
Summary:The memory system remains a major performance bottleneck in modern and future architectures. In this dissertation, we propose a hardware/software cooperative approach and demonstrate its effectiveness. This approach combines the global yet imperfect view of the compiler with the timely yet narrow-scope context of the hardware. It relies on a light-weight extension to the instruction set architecture to convey compile-time knowledge (hints) to the hardware. The hardware then uses these hints to make better decisions. Our work shows that a cooperative hardware/software approach to (1) cache replacement, (2) prefetching, and (3) their combination eliminates or tolerates much of the memory performance bottleneck. (1) Our work enhances cache replacement decisions using compiler hints. The compiler detects which data will or will not be reused and annotates loads accordingly. The compiler sets one bit (the evict-me bit) to denote a preferred eviction candidate. On a miss, the cache replacement algorithm preferentially replaces a cache line with its evict-me bit set. Otherwise, it follows the LRU policy. The evict-me replacement scheme improves cache replacement decisions and is effective in both L1 and L2 caches. (2) We also use compiler hints to direct aggressive hardware region prefetching and content-aware pointer prefetching. The original SRP (scheduled region prefetching) engine queues prefetching requests on every outstanding L2 miss and tolerates latencies at the cost of dramatically increasing the memory traffic. GRP (guided region prefetching) enhances SRP by restricting prefetching to compiler-marked loads. Our compiler algorithms effectively mark spatial reuses across the SPEC CPU2000 benchmarks, and thus GRP achieves the performance of SRP with only one eighth of the additional traffic. (3) The evict-me cache replacement scheme helps alleviate the side effects of cache pollution introduced by useless region prefetches. The combination of evict-me caching and region prefetching further improves cache performance. These results demonstrate significant promise for overcoming the memory bottleneck with cooperative hardware/software techniques.