Energy-efficient mechanisms for managing on-chip storage in throughput processors

Modern computer systems are power or energy limited. While the number of transistors per chip continues to increase, classic Dennard voltage scaling has come to an end. Therefore, architects must improve a design's energy efficiency to continue to increase performance at historical rates, whi...

Full description

Bibliographic Details
Main Author: Gebhart, Mark Alan
Format: Others
Language:English
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/2152/ETD-UT-2012-05-5141
Description
Summary:Modern computer systems are power or energy limited. While the number of transistors per chip continues to increase, classic Dennard voltage scaling has come to an end. Therefore, architects must improve a design's energy efficiency to continue to increase performance at historical rates, while staying within a system's power limit. Throughput processors, which use a large number of threads to tolerate memory latency, have emerged as an energy-efficient platform for achieving high performance on diverse workloads and are found in systems ranging from cell phones to supercomputers. This work focuses on graphics processing units (GPUs), which contain thousands of threads per chip. In this dissertation, I redesign the on-chip storage system of a modern GPU to improve energy efficiency. Modern GPUs contain very large register files that consume between 15%-20% of the processor's dynamic energy. Most values written into the register file are only read a single time, often within a few instructions of being produced. To optimize for these patterns, we explore various designs for register file hierarchies. We study both a hardware-managed register file cache and a software-managed operand register file. We evaluate the energy tradeoffs in varying the number of levels and the capacity of each level in the hierarchy. Our most efficient design reduces register file energy by 54%. Beyond the register file, GPUs also contain on-chip scratchpad memories and caches. Traditional systems have a fixed partitioning between these three structures. Applications have diverse requirements and often a single resource is most critical to performance. We propose to unify the register file, primary data cache, and scratchpad memory into a single structure that is dynamically partitioned on a per-kernel basis to match the application's needs. The techniques proposed in this dissertation improve the utilization of on-chip memory, a scarce resource for systems with a large number of hardware threads. Making more efficient use of on-chip memory both improves performance and reduces energy. Future efficient systems will be achieved by the combination of several such techniques which improve energy efficiency. === text