Summary: | Multimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectural enhancements that enable the high bandwidth, low latency communication.
This research analyzes multimedia workloads to characterize the communication patterns that occur in the execution of standard multimedia benchmarks. This empirical analysis indicates that most operands exhibit strong locality, enabling several optimizations of transport mechanisms. This empirical study shows that an eight-entry local buffer with approximate information on operand lifetime is sufficient to suppress 81% of operand writes. In addition, chaining selected pairs of FUs based on producer-consumer information allows 50% of reads to be accessed through the shortest path.
These results guide the design of two efficient operand transport mechanisms: a traffic-driven bypass network and a dynamic instruction clustering. The traffic-driven bypass network is designed using a novel, systematic design customization process for wide-issue architectures. It is driven by a technology model-based evaluation methodology, resulting in a low cost, high performance bypass network for multimedia applications. This technique places microarchitectural components exploiting the communication patterns, reorganizes bypass paths based on the traffic rate, and maps inter-instruction communication on the local paths. The reduction in transport latency combined with a faster clock cycle achieves an instruction throughput gain of 2.9x over the broadcast bypass network at 45nm. In addition, the throughput gain over a typical clustered architecture is 1.3x.
Dynamic instruction clustering groups dependent instructions into clusters during instruction execution, performs operand transport pattern analysis, and maps the clustered instructions to a cluster execution unit. Two execution unit implementations are explored: network ALUs and a dynamically-scheduled SIMD PE array. In the network ALUs, intermediate values are propagated among ALUs without distribution through global bypass buses. The reduction in operand transport latency results in a 35% IPC speedup over a conventional ILP processor. The dynamically-scheduled SIMD PE array supports DLP processing of the innermost loops in image processing applications. Data-parallel operations combined with localized operand communication produce an IPC speedup of 2.59x over a 16-way, four-clustered microarchitecture.
|