Architectural enhancements for efficient operand transport in multimedia systems

Multimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectura...

Full description

Bibliographic Details
Main Author: Kim, Hongkyu
Published: Georgia Institute of Technology 2007
Subjects:
Online Access:http://hdl.handle.net/1853/14595
id ndltd-GATECH-oai-smartech.gatech.edu-1853-14595
record_format oai_dc
spelling ndltd-GATECH-oai-smartech.gatech.edu-1853-145952013-01-07T20:16:51ZArchitectural enhancements for efficient operand transport in multimedia systemsKim, HongkyuOperandTransportMultimediaArchitectureMultimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectural enhancements that enable the high bandwidth, low latency communication. This research analyzes multimedia workloads to characterize the communication patterns that occur in the execution of standard multimedia benchmarks. This empirical analysis indicates that most operands exhibit strong locality, enabling several optimizations of transport mechanisms. This empirical study shows that an eight-entry local buffer with approximate information on operand lifetime is sufficient to suppress 81% of operand writes. In addition, chaining selected pairs of FUs based on producer-consumer information allows 50% of reads to be accessed through the shortest path. These results guide the design of two efficient operand transport mechanisms: a traffic-driven bypass network and a dynamic instruction clustering. The traffic-driven bypass network is designed using a novel, systematic design customization process for wide-issue architectures. It is driven by a technology model-based evaluation methodology, resulting in a low cost, high performance bypass network for multimedia applications. This technique places microarchitectural components exploiting the communication patterns, reorganizes bypass paths based on the traffic rate, and maps inter-instruction communication on the local paths. The reduction in transport latency combined with a faster clock cycle achieves an instruction throughput gain of 2.9x over the broadcast bypass network at 45nm. In addition, the throughput gain over a typical clustered architecture is 1.3x. Dynamic instruction clustering groups dependent instructions into clusters during instruction execution, performs operand transport pattern analysis, and maps the clustered instructions to a cluster execution unit. Two execution unit implementations are explored: network ALUs and a dynamically-scheduled SIMD PE array. In the network ALUs, intermediate values are propagated among ALUs without distribution through global bypass buses. The reduction in operand transport latency results in a 35% IPC speedup over a conventional ILP processor. The dynamically-scheduled SIMD PE array supports DLP processing of the innermost loops in image processing applications. Data-parallel operations combined with localized operand communication produce an IPC speedup of 2.59x over a 16-way, four-clustered microarchitecture.Georgia Institute of Technology2007-05-25T17:36:56Z2007-05-25T17:36:56Z2007-01-08Dissertationhttp://hdl.handle.net/1853/14595
collection NDLTD
sources NDLTD
topic Operand
Transport
Multimedia
Architecture
spellingShingle Operand
Transport
Multimedia
Architecture
Kim, Hongkyu
Architectural enhancements for efficient operand transport in multimedia systems
description Multimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectural enhancements that enable the high bandwidth, low latency communication. This research analyzes multimedia workloads to characterize the communication patterns that occur in the execution of standard multimedia benchmarks. This empirical analysis indicates that most operands exhibit strong locality, enabling several optimizations of transport mechanisms. This empirical study shows that an eight-entry local buffer with approximate information on operand lifetime is sufficient to suppress 81% of operand writes. In addition, chaining selected pairs of FUs based on producer-consumer information allows 50% of reads to be accessed through the shortest path. These results guide the design of two efficient operand transport mechanisms: a traffic-driven bypass network and a dynamic instruction clustering. The traffic-driven bypass network is designed using a novel, systematic design customization process for wide-issue architectures. It is driven by a technology model-based evaluation methodology, resulting in a low cost, high performance bypass network for multimedia applications. This technique places microarchitectural components exploiting the communication patterns, reorganizes bypass paths based on the traffic rate, and maps inter-instruction communication on the local paths. The reduction in transport latency combined with a faster clock cycle achieves an instruction throughput gain of 2.9x over the broadcast bypass network at 45nm. In addition, the throughput gain over a typical clustered architecture is 1.3x. Dynamic instruction clustering groups dependent instructions into clusters during instruction execution, performs operand transport pattern analysis, and maps the clustered instructions to a cluster execution unit. Two execution unit implementations are explored: network ALUs and a dynamically-scheduled SIMD PE array. In the network ALUs, intermediate values are propagated among ALUs without distribution through global bypass buses. The reduction in operand transport latency results in a 35% IPC speedup over a conventional ILP processor. The dynamically-scheduled SIMD PE array supports DLP processing of the innermost loops in image processing applications. Data-parallel operations combined with localized operand communication produce an IPC speedup of 2.59x over a 16-way, four-clustered microarchitecture.
author Kim, Hongkyu
author_facet Kim, Hongkyu
author_sort Kim, Hongkyu
title Architectural enhancements for efficient operand transport in multimedia systems
title_short Architectural enhancements for efficient operand transport in multimedia systems
title_full Architectural enhancements for efficient operand transport in multimedia systems
title_fullStr Architectural enhancements for efficient operand transport in multimedia systems
title_full_unstemmed Architectural enhancements for efficient operand transport in multimedia systems
title_sort architectural enhancements for efficient operand transport in multimedia systems
publisher Georgia Institute of Technology
publishDate 2007
url http://hdl.handle.net/1853/14595
work_keys_str_mv AT kimhongkyu architecturalenhancementsforefficientoperandtransportinmultimediasystems
_version_ 1716474643900006400