Architectural enhancements for efficient operand transport in multimedia systems

Multimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectura...

Full description

Bibliographic Details
Main Author:	Kim, Hongkyu
Published:	Georgia Institute of Technology 2007
Subjects:	Operand Transport Multimedia Architecture
Online Access:	http://hdl.handle.net/1853/14595

id	ndltd-GATECH-oai-smartech.gatech.edu-1853-14595
record_format	oai_dc
spelling	ndltd-GATECH-oai-smartech.gatech.edu-1853-145952013-01-07T20:16:51ZArchitectural enhancements for efficient operand transport in multimedia systemsKim, HongkyuOperandTransportMultimediaArchitectureMultimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectural enhancements that enable the high bandwidth, low latency communication. This research analyzes multimedia workloads to characterize the communication patterns that occur in the execution of standard multimedia benchmarks. This empirical analysis indicates that most operands exhibit strong locality, enabling several optimizations of transport mechanisms. This empirical study shows that an eight-entry local buffer with approximate information on operand lifetime is sufficient to suppress 81% of operand writes. In addition, chaining selected pairs of FUs based on producer-consumer information allows 50% of reads to be accessed through the shortest path. These results guide the design of two efficient operand transport mechanisms: a traffic-driven bypass network and a dynamic instruction clustering. The traffic-driven bypass network is designed using a novel, systematic design customization process for wide-issue architectures. It is driven by a technology model-based evaluation methodology, resulting in a low cost, high performance bypass network for multimedia applications. This technique places microarchitectural components exploiting the communication patterns, reorganizes bypass paths based on the traffic rate, and maps inter-instruction communication on the local paths. The reduction in transport latency combined with a faster clock cycle achieves an instruction throughput gain of 2.9x over the broadcast bypass network at 45nm. In addition, the throughput gain over a typical clustered architecture is 1.3x. Dynamic instruction clustering groups dependent instructions into clusters during instruction execution, performs operand transport pattern analysis, and maps the clustered instructions to a cluster execution unit. Two execution unit implementations are explored: network ALUs and a dynamically-scheduled SIMD PE array. In the network ALUs, intermediate values are propagated among ALUs without distribution through global bypass buses. The reduction in operand transport latency results in a 35% IPC speedup over a conventional ILP processor. The dynamically-scheduled SIMD PE array supports DLP processing of the innermost loops in image processing applications. Data-parallel operations combined with localized operand communication produce an IPC speedup of 2.59x over a 16-way, four-clustered microarchitecture.Georgia Institute of Technology2007-05-25T17:36:56Z2007-05-25T17:36:56Z2007-01-08Dissertationhttp://hdl.handle.net/1853/14595
collection	NDLTD
sources	NDLTD
topic	Operand Transport Multimedia Architecture
spellingShingle	Operand Transport Multimedia Architecture Kim, Hongkyu Architectural enhancements for efficient operand transport in multimedia systems
description	Multimedia applications pose new challenges to computer architecture. Their tremendous communication demands severely burden the interconnect between functional units. This dissertation addresses to efficiently transport operands among computational and storage components. It provides architectural enhancements that enable the high bandwidth, low latency communication. This research analyzes multimedia workloads to characterize the communication patterns that occur in the execution of standard multimedia benchmarks. This empirical analysis indicates that most operands exhibit strong locality, enabling several optimizations of transport mechanisms. This empirical study shows that an eight-entry local buffer with approximate information on operand lifetime is sufficient to suppress 81% of operand writes. In addition, chaining selected pairs of FUs based on producer-consumer information allows 50% of reads to be accessed through the shortest path. These results guide the design of two efficient operand transport mechanisms: a traffic-driven bypass network and a dynamic instruction clustering. The traffic-driven bypass network is designed using a novel, systematic design customization process for wide-issue architectures. It is driven by a technology model-based evaluation methodology, resulting in a low cost, high performance bypass network for multimedia applications. This technique places microarchitectural components exploiting the communication patterns, reorganizes bypass paths based on the traffic rate, and maps inter-instruction communication on the local paths. The reduction in transport latency combined with a faster clock cycle achieves an instruction throughput gain of 2.9x over the broadcast bypass network at 45nm. In addition, the throughput gain over a typical clustered architecture is 1.3x. Dynamic instruction clustering groups dependent instructions into clusters during instruction execution, performs operand transport pattern analysis, and maps the clustered instructions to a cluster execution unit. Two execution unit implementations are explored: network ALUs and a dynamically-scheduled SIMD PE array. In the network ALUs, intermediate values are propagated among ALUs without distribution through global bypass buses. The reduction in operand transport latency results in a 35% IPC speedup over a conventional ILP processor. The dynamically-scheduled SIMD PE array supports DLP processing of the innermost loops in image processing applications. Data-parallel operations combined with localized operand communication produce an IPC speedup of 2.59x over a 16-way, four-clustered microarchitecture.
author	Kim, Hongkyu
author_facet	Kim, Hongkyu
author_sort	Kim, Hongkyu
title	Architectural enhancements for efficient operand transport in multimedia systems
title_short	Architectural enhancements for efficient operand transport in multimedia systems
title_full	Architectural enhancements for efficient operand transport in multimedia systems
title_fullStr	Architectural enhancements for efficient operand transport in multimedia systems
title_full_unstemmed	Architectural enhancements for efficient operand transport in multimedia systems
title_sort	architectural enhancements for efficient operand transport in multimedia systems
publisher	Georgia Institute of Technology
publishDate	2007
url	http://hdl.handle.net/1853/14595
work_keys_str_mv	AT kimhongkyu architecturalenhancementsforefficientoperandtransportinmultimediasystems
_version_	1716474643900006400

Architectural enhancements for efficient operand transport in multimedia systems

Similar Items