Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors

碩士 === 國立交通大學 === 資訊科學與工程研究所 === 97 === In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipel...

Full description

Bibliographic Details
Main Authors:	Shun-Chieh Chang, 張順傑
Other Authors:	Chung-Ping Chung
Format:	Others
Language:	en_US
Published:	2008
Online Access:	http://ndltd.ncl.edu.tw/handle/87648186708460464578

id	ndltd-TW-097NCTU5394028
record_format	oai_dc
spelling	ndltd-TW-097NCTU53940282015-10-13T13:11:49Z http://ndltd.ncl.edu.tw/handle/87648186708460464578 Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors 提早載入：在深管線處理器設計下隱藏載入使用延遲 Shun-Chieh Chang 張順傑碩士國立交通大學資訊科學與工程研究所 97 In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency. In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost. Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits. Chung-Ping Chung 鍾崇斌 2008 學位論文 ; thesis 46 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立交通大學 === 資訊科學與工程研究所 === 97 === In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency. In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost. Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits.
author2	Chung-Ping Chung
author_facet	Chung-Ping Chung Shun-Chieh Chang 張順傑
author	Shun-Chieh Chang 張順傑
spellingShingle	Shun-Chieh Chang 張順傑 Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors
author_sort	Shun-Chieh Chang
title	Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors
title_short	Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors
title_full	Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors
title_fullStr	Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors
title_full_unstemmed	Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors
title_sort	early load：hiding load-to-use latency in deep pipeline processors
publishDate	2008
url	http://ndltd.ncl.edu.tw/handle/87648186708460464578
work_keys_str_mv	AT shunchiehchang earlyloadhidingloadtouselatencyindeeppipelineprocessors AT zhāngshùnjié earlyloadhidingloadtouselatencyindeeppipelineprocessors AT shunchiehchang tízǎozàirùzàishēnguǎnxiànchùlǐqìshèjìxiàyǐncángzàirùshǐyòngyánchí AT zhāngshùnjié tízǎozàirùzàishēnguǎnxiànchùlǐqìshèjìxiàyǐncángzàirùshǐyòngyánchí
_version_	1717734140703408128

Early Load：Hiding Load-to-Use Latency in Deep Pipeline Processors

Similar Items