Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
碩士 === 國立交通大學 === 資訊科學與工程研究所 === 97 === In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipel...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2008
|
Online Access: | http://ndltd.ncl.edu.tw/handle/87648186708460464578 |
id |
ndltd-TW-097NCTU5394028 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-097NCTU53940282015-10-13T13:11:49Z http://ndltd.ncl.edu.tw/handle/87648186708460464578 Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors 提早載入:在深管線處理器設計下隱藏載入使用延遲 Shun-Chieh Chang 張順傑 碩士 國立交通大學 資訊科學與工程研究所 97 In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency. In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost. Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits. Chung-Ping Chung 鍾崇斌 2008 學位論文 ; thesis 46 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立交通大學 === 資訊科學與工程研究所 === 97 === In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency.
In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost.
Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits.
|
author2 |
Chung-Ping Chung |
author_facet |
Chung-Ping Chung Shun-Chieh Chang 張順傑 |
author |
Shun-Chieh Chang 張順傑 |
spellingShingle |
Shun-Chieh Chang 張順傑 Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors |
author_sort |
Shun-Chieh Chang |
title |
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors |
title_short |
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors |
title_full |
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors |
title_fullStr |
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors |
title_full_unstemmed |
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors |
title_sort |
early load:hiding load-to-use latency in deep pipeline processors |
publishDate |
2008 |
url |
http://ndltd.ncl.edu.tw/handle/87648186708460464578 |
work_keys_str_mv |
AT shunchiehchang earlyloadhidingloadtouselatencyindeeppipelineprocessors AT zhāngshùnjié earlyloadhidingloadtouselatencyindeeppipelineprocessors AT shunchiehchang tízǎozàirùzàishēnguǎnxiànchùlǐqìshèjìxiàyǐncángzàirùshǐyòngyánchí AT zhāngshùnjié tízǎozàirùzàishēnguǎnxiànchùlǐqìshèjìxiàyǐncángzàirùshǐyòngyánchí |
_version_ |
1717734140703408128 |