Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors

碩士 === 國立交通大學 === 資訊科學與工程研究所 === 97 === In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipel...

Full description

Bibliographic Details
Main Authors: Shun-Chieh Chang, 張順傑
Other Authors: Chung-Ping Chung
Format: Others
Language:en_US
Published: 2008
Online Access:http://ndltd.ncl.edu.tw/handle/87648186708460464578
id ndltd-TW-097NCTU5394028
record_format oai_dc
spelling ndltd-TW-097NCTU53940282015-10-13T13:11:49Z http://ndltd.ncl.edu.tw/handle/87648186708460464578 Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors 提早載入:在深管線處理器設計下隱藏載入使用延遲 Shun-Chieh Chang 張順傑 碩士 國立交通大學 資訊科學與工程研究所 97 In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency. In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost. Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits. Chung-Ping Chung 鍾崇斌 2008 學位論文 ; thesis 46 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立交通大學 === 資訊科學與工程研究所 === 97 === In order to achieve high instruction throughput, high performance processors tend to use more and deeper pipelines. As pipeline gets deeper and wider, the instruction execution latency becomes longer. The longer instruction execution latency induces more pipeline stall cycles in an in-order processor. A conventional solution is out-of-order instruction issue and execution; but it is too expensive for some applications, such as embedded processors. An economical solution for low-cost designs is to out-of-order execute only some critical instructions. We focus on load instructions, due to their frequent occurrences and long execution latency in a deep pipeline. If a subsequent instruction depends on the load instruction, it may need to stall in the pipeline to wait for the load outcome. The maximum possible number of stall cycles is called the load-to-use latency. In this thesis, we propose a hardware method, called the early load, to hide load-to-use latency via executing load instructions early. Early load requires that load instructions be identified early and issued for execution early. In the meantime, an error detection method is proposed to stop or invalidate incorrect early loads, ensuring correctness without inducing extra performance degradation. Early load can both hide load-to-use latency and reduce load/store unit contention, at only a little hardware cost. Our experiments show that for a 12-stage in-order dual-issue design, early load can give a 11.64% performance gain in Dhrystone benchmark; and a 18.60% maximal and 5.15% average gain for MiBench benchmark suite. Meanwhile, early load induces 24.08% additional memory accesses. The incurred hardware cost is about ten thousand transistors and corresponding control circuits.
author2 Chung-Ping Chung
author_facet Chung-Ping Chung
Shun-Chieh Chang
張順傑
author Shun-Chieh Chang
張順傑
spellingShingle Shun-Chieh Chang
張順傑
Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
author_sort Shun-Chieh Chang
title Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
title_short Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
title_full Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
title_fullStr Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
title_full_unstemmed Early Load:Hiding Load-to-Use Latency in Deep Pipeline Processors
title_sort early load:hiding load-to-use latency in deep pipeline processors
publishDate 2008
url http://ndltd.ncl.edu.tw/handle/87648186708460464578
work_keys_str_mv AT shunchiehchang earlyloadhidingloadtouselatencyindeeppipelineprocessors
AT zhāngshùnjié earlyloadhidingloadtouselatencyindeeppipelineprocessors
AT shunchiehchang tízǎozàirùzàishēnguǎnxiànchùlǐqìshèjìxiàyǐncángzàirùshǐyòngyánchí
AT zhāngshùnjié tízǎozàirùzàishēnguǎnxiànchùlǐqìshèjìxiàyǐncángzàirùshǐyòngyánchí
_version_ 1717734140703408128