Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures

As the feature size scales down to deep nanometer regimes, it has enabled the designers to fabricate chips with billions of transistors. The availability of such abundant computational resources on a single chip has made it possible to design chips with multiple computational cores, resulting in the...

Full description

Bibliographic Details
Main Author: Poluri, Pavan Kamal Sudheendra
Other Authors: Louri, Ahmed
Language:en_US
Published: The University of Arizona. 2014
Subjects:
Online Access:http://hdl.handle.net/10150/338752
id ndltd-arizona.edu-oai-arizona.openrepository.com-10150-338752
record_format oai_dc
collection NDLTD
language en_US
sources NDLTD
topic Network-on-Chip
Performance
Reliability
Modeling
Electrical & Computer Engineering
spellingShingle Network-on-Chip
Performance
Reliability
Modeling
Electrical & Computer Engineering
Poluri, Pavan Kamal Sudheendra
Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
description As the feature size scales down to deep nanometer regimes, it has enabled the designers to fabricate chips with billions of transistors. The availability of such abundant computational resources on a single chip has made it possible to design chips with multiple computational cores, resulting in the inception of Chip Multiprocessors (CMPs). The widespread use of CMPs has resulted in a paradigm shift from computation-centric architectures to communication-centric architectures. With the continuous increase in the number of cores that can be fabricated on a single chip, communication between the cores has become a crucial factor in its overall performance. Network-on-Chip (NoC) paradigm has evolved into a standard on-chip interconnection network that can efficiently handle the strict communication requirements between the cores on a chip. The components of an NoC include routers, that facilitate routing of data between multiple cores and links that provide raw bandwidth for data traversal. While diminishing feature size has made it possible to integrate billions of transistors on a chip, the advantage of multiple cores has been marred with the waning reliability of transistors. Components of an NoC are not immune to the increasing number of hard faults and soft errors emanating due to extreme miniaturization of transistor sizes. Faults in an NoC result in significant ramifications such as isolation of healthy cores, deadlock, data corruption, packet loss and increased packet latency, all of which have a severe impact on the performance of a chip. This has stimulated the need to design resilient and fault tolerant NoCs. This thesis handles the issue of fault tolerance in NoC routers. Within the NoC router, the focus is specifically on the router pipeline that is responsible for the smooth flow of packets. In this thesis we propose two different fault tolerant architectures that can continue to operate in the presence of faults. In addition to these two architectures, we also propose a new reliability metric for evaluating soft error tolerant techniques targeted towards the control logic of the NoC router pipeline. First, we present Shield, a fault tolerant NoC router architecture that is capable of handling both hard faults and soft errors in its pipeline. Shield uses techniques such as spatial redundancy, exploitation of idle resources and bypassing a faulty resource to achieve hard fault tolerance. The use of these techniques reveals that Shield is six times more reliable than baseline-unprotected router. To handle soft errors, Shield uses selective hardening technique that includes hardening specific gates of the router pipeline to increase its soft error tolerance. To quantify soft error tolerance improvement, we propose a new metric called Soft Error Improvement Factor (SEIF) and use it to show that Shield’s soft error tolerance is three times better than that of the baseline-unprotected router. Then, we present Soft Error Tolerant NoC Router (STNR), a low overhead fault tolerating NoC router architecture that can tolerate soft errors in the control logic of its pipeline. STNR achieves soft error tolerance based on the idea of dual execution, comparison and rollback. It exploits idle cycles in the router pipeline to perform redundant computation and comparison necessary for soft error detection. Upon the detection of a soft error, the pipeline is rolled back to the stage that got affected by the soft error. Salient features of STNR include high level of soft error detection, fault containment and minimum impact on latency. Simulations show that STNR has been able to detect all injected single soft errors in the router pipeline. To perform a quantitative comparison between STNR and other existing similar architectures, we propose a new reliability metric called Metric for Soft error Tolerance (MST) in this thesis. MST is unique in the aspect that it encompasses four crucial factors namely, soft error tolerance, area overhead, power overhead and pipeline latency overhead into a single metric. Analysis using MST shows that STNR provides better reliability while incurring low overhead compared to existing architectures.
author2 Louri, Ahmed
author_facet Louri, Ahmed
Poluri, Pavan Kamal Sudheendra
author Poluri, Pavan Kamal Sudheendra
author_sort Poluri, Pavan Kamal Sudheendra
title Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
title_short Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
title_full Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
title_fullStr Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
title_full_unstemmed Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures
title_sort fault tolerant network-on-chip router architectures for multi-core architectures
publisher The University of Arizona.
publishDate 2014
url http://hdl.handle.net/10150/338752
work_keys_str_mv AT poluripavankamalsudheendra faulttolerantnetworkonchiprouterarchitecturesformulticorearchitectures
_version_ 1718107760335257600
spelling ndltd-arizona.edu-oai-arizona.openrepository.com-10150-3387522015-10-23T05:35:47Z Fault Tolerant Network-on-Chip Router Architectures for Multi-Core Architectures Poluri, Pavan Kamal Sudheendra Louri, Ahmed Louri, Ahmed Hariri, Salim Lysecky, Roman Network-on-Chip Performance Reliability Modeling Electrical & Computer Engineering As the feature size scales down to deep nanometer regimes, it has enabled the designers to fabricate chips with billions of transistors. The availability of such abundant computational resources on a single chip has made it possible to design chips with multiple computational cores, resulting in the inception of Chip Multiprocessors (CMPs). The widespread use of CMPs has resulted in a paradigm shift from computation-centric architectures to communication-centric architectures. With the continuous increase in the number of cores that can be fabricated on a single chip, communication between the cores has become a crucial factor in its overall performance. Network-on-Chip (NoC) paradigm has evolved into a standard on-chip interconnection network that can efficiently handle the strict communication requirements between the cores on a chip. The components of an NoC include routers, that facilitate routing of data between multiple cores and links that provide raw bandwidth for data traversal. While diminishing feature size has made it possible to integrate billions of transistors on a chip, the advantage of multiple cores has been marred with the waning reliability of transistors. Components of an NoC are not immune to the increasing number of hard faults and soft errors emanating due to extreme miniaturization of transistor sizes. Faults in an NoC result in significant ramifications such as isolation of healthy cores, deadlock, data corruption, packet loss and increased packet latency, all of which have a severe impact on the performance of a chip. This has stimulated the need to design resilient and fault tolerant NoCs. This thesis handles the issue of fault tolerance in NoC routers. Within the NoC router, the focus is specifically on the router pipeline that is responsible for the smooth flow of packets. In this thesis we propose two different fault tolerant architectures that can continue to operate in the presence of faults. In addition to these two architectures, we also propose a new reliability metric for evaluating soft error tolerant techniques targeted towards the control logic of the NoC router pipeline. First, we present Shield, a fault tolerant NoC router architecture that is capable of handling both hard faults and soft errors in its pipeline. Shield uses techniques such as spatial redundancy, exploitation of idle resources and bypassing a faulty resource to achieve hard fault tolerance. The use of these techniques reveals that Shield is six times more reliable than baseline-unprotected router. To handle soft errors, Shield uses selective hardening technique that includes hardening specific gates of the router pipeline to increase its soft error tolerance. To quantify soft error tolerance improvement, we propose a new metric called Soft Error Improvement Factor (SEIF) and use it to show that Shield’s soft error tolerance is three times better than that of the baseline-unprotected router. Then, we present Soft Error Tolerant NoC Router (STNR), a low overhead fault tolerating NoC router architecture that can tolerate soft errors in the control logic of its pipeline. STNR achieves soft error tolerance based on the idea of dual execution, comparison and rollback. It exploits idle cycles in the router pipeline to perform redundant computation and comparison necessary for soft error detection. Upon the detection of a soft error, the pipeline is rolled back to the stage that got affected by the soft error. Salient features of STNR include high level of soft error detection, fault containment and minimum impact on latency. Simulations show that STNR has been able to detect all injected single soft errors in the router pipeline. To perform a quantitative comparison between STNR and other existing similar architectures, we propose a new reliability metric called Metric for Soft error Tolerance (MST) in this thesis. MST is unique in the aspect that it encompasses four crucial factors namely, soft error tolerance, area overhead, power overhead and pipeline latency overhead into a single metric. Analysis using MST shows that STNR provides better reliability while incurring low overhead compared to existing architectures. 2014 text Electronic Dissertation http://hdl.handle.net/10150/338752 en_US Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. The University of Arizona.