Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing

Thesis (Ph.D.)--Boston University === Many-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power de...

Full description

Bibliographic Details
Main Author: Meng, Jie
Language:en_US
Published: Boston University 2015
Online Access:https://hdl.handle.net/2144/11145
id ndltd-bu.edu-oai-open.bu.edu-2144-11145
record_format oai_dc
spelling ndltd-bu.edu-oai-open.bu.edu-2144-111452019-01-08T15:34:26Z Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing Meng, Jie Thesis (Ph.D.)--Boston University Many-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power densities and temperatures increase following the growth in the performance capacity, and bring major challenges in energy efficiency, cooling costs, and reliability. These challenges require a joint assessment of performance, power, and temperature tradeoffs as well as the design of runtime optimization techniques that monitor and manage the interplay among them. This thesis proposes novel modeling and runtime management techniques that evaluate and optimize the performance, energy, and reliability of many-core systems. We first address the energy and thermal challenges in 3D-stacked many-core processors. 3D processors with stacked DRAM have the potential to dramatically improve performance owing to lower memory access latency and higher bandwidth. However, the performance increase may cause 3D systems to exceed the power budgets or create thermal hot spots. In order to provide an accurate analysis and enable the design of efficient management policies, this thesis introduces a simulation framework to jointly analyze performance, power, and temperature for 3D systems. We then propose a runtime optimization policy that maximizes the system performance by characterizing the application behavior and predicting the operating points that satisfy the power and thermal constraints. Our policy reduces the energy-delay product (EDP) by up to 61.9% compared to existing strategies. Performance, cooling energy, and reliability are also critical aspects in HPC data centers. In addition to causing reliability degradation, high temperatures increase the required cooling energy. Communication cost, on the other hand, has a significant impact on system performance in HPC data centers. This thesis proposes a topology-aware technique that maximizes system reliability by selecting between workload clustering and balancing. Our policy improves the system reliability by up to 123.3% compared to existing temperature balancing approaches. We also introduce a job allocation methodology to simultaneously optimize the communication cost and the cooling energy in a data center. Our policy reduces the cooling cost by 40% compared to cooling-aware and performance-aware policies, while achieving comparable performance to performance-aware policy. 2015-04-27T16:56:23Z 2015-04-27T16:56:23Z 2013 2013 Thesis/Dissertation https://hdl.handle.net/2144/11145 en_US Boston University
collection NDLTD
language en_US
sources NDLTD
description Thesis (Ph.D.)--Boston University === Many-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power densities and temperatures increase following the growth in the performance capacity, and bring major challenges in energy efficiency, cooling costs, and reliability. These challenges require a joint assessment of performance, power, and temperature tradeoffs as well as the design of runtime optimization techniques that monitor and manage the interplay among them. This thesis proposes novel modeling and runtime management techniques that evaluate and optimize the performance, energy, and reliability of many-core systems. We first address the energy and thermal challenges in 3D-stacked many-core processors. 3D processors with stacked DRAM have the potential to dramatically improve performance owing to lower memory access latency and higher bandwidth. However, the performance increase may cause 3D systems to exceed the power budgets or create thermal hot spots. In order to provide an accurate analysis and enable the design of efficient management policies, this thesis introduces a simulation framework to jointly analyze performance, power, and temperature for 3D systems. We then propose a runtime optimization policy that maximizes the system performance by characterizing the application behavior and predicting the operating points that satisfy the power and thermal constraints. Our policy reduces the energy-delay product (EDP) by up to 61.9% compared to existing strategies. Performance, cooling energy, and reliability are also critical aspects in HPC data centers. In addition to causing reliability degradation, high temperatures increase the required cooling energy. Communication cost, on the other hand, has a significant impact on system performance in HPC data centers. This thesis proposes a topology-aware technique that maximizes system reliability by selecting between workload clustering and balancing. Our policy improves the system reliability by up to 123.3% compared to existing temperature balancing approaches. We also introduce a job allocation methodology to simultaneously optimize the communication cost and the cooling energy in a data center. Our policy reduces the cooling cost by 40% compared to cooling-aware and performance-aware policies, while achieving comparable performance to performance-aware policy.
author Meng, Jie
spellingShingle Meng, Jie
Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
author_facet Meng, Jie
author_sort Meng, Jie
title Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
title_short Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
title_full Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
title_fullStr Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
title_full_unstemmed Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
title_sort modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
publisher Boston University
publishDate 2015
url https://hdl.handle.net/2144/11145
work_keys_str_mv AT mengjie modelingandoptimizationofhighperformancemanycoresystemsforenergyefficientandreliablecomputing
_version_ 1718810371660185600