Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters

Bibliographic Details
Main Author: Raja Chandrasekar, Raghunath
Language:English
Published: The Ohio State University / OhioLINK 2014
Subjects:
HPC
MPI
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721
id ndltd-OhioLink-oai-etd.ohiolink.edu-osu1417733721
record_format oai_dc
collection NDLTD
language English
sources NDLTD
topic Computer Engineering
Computer Science
fault-tolerance
resilience
checkpointing
process-migration
Input-Output
HPC
supercomputing
MPI
MVAPICH
accelerators
energy-efficiency
spellingShingle Computer Engineering
Computer Science
fault-tolerance
resilience
checkpointing
process-migration
Input-Output
HPC
supercomputing
MPI
MVAPICH
accelerators
energy-efficiency
Raja Chandrasekar, Raghunath
Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
author Raja Chandrasekar, Raghunath
author_facet Raja Chandrasekar, Raghunath
author_sort Raja Chandrasekar, Raghunath
title Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
title_short Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
title_full Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
title_fullStr Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
title_full_unstemmed Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters
title_sort designing scalable and efficient i/o middleware for fault-resilient high-performance computing clusters
publisher The Ohio State University / OhioLINK
publishDate 2014
url http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721
work_keys_str_mv AT rajachandrasekarraghunath designingscalableandefficientiomiddlewareforfaultresilienthighperformancecomputingclusters
_version_ 1719437443950706688
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-osu14177337212021-08-03T06:28:23Z Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters Raja Chandrasekar, Raghunath Computer Engineering Computer Science fault-tolerance resilience checkpointing process-migration Input-Output HPC supercomputing MPI MVAPICH accelerators energy-efficiency In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors.A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the order of tens of minutes to hours. On future systems, checkpointing activities are predicted to dominate compute time and overwhelm file system resources.On supercomputing systems geared for Exascale, parallel applications will have a wider range of storage media to choose from - on-chip/off-chip caches, node-level RAM, Non-Volatile Memory (NVM), distributed-RAM, flash-storage (SSDs), HDDs, parallel file systems, and archival storage. Current-generation checkpointing middleware and frameworks are oblivious to this hierarchy in storage where each medium has unique performance and data-persistence characteristics.This thesis proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include - \textit{CRUISE}, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; \textit{Stage-FS}, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; \textit{MIC-Check}, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; \textit{Power-Check}, an energy-efficient checkpointing framework for transparent and application-aware HPC checkpointing systems; and \textit{FTB-IPMI}, an out-of-band fault-prediction mechanism that pro-actively monitors for failures. The components of this framework have been evaluated up to a scale of three million compute processes, have reduced the checkpointing overhead on scientific applications by a factor of 30, and reduced the amount of energy consumed by checkpointing systems by up to 48\%. 2014 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721 http://rave.ohiolink.edu/etdc/view?acc_num=osu1417733721 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.