Extending the domain of transparent checkpoint-restart for large-scale HPC

While large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failure...

Full description

Bibliographic Details
Published:
Online Access:	http://hdl.handle.net/2047/D20316244

id	ndltd-NEU--neu-m044c040f
record_format	oai_dc
spelling	ndltd-NEU--neu-m044c040f2021-05-27T05:12:01ZExtending the domain of transparent checkpoint-restart for large-scale HPCWhile large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failures. Previous work in this domain has failed to address two key challenges: (a) support for transparent checkpointing of modern hardware accelerator-based HPC systems, such as those using GPUs and newer RDMA networks; and (b) reducing the large I/O overhead that leads to reduced productivity and contention. To address the first challenge, this dissertation presents a new transparent checkpointing framework, based on "split processes". The framework uses the hardware virtual memory subsystem of the host CPU to decouple computation state from the external subsystem context. This isolation between the application process and the external subsystem context enables transparent checkpointing for two diverse, well-known problems: checkpointing of modern CUDA-based programs; and transparent checkpointing of MPI applications, running over a variety of RDMA networks and using different MPI implementations with a single code base. To address the second challenge, this dissertation demonstrates that system reliability and application resilience characteristics can be used to improve system (and individual application) throughput and reduce the checkpointing I/O overhead.http://hdl.handle.net/2047/D20316244
collection	NDLTD
sources	NDLTD
description	While large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failures. Previous work in this domain has failed to address two key challenges: (a) support for transparent checkpointing of modern hardware accelerator-based HPC systems, such as those using GPUs and newer RDMA networks; and (b) reducing the large I/O overhead that leads to reduced productivity and contention. To address the first challenge, this dissertation presents a new transparent checkpointing framework, based on "split processes". The framework uses the hardware virtual memory subsystem of the host CPU to decouple computation state from the external subsystem context. This isolation between the application process and the external subsystem context enables transparent checkpointing for two diverse, well-known problems: checkpointing of modern CUDA-based programs; and transparent checkpointing of MPI applications, running over a variety of RDMA networks and using different MPI implementations with a single code base. To address the second challenge, this dissertation demonstrates that system reliability and application resilience characteristics can be used to improve system (and individual application) throughput and reduce the checkpointing I/O overhead.
title	Extending the domain of transparent checkpoint-restart for large-scale HPC
spellingShingle	Extending the domain of transparent checkpoint-restart for large-scale HPC
title_short	Extending the domain of transparent checkpoint-restart for large-scale HPC
title_full	Extending the domain of transparent checkpoint-restart for large-scale HPC
title_fullStr	Extending the domain of transparent checkpoint-restart for large-scale HPC
title_full_unstemmed	Extending the domain of transparent checkpoint-restart for large-scale HPC
title_sort	extending the domain of transparent checkpoint-restart for large-scale hpc
publishDate
url	http://hdl.handle.net/2047/D20316244
_version_	1719407445942468608

Extending the domain of transparent checkpoint-restart for large-scale HPC

Similar Items