id |
ndltd-NEU--neu-m044c040f
|
record_format |
oai_dc
|
spelling |
ndltd-NEU--neu-m044c040f2021-05-27T05:12:01ZExtending the domain of transparent checkpoint-restart for large-scale HPCWhile large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failures. Previous work in this domain has failed to address two key challenges: (a) support for transparent checkpointing of modern hardware accelerator-based HPC systems, such as those using GPUs and newer RDMA networks; and (b) reducing the large I/O overhead that leads to reduced productivity and contention. To address the first challenge, this dissertation presents a new transparent checkpointing framework, based on "split processes". The framework uses the hardware virtual memory subsystem of the host CPU to decouple computation state from the external subsystem context. This isolation between the application process and the external subsystem context enables transparent checkpointing for two diverse, well-known problems: checkpointing of modern CUDA-based programs; and transparent checkpointing of MPI applications, running over a variety of RDMA networks and using different MPI implementations with a single code base. To address the second challenge, this dissertation demonstrates that system reliability and application resilience characteristics can be used to improve system (and individual application) throughput and reduce the checkpointing I/O overhead.http://hdl.handle.net/2047/D20316244
|
collection |
NDLTD
|
sources |
NDLTD
|
description |
While large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failures. Previous work in this domain has failed to address two key challenges: (a) support for transparent checkpointing of modern hardware accelerator-based HPC systems, such as those
using GPUs and newer RDMA networks; and (b) reducing the large I/O overhead that leads to reduced productivity and contention. To address the first challenge, this dissertation presents a new transparent checkpointing framework, based on "split processes". The framework uses the hardware virtual memory subsystem of the host CPU to decouple computation state from the external subsystem context. This isolation between the application process and the external subsystem context enables
transparent checkpointing for two diverse, well-known problems: checkpointing of modern CUDA-based programs; and transparent checkpointing of MPI applications, running over a variety of RDMA networks and using different MPI implementations with a single code base. To address the second challenge, this dissertation demonstrates that system reliability and application resilience characteristics can be used to improve system (and individual application) throughput and reduce the
checkpointing I/O overhead.
|
title |
Extending the domain of transparent checkpoint-restart for large-scale HPC
|
spellingShingle |
Extending the domain of transparent checkpoint-restart for large-scale HPC
|
title_short |
Extending the domain of transparent checkpoint-restart for large-scale HPC
|
title_full |
Extending the domain of transparent checkpoint-restart for large-scale HPC
|
title_fullStr |
Extending the domain of transparent checkpoint-restart for large-scale HPC
|
title_full_unstemmed |
Extending the domain of transparent checkpoint-restart for large-scale HPC
|
title_sort |
extending the domain of transparent checkpoint-restart for large-scale hpc
|
publishDate |
|
url |
http://hdl.handle.net/2047/D20316244
|
_version_ |
1719407445942468608
|