Transparent checkpointing over RDMA-based networks

Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over...

Full description

Bibliographic Details
Published:
Online Access:http://hdl.handle.net/2047/D20290419
id ndltd-NEU--neu-cj82rk49b
record_format oai_dc
spelling ndltd-NEU--neu-cj82rk49b2021-04-13T05:14:15ZTransparent checkpointing over RDMA-based networksFault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.http://hdl.handle.net/2047/D20290419
collection NDLTD
sources NDLTD
description Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.
title Transparent checkpointing over RDMA-based networks
spellingShingle Transparent checkpointing over RDMA-based networks
title_short Transparent checkpointing over RDMA-based networks
title_full Transparent checkpointing over RDMA-based networks
title_fullStr Transparent checkpointing over RDMA-based networks
title_full_unstemmed Transparent checkpointing over RDMA-based networks
title_sort transparent checkpointing over rdma-based networks
publishDate
url http://hdl.handle.net/2047/D20290419
_version_ 1719395779211165696