Transparent checkpointing over RDMA-based networks

Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over...

Full description

Bibliographic Details
Published:
Online Access:	http://hdl.handle.net/2047/D20290419

id	ndltd-NEU--neu-cj82rk49b
record_format	oai_dc
spelling	ndltd-NEU--neu-cj82rk49b2021-04-13T05:14:15ZTransparent checkpointing over RDMA-based networksFault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.http://hdl.handle.net/2047/D20290419
collection	NDLTD
sources	NDLTD
description	Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a fault-tolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.
title	Transparent checkpointing over RDMA-based networks
spellingShingle	Transparent checkpointing over RDMA-based networks
title_short	Transparent checkpointing over RDMA-based networks
title_full	Transparent checkpointing over RDMA-based networks
title_fullStr	Transparent checkpointing over RDMA-based networks
title_full_unstemmed	Transparent checkpointing over RDMA-based networks
title_sort	transparent checkpointing over rdma-based networks
publishDate
url	http://hdl.handle.net/2047/D20290419
_version_	1719395779211165696

Transparent checkpointing over RDMA-based networks

Similar Items