Distributed system fault tolerance using message logging and checkpointing

Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behav...

Full description

Bibliographic Details
Main Author: Johnson, David Bruce
Other Authors: Zwaenepoel, Willy
Format: Others
Language:English
Published: 2009
Subjects:
Online Access:http://hdl.handle.net/1911/16354
id ndltd-RICE-oai-scholarship.rice.edu-1911-16354
record_format oai_dc
spelling ndltd-RICE-oai-scholarship.rice.edu-1911-163542013-10-23T04:08:49ZDistributed system fault tolerance using message logging and checkpointingJohnson, David BruceComputer ScienceFault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.Zwaenepoel, Willy2009-06-03T23:56:33Z2009-06-03T23:56:33Z1990ThesisText133 p.application/pdfhttp://hdl.handle.net/1911/16354eng
collection NDLTD
language English
format Others
sources NDLTD
topic Computer Science
spellingShingle Computer Science
Johnson, David Bruce
Distributed system fault tolerance using message logging and checkpointing
description Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.
author2 Zwaenepoel, Willy
author_facet Zwaenepoel, Willy
Johnson, David Bruce
author Johnson, David Bruce
author_sort Johnson, David Bruce
title Distributed system fault tolerance using message logging and checkpointing
title_short Distributed system fault tolerance using message logging and checkpointing
title_full Distributed system fault tolerance using message logging and checkpointing
title_fullStr Distributed system fault tolerance using message logging and checkpointing
title_full_unstemmed Distributed system fault tolerance using message logging and checkpointing
title_sort distributed system fault tolerance using message logging and checkpointing
publishDate 2009
url http://hdl.handle.net/1911/16354
work_keys_str_mv AT johnsondavidbruce distributedsystemfaulttoleranceusingmessageloggingandcheckpointing
_version_ 1716610028600819712