Distributed system fault tolerance using message logging and checkpointing

Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behav...

Full description

Bibliographic Details
Main Author:	Johnson, David Bruce
Other Authors:	Zwaenepoel, Willy
Format:	Others
Language:	English
Published:	2009
Subjects:	Computer Science
Online Access:	http://hdl.handle.net/1911/16354

id	ndltd-RICE-oai-scholarship.rice.edu-1911-16354
record_format	oai_dc
spelling	ndltd-RICE-oai-scholarship.rice.edu-1911-163542013-10-23T04:08:49ZDistributed system fault tolerance using message logging and checkpointingJohnson, David BruceComputer ScienceFault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.Zwaenepoel, Willy2009-06-03T23:56:33Z2009-06-03T23:56:33Z1990ThesisText133 p.application/pdfhttp://hdl.handle.net/1911/16354eng
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Computer Science
spellingShingle	Computer Science Johnson, David Bruce Distributed system fault tolerance using message logging and checkpointing
description	Fault tolerance can allow processes executing in a computer system to survive failures within the system. This thesis addresses the theory and practice of transparent fault-tolerance methods using message logging and checkpointing in distributed systems. A general model for reasoning about the behavior and correctness of these methods is developed, and the design, implementation, and performance of two new low-overhead methods based on this model are presented. No specialized hardware is required with these new methods. The model is independent of the protocols used in the system. Each process state is represented by a dependency vector, and each system state is represented by a dependency matrix showing a collection of process states. The set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. There is thus always a unique maximum recoverable system state. The first method presented uses a new pessimistic message logging protocol called sender-based message logging. Each message is logged in the local volatile memory of the machine from which it was sent, and the order in which the message was received is returned to the sender as a receive sequence number. Message logging overlaps execution of the receiver, until the receiver attempts to send a new message. Implemented in the V-System, the maximum measured failure-free overhead on distributed application programs was under 16 percent, and average overhead measured 2 percent or less, depending on problem size and communication intensity. Optimistic message logging can outperform pessimistic logging, since message logging occurs asynchronously. A new optimistic message logging system is presented that guarantees to find the maximum possible recoverable system state, which is not ensured by previous optimistic methods. All logged messages and checkpoints are utilized, and thus some messages received by a process before it was checkpointed may not need to be logged. Although failure recovery using optimistic message logging is more difficult, failure-free application overhead using this method ranged form only a maximum of under 4 percent to much less than 1 percent.
author2	Zwaenepoel, Willy
author_facet	Zwaenepoel, Willy Johnson, David Bruce
author	Johnson, David Bruce
author_sort	Johnson, David Bruce
title	Distributed system fault tolerance using message logging and checkpointing
title_short	Distributed system fault tolerance using message logging and checkpointing
title_full	Distributed system fault tolerance using message logging and checkpointing
title_fullStr	Distributed system fault tolerance using message logging and checkpointing
title_full_unstemmed	Distributed system fault tolerance using message logging and checkpointing
title_sort	distributed system fault tolerance using message logging and checkpointing
publishDate	2009
url	http://hdl.handle.net/1911/16354
work_keys_str_mv	AT johnsondavidbruce distributedsystemfaulttoleranceusingmessageloggingandcheckpointing
_version_	1716610028600819712

Distributed system fault tolerance using message logging and checkpointing

Similar Items