Developing a Fault-Tolerance Self-Recovery System

碩士 === 輔仁大學 === 資訊工程學系 === 99 === Software failures may lead to lose important data. Therefore, how to handle software failures is a very important issue. There are many studies attempting to solve software failures. Checkpoint is a technique which is used to improve fault-tolerant in software. When...

Full description

Bibliographic Details
Main Authors: Cheng, Weian, 鄭惟安
Other Authors: Yeh, Tsozen
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/47269631014676791088
Description
Summary:碩士 === 輔仁大學 === 資訊工程學系 === 99 === Software failures may lead to lose important data. Therefore, how to handle software failures is a very important issue. There are many studies attempting to solve software failures. Checkpoint is a technique which is used to improve fault-tolerant in software. When a program is running, certain program states and information are stored in a checkpoint at an appropriate time. If a software failure occurs, the program rolls back to the checkpoint to re-execute. The program re-executes from the last interrupt time by restoring to earlier states. Checkpoint can save more time than restarting the program. There are many researches using checkpoint recovery mechanisms. But most recovery mechanisms restore the program states to a recent checkpoint and re-execute in a new process when the program failure occurs. This type of recovery mechanisms handles to unexpected errors on the program runtime such as transient errors. If the program crash is caused by the user input wrong data, the wrong data is still exist even rolls back to the recent checkpoint, it can crash, roll back, crash, roll back…, the program and recovery mechanisms will into a infinite loop. This thesis proves a multi-checkpoint recovery system to handle wrong data by user input, and our recovery system do not need to re-compile the program. This recovery system makes a checkpoint for the program while user input a data. Each program may not have only one data input, so that many checkpoints may belong to a program. If a user input data cause the program to crash, recovery system will show the information which is the past data by the user input. This information can let users to consider which input data caused the program to crash, and then user chooses the input of checkpoint, rolls back to input a new data. The program can avoid the same error to occur again.