Summary: | The report describes a research investigation into fault tolerant strategies within a real time control system. Methods for increasing the reliability of a system other than through the use of fault tolerance have also been reviewed. The study which concentrated on a Recovery Block structure is separated into two parts, that is, a single and a distributed processing system. The single processor study involved modelling a subset of the control system; error recovery strategies are presented here as additions to the basic Recovery Block structure. Fault injection logic was specially designed and built in order that the recovery strategies could be tested under extreme operating conditions. The distributed processing study is an extension of the single processor research. Three types of recovery are investigated to increase system availability; local recovery, global recovery and task swapping. The philosophy used in the distributed processing study was always to attempt recovery on a local basis, that is to prevent the propagation of faults to other microprocessors within the system. Global recovery is established as a method of maintaining continued safe operation when local recovery or communication between processors fails. The use of a standby processor system for dynamic task swapping is shown to give continued systems operation under conditions which would normally cause a catastrophic crash in non redundant systems. The overall conclusion of the research is that fault recovery must be localised to prevent fault propagation from one process to the following process, with no distinction as to whether the communicating processes are in the same or different microprocessor subsystems, and that this can be successfully achieved in a real time environment by the use of a Recovery Block structure.
|