Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes

The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is non-persistent. Our thesis is that an operating system ke...

Full description

Bibliographic Details
Main Author: Depoutovitch, Alexandre
Other Authors: Stumm, Michael
Language:en_ca
Published: 2011
Subjects:
Online Access:http://hdl.handle.net/1807/31733
id ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-31733
record_format oai_dc
spelling ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-317332013-04-19T19:56:41ZOtherworld - Giving Applications a Chance to Survive OS Kernel CrashesDepoutovitch, AlexandreOperating SystemsReliabiltiyFault ToleranceMicroreboot0984The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is non-persistent. Our thesis is that an operating system kernel is simply a component of a larger software system, which is logically well isolated from other components, such as applications, and therefore it should be possible to reboot the kernel without terminating everything else running on the same system. In order to prove this thesis, we designed and implemented a new mechanism, called Otherworld, that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore. We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97\% of the cases. In the default case, Otherworld adds negligible overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.Stumm, Michael2011-112012-01-06T16:08:46ZNO_RESTRICTION2012-01-06T16:08:46Z2012-01-06Thesishttp://hdl.handle.net/1807/31733en_ca
collection NDLTD
language en_ca
sources NDLTD
topic Operating Systems
Reliabiltiy
Fault Tolerance
Microreboot
0984
spellingShingle Operating Systems
Reliabiltiy
Fault Tolerance
Microreboot
0984
Depoutovitch, Alexandre
Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes
description The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is non-persistent. Our thesis is that an operating system kernel is simply a component of a larger software system, which is logically well isolated from other components, such as applications, and therefore it should be possible to reboot the kernel without terminating everything else running on the same system. In order to prove this thesis, we designed and implemented a new mechanism, called Otherworld, that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore. We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97\% of the cases. In the default case, Otherworld adds negligible overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.
author2 Stumm, Michael
author_facet Stumm, Michael
Depoutovitch, Alexandre
author Depoutovitch, Alexandre
author_sort Depoutovitch, Alexandre
title Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes
title_short Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes
title_full Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes
title_fullStr Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes
title_full_unstemmed Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes
title_sort otherworld - giving applications a chance to survive os kernel crashes
publishDate 2011
url http://hdl.handle.net/1807/31733
work_keys_str_mv AT depoutovitchalexandre otherworldgivingapplicationsachancetosurviveoskernelcrashes
_version_ 1716582109795057664