Symmetric active/active high availability for high-performance computing system services

In order to address anticipated high failure rates, reliability, availability and serviceability have become an urgent priority for next-generation high-performance computing (HPC) systems. This thesis aims to pave the way for highly available HPC systems by focusing on their most critical component...

Full description

Bibliographic Details
Main Author: Engelmann, Christian
Published: University of Reading 2008
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.559245
id ndltd-bl.uk-oai-ethos.bl.uk-559245
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-5592452015-03-20T05:18:18ZSymmetric active/active high availability for high-performance computing system servicesEngelmann, Christian2008In order to address anticipated high failure rates, reliability, availability and serviceability have become an urgent priority for next-generation high-performance computing (HPC) systems. This thesis aims to pave the way for highly available HPC systems by focusing on their most critical components and by reinforcing them with appropriate high availability solutions. Service components, such as head and service nodes, are the "Achilles heel" of a HPC system. A failure typically results in a complete system-wide outage. This thesis targets efficient software state replication mechanisms for service component redundancy to achieve high availability as well as high performance. Its methodology relies on defining a modern theoretical foundation for providing service- level high availability, identifying availability deficiencies of HPC systems, and comparing various service-level high availability methods. This thesis showcases several developed proof-of-concept prototypes providing high availability for services running on HPC head and service nodes using the symmetric active/ active replication method, i.e., state- machine replication, to complement prior work in this area using active/standby and asymmetric active/active configurations. Presented contributions include a generic taxonomy for service high availability, an insight into availability deficiencies of HPC systems, and a unified definition of service-level high availability methods. Further contributions encompass a fully functional symmetric active/active high availability prototype for a HPC job and resource management service that does not require modification of service, a fully functional symmetric active/active high availability prototype for a HPC parallel file system metadata service that offers high performance, and two preliminary prototypes for a transparent symmetric active/active replication software framework for client-service and dependent service scenarios that hide the replication infrastructure from clients and services. Assuming a mean-time to failure of 5,000 hours for a head or service node, all presented prototypes improve service availability from 99.285% to 99.995% in a two-node system, and to 99.99996% with three nodes.621.39University of Readinghttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.559245Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 621.39
spellingShingle 621.39
Engelmann, Christian
Symmetric active/active high availability for high-performance computing system services
description In order to address anticipated high failure rates, reliability, availability and serviceability have become an urgent priority for next-generation high-performance computing (HPC) systems. This thesis aims to pave the way for highly available HPC systems by focusing on their most critical components and by reinforcing them with appropriate high availability solutions. Service components, such as head and service nodes, are the "Achilles heel" of a HPC system. A failure typically results in a complete system-wide outage. This thesis targets efficient software state replication mechanisms for service component redundancy to achieve high availability as well as high performance. Its methodology relies on defining a modern theoretical foundation for providing service- level high availability, identifying availability deficiencies of HPC systems, and comparing various service-level high availability methods. This thesis showcases several developed proof-of-concept prototypes providing high availability for services running on HPC head and service nodes using the symmetric active/ active replication method, i.e., state- machine replication, to complement prior work in this area using active/standby and asymmetric active/active configurations. Presented contributions include a generic taxonomy for service high availability, an insight into availability deficiencies of HPC systems, and a unified definition of service-level high availability methods. Further contributions encompass a fully functional symmetric active/active high availability prototype for a HPC job and resource management service that does not require modification of service, a fully functional symmetric active/active high availability prototype for a HPC parallel file system metadata service that offers high performance, and two preliminary prototypes for a transparent symmetric active/active replication software framework for client-service and dependent service scenarios that hide the replication infrastructure from clients and services. Assuming a mean-time to failure of 5,000 hours for a head or service node, all presented prototypes improve service availability from 99.285% to 99.995% in a two-node system, and to 99.99996% with three nodes.
author Engelmann, Christian
author_facet Engelmann, Christian
author_sort Engelmann, Christian
title Symmetric active/active high availability for high-performance computing system services
title_short Symmetric active/active high availability for high-performance computing system services
title_full Symmetric active/active high availability for high-performance computing system services
title_fullStr Symmetric active/active high availability for high-performance computing system services
title_full_unstemmed Symmetric active/active high availability for high-performance computing system services
title_sort symmetric active/active high availability for high-performance computing system services
publisher University of Reading
publishDate 2008
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.559245
work_keys_str_mv AT engelmannchristian symmetricactiveactivehighavailabilityforhighperformancecomputingsystemservices
_version_ 1716790756490870784