Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems

Complex distributed systems such as distributed information flows systems which continuously acquire manipulate and disseminate information across an enterprise's distributed sites and machines, and distributed server applications co-deployed in one or multiple shared data centers, with eac...

Full description

Bibliographic Details
Main Author:	Cai, Zhongtang
Published:	Georgia Institute of Technology 2008
Subjects:	High performance Routing and scheduling Overlay Autonomic computing HIgh availability Distributed systems Enterprise application integration (Computer systems) Electronic data processing > Distributed processing Expert systems (Computer science) Decision making > Automation Fault-tolerant computing
Online Access:	http://hdl.handle.net/1853/22581

id	ndltd-GATECH-oai-smartech.gatech.edu-1853-22581
record_format	oai_dc
collection	NDLTD
sources	NDLTD
topic	High performance Routing and scheduling Overlay Autonomic computing HIgh availability Distributed systems Enterprise application integration (Computer systems) Electronic data processing--Distributed processing Expert systems (Computer science) Decision making--Automation Fault-tolerant computing
spellingShingle	High performance Routing and scheduling Overlay Autonomic computing HIgh availability Distributed systems Enterprise application integration (Computer systems) Electronic data processing--Distributed processing Expert systems (Computer science) Decision making--Automation Fault-tolerant computing Cai, Zhongtang Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
description	Complex distributed systems such as distributed information flows systems which continuously acquire manipulate and disseminate information across an enterprise's distributed sites and machines, and distributed server applications co-deployed in one or multiple shared data centers, with each of them having different performance/availability requirements that vary over time and competing with each other for the shared resources, have been playing a more serious role in industry and society now. Consequently, it becomes more important for enterprise scale IT infrastructure to provide timely and sustained/reliable delivery and processing of service requests. This hasn't become easier, despite more than 30 years of progress in distributed computer connectivity, availability and reliability, if not more difficult~cite{ReliableDistributedSys}, because of many reasons. Some of them are, the increasing complexity of enterprise scale computing infrastructure; the distributed nature of these systems which make them prone to failures, e.g., because of inevitable Heisenbugs in these complex distributed systems; the need to consider diverse and complex business objectives and policies including risk preference and attitudes in enterprise computing; the issues of performance and availability conflicts, varying importance of sub-systems in an enterprise's distributed infrastructure which compete for resource in currently typical shared environment; and the best effort nature of resources such as network resources, which implies resource availability itself an issue, etc. This thesis proposes a novel business policy-driven risk-based automated availability management which uses an automated decision engine to make various availability decisions and meet business policies while optimizing overall system utility, uses utility theory to capture users' risk attitudes, and address the potentially conflicting business goals and resource demands in enterprise scale distributed systems. For the critical and complex enterprise applications, since a key contributor to application utility is the time taken to recover from failures, we develop a novel proactive fault tolerance approach, which uses online methods for failure prediction to dynamically determine the acceptable amounts of additional processing and communication resources to be used (i.e., costs) to attain certain levels of utility and acceptable delays in failure recovery. Since resource availability itself is often not guaranteed in typical shared enterprise IT environments, this thesis provides IQ-Paths with probabilistic service guarantee, to address the dynamic network behavior in realistic enterprise computing environment. The risk-based formulation is used as an effective way to link the operational guarantees expressed by utility and enforced by the PGOS algorithm with the higher level business objectives sought by end users. Together, this thesis proposes novel availability management framework and methods for large-scale enterprise applications and systems, with the goal to provide different levels of performance/availability guarantees for multiple applications and sub-systems in a complex shared distributed computing infrastructure. More specifically, this thesis addresses the following problems. For data center environments, (1) how to provide availability management for applications and systems that vary in both resource requirements and in their importance to the enterprise, based both on operational level quantities and on business level objectives; (2) how to deal with managerial policies such as risk attitude; and (3) how to deal with the tradeoff between performance and availability, given limited resources in a typical data center. Since realistic business settings extend beyond single data centers, a second set of problems addressed in this thesis concerns predictable and reliable operation in wide area settings. For such systems, we explore (4) how to provide high availability in widely distributed operational systems with low cost fault tolerance mechanisms, and (5) how to provide probabilistic service guarantees given best effort network resources.
author	Cai, Zhongtang
author_facet	Cai, Zhongtang
author_sort	Cai, Zhongtang
title	Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
title_short	Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
title_full	Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
title_fullStr	Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
title_full_unstemmed	Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
title_sort	risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in enterprise distributed systems
publisher	Georgia Institute of Technology
publishDate	2008
url	http://hdl.handle.net/1853/22581
work_keys_str_mv	AT caizhongtang riskbasedproactiveavailabilitymanagementattaininghighperformanceandresiliencewithdynamicselfmanagementinenterprisedistributedsystems
_version_	1716474820581916672
spelling	ndltd-GATECH-oai-smartech.gatech.edu-1853-225812013-01-07T20:25:48ZRisk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed SystemsCai, ZhongtangHigh performanceRouting and schedulingOverlayAutonomic computingHIgh availabilityDistributed systemsEnterprise application integration (Computer systems)Electronic data processing--Distributed processingExpert systems (Computer science)Decision making--AutomationFault-tolerant computingComplex distributed systems such as distributed information flows systems which continuously acquire manipulate and disseminate information across an enterprise's distributed sites and machines, and distributed server applications co-deployed in one or multiple shared data centers, with each of them having different performance/availability requirements that vary over time and competing with each other for the shared resources, have been playing a more serious role in industry and society now. Consequently, it becomes more important for enterprise scale IT infrastructure to provide timely and sustained/reliable delivery and processing of service requests. This hasn't become easier, despite more than 30 years of progress in distributed computer connectivity, availability and reliability, if not more difficult~cite{ReliableDistributedSys}, because of many reasons. Some of them are, the increasing complexity of enterprise scale computing infrastructure; the distributed nature of these systems which make them prone to failures, e.g., because of inevitable Heisenbugs in these complex distributed systems; the need to consider diverse and complex business objectives and policies including risk preference and attitudes in enterprise computing; the issues of performance and availability conflicts, varying importance of sub-systems in an enterprise's distributed infrastructure which compete for resource in currently typical shared environment; and the best effort nature of resources such as network resources, which implies resource availability itself an issue, etc. This thesis proposes a novel business policy-driven risk-based automated availability management which uses an automated decision engine to make various availability decisions and meet business policies while optimizing overall system utility, uses utility theory to capture users' risk attitudes, and address the potentially conflicting business goals and resource demands in enterprise scale distributed systems. For the critical and complex enterprise applications, since a key contributor to application utility is the time taken to recover from failures, we develop a novel proactive fault tolerance approach, which uses online methods for failure prediction to dynamically determine the acceptable amounts of additional processing and communication resources to be used (i.e., costs) to attain certain levels of utility and acceptable delays in failure recovery. Since resource availability itself is often not guaranteed in typical shared enterprise IT environments, this thesis provides IQ-Paths with probabilistic service guarantee, to address the dynamic network behavior in realistic enterprise computing environment. The risk-based formulation is used as an effective way to link the operational guarantees expressed by utility and enforced by the PGOS algorithm with the higher level business objectives sought by end users. Together, this thesis proposes novel availability management framework and methods for large-scale enterprise applications and systems, with the goal to provide different levels of performance/availability guarantees for multiple applications and sub-systems in a complex shared distributed computing infrastructure. More specifically, this thesis addresses the following problems. For data center environments, (1) how to provide availability management for applications and systems that vary in both resource requirements and in their importance to the enterprise, based both on operational level quantities and on business level objectives; (2) how to deal with managerial policies such as risk attitude; and (3) how to deal with the tradeoff between performance and availability, given limited resources in a typical data center. Since realistic business settings extend beyond single data centers, a second set of problems addressed in this thesis concerns predictable and reliable operation in wide area settings. For such systems, we explore (4) how to provide high availability in widely distributed operational systems with low cost fault tolerance mechanisms, and (5) how to provide probabilistic service guarantees given best effort network resources.Georgia Institute of Technology2008-06-10T20:38:31Z2008-06-10T20:38:31Z2008-01-10Dissertationhttp://hdl.handle.net/1853/22581

Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems

Similar Items