Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems

Complex distributed systems such as distributed information flows systems which continuously acquire manipulate and disseminate information across an enterprise's distributed sites and machines, and distributed server applications co-deployed in one or multiple shared data centers, with eac...

Full description

Bibliographic Details
Main Author: Cai, Zhongtang
Published: Georgia Institute of Technology 2008
Subjects:
Online Access:http://hdl.handle.net/1853/22581
Description
Summary:Complex distributed systems such as distributed information flows systems which continuously acquire manipulate and disseminate information across an enterprise's distributed sites and machines, and distributed server applications co-deployed in one or multiple shared data centers, with each of them having different performance/availability requirements that vary over time and competing with each other for the shared resources, have been playing a more serious role in industry and society now. Consequently, it becomes more important for enterprise scale IT infrastructure to provide timely and sustained/reliable delivery and processing of service requests. This hasn't become easier, despite more than 30 years of progress in distributed computer connectivity, availability and reliability, if not more difficult~cite{ReliableDistributedSys}, because of many reasons. Some of them are, the increasing complexity of enterprise scale computing infrastructure; the distributed nature of these systems which make them prone to failures, e.g., because of inevitable Heisenbugs in these complex distributed systems; the need to consider diverse and complex business objectives and policies including risk preference and attitudes in enterprise computing; the issues of performance and availability conflicts, varying importance of sub-systems in an enterprise's distributed infrastructure which compete for resource in currently typical shared environment; and the best effort nature of resources such as network resources, which implies resource availability itself an issue, etc. This thesis proposes a novel business policy-driven risk-based automated availability management which uses an automated decision engine to make various availability decisions and meet business policies while optimizing overall system utility, uses utility theory to capture users' risk attitudes, and address the potentially conflicting business goals and resource demands in enterprise scale distributed systems. For the critical and complex enterprise applications, since a key contributor to application utility is the time taken to recover from failures, we develop a novel proactive fault tolerance approach, which uses online methods for failure prediction to dynamically determine the acceptable amounts of additional processing and communication resources to be used (i.e., costs) to attain certain levels of utility and acceptable delays in failure recovery. Since resource availability itself is often not guaranteed in typical shared enterprise IT environments, this thesis provides IQ-Paths with probabilistic service guarantee, to address the dynamic network behavior in realistic enterprise computing environment. The risk-based formulation is used as an effective way to link the operational guarantees expressed by utility and enforced by the PGOS algorithm with the higher level business objectives sought by end users. Together, this thesis proposes novel availability management framework and methods for large-scale enterprise applications and systems, with the goal to provide different levels of performance/availability guarantees for multiple applications and sub-systems in a complex shared distributed computing infrastructure. More specifically, this thesis addresses the following problems. For data center environments, (1) how to provide availability management for applications and systems that vary in both resource requirements and in their importance to the enterprise, based both on operational level quantities and on business level objectives; (2) how to deal with managerial policies such as risk attitude; and (3) how to deal with the tradeoff between performance and availability, given limited resources in a typical data center. Since realistic business settings extend beyond single data centers, a second set of problems addressed in this thesis concerns predictable and reliable operation in wide area settings. For such systems, we explore (4) how to provide high availability in widely distributed operational systems with low cost fault tolerance mechanisms, and (5) how to provide probabilistic service guarantees given best effort network resources.