Summary: | Complex distributed systems such as distributed information flows systems
which continuously acquire manipulate and disseminate
information across an enterprise's distributed sites and machines,
and distributed server applications co-deployed in one or multiple shared data centers,
with each of them having different performance/availability requirements
that vary over time and competing with each other for the shared resources,
have been playing a more serious role in industry and society now.
Consequently, it becomes more important for enterprise scale IT infrastructure to
provide timely and sustained/reliable delivery and processing of service requests.
This hasn't become easier, despite more than 30 years of progress in distributed
computer connectivity, availability and reliability, if not more difficult~cite{ReliableDistributedSys},
because of many reasons. Some of them are, the increasing complexity
of enterprise scale computing infrastructure; the distributed
nature of these systems which make them prone to failures,
e.g., because of inevitable Heisenbugs in these complex distributed systems;
the need to consider diverse and complex business objectives and policies
including risk preference and attitudes in enterprise computing;
the issues of performance and availability conflicts, varying importance of
sub-systems in an enterprise's distributed infrastructure which compete for
resource in currently typical shared environment; and
the best effort nature of resources such as network resources, which implies
resource availability itself an issue, etc.
This thesis proposes a novel business policy-driven risk-based automated availability management
which uses an automated decision engine to make various availability decisions and
meet business policies while optimizing overall system utility,
uses utility theory to capture users' risk attitudes,
and address the potentially conflicting business goals and resource demands in enterprise scale
distributed systems.
For the critical and complex enterprise applications,
since a key contributor to application utility is the time taken to
recover from failures, we develop a novel proactive fault tolerance approach,
which uses online methods for failure prediction to dynamically determine the acceptable amounts of
additional processing and communication resources to be used (i.e., costs)
to attain certain levels of utility and acceptable delays in failure
Since resource availability itself is often not guaranteed in typical shared enterprise
IT environments, this thesis provides IQ-Paths with probabilistic
service guarantee, to address the dynamic network
behavior in realistic enterprise computing environment.
The risk-based formulation is used as an effective
way to link the operational guarantees expressed by utility and
enforced by the PGOS algorithm with the higher level business objectives sought
by end users.
Together, this thesis proposes novel availability management framework and methods for
large-scale enterprise applications and systems, with the goal to provide different
levels of performance/availability guarantees for multiple applications and
sub-systems in a complex shared distributed computing infrastructure. More specifically,
this thesis addresses the following problems. For data center environments,
(1) how to provide availability management for applications and systems that
vary in both resource requirements and in their importance to the enterprise,
based both on operational level quantities and on business level objectives;
(2) how to deal with managerial policies such as risk attitude; and
(3) how to deal with the tradeoff between performance and availability,
given limited resources in a typical data center.
Since realistic business settings extend beyond single data centers, a second
set of problems addressed in this thesis concerns predictable and reliable
operation in wide area settings. For such systems, we explore (4) how to
provide high availability in widely distributed operational systems with
low cost fault tolerance mechanisms, and (5) how to provide probabilistic
service guarantees given best effort network resources.