A Middleware for Self-Managing Large-Scale Systems

This thesis investigates designs that enable individual components of a distributed system to work together and coordinate their actions towards a common goal. While the basic motivation for our research is to develop engineering principles for large-scale autonomous systems, we address the problem...

Full description

Bibliographic Details
Main Author: Adam, Constantin
Format: Doctoral Thesis
Language:English
Published: KTH, Skolan för elektro- och systemteknik (EES) 2006
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4178
http://nbn-resolving.de/urn:isbn:978-91-7178-512-1
Description
Summary:This thesis investigates designs that enable individual components of a distributed system to work together and coordinate their actions towards a common goal. While the basic motivation for our research is to develop engineering principles for large-scale autonomous systems, we address the problem in the context of resource management in server clusters that provide web services. To this end, we have developed, implemented and evaluated a decentralized design for resource management that follows four principles. First, in order to facilitate scalability, each node has only partial knowledge of the system. Second, each node can adapt and change its role at runtime. Third, each node runs a number of local control mechanisms independently and asynchronously from its peers. Fourth, each node dynamically adapts its local configuration in order to optimize a global utility function. The design includes three fundamental building blocks: overlay construction, request routing and application placement. Overlay construction organizes the cluster nodes into a single dynamic overlay. Request routing directs service requests towards nodes with available resources. Application placement partitions the cluster resources between applications, and dynamically adjusts the allocation in response to changes in external load, node failures, etc. We have evaluated the design using complexity analysis, simulation and prototype implementation. Using complexity analysis and simulation, we have shown that the system is scalable, operates efficiently in steady state, quickly adapts to external events and allows for effective service differentiation by a system administrator. A prototype has been built using accepted technologies (Java, Tomcat) and evaluated using standard benchmarks (TPC-W and RUBiS). The evaluation results show that the behavior of the prototype matches closely that of the simulated design for key metrics related to adaptability and robustness, therefore validating our design and proving its feasibility. === QC 20100629