Holistic Runtime Scheduling for the Distributed Computing Landscape
Internet services have become an indispensable part of our lives, with billions of users on a daily basis. Example use cases include services for real-time communication and collaborative editing of documents. Furthermore, there are many hidden—nonetheless omnipresent—use cases like cashier systems...
Summary: | Internet services have become an indispensable part of our lives, with billions of users on a daily basis. Example use cases include services for real-time communication and collaborative editing of documents. Furthermore, there are many hidden—nonetheless omnipresent—use cases like cashier systems and sensors of industry facilities. Users expect to use Internet services at any time at low cost with the desired service quality despite potential load spikes a service might face. A straightforward strategy to provide services with high availability is to allocate dedicated resources for each service. In turn, this strategy is likely to lead to over-provisioning and increased operating expenses, which contradicts offering services at a low price. A solution to this problem is to leverage resource scheduling to share the underlying resources among many different workloads and services. Sharing the underlying resources is a key enabler to offer highly scalable services while keeping operating expenses of each service low.
A wide range of resource scheduling systems for the distributed computing landscape has been proposed in the past, covering the application and infrastructure levels. Application-level scheduling focuses on problems such as given a set of resources, configuring an application to reach high throughput and good service quality. Many application-level resource scheduling systems lack support for runtime scheduling, often due to slow or unsuitable algorithms. Without runtime scheduling, resource scheduling must run in advance for many scenarios and, at best, repeats periodically to update scheduling decisions. This is likely to result in inefficient resource usage. In contrast to application-level scheduling, infrastructure-level scheduling is about orchestrating resources and serving resource requests of various applications, aiming at high resource utilization. Infrastructure-level scheduling leverages generic resource abstractions, e.g., containers and virtual machines, to fulfill these properties. These abstractions make assumptions (e.g., homogeneity, linear resource consumption) to simplify management, but ignore the fact that current distributed computing systems have been evolving in the post-Moore’s law era and many of these assumptions need to be revised. In particular, the recent trend of new programmable networking devices, ushering in a new area of in-network computing (INC), overtaxes the generic abstractions of compute containers running on servers. The ever-growing demand for Internet services in general and increasingly heterogeneous resources combined with the highly varying demand in particular, require runtime solutions for holistic resource scheduling, covering both the application and infrastructure levels.
This dissertation presents four novel solutions to holistic runtime scheduling for the distributed computing landscape. Two solutions cover the application and two the infrastructure levels. We start with an analysis of the field of resource scheduling for the distributed computing landscape and classify involved systems, resources, and abstractions. Based on this, we present a classification of INC which helps to understand the design space of INC resource scheduling. Next, we discuss two scenarios at the application level and demonstrate how runtime scheduling improves resource efficiency. As a first scenario, we consider big data aggregation systems and present ROME, a middleware system to reduce the total aggregation time. ROME automatically analyzes at runtime the involved aggregation function’s data stream and optimizes each node’s responsibilities in the aggregation plan. ROME reduces total aggregation time even compared with manually fine-tuned systems. The second scenario discusses resource scheduling of distributed service function chains. We present STEAM, the first distributed runtime scheduler for this problem, that operates at packet-level granularity without requiring a priori information of traffic estimates and a global view of the systems. Compared with non-runtime solutions to this problem, STEAM achieves better service quality when using the same resources and reduces the amount of resources required to serve the same load.
For the data center infrastructure level, we present two mutually exclusive solutions. Our first solution is IncSched, a system that retrofits existing data center resource schedulers for INC. Based on the proposed classification of INC, IncSched presents a new resource model, translates resource requests to be compliant with the plugged retrofitted scheduler, and holds the logic for managing INC resources. IncSched makes existing resource schedulers compatible with INC for the first time, contributing to a broad acceptance of INC. For a holistic integration of INC in data center resource scheduling, we propose HIRE, a full-fledged resource scheduling solution for INC. HIRE extends the resource model of IncSched for automatic augmentation of resource alternatives and incorporates non-linearity property of INC resource usage. HIRE is the first scheduler that combines all server and INC resources in the same scheduling problem to attribute interdependencies on data center level. These novelties make HIRE more successful in satisfying resource requests with INC, finding better placements concerning locality, and reducing tail latencies. We evaluate all solutions using extensive simulations, and for some also using system prototypes and integrated benchmarks.
In summary, this dissertation proposes four novel solutions for holistic runtime resource scheduling. The contributions foster the importance of runtime resource scheduling for more efficient resource usage. Our contributions to holistic resource scheduling make shared INC available on a data center level for the first time. |
---|