Summary: | Spotify is a music streaming service that offers access to a vast music catalogue; it counts more than 24 million active users in 28 different countries. Spotify's backend is made up by a constellation of independent loosely-coupled services; each service consists of a set of replicas, running on a set of servers in multiple data centers: each request to a service needs to be routed to an appropriate replica. Balancing the load across replicas is crucial to exploit available resources in the best possible way, and to provide optimal performances to clients. The main aim of this project is exploring the possibility of developing a load balancing algorithm that exploits request-reply latencies as its only load index. There are two reasons why latency is an appealing load index: in the first place it has a significant impact on the experience of Spotify users; in the second place, identifying a good load index in a distributed system presents significant challenges due to phenomena that might arise from the interaction of the different system components such as multi-bottlenecks. The use of latency as load index is even more attractive under this light, because it allows for a simple black box model where it is not necessary to model resource usage patterns and bottlenecks of every single service individually: modeling each system would be an impractical task, due both to the number of services and to the speed at which these services evolve. In this work, we justify the choice of request-reply latency as a load indicator, by presenting empirical evidence that it correlates well with known reliable load index obtained through a white box approach. In order to assess the correlation between latency and a known load index obtained through a white box approach, we present measurements from the production environment and from an ad-hoc test environment. We present the design of a novel load balancing algorithm based on a modified ' accrual failure detector that exploits request-reply latency as an indirect measure of the load on individual backends; we analyze the algorithm in detail, providing an overview of potential pitfalls and caveats; we also provide an empirical evaluation of our algorithm, compare its performances to a pure round-robin scheduling discipline and discuss which parameters can be tuned and how they affect the overall behavior of the load balancer.
|