Solving Practical Problems in Datacenter Networks

<p>The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under che...

Full description

Bibliographic Details
Main Author: Wu, Xin
Other Authors: Yang, Xiaowei
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10161/8201
id ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-8201
record_format oai_dc
spelling ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-82012013-12-18T03:31:30ZSolving Practical Problems in Datacenter NetworksWu, XinComputer scienceAdaptive RoutingDatacenter NetworkFailure MitigationMultipath Transport Protocol<p>The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. Today, datacenter network operators spend tremendous time and efforts on two key challenges: 1) how to efficiently utilize the bandwidth connecting host pairs and 2) how to promptly handle network failures with minimal disruptions to the hosted services.</p><p>To resolve the first challenge, we propose solutions in both network layer and transport layer. In the network layer solution, We advocate to design practical datacenter architectures for easy operation, i.e., an architecture should be reliable, capable of improving bisection bandwidth, scalable and debugging-friendly. By strictly following these four guidelines, We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to reallocate traffic from overloaded paths to underloaded paths without central coordination. We use congestion game theory to show that DARD converges to a Nash equilibrium in finite steps and its gap to the optimal flow allocation is bounded in the order of 1/logL, with L being the number of links. We use a testbed implementation and simulations to show that DARD can achieve a close-to-optimal flow allocation with small control overhead in practice.</p><p>In the transport layer solution, We propose Explicit Multipath Congestion Control Protocol (MPXCP), which achieves four desirable properties: fast convergence, efficiency, being fair to flows with different RTTs and negligible queue size. Intensive ns-2 simulation shows that MPXCP can quickly converge to efficiency and fairness without building up queues despite different delay-bandwidth products.</p><p>To resolve the second challenge, recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, we propose NetPilot, a system aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.</p>DissertationYang, Xiaowei2013Dissertationhttp://hdl.handle.net/10161/8201
collection NDLTD
sources NDLTD
topic Computer science
Adaptive Routing
Datacenter Network
Failure Mitigation
Multipath Transport Protocol
spellingShingle Computer science
Adaptive Routing
Datacenter Network
Failure Mitigation
Multipath Transport Protocol
Wu, Xin
Solving Practical Problems in Datacenter Networks
description <p>The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. Today, datacenter network operators spend tremendous time and efforts on two key challenges: 1) how to efficiently utilize the bandwidth connecting host pairs and 2) how to promptly handle network failures with minimal disruptions to the hosted services.</p><p>To resolve the first challenge, we propose solutions in both network layer and transport layer. In the network layer solution, We advocate to design practical datacenter architectures for easy operation, i.e., an architecture should be reliable, capable of improving bisection bandwidth, scalable and debugging-friendly. By strictly following these four guidelines, We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to reallocate traffic from overloaded paths to underloaded paths without central coordination. We use congestion game theory to show that DARD converges to a Nash equilibrium in finite steps and its gap to the optimal flow allocation is bounded in the order of 1/logL, with L being the number of links. We use a testbed implementation and simulations to show that DARD can achieve a close-to-optimal flow allocation with small control overhead in practice.</p><p>In the transport layer solution, We propose Explicit Multipath Congestion Control Protocol (MPXCP), which achieves four desirable properties: fast convergence, efficiency, being fair to flows with different RTTs and negligible queue size. Intensive ns-2 simulation shows that MPXCP can quickly converge to efficiency and fairness without building up queues despite different delay-bandwidth products.</p><p>To resolve the second challenge, recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, we propose NetPilot, a system aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.</p> === Dissertation
author2 Yang, Xiaowei
author_facet Yang, Xiaowei
Wu, Xin
author Wu, Xin
author_sort Wu, Xin
title Solving Practical Problems in Datacenter Networks
title_short Solving Practical Problems in Datacenter Networks
title_full Solving Practical Problems in Datacenter Networks
title_fullStr Solving Practical Problems in Datacenter Networks
title_full_unstemmed Solving Practical Problems in Datacenter Networks
title_sort solving practical problems in datacenter networks
publishDate 2013
url http://hdl.handle.net/10161/8201
work_keys_str_mv AT wuxin solvingpracticalproblemsindatacenternetworks
_version_ 1716619875902816256