Solving Practical Problems in Datacenter Networks

<p>The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under che...

Full description

Bibliographic Details
Main Author:	Wu, Xin
Other Authors:	Yang, Xiaowei
Published:	2013
Subjects:	Computer science Adaptive Routing Datacenter Network Failure Mitigation Multipath Transport Protocol
Online Access:	http://hdl.handle.net/10161/8201

id	ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-8201
record_format	oai_dc
spelling	ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-82012013-12-18T03:31:30ZSolving Practical Problems in Datacenter NetworksWu, XinComputer scienceAdaptive RoutingDatacenter NetworkFailure MitigationMultipath Transport Protocol<p>The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. Today, datacenter network operators spend tremendous time and efforts on two key challenges: 1) how to efficiently utilize the bandwidth connecting host pairs and 2) how to promptly handle network failures with minimal disruptions to the hosted services.</p><p>To resolve the first challenge, we propose solutions in both network layer and transport layer. In the network layer solution, We advocate to design practical datacenter architectures for easy operation, i.e., an architecture should be reliable, capable of improving bisection bandwidth, scalable and debugging-friendly. By strictly following these four guidelines, We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to reallocate traffic from overloaded paths to underloaded paths without central coordination. We use congestion game theory to show that DARD converges to a Nash equilibrium in finite steps and its gap to the optimal flow allocation is bounded in the order of 1/logL, with L being the number of links. We use a testbed implementation and simulations to show that DARD can achieve a close-to-optimal flow allocation with small control overhead in practice.</p><p>In the transport layer solution, We propose Explicit Multipath Congestion Control Protocol (MPXCP), which achieves four desirable properties: fast convergence, efficiency, being fair to flows with different RTTs and negligible queue size. Intensive ns-2 simulation shows that MPXCP can quickly converge to efficiency and fairness without building up queues despite different delay-bandwidth products.</p><p>To resolve the second challenge, recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, we propose NetPilot, a system aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.</p>DissertationYang, Xiaowei2013Dissertationhttp://hdl.handle.net/10161/8201
collection	NDLTD
sources	NDLTD
topic	Computer science Adaptive Routing Datacenter Network Failure Mitigation Multipath Transport Protocol
spellingShingle	Computer science Adaptive Routing Datacenter Network Failure Mitigation Multipath Transport Protocol Wu, Xin Solving Practical Problems in Datacenter Networks
description	<p>The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. Today, datacenter network operators spend tremendous time and efforts on two key challenges: 1) how to efficiently utilize the bandwidth connecting host pairs and 2) how to promptly handle network failures with minimal disruptions to the hosted services.</p><p>To resolve the first challenge, we propose solutions in both network layer and transport layer. In the network layer solution, We advocate to design practical datacenter architectures for easy operation, i.e., an architecture should be reliable, capable of improving bisection bandwidth, scalable and debugging-friendly. By strictly following these four guidelines, We propose DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to reallocate traffic from overloaded paths to underloaded paths without central coordination. We use congestion game theory to show that DARD converges to a Nash equilibrium in finite steps and its gap to the optimal flow allocation is bounded in the order of 1/logL, with L being the number of links. We use a testbed implementation and simulations to show that DARD can achieve a close-to-optimal flow allocation with small control overhead in practice.</p><p>In the transport layer solution, We propose Explicit Multipath Congestion Control Protocol (MPXCP), which achieves four desirable properties: fast convergence, efficiency, being fair to flows with different RTTs and negligible queue size. Intensive ns-2 simulation shows that MPXCP can quickly converge to efficiency and fairness without building up queues despite different delay-bandwidth products.</p><p>To resolve the second challenge, recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, we propose NetPilot, a system aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do -- by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.</p> === Dissertation
author2	Yang, Xiaowei
author_facet	Yang, Xiaowei Wu, Xin
author	Wu, Xin
author_sort	Wu, Xin
title	Solving Practical Problems in Datacenter Networks
title_short	Solving Practical Problems in Datacenter Networks
title_full	Solving Practical Problems in Datacenter Networks
title_fullStr	Solving Practical Problems in Datacenter Networks
title_full_unstemmed	Solving Practical Problems in Datacenter Networks
title_sort	solving practical problems in datacenter networks
publishDate	2013
url	http://hdl.handle.net/10161/8201
work_keys_str_mv	AT wuxin solvingpracticalproblemsindatacenternetworks
_version_	1716619875902816256

Solving Practical Problems in Datacenter Networks

Similar Items