Summary: | As the Internet starts to carry more and more mission critical services such as voice-over-IP (VoIP), multiplayer games, and video conferencing, the reliability of the Internet becomes a crucial role to maintain the performance of those services. For those applications, at data plane, networks need to guarantee very low packet loss rate, low delays, while at control plane, routing protocols should be able to achieve high network availability and a fast reaction time after a failure occurs. Empirical studies have shown that end-to-end Internet path failures are widespread, which are either due to some links, nodes, or interfaces becoming unavailable or due to administrative changes, such as link weight changes. Furthermore, studies have shown that routing protocols can achieve convergence time of a few hundred milliseconds, or as long as 30 minutes at the worst case. During routing convergence, routers might be in transient states which will impact the data plane performance to the extent that a packet might traverse a longer path or even a loop. An even worse transient behavior is that packets are dropped even though destinations are reachable. However, routing protocols' transient behavior has so far received little attention in the research community. Understanding routing protocols' transient behavior is critical for improving network performance and wide deployment of those interactive services in the Internet. In this dissertation, we study the impact of transient routing behavior on the performance of end-to-end path. First, we use analytical approach to model transient routing behavior and understand how routing events, routing policies, iBGP configuration, and network topology affect the performance of end-to-end path. Second, we use active probing and controlled routing updates to measure the impact of routing dynamics on the performance of end-to-end path in the Internet. We find that routing instability contribute significantly to end-to-end failures and are responsible for most long-lasting path failures. Third, based on our analysis and measurement, we present an alternate approach that exploits the existence of loop-free alternate paths to overcome traffic disrupting failures due to transient routing behavior. Finally, in order to minimize the impact (e.g., helping network operators to detect routing failures more quickly), we present a real-time diagnosis tool to detect routing outages. We demonstrate the feasibility of a real-time tool for detecting and diagnosing routing problems.
|