This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications Phillipa Gill University of Toronto [email protected] Navendu Jain Microsoft Research [email protected] Nachiappan Nagappan Microsoft Research [email protected] ABSTRACT We present the first large-scale analysis of failures in a data cen- ter network. Through our analysis, we seek to answer several fun- damental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how ef- fective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults, (4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure. Categories and Subject Descriptors: C.2.3 [Computer-Comm- unication Network]: Network Operations General Terms: Network Management, Performance, Reliability Keywords: Data Centers, Network Reliability 1. INTRODUCTION Demand for dynamic scaling and benefits from economies of scale are driving the creation of mega data centers to host a broad range of services such as Web search, e-commerce, storage backup, video streaming, high-performance computing, and data analytics. To host these applications, data center networks need to be scalable, efficient, fault tolerant, and easy-to-manage. Recognizing this need, the research community has proposed several architectures to im- prove scalability and performance of data center networks [2, 3, 12– 14, 17, 21]. However, the issue of reliability has remained unad- dressed, mainly due to a dearth of available empirical data on fail- ures in these networks. In this paper, we study data center network reliability by ana- lyzing network error logs collected for over a year from thousands of network devices across tens of geographically distributed data centers. Our goals for this analysis are two-fold. First, we seek to characterize network failure patterns in data centers and under- stand overall reliability of the network. Second, we want to leverage Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee....
View Full Document
This note was uploaded on 12/01/2011 for the course EE 5373 taught by Professor Chao during the Spring '11 term at NYU Poly.
- Spring '11