understanding the causes and impact of network failures

understanding the causes and impact of network failures -...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: California Fault Lines: Understanding the Causes and Impact of Network Failures Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage {djturner, klevchen, snoeren, savage}@cs.ucsd.edu Department of Computer Science and Engineering University of California, San Diego ABSTRACT Of the major factors affecting end-to-end service availability, net- work component failure is perhaps the least well understood. How often do failures occur, how long do they last, what are their causes, and how do they impact customers? Traditionally, answering ques- tions such as these has required dedicated (and often expensive) instrumentation broadly deployed across a network. We propose an alternative approach: opportunistically mining low-quality data sources that are already available in modern net- work environments. We describe a methodology for recreating a succinct history of failure events in an IP network using a com- bination of structured data (router configurations and syslogs) and semi-structured data (email logs). Using this technique we analyze over five years of failure events in a large regional network consist- ing of over 200 routers; to our knowledge, this is the largest study of its kind. Categories and Subject Descriptors C.2.3 [ Computer-Communication Networks ]: Network Opera- tions General Terms Measurement, Reliability 1. INTRODUCTION Todays network-centric enterprises are built on the promise of uninterrupted service availability. However, delivering on this promise is a challenging task because availability is not an intrinsic design property of a system; instead, a system must accommodate the underlying failure properties of its components. Thus, provid- ing availability first requires understanding failure: how long are failures, what causes them, and how well are they masked? This is particularly true for networks, which have been increasingly identi- fied as the leading cause of end-to-end service disruption [2, 9, 15, 24, 30], as they exhibit complex failure modes. Unfortunately, such analysis is rarely performed in practice as common means of measuring network failures at fine grain presup- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGCOMM10, August 30September 3, 2010, New Delhi, India. Copyright 2010 ACM 978-1-4503-0201-2/10/08 ...$10.00. pose measurement mechanisms (e.g., IGP logging [23], pervasive high-frequency SNMP polling, passive link monitoring [8], and pair-wise active probing [26]) that are not universally available out- side focused research-motivated efforts and which can incur signif- icant capital and operational expense. Indeed, even in the researchicant capital and operational expense....
View Full Document

This note was uploaded on 12/01/2011 for the course EE 5373 taught by Professor Chao during the Spring '11 term at NYU Poly.

Page1 / 12

understanding the causes and impact of network failures -...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online