f33-book-depend-pres-pt7

f33-book-depend-pres-pt7 - Dec. 2009 Part VII Failures:...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
Dec. 2009 Part VII – Failures: Computational Breaches Slide 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Dec. 2009 Part VII – Failures: Computational Breaches Slide 2 About This Presentation This presentation is intended to support the use of the textbook Dependable Computing: A Multilevel Approach (traditional print or on-line open publication, TBD). It is updated regularly by the author as part of his teaching of the graduate course ECE 257A, Fault-Tolerant Computing, at Univ. of California, Santa Barbara. Instructors can use these slides freely in classroom teaching or for other educational purposes. Unauthorized uses, including distribution for profit, are strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First Sep. 2006 Oct. 2007 Dec. 2009
Background image of page 2
Dec. 2009 Part VII – Failures: Computational Breaches Slide 3 25 Failure Confinement
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Dec. 2009 Part VII – Failures: Computational Breaches Slide 4
Background image of page 4
Dec. 2009 Part VII – Failures: Computational Breaches Slide 5 Robust Parallel Processing Resilient Algorithms
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Dec. 2009 Part VII – Failures: Computational Breaches Slide 6 25.1 From Failure to Disaster Computers are components in larger technical or societal systems Failure detection and manual back-up system can prevent disaster Used routinely in safety-critical systems: Manual control/override in jetliners Ground-based control for spacecraft Manual bypass in nuclear reactors Failed Not just for safety-critical systems: Amtrak lost ticketing capability on Friday, Nov. 30, 1996, (Thanksgiving weekend) due to a communication system failure and had no up-to-date fare information in train stations to issue tickets manually Manual back-up and bypass systems provide a buffer between the failed state and potential disaster Manual system infeasible for e-commerce sites
Background image of page 6
Dec. 2009 Part VII – Failures: Computational Breaches Slide 7 25.2 Failure Awareness Indicate where effort is most needed Help with verification of analytic models System outage stats (%)* Hardware Software Operations Environment Bellcore [Ali86] 26 30 44 -- Tandem [Gray87] 22 49 15 14 Northern Telecom 19 19 33 28 Japanese Commercial 36 40 11 13 Mainframe users 47 21 16 16 Overall average 30 32 24 14 *Excluding scheduled maintenance Tandem unscheduled outages Power 53% Communication lines 22% Application software 10% File system 10% Hardware 5% Tandem outages due to hardware Disk storage 49% Communications 24% Processors 18% Wiring 9% Spare units 1% Importance of collecting experimental failure data
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Dec. 2009 Part VII – Failures: Computational Breaches Slide 8 System Failure Data Repositories LANL data, collected 1996-2005: SMPs, Clusters, NUMAs http://institutes.lanl.gov/data/fdata/ From the site’s FAQs: “A failure record contains the time when the failure started (start time), the time when it was resolved (end time), the system and node affected, the type of workload running on the node and the root cause.” Usenix Computer Failure Data Repository http://cfdr.usenix.org/ “The computer failure data repository (CDFR) aims at accelerating research on system reliability by filling the nearly empty collection of
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/29/2011 for the course ECE 257a taught by Professor Parhami,b during the Fall '08 term at UCSB.

Page1 / 99

f33-book-depend-pres-pt7 - Dec. 2009 Part VII Failures:...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online