BGSU2010 - A Hierarchical Power Law Process Model for...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Hierarchical Power Law Process Model for Supercomputer Reliability Kenneth J. Ryan * kjryan@bgsu.edu * This collaborative research with Mike Hamada (Los Alamos National Lab) and Shane Reese (Brigham Young University) is under peer review. 1 Outline Accelerated Strategic Computing Initiative and the Blue Moun- tain Supercomputer Notation and non-homogeneous Poisson processes Blue Mountain supercomputer hardware failure data A single system model: definition, application to Blue Moun- tain, and simulation study A hierarchical model for repairable systems: definition, ap- plication to Blue Mountain, and simulation study Discussion 2 Accelerated Strategic Computing Initiative (ASCI) Fast supercomputers are a standard tool used to help solve important, dicult problems Human genome project Aging properties of nuclear weapons Global ocean modeling ASCI is an on-going collaborative effort between national lab- oratories and the Department of Energy to build the needed supercomputing facilities 3 The Blue Mountain Supercomputer An older generation ASCI supercomputer 48 SGI Origin 2000 shared memory processors (SMPs) Complicated interconnect (475 miles of wires) 3.096 teraOps in November 1998, the worlds fastest computer 24-7-365 monitoring due to frequent hardware failures * Decommissioned July 2004 * In 2000, analysis of hardware failure data was reported in the end of year users report which was used in contract negotiations for next generation ASCI supercomputers. 4 Some (Fairly) Standard Notation for Repairable Systems Data T ij failure time j for the i th SMP < T i 1 < T i 2 < ... < T in N i ( a,b ) number of failures for the i th SMP in ( a,b ] N i ( t ) number of failures for the i th SMP in (0 ,t ] Three Common Data Collection Methods Failure truncated design: Observe the first n exact failure times. Time truncated design: Observe the first N = n exact failure times before censoring time c . Failure count design: Defined by b bins with endpoints e = ( e 1 ,e 2 ,...,e b ). One observes failure counts x i = ( x i 1 ,x i 2 ,...,x ib ), where x ij = N i ( e j- 1 ,e j ) and e = 0. 5 Available Blue Mountain Hardware Failure Data: a monthly failure count design through month 6, i.e., e = (1 , 2 , 3 , 4 , 5 , 6). 0 1 2 3 4 5 6 7 10 20 30 40 N i ( t ) t (in months) The outlying graphical SMP had additional hardware and more usage compared to the other 47 non-graphical SMPs....
View Full Document

Page1 / 22

BGSU2010 - A Hierarchical Power Law Process Model for...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online