This preview shows pages 1–7. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: A Hierarchical Power Law Process Model for Supercomputer Reliability Kenneth J. Ryan * kjryan@bgsu.edu * This collaborative research with Mike Hamada (Los Alamos National Lab) and Shane Reese (Brigham Young University) is under peer review. 1 Outline Accelerated Strategic Computing Initiative and the Blue Moun tain Supercomputer Notation and nonhomogeneous Poisson processes Blue Mountain supercomputer hardware failure data A single system model: definition, application to Blue Moun tain, and simulation study A hierarchical model for repairable systems: definition, ap plication to Blue Mountain, and simulation study Discussion 2 Accelerated Strategic Computing Initiative (ASCI) Fast supercomputers are a standard tool used to help solve important, dicult problems Human genome project Aging properties of nuclear weapons Global ocean modeling ASCI is an ongoing collaborative effort between national lab oratories and the Department of Energy to build the needed supercomputing facilities 3 The Blue Mountain Supercomputer An older generation ASCI supercomputer 48 SGI Origin 2000 shared memory processors (SMPs) Complicated interconnect (475 miles of wires) 3.096 teraOps in November 1998, the worlds fastest computer 247365 monitoring due to frequent hardware failures * Decommissioned July 2004 * In 2000, analysis of hardware failure data was reported in the end of year users report which was used in contract negotiations for next generation ASCI supercomputers. 4 Some (Fairly) Standard Notation for Repairable Systems Data T ij failure time j for the i th SMP < T i 1 < T i 2 < ... < T in N i ( a,b ) number of failures for the i th SMP in ( a,b ] N i ( t ) number of failures for the i th SMP in (0 ,t ] Three Common Data Collection Methods Failure truncated design: Observe the first n exact failure times. Time truncated design: Observe the first N = n exact failure times before censoring time c . Failure count design: Defined by b bins with endpoints e = ( e 1 ,e 2 ,...,e b ). One observes failure counts x i = ( x i 1 ,x i 2 ,...,x ib ), where x ij = N i ( e j 1 ,e j ) and e = 0. 5 Available Blue Mountain Hardware Failure Data: a monthly failure count design through month 6, i.e., e = (1 , 2 , 3 , 4 , 5 , 6). 0 1 2 3 4 5 6 7 10 20 30 40 N i ( t ) t (in months) The outlying graphical SMP had additional hardware and more usage compared to the other 47 nongraphical SMPs....
View Full
Document
 Spring '12
 ellen

Click to edit the document details