BGSU2010 - A Hierarchical Power Law Process Model for...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Hierarchical Power Law Process Model for Supercomputer Reliability Kenneth J. Ryan * [email protected] * This collaborative research with Mike Hamada (Los Alamos National Lab) and Shane Reese (Brigham Young University) is under peer review. 1 Outline • Accelerated Strategic Computing Initiative and the Blue Moun- tain Supercomputer • Notation and non-homogeneous Poisson processes • Blue Mountain supercomputer hardware failure data • A single system model: definition, application to Blue Moun- tain, and simulation study • A hierarchical model for repairable systems: definition, ap- plication to Blue Mountain, and simulation study • Discussion 2 Accelerated Strategic Computing Initiative (ASCI) • Fast supercomputers are a standard tool used to help solve important, difficult problems – Human genome project – Aging properties of nuclear weapons – Global ocean modeling • ASCI is an on-going collaborative effort between national lab- oratories and the Department of Energy to build the needed supercomputing facilities 3 The Blue Mountain Supercomputer • An “older generation” ASCI supercomputer • 48 SGI Origin 2000 shared memory processors (SMPs) • Complicated interconnect (475 miles of wires) • 3.096 teraOps in November 1998, the world’s fastest computer • 24-7-365 monitoring due to frequent hardware failures * • Decommissioned July 2004 * In 2000, analysis of hardware failure data was reported in the end of year users report which was used in contract negotiations for next generation ASCI supercomputers. 4 Some (Fairly) Standard Notation for Repairable Systems Data • T ij failure time j for the i th SMP < T i 1 < T i 2 < ... < T in • N i ( a,b ) number of failures for the i th SMP in ( a,b ] • N i ( t ) number of failures for the i th SMP in (0 ,t ] Three Common Data Collection Methods • Failure truncated design: Observe the first n exact failure times. • Time truncated design: Observe the first N = n exact failure times before censoring time c . • Failure count design: Defined by b bins with endpoints e = ( e 1 ,e 2 ,...,e b ). One observes failure counts x i = ( x i 1 ,x i 2 ,...,x ib ), where x ij = N i ( e j- 1 ,e j ) and e = 0. 5 Available Blue Mountain Hardware Failure Data: a monthly failure count design through month 6, i.e., e = (1 , 2 , 3 , 4 , 5 , 6). 0 1 2 3 4 5 6 7 10 20 30 40 N i ( t ) t (in months) The outlying “graphical” SMP had additional hardware and more usage compared to the other 47 “non-graphical” SMPs....
View Full Document

This note was uploaded on 02/29/2012 for the course ECON 4020 taught by Professor Ellen during the Spring '12 term at Bowling Green.

Page1 / 22

BGSU2010 - A Hierarchical Power Law Process Model for...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online