Greenan - Mean time to meaningless MTTDL Markov models and...

Mean time to meaningless: MTTDL, Markov models, and storage system reliability Kevin M. Greenan ParaScale, Inc. James S. Plank University of Tennessee Jay J. Wylie HP Labs Abstract Mean Time To Data Loss ( MTTDL ) has been the stan- dard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with egregious examples rely- ing on the MTTDL to generate reliability estimates that span centuries or millennia. Moving forward, the stor- age community needs to replace MTTDL with a metric that can be used to accurately compare the reliability of systems in a way that reflects the impact of data loss in the real world. 1 Introduction “Essentially, all models are wrong, but some are useful” – George E.P. Box Since Gibson’s original work on RAID [3], the stan- dard metric of storage system reliability has been the Mean Time To Data Loss ( MTTDL ). MTTDL is an esti- mate of the expected time that it would take a given stor- age system to exhibit enough failures such that at least one block of data cannot be retrieved or reconstructed. One of the reasons that MTTDL is so appealing as a metric is that it is easy to construct a Markov model that yields an analytic closed-form equation for MTTDL . Such formulae have been ubiquitous in research and practice due to the ease of estimating reliability by plug- ging a few numbers into an expression. Given simplistic assumptions about the physical system, such as indepen- dent exponential probability distributions for failure and repair, a Markov model can be easily constructed result- ing in a nice, closed-form expression. There are three major problems with using the MTTDL as a measure of storage system reliability. First, the models on which the calculation depends rely on an ex- tremely simplistic view of the storage system. Second, the metric does not reflect the real world, but is often interpreted as a real world estimate. For example, the Pergamum archival storage system estimates a MTTDL of 1400 years [13]. These estimates are based on the as- sumptions of the underlying Markov models and are typ- ically well beyond the life of any storage system. Finally, MTTDL values tend to be incomparable because each is a function of system scale and omits the (expected) mag- nitude of data loss. In this position paper, we argue that MTTDL is a bad reliability metric and that Markov models, the traditional means of determining MTTDL , do a poor job of modeling modern storage system reliability. We then outline prop- erties we believe a good storage system reliability metric should have, and propose a new metric with these prop- erties: NOrmalized Magnitude of Data Loss ( NOMDL ).

