{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

# Greenan - Mean time to meaningless MTTDL Markov models and...

This preview shows pages 1–2. Sign up to view the full content.

Mean time to meaningless: MTTDL, Markov models, and storage system reliability Kevin M. Greenan ParaScale, Inc. James S. Plank University of Tennessee Jay J. Wylie HP Labs Abstract Mean Time To Data Loss ( MTTDL ) has been the stan- dard reliability metric in storage systems for more than 20 years. MTTDL represents a simple formula that can be used to compare the reliability of small disk arrays and to perform comparative trending analyses. The MTTDL metric is often misused, with egregious examples rely- ing on the MTTDL to generate reliability estimates that span centuries or millennia. Moving forward, the stor- age community needs to replace MTTDL with a metric that can be used to accurately compare the reliability of systems in a way that reflects the impact of data loss in the real world. 1 Introduction “Essentially, all models are wrong, but some are useful” – George E.P. Box Since Gibson’s original work on RAID [3], the stan- dard metric of storage system reliability has been the Mean Time To Data Loss ( MTTDL ). MTTDL is an esti- mate of the expected time that it would take a given stor- age system to exhibit enough failures such that at least one block of data cannot be retrieved or reconstructed. One of the reasons that MTTDL is so appealing as a metric is that it is easy to construct a Markov model that yields an analytic closed-form equation for MTTDL . Such formulae have been ubiquitous in research and practice due to the ease of estimating reliability by plug- ging a few numbers into an expression. Given simplistic assumptions about the physical system, such as indepen- dent exponential probability distributions for failure and repair, a Markov model can be easily constructed result- ing in a nice, closed-form expression. There are three major problems with using the MTTDL as a measure of storage system reliability. First, the models on which the calculation depends rely on an ex- tremely simplistic view of the storage system. Second, the metric does not reflect the real world, but is often interpreted as a real world estimate. For example, the Pergamum archival storage system estimates a MTTDL of 1400 years [13]. These estimates are based on the as- sumptions of the underlying Markov models and are typ- ically well beyond the life of any storage system. Finally, MTTDL values tend to be incomparable because each is a function of system scale and omits the (expected) mag- nitude of data loss. In this position paper, we argue that MTTDL is a bad reliability metric and that Markov models, the traditional means of determining MTTDL , do a poor job of modeling modern storage system reliability. We then outline prop- erties we believe a good storage system reliability metric should have, and propose a new metric with these prop- erties: NOrmalized Magnitude of Data Loss ( NOMDL ).

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}