Mean time to meaningless:
MTTDL, Markov models, and storage system reliability
Kevin M. Greenan
ParaScale, Inc.
James S. Plank
University of Tennessee
Jay J. Wylie
HP Labs
Abstract
Mean Time To Data Loss (
MTTDL
) has been the stan
dard reliability metric in storage systems for more than
20
years.
MTTDL
represents a simple formula that can be
used to compare the reliability of small disk arrays and
to perform comparative trending analyses. The
MTTDL
metric is often misused, with egregious examples rely
ing on the
MTTDL
to generate reliability estimates that
span centuries or millennia. Moving forward, the stor
age community needs to replace
MTTDL
with a metric
that can be used to accurately compare the reliability of
systems in a way that reflects the impact of data loss in
the real world.
1
Introduction
“Essentially, all models are wrong, but some are useful”
– George E.P. Box
Since Gibson’s original work on RAID [3], the stan
dard metric of storage system reliability has been the
Mean Time To Data Loss
(
MTTDL
).
MTTDL
is an esti
mate of the expected time that it would take a given stor
age system to exhibit enough failures such that at least
one block of data cannot be retrieved or reconstructed.
One of the reasons that
MTTDL
is so appealing as a
metric is that it is easy to construct a Markov model
that yields an analytic closedform equation for
MTTDL
.
Such formulae have been ubiquitous in research and
practice due to the ease of estimating reliability by plug
ging a few numbers into an expression. Given simplistic
assumptions about the physical system, such as indepen
dent exponential probability distributions for failure and
repair, a Markov model can be easily constructed result
ing in a nice, closedform expression.
There are three major problems with using the
MTTDL
as a measure of storage system reliability.
First, the
models on which the calculation depends rely on an ex
tremely simplistic view of the storage system. Second,
the metric does not reflect the real world, but is often
interpreted as a real world estimate.
For example, the
Pergamum archival storage system estimates a
MTTDL
of 1400 years [13]. These estimates are based on the as
sumptions of the underlying Markov models and are typ
ically well beyond the life of any storage system. Finally,
MTTDL
values tend to be incomparable because each is a
function of system scale and omits the (expected) mag
nitude of data loss.
In this position paper, we argue that
MTTDL
is a bad
reliability metric and that Markov models, the traditional
means of determining
MTTDL
, do a poor job of modeling
modern storage system reliability. We then outline prop
erties we believe a good storage system reliability metric
should have, and propose a new metric with these prop
erties: NOrmalized Magnitude of Data Loss (
NOMDL
).
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '08
 Gupta
 Failure rate, Data loss, MTTDL

Click to edit the document details