2.4
Other Reliability Evaluation Techniques
FIGURE 2.15
The Markov model for the duplex system with an inactive spare.
FIGURE 2.16
35
The Markov model for a duplex system with repair.
This expression can also be derived based on combinatorial arguments. T
34
CHAPTER 2
Hardware Fault Tolerance
2. It was at some other state j at time t (j = i) and moved from j to i during the
interval t. This event has a probability of Pj (t)ji t plus terms of order t2 .
The probability of more than one transition during
t2
2.4
Other Reliability Evaluation Techniques
33
Similar derivations can be made for M-of-N systems in which failing processors
are identied and replaced from an innite pool of spares. This is left for the reader
as an exercise. The extension to the case wi
32
CHAPTER 2
Hardware Fault Tolerance
a processor fails (due to a permanent fault), it is diagnosed and replaced instantaneously. Due to the constant failure rate , the time between two consecutive
failures of the same processor is exponentially distribut
2.4
Other Reliability Evaluation Techniques
31
interval of length t (k = 0, 1, 2, . . .). Based on (1)(3), we have
Pk (t + t) Pk1 (t) t + Pk (t)(1 t)
(for k = 1, 2, . . .)
and
P0 (t + t) P0 (t)(1 t)
These approximations become more accurate as
tial equati
30
CHAPTER 2
Hardware Fault Tolerance
Module
Comparator
Module
Switch/
Comparator
Module
Comparator
Module
FIGURE 2.14
A pair-and-spare structure consisting of two duplexes.
the system. The triplexduplex arrangement allows for the error masking of voting
2.3
Canonical and Resilient Structures
29
Hardware Testing
The second method of identifying the failed processor is to subject both processors
to some hardware/logic test routines. Such diagnostic tests are regularly used to
verify that the processor circ
28
CHAPTER 2
Hardware Fault Tolerance
Assuming that the two processors are identical, each with a reliability R(t), the
reliability of the duplex system is
Rduplex (t) = Rcomp (t) R2 (t) + 2cR(t) 1 R(t)
(2.31)
where Rcomp is the reliability of the compara
2.3
Canonical and Resilient Structures
FIGURE 2.12
27
Sift-out structure.
Module
Comparator
Module
FIGURE 2.13
Duplex system.
module whose output disagrees with the outputs of the other modules is switched
out and no longer contributes to the system outpu
26
CHAPTER 2
FIGURE 2.11
Hardware Fault Tolerance
Hybrid redundancy.
unit) to the output of the voter to identify a faulty primary (if any). The Compare
unit then generates the corresponding disagreement signal, which will cause the
Reconguration unit to
2.3
Canonical and Resilient Structures
FIGURE 2.10
25
Dynamic redundancy.
system consists of one active module, N spare modules, and a Fault Detection and
Reconguration unit that is assumed to be capable of detecting any erroneous output produced by the a
24
CHAPTER 2
Hardware Fault Tolerance
unit 2
unit 2
unit 1
V
unit 1
V
V
unit 2
V
V
unit 1
V
unit 3
FIGURE 2.8
V
V
unit 3
V
unit 4
V
V
unit 4
unit 3
unit 4
V
Subsystem-level TMR.
Processor
Memory
Processor
V
Memory
Processor
FIGURE 2.9
V
V
Memory
Triplicat
2.3
Canonical and Resilient Structures
23
have a faulty adder and another module a faulty multiplier. If the adder and multiplier circuits are disjoint, the two faulty modules are unlikely to generate wrong
outputs simultaneously. If all compensating and
CHAPTER 2
Hardware Fault Tolerance
System Reliability
22
1.0
0.8
5MR
0.6
triplex
0.4
sim
ple
0.2
x
0
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
R
FIGURE 2.7 Comparing NMR reliability (for N = 3 and 5 ) to that of a single module
(voter failure rate is cons
2.3
Canonical and Resilient Structures
21
Module
Voter
Module
Module
FIGURE 2.6
A Triple Modular Redundant (TMR) structure.
time t. The system reliability is therefore given by
N
RM_of_N (t) =
i=M
N
Ri (t) 1 R(t)
i
Ni
(2.25)
N
where N = (Ni!)!i! . The ass
CHAPTER 2
Hardware Fault Tolerance
1.0
System Reliability
20
0.8
d
un
o
rb
pe
0.6
Up
Exact value
0.4
d
un
we
0.2
0
o
rb
Lo
0
0.2
0.4
0.6
0.8
1.0
R
FIGURE 2.5 Comparing the exact reliability of the non-series/parallel system in Figure 2.3 to its upper and
2.3
Canonical and Resilient Structures
19
Expanding the diagram in Figure 2.4b about E yields
Probcfw_System works|C is fault-free
= RE RF 1 (1 RA )(1 RB ) + (1 RE )RA RD RF
Substituting this last expression in 2.18 results in
Rsystem = RC RE RF (RA + RB
18
CHAPTER 2
Hardware Fault Tolerance
B
E
A
F
D
(a) C not working
B
A
E
F
D
(b) C working
FIGURE 2.4
Expanding the diagram in Figure 2.3 about module C.
In the following analysis, the dependence of the reliability on the time t is omitted for simplicity o
2.3
Canonical and Resilient Structures
17
B
A
C
E
F
D
FIGURE 2.3
A non-series/parallel system.
following expression for the reliability of a parallel system, denoted by Rp (t):
N
1 Ri (t)
Rp (t) = 1
(2.15)
i=1
If module i has a constant failure rate i ,
16
CHAPTER 2
Hardware Fault Tolerance
(a) Series system
FIGURE 2.2 Series and parallel systems.
(b) Parallel system
2.3.1 Series and Parallel Systems
The most basic structures are the series and parallel systems depicted in Figure 2.2. A series system is
2.3
Canonical and Resilient Structures
15
assumption is inappropriate, especially during the "infant mortality" and "wearout" phases of a component's life (Figure 2.1). In such cases, the Weibull distribution is often used. This distribution has two param
14
CHAPTER 2
Hardware Fault Tolerance
we know that the component survived at least until time t. This conditional probability is represented by the failure rate (also called the hazard rate) of a component at time t, denoted by (t), which can be calculate
2.2
Failure Rate, Reliability, andMean Time to Failure
13
C1 , C2
Complexity factors; functions of the number of gates on the chip and the number of pins in the package.
Further details can be found in MIL-HDBK-217E, which is a handbook produced by the U.
12
CHAPTER 2
Hardware Fault Tolerance
FIGURE 2.1
Bathtub curve.
shocks that it suffers, the ambient temperature, and the technology. The dependence on age is usually captured by what is known as the bathtub curve (see Figure 2.1). When components are very
CHAPTER
2
Hardware Fault Tolerance
Hardware fault tolerance is the most mature area in the general field of faulttolerant computing. Many hardware fault-tolerance techniques have been developed and used in practice in critical applications ranging from te
10
CHAPTER 1
Preliminaries
References
[1] A. Avizienis and J. Laprie, "Dependable Computing: From Concepts to Design Diversity," Proceedings of the IEEE, Vol. 74, pp. 629638, May 1986. [2] W. R. Dunn, Practical Design of Safety-Critical Computer Systems,
1.5
Further Reading
9
An important part of the design and evaluation process of a fault-tolerant system is to demonstrate that the system does indeed function at the advertised level of reliability. Often the designed system is too complex to develop anal
8
CHAPTER 1
Preliminaries
information redundancy are discussed, including storage redundancy (RAID systems), data replication in distributed systems, and, finally, the algorithm-based fault-tolerance technique that tolerates data errors in array computati
1.4
Outline of This Book
7
FIGURE 1.1
Inadequacy of classical connectivity.
defined as the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected. This gives a rough indication of how vulnerable a networ
6
CHAPTER 1
Preliminaries
Such a system has an MTBF of just 1 hour and, consequently, a low reliability; however, its availability is high: A = 3599/3600 = 0.99972. These definitions assume, of course, that we have a state in which the system can be said