lec15_seu - ELEC 7770 Advanced VLSI Design Spring 2010 Soft...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ELEC 7770 Advanced VLSI Design Spring 2010 Soft Errors and Fault-Tolerant Design Soft Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr10 Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 1 Soft Errors Soft errors are the errors caused by the Soft operating environment. operating They are not due to a permanent hardware fault. Soft errors are intermittent or random, which Soft makes their testing unreliable. makes One way to deal with soft errors is to make One hardware robust: hardware Capable of detecting soft errors Capable of correcting soft errors Both measures are probabilistic ELEC 7770: Advanced VLSI Design (Agrawal) 2 Spring 2010, Apr 14 . . . Spring Some Early References J. von Neumann, “Probabilistic Logics and the Synthesis J. of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, John von Neumann: Collected Works, Volume V: Design of Neumann: Computers, Theory of Automata and Numerical Analysis, Computers, Oxford University Press, 1963. M. A. Breuer, “Testing for Intermittent Faults in Digital M. Circuits,” IEEE Trans. Computers, vol. C-22, no. 3, pp. IEEE vol. 241-246, March 1973. 241-246, T. C. May and M. H. Woods, “Alpha-Particle-Induces T. Soft Errors in Dynamic Memories,” IEEE Trans. Electron Devices, vol. ED-26, no. 1, pp. 2-9, 1979. Devices ELEC 7770: Advanced VLSI Design (Agrawal) Spring 2010, Apr 14 . . . Spring 3 Causes of Soft Errors Interconnect coupling (crosstalk). Power supply noise: IR-drop, power droop, Power ground bounce. ground Ignition noise. Electromagnetic pulse (EMP). Effects generally attributed to alpha-particles: Charged particles: electrons, protons, ions. Radiation (photons): X-rays, gamma-rays, ultra-violet Radiation light. Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 4 Sources of Alpha-Particles Radioactive contamination in VLSI packaging Radioactive material. material. Ionosphere, magnetosphere and solar radiation. Other electromagnetic radiation. Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 5 Alpha-Particle Helium nucleus: two protons and two Helium neutrons, mass = 6.65 ×10-27kg, charge = ×10 charge +2e (e = 1.6 ×10-19C). ×10 Energy = 3.73 GeV Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 6 Soft Error Rate (SER) Failures in time (FIT): One FIT is 1 error per Failures billion hours of operation. billion Alternative unit is mean time between failures Alternative (MTBF) or mean time to failure (MTTF). (MTBF) 1 year MTBF = 109/(365×24) = 114,155 FIT Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 7 Particle Strike Ion or Charged particle -+ n ++ +- p - substrate Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 8 Induced Current current time I(t) = I0(e– t/a – e– t/b), a >> b Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 9 Voltage Induced at a Node V = Q/C Where Q = ∫ I(t) dt C = node capacitance Smaller node capacitance will result in larger voltage swing. Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 10 Effect on Digital Circuit Charged Particles Charged Particles IN Combinational Logic OUT CK Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 11 An SRAM Cell WL VDD 0 bit 1 bit BL BL Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 12 SRAM Cell Struck by Alpha-Particle Single-Event Upset (SEU) WL VDD Charged Particles 0→1 bit 1→0 bit BL BL Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 13 A Resistor Hardened SRAM Cell WL VDD 0 bit 1 bit BL BL Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 14 D-Latch D 1 Q Q CK = 0 0 Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 15 SEU in D-Latch D Q Charged Particles 1→0 Q CK = 0 0→1 Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 16 Single Event Transients in Single Combinational Logic Combinational 1 1 0 1 1 Charged Particles 0 CK CK Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 17 Effects of Transients Error correcting effects Transient pulse is filtered by gate inertia Transient is blocked by an unsensitized path Transient is blocked by an inactive clock Error enhancing effects Large number of gates can produce multiple Large pulses pulses Fanouts can multiply error pulses Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 18 Typical Soft Error Distribution Typical S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 19 Soft Error Simulation Soft F. Wang and V. D. Agrawal, “Soft Error Rate F. with Inertial and Logical Masking,” Proc. 22nd with International Conference on Quality VLSI Design, January 2009, pp. 459-464. Design, F. Wang and V. D. Agrawal, “Soft Error Rate F. Determination for Nanoscale Sequential Logic,” Proc. 11th International Symposium on Quality Proc. Electronic Design (ISQED), March 2010. Electronic Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 20 SEUs in FPGA Parts that can be affected Look-up table (LUT) Configuration memory cell Flip-flop Block RAM F. L. Kastensmidt, L. Carro and R. Reis, F. Fault-Tolerant Techniques for SRAM-Based FPGAs, Springer, 2006. FPGAs Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 21 F1 1 1 1 0 0 Memory cells 1 0 0 0 0 1 1 1 0 0 Spring 2010, Apr 14 . . . Spring F2 F3 F4 LUT out 1 ELEC 7770: Advanced VLSI Design (Agrawal) 22 F1 1 1 1 0 0 Memory cells 1 0 0 0 0 1 1 0 0 0 1 F2 F3 F4 SEU in SEU LUT LUT out Charged Particle 1 changed to 0 Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 23 Four Types of SEU in FPGA M MMMM F1 F2 F3 F4 LUT Type 1 FF M Type 3 Type 2 M Configuration memory cell Type 4 Block RAM Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 24 SEU Detection Methods Hardware redundancy Time redundancy Error detection codes (EDC) Self-checker techniques Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 25 SEU Mitigation Techniques Triple modular redundancy (TMR) Multiple redundancy with voting Error detection and correction codes (EDAC) Hardened memory cells FPGA-specific methods Reconfiguration Partial configuration Rerouting design ELEC 7770: Advanced VLSI Design (Agrawal) 26 Spring 2010, Apr 14 . . . Spring Hardware Redundancy for Detection inputs Combinational Logic output Combinational Logic (duplicated) Logic 1 indicates error Hardware overhead is high ~ 100% Performance penalty is negligible. Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 27 Time Redundancy for Detection inputs Combinational Logic DQ CK+ d Logic 1 indicates error output DQ CK Hardware overhead is low. Performance penalty ( ~ d) = maximum detectable pulse width. Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 28 Repeat on Error Detection inputs Combinational Logic DQ C CK+ d Logic 1 indicates error output DQ CK Operation: If error is detected, then output retains its previous value. Repeating the computation can produce correct result. ELEC 7770: Advanced VLSI Design (Agrawal) 29 Spring 2010, Apr 14 . . . Spring Muller C-Element A C B output A 0 0 1 1 B 0 1 0 1 output 0 Old output Old output 1 A SQ R output B Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 30 Dynamic CMOS C-Element Dynamic A C B output A A 0 0 1 1 B 0 1 0 1 output 1 Old output Old output 0 ELEC 7770: Advanced VLSI Design (Agrawal) 31 output B Spring 2010, Apr 14 . . . Pseodostatic CMOS C-Element Pseodostatic A C B Weak keeper output A A 0 0 1 1 B 0 1 0 1 output 1 Old output Old output 0 ELEC 7770: Advanced VLSI Design (Agrawal) 32 B output Spring 2010, Apr 14 . . . Built-In Soft Error Resilience (BISER) Built-In Data from combinational logic Flip-flop A Weak keeper output Duplicate Flip-flop Clock A 0 B 0 1 1 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) B 0 1 0 1 output 1 Old output Old output 0 33 BISER BISER Assumptions: Most soft errors in combinational logic are eliminated by Most inertial and logic maskings. inertial Soft error pulse generated in flip-flop is much shorter Soft than clock period. than Probability of either a master or slave latch being struck Probability by soft error exactly at clock edge is small. by Flip-flop is duplicated and outputs fed to C-element. Twenty times reduction of soft error observed. Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, Ref.: “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. Computer Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 34 Triple Modular Redundancy (TMR) Combinational Logic copy 1 inputs Combinational Logic copy 2 Majority Voter output Combinational Logic copy 3 Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 35 TMR Error Reduction TMR Voter input error probability = E, assumed Voter independent for each input. independent Output error probability, e= Prob(two errors or three errors) Prob(two three = = ( ) E (1 – E) + ( 3 ) E3 (1 2 2 3 3 3 E2 – 3 E3 + E3 = 3 E2 – 2 E3 For very small E, E3 << E2 → e = 3E2 << Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 36 TMR Error Probability TMR Input error probability, E 0.0 0.001 0.01 0.1 0.2 0.3 0.4 0.5 0.6 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) Output error probability, e 0.0 0.000002998 0.000298 0.027 0.104 0.216 0.352 0.5 0.648 37 Majority Voter Circuit A 0 0 0 0 1 1 1 1 B 0 0 1 1 0 0 1 1 C 0 1 0 1 0 1 0 1 output 0 0 0 1 0 1 1 1 ELEC 7770: Advanced VLSI Design (Agrawal) 38 A B C Majority Voter output A B output C Spring 2010, Apr 14 . . . Spring Alternative Implementations of Voter VDD 0 0 0 1 0 1 1 1 A LUT output B output C ABC Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 39 Triple Modular Redundancy (TMR) inputs Combinational Logic CK DQ CK+ d DQ Majority Voter DQ DQ output CK+3d CK+2d Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 40 TMR for Memory Cells inputs Combinational Logic CK DQ CK DQ Problems: 1. Accumulation of errors in flip-flops. 1. Voter is not protected. Majority Voter DQ output CK Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 41 FF Refresh and TMR for Memory Cells r1 DQ r2 CK Majority Voter DQ r3 CK Majority Voter Majority Voter output DQ CK Spring 2010, Apr 14 . . . Spring Majority Voter ELEC 7770: Advanced VLSI Design (Agrawal) 42 Reliability Analysis Reliability Determine how long a system will work without Determine failure. failure. Find: Mean time to failure (MTTF) Mean time to repair (MTTR) FIT rate Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 43 Reliability Function Reliability Reliability function of a system, R(t) = Probability of survival at time t Determined from failure rates of components, λ(t) = Number of failures per unit time Generally varies with time. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 44 Failure Rate, λ(t) 100 Failures per second, λ(t) 10-3 10-6 10-9 10-12 Time, t Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 45 Infant mortality Constant failure Wearout Rate (useful life) or aging λ(t) = λ Deriving R(t) Deriving R(t) is the probability of no error in interval [0, t]. Divide interval in a large number (n) of subintervals of Divide duration t/n. Let x be the probability of error in one subinterval. subinterval. Assume that duration t/n is so small that either no error Assume occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = λt/n. Probability of no error in interval [0, t] is, R(t) = (1 – x)n = (1 – λt/n)n = exp(– λt), from Sterling’s formula as n → ∞ ELEC 7770: Advanced VLSI Design (Agrawal) 46 Spring 2010, Apr 14 . . . R(t) and MTBF R(t) R(t) = e –λt ∞ ∞ MTBF = ∫ R(t) dt = ∫ exp(– λt)dt 0 0 = Spring 2010, Apr 14 . . . 1/λ ELEC 7770: Advanced VLSI Design (Agrawal) 47 Reliability and MTBF Reliability 1.0 0.8 Reliability, R(t) R(t) = 1/e = 0.368 0.6 0.4 0.2 0.0 1 MTBF Spring 2010, Apr 14 . . . 2 MTBF Time, t 3 MTBF 48 ELEC 7770: Advanced VLSI Design (Agrawal) Example: First Generation Computer Example: 10,000 electron tubes. Average burn out rate: 5 tubes per 100,000 hours. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 49 Reliability of TMR Reliability R(TMR) = Prob(all three modules correct) + Prob(any two modules correct) = R3 + 3R2 (1 – R) = 3 R2 – 2 R3 = 3e-2λt – 2e-3λt Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 50 MTBF of TMR MTBF R(TMR) 8 = 3e-2λt – 2e-3λt = 5/(6λ) MTBF = ∫ R(TMR) dt MTBF 0 Spring 2010, Apr 14 . . . Spring ELEC 7770: Advanced VLSI Design (Agrawal) 51 MTBF of TMR MTBF 1.0 0.8 Reliability, R(t) 0.6 0.4 0.2 0.0 TMR Single module Mission duration Time, t 52 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) Errors: Bits can flip due too noise in circuits and Errors: in communication. in Extra bits used for error detection. Example: a parity bit in ASCII code 7-bit ASCII code Error Detection Code Error Even parity code for A (even number of 1s) 01000001 Parity bits Odd parity code for A (odd number of 1s) 11000001 Single-bit error in 7-bit code of “A”, e.g., 1000101, will change symbol to “E” or 1000000 to “@”. But error will be detected in the 8-bit code because the error changes the specified parity. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 43 Richard W. Hamming Richard Error-correcting codes Error-correcting (ECC). (ECC). Also known for Hamming distance Hamming HD = Number of bits two binary vectors differ in Example: HD(1101, 1010) = 3 Hamming Medal, 1988 1915-1998 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 54 The Idea of Hamming Code The Code space contains 2N possible N-bit code words 0010 ”2” 1110 ”E” HD = 1 HD = 1 1-bit error in “A” 1010 ”A” HD = 1 HD = 1 1000 ”8” 1011 ”B” Error not correctable. Reason: No redundancy. Hamming’s idea: Increase HD between valid code words. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 55 Hamming’s Distance ≥ 3 Code Hamming’s 0010101 ”2” 1110100 HD = 4 HD = 4 HD = 3 ”E” HD = 3 1-bit error in “A” shortest distance decoding eliminates error HD = 4 1000111 0010010 ”?” HD = 1 1010010 0011110 ”A” ”3” HD = 2 HD = 3 HD = 3 1011001 ”8” Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) ”B” 56 Minimum Distance-3 Hamming Code Minimum Symbol Original code Odd-parity code ECC, HD ≥ 3 0 1 2 3 4 5 6 7 8 9 A B C D E F 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 10000 00001 00010 10011 00100 10101 10110 00111 01000 11001 11010 01011 11100 01101 01110 11111 0000000 0001011 0010101 0011110 0100110 0101101 0110011 0111000 1000111 1001100 1010010 1011001 1100001 1101010 1110100 1111111 Original code: Symbol “0” with a single-bit error will be Interpreted as “1”, “2”, “4” or “8”. Reason: Hamming distance between codes is 1. A code with any bit error will map onto another valid code. Remedy: Design codes with HD ≥ 2. Example: Parity code. Single bit error detected but not correctable. Remedy: Design codes with HD ≥ 3. For single bit error correction, decode as the valid code at HD = 1. For more error bit detection or correction, design code with HD ≥ 4. 57 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) A Book on Coding Theory Book R. W. Hamming, Coding and Information Theory, R. Coding Englewood Cliffs, New Jersey: Prentice-Hall, 1980. 1980. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 58 Byzantine Empire, 527-565 Byzantine Emperor Justinian and General Belisarius Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 59 Byzantine General’s Problem Byzantine IIn a war a general needs to communicate an n attack (a) or retreat (r) order to subordinates in the field. the For success a perfect agreement is necessary. Byzantine Fault: Subordinates can be unreliable or malicious. Communication (messengers) can be unrelaible or Communication malicious. malicious. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 60 Example 1: Single Fault Example General: D; Subordinates: A, B and C D r→a r r B A C Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 61 Example 1: Majority Agreement Example General: D; Subordinates: A, B and C D r→a r Retreat A a a r Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 62 r r B r r Retreat C Retreat Example 2: Two Faults Example General: D; Subordinates: A, B and C D a a a B A C Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 63 Example 2: Byzantine Failure Example General: D; Subordinates: A, B and C D a r Attack A r r a a r C Retreat a B Attack a Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 64 Byzantine Resilient System Byzantine A system that can correctly function in presence of system Byzantine faults. Byzantine Byzantine protocol for n node system: Any node can initiate a message broadcast. All nodes rebroadcast the received message to all nodes All it has not heard from. it After communications end, nodes take majority decision. Ref.: L. Lamport, R. Shostak and M. Pease, “The Ref.: Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. Lang. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 65 Byzantine Resilience Conditions Byzantine In order to tolerate t failures, : The system must have at least 3t + 1 nodes. There must be at least 2t +1 disjoint There communication paths between nodes. communication A node must exchange messages at least t +1 node times. times. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 66 Four-Core Processor System Four-Core A B D C Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 67 Example 1: C Initiates Message m, Sends n to A and m to B and D Sends Processor A B D First round n m m Second round mm mn mn Decoded message m m m Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 68 Example 2: C Initiates Message m, B Sends p to A and D Sends Processor A B D First round m m m Second round mp mm mp Decoded message m m m Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 69 Example 2: C Initiates Message m, A and B generate faulty message q and Processor A B D First round m m m Second round mq mq qq Decoded message m m q Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 70 References References L. Lamport, R. Shostak and M. Pease, “The L. Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. July D. K. Pradhan, Fault-Tolerant Computer System D. Design, Upper Saddle River, New Jersey: Design, Prentice Hall PTR, 1996. Prentice P. K. Lala, Self-Checking and Fault-Tolerant P. Digital Design, San Francisco: MorganDigital Kaufmann, 2001. ELEC 7770: Advanced VLSI Design (Agrawal) 71 Spring 2010, Apr 14 . . . ...
View Full Document

This note was uploaded on 09/16/2011 for the course ELEC 7770 taught by Professor Agrawal,v during the Spring '08 term at Auburn University.

Ask a homework question - tutors are online