Lec18 - Recovery Techniques thus far allow failure handling...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
Computer Science Lecture 18, page CS677: Distributed OS Recovery Techniques thus far allow failure handling Recovery: operations that must be performed after a failure to recover to a correct state Techniques: Checkpointing: Periodically checkpoint state Upon a crash roll back to a previous checkpoint with a consistent state 1 Computer Science Lecture 18, page CS677: Distributed OS Independent Checkpointing Each processes periodically checkpoints independently of other processes Upon a failure, work backwards to locate a consistent cut Problem: if most recent checkpoints form inconsistenct cut, will need to keep rolling back until a consistent cut is found Cascading rollbacks can lead to a domino effect. 2
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Computer Science Lecture 18, page CS677: Distributed OS Coordinated Checkpointing Take a distributed snapshot [discussed in Lec 11] Upon a failure, roll back to the latest snapshot All process restart from the latest snapshot 3 Computer Science Lecture 18, page CS677: Distributed OS Message Logging Checkpointing is expensive All processes restart from previous consistent cut Taking a snapshot is expensive Infrequent snapshots => all computations after previous snapshot will need to be redone [wasteful] Combine checkpointing (expensive) with message logging (cheap) Take infrequent checkpoints Log all messages between checkpoints to local stable storage To recover: simply replay messages from previous checkpoint Avoids recomputations from previous checkpoint 4
Background image of page 2
Computer Science Lecture 18, page CS677: Distributed OS Recovery Oriented Computing Cheaper to optimize for recover than to design the system to prevent faults Need to restart the system upon failure Naïve case: reboot Reboot part of the system: modular system, where components can be restarted independently Unix /etc/rc service Stateful recovery Database recovery
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 11/22/2011 for the course COMPSCI 677 taught by Professor Shenoy during the Spring '08 term at UMass (Amherst).

Page1 / 13

Lec18 - Recovery Techniques thus far allow failure handling...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online