1fault tolerance - Scheduling in Distributed Systems There...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
Scheduling in Distributed Systems There is not really a lot to say about scheduling in a distributed system. Each processor does its own local scheduling (assuming that it has multiple processes running on it), without regard to what the other processors are doing. However, when a group of related, heavily interacting processes are all running on different processors, independent scheduling is not always the most efficient way. See the example below: Time Slot 0 1 0 A C 1 B D 2 A C 3 B D 4 A C 5 B D Time Slot 0 1 2 3 4 5 6 7 0 X X 1 X X 2 X X X 3 X X 4 X X X 5 X X
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Scheduling in Distributed Systems (continue) Although it is difficult to determine dynamically the inter-process communication patterns, in many cases, a group of related processes will be started off together. We can assume that processes are created in groups and that intra-group communication is much more prevalent than inter-group communication. We can further assume that a sufficiently large number of processors is available to handle the largest group, and that each processor is multi-programmed with N process slots. Ousterhout’s co-scheduling: The idea is to have each processor use a round-robin scheduling algorithm with all processors first running the process in slot 0 for a fixed period, then all processors switch to slot 1 and run for a fixed period and so on. A broadcast message could be used to tell each processor when to do process switching, to keep the time slices synchronized. By putting all the members of a process group in the same slot number, but on different processors, one has the advantage of N-fold parallelism, with a guarantee all the processes will be run at the same time, to maximize communication throughput
Background image of page 2
Fault Tolerance A system is said to fail when it does not meet its specification. Examples of failures: supermarket’s distributed ordering systems, distributed air traffic control system (safety-critical system). Component Faults: Computer systems can fail due to a fault in some component, such as a processor, memory, I/O device, cable, or software. A fault is a malfunction, possibly caused by a design error, a manufacturing error, a programming error, physical damage, deterioration in the course of time, harsh environmental conditions, unexpected inputs, operator error, and many other causes. Faults are generally classified as: Transient fault: occur once and then disappear. If the operation is repeated, the fault goes away for the second try. Intermittent fault: it occurs, then vanishes, then reappears, and so on. Like a loose contact on a connector. Intermittent faults cause a great deal of aggravation because they are difficult to diagnose. Permanent fault: one that continue to exist until the faulty component is repaired or
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 09/17/2009 for the course IT it771 taught by Professor Jenisha during the Fall '09 term at University of Advancing Technology.

Page1 / 22

1fault tolerance - Scheduling in Distributed Systems There...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online