This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Chapter 5: Terminology and Basic Algorithms
Ajay Kshemkalyani and Mukesh Singhal
Distributed Computing: Principles, Algorithms, and Systems Cambridge University Press A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 1 / 79 Distributed Computing: Principles, Algorithms, and Systems Topology Abstraction and Overlays
System: undirected (weighted) graph (N, L), where n = N, l = L Physical topology
Nodes: network nodes, routers, all end hosts (whether participating or not) Edges: all LAN, WAN links, direct edges between end hosts E.g., Fig. 5.1(a) topology + all routers and links in WANs Logical topology (application context)
Nodes: end hosts where application executes Edges: logical channels among these nodes Alltoall fully connected (e.g., Fig 5.1(b)) or any subgraph thereof, e.g., neighborhood view, (Fig 5.1(a))  partial system view, needs multihop paths, easy to maintain Superimposed topology (a.k.a. topology overlay):
superimposed on logical topology Goal: efficient information gathering, distribution, or search (as in P2P overlays) e.g., ring, tree, mesh, hypercube A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 2 / 79 Distributed Computing: Principles, Algorithms, and Systems Topology Abstractions WAN WAN WAN WAN participating process(or) (a) WAN or other network (b) Figure 5.1: Example topology views at different levels of abstraction A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 3 / 79 Distributed Computing: Principles, Algorithms, and Systems Classifications and Basic Concepts (1) Application execution vs. control algorithm execution, each with own events
Control algorithm:
for monitoring and auxiliary functions, e.g., creating a ST, MIS, CDS, reaching consensus, global state detection (deadlock, termination etc.), checkpointing superimposed on application execution, but does not interfere its send, receive, internal events are transparent to application execution a.k.a. protocol Centralized and distributed algorithms
Centralized: asymmetric roles; clientserver configuration; processing and bandwidth bottleneck; point of failure Distributed: more balanced roles of nodes, difficult to design perfectly distributed algorithms (e.g., snapshot algorithms, treebased algorithms) Symmetric and asymmetric algorithms A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 4 / 79 Distributed Computing: Principles, Algorithms, and Systems Classifications and Basic Concepts (2) Anonymous algorithm: process ids or processor ids are not used to make any execution (runtime) decisions
Structurally elegant but hard to design, or impossible, e.g., anonymous leader election is impossible Uniform algorithm: Cannot use n, the number of processes, as a parameter in the code
Allows scalability; process leave/join is easy and only neighbors need to be aware of logical topology changes Adaptive algorithm: Let k ( n) be the number of processes participating in the context of a problem X when X is being executed. Complexity should be expressible as a function of k, not n.
E.g., mutual exclusion: critical section contention overhead expressible in terms of number of processes contending at this time (k) A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 5 / 79 Distributed Computing: Principles, Algorithms, and Systems Classifications and Basic Concepts (3) Deterministic vs. nondeterministic executions
Nondeterministic execution: contains at least 1 nondeterministic receive; deterministic execution has no nondeterministic receive
Nondeterministic receive: can receive a message from any source Deterministic receive: source is specified Difficult to reason with Asynchronous system: reexecution of deterministic program will produce same partial order on events ((used in debugging, unstable predicate detection etc.) Asynchronous system: reexecution of nondeterministic program may produce different partial order (unbounded delivery times and unpredictable congestion, variable local CPU scheduling delays) A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 6 / 79 Distributed Computing: Principles, Algorithms, and Systems Classification and Basic Concepts (4) Execution inhibition (a.k.a. freezing)
Protocols that require suspension of normal execution until some stipulated operations occur are inhibitory Concept: Different from blocking vs. nonblocking primitives Analyze inhibitory impact of control algo on underlying execution Classification 1:
Noninhibitory protocol: no event is disabled in any execution Locally inhibitory protocol: in any execution, any delayed event is a locally delayed event, i.e., inhibition under local control, not dependent on any receive event Globally inhibitory: in some execution, some delayed event is not locally delayed Classification 2: send inhibitory/ receive inhibitory/ internal event inhibitory A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 7 / 79 Distributed Computing: Principles, Algorithms, and Systems Classifications and Basic Concepts (5) Synchronous vs. asynchronous systems
Synchronous:
upper bound on message delay known bounded drift rate of clock wrt. real time known upper bound for process to execute a logical step Asynchronous: above criteria not satisfied spectrum of models in which some combo of criteria satisfied Algorithm to solve a problem depends greatly on this model Distributed systems inherently asynchronous Online vs. offline (control) algorithms
Online: Executes as data is being generated Clear advantages for debugging, scheduling, etc. Offline: Requires all (trace) data before execution begins A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 8 / 79 Distributed Computing: Principles, Algorithms, and Systems Classification and Basic Concepts (6) Waitfree algorithms (for synchronization operations)
resilient to n  1 process failures, i.e., ops of any process must complete in bounded number of steps, irrespective of other processes very robust, but expensive possible to design for mutual exclusion may not always be possible to design, e.g., producerconsumer problem Communication channels
pointtopoint: FIFO, nonFIFO At application layer, FIFO usually provided by network stack A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 9 / 79 Distributed Computing: Principles, Algorithms, and Systems Classifications and Basic Concepts (7)
Process failures (sync + async systems) in order of increasing severity
Failstop: Properly functioning process stops execution. Other processes learn about the failed process (thru some mechanism) Crash: Properly functioning process stops execution. Other processes do not learn about the failed process Receive omission: Properly functioning process fails by receiving only some of the messages that have been sent to it, or by crashing. Send omission: Properly functioning process fails by sending only some of the messages it is supposed to send, or by crashing. Incomparable with receive omission model. General omission: Send omission + receive omission Byzantine (or malicious) failure, with authentication: Process may (mis) behave anyhow, including sending fake messages. Authentication facility = If a faulty process claims to have received a message from a correct process, that is verifiable. Byzantine (or malicious) failure, no authentication The nonmalicious failure models are "benign" A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 10 / 79 Distributed Computing: Principles, Algorithms, and Systems Classifications and Basic Concepts (8) Process failures (contd.) Timing failures (sync systems):
General omission failures, or clocks violating specified drift rates, or process violating bounds on time to execute a step More severe than general omission failures Failure models influence design of algorithms Link failures
Crash failure: Properly functioning link stops carrying messages Omission failure: Link carries only some of the messages sent on it, not others Byzantine failure: Link exhibits arbitrary behavior, including creating fake messages and altering messages sent on it Link failures Timing failures (sync systems): messages delivered faster/slower than specified behavior A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 11 / 79 Distributed Computing: Principles, Algorithms, and Systems Complexity Measures and Metrics
Each metric specified using lower bound (), upper bound (O), exact bound () Metrics
Space complexity per node Systemwide space complexity (= n space complexity per node). E.g., worst case may never occur at all nodes simultaneously! Time complexity per node Systemwide time complexity. Do nodes execute fully concurrently? Message complexity
Number of messages (affects space complexity of message ovhd) Size of messages (affects space complexity of message ovhd + time component via increased transmission time) Message time complexity: depends on number of messages, size of messages, concurrency in sending and receiving messages : Other metrics: # send and # receive events; # multicasts, and how implemented? (Shared memory systems): size of shared memory; # synchronization operations
A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 12 / 79 Distributed Computing: Principles, Algorithms, and Systems Program Structure
Communicating Sequential Processes (CSP) like: [ G1  CL1  G2  CL2   Gk  CLk ] The repetitive command "*" denotes an infinite loop. Inside it, the alternative command `" is over guarded commands. Specifies execution of exactly one of its constituent guarded commands. Guarded command syntax: "G  CL" guard G is boolean expression, CL is list of commands to be executed if G is true. Guard may check for message arrival from another process. Alternative command fails if all the guards fail; if > 1 guard is true, one is nondeterministically chosen for execution. Gm  CLm : CLm and Gm atomically executed. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 13 / 79 Distributed Computing: Principles, Algorithms, and Systems Basic Distributed Graph Algorithms: Listing Sync 1initiator ST (flooding) Async 1initiator ST (flooding) Async concinitiator ST (flooding) Async DFS ST Broadcast & convergecast on tree Sync 1source shortest path Distance Vector Routing Async 1source shortest path All sources shortest path: FloydWarshall Sync, async constrained flooding MST, sync MST, async Synchronizers: simple, , , MIS, async, randomized CDS Compact routing tables Leader election: LCR algorithm Dynamic object replication A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 14 / 79 Distributed Computing: Principles, Algorithms, and Systems Sync 1initiator ST (flooding)
(local variables) int visited, depth  0 int parent  set of int Neighbors  set of neighbors (message types) QUERY (1) if i (2) (3) (4) (5) for (6) (7) (8) (9) (10) (11) (12) = root then visited  1; depth  0; send QUERY to Neighbors; round = 1 to diameter do if visited = 0 then if any QUERY messages arrive then parent  randomly select a node from which QUERY was received; visited  1; depth  round; send QUERY to Neighbors \ {senders of QUERYs received in this round}; delete any QUERY messages that arrived in this round. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 15 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronous 1init Spanning Tree: Example A
(1) (1) (2) B (2)
(3) (3) C
(3) QUERY F (2) E (3) D Figure 5.2: Tree in boldface; round numbers of QUERY are labeled Designated root. Node A in example. Each node identifies parent How to identify child nodes? A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 16 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronous 1init Spanning Tree: Complexity Termination: after diameter rounds. How can a process terminate after setting its parent? Complexity: Local space: O(degree) Global space: O( local space) Local time: O(degree + diameter ) Message time complexity: d rounds or message hops Message complexity: 1, 2 messages/edge. Thus, [l, 2l] Spanning tree: analogous to breadthfirst search A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 17 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous 1init Spanning Tree: Code
(local variables) int parent  set of int Children, Unrelated  set of int Neighbors set of neighbors  (message types) QUERY, ACCEPT, REJECT (1) When the predesignated root node wants to initiate the algorithm: (1a) if (i = root and parent =) then (1b) send QUERY to all neighbors; (1c) parent i.  (2) When QUERY arrives from j: (2a) if parent = then (2b) parent j;  (2c) send ACCEPT to j; (2d) send QUERY to all neighbors except j; (2e) if (Children Unrelated) = (Neighbors \ {parent}) then (2f) terminate. (2g) else send REJECT to j. (3) When ACCEPT arrives from j: (3a) Children Children {j};  (3b) if (Children Unrelated) = (Neighbors \ {parent}) then (3c) terminate. (4) When REJECT arrives from j: (4a) Unrelated Unrelated {j};  (4b) if (Children Unrelated) = (Neighbors \ {parent}) then (4c) terminate. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 18 / 79 Distributed Computing: Principles, Algorithms, and Systems Async 1init Spanning Tree: Operation root initiates flooding of QUERY to identify tree edges parent: 1st node from which QUERY received
ACCEPT (+ rsp) sent in response; QUERY sent to other nbhs Termination: when ACCEPT or REJECT ( rsp) received from nonparent nbhs. Why? QUERY from nonparent replied to by REJECT Necessary to track neighbors? to determine children and when to terminate? Why is REJECT message type required? Can use of REJECT messages be eliminated? How? What impact? A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 19 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous 1init Spanning Tree: Complexity Local termination: after receiving ACCEPT or REJECT from nonparent nbhs. Complexity: Local space: O(degree) Global space: O( local space) Local time: O(degree) Message complexity: 2, 4 messages/edge. Thus, [2l, 4l] Message time complexity: d + 1 message hops. Spanning tree: no claim can be made. Worst case height n  1 A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 20 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous 1init Spanning Tree: Example
A
(1) (1) B (4)
(3) (5) (3) C
QUERY (2) F E D Figure 5.3: Tree in boldface; Number indicates approximate order in which QUERY get sent Designated root. Node A in example. tree edges: QUERY + ACCEPT msgs crossedges and backedges: 2(QUERY + REJECT) msgs A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 21 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous Spanning Tree: Concurrent Initiators
A B D E C F G H I J Algorithm: A node may spontaneously initiate algorithm and become "root". Each "root" initiates variant of 1init algorithm; lower priorities suppressed at intermediate nodes Termination: Only root detects termination. Needs to send extra messages to inform others. Time complexity: O(l) Message complexity: O(nl) Figure 5.4: Concurrent initiators A,G,J No predesignated root: Option 1: Merge partial STs. Difficult based on local knowledge, can lead to cycles Option 2: Allow one ST computation instance to proceed; supress others. Used by algorithm; selects root with higher process id to continue 3 cases: newroot < = > myroot
A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 22 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous Spanning Tree: Code (1/2)
(local variables) int parent, myroot  set of int Children, Unrelated  set of int Neighbors  set of neighbors (message types) QUERY, ACCEPT, REJECT (1) When the node wants to initiate the algorithm as a root: (1a) if (parent =) then (1b) send QUERY(i) to all neighbors; (1c) parent, myroot  i. (2) When QUERY(newroot) arrives from j: (2a) if myroot < newroot then // discard earlier partial execution due to its lower priority (2b) parent  j; myroot  newroot; Children, Unrelated  ; (2c) send QUERY(newroot) to all neighbors except j; (2d) if Neighbors = {j} then (2e) send ACCEPT(myroot) to j; terminate. // leaf node (2f) else send REJECT(newroot) to j. // if newroot = myroot then parent is already identified. // if newroot < myroot ignore the QUERY. j will update its root when it receives QUERY(myroot). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 23 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous Spanning Tree: Code (2/2) (3) When ACCEPT(newroot) arrives from j: (3a) if newroot = myroot then (3b) Children  Children {j}; (3c) if (Children Unrelated) = (Neighbors \ {parent}) then (3d) if i = myroot then (3e) terminate. (3f) else send ACCEPT(myroot) to parent. //if newroot < myroot then ignore the message. newroot > myroot will never occur. (4) When REJECT(newroot) arrives from j: (4a) if newroot = myroot then (4b) Unrelated  Unrelated {j}; (4c) if (Children Unrelated) = (Neighbors \ {parent}) then (4d) if i = myroot then (4e) terminate. (4f) else send ACCEPT(myroot) to parent. //if newroot < myroot then ignore the message. newroot > myroot will never occur. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 24 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous DFS Spanning Tree Handle concurrent initiators just as for the nonDFS algorithm, just examined When QUERY, ACCEPT, or REJECT arrives: actions depend on whether myroot < = newroot Termination: only successful root detects termination. Informs others using ST edges. Time complexity: O(l) Message complexity: O(nl) A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 25 / 79 Distributed Computing: Principles, Algorithms, and Systems Asynchronous DFS Spanning Tree: Code
(local variables) int parent, myroot  set of int Children  set of int Neighbors, Unknown set of neighbors  (message types) QUERY, ACCEPT, REJECT (1) When the node wants to initiate the algorithm as a root: (1a) if (parent =) then (1b) send QUERY(i) to i (itself). (2) When QUERY(newroot) arrives from j: (2a) if myroot < newroot then (2b) parent j; myroot newroot; Unknown set of neighbours;    (2c) Unknown Unknown \ {j}; (2d) if Unknown = then (2e) delete some x from Unknown; (2f) send QUERY(myroot) to x; (2g) else send ACCEPT(myroot) to j; (2h) else if myroot = newroot then (2i) send REJECT to j. // if newroot < myroot ignore the query. // j will update its root to a higher root identifier when it receives its QUERY. (3) When ACCEPT(newroot) or REJECT(newroot) arrives from j: (3a) if newroot = myroot then (3b) if ACCEPT message arrived then (3c) Children Children {j};  (3d) if Unknown = then (3e) if parent = i then (3f) send ACCEPT(myroot) to parent; (3g) else set i as the root; terminate. (3h) else (3i) delete some x from Unknown; (3j) send QUERY(myroot) to x. // if newroot < myroot ignore the query. Since sending QUERY to j, i has updated its myroot. // j will update its myroot to a higher root identifier when it receives a QUERY initiated by it. newroot > myroot will never occur. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 26 / 79 Distributed Computing: Principles, Algorithms, and Systems Broadcast and Convergecast on a Tree (1)
initiated by root root initiated by leaves broadcast tree edge crossedge backedge Figure 5.5: Tree structure for broadcast and convergecast Question: how to perform BC and CC on a ring? on a mesh? Costs? A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms convergecast CUP 2008 27 / 79 Distributed Computing: Principles, Algorithms, and Systems Broadcast and Convergecast on a Tree (2)
Broadcast: distribute information BC1. Root sends info to be broadcast to all its children. Terminate. BC2. When a (nonroot) node receives info from its parent, it copies it and forwards it to its children. Terminate. Convergecast: collect info at root, to compute a global function CVC1. Leaf node sends its report to its parent. Terminate. CVC2. At a nonleaf node that is not the root: When a report is received from all the child nodes, the collective report is sent to the parent. Terminate. CVC3. At root: When a report is received from all the child nodes, the global function is evaluated using the reports. Terminate. Uses: compute min, max, leader election, compute global state functions Time complexity: O(h); Message complexity: n  1 messages for BC or CC A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 28 / 79 Distributed Computing: Principles, Algorithms, and Systems Single Source Shortest Path: Sync BellmanFord Weighted graph, no cycles with negative weight No node has global view; only local topology Assumption: node knows n; needed for termination After k rounds: length at any node has length of shortest path having k hops After k rounds: length of all nodes up to k hops away in final MST has stabilized Termination: n  1 rounds Time Complexity: n  1 rounds Message complexity: (n  1) l messages A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 29 / 79 Distributed Computing: Principles, Algorithms, and Systems Sync Distributed BellmanFord: Code (local variables) int length  int parent  set of int Neighbors  set of neighbors set of int {weighti,j , weightj,i  j Neighbors}  the known values of the weights of incident links (message types) UPDATE (1) if i = i0 then length  0; (2) for round = 1 to n  1 do (3) send UPDATE(i, length) to all neighbors; (4) await UPDATE(j, lengthj ) from each j Neighbors; (5) for each j Neighbors do (6) if (length > (lengthj + weightj,i ) then (7) length  lengthj + weightj,i ; parent  j. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 30 / 79 Distributed Computing: Principles, Algorithms, and Systems Distance Vector Routing Used in Internet routing (popular upto to mid1980s), having dynamically changing graph, where link weights model delay/ load Variant of sync BellmanFord; outer for loop is infinite Track shortest path to every destination length replaced by LENGTH[1..n]; parent replaced by PARENT [1..n] kth component denotes bestknown length to LENGTH[k] In each iteration
apply triangle inequality for each destination independently Triangle inequality: (LENGTH[k] > (LENGTHj [k] + weightj,i ) Node i estimates weightij using RTT or queuing delay to neighbor j A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 31 / 79 Distributed Computing: Principles, Algorithms, and Systems Single Source Shortest Path: Async BellmanFord Weighted graph, no cycles with negative weight No node has global view; only local topology exponential (c n ) number of messages and exponential (c n d) time complexity in the worst case, where c is some constant If all links have equal weight, the algorithm computes the minimumhop path; the minimumhop routing tables to all destinations are computed using O(n2 l) messages A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 32 / 79 Distributed Computing: Principles, Algorithms, and Systems Async Distributed BellmanFord: Code (local variables) int length  set of int Neighbors  set of neighbors set of int {weighti,j , weightj,i  j Neighbors}  the known values of the weights of incident links (message types) UPDATE (1) if i = i0 then (1a) length  0; (1b) send UPDATE(i0 , 0) to all neighbours; terminate. (2) When UPDATE(i0 , lengthj ) arrives from j: (2a) if (length > (lengthj + weightj,i )) then (2b) length  lengthj + weightj,i ; parent  j; (2c) send UPDATE(i0 , length) to all neighbors; A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 33 / 79 Distributed Computing: Principles, Algorithms, and Systems AllAll Shortest Paths: FloydWarshall s passes through nodes in {1,2,...,pivot1} t
VIA(VIA(s,t), t) t LENGTH[s,t] LENGTH[s,pivot]
passes through nodes in passes through nodes in {1,2,...,pivot1} {1,2,...,pivot1} LENGTH[pivot,t] VIA(s,t) s (b) pivot
(a) Figure 5.6: (a) Triangle inequality for FloydWarshall algorithm. (b) VIA relationships along a branch of the sink tree for a given (s, t) pair A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 34 / 79 Distributed Computing: Principles, Algorithms, and Systems AllAll Shortest Paths: FloydWarshall
After pivot iterations of the outer loop, Invariant
"LENGTH[i, j] is the shortest path going through intermediate nodes from the set {i, . . . , pivot}. VIA[i, j] is the corresponding first hop."
(1) for pivot = 1 to n do (2) for s = 1 to n do (3) for t = 1 to n do (4) if LENGTH[s, pivot] + LENGTH[pivot, t] < LENGTH[s, t] then (5) LENGTH[s, t]  LENGTH[s, pivot] + LENGTH[pivot, t]; (6) VIA[s, t]  VIA[s, pivot]. Complexity (centralized): O(n3 ) A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 35 / 79 Distributed Computing: Principles, Algorithms, and Systems Distributed FloydWarshall (1) Row i of LENGTH[1..n, 1..n], VIA[1..n, 1..n] stored at i, which is responsible for updating the rows. (So, i acts as source.) Corresponding to centralized algorithm, line (4):
How does node i access remote datum LENGTH[pivot, t] in each iteration pivot?
Distributed (dynamic) sink tree: In any iteration pivot, all nodes s  LENGTH[s, t] = are on a sink tree, with sink at t How to synchronize execution of outer loop iteration at different nodes? (otherwise, algorithm goes wrong).
Simulate "synchronizer": e.g., use r eceive to get data LENGTH[pivot, ] from parent on sink tree A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 36 / 79 Distributed Computing: Principles, Algorithms, and Systems Distributed FloydWarshall: Data structures // LEN[j] is the length of the shortest known path from i to node j. // LEN[j] = weightij for neighbor j, 0 for j = i, otherwise array of int PARENT [1..n] // PARENT [j] is the parent of node i (myself) on the sink tree rooted at j. // PARENT [j] = j for neighbor j, otherwise set of int Neighbours  set of neighbors int pivot, nbh  0 (message types) IN TREE(pivot), NOT IN TREE(pivot), PIV LEN(pivot, PIVOT ROW [1..n]) // PIVOT ROW [k] is LEN[k] of node pivot, which is LEN[pivot, k] in the central algorithm // the PIV LEN message is used to convey PIVOT ROW . (local variables) array of int LEN[1..n] A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 37 / 79 Distributed Computing: Principles, Algorithms, and Systems Distributed FloydWarshall: Code (1) for pivot = 1 to n do (2) for each neighbour nbh Neighbours do (3) if PARENT [pivot] = nbh then (4) send IN TREE(pivot) to nbh; (5) else send NOT IN TREE(pivot) to nbh; (6) await IN TREE or NOT IN TREE message from each neighour; (7) if LEN[pivot] = then (8) if pivot = i then (9) receive PIV LEN(pivot, PIVOT ROW [1..n]) from PARENT [pivot]; (10) for each neighbour nbh Neighbours do (11) if IN TREE message was received from nbh then (12) if pivot = i then (13) send PIV LEN(pivot, LEN[1..n]) to nbh; (14) else send PIV LEN(pivot, PIVOT ROW [1..n]) to nbh; (15) for t = 1 to n do (16) if LEN[pivot] + PIVOT ROW [t] < LEN[t] then (17) LEN[t]  LEN[pivot] + PIVOT ROW [t]; (18) PARENT [t]  PARENT [pivot]. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 38 / 79 Distributed Computing: Principles, Algorithms, and Systems Distributed FloydWarshall: Dynamic Sink Tree
Rename LENGTH[i, j], VIA[i, j] as LEN[j], PARENT [j] in distributed algorithm = LENGTH[i, pivot] is LEN[pivot] At any node i, in iteration pivot: iff LEN[pivot] = at node i, then pivot distributes LEN to all nodes (including i) in sink tree of pivot Parentchild edges in sink tree need to be IDed. How?
1 2 A node sends IN TREE to PARENT [pivot]; NOT IN TREE to other neighbors Receive IN TREE from k = k is a child in sink tree of pivot Await IN TREE or NOT IN TREE from each neighbor. This sendreceive is synchronization! pivot broadcasts LEN down its sink tree. This sendreceive is synchronization! Now, all nodes execute triangle inequality in pseudo lockstep Time Complexity: O(n2 ) execution/node, + time for n broadcasts Message complexity: n iterations; 2 IN TREE or NOT IN TREE msgs of size O(1) per edge: O(l) msgs n  1 PIV LEN msgs of size O(n): O(n) msgs Total O(n(l + n)) messages; Total O(nl + n3 ) message space
A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 39 / 79 Distributed Computing: Principles, Algorithms, and Systems Distributed FloydWarshall: Sink Tree B
NOT_IN_TREE(pivot) NOT_IN_TREE(pivot) NOT_IN_TREE(pivot) NOT_IN_TREE(pivot) i IN_TREE(pivot) A C IN_TREE(pivot) Figure 5.7: Identifying parentchild nodes in sink tree A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 40 / 79 Distributed Computing: Principles, Algorithms, and Systems Constrained Flooding (no ST) FIFO channels; duplicates depected using seq. nos. Asynchronous flooding:
used by Link State Routing in IPv4 Complexity: 2l messages worst case; Time: d sequential hops Synchronous flooding (to learn one datum from each processor):
STATEVEC [k] is estimate of k's datum Message complexity: 2ld messages, each of size n Time complexity: d rounds A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 41 / 79 Distributed Computing: Principles, Algorithms, and Systems Async Constrained Flooding (no ST) (local variables) array of int SEQNO[1..n]  0 set of int Neighbors  set of neighbors (message types) UPDATE (1) To send a message M: (1a) if i = root then (1b) SEQNO[i]  SEQNO[i] + 1; (1c) send UPDATE(M, i, SEQNO[i]) to each j Neighbors. (2) When UPDATE(M, j, seqnoj ) arrives from k: (2a) if SEQNO[j] < seqnoj then (2b) Process the message M; (2c) SEQNO[j]  seqnoj ; (2d) send UPDATE(M, j, seqnoj ) to Neighbors/{k} (2e) else discard the message. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 42 / 79 Distributed Computing: Principles, Algorithms, and Systems Sync Constrained Flooding (no ST) Algorithm learns all nodes identifiers
(local variables) array of int STATEVEC [1..n]  0 set of int Neighbors  set of neighbors (message types) UPDATE (1) STATEVEC [i]  local value; (2) for round = 1 to diameter d do (3) send UPDATE(STATEVEC [1..n]) to each j Neighbors; (4) for count = 1 to Neighbors do (5) await UPDATE(SV [1..n]) from some j Neighbors; (6) STATEVEC [1..n]  max(STATEVEC [1..n], SV [1..n]). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 43 / 79 Distributed Computing: Principles, Algorithms, and Systems Minimum Spanning Tree (MST): Overview
Assume undirected weighted graph. If weights are not unique, assume some tiebreaker such as nodeIDs are used to impose a total order on edge weights. Review defns: forest, spanning forest, spanning tree, MST Kruskal's MST:
Assume forest of graph components maintain sorted list of edges In each of n  1 iterations, identify minimum weight edge that connects two different components Include the edge in MST O(l log l) Prim's MST:
Begin with a single node component In each of n  1 iterations, select the minimum weight edge incident on the component. Component expands using this selected edge. O(n2 ) (or O(n log n) using Fibonacci heaps in dense graphs) A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 44 / 79 Distributed Computing: Principles, Algorithms, and Systems GHS Synchronous MST Algorithm: Overview
GallagherHumbletSpira distributed MST uses Kruskal's strategy. Begin with forest of graph components. MWOE (minimum weight outgoing edge): "outgoing" is logical, i.e., indicates direction of expansion of component Spanning trees of connected components combine with the MWOEs to still retain the spanning tree property in combined component Concurrently combine MWOEs:
after k iterations, n 2k components = at most log n iterations Each component has a leader node in an iteration Each iteration within a component has 5 steps, triggered by leader
broadcastconvergecast phase: leader identifies MWOE broadcast phase: (potential) leader for next iteration identified broadcast phase: among merging components, 1 leader is selected; it identifies itself to all in the new component A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 45 / 79 Distributed Computing: Principles, Algorithms, and Systems Minimum Weight Outgoing Edge: Example
A B A B C C (a) (b) Figure 5.8: Merging of MWOE components. (a) Cycle len = 2 possible. (b) Cycle len > 2 not possible. Observation 5.1
For any spanning forest {(Ni , Li )  i = 1 . . . k} of graph G , consider any component (Nj , Lj ). Denote by j , the edge having the smallest weight among those that are incident on only one node in Nj . Then an MST for G that includes all the edges in each Li in the spanning forest, must also include edge i . A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 46 / 79 Distributed Computing: Principles, Algorithms, and Systems MST Example 54 16 21 88 112 27 43 87 34 11 (MWOE) 16 44 13 14 tree edge cross edge outedge root of component Figure 5.9: Phases within an iteration in a component. (a) Root broadcasts SEARCH MWOE; (b) Convergecast REPLY MWOE occurs. (c) Root broadcasts ADD MWOE; (d) If the MWOE is also chosen as the MWOE by the component at the other end of the MWOE, the incident process with the higher ID is the leader for the next iteration; and broadcasts NEW LEADER. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 47 / 79 Distributed Computing: Principles, Algorithms, and Systems Sync GHS: Message Types (message types:) SEARCH MWOE(leader ) EXAMINE(leader ) REPLY MWOES(local ID, remote ID) ADD MWOE(local ID, remote ID) NEW LEADER(leader ) // broadcast by current leader on tree edges // sent on nontree edges after receiving SEARCH MWOE // details of potential MWOEs are convergecast to leader // sent by leader to add MWOE and identify new leader // broadcast by new leader after merging components A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 48 / 79 Distributed Computing: Principles, Algorithms, and Systems Sync GHS: Code
leader = i; for round = 1 to log (n) do 1 // each merger in each iteration involves at least two components if leader = i then broadcast SEARCH MWOE(leader ) along marked edges of tree. On receiving a SEARCH MWOE(leader ) message that was broadcast on marked edges: 1 Each process i (including leader ) sends an EXAMINE message along unmarked (i.e., nontree) edges to determine if the other end of the edge is in the same component (i.e., whether its leader is the same). From among all incident edges at i, for which the other end belongs to a different component, process i picks its incident MWOE(localID,remoteID). 3 4 The leaf nodes in the MST within the component initiate the convergecast using REPLY MWOEs, informing their parent of their MWOE(localID,remoteID). All the nodes participate in this convergecast. if leader = i then await convergecast replies along marked edges. Select the minimum MWOE(localID,remoteID) from all the replies. broadcast ADD MWOE(localID,remoteID) along marked edges of tree. // To ask process localID to mark the (localID, remoteID) edge, // i.e., include it in MST of component. if an MWOE edge gets marked by both the components on which it is incident then 1 2 3 Define new leader as the process with the larger ID on which that MWOE is incident (i.e., process whose ID is max(localID, remoteID)). new leader identifies itself as the leader for the next round. new leader broadcasts NEW LEADER in the newly formed component along the marked edges announcing itself as the leader for the next round. 2 2 5 A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 49 / 79 Distributed Computing: Principles, Algorithms, and Systems GHS: Complexity
log n rounds (synchronous) Time complexity: O(n log n) Message complexity:
In each iteration, O(n) msgs along tree edges (steps 1,3,4,5) In each iteration, l EXAMINE msgs to determine MWOEs Hence, O((n + l) log n) messages Correctness requires synchronous operation
In step (2), EXAMINE used to determine if unmarked neighbor belongs to same component. If nodes of an unmarked edge are in different levels, problem! Consider EXAMINE sent on edge (j, k), belonging to same component. But k may not have learnt it belongs to new component and new leader ID; and replies +ve Can lead to cycles. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 50 / 79 Distributed Computing: Principles, Algorithms, and Systems MST (asynchronous) Synchronous GHS simulated using extra msgs/steps.
New leader does BC/CC on marked edges of new component.
In Step (2), recipient of EXAMINE can delay response if in old round n log n extra messages overall On involvement in a new round, inform each neighbor
Send EXAMINE when all nbhs along unmarked edges in same round l log n extra messages overall Engineer!! asynchronous GHS:
msg O(n log n + l) time: O(n log n (l + d)) Challenges
determine levels of adjacent nodes repeated combining with singleton components = log n becomes n If components at different levels, coordinate search for MWOEs, merging A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 51 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizers
Definition
Class of transformation algorithms that allow a synchronous program (designed for a synchronous system) to run on asynchronous systems. Assumption: failurefree system Designing tailormade async algo from scratch may be more efficient than using synchronizer Process safety
Process i is safe in round r if all messages sent by i have been received. Implementation key: signal to each process when it is safe to go to next round, i.e., when all msgs to be received have arrived A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 52 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizers: Notation Ma = Ms + (Minit + rounds Mround ) Ta = Ts + Tinit + rounds Tround Ms : # messages in the synchronous algorithm. rounds: # rounds in the synchronous algorithm. Ts : time for the synchronous algorithm. Assuming one unit (message hop) per round, this equals rounds. Mround : # messages needed to simulate a round, Tround : # sequential message hops to simulate a round. Minit , Tinit : # messages, # sequential message hops to initialize async system. (1) (2) A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 53 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizers: Complexity Simple synchronizer Minit Tinit Mround Tround 0 d 2L 1 synchronizer 0 0 O(L) O(1) synchronizer O(n log (n) + L) O(n) O(n) O(n) synchronizer O(kn2 ) n log (n)/log (k) O(Lc ) ( O(kn)) O(hc ) ( O(log (n)/log (k))) The message and time complexities for synchronizers. hc is the greatest height of a tree among all the clusters. Lc is the number of tree edges and designated edges in the clustering scheme for the synchronizer. d is the graph diameter. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 54 / 79 Distributed Computing: Principles, Algorithms, and Systems Simple Synchronizer A process sends each neighbor 1 message/round Combine messages or send dummy message On receiving a msg from each neighbor, go to next round. Neighbors Pi , Pj may be only one round apart Pi in roundi can receive msg from only roundi or roundi + 1 of neighbor. Initialization:
Any process may start round x. In d time units, all processes would be in round x. Tinit = d, Minit = 0. Complexity: Mround = 2L, Tround = 1. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 55 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizer
Pi in round r moves to r + 1 if all neighbors are safe for round r . When neighbor Pj receives ack for each message it sent, it informs Pi (and its other neighbors) that it is safe.
B
1 2 1 1 2 1 2 B C
3 3 3 3 3 A
3 3 3 C E2 A D E D
acknowledgement (b) "safe" execution message (a) Figure 5.10: Example. (a) Execution msgs (1) and acks (2). (b) "I am safe" msgs (3). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 56 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizer: Complexity Complexity:
l msgs l acks; transport layer acks free! 2L messages/round to inform neighbors of safety. Mround = O(L).Tround = O(1). Initialization: None. Any process may spontaneously wake up. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 57 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizer Initialization: rooted spanning tree, O(n log n + L) messages, O(n) time. Operation: Safe nodes initiate convergecast (CvgC) intermediate nodes propagate CvgC when their subtree is safe. When root becomes safe and receives CvgC from all children, initiates tree broadcast to inform all to move to next round. Complexity: l acks for free, due to transport layer. Mround = 2(n  1) Tround = 2 log n average; 2n worst case A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 58 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizer: Clusters
Set of clusters; each cluster has a spanning tree Intracluster: synchronizer over tree edges Intercluster: synchronizer over designated intercluster edges. (For 2 neighboring clusters, 1 intercluster edge is designated.)
A B C F E D tree edge designated (intercluster) edge root Figure 5.11: Cluster organization. Only tree edges and intercluster designated edges are shown. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 59 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizer: Operation and Complexity Within cluster, synchronizer executed Once cluster is stabilized, synchronizer over intercluster edges To convey stabilization of intercluster synchronizer, within a cluster, CvgC and BC phases over tree This CvgC initiated by leaf nodes once neighboring clusters are stabilized. Mround = O(Lc ), Tround = O(hc ). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 60 / 79 Distributed Computing: Principles, Algorithms, and Systems Synchronizer: Code
(message types) Subtree safe This cluster safe My cluster safe Neighboring cluster safe Next round // synchronizer phase's convergecast within cluster // synchronizer phase's broadcast within cluster // embedded intercluster synchronizer's messages across cluster boundaries // Convergecast following intercluster synchronizer phase // Broadcast following intercluster synchronizer phase for each round do 1 ( synchronizer phase:) This phase aims to detect when all the nodes within a cluster are safe, and inform all the nodes in that cluster. 1 2 3 Using the spanning tree, leaves initiate the convergecast of the `Subtree safe' message towards the root of the cluster. After the convergecast completes, the root initiates a broadcast of `This cluster safe' on the spanning tree within the cluster. (Embedded synchronizer:) 1 2 2 During this broadcast in the tree, as the nodes get engaged, the nodes also send `My cluster safe' messages on any incident designated intercluster edges. Each node also awaits `My cluster safe' messages along any such incident designated edges. (Convergecast and broadcast phase:) This phase aims to detect when all neighboring clusters are safe, and to inform every node within this cluster. 1 (Convergecast:) 1 After the broadcast of the earlier phase (1.2) completes, the leaves initiate a convergecast using `Neighboring cluster safe' messages once they receive any expected `My cluster safe' messages (step (1.3)) on all the designated incident edges. An intermediate node propagates the convergecast once it receives the `Neighboring cluster safe' message from all its children, and also any expected `My cluster safe' message (as per step (1.3)) along designated edges incident on it. 2 (Broadcast:) Once the convergecast completes at the root of the cluster, a `Next round' message is broadcast in the cluster's tree to inform all the tree nodes to move to the next round. 2 A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 61 / 79 Distributed Computing: Principles, Algorithms, and Systems Maximal Independent Set: Definition For a graph (N, L), an independent set of nodes N , where N N, is such that for each i and j in N , (i, j) L. An independent set N is a maximal independent set if no strict superset of N is an independent set. A graph may have multiple MIS; perhaps of varying sizes.
The largest sized independent set is the maximum independent set. Application: wireless broadcast  allocation of frequency bands (mutex) NPcomplete A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 62 / 79 Distributed Computing: Principles, Algorithms, and Systems Luby's Randomized Algorithm, Async System Iteratively: Nodes pick random nos, exchange with nbhs Lowest number in neighborhood wins (selected in MIS) If neighbor is selected, I am eliminated ( safety) Only neighbors of selected nodes are eliminated ( correctness) Complexity: In each iteration, 1 selected, 1 eliminated n/2 iterations. Expected # iterations O(log , n) due to randomized nature. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 63 / 79 Distributed Computing: Principles, Algorithms, and Systems Luby's Maximal Independent Set: Code
(variables) set of integer Neighbours real randomi boolean selectedi boolean eliminatedi (message types) RANDOM(real random) SELECTED(integer pid, boolean indicator ) ELIMINATED(integer pid, boolean indicator ) // set of neighbours // random number from a sufficiently large range // becomes true when Pi is included in the MIS // becomes true when Pi is eliminated from the candidate set // a random number is sent // whether sender was selected in MIS // whether sender was removed from candidates (1a) repeat (1b) if Neighbours = then (1c) selectedi true; exit();  (1d) randomi a random number;  (1e) send RANDOM(randomi ) to each neighbour; (1f) await RANDOM(randomj ) from each neighbour j Neighbours; (1g) if randomi < randomj (j Neighbours) then (1h) send SELECTED(i, true) to each j Neighbours; (1i) selectedi true; exit(); // in MIS  (1j) else (1k) send SELECTED(i, false) to each j Neighbours; (1l) await SELECTED(j, ) from each j Neighbours; (1m) if SELECTED(j, true) arrived from some j Neighbours then (1n) for each j Neighbours from which SELECTED( , false) arrived do (1o) send SELECTED(i, true) to j; (1p) eliminatedi true; exit(); // not in MIS  (1q) else (1r) send ELIMINATED(i, false) to each j Neighbours; (1s) await ELIMINATED(j, ) from each j Neighbours; (1t) for all j Neighbours do (1u) if ELIMINATED(j, true) arrived then (1v) Neighbours Neighbours \ {j};  (1w) forever. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 64 / 79 Distributed Computing: Principles, Algorithms, and Systems Maximal Independent Set: Example A 7 B 6 C 0 E 2 G 1 H 5 2 A 4 E 2 F G H D 1 8 (a) F J I 6 K 6 B D C 5 (b) J I 9 K 1 Figure 5.12: (a) Winners and losers in round 1. (b) Winners up to round 1, losers in round 2. Third round: I is winner. MIS={C , E , G , I , K }. Note: {A, C , G , J} is a smaller MIS. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 65 / 79 Distributed Computing: Principles, Algorithms, and Systems Connected Dominating Set (CDS) A dominating set of graph (N, L) is a set N N  each node in N \ N has an edge to some node in N . A connected dominating set (CDS) of (N, L) is a dominating set N such that the subgraph induced by the nodes in N is connected. NPComplete
Finding the minimum connected dominating set (MCDS) Determining if there exists a dominating set of size k < N Polytime heuristics: measure using approximation factor, stretch factor
Create ST; delete edges to leaves Create MIS; add edges to create CDS Application: backbone for broadcasts A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 66 / 79 Distributed Computing: Principles, Algorithms, and Systems Compact Routing Tables (1) 4
Avoid tables of size n  large size, more processing time
47 13 57 14 Hierarchical routing  hierarchical clustered network, e.g., IPv4 Tree labeling schemes Logical tree topology for routing Node labels  dests reachable via link labeled by contiguous addresses [x, y ] Small tables but traffic imbalance 11 2
33 42 64 6
55 77 16 27 1 3 5 7 Figure 5.13: Tree label based routing tables. Tree edges labels in rectangles. Nontree edges in dashed lines. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 67 / 79 Distributed Computing: Principles, Algorithms, and Systems Compact Routing Tables (2) Interval routing: Node labeling: B is a 1:1 mapping on N. Edge labeling: I labels each edge in L by some subset of node labels B(N)  for any node x
all destinations are covered (y Neighbours I(x, y ) B(x) = N) and there is no duplication of coverage (I(x, w ) I(x, y ) = for w , y Neighbours). For any s, t, there exists a path s = x0 , x1 . . . xk1 , xk = t where B(t) I(xi1 , xi ) for each i [1, k]. Interval labeling possible for every graph! No guarantee on path lengths; not robust to topology changes. Prefix routing: Node, channel labels from same domain, view as strings To route: use channel whose label is longest prefix of dest. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 68 / 79 Distributed Computing: Principles, Algorithms, and Systems Compact Routing Tables (3) Stretch factor of a routing scheme r
distancer maxi,jN { distanceopt(i,j) }. (i,j) Designing compact routing schemes: rich in graph algorithmic problems Identify and prove bounds on efficiency of routes Different specialized topologies (e.g., grid, ring, tree) offer scope for easier results A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 69 / 79 Distributed Computing: Principles, Algorithms, and Systems Leader Election Defn: All processes agree on a common distinguished process (leader) Distributed algorithms not completely symmetrical; need a initiator, finisher process; e.g., MST for BC and CvgC to compute global function LeLang Chang Roberts (LCR) algorithm
Asynchronous unidirectional ring All processes have unique IDs Processes circulate their IDs; highest ID wins Despite obvious optimizations, msg complexity n (n  1)/2; time complexity O(n). Cannot exist deterministic leader election algorithm for anonymous rings Algorithms may be uniform A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 70 / 79 Distributed Computing: Principles, Algorithms, and Systems Leader Election  LCR algorithm: Code
(variables) boolean participate false (message types) PROBE integer SELECTED integer // becomes true when Pi is included in the MIS // contains a node identifier // announcing the result (1) When a process wakes up to participate in leader election: (1a) send PROBE(i) to right neighbor; (1b) participate  true. (2) When a PROBE(k) message arrives from the left neighbor Pj : (2a) if participate = false then execute step (1) first. (2b) if i > k then (2c) discard the probe; (2d) else if i < k then (2e) forward PROBE(k) to right neighbor; (2f) else if i = k then (2g) declare i is the leader; (2h) circulate SELECTED(i) to right neighbor; (3) When a SELECTED(x) message arrives from left neighbor: (3a) if x = i then (3b) note x as the leader and forward message to right neighbor; (3c) else do not forward the SELECTED message. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 71 / 79 Distributed Computing: Principles, Algorithms, and Systems Leader Election: HirschbergSinclair Algorithm Binary search in both directions on ring; tokenbased In each round k, each active process does:
Token circulated to 2k nghbrs on both sides Pi is a leader after round k iff i is the highest ID among 2k nghbrs in both directions After round k, any pair of leaders are at least 2k apart # leaders diminishes logarithmically as n/2k Only winner (leader) after a round proceeds to next round. In each round, max n msgs sent using supression as in LCR log n rounds Message complexity: O(n log n) (formulate exact expression)! Time complexity: O(n). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 72 / 79 Distributed Computing: Principles, Algorithms, and Systems Object Replication Problems Weighted graph (N, L), k users at Nk N nodes, r replicas of a object at Nr N. What is the optimal placement of the replicas if k > r and accesses are readonly?
Evaluate all choices for Nr to identify min( iNk ,ri Nr disti,ri ), where disti,ri is the cost from node i to ri , the replica nearest to i. If Read accesses from each user in Nk have a certain frequency (or weight), the minimization function changes. Address BW of each edge. Assume user access is a Read with prob. x, and an Update with prob. 1  x. Update requires all replicas to be updated.
What is the optimal placement of the replicas if k > r ? A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 73 / 79 Distributed Computing: Principles, Algorithms, and Systems Adaptive Data Replication: Problem Formulation
Network (V , E ). Assume single replicated object. Replication scheme: subset R of V  each node in R has a replica. ri , wi : rates of reads and writes issued by i cr (i), cw (i): cost of a read and write issued by i. R: set of all possible replication schemes. Goal: minimize cost of the replication scheme:
RR min [
iV ri cr (i) +
iV wi cw (i)] Arbitrary graph: cost is NPComplete Hence, assume tree overlay Assume one copy serializability, implemented by ReadOneWriteAll (ROWA) policy. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 74 / 79 Distributed Computing: Principles, Algorithms, and Systems Adaptive Data Replication over Tree Overlay All communication, set R on tree T overlay R: amoebalike subgraph, moves to centerofgravity of activity
Expands when Read cost is higher Shrinks when Write cost is higher Equilibriumstate R is optimal; converges in d + 1 steps once ReadWrite pattern stabilizes Dynamic activity: algorithm reexecuted in epochs Read: From closest replica, along T . Use parent pointers. Write: To closest replica, along T . Then propagate in R. Use R  neighbor , set of neighbors in R. Implementation: (i) in R? (ii) R  neighbor , (iii) parent. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 75 / 79 Distributed Computing: Principles, Algorithms, and Systems Adaptive Data Replication: Convergence (1) R A B C E D Rfringe Rneighbour Rneighbour and Rfringe Rneighbour: i R; and has at least one neighbour j R. Rfringe: i R; and has only one neighbour j R. Thus, i is a leaf in the subgraph of T induced by R and j is parent of i. singleton: R =1 and i R. Figure 5.14: Nodes in ellipse belong to R. C is Rfringe A, E are Rfringe and Rneighbour D is Rneighbour A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 76 / 79 Distributed Computing: Principles, Algorithms, and Systems Adaptive Data Replication: Tests
Tests at end of each epoch. Expansion test: Rneighbour node i includes neighbor j in R if r > w . Contraction test: Rfringe node i excludes itself from R if w > r . Before exiting, seek permission from j to avoid R = .
j r w (a) (b) i j w r (c) r+w i j i r+w Figure 5.15: (a) Expansion test. (b) Contraction test. (c) Switch test. Rneighbour may also be Rfringe or Switch test: Singleton node i transfers its singleton. In either case, the expansion test replica to j if r + w being executed first; if it fails, contraction test or forwarded by j is greater switch test is executed. than r + w that node i receives from all other nodes. A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 77 / 79 Distributed Computing: Principles, Algorithms, and Systems Adaptive Data Replication: Code (1)
(variables) array of integer Neighbours[1 . . . bi ]; array of integer Read Received[1 . . . bi ]; array of integer Write Received[1 . . . bi ]; integer writei , readi ; boolean success; // bi neighbours in tree T topology // jth element gives # reads from Neighbours[j] // jth element gives # writes from Neighbours[j] // # writes and # reads issued locally (1) Pi determines which tests to execute at the end of each epoch: (1a) if i is Rneighbour and Rfringe then (1b) if expansion test fails then (1c) reduction test (1d) else if i is Rneighbour and singleton then (1e) if expansion test fails then (1f) switch test (1g) else if i is Rneighbour and not Rfringe and not singleton then (1h) expansion test (1i) else if i is R  neighbour and Rfringe then (1j) contraction test. (2) Pi executes expansion test: (2a) for j from 1 to bi do (2b) if Neighbours[j] not in R then (2c) if Read Received[j] > (writei + k=1...b ,k=j Write Received[k]) then i (2d) send a copy of the object to Neighbours[j]; success  1; (2e) return(success). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 78 / 79 Distributed Computing: Principles, Algorithms, and Systems Adaptive Data Replication: Code (2)
(variables) array of integer Neighbours[1 . . . bi ]; array of integer Read Received[1 . . . bi ]; array of integer Write Received[1 . . . bi ]; integer writei , readi ; boolean success; // bi neighbours in tree T topology // jth element gives # reads from Neighbours[j] // jth element gives # writes from Neighbours[j] // # writes and # reads issued locally (3) Pi executes contraction test: (3a) let Neighbours[j] be the only neighbour in R; (3b) if Write Received[j] > (readi + k=1...b ,k=j Read Received[k]) then i (3c) seek permission from Neighbours[j] to exit from R; (3d) if permission received then (3e) success  1; inform all neighbours; (3f) return(success). (4) Pi executes switch test: (4a) for j from 1 to bi do (4b) if (Read Received[j] + Write Received[j]) > [ k=1...b ,k=j (Read Received[k] + Write Received[k]) + readi + writei ] then i (4c) transfer object copy to Neighbours[j]; success  1; inform all neighbours; (4d) return(success). A. Kshemkalyani and M. Singhal (Distributed Computing) Terminology and Basic Algorithms CUP 2008 79 / 79 ...
View
Full
Document
This note was uploaded on 02/23/2012 for the course COMP 553 taught by Professor Ajay during the Spring '12 term at Ill. Chicago.
 Spring '12
 Ajay
 Algorithms, Distributed Computing

Click to edit the document details