Lecture-partitioning in large data systems

Lecture-partitioning in large data systems - PARTITIONING...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: PARTITIONING IN LARGE DATA SYSTEMS (IF IT IS NOT DYNAMIC, WHAT'S THE POINT?) NYU Advanced Databases Class, Invited Lecture, Oct/2009 alberto.lerner@gmail.com MoKvaKon c c c User Data Data par==oning as a way to scale If only one could "stretch" a single server indefinitely as one grew... where's that key? Metadata ID P1 P2 Keys K(P1) K(P2) Node B A Node A B Address ... ... MoKvaKon c A B User Data P1 P2 Which server holds a given key? The system can maintain metadata on parKcipant nodes and on parKKoning. A client (library) can lookup metadata. Metadata MoKvaKon "split" parKKon P2 ID P1 P2 P3 Keys K(P1) K(P2) K(P3) Node B A A Node A B Address ... ... A B User Data P1 P2` P3 A par==on's size changed? Repar==on as it goes From the metadata perspecKve, the system can adjust a parKKoning boundaries. Metadata MoKvaKon ID P1 P2 P3 Keys K(P1) K(P2) K(P3) Node B A A Node A B C Address ... ... A C B User Data P1 P2 P3 If need be, system can accept new nodes Again (and quite simplisKcally) adding a node would mean adjusKng the system metadata. Metadata MoKvaKon ID P1 P2 P3 Keys K(P1) K(P2) K(P3) Node B A A > C Node A B C Address ... ... ... A P3 C B User Data P1 P2 P3 Rebalances as it goes as well The system may reassign load from one node to another. Metadata MoKvaKon ID P1 P2 P3 Keys K(P1) K(P2) K(P3) Node B A C Node A B C Address ... ... ... Locate a given key Split a parKKon Add a node Migrate a parKKon Merge parKKons Subtract a node Recap: dynamic par==oning primi=ves What's hard about implemenKng these primiKves in pracKce? Metadata 1. Par==on system data across nodes on the system Changes should be made in the responsible node No node has global informaKon 2. Replicate (not necessarily fully) system data on all system nodes Whenever a change is made, propagate across the system All nodes have a view of global informaKon, even if someKmes not a current one ID P1 P2 P3 Keys K(P1) K(P2) K(P3) Node B A C Node A B C Address ... ... ... Agenda For one, metadata must not be a single point of failure! Let's look at alternaKve strategies to implement dynamic parKKoning from the metadata perspec.ve. We'll assume a simple keyvalue data model. (In pracKce, models can be more sophisKcated.) And we'll try to make the fewest possible assumpKons about the underlying storage layout. (Of course, parKKoning the data has its challenges too.) For our purposes here, let's assume that there is a completely orthogonal data replica=on scheme for fault tolerance. (Orthogonal?! Yeah, right.) We won't go into crash recovery nor into fault detec=on Agenda What the talk is not about But, please, let's fill a whiteboard on any of these offline! c 1) where's key `d'? 2) what is `d's value? A ParKKon B System Data k B A 8 User Data k o 8 User Data a d k P2 P1 Idea: the par==on table is ... a table! But only the system can update it. Table read is public, though. 1) which parKKon would know about user table `T' and key `d'? 2) where's key `d'? c 3) what is `d's value? A ParKKon 8 B System Data Tk A T B Tk B Tk User Data k o 8 System Data User Data T A T a d k 8 P2 P1 The par==on table can be... par==oned Lookup's complexity is logarithmic. 8 c A ParKKon P1 got too big. I split it in `g' B System Data g B k B A 8 User Data k 8 User Data P1 g k P2 P3 Is split a local opera=on? From the metadata perspecKve, probably not. what is your load? Load Balancer A ParKKon C B System Data g B k B >C A 8 User Data P2 8 User Data g P1 P3 k Load Balancing Pull load informaKon from all the nodes. Migrate parKKons if necessary. Do it again periodically. (*A) B C List all servers Hello, I'm here to serve Load Balancer Where's the start of system data? c A ParKKon C B System Data g B k C A 8 User Data P2 8 k User Data P3 g P1 A missing piece? Where does a client start looking for system data? How does the load balancer know about the nodes in the system? How does a node join the system? (*A)BC (*A)BC (*A)BC (*A) B C ParKKon Was that a central point of failure? No. The informaKon is kept in several places that agree on values using a distributed consensus algorithm. c Acceptor Proposer/ Leader Acceptor Establishes leader and proposal number Learns about previously accepted values Establishes value and gathers majority around it prepare promise propose Phase 1 Phase 2 ParKKon accept Learner decided Paxos Solves the distributed consensus problems in scenarios involving crashes, omissions, and restarts. Depicted here is a very well behaved instance. Under faults, each phase may involve several rounds of messages. va vb prepare (1) promisse (1, nil) {3, nil} {1, nil} prepare (3) promisse (3,nil) ParKKon accept(1, a) reject(3) Liveness Any node can be a proposer. In fact, this guarantees liveness, in case a previous leader crashes or gets disconnected. c1 va c2 vb prepare (1) promise (1, nil) majority locked in va propose(1, va) {1, nil} {1, nil} {3, nil} prepare (3) promise (3,nil) {1, va} {1, va} {3, va} prepare (3) promise (3, va) propose (3, va) accept (3,va) ParKKon accept(1,va) proposal #1 eventually Kmes out decided(3,va) Correctness Once a majority of nodes locks in a value, that value is propagated through later proposals. c1 vc vc vb va vb va vb va vb va vb va ParKKon Replicated State Machines Use a Paxos instance to decide on the next operaKon (or on the next instance leader). Having a leader across instances speeds things up. c1 File System Interface ParKKon Distributed Lock Manager Wraps a Paxos layer with a file system interface, complete with locks, sequencers, and watchers. Now, can you solve the group membership problem? (*A) B C c Load Balancer A ParKKon C B System Data g B k C A 8 User Data P2 8 User Data k User Data g P3 P1 And back to the big picture Sync point: we've seen key locaKon, parKKon split, parKKon migraKon, and node addiKon under the first parKKoning scheme. What could be a moKvaKon for another scheme? 1 0 Node A B Address ... ... A=H(tokenA) H(g) B=H(tokenB) User Data g k H(k) Replicate Consistent Hashing Nodes' tokens and user data's keys are hashed and placed on a ring represenKng the hash space. A key belongs to the first clockwise node. That is, par..oning is implicit. c 1 0 1) Where is key `k' ? J A 2) B, E, and J addresses re in A's "finger table", so H(k) falls between E and J B Replicate H(k) E Finger tables Each node knows the addresses (and hashes) of increasingly distant nodes in the ring, in parKcular, it's successor. c 1 0 1) Where is key `k' ? A I C Replicate F H(k) E 3) F, I, and C addresses are in E's "finger table", so E can conclude that H(k) belongs to F Rou=ng in DHT's Efficient schemes are known that route requests in O(log n). ReplicaKng a "complete" finger table would bring that to O(1) but... c 1) give me the key `d' (orange hash) 2) fetch key `d' A B System Data A B User Data System Data A B User Data Replicate How does a client go about finding a key? Logic of key locaKon can be pushed to clients. Or the client can ask a node to redirect requests. 1 0 Node A B C Address ... ... H(tokenA) H(tokenB) User Data Replicate H(tokenC) Adding a new node Impact only on "following" node, which has to transfer some of its keys to the new arrival. 1) hello, I'm `token C' and here's my address A B 2) transfer "brown" keys C System Data A B User Data System Data A B User Data System Data User Data Replicate C Adding a node Node announces itself to successor. The laper offers to transfers the relevant keys. A sync system data B C System Data A B C User Data System Data A B C User Data System Data A B C User Data Replicate Group membership through gossip Each node contacts a random node periodically and they reconcile their system data. Eventually all nodes learn about incoming and outgoing nodes. H(tokenC3) 1 0 H(tokenA0) H(tokenB2) H(tokenB1) H(tokenA3) H(tokenC2) Node A B C Address ... ... ... Replicate H(tokenC1) H(tokenA1) Even par==oning through virtual nodes One may assign T tokens per node to assure even key distribuKon. Now, what if we wanted to keep the same parKKon number and just throw one more server? H(tokenC3) 1 0 H(tokenA0) H(tokenB2) H(tokenB1) H(tokenA3) H(tokenC2) Node A B C Address ... ... ... Replicate H(tokenC1) H(tokenA1) Equalsized par==ons Use Q parKKons regardless of number of servers. Now, can Q vary? Does a uniform key distribu=on guarantee uniform load distribu=on? How about the ability to issue sequen=al scans? Is it lost for the sake of load balancing? How likely replicaKon system data would keep scaling? Replicate Load balancing, ques=ons Adopters have reported posiKve answers to all the above. ParKKons metadata informaKon Uses a central authority ParKKoning is explicit (range) Maps parKKons to nodes Replicates metadata informaKon Does away without a central authority ParKKoning is implicit (hashing) ParKKons "fall" into nodes Are they really that different? UlKmately, yes. And both are successfully deployed. Wrapup References Bigtable: A Distributed Storage System for Structured Data, Chang et al., OSDI'06 The Chubby Lock Service for Loosely Coupled Distributed Systems, Mike Burrows, OSDI'06 Paxos Made Simple, Leslie Lamport Dynamo: Amazon's Highly Available Key Value Store, DeCandia et al., SOSP'07 Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the WWW, STOC'97 Chord: A Scalable PeertoPeer Lookup Service for Internet ApplicaKons, Stoica et al, SIGCOMM'01 Wrapup ...
View Full Document

Ask a homework question - tutors are online