slides.0426.2011.a - Atul Adya – Google John Dunagan –...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Atul Adya – Google John Dunagan – Microso7 Alec Wolman – Microso0 Research Incoming Request (from Device D2): Incoming Request (from Device D1):   Problems: Tell me D1’s current IP addr store my current IP = A   How to assign responsibility for … items to app servers? (parEEoning) Front‐end Front‐end Front‐end   How to deal with addiEon, Web server Web server Web server removal, & crashes of app servers?   How to avoid requests for the same Locate the App Server that Locate the App Server that stores the contact info for D1 stores the contact info for D1 item winding up at different servers? (use leases) Read D1’s IP addr Store D1’s IP addr   How to adapt to load changes? ApplicaGon Server ApplicaGon Server (In‐Memory) (In‐Memory) … ApplicaGonServer (In‐Memory) 2 Targets class of services with these characterisEcs:   InteracEve (needs low latency) ▪  App servers operate on in‐memory state   ApplicaEon Eer operates on cached data: the truth is hosted on clients or back‐end storage   Services use many small objects   Even the most popular object can be handled by one server ▪  ReplicaEon not needed to handle load 3   Prior systems implement leasing and parEEoning separately   We show that integraEng leasing and parEEoning allows scaling to massive numbers of objects   This integraEon requires us to rethink the mechanisms and API for leasing ▪  Manager‐directed leasing ▪  Non‐tradiEonal API where clients cannot request leases 4   Centrifuge design   Centrifuge internals   Results from live deployment 5 Front‐end Lookup Library Front‐end Lookup Library Lookups: Front‐End Web Servers … Front‐end Lookup Library Centrifuge Manager Service Owners: Middle Tier ApplicaGon Servers Owner Library Owner Library In‐Memory Server In‐Memory Server … Owner Library In‐Memory Server 6   Need to issue leases for very large # of objects   Lease per object will lead to prohibiEve overhead   Centrifuge manager hands out leases on ranges   Use consistent hashing to parEEon a flat namespace Centrifuge   Assign leases on conEguous ranges of the hashed Manager Service Lease: 0‐50,100‐200 namespace Owner Library In‐Memory Server   One lease (one range) per virtual node (64 per server)   Single mechanism: manager‐directed leasing handles both leasing and parEEoning 7 Lookup API URL Lookup(Key key) void LossNoEficaEonUpcall(KeyRange lost) Owner API bool CheckLeaseNow(Key key, out LeaseNum leaseNum) bool CheckLeaseConEnuous(Key key, LeaseNum leaseNum) Incoming Request: Find Device “D” … Front‐end Lookup Library Front‐end Front‐end 1.CheckLeaseNow(“D”) ‐> handle Lookup Library Lookup Library 2.Perform applicaGon operaGon: find D’s current IP addr Lookup(“D”) ‐> “hXp://m6/” 3.CheckLeaseConGnuous(“D”, handle) Owner Library Server “m1” Owner Library Server “m2” … Owner Library Server “m6” 8   Servers in datacenter environment are stable   Benefits   Much cheaper to avoid holding mulEple copies in RAM   Avoids complexity/performance issues of quorum protocols   Doesn’t add extra complexity: ▪  Need a mechanism to tolerate correlated failures anyway (e.g. security vulnerabiliEes, patch installaEon)   Cost   When an applicaEon server crashes, items are not available unEl clients republish 9   When applicaEon server crashes, Lookups receive Loss NoEficaEons   Indicates which ranges are lost   Allows the applicaEon to determine which clients should republish their state   Live Mesh services use this model   Rely on clients to recover state 10   ParEEoning   Manager spreads namespace across Owners by assigning leases   Consistency   Leases ensure single‐copy guarantee: at any Eme t, for any key at most one Owner node   Recovery   Loss noEficaEons enable app developer to detect and recover from Owner crashes   Membership   Owners indicate liveness by requesEng leases   Load Balancing   Manager rebalances namespace based on reported load 11   Centrifuge design   Centrifuge internals   Results from live deployment 12 Cached Lease Table Current LSN:2 … Lookup “I am at LSN 2.” Manager Lease Table Current LSN:4 [0‐1:Owner=A] [1‐2:Owner=B] [2‐9:Owner=C] Change Log … “Here are changes LSN 2‐>4”   Incremental protocol to synchronize Lookup and Manager lease tables   Lookups are fast: no need to contact Manager and incur delay   Manager load not dependent on incoming request load to Lookups 13 Owner Manager “Request Leases” “Leases granted/recalled” Robustness: Owners have mulEple opportuniEes to retain their leases:   Leases requested every 15 seconds   Leases last 60 seconds   Takes 3 consecuEve lost/delayed requests to lose the lease Safety: owner never thinks it has the lease when the manager disagrees   Similar to previous lease servers, rely on clock rate synchronizaEon 14 Manager Service Standby “Can I have the leader lease?” Standby “No.” “Renew leader lease and commit state update.” Leader Lookups and Owners Paxos Group “Yes.” Leader and Standbys 15   Centrifuge designed to run in a single datacenter   Scalability target: ~1000 machines in 1 cluster   Beyond there, scale by deploying mulEple clusters 16   Centrifuge design   Centrifuge internals   Results from live deployment 17   First deployed in April 2008   Results cover 2.5 months: Dec ’08 – Mar ‘09   1000 Lookups, 130 Owners   Manager = 8 servers 18   Is the Centrifuge manager a scalability borleneck in steady‐state?   How well does Centrifuge handle high‐churn events?   How stable are producEon servers? 19 20 21 22   From 1/15/09 through 3/2/09, no patch installaEons How stable were the owners during this period?   Servers are very stable: only 10 lease‐loss events     7 cases, servers recovered < 10 minutes   3 cases, servers recovered < 1 hour 23   Centrifuge simplifies building scalable applicaEon Eers with in‐memory state   Combining leasing and parEEoning leads to a simple and powerful protocol   Deployed within Live Mesh since April 2008, in use by 5 different Live Mesh Services   Data center server stability enables the single copy in RAM w/loss noEficaEons 24 ...
View Full Document

Ask a homework question - tutors are online