availability in globally distributed storage systems(google)

availability in globally distributed storage...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Availability in Globally Distributed Storage Systems Daniel Ford, Franc ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong * , Luiz Barroso, Carrie Grimes, and Sean Quinlan { ford,flab,florentina,mstokely } @google.com, vatruong@ieor.columbia.edu { luiz,cgrimes,sean } @google.com Google, Inc. Abstract Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. So- phisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of fail- ure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud- based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Googles main storage infrastructure and present statis- tical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet. 1 Introduction Cloud storage is often implemented by complex multi- tiered distributed systems on clusters of thousands of commodity servers. For example, in Google we run Bigtable [9], on GFS [16], on local Linux file systems that ultimately write to local hard drives. Failures in any of these layers can cause data unavailability. Correctly designing and optimizing these multi- layered systems for user goals such as data availability relies on accurate models of system behavior and perfor- mance. In the case of distributed storage systems, this includes quantifying the impact of failures and prioritiz- ing hardware and software subsystem improvements in * Now at Dept. of Industrial Engineering and Operations Research Columbia University the datacenter environment. We present models we derived from studying a year of live operation at Google and describe how our analysis influenced the design of our next generation distributed storage system [22]. Our work is presented in two parts. First, we measured and analyzed the component availability , e.g. machines, racks, multi-racks, in tens of Google storage clusters. In this part we: Compare mean time to failure for system compo- nents at different granularities, including disks, ma- chines and racks of machines. (Section 3) Classify the failure causes for storage nodes, their characteristics and contribution to overall unavail- ability. (Section 3) Apply a clustering heuristic for grouping failures which occurs almost simultaneously and show that a large fraction of failures happen in bursts. (Sec- tion 4)...
View Full Document

Page1 / 14

availability in globally distributed storage...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online