This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Nectar: Automatic Management of Data and Computation in Datacenters Pradeep Kumar Gunda, Lenin Ravindranath * , Chandramohan A. Thekkath, Yuan Yu, Li Zhuang Microsoft Research Silicon Valley Abstract Managing data and computation is at the heart of data- center computing. Manual management of data can lead to data loss, wasteful consumption of storage, and labo- rious bookkeeping. Lack of proper management of com- putation can result in lost opportunities to share common computations across multiple jobs or to compute results incrementally. Nectar is a system designed to address the aforemen- tioned problems. It automates and unifies the manage- ment of data and computation within a datacenter. In Nectar, data and computation are treated interchange- ably by associating data with its computation. De- rived datasets, which are the results of computations, are uniquely identified by the programs that produce them, and together with their programs, are automatically man- aged by a datacenter wide caching service. Any derived dataset can be transparently regenerated by re-executing its program, and any computation can be transparently avoided by using previously cached results. This en- ables us to greatly improve datacenter management and resource utilization: obsolete or infrequently used de- rived datasets are automatically garbage collected, and shared common computations are computed only once and reused by others. This paper describes the design and implementation of Nectar, and reports on our evaluation of the system using analytic studies of logs from several production clusters and an actual deployment on a 240-node cluster. 1 Introduction Recent advances in distributed execution engines (Map- Reduce , Dryad , and Hadoop ) and high-level language support (Sawzall , Pig , BOOM , HIVE , SCOPE , DryadLINQ ) have greatly * L. Ravindranath is affiliated with the Massachusetts Institute of Technology and was a summer intern on the Nectar project. simplified the development of large-scale, data-intensive, distributed applications. However, major challenges still remain in realizing the full potential of data-intensive distributed computing within datacenters. In current practice, a large fraction of the computations in a dat- acenter is redundant and many datasets are obsolete or seldom used, wasting vast amounts of resources in a dat- acenter. As one example, we quantified the wasted storage in our 240-node experimental Dryad/DryadLINQ cluster. We crawled this cluster and noted the last access time for each data file. We discovered that around 50% of the files was not accessed in the last 250 days. As another example, we examined the execution statis- tics of 25 production clusters running data-parallel ap- plications. We estimated that, on one such cluster, over 7000 hours of redundant computation can be eliminated per day by caching intermediate results. (This is approx- imately equivalent to shutting off 300 machines daily.)imately equivalent to shutting off 300 machines daily....
View Full Document
This note was uploaded on 11/12/2011 for the course CE 726 taught by Professor Staf during the Spring '11 term at SUNY Buffalo.
- Spring '11