This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan †⋄ Srikanth Kandula † Albert Greenberg † Ion Stoica ⋄ Yi Lu † Bikas Saha ‡ Edward Harris ‡ † Microsoft Research ⋄ UC Berkeley ‡ Microsoft Bing Abstract– Experience from an operational Map-Reduce cluster reveals that outliers significantly prolong job com- pletion. The causes for outliers include run-time con- tention for processor, memory and other resources, disk failures, varying bandwidth and congestion along net- work paths and, imbalance in task workload. We present Mantri , a system that monitors tasks and culls outliers us- ing cause- and resource-aware techniques. Mantri ’s strate- gies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on out- liers early in their lifetime. Early action frees up resources that canbe used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing’s production clusters and trace-driven simulations show that Mantri improves job completion times by uniF646uniF645uniF642. uniF644 Introduction In a very short time, Map-Reduce has become the domi- nant paradigm for large data processing on compute clus- ters. Software frameworks based on Map-Reduce [ uniF644 , uniF644uniF644 , uniF644uniF646 ] have been deployed on tens of thousands of machines to implement a variety of applications, such as building search indices, optimizing advertisements, and mining social networks. While highly successful, Map-Reduce clusters come with their own set of challenges. One such challenge is the often unpredictable performance of the Map-Reduce jobs. A job consists of a set of tasks which are organized in phases. Tasks in a phase depend on the results computed by the tasks in the previous phase and can run in paral- lel. When a task takes longer to finish than other similar tasks, tasks in the subsequent phase are delayed. At key points in the job, a few such outlier tasks can prevent the rest of the job from making progress. As the size of the cluster and the size of the jobs grow, the impact of outliers increases dramatically. Addressing the outlier problem is critical to speed up job completion and improve cluster efficiency. Even a few percent of improvement in the efficiency of a cluster consisting of tens of thousands of nodes can save millions of dollars a year. In addition, finishing pro- duction jobs quickly is a competitive advantage. Doing so predictably allows SLAs to be met. In iterative mod- ify/ debug/ analyze development cycles, the ability to it- erate faster improves programmer productivity....
View Full Document
This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.
- Spring '08