Ananthanarayanan - Reining in the Outliers in Map-Reduce...

Info icon This preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
ReiningintheOutliersinMap-ReduceClustersusingMantri Ganesh Ananthanarayanan †⋄ Srikanth Kandula Albert Greenberg Ion Stoica Yi Lu Bikas Saha Edward Harris Microsoft Research UC Berkeley Microsoft Bing Abstract– Experience from an operational Map-Reduce cluster reveals that outliers significantly prolong job com- pletion. The causes for outliers include run-time con- tention for processor, memory and other resources, disk failures, varying bandwidth and congestion along net- work paths and, imbalance in task workload. We present Mantri , a system that monitors tasks and culls outliers us- ing cause- and resource-aware techniques. Mantri ’s strate- gies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on out- liers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing’s production clusters and trace-driven simulations show that Mantri improves job completion times by uniF646uniF645uniF642. uniF644 Introduction In a very short time, Map-Reduce has become the domi- nant paradigm for large data processing on compute clus- ters. Software frameworks based on Map-Reduce [ uniF644 , uniF644uniF644 , uniF644uniF646 ] have been deployed on tens of thousands of machines to implement a variety of applications, such as building search indices, optimizing advertisements, and mining social networks. While highly successful, Map-Reduce clusters come with their own set of challenges. One such challenge is the often unpredictable performance of the Map-Reduce jobs. A job consists of a set of tasks which are organized in phases. Tasks in a phase depend on the results computed by the tasks in the previous phase and can run in paral- lel. When a task takes longer to finish than other similar tasks, tasks in the subsequent phase are delayed. At key points in the job, a few such outlier tasks can prevent the rest of the job from making progress. As the size of the cluster and the size of the jobs grow, the impact of outliers increases dramatically. Addressing the outlier problem is critical to speed up job completion and improve cluster efficiency. Even a few percent of improvement in the efficiency of a cluster consisting of tens of thousands of nodes can save millions of dollars a year. In addition, finishing pro- duction jobs quickly is a competitive advantage. Doing so predictably allows SLAs to be met. In iterative mod- ify/ debug/ analyze development cycles, the ability to it- erate faster improves programmer productivity.
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern