zaharia - USENIX Association 8th USENIX Symposium on...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 29 Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica University of California, Berkeley { matei,andyk,adj,randy,stoica } @cs.berkeley.edu Abstract MapReduce is emerging as an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientiFc simulation. Hadoop is an open-source implementation of MapRe- duce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoops per- formance is closely tied to its task scheduler, which im- plicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that ap- pear to be stragglers. In practice, the homogeneity as- sumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazons Elastic Compute Cloud (EC2). We show that Hadoops scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2. 1 Introduction Todays most popular computer applications are Internet services with millions of users. The sheer volume of data that these services work with has led to interest in paral- lel processing on commodity clusters. The leading exam- ple is Google, which uses its MapReduce framework to process 20 petabytes of data per day [1]. Other Internet services, such as e-commerce websites and social net- works, also cope with enormous volumes of data. These services generate clickstream data from millions of users every day, which is a potential gold mine for understand- ing access patterns and increasing ad revenue. urther- more, for each user action, a web application generates one or two orders of magnitude more data in system logs, which are the main resource that developers and opera- tors have for diagnosing problems in production. The MapReduce model popularized by Google is very attractive for ad-hoc parallel processing of arbitrary data. MapReduce breaks a computation into small tasks that run in parallel on multiple machines, and scales easily to very large clusters of inexpensive commodity comput- ers. Its popular open-source implementation, Hadoop [2], was developed primarily by Yahoo, where it runs jobs that produce hundreds of terabytes of data on at least 10,000 cores [4]. Hadoop is also used at acebook, Ama- zon, and Last.fm [5]. In addition, researchers at Cornell, Carnegie Mellon, University of Maryland and PARC are starting to use Hadoop for seismic simulation, natural language processing, and mining web data [5, 6]....
View Full Document

Page1 / 14

zaharia - USENIX Association 8th USENIX Symposium on...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online