This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein UC Berkeley Khaled Elmeleegy, Russell Sears Yahoo! Research Abstract MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce mate- rialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapRe- duce programming model beyond batch processing, and can reduce completion times and improve system utiliza- tion for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports on- line aggregation , which allows users to see early returns from a job as it is being computed. Our Hadoop Online Prototype ( HOP ) also supports continuous queries , which enable MapReduce programs to be written for applica- tions such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs. 1 Introduction MapReduce has emerged as a popular way to harness the power of large clusters of computers. MapReduce allows programmers to think in a data-centric fashion: they focus on applying transformations to sets of data records, and allow the details of distributed execution, network communication and fault tolerance to be handled by the MapReduce framework. MapReduce is typically applied to large batch-oriented computations that are concerned primarily with time to job completion. The Google MapReduce framework  and open-source Hadoop system reinforce this usage model through a batch-processing implementation strat- egy: the entire output of each map and reduce task is materialized to a local file before it can be consumed by the next stage. Materialization allows for a simple and elegant checkpoint/restart fault tolerance mechanism that is critical in large deployments, which have a high probability of slowdowns or failures at worker nodes. We propose a modified MapReduce architecture in which intermediate data is pipelined between operators, while preserving the programming interfaces and fault tolerance models of previous MapReduce frameworks. To validate this design, we developed the Hadoop Online Prototype (HOP), a pipelining version of Hadoop. 1 Pipelining provides several important advantages to a MapReduce framework, but also raises new design chal- lenges. We highlight the potential benefits first: Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as on- line aggregation , can provide initial estimates of results several orders of magnitude faster than the final results. We describe how we adapted online ag- gregation to our pipelined MapReduce architecture in Section 4.in Section 4....
View Full Document
This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.
- Spring '08
- Distributed Computing