yu-sosp09 - Distributed Aggregation for Data-Parallel...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations Yuan Yu Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 [email protected] Pradeep Kumar Gunda Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 [email protected] Michael Isard Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 [email protected] ABSTRACT Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggrega- tion is a core primitive of many distributed programming models, and it is often the most efficient available mecha- nism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non- standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max . As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementa- tions for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user- defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans per- form better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depend- ing on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices. Categories and Subject Descriptors D.1.3 [ Programming Techniques ]: Concurrent Pro- gramming— Distributed programming General Terms Design, Languages, Performance Keywords Distributed programming, cloud computing, concur- rency 1. INTRODUCTION Many data-mining computations have as a fun- damental subroutine a “GroupBy-Aggregate” oper- ation. This takes a dataset, partitions its records into groups according to some key, then performs an aggregation over each resulting group. GroupBy- Aggregate is useful for summarization, e.g. finding average household income by zip code from a census dataset, but it is also at the heart of the distributed implementation of algorithms such as matrix multi- plication [22, 27]. The ability to perform GroupBy- Aggregate at scale is therefore increasingly impor- tant, both for traditional data-mining tasks and also for emerging applications such as web-scale machine learning and graph analysis....
View Full Document

Page1 / 17

yu-sosp09 - Distributed Aggregation for Data-Parallel...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online