This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations Yuan Yu Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 [email protected] Pradeep Kumar Gunda Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 [email protected] Michael Isard Microsoft Research 1065 La Avenida Ave. Mountain View, CA 94043 [email protected] ABSTRACT Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggrega- tion is a core primitive of many distributed programming models, and it is often the most efficient available mecha- nism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non- standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max . As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementa- tions for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user- defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans per- form better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depend- ing on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices. Categories and Subject Descriptors D.1.3 [ Programming Techniques ]: Concurrent Pro- gramming— Distributed programming General Terms Design, Languages, Performance Keywords Distributed programming, cloud computing, concur- rency 1. INTRODUCTION Many data-mining computations have as a fun- damental subroutine a “GroupBy-Aggregate” oper- ation. This takes a dataset, partitions its records into groups according to some key, then performs an aggregation over each resulting group. GroupBy- Aggregate is useful for summarization, e.g. finding average household income by zip code from a census dataset, but it is also at the heart of the distributed implementation of algorithms such as matrix multi- plication [22, 27]. The ability to perform GroupBy- Aggregate at scale is therefore increasingly impor- tant, both for traditional data-mining tasks and also for emerging applications such as web-scale machine learning and graph analysis....
View Full Document
- Spring '11