dryadlinq - DryadLINQ: A System for General-Purpose...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Yuan Yu Michael Isard Dennis Fetterly Mihai Budiu Úlfar Erlingsson 1 Pradeep Kumar Gunda Jon Currey Microsoft Research Silicon Valley 1 joint affiliation, Reykjavík University, Iceland Abstract DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale dis- tributed computing. It generalizes previous execution en- vironments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language. A DryadLINQ program is a sequential program com- posed of LINQ expressions performing arbitrary side- effect-free transformations on datasets, and can be writ- ten and debugged using standard .NET development tools. The DryadLINQ system automatically and trans- parently translates the data-parallel portions of the pro- gram into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures ef- ficient, reliable execution of this plan. We describe the implementation of the DryadLINQ compiler and runtime. We evaluate DryadLINQ on a varied set of programs drawn from domains such as web-graph analysis, large-scale log mining, and machine learning. We show that excellent absolute performance can be attained—a general-purpose sort of 10 12 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster—as well as demonstrating near-linear scal- ing of execution time on representative applications as we vary the number of computers used for a job. 1 Introduction The DryadLINQ system is designed to make it easy for a wide variety of developers to compute effectively on large amounts of data. DryadLINQ programs are written as imperative or declarative operations on datasets within a traditional high-level programming language, using an expressive data model of strongly typed .NET objects. The main contribution of this paper is a set of language extensions and a corresponding system that can auto- matically and transparently compile imperative programs in a general-purpose language into distributed computa- tions that execute efficiently on large computing clusters. Our goal is to give the programmer the illusion of writing for a single computer and to have the sys- tem deal with the complexities that arise from schedul- ing, distribution, and fault-tolerance. Achieving this goal requires a wide variety of components to inter- act, including cluster-management software, distributed- execution middleware, language constructs, and devel- opment tools. Traditional parallel databases (which we survey in Section 6.1) as well as more recent data-processing systems such as MapReduce [15] and Dryad [26] demonstrate that it is possible to implement
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.

Page1 / 14

dryadlinq - DryadLINQ: A System for General-Purpose...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online