slides.021511 - CLOUD PROGRAMMING Andrew Harris & Long...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
C LOUD PROGRAMMING 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
M OTIVATION Research problem : How to write distributed data-parallel programs for a compute cluster? Drawback of Parallel Databases (SQL) : Too limited for many applications. Very restrictive type system The declarative query is unnatural. Drawback of Map Reduce: Too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. 2
Background image of page 2
Image Processing L AYERS 3 Server Cluster Services Hadoop Map-Reduce / Dryad Pig Latin / DryadLINQ Server Server Server Other Languages Machine Learning Graph Analysis Data Mining Applications Other Applications
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
P IG L ATIN : A Not-So-Foreign Language for Data Processing 4
Background image of page 4
D ATAFLOW LANGUAGE User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural – desirable for programmers. With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers. 5
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
A N SAMPLE CODE OF PIG LATIN 6 SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); SQL Pig Latin Pig Latin program is a sequence of steps, each of which carries out a single data transformation.
Background image of page 6
D ATA M ODEL Atom : Contains a simple atomic value such as a string or a number, e.g., ‘Joe’. Tuple : Sequence of fields, each of which might be any data type, e.g., (‘Joe’, ‘lakers’) Bag : A collection of tuples with possible duplicates. Schema of a bag is flexible. Map : A collection of data items, where each item has an associated key through which it can be looked up. Keys must be data atoms. 7
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
OMPARISON WITH R ELATIONAL A LGEBRA 8 Everything is a bag. Dataflow language. FILTER is same as the Select operator. Everything is a table. Dataflow language. Select operator is same as the FILTER cmd. Pig Latin
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.

Page1 / 37

slides.021511 - CLOUD PROGRAMMING Andrew Harris & Long...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online