Lecture-database-inspired language called pig

Lecture-database-inspired language called pig - Chris...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Pig Latin: A Not-So-Foreign Language For Data Processing Research Data Processing Renaissance Internet companies swimming in data • E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers Data Warehousing …? Scale Scale Often not scalable enough $ $ $ $ $ $ $ $ Prohibitively expensive at web scale • Up to $200K/TB SQL SQL • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs New Systems For Data Analysis Map-Reduce Apache Hadoop Dryad . . . Map-Reduce Input records k 1 v 1 k 2 v 2 k 1 v 3 k 2 v 4 k 1 v 5 map map map map k 1 v 1 k 1 v 3 k 1 v 5 k 2 v 2 k 2 v 4 Output records reduc e reduc e reduc e reduc e Just a group-by-aggregate? Just a group-by-aggregate? The Map-Reduce Appeal Scale Scale Scalable due to simpler design • Only parallelizable operations • No transactions $ $ Runs on cheap commodity hardware Procedural Control- a processing “pipe” SQL SQL Disadvantages 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Split M M R R M M M M R R M M Chains 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize Pros And Cons Need a high-level, general data flow language Enter Pig Latin P i g L a t i n P i g L a t i n Need a high-level, general data flow language Outline • Map-Reduce and the need for Pig Latin • Pig Latin example • Salient features • Implementation Example Data Analysis Task User Url Time Amy cnn.com 8:00 Amy bbc.com 10:00 Amy flickr.com 10:05 Fred cnn.com 12:00 Find the top 10 most visited pages in each category...
View Full Document

{[ snackBarMessage ]}

Page1 / 27

Lecture-database-inspired language called pig - Chris...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online