pig - Pig, a high level data processing system on Hadoop Is...

Info iconThis preview shows pages 1–17. Sign up to view the full content.

View Full Document Right Arrow Icon
Pig, a high level data processing system on Hadoop
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Is MapReduce not Good Enough? Restricted programming model Only two phases Job chain for long data flow Too many lines of code even for simple logic How many lines do you have for word count? Programmers are responsible for this
Background image of page 2
3 Pig to the Rescue High level dataflow language (Pig Latin) Much simpler than Java Simplifies the data processing Puts the operations at the apropriate phases Chains multiple MR jobs
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 How Pig is used in the Industry At Yahoo, 70% MapReduce jobs are written in Pig Used to Process web logs Build user behavior models Process images Data mining Also used by Twitter, LinkedIn, eBay, AOL, . ..
Background image of page 4
5 Motivation by Example Suppose we have user data in one file, website data in another file. We need to find the top 5 most visited pages by users aged 18-25
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 In MapReduce
Background image of page 6
7 In Pig Latin
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
8 Pig runs over Hadoop
Background image of page 8
9 Wait a minute How to map the data to records By default, one line → one record User can customize the loading process How to identify attributes and map them to the schema Delimiter to separate different attributes By default, delimiter is tab. Customizable.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
10 MapReduce Vs. Pig cont. Join in MapReduce Various algorithms. None of them are easy to implement in MapReduce Multi-way join is more complicated Hard to integrate into SPJA workflow
Background image of page 10
11 MapReduce Vs. Pig cont. Join in Pig Various algorithms are already available. Some of them are generic to support multi-way join No need to consider integration into SPJA workflow. Pig does that for you! A = LOAD 'input/join/A'; B = LOAD 'input/join/B'; C = JOIN A BY $0, B BY $1; DUMP C;
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
12 Pig Latin Data flow language Users specify a sequence of operations to process data More control on the process, compared with declarative language Various data types are supported Schema is supported User-defined functions are supported
Background image of page 12
13 Statement A statement represents an operation, or a stage in the data flow Usually a variable is used to represent the result of the statement Not limited to data processing operations, but also contains filesystem operations
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
14 Schema User can optionally define the schema of the input data Once the schema of the source data is given, the schema of the intermediate relation will be induced by Pig
Background image of page 14
15 Schema cont. Why schema? Scripts are more readable (by alias) Help system validate the input Similar to Database? Yes. But schema here is optional Schema is not fixed for a particular dataset, but changable
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
16 Schema cont. Schema 1
Background image of page 16
Image of page 17
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 01/17/2012.

Page1 / 54

pig - Pig, a high level data processing system on Hadoop Is...

This preview shows document pages 1 - 17. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online