gates - Pig Optimization and Execution Alan F. Gates...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
Pig Optimization and Execution Page 1 Alan F. Gates @alanfgates © Hortonworks Inc. 2011
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Who Am I? • Pig committer and PMC Member • HCatalog committer and mentor • Member of ASF and Incubator PMC • Co-founder of Hortonworks • Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs
Background image of page 2
Who Are You? 3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
What Should We Optimize? • Minimize scans – Hadoop is still often I/O bound • Minimize total number of MR jobs • Minimize shuffle size and number of shuffles • Avoid spills to disk • Reduce or remove skew • For small jobs, minimize start-up time 4
Background image of page 4
Pig Deployment User machine Hadoop Cluster Pig resides on user machine or gateway Job executes on cluster No server, all optimization and planning done on the launching machine
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Pig Guts (i.e. Pig Architecture), p. 1 6 A = LOAD ‘myfile’ AS (x, y, z); B = GROUP A by x; C = FILTER B by group > 0; D = FOREACH C GENERATE group, COUNT(A); STORE D INTO ‘output’; Pig Latin Load Group Filter Foreach Store Logical Plan AST
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 19

gates - Pig Optimization and Execution Alan F. Gates...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online