6-PigAdvanced - Joins

This preview shows page 1 - 9 out of 9 pages.

Image of page 1
Image of page 2
Image of page 3
Image of page 4
Image of page 5
Image of page 6
Image of page 7
Image of page 8
Image of page 9

You've reached the end of your free preview.

Want to read all 9 pages?

Unformatted text preview: Joins join implementations - Pig’s join starts up Pig’s default implementation of join Different implementations of join are available, exploiting knowledge about the data In RDBMS systems, the SQL optimiser chooses the “right" join implementation automatically In Pig, the user is expected to make the choice (knows best how the data looks like) Keyword using to pick the implementation of choice Pig Operations - Joining 0 Merge Join 0 sets are pro—sorted by the join key «- Replicated Join 0 one set is very large, while other sets are small enough to fit into memory «- Skewed Join o when a large number of records for some values of the join key is expected 0 Regular Join join implementations small to large ' Scenario: lockup in a smaller input, e.g. translate postcode to place name * Germany has :80M people, but only ~30.000 postcodes . . replicated can Ice used with . Small data set usually fits Intc memory more than two tables. - The first is used as in ut to the . Reduce phase Is unnecessary Mapper. the rest is Mammy. . More efficient to send the smaller data set (eg. zipcode— town file) to every data node, load it to memory and join by streaming the large data set through the Mapper - Called fragment-replicate join in Pig, keyword replicated grunt} jnd = join X1 by {yl,zlj, 32 by [y2,32} using 'replicated' join implementations skewed date '- Skew in the number of records per heyr (eg. wcrds in a text, links en the Web) . Remember Zipf‘s lewl - Default jcin implementeticn is sensitive tc skew, ell reccrds with the same key sent to the same reducer - Pig’s scluticn: skew lcin 1.Input fer the jcin is sampled 2.i<2e3rs are identified with tcc rnerwr reccrds attached 3.Jcin happens in a second Hedccp jcb 1.8tenderd icin fcr ell "ncrmei" keys {at single keyr ends up in a reducer) 2.8kewed keys distributed over reducers (split to achieve in-memcry split] join implementations skewed data users{name,city) oity_info{oity,population} join oity_info by city, users by site using 'skewed'; - Data set contains: . 20 users from Delft . 100,000 users from New York ' 350 users from Amsterdam - A reduoer can deal with 75,000 reoords in memory. - User records with key 'New York’ are split across 2 reducers. Fieoords from oity_info with key New York are duplicated and sent to both reduoers. join implementations skewed data - In general: - Pig samples from the second input to the join and splits records with the same key if necessary - The first input has records with those values replicated across reducers - Implementation optimised for one of the inputs being skewed ' Skew join can be done on inner and outer joins. - Skew join can only take two inputs; multiway joins need to be Same caveat as order: breaking MapReduce of one key = one reducer. Consequence: same key distributed across different part-r—H files. join implementations sorted data - Sort-merge join (database join strategy): first sort both inputs and then walk through both inputs together - Not faster in MapReduoe than default join, as a sort requires one MapFieduee job (as does the join) - Pig‘s merge join can be used if both inputs are already sorted on the join key . No reduce phase required - Keyword is merge grunt} jnd = join sorted_}{1 by Fl, sorted_}{2 by j-{E using 'merge' join implementations sorted data it But: files are split into blocks, distributed in the Cluster grunt} jnd = join sorted_}{l by yl, 30rted_12 by 372 using ’merge' v First Matheduce iob to sample from sorted_>(2: job builds an index of input split and the value of its first (sorted) reoord . Second Map-Reduce job reads over sorted_)(1: the first record of the input split is looked up in the index built in step (1) and sorted_)<2 is opened at the right block . No further lookups in the index, both "record pointers” are advanced alternatively ...
View Full Document

  • Spring '16
  • dncj ncnd

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern