Yes but with certain restriction on the relation size Ideas Sorting Sort

Yes but with certain restriction on the relation size

This preview shows page 356 - 373 out of 676 pages.

– Yes, but with certain restriction on the relation size
Background image
Ideas • Sorting – Sort relation(s) into runs – Perform the needed operation while merging the runs • Hashing – Hash relation(s) into buckets – Only need to examine a bucket or a pair of bucket at a time 34
Background image
Duplicate Elimination δ(R) Based on Sorting
Background image
36 Grouping: γcity, sum(price)(R) Based on Sorting
Background image
37 Binary operations: R ∩ S, R U S, R – S Based on Sorting Can you figure out all the cases (intersection, …)?
Background image
Problem with join A large number of tuples with the same value on the join attribute(s) But buffer can not hold all joining tuples (with the same value on join attribute) for at least one relation 38
Background image
Problem with join Sorted lists a a a a a a a a a a a Disk Main memory buffers Join (a,a) (a,a) (a,a) (a,a) Output buffer 39 R.A S.A Many tuples may have the same value on the join attribute Remember the tuple may have other attributes than a.
Background image
Sort-Merge Join 40
Background image
Example Suppose M = 101 blocks (i.e., pages), B(R) = 1,000 blocks, B(S) = 5,000 blocks – Suppose we use 100 blocks in sorting Cost of R S using sort-merge join algorithm What if B(S) = 50,000 blocks? 41
Background image
42 Simple Sort-based Join Start by completely sorting both R and S on the join attribute (assuming this can be done in 2 passes): Cost: 4B(R)+4B(S) (because we need to write result to disk) Read both relations in sorted order, match tuples Cost: B(R)+B(S) Can use as many buffers as possible to load join tuples from one relation (with the same join value), say R Only one buffer is needed for the other relation, say S If we still can not fit all join tuples from R Need to use nested loop algorithm, higher cost
Background image
Simple Sort-based Join Total cost: 5B(R)+5B(S) Assumption: B(R)<= M2, B(S)<= M2, and at least one set of the tuples with a common value for the join attributes fit in M (or M-2 to be exact) – Note that we only need one page buffer for the other relation 43
Background image
Example Suppose M = 101 blocks (i.e., pages), B(R) = 1,000 blocks, B(S) = 5,000 blocks – Assume that we use 100 blocks in sorting Cost of R S using simple sort-based join algorithm What if B(S) = 50,000 blocks? 44
Background image
Two-Pass Algorithms Based on Hashing 45
Background image
Hashing-Based Algorithms Hash all the tuples of input relations using an appropriate hash key such that: – All the tuples that need to be considered together to perform an operation go to the same bucket Reduce the size of input relations by a factor of M Perform the operation by working on a bucket (or a pair of buckets for binary operations) at a time – Apply a one-pass algorithm for the operation 46
Background image
Sorting vs. Hashing 47 Sorting . . . B/M runs M Hashing . . . M buckets B/M "Partitioning" picture Size of run Size of bucket
Background image
Hashing-Based Algorithm for δ • Recall: δ(R) = duplicate elimination Step 1. Partition R into (M-1) buckets Step 2. Apply δto each bucket (must read it into main memory) Cost: 3B(R) Assumption: B(R) <= M2 – To be more exact: B(R)/(M-1) <= M-2 48
Background image
Two-Pass Duplicate Elimination Based on Hashing
Background image
Image of page 373

You've reached the end of your free preview.

Want to read all 676 pages?

  • Fall '14

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes