– Yes, but with certain restriction on the relation size
Ideas • Sorting – Sort relation(s) into runs – Perform the needed operation while merging the runs • Hashing – Hash relation(s) into buckets – Only need to examine a bucket or a pair of bucket at a time 34
Duplicate Elimination δ(R) Based on Sorting
36 Grouping: γcity, sum(price)(R) Based on Sorting
37 Binary operations: R ∩ S, R U S, R – S Based on Sorting Can you figure out all the cases (intersection, …)?
Problem with join •A large number of tuples with the same value on the join attribute(s) •But buffer can not hold all joining tuples (with the same value on join attribute) for at least one relation 38
Problem with join Sorted lists a a a a a a a a a a a Disk Main memory buffers Join (a,a) (a,a) (a,a) (a,a) … Output buffer 39 R.A S.A Many tuples may have the same value on the join attribute Remember the tuple may have other attributes than a.
Sort-Merge Join 40
Example •Suppose M = 101 blocks (i.e., pages), B(R) = 1,000 blocks, B(S) = 5,000 blocks – Suppose we use 100 blocks in sorting •Cost of R S using sort-merge join algorithm •What if B(S) = 50,000 blocks? 41 ⋈
42 Simple Sort-based Join •Start by completely sorting both R and S on the join attribute (assuming this can be done in 2 passes): –Cost: 4B(R)+4B(S) (because we need to write result to disk) •Read both relations in sorted order, match tuples –Cost: B(R)+B(S) •Can use as many buffers as possible to load join tuples from one relation (with the same join value), say R –Only one buffer is needed for the other relation, say S •If we still can not fit all join tuples from R –Need to use nested loop algorithm, higher cost
Simple Sort-based Join •Total cost: 5B(R)+5B(S) •Assumption: B(R)<= M2, B(S)<= M2, and at least one set of the tuples with a common value for the join attributes fit in M (or M-2 to be exact) – Note that we only need one page buffer for the other relation 43
Example •Suppose M = 101 blocks (i.e., pages), B(R) = 1,000 blocks, B(S) = 5,000 blocks – Assume that we use 100 blocks in sorting •Cost of R S using simple sort-based join algorithm •What if B(S) = 50,000 blocks? 44 ⋈
Two-Pass Algorithms Based on Hashing 45
Hashing-Based Algorithms •Hash all the tuples of input relations using an appropriate hash key such that: – All the tuples that need to be considered together to perform an operation go to the same bucket •Reduce the size of input relations by a factor of M •Perform the operation by working on a bucket (or a pair of buckets for binary operations) at a time – Apply a one-pass algorithm for the operation 46
Sorting vs. Hashing 47 Sorting . . . B/M runs M Hashing . . . M buckets B/M "Partitioning" picture Size of run Size of bucket
Hashing-Based Algorithm for δ • Recall: δ(R) = duplicate elimination •Step 1. Partition R into (M-1) buckets •Step 2. Apply δto each bucket (must read it into main memory) •Cost: 3B(R) •Assumption: B(R) <= M2 – To be more exact: B(R)/(M-1) <= M-2 48
Two-Pass Duplicate Elimination Based on Hashing •
You've reached the end of your free preview.
Want to read all 676 pages?
- Fall '14