This preview shows page 1. Sign up to view the full content.
Unformatted text preview: that satisfy
these candidate pair conditions:
1. Both i and j are frequent items
2. Using the first hash function, the pair
hashes to a bucket whose bit in the
first bitvector is 1.
3. Using the second hash function, the pair
hashes to a bucket whose bit in the
second bitvector is 1. 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 43 The two hash functions have to be
independent.
2. We need to check both hashes on the third
pass:
1. 1/5/2011 If not, we would wind up counting pairs of
frequent items that hashed first to an infrequent
bucket but happened to hash second to a
frequent bucket Jure Leskovec, Stanford C246: Mining Massive Datasets 44 Key idea: Use several independent hash tables
on the first pass Risk: Halving the number of buckets doubles
the average count We have to be sure most buckets will still not
reach count s If so, we can get a benefit like multistage,
but in only 2 passes 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 45 Item counts Freq. items
Bitmap 1 First hash
table
Second
hash table Bitmap 2
Counts of
candidate
pairs Pass 1 1/5/2011 Pass 2 Jure Leskovec, Stanford C246: Mining Massive Datasets 46 APriori, PCY, etc., take k passes to find
frequent itemsets of size k Can we use fewer passes? Use 2 or fewer passes for all sizes,
but may miss some frequent itemset...
View
Full
Document
This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.
 Winter '09

Click to edit the document details