02-assoc

Leskovec stanford c246 mining massive datasets 41 freq

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: that satisfy these candidate pair conditions: 1. Both i and j are frequent items 2. Using the first hash function, the pair hashes to a bucket whose bit in the first bit-vector is 1. 3. Using the second hash function, the pair hashes to a bucket whose bit in the second bit-vector is 1. 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 43 The two hash functions have to be independent. 2. We need to check both hashes on the third pass: 1. 1/5/2011 If not, we would wind up counting pairs of frequent items that hashed first to an infrequent bucket but happened to hash second to a frequent bucket Jure Leskovec, Stanford C246: Mining Massive Datasets 44 Key idea: Use several independent hash tables on the first pass Risk: Halving the number of buckets doubles the average count We have to be sure most buckets will still not reach count s If so, we can get a benefit like multistage, but in only 2 passes 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 45 Item counts Freq. items Bitmap 1 First hash table Second hash table Bitmap 2 Counts of candidate pairs Pass 1 1/5/2011 Pass 2 Jure Leskovec, Stanford C246: Mining Massive Datasets 46 A-Priori, PCY, etc., take k passes to find frequent itemsets of size k Can we use fewer passes? Use 2 or fewer passes for all sizes, but may miss some frequent itemset...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online