02-assoc

The bit vector requires 132 of memory also decide

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: pair: 1. Both i and j are frequent items. 2. The pair {i, j} hashes to a bucket whose bit in the bit vector is 1 (i.e., frequent bucket) 1/5/2011 Both conditions are necessary for the pair to have a chance of being frequent Jure Leskovec, Stanford C246: Mining Massive Datasets 38 Item counts Frequent items Bitmap Hash table Pass 1 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets Counts of candidate pairs Pass 2 39 Buckets require a few bytes each: Note: we don’t have to count past s #buckets is O(main-memory size) On second pass, a table of (item, item, count) triples is essential (why?) Hash table must eliminate approx. 2/3 of the candidate pairs for PCY to beat a-priori. 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40 Limit the number of candidates to be counted Remember: memory is the bottleneck Still need to generate all the itemsets but we only want to count/keep track of the ones that are frequent Key idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY On middle pass, fewer pairs contribute to buckets, so fewer false positives – frequent buckets with no frequent pair Requires 3 passes over the data 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 41 Freq. items Freq. items Bitmap 1 Bitmap 1 First hash table Second hash table Bitmap 2 Pass 1 Pass 2 Item counts 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets Counts of candidate pairs Pass 3 42 Count only those pairs {i, j}...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online