02-assoc

Items bitmap 1 first hash table second hash table

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s Random sampling SON (Savasere, Omiecinski, and Navathe) Toivonen (see textbook) 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 47 Take a random sample of the market baskets Run a-priori or one of its improvements in main memory Copy of So we don’t pay for disk I/O each time we increase the size of itemsets Reduce support threshold proportionally to match the sample size sample baskets Space for counts Main memory 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 48 Optionally, verify that the candidate pairs are truly frequent in the entire data set by a second pass (avoid false positives) But you don’t catch sets frequent in the whole but not in the sample Smaller threshold, e.g., s/125, helps catch more truly frequent itemsets But requires more space 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 49 Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets Note: we are not sampling, but processing the entire file in memory-sized chunks An itemset becomes a c...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online