This preview shows page 1. Sign up to view the full content.
Unformatted text preview: s Random sampling SON (Savasere, Omiecinski, and Navathe) Toivonen (see textbook) 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 47 Take a random sample of the market baskets Run a-priori or one of its improvements
in main memory
Copy of So we don’t pay for disk I/O each
time we increase the size of itemsets Reduce support threshold proportionally
to match the sample size sample
Main memory 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 48 Optionally, verify that the candidate pairs are
truly frequent in the entire data set by a
second pass (avoid false positives) But you don’t catch sets frequent in the whole
but not in the sample Smaller threshold, e.g., s/125, helps catch more
truly frequent itemsets But requires more space 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 49 Repeatedly read small subsets of the baskets
into main memory and run an in-memory
algorithm to find all frequent itemsets Note: we are not sampling, but processing the
entire file in memory-sized chunks An itemset becomes a c...
View Full Document
This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.
- Winter '09