02-assoc

# Frequent items so you know what must be counted

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: tems (candidate pairs) Pass 1 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets Pass 2 29 You can use the triangular matrix method with n = number of frequent items May save space compared with storing triples 1/5/2011 Trick: re-number frequent items 1,2,… and keep a table relating new numbers to original item numbers Jure Leskovec, Stanford C246: Mining Massive Datasets 30 For each k, we construct two sets of k-tuples (sets of size k): Ck = candidate k-tuples = those that might be frequent sets (support &gt; s) based on information from the pass for k–1 Lk = the set of truly frequent k-tuples All items C1 1/5/2011 All pairs of items from L1 Count the items Filter L1 Construct Count the pairs C2 Jure Leskovec, Stanford C246: Mining Massive Datasets Filter To be explained L2 Construct C3 31 C1 = all items L1 = those counted on first pass to be frequent C2 = pairs, both elements are frequent (appear in L1) L2 = those in C2 that are frequent (supp ≥ s) In general: Ck = k –tuples, each k –1 of which is in Lk -1 Lk = members of Ck with support ≥ s 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 One pass for each k (itemset size) Needs room in main memory to count each candidate k–tuple For typical market-basket data and reasonable support (e.g., 1%), k = 2 requires the most memory 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 33 Observation: In pass 1 of a-priori, most memory is idle We store only individua...
View Full Document

## This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online