16-streams

# 322011 jure leskovec stanford c246 mining massive

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: t; 0 = not present Use DGIM to estimate counts of 1’s for all items 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 In principle, you could count frequent pairs or even larger sets the same way One stream per itemset Drawbacks: Only approximate Number of itemsets is way too big 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 33 Exponentially decaying windows: A heuristic for selecting likely frequent itemsets What are “currently” most popular movies? Instead of computing the raw count in last N elements Compute a smooth aggregation over the whole stream If stream is a1, a2,… and we are taking the sum of the stream, take the answer at time t to be: =Σi = 1,2,…,t ai e -c (t-i) (or, Σi = 1,…,t ai (1-c)t-i ) c is a constant, presumably tiny, like 10-6 or 10-9 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34 If each ai is an “item” we can compute the characteristic function of each possible item x as an E.D.W. That is: Σi = 1,2,…,t δi e -c (t-i) where δi = 1 if ai = x, and 0 otherwise Call this sum the “weight” item x 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 ... 1/c 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 36 Suppose we want to find those items of weight at least ½ Important property: Sum over all weights is 1/(1 – e-c ) or very close to 1/[1 – (1 – c)] = 1/c Thus: At most 2/c items have weight at least ½. 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Count (some) itemsets in an E.D.W. When a basket B comes in: 1. Multiply all counts by (1-c ); 2. For uncounted items in B, create new count. 3. Add 1 to count of any item in B and to any counted itemset contained in B. 4. Drop counts &lt; ½. 5. Initiate new counts (next slide). * Informal proposal of Art Owen 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 38 Start a count for an itemset S ⊆ B if every proper subset of S had a count prior to arrival of basket B Example: Start counting {i, j} iff both i and j were counted prior to seeing B Example: Start counting {i, j, k} iff {i, j}, {i, k}, and {j, k} were all counted prior to seeing B 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 Counts for single items &lt; (2/c) times the average number of items in a basket Counts for larger itemsets = ??. But we are conservative about starting counts of large sets. If we counted every set we saw, one basket of 20 items would initiate 1M counts. 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40...
View Full Document

## This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online