02-assoc

# G occurrences of pairs of items the number of

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: g counts in/out is a disaster (why?) 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 The hardest problem often turns out to be finding the frequent pairs of items {i1, i2} Often frequent pairs are common, frequent triples are rare We’ll concentrate on pairs, then extend to larger sets The game: We always need to generate all the itemsets But we would only like to count/keep track only of those that at the end turn out to be frequent 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Read file once, counting in main memory the occurrences of each pair: From each basket of n items, generate its n(n-1)/2 pairs by two nested loops Fails if (#items)2 exceeds main memory Remember: #items can be 100K (Wal-Mart) or 10B (Web pages) 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 Approach 1: Store triples [i, j, c], where count(i , j) = c If integers and item ids are 4 bytes, we need approximately 12 bytes for pairs with count &gt; 0 Plus some additional overhead for the hashtable What if most pairs occur, even if infrequently? 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Approach 2: Count all pairs Number items 1, 2, 3,…, n Count {i, j} only if...
View Full Document

Ask a homework question - tutors are online