02-assoc

152011 jure leskovec stanford c246 mining massive

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: i<j. Keep pair counts in lexicographic order: {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n }, {3,4},… Pair {i, j} is at position (i –1)(n– i/2) + j –i Only requires 4 bytes per pair But: Total number of pairs n(n –1)/2 Approach 1 uses 12p bytes, p is the number of pairs that actually occur Beats triangular matrix if less than 1/3 of possible pairs actually occur 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 4 bytes per pair Triangular Matrix 1/5/2011 12 per occurring pair Triples Jure Leskovec, Stanford C246: Mining Massive Datasets 26 A two-pass approach called a-priori limits the need for main memory Key idea: monotonicity If a set of items I appears at least s times, so does every subset J of I. Contrapositive for pairs: If item i does not appear in s baskets, then no pair including i can appear in s baskets 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27 Pass 1: Read baskets and count in main memory the occurrences of each individual item Requires only memory proportional to #items Items that appear at least s times are the frequent items Pass 2: Read baskets again and count in main memory only those pairs where both elements are frequent (from Pass 1) Requires memory proportional to square of frequent items only (for counts) Plus a list of the frequent items (so you know what must be counted) 1/5/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 Item counts Frequent items Counts of pairs of frequent i...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online