Unformatted text preview: 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 New topic: Counting Distinct Elements Problem: Data stream consists of a universe of elements
chosen from a set of size N Maintain a count of the number of distinct
elements seen so far Obvious approach: Maintain the set of
elements seen so far 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 How many different words are found among
the Web pages being crawled at a site? Unusually low or high numbers could indicate
artificial pages (spam?) How many different Web pages does each
customer request in a week? 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 Real problem: What if we do not have space
to maintain the set of elements seen so far? Estimate the count in an unbiased way Accept that the count may have a little error,
but limit the probability that the error is large 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 Pick a hash function h that maps each of the
N elements to at least log2 N bits For each stream element a, let r(a) be the
number of trailing 0s in h(a) r(a) = position of first 1 counting from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = maxa r(a), over all the items a seen so far Estimated number of distinct elements = 2R 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 The probability that a given h(a) ends in at
least r 0s is 2r Probability of NOT seeing a tail of length r
among m elements: (1  2r )m
Prob. all end in
fewer than r 0s. 3/2/2011 Prob. a given h(a)
ends in fewer than
r 0s. Jure Leskovec, Stanford C246: Mining Massive Datasets 19 Prob. of NOT finding a tail of length r is: If m << 2r, then prob. tends to 1
−r m
− m 2− r
= 1 as m/2r→ 0 (1 − 2 ) ≈ e So, the probability of finding a tail of length r tends to 0 If m >> 2r, then prob. tends to 0
−r m
− m 2− r
= 0 as m/2r → ∞ (1 − 2 ) ≈ e So, the probability of finding a tail of length r tends to 1 Thus, 2R will almost always be around m N...
View
Full
Document
This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.
 Winter '09
 Algorithms

Click to edit the document details