This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ote: (1 − 2 ) = (1 − 2 ) 3/2/2011 −r m − r 2r ( m 2− r ) Jure Leskovec, Stanford C246: Mining Massive Datasets ≈e − m 2− r 20 One can also think of FlajoletMartin the
following way (roughly): h(a) hashes item a with equal prob. to any of N values Then h(a) is a sequence of log2 N bits, where 2r
fraction of a’s have a tail r zeros 50% hashes end with ***0, 25% hashes end with **00 So, if we saw the longest tail of r=2 (i.e., item hash ending
*100) then we have probably seen 4 distinct items so far So, in expectation it takes 2r items before we see one
with zerosuffix of length r
3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 E[2R] is actually infinite Probability halves when R → R+1, but value doubles Workaround involves using many hash functions
and getting many samples
How are samples combined? Average? What if one very large value? Median? All estimates are a power of 2 Solution: Partition your samples into small groups Take the average of groups Then take the median of the averages 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Suppose a stream has elements chosen from
a set of N values Let ma be the number of times value a
occurs The 3/2/2011 kth moment (ma ) k
is ∑a Jure Leskovec, Stanford C246: Mining Massive Datasets 23 0thmoment = number of distinct elements The problem just considered 1st moment = count of the numbers of
elements = length of the stream. Easy to compute 2nd moment = surprise number = a measure of
how uneven the distribution is 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Stream of length 100; 11 distinct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9
Surprise # = 910 Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
Surprise # = 8,110 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 [Alon, Matias, and Szegedy] Works for all moments
Gives an unbiased estimate We will just concentrate on the 2nd moment Based on calculation of many random
variables X: For each rnd. var. X we store X.el and X.val Note this requires a count in main memory, so
number of Xs is limited 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 How to set X.val and X.el? Assume stream has...
View Full
Document
 Winter '09
 Algorithms

Click to edit the document details