16-streams

Tends to 0 r m m 2 r 0 as m2r 1 2 e so the probability

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ote: (1 − 2 ) = (1 − 2 ) 3/2/2011 −r m − r 2r ( m 2− r ) Jure Leskovec, Stanford C246: Mining Massive Datasets ≈e − m 2− r 20 One can also think of Flajolet-Martin the following way (roughly): h(a) hashes item a with equal prob. to any of N values Then h(a) is a sequence of log2 N bits, where 2-r fraction of a’s have a tail r zeros 50% hashes end with ***0, 25% hashes end with **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen 4 distinct items so far So, in expectation it takes 2r items before we see one with zero-suffix of length r 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 E[2R] is actually infinite Probability halves when R → R+1, but value doubles Workaround involves using many hash functions and getting many samples How are samples combined? Average? What if one very large value? Median? All estimates are a power of 2 Solution: Partition your samples into small groups Take the average of groups Then take the median of the averages 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Suppose a stream has elements chosen from a set of N values Let ma be the number of times value a occurs The 3/2/2011 kth moment (ma ) k is ∑a Jure Leskovec, Stanford C246: Mining Massive Datasets 23 0thmoment = number of distinct elements The problem just considered 1st moment = count of the numbers of elements = length of the stream. Easy to compute 2nd moment = surprise number = a measure of how uneven the distribution is 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Stream of length 100; 11 distinct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise # = 910 Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise # = 8,110 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 [Alon, Matias, and Szegedy] Works for all moments Gives an unbiased estimate We will just concentrate on the 2nd moment Based on calculation of many random variables X: For each rnd. var. X we store X.el and X.val Note this requires a count in main memory, so number of Xs is limited 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 How to set X.val and X.el? Assume stream has...
View Full Document

Ask a homework question - tutors are online