16-streams

G googles bigtable squid web proxy suitable for

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 New topic: Counting Distinct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of distinct elements seen so far Obvious approach: Maintain the set of elements seen so far 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?) How many different Web pages does each customer request in a week? 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 Real problem: What if we do not have space to maintain the set of elements seen so far? Estimate the count in an unbiased way Accept that the count may have a little error, but limit the probability that the error is large 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 Pick a hash function h that maps each of the N elements to at least log2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = position of first 1 counting from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = maxa r(a), over all the items a seen so far Estimated number of distinct elements = 2R 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 The probability that a given h(a) ends in at least r 0s is 2-r Probability of NOT seeing a tail of length r among m elements: (1 - 2-r )m Prob. all end in fewer than r 0s. 3/2/2011 Prob. a given h(a) ends in fewer than r 0s. Jure Leskovec, Stanford C246: Mining Massive Datasets 19 Prob. of NOT finding a tail of length r is: If m << 2r, then prob. tends to 1 −r m − m 2− r = 1 as m/2r→ 0 (1 − 2 ) ≈ e So, the probability of finding a tail of length r tends to 0 If m >> 2r, then prob. tends to 0 −r m − m 2− r = 0 as m/2r → ∞ (1 − 2 ) ≈ e So, the probability of finding a tail of length r tends to 1 Thus, 2R will almost always be around m N...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online