{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

cs345-streams2-3

cs345-streams2-3 - More Stream-Mining Counting Distinct...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
1 More Stream-Mining Counting Distinct Elements Computing “Moments” Frequent Itemsets Elephants and Troops Exponentially Decaying Windows
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Counting Distinct Elements rhombus6 Problem : a data stream consists of elements chosen from a set of size n . Maintain a count of the number of distinct elements seen so far. rhombus6 Obvious approach : maintain the set of elements seen.
Background image of page 2
3 Applications rhombus6 How many different words are found among the Web pages being crawled at a site? rhombus4 Unusually low or high numbers could indicate artificial pages (spam?). rhombus6 How many different Web pages does each customer request in a week?
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
4 Using Small Storage rhombus6 Real Problem : what if we do not have space to store the complete set? rhombus6 Estimate the count in an unbiased way. rhombus6 Accept that the count may be in error, but limit the probability that the error is large.
Background image of page 4
5 Flajolet-Martin* Approach rhombus6 Pick a hash function h that maps each of the n elements to at least log 2 n bits. rhombus6 For each stream element a , let r ( a ) be the number of trailing 0’s in h ( a ). rhombus6 Record R = the maximum r ( a ) seen. rhombus6 Estimate = 2 R . * Really based on a variant due to AMS (Alon, Matias, and Szegedy)
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
6 Why It Works rhombus6 The probability that a given h ( a ) ends in at least r 0’s is 2 - r . rhombus6 If there are m different elements, the probability that R r is 1 – (1 - 2 - r ) m . Prob. a given h(a) ends in fewer than r 0’s. Prob. all h(a)’s end in fewer than r 0’s.
Background image of page 6
7 Why It Works – (2) rhombus6 Since 2 -r is small, 1 - (1-2 -r ) m 1 - e -m2 . rhombus6 If 2 r >> m , 1 - (1 - 2 - r ) m 1 - (1 - m2 -r ) m /2 r 0. rhombus6 If 2 r << m , 1 - (1 - 2 - r ) m 1 - e -m2 1. rhombus6 Thus, 2 R will almost always be around m . -r -r First 2 terms of the Taylor expansion of e x
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
8 Why It Doesn’t Work rhombus6 E(2 R ) is actually infinite. rhombus4 Probability halves when R -> R +1, but value doubles. rhombus6 Workaround involves using many hash functions and getting many samples. rhombus6 How are samples combined? rhombus4 Average ? What if one very large value? rhombus4 Median ? All values are a power of 2.
Background image of page 8
9 Solution rhombus6 Partition your samples into small groups.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}