{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

cs345-streams2-3

# cs345-streams2-3 - More Stream-Mining Counting Distinct...

This preview shows pages 1–10. Sign up to view the full content.

1 More Stream-Mining Counting Distinct Elements Computing “Moments” Frequent Itemsets Elephants and Troops Exponentially Decaying Windows

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 Counting Distinct Elements rhombus6 Problem : a data stream consists of elements chosen from a set of size n . Maintain a count of the number of distinct elements seen so far. rhombus6 Obvious approach : maintain the set of elements seen.
3 Applications rhombus6 How many different words are found among the Web pages being crawled at a site? rhombus4 Unusually low or high numbers could indicate artificial pages (spam?). rhombus6 How many different Web pages does each customer request in a week?

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
4 Using Small Storage rhombus6 Real Problem : what if we do not have space to store the complete set? rhombus6 Estimate the count in an unbiased way. rhombus6 Accept that the count may be in error, but limit the probability that the error is large.
5 Flajolet-Martin* Approach rhombus6 Pick a hash function h that maps each of the n elements to at least log 2 n bits. rhombus6 For each stream element a , let r ( a ) be the number of trailing 0’s in h ( a ). rhombus6 Record R = the maximum r ( a ) seen. rhombus6 Estimate = 2 R . * Really based on a variant due to AMS (Alon, Matias, and Szegedy)

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6 Why It Works rhombus6 The probability that a given h ( a ) ends in at least r 0’s is 2 - r . rhombus6 If there are m different elements, the probability that R r is 1 – (1 - 2 - r ) m . Prob. a given h(a) ends in fewer than r 0’s. Prob. all h(a)’s end in fewer than r 0’s.
7 Why It Works – (2) rhombus6 Since 2 -r is small, 1 - (1-2 -r ) m 1 - e -m2 . rhombus6 If 2 r >> m , 1 - (1 - 2 - r ) m 1 - (1 - m2 -r ) m /2 r 0. rhombus6 If 2 r << m , 1 - (1 - 2 - r ) m 1 - e -m2 1. rhombus6 Thus, 2 R will almost always be around m . -r -r First 2 terms of the Taylor expansion of e x

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
8 Why It Doesn’t Work rhombus6 E(2 R ) is actually infinite. rhombus4 Probability halves when R -> R +1, but value doubles. rhombus6 Workaround involves using many hash functions and getting many samples. rhombus6 How are samples combined? rhombus4 Average ? What if one very large value? rhombus4 Median ? All values are a power of 2.
9 Solution rhombus6 Partition your samples into small groups.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}