cs345-streams2-3

cs345-streams2-3 - 1 More Stream-Mining Counting Distinct...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 More Stream-Mining Counting Distinct Elements Computing Moments Frequent Itemsets Elephants and Troops Exponentially Decaying Windows 2 Counting Distinct Elements r Problem : a data stream consists of elements chosen from a set of size n . Maintain a count of the number of distinct elements seen so far. r Obvious approach : maintain the set of elements seen. 3 Applications r How many different words are found among the Web pages being crawled at a site? R Unusually low or high numbers could indicate artificial pages (spam?). r How many different Web pages does each customer request in a week? 4 Using Small Storage r Real Problem : what if we do not have space to store the complete set? r Estimate the count in an unbiased way. r Accept that the count may be in error, but limit the probability that the error is large. 5 Flajolet-Martin* Approach r Pick a hash function h that maps each of the n elements to at least log 2 bits. r For each stream element a , let r ( ) be the number of trailing 0s in h ( a ). r Record R = the maximum ( ) seen. r Estimate = 2 . * Really based on a variant due to AMS (Alon, Matias, and Szegedy) 6 Why It Works r The probability that a given h ( a ) ends in at least r 0s is 2-r . r If there are m different elements, the probability that R is 1 (1 - 2-) m . Prob. a given h(a) ends in fewer than 0s. Prob. all h(a)s end in fewer than 0s. 7 Why It Works (2) r Since 2-r is small, 1 - (1-2-r ) m 1 -e -m2 . r If 2 r >> m , 1 - (1 - 2-) m 1 - (1 - m2-r ) /2 r 0. r If 2 << , 1 - (1 - 2-) m 1 -e-m2 1. r Thus, 2 R will almost always be around .-r-r First 2 terms of the Taylor expansion of x 8 Why It Doesnt Work r E(2 R ) is actually infinite. R Probability halves when -> +1, but value doubles. r Workaround involves using many hash functions and getting many samples. r How are samples combined? R Average ? What if one very large value? R Median ? All values are a power of 2. 9 Solution r Partition your samples into small groups....
View Full Document

Page1 / 36

cs345-streams2-3 - 1 More Stream-Mining Counting Distinct...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online