16-streams

16-streams - CS246 Mining Massive Datasets Jure Leskovec...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Counting distinct elements: Flajolet-Martin Number of distinct elements in the last k elements of the stream (3) Estimating moments: AMS method Estimate std. dev. of last k elements (4) Counting frequent items 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
Background image of page 2
Each element of data stream is a tuple Given a list of keys S Determine which elements of stream have keys in S Obvious solution: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream 3/2/2011 3 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Example: Email spam filtering: We know 1 billion “good” email addresses If an email comes from one of these, it is NOT spam Publish-subscribe systems: People express interest in certain sets of keywords Determine whether each message matches user’s interest 3/2/2011 4 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 4
Create a bit array B of n bits, initially all 0s Choose a hash function h with range [0,m) Hash each member of s S to one of m buckets, and set that bit to 1, i.e., B[h(s)]=1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == 1 5 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Creates false positives but no false negatives If the item is in S we surely output it, if not we may still output it 6 Item 0010001011000 Output the item since it may be in S ; Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Drop the item; It hashes to a bucket set to 0 so it is surely not in S . Bit array B 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 6
|S| = 1 billion email addresses |B|= 1GB = 8 billion bits If the email address is in S , then it surely hashes to a bucket that has the big set to 1, so it always gets through ( no false negatives ) Approximately 1/8 of the bits are set to 1, so about 1/8 th of the addresses not in S get through to the output ( false positives ) Actually, less than 1/8 th , because more than one address might hash to the same bit 7 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
More accurate analysis for the number of false positives Consider: If we throw m darts into n equally likely targets, what is the probability that a target gets at least one dart? In our case: Targets = bits/buckets Darts = hash values of items 8 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 8
We have m darts, n targets What is the probability that a target gets at least one dart?
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 40

16-streams - CS246 Mining Massive Datasets Jure Leskovec...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online