16-streams

# For the number of false positives consider if we

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: qually likely targets, what is the probability that a target gets at least one dart? In our case: Targets = bits/buckets Darts = hash values of items 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 We have m darts, n targets What is the probability that a target gets at least one dart? Equals 1/e as n →∞ Equivalent 1 - (1 – 1/n) Probability target not hit by one dart 3/2/2011 n( m / n) 1 – e–m/n Probability at least one dart hits target Jure Leskovec, Stanford C246: Mining Massive Datasets 9 Fraction of 1s in the array B == probability of false positive == 1 – e-m/n Example: 109 darts, 8∙109 targets Fraction of 1s in B = 1 – e-1/8 = 0.1175 Compare with our earlier estimate: 1/8 = 0.125 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 Consider: |S| = m, |B| = n Use k independent hash functions h1 ,…, hk Initialization: Set B to all 0s Hash each element s ∈ S using each hash function hi, set B[hi(s)] = 1 (for each i = 1,.., k) Run-time: When a stream element with key x arrives If B[hi(x)] = 1 for all i = 1,..., k, then declare that x is in S i.e., x hashes to a bucket set to 1 for every hash function hi() Otherwise discard the element x 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 What fraction of the bit vector B are 1s? Throwing k∙m darts at n targets So fraction of 1s is (1 – e-km/n) But we have k independent hash functions So, false positive probability = (1 – e-km/n)k 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 m = 1 billion, n = 8 billion e-1/8) k = 1: (1 – = 0.1175 k = 2: (1 – e-1/4)2 = 0.0493 What happens as we keep increasing k? 0.18 0.16 False positive prob. 0.2 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2 4 6 8 10 12 14 16 18 Number of hash functions, k 20 “Optimal” value of k: n/m ln(2) E.g.: 8 ln(2) = 5.54 3/2/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Bloom filters guarantee no false negatives, and use limited memory Great for pre-processing before more expensive checks E.g., Google’s BigTable, Squid web proxy Suitable for hardware implementation Hash function computations can be parallelized...
View Full Document

## This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online