*This preview shows
pages
1–11. Sign up
to
view the full content.*

This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*
**Unformatted text preview: **1 Improvements to A-Priori Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results 2 Aside : Hash-Based Filtering Simple problem : I have a set S of one billion strings of length 10. I want to scan a larger file F of strings and output those that are in S . I have 1GB of main memory. So I cant afford to store S in memory. 3 Solution (1) Create a bit array of 8 billion bits, initially all 0s. Choose a hash function h with range [0, 8*10 9 ), and hash each member of S to one of the bits, which is then set to 1. Filter the file F by hashing each string and outputting only those that hash to a 1. 4 Solution (2) Filter File F 0010001011000 To output; may be in S . h Drop; surely not in S . 5 Solution (3) As at most 1/8 of the bit array is 1, only 1/8 th of the strings not in S get through to the output. If a string is in S , it surely hashes to a 1, so it always gets through. Can repeat with another hash function and bit array to reduce the false positives by another factor of 8. 6 Solution Summary Each filter step costs one pass through the remaining file F and reduces the fraction of false positives by a factor of 8. Actually 1/(1-e -1/8 ). Repeat passes until few false positives. Either accept some errors, or check the remaining strings. e.g., divide surviving F into chunks that fit in memory and make a pass though S for each. 7 Aside : Throwing Darts A number of times we are going to need to deal with the problem: If we throw k darts into n equally likely targets, what is the probability that a target gets at least one dart? Example : targets = bits, darts = hash values of elements. 8 Throwing Darts (2) (1 1/n) Probablity target not hit by one dart k 1 -Probability at least one dart hits target n( /n) Equivalent Equals 1/e as n 1 e k/n 9 Throwing Darts (3) If k << n , then e-k/n can be approximated by the first two terms of its Taylor expansion: 1 k/n. Example : 10 9 darts, 8*10 9 targets. True value : 1 e-1/8 = .1175. Approximation : 1 (1 1/8) = .125. 10 Improvement : Superimposed Codes (Bloom Filters) We could use two hash functions, and hash each member of S to two bits of the bit array....

View
Full
Document