This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Improvements to APriori Bloom Filters ParkChenYu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results 2 Aside : HashBased Filtering ◆ Simple problem : I have a set S of one billion strings of length 10. ◆ I want to scan a larger file F of strings and output those that are in S . ◆ I have 1GB of main memory. ◗ So I can’t afford to store S in memory. 3 Solution – (1) ◆ Create a bit array of 8 billion bits, initially all 0’s. ◆ Choose a hash function h with range [0, 8*10 9 ), and hash each member of S to one of the bits, which is then set to 1. ◆ Filter the file F by hashing each string and outputting only those that hash to a 1. 4 Solution – (2) Filter File F 0010001011000 To output; may be in S . h Drop; surely not in S . 5 Solution – (3) ◆ As at most 1/8 of the bit array is 1, only 1/8 th of the strings not in S get through to the output. ◆ If a string is in S , it surely hashes to a 1, so it always gets through. ◆ Can repeat with another hash function and bit array to reduce the false positives by another factor of 8. 6 Solution – Summary ◆ Each filter step costs one pass through the remaining file F and reduces the fraction of false positives by a factor of 8. ◗ Actually 1/(1e 1/8 ). ◆ Repeat passes until few false positives. ◆ Either accept some errors, or check the remaining strings. ◗ e.g., divide surviving F into chunks that fit in memory and make a pass though S for each. 7 Aside : Throwing Darts ◆ A number of times we are going to need to deal with the problem: If we throw k darts into n equally likely targets, what is the probability that a target gets at least one dart? ◆ Example : targets = bits, darts = hash values of elements. 8 Throwing Darts – (2) (1 – 1/n) Probablity target not hit by one dart k 1 Probability at least one dart hits target n( /n) Equivalent Equals 1/e as n → ∞ 1 – e –k/n 9 Throwing Darts – (3) ◆ If k << n , then ek/n can be approximated by the first two terms of its Taylor expansion: 1 – k/n. ◆ Example : 10 9 darts, 8*10 9 targets. ◗ True value : 1 – e1/8 = .1175. ◗ Approximation : 1 – (1 – 1/8) = .125. 10 Improvement : Superimposed Codes (Bloom Filters) ◆ We could use two hash functions, and hash each member of S to two bits of the bit array....
View
Full Document
 Fall '09
 jenisha
 Algorithms, Data Mining, hash function, main memory, Bloom filter, Apriori, PCY, negative border

Click to edit the document details