assoc-rules2-4

assoc-rules2-4 - 1 Improvements to A-Priori Bloom Filters...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Improvements to A-Priori Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results 2 Aside : Hash-Based Filtering Simple problem : I have a set S of one billion strings of length 10. I want to scan a larger file F of strings and output those that are in S . I have 1GB of main memory. So I cant afford to store S in memory. 3 Solution (1) Create a bit array of 8 billion bits, initially all 0s. Choose a hash function h with range [0, 8*10 9 ), and hash each member of S to one of the bits, which is then set to 1. Filter the file F by hashing each string and outputting only those that hash to a 1. 4 Solution (2) Filter File F 0010001011000 To output; may be in S . h Drop; surely not in S . 5 Solution (3) As at most 1/8 of the bit array is 1, only 1/8 th of the strings not in S get through to the output. If a string is in S , it surely hashes to a 1, so it always gets through. Can repeat with another hash function and bit array to reduce the false positives by another factor of 8. 6 Solution Summary Each filter step costs one pass through the remaining file F and reduces the fraction of false positives by a factor of 8. Actually 1/(1-e -1/8 ). Repeat passes until few false positives. Either accept some errors, or check the remaining strings. e.g., divide surviving F into chunks that fit in memory and make a pass though S for each. 7 Aside : Throwing Darts A number of times we are going to need to deal with the problem: If we throw k darts into n equally likely targets, what is the probability that a target gets at least one dart? Example : targets = bits, darts = hash values of elements. 8 Throwing Darts (2) (1 1/n) Probablity target not hit by one dart k 1 -Probability at least one dart hits target n( /n) Equivalent Equals 1/e as n 1 e k/n 9 Throwing Darts (3) If k << n , then e-k/n can be approximated by the first two terms of its Taylor expansion: 1 k/n. Example : 10 9 darts, 8*10 9 targets. True value : 1 e-1/8 = .1175. Approximation : 1 (1 1/8) = .125. 10 Improvement : Superimposed Codes (Bloom Filters) We could use two hash functions, and hash each member of S to two bits of the bit array....
View Full Document

This document was uploaded on 03/04/2012.

Page1 / 49

assoc-rules2-4 - 1 Improvements to A-Priori Bloom Filters...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online