This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 LocalitySensitive Hashing Basic Technique HammingLSH Applications 2 Finding Similar Pairs Suppose we have in main memory data representing a large number of objects. May be the objects themselves (e.g., summaries of faces). May be signatures as in minhashing. We want to compare each to each, finding those pairs that are sufficiently similar. 3 Candidate Generation From Minhash Signatures Pick a similarity threshold s , a fraction < 1. A pair of columns c and d is a candidate pair if their signatures agree in at least fraction s of the rows. I.e., M ( i, c ) = M ( i, d ) for at least fraction s values of i . 4 Candidate Generation  (2) For images, a pair of vectors is a candidate if they differ by at most a small threshold t in at least s % of the components. For entity records, a pair is a candidate if the sum of similarity scores of corresponding components exceeds a threshold. 5 The Problem with Checking for Candidates While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. Example : 10 6 columns implies 5*10 11 comparisons. At 1 microsecond/comparison: 6 days. 6 Solutions 1. DivideComputeMerge (DCM) uses external sorting, merging. 2. LocalitySensitive Hashing (LSH) can be carried out in main memory, but admits some false negatives. 3. Hamming LSH  a variant LSH method. 7 DivideComputeMerge Designed for shingles and docs. At each stage, divide data into batches that fit in main memory. Operate on individual batches and write out partial results to disk. Merge partial results from disk. 8 doc1: s11,s12,,s1k doc2: s21,s22,,s2k DCM Steps s11,doc1 s12,doc1 s1k,doc1 s21,doc2 Invert t1,doc11 t1,doc12 t2,doc21 t2,doc22 sort on shingleId doc11,doc12,1 doc11,doc13,1 doc21,doc22,1 Invert and pair doc11,doc12,1 doc11,doc12,1 doc11,doc13,1 sort on <docId1, docId2> doc11,doc12,2 doc11,doc13,10 Merge 9...
View
Full
Document
 Fall '09

Click to edit the document details