cs345-lsh

cs345-lsh - 1 Finding Similar Pairs Divide-Compute-Merge...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications 2 Finding Similar Pairs r Suppose we have in main memory data representing a large number of objects. R May be the objects themselves (e.g., summaries of faces). R May be signatures as in minhashing. r We want to compare each to each, finding those pairs that are sufficiently similar. 3 Candidate Generation From Minhash Signatures r Pick a similarity threshold s , a fraction < 1. r A pair of columns c and d is a c a n d i d a t e p a i r if their signatures agree in at least fraction of the rows. R I.e., M ( i , c ) = ( i , d ) for at least fraction values of i . 4 Other Notions of “Sufficiently Similar” r For images, a pair of vectors is a candidate if they differ by at most a small amount t in at least s % of the components. r For entity records, a pair is a candidate if the sum of similarity scores of corresponding components exceeds a threshold. 5 Checking All Pairs is Hard r While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. r Example : 10 6 columns implies 5*10 11 comparisons. r At 1 microsecond/comparison: 6 days. 6 Solutions 1 . D i v i d e - C o m p u t e - M e r g e (DCM) uses external sorting, merging. 2 . L o c a l i t y - S e n s i t i v e H a s h i n g (LSH) can be carried out in main memory, but admits some false negatives. 7 Divide-Compute-Merge r Designed for “shingles” and docs. R Or other problems where data is presented by column. r At each stage, divide data into batches that fit in main memory. r Operate on individual batches and write out partial results to disk. r Merge partial results from disk. 8 doc1: s11,s12,…,s1k doc2: s21,s22,…,s2k … DCM Steps s11,doc1 s12,doc1 … s1k,doc1 s21,doc2 … Invert t1,doc11 t1,doc12 … t2,doc21 t2,doc22 … sort on shingleId doc11,doc12,1 doc11,doc13,1 … doc21,doc22,1 … Invert and pair doc11,doc12,1 doc11,doc12,1 … doc11,doc13,1 … sort on <docId1, docId2> doc11,doc12,2 doc11,doc13,10 … Merge 9 DCM Summary 1. Start with the pairs <shingleId, docId>....
View Full Document

This note was uploaded on 01/31/2011 for the course PHI 101 taught by Professor Gilmore during the Fall '09 term at UC Davis.

Page1 / 37

cs345-lsh - 1 Finding Similar Pairs Divide-Compute-Merge...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online