{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

cs345-lsh-2 - LocalitySensitive Hashing Basic Technique...

Info icon This preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Finding Similar Pairs Suppose we have in main memory data  representing a large number of objects. May be the objects themselves (e.g.,  summaries of faces). May be signatures as in minhashing. We want to compare each to each,  finding those pairs that are sufficiently  similar.
Image of page 2
3 Candidate Generation From  Minhash Signatures Pick a similarity threshold  s , a fraction <  1. A pair of columns  c   and   is a  candidate pair   if their signatures agree  in at least fraction  s   of the rows. I.e.,  M  ( i, c  ) =  M  ( i, d  )  for at least fraction  s   values of  i .
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
4 Candidate Generation --- (2) For images, a pair of vectors is a  candidate if they differ by at most a  small threshold  t   in at least  % of the  components. For entity records, a pair is a candidate  if the sum of similarity scores of  corresponding components exceeds a  threshold.
Image of page 4
5 The Problem with Checking  for Candidates While the signatures of all columns may  fit in main memory, comparing the  signatures of all pairs of columns is  quadratic in the number of columns. Example : 10 6  columns implies 5*10 11   comparisons. At 1 microsecond/comparison: 6 days.
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
6 Solutions 1. Divide-Compute-Merge   (DCM) uses  external sorting, merging. 2. Locality-Sensitive Hashing   (LSH) can  be carried out in main memory, but  admits some false negatives. 3. Hamming LSH  --- a variant LSH  method.
Image of page 6
7 Divide-Compute-Merge Designed for “shingles” and docs. At each stage, divide data into batches  that fit in main memory. Operate on individual batches and write  out partial results to disk. Merge partial results from disk.
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
8 doc1: s11,s12,…,s1k doc2: s21,s22,…,s2k DCM Steps s11,doc1 s12,doc1 s1k,doc1 s21,doc2 Invert t1,doc11 t1,doc12 t2,doc21 t2,doc22 sort on shingleId doc11,doc12,1 doc11,doc13,1 doc21,doc22,1 Invert and pair doc11,doc12,1 doc11,doc12,1 doc11,doc13,1 sort on <docId1, docId2> doc11,doc12,2 doc11,doc13,10 Merge
Image of page 8
9 DCM Summary 1. Start with the pairs <shingleId, docId>.
Image of page 9

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern