This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 NearNeighbor Search Applications Matrix Formulation Minhashing 2 Example Application : Face Recognition r We have a database of (say) 1 million face images. r We want to find the most similar images in the database. r Represent faces by (relatively) invariant values, e.g., ratio of nose width to eye width. 3 Face Recognition – (2) r Each image represented by a large number (say 1000) of numerical features. r Problem : given a face, find those in the DB that are close in at least ¾ (say) of the features. 4 Face Recognition – (3) r M a n y  o n e p r o b l e m : given a new face, see if it is close to any of the 1 million old faces. r a n y  M a n y p r o b l e m : which pairs of the 1 million faces are similar. 5 Simple Solution r Represent each face by a vector of 1000 values and score the comparisons. r Sortof OK for manyone problem. r Out of the question for the manymany problem (10 6 *10 6 *1000/2 numerical comparisons). r We can do better ! 6 Multidimensional Indexes Don’t Work New face: [6,14,…] 04 59 1014 . . . Dimension 1 = Surely we’d better look here. Maybe look here too, in case of a slight error. But the first dimension could be one of those that is not close. So we’d better look everywhere! 7 Another Problem : Entity Resolution r Two sets of 1 million nameaddressphone records. r Some pairs, one from each set, represent the same person. r Errors of many kinds : R Typos, missing middle initial, areacode changes, St./Street, Bob/Robert, etc., etc. 8 Entity Resolution – (2) r Choose a scoring system for how close names are. R Deduct so much for edit distance > 0; so much for missing middle initial, etc. r Similarly score differences in addresses, phone numbers. r Sufficiently high total score > records represent the same entity. 9 Simple Solution r Compare each pair of records, one from each set. r Score the pair. r Call them the same if the score is sufficiently high. r Unfeasible for 1 million records. r We can do better ! 10 Example : Similar Customers r Common pattern : looking for sets with a relatively large intersection. r Represent a customer, e.g., of Netflix, by the set of movies they rented. r Similar customers have a relatively large fraction of their choices in common. 11 Example : Similar Products r Dual view of productcustomer relationship....
View
Full Document
 Fall '09
 hash function, main memory, Cryptographic hash function, rows, similar customers

Click to edit the document details