# similarity2 - 1 Applications of LSH Entity Resolution

Unformatted text preview: 1 Applications of LSH Entity Resolution Fingerprints Similar News Articles 2 Desiderata ◆ Whatever form we use for LSH, we want : 1. The time spent performing the LSH should be linear in the number of objects. 2. The number of candidate pairs should be proportional to the number of truly similar pairs. ◆ Bucketizing guarantees (1). 3 Entity Resolution ◆ The entity-resolution problem is to examine a collection of records and determine which refer to the same entity. ◗ Entities could be people, events, etc. ◆ Typically, we want to merge records if their values in corresponding fields are similar. 4 Matching Customer Records ◆ I once took a consulting job solving the following problem: ◗ Company A agreed to solicit customers for Company B, for a fee. ◗ They then argued over how many customers. ◗ Neither recorded exactly which customers were involved. 5 Customer Records – (2) ◆ Company B had about 1 million records of all its customers. ◆ Company A had about 1 million records describing customers, some of whom it had signed up for B. ◆ Records had name, address, and phone, but for various reasons, they could be different for the same person. 6 Customer Records – (3) ◆ Step 1 : Design a measure (“ score ”) of how similar records are: ◗ E.g., deduct points for small misspellings (“Jeffrey” vs. “Jeffery”) or same phone with different area code. ◆ Step 2 : Score all pairs of records; report high scores as matches. 7 Customer Records – (4) ◆ Problem : (1 million) 2 is too many pairs of records to score....
