similarity2

# similarity2 - 1 Applications of LSH Entity Resolution...

This preview shows pages 1–8. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Applications of LSH Entity Resolution Fingerprints Similar News Articles 2 Desiderata ◆ Whatever form we use for LSH, we want : 1. The time spent performing the LSH should be linear in the number of objects. 2. The number of candidate pairs should be proportional to the number of truly similar pairs. ◆ Bucketizing guarantees (1). 3 Entity Resolution ◆ The entity-resolution problem is to examine a collection of records and determine which refer to the same entity. ◗ Entities could be people, events, etc. ◆ Typically, we want to merge records if their values in corresponding fields are similar. 4 Matching Customer Records ◆ I once took a consulting job solving the following problem: ◗ Company A agreed to solicit customers for Company B, for a fee. ◗ They then argued over how many customers. ◗ Neither recorded exactly which customers were involved. 5 Customer Records – (2) ◆ Company B had about 1 million records of all its customers. ◆ Company A had about 1 million records describing customers, some of whom it had signed up for B. ◆ Records had name, address, and phone, but for various reasons, they could be different for the same person. 6 Customer Records – (3) ◆ Step 1 : Design a measure (“ score ”) of how similar records are: ◗ E.g., deduct points for small misspellings (“Jeffrey” vs. “Jeffery”) or same phone with different area code. ◆ Step 2 : Score all pairs of records; report high scores as matches. 7 Customer Records – (4) ◆ Problem : (1 million) 2 is too many pairs of records to score....
View Full Document

{[ snackBarMessage ]}

### Page1 / 25

similarity2 - 1 Applications of LSH Entity Resolution...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online