Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e) if (person1!stnum && person2!stnum) very close stnum = very close num(person1!stnum, person2!stnum) else very close stnum = FALSE / For all tuples under consideration / for (j = start j < ntuples j++) f / person2 points to the j-th tuple / person2 = &tuples j] / For all other tuples inside the window (wsize-1 tuples before the j-th tuple). / for (i = j ; 1 i > j;wsize && i 0 i;;) f / person1 points to the i-th tuple / person1 = &tuples i] if (person1!aptm && person2!aptm) very close aptm = very close str(person1!aptm, person2!aptm) else very close aptm = FALSE / RULEs: compare-addresses-use-numbers-state, compare-addresses-use-numbers-zipcode, and same-address-except-city / if ((very close stnum && not close && very close aptm && similar city && (similar state jj similar zip) && !similar addrs) jj (similar addrs && very close stnum && very close aptm && similar zip)) f very similar addrs = TRUE / Compare person1 with person2 / / RULE: nd-similar-ssns / similar ssns = same ssn p(person1!ssn,person2!ssn,3) / RULE: compare-names / similar names = compare names ( person1!name, person2!name, person1!fname, person1!minit, person1!lname, person2!fname, person2!minit, person2!lname, person1!fname init, person2!fname init ) / RULEs: same-ssn-and-address and same-name-and-address (again) / if (similar ssns jj similar names) f merge tuples(person1, person2) continue / RULE: same-ssn-and-name / if (similar ssns && similar names) f merge tuples(person1, person2) continue g / RULE: very-close-ssn-close-address / if (similar addrs && similar ssns && !similar names) if (same ssn p (person1!ssn, person2!ssn, 2)) f merge tuples(person1, person2) continue / RULE: compare-addresses / similar addrs = compare addresses(person1!stname, person2!stname) g / Compare other elds of the address / similar city = same city(person1!city, person2!city) similar zip = same zipcode(person1!zipcode, person2!zipcode) similar state = (strcmp(person1!state, person2!state) == 0) / RULEs: closer-addresses-use-zips and closer-address-use-states / very similar addrs = (similar addrs && similar city && (similar state jj similar zip)) g g / RULE: hard-case-1 / if (similar ssns && very similar addrs && similar zip && same name or initial(person1!fname,person2!fname)) f merge tuples(person1, person2) continue g g 37 g g B C version of the equational theory / / RULEs: same-ssn-and-address and same-name-and-address / if ((similar ssns jj similar names) && very similar addrs) f merge tuples(person1, person2) continue rule program ( number of tuples, rst tuple, window size ) Compare all tuples inside a window. If a match is found, call merge tuples(). / void rule program(int ntuples, int start, int wsize) f g register int i, j register WindowEntry person1, person2 boolean similar ssns, similar names, similar addrs boolean similar city, similar state, similar zip boolean very similar addres, very close aptm, very close stnum, not close not close = close but not much(person1!stname, person2!stname) if (person1!stnum && person2!stnum) very close stnum = very close num(person1!stnum, person2!stnum) else very close stnum = FALSE / For all tuples under consideration / for (j = start j < ntuples j++) f / person2 points to the j-th tuple / person2 = &tuples j] / For all other tuples inside the window (wsize-1 tuples before the j-th tuple). / for (i = j ; 1 i > j;wsize && i 0 i;;) f / person1 points to the i-th tuple / person1 = &tuples i] if (person1!aptm && person2!aptm) very close aptm = very close str(person1!aptm, person2!aptm) else very close aptm = FALSE / RULEs: compare-addresses-use-numbers-state, compare-addresses-use-numbers-zipcode, and same-address-except-city / if ((very close stnum && not close && very close aptm && similar city && (similar state jj similar zip) && !similar addrs) jj (similar addrs && very close stnum && very close aptm && similar zip)) f very similar addrs = TRUE / Compare person1 with person2 / / RULE: nd-similar-ssns / similar ssns = same ssn p(person1!ssn,person2!ssn,3) / RULE: compare-names / similar names = compare names ( person1!name, person2!name, person1!fname, person1!minit, person1!lname, person2!fname, person2!minit, person2!lname, person1!fname init, person2!fname init ) / RULEs: same-ssn-and-address and same-name-and-address (again) / if (similar ssns jj similar names) f merge tuples(person1, person2) continue / RULE: same-ssn-and-name / if (similar ssns && similar names) f merge tuples(person1, person2) continue g g g / RULE: very-close-ssn-close-address / if (similar addrs && similar ssns && !similar names) if (same ssn p (person1!ssn, person2!ssn, 2)) f merge tuples(person1, person2) continue / RULE: compare-addresses / similar addrs = compare addresses(person1!stname, person2!stname) g / Compare other elds of the address / similar city = same city(person1!city, person2!city) similar zip = same zipcode(person1!zipcode, person2!zipcode) similar state = (strcmp(person1!state, person2!state) == 0) / RULE: hard-case-1 / if (similar ssns && very similar addrs && similar zip && same name or initial(person1!fname,person2!fname)) f merge tuples(person1, person2) continue g / RULEs: closer-addresses-use-zips and closer-address-use-states / very similar addrs = (similar addrs && similar city && (similar state jj similar zip)) g 38 g g...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online