Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

Basic sorted neighborhood method 4 results on real

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ring vs. Basic Sorted Neighborhood Method 4 Results on Real-World Data Even though the results we have achieved on a wide variety of statistically controlled generated data indicate that the multi-pass approach is quite good, some may not regard this as a de nitive validation of the e cacy of the techniques. The State of Washington Department of Social and Health Services maintains large databases of transactions made over the years with state residents. In March of 1995 the O ce of Children Administrative Research (OCAR) of the Department of Social and Health Services posted a request on the KDD-nuggets 8] asking for assistance analyzing one of their databases. We answered their request and this section details our results. OCAR analyzes the database of payments by the State to families and businesses that provide services to needy children. OCAR's goal is to answer questions such as: \How many children are in foster care>`, \How long do children stay in foster care?" \How many di erent homes do children typically stay in?" To accurately answer such questions, the many computer records for payments and services must be identi ed for each child. (Obviously, without matching records with the appropriate individual client, the frequency distributions for such services will be grossly in error.) Because no unique identi er for an individual child exists, it must be generated and assigned by an algorithm that compares multiple service records in the database. The elds used in the records to help identify a child include 21 name, birth date, case number, social security number, each of which is unreliable, containing misspellings, typographical errors, and incomplete information. This is not a unique situation in real-world databases. It was the need to develop computer processes that more accurately identi ed all records for a given child that spurred OCAR to seek assistance. 4.1 Database Description Most of OCAR's data is stored in one relation that contains all payments to service providers since 1984. There are currently approximately 6,000,000 total records in the relation and this number grows by approximately 50,000 a month. The relation has 19 attributes, of which the most relevant (those carrying information that can be used to infer the identify of individual entities) are: First Name, Last Name, Birthday, Social Security Number, Case Number, Service ID, Service dates (beginning and ending dates), Gender and Race, Provider ID, Amount of Payment, Date of Payment, and Worker ID. Each record is 105 bytes long. The typical problems with the OCAR data are as follows: 1. Names are frequently misspelled. Sometimes nicknames or \similar sounding" names are used instead of the real name. Also the parent or guardian's name is sometimes used instead of the child's name. 2. Social security numbers or birthdays are missing or clearly wrong (e.g., some records have the social security number \999999999"). Likewise, the parent or guardian's information is sometimes used instead of the child's proper information. 3. The case number, which should uniquely identify a family, often changes when a child's family moves to another part of the state, or is referred for service a second time after more than a couple years since the rst referral. 4. There are records which cannot be assigned to any person because the name entered in the record was not the child's name, but that of the service provider. Also, names like \Anonymous Male" and \Anonymous Female" were used. (We call this last type of records ghost records.) 22 2000 1800 1600 1400 Distribution 1200 1000 800 600 400 200 0 0 5 10 15 20 25 30 Number of Records per Cluster 35 40 45 50 Figure 6: Number of records per Child as computed by OCAR (The graph is drawn only to a cluster size of 50. In actuality, it continues to 500!). Because of the private nature of the data recorded in the database, we cannot produce sample records to illustrate each of the mentioned cases. Even so, any database administrator responsible for large corporate or agency databases will immediately see the parallels here to their data. (After all, real-world data is very dirty!) OCAR provided a sample of their database to conduct this study. The sample, which contains the data from only one service o ce, has 128,438 records (13.6 Mbytes). They also provided us with their current individual identi cation number for each record in the sample (the number that should uniquely identify each child in the database) according to their own analysis. These OCAR-assigned identi ers serve as our basis for comparing the accuracy over varying window sizes. Figure 6 shows the distribution of the number of records per individual detected by OCAR. Most individuals in the database are represented on average by 1 to 10 records in the database (approximately 2,000 individuals are represented by 1 record in the database). Note that individuals may be represented by as much as 30-40 records and, although not shown in gure 6, there are some individuals with more than 100 records, and o...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online