This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ring vs. Basic Sorted Neighborhood Method 4 Results on Real-World Data
Even though the results we have achieved on a wide variety of statistically controlled generated data indicate that the multi-pass approach is quite good, some may not regard this as
a de nitive validation of the e cacy of the techniques.
The State of Washington Department of Social and Health Services maintains large
databases of transactions made over the years with state residents. In March of 1995 the
O ce of Children Administrative Research (OCAR) of the Department of Social and Health
Services posted a request on the KDD-nuggets 8] asking for assistance analyzing one of their
databases. We answered their request and this section details our results.
OCAR analyzes the database of payments by the State to families and businesses that
provide services to needy children. OCAR's goal is to answer questions such as: \How
many children are in foster care>`, \How long do children stay in foster care?" \How many
di erent homes do children typically stay in?" To accurately answer such questions, the many
computer records for payments and services must be identi ed for each child. (Obviously,
without matching records with the appropriate individual client, the frequency distributions
for such services will be grossly in error.) Because no unique identi er for an individual
child exists, it must be generated and assigned by an algorithm that compares multiple
service records in the database. The elds used in the records to help identify a child include
21 name, birth date, case number, social security number, each of which is unreliable, containing
misspellings, typographical errors, and incomplete information. This is not a unique situation
in real-world databases. It was the need to develop computer processes that more accurately
identi ed all records for a given child that spurred OCAR to seek assistance. 4.1 Database Description
Most of OCAR's data is stored in one relation that contains all payments to service providers
since 1984. There are currently approximately 6,000,000 total records in the relation and
this number grows by approximately 50,000 a month. The relation has 19 attributes, of
which the most relevant (those carrying information that can be used to infer the identify
of individual entities) are: First Name, Last Name, Birthday, Social Security Number, Case
Number, Service ID, Service dates (beginning and ending dates), Gender and Race, Provider
ID, Amount of Payment, Date of Payment, and Worker ID. Each record is 105 bytes long.
The typical problems with the OCAR data are as follows:
1. Names are frequently misspelled. Sometimes nicknames or \similar sounding" names
are used instead of the real name. Also the parent or guardian's name is sometimes
used instead of the child's name.
2. Social security numbers or birthdays are missing or clearly wrong (e.g., some records
have the social security number \999999999"). Likewise, the parent or guardian's
information is sometimes used instead of the child's proper information.
3. The case number, which should uniquely identify a family, often changes when a child's
family moves to another part of the state, or is referred for service a second time after
more than a couple years since the rst referral.
4. There are records which cannot be assigned to any person because the name entered
in the record was not the child's name, but that of the service provider. Also, names
like \Anonymous Male" and \Anonymous Female" were used. (We call this last type
of records ghost records.)
1400 Distribution 1200
0 5 10 15 20
Number of Records per Cluster 35 40 45 50 Figure 6: Number of records per Child as computed by OCAR (The graph is drawn only to
a cluster size of 50. In actuality, it continues to 500!).
Because of the private nature of the data recorded in the database, we cannot produce
sample records to illustrate each of the mentioned cases. Even so, any database administrator
responsible for large corporate or agency databases will immediately see the parallels here
to their data. (After all, real-world data is very dirty!)
OCAR provided a sample of their database to conduct this study. The sample, which
contains the data from only one service o ce, has 128,438 records (13.6 Mbytes). They
also provided us with their current individual identi cation number for each record in the
sample (the number that should uniquely identify each child in the database) according to
their own analysis. These OCAR-assigned identi ers serve as our basis for comparing the
accuracy over varying window sizes.
Figure 6 shows the distribution of the number of records per individual detected by
OCAR. Most individuals in the database are represented on average by 1 to 10 records in
the database (approximately 2,000 individuals are represented by 1 record in the database).
Note that individuals may be represented by as much as 30-40 records and, although not
shown in gure 6, there are some individuals with more than 100 records, and o...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14