This preview shows page 1. Sign up to view the full content.
Unformatted text preview: For example, many European data sets are constrained in
their use by privacy regulations that are far stricter than those in the United States.
b. Data Description Describe the contents of each file or database table. Some of the properties
documented in a Data Description Report are:
• Number of fields/columns
• Number/percentage of records with missing values
• Field names
For each field:
• Data type
• Source of field
• Unit of measure
• Number of unique values
• List of values
• Range of values
• Number/percentage of missing values
• Collection information (e.g., how, where, conditions)
• Timeframe (e.g., daily, weekly, monthly)
• Specific time data (e.g., every Monday or every Tuesday)
• Primary key/foreign key relationships
24 © 1999 Two Crows Corporation c. Selection. The next step in preparing the data mining database is to select the subset of data
to mine. This is not the same as sampling the database or choosing predictor variables.
Rather, it is a gross elimination of irrelevant or unneeded data. Other criteria for excluding
data may include resource constraints, cost, restrictions on data use, or quality problems.
d. Data quality assessment and data cleansing. GIGO (Garbage In, Garbage Out) is quite
applicable to data mining, so if you want good models you need to have good data. A data
quality assessment identifies characteristics of the data that will affect the model quality.
Essentially, you are trying to ensure not only the correctness and consistency of values but
also that all the data you have is measuring the same thing in the same way.
There are a number of types of data quality problems. Single fields may have an incorrect
value. For example, recently a man’s nine-digit Social Security identification number was
accidentally entered as income when the government computed his taxes! Even when
individual fields have what appear to be correct values, there may be incorrect combinations,
such as pregnant males. Sometimes the value for a field is missing. Inconsistencies must be
identified and removed when consolidating data from multiple sources.
Missing data can be a particularly pernicious problem. If you have to throw out every record
with a field missing, you may wind up with a very small database or an inaccurate picture of
the whole database. The fact that a value is missing may be significant in itself. Perhaps only
wealthy customers regularly leave the “income” field blank, for instance. It can be
worthwhile to create a new variable to identify missing values, build a model using it, and
compare the results with those achieved by substituting for the missing value to see which
leads to better predictions.
Another approach is to calculate a substitute value. Some common strategies for calculating
missing values include using the modal value (for nominal variables), the median (for ordinal
variables), or the mean (for continuous variables). A less common strategy is to assign a
missing value based on the distribution of values for that variable. For example, if a database
consisted of 40% females and 60% males, then you might assign a missing gender entry the
value of “female” 40% of the time and “male” 60% of the time. Sometimes people build
predictive models using data mining techniques to predict missing values. This usually gives
a better result than a simple calculation, but is much more time-consuming.
Recognize that you will not be able to fix all the problems, so you will need to work around
them as best as possible. It is far preferable and more cost-effective to put in place procedures
and checks to avoid the data quality problems — “an ounce of prevention.” Usually, however,
you must build the models you need with the data you now have, and avoidance is something
you’ll work toward for the future.
e. Integration and consolidation. The data you need may reside in a single database or in
multiple databases. The source databases may be transaction databases used by the
operational systems of your company. Other data may be in data warehouses or data marts
built for specific purposes. Still other data may reside in a proprietary database belonging to
another company such as a credit bureau.
Data integration and consolidation combines data from different sources into a single mining
database and requires reconciling differences in data values from the various sources.
Improperly reconciled data is a major source of quality problems. There are often large
© 1999 Two Crows Corporation 25 differences in the way data are defined and used in different databases. Some inconsistencies
may be easy to uncover, such as different addresses for the same customer. Making it more
difficult to resolve these problems is that they are often subtle. For example, the same
customer may have different names or — worse — multiple customer identification numbers.
The same name may be used for different entities (homonyms), or differe...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08