This preview shows page 1. Sign up to view the full content.
Unformatted text preview: nt names may be
used for the same entity (synonyms). There are often unit incompatibilities, especially when
data sources are consolidated from different countries; for example, U.S. dollars and
Canadian dollars cannot be added without conversion.
f. Metadata construction. The information in the Dataset Description and Data Description
reports is the basis for the metadata infrastructure. In essence this is a database about the
database itself. It provides information that will be used in the creation of the physical
database as well as information that will be used by analysts in understanding the data and
building the models. g. Load the data mining database. In most cases the data should be stored in its own database.
For large amounts or complex data, this will usually be a DBMS as opposed to a flat file.
Having collected, integrated and cleaned the data, it is now necessary to actually load the
database itself. Depending on the DBMS and hardware being used, the amount of data, and
the complexity of the database design, this may turn out to be a serious undertaking that
requires the expertise of information systems professionals.
h. Maintain the data mining database. Once created, a database needs to be cared for. It needs
to be backed up periodically; its performance should be monitored; and it may need
occasional reorganization to reclaim disk storage or to improve performance. For a large,
complex database stored in a DBMS, the maintenance may also require the services of
information systems professionals.
3. Explore the data. See the DATA DESCRIPTION FOR DATA MINING section above for a detailed
discussion of visualization, link analysis, and other means of exploring the data. The goal is to
identify the most important fields in predicting an outcome, and determine which derived values
may be useful.
In a data set with hundreds or even thousands of columns, exploring the data can be as timeconsuming and labor-intensive as it is illuminating. A good interface and fast computer response
are very important in this phase because the very nature of your exploration is changed when you
have to wait even 20 minutes for some graphs, let alone a day.
4. Prepare data for modeling. This is the final data preparation step before building models. There
are four main parts to this step:
a. Select variables
b. Select rows
c. Construct new variables
d. Transform variables
a. Select variables. Ideally, you would take all the variables you have, feed them to the data
mining tool and let it find those which are the best predictors. In practice, this doesn’t work
very well. One reason is that the time it takes to build a model increases with the number of 26 © 1999 Two Crows Corporation variables. Another reason is that blindly including extraneous columns can lead to incorrect
models. A very common error, for example, is to use as a predictor variable data that can only
be known if you know the value of the response variable. People have actually used date of
birth to “predict” age without realizing it.
While in principle some data mining algorithms will automatically ignore irrelevant variables
and properly account for related (covariant) columns, in practice it is wise to avoid depending
solely on the tool. Often your knowledge of the problem domain can let you make many of
these selections correctly. For example, including ID number or Social Security number as
predictor variables will at best have no benefit and at worst may reduce the weight of other
b. Select rows. As in the case of selecting variables, you would like to use all the rows you have
to build models. If you have a lot of data, however, this may take too long or require buying a
bigger computer than you would like.
Consequently it is often a good idea to sample the data when the database is large. This yields
no loss of information for most business problems, although sample selection must be done
carefully to ensure the sample is truly random. Given a choice of either investigating a few
models built on all the data or investigating more models built on a sample, the latter
approach will usually help you develop a more accurate and robust model.
You may also want to throw out data that are clearly outliers. While in some cases outliers
may contain information important to your model building, often they can be ignored based
on your understanding of the problem. For example, they may be the result of incorrectly
entered data, or of a one-time occurrence such as a labor strike.
Sometimes you may need to add new records (e.g., for customers who made no purchases).
c. Construct new variables. It is often necessary to construct new predictors derived from the
raw data. For example, forecasting credit risk using a debt-to-income ratio rather than just
debt and income as predictor variables may yield more accurate results that are also easier to
understand. Certain variables that have little effect alone may need to be combined with
others, using various arithmetic or algebraic operations (e.g., a...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08