This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ddition, ratios). Some variables
that extend over a wide range may be modified to construct a better predictor, such as using
the log of income instead of income.
d. Transform variables. The tool you choose may dictate how you represent your data, for
instance, the categorical explosion required by neural nets. Variables may also be scaled to
fall within a limited range, such as 0 to 1. Many decision trees used for classification require
continuous data such as income to be grouped in ranges (bins) such as High, Medium, and
Low. The encoding you select can influence the result of your model. For example, the cutoff points for the bins may change the outcome of a model.
5. Data mining model building. The most important thing to remember about model building is
that it is an iterative process. You will need to explore alternative models to find the one that is
most useful in solving your business problem. What you learn in searching for a good model may
lead you to go back and make some changes to the data you are using or even modify your
problem statement. © 1999 Two Crows Corporation 27 Once you have decided on the type of prediction you want to make (e.g., classification or
regression), you must choose a model type for making the prediction. This could be a decision
tree, a neural net, a proprietary method, or that old standby, logistic regression. Your choice of
model type will influence what data preparation you must do and how you go about it. For
example, a neural net tool may require you to explode your categorical variables. Or the tool may
require that the data be in a particular file format, thus requiring you to extract the data into that
format. Once the data is ready, you can proceed with training your model.
The process of building predictive models requires a well-defined training and validation protocol
in order to insure the most accurate and robust predictions. This kind of protocol is sometimes
called supervised learning. The essence of supervised learning is to train (estimate) your model on
a portion of the data, then test and validate it on the remainder of the data. A model is built when
the cycle of training and testing is completed. Sometimes a third data set, called the validation
data set, is needed because the test data may be influencing features of the model, and the
validation set acts as an independent measure of the model’s accuracy.
Training and testing the data mining model requires the data to be split into at least two groups:
one for model training (i.e., estimation of the model parameters) and one for model testing. If you
don’t use different training and test data, the accuracy of the model will be overestimated. After
the model is generated using the training database, it is used to predict the test database, and the
resulting accuracy rate is a good estimate of how the model will perform on future databases that
are similar to the training and test databases. It does not guarantee that the model is correct. It
simply says that if the same technique were used on a succession of databases with similar data to
the training and test data, the average accuracy would be close to the one obtained this way.
Simple validation. The most basic testing method is called simple validation. To carry this out,
you set aside a percentage of the database as a test database, and do not use it in any way in the
model building and estimation. This percentage is typically between 5% and 33%. For all the
future calculations to be correct, the division of the data into two groups must be random, so that
the training and test data sets both reflect the data being modeled.
After building the model on the main body of the data, the model is used to predict the classes or
values of the test database. Dividing the number of incorrect classifications by the total number of
instances gives an error rate. Dividing the number of correct classifications by the total number of
instances gives an accuracy rate (i.e., accuracy = 1 – error). For a regression model, the goodness
of fit or “r-squared” is usually used as an estimate of the accuracy.
In building a single model, even this simple validation may need to be performed dozens of times.
For example, when using a neural net, sometimes each training pass through the net is tested
against a test database. Training then stops when the accuracy rates on the test database no longer
improve with additional iterations.
Cross validation. If you have only a modest amount of data (a few thousand rows) for building
the model, you can’t afford to set aside a percentage of it for simple validation. Cross validation is
a method that lets you use all your data. The data is randomly divided into two equal sets in order
to estimate the predictive accuracy of the model. First, a model is built on the first set and used to
predict the outcomes in the second set and calculate an error rate. Then a model is built on the
second set and used to predict the outcomes in the first set and again calculate an error rate.
Finally, a model is built using all the data. There are now two independent error estimates which
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08