Unformatted text preview: ied as
likely to respond to a direct mail solicitation, vulnerable to switching over to a competing longdistance phone service, or a good candidate for a surgical procedure.
Data mining creates classification models by examining already classified data (cases) and
inductively finding a predictive pattern. These existing cases may come from an historical database,
such as people who have already undergone a particular medical treatment or moved to a new longdistance service. They may come from an experiment in which a sample of the entire database is
tested in the real world and the results used to create a classifier. For example, a sample of a mailing
list would be sent an offer, and the results of the mailing used to develop a classification model to be
applied to the entire database. Sometimes an expert classifies a sample of the database, and this
classification is then used to create the model which will be applied to the entire database.
Regression uses existing values to forecast what other values will be. In the simplest case, regression
uses standard statistical techniques such as linear regression. Unfortunately, many real-world
problems are not simply linear projections of previous values. For instance, sales volumes, stock
prices, and product failure rates are all very difficult to predict because they may depend on complex
interactions of multiple predictor variables. Therefore, more complex techniques (e.g., logistic
regression, decision trees, or neural nets) may be necessary to forecast future values.
The same model types can often be used for both regression and classification. For example, the
CART (Classification And Regression Trees) decision tree algorithm can be used to build both
classification trees (to classify categorical response variables) and regression trees (to forecast
continuous response variables). Neural nets too can create both classification and regression models.
Time series forecasting predicts unknownfuture values based on a time-varying series of predictors.
Like regression, it uses known results to guide its predictions. Models must take into account the
distinctive properties of time, especially the hierarchy of periods (including such varied definitions as
the five- or seven-day work week, the thirteen-“month” year, etc.), seasonality, calendar effects such
as holidays, date arithmetic, and special considerations such as how much of the past is relevant. 10 © 1999 Two Crows Corporation DATA MINING MODELS AND ALGORITHMS Now let’s examine some of the types of models and algorithms used to mine data. Most products use
variations of algorithms that have been published in computer science or statistics journals, with their
specific implementations customized to meet the individual vendor’s goal. For example, many
vendors sell versions of the CART or CHAID decision trees with enhancements to work on parallel
computers. Some vendors have proprietary algorithms which, while not extensions or enhancements
of any published approach, may work quite well.
Most of the models and algorithms discussed in this section can be thought of as generalizations of
the standard workhorse of modeling, the linear regression model. Much effort has been expended in
the statistics, computer science, artificial intelligence and engineering communities to overcome the
limitations of this basic model. The common characteristic of many of the newer technologies we will
consider is that the pattern-finding mechanism is data-driven rather than user-driven. That is, the
relationships are found inductively by the software itself based on the existing data rather than
requiring the modeler to specify the functional form and interactions.
Perhaps the most important thing to remember is that no one model or algorithm can or should be
used exclusively. For any given problem, the nature of the data itself will affect the choice of models
and algorithms you choose. There is no “best” model or algorithm. Consequently, you will need a
variety of tools and technologies in order to find the best possible model.
Neural networks are of particular interest because they offer a means of efficiently modeling large and
complex problems in which there may be hundreds of predictor variables that have many
interactions.(Actual biological neural networks are incomparably more complex.) Neural nets may be
used in classification problems (where the output is a categorical variable) or for regressions (where
the output variable is continuous).
A neural network (Figure 4) starts with an input layer, where each node corresponds to a predictor
variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is
connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes
in another hidden layer, or to an output layer. The output layer consists of one or more response
Hidden Layer Figure 4. A neural network with one hidden layer. © 1999 Two Crows Corpora...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08