This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ered. In store planning,
for example, putting associated items physically close together may reduce the total value of market
baskets — customers may buy less overall because they no longer pick up unplanned items while
walking through the store in search of the desired items. Insight, analysis and experimentation are
usually required to achieve any benefit from association rules.
Graphical methods may also be very useful in seeing the structure of links. In Figure 3 each of the
circles represents a value or an event. The lines connecting them show a link. The thicker lines
represent stronger or more frequent linkages, thus emphasizing potentially more important
relationships such as associations. For instance, looking at an insurance database to detect potential
fraud might reveal that a particular doctor and lawyer work together on an unusually large number of
cases. Figure 3. Linkage diagram 8 © 1999 Two Crows Corporation PREDICTIVE DATA MINING A hierarchy of choices
The goal of data mining is to produce new knowledge that the user can act upon. It does this by
building a model of the real world based on data collected from a variety of sources which may
include corporate transactions, customer histories and demographic information, process control data,
and relevant external databases such as credit bureau information or weather data. The result of the
model building is a description of patterns and relationships in the data that can be confidently used
To avoid confusing the different aspects of data mining, it helps to envision a hierarchy of the choices
and decisions you need to make before you start:
• Business goal
• Type of prediction
• Model type
At the highest level is the business goal: what is the ultimate purpose of mining this data? For
example, seeking patterns in your data to help you retain good customers, you might build one model
to predict customer profitability and a second model to identify customers likely to leave (attrition).
Your knowledge of your organization’s needs and objectives will guide you in formulating the goal of
The next step is deciding on the type of prediction that’s most appropriate: (1) classification:
predicting into what category or class a case falls, or (2) regression: predicting what number value a
variable will have (if it’s a variable that varies with time, it’s called time series prediction). In the
example above, you might use regression to forecast the amount of profitability, and classification to
predict which customers might leave. These are discussed in more detail below.
Now you can choose the model type: a neural net to perform the regression, perhaps, and a decision
tree for the classification. There are also traditional statistical models to choose from such as logistic
regression, discriminant analysis, or general linear models. The most important model types for data
mining are described in the next section, on DATA MINING MODELS AND ALGORITHMS.
Many algorithms are available to build your models. You might build the neural net using
backpropagation or radial basis functions. For the decision tree, you might choose among CART,
C5.0, Quest, or CHAID. Some of these algorithms are also discussed in DATA MINING MODELS AND
When selecting a data mining product, be aware that they generally have different implementations
of a particular algorithm even when they identify it with the same name. These implementation
differences can affect operational characteristics such as memory usage and data storage, as well as
performance characteristics such as speed and accuracy. Other key considerations to keep in mind are
covered later in the section on SELECTING DATA MINING PRODUCTS.
Many business goals are best met by building multiple model types using a variety of algorithms. You
may not be able to determine which model type is best until you’ve tried several approaches.
© 1999 Two Crows Corporation 9 Some terminology
In predictive models, the values or classes we are predicting are called the response, dependent or
target variables. The values used to make the prediction are called the predictor or independent
Predictive models are built, or trained, using data for which the value of the response variable is
already known. This kind of training is sometimes referred to as supervised learning, because
calculated or estimated values are compared with the known results. (By contrast, descriptive
techniques such as clustering, described in the previous section, are sometimes referred to as
unsupervised learning because there is no already-known result to guide the algorithms.)
Classification problems aim to identify the characteristics that indicate the group to which each case
belongs. This pattern can be used both to understand the existing data and to predict how new
instances will behave. For example, you may want to predict whether individuals can be classif...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08