Unformatted text preview: .................................................. 36 RELATED READINGS Data Mining ’99: Technology Report, Two Crows Corporation, 1999
M. Berry and G. Linoff, Data Mining Techniques, John Wiley, 1997
William S. Cleveland, The Elements of Graphing Data, revised, Hobart Press, 1994
Howard Wainer, Visual Revelations, Copernicus, 1997
R. Kennedy, Lee, Reed, and Van Roy, Solving Pattern Recognition Problems,
U. Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, Advances in Knowledge
Discovery and Data Mining, MIT Press, 1996
Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999
C. Westphal and T. Blaxton, Data Mining Solutions, John Wiley, 1998
Vasant Dhar and Roger Stein, Seven Methods for Transforming Corporate Data into
Business Intelligence, Prentice Hall 1997
Brieman, Freidman, Olshen, and Stone, Classification and Regression Trees,
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992 Introduction to Data Mining and Knowledge Discovery INTRODUCTION
Data mining: In brief
Databases today can range in size into the terabytes — more than 1,000,000,000,000 bytes of data.
Within these masses of data lies hidden information of strategic importance. But when there are so
many trees, how do you draw meaningful conclusions about the forest?
The newest answer is data mining, which is being used both to increase revenues and to reduce costs.
The potential returns are enormous. Innovative organizations worldwide are already using data
mining to locate and appeal to higher-value customers, to reconfigure their product offerings to
increase sales, and to minimize losses due to error or fraud.
Data mining is a process that uses a variety of data analysis tools to discover patterns and
relationships in data that may be used to make valid predictions.
The first and simplest analytical step in data mining is to describe the data — summarize its statistical
attributes (such as means and standard deviations), visually review it using charts and graphs, and
look for potentially meaningful links among variables (such as values that often occur together). As
emphasized in the section on THE DATA MINING PROCESS, collecting, exploring and selecting the right
data are critically important.
But data description alone cannot provide an action plan. You must build a predictive model based
on patterns determined from known results, then test that model on results outside the original
sample. A good model should never be confused with reality (you know a road map isn’t a perfect
representation of the actual road), but it can be a useful guide to understanding your business.
The final step is to empirically verify the model. For example, from a database of customers who
have already responded to a particular offer, you’ve built a model predicting which prospects are
likeliest to respond to the same offer. Can you rely on this prediction? Send a mailing to a portion of
the new list and see what results you get.
Data mining: What it can’t do
Data mining is a tool, not a magic wand. It won’t sit in your database watching what happens and
send you e-mail to get your attention when it sees an interesting pattern. It doesn’t eliminate the need
to know your business, to understand your data, or to understand analytical methods. Data mining
assists business analysts with finding patterns and relationships in the data — it does not tell you the
value of the patterns to the organization. Furthermore, the patterns uncovered by data mining must be
verified in the real world.
Remember that the predictive relationships found via data mining are not necessarily causes of an
action or behavior. For example, data mining might determine that males with incomes between
$50,000 and $65,000 who subscribe to certain magazines are likely purchasers of a product you want
to sell. While you can take advantage of this pattern, say by aiming your marketing at people who fit
the pattern, you should not assume that any of these factors cause them to buy your product.
© 1999 Two Crows Corporation 1 To ensure meaningful results, it’s vital that you understand your data. The quality of your output will
often be sensitive to outliers (data values that are very different from the typical values in your
database), irrelevant columns or columns that vary together (such as age and date of birth), the way
you encode your data, and the data you leave in and the data you exclude. Algorithms vary in their
sensitivity to such data issues, but it is unwise to depend on a data mining product to make all the
right decisions on its own.
Data mining will not automatically discover solutions without guidance. Rather than setting the vague
goal, “Help improve the response to my direct mail solicitation,” you might use data mining to find
the characteristics of people who (1) respond to your solicitation, or (2) respond AND make a large
purchase. The patterns data mining finds for those two goals may be very different.
Although a good data mining tool shelters you from the intricacies of statistical techniques, it requires
you to understand th...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08