This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ct the
effectiveness of surgical procedures, medical tests or medications. Companies active in the financial
markets use data mining to determine market and industry characteristics as well as to predict
individual company and stock performance. Retailers are making more use of data mining to decide
which products to stock in particular stores (and even how to place them within a store), as well as to
assess the effectiveness of promotions and coupons. Pharmaceutical firms are mining large databases
of chemical compounds and of genetic material to discover substances that might be candidates for
development as agents for the treatments of disease.
Successful data mining
There are two keys to success in data mining. First is coming up with a precise formulation of the
problem you are trying to solve. A focused statement usually results in the best payoff. The second
key is using the right data. After choosing from the data available to you, or perhaps buying external
data, you may need to transform and combine it in significant ways.
The more the model builder can “play” with the data, build models, evaluate results, and work with
the data some more (in a given unit of time), the better the resulting model will be. Consequently, the
degree to which a data mining tool supports this interactive data exploration is more important than
the algorithms it uses.
Ideally, the data exploration tools (graphics/visualization, query/OLAP) are well-integrated with the
analytics or algorithms that build the models. © 1999 Two Crows Corporation 5 DATA DESCRIPTION FOR DATA MINING
Summaries and visualization
Before you can build good predictive models, you must understand your data. Start by gathering a
variety of numerical summaries (including descriptive statistics such as averages, standard deviations
and so forth) and looking at the distribution of the data. You may want to produce cross tabulations
(pivot tables) for multi-dimensional data.
Data can be continuous, having any numerical value (e.g., quantity sold) or categorical, fitting into
discrete classes (e.g., red, blue, green). Categorical data can be further defined as either ordinal,
having a meaningful order (e.g., high/medium/low), or nominal, that is unordered (e.g., postal codes).
Graphing and visualization tools are a vital aid in data preparation and their importance to effective
data analysis cannot be overemphasized. Data visualization most often provides the Aha! leading to
new insights and success. Some of the common and very useful graphical displays of data are
histograms or box plots that display distributions of values. You may also want to look at scatter plots
in two or three dimensions of different pairs of variables. The ability to add a third, overlay variable
greatly increases the usefulness of some types of graphs.
Visualization works because it exploits the broader information bandwidth of graphics as opposed to
text or numbers. It allows people to see the forest and zoom in on the trees. Patterns, relationships,
exceptional values and missing values are often easier to perceive when shown graphically, rather
than as lists of numbers and text.
The problem in using visualization stems from the fact that models have many dimensions or
variables, but we are restricted to showing these dimensions on a two-dimensional computer screen or
paper. For example, we may wish to view the relationship between credit risk and age, sex, marital
status, own-or-rent, years in job, etc. Consequently, visualization tools must use clever representations
to collapse n dimensions into two. Increasingly powerful and sophisticated data visualization tools are
being developed, but they often require people to train their eyes through practice in order to
understand the information being conveyed. Users who are color-blind or who are not spatially
oriented may also have problems with visualization tools.
Clustering divides a database into different groups. The goal of clustering is to find groups that are
very different from each other, and whose members are very similar to each other. Unlike
classification (see Predictive Data Mining, below), you don’t know what the clusters will be when
you start, or by which attributes the data will be clustered. Consequently, someone who is
knowledgeable in the business must interpret the clusters. Often it is necessary to modify the
clustering by excluding variables that have been employed to group instances, because upon
examination the user identifies them as irrelevant or not meaningful. After you have found clusters
that reasonably segment your database, these clusters may then be used to classify new data. Some of
the common algorithms used to perform clustering include Kohonen feature maps and K-means.
Don’t confuse clustering with segmentation. Segmentation refers to the general problem of
identifying groups that have common characteristics. Clustering is a way to segment data into groups
that are not previously defined, whereas classi...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08