This preview shows page 1. Sign up to view the full content.
Unformatted text preview: new patterns. We further subdivide the
discovery goal into prediction, where the system ﬁnds patterns for predicting the future
behavior of some entities, and d escription,
where the system ﬁnds patterns for presentation to a user in a humanunderstandable
form. In this article, we are primarily concerned with discoveryoriented data mining.
Data mining involves ﬁtting models to, or
determining patterns from, observed data.
The ﬁtted models play the role of inferred
knowledge: Whether the models reﬂect useful
or interesting knowledge is part of the overall, interactive KDD process where subjective
human judgment is typically required. Two
primary mathematical formalisms are used in
model ﬁtting: (1) statistical and (2) logical.
The statistical approach allows for nondeterministic effects in the model, whereas a logical model is purely deterministic. We focus
primarily on the statistical approach to data
mining, which tends to be the most widely
used basis for practical datamining applications given the typical presence of uncertainty in realworld datagenerating processes.
Most datamining methods are based on
tried and tested techniques from machine
learning, pattern recognition, and statistics:
classiﬁcation, clustering, regression, and so
on. The array of different algorithms under
each of these headings can often be bewildering to both the novice and the experienced
data analyst. It should be emphasized that of
the many datamining methods advertised in
the literature, there are really only a few fundamental techniques. The actual underlying
model representation being used by a particular method typically comes from a composition of a small number of wellknown options: polynomials, splines, kernel and basis
functions, thresholdBoolean functions, and
so on. Thus, algorithms tend to differ primar o Debt
o x
o
x x x
x x
o o
x o
o o o
x x o
o x o o
Income Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes. ily in the goodnessofﬁt criterion used to
evaluate model ﬁt or in the search method
used to ﬁnd a good ﬁt.
In our brief overview of datamining methods, we try in particular to convey the notion
that most (if not all) methods can be viewed
as extensions or hybrids of a few basic techniques and principles. We ﬁrst discuss the primary methods of data mining and then show
that the data mining methods can be viewed
as consisting of three primary algorithmic
components: (1) model representation, (2)
model evaluation, and (3) search. In the discussion of KDD and datamining methods,
we use a simple example to make some of the
notions more concrete. Figure 2 shows a simple twodimensional artiﬁcial data set consisting of 23 cases. Each point on the graph represents a person who has been given a loan
by a particular bank at some time in the past.
The horizontal axis represents the income of
the person; the vertical axis represents the total personal debt of the person (mortgage, car
payments, and so on). The data have been
classiﬁed into two classes: (1) the x’s rep...
View
Full
Document
This document was uploaded on 02/15/2014.
 Spring '14

Click to edit the document details