This preview shows page 1. Sign up to view the full content.
Unformatted text preview: new patterns. We further subdivide the
discovery goal into prediction, where the system ﬁnds patterns for predicting the future
behavior of some entities, and d escription,
where the system ﬁnds patterns for presentation to a user in a human-understandable
form. In this article, we are primarily concerned with discovery-oriented data mining.
Data mining involves ﬁtting models to, or
determining patterns from, observed data.
The ﬁtted models play the role of inferred
knowledge: Whether the models reﬂect useful
or interesting knowledge is part of the overall, interactive KDD process where subjective
human judgment is typically required. Two
primary mathematical formalisms are used in
model ﬁtting: (1) statistical and (2) logical.
The statistical approach allows for nondeterministic effects in the model, whereas a logical model is purely deterministic. We focus
primarily on the statistical approach to data
mining, which tends to be the most widely
used basis for practical data-mining applications given the typical presence of uncertainty in real-world data-generating processes.
Most data-mining methods are based on
tried and tested techniques from machine
learning, pattern recognition, and statistics:
classiﬁcation, clustering, regression, and so
on. The array of different algorithms under
each of these headings can often be bewildering to both the novice and the experienced
data analyst. It should be emphasized that of
the many data-mining methods advertised in
the literature, there are really only a few fundamental techniques. The actual underlying
model representation being used by a particular method typically comes from a composition of a small number of well-known options: polynomials, splines, kernel and basis
functions, threshold-Boolean functions, and
so on. Thus, algorithms tend to differ primar- o Debt
x x x
o o o
x x o
o x o o
Income Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes. ily in the goodness-of-ﬁt criterion used to
evaluate model ﬁt or in the search method
used to ﬁnd a good ﬁt.
In our brief overview of data-mining methods, we try in particular to convey the notion
that most (if not all) methods can be viewed
as extensions or hybrids of a few basic techniques and principles. We ﬁrst discuss the primary methods of data mining and then show
that the data- mining methods can be viewed
as consisting of three primary algorithmic
components: (1) model representation, (2)
model evaluation, and (3) search. In the discussion of KDD and data-mining methods,
we use a simple example to make some of the
notions more concrete. Figure 2 shows a simple two-dimensional artiﬁcial data set consisting of 23 cases. Each point on the graph represents a person who has been given a loan
by a particular bank at some time in the past.
The horizontal axis represents the income of
the person; the vertical axis represents the total personal debt of the person (mortgage, car
payments, and so on). The data have been
classiﬁed into two classes: (1) the x’s rep...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14