This preview shows page 1. Sign up to view the full content.
Unformatted text preview: le: Use representative examples from the database to approximate a model; that is, predictions on new examples are derived from the properties of
similar examples in the model whose prediction is known. Techniques include nearestneighbor classiﬁcation and regression algorithms (Dasarathy 1991) and casebased
reasoning systems (Kolodner 1993). Figure 8
illustrates the use of a nearestneighbor classiﬁer for the loan data set: The class at any
new point in the twodimensional space is
the same as the class of the closest point in
the original training data set.
A potential disadvantage of examplebased
methods (compared with treebased methods)
is that a welldeﬁned distance metric for evaluating the distance between data points is required. For the loan data in ﬁgure 8, this
would not be a problem because income and
debt are measured in the same units. However, if one wished to include variables such as
the duration of the loan, sex, and profession,
then it would require more effort to deﬁne a
sensible metric between the variables. Model
evaluation is typically based on crossvalidation estimates (Weiss and Kulikowski 1991) of
a prediction error: Parameters of the model to
be estimated can include the number of
neighbors to use for prediction and the distance metric itself. Like nonlinear regression
methods, examplebased methods are often
asymptotically powerful in terms of approximation properties but, conversely, can be
difﬁcult to interpret because the model is implicit in the data and not explicitly formulated. Related techniques include kerneldensity Debt o
No Loan o x
o
x x x o o
x
x o o
x x o o o x x o
o Loan o
Income Figure 7. An Example of Classiﬁcation Boundaries Learned by a Nonlinear
Classiﬁer (Such as a Neural Network) for the Loan Data Set. o
Debt No Loan o x
o
x x x
x o
x
o o
x o o o
x x o
o x o Loan o
Income Figure 8. Classiﬁcation Boundaries for a NearestNeighbor
Classiﬁer for the Loan Data Set. FALL 1996 47 Articles estimation (Silverman 1986) and mixture
modeling (Titterington, Smith, and Makov
1985). Probabilistic Graphic
Dependency Models Understanding data
mining and
model
induction at
this
component
level clariﬁes
the behavior
of any
datamining
algorithm
and makes it
easier for the
user to
understand its
overall
contribution
and
applicability
to the
KDD
process. Graphic models specify probabilistic dependencies using a graph structure (Whittaker
1990; Pearl 1988). In its simplest form, the
model speciﬁes which variables are directly dependent on each other. Typically, these models are used with categorical or discretevalued
variables, but extensions to special cases, such
as Gaussian densities, for realvalued variables
are also possible. Within the AI and statistical
communities, these models were initially developed within the framework of probabilistic
expert systems; the structure of the model and
the parameters (the conditional probabilities
attached to the links of the graph) were elicited from experts. Recently, there has been signiﬁcant work in both the AI and statistical
communities on methods whereby...
View
Full
Document
This document was uploaded on 02/15/2014.
 Spring '14

Click to edit the document details