This preview shows page 1. Sign up to view the full content.
Unformatted text preview: : The ﬁt is poor because only a weak correlation exists between
the two variables.
Clustering is a common descriptive task Articles where one seeks to identify a ﬁnite set of categories or clusters to describe the data (Jain
and Dubes 1988; Titterington, Smith, and
Makov 1985). The categories can be mutually
exclusive and exhaustive or consist of a richer
representation, such as hierarchical or overlapping categories. Examples of clustering applications in a knowledge discovery context
include discovering homogeneous subpopulations for consumers in marketing databases
and identifying subcategories of spectra from
infrared sky measurements (Cheeseman and
Stutz 1996). Figure 5 shows a possible clustering of the loan data set into three clusters;
note that the clusters overlap, allowing data
points to belong to more than one cluster.
The original class labels (denoted by x’s and
o’s in the previous ﬁgures) have been replaced
by a + to indicate that the class membership
is no longer assumed known. Closely related
to clustering is the task of probability density
estimation, which consists of techniques for
estimating from data the joint multivariate
probability density function of all the variables or ﬁelds in the database (Silverman
1986).
Summarization involves methods for ﬁnding a compact description for a subset of data. A simple example would be tabulating the
mean and standard deviations for all ﬁelds.
More sophisticated methods involve the
derivation of summary rules (Agrawal et al.
1996), multivariate visualization techniques,
and the discovery of functional relationships
between variables (Zembowicz and Zytkow
1996). Summarization techniques are often
applied to interactive exploratory data analysis and automated report generation.
Dependency modeling consists of ﬁnding a
model that describes signiﬁcant dependencies
between variables. Dependency models exist
at two levels: (1) the s tructural level of the
model speciﬁes (often in graphic form) which
variables are locally dependent on each other
and (2) the q uantitative level o f the model
speciﬁes the strengths of the dependencies
using some numeric scale. For example, probabilistic dependency networks use conditional independence to specify the structural aspect of the model and probabilities or
correlations to specify the strengths of the dependencies (Glymour et al. 1987; Heckerman
1996). Probabilistic dependency networks are
increasingly ﬁnding applications in areas as
diverse as the development of probabilistic
medical expert systems from databases, information retrieval, and modeling of the human
genome.
Change and deviation detection focuses on Cluster 2 + Debt
Cluster 1 + +
+
+ + +
+ +
+ +
+ +
+ + +
+ +
+ + + + +
Cluster 3
Income Figure 5. A Simple Clustering of the Loan Data Set into Three Clusters.
Note that original labels are replaced by a +. discovering the most signiﬁcant changes in
the data from previously measured or normative values (Berndt and Clifford 1996; Guyon,
Matic, and Vapnik 1996; Kloesgen 1996;
Matheus, PiatetskyShapiro, and McNeill
1996; Bass...
View
Full
Document
This document was uploaded on 02/15/2014.
 Spring '14

Click to edit the document details