From Data Mining to Knowledge Discovery in Databases

The categories can be mutually exclusive and

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: : The fit is poor because only a weak correlation exists between the two variables. Clustering is a common descriptive task Articles where one seeks to identify a finite set of categories or clusters to describe the data (Jain and Dubes 1988; Titterington, Smith, and Makov 1985). The categories can be mutually exclusive and exhaustive or consist of a richer representation, such as hierarchical or overlapping categories. Examples of clustering applications in a knowledge discovery context include discovering homogeneous subpopulations for consumers in marketing databases and identifying subcategories of spectra from infrared sky measurements (Cheeseman and Stutz 1996). Figure 5 shows a possible clustering of the loan data set into three clusters; note that the clusters overlap, allowing data points to belong to more than one cluster. The original class labels (denoted by x’s and o’s in the previous figures) have been replaced by a + to indicate that the class membership is no longer assumed known. Closely related to clustering is the task of probability density estimation, which consists of techniques for estimating from data the joint multivariate probability density function of all the variables or fields in the database (Silverman 1986). Summarization involves methods for finding a compact description for a subset of data. A simple example would be tabulating the mean and standard deviations for all fields. More sophisticated methods involve the derivation of summary rules (Agrawal et al. 1996), multivariate visualization techniques, and the discovery of functional relationships between variables (Zembowicz and Zytkow 1996). Summarization techniques are often applied to interactive exploratory data analysis and automated report generation. Dependency modeling consists of finding a model that describes significant dependencies between variables. Dependency models exist at two levels: (1) the s tructural level of the model specifies (often in graphic form) which variables are locally dependent on each other and (2) the q uantitative level o f the model specifies the strengths of the dependencies using some numeric scale. For example, probabilistic dependency networks use conditional independence to specify the structural aspect of the model and probabilities or correlations to specify the strengths of the dependencies (Glymour et al. 1987; Heckerman 1996). Probabilistic dependency networks are increasingly finding applications in areas as diverse as the development of probabilistic medical expert systems from databases, information retrieval, and modeling of the human genome. Change and deviation detection focuses on Cluster 2 + Debt Cluster 1 + + + + + + + + + + + + + + + + + + + + + + Cluster 3 Income Figure 5. A Simple Clustering of the Loan Data Set into Three Clusters. Note that original labels are replaced by a +. discovering the most significant changes in the data from previously measured or normative values (Berndt and Clifford 1996; Guyon, Matic, and Vapnik 1996; Kloesgen 1996; Matheus, Piatetsky-Shapiro, and McNeill 1996; Bass...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online