Unformatted text preview: Contents
7 Cluster Analysis 7.1 7.2 What Is Cluster Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of Data in Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 7.2.2 7.2.3 7.2.4 7.2.5 7.3 7.4 IntervalScaled Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Categorical, Ordinal, and RatioScaled Variables . . . . . . . . . . . . . . . . . . . . . . . . Variables of Mixed Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 9 10 12 13 16 17 18 20 20 24 25 25 28 30 31 32 32 34 35 37 38 39 40 40 42
A Categorization of Major Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 7.4.2 Classical Partitioning Methods: k Means and k Medoids . . . . . . . . . . . . . . . . . . . . Partitioning Methods in Large Databases: From k Medoids to CLARANS . . . . . . . . . .
7.5
Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 7.5.2 7.5.3 7.5.4 Agglomerative and Divisive Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies . . . . . . . . . . . . ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes . . . . . . . . . . . Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling . . . . . . . . .
7.6
DensityBased Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 7.6.2 7.6.3 DBSCAN: A DensityBased Clustering Method Based on Connected Regions with Sufficiently High Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OPTICS: Ordering Points To Identify the Clustering Structure . . . . . . . . . . . . . . . . DENCLUE: Clustering Based on Density Distribution Functions . . . . . . . . . . . . . . .
7.7
GridBased Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 7.7.2 STING: STatistical INformation Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WaveCluster: Clustering Using Wavelet Transformation . . . . . . . . . . . . . . . . . . . .
7.8
ModelBased Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 7.8.2 ExpectationMaximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 7.8.3 7.9
CONTENTS Neural Network Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 46 48 48 51 53 54 55 56 57 58 60 61 63 64 66
Clustering HighDimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 7.9.2 7.9.3 CLIQUE: A DimensionGrowth Subspace Clustering Method . . . . . . . . . . . . . . . . . PROCLUS: A DimensionReduction Subspace Clustering Method . . . . . . . . . . . . . . Frequent PatternBased Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.10 ConstraintBased Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Clustering with Obstacle Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 UserConstrained Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.3 SemiSupervised Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Outlier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.1 Statistical DistributionBased Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.2 DistanceBased Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.3 DensityBased Local Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.4 DeviationBased Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.14 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures
7.1 7.2 7.3 7.4 7.5 7.6 7.7 Euclidean and Manhattan distances between two objects. . . . . . . . . . . . . . . . . . . . . . . . The kmeans partitioning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clustering of a set of objects based on the kmeans method. (The mean of each cluster is marked by a "+".) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Four cases of the cost function for kmedoids clustering. . . . . . . . . . . . . . . . . . . . . . . . . PAM, a kmedoids partitioning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. [to editor Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., using bold italics). Thank you.] . . Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. [to editor Please add the heading "level" above l = 0, l = 1, etc. Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., bold italics). Thanks.] . . . . . . . . . . . . . . . . . . . . . . . . . . A CF tree structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chameleon: Hierarchical clustering based on knearest neighbors and dynamic modeling. Based on [KHK99]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 21 22 23 24 26
27 29 31
7.8 7.9
7.10 Density reachability and density connectivity in densitybased clustering. Based on [EKSX96]. [to editor For consistency, please change M, O, P, Q, R to m, o, p, q, r, respectively (i.e., bold italics). Thanks.] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 OPTICS terminology. Based on [ABKS99]. [to editor 1) Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) and prime (') signs missing! Please kindly compare with Figure 8.10 of first edition, which was correct. 2) The symbol in the figure looks different than that used in the text. Thank you.] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Cluster ordering in OPTICS. Figure is based on [ABKS99]. . . . . . . . . . . . . . . . . . . . . . . 7.13 Possible density functions for a 2D data set. From [HK98]. . . . . . . . . . . . . . . . . . . . . . . 7.14 Examples of centerdefined clusters (top row) and arbitraryshape clusters (bottom row). [to editor Label missing: please add the label "Density" to the second graph of the top row (as in the other graphs of that row). Thanks.] From [HK98]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.15 A hierarchical structure for STING clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.16 A sample of twodimensional feature space. From [SCZ98]. . . . . . . . . . . . . . . . . . . . . . . . 7.17 Multiresolution of the feature space in Figure 7.16 at (a) scale 1 (high resolution); (b) scale 2 (medium resolution); (c) scale 3 (low resolution). From [SCZ98]. . . . . . . . . . . . . . . . . . . . 3
33
34 35 36
37 38 40 40
4
LIST OF FIGURES 7.18 Each cluster can be represented by a probability distribution, centered at a mean, and with a standard deviation. Here, we have two clusters, corresponding to the Gaussian distributions g(m 1 , 1 ) and g(m2 , 2 ), respectively, where the circles represent the first standard deviation of the distributions. [to editor This figure is a draft. The final version should have: 1) points heavily scattered within each circle; 2) points sparsely scattered in area outside of circles; 3) a dark point at the center of each circle (to mark the means, m1 and m2 . Thank you!] . . . . . . . . . . . . . . . . . . . . . 7.19 A classification tree. Figure is based on [Fis87]. [to editor Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) signs missing! Please kindly compare with Figure 8.18 of first edition, which was correct. Thank you.] . . . . . . . . . . . . . . . . . . . . . . 7.20 The result of SOM clustering of 12,088 Web articles on comp.ai.neuralnets (left), and of drilling down on the keyword: "mining" (right). Based on http://websom.hut.fi/websom/comp.ai.neuralnetsnew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.21 Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate search space for dense units of higher dimensionality. . . . . . . . . . 7.22 Raw data from a fragment of microarray data containing only 3 objects and 10 attributes. . . . . . 7.23 Objects in Figure 7.22 form (a) a shift pattern in subspace {b, c, h, j, e}, and (b) a scaling pattern in subspace {f, d, a, g, i}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.24 Clustering with obstacle objects [new (o1 and o2 )]: (a) a visibility graph, and (b) triangulation of regions with microclusters. From [THH01]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.25 Clustering results obtained without and with consideration of obstacles (where rivers and inaccessible highways or city blocks are represented by polygons): (a) clustering without considering obstacles, and (b) clustering with obstacles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.26 Clustering through decision tree construction: (a) the set of data points to be clustered, viewed as a set of "Y " points, (b) the addition of a set of uniformly distributed "N " points, represented by "", and (c) the clustering result with "Y " points only. . . . . . . . . . . . . . . . . . . . . . . . . . 7.27 The necessity of densitybased local outlier analysis. From [BKNS00]. . . . . . . . . . . . . . . . .
41
42
45 47 49 50 53
54
56 60
List of Tables
7.1 7.2 7.3 A contingency table for binary variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A relational table where patients are described by binary attributes. . . . . . . . . . . . . . . . . . A sample data table containing variables of mixed type. . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14
5
6
LIST OF TABLES
Chapter 7
Cluster Analysis
Imagine that you are given a set of data objects for analysis where, unlike in classification, the class label of each object is not known. This is quite common in large databases because assigning class labels to a large number of objects can be a very costly process. Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. In this chapter, we study the requirements of clustering methods for large amounts of data. We explain how to compute dissimilarities between objects represented by various attribute or variable types. We examine several clustering techniques, organized into the following categories: partitioning methods, hierarchical methods, densitybased methods, gridbased methods, modelbased methods, methods for highdimensional data (such as frequent patternbased methods), and constraintbased clustering. Clustering can also be used for outlier detection, which forms the final topic of this chapter.
7.1
What Is Cluster Analysis?
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labelling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantage of such a clusteringbased process is that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis is an important human activity. Early in childhood, one learns how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database, and in the identification of groups of houses in a city according to house type, value, and geographical location, as well as the identification of 7
8
CHAPTER 7. CLUSTER ANALYSIS
groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on the Web for information discovery. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are "far away" from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a standalone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distancebased cluster analysis. Cluster analysis tools based on kmeans, kmedoids, and several other methods have also been built into many statistical analysis software packages or systems, such as SPlus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and classlabeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, highdimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. Clustering is a challenging field of research where its potential applications pose their own special requirements. The following are typical requirements of clustering in data mining: Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributes: Many algorithms are designed to cluster intervalbased (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often hard to determine, especially for data sets containing highdimensional objects. This not only burdens users, but also makes the quality of clustering difficult to control. Ability to deal with noisy data: Most realworld databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and instead,
7.2. TYPES OF DATA IN CLUSTER ANALYSIS
9
must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data. That is, given a set of data objects, such an algorithm may return dramatically different clusterings depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input. High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling lowdimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in highdimensional space is challenging, especially considering that such data can be very sparse and highly skewed. Constraintbased clustering: Realworld applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (i.e., ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city's rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied in with specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and clustering methods. With these requirements in mind, our study of cluster analysis proceeds as follows. First, we study different types of data and how they can influence clustering methods. Second, we present a general categorization of clustering methods. We then study each clustering method in detail, including partitioning methods, hierarchical methods, densitybased methods, gridbased methods, and modelbased methods. We also examine clustering in highdimensional space, constraintbased clustering, and outlier analysis.
7.2
Types of Data in Cluster Analysis
In this section, we study the types of data that often occur in cluster analysis and how to preprocess them for such an analysis. Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents, countries, and so on. Main memorybased clustering algorithms typically operate on either of the following two data structures. Data matrix (or objectbyvariable structure): This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or nbyp matrix (n objects p variables): x11 x1f x1p xi1 xif xip (7.1) xn1 xnf xnp Dissimilarity matrix (or objectbyobject structure): This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an nbyn table: 0 d(2, 1) 0 d(3, 1) d(3, 2) 0 (7.2) . . . . . . . . . d(n, 1) d(n, 2) 0
10
CHAPTER 7. CLUSTER ANALYSIS where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a nonnegative number that is close to 0 when objects i and j are highly similar or "near" each other, and becomes larger the more they differ. Since d(i, j) = d(j, i), and d(i, i) = 0, we have the matrix in (7.2). Measures of dissimilarity are discussed throughout this section.
The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent the same entity. Thus, the data matrix is often called a twomode matrix, whereas the dissimilarity matrix is called a onemode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying such clustering algorithms. In this section, we discuss how object dissimilarity can be computed for objects described by intervalscaled variables; by binary variables; by categorical, ordinal, and ratioscaled variables; or combinations of these variable types. Nonmetric similarity between complex objects (such as documents) is also described. The dissimilarity data can later be used to compute clusters of objects.
7.2.1
IntervalScaled Variables
This section discusses intervalscaled variables and their standardization. It then describes distance measures that are commonly used for computing the dissimilarity of objects described by such variables. These measures include the Euclidean, Manhattan, and Minkowski distances. "What are intervalscaled variables?" Intervalscaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature. The measurement unit used can affect the clustering analysis. For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure. In general, expressing a variable in smaller units will lead to a larger range for that variable, and thus a larger effect on the resulting clustering structure. To help avoid dependence on the choice of measurement units, the data should be standardized. Standardizing measurements attempts to give all variables an equal weight. This is particularly useful when given no prior knowledge of the data. However, in some applications, users may intentionally want to give more weight to a certain set of variables than to others. For example, when clustering basketball player candidates, we may prefer to give more weight to the variable height. "How can the data for a variable be standardized?" To standardize measurements, one choice is to convert the original measurements to unitless variables. Given measurements for a variable f , this can be performed as follows. 1. Calculate the mean absolute deviation, sf : sf = 1 (x1f  mf  + x2f  mf  + + xnf  mf ), n
1 n (x1f
(7.3) + x2f +
where x1f , . . . , xnf are n measurements of f , and mf is the mean value of f , that is, mf = + xnf ). 2. Calculate the standardized measurement, or zscore: zif = xif  mf . sf
(7.4)
The mean absolute deviation, sf , is more robust to outliers than the standard deviation, f . When computing the mean absolute deviation, the deviations from the mean (i.e., xif  mf ) are not squared; hence, the effect of outliers is somewhat reduced. There are more robust measures of dispersion, such as the median absolute deviation. However, the advantage of using the mean absolute deviation is that the zscores of outliers do not become too small; hence, the outliers remain detectable.
7.2. TYPES OF DATA IN CLUSTER ANALYSIS
11
Standardization may or may not be useful in a particular application. Thus the choice of whether and how to perform standardization should be left to the user. Methods of standardization are also discussed in Chapter 2 under normalization techniques for data preprocessing. After standardization, or without standardization in certain applications, the dissimilarity (or similarity) between the objects described by intervalscaled variables is typically computed based on the distance between each pair of objects. The most popular distance measure is Euclidean distance, which is defined as d(i, j) = (xi1  xj1 )2 + (xi2  xj2 )2 + + (xin  xjn )2 , (7.5)
where i = (xi1 , xi2 , . . . , xin ) and j = (xj1 , xj2 , . . . , xjn ) are two ndimensional data objects. Another wellknown metric is Manhattan (or city block) distance, defined as d(i, j) = xi1  xj1  + xi2  xj2  + + xin  xjn . (7.6)
Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function: 1. d(i, j) 0: Distance is a nonnegative number. 2. d(i, i) = 0: The distance of an object to itself is 0. 3. d(i, j) = d(j, i): Distance is a symmetric function. 4. d(i, j) d(i, h) + d(h, j): Going directly from object i to object j in space is no more than making a detour over any other object h (triangular inequality).
x 2 = (3,5)
5 4 3 2 1 1 3 x 1 = (1,2) 2 2
Euclidean distance = (2 2 + 3 2 )1/2 = 3.61 Manhattan distance =2+3=5
3
Figure 7.1: Euclidean and Manhattan distances between two objects.
Example 7.1 Euclidean distance and Manhattan distance. Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in Figure 7.1. The Euclidean distance between the two is (22 + 32 ) = 3.61. The Manhattan distance between the two is 2 + 3 = 5. Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is defined as d(i, j) = (xi1  xj1 p + xi2  xj2 p + + xin  xjn p )1/p , (7.7)
where p is a positive integer. Such a distance is also called Lp norm, in some literature. It represents the Manhattan distance when p = 1 (i.e., L1 norm), and Euclidean distance when p = 2 (i.e., L2 norm).
12
CHAPTER 7. CLUSTER ANALYSIS
If each variable is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as d(i, j) = w1 xi1  xj1 2 + w2 xi2  xj2 2 + + wm xin  xjn 2 . (7.8)
Weighting can also be applied to the Manhattan and Minkowski distances.
7.2.2
Binary Variables
Let us see how to compute the dissimilarity between objects described by either symmetric or asymmetric binary variables. A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. Given the variable smoker describing a patient, for instance, 1 indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary variables as if they are intervalscaled can lead to misleading clustering results. Therefore, methods specific to binary data are necessary for computing dissimilarities. "So, how can we compute the dissimilarity between two binary variables?" One approach involves computing a dissimilarity matrix from the given binary data. If all binary variables are thought of as having the same weight, we have the 2by2 contingency table of Table 7.1, where q is the number of variables that equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s is the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j. The total number of variables is p, where p = q + r + s + t. object j 1 0 q r s t q+s r+t sum q+r s+t p
object i
1 0 sum
Table 7.1: A contingency table for binary variables.
"What is the difference between symmetric and asymmetric binary variables?" A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1. One such example could be the attribute gender having the states male and female. Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity. Its dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the dissimilarity between objects i and j. d(i, j) = r+s . q+r+s+t (7.9)
A binary variable is asymmetric if the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test. By convention, we shall code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive), and the other by 0 (e.g., HIV negative). Given two asymmetric binary variables, the agreement of two 1s (a positive match) is then considered more significant than that of two 0s (a negative match). Therefore, such binary variables are often considered "monary" (as if having one state). The dissimilarity based on such variables is called asymmetric binary dissimilarity, where the number of negative matches, t, is considered unimportant and thus is ignored in the computation, as shown in Equation (7.10). d(i, j) = r+s . q+r+s (7.10)
Complementarily, one can measure the distance between two binary variables based on the notion of similarity instead of dissimilarity. For example, the asymmetric binary similarity between the objects i and j, or sim(i, j),
7.2. TYPES OF DATA IN CLUSTER ANALYSIS can be computed as, sim(i, j) = q = 1  d(i, j). q+r+s
13
(7.11)
The coefficient sim(i, j) is called the Jaccard coefficient, which is popularly referenced in the literature. When both symmetric and asymmetric binary variables occur in the same data set, the mixed variables approach described in Section 7.2.4 can be applied. Example 7.2 Dissimilarity between binary variables. Suppose that a patient record table (Table 7.2) contains the attributes name, gender, fever, cough, test1, test2, test3, and test4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes are asymmetric binary. name Jack Mary Jim . . . gender M F M . . . fever Y Y Y . . . cough N N Y . . . test1 P P N . . . test2 N N N . . . test3 N P N . . . test4 N N N . . .
Table 7.2: A relational table where patients are described by binary attributes.
For asymmetric attribute values, let the values Y (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0. Suppose that the distance between objects (patients) is computed based only on the asymmetric variables. According to Equation (7.10), the distance between each pair of the three patients, Jack, Mary, and Jim, is d(jack, mary) d(jack, jim) d(mary, jim) = = =
0+1 2+0+1 1+1 1+1+1 1+2 1+1+2
= 0.33 = 0.67 = 0.75
These measurements suggest that Mary and Jim are unlikely to have a similar disease since they have the highest dissimilarity value among the three pairs. Of the three patients, Jack and Mary are the most likely to have a similar disease.
7.2.3
Categorical, Ordinal, and RatioScaled Variables
"How can we compute the dissimilarity between objects described by categorical, ordinal, and ratioscaled variables?" Categorical Variables A categorical variable is a generalization of the binary variable in that it can take on more than two states. For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue. Let the number of states of a categorical variable be M . The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, . . . , M . Notice that such integers are used just for data handling and do not represent any specific ordering. "How is dissimilarity computed between objects described by categorical variables?" The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d(i, j) = pm , p (7.12)
14
CHAPTER 7. CLUSTER ANALYSIS
where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables. Weights can be assigned to increase the effect of m or to assign greater weight to the matches in variables having a larger number of states. object identifier 1 2 3 4 test1 (categorical) codeA codeB codeC codeA test2 (ordinal) excellent fair good excellent test3 (ratioscaled) 445 22 164 1,210
Table 7.3: A sample data table containing variables of mixed type.
Example 7.3 Dissimilarity between categorical variables. Suppose that we have the sample data of Table 7.3, except that only the objectidentifier and the variable (or attribute) test1 are available, where test1 is categorical. (We will use test2 and test3 in later examples.) Let's compute the dissimilarity matrix (7.2), that is, 0 d(2, 1) d(3, 1) d(4, 1) 0 d(4, 3) 0
0 d(3, 2) d(4, 2)
Since here we have one categorical variable, test1, we set p = 1 in Equation (7.12) so that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ. Thus, we get 0 1 1 0 0 1 0
0 1 1
Categorical variables can be encoded by asymmetric binary variables by creating a new binary variable for each of the M states. For an object with a given state value, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. For example, to encode the categorical variable map color, a binary variable can be created for each of the five colors listed above. For an object having the color yellow, the yellow variable is set to 1, while the remaining four variables are set to 0. The dissimilarity coefficient for this form of encoding can be calculated using the methods discussed in Section 7.2.2. Ordinal Variables A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective assessments of qualities that cannot be measured objectively. For example, professional ranks are often enumerated in a sequential order, such as assistant, associate, and full for professors. A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential but their actual magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver, bronze) is often more essential than the actual values of a particular measure. Ordinal variables may also be obtained from the discretization of intervalscaled quantities by splitting the value range into a finite number of classes. The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal variable f has M f states. These ordered states define the ranking 1, . . . , Mf .
7.2. TYPES OF DATA IN CLUSTER ANALYSIS
15
"How are ordinal variables handled?" The treatment of ordinal variables is quite similar to that of intervalscaled variables when computing the dissimilarity between objects. Suppose that f is a variable from a set of ordinal variables describing n objects. The dissimilarity computation with respect to f involves the following steps: 1. The value of f for the ith object is xif , and f has Mf ordered states, representing the ranking 1, . . . , Mf . Replace each xif by its corresponding rank, rif {1, . . . , Mf }. 2. Since each ordinal variable can have a different number of states, it is often necessary to map the range of each variable onto [0.0,1.0] so that each variable has equal weight. This can be achieved by replacing the rank rif of the ith object in the f th variable by zif = rif  1 . Mf  1 (7.13)
3. Dissimilarity can then be computed using any of the distance measures described in Section 7.2.1 for intervalscaled variables, using zif to represent the f value for the ith object. Example 7.4 Dissimilarity between ordinal variables. Suppose that we have the sample data of Table 7.3, except that this time only the objectidentifier and the continuous ordinal variable, test2, are available. There are three states for test2, namely fair, good, and excellent, that is M f = 3. For step 1, if we replace each value for test2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can use, say, the Euclidean distance (Equation 7.5), which results in the following dissimilarity matrix: 0 1 0.5 0 0 0.5 0
0 0.5 1.0
RatioScaled Variables A ratioscaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula AeBt or AeBt , (7.14)
where A and B are positive constants, and t typically represents time. Common examples include the growth of a bacteria population, or the decay of a radioactive element. "How can I compute the dissimilarity between objects described by ratioscaled variables?" There are three methods to handle ratioscaled variables for computing the dissimilarity between objects. Treat ratioscaled variables like intervalscaled variables. This, however, is not usually a good choice since it is likely that the scale may be distorted. Apply logarithmic transformation to a ratioscaled variable f having value x if for object i by using the formula yif = log(xif ). The yif values can be treated as intervalvalued, as described in Section 7.2.1. Notice that for some ratioscaled variables, loglog or other transformations may be applied, depending on the [new variable's] definition and the application. Treat xif as continuous ordinal data and treat their ranks as intervalvalued.
16
CHAPTER 7. CLUSTER ANALYSIS
The latter two methods are the most effective, although the choice of method used may be dependent on the given application. Example 7.5 Dissimilarity between ratioscaled variables. This time, we have the sample data of Table 7.3, except that only the objectidentifier and the ratioscaled variable, test3, are available. Let's try a logarithmic transformation. Taking the log of test3 results in the values 2.65, 1.34, 2.21, and 3.08 for the objects 1 to 4, respectively. Using the Euclidean distance (Equation 7.5) on the transformed values, we obtain the following dissimilarity matrix: 0 1.31 0 0.44 0.87 0 0.43 1.74 0.87 0
7.2.4
Variables of Mixed Types
Sections 7.2.1 to 7.2.3 discussed how to compute the dissimilarity between objects described by variables of the same type, where these types may be either intervalscaled, symmetric binary, asymmetric binary, categorical, ordinal, or ratioscaled. However, in many real databases, objects are described by a mixture of variable types. In general, a database can contain all of the six variable types listed above. "So, how can we compute the dissimilarity between objects of mixed variable types?" One approach is to group each kind of variable together, performing a separate cluster analysis for each variable type. This is feasible if these analyses derive compatible results. However, in real applications, it is unlikely that a separate cluster analysis per variable type will generate compatible results. A more preferable approach is to process all variable types together, performing a single cluster analysis. One such technique combines the different variables into a single dissimilarity matrix, bringing all of the meaningful variables onto a common scale of the interval [0.0,1.0]. Suppose that the data set contains p variables of mixed type. The dissimilarity d(i, j) between objects i and j is defined as d(i, j) =
(f ) (f ) p f =1 ij dij , (f ) p ij f =1 (f )
(7.15)
where the indicator ij = 0 if either (1) xif or xjf is missing (i.e., there is no measurement of variable f for object i or object j), or (2) xif = xjf = 0 and variable f is asymmetric binary; otherwise, ij = 1. The contribution of variable f to the dissimilarity between i and j, that is, dij , is computed dependent on its type: If f is intervalbased: dij =
(f ) xif xjf  maxh xhf minh xhf (f ) (f ) (f )
, where h runs over all nonmissing objects for variable f .
(f )
If f is binary or categorical: dij = 0 if xif = xjf ; otherwise dij = 1. If f is ordinal: compute the ranks rif and zif =
rif 1 Mf 1 ,
and treat zif as intervalscaled.
If f is ratioscaled: either perform logarithmic transformation and treat the transformed data as intervalscaled; or treat f as continuous ordinal data, compute rif and zif , and then treat zif as intervalscaled. The above steps are identical to what we have already seen for each of the individual variable types. The only difference is for intervalbased variables, where here we normalize so that the values map to the interval [0.0,1.0]. Thus, the dissimilarity between objects can be computed even when the variables describing the objects are of different types.
7.2. TYPES OF DATA IN CLUSTER ANALYSIS
17
Example 7.6 Dissimilarity between variables of mixed type. Let's compute a dissimilarity matrix for the objects of Table 7.3. Now we will consider all of the variables, which are of different types. In Examples 7.3 to 7.5, we worked out the dissimilarity matrices for each of the individual variables. The procedures that we followed for test1 (which is categorical) and test2 (which is ordinal), are the same as outlined above for processing variables of mixed types. Therefore, we can use the dissimilarity matrices obtained for test1 and test2 later when we compute Equation (7.15). First, however, we need to complete some work for test3 (which is ratioscaled). We have already applied a logarithmic transformation to its values. Based on the transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for the objects 1 to 4, respectively, we let maxh xh = 3.08 and minh xh = 1.34. We then normalize the values in the dissimilarity matrix obtained in Example 7.5 by dividing each one by (3.08  1.34) = 1.74. This results in the following dissimilarity matrix for test3 : 0 0.75 0.25 0.25 0 0.50 0
0 0.50 1.00
We can now use the dissimilarity matrices for the three variables in our computation of Equation (7.15). For example, we get d(2, 1) = 1(1)+1(1)+1(0.75) = 0.92. The resulting dissimilarity matrix obtained for the data 3 described by the three variables of mixed types is: 0 0.92 0.58 0.25 0 0.67 0
0 0.67 1.00
If we go back and look at Table 7.3, we can intuitively guess that objects 1 and 4 are the most similar, based on their values for test1 and test2. This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates that objects 2 and 4 are the least similar.
7.2.5
Vector objects
[from MK: Jiawei, do you prefer 'Complex Objects' or `Vector Objects' as the title?] In some applications, such as information retrieval, text document clustering, and biological taxonomy, we need to compare and cluster complex objects (such as documents) containing a large number of symbolic entities (such as keywords and phrases). To measure the distance between complex objects, it is often desirable to abandon traditional metric distance computation and introduce a nonmetric similarity function. There are several ways to define such a similarity function, s(x, y), to compare two vectors x and y. One popular way is to define the similarity function as a cosine measure as follows. xt y xy
s(x, y) =
(7.16)
where xt is a transposition of vector x, x is the Euclidean norm of vector x,1 y is the Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and y. This value is invariant to rotation and dilation, but it is not invariant to translation and general linear transformation. When variables are binaryvalued (0 or 1), the above similarity function can be interpreted in terms of shared features and attributes. Suppose an object x possesses the ith attribute if x i = 1. Then xt y is the number of attributes possessed by both x and y, and xy is the geometric mean of the number of attributes possessed by x and the number possessed by y. Thus s(x, y) is a measure of relative possession of common attributes.
1 The
Euclidean normal of vector x = (x1 , x2 , . . . , xp ) is defined as
x2 + x2 + . . . + x2 . Conceptually, it is the length of the vector. p 1 2
18
CHAPTER 7. CLUSTER ANALYSIS
Example 7.7 Nonmetric similarity between two objects using cosine. Suppose we are given two vectors, x = (1, 1, 0, 0) and y = (0, 1, 1, 0). By Equation (7.16), the similarity between x and y is s(x, y) = (0+1+0+0) = 0.5. 2 2 A simple variation of the above measure is s(x, y) = xt xt y x + y t y  xt y (7.17)
which is the ratio of the number of attributes shared by x and y to the number of attributes possessed by x or y. This function, known as the Tanimoto coefficient or Tanimoto distance, is frequently used in information retrieval and biology taxonomy. Notice that there are many ways to select a particular similarity (or distance) function or normalize the data for cluster analysis. There is no universal standard to guide such selection. The appropriate selection of such measures will be heavily dependent on the given application. One should bear this in mind and refine the selection of such measures to ensure that the clusters generated are meaningful and useful for the application at hand.
7.3
A Categorization of Major Clustering Methods
A large number of clustering algorithms exist in the literature. It is difficult to provide a crisp categorization of clustering methods since these categories may overlap so that a method may have features from several categories. Nevertheless, it is useful to present a relatively organized picture of the different clustering methods. In general, the major clustering methods can be classified into the following categories. Partitioning methods: Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k n. That is, it classifies the data into k groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group. Notice that the second requirement can be relaxed in some fuzzy partitioning techniques. References to such techniques are given in the bibliographic notes. Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion of a good partitioning is that objects in the same cluster are "close" or related to each other, whereas objects of different clusters are "far apart" or very different. There are various kinds of other criteria for judging the quality of partitions. To achieve global optimality in partitioningbased clustering would require the exhaustive enumeration of all of the possible partitions. Instead, most applications adopt one of a few popular heuristic methods, such as (1) the kmeans algorithm, where each cluster is represented by the mean value of the objects in the cluster, and (2) the kmedoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. These heuristic clustering methods work well for finding sphericalshaped clusters in small to mediumsized databases. To find clusters with complex shapes and for clustering very large data sets, partitioningbased methods need to be extended. Partitioningbased clustering methods are studied in depth in Section 7.4. Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottomup approach, starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the topdown approach, starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.
7.3. A CATEGORIZATION OF MAJOR CLUSTERING METHODS
19
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry about a combinatorial number of different choices. However, such techniques cannot correct erroneous decisions. There are two approaches to improving the quality of hierarchical clustering: (1) perform careful analysis of object "linkages" at each hierarchical partitioning, such as in Chameleon, or (2) integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation, as in BIRCH. Hierarchical clustering methods are studied in Section 7.5. Densitybased methods: Most partitioning methods cluster objects based on the distance between objects. Such methods can find only sphericalshaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the "neighborhood" exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape. DBSCAN and its extension, OPTICS, are typical densitybased methods that grow clusters according to a densitybased connectivity analysis. DENCLUE is a method that clusters objects based on the analysis of the value distributions of density functions. Densitybased clustering methods are studied in Section 7.6. Gridbased methods: Gridbased methods quantize the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized space). The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. STING is a typical example of a gridbased method. WaveCluster applies wavelet transformation for clustering analysis and is both gridbased and densitybased. Gridbased clustering methods are studied in Section 7.7. Modelbased methods: Modelbased methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A modelbased algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of clusters based on standard statistics, taking "noise" or outliers into account and thus yielding robust clustering methods. EM is an algorithm that performs expectationmaximization analysis based on statistical modeling. COBWEB is a conceptual learning algorithm that performs probability analysis and takes concepts as a model for clusters. SOM (or selforganizing feature map) is a neural networkbased algorithm that clusters by mapping highdimensional data into a 2D or 3D feature map, which is also useful for data visualization. Modelbased clustering methods are studied in Section 7.8. The choice of clustering algorithm depends both on the type of data available and on the particular purpose of the application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose. Some clustering algorithms integrate the ideas of several clustering methods, so that it is sometimes difficult to classify a given algorithm as uniquely belonging to only one clustering method category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques. Aside from the above categories of clustering methods, there are two classes of clustering tasks that require special attention. One is clustering highdimensional data, and the other is constraintbased clustering. Clustering highdimensional data is a particularly important task in cluster analysis because there are many applications that require the analysis of objects containing a large number of features or "dimensions". For example, text documents may contain thousands of terms or keywords as features, and DNA microarray data may provide information on the expression levels of thousands of genes under hundreds of conditions. Clustering highdimensional data is challenging due to the curse of dimensionality. Many dimensions may not be relevant.
20
CHAPTER 7. CLUSTER ANALYSIS
As the number of dimension increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low. Therefore, a different clustering methodology needs to be developed for highdimensional data. CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces (or subsets of dimensions) of the data, rather than over the entire data space. Frequent patternbased clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters. pCluster is an example of frequent patternbased clustering that groups objects based on their pattern similarity. Highdimensional data clustering methods are studied in Section 7.9. Constraintbased clustering is a clustering approach that performs clustering by incorporation of userspecified or applicationoriented constraints. A constraint expresses a user's expectation or describes "properties" of the desired clustering results, and provides an effective means for communicating with the clustering process. Various kinds of constraints can be specified, either by a user or as per application requirements. Our focus of discussion will be on spatial clustering with the existence of obstacles and clustering under userspecified constraints. In addition, semisupervised clustering is described, which employs, for example, pairwise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering. Constraintbased clustering methods are studied in Section 7.10. In the following sections, we examine each of the above clustering methods in detail. We also introduce algorithms that integrate the ideas of several clustering methods. Outlier analysis, which typically involves clustering, is described in Section 7.11. In general, the notation used in the following sections is as follows. Let D be a data set of n objects to be clustered. An object is described by d variables (attributes or dimensions) and therefore may also be referred to as a point in ddimensional object space. Objects are represented in bold italic font, e.g., p.
7.4
Partitioning Methods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k n), where each partition represents a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are "similar," whereas the objects of different clusters are "dissimilar" in terms of the data set attributes.
7.4.1
Classical Partitioning Methods: k Means and k Medoids
The most wellknown and commonly used partitioning methods are kmeans, kmedoids, and their variations. CentroidBased Technique: The k Means Method The kmeans algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster's centroid or center of gravity. "How does the kmeans algorithm work?" The kmeans algorithm proceeds as follows. First, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the squareerror criterion is used, defined as
k
E=
i=1
pCi
p  mi 2 ,
(7.18)
7.4. PARTITIONING METHODS
21
where E is the sum of the squareerror for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional). In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed. This criterion tries to make the resulting k clusters as compact and as separate as possible. The kmeans procedure is summarized in Figure 7.2. Algorithm: kmeans. The kmeans algorithm for partitioning, where each cluster's center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters,
D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) update the cluster means, i.e., calculate the mean value of the objects for each cluster; (5) until no change; Figure 7.2: The kmeans partitioning algorithm. Example 7.8 Clustering by kmeans partitioning. Suppose that there is a set of objects located in space as depicted in the rectangle shown in Figure 7.3(a). Let k = 3; that is, the user would like the objects to be partitioned into three clusters. According to the algorithm in Figure 7.2, we arbitrarily choose three objects as the three initial cluster centers, where cluster centers are marked by a "+". Each object is distributed to a cluster based on the cluster center to which it is the nearest. Such a distribution forms silhouettes encircled by dotted curves, as shown in Figure 7.3(a). Next, the cluster centers are updated. That is, the mean value of each cluster is recalculated based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Such a redistribution forms new silhouettes encircled by dashed curves, as shown in Figure 7.3(b). This process iterates, leading to Figure 7.3(c). The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation. Eventually, no redistribution of the objects in any cluster occurs and so the process terminates. The resulting clusters are returned by the clustering process. The algorithm attempts to determine k partitions that minimize the squareerror function. It works well when the clusters are compact clouds that are rather well separated from one another. The method is relatively scalable and efficient in processing large data sets because the computational complexity of the algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k n and t n. The method often terminates at a local optimum. The kmeans method, however, can be applied only when the mean of a cluster is defined. This may not be the case in some applications, such as when data with categorical attributes are involved. The necessity for users to specify k, the number of clusters, in advance can be seen as a disadvantage. The kmeans method is not suitable for discovering clusters with nonconvex shapes or clusters of very different size. Moreover, it is sensitive to noise and outlier data points since a small number of such data can substantially influence the mean value.
22
CHAPTER 7. CLUSTER ANALYSIS
(a)
(b)
(c)
Figure 7.3: Clustering of a set of objects based on the kmeans method. (The mean of each cluster is marked by a "+".)
There are quite a few variants of the kmeans method. These can differ in the selection of the initial k means, the calculation of dissimilarity, and the strategies for calculating cluster means. An interesting strategy that often yields good results is to first apply a hierarchical agglomeration algorithm, which determines the number of clusters and finds an initial clustering, and then use iterative relocation to improve the clustering. Another variant to kmeans is the kmodes method, which extends the kmeans paradigm to cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a frequencybased method to update modes of clusters. The kmeans and the kmodes methods can be integrated to cluster data with mixed numeric and categorical values. The EM (ExpectationMaximization) algorithm (which will be further discussed in Section 7.8.1) extends the kmeans paradigm in a different way. Whereas the kmeans algorithm assigns each object to a cluster, in EM, each object is assigned to each cluster according to a weight representing its probability of membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. "How can we make the kmeans algorithm more scalable?" A recent approach to scaling the kmeans algorithm is based on the idea of identifying three kinds of regions in data: regions that are compressible, regions that must be maintained in main memory, and regions that are discardable. An object is discardable if its membership in a cluster is ascertained. An object is compressible if it is not discardable but belongs to a tight subcluster. A data structure known as a clustering feature is used to summarize objects that have been discarded or compressed. If an object is neither discardable nor compressible, then it should be retained in main memory. To achieve scalability, the iterative clustering algorithm only includes the clustering features of the compressible objects and the objects that must be retained in main memory, thereby turning a secondarymemorybased algorithm into a mainmemorybased algorithm. An alternative approach to scaling the kmeans algorithm explores the microclustering idea, which first groups nearby objects into "microclusters" and then performs kmeans clustering on the microclusters. Microclustering is further discussed in Section 7.5.
Representative ObjectBased Technique: The k Medoids Method The kmeans algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data. This effect is particularly exacerbated due to the use of the squareerror function (Equation 7.18). "How might the algorithm be modified to diminish such sensitivity?" Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. That is, an absoluteerror criterion is used, defined
7.4. PARTITIONING METHODS as
k
23
E=
j=1 pCj
p  oj ,
(7.19)
where E is the sum of the absoluteerror for all objects in the data set; p is the point in space representing a given object in cluster Cj ; and oj is the representative object of Cj . In general, the algorithm iterates until, eventually, each representative object is actually the medoid, or most centrally located object, of its cluster. This is the basis of the kmedoids method for grouping n objects into k clusters. Let's have a closer look at kmedoids clustering. The initial representative objects (or seeds) are chosen arbitrarily. The iterative process of replacing representative objects by nonrepresentative objects continues as long as the quality of the resulting clustering is improved. This quality is estimated using a cost function that measures the average dissimilarity between an object and the representative object of its cluster. To determine whether a nonrepresentative object, orandom , is a good replacement for a current representative object, oj , the following four cases are examined for each of the nonrepresentative objects, p, as illustrated in Figure 7.4. Case 1: p currently belongs to representative object, oj . If oj is replaced by orandom as a representative object and p is closest to one of the other representative objects, oi , i = j, then p is reassigned to oi . Case 2: p currently belongs to representative object, oj . If oj is replaced by orandom as a representative object and p is closest to orandom , then p is reassigned to orandom . Case 3: p currently belongs to representative object, oi , i = j. If oj is replaced by orandom as a representative object and p is still closest to oi , then the assignment does not change. Case 4: p currently belongs to representative object, oi , i = j. If oj is replaced by orandom as a representative object and p is closest to orandom , then p is reassigned to orandom .
Oi Oi p Oi p Orandom Orandom 3. No change p Orandom Oi
p
Oj
Oj
Oj
Oj
Orandom
1. Reassigned to Oi 2. Reassigned to Orandom data object cluster center before swapping after swapping
4. Reassigned to Orandom
Figure 7.4: Four cases of the cost function for kmedoids clustering. Each time a reassignment occurs, a difference in absoluteerror, E, is contributed to the cost function. Therefore, the cost function calculates the difference in absoluteerror value if a current representative object is replaced by a nonrepresentative object. The total cost of swapping is the sum of costs incurred by all nonrepresentative objects. If the total cost is negative, then oj is replaced or swapped with orandom since the actual absoluteerror E would be reduced. If the total cost is positive, the current representative object, o j , is considered acceptable, and nothing is changed in the iteration. PAM (Partitioning Around Medoids) was one of the first kmedoids algorithms introduced (Figure 7.5). It attempts to determine k partitions for n objects. After an initial random selection of k representative objects, the algorithm repeatedly tries to make a better choice of cluster representatives. All of the possible pairs of objects are analyzed, where one object in each pair is considered a representative object and the other is not. The quality of
24
CHAPTER 7. CLUSTER ANALYSIS
Algorithm: kmedoids. PAM, a kmedoids algorithm for partitioning based on medoid or central objects. Input: k: the number of clusters,
D: a data set containing n objects. Output: A set of k clusters. Method: (1) (2) (3) (4) (5) (6) (7) arbitrarily choose k objects in D as the initial representative objects or seeds; repeat assign each remaining object to the cluster with the nearest representative object; randomly select a nonrepresentative object, orandom ; compute the total cost, S, of swapping representative object, oj , with orandom ; if S < 0 then swap oj with orandom to form the new set of k representative objects; until no change; Figure 7.5: PAM, a kmedoids partitioning algorithm. the resulting clustering is calculated for each such combination. An object, o j , is replaced with the object causing the greatest reduction in error. The set of best objects for each cluster in one iteration forms the representative objects for the next iteration. The final set of representative objects are the respective medoids of the clusters. The complexity of each iteration is O(k(n  k)2 ). For large values of n and k, such computation becomes very costly. "Which method is more robustkmeans or kmedoids?" The kmedoids method is more robust than kmeans in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the kmeans method. Both methods require the user to specify k, the number of clusters. Aside from using the mean or the medoid as a measure of cluster center, other alternative measures are also commonly used in partitioning clustering methods. The median can be used, resulting in the kmedian method, where the median or "middle value" is taken for each ordered attribute. Alternatively, in the kmodes method, the most frequent value for each attribute is used.
7.4.2
Partitioning Methods in Large Databases: From k Medoids to CLARANS
"How efficient is the kmedoids algorithm on large data sets?" A typical kmedoids partitioning algorithm like PAM works effectively for small data sets, but does not scale well for large data sets. To deal with larger data sets, a samplingbased method, called CLARA (Clustering LARge Applications), can be used. The idea behind CLARA is as follows: Instead of taking the whole set of data into consideration, a small portion of the actual data is chosen as a representative of the data. Medoids are then chosen from this sample using PAM. If the sample is selected in a fairly random manner, it should closely represent the original data set. The representative objects (medoids) chosen will likely be similar to those that would have been chosen from the whole data set. CLARA draws multiple samples of the data set, applies PAM on each sample, and returns its best clustering as the output. As expected, CLARA can deal with larger data sets than PAM. The complexity of each iteration now becomes O(ks2 + k(n  k)), where s is the size of the sample, k is the number of clusters, and n is the total number of objects. The effectiveness of CLARA depends on the sample size. Notice that PAM searches for the best k medoids among a given data set, whereas CLARA searches for the best k medoids among the selected sample of the data set. CLARA cannot find the best clustering if any of the best sampled medoids is not among the best k medoids.
7.5. HIERARCHICAL METHODS
25
That is, if an object oi is one of the best k medoids but is not selected during sampling, CLARA will never find the best clustering. This is, therefore, a tradeoff for efficiency. A good clustering based on sampling will not necessarily represent a good clustering of the whole data set if the sample is biased. "How might we improve the quality and scalability of CLARA?" A kmedoids type algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search) was proposed, which combines the sampling technique with PAM. However, unlike CLARA, CLARANS does not confine itself to any sample at any given time. While CLARA has a fixed sample at each stage of the search, CLARANS draws a sample with some randomness in each step of the search. Conceptually, the clustering process can be viewed as a search through a graph, where each node is a potential solution (a set of k medoids). Two nodes are neighbors (that is, connected by an arc in the graph) if their sets differ by only one object. Each node can be assigned a cost that is defined by the total dissimilarity between every object and the medoid of its cluster. At each step, PAM examines all of the neighbors of the current node in its search for a minimum cost solution. The current node is then replaced by the neighbor with the largest descent in costs. Because CLARA works on a sample of the entire data set, it examines fewer neighbors and restricts the search to subgraphs that are smaller than the original graph. While CLARA draws a sample of nodes at the beginning of a search, CLARANS dynamically draws a random sample of neighbors in each step of a search. The number of neighbors to be randomly sampled is restricted by a userspecified parameter. In this way, CLARANS does not confine the search to a localized area. If a better neighbor is found (i.e., having a lower error), CLARANS moves to the neighbor's node and the process starts again; otherwise the current clustering produces a local minimum. If a local minimum is found, CLARANS starts with new randomly selected nodes in search for a new local minimum. Once a userspecified number of local minima has been found, the algorithm outputs, as a solution, the best local minimum, that is, the local minimum having the lowest cost. CLARANS has been experimentally shown to be more effective than both PAM and CLARA. It can be used to find the most "natural" number of clusters using a silhouette coefficienta property of an object that specifies how much the object truly belongs to the cluster. CLARANS also enables the detection of outliers. However, the computational complexity of CLARANS is about O(n2 ), where n is the number of objects. Furthermore, its clustering quality is dependent on the sampling method used. The ability of CLARANS to deal with data objects that reside on disk can be further improved by focussing techniques that explore spatial data structures, such as R*trees.
7.5
Hierarchical Methods
A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottomup (merging) or topdown (splitting) fashion. The quality of a pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed. That is, if a particular merge or split decision later turns out to have been a poor choice, the method cannot backtrack and correct it. Recent studies have emphasized the integration of hierarchical agglomeration with iterative relocation methods.
7.5.1
Agglomerative and Divisive Hierarchical Clustering
In general, there are two types of hierarchical clustering methods: Agglomerative hierarchical clustering: This bottomup strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity. Divisive hierarchical clustering: This topdown strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces,
26
CHAPTER 7. CLUSTER ANALYSIS until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.
Agglomerative (AGNES)
step 0 a b c d e step 4
step 1
step 2
step 3
step 4
ab abcde cde de Divisive (DIANA)
step 3
step 2
step 1
step 0
Figure 7.6: Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. [to editor Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., using bold italics). Thank you.]
Example 7.9 Agglomerative versus divisive hierarchical clustering. Figure 7.6 shows the application of AGNES (AGglomerative NESting) , an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}. Initially, AGNES places each object into a cluster of its own. The clusters are then merged stepbystep according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any two objects from different clusters. This is a singlelinkage approach in that each cluster is represented by all of the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. The cluster merging process repeats until all of the objects are eventually merged to form one cluster. In DIANA, all of the objects are used to form one initial cluster. The cluster is split according to some principle, such as the maximum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains only a single object.
In either agglomerative or divisive hierarchical clustering, the user can specify the desired number of clusters as a termination condition. A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together step by step. Figure 7.7 shows a dendrogram for the five objects presented in Figure 7.6, where l = 0 shows the five objects as singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first cluster and they stay together at all subsequent levels. We can also use a vertical axis to show the similarity scale between clusters. For example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they are merged together to form a single cluster. Four widely used measures for distance between clusters are as follows, where p  p  is the distance between two objects or points, p and p ; mi is the mean for cluster, Ci ; and ni is the number of objects in Ci .
7.5. HIERARCHICAL METHODS
27
l =0 l =1 l =2 l =3 l =4
a
b
c
d
e
1.0 0.6 0.4 0.2
0.0
Figure 7.7: Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. [to editor Please add the heading "level" above l = 0, l = 1, etc. Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., bold italics). Thanks.]
Minimum distance : Maximum distance : Mean distance : Average distance :
dmin (Ci , Cj ) = dmax (Ci , Cj ) = dmean (Ci , Cj ) = davg (Ci , Cj ) =
minpC ,p i maxpC ,p i mi  mj  1 ni nj
Cj
p  p  p  p 
similarity scale
0.8
(7.20) (7.21) (7.22)
Cj
pCi p
Cj
p  p 
(7.23)
When an algorithm uses the minimum distance, dmin (Ci , Cj ), to measure the distance between clusters, it is sometimes called a nearest neighbor clustering algorithm. Moreover, if the clustering process is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called a singlelinkage algorithm. If we view the data points as nodes of a graph, with edges forming a path between the nodes in a cluster, then the merging of two clusters, Ci and Cj , corresponds to adding an edge between the nearest pair of nodes in Ci and Cj . Since edges linking clusters always go between distinct clusters, the resulting graph will generate a tree. Thus, an agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm. When an algorithm uses the maximum distance, dmax (Ci , Cj ), to measure the distance between clusters, it is sometimes called a farthest neighbor clustering algorithm. If the clustering process is terminated when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called a completelinkage algorithm. By viewing data points as nodes of a graph, with edges linking nodes, we can think of each cluster as a complete subgraph, that is, with edges connecting all of the nodes in the clusters. The distance between two clusters is determined by the most distant nodes in the two clusters. Farthest neighbor algorithms tend to minimize the increase in diameter of the clusters at each iteration as little as possible. If the true clusters are rather compact and approximately equal in size, the method will produce high quality clusters. Otherwise, the clusters produced can be meaningless. The above minimum and maximum measures represent two extremes in measuring the distance between clusters. They tend to be overly sensitive to outliers or noisy data. The use of mean or average distance is a compromise between the minimum and maximum distances and overcomes the outlier sensitivity problem. Whereas the mean distance is the simplest to compute, the average distance is advantageous in that it can handle categoric as well
28
CHAPTER 7. CLUSTER ANALYSIS
as numeric data.2 The computation of the mean vector for categoric data can be difficult or impossible to define. "What are some of the difficulties with hierarchical clustering?" The hierarchical clustering method, though simple, often encounters difficulties regarding the selection of merge or split points. Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters. It will neither undo what was done previously, nor perform object swapping between clusters. Thus merge or split decisions, if not well chosen at some step, may lead to lowquality clusters. Moreover, the method does not scale well since each decision of merge or split needs to examine and evaluate a good number of objects or clusters. One promising direction for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques, resulting in multiplephase clustering. Three such methods are introduced in the following subsections. The first, called BIRCH, begins by partitioning objects hierarchically using tree structures, where the leaf or lowlevel nonleaf nodes can be viewed as "microclusters" depending on the scale of resolution. It then applies other clustering algorithms to perform macroclustering on the microclusters. The second method, called ROCK, merges clusters based on their interconnectivity. The third method, called Chameleon, explores dynamic modeling in hierarchical clustering.
7.5.2
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
BIRCH is designed for clustering a large amount of numerical data by integration of hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage). It overcomes the two difficulties of agglomerative clustering methods: (1) scalability, and (2) the inability to undo what was done in the previous step. BIRCH introduces two concepts, clustering feature and clustering feature tree (CF tree), which are used to summarize cluster representations. These structures help the clustering method achieve good speed and scalability in large databases, and also make it effective for incremental and dynamic clustering of incoming objects. Let's have a closer look at the abovementioned structures. Given n ddimensional data objects or points in a cluster, we can define the centroid x0 , radius R, and diameter D of the cluster as follows,
n
xi x0 =
n i=1 n
(7.24)
(xi  x0 )2 R=
i=1 n n n
(7.25)
(xi  xj )2 D=
i=1 j=1 n(n1)
(7.26)
where R is the average distance from member objects to the centroid, and D is the average pairwise distance within a cluster. Both R and D reflect the tightness of the cluster around the centroid. A clustering feature (CF) is a 3dimensional vector summarizing information about clusters of objects. Given n ddimensional objects or points in a cluster, {xi }, then the CF of the cluster is defined as, CF = n, LS, SS , where n is the number of points in the cluster, LS is the linear sum of the n points (i.e., n square sum of the data points (i.e., i=1 xi 2 ).
n i=1
(7.27) xi ), and SS is the
2 To handle categoric data, dissimilarity measures such as those described in Sections 7.2.2 and 7.2.3 can be used to replace p  p  by d(p, p ) in Equation (7.23).
7.5. HIERARCHICAL METHODS
29
A clustering feature is essentially a summary of the statistics for the given cluster: the zeroth, first, and second moments of the cluster from a statistical point of view. Clustering features are additive. For example, suppose that we have two disjoint clusters, C1 and C2 , having the clustering features, CF1 and CF2 , respectively. The clustering feature for the cluster that is formed by merging C1 and C2 is simply CF1 + CF2 . Clustering features are sufficient for calculating all of the measurements that are needed for making clustering decisions in BIRCH. BIRCH thus utilizes storage efficiently by employing the clustering features to summarize information about the clusters of objects, thereby bypassing the need to store all objects. Example 7.10 Clustering feature. Suppose that there are three points, (2, 5), (3, 2), and (4, 3), in a cluster, C1 . The clustering feature of C1 is CF1 = 3, (2 + 3 + 4, 5 + 2 + 3), (22 + 32 + 42 , 52 + 22 + 32 ) = 3, (9, 10), (29, 38) . Suppose that C1 is disjoint to a second cluster, C2 , where CF2 = 3, (35, 36), (417, 440) . The clustering feature of a new cluster, C3 , that is formed by merging C1 and C2 , is derived by adding CF1 and CF2 . That is, CF3 = 3 + 3, (9 + 35, 10 + 36), (29 + 417, 38 + 440) = 6, (44, 46), (446, 478) .
A CF tree is a heightbalanced tree that stores the clustering features for a hierarchical clustering. An example is shown in Figure 7.8. By definition, a nonleaf node in a tree has descendants or "children." The nonleaf nodes store sums of the CFs of their children, and thus summarize clustering information about their children. A CF tree has two parameters: branching factor, B, and threshold, T . The branching factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. These two parameters influence the size of the resulting tree.
CF1 CF2 CFk Root level
CF11
CF12
CF1k
First level
Figure 7.8: A CF tree structure. BIRCH tries to produce the best clusters with the available resources. Given a limited amount of main memory, an important consideration is to minimize the time required for I/O. BIRCH applies a multiphase clustering technique: a single scan of the data set yields a basic good clustering, and one or more additional scans can (optionally) be used to further improve the quality. The primary phases are: Phase 1: BIRCH scans the database to build an initial inmemory CF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data. Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. For Phase 1, the CF tree is built dynamically as objects are inserted. Thus, the method is incremental. An object is inserted into the closest leaf entry (subcluster). If the diameter of the subcluster stored in the leaf node after insertion is larger than the threshold value, then the leaf node and possibly other nodes are split. After the insertion of the new object, information about it is passed toward the root of the tree. The size of the CF tree can be changed by modifying the threshold. If the size of the memory that is needed for storing the CF tree is larger than the size of the main memory, then a smaller threshold value can be specified and the CF tree is rebuilt. The rebuild process is performed by building a new tree from the leaf nodes of the old tree. Thus, the process of
30
CHAPTER 7. CLUSTER ANALYSIS
rebuilding the tree is done without the necessity of rereading all of the objects or points. This is similar to the insertion and node split in the construction of B+trees. Therefore, for building the tree, data has to be read just once. Some heuristics and methods have been introduced to deal with outliers and improve the quality of CF trees by additional scans of the data. Once the CF tree is built, any clustering algorithm, such as a typical partitioning algorithm, can be used with the CF tree in Phase 2. "How effective is BIRCH?" The computation complexity of the algorithm is O(n), where n is the number of objects to be clustered. Experiments have shown the linear scalability of the algorithm with respect to the number of objects, and good quality of clustering of the data. However, since each node in a CF tree can hold only a limited number of entries due to its size, a CF tree node does not always correspond to what a user may consider a natural cluster. Moreover, if the clusters are not spherical in shape, BIRCH does not perform well because it uses the notion of radius or diameter to control the boundary of a cluster.
7.5.3
ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes
ROCK (RObust Clustering using linKs) is a hierarchical clustering algorithm that explores the concept of links (the number of common neighbors between two objects) for data with categorical attributes. Traditional clustering algorithms for clustering data with Boolean and categorical attributes use distance functions (such as those introduced for binary variables in Section 7.2.2). However, experiments show that such distance measures cannot lead to high quality clusters when clustering categorical data. Furthermore, most clustering algorithms assess only the similarity between points when clustering, that is, at each step, points that are the most similar are merged into a single cluster. This "localized" approach is prone to errors. For example, two distinct clusters may have a few points or outliers that are close, therefore, relying on the similarity between points to make clustering decisions could cause the two clusters to be merged. ROCK takes a more global approach to clustering by considering the neighborhoods of individual pairs of points. If two similar points also have similar neighborhoods, then the two points likely belong to the same cluster and so can be merged. More formally, two points, pi and pj , are neighbors if sim(pi , pj ) , where sim is a similarity function and is a userspecified threshold. We can choose sim to be a distance metric or even a nonmetric (provided by a domain expert or as in Section 7.2.5) that is normalized so that its values fall between 0 and 1, with larger values indicating that the points are more similar. The number of links between p i and pj is defined as the number of common neighbors between pi and pj . If the number of links between two points is large, then it is more likely that they belong to the same cluster. By considering neighboring data points in the relationship between individual pairs of points, ROCK is more robust than standard clustering methods that focus only on point similarity. A good example of data containing categorical attributes is market basket data (Chapter 5). Such data consists of a database of transactions, where each transaction is a set of items. Transactions are considered records with Boolean attributes, each corresponding to an individual item, such as bread or cheese. In the record for a transaction, the attribute corresponding to an item is true if the transaction contains the item; otherwise, it is false. Other data sets with categorical attributes can be handled in a similar manner. ROCK's concepts of neighbors and links are illustrated in the following example, where the similarity between two "points" or transactions, T i and Tj , is defined with the Jaccard coefficient as
sim(Ti , Tj ) =
Ti Tj  . Ti Tj 
(7.28)
Example 7.11 Using neighborhood link information together with point similarity. Suppose that a market basket database contains transactions regarding the items a, b, . . . , g. Consider two clusters of transactions, C1 and C2 . C1 , which references the items a, b, c, d, e , contains the transactions {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}. C2 references the items a, b, f, g . It contains the transactions {a, b, f }, {a, b, g}, {a, f, g}, {b, f, g}. Suppose, first, that we consider only the similarity between points while ignoring neighborhood information. The Jaccard coefficient between the transactions {a, b, c} and {b, d, e} of C1 is 1 = 0.2. In fact, the Jaccard coefficient between any pair of transactions in C1 ranges from 0.2 5
7.5. HIERARCHICAL METHODS
31
to 0.5 (e.g.,{a, b, c} and {a, b, d}). The Jaccard coefficient between transactions belonging to different clusters may also reach 0.5 (e.g., {a, b, c} of C1 with {a, b, f } or {a, b, g} of C2 ). Clearly, by using the Jaccard coefficient on its own, we cannot obtain the desired clusters. On the other hand, the linkbased approach of ROCK can successfully separate the transactions into the appropriate clusters. As it turns out, for each transaction, the transaction with which it has the most links is always another transaction from the same cluster. For example, let = 0.5. Transaction {a, b, f } of C 2 has five links with transaction {a, b, g} of the same cluster (due to common neighbors {a, b, c}, {a, b, d}, {a, b, e}, {a, f, g}, and {b, f, g}). However, transaction {a, b, f } of C2 has only three links with {a, b, f } of C1 (due to {a, b, d}, {a, b, e}, and {a, b, g}). Similarly, transaction {a, f, g} of C2 has two links with every other transaction in C2 , and 0 links with each transaction in C1 . Thus, the linkbased approach, which considers neighborhood information in addition to object similarity, can correctly distinguish the two clusters of transactions. Based on these ideas, ROCK first constructs a sparse graph from a given data similarity matrix using a similarity threshold and the concept of shared neighbors. It then performs agglomerative hierarchical clustering on the sparse graph. A goodness measure is used to evaluate the clustering. Random sampling is used for scaling up to large data sets. The worstcase time complexity of ROCK is O(n2 + nmm ma + n2 logn) where mm and ma are the maximum and average number of neighbors, respectively, and n is the number of objects. In several real life data sets, such as the congressional voting data set and the mushroom data set at UCIrvine Machine Learning Repository, ROCK has demonstrated its power at deriving much more meaningful clusters than the traditional hierarchical clustering algorithms.
7.5.4
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters. It was derived based on the observed weaknesses of two hierarchical clustering algorithms: ROCK and CURE. ROCK and related schemes emphasize cluster interconnectivity while ignoring information regarding cluster proximity. CURE and related schemes consider cluster proximity yet ignore cluster interconnectivity. In Chameleon, cluster similarity is assessed based on how well connected objects are within a cluster and on the proximity of clusters. That is, two clusters are merged if their interconnectivity is high and they are close together. Thus, Chameleon does not depend on a static, usersupplied model and can automatically adapt to the internal characteristics of the clusters being merged. The merge process facilitates the discovery of natural and homogeneous clusters and applies to all types of data as long as a similarity function can be specified.
knearest neighbor graph Data set Contruct a sparse graph Partition the graph Merge partitions Final clusters
Figure 7.9: Chameleon: Hierarchical clustering based on knearest neighbors and dynamic modeling. Based on [KHK99]. "How does Chameleon work?" The main approach of Chameleon is illustrated in Figure 7.9. Chameleon uses a knearest neighbor graph approach to construct a sparse graph, where each vertex of the graph represents a data object, and there exists an edge between two vertices (objects) if one object is among the kmost similar objects of the other. The edges are weighted to reflect the similarity between objects. Chameleon uses a graph partitioning algorithm to partition the knearest neighbor graph into a large number of relatively small subclusters. It then uses an agglomerative hierarchical clustering algorithm that repeatedly merges subclusters based on their similarity. To determine the pairs of most similar subclusters, it takes into account both the interconnectivity as well as the closeness of the clusters. We will give a mathematical definition for these criteria shortly.
32
CHAPTER 7. CLUSTER ANALYSIS
Note that the knearest neighbor graph captures the concept of neighborhood dynamically: the neighborhood radius of an object is determined by the density of the region in which the object resides. In a dense region, the neighborhood is defined narrowly; in a sparse region, it is defined more widely. This tends to result in more natural clusters, in comparison with densitybased methods like DBSCAN (described in Section 7.6) that instead use a global neighborhood. Moreover, the density of the region is recorded as the weight of the edges. That is, the edges of a dense region tend to weigh more than that of a sparse region. The graph partitioning algorithm partitions the knearest neighbor graph into several partitions such that it minimizes the edge cut. That is, a cluster C is partitioned into subclusters C i and Cj so as to minimize the weight of the edges that would be cut should C be bisected into Ci and Cj . Edge cut is denoted EC(Ci , Cj ) and assesses the absolute interconnectivity between clusters Ci and Cj . Chameleon determines the similarity between each pair of clusters Ci and Cj according to their relative interconnectivity, RI(Ci , Cj ), and their relative closeness, RC(Ci , Cj ): The relative interconnectivity, RI(Ci , Cj ), between two clusters, Ci and Cj , is defined as the absolute interconnectivity between Ci and Cj , normalized with respect to the internal interconnectivity of the two clusters, Ci and Cj . That is, RI(Ci , Cj ) = EC{Ci ,Cj }  , 1 2 (ECCi  + ECCj ) (7.29)
where EC{Ci ,Cj } is the edge cut as defined as above for a cluster containing both Ci and Cj . Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj ) into two roughly equal parts. The relative closeness, RC(Ci , Cj ), between a pair of clusters, Ci and Cj , is the absolute closeness between Ci and Cj , normalized with respect to the internal closeness of the two clusters, C i and Cj . It is defined as RC(Ci , Cj ) = S EC {Ci ,Cj }
Ci  Ci +Cj  S EC Ci
+
Cj  Ci +Cj  S EC Cj
,
(7.30)
where S EC {Ci ,Cj } is the average weight of the edges that connect vertices in Ci to vertices in Cj , and S EC Ci (or S EC Cj ) is the average weight of the edges that belong to the mincut bisector of cluster C i (or Cj ). Chameleon has been shown to have greater power at discovering arbitrarily shaped clusters of high quality than several wellknown algorithms such as BIRCH and densitybased DBSCAN (Section 7.6.1). However, the processing cost for highdimensional data may require O(n2 ) time for n objects in the worst case.
7.6
DensityBased Methods
To discover clusters with arbitrary shape, densitybased clustering methods have been developed. These typically regard clusters as dense regions of objects in the data space that are separated by regions of low density (representing noise).
7.6.1
DBSCAN: A DensityBased Clustering Method Based on Connected Regions with Sufficiently High Density
DBSCAN (DensityBased Spatial Clustering of Applications with Noise) is a densitybased clustering algorithm. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of densityconnected points. The basic ideas of densitybased clustering involve a number of new definitions. We intuitively present these definitions, and then follow up with an example.
7.6. DENSITYBASED METHODS The neighborhood within a radius of a given object is called the neighborhood of the object.
33
If the neighborhood of an object contains at least a minimum number, M inP ts, of objects, then the object is called a core object. Given a set of objects, D, we say that an object p is directly densityreachable from object q if p is within the neighborhood of q, and q is a core object. An object p is densityreachable from object q with respect to and M inP ts in a set of objects, D, if there is a chain of objects p1 , . . . , pn , p1 = q and pn = p such that pi+1 is directly densityreachable from pi with respect to and M inP ts, for 1 i n, pi D. An object p is densityconnected to object q with respect to and M inP ts in a set of objects, D, if there is an object o D such that both p and q are densityreachable from o with respect to and M inP ts. Density reachability is the transitive closure of direct density reachability, and this relationship is asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a symmetric relation. Example 7.12 Densityreachabiltiy and density connectivity. Consider Figure 7.10 for a given sented by the radius of the circles, and, say, let M inP ts = 3. Based on the above definitions: repre
Q M P S R
O
Figure 7.10: Density reachability and density connectivity in densitybased clustering. Based on [EKSX96]. [to editor For consistency, please change M, O, P, Q, R to m, o, p, q, r, respectively (i.e., bold italics). Thanks.]
Of the labeled points, m, p, o, r, are core objects since each is in an neighborhood containing at least three points. q is directly densityreachable from m. m is directly densityreachable from p and vice versa. q is (indirectly) densityreachable from p since q is directly densityreachable from m and m is directly densityreachable from p. However, p is not densityreachable from q since q is not a core object. Similarly, r and s are densityreachable from o, and o is densityreachable from r. o, r, and s are all densityconnected. A densitybased cluster is a set of densityconnected objects that is maximal with respect to densityreachability. Every object not contained in any cluster is considered to be noise. "How does DBSCAN find clusters?" DBSCAN searches for clusters by checking the neighborhood of each point in the database. If the neighborhood of a point p contains more than M inP ts, a new cluster with p as a core object is created. DBSCAN then iteratively collects directly densityreachable objects from these core objects, which may involve the merge of a few densityreachable clusters. The process terminates when no new point can be added to any cluster. If a spatial index is used, the computational complexity of DBSCAN is O(n log n), where n is the number of database objects. Otherwise, it is O(n2 ). With appropriate settings of the userdefined parameters, and M inP ts, the algorithm is effective at finding arbitrary shaped clusters.
34
CHAPTER 7. CLUSTER ANALYSIS
7.6.2
OPTICS: Ordering Points To Identify the Clustering Structure
Although DBSCAN can cluster objects given input parameters such as and M inP ts, it still leaves the user with the responsibility of selecting parameter values that will lead to the discovery of acceptable clusters. Actually, this is a problem associated with many other clustering algorithms. Such parameter settings are usually empirically set and difficult to determine, especially for realworld, highdimensional data sets. Most algorithms are very sensitive to such parameter values: slightly different settings may lead to very different clusterings of the data. Moreover, highdimensional real data sets often have very skewed distributions such that their intrinsic clustering structure may not be characterized by global density parameters. To help overcome this difficulty, a cluster analysis method called OPTICS was proposed. Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. This ordering represents the densitybased clustering structure of the data. It contains information that is equivalent to densitybased clustering obtained from a wide range of parameter settings. The cluster ordering can be used to extract basic clustering information (such as cluster centers, or arbitraryshaped clusters), as well as provide the intrinsic clustering structure. By examining DBSCAN, we can easily see that for a constant M inP ts value, densitybased clusters with respect to a higher density (i.e., a lower value for ) are completely contained in densityconnected sets obtained with respect to a lower density. Recall that the parameter is a distanceit is the neighborhood radius. Therefore, in order to produce a set or ordering of densitybased clusters, we can extend the DBSCAN algorithm to process a set of distance parameter values at the same time. To construct the different clusterings simultaneously, the objects should be processed in a specific order. This order selects an object that is densityreachable with respect to the lowest value so that clusters with higher density (lower ) will be finished first. Based on this idea, two values need to be stored for each objectcoredistance and reachabilitydistance: The coredistance of an object p is the smallest object, the coredistance of p is undefined. value that makes p a core object. If p is not a core
The reachabilitydistance of an object q with respect to another object p is the greater value of the coredistance of p and the Euclidean distance between p and q. If p is not a core object, the reachabilitydistance between p and q is undefined.
p 3 mm
6 mm
6 mm
p
q1 q2
Coredistance of p
Reachabilitydistance (p, q1) Reachabilitydistance (p, q2)
3 mm d(p, q2)
Figure 7.11: OPTICS terminology. Based on [ABKS99]. [to editor 1) Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) and prime (') signs missing! Please kindly compare with Figure 8.10 of first edition, which was correct. 2) The symbol in the figure looks different than that used in the text. Thank you.]
Example 7.13 Coredistance and reachabilitydistance. Figure 7.11 illustrates the concepts of coredistance and reachabilitydistance. Suppose that = 6 mm and M inP ts = 5. The coredistance of p is the distance, , between p and the fourth closest data object. The reachabilitydistance of q 1 with respect to p is the coredistance
7.6. DENSITYBASED METHODS
35
of p (i.e., = 3 mm) since this is greater than the Euclidean distance from p to q1 . The reachabilitydistance of q2 with respect to p is the Euclidean distance from p to q2 since this is greater than the coredistance of p. "How are these values used?" The OPTICS algorithm creates an ordering of the objects in a database, additionally storing the coredistance and a suitable reachabilitydistance for each object. An algorithm was proposed to extract clusters based on the ordering information produced by OPTICS. Such information is sufficient for the extraction of all densitybased clusterings with respect to any distance that is smaller than the distance used in generating the order. The cluster ordering of a data set can be represented graphically, which helps in its understanding. For example, Figure 7.12 is the reachability plot for a simple twodimensional data set, which presents a general overview of how the data are structured and clustered. The data objects are plotted in cluster order (horizontal axis) together with their respective reachabilitydistance (vertical axis). The three Gaussian "bumps" in the plot reflect three clusters in the data set. Methods have also been developed for viewing clustering structures of highdimensional data at various levels of detail.
Reachability distances
Undefined
Cluster order of the objects
Figure 7.12: Cluster ordering in OPTICS. Figure is based on [ABKS99]. [to editor This figure needs to be improved so that it more closely resembles Figure 9 of [ABKS99]. For example, (1) the opposite ends of the arrows should be extended so that they reach into the respective clusters of points. (2) Please make the label of the vertical axis singular instead of plural (i.e., change to "Reachabilitydistance"). Thank you.]
Because of the structural equivalence of the OPTICS algorithm to DBSCAN, the OPTICS algorithm has the same runtime complexity as that of DBSCAN, that is, O(n log n) if a spatial index is used, where n is the number of objects.
7.6.3
DENCLUE: Clustering Based on Density Distribution Functions
DENCLUE (DENsitybased CLUstEring) is a clustering method based on a set of density distribution functions. The method is built on the following ideas: (1) the influence of each data point can be formally modeled using a mathematical function, called an influence function, which describes the impact of a data point within its neighborhood; (2) the overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points; and (3) clusters can then be determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function.
36
CHAPTER 7. CLUSTER ANALYSIS
Let x and y be objects or points in F d , a ddimensional input space. The influence function of data object y + y on x is a function, fB : F d R0 , which is defined in terms of a basic influence function fB : y fB (x) = fB (x, y). (7.31)
This reflects the impact of y on x. In principle, the influence function can be an arbitrary function that can be determined by the distance between two objects in a neighborhood. The distance function, d(x, y), should be reflexive and symmetric, such as the Euclidean distance function (Section 7.2.1). It can be used to compute a square wave influence function, fSquare (x, y) = 0 1 if d(x, y) > otherwise (7.32)
or a Gaussian influence function, d(x, y)2 2 2 . fGauss (x, y) = e

(7.33)
To help understand the concept of influence function, the following example offers some additional insight. Example 7.14 Influence function. Consider the square wave influence function of Equation (7.32). If objects x and y are far apart from one another in the ddimensional space, then the distance, d(x, y) will be above some threshold, . In this case, the influence function returns a 0, representing the lack of influence between distant points. On the other hand, if x and y are "close" (where closeness is determined by the parameter ), a value of 1 is returned, representing the notion that one influences the other.
Density Density
(a) Data Set
(b) Square Wave
(c) Gaussian
Figure 7.13: Possible density functions for a 2D data set. From [HK98]. The density function at an object or point x F d is defined as the sum of influence functions of all data points. That is, it is the total influence on x of all of the data points. Given n data objects, D = {x 1 , . . . , xn } F d , the density function at x is defined as
n D fB (x) = i=1
x x x x fB i (x) = fB 1 (x) + fB 2 (x) + . . . + fB n (x).
(7.34)
For example, the density function that results from the Gaussian influence function (7.33) is
n D fGauss (x) = i=1
d(x, xi )2 2 2 . e

(7.35)
Figure 7.13 shows a 2D data set, together with the corresponding overall density functions for a square wave and a Gaussian influence function.
7.7. GRIDBASED METHODS
37
From the density function, we can define the gradient of the function and the density attractor, the local maxima of the overall density function. A point x is said to be density attracted to a density attractor x if there exists a set of points x0 , x1 , . . . , xk such that x0 = x, xk = x and the gradient of xi1 is in the direction of xi for 0 < i < k. Intuitively, a density attractor influences many other points. For a continuous and differentiable influence function, a hillclimbing algorithm guided by the gradient can be used to determine the density attractor of a set of data points. In general, points that are density attracted to x may form a cluster. Based on the above notions, both centerdefined cluster and arbitraryshape cluster can be formally defined. A centerdefined cluster for a density attractor, x , is a subset of points, C D, that are densityattracted by x , and where the density function at x is no less than a threshold, . Points that are densityattracted by x , but for which the density function value is less than , are considered outliers. That is, intuitively, points in a cluster are influenced by many points, but outliers are not. An arbitraryshape cluster for a set of density attractors is a set of Cs, each being densityattracted to its respective densityattractor, where (1) the density function value at each densityattractor is no less than a threshold, , and (2) there exists a path, P , from each densityattractor to another, where the density function value for each point along the path is no less than . Examples of centerdefined and arbitraryshape clusters are shown in Figure 7.14.
Density Density
= 0.2
= 0.6
= 1.5
Density
=2
=2
Density
=1
=1
Figure 7.14: Examples of centerdefined clusters (top row) and arbitraryshape clusters (bottom row). [to editor Label missing: please add the label "Density" to the second graph of the top row (as in the other graphs of that row). Thanks.] From [HK98]. "What major advantages does DENCLUE have in comparison with other clustering algorithms?" There are several: (1) it has a solid mathematical foundation and generalizes various clustering methods, including partitioning, hierarchical, and densitybased methods, (2) it has good clustering properties for data sets with large amounts of noise, (3) it allows a compact mathematical description of arbitrarily shaped clusters in highdimensional data sets, and (4) it uses grid cells yet only keeps information about grid cells that do actually contain data points. It manages these cells in a treebased access structure, and thus is significantly faster than some influential algorithms, such as DBSCAN. However, the method requires careful selection of the density parameter and noise threshold , as the selection of such parameters may significantly influence the quality of the clustering results.
7.7
GridBased Methods
The gridbased clustering approach uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space.
38
CHAPTER 7. CLUSTER ANALYSIS
Some typical examples of the gridbased approach include STING, which explores statistical information stored in the grid cells, WaveCluster, which clusters objects using a wavelet transform method, and CLIQUE, which represents a grid and densitybased approach for clustering in highdimensional data space that will be introduced in Section 7.9.
7.7.1
STING: STatistical INformation Grid
STING is a gridbased multiresolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) are precomputed and stored. These statistical parameters are useful for query processing, as described below.
1st layer
(i1)st layer
ith layer
Figure 7.15: A hierarchical structure for STING clustering. [to editor This figure does not need to be this big.]
Figure 7.15 shows a hierarchical structure for STING clustering. Statistical parameters of higherlevel cells can easily be computed from the parameters of the lowerlevel cells. These parameters include the following: the attributeindependent parameter, count; and the attributedependent parameters, mean, stdev (standard deviation), min (minimum), max (maximum), and the type of distribution that the attribute value in the cell follows, such as normal, uniform, exponential, or none (if the distribution is unknown). When the data are loaded into the database, the parameters count, mean, stdev, min, and max of the bottomlevel cells are calculated directly from the data. The value of distribution may either be assigned by the user if the distribution type is known beforehand or obtained by hypothesis tests such as the 2 test. The type of distribution of a higherlevel cell can be computed based on the majority of distribution types of its corresponding lowerlevel cells in conjunction with a threshold filtering process. If the distributions of the lowerlevel cells disagree with each other and fail the threshold test, the distribution type of the highlevel cell is set to none. "How is this statistical information useful for queryanswering?" The statistical parameters can be used in a topdown, gridbased method as follows. First, a layer within the hierarchical structure is determined from which the queryanswering process is to start. This layer typically contains a small number of cells. For each cell in the current layer, we compute the confidence interval (or estimated range of probability) reflecting the cell's relevancy to the given query. The irrelevant cells are removed from further consideration. Processing of the next lower level examines only the remaining relevant cells. This process is repeated until the bottom layer is reached. At this time, if the query specification is met, the regions of relevant cells that satisfy the query are returned. Otherwise,
7.7. GRIDBASED METHODS
39
the data that fall into the relevant cells are retrieved and further processed until they meet the requirements of the query. "What advantages does STING offer over other clustering methods?" STING offers several advantages: (1) the gridbased computation is queryindependent since the statistical information stored in each cell represents the summary information of the data in the grid cell, independent of the query; (2) the grid structure facilitates parallel processing and incremental updating; and (3) the method's efficiency is a major advantage: STING goes through the database once to compute the statistical parameters of the cells, and hence the time complexity of generating clusters is O(n), where n is the total number of objects. After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n. Since STING uses a multiresolution approach to cluster analysis, the quality of STING clustering depends on the granularity of the lowest level of the grid structure. If the granularity is very fine, the cost of processing will increase substantially; however, if the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. Moreover, STING does not consider the spatial relationship between the children and their neighboring cells for construction of a parent cell. As a result, the shapes of the resulting clusters are isothetic, that is, all of the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected. This may lower the quality and accuracy of the clusters despite the fast processing time of the technique.
7.7.2
WaveCluster: Clustering Using Wavelet Transformation
WaveCluster is a multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space. In this approach, each grid cell summarizes the information of a group of points that map into the cell. This summary information typically fits into main memory for use by the multiresolution wavelet transform and the subsequent cluster analysis. A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands. The wavelet model can be applied to ddimensional signals by applying a onedimensional wavelet transform d times. In applying a wavelet transform, data are transformed so as to preserve the relative distance between objects at different levels of resolution. This allows the natural clusters in the data to become more distinguishable. Clusters can then be identified by searching for dense regions in the new domain. Wavelet transforms are also discussed in Chapter 2, where they are used for data reduction by compression. Additional references to the technique are given in the bibliographic notes. "Why is wavelet transformation useful for clustering?" It offers the following advantages: It provides unsupervised clustering. It uses hatshape filters that emphasize regions where the points cluster, while at the same time suppressing weaker information outside of the cluster boundaries. Thus, dense regions in the original feature space act as attractors for nearby points and as inhibitors for points that are further away. This means that the clusters in the data automatically stand out and "clear" the regions around them. Thus, another advantage is that wavelet transformation can automatically result in the removal of outliers. The multiresolution property of wavelet transformations can help in the detection of clusters at varying levels of accuracy. For example, Figure 7.16 shows a sample of twodimensional feature space, where each point in the image represents the attribute or feature values of one object in the spatial data set. Figure 7.17 shows the resulting wavelet transformation at different resolutions, from a fine scale (scale 1) to a coarse scale (scale 3). At each level, the four subbands into which the original data are decomposed are shown. The subband shown in the upperleft quadrant emphasizes the average neighborhood around each data point. The subband in the upperright quadrant emphasizes the horizontal edges of the data. The subband in the lowerleft quadrant emphasizes the vertical edges, while the subband in the lowerright quadrant emphasizes the corners.
40
CHAPTER 7. CLUSTER ANALYSIS
Figure 7.16: A sample of twodimensional feature space. From [SCZ98].
(a)
(b)
(c)
Figure 7.17: Multiresolution of the feature space in Figure 7.16 at (a) scale 1 (high resolution); (b) scale 2 (medium resolution); (c) scale 3 (low resolution). From [SCZ98].
Waveletbased clustering is very fast, with a computational complexity of O(n), where n is the number of objects in the database. The algorithm implementation can be made parallel. WaveCluster is a gridbased and densitybased algorithm. It conforms with many of the requirements of a good clustering algorithm: It handles large data sets efficiently, discovers clusters with arbitrary shape, successfully handles outliers, is insensitive to the order of input, and does not require the specification of input parameters such as the number of clusters or a neighborhood radius. In experimental studies, WaveCluster was found to outperform BIRCH, CLARANS, and DBSCAN in terms of both efficiency and clustering quality. The study also found WaveCluster capable of handling data with up to 20 dimensions.
7.8
ModelBased Clustering Methods
Modelbased clustering methods attempt to optimize the fit between the given data and some mathematical model. Such methods are often based on the assumption that the data are generated by a mixture of underlying probability distributions. In this section, we describe three examples of modelbased clustering. Section 7.8.1 presents an extension of the kmeans partitioning algorithm, called ExpectationMaximization. Conceptual clustering is discussed in Section 7.8.2. A neural network approach to clustering is given in Section 7.8.3.
7.8.1
ExpectationMaximization
In practice, each cluster can be represented mathematically by a parametric probability distribution. The entire data is a mixture of these distributions, where each individual distribution is typically referred to as a component distribution. We can therefore cluster the data using a finite mixture density model of k probability distributions, where each distribution represents a cluster. The problem is to estimate the parameters of the probability
7.8. MODELBASED CLUSTERING METHODS
41
distributions so as to best fit the data. Figure 7.18 is an example of a simple finite mixture density model. There are two clusters. Each follows a normal or Gaussian distribution with its own mean and standard deviation.
g(m 2 , 2 ) m2
g(m 1 , 1 )
m1
Figure 7.18: Each cluster can be represented by a probability distribution, centered at a mean, and with a standard deviation. Here, we have two clusters, corresponding to the Gaussian distributions g(m 1 , 1 ) and g(m2 , 2 ), respectively, where the circles represent the first standard deviation of the distributions. [to editor This figure is a draft. The final version should have: 1) points heavily scattered within each circle; 2) points sparsely scattered in area outside of circles; 3) a dark point at the center of each circle (to mark the means, m 1 and m2 . Thank you!]
The EM (ExpectationMaximization) algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates. It can be viewed as an extension of the kmeans paradigm, which assigns an object to the cluster with which it is most similar, based on the cluster mean (Section 7.4.1). Instead of assigning each object to a dedicated cluster, EM assigns each object to a cluster according to a weight representing the probability of membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. EM starts with an initial estimate or "guess" of the parameters of the mixture model (collectively referred to as the parameter vector ). It iteratively rescores the objects against the mixture density produced by the parameter vector. The rescored objects are then used to update the parameter estimates. Each object is assigned a probability that it would possess a certain set of attribute values given that it was a member of a given cluster. The algorithm is described as follows. 1. Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster means or centers (as in kmeans partitioning), as well as making guesses for the additional parameters. 2. Iteratively refine the parameters (or clusters) based on the following two steps: (a) Expectation Step: Assign each object xi to cluster Ck with the probability p(Ck )p(xi Ck ) , p(xi )
P (xi Ck ) = p(Ck xi ) =
(7.36)
where p(xi Ck ) = N (mk , Ek (xi )) follows the normal (i.e., Gaussian) distribution around mean, mk , with expectation, Ek . In other words, this step calculates the probability of cluster membership of object xi , for each of the clusters. These probabilities are the "expected" cluster memberships for object xi . (b) Maximization Step: Use the probability estimates from above to reestimate (or refine) the model parameters. For example,
42
CHAPTER 7. CLUSTER ANALYSIS
mk =
1 n
n i=1
xi P (xi Ck ) . j P (xi Cj )
(7.37)
This step is the "maximization" of the likelihood of the distributions given the data. The EM algorithm is simple, and easy to implement. In practice, it converges fast, but may not reach the global optima. Convergence is guaranteed for certain forms of optimization functions. The computational complexity is linear in d (the number of input features), n (the number of objects), and t (the number of iterations). Bayesian clustering methods focus on the computation of classconditional probability density. They are commonly used in the statistics community. In industry, AutoClass is a popular Bayesian clustering method that uses a variant of the EM algorithm. The best clustering is the one that maximizes the ability to predict the attributes of a object given the correct cluster of the object. AutoClass can also estimate the number of clusters. It has been applied to several domains and was able to discover a new class of stars based on infrared astronomy data. Further references are provided in the bibliographic notes.
7.8.2
Conceptual Clustering
Conceptual clustering is a form of clustering in machine learning that, given a set of unlabeled objects, produces a classification scheme over the objects. Unlike conventional clustering, which primarily identifies groups of like objects, conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or class. Hence, conceptual clustering is a twostep process: first, clustering is performed, followed by characterization. Here, clustering quality is not solely a function of the individual objects. Rather, it incorporates factors such as the generality and simplicity of the derived concept descriptions. Most methods of conceptual clustering adopt a statistical approach that uses probability measurements in determining the concepts or clusters. Probabilistic descriptions are typically used to represent each derived concept. COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects are described by categorical attributevalue pairs. COBWEB creates a hierarchical clustering in the form of a classification tree.
Animal P(C 0) 1.0 P(scales C 0) ...
0.25
Fish P(C1) 0.25 P(scales C1) 1.0 ...
Amphibian P(C2) 0.25 P(moist C2) 1.0 ...
Mammal/bird P(C3) 0.5 P(hair C3) 0.5 ...
Mammal P(C4) 0.5 P(hair C4) 1.0 ...
Bird P(C5) 0.5 P(feathers C5) ...
1.0
Figure 7.19: A classification tree. Figure is based on [Fis87]. [to editor Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) signs missing! Please kindly compare with Figure 8.18 of first edition, which was correct. Thank you.] "But, what is a classification tree? Is it the same as a decision tree?" Figure 7.19 shows a classification tree for a set of animal data. A classification tree differs from a decision tree. Each node in a classification tree refers to
7.8. MODELBASED CLUSTERING METHODS
43
a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node. The probabilistic description includes the probability of the concept and conditional probabilities of the form P (Ai = vij Ck ), where Ai = vij is an attributevalue pair [new (that is, the ith attribute takes its j th possible value)] and Ck is the concept class. (Counts are accumulated and stored at each node for computation of the probabilities.) This is unlike decision trees, which label branches rather than nodes and use logical rather than probabilistic descriptors.3 The sibling nodes at a given level of a classification tree are said to form a partition. To classify an object using a classification tree, a partial matching function is employed to descend the tree along a path of "best" matching nodes. COBWEB uses a heuristic evaluation measure called category utility to guide construction of the tree. Category utility (CU) is defined as
n k=1
P (Ck )[
i
j
P (Ai = vij Ck )2  n
i
j
P (Ai = vij )2 ]
,
(7.38)
where n is the number of nodes, concepts, or "categories" forming a partition, {C 1 , C2 , . . . , Cn }, at the given level of the tree. In other words, category utility is the increase in the expected number of attribute values that can be correctly guessed given a partition (where this expected number corresponds to the term P (C k )i j P (Ai = vij Ck )2 ) over the expected number of correct guesses with no such knowledge (corresponding to the term i j P (Ai = vij )2 ). Although we do not have room to show the derivation, category utility rewards intraclass similarity and interclass dissimilarity where Intraclass similarity is the probability P (Ai = vij Ck ). The larger this value is, the greater the proportion of class members that share this attributevalue pair, and the more predictable the pair is of class members. Interclass dissimilarity is the probability P (Ck Ai = vij ). The larger this value is, the fewer the objects in contrasting classes that share this attributevalue pair, and the more predictive the pair is of the class. Let's have a look at how COBWEB works. COBWEB incrementally incorporates objects into a classification tree. "Given a new object, how does COBWEB decide where to incorporate it into the classification tree?" COBWEB descends the tree along an appropriate path, updating counts along the way, in search of the "best host" or node at which to classify the object. This decision is based on temporarily placing the object in each node and computing the category utility of the resulting partition. The placement that results in the highest category utility should be a good host for the object. "What if the object does not really belong to any of the concepts represented in the tree so far? What if it is better to create a new node for the given object?" That is a good point. In fact, COBWEB also computes the category utility of the partition that would result if a new node were to be created for the object. This is compared to the above computation based on the existing nodes. The object is then placed in an existing class, or a new class is created for it, based on the partition with the highest category utility value. Notice that COBWEB has the ability to automatically adjust the number of classes in a partition. It does not need to rely on the user to provide such an input parameter. The two operators mentioned above are highly sensitive to the input order of the object. COBWEB has two additional operators that help make it less sensitive to input order. These are merging and splitting. When an object is incorporated, the two best hosts are considered for merging into a single class. Furthermore, COBWEB considers splitting the children of the best host among the existing categories. These decisions are based on category utility. The merging and splitting operators allow COBWEB to perform a bidirectional searchfor example, a merge can undo a previous split. COBWEB has a number of limitations. First, it is based on the assumption that probability distributions on separate attributes are statistically independent of one another. This assumption is, however, not always true since correlation between attributes often exists. Moreover, the probability distribution representation of clusters
3 Decision
trees are described in Chapter 6.
44
CHAPTER 7. CLUSTER ANALYSIS
makes it quite expensive to update and store the clusters. This is especially so when the attributes have a large number of values since the time and space complexities depend not only on the number of attributes, but also on the number of values for each attribute. Furthermore, the classification tree is not heightbalanced for skewed input data, which may cause the time and space complexity to degrade dramatically. CLASSIT is an extension of COBWEB for incremental clustering of continuous (or real valued) data. It stores a continuous normal distribution (i.e., mean and standard deviation) for each individual attribute in each node and uses a modified category utility measure that is an integral over continuous attributes instead of a sum over discrete attributes as in COBWEB. However, it suffers similar problems as COBWEB and thus is not suitable for clustering large database data. Conceptual clustering is popular in the machine learning community. However, the method does not scale well for large data sets.
7.8.3
Neural Network Approach
The neural network approach is motivated by biological neural networks. 4 Roughly speaking, a neural network is a set of connected input/output units, where each connection has a weight associated with it. Neural networks have several properties that make them popular for clustering. First, neural networks are inherently parallel and distributed processing architectures. Second, neural networks learn by adjusting their interconnection weights so as to best fit the data. This allows them to "normalize" or "prototype" the patterns and act as feature (or attribute) extractors for the various clusters. Third, neural networks process numerical vectors and require object patterns to be represented by quantitative features only. Many clustering tasks handle only numerical data or can transform their data into quantitative features if needed. The neural network approach to clustering tends to represent each cluster as an exemplar. An exemplar acts as a "prototype" of the cluster and does not necessarily have to correspond to a particular data example or object. New objects can be distributed to the cluster whose exemplar is the most similar, based on some distance measure. The attributes of an object assigned to a cluster can be predicted from the attributes of the cluster's exemplar. Selforganizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen selforganizing feature maps, after their creator, Teuvo Kohonon, or as topologically ordered maps. SOMs' goal is to represent all points in a highdimensional source space by points in a lowdimensional (usually 2D or 3D) target space, such that the distance and proximity relationships (hence the topology) are preserved as much as possible. The method is particularly useful when a nonlinear mapping is inherent in the problem itself. SOMs can also be viewed as a constrained version of kmeans clustering, in which the cluster centers tend to lie in a low dimensional manifold in the feature or attribute space. With selforganizing feature maps (SOMs), clustering is performed by having several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning or active unit. So as to move even closer to the input object, the weights of the winning unit are adjusted, as well as those of its nearest neighbors. SOMs assume that there is some topology or ordering among the input objects, and that the units will eventually take on this structure in space. The organization of units is said to form a feature map. SOMs are believed to resemble processing that can occur in the brain and are useful for visualizing highdimensional data in 2 or 3D space. The SOM approach has been used successfully for Web document clustering. The left graph of Figure 7.20 shows the result of clustering 12,088 Web articles from comp.ai.neuralnets using the SOM approach, while the right graph of the figure shows the result of drilling down on the keyword: "mining". The neural network approach to clustering has strong theoretical links with actual brain processing. Further research is required in making it more effective and scalable in large databases due to long processing times and the intricacies of complex data.
4 Neural
networks were also introduced in Chapter 6 on classification and prediction.
7.9. CLUSTERING HIGHDIMENSIONAL DATA
45
Figure 7.20: The result of SOM clustering of 12,088 Web articles on comp.ai.neuralnets (left), and of drilling down on the keyword: "mining" (right). Based on http://websom.hut.fi/websom/comp.ai.neuralnetsnew.
7.9
Clustering HighDimensional Data
A large majority of clustering methods are designed for clustering lowdimensional data and encounter challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks). This is because when the dimensionality increases, usually only a small number of dimensions are relevant to certain clusters, but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered. Moreover, when dimensionality increases, data usually become increasingly sparse because the data points are likely located in different dimensional subspaces. When the data becomes really sparse, data points located at different dimensions can be considered as all equal distance, and the distance measure, which is essential for cluster analysis, becomes meaningless. To overcome this difficulty, we may consider using feature (or attribute) transformation and feature (or attribute) selection techniques. Feature transformation methods, such as principal component analysis 5 and singular value decomposition 6 , transform the data onto a smaller space while generally preserving the original relative distance between objects. They summarize data by creating linear combinations of the attributes, and may discover hidden structures in the data. However, such techniques do not actually remove any of the original attributes from analysis. This is problematic when there are a large number of irrelevant attributes. The irrelevant information, may mask the real clusters, even after transformation. Moreover, the transformed features (attributes) are often difficult to interpret, making the clustering results less useful. Thus, feature transformation is only suited to data sets where most of
5 Principal 6 Singular
component analysis was introduced in Chapter 2 as a method of data compression (Section 2.5.3). value decomposition is discussed in detail in Chapter 8.
46
CHAPTER 7. CLUSTER ANALYSIS
the dimensions are relevant to the clustering task. Unfortunately, realworld data sets tend to have many highly correlated, or redundant, dimensions. Another way of tackling the curse of dimensionality is to try to remove some of the dimensions. Attribute subset selection (or feature subset selection7 ) is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes). Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task. Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criterion. It is most commonly performed by supervised learningthe most relevant set of attributes are found with respect to the given class labels. It can also be performed by an unsupervised process, such as entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters. Other evaluation functions, such as category utility, may also be used. Subspace clustering is an extension to attribute subset selection that has shown its strength at highdimensional clustering. It is based on the observation that different subspaces may contain different, meaningful clusters. Subspace clustering searches for groups of clusters within different subspaces of the same data set. The problem becomes how to find such subspace clusters effectively and efficiently. In this section, we introduce three approaches for effective clustering of highdimensional data: dimensiongrowth subspace clustering, represented by CLIQUE, dimensionreduction projected clustering, represented by PROCLUS, and frequent patternbased clustering, represented by pCluster.
7.9.1
CLIQUE: A DimensionGrowth Subspace Clustering Method
CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimensiongrowth subspace clustering in highdimensional space. In dimensiongrowth subspace clustering, the clustering process starts at singledimensional subspaces and grows upwards to higher dimensional ones. Since CLIQUE partitions each dimension like a grid structure and determines whether a cell is dense based on the number of points it contains, it can also be viewed as an integration of densitybased and gridbased clustering methods. However, its overall approach is typical of subspace clustering for highdimensional space, and so it is introduced in this section. The ideas of the CLIQUE clustering algorithm are outlined as follows. Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. CLIQUE's clustering identifies the sparse and the "crowded" areas in space (or units), thereby discovering the overall distribution patterns of the data set. A unit is dense if the fraction of total data points contained in it exceeds an input model parameter. In CLIQUE, a cluster is defined as a maximal set of connected dense units. "How does CLIQUE work?" CLIQUE performs multidimensional clustering in two steps. In the first step, CLIQUE partitions the ddimensional data space into nonoverlapping rectangular units, identifying the dense units among these. This is done (in 1D) for each dimension. For example, Figure 7.21 shows dense rectangular units found with respect to age for the dimensions salary and (number of weeks of) vacation. The subspaces representing these dense units are intersected to form a candidate search space in which dense units of higher dimensionality may exist. "Why does CLIQUE confine its search for dense units of higher dimensionality to the intersection of the dense units in the subspaces?" The identification of the candidate search space is based on the Apriori property used in association rule mining.8 In general, the property employs prior knowledge of items in the search space so that portions of the space can be pruned. The property, adapted for CLIQUE, states the following: If a kdimensional
7 Attribute subset selection is known in the machine learning literature as feature subset selection. It was discussed in Chapter 2 as a form of data reduction (Section 2.5.2). 8 Association rule mining is described in detail in Chapter 5. In particular, the Apriori property is described in Section 5.2.1. The Apriori property can also be used for cube computation, as described in Chapter 4.
7.9. CLUSTERING HIGHDIMENSIONAL DATA
47
7
salary (10,000)
6 5 4 3 2 1 0 20 30 40 50 60 age
7
vacation (week)
6 5 4 3 2 1 0 20 30 40 50 60 age
vacation
30
50
age
Figure 7.21: Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate search space for dense units of higher dimensionality.
unit is dense, then so are its projections in (k  1)dimensional space. That is, given a kdimensional candidate dense unit, if we check its (k  1)th projection units and find any that are not dense, then we know that the kth dimensional unit cannot be dense either. Therefore, we can generate potential or candidate dense units in kdimensional space from the dense units found in (k  1)dimensional space. In general, the resulting space searched is much smaller than the original space. The dense units are then examined in order to determine the clusters. In the second step, CLIQUE generates a minimal description for each cluster as follows. For each cluster, it determines the maximal region that covers the cluster of connected dense units. It then determines a minimal cover [new (logic description)] for each cluster. "How effective is CLIQUE?" CLIQUE automatically finds subspaces of the highest dimensionality such that highdensity clusters exist in those subspaces. It is insensitive to the order of input objects and does not presume any canonical data distribution. It scales linearly with the size of input and has good scalability as the number of dimensions in the data is increased. However, obtaining meaningful clustering results is dependent on proper tuning of the grid size (which is a stable structure here) and the density threshold. This is particularly difficult because the grid size and density threshold are used across all combinations of dimensions in the data set. Thus, the accuracy of the clustering results may be degraded at the expense of the simplicity of the method. Moreover, for a given dense region, all projections of the region onto lower dimensionality subspaces will also be dense. This can result is a large overlap among the reported dense regions. Furthermore, it is difficult to find clusters of rather different density within different dimensional subspaces.
sa la ry
48
CHAPTER 7. CLUSTER ANALYSIS
There are several extensions to this approach that follow a similar philosophy. For example, let's think of a grid as a set of fixed bins. Instead of using fixed bins for each of the dimensions, we can use an adaptive, datadriven strategy to dynamically determine the bins for each dimension based on data distribution statistics. Alternatively, instead of using a density threshold, we would use entropy (Chapter 6) as a measure of the quality of subspace clusters.
7.9.2
PROCLUS: A DimensionReduction Subspace Clustering Method
PROCLUS (PROjected CLUStering) is a typical dimensionreduction subspace clustering method. That is, instead of starting from singledimensional spaces, it starts by finding an initial approximation of the clusters in the highdimensional attribute space. Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters. This leads to the exploration of dense regions in all subspaces of some desired dimensionality and avoids the generation of a large number of overlapped clusters in projected dimensions of lower dimensionality. PROCLUS finds the best set of medoids by a hill climbing process similar to that used in CLARANS, but generalized to deal with projected clustering. It adopts a distance measure called Manhattan segmental distance, which is the Manhattan distance on a set of relevant dimensions. The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster refinement. In the initialization phase, it uses a greedy algorithm to select a set of initial medoids that are far apart from each other so as to ensure that each cluster is represented by at least one object in the selected set. More concretely, it first chooses a random sample of data points proportional to the number of clusters we wish to generate, and then applies the greedy algorithm to obtain an even smaller final subset for the next phase. The iteration phase selects a random set of k medoids from this reduced set (of medoids), and replaces "bad" medoids with randomly chosen new medoids if the clustering is improved. For each medoid, a set of dimensions is chosen whose average distances are small compared to statistical expectation. The total number of dimensions associated to medoids must be k l, where l is an input parameter that selects the average dimensionality of cluster subspaces. The refinement phase computes new dimensions for each medoid based on the clusters found, reassigns points to medoids, and remove outliers. Experiments on PROCLUS show that the method is efficient and scalable at finding highdimensional clusters. Unlike CLIQUE, which outputs many overlapped clusters, PROCLUS finds nonoverlapped partitions of points. The discovered clusters may help better understand the highdimensional data and facilitate other subsequence analyses.
7.9.3
Frequent PatternBased Clustering Methods
This section looks at how methods of frequent pattern mining can be applied to clustering, resulting in frequent patternbased cluster analysis. Frequent pattern mining, as the name implies, searches for patterns (such as sets of items or objects) that occur frequently in large data sets. Frequent pattern mining can lead to the discovery of interesting associations and correlations among data objects. Methods for frequent pattern mining were introduced in Chapter 5. The idea behind frequent patternbased cluster analysis is that the frequent patterns discovered may also indicate clusters. Frequent patternbased cluster analysis is wellsuited to highdimensional data. It can be viewed as an extension of the dimensiongrowth subspace clustering approach. However, the boundaries of different dimensions are not obvious since here they are represented by sets of frequent itemsets. That is, rather than growing the clusters dimension by dimension, we grow sets of frequent itemsets, which eventually lead to cluster descriptions. Typical examples of frequent patternbased cluster analysis include the clustering of text documents that contain thousands of distinct keywords, and the analysis of microarray data that contain tens of thousands of measured values or "features". In this section, we examine two forms of frequent patternbased cluster analysis: frequent termbased text clustering and clustering by pattern similarity in microarray data analysis. In frequent termbased text clustering, text documents are clustered based on the frequent terms they contain. Using the vocabulary of text document analysis, a term is any sequence of characters separated from other terms by a delimiter. A term can be made up of a single word or several words. In general, we first
7.9. CLUSTERING HIGHDIMENSIONAL DATA
49
remove nontext information (such as HTML tags and punctuation) and stop words. Terms are then extracted. A stemming algorithm is then applied to reduce each term to its basic stem. In this way, each document can be represented as a set of terms. Each set is typically large. Collectively, a large set of documents will contain a very large set of distinct terms. If we treat each term as a dimension, the dimension space will be of very high dimensionality! This poses great challenges for document cluster analysis. The dimension space can be referred to as term vector space, where each document is represented by a term vector. This difficulty can be overcome by frequent termbased analysis. That is, by using an efficient frequent itemset mining algorithm introduced in Section 5.2, we can mine a set of frequent terms from the set of text documents. Then, instead of clustering on highdimensional term vector space, we need only consider the lowdimensional frequent term sets as "cluster candidates". Notice that a frequent term set is not a cluster but rather, the description of a cluster. The corresponding cluster consists of the set of documents containing all of the terms of the frequent term set. A wellselected subset of the set of all frequent term sets can be considered as a clustering. "How, then, can we select a good subset of the set of all frequent term sets? " This step is critical since such a selection will determine the quality of the resulting clustering. Let Fi be a set of frequent term sets and cov(Fi ) be the set of documents covered by Fi . That is, cov(Fi ) refers to the documents that contain all of the terms in Fi . The general principle for finding a wellselected subset, F1 , . . . , Fk , of the set of all frequent term sets is to ensure that (1) k cov(Fi ) = D, i.e., the selected subset should cover all of the documents to be clustered; and (2) the i=1 overlap between any two partitions, Fi and Fj (for i = j), should be minimized. An overlap measure based on entropy9 is used to assess cluster overlap by measuring the distribution of the documents supporting some cluster over the remaining cluster candidates. An advantage of frequent termbased text clustering is that it automatically generates a description for the generated clusters in terms of their frequent term sets. Traditional clustering methods produce only clusters a description for the generated clusters requires an additional processing step. Another interesting approach for clustering highdimensional data is based on pattern similarity among the objects on a subset of dimensions. Here we introduce the pCluster method, which performs clustering by pattern similarity in microarray data analysis. In DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli or conditions. Under the pCluster model, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. This is illustrated in Example 7.15 below. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks.
90 80 70 60 50 40 30 20 10 0 a b c d e f g h i j Object 1 Object 2 Object 3
Figure 7.22: Raw data from a fragment of microarray data containing only 3 objects and 10 attributes.
9 Entropy is a measure from information theory. It was introduced in Chapter 2 regarding data discretization and is also described in Chapter 6 regarding decision tree construction.
50
CHAPTER 7. CLUSTER ANALYSIS
Example 7.15 Clustering by pattern similarity in DNA microarray analysis. Figure 7.22 shows a fragment of microarray data containing only three genes (taken as "objects" here) and ten attributes (columns a to j). No patterns among the three objects are visibly explicit. However, if two subsets of attributes, {b, c, h, j, e} and {f, d, a, g, i}, are selected and plotted as in Figure 7.23 (a) and (b) respectively, it is easy to see that they form some interesting patterns: Figure 7.23 (a) forms a shift pattern, where the three curves are similar to each other with respect to a shift operation along the yaxis; while Figure 7.23 (b) forms a scaling pattern, where the three curves are similar to each other with respect to a scaling operation along the yaxis.
90 80 70 60 50 40 30 20 10 0 b c h j e Object 1 Object 2 Object 3 90 80 70 60 50 40 30 20 10 0 f d a g i Object 1 Object 2 Object 3
(a)
(b)
Figure 7.23: Objects in Figure 7.22 form (a) a shift pattern in subspace {b, c, h, j, e}, and (b) a scaling pattern in subspace {f, d, a, g, i}. Let us first examine how to discover shift patterns. In DNA microarray data, each row corresponds to a gene and each column or attribute represents a condition under which the gene is developed. The usual Euclidean distance measure cannot capture pattern similarity since the y values of different curves can be quite far apart. Alternatively, we could first transform the data to derive new attributes, such as A ij = vi  vj (where vi and vj are object values for attributes Ai and Aj , respectively), and then cluster on the derived attributes. However, this would introduce d(d  1)/2 dimensions for a ddimensional data set, which is undesirable for a nontrivial d value. A biclustering method was proposed in an attempt to overcome these difficulties. It introduces a new measure, the mean squared residue score, which measures the coherence of the genes and conditions in a submatrix of a DNA array. Let I X and J Y be subsets of genes, X, and conditions, Y , respectively. The pair, (I, J), specifies a submatrix, AIJ , with the mean squared residue score defined as
H(IJ) =
1 IJ
iI,jJ
(dij  diJ  dIj + dIJ )2
(7.39)
where dij is the measured value of gene i for condition j, and diJ = 1 J dij , dIj =
jJ
1 I
dij , dIJ =
iI
1 IJ
dij ,
iI,jJ
(7.40)
where diJ and dIj are the row and column means, respectively, and dIJ is the mean of the subcluster matrix, AIJ . A submatrix, AIJ , is called a bicluster if H(I, J) for some > 0. A randomized algorithm is designed to find such clusters in a DNA array. There are two major limitations of this method. First, a submatrix of a bicluster is not necessarily a bicluster, which makes it difficult to design an efficient pattern growthbased
7.10. CONSTRAINTBASED CLUSTER ANALYSIS
51
algorithm. Second, because of the averaging effect, a bicluster may contain some undesirable outliers yet still satisfy a rather small threshold. To overcome the problems of the biclustering method, a pCluster model was introduced as follows. Given objects x, y O and attributes a, b T , pScore is defined by a 2 2 matrix as
pScore(
dxa dxb dy a dy b
) = (dxa  dxb )  (dy a  dy b ),
(7.41)
where dxa is the value of object (or gene) x for attribute (or condition) a, and so on. A pair, (O, T ), forms a pCluster if, for any 2 2 matrix, X, in (O, T ), we have pScore(X) for some > 0. Intuitively, this means that the change of values on the two attributes between the two objects in confined by for every pair of objects in O and every pair of attributes in T . It is easy to see that pCluster has the downward closure property, that is, if (O, T ) forms a pCluster, then any of its submatrices is also a pCluster. Moreover, since a pCluster requires that every two objects and every two attributes conform with the inequality, the clusters modeled by the pCluster method are more homogeneous than those modeled by the bicluster method. In frequent itemset mining, itemsets are considered frequent if they satisfy a minimum support threshold, which reflects their frequency of occurrence. Based on the definition of pCluster, the problem of mining pClusters becomes one of mining frequent patterns in which each pair of objects and their corresponding features must satisfy the specified threshold. A frequent patterngrowth method can easily be extended to mine such patterns efficiently. Now, let's look into how to discover scaling patterns. Notice that the original pScore definition, though defined for shift patterns in Equation (7.41), can easily be extended for scaling by introducing a new inequality, dxa /dy a . dxb /dy b (7.42)
This can be computed efficiently because Equation (7.41) is a logarithmic form of Equation (7.42). That is, the same pCluster model can be applied to the data set after converting the data to the logarithmic form. Thus, the efficient derivation of pClusters for shift patterns can naturally be extended for the derivation of pClusters for scaling patterns. The pCluster model, though developed in the study of microarray data cluster analysis, can be applied to many other applications that require finding similar or coherent patterns involving a subset of numerical dimensions in large, highdimensional data sets.
7.10
ConstraintBased Cluster Analysis
In the above discussion, we assume that cluster analysis is an automated, algorithmic computational process, based on the evaluation of similarity or distance functions among a set of objects to be clustered, with little user guidance or interaction. However, users often have a clear view of the application requirements, which they would ideally like to use to guide the clustering process and influence the clustering results. Thus, in many applications, it is desirable to have the clustering process take user preferences and constraints into consideration. Examples of such information include the expected number of clusters, the minimal or maximal cluster size, weights for different objects or dimensions, and other desirable characteristics of the resulting clusters. Moreover, when a clustering task involves a rather highdimensional space, it is very difficult to generate meaningful clusters by relying solely on the clustering parameters. User input regarding important dimensions or the desired results will serve as crucial hints or meaningful constraints for effective clustering. In general, we contend that knowledge discovery would be most effective if one can develop an environment for humancentered, exploratory mining of data, that is, where
52
CHAPTER 7. CLUSTER ANALYSIS
the human user is allowed to play a key role in the process. Foremost, a user should be allowed to specify a focus directing the mining algorithm towards the kind of "knowledge" that the user is interested in finding. Clearly, userguided mining will lead to more desirable results and capture the application semantics. Constraintbased clustering finds clusters that satisfy userspecified preferences or constraints. Depending on the nature of the constraints, constraintbased clustering may adopt rather different approaches. Here are a few categories of constraints. 1. Constraints on individual objects: We can specify constraints on the objects to be clustered. In a real estate application, for example, one may like to spatially cluster only those luxury mansions worth over a million dollars. This constraint confines the set of objects to be clustered. It can easily be handled by preprocessing (e.g., performing selection using an SQL query), after which the problem reduces to an instance of unconstrained clustering. 2. Constraints on the selection of clustering parameters: A user may like to set a desired range for each clustering parameter. Clustering parameters are usually quite specific to the given clustering algorithm. Examples of parameters include k, the desired number of clusters in a kmeans algorithm; or (the radius) and MinPts (the minimum number of points) in the DBSCAN algorithm. Although such userspecified parameters may strongly influence the clustering results, they are usually confined to the algorithm itself. Thus, their fine tuning and processing are usually not considered a form of constraintbased clustering. 3. Constraints on distance or similarity functions: We can specify different distance or similarity functions for specific attributes of the objects to be clustered, or different distance measures for specific pairs of objects. When clustering sportsmen, for example, we may use different weighting schemes for height, body weight, age, and skilllevel. Although this will likely change the mining results, it may not alter the clustering process per se. However, in some cases, such changes may make the evaluation of the distance function nontrivial, especially when it is tightly intertwined with the clustering process. This can be seen in the following example. Example 7.16 Clustering with obstacle objects. A city may have rivers, bridges, highways, lakes, and mountains. We do not want to swim across a river to reach an automated banking machine. Such obstacle objects and their effects can be captured by redefining the distance functions among objects. Clustering with obstacle objects using a partitioning approach requires that the distance between each object and its corresponding cluster center be reevaluated at each iteration whenever the cluster center is changed. However, such reevaluation is quite expensive with the existence of obstacles. In this case, efficient new methods should be developed for clustering with obstacle objects in large data sets. 4. Userspecified constraints on the properties of individual clusters: A user may like to specify desired characteristics of the resulting clusters, which may strongly influence the clustering process. Such constraintbased clustering arises naturally in practice, as in Example 7.17. Example 7.17 Userconstrained cluster analysis. Suppose a package delivery company would like to determine the locations for k service stations in a city. The company has a database of customers that registers the customers' name, location, length of time since the customer began using the company's services, and average monthly charge. We may formulate this location selection problem as an instance of unconstrained clustering using a distance function computed based on customer location. However, a smarter approach is to partition the customers into two classes: highvalue customers (who need frequent, regular service), and ordinary customers (who require occasional service). In order to save costs and provide good service, the manager adds the following constraints: (1) each station should serve at least 100 highvalue customers; and (2) each station should serve at least 5,000 ordinary customers. Constraintbased clustering will take such constraints into consideration during the clustering process. 5. Semisupervised clustering based on "partial" supervision: The quality of unsupervised clustering can be significantly improved using some weak form of supervision. This may be in the form of pairwise constraints, i.e., pairs of objects labeled as belonging to the same or different cluster. Such a constrained clustering process is called semisupervised clustering.
7.10. CONSTRAINTBASED CLUSTER ANALYSIS
53
In this section, we examine how efficient constraintbased clustering methods can be developed for large data sets. Since cases 1 and 2 above are trivial, we focus on cases 3 to 5 as typical forms of constraintbased cluster analysis.
7.10.1
Clustering with Obstacle Objects
Example 7.16 above introduced the problem of clustering with obstacle objects regarding the placement of automated banking machines. The machines should be easily accessible to the bank's customers. This means that during clustering, we must take obstacle objects into consideration, such as rivers, highways, and mountains. Obstacles introduce constraints on the distance function. The straight line distance between two points is meaningless if there is an obstacle in the way. As pointed out in Example 7.16, we do not want to have to swim across a river to get to a banking machine! "How can we approach the problem of clustering with obstacles?" A partitioning clustering method is preferable since it minimizes the distance between objects and their cluster centers. If we choose the kmeans method, a cluster center may not be accessible given the presence of obstacles. For example, the cluster mean could turn out to be in the middle of a lake. On the other hand, the kmedoids method chooses an object within the cluster as a center and thus guarantees that such a problem cannot occur. Recall that every time a new medoid is selected, the distance between each object and its newly selected cluster center has to be recomputed. Since there could be obstacles inbetween two objects, the distance between two objects may have to be derived by geometric computations (e.g., involving triangulation). The computational cost can get very high if a large number of objects and obstacles are involved. The clustering with obstacles problem can be represented using a graphical notation. First, a point, p, is visible from another point, q, in the region, R, if the straight line joining p and q does not intersect any obstacles. A visibility graph is the graph, V G = (V, E), such that each vertex of the obstacles has a corresponding node in V and two nodes, v1 and v2 , in V are joined by an edge in E if and only if the corresponding vertices they represent are visible to each other. Let V G = (V , E ) be a visibility graph created from V G by adding two additional points, p and q, in V . E contains an edge joining two points in V if the two points are mutually visible. The shortest path between two points, p and q, will be a subpath of V G as shown in Figure 7.24 (a). We see that it begins with an edge from p to either v1 , v2 , or v3 , goes through some path in VG, and then ends with an edge from either v4 or v5 to q.
o1
p
VG VG'
(a) Figure 7.24: regions with [to editor vertex as
Clustering with obstacle objects [new (o1 and o2 )]: (a) a visibility graph, and (b) triangulation of microclusters. From [THH01]. Please add the following vertex labels for part (a): For the polygon containing o 1 , label the topmost v1 , the leftmost vertex as v2 , and the bottommost as v3 . For the polygon containing o2 , label the topmost vertex as v5 and the bottommost vertex as v6 . Thanks.]
o2
q
(b)
54
CHAPTER 7. CLUSTER ANALYSIS
To reduce the cost of distance computation between any two pairs of objects or points, several preprocessing and optimization techniques can be used. One method groups points that are close together into microclusters. This can be done by first triangulating the region R into triangles, and then grouping nearby points in the same triangle into microclusters, using a method similar to BIRCH or DBSCAN, as shown in Figure 7.24 (b). By processing microclusters rather than individual points, the overall computation is reduced. After that, precomputation can be performed to build two kinds of join indices based on the computation of the shortest paths: (1) VV indices, for any pair of obstacle vertices, and (2) MV indices, for any pair of microcluster and obstacle vertex. Use of the indices helps further optimize the overall performance. With such precomputation and optimization, the distance between any two points (at the granularity level of microcluster) can be computed efficiently. Thus, the clustering process can be performed in a manner similar to a typical efficient kmedoids algorithm, such as CLARANS, and achieve good clustering quality for large data sets. Given a large set of points, Figure 7.25(a) shows the result of clustering a large set of points without considering obstacles, whereas Figure 7.25(b) shows the result with consideration of obstacles. The latter represents rather different but more desirable clusters. For example, if we carefully compare the upper left hand corner of the two graphs, we see that Figure 7.25(a) has a cluster center on an obstacle (making the center inaccessible), while all cluster centers in Figure 7.25(b) are accessible. A similar situation has occurred with respect to the bottom right hand corner of the graphs.
(a)
(b)
Figure 7.25: Clustering results obtained without and with consideration of obstacles (where rivers and inaccessible highways or city blocks are represented by polygons): (a) clustering without considering obstacles, and (b) clustering with obstacles.
7.10.2
UserConstrained Cluster Analysis
Let's examine the problem of relocating package delivery centers, as illustrated in Example 7.17. Specifically, a package delivery company with n customers would like to determine locations for k service stations so as to minimize the traveling distance between customers and service stations. The company's customers are regarded as either highvalue customers (requiring frequent, regular services), or ordinary customers (requiring occasional services). The manager has stipulated two constraints: each station should serve (1) at least 100 highvalue customers, and (2) at least 5,000 ordinary customers. This can be considered as a constrained optimization problem. We could consider using a mathematical programming approach to handle it. However, such a solution is difficult to scale to large data sets. To cluster n customers into k clusters, a mathematical programming approach will involve at least k n variables. As n can be as large as a few million, we could end up having to solve a few million simultaneous equations a very expensive feat. A more efficient approach is proposed that explores the idea of microclustering, as illustrated below.
7.10. CONSTRAINTBASED CLUSTER ANALYSIS
55
The general idea of clustering a large data set into k clusters satisfying userspecified constraints goes as follows. First, we can find an initial "solution" by partitioning the data set into k groups, satisfying the userspecified constraints, such as the two constraints in our example. We then iteratively refine the solution by moving objects from one cluster to another, trying to satisfy the constraints. For example, we can move a set of m customers from cluster Ci to Cj if Ci has at least m surplus customers (under the specified constraints), or if the result of moving customers into Ci from some other clusters (including from Cj ) would result in such a surplus. The movement is desirable if the total sum of the distances of the objects to their corresponding cluster centers is reduced. Such movement can be directed by selecting promising points to be moved, such as objects that are currently assigned to some cluster, Ci , but that are actually closer to a representative (e.g., centroid) of some other cluster, C j . We need to watch out for and handle deadlock situations (where a constraint is impossible to satisfy), in which case, a deadlock resolution strategy can be employed. To increase the clustering efficiency, data can first be preprocessed using the microclustering idea to form microclusters (groups of points that are close together), thereby avoiding the processing of all of the points individually. Object movement, deadlock detection, and constraint satisfaction can be tested at the microcluster level, which reduces the number of points to be computed. Occasionally, such microclusters may need to be brokenup in order to resolve deadlocks under the constraints. This methodology ensures that the effective clustering can be performed in large data sets under the userspecified constraints with good efficiency and scalability.
7.10.3
SemiSupervised Cluster Analysis
In comparison with supervised learning, clustering lacks guidance from users or classifiers (such as class label information), and thus may not generate highly desirable clusters. The quality of unsupervised clustering can be significantly improved using some weak form of supervision, for example, in the form of pairwise constraints, i.e., pairs of objects labeled as belonging to the same or different clusters. Such a clustering process based on user feedback or guidance constraints is called semisupervised clustering. Methods for semisupervised clustering can be categorized into two classes: constraintbased semisupervised clustering and distancebased semisupervised clustering. Constraintbased semisupervised clustering relies on userprovided labels or constraints to guide the algorithm towards a more appropriate data partitioning. This includes modifying the objective function based on constraints, or initializing and constraining the clustering process based on the labeled objects. Distancebased semisupervised clustering employs an adaptive distance measure that is trained to satisfy the labels or constraints in the supervised data. Several different adaptive distance measures have been used, such as stringedit distance trained using ExpectationMaximization (EM), and Euclidean distance modified by a shortest distance algorithm. An interesting clustering method, called CLTree (CLustering based on decision Trees), integrates unsupervised clustering with the idea of supervised classification. It is an example of constraintbased semisupervised clustering. It transforms a clustering task into a classification task by viewing the set of points to be clustered as belonging to one class, labeled as "Y ", and adds a set of relatively uniformly distributed, "nonexistence points" with a different class label, "N ". The problem of partitioning the data space into data (dense) regions and empty (sparse) regions can then be transformed into a classification problem. For example, Figure 7.26(a) contains a set of data points to be clustered. These points can be viewed as a set of "Y " points. Figure 7.26(b) shows the addition of a set of uniformly distributed "N " points, represented by the "" points. The original clustering problem is thus transformed into a classification problem, which works out a scheme that distinguishes "Y " and "N " points. A decision tree induction method can be applied10 , to partition the twodimensional space as shown in Figure 7.26(c). Two clusters are identified, which are from the "Y " points only. Adding a large number of "N " points to the original data may introduce unnecessary overhead in computation. Furthermore, it is unlikely that any points added would truly be uniformly distributed in a very high dimensional space as this would require an exponential number of points. To deal with this problem, we do not physically add any of the "N " points, but only assume their existence. This works because the decisiontree method does not actually require the points. Instead, it only needs the number of "N " points at each decision tree node. This
10 Decision
tree induction was described in Chapter 6 on classification.
56
CHAPTER 7. CLUSTER ANALYSIS
(a)
(b)
(c)
Figure 7.26: Clustering through decision tree construction: (a) the set of data points to be clustered, viewed as a set of "Y " points, (b) the addition of a set of uniformly distributed "N " points, represented by "", and (c) the clustering result with "Y " points only.
number can be computed when needed, without having to add points to the original data. Thus, CLTree can achieve the results in Figure 7.26(c) without actually adding any "N " points to the original data. Again, two clusters are identified. The question then is how many (virtual ) "N " points should be added in order to achieve good clustering results. The answer follows this simple rule: At the root node, the number of inherited "N " points is 0. At any current node, E, if the number of "N " points inherited from the parent node of E is less than the number of "Y " points in E, then the number of "N " points for E is increased to the number of "Y " points in E. (That is, we set the number of "N " points to be as big as the number of "Y " points.) Otherwise, the number of inherited "N " points is used in E. The basic idea is to use an equal number of "N " points to the number of "Y " points. Decision tree classification methods use a measure, typically based on information gain, to select the attribute test for a decision node (Section 6.3.2). The data are then split or partitioned according the test or "cut". Unfortunately, with clustering, this can lead to the fragmentation of some clusters into scattered regions. To address this problem, methods were developed that use information gain, but allow the ability to look ahead. That is, CLTree first finds initial cuts and then looks ahead to find better partitions that cut less into cluster regions. It finds those cuts that form regions with a very low relative density. The idea is that we want to split at the cut point that may result in a big empty ("N ") region, which is more likely to separate clusters. With such tuning, CLTree can perform high quality clustering in highdimensional space. It can also find subspace clusters as the decision tree method normally selects only a subset of the attributes. An interesting byproduct of this method is the empty (sparse) regions, which may also be useful in certain applications. In marketing, for example, clusters may represent different segments of existing customers of a company, while empty regions reflect the profiles of noncustomers. Knowing the profiles of noncustomers allows the company to tailor their services or marketing to target these potential customers.
7.11
Outlier Analysis
"What is an outlier?" Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers. Outliers can be caused by measurement or execution error. For example, the display of a person's age as 999 could be caused by a program default setting of an unrecorded age. Alternatively, outliers may be the result of inherent data variability. The salary of the chief executive officer of a company, for instance, could naturally stand out as an outlier among the salaries of the other employees in the firm. Many data mining algorithms try to minimize the influence of outliers or eliminate them all together. This,
7.11. OUTLIER ANALYSIS
57
however, could result in the loss of important hidden information since one person's noise could be another person's signal. In other words, the outliers themselves may be of particular interest, such as in the case of fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining. Outlier mining has wide applications. As mentioned above, it can be used in fraud detection, for example, by detecting unusual usage of credit cards or telecommunication services. In addition, it is useful in customized marketing for identifying the spending behavior of customers with extremely low or extremely high incomes, or in medical analysis for finding unusual responses to various medical treatments. Outlier mining can be described as follows: Given a set of n data points or objects, and k, the expected number of outliers, find the top k objects that are considerably dissimilar, exceptional, or inconsistent with respect to the remaining data. The outlier mining problem can be viewed as two subproblems: (1) define what data can be considered as inconsistent in a given data set, and (2) find an efficient method to mine the outliers so defined. The problem of defining outliers is nontrivial. If a regression model is used for data modeling, analysis of the residuals can give a good estimation for data "extremeness." The task becomes tricky, however, when finding outliers in timeseries data as they may be hidden in trend, seasonal, or other cyclic changes. When multidimensional data are analyzed, not any particular one, but rather a combination of dimension values may be extreme. For nonnumeric (i.e., categorical data), the definition of outliers requires special consideration. "What about using data visualization methods for outlier detection?" This may seem like an obvious choice, since human eyes are very fast and effective at noticing data inconsistencies. However, this does not apply to data containing cyclic plots, where values that appear to be outliers could be perfectly valid values in reality. Data visualization methods are weak in detecting outliers in data with many categorical attributes or in data of high dimensionality, since human eyes are good at visualizing numeric data of only two to three dimensions. In this section, we instead examine computerbased methods for outlier detection. These can be categorized into four approaches: the statistical approach, the distancebased approach, the densitybased local outlier approach, and the deviationbased approach, each of which are studied here. Notice that while clustering algorithms discard outliers as noise, they can be modified to include outlier detection as a byproduct of their execution. In general, users must check that each outlier discovered by these approaches is indeed a "real" outlier.
7.11.1
Statistical DistributionBased Outlier Detection
The statistical distributionbased approach to outlier detection assumes a distribution or probability model for the given data set (e.g., a normal or Poisson distribution) and then identifies outliers with respect to the model using a discordancy test. Application of the test requires knowledge of the data set parameters (such as the assumed data distribution), knowledge of distribution parameters (such as the mean and variance), and the expected number of outliers. "How does the discordancy testing work?" A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis. A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F , that is, H : oi F, where i = 1, 2, . . . , n. (7.43)
The hypothesis is retained if there is no statistically significant evidence supporting its rejection. A discordancy test verifies whether an object, oi , is significantly large (or small) in relation to the distribution F . Different test statistics have been proposed for use as a discordancy test, depending on the available knowledge of the data. Assuming that some statistic, T , has been chosen for discordancy testing, and the value of the statistic for object oi is vi , then the distribution of T is constructed. Significance probability, SP (vi ) = P rob(T > vi ), is evaluated. If SP (vi ) is sufficiently small, then oi is discordant and the working hypothesis is rejected. An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen since oi may be an outlier under one model and a perfectly valid value under another.
58
CHAPTER 7. CLUSTER ANALYSIS
The alternative distribution is very important in determining the power of the test, that is, the probability that the working hypothesis is rejected when oi is really an outlier. There are different kinds of alternative distributions. Inherent alternative distribution: In this case, the working hypothesis that all of the objects come from distribution, F , is rejected in favor of the alternative hypothesis that all of the objects arise from another distribution, G: H : oi G, where i = 1, 2, . . . , n. (7.44)
F and G may be different distributions or differ only in parameters of the same distribution. There are constraints on the form of the G distribution in that it must have potential to produce outliers. For example, it may have a different mean or dispersion, or a longer tail. Mixture alternative distribution: The mixture alternative states that discordant values are not outliers in the F population, but contaminants from some other population, G. In this case, the alternative hypothesis is H : oi (1  )F + G, where i = 1, 2, . . . , n. (7.45)
Slippage alternative distribution: This alternative states that all of the objects (apart from some prescribed small number) arise independently from the initial model, F , with its given parameters, while the remaining objects are independent observations from a modified version of F in which the parameters have been shifted. There are two basic types of procedures for detecting outliers: Block procedures: In this case, either all of the suspect objects are treated as outliers, or all of them are accepted as consistent. Consecutive (or sequential) procedures: An example of such a procedure is the insideout procedure. Its main idea is that the object that is least "likely" to be an outlier is tested first. If it is found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next most extreme object is tested, and so on. This...