This **preview** has intentionally **blurred** parts. Sign up to view the full document

**Unformatted Document Excerpt**

Cluster Contents 7 Analysis 7.1 7.2 What Is Cluster Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of Data in Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 7.2.2 7.2.3 7.2.4 7.2.5 7.3 7.4 Interval-Scaled Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Categorical, Ordinal, and Ratio-Scaled Variables . . . . . . . . . . . . . . . . . . . . . . . . Variables of Mixed Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 9 10 12 13 16 17 18 20 20 24 25 25 28 30 31 32 32 34 35 37 38 39 40 40 42 A Categorization of Major Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 7.4.2 Classical Partitioning Methods: k -Means and k -Medoids . . . . . . . . . . . . . . . . . . . . Partitioning Methods in Large Databases: From k -Medoids to CLARANS . . . . . . . . . . 7.5 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 7.5.2 7.5.3 7.5.4 Agglomerative and Divisive Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies . . . . . . . . . . . . ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes . . . . . . . . . . . Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling . . . . . . . . . 7.6 Density-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 7.6.2 7.6.3 DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OPTICS: Ordering Points To Identify the Clustering Structure . . . . . . . . . . . . . . . . DENCLUE: Clustering Based on Density Distribution Functions . . . . . . . . . . . . . . . 7.7 Grid-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 7.7.2 STING: STatistical INformation Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WaveCluster: Clustering Using Wavelet Transformation . . . . . . . . . . . . . . . . . . . . 7.8 Model-Based Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 7.8.2 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 7.8.3 7.9 CONTENTS Neural Network Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 46 48 48 51 53 54 55 56 57 58 60 61 63 64 66 Clustering High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 7.9.2 7.9.3 CLIQUE: A Dimension-Growth Subspace Clustering Method . . . . . . . . . . . . . . . . . PROCLUS: A Dimension-Reduction Subspace Clustering Method . . . . . . . . . . . . . . Frequent Pattern-Based Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Constraint-Based Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Clustering with Obstacle Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 User-Constrained Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.3 Semi-Supervised Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Outlier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.1 Statistical Distribution-Based Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.2 Distance-Based Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.3 Density-Based Local Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.4 Deviation-Based Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.14 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Euclidean and Manhattan distances between two objects. . . . . . . . . . . . . . . . . . . . . . . . The k-means partitioning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a "+".) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Four cases of the cost function for k-medoids clustering. . . . . . . . . . . . . . . . . . . . . . . . . PAM, a k-medoids partitioning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. [to editor Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., using bold italics). Thank you.] . . Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. [to editor Please add the heading "level" above l = 0, l = 1, etc. Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., bold italics). Thanks.] . . . . . . . . . . . . . . . . . . . . . . . . . . A CF tree structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chameleon: Hierarchical clustering based on k-nearest neighbors and dynamic modeling. Based on [KHK99]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 21 22 23 24 26 27 29 31 7.8 7.9 7.10 Density reachability and density connectivity in density-based clustering. Based on [EKSX96]. [to editor For consistency, please change M, O, P, Q, R to m, o, p, q, r, respectively (i.e., bold italics). Thanks.] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 OPTICS terminology. Based on [ABKS99]. [to editor 1) Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) and prime (') signs missing! Please kindly compare with Figure 8.10 of first edition, which was correct. 2) The symbol in the figure looks different than that used in the text. Thank you.] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Cluster ordering in OPTICS. Figure is based on [ABKS99]. . . . . . . . . . . . . . . . . . . . . . . 7.13 Possible density functions for a 2-D data set. From [HK98]. . . . . . . . . . . . . . . . . . . . . . . 7.14 Examples of center-defined clusters (top row) and arbitrary-shape clusters (bottom row). [to editor Label missing: please add the label "Density" to the second graph of the top row (as in the other graphs of that row). Thanks.] From [HK98]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.15 A hierarchical structure for STING clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.16 A sample of two-dimensional feature space. From [SCZ98]. . . . . . . . . . . . . . . . . . . . . . . . 7.17 Multiresolution of the feature space in Figure 7.16 at (a) scale 1 (high resolution); (b) scale 2 (medium resolution); (c) scale 3 (low resolution). From [SCZ98]. . . . . . . . . . . . . . . . . . . . 3 33 34 35 36 37 38 40 40 4 LIST OF FIGURES 7.18 Each cluster can be represented by a probability distribution, centered at a mean, and with a standard deviation. Here, we have two clusters, corresponding to the Gaussian distributions g(m 1 , 1 ) and g(m2 , 2 ), respectively, where the circles represent the first standard deviation of the distributions. [to editor This figure is a draft. The final version should have: 1) points heavily scattered within each circle; 2) points sparsely scattered in area outside of circles; 3) a dark point at the center of each circle (to mark the means, m1 and m2 . Thank you!] . . . . . . . . . . . . . . . . . . . . . 7.19 A classification tree. Figure is based on [Fis87]. [to editor Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) signs missing! Please kindly compare with Figure 8.18 of first edition, which was correct. Thank you.] . . . . . . . . . . . . . . . . . . . . . . 7.20 The result of SOM clustering of 12,088 Web articles on comp.ai.neural-nets (left), and of drilling down on the keyword: "mining" (right). Based on http://websom.hut.fi/websom/comp.ai.neuralnets-new. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.21 Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate search space for dense units of higher dimensionality. . . . . . . . . . 7.22 Raw data from a fragment of microarray data containing only 3 objects and 10 attributes. . . . . . 7.23 Objects in Figure 7.22 form (a) a shift pattern in subspace {b, c, h, j, e}, and (b) a scaling pattern in subspace {f, d, a, g, i}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.24 Clustering with obstacle objects [new (o1 and o2 )]: (a) a visibility graph, and (b) triangulation of regions with microclusters. From [THH01]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.25 Clustering results obtained without and with consideration of obstacles (where rivers and inaccessible highways or city blocks are represented by polygons): (a) clustering without considering obstacles, and (b) clustering with obstacles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.26 Clustering through decision tree construction: (a) the set of data points to be clustered, viewed as a set of "Y " points, (b) the addition of a set of uniformly distributed "N " points, represented by "", and (c) the clustering result with "Y " points only. . . . . . . . . . . . . . . . . . . . . . . . . . 7.27 The necessity of density-based local outlier analysis. From [BKNS00]. . . . . . . . . . . . . . . . . 41 42 45 47 49 50 53 54 56 60 List of Tables 7.1 7.2 7.3 A contingency table for binary variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A relational table where patients are described by binary attributes. . . . . . . . . . . . . . . . . . A sample data table containing variables of mixed type. . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 5 6 LIST OF TABLES Chapter 7 Cluster Analysis Imagine that you are given a set of data objects for analysis where, unlike in classification, the class label of each object is not known. This is quite common in large databases because assigning class labels to a large number of objects can be a very costly process. Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning. In this chapter, we study the requirements of clustering methods for large amounts of data. We explain how to compute dissimilarities between objects represented by various attribute or variable types. We examine several clustering techniques, organized into the following categories: partitioning methods, hierarchical methods, densitybased methods, grid-based methods, model-based methods, methods for high-dimensional data (such as frequent pattern-based methods), and constraint-based clustering. Clustering can also be used for outlier detection, which forms the final topic of this chapter. 7.1 What Is Cluster Analysis? The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labelling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantage of such a clustering-based process is that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis is an important human activity. Early in childhood, one learns how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database, and in the identification of groups of houses in a city according to house type, value, and geographical location, as well as the identification of 7 8 CHAPTER 7. CLUSTER ANALYSIS groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on the Web for information discovery. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are "far away" from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a standalone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. Clustering is a challenging field of research where its potential applications pose their own special requirements. The following are typical requirements of clustering in data mining: Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributes: Many algorithms are designed to cluster intervalbased (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often hard to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but also makes the quality of clustering difficult to control. Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and instead, 7.2. TYPES OF DATA IN CLUSTER ANALYSIS 9 must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data. That is, given a set of data objects, such an algorithm may return dramatically different clusterings depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input. High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high-dimensional space is challenging, especially considering that such data can be very sparse and highly skewed. Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (i.e., ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city's rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied in with specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and clustering methods. With these requirements in mind, our study of cluster analysis proceeds as follows. First, we study different types of data and how they can influence clustering methods. Second, we present a general categorization of clustering methods. We then study each clustering method in detail, including partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. We also examine clustering in high-dimensional space, constraint-based clustering, and outlier analysis. 7.2 Types of Data in Cluster Analysis In this section, we study the types of data that often occur in cluster analysis and how to preprocess them for such an analysis. Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents, countries, and so on. Main memory-based clustering algorithms typically operate on either of the following two data structures. Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects p variables): x11 x1f x1p xi1 xif xip (7.1) xn1 xnf xnp Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table: 0 d(2, 1) 0 d(3, 1) d(3, 2) 0 (7.2) . . . . . . . . . d(n, 1) d(n, 2) 0 10 CHAPTER 7. CLUSTER ANALYSIS where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a nonnegative number that is close to 0 when objects i and j are highly similar or "near" each other, and becomes larger the more they differ. Since d(i, j) = d(j, i), and d(i, i) = 0, we have the matrix in (7.2). Measures of dissimilarity are discussed throughout this section. The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent the same entity. Thus, the data matrix is often called a two-mode matrix, whereas the dissimilarity matrix is called a one-mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying such clustering algorithms. In this section, we discuss how object dissimilarity can be computed for objects described by interval-scaled variables; by binary variables; by categorical, ordinal, and ratio-scaled variables; or combinations of these variable types. Nonmetric similarity between complex objects (such as documents) is also described. The dissimilarity data can later be used to compute clusters of objects. 7.2.1 Interval-Scaled Variables This section discusses interval-scaled variables and their standardization. It then describes distance measures that are commonly used for computing the dissimilarity of objects described by such variables. These measures include the Euclidean, Manhattan, and Minkowski distances. "What are interval-scaled variables?" Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature. The measurement unit used can affect the clustering analysis. For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure. In general, expressing a variable in smaller units will lead to a larger range for that variable, and thus a larger effect on the resulting clustering structure. To help avoid dependence on the choice of measurement units, the data should be standardized. Standardizing measurements attempts to give all variables an equal weight. This is particularly useful when given no prior knowledge of the data. However, in some applications, users may intentionally want to give more weight to a certain set of variables than to others. For example, when clustering basketball player candidates, we may prefer to give more weight to the variable height. "How can the data for a variable be standardized?" To standardize measurements, one choice is to convert the original measurements to unitless variables. Given measurements for a variable f , this can be performed as follows. 1. Calculate the mean absolute deviation, sf : sf = 1 (|x1f - mf | + |x2f - mf | + + |xnf - mf |), n 1 n (x1f (7.3) + x2f + where x1f , . . . , xnf are n measurements of f , and mf is the mean value of f , that is, mf = + xnf ). 2. Calculate the standardized measurement, or z-score: zif = xif - mf . sf (7.4) The mean absolute deviation, sf , is more robust to outliers than the standard deviation, f . When computing the mean absolute deviation, the deviations from the mean (i.e., |xif - mf |) are not squared; hence, the effect of outliers is somewhat reduced. There are more robust measures of dispersion, such as the median absolute deviation. However, the advantage of using the mean absolute deviation is that the z-scores of outliers do not become too small; hence, the outliers remain detectable. 7.2. TYPES OF DATA IN CLUSTER ANALYSIS 11 Standardization may or may not be useful in a particular application. Thus the choice of whether and how to perform standardization should be left to the user. Methods of standardization are also discussed in Chapter 2 under normalization techniques for data preprocessing. After standardization, or without standardization in certain applications, the dissimilarity (or similarity) between the objects described by interval-scaled variables is typically computed based on the distance between each pair of objects. The most popular distance measure is Euclidean distance, which is defined as d(i, j) = (xi1 - xj1 )2 + (xi2 - xj2 )2 + + (xin - xjn )2 , (7.5) where i = (xi1 , xi2 , . . . , xin ) and j = (xj1 , xj2 , . . . , xjn ) are two n-dimensional data objects. Another well-known metric is Manhattan (or city block) distance, defined as d(i, j) = |xi1 - xj1 | + |xi2 - xj2 | + + |xin - xjn |. (7.6) Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function: 1. d(i, j) 0: Distance is a nonnegative number. 2. d(i, i) = 0: The distance of an object to itself is 0. 3. d(i, j) = d(j, i): Distance is a symmetric function. 4. d(i, j) d(i, h) + d(h, j): Going directly from object i to object j in space is no more than making a detour over any other object h (triangular inequality). x 2 = (3,5) 5 4 3 2 1 1 3 x 1 = (1,2) 2 2 Euclidean distance = (2 2 + 3 2 )1/2 = 3.61 Manhattan distance =2+3=5 3 Figure 7.1: Euclidean and Manhattan distances between two objects. Example 7.1 Euclidean distance and Manhattan distance. Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in Figure 7.1. The Euclidean distance between the two is (22 + 32 ) = 3.61. The Manhattan distance between the two is 2 + 3 = 5. Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is defined as d(i, j) = (|xi1 - xj1 |p + |xi2 - xj2 |p + + |xin - xjn |p )1/p , (7.7) where p is a positive integer. Such a distance is also called Lp norm, in some literature. It represents the Manhattan distance when p = 1 (i.e., L1 norm), and Euclidean distance when p = 2 (i.e., L2 norm). 12 CHAPTER 7. CLUSTER ANALYSIS If each variable is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as d(i, j) = w1 |xi1 - xj1 |2 + w2 |xi2 - xj2 |2 + + wm |xin - xjn |2 . (7.8) Weighting can also be applied to the Manhattan and Minkowski distances. 7.2.2 Binary Variables Let us see how to compute the dissimilarity between objects described by either symmetric or asymmetric binary variables. A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. Given the variable smoker describing a patient, for instance, 1 indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary variables as if they are interval-scaled can lead to misleading clustering results. Therefore, methods specific to binary data are necessary for computing dissimilarities. "So, how can we compute the dissimilarity between two binary variables?" One approach involves computing a dissimilarity matrix from the given binary data. If all binary variables are thought of as having the same weight, we have the 2-by-2 contingency table of Table 7.1, where q is the number of variables that equal 1 for both objects i and j, r is the number of variables that equal 1 for object i but that are 0 for object j, s is the number of variables that equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for both objects i and j. The total number of variables is p, where p = q + r + s + t. object j 1 0 q r s t q+s r+t sum q+r s+t p object i 1 0 sum Table 7.1: A contingency table for binary variables. "What is the difference between symmetric and asymmetric binary variables?" A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1. One such example could be the attribute gender having the states male and female. Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity. Its dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the dissimilarity between objects i and j. d(i, j) = r+s . q+r+s+t (7.9) A binary variable is asymmetric if the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test. By convention, we shall code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive), and the other by 0 (e.g., HIV negative). Given two asymmetric binary variables, the agreement of two 1s (a positive match) is then considered more significant than that of two 0s (a negative match). Therefore, such binary variables are often considered "monary" (as if having one state). The dissimilarity based on such variables is called asymmetric binary dissimilarity, where the number of negative matches, t, is considered unimportant and thus is ignored in the computation, as shown in Equation (7.10). d(i, j) = r+s . q+r+s (7.10) Complementarily, one can measure the distance between two binary variables based on the notion of similarity instead of dissimilarity. For example, the asymmetric binary similarity between the objects i and j, or sim(i, j), 7.2. TYPES OF DATA IN CLUSTER ANALYSIS can be computed as, sim(i, j) = q = 1 - d(i, j). q+r+s 13 (7.11) The coefficient sim(i, j) is called the Jaccard coefficient, which is popularly referenced in the literature. When both symmetric and asymmetric binary variables occur in the same data set, the mixed variables approach described in Section 7.2.4 can be applied. Example 7.2 Dissimilarity between binary variables. Suppose that a patient record table (Table 7.2) contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes are asymmetric binary. name Jack Mary Jim . . . gender M F M . . . fever Y Y Y . . . cough N N Y . . . test-1 P P N . . . test-2 N N N . . . test-3 N P N . . . test-4 N N N . . . Table 7.2: A relational table where patients are described by binary attributes. For asymmetric attribute values, let the values Y (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0. Suppose that the distance between objects (patients) is computed based only on the asymmetric variables. According to Equation (7.10), the distance between each pair of the three patients, Jack, Mary, and Jim, is d(jack, mary) d(jack, jim) d(mary, jim) = = = 0+1 2+0+1 1+1 1+1+1 1+2 1+1+2 = 0.33 = 0.67 = 0.75 These measurements suggest that Mary and Jim are unlikely to have a similar disease since they have the highest dissimilarity value among the three pairs. Of the three patients, Jack and Mary are the most likely to have a similar disease. 7.2.3 Categorical, Ordinal, and Ratio-Scaled Variables "How can we compute the dissimilarity between objects described by categorical, ordinal, and ratio-scaled variables?" Categorical Variables A categorical variable is a generalization of the binary variable in that it can take on more than two states. For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue. Let the number of states of a categorical variable be M . The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, . . . , M . Notice that such integers are used just for data handling and do not represent any specific ordering. "How is dissimilarity computed between objects described by categorical variables?" The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d(i, j) = p-m , p (7.12) 14 CHAPTER 7. CLUSTER ANALYSIS where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables. Weights can be assigned to increase the effect of m or to assign greater weight to the matches in variables having a larger number of states. object identifier 1 2 3 4 test-1 (categorical) code-A code-B code-C code-A test-2 (ordinal) excellent fair good excellent test-3 (ratio-scaled) 445 22 164 1,210 Table 7.3: A sample data table containing variables of mixed type. Example 7.3 Dissimilarity between categorical variables. Suppose that we have the sample data of Table 7.3, except that only the object-identifier and the variable (or attribute) test-1 are available, where test-1 is categorical. (We will use test-2 and test-3 in later examples.) Let's compute the dissimilarity matrix (7.2), that is, 0 d(2, 1) d(3, 1) d(4, 1) 0 d(4, 3) 0 0 d(3, 2) d(4, 2) Since here we have one categorical variable, test-1, we set p = 1 in Equation (7.12) so that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ. Thus, we get 0 1 1 0 0 1 0 0 1 1 Categorical variables can be encoded by asymmetric binary variables by creating a new binary variable for each of the M states. For an object with a given state value, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. For example, to encode the categorical variable map color, a binary variable can be created for each of the five colors listed above. For an object having the color yellow, the yellow variable is set to 1, while the remaining four variables are set to 0. The dissimilarity coefficient for this form of encoding can be calculated using the methods discussed in Section 7.2.2. Ordinal Variables A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective assessments of qualities that cannot be measured objectively. For example, professional ranks are often enumerated in a sequential order, such as assistant, associate, and full for professors. A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential but their actual magnitude is not. For example, the relative ranking in a particular sport (e.g., gold, silver, bronze) is often more essential than the actual values of a particular measure. Ordinal variables may also be obtained from the discretization of interval-scaled quantities by splitting the value range into a finite number of classes. The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal variable f has M f states. These ordered states define the ranking 1, . . . , Mf . 7.2. TYPES OF DATA IN CLUSTER ANALYSIS 15 "How are ordinal variables handled?" The treatment of ordinal variables is quite similar to that of interval-scaled variables when computing the dissimilarity between objects. Suppose that f is a variable from a set of ordinal variables describing n objects. The dissimilarity computation with respect to f involves the following steps: 1. The value of f for the ith object is xif , and f has Mf ordered states, representing the ranking 1, . . . , Mf . Replace each xif by its corresponding rank, rif {1, . . . , Mf }. 2. Since each ordinal variable can have a different number of states, it is often necessary to map the range of each variable onto [0.0,1.0] so that each variable has equal weight. This can be achieved by replacing the rank rif of the ith object in the f th variable by zif = rif - 1 . Mf - 1 (7.13) 3. Dissimilarity can then be computed using any of the distance measures described in Section 7.2.1 for intervalscaled variables, using zif to represent the f value for the ith object. Example 7.4 Dissimilarity between ordinal variables. Suppose that we have the sample data of Table 7.3, except that this time only the object-identifier and the continuous ordinal variable, test-2, are available. There are three states for test-2, namely fair, good, and excellent, that is M f = 3. For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can use, say, the Euclidean distance (Equation 7.5), which results in the following dissimilarity matrix: 0 1 0.5 0 0 0.5 0 0 0.5 1.0 Ratio-Scaled Variables A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula AeBt or Ae-Bt , (7.14) where A and B are positive constants, and t typically represents time. Common examples include the growth of a bacteria population, or the decay of a radioactive element. "How can I compute the dissimilarity between objects described by ratio-scaled variables?" There are three methods to handle ratio-scaled variables for computing the dissimilarity between objects. Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually a good choice since it is likely that the scale may be distorted. Apply logarithmic transformation to a ratio-scaled variable f having value x if for object i by using the formula yif = log(xif ). The yif values can be treated as interval-valued, as described in Section 7.2.1. Notice that for some ratio-scaled variables, log-log or other transformations may be applied, depending on the [new variable's] definition and the application. Treat xif as continuous ordinal data and treat their ranks as interval-valued. 16 CHAPTER 7. CLUSTER ANALYSIS The latter two methods are the most effective, although the choice of method used may be dependent on the given application. Example 7.5 Dissimilarity between ratio-scaled variables. This time, we have the sample data of Table 7.3, except that only the object-identifier and the ratio-scaled variable, test-3, are available. Let's try a logarithmic transformation. Taking the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for the objects 1 to 4, respectively. Using the Euclidean distance (Equation 7.5) on the transformed values, we obtain the following dissimilarity matrix: 0 1.31 0 0.44 0.87 0 0.43 1.74 0.87 0 7.2.4 Variables of Mixed Types Sections 7.2.1 to 7.2.3 discussed how to compute the dissimilarity between objects described by variables of the same type, where these types may be either interval-scaled, symmetric binary, asymmetric binary, categorical, ordinal, or ratio-scaled. However, in many real databases, objects are described by a mixture of variable types. In general, a database can contain all of the six variable types listed above. "So, how can we compute the dissimilarity between objects of mixed variable types?" One approach is to group each kind of variable together, performing a separate cluster analysis for each variable type. This is feasible if these analyses derive compatible results. However, in real applications, it is unlikely that a separate cluster analysis per variable type will generate compatible results. A more preferable approach is to process all variable types together, performing a single cluster analysis. One such technique combines the different variables into a single dissimilarity matrix, bringing all of the meaningful variables onto a common scale of the interval [0.0,1.0]. Suppose that the data set contains p variables of mixed type. The dissimilarity d(i, j) between objects i and j is defined as d(i, j) = (f ) (f ) p f =1 ij dij , (f ) p ij f =1 (f ) (7.15) where the indicator ij = 0 if either (1) xif or xjf is missing (i.e., there is no measurement of variable f for object i or object j), or (2) xif = xjf = 0 and variable f is asymmetric binary; otherwise, ij = 1. The contribution of variable f to the dissimilarity between i and j, that is, dij , is computed dependent on its type: If f is interval-based: dij = (f ) |xif -xjf | maxh xhf -minh xhf (f ) (f ) (f ) , where h runs over all nonmissing objects for variable f . (f ) If f is binary or categorical: dij = 0 if xif = xjf ; otherwise dij = 1. If f is ordinal: compute the ranks rif and zif = rif -1 Mf -1 , and treat zif as interval-scaled. If f is ratio-scaled: either perform logarithmic transformation and treat the transformed data as intervalscaled; or treat f as continuous ordinal data, compute rif and zif , and then treat zif as interval-scaled. The above steps are identical to what we have already seen for each of the individual variable types. The only difference is for interval-based variables, where here we normalize so that the values map to the interval [0.0,1.0]. Thus, the dissimilarity between objects can be computed even when the variables describing the objects are of different types. 7.2. TYPES OF DATA IN CLUSTER ANALYSIS 17 Example 7.6 Dissimilarity between variables of mixed type. Let's compute a dissimilarity matrix for the objects of Table 7.3. Now we will consider all of the variables, which are of different types. In Examples 7.3 to 7.5, we worked out the dissimilarity matrices for each of the individual variables. The procedures that we followed for test-1 (which is categorical) and test-2 (which is ordinal), are the same as outlined above for processing variables of mixed types. Therefore, we can use the dissimilarity matrices obtained for test-1 and test-2 later when we compute Equation (7.15). First, however, we need to complete some work for test-3 (which is ratio-scaled). We have already applied a logarithmic transformation to its values. Based on the transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for the objects 1 to 4, respectively, we let maxh xh = 3.08 and minh xh = 1.34. We then normalize the values in the dissimilarity matrix obtained in Example 7.5 by dividing each one by (3.08 - 1.34) = 1.74. This results in the following dissimilarity matrix for test-3 : 0 0.75 0.25 0.25 0 0.50 0 0 0.50 1.00 We can now use the dissimilarity matrices for the three variables in our computation of Equation (7.15). For example, we get d(2, 1) = 1(1)+1(1)+1(0.75) = 0.92. The resulting dissimilarity matrix obtained for the data 3 described by the three variables of mixed types is: 0 0.92 0.58 0.25 0 0.67 0 0 0.67 1.00 If we go back and look at Table 7.3, we can intuitively guess that objects 1 and 4 are the most similar, based on their values for test-1 and test-2. This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates that objects 2 and 4 are the least similar. 7.2.5 Vector objects [from MK: Jiawei, do you prefer 'Complex Objects' or `Vector Objects' as the title?] In some applications, such as information retrieval, text document clustering, and biological taxonomy, we need to compare and cluster complex objects (such as documents) containing a large number of symbolic entities (such as keywords and phrases). To measure the distance between complex objects, it is often desirable to abandon traditional metric distance computation and introduce a nonmetric similarity function. There are several ways to define such a similarity function, s(x, y), to compare two vectors x and y. One popular way is to define the similarity function as a cosine measure as follows. xt y ||x||||y|| s(x, y) = (7.16) where xt is a transposition of vector x, ||x|| is the Euclidean norm of vector x,1 ||y|| is the Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and y. This value is invariant to rotation and dilation, but it is not invariant to translation and general linear transformation. When variables are binary-valued (0 or 1), the above similarity function can be interpreted in terms of shared features and attributes. Suppose an object x possesses the ith attribute if x i = 1. Then xt y is the number of attributes possessed by both x and y, and |x||y| is the geometric mean of the number of attributes possessed by x and the number possessed by y. Thus s(x, y) is a measure of relative possession of common attributes. 1 The Euclidean normal of vector x = (x1 , x2 , . . . , xp ) is defined as x2 + x2 + . . . + x2 . Conceptually, it is the length of the vector. p 1 2 18 CHAPTER 7. CLUSTER ANALYSIS Example 7.7 Nonmetric similarity between two objects using cosine. Suppose we are given two vectors, x = (1, 1, 0, 0) and y = (0, 1, 1, 0). By Equation (7.16), the similarity between x and y is s(x, y) = (0+1+0+0) = 0.5. 2 2 A simple variation of the above measure is s(x, y) = xt xt y x + y t y - xt y (7.17) which is the ratio of the number of attributes shared by x and y to the number of attributes possessed by x or y. This function, known as the Tanimoto coefficient or Tanimoto distance, is frequently used in information retrieval and biology taxonomy. Notice that there are many ways to select a particular similarity (or distance) function or normalize the data for cluster analysis. There is no universal standard to guide such selection. The appropriate selection of such measures will be heavily dependent on the given application. One should bear this in mind and refine the selection of such measures to ensure that the clusters generated are meaningful and useful for the application at hand. 7.3 A Categorization of Major Clustering Methods A large number of clustering algorithms exist in the literature. It is difficult to provide a crisp categorization of clustering methods since these categories may overlap so that a method may have features from several categories. Nevertheless, it is useful to present a relatively organized picture of the different clustering methods. In general, the major clustering methods can be classified into the following categories. Partitioning methods: Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k n. That is, it classifies the data into k groups, which together satisfy the following requirements: (1) each group must contain at least one object, and (2) each object must belong to exactly one group. Notice that the second requirement can be relaxed in some fuzzy partitioning techniques. References to such techniques are given in the bibliographic notes. Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion of a good partitioning is that objects in the same cluster are "close" or related to each other, whereas objects of different clusters are "far apart" or very different. There are various kinds of other criteria for judging the quality of partitions. To achieve global optimality in partitioning-based clustering would require the exhaustive enumeration of all of the possible partitions. Instead, most applications adopt one of a few popular heuristic methods, such as (1) the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. These heuristic clustering methods work well for finding spherical-shaped clusters in small to medium-sized databases. To find clusters with complex shapes and for clustering very large data sets, partitioning-based methods need to be extended. Partitioning-based clustering methods are studied in depth in Section 7.4. Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the top-down approach, starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds. 7.3. A CATEGORIZATION OF MAJOR CLUSTERING METHODS 19 Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry about a combinatorial number of different choices. However, such techniques cannot correct erroneous decisions. There are two approaches to improving the quality of hierarchical clustering: (1) perform careful analysis of object "linkages" at each hierarchical partitioning, such as in Chameleon, or (2) integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation, as in BIRCH. Hierarchical clustering methods are studied in Section 7.5. Density-based methods: Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the "neighborhood" exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape. DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-based connectivity analysis. DENCLUE is a method that clusters objects based on the analysis of the value distributions of density functions. Density-based clustering methods are studied in Section 7.6. Grid-based methods: Grid-based methods quantize the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized space). The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. STING is a typical example of a grid-based method. WaveCluster applies wavelet transformation for clustering analysis and is both grid-based and density-based. Grid-based clustering methods are studied in Section 7.7. Model-based methods: Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of clusters based on standard statistics, taking "noise" or outliers into account and thus yielding robust clustering methods. EM is an algorithm that performs expectation-maximization analysis based on statistical modeling. COBWEB is a conceptual learning algorithm that performs probability analysis and takes concepts as a model for clusters. SOM (or self-organizing feature map) is a neural networkbased algorithm that clusters by mapping high-dimensional data into a 2-D or 3-D feature map, which is also useful for data visualization. Model-based clustering methods are studied in Section 7.8. The choice of clustering algorithm depends both on the type of data available and on the particular purpose of the application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose. Some clustering algorithms integrate the ideas of several clustering methods, so that it is sometimes difficult to classify a given algorithm as uniquely belonging to only one clustering method category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques. Aside from the above categories of clustering methods, there are two classes of clustering tasks that require special attention. One is clustering high-dimensional data, and the other is constraint-based clustering. Clustering high-dimensional data is a particularly important task in cluster analysis because there are many applications that require the analysis of objects containing a large number of features or "dimensions". For example, text documents may contain thousands of terms or keywords as features, and DNA microarray data may provide information on the expression levels of thousands of genes under hundreds of conditions. Clustering high-dimensional data is challenging due to the curse of dimensionality. Many dimensions may not be relevant. 20 CHAPTER 7. CLUSTER ANALYSIS As the number of dimension increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low. Therefore, a different clustering methodology needs to be developed for high-dimensional data. CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces (or subsets of dimensions) of the data, rather than over the entire data space. Frequent pattern-based clustering is another clustering methodology, which extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters. pCluster is an example of frequent pattern-based clustering that groups objects based on their pattern similarity. High-dimensional data clustering methods are studied in Section 7.9. Constraint-based clustering is a clustering approach that performs clustering by incorporation of userspecified or application-oriented constraints. A constraint expresses a user's expectation or describes "properties" of the desired clustering results, and provides an effective means for communicating with the clustering process. Various kinds of constraints can be specified, either by a user or as per application requirements. Our focus of discussion will be on spatial clustering with the existence of obstacles and clustering under user-specified constraints. In addition, semi-supervised clustering is described, which employs, for example, pairwise constraints (such as pairs of instances labeled as belonging to the same or different clusters) in order to improve the quality of the resulting clustering. Constraint-based clustering methods are studied in Section 7.10. In the following sections, we examine each of the above clustering methods in detail. We also introduce algorithms that integrate the ideas of several clustering methods. Outlier analysis, which typically involves clustering, is described in Section 7.11. In general, the notation used in the following sections is as follows. Let D be a data set of n objects to be clustered. An object is described by d variables (attributes or dimensions) and therefore may also be referred to as a point in d-dimensional object space. Objects are represented in bold italic font, e.g., p. 7.4 Partitioning Methods Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k n), where each partition represents a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are "similar," whereas the objects of different clusters are "dissimilar" in terms of the data set attributes. 7.4.1 Classical Partitioning Methods: k -Means and k -Medoids The most well-known and commonly used partitioning methods are k-means, k-medoids, and their variations. Centroid-Based Technique: The k -Means Method The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster's centroid or center of gravity. "How does the k-means algorithm work?" The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the square-error criterion is used, defined as k E= i=1 pCi |p - mi |2 , (7.18) 7.4. PARTITIONING METHODS 21 where E is the sum of the square-error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional). In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed. This criterion tries to make the resulting k clusters as compact and as separate as possible. The k-means procedure is summarized in Figure 7.2. Algorithm: k-means. The k-means algorithm for partitioning, where each cluster's center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) update the cluster means, i.e., calculate the mean value of the objects for each cluster; (5) until no change; Figure 7.2: The k-means partitioning algorithm. Example 7.8 Clustering by k-means partitioning. Suppose that there is a set of objects located in space as depicted in the rectangle shown in Figure 7.3(a). Let k = 3; that is, the user would like the objects to be partitioned into three clusters. According to the algorithm in Figure 7.2, we arbitrarily choose three objects as the three initial cluster centers, where cluster centers are marked by a "+". Each object is distributed to a cluster based on the cluster center to which it is the nearest. Such a distribution forms silhouettes encircled by dotted curves, as shown in Figure 7.3(a). Next, the cluster centers are updated. That is, the mean value of each cluster is recalculated based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Such a redistribution forms new silhouettes encircled by dashed curves, as shown in Figure 7.3(b). This process iterates, leading to Figure 7.3(c). The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation. Eventually, no redistribution of the objects in any cluster occurs and so the process terminates. The resulting clusters are returned by the clustering process. The algorithm attempts to determine k partitions that minimize the square-error function. It works well when the clusters are compact clouds that are rather well separated from one another. The method is relatively scalable and efficient in processing large data sets because the computational complexity of the algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k n and t n. The method often terminates at a local optimum. The k-means method, however, can be applied only when the mean of a cluster is defined. This may not be the case in some applications, such as when data with categorical attributes are involved. The necessity for users to specify k, the number of clusters, in advance can be seen as a disadvantage. The k-means method is not suitable for discovering clusters with nonconvex shapes or clusters of very different size. Moreover, it is sensitive to noise and outlier data points since a small number of such data can substantially influence the mean value. 22 CHAPTER 7. CLUSTER ANALYSIS (a) (b) (c) Figure 7.3: Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a "+".) There are quite a few variants of the k-means method. These can differ in the selection of the initial k means, the calculation of dissimilarity, and the strategies for calculating cluster means. An interesting strategy that often yields good results is to first apply a hierarchical agglomeration algorithm, which determines the number of clusters and finds an initial clustering, and then use iterative relocation to improve the clustering. Another variant to k-means is the k-modes method, which extends the k-means paradigm to cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a frequency-based method to update modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with mixed numeric and categorical values. The EM (Expectation-Maximization) algorithm (which will be further discussed in Section 7.8.1) extends the k-means paradigm in a different way. Whereas the k-means algorithm assigns each object to a cluster, in EM, each object is assigned to each cluster according to a weight representing its probability of membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. "How can we make the k-means algorithm more scalable?" A recent approach to scaling the k-means algorithm is based on the idea of identifying three kinds of regions in data: regions that are compressible, regions that must be maintained in main memory, and regions that are discardable. An object is discardable if its membership in a cluster is ascertained. An object is compressible if it is not discardable but belongs to a tight subcluster. A data structure known as a clustering feature is used to summarize objects that have been discarded or compressed. If an object is neither discardable nor compressible, then it should be retained in main memory. To achieve scalability, the iterative clustering algorithm only includes the clustering features of the compressible objects and the objects that must be retained in main memory, thereby turning a secondary-memory-based algorithm into a main-memorybased algorithm. An alternative approach to scaling the k-means algorithm explores the microclustering idea, which first groups nearby objects into "microclusters" and then performs k-means clustering on the microclusters. Microclustering is further discussed in Section 7.5. Representative Object-Based Technique: The k -Medoids Method The k-means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data. This effect is particularly exacerbated due to the use of the square-error function (Equation 7.18). "How might the algorithm be modified to diminish such sensitivity?" Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion is used, defined 7.4. PARTITIONING METHODS as k 23 E= j=1 pCj |p - oj |, (7.19) where E is the sum of the absolute-error for all objects in the data set; p is the point in space representing a given object in cluster Cj ; and oj is the representative object of Cj . In general, the algorithm iterates until, eventually, each representative object is actually the medoid, or most centrally located object, of its cluster. This is the basis of the k-medoids method for grouping n objects into k clusters. Let's have a closer look at k-medoids clustering. The initial representative objects (or seeds) are chosen arbitrarily. The iterative process of replacing representative objects by nonrepresentative objects continues as long as the quality of the resulting clustering is improved. This quality is estimated using a cost function that measures the average dissimilarity between an object and the representative object of its cluster. To determine whether a nonrepresentative object, orandom , is a good replacement for a current representative object, oj , the following four cases are examined for each of the nonrepresentative objects, p, as illustrated in Figure 7.4. Case 1: p currently belongs to representative object, oj . If oj is replaced by orandom as a representative object and p is closest to one of the other representative objects, oi , i = j, then p is reassigned to oi . Case 2: p currently belongs to representative object, oj . If oj is replaced by orandom as a representative object and p is closest to orandom , then p is reassigned to orandom . Case 3: p currently belongs to representative object, oi , i = j. If oj is replaced by orandom as a representative object and p is still closest to oi , then the assignment does not change. Case 4: p currently belongs to representative object, oi , i = j. If oj is replaced by orandom as a representative object and p is closest to orandom , then p is reassigned to orandom . Oi Oi p Oi p Orandom Orandom 3. No change p Orandom Oi p Oj Oj Oj Oj Orandom 1. Reassigned to Oi 2. Reassigned to Orandom data object cluster center before swapping after swapping 4. Reassigned to Orandom Figure 7.4: Four cases of the cost function for k-medoids clustering. Each time a reassignment occurs, a difference in absolute-error, E, is contributed to the cost function. Therefore, the cost function calculates the difference in absolute-error value if a current representative object is replaced by a nonrepresentative object. The total cost of swapping is the sum of costs incurred by all nonrepresentative objects. If the total cost is negative, then oj is replaced or swapped with orandom since the actual absolute-error E would be reduced. If the total cost is positive, the current representative object, o j , is considered acceptable, and nothing is changed in the iteration. PAM (Partitioning Around Medoids) was one of the first k-medoids algorithms introduced (Figure 7.5). It attempts to determine k partitions for n objects. After an initial random selection of k representative objects, the algorithm repeatedly tries to make a better choice of cluster representatives. All of the possible pairs of objects are analyzed, where one object in each pair is considered a representative object and the other is not. The quality of 24 CHAPTER 7. CLUSTER ANALYSIS Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) (2) (3) (4) (5) (6) (7) arbitrarily choose k objects in D as the initial representative objects or seeds; repeat assign each remaining object to the cluster with the nearest representative object; randomly select a nonrepresentative object, orandom ; compute the total cost, S, of swapping representative object, oj , with orandom ; if S < 0 then swap oj with orandom to form the new set of k representative objects; until no change; Figure 7.5: PAM, a k-medoids partitioning algorithm. the resulting clustering is calculated for each such combination. An object, o j , is replaced with the object causing the greatest reduction in error. The set of best objects for each cluster in one iteration forms the representative objects for the next iteration. The final set of representative objects are the respective medoids of the clusters. The complexity of each iteration is O(k(n - k)2 ). For large values of n and k, such computation becomes very costly. "Which method is more robust--k-means or k-medoids?" The k-medoids method is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the k-means method. Both methods require the user to specify k, the number of clusters. Aside from using the mean or the medoid as a measure of cluster center, other alternative measures are also commonly used in partitioning clustering methods. The median can be used, resulting in the k-median method, where the median or "middle value" is taken for each ordered attribute. Alternatively, in the k-modes method, the most frequent value for each attribute is used. 7.4.2 Partitioning Methods in Large Databases: From k -Medoids to CLARANS "How efficient is the k-medoids algorithm on large data sets?" A typical k-medoids partitioning algorithm like PAM works effectively for small data sets, but does not scale well for large data sets. To deal with larger data sets, a sampling-based method, called CLARA (Clustering LARge Applications), can be used. The idea behind CLARA is as follows: Instead of taking the whole set of data into consideration, a small portion of the actual data is chosen as a representative of the data. Medoids are then chosen from this sample using PAM. If the sample is selected in a fairly random manner, it should closely represent the original data set. The representative objects (medoids) chosen will likely be similar to those that would have been chosen from the whole data set. CLARA draws multiple samples of the data set, applies PAM on each sample, and returns its best clustering as the output. As expected, CLARA can deal with larger data sets than PAM. The complexity of each iteration now becomes O(ks2 + k(n - k)), where s is the size of the sample, k is the number of clusters, and n is the total number of objects. The effectiveness of CLARA depends on the sample size. Notice that PAM searches for the best k medoids among a given data set, whereas CLARA searches for the best k medoids among the selected sample of the data set. CLARA cannot find the best clustering if any of the best sampled medoids is not among the best k medoids. 7.5. HIERARCHICAL METHODS 25 That is, if an object oi is one of the best k medoids but is not selected during sampling, CLARA will never find the best clustering. This is, therefore, a trade-off for efficiency. A good clustering based on sampling will not necessarily represent a good clustering of the whole data set if the sample is biased. "How might we improve the quality and scalability of CLARA?" A k-medoids type algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search) was proposed, which combines the sampling technique with PAM. However, unlike CLARA, CLARANS does not confine itself to any sample at any given time. While CLARA has a fixed sample at each stage of the search, CLARANS draws a sample with some randomness in each step of the search. Conceptually, the clustering process can be viewed as a search through a graph, where each node is a potential solution (a set of k medoids). Two nodes are neighbors (that is, connected by an arc in the graph) if their sets differ by only one object. Each node can be assigned a cost that is defined by the total dissimilarity between every object and the medoid of its cluster. At each step, PAM examines all of the neighbors of the current node in its search for a minimum cost solution. The current node is then replaced by the neighbor with the largest descent in costs. Because CLARA works on a sample of the entire data set, it examines fewer neighbors and restricts the search to subgraphs that are smaller than the original graph. While CLARA draws a sample of nodes at the beginning of a search, CLARANS dynamically draws a random sample of neighbors in each step of a search. The number of neighbors to be randomly sampled is restricted by a user-specified parameter. In this way, CLARANS does not confine the search to a localized area. If a better neighbor is found (i.e., having a lower error), CLARANS moves to the neighbor's node and the process starts again; otherwise the current clustering produces a local minimum. If a local minimum is found, CLARANS starts with new randomly selected nodes in search for a new local minimum. Once a user-specified number of local minima has been found, the algorithm outputs, as a solution, the best local minimum, that is, the local minimum having the lowest cost. CLARANS has been experimentally shown to be more effective than both PAM and CLARA. It can be used to find the most "natural" number of clusters using a silhouette coefficient--a property of an object that specifies how much the object truly belongs to the cluster. CLARANS also enables the detection of outliers. However, the computational complexity of CLARANS is about O(n2 ), where n is the number of objects. Furthermore, its clustering quality is dependent on the sampling method used. The ability of CLARANS to deal with data objects that reside on disk can be further improved by focussing techniques that explore spatial data structures, such as R*-trees. 7.5 Hierarchical Methods A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion. The quality of a pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed. That is, if a particular merge or split decision later turns out to have been a poor choice, the method cannot backtrack and correct it. Recent studies have emphasized the integration of hierarchical agglomeration with iterative relocation methods. 7.5.1 Agglomerative and Divisive Hierarchical Clustering In general, there are two types of hierarchical clustering methods: Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity. Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, 26 CHAPTER 7. CLUSTER ANALYSIS until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold. Agglomerative (AGNES) step 0 a b c d e step 4 step 1 step 2 step 3 step 4 ab abcde cde de Divisive (DIANA) step 3 step 2 step 1 step 0 Figure 7.6: Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. [to editor Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., using bold italics). Thank you.] Example 7.9 Agglomerative versus divisive hierarchical clustering. Figure 7.6 shows the application of AGNES (AGglomerative NESting) , an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}. Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-by-step according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any two objects from different clusters. This is a single-linkage approach in that each cluster is represented by all of the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. The cluster merging process repeats until all of the objects are eventually merged to form one cluster. In DIANA, all of the objects are used to form one initial cluster. The cluster is split according to some principle, such as the maximum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains only a single object. In either agglomerative or divisive hierarchical clustering, the user can specify the desired number of clusters as a termination condition. A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together step by step. Figure 7.7 shows a dendrogram for the five objects presented in Figure 7.6, where l = 0 shows the five objects as singleton clusters at level 0. At l = 1, objects a and b are grouped together to form the first cluster and they stay together at all subsequent levels. We can also use a vertical axis to show the similarity scale between clusters. For example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they are merged together to form a single cluster. Four widely used measures for distance between clusters are as follows, where |p - p | is the distance between two objects or points, p and p ; mi is the mean for cluster, Ci ; and ni is the number of objects in Ci . 7.5. HIERARCHICAL METHODS 27 l =0 l =1 l =2 l =3 l =4 a b c d e 1.0 0.6 0.4 0.2 0.0 Figure 7.7: Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. [to editor Please add the heading "level" above l = 0, l = 1, etc. Please replace a, b, c, d, e in figure by a, b, c, d, e, respectively (i.e., bold italics). Thanks.] Minimum distance : Maximum distance : Mean distance : Average distance : dmin (Ci , Cj ) = dmax (Ci , Cj ) = dmean (Ci , Cj ) = davg (Ci , Cj ) = minpC ,p i maxpC ,p i |mi - mj | 1 ni nj Cj |p - p | |p - p | similarity scale 0.8 (7.20) (7.21) (7.22) Cj pCi p Cj |p - p | (7.23) When an algorithm uses the minimum distance, dmin (Ci , Cj ), to measure the distance between clusters, it is sometimes called a nearest neighbor clustering algorithm. Moreover, if the clustering process is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called a single-linkage algorithm. If we view the data points as nodes of a graph, with edges forming a path between the nodes in a cluster, then the merging of two clusters, Ci and Cj , corresponds to adding an edge between the nearest pair of nodes in Ci and Cj . Since edges linking clusters always go between distinct clusters, the resulting graph will generate a tree. Thus, an agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm. When an algorithm uses the maximum distance, dmax (Ci , Cj ), to measure the distance between clusters, it is sometimes called a farthest neighbor clustering algorithm. If the clustering process is terminated when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called a complete-linkage algorithm. By viewing data points as nodes of a graph, with edges linking nodes, we can think of each cluster as a complete subgraph, that is, with edges connecting all of the nodes in the clusters. The distance between two clusters is determined by the most distant nodes in the two clusters. Farthest neighbor algorithms tend to minimize the increase in diameter of the clusters at each iteration as little as possible. If the true clusters are rather compact and approximately equal in size, the method will produce high quality clusters. Otherwise, the clusters produced can be meaningless. The above minimum and maximum measures represent two extremes in measuring the distance between clusters. They tend to be overly sensitive to outliers or noisy data. The use of mean or average distance is a compromise between the minimum and maximum distances and overcomes the outlier sensitivity problem. Whereas the mean distance is the simplest to compute, the average distance is advantageous in that it can handle categoric as well 28 CHAPTER 7. CLUSTER ANALYSIS as numeric data.2 The computation of the mean vector for categoric data can be difficult or impossible to define. "What are some of the difficulties with hierarchical clustering?" The hierarchical clustering method, though simple, often encounters difficulties regarding the selection of merge or split points. Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters. It will neither undo what was done previously, nor perform object swapping between clusters. Thus merge or split decisions, if not well chosen at some step, may lead to low-quality clusters. Moreover, the method does not scale well since each decision of merge or split needs to examine and evaluate a good number of objects or clusters. One promising direction for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques, resulting in multiple-phase clustering. Three such methods are introduced in the following subsections. The first, called BIRCH, begins by partitioning objects hierarchically using tree structures, where the leaf or low-level nonleaf nodes can be viewed as "microclusters" depending on the scale of resolution. It then applies other clustering algorithms to perform macroclustering on the microclusters. The second method, called ROCK, merges clusters based on their interconnectivity. The third method, called Chameleon, explores dynamic modeling in hierarchical clustering. 7.5.2 BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies BIRCH is designed for clustering a large amount of numerical data by integration of hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage). It overcomes the two difficulties of agglomerative clustering methods: (1) scalability, and (2) the inability to undo what was done in the previous step. BIRCH introduces two concepts, clustering feature and clustering feature tree (CF tree), which are used to summarize cluster representations. These structures help the clustering method achieve good speed and scalability in large databases, and also make it effective for incremental and dynamic clustering of incoming objects. Let's have a closer look at the above-mentioned structures. Given n d-dimensional data objects or points in a cluster, we can define the centroid x0 , radius R, and diameter D of the cluster as follows, n xi x0 = n i=1 n (7.24) (xi - x0 )2 R= i=1 n n n (7.25) (xi - xj )2 D= i=1 j=1 n(n-1) (7.26) where R is the average distance from member objects to the centroid, and D is the average pairwise distance within a cluster. Both R and D reflect the tightness of the cluster around the centroid. A clustering feature (CF) is a 3-dimensional vector summarizing information about clusters of objects. Given n d-dimensional objects or points in a cluster, {xi }, then the CF of the cluster is defined as, CF = n, LS, SS , where n is the number of points in the cluster, LS is the linear sum of the n points (i.e., n square sum of the data points (i.e., i=1 xi 2 ). n i=1 (7.27) xi ), and SS is the 2 To handle categoric data, dissimilarity measures such as those described in Sections 7.2.2 and 7.2.3 can be used to replace |p - p | by d(p, p ) in Equation (7.23). 7.5. HIERARCHICAL METHODS 29 A clustering feature is essentially a summary of the statistics for the given cluster: the zeroth, first, and second moments of the cluster from a statistical point of view. Clustering features are additive. For example, suppose that we have two disjoint clusters, C1 and C2 , having the clustering features, CF1 and CF2 , respectively. The clustering feature for the cluster that is formed by merging C1 and C2 is simply CF1 + CF2 . Clustering features are sufficient for calculating all of the measurements that are needed for making clustering decisions in BIRCH. BIRCH thus utilizes storage efficiently by employing the clustering features to summarize information about the clusters of objects, thereby bypassing the need to store all objects. Example 7.10 Clustering feature. Suppose that there are three points, (2, 5), (3, 2), and (4, 3), in a cluster, C1 . The clustering feature of C1 is CF1 = 3, (2 + 3 + 4, 5 + 2 + 3), (22 + 32 + 42 , 52 + 22 + 32 ) = 3, (9, 10), (29, 38) . Suppose that C1 is disjoint to a second cluster, C2 , where CF2 = 3, (35, 36), (417, 440) . The clustering feature of a new cluster, C3 , that is formed by merging C1 and C2 , is derived by adding CF1 and CF2 . That is, CF3 = 3 + 3, (9 + 35, 10 + 36), (29 + 417, 38 + 440) = 6, (44, 46), (446, 478) . A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. An example is shown in Figure 7.8. By definition, a nonleaf node in a tree has descendants or "children." The nonleaf nodes store sums of the CFs of their children, and thus summarize clustering information about their children. A CF tree has two parameters: branching factor, B, and threshold, T . The branching factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. These two parameters influence the size of the resulting tree. CF1 CF2 CFk Root level CF11 CF12 CF1k First level Figure 7.8: A CF tree structure. BIRCH tries to produce the best clusters with the available resources. Given a limited amount of main memory, an important consideration is to minimize the time required for I/O. BIRCH applies a multiphase clustering technique: a single scan of the data set yields a basic good clustering, and one or more additional scans can (optionally) be used to further improve the quality. The primary phases are: Phase 1: BIRCH scans the database to build an initial in-memory CF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data. Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. For Phase 1, the CF tree is built dynamically as objects are inserted. Thus, the method is incremental. An object is inserted into the closest leaf entry (subcluster). If the diameter of the subcluster stored in the leaf node after insertion is larger than the threshold value, then the leaf node and possibly other nodes are split. After the insertion of the new object, information about it is passed toward the root of the tree. The size of the CF tree can be changed by modifying the threshold. If the size of the memory that is needed for storing the CF tree is larger than the size of the main memory, then a smaller threshold value can be specified and the CF tree is rebuilt. The rebuild process is performed by building a new tree from the leaf nodes of the old tree. Thus, the process of 30 CHAPTER 7. CLUSTER ANALYSIS rebuilding the tree is done without the necessity of rereading all of the objects or points. This is similar to the insertion and node split in the construction of B+-trees. Therefore, for building the tree, data has to be read just once. Some heuristics and methods have been introduced to deal with outliers and improve the quality of CF trees by additional scans of the data. Once the CF tree is built, any clustering algorithm, such as a typical partitioning algorithm, can be used with the CF tree in Phase 2. "How effective is BIRCH?" The computation complexity of the algorithm is O(n), where n is the number of objects to be clustered. Experiments have shown the linear scalability of the algorithm with respect to the number of objects, and good quality of clustering of the data. However, since each node in a CF tree can hold only a limited number of entries due to its size, a CF tree node does not always correspond to what a user may consider a natural cluster. Moreover, if the clusters are not spherical in shape, BIRCH does not perform well because it uses the notion of radius or diameter to control the boundary of a cluster. 7.5.3 ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes ROCK (RObust Clustering using linKs) is a hierarchical clustering algorithm that explores the concept of links (the number of common neighbors between two objects) for data with categorical attributes. Traditional clustering algorithms for clustering data with Boolean and categorical attributes use distance functions (such as those introduced for binary variables in Section 7.2.2). However, experiments show that such distance measures cannot lead to high quality clusters when clustering categorical data. Furthermore, most clustering algorithms assess only the similarity between points when clustering, that is, at each step, points that are the most similar are merged into a single cluster. This "localized" approach is prone to errors. For example, two distinct clusters may have a few points or outliers that are close, therefore, relying on the similarity between points to make clustering decisions could cause the two clusters to be merged. ROCK takes a more global approach to clustering by considering the neighborhoods of individual pairs of points. If two similar points also have similar neighborhoods, then the two points likely belong to the same cluster and so can be merged. More formally, two points, pi and pj , are neighbors if sim(pi , pj ) , where sim is a similarity function and is a user-specified threshold. We can choose sim to be a distance metric or even a nonmetric (provided by a domain expert or as in Section 7.2.5) that is normalized so that its values fall between 0 and 1, with larger values indicating that the points are more similar. The number of links between p i and pj is defined as the number of common neighbors between pi and pj . If the number of links between two points is large, then it is more likely that they belong to the same cluster. By considering neighboring data points in the relationship between individual pairs of points, ROCK is more robust than standard clustering methods that focus only on point similarity. A good example of data containing categorical attributes is market basket data (Chapter 5). Such data consists of a database of transactions, where each transaction is a set of items. Transactions are considered records with Boolean attributes, each corresponding to an individual item, such as bread or cheese. In the record for a transaction, the attribute corresponding to an item is true if the transaction contains the item; otherwise, it is false. Other data sets with categorical attributes can be handled in a similar manner. ROCK's concepts of neighbors and links are illustrated in the following example, where the similarity between two "points" or transactions, T i and Tj , is defined with the Jaccard coefficient as sim(Ti , Tj ) = |Ti Tj | . |Ti Tj | (7.28) Example 7.11 Using neighborhood link information together with point similarity. Suppose that a market basket database contains transactions regarding the items a, b, . . . , g. Consider two clusters of transactions, C1 and C2 . C1 , which references the items a, b, c, d, e , contains the transactions {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}. C2 references the items a, b, f, g . It contains the transactions {a, b, f }, {a, b, g}, {a, f, g}, {b, f, g}. Suppose, first, that we consider only the similarity between points while ignoring neighborhood information. The Jaccard coefficient between the transactions {a, b, c} and {b, d, e} of C1 is 1 = 0.2. In fact, the Jaccard coefficient between any pair of transactions in C1 ranges from 0.2 5 7.5. HIERARCHICAL METHODS 31 to 0.5 (e.g.,{a, b, c} and {a, b, d}). The Jaccard coefficient between transactions belonging to different clusters may also reach 0.5 (e.g., {a, b, c} of C1 with {a, b, f } or {a, b, g} of C2 ). Clearly, by using the Jaccard coefficient on its own, we cannot obtain the desired clusters. On the other hand, the link-based approach of ROCK can successfully separate the transactions into the appropriate clusters. As it turns out, for each transaction, the transaction with which it has the most links is always another transaction from the same cluster. For example, let = 0.5. Transaction {a, b, f } of C 2 has five links with transaction {a, b, g} of the same cluster (due to common neighbors {a, b, c}, {a, b, d}, {a, b, e}, {a, f, g}, and {b, f, g}). However, transaction {a, b, f } of C2 has only three links with {a, b, f } of C1 (due to {a, b, d}, {a, b, e}, and {a, b, g}). Similarly, transaction {a, f, g} of C2 has two links with every other transaction in C2 , and 0 links with each transaction in C1 . Thus, the link-based approach, which considers neighborhood information in addition to object similarity, can correctly distinguish the two clusters of transactions. Based on these ideas, ROCK first constructs a sparse graph from a given data similarity matrix using a similarity threshold and the concept of shared neighbors. It then performs agglomerative hierarchical clustering on the sparse graph. A goodness measure is used to evaluate the clustering. Random sampling is used for scaling up to large data sets. The worst-case time complexity of ROCK is O(n2 + nmm ma + n2 logn) where mm and ma are the maximum and average number of neighbors, respectively, and n is the number of objects. In several real life data sets, such as the congressional voting data set and the mushroom data set at UC-Irvine Machine Learning Repository, ROCK has demonstrated its power at deriving much more meaningful clusters than the traditional hierarchical clustering algorithms. 7.5.4 Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters. It was derived based on the observed weaknesses of two hierarchical clustering algorithms: ROCK and CURE. ROCK and related schemes emphasize cluster interconnectivity while ignoring information regarding cluster proximity. CURE and related schemes consider cluster proximity yet ignore cluster interconnectivity. In Chameleon, cluster similarity is assessed based on how well connected objects are within a cluster and on the proximity of clusters. That is, two clusters are merged if their interconnectivity is high and they are close together. Thus, Chameleon does not depend on a static, user-supplied model and can automatically adapt to the internal characteristics of the clusters being merged. The merge process facilitates the discovery of natural and homogeneous clusters and applies to all types of data as long as a similarity function can be specified. k-nearest neighbor graph Data set Contruct a sparse graph Partition the graph Merge partitions Final clusters Figure 7.9: Chameleon: Hierarchical clustering based on k-nearest neighbors and dynamic modeling. Based on [KHK99]. "How does Chameleon work?" The main approach of Chameleon is illustrated in Figure 7.9. Chameleon uses a k-nearest neighbor graph approach to construct a sparse graph, where each vertex of the graph represents a data object, and there exists an edge between two vertices (objects) if one object is among the k-most similar objects of the other. The edges are weighted to reflect the similarity between objects. Chameleon uses a graph partitioning algorithm to partition the k-nearest neighbor graph into a large number of relatively small subclusters. It then uses an agglomerative hierarchical clustering algorithm that repeatedly merges subclusters based on their similarity. To determine the pairs of most similar subclusters, it takes into account both the interconnectivity as well as the closeness of the clusters. We will give a mathematical definition for these criteria shortly. 32 CHAPTER 7. CLUSTER ANALYSIS Note that the k-nearest neighbor graph captures the concept of neighborhood dynamically: the neighborhood radius of an object is determined by the density of the region in which the object resides. In a dense region, the neighborhood is defined narrowly; in a sparse region, it is defined more widely. This tends to result in more natural clusters, in comparison with density-based methods like DBSCAN (described in Section 7.6) that instead use a global neighborhood. Moreover, the density of the region is recorded as the weight of the edges. That is, the edges of a dense region tend to weigh more than that of a sparse region. The graph partitioning algorithm partitions the k-nearest neighbor graph into several partitions such that it minimizes the edge cut. That is, a cluster C is partitioned into subclusters C i and Cj so as to minimize the weight of the edges that would be cut should C be bisected into Ci and Cj . Edge cut is denoted EC(Ci , Cj ) and assesses the absolute interconnectivity between clusters Ci and Cj . Chameleon determines the similarity between each pair of clusters Ci and Cj according to their relative interconnectivity, RI(Ci , Cj ), and their relative closeness, RC(Ci , Cj ): The relative interconnectivity, RI(Ci , Cj ), between two clusters, Ci and Cj , is defined as the absolute interconnectivity between Ci and Cj , normalized with respect to the internal interconnectivity of the two clusters, Ci and Cj . That is, RI(Ci , Cj ) = |EC{Ci ,Cj } | , 1 2 (|ECCi | + |ECCj |) (7.29) where EC{Ci ,Cj } is the edge cut as defined as above for a cluster containing both Ci and Cj . Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj ) into two roughly equal parts. The relative closeness, RC(Ci , Cj ), between a pair of clusters, Ci and Cj , is the absolute closeness between Ci and Cj , normalized with respect to the internal closeness of the two clusters, C i and Cj . It is defined as RC(Ci , Cj ) = S EC {Ci ,Cj } |Ci | |Ci |+|Cj | S EC Ci + |Cj | |Ci |+|Cj | S EC Cj , (7.30) where S EC {Ci ,Cj } is the average weight of the edges that connect vertices in Ci to vertices in Cj , and S EC Ci (or S EC Cj ) is the average weight of the edges that belong to the min-cut bisector of cluster C i (or Cj ). Chameleon has been shown to have greater power at discovering arbitrarily shaped clusters of high quality than several well-known algorithms such as BIRCH and density-based DBSCAN (Section 7.6.1). However, the processing cost for high-dimensional data may require O(n2 ) time for n objects in the worst case. 7.6 Density-Based Methods To discover clusters with arbitrary shape, density-based clustering methods have been developed. These typically regard clusters as dense regions of objects in the data space that are separated by regions of low density (representing noise). 7.6.1 DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points. The basic ideas of density-based clustering involve a number of new definitions. We intuitively present these definitions, and then follow up with an example. 7.6. DENSITY-BASED METHODS The neighborhood within a radius of a given object is called the -neighborhood of the object. 33 If the -neighborhood of an object contains at least a minimum number, M inP ts, of objects, then the object is called a core object. Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the -neighborhood of q, and q is a core object. An object p is density-reachable from object q with respect to and M inP ts in a set of objects, D, if there is a chain of objects p1 , . . . , pn , p1 = q and pn = p such that pi+1 is directly density-reachable from pi with respect to and M inP ts, for 1 i n, pi D. An object p is density-connected to object q with respect to and M inP ts in a set of objects, D, if there is an object o D such that both p and q are density-reachable from o with respect to and M inP ts. Density reachability is the transitive closure of direct density reachability, and this relationship is asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a symmetric relation. Example 7.12 Density-reachabiltiy and density connectivity. Consider Figure 7.10 for a given sented by the radius of the circles, and, say, let M inP ts = 3. Based on the above definitions: repre- Q M P S R O Figure 7.10: Density reachability and density connectivity in density-based clustering. Based on [EKSX96]. [to editor For consistency, please change M, O, P, Q, R to m, o, p, q, r, respectively (i.e., bold italics). Thanks.] Of the labeled points, m, p, o, r, are core objects since each is in an -neighborhood containing at least three points. q is directly density-reachable from m. m is directly density-reachable from p and vice versa. q is (indirectly) density-reachable from p since q is directly density-reachable from m and m is directly density-reachable from p. However, p is not density-reachable from q since q is not a core object. Similarly, r and s are density-reachable from o, and o is density-reachable from r. o, r, and s are all density-connected. A density-based cluster is a set of density-connected objects that is maximal with respect to densityreachability. Every object not contained in any cluster is considered to be noise. "How does DBSCAN find clusters?" DBSCAN searches for clusters by checking the -neighborhood of each point in the database. If the -neighborhood of a point p contains more than M inP ts, a new cluster with p as a core object is created. DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may involve the merge of a few density-reachable clusters. The process terminates when no new point can be added to any cluster. If a spatial index is used, the computational complexity of DBSCAN is O(n log n), where n is the number of database objects. Otherwise, it is O(n2 ). With appropriate settings of the user-defined parameters, and M inP ts, the algorithm is effective at finding arbitrary shaped clusters. 34 CHAPTER 7. CLUSTER ANALYSIS 7.6.2 OPTICS: Ordering Points To Identify the Clustering Structure Although DBSCAN can cluster objects given input parameters such as and M inP ts, it still leaves the user with the responsibility of selecting parameter values that will lead to the discovery of acceptable clusters. Actually, this is a problem associated with many other clustering algorithms. Such parameter settings are usually empirically set and difficult to determine, especially for real-world, high-dimensional data sets. Most algorithms are very sensitive to such parameter values: slightly different settings may lead to very different clusterings of the data. Moreover, high-dimensional real data sets often have very skewed distributions such that their intrinsic clustering structure may not be characterized by global density parameters. To help overcome this difficulty, a cluster analysis method called OPTICS was proposed. Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. This ordering represents the density-based clustering structure of the data. It contains information that is equivalent to density-based clustering obtained from a wide range of parameter settings. The cluster ordering can be used to extract basic clustering information (such as cluster centers, or arbitrary-shaped clusters), as well as provide the intrinsic clustering structure. By examining DBSCAN, we can easily see that for a constant M inP ts value, density-based clusters with respect to a higher density (i.e., a lower value for ) are completely contained in density-connected sets obtained with respect to a lower density. Recall that the parameter is a distance--it is the neighborhood radius. Therefore, in order to produce a set or ordering of density-based clusters, we can extend the DBSCAN algorithm to process a set of distance parameter values at the same time. To construct the different clusterings simultaneously, the objects should be processed in a specific order. This order selects an object that is density-reachable with respect to the lowest value so that clusters with higher density (lower ) will be finished first. Based on this idea, two values need to be stored for each object--core-distance and reachability-distance: The core-distance of an object p is the smallest object, the core-distance of p is undefined. value that makes p a core object. If p is not a core The reachability-distance of an object q with respect to another object p is the greater value of the coredistance of p and the Euclidean distance between p and q. If p is not a core object, the reachability-distance between p and q is undefined. p 3 mm 6 mm 6 mm p q1 q2 Core-distance of p Reachability-distance (p, q1) Reachability-distance (p, q2) 3 mm d(p, q2) Figure 7.11: OPTICS terminology. Based on [ABKS99]. [to editor 1) Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) and prime (') signs missing! Please kindly compare with Figure 8.10 of first edition, which was correct. 2) The symbol in the figure looks different than that used in the text. Thank you.] Example 7.13 Core-distance and reachability-distance. Figure 7.11 illustrates the concepts of core-distance and reachability-distance. Suppose that = 6 mm and M inP ts = 5. The core-distance of p is the distance, , between p and the fourth closest data object. The reachability-distance of q 1 with respect to p is the core-distance 7.6. DENSITY-BASED METHODS 35 of p (i.e., = 3 mm) since this is greater than the Euclidean distance from p to q1 . The reachability-distance of q2 with respect to p is the Euclidean distance from p to q2 since this is greater than the core-distance of p. "How are these values used?" The OPTICS algorithm creates an ordering of the objects in a database, additionally storing the core-distance and a suitable reachability-distance for each object. An algorithm was proposed to extract clusters based on the ordering information produced by OPTICS. Such information is sufficient for the extraction of all density-based clusterings with respect to any distance that is smaller than the distance used in generating the order. The cluster ordering of a data set can be represented graphically, which helps in its understanding. For example, Figure 7.12 is the reachability plot for a simple two-dimensional data set, which presents a general overview of how the data are structured and clustered. The data objects are plotted in cluster order (horizontal axis) together with their respective reachability-distance (vertical axis). The three Gaussian "bumps" in the plot reflect three clusters in the data set. Methods have also been developed for viewing clustering structures of high-dimensional data at various levels of detail. Reachability -distances Undefined Cluster order of the objects Figure 7.12: Cluster ordering in OPTICS. Figure is based on [ABKS99]. [to editor This figure needs to be improved so that it more closely resembles Figure 9 of [ABKS99]. For example, (1) the opposite ends of the arrows should be extended so that they reach into the respective clusters of points. (2) Please make the label of the vertical axis singular instead of plural (i.e., change to "Reachability-distance"). Thank you.] Because of the structural equivalence of the OPTICS algorithm to DBSCAN, the OPTICS algorithm has the same run-time complexity as that of DBSCAN, that is, O(n log n) if a spatial index is used, where n is the number of objects. 7.6.3 DENCLUE: Clustering Based on Density Distribution Functions DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density distribution functions. The method is built on the following ideas: (1) the influence of each data point can be formally modeled using a mathematical function, called an influence function, which describes the impact of a data point within its neighborhood; (2) the overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points; and (3) clusters can then be determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function. 36 CHAPTER 7. CLUSTER ANALYSIS Let x and y be objects or points in F d , a d-dimensional input space. The influence function of data object y + y on x is a function, fB : F d R0 , which is defined in terms of a basic influence function fB : y fB (x) = fB (x, y). (7.31) This reflects the impact of y on x. In principle, the influence function can be an arbitrary function that can be determined by the distance between two objects in a neighborhood. The distance function, d(x, y), should be reflexive and symmetric, such as the Euclidean distance function (Section 7.2.1). It can be used to compute a square wave influence function, fSquare (x, y) = 0 1 if d(x, y) > otherwise (7.32) or a Gaussian influence function, d(x, y)2 2 2 . fGauss (x, y) = e - (7.33) To help understand the concept of influence function, the following example offers some additional insight. Example 7.14 Influence function. Consider the square wave influence function of Equation (7.32). If objects x and y are far apart from one another in the d-dimensional space, then the distance, d(x, y) will be above some threshold, . In this case, the influence function returns a 0, representing the lack of influence between distant points. On the other hand, if x and y are "close" (where closeness is determined by the parameter ), a value of 1 is returned, representing the notion that one influences the other. Density Density (a) Data Set (b) Square Wave (c) Gaussian Figure 7.13: Possible density functions for a 2-D data set. From [HK98]. The density function at an object or point x F d is defined as the sum of influence functions of all data points. That is, it is the total influence on x of all of the data points. Given n data objects, D = {x 1 , . . . , xn } F d , the density function at x is defined as n D fB (x) = i=1 x x x x fB i (x) = fB 1 (x) + fB 2 (x) + . . . + fB n (x). (7.34) For example, the density function that results from the Gaussian influence function (7.33) is n D fGauss (x) = i=1 d(x, xi )2 2 2 . e - (7.35) Figure 7.13 shows a 2-D data set, together with the corresponding overall density functions for a square wave and a Gaussian influence function. 7.7. GRID-BASED METHODS 37 From the density function, we can define the gradient of the function and the density attractor, the local maxima of the overall density function. A point x is said to be density attracted to a density attractor x if there exists a set of points x0 , x1 , . . . , xk such that x0 = x, xk = x and the gradient of xi-1 is in the direction of xi for 0 < i < k. Intuitively, a density attractor influences many other points. For a continuous and differentiable influence function, a hill-climbing algorithm guided by the gradient can be used to determine the density attractor of a set of data points. In general, points that are density attracted to x may form a cluster. Based on the above notions, both center-defined cluster and arbitrary-shape cluster can be formally defined. A center-defined cluster for a density attractor, x , is a subset of points, C D, that are density-attracted by x , and where the density function at x is no less than a threshold, . Points that are density-attracted by x , but for which the density function value is less than , are considered outliers. That is, intuitively, points in a cluster are influenced by many points, but outliers are not. An arbitrary-shape cluster for a set of density attractors is a set of Cs, each being density-attracted to its respective density-attractor, where (1) the density function value at each density-attractor is no less than a threshold, , and (2) there exists a path, P , from each density-attractor to another, where the density function value for each point along the path is no less than . Examples of center-defined and arbitrary-shape clusters are shown in Figure 7.14. Density Density = 0.2 = 0.6 = 1.5 Density =2 =2 Density =1 =1 Figure 7.14: Examples of center-defined clusters (top row) and arbitrary-shape clusters (bottom row). [to editor Label missing: please add the label "Density" to the second graph of the top row (as in the other graphs of that row). Thanks.] From [HK98]. "What major advantages does DENCLUE have in comparison with other clustering algorithms?" There are several: (1) it has a solid mathematical foundation and generalizes various clustering methods, including partitioning, hierarchical, and density-based methods, (2) it has good clustering properties for data sets with large amounts of noise, (3) it allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets, and (4) it uses grid cells yet only keeps information about grid cells that do actually contain data points. It manages these cells in a tree-based access structure, and thus is significantly faster than some influential algorithms, such as DBSCAN. However, the method requires careful selection of the density parameter and noise threshold , as the selection of such parameters may significantly influence the quality of the clustering results. 7.7 Grid-Based Methods The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space. 38 CHAPTER 7. CLUSTER ANALYSIS Some typical examples of the grid-based approach include STING, which explores statistical information stored in the grid cells, WaveCluster, which clusters objects using a wavelet transform method, and CLIQUE, which represents a grid- and density-based approach for clustering in high-dimensional data space that will be introduced in Section 7.9. 7.7.1 STING: STatistical INformation Grid STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) are precomputed and stored. These statistical parameters are useful for query processing, as described below. 1st layer (i-1)-st layer ith layer Figure 7.15: A hierarchical structure for STING clustering. [to editor This figure does not need to be this big.] Figure 7.15 shows a hierarchical structure for STING clustering. Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-level cells. These parameters include the following: the attribute-independent parameter, count; and the attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max (maximum), and the type of distribution that the attribute value in the cell follows, such as normal, uniform, exponential, or none (if the distribution is unknown). When the data are loaded into the database, the parameters count, mean, stdev, min, and max of the bottom-level cells are calculated directly from the data. The value of distribution may either be assigned by the user if the distribution type is known beforehand or obtained by hypothesis tests such as the 2 test. The type of distribution of a higher-level cell can be computed based on the majority of distribution types of its corresponding lower-level cells in conjunction with a threshold filtering process. If the distributions of the lower-level cells disagree with each other and fail the threshold test, the distribution type of the high-level cell is set to none. "How is this statistical information useful for query-answering?" The statistical parameters can be used in a top-down, grid-based method as follows. First, a layer within the hierarchical structure is determined from which the query-answering process is to start. This layer typically contains a small number of cells. For each cell in the current layer, we compute the confidence interval (or estimated range of probability) reflecting the cell's relevancy to the given query. The irrelevant cells are removed from further consideration. Processing of the next lower level examines only the remaining relevant This cells. process is repeated until the bottom layer is reached. At this time, if the query specification is met, the regions of relevant cells that satisfy the query are returned. Otherwise, 7.7. GRID-BASED METHODS 39 the data that fall into the relevant cells are retrieved and further processed until they meet the requirements of the query. "What advantages does STING offer over other clustering methods?" STING offers several advantages: (1) the grid-based computation is query-independent since the statistical information stored in each cell represents the summary information of the data in the grid cell, independent of the query; (2) the grid structure facilitates parallel processing and incremental updating; and (3) the method's efficiency is a major advantage: STING goes through the database once to compute the statistical parameters of the cells, and hence the time complexity of generating clusters is O(n), where n is the total number of objects. After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n. Since STING uses a multiresolution approach to cluster analysis, the quality of STING clustering depends on the granularity of the lowest level of the grid structure. If the granularity is very fine, the cost of processing will increase substantially; however, if the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. Moreover, STING does not consider the spatial relationship between the children and their neighboring cells for construction of a parent cell. As a result, the shapes of the resulting clusters are isothetic, that is, all of the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected. This may lower the quality and accuracy of the clusters despite the fast processing time of the technique. 7.7.2 WaveCluster: Clustering Using Wavelet Transformation WaveCluster is a multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space. In this approach, each grid cell summarizes the information of a group of points that map into the cell. This summary information typically fits into main memory for use by the multiresolution wavelet transform and the subsequent cluster analysis. A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands. The wavelet model can be applied to d-dimensional signals by applying a one-dimensional wavelet transform d times. In applying a wavelet transform, data are transformed so as to preserve the relative distance between objects at different levels of resolution. This allows the natural clusters in the data to become more distinguishable. Clusters can then be identified by searching for dense regions in the new domain. Wavelet transforms are also discussed in Chapter 2, where they are used for data reduction by compression. Additional references to the technique are given in the bibliographic notes. "Why is wavelet transformation useful for clustering?" It offers the following advantages: It provides unsupervised clustering. It uses hat-shape filters that emphasize regions where the points cluster, while at the same time suppressing weaker information outside of the cluster boundaries. Thus, dense regions in the original feature space act as attractors for nearby points and as inhibitors for points that are further away. This means that the clusters in the data automatically stand out and "clear" the regions around them. Thus, another advantage is that wavelet transformation can automatically result in the removal of outliers. The multiresolution property of wavelet transformations can help in the detection of clusters at varying levels of accuracy. For example, Figure 7.16 shows a sample of two-dimensional feature space, where each point in the image represents the attribute or feature values of one object in the spatial data set. Figure 7.17 shows the resulting wavelet transformation at different resolutions, from a fine scale (scale 1) to a coarse scale (scale 3). At each level, the four subbands into which the original data are decomposed are shown. The subband shown in the upper-left quadrant emphasizes the average neighborhood around each data point. The subband in the upper-right quadrant emphasizes the horizontal edges of the data. The subband in the lower-left quadrant emphasizes the vertical edges, while the subband in the lower-right quadrant emphasizes the corners. 40 CHAPTER 7. CLUSTER ANALYSIS Figure 7.16: A sample of two-dimensional feature space. From [SCZ98]. (a) (b) (c) Figure 7.17: Multiresolution of the feature space in Figure 7.16 at (a) scale 1 (high resolution); (b) scale 2 (medium resolution); (c) scale 3 (low resolution). From [SCZ98]. Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is the number of objects in the database. The algorithm implementation can be made parallel. WaveCluster is a grid-based and density-based algorithm. It conforms with many of the requirements of a good clustering algorithm: It handles large data sets efficiently, discovers clusters with arbitrary shape, successfully handles outliers, is insensitive to the order of input, and does not require the specification of input parameters such as the number of clusters or a neighborhood radius. In experimental studies, WaveCluster was found to outperform BIRCH, CLARANS, and DBSCAN in terms of both efficiency and clustering quality. The study also found WaveCluster capable of handling data with up to 20 dimensions. 7.8 Model-Based Clustering Methods Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model. Such methods are often based on the assumption that the data are generated by a mixture of underlying probability distributions. In this section, we describe three examples of model-based clustering. Section 7.8.1 presents an extension of the k-means partitioning algorithm, called Expectation-Maximization. Conceptual clustering is discussed in Section 7.8.2. A neural network approach to clustering is given in Section 7.8.3. 7.8.1 Expectation-Maximization In practice, each cluster can be represented mathematically by a parametric probability distribution. The entire data is a mixture of these distributions, where each individual distribution is typically referred to as a component distribution. We can therefore cluster the data using a finite mixture density model of k probability distributions, where each distribution represents a cluster. The problem is to estimate the parameters of the probability 7.8. MODEL-BASED CLUSTERING METHODS 41 distributions so as to best fit the data. Figure 7.18 is an example of a simple finite mixture density model. There are two clusters. Each follows a normal or Gaussian distribution with its own mean and standard deviation. g(m 2 , 2 ) m2 g(m 1 , 1 ) m1 Figure 7.18: Each cluster can be represented by a probability distribution, centered at a mean, and with a standard deviation. Here, we have two clusters, corresponding to the Gaussian distributions g(m 1 , 1 ) and g(m2 , 2 ), respectively, where the circles represent the first standard deviation of the distributions. [to editor This figure is a draft. The final version should have: 1) points heavily scattered within each circle; 2) points sparsely scattered in area outside of circles; 3) a dark point at the center of each circle (to mark the means, m 1 and m2 . Thank you!] The EM (Expectation-Maximization) algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates. It can be viewed as an extension of the k-means paradigm, which assigns an object to the cluster with which it is most similar, based on the cluster mean (Section 7.4.1). Instead of assigning each object to a dedicated cluster, EM assigns each object to a cluster according to a weight representing the probability of membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. EM starts with an initial estimate or "guess" of the parameters of the mixture model (collectively referred to as the parameter vector ). It iteratively re-scores the objects against the mixture density produced by the parameter vector. The re-scored objects are then used to update the parameter estimates. Each object is assigned a probability that it would possess a certain set of attribute values given that it was a member of a given cluster. The algorithm is described as follows. 1. Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster means or centers (as in k-means partitioning), as well as making guesses for the additional parameters. 2. Iteratively refine the parameters (or clusters) based on the following two steps: (a) Expectation Step: Assign each object xi to cluster Ck with the probability p(Ck )p(xi |Ck ) , p(xi ) P (xi Ck ) = p(Ck |xi ) = (7.36) where p(xi |Ck ) = N (mk , Ek (xi )) follows the normal (i.e., Gaussian) distribution around mean, mk , with expectation, Ek . In other words, this step calculates the probability of cluster membership of object xi , for each of the clusters. These probabilities are the "expected" cluster memberships for object xi . (b) Maximization Step: Use the probability estimates from above to re-estimate (or refine) the model parameters. For example, 42 CHAPTER 7. CLUSTER ANALYSIS mk = 1 n n i=1 xi P (xi Ck ) . j P (xi Cj ) (7.37) This step is the "maximization" of the likelihood of the distributions given the data. The EM algorithm is simple, and easy to implement. In practice, it converges fast, but may not reach the global optima. Convergence is guaranteed for certain forms of optimization functions. The computational complexity is linear in d (the number of input features), n (the number of objects), and t (the number of iterations). Bayesian clustering methods focus on the computation of class-conditional probability density. They are commonly used in the statistics community. In industry, AutoClass is a popular Bayesian clustering method that uses a variant of the EM algorithm. The best clustering is the one that maximizes the ability to predict the attributes of a object given the correct cluster of the object. AutoClass can also estimate the number of clusters. It has been applied to several domains and was able to discover a new class of stars based on infrared astronomy data. Further references are provided in the bibliographic notes. 7.8.2 Conceptual Clustering Conceptual clustering is a form of clustering in machine learning that, given a set of unlabeled objects, produces a classification scheme over the objects. Unlike conventional clustering, which primarily identifies groups of like objects, conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or class. Hence, conceptual clustering is a two-step process: first, clustering is performed, followed by characterization. Here, clustering quality is not solely a function of the individual objects. Rather, it incorporates factors such as the generality and simplicity of the derived concept descriptions. Most methods of conceptual clustering adopt a statistical approach that uses probability measurements in determining the concepts or clusters. Probabilistic descriptions are typically used to represent each derived concept. COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects are described by categorical attribute-value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree. Animal P(C 0) 1.0 P(scales C 0) ... 0.25 Fish P(C1) 0.25 P(scales C1) 1.0 ... Amphibian P(C2) 0.25 P(moist C2) 1.0 ... Mammal/bird P(C3) 0.5 P(hair C3) 0.5 ... Mammal P(C4) 0.5 P(hair C4) 1.0 ... Bird P(C5) 0.5 P(feathers C5) ... 1.0 Figure 7.19: A classification tree. Figure is based on [Fis87]. [to editor Some parts of this figure are not showing up in printouts and on screen, e.g., there are equal (=) signs missing! Please kindly compare with Figure 8.18 of first edition, which was correct. Thank you.] "But, what is a classification tree? Is it the same as a decision tree?" Figure 7.19 shows a classification tree for a set of animal data. A classification tree differs from a decision tree. Each node in a classification tree refers to 7.8. MODEL-BASED CLUSTERING METHODS 43 a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node. The probabilistic description includes the probability of the concept and conditional probabilities of the form P (Ai = vij |Ck ), where Ai = vij is an attribute-value pair [new (that is, the ith attribute takes its j th possible value)] and Ck is the concept class. (Counts are accumulated and stored at each node for computation of the probabilities.) This is unlike decision trees, which label branches rather than nodes and use logical rather than probabilistic descriptors.3 The sibling nodes at a given level of a classification tree are said to form a partition. To classify an object using a classification tree, a partial matching function is employed to descend the tree along a path of "best" matching nodes. COBWEB uses a heuristic evaluation measure called category utility to guide construction of the tree. Category utility (CU) is defined as n k=1 P (Ck )[ i j P (Ai = vij |Ck )2 - n i j P (Ai = vij )2 ] , (7.38) where n is the number of nodes, concepts, or "categories" forming a partition, {C 1 , C2 , . . . , Cn }, at the given level of the tree. In other words, category utility is the increase in the expected number of attribute values that can be correctly guessed given a partition (where this expected number corresponds to the term P (C k )i j P (Ai = vij |Ck )2 ) over the expected number of correct guesses with no such knowledge (corresponding to the term i j P (Ai = vij )2 ). Although we do not have room to show the derivation, category utility rewards intraclass similarity and interclass dissimilarity where Intraclass similarity is the probability P (Ai = vij |Ck ). The larger this value is, the greater the proportion of class members that share this attribute-value pair, and the more predictable the pair is of class members. Interclass dissimilarity is the probability P (Ck |Ai = vij ). The larger this value is, the fewer the objects in contrasting classes that share this attribute-value pair, and the more predictive the pair is of the class. Let's have a look at how COBWEB works. COBWEB incrementally incorporates objects into a classification tree. "Given a new object, how does COBWEB decide where to incorporate it into the classification tree?" COBWEB descends the tree along an appropriate path, updating counts along the way, in search of the "best host" or node at which to classify the object. This decision is based on temporarily placing the object in each node and computing the category utility of the resulting partition. The placement that results in the highest category utility should be a good host for the object. "What if the object does not really belong to any of the concepts represented in the tree so far? What if it is better to create a new node for the given object?" That is a good point. In fact, COBWEB also computes the category utility of the partition that would result if a new node were to be created for the object. This is compared to the above computation based on the existing nodes. The object is then placed in an existing class, or a new class is created for it, based on the partition with the highest category utility value. Notice that COBWEB has the ability to automatically adjust the number of classes in a partition. It does not need to rely on the user to provide such an input parameter. The two operators mentioned above are highly sensitive to the input order of the object. COBWEB has two additional operators that help make it less sensitive to input order. These are merging and splitting. When an object is incorporated, the two best hosts are considered for merging into a single class. Furthermore, COBWEB considers splitting the children of the best host among the existing categories. These decisions are based on category utility. The merging and splitting operators allow COBWEB to perform a bidirectional search--for example, a merge can undo a previous split. COBWEB has a number of limitations. First, it is based on the assumption that probability distributions on separate attributes are statistically independent of one another. This assumption is, however, not always true since correlation between attributes often exists. Moreover, the probability distribution representation of clusters 3 Decision trees are described in Chapter 6. 44 CHAPTER 7. CLUSTER ANALYSIS makes it quite expensive to update and store the clusters. This is especially so when the attributes have a large number of values since the time and space complexities depend not only on the number of attributes, but also on the number of values for each attribute. Furthermore, the classification tree is not height-balanced for skewed input data, which may cause the time and space complexity to degrade dramatically. CLASSIT is an extension of COBWEB for incremental clustering of continuous (or real valued) data. It stores a continuous normal distribution (i.e., mean and standard deviation) for each individual attribute in each node and uses a modified category utility measure that is an integral over continuous attributes instead of a sum over discrete attributes as in COBWEB. However, it suffers similar problems as COBWEB and thus is not suitable for clustering large database data. Conceptual clustering is popular in the machine learning community. However, the method does not scale well for large data sets. 7.8.3 Neural Network Approach The neural network approach is motivated by biological neural networks. 4 Roughly speaking, a neural network is a set of connected input/output units, where each connection has a weight associated with it. Neural networks have several properties that make them popular for clustering. First, neural networks are inherently parallel and distributed processing architectures. Second, neural networks learn by adjusting their interconnection weights so as to best fit the data. This allows them to "normalize" or "prototype" the patterns and act as feature (or attribute) extractors for the various clusters. Third, neural networks process numerical vectors and require object patterns to be represented by quantitative features only. Many clustering tasks handle only numerical data or can transform their data into quantitative features if needed. The neural network approach to clustering tends to represent each cluster as an exemplar. An exemplar acts as a "prototype" of the cluster and does not necessarily have to correspond to a particular data example or object. New objects can be distributed to the cluster whose exemplar is the most similar, based on some distance measure. The attributes of an object assigned to a cluster can be predicted from the attributes of the cluster's exemplar. Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonon, or as topologically ordered maps. SOMs' goal is to represent all points in a high-dimensional source space by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity relationships (hence the topology) are preserved as much as possible. The method is particularly useful when a nonlinear mapping is inherent in the problem itself. SOMs can also be viewed as a constrained version of k-means clustering, in which the cluster centers tend to lie in a low dimensional manifold in the feature or attribute space. With self-organizing feature maps (SOMs), clustering is performed by having several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning or active unit. So as to move even closer to the input object, the weights of the winning unit are adjusted, as well as those of its nearest neighbors. SOMs assume that there is some topology or ordering among the input objects, and that the units will eventually take on this structure in space. The organization of units is said to form a feature map. SOMs are believed to resemble processing that can occur in the brain and are useful for visualizing high-dimensional data in 2- or 3-D space. The SOM approach has been used successfully for Web document clustering. The left graph of Figure 7.20 shows the result of clustering 12,088 Web articles from comp.ai.neural-nets using the SOM approach, while the right graph of the figure shows the result of drilling down on the keyword: "mining". The neural network approach to clustering has strong theoretical links with actual brain processing. Further research is required in making it more effective and scalable in large databases due to long processing times and the intricacies of complex data. 4 Neural networks were also introduced in Chapter 6 on classification and prediction. 7.9. CLUSTERING HIGH-DIMENSIONAL DATA 45 Figure 7.20: The result of SOM clustering of 12,088 Web articles on comp.ai.neural-nets (left), and of drilling down on the keyword: "mining" (right). Based on http://websom.hut.fi/websom/comp.ai.neural-nets-new. 7.9 Clustering High-Dimensional Data A large majority of clustering methods are designed for clustering low-dimensional data and encounter challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks). This is because when the dimensionality increases, usually only a small number of dimensions are relevant to certain clusters, but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered. Moreover, when dimensionality increases, data usually become increasingly sparse because the data points are likely located in different dimensional subspaces. When the data becomes really sparse, data points located at different dimensions can be considered as all equal distance, and the distance measure, which is essential for cluster analysis, becomes meaningless. To overcome this difficulty, we may consider using feature (or attribute) transformation and feature (or attribute) selection techniques. Feature transformation methods, such as principal component analysis 5 and singular value decomposition 6 , transform the data onto a smaller space while generally preserving the original relative distance between objects. They summarize data by creating linear combinations of the attributes, and may discover hidden structures in the data. However, such techniques do not actually remove any of the original attributes from analysis. This is problematic when there are a large number of irrelevant attributes. The irrelevant information, may mask the real clusters, even after transformation. Moreover, the transformed features (attributes) are often difficult to interpret, making the clustering results less useful. Thus, feature transformation is only suited to data sets where most of 5 Principal 6 Singular component analysis was introduced in Chapter 2 as a method of data compression (Section 2.5.3). value decomposition is discussed in detail in Chapter 8. 46 CHAPTER 7. CLUSTER ANALYSIS the dimensions are relevant to the clustering task. Unfortunately, real-world data sets tend to have many highly correlated, or redundant, dimensions. Another way of tackling the curse of dimensionality is to try to remove some of the dimensions. Attribute subset selection (or feature subset selection7 ) is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes). Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task. Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criterion. It is most commonly performed by supervised learning--the most relevant set of attributes are found with respect to the given class labels. It can also be performed by an unsupervised process, such as entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters. Other evaluation functions, such as category utility, may also be used. Subspace clustering is an extension to attribute subset selection that has shown its strength at high-dimensional clustering. It is based on the observation that different subspaces may contain different, meaningful clusters. Subspace clustering searches for groups of clusters within different subspaces of the same data set. The problem becomes how to find such subspace clusters effectively and efficiently. In this section, we introduce three approaches for effective clustering of high-dimensional data: dimension-growth subspace clustering, represented by CLIQUE, dimension-reduction projected clustering, represented by PROCLUS, and frequent pattern-based clustering, represented by pCluster. 7.9.1 CLIQUE: A Dimension-Growth Subspace Clustering Method CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimension-growth subspace clustering in high-dimensional space. In dimension-growth subspace clustering, the clustering process starts at singledimensional subspaces and grows upwards to higher dimensional ones. Since CLIQUE partitions each dimension like a grid structure and determines whether a cell is dense based on the number of points it contains, it can also be viewed as an integration of density-based and grid-based clustering methods. However, its overall approach is typical of subspace clustering for high-dimensional space, and so it is introduced in this section. The ideas of the CLIQUE clustering algorithm are outlined as follows. Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. CLIQUE's clustering identifies the sparse and the "crowded" areas in space (or units), thereby discovering the overall distribution patterns of the data set. A unit is dense if the fraction of total data points contained in it exceeds an input model parameter. In CLIQUE, a cluster is defined as a maximal set of connected dense units. "How does CLIQUE work?" CLIQUE performs multidimensional clustering in two steps. In the first step, CLIQUE partitions the d-dimensional data space into nonoverlapping rectangular units, identifying the dense units among these. This is done (in 1-D) for each dimension. For example, Figure 7.21 shows dense rectangular units found with respect to age for the dimensions salary and (number of weeks of) vacation. The subspaces representing these dense units are intersected to form a candidate search space in which dense units of higher dimensionality may exist. "Why does CLIQUE confine its search for dense units of higher dimensionality to the intersection of the dense units in the subspaces?" The identification of the candidate search space is based on the Apriori property used in association rule mining.8 In general, the property employs prior knowledge of items in the search space so that portions of the space can be pruned. The property, adapted for CLIQUE, states the following: If a k-dimensional 7 Attribute subset selection is known in the machine learning literature as feature subset selection. It was discussed in Chapter 2 as a form of data reduction (Section 2.5.2). 8 Association rule mining is described in detail in Chapter 5. In particular, the Apriori property is described in Section 5.2.1. The Apriori property can also be used for cube computation, as described in Chapter 4. 7.9. CLUSTERING HIGH-DIMENSIONAL DATA 47 7 salary (10,000) 6 5 4 3 2 1 0 20 30 40 50 60 age 7 vacation (week) 6 5 4 3 2 1 0 20 30 40 50 60 age vacation 30 50 age Figure 7.21: Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate search space for dense units of higher dimensionality. unit is dense, then so are its projections in (k - 1)-dimensional space. That is, given a k-dimensional candidate dense unit, if we check its (k - 1)-th projection units and find any that are not dense, then we know that the kth dimensional unit cannot be dense either. Therefore, we can generate potential or candidate dense units in kdimensional space from the dense units found in (k - 1)-dimensional space. In general, the resulting space searched is much smaller than the original space. The dense units are then examined in order to determine the clusters. In the second step, CLIQUE generates a minimal description for each cluster as follows. For each cluster, it determines the maximal region that covers the cluster of connected dense units. It then determines a minimal cover [new (logic description)] for each cluster. "How effective is CLIQUE?" CLIQUE automatically finds subspaces of the highest dimensionality such that high-density clusters exist in those subspaces. It is insensitive to the order of input objects and does not presume any canonical data distribution. It scales linearly with the size of input and has good scalability as the number of dimensions in the data is increased. However, obtaining meaningful clustering results is dependent on proper tuning of the grid size (which is a stable structure here) and the density threshold. This is particularly difficult because the grid size and density threshold are used across all combinations of dimensions in the data set. Thus, the accuracy of the clustering results may be degraded at the expense of the simplicity of the method. Moreover, for a given dense region, all projections of the region onto lower dimensionality subspaces will also be dense. This can result is a large overlap among the reported dense regions. Furthermore, it is difficult to find clusters of rather different density within different dimensional subspaces. sa la ry 48 CHAPTER 7. CLUSTER ANALYSIS There are several extensions to this approach that follow a similar philosophy. For example, let's think of a grid as a set of fixed bins. Instead of using fixed bins for each of the dimensions, we can use an adaptive, data-driven strategy to dynamically determine the bins for each dimension based on data distribution statistics. Alternatively, instead of using a density threshold, we would use entropy (Chapter 6) as a measure of the quality of subspace clusters. 7.9.2 PROCLUS: A Dimension-Reduction Subspace Clustering Method PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace clustering method. That is, instead of starting from single-dimensional spaces, it starts by finding an initial approximation of the clusters in the high-dimensional attribute space. Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters. This leads to the exploration of dense regions in all subspaces of some desired dimensionality and avoids the generation of a large number of overlapped clusters in projected dimensions of lower dimensionality. PROCLUS finds the best set of medoids by a hill climbing process similar to that used in CLARANS, but generalized to deal with projected clustering. It adopts a distance measure called Manhattan segmental distance, which is the Manhattan distance on a set of relevant dimensions. The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster refinement. In the initialization phase, it uses a greedy algorithm to select a set of initial medoids that are far apart from each other so as to ensure that each cluster is represented by at least one object in the selected set. More concretely, it first chooses a random sample of data points proportional to the number of clusters we wish to generate, and then applies the greedy algorithm to obtain an even smaller final subset for the next phase. The iteration phase selects a random set of k medoids from this reduced set (of medoids), and replaces "bad" medoids with randomly chosen new medoids if the clustering is improved. For each medoid, a set of dimensions is chosen whose average distances are small compared to statistical expectation. The total number of dimensions associated to medoids must be k l, where l is an input parameter that selects the average dimensionality of cluster subspaces. The refinement phase computes new dimensions for each medoid based on the clusters found, reassigns points to medoids, and remove outliers. Experiments on PROCLUS show that the method is efficient and scalable at finding high-dimensional clusters. Unlike CLIQUE, which outputs many overlapped clusters, PROCLUS finds non-overlapped partitions of points. The discovered clusters may help better understand the high-dimensional data and facilitate other subsequence analyses. 7.9.3 Frequent Pattern-Based Clustering Methods This section looks at how methods of frequent pattern mining can be applied to clustering, resulting in frequent pattern-based cluster analysis. Frequent pattern mining, as the name implies, searches for patterns (such as sets of items or objects) that occur frequently in large data sets. Frequent pattern mining can lead to the discovery of interesting associations and correlations among data objects. Methods for frequent pattern mining were introduced in Chapter 5. The idea behind frequent pattern-based cluster analysis is that the frequent patterns discovered may also indicate clusters. Frequent pattern-based cluster analysis is well-suited to high-dimensional data. It can be viewed as an extension of the dimension-growth subspace clustering approach. However, the boundaries of different dimensions are not obvious since here they are represented by sets of frequent itemsets. That is, rather than growing the clusters dimension by dimension, we grow sets of frequent itemsets, which eventually lead to cluster descriptions. Typical examples of frequent pattern-based cluster analysis include the clustering of text documents that contain thousands of distinct keywords, and the analysis of microarray data that contain tens of thousands of measured values or "features". In this section, we examine two forms of frequent pattern-based cluster analysis: frequent term-based text clustering and clustering by pattern similarity in microarray data analysis. In frequent term-based text clustering, text documents are clustered based on the frequent terms they contain. Using the vocabulary of text document analysis, a term is any sequence of characters separated from other terms by a delimiter. A term can be made up of a single word or several words. In general, we first 7.9. CLUSTERING HIGH-DIMENSIONAL DATA 49 remove non-text information (such as HTML tags and punctuation) and stop words. Terms are then extracted. A stemming algorithm is then applied to reduce each term to its basic stem. In this way, each document can be represented as a set of terms. Each set is typically large. Collectively, a large set of documents will contain a very large set of distinct terms. If we treat each term as a dimension, the dimension space will be of very high dimensionality! This poses great challenges for document cluster analysis. The dimension space can be referred to as term vector space, where each document is represented by a term vector. This difficulty can be overcome by frequent term-based analysis. That is, by using an efficient frequent itemset mining algorithm introduced in Section 5.2, we can mine a set of frequent terms from the set of text documents. Then, instead of clustering on high-dimensional term vector space, we need only consider the low-dimensional frequent term sets as "cluster candidates". Notice that a frequent term set is not a cluster but rather, the description of a cluster. The corresponding cluster consists of the set of documents containing all of the terms of the frequent term set. A well-selected subset of the set of all frequent term sets can be considered as a clustering. "How, then, can we select a good subset of the set of all frequent term sets? " This step is critical since such a selection will determine the quality of the resulting clustering. Let Fi be a set of frequent term sets and cov(Fi ) be the set of documents covered by Fi . That is, cov(Fi ) refers to the documents that contain all of the terms in Fi . The general principle for finding a well-selected subset, F1 , . . . , Fk , of the set of all frequent term sets is to ensure that (1) k cov(Fi ) = D, i.e., the selected subset should cover all of the documents to be clustered; and (2) the i=1 overlap between any two partitions, Fi and Fj (for i = j), should be minimized. An overlap measure based on entropy9 is used to assess cluster overlap by measuring the distribution of the documents supporting some cluster over the remaining cluster candidates. An advantage of frequent term-based text clustering is that it automatically generates a description for the generated clusters in terms of their frequent term sets. Traditional clustering methods produce only clusters a description for the generated clusters requires an additional processing step. Another interesting approach for clustering high-dimensional data is based on pattern similarity among the objects on a subset of dimensions. Here we introduce the pCluster method, which performs clustering by pattern similarity in microarray data analysis. In DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli or conditions. Under the pCluster model, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. This is illustrated in Example 7.15 below. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. 90 80 70 60 50 40 30 20 10 0 a b c d e f g h i j Object 1 Object 2 Object 3 Figure 7.22: Raw data from a fragment of microarray data containing only 3 objects and 10 attributes. 9 Entropy is a measure from information theory. It was introduced in Chapter 2 regarding data discretization and is also described in Chapter 6 regarding decision tree construction. 50 CHAPTER 7. CLUSTER ANALYSIS Example 7.15 Clustering by pattern similarity in DNA microarray analysis. Figure 7.22 shows a fragment of microarray data containing only three genes (taken as "objects" here) and ten attributes (columns a to j). No patterns among the three objects are visibly explicit. However, if two subsets of attributes, {b, c, h, j, e} and {f, d, a, g, i}, are selected and plotted as in Figure 7.23 (a) and (b) respectively, it is easy to see that they form some interesting patterns: Figure 7.23 (a) forms a shift pattern, where the three curves are similar to each other with respect to a shift operation along the y-axis; while Figure 7.23 (b) forms a scaling pattern, where the three curves are similar to each other with respect to a scaling operation along the y-axis. 90 80 70 60 50 40 30 20 10 0 b c h j e Object 1 Object 2 Object 3 90 80 70 60 50 40 30 20 10 0 f d a g i Object 1 Object 2 Object 3 (a) (b) Figure 7.23: Objects in Figure 7.22 form (a) a shift pattern in subspace {b, c, h, j, e}, and (b) a scaling pattern in subspace {f, d, a, g, i}. Let us first examine how to discover shift patterns. In DNA microarray data, each row corresponds to a gene and each column or attribute represents a condition under which the gene is developed. The usual Euclidean distance measure cannot capture pattern similarity since the y values of different curves can be quite far apart. Alternatively, we could first transform the data to derive new attributes, such as A ij = vi - vj (where vi and vj are object values for attributes Ai and Aj , respectively), and then cluster on the derived attributes. However, this would introduce d(d - 1)/2 dimensions for a d-dimensional data set, which is undesirable for a nontrivial d value. A biclustering method was proposed in an attempt to overcome these difficulties. It introduces a new measure, the mean squared residue score, which measures the coherence of the genes and conditions in a submatrix of a DNA array. Let I X and J Y be subsets of genes, X, and conditions, Y , respectively. The pair, (I, J), specifies a submatrix, AIJ , with the mean squared residue score defined as H(IJ) = 1 |I||J| iI,jJ (dij - diJ - dIj + dIJ )2 (7.39) where dij is the measured value of gene i for condition j, and diJ = 1 |J| dij , dIj = jJ 1 |I| dij , dIJ = iI 1 |I||J| dij , iI,jJ (7.40) where diJ and dIj are the row and column means, respectively, and dIJ is the mean of the subcluster matrix, AIJ . A submatrix, AIJ , is called a -bicluster if H(I, J) for some > 0. A randomized algorithm is designed to find such clusters in a DNA array. There are two major limitations of this method. First, a submatrix of a -bicluster is not necessarily a -bicluster, which makes it difficult to design an efficient pattern growth-based 7.10. CONSTRAINT-BASED CLUSTER ANALYSIS 51 algorithm. Second, because of the averaging effect, a -bicluster may contain some undesirable outliers yet still satisfy a rather small threshold. To overcome the problems of the biclustering method, a pCluster model was introduced as follows. Given objects x, y O and attributes a, b T , pScore is defined by a 2 2 matrix as pScore( dxa dxb dy a dy b ) = |(dxa - dxb ) - (dy a - dy b )|, (7.41) where dxa is the value of object (or gene) x for attribute (or condition) a, and so on. A pair, (O, T ), forms a -pCluster if, for any 2 2 matrix, X, in (O, T ), we have pScore(X) for some > 0. Intuitively, this means that the change of values on the two attributes between the two objects in confined by for every pair of objects in O and every pair of attributes in T . It is easy to see that -pCluster has the downward closure property, that is, if (O, T ) forms a -pCluster, then any of its submatrices is also a -pCluster. Moreover, since a pCluster requires that every two objects and every two attributes conform with the inequality, the clusters modeled by the pCluster method are more homogeneous than those modeled by the bicluster method. In frequent itemset mining, itemsets are considered frequent if they satisfy a minimum support threshold, which reflects their frequency of occurrence. Based on the definition of pCluster, the problem of mining pClusters becomes one of mining frequent patterns in which each pair of objects and their corresponding features must satisfy the specified threshold. A frequent pattern-growth method can easily be extended to mine such patterns efficiently. Now, let's look into how to discover scaling patterns. Notice that the original pScore definition, though defined for shift patterns in Equation (7.41), can easily be extended for scaling by introducing a new inequality, dxa /dy a . dxb /dy b (7.42) This can be computed efficiently because Equation (7.41) is a logarithmic form of Equation (7.42). That is, the same pCluster model can be applied to the data set after converting the data to the logarithmic form. Thus, the efficient derivation of -pClusters for shift patterns can naturally be extended for the derivation of -pClusters for scaling patterns. The pCluster model, though developed in the study of microarray data cluster analysis, can be applied to many other applications that require finding similar or coherent patterns involving a subset of numerical dimensions in large, high-dimensional data sets. 7.10 Constraint-Based Cluster Analysis In the above discussion, we assume that cluster analysis is an automated, algorithmic computational process, based on the evaluation of similarity or distance functions among a set of objects to be clustered, with little user guidance or interaction. However, users often have a clear view of the application requirements, which they would ideally like to use to guide the clustering process and influence the clustering results. Thus, in many applications, it is desirable to have the clustering process take user preferences and constraints into consideration. Examples of such information include the expected number of clusters, the minimal or maximal cluster size, weights for different objects or dimensions, and other desirable characteristics of the resulting clusters. Moreover, when a clustering task involves a rather high-dimensional space, it is very difficult to generate meaningful clusters by relying solely on the clustering parameters. User input regarding important dimensions or the desired results will serve as crucial hints or meaningful constraints for effective clustering. In general, we contend that knowledge discovery would be most effective if one can develop an environment for human-centered, exploratory mining of data, that is, where 52 CHAPTER 7. CLUSTER ANALYSIS the human user is allowed to play a key role in the process. Foremost, a user should be allowed to specify a focus-- directing the mining algorithm towards the kind of "knowledge" that the user is interested in finding. Clearly, user-guided mining will lead to more desirable results and capture the application semantics. Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints. Depending on the nature of the constraints, constraint-based clustering may adopt rather different approaches. Here are a few categories of constraints. 1. Constraints on individual objects: We can specify constraints on the objects to be clustered. In a real estate application, for example, one may like to spatially cluster only those luxury mansions worth over a million dollars. This constraint confines the set of objects to be clustered. It can easily be handled by preprocessing (e.g., performing selection using an SQL query), after which the problem reduces to an instance of unconstrained clustering. 2. Constraints on the selection of clustering parameters: A user may like to set a desired range for each clustering parameter. Clustering parameters are usually quite specific to the given clustering algorithm. Examples of parameters include k, the desired number of clusters in a k-means algorithm; or (the radius) and MinPts (the minimum number of points) in the DBSCAN algorithm. Although such user-specified parameters may strongly influence the clustering results, they are usually confined to the algorithm itself. Thus, their fine tuning and processing are usually not considered a form of constraint-based clustering. 3. Constraints on distance or similarity functions: We can specify different distance or similarity functions for specific attributes of the objects to be clustered, or different distance measures for specific pairs of objects. When clustering sportsmen, for example, we may use different weighting schemes for height, body weight, age, and skill-level. Although this will likely change the mining results, it may not alter the clustering process per se. However, in some cases, such changes may make the evaluation of the distance function nontrivial, especially when it is tightly intertwined with the clustering process. This can be seen in the following example. Example 7.16 Clustering with obstacle objects. A city may have rivers, bridges, highways, lakes, and mountains. We do not want to swim across a river to reach an automated banking machine. Such obstacle objects and their effects can be captured by redefining the distance functions among objects. Clustering with obstacle objects using a partitioning approach requires that the distance between each object and its corresponding cluster center be re-evaluated at each iteration whenever the cluster center is changed. However, such re-evaluation is quite expensive with the existence of obstacles. In this case, efficient new methods should be developed for clustering with obstacle objects in large data sets. 4. User-specified constraints on the properties of individual clusters: A user may like to specify desired characteristics of the resulting clusters, which may strongly influence the clustering process. Such constraint-based clustering arises naturally in practice, as in Example 7.17. Example 7.17 User-constrained cluster analysis. Suppose a package delivery company would like to determine the locations for k service stations in a city. The company has a database of customers that registers the customers' name, location, length of time since the customer began using the company's services, and average monthly charge. We may formulate this location selection problem as an instance of unconstrained clustering using a distance function computed based on customer location. However, a smarter approach is to partition the customers into two classes: high-value customers (who need frequent, regular service), and ordinary customers (who require occasional service). In order to save costs and provide good service, the manager adds the following constraints: (1) each station should serve at least 100 high-value customers; and (2) each station should serve at least 5,000 ordinary customers. Constraint-based clustering will take such constraints into consideration during the clustering process. 5. Semi-supervised clustering based on "partial" supervision: The quality of unsupervised clustering can be significantly improved using some weak form of supervision. This may be in the form of pairwise constraints, i.e., pairs of objects labeled as belonging to the same or different cluster. Such a constrained clustering process is called semi-supervised clustering. 7.10. CONSTRAINT-BASED CLUSTER ANALYSIS 53 In this section, we examine how efficient constraint-based clustering methods can be developed for large data sets. Since cases 1 and 2 above are trivial, we focus on cases 3 to 5 as typical forms of constraint-based cluster analysis. 7.10.1 Clustering with Obstacle Objects Example 7.16 above introduced the problem of clustering with obstacle objects regarding the placement of automated banking machines. The machines should be easily accessible to the bank's customers. This means that during clustering, we must take obstacle objects into consideration, such as rivers, highways, and mountains. Obstacles introduce constraints on the distance function. The straight line distance between two points is meaningless if there is an obstacle in the way. As pointed out in Example 7.16, we do not want to have to swim across a river to get to a banking machine! "How can we approach the problem of clustering with obstacles?" A partitioning clustering method is preferable since it minimizes the distance between objects and their cluster centers. If we choose the k-means method, a cluster center may not be accessible given the presence of obstacles. For example, the cluster mean could turn out to be in the middle of a lake. On the other hand, the k-medoids method chooses an object within the cluster as a center and thus guarantees that such a problem cannot occur. Recall that every time a new medoid is selected, the distance between each object and its newly selected cluster center has to be recomputed. Since there could be obstacles in-between two objects, the distance between two objects may have to be derived by geometric computations (e.g., involving triangulation). The computational cost can get very high if a large number of objects and obstacles are involved. The clustering with obstacles problem can be represented using a graphical notation. First, a point, p, is visible from another point, q, in the region, R, if the straight line joining p and q does not intersect any obstacles. A visibility graph is the graph, V G = (V, E), such that each vertex of the obstacles has a corresponding node in V and two nodes, v1 and v2 , in V are joined by an edge in E if and only if the corresponding vertices they represent are visible to each other. Let V G = (V , E ) be a visibility graph created from V G by adding two additional points, p and q, in V . E contains an edge joining two points in V if the two points are mutually visible. The shortest path between two points, p and q, will be a subpath of V G as shown in Figure 7.24 (a). We see that it begins with an edge from p to either v1 , v2 , or v3 , goes through some path in VG, and then ends with an edge from either v4 or v5 to q. o1 p VG VG' (a) Figure 7.24: regions with [to editor vertex as Clustering with obstacle objects [new (o1 and o2 )]: (a) a visibility graph, and (b) triangulation of microclusters. From [THH01]. Please add the following vertex labels for part (a): For the polygon containing o 1 , label the topmost v1 , the leftmost vertex as v2 , and the bottommost as v3 . For the polygon containing o2 , label the topmost vertex as v5 and the bottommost vertex as v6 . Thanks.] o2 q (b) 54 CHAPTER 7. CLUSTER ANALYSIS To reduce the cost of distance computation between any two pairs of objects or points, several preprocessing and optimization techniques can be used. One method groups points that are close together into microclusters. This can be done by first triangulating the region R into triangles, and then grouping nearby points in the same triangle into microclusters, using a method similar to BIRCH or DBSCAN, as shown in Figure 7.24 (b). By processing microclusters rather than individual points, the overall computation is reduced. After that, precomputation can be performed to build two kinds of join indices based on the computation of the shortest paths: (1) VV indices, for any pair of obstacle vertices, and (2) MV indices, for any pair of microcluster and obstacle vertex. Use of the indices helps further optimize the overall performance. With such precomputation and optimization, the distance between any two points (at the granularity level of microcluster) can be computed efficiently. Thus, the clustering process can be performed in a manner similar to a typical efficient k-medoids algorithm, such as CLARANS, and achieve good clustering quality for large data sets. Given a large set of points, Figure 7.25(a) shows the result of clustering a large set of points without considering obstacles, whereas Figure 7.25(b) shows the result with consideration of obstacles. The latter represents rather different but more desirable clusters. For example, if we carefully compare the upper left hand corner of the two graphs, we see that Figure 7.25(a) has a cluster center on an obstacle (making the center inaccessible), while all cluster centers in Figure 7.25(b) are accessible. A similar situation has occurred with respect to the bottom right hand corner of the graphs. (a) (b) Figure 7.25: Clustering results obtained without and with consideration of obstacles (where rivers and inaccessible highways or city blocks are represented by polygons): (a) clustering without considering obstacles, and (b) clustering with obstacles. 7.10.2 User-Constrained Cluster Analysis Let's examine the problem of relocating package delivery centers, as illustrated in Example 7.17. Specifically, a package delivery company with n customers would like to determine locations for k service stations so as to minimize the traveling distance between customers and service stations. The company's customers are regarded as either high-value customers (requiring frequent, regular services), or ordinary customers (requiring occasional services). The manager has stipulated two constraints: each station should serve (1) at least 100 high-value customers, and (2) at least 5,000 ordinary customers. This can be considered as a constrained optimization problem. We could consider using a mathematical programming approach to handle it. However, such a solution is difficult to scale to large data sets. To cluster n customers into k clusters, a mathematical programming approach will involve at least k n variables. As n can be as large as a few million, we could end up having to solve a few million simultaneous equations a very expensive feat. A more efficient approach is proposed that explores the idea of microclustering, as illustrated below. 7.10. CONSTRAINT-BASED CLUSTER ANALYSIS 55 The general idea of clustering a large data set into k clusters satisfying user-specified constraints goes as follows. First, we can find an initial "solution" by partitioning the data set into k groups, satisfying the user-specified constraints, such as the two constraints in our example. We then iteratively refine the solution by moving objects from one cluster to another, trying to satisfy the constraints. For example, we can move a set of m customers from cluster Ci to Cj if Ci has at least m surplus customers (under the specified constraints), or if the result of moving customers into Ci from some other clusters (including from Cj ) would result in such a surplus. The movement is desirable if the total sum of the distances of the objects to their corresponding cluster centers is reduced. Such movement can be directed by selecting promising points to be moved, such as objects that are currently assigned to some cluster, Ci , but that are actually closer to a representative (e.g., centroid) of some other cluster, C j . We need to watch out for and handle deadlock situations (where a constraint is impossible to satisfy), in which case, a deadlock resolution strategy can be employed. To increase the clustering efficiency, data can first be preprocessed using the microclustering idea to form microclusters (groups of points that are close together), thereby avoiding the processing of all of the points individually. Object movement, deadlock detection, and constraint satisfaction can be tested at the microcluster level, which reduces the number of points to be computed. Occasionally, such microclusters may need to be broken-up in order to resolve deadlocks under the constraints. This methodology ensures that the effective clustering can be performed in large data sets under the user-specified constraints with good efficiency and scalability. 7.10.3 Semi-Supervised Cluster Analysis In comparison with supervised learning, clustering lacks guidance from users or classifiers (such as class label information), and thus may not generate highly desirable clusters. The quality of unsupervised clustering can be significantly improved using some weak form of supervision, for example, in the form of pairwise constraints, i.e., pairs of objects labeled as belonging to the same or different clusters. Such a clustering process based on user feedback or guidance constraints is called semi-supervised clustering. Methods for semi-supervised clustering can be categorized into two classes: constraint-based semi-supervised clustering and distance-based semi-supervised clustering. Constraint-based semi-supervised clustering relies on user-provided labels or constraints to guide the algorithm towards a more appropriate data partitioning. This includes modifying the objective function based on constraints, or initializing and constraining the clustering process based on the labeled objects. Distance-based semi-supervised clustering employs an adaptive distance measure that is trained to satisfy the labels or constraints in the supervised data. Several different adaptive distance measures have been used, such as string-edit distance trained using Expectation-Maximization (EM), and Euclidean distance modified by a shortest distance algorithm. An interesting clustering method, called CLTree (CLustering based on decision Trees), integrates unsupervised clustering with the idea of supervised classification. It is an example of constraint-based semi-supervised clustering. It transforms a clustering task into a classification task by viewing the set of points to be clustered as belonging to one class, labeled as "Y ", and adds a set of relatively uniformly distributed, "nonexistence points" with a different class label, "N ". The problem of partitioning the data space into data (dense) regions and empty (sparse) regions can then be transformed into a classification problem. For example, Figure 7.26(a) contains a set of data points to be clustered. These points can be viewed as a set of "Y " points. Figure 7.26(b) shows the addition of a set of uniformly distributed "N " points, represented by the "" points. The original clustering problem is thus transformed into a classification problem, which works out a scheme that distinguishes "Y " and "N " points. A decision tree induction method can be applied10 , to partition the two-dimensional space as shown in Figure 7.26(c). Two clusters are identified, which are from the "Y " points only. Adding a large number of "N " points to the original data may introduce unnecessary overhead in computation. Furthermore, it is unlikely that any points added would truly be uniformly distributed in a very high dimensional space as this would require an exponential number of points. To deal with this problem, we do not physically add any of the "N " points, but only assume their existence. This works because the decision-tree method does not actually require the points. Instead, it only needs the number of "N " points at each decision tree node. This 10 Decision tree induction was described in Chapter 6 on classification. 56 CHAPTER 7. CLUSTER ANALYSIS (a) (b) (c) Figure 7.26: Clustering through decision tree construction: (a) the set of data points to be clustered, viewed as a set of "Y " points, (b) the addition of a set of uniformly distributed "N " points, represented by "", and (c) the clustering result with "Y " points only. number can be computed when needed, without having to add points to the original data. Thus, CLTree can achieve the results in Figure 7.26(c) without actually adding any "N " points to the original data. Again, two clusters are identified. The question then is how many (virtual ) "N " points should be added in order to achieve good clustering results. The answer follows this simple rule: At the root node, the number of inherited "N " points is 0. At any current node, E, if the number of "N " points inherited from the parent node of E is less than the number of "Y " points in E, then the number of "N " points for E is increased to the number of "Y " points in E. (That is, we set the number of "N " points to be as big as the number of "Y " points.) Otherwise, the number of inherited "N " points is used in E. The basic idea is to use an equal number of "N " points to the number of "Y " points. Decision tree classification methods use a measure, typically based on information gain, to select the attribute test for a decision node (Section 6.3.2). The data are then split or partitioned according the test or "cut". Unfortunately, with clustering, this can lead to the fragmentation of some clusters into scattered regions. To address this problem, methods were developed that use information gain, but allow the ability to look ahead. That is, CLTree first finds initial cuts and then looks ahead to find better partitions that cut less into cluster regions. It finds those cuts that form regions with a very low relative density. The idea is that we want to split at the cut point that may result in a big empty ("N ") region, which is more likely to separate clusters. With such tuning, CLTree can perform high quality clustering in high-dimensional space. It can also find subspace clusters as the decision tree method normally selects only a subset of the attributes. An interesting by-product of this method is the empty (sparse) regions, which may also be useful in certain applications. In marketing, for example, clusters may represent different segments of existing customers of a company, while empty regions reflect the profiles of non-customers. Knowing the profiles of non-customers allows the company to tailor their services or marketing to target these potential customers. 7.11 Outlier Analysis "What is an outlier?" Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers. Outliers can be caused by measurement or execution error. For example, the display of a person's age as -999 could be caused by a program default setting of an unrecorded age. Alternatively, outliers may be the result of inherent data variability. The salary of the chief executive officer of a company, for instance, could naturally stand out as an outlier among the salaries of the other employees in the firm. Many data mining algorithms try to minimize the influence of outliers or eliminate them all together. This, 7.11. OUTLIER ANALYSIS 57 however, could result in the loss of important hidden information since one person's noise could be another person's signal. In other words, the outliers themselves may be of particular interest, such as in the case of fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining. Outlier mining has wide applications. As mentioned above, it can be used in fraud detection, for example, by detecting unusual usage of credit cards or telecommunication services. In addition, it is useful in customized marketing for identifying the spending behavior of customers with extremely low or extremely high incomes, or in medical analysis for finding unusual responses to various medical treatments. Outlier mining can be described as follows: Given a set of n data points or objects, and k, the expected number of outliers, find the top k objects that are considerably dissimilar, exceptional, or inconsistent with respect to the remaining data. The outlier mining problem can be viewed as two subproblems: (1) define what data can be considered as inconsistent in a given data set, and (2) find an efficient method to mine the outliers so defined. The problem of defining outliers is nontrivial. If a regression model is used for data modeling, analysis of the residuals can give a good estimation for data "extremeness." The task becomes tricky, however, when finding outliers in time-series data as they may be hidden in trend, seasonal, or other cyclic changes. When multidimensional data are analyzed, not any particular one, but rather a combination of dimension values may be extreme. For nonnumeric (i.e., categorical data), the definition of outliers requires special consideration. "What about using data visualization methods for outlier detection?" This may seem like an obvious choice, since human eyes are very fast and effective at noticing data inconsistencies. However, this does not apply to data containing cyclic plots, where values that appear to be outliers could be perfectly valid values in reality. Data visualization methods are weak in detecting outliers in data with many categorical attributes or in data of high dimensionality, since human eyes are good at visualizing numeric data of only two to three dimensions. In this section, we instead examine computer-based methods for outlier detection. These can be categorized into four approaches: the statistical approach, the distance-based approach, the density-based local outlier approach, and the deviation-based approach, each of which are studied here. Notice that while clustering algorithms discard outliers as noise, they can be modified to include outlier detection as a byproduct of their execution. In general, users must check that each outlier discovered by these approaches is indeed a "real" outlier. 7.11.1 Statistical Distribution-Based Outlier Detection The statistical distribution-based approach to outlier detection assumes a distribution or probability model for the given data set (e.g., a normal or Poisson distribution) and then identifies outliers with respect to the model using a discordancy test. Application of the test requires knowledge of the data set parameters (such as the assumed data distribution), knowledge of distribution parameters (such as the mean and variance), and the expected number of outliers. "How does the discordancy testing work?" A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis. A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F , that is, H : oi F, where i = 1, 2, . . . , n. (7.43) The hypothesis is retained if there is no statistically significant evidence supporting its rejection. A discordancy test verifies whether an object, oi , is significantly large (or small) in relation to the distribution F . Different test statistics have been proposed for use as a discordancy test, depending on the available knowledge of the data. Assuming that some statistic, T , has been chosen for discordancy testing, and the value of the statistic for object oi is vi , then the distribution of T is constructed. Significance probability, SP (vi ) = P rob(T > vi ), is evaluated. If SP (vi ) is sufficiently small, then oi is discordant and the working hypothesis is rejected. An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen since oi may be an outlier under one model and a perfectly valid value under another. 58 CHAPTER 7. CLUSTER ANALYSIS The alternative distribution is very important in determining the power of the test, that is, the probability that the working hypothesis is rejected when oi is really an outlier. There are different kinds of alternative distributions. Inherent alternative distribution: In this case, the working hypothesis that all of the objects come from distribution, F , is rejected in favor of the alternative hypothesis that all of the objects arise from another distribution, G: H : oi G, where i = 1, 2, . . . , n. (7.44) F and G may be different distributions or differ only in parameters of the same distribution. There are constraints on the form of the G distribution in that it must have potential to produce outliers. For example, it may have a different mean or dispersion, or a longer tail. Mixture alternative distribution: The mixture alternative states that discordant values are not outliers in the F population, but contaminants from some other population, G. In this case, the alternative hypothesis is H : oi (1 - )F + G, where i = 1, 2, . . . , n. (7.45) Slippage alternative distribution: This alternative states that all of the objects (apart from some prescribed small number) arise independently from the initial model, F , with its given parameters, while the remaining objects are independent observations from a modified version of F in which the parameters have been shifted. There are two basic types of procedures for detecting outliers: Block procedures: In this case, either all of the suspect objects are treated as outliers, or all of them are accepted as consistent. Consecutive (or sequential) procedures: An example of such a procedure is the inside-out procedure. Its main idea is that the object that is least "likely" to be an outlier is tested first. If it is found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next most extreme object is tested, and so on. This...