This preview shows page 1. Sign up to view the full content.
Unformatted text preview: KCenter and Dendrogram Clustering KCenter and Dendrogram Clustering
Jia Li
Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Kcenter Clustering Let A be a set of n objects. Partition A into K sets C1 , C2 , ..., CK . Cluster size of Ck : the least value D for which all points in Ck are:
1. within distance D of each other, or 2. within distance D/2 of some point called the cluster center. Let the cluster size of Ck be Dk . The cluster size of partition S is D = max Dk .
k=1,...,K Goal: Given K , minS D(S).
http://www.stat.psu.edu/jiali Jia Li KCenter and Dendrogram Clustering Comparison with kmeans Assume the distance between vectors is the squared Euclidean distance. Kmeans: K min (xi  k )T (xi  k )
S where k is the centroid for cluster Ck . In particular, 1 k = xi . Nk
i:xi Ck k=1 i:xi Ck Kcenter: min max
S k=1,...,K i:xi Ck max (xi  k )T (xi  k ) . where k is called the "centroid", but may not be the mean vector.
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Another formulation of kcenter: min max
S k=1,...,K i,j:xi ,xj Ck max L(xi , xj ) . L(xi , xj ) denotes any distance between a pair of objects. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Original unclustered data.
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Clustering by kmeans. Kmeans focuses on average distance.
Jia Li http://www.stat.psu.edu/jiali Clustering by kcenter. Kcenter focuses on worst scenario. KCenter and Dendrogram Clustering Greedy Algorithm Choose a subset H from S consisting K points that are farthest apart from each other. Each point hk H represents one cluster Ck . Point xi is partitioned into cluster Ck if L(xi , hk ) =
k =1,...,K min L(xi , hk ) . Only need pairwise distance L(xi , xj ) for any xi , xj S. Hence, xi can be a nonvector representation of the objects. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering The greedy algorithm achieves an approximation factor of 2 as long as the distance measure L satisfies the triangle inequality. That is, if D = min max
S k=1,...,K i,j:xi ,xj Ck max L(xi , xj ) then the greedy algorithm guarantees that D 2D . The relation holds if the cluster size is defined in the sense of centralized clustering. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Pseudo Code H denotes the set of cluster representative objects {h1 , ..., hk } S. Let cluster (xi ) be the identity of the cluster xi S belongs to. Let dist(xi ) be the distance between xi and its closest cluster representative object: dist(xi ) = min L(xi , hj ) .
hj H Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Pseudo code:
1. Randomly select an object xj from S, let h1 = xj , H = {h1 }. 2. for j = 1 to n, dist(xj ) = L(xj , h1 ) cluster (xj ) = 1 3. for i = 2 to K D = maxxj :xj S\H dist(xj ) choose hi S \ H s.t. dist(hi ) = D H = H {hi } for j = 1 to n if L(xj , hi ) dist(xj ) dist(xj ) = L(xj , hi ) cluster (xj ) = i Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Algorithm Property The running time of the algorithm is O(Kn). ~ Let the partition obtained by the greedy algorithm be S and . the optimal partition be S ~ ~ Let the cluster size of S be D and that of S be D . The cluster size is defined in the pairwise distance sense. ~ It can be proved that D 2D . We have the approximation factor of 2 result if cluster size of a partition S is defined in the sense of centralized clustering. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Key Ideas for Proof Let Dj be the cluster size of the partition generated by {h1 , ..., hj }. D1 D2 D3 . For i < j, L(hi , hj ) Dj1 . For j, i < j, s.t., L(hi , hj ) = Dj1 . Consider optimal partition S with k clusters and the minimum size D . Suppose the greedy algorithm generates centroids {h1 , ..., hk , hk+1 }. By the pigeonhole principle, at least two centroids fall into one cluster of the partition S . Let the two centroids be 1 i < j k + 1. Then L(hi , hj ) 2D , by the triangle inequality, and the fact they lie in the same cluster. Also L(hi , hj ) Dj1 Dk . Thus Dk 2D .
http://www.stat.psu.edu/jiali Jia Li KCenter and Dendrogram Clustering Proof ~ = maxxj :xj S\H minhk :hk H L(xj , hk ) Let hK +1 be the object in S \ H s.t. ~ minhk :hk H L(hK +1 , hk ) = . ~ By definition, L(hK +1 , hk ) for all k = 1, ..., K . Let Hk = {h1 , ..., hk }, k = 1, 2, ..., K . Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Consider the distance between any hi and hj , i < j K without loss of generality. According to the greedy algorithm:
hk :hk Hj1 min L(hj , hk ) hk :hk Hj1 min L(xl , hk ) for any xl S \ Hj . Since hK +1 S \ H and S \ H S \ Hj , L(hj , hi ) hk :hk Hj1 hk :hk Hj1 hk :hk H min min L(hj , hk ) L(hK +1 , hk ) min L(hK +1 , hk ) ~ = Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering We have shown that for any i < j K + 1, ~ L(hi , hj ) . Consider the partition C1 , C2 , ..., CK formed by S . At lease 2 of the K + 1 objects h1 , ..., hK +1 will be covered by one cluster. Without loss of generality, assume hi and hj belong to the same cluster in S . Then L(hi , hj ) D . ~ ~ Since L(hi , hj ) , D . Consider any two objects x and x in any cluster represented ~ ~ ~ by hk . By the definition of , L(x , hk ) and L(x , hk ) . Hence by the triangle inequality, ~ L(x , x ) L(x , hk ) + L(x , hk ) 2 . Hence
Jia Li http://www.stat.psu.edu/jiali ~ ~ D 2 2D KCenter and Dendrogram Clustering For centralized clustering: A stepbystep illustration of the kcenter clustering is provided next. ~ Let D = maxk=1,...,K maxxj :xj Ck L(xj , hk ). Define D similarly. ~ Step 7 in the proof modifies to L(hi , hj ) 2D by the triangle inequality. ~ ~ D = L(hi , hj ) 2D . Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering 2 clusters
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering 3 clusters
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering 4 clusters
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Applications to Image Segmentation Original image Segmentation using Kcenter Segmentation using Kmeans with LBG initialization Jia Li http://www.stat.psu.edu/jiali Segmentation by Kmeans using Kcenter for initialization KCenter and Dendrogram Clustering Scatter plots for LUV color components with Kcenter clustering Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Kmeans with LGB initialization Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Kmeans with Kcenter initialization Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Comparison of segmentation results. Left: original images. Middle: Kmeans with kcenter initialization. Right: Kmeans with LGB initialization using the same number of clusters as in the kcenter case.
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Comparison of segmentation results. Left: original images. Middle: Kmeans with kcenter initialization. Right: Kmeans with LGB initialization using the same number of clusters as in the kcenter case. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Agglomerative Clustering Generate clusters in a hierarchical way. Let the data set be A = {x1 , ..., xn }. Start with n clusters, each containing one data point. Merge the two clusters with minimum pairwise distance. Update betweencluster distance. Iterate the merging procedure. The clustering procedure can be visualized by a tree structure called dendrogram. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Definition for betweencluster distance? For clusters containing only one data point, the betweencluster distance is the betweenobject distance. For clusters containing multiple data points, the betweencluster distance is an agglomerative version of the betweenobject distances. Examples: minimum or maximum betweenobjects distances for objects in the two clusters. The agglomerative betweencluster distance can often be computed recursively. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Example Distances Suppose cluster r and s are two clusters merged into a new cluster t. Let k be any other cluster. Denote betweencluster distance by D(, ). How to get D(t, k) from D(r , k) and D(s, k)? Singlelink clustering: D(t, k) = min(D(r , k), D(s, k)) D(t, k) is the minimum distance between two objects in cluster t and k respectively. Completelink clustering: D(t, k) = max(D(r , k), D(s, k)) D(t, k) is the maximum distance between two objects in cluster t and k respectively. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering How to get D(t, k) from D(r , k) and D(s, k)? Average linkage clustering: Unweighted case: D(t, k) = Weighted case: D(t, k) = 1 1 D(r , k) + D(s, k) 2 2 nr ns D(r , k) + D(s, k) nr + ns nr + ns D(t, k) is the average distance between two objects in cluster t and k respectively. For the unweighted case, the number of elements in each cluster is taken into consideration, while in the weighted case each cluster is weighted equally. So objects in smaller cluster are weighted more heavily than those in larger clusters.
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering How to get D(t, k) from D(r , k) and D(s, k)? Centroid clustering: Unweighted case: D(t, k) = nr ns D(r , k) + D(s, k) nr + ns nr + ns nr ns  D(r , s) nr + ns Weighted case: D(t, k) = 1 1 1 D(r , k) + D(s, k)  D(r , s) 2 2 4 A centroid is computed for each cluster and the distance between clusters is given by the distance between their respective centroids. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering How to get D(t, k) from D(r , k) and D(s, k)? Ward's clustering: D(t, k) = nr + nk D(r , k) nr + ns + nk ns + nk + D(s, k) nr + ns + nk nk  D(r , s) nr + ns + nk Merge the two clusters for which the change in the variance of the clustering is minimized. The variance of a cluster is defined as the sum of squarederror between each object in the cluster and the centroid of the cluster. The dendrogram generated by singlelink clustering tends to look like a chain. Clusters generated by completelink may not be well separated. Other methods are intermediates between the two.
http://www.stat.psu.edu/jiali Jia Li KCenter and Dendrogram Clustering Pseudo Code
1. Begin with n clusters, each containing one object. Number the clusters 1 through n. 2. Compute the betweencluster distance D(r , s) as the betweenobject distance of the two objects in r and s respectively, r , s = 1, 2, ..., n. Let square matrix D = (D(r , s)). 3. Find the most similar pair of clusters r , s, that is, D(r , s) is minimum among all the pairwise distances. 4. Merge r and s to a new cluster t. Compute the betweencluster distance D(t, k) for all k = r , s. Delete the rows and columns corresponding to r and s in D. Add a new row and column in D corresponding to cluster t. 5. Repeat Step 3 a total of n  1 times until there is only one cluster left.
Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Agglomerate clustering of a data set (100 points) into 9 clusters. Left: Singlelink, Right: Completelink. Jia Li http://www.stat.psu.edu/jiali KCenter and Dendrogram Clustering Agglomerate clustering of a data set (100 points) into 9 clusters. Left: Average linkage, Right: Wards clustering. Jia Li http://www.stat.psu.edu/jiali ...
View
Full
Document
This note was uploaded on 02/04/2012 for the course STAT 557 taught by Professor Jiali during the Fall '09 term at Pennsylvania State University, University Park.
 Fall '09
 JIALI
 Statistics

Click to edit the document details