1 1 j i j i c c x x y c c y j i j i j i y x sim c c c c c c sim Computing Group

# 1 1 j i j i c c x x y c c y j i j i j i y x sim c c c

• 95

This preview shows page 47 - 60 out of 95 pages.

) ( : ) ( ) , ( ) 1 ( 1 ) , ( j i j i c c x x y c c y j i j i j i y x sim c c c c c c sim
Computing Group Average Similarity Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time: j c x j x c s ) ( ) 1 | | | |)(| | | (| |) | | (| )) ( ) ( ( )) ( ) ( ( ) , ( j i j i j i j i j i j i c c c c c c c s c s c s c s c c sim
Non-Hierarchical Clustering Typically must provide the number of desired clusters, k . Randomly choose k instances as seeds , one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.
K-Means Assumes instances are real-valued vectors. Clusters based on centroids , center of gravity , or mean of points in a cluster, c: – I c I is the number of data points in cluster c Reassignment of instances to clusters is based on distance to the current cluster centroids. c x x c | | 1 (c) μ
Distance Metrics Euclidian distance (L 2 norm): • L 1 norm: Cosine Similarity (transform to a distance by subtracting from 1): 2 1 2 ) ( ) , ( i m i i y x y x L m i i i y x y x L 1 1 ) , ( y x y x 1
K-Means Algorithm Let d be the distance measure between instances. Select k random instances { s 1 , s 2 ,… s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d(x i , s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j = (c j ) // recalculate centroids
K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x Compute centroids Reassign clusters Converged!
Termination conditions Several possibilities, e.g., – A fixed number of iterations. – Partition unchanged. – Centroid positions don’t change. Sec. 16.4
Convergence Why should the K-means algorithm ever reach a fixed point? – A state in which clusters don t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. – EM is known to converge. – Number of iterations could be large. –But in practice usually isn’t Sec. 16.4
Time Complexity Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm) . Computing centroids: Each instance vector gets added once to some centroid: O(nm) . Assume these two steps are each done once for I iterations: O(Iknm) . Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n 2 ) HAC.
A Simple example showing the implementation of k-means algorithm (using K=2)
Step 1 : Initialization: Randomly we choose following two centroids (k=2) for two clusters. In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2: Thus, we obtain two clusters containing: {1,2,3} and {4,5,6,7}.

#### You've reached the end of your free preview.

Want to read all 95 pages?

• Fall '14
• Seon Kim