1 1 j i j i c c x x y c c y j i j i j i y x sim c c c c c c sim Computing Group

1 1 j i j i c c x x y c c y j i j i j i y x sim c c c

This preview shows page 47 - 60 out of 95 pages.

) ( : ) ( ) , ( ) 1 ( 1 ) , ( j i j i c c x x y c c y j i j i j i y x sim c c c c c c sim
Image of page 47
Computing Group Average Similarity Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time: j c x j x c s ) ( ) 1 | | | |)(| | | (| |) | | (| )) ( ) ( ( )) ( ) ( ( ) , ( j i j i j i j i j i j i c c c c c c c s c s c s c s c c sim
Image of page 48
Non-Hierarchical Clustering Typically must provide the number of desired clusters, k . Randomly choose k instances as seeds , one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.
Image of page 49
K-Means Assumes instances are real-valued vectors. Clusters based on centroids , center of gravity , or mean of points in a cluster, c: – I c I is the number of data points in cluster c Reassignment of instances to clusters is based on distance to the current cluster centroids. c x x c | | 1 (c) μ
Image of page 50
Distance Metrics Euclidian distance (L 2 norm): • L 1 norm: Cosine Similarity (transform to a distance by subtracting from 1): 2 1 2 ) ( ) , ( i m i i y x y x L m i i i y x y x L 1 1 ) , ( y x y x 1
Image of page 51
K-Means Algorithm Let d be the distance measure between instances. Select k random instances { s 1 , s 2 ,… s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d(x i , s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j = (c j ) // recalculate centroids
Image of page 52
K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x Compute centroids Reassign clusters Converged!
Image of page 53
Termination conditions Several possibilities, e.g., – A fixed number of iterations. – Partition unchanged. – Centroid positions don’t change. Sec. 16.4
Image of page 54
Convergence Why should the K-means algorithm ever reach a fixed point? – A state in which clusters don t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. – EM is known to converge. – Number of iterations could be large. –But in practice usually isn’t Sec. 16.4
Image of page 55
Time Complexity Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm) . Computing centroids: Each instance vector gets added once to some centroid: O(nm) . Assume these two steps are each done once for I iterations: O(Iknm) . Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n 2 ) HAC.
Image of page 56
A Simple example showing the implementation of k-means algorithm (using K=2)
Image of page 57
Step 1 : Initialization: Randomly we choose following two centroids (k=2) for two clusters. In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Image of page 58
Step 2: Thus, we obtain two clusters containing: {1,2,3} and {4,5,6,7}.
Image of page 59
Image of page 60

You've reached the end of your free preview.

Want to read all 95 pages?

  • Fall '14
  • Seon Kim

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes