Last time ... Clustering vs. Classification Clustering: unsupervised learning Classification: supervised learning Classification: Classes are human-defined and input to the learning algorithm. Clustering: Clusters are inferred from the data without human input. However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . .

Issues for clustering Representation for clustering Document representation Vector space? Normalization? Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much. Flat clustering: K-means Objective/partitioning criterion: minimize the average squared difference from the centroid Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, w : We try to find the minimum average squared difference by iterating two steps: reassignment : assign each vector to its closest centroid recomputation : recompute each centroid as the average of the vectors that were assigned to it in reassignment
K-means Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x Compute centroids Reassign clusters Converged! Convergence Why should the K-means algorithm ever reach a fixed point ? A state in which clusters don’t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. But in practice usually isn’t

Convergence of K-Means /%*0#% '""1#%.. 2%-.3&% "* (43.5%& 3 -. .32 "* .63-&%1 10.5-#(%. *&"2 (43.5%& (%#5&"017 ! 8 3 9 : 0 ;1 0 < ( 3 = > ;.32 "\$%& -44 1 0 0# (43.5%& 3 = 8 9 : 3 8 3 ?%-..0'#2%#5 2"#"5"#0([email protected] 1%(&%-.%. 8 .0#(% %-(A \$%(5"& 0. -..0'#%1 5" 5A% (4".%.5 (%#5&"01B Convergence of K-means /%("0123-34"# 0"#"3"#4(-556 7%(&%-.%. %-(8 9 3 .4#(% : , 3 4. #20;%& "* 0%0;%&. 4# (52.3%& 3 <= ! 4 5& ) 6 -7 8 &%-(8%. 04#4020 *"&= ! 4 685& ) 6 -7 9 : ! 4 & ) 9 4 - ! , 2 - 9 4 & ) ! - 9 5;< , 3 7 4 & ) 9 ( 3 2 +0%-#. 3614(-556 ("#\$%&'%. >24(?56
Time Complexity Computing distance between two docs is O(M) where M is the dimensionality of the vectors.

