This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Section Notes 11 CS51Spring 2009 Week of April 27, 2009 1 Outline 1. Clustering 2. Impossible Programs 3. List of Topics By the end of this section, you should be familiar with clustering and recognizing impossible programs. 2 Clustering Clustering is the problem of dividing unlabeled data into groups. Because the data is usually coming from a source that doesnt give us its true classification, this problem is somewhat illspecifiedthere isnt an obvious metric for what the right answer is. However, it is still a problem that we often face in the real world, so we have to solve it anyway. There are many approaches: two that were covering in CS51 are the Kmeans and Kcenters algorithms. 2.1 The Kmeans algorithm The Kmeans algorithm is based on the model that the error of our clustering is the sum of the squared distances to the cluster centers. For our purposes here, well assume that we know k the number of clusters we want 1 . Then, the algorithm is extremely simple: 1. Pick k points to be initial cluster centers. 2. Assign each point to the cluster whose center its closest to. 3. Choose new cluster centers to be the centroid (average) of all the points in the cluster. 4. Repeat until no points change their cluster assignment. Some properties of Kmeans: Each repetition improves the clustersthe new ones have smaller error. (Why this is true is not obvious, but it is true). Will always terminate. Tends to be very fast. Answer depends on starting points. Returned clustering may not be very goodthe error function is not convex, so it may find a local minimum. 1 There do exist ways of picking k if its not known, but we wont cover them here. 1 2.2 The Kcenters algorithm We talked about the fact that it doesnt always make sense to use the same error metrics when fitting lines to data. A similar story arises with clustering. So, lets consider the case where we dont want to use the sum of square errors from the cluster centers as our error metric. Instead, we want to say that the quality of a cluster is the distance of the furthest point from the cluster center. This leads to the Kcenters algorithm: 1. Pick a point to be the center of the first cluster.1....
View
Full
Document
This note was uploaded on 07/26/2009 for the course COMPUTERSC CS51 taught by Professor Gregmorrisett during the Spring '09 term at Harvard.
 Spring '09
 GREGMORRISETT

Click to edit the document details