section11 - Section Notes 11 CS51Spring 2009 Week of April...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Section Notes 11 CS51Spring 2009 Week of April 27, 2009 1 Outline 1. Clustering 2. Impossible Programs 3. List of Topics By the end of this section, you should be familiar with clustering and recognizing impossible programs. 2 Clustering Clustering is the problem of dividing unlabeled data into groups. Because the data is usually coming from a source that doesnt give us its true classification, this problem is somewhat ill-specifiedthere isnt an obvious metric for what the right answer is. However, it is still a problem that we often face in the real world, so we have to solve it anyway. There are many approaches: two that were covering in CS51 are the K-means and K-centers algorithms. 2.1 The K-means algorithm The K-means algorithm is based on the model that the error of our clustering is the sum of the squared distances to the cluster centers. For our purposes here, well assume that we know k the number of clusters we want 1 . Then, the algorithm is extremely simple: 1. Pick k points to be initial cluster centers. 2. Assign each point to the cluster whose center its closest to. 3. Choose new cluster centers to be the centroid (average) of all the points in the cluster. 4. Repeat until no points change their cluster assignment. Some properties of K-means: Each repetition improves the clustersthe new ones have smaller error. (Why this is true is not obvious, but it is true). Will always terminate. Tends to be very fast. Answer depends on starting points. Returned clustering may not be very goodthe error function is not convex, so it may find a local minimum. 1 There do exist ways of picking k if its not known, but we wont cover them here. 1 2.2 The K-centers algorithm We talked about the fact that it doesnt always make sense to use the same error metrics when fitting lines to data. A similar story arises with clustering. So, lets consider the case where we dont want to use the sum of square errors from the cluster centers as our error metric. Instead, we want to say that the quality of a cluster is the distance of the furthest point from the cluster center. This leads to the K-centers algorithm: 1. Pick a point to be the center of the first cluster.1....
View Full Document

This note was uploaded on 07/26/2009 for the course COMPUTERSC CS51 taught by Professor Gregmorrisett during the Spring '09 term at Harvard.

Page1 / 4

section11 - Section Notes 11 CS51Spring 2009 Week of April...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online