This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Section Notes 11 CS51—Spring 2009 Week of April 27, 2009 1 Outline 1. Clustering 2. Impossible Programs 3. List of Topics By the end of this section, you should be familiar with clustering and recognizing impossible programs. 2 Clustering Clustering is the problem of dividing unlabeled data into groups. Because the data is usually coming from a source that doesn’t give us its “true” classification, this problem is somewhat ill-specified–there isn’t an obvious metric for what the “right answer” is. However, it is still a problem that we often face in the real world, so we have to solve it anyway. There are many approaches: two that we’re covering in CS51 are the K-means and K-centers algorithms. 2.1 The K-means algorithm The K-means algorithm is based on the model that the error of our clustering is the sum of the squared distances to the cluster centers. For our purposes here, we’ll assume that we know k –the number of clusters we want 1 . Then, the algorithm is extremely simple: 1. Pick k points to be initial cluster centers. 2. Assign each point to the cluster whose center it’s closest to. 3. Choose new cluster centers to be the centroid (average) of all the points in the cluster. 4. Repeat until no points change their cluster assignment. Some properties of K-means: • Each repetition improves the clusters–the new ones have smaller error. (Why this is true is not obvious, but it is true). • Will always terminate. • Tends to be very fast. • Answer depends on starting points. • Returned clustering may not be very good–the error function is not convex, so it may find a local minimum. 1 There do exist ways of picking k if it’s not known, but we won’t cover them here. 1 2.2 The K-centers algorithm We talked about the fact that it doesn’t always make sense to use the same error metrics when fitting lines to data. A similar story arises with clustering. So, let’s consider the case where we don’t want to use the sum of square errors from the cluster centers as our error metric. Instead, we want to say that the quality of a cluster is the distance of the furthest point from the cluster center. This leads to the K-centers algorithm: 1. Pick a point to be the center of the first cluster.1....
View Full Document
- Spring '09
- Halting problem, K-means++, CS51, impossible programs