This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 6.896 Sublinear Time Algorithms April 26, 2007 Lecture 22 Lecturer: Ronitt Rubinfeld Scribe: Brendan Juba 1 Overview We will continue examining sublinear time algorithms for clustering. Last time, we considered a set of n points and gave an algorithm to decide, in a constant number of queries, “are they ( k, b )-radius clus- terable?” Today we’ll give an algorithm for a completely different notion of clustering, which examines the average distance of the points from their centers rather than the maximum distance. Our algorithm will make O (log n ) queries, which is worse, but outputs an approximation of how well the data can be clustered rather than simply testing whether or not a clustering exists. In fact, we’ll even see how to find a concise representation of an approximate clustering in sublinear time. It’s worth noting that the exact version of this problem is also NP-complete. 2 Notation and preliminaries Let X be a set of n points such that for any two points x,y ∈ X , the distance between x and y , dist ( x, y ), is at most M . Given k centers , c 1 , .. . ,c k ∈ X , we define f c 1 ,...,c k ( x ) = min i dist ( x,c i ) We remark that, in clustering, one should always check whether the cluster centers are allowed to be arbitrary points, or whether they are restricted to come from the input data. In this case, notice that they must lie in X , the set we wish to cluster. We will define the cost of a clustering to be the average over all points of the distance to the closest center, i.e., the cost of the clustering f c 1 ,...,c k is E X [ f c 1 ,...,c k ( x )] = 1 | X | ∑ x ∈ X f c 1 ,...,c k ( x ) Our goal to choose centers c 1 ,. . ., c,....
View Full Document
- Fall '04
- Algorithms, Probability theory, Trigraph