This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Chapter 8 Approximation algorithms III By Sariel HarPeled , September 19, 2007 Version: 0.2 8.1 Clustering Consider the problem of unsupervised learning . We are given a set of examples, and we would like to partition them into classes of similar examples. For example, given a webpage X about The reality dysfunction, one would like to find all webpages on this topic (or closely related topics). Similarly, a webpage about All quiet on the western front should be in the same group as webpage as Storm of steel (since both are about soldier experiences in World War I). The hope is that all such webpages of interest would be in the same cluster as X , if the clustering is good. More formally, the input is a set of examples, usually interpreted as points in high dimensions. For example, given a webpage W , we represent it as a point in high dimensions, by setting the i th coordinate to 1 if the word w i appears somewhere in the document, where we have a prespecified list of 10 , 000 words that we care about. Thus, the webpage W can be interpreted as a point of the { , 1 } 10 , 000 hypercube; namely, a point in 10 , 000 dimensions. Let X be the resulting set of n points in d dimensions. To be able to partition points into similar clusters, we need to define a notion of similarity. Such a similarity measure can be any distance function between points. For example, consider the regular Euclidean distance between points, where d ( p , q ) = v t d X i = 1 ( p i q i ) 2 , where p = ( p 1 , . . . , p d ) and q = ( q 1 , . . . , q d ). As another motivating example, consider the facility location problem . We are given a set X of n cities and distances between them, and we would like to build k hospitals, so that the maximum This work is licensed under the Creative Commons AttributionNoncommercial 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/bync/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 1 distance of a city from its closest hospital is minimized. (So that the maximum time it would take a patient to get to the its closest hospital is bounded.) Intuitively, what we are interested in is selecting good representatives for the input pointset X . Namely, we would like to find k points in X such that they represent X well. Formally, consider a subset S of k points of X , and a p a point of X . The distance of p from the set S is d ( p , S ) = min q S d ( p , q ); namely, d ( p , S ) is the minimum distance of a point of S to p . If we interpret S as a set of centers then d ( p , S ) is the distance of p to its closest center....
View
Full
Document
 Fall '08
 Viswanathan
 Algorithms

Click to edit the document details