07-clustering

07-clustering - CS246 Mining Massive Datasets Jure Leskovec...

This preview shows pages 1–12. Sign up to view the full content.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 Given a set of points, with a notion of distance between points, group the points into some number of clusters , so that members of a cluster are close/similar to each other Members of different clusters are dissimilar Usually: points are in a high---dimensional space similarity is defined using a distance measure Euclidean, Cosine, Jaccard, edit distance, … 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
3 x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands). Problem : Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc. Sloan Sky Survey is a newer, better version. 4 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Intuitively : Movies divide into categories, and customers prefer a few categories. But what are categories really? Represent a movie by the customers who bought/rated it Similar movies’s have similar sets of customers, and vice-versa Space of all movies: Think of a space with one dimension for each customer. Values in a dimension may be 0 or 1 only. A movies’s point in this space is ( x 1 , x 2 ,…, x k ), where x i = 1 iff the i th customer bought the movie. 5 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
We have a choice: 1. Sets as vectors : measure similarity by the cosine distance. 2. Sets as sets : measure similarity by the Jaccard distance. 3. Sets as points : measure similarity by Euclidean distance. 6 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
7 Hierarchical : Agglomerative (bottom up): Initially, each point is a cluster Repeatedly combine the two “nearest” clusters into one. Divisive (top down): Start with one cluster and recursively split it Point Assignment : Maintain a set of clusters Points belong to “nearest” cluster 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
8 Key operation: Repeatedly combine two nearest clusters Three important questions: 1. How do you represent a cluster of more than one point? 2. How do you determine the “nearness” of clusters? 3. When to stop combining clusters? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
9 Key problem : as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? Euclidean case : each cluster has a centroid = average of its points Measure cluster distances by distances of centroids 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
10 (5,3) o (1,2) o o (2,1) o (4,1) o (0,0) o (5,0) x (1.5,1.5) x (4.5,0.5) x (1,1) x (4.7,1.3) Data Dendrogram 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
The only “locations” we can talk about are the points themselves.