07-clustering

07-clustering - CS246 Mining Massive Datasets Jure Leskovec...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Given a set of points, with a notion of distance between points, group the points into some number of clusters , so that members of a cluster are close/similar to each other Members of different clusters are dissimilar Usually: points are in a high---dimensional space similarity is defined using a distance measure Euclidean, Cosine, Jaccard, edit distance, … 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 2
3 x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands). Problem : Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc. Sloan Sky Survey is a newer, better version. 4 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 4
Intuitively : Movies divide into categories, and customers prefer a few categories. But what are categories really? Represent a movie by the customers who bought/rated it Similar movies’s have similar sets of customers, and vice-versa Space of all movies: Think of a space with one dimension for each customer. Values in a dimension may be 0 or 1 only. A movies’s point in this space is ( x 1 , x 2 ,…, x k ), where x i = 1 iff the i th customer bought the movie. 5 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
We have a choice: 1. Sets as vectors : measure similarity by the cosine distance. 2. Sets as sets : measure similarity by the Jaccard distance. 3. Sets as points : measure similarity by Euclidean distance. 6 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 6
7 Hierarchical : Agglomerative (bottom up): Initially, each point is a cluster Repeatedly combine the two “nearest” clusters into one. Divisive (top down): Start with one cluster and recursively split it Point Assignment : Maintain a set of clusters Points belong to “nearest” cluster 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
8 Key operation: Repeatedly combine two nearest clusters Three important questions: 1. How do you represent a cluster of more than one point? 2. How do you determine the “nearness” of clusters? 3. When to stop combining clusters? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 8
9 Key problem : as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? Euclidean case : each cluster has a centroid = average of its points Measure cluster distances by distances of centroids 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
10 (5,3) o (1,2) o o (2,1) o (4,1) o (0,0) o (5,0) x (1.5,1.5) x (4.5,0.5) x (1,1) x (4.7,1.3) Data Dendrogram 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets
Background image of page 10
The only “locations” we can talk about are the points themselves.
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 12
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 44

07-clustering - CS246 Mining Massive Datasets Jure Leskovec...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online