This preview shows page 1. Sign up to view the full content.
Unformatted text preview: or 1 only. A movies’s point in this space is (x1, x2,…, xk), where xi = 1 iff
the i th customer bought the movie. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 5 We have a choice: 1. Sets as vectors: measure similarity by the cosine
distance.
2. Sets as sets: measure similarity by the Jaccard
distance.
3. Sets as points: measure similarity by Euclidean
distance. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 Hierarchical: Agglomerative (bottom up): Initially, each point is a cluster Repeatedly combine the two
“nearest” clusters into one. Divisive (top down): Start with one cluster and recursively split it Point Assignment: Maintain a set of clusters Points belong to “nearest” cluster 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 Key operation:
Repeatedly combine two nearest clusters Three important questions:
1. How do you represent a cluster of more than
one point?
2. How do you determine the “nearness” of
clusters?
3. When to stop combining clusters? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Key problem: as you build clusters, how do you
represent the location of each cluster, to tell
which pair of clusters is closest? Euclidean case: each cluster has a
centroid = av...
View
Full
Document
This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.
 Winter '09

Click to edit the document details