07-clustering

Values in a dimension may be 0 or 1 only a moviess

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: or 1 only. A movies’s point in this space is (x1, x2,…, xk), where xi = 1 iff the i th customer bought the movie. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 5 We have a choice: 1. Sets as vectors: measure similarity by the cosine distance. 2. Sets as sets: measure similarity by the Jaccard distance. 3. Sets as points: measure similarity by Euclidean distance. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 Hierarchical: Agglomerative (bottom up): Initially, each point is a cluster Repeatedly combine the two “nearest” clusters into one. Divisive (top down): Start with one cluster and recursively split it Point Assignment: Maintain a set of clusters Points belong to “nearest” cluster 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 Key operation: Repeatedly combine two nearest clusters Three important questions: 1. How do you represent a cluster of more than one point? 2. How do you determine the “nearness” of clusters? 3. When to stop combining clusters? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? Euclidean case: each cluster has a centroid = av...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online