07-clustering

11262010 jure leskovec stanford c246 mining massive

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s 13 Approach 1: Use the diameter of the merged cluster = maximum distance between points in the cluster. Approach 2: Use the average distance between points in the cluster. Approach 3: Use a density-based approach: take the diameter or avg. distance, e.g., and divide by the number of points in the cluster. Perhaps raise the number of points to a power first, e.g., square-root. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Naïve implementation: At each step, compute pairwise distances between all pairs of clusters O(N3) Careful implementation using priority queue can reduce time to O(N2 log N) Still too expensive for really big datasets that do not fit in memory 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 Assumes Euclidean space/distance. Start by picking k, the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k-1 other points, each as far away as possible from the previous points. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 1. For each point, place it in the cluster whose current centroid it is nearest. 2. After all points are assigned, fix the centroids of the k clusters. 3. Optional: reassign all poin...
View Full Document

Ask a homework question - tutors are online