This preview shows page 1. Sign up to view the full content.
Unformatted text preview: s 13 Approach 1: Use the diameter of the merged
cluster = maximum distance between points
in the cluster.
Approach 2: Use the average distance
between points in the cluster.
Approach 3: Use a density-based approach:
take the diameter or avg. distance, e.g., and
divide by the number of points in the cluster. Perhaps raise the number of points to a power
first, e.g., square-root.
11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Naïve implementation: At each step, compute pairwise distances between
all pairs of clusters O(N3) Careful implementation using priority queue
can reduce time to O(N2 log N) Still too expensive for really big datasets that do
not fit in memory 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 Assumes Euclidean space/distance. Start by picking k, the number of clusters. Initialize clusters by picking one point per
cluster. Example: pick one point at random, then k-1
other points, each as far away as possible from the
previous points. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 1. For each point, place it in the cluster whose
current centroid it is nearest. 2. After all points are assigned, fix the centroids
of the k clusters. 3. Optional: reassign all poin...
View Full Document
- Winter '09