07-clustering

Accept a point for a cluster if its md is some

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: sets 36 Compute the variance of the combined subcluster. N, SUM, and SUMSQ allow us to make that calculation quickly. Combine if the variance is below some threshold. Many alternatives: treat dimensions differently, consider density. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Problem with BFR/k -means: Assumes clusters are normally distributed in each dimension. And axes are fixed – ellipses at an angle are not OK. CURE: Vs. Assumes a Euclidean distance. Allows clusters to assume any shape. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 38 h h h e e e e h e e h e h e salary h e e e h h h h h h age 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 1. Pick a random sample of points that fit in main memory. 2. Cluster these points hierarchically – group nearest points/clusters. 3. For each cluster, pick a sample of points, as dispersed as possible. 4. From the sample, pick representatives by moving them (say) 20% toward the centroid of the cluster. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 40 h h h e e e e h e e h e h e salary h e e e h h h h h h age 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 41 h h h e e e e h h h h h h e e h e salary h e e e h h e Pick (say) 4 remote points for each cluster. age 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 42 h h h e e e e h h h h h h e e h e salary h e e e h h e Move points (say) 20% toward the centroid. age 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 43 Now, visit each point p in the data set. Place it in the “closest cluster.” Normal definition of “closest”: that cluster with the closest (to p ) among all the sample points of all the clusters. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 44...
View Full Document

Ask a homework question - tutors are online