This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ponent is the sum
of the coordinates of the points in the ith
3. The vector SUMSQ: ith component = sum of
squares of coordinates in ith dimension. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 2d + 1 values represent any size cluster. d = number of dimensions. Averages in each dimension (centroid) can be
calculated as SUMi /N. SUMi = i th component of SUM. Variance of a cluster’s discard set in
dimension i is: (SUMSQi /N ) – (SUMi /N )2 And standard deviation is the square root of that. Q: Why use this representation of clusters? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 1. Find those points that are “sufficiently close”
to a cluster centroid; add those points to
that cluster and the DS. 2. Use any main-memory clustering algorithm
to cluster the remaining points and the old
RS. 11/26/2010 Clusters go to the CS; outlying points to the RS. Jure Leskovec, Stanford C246: Mining Massive Datasets 30 Adjust statistics of the clusters to account for
the new points. 3. Add N’s, SUM’s, SUMSQ’s. 4. Consider merging compressed sets in the CS. 5. If this is the last round, merge all
compressed sets in the CS and all RS points
into their nearest cluster. 11/26/2010 Jure Leskovec, S...
View Full Document
- Winter '09