07-clustering

11262010 the centroid jure leskovec stanford c246

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ponent is the sum of the coordinates of the points in the ith dimension. 3. The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 2d + 1 values represent any size cluster. d = number of dimensions. Averages in each dimension (centroid) can be calculated as SUMi /N. SUMi = i th component of SUM. Variance of a cluster’s discard set in dimension i is: (SUMSQi /N ) – (SUMi /N )2 And standard deviation is the square root of that. Q: Why use this representation of clusters? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 1. Find those points that are “sufficiently close” to a cluster centroid; add those points to that cluster and the DS. 2. Use any main-memory clustering algorithm to cluster the remaining points and the old RS. 11/26/2010 Clusters go to the CS; outlying points to the RS. Jure Leskovec, Stanford C246: Mining Massive Datasets 30 Adjust statistics of the clusters to account for the new points. 3. Add N’s, SUM’s, SUMSQ’s. 4. Consider merging compressed sets in the CS. 5. If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster. 11/26/2010 Jure Leskovec, S...
View Full Document

Ask a homework question - tutors are online