This preview shows page 1. Sign up to view the full content.
Unformatted text preview: e Leskovec, Stanford C246: Mining Massive Datasets 23 Points are read one mainmemoryfull at a
time. Most points from previous memory loads are
summarized by simple statistics. To begin, from the initial load we select the
initial k centroids by some sensible approach. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Possibilities include: 1. Take a small random sample and cluster
optimally
2. Take a sample; pick a random point, and then
k–1 more points, each as far from the previously
selected points as possible. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 1. The discard set (DS): points close enough to
a centroid to be summarized. 2. The compression set (CS): groups of points
that are close together but not close to any
centroid. They are summarized, but not
assigned to a cluster. 3. The retained set (RS): isolated points. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 Points in
the RS Compressed sets.
Their points are in
the CS. A cluster. Its points
are in the DS. 11/26/2010 The centroid Jure Leskovec, Stanford C246: Mining Massive Datasets 27 For each cluster, the discard set is
summarized by: 1. The number of points, N.
2. The vector SUM, whose ith com...
View Full
Document
 Winter '09

Click to edit the document details