07-clustering

It assumes that clusters are normally distributed

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e Leskovec, Stanford C246: Mining Massive Datasets 23 Points are read one main-memory-full at a time. Most points from previous memory loads are summarized by simple statistics. To begin, from the initial load we select the initial k centroids by some sensible approach. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Possibilities include: 1. Take a small random sample and cluster optimally 2. Take a sample; pick a random point, and then k–1 more points, each as far from the previously selected points as possible. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 1. The discard set (DS): points close enough to a centroid to be summarized. 2. The compression set (CS): groups of points that are close together but not close to any centroid. They are summarized, but not assigned to a cluster. 3. The retained set (RS): isolated points. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 Points in the RS Compressed sets. Their points are in the CS. A cluster. Its points are in the DS. 11/26/2010 The centroid Jure Leskovec, Stanford C246: Mining Massive Datasets 27 For each cluster, the discard set is summarized by: 1. The number of points, N. 2. The vector SUM, whose ith com...
View Full Document

Ask a homework question - tutors are online