! Exercise 7.3.3: Give an example of a dataset and a selection of k initial centroids such that when the
points are reassigned to their nearest centroid at the end, at least one of the initial k points is reassigned
to a dierent cluster.
Exercise 7.3.4: F
combined another pair of points. In general, it is possible that this rule will result in an entirely dierent
clustering from that obtained using the distance-of-centroids rule.
2. Take the distance between two clusters to be the average distance of all p
An optional step at the end is to x the centroids of the clusters and to reassign each point, including the
k initial points, to the k clusters. Usually, a point p will be assigned to the same cluster in which it was
placed on the rst pass. However, there
7.3. K-MEANS ALGORITHMS 257
clusters than there really are, the measure will rise precipitously. The idea is expressed by the diagram of
Number of Clusters
Correct value of k
Figure 7.9: Average diameter or another measure of di
2. The Compressed Set: These are summaries, similar to the cluster summaries, but for sets of points
that have been found close to one another, but not close to any cluster. The points represented by the
compressed set are also discarded, in the sense tha
clustroid. We can select the clustroid in various ways, each designed to, in some sense, minimize the
distances between the clustroid and the other points in the cluster. Common choices include selecting as
the clustroid the point that minimizes:
1. The s
Figure 7.4: Clustering after two additional steps
However, there is a somewhat more ecient implementation of which we should be aware.
1. We start, as we must, by computing the distances between all pairs of points, and this step is O(n2).
d2(p,q) = d2(p,c) + d2(c,q)
If we sum over all q other than c, and then add d2(p,c) to ROWSUM(p) to account for the fact that the
clustroid is one of the points in the cluster, we derive ROWSUM(p) = ROWSUM(c) + Nd2(p,c). Now, we
must see if the new point
Exercise 7.2.1: Perform a hierarchical clustering of the one-dimensional set of points 1, 4, 9, 16, 25, 36,
49, 64, 81, assuming clusters are represented by
254 CHAPTER 7. CLUSTERING
their centroid (average), and at each step the clusters with the closest
circle will be assigned to the ring if it is outside the ring. If the outlier is between the ring and the circle,
it will be assigned to one or the other, somewhat favoring the ring because its representative points have
been moved toward the circle. 2
the greatest authorities, since they are linked to by the two biggest hubs, A and D. For Web-sized graphs,
the only way of computing the solution to the hubsand-authorities equations is iteratively. However, for
this tiny example, we can compute the solut
(a) The square of the radius.
(b) The diameter (not squared).
What are the densities, according to (a) and (b), of the clusters that result from the merger of any two of
these three clusters. Does the dierence in densities suggest the clusters should or s
Since the last two steps are executed at most n times, and the rst two steps are executed only once, the
overall running time of this algorithm is O(n2 logn). That is better than O(n3), but it still puts a strong
limit on how large n can be before it beco
1, we get b = 0.3583 and d = 0.7165. Along with c = e = 0, these values give us the limiting value of h. The
value of a can be computed from h by multiplying by LT and scaling. 2
5.5.3 Exercises for Section 5.5
Exercise 5.5.1: Compute the hubbiness and au
strange bends, S-shapes, or even rings. Instead of representing clusters by their centroid, it uses a
collection of representative points, as the name implies.
7.4. THE CURE ALGORITHM 263
Figure 7.12: Two clusters, one surrounding the other