CS229 Lecture notes
Andrew Ng
The
k
means clustering algorithm
In the clustering problem, we are given a training set
{
x
(1)
, . . . , x
(
m
)
}
, and
want to group the data into a few cohesive “clusters.”
Here,
x
(
i
)
∈
R
n
as usual; but no labels
y
(
i
)
are given. So, this is an unsupervised learning
problem.
The
k
means clustering algorithm is as follows:
1. Initialize
cluster centroids
μ
1
, μ
2
, . . . , μ
k
∈
R
n
randomly.
2. Repeat until convergence:
{
For every
i
, set
c
(
i
)
:= arg min
j

x
(
i
)

μ
j

2
.
For each
j
, set
μ
j
:=
∑
m
i
=1
1
{
c
(
i
)
=
j
}
x
(
i
)
∑
m
i
=1
1
{
c
(
i
)
=
j
}
.
}
In the algorithm above,
k
(a parameter of the algorithm) is the number
of clusters we want to find; and the cluster centroids
μ
j
represent our current
guesses for the positions of the centers of the clusters. To initialize the cluster
centroids (in step 1 of the algorithm above), we could choose
k
training
examples randomly, and set the cluster centroids to be equal to the values of
these
k
examples. (Other initialization methods are also possible.)
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 '09
 Machine Learning, cluster centroids, Andrew Ng

Click to edit the document details