)
(
:
)
(
)
,
(
)
1
(
1
)
,
(
j
i
j
i
c
c
x
x
y
c
c
y
j
i
j
i
j
i
y
x
sim
c
c
c
c
c
c
sim
Computing Group Average Similarity
•
Assume cosine similarity and normalized
vectors with unit length.
•
Always maintain sum of vectors in each
cluster.
•
Compute similarity of clusters in constant
time:
j
c
x
j
x
c
s
)
(
)
1



)(


(
)


(
))
(
)
(
(
))
(
)
(
(
)
,
(
j
i
j
i
j
i
j
i
j
i
j
i
c
c
c
c
c
c
c
s
c
s
c
s
c
s
c
c
sim
NonHierarchical Clustering
•
Typically must provide the number of desired
clusters,
k
.
•
Randomly choose
k
instances as
seeds
, one per
cluster.
•
Form initial clusters based on these seeds.
•
Iterate, repeatedly reallocating instances to
different clusters to improve the overall clustering.
•
Stop when clustering converges or after a fixed
number of iterations.
KMeans
•
Assumes instances are realvalued vectors.
•
Clusters based on
centroids
,
center of gravity
, or
mean of points in a cluster, c:
– I
c
I is the number of data points in cluster
c
•
Reassignment of instances to clusters is based
on distance to the current cluster centroids.
c
x
x
c


1
(c)
μ
Distance Metrics
•
Euclidian distance (L
2
norm):
• L
1
norm:
•
Cosine Similarity (transform to a distance by
subtracting from 1):
2
1
2
)
(
)
,
(
i
m
i
i
y
x
y
x
L
m
i
i
i
y
x
y
x
L
1
1
)
,
(
y
x
y
x
1
KMeans Algorithm
Let
d
be the distance measure between instances.
Select
k
random instances {
s
1
, s
2
,… s
k
} as seeds.
Until clustering converges or other stopping criterion:
For each instance
x
i
:
Assign
x
i
to the cluster
c
j
such that
d(x
i
, s
j
)
is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster
c
j
s
j
=
(c
j
)
// recalculate centroids
K Means Example (K=2)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reasssign clusters
x
x
Compute centroids
Reassign clusters
Converged!
Termination conditions
•
Several possibilities, e.g.,
– A fixed number of iterations.
– Partition unchanged.
– Centroid positions don’t change.
Sec. 16.4
Convergence
•
Why should the Kmeans algorithm ever reach a
fixed point?
– A state in which clusters don
’
t change.
•
Kmeans is a special case of a general
procedure known as the Expectation
Maximization (EM) algorithm.
– EM is known to converge.
– Number of iterations could be large.
–But in practice usually isn’t
Sec. 16.4
Time Complexity
•
Assume computing distance between two instances
is
O(m)
where m is the dimensionality of the
vectors.
•
Reassigning clusters:
O(kn)
distance computations,
or
O(knm)
.
•
Computing centroids: Each instance vector gets
added once to some centroid:
O(nm)
.
•
Assume these two steps are each done once for
I
iterations:
O(Iknm)
.
•
Linear in all relevant factors, assuming a fixed
number of iterations, more efficient than
O(n
2
)
HAC.
A Simple example showing the implementation
of kmeans algorithm (using K=2)
Step 1
:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters. In this case the 2 centroid are:
m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
•
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
You've reached the end of your free preview.
Want to read all 95 pages?
 Fall '14
 Seon Kim