Clustering Large Datasets in Arbitrary Metric Spaces
Venkatesh Ganti
Raghu Ramakrishnan
Johannes Gehrke
Computer Sciences Department, University of Wisconsin-Madison
Allison Powell
James French
Department of Computer Science, University of Virginia, Charlottesville
Abstract
Clustering partitions a collection of objects into groups
called clusters, such that similar objects fall into the same
group. Similarity between objects is deﬁned by a distance
function satisfying the triangle inequality; this distance
function along with the collection of objects describes a dis-
tance space. In a distance space, the only operation possi-
ble on data objects is the computation of distance between
them. All scalable algorithms in the literature assume a spe-
cial type of distance space, namely a
-dimensional vector
space, which allows vector operations on objects.
We present two scalable algorithms designed for cluster-
ing very large datasets in distance spaces. Our ﬁrst algo-
rithm BUBBLE is, to our knowledge, the ﬁrst scalable clus-
tering algorithm for data in a distance space. Our second
algorithm BUBBLE-FM improves upon BUBBLE by reduc-
ing the number of calls to the distance function, which may
be computationally very expensive. Both algorithms make
only a single scan over the database while producing high
clustering quality. In a detailed experimental evaluation,
we study both algorithms in terms of scalability and quality
of clustering. We also show results of applying the algo-
rithms to a real-life dataset.
1. Introduction
Data clustering is an important data mining problem
[1, 8, 9, 10, 12, 17, 21, 26]. The goal of clustering is to
partition a collection of objects into groups, called
clusters
,
such that “similar” objects fall into the same group. Simi-
larity between objects is captured by a distance function.
In this paper, we consider the problem of clustering large
datasets in a
distance space
in which the only operation pos-
sible on data objects is the computation of a distance func-
tion that satisﬁes the triangle inequality. In contrast, objects
The ﬁrst three authors were supported by Grant 2053 from the IBM
corporation.
Supported by an IBM Corporate Fellowship
Supported by NASA GSRP NGT5-50062.
This work supported in part by DARPA contract N66001-97-C-8542.
in a
coordinate space
can be represented as vectors. The
vector representation allows various vector operations, e.g.,
addition and subtraction of vectors, to form condensed rep-
resentations of clusters and to reduce the time and space
requirements of the clustering problem [4, 26]. These oper-
ations are not possible in a distance space thus making the
problem much harder.
The distance function associated with a distance space
can be computationally very expensive [5], and may dom-
inate the overall resource requirements. For example, con-
sider the domain of strings where the distance between two
strings is the
edit distance
.
Computing the edit distance
between two strings of lengths
and
requires
comparisons between characters. In contrast, computing the