ICDE99 - Clustering Large Datasets in Arbitrary Metric...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Clustering Large Datasets in Arbitrary Metric Spaces Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences Department, University of Wisconsin-Madison Allison Powell James French Department of Computer Science, University of Virginia, Charlottesville Abstract Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a dis- tance space. In a distance space, the only operation possi- ble on data objects is the computation of distance between them. All scalable algorithms in the literature assume a spe- cial type of distance space, namely a -dimensional vector space, which allows vector operations on objects. We present two scalable algorithms designed for cluster- ing very large datasets in distance spaces. Our first algo- rithm BUBBLE is, to our knowledge, the first scalable clus- tering algorithm for data in a distance space. Our second algorithm BUBBLE-FM improves upon BUBBLE by reduc- ing the number of calls to the distance function, which may be computationally very expensive. Both algorithms make only a single scan over the database while producing high clustering quality. In a detailed experimental evaluation, we study both algorithms in terms of scalability and quality of clustering. We also show results of applying the algo- rithms to a real-life dataset. 1. Introduction Data clustering is an important data mining problem [1, 8, 9, 10, 12, 17, 21, 26]. The goal of clustering is to partition a collection of objects into groups, called clusters , such that “similar” objects fall into the same group. Simi- larity between objects is captured by a distance function. In this paper, we consider the problem of clustering large datasets in a distance space in which the only operation pos- sible on data objects is the computation of a distance func- tion that satisfies the triangle inequality. In contrast, objects The first three authors were supported by Grant 2053 from the IBM corporation. Supported by an IBM Corporate Fellowship Supported by NASA GSRP NGT5-50062. This work supported in part by DARPA contract N66001-97-C-8542. in a coordinate space can be represented as vectors. The vector representation allows various vector operations, e.g., addition and subtraction of vectors, to form condensed rep- resentations of clusters and to reduce the time and space requirements of the clustering problem [4, 26]. These oper- ations are not possible in a distance space thus making the problem much harder. The distance function associated with a distance space can be computationally very expensive [5], and may dom- inate the overall resource requirements. For example, con- sider the domain of strings where the distance between two strings is the edit distance . Computing the edit distance between two strings of lengths and requires comparisons between characters. In contrast, computing the
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/31/2011 for the course CS 345 taught by Professor Dunbar,a during the Fall '07 term at UC Davis.

Page1 / 11

ICDE99 - Clustering Large Datasets in Arbitrary Metric...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online