ICDE99 - Clustering Large Datasets in Arbitrary Metric...

Info icon This preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Clustering Large Datasets in Arbitrary Metric Spaces Venkatesh Ganti Raghu Ramakrishnan Johannes Gehrke Computer Sciences Department, University of Wisconsin-Madison Allison Powell James French Department of Computer Science, University of Virginia, Charlottesville Abstract Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a dis- tance space. In a distance space, the only operation possi- ble on data objects is the computation of distance between them. All scalable algorithms in the literature assume a spe- cial type of distance space, namely a -dimensional vector space, which allows vector operations on objects. We present two scalable algorithms designed for cluster- ing very large datasets in distance spaces. Our first algo- rithm BUBBLE is, to our knowledge, the first scalable clus- tering algorithm for data in a distance space. Our second algorithm BUBBLE-FM improves upon BUBBLE by reduc- ing the number of calls to the distance function, which may be computationally very expensive. Both algorithms make only a single scan over the database while producing high clustering quality. In a detailed experimental evaluation, we study both algorithms in terms of scalability and quality of clustering. We also show results of applying the algo- rithms to a real-life dataset. 1. Introduction Data clustering is an important data mining problem [1, 8, 9, 10, 12, 17, 21, 26]. The goal of clustering is to partition a collection of objects into groups, called clusters , such that “similar” objects fall into the same group. Simi- larity between objects is captured by a distance function. In this paper, we consider the problem of clustering large datasets in a distance space in which the only operation pos- sible on data objects is the computation of a distance func- tion that satisfies the triangle inequality. In contrast, objects The first three authors were supported by Grant 2053 from the IBM corporation. Supported by an IBM Corporate Fellowship Supported by NASA GSRP NGT5-50062. This work supported in part by DARPA contract N66001-97-C-8542. in a coordinate space can be represented as vectors. The vector representation allows various vector operations, e.g., addition and subtraction of vectors, to form condensed rep- resentations of clusters and to reduce the time and space requirements of the clustering problem [4, 26]. These oper- ations are not possible in a distance space thus making the problem much harder. The distance function associated with a distance space can be computationally very expensive [5], and may dom- inate the overall resource requirements. For example, con- sider the domain of strings where the distance between two strings is the edit distance . Computing the edit distance between two strings of lengths and requires comparisons between characters. In contrast, computing the
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern