steinbach00kddTM

steinbach00kddTM - A Comparison of Document Clustering...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
1 A Comparison of Document Clustering Techniques Michael Steinbach George Karypis Vipin Kumar Department of Computer Science / Army HPC Research Center, University of Minnesota 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, Minnesota 55455 steinbac@cs.umn.edu karypis@cs.umn.edu kumar@cs.umn.edu ABSTRACT This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and K-means. (We used both a “standard” K-means algorithm and a “bisecting” K-means algorithm.) Our results indicate that the bisecting K-means technique is better than the standard K-means approach and (somewhat surprisingly) as good or better than the hierarchical approaches that we tested. Keywords K-means, hierarchical clustering, document clustering. 1. INTRODUCTION Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity that is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to “get the best of both worlds.” For example, in the document domain, Scatter/Gather [1], a document browsing system based on clustering, uses a hybrid approach involving both K-means and agglomerative hierarchical clustering. K-means is used because of its run-time efficiency and agglomerative hierarchical clustering is used because of its quality. However, during the course of our experiments we discovered that a simple and efficient variant of K-means, “bisecting” K-means, can produce clusters of documents that are better than those produced by “regular” K-means and as good or better than those produced by agglomerative hierarchical clustering techniques. We have also been able to find what we think is a reasonable explanation for this behavior. We refer the reader to [2] for a review of cluster analysis and to [4] for a review of information retrieval. For a more complete version of this paper, please see [6]. The data sets that we used are ones that are described
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/10/2012 for the course CSE 5800 taught by Professor Staff during the Fall '09 term at FIT.

Page1 / 2

steinbach00kddTM - A Comparison of Document Clustering...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online