View the step-by-step solution to:

Fast and Effective Text Mining Using Linear-time Document Clustering Bjornar Larsen and Chinatsu Aone SRA International, Inc. 4300 Fair Lakes Cow-l...

Can you help me write a SWOT analysis of a Research Paper ???

Fast and Effective Text Mining Using Linear-time Document Clustering Bjornar Larsen and Chinatsu Aone SRA International, Inc. 4300 Fair Lakes Cow-l Fairfax, VA 22033 {bjornar-larsen, aonec} ABSTRACT Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of F- Measure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and k- means. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, “vector average damping,” that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively. Keywords Clustering, text mining, multi-document summarization 1. INTRODUCTION The information age is surrounding us with increasingly overwhelming quantities of electronic data. Users need software tools that can help them rapidly explore the most frequent form of data, collections of text. Hand-built directories of web content such as Yahoo! offer one solution to the problem, but unfortunately creating and maintaining such directories requires enormous amounts of human effort. Routing/categorization systems can help automate the assignment of documents to a topic hierarchy, but they require training and prior knowledge of the topics in a corpus. For many situations, a more practical solution is to discover and approximate these topic hierarchies using permission to nlakc digital or lxwd copies of all or part of this wd fbl personal or classroom use is granted without fee provided that copies arc: not made or distributed Ibr profit or commercial [email protected] and that copies hear this notice and the I’ull citation on the lint page. To c(VY othcr\visc, to republish, to post on scrvcrs or to rcdistrihutc to lists. requires prior specific permission and/or a fee. KDD-99 San Diego CA l!SA Copyright ACM 1999 l-581 13-143-7/99/08. ..%00 unsupervised clustering methods. Document clustering helps tackle the information overload problem in several ways. One is exploration; the top level of a cluster hierarchy summarizes at a glance the contents of a document collection, enabling users to selectively drill deeper to explore specific topics of interest without reading every document (e.g., [7], [9], [13]). Used in retrieval (e.g., [6], [lo], [14]), clustering organizes search results by topic similarity and potentially helps users find relevant documents more quickly. Some clustering algorithms excel at quickly and accurately grouping duplicate or near-duplicate documents. Though each of these uses is important, the focus of our paper is the first use, i.e., multi-document summarization through the discovery of topic hierarchies. In this paper, we first describe our fast, scalable document clustering system. This text mining tool is designed to discover topic hierarchies in gigabytes of documents per day 1, and to present the results in an intuitive GUI (cf. Figure 1). We describe the algorithms we use, for both feature extraction and clustering of extracted features. Then, we evaluate the impact of different algorithms on the quality of the generated cluster hierarchies and tax. quake. computer, earthquake, nchter scale. violence. damage, VW% pot wake. earthquake. rlchter scale, damage, msgnitude, mile, measure. m gun.towan, handgun, gang. shoot, police, rosy. law, matyland, ban. soviet. Armenian. armema, gorbachev. lass, wake, leninokan, mus:*w I heart. patlent. euthanasia. doctor, drug, hospital. medical, ama, artew, su heart patient, artery. surgeru, atack, angioplasty. corona, study. h351 euthanasia, ama, patient, doctor, mercv, delegate, ptvsl::lan, adv, me’ ethic, puma, baby, doctor, transplant, ethical, organ. h053ital. consulti drug, fda. roche. approval, committee, pharmaceutical, centurx, flsorls tube, feed, gray, cruzan, die, court, mr, connor, remow, dwgnte!, Figure 1 - User Interface 1 Current throughput using the default near-linear algorithms, including all pre-processing (e.g., for building a term frequency baseline) and feature-extraction, is 1.5 hours for 1 gigabyte of text (-200,000 documents) on a Sun ULTRA 1 with 256MB RAM. 16
Background image of page 1
on the processing time required to build those hierarchies. Figure 1 shows the top level clusters for a small corpus (972 documents), a large portion of which are news stories. The first cluster (239 documents) is about “quake, earthquake, richter scale” etc., so we can conclude that it contains documents about earthquakes. Each cluster is labeled with related terms (from the cluster centroid) that convey a coherent topic. The seventh cluster (84 documents) appears to be on medical issues (“heart, patient, euthanasia, doctor, drug, hospital, medical” etc.) We have expanded this cluster to show the nine sub-clusters (which represent medical subtopics) - two of these are documents (i.e., from a cluster of size one), shaded in gray. 2. SYSTEM DESCRIPTION The primary steps to generating a hierarchy are the extraction of features from documents, followed by clustering or grouping based on those features. Over forty configurable parameters control the algorithms in this process - we will describe the most significant variations. 2.1 Feature Extraction Feature extraction maps each document into a concise representation of its topic. These extracted features are used to compare documents and to label their content for users. We use the vector space model [l l] to represent documents as points in a high dimensional topic space, where each dimension corresponds to a unique word or concept from the corpus. The mapping process extracts a list of unique terms from each document, assigns each a weight, and represents the document using the n highest-weighted terms. We refer to the parameter n as the feature vector length, and our default is 25. Several parameters control term selection and weighting. By default, we remove stop words. The weight assigned to each remaining term is a function of either the tf (term frequency) or tf;df (term frequency - inverse document frequency). To use tfidf; the system makes an initial pass over the text collection to create a “baseline” - the baseline stores the total number of documents that each unique term occurs in (i.e., the document frequency). Feature extraction consults this baseline to set term weights as a function of both the term’s frequency within the document and the number of documents the term occurs in. An important word is one that appears frequently within the current document, but infrequently across the other documents in the collection. For example, the word ‘excavation’ is uncommon in most corpora, so if it appeared frequently in a particular document, it would be given a high topic weight. Conversely, words like ‘say’ that might appear in almost every document are assigned a low weight (even if they occur frequently in a given document.) Because feature extraction produces generic points in space as input to the clustering algorithms, the system is not limited to text; by implementing new feature extraction modules to map data to high-dimensional points, we could cluster any type of data. 2.2 Clustering The goal of clustering is to group the points in a feature space optimally based on proximity, in order to form a hierarchy of clusters. We unified near-linear time complexity techniques from k-means ([8], [4]) and Scatter/Gather [l]. The techniques are all partitional, meaning that they simply separate a flat collection of items into a single set of “bins.” A hierarchy is built by recursively applying a partitional algorithm. The partitional algorithms each run in O(N) with respect to the number of documents, N, so the overall hierarchy is generated in O(N log N) time (assuming a balanced hierarchy.) Because documents and clusters are represented as points in space, we can compare them using vector cosine. Clusters include a “center” or “centroid” vector that is the weighted average of the documents or clusters they contain. To prevent longer documents from dominating centroid calculations, we normalize all document vectors to unit length. To compare a document to a cluster, we simply calculate the cosine between the document vector and the cluster’s centroid vector. The partitional algorithms have three stages: seed selection, center adjustment, and cluster refinement. Seed selection is the process of choosing k candidate points in the feature space to serve as centers for the partitions2. During center adjustment, documents are repeatedly assigned to the nearest center, and the center is recalculated based on the average location of all documents assigned to it, thereby moving it through the feature space. This process may be repeated multiple times. Afterwards, all documents are removed from the centers, and reassigned to the new closest center. Thus, it is important that the centers be distributed effectively enough that they each attract sufficient nearby, topically related documents. Cluster refinement is an optional final step for improving the new partitions. 2.2.1 Seed Selection Seed selection picks centers to which the system can assign each point in the input set to form a partition. We implemented three seed selection algorithms: random, buckshot, and fractionation. Random is the simplest; it picks k points randomly from the input set as the initial centers. The second method is buckshot, described by [I]. Buckshot picks 2/k. y1 points randomly from the input set of n items, and clusters them using a high quality O(N2) clustering algorithm. The k centroids resulting from this clustering become the initial centers. For the O(N2) algorithm we use the group average variation of greedy agglomerative clustering, as did [l]. We will also refer to this as the “cluster subroutine.” The third method, fractionation, is also described by [I]. It uses the same cluster subroutine to build a bottom-up hierarchy from the initial input set, clustering fixed-size groups of points at each step to maintain a linear time complexity. The top-level clusters of this hierarchy become the initial seeds. 2.2.2 Center Adjustment Once k seeds are selected as centers, the system can iteratively assign each point in the input set to the closest center and adjust that center accordingly. If a point’s similarity to every center is below the assignment similarity threshold, t, it is not assigned to any center. By default, we use a small non-zero fixed value for t, though we are investigating techniques for setting t dynamically. Continuous k-means [4] consists of following random seed selection with some number of iterations of center adjustment. In 2 k defaults to 9 in our system. 17
Background image of page 2
Show entire document

Recently Asked Questions

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.


Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question