bhs - Bursty and Hierarchical Structure in Streams * Jon...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Bursty and Hierarchical Structure in Streams * Jon Kleinberg Abstract A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise that the appearance of a topic in a document stream is signaled by a burst of activity, with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such bursts, in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The ap- proach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. * This work appears in the Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. Department of Computer Science, Cornell University, Ithaca NY 14853. Email: Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award, NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399. 1 1 Introduction Documents can be naturally organized by topic, but in many settings we also experience their arrival over time. E-mail and news articles provide two clear examples of such docu- ment streams : in both cases, the strong temporal ordering of the content is necessary for making sense of it, as particular topics appear, grow in intensity, and then fade away again. Over a much longer time scale, the published literature in a particular research field can be meaningfully understood in this way as well, with particular research themes growing and diminishing in visibility across a period of years. Work in the areas of topic detection and tracking [2, 3, 6, 67, 68], text mining [39, 62, 63, 64], and visualization [29, 47, 66] has explored techniques for identifying topics in document streams comprised of news stories, using a combination of content analysis and time-series modeling....
View Full Document

This note was uploaded on 12/27/2011 for the course CMPSC 290a taught by Professor Vandam during the Fall '09 term at UCSB.

Page1 / 25

bhs - Bursty and Hierarchical Structure in Streams * Jon...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online