This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Bursty and Hierarchical Structure in Streams * Jon Kleinberg † Abstract A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise — that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The ap- proach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. * This work appears in the Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. † Department of Computer Science, Cornell University, Ithaca NY 14853. Email: [email protected] Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award, NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399. 1 1 Introduction Documents can be naturally organized by topic, but in many settings we also experience their arrival over time. E-mail and news articles provide two clear examples of such docu- ment streams : in both cases, the strong temporal ordering of the content is necessary for making sense of it, as particular topics appear, grow in intensity, and then fade away again. Over a much longer time scale, the published literature in a particular research field can be meaningfully understood in this way as well, with particular research themes growing and diminishing in visibility across a period of years. Work in the areas of topic detection and tracking [2, 3, 6, 67, 68], text mining [39, 62, 63, 64], and visualization [29, 47, 66] has explored techniques for identifying topics in document streams comprised of news stories, using a combination of content analysis and time-series modeling....
View Full Document
- Fall '09
- E-mail, A∗, state sequence, optimal state sequence