p181-fung - Parameter Free Bursty Events Detection in Text...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong Fung Jeffrey Xu Yu Philip S. Yu Hongjun Lu The Chinese University of Hong Kong, Hong Kong, China, { pcfung,yu } @se.cuhk.edu.hk T. J. Watson Research Center, IBM, USA, psyu@us.ibm.com The Hong Kong University of Science and Technology, Hong Kong, China, luhj@cs.ust.hk Abstract Text classification is a major data mining task. An advanced text classification technique is known as partially supervised text classifica- tion, which can build a text classifier using a small set of positive examples only. This leads to our curiosity whether it is possible to find a set of features that can be used to describe the positive examples. Therefore, users do not even need to specify a set of positive exam- ples. As the first step, in this paper, we for- malize it as a new problem, called hot bursty events detection, to detect bursty events from a text stream which is a sequence of chrono- logically ordered documents. Here, a bursty event is a set of bursty features, and is con- sidered as a potential category to build a text classifier. It is important to know that the hot bursty events detection problem, we study in this paper, is different from TDT (topic de- tection and tracking) which attempts to clus- ter documents as events using clustering tech- niques. In other words, our focus is on de- tecting a set of bursty features for a bursty event. In this paper, we propose a new novel parameter free probabilistic approach, called feature-pivot clustering. Our main technique is to fully utilize the time information to de- termine a set of bursty features which may occur in different time windows. We detect bursty events based on the feature distribu- tions. There is no need to tune or estimate any parameters. We conduct experiments us- ing real life data, a major English newspaper Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 in Hong Kong, and show that the parameter free feature-pivot clustering approach can de- tect the bursty events with a high success rate. 1 Introduction In this paper, we study a new problem, called hot bursty events detection in a text stream, where a text stream is a sequence of chronologically ordered doc- uments, and a hot bursty event is a minimal set of bursty features that occur together in certain time win- dows with strong support of documents in the text stream. For example, SARS (Special Severe Acute Respiratory Syndrome) is a bursty event that con- sists of a set of bursty features such as sars, outbreak,...
View Full Document

Page1 / 12

p181-fung - Parameter Free Bursty Events Detection in Text...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online