This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases Duo Zhang Chengxiang Zhai Jiawei Han [email protected], [email protected], [email protected] Department of Computer Science, University of Illinois at Urbana Champaign Abstract As the amount of textual information grows explosively in vari- ous kinds of business systems, it becomes more and more desir- able to analyze both structured data records and unstructured text data simultaneously. While online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining structured data, they face challenges in handling text data. On the other hand, probabilistic topic models are among the most effective approaches to latent topic analysis and mining on text data. In this paper, we propose a new data model called topic cube to combine OLAP with probabilistic topic modeling and enable OLAP on the dimension of text data in a multidimensional text database. Topic cube extends the traditional data cube to cope with a topic hierarchy and store probabilistic content measures of text documents learned through a probabilistic topic model. To materialize topic cubes ef- ficiently, we propose a heuristic method to speed up the iterative EM algorithm for estimating topic models by leveraging the mod- els learned on component data cells to choose a good starting point for iteration. Experiment results show that this heuristic method is much faster than the baseline method of computing each topic cube from scratch. We also discuss potential uses of topic cube and show sample experimental results. 1 Introduction Data warehouses are widely used in todays business market for organizing and analyzing large amounts of data. An im- portant technology to exploit data warehouses is the Online Analytical Processing (OLAP) technology [4, 10, 16], which enables flexible interactive analysis of multidimensional data in different granularities. It has been widely applied to many different domains [15, 22, 31]. OLAP on data warehouses is mainly supported through data cubes [11, 12]. As unstructured text data grows quickly, it is more and more important to go beyond the traditional OLAP on structured data to also tap into the huge amounts of text data available to us for data analysis and knowledge discovery. These text data often exist either in the character fields of data records or in a separate place with links to the data records through joinable common attributes. Thus conceptually we have both structured data and unstructured text data in a database. For convenience, we will refer to such a database as a multidimensional text database , to distinguish it from both the traditional relational databases and the text databases which consist primarily of text documents....
View Full Document
- Fall '09
- analyst, PLSA, topic cube