This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: MDL Summarization with Holes Shaofeng Bu Univ. of British Columbia [email protected] Laks V.S. Lakshmanan Univ. of British Columbia [email protected] Raymond T. Ng Univ. of British Columbia [email protected] Abstract Summarization of query results is an impor- tant problem for many OLAP applications. The Minimum Description Length principle has been applied in various studies to pro- vide summaries. In this paper, we consider a new approach of applying the MDL principle. We study the problem of finding summaries of the form S circleminus H for k-d cubes with tree hier- archies. The S part generalizes the query re- sults, while the H part describes all the excep- tions to the generalizations. The optimization problem is to minimize the combined cardi- nalities of S and H . We first characterize the problem by showing that solving the 1-d prob- lem can be done in time linear to the size of hierarchy, but solving the 2-d problem is NP- hard. We then develop three different heuris- tics, based on a greedy approach, a dynamic programming approach and a quadratic pro- gramming approach. We conduct a compre- hensive experimental evaluation. Both the dy- namic programming algorithm and the greedy algorithm can be used for different circum- stances. Both produce summaries that are significantly shorter than those generated by state-of-the-art alternatives. 1 Introduction and Motivation It is well known that complex aggregate queries in- volving millions of records are one of the hallmarks of OLAP-style data analysis applications running on data warehouses. Phenomenal strides have been made over the past decade in the development of efficient al- gorithms for computing the data cubes [4, 17], in ma- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 terialization of the cubes , in approximation , in decomposition , and in compression . An equally important topic is the summarization of the query re- sults. Instead of returning individual tuples satisfy- ing a specified set of querying conditions, the tuples are summarized into “rollup regions” using non-leaf nodes in the hierarchies associated with the dimen- sions [14, 8]. The Minimum Description Length Prin- ciple (MDL)  is often used to do so. Figure 1 shows a 2-dimensional data cube over the dimensions location and clothes . The base table contains the volume of sales of every type of cloth- ing item in every location. The figure also shows the hierarchies. Suppose a user asks the query “which lo- cations grossed a sales volume of over 100,000 for any clothing item?” This is an aggregate selection query.clothing item?...
View Full Document
This note was uploaded on 03/01/2010 for the course ICT ... taught by Professor ... during the Three '10 term at University of Sydney.
- Three '10