This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Structure and Content Scoring for XML Sihem Amer-Yahia AT&T Labs–Research [email protected] Nick Koudas University of Toronto [email protected] Am´elie Marian Columbia University [email protected] Divesh Srivastava AT&T Labs–Research [email protected] David Toman University of Waterloo [email protected] Abstract XML repositories are usually queried both on structure and content. Due to structural hetero- geneity of XML, queries are often interpreted ap- proximately and their answers are returned ranked by scores. Computing answer scores in XML is an active area of research that oscillates between pure content scoring such as the well-known tf*idf and taking structure into account. However, none of the existing proposals fully accounts for struc- ture and combines it with content to score query answers. We propose novel XML scoring meth- ods that are inspired by tf*idf and that account for both structure and content while considering query relaxations . Twig scoring , accounts for the most structure and content and is thus used as our reference method. Path scoring is an approxima- tion that loosens correlations between query nodes hence reducing the amount of time required to ma- nipulate scores during top- query processing. We propose efficient data structures in order to speed up ranked query processing. We run extensive ex- periments that validate our scoring methods and that show that path scoring provides very high pre- cision while improving score computation time. 1 Introduction XML data is now available in different forms ranging from persistent repositories such as the INEX and the US Li- brary of Congress collections to streaming data such as stock quotes and news . Such data is often queried on both structure and content [3, 6, 14, 18, 19]. Due to Permission to copy without fee all or part of this material is granted pro- vided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 the structural heterogeneity of XML data, queries are usu- ally interpreted approximately [1, 4, 5, 11, 15] and top- answers are returned ranked by their relevance to the query. The term frequency ( tf ) and inverse document fre- quency ( idf ) measures, proposed in Information Retrieval (IR) , are widely used to score keyword queries, i.e., queries on content. However, although some recent propos- als [3, 6, 11, 15] attempted to propose scoring methods that account for structure for ranking answers to XML queries, none of them fully captures both structure and content and uses query relaxation in computing answer scores....
View Full Document
- Three '10