p301-augsten - Approximate Matching of Hierarchical Data...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Approximate Matching of Hierarchical Data Using pq-Grams Nikolaus Augsten Michael B ohlen Johann Gamper Free University of Bozen-Bolzano Dominikanerplatz 3, Bozen Italy { augsten,boehlen,gamper } @inf.unibz.it Abstract When integrating data from autonomous sources, exact matches of data items that rep- resent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running ex- ample we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between au- tonomous data sources. Typically the match- ing has to be approximate since the represen- tations in the sources differ. We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance be- tween ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Ex- periments with synthetic and real world data confirm the analytic results and the scalability of our approach. 1 Introduction When integrating data from autonomous sources, ex- act matches of data items representing the same real world object often fail due to missing global keys and different data representations. Approximate matching Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 techniques must be applied instead. We focus on hi- erarchical data, where, in addition to data values, the data structure must also be considered. As a running example we use an application from our local municipality. The GIS Office wants to relate data about apartments stored in different databases and display this information on a map. This requires a join on the address attributes. An equality join gives extremely poor results, mainly due to the differ- ent street names in various databases. Street names vary because different conventions are used to rep- resent them. They may even be stored in different languages, which prevents the use of standard string comparison techniques. To overcome this problem we exploit the hierarchical organization of addresses. In- stead of comparing street names we look for similar- ities in the hierarchical structure imposed by the ad- dresses of a street....
View Full Document

This note was uploaded on 03/01/2010 for the course ICT ... taught by Professor ... during the Three '10 term at University of Sydney.

Page1 / 12

p301-augsten - Approximate Matching of Hierarchical Data...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online