3 - Information Integration Adapted from slides for Liu,...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
Information Integration Adapted from slides for Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2nd ed., Springer, 2009.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Introduction ± At the end of last topic, we identified the problem of integrating extracted data: o column match and instance value match. ± Unfortunately, limited research has been done in this specific context. Much of the Web information integration research has been focused on the integration of Web query interfaces. ± In this part, we introduce o some basic integration techniques, and o Web query interface integration
Background image of page 2
Database integration (Rahm and Berstein 2001) ± Information integration started with database integration, which has been studied in the database community since the early 1980s. ± Fundamental problem : schema matching , which takes two (or more) database schemas to produce a mapping between elements (or attributes ) of the two (or more) schemas that correspond semantically to each other. ± Objective : merge the schemas into a single global schema.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Integrating two schemas ± Consider two schemas, S 1 and S 2, representing two customer relations, Cust and Customer. S1 S2 Cust Customer CNo CustID CompName Company FirstName Contact LastName Phone
Background image of page 4
Integrating two schemas (contd) ± Represent the mapping with a similarity relation, , over the power sets of S 1 and S 2, where each pair in represents one element of the mapping. E.g., Cust.CNo Customer.CustID Cust.CompName Customer.Company {Cust.FirstName, Cust.LastName} Customer.Contact
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Different types of matching ± Schema-level only matching : only schema information is considered. ± Domain and instance-level only matching : some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web. ± Integrated matching of schema, domain and instance data : Both schema and instance data (possibly domain information) are available.
Background image of page 6
Pre-processing for integration (He and Chang SIGMOD-03, Madhavan et al. VLDB-01, Wu et al. SIGMOD-04 ± Tokenization : break an item into atomic words using a dictionary, e.g., o Break fromCity into from and city o Break first-name into first and name ± Expansion : expand abbreviations and acronyms to their full words, e.g., o From dept to departure ± Stopword removal and stemming ± Standardization of words : Irregular words are standardized to a single form, e.g., o From colour to color
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Schema-level matching (Rahm and Berstein 2001) ± Schema level matching relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc.
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 38

3 - Information Integration Adapted from slides for Liu,...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online