CS411 - InformationExtraction 2 -scribe

Hlrt wrapper construction goal for n fields want to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: set of labeled pages (i.e., we know which part is #country, which part is #code, etc.), can we automate the process of getting the extraction program? Manual Wrapper Construction (4 of 4) 14 A disadvantage of using manual wrapper construction is that it can be laborious. You would have to make an extraction program for each type of web page. But what if we are given several web pages that follow a similar template and we are too lazy to write an extraction program? Then we can automate the extraction program by using a learning- based wrapper construction. Learning- Based wrapper construction: Given target schema, auto generate an extraction program. Uses HLRT(Head Left Right Tail) LEARNING- BASED WRAPPER CONSTRUCTION Learning- based Wrapper Construction (0 of 4) 15 Here is the page that we want to extract information from. Observe that the information that we want is in the middle of the page. Looking for Patterns • Use delimiters to extract tuples from one page – E.g., extract (#country, #code) <HTML> <TITLE>Countries in Australia (Continent)</TITLE> <BODY> <B>Countries in Australia (Continent)</B><P> <B>Australia</B> <I>61</I><BR> <B>East Timor</B> <I>670</I><BR> <B>Papua New Guinea</B> <I>675</I><BR> <HR> <B>Copyright easycalls.com</B> </BODY> </HTML> Learning- based Wrapper Construction (1 of 4) 16 HLRT Wrapper There is the head, data region, and tail. We only care about the data region. But how do we find the data region? Find the head and tail and between them is the data region. • Head- Left- Right- Tail wrapper Learning- based Wrapper Construction (2 of 4) 17 HLRT wrapper is a tuple of (2n+2) strings - 2n (l’s and r’s) and a head and a tail. HLRT Wrapper Construction • Goal: for n fields, want to construct an HLRT wrapper as a tuple of (2n+2) strings (h, t, l1, r1, …, ln, rn) • Approach: a learning module to construct a wrapper for multiple pages To find the head: First compare all the web pages that we will be parsing. For each web page i, there will be a string x_i that is a prefix to the first attribute of the page. – E.g., consider finding all possible values for h from k pages ExampleLearningModule(p1, …, pk, a1, …, an) • Let xi be the string from the beginning of page pi to the first occurrence of the very first attribute a’1 • Return the common substring of {x1, x2, …, xk}...
View Full Document

This note was uploaded on 01/28/2014 for the course CS 411 taught by Professor Staff during the Fall '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online