CS411 - InformationExtraction 2 -scribe

Now we generate a new regular expression from both

Unformatted text preview: as h Learning- based Wrapper Construction (3 of 4) 18 Now compare all x_i’s and the common substring will be the head. The tail can be found similarly. The HLRT wrapper is easier to understand and implement when compared to the Road Runner wrapper. Summary for HLRT Wrapper • Easy to understand and implement • Have limited applicability – Assume a flat tuple schema (a parallel list of <li, ri>’s) – Assume all attributes can be extracted with delimiters To use the HLRT wrapper, we must assume that the target schema is a flat tuple(a parallel list of l_i and r_i). And that the information is easily extractable with the use of delimiters. • In practice, things are more complicated – May have nested schema • #Book = (#title, #authors, #price), where #authors = a list of (#first_name, #last_name) tuples – May not be able to extract with delimiters • Extract zip code from “40 Colfax, Phoenix, AZ 85258” Learning- based Wrapper Construction (4 of 4) 19 Most webpages are not this nice and so we must try a different method, automatic wrapper construction. Automatic Wrapper Construction: Automatic target schema and extraction program. (Road Runner) AUTOMATIC WRAPPER CONSTRUCTION Automatic Wrapper Construction (0 of 12) 20 Wrapper Learning without Schema The basics: You have a set of web pages. Compare them Automatically generate the target schema by checking which fields align across all web pages. Automatically generate the extraction program by making a generic regex program that can be run across all pages. • Also called automatic approach for wrapper learning – Input a set of Web pages of source S – Examine similarities/dissimilarities across pages – Automatically infer • Schema of pages: which fields are aligned (as attributes) • Extraction program: how to extract the fields (as attributes) Automatic Wrapper Construction (1 of 12) 21 Example We start with the first page and generalize the page into a regular expression. Note: The “+” means that the block of text may be iterated. You may have several blocks of texts that follow the same pattern that follow each other. <HTML> <B>The Elements of Style</B><P> <U>William Strunk Jr.</U><BR> <U>E. B. White</U><BR> <I>Price:</I>$9.95<BR> &l...
This note was uploaded on 01/28/2014 for the course CS 411 taught by Professor Staff during the Fall '08 term at University of Illinois, Urbana Champaign.

