CS411 - InformationExtraction 2 -scribe

G introducing a regex of img src generalize to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: t;I>Publisher:</I>Longman<BR> </HTML> <HTML><B>#PCDATA</B><P>(<U>#PCDATA</U><BR>)+ <I>Price:</I>#PCDATA<BR><I>Publisher:</I>#PCDATA<BR></HTML> Automatic Wrapper Construction (2 of 12) Example 22 Now we compare the second page with the regular expression that we got from the first page. Now we generate a new regular expression from both pages. <HTML> <B>The Elements of Style</B><P> <U>William Strunk Jr.</U><BR> <U>E. B. White</U><BR> <I>Price:</I>$9.95<BR> <I>Publisher:</I>Longman<BR> </HTML> <HTML> <B>The Snow of Kilimanjaro</B><P> <U>Emest Hemingway</U><BR> <I>Publisher:</I>Scribner<BR> </HTML> Note:The “?” means that the block of text may be optional. Each web page may not contain the same attributes and fields as the other web pages. <HTML><B>#PCDATA</B><P>(<U>#PCDATA</U><BR>)+ (<I>Price:</I>#PCDATA<BR>)?<I>Publisher:</I>#PCDATA<BR></HTML> Automatic Wrapper Construction (3 of 12) 23 RoadRunner: Inferring Schema and Program We repeat the previous steps from the example to all web pages so that in the end we have a REGEX that can be matched on all web pages. • Given a set of Web pages P = {p1,…,pk} – Examine P to infer extraction program (as a REGEX) – Then infer the schema from the extraction results • To infer extraction program, iterate – Initialize a REGEX to page p1 – Generalize it to match p2, and so on – Return a REGEX that has been generalized (minimally) to match all the pages in P Automatic Wrapper Construction (4 of 12) 24 When comparing web pages and generalizing the regular expression, we change it based on the differences: - String mismatch(easy) “Vincent” vs “Kevin” - Tag mismatch(hard) 2 tags vs 1 tag then a string The Gene...
View Full Document

This note was uploaded on 01/28/2014 for the course CS 411 taught by Professor Staff during the Fall '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online