CS411 - InformationExtraction 2 -scribe

Unformatted text preview: These are often displayed in an particular order(Format) Many Web Data Are Generated from the Hidden Databases • Web pages from same data source S – Powered by a database – E.g., S = Amazon book database • An HTML page rendered by a schema and a format – Schema = <#title, #author, #price, ..> – Format = #title followed by #author Why Do We Learn This? (1 of 4) 6 What’s an ASIN?? - http://en.wikipedia.org/wiki/Amazon_Standard_Identification_Number Wrapper: A target schema and an extraction program. If we need data from such a source … • A wrapper to extract attributes from pages of S Target Schema: This is not necessarily the same as the source schema but it contains the names of attributes that we care about – Formally, a wrapper is a tuple of <target schema, extraction program> • A target schema – This needs not be the same as the source schema, because we may only want some of the attributes • Plus an extraction program that uses the format Extraction Program: The program that parses a web page to retrieve information for the attributes for the target schema. – It parses a page of S and extracts the target schema attributes – Typically written as some script Why Do We Learn This? (2 of 4) 7 Example of a manual solution wrapper we want to make Example Wrapper Given target Schema: (#country,#capital, #population,#continent) • Consider a wrapper that extracts all the attributes from pages of countries.com Given extraction Program: A perl script that will parse the web page. – Target schema equals to the source schema as (#country, #capital, #population, #continent) – Extraction program may be a Perl script specifying that given a page from source S • Return the first fully capitalized string as #country • Return the string immediately following “Capital:” as #capital • Etc. Why Do We Learn This? (3 of 4) 8 Manual Solution – manual target schema, manual extraction program(Regex) Learning- based- manual target schema, automatic extraction program (HRLT) Automatic(Road Runner)automatic target schema, automatic extraction program Different Settings • Manual wrapper construction – Given a target schema, manually construct extraction program • Learning- based wrapper construction – Given a target schema, automatically learn the extraction program from examples • Automatic wrapper construction – Automatical...
