CS411 - InformationExtraction 2 -scribe

1 of 4 6 whats an asin

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: These are often displayed in an particular order(Format) Many Web Data Are Generated from the Hidden Databases • Web pages from same data source S – Powered by a database – E.g., S = Amazon book database • An HTML page rendered by a schema and a format – Schema = <#title, #author, #price, ..> – Format = #title followed by #author Why Do We Learn This? (1 of 4) 6 What’s an ASIN?? - http://en.wikipedia.org/wiki/Amazon_Standard_Identification_Number Wrapper: A target schema and an extraction program. If we need data from such a source … • A wrapper to extract attributes from pages of S Target Schema: This is not necessarily the same as the source schema but it contains the names of attributes that we care about – Formally, a wrapper is a tuple of <target schema, extraction program> • A target schema – This needs not be the same as the source schema, because we may only want some of the attributes • Plus an extraction program that uses the format Extraction Program: The program that parses a web page to retrieve information for the attributes for the target schema. – It parses a page of S and extracts the target schema attributes – Typically written as some script Why Do We Learn This? (2 of 4) 7 Example of a manual solution wrapper we want to make Example Wrapper Given target Schema: (#country,#capital, #population,#continent) • Consider a wrapper that extracts all the attributes from pages of countries.com Given extraction Program: A perl script that will parse the web page. – Target schema equals to the source schema as (#country, #capital, #population, #continent) – Extraction program may be a Perl script specifying that given a page from source S • Return the first fully capitalized string as #country • Return the string immediately following “Capital:” as #capital • Etc. Why Do We Learn This? (3 of 4) 8 Manual Solution – manual target schema, manual extraction program(Regex) Learning- based- manual target schema, automatic extraction program (HRLT) Automatic(Road Runner)automatic target schema, automatic extraction program Different Settings • Manual wrapper construction – Given a target schema, manually construct extraction program • Learning- based wrapper construction – Given a target schema, automatically learn the extraction program from examples • Automatic wrapper construction – Automatical...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online