Unformatted text preview: ly induce both target schema and extraction program Why Do We Learn This? (4 of 4) 9 Manual wrapper construction is easy when we know exactly how the web page is formatted. We just need to know what information we want(target schema) and write a program to go line by line and extract the information(Extraction program). MANUAL WRAPPER CONSTRUCTION Manual Wrapper Construction (0 of 4) 10 Manual Wrapper Construction • Developer examines a set of Web pages – Manually creates the target schema and the extraction program – Often writes extraction program using a script (e.g., perl) Manual Wrapper Construction (1 of 4) Example 11 <HTML> <TITLE>Countries in Australia (Continent)</TITLE> <BODY> <B>Countries in Australia (Continent)</B><P> <B>Australia</B> <I>61</I><BR> <B>East Timor</B> <I>670</I><BR> <B>Papua New Guinea</B> <I>675</I><BR> <HR> <B>Copyright easycalls.com</B> </BODY> </HTML> #!/usr/bin/perl open(INFILE, $ARGV[0]) or die "cannot open file\n"; while ($line = <INFILE>) { if ($line =~ m/<B>(.+?)<\/B>\s+?<I>(\d+?)<\/I><BR>/) { print "\#country = $1, \#code = $2\n"; } } close(INFILE); Manual Wrapper Construction (2 of 4) 12 On the top right is the html of the web page that we are parsing, on the bottom right is the Perl script that parses the html page and prints out the country and the country code. The Perl script goes through line by line and prints out the country name (between <B> and </B>) and the code (between <I> and </I>) Want to learn more about Perl’s regular expressions? http://perldoc.perl.org/perlre.html#Regular- Expressions Alternatively, we can use Xpath to go through the DOM tree and get the information that we want. (Instead of going line by line with a Perl script) Different Ways to View A Page • For example: – As a string à༎ can write wrapper as Perl program – As a DOM tree à༎ can write wrapper using Xpath – As a visual page, consisting of blocks Manual Wrapper Construction (3 of 4) 13 This slide is for transitioning to the next topic, learning based wrapper construction. Summary for Manual Wrapper Construction • Regardless of page model (string, DOM tree, visual, etc.), manually writing up a wrapper extraction program can be very laborious • Suppose we are given a...
