124.11.lec13 - CS 124/LINGUIST 180: From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 13: Intro to XML Slides from Chris Manning, thanks to Dan Suciu, Daniela Florescu, Donald Kossmann, and Robin Burke for borrowed slides Document structure   For IR:   We defined a no<on of a “document” for retrieval   But in many cases there are nested document sizes   E.g., a Slate or New Yorker ar<cle is oNen split over 4 ­6 web pages   Books have chapters which have sec<ons. Which is the correct unit for retrieval? o  It probably varies by task, but clearly just returning a whole book and saying that some part of it is useful isn’t very friendly to the user Document structure   For IR:   We represented documents as just a list of word tokens   But in almost all cases documents have some structure   Books have sec<on <tles, block quota<ons etc.   Email messages have author, subject, and date informa<on   This week: How can we represent and exploit hierarchical units and structure in documents? The Answer   This is the study of “semi ­structured data”   There’s a general concept of semi ­structured data, but in prac<ce these days, everyone uses XML   So we’ll study XML   Today we’ll do the basics   Next lecture, we’ll look at tools and methods that exist for exploi<ng and extrac<ng informa<on from semi ­ structured data:   I.e., XPath and XSLT XML <?xml version="1.0"?> <catalog> <book id="bk103"> <author>Corets, Eva</author> <<tle>Maeve Ascendant</<tle> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000 ­11 ­17</publish_date> <descrip<on>ANer the collapse of a nanotechnology society in England, the young survivors lay the founda<on for a new society.</descrip<on> </book> <book id="bk104"> <author>Corets, Eva</author> <<tle>Oberon's Legacy</<tle> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001 ­03 ­10</publish_date> <descrip<on>In post ­apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</descrip<on> </book> </catalog> Two flavors of XML   There are two flavors of XML   Well, really there’s a con<nuum….   Document ­centric XML   We have text documents and mark some structure in them.   Web pages with a bit more seman<c structure.   Our focus here.   Data ­centric XML   We have a database record and wrap it in XML as a self ­ describing text format   Fodder for a database course. CS145. XML BASICS XML   A W3C standard to complement HTML   eXtensible Markup Language   Origins: structured text SGML   From work at IBM in the 1960s   Mo<va<on:   HTML describes presenta<on   XML describes content   hnp://www.w3.org/TR/REC ­xml/   The spec. Don’t try to read it. From HTML to XML HTML describes the presentation HTML <h1> Bibliography </h1> <p> <i> Founda<ons of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML in 2010   Okay, that example was from 1995. Today it’s more likely to look like this. HTML has become a low ­level presenta<on language for the output of design soNware … the Postscript of the web. <td><h1 class=”Books">Poli<cs of experience by Ronald Laing, published in 1967</h1></td><td align="right" nowrap> Item number: 320070381076</td><td align="right" valign="top"><img src="hnp:// pics.bookssta<c.com/aw/pics/globalAssets/rtCurve.gif" width="8" height="8"></td></tr><tr><td colspan="6" valign="middle" bgcolor="#5F66EE"><img src="hnp://pics.bookssta<c.com/aw/pics/ s.gif" width="1" height="4"></td></tr></table><table width="100%" border="0" cellpadding="0" cellspacing="0"><tr><td bgcolor="#CCCCFF"><img src="hnp://pics.bookssta<c.com/aw/pics/ s.gif" width="1" height="1"></td><td bgcolor="#EEEEFF"><div id="FastVIPBIBO"><table border="0" cellpadding="0" cellspacing="0" width="100%"> 11 XML XML describes the content <bibliography> <book> <<tle> Founda<ons… </<tle> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> Another layer (e.g., XSLT) defines presentation XML   HTML defines a specific set of markup elements with specific meanings:   h1, blockquote, b, …   XML provides a syntax for defining sets of seman<c markup elements, where a par<cular set is used for different applica<ons (that’s the eXtensible part) XML Terminology   tags: <book>, </<tle>, <author/>, …   start tag: <book>, end tag: </book>   elements: <book>…<book>,<author>…</author>   A book element and an author element   Elements can be sequenced and nested   Empty element: <red></red> abbrv. <red/>   These two forms are to be treated the same by an XML processor   an XML document: single root element Element - Names in XML   Rules for naming   Start with either a lener or underscore (_)   Rest: leners, digits, underscore (_), dot (.), hyphen ( ­)   Spaces are not allowed   Names cannot start with the string xml   Names are case sensi<ve in XML   Conven<on   HTML elements in XML are in uppercase   XML elements are in lowercase Empty Element   Empty element: elements that have no content   For the value of their anributes   Example   <email href=“mailto…”></email>   Shorthand nota<on   <email href=“mailto…” /> XML Terminology   Well-formed XML document: if it is syntactically correct   Basically, if it has matching tags   Ill-formed:     <a> foo <b> bar </a> </b> o  Elements aren’t nested <body> Hello! <p> How are you? <p> Bye! </body> o  Missing end element tags   Valid XML document: its structure is in accordance with a schema/DTD   We discuss this later today   This is a more structured, semantic notion; mostly XML is just well-formed. More XML: Attributes <book price = “55” currency = “USD”> <<tle> Founda<ons of Databases </<tle> <author> Abiteboul </author> … <year> 1995 </year> </book> • Attributes are an alternative way to represent data • attribute values must be enclosed in double or single quotation marks Elements vs. Attributes   Elements can be nested <bib> <book> Wilde Wutz </book> </bib>   Subelements can implement mul<sets <bib> <book> ... </book> <book> ... </book> </bib>   Order is important and oNen seman<c!   Elements can have only anributes           <person name = “Wutz” age = “33”/> Anribute names must be unique! (No Mul<sets) <person name = „Wilde“ name = „Wutz“/> is illegal! Anributes can’t have nested structure. Order is usually not seman<c What is the difference between a nested element and an anribute? Are anributes useful? In many cases you can use either, but having anributes fits with the text ­ centric ID of some<mes having extra document data that isn’t part of the text of the document 19 Basic Structure   An XML document is an ordered, labeled tree   character data leaf nodes contain the actual data (text strings)   element nodes, are each labeled with   a name (oNen called the element type), and   a set of a0ributes, each consis<ng of a name and a value,   can have child nodes XML Example XML: Design Goals   Separate syntax from seman8cs to provide a common framework for structuring informa<on   Allow tailor ­made markup for any imaginable applica<on domain   Support interna8onaliza8on (Unicode) and pla<orm independence   Be the future of (semi)structured informa8on (do some of the work now done by databases) More XML: CDATA sections   Some<mes we would like to preserve the original characters, and not interpret them as markup   CDATA sec<ons   Not parsed as XML   <message>! <greeting>Hello,world!</greeting>! </message>!   <message> <![CDATA [<greeting>Hello, world!</ greeting>]]> </message> ! 23 More XML: id and idref(s) <person id=“o555”> <name> Jane </name> </person> <person id=“o456”> <name> Mary </name> <children idrefs=“o123 o555”/></person> <person id=“o123”><name>John</name> <mother idref=“o456”/></person> id and idref look just like regular XML syntax, but they are distinguished names for IDs/pointers, and allow the representation of arbitrary graphs in XML. An id value must be unique. Not much used in text applications More XML: Entity References   Syntax: &en<tyname;   Example: <element> this is less than &lt; </element>   Some en<<es: &lt; < &gt; > &amp; & &apos; ‘ &quot; “ &#38; Unicode char More XML: Processing Instructions   Syntax: <?target argument?>   Stuff for processor that is not data   Example: <product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>   XML documents must begin with an xml processing instruc<on, e.g., Unicode/UTF-8 is default   <?xml version="1.0"?> OR   <?xml version=“1.0” encoding=“UTF ­8” standalone=“yes” ?>   Usually the only one you ever see…. More XML: Comments   Syntax <! ­ ­ .... Comment text...  ­ ­>   Yes, they are part of the data model !!!   A comment cannot contain two consecu<ve dashes   Really! XML Namespaces   hnp://www.w3.org/TR/REC ­xml ­names (1/99)   name ::= [prefix:]localpart   Provide modularity … just like in programming languages <book xmlns:isbn=“www.isbn-org.org/def”> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book> XLink   Generalizes HTML’s href   Many types: simple, extended, locator, ...   Discuss only simple links <person xmlns:xlink=“http:///.w3.org/1999/xlink” xlink:type=“simple” xlink:href=“http://a.b.c/myhomepage.html” xlink:title=“The Homepage” xlink:show=“replace” xlink:actuate=“onRequest”> ..... </person> required attributes optional attributes XPointer   An extension of XPath (coming up…)   Usage:   href=“www.a.b.c/document.xml#xpointerExpr”   An xpointer expression points to:   A point   A range Whitespace declaration   Whitespace = Con<nuous sequence of Space, Tab and Return character   Special Anribute xml:space to control use   Human ­readable XML (with Whitespace) <book xml:space=“preserve” > <<tle>The poli<cs of experience</<tle> <author>Ronald Laing</author> </book>   (Efficient) machine ­readable XML (no WS) <book xml:space=“default” ><<tle>The poli<cs of experience</ <tle><author>Ronald Laing</author></book> 31 Language declaration   <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>!   <p xml:lang="en-GB">What colour is it?</p>!   <p xml:lang="en-US">What color is it?</p>!   Note: uses reserved “xml” namespace! 32 Examples of XML data   XHTML (browser/presenta<on)   RSS (blogs)   UBL (Universal Business Language)   HealthCare Level 7 (medical data)   XBRL (financial data)   Digital photography metadata (XMP)   XMI (metadata)   XQueryX (programs)   XForms (forms)   SOAP (message envelopes)   MicrosoN Office 2007/2008  ­ ­ Documents in XML 33 RSS, blogs   <?xml version="1.0"?><rdf:RDF xmlns:rdf="http:// www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http:// purl.org/rss/1.0/"> <channel rdf:about="http:// www.xml.com/xml/news.rss"> <title>XML.com</title> <link>http://xml.com/pub</link> <description> XML.com features a rich mix of information and services for the XML community. </description> <image rdf:resource="http://xml.com/universal/images/ xml_tiny.gif" /> <items> <rdf:Seq> <rdf:li resource="http://xml.com/pub/2000/08/09/xslt/ xslt.html" /> <rdf:li resource="http://xml.com/ pub/2000/08/09/rdfdb/index.html" /> </rdf:Seq> </ items> <textinput rdf:resource="http:// search.xml.com" /> </channel> <image rdf:about="http:// xml.com/universal/images/xml_tiny.gif"> <title>XML.com</title> <link>http://www.xml.com</ link> <url>http://xml.com/universal/images/ xml_tiny.gif</url> </image>! 34 Forms on the Web in XML   XML Forms (Xforms)   hnp://www.w3.org/TR/xforms/   <xforms:model> <xforms:instance> <ecommerce xmlns=""> <method/ > <number/> <expiry/> </ecommerce> </xforms:instance> <xforms:submission action="http:// example.com/submit" method="post" id="submit" </xforms:model>! 35 SOAP and Web Services   Web Services is the favorite way of exchanging informa<on between applica<ons   XML exchange over HTTP, with a specific protocol (SOAP)   <?xml version='1.0' ?><env:Envelope xmlns:env="http:// www.w3.org/2003/05/soap-envelope"> <env:Header> <m:reservation xmlns:m="http://travelcompany.example.org/ reservation" env:role="http://www.w3.org/2003/05/ soap-envelope/role/next" env:mustUnderstand="true"> <m:reference>uuid:093a2da1q345-739r-ba5d-pqff98fe8j7d</m:reference> <m:dateAndTime>2001-11-29T13:20:00.000-05:00</ m:dateAndTime> </m:reservation> <n:passenger xmlns:n="http://mycompany.example.com/employees" env:role="http://www.w3.org/2003/05/soap-envelope/role/ next" env:mustUnderstand="true"> <n:name>Åke Jógvan Øyvind</n:name> </n:passenger> </env:Header> <env:Body/> </env:Envelope>! 36 Texts: A dictionary <ENTRY><HW>jajirdi</HW><POS>N</POS> <DOMAIN><DMI>fauna</DMI>: <DMI>kuyu</DMI> </DOMAIN> <LAT>Dasyurus geoffroii</LAT> <GL><GLI>Western Quoll</GLI>, <GLI>Western Na<ve Cat</GLI></GL> <EXAMPLES> <EXAMPLE><WE TYPE="DEFN">Jajirdi, kuyu wita  ­ purturluju jiirlpari ­jiirlpari  ­ wirliyaju rdaka ­piya  ­ langaju jungunypa ­piya  ­ ngirn< pirlirripirlirri. </WE> <ET>The na<ve cat is a small animal with a sponed back, hand ­like paws, ears like a rat's and a flat broad tail.</ET></EXAMPLE> <EXAMPLE><WE>Kalinja kapala nyina jajirdi ­jarra  ­ jina ­mardarni kapala ­jana kurdu ­kurdu jaji ­nyanurlu, nga< ­nyanurlu. </WE> <ET>Two na<ve cats live together as a couple. Both the mother and the father look aNer their young.</ET></EXAMPLE></EXAMPLES> <SYN><SYNI>kuninyka</SYNI>, <SYNI>kurninka</SYNI>, <SYNI>ngirn< ­ wiirnpiri<DIALECTS><DLI>Wi</DLI></DIALECTS></SYNI>, <SYNI>parrjarda</ SYNI>, <SYNI>parrjardi</SYNI></SYN> <CSL>YSL#519</CSL> </ENTRY Texts: ThML (Theological Markup Language) <h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One. 2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One. 2.p0.7">Aristotle</name>, Metaphysics, i. 1. </ added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rus<c who serves God is bener than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One. 2.p0.10"> Augus<ne, Confessions V. 4. </p> </ note></added> </p> SCHEMAS FOR XML Schemas for XML   The goal is to put enforced syntac<c/seman<c structure on your documents   E.g., all dic<onary entries should have a headword and a part of speech   If there are examples, they should follow the defini<on   Links to related words should be at the end of the entry Very Simple DTD <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>> Very Simple DTD Example of valid XML document using this DTD: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ... </company> DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper> DTDs as Grammars   A DTD = a grammar   A valid XML document = The sequence of tokens has a parse tree according to that grammar   The syntax should remind you of regular expressions   But here we have a context ­free language EBNF (Extended Backus-Naur Form)   EBNF: a compact version of BNF   it uses regular expressions to simplify grammar expression   A aB   A aBA   turns into   A aB(A)?   only one produc<on per non ­terminal allowed DTDs   Use EBNF to specify structure of XML documents   Plus   anributes   en<<es   Syntax   holdover from SGML   Ugly DTD Syntax   <!ELEMENT element ­name content_model>   Content model contains the RHS of the produc<on rule   Example <!ELEMENT name (firstName, lastName)> DTD Syntax cont'd   Not XML   <! begins a declara<on   No "content"   Empty elements not indicated with /> Simple content models   Content can be any text   #PCDATA   Content can be anything at all   (useful for debugging)   ANY   Element has no content   EMPTY Declaring the structure of elements   Grammar that describes the structure of the element   Subelements, iden<fied by Name or   #PCDATA   Combinators :   “+” for at least 1   “*” for 0 or more   “?” for 0 or 1   “ , ” for concatena<on   “ | ” for alterna<on   <!ELEMENT a ( (b | c) * , d ? , e ) >   PCDATA: only textual content allowed   <!ELEMENT a #PCDATA>   EMPTY : the element must be empty   <!ELEMENT a EMPTY>   ANY: allows any content   <!ELEMENT a ANY > 50 Wellformed? Valid? <!ELEMENT person (name, profession*)> <!ELEMENT name (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ELEMENT profession (#PCDATA)> Is the following wellformed and valid? <person> <name> <first_name>Alan</first_name> <last_name>Turing</last_name> </name> <profession>computer scien<st</profession> <profession>mathema<cian</profession> <profession>cryptographer</profession> </person> Wellformed? Valid <!ELEMENT person (name, profession*)> <!ELEMENT name (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ELEMENT profession (#PCDATA)> Is the following wellformed and valid?   <person>   <name>   <first_name>Alan</first_name>   <last_name>Turing</last_name>   </name>   </person> Wellformed? Valid? <!ELEMENT person (name, profession*)> <!ELEMENT name (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ELEMENT profession (#PCDATA)> Is the following wellformed? valid? <person> <name> <first_name>Alan</first_name> <last_name>Turing</last_name> </name> Valid or No? <!ELEMENT person (name, profession*)> <!ELEMENT name (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ELEMENT profession (#PCDATA)> Is the following valid?   <person>   <profession>computer scien<st</profession>   <profession>mathema<cian</profession>   <profession>cryptographer</profession>   </person> Valid or No? <!ELEMENT person (name, profession*)> <!ELEMENT name (first_name, last_name)> <!ELEMENT first_name (#PCDATA)> <!ELEMENT last_name (#PCDATA)> <!ELEMENT profession (#PCDATA)>   Is the following valid? <person> <profession>computer scien<st</profession> <name> <first_name>Alan</first_name> <last_name>Turing</last_name> </name> <profession>mathema<cian</profession> <profession>cryptographer</profession> </person> Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS person age id CDATA #REQUIRED ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED > <person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person> Attribute lists   Declared separately from elements   can be anywhere in the DTD   Specifica<on includes   name of the element   name of the anribute   anribute type   default Defining the attribute lists  Structure: <!ATTLIST ElementName defini8on>   <!ATTLIST ingredient name CDATA #REQUIRED amount CDATA #IMPLIED unit CDATA #FIXED “cup” >   CDATA means normal content   Not the same as CDATA in XML!   #REQUIRED, or #IMPLIED refer to the fact that the anribute is required or op<onal   Default value possible or required (#FIXED) 58 Mixed content in DTDs   Mixing PCDATA declara<ons with other subelements means that the content can be “mixed” <!ELEMENT p(#PCDATA|a|ul|b|i|em)*>! <p>some text <em>some emphasized text</ em> blah <b>some bold ! text</b> </p>! 59 Mixed content   Legal to have a content model with text and element data <story category="national" byline="Karen Wheatley"> <headline>President Meets with Congress</headline> The President meet with Congressional leaders today in effort to jump-start faltering budget negotiations. Sources described the mood of the meeting as "cordial". <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /> </story> Mixed content, cont'd   <!ELEMENT story (headline, #PCDATA, full ­story, image*)>   Mixed content makes handling XML complex   necessary for many applica<ons Declarations of DTDs   No DTD (well ­formed Documents)   DTD inside the Document: <!DOCTYPE name [defini8on] >   DTD external, specified by URI: <!DOCTYPE name SYSTEM “demo.dtd”>   DTD external, Name and op<onal URI: <!DOCTYPE name PUBLIC “Demo”> <!DOCTYPE name PUBLIC “Demo” “demo.dtd”> 62 Limitations of DTDs   DTDs describe only the “grammar” of the XML file, not the detailed structure and/or types   This grammatical description has some obvious shortcomings:"   we cannot express that a “length” element must contain a non-negative number (constraints on the type of the value of an element or attribute)   The “unit” element should only be allowed when “amount” is present (co-occurrence constraints)!   etc.!   There are other schema systems, notably XML Schema, with schemas written in XML"   But we wonʼt consider them" 63 PROCESSING XML DOCUMENTS Processing XML documents   You want to use an XML parser   The complexity of the structure available makes trying to use regexps messy at best, and impossible at worst   The parser reads the syntac<c structure and lets you deal with the seman<c elements Processing XML documents   What then?   There are two tradi<onal ways to do things in code:   Processing as a stream of events: SAX   Processing an XML tree: DOM   But in the 21st Century, you’re almost always bener off using XPath, XSLT and/or XQuery … linle languages for XML   Next lecture we do XPath and XSLT   XQuery is leN for the database course A glimpse of XML IR   You want to be able to answer structure sensi<ve queries: Book Title Trump Book Title Author Gates Bill   And to return an appropriate size unit when the query men<ons no structure General positional indexes   View the XML document as a text document   Build a posi<onal index for each element   Mark the beginning and end for each element, e.g., Play Doc:1(27) /Play Doc:1(1122) Verse Doc:1(431) Doc:4(33) /Verse Doc:1(867) Doc:4(92) Term:droppeth Doc:1(2033) Doc:1(5790) Doc:1(720) Positional containment Doc:1 Play Verse Term:droppeth 27 1122 431 720 2033 867 Containment can be viewed as merging postings. droppeth under Verse under Play. 5790 A glimpse of XML IR   But what do we index and return as a document?   Any unit (paragraph, sec<on, chapter) which is a plausible unit for retrieval   Our ranking algorithm prefers:   Documents with a bener cosine similarity score   Shorter documents within a larger document that are almost as good in cosine similarity ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online