124.11.lec13

124.11.lec13 - CS 124/LINGUIST 180 From Click to edit...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
Click to edit Master subtitle style 1/10/09 Dan Jurafsky Lecture 13: Intro to XML Slides from Chris Manning, thanks to Dan Suciu, Daniela Florescu, Donald Kossmann, and Robin Burke for borrowed slides
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Slide from Chris Manning's 276 class Document structure For IR: We defined a notion of a “document” for retrieval But in many cases there are nested document sizes E.g., a Slate or New Yorker article is often split over 4-6 web pages Books have chapters which have sections. Which is the correct unit for retrieval? o It probably varies by task, but clearly just returning a whole book and saying that some part of it is useful isn’t very friendly to the user
Background image of page 2
Slide from Chris Manning's 276 class Document structure For IR: We represented documents as just a list of word tokens But in almost all cases documents have some structure Books have section titles, block quotations etc. Email messages have author, subject, and date information This week: How can we represent and exploit hierarchical units and structure in documents?
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Slide from Chris Manning's 276 class The Answer This is the study of “semi-structured data” There’s a general concept of semi- structured data, but in practice these days, everyone uses XML So we’ll study XML Today we’ll do the basics Next lecture, we’ll look at tools and methods that exist for exploiting and extracting information from semi-structured data: I.e., XPath and XSLT
Background image of page 4
Slide from Chris Manning's 276 class XML <?xml version="1.0"?> <catalog> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</ title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Slide from Chris Manning's 276 class Two flavors of XML There are two flavors of XML Well, really there’s a continuum…. Document-centric XML We have text documents and mark some structure in them. Web pages with a bit more semantic structure. Our focus here. Data-centric XML We have a database record and wrap it in XML as a self-describing text format Fodder for a database course. CS145.
Background image of page 6
Slide from Chris Manning's 276 class XML BASICS
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Slide from Chris Manning's 276 class XML A W3C standard to complement HTML eXtensible Markup Language Origins: structured text SGML From work at IBM in the 1960s Motivation: HTML describes presentation XML describes content http://www.w3.org/TR/REC-xml/ The spec. Don’t try to read it.
Background image of page 8
Slide from Chris Manning's 276 class From HTML to XML HTML describes the presentation
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 06/01/2011.

Page1 / 70

124.11.lec13 - CS 124/LINGUIST 180 From Click to edit...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online