RelationExtraction-2

RelationExtraction-2 - CS345 DataMining...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
    CS345 Data Mining Mining the Web for Structured  Data
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
  Our view of the web so far… Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model
Background image of page 2
  Going beyond web pages Question answering What is the height of Mt Everest? Who killed Abraham Lincoln? Relation Extraction Find all <company,CEO> pairs Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500  companies
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
  Question Answering E.g., Who killed Abraham Lincoln? Naïve algorithm Find all web pages containing the terms “killed”  and “Abraham Lincoln” in close proximity Extract k-grams from a small window around  the terms Find the most commonly occuring k-grams
Background image of page 4
  Question Answering Naïve algorithm works fairly well! Some improvements Use sentence structure e.g., restrict to noun  phrases only Rewrite questions before matching  “What is the height of Mt Everest” becomes “The  height of Mt Everest is <blank>” The number of pages analyzed is more  important than the sophistication of the  NLP For simple questions Reference: Dumais et al
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
  Relation Extraction Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov) Find pairs (company, hq) E.g., (Microsoft, Redmond) Find pairs (abbreviation, expansion) (ADA, American Dental Association) Can also have tuples with >2 components
Background image of page 6
  Relation Extraction Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear “close” together Foundation, by Isaac Asimov Isaac Asimov’s masterpiece, the  <em>Foundation</em> trilogy There are repeated patterns in the way tuples  are represented on web pages
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
  Naïve approach Study a few websites and come up with a  set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} <b>(title)</b> by (author)
Background image of page 8
  Problems with naïve approach A pattern that works on one web page  might produce nonsense when applied to  another So patterns need to be page-specific, or at  least site-specific Impossible for a human to exhaustively  enumerate patterns for every relevant  website Will result in low coverage
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
  Better approach (Brin) Exploit duality between patterns and tuples Find tuples that match a set of patterns
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 35

RelationExtraction-2 - CS345 DataMining...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online