RelationExtraction-1

RelationExtraction-1 - 1 CS345 Data Mining Mining the Web...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 CS345 Data Mining Mining the Web for Structured Data Our view of the web so far Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model Going beyond web pages Question answering What is the height of Mt Everest? Who killed Abraham Lincoln? Relation Extraction Find all <company,CEO> pairs Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500 companies Question Answering E.g., Who killed Abraham Lincoln? Nave algorithm Find all web pages containing the terms killed and Abraham Lincoln in close proximity Extract k-grams from a small window around the terms Find the most commonly occuring k- grams Question Answering Nave algorithm works fairly well! Some improvements Use sentence structure e.g., restrict to noun phrases only Rewrite questions before matching What is the height of Mt Everest becomes The height of Mt Everest is <blank> The number of pages analyzed is more important than the sophistication of the NLP For simple questions Reference: Dumais et al Relation Extraction Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov) Find pairs (company, hq) E.g., (Microsoft, Redmond) Find pairs (abbreviation, expansion) (ADA, American Dental Association) Can also have tuples with >2 components 2 Relation Extraction Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear close together Foundation, by Isaac Asimov Isaac Asimovs masterpiece, the <em>Foundation</em> trilogy There are repeated patterns in the way tuples are represented on web pages Nave approach Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} <b>(title)</b> by (author) Problems with nave approach A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least site-specific Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage Better approach (Brin) Exploit duality between patterns and tuples Find tuples that match a set of patterns...
View Full Document

Page1 / 6

RelationExtraction-1 - 1 CS345 Data Mining Mining the Web...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online