RelationExtraction-1 - 1 CS345 Data Mining Mining the Web...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 CS345 Data Mining Mining the Web for Structured Data Our view of the web so far… ¡ Web pages as atomic units ¡ Great for some applications ¢ e.g., Conventional web search ¡ But not always the right model Going beyond web pages ¡ Question answering ¢ What is the height of Mt Everest? ¢ Who killed Abraham Lincoln? ¡ Relation Extraction ¢ Find all <company,CEO> pairs ¡ Virtual Databases ¢ Answer database-like queries over web data ¢ E.g., Find all software engineering jobs in Fortune 500 companies Question Answering ¡ E.g., Who killed Abraham Lincoln? ¡ Naïve algorithm ¢ Find all web pages containing the terms “killed” and “Abraham Lincoln” in close proximity ¢ Extract k-grams from a small window around the terms ¢ Find the most commonly occuring k- grams Question Answering ¡ Naïve algorithm works fairly well! ¡ Some improvements ¢ Use sentence structure e.g., restrict to noun phrases only ¢ Rewrite questions before matching ¡ “What is the height of Mt Everest” becomes “The height of Mt Everest is <blank>” ¡ The number of pages analyzed is more important than the sophistication of the NLP ¢ For simple questions Reference: Dumais et al Relation Extraction ¡ Find pairs (title, author) ¢ Where title is the name of a book ¢ E.g., (Foundation, Isaac Asimov) ¡ Find pairs (company, hq) ¢ E.g., (Microsoft, Redmond) ¡ Find pairs (abbreviation, expansion) ¢ (ADA, American Dental Association) ¡ Can also have tuples with >2 components 2 Relation Extraction ¡ Assumptions: ¢ No single source contains all the tuples ¢ Each tuple appears on many web pages ¢ Components of tuple appear “close” together ¡ Foundation, by Isaac Asimov ¡ Isaac Asimov’s masterpiece, the <em>Foundation</em> trilogy ¢ There are repeated patterns in the way tuples are represented on web pages Naïve approach ¡ Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} <b>(title)</b> by (author) Problems with naïve approach ¡ A pattern that works on one web page might produce nonsense when applied to another ¢ So patterns need to be page-specific, or at least site-specific ¡ Impossible for a human to exhaustively enumerate patterns for every relevant website ¢ Will result in low coverage Better approach (Brin) ¡ Exploit duality between patterns and tuples ¢ Find tuples that match a set of patterns...
View Full Document

This document was uploaded on 03/04/2012.

Page1 / 6

RelationExtraction-1 - 1 CS345 Data Mining Mining the Web...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online