{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

RelationExtraction-2

RelationExtraction-2 - CS345 Data Mining Mining the Web for...

Info icon This preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
CS345 Data Mining Mining the Web for Structured Data
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Our view of the web so far… Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model
Image of page 2
Going beyond web pages Question answering What is the height of Mt Everest? Who killed Abraham Lincoln? Relation Extraction Find all <company,CEO> pairs Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500 companies
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Question Answering E.g., Who killed Abraham Lincoln? Naïve algorithm Find all web pages containing the terms “killed” and “Abraham Lincoln” in close proximity Extract k-grams from a small window around the terms Find the most commonly occuring k- grams
Image of page 4
Question Answering Naïve algorithm works fairly well! Some improvements Use sentence structure e.g., restrict to noun phrases only Rewrite questions before matching “What is the height of Mt Everest” becomes “The height of Mt Everest is <blank>” The number of pages analyzed is more important than the sophistication of the NLP For simple questions Reference: Dumais et al
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Relation Extraction Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov) Find pairs (company, hq) E.g., (Microsoft, Redmond) Find pairs (abbreviation, expansion) (ADA, American Dental Association) Can also have tuples with >2 components
Image of page 6
Relation Extraction Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear “close” together Foundation, by Isaac Asimov Isaac Asimov’s masterpiece, the <em>Foundation</em> trilogy There are repeated patterns in the way tuples are represented on web pages
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Naïve approach Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} <b>(title)</b> by (author)
Image of page 8
Problems with naïve approach A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least site-specific Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage
Image of page 9

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Better approach (Brin) Exploit duality between patterns and tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation Extraction) Patterns Tuples Match Generate
Image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern