lec6 - CS 6093 Lecture 6 Survey of Information Extraction...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 6093 Lecture 6 Survey of Information Extraction Systems Cong Yu Reminders Next week is spring break, no class Office hour will be by appointment only Midterm report due on 5p ET March 21 st An hour before the class See last weeks lecture note for late submission policy Any other logistic question? Outline Information extraction systems (Snowball) Cimple KnowItAll/TextRunner Quiz + Break (30 min) Recent research Prioritization of information extraction Enhancing Wikipedia Conclusion Recap of Last Lecture Main tasks of information extraction Basic techniques Named entity recognition Wrapper induction System: Snowball Segmentation Classification Association Integration Review of Snowball Snowball Landscape of Extraction Systems Open Web Domain-centric Modest Manual Effort Low Manual Effort Cimple TextRunner Snowball IBM UIMA Open Calais Tools and Platforms: Cimple Building structured web community portals: A top- down, compositional, and increment approach . Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, Raghu Ramakrishnan. VLDB 2007. Landscape of Extraction Systems Open Web Domain-centric Modest Manual Effort Low Manual Effort Cimple TextRunner Snowball Goal of Cimple Community Information Management : building comprehensive structured community portals with modest amount of human efforts Community: a group of users who share similar interests and a common ontology I.e., domain centric Characteristics Top-down Compositional Incremental Top-down Start by focusing on a few well-known Web sources within the community Database research community: DBWorld, DBLP, etc. Entertainment community: IMDb, Rotten Tomatoes, etc. This initial set is small enough such that building an extraction plan manually is relatively easy Compositional Dividing the complex extraction task into individual small tasks Each small task is simple enough to be carried out by easy-to-implement operators The results are consolidated in the end Enables declarative information extraction Incremental Simple assumption: important data sources will sooner or later show up in the set of sources currently being monitored Therefore, there is no need to actively go out and crawl the Web for additional sources Overview Researcher Homepages Conference Pages Group Pages DBworld mailing list DBLP Web pages Documents * * * * * * * * * SIGMOD-04 * * * * give-talk Jim Gray Keyword search SQL querying Question answering Browse Mining Alert/Monitor News summary Jim Gray SIGMOD-04 * * user feedbacks ER Graph NER Top-down: Initial Source Ranking 80/20 Rules 20% of the sources covers 80% of the community knowledge How to identify the 20%?...
View Full Document

Page1 / 72

lec6 - CS 6093 Lecture 6 Survey of Information Extraction...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online