cis6930fa11_OpenIE

cis6930fa11_OpenIE - OPEN INFORMATION EXTRACTION FROM THE...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: OPEN INFORMATION EXTRACTION FROM THE WEB Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni Call for a Shake Up in Search! Question Answering rather than indexed key word search Gravity of keyword search Massive, heterogeneous data Knowledge assertion Call for a general-purpose question-answering systems Watson, Siri Motivation Traditional Information Extraction (IE) Require hand-crafted extraction rule, training example Re-specify relation of interest Usually domain specific Dose not scale well with large and heterogeneous corpora Overview Preliminary Key Components design of Open IE system Evaluation Related work Demo About this paper High level description on system components Framework design Technical details largely based on description rather than rigorous details Work on Maximum Entropy Methods (part-of-speech labeling, identifying noun phrases…) Work on KnowItAll paper Several terminologies Tuple: (ei,rij,ej), rij is relation Relation: general rules for connecting entities, e.g. City such as New York, Tokyo, London, Beijing… Relation arguments: for tuple (ei,rij,ej), ei and ej are arguments for relation rij Design Goals Automation Corpus heterogeneity Efficiency TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier Self-supervised Learner 1.1 Trainer parses through text. For each sentence, find all base noun phrase ei, for each pair (ei,ej), identify potential relation rij (sequence of words) in tuple t=(ei,rij,ej) 1.2 Using constrains to label t as positive or negtive Length of dependency chain connecting (ei,ej) Path from (ei,ej) does not cross sentence boundary Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier Self-supervised Learner 2.1 Map each tuple to a feature vector E.g. number of tokens in rij , presence of POS tag sequence in rij , POS tag to the left of ei 2.2 Labeled feature vectors are as input to a Naïve Bayes classifier Classifier is language specific TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing Single-pass Extractor Make a single pass over corpus Tag POS label for each word in sentence Using tags and nous phrase c hunker to identify entities Relations are extracted by analyzing text between noun phrases Classifier classifies Candidate tuples. TextRunner Stores the trustworthy tuples Single-pass Extractor Relation Normalization: Non essential phrases are eliminated to have succinct relation text (e.g.“definitely developed” is reduced to “developed” Entity Normalization: Chunker assigns probability to entities. Tuples containing entities with low confidence are dropped. TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing Redundancy-based Assessor Merge identical tuples Count distinct sentences The count is used to assign probability to each tuple (KnowItAll) Intuition: tuple t=(ei,rij,ej) is a correct instance of relation rij if it is extracted from many different sentences TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query processing Query Processing Using Inverted Index distributed over a pool of machines Each relation is assigned to one machine Each machine then store a reference to all tuples that are instances of any relation assigned to it Like a Distributed Hash Table Query Processing Relation centric index Can be used for advance natural language like searching and answering Distributed pool of machines support interactive search speed Experimental Results Comparison with Traditional IE Global Statistics on Facts Learned Comparison with Traditional IE TextRunner VS KnowItAll Open IE vs Closed IE 10 relations are pre-selected Comparison with Traditional IE • Speed-wise: TextRunner, 85 CPU hours for all relations in corpus at once; KnowItAll, 6.3 hours per relation Global Statistics on Facts Learned Evaluation Goal: How many of the tuples found represent actual relationships with plausible arguments What subset of these tuples is correct? How many of these tuples are distinct? Global Statistics on Facts Learned Data Set used: 9 million Web pages 133 million sentences 60.5 million tuples extracted (2.2 tuples per sentence) Filtering Criteria Tuples with probability > .8 Tuple’s relation is supported by 10 distinct sentences Not a general relation (top .1% relations) e.g.(NP1, has, NP2) A result of 11.3 million tuples containing 278,085 distinct relation strings. Estimating the Correctness of Facts Estimating the Number of Distinct Facts Only address relation synonymy Merge relation by using linguistic/syntactic components (punctuation, auxiliary verbs, leading stopwords, use of active and passive voice) Reduce the number of distinct relations to 91% of the number before merging Estimating the Number of Distinct Facts Difficulty: rare to find two distinct relations that are truly synonymous in all senses of each phrase E.g. person develop diseases vs. scientist develop technology Use synonymy clusters, human invovled assessment at tuple level cribes 2 distinct relations R 1 a Estimating the Number of Distinct Facts nd Station X as delineated bel to find two distinct relations that are truly synonymous in all senses of each phrase unless domain-specific type checking is performed on one or both arguments. If the first argument is the name of a scientist, then developed is synonymous with invented and created, and is closely related to patented. Without such argument type checking, these relations will pick out overlapping, but quite distinct sets of tuples.5 It is, however, easier for a human to assess similarity at the tuple level, where context in the form of entities ground ing the relationship is available. In order to estimate the number of similar facts extracted by T EX T RUNNE R, we began with our filtered set of 11.3 million tuples. For each tuple, we found clusters of concrete tuples of the form ( e1 , r , e2 ) , ( e1 , q, e2 ) where r = q, that is tuples where the entities match but the relation strings are distinct. We fou nd that only one third of the tuples belonged to such “synonymy clusters”. Next, we randomly sampled 100 synonymy clusters and asked one author of this paper to determine how many distinct facts existed within each cluster. For example, the cluster of 4 tuples below describes 2 distinct relations R 1 and R2 between Bletchley Park and Station X as delineated below: specificity, but does not scale to the Web as explained be Given a collection of documents, their system first forms clustering of the entire set of articles, partitionin corpus into sets of articles believed to discuss similar to Within each cluster, named-entity recognition, co-refe resolution and deep linguistic parse structures are comp and then used to automatically identify relations between of entities. This use of “heavy” linguistic machinery w be problematic if applied to the Web. Shinyama and Sekine’s system, which uses pai vector-space clustering, initially requires an O( D 2 ) e where D is the number of documents. Each documen signed to a cluster is then subject to linguistic proces potentially resulting in another pass through the set of documents. This is far more expensive for large docu collections than T EX T RUNNE R’s O( D + T log T ) runtim presented earlier. From a collection of 28,000 newswire articles, Shiny and Sekine were able to discover 101 relations. While difficult to measure the exact number of relations foun T EX T RUNNE R on its 9,000,000 Web page corpus, it is at two or three orders of magnitude greater than 101. was location of hley Park, R1 (Bletchley Park, was location of , Station X) 5 Conclusions R2 R2 R2 (Bletchley Park, (Bletchley Park, (Bletchley Park, being called , known as , codenamed , Station X) , Station X) , Station X) This paper introduces Open IE from the Web, an uns vised extraction paradigm that eschews relation-specifi traction in favor of a single extraction pass over the co during which relations of interest are automatically dis ered and efficiently stored. Unlike traditional IE system repeatedly incur the cost of corpus analysis with the na of each new relation, Open IE’s one-time relation disco procedure allows a user to name and explore relationshi interactive speeds. The paper also introduces T EX T RUNNE R, a fully im mented Open IE system, and demonstrates its abili extract massive amounts of high-quality information a nine million Web page corpus. We have shown T EX T RUNNE R is able to match the recall of the K NOW I T state-of-the-art Web IE system, while achieving higher p sion. In the future, we plan to integrate scalable methods fo tecting synonyms and resolving multiple mentions of en in T EX T RUNNE R. The system would also benefit from ability to learn the types of entities commonly taken b lations. This would enable the system to make a distin between different senses of a relation, as well as better l entity boundaries. Finally we plan to unify tuples outp T EX T RUNNE R into a graph-based structure, enabling plex relational queries. hl•eClusterark,by (e being c),alled y P found ,p,e ), (e ,r,e where p!=r hl•e92%Padistinct assertionsbyover estimation) y of thektuples foundn(own as r , , k TEXTRUNNER express hley Park, , codenamed 1 2 1 Overall, we found that roughly one quarter of the tuples in our sample were reformulations of other tuples contained somewhere in the filtered set of 11.3 million tuples. Given our previous measurement that two thirds of the concrete fac t tuples do not belong to synonymy clusters, we can compute that 2 + ( 1 × 3 ) or roughly 92% of the tuples found by 3 3 4 T EX T RUNNE R express distinct assertions. As pointed out earlier, this is an overestimate of the number of unique fact s because we have not been able to factor in the impact of multiple entity names, which is a topic for future work. 4 Related Work Traditional “closed” IE work was discussed in Section 1. Recent efforts [Pasca et al., 2006] seeking to undertake largescale extraction indicate a growing interest in the problem . This year, Sekine [Sekine, 2006] proposed a paradigm for “on-demand information extraction,” which aims to eliminate customization involved with adapting IE systems to new topics. Using unsupervised learning methods, the system automatically creates patterns and performs extraction based o n a 2 Estimating the Number of Distinct Facts Challenge: find methods for detecting synonyms and resolving multiple mentions of entities Related Work KnowItAll Project (umbrella project) IBM Watson TextRunner Demo Questions? ...
View Full Document

This note was uploaded on 11/09/2011 for the course CIS 6930 taught by Professor Staff during the Fall '08 term at University of Florida.

Ask a homework question - tutors are online