20079 - Doug Cutting and Julian Kupiec and Jan Pedersen and...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Doug Cutting and Julian Kupiec and Jan Pedersen and Penelope Sibun Xerox Palo Alto Research Center 3333 Coyote Hill Road, Palo Alto, CA 94304, USA A Practical Part-of-Speech Tagger We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: phrase recognition word sense disambiguation and grammatical function assignment. Abstract Reusable The e ort required to retarget a tagger to 2 Methodology 2.1 Background new corpora, new tagsets, and new languages should be minimal. 1 Desiderata Many words are ambiguous in their part of speech. For example, \tag" can be a noun or a verb. However, when a word appears in the context of other words, the ambiguity is often reduced: in \a tag is a part-of-speech label," the word \tag" can only be a noun. A part-of-speech tagger is a system that uses context to assign parts of speech to words. Automatic text tagging is an important rst step in discovering the linguistic structure of large text corpora. Part-of-speech information facilitates higher-level analysis, such as recognizing noun phrases and other patterns in text. For a tagger to function as a practical component in a language processing system, we believe that a tagger must be: Robust Text corpora contain ungrammatical constructions, isolated phrases (such as titles), and nonlinguistic data (such as tables). Corpora are also likely to contain words that are unknown to the tagger. It is desirable that a tagger deal gracefully with these situations. E cient If a tagger is to be used to analyze arbitrarily large corpora, it must be e cient|performing in time linear in the number of words tagged. Any training required should also be fast, enabling rapid turnaround with new corpora and new text genres. Accurate A tagger should attempt to assign the correct part-of-speech tag to every word encountered. Tunable A tagger should be able to take advantage of linguistic insights. One should be able to correct systematic errors by supplying appropriate a priori \hints." It should be possible to give di erent hints for di erent corpora. Several di erent approaches have been used for building text taggers. Greene and Rubin used a rule-based approach in the TAGGIT program Greene and Rubin, 1971], which was an aid in tagging the Brown corpus Francis and Kucera, 1982]. TAGGIT disambiguated 77% of the corpus the rest was done manually over a period of several years. More recently, Koskenniemi also used a rule-based approach implemented with nite-state machines Koskenniemi, 1990]. Statistical methods have also been used (e.g., DeRose, 1988], Garside et al., 1987]). These provide the capability of resolving ambiguity on the basis of most likely interpretation. A form of Markov model has been widely used that assumes that a word depends probabilistically on just its part-of-speech category, which in turn depends solely on the categories of the preceding two words. Two types of training (i.e., parameter estimation) have been used with this model. The rst makes use of a tagged training corpus. Derouault and Merialdo use a bootstrap method for training Derouault and Merialdo, 1986]. At rst, a relatively small amount of text is manually tagged and used to train a partially accurate model. The model is then used to tag more text, and the tags are manually corrected and then used to retrain the model. Church uses the tagged Brown corpus for training Church, 1988]. These models involve probabilities for each word in the lexicon, so large tagged corpora are required for reliable estimation. The second method of training does not require a tagged training corpus. In this situation the Baum-Welch algorithm (also known as the forward-backward algorithm) can be used Baum, 1972]. Under this regime the model is called a hidden Markov model (HMM), as state transitions (i.e., part-of-speech categories) are assumed to be unobservable. Jelinek has used this method for training a text tagger Jelinek, 1985]. Parameter smoothing can be conveniently achieved using the method of deleted interpolation in which weighted estimates are taken from secondand rst-order models and a uniform probability distribution Jelinek and Mercer, 1980]. Kupiec used word equivalence classes (referred to here as ambiguity classes) based on parts of speech, to pool data from individual words Kupiec, 1989b]. The most common words are still represented individually, as su cient data exist for robust estimation. However all other words are represented according to the set of possible categories they can assume. In this manner, the vocabulary of 50,000 words in the Brown corpus can be reduced to approximately 400 distinct ambiguity classes Kupiec, 1992]. To further reduce the number of parameters, a rst-order model can be employed (this assumes that a word's category depends only on the immediately preceding word's category). In Kupiec, 1989a], networks are used to selectively augment the context in a basic rstorder model, rather than using uniformly second-order dependencies. 2.2 Our approach We next describe how our choice of techniques satis es the criteria listed in section 1. The use of an HMM permits complete exibility in the choice of training corpora. Text from any desired domain can be used, and a tagger can be tailored for use with a particular text database by training on a portion of that database. Lexicons containing alternative tag sets can be easily accommodated without any need for re-labeling the training corpus, a ording further exibility in the use of specialized tags. As the resources required are simply a lexicon and a suitably large sample of ordinary text, taggers can be built with minimal e ort, even for other languages, such as French (e.g., Kupiec, 1992]). The use of ambiguity classes and a rst-order model reduces the number of parameters to be estimated without signi cant reduction in accuracy (discussed in section 5). This also enables a tagger to be reliably trained using only moderate amounts of text. We have produced reasonable results training on as few as 3,000 sentences. Fewer parameters also reduce the time required for training. Relatively few ambiguity classes are su cient for wide coverage, so it is unlikely that adding new words to the lexicon requires retraining, as their ambiguity classes are already accommodated. Vocabulary independence is achieved by predicting categories for words not in the lexicon, using both context and su x information. Probabilities corresponding to category sequences that never occurred in the training data are assigned small, non-zero values, ensuring that the model will accept any sequence of tokens, while still providing the most likely tagging. By using the fact that words are typically associated with only a few part-ofspeech categories, and carefully ordering the computation, the algorithms have linear complexity (section 3.3). symbol generator (i.e., a Markov process with noise).1 The Markov process captures the notion of sequence dependency and is described by a set of N states, a matrix of transition probabilities A = faij g 1 i j N where aij is the probability of moving from state i to state j , and a vector of initial probabilities = f ig 1 i N where i is the probability of starting in state i. The symbol generator is a state-dependent measure on V described by a matrix of symbol probabilities B = fbjk g 1 j N and 1 k M where M = jW j and bjk is the probability of generating symbol sk given that the Markov process is in state j .2 In part-of-speech tagging, we will model word order dependency through an underlying Markov process that operates in terms of lexical tags, yet we will only be able to observe the sets of tags, or ambiguity classes, that are possible for individual words. The ambiguity class of each word is the set of its permitted parts of speech, only one of which is correct in context. Given the parameters A, B and , hidden Markov modeling allows us to compute the most probable sequence of state transitions, and hence the mostly likely sequence of lexical tags, corresponding to a sequence of ambiguity classes. In the following, N can be identi ed with the number of possible tags, and W with the set of all ambiguity classes. Applying an HMM consists of two tasks: estimating the model parameters A, B and from a training set and computing the most likely sequence of underlying state transitions given new observations. Maximum likelihood estimates (that is, estimates that maximize the probability of the training set) can be found through application of alternating expectation in a procedure known as the BaumWelch, or forward-backward, algorithm Baum, 1972]. It proceeds by recursively de ning two sets of probabilities: the forward probabilities, 1 t T ; 1 (1) t(i)aij bj (St+1 ) i=1 where 1(i) = ibi (S1 ) for all i and the backward probat+1(j ) = "N X # bilities, t (i) = N X j =1 aij bj (St+1 ) t+1 (j ) T ;1 t 1 (2) 3 Hidden Markov Modeling The hidden Markov modeling component of our tagger is implemented as an independent module following the speci cation given in Levinson et al., 1983], with special attention to space and time e ciency issues. Only rst-order modeling is addressed and will be presumed for the remainder of this discussion. In brief, an HMM is a doubly stochastic process that generates sequence of symbols S = fS1 S2 : : : ST g Si 2 W 1 i T where W is some nite set of possible symbols, by composing an underlying Markov process with a state-dependent where T (j ) = 1 for all j . The forward probability t(i) is the joint probability of the sequence up to time t, fS1 S2 : : : Stg, and the event that the Markov process is in state i at time t. Similarly, the backward probability t (j ) is the probability of seeing the sequence fSt+1 St+2 : : : ST g given that the Markov process is at state i at time t. It follows that the probability of the entire sequence is 3.1 Formalism P= N N XX i=1 j =1 t(i)aij bj (St+1 ) t+1 (j ) 1 For an introduction to hidden Markov modeling see Rabiner and Juang, 1986]. 2 In the following we will write bj (St ) for bjk if St = sk . 3 This is most conveniently evaluated at t = T ; 1, in which P case P = N T (i) i=1 for any t in the range 1 t T ; 1.3 Similarly, bjk and i can be estimated as follows: P t3S ^jk = PTt =sk t(j ) t (j ) b (4) t=1 t(j ) t (j ) and 1 (5) ^i = P 1(i) 1 (i): In summary, to nd maximum likelihood estimates for A, B , and , via the Baum-Welch algorithm, one chooses some starting values, applies equations 3{5 to compute new values, and then iterates until convergence. It can be shown that this algorithm will converge, although possibly to a non-global maximum Baum, 1972]. Once a model has been estimated, selecting the most likely underlying sequence of state transitions corresponding to an observation S can be thought of as a maximization over all sequences that might generate S . An e cient dynamic programming procedure, known as the Viterbi algorithm Viterbi, 1967], arranges for this computation to proceed in time proportional to T . Suppose V = fv(t)g 1 t T is a state sequence that generates S , then the probability that V generates S is, X 1 T;1 (i)a b (S ) (j ): =P t ij j t+1 t+1 ij t=1 Hence we can estimate aij by: PT;1 t i) aij = PN ij = t=1 P(T;a1ij bj (St+1 ) t+1 (j ) : (3) ^ j =1 ij t=1 t(i) t (i) Given an initial choice for the parameters A, B , and the expected number of transitions, ij , from state i to state j conditioned on the observation sequence S may be computed as follows: rescale. One approach premultiplies the and probabilities with an accumulating product depending on t Levinson et al., 1983]. Let ~ 1(i) = 1(i) and de ne ct = "N X i=1 ~t (i) #;1 1 t T: in Now de ne ^ t(i) = ct ~t (i) and use ^ in place of equation 1 to de ne ~ for the next iteration: ~ t+1(j ) = "N X Note that n=1 ^ t(i) = 1 for 1 t T . Similarly, let i ^T (i) = T (i) and de ne ~t (i) = ct ^t (i) for T t 1 where ^t (i) = X aij bj (St+1 ) ~t+1 (j ) j =1 N P i=1 ^ t(i)aij bj (St+1 ) # 1 t T ; 1: T ; 1 t 1: The scaled backward and forward probabilities, ^ and ^, can be exchanged for the unscaled probabilities in equations 3{5 without a ecting the value of the ratios. To t see this, note that ^ t(i) = C1 t(i) and ^t (i) = t (i)CtT+1 where Cij = j Y t=i ct: P (v) = v(1)bv(1)(S1 ) T Y and To nd the most probable such sequence we start by de ning 1(i) = ibi (S1 ) for 1 i N and then perform the recursion (6) t(j ) = 1max t;1(i)aij ]bj (St ) i N t (j ) = 1max ;1 t;1(i) i N t=2 av(t;1)v(t)bv(t)(St ): Now, in terms of the scaled probabilities, equation 5, for example, can be seen to be unchanged: 1 T ^ 1(i) ^1 (i) = C1 1 (i) 1 (i)C2 = ^ : PN PN T i i=1 ^ T (i) i=1 C1 T (i) A slight di culty occurs in equation 3 that can be cured by the addition of a new term, ct+1 , in each product of the upper sum: PT;1 t=1 ^ t(i)aij bj (St+1 ) ^t+1 (j )ct+1 = ^ij : a PT;1 t=1 ^t (i) ^t (i) Numerical instability in the Viterbi algorithm can be ameliorated by operating on a logarithmic scale Levinson et al., 1983]. That is, one maximizes the log probability of each sequence of state transitions, log(P (v)) = log( v(1)) + log(bv(1)(S1 )) + T X t=2 for 2 t T and 1 j N . The crucial observation is that for each time t and each state i one need only consider the most probable sequence arriving at state i at time t. The probability of the most probable sequence is max1 i N T (i)] while the sequence itself can be reconstructed by de ning v(T ) = max;1i N T (i) and 1 v(t ; 1) = t(qt ) for T t 2. log(av(t;1)v(t)) + log(bv(t)(St )): 3.2 Numerical Stability The Baum-Welch algorithm (equations 1{5) and the Viterbi algorithm (equation 6) involve operations on products of numbers constrained to be between 0 and 1. Since these products can easily under ow, measures must be taken to Hence, equation 6 is replaced by t(j ) = 1max t;1(i) + log(aij )] + log bj (St ): i N Care must be taken with zero probabilities. However, this can be elegantly handled through the use of IEEE negative in nity P754, 1981]. 3.3 Reducing Time Complexity As can be2 seen from equations 1{5, the time cost of training is O(TN ). Similarly, as given in equation 6, the Viterbi algorithm is also O(TN 2). However, in part-of-speech tagging, the problem structure dictates that the matrix of symbol probabilities B is sparsely populated. That is, bij 6= 0 i the ambiguity class corresponding to symbol j includes the part-of-speech tag associated with state i. In practice, the degree of overlap between ambiguity classes is relatively low some tokens are assigned unique tags, and hence have only one non-zero symbol probability. The sparseness of B leads one to consider restructuring equations 1{6 so a check for zero symbol probability can obviate the need for further computation. Equation 1 is already conveniently factored so that the dependence on bj (St+1 ) is outside the inner sum. Hence, if k is the average number of non-zero entries in each row of B , the cost of computing equation 1 can be reduced to O(kTN ). Equations 2{4 can be similarly reduced by switching the order of iteration. For example, in equation 2, rather than for a given t computing t (i) for each i one at a time, one can accumulate terms for all i in parallel. The net e ect of this rewriting is to place a bj (St+1 ) = 0 check outside the innermost iteration. Equations 3 and 4 submit to a similar approach. Equation 5 is already only O(N ). Hence, the overall cost of training can be reduced to O(kTN ), which, in our experience, amounts to an order of magnitude speedup.4 The time complexity of the Viterbi algorithm can also be reduced to O(kTN ) by noting that bj (St ) can be factored out of the maximization of equation 6. Adding up the sizes of the probability matrices A, B , and , it is easy to see that the storage cost for directly representing one model is proportional to N (N + M + 1). Running the Baum-Welch algorithm requires storage for the sequence of observations, the and probabilities, the vector fci g, and copies of the A and B matrices (since the originals cannot be overwritten until the end of each iteration). Hence, the grand total of space required for training is proportional to T + 2N (T + N + M + 1). Since N and M are xed by the model, the only parameter that can be varied to reduce storage costs is T . Now, adequate training requires processing from tens of thousands to hundreds of thousands of tokens Kupiec, 1989a]. The training set can be considered one long sequence, it which case T is very large indeed, or it can be broken up into a number of smaller sequences at convenient boundaries. In rst-order hidden Markov modeling, the stochastic process e ectively restarts at unambiguous tokens, such as sentence and paragraph markers, hence these tokens are convenient points at which to break the training set. If the Baum-Welch algorithm is run separately (from the same starting point) on each piece, the resulting trained models must be recombined in some way. One obvious approach is simply to average. However, this fails if any two An equivalent approach maintains a mapping from states i to non-zero symbol probabilities and simply avoids, in the inner iteration, computing products which must be zero Kupiec, 1992]. 4 states are indistinguishable (in the sense that they had the same transition probabilities and the same symbol probabilities at start), because states are then not matched across trained models. It is therefore important that each state have a distinguished role, which is relatively easy to achieve in part-of-speech tagging. Our implementation of the Baum-Welch algorithm breaks up the input into xed-sized pieces of training text. The Baum-Welch algorithm is then run separately on each piece and the results are averaged together. Running the Viterbi algorithm requires storage for the sequence of observations, a vector of current maxes, a scratch array of the same size, and a matrix of indices, for a total proportional to T + N (2 + T ) and a grand total (including the model) of T + N (N + M + T +3). Again, N and M are xed. However, T need not be longer than a single sentence, since, as was observed above, the HMM, and hence the Viterbi algorithm, restarts at sentence boundaries. 3.5 Model Tuning 3.4 Controlling Space Complexity An HMM for part-of-speech tagging can be tuned in a variety of ways. First, the choice of tagset and lexicon determines the initial model. Second, empirical and a priori information can in uence the choice of starting values for the Baum-Welch algorithm. For example, counting instances of ambiguity classes in running text allows one to assign non-uniform starting probabilities in A for a particular tag's realization as a particular ambiguity class. Alternatively, one can state a priori that a particular ambiguity class is most likely to be the re ection of some subset of its component tags. For example, if an ambiguity class consisting of the open class tags is used for unknown words, one may encode the fact that most unknown words are nouns or proper nouns by biasing the initial probabilities in B . Another biasing of starting values can arises from noting that some tags are unlikely to be followed by others. For example, the lexical item \to" maps to an ambiguity class containing two tags, in nitive-marker and to-aspreposition, neither of which occurs in any other ambiguity class. If nothing more were stated, the HMM would have two states which were indistinguishable. This can be remedied by setting the initial transition probabilities from in nitive-marker to strongly favor transitions to such states as verb-unin ected and adverb. Our implementation allows for two sorts of biasing of starting values: ambiguity classes can be annotated with favored tags and states can be annotated with favored transitions. These biases may be speci ed either as sets or as set complements. Biases are implemented by replacing the disfavored probabilities with a small constant (machine epsilon) and redistributing mass to the other possibilities. This has the e ect of disfavoring the indicated outcomes without disallowing them su cient converse data can rehabilitate these values. 4 Architecture In support of this and other work, we have developed a system architecture for text access Cutting et al., 1991]. This architecture de nes ve components for such systems: (further analysis) stem, tag Search Index Analysis Corpus ambiguity class, <stem,tag>* Lexicon token Tokenizer character Tagging trained HMM ambiguity class Training Figure 1: Tagger Modules in System Context corpus, which provides text in a generic manner analysis, which extracts terms from the text index which stores term occurrence statistics and search, which utilizes these statistics to resolve queries. The part-of-speech tagger described here is implemented as an analysis module. Figure 1 illustrates the overall architecture, showing the tagger analysis implementation in detail. The tagger itself has a modular architecture, isolating behind standard protocols those elements which may vary, enabling easy substitution of alternate implementations. Also illustrated here are the data types which ow between tagger components. As an analysis implementation, the tagger must generate terms from text. In this context, a term is a word stem annotated with part of speech. Text enters the analysis sub-system where the rst processing module it encounters is the tokenizer, whose duty is to convert text (a sequence of characters) into a sequence of tokens. Sentence boundaries are also identi ed by the tokenizer and are passed as reserved tokens. The tokenizer subsequently passes tokens to the lexicon. Here tokens are converted into a set of stems, each annotated with a part-of-speech tag. The set of tags identi es an ambiguity class. The identi cation of these classes is also the responsibility of the lexicon. Thus the lexicon delivers a set of stems paired with tags, and an ambiguity class. The training module takes long sequences of ambiguity classes as input. It uses the Baum-Welch algorithm to produce a trained HMM, an input to the tagging module. Training is typically performed on a sample of the corpus at hand, with the trained HMM being saved for subsequent use on the corpus at large. The tagging module bu ers sequences of ambiguity classes between sentence boundaries. These sequences are disambiguated by computing the maximal path through the HMM with the Viterbi algorithm. Operating at sentence granularity provides fast throughput without loss of accuracy, as sentence boundaries are unambiguous. The resulting sequence of tags is used to select the appropriate stems. Pairs of stems and tags are subsequently emitted. The tagger may function as a complete analysis component, providing tagged text to search and indexing components, or as a sub-system of a more elaborate analysis, such as phrase recognition. 4.1 Tokenizer Implementation The problem of tokenization has been well addressed by much work in compilation of programming languages. The accepted approach is to specify token classes with regular expressions. These may be compiled into a single deterministic nite state automaton which partitions character streams into labeled tokens Aho et al., 1986, Lesk, 1975]. In the context of tagging, we require at least two token classes: sentence boundary and word. Other classes may include numbers, paragraph boundaries and various sorts of punctuation (e.g., braces of various types, commas). However, for simplicity, we will henceforth assume only words and sentence boundaries are extracted. Just as with programming languages, with text it is not always possible to unambiguously specify the required token classes with regular expressions. However the addition of a simple lookahead mechanism which allows speci cation of right context ameliorates this Aho et al., 1986, Lesk, 1975]. For example, a sentence boundary in English text might be identi ed by a period, followed by whitespace, followed by an uppercase letter. However the up- percase letter must not be consumed, as it is the rst component of the next token. A lookahead mechanism allows us to specify in the sentence-boundary regular expression that the nal character matched should not be considered a part of the token. This method meets our stated goals for the overall system. It is e cient, requiring that each character be examined only once (modulo lookahead). It is easily parameterizable, providing the expressive power to concisely de ne accurate and robust token classes. The lexicon module is responsible for enumerating parts of speech and their associated stems for each word it is given. For the English word \does," the lexicon might return \do, verb" and \doe, plural-noun." It is also responsible for identifying ambiguity classes based upon sets of tags. We have employed a three-stage implementation: First, we consult a manually-constructed lexicon to nd stems and parts of speech. Exhaustive lexicons of this sort are expensive, if not impossible, to produce. Fortunately, a small set of words accounts for the vast majority of word occurences. Thus high coverage can be obtained without prohibitive e ort. Words not found in the manually constructed lexicon are generally both open class and regularly in ected. As a second stage, a language-speci c method can be employed to guess ambiguity classes for unknown words. For many languages (e.g., English and French), word su xes provide strong cues to words' possible categories. Probabalistic predictions of a word's category can be made by analyzing su xes in untagged text Kupiec, 1992, Meteer et al., 1991]. As a nal stage, if a word is not in the manually constructed lexicon, and its su x is not recognized, a default ambiguity class is used. This class typically contains all the open class categories in the language. Dictionaries and su x tables are both e ciently implementable as letter trees, or tries Knuth, 1973], which require that each character of a word be examined only once during a lookup. total of of 143 CPU seconds. The time breakdown for this was as follows: Tagging: average seconds per token tokenizer lexicon Viterbi total 604 388 233 1235 It can be seen from these gures that training on a new corpus may be accomplished in a matter of minutes, and that tens of megabytes of text may then be tagged per hour. 4.2 Lexicon Implementation 5.2 Accurate and Robust When using a lexicon and tagset built from the tagged text of the Brown corpus Francis and Kucera, 1982], training on one half of the corpus (about 500,000 words) and tagging the other, 96% of word instances were assigned the correct tag. Eight iterations of training were used. This level of accuracy is comparable to the best achieved by other taggers Church, 1988, Merialdo, 1991]. The Brown Corpus contains fragments and ungrammaticalities, thus providing a good demonstration of robustness. 5.3 Tunable and Reusable A tagger should be tunable, so that systematic tagging errors and anomalies can be addressed. Similarly, it is important that it be fast and easy to target the tagger to new genres and languages, and to experiment with di erent tagsets re ecting di erent insights into the linguistic phenomena found in text. In section 3.5, we describe how the HMM implementation itself supports tuning. In addition, our implementation supports a number of explicit parameters to facilitate tuning and reuse, including speci cation of lexicon and training corpus. There is also support for a exible tagset. For example, if we want to collapse distinctions in the lexicon, such as those between positive, comparative, and superlative adjectives, we only have to make a small change in the mapping from lexicon to tagset. Similarly, if we wish to make ner grain distinctions than those available in the lexicon, such as case marking on pronouns, there is a simple way to note such exceptions. 5 Performance 5.1 E cient 6 Applications In this section, we detail how our tagger meets the desiderata that we outlined in section 1. The system is implemented in Common Lisp Steele, 1990]. All timings reported are for a Sun SPARCStation2. The English lexicon used contains 38 tags (M = 38) and 174 ambiguity classes (N = 174). Training was performed on 25,000 words in articles selected randomly from Grolier's Encyclopedia. Five iterations of training were performed in a total time of 115 CPU seconds. Following is a time breakdown by component: Training: average seconds per token tokenizer lexicon 1 iteration 5 iterations total 640 400 680 3400 4600 Tagging was performed on 115,822 words in a collection of articles by the journalist Dave Barry. This required a We have used the tagger in a number of applications. We describe three applications here: phrase recognition word sense disambiguation and grammatical function assignment. These projects are part of a research e ort to use shallow analysis techniques to extract content from unrestricted text. 6.1 Phrase Recognition We have constructed a system that recognizes simple phrases when given as input the sequence of tags for a sentence. There are recognizers for noun phrases, verb groups, adverbial phrases, and prepositional phrases. Each of these phrases comprises a contiguous sequence of tags that satises a simple grammar. For example, a noun phrase can be a unary sequence containing a pronoun tag or an arbitrarily long sequence of noun and adjective tags, possibly preceded by a determiner tag and possibly with an embedded possessive marker. The longest possible sequence is found (e.g., \the program committee" but not \the program"). Conjunctions are not recognized as part of any phrase for example, in the fragment \the cats and dogs," \the cats" and \dogs" will be recognized as two noun phrases. Prepositional phrase attachment is not performed at this stage of processing. This approach to phrase recognition in some cases captures only parts of some phrases however, our approach minimizes false positives, so that we can rely on the recognizers' results. Acknowledgments References We would like to thank Marti Hearst for her contributions to this paper, Lauri Karttunen and Annie Zaenen for their work on lexicons, and Kris Halvorsen for supporting this project. Aho et al., 1986] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. AddisonWesley, 1986. Baum, 1972] L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3:1{8, 1972. Church, 1988] K. W. Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing (ACL), pages 136{143, 1988. Cutting et al., 1991] D.R. Cutting, J. Pedersen, and P.- 6.2 Word Sense Disambiguation Part-of-speech tagging in and of itself is a useful tool in lexical disambiguation for example, knowing that \dig" is being used as a noun rather than as a verb indicates the word's appropriate meaning. But many words have multiple meanings even while occupying the same part of speech. To this end, the tagger has been used in the implementation of an experimental noun homograph disambiguation algorithm Hearst, 1991]. The algorithm (known as CatchWord) performs supervised training over a large text corpus, gathering lexical, orthographic, and simple syntactic evidence for each sense of the ambiguous noun. After a period of training, CatchWord classi es new instances of the noun by checking its context against that of previously observed instances and choosing the sense for which the most evidence is found. Because the sense distinctions made are coarse, the disambiguation can be accomplished without the expense of knowledge bases or inference mechanisms. Initial tests resulted in accuracies of around 90% for nouns with strongly distinct senses. This algorithm uses the tagger in two ways: (i) to determine the part of speech of the target word ( ltering out the non-noun usages) and (ii) as a step in the phrase recognition analysis of the context surrounding the noun. 6.3 Grammatical Function Assignment The phrase recognizers also provide input to a system, Sopa Sibun, 1991], which recognizes nominal arguments of verbs, speci cally, Subject, Object, and Predicative Arguments. Sopa does not rely on information (such as arity or voice) speci c to the particular verbs involved. The rst step in assigning grammatical functions is to partition the tag sequence of each sentence into phrases. The phrase types include those mentioned in section 6.1, additional types to account for conjunctions, complementizers, and indicators of sentence boundaries, and an \unknown" type. After a sentence has been partitioned, each simple noun phrase is examined in the context of the phrase to its left and the phrase to its right. On the basis of this local context and a set of rules, the noun phrase is marked as a syntactic Subject, Object, Predicative, or is not marked at all. A label of Predicative is assigned only if it can be determined that the governing verb group is a form of a predicating verb (e.g., a form of \be"). Because this cannot always be determined, some Predicatives are labeled Objects. If a noun phrase is labeled, it is also annotated as to whether the governing verb is the closest verb group to the right or to the left. The algorithm has an accuracy of approximately 80% in assigning grammatical functions. K. Halvorsen. An object-oriented architecture for text retrieval. In Conference Proceedings of RIAO'91, Intelligent Text and Image Handling, Barcelona, Spain, pages 285{298, April 1991. Also available as Xerox PARC technical report SSL-90-83. DeRose, 1988] S. DeRose. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14:31{39, 1988. Derouault and Merialdo, 1986] A. M. Derouault and B. Merialdo. Natural language modeling for phonemeto-text transcription. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8:742{749, 1986. Francis and Kucera, 1982] W. N. Francis and F. Kucera. Frequency Analysis of English Usage. Houghton Mi in, 1982. Garside et al., 1987] R. Garside, G. Leech, and G. Sampson. The Computational Analysis of English. Longman, 1987. Greene and Rubin, 1971] B. B. Greene and G. M. Rubin. Automatic grammatical tagging of English. Technical report, Department of Linguistics, Brown University, Providence, Rhode Island, 1971. Hearst, 1991] M. A. Hearst. Noun homograph disambiguation using local context in large text corpora. In Jelinek and Mercer, 1980] F. Jelinek and R. L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop Pattern Recognition in Practice, pages 381{397, Amsterdam, 1980. North-Holland. Jelinek, 1985] F. Jelinek. Markov source modeling of text generation. In J. K. Skwirzinski, editor, Impact of Processing Techniques on Communication. Nijho , Dordrecht, 1985. Knuth, 1973] D. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. AddisonWesley, 1973. The Proceedings of the 7th New OED Conference on Using Corpora, pages 1{22, Oxford, 1991. Koskenniemi, 1990] K. Koskenniemi. Finte-state parsing and disambiguation. In H. Karlgren, editor, COLING90, pages 229{232, Helsinki University, 1990. Kupiec, 1989a] J. M. Kupiec. Augmenting a hidden Markov model for phrase-dependent word tagging. In Kaufmann. Kupiec, 1989b] J. M. Kupiec. Probabilistic models of short and long distance word dependencies in running text. In Proceedings of the 1989 DARPA Speech and Natural Language Workshop, pages 290{295, Philadelphia, 1989. Morgan Kaufmann. Kupiec, 1992] J. M. Kupiec. Robust part-of-speech tagging using a hidden markov model. submitted to Computer Speech and Language, 1992. Lesk, 1975] M. E. Lesk. LEX | a lexical analyzer generator. Computing Science Technical Report 39, AT&T Bell Laboratories, Murray Hill, New Jersey, 1975. Levinson et al., 1983] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 62:1035{1074, 1983. Merialdo, 1991] B. Merialdo. Tagging text with a probablistic model. In Proceedings of ICASSP-91, pages 809{ 812, Toronto, Canada, 1991. Meteer et al., 1991] M. W. Meteer, R. Schwartz, and R. Weischedel. POST: Using probabilities in language processing. In Proceedings of the 12th International Joint Conference on Arti cial Intelligence, pages 960{ 965, 1991. P754, 1981] IEEE Task P754. A proposed standard for binary oating-point arithmetic. Computer, 14(3):51{ 62, March 1981. Rabiner and Juang, 1986] L. R. Rabiner and B. H. Juang. An introduction to hidden markov models. IEEE ASSP Magazine, January 1986. Sibun, 1991] P. Sibun. Grammatical function assignment in unrestricted text. internal report, Xerox Palo Alto Research Center, 1991. Steele, 1990] G. L. Steele, Jr. Common Lisp, The Language. Digital Press, second edition, 1990. Viterbi, 1967] A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. In IEEE Transactions on Information Theory, pages 260{269, April 1967. Proceedings of the DARPA Speech and Natural Language Workshop, pages 92{98, Cape Cod, MA, 1989. Morgan ...
View Full Document

Ask a homework question - tutors are online