Unformatted Document Excerpt
Coursehero >>
Massachusetts >>
MIT >>
CS 6.345
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
n-gram Class-Based Models of Natural Language
P e t e r F. B r o w n " P e t e r V. d e S o u z a * R o b e r t L. Mercer* IBM T. J. Watson Research Center V i n c e n t J. D e l l a Pietra* J e n i f e r C. Lai*
We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.
1. Introduction
In a number of natural language processing tasks, we face the problem of recovering a string of English words after it has been garbled by passage through a noisy channel. To tackle this problem successfully, we must be able to estimate the probability with which any particular string of English words will be presented as input to the noisy channel. In this paper, we discuss a method for making such estimates. We also discuss the related topic of assigning words to classes according to statistical behavior in a large body of text. In the next section, we review the concept of a language model and give a definition of n-gram models. In Section 3, we look at the subset of n-gram models in which the words are divided into classes. We show that for n = 2 the maximum likelihood assignment of words to classes is equivalent to the assignment for which the average mutual information of adjacent classes is greatest. Finding an optimal assignment of words to classes is computationally hard, but we describe two algorithms for finding a suboptimal assignment. In Section 4, we apply mutual information to two other forms of word clustering. First, we use it to find pairs of words that function together as a single lexical entity. Then, by examining the probability that two words will appear within a reasonable distance of one another, we use it to find classes that have some loose semantic coherence. In describing our work, we draw freely on terminology and notation from the mathematical theory of communication. The reader who is unfamiliar with this field or who has allowed his or her facility with some of its concepts to fall into disrepair may profit from a brief perusal of Feller (1950) and Gallagher (1968). In the first of these, the reader should focus on conditional probabilities and on Markov chains; in the second, on entropy and mutual information.
* IBM T. J. WatsonResearchCenter,YorktownHeights,New York 10598.
(~) 1992 Associationfor ComputationalLinguistics
Computational Linguistics
Volume 18, Number 4
Source Language Model
W
Channel Model
Y
Pr(W)
x
Pr(YIW)
= Pr(W,Y)
Figure 1 Source-channel setup.
2. Language Models Figure I shows a model that has long been used in automatic speech recognition (Bahl, Jelinek, and Mercer 1983) and has recently been proposed for machine translation (Brown et al. 1990) and for automatic spelling correction (Mays, Demerau, and Mercer 1990). In automatic speech recognition, y is an acoustic signal; in machine translation, y is a sequence of words in another language; and in spelling correction, y is a sequence of characters produced by a possibly imperfect typist. In all three applications, given a signal y, we seek to determine the string of English words, w, which gave rise to it. In general, many different word strings can give rise to the same signal and so we cannot hope to recover w successfully in all cases. We can, however, minimize our probability of error by choosing as our estimate of w that string @ for which the a posteriori probability of @ given y is greatest. For a fixed choice of y, this probability is proportional to the joint probability of @ and y which, as shown in Figure 1, is the product of two terms: the a priori probability of @ and the probability that y will appear at the output of the channel when @ is placed at the input. The a priori probability of @, Pr (@), is the probability that the string @ will arise in English. We do not attempt a formal definition of English or of the concept of arising in English. Rather, we blithely assume that the production of English text can be characterized by a set of conditional probabilities, Pr(Wk I W~1-1), in terms of which the probability of a string of words, w~l, can be expressed as a product: Pr (wkl) = Pr(wl)Pr (w2
Iwl)'"Pr(Wk Iw~-l).
(1)
Here, W~ represents the string W l W 2 " ' ' W k _ 1. In the conditional probability Pr(Wk I -1 w~-l), we call Wl the history and k-1 the prediction. We refer to a computational mechanism for obtaining these conditional probabilities as a language model. Often we must choose which of two different language models is the better one. The performance of a language model in a complete system depends on a delicate interplay between the language model and other components of the system. One language model may surpass another as part of a speech recognition system but perform less well in a translation system. However, because it is expensive to evaluate a language model in the context of a complete system, we are led to seek an intrinsic measure of the quality of a language model. We might, for example, use each lan-
Wk
468
Peter E Brown and Vincent J. Della Pietra
Class-Basedn-gram Models of Natural Language
guage model to compute the joint probability of some collection of strings and judge as better the language model that yields the greater probability. The perplexity of a language model with respect to a sample of text, S, is the reciprocal of the geometric average of the probabilities of the predictions in S. If S has I S ] words, then the perplexity is Pr (S) -1/Isj. Thus, the language model with the smaller perplexity will be the one that assigns the larger probability to S. Because the perplexity depends not only on the language model but also on the text with respect to which it is measured, 'it is important that the text be representative of that for which the language model is intended. Because perplexity is subject to sampling error, making fine distinctions between language models may require that the perplexity be measured with respect to a large sample. In an n-gram language model, we treat two histories as equivalent if they end in the same n - 1 words, i.e., we assume that for k > n, Pr (Wk I wl k-l) is equal to Pr (Wk [ wk--1 k-n+1)"For a vocabulary of size V, a 1-gram model has V - 1 independent parameters, one for each word minus one for the constraint that all of the probabilities add up to 1. A 2-gram model has V(V - 1) independent parameters of the form Pr (w2 [ Wl) and V - 1 of the form Pr (w) for a total of V2 - 1 independent parameters. In general, an n-gram model has V n - 1 independent parameters: V n-1 (V - 1) of the form Pr (Wn ] w~-l), which we call the order-n parameters, plus the V n - l - 1 parameters of an (n - 1)-gram model. We estimate the parameters of an n-gram model by examining a sample of text, t~, which we call the training text, in a process called training. If C(w) is the number of times that the string w occurs in the string t~, then for a 1-gram language model the maximum likelihood estimate for the parameter Pr (w) is C(w)/T. To estimate the parameters of an n-gram model, we estimate the parameters of the (n - 1)-gram model that it contains and then choose the order-n parameters so as to maximize Pr (tn [ t~-'). T Thus, the order-n parameters are
Pr
(wn ] w / - 1 )
-~
C(w -'wn) " n-1 Ew C(wl w)
(2)
We call this method of parameter estimation sequential maximum likelihood estimation. We can think of the order-n parameters of an n-gram model as constituting the transition matrix of a Markov model the states of which are sequences of n - 1 words. Thus, the probability of a transition between the state wlw2.. "Wn-1 and the state w2w3.., w n is Pr (Wn ] w l w 2 " " W n - 1 ) The steady-state distribution for this transition matrix assigns a probability to each (n - 1)-gram, which we denote S(w~-l). We say that an n-gram language model is consistent if, for each string wln-1, the probability that the model assigns to wn-1 is S(w~ -1). Sequential maximum likelihood estimation does 1 not, in general, lead to a consistent model, although for large values of T, the model will be very nearly consistent. Maximum likelihood estimation of the parameters of a consistent n-gram language model is an interesting topic, but is beyond the scope of this paper. The vocabulary of English is very large and so, even for small values of n, the number of parameters in an n-gram model is enormous. The IBM Tangora speech recognition system has a vocabulary of about 20,000 words and employs a 3-gram language model with over eight trillion parameters (Averbuch et al. 1987). We can illustrate the problems attendant to parameter estimation for a 3-gram language model with the data in Table 1. Here, we show the number of 1-, 2-, and 3-grams appearing with various frequencies in a sample of 365,893,263 words of English text from a variety of sources. The vocabulary consists of the 260,740 different words plus a special
469
Computational Linguistics Count 1 2 3 >3 >0 >0 1-grams 2-grams 3-grams 36,789 8,045,024 53,737,350 20,269 2,065,469 9,229,958 13,123 970,434 3,653,791 135,335 3,413,290 8,728,789 205,516 14,494,217 ,75,349,888 260,741 6.799x 101 1.773X 1016
Volume 18, Number 4
Table 1
Number of n-grams with various frequencies in 365,893,263 words of running text.
unknown word into which all other words are mapped. Of the 6.799 x 10l 2-grams that might have occurred in the data, only 14,494,217 actually did occur and of these, 8,045,024 occurred only once each. Similarly, of the 1.773 x 1016 3-grams that might have occurred, only 75,349,888 actually did occur and of these, 53,737,350 occurred only once each. From these data and Turing's formula (Good 1953), we can expect that maximum likelihood estimates will be 0 for 14.7 percent of the 3-grams and for 2.2 percent of the 2-grams in a new sample of English text. We can be confident that any 3-gram that does not appear in our sample is, in fact, rare, but there are so many of them that their aggregate probability is substantial. As n increases, the accuracy of an n-gram model increases, but the reliability of our parameter estimates, drawn as they must be from a limited training text, decreases. Jelinek and Mercer (1980) describe a technique called interpolated estimation that combines the estimates of several language models so as to use the estimates of the more accurate models where they are reliable and, where they are unreliable, to fall back on the more reliable estimates of less accurate models. If Prq) (wi I w i-1 ) is the conditional I probability as determined by the jth language model, then the interpolated estimate, Pr(wi [ wi-1 ), is given by 1
Pr Iwi [W~--I/ ~ ~ &j(w~-l)Pr(J)(wi[ w~-l).
J
(3)
Given values for prq)(.), the /~j(W~ ) are chosen, with the help of the EM algorithm, -1 so as to maximize the probability of some additional sample of text called the held-out data (Baum 1972; Dempster, Laird, and Rubin 1977; Jelinek and Mercer 1980). When we use interpolated estimation to combine the estimates from 1-, 2-, and 3-gram models, we choose the )~s to depend on the history, W1i-1, only through the count of the 2gram, wi_2wi_ 1. We expect that where the count of the 2-gram is high, the 3-gram estimates will be reliable, and, where the count is low, the estimates will be unreliable. We have constructed an interpolated 3-gram model in which we have divided the ;~s into 1,782 different sets according to the 2-gram counts. We estimated these ),s from a held-out sample of 4,630,934 words. We measure the performance of our model on the Brown corpus, which contains a variety of English text and is not included in either our training or held-out data (Ku~era and Francis 1967). The Brown corpus contains 1,014,312 words and has a perplexity of 244 with respect to our interpolated model.
3. Word Classes
Clearly, some words are similar to other words in their meaning and syntactic function. We would not be surprised to learn that the probability distribution of words in the vicinity of Thursday is very much like that for words in the vicinity of Friday. Of
470
Peter F. Brown and Vincent J. Della Pietra
Class-Based n-gram Models of Natural Language
course, they will not be identical: we rarely hear someone say Thank God it's Thursday! or worry about Thursday the 13 th. If we can successfully assign words to classes, it m a y be possible to make more reasonable predictions for histories that we have not previously seen by assuming that they are similar to other histories that we have seen. Suppose that we partition a vocabulary of V words into C classes using a function, 7r, which maps a word, wi, into its class, ci. We say that a language model is an ngram class model if it is an n-gram language model and if, in addition, for 1 < k < n, Pr (Wk I W~-1) = Pr (Wk [ Ck)Pr (Ck [~-1). An n-gram class model has C" - 1 + V - C independent parameters: V - C of the form Pr (wi ] ci), plus the C" - 1 independent parameters of an n-gram language model for a vocabulary of size C. Thus, except in the trivial cases in which C = V or n = 1, an n-gram class language model always has fewer independent parameters than a general n-gram language model. Given training text, t T, the m a x i m u m likelihood estimates of the parameters of a 1-gram class model are Pr (w [ c) - C(w) and Pr(c) = C(c)
C(c) '
(4)
-T-'
(5)
where by C(c) we mean the number of words in tl for which the class is c. From these r equations, we see that, since c = 7r(w), Pr (w) = Pr (w I c) Pr (c) -- C(w)/T. For a 1-gram class model, the choice of the mapping rr has no effect. For a 2-gram class model, the sequential m a x i m u m likelihood estimates of the order-2 parameters maximize Pr (tT ] tl) or, equivalently, log Pr(t2 I h) and are given by r Pr (c2 I Cl) -= y-~cC(ClC )
C(clc2)
(6)
By definition, Pr (clc2) = Pr (Cl)Pr (c2 ] Cl), and so, for sequential m a x i m u m likelihood estimation, we have c(cl) Pr (CLC2)- C(CLC2) x (7)
T
~,c C(clc)
Since C(Cl) and ~c C(ClC) are the numbers of words for which the class is Cl in the strings t T and t r-1 respectively, the final term in this equation tends to 1 as T tends to infinity. Thus, Pr (CLC2)tends to the relative frequency of ClC2as consecutive classes in the training text. Let L(rr) = (T - 1) -1 logPr (tr [ h). Then
L(Tr) =
--
y~
Wl W2
C(WlW2) ~---~ logPr(c2ICl)Pr(w2[c2)
Pr (c2 ] Cl)
~W C(~w2)
W2
logPr (w2 I c2) Pr(c2).
(8)
ClC2
Pr(w2)
Y
Therefore, since Y~w C(ww2)/(T-1) tends to the relative frequency of w2 in the training text, and hence to Pr (w2), we must have, in the limit, L(Tr) = ZPr(w)logPr(w)+ZPr(ClCa)log
W CiC2
Pr (c2 Icl) Pr (c2) (9)
= -H(w) I(Cl, C2),
471
Computational Linguistics
Volume 18, Number 4
where H(w) is the entropy of the 1-gram word distribution and I(cl, C2) is the average mutual information of adjacent classes. Because L(1r) depends on ~r only through this average mutual information, the partition that maximizes L(~r) is, in the limit, also the partition that maximizes the average mutual information of adjacent classes. We know of no practical method for finding one of the partitions that maximize the average mutual information. Indeed, given such a partition, we know of no practical method for demonstrating that it does, in fact, maximize the average mutual information. We have, however, obtained interesting results using a greedy algorithm. Initially, we assign each word to a distinct class and compute the average mutual information between adjacent classes. We then merge that pair of classes for which the loss in average mutual information is least. After V - C of these merges, C classes remain. Often, we find that for classes obtained in this way the average mutual information can be made larger by moving some words from one class to another. Therefore, after having derived a set of classes from successive merges, we cycle through the vocabulary moving each word to the class for which the resulting partition has the greatest average mutual information. Eventually no potential reassignment of a word leads to a partition with greater average mutual information. At this point, we stop. It may be possible to find a partition with higher average mutual information by simultaneously reassigning two or more words, but we regard such a search as too costly to be feasible. To make even this suboptimal algorithm practical one must exercise a certain care in implementation. There are approximately (V-i)2/2 merges that we must investigate to carry out the /th step. The average mutual information remaining after any one of them is the sum of (V -/)2 terms, each of which involves a logarithm. Since altogether we must make V - C merges, this straightforward approach to the computation is of order Vs. We cannot seriously contemplate such a calculation except for very small values of V. A more frugal organization of the computation must take advantage of the redundancy in this straightforward calculation. As we shall see, we can make the computation of the average mutual information remaining after a merge in constant time, independent of V. Suppose that we have already made V - k merges, resulting in classes Ck (1), Ck (2), ..., Ck(k) and that we now wish to investigate the merge of Ck(i) with Ck(j) for 1 _< i < j <_ k. Let pk(l, m) = Pr (Ck(1), Ck(m)), i.e., the probability that a word in class Ck(m) follows a word in class Ck(1). Let
plk(l) = ~ pk(l, m),
m
(10)
let
prk(m) ~ = pk(l, m),
1
(11)
and let
qk(l,m) = pk(l,m)"log p lpk(l,m) ) " ~m
The average mutual information remaining after V - k merges is
(12)
Ik = ~ qk(l, m).
l,m
(13)
We use the notation i + j to represent the cluster obtained by merging
Ck(i) and Ck(j).
472
Peter E Brown and Vincent J. Della Pietra
Class-Based n-gram Models of Natural Language
Thus, for example, pk(i + j , m ) = pk(i,m) + pk(j,m) and
pk(i + j, m) qk(i + j, m) = pk(i + j, m)log plk(i + j)prk(m)"
(14)
The average mutual information remaining after we merge Ck(i) and Ck(j) is then
Ik(i,j)
Ik -- Sk(i) -- Sk(j) + qk(i,j) + + Y 2 qk(l,i + j) + ~
l#i d m#i,j
qk(j, i) + qk(i + j ,
i +j)
qk(i + j,m),
(15)
where
sk(i) = y ~ qk(l, i) + ~
I m
qk(i, m) -- qk(i, i).
(16)
If we know Ik, Sk(i), and Sk(J'), then the majority of the time involved in computing Ik(i,j) is devoted to computing the sums on the second line of equation (15). Each of these sums has approximately V - k terms and so we have reduced the problem of evaluating Ik(i,j) from one of order V2 to one of order V. We can improve this further by keeping track of those pairs I, m for which pk(l, m) is different from 0. We recall from Table 1, for example, that of the 6.799 x 101 2-grams that might have occurred in the training data, only 14,494,217 actually did occur. Thus, in this case, the sums required in equation (15) have, on average, only about 56 non-zero terms instead of 260,741, as we might expect from the size of the vocabulary. By examining all pairs, we can find that pair, i ( j, for which the loss in average mutual information, Lk(i,j) =-- Ik -- Ik(i,j), is least. We complete the step by merging Ck(i) and Ck(j) to form a new cluster Ck-l(/). I f j # k, we rename Ck(k) as Ck-l(j) and for I i,j, we set Ck_l(l) to Ck(1). Obviously, lk-1 = Ik(i,j). The values of pk-1, plk-1, prk-1, and qk-1 can be obtained easily from pk, plk, prk, and qk. If 1 and m both denote indices neither of which is equal to either i or j, then it is easy to establish that
Sk-l(l)
=
=
sk_l(j)
Lk-1 (1, m)
Lk-1 (l,j) Lk-l(j,l)
=
=
Sk(l) -- qk(l, i) -- qk(i, 1) -- qk(l,j) -- qk(J', l) + qk-~ (I, i) + qk-l(i, I) Sk(k) -- qk(k, i) -- qk(i, k) - qk(k,j) - qk(j, k) + qk-l(j, i) + qk-l(i,j) Lk(l,m)--qk(l+m,i)--qk(i,l+m)--qk(l+m,j)--qk(j,l+m) +qk-1 (I + m, i) + qk-1 (i, 1 + m) Lk(l, k) - qk(l + k, i) - qk(i, 1 + k) - qk(l + k,j) - qk(J', l + k)
(17)
+qk-l(l + j, i) + qk-1 (i, I + j) ---- Lk-l(l,j)
Finally, we must evaluate Sk-l(i) and Lk_l(l,i) from equations 15 and 16. Thus, the entire update process requires something on the order of V2 computations in the course of which we will determine the next pair of clusters to merge. The algorithm, then, is of order V3. Although we have described this algorithm as one for finding clusters, we actually determine much more. If we continue the algorithm for V - 1 merges, then we will have a single cluster which, of course, will be the entire vocabulary. The order in which clusters are merged, however, determines a binary tree the root of which corresponds
473
Computational Linguistics
Volume 18, Number 4
plan letter request memo case question -'-7 charge----I ~__ statement L-] draft ~ day year week month quarter half
F-
.evaluation assessment analysis understanding opinion conversation discussion
~" ,., 1 I
iI
reps representatives representative rep
accounts people customers individuals employees students i~
iL I
]
Figure 2 Sample subtrees from a 1,000-word mutual information tree.
to this single cluster and the leaves of which correspond to the words in the vocabulary. Intermediate nodes of the tree correspond to groupings of words intermediate between single words and the entire vocabulary. Words that are statistically similar with respect to their immediate neighbors in running text will be close together in the tree. We have applied this tree-building algorithm to vocabularies of up to 5,000 words. Figure 2 shows some of the substructures in a tree constructed in this manner for the 1,000 most frequent words in a collection of office correspondence. Beyond 5,000 words this algorithm also fails of practicality. To obtain clusters for larger vocabularies, we proceed as follows. We arrange the words in the vocabulary in order of frequency with the most frequent words first and assign each of the first C words to its own, distinct class. At the first step of the algorithm, we assign the (C Jr 1) st most probable word to a new class and merge that pair among the resulting C + 1 classes for which the loss in average mutual information is least. At the k th step of the algorithm, we assign the (C + k) th most probable word to a new class. This restores the number of classes to C + 1, and we again merge that pair for which the loss in average mutual information is least. After V - C steps, each of the words in the vocabulary will have been assigned to one of C classes. We have used this algorithm to divide the 260,741-word vocabulary of Table I into 1,000 classes. Table 2 contains examples of classes that we find particularly interesting. Table 3 contains examples that were selected at random. Each of the lines in the tables contains members of a different class. The average class has 260 words and so to make the table manageable, we include only words that occur at least ten times and
474
Peter F. Brown and Vincent J. Della Pietra
Class-Basedn-gram Models of Natural Language
Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August people guys folks fellows CEOs chaps doubters commies unfortunates blokes down backwards ashore sideways southward northward overboard aloft downwards adrift water gas coal liquid acid sand carbon steam shale iron great big vast sudden mere sheer gigantic lifelong scant colossal man woman boy girl lawyer doctor guy farmer teacher citizen American Indian European Japanese German African Catholic Israeli Italian Arab pressure temperature permeability density porosity stress velocity viscosity gravity tension mother wife father son husband brother daughter sister boss uncle machine device controller processor CPU printer spindle subsystem compiler plotter John George James Bob Robert Paul William Jim David Mike anyone someone anybody somebody feet miles pounds degrees inches barrels tons acres meters bytes director chief professor commissioner commander treasurer founder superintendent dean custodian liberal conservative parliamentary royal progressive Tory provisional separatist federalist PQ had hadn't hath would've could've should've must've might've asking telling wondering instructing informing kidding reminding bc)thering thanking deposing that tha theat head body hands eyes voice arm seat eye hair mouth
Table 2
Classes from a 260,741-word vocabulary.
we include no more than the ten most frequent words of any class (the other two months would appear with the class of months if we extended this limit to twelve). The degree to which the classes capture both syntactic and semantic aspects of English is quite surprising given that they were constructed from nothing more than counts of bigrams. The class {that tha theat} is interesting because although tha and theat are not English words, the computer has discovered that in our data each of them is most often a mistyped that. Table 4 shows the number of class 1-, 2-, and 3-grams occurring in the text with various frequencies. We can expect from these data that maximum likelihood estimates will assign a probability of 0 to about 3.8 percent of the class 3-grams and to about .02 percent of the class 2-grams in a new sample of English text. This is a substantial improvement over the corresponding numbers for a 3-gram language model, which are 14.7 percent for word 3-grams and 2.2 percent for word 2-grams, but we have achieved this at the expense of precision in the model. With a class model, we distinguish between two different words of the same class only according to their relative frequencies in the text as a whole. Looking at the classes in Tables 2 and 3, we feel that
475
Computational Linguistics
Volume 18, Number 4
little prima moment's trifle tad Litle minute's tinker's hornet's teammate's 6 ask remind instruct urge interrupt invite congratulate commend warn applaud object apologize apologise avow whish cost expense risk profitability deferral earmarks capstone cardinality mintage reseller B dept. AA Whitey CL pi Namerow PA Mgr. LaRose # Rel rel. #S Shree S Gens nai Matsuzawa ow Kageyama Nishida Sumit Zollner Mallik research training education science advertising arts medicine machinery Art AIDS rise focus depend rely concentrate dwell capitalize embark intrude typewriting Minister mover Sydneys Minster Miniter 3 running moving playing setting holding carrying passing cutting driving fighting court judge jury slam Edelstein magistrate marshal Abella Scalia larceny annual regular monthly daily weekly quarterly periodic Good yearly convertible aware unaware unsure cognizant apprised mindful partakers force ethic stoppage force's conditioner stoppages conditioners waybill forwarder Atonabee systems magnetics loggers products' coupler Econ databanks Centre inscriber correctors industry producers makers fishery Arabia growers addiction medalist inhalation addict brought moved opened picked caught tied gathered cleared hung lifted
Table 3
Randomly selected word classes.
Count 1 2 3 > 3 > 0 0
Table 4
1-gram
2-grams
3-grams
0 8 1 , 1 7 1 13,873,192 0 57,056 4,109,998 0 43,752 2,012,394 1,000 6 5 8 , 5 6 4 6,917,746 1,000 840,543 26,913,330 1,000 1,000,000 1.000 x 109
Number of class n-grams with various frequencies in 365,893,263 words of running text.
this is reasonable for pairs like John and George or liberal and conservative but perhaps less so for pairs like little and prima or Minister and mover. We used these classes to construct an interpolated 3-gram class m o d e l using the same training text and held-out data as we used for the word-based language m o d e l we discussed above. We measured the perplexity of the Brown corpus with respect to this m o d e l and found it to be 271. We then interpolated the class-based estimators with the word-based estimators and found the perplexity of the test data to be 236, which is a small i m p r o v e m e n t over the perplexity of 244 we obtained with the w o r d - b a s e d model.
476
Peter F. Brown and Vincent J. Della Pietra
Class-Based n-gram Models of Natural Language
4. Sticky Pairs and Semantic Classes
In the previous section, we discussed some methods for grouping words together according to the statistical similarity of their surroundings. Here, we discuss two additional types of relations between words that can be discovered by examining various co-occurrence statistics. The mutual information of the pair wl and w2 as adjacent words is Pr (wl w2) log Pr (wl) Pr (w2). (18)
If w2 follows Wl less often than we w o u l d expect on the basis of their i n d e p e n d e n t frequencies, then the mutual information is negative. If w 2 follows Wl more often than we w o u l d expect, then the mutual information is positive. We say that the pair WlW2 is sticky if the mutual information for the pair is substantially greater than 0. In Table 5, we list the 20 stickiest pairs of words found in a 59,537,595-word sample of text from the Canadian parliament. The mutual information for each pair is given in bits, which corresponds to using 2 as the base of the logarithm in equation 18. Most of the pairs are p r o p e r names such as Pontius Pilate or foreign phrases that have been adopted into English such as mutatis mutandis and avant garde. The mutual information for Humpty Dumpty, 22.5 bits, means that the pair occurs roughly 6,000,000 times more than one w o u l d expect from the individual frequencies of Humpty and Dumpty. Notice that the property of being a sticky pair is not symmetric and so, while Humpty Dumpty forms a sticky pair, Dumpty Humpty does not.
Word pair Humpty Dumpty Klux Klan Ku Klux Chah Nulth Lao Bao Nuu Chah Tse Tung avant garde Carena Bancorp gizzard shad Bobby Orr Warnock Hersey mutatis mutandis Taj Mahal Pontius Pilate ammonium nitrate jiggery pokery Pitney Bowes Lubor Zink anciens combattants Abu Dhabi Aldo Moro fuddle duddle helter skelter mumbo jumbo
Mutual Information 22.5 22.2 22.2 22.2 22.2 22.1 22.1 22.1 22.0 22.0 22.0 22.0 21.9 21.8 21.7 21.7 21.6 21.6 21.5 21.5 21.4 21.4 21.4 21.4 21.4
Table 5 Sticky word pairs.
477
Computational Linguistics
Volume 18, Number 4
we our us ourselves ours question questions asking answer answers answering performance performed perform performs performing tie jacket suit write writes writing written wrote pen morning noon evening night nights midnight bed attorney counsel trial court judge problems problem solution solve analyzed solved solving letter addressed enclosed letters correspondence large size small larger smaller operations operations operating operate operated school classroom teaching grade math street block avenue corner blocks table tables dining chairs plate published publication author publish writer titled wall ceiling walls enclosure roof sell buy selling buying sold
Table 6
Semantic clusters.
Instead of seeking pairs of words that occur next to one another more than we w o u l d expect, we can seek pairs of words that simply occur near one another more than we w o u l d expect. We avoid finding sticky pairs again by not considering pairs of words that occur too close to one another. To be precise, let Prnear (WlW2) be the probability that a w o r d chosen at r a n d o m from the text is Wl and that a second word, chosen at r a n d o m from a w i n d o w of 1,001 words centered on wl but excluding the w o r d s in a w i n d o w of 5 centered on wl, is w2. We say that Wl and w2 are semantically sticky if Prnear (WlW2)is m u c h larger than Pr (wO Pr (W2). Unlike stickiness, semantic stickiness is symmetric so that if Wl sticks semantically to w2, then w2 sticks semantically to Wl. In Table 6, we show some interesting classes that we constructed, using Prnear(WlW2), in a m a n n e r similar to that described in the preceding section. Some classes group together words having the same morphological stem, such as performance, performed, perform, performs, and performing. Other classes contain w o r d s that are semantically related but have different stems, such as attorney, counsel, trial, court, and judge.
5. D i s c u s s i o n
We have described several m e t h o d s here that we feel clearly demonstrate the value of simple statistical techniques as allies in the struggle to tease from w o r d s their linguistic secrets. However, we have not as yet demonstrated the full value of the secrets thus gleaned. At the expense of a slightly greater perplexity, the 3-gram model with w o r d classes requires only about one-third as m u c h storage as the 3-gram language model in which each w o r d is treated as a unique individual (see Tables 1 and 4). Even w h e n we combine the two models, we are not able to achieve m u c h i m p r o v e m e n t in the perplexity. Nonetheless, we are confident that we will eventually be able to make significant i m p r o v e m e n t s to 3-gram language models with the help of classes of the kind that we have described here.
478
Peter F. Brown and Vincent J. Della Pietra
Class-Based n-gram Models of Natural Language
Acknowledgment
The authors would like to thank John Lafferty for his assistance in constructing word classes described in this paper.
References
Averbuch, A.; Bahl, L.; Bakis, R.; Brown, P.; Cole, A.; Daggett, G.; Das, S.; Davies, K.; Gennaro, S. De.; de Souza, P.; Epstein, E.; Fraleigh, D.; Jelinek, E; Moorhead, J.; Lewis, B.; Mercer, R.; Nadas, A.; Nahamoo, D.; Picheny, M.; Shichman, G.; Spinelli, P.; Van Compernolle, D.; and Wilkens, H. (1987). "Experiments with the Tangora 20,000 word speech recognizer." In Proceedings, IEEE International Conference
statistical approach to machine translation." Computational Linguistics, 16(2), 79-85. Dempster, A.; Laird, N.; and Rubin, D. (1977). "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society, 39(B), 1-38. Feller, W. (1950). An Introduction to
on Acoustics, Speech and Signal Processing. Dallas, Texas, 701-704. Bahl, L. R.; Jelinek, E; and Mercer, R. L. (1983). "A maximum likelihood approach to continuous speech recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2), 179-190. Baum, L. (1972). "An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process." Inequalities, 3, 1-8. Brown, P. E; Cocke, J.; DellaPietra, S. A.; DellaPietra, V. J.; Jelinek, E; Lafferty, J. D.; Mercer, R. L.; and Roossin, P. S. (1990). "A
Probability Theory and its Applications, Volume I. John Wiley & Sons, Inc. Gallagher, R. G. (1968). Information Theory and Reliable Communication. John Wiley & Sons, Inc. Good, I. (1953). "The population frequencies of species and the estimation of population parameters." Biometrika, 40(3-4), 237-264. Jelinek, E, and Mercer, R. L. (1980). "Interpolated estimation of Markov source parameters from sparse data." In Proceedings, Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands, 381-397. Ku~era, H., and Francis, W. (1967). Computational Analysis of Present Day American English. Brown University Press. Mays, E.; Damerau, E J.; and Mercer, R. L. (1990). "Context-based spelling correction." In Proceedings, IBM Natural Language ITL. Paris, France, 517-522.
479
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more.
Course Hero has millions of course specific materials providing students with the best way to expand
their education.
Below is a small sample set of documents:
MIT - CS - 6.345
Entropy-based Pruning of Backoff Language ModelsAndreas Stolcke Speech Technology And Research Laboratory SRI International Menlo Park, CaliforniaABSTRACTA criterion for pruning parameters from N-gram backoff language models is developed, based on the
MIT - CS - 6.345
C O M M U N I C A T I O NA Weighted Finite State Transducer tutorialPhilip N. GarnerIDIAPCom 08-03aI D I APDecember 2007a IDIAP Research InstituteIDIAP Research Institutewww.idiap.chAv. des Prs-Beudin 20 P.O. Box 592 1920 Martigny - Switzerland
MIT - CS - 6.345
SOME STATISTICAL ISSUES IN THE COMPARISON OF SPEECH RECOGNITION ALGORITHMS L. Gillick and Stephen CoxDragon Systems, Inc., Chapel Bridge Park, 90 Bridge Street, Newton, MA 02158, USA BritishTelecom Research Laboratories, Martlesham Heath, Ipswich IP5 7
MIT - CS - 6.345
Computer Speech and Language 17 (2003) 137152COMPUTER SPEECH AND LANGUAGEwww.elsevier.com/locate/cslA probabilistic framework for segment-based speech recognitionJames R. Glass*MIT Laboratory for Computer Science, 200 Technology Square, Cambridge, M
MIT - CS - 6.345
Mean and Variance Adaptation within the MLLR FrameworkM.J.F. Gales & P.C. Woodland April 1996 Revised August 23rd 1996 Cambridge University Engineering Department Trumpington Street Cambridge CB2 1PZ England Email: fmjfg,pcwg@eng.cam.ac.ukAbstractOne o
MIT - CS - 6.345
360IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 4, NO 5 , SEPTEMBER 1996's stinified View ecognitionMari Ostendorf, Member, IEEE, Vassilios V. Digalakis, and Owen A. Kimball, Member, IEEEAbstract-In recent years, many alternative models have
MIT - CS - 6.345
COMMUNICATIONELSEVIERSpeech Communication 17 (1995) 91-108SPEECHSpeaker identificationand verification using Gaussian mixture speaker models *Douglas A. Reynolds*M4 02173, USA 1995 244 Wood St, Lexington,MTLincoln Laboratory, Received27 Septemb
MIT - CS - 6.345
Two Decades of Statistical Language Modeling: Where Do We Go from Here?RONALD ROSENFELD, ASSOCIATE MEMBER, IEEE Invited PaperStatistical language models estimate the distribution of various natural language phenomena for the purpose of speech recognitio
MIT - CS - 6.345
SRILM - AN EXTENSIBLE LANGUAGE MODELING TOOLKIT Andreas Stolcke Speech Technology and Research Laboratory SRI International, Menlo Park, CA, U.S.A. http:/www.speech.sri.com/ABSTRACTSRILM is a collection of C+ libraries, executable programs, and helper s
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345 Automatic Speech Recognition Spring, 2005 Issued: 4/22/05 Due: 5/11/05Assignment 5 Automatic Speech Recognition SystemsThis lab is intended to familiariz
MIT - CS - 6.345
!"#$%&%'($)*+$,-.'/0-%'123-%4+*5*'6"5"'7$"'5,*'8!'9+:-0$5,% 9&5,-0;<=>'9?'@?'6*%4<5*0A'B?'!?'("$0.A'6?'C?'D&E$2 F-&03*>'G-&02"+'-/'5,*'D-H"+'F5"5$<5$3"+'F-3$*5H?'F*0$*<'C';!*5,-.-+-:$3"+=A'I-+?'JKA'B-?'L ;LKMM=A'44?'LNJO @&E+$<,*.'EH>'C+"3)P*+'@&E+$<,$2:'
MIT - CS - 6.345
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov ModelsJeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute Berkeley CA, 94704 and Computer Science Divisi
MIT - CS - 6.345
The MIT Finite-State Transducer Toolkit for Speech and Language ProcessingLee Hetherington Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 USA AbstractWe present the MIT Finite-State Tran
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345 Automatic Speech Recognition Spring, 2010Laminar User Guide1Introduction to laminarThe laminar (Line Analog Models for Interactive Articulation Researc
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2007 Issued: 2/7/07Introduction to matlab Introductionmatlab, which stands for Matrix Laboratory, is a comme
MIT - CS - 6.345
EM TRAINING OF FINITE-STATE TRANSDUCERS AND ITS APPLICATION TO PRONUNCIATION MODELING Han Shu and I. Lee Hetherington Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, Massachusetts 02139 USA cf
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345 / HST.728 Automatic Speech Recognition Spring, 2007SUMMIT Speech Recognizer Tutorial11 BUILDING A RECOGNIZER USING SUMMIT: A TUTORIALThis set of tutoria
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 3/4/10Lecture Handouts Acoustic Modelling II Gaussian Classifiers Dimensionality Reduction Significance
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 5/4/10Lecture Handouts Speech recognition applicationsSpeech Recognition Applications Medium-vocabula
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/15/10Lecture Handouts Finite-State Transducers (FSTs) Reading: Garner, "A weighted finite state trans
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 3/2/10Lecture Handouts Acoustic Modelling I Clustering & Vector Quantization (VQ) Gaussian Mixture Mode
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 3/30/10Lecture Handouts Language Modeling n-grams Perplexity Smoothing Reading: Jurafsky et al., Speech
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/1/10Lecture Handouts Beyond word n-gram language models Word-class n-grams Phrase-class n-grams Stoch
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/29/10Lecture Handouts Out-of-Vocabulary (OOV) ModelingModelling New Words Introduction Modelling ou
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 3/16/10Lecture Handouts Unsupervised Pattern Discovery in Speech Reading: Park et al., "Unsupervised Pa
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/27/10Lecture Handouts Noise Robustness and Confidence ScoringNoise Robustness and Confidence Scoring
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 2/25/10Lecture Handouts Lecture Slides: Dynamic Time Warping and Search Reading: DTW: Rabiner et al., F
MIT - CS - 6.345
Segment-Based Speech Recognition Introduction Phonetic classification Probabilistic formulation for graph-based observation spacesAnti-phone modelling Near-miss modelling Modelling landmarks Phonetic and word recognition Search and training issues6.34
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 3/9/10Lecture Handouts N -Best & backwards A search Speaker AdaptationSearch Example: Computing N-best
MIT - CS - 6.345
MITAcoustic Theory of Speech Production Overview Sound sources Radiation Characteristics Vocal tract transfer function Wave equations Sound propagation in a uniform acoustic tube Representing the vocal tract with simple acoustic tubes Estimating natural
MIT - CS - 6.345
Acoustic Theory of Speech ProductionVictor ZueSupplementary Notes for 6.345 Automatic Speech RecognitionDepartment of Electrical Engineering & Computer Science Massachusetts Institute of Technology Spring, 2010Preliminary Draft (Do not duplicate witho
MIT - CS - 6.345
Welcome to6.345Automatic Speech Recognitionhttps:/stellar.mit.edu/S/course/6/sp10/6.345"6.345 Automatic Speech Recognition (2010) Course Introduction 1Course Introduction" Staff" Lectures: Jim Glass and Victor Zue (+ guest lecturers)" TA: Ian McGra
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/6/10Lecture Handouts Hidden Markov Models (HMMs) Reading: Rabiner, "A Tutorial on Hidden Markov Model
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/8/10Lecture Handouts HMM Training Homework: Hidden Markov Models4/2/10Training an HMM-based Speech
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/13/10Lecture Handouts VQ-based HMMs Discriminative Training4/13/10Addendum on VQ-based systems So
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010Guest Lecturer: Louis D. Braida Sensory Communication Group Research Laboratory of ElectronicsNotes on I
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010Guest Lecturer: Louis D. Braida Sensory Communication Group Research Laboratory of Electronics 2/18/2010
MIT - CS - 6.345
MITSpeech Signal Representation Fourier Analysis Cepstral Analysis Linear Prediction Auditorily-Motivated Representations Comparisons6.345 Automatic Speech Recognition (2010)Speech Signal Representaion 1MITDiscrete-Time Fourier Transform + X(ej ) =
MIT - CS - 6.345
MITSpeech Sounds of American English There are over 40 speech sounds in American English which can be organized by their basic manner of articulation Manner Class Vowels Fricatives Stops Nasals Semivowels Affricates Aspirant Number 18 8 6 3 4 2 1 Vowel
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345 / HST.728 Automatic Speech Recognition Spring, 2010Course InformationName Lecturers: TA: Secretary: Office Telephone E-mail 3-1640 3-8513 3-3049 glass@mi
MIT - CS - 6.345
186IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008Unsupervised Pattern Discovery in SpeechAlex S. Park, Member, IEEE, and James R. Glass, Senior Member, IEEEAbstract-We present a novel approach to speech proce
MIT - CS - 6.345
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.345 / HST.728 Predicting Speech Intelligibility Issued: February 18, 201001Speech IntelligibilityWhen communicating via speech, the speaker attempts to i
MIT - CS - 6.345
Speech Communication 22 Z1997. 115Speech recognition by machines and humansRichard P. Lippmann) Lincoln Laboratory MIT, Room S4-121, 244 Wood Street, Lexington, MA 02173-9108, USA Received 2 February 1996; revised 14 November 1996; accepted 28 April 19
MIT - CS - 6.345
A TUTORIAL ON PRINCIPAL COMPONENT ANALYSISDerivation, Discussion and Singular Value DecompositionJon Shlens | jonshlens@ucsd.edu Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but poorly unders
MIT - CS - 6.345
-16.345-HST.728 Automatic Speech Recognition Spring Term, 2010 Auditory Processing of Speech. Louis D. Braida Sensory Communication Group Research Laboratory of Electronics and Massachusetts Institute of Technology Department of Electrical Engineering a
MIT - CS - 6.345
Zue. Speech Input/Output TechnologiesEighty Challenges Facing Speech Input/Output TechnologiesVictor Zue MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA, USA zue@csail.mit.eduABSTRACT During the past three decades, we have wit
MIT - CS - 6.345
Conversational Interfaces: Advances and ChallengesVICTOR W. ZUE AND JAMES R. GLASS, MEMBER, IEEE Invited PaperThe past decade has witnessed the emergence of a new breed of humancomputer interfaces that combines several human language technologies to ena
MIT - CS - 6.254
Reading List Topics and PapersLearning H.P. Young, "The Evolution of Conventions," Econometrica, vol. 61, pp. 5784,1993. M. Kandori, G.J. Mailath, and R. Rob, "Learning, mutation, and long run equilibria in games," Econometrica, vol. 61, no. 1, pp. 2956,
MIT - CS - 6.254
6:254 Game Theory with Engineering Applications Course InformationDescription Introduction to fundamentals of game theory and mechanism design with motivations drawn from engineered/networked systems (including distributed control of wireline and wireles
MIT - CS - 6.254
Tentative Syllabus1. Introduction to Game Theory (1 Lecture): Games and solutions. Game theory and mechanism design. Examples from networks. 2. Strategic Form Games (4-5 Lectures): Matrix and continuous games. Iterated strict dominance. Rationalizability
MIT - CS - 6.254
6.254: Game Theory with Engineering Applications Guest Lecture: Social Choice and Voting TheoryDaron Acemoglu MITMay 6, 20101Game Theory: Lecture 21IntroductionOutlineSocial choice and group decision-making Arrow' Impossibility Theorem s Gibbard-Sa
MIT - CS - 6.254
6.254 : Game Theory with Engineering Applications Lecture 1: IntroductionAsu Ozdaglar MITFebruary 2, 20101Game Theory: Lecture 1IntroductionOptimization Theory: Optimize a single objective over a decision variable x Rn . i ui ( x ) subject to x X Rn