JurafskyNGrams

JurafskyNGrams - / ias chosen the first nations (that, I,...

Info iconThis preview shows pages 1–17. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 12
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 14
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 16
Background image of page 17
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: / ias chosen the first nations (that, I, he, tinuation. set, let’s look at ' Wall Street Journal both English, so we as. To check whether by unigram, bigram, / er were recession ex- ion N. B. E. C. Taylor percent of U. S. E. has ,re frequently fishing ‘0 undred four oh six three ket conditions . iels computed from 40 ml- and treating P‘mc . 3 readabilitY- mation as '1 small Phrases' . :tty useless as Predl care and . of text like newsPaPer text; .'_ i ' iges. Sometimes find'éngtztrgrft . I .t build N-grams or I 5' 1k, 0 s of SMS dam- scribed.- Section 4.4. Evaluating N -Grams: Perplexity 95 _____________—__—_———————————-— 4.3.2 Unknown Words: Open Versus Closed Vocabulary Tasks Sometimes we have a language task in which we know all the words that can occur, and hence we know the vocabulary size V in advance. The closed vocabulary as- sumption is the assumption that we have such a lexicon and that the test set can only contain words from this lexicon. The closed vocabulary task thus assumes there are no unknown words. But of course this is a simplification; as we suggested earlier, the number of unseen words grows constantly, so we can’t possibly know in advance exactly how many there are, and we’d like our model to do something reasonable with them. We call these 00v unseen events unknown words, or out of vocabulary (00V) words. The percentage of 00V words that appear in the test set is called the 00V rate. An open vocabulary system is one in which we model these potential unknown words in the test set by adding a pseudo-word called <UNK>. We can train the proba- bilities of the unknown word model <UNK> as follows: Closed vocabulary Open vocabulary 1. Choose a vocabulary (word list) that is fixed in advance. 2. Convert in the training set any word that is not in this set (any 00V word) to the unknown word token <UNK> in a text normalization step. 3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the training set. An alternative that doesn’t require choosing a vocabulary is to replace the first occur- rence of every word type in the training data by <UNK>. 4.4 Evaluating N —Grams: Perplexity The best way to evaluate the performance of a language model is to embed it in an application and measure the total performance of the application. Such end-to-end evaluation is called extrinsic evaluation, and also sometimes called in vivo evaluation (Sparck Jones and Galliers, 1996). Extrinsic evaluation is the only way to know if a particular improvement in a component is really going to help the task at hand. Thus, for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription. Unfortunately, end-to-end evaluation is often very expensive; evaluating a large speech recognition test set, for example, takes hours or even days. Thus, we would like a metric that can be used to quickly evaluate potential improvements in a language model. An intrinsic evaluation metric is one that measures the quality of a model independent of any application. Perplexity is the most common intrinsic evaluation metric for N-gram language models. While an (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in speech recognition performance (or any other end-to-end metric), it often correlates with such improvements. Thus, it is commonly used as a quick check on an algorithm, and an improvement in perplexity can then be confirmed by an end-to-end evaluation. Extrinsic evaluation Intrinsic evaluation 96 Chapter 4. N-Grams ——————_—_____—\ Perplexity The intuition of perplexity is that given two probabilistic models, the better mode] is the one that has a tighter fit to the test data or that better predicts the details of the test data. We can measure better prediction by looking at the probability the model assigns to the test data; the better model will assign a higher probability to the test data. More formally, the perplexity (sometimes called PP for short) of a language model on a test set is a function of the probability that the language model assigns to that test set. For a test set W = wlwz . . .wN, the perplexity is the probability of the test set, normalized by the number of words: PP(W) = P(w1wQ . . .WNHI (4.16) P(w1wz . . .wN We can use the chain rule to expand the probability of W: PP(W) = (4.17) Thus, if we are computing the perplexity of W with a bigram language model, we get: (4.18) Note that because of the inverse in Eq. 4.17, the higher the conditional probability of the word sequence, the lower the perplexity. Thus, minimizing perplexity is equiva- lent to maximizing the test set probability according to the language model. What we generally use for word sequence in Eq. 4.17 or Eq. 4.18 is the entire sequence of words in some test set. Since of course this sequence will cross many sentence boundaries, we need to include the begin- and end-sentence markers <s> and </ s> in the probability computation. We also need to include the end-of-sentence marker </ s> (but not the beginning-of-sentence marker <s>) in the total count of word tokens N. There is another way to think about perplexity: as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word. Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that each of the 10 digits occurs with equal probability P = $. The perplexity of this mini-language is in fact 10. To see that, imagine a string of digits of length N. By Eq. 4.17, the perplexity will be = P(W1W2...WN)_}17 1 N _ (m 1 N g . r t E. Bl. than c the tin the pe' exerci: “It theore Fir N -grar words word \ set of millior As quence verser No out an) perple) lary ta greatly the mo compai perplex 4.5 Smoothil Sparse data There i seen fo: data ca 5 Katzb (LDC, 19 Section 4.5. Srnoothing 97 / odels, the better model = i 1 :ts the details of the test _ bility the model assigns _ (4I19) v)’ to the t6“ data“ More But suppose that the number zero is really frequent and occurs 10 times more often atnguage mOdel 0“ a tCSt than other numbers. Now we should expect the perplexity to be lower since most of igns to that test set' .For the time the next number will be zero. Thus, although the branching factor is still 10, 'the test Set, nomahzed the perplexity or weighted branching factor is smaller. We leave this calculation as an exercise to the reader. We see in Section 4.10 that perplexity is also closely related to the information- (4I16) theoretic notion of entropy. Finally, let’s look at an example of how perplexity can be used to compare different N-gram models. We trained unigram, bigram, and trigram grammars on 38 million words (including start-of-sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary.6 We then computed the perplexity of each of these models on a test set of 1.5 million words with Eq. 4.18. The table below shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars. ‘ Unigram Bigram 'Irigram _ (4-17) Perplexity 962 170 109 ) As we see above, the more information the N-gram gives us about the word se- igram language model, we I ' quence, the lower the perplexity (since as Eq. 4.17 showed, perplexity is related in- versely to the likelihood of the test sequence according to the model). Note that in computing perplexities, the N -gram model P must be constructed with- out any knowledge of the test set. Any kind of knowledge of the test set can cause the 1' ' perplexity to be artificially low. For example, we defined above the closed vocabu- (4'18) 5' lary task, in which the vocabulary for the test set is specified in advance. This can J greatly reduce the perplexity. As long as this knowledge is provided equally to each of c the conditional probability . the models we are comparing, the closed vocabulary perplexity can still be useful for . .n rplexity is equivae' comparing models, but care must be taken 1n interpreting the results. In general, the mm g pe at we: I I perplexity of two language models is only comparable if they use the same vocabulary. a language model. Wh I II the entire sequence ofIwordS. I .any sentence boundaries, we - and < / 5> in the probabilith ' smoothing 1- </s> (but notthB. ' :e marke word tokens N . . I weighted average branc' ‘ I. Never tlo II ever want / to hear another word! is the number of 90551131" _' There tsn t one, / I haverIt t hearri! ognizing the digits 1“ E“%~; f' _ liliza Doollttle I occurs with equal P‘Obabl-l',” ;. in Alan Jay Lerner s ). To see mat, imagine 3 sm- ';- My Fair Lady ' There is a major problem with the maximum likelihood estimation process we have seen for training the parameters of an N-gram model. This is the problem of sparse spam ‘0’“ data caused by the fact that our maximum likelihood estimate was based on a particular 6 Katz backoff grammars with Good—Turing discounting trained on 38 million words from the W810 corpus (LDC, 1993), open-vocabulary, using the <UNK> token; see later sections for definitions. 98 Chapter 4. N-Grams Smoothing Laplace smoothing Add-one set of training data. For any N-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. This missing data means that the N-gram matrix for any given training corpus is bound to We need a method which can help get better estimates for these Zero or low- frequency counts. Zero counts turn out to cause another huge problem. The perplexity metric defined above requires that we compute the probability of each test sentence. But if a test sentence has an N—gram that never appeared in the training set, the max- imum likelihood estimate of the probability for this N -gram, and hence for the whole test sentence, will be zero! This means that in order to evaluate our' language mod- els, we need to modify the MLE method to assign some non-zero probability to any -gram, even one that was never observed in training. For these reasons, we’ll want to modify the maximum likelihood estimates for computing N-gram probabilities, focusing on the N—gram events that we incorrectly assumed had zero probability. We use the term smoothing for such modifications that address the poor estimates that are due to variability in small data sets. The name comes from the fact that (looking ahead a bit) we will be shaving a little bit of proba- bility mass from the higher counts and piling it instead on the zero counts, making the distribution a little less jagged. In the next few sections we introduce some smoothing algorithms and show how they modify the Berkeley Restaurant bigram probabilities in Fig. 4.2. 4.5.1 Laplace Smoothing fore we normalize them into probabilities, and add 1 to all the counts. This algorithm is called Laplace smoothing, or Laplace’s Law (Lidstone, 1920; Johnson, 1932; Jef- freys, 1948). Laplace smoothing does not perform well enough to be used in modern i whlliluyflnu-g' flvml-h'u- -— ._- Discounting COUl Thu. Discount algo origi Restz bigra i war to eat chin food lune spen Figure Berkele Fig Recall counts For the nurr / iumber of times, we rpus is limited, some iissing from it. This 1g corpus is bound to V-grams” that should nethod also PIOduceS )r these zero or low- )blem. The per-plenty of each test sentence. . training set, the max- id hence for the whole ate our language mod- zero probability to any ikelihood estimates for ants that we incorrech such modifications that ill data sets. The name . . ba_ tin a little bit of pro g unts, making the Igorithins and Show how Fig. 4.2. ' ts be- 4;. atrix of bigram coun , ll he counts. This algoritht’: 1920', Johnson, .1932-,;eIn augh to be used in mo 6 ncepts that ;efu1 baseline. u ni ram proba . _ the guni gram probability of of word tokens N1 bilities. Re- 1;; Discounting Discount Section 4.5. N Smoothing 99 Ci+l N+V Instead of changing both the numerator and denominator, it is convenient to de- scribe how a smoothing algorithm affects the numerator, by defining an adjusted count c*. This adjusted count is easier to compare directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by N. To define this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply by a normalization factor NLW: PLaplace (wi) = (4-20) 0? = (Ci+1)N—I_:‘V We can now turn cf into a probability Pi" by normalizing by N. A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. Thus, instead of referring to the discounted counts c*, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts: (4.21) II C dc:— C Now that we have the intuition for the unigram case, let’s smooth our Berkeley Restaurant Project bigrams. Figure 4.5 shows the add-one smoothed counts for the bigrams in Fig. 4.1. M i want to eat chinese food lunch spend 1 6 828 l 10 1 l I 3 want 3 l 609 2 7 7 6 2 to 3 l 5 687 3 1 7 212 eat 1 1 3 l 17 3 43 1 chinese 2 l 1 1 1 83 2 1 food 16 1 16 l 2 5 l 1 lunch 3 I 1 1 I 2 I I spend 2 1 2 1 1 1 1 1 Figure 4.5 Add-one smoothed bigram counts for eight of the words (out of V = 1446) in the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts are in gray. Figure 4.6 shows the add-one smoothed probabilities for the bigrams in Fig. 4.2. Recall that normal bigram probabilities are computed by normalizing each row of counts by the unigram count: C(wn_1w,,) C(wn_1) For add-one smoothed bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary V: P(WnIWn-1) = (4.22) 100 Chapter 4. N~Grams C(wn_1wn)+l C(Wn_1)+V Thus, each of the unigram counts given in the previous section will need to be augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 4.6. i want to eat chine'se food lunch spend 1 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075 Pfiaplace (Wnan—I) = (4.23) want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.00084 to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055 cat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046 Chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062 food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039 lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056 spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058 Add-one smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP corpus of 9332 sentences. Previously-zero probabilities are in gray. It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. These adjusted counts can be computed by Eq. 4.24. Figure 4.7 shows the reconstructed counts. i want to eat chinese food lunch spend i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9 want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78 to 1.9 0.63 3.1 430 1.9 0.63 4.4 133 eat 0.34 0.34 1 0.34 5.8 1 15 0.34 Chinese 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098 food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43 lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19 spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus of 9332 sentences. Previously-zero counts are in gray. Note that add-one smoothing has made a very big change to the counts. C(want to) changed from 608 to 238! We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10! The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros. We could move a bit less mass by adding a frac- tional count rather than 1 (add-6 smoothing; (Lidstone, 1920; Johnson, 1932; Jeffreys, __ _...._~.----,—..- _ . Good-Turing Singleton 194: appr For metl 4.5. Ther com; as G (195 discr once any 6 Turir ofze NC, t1 occur joint of big bin t1 frequ: Tl of thir occur c for I W Instea¢ use the what v N is th probab Showir / (4.23) ;ection will need to be obabilities in Fig. 4.6. M chh _ spend ' ' 0.00025 0.00075 0.0025 0.00084 0.0018 0 055 0.02 0.00046 0.0012 0.00062 0.00039 0.00039 0.00056 0.00056 8 0.00058 0.00058 = 1446) in the BeRP corpus 0 we can see how much a =se adjusted counts can be counts. W“) (4 24) 2 7 2.3 038 ‘ i 0 63 4.4 139 1 15 0.34 _ 8.2 02 0:098? I 2.2- 0 43 0743-." P 0.38 0.19 0.1.9: o 16 0 15' W = 1446) in the BeRP corpu mange . I ‘I ility space as Well: P(to\wq$l§ : smoothed case. Looking aim.is ws us how strikiIIgly the 0.039: t for the bigram want to 15 .. 10‘. bab‘j its because too much pro : {a Good-Turing Singleton Section 4.5. _____________—__—'—_——————————- Smoothing 101 1948)), but this method requires a method for choosing 6 dynamically, results in an in- appropriate discount for many counts, and turns out to give counts with poor variances. For these and other reasons (Gale and Church, 1994), we’ll need better smoothing methods for N —grams like the ones we show in the next section. 4.5.2 Good-Turing Discounting There are a number of much better discounting algorithms that are only slightly more complex than add-one smoothing. In this section we introduce one of them, known as Good-Tilting smoothing. The Good-Turing algorithm was first described by Good (1953), who credits Turing with the original idea. The intuition of a number of discounting algorithms (Good—'lluring, Witten-Bell discounting, and Kneser-Ney smoothing) is to use the count of things you’ve seen once to help estimate the count of things you’ve never seen. A word or N-gram (or any event) that occurs once is called a singleton, or a hapax legomenon. The Good— Turing intuition is to use the frequency of singletons as a re-estimate of the frequency of zero-count bigrams. Let’s formalize the algorithm. The Good-Turing algorithm is based on computing NC, the number of N —grams that occur c times. We refer to the number of N -grarns that occur c times as the frequency of frequency c. So applying the idea to smoothing the joint probability of bigrams, No is the number of bigrams with count 0, N1 the number of bigrams with count 1 (singletons), and so on. We can think of each of the NC as a bin that stores the number of different N -grarns that occur in the training set with that frequency c. More formally: NC = Z 1 (4.25) xscount [x] =c The MLE count for NC is c. The Good—Turing intuition is to estimate the probability of things that occur c times in the training corpus by the MLE probability of things that occur c + 1 times in the corpus. So the Good-Turing estimate replaces the MLE count c for NC with a smoothed count c* that is a function of NH]: (4.26) We can use Eq. 4.26 to replace the MLE counts for all the bins N1, N2, and so on. Instead of using this equation directly to re-estimate the smoothed count c* for No, we use the following equation for the probability PaT for things that had zero count No, or what we might call the missing mass: N1 N Here N1 is the count of items in bin 1, that is, that were seen once in training, and N is the total number of items we have seen in training. Equation 4.27 thus gives the probability that the N + lst bigram we see will be one that we never saw in training. Showing that Eq. 4.27 follows from Eq. 4.26 is left as Exercise 4.8 for the reader. PET (things with frequency zero in training) = (4.27) case, either a catfish or a bass? The MLE count c of a hitherto-unseen species (bass or catfish) is 0. But Eq. 4.27 tells us that the probability of a new fish being one of these unseen species is 3 since N1 is 3 andNis 18: since we just stole % of our probability mass to use on unseen events. We’ll need to discount the MLE probabilities for trout, perch, carp, etc. In summary, the revised counts 0* and Good—Turing smoothed probabilities (in. for species with count 0 (like bass or catfish) or count 1 (like trout, salmon, or eel) are as follows: Unseen (Bass or Catfish) 'fi‘out c 1 MLEp p=%=0 1 E c* c*(trout)= 2 x £12 = 2 x § = .67 GT PET P5T(unseen) = ENL = fi = .17 P5T(trout) = {5—87 = 2L7 = .037 Note that the revised count c“ for trout was discounted from c = 1.0 to c" = .67 (thus leaving some probability mass P5T(unseen) = % = .17 for the catfish and bass). And since we know there were 2 unknown species, the probability of the next fish being specifically a catfish is PéT(catfish) = % x A — .085. 4.5.3 Some Advanced Issues in Good-firing Estimation Good-Turing estimation assumes that the distribution of each bigram is binomial (see Si [2 Good-Tifrlgw SH! _ 1311 ex: est zei car Sai we seq fro} (19 Larg (19E low r Turin / e populations of ani- tin created by Joshua with 8 species (bass, estimate must be lower een events. We’ll need In summary, the revised 3ecies with count 0 (like llows: N1 it)=%=—2‘7=.037 lfromc=10toc*==.6-7- 17 for the catfish and bass). I ability of the next fish being - - Eood-Turing discounting to : :nces, and a larger example - is (AP) neWswire by ' I I ows the count c, that 15,131.61 .olumn shows the nu the count. lg Estimation each bigram is binomial I er of bigrams we haven t. ' arms 15 Section 4.5. Smoothing 103 _________’________.__———————’— Simple Good-Turing ____________________———————————-—— AP Newswire Berkeley Restaurant c (MLE) NC 0* (GT) 0 (MLE) NC 0* (GT) 0 74,671,100,000 0.0000270 0 2,081,496 0.002553 1 2,018,046 0.446 1 5315 0.533960 2 449,721 1.26 2 1419 1.357294 3 188,933 2.24 3 642 2.373832 4 105,668 3.24 4 381 4.081365 5 68,379 4.22 5 311 3.781350 6 48,190 5.19 6 . 196 4.500000 Bigram “frequencies of frequencies" and Good-Tbring re-estimations for the 22 million AP bigrams from Church and Gale (1991) and from the Berkeley Restaurant cor- pus of 9332 sentences. There are a number of additional complexities in the use of Good-Turing. For example, we don’t just use the raw NC values in Eq. 4.26. This is because the re- estimate c“ for NC depends on Nc+1; hence, Eq. 4.26 is undefined when Nc+1 = 0. Such zeros occur quite often. In our sample problem above, for example, since N4 = 0, how can we compute N3? One solution to this is called Simple Good-Turing (Gale and Sampson, 1995). In Simple Good-Turing, after we compute the bins No but before we compute Eq. 4.26 from them, we smooth the NC counts to replace any zeros in the sequence. The simplest thing is just to replace the value NC with a value computed from a linear regression that is fit to map NC to c in log space (see Gale and Sampson (1995) for details): log(Nc) = a + b log(c) (4.29) In addition, in practice, the discounted estimate 0* is not used for all counts c. Large counts (where c > k for some threshold k) are assumed to be reliable. Katz (1987) suggests setting k at 5. Thus, we define c“ = c for c > k (4.30) The correct equation for c* when some k is introduced (from Katz (1987)) is * (C+1)N_Icvfl _C(k+1)Nk+l c = c "1 , for 1 g c g k. (4.31) 1 _ (k+1l‘)lNk:tl 1 Finally, with Good-Turing and other discounting, it is usual to treat N-grams with low raw counts (especially counts of 1) as if the count were 0, that is, to apply Good- Turing discounting to these as if they were unseen, and then using smoothing. Good-Turing discounting is not used by itself in discounting N—grams; it is only used in combination with the backoff and interpolation algorithms described in the next sections. -__.._._ The discounting we have been discussing so far can help solve the problem of new frequency N-grams. But there is an additional source of knowledge we can draw on_ If we are trying to compute P(w,,]w,,_2w,,_1) but we have no examples of a particular trigram wn_2w,,_1w,,, we can instead estimate its probability by using the bigram prob- ability P(w,,|w,,_1). Similarly, if we don’t have counts to compute P(w,,]w,,_1), we can look to the unigram P(w,,). Baa/c017 There are two ways to use this N -gram Interpolation backoff, if we have non-zero tri 19 4.7 Backo: “hierarchy”: backofl' and interpolation. In gram counts, we rely solely on the tring counts. We Wh only “back off” to a lower-order N—gram if we have zero ev1dence for a higher-order that N-gram. By contrast, in interpolation, we always mix the probability estimates from The all the N -gram estimators, that is, we do a weighted interpolation of tngram, bigram . intr and unigram counts. Katz backofl Kai In simple linear interpolation, we combine different order N -grams by linearly in- I we : terpolating all the models Thus, we estimate the trigram probability P(w,,|w,,_2w,,_1) \ reac by nuxrng together the unigram, bigram, and trigram probabilities, each weighted by a 3' A: I lit. 13(WnIWn—2Wn—1) = A1P(Wnlw,. 2Wn—l) +A2P(Wnlwn l) I +A3P(w,,) (4 32) , on t1 I ' none such that the As sum to l: ' short 2A,- = (4.33) - facto i - these In a slightly more sophisticated vers10n of linear interpolation, each A weight is . Pedai computed in a more sophisticated way, by conditioning on the context. This way, if ' to the we have particularly accurate counts for a particular bigram, we assume that the counts :- of the tngrams based on this bigram will be more trustworthy, so we can make the As 5' for those trigrams higher and thus give that tiigram more weight in the interpolation. Equation 4 34 shows the equation for interpolation with context~conditioned weights: ' P(WnIWn—2Wn—i) = A1(WZI§)P(WnIWn—2Wn_i) +AZ(W::é)P(Wn‘Wn—l) +A3(w;‘:§)P(w,,) (4.34) I(a preVio How are these A values set? Both the Simple interpolation and conditional interpo- used tc Held-out lation As are learned from a held-out corpus Recall from Section 4.3 that a held-out unseen corpus is an additional training corpus that we use, not to set the N-gram counts but to evenly set other parameters In this case, we can use such data to set the A values. We can do probab this by choosing the A values that maximize the likelihood of the held~out corpus. That grams ; the problem of zero dge we can on. unples of a Farm“lar sing the bigram PIOb‘ e P(Wn‘Wn—1)’ we can and interpolation. In he trigram counts. We :nce for a higher-order >ability estimates from non of trigram, bigram, V-grams by linearly in- 1bility P(w,.\wn-2w,,_1) ties, each weighted by a 1) (4.32) (4.33) )olation, each 2» weight _ the context. This way, If. - we assume that the counts: II I by, so We can make the I weight in the interpolation: ,_ ntext-conditioned weights) ,_.2Wn— 1) Wn—l) t I set the N -gram counts bu 0 set the A values. We 6 d of the held-out corpus Section 4.7. Backoff 105 is, we fix the N—gram probabilities and then search for the A values that when plugged into Eq. 4.32 give us the highest probability of the held-out set. There are various ways to find this optimal set of As. One way is to use the EM algorithm defined in Chapter 6, which is an iterative learning algorithm that converges on locally optimal As (Baum, 1972; Dempster et al., 1977; Jelinek and Mercer, 1980). 4.7 Backoff Katz backofl’ While simple interpolation is indeed simple to understand and implement, it turns out that there are a number of better algorithms. One of these is backoff N-gram modeling. The version of backoff that we describe uses Good-’lhring discounting as well. It was introduced by Katz (1987); hence, this kind of backoff with discounting is also called Katz backoff. In a Katz backoff N - gram model, if the N - gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts: P*(Wnlwz:]1v+1)a ifC(W:—N+1) > 0 W wn:1 = _ - _ Pkatz( nl n N+1) a(W:_I{I+1)Pkatz(wn|w:_}v+2), otherwrse. (4.35) Equation 4.35 shows that the Katz backoff probability for an N-gram just relies on the (discounted) probability P" if we’ve seen this N-gram before (i.e., if we have non-zero counts). Otherwise, we recursively back off to the Katz probability for the shorter-history (N —l)- gram. We’ll define the discounted probability P‘, the normalizing factor a, and other details about dealing with zero counts in Section 4.7.1. Based on these details, the trigram version of backoff might be represented as follows (where for pedagogical clarity, since it’s easy to confuse the indices wi,w,-_1 and so on, we refer to the three words in a sequence as x, y, z in that order): P*(z|x,y), if C(x,y,z) >0 fiat2(zlx1y) = a(x1y)fi(at2(z|y)1 else > 0 P* (2), otherwise. P*(z|y), if C(y,z) > 0 = 4.37 Pkatz (zly) { a(y)P*(z), otherwise. ( ) Katz backoff incorporates discounting as an integral part of the algorithm. Our previous discussions of discounting showed how a method like Good-Turing could be used to assign probability mass to unseen events. For simplicity, we assumed that these unseen events were all equally probable, and so the probability mass was distributed evenly among all unseen events. Katz backoff gives us a better way to distribute the probability mass among unseen trigram events by relying on information from uni- grams and bigrams. We use discounting to tell us how much total probability mass to probability. Discounting is implemented by using discounted probabilities P* - MLE probabilities P(-) in Eq. 4.35 and Eq. 4.37. Why do we need discounts and a values in Eq. 4.35 and Eq. 4.37? Why couldn’ probability of an N -gram (and save P for MLE probabilities): _ 0* (Wn—N 1) P (w lw".1 = "_ + (439) n " N 1 0041—15“ ) 0 00134 0 000483 0 00455 0 00455 0 00384 0 000483 to 0 000512 0 00152 0 00165 0 284 0.000512 0.0017 0 00175 0 0873 eat 0.00101 0 00152 0 00166 0 00189 0.0214 0 00166 0 0563 0 000585 Chinese 0.00283 0 00152 0 00248 0 00189 0.000205 0 519 0 00283 0 000585 feed 0.0137 0 00152 0.0137 0 00189 0.000409 0.00366 0.00073 0.000585 lunch 0.00363 0.00152 0.00248 0.00189 0.000205 0.00131 0.00073 0.0005 85 Spend 0.00161 0.00152 0.00161 0.00189 0.000205 0.0017 0.00073 0.000585 Good-Turing smoothed bigram probabilities for eight words (of V = 1446) in the BeRP corpus of 9332 sentences co In at )3 , lei pl“ gra (bi. pro zer fro} and and / how to distribute this ties P* (-) rather than .37? Why couldn’t we ruse without discounts )robability‘. The MLE 16 probability of all W, (4.38) le off to a lower—order ; extra probability mass greater than 1‘. ted. The P* is used to ass for the lower-order from all the lower—order .ounting the higher-order e conditional probability (4.39) c, this probability P" will :ler N -grams which is then 1 Section 4.7.1. Figure 4.9: tple words, computed from -_ lunch spends.:_ . 17 0.00073 - -. 455 0.00384 17 000175 {166 0.0563 .9 0.00283 )366 0.00073 3131 0.00073 , 0073 . _. 017 0 0 ofi 933 446) in the BeRP COYP‘ls II . _.‘.'_.._. _ #24. Section 4.7. Backoff 107 4.7.1 Advanced: Details of Computing Katz Backoff a and P* In this section we give the remaining details of the computation of the discounted prob- ability P" and the backoff weights a(w). We begin with a, which passes the leftover probability mass to the lower-order N - grams. Let’s represent the total amount of leftover probability mass by the function [3, a function of the (N-1)-gram context. For a given (N-1)-gram context, the total leftover probability mass can be computed by subtracting from 1 the total discounted probability mass for all N -grams starting with that context: fi(M/'n2}q+1)= 1— Z P‘(w»|w::},+1) (4.40) w,,:(:()4"'"__N+l )>0 This gives us the total probability mass that we are ready to distribute to all (N-1)- gram (e.g., bigrams if our original model was a trigram). Each individual (N-1)-gram (bigram) will only get a fraction of this mass, so we need to normalize B by the total probability of all the (N-1)-grams (bigrams) that begin some N-gram (trigram) that has zero count. The final equation for computing how much probability mass to distribute from an N—gram to an (N—1)-gram is represented by the function a: 4 MIL“) an:c(4"._~+1)=o Pkatz(wn M33”) 1 _ ng=c(w",,_N+l)>0 P* (WnIWZIIIVH) 1 _ ngIC(W,',_N+1)>0P*(Wnlwzjnz) “MILO (4.41) Note that a is a function of the preceding word string, that is, of w"n:,1v+1; thus the amount by which we discount each trigram (d) and the mass that gets reassigned to lower-order N-grams (a) are recomputed for every (N-1)-gram that occurs in any N-gram. We need to specify what to do when the counts of an (N-1)-gram context are 0, (i.e., when C(W::Il‘, +1) = 0) and our definition is complete: Pkatz(wnlwnn:llv+1) = Pkatz (Wnlwnnjwz) if C(Wnn:})l+1) = 0 (4'42) and P" (w,,|w"n:}, +1) = 0 if 4ng, +1) = 0 (4.43) and 108 Chapter 4. 4 .8 Practical Issues: Toolkits and Data Formats N-Grams unigram: logP*(w,-) w,- bigram: logP*(w,-]w,-_1) wi_1w,- tfigram 103P*(WilWi—2,Wi—1) Wi—2Wi—1Wi Figure 4.10 shows an ARPA formatted LM file with selected N-grams from the BeRP corpus. Given one of these trigrams, the probability P(z[x, y) for the word se- quence x, y, z can be computed as follows (repeated from (4.37)): log a (w) 108 01(Wt—1Wt) P*(z‘xvy)v ifC(x,y,z) > O [Iratz (le09 = a(x,y)Pkatz (zly), else if C(x, y) > 0 (4.46) P“ (Z), otherwise. P“ (21y), if C(y, z) > o Pkatz (Z‘y) — { (10013»: (2), otherwise. Toolkits: Two commonly used toolkits for building language models are the SRILM toolkit (Stolcke, 2002) and the Cambridge-CMU toolkit (Clarkson and Rosenfeld, 1997). Both are publicly available and have similar functionality. In training mode, Turing or Kneser-Ney, discussed in Section 4.9.1), and various thresholds. The output is a language model in ARPA format. In perplexity or decoding mode, the toolkits Kneser-Ney Bec cess brie £1132 guaé 4.9. One lated Absc the C quent '— d. We represent and nderflow and also to ban 1, the more prob— Multiplying enough g log probabilities in- . Adding in log space ~robabilities by adding te than multiplication. that when we need to +logp4) ARPA format. followed by a list of all )y bigrams, followed by ;counted log prob ; are only necessary for omputed for the highest in the end-of—sequence J-gram is WOW) g a(Wi—1Wi) :lected N 'grams from the P(z\x,)’) for the Word 37)): Lxd>0 if C(x,y) > 0 :rwise. - 0 uage mo 1 (Clarkson and ‘tionality. In trai I _ I ' mar th words separated by W . .I‘ ounting (000‘.- irious thresholds. The out?“ decoding mode, type of disc (4.45) An ability the ' » - -__‘._‘-_v- - dels are the SRIL‘ -' Rosenfcllli m’ng 111003”... ' Kneser-Ney Section 4.9. Advanced Issues in Language Modeling 109 \data\ ngram 1=1447 ngram 2=9420 ngram 3=5201 \1—grm:\ -0 .8679678 </a-> -99 <s>- .068532 -4.743076 chow-fun .1943932 -4 .266155 fries .. 5432162 —3 . 175167 thursday . 7510199 -1.776296 want: .04292 \2-grams:\ —0.6077-676 ' -0.6257131 —0.4861297 ' 0.0425899 -2.832415 -0.06423882 -0.546952‘5 -0.008193135 —0 . 09403705 \3-grams:\ .579416 i prefer . 148009 about fifteen . 4 120701 90 to .3735807 in list: .260361 jupiter </s> . 260361 malaysian restaurant ARPA format for N—grams, showing some sample N—grams. Each is represented by a logprob, the word sequence, W1...W", followed by the log backoff weight a. Note that no a is computed for the highest-order N—gram or for N—grams ending in <s>. take a language model in ARPA format and a sentence or corpus, and produce the probability and perplexity of the sentence or corpus. Both also implement many ad— vanced features discussed later in this chapter and in following chapters, including skip N-grams, word lattices, confusion networks, and N-gram pruning. 4.9 Advanced Issues in Language Modeling Because language models play such a broad role throughout speech and language pro- cessing, they have been extended and augmented in many ways. In this section we briefly introduce a few of these, including Kneser-Ney Smoothing, class—based lan- guage modeling, language model adaptation, topic-based language models, cache lan— guage models, variable-length N -grams. 4.9.1 Advanced Smoothing Methods: Kneser-Ney Smoothing One of the most commonly used modern N—gram smoothing methods is the interpo- lated Kneser-Ney algorithm. Kneser-Ney has its roots in a discounting method called absolute discounting. Absolute discounting is a much better method of computing a revised count c* than the Good—Turing discount formula we saw in Eq. 4.26, based on frequencies of fre- quencies. To get the intuition, let’s revisit the Good-Turing estimates of the bigram c* l 10 Chapter 4. N—Grams Absolute discounting extended from Fig. 4.8 and reformatted below. c(MLE) 0 1 2 3 4 5 6 7 8 9 c* (GT) 0.0000270 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25 and 1, all the other re-estimated counts c* could be estimated tracting 0.75 from the MLE count c! Absolute discounting by subtracting a fixed (absolute) discount (1 from each count. The intuition is that we C(wi_1w,~!—D a lfc '— z > 0 Pabsolute (WiiWi—l) = C wi—l (W! 1w ) (4.48) a(w,-)P(w,-), otherwise .InterPomzd Kneser—Ney Class-based N—gram Cluster N-gram IBM clustering later disc tatic a ba« whe: reall 4.9. The infor deali want in tht If we traini as IB of ha mates ity of and t1 If likelil". the Cl: Cll ' 8 9 3.21 7.24 8.25 stimated counts for 0 'etty well by just sub- rmalizes this intuition he intuition is that we iscount d won’t affect h we don’t necessarily .ng applied to bigrams .iing sum to 1) is 1W‘) > 0 (4.48) .e. 1311168 D for the 0 and 1 nts absolute discounting .ion. Consider the 10b of melting off to a unigram than the word Francisco. 1 will prefer it to glasses. sea is frequent, it 15 only sco. The word glasses has , t (the number of times the nt backoff distribution. War I )f times we might expect tt: . tion is to base our estimaf- .. Words that have appeared ‘ text as well. We can express ’, as follows: M— (4.49.i. . —1Wi) > 0}| I as follows (again assumingm -.m to 1): ifc<wi'1wi)> 0 (4.50) otherwise. lnterpolated Kneser—Ney Class-based N-gram Cluster N-gram IBM clustering Section 4.9. Advanced Issues in Language Modeling 111 ————————-——-——————-———__—-—— Finally, it turns out to be better to use an interpolated rather than a backoff form of Kneser-Ney. While simple linear interpolation is generally not as successful as Katz backoff, it turns out that more powerful interpolated models, such as interpo- lated Kneser-Ney, work better than their backoff version. Interpolated Kneser-Ney discounting can be computed with an equation like the following (omitting the compu- tation of fl): C(Wi—i) (4.51) PKN(Wi|Wi-1) = +/3(wi)zl{wi-1 3C(Wi—1wi) > 0}| A final practical note: it turns out that any interpolation model can be represented as a backoff model, hence stored in ARPA backoff format. We simply do the interpolation when we build the model, so the ‘bigram’ probability stored in the backoff format is really ‘bigram already interpolated with unigram’. 4.9.2 Class-Based N-Grams The class-based N-gram, or cluster N-gram, is a variant of the N-gram that uses information about word classes or clusters. Class—based N-grams can be useful for dealing with sparsity in the training data. Suppose for a flight reservation system we want to compute the probability of the bigram to Shanghai, but this bigram never occurs in the training set. Instead, our training data has to London, to Beijing, and to Denver. If we knew that these were all cities and assuming that Shanghai does appear in the training set in other contexts, we could predict the likelihood of a city following from. There are many variants of cluster N-grams. The simplest one is sometimes known as IBM clustering, after its originators (Brown et al., 1992). IBM clustering is a kind of hard clustering, in which each word can belong to only one class. The model esti- mates the conditional probability of a word w,- by multiplying two factors: the probabil- ity of the word’s class c; given the preceding classes (based on an N-gram of classes), and the probability of w; given ci. Here is the IBM model in bigram form: P(Wi|Wi—l) z P(CiICi—1)X P(Wilci) If we had a training corpus in which we knew the class for each word, the maximum likelihood estimate of the probability of the word given the class and the probability of the class given the previous class could be computed as follows: P(wlc) = P(Ci|Ci—1) = Cluster N—grams are generally used in two ways. In dialog systems we often hand- design domain-specific word classes. Thus for an airline information system, we might ...
View Full Document

Page1 / 17

JurafskyNGrams - / ias chosen the first nations (that, I,...

This preview shows document pages 1 - 17. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online