This preview shows pages 1–17. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: / ias chosen the ﬁrst
nations (that, I, he,
tinuation. set, let’s look at
' Wall Street Journal both English, so we
as. To check whether by unigram, bigram, / er were recession ex ion N. B. E. C. Taylor percent of U. S. E. has ,re frequently ﬁshing ‘0 undred four oh six three ket conditions .
iels computed from 40 ml and treating P‘mc .
3 readabilitY mation as '1 small Phrases' .
:tty useless as Predl
care and . of text like newsPaPer text; .'_ i ' iges. Sometimes ﬁnd'éngtztrgrft . I
.t build Ngrams or I 5' 1k, 0 s of SMS dam scribed. Section 4.4. Evaluating N Grams: Perplexity 95 _____________—__—_———————————— 4.3.2 Unknown Words: Open Versus Closed Vocabulary Tasks Sometimes we have a language task in which we know all the words that can occur,
and hence we know the vocabulary size V in advance. The closed vocabulary as
sumption is the assumption that we have such a lexicon and that the test set can only
contain words from this lexicon. The closed vocabulary task thus assumes there are no
unknown words. But of course this is a simpliﬁcation; as we suggested earlier, the number of unseen
words grows constantly, so we can’t possibly know in advance exactly how many there
are, and we’d like our model to do something reasonable with them. We call these
00v unseen events unknown words, or out of vocabulary (00V) words. The percentage
of 00V words that appear in the test set is called the 00V rate. An open vocabulary system is one in which we model these potential unknown
words in the test set by adding a pseudoword called <UNK>. We can train the proba bilities of the unknown word model <UNK> as follows: Closed vocabulary Open vocabulary 1. Choose a vocabulary (word list) that is ﬁxed in advance. 2. Convert in the training set any word that is not in this set (any 00V word) to the
unknown word token <UNK> in a text normalization step. 3. Estimate the probabilities for <UNK> from its counts just like any other regular
word in the training set. An alternative that doesn’t require choosing a vocabulary is to replace the ﬁrst occur
rence of every word type in the training data by <UNK>. 4.4 Evaluating N —Grams: Perplexity The best way to evaluate the performance of a language model is to embed it in an
application and measure the total performance of the application. Such endtoend
evaluation is called extrinsic evaluation, and also sometimes called in vivo evaluation
(Sparck Jones and Galliers, 1996). Extrinsic evaluation is the only way to know if a
particular improvement in a component is really going to help the task at hand. Thus,
for speech recognition, we can compare the performance of two language models by
running the speech recognizer twice, once with each language model, and seeing which
gives the more accurate transcription. Unfortunately, endtoend evaluation is often very expensive; evaluating a large
speech recognition test set, for example, takes hours or even days. Thus, we would
like a metric that can be used to quickly evaluate potential improvements in a language
model. An intrinsic evaluation metric is one that measures the quality of a model
independent of any application. Perplexity is the most common intrinsic evaluation
metric for Ngram language models. While an (intrinsic) improvement in perplexity
does not guarantee an (extrinsic) improvement in speech recognition performance (or
any other endtoend metric), it often correlates with such improvements. Thus, it is
commonly used as a quick check on an algorithm, and an improvement in perplexity
can then be conﬁrmed by an endtoend evaluation. Extrinsic
evaluation Intrinsic
evaluation 96 Chapter 4. NGrams ——————_—_____—\ Perplexity The intuition of perplexity is that given two probabilistic models, the better mode]
is the one that has a tighter ﬁt to the test data or that better predicts the details of the test
data. We can measure better prediction by looking at the probability the model assigns
to the test data; the better model will assign a higher probability to the test data. More
formally, the perplexity (sometimes called PP for short) of a language model on a test
set is a function of the probability that the language model assigns to that test set. For
a test set W = wlwz . . .wN, the perplexity is the probability of the test set, normalized
by the number of words: PP(W) = P(w1wQ . . .WNHI (4.16)
P(w1wz . . .wN
We can use the chain rule to expand the probability of W:
PP(W) = (4.17) Thus, if we are computing the perplexity of W with a bigram language model, we
get: (4.18) Note that because of the inverse in Eq. 4.17, the higher the conditional probability
of the word sequence, the lower the perplexity. Thus, minimizing perplexity is equiva
lent to maximizing the test set probability according to the language model. What we
generally use for word sequence in Eq. 4.17 or Eq. 4.18 is the entire sequence of words
in some test set. Since of course this sequence will cross many sentence boundaries, we
need to include the begin and endsentence markers <s> and </ s> in the probability
computation. We also need to include the endofsentence marker </ s> (but not the
beginningofsentence marker <s>) in the total count of word tokens N. There is another way to think about perplexity: as the weighted average branching
factor of a language. The branching factor of a language is the number of possible next
words that can follow any word. Consider the task of recognizing the digits in English
(zero, one, two,..., nine), given that each of the 10 digits occurs with equal probability
P = $. The perplexity of this minilanguage is in fact 10. To see that, imagine a string
of digits of length N. By Eq. 4.17, the perplexity will be = P(W1W2...WN)_}17
1 N _ (m 1
N g .
r
t
E. Bl.
than c
the tin
the pe'
exerci: “It
theore Fir
N grar
words
word \
set of
millior As
quence
verser No
out an)
perple)
lary ta
greatly
the mo
compai perplex 4.5 Smoothil Sparse data There i
seen fo:
data ca 5 Katzb
(LDC, 19 Section 4.5. Srnoothing 97 /
odels, the better model = i 1 :ts the details of the test _ bility the model assigns _ (4I19)
v)’ to the t6“ data“ More But suppose that the number zero is really frequent and occurs 10 times more often
atnguage mOdel 0“ a tCSt than other numbers. Now we should expect the perplexity to be lower since most of
igns to that test set' .For the time the next number will be zero. Thus, although the branching factor is still 10,
'the test Set, nomahzed the perplexity or weighted branching factor is smaller. We leave this calculation as an exercise to the reader. We see in Section 4.10 that perplexity is also closely related to the information (4I16) theoretic notion of entropy. Finally, let’s look at an example of how perplexity can be used to compare different
Ngram models. We trained unigram, bigram, and trigram grammars on 38 million
words (including startofsentence tokens) from the Wall Street Journal, using a 19,979
word vocabulary.6 We then computed the perplexity of each of these models on a test
set of 1.5 million words with Eq. 4.18. The table below shows the perplexity of a 1.5
million word WSJ test set according to each of these grammars. ‘ Unigram Bigram 'Irigram _ (417) Perplexity 962 170 109
) As we see above, the more information the Ngram gives us about the word se
igram language model, we I ' quence, the lower the perplexity (since as Eq. 4.17 showed, perplexity is related in versely to the likelihood of the test sequence according to the model).
Note that in computing perplexities, the N gram model P must be constructed with
out any knowledge of the test set. Any kind of knowledge of the test set can cause the 1' ' perplexity to be artiﬁcially low. For example, we deﬁned above the closed vocabu (4'18) 5' lary task, in which the vocabulary for the test set is speciﬁed in advance. This can J greatly reduce the perplexity. As long as this knowledge is provided equally to each of c the conditional probability . the models we are comparing, the closed vocabulary perplexity can still be useful for
. .n rplexity is equivae' comparing models, but care must be taken 1n interpreting the results. In general, the
mm g pe at we: I I perplexity of two language models is only comparable if they use the same vocabulary. a language model. Wh I II
the entire sequence ofIwordS. I
.any sentence boundaries, we  and < / 5> in the probabilith ' smoothing
1 </s> (but notthB. ' :e marke word tokens N . . I weighted average branc' ‘ I. Never tlo II ever want / to hear another word!
is the number of 90551131" _' There tsn t one, / I haverIt t hearri! ognizing the digits 1“ E“%~; f' _ liliza Doollttle I
occurs with equal P‘Obabll',” ;. in Alan Jay Lerner s ). To see mat, imagine 3 sm '; My Fair Lady
' There is a major problem with the maximum likelihood estimation process we have
seen for training the parameters of an Ngram model. This is the problem of sparse spam ‘0’“ data caused by the fact that our maximum likelihood estimate was based on a particular 6 Katz backoff grammars with Good—Turing discounting trained on 38 million words from the W810 corpus
(LDC, 1993), openvocabulary, using the <UNK> token; see later sections for deﬁnitions. 98 Chapter 4. NGrams Smoothing Laplace
smoothing Addone set of training data. For any Ngram that occurred a sufﬁcient number of times, we
might have a good estimate of its probability. But because any corpus is limited, some
perfectly acceptable English word sequences are bound to be missing from it. This
missing data means that the Ngram matrix for any given training corpus is bound to We need a method which can help get better estimates for these Zero or low
frequency counts. Zero counts turn out to cause another huge problem. The perplexity
metric deﬁned above requires that we compute the probability of each test sentence.
But if a test sentence has an N—gram that never appeared in the training set, the max
imum likelihood estimate of the probability for this N gram, and hence for the whole
test sentence, will be zero! This means that in order to evaluate our' language mod
els, we need to modify the MLE method to assign some nonzero probability to any gram, even one that was never observed in training. For these reasons, we’ll want to modify the maximum likelihood estimates for
computing Ngram probabilities, focusing on the N—gram events that we incorrectly
assumed had zero probability. We use the term smoothing for such modiﬁcations that
address the poor estimates that are due to variability in small data sets. The name
comes from the fact that (looking ahead a bit) we will be shaving a little bit of proba
bility mass from the higher counts and piling it instead on the zero counts, making the
distribution a little less jagged. In the next few sections we introduce some smoothing algorithms and show how
they modify the Berkeley Restaurant bigram probabilities in Fig. 4.2. 4.5.1 Laplace Smoothing fore we normalize them into probabilities, and add 1 to all the counts. This algorithm
is called Laplace smoothing, or Laplace’s Law (Lidstone, 1920; Johnson, 1932; Jef
freys, 1948). Laplace smoothing does not perform well enough to be used in modern i whlliluyﬂnug' flvmlh'u — ._ Discounting COUl
Thu.
Discount algo
origi Restz
bigra i
war
to
eat
chin
food
lune
spen Figure Berkele Fig
Recall
counts For
the nurr / iumber of times, we
rpus is limited, some
iissing from it. This
1g corpus is bound to
Vgrams” that should nethod also PIOduceS )r these zero or low
)blem. The perplenty
of each test sentence.
. training set, the max
id hence for the whole
ate our language mod
zero probability to any ikelihood estimates for
ants that we incorrech
such modiﬁcations that
ill data sets. The name . . ba_
tin a little bit of pro
g unts, making the Igorithins and Show how
Fig. 4.2. ' ts be 4;.
atrix of bigram coun , ll
he counts. This algoritht’:
1920', Johnson, .1932,;eIn
augh to be used in mo 6 ncepts that
;efu1 baseline. u ni ram proba . _
the guni gram probability of of word tokens N1 bilities. Re 1;; Discounting Discount Section 4.5.
N Smoothing 99 Ci+l
N+V Instead of changing both the numerator and denominator, it is convenient to de
scribe how a smoothing algorithm affects the numerator, by deﬁning an adjusted count
c*. This adjusted count is easier to compare directly with the MLE counts and can be
turned into a probability like an MLE count by normalizing by N. To deﬁne this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply by a normalization factor NLW: PLaplace (wi) = (420) 0? = (Ci+1)N—I_:‘V
We can now turn cf into a probability Pi" by normalizing by N.
A related way to view smoothing is as discounting (lowering) some nonzero
counts in order to get the probability mass that will be assigned to the zero counts.
Thus, instead of referring to the discounted counts c*, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the
original counts: (4.21) II
C
dc:—
C Now that we have the intuition for the unigram case, let’s smooth our Berkeley Restaurant Project bigrams. Figure 4.5 shows the addone smoothed counts for the
bigrams in Fig. 4.1. M i want to eat chinese food lunch spend
1 6 828 l 10 1 l I 3
want 3 l 609 2 7 7 6 2
to 3 l 5 687 3 1 7 212
eat 1 1 3 l 17 3 43 1
chinese 2 l 1 1 1 83 2 1
food 16 1 16 l 2 5 l 1
lunch 3 I 1 1 I 2 I I
spend 2 1 2 1 1 1 1 1 Figure 4.5 Addone smoothed bigram counts for eight of the words (out of V = 1446) in the
Berkeley Restaurant Project corpus of 9332 sentences. Previouslyzero counts are in gray. Figure 4.6 shows the addone smoothed probabilities for the bigrams in Fig. 4.2. Recall that normal bigram probabilities are computed by normalizing each row of
counts by the unigram count: C(wn_1w,,)
C(wn_1) For addone smoothed bigram counts, we need to augment the unigram count by
the number of total word types in the vocabulary V: P(WnIWn1) = (4.22) 100 Chapter 4. N~Grams C(wn_1wn)+l
C(Wn_1)+V Thus, each of the unigram counts given in the previous section will need to be
augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 4.6. i want to eat chine'se food lunch spend
1 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075 Pﬁaplace (Wnan—I) = (4.23) want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.00084
to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055 cat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046
Chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062
food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039
lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056
spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058 Addone smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP corpus
of 9332 sentences. Previouslyzero probabilities are in gray. It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts. These adjusted counts can be
computed by Eq. 4.24. Figure 4.7 shows the reconstructed counts. i want to eat chinese food lunch spend
i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9 want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78
to 1.9 0.63 3.1 430 1.9 0.63 4.4 133
eat 0.34 0.34 1 0.34 5.8 1 15 0.34 Chinese 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098
food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43
lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19
spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16 Addone reconstituted counts for eight words (of V = 1446) in the BeRP corpus of
9332 sentences. Previouslyzero counts are in gray. Note that addone smoothing has made a very big change to the counts. C(want to)
changed from 608 to 238! We can see this in probability space as well: P(towant)
decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at the
discount d (the ratio between new and old counts) shows us how strikingly the counts
for each preﬁx word have been reduced; the discount for the bigram want to is .39,
while the discount for Chinese food is .10, a factor of 10! The sharp change in counts and probabilities occurs because too much probability
mass is moved to all the zeros. We could move a bit less mass by adding a frac
tional count rather than 1 (add6 smoothing; (Lidstone, 1920; Johnson, 1932; Jeffreys, __ _...._~.,—.. _ . GoodTuring Singleton 194:
appr
For metl 4.5. Ther
com;
as G
(195 discr
once
any 6
Turir
ofze NC, t1
occur
joint of big
bin t1
frequ: Tl
of thir
occur
c for I W
Instea¢
use the
what v N is th
probab
Showir / (4.23) ;ection will need to be
obabilities in Fig. 4.6. M
chh _ spend
' ' 0.00025 0.00075
0.0025 0.00084
0.0018 0 055
0.02 0.00046
0.0012 0.00062
0.00039 0.00039
0.00056 0.00056
8 0.00058 0.00058
= 1446) in the BeRP corpus
0 we can see how much a
=se adjusted counts can be
counts.
W“) (4 24) 2 7 2.3 038 ‘ i
0 63 4.4 139
1 15 0.34 _ 8.2 02 0:098? I
2.2 0 43 0743." P
0.38 0.19 0.1.9:
o 16 0 15' W = 1446) in the BeRP corpu mange . I ‘I
ility space as Well: P(to\wq$l§
: smoothed case. Looking aim.is
ws us how strikiIIgly the 0.039:
t for the bigram want to 15 ..
10‘. bab‘j its because too much pro : {a GoodTuring Singleton Section 4.5. _____________—__—'—_—————————— Smoothing 101 1948)), but this method requires a method for choosing 6 dynamically, results in an in
appropriate discount for many counts, and turns out to give counts with poor variances.
For these and other reasons (Gale and Church, 1994), we’ll need better smoothing
methods for N —grams like the ones we show in the next section. 4.5.2 GoodTuring Discounting There are a number of much better discounting algorithms that are only slightly more
complex than addone smoothing. In this section we introduce one of them, known
as GoodTilting smoothing. The GoodTuring algorithm was ﬁrst described by Good
(1953), who credits Turing with the original idea. The intuition of a number of discounting algorithms (Good—'lluring, WittenBell
discounting, and KneserNey smoothing) is to use the count of things you’ve seen
once to help estimate the count of things you’ve never seen. A word or Ngram (or
any event) that occurs once is called a singleton, or a hapax legomenon. The Good—
Turing intuition is to use the frequency of singletons as a reestimate of the frequency
of zerocount bigrams. Let’s formalize the algorithm. The GoodTuring algorithm is based on computing
NC, the number of N —grams that occur c times. We refer to the number of N grarns that
occur c times as the frequency of frequency c. So applying the idea to smoothing the
joint probability of bigrams, No is the number of bigrams with count 0, N1 the number
of bigrams with count 1 (singletons), and so on. We can think of each of the NC as a
bin that stores the number of different N grarns that occur in the training set with that
frequency c. More formally: NC = Z 1 (4.25) xscount [x] =c The MLE count for NC is c. The Good—Turing intuition is to estimate the probability
of things that occur c times in the training corpus by the MLE probability of things that
occur c + 1 times in the corpus. So the GoodTuring estimate replaces the MLE count
c for NC with a smoothed count c* that is a function of NH]: (4.26) We can use Eq. 4.26 to replace the MLE counts for all the bins N1, N2, and so on.
Instead of using this equation directly to reestimate the smoothed count c* for No, we
use the following equation for the probability PaT for things that had zero count No, or
what we might call the missing mass: N1
N Here N1 is the count of items in bin 1, that is, that were seen once in training, and
N is the total number of items we have seen in training. Equation 4.27 thus gives the probability that the N + lst bigram we see will be one that we never saw in training.
Showing that Eq. 4.27 follows from Eq. 4.26 is left as Exercise 4.8 for the reader. PET (things with frequency zero in training) = (4.27) case, either a catﬁsh or a bass? The MLE count c of a hithertounseen species (bass or catﬁsh) is 0. But Eq. 4.27 tells us that the probability of a new ﬁsh being one of these unseen species is 3 since
N1 is 3 andNis 18: since we just stole % of our probability mass to use on unseen events. We’ll need
to discount the MLE probabilities for trout, perch, carp, etc. In summary, the revised
counts 0* and Good—Turing smoothed probabilities (in. for species with count 0 (like
bass or catﬁsh) or count 1 (like trout, salmon, or eel) are as follows: Unseen (Bass or Catﬁsh) 'ﬁ‘out
c 1 MLEp p=%=0 1 E
c* c*(trout)= 2 x £12 = 2 x § = .67 GT PET P5T(unseen) = ENL = ﬁ = .17 P5T(trout) = {5—87 = 2L7 = .037 Note that the revised count c“ for trout was discounted from c = 1.0 to c" = .67
(thus leaving some probability mass P5T(unseen) = % = .17 for the catﬁsh and bass). And since we know there were 2 unknown species, the probability of the next ﬁsh being
speciﬁcally a catﬁsh is PéT(catﬁsh) = % x A — .085. 4.5.3 Some Advanced Issues in Goodﬁring Estimation GoodTuring estimation assumes that the distribution of each bigram is binomial (see Si [2
GoodTifrlgw SH! _ 1311 ex:
est
zei
car
Sai
we
seq
fro}
(19 Larg
(19E low r
Turin / e populations of ani
tin created by Joshua
with 8 species (bass, estimate must be lower
een events. We’ll need In summary, the revised
3ecies with count 0 (like llows: N1
it)=%=—2‘7=.037
lfromc=10toc*==.67 17 for the catﬁsh and bass). I
ability of the next ﬁsh being   EoodTuring discounting to :
:nces, and a larger example 
is (AP) neWswire by ' I I
ows the count c, that 15,131.61
.olumn shows the nu the count. lg Estimation each bigram is binomial I er of bigrams we haven t. ' arms 15 Section 4.5. Smoothing 103 _________’________.__———————’— Simple
GoodTuring ____________________—————————————
AP Newswire Berkeley Restaurant c (MLE) NC 0* (GT) 0 (MLE) NC 0* (GT) 0 74,671,100,000 0.0000270 0 2,081,496 0.002553
1 2,018,046 0.446 1 5315 0.533960
2 449,721 1.26 2 1419 1.357294
3 188,933 2.24 3 642 2.373832
4 105,668 3.24 4 381 4.081365
5 68,379 4.22 5 311 3.781350
6 48,190 5.19 6 . 196 4.500000 Bigram “frequencies of frequencies" and GoodTbring reestimations for the
22 million AP bigrams from Church and Gale (1991) and from the Berkeley Restaurant cor pus of 9332 sentences. There are a number of additional complexities in the use of GoodTuring. For
example, we don’t just use the raw NC values in Eq. 4.26. This is because the re
estimate c“ for NC depends on Nc+1; hence, Eq. 4.26 is undeﬁned when Nc+1 = 0. Such
zeros occur quite often. In our sample problem above, for example, since N4 = 0, how
can we compute N3? One solution to this is called Simple GoodTuring (Gale and
Sampson, 1995). In Simple GoodTuring, after we compute the bins No but before
we compute Eq. 4.26 from them, we smooth the NC counts to replace any zeros in the
sequence. The simplest thing is just to replace the value NC with a value computed
from a linear regression that is ﬁt to map NC to c in log space (see Gale and Sampson (1995) for details): log(Nc) = a + b log(c) (4.29) In addition, in practice, the discounted estimate 0* is not used for all counts c.
Large counts (where c > k for some threshold k) are assumed to be reliable. Katz (1987) suggests setting k at 5. Thus, we deﬁne c“ = c for c > k (4.30) The correct equation for c* when some k is introduced (from Katz (1987)) is * (C+1)N_Icvﬂ _C(k+1)Nk+l
c = c "1 , for 1 g c g k. (4.31) 1 _ (k+1l‘)lNk:tl
1 Finally, with GoodTuring and other discounting, it is usual to treat Ngrams with
low raw counts (especially counts of 1) as if the count were 0, that is, to apply Good
Turing discounting to these as if they were unseen, and then using smoothing. GoodTuring discounting is not used by itself in discounting N—grams; it is only
used in combination with the backoff and interpolation algorithms described in the next sections. __.._._ The discounting we have been discussing so far can help solve the problem of new
frequency Ngrams. But there is an additional source of knowledge we can draw on_
If we are trying to compute P(w,,]w,,_2w,,_1) but we have no examples of a particular
trigram wn_2w,,_1w,,, we can instead estimate its probability by using the bigram prob ability P(w,,w,,_1). Similarly, if we don’t have counts to compute P(w,,]w,,_1), we can
look to the unigram P(w,,). Baa/c017 There are two ways to use this N gram
Interpolation backoff, if we have nonzero tri 19 4.7 Backo: “hierarchy”: backoﬂ' and interpolation. In
gram counts, we rely solely on the tring counts. We Wh
only “back off” to a lowerorder N—gram if we have zero ev1dence for a higherorder that
Ngram. By contrast, in interpolation, we always mix the probability estimates from The
all the N gram estimators, that is, we do a weighted interpolation of tngram, bigram . intr
and unigram counts. Katz backoﬂ Kai
In simple linear interpolation, we combine different order N grams by linearly in I we :
terpolating all the models Thus, we estimate the trigram probability P(w,,w,,_2w,,_1) \ reac
by nuxrng together the unigram, bigram, and trigram probabilities, each weighted by a 3'
A: I
lit.
13(WnIWn—2Wn—1) = A1P(Wnlw,. 2Wn—l)
+A2P(Wnlwn l) I
+A3P(w,,) (4 32) , on t1
I ' none
such that the As sum to l: ' short
2A, = (4.33)  facto
i  these
In a slightly more sophisticated vers10n of linear interpolation, each A weight is . Pedai
computed in a more sophisticated way, by conditioning on the context. This way, if ' to the
we have particularly accurate counts for a particular bigram, we assume that the counts :
of the tngrams based on this bigram will be more trustworthy, so we can make the As 5'
for those trigrams higher and thus give that tiigram more weight in the interpolation. Equation 4 34 shows the equation for interpolation with context~conditioned weights: '
P(WnIWn—2Wn—i) = A1(WZI§)P(WnIWn—2Wn_i)
+AZ(W::é)P(Wn‘Wn—l)
+A3(w;‘:§)P(w,,) (4.34) I(a
preVio
How are these A values set? Both the Simple interpolation and conditional interpo used tc
Heldout lation As are learned from a heldout corpus Recall from Section 4.3 that a heldout unseen
corpus is an additional training corpus that we use, not to set the Ngram counts but to evenly
set other parameters In this case, we can use such data to set the A values. We can do probab
this by choosing the A values that maximize the likelihood of the held~out corpus. That grams ; the problem of zero
dge we can on.
unples of a Farm“lar
sing the bigram PIOb‘
e P(Wn‘Wn—1)’ we can and interpolation. In
he trigram counts. We
:nce for a higherorder
>ability estimates from
non of trigram, bigram, Vgrams by linearly in
1bility P(w,.\wn2w,,_1) ties, each weighted by a
1)
(4.32)
(4.33) )olation, each 2» weight _
the context. This way, If. 
we assume that the counts: II I by, so We can make the I weight in the interpolation: ,_ ntextconditioned weights) ,_.2Wn— 1) Wn—l) t I set the N gram counts bu 0 set the A values. We 6
d of the heldout corpus Section 4.7. Backoff 105 is, we ﬁx the N—gram probabilities and then search for the A values that when plugged
into Eq. 4.32 give us the highest probability of the heldout set. There are various ways
to ﬁnd this optimal set of As. One way is to use the EM algorithm deﬁned in Chapter 6,
which is an iterative learning algorithm that converges on locally optimal As (Baum,
1972; Dempster et al., 1977; Jelinek and Mercer, 1980). 4.7 Backoff Katz backoﬂ’ While simple interpolation is indeed simple to understand and implement, it turns out
that there are a number of better algorithms. One of these is backoff Ngram modeling.
The version of backoff that we describe uses Good’lhring discounting as well. It was
introduced by Katz (1987); hence, this kind of backoff with discounting is also called
Katz backoff. In a Katz backoff N  gram model, if the N  gram we need has zero counts, we approximate it by backing off to the (N1)gram. We continue backing off until we
reach a history that has some counts: P*(Wnlwz:]1v+1)a ifC(W:—N+1) > 0 W wn:1 = _  _
Pkatz( nl n N+1) a(W:_I{I+1)Pkatz(wnw:_}v+2), otherwrse. (4.35) Equation 4.35 shows that the Katz backoff probability for an Ngram just relies
on the (discounted) probability P" if we’ve seen this Ngram before (i.e., if we have
nonzero counts). Otherwise, we recursively back off to the Katz probability for the
shorterhistory (N —l) gram. We’ll deﬁne the discounted probability P‘, the normalizing
factor a, and other details about dealing with zero counts in Section 4.7.1. Based on
these details, the trigram version of backoff might be represented as follows (where for pedagogical clarity, since it’s easy to confuse the indices wi,w,_1 and so on, we refer
to the three words in a sequence as x, y, z in that order): P*(zx,y), if C(x,y,z) >0
ﬁat2(zlx1y) = a(x1y)ﬁ(at2(zy)1 else > 0 P* (2), otherwise.
P*(zy), if C(y,z) > 0
= 4.37
Pkatz (zly) { a(y)P*(z), otherwise. ( ) Katz backoff incorporates discounting as an integral part of the algorithm. Our
previous discussions of discounting showed how a method like GoodTuring could be
used to assign probability mass to unseen events. For simplicity, we assumed that these
unseen events were all equally probable, and so the probability mass was distributed
evenly among all unseen events. Katz backoff gives us a better way to distribute the
probability mass among unseen trigram events by relying on information from uni
grams and bigrams. We use discounting to tell us how much total probability mass to probability. Discounting is implemented by using discounted probabilities P* 
MLE probabilities P() in Eq. 4.35 and Eq. 4.37. Why do we need discounts and a values in Eq. 4.35 and Eq. 4.37? Why couldn’ probability
of an N gram (and save P for MLE probabilities):
_ 0* (Wn—N 1)
P (w lw".1 = "_ + (439)
n " N 1 0041—15“ ) 0 00134 0 000483 0 00455 0 00455 0 00384 0 000483
to 0 000512 0 00152 0 00165 0 284 0.000512 0.0017 0 00175 0 0873
eat 0.00101 0 00152 0 00166 0 00189 0.0214 0 00166 0 0563 0 000585
Chinese 0.00283 0 00152 0 00248 0 00189 0.000205 0 519 0 00283 0 000585
feed 0.0137 0 00152 0.0137 0 00189 0.000409 0.00366 0.00073 0.000585
lunch 0.00363 0.00152 0.00248 0.00189 0.000205 0.00131 0.00073 0.0005 85
Spend 0.00161 0.00152 0.00161 0.00189 0.000205 0.0017 0.00073 0.000585 GoodTuring smoothed bigram probabilities for eight words (of V = 1446) in the BeRP corpus of 9332 sentences co In
at )3 ,
lei pl“ gra
(bi.
pro
zer
fro} and and / how to distribute this
ties P* () rather than .37? Why couldn’t we
ruse without discounts
)robability‘. The MLE 16 probability of all W, (4.38) le off to a lower—order ; extra probability mass greater than 1‘. ted. The P* is used to
ass for the lowerorder
from all the lower—order
.ounting the higherorder
e conditional probability (4.39) c, this probability P" will :ler N grams which is then
1 Section 4.7.1. Figure 4.9: tple words, computed from _ lunch spends.:_ .
17 0.00073  .
455 0.00384
17 000175
{166 0.0563
.9 0.00283
)366 0.00073
3131 0.00073 , 0073 . _. 017 0 0 oﬁ 933 446) in the BeRP COYP‘ls II . _.‘.'_.._. _ #24. Section 4.7. Backoff 107 4.7.1 Advanced: Details of Computing Katz Backoff a and P* In this section we give the remaining details of the computation of the discounted prob
ability P" and the backoff weights a(w). We begin with a, which passes the leftover probability mass to the lowerorder
N  grams. Let’s represent the total amount of leftover probability mass by the function
[3, a function of the (N1)gram context. For a given (N1)gram context, the total
leftover probability mass can be computed by subtracting from 1 the total discounted
probability mass for all N grams starting with that context: ﬁ(M/'n2}q+1)= 1— Z P‘(w»w::},+1) (4.40) w,,:(:()4"'"__N+l )>0 This gives us the total probability mass that we are ready to distribute to all (N1)
gram (e.g., bigrams if our original model was a trigram). Each individual (N1)gram
(bigram) will only get a fraction of this mass, so we need to normalize B by the total
probability of all the (N1)grams (bigrams) that begin some Ngram (trigram) that has
zero count. The ﬁnal equation for computing how much probability mass to distribute
from an N—gram to an (N—1)gram is represented by the function a: 4 MIL“)
an:c(4"._~+1)=o Pkatz(wn M33”)
1 _ ng=c(w",,_N+l)>0 P* (WnIWZIIIVH) 1 _ ngIC(W,',_N+1)>0P*(Wnlwzjnz) “MILO (4.41) Note that a is a function of the preceding word string, that is, of w"n:,1v+1; thus
the amount by which we discount each trigram (d) and the mass that gets reassigned
to lowerorder Ngrams (a) are recomputed for every (N1)gram that occurs in any
Ngram. We need to specify what to do when the counts of an (N1)gram context are 0,
(i.e., when C(W::Il‘, +1) = 0) and our deﬁnition is complete: Pkatz(wnlwnn:llv+1) = Pkatz (Wnlwnnjwz) if C(Wnn:})l+1) = 0 (4'42) and
P" (w,,w"n:}, +1) = 0 if 4ng, +1) = 0 (4.43) and 108 Chapter 4. 4 .8 Practical Issues: Toolkits and Data Formats NGrams unigram: logP*(w,) w,
bigram: logP*(w,]w,_1) wi_1w,
tﬁgram 103P*(WilWi—2,Wi—1) Wi—2Wi—1Wi Figure 4.10 shows an ARPA formatted LM ﬁle with selected Ngrams from the BeRP corpus. Given one of these trigrams, the probability P(z[x, y) for the word se
quence x, y, z can be computed as follows (repeated from (4.37)): log a (w)
108 01(Wt—1Wt) P*(z‘xvy)v ifC(x,y,z) > O
[Iratz (le09 = a(x,y)Pkatz (zly), else if C(x, y) > 0 (4.46)
P“ (Z), otherwise.
P“ (21y), if C(y, z) > o
Pkatz (Z‘y) — { (10013»: (2), otherwise. Toolkits: Two commonly used toolkits for building language models are the SRILM
toolkit (Stolcke, 2002) and the CambridgeCMU toolkit (Clarkson and Rosenfeld, 1997). Both are publicly available and have similar functionality. In training mode, Turing or KneserNey, discussed in Section 4.9.1), and various thresholds. The output
is a language model in ARPA format. In perplexity or decoding mode, the toolkits KneserNey Bec
cess
brie
£1132
guaé 4.9. One
lated Absc
the C
quent '— d. We represent and
nderﬂow and also to
ban 1, the more prob—
Multiplying enough
g log probabilities in
. Adding in log space
~robabilities by adding
te than multiplication.
that when we need to +logp4) ARPA format.
followed by a list of all
)y bigrams, followed by
;counted log prob
; are only necessary for
omputed for the highest in the endof—sequence Jgram is WOW)
g a(Wi—1Wi) :lected N 'grams from the
P(z\x,)’) for the Word 37)): Lxd>0
if C(x,y) > 0
:rwise.  0 uage mo 1 (Clarkson and ‘tionality. In trai I _ I
' mar
th words separated by W . .I‘
ounting (000‘. irious thresholds. The out?“
decoding mode, type of disc (4.45)
An ability the ' »  __‘._‘_v  dels are the SRIL‘ '
Rosenfcllli
m’ng 111003”... ' KneserNey Section 4.9. Advanced Issues in Language Modeling 109 \data\ ngram 1=1447
ngram 2=9420
ngram 3=5201 \1—grm:\ 0 .8679678 </a> 99 <s> .068532
4.743076 chowfun .1943932
4 .266155 fries .. 5432162
—3 . 175167 thursday . 7510199
1.776296 want: .04292 \2grams:\ —0.6077676 ' 0.6257131
—0.4861297 ' 0.0425899
2.832415 0.06423882
0.546952‘5 0.008193135
—0 . 09403705 \3grams:\
.579416 i prefer
. 148009 about fifteen
. 4 120701 90 to
.3735807 in list:
.260361 jupiter </s>
. 260361 malaysian restaurant ARPA format for N—grams, showing some sample N—grams. Each is represented
by a logprob, the word sequence, W1...W", followed by the log backoff weight a. Note that no a
is computed for the highestorder N—gram or for N—grams ending in <s>. take a language model in ARPA format and a sentence or corpus, and produce the
probability and perplexity of the sentence or corpus. Both also implement many ad—
vanced features discussed later in this chapter and in following chapters, including skip
Ngrams, word lattices, confusion networks, and Ngram pruning. 4.9 Advanced Issues in Language Modeling Because language models play such a broad role throughout speech and language pro
cessing, they have been extended and augmented in many ways. In this section we
brieﬂy introduce a few of these, including KneserNey Smoothing, class—based lan
guage modeling, language model adaptation, topicbased language models, cache lan—
guage models, variablelength N grams. 4.9.1 Advanced Smoothing Methods: KneserNey Smoothing One of the most commonly used modern N—gram smoothing methods is the interpo
lated KneserNey algorithm. KneserNey has its roots in a discounting method called absolute discounting.
Absolute discounting is a much better method of computing a revised count c* than
the Good—Turing discount formula we saw in Eq. 4.26, based on frequencies of fre
quencies. To get the intuition, let’s revisit the GoodTuring estimates of the bigram c* l 10 Chapter 4. N—Grams Absolute
discounting extended from Fig. 4.8 and reformatted below. c(MLE) 0 1 2 3 4 5 6 7 8 9
c* (GT) 0.0000270 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25 and 1, all the other reestimated counts c* could be estimated
tracting 0.75 from the MLE count c! Absolute discounting by subtracting a ﬁxed (absolute) discount (1 from each count. The intuition is that we C(wi_1w,~!—D
a lfc '— z > 0
Pabsolute (WiiWi—l) = C wi—l (W! 1w ) (4.48)
a(w,)P(w,), otherwise .InterPomzd
Kneser—Ney Classbased
N—gram Cluster Ngram IBM clustering later
disc
tatic a ba«
whe:
reall 4.9. The infor
deali
want
in tht
If we
traini as IB
of ha
mates
ity of
and t1 If
likelil".
the Cl: Cll ' 8 9 3.21 7.24 8.25
stimated counts for 0
'etty well by just sub
rmalizes this intuition
he intuition is that we
iscount d won’t affect
h we don’t necessarily
.ng applied to bigrams
.iing sum to 1) is 1W‘) > 0 (4.48)
.e.
1311168 D for the 0 and 1 nts absolute discounting
.ion. Consider the 10b of
melting off to a unigram than the word Francisco.
1 will prefer it to glasses.
sea is frequent, it 15 only
sco. The word glasses has , t (the number of times the nt backoff distribution. War I )f times we might expect tt: .
tion is to base our estimaf .. Words that have appeared ‘ text as well. We can express
’, as follows: M— (4.49.i. .
—1Wi) > 0} I
as follows (again assumingm
.m to 1): ifc<wi'1wi)> 0 (4.50) otherwise. lnterpolated
Kneser—Ney Classbased
Ngram Cluster Ngram IBM clustering Section 4.9. Advanced Issues in Language Modeling 111 ———————————————————__——— Finally, it turns out to be better to use an interpolated rather than a backoff form
of KneserNey. While simple linear interpolation is generally not as successful as
Katz backoff, it turns out that more powerful interpolated models, such as interpo
lated KneserNey, work better than their backoff version. Interpolated KneserNey
discounting can be computed with an equation like the following (omitting the compu tation of ﬂ): C(Wi—i) (4.51) PKN(WiWi1) = +/3(wi)zl{wi1 3C(Wi—1wi) > 0} A ﬁnal practical note: it turns out that any interpolation model can be represented as
a backoff model, hence stored in ARPA backoff format. We simply do the interpolation
when we build the model, so the ‘bigram’ probability stored in the backoff format is
really ‘bigram already interpolated with unigram’. 4.9.2 ClassBased NGrams The classbased Ngram, or cluster Ngram, is a variant of the Ngram that uses
information about word classes or clusters. Class—based Ngrams can be useful for
dealing with sparsity in the training data. Suppose for a ﬂight reservation system we
want to compute the probability of the bigram to Shanghai, but this bigram never occurs
in the training set. Instead, our training data has to London, to Beijing, and to Denver.
If we knew that these were all cities and assuming that Shanghai does appear in the
training set in other contexts, we could predict the likelihood of a city following from. There are many variants of cluster Ngrams. The simplest one is sometimes known
as IBM clustering, after its originators (Brown et al., 1992). IBM clustering is a kind
of hard clustering, in which each word can belong to only one class. The model esti
mates the conditional probability of a word w, by multiplying two factors: the probabil
ity of the word’s class c; given the preceding classes (based on an Ngram of classes),
and the probability of w; given ci. Here is the IBM model in bigram form: P(WiWi—l) z P(CiICi—1)X P(Wilci) If we had a training corpus in which we knew the class for each word, the maximum
likelihood estimate of the probability of the word given the class and the probability of
the class given the previous class could be computed as follows: P(wlc) = P(CiCi—1) = Cluster N—grams are generally used in two ways. In dialog systems we often hand
design domainspeciﬁc word classes. Thus for an airline information system, we might ...
View
Full
Document
 Spring '10
 Glass

Click to edit the document details