Unformatted text preview: Search and Decoding in Speech Recognition NGrams NGrams Problem of word prediction.
Example: “I’d like to make a collect …” Very likely words: “call”, “international call”, or “phone call”, and NOT “the”.
The idea of word prediction is formalized with probabilistic models called N
grams. Ngrams – predict the next word from previous N1 words. Statistical models of word sequences are also called language models or LMs.
Computing probability of the next word will turn out to be closely related to computing the probability of a sequence of words.
Example: “… all of a sudden I notice three guys standing on the sidewalk …”, vs. “… on guys all I of notice sidewalk three a sudden standing the …” February 16, 2012 Veton Këpuska 2 Ngrams Estimators like Ngrams that assign a conditional probability to possible next words can be used to assign a joint probability to an entire sentence. Ngram models are on of the most important tools in speech and language processing. Ngrams are essential in any tasks in which the words must be identified from ambiguous and noisy inputs. Speech Recognition – the input speech sounds are very confusable and many words sound extremely similar. February 16, 2012 Veton Këpuska 3 Ngram Handwriting Recognition – probabilities of word sequences help in recognition. Woody Allen in his movie “Take the Money and Run”, tries to rob a bank with a sloppily written holdup note that the teller incorrectly reads as “I have a gub”. Any speech and language processing system could avoid making this mistake by using the knowledge that the sequence “I have a gun” is far more probable than the non
word “I have a gub” or even “I have a gull”. February 16, 2012 Veton Këpuska 4 Ngram Statistical Machine Translation – Example of translation of a Chinese source sentence: from a set of potential rough English translations: he briefed to reporters on the chief contents of the statement
he briefed reporters on the chief contents of the statement
he briefed to reporters on the main contents of the statement
he briefed reporters on the main contents of the statement February 16, 2012 Veton Këpuska 5 Ngram An Ngram grammar might tell us that briefed reporters is more likely than briefed to reporters, and main contents is more likely than chief contents. Spelling Corrections – need to find correct spelling errors like the following that accidentally result in real English words: They are leaving in about fifteen minuets to go to her house. The design an construction of the system will take more than a year. Problem – real words thus dictionary search will not help. Note: “in about fifteen minuets” is a much less probable sequence than “in about fifteen minutes” Spellchecker can use a probability estimator both to detect these errors and to suggest higherprobability corrections. February 16, 2012 Veton Këpuska 6 Ngram Augmentative Communication – helping people who are unable to sue speech or sign language to communicate (Steven Hawking). Using simple body movements to select words from a menu that are spoken by the system. Word prediction can be used to suggest likely words for the menu. Other areas: Part ofspeech tagging Natural Language Generation, Word Similarity, Authorship identification Sentiment Extraction
Predictive Text Input (Cell phones).
February 16, 2012 Veton Këpuska 7 Corpora & Counting Words Probabilities are based on counting things. Must decide what to count. Counting things in natural language is based on a corpus (plural corpora) – an online collection of text or speech. Popular corpora “Brown” and “Switchboard”. Brown corpus is a 1 million word collection of samples from 500 written texts from different genres (newspaper, novels, nonfiction, academic, etc.) assembled at Brown university 19631964. Example sentence from Brown corpus: He stepped out into the hall, was delighted to encounter a water brother. 13 words if don’t’ count punctuationmarks as words – 15 if we count punctuation. Treatment of “,” and “.” depends on the task. Punctuation marks are critical for identifying boundaries (, . ;) of things and for identifying some aspects of meaning (? ! ”) For some tasks (partofspeech tagging or parsing or sometimes speech synthesis) punctuation are treated as being separate words. February 16, 2012 Veton Këpuska 8 Corpora & Counting Words Switchboard Corpus – collection of 2430 telephone conversations averaging 6 minutes each – total of 240 hours of speech with about 3 million words. This kind of corpora do not have punctuation. Complications with defining words.
Example: I do uh main mainly business data processing. Two kinds of disfluencies. Brokenoff word main is called a fragment.
Words like uh um are called fillers or filled pauses. Counting disfluencies as words depends on the application: Automatic Dictation System based on Automatic Speech Recognition will remove disfluencies. Speaker Identification application can use disfluencies to identify a person. Parsing and word prediction can use disfluencies – Stolcke and Shriberg (1996) found that treating uh as a word improves nextword prediciton (?) and thus most speech recognition systems treat uh and um as words. February 16, 2012 Veton Këpuska 9 Ngram Are capitalized tokens like “They” and uncapitalized tokens like “they” the same word? In speech recognition they are treated the same.
In partofspeechtagging capitalization is retained as a separate features.
In this chapter models are not case sensitive. Lemma is a set of lexical forms having the same Wordform is the full Inflected forms – cats versus cat. These two words have the same lemma “cat” but are different wordforms. Stem Major partofspeech, and Wordsense. inflected or derived form of the word. February 16, 2012 Veton Këpuska 10 Ngrams In this chapter Ngrams are based on wordforms. Ngram models and counting words in general requires that we do the kind of tokenization or text normalization that was introduced in previous chapter: Separating out punctuation Dealing with abbreviations (m.p.h) Normalizing spelling, etc. February 16, 2012 Veton Këpuska 11 Ngram How many words are there in English? Must first distinguish types – the number of the distinct words in a corpus or vocabulary size V, from tokens – the total number N of running words. Example: They picnicked by the pool, then lay back on the grass and looked at the stars. 16 Tokens 14 Types V >O
February 16, 2012 Veton Këpuska ( N)
12 Ngram The Switchboard corpus has ~20,000 wordform types ~3 million worform tokens
Shakspeare’s complete works have 29,066 wordform types 884,647 wordform tokens Brown corpus has: 61,805 wordform types 37,851 lemma types 1 million wordform tokens February 16, 2012 Veton Këpuska A very large corpus (Brown 1992a) found that it included 293,181 different wordform types 583 million wordform tokens
Heritage third edition dictionary lists 200,00 boldface forms.
It seems that the larger corpora the more word types are found: It is suggested that vocabulary size (the number of types) grows at least the square root of the number of tokens 13 Brief Introduction to Probability
Discrete Probability Discrete Probability Distributions Definition: Set called the sample space which contains the set of all possible outcomes: S = { x1 , x2 , x3 , , x N } For each element x of the set S; x ∊ S , a probability value is assigned as a function of x; P(x) with the following properties:
1. P(x) ∊ [0,1], ∀ x ∊ S, 2. ∑ P( x ) = 1 February 16, 2012 x∈Ω Veton Këpuska 15 Discrete Probability Distributions Event is defined as any subset E of the sample space S. The probability of the event E is defined as: P( E ) = ∑ P( x )
x∈E Probability of the entire space S is 1 as indicated by 2 in the previous
slide. Probability of the empty or null event is 0. The function P(x) mapping a point in in the sample space to the
“probability” value is called a probability mass function (pmf). February 16, 2012 Veton Këpuska 16 Properties of Probability Function If A and B are mutually exclusive events in S, then: P(A∪B) = P(A)+P(B) Mutual exclusive events are those that A∩B=∅ In general for n mutually exclusive events: P( A1 A2 A3 An ) = P( A1 ) + P ( A2 ) + P ( A3 ) + + P( An )
A February 16, 2012 Venn Diagram Veton Këpuska B 17 Elementary Theorems of Probability If A is any event in S, then P(A’) = 1P(A) where A’ is set of all events not in A. Proof: P(A∪A’) = P(A)+P(A’), considering that P(A∪A’) = P(S)= 1 P(A)+P(A’) = 1 February 16, 2012 Veton Këpuska 18 Elementary Theorems of Probability If A and B are any events in S, then P(A∪B) = P(A)+P(B) P(A∩B), Proof: P(A∪B) = P(A∩B’)+P(A∩B)+P(A’∩B)=
P(A∪B) = [P(A∩B’)+P(A∩B)] + [P(A’∩B)+P(A∩B) ]  P(A∩B) P(A∪B) = P(A)+P(B) P(A∩B) Venn Diagram
A∪ B A A∩B’ February 16, 2012 A∩B Veton Këpuska S
B A’∩B 19 Conditional Probability If A and B are any events in S, and P(B)≠0, the conditional probability of A relative to B is given by: P( A ∩ B )
P( A  B ) =
P( B ) If A and B are any events in S, then P( A ∩ B ) = P( A  B ) P( B ) if
P( A ∩ B ) = P( B  A) P( A) if February 16, 2012 Veton Këpuska P( B ) ≠ 0
P ( A) ≠ 0 20 Independent Events If A and B are independent events then : P ( A ∩ B ) = P ( A  B ) P ( B ) = P ( A) P ( B )
P ( A ∩ B ) = P ( B  A) P ( A) = P ( B ) P ( A) February 16, 2012 Veton Këpuska 21 Bayes Rule If B1, B2, B3,…, Bn are mutually exclusive events of which one n
∑P
must occur, that is: , then( Bi ) = 1
i =1 P( Bi  A) = P( A  Bi ) P( Bi )
n ∑ P( A  B ) P( B )
i =1 February 16, 2012 i for i = 1,2,3, ,n i Veton Këpuska 22 End of Brief Introduction to Probability
End Simple Ngrams Simple (Unsmoothed) NGrams Our goal is to compute the probability of a word w given some history h: P(wh). Example: h ⇒ “its water is so transparent that” w ⇒ “the” P(the  its water is so transparent that) How can we compute this probability? One way is to estimate it from relative frequency counts. From a very large corpus count number of times we see “its
water is so transparent that” and count the number of times this
is followed by “the” Out of the times we saw the history h, how
many times was it followed by the word w”: P( the  its water is so transparent that ) =
February 16, 2012 C ( its water is so transparent that the )
C ( its water is so transparent that ) Veton Këpuska 25 Estimating Probabilities Estimating probabilities form counts works fine in many cases, it turns out that even the www is not big enough to give us good estimates in most cases. Language is creative: 1. new sentences are created all the time.
2. It is not possible to count entire sentences. February 16, 2012 Veton Këpuska 26 Estimating Probabilities Joint Probabilities – probability of an entire sequence of words like “its water is so transparent”: Out of all possible sequences of 5 words how many of them are “its water is so transparent” Must count of all occurrences of “its water is so transparent” and divide by the sum of counts of all possible 5 word sequences. It seems a lot of work for a simple computation of estimates. February 16, 2012 Veton Këpuska 27 Estimating Probabilities Must figure out cleverer ways of estimating the probability of A word w given some history h, or An entire word sequence W. Introduction of formal notations: Random variable – Xi Probability Xi taking on the value “the” – P(Xi =“the”) = P(the) Sequence of N words:
n Joint probability of each word in a sequence having a 1
2
n
1
particular value: w , w , , w or w P( w1 , w2 , w3 , , wn ) = P( X = w1 , Y = w2 , Z = w3 , , )
February 16, 2012 Veton Këpuska 28 Chain Rule Chain rule of Probability: ( ) ( P( X 1 , X n ) = P( X 1 ) P( X 2  X 1 ) P X 3  X 12 P X n  X 1n −1
n ( = ∏ P X k  X 1k −1
k =1 ) Applying the chain rule to words we get: ( () ) ( P w1n = P( w1 ) P ( w2  w1 ) P w3  w12 P wn  w1n −1
n ( = ∏ P wk  w1k −1
k =1 February 16, 2012 ) ) Veton Këpuska 29 ) Chain Rule The chain rule provides the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. Equation presented in previous slide provides the way of computing joint probability estimate of an entire sequence based on multiplication of a number of conditional probabilities. However, we still do not know any way of computing the exact probability of a word given a long sequence of P wn  w1n −1
preceding words: ( February 16, 2012 Veton Këpuska ) 30 Ngrams Approximation: Idea of Ngram model is to approximate the history by just the last few words instead of computing the probability of a word given its entire history. Bigram: The bigram model approximates the probability of a word P ( wn  w1n −1 )
given all the previous words by the conditional probability of the preceding word . P ( wn  wn −1 ) Example: Instead of computing the probability: P( the  Walden Pond' s water is so tranparent that ) It is approximated with the probability: February 16, 2012 P( the  that )
Veton Këpuska 31 Bigram The following approximation is used when the bi
gram probability is applied: ( n −1
1 P wn  w ) ≈ P( w n  wn −1 ) The assumption that the conditional probability of the a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume that wee can predict the probability of some future unit without looking too far into the past. February 16, 2012 Veton Këpuska 32 Bigram Generalization Trigram: looks two words into the past Ngram: looks N1 words into the past. General equation for Ngram approximation to the conditional probability of the next word in a sequence is: ( ) ( n−
P wn  w1n −1 ≈ P wn  wn −1 +1
N ) The simplest and most intuitive way to estimate probabilities is the method called Maximum Likelihood Estimation or MLE for short. February 16, 2012 Veton Këpuska 33 Maximum Likelihood Estimation MLE estimate for the parameters of an Ngram model is done by taking counts from a corpus, and normalizing them so they lie between 0 and 1. Bigram: computing a particular bigram probability of a word y given a previous word x, the count C(xy) is computed and normalized by the sum of all bigrams that share the same first word x. ( P wn  wn −1 ) C ( wn −1wn )
=
∑ C ( wn−1wn )
w February 16, 2012 Veton Këpuska 34 Maximum Likelihood Estimate The previous equation can be further simplified by noting:
C ( wn −1 ) = ∑ C ( wn −1wn )
w February 16, 2012 C ( wn −1wn )
⇒ P ( wn  wn −1 ) =
C ( wn −1 ) Veton Këpuska 35 Example Minicorpus containing three sentences marked with begging sentence marked <s> and ending sentence marker </s>:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
(From Dr. Seuss series: “Green Eggs and Ham” book) Some of the bigram calculations from this corpus: P(I<s>) = 2/3 = 0.66
P(Sam<s>) = 1/3=0.33 P(amI) = 2/3 = 0.66 P(Samam) = ½=0.5 P(</s>Sam) = 1/3 = 0.33 P(</s>am) = 1/3=0.33 P(doI) = 1/3 = 0.33 February 16, 2012 Veton Këpuska 36 Ngram Parameter Estimation In general case, MLE is calculated for Ngram model using the following: ( n −1
n − N +1 P wn  w ) ( n −1
n − N +1 n
n −1
n − N +1 Cw
=
Cw ( w ) ) This equation estimates the Ngram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ration is called relative frequency. Relative frequencies is one way how to estimate probabilities in Maximum Likelihood Estimation.
Conventional MLE is not always the best way to compute probability estimates (bias toward a training corpus – e.g., Brown).
MLE can be modified to address better those considerations. February 16, 2012 Veton Këpuska 37 Example 2 Data used from Berkeley Restaurant Project Corpus consisting of 9332 sentences (available from the WWW): can you tell me about any good cantonese restaurants close by mid priced thai food is what I’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’am looking for a good place to eat breakfast when is caffe venezia open during the day February 16, 2012 Veton Këpuska 38 Bigram counts for eight of the words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences
i want to eat chinese food lunch spend i 5 827 0 9 0 0 0 2 want 2 0 608 1 6 6 5 1 to 2 0 4 686 2 0 6 211 eat 0 0 2 0 16 2 42 0 chineze 1 0 0 0 0 82 1 0 food 15 0 15 0 1 4 0 0 lunch 2 0 0 0 0 1 0 0 spend 1 0 1 0 0 0 0 0 February 16, 2012 Veton Këpuska 39 Bigram Probabilities After Normalization
i want to eat chinese food lunch spend 2533 927 214 746 158 1093 341 278 Unigram Counts Some other useful probabilities: P(i<s>)=0.25 P(englishwant)=0.0011 P(foodenglish)=0.5
P(</s>food)=0.68 Clearly now we can compute probability of sentence like: “I want English food”, or “I want Chineze food” by multiplying appropriate bigram probabilities together as follows:
February 16, 2012 Veton Këpuska 40 Bigram probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences
i want to eat chinese food lunch spend i 0.002 0.33 0 0.0036 0 0 0 0.00079 want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011 to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087 0 0 0.0027 0 0.021 0.0027 0.056 0 chineze 0.0063 0 0 0 0 0.52 0.0063 0 food 0.014 0 0.014 0 0.00092 0.0037 0 0 lunch 0.0059 0 0 0 0 0.0029 0 0 spend 0.0036 0 0.0036 0 0 0 0 0 eat February 16, 2012 Veton Këpuska 41 Bigram Probability P(<s> i want english food </s>)
= P(i<s>)P(wanti) P(englishwant) P(foodenglish) P(</s>food)
= 0.25 x 0.33 x 0.0011 x 0.5 x 0.68 = 0.000031 Exercise: Computer the probability of “I want chinese food”. Some of the bigram probabilities encode some facts that we think
of as strictly syntactic in nature: What comes after eat is usually a noun or an adjective, or What comes after to is usually a verb
February 16, 2012 Veton Këpuska 42 Trigram Modeling Although we will generally show bigram models in this chapter for pedagogical purposes, note that when there is sufficient training data we are more likely to use trigram models, which condition on the previous two words rather than the previous word. To compute trigram probabilities at the very beginning of sentence, we can use two pseudowords for the first trigram (i.e., P(I<s><s>). February 16, 2012 Veton Këpuska 43 Training and Test Sets Ngram models are obtained from a corpus that is trained on. Those models are used on some new data in some task (e.g. speech recognition). New data or task will not be exactly the same as data that was used for training. Formally: Data that is used to build the Ngram (or any model) are called Training Set or Training Corpus Data that are used to test the models comprise Test Set or Test Corpus. February 16, 2012 Veton Këpuska 44 Model Evaluation Trainingandtesting paradigm can also be used to evaluate different Ngram architectures: Comparing Ngrams of different order N, or
Using the different smoothing algorithms (to be introduced later) Train various models using training corpus Evaluate each model on the test corpus. How do we measure the performance of each model on the test corpus? Perplexity (introduced latter in the chapter) – computing probability of each sentence in the test set: the model that assigns a higher probability to the test set (hence more accurately predicts the test set) is assumed to be a better model. Because evaluation metric is based on test set probability, it’s important not to let the test sentences into the training set. Avoiding training on the test set data. February 16, 2012 Veton Këpuska 45 Other Divisions of Data Extra source of data to augment the training set is needed. This data is called a heldout set. Ngram model is based on only training set. Heldout set is used to set additional (other) parameters of our model. Used to set interpolation parameters of Ngram model Multiple test sets: Test set that is used often in measuring performance of the model typically is called development (test) set. Due to its high usage the models may be tuned to it. Thus a new completely unseen (or seldom used data set) should be used for final evaluation. This set is called evaluation (test) set. February 16, 2012 Veton Këpuska 46 Picking Train, Development Test and Evaluation Test Data For training we need as much data as possible. However, for Testing we need sufficient data in order for the resulting measurements to be statistically significant. In practice often the data is divided into 80% training 10% development and 10% evaluation. February 16, 2012 Veton Këpuska 47 Ngram Sensitivity to the Training Corpus.
1. Ngram modeling, like many statistical models, is very dependent on the training corpus. Often the model encodes very specific facts about a given training corpus. 1. Ngrams do a better and better job of modeling the training corpus as we increase the value of N. This is another aspect of model being tuned to specifically to training data at the expense of generality. February 16, 2012 Veton Këpuska 48 Visualization of Ngram Modeling Shannon (1951) AND Miller & Selfridge (1950). The simplest way to visualize how this works is for the unigram case: All words of English language covering the probability space between 0 and 1 – each word thus covering an interval of size equal to its (relative) frequency.
Let us choose a random number between 0 and 1, and print out the word whose interval includes the real value we have chosen.
We continue choosing random numbers and generating words until we randomly generate the sentencefinal token </s>.
The same technique can be used to generate bigrams by first generating a random bigram that starts with <s> (according to its bigram probability) Then choosing a random bigram to follow it (again according to its conditional probability), and so on. February 16, 2012 Veton Këpuska 49 Visualization of Ngram Modeling: Unigram To provide an intuition of the increasing power of higher order Ngrams, the example below is depicted that shows random sentences generated from unigram, bigram, trigram, and quadrigram models trained on Shakspeare’s work. To him swallowed confess hear both. Which. Of save on trail for are ay device an rote life have Every enter noe severally so, let Hill he late speaks; or! A more to leg less first you enter Are where exeunt and sighs have rise excellency took of. Sleep knave we. Near; vile like February 16, 2012 Veton Këpuska 50 Visualization of Ngram Modeling: Bigram What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. What we, hath got so she that I rest and sent to scold and nature bankrupt, nor the first gentleman? Enter Menenius, if it so many good direction found’st thou art a strong upon command of fear not a liberal largess given away, Falstaff! Exeunt February 16, 2012 Veton Këpuska 51 Visualization of Ngram Modeling: Trigram Sweet prince, Falstaff shall die. Harry of Monmouth’s grave.
This shall forbid it should be branded, if renown made it empty.
Indeed the duke; and had a very good friend.
Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, ’tis done. February 16, 2012 Veton Këpuska 52 Visualization of NgramModeling: Quadrigram King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in; Will you not tell me who I am? It cannot be but so. Indeed the short and the long. Marry, ’tis a noble Lepidus. February 16, 2012 Veton Këpuska 53 Size of N in Ngram Models The longer the context on which we train the model, the more coherent the sentences. In the unigram sentences, there is no coherent relation between words, nor sentencefinal punctuation. The bigram sentences have some very local wordtoword coherence (especially if we consider that punctuation counts as a word). The trigram and quadrigram sentences are beginning to look a lot like Shakespeare. Indeed a careful investigation of the quadrigram sentences shows that they look a little too much like Shakespeare. The words It cannot be but so are directly from King John. February 16, 2012 Veton Këpuska 54 Specificity vs Generality The variability of words phrases in Shakespeare is not very large in the context of training corpora used for language modeling: N = 884,647 & V = 29,066 Ngram probability matrices are very sparse: V2 = 844,000,000 possible bigrams alone V4 = 7x1017 number of possible quadrigrams. Once the generator has chosen the first quadrigram, there are only five possible continuations (that, I , he, thou, and so). In fact for many quadrigrams there is only one continuation. February 16, 2012 Veton Këpuska 55 Dependence of Grammar to its Training Set. Example of Wall Street Journal (WSJ) Corpus based on the newspaper. Shakespeare work and WSJ are both in English, so one might expect some overlap between our Ngrams for the two genres. In order to check whether this is true the next slides provide sentences generated by unigram, bigram and trigram grammars trained on 40 million words from WSJ. February 16, 2012 Veton Këpuska 56 WSJ Example
Unigram Bigram Trigram Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one point five percent of U.S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions Sentences randomly generated from three orders of Ngram computed from 40 million words of the Wall Street Journal. All characters were mapped to lowercase and punctuation marks were treated as words. Output is handcorrected for capitalization to improve readability.
February 16, 2012 Veton Këpuska 57 Comparison of Shakespeare and WSJ Examples While superficially they both seem to model “Englishlike sentences” there is obviously no overlap whatsoever in possible sentences, and little if any overlap even in small phrases. This stark difference tells us that statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different as Shakespeare and WSJ. How should we deal with this problem when we build Ngram models? February 16, 2012 Veton Këpuska 58 Comparison of Shakespeare and WSJ Examples In general we need to be sure to use a training corpus that looks like our test corpus. We especially wouldn’t choose training and tests from different genres of text like newspaper text, early English fiction, telephone conversations, and web pages. Sometimes finding appropriate training text for a specific new task can be difficult; to build Ngrams for text prediction in SMS (Short Message Service), we need a training corpus of SMS data.
To build Ngrams on business meetings, we would need to have corpora of transcribed business meetings.
For general research where we know we want written English but don’t have a domain in mind, we can use a balanced training corpus that includes crosssections from different genres, such as the 1million word Brown corpus of English (Francis and Kuˇcera, 1982) or the 100million word British National Corpus (Leech et al., 1994).
Recent research has also studied ways to dynamically adapt language models to different genres; February 16, 2012 Veton Këpuska 59 Unknown Words: Open vs. Closed Vocabulary Tasks Sometimes we have a language task in which we know all the words that can occur, and hence we know the vocabulary size V in advance. The closed vocabulary assumption is the assumption that we have such a lexicon, and that the test set can only contain words from this lexicon. The closed vocabulary task thus assumes there are no unknown words. February 16, 2012 Veton Këpuska 60 Unknown Words: Open vs. Closed Vocabulary Tasks As we suggested earlier, the number of unseen words grows constantly, so we can’t possibly know in advance exactly how many there are, and we’d like our model to do something reasonable with them. We call these OOV unseen events unknown words, or out of vocabulary (OOV) words. The percentage of OOV words that appear in the test set is called the OOV rate. An open vocabulary system is one where we model these potential unknown words in the test set by adding a pseudo
word called <UNK>. February 16, 2012 Veton Këpuska 61 Training Probabilities of Unknown Model We can train the probabilities of the unknown word model <UNK> as follows: 1. Choose a vocabulary (word list) which is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK> in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the trainings set. February 16, 2012 Veton Këpuska 62 Evaluating NGrams
Perplexity Perplexity The correct way to evaluate the performance of a language model is to embed it in an application and measure the total performance of the application. Such endto end evaluation, also called in vivo evaluation, is the only way to know if a particular improvement in a component is really going to help the task at hand. Thus for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription. February 16, 2012 Veton Këpuska 64 Perplexity Endtoend evaluation is often very expensive; evaluating a large speech recognition test set, for example, takes hours or even days. Thus we would like a metric that can be used to quickly evaluate potential improvements in a language model. Perplexity is the most common evaluation metric for Ngram language models. While an improvement in perplexity does not guarantee an improvement in speech recognition performance (or any other endtoend metric), it often correlates with such improvements. Thus it is commonly used as a quick check on an algorithm; an improvement in perplexity can then be confirmed by an endtoend evaluation. February 16, 2012 Veton Këpuska 65 Perplexity Given two probabilistic models, the better model is the one that has a tighter fit to the test data, or predicts the details of the test data better. We can measure better prediction by looking at the probability the model assigns to the test data; the better model will assign a higher probability to the test data. February 16, 2012 Veton Këpuska 66 Definition of Perplexity The perplexity (sometimes called PP for short) of a language model on a test set is a function of the probability that the language model assigns to that test set. For a test set W = w1w2 . . .wN, the perplexity is the probability of the test set, normalized by the number of words: PP(W ) = P( w1w2 wN )
=N
February 16, 2012 − 1
N 1
P( w1w2 wN ) Veton Këpuska 67 Definition of Perplexity We can use the chain rule to expand the probability of W:
N 1
PP(W ) = N ∏
i =1 P ( wi  w1 w2 wi −1 ) For bigram language model the perplexity of W is computed as:
N 1
PP(W ) = N ∏
i =1 P ( wi  wi −1 )
February 16, 2012 Veton Këpuska 68 Interpretation of Perplexity
1. Minimizing perplexity is equivalent to maximizing the test set probability according to the language model. What we generally use for word sequence in the general Equation presented in previous slide is the entire sequence of words in some test set. Since this sequence will cross many sentence boundaries, we need to include the beginand endsentence markers <s> and </s> in the probability computation. also need to include the endofsentence marker </s> (but not the beginningofsentence marker <s>) in the total count of word tokens N. February 16, 2012 Veton Këpuska 69 Interpretation of Perplexity Perplexity can also be interpreted as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word. Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that each of the 10 digits occur with equal probability P = 1/10 . The perplexity of this language is in fact 10. To see that, imagine a string of digits of length N. By Equation presented in previous slide, the perplexity will be: PP(W ) = P ( w1w2 wN ) 1 N = 10 February 16, 2012 − 1
N − 1
N
−1 1
= = 10 10 Veton Këpuska 70 Interpretation of Perplexity Exercise: Suppose that the number zero is really frequent and occurs 10 times more often than other numbers. Show that the perplexity to be lower, as expected since most of the time the next number will be zero. Branching factor however, is still the same for digit recognition task (e.g. 10). February 16, 2012 Veton Këpuska 71 Interpretation of Perplexity Perplexity is also related to the information theoretic notion of entropy as it will be shown latter in this chapter. February 16, 2012 Veton Këpuska 72 Example of Perplexity Use Perplexity is used in following example to compare three Ngram models. Unigram, Bigram, and Trigram grammars are trained on 38 million words (including startofsentence tokens) using
WSJ corpora with 19,979 word vocabulary. Perplexity is computed on a test set of 1.5 million words via equation presented in the slide: Definition of Perplexity and the results are summarized in the Table below: Ngram Order
Perplexity February 16, 2012 Unigram Bigram Trigram 962 170 109 Veton Këpuska 73 Example of Perplexity Use As we see in previous slide, the more information the Ngram gives us about the word sequence, the lower the perplexity: the perplexity is related inversely to the likelihood of the test sequence according to the model. Note that in computing perplexities the Ngram model P must be
constructed without any knowledge of the test set t. Any kind of
knowledge of the test set can cause the perplexity to be artificially
low. For example, we defined above the closed vocabulary task, in which the vocabulary for the test set is specified in advance. This can greatly reduce the perplexity. As long as this knowledge is provided equally to each of the models we are comparing, the closed vocabulary perplexity can still be useful for comparing models, but care must be taken in interpreting the results. In general, the perplexity of two language models is only comparable if they use the same vocabulary. February 16, 2012 Veton Këpuska 74 Smoothing Smoothing There is a major problem with the maximum likelihood estimation process we have seen for training the parameters of an Ngram model. This is the problem of sparse data caused by the fact that our maximum likelihood estimate was based on a particular set of training data. For any Ngram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. 1. This missing data means that the Ngram matrix for any given training corpus is bound to have a very large number of cases of putative “zero probability Ngrams” that should really have some nonzero probability. 2. Furthermore, the MLE method also produces poor estimates when the counts are nonzero but still small. February 16, 2012 Veton Këpuska 76 Smoothing We need a method which can help get better estimates for these zero or low frequency counts. Zero counts turn out to cause another huge problem. The perplexity metric defined above requires that we compute the probability of each test sentence. But if a test sentence has an Ngram that never appeared in the training set, the Maximum Likelihood estimate of the probability for this Ngram, and hence for the whole test sentence, will be zero! This means that in order to evaluate our language models, we need to modify the MLE method to assign some nonzero probability to any Ngram, even one that was never observed in training. February 16, 2012 Veton Këpuska 77 Smoothing The term smoothing is used for such modifications that address the poor estimates due to variability in small data sets. The name comes from the fact that (looking ahead a bit) we will be shaving a little bit of probability mass from the higher counts, and piling it instead on the zero counts, making the distribution a little less discontinuous. In the next few sections some smoothing algorithms are introduced. The original Berkeley Restaurant example introduced previously will be used to show how smoothing algorithms modify the bigram probabilities. February 16, 2012 Veton Këpuska 78 Laplace Smoothing One simple way to do smoothing is to take our matrix of bigram counts, before we normalize them into probabilities, and add one to all the counts. This algorithm is called Laplace smoothing, or Laplace’s Law. Laplace smoothing does not perform well enough to be used in modern Ngram models, but we begin with it because it introduces many of the concepts that we will see in other smoothing algorithms, and also gives us a useful baseline. February 16, 2012 Veton Këpuska 79 Laplace Smoothing to Unigram Probabilities Recall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N: ci
P( wi ) =
N Laplace smoothing adds one to each count. Considering that there are V words in the vocabulary, and each one got increased, we also need to adjust the denominator to take into account the extra V observations in order to have legitimate probabilities. ci + 1
PLaplace ( wi ) =
N +V
February 16, 2012 Veton Këpuska 80 Laplace Smoothing It is convenient to describe a smoothing algorithm as a corrective constant that affects the numerator by defining an adjusted count c* as follows: ci + 1
ci + 1 N N
PLaplace ( wi ) =
=
= ci + 1
N +V N +V N N +V
N
∗
ci = ci + 1
N +V February 16, 2012 Veton Këpuska ∗ 1 ci
=
N N 81 Discounting A related way to view smoothing is as discounting (lowering) some nonzero counts in order to get the correct probability mass that will be assigned to the zero counts. Thus instead of referring to the discounted counts c, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts: c*
dc =
c February 16, 2012 Veton Këpuska 82 Berkeley Restaurant Project Smoothed Bigram Counts (V=1446)
i want to eat chinese food lunch spend i 6 828 1 10 1 1 1 3 want 3 1 609 2 7 7 6 2 to 3 1 5 687 3 1 7 212 eat 1 1 3 1 17 3 43 1 chineze 2 1 1 1 1 83 2 1 food 16 1 16 1 2 5 1 1 lunch 3 1 1 1 1 2 1 1 spend 2 1 2 1 1 1 1 1 February 16, 2012 Veton Këpuska 83 Smoothed Bigram Probabilities Recall normal bigram probabilites are computed by normalizing each raw of counts by the unigram count: C ( wn −1wn )
P( wn  wn −1 ) =
C ( wn −1 ) For addone smoothed bigram counts we need to augment the unigram count by the number of total types in the vocabulary V:
*
Laplace P C ( wn −1wn ) + 1
( wn  wn−1 ) =
C ( wn −1 ) + V The result is the smoothed bigram probabilities presented in the table in the next slide.
February 16, 2012 Veton Këpuska 84 Bigram Smoothed Probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences
i want to eat chinese food lunch spend i 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075 want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.0084 to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055 eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046 chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062 food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039 lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056 spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058 February 16, 2012 Veton Këpuska 85 Adjusted Counts Table It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. These adjusted counts can be computed by Equation presented below and the table in the next slide shows the reconstructed counts. [ C ( wn−1wn ) + 1] × C ( wn−1 )
c ( wn −1wn ) =
C ( wn −1 ) + V
* February 16, 2012 Veton Këpuska 86 Adjusted Counts Table
i want to eat chinese food lunch spend i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9 want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78 to 1.9 0.63 3.1 430 1.9 0.63 4.4 133 eat 0.34 0.34 1 0.34 5.8 1 15 0.34 chineze 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098 food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43 lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19 spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16 February 16, 2012 Veton Këpuska 87 Observation Note that addone smoothing has made a very big change to the counts. C(want to) changed from 608 to 238! We can see this in probability space as well: P(towant) decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix
word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10! February 16, 2012 Veton Këpuska 88 Problems with AddOne (Laplace) Smoothing The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros. We could move a bit less mass by adding a fractional count rather than 1 (addδ smoothing; (Lidstone, 1920; Jeffreys, 1948)), but this method requires a method for choosing δ dynamically, results in an inappropriate discount for many counts, and turns out to give counts with poor variances. For these and other reasons (Gale and Church, 1994), we’ll need use better smoothing methods for Ngrams like the ones we will present in the next section. February 16, 2012 Veton Këpuska 89 GoodTuring Discounting A number of much better algorithms have been developed that are only slightly more complex than addone smoothing: GoodTuring The idea behind a number of those algorithms is to use the count of things you’ve seen once to help estimate the count of things you have never seen. Good described the algorithm in 1953 in which he credits Turing for the original idea. Basic idea in this algorithm is to reestimate the amount of probability mass t assign to Ngrams with zero counts by looking at the number of Ngrams that occurred only one time. A word or Ngram that occurs once is called a singleton. GoodTuring algorithm uses the frequency of singletons as a re
estimate of the frequency of zerocount bigrams. February 16, 2012 Veton Këpuska 90 GoodTuring Discounting Algorithm Definition: Nc – the number of Ngrams that occur c times: frequency of frequency c. N0 – the number of bigrams b with count 0. N1 – the number of bigram with count 1 (singletons), etc. Nc = ∑1)
( x:count x = c The MLE count for Nc is c. The GoodTuring estimate replaces this with a smoothed count c*, as a function of Nc+1: c ∗ = ( c + 1)
February 16, 2012 N c +1
Nc Veton Këpuska 91 GoodTuring Discounting The previous equation presented in previous slide can be used to replace the MLE counts for all the bins N1, N2, and so on. Instead of using this equation directly to reestimate the smoothed count c* for N0, the following equation is used that defines probability of the missing mass:
∗ PGT N1
P ( things with frequency zero in training ) =
N
∗
GT N1 – is the count of items in bin 1 (that were seen once in training), and N is total number of items we have seen in training.
February 16, 2012 Veton Këpuska 92 GoodTuring Discounting Example: A lake with 8 species of fish (bass, carp, catfish, eel, perch, salmon, trout, whitefish)
When fishing we have caught 6 species with the following count: 10 carp 3 perch, 2 whitefish, 1 trout, 1 salmon, and 1 eel (no catfish and no bass). What is the probability that the next fish we catch will be a new species, i.e., one that had a zero frequency in our training set (catfish or bass)? The MLE count c of unseen species (bass or catfish) is 0. From the equation in the previous slide the probability of a new fish being one of these unseen species is 3/18, since N1 is 3 and N is 18: N1 3
P ( things with frequency zero in training ) =
=
N 18
∗
GT February 16, 2012 Veton Këpuska 93 GoodTuring Discounting: Example Lets now estimate the probability that the next fish will be another trout? MLE count for trout is 1, so the MLE estimated probability is 1/18. However, the GoodTuring estimate must be lower, since we took 3/18 of our probability mass to use on unseen events! Must discount MLE probabilities for observed counts (perch, whitefish, trout, salmon, and eel) The revised count c* and GoodTuring smoothed probabilities for species with counts 0 (like bass or catfish in previous ∗
PGT
example) or counts 1 (like trout, salmon, or eel) are as follows: February 16, 2012 Veton Këpuska 94 GoodTuring Discounting: Example
Unseen (bass or catfish)
0 c
MLE p trout
1 p= 1
18 0
=0
18 c * ( trout ) = 2 × c*
GT *
pGT *
pGT ( unseen) ) = N1
3
=
= 0.17
N
18 N2
1
= 2 × = 0.67
N1
3 c*
0.67
*
pGT (trout ) =
=
=0.037
N
18 Note that the revised count c* for eel as well is discounted from c=1.0 to c*=0.67 in order to account for some probability mass for *
pGT
unseen species (unseen) = 3/18=0.17 for catfish and bass. Since we know that there are 2 unknown species, the probability of the next fish being specifically a catfish is (catfish) = (1/2)x(3/18) = 0.085
*
pGT
February 16, 2012 Veton Këpuska 95 Bigram Examples Berkeley Restaurant Corpus of 9332 sentences. Associated Press (AP) newswire Corpus c(MLE) AP Newswire
Nc
c*(GT) Berkeley Restaurant
c(MLE)
Nc
c*(GT) 0 74,671,100,000 0.0000270 0 2,081,496 0.002553 1 2,018,046 0.446 1 5315 0.533960 2 449,721 1.26 2 1419 1.357294 3 188,933 2.24 3 642 2.373832 4 105,668 3.24 4 381 4.081365 5 68,379 4.22 5 311 3.781350 6 48,190 5.19 6 196 4.500000 February 16, 2012 Veton Këpuska 96 Advanced Issues in GoodTuring Estimation Assumptions of GoodTuring Estimation: Distribution of each bigram is binomial The number N0 of bigrams that have not been seen is known. This number is known if size V of vocabulary is known ⇒ total number of bigrams is thus V2. The raw number Nc can not be used since reestimate c* for Nc dependents on Nc+1 and the reestimation expression is undefined when Nc+1=0. In the example of the fish species N4 = 0, thus how can one compute N3? ⇒ Simple GoodTuring Algorithm February 16, 2012 Veton Këpuska 97 Simple GoodTuring
1. Compute Nc for all c’s 2.
3. Smooth counts to replace any zeros in the sequence.
Compute adjusted counts c*. Smoothing Approaches: Linear regression: fitting a map from Nc to c in log space: log( N c ) = a + b log( c ) Inaddition, the discounted c* is not used for all counts c. Large counts where c > k for some threshold k (e.g., k=5 in Katz 1987) are assumed to be reliable. c = c for c > k
* February 16, 2012 Veton Këpuska 98 Simple GoodTuring Correct equation of c* when some k is introduced is: ( k + 1) N k +1
N c +1
( c + 1)
−c
Nc
N1
c* =
, for 1 ≤ c ≤ k
( k + 1) N k +1
1−
N1 With GoodTuring discounting (as well as other algorithms), it is usual to treat Ngrams with low count (especially counts of 1) as if the count were 0. Finally GoodTuring discounting (or any other algorithm) is not used directly on Ngrams; it is used in combination with the backoff and interpolation algorithms that are described next. February 16, 2012 Veton Këpuska 99 Interpolation Interpolation Discounting algorithms can help solve the problem of zero frequency Ngrams. Additional knowledge that is not used: If trying to compute P(wnwn1wn2) but we have no examples of a particular trigram wn2wn1wn Estimate trigram probability based on the bigram probability P(wnwn1). If there are no counts for computation of bigram probability P(wnwn1), use unigram probability P(wn). There are two ways to rely on this Ngram “hiearchy”: Backoff, and Interpolation
February 16, 2012 Veton Këpuska 101 Backoff vs. Interpolation Backoff: Relies solely on the trigram counts. When there is a zero count evidence of a trigram then the backoff to lower Ngram. Interpolation: Probability estimates are always mixed from all Ngram estimators: Weighted interpolation of trigram, bigram and unigram counts. Simple Interpolation – Linear Interpolation February 16, 2012 Veton Këpuska 102 Linear Interpolation
ˆ
P ( wn  wn −1wn − 2 ) = λ1 P ( wn  wn −1wn − 2 )
+ λ2 P( wn  wn −1 )
+ λ3 P( wn ) ∑λ i =1 i Slightly more sophisticate version of linear interpolation with context dependent weights. ( ) 3 n −1
n−2 n−
ˆ
P( wn  wn −1wn − 2 ) = λ1 wn −1 P( wn  wn −1wn − 2 )
2 ()
+ λ ( w ) P( w )
∑ λ (w ) = 1 n−
+ λ2 wn −1 P ( wn  wn −1 )
2 i n n −1
n−2 i February 16, 2012 Veton Këpuska 103 Computing Interpolation Weights λ Weights are set from heldout corpus. Heldout corpus is additional training corpus that is NOT used to set the Ngram counts but to set other parameters like in this case. Choosing λ values that maximize the estimated interpolated probability for example with EM algorithm (iterative algorithm discussed in latter chapters). February 16, 2012 Veton Këpuska 104 Backoff Interpolation is simple to understand and implement There are better algorithms like backoff Ngram modeling. Uses GoodTuring discounting based on Katz and also known as Katz backoff. ( n −1
n − N +1 Pkatz wn  w ) ( )
( n− P * wn  wn −1 +1
N
=
n−
n−
α wn −1 +1 Pkatz wn  wn −1 + 2
N
N ( ) ) ( ) n
if C wn − N +1 > 0 otherwise Equation above describes a recursive procedure. Computation of P*, the normalizing factor a, and other details are discussed in next section.
February 16, 2012 Veton Këpuska 105 Trigram Discounting with Interpolation The wi , wi1, wi2 for clarity are referred as a sequence x, y, z. Katz method incorporates discounting as integral part of the algorithm. P * ( z  x, y )
if C ( x, y, z ) > 0 Pkatz ( z  x, y ) = α ( x, y ) Pkatz ( z  y ) else if C ( x, y ) > 0
P * ( z )
otherwise P * ( z  y )
if C ( y, z ) > 0
Pkatz ( z  y ) = α ( y ) Pkatz ( z ) otherwise February 16, 2012 Veton Këpuska 106 Katz Backoff GoodTuring method assigned probability of unseen events based on the assumption that they are all equally probable. Katz backoff gives us a better way to distribute the probability mass among unseen trigram events, by relying on information from unigrams and bigrams. We use discounting to tell us how much total probability mass to set aside for all the events we haven’t seen, and backoff to tell us how to distribute this probability. Discount probability P*(.) is needed rather than MLE P(.) in order to account for the missing probability mass. α weights are necessary to ensure that when backoff occurs the resulting probabilities are true probabilities that sum to 1. February 16, 2012 Veton Këpuska 107 Discounted Probability Computation P* is defined as discounted (c*) estimate of the conditional probability of an Ngram. ( n −1 P wn  w n− N +1
* ()
) = c( w )
c* wn−N +1
n
n −1 n − N +1 Because on average the discounted c8 will be less than c, this probability P* will be slightly less than the MLE estimate: ()
c( w ) c* wn− N +1
n
n −1 n − N +1 February 16, 2012 Veton Këpuska 108 Discounted Probability Computation The previous slide computation will leave some probability mass for the lower order Ngrams, which is then distributed by the α weights (descirbed in next section). The table in the next slide shows the result of Katz backoff bigram probabilities for previous 8 sample words computed from BeRP corpus using the SRILM toolkit. February 16, 2012 Veton Këpuska 109 Smoothed Bigram Probabilities computed with SRLIM toolkit.
i want to eat chinese food lunch spend i 0.0014 0.326 0.00248 0.00355 0.000205 0.0017 0.00073 0.000489 want 0.00134 0.00152 0.656 0.000483 0.00455 0.00455 0.00073 0.000483 to 0.000512 0.00152 0.00165 0.284 0.000512 0.0017 0.00175 0.00873 eat 0.00101 0.00152 0.00166 0.00189 0.0214 0.00166 0.0563 0.000585 chineze 0.0283 0.00152 0.00248 0.00189 0.000205 0.519 0.00283 0.000585 food 0.0137 0.00152 0.0137 0.00189 0.000409 0.00366 0.00073 0.000585 lunch 0.00363 0.00152 0.00248 0.00189 0.000205 0.00131 0.00073 0.000585 spend 0.00161 0.00152 0.00161 0.00189 0.000205 0.0017 0.00073 0.000585 February 16, 2012 Veton Këpuska 110 Advanced Details of Computing Katz backoff α and P* Remaining details of computation of α and P* are presented in this section. β – total amount of leftover probability mass function of the (N
1)gram context. For a given (N1) gram context, the total leftover probability mass
can be computed by subtracting from 1 the total discounted
probability mass for all Ngrams starting with that context. β (w n −1
n − N +1 ) = 1− ∑ P *(w
(
)
n−
wn :c wn−1 +1 > 0
N n n −1
n − N +1 w ) This gives us the total probability mass that we are ready to
distribute to all (N1)gram (e.g., bigrams if our original model
was trigram)
February 16, 2012 Veton Këpuska 111 Advanced Details of Computing Katz backoff α and P* (cont.) Each individual (N1)gram (bigram) will only get a fraction of this mass, so we need to normalize β by the total probability of all the (N1)grams (bigrams) that begin some Ngram (trigram) that has zero count. The final equation for computing how much probability mass to distribute from an Ngram to an (N1)gram is represented by the function α:
n−
β ( wn −1 +1 )
n−
N
α ( wn −1 +1 ) =
N
n−
Pkatz ( wn  wn −1 + 2 )
∑1
N
n−
wn :c ( wn− N +1 ) > 0
n−
1−
P * ( wn  wn −1 +1 )
∑1
N
n−
wn :c ( wn− N +1 ) > 0
=
n−
1−
P * ( wn  wn −1 + 2 )
∑1
N
n−
wn :c ( wn− N +1 ) > 0
February 16, 2012 Veton Këpuska 112 Advanced Details of Computing Katz backoff α and P* (cont.) Note that a is a function of the preceding word string, that is, of n−
wn −1 +1 ; thus the amount by which we discount each trigram ( d), and N
the mass that gets reassigned to lower order Ngrams (α) are recomputed for every (N1)gram that occurs in any Ngram. We only need to specify what to do when the counts of an ( N1)
gram context are 0, (i.e., when ) and our definition is n−
c wn −1 +1 = 0
complete:
N ( (w n −1
katz
n
n − N +1
*
n −1
n
n − N +1
n −1
n
n − N +1 P ( w P w w
β w w ( February 16, 2012 ) = P (w )=0 katz n n −1
n− N + 2 w ) =1 Veton Këpuska ) ) (
if c( w
if c( w n −1
n − N +1
n −1
n − N +1
n −1
n − N +1 if c w )=0
)=0
)=0
113 Practical Issues
Toolkits and Data Formats Practical Issues: Toolkits and Data Formats How Ngram language models are represented? Language model probabilities are represented and computed in log format to avoid underflow and speed up computation. Probabilities by definition are less than 1. Multiplying enough Ngrams together would result in numerical underflow. Using log probabilities instead of raw probabilities the numbers do not get as small. Adding in log space is equivalent to multiplying in linear space; log probabilities are combined by addition. In general addition is faster than multiplication in most general purpose computers. Reporting true probabilities if necessary requires exponentiation operation:
p1xp2xp3xp4 = exp[log(p1)+log(p2)+log(p3)+log(p4)]
February 16, 2012 Veton Këpuska 115 Practical Issues: Toolkits and Data Formats Backoff Ngram language models are generally stored in ARPA format. An Ngram in ARPA format is an SCII file with a small header followed by a list of all the nonzero Ngram probabilities of all: unigrams, followed by bigrams, followed by trigrams, and so on Each Ngram entry is stored with its discounted log probability (in log10 format) and its backoff weight α . Backoff weights are only necessary for Ngrams which form a prefix of a longer Ngram, thus no α ’s are computed for the highest order Ngram (e.g., tigram) or Ngrams ending in the end of sequence token <s>.
February 16, 2012 Veton Këpuska 116 Practical Issues: Toolkits and Data Formats Format of each Ngram is for a trigram grammar is: unigram : log p * ( wi )
bigram : log p * ( wi  wi −1 ) wi
wi −1wi trigram : wi − 2 wi −1wi February 16, 2012 log p * ( wi  wi − 2 , wi −1 ) Veton Këpuska log α ( wi )
log α ( wi −1wi ) 117 Example of ARPA formatted LM file from BeRP corpus (Juras February 16, 2012 Veton Këpuska 118 Probability Computation Given a sequence x,y,z the trigram probability
P(zx,y) is computed form the model as follows: P * ( z  x, y ) ,
if C ( x, y, z ) > 0 Pkatz ( z  x, y ) = α ( x,y ) Pkatz ( z  y ) , else if C ( x, y ) > 0
P* ( z ) ,
otherwise. P* ( z  y ) ,
if C ( y, z ) > 0
Pkatz ( z  y ) = α ( y ) P * ( z ) , otherwise. February 16, 2012 Veton Këpuska 119 Toolkits (Publicly Available) SRILM (Stolcke, 2002) http://citeseer.ist.psu.edu/621361.html http://www.speech.sri.com/projects/srilm/ CambridgeCMU toolkit (Clarkson & Rosenfeld, 1997). http://www.cs.cmu.edu/~archan/sphinxInfo.html#cmulmtk http://www.cs.cmu.edu/~archan/s_info/CMULMTK/toolkit_doc http://www.speech.cs.cmu.edu/SLM_info.html February 16, 2012 Veton Këpuska 120 ADVANCED ISSUES IN LANGUAGE MODELING
Advanced Smoothing Methods: Kneser
Ney Smoothing Advanced Smoothing Methods: KneserNey Smoothing Brief introduction to the most commonly used modern Ngram smoothing method, the Interpolated KneserNey algorithm: Algorithm is based on absolute discounting method. It is a more elaborate method of computing revised count c* than the GoodTuring discount formula. Revisiting GoodTuring estimates of the bigram extended from slide Bigram Examples. c(MLE) 0 1 2 3 4 5 6 7 8 9 c*(GT) 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25 0.554 0.74 0.76 0.76 0.78 0.81 0.79 0.76 0.75 0.0000270 ∆= cc* 0.0000270 February 16, 2012 Veton Këpuska 122 Advanced Smoothing Methods: KneserNey Smoothing Reestimated counts c* for greater than 1 counts could be estimated pretty well by just subtracting 0.75 from the MLE count c. Absolute discounting method formalizes this intuition by subtracting a fixed (absolute) discount d from each count. The rational is that we have good estimates already for the high counts, and a small discount d won’t affect them much. The affected are only the smaller counts for which we do not necessarily trust the estimate anyhow.
The equation for absolute discounting applied to bigrams (assuming a proper coefficient α on the backoff to make everything sum to one) is: c( wi −1wi ) − D
,
if c( wi −1wi ) > 0 Pabsolute ( wi  wi −1 ) = c( wi −1 )
α ( wi ) Pabsolute ( wi ) otherwize February 16, 2012 Veton Këpuska 123 Advanced Smoothing Methods: KneserNey Smoothing
In practice distinct discount values d for the 0 and 1 counts are computed. KneserNeay discounting augments absolute discounting with a more sophisticated way to handle the backoff distribution. Consider the job of predicting the next word in the sentence, assuming we are backing off to a unigram model: I can’t see without my reading XXXXXX. The word “glasses” seem much more likely to follow than the word “Francisco”. But “Francisco” is in fact more common, and thus a unigram model will prefer it to “glasses”.
1. Thus we would like to capture that although “Francisco” is frequent, it is only frequent after the word “San”.
2. The word “glasses” has a much wider distribution. February 16, 2012 Veton Këpuska 124 Advanced Smoothing Methods: KneserNey Smoothing Thus the idea is instead of backing off to the unigram MLE count (the number of times the word w has been seen), we want to use a completely different backoff distribution! We want a heuristic that more accurately estimates the number of times we might expect to see word w in a new unseen context. The KneserNey intuition is to base our estimate on the number of different contexts word w has appeared in. Words that have appeared in more contexts are more likely to appear in some new context as well. New backoff probability can be expressed as the “continuation probability” presented in following expression: February 16, 2012 Veton Këpuska 125 Advanced Smoothing Methods: KneserNey Smoothing Continuation Probability: { wi −1 : c( wi −1wi ) > 0}
Pcontinuation ( wi ) =
∑ { wi −1 : c( wi −1wi ) > 0}
wi KneserNey backoff is formalized as follows assuming proper coefficient α on the backoff to make everything sum to one: c( wi −1wi ) − D
, c( w )
i −1 PKN ( wi  wi −1 ) = { wi −1 : c( wi −1wi ) > 0}
α ( wi )
∑ { wi −1 : c( wi −1wi ) > 0} wi February 16, 2012 Veton Këpuska if c( wi −1wi ) > 0
otherwize
126 Interpolated vs Backoff form of KneserNey KneserNey backoff algorithm was shown to be less superior to its interpolated version. Interpolated KneserNey discounting can be computed with an equation like the following (omitting the computation of β): { wi −1 : c( wi −1wi ) > 0}
c( wi −1wi ) − D
PKN ( wi  wi −1 ) =
+ β ( wi )
c( wi −1 )
∑ { wi −1 : c( wi −1wi ) > 0}
w i Practical note – it turns out that any interpolation model cab be represented as a backoff model, and hence stored in ARPA backoff format. The interpolation is done when the model is built, thus the ‘bigram’ probability stored in the backoff format is really ‘bigram already interpolated with unigram’. February 16, 2012 Veton Këpuska 127 Classbased Ngrams
Classbased Ngrams or Cluster N
grams Classbased Ngrams Classbased Ngrams or Cluster Ngrams is a variant of the N
gram that uses information about word classes or clusters. It is useful for dealing with scarcity in the training data. Example: Suppose for a flight reservation system we want to compute the probability of the bigram to Shanghai, but this bigram never occurs in the training set. Assume that our training data has to London, to Beijing, and to Denver.
If we new that these were all cities, and assuming Shanghai does appear in the training set in other contexts, we could predict the likelihood of a city following from. February 16, 2012 Veton Këpuska 129 Classbased Ngrams Many variants of cluster Ngrams: IBM clustering – hard clustering: each word can belong to only one class. The model estimates the conditional probability of a word wi by multiplying two factors: the probability of the word’s class ci given the preceding classes (based on Ngramofclasses), and the probability of wi given ci. P( wi  wi −1 ) ≈ P( ci  ci −1 ) × P( wi  ci )
February 16, 2012 Veton Këpuska 130 Classbased Ngrams Assuming that there is a training corpus in which we have a class label for each word, the MLE of the probability of the word given the class and the probability of the class given the previous class could be computed as follows: C ( w)
P( w  c ) =
C( c)
C ( ci −1ci )
P( ci  ci −1 ) =
∑ C ( ci−1c )
c February 16, 2012 Veton Këpuska 131 Classbased Ngrams
Cluster Ngrams are generally used in two ways:
1. Handdesigned domainspecific word classes. In airline information system we might use classes like: CITYNAME AIRLINE DAYOFWEEK MONTH, etc.
1. Automatically induce the classes by clustering words in a corpus. Syntactic categories like partofspeech tags don’t seem to work well as classes. Whether automatically induced or handdesigned, cluster N
grams are generally mixed with regular wordbased Ngrams. February 16, 2012 Veton Këpuska 132 Language Model Adaptation and Using the WWW One of the most recent developments in language modeling is language model adaptation. Relevant when one has only a small amount of indomain training data, but a large amount of data from some other domain. Train on the larger outofdomain dataset, and Adopt the models to the small indomain set. An obvious large data source for this type of adaptation is WWW. The simplest way to apply the web is to improve, say, trigram language models is to use search engines to get counts for w1w2
and w1w2w3, and then compute ˆ
pweb
February 16, 2012 cweb ( w1w2 w3 )
=
cweb ( w1w2 )
Veton Këpuska 133 Language Model Adaptation and Using the WWW One can mix with a conventional Ngram. Also, more ˆ
pweb
sophisticated methods can be used by combining methods that make use of topic or class dependence to find domainrelevant data on the web. Problems: In practice it is impossible to download every page from the web in order to compute Ngrams. Only page counts are used from the data returned by search engines. Page counts are only approximations to actual counts for many reasons: February 16, 2012 A page may contain an Ngram multiple times. Most search engines round off their counts, Punctuation is deleted, and
Counts may be adjusted due to link and other information. Veton Këpuska 134 Language Model Adaptation and Using the WWW The result is not hugely affected in spite of the “noise” due to inaccuracies of the information collected. It is possible to perform specific adjustments, such as fitting a regression to predict actual word counts from page counts. February 16, 2012 Veton Këpuska 135 Using Longer Distance Information: A Brief Summary There are methods for incorporating longerdistance context into Ngram modeling. While we have limited our discussion mainly to bigram and trigrams, stateoftheart speech recognition systems, for example, are based on longer distance Ngrams, especially 4
grams, but also 5grams. Goodman (2006) showed that with 284 million words of training data, 5grams do improve perplexity scores over 4grams, but not by much. Goodman checked contexts up to 20grams, and found that after 6grams, longer contexts weren’t useful, at least not with 284 million words of training data. February 16, 2012 Veton Këpuska 136 More Sophisticated Models People tend to repeat words they have used before. Thus if a word is used once in a text, it will probably be used again. We can capture this fact by a cache language model (Kuhn and DeMori, 1990). To use a unigram cache model to predict word i of a test corpus, we create a unigram grammar from the preceding part of the test corpus (words 1 to i−1), and mix this with our conventional N
gram. We might use only a shorter window from the previous words, rather than the entire set. Cache language models are very powerful in any applications where we have perfect knowledge of the words. Cache models work less well in domains where the previous words are not known exactly. In speech applications, for example, unless there is some way for users to correct errors, cache models tend to ‘lockin’ to errors they made on earlier words. February 16, 2012 Veton Këpuska 137 More Sophisticated Models Repetition of words in a text is a symptom of a more general fact about words; texts tend to be about things. Documents which are about particular topics tend to use similar words. Suggests that we could train separate language model for different topics. Topicbased language models take advantage of the fact that different topics will have different kinds of words. Train different language models for each topic t, and then mix them, weighted by how likely each topic is given the history h: p( w  h ) = ∑ P( w  t ) P( t  h )
t February 16, 2012 Veton Këpuska 138 More Sophisticated Models Latent Semantic Indexing: Based on the intuition that upcoming words are semantically similar to preceding words in the text. Word similarity is computed from measure of semantic word association such as the latent semantic indexing Computed from dictionaries or thesauri, then Mixed with a conventional Ngram. February 16, 2012 Veton Këpuska 139 More Sophisticated Models Trigger Word based Ngrams Predictor word – called trigger which is not adjacent but is very related: has high mutual information with the word we are trying to predict. Skip Ngrams Preceding context ‘skips over’ some intermediate words such as P(wiwi1wi2).
Variablelength Ngrams Preceding context is extended where a longer phrase is particularly frequent.
Using very large and rich contexts can result in very large language models. These models are often pruned, removing lowprobability events.
There is a large body of research on integrating sophisticated linguistic structures into language modeling as described in following chapters of the text book. February 16, 2012 Veton Këpuska 140 ADVACED TOPIC
INFORMAITON THEORY BACKGROUND Information Theory Background In previous section, perplexity was introduced as a way to evaluate Ngram models on a test set. A better Ngram model is one which assigns a higher probability to the test data Perplexity is a normalized version of the probability of the test set. Another way to think about perplexity is based on the informationtheoretic concept of crossentropy. This section introduces fundamentals of information theory including the concept of crossentropy. Reference: “Elements of Information Theory”, Cover and Thomas. WileyInterscience, 1991 February 16, 2012 Veton Këpuska 142 Entropy Entropy is a measure of information content. Computing entropy requires establishing of a Random variable X that takes values from whatever it is being predicted: Words, letters, parts of speech, … from a set χ. Probability function p(x). The entropy of random variable X is then defined as: H ( X ) = −∑ p( x ) log 2 [ p( x ) ]
x∈χ The log can in principle be computed in any base. However, if base 2 is used the resulting value is measured in bits. February 16, 2012 Veton Këpuska 143 Entropy The most intuitive way to define entropy for computer scientist is to think of the entropy as a lower bound on the number of bits it would take to encode a certain decision or piece of information in the optimal coding scheme. In Cover and Thomas the following example is provided: Imagine that we want to place a bet on a horse race but it is too far to go all the way to Yonkers Racetrack, and we’d like to send a short message to the bookie to tell him which horse to bet on. Suppose there are eight horses in this particular race. One way to encode this message is just to use the binary representation of the horse’s number as the code; thus horse 1 would be 001, horse 2 010, horse 3 011, and so on, with horse 8 coded as 000. If we spend the whole day betting, and each horse is coded with 3 bits, on the average we would be sending 3 bits per race. Can we do better? February 16, 2012 Veton Këpuska 144 Entropy Suppose that the spread is the actual distribution of the bets placed, and that we represent it as the prior probability of each horse as follows:
Horse 1 2 3 4 5 6 7 8 Prior 1/2 1/4 1/8 1/16 1/64 1/64 1/64 1/64 32/64 16/64 8/64 4/64 1/64 1/64 1/64 1/64 The entropy of the random variable X that ranges over horses gives us a lower bound on the number of bits, and it is:
11
11
11
1
1
1
1
H ( X ) = −∑ p( i ) log 2 [ p( i ) ] = − log 2 + log 2 + log 2 + log 2 + 4 log 2 24
48
8 16
16
64
64 2
i =1
= 2 bits
8 February 16, 2012 Veton Këpuska 145 Entropy Variable length encoding: 0 – for most likely horse 10 for the next most likely horse, and 110, 1110, and for the last equally likely four 111100, 111101, 111110, and 111111 The entropy for equallength binary code applied when the horses are equally likely is: 1
1 1 H ( X ) = −∑ p( i ) log 2 [ p( i ) ] = −∑ log 2 = − log 2 = 3 bits
8 8 i =1
i =1 8
8 February 16, 2012 8 Veton Këpuska 146 Entropy Practical application in Language Processing involves sequences; for a grammar, one will be computing entropy of some sequence of words W={w0, w1, w2,…, wn}. One way to compute entropy for a sequence is to assign a random variable that ranges over all finite sequences of words of length n in some language L as follows: () [ ( )] H ( w1 , w2 , w3 , , wn ) = − ∑ p Wi n log 2 p Wi n
W1n ∈L Entropy rate can be defined as perword entropy of a sequence divided by the number of words: () () [ ( )] 1
1
n
H Wi = − ∑ p Wi n log 2 p Wi n
n
n W1n ∈L February 16, 2012 Veton Këpuska 147 Entropy To compute the true entropy of a language, one needs to consider sequences of infinite length. Assuming a language L as a stochastic process that produces a sequence of words, its entropy rate H(L) is defined as: 1
H ( L ) = − lim H ( w1 , w2 ,..., wn )
n →∞ n
1
= − lim ∑ p( w1 , w2 ,..., wn ) log 2 [ p( w1 , w2 ,..., wn ) ]
n →∞ n
W ∈L February 16, 2012 Veton Këpuska 148 Entropy Based on ShannonMcMillanBreiman theorem for a language that is regular in certain ways (specifically, if its both stationary and ergodic) then the following expression can be used 1
H ( L ) = − lim log 2 [ p( w1 , w2 ,..., wn ) ]
n →∞ n That is, we can take a single sequence that is long enough instead of summing over all possible sequences. The rationale of the ShannonMcMillanBreiman theorem is that a long enough sequence of words will contain in it many other shorter sequences, and that each of these shorter sequences will reoccur in the longer sequence according to their probabilities. February 16, 2012 Veton Këpuska 149 Entropy A stochastic process is said to be stationary if the probabilities it assigns to a sequence are invariant with respect to shifts in the time index. The probability distribution for words at time t, is the same as the probability distribution at time t+1. Markov models and hence Ngrams are stationary. In bigram Pi is dependent only on Pi1. So if we shift our time index by x, Pi+x is still dependent on Pi+x1. However, natural language is not stationary, since as we will see in latter (Ch12 of the book) the probability of upcoming words can be dependent on events that were arbitrarily distant in time and thus time dependent. Consequently, statistical models only give an approximation to the correct distributions and entropies of natural language. February 16, 2012 Veton Këpuska 150 Entropy To summarize: By making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output, and computing its average log probability. In the next section we talk about the why and how; why we would want to do this (i.e., for what kinds of problems would the entropy tell us something useful), and how to compute the probability of a very long sequence. February 16, 2012 Veton Këpuska 151 Cross Entropy & Comparing Models Cross entropy is useful when we do not know the actual probability distribution p that generated a data. It uses a model m of distribution p as follows: 1
H ( p, m ) = − lim ∑ p( w1 , w2 ,..., wn ) log 2 [ m( w1 , w2 ,..., wn ) ]
n →∞ n
W ∈L For stationary ergodic process this expression becomes: 1
H ( p, m ) = − lim log 2 [ m( w1 , w2 ,..., wn ) ]
n →∞ n The cross entropy H(p,m) is useful because it gives us an upper bound on the entropy H(p). For any model m:
H(p) ≤ H(p,m) February 16, 2012 Veton Këpuska 152 Cross Entropy & Comparing Models This means that we can use some simplified model m to help estimate the true entropy of a sequence of symbols drawn according to probability p. The more accurate m is, the closer the cross entropy H(p,m) will be to the true entropy H(p). Thus the difference between H(p,m) and H(p) is a measure of how accurate a model is. Between two models m1 and m2, the more accurate model will be the one with the lower crossentropy. (The cross
entropy can never be lower than the true entropy, so a model cannot err by underestimating the true entropy). February 16, 2012 Veton Këpuska 153 Perplexity and CrossEntropy Crossentropy is defined in the limit, as the length of the observed word sequence goes to infinity. We will need an approximation to crossentropy, relying on a (sufficiently long) sequence of fixed length. This approximation to the crossentropy of a model M = P(wiwiN+1…wi1) on a sequence of words W is: 1
H (W ) = − log 2 [ P( w1 , w2 ,..., wN ) ]
N The perplexity of a model P on a sequence of words W defined as exponent of the crossentropy presented next:
February 16, 2012 Veton Këpuska 154 Perplexity and CrossEntropy
PP(W ) = 2 H ( w) =2 − 1
log 2 [ P ( w1w2 wN ) ]
N = P( w1w2 wN )
=N February 16, 2012 − =2 log 2 [ P ( w1w2 wN ) ] −1 N 1
N 1
P( w1w2 wN ) Veton Këpuska 155 Advanced Topic
The Entropy of English and Entropy Rate Constancy The Entropy of English and Entropy Rate Constancy As we suggested in the previous section, the crossentropy of some model m can be used as an upper bound on the true entropy of some process. We can use this method to get an estimate of the true entropy of English. Why should we care about the entropy of English? 1. Knowing the entropy of English would give us a solid lower bound for all of our future experiments on probabilistic grammars.
We can use the entropy of English to help understand what parts of a language provide the most information: Is the predictability of English mainly based on Word order Semantics, Morphology, Consistency, or on Pragmatic cues? Answering this question can help us immensely in knowing where to focus our languagemodeling efforts. 2. February 16, 2012 Veton Këpuska 157 The Entropy of English and Entropy Rate Constancy There are two common methods for computing the entropy of English:
1. Using human subjects to construct a psychological experiments that requires them to guess strings of letters;
By looking at how many guesses it takes them to guess letters correctly we can estimate the probability of the letters and hence the entropy of the sequence. This method was used by Shannon in 1951 in his groundbreaking work in defining the field of information theory.
2. Take a very good stochastic model, trained on a very large corpus, and use it to assign a logprobability to a very long sequence of English applying ShannonMcMillanBreiman theorem introduced earlier and repeated here for clarity: 1
H ( English ) = − lim log 2 [ m( w1 , w2 ,..., wn ) ]
n →∞ n
February 16, 2012 Veton Këpuska 158 Shannon Experiment
The actual experiment is designed as follows: We present a subject with some English text and ask the subject to guess the next letter. The subjects will use their knowledge of the language to guess the most probable letter first, the next most probable next, and so on. We record the number of guesses it takes for the subject to guess correctly. Shannon’s insight was that the entropy of the numberofguesses sequence is the same as the entropy of English. (The intuition is that given the numberofguesses sequence, we could reconstruct the original text by choosing the “ nth most probable” letter whenever the subject took n guesses). This methodology requires the use of letter guesses rather than word guesses (since the subject sometimes has to do an exhaustive search of all the possible letters!), and so Shannon computed the perletter entropy of English rather than the perword entropy. He reported an entropy of 1.3 bits (for 27 characters (26 letters plus space)). Shannon’s estimate is likely to be too low, since it is based on a single text ( Jefferson the Virginian by Dumas Malone). Shannon notes that his subjects had worse guesses (hence higher entropies) on other texts (newspaper writing, scientific work, and poetry). More recently variations on the Shannon experiments include the use of a gambling paradigm where the subjects get to bet on the next letter (Cover and King, 1978; Cover and Thomas, 1991). February 16, 2012 Veton Këpuska 159 Computer Based Computation of Entropy of English The second method for computing the entropy of English helps avoid the single text problem that confounds Shannon’s results.
For example, Brown et al. (1992a) trained a trigram language model on 583 million words of English, (293,181 different types) and used it to compute the probability of the entire Brown corpus (1,014,312 tokens). The training data include newspapers, encyclopedias, novels, office correspondence, proceedings of the Canadian parliament, and other miscellaneous sources.
They then computed the characterentropy of the Brown corpus, by using their wordtrigram grammar to assign probabilities to the Brown corpus, considered as a sequence of individual letters. They obtained an entropy of 1.75 bits per character (where the set of characters included all the 95 printable ASCII characters). February 16, 2012 Veton Këpuska 160 Computer Based Computation of Entropy of English The average length of English written words (including space) has been reported at 5.5 letters (N´adas, 1984). If this is correct, it means that the Shannon estimate of 1.3 bits per letter corresponds to a perword perplexity of 142 for general English. The numbers we report earlier for the WSJ experiments are significantly lower than this, since the training and test set came from the same subsample of English. That is, those experiments underestimate the complexity of English (since the Wall Street Journal looks very little like Shakespeare, for example). February 16, 2012 Veton Këpuska 161 Constant Information/Entropy Rate of Speech A number of scholars have independently made the intriguing suggestion that entropy rate plays a role in human communication in general (Lindblom, 1990; Van Son et al., 1998; Aylett, 1999; Genzel and Charniak, 2002; Van Son and Pols, 2003). Constant information/entropy rate of speech: The idea is that people speak so as to keep the rate of information being transmitted per second roughly constant, i.e. transmitting a constant number of bits per second, or maintaining a constant entropy rate. Since the most efficient way of transmitting information through a channel is at a constant rate, language may even have evolved for such communicative efficiency (Plotkin and Nowak, 2000). There is a wide variety of evidence for the constant entropy rate hypothesis. One class of evidence, for speech, shows that speakers shorten predictable words (i.e. they take less time to say predictable words) and lengthen unpredictable words (Aylett, 1999; Jurafsky et al., 2001; Aylett and Turk, 2004). February 16, 2012 Veton Këpuska 162 Constant Information/Entropy Rate of Speech In another line of research, Genzel and Charniak (2002, 2003) show that entropy rate constancy makes predictions about the entropy of individual sentences from a text. In particular, they show that it predicts that local measures of sentence entropy which ignore previous discourse context (for example the Ngram probability of sentence), should increase with the sentence number, and they document this increase in corpora. Keller (2004) provides evidence that entropy rate plays a role for the addressee as well, showing a correlation between the entropy of a sentence and the processing effort it causes in comprehension, as measured by reading times in eye
tracking data. February 16, 2012 Veton Këpuska 163 END ...
View
Full Document
 Fall '11
 Staff
 King John, Ngram, Veton Këpuska

Click to edit the document details