Ch3-N-Grams

Ch3-N-Grams - Search and Decoding in Speech Recognition...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Search and Decoding in Speech Recognition N­Grams N­Grams Problem of word prediction. Example: “I’d like to make a collect …” Very likely words: “call”, “international call”, or “phone call”, and NOT “the”. The idea of word prediction is formalized with probabilistic models called N­ grams. N­grams – predict the next word from previous N­1 words. Statistical models of word sequences are also called language models or LMs. Computing probability of the next word will turn out to be closely related to computing the probability of a sequence of words. Example: “… all of a sudden I notice three guys standing on the sidewalk …”, vs. “… on guys all I of notice sidewalk three a sudden standing the …” February 16, 2012 Veton Këpuska 2 N­grams Estimators like N­grams that assign a conditional probability to possible next words can be used to assign a joint probability to an entire sentence. N­gram models are on of the most important tools in speech and language processing. N­grams are essential in any tasks in which the words must be identified from ambiguous and noisy inputs. Speech Recognition – the input speech sounds are very confusable and many words sound extremely similar. February 16, 2012 Veton Këpuska 3 N­gram Handwriting Recognition – probabilities of word sequences help in recognition. Woody Allen in his movie “Take the Money and Run”, tries to rob a bank with a sloppily written hold­up note that the teller incorrectly reads as “I have a gub”. Any speech and language processing system could avoid making this mistake by using the knowledge that the sequence “I have a gun” is far more probable than the non­ word “I have a gub” or even “I have a gull”. February 16, 2012 Veton Këpuska 4 N­gram Statistical Machine Translation – Example of translation of a Chinese source sentence: from a set of potential rough English translations: he briefed to reporters on the chief contents of the statement he briefed reporters on the chief contents of the statement he briefed to reporters on the main contents of the statement he briefed reporters on the main contents of the statement February 16, 2012 Veton Këpuska 5 N­gram An N­gram grammar might tell us that briefed reporters is more likely than briefed to reporters, and main contents is more likely than chief contents. Spelling Corrections – need to find correct spelling errors like the following that accidentally result in real English words: They are leaving in about fifteen minuets to go to her house. The design an construction of the system will take more than a year. Problem – real words thus dictionary search will not help. Note: “in about fifteen minuets” is a much less probable sequence than “in about fifteen minutes” Spell­checker can use a probability estimator both to detect these errors and to suggest higher­probability corrections. February 16, 2012 Veton Këpuska 6 N­gram Augmentative Communication – helping people who are unable to sue speech or sign language to communicate (Steven Hawking). Using simple body movements to select words from a menu that are spoken by the system. Word prediction can be used to suggest likely words for the menu. Other areas: Part of­speech tagging Natural Language Generation, Word Similarity, Authorship identification Sentiment Extraction Predictive Text Input (Cell phones). February 16, 2012 Veton Këpuska 7 Corpora & Counting Words Probabilities are based on counting things. Must decide what to count. Counting things in natural language is based on a corpus (plural corpora) – an on­line collection of text or speech. Popular corpora “Brown” and “Switchboard”. Brown corpus is a 1 million word collection of samples from 500 written texts from different genres (newspaper, novels, non­fiction, academic, etc.) assembled at Brown university 1963­1964. Example sentence from Brown corpus: He stepped out into the hall, was delighted to encounter a water brother. 13 words if don’t’ count punctuation­marks as words – 15 if we count punctuation. Treatment of “,” and “.” depends on the task. Punctuation marks are critical for identifying boundaries (, . ;) of things and for identifying some aspects of meaning (? ! ”) For some tasks (part­of­speech tagging or parsing or sometimes speech synthesis) punctuation are treated as being separate words. February 16, 2012 Veton Këpuska 8 Corpora & Counting Words Switchboard Corpus – collection of 2430 telephone conversations averaging 6 minutes each – total of 240 hours of speech with about 3 million words. This kind of corpora do not have punctuation. Complications with defining words. Example: I do uh main­ mainly business data processing. Two kinds of disfluencies. Broken­off word main­ is called a fragment. Words like uh um are called fillers or filled pauses. Counting disfluencies as words depends on the application: Automatic Dictation System based on Automatic Speech Recognition will remove disfluencies. Speaker Identification application can use disfluencies to identify a person. Parsing and word prediction can use disfluencies – Stolcke and Shriberg (1996) found that treating uh as a word improves next­word prediciton (?) and thus most speech recognition systems treat uh and um as words. February 16, 2012 Veton Këpuska 9 N­gram Are capitalized tokens like “They” and un­capitalized tokens like “they” the same word? In speech recognition they are treated the same. In part­of­speech­tagging capitalization is retained as a separate features. In this chapter models are not case sensitive. Lemma is a set of lexical forms having the same Wordform is the full Inflected forms – cats versus cat. These two words have the same lemma “cat” but are different wordforms. Stem Major part­of­speech, and Word­sense. inflected or derived form of the word. February 16, 2012 Veton Këpuska 10 N­grams In this chapter N­grams are based on wordforms. N­gram models and counting words in general requires that we do the kind of tokenization or text normalization that was introduced in previous chapter: Separating out punctuation Dealing with abbreviations (m.p.h) Normalizing spelling, etc. February 16, 2012 Veton Këpuska 11 N­gram How many words are there in English? Must first distinguish types – the number of the distinct words in a corpus or vocabulary size V, from tokens – the total number N of running words. Example: They picnicked by the pool, then lay back on the grass and looked at the stars. 16 Tokens 14 Types V >O February 16, 2012 Veton Këpuska ( N) 12 N­gram The Switchboard corpus has ~20,000 wordform types ~3 million worform tokens Shakspeare’s complete works have 29,066 wordform types 884,647 wordform tokens Brown corpus has: 61,805 wordform types 37,851 lemma types 1 million wordform tokens February 16, 2012 Veton Këpuska A very large corpus (Brown 1992a) found that it included 293,181 different wordform types 583 million wordform tokens Heritage third edition dictionary lists 200,00 boldface forms. It seems that the larger corpora the more word types are found: It is suggested that vocabulary size (the number of types) grows at least the square root of the number of tokens 13 Brief Introduction to Probability Discrete Probability Discrete Probability Distributions Definition: Set called the sample space which contains the set of all possible outcomes: S = { x1 , x2 , x3 , , x N } For each element x of the set S; x ∊ S , a probability value is assigned as a function of x; P(x) with the following properties: 1. P(x) ∊ [0,1], ∀ x ∊ S, 2. ∑ P( x ) = 1 February 16, 2012 x∈Ω Veton Këpuska 15 Discrete Probability Distributions Event is defined as any subset E of the sample space S. The probability of the event E is defined as: P( E ) = ∑ P( x ) x∈E Probability of the entire space S is 1 as indicated by 2 in the previous slide. Probability of the empty or null event is 0. The function P(x) mapping a point in in the sample space to the “probability” value is called a probability mass function (pmf). February 16, 2012 Veton Këpuska 16 Properties of Probability Function If A and B are mutually exclusive events in S, then: P(A∪B) = P(A)+P(B) Mutual exclusive events are those that A∩B=∅ In general for n mutually exclusive events: P( A1 A2 A3 An ) = P( A1 ) + P ( A2 ) + P ( A3 ) + + P( An ) A February 16, 2012 Venn Diagram Veton Këpuska B 17 Elementary Theorems of Probability If A is any event in S, then P(A’) = 1-P(A) where A’ is set of all events not in A. Proof: P(A∪A’) = P(A)+P(A’), considering that P(A∪A’) = P(S)= 1 P(A)+P(A’) = 1 February 16, 2012 Veton Këpuska 18 Elementary Theorems of Probability If A and B are any events in S, then P(A∪B) = P(A)+P(B)- P(A∩B), Proof: P(A∪B) = P(A∩B’)+P(A∩B)+P(A’∩B)= P(A∪B) = [P(A∩B’)+P(A∩B)] + [P(A’∩B)+P(A∩B) ] - P(A∩B) P(A∪B) = P(A)+P(B)- P(A∩B) Venn Diagram A∪ B A A∩B’ February 16, 2012 A∩B Veton Këpuska S B A’∩B 19 Conditional Probability If A and B are any events in S, and P(B)≠0, the conditional probability of A relative to B is given by: P( A ∩ B ) P( A | B ) = P( B ) If A and B are any events in S, then P( A ∩ B ) = P( A | B ) P( B ) if P( A ∩ B ) = P( B | A) P( A) if February 16, 2012 Veton Këpuska P( B ) ≠ 0 P ( A) ≠ 0 20 Independent Events If A and B are independent events then : P ( A ∩ B ) = P ( A | B ) P ( B ) = P ( A) P ( B ) P ( A ∩ B ) = P ( B | A) P ( A) = P ( B ) P ( A) February 16, 2012 Veton Këpuska 21 Bayes Rule If B1, B2, B3,…, Bn are mutually exclusive events of which one n ∑P must occur, that is: , then( Bi ) = 1 i =1 P( Bi | A) = P( A | Bi ) P( Bi ) n ∑ P( A | B ) P( B ) i =1 February 16, 2012 i for i = 1,2,3, ,n i Veton Këpuska 22 End of Brief Introduction to Probability End Simple N­grams Simple (Unsmoothed) N­Grams Our goal is to compute the probability of a word w given some history h: P(w|h). Example: h ⇒ “its water is so transparent that” w ⇒ “the” P(the | its water is so transparent that) How can we compute this probability? One way is to estimate it from relative frequency counts. From a very large corpus count number of times we see “its water is so transparent that” and count the number of times this is followed by “the”- Out of the times we saw the history h, how many times was it followed by the word w”: P( the | its water is so transparent that ) = February 16, 2012 C ( its water is so transparent that the ) C ( its water is so transparent that ) Veton Këpuska 25 Estimating Probabilities Estimating probabilities form counts works fine in many cases, it turns out that even the www is not big enough to give us good estimates in most cases. Language is creative: 1. new sentences are created all the time. 2. It is not possible to count entire sentences. February 16, 2012 Veton Këpuska 26 Estimating Probabilities Joint Probabilities – probability of an entire sequence of words like “its water is so transparent”: Out of all possible sequences of 5 words how many of them are “its water is so transparent” Must count of all occurrences of “its water is so transparent” and divide by the sum of counts of all possible 5 word sequences. It seems a lot of work for a simple computation of estimates. February 16, 2012 Veton Këpuska 27 Estimating Probabilities Must figure out cleverer ways of estimating the probability of A word w given some history h, or An entire word sequence W. Introduction of formal notations: Random variable – Xi Probability Xi taking on the value “the” – P(Xi =“the”) = P(the) Sequence of N words: n Joint probability of each word in a sequence having a 1 2 n 1 particular value: w , w , , w or w P( w1 , w2 , w3 , , wn ) = P( X = w1 , Y = w2 , Z = w3 , , ) February 16, 2012 Veton Këpuska 28 Chain Rule Chain rule of Probability: ( ) ( P( X 1 , X n ) = P( X 1 ) P( X 2 | X 1 ) P X 3 | X 12 P X n | X 1n −1 n ( = ∏ P X k | X 1k −1 k =1 ) Applying the chain rule to words we get: ( () ) ( P w1n = P( w1 ) P ( w2 | w1 ) P w3 | w12 P wn | w1n −1 n ( = ∏ P wk | w1k −1 k =1 February 16, 2012 ) ) Veton Këpuska 29 ) Chain Rule The chain rule provides the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. Equation presented in previous slide provides the way of computing joint probability estimate of an entire sequence based on multiplication of a number of conditional probabilities. However, we still do not know any way of computing the exact probability of a word given a long sequence of P wn | w1n −1 preceding words: ( February 16, 2012 Veton Këpuska ) 30 N­grams Approximation: Idea of N­gram model is to approximate the history by just the last few words instead of computing the probability of a word given its entire history. Bigram: The bigram model approximates the probability of a word P ( wn | w1n −1 ) given all the previous words by the conditional probability of the preceding word . P ( wn | wn −1 ) Example: Instead of computing the probability: P( the | Walden Pond' s water is so tranparent that ) It is approximated with the probability: February 16, 2012 P( the | that ) Veton Këpuska 31 Bi­gram The following approximation is used when the bi­ gram probability is applied: ( n −1 1 P wn | w ) ≈ P( w n | wn −1 ) The assumption that the conditional probability of the a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume that wee can predict the probability of some future unit without looking too far into the past. February 16, 2012 Veton Këpuska 32 Bi­gram Generalization Tri­gram: looks two words into the past N­gram: looks N­1 words into the past. General equation for N­gram approximation to the conditional probability of the next word in a sequence is: ( ) ( n− P wn | w1n −1 ≈ P wn | wn −1 +1 N ) The simplest and most intuitive way to estimate probabilities is the method called Maximum Likelihood Estimation or MLE for short. February 16, 2012 Veton Këpuska 33 Maximum Likelihood Estimation MLE estimate for the parameters of an N­gram model is done by taking counts from a corpus, and normalizing them so they lie between 0 and 1. Bi­gram: computing a particular bi­gram probability of a word y given a previous word x, the count C(xy) is computed and normalized by the sum of all bi­grams that share the same first word x. ( P wn | wn −1 ) C ( wn −1wn ) = ∑ C ( wn−1wn ) w February 16, 2012 Veton Këpuska 34 Maximum Likelihood Estimate The previous equation can be further simplified by noting: C ( wn −1 ) = ∑ C ( wn −1wn ) w February 16, 2012 C ( wn −1wn ) ⇒ P ( wn | wn −1 ) = C ( wn −1 ) Veton Këpuska 35 Example Mini­corpus containing three sentences marked with begging sentence marked <s> and ending sentence marker </s>: <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> (From Dr. Seuss series: “Green Eggs and Ham” book) Some of the bi­gram calculations from this corpus: P(I|<s>) = 2/3 = 0.66 P(Sam|<s>) = 1/3=0.33 P(am|I) = 2/3 = 0.66 P(Sam|am) = ½=0.5 P(</s>|Sam) = 1/3 = 0.33 P(</s>|am) = 1/3=0.33 P(do|I) = 1/3 = 0.33 February 16, 2012 Veton Këpuska 36 N­gram Parameter Estimation In general case, MLE is calculated for N­gram model using the following: ( n −1 n − N +1 P wn | w ) ( n −1 n − N +1 n n −1 n − N +1 Cw = Cw ( w ) ) This equation estimates the N­gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ration is called relative frequency. Relative frequencies is one way how to estimate probabilities in Maximum Likelihood Estimation. Conventional MLE is not always the best way to compute probability estimates (bias toward a training corpus – e.g., Brown). MLE can be modified to address better those considerations. February 16, 2012 Veton Këpuska 37 Example 2 Data used from Berkeley Restaurant Project Corpus consisting of 9332 sentences (available from the WWW): can you tell me about any good cantonese restaurants close by mid priced thai food is what I’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’am looking for a good place to eat breakfast when is caffe venezia open during the day February 16, 2012 Veton Këpuska 38 Bigram counts for eight of the words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences i want to eat chinese food lunch spend i 5 827 0 9 0 0 0 2 want 2 0 608 1 6 6 5 1 to 2 0 4 686 2 0 6 211 eat 0 0 2 0 16 2 42 0 chineze 1 0 0 0 0 82 1 0 food 15 0 15 0 1 4 0 0 lunch 2 0 0 0 0 1 0 0 spend 1 0 1 0 0 0 0 0 February 16, 2012 Veton Këpuska 39 Bigram Probabilities After Normalization i want to eat chinese food lunch spend 2533 927 214 746 158 1093 341 278 Unigram Counts Some other useful probabilities: P(i|<s>)=0.25 P(english|want)=0.0011 P(food|english)=0.5 P(</s>|food)=0.68 Clearly now we can compute probability of sentence like: “I want English food”, or “I want Chineze food” by multiplying appropriate bigram probabilities together as follows: February 16, 2012 Veton Këpuska 40 Bigram probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences i want to eat chinese food lunch spend i 0.002 0.33 0 0.0036 0 0 0 0.00079 want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011 to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087 0 0 0.0027 0 0.021 0.0027 0.056 0 chineze 0.0063 0 0 0 0 0.52 0.0063 0 food 0.014 0 0.014 0 0.00092 0.0037 0 0 lunch 0.0059 0 0 0 0 0.0029 0 0 spend 0.0036 0 0.0036 0 0 0 0 0 eat February 16, 2012 Veton Këpuska 41 Bigram Probability P(<s> i want english food </s>) = P(i|<s>)P(want|i) P(english|want) P(food|english) P(</s>|food) = 0.25 x 0.33 x 0.0011 x 0.5 x 0.68 = 0.000031 Exercise: Computer the probability of “I want chinese food”. Some of the bigram probabilities encode some facts that we think of as strictly syntactic in nature: What comes after eat is usually a noun or an adjective, or What comes after to is usually a verb February 16, 2012 Veton Këpuska 42 Trigram Modeling Although we will generally show bigram models in this chapter for pedagogical purposes, note that when there is sufficient training data we are more likely to use trigram models, which condition on the previous two words rather than the previous word. To compute trigram probabilities at the very beginning of sentence, we can use two pseudo­words for the first trigram (i.e., P(I|<s><s>). February 16, 2012 Veton Këpuska 43 Training and Test Sets N­gram models are obtained from a corpus that is trained on. Those models are used on some new data in some task (e.g. speech recognition). New data or task will not be exactly the same as data that was used for training. Formally: Data that is used to build the N­gram (or any model) are called Training Set or Training Corpus Data that are used to test the models comprise Test Set or Test Corpus. February 16, 2012 Veton Këpuska 44 Model Evaluation Training­and­testing paradigm can also be used to evaluate different N­gram architectures: Comparing N­grams of different order N, or Using the different smoothing algorithms (to be introduced later) Train various models using training corpus Evaluate each model on the test corpus. How do we measure the performance of each model on the test corpus? Perplexity (introduced latter in the chapter) – computing probability of each sentence in the test set: the model that assigns a higher probability to the test set (hence more accurately predicts the test set) is assumed to be a better model. Because evaluation metric is based on test set probability, it’s important not to let the test sentences into the training set. Avoiding training on the test set data. February 16, 2012 Veton Këpuska 45 Other Divisions of Data Extra source of data to augment the training set is needed. This data is called a held­out set. N­gram model is based on only training set. Held­out set is used to set additional (other) parameters of our model. Used to set interpolation parameters of N­gram model Multiple test sets: Test set that is used often in measuring performance of the model typically is called development (test) set. Due to its high usage the models may be tuned to it. Thus a new completely unseen (or seldom used data set) should be used for final evaluation. This set is called evaluation (test) set. February 16, 2012 Veton Këpuska 46 Picking Train, Development Test and Evaluation Test Data For training we need as much data as possible. However, for Testing we need sufficient data in order for the resulting measurements to be statistically significant. In practice often the data is divided into 80% training 10% development and 10% evaluation. February 16, 2012 Veton Këpuska 47 N­gram Sensitivity to the Training Corpus. 1. N­gram modeling, like many statistical models, is very dependent on the training corpus. Often the model encodes very specific facts about a given training corpus. 1. N­grams do a better and better job of modeling the training corpus as we increase the value of N. This is another aspect of model being tuned to specifically to training data at the expense of generality. February 16, 2012 Veton Këpuska 48 Visualization of N­gram Modeling Shannon (1951) AND Miller & Selfridge (1950). The simplest way to visualize how this works is for the unigram case: All words of English language covering the probability space between 0 and 1 – each word thus covering an interval of size equal to its (relative) frequency. Let us choose a random number between 0 and 1, and print out the word whose interval includes the real value we have chosen. We continue choosing random numbers and generating words until we randomly generate the sentence­final token </s>. The same technique can be used to generate bigrams by first generating a random bigram that starts with <s> (according to its bigram probability) Then choosing a random bigram to follow it (again according to its conditional probability), and so on. February 16, 2012 Veton Këpuska 49 Visualization of N­gram Modeling: Unigram To provide an intuition of the increasing power of higher order N­grams, the example below is depicted that shows random sentences generated from unigram, bigram, trigram, and quadrigram models trained on Shakspeare’s work. To him swallowed confess hear both. Which. Of save on trail for are ay device an rote life have Every enter noe severally so, let Hill he late speaks; or! A more to leg less first you enter Are where exeunt and sighs have rise excellency took of. Sleep knave we. Near; vile like February 16, 2012 Veton Këpuska 50 Visualization of N­gram Modeling: Bigram What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. What we, hath got so she that I rest and sent to scold and nature bankrupt, nor the first gentleman? Enter Menenius, if it so many good direction found’st thou art a strong upon command of fear not a liberal largess given away, Falstaff! Exeunt February 16, 2012 Veton Këpuska 51 Visualization of N­gram Modeling: Trigram Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. This shall forbid it should be branded, if renown made it empty. Indeed the duke; and had a very good friend. Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, ’tis done. February 16, 2012 Veton Këpuska 52 Visualization of N­gramModeling: Quadrigram King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in; Will you not tell me who I am? It cannot be but so. Indeed the short and the long. Marry, ’tis a noble Lepidus. February 16, 2012 Veton Këpuska 53 Size of N in N­gram Models The longer the context on which we train the model, the more coherent the sentences. In the unigram sentences, there is no coherent relation between words, nor sentence­final punctuation. The bigram sentences have some very local word­to­word coherence (especially if we consider that punctuation counts as a word). The trigram and quadrigram sentences are beginning to look a lot like Shakespeare. Indeed a careful investigation of the quadrigram sentences shows that they look a little too much like Shakespeare. The words It cannot be but so are directly from King John. February 16, 2012 Veton Këpuska 54 Specificity vs Generality The variability of words phrases in Shakespeare is not very large in the context of training corpora used for language modeling: N = 884,647 & V = 29,066 N-gram probability matrices are very sparse: V2 = 844,000,000 possible bigrams alone V4 = 7x1017 number of possible quadrigrams. Once the generator has chosen the first quadrigram, there are only five possible continuations (that, I , he, thou, and so). In fact for many quadrigrams there is only one continuation. February 16, 2012 Veton Këpuska 55 Dependence of Grammar to its Training Set. Example of Wall Street Journal (WSJ) Corpus based on the newspaper. Shakespeare work and WSJ are both in English, so one might expect some overlap between our N­grams for the two genres. In order to check whether this is true the next slides provide sentences generated by unigram, bigram and trigram grammars trained on 40 million words from WSJ. February 16, 2012 Veton Këpuska 56 WSJ Example Unigram Bigram Trigram Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a acquire to six executives Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one point five percent of U.S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions Sentences randomly generated from three orders of N­gram computed from 40 million words of the Wall Street Journal. All characters were mapped to lower­case and punctuation marks were treated as words. Output is hand­corrected for capitalization to improve readability. February 16, 2012 Veton Këpuska 57 Comparison of Shakespeare and WSJ Examples While superficially they both seem to model “English­like sentences” there is obviously no overlap whatsoever in possible sentences, and little if any overlap even in small phrases. This stark difference tells us that statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different as Shakespeare and WSJ. How should we deal with this problem when we build N­gram models? February 16, 2012 Veton Këpuska 58 Comparison of Shakespeare and WSJ Examples In general we need to be sure to use a training corpus that looks like our test corpus. We especially wouldn’t choose training and tests from different genres of text like newspaper text, early English fiction, telephone conversations, and web pages. Sometimes finding appropriate training text for a specific new task can be difficult; to build N­grams for text prediction in SMS (Short Message Service), we need a training corpus of SMS data. To build N­grams on business meetings, we would need to have corpora of transcribed business meetings. For general research where we know we want written English but don’t have a domain in mind, we can use a balanced training corpus that includes cross­sections from different genres, such as the 1­million word Brown corpus of English (Francis and Kuˇcera, 1982) or the 100­million word British National Corpus (Leech et al., 1994). Recent research has also studied ways to dynamically adapt language models to different genres; February 16, 2012 Veton Këpuska 59 Unknown Words: Open vs. Closed Vocabulary Tasks Sometimes we have a language task in which we know all the words that can occur, and hence we know the vocabulary size V in advance. The closed vocabulary assumption is the assumption that we have such a lexicon, and that the test set can only contain words from this lexicon. The closed vocabulary task thus assumes there are no unknown words. February 16, 2012 Veton Këpuska 60 Unknown Words: Open vs. Closed Vocabulary Tasks As we suggested earlier, the number of unseen words grows constantly, so we can’t possibly know in advance exactly how many there are, and we’d like our model to do something reasonable with them. We call these OOV unseen events unknown words, or out of vocabulary (OOV) words. The percentage of OOV words that appear in the test set is called the OOV rate. An open vocabulary system is one where we model these potential unknown words in the test set by adding a pseudo­ word called <UNK>. February 16, 2012 Veton Këpuska 61 Training Probabilities of Unknown Model We can train the probabilities of the unknown word model <UNK> as follows: 1. Choose a vocabulary (word list) which is fixed in advance. 2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK> in a text normalization step. 3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the trainings set. February 16, 2012 Veton Këpuska 62 Evaluating N­Grams Perplexity Perplexity The correct way to evaluate the performance of a language model is to embed it in an application and measure the total performance of the application. Such end­to end evaluation, also called in vivo evaluation, is the only way to know if a particular improvement in a component is really going to help the task at hand. Thus for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription. February 16, 2012 Veton Këpuska 64 Perplexity End­to­end evaluation is often very expensive; evaluating a large speech recognition test set, for example, takes hours or even days. Thus we would like a metric that can be used to quickly evaluate potential improvements in a language model. Perplexity is the most common evaluation metric for N­gram language models. While an improvement in perplexity does not guarantee an improvement in speech recognition performance (or any other end­to­end metric), it often correlates with such improvements. Thus it is commonly used as a quick check on an algorithm; an improvement in perplexity can then be confirmed by an end­to­end evaluation. February 16, 2012 Veton Këpuska 65 Perplexity Given two probabilistic models, the better model is the one that has a tighter fit to the test data, or predicts the details of the test data better. We can measure better prediction by looking at the probability the model assigns to the test data; the better model will assign a higher probability to the test data. February 16, 2012 Veton Këpuska 66 Definition of Perplexity The perplexity (sometimes called PP for short) of a language model on a test set is a function of the probability that the language model assigns to that test set. For a test set W = w1w2 . . .wN, the perplexity is the probability of the test set, normalized by the number of words: PP(W ) = P( w1w2 wN ) =N February 16, 2012 − 1 N 1 P( w1w2 wN ) Veton Këpuska 67 Definition of Perplexity We can use the chain rule to expand the probability of W: N 1 PP(W ) = N ∏ i =1 P ( wi | w1 w2 wi −1 ) For bigram language model the perplexity of W is computed as: N 1 PP(W ) = N ∏ i =1 P ( wi | wi −1 ) February 16, 2012 Veton Këpuska 68 Interpretation of Perplexity 1. Minimizing perplexity is equivalent to maximizing the test set probability according to the language model. What we generally use for word sequence in the general Equation presented in previous slide is the entire sequence of words in some test set. Since this sequence will cross many sentence boundaries, we need to include the begin­and end­sentence markers <s> and </s> in the probability computation. also need to include the end­of­sentence marker </s> (but not the beginning­of­sentence marker <s>) in the total count of word tokens N. February 16, 2012 Veton Këpuska 69 Interpretation of Perplexity Perplexity can also be interpreted as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word. Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that each of the 10 digits occur with equal probability P = 1/10 . The perplexity of this language is in fact 10. To see that, imagine a string of digits of length N. By Equation presented in previous slide, the perplexity will be: PP(W ) = P ( w1w2 wN ) 1 N = 10 February 16, 2012 − 1 N − 1 N −1 1 = = 10 10 Veton Këpuska 70 Interpretation of Perplexity Exercise: Suppose that the number zero is really frequent and occurs 10 times more often than other numbers. Show that the perplexity to be lower, as expected since most of the time the next number will be zero. Branching factor however, is still the same for digit recognition task (e.g. 10). February 16, 2012 Veton Këpuska 71 Interpretation of Perplexity Perplexity is also related to the information theoretic notion of entropy as it will be shown latter in this chapter. February 16, 2012 Veton Këpuska 72 Example of Perplexity Use Perplexity is used in following example to compare three N­gram models. Unigram, Bigram, and Trigram grammars are trained on 38 million words (including start­of­sentence tokens) using WSJ corpora with 19,979 word vocabulary. Perplexity is computed on a test set of 1.5 million words via equation presented in the slide: Definition of Perplexity and the results are summarized in the Table below: N­gram Order Perplexity February 16, 2012 Unigram Bigram Trigram 962 170 109 Veton Këpuska 73 Example of Perplexity Use As we see in previous slide, the more information the N­gram gives us about the word sequence, the lower the perplexity: the perplexity is related inversely to the likelihood of the test sequence according to the model. Note that in computing perplexities the N-gram model P must be constructed without any knowledge of the test set t. Any kind of knowledge of the test set can cause the perplexity to be artificially low. For example, we defined above the closed vocabulary task, in which the vocabulary for the test set is specified in advance. This can greatly reduce the perplexity. As long as this knowledge is provided equally to each of the models we are comparing, the closed vocabulary perplexity can still be useful for comparing models, but care must be taken in interpreting the results. In general, the perplexity of two language models is only comparable if they use the same vocabulary. February 16, 2012 Veton Këpuska 74 Smoothing Smoothing There is a major problem with the maximum likelihood estimation process we have seen for training the parameters of an N­gram model. This is the problem of sparse data caused by the fact that our maximum likelihood estimate was based on a particular set of training data. For any N­gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. 1. This missing data means that the N­gram matrix for any given training corpus is bound to have a very large number of cases of putative “zero probability N­grams” that should really have some non­zero probability. 2. Furthermore, the MLE method also produces poor estimates when the counts are non­zero but still small. February 16, 2012 Veton Këpuska 76 Smoothing We need a method which can help get better estimates for these zero or low frequency counts. Zero counts turn out to cause another huge problem. The perplexity metric defined above requires that we compute the probability of each test sentence. But if a test sentence has an N­gram that never appeared in the training set, the Maximum Likelihood estimate of the probability for this N­gram, and hence for the whole test sentence, will be zero! This means that in order to evaluate our language models, we need to modify the MLE method to assign some non­zero probability to any N­gram, even one that was never observed in training. February 16, 2012 Veton Këpuska 77 Smoothing The term smoothing is used for such modifications that address the poor estimates due to variability in small data sets. The name comes from the fact that (looking ahead a bit) we will be shaving a little bit of probability mass from the higher counts, and piling it instead on the zero counts, making the distribution a little less discontinuous. In the next few sections some smoothing algorithms are introduced. The original Berkeley Restaurant example introduced previously will be used to show how smoothing algorithms modify the bigram probabilities. February 16, 2012 Veton Këpuska 78 Laplace Smoothing One simple way to do smoothing is to take our matrix of bigram counts, before we normalize them into probabilities, and add one to all the counts. This algorithm is called Laplace smoothing, or Laplace’s Law. Laplace smoothing does not perform well enough to be used in modern N­gram models, but we begin with it because it introduces many of the concepts that we will see in other smoothing algorithms, and also gives us a useful baseline. February 16, 2012 Veton Këpuska 79 Laplace Smoothing to Unigram Probabilities Recall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N: ci P( wi ) = N Laplace smoothing adds one to each count. Considering that there are V words in the vocabulary, and each one got increased, we also need to adjust the denominator to take into account the extra V observations in order to have legitimate probabilities. ci + 1 PLaplace ( wi ) = N +V February 16, 2012 Veton Këpuska 80 Laplace Smoothing It is convenient to describe a smoothing algorithm as a corrective constant that affects the numerator by defining an adjusted count c* as follows: ci + 1 ci + 1 N N PLaplace ( wi ) = = = ci + 1 N +V N +V N N +V N ∗ ci = ci + 1 N +V February 16, 2012 Veton Këpuska ∗ 1 ci = N N 81 Discounting A related way to view smoothing is as discounting (lowering) some non­zero counts in order to get the correct probability mass that will be assigned to the zero counts. Thus instead of referring to the discounted counts c, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts: c* dc = c February 16, 2012 Veton Këpuska 82 Berkeley Restaurant Project Smoothed Bigram Counts (V=1446) i want to eat chinese food lunch spend i 6 828 1 10 1 1 1 3 want 3 1 609 2 7 7 6 2 to 3 1 5 687 3 1 7 212 eat 1 1 3 1 17 3 43 1 chineze 2 1 1 1 1 83 2 1 food 16 1 16 1 2 5 1 1 lunch 3 1 1 1 1 2 1 1 spend 2 1 2 1 1 1 1 1 February 16, 2012 Veton Këpuska 83 Smoothed Bigram Probabilities Recall normal bigram probabilites are computed by normalizing each raw of counts by the unigram count: C ( wn −1wn ) P( wn | wn −1 ) = C ( wn −1 ) For add­one smoothed bigram counts we need to augment the unigram count by the number of total types in the vocabulary V: * Laplace P C ( wn −1wn ) + 1 ( wn | wn−1 ) = C ( wn −1 ) + V The result is the smoothed bigram probabilities presented in the table in the next slide. February 16, 2012 Veton Këpuska 84 Bigram Smoothed Probabilities for eight words (out of V=1446) in Berkeley Restaurant Project corpus of 9332 sentences i want to eat chinese food lunch spend i 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075 want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.0084 to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055 eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046 chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062 food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039 lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056 spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058 February 16, 2012 Veton Këpuska 85 Adjusted Counts Table It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. These adjusted counts can be computed by Equation presented below and the table in the next slide shows the reconstructed counts. [ C ( wn−1wn ) + 1] × C ( wn−1 ) c ( wn −1wn ) = C ( wn −1 ) + V * February 16, 2012 Veton Këpuska 86 Adjusted Counts Table i want to eat chinese food lunch spend i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9 want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78 to 1.9 0.63 3.1 430 1.9 0.63 4.4 133 eat 0.34 0.34 1 0.34 5.8 1 15 0.34 chineze 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098 food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43 lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19 spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16 February 16, 2012 Veton Këpuska 87 Observation Note that add­one smoothing has made a very big change to the counts. C(want to) changed from 608 to 238! We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix­ word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10! February 16, 2012 Veton Këpuska 88 Problems with Add­One (Laplace) Smoothing The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros. We could move a bit less mass by adding a fractional count rather than 1 (add­δ smoothing; (Lidstone, 1920; Jeffreys, 1948)), but this method requires a method for choosing δ dynamically, results in an inappropriate discount for many counts, and turns out to give counts with poor variances. For these and other reasons (Gale and Church, 1994), we’ll need use better smoothing methods for N­grams like the ones we will present in the next section. February 16, 2012 Veton Këpuska 89 Good­Turing Discounting A number of much better algorithms have been developed that are only slightly more complex than add­one smoothing: Good­Turing The idea behind a number of those algorithms is to use the count of things you’ve seen once to help estimate the count of things you have never seen. Good described the algorithm in 1953 in which he credits Turing for the original idea. Basic idea in this algorithm is to re­estimate the amount of probability mass t assign to N­grams with zero counts by looking at the number of N­grams that occurred only one time. A word or N­gram that occurs once is called a singleton. Good­Turing algorithm uses the frequency of singletons as a re­ estimate of the frequency of zero­count bigrams. February 16, 2012 Veton Këpuska 90 Good­Turing Discounting Algorithm Definition: Nc – the number of N­grams that occur c times: frequency of frequency c. N0 – the number of bigrams b with count 0. N1 – the number of bigram with count 1 (singletons), etc. Nc = ∑1) ( x:count x = c The MLE count for Nc is c. The Good­Turing estimate replaces this with a smoothed count c*, as a function of Nc+1: c ∗ = ( c + 1) February 16, 2012 N c +1 Nc Veton Këpuska 91 Good­Turing Discounting The previous equation presented in previous slide can be used to replace the MLE counts for all the bins N1, N2, and so on. Instead of using this equation directly to re­estimate the smoothed count c* for N0, the following equation is used that defines probability of the missing mass: ∗ PGT N1 P ( things with frequency zero in training ) = N ∗ GT N1 – is the count of items in bin 1 (that were seen once in training), and N is total number of items we have seen in training. February 16, 2012 Veton Këpuska 92 Good­Turing Discounting Example: A lake with 8 species of fish (bass, carp, catfish, eel, perch, salmon, trout, whitefish) When fishing we have caught 6 species with the following count: 10 carp 3 perch, 2 whitefish, 1 trout, 1 salmon, and 1 eel (no catfish and no bass). What is the probability that the next fish we catch will be a new species, i.e., one that had a zero frequency in our training set (catfish or bass)? The MLE count c of unseen species (bass or catfish) is 0. From the equation in the previous slide the probability of a new fish being one of these unseen species is 3/18, since N1 is 3 and N is 18: N1 3 P ( things with frequency zero in training ) = = N 18 ∗ GT February 16, 2012 Veton Këpuska 93 Good­Turing Discounting: Example Lets now estimate the probability that the next fish will be another trout? MLE count for trout is 1, so the MLE estimated probability is 1/18. However, the Good­Turing estimate must be lower, since we took 3/18 of our probability mass to use on unseen events! Must discount MLE probabilities for observed counts (perch, whitefish, trout, salmon, and eel) The revised count c* and Good­Turing smoothed probabilities for species with counts 0 (like bass or catfish in previous ∗ PGT example) or counts 1 (like trout, salmon, or eel) are as follows: February 16, 2012 Veton Këpuska 94 Good­Turing Discounting: Example Unseen (bass or catfish) 0 c MLE p trout 1 p= 1 18 0 =0 18 c * ( trout ) = 2 × c* GT * pGT * pGT ( unseen) ) = N1 3 = = 0.17 N 18 N2 1 = 2 × = 0.67 N1 3 c* 0.67 * pGT (trout ) = = =0.037 N 18 Note that the revised count c* for eel as well is discounted from c=1.0 to c*=0.67 in order to account for some probability mass for * pGT unseen species (unseen) = 3/18=0.17 for catfish and bass. Since we know that there are 2 unknown species, the probability of the next fish being specifically a catfish is (catfish) = (1/2)x(3/18) = 0.085 * pGT February 16, 2012 Veton Këpuska 95 Bi­gram Examples Berkeley Restaurant Corpus of 9332 sentences. Associated Press (AP) newswire Corpus c(MLE) AP Newswire Nc c*(GT) Berkeley Restaurant c(MLE) Nc c*(GT) 0 74,671,100,000 0.0000270 0 2,081,496 0.002553 1 2,018,046 0.446 1 5315 0.533960 2 449,721 1.26 2 1419 1.357294 3 188,933 2.24 3 642 2.373832 4 105,668 3.24 4 381 4.081365 5 68,379 4.22 5 311 3.781350 6 48,190 5.19 6 196 4.500000 February 16, 2012 Veton Këpuska 96 Advanced Issues in Good­Turing Estimation Assumptions of Good­Turing Estimation: Distribution of each bigram is binomial The number N0 of bigrams that have not been seen is known. This number is known if size V of vocabulary is known ⇒ total number of bigrams is thus V2. The raw number Nc can not be used since re­estimate c* for Nc dependents on Nc+1 and the re­estimation expression is undefined when Nc+1=0. In the example of the fish species N4 = 0, thus how can one compute N3? ⇒ Simple Good­Turing Algorithm February 16, 2012 Veton Këpuska 97 Simple Good­Turing 1. Compute Nc for all c’s 2. 3. Smooth counts to replace any zeros in the sequence. Compute adjusted counts c*. Smoothing Approaches: Linear regression: fitting a map from Nc to c in log space: log( N c ) = a + b log( c ) Inaddition, the discounted c* is not used for all counts c. Large counts where c > k for some threshold k (e.g., k=5 in Katz 1987) are assumed to be reliable. c = c for c > k * February 16, 2012 Veton Këpuska 98 Simple Good­Turing Correct equation of c* when some k is introduced is: ( k + 1) N k +1 N c +1 ( c + 1) −c Nc N1 c* = , for 1 ≤ c ≤ k ( k + 1) N k +1 1− N1 With Good­Turing discounting (as well as other algorithms), it is usual to treat N­grams with low count (especially counts of 1) as if the count were 0. Finally Good­Turing discounting (or any other algorithm) is not used directly on N­grams; it is used in combination with the backoff and interpolation algorithms that are described next. February 16, 2012 Veton Këpuska 99 Interpolation Interpolation Discounting algorithms can help solve the problem of zero frequency N­grams. Additional knowledge that is not used: If trying to compute P(wn|wn-1wn-2) but we have no examples of a particular trigram wn-2wn-1wn Estimate trigram probability based on the bigram probability P(wn|wn-1). If there are no counts for computation of bigram probability P(wn|wn-1), use unigram probability P(wn). There are two ways to rely on this N­gram “hiearchy”: Backoff, and Interpolation February 16, 2012 Veton Këpuska 101 Backoff vs. Interpolation Backoff: Relies solely on the trigram counts. When there is a zero count evidence of a trigram then the backoff to lower N­gram. Interpolation: Probability estimates are always mixed from all N­gram estimators: Weighted interpolation of trigram, bigram and unigram counts. Simple Interpolation – Linear Interpolation February 16, 2012 Veton Këpuska 102 Linear Interpolation ˆ P ( wn | wn −1wn − 2 ) = λ1 P ( wn | wn −1wn − 2 ) + λ2 P( wn | wn −1 ) + λ3 P( wn ) ∑λ i =1 i Slightly more sophisticate version of linear interpolation with context dependent weights. ( ) 3 n −1 n−2 n− ˆ P( wn | wn −1wn − 2 ) = λ1 wn −1 P( wn | wn −1wn − 2 ) 2 () + λ ( w ) P( w ) ∑ λ (w ) = 1 n− + λ2 wn −1 P ( wn | wn −1 ) 2 i n n −1 n−2 i February 16, 2012 Veton Këpuska 103 Computing Interpolation Weights λ Weights are set from held­out corpus. Held­out corpus is additional training corpus that is NOT used to set the N­gram counts but to set other parameters like in this case. Choosing λ values that maximize the estimated interpolated probability for example with EM algorithm (iterative algorithm discussed in latter chapters). February 16, 2012 Veton Këpuska 104 Backoff Interpolation is simple to understand and implement There are better algorithms like backoff N­gram modeling. Uses Good­Turing discounting based on Katz and also known as Katz backoff. ( n −1 n − N +1 Pkatz wn | w ) ( ) ( n− P * wn | wn −1 +1 N = n− n− α wn −1 +1 Pkatz wn | wn −1 + 2 N N ( ) ) ( ) n if C wn − N +1 > 0 otherwise Equation above describes a recursive procedure. Computation of P*, the normalizing factor a, and other details are discussed in next section. February 16, 2012 Veton Këpuska 105 Trigram Discounting with Interpolation The wi , wi-1, wi-2 for clarity are referred as a sequence x, y, z. Katz method incorporates discounting as integral part of the algorithm. P * ( z | x, y ) if C ( x, y, z ) > 0 Pkatz ( z | x, y ) = α ( x, y ) Pkatz ( z | y ) else if C ( x, y ) > 0 P * ( z ) otherwise P * ( z | y ) if C ( y, z ) > 0 Pkatz ( z | y ) = α ( y ) Pkatz ( z ) otherwise February 16, 2012 Veton Këpuska 106 Katz Backoff Good­Turing method assigned probability of unseen events based on the assumption that they are all equally probable. Katz backoff gives us a better way to distribute the probability mass among unseen trigram events, by relying on information from unigrams and bigrams. We use discounting to tell us how much total probability mass to set aside for all the events we haven’t seen, and backoff to tell us how to distribute this probability. Discount probability P*(.) is needed rather than MLE P(.) in order to account for the missing probability mass. α weights are necessary to ensure that when backoff occurs the resulting probabilities are true probabilities that sum to 1. February 16, 2012 Veton Këpuska 107 Discounted Probability Computation P* is defined as discounted (c*) estimate of the conditional probability of an N­gram. ( n −1 P wn | w n− N +1 * () ) = c( w ) c* wn−N +1 n n −1 n − N +1 Because on average the discounted c8 will be less than c, this probability P* will be slightly less than the MLE estimate: () c( w ) c* wn− N +1 n n −1 n − N +1 February 16, 2012 Veton Këpuska 108 Discounted Probability Computation The previous slide computation will leave some probability mass for the lower order N­grams, which is then distributed by the α weights (descirbed in next section). The table in the next slide shows the result of Katz backoff bigram probabilities for previous 8 sample words computed from BeRP corpus using the SRILM toolkit. February 16, 2012 Veton Këpuska 109 Smoothed Bigram Probabilities computed with SRLIM toolkit. i want to eat chinese food lunch spend i 0.0014 0.326 0.00248 0.00355 0.000205 0.0017 0.00073 0.000489 want 0.00134 0.00152 0.656 0.000483 0.00455 0.00455 0.00073 0.000483 to 0.000512 0.00152 0.00165 0.284 0.000512 0.0017 0.00175 0.00873 eat 0.00101 0.00152 0.00166 0.00189 0.0214 0.00166 0.0563 0.000585 chineze 0.0283 0.00152 0.00248 0.00189 0.000205 0.519 0.00283 0.000585 food 0.0137 0.00152 0.0137 0.00189 0.000409 0.00366 0.00073 0.000585 lunch 0.00363 0.00152 0.00248 0.00189 0.000205 0.00131 0.00073 0.000585 spend 0.00161 0.00152 0.00161 0.00189 0.000205 0.0017 0.00073 0.000585 February 16, 2012 Veton Këpuska 110 Advanced Details of Computing Katz backoff α and P* Remaining details of computation of α and P* are presented in this section. β – total amount of left­over probability mass function of the (N­ 1)­gram context. For a given (N-1) gram context, the total left-over probability mass can be computed by subtracting from 1 the total discounted probability mass for all N-grams starting with that context. β (w n −1 n − N +1 ) = 1− ∑ P *(w ( ) n− wn :c wn−1 +1 > 0 N n n −1 n − N +1 |w ) This gives us the total probability mass that we are ready to distribute to all (N-1)-gram (e.g., bigrams if our original model was trigram) February 16, 2012 Veton Këpuska 111 Advanced Details of Computing Katz backoff α and P* (cont.) Each individual (N­1)­gram (bigram) will only get a fraction of this mass, so we need to normalize β by the total probability of all the (N­1)­grams (bigrams) that begin some N­gram (trigram) that has zero count. The final equation for computing how much probability mass to distribute from an N­gram to an (N­1)­gram is represented by the function α: n− β ( wn −1 +1 ) n− N α ( wn −1 +1 ) = N n− Pkatz ( wn | wn −1 + 2 ) ∑1 N n− wn :c ( wn− N +1 ) > 0 n− 1− P * ( wn | wn −1 +1 ) ∑1 N n− wn :c ( wn− N +1 ) > 0 = n− 1− P * ( wn | wn −1 + 2 ) ∑1 N n− wn :c ( wn− N +1 ) > 0 February 16, 2012 Veton Këpuska 112 Advanced Details of Computing Katz backoff α and P* (cont.) Note that a is a function of the preceding word string, that is, of n− wn −1 +1 ; thus the amount by which we discount each trigram ( d), and N the mass that gets reassigned to lower order N­grams (α) are recomputed for every (N­1)­gram that occurs in any N­gram. We only need to specify what to do when the counts of an ( N­1)­ gram context are 0, (i.e., when ) and our definition is n− c wn −1 +1 = 0 complete: N ( (w n −1 katz n n − N +1 * n −1 n n − N +1 n −1 n n − N +1 P ( |w P w |w β w |w ( February 16, 2012 ) = P (w )=0 katz n n −1 n− N + 2 |w ) =1 Veton Këpuska ) ) ( if c( w if c( w n −1 n − N +1 n −1 n − N +1 n −1 n − N +1 if c w )=0 )=0 )=0 113 Practical Issues Toolkits and Data Formats Practical Issues: Toolkits and Data Formats How N­gram language models are represented? Language model probabilities are represented and computed in log format to avoid underflow and speed up computation. Probabilities by definition are less than 1. Multiplying enough N­grams together would result in numerical underflow. Using log probabilities instead of raw probabilities the numbers do not get as small. Adding in log space is equivalent to multiplying in linear space; log probabilities are combined by addition. In general addition is faster than multiplication in most general purpose computers. Reporting true probabilities if necessary requires exponentiation operation: p1xp2xp3xp4 = exp[log(p1)+log(p2)+log(p3)+log(p4)] February 16, 2012 Veton Këpuska 115 Practical Issues: Toolkits and Data Formats Backoff N­gram language models are generally stored in ARPA format. An N­gram in ARPA format is an SCII file with a small header followed by a list of all the non­zero N­gram probabilities of all: unigrams, followed by bigrams, followed by trigrams, and so on Each N­gram entry is stored with its discounted log probability (in log10 format) and its backoff weight α . Backoff weights are only necessary for N­grams which form a prefix of a longer N­gram, thus no α ’s are computed for the highest order N­gram (e.g., tigram) or N­grams ending in the end of sequence token <s>. February 16, 2012 Veton Këpuska 116 Practical Issues: Toolkits and Data Formats Format of each N­gram is for a trigram grammar is: unigram : log p * ( wi ) bigram : log p * ( wi | wi −1 ) wi wi −1wi trigram : wi − 2 wi −1wi February 16, 2012 log p * ( wi | wi − 2 , wi −1 ) Veton Këpuska log α ( wi ) log α ( wi −1wi ) 117 Example of ARPA formatted LM file from BeRP corpus (Juras February 16, 2012 Veton Këpuska 118 Probability Computation Given a sequence x,y,z the trigram probability P(z|x,y) is computed form the model as follows: P * ( z | x, y ) , if C ( x, y, z ) > 0 Pkatz ( z | x, y ) = α ( x,y ) Pkatz ( z | y ) , else if C ( x, y ) > 0 P* ( z ) , otherwise. P* ( z | y ) , if C ( y, z ) > 0 Pkatz ( z | y ) = α ( y ) P * ( z ) , otherwise. February 16, 2012 Veton Këpuska 119 Toolkits (Publicly Available) SRILM (Stolcke, 2002) http://citeseer.ist.psu.edu/621361.html http://www.speech.sri.com/projects/srilm/ Cambridge­CMU toolkit (Clarkson & Rosenfeld, 1997). http://www.cs.cmu.edu/~archan/sphinxInfo.html#cmulmtk http://www.cs.cmu.edu/~archan/s_info/CMULMTK/toolkit_doc http://www.speech.cs.cmu.edu/SLM_info.html February 16, 2012 Veton Këpuska 120 ADVANCED ISSUES IN LANGUAGE MODELING Advanced Smoothing Methods: Kneser­ Ney Smoothing Advanced Smoothing Methods: Kneser­Ney Smoothing Brief introduction to the most commonly used modern N­gram smoothing method, the Interpolated Kneser­Ney algorithm: Algorithm is based on absolute discounting method. It is a more elaborate method of computing revised count c* than the Good­Turing discount formula. Re­visiting Good­Turing estimates of the bigram extended from slide Bi­gram Examples. c(MLE) 0 1 2 3 4 5 6 7 8 9 c*(GT) 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25 0.554 0.74 0.76 0.76 0.78 0.81 0.79 0.76 0.75 0.0000270 ∆= c-c* -0.0000270 February 16, 2012 Veton Këpuska 122 Advanced Smoothing Methods: Kneser­Ney Smoothing Re­estimated counts c* for greater than 1 counts could be estimated pretty well by just subtracting 0.75 from the MLE count c. Absolute discounting method formalizes this intuition by subtracting a fixed (absolute) discount d from each count. The rational is that we have good estimates already for the high counts, and a small discount d won’t affect them much. The affected are only the smaller counts for which we do not necessarily trust the estimate anyhow. The equation for absolute discounting applied to bigrams (assuming a proper coefficient α on the backoff to make everything sum to one) is: c( wi −1wi ) − D , if c( wi −1wi ) > 0 Pabsolute ( wi | wi −1 ) = c( wi −1 ) α ( wi ) Pabsolute ( wi ) otherwize February 16, 2012 Veton Këpuska 123 Advanced Smoothing Methods: Kneser­Ney Smoothing In practice distinct discount values d for the 0 and 1 counts are computed. Kneser­Neay discounting augments absolute discounting with a more sophisticated way to handle the backoff distribution. Consider the job of predicting the next word in the sentence, assuming we are backing off to a unigram model: I can’t see without my reading XXXXXX. The word “glasses” seem much more likely to follow than the word “Francisco”. But “Francisco” is in fact more common, and thus a unigram model will prefer it to “glasses”. 1. Thus we would like to capture that although “Francisco” is frequent, it is only frequent after the word “San”. 2. The word “glasses” has a much wider distribution. February 16, 2012 Veton Këpuska 124 Advanced Smoothing Methods: Kneser­Ney Smoothing Thus the idea is instead of backing off to the unigram MLE count (the number of times the word w has been seen), we want to use a completely different backoff distribution! We want a heuristic that more accurately estimates the number of times we might expect to see word w in a new unseen context. The Kneser­Ney intuition is to base our estimate on the number of different contexts word w has appeared in. Words that have appeared in more contexts are more likely to appear in some new context as well. New backoff probability can be expressed as the “continuation probability” presented in following expression: February 16, 2012 Veton Këpuska 125 Advanced Smoothing Methods: Kneser­Ney Smoothing Continuation Probability: { wi −1 : c( wi −1wi ) > 0} Pcontinuation ( wi ) = ∑ { wi −1 : c( wi −1wi ) > 0} wi Kneser­Ney backoff is formalized as follows assuming proper coefficient α on the backoff to make everything sum to one: c( wi −1wi ) − D , c( w ) i −1 PKN ( wi | wi −1 ) = { wi −1 : c( wi −1wi ) > 0} α ( wi ) ∑ { wi −1 : c( wi −1wi ) > 0} wi February 16, 2012 Veton Këpuska if c( wi −1wi ) > 0 otherwize 126 Interpolated vs Backoff form of Kneser­Ney Kneser­Ney backoff algorithm was shown to be less superior to its interpolated version. Interpolated Kneser­Ney discounting can be computed with an equation like the following (omitting the computation of β): { wi −1 : c( wi −1wi ) > 0} c( wi −1wi ) − D PKN ( wi | wi −1 ) = + β ( wi ) c( wi −1 ) ∑ { wi −1 : c( wi −1wi ) > 0} w i Practical note – it turns out that any interpolation model cab be represented as a backoff model, and hence stored in ARPA backoff format. The interpolation is done when the model is built, thus the ‘bigram’ probability stored in the backoff format is really ‘bigram already interpolated with unigram’. February 16, 2012 Veton Këpuska 127 Class­based N­grams Class­based N­grams or Cluster N­ grams Class­based N­grams Class­based N­grams or Cluster N­grams is a variant of the N­ gram that uses information about word classes or clusters. It is useful for dealing with scarcity in the training data. Example: Suppose for a flight reservation system we want to compute the probability of the bigram to Shanghai, but this bigram never occurs in the training set. Assume that our training data has to London, to Beijing, and to Denver. If we new that these were all cities, and assuming Shanghai does appear in the training set in other contexts, we could predict the likelihood of a city following from. February 16, 2012 Veton Këpuska 129 Class­based N­grams Many variants of cluster N­grams: IBM clustering – hard clustering: each word can belong to only one class. The model estimates the conditional probability of a word wi by multiplying two factors: the probability of the word’s class ci given the preceding classes (based on N­gram­of­classes), and the probability of wi given ci. P( wi | wi −1 ) ≈ P( ci | ci −1 ) × P( wi | ci ) February 16, 2012 Veton Këpuska 130 Class­based N­grams Assuming that there is a training corpus in which we have a class label for each word, the MLE of the probability of the word given the class and the probability of the class given the previous class could be computed as follows: C ( w) P( w | c ) = C( c) C ( ci −1ci ) P( ci | ci −1 ) = ∑ C ( ci−1c ) c February 16, 2012 Veton Këpuska 131 Class­based N­grams Cluster N­grams are generally used in two ways: 1. Hand­designed domain­specific word classes. In airline information system we might use classes like: CITYNAME AIRLINE DAYOFWEEK MONTH, etc. 1. Automatically induce the classes by clustering words in a corpus. Syntactic categories like part­of­speech tags don’t seem to work well as classes. Whether automatically induced or hand­designed, cluster N­ grams are generally mixed with regular word­based N­grams. February 16, 2012 Veton Këpuska 132 Language Model Adaptation and Using the WWW One of the most recent developments in language modeling is language model adaptation. Relevant when one has only a small amount of in­domain training data, but a large amount of data from some other domain. Train on the larger out­of­domain dataset, and Adopt the models to the small in­domain set. An obvious large data source for this type of adaptation is WWW. The simplest way to apply the web is to improve, say, trigram language models is to use search engines to get counts for w1w2 and w1w2w3, and then compute ˆ pweb February 16, 2012 cweb ( w1w2 w3 ) = cweb ( w1w2 ) Veton Këpuska 133 Language Model Adaptation and Using the WWW One can mix with a conventional N­gram. Also, more ˆ pweb sophisticated methods can be used by combining methods that make use of topic or class dependence to find domain­relevant data on the web. Problems: In practice it is impossible to download every page from the web in order to compute N­grams. Only page counts are used from the data returned by search engines. Page counts are only approximations to actual counts for many reasons: February 16, 2012 A page may contain an N­gram multiple times. Most search engines round off their counts, Punctuation is deleted, and Counts may be adjusted due to link and other information. Veton Këpuska 134 Language Model Adaptation and Using the WWW The result is not hugely affected in spite of the “noise” due to inaccuracies of the information collected. It is possible to perform specific adjustments, such as fitting a regression to predict actual word counts from page counts. February 16, 2012 Veton Këpuska 135 Using Longer Distance Information: A Brief Summary There are methods for incorporating longer­distance context into N­gram modeling. While we have limited our discussion mainly to bigram and trigrams, state­of­the­art speech recognition systems, for example, are based on longer distance N­grams, especially 4­ grams, but also 5­grams. Goodman (2006) showed that with 284 million words of training data, 5­grams do improve perplexity scores over 4­grams, but not by much. Goodman checked contexts up to 20­grams, and found that after 6­grams, longer contexts weren’t useful, at least not with 284 million words of training data. February 16, 2012 Veton Këpuska 136 More Sophisticated Models People tend to repeat words they have used before. Thus if a word is used once in a text, it will probably be used again. We can capture this fact by a cache language model (Kuhn and DeMori, 1990). To use a unigram cache model to predict word i of a test corpus, we create a unigram grammar from the preceding part of the test corpus (words 1 to i−1), and mix this with our conventional N­ gram. We might use only a shorter window from the previous words, rather than the entire set. Cache language models are very powerful in any applications where we have perfect knowledge of the words. Cache models work less well in domains where the previous words are not known exactly. In speech applications, for example, unless there is some way for users to correct errors, cache models tend to ‘lock­in’ to errors they made on earlier words. February 16, 2012 Veton Këpuska 137 More Sophisticated Models Repetition of words in a text is a symptom of a more general fact about words; texts tend to be about things. Documents which are about particular topics tend to use similar words. Suggests that we could train separate language model for different topics. Topic­based language models take advantage of the fact that different topics will have different kinds of words. Train different language models for each topic t, and then mix them, weighted by how likely each topic is given the history h: p( w | h ) = ∑ P( w | t ) P( t | h ) t February 16, 2012 Veton Këpuska 138 More Sophisticated Models Latent Semantic Indexing: Based on the intuition that upcoming words are semantically similar to preceding words in the text. Word similarity is computed from measure of semantic word association such as the latent semantic indexing Computed from dictionaries or thesauri, then Mixed with a conventional N­gram. February 16, 2012 Veton Këpuska 139 More Sophisticated Models Trigger Word based N­grams Predictor word – called trigger which is not adjacent but is very related: has high mutual information with the word we are trying to predict. Skip N­grams Preceding context ‘skips over’ some intermediate words such as P(wi|wi-1wi-2). Variable­length N­grams Preceding context is extended where a longer phrase is particularly frequent. Using very large and rich contexts can result in very large language models. These models are often pruned, removing low­probability events. There is a large body of research on integrating sophisticated linguistic structures into language modeling as described in following chapters of the text book. February 16, 2012 Veton Këpuska 140 ADVACED TOPIC INFORMAITON THEORY BACKGROUND Information Theory Background In previous section, perplexity was introduced as a way to evaluate N­gram models on a test set. A better N­gram model is one which assigns a higher probability to the test data Perplexity is a normalized version of the probability of the test set. Another way to think about perplexity is based on the information­theoretic concept of cross­entropy. This section introduces fundamentals of information theory including the concept of cross­entropy. Reference: “Elements of Information Theory”, Cover and Thomas. Wiley­Interscience, 1991 February 16, 2012 Veton Këpuska 142 Entropy Entropy is a measure of information content. Computing entropy requires establishing of a Random variable X that takes values from whatever it is being predicted: Words, letters, parts of speech, … ­ from a set χ. Probability function p(x). The entropy of random variable X is then defined as: H ( X ) = −∑ p( x ) log 2 [ p( x ) ] x∈χ The log can in principle be computed in any base. However, if base 2 is used the resulting value is measured in bits. February 16, 2012 Veton Këpuska 143 Entropy The most intuitive way to define entropy for computer scientist is to think of the entropy as a lower bound on the number of bits it would take to encode a certain decision or piece of information in the optimal coding scheme. In Cover and Thomas the following example is provided: Imagine that we want to place a bet on a horse race but it is too far to go all the way to Yonkers Racetrack, and we’d like to send a short message to the bookie to tell him which horse to bet on. Suppose there are eight horses in this particular race. One way to encode this message is just to use the binary representation of the horse’s number as the code; thus horse 1 would be 001, horse 2 ­ 010, horse 3 ­ 011, and so on, with horse 8 coded as 000. If we spend the whole day betting, and each horse is coded with 3 bits, on the average we would be sending 3 bits per race. Can we do better? February 16, 2012 Veton Këpuska 144 Entropy Suppose that the spread is the actual distribution of the bets placed, and that we represent it as the prior probability of each horse as follows: Horse 1 2 3 4 5 6 7 8 Prior 1/2 1/4 1/8 1/16 1/64 1/64 1/64 1/64 32/64 16/64 8/64 4/64 1/64 1/64 1/64 1/64 The entropy of the random variable X that ranges over horses gives us a lower bound on the number of bits, and it is: 11 11 11 1 1 1 1 H ( X ) = −∑ p( i ) log 2 [ p( i ) ] = − log 2 + log 2 + log 2 + log 2 + 4 log 2 24 48 8 16 16 64 64 2 i =1 = 2 bits 8 February 16, 2012 Veton Këpuska 145 Entropy Variable length encoding: 0 – for most likely horse 10­ for the next most likely horse, and 110, 1110, and for the last equally likely four 111100, 111101, 111110, and 111111 The entropy for equal­length binary code applied when the horses are equally likely is: 1 1 1 H ( X ) = −∑ p( i ) log 2 [ p( i ) ] = −∑ log 2 = − log 2 = 3 bits 8 8 i =1 i =1 8 8 February 16, 2012 8 Veton Këpuska 146 Entropy Practical application in Language Processing involves sequences; for a grammar, one will be computing entropy of some sequence of words W={w0, w1, w2,…, wn}. One way to compute entropy for a sequence is to assign a random variable that ranges over all finite sequences of words of length n in some language L as follows: () [ ( )] H ( w1 , w2 , w3 , , wn ) = − ∑ p Wi n log 2 p Wi n W1n ∈L Entropy rate can be defined as per­word entropy of a sequence divided by the number of words: () () [ ( )] 1 1 n H Wi = − ∑ p Wi n log 2 p Wi n n n W1n ∈L February 16, 2012 Veton Këpuska 147 Entropy To compute the true entropy of a language, one needs to consider sequences of infinite length. Assuming a language L as a stochastic process that produces a sequence of words, its entropy rate H(L) is defined as: 1 H ( L ) = − lim H ( w1 , w2 ,..., wn ) n →∞ n 1 = − lim ∑ p( w1 , w2 ,..., wn ) log 2 [ p( w1 , w2 ,..., wn ) ] n →∞ n W ∈L February 16, 2012 Veton Këpuska 148 Entropy Based on Shannon­McMillan­Breiman theorem for a language that is regular in certain ways (specifically, if its both stationary and ergodic) then the following expression can be used 1 H ( L ) = − lim log 2 [ p( w1 , w2 ,..., wn ) ] n →∞ n That is, we can take a single sequence that is long enough instead of summing over all possible sequences. The rationale of the Shannon­McMillan­Breiman theorem is that a long enough sequence of words will contain in it many other shorter sequences, and that each of these shorter sequences will reoccur in the longer sequence according to their probabilities. February 16, 2012 Veton Këpuska 149 Entropy A stochastic process is said to be stationary if the probabilities it assigns to a sequence are invariant with respect to shifts in the time index. The probability distribution for words at time t, is the same as the probability distribution at time t+1. Markov models and hence N­grams are stationary. In bigram Pi is dependent only on Pi-1. So if we shift our time index by x, Pi+x is still dependent on Pi+x-1. However, natural language is not stationary, since as we will see in latter (Ch12 of the book) the probability of upcoming words can be dependent on events that were arbitrarily distant in time and thus time dependent. Consequently, statistical models only give an approximation to the correct distributions and entropies of natural language. February 16, 2012 Veton Këpuska 150 Entropy To summarize: By making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output, and computing its average log probability. In the next section we talk about the why and how; why we would want to do this (i.e., for what kinds of problems would the entropy tell us something useful), and how to compute the probability of a very long sequence. February 16, 2012 Veton Këpuska 151 Cross Entropy & Comparing Models Cross entropy is useful when we do not know the actual probability distribution p that generated a data. It uses a model m of distribution p as follows: 1 H ( p, m ) = − lim ∑ p( w1 , w2 ,..., wn ) log 2 [ m( w1 , w2 ,..., wn ) ] n →∞ n W ∈L For stationary ergodic process this expression becomes: 1 H ( p, m ) = − lim log 2 [ m( w1 , w2 ,..., wn ) ] n →∞ n The cross entropy H(p,m) is useful because it gives us an upper bound on the entropy H(p). For any model m: H(p) ≤ H(p,m) February 16, 2012 Veton Këpuska 152 Cross Entropy & Comparing Models This means that we can use some simplified model m to help estimate the true entropy of a sequence of symbols drawn according to probability p. The more accurate m is, the closer the cross entropy H(p,m) will be to the true entropy H(p). Thus the difference between H(p,m) and H(p) is a measure of how accurate a model is. Between two models m1 and m2, the more accurate model will be the one with the lower cross­entropy. (The cross­ entropy can never be lower than the true entropy, so a model cannot err by underestimating the true entropy). February 16, 2012 Veton Këpuska 153 Perplexity and Cross­Entropy Cross­entropy is defined in the limit, as the length of the observed word sequence goes to infinity. We will need an approximation to cross­entropy, relying on a (sufficiently long) sequence of fixed length. This approximation to the cross­entropy of a model M = P(wi|wiN+1…wi-1) on a sequence of words W is: 1 H (W ) = − log 2 [ P( w1 , w2 ,..., wN ) ] N The perplexity of a model P on a sequence of words W defined as exponent of the cross­entropy presented next: February 16, 2012 Veton Këpuska 154 Perplexity and Cross­Entropy PP(W ) = 2 H ( w) =2 − 1 log 2 [ P ( w1w2 wN ) ] N = P( w1w2 wN ) =N February 16, 2012 − =2 log 2 [ P ( w1w2 wN ) ] −1 N 1 N 1 P( w1w2 wN ) Veton Këpuska 155 Advanced Topic The Entropy of English and Entropy Rate Constancy The Entropy of English and Entropy Rate Constancy As we suggested in the previous section, the cross­entropy of some model m can be used as an upper bound on the true entropy of some process. We can use this method to get an estimate of the true entropy of English. Why should we care about the entropy of English? 1. Knowing the entropy of English would give us a solid lower bound for all of our future experiments on probabilistic grammars. We can use the entropy of English to help understand what parts of a language provide the most information: Is the predictability of English mainly based on Word order Semantics, Morphology, Consistency, or on Pragmatic cues? Answering this question can help us immensely in knowing where to focus our language­modeling efforts. 2. February 16, 2012 Veton Këpuska 157 The Entropy of English and Entropy Rate Constancy There are two common methods for computing the entropy of English: 1. Using human subjects to construct a psychological experiments that requires them to guess strings of letters; By looking at how many guesses it takes them to guess letters correctly we can estimate the probability of the letters and hence the entropy of the sequence. This method was used by Shannon in 1951 in his groundbreaking work in defining the field of information theory. 2. Take a very good stochastic model, trained on a very large corpus, and use it to assign a log­probability to a very long sequence of English applying Shannon­McMillan­Breiman theorem introduced earlier and repeated here for clarity: 1 H ( English ) = − lim log 2 [ m( w1 , w2 ,..., wn ) ] n →∞ n February 16, 2012 Veton Këpuska 158 Shannon Experiment The actual experiment is designed as follows: We present a subject with some English text and ask the subject to guess the next letter. The subjects will use their knowledge of the language to guess the most probable letter first, the next most probable next, and so on. We record the number of guesses it takes for the subject to guess correctly. Shannon’s insight was that the entropy of the number­of­guesses sequence is the same as the entropy of English. (The intuition is that given the number­of­guesses sequence, we could reconstruct the original text by choosing the “ nth most probable” letter whenever the subject took n guesses). This methodology requires the use of letter guesses rather than word guesses (since the subject sometimes has to do an exhaustive search of all the possible letters!), and so Shannon computed the per­letter entropy of English rather than the per­word entropy. He reported an entropy of 1.3 bits (for 27 characters (26 letters plus space)). Shannon’s estimate is likely to be too low, since it is based on a single text ( Jefferson the Virginian by Dumas Malone). Shannon notes that his subjects had worse guesses (hence higher entropies) on other texts (newspaper writing, scientific work, and poetry). More recently variations on the Shannon experiments include the use of a gambling paradigm where the subjects get to bet on the next letter (Cover and King, 1978; Cover and Thomas, 1991). February 16, 2012 Veton Këpuska 159 Computer Based Computation of Entropy of English The second method for computing the entropy of English helps avoid the single text problem that confounds Shannon’s results. For example, Brown et al. (1992a) trained a trigram language model on 583 million words of English, (293,181 different types) and used it to compute the probability of the entire Brown corpus (1,014,312 tokens). The training data include newspapers, encyclopedias, novels, office correspondence, proceedings of the Canadian parliament, and other miscellaneous sources. They then computed the character­entropy of the Brown corpus, by using their word­trigram grammar to assign probabilities to the Brown corpus, considered as a sequence of individual letters. They obtained an entropy of 1.75 bits per character (where the set of characters included all the 95 printable ASCII characters). February 16, 2012 Veton Këpuska 160 Computer Based Computation of Entropy of English The average length of English written words (including space) has been reported at 5.5 letters (N´adas, 1984). If this is correct, it means that the Shannon estimate of 1.3 bits per letter corresponds to a per­word perplexity of 142 for general English. The numbers we report earlier for the WSJ experiments are significantly lower than this, since the training and test set came from the same sub­sample of English. That is, those experiments underestimate the complexity of English (since the Wall Street Journal looks very little like Shakespeare, for example). February 16, 2012 Veton Këpuska 161 Constant Information/Entropy Rate of Speech A number of scholars have independently made the intriguing suggestion that entropy rate plays a role in human communication in general (Lindblom, 1990; Van Son et al., 1998; Aylett, 1999; Genzel and Charniak, 2002; Van Son and Pols, 2003). Constant information/entropy rate of speech: The idea is that people speak so as to keep the rate of information being transmitted per second roughly constant, i.e. transmitting a constant number of bits per second, or maintaining a constant entropy rate. Since the most efficient way of transmitting information through a channel is at a constant rate, language may even have evolved for such communicative efficiency (Plotkin and Nowak, 2000). There is a wide variety of evidence for the constant entropy rate hypothesis. One class of evidence, for speech, shows that speakers shorten predictable words (i.e. they take less time to say predictable words) and lengthen unpredictable words (Aylett, 1999; Jurafsky et al., 2001; Aylett and Turk, 2004). February 16, 2012 Veton Këpuska 162 Constant Information/Entropy Rate of Speech In another line of research, Genzel and Charniak (2002, 2003) show that entropy rate constancy makes predictions about the entropy of individual sentences from a text. In particular, they show that it predicts that local measures of sentence entropy which ignore previous discourse context (for example the N­gram probability of sentence), should increase with the sentence number, and they document this increase in corpora. Keller (2004) provides evidence that entropy rate plays a role for the addressee as well, showing a correlation between the entropy of a sentence and the processing effort it causes in comprehension, as measured by reading times in eye­ tracking data. February 16, 2012 Veton Këpuska 163 END ...
View Full Document

Ask a homework question - tutors are online