124.11.lec3

124.11.lec3 - CS 124/LINGUIST 180: From Language to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Language to Information Dan Jurafsky Lecture 3: Intro to Probability, Language Modeling IP notice: some slides for today from: Jim Martin, Sandiway Fong, Dan Klein Outline   Probability   Basic probability   Condi?onal probability   Language Modeling (N ­grams)   N ­gram Intro   The Chain Rule   The Shannon Visualiza?on Method   Evalua?on:   Perplexity   Smoothing:   Laplace (Add ­1)   Add ­prior 1. Introduction to Probability   Experiment (trial)   Repeatable procedure with well ­defined possible outcomes   Sample Space (S)   the set of all possible outcomes   finite or infinite   Example   coin toss experiment   possible outcomes: S = {heads, tails}   Example   die toss experiment   possible outcomes: S = {1,2,3,4,5,6} Slides from Sandiway Fong Introduction to Probability   Defini?on of sample space depends on what we are asking   Sample Space (S): the set of all possible outcomes   Example   die toss experiment for whether the number is even or odd   possible outcomes: {even,odd}   not {1,2,3,4,5,6} More definitions   Events   an event is any subset of outcomes from the sample space   Example   die toss experiment   let A represent the event such that the outcome of the die toss experiment is divisible by 3   A = {3,6}   A is a subset of the sample space S= {1,2,3,4,5,6}   Example   Draw a card from a deck   suppose sample space S = {heart,spade,club,diamond} (four suits)   let A represent the event of drawing a heart   let B represent the event of drawing a red card   A = {heart}   B = {heart,diamond} Introduction to Probability   Some defini?ons   Coun?ng   suppose opera?on oi can be performed in ni ways, then   a sequence of k opera?ons o1o2...ok   can be performed in n1 × n2 × ... × nk ways   Example   die toss experiment, 6 possible outcomes   two dice are thrown at the same ?me   number of sample points in sample space = 6 × 6 = 36 Definition of Probability   The probability law assigns to an event a nonnega?ve number   Called P(A)   Also called the probability A   That encodes our knowledge or belief about the collec?ve likelihood of all the elements of A   Probability law must sa?sfy certain proper?es Probability Axioms   Nonnega?vity   P(A) >= 0, for every event A   Addi?vity   If A and B are two disjoint events, then the probability of their union sa?sfies:   P(A U B) = P(A) + P(B)   Normaliza?on   The probability of the en?re sample space S is equal to 1, I.e. P(S) = 1. An example   An experiment involving a single coin toss   There are two possible outcomes, H and T   Sample space S is {H,T}   If coin is fair, should assign equal probabili?es to 2         outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0 Another example   Experiment involving 3 coin tosses   Outcome is a 3 ­long string of H or T   S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}   Assume each outcome is equiprobable   “Uniform distribu?on”   What is probability of the event that exactly 2 heads         occur? A = {HHT,HTH,THH} P(A) = P({HHT})+P({HTH})+P({THH}) = 1/8 + 1/8 + 1/8 =3/8 Probability definitions   In summary: Probability of drawing a spade from 52 well ­shuffled playing cards: Probabilities of two events   If two events A and B are independent   Then   P(A and B) = P(A) x P(B)   If flip a fair coin twice   What is the probability that they are both heads?   If draw a card from a deck, then put it back, draw a card from the deck again   What is the probability that both drawn cards are hearts? How about non-uniform probabilities? An example   A biased coin,   twice as likely to come up tails as heads,   is tossed twice   What is the probability that at least one head occurs?   Sample space = {hh, ht, th, j} (h = heads, t = tails)   Sample points/probability for the event:   ht 1/3 x 2/3 = 2/9   th 2/3 x 1/3 = 2/9 hh 1/3 x 1/3= 1/9 j 2/3 x 2/3 = 4/9   Answer: 5/9 = ≈0.56 (sum of weights in red) Moving toward language   What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? 4 1 P ( drawing a two) = = = .077 52 13   What’s the probability of a random word (from a random dic?onary page) being a verb? € € # of ways to get a verb P ( drawing a verb) = all words Probability and part of speech tags   What’s the probability of a random word (from a random dic?onary page) being a verb? P ( drawing a verb) = # of ways to get a verb all words   How to compute each of these €   All words = just count all the words in the dic?onary   # of ways to get a verb: number of words which are verbs!   If a dic?onary has 50,000 entries, and 10,000 are verbs…. P (V) is 10000/50000 = 1/5 = .20 Conditional Probability   A way to reason about the outcome of an experiment based on par?al informa?on   In a word guessing game the first lejer for the word is a “t”. What is the likelihood that the second lejer is an “h”?   How likely is it that a person has a disease given that a medical test was nega?ve?   A spot shows up on a radar screen. How likely is it that it corresponds to an aircrap? More precisely   Given an experiment, a corresponding sample space S, and a probability law   Suppose we know that the outcome is within some given event B   We want to quan?fy the likelihood that the outcome also belongs to some other given event A.   We need a new probability law that gives us the condi?onal probability of A given B   P(A|B) An intuition   A is “it’s raining now”.   P(A) in dry California is .01   B is “it was raining ten minutes ago”   P(A|B) means “what is the probability of it raining now if it was raining 10 minutes ago”   P(A|B) is probably way higher than P(A)   Perhaps P(A|B) is .10   Intui?on: The knowledge about B should change our es?mate of the probability of A. Conditional probability   One of the following 30 items is chosen at random   What is P(X), the probability that it is an X?   What is P(X|red), the probability that it is an X given that it is red? Conditional Probability   let A and B be events   p(B|A) = the probability of event B occurring given event A occurs   defini8on: p(B|A) = p(A ∩ B) / p(A) S Conditional probability   P(A|B) = P(A ∩ B)/P(B)   Or P( A | B) = P( A, B) P( B) Note: P(A,B)=P(A|B) · P(B) Also: P(A,B) = P(B,A) A A,B B Independence   What is P(A,B) if A and B are independent?   P(A,B)=P(A)  P(B) iff A,B independent. P(heads,tails) = P(heads)  P(tails) = .5  .5 = .25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent Summary   Probability   Condi?onal Probability   Independence Language Modeling   We want to compute   P(w1,w2,w3,w4,w5…wn) = P(W)   = the probability of a sequence   Alterna?vely we want to compute   P(w5|w1,w2,w3,w4)   =the probability of a word given some previous words   The model that computes   P(W) or   P(wn|w1,w2…wn ­1)   is called the language model.   A bejer term for this would be “ The Grammar”   But “Language model” or LM is standard Computing P(W)   How to compute this joint probability:   P(“the” , ”other” , ”day” , ”I” , ”was” , ”walking” , ”along”, ”and”,”saw”,”a”,”lizard”)   Intui?on: let’s rely on the Chain Rule of Probability The Chain Rule   Recall the defini?on of condi?onal probabili?es P( A^ B) P( A | B) = P( B)   Rewri?ng: P( A^ B) = P( A | B) P( B)   More generally   P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)   In general   P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn ­1) The Chain Rule applied to joint probability of words in sentence   P(“the big red dog was”)= P(the) * P(big|the) * P(red|the big) * P(dog|the big red) * P(was|the big red dog) Very easy estimate:   How to es?mate?   P(the | its water is so transparent that) P(the | its water is so transparent that) = C(its water is so transparent that the) ___________________________________________________________________________________ C(its water is so transparent that) Unfortunately   There are a lot of possible sentences   We’ll never be able to get enough data to compute the sta?s?cs for those long prefixes   P(lizard| the,other,day,I,was,walking,along,and,saw,a)   Or   P(the|its water is so transparent that) Markov Assumption   Make the simplifying assump?on   P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P (lizard|a)   Or maybe   P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P (lizard|saw,a) Markov Assumption "   So for each component in the product replace with the approximation (assuming a prefix of N) n −1 1 P ( wn | w ) ≈ P ( wn | w n −1 n − N +1 "   Bigram version n −1 1 P(wn | w ) ≈ P ( w n | w n −1 ) ) Estimating bigram probabilities   The Maximum Likelihood Es?mate count ( w i−1, w i ) P ( w i | w i−1 ) = count ( w i−1 ) c ( w i−1, w i ) P ( w i | w i−1 ) = c ( w i−1 ) An example   <s> I am Sam </s>   <s> Sam I am </s>   <s> I do not like green eggs and ham </s>   This is the Maximum Likelihood Es?mate, because it is the one which maximizes P(Training set|Model) Maximum Likelihood Estimates   The maximum likelihood es?mate of some parameter of a model M from a training set T   Is the es?mate   that maximizes the likelihood of the training set T given the model M   Suppose the word Chinese occurs 400 ?mes in a corpus of a million words (Brown corpus)   What is the probability that a random word from some other text will be “Chinese”   MLE es?mate is 400/1000000 = .004   This may be a bad es?mate for some other corpus   But it is the es'mate that makes it most likely that “Chinese” will occur 400 ?mes in a million word corpus. More examples: Berkeley Restaurant Project sentences   can you tell me about any good cantonese restaurants close by   mid priced thai food is what i’m looking for   tell me about chez panisse   can you give me a lis?ng of the kinds of food that are available   i’m looking for a good place to eat breakfast   when is caffe venezia open during the day Raw bigram counts   Out of 9222 sentences Raw bigram probabilities   Normalize by unigrams:   Result: Bigram estimates of sentence probabilities   P(<s> I want english food </s>) = P(i|<s>) x P(want|I) x P(english|want) x P(food|english) x P(</s>|food) =.000031 What kinds of knowledge?   P(english|want) = .0011   P(chinese|want) = .0065   P(to|want) = .66   P(eat | to) = .28   P(food | to) = 0   P(want | spend) = 0   P (i | <s>) = .25 The Shannon Visualization Method   Generate random sentences:   Choose a random bigram <s>, w according to its probability   Now choose a random bigram (w, x) according to its probability   And so on un?l we choose </s>   Then string the words together   <s> I I want want to to eat eat Chinese Chinese food food </s> Approximating Shakespeare   Shakespeare as corpus   N=884,647 tokens, V=29,066   Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)   Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare The wall street journal is not shakespeare (no offense)   Lesson 1: the perils of overfitting   N ­grams only work well for word predic?on if the test corpus looks like the training corpus   In real life, it open doesn’t   We need to train robust models, adapt to test set, etc Lesson 2: zeros or not?   Zipf’s Law:   A small number of events occur with high frequency   A large number of events occur with low frequency   You can quickly collect sta?s?cs on the high frequency events   You might have to wait an arbitrarily long ?me to get valid sta?s?cs on low frequency events   Result:   Our es?mates are sparse! no counts at all for the vast bulk of things we want to es?mate!   Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. Aper all, ANYTHING CAN HAPPEN!   How to address?   Answer:   Es?mate the likelihood of unseen N ­grams! Slide adapted from Bonnie Dorr and Julia Hirschberg Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass) Slide from Dan Klein Laplace smoothing   Also called add ­one smoothing   Just add one to all the counts!   Very simple   MLE es?mate:   Laplace es?mate:   Reconstructed counts: Laplace smoothed bigram counts Laplace-smoothed bigrams Reconstituted counts Note big change to counts   C(count to) went from 608 to 238!   P(to|want) from .66 to .26!   Discount d= c*/c   d for “chinese food” =.10!!! A 10x reduc?on   So in general, Laplace is a blunt instrument   But Laplace smoothing not used for N ­grams, as we have much bejer methods   Despite its flaws Laplace (add ­k) is however s?ll used to smooth other probabilis?c models in NLP, especially   For pilot studies   in domains where the number of zeros isn’t so huge. Add-k   Add a small frac?on instead of 1 Even better: Bayesian unigram prior smoothing for bigrams   Maximum Likelihood Es?ma?on C ( w1, w 2 ) P ( w 2 | w1 ) = C ( w1 )   Laplace Smoothing € C ( w1, w 2 ) + 1 PLaplace ( w 2 | w1 ) = C ( w1 ) + vocab   Bayesian Prior Smoothing € C ( w1, w 2 ) + P ( w 2 ) PPrior ( w 2 | w1 ) = C ( w1 ) + 1 Practical Issues   We do everything in log space  Avoid underflow  (also adding is faster than mul?plying) Language Modeling Toolkits   SRILM  hjp://www.speech.sri.com/projects/ srilm/ Google N-Gram Release Google N-Gram Release   serve as the incoming 92!   serve as the incubator 99!   serve as the independent 794!   serve as the index 223!   serve as the indication 72!   serve as the indicator 120!   serve as the indicators 45!   serve as the indispensable 111!   serve as the indispensible 40!   serve as the individual 234! http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Evaluation   We train parameters of our model on a training set.   How do we evaluate how well our model works?   We look at the models performance on some new data   This is what happens in the real world; we want to know how our model performs on data we haven’t seen   So a test set. A dataset which is different than our training set   Then we need an evalua'on metric to tell us how well our model is doing on the test set.   One such metric is perplexity (to be introduced below) Evaluating N-gram models  Best evalua?on for an N ­gram  Put model A in a task (language iden?fica?on, speech recognizer, machine transla?on system)  Run the task, get an accuracy for A (how many lgs iden?fied corrrectly, or Word Error Rate, or etc)  Put model B in task, get accuracy for B  Compare accuracy for A and B  Extrinsic evalua'on Difficulty of extrinsic (in-vivo) evaluation of N-gram models   Extrinsic evalua?on   This is really ?me ­consuming   Can take days to run an experiment   So   As a temporary solu?on, in order to run experiments   To evaluate N ­grams we open use an intrinsic evalua?on, an approxima?on called perplexity   But perplexity is a poor approxima?on unless the test data looks just like the training data   So is generally only useful in pilot experiments (generally is not sufficient to publish)   But is helpful to think about. Perplexity   Perplexity is the probability of the test set (assigned by the language model), normalized by the number of words:   Chain rule:   For bigrams: "   Minimizing perplexity is the same as maximizing probability "   The best language model is one that best predicts an unseen test set A totally different perplexity Intuition   How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9,oh’: easy, perplexity 11 (or if we ignore ‘oh’, perplexity 10)   How hard is recognizing (30,000) names at Microsop. Hard: perplexity = 30,000   If a system has to recognize   Operator (1 in 4)   Sales (1 in 4)   Technical Support (1 in 4)   30,000 names (1 in 120,000 each)   Perplexity is 54   Perplexity is weighted equivalent branching factor Slide from Josh Goodman Perplexity as branching factor Lower perplexity = better model   Training 38 million words, test 1.5 million words, WSJ Unknown words: Open versus closed vocabulary tasks   If we know all the words in advanced   Vocabulary V is fixed   Closed vocabulary task   OHen we don’t know this   Out Of Vocabulary = OOV words   Open vocabulary task   Instead: create an unknown word token <UNK>   Training of <UNK> probabili?es   Create a fixed lexicon L of size V   At text normaliza?on phase, any training word not in L changed to <UNK>   Now we train its probabili?es like a normal word   At decoding ?me   If text input: Use UNK probabili?es for any word not in training Advanced LM stuff   Current best smoothing algorithm   Kneser ­Ney smoothing   Other stuff   Interpola?on   Backoff   Variable ­length n ­grams   Class ­based n ­grams   Clustering   Hand ­built classes   Cache LMs   Topic ­based LMs   Sentence mixture models   Skipping LMs   Parser ­based LMs Backoff and Interpolation   Another really useful source of knowledge   If we are es?ma?ng:   trigram p(z|xy)   but c(xyz) is zero   Use info from:   Bigram p(z|y)   Or even:   Unigram p(z)   How to combine the trigram/bigram/unigram info? Backoff versus interpolation   Backoff: use trigram if you have it, otherwise bigram, otherwise unigram   Interpola'on: mix all three Interpolation   Simple interpola?on   Lambdas condi?onal on context: How to set the lambdas?   Use a held ­out corpus Training Data Held-Out Data Test Data   Choose lambdas which maximize the probability of thisheld ­out data   I.e. fix the N ­gram probabili?es   Then search for lambda values   That when plugged into previous equa?on   Give largest probability for held ­out set   Can use EM to do this search Intuition of backoff+discounting   How much probability to assign to all the zero trigrams?   Use GT or other discoun?ng algorithm to tell us   How to divide that probability mass among different contexts?   Use the N ­1 gram es?mates to tell us   What do we do for the unigram words not seen in training?   Out Of Vocabulary = OOV words ARPA format Summary   Probability   Basic probability   Condi?onal probability   Language Modeling (N ­grams)   N ­gram Intro   The Chain Rule   The Shannon Visualiza?on Method   Evalua?on:   Perplexity   Smoothing:   Laplace (Add ­1)   Add ­k   Add ­prior ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online