{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

224s.09.lec11

# 224s.09.lec11 - CS124/LINGUIST180:From DanJurafsky...

This preview shows pages 1–13. Sign up to view the full content.

CS 124/LINGUIST 180: From  Language to Information Dan Jurafsky Lecture 3: Intro to Probability,   Language Modeling IP notice: some slides for today from: Jim Martin, Sandiway Fong, Dan Klein

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Outline Probability Basic probability Conditional probability Language Modeling (N-grams) N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Perplexity Smoothing:  Laplace (Add-1) Add-prior 2
Language Modeling We want to compute  P(w 1 ,w 2 ,w 3 ,w 4 ,w 5 …w n ) = P(W) = the probability of a sequence Alternatively we want to compute  P(w 5 |w 1 ,w 2 ,w 3 ,w 4 ) =the probability of a word given some previous words The model that computes  P(W) or P(w n |w 1 ,w 2 …w n-1 ) is called the  language model . A better term for this would be “The Grammar” But “Language model” or LM is standard 3

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Computing P(W) How to compute this joint probability: P(“the”,”other”,”day”,”I”,”was”,”walking”,”along” ,”and”,”saw”,”a”,”lizard”) Intuition: let’s rely on the Chain Rule of  Probability 4
The Chain Rule Recall the definition of conditional probabilities Rewriting: More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) In general  P(x 1 ,x 2 ,x 3 ,…x n ) = P(x 1 )P(x 2 |x 1 )P(x 3 |x 1 ,x 2 )…P(x n | x 1 …x n-1 ) ) ( ) ^ ( ) | ( B P B A P B A P = ) ( ) | ( ) ^ ( B P B A P B A P = 5

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
The Chain Rule applied to joint  probability of words in sentence P(“the big red dog was”)= P(the) * P(big|the) * P(red|the big) * P(dog|the  big red) * P(was|the big red dog) 6
Very easy estimate: How to estimate? P(the | its water is so transparent that) P(the | its water is so transparent that) = C(its water is so transparent that the) ___________________________________________________________________________________ C(its water is so transparent that) 7

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Unfortunately There are a lot of possible sentences We’ll never be able to get enough data to  compute the statistics for those long prefixes P(lizard| the,other,day,I,was,walking,along,and,saw,a) Or P(the|its water is so transparent that) 8
Markov Assumption Make the simplifying assumption P(lizard| the,other,day,I,was,walking,along,and,saw,a)  = P(lizard|a) Or maybe P(lizard| the,other,day,I,was,walking,along,and,saw,a)  = P(lizard|saw,a) 9

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
So for each component in the product replace with  the approximation (assuming a prefix of N)  Bigram version P ( w n | w 1 n - 1 ) P ( w n | w n - N + 1 n - 1 ) Markov Assumption P ( w n | w 1 n - 1 ) P ( w n | w n - 1 ) 10
Estimating bigram probabilities The Maximum Likelihood Estimate P ( w i | w i - 1 ) = count ( w i - 1 , w i ) count ( w i - 1 ) P ( w i | w i - 1 ) = c ( w i - 1 , w i ) c ( w i - 1 ) 11

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
An example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> This is the Maximum Likelihood Estimate, because it is the one  which maximizes P(Training set|Model) 12
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 65

224s.09.lec11 - CS124/LINGUIST180:From DanJurafsky...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online