languagemodeling

K wi wi1 i dan jurafsky unknown words open versus

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 3 allega*ons 2 reports 1 claims 1 request 7 total allegations When we have sparse sta*s*cs: allegations allegations •  Dan Jurafsky Add ­one es1ma1on •  Also called Laplace smoothing •  Pretend we saw each word one more *me than we did •  Just add one to all the counts! c(wi!1, wi ) PMLE (wi | wi!1 ) = c(wi!1 ) •  MLE es*mate: •  Add ­1 es*mate: c(wi!1, wi ) + 1 PAdd !1 (wi | wi!1 ) = c(wi!1 ) + V Dan Jurafsky Maximum Likelihood Es1mates •  The maximum likelihood es*mate •  of some parameter of a model M from a training set T •  maximizes the likelihood of the training set T given the model M •  Suppose the word “bagel” occurs 400 *mes in a corpus of a million words •  What is the probability that a random word from some other text will be “bagel”? •  MLE es*mate is 400/1,000,000 = .004 •  This may be a bad es*mate for some other corpus •  But it is the es1mate that makes it most likely that “bagel” will occur 400 *mes in a million word corpus. Dan Jurafsky Berkeley Restaurant Corpus: Laplace smoothed bigram counts Dan Jurafsky Laplace-smoothed bigrams Dan Jurafsky Reconstituted counts Dan Jurafsky Compare with raw bigram counts Dan Jurafsky Add ­1 es1ma1on is a blunt instrument •  So add ­1 isn’t used for N ­grams: •  We’ll see be_er methods •  But add ­1 is used to smooth other NLP models •  For text classifica*on •  In domains where the number of zeros isn’t so huge. Language Modeling Smoothing: Add ­one (Laplace) smoothing Language Modeling Interpola*on, Backoff, and Web ­Scale LMs Dan Jurafsky Backoff and Interpolation •  Some*mes it helps to use less context •  Condi*on on less context for contexts you haven’t learned much about •  Backoff: •  use trigram if you have goo...
View Full Document

Ask a homework question - tutors are online