together Chinese food! food I want to eat Chinese food! </s>! Dan Jurafsky Approxima1ng Shakespeare Dan Jurafsky Shakespeare as corpus •  N=884,647 tokens, V=29,066 •  Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams. •  So 99.96% of the possible bigrams were never seen (have zero entries in the table) •  Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare Dan Jurafsky The wall street journal is not shakespeare (no offense) Dan Jurafsky The perils of overfi]ng •  N ­grams only work well for word predic*on if the test corpus looks like the training corpus •  In real life, it oIen doesn't •  We need to train robust models that generalize! •  One kind of generaliza*on: Zeros! •  Things that don't ever occur in the training set •  But occur in the test set Dan Jurafsky Zeros •  Test set •  Training set: … denied the allega*ons … denied the offer … denied the loan … denied the reports … denied the claims … denied the request P("offer" | denied the) = 0 Dan Jurafsky Zero probability bigrams •  Bigrams with zero probability •  mean that we will assign 0 probability to the test set! •  And hence we cannot compute perplexity (can't divide by 0)! Language Modeling Generaliza*on and zeros Language Modeling Smoothing: Add ­one (Laplace) smoothing Dan Jurafsky The intuition of smoothing (from Dan Klein) man outcome man outcome attack … … P(w | denied the) 2.5 allega*ons 1.5 reports 0.5 claims 0.5 request 2 other 7 total attack Steal probability mass to generalize be_er request •  claims request claims reports reports P(w | denied the)
