t of things we've seen c *mes •  Sam I am I am Sam I do not eat I 3! sam 2! N1 = 3 am 2! do 1! N2 = 2 not 1! N3 = 1 eat 1! 72 Dan Jurafsky Good-Turing smoothing intuition •  You are ﬁshing (a scenario from Josh Goodman), and caught: •  10 carp, 3 perch, 2 whiteﬁsh, 1 trout, 1 salmon, 1 eel = 18 ﬁsh •  How likely is it that next species is trout? •  1/18 •  How likely is it that next species is new (i.e. ca•ish or bass) •  Let's use our es*mate of things ­we ­saw ­once to es*mate the new things. •  3/18 (because N1=3) •  Assuming so, how likely is it that next species is trout? •  Must be less than 1/18 •  How to es*mate? Dan Jurafsky Good Turing calculations N1 P (things with zero frequency) = N * GT •  Unseen (bass or catfish) (c + 1) N c+1 c* = Nc •  Seen once (trout) •  c = 0: •  MLE p = 0/18 = 0 •  c = 1 •  MLE p = 1/18 •  P*GT (unseen) = N1/N = 3/18 •  C*(trout) = 2 * N2/N1 = 2 * 1/3 = 2/3 •  P*GT(trout) = 2/3 / 18 = 1/27 Dan Jurafsky Ney et al.'s Good Turing Intuition H. Ney, U. Essen, and R. Kneser, 1995. On the es*ma*on of 'small' probabili*es by leaving ­one ­out. IEEE Trans. PAMI. 17:12,1202 ­1212 Held-out words: 75 Intui*on from leave ­one ­out valida*on •  Take each of the c training words out in turn •  c training sets of size c–1, held ­out of size 1 •  What frac*on of held ­out words are unseen in training? •  N1/c •  What frac*on of held ­out words are seen k *mes in training? •  (k+1)Nk+1/c •  So...
