Lecture13-goodturing

Lecture13-goodturing - 9/28/2011 This work is licensed...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
9/28/2011 1 CS 479, section 1: Natural Language Processing Lecture #13: Good Turing Smoothing Thanks to Dan Klein of UC Berkeley for some of the slides and to Robbie Haertel of BYU for many of the slides used in this lecture. This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . Announcements Project #1, Part 1 Early: today Due: Friday Questions? Recommended reading: Gale on “Simple Good Turing” No reading report Very helpful for Project #1, Part 2 Objectives Go beyond Held out Estimation! Develop Good Turing smoothing Prepare to use these techniques in Project #1, Part 2 Questions How many species are in the ocean? How many words did Shakespeare know? How many unseen word types are there? Context Held Out Estimation From real bigram counts in 22M words of AP newswire data [Church and Gale 91]: Why do we care about r? Why is r* lower on average? Why do we care about ratio of r2/r1? (Zipf’s law) r ( MLE) r* Held-out (Next 22M w.) 1 0.448 2 1.25 3 2.24 4 3.23 5 4.21 Mass on New 0% 9.2% Ratio of r* 2 /r* 1 2 2.8
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
9/28/2011 2 Held Out Estimation From real bigram counts in 22M words of AP newswire data [Church and Gale 91]: Big things to notice: Add one vastly overestimates the fraction of new bigrams Add 0.000027 still underestimates the ratio r* 2 /r* 1 Can we get the properties of held out estimation with less data? r ( MLE) r* Held-out (Next 22M w.) r* Add-one r* Add-delta (delta=0.000027) 1 0.448 0.000274 ~1 2 1.25 0.000411 ~2 3 2.24 0.000548 ~3 4 3.23 0.000685 ~4 5 4.21 0.000822 ~5 Mass on New 0% 46.5% 9.2% 9.2% Ratio of r* 2 /r* 1 21 . 5 ~ 2 2.8 Activity Given a large data set, 1. Compute the frequencies (done) 2. Compute the frequencies of frequencies Activity kN k 1 2 3 4 5 6 8 11 15 22 39 91 47695 Word types Freq the 47695 creditors 91 status 39 bridges 22 constantly 15 repaid 11 capability 8 titled 6 unraveled 5 oaks 4 riots 3 malaysian 3 gibbons 2 aspin 2 chat 2 al usions 2 holtzman 1 ahlerich 1 hayasaka 1 breger 1 tal ies 1 tryon 1 claptrap 1 fanned 1 confession 1 testifies 1 Answers k 1 10 2 4 3 2 4 1 5 1 6 1 8 1 11 1 15 1 22 1 39 1 91 1 47695 1 Word Freq the 47695 creditors 91 status 39 bridges 22 constantly 15 repaid 11 capability 8 titled 6 unraveled 5 oaks 4 riots 3 malaysian 3 gibbons 2 aspin 2 chat 2 al usions 2 holtzman 1 ahlerich 1 hayasaka 1 breger 1 tal ies 1 tryon 1 claptrap 1 fanned 1 confession 1 testifies 1 Activity (cont.) 3. Remove the occurrences of “oaks” one by one. How many does this leave in training each time? 4. Remove the occurrences of “malaysian” and “riots” one by one. How many (total) “malaysians” and “riots” are left each time? 5. “gibbons”, “aspin”, “chats”, “allusions”?
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/18/2011 for the course CS 479 taught by Professor Ericringger during the Fall '11 term at BYU.

Page1 / 7

Lecture13-goodturing - 9/28/2011 This work is licensed...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online