jurafsky&martin_3rdEd_17 (1).pdf

Del x y count xy typed as x ins x y count x typed as

Info icon This preview shows pages 65–67. Sign up to view the full content.

View Full Document Right Arrow Icon
del [ x , y ] : count( xy typed as x ) ins [ x , y ] : count( x typed as xy ) sub [ x , y ] : count( x typed as y ) trans [ x , y ] : count( xy typed as yx ) Note that we’ve conditioned the insertion and deletion probabilities on the previ- ous character; we could instead have chosen to condition on the following character. Where do we get these confusion matrices? One way is to extract them from lists of misspellings like the following: additional : addional, additonal environments : enviornments, enviorments, enviroments preceded : preceeded ... There are lists available on Wikipedia and from Roger Mitton ( http://www. dcs.bbk.ac.uk/ ˜ ROGER/corpora.html ) and Peter Norvig ( http://norvig. com/ngrams/ ). Norvig also gives the counts for each single-character edit that can be used to directly create the error model probabilities. An alternative approach used by Kernighan et al. (1990) is to compute the ma- trices by iteratively using this very spelling error correction algorithm itself. The iterative algorithm first initializes the matrices with equal values; thus, any character is equally likely to be deleted, equally likely to be substituted for any other char- acter, etc. Next, the spelling error correction algorithm is run on a set of spelling errors. Given the set of typos paired with their predicted corrections, the confusion matrices can now be recomputed, the spelling algorithm run again, and so on. This iterative algorithm is an instance of the important EM algorithm (Dempster et al., 1977) , which we discuss in Chapter 9. Once we have the confusion matrices, we can estimate P ( x | w ) as follows (where
Image of page 65

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
66 C HAPTER 5 S PELLING C ORRECTION AND THE N OISY C HANNEL w i is the i th character of the correct word w ) and x i is the i th character of the typo x : P ( x | w ) = 8 > > > > > > > > > > > > < > > > > > > > > > > > > : del [ x i - 1 , w i ] count [ x i - 1 w i ] , if deletion ins [ x i - 1 , w i ] count [ w i - 1 ] , if insertion sub [ x i , w i ] count [ w i ] , if substitution trans [ w i , w i + 1 ] count [ w i w i + 1 ] , if transposition (5.6) Using the counts from Kernighan et al. (1990) results in the error model proba- bilities for acress shown in Fig. 5.4 . Candidate Correct Error Correction Letter Letter x | w P(x | w) actress t - c|ct .000117 cress - a a|# .00000144 caress ca ac ac|ca .00000164 access c r r|c .000000209 across o e e|o .0000093 acres - s es|e .0000321 acres - s ss|s .0000342 Figure 5.4 Channel model for acress ; the probabilities are taken from the del [], ins [], sub [], and trans [] confusion matrices as shown in Kernighan et al. (1990) . Figure 5.5 shows the final probabilities for each of the potential corrections; the unigram prior is multiplied by the likelihood (computed with Eq. 5.6 and the confusion matrices). The final column shows the product, multiplied by 10 9 just for readability. Candidate Correct Error Correction Letter Letter x | w P(x | w) P(w) 10 9 *P(x | w)P(w) actress t - c|ct .000117 .0000231 2.7 cress - a a|# .00000144 .000000544 0.00078 caress ca ac ac|ca .00000164 .00000170 0.0028 access c r r|c .000000209 .0000916 0.019 across o e e|o .0000093 .000299 2.8 acres - s es|e .0000321 .0000318 1.0 acres - s ss|s .0000342 .0000318 1.0 Figure 5.5 Computation of the ranking for each candidate correction, using the language model shown earlier and the error model from Fig. 5.4 . The final score is multiplied by 10 9 for readability.
Image of page 66
Image of page 67
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern