jurafsky&martin_3rdEd_17 (1).pdf

# In other cases we have to deal with words we havent

• 499
• 100% (1) 1 out of 1 people found this document helpful

This preview shows pages 46–48. Sign up to view the full content.

In other cases we have to deal with words we haven’t seen before, which we’ll call unknown words, or out of vocabulary ( OOV ) words. The percentage of OOV OOV words that appear in the test set is called the OOV rate . An open vocabulary system open vocabulary is one in which we model these potential unknown words in the test set by adding a pseudo-word called <UNK> . There are two common ways to train the probabilities of the unknown word model <UNK> . The first one is to turn the problem back into a closed vocabulary one by choosing a fixed vocabulary in advance: 1. Choose a vocabulary (word list) that is fixed in advance. 2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK> in a text normalization step. 3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the training set. The second alternative, in situations where we don’t have a prior vocabulary in ad- vance, is to create such a vocabulary implicitly, replacing words in the training data by <UNK> based on their frequency. For example we can replace by <UNK> all words that occur fewer than n times in the training set, where n is some small number, or equivalently select a vocabulary size V in advance (say 50,000) and choose the top V words by frequency and replace the rest by UNK. In either case we then proceed to train the language model as before, treating <UNK> like a regular word. The exact choice of <UNK> model does have an effect on metrics like perplexity. A language model can achieve low perplexity by choosing a small vocabulary and assigning the unknown word a high probability. For this reason, perplexities should only be compared across language models with the same vocabularies (Buck et al., 2014) . 4.4 Smoothing What do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an unseen context (for example they appear after a word they never appeared after in training)? To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. This modification is called smoothing or discounting . In this section and the fol- smoothing discounting

This preview has intentionally blurred sections. Sign up to view the full version.

4.4 S MOOTHING 47 lowing ones we’ll introduce a variety of ways to do smoothing: add-1 smoothing , add-k smoothing , Stupid backoff , and Kneser-Ney smoothing . 4.4.1 Laplace Smoothing The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called Laplace smoothing . Laplace smoothing does not perform well enough to be used Laplace smoothing in modern N-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification (Chapter 6).
This is the end of the preview. Sign up to access the rest of the document.
• Fall '09

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern