jurafsky&martin_3rdEd_17 (1).pdf

# In the following examples well use a 4 gram example

• 499
• 100% (1) 1 out of 1 people found this document helpful

This preview shows pages 116–118. Sign up to view the full content.

In the following examples we’ll use a 4-gram example, so we’ll show a net to estimate the probability P ( w t = i | w t - 1 , w t - 2 , w t - 3 ) . 8.5.1 Embeddings The insight of neural language models is in how to represent the prior context. Each word is represented as a vector of real numbers of of dimension d ; d tends to lie between 50 and 500, depending on the system. These vectors for each words are called embeddings , because we represent a word as being embedded in a vector embeddings space. By contrast, in many traditional NLP applications, a word is represented as a string of letters, or an index in a vocabulary list. Why represent a word as a vector of 50 numbers? Vectors turn out to be a really powerful representation for words, because a distributed representation allows words that have similar meanings, or similar grammatical properties, to have similar vectors. As we’ll see in Chapter 15, embedding that are learned for words like “cat” and “dog”— words with similar meaning and parts of speech—will be similar vectors. That will allow us to generalize our language models in ways that wasn’t possible with traditional N-gram models. For example, suppose we’ve seen this sentence in training: I have to make sure when I get home to feed the cat. and then in our test set we are trying to predict what comes after the prefix “I forgot when I got home to feed the”. A traditional N-gram model will predict “cat”. But suppose we’ve never seen the word “dog” after the words ”feed the”. A traditional LM won’t expect “dog”. But by representing words as vectors, and assuming the vector for “cat” is similar to the vector for “dog”, a neural LM, even if it’s never seen “feed the dog”, will assign a reasonably high probability to “dog” as well as “cat”, merely because they have similar vectors. Representing words as embeddings vectors is central to modern natural language processing, and is generally referred to as the vector space model of meaning. We will go into lots of details on the different kinds of embeddings in Chapter 15 and Chapter 152. Let’s set aside—just for a few pages—the question of how these embeddings are learned. Imagine that we had an embedding dictionary E that gives us, for each word in our vocabulary V , the vector for that word. Fig. 8.12 shows a sketch of this simplified FFNNLM with N=3; we have a mov- ing window at time t with a one-hot vector representing each of the 3 previous words

This preview has intentionally blurred sections. Sign up to view the full version.

8.5 N EURAL L ANGUAGE M ODELS 117 (words w t - 1 , w t - 2 , and w t - 3 ). These 3 vectors are concatenated together to produce x , the input layer of a neural network whose output is a softmax with a probability distribution over words. Thus y 42 , the value of output node 42 is the probability of the next word w t being V 42 , the vocabulary word with index 42.
This is the end of the preview. Sign up to access the rest of the document.
• Fall '09

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern