jurafsky&martin_3rdEd_17 (1).pdf

# Ground there lived word 42 w t 1 w t 2 w t w t 3 d h

• 499
• 100% (1) 1 out of 1 people found this document helpful

This preview shows pages 118–120. Sign up to view the full content.

ground there lived word 42 w t-1 w t-2 w t w t-3 d h 3 d 1 d h |V| d h P(w t =V 42 |w t-3 ,w t-2 ,w t-3 ) 1 |V| Input layer one-hot vectors index word 35 0 0 1 0 0 1 |V| 35 0 0 1 0 0 1 |V| 45180 0 0 1 0 0 1 |V| 9925 0 0 index word 9925 index word 45180 E 1 |V| d |V| E is shared across words Figure 8.13 learning all the way back to embeddings. notice that the embedding matrix E is shared among the 3 context words. Fig. 8.13 shows the additional layers needed to learn the embeddings during LM training. Here the N=3 context words are represented as 3 one-hot vectors, fully connected to the embedding layer via 3 instantiations of the E embedding matrix. Note that we don’t want to learn separate weight matrices for mapping each of the 3 previous words to the projection layer, we want one single embedding dictionary E that’s shared among these three. That’s because over time, many different words will appear as w t - 2 or w t - 1 , and we’d like to just represent each word with one vector, whichever context position it appears in. The embedding weight matrix E thus has a row for each word, each a vector of d dimensions, and hence has dimensionality V d . Let’s walk through the forward pass of Fig. 8.13 . 1. Select three embeddings from E : Given the three previous words, we look up their indices, create 3 one-hot vectors, and then multiply each by the em- bedding matrix E . Consider w t - 3 . The one-hot vector for ‘the’ is (index 35) is multiplied by the embedding matrix E , to give the first part of the first hidden layer, called the projection layer . Since each row of the input matrix E is just projection layer an embedding for a word, and the input is a one-hot columnvector x i for word V i , the projection layer for input w will be Ex i = e i , the embedding for word i . We now concatenate the three embeddings for the context words.

This preview has intentionally blurred sections. Sign up to view the full version.

8.6 S UMMARY 119 2. Multiply by W : We now multiply by W (and add b ) and pass through the rectified linear (or other) activation function to get the hidden layer h . 3. Multiply by U : h is now multiplied by U 4. Apply softmax : After the softmax, each node i in the output layer estimates the probability P ( w t = i | w t - 1 , w t - 2 , w t - 3 ) In summary, if we use e to represent the projection layer, formed by concatenat- ing the 3 embedding for the three context vectors, the equations for a neural language model become: e = ( Ex 1 , Ex 2 ,..., Ex ) (8.24) h = s ( We + b ) (8.25) z = Uh (8.26) y = softmax ( z ) (8.27) 8.5.2 Training the neural language model To train the model, i.e. to set all the parameters q = E , W , U , b , we use the SGD al- gorithm of Fig. 8.11 , with error back propagation to compute the gradient. Training thus not only sets the weights W and U of the network, but also as we’re predicting upcoming words, we’re learning the embeddings E for each words that best predict upcoming words.
This is the end of the preview. Sign up to access the rest of the document.
• Fall '09

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern