jurafsky&martin_3rdEd_17 (1).pdf

Ground there lived word 42 w t 1 w t 2 w t w t 3 d h

Info icon This preview shows pages 118–120. Sign up to view the full content.

ground there lived word 42 w t-1 w t-2 w t w t-3 d h 3 d 1 d h |V| d h P(w t =V 42 |w t-3 ,w t-2 ,w t-3 ) 1 |V| Input layer one-hot vectors index word 35 0 0 1 0 0 1 |V| 35 0 0 1 0 0 1 |V| 45180 0 0 1 0 0 1 |V| 9925 0 0 index word 9925 index word 45180 E 1 |V| d |V| E is shared across words Figure 8.13 learning all the way back to embeddings. notice that the embedding matrix E is shared among the 3 context words. Fig. 8.13 shows the additional layers needed to learn the embeddings during LM training. Here the N=3 context words are represented as 3 one-hot vectors, fully connected to the embedding layer via 3 instantiations of the E embedding matrix. Note that we don’t want to learn separate weight matrices for mapping each of the 3 previous words to the projection layer, we want one single embedding dictionary E that’s shared among these three. That’s because over time, many different words will appear as w t - 2 or w t - 1 , and we’d like to just represent each word with one vector, whichever context position it appears in. The embedding weight matrix E thus has a row for each word, each a vector of d dimensions, and hence has dimensionality V d . Let’s walk through the forward pass of Fig. 8.13 . 1. Select three embeddings from E : Given the three previous words, we look up their indices, create 3 one-hot vectors, and then multiply each by the em- bedding matrix E . Consider w t - 3 . The one-hot vector for ‘the’ is (index 35) is multiplied by the embedding matrix E , to give the first part of the first hidden layer, called the projection layer . Since each row of the input matrix E is just projection layer an embedding for a word, and the input is a one-hot columnvector x i for word V i , the projection layer for input w will be Ex i = e i , the embedding for word i . We now concatenate the three embeddings for the context words.
Image of page 118

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

8.6 S UMMARY 119 2. Multiply by W : We now multiply by W (and add b ) and pass through the rectified linear (or other) activation function to get the hidden layer h . 3. Multiply by U : h is now multiplied by U 4. Apply softmax : After the softmax, each node i in the output layer estimates the probability P ( w t = i | w t - 1 , w t - 2 , w t - 3 ) In summary, if we use e to represent the projection layer, formed by concatenat- ing the 3 embedding for the three context vectors, the equations for a neural language model become: e = ( Ex 1 , Ex 2 ,..., Ex ) (8.24) h = s ( We + b ) (8.25) z = Uh (8.26) y = softmax ( z ) (8.27) 8.5.2 Training the neural language model To train the model, i.e. to set all the parameters q = E , W , U , b , we use the SGD al- gorithm of Fig. 8.11 , with error back propagation to compute the gradient. Training thus not only sets the weights W and U of the network, but also as we’re predicting upcoming words, we’re learning the embeddings E for each words that best predict upcoming words.
Image of page 119
Image of page 120
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern