jurafsky&martin_3rdEd_17 (1).pdf

Its worth taking a moment to envision how the network

Info icon This preview shows pages 293–295. Sign up to view the full content.

It’s worth taking a moment to envision how the network is computing the same probability as the dot product version we described above. In the network of Fig. 16.5 , we begin with an input vector x , which is a one-hot vector for the current word w j . one-hot A one-hot vector is just a vector that has one element equal to 1, and all the other elements are set to zero. Thus in a one-hot representation for the word w j , x j = 1, and x i = 0 8 i 6 = j , as shown in Fig. 16.6 .
Image of page 293

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

294 C HAPTER 16 S EMANTICS WITH D ENSE V ECTORS 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0 w 0 w j w |V| w 1 Figure 16.6 A one-hot vector, with the dimension corresponding to word w j set to 1. We then predict the probability of each of the 2 L output words—in Fig. 16.5 that means the one output word w t + 1 — in 3 steps: 1. Select the embedding from W : x is multiplied by W , the input matrix, to give the hidden or projection layer . Since each row of the input matrix W is just projection layer an embedding for word w t , and the input is a one-hot columnvector for w j , the projection layer for input x will be h = W w j = v j , the input embedding for w j . 2. Compute the dot product c k · v j : For each of the 2 L context words we now multiply the projection vector h by the context matrix C . The result for each context word, o = Ch , is a 1 | V | dimensional output vector giving a score for each of the | V | vocabulary words. In doing so, the element o k was computed by multiplying h by the output embedding for word w k : o k = c k · h = c k · v j . 3. Normalize the dot products into probabilities : For each context word we normalize this vector of dot product scores, turning each score element o k into a probability by using the soft-max function: p ( w k | w j ) = y k = exp ( c k · v j ) P i 2 | V | exp ( c i · v j ) (16.3) 16.2.2 Relationship between different kinds of embeddings There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices WC , we produce a | V | | V | matrix X , each entry x i j corresponding to some association between input word i and context word j . Levy and Goldberg (2014b) prove that skip-gram’s optimal value occurs when this learned matrix is actually a version of the PMI matrix, with the values shifted by log k (where k is the number of negative samples in the skip-gram with negative sampling algorithm): WC = X PMI - log k (16.4) In other words, skip-gram is implicitly factorizing a (shifted version of the) PMI matrix into the two embedding matrices W and C , just as SVD did, albeit with a different kind of factorization. See Levy and Goldberg (2014b) for more details. Once the embeddings are learned, we’ll have two embeddings for each word w i : v i and c i . We can choose to throw away the C matrix and just keep W , as we did with SVD, in which case each word i will be represented by the vector v i .
Image of page 294
Image of page 295
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern