jurafsky&martin_3rdEd_17 (1).pdf

W 12kv 1 d context embedding for word k c 1 j v 1 d

Info icon This preview shows pages 291–293. Sign up to view the full content.

W 1.2……k………|V| 1 . . . d context embedding for word k C 1 . . j . . |V| 1. .. d target embeddings context embeddings Similarity( j , k) target embedding for word j Figure 16.4 Of course, the dot product c k · v j is not a probability, it’s just a number ranging from - to . We can use the softmax function from Chapter 7 to normalize the dot product into probabilities. Computing this denominator requires computing the dot product between each other word w in the vocabulary with the target word w i : p ( w k | w j ) = exp ( c k · v j ) P i 2 | V | exp ( c i · v j ) (16.1) In summary, the skip-gram computes the probability p ( w k | w j ) by taking the dot product between the word vector for j ( v j ) and the context vector for k ( c k ), and
Image of page 291

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

292 C HAPTER 16 S EMANTICS WITH D ENSE V ECTORS turning this dot product v j · c k into a probability by passing it through a softmax function. This version of the algorithm, however, has a problem: the time it takes to com- pute the denominator. For each word w t , the denominator requires computing the dot product with all other words. As we’ll see in the next section, we generally solve this by using an approximation of the denominator. CBOW The CBOW ( continuous bag of words ) model is roughly the mirror im- age of the skip-gram model. Like skip-grams, it is based on a predictive model, but this time predicting the current word w t from the context window of 2 L words around it, e.g. for L = 2 the context is [ w t - 2 , w t - 1 , w t + 1 , w t + 2 ] While CBOW and skip-gram are similar algorithms and produce similar embed- dings, they do have slightly different behavior, and often one of them will turn out to be the better choice for any particular task. 16.2.1 Learning the word and context embeddings We already mentioned the intuition for learning the word embedding matrix W and the context embedding matrix C : iteratively make the embeddings for a word more like the embeddings of its neighbors and less like the embeddings of other words. In the version of the prediction algorithm suggested in the previous section, the probability of a word is computed by normalizing the dot-product between a word and each context word by the dot products for all words. This probability is opti- mized when a word’s vector is closest to the words that occur near it (the numerator), and further from every other word (the denominator). Such a version of the algo- rithm is very expensive; we need to compute a whole lot of dot products to make the denominator. Instead, the most commonly used version of skip-gram, skip-gram with negative sampling , approximates this full denominator. This section offers a brief sketch of how this works. In the training phase, the algorithm walks through the corpus, at each target word choosing the surrounding context words as positive examples, and for each positive example also choosing k noise samples or negative samples : non-neighbor words. The goal will be to move negative samples the embeddings toward the neighbor words and away from the noise words.
Image of page 292
Image of page 293
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern