jurafsky&martin_3rdEd_17 (1).pdf

1521 alternatives to ppmi for measuring association

Info icon This preview shows pages 277–279. Sign up to view the full content.

15.2.1 Alternatives to PPMI for measuring association While PPMI is quite popular, it is by no means the only measure of association between two words (or between a word and some other feature). Other common
Image of page 277

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

278 C HAPTER 15 V ECTOR S EMANTICS measures of association come from information retrieval (tf-idf, Dice) or from hy- pothesis testing (the t-test, the likelihood-ratio test). In this section we briefly sum- marize one of each of these types of measures. Let’s first consider the standard weighting scheme for term-document matrices in information retrieval, called tf-idf . Tf-idf (this is a hyphen, not a minus sign) is tf-idf the product of two factors. The first is the term frequency (Luhn, 1957) : simply the term frequency frequency of the word in the document, although we may also use functions of this frequency like the log frequency. The second factor is used to give a higher weight to words that occur only in a few documents. Terms that are limited to a few documents are useful for discrimi- nating those documents from the rest of the collection; terms that occur frequently across the entire collection aren’t as helpful. The inverse document frequency or inverse document frequency IDF term weight (Sparck Jones, 1972) is one way of assigning higher weights to IDF these more discriminative words. IDF is defined using the fraction N / d f i , where N is the total number of documents in the collection, and d f i is the number of doc- uments in which term i occurs. The fewer documents in which a term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that occur in all the documents. Because of the large number of documents in many collections, this measure is usually squashed with a log function. The resulting definition for inverse document frequency (IDF) is thus idf i = log N d f i (15.10) Combining term frequency with IDF results in a scheme known as tf-idf weight- tf-idf ing of the value for word i in document j , w i j : w i j = tf i j idf i (15.11) Tf-idf thus prefers words that are frequent in the current document j but rare overall in the collection. The tf-idf weighting is by far the dominant way of weighting co-occurrence ma- trices in information retrieval, but also plays a role in many other aspects of natural language processing including summarization. Tf-idf, however, is not generally used as a component in measures of word sim- ilarity; for that PPMI and significance-testing metrics like t-test and likelihood-ratio are more common. The t-test statistic, like PMI, can be used to measure how much t-test more frequent the association is than chance. The t-test statistic computes the differ- ence between observed and expected means, normalized by the variance. The higher the value of t , the greater the likelihood that we can reject the null hypothesis that the observed and expected means are the same.
Image of page 278
Image of page 279
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern