” is similar to the other variations) o books becomes book Stop-words should be removed o A stop-word is a very common word in English (or whatever language is being parsed) o Words such as: the , and , of , and on are removed Term Frequency Use the word count (frequency) in the document instead of just a zero or one Differentiates between how many times a word is used How will these documents be represented with term frequency? Normalized Term Frequency (TF) Documents of various lengths “hello” appears once in Doc-101 This document contains and 200 words in total “hello” appears ten times in Doc-102 This document contains 1,800,000 words in total
The term frequencies are normalized by dividing each by the total number of words in the document Normalized term frequency of “hello” in Doc-101 is? Normalized term frequency of “hello” in Doc-102 is? TF-IDF Inverse Document Frequency (IDF) of a term Note: Log base 10 The IDF of a term shows how significant that term is in the entire collection of documents Note: TF is the normalized term frequency Table 1: Apple IBM Humour Hello D101 1 0 1 1 0.005 D102 0 1 1 10 0.000005 D3 0 0 0 D4 20 0 0 Table 2: Apple IBM Humour Hello D101 0.1 0.40 0.1 0.15 1 1 0.005 D102 0 1 1 10 0.000005 D3 0 0 0 D4 20 0 0 IDF(Apple) = 4 (2) IDF(IBM) = 1.5 (1000 / 2 = 500) D1:
1. TF(Apple) = 0.1 2. TF(IBM) = 0.1 e.g. If the new document contains IBM, the document has no distinguishing power. TFIDF(t,d) reflects how important a word t is to a document d in a corpus . TF-IDF - Example Document D101 contains word “apple” 12 times D101 contains 100 words in total TF(“apple”, D101) = 12/100 = 0.12 We have 10,000,000 documents in total 300,000 documents contain word “apple” IDF(“apple”) = log(10,000,000/300,000) = 1.52 TF-IDF(“apple”, D101) = 0.12×1.52 = 0.182 IDF Example: Jazz Musicians 15 prominent jazz musicians and excerpts of their biographies from Wikipedia
Nearly 2,000 features after stemming and stop-word removal! Consider the sample phrase: Famous jazz saxophonist born in Kansas who played bebop and Latin Our goal is to build a vector for this sentence (composed of the TF-IDF of its terms), and then see which of the 15 biographies are the most similar to this phrase Basic stemming is applied Stemming methods are not perfect, and can produce terms like kansaand famoufrom “Kansas” and “famous” Stemming perfection usually isn’t important as long as it’s consistent among all the documents
Next, stop-words ( in,and, who ) are removed, and the words are normalized with respect to document length These values can be used as the Term Frequency (TF) feature values The full TF-IDF representation by multiplying each term’s TF value by its IDF value This boosts words that are rare Jazz and play are very frequent in this corpus of jazz musician biographies so they get no boost from IDF!
The terms with the highest TF-IDF values (“ latin ”, “ famous ”, “ kansas ”) are the rarest in this corpus so they end up with the highest weights among the terms in the query Similarity of each musician’s text to the following query: Famous jazz saxophonist born in Kansas who played bebop and Latin Beyond “Bag of Words” ? -gram Sequences Topic Models NLP
Example: Bush started the Iraq war and didn’t visit China.
You've reached the end of your free preview.
Want to read all 152 pages?
- Fall '19
- Data Mining, Data Opportunities