” is similar to the other variations)
o
books
becomes
book
Stop-words
should be
removed
o
A stop-word is a very common word in English (or whatever language is being parsed)
o
Words such as:
the
,
and
,
of
, and
on
are removed
Term Frequency
Use the word count (frequency) in the document instead of just a zero or one
Differentiates between how many times a word is used
How will these documents be represented with term frequency?
Normalized Term Frequency (TF)
Documents of various lengths
“hello” appears once in Doc-101
This document contains and 200 words in total
“hello” appears ten times in Doc-102
This document contains 1,800,000 words in total

The term frequencies are normalized by dividing each by the total number of words in the document
Normalized term frequency of “hello” in Doc-101 is?
Normalized term frequency of “hello” in Doc-102 is?
TF-IDF
Inverse Document Frequency (IDF) of a term
Note: Log base 10
The IDF of a term shows how significant that term is in the entire collection of documents
Note: TF is the
normalized term frequency
Table 1:
Apple
IBM
Humour
Hello
D101
1
0
1
1
0.005
D102
0
1
1
10
0.000005
D3
0
0
0
D4
20
0
0
Table 2:
Apple
IBM
Humour
Hello
D101
0.1
0.40
0.1
0.15
1
1
0.005
D102
0
1
1
10
0.000005
D3
0
0
0
D4
20
0
0
IDF(Apple) = 4 (2)
IDF(IBM) = 1.5 (1000 / 2 = 500)
D1:

1.
TF(Apple) = 0.1
2.
TF(IBM) = 0.1
e.g. If the new document contains IBM, the document has no distinguishing power.
TFIDF(t,d) reflects how important a word
t
is to a document
d
in a
corpus
.
TF-IDF - Example
Document D101 contains word “apple” 12 times
D101 contains 100 words in total
TF(“apple”, D101) = 12/100 = 0.12
We have 10,000,000 documents in total
300,000 documents contain word “apple”
IDF(“apple”) = log(10,000,000/300,000) = 1.52
TF-IDF(“apple”, D101) = 0.12×1.52 = 0.182
IDF
Example: Jazz Musicians
15 prominent jazz musicians and excerpts of their biographies from Wikipedia

Nearly 2,000 features after stemming and stop-word removal!
Consider the sample phrase:
Famous jazz saxophonist born in Kansas who played bebop and Latin
Our goal is to build a vector for this sentence (composed of the TF-IDF of its terms), and then see which of the 15 biographies
are the most similar to this phrase
Basic stemming is applied
Stemming methods are not perfect, and can produce terms like
kansaand
famoufrom
“Kansas” and “famous”
Stemming perfection usually isn’t important as long as it’s consistent among all the documents

Next,
stop-words
(
in,and, who
) are removed, and the words are
normalized
with respect to document length
These values can be used as the Term Frequency (TF) feature values
The full TF-IDF representation by multiplying each term’s TF value by its IDF value
This boosts words that are rare
Jazz
and
play
are very frequent in this corpus of jazz musician biographies so they get no boost from IDF!

The terms with the
highest TF-IDF values
(“
latin
”,
“
famous
”, “
kansas
”) are the
rarest
in this corpus so they end up with the
highest weights among the terms in the query
Similarity of each musician’s text to the following query:
Famous jazz saxophonist born in Kansas who played bebop and
Latin
Beyond “Bag of Words”
?
-gram Sequences
Topic Models
NLP

Example:
Bush started the Iraq war and didn’t visit China.


You've reached the end of your free preview.
Want to read all 152 pages?
- Fall '19
- Data Mining, Data Opportunities