This preview shows pages 1–3. Sign up to view the full content.
1 Latent Semantics
We recall that we were able to express the huge matrix that relates words (terms) to
documents by introducing a matrix that relates words to concepts, and doing a matrix
multiplication.
Terms x Documents
Topics x Terms
Topics x Documents
Figure 1.
By the way, let’s recall the horrible formula for matrix multiplication: if we call the pink
matrix M, and the termdocument matrix D, and the TopicDocument (or
concept
–
document) matrix C, then the relation is that:
()
kj
ki
ij
i
C
MD matrix multiplication
C
M D
In this equation:
the number
ki
M
(in the
kth row and the ith column of M)
tells how
related the specific word or term (“i”) is to the concept labeled by “
k”.
The number
ij
D
,
(
in the ith row and thejth column of D)
tells us something about how often the
ith word
occurs in the document labeled
j.
And this particular term
kj
C
in the matrix product tells
us how important concept
k is in the document labeled j.
Recall that what this means is that we can get, for each particular document, a new
vector, which is much shorter, and still represents the document. But it represents it in
terms of the topics that are in it, rather than the terms.
Recall that in the yellow matrix, each document is represented by a column. (Shown here
in blue.) That same document is represented (in terms of concepts or topics) by the
shorter vector, shown in yellow, in the topics by documents matrix. (See Figure 2).
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentFigure 2
In this case we had to make up the “pink matrix” ourselves. We did it for a few words.
But if we tried to do it by hand for all the words in the language, and all the reasonable
topics, it would certainly keep all the librarians in the world busy for centuries. So people
looked for a different way to get at it.
They looked to the idea that “where a word appears” tell you something about its
meaning. There are two senses to the notion of “where a word appears”. One is the local
context. The other is distribution across documents or web pages.
For example the context of the word “something” in the preceding paragraph in the
context:
tell you something about its.
I could have taken a larger or smaller context, but I
am just looking two words to the left and to the right. This context tells us that whatever
the word “something” means, it makes sense to put it into the local context “
tell you 
 about its
”. That actually tells us a little about what the word something means.
On the other hand, distribution across web pages tells us something about a word also.
Depending on what a word means, it will tend to appear in some pages and not others.
This is the end of the preview. Sign up
to
access the rest of the document.
 Fall '09
 Boros

Click to edit the document details