Unformatted text preview: The Mathematics of
Information Retrieval
Information 11/21/2005
Presented by Jeremy Chapman,
Presented
Jeremy
Grant Gelven and Ben Lakin
Grant
Ben Acknowledgments
Acknowledgments
This presentation is based on the following
This
paper:
paper:
“Matrices, Vector Spaces, and Information
Matrices,
Retrieval.” by Michael W. Barry, Zlatko
Drmat, and Elizabeth R.Jessup.
Drmat, Indexing of Scientific Works
Indexing
Indexing primarily done by using the title,
Indexing
author list, abstract, key word list, and
subject classification
subject
These are created in large part to allow
These
them to be found in a search of scientific
documents
documents
The use of automated information retrieval
The
(IR) has improved consistency and speed
(IR) Vector Space Model for IR
Vector
The basic mechanism for this model is the
The
encoding of a document as a vector
encoding
All documents’ vectors are stored in a
All
single matrix
Latent Semantic Indexing (LSI) replaces
the original matrix by a matrix of a smaller
rank while maintaining similar information
by use of Rank Reduction
by Creating the Database Matrix
Creating
Each document is defined in a column of
Each
the matrix (d is the number of documents)
the
Each term is defined as a row (t is the
Each
number of terms)
number
This gives us a t x d matrix
The document vectors span the content
The Simple Example
Simple
Let the six terms as follows:
T1: bak(e, ing)
T2: recipes
T3: bread
T4: cake
T5: pastr(y, ies)
T6: pie The following are the d=5 documents
D1: How to Bake Bread Without Recipes
D2: The Classical Art of Viennese Pastry
D3: Numerical Recipes: The Art of Scientific Computing
D4: Breads, Pastries, Pies, and Cakes: Quantity Baking
D4:
Recipes
Recipes
D5:Pastry: A Book of Best French Recipes Thus the document matrix becomes: A= 1 1
1 0
0 0 0 0 1 0 1 1 0 0 1 0
1 0
0 1
1 0 0 1 0
÷
1÷
0÷
÷
0÷
0÷
÷
0÷ The matrix A after Normalization
The
Thus after the normalization of the columns of A we get the following: .5774 .5774 .5774
A=
0
0 0 0
0
0
0
1
0 0
1
0
0
0
0 .4082
.4082
.4082
.4082
.4082
.4082 0
.7071
0
0
.7071
0 ÷
÷
÷
÷
÷
÷
÷
÷ Making a Query
Making
Next we will use the document matrix to
Next
ease our search for related documents.
ease
Referring to our example we will make the
Referring
following query: Baking Bread
following
We will now format a query using our
We
terms definitions given before:
terms
q= (1 0
1
0
0
0)T Matching the Document to the
Query
Query
Matching the documents to a given query is
Matching
typically done by using the cosine of the angle
between the query and document vectors
between
The cosine is given as follows: aj q
cos(θ j ) =
 a j 2  q 2
T A Query
Query
By using the cosine formula we would get:
Cos(θ1 )=.8165, Cos(θ 2 )=0, Cos(θ3 )=0, Cos(θ 4 )=.57 74, and Cos(θ5 )=0 We will set our lower limit on our cosine at .5.
• Thus by conducting a query “baking bread” we get the
Thus
following two articles:
following
D1: How to Bake Bread Without Recipes
D4: Breads, Pastries, Pies, and Cakes: Quantity
D4:
Baking Recipes
Baking Singular Value Decomposition
Singular
The Singular Value Decomposition (SVD) is used to
The
reduce the rank of the matrix, while also giving a good
approximation of the information stored in it
approximation
The decomposition is written in the following manner: A=U ∑ V T Where U spans the column space of A, ∑ is the matrix with
Where
singular values of A along the main diagonal, and V
spans the row space of A. U and V are also orthogonal.
spans SVD continued
SVD
• Unlike the QR Factorization, SVD provides us with a lower rank
Unlike
representation of the column and row spaces
representation
• We know Ak is the best rankk approximation to A by Eckert and Young’s
Theorem that states:
Theorem  A − Ak =
 A − Ak = • min  A − x  rank
min  A − x  ( x ) ≤ k rank ( x )≤ k Thus the rankk approximation of A is given as follows:
Ak= Uk kVkT
• Where Uk=the first k columns of U
∑ =a k x k matrix whose diagonal is a set of decreasing values,
k =a
call them:
call
1
k σ ,..., σ VkT=is the k x d matrix whose rows are the first k rows of V SVD Factorization
SVD .2670 .7479 .2670
U = .1182 .5198 .1182 ÷
.5249 .0816
0
0÷
.5308 .2847 .7071
0÷
÷
.2774 .6397
0
.7071 ÷ V
÷
.0838 .1158 0
0
÷
.2774 .6394
0
.7071÷ .2567 .5308 .2847 .7071
.3981
.2567
.0127
.8423
.0127 ∑ 0 .4366 .3067
= .4412 .4909 .5288 0
÷
.7549 .0998 .2760 .5000 ÷
.3568 .6247 .1945 .5000 ÷
÷
.0346 .5711 .6571
0÷
∑
.2815 .3712 .0577 .7071 ÷ .4717 .3688 .6715 0
0
0
1.6950 0 ÷
0
1.1158 0
0
0÷ 0
0 0.8403 0
0÷
=
÷
0
0
0 0.4195 0 ÷ 0
0
0
0 0÷ ÷
0
0
0
0
0÷ Interpretation
Interpretation
From the matrix given on the slide before we
From
notice that if we take the rank4 matrix has only
four nonzero singular values
four Also the two nonzero columns in ∑ tell us that
Also
the first four columns of U give us the basis for
the column space of A
the Analysis of the Rankk
Approximations
Approximations
Using the following formula we can calculate the
Using
relative error from the original matrix to its rankk
approximation:
approximation:
σ 12 + ... + σ k +12
AAkF=
Thus only a 19% relative error is needed to change
from a rank4 to a rank3 matrix, however a 42%
relative error is necessary to move to a rank2
approximation from a rank4 approximation
• As expected these values are less than the rankk approximations for the QR factorization Using the SVD for Query Matching
Using
• Using the following formula we can calculate
Using
the cosine of the angles between the query
and the columns of our rankk approximation of
A.
T
T
A.
[s (U q)] Cos(θ j )= • j k T (s j 2 U k q2 ) Using the rank3 approximation we return the
Using
first and fourth books again using the cutoff of .
first
5 TermTerm Comparison
TermTerm
It is possible to modify the vector space model for
It
comparing queries with documents in order to
compare terms with terms.
compare
When this is added to a search engine it can act as
When
a tool to refine the result
tool
First we run our search as before and retrieve a
First
certain number of documents in the following
example we will have five documents retrieved.
example
We will then create another document matrix with
We
the remaining information, call it G.
the Terms Another Example
Another
Documents T1:Run(ning) D1:Complete Triathlon Endurance Training Manual:Swim,
Bike, Run
D2:Lake, River, and SeaRun Fishes of Canada
T2:Bike
D3:Middle Distance Running, Training and Competition
T3:Endurance D4:Music Law: How to Run your Band’s Business
D5:Running: Learning, Training Competing
T4:Training .5000 .7071 .7071 .5774 .7071 ÷
T5:Band
.5000 0
0
0
0 ÷
T6:Music .5000 0
÷
0
0
0
÷
T7:Fishes G = .5000 0
.7071 0
.7071 ÷ 0 0
0 0
0 0
0 .5774 0
.5774 0 .7071 0 .5774 0 ÷
÷
÷
÷ Analysis of the TermTerm
Comparison
Comparison
For this we use the following formula:
T Cos(θij )= T [(ei G)(G e j )]
T T (G ei 2 G e j 2 ) Clustering
Clustering
• Clustering is the process by
Clustering
which terms are grouped if
they are related such as bike,
endurance and training
endurance
• First the terms are split into
First
groups which are related
groups
• The terms in each group are
The
placed such that their vectors
are almost parallel
are Clusters
Clusters
In this example the first cluster is running
The second cluster is bike, endurance and
The
training
training
The third is band and music
And the fourth is fishes Analyzing the termterm
Comparison
Comparison
We will again use the SVD rankk approximation
We
Thus the cosine of the angles becomes:
Thus [(ei U k ∑
T Cos(θij )= )(∑ k U k e j )]
T k (∑ k U k ei 2 ∑ k U k e j 2 )
T T Conclusion
Conclusion
Through the use of
Through
this model many
libraries and smaller
collections can index
their documents
their
However, as the next
However,
presentation will show
a different approach
is used in large
collections such as
the internet
the ...
View
Full Document
 Fall '11
 Kearn
 Matrices, Singular value decomposition, Latent semantic analysis, JEREMY CHAPMAN

Click to edit the document details