# svd - The Mathematics of Information Retrieval Information...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: The Mathematics of Information Retrieval Information 11/21/2005 Presented by Jeremy Chapman, Presented Jeremy Grant Gelven and Ben Lakin Grant Ben Acknowledgments Acknowledgments This presentation is based on the following This paper: paper: “Matrices, Vector Spaces, and Information Matrices, Retrieval.” by Michael W. Barry, Zlatko Drmat, and Elizabeth R.Jessup. Drmat, Indexing of Scientific Works Indexing Indexing primarily done by using the title, Indexing author list, abstract, key word list, and subject classification subject These are created in large part to allow These them to be found in a search of scientific documents documents The use of automated information retrieval The (IR) has improved consistency and speed (IR) Vector Space Model for IR Vector The basic mechanism for this model is the The encoding of a document as a vector encoding All documents’ vectors are stored in a All single matrix Latent Semantic Indexing (LSI) replaces the original matrix by a matrix of a smaller rank while maintaining similar information by use of Rank Reduction by Creating the Database Matrix Creating Each document is defined in a column of Each the matrix (d is the number of documents) the Each term is defined as a row (t is the Each number of terms) number This gives us a t x d matrix The document vectors span the content The Simple Example Simple Let the six terms as follows: T1: bak(e, ing) T2: recipes T3: bread T4: cake T5: pastr(y, ies) T6: pie The following are the d=5 documents D1: How to Bake Bread Without Recipes D2: The Classical Art of Viennese Pastry D3: Numerical Recipes: The Art of Scientific Computing D4: Breads, Pastries, Pies, and Cakes: Quantity Baking D4: Recipes Recipes D5:Pastry: A Book of Best French Recipes Thus the document matrix becomes: A= 1 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 ÷ 1÷ 0÷ ÷ 0÷ 0÷ ÷ 0÷ The matrix A after Normalization The Thus after the normalization of the columns of A we get the following: .5774 .5774 .5774 A= 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 .4082 .4082 .4082 .4082 .4082 .4082 0 .7071 0 0 .7071 0 ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ Making a Query Making Next we will use the document matrix to Next ease our search for related documents. ease Referring to our example we will make the Referring following query: Baking Bread following We will now format a query using our We terms definitions given before: terms q= (1 0 1 0 0 0)T Matching the Document to the Query Query Matching the documents to a given query is Matching typically done by using the cosine of the angle between the query and document vectors between The cosine is given as follows: aj q cos(θ j ) = || a j ||2 || q ||2 T A Query Query By using the cosine formula we would get: Cos(θ1 )=.8165, Cos(θ 2 )=0, Cos(θ3 )=0, Cos(θ 4 )=.57 74, and Cos(θ5 )=0 We will set our lower limit on our cosine at .5. • Thus by conducting a query “baking bread” we get the Thus following two articles: following D1: How to Bake Bread Without Recipes D4: Breads, Pastries, Pies, and Cakes: Quantity D4: Baking Recipes Baking Singular Value Decomposition Singular The Singular Value Decomposition (SVD) is used to The reduce the rank of the matrix, while also giving a good approximation of the information stored in it approximation The decomposition is written in the following manner: A=U ∑ V T Where U spans the column space of A, ∑ is the matrix with Where singular values of A along the main diagonal, and V spans the row space of A. U and V are also orthogonal. spans SVD continued SVD • Unlike the QR Factorization, SVD provides us with a lower rank Unlike representation of the column and row spaces representation • We know Ak is the best rank-k approximation to A by Eckert and Young’s Theorem that states: Theorem || A − Ak ||= || A − Ak ||= • min || A − x || rank min || A − x || ( x ) ≤ k rank ( x )≤ k Thus the rank-k approximation of A is given as follows: Ak= Uk kVkT • Where Uk=the first k columns of U ∑ =a k x k matrix whose diagonal is a set of decreasing values, k =a call them: call 1 k σ ,..., σ VkT=is the k x d matrix whose rows are the first k rows of V SVD Factorization SVD .2670 .7479 .2670 U = .1182 .5198 .1182 ÷ -.5249 .0816 0 0÷ .5308 -.2847 .7071 0÷ ÷ .2774 .6397 0 -.7071 ÷ V ÷ .0838 -.1158 0 0 ÷ .2774 .6394 0 .7071÷ -.2567 .5308 -.2847 -.7071 -.3981 -.2567 -.0127 .8423 -.0127 ∑ 0 .4366 .3067 = .4412 .4909 .5288 0 ÷ .7549 .0998 -.2760 -.5000 ÷ -.3568 -.6247 .1945 -.5000 ÷ ÷ -.0346 .5711 .6571 0÷ ∑ .2815 -.3712 -.0577 .7071 ÷ -.4717 .3688 -.6715 0 0 0 1.6950 0 ÷ 0 1.1158 0 0 0÷ 0 0 0.8403 0 0÷ = ÷ 0 0 0 0.4195 0 ÷ 0 0 0 0 0÷ ÷ 0 0 0 0 0÷ Interpretation Interpretation From the matrix given on the slide before we From notice that if we take the rank-4 matrix has only four non-zero singular values four Also the two non-zero columns in ∑ tell us that Also the first four columns of U give us the basis for the column space of A the Analysis of the Rank-k Approximations Approximations Using the following formula we can calculate the Using relative error from the original matrix to its rank-k approximation: approximation: σ 12 + ... + σ k +12 ||A-Ak||F= Thus only a 19% relative error is needed to change from a rank-4 to a rank-3 matrix, however a 42% relative error is necessary to move to a rank-2 approximation from a rank-4 approximation • As expected these values are less than the rankk approximations for the QR factorization Using the SVD for Query Matching Using • Using the following formula we can calculate Using the cosine of the angles between the query and the columns of our rank-k approximation of A. T T A. [s (U q)] Cos(θ j )= • j k T (||s j ||2 ||U k q||2 ) Using the rank-3 approximation we return the Using first and fourth books again using the cutoff of . first 5 Term-Term Comparison Term-Term It is possible to modify the vector space model for It comparing queries with documents in order to compare terms with terms. compare When this is added to a search engine it can act as When a tool to refine the result tool First we run our search as before and retrieve a First certain number of documents in the following example we will have five documents retrieved. example We will then create another document matrix with We the remaining information, call it G. the Terms Another Example Another Documents T1:Run(ning) D1:Complete Triathlon Endurance Training Manual:Swim, Bike, Run D2:Lake, River, and Sea-Run Fishes of Canada T2:Bike D3:Middle Distance Running, Training and Competition T3:Endurance D4:Music Law: How to Run your Band’s Business D5:Running: Learning, Training Competing T4:Training .5000 .7071 .7071 .5774 .7071 ÷ T5:Band .5000 0 0 0 0 ÷ T6:Music .5000 0 ÷ 0 0 0 ÷ T7:Fishes G = .5000 0 .7071 0 .7071 ÷ 0 0 0 0 0 0 0 .5774 0 .5774 0 .7071 0 .5774 0 ÷ ÷ ÷ ÷ Analysis of the Term-Term Comparison Comparison For this we use the following formula: T Cos(θij )= T [(ei G)(G e j )] T T (||G ei ||2 ||G e j ||2 ) Clustering Clustering • Clustering is the process by Clustering which terms are grouped if they are related such as bike, endurance and training endurance • First the terms are split into First groups which are related groups • The terms in each group are The placed such that their vectors are almost parallel are Clusters Clusters In this example the first cluster is running The second cluster is bike, endurance and The training training The third is band and music And the fourth is fishes Analyzing the term-term Comparison Comparison We will again use the SVD rank-k approximation We Thus the cosine of the angles becomes: Thus [(ei U k ∑ T Cos(θij )= )(∑ k U k e j )] T k (||∑ k U k ei ||2 ||∑ k U k e j ||2 ) T T Conclusion Conclusion Through the use of Through this model many libraries and smaller collections can index their documents their However, as the next However, presentation will show a different approach is used in large collections such as the internet the ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online