This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Remember: Don't print hidden slides!! 3/12/12 2 So many ways things can go wrong… Reasons that ideal effectiveness hard to achieve: 1. Document representation loses information. 2. Users’ inability to describe queries precisely. 3. Similarity function used not be good enough. 4. Importance/weight of a term in representing a document and query may be inaccurate 5. Same term may have multiple meanings and different terms may have similar meanings. Query expansion Relevance feedback LSI Cooccurrence analysis Improving Vector Space Ranking • We will consider three techniques – Relevance feedback—which tries to improve the query quality Will do later with Text Classification – Correlation analysis, which looks at correlations between keywords (and thus effectively computes a thesaurus based on the word occurrence in the documents) to do query elaboration – Principal Components Analysis (also called Latent Semantic Indexing) which subsumes correlation analysis and does dimensionality reduction. Correlation/Cooccurrence analysis Cooccurrence analysis: • Terms that are related to terms in the original query may be added to the query. • Two terms are related if they have high cooccurrence in documents. Let n be the number of documents; n1 and n2 be # documents containing terms t1 and t2, m be the # documents having both t1 and t2 If t1 and t2 are independent If t1 and t2 are correlated m n n n n n ≈ 2 1 m n n n n n < < 2 1 M e a s u r e d e g r e e o f c o r r e l a t i o n >> if Inversely correlated Correlation • There are 1000 documents in my corpus • The keyword k1 is in 25% of them • The keyword k2 is in 32% of them • Are k1 and k2 – Correlated positively? – Correlated negatively? – Not correlated? Let the fraction of documents that have k1 and k2 together be x% Now can you answer? If x >> 8% +ve correlation x<< 8% ve correlation x~ 8% independent Correlation/Cooccurrence analysis Cooccurrence analysis: • Terms that are related to terms in the original query may be added to the query. • Two terms are related if they have high cooccurrence in documents. Let n be the number of documents; n1 and n2 be # documents containing terms t1 and t2, m be the # documents having both t1 and t2 If t1 and t2 are independent If t1 and t2 are correlated m n n n n n ≈ 2 1 m n n n n n < < 2 1 M e a s u r e d e g r e e o f c o r r e l a t i o n >> if Inversely correlated a b c d e f g h I Interface 1 User 1 1 1 System 2 1 1 Human 1 1 Computer 1 1 Response 1 1 Time 1 1 EPS 1 1 Survey 1 1 Trees 1 1 1 Graph 1 1 1 Minors 1 1 Terms and Docs as mutually dependent vectors Document vector T e r m v e c t o r If terms are independent, the TT similarity matrix should be diagonal =If it is not diagonal, we can use the correlations to add related terms to the query =But can also ask the question “Are there independent dimensions which define the space where terms & docs are vectors ?” In addition to docdoc similarity, We can compute termterm distance...
View
Full Document
 Spring '08
 RAO
 Singular value decomposition, LSI, Dimensionality Reduction

Click to edit the document details