This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Recap Why rank? Implementation The complete search system Overview 1 Recap 2 Why rank? 3 Implementation 4 The complete search system 1 / 43 Recap Why rank? Implementation The complete search system Outline 1 Recap 2 Why rank? 3 Implementation 4 The complete search system 2 / 43 Recap Why rank? Implementation The complete search system Term frequency weighting The log frequency weight of term t in d is defined as follows w t , d = braceleftbigg 1 + log 10 tf t , d if tf t , d > otherwise Score for a documentquery pair: sum over terms t in both q and d : matchingscore = ∑ t ∈ q ∩ d (1 + log tf t , d ) 3 / 43 Recap Why rank? Implementation The complete search system idf weight df t is the document frequency, the number of documents that t occurs in. df is an inverse measure of the informativeness of the term. We define the idf weight of term t as follows: idf t = log 10 N df t idf is a measure of the informativeness of the term. 4 / 43 Recap Why rank? Implementation The complete search system tfidf weighting The tfidf weight of a term is the product of its tf weight and its idf weight . w t , d = (1 + log tf t , d ) · log N df t Best known weighting scheme in information retrieval 5 / 43 Recap Why rank? Implementation The complete search system Cosine similarity between query and document cos( vector q , vector d ) = sim ( vector q , vector d ) = vector q · vector d  vector q  vector d  = ∑  V  i =1 q i d i radicalBig ∑  V  i =1 q 2 i radicalBig ∑  V  i =1 d 2 i q i is the tfidf weight of term i in the query. d i is the tfidf weight of term i in the document.  vector q  and  vector d  are the lengths of vector q and vector d . 6 / 43 Recap Why rank? Implementation The complete search system Cosine similarity illustrated 1 1 jealous gossip vector v ( q ) vector v ( d 1 ) vector v ( d 2 ) vector v ( d 3 ) θ 7 / 43 Recap Why rank? Implementation The complete search system tfidf example: ltn.lnc Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tfraw tfwght df idf weight tfraw tfwght weight n’lized auto 5000 2.3 1 1 1 0.52 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tfraw: raw (unweighted) term frequency, tfwght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight √ 1 2 + 0 2 + 1 2 + 1 . 3 2 ≈ 1 . 92 1 / 1 . 92 ≈ . 52 1 . 3 / 1 . 92 ≈ . 68 Final similarity score between query and document: ∑ i w qi · w di = 0 + 0 + 1 . 04 + 2 . 04 = 3 . 08 8 / 43 Recap Why rank? Implementation The complete search system Outline 1 Recap 2 Why rank?...
View
Full
Document
This note was uploaded on 01/21/2011 for the course CSCP 689 taught by Professor James during the Spring '10 term at Texas A&M.
 Spring '10
 JAMES

Click to edit the document details