class08-end-vector

class08-end-vector - Recap Why rank? Implementation The...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Recap Why rank? Implementation The complete search system Overview 1 Recap 2 Why rank? 3 Implementation 4 The complete search system 1 / 43 Recap Why rank? Implementation The complete search system Outline 1 Recap 2 Why rank? 3 Implementation 4 The complete search system 2 / 43 Recap Why rank? Implementation The complete search system Term frequency weighting The log frequency weight of term t in d is defined as follows w t , d = braceleftbigg 1 + log 10 tf t , d if tf t , d > otherwise Score for a document-query pair: sum over terms t in both q and d : matching-score = t q d (1 + log tf t , d ) 3 / 43 Recap Why rank? Implementation The complete search system idf weight df t is the document frequency, the number of documents that t occurs in. df is an inverse measure of the informativeness of the term. We define the idf weight of term t as follows: idf t = log 10 N df t idf is a measure of the informativeness of the term. 4 / 43 Recap Why rank? Implementation The complete search system tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight . w t , d = (1 + log tf t , d ) log N df t Best known weighting scheme in information retrieval 5 / 43 Recap Why rank? Implementation The complete search system Cosine similarity between query and document cos( vector q , vector d ) = sim ( vector q , vector d ) = vector q vector d | vector q || vector d | = | V | i =1 q i d i radicalBig | V | i =1 q 2 i radicalBig | V | i =1 d 2 i q i is the tf-idf weight of term i in the query. d i is the tf-idf weight of term i in the document. | vector q | and | vector d | are the lengths of vector q and vector d . 6 / 43 Recap Why rank? Implementation The complete search system Cosine similarity illustrated 1 1 jealous gossip vector v ( q ) vector v ( d 1 ) vector v ( d 2 ) vector v ( d 3 ) 7 / 43 Recap Why rank? Implementation The complete search system tf-idf example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight nlized auto 5000 2.3 1 1 1 0.52 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, nlized: document weights after cosine normalization, product: the product of final query weight and final document weight 1 2 + 0 2 + 1 2 + 1 . 3 2 1 . 92 1 / 1 . 92 . 52 1 . 3 / 1 . 92 . 68 Final similarity score between query and document: i w qi w di = 0 + 0 + 1 . 04 + 2 . 04 = 3 . 08 8 / 43 Recap Why rank? Implementation The complete search system Outline 1 Recap 2 Why rank?...
View Full Document

Page1 / 147

class08-end-vector - Recap Why rank? Implementation The...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online