lecture6-tfidf-handout-6-per

Lecture6-tfidf-handout-6-per

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: measures how well document and query match . Introduc)on to Informa)on Retrieval Ch. 6 Query ­document matching scores Take 1: Jaccard coefficient   We need a way of assigning a score to a query/ document pair   Let s start with a one ­term query   If the query term does not occur in the document: score should be 0   The more frequent the query term in the document, the higher the score (should be)   We will look at a number of alterna*ves for this.   Recall from Lecture 3: A commonly used measure of overlap of two sets A and B   jaccard(A,B) = |A ∩ B| / |A ∪ B|   jaccard(A,A) = 1   jaccard(A,B) = 0 if A ∩ B = 0   A and B don t have to be the same size.   Always assigns a number between 0 and 1. Introduc)on to Informa)on Retrieval Ch. 6 Introduc)on to Informa)on Retrieval Ch. 6 Jaccard coefficient: Scoring example Issues with Jaccard for scoring   What is the query ­document match score that the Jaccard coefficient computes for each of the two documents below?   Query: ides of march   Document 1: caesar died in march   Document 2: the long march   It doesn t consider term frequency (how many *mes a term occurs in a document)   Rare terms in a collec*on are more informa*ve than frequent terms. Jaccard doesn t consider this informa*on   We need a more sophis*cated way of normalizing for length   Later in this lecture, we ll use | A B | / | A B |   . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normaliza*on. 2 Introduc)on to Informa)on Retrieval S ec. 6.2 Recall (Lecture 1): Binary term ­ document incidence matrix Introduc)on to Informa)on Retrieval Sec. 6.2 Term ­document count matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0   Consider the number...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online