measures how well document and query match . Introduc)on to Informa)on Retrieval Ch. 6 Query ­document matching scores Take 1: Jaccard coeﬃcient   We need a way of assigning a score to a query/ document pair   Let s start with a one ­term query   If the query term does not occur in the document: score should be 0   The more frequent the query term in the document, the higher the score (should be)   We will look at a number of alterna*ves for this.   Recall from Lecture 3: A commonly used measure of overlap of two sets A and B   jaccard(A,B) = |A ∩ B| / |A ∪ B|   jaccard(A,A) = 1   jaccard(A,B) = 0 if A ∩ B = 0   A and B don t have to be the same size.   Always assigns a number between 0 and 1. Introduc)on to Informa)on Retrieval Ch. 6 Introduc)on to Informa)on Retrieval Ch. 6 Jaccard coeﬃcient: Scoring example Issues with Jaccard for scoring   What is the query ­document match score that the Jaccard coeﬃcient computes for each of the two documents below?   Query: ides of march   Document 1: caesar died in march   Document 2: the long march   It doesn t consider term frequency (how many *mes a term occurs in a document)   Rare terms in a collec*on are more informa*ve than frequent terms. Jaccard doesn t consider this informa*on   We need a more sophis*cated way of normalizing for length   Later in this lecture, we ll use | A B | / | A B |   . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normaliza*on. 2 Introduc)on to Informa)on Retrieval S ec. 6.2 Recall (Lecture 1): Binary term ­ document incidence matrix Introduc)on to Informa)on Retrieval Sec. 6.2 Term ­document count matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0   Consider the number...
