IR-part2

# 62 term document count matrices consider the number

Unformatted text preview: erms in a collec*on are more informa*ve than frequent terms §༊  Jaccard doesn’t consider this informa*on §༊  We need a more sophis*cated way of normalizing for length §༊  Later in this lecture, we’ll use | A B | / | A B | . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normaliza*on. Introduc)on to Informa)on Retrieval Introduc*on to Informa(on Retrieval Scoring with the Jaccard coeﬃcient Introduc)on to Informa)on Retrieval Introduc*on to Informa(on Retrieval Term frequency weigh*ng Introduc)on to Informa)on Retrieval S ec. 6.2 Recall: Binary term- document incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Each document is represented by a binary vector ∈ {0,1...
