lecture6-tfidf-handout-6-per

# This is called the bag of words model in a sense this

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: of occurrences of a term in a document:   Each document is a count vector in ℕv: a column below Othello Macbeth 0 0 0 Brutus 4 157 0 1 0 0 232 227 0 2 1 1 0 10 0 0 0 0 57 0 0 0 0 0 mercy Hamlet 0 Cleopatra The Tempest 73 Calpurnia Introduc)on to Informa)on Retrieval Julius Caesar 157 Caesar Each document is represented by a binary vector ∈ {0,1}|V| Antony and Cleopatra Antony 2 0 3 5 5 1 worser 2 0 1 1 1 0 Introduc)on to Informa)on Retrieval Bag of words model Term frequency o   Vector representa*on doesn t consider the ordering of words in a document   John is quicker than Mary and Mary is quicker than John have the same vectors   This is called the bag of words model.   In a sense, this is a step back: The posi*onal index was able to dis*nguish these two documents.   We will look at recovering posi*onal informa*on later in this course.   For now: bag of words model   The term frequency ot,d of term t in document d is deﬁned as the number of *mes that t occurs in d.   We want to use o when compu*ng query ­document match scores. But how?   Raw term frequency is not what we want: Introduc)on to Informa)on Retrieval Sec. 6.2 Log ­frequency weigh*ng   The log frequency weight of term t in d is Ⱥ1 + log10 tft,d , wt,d = Ⱥ 0, Ⱥ if tft,d > 0 otherwise   0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.   Score for a document ­query pair: sum over terms t in both q and d:   score = t∈q∩d (1 + log tft ,d ) ∑   A document with 10 occurrences of the term is more re...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online