lecture5-compression-handout-6-per

# Assumeasearchengineindexesatotalof20000000000

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: any dis)nct words are there?    Lossy compression: Discard some informa)on    Several of the preprocessing steps can be viewed as  lossy compression: case folding, stop words,  stemming, number elimina)on.    Chap/Lecture 7: Prune pos)ngs entries that are  unlikely to turn up in the top k list for any query.    Can we assume an upper bound?    Not really: At least 7020 = 1037 diﬀerent words of length 20    In prac)ce, the vocabulary will keep growing with the  collec)on size    Especially with Unicode     Almost no loss quality for top k list.  9  Introduc)on to Informa)on Retrieval Sec. 5.1 Vocabulary vs. collec)on size  10  Sec. 5.1 Introduc)on to Informa)on Retrieval Heaps’ Law  Fig 5.1 p81 For RCV1, the dashed line    Heaps’ law: M = kTb   M is the size of the vocabulary, T is the number of  tokens in the collec)on    Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5    In a log‐log plot of vocabulary size M vs. T, Heaps’  law predicts a line with slope about ½  log10M = 0.49 log10T + 1.64  is the best least squares ﬁt.  Thus, M = 101.64T0.49 so k =  101.64 ≈ 44 and b = 0.49.    It is the simplest possible rela)onship between the two in  log‐log space    An empirical ﬁnding (“empirical law”)  11  Good empirical ﬁt for  Reuters RCV1 !  For ﬁrst 1,000,020 tokens,  law predicts 38,323 terms;  actually, 38,365 terms  12  2 Introduc)on to Informa)on Retrieval Sec. 5.1 Exercises  Introduc)on to Informa)on Retrieval Zipf’s law    What is the eﬀect of including spelling errors, vs.  automa)cally cor...
View Full Document

Ask a homework question - tutors are online