lecture5-compression-handout-6-per

51 2 30 stopwords 7 474 case folding

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: you
keep
more
in
memory
   We
will
devise
various
IR‐specific
compression
schemes
 5
 6
 1 Sec. 5.1 Introduc)on to Informa)on Retrieval Index
parameters
vs.
what
we
index
 (details
IIR Table
5.1,
p.80)
 Recall
Reuters
RCV1
           symbol N
 
 
 L
 
 
 M
 
 















 
sta(s(c
 
 
 
 
 
 
documents 
 
 
 

 
avg.
#
tokens
per
doc

 
terms
(=
word
types)
 
 
avg.
#
bytes
per
token
 


















 
(incl.
spaces/punct.)
   















 
avg.
#
bytes
per
token
 Sec. 5.1 Introduc)on to Informa)on Retrieval 
value
 
800,000
 
200
 
~400,000
 
6
 size of non-positional postings dictionary non-positional index positional index Size (K) Size (K) Size (K) ∆% cumul % ∆ % positional postings cumul % cumul % -2 100,680 -8 -8 179,158 -9 392 -17 -19 96,969 -3 -12 179,158 0 -9 391 -0 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52 stemming Sec. 5.1 -2 30 stopwords 7
 474 Case folding 











 
(without
spaces/punct.)
 484 No numbers   















 
avg.
#
bytes
per
term 
 
7.5
   














 
non‐posi)onal
pos)ngs 
100,000,000
 109,971 ∆ % Unfiltered 
4.5
 Introduc)on to Informa)on Retrieval word types (terms) 197,879 322 -17 -33 63,812 -42 94,517 -52 -4 -9 0 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? 8
 Sec. 5.1 Introduc)on to Informa)on Retrieval Lossless
vs.
lossy
compression
 Vocabulary
vs.
collec)on
size
   Lossless
compression:
All
informa)on
is
preserved.
   How
big
is
the
term
vocabulary?
   What
we
mostly
do
in
IR.
   That
is,
how
m...
View Full Document

Ask a homework question - tutors are online