lecture5-compression-handout-6-per

Assumeasearchengineindexesatotalof20000000000

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: any
dis)nct
words
are
there?
   Lossy
compression:
Discard
some
informa)on
   Several
of
the
preprocessing
steps
can
be
viewed
as
 lossy
compression:
case
folding,
stop
words,
 stemming,
number
elimina)on.
   Chap/Lecture
7:
Prune
pos)ngs
entries
that
are
 unlikely
to
turn
up
in
the
top
k
list
for
any
query.
   Can
we
assume
an
upper
bound?
   Not
really:
At
least
7020
=
1037
different
words
of
length
20
   In
prac)ce,
the
vocabulary
will
keep
growing
with
the
 collec)on
size
   Especially
with
Unicode

   Almost
no
loss
quality
for
top
k
list.
 9
 Introduc)on to Informa)on Retrieval Sec. 5.1 Vocabulary
vs.
collec)on
size
 10
 Sec. 5.1 Introduc)on to Informa)on Retrieval Heaps’
Law
 Fig 5.1 p81 For
RCV1,
the
dashed
line
   Heaps’
law:
M = kTb   M
is
the
size
of
the
vocabulary,
T
is
the
number
of
 tokens
in
the
collec)on
   Typical
values:
30
≤
k
≤
100
and
b
≈
0.5
   In
a
log‐log
plot
of
vocabulary
size
M
vs.
T,
Heaps’
 law
predicts
a
line
with
slope
about
½
 log10M
=
0.49
log10T
+
1.64
 is
the
best
least
squares
fit.
 Thus,
M
=
101.64T0.49
so
k
=
 101.64
≈
44
and
b
=
0.49.
   It
is
the
simplest
possible
rela)onship
between
the
two
in
 log‐log
space
   An
empirical
finding
(“empirical
law”)
 11
 Good
empirical
fit
for
 Reuters
RCV1
!
 For
first
1,000,020
tokens,
 law
predicts
38,323
terms;
 actually,
38,365
terms
 12
 2 Introduc)on to Informa)on Retrieval Sec. 5.1 Exercises
 Introduc)on to Informa)on Retrieval Zipf’s
law
   What
is
the
effect
of
including
spelling
errors,
vs.
 automa)cally
cor...
View Full Document

Ask a homework question - tutors are online