lecture5-compression-handout-6-per

Pointer to next word shows end of current word hope

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: rec)ng
spelling
errors
on
Heaps’
 law?
   Compute
the
vocabulary
size
M for
this
scenario:   Looking
at
a
collec)on
of
web
pages,
you
find
that
there
 are
3000
different
terms
in
the
first
10,000
tokens
and
 30,000
different
terms
in
the
first
1,000,000
tokens.
   Assume
a
search
engine
indexes
a
total
of
20,000,000,000
 (2
×
1010)
pages,
containing
200
tokens
on
average
   What
is
the
size
of
the
vocabulary
of
the
indexed
collec)on
 as
predicted
by
Heaps’
law?
   Heaps’
law
gives
the
vocabulary
size
in
collec)ons.
   We
also
study
the
rela)ve
frequencies
of
terms.
   In
natural
language,
there
are
a
few
very
frequent
 terms
and
very
many
very
rare
terms.
   Zipf’s
law:
The
ith
most
frequent
term
has
frequency
 propor)onal
to
1/i
.
   cfi
∝
1/i = K/i where
K
is
a
normalizing
constant   cfi
is
collec)on
frequency:
the
number
of
 occurrences
of
the
term
ti
in
the
collec)on.
 13
 Introduc)on to Informa)on Retrieval Sec. 5.1 Sec. 5.1 Zipf
consequences
 14
 Introduc)on to Informa)on Retrieval Sec. 5.1 Zipf’s
law
for
Reuters
RCV1
   If
the
most
frequent
term
(the)
occurs
cf1
)mes

   then
the
second
most
frequent
term
(of)
occurs
cf1/2
)mes 
   the
third
most
frequent
term
(and)
occurs
cf1/3
)mes
…

   Equivalent:
cfi
=
K/i
where
K
is
a
normalizing
factor,
 so
   log
cfi
=
log
K
‐
log
i   Linear
rela)onship
between
log
cfi
and
log
i   Another
power
law
rela)onship
 15
 Introduc)on to Informa)on Retrieval Ch. 5 16
 Introduc)on to Informa)on Retrieval Sec. 5.2 Compression
   Now,
we
will
consider
compressing...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online