lecture5-compression-handout-6-per

Gammacodecanbeusedforanydistribuon

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 
of
k)
 8automata8automate9automa'c10automa'on →8automat*a1◊e2◊ic3◊ion Encodes automat 29
 Extra length beyond automat. Begins to resemble general string compression. 30
 5 Sec. 5.2 Introduc)on to Informa)on Retrieval Introduc)on to Informa)on Retrieval Sec. 5.3 RCV1
dic)onary
compression
summary
 Technique
 Size
in
MB
 Fixed
width
 11.2
 Dic)onary‐as‐String
with
pointers
to
every
term
 7.6
 Also,
blocking
k =
4
 7.1
 Also,
Blocking
+
front
coding
 5.9
 POSTINGS
COMPRESSION
 31
 Introduc)on to Informa)on Retrieval Sec. 5.3 32
 Introduc)on to Informa)on Retrieval Sec. 5.3 Pos)ngs
compression
 Pos)ngs:
two
conflic)ng
forces
   The
pos)ngs
file
is
much
larger
than
the
dic)onary,
 factor
of
at
least
10.
   Key
desideratum:
store
each
pos)ng
compactly.
   A
pos)ng
for
our
purposes
is
a
docID.
   For
Reuters
(800,000
documents),
we
would
use
32
 bits
per
docID
when
using
4‐byte
integers.
   Alterna)vely,
we
can
use
log2
800,000
≈
20
bits
per
 docID.
   Our
goal:
use
far
fewer
than
20
bits
per
docID.
   A
term
like
arachnocentric occurs
in
maybe
one
doc
 out
of
a
million
–
we
would
like
to
store
this
pos)ng
 using
log2
1M
~
20
bits.
   A
term
like
the
occurs
in
virtually
every
doc,
so
20
 bits/pos)ng
is
too
expensive.
   Prefer
0/1
bitmap
vector
in
this
case

 33...
View Full Document

Ask a homework question - tutors are online