lecture5-compression-handout-6-per

38 forarachnocentricwewilluse20bitsgapentry

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: cking
 4
bytes
per
term
for
Freq.
 Now avg. 11 4
bytes
per
term
for
pointer
to
Pos)ngs.
 bytes/term, not 20. 3
bytes
per
term
pointer
 Avg.
8
bytes
per
term
in
term
string
 400K
terms
x
19
⇒
7.6
MB
(against
11.2MB
for
fixed
 width)
   Store
pointers
to
every
kth
term
string.
   Example
below:
k=4.
   Need
to
store
term
lengths
(1
extra
byte)
 ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 9 bytes on 3 pointers. 23
 Lose 4 bytes on term lengths. 24
 4 Introduc)on to Informa)on Retrieval Sec. 5.2 Sec. 5.2 Introduc)on to Informa)on Retrieval Net
 Exercise
   Example
for
block
size
k
=
4
   Where
we
used
3
bytes/pointer
without
blocking
   Es)mate
the
space
usage
(and
savings
compared
to
 7.6
MB)
with
blocking,
for
block
sizes
of
k = 4, 8 and 16.   3
x
4
=
12
bytes,
 now
we
use
3
+
4
=
7
bytes.
 Shaved
another
~0.5MB.
This
reduces
the
size
of
the
 dic)onary
from
7.6
MB
to
7.1
MB.
 We
can
save
more
with
larger
k.
 Why not go with larger k? 25
 Introduc)on to Informa)on Retrieval Sec. 5.2 26
 Sec. 5.2 Introduc)on to Informa)on Retrieval Dic)onary
search
with
blocking
 Dic)onary
search
without
blocking
   Assuming
each
 dic)onary
term
equally
 likely
in
query
(not
really
 so
in
prac)ce!),
average
 number
of
comparisons
 =
(1+2∙2+4∙3+4)/8
~2.6
   Binary
search
down
to
4‐term
block;
   Then
linear
search
through
terms
in
block.
 Exercise:
what
if
the
frequencies
 of
query
terms
were
non‐uniform
 but
known,
how
would
you
 structure
the
dic)onary
search
 tree?
   Blocks
of
4
(binary
tree),
avg.
=
 (1+2∙2+2∙3+2∙4+5)/8
=
3
compares
 27
 Introduc)on to Informa)on Retrieval Sec. 5.2 28
 Sec. 5.2 Introduc)on to Informa)on Retrieval Exercise
 Front
coding
   Es)mate
the
impact
on
search
performance
(and
 slowdown
compared
to
k=1)
with
blocking,
for
block
 sizes
of
k = 4, 8 and 16.   Front‐coding:
   Sorted
words
commonly
have
long
common
prefix
–
store
 differences
only
   (for
last
k‐1
in
a
block...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online