lecture5-compression-handout-6-per

# 38 forarachnocentricwewilluse20bitsgapentry

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: cking  4 bytes per term for Freq.  Now avg. 11 4 bytes per term for pointer to Pos)ngs.  bytes/term, not 20. 3 bytes per term pointer  Avg. 8 bytes per term in term string  400K terms x 19 ⇒ 7.6 MB (against 11.2MB for ﬁxed  width)    Store pointers to every kth term string.    Example below: k=4.    Need to store term lengths (1 extra byte)  ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 9 bytes on 3 pointers. 23  Lose 4 bytes on term lengths. 24  4 Introduc)on to Informa)on Retrieval Sec. 5.2 Sec. 5.2 Introduc)on to Informa)on Retrieval Net  Exercise    Example for block size k = 4    Where we used 3 bytes/pointer without blocking    Es)mate the space usage (and savings compared to  7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.   3 x 4 = 12 bytes,  now we use 3 + 4 = 7 bytes.  Shaved another ~0.5MB. This reduces the size of the  dic)onary from 7.6 MB to 7.1 MB.  We can save more with larger k.  Why not go with larger k? 25  Introduc)on to Informa)on Retrieval Sec. 5.2 26  Sec. 5.2 Introduc)on to Informa)on Retrieval Dic)onary search with blocking  Dic)onary search without blocking    Assuming each  dic)onary term equally  likely in query (not really  so in prac)ce!), average  number of comparisons  = (1+2∙2+4∙3+4)/8 ~2.6    Binary search down to 4‐term block;    Then linear search through terms in block.  Exercise: what if the frequencies  of query terms were non‐uniform  but known, how would you  structure the dic)onary search  tree?    Blocks of 4 (binary tree), avg. =  (1+2∙2+2∙3+2∙4+5)/8 = 3 compares  27  Introduc)on to Informa)on Retrieval Sec. 5.2 28  Sec. 5.2 Introduc)on to Informa)on Retrieval Exercise  Front coding    Es)mate the impact on search performance (and  slowdown compared to k=1) with blocking, for block  sizes of k = 4, 8 and 16.   Front‐coding:    Sorted words commonly have long common preﬁx – store  diﬀerences only    (for last k‐1 in a block...
View Full Document

## This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online