{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}


lecture5-compression-handout-6-per -...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Introduc)on to Informa(on Retrieval CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 5: Index Compression Introduc)on to Informa)on Retrieval Course work Problem set 1 due Thursday Programming exercise 1 will be handed out today 2 Introduc)on to Informa)on Retrieval Last lecture – index construc)on Sort‐based indexing Naïve in‐memory inversion Blocked Sort‐Based Indexing Merge sort is effec)ve for disk‐based sor)ng (avoid seeks!) Single‐Pass In‐Memory Indexing No global dic)onary Generate separate dic)onary for each block Don’t sort pos)ngs Accumulate pos)ngs in pos)ngs lists as they occur Distributed indexing using MapReduce Dynamic indexing: Mul)ple indices, logarithmic merge 3 Introduc)on to Informa)on Retrieval Today Collec)on sta)s)cs in more detail (with RCV1) How big will the dic)onary and pos)ngs be? Dic)onary compression Pos)ngs compression Ch. 5 4 Introduc)on to Informa)on Retrieval Why compression (in general)? Use less disk space Saves a li]le money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data | decompress] is faster than [read uncompressed data] Premise: Decompression algorithms are fast True of the decompression algorithms we use Ch. 5 5 Introduc)on to Informa)on Retrieval Why compression for inverted indexes? Dic)onary Make it small enough to keep in main memory Make it so small that you can keep some pos)ngs lists in main memory too Pos)ngs file(s) Reduce disk space needed Decrease )me needed to read pos)ngs lists from disk Large search engines keep a significant part of the pos)ngs in memory. Compression lets you keep more in memory We will devise various IR‐specific compression schemes Ch. 5 6
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Introduc)on to Informa)on Retrieval Recall Reuters RCV1 symbol sta(s(c value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) ~400,000 avg. # bytes per token 6 (incl. spaces/punct.) avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 non‐posi)onal pos)ngs 100,000,000 Sec. 5.1 7 Introduc)on to Informa)on Retrieval Index parameters vs. what we index (details IIR Table 5.1, p.80) size of word types (terms) non-positional postings positional postings dictionary non-positional index positional index Size (K) % cumul % Size (K) % cumul % Size (K) % cumul % Unfiltered 484 109,971 197,879 No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9 Case folding 392 -17 -19 96,969 -3 -12 179,158 -9 30 stopwords 391 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 -52 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns?
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}