{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

lecture5-compression-handout-6-per

lecture5-compression-handout-6-per -...

Info icon This preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Introduc)on to Informa(on Retrieval CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 5: Index Compression Introduc)on to Informa)on Retrieval Course work Problem set 1 due Thursday Programming exercise 1 will be handed out today 2 Introduc)on to Informa)on Retrieval Last lecture – index construc)on Sort‐based indexing Naïve in‐memory inversion Blocked Sort‐Based Indexing Merge sort is effec)ve for disk‐based sor)ng (avoid seeks!) Single‐Pass In‐Memory Indexing No global dic)onary Generate separate dic)onary for each block Don’t sort pos)ngs Accumulate pos)ngs in pos)ngs lists as they occur Distributed indexing using MapReduce Dynamic indexing: Mul)ple indices, logarithmic merge 3 Introduc)on to Informa)on Retrieval Today Collec)on sta)s)cs in more detail (with RCV1) How big will the dic)onary and pos)ngs be? Dic)onary compression Pos)ngs compression Ch. 5 4 Introduc)on to Informa)on Retrieval Why compression (in general)? Use less disk space Saves a li]le money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data | decompress] is faster than [read uncompressed data] Premise: Decompression algorithms are fast True of the decompression algorithms we use Ch. 5 5 Introduc)on to Informa)on Retrieval Why compression for inverted indexes? Dic)onary Make it small enough to keep in main memory Make it so small that you can keep some pos)ngs lists in main memory too Pos)ngs file(s) Reduce disk space needed Decrease )me needed to read pos)ngs lists from disk Large search engines keep a significant part of the pos)ngs in memory. Compression lets you keep more in memory We will devise various IR‐specific compression schemes Ch. 5 6
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Introduc)on to Informa)on Retrieval Recall Reuters RCV1 symbol sta(s(c value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) ~400,000 avg. # bytes per token 6 (incl. spaces/punct.) avg. # bytes per token 4.5 (without spaces/punct.) avg. # bytes per term 7.5 non‐posi)onal pos)ngs 100,000,000 Sec. 5.1 7 Introduc)on to Informa)on Retrieval Index parameters vs. what we index (details IIR Table 5.1, p.80) size of word types (terms) non-positional postings positional postings dictionary non-positional index positional index Size (K) % cumul % Size (K) % cumul % Size (K) % cumul % Unfiltered 484 109,971 197,879 No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9 Case folding 392 -17 -19 96,969 -3 -12 179,158 -9 30 stopwords 391 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 -52 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns?
Image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern