class04-indexes - Last time Recall basic indexing pipeline...

Info icon This preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Last time ... Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. Recall basic indexing pipeline
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Today How can we improve on our basic index? Skip pointers: faster postings merges Positional index: Phrase queries and Proximity queries Permuterm index: Wildcard queries k-gram index: Wildcard queries and spell correction How can we compress our indexes? Need a better index than simple <term: docs> Faster postings merges: Skip pointers
Image of page 2
Recall basic merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 31 2 4 8 16 32 64 1 2 3 5 8 17 21 Brutus Caesar 2 8 If the list lengths are m and n , the merge takes O( m+n ) operations Can we do better? Yes, if index isn’t changing too fast. Augment postings with skip pointers (at indexing time) Why? To skip postings that will not figure in the search results. How? Where do we place skip pointers? 128 2 4 8 16 32 64 31 1 2 3 5 8 17 21 31 8 16 128
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Query processing with skip pointers 128 2 4 32 64 1 2 3 5 17 31 8 16 128 Suppose we’ve stepped through the lists until we process 8 on each list. When we get to 16 on the top list, we see that its successor is 32 . But the skip successor of 8 on the lower list is 31 , so we can skip ahead past the intervening postings. Where do we place skips? Tradeoff: More skips ! shorter skip spans " more likely to skip. But lots of comparisons to skip pointers. Fewer skips ! few pointer comparison, but then long skip spans " few successful skips.
Image of page 4
Placing skips Simple heuristic: for postings of length L , use # L evenly-spaced skip pointers. This ignores the distribution of query terms. Easy if the index is relatively static; harder if L keeps changing because of updates. This definitely used to help; with modern hardware it may not (Bahle et al. 2002) The cost of loading a bigger postings list outweighs the gain from quicker in memory merging Positional Indexes
Image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Phrase queries Want to answer queries such as stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. The concept of phrase queries has proven easily understood by users; about 10% of web queries are phrase queries A first attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate.
Image of page 6
Longer phrase queries Longer phrases are processed as we did with wild-cards: stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Longer phrase queries Longer phrases are processed as we did with wild-cards:
Image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern