class04-indexes

class04-indexes - Last time Recall basic indexing pipeline...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Last time . .. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. Recall basic indexing pipeline
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
To d ay How can we improve on our basic index? Skip pointers: faster postings merges Positional index: Phrase queries and Proximity queries Permuterm index: Wildcard queries k-gram index: Wildcard queries and spell correction How can we compress our indexes? Need a better index than simple <term: docs> Faster postings merges: Skip pointers
Background image of page 2
Recall basic merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 31 2 4 8 16 32 64 1 2 3 5 8 17 21 Brutus Caesar 2 8 If the list lengths are m and n , the merge takes O( m+n ) operations Can we do better? Yes, if index isn’t changing too fast. Augment postings with skip pointers (at indexing time) Why? To skip postings that will not Fgure in the search results. How? Where do we place skip pointers? 128 2 4 8 16 32 64 31 1 2 3 5 8 17 21 31 8 16 128
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Query processing with skip pointers 128 2 4 32 64 1 2 3 5 17 31 8 16 128 Suppose we’ve stepped through the lists until we process 8 on each list. When we get to 16 on the top list, we see that its successor is 32 . But the skip successor of 8 on the lower list is 31 , so we can skip ahead past the intervening postings. Where do we place skips? Tradeoff: More skips ! shorter skip spans " more likely to skip. But lots of comparisons to skip pointers. Fewer skips ! few pointer comparison, but then long skip spans " few successful skips.
Background image of page 4
Placing skips Simple heuristic: for postings of length L , use # L evenly-spaced skip pointers. This ignores the distribution of query terms. Easy if the index is relatively static; harder if L keeps changing because of updates. This deFnitely used to help; with modern hardware it may not (Bahle et al. 2002) The cost of loading a bigger postings list outweighs the gain from quicker in memory merging Positional Indexes
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Phrase queries Want to answer queries such as stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. The concept of phrase queries has proven easily understood by users; about 10% of web queries are phrase queries A Frst attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase ±or example the text “±riends, Romans, Countrymen” would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate.
Background image of page 6
Longer phrase queries Longer phrases are processed as we did with wild-cards: stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Longer phrase queries Longer phrases are processed as we did with wild-cards:
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/21/2011 for the course CSCE 670 taught by Professor James during the Spring '10 term at Texas A&M.

Page1 / 37

class04-indexes - Last time Recall basic indexing pipeline...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online