lecture2-dictionary-handout-6-per -...

1 Introduc)on to Informa(on Retrieval CS276: Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 2: The term vocabulary and pos)ngs lists Introduc)on to Informa)on Retrieval Recap of the previous lecture Basic inverted indexes: Structure: Dic)onary and Pos)ngs Key step in construc)on: Sor)ng Boolean query processing Intersec)on by linear )me “merging” Simple op)miza)ons Overview of course topics Ch. 1 2 Introduc)on to Informa)on Retrieval Plan for this lecture Elaborate basic indexing Preprocessing to form the term vocabulary Documents Tokeniza)on What terms do we put in the index? Pos)ngs Faster merges: skip lists Posi)onal pos)ngs and phrase queries 3 Introduc)on to Informa)on Retrieval Recall the basic indexing pipeline Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. 4 Introduc)on to Informa)on Retrieval Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … Sec. 2.1 5 Introduc)on to Informa)on Retrieval Complica)ons: Format/language Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Some)mes a document or its components can contain mul)ple languages/formats French email with a German pdf aZachment. What is a unit document ? A file? An email? (Perhaps one of many in an mbox.) An email with 5 aZachments? A group of files (PPT or LaTeX as HTML pages) Sec. 2.1 6
2 Introduc)on to Informa)on Retrieval TOKENS AND TERMS 7 Introduc)on to Informa)on Retrieval Tokeniza)on Input : “ Friends, Romans, Countrymen Output : Tokens Friends Romans Countrymen A token is a sequence of characters in a document Each such token is now a candidate for an index entry, aber further processing Described below But what are valid tokens to emit? Sec. 2.2.1 8 Introduc)on to Informa)on Retrieval Tokeniza)on Issues in tokeniza)on: Finland’s capital Finland? Finlands? Finland’s ? Hewle:‐Packard Hewle: and Packard as two tokens? state‐of‐the‐art : break up hyphenated sequence. co‐educa?on lowercase , lower‐case , lower case ? It can be effec)ve to get the user to put in possible hyphens San Francisco : one token or two? How do you decide it is one token? Sec. 2.2.1 9 Introduc)on to Informa)on Retrieval Numbers 3/12/91 Mar. 12, 1991 12/3/91 55 B.C.
