1 Introduc)on to Informa(on Retrieval CS276 Informa)on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 1: Boolean retrieval Introduc)on to Informa)on Retrieval Informa)on Retrieval Informa)on Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that sa)sfies an informa)on need from within large collec)ons (usually stored on computers). 2 Introduc)on to Informa)on Retrieval Unstructured (text) vs. structured (database) data in 1996 3 Introduc)on to Informa)on Retrieval Unstructured (text) vs. structured (database) data in 2009 4 Introduc)on to Informa)on Retrieval Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia ? Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non‐trivial Other opera)ons (e.g., find the word Romans near countrymen ) not feasible Ranked retrieval (best documents to return) Later lectures 5 Sec. 1.1 Introduc)on to Informa)on Retrieval Term‐document incidence 1 if play contains word , 0 otherwise Brutus AND Caesar BUT NOT Calpurnia Sec. 1.1

2 Introduc)on to Informa)on Retrieval Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND . 110100 AND 110111 AND 101111 = 100100. 7 Sec. 1.1 Introduc)on to Informa)on Retrieval Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 8 Sec. 1.1 Introduc)on to Informa)on Retrieval Basic assump)ons of Informa)on Retrieval Collec)on : Fixed set of documents Goal : Retrieve documents with informa)on that is relevant to the user’s informa)on need and helps the user complete a task 9 Sec. 1.1 Introduc)on to Informa)on Retrieval The classic search model Corpus TASK Info Need Query Verbal form Results SEARCH ENGINE Query Refinement Info about removing mice without killing them mouse trap Misconception? Mistranslation? Misformulation? Introduc)on to Informa)on Retrieval How good are the retrieved docs? Precision : Frac)on of retrieved docs that are relevant to user’s informa)on need Recall : Frac)on of relevant docs in collec)on that are retrieved More precise defini)ons and measurements to follow in later lectures 11 Sec. 1.1 Introduc)on to Informa)on Retrieval Bigger collec)ons Consider N = 1 million documents, each with about 1000 words.
