124.11.lec8

124.11.lec8 - CS 124/LINGUIST 180: From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 8: Informa5on Retrieval (Intro and Boolean Retrieval) Thanks to Chris Manning for these slides from his CS 276 Information Retrieval and Web Search class! Outline   Boolean Retrieval   The Classic Boolean Search Model   Normaliza5on, Stemming, Stop Words   Term ­document incidence   The Inverted Index   Dic5onary and pos5ngs file   Query Op5miza5on   Phrase Queries Slide from Chris Manning's 276 class Unstructured (text) vs. structured (database) data in 1996 Slide from Chris Manning's 276 class 3 Unstructured (text) vs. structured (database) data in 2009 Slide from Chris Manning's 276 class 4 Sec. 1.1 Unstructured data in 1680   Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?   One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?   Why is that not the answer?   Slow (for large corpora)   NOT Calpurnia is non ­trivial   Other opera5ons (e.g., find the word Romans near countrymen) not feasible   Ranked retrieval (best documents to return)   Later lectures Slide from Chris Manning's 276 class 5 Term-document incidence Brutus AND Caesar but NOT Calpurnia 1 if play contains word, 0 otherwise Slide from Chris Manning's 276 class Incidence vectors   So we have a 0/1 vector for each term.   To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.   110100 AND 110111 AND 101111 = 100100. Slide from Chris Manning's 276 class Sec. 1.1 Answers to query   Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.   Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Slide from Chris Manning's 276 class Basic assumptions of Information Retrieval   Collec5on: Fixed set of documents   Goal: Retrieve documents with informa5on that is relevant to user’s informa5on need and helps user complete a task Slide from Chris Manning's 276 class The classic search model TASK Mis-conception? Info about removing mice without killing them Info Need Mis-translation? Verbal form Mis-formulation? Query mouse trap SEARCH ENGINE Query Refinement Results Slide from Chris Manning's 276 class Corpus How good are the retrieved docs?   Precision : Frac5on of retrieved docs that are relevant to user’s informa5on need   Recall : Frac5on of relevant docs in collec5on that are retrieved   More precise defini5ons and measurements to follow in following lecture Slide from Chris Manning's 276 class Bigger collections   Consider N = 1 million documents, each with about 1000 words   Avg 6 bytes/term including spaces/punctua5on   6GB of data in the documents.   Say there are m = 500K dis4nct terms among these. Slide from Chris Manning's 276 class Can’t build the matrix   500K x 1M matrix has half ­a ­trillion 0’s and 1’s.   But it has no more than one billion 1’s.   matrix is extremely sparse.   What’s a be_er representa5on?   We only record the 1 posi5ons. Slide from Chris Manning's 276 class Why? Sec. 1.2 Inverted index   For each term t, we must store a list of all documents that contain t.   Iden5fy each by a docID, a document serial number   Can we used fixed ­size arrays for this? Brutus 1 Caesar 1 Calpurnia 2 2 4 2 31 11 31 45 173 174 4 5 6 16 57 132 54 101 What happens if the word Caesar is added to document 14? Slide from Chris Manning's 276 class Sec. 1.2 Inverted index   We need variable ­size pos5ngs lists   On disk, a con5nuous run of pos5ngs is normal and best   In memory, can use linked lists or variable length arrays   Some tradeoffs in size/ease of inser5on Posting Brutus 1 Caesar 1 Calpurnia Dictionary Slide from Chris Manning's 276 class 2 2 2 31 4 11 31 45 173 174 4 5 6 16 57 132 54 101 Postings Sorted by docID (more later on why). 15 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Friends Romans Token stream. Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer friend Inverted index. Slide from Chris Manning's 276 class 2 4 roman 1 2 countryman 13 16 Stop words   With a stop list, you exclude from dic5onary en5rely the commonest words. Intui5on:   They have li_le seman5c content: the, a, and, to, be   There are a lot of them: ~30% of pos5ngs for top 30 wds   But the trend is away from doing this:   Good compression techniques (see CS 276) means the space for including stopwords in a system is very small   Good query op5miza5on techniques mean you pay li_le at query 5me for including stop words.   You need them for:   Phrase queries: “King of Denmark”   Various song 5tles, etc.: “Let it be”, “ To be or not to be”   “Rela5onal” queries: “flights to London” Slide from Chris Manning's 276 class Normalization   Need to “normalize” terms in indexed text as well as query terms into the same form   We want to match U.S.A. and USA   We most commonly implicitly define equivalence classes of terms   e.g., by dele5ng periods in a term   Alterna5ve is to do asymmetric expansion:   Enter: window   Enter: windows   Enter: Windows Search: window, windows Search: Windows, windows, window Search: Windows   Poten5ally more powerful, but less efficient Slide from Chris Manning's 276 class Stemming   Reduce terms to their “roots” before indexing   “Stemming” suggest crude affix chopping   language dependent   e.g., automate(s), automa>c, automa>on all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. Slide from Chris Manning's 276 class for exampl compress and compress ar both accept as equival to compress Sec. 1.2 Indexer steps: Token sequence   Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Slide from Chris Manning's 276 class Sec. 1.2 Indexer steps: Sort   Sort by terms   And then docID Core indexing step Slide from Chris Manning's 276 class Sec. 1.2 Indexer steps: Dictionary & Postings   Mul5ple term entries in a single document are merged.   Split into Dic5onary and Pos5ngs   Doc. frequency informa5on is added. Why frequency? Will discuss later. Sec. 1.2 Where do we pay in storage? Lists of docIDs Terms and counts Slide from Chris Manning's 276 class Pointers 23 The index we just built   How do we process a query? Slide from Chris Manning's 276 class Query processing: AND   Consider processing the query: Brutus AND Caesar   Locate Brutus in the Dic5onary;   Retrieve its pos5ngs.   Locate Caesar in the Dic5onary;   Retrieve its pos5ngs.   “Merge” the two pos5ngs: 2 4 8 16 1 2 3 5 Slide from Chris Manning's 276 class 32 8 64 13 128 21 Brutus 34 Caesar The merge   Walk through the two pos5ngs simultaneously, in 5me linear in the total number of pos5ngs entries 2 8 2 1 4 2 8 3 16 32 64 128 Brutus 128 5 8 13 21 34 Caesar If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. Slide from Chris Manning's 276 class Intersecting two postings lists (a “merge” algorithm) Slide from Chris 27 Manning's 276 class Boolean queries: Exact match   The Boolean Retrieval model is being able to ask a query that is a Boolean expression:   Boolean Queries are queries using AND, OR and NOT to join query terms   Views each document as a set of words   Is precise: document matches condi5on or not.   Perhaps the simplest model to build an IR system on   Primary commercial retrieval tool for 3 decades.   Many search systems you s5ll use are Boolean:   Email, library catalog, Mac OS X Spotlight Slide from Chris Manning's 276 class Example: WestLaw http://www.westlaw.com/   Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)   Tens of terabytes of data; 700,000 users   Majority of users still use boolean queries   Example query:   What is the statute of limitations in cases involving the federal tort claims act?   LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM   /3 = within 3 words, /S = in same sentence Slide from Chris Manning's 276 class Example: WestLaw http://www.westlaw.com/   Another example query:   Requirements for disabled people to be able to access a workplace   disabl! /p access! /s work-site work-place (employment /3 place   Note that SPACE is disjunc5on, not conjunc5on!   Long, precise queries; proximity operators; incrementally developed; not like web search   Professional searchers oren like Boolean search:   You know what you are gesng   But that doesn’t mean they actually work be_er…. Slide from Chris Manning's 276 class Query optimization   What is the best order for query processing?   Consider a query that is an AND of t terms.   For each of the t terms, get its pos5ngs, then AND them together. Brutus 2 Calpurnia 1 2 13 16 Caesar 4 8 16 32 64 128 3 5 8 16 21 34 Query: Brutus AND Calpurnia AND Caesar Slide from Chris Manning's 276 class 31 Query optimization example   Process in order of increasing freq:   start with smallest set, then keep cu>ng further. This is why we kept freq in dictionary Brutus 2 Calpurnia 1 Caesar 13 4 8 2 16 32 64 3 5 8 13 128 21 34 16 Execute the query as (Caesar AND Brutus) AND Calpurnia. Slide from Chris Manning's 276 class More general optimization   e.g., (madding OR crowd) AND (ignoble OR strife)   Get doc. freq’s for all terms.   Es5mate the size of each OR by the sum of its doc. freq’s (conserva5ve).   Process in increasing order of OR sizes. Slide from Chris Manning's 276 class Exercise   Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) Slide from Chris Manning's 276 class Phrase queries   Want to be able to answer queries such as “stanford university” – as a phrase   Thus the sentence “I went to university at Stanford” is not a match.   The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works   Many more queries are implicit phrase queries   For this, it no longer suffices to store only <term : docs> entries Slide from Chris Manning's 276 class A first attempt: Biword indexes   Index every consecu5ve pair of terms in the text as a phrase   For example the text “Friends, Romans, Countrymen” would generate the biwords   friends romans   romans countrymen   Each of these biwords is now a dic5onary term   Two ­word phrase query ­processing is now immediate. Slide from Chris Manning's 276 class Longer phrase queries   Longer phrases are processed as used for wild ­cards:   stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Slide from Chris Manning's 276 class Can have false positives! Extended biwords   Parse the indexed text and perform part ­of ­speech ­tagging.   Bucket the terms into (say) Nouns (N) and ar5cles/preposi5ons (X).   Now deem any string of terms of the form NX*N to be an extended biword.   Each such extended biword is now made a term in the dic5onary.   Example: catcher in the rye N X X N   Query processing: parse it into N’s and X’s   Segment query into enhanced biwords   Look up index Slide from Chris Manning's 276 class Issues for biword indexes   False posi5ves, as noted before   Index blowup due to bigger dic5onary   Infeasible for more than biwords, big even for them   Biword indexes are not the standard solu5on (for all biwords) but can be part of a compound strategy Slide from Chris Manning's 276 class Solution 2: Positional indexes   In the pos5ngs, store, for each term, entries of the form: <term, number of docs containing term; doc1: posi5on1, posi5on2 … ; doc2: posi5on1, posi5on2 … ; etc.> Slide from Chris Manning's 276 class Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”?   We use a merge algorithm recursively at the document level   But we now need to deal with more than just equality Slide from Chris Manning's 276 class Processing a phrase query   Extract inverted index entries for each dis5nct term: to, be, or, not.   Merge their doc:posi4on lists to enumerate all posi5ons with “to be or not to be”.   to:   2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...   be:   1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...   Same general method for proximity searches Slide from Chris Manning's 276 class Positional index size   A posi5onal index expands pos5ngs storage substan4ally   Even though it can be compressed   Nevertheless, a posi5onal index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system. Slide from Chris Manning's 276 class Positional index size   Need an entry for each occurrence, not just once per document   Index size depends on average document size Why?   Average web page has <1000 terms   SEC filings, books, even some epic poems … easily 100,000 terms   Consider a term with frequency 0.1% Document size Postings Positional postings 1000 1 1 100,000 1 100 Slide from Chris Manning's 276 class Rules of thumb   A posi5onal index is 2–4 5mes as large as a non ­ posi5onal index   Posi5onal index size 35–50% of volume of original text   Caveat: all of this holds for “English ­like” languages Slide from Chris Manning's 276 class Combination schemes   These two approaches can be profitably combined   For par5cular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging posi5onal pos5ngs lists   Even more so for phrases like “ The Who”   Williams et al. (2004) evaluate a more sophis5cated mixed indexing scheme   A typical web query mixture was executed in ¼ of the 5me of using just a posi5onal index   It required 26% more space than having a posi5onal index alone Slide from Chris Manning's 276 class IR vs. databases: Structured vs unstructured data   Structured data tends to refer to informa5on in “tables” Employee Smith Chang Manager Jones Smith Salary 50000 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. Slide from Chris Manning's 276 class Unstructured data   Typically refers to free text   Allows   Keyword queries including operators   More sophis5cated “concept” queries e.g.,   find all web pages dealing with drug abuse   Classic model for searching text documents Slide from Chris Manning's 276 class Semi-structured data   In fact almost no data is “unstructured”   E.g., this slide has dis5nctly iden5fied zones such as the Title and Bullets   Facilitates “semi ­structured” search such as   Title contains data AND Bullets contain search   Or even   Title is about Object Oriented Programming AND Author something like stro*rup   where * is the wild ­card operator Slide from Chris Manning's 276 class Next time: Ranking search results   Boolean queries give inclusion or exclusion of docs.   Oren we want to rank/group results   Need to measure proximity from query to each doc.   Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query. Slide from Chris Manning's 276 class Outline   Boolean Retrieval   The Classic Boolean Search Model   Normaliza5on, Stemming, Stop Words   Term ­document incidence   The Inverted Index   Dic5onary and pos5ngs file   Query Op5miza5on   Phrase Queries Slide from Chris Manning's 276 class ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online