124.11.lec8

# 124.11.lec8 - CS 124/LINGUIST 180 From Click to edit Master...

This preview shows pages 1–13. Sign up to view the full content.

Click to edit Master subtitle style 1/10/09 Dan Jurafsky Lecture 8: Information Retrieval (Intro and Boolean Retrieval) Thanks to Chris Manning for these slides from his CS 276 Information Retrieval and Web Search class!

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6/1/11 Outline Boolean Retrieval The Classic Boolean Search Model Normalization, Stemming, Stop Words Term-document incidence The Inverted Index Dictionary and postings file Query Optimization Phrase Queries Slide from Chris Manning's 276
Slide from Chris Manning's 276 class Unstructured (text) vs. structured 3 Obj10 Slide from Chris Manning's 276

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Slide from Chris Manning's 276 class Unstructured (text) vs. structured Obj101 4 Slide from Chris Manning's 276
6/1/11 Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia ? Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen ) not feasible 55 Sec. 1.1 Slide from Chris Manning's 276

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6/1/11 Term-document incidence 1 if  play  contains  word 0 otherwise Brutus   AND   Caesar  but  NOT   Calpurnia Slide from Chris Manning's 276
6/1/11 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia bitwise AND . 110100 AND 110111 AND 101111 = 100100. Slide from Chris Manning's 276

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6/1/11 Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Sec. 1.1 Slide from Chris Manning's 276 class
6/1/11 Basic assumptions of Information Retrieval Collection : Fixed set of documents Goal : Retrieve documents with information that is relevant to user’s information need and helps user complete a task Slide from Chris Manning's 276

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6/1/11 The classic search model Corpus TASK Info Need Query Verbal form Results SEARCH ENGINE Query Refinement Get rid of mice in a politically correct way Info about removing mice without killing them How do I trap mice alive? mouse trap Mis-conception? Mis-translation? Mis-formulation? Slide from Chris Manning's 276
6/1/11 How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to user’s information need Recall : Fraction of relevant docs in collection that are retrieved More precise definitions and measurements to follow in Slide from Chris Manning's 276 class

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Bigger collections Consider N = 1 million documents, each with about 1000 words Avg 6 bytes/term including spaces/punctuation 6GB of data in the documents. Say there are
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 51

124.11.lec8 - CS 124/LINGUIST 180 From Click to edit Master...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online