124.11.lec10

124.11.lec10 - CS 124/LINGUIST 180: From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 10: Informa6on Retrieval III Evalua6on and Assorted IR/Web topics Thanks to Chris Manning for these slides from his CS 276 Information Retrieval and Web Search class! This lecture   How do we know if our results are any good?   Evalua6ng a search engine   Benchmarks   Precision and recall   Remaining assorted IR/Web topics:  Results summaries:   Making our good results usable to a user   Relevance Feedback   Searching the Web Slide from Chris Manning's 276 class Evaluating search engines Measures for a search engine   How fast does it index   Number of documents/hour   (Average document size)   How fast does it search   Latency as a func6on of index size   Expressiveness of query language   Ability to express complex informa6on needs   Speed on complex queries   UncluSered UI   Is it free? Slide from Chris Manning's 276 class Measures for a search engine   All of the preceding criteria are measurable:   we can quan6fy speed/size;   we can make expressiveness precise   The key measure: user happiness   What is this?   Speed of response/size of index are factors   But blindingly fast, useless answers won’t make a user happy   Need a way of quan6fying user happiness Slide from Chris Manning's 276 class Measuring user happiness   Issue: who is the user we are trying to make happy?   Depends on the seXng   Web engine: user finds what they want and return to the engine   Can measure rate of return users   eCommerce site: user finds what they want and make a purchase   Is it the end ­user, or the eCommerce site, whose happiness we measure?   Measure 6me to purchase, or frac6on of searchers who become buyers? Slide from Chris Manning's 276 class Measuring user happiness   Enterprise (company/govt/academic): Care about “user produc6vity”   How much 6me do my users save when looking for informa6on?   Many other criteria having to do with breadth of access, secure access, etc. Slide from Chris Manning's 276 class Happiness: elusive to measure   Most common proxy: relevance of search results   But how do you measure relevance?   We will detail a methodology here, then examine its issues   Relevant measurement requires 3 elements: 1.  A benchmark document collec6on 2.  A benchmark suite of queries 3.  A usually binary assessment of either Relevant or Nonrelevant for each query and each document   Some work on more ­than ­binary, but not the standard Slide from Chris Manning's 276 class Evaluating an IR system   Note: the informa(on need is translated into a query   Relevance is assessed rela6ve to the informa(on need not the query   E.g., Informa6on need: I'm looking for informa5on on whether drinking red wine is more effec5ve at reducing your risk of heart a;acks than white wine.   Query: wine red white heart a+ack effec/ve   You evaluate whether the doc addresses the informa6on need, not whether it has these words Slide from Chris Manning's 276 class Standard relevance benchmarks   TREC  ­ Na6onal Ins6tute of Standards and Technology (NIST) has run a large IR test bed for many years   Reuters and other benchmark doc collec6ons used   “Retrieval tasks” specified   some6mes as queries   Human experts mark, for each query and for each doc, Relevant or Nonrelevant   or at least for subset of docs that some system returned for that query Slide from Chris Manning's 276 class Unranked retrieval evaluation: Precision and Recall   Precision: frac6on of retrieved docs that are relevant = P(relevant|retrieved)   Recall: frac6on of relevant docs that are retrieved = P (retrieved|relevant) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn   Precision P = tp/(tp + fp)   Recall R = tp/(tp + fn) Slide from Chris Manning's 276 class Should we instead use the accuracy measure for evaluation?   Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant”   The accuracy of an engine: the frac6on of these classifica6ons that are correct   (tp + tn) / ( tp + fp + fn + tn)   Accuracy is a commonly used evalua6on measure in machine learning classifica6on work   Why is this not a very useful evalua6on measure in IR? Slide from Chris Manning's 276 class Why not just use accuracy?   How to build a 99.9999% accurate search engine on a low budget…. Search for: 0 matching results found.   People doing informa6on retrieval want to find something and have a certain tolerance for junk. Slide from Chris Manning's 276 class Precision/Recall   You can get high recall (but low precision) by retrieving all docs for all queries!   Recall is a non ­decreasing func6on of the number of docs retrieved   In a good system, precision decreases as either the number of docs retrieved or recall increases   This is not a theorem, but a result with strong empirical confirma6on Slide from Chris Manning's 276 class Difficulties in using precision/recall   Should average over large document collec6on/query ensembles   Need human relevance assessments   People aren’t reliable assessors   Assessments have to be binary   Nuanced assessments?   Heavily skewed by collec6on/authorship   Results may not translate from one domain to another Slide from Chris Manning's 276 class A combined measure: F   Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):   People usually use balanced F1 measure   i.e., with β = 1 or α = ½   Harmonic mean is a conserva6ve average   See CJ van Rijsbergen, Informa5on Retrieval Slide from Chris Manning's 276 class F1 and other averages Slide from Chris Manning's 276 class Evaluating ranked results  Evalua6on of ranked results:  The system can return any number of results  By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision ­recall curve Slide from Chris Manning's 276 class Recall/Precision   1   2   3   4   5   6   7   8   9   10 R N N R R N R N N N Slide from Chris Manning's 276 class Recall/Precision   R   1   2   3   4   5   6   7   8   9   10 R N N R R N R N N N   10%   10   10   20   30   30   40   40   40   40 Slide from Chris Manning's 276 class P 100% 50 33 50 60 50 57 50 44 40 Assume 10 rel docs in collection A precision-recall curve Slide from Chris Manning's 276 class Averaging over queries   A precision ­recall graph for one query isn’t a very sensible thing to look at   You need to average performance over a whole bunch of queries.   But there’s a technical issue:   Precision ­recall calcula6ons place some points on the graph   How do you determine a value (interpolate) between the points? Slide from Chris Manning's 276 class Interpolated precision   Idea: If locally precision increases with increasing recall, then you should get to count that…   So you max of precisions to right of value Slide from Chris Manning's 276 class Evaluation   Graphs are good, but people want summary measures!   Precision at fixed retrieval level   Precision ­at ­k: Precision of top k results   Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages   But: averages badly and has an arbitrary parameter of k   11 ­point interpolated average precision   The standard measure in the early TREC compe66ons: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpola6on (the value for 0 is always interpolated!), and average them   Evaluates performance at all recall levels Slide from Chris Manning's 276 class Typical (good) 11 point precisions   SabIR/Cornell 8A1 11pt precision from TREC 8 (1999) Slide from Chris Manning's 276 class Yet more evaluation measures…   Mean average precision (MAP)   Average of the precision value obtained for the top k documents, each 6me a relevant doc is retrieved   Avoids interpola6on, use of fixed recall levels   MAP for query collec6on is arithme6c ave.   Macro ­averaging: each query counts equally Slide from Chris Manning's 276 class Variance   For a test collec6on, it is usual that a system does crummily on some informa6on needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7)   Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query.   That is, there are easy informa6on needs and hard ones! Slide from Chris Manning's 276 class TREC   TREC Ad Hoc task from first 8 TRECs is standard IR task   50 detailed informa6on needs a year   Human evalua6on of pooled results returned   More recently other related things: Web track, HARD   A TREC query (TREC 5) <top> <num> Number: 225 <desc> Descrip6on: What is the main func6on of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facili6es? </top> Slide from Chris Manning's 276 class Standard relevance benchmarks: Others   GOV2   Another TREC/NIST collec6on   25 million web pages   Largest collec6on that is easily available   But s6ll 3 orders of magnitude smaller than what Google/ Yahoo/MSN index   NTCIR   East Asian language and cross ­language informa6on retrieval   Cross Language Evalua6on Forum (CLEF)   This evalua6on series has concentrated on European languages and cross ­language informa6on retrieval.   Many others Slide from Chris Manning's 276 class Critique of pure relevance   Relevance vs Marginal Relevance   A document can be redundant even if it is highly relevant   Duplicates   The same informa6on from different sources   Marginal relevance is a beSer measure of u6lity for the user.   Using facts/en66es as evalua6on units more directly measures true relevance.   But harder to create evalua6on set Slide from Chris Manning's 276 class Evaluation at large search engines   Search engines have test collec6ons of queries and hand ­ ranked results   Recall is difficult to measure on the web   Search engines oxen use precision at top k, e.g., k = 10   Search engines also use non ­relevance ­based measures.   Clickthrough on first result   Not very reliable if you look at a single clickthrough … but preSy reliable in the aggregate.   Studies of user behavior in the lab   A/B tes6ng Slide from Chris Manning's 276 class A/B testing •  Purpose: Test a single innova6on •  Assume: You have a large search engine up and running. •  Have most users use old system •  Divert a small propor6on of traffic (e.g., 1%) to the new system that includes the innova6on •  Evaluate with an automa6c measure like clickthrough on first result •  Now we can directly see if the innova6on does improve user happiness. •  Probably the evalua6on methodology that large search engines trust most •  In principle less powerful than doing a mul6variate regression analysis, but easier to understand Slide from Chris Manning's 276 class Results Summaries Result Summaries   Having ranked the documents matching a query, we wish to present a results list   Most commonly, a list of the document 6tles plus a short summary, aka “10 blue links” Slide from Chris Manning's 276 class Summaries   The 6tle is typically automa6cally extracted from document metadata. What about the summaries?   This descrip6on is crucial.   User can iden6fy good/relevant hits based on descrip6on.   Two basic kinds:   Sta6c   Dynamic   A sta(c summary of a document is always the same, regardless of the query that hit the doc   A dynamic summary is a query ­dependent aSempt to explain why the document was retrieved for the query at hand Slide from Chris Manning's 276 class Static summaries   Typically sta6c summary is a subset of document   Extrac5ve summary   Simplest heuris6c: the first 50ish words of the document   Summary cached at indexing 6me   More sophis6cated: extract from each document a set of “key” sentences   Simple NLP heuris6cs to score each sentence   Summary is made up of top ­scoring sentences.   Most sophis6cated: NLP used to synthesize a summary   Seldom used in IR;   Research area called text summariza5on Slide from Chris Manning's 276 class Dynamic summaries   Present “windows” within the document w/query terms   “KWIC” snippets: Keyword in Context presenta6on Slide from Chris Manning's 276 class Sec. 8.7 Techniques for dynamic summaries   Find small windows in doc that contain query terms   Requires fast window lookup in a document cache   Score each window wrt query   Use various features such as window width, posi6on in document, etc.   Combine features through a scoring func6on –   Challenges in evalua6on: judging summaries   Easier to do pairwise comparisons rather than binary relevance assessments 38 Quicklinks   For a naviga5onal query such as united airlines user’s need likely sa6sfied on www.united.com   Quicklinks provide naviga6onal cues on that home page 39 40 Relevance Feedback Relevance Feedback   Relevance feedback: user feedback on relevance of docs in ini6al set of results   User issues a (short, simple) query   The user marks some results as relevant or non ­relevant.   The system computes a beSer representa6on of the informa6on need based on feedback.   Relevance feedback can go through one or more itera6ons.   Idea: it may be difficult to formulate a good query when you don’t know the collec6on well, so iterate Slide from Chris Manning's 276 class Similar pages Slide from Chris Manning's 276 class Relevance Feedback: Example  Image search engine hSp://nayana.ece.ucsb.edu/imsearch/imsearch.html Slide from Chris Manning's 276 class Results for Initial Query Slide from Chris Manning's 276 class 9.1.1 Relevance Feedback Slide from Chris Manning's 276 class 9.1.1 Results after Relevance Feedback Slide from Chris Manning's 276 class 9.1.1 Initial query/results   Ini6al query: New space satellite applica5ons 1. 0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer 2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan 3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes 4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget 5. 0.525, 07/24/90, Scien6st Who Exposed Global Warming Proposes Satellites for Climate Research 6. 0.524, 08/22/90, Report Provides Support for the Cri6cs Of Using Big Satellites to Study Climate 7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada 8. 0.509, 12/02/87, Telecommunica6ons Tale of Two Companies + + +   User then marks relevant documents with “+”. Slide from Chris Manning's 276 class 9.1.1 Expanded query after relevance feedback   2.074 new 15.106 space   30.816 satellite 5.660 applica6on   5.991 nasa 5.196 eos   4.196 launch 3.972 aster   3.516 instrument 3.446 arianespace   3.004 bundespost 2.806 ss   2.790 rocket 2.053 scien6st   2.003 broadcast 1.172 earth   0.836 oil 0.646 measure Slide from Chris Manning's 276 class 9.1.1 Results for expanded query 2 1 1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan 8 5. 0.492, 12/02/87, Telecommunica6ons Tale of Two Companies 2. 0.500, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer 3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 4. 0.493, 07/31/89, NASA Uses ‘Warm’ Superconductors For Fast Circuit 6. 0.491, 07/09/91, Soviets May Adapt Parts of SS ­20 Missile For Commercial Use 7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost $90 Million Slide from Chris Manning's 276 class 9.1.1 Query Expansion   In relevance feedback, users give addi6onal input (relevant/non ­relevant) on documents, which is used to reweight terms in the documents   In query expansion, users give addi6onal input (good/ bad search term) on words or phrases Slide from Chris Manning's 276 class 9.2.2 Query assist Slide from Chris Manning's 276 class How do we augment the user query?   Manual thesaurus   E.g. MedLine: physician, syn: doc, doctor, MD, medico   Can be query rather than just synonyms   Global Analysis: (sta6c; of all documents in collec6on)   Automa6cally derived thesaurus   (co ­occurrence sta6s6cs)   Refinements based on query log mining   Common on the web   Local Analysis: (dynamic)   Analysis of documents in result set Slide from Chris Manning's 276 class Web Search Brief (non-technical) history   Early keyword ­based engines ca. 1995 ­1997   Altavista, Excite, Infoseek, Inktomi, Lycos   Paid search ranking: Goto (morphed into Overture.com → Yahoo!)   Your search ranking depended on how much you paid   Auc6on for keywords: casino was expensive! Slide from Chris Manning's 276 class Brief (non-technical) history   1998+: Link ­based ranking pioneered by Google   Blew away all early engines save Inktomi   Great user experience in search of a business model   Meanwhile Goto/Overture’s annual revenues were nearing $1 billion   Result: Google added paid search “ads” to the side, independent of search results   Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search)   2005+: Google gains search share, domina6ng in Europe and very strong in North America   2009: Yahoo! and Microsox propose combined paid search offering Ads Slide from Chris Manning's 276 class Algorithmic results. User Needs   Need [Brod02, RL04]   Informa(onal – want to learn about something (~40% / 65%)   Naviga(onal – want to go to that page (~25% / 15%)   Transac(onal – want to do something (web ­mediated) (~35% / 20%)   Access a service   Downloads   Shop   Gray areas   Find a good hub   Exploratory search “see what’s there” Slide from Chris Manning's 276 class How far do people look for results? Slide from Chris Manning's 276 class (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) Users’ empirical evaluation of results   Quality of pages varies widely   Relevance is not enough   Other desirable quali6es (non IR!!)   Content: Trustworthy, diverse, non ­duplicated, well maintained   Web readability: display correctly & fast   No annoyances: pop ­ups, etc   Precision vs. recall   On the web, recall seldom maSers   What maSers   Precision at 1? Precision above the fold?   Comprehensiveness – must be able to deal with obscure queries   Recall maSers when the number of matches is very small   User percep6ons may be unscien6fic, but are significant over a large aggregate Slide from Chris Manning's 276 class Users’ empirical evaluation of results   Relevance and validity of results   UI – Simple, no cluSer, error tolerant   Trust – Results are objec6ve   Coverage of topics for polysemic queries   Pre/Post process tools provided   Mi6gate user errors (auto spell check, search assist,…)   Explicit: Search within results, more like this, refine ...   An6cipa6ve: related searches   Deal with idiosyncrasies   Web specific vocabulary   Impact on stemming, spell ­check, etc   Web addresses typed in the search box Slide from Chris Manning's 276 class Users’ empirical evaluation of engines   Relevance and validity of results   UI – Simple, no cluSer, error tolerant   Trust – Results are objec6ve   Coverage of topics for polysemic queries   Pre/Post process tools provided   Mi6gate user errors (auto spell check, search assist,…)   Explicit: Search within results, more like this, refine ...   An6cipa6ve: related searches   Deal with idiosyncrasies   Web specific vocabulary   Impact on stemming, spell ­check, etc   Web addresses typed in the search box   … Slide from Chris Manning's 276 class The Web document collection   No design/co ­ordina6on   Distributed content crea6on, linking,         The Web   democra6za6on of publishing Content includes truth, lies, obsolete informa6on, contradic6ons … Unstructured (text, html, …), semi ­ structured (XML, annotated photos), structured (Databases)… Scale much larger than previous text collec6ons … but corporate records are catching up Growth – slowed down from ini6al “volume doubling every few months” but s6ll expanding Content can be dynamically generated Slide from Chris Manning's 276 class ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online