124.11.lec9

124.11.lec9 - CS 124/LINGUIST 180 From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 9: Informa5on Retrieval II (Ranked Retrieval) Thanks to Chris Manning for these slides from his CS 276 Information Retrieval and Web Search class! Outline: Ranked Retrieval   The vector space model:   @ ­idf:   @: Term frequency   idf: Inverse document frequency   Advanced Issues:   Zone indices   Query Term Proximity   Compu5ng cosine ranking efficiently   Combining many different features for ranking:   Machine Learning Ranked retrieval   Thus far, our queries have all been Boolean.   Documents either match or don’t.   Good for expert users with precise understanding of their needs and the collec5on.   Also good for applica5ons: Applica5ons can easily consume 1000s of results.   Not good for the majority of users.   Most users incapable of wri5ng Boolean queries (or they are, but they think it’s too much work).   Most users don’t want to wade through 1000s of results.   This is par5cularly true of web search. Slide from Chris Manning's 276 class Problem with Boolean search: feast or famine Ch. 6   Boolean queries oXen result in either too few (=0) or too many (1000s) results.   Query 1: “standard user dlink 650” → 200,000 hits   Query 2: “standard user dlink 650 no card found”: 0 hits   It takes a lot of skill to come up with a query that produces a manageable number of hits.   AND gives too few; OR gives too many Slide from Chris Manning's 276 class Ranked retrieval models   Rather than a set of documents sa5sfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collec5on with respect to a query   Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language   In principle, there are two separate choices here, but in prac5ce, ranked retrieval models have normally been associated with free text queries and vice versa 5 Ch. 6 Feast or famine: not a problem in ranked retrieval   When a system produces a ranked result set, large result sets are not an issue   Indeed, the size of the result set is not an issue   We just show the top k ( ≈ 10) results   We don’t overwhelm the user   Premise: the ranking algorithm works Slide from Chris Manning's 276 class Scoring as the basis of ranked retrieval Ch. 6   We wish to return in order the documents most likely to be useful to the searcher   How can we rank ­order the documents in the collec5on with respect to a query?   Assign a score – say in [0, 1] – to each document   This score measures how well document and query “match”. Slide from Chris Manning's 276 class Ch. 6 Query-document matching scores   We need a way of assigning a score to a query/ document pair   Let’s start with a one ­term query   If the query term does not occur in the document: score should be 0   The more frequent the query term in the document, the higher the score (should be)   We will look at a number of alterna5ves for this. Slide from Chris Manning's 276 class Ch. 6 Take 1: Jaccard coefficient   A commonly used measure of overlap of two sets A and B   jaccard(A,B) = |A ∩ B| / |A ∪ B|   jaccard(A,A) = 1   jaccard(A,B) = 0 if A ∩ B = 0   A and B don’t have to be the same size.   Always assigns a number between 0 and 1. Slide from Chris Manning's 276 class Jaccard coefficient: Scoring example   What is the query ­document match score that the Jaccard coefficient computes for each of the two documents below?   Query: ides of march   Document 1: caesar died in march   Document 2: the long march Slide from Chris Manning's 276 class Ch. 6 Issues with Jaccard for scoring   It doesn’t consider term frequency (how many 5mes a term occurs in a document)   Rare terms in a collec5on are more informa5ve than frequent terms. Jaccard doesn’t consider this informa5on   We need a more sophis5cated way of normalizing for length   Later in this lecture, we’ll use   . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normaliza5on. Slide from Chris Manning's 276 class Recall: Binary term-document incidence matrix Each document is represented by a binary vector ∈ {0,1}|V| Slide from Chris Manning's 276 class Term-document count matrices  Consider the number of occurrences of a term t in a document d: @t,d:  Each document is a count vector in ℕv: a column below Slide from Chris Manning's 276 class Slide from Chris Manning's 276 class Bag of words model   Vector representa5on doesn’t consider the ordering of words in a document   John is quicker than Mary and Mary is quicker than John have the same vectors   This is called the bag of words model.   In a sense, this is a step back: The posi5onal index was able to dis5nguish these two documents.   We will look at “recovering” posi5onal informa5on later.   For now: bag of words model Slide from Chris Manning's 276 class Term frequency tf   The term frequency @t,d of term t in document d is defined as the number of 5mes that t occurs in d.   We want to use @ when compu5ng query ­document match scores. But how?   Raw term frequency is not what we want:   A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.   But not 10 5mes more relevant.   Relevance does not increase propor5onally with term frequency. Slide from Chris Manning's 276 class Sec. 6.2 Log-frequency weighting   The log frequency weight of term t in d is   0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.   Score for a document ­query pair: sum over terms t in both q and d:   score   The score is 0 if none of the query terms is present in the document. Slide from Chris Manning's 276 class Sec. 6.2.1 Document frequency   Rare terms are more informa5ve than frequent terms   Recall stop words   Consider a term in the query that is rare in the collec5on (e.g., arachnocentric)   A document containing this term is very likely to be relevant to the query arachnocentric   → We want a high weight for rare terms like arachnocentric. Slide from Chris Manning's 276 class Sec. 6.2.1 Document frequency, continued   Frequent terms are less informa5ve than rare terms   Consider a query term that is frequent in the collec5on (e.g., high, increase, line)   A document containing such a term is more likely to be relevant than a document that doesn’t   But it’s not a sure indicator of relevance.   → For frequent terms, we want high posi5ve weights for words like high, increase, and line   But lower weights than for rare terms.   We will use document frequency (df) to capture this. Slide from Chris Manning's 276 class Sec. 6.2.1 idf weight   dft is the document frequency of t: the number of documents that contain t   dft is an inverse measure of the informa5veness of t   dft ≤ N   We define the idf (inverse document frequency) of t by   We use log (N/dft) instead of N/dft to “dampen” the effect of idf. Will turn out the base of the log is immaterial. Slide from Chris Manning's 276 class Sec. 6.2.1 idf example, suppose N = 1 million term calpurnia dft idft 1 animal 100 sunday 1,000 fly 10,000 under the 100,000 1,000,000 There is one idf value for each term t in a collection. Slide from Chris Manning's 276 class Effect of idf on ranking   Does idf have an effect on ranking for one ­term queries, like   iPhone   idf has no effect on ranking one term queries   idf affects the ranking of documents for queries with at least two terms   For the query capricious person, idf weigh5ng makes occurrences of capricious count for much more in the final document ranking than occurrences of person. Slide from Chris Manning's 276 class Sec. 6.2.1 Collection vs. Document frequency   The collec5on frequency of t is the number of occurrences of t in the collec5on, coun5ng mul5ple occurrences.   Example: Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760   Which word is a beqer search term (and should get a higher weight)? Slide from Chris Manning's 276 class Sec. 6.2.2 tf-idf weighting   The @ ­idf weight of a term is the product of its @ weight and its idf weight.   Best known weigh5ng scheme in informa5on retrieval   Note: the “ ­” in @ ­idf is a hyphen, not a minus sign!   Alterna5ve names: @.idf, @ x idf   Increases with the number of occurrences within a document   Increases with the rarity of the term in the collec5on Slide from Chris Manning's 276 class Sec. 6.2.2 Final ranking of documents for a query Score(q, d ) = ∑ Slide from Chris Manning's 276 class t ∈q∩d tf.idft ,d 24 Sec. 6.3 Binary → count → weight matrix Each document is now represented by a real-valued vector of tfidf weights ∈ R|V| Slide from Chris Manning's 276 class Sec. 6.3 Documents as vectors   So we have a |V| ­dimensional vector space   Terms are axes of the space   Documents are points or vectors in this space   Very high ­dimensional: tens of millions of dimensions when you apply this to a web search engine   These are very sparse vectors  ­ most entries are zero. Slide from Chris Manning's 276 class Queries as vectors   Key idea 1: Do the same for queries: represent them as vectors in the space   Key idea 2: Rank documents according to their proximity to the query in this space   proximity = similarity of vectors   proximity ≈ inverse of distance   Recall: We do this because we want to get away from the you’re ­either ­in ­or ­out Boolean model.   Instead: rank more relevant documents higher than less relevant documents Slide from Chris Manning's 276 class Sec. 6.3 Formalizing vector space proximity   First cut: distance between two points   ( = distance between the end points of the two vectors)   Euclidean distance?   Euclidean distance is a bad idea . . .   . . . because Euclidean distance is large for vectors of different lengths. Slide from Chris Manning's 276 class Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribu5on of terms in the query q and the distribu5on of terms in the document d2 are very similar. Slide from Chris Manning's 276 class Sec. 6.3 Use angle instead of distance   Thought experiment: take a document d and append it to itself. Call this document d′.   “Seman5cally” d and d′ have the same content   The Euclidean distance between the two documents can be quite large   The angle between the two documents is 0, corresponding to maximal similarity.   Key idea: Rank documents according to angle with query. Slide from Chris Manning's 276 class Sec. 6.3 From angles to cosines   The following two no5ons are equivalent.   Rank documents in decreasing order of the angle between query and document   Rank documents in increasing order of cosine (query,document)   Cosine is a monotonically decreasing func5on for the interval [0o, 180o] Slide from Chris Manning's 276 class Sec. 6.3 From angles to cosines   But how – and why – should we be compu5ng cosines? Slide from Chris Manning's 276 class Sec. 6.3 Length normalization   A vector can be (length ­) normalized by dividing each of its components by its length – for this we use the L2 norm:   Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)   Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have iden5cal vectors aXer length ­normaliza5on.   Long and short documents now have comparable weights Slide from Chris Manning's 276 class Sec. 6.3 cosine(query,document) Dot product Unit vectors qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. Slide from Chris Manning's 276 class Cosine for length-normalized vectors   For length ­normalized vectors, cosine similarity is simply the dot product (or scalar product): V cos(q, d ) = q • d = ∑ qi di i= 1 for q, d length ­normalized. € Slide from Chris Manning's 276 class 35 Cosine similarity illustrated Slide from Chris Manning's 276 class 36 Cosine similarity among 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting. Slide from Chris Manning's 276 class 3 documents example contd. Log frequency weighting term SaS PaP WH After normalization term SaS PaP WH affection 3.06 2.76 2.30 affection 0.789 0.832 0.524 jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465 gossip 1.30 0 1.78 gossip 0.335 0 0.405 0 0 2.58 wuthering 0 0 0.588 wuthering cos(SaS,PaP) ≈ 0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)? Computing cosine scores Slide from Chris Manning's 276 class tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial? Slide from Chris Manning's 276 class Sec. 6.4 Weighting may differ in queries vs documents   Many search engines allow for different weigh5ngs for queries vs. documents   SMART Nota5on: denotes the combina5on in use in an engine, with the nota5on ddd.qqq, using the acronyms from the previous table   A very standard weigh5ng scheme is: lnc.ltc   Document: logarithmic @ (l as first character), no idf and cosine normaliza5on   Query: logarithmic @ (l in leXmost column), idf (t in second column), no normaliza5on … Slide from Chris Manning's 276 class Sec. 6.4 tf-idf example: lnc.ltc Document: car insurance auto insurance Query: best car insurance Term Query tf- tf-wt raw df idf Document wt n’lize tf-raw tf-wt Prod wt n’lize auto 0 0 5000 2.3 0 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0 car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27 insurance 1 1 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53 1000 Exercise: what is N, the number of docs? Doc length = 12 + 0 2 + 12 + 1.32 ≈ 1.92 Score = 0+0+0.27+0.53 = 0.8 € Summary – vector space ranking   Represent the query as a weighted @ ­idf vector   Represent each document as a weighted @ ­idf vector   Compute the cosine similarity score for the query vector and each document vector   Rank documents with respect to the query by score   Return the top K (e.g., K = 10) to the user Slide from Chris Manning's 276 class Parametric and zone indexes   Thus far, a doc has been a sequence of terms   In fact documents have mul5ple parts, some with special seman5cs:   Author   Title   Date of publica5on   Language   Format   etc.   These cons5tute the metadata about a document Slide from Chris Manning's 276 class Fields   We some5mes wish to search by these metadata   E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick   Year = 1601 is an example of a field   Also, author last name = shakespeare, etc   Field or parametric index: pos5ngs for each field value   Some5mes build range trees (e.g., for dates)   Field query typically treated as conjunc5on   (doc must be authored by shakespeare) Slide from Chris Manning's 276 class Zone   A zone is a region of the doc that can contain an arbitrary amount of text e.g.,   Title   Abstract   References …   Build inverted indexes on zones as well to permit querying   E.g., “find docs with merchant in the 5tle zone and matching the query gentle rain” Slide from Chris Manning's 276 class Example zone indexes Encode zones in dictionary vs. postings. Slide from Chris Manning's 276 class Sec. 7.2.2 Query term proximity   Free text queries: just a set of terms typed into the query box – common on the web   Users prefer docs in which query terms occur within close proximity of each other   Let w be the smallest window in a doc containing all query terms, e.g.,   For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words)   Would like scoring func5on to take this into account – how? Slide from Chris Manning's 276 class Sec. 7.2.3 Query parsers   Free text query from user may in fact spawn one or more queries to the indexes, e.g. query rising interest rates   Run the query as a phrase query   If <K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates   If we s5ll have <K docs, run the vector space query rising interest rates   Rank matching docs by vector space scoring   This sequence is issued by a query parser Slide from Chris Manning's 276 class Sec. 7.1 Efficient cosine ranking   Find the K docs in the collec5on “nearest” to the query ⇒ K largest query ­doc cosines.   Efficient ranking:   Compu5ng a single cosine efficiently.   Choosing the K largest cosine values efficiently.   Can we do this without compu5ng all N cosines? Slide from Chris Manning's 276 class Sec. 7.1 Special case – unweighted queries   No weigh5ng on query terms   Assume each query term occurs only once   Then for ranking, don’t need to normalize query vector   Slight simplifica5on of algorithm above Slide from Chris Manning's 276 class Sec. 7.1 Faster cosine: unweighted query Sec. 7.1 Computing the K largest cosines: selection vs. sorting   Typically we want to retrieve the top K docs (in the cosine ranking for the query)   not to totally order all docs in the collec5on   Can we pick off docs with K highest cosines?   Let J = number of docs with nonzero cosines   We seek the K best of these J Slide from Chris Manning's 276 class Sec. 7.1 Use heap for selecting top K   Binary tree in which each node’s value > the values of children   Takes 2J opera5ons to construct, then each of K “winners” read off in 2log J steps.   For J=1M, K=100, this is about 10% of the cost of sor5ng. 1 .9 .3 Slide from Chris Manning's 276 class .3 .8 .1 .1 Sec. 7.1.1 Bottlenecks   Primary computa5onal boqleneck in scoring: cosine computa5on   Can we avoid all this computa5on?   Yes, but may some5mes get it wrong   a doc not in the top K may creep into the list of K output docs   Is this such a bad thing? Slide from Chris Manning's 276 class Sec. 7.1.2 Index elimination   Basic algorithm FastCosineScore only considers docs containing at least one query term   Take this further:   Only consider high ­idf query terms   Only consider docs containing many query terms Slide from Chris Manning's 276 class Sec. 7.1.2 High-idf query terms only   For a query such as catcher in the rye   Only accumulate scores from catcher and rye   Intui5on: in and the contribute liqle to the scores and so don’t alter rank ­ordering much   Benefit:   Pos5ngs of low ­idf terms have many docs → these (many) docs get eliminated from set A of contenders Slide from Chris Manning's 276 class Sec. 7.1.2 Docs containing many query terms   Any doc with at least one query term is a candidate for the top K output list   For mul5 ­term queries, only compute scores for docs containing several of the query terms   Say, at least 3 out of 4   Imposes a “soX conjunc5on” on queries seen on web search engines (early Google)   Easy to implement in pos5ngs traversal Slide from Chris Manning's 276 class Sec. 7.1.2 3 of 4 query terms Antony 3 4 8 16 32 64 128 Brutus 2 4 8 16 32 64 128 Caesar 1 3 5 Calpurnia 2 8 13 21 34 13 16 32 Scores only computed for docs 8, 16 and 32. Slide from Chris Manning's 276 class Integrating multiple features to determine relevance   Modern systems – especially on the Web – use a great number of features:   Arbitrary useful features – not a single unified model   Log frequency of query word in anchor text?   Query word in color on page?   # of images on page?   # of (out) links on page?   PageRank of page?   URL length?   URL contains “~”?   Page edit recency?   Page length?   The New York Times (2008 ­06 ­03) quoted Amit Singhal as saying Google was using over 200 such features. Slide from Chris Manning's 276 class How to combine features to assign relevance to a document?   Given lots of relevant features   Build a classifier to learn the weights on the features   Requires:   labeled training data Slide from Chris Manning's 276 class Simple example: Using classification for ad hoc IR Sec. 15.4.1   Collect a training corpus of (q, d, r) triples   Relevance r is here binary (but may be mul5class, with 3–7 values)   Document is represented by a feature vector   x = (α, ω) α is cosine similarity, ω is minimum query window size   ω is the the shortest text span that includes all query words   Query term proximity is a very important new weigh5ng factor   Train a machine learning model to predict the class r of a document ­query pair Simple example: Using classification for ad hoc IR Sec. 15.4.1   A linear score func5on is then Score(d, q) = Score(α, ω) = aα + bω + c   And the linear classifier is Decide relevant if Score(d, q) > θ   … just like when we were doing text classifica5on Slide from Chris Manning's 276 class Simple example: Using classification for ad hoc IR cosine score α 0.05 R 0.025 R R N R R 0 2 N R N N R R R R N N 3 N N N 4 5 Term proximity ω Slide from Chris Manning's 276 class R N Decision surface Sec. 15.4.1 Outline: Ranked Retrieval   The vector space model:   @ ­idf:   @: Term frequency   idf: Inverse document frequency   Advanced Issues:   Zone indices   Query Term Proximity   Compu5ng cosine ranking efficiently   Combining many different features for ranking:   Machine Learning ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online