lecture7-vectorspace-handout-6-per

# 714 introducon to informaon retrieval sec 714

Unformatted text preview: + cosine(q,d)   Denote this by g(d)   Thus, a quan*ty like the number of cita*ons is scaled into [0,1]   Exercise: suggest a formula for this.   Can use some other linear combina*on   Indeed, any func*on of the two signals of user happiness – more later   Now we seek the top K docs by net score 4 Introduc)on to Informa)on Retrieval Sec. 7.1.4 Introduc)on to Informa)on Retrieval Sec. 7.1.4 Top K by net score – fast methods Why order pos*ngs by g(d)?   First idea: Order all pos*ngs by g(d)   Key: this is a common ordering for all pos*ngs   Thus, can concurrently traverse query terms pos*ngs for   Under g(d) ­ordering, top ­scoring docs likely to appear early in pos*ngs traversal   In *me ­bound applica*ons (say, we have to return whatever search results we can in 50 ms), this allows us to stop pos*ngs traversal early   Pos*ngs intersec*on   Cosine score computa*on   Short of compu*ng scores for all docs in pos*ngs   Exercise: write pseudocode for cosine score computa*on if pos*ngs are ordered by g(d) Introduc)on to Informa)on Retrieval Sec. 7.1.4 Introduc)on to Informa)on Retrieval Sec. 7.1.4 Champion lists in g(d) ­ordering High and low lists   Can combine champion lists with g(d) ­ordering   Maintain for each term a champion list of the r docs with highest g(d) + I ­idftd   Seek top ­K results from only the docs in these champion lists   For each term, we maintain two pos*ngs lists called high and low   Think of high as the champion list   When traversing pos*ngs on a query, only traverse high lists ﬁrst   If we get more than K docs, select the top K and stop   Else proceed to get docs from the low lists   Can be used even for simple cosine scores, without global quality g(d)  ...
