This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CS 6740: Advanced Language Technologies February 25, 2010 Lecture 9: Language Models: More on Query Likelihood Lecturer: Lillian Lee Scribes: Navin Sivakumar, Lakshmi Ganesh, Taiyang Chen Abstract The Language Modeling approach to document retrieval scores the relevance of a document d with respect to a query q as the likelihood that a language model induced from d would generate q . This querygenerative model is counterintuitive; after all, we are looking for a retrieval system that takes a query as input and generates relevant documents ; yet, here we appear to be trying to generate the query itself. In this lecture we examine the querygenerative perspective of the language modeling approach in more detail and attempt to justify it. 1 Review of the Language Modeling Approach First, we briefly review the languagemodeling approach to scoring documents [PC98]. Recall that if ~x denotes an mdimensional vector, then x [ j ] denotes the j th entry and we write x [ · ] for ∑ m j =1 x [ j ]. Given a document d , we consider the language model (a probability distribution over strings of terms) induced by the document. The score of a document d with respect to a query q is given by the probability assigned to q by the language model induced by d . More formally, a document d induces a vector ~ θ d of parameters. We then score d with respect to a query q by P ~ θ d ( ~ q ), where P ~ θ d denotes the probability distribution specified by the parameters ~ θ d , and ~ q is the vector of term counts 1 in the query q . Typically, we consider multinomial distributions P ~ θ , which are parametrized by a length param eter L , specifying the number of trials (in our case, the length of the string being generated), and the probabilities θ [ j ] of term v j occurring in each trial. This gives us a scoring function of the form P ~ θ ( ~ q ) = k Y j θ [ j ] q [ j ] (1) where k is a constant (independent of the θ [ j ] parameters) giving the number of possible rearrangements of the terms in q . Since k is independent of the parameters that arise from the document d , the following is equivalent under rank: P ~ θ ( ~ q ) rank = Y j θ [ j ] q [ j ] (2) 1 From time to time we are somewhat loose in interpreting a language model as either a distribution on strings of terms or a distribution on termcount vectors. 91 From a document d we induce the parameter θ d [ j ] through the rate of occurrence of term v j in d : θ d [ j ] = f d [ j ] f d [ · ] (3) With Dirichlet smoothing [MP95], we can induce the TF, IDF and length normal ization terms. 2 Why “Query Likelihood”? Let us discuss why it is appropriate to consider the query as the object being generated by our model. As a point of comparison, we consider the alternative approach of inducing a language model from a query and considering the document as the object being generated....
View
Full
Document
This note was uploaded on 01/09/2012 for the course CS CS273 taught by Professor Xifengyan during the Spring '11 term at UCSB.
 Spring '11
 XIFENGYAN

Click to edit the document details