This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: CS 6740: Advanced Language Technologies February 25, 2010 Lecture 9: Language Models: More on Query Likelihood Lecturer: Lillian Lee Scribes: Navin Sivakumar, Lakshmi Ganesh, Taiyang Chen Abstract The Language Modeling approach to document retrieval scores the relevance of a document d with respect to a query q as the likelihood that a language model induced from d would generate q . This query-generative model is counter-intuitive; after all, we are looking for a retrieval system that takes a query as input and generates relevant documents ; yet, here we appear to be trying to generate the query itself. In this lecture we examine the query-generative perspective of the language modeling approach in more detail and attempt to justify it. 1 Review of the Language Modeling Approach First, we briefly review the language-modeling approach to scoring documents [PC98]. Recall that if ~x denotes an m-dimensional vector, then x [ j ] denotes the j th entry and we write x [ · ] for ∑ m j =1 x [ j ]. Given a document d , we consider the language model (a probability distribution over strings of terms) induced by the document. The score of a document d with respect to a query q is given by the probability assigned to q by the language model induced by d . More formally, a document d induces a vector ~ θ d of parameters. We then score d with respect to a query q by P ~ θ d ( ~ q ), where P ~ θ d denotes the probability distribution specified by the parameters ~ θ d , and ~ q is the vector of term counts 1 in the query q . Typically, we consider multinomial distributions P ~ θ , which are parametrized by a length param- eter L , specifying the number of trials (in our case, the length of the string being generated), and the probabilities θ [ j ] of term v j occurring in each trial. This gives us a scoring function of the form P ~ θ ( ~ q ) = k Y j θ [ j ] q [ j ] (1) where k is a constant (independent of the θ [ j ] parameters) giving the number of possible rearrangements of the terms in q . Since k is independent of the parameters that arise from the document d , the following is equivalent under rank: P ~ θ ( ~ q ) rank = Y j θ [ j ] q [ j ] (2) 1 From time to time we are somewhat loose in interpreting a language model as either a distribution on strings of terms or a distribution on term-count vectors. 9-1 From a document d we induce the parameter θ d [ j ] through the rate of occurrence of term v j in d : θ d [ j ] = f d [ j ] f d [ · ] (3) With Dirichlet smoothing [MP95], we can induce the TF, IDF and length normal- ization terms. 2 Why “Query Likelihood”? Let us discuss why it is appropriate to consider the query as the object being generated by our model. As a point of comparison, we consider the alternative approach of inducing a language model from a query and considering the document as the object being generated....
View Full Document
This note was uploaded on 01/09/2012 for the course CS CS273 taught by Professor Xifengyan during the Spring '11 term at UCSB.
- Spring '11