lec09[1]

lec09[1] - CS 6740 Advanced Language Technologies Lecture 9...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 6740: Advanced Language Technologies February 25, 2010 Lecture 9: Language Models: More on Query Likelihood Lecturer: Lillian Lee Scribes: Navin Sivakumar, Lakshmi Ganesh, Taiyang Chen Abstract The Language Modeling approach to document retrieval scores the relevance of a document d with respect to a query q as the likelihood that a language model induced from d would generate q . This query-generative model is counter-intuitive; after all, we are looking for a retrieval system that takes a query as input and generates relevant documents ; yet, here we appear to be trying to generate the query itself. In this lecture we examine the query-generative perspective of the language modeling approach in more detail and attempt to justify it. 1 Review of the Language Modeling Approach First, we briefly review the language-modeling approach to scoring documents [PC98]. Recall that if ~x denotes an m-dimensional vector, then x [ j ] denotes the j th entry and we write x [ · ] for ∑ m j =1 x [ j ]. Given a document d , we consider the language model (a probability distribution over strings of terms) induced by the document. The score of a document d with respect to a query q is given by the probability assigned to q by the language model induced by d . More formally, a document d induces a vector ~ θ d of parameters. We then score d with respect to a query q by P ~ θ d ( ~ q ), where P ~ θ d denotes the probability distribution specified by the parameters ~ θ d , and ~ q is the vector of term counts 1 in the query q . Typically, we consider multinomial distributions P ~ θ , which are parametrized by a length param- eter L , specifying the number of trials (in our case, the length of the string being generated), and the probabilities θ [ j ] of term v j occurring in each trial. This gives us a scoring function of the form P ~ θ ( ~ q ) = k Y j θ [ j ] q [ j ] (1) where k is a constant (independent of the θ [ j ] parameters) giving the number of possible rearrangements of the terms in q . Since k is independent of the parameters that arise from the document d , the following is equivalent under rank: P ~ θ ( ~ q ) rank = Y j θ [ j ] q [ j ] (2) 1 From time to time we are somewhat loose in interpreting a language model as either a distribution on strings of terms or a distribution on term-count vectors. 9-1 From a document d we induce the parameter θ d [ j ] through the rate of occurrence of term v j in d : θ d [ j ] = f d [ j ] f d [ · ] (3) With Dirichlet smoothing [MP95], we can induce the TF, IDF and length normal- ization terms. 2 Why “Query Likelihood”? Let us discuss why it is appropriate to consider the query as the object being generated by our model. As a point of comparison, we consider the alternative approach of inducing a language model from a query and considering the document as the object being generated....
View Full Document

This note was uploaded on 01/09/2012 for the course CS CS273 taught by Professor Xifengyan during the Spring '11 term at UCSB.

Page1 / 12

lec09[1] - CS 6740 Advanced Language Technologies Lecture 9...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online