This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Document Language Models, Query Models, and Risk Minimization for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Chengxiang Zhai School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ABSTRACT We present a framework for information retrieval that com- bines document models and query models using a proba- bilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling ap- proach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk min- imization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC col- lections and compared to the basic language modeling ap- proach and vector space models together with query expan- sion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data. 1. INTRODUCTION The language modeling approach to information retrieval has recently been proposed as a new alternative to tradi- tional vector space models and other probabilistic models. In the use of language modeling by Ponte and Croft , a unigram language model is estimated for each document, and the likelihood of the query according to this model is used to score the document for ranking. Miller et al.  smooth the document language model with a background model using hidden Markov model techniques, and demon- strate good performance on TREC benchmarks. Berger and Lafferty  use methods from statistical machine transla- tion to incorporate synonomy into the document language Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR01, September 9-12, 2001, New Orleans, Louisiana, USA Copyright 2001 ACM 1-58113-331-6/01/0009 ... $ 5.00....
View Full Document
- Spring '11