This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: LANGUAGE MODELS Djoerd Hiemstra University of Twente http://www.cs.utwente.nl/ ∼ hiemstra SYNONYMS GENERATIVE MODELS DEFINITION A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model based on a big English newspaper archive is expected to assign a higher probability to “a bit of text” than to “aw pit tov tags”, because the words in the former phrase (or word pairs or word triples if so-called N-GRAM MODELS are used) occur more frequently in the data than the words in the latter phrase. For information retrieval, typical usage is to build a language model for each document. At search time, the top ranked document is the one which’ language model assigns the highest probability to the query. HISTORICAL BACKGROUND The term language models originates from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s . Speech recognition systems use a language model to complement the results of the acoustic model which models the relation between words (or parts of words called phonemes) and the acoustic signal. The history of language models, however, goes back to beginning of the 20th century when Andrei Markov used language models (Markov models) to model letter sequences in works of Russian literature . Another famous application of language models are Claude Shannon’s models of letter sequences and word sequences, which he used to illustrate the implications of coding and information theory . In the 1990’s language models were applied as a general tool for several natural language processing applications, such as part-of-speech tagging, machine translation, and optical character recognition. Language models were applied to information retrieval by a number of research groups in the late 1990’s [4, 7, 14, 15]. They became rapidly popular in information retrieval research. By 2001, the ACM SIGIR conference had two separate sessions on language models containing 5 papers in total . In 2003, a group of leading information retrieval researchers published a research roadmap “Challenges in Information Retrieval and Language Modeling” , indicating that the future of information retrieval and the future of language modeling can not be seen apart from each other. SCIENTIFIC FUNDAMENTALS Language models are generative models, i.e., models that define a probability mechanism for generating language. Such generative models might be explained by the following probability mechanism: Imagine picking a term T at random from this page by pointing at the page with closed eyes. This mechanism defines a probability P ( T | D ), which could be defined as the relative frequency of the occurrence of the event, i.e., by the number of occurrences of a word on the page divided by the total number of terms on the page. Suppose the process is repeated n times, picking one at a time the terms T 1 , T 2 ,. . . , T n...
View Full Document
This note was uploaded on 01/09/2012 for the course CS CS273 taught by Professor Xifengyan during the Spring '11 term at UCSB.
- Spring '11