ClassBased ngram Models of Natural
Language
Peter F. Brown"
Peter V. deSouza*
Robert L. Mercer*
IBM T. J. Watson Research Center
Vincent J. Della Pietra*
Jenifer C. Lai*
We address the problem of predicting a word from previous words in a sample of text. In particular,
we discuss ngram models based on classes of words. We also discuss several statistical algorithms
for assigning words to classes based on the frequency of their cooccurrence with other words. We
find that we are able to extract classes that have the flavor of either syntactically based groupings
or semantically based groupings, depending on the nature of the underlying statistics.
1. Introduction
In a number of natural language processing tasks, we face the problem of recovering a
string of English words after it has been garbled by passage through a noisy channel.
To tackle this problem successfully, we must be able to estimate the probability with
which any particular string of English words will be presented as input to the noisy
channel. In this paper, we discuss a method for making such estimates. We also discuss
the related topic of assigning words to classes according to statistical behavior in a
large body of text.
In the next section, we review the concept of a language model and give a defini
tion of ngram models. In Section 3, we look at the subset of ngram models in which
the words are divided into classes. We show that for n = 2 the maximum likelihood
assignment of words to classes is equivalent to the assignment for which the average
mutual information of adjacent classes is greatest. Finding an optimal assignment of
words to classes is computationally hard, but we describe two algorithms for finding a
suboptimal assignment. In Section 4, we apply mutual information to two other forms
of word clustering. First, we use it to find pairs of words that function together as a
single lexical entity. Then, by examining the probability that two words will appear
within a reasonable distance of one another, we use it to find classes that have some
loose semantic coherence.
In describing our work, we draw freely on terminology and notation from the
mathematical theory of communication. The reader who is unfamiliar with this field
or who has allowed his or her facility with some of its concepts to fall into disrepair
may profit from a brief perusal of Feller (1950) and Gallagher (1968). In the first of
these, the reader should focus on conditional probabilities and on Markov chains; in
the second, on entropy and mutual information.
* IBM T. J. Watson Research Center, Yorktown Heights, New York 10598.
(~) 1992 Association for Computational Linguistics
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentComputational Linguistics
Volume 18, Number 4
Source
Language
Model
W
Channel
Y
Pr(W)
x
Pr(YIW)
= Pr(W,Y)
Figure 1
Sourcechannel setup.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 Glass
 Speech recognition, Maximum likelihood, natural language, Ngram, language model, average mutual information

Click to edit the document details