This preview shows pages 1–2. Sign up to view the full content.
Entropybased Pruning of Backoff Language Models
Andreas Stolcke
Speech Technology And Research Laboratory
SRI International
Menlo Park, California
ABSTRACT
A criterion for pruning parameters from Ngram backoff language
models is developed, based on the relative entropy between the orig
inal and the pruned model. It is shown that the relative entropy
resulting from pruning a single Ngram can be computed exactly
and efFciently for backoff models. The relative entropy measure
can be expressed as a relative change in training set perplexity. This
leads to a simple pruning criterion whereby all Ngrams that change
perplexity by less than a threshold are removed from the model. Ex
periments show that a productionquality Hub4 LM can be reduced
to 26% its original size without increasing recognition error. We also
compare the approach to a heuristic pruning criterion by Seymore
and Rosenfeld [9], and show that their approach can be interpreted
as an approximation to the relative entropy criterion. Experimen
tally, both approaches select similar sets of Ngrams (about 85%
overlap), with the exact relative entropy criterion giving marginally
better performance.
1. Introduction
Ngram backoff models [5], despite their shortcomings, still
dominate as the technology of choice for stateoftheart
speech recognizers [4]. Two sources of performance improve
ments are the use of higherorder models (several DARPA
Hub4 sites now use 4gram or 5gram models) and the inclu
sion of more training data from more sources (Hub4 models
typically include Broadcast News, NABN and WSJ data).
Both of these approaches lead to model sizes that are im
practical unless some sort of parameter selection technique
is used. In the case of Ngram models, the goal of parame
ter selection is to chose which Ngrams should have explicit
conditional probability estimates assigned by the model, so
as to maximize performance (i.e., minimize perplexity and/or
recognition error) while minimizing model size. As pointed
out in [6], pruning (selecting parameters from) a full Ngram
model of higher order amounts to building a
variablelength
Ngram model, i.e., one in which training set contexts are not
uniformly represented by Ngrams of the same length.
Seymore and Rosenfeld [9] showed that selecting Ngrams
based on theirconditionalprobabilityestimates and frequency
of use is more effective than the traditional absolute frequency
thresholding. In this paper we revisit the problem of Ngram
parameter selection by deriving a criterion that satisFes the
following desiderata.
Soundness:
The criterion should optimize some well
understood informationtheoretic measure of language
model quality.
EfFciency:
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview. Sign up
to
access the rest of the document.
 Spring '10
 Glass

Click to edit the document details