EntropyBased1998

EntropyBased1998 - Entropy-based Pruning of Backoff...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Entropy-based Pruning of Backoff Language Models Andreas Stolcke Speech Technology And Research Laboratory SRI International Menlo Park, California ABSTRACT A criterion for pruning parameters from N-gram backoff language models is developed, based on the relative entropy between the orig- inal and the pruned model. It is shown that the relative entropy resulting from pruning a single N-gram can be computed exactly and efFciently for backoff models. The relative entropy measure can be expressed as a relative change in training set perplexity. This leads to a simple pruning criterion whereby all N-grams that change perplexity by less than a threshold are removed from the model. Ex- periments show that a production-quality Hub4 LM can be reduced to 26% its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld [9], and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimen- tally, both approaches select similar sets of N-grams (about 85% overlap), with the exact relative entropy criterion giving marginally better performance. 1. Introduction N-gram backoff models [5], despite their shortcomings, still dominate as the technology of choice for state-of-the-art speech recognizers [4]. Two sources of performance improve- ments are the use of higher-order models (several DARPA- Hub4 sites now use 4-gram or 5-gram models) and the inclu- sion of more training data from more sources (Hub4 models typically include Broadcast News, NABN and WSJ data). Both of these approaches lead to model sizes that are im- practical unless some sort of parameter selection technique is used. In the case of N-gram models, the goal of parame- ter selection is to chose which N-grams should have explicit conditional probability estimates assigned by the model, so as to maximize performance (i.e., minimize perplexity and/or recognition error) while minimizing model size. As pointed out in [6], pruning (selecting parameters from) a full N-gram model of higher order amounts to building a variable-length N-gram model, i.e., one in which training set contexts are not uniformly represented by N-grams of the same length. Seymore and Rosenfeld [9] showed that selecting N-grams based on theirconditionalprobabilityestimates and frequency of use is more effective than the traditional absolute frequency thresholding. In this paper we revisit the problem of N-gram parameter selection by deriving a criterion that satisFes the following desiderata. Soundness: The criterion should optimize some well- understood information-theoretic measure of language model quality. EfFciency:
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 5

EntropyBased1998 - Entropy-based Pruning of Backoff...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online