longcombine

longcombine - A Bit of Progress in Language Modeling...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
A Bit of Progress in Language Modeling Extended Version Joshua T. Goodman Machine Learning and Applied Statistics Group Microsoft Research One Microsoft Way Redmond, WA 98052 joshuago@microsoft.com August 2001 Technical Report MSR-TR-2001-72 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 http://www.research.microsoft.com
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
1 Introduction 1.1 Overview Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1992; Kernighan et al., 1990; Srihari and Baltus, 1992). The most commonly used language models are very simple (e.g. a Katz-smoothed trigram model). There are many improve- ments over this simple model however, including caching, clustering, higher- order n-grams, skipping models, and sentence-mixture models, all of which we will describe below. Unfortunately, these more complicated techniques have rarely been examined in combination. It is entirely possible that two techniques that work well separately will not work well together, and, as we will show, even possible that some techniques will work better together than either one does by itself. In this paper, we will Frst examine each of the aforementioned techniques separately, looking at variations on the technique, or its limits. Then we will ex- amine the techniques in various combinations, and compare to a Katz smoothed trigram with no count cuto±s. On a small training data set, 100,000 words, we can get up to a 50% perplexity reduction, which is one bit of entropy. On larger data sets, the improvement declines, going down to 41% on our largest data set, 284,000,000 words. On a similar large set without punctuation, the reduction is 38%. On that data set, we achieve an 8.9% word error rate reduction. These are perhaps the largest reported perplexity reductions for a language model, versus a fair baseline. The paper is organized as follows. ²irst, in this section, we will describe our terminology, briefly introduce the various techniques we examined, and describe our evaluation methodology. In the following sections, we describe each technique in more detail, and give experimental results with variations on the technique, determining for each the best variation, or its limits. In particular, for caching, we show that trigram caches have nearly twice the potential of unigram caches. ²or clustering, we Fnd variations that work slightly better than traditional clustering, and examine the limits. ²or n-gram models, we examine up to 20-grams, but show that even for the largest models, performance has plateaued by 5 to 7 grams. ²or skipping models, we give the Frst detailed comparison of di±erent skipping techniques, and the Frst that we know of at the 5-gram level. ²or sentence mixture models, we show that mixtures of up to 64 sentence types can lead to improvements. We then give experiments comparing all techniques, and combining all techniques in various ways. All of
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/18/2011 for the course CS 479 taught by Professor Ericringger during the Fall '11 term at BYU.

Page1 / 73

longcombine - A Bit of Progress in Language Modeling...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online