This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Two Decades of Statistical Language Modeling: Where Do We Go from Here? RONALD ROSENFELD, ASSOCIATE MEMBER, IEEE Invited Paper Statistical language models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Since the first significant model was proposed in 1980, many attempts have been made to improve the state-of-the-art. We review them here, point to a few promising directions, and argue for a Bayesian approach to integration of lin- guistic theories with data. Keywords— Natural language processing, natural language technologies, statistical language modeling. I. OUTLINE Statistical language modeling (SLM) is the attempt to cap- ture regularities of natural language for the purpose of im- proving the performance of various natural language appli- cations. By and large, SLM amounts to estimating the prob- ability distribution of various linguistic units, such as words, sentences, and whole documents. SLM is crucial for a large variety of language technology applications. These include speech recognition (where SLM got its start), machine translation, document classification and routing, optical character recognition, information retrieval, handwriting recognition, spelling correction, and many more. In machine translation, for example, purely statistical ap- proaches have been introduced in . But even researchers using rule-based approaches have found it beneficial to in- troduce some elements of SLM and statistical estimation . In information retrieval, a language modeling approach was recently proposed by , and a statistical/information theo- retical approach was developed by . SLM employs statistical estimation techniques using language training data, that is, text. Because of the categor- ical nature of language, and the large vocabularies people Manuscript received January 20, 2000; revised May 2, 2000. The author is with the School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA 15213 USA (e-mail: [email protected]). Publisher Item Identifier S 0018-9219(00)08094-4. naturally use, statistical techniques must estimate a large number of parameters, and consequently depend critically on the availability of large amounts of training data. Over the past 20 years, successively larger amounts of text of various types have become available online. As a result, in domains where such data became available, the quality of language models has increased dramatically. However, this improvement is now beginning to asymptote. Even if online text continues to accumulate at an exponential rate (which it no doubt will, given the growth rate of the World Wide Web), the quality of currently used statistical language models is not likely to improve by a significant factor. One informal estimate from IBM shows that bigram models effectively saturate within several hundred million words, and trigram models are likely to saturate within a few billion words. Inmodels are likely to saturate within a few billion words....
View Full Document
This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.
- Spring '10