This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: SRILM — AN EXTENSIBLE LANGUAGE MODELING TOOLKIT Andreas Stolcke Speech Technology and Research Laboratory SRI International, Menlo Park, CA, U.S.A. http://www.speech.sri.com/ ABSTRACT SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimen- tation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a vari- ety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipu- lation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and imple- mentation, highlighting ease of rapid prototyping, reusability, and combinability of tools. 1. INTRODUCTION Statistical language modeling is the science (and often art) of building models that estimate the prior probabilities of word strings. Language modeling has many applications in natural lan- guage technology and other areas where sequences of discrete ob- jects play a role, with prominent roles in speech recognition and natural language tagging (including specialized tasks such as part- of-speech tagging, word and sentence segmentation, and shallow parsing). As pointed out in , the main techniques for effec- tive language modeling have been known for at least a decade, al- though one suspects that important advances are possible, and in- deed needed, to bring about signiFcant breakthroughs in the appli- cation areas cited above—such breakthroughs just have been very hard to come by [2, 3]. Various software packages for statistical language modeling have been in use for many years—the basic algorithms are simple enough that one can easily implement them with reasonable effort for research use. One such package, the CMU-Cambridge LM toolkit , has been in wide use in the research community and has greatly facilitated the construction of language models (LMs) for many practitioners. This paper describes a fairly recent addition to the set of publicly available LM tools, the SRI Language Modeling Toolkit (SRILM). Compared to existing LM tools, SRILM offers a pro- gramming interface and an extensible set of LM classes, several non-standard LM types, and more a comprehensive functionality that goes beyond language modeling to include tagging, N-best rescoring, and other applications. This paper describes the design philosophy and key implementation choices in SRILM, summa- rizes its capabilities, and concludes by discussing deFciencies and plans for future development. ¡or lack of space we must refer to other publications for an introduction to language modeling and its role in speech recognition and other areas [3, 4]....
View Full Document
This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.
- Spring '10