SpeechReview - onsiderable progress has been made in...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
onsiderable progress has been made in speech-recog- nition technology over the last few years and no- where has this progress been more evident than in the area of large-vocabulary recognition (LVR). Current labora- tory systems are capable of transcribing continuous speech from any speaker with average word-error rates between 5% and 10%. If speaker adaptation is allowed, then after 2 or 3 minutes of speech, the error rate will drop well below 5% for dependent and required words to be spoken with a short pause between them. However, the capability to recognize natural continuous-speech input from any speaker opens up many more applications. As a result, LVR technology appears to be on the brink of widespread deployment across a range of information technology (IT) systems. This article discusses the principles and architecture of current LVR systems and identifies the key issues affecting most speakers. LVR systems had been limited to dictation applications since the systems were speaker their future deployment. To illustrate the various points raised, the Cambridge Uni- versity HTK system is described. This sys- SEPTEMBER 1996 IEEE SIGNAL PROCESSING MAGAZINE 3053-5888/96/$5.0001996IEEE 45 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 13, 2009 at 17:24 from IEEE Xplore. Restrictions apply.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
tem is a modem design that gives state-of-the-art perform- ance, and it is typical of the current generation of recogni- tion systems. System Overview Current LVR systems are firmly based on the principles of statistical pattem recognition. The basic methods of applying these principles to the problem of speech recognition were pioneered by Baker, Jelinek, and their colleagues from IBM in the 1970s, and little has changed since [13, 541. Figure 1 illustrates the main components of an LVR system. An unknown speech waveform is converted by a front-end signal processor into a sequence of acoustic vectors, Y =yi. y2, ...,Y T. Each of these vectors is a compact representation of the short-time speech spectrum covering a period of typi- cally 10 msecs. Thus, a typical 10-word utterance might have a duration of around 3 seconds and would be represented by a sequence of T = 300 acoustic vectors. The utterance consists of a sequence of words, W = wi, w2,. .wn, and it is the job of the LVR system to determine the most probable word sequence, W, given the observed acous- tic signal Y. To do this, Bayes’ rule is used to decompose the required probability P( WIY) into two components, that is, This equation indicates that to find the most likely word sequence W, the word sequence that maximizes the product of P(W) and P(YIW) must be found. The first of these terms represents the a priori probability of observing W inde- pendent of the observed signal, and this probability is deter- mined by a language model. The second term represents the probability of observing the vector sequence Y given some specified word sequence W, and this probability is deter-
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 10/24/2011.

Page1 / 13

SpeechReview - onsiderable progress has been made in...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online