{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}


breglerkonig94eigenlips - EIGENLIPS FOR ROBUST SPEECH...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
“EIGENLIPS” FOR ROBUST SPEECH RECOGNITION Christoph Bregler , and Yochai Konig Int. Computer Science Institute 1947 Center Street Berkeley, CA 94704 U.S.A Computer Science Division University of California Berkeley, CA 94720 { bregler,konig } @cs.berkeley.edu University of Karlsruhe Institut Prof. Alex Waibel D-76128 Karlsruhe Germany To appear in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia 1994 ABSTRACT In this study we improve the performance of a hybrid con- nectionist speech recognition system by incorporating vi- sual information about the corresponding lip movements. Specifically, we investigate the benefits of adding visual fea- tures in the presence of additive noise and crosstalk (cock- tail party effect). Our study extends our previous experi- ments [3] by using a new visual front end, and an alternative architecture for combining the visual and acoustic infor- mation. Furthermore, we have extended our recognizer to a multi-speaker, connected letters recognizer. Our results show a significant improvement for the combined architec- ture (acoustic and visual information) over just the acoustic system in the presence of additive noise and crosstalk. 1. INTRODUCTION Most efforts in robust speech recognition focus on meth- ods that reduce signal distortions. The signal distortions may be caused by background noise (additive noise) and by channel effects (convolutional noise.) We investigate an alternative approach by incorporating additional informa- tion from the signal source itself, like positional informa- tion about the visible articulators (lipmovements, tongue and teeth positions). In fact it is well known that human speech perception is inherently bi-modal as well [10, 5]. The idea of extending automated speech recognition to the visual modality has already been investigated for a long time. As popular non-connectionist approaches the work of Petajan, Bischoff, Bodoff, and Brooke [11], Mase and Pentland [9] should be mentioned. Just recently Goldschen [6] completed a lip reading system. He trained HMMs to discriminate visual information on a continuous word database. Recent connectionist systems were investigated by Yuhas, Goldstein, and Sejnowski [14], who used static images for vowel discrimination. Wolff, Prasad, Stork, and Hennecke [13] are using a modified TDNN for isolated word segments. We focus on scenarios where the acoustic modality is de- graded in a way that causes state-of-the-art speech recogni- tion systems to achieve poor recognition performance. We simulated such situations by adding car noise and crosstalk of different ratios to clean speech. We have done similar experiments on the same database already described in [3]. In this study, however we use a new visual processing tech- nique, a different acoustic front-end, and an alternative hy- brid connectionist recognition architecture (MLP/HMM).
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}