VECTOR QUANTIZATION FOR THE EFFICIENT COMPUTATION
OF CONTINUOUS DENSITY LIKELIHOODS
Enrico Bocchieri
Speech
Research
Dept.,
AT&T
Bell Laboratories, MH2C 568,
Murray Hill,
N.J.
07974
ABSTRACT
In speech recognition systems based on Continuous
Observation Density Hidden Markov Models, the
computation of the state likelihoods is an intensive
task. This paper presents an efficient method for the
computation of the likelihoods defined by weighted
sums (mixtures) of Gaussians. The proposed method
uses vector quantization of the input feature vector to
identify a subset of Gaussian
neighbors.
It is here
shown that, under certain conditions, instead of
computing the likelihoods of
all
the Gaussians, one
needs to compute the likelihoods of
only
the Gaussian
neighbors.
Significant (up to a factor of nine)
likelihood computation reductions have been obtained
on various
data
bases, with only a small loss of
recognition accuracy.
1. MOTIVATION AND
ALGORITHM OVERVIEW
In speech recognition algorithms based on Hidden
Markov Models (HMM), the state likelihoods of the
input observation vectors are typically represented by
either discrete or parametric (continuous) distributions.
In the continuous case, the state likelihood computation
is an intensive task, that requires up to
96%
of the total
computation in a typical small vocabulary application
(connected digit recognition), and more than
80%
of
the total computation in a large vocabulary system
based on BeamSearch of the state path
(DARPA
Resource Management). The likelihood computation is
intensive also for tiedmixture systems,
as
discussed in
[
11. Therefore, an efficient likelihood computation
method can significantly reduce the computational
requirements of a recognition system.
In parbcular, we have studied
an
efficient algorithm
for the computation of the state likelihoods represented
by
Gaussian
mixtures.
The
generic
Gaussian
component likelihood of frame feature vector
fi
is
denoted by:
G(
fr,
L,
U,)
9
m
=
1,
M
(1)
where
and
U,
are the mean vector and diagonal
covariance
of
the
mth
Gaussian
component,
respectively, and M is the total number of the mixture
components of all the states of all the words (speech
units) in the vocabulary. The likelihood
Z,(f,)
of
observation
f,,
given a certain state
s,
is computed
as:
ls(fi)
=
Em
G(
ft,
Pm,
9
m
E
M,
E,=1
(2)
m
E
M,
where
E,
is the mixture weight and
M,
is the subset of
mixture components
(1)
that belong to state
s.
It is well known that the Gaussian models
(1)
are
statistically accurate only if the input feature vector is
near
to the Gaussian means. The Gaussian model
provides at best a poor approximation of the likelihood
(and a small likelihood contribution) when the feature
vector falls on its distribution tail (outlier feature
vector). The proposed method computes only the
likelihoods of those Gaussians for which the input
vector is not an outlier. During system training, all the
mixture
components
(1) are
clustered
into
neighborhoods.
A
vector quantizer, consisting of one
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 Glass
 Speech recognition, Gaussians, input feature vector, likelihood computation

Click to edit the document details