This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: EIGENVOICES FOR SPEAKER ADAPTATION R. Kuhn, P. Nguyen, J.C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke, K. Field, M. Contolini Panasonic Technologies Inc., Speech Technology Laboratory
3888 State Street, Suite 202, Santa Barbara, CA 93105, USA.
Tel. (805) 687—0110; fax: (805) 687—2625; email: kuhn, [email protected] 1. ABSTRACT We have devised a new class of fast adaptation techniques for
speech recognition, based on prior knowledge of speaker varia—
tion. To obtain this prior knowledge, one applies Principal Com—
ponent Analysis (PCA) [9] or a similar technique to a training set
of T vectors of dimension D derived from T speaker—dependent
(SD) models. This ofﬂine step yields T basis vectors, which
we call “eigenvoices” by analogy with the eigenfaces employed
in face recognition [14,18]. We constrain the model for new
speaker S to be located in K—space, the space sparmed by the
ﬁrst K eigenvoices. Speaker adaptation then involves estimating
the K eigenvoice coefﬁcients for the new speaker; typically, K
is very small compared to the original dimension D. We conducted mean adaptation experiments on the Isolet
database [2], using PCA to ﬁnd the eigenvoices. In these ex—
periments, D (number of Gaussian mean parameters) was 2808,
T was 120, and K was set to several values between 1 and 20.
With a large amount of supervised adaptation data, most eigen—
voice techniques performed slightly better than MAP or MLLR;
with small amounts of supervised adaptation data or for unsuper—
vised adaptation, some eigenvoice techniques performed much
better. For instance, when the supervised adaptation data was
four letters pronounced once by the new speaker, the average rel—
ative reduction in error rate for an eigenvoice model with K = 5
was 26% (18.7% error in unit accuracy for SI baseline vs. 13.8%
error for eigenvoice); MAP and MLLR showed no improvement.
We believe that the eigenvoice approach would yield rapid adap—
tation for most speech recognition systems, including ones with
a medium—sized or large vocabulary. 2. WHAT ARE EIGENVOICES? “There are many examples of families of patterns for which it
is possible to obtain a useful systematic characterization. Often,
the initial motivation might be no more than the intuitive notion
that the family is low dimensional, that is, in some sense, any
given member might be represented by a small number of pa—
rameters. Possible candidates for such families of patterns are
abundant both in nature and in the literature. Such examples
include turbulent ﬂows, human speech, and the subject of this
correspondence, human faces” [10]. [10] introduced “eigenfaces” to researchers working on the
representation and recognition of human faces. Previously, faces
had been modeled with general—purpose image processing tech—
niques. However, the true dimensionality of “face space” is much
lower than its apparent dimensionality — outside the oeuvre of Pablo Picasso, human faces differ from each other in minor ways.
Since the publication of [10], face recognition researchers have
applied dimensionality reduction techniques to training images
of faces to characterize the space of variation between faces. Of—
ten, these researchers use PCA, which generates an orthogonal
basis derived from the eigenvectors of the covariance or corre—
lation matrix of the input data [9]. PCA guarantees that for the
original data, the mean—square error introduced by truncating the
expansion after the K —th eigenvector is minimized. The dirnen—
sionality reduction can be a factor of 50, 000 or more [14,18].
However, other dimensionality reduction techniques can be used:
e.g., linear discriminant analysis, singular value decomposition,
or independent component analysis [3]. [11] proposed that such a technique be applied to SD mod—
els to ﬁnd speaker space, the topography of variation between
speaker models. Dirnensionality reduction techniques are al—
ready widely used in speech recognition, but at the level of acous—
tic features rather than of complete speaker models. In the eigen—
voice approach, a set of T well—trained SD models must ﬁrst be
“vectorized”. I .e., for each speaker, one writes out ﬂoating—point
coefﬁcients representing all HMMs trained on that speaker, cre—
ating a vector of some large dimension D. In our Isolet experi—
ments, only Gaussian mean parameters for each HMM state were
written out in this way, but covariances, transition probabilities,
or mixture weights could be included as well. The T vectors thus
obtained are called “supervectors”; the order in which the HMM
parameters are stored in the supervectors is arbitrary, but must be
the same for all T supervectors. In an ofﬂine computation, we ap—
ply PCA or a similar technique to the set of supervectors to obtain
T eigenvectors, each of dimension D — the “eigenvoices”. The
ﬁrst few eigenvoices capture most of the variation in the data, so
we need to keep only the ﬁrst K of them, where K < T << D
(we let eigenvoice 0 be the mean vector). These K eigenvoices
span “K—space”. Currently, the most commonly—used speaker adaptation tech—
niques are MAP [6] and MLLR [13]; neither employs a priori in—
formation about type of speaker. The EMAP (“extended MAP”)
or RMP (“regression—based model prediction”) approach is an
exception: here, phoneme correlations estimated from training
data allow observations of any phoneme from the new speaker
to update the HMMs for all phonemes [1,4,12]. Like speaker
clustering [1,5], our approach employs prior knowledge about
speaker types. However, clustering diminishes the amount of
training data used to train each HMM, since information is not
shared across clusters, while the eigenvoice approach pools train—
ing data independently in each dimension. 3. FINDING EIGENVOICE
COEFFICIENTS 3.1. Projection Let new speaker S be represented by a point P in K—space. We
devised two techniques for estimating P from adaptation data.
The projection estimator for P is similar to a technique com—
monly used in the eigenface literature. Let 6(1), ..., e(K) be the
K eigenvoices; then E = [e(1)...e(K)] is a matrix of dimen—
sion (D X K We now train an SD model on the adaptation
data, from which we extract a supervector V of dimension D X 1
and project it into K—space to obtain P: P = E X ET x V. It
is now trivial to generate the adapted HMMs for S from P (if
the D parameters in P represent only the Gaussian means, as
for the experiments below, the remaining HMM parameters can
be obtained from an SI model). The main ﬂaw of the projection
method is that for it to work well, all D parameters should be
observed at least once in the adaptation data. 3.2. Max. Likelihood EigenDecomposition
(MLED) We now derive the maxirnum—likelihood MLED estimator for P
in the case of Gaussian mean adaptation [15,16]. If m is a Gaus—
sian in a mixture Gaussian output distribution for state 3 in a set
of HMMs for a given speaker, let n be the number of features
0,; be feature vector (length n) at time t
05: )_1 be inverse covariance for m in state 3
115:) be adapted mean for mixture m of s
75:) (t) be the L(m, 3A, 0t) (s—m occupation prob.) To maximize the likelihood of observation 0 = 01 . . . 0T
w.r.t. A, we iteratively maximize an auxiliary function Q()\, A), where A is current model and 3‘ is estimated model [13]. We have QM) = —§P<0IA) x 22 Evgktvoham) where
f(0t, m) = [nlog(27r) + logl05i) + h(0t, s,m)] and h(0t7 87m) = (02: — ﬁg))TCy(§)_l(0t — ﬁg) Consider the eigenvoice vectors e( j) with j = 1 . . . K : em = [e§”(j), e90), . . . , em, . . T where 65:) represents the subvector of eigenvoice j corre—
sponding to the mean vector of mixture Gaussian m in state s.
Then we need K T = Z w(j)e(j) j=1 l1: [ﬁll)7ﬂgl)7“‘7ﬁ£:)7“ The w( j ) are the K coefﬁcients of the eigenvoice model: K
125:) = Ewwm)
j=1
To maximize (2053‘), set 83%) = 0,j = 1 . . . K; assum— 8w(i)
’ 310(1) Z 227$)(t)(e$i)(j))T0§:)‘10t= Z2Ems:kamamas)<k))T0£:>—1e£:>(j)},
8 m t k=1 j=1...K ing the eigenvalues are independent
obtain =0,i;£j. We Thus, we have K equations to solve for the K unknown w( j)
values. The computational cost of this online operation is quite
reasonable — for instance, it is much “cheaper” than most irn—
plementations of MLLR. To reduce computational cost, one can
choose a lower K (at the expense of accuracy). Note also that the
Isolet experiments described below involved only one Gaussian
per state 3 (so the K equations we solved for MLED estimation
in the experiments were a special case of those just given). 4. EXPERIMENTS 4.1. Protocol and Results We conducted mean adaptation experiments on the Isolet database
[2], which contains 5 sets of 30 speakers, each pronouncing the
alphabet twice. After downsampling to 8kHz, ﬁve splits of the
data were done. Each split took 4 of the sets (120 speakers) as
training data, and the remaining set (30 speakers) as test data;
results given below are averaged over the ﬁve splits. Ofﬂine,
we trained 120 SD models on the training data, and extracted
a supervector from each. Each SD model contained one HMM
per letter of the alphabet, with each HMM having six single—
Gaussian output states. Each Gaussian involved eighteen “Per—
ceptual linear predictive” (PLP) [7] cepstral features whose tra—
jectories were ﬁltered. Thus, each supervector contained D =
26 * 6 * 18 = 2808 parameters. For each of the 30 test speakers, we drew adaptation data
from the ﬁrst repetition of the alphabet, and tested on the entire
second repetition. SI models trained on the 120 training speakers
yielded 81.3% word percent correct; SD models trained on the
entire ﬁrst repetition for each new speaker yielded 59.6%. We
also tested three conventional mean adaptation techniques, using
various subsets of the ﬁrst alphabet repetition for each speaker
as adaptation data. The three techniques (whose unit accuracy
results are shown in Table 1) are MAP with SI prior (“MAP”),
global MLLR with SI priors (“MLLR G”), and MAP with the
MLLR G model as prior (“MLLR G => MAP”). For MAP tech—
niques shown here and below, we set 7' = 20 (we veriﬁed that
results were insensitive to changes in 7'). Using the whole alphabet as adaptation data, we carried out
both supervised and unsupervised adaptation experiments (ﬁrst—
pass SI recognition for unsupervised adaptation); the results are
denoted as alph. sup. and alph. uns. in Table 1. The other
experiments in Table 1 involve supervised adaptation employing subsets of the alphabet as adaptation data. These include a bal—
anced alphabet subset of size 17, bal—I7 = {C D F G I J M
N Q R S U V W X Y Z}, and two subsets of size 4, AEOW
and ABC U, whose membership is given by their names. Finally,
since we can't show all 26 experiments using a single letter as
adaptation data, we show results for D (the worst MAP result),
the average result over single all letters ave( I —let. ), and the result
for A (the best MAP result). For small amounts of data MLLR
G and MLLR G => MAP give pathologically bad results. Ad. data MAP MLLR G MLLR G => MAP
alph. sup. 87.4 85.8 87.3
alph. uns. 77.8 81.5 78.5
bal—I7 81.0 81.4 81.9
AEOW 79.7 14.4 15.4
ABCU 78.6 17.0 17.5
D (worst) 77.6 3.8 3.8
ave( I Jet. ) 80.0 3 .8 3.8
A (best) 81.2 3.8 3.8 Table l: NON—EIGENVOICE ADAPTATION To carry out eigenvoice experiments, we performed PCA on
the T = 120 supervectors (using the correlation matrix), and
kept eigenvoices 0...K (0 is mean vector). First, we studied the
effect of K and of estimation method. For these experiments,
shown in Table 2, the whole alphabet was used as supervised
adaptation data (alph. sup. data option). “PROJ.K” is eigen—
voice model obtained by projection into K—space, “MLED.K”
is the maximum—likelihood eigenvoice model in K—space, and
“MLED.K => MAP” is MAP using MLED.K as the prior. Com—
parison with the alph. sup. row of Table 1 shows that MLED.K
=> MAP outperforms the non—eigenvoice techniques by a small
amount. K PROJ.K MLED.K MLED.K => MAP
1 83.4 84.7 88.3
5 81.4 86.5 88.8
10 80.5 87.4 89.0
20 78.5 87 .4 89.1 Table 2: EIGENVOICES: VARYING K (alph. sup.) For unsupervised adaptation or small amounts of adaptation
data, some of the eigenvoice techniques performed much better
than conventional techniques (Table 3). Here, we tested eigen—
voice techniques with K = 5 and K = 10 and the same adap—
tation data as in Table 1. Thus, we tried MLED.5, MLED.5 =>
MAP (“=>MAP” after “MLED.5” in Table 3), MLED.10, and
MLED. 10 => MAP (“=>MAP” after “MLED.10”). For single—
letter adaptation, we show W (letter with worst MLED.5 result),
the average results ave( I —let. ), and results for V (letter with best
MLED.5 result). Note that unsupervised MLED.5 and MLED. 10
(alph. uns.) are almost as good as supervised (alph. sup.) The S1
performance is 81.3% word correct; Table 3 shows that MLED.5
can improve signiﬁcantly on this even when the amount of adap—
tation data is very small. We know of no other equally rapid
adaptation method. Ad. data MLED.5, =>MAP MLED.10, =>MAP
alph. sup. 86.5, 88.8 87.4, 89.0
alph. uns. 86.3, 80.8 86.3, 81.4 ball—17 86.5, 86.0 87.0, 86.8 AEOW 86.2, 85.4 85.8, 85.3 ABCU 86.3, 85.2 86.4, 85.5 W (worst) 82.2, 81.8 79.9, 79.2
ave(I—ler.) 84.4, 83.9 82.4, 81.8 V (best) 85.7, 85.7 83.2, 83.1 Table 3: EIGENVOICES: PARTIAL ALPHABET 4.2. What Do the Eigenvoices Mean? We tried to interpret the eigendimensions for one of the ﬁve splits
in these experiments. Figure 1 shows how as more eigenvoices
are added, more variation in the training speakers is accounted
for. Eigenvoice 1 accounts for 18.4% of the variation; to account
for 50% of the variation, we need the eigenvoices up to and in—
cluding number 14. 1 00
80
60 40 Cumulative % of variation 20 0 20 40 60 80 100 1 20
Eigenvector # Figure 1: Cumulative variation by eigenvoice number We looked for acoustic correlates of high (+) or low (—)
coordinates, estimated on both alphabet repetitions, for the 150
Isolet speakers in dimensions 1, 2, and 3. Dimension 1 is closely
correlated with sex (74 of 75 women in the database have —
values in this dimension, all 75 men have + values) and with
F0. Dimension 2 correlates strongly with amplitude: — values
indicate loudness, + values softness. Both ﬁndings are rather
surprising: PLP cepstral features should not contain pitch or am—
plitude information. However, both pitch and amplitude may be
strongly correlated with other types of information (e.g., loca—
tions of harmonics, spectral tilt) which are likely to survive PLP
cepstral parametrization. Finally, + values in dimension 3 cor—
relate with lack of movement or low rate of change in vowel for—
mants, while speakers with — values show dramatic movement
towards the off—glide. 5. DISCUSSION Some other researchers share our belief that fast speaker adap—
tation can be achieved by quantifying inter—speaker variation.
N. Strom models speaker variation for adaptation in a hybrid
ANN/HMM system by adding an extra layer of “speaker space
units” [17]. There is one such unit per training speaker; when the
system is being trained on speaker i, the activity of unit 1' is set
to 1 and all other activities are set to 0. Strom found moderate
improvement for the adapted system over the baseline for four or
more words. Examination of the connections in the ANN indi—
cated that male and female speakers form two separate clusters
in speaker space ([17], Fig. 2). After submission of this paper in April 1998, we became
aware of some excellent research along similar lines, unpub—
lished at that time. Hu et a] [8] focus on vowel classiﬁcation
by Gaussian mixture classiﬁers, but their approach could be ex—
tended to cover all phonemes. PCA is performed on a set of
training vectors consisting, for each speaker, of the concatenated
mean feature vectors for vowels. Vowel data from the new speaker
is projected onto the eigenvectors to estimate the new speaker's
deviation from the training speaker mean vector. Finally, clas—
siﬁcation is carried out either by subtracting the deviations from
the new speaker's acoustic data (speaker normalization) or by
adjusting the Gaussian classiﬁer means to reﬂect the deviation.
This technique can be seen as a special case of the eigenvoice ap—
proach for mean adaptation. In this special case, only HMMs for
vowels are employed, each HMM has a single state with a single
Gaussian output distribution, and the projection technique is used
to estimate the eigenvoice coordinates for the new speaker. Hu et
a] ﬁnd signiﬁcant improvements over an SI baseline if their adap—
tation approach is used, for both supervised and unsupervised
adaptation. As it did in our experiments, the ﬁrst coefﬁcient in
their experiments separates men and women (though it accounts
for 93.8% of variation vs. only about 18% in our case). In the small—vocabulary speaker adaptation experiments de—
scribed in this paper, the eigenvoice approach reduced the de—
grees of freedom for speaker adaptation from D = 2808 to
K <= 20 and yielded much better performance than other tech—
niques for small amounts of adaptation data. These exciting
results provide a strong motivation for testing the approach in
medium— and large—vocabulary systems. We also plan to study
the robustness of the approach to deterioration in the quantity
or quality of the training data: e.g., fewer training speakers or
less data per training speaker, mismatch between training and
test environments, differences in dialect between training and test
speakers. We will also experiment with discriminative training of
the original SD models. Other important issues include training
of mixture Gaussian SD models (for the resulting eigenvoices
to be useful, Gaussian i for phonetic unit P in a given training
SD model must mean the same thing as Gaussian i for P for
another training speaker — how can this be ensured?) and the per—
formance of eigenvoices found by dimensionality reduction tech—
niques other than PCA. We hope to explore Bayesian versions of
the approach: estimate the position A of the new speaker in K—
space by maximizing P(O)\) X P()\) (MLED only maximizes
the ﬁrst term). Finally, we have begun to apply the eigenvoice
approach to speaker veriﬁcation and identiﬁcation, with encour—
aging early results. 1. 10. 11. 12. 13. 14. 15. 16. 17. 18. 6. REFERENCES S. Ahadi—Sarkani. “Bayesian and Predictive Techniques
for Speaker Adaptation”. Ph.D. thesis, Cambridge Uni—
versity, Jan. 1996. R. Cole, Y. Muthusamy, and M. Fanty. “The ISOLET Spoken Letter Database”, http : //www.cse.ogi.edu/
CSLU/corpom/isolethtml . P. Comon. “Independent component analysis, a new con— cept?”. Sig. Proc., V. 36, No. 3, pp. 287—3 14, Apr. 1994. S. Cox. “Predictive speaker adaptation in speech recogni—
tion”. Comp. Speech Lang, V. 9, pp. 1—17, Jan. 1995. . S. Furui. “Unsupervised speaker adaptation method based on hierarchical spectral clustering”. ICASSP—89, V. 1, pp.
286—289, Glasgow, 1989. . J .—L. Gauvain and C.—H. Lee. “Maximum a Posteriori Es— timation for Multivariate Gaussian Mixture Observations
of Markov Chains”. IEEE Trans. Speech Audio Proc., V.
2, pp. 291—298, Apr. 1994. . H. Hermansky, B. Hanson, and H. Wakita. “Low—dimensional representation of vowels based on all—pole modeling in the
psychophysical domain”. Speech Comm, V. 4, pp. 181—
187, 1985. . Z. Hu, E. Barnard, and P. Vermeulen. “Speaker Normal— ization using Correlations Among Classes”. To be publ. Proc. Workshop on Speech Rec., Understanding and Pro—
cessing, CUHK, Hong Kong, Sept. 1998. . I. T. Jolliffe. “Principal Component Analysis”. Springer— Verlag, 1986. M. Kirby and L. Sirovich. “Application of the Karhunen— Loeve Procedure for the Characterization of Human Faces”.
IEEE PAMI, V. 12, no. 1, pp. 103—108, Jan. 1990. R. Kuhn. “Eigenvoices for Speaker Adaptation”. Internal
tech. report, STL, Santa Barbara, CA, July 30, 1997. M. Lasry and R. Stern. “A Posteriori Estimation of Cor—
related Jointly Gaussian Mean Vectors”. IEEE PAMI, V.
6, no. 4, pp. 530—535, July 1984. C. Leggetter and P. Woodland. “Maximum likelihood lin—
ear regression for speaker adaptation of continuous den—
sity hidden Markov models”. Comp. Speech Lang, V. 9,
pp. 171—185,1995. B. Moghaddam and A. Pentland. “Probabilistic Visual
Learning for Object Representation”. IEEE PAMI, V. 19,
no. 7, pp. 696—710, July 1997. P. Nguyen. “ML linear eigen—decomposition”. Internal
tech. report, STL, Santa Barbara, CA, Jan. 22, 1998. P. Nguyen. “Fast Speaker Adaptation”. Industrial Thesis
Report, Institut Eurécom, June 17, 1998. N. Strom. “Speaker Adaptation by Modeling the Speaker
Variation in a Continuous Speech Recognition System”. ICSLP—96, V. 2, pp. 989—992, Oct. 1996. M. Turk and A. Pentland. “Eigenfaces for Recognition”. Journ. Cognitive Neuroscience, V. 3, no. 1, pp. 71—86,
1991. ...
View
Full Document
 Spring '10
 Glass
 Speech recognition, speaker adaptation, adaptation data, MLLR, K—space

Click to edit the document details