Further, a linear transformation is com- ... 10-15% with the use of as little as 3 sentences of adap- ... mixtures of continuous-density Gaussian pdf's to model.
SPEAKER CLUSTERING AND TRANSFORMATION FOR SPEAKER ADAPTATION IN LARGE-VOCABULARY SPEECH RECOGNITION SYSTEMS M. Padmanabhan, L. R. Bahl, D. Nahamoo, M. A. Picheny IBM T. J. Watson Research Center P. O. Box 704, Yorktown Heights, NY 10598
1 ABSTRACT A speaker adaptation strategy is described that is based on nding a subset of speakers, from the training set, who are acoustically close to the test speaker, and using only the data from these speakers (rather than the complete training corpus) to re-estimate the system parameters. Further, a linear transformation is computed for every one of the selected training speakers to better map the training speaker's data to the test speaker's acoustic space. Finally, the system parameters (Gaussian means) are re-estimated speci cally for the test speaker using the transformed data from the selected training speakers. Experiments showed that this scheme is capable of reducing the error rate by 10-15% with the use of as little as 3 sentences of adaptation data.
2 INTRODUCTION In the last couple of years, several advances have been made in improving the error rate of continuousspeech-recognition systems [1]. For instance, typical word-error rates on test data drawn from the Wall Street Journal database, as reported by dierent participants in the Wall Street Journal task [1], hover in the neighborhood of 12 %, for large-vocabulary speakerindependent systems. Though this represents a reasonable level of performance on this particular test data, there is still scope for further improvement. One way to improve the performance of these systems is to make the system parameters speaker-dependent. However, large-vocabulary systems tend to have a large number of parameters, and in order to robustly estimate these parameters, a large amount of training data is needed. This implies that the test speaker will have to furnish a large amount of data to speci cally train the system to his/her speech. This is usually not a practical solution. Consequently, most systems
use speaker adaptation techniques, that require only a small amount of data from the test speaker. This data is used to move the parameters of the speakerindependent system towards speaker-dependent values. In this paper, we present a speaker adaptation method that is based on nding a cluster of speakers who are acoustically 'close' to the test speaker, and using these speakers to estimate model parameters, that are closer to the test speaker's data than the speakerindependent model parameters. Further, this method assumes that the basic speech recognition system uses HMM's to model the speech production process, and mixtures of continuous-density Gaussian pdf's to model the output distribution of the HMM's.
3 TECHNICAL BACKGROUND Some adaptation schemes that have been proposed recently include transformation methods [2, 3, 6], MAP estimation [4, 5], etc. In [2], the speaker-independent system is transformed to come closer to the test speakers acoustics by applying a linear transformation on the means of the speaker-independent Gaussians. The transformation is computed so as to maximize the likelihood of the test speaker's adaptation data. The scheme used in [3] is similar - here, the assumption is made that the acoustic space of the test speaker and the training data are related by a linear transformation, and the model parameters for the test speaker are obtained by applying this transformation on the means and covariance matrices of the speaker-independent system. Another related scheme that actually applies a non-linear transformation on the training data, in order to map it to the test speakers space, is the metamorphic transformation of [6]. In contrast to these transformation schemes, [4, 5] attempt to obtain a Bayesian estimate of the model parameters from the limited amount of adaptation data available from the
test speaker. These schemes assume a prior distribution on the model parameters, that leads to a very simple adaptation process. In contrast to the above schemes, the adaptation scheme described here is based on the fact that the training data contains a number of training speakers, some of whom are closer, acoustically, to the test speaker, than the others. If the model parameters are re-estimated from this subset of the training data, they should be reasonably close to the speaker-dependent parameters that would be obtained by training on large amounts of data from the test speaker (if such data were available). Of course, the simplest implementation of such a clustering strategy is gender dependent processing. However, the performance gains that can be obtained using this strategy are quite constrained [7]. Possible explanations for this are that the speakers in the training corpus that are closest to the test speaker do not necessarily belong to the same gender as the test speaker, and that partitioning the data into two sets is not good enough, as there are too many dissimilar speakers in each set. A further improvement on speaker-clustering can be obtained if the acoustic space of each of these training speakers can be transformed to be even closer to the test speaker, to minimize the mismatch between the test and training data. This may be done by using linear [2] or non-linear [6] techniques; in this paper, for reasons of simplicity, we have opted to use the MLRR linear technique of [2]. The adaptation scheme is described in more detail in the following section, and is shown to be capable of giving quite large improvements in performance with a very little amount of adaptation data. The notation used in the rest of the paper is as follows : underlining will be used to represent a column vector, and double underlining will be used to represent a matrix.
4 DESCRIPTION OF ADAPTATION PROCEDURE
Models for the training speakers
Make up acoustic models for each of the speakers in
the training vocabulary, comprising a single Gaussian per context-dependent state (typically referred to as leaves). For purposes of notation, let L denote the total number of leaves, and d denote the dimension of the acoustic features, then the kth training speaker is parametrized by ki ; ki ; i = 1; ; L, where ki and ki represent the d-dimensional mean-vector and dxddimensional covariance matrix of the Gaussian characterizing the ith leaf. Further, we assume that the
covariance matrices are diagonal. Typically, however, as the amount of data available from each speaker is not sucient to make up these models, we use the MAP re-estimation strategy of [4] to estimate the parameters. This strategy assumes a prior distribution p(), on the parameters , and attempts to nd MAP = argmax p()p(y1T =) : (1)
In [4], it was shown that the choice of a normal-Wishart density for the the prior distribution on the Gaussian parameters, resulted in a convenient estimation strategy. Consequently, we will choose the prior distribution to be of the form T i ? 1 ? 1 gi gi p() = ji j exp ? 2 i ? i i i ? i 1 gi ? 1 exp ? 2 tr i i (2) i
Here, gii and gii represent the mean vector and covariance matrix associated with the ith leaf of the gender-independent model. Further i is assumed to be constant. From [4], the term p (y1T ) in ( 1) also takes on a form similar to ( 2), and the solution to ( 1) corresponds to choosing the mode of the product of the two distributions. Preliminaries
Decode the adaptation data for the test speaker, and
do a viterbi-alignment of the data against the decoded script (unsupervised adaptation). Do one iteration of the E-M algorithm, starting from a speaker-independent model, and accumulate the counts on the adaptation data. Do a viterbi-alignment of all the training data, using speaker-independent models. Speaker Clustering
Compute the acoustic likelihood of the adaptation
data, given the alignment, using each training-speaker's model. Subsequently, rank the training speakers in the order of this likelihood, and pick the top N speakers as being acoustically close to the test speaker. The wsj0/1 training set includes 142 male, and 142 female speakers. Fig. 1 shows the distance between a male test speaker and each of the training speakers. For the sake of ease of interpretation, the distances have been sorted before being plotted, and the distances to the male and female training speakers are plotted separately. One observation that can be made at this stage is that if the closest N training speakers were selected,
we can rewrite ( 3) as
140
Distance between test and training speaker
135
125 120
110 105
Male spkrs
100
Female spkrs
Akd = B ?k;d1
95
50
100
i2
xt;d ? Akd ^ k;i ci (t) ; 2 k;i;d i;t;d X
(5)
where Akd is the dth row of Ak , xt;d is the dth element of x(t). Dierentiating ( 5) with respect to Akd , and setting the derivative equal to zero yields
115
90 0
h
min Akd
130
150
Speaker
they would include male as well as female speakers. This observation reinforces the argument made earlier about the limitations of gender-dependent processing.
where
"
"
B k;d =
X
i X
i
#
1 ^ X (c (t)x ) ; i t;d 2 k;i k;i;d
1 T 2 ^k;i ^ k;i k;i;d
(6)
t
X
t
!#
ci (t)
:
(7)
In the above development, it was assumed that the same matrix Ak was applied to all means, correspondTransform computation ing to the case where the summation over i in ( 6, 7) goes from 1 to L. However, if sucient data is availEstimate a linear transformation for every training able, it is possible to compute several transformations, speaker, that maps the training speaker's space closer with dierent transformations being applied to disjoint to the test speaker. We use the technique described clusters of leaves. For this case, the summation over in [2] to compute this transformation. For the sake of i in ( 6, 7) would be over the leaves that belong to a convenience, we will brie y summarize this procedure single cluster. The clusters can be obtained based on next. Recall that we have already made up acoustic the acoustic similarity of the leaves using a bottom-up models, k;i; k;i; i = 1; ; L, for the kth trainprocedure, as in [2]. ing speaker. As in [2], we will assume that a linear A plot of a typical transformation matrix is shown transformation is applied to the means of the training in Fig. 2. the (d)x(d + 1) matrix Ak can be thought speaker's model, and compute the transformation so of as applying a square dxd transformation (given by as to minimize the following objective function : the rst d columns of the matrix) on the mean k;i, i and then adding a vector (the (d + 1)st column) to h X ci(t) (xt ? Ak ^ k;i)T ?k;i1 (xt ? Ak ^ k;i) + log jk;ij : the result. The top part of Fig. 2 shows the rst d i;t columns of the matrix, and the lower part of the gure (3) shows the last column. As can be seen from the gure, k Here, A is a (d)x(d+1) matrix, and ^k;i is a (d+1)x1 the dominant components of the transformation are h iT the diagonal terms and the last column of the matrix. vector obtained from k;i as (^k;i)T = (k;i)T 1 . This seems to imply that the transformation could be Also, ci (t) is the posterior probability of being in state carried out while assuming the above structure on the i at time t. Unlike [2] however, ci (t) is not obtained k matrix, to simplify the computation process. A th using the model for the k training speaker, but is obtained using the gender-independent model 1 . Re ? estimating the Gaussians Using the fact that k;i is diagonal ? 2 Once the transformations mapping each training 2 k;i = diag k;i; (4) 1; ; k;i;d ; speaker's space to the test speaker's space have been computed, a set of Gaussians can be made up specif1 The reason for doing this was because the models for the ically for the test speaker. Note that the transfortraining speakers are very crude; consequently, one could expect the alignment of states produced by using these models to be mations were computed under the assumption that much poorer than for the case where the gender-independent they were applied to the means of the training speakmodel is used. er's models, but the variances of the model were assumed to be unchanged; this is equivalent to applying
2 1 0 -1 60 40 20 0
0
10
Row
20
30
40
60
50
Column
10 0 -10 -20 -30 0
10
20
30 Dimension
40
50
60
the transformation on the raw data from the training speaker, and using the transformed data in making up the new means of the Gaussians, but leaving the variances of the Gaussians unchanged. Further, the viterbi alignment of the training speaker's data is also used in conjuction with this procedure, in order to allow for the possibility of using dierent transformations for dierent leaves.
5 RESULTS The results of applying this adaptation scheme are summarized below. The test data comprises of around 15 sentences each, from 20 speakers, and represents the Nov'94 evaluation data in the ARPA Wall Street Journal task. For reasons of faster turnaround time, these experiments were performed not with the IBM system used for the Nov'94 evaluation (for a description of this system, see [7]), but with a scaled down version of this system. The system had 35000 Gaussians, and used the ocial 20K language model that was provided by NIST [1] for the Nov'94 ARPA evaluation. For comparison purposes, we also indicate the results obtained by using the adaptation scheme proposed in [2]. Unsupervised incremental adaptation was used in all cases, i.e., the rst 3 sentences were decoded with the speaker-independent system, subsequently, the Gaussians were adapted, and the adapted prototypes used to decode the next 6 sentences, subsequently the Gaussians were adapted again using data from the rst 9 sentences, and the new prototypes used to decode the remainder of the test data. The error rate for a speaker, was calculated on the basis of all sentences from that speaker. The table above shows the average error rate on the 20 speakers, and it may be seen that the adaptation scheme gives a performance improvement of 12.8
No. adaptation 0 3 9 sentences Baseline error 14.64 14.64 14.64 Adapted error 12.76 12.45 [2]-adaptation error 13.45 13.33 % with only 3 sentences of adaptation data. However, the performance improvement appears to saturate very quickly, and the scheme does not appear to be able to make better use of large amounts of adaptation data. This appears to be an inherent limitation of transformation-based adaptation schemes, as pointed out in [2, 3]; however, this limitation may be overcome by using a combination of MAP and transformationbased adaptation.
REFERENCES
[1] Proceedings of ARPA Speech and Natural Language Workshop, 1995, Morgan Kaufman Publishers. [2] C. J. Legetter and P. C. Woodland, "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous density HMM's", Computer Speech and Language, vol. 9, no. 2, pp 171-186. [3] V. Digalakis and L. Neumeyer, "Speaker adaptation using combined Transformation and Bayesian methods", Proceedings of the ICASSP, pp 680-683, 1995. [4] J. L. Gauvain and C. H. Lee, "Maximum-aPosteriori estimation for multivariate Gaussian observations of Markov chains", IEEE Trans. Speech and Audio Processing, vol. 2, no. 2, pp 291-298, Apr 1994. [5] G. Zavagliogkos, R. Schwartz and J. Makhoul, "Batch, Incremental and Instantaneous Adaptation techniques for Speech Recognition", Proceedings of the ICASSP, pp 676-679, 1995. [6] J. R. Bellergarda et al., "Experiments using data augmentation for speaker adaptation", Proceedings of the ICASSP, 1995. [7] L. R. Bahl et al., "Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task", Proceedings of the ICASSP, pp 41-44, 1995.