Speaker Adaptation of Context Dependent Deep Neural Networks ...

26 downloads 0 Views 174KB Size Report
Sep 14, 2014 - In this paper we propose a novel speaker adaptation method for a context-dependent deep neural network HMM (CD-DNN-. HMM) acoustic ...
INTERSPEECH 2014

Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing Natalia Tomashenko1,2 , Yuri Khokhlov1 1

Speech Technology Center, Saint-Petersburg, Russia 2 ITMO University, Saint-Petersburg, Russia

[email protected], [email protected]

Abstract

Nowadays artificial neural networks (ANN) are getting more and more popular in state-of-the-art recognition systems. This is due to the recent success of Deep Neural Networks (DNN) [19]. New efficient adaptation methods need to be developed for them. Some feature-space adaptation methods from the GMM-HMM framework, such as VTLN and fMLLR, have been successfully applied for DNN-HMM systems [4]-[6] although in [5] it was shown that these methods are more efficient for shallow NNs than for deep NNs. Most of the other methods can be classified into several types:

In this paper we propose a novel speaker adaptation method for a context-dependent deep neural network HMM (CD-DNNHMM) acoustic model. The approach is based on using GMMderived features as the input to the DNN. The described technique of processing features for DNNs makes it possible to use GMM-HMM adaptation algorithms in the neural network framework. Adaptation to a new speaker can be simply performed by adapting an auxiliary GMM-HMM model used in calculation of GMM-derived features and can be regarded as adaptation in the feature space for a DNN system. In this work, traditional maximum a posteriori adaptation is performed for an auxiliary GMM-HMM model. Experiments show that the proposed adaptation technique can provide, on average, a 5%-36% relative word error reduction on different adaptation sets under supervised adaptation setup, compared to speaker independent (SI) CD-DNN-HMM systems. In addition, several multi-stream combination techniques are examined in order to improve the performance of the baseline SI model. Index Terms: speaker adaptation, deep neural networks, MAP, CD-DNN-HMM, GMM-derived features, multi-stream recognition

• Linear transformation. This is the most popular approach to speaker adaptation with ANNs. Linear transformation can be applied at different levels of the ANNHMM system: to the input features, as in Linear Input Network transformation (LIN) [7]-[10] or Featurespace Discriminative Linear Regression (fDLR) [5][11]; to the activations of hidden layers, as in Linear Hidden Network transformation (LHN) [7][8]; or to the softmax layer, as in LON [9] or in output-feature discriminative linear regression [11]. Wherever the transformation is applied, weights are initialized with an identity matrix and then they are trained by minimizing the error at the output of the ANN system while keeping the weights of the original ANN fixed. Adaptation of hybrid tiedposterior acoustic models [12] can also be considered a linear transform of posteriors. [13] describes a method based on linear transformation in the feature space and Principal Components Analysis (PCA).

1. Introduction Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition systems. Adaptation is an efficient way to reduce the mismatches between the models and the data from a particular speaker or channel. Various acoustic model adaptation techniques have been developed to overcome this problem [1]-[18]. For Gaussian Mixture Models - Hidden Markov Model (GMM-HMM) acoustic models, approaches to adaptation may be divided into two classes: model-space adaptation, and feature-space adaptation. Model-space adaptation transforms the model parameters in order to optimize a certain criterion, for example, to maximize posterior probability or likelihood. The most popular approaches are maximum a posteriori (MAP) adaptation [1] and Maximum Likelihood Linear Regression (MLLR) [2]. The MAP estimation maximizes the a posteriori distribution of the adaptation data given the a priori model parameters. MLLR finds the transform that maximizes the likelihood of the adaptation data given the transformed Gaussian parameters. Feature-space adaptation techniques include vocal tract length normalization (VTLN) [3] and feature space MLLR (fMLLR), also known as constrained MLLR (CMLLR)) [2]. They transform acoustic feature vectors and do not require modification of the parameters of the acoustic model, so they are more suitable for real-time on-line ASR systems.

Copyright © 2014 ISCA

• Re-training the entire network or only input or output layers of the DNN using special regularization techniques for improving generalization, such as L2-prior regularization [14], Kullback-Leibler divergence regularization [15], Conservative Training [16]. In [12] only a sub-set of the hidden units with maximum variance (computed on the adaptation data) is retrained. • Non-linear feature-space transformation, such as adaptation with speaker codes [17]. In this work we present a novel approach for the adaptation of DNNs based on using GMM-derived features as the input to DNNs. In the past it was shown [20] that GMM log likelihoods can be effectively used as features for training a MLP phoneme recognizer. Our approach is based on using features derived from a GMM model and GMM-based adaptation techniques. In this work we applied MAP adaptation in feature space for DNN-HMM models. The proposed approach to processing features makes it possible to use other GMM-HMM adaptation algorithms in the DNN framework.

2997

14- 18 September 2014, Singapore

Figure 1: GMM-derived features for DNN-HMM.

The proposed approach can be considered as a non-linear feature-space transformation technique for adapting DNNHMMs. The rest of the paper is organized as follows. In Section 2, GMM-derived features for DNN-HMMs are described. Feature-space adaptation for DNN-HMMs trained on the proposed features is presented in Section 3. Section 4 describes the experimental results of adaptation and feature combination. Finally, conclusions and future developments are presented in Section 5.

If we eliminate the PCA dimensionality reduction step in the process of feature extraction described above, we obtain 1727-dimensional (157·11) features. We will refer to these features as f-gmm-1727. In this scheme we also tried to use log-likelihoods of the features obtained from a Gaussian mixture model, as it is described in [20], but we do not present these results because the performance of features without logarithms was better.

2. GMM-derived features for DNN-HMM

Speaker adaptation of a speaker independent (SI) DNN-HMM model built on f-gmm-d features (here d denotes the dimension of the features) is performed through the adaptation of the auxiliary SI GMM-HMM monophone model, which was used for calculating GMM-derived features. In this work we use the traditional MAP adaptation algorithm for adapting the SI GMM-HMM model [1][18]. Let m denote an index of a Gaussian in SI acoustic model, and μm the mean of this Gaussian. Then the MAP estimation of the mean vector is  τ μm + t γm (t)ot  , (3) μ m = τ + t γm (t)

3. Adaptation

Construction of GMM-derived features for DNNs is motivated by two factors. Firstly, in the past it was shown [20][23] that NN and GMM models may be complementary and benefit from their combination. Secondly, this type of features makes it possible to use various GMM-HMM adaptation algorithms in the DNN framework. Our features are obtained as follows (shown in Figure 1). First, the speech signal is processed in blocks of 16 ms with a shift of 10 ms in order to extract 39-dimensional Mel-frequency cepstral coefficients (MFCC) for every frame. After that, cepstral mean normalization is applied. Then an auxiliary HMMGMM monophone model is used to transform cepstral feature vectors into likelihoods vectors. In this auxiliary HMM-GMM, each phoneme is modeled using a three state left-right context-independent HMM, with 30 Gaussians per state. The silence model is a 1-state HMM. For a given acoustic MFCC-feature vector, a new GMM-derived feature vector is obtained by calculating likelihoods across all the states of the auxiliary HMM-GMM monophone model on the given vector. Suppose ot is the acoustic feature at the time t, then new GMM-derived feature vector ft is calculated as follows: ft = [p1t , . . . , pn t ],

where τ is the parameter that controls the balance between the maximum likelihood estimate of the mean and its prior value; γm (t) is the posterior probability of Gaussian component m at time t. The proposed adaptation scheme for a DNN model is shown in Figure 2. First, MAP adaptation of an auxiliary SI GMM-HMM monophone model is performed and a new speaker-adapted (SA) GMM-HMM model is created. Second, at the recognition stage, GMM-derived features are calculated using this SA GMM-HMM. The proposed approach can be considered a feature space transformation technique with respect to DNN-HMMs trained on GMM-derived features.

(1)

4. Experimental results

where n is the number of states in the auxiliary HMM-GMM monophone model, pit = P (ot | st = i)

4.1. Baseline system The experiments are conducted on a microphone corpus (16 kHz) of spontaneous and read speech. The training set contains approximately 27 hours of speech data from a Russian speech corpus (collected at STC). The corpus constists of recordings from 203 speakers. We use 11×MFCC features (13-dimensional MFCC spliced across 11 frames (±5)) as baseline features and compare them to f-gmm-550 and f-gmm-1727 features. The three SI-DNN models corresponding to these three types of features (11×MFCC, f-gmm-550 and f-gmm-1727) have identical topology (except for the dimension of the input

(2)

is the likelihood estimated using the GMM-HMM. Here st denotes the state index at the time t. In our case n is equal to 157 and this procedure leads to a 157-dimension feature vector per speech frame. We reduce the dimensionality of GMM-derived features from 157 to 50 using PCA. After that, the features are spliced in time taking a context size of 11 frames (i.e., ±5). These features are used as the input for the DNN. We will refer to these resulting features as f-gmm-550 features.

2998

Figure 2: The adaptation scheme.

layer) and are trained on the same training corpus. An auxiliary SI monophone GMM was also trained on the same data. The SI CD-DNN-HMM systems used 1000-neuron hidden layers, and a 2008-neuron output layer. 2008 neurons in the output layer correspond to context-dependent states determined by tree-based clustering in the CD-GMM-HMM system. Rectifier nonlinearities [21][22] are used in the hidden layers. The DNN system was trained using the frame-level cross entropy criterion and the senone alignment generated from the GMM system. The output layer is a soft-max layer. DNNs are trained without layer-by-layer pre-training, and hidden dropout factor (HDF = 0.2) is applied during the training as a regularization [21]. In order to explore trends in the behavior of SI and adapted models trained on the proposed GMM-derived features for DNNs of different depths, two DNN-HMMs are trained for each experiment, with 2 and 4 hidden layers. An 11k vocabulary was used in evaluating both SI and adapted models. No language model is used. All described experiments are conducted on the same test set, which consists of 2395 utterances. The more detailed description of this test set is given in the next section. The performance results in terms of Word Error Rate (WER) for SI DNN-HMM models are presented in Table 1. We can see that DNN-HMMs trained on 11×MFCC perform better than DNN-HMMs trained either on f-gmm-550 or f-gmm-1727 features. In Section 4.3, we explore this issue in more detail and propose ways to improve the quality of SI models.

cept Section 4.2.2) is 5 minutes for each speaker, and the rest of the data (11-12 minutes for a speaker) is used for testing in all experiments. The test set consists of 2395 utterances. The total number of tested words is approximately 19K. The parameter τ in the MAP adaptation formula (3) is set to 5. Adaptation experiments are carried out in the supervised mode. 4.2.1. Adaptation performance for different DNN-HMMs The adaptation results for four DNN-HMMs are presented in Table 2. We can see that adaptation improves SI models by approximately 14-15% absolute, corresponding to 34-36% relative WER reduction (WERR). Table 2: Adaptation performance for different DNN-HMMs. Type of Features f-gmm-1727 f-gmm-550

11×MFCC f-gmm-1727 f-gmm-550

# HL 2 4 2 4 2 4

WER, % (Adaptation) 26.4 26.9 26.8 27.1

Absolute WERR, % 14.3 13.9 15.2 14.5

4.2.2. Adaptation performance depending on the amount of the adaptation data In this experiment we measure the performance of the proposed adaptation method using different amounts of adaptation data. Adaptation sets of different size, from 5 seconds to 5 minutes (for a speaker), are used to adapt a SI 4-hidden-layer DNNHMM trained on f-gmm-1727 features. The results are shown in Figure 3. We can see that even with 5 seconds of adaptation data, we can achieve a performance gain of about 2% absolute (5% relative) WERR. After using 300 seconds of adaptation data, the gain from the adaptation increases up to approximately 14% absolute (34% relative) WERR.

Table 1: Baseline WER results for SI-DNN models (with 2 and 4 hidden layers (#HL)) for three types of features. Type of Features

# HL 2 4 2 4

WER, % 39.7 38.0 40.8 40.8 42.0 41.6

4.3. Feature combination As seen from the results described above (Table 1), DNNHMMs trained on 11×MFCC perform better than DNN-HMMs trained on either f-gmm-550 or f-gmm-1727 features. For the SI 2-hidden-layer DNN, the absolute difference in WER is 2.3% for f-gmm-550 and 1.1% for f-gmm-1727 as compared to 11×MFCC features. As the number of layers in DNNs increases from 2 to 4, the differences in quality between the SI DNN trained on 11×MFCC features and SI DNNs trained on

4.2. Adaptation for DNN The adaptation experiments were conducted on data from 20 speakers (10 male and 10 female speakers) excluded from the training set. For each speaker we have approximately 180 utterances, corresponding to 16-17 minutes of speech data. The amount of the adaptation data (in all experiments ex-

2999

ery frame. The posterior combination results are summarized in Table 3. Combining scores from DNNs trained on 11×MFCC and f-gmm-550 features achieves 1.6% and 0.8% absolute WER improvement for SI DNNs as compared to DNNs trained on 11×MFCC features for DNNs with 2 and 4 hidden layers respectively. Adaptation for posterior combination gives about 30-32% relative WERR compared to SI DNNs trained on 11×MFCC features. Thus we see that posterior combination of the two SI-DNNHMMs gives better results than the feature concatenation approach. Moreover, posterior combination helps to improve the quality of the baseline SI-system which is used for adaptation. This type of multi-stream combination give an improvement of about 2-4% relative WERR over the single best stream. This result can probably be further improved by using other multistream combination techniques.

42 40

38.8 38.8

WER, %

38

36.8 36.8

36

SI Adapted

35.9 35.9 33.9 33.9

34

31.7 31.7

32 30

28.1 28.1 26.9 26.9

28 26 5

10 15 30 60 180 300 Amount of adaptation data, sec

5. Conclusions

Figure 3: Adaptation performance depending on the amount of the adaptation data for the DNN trained on f-gmm-1727 features.

In this paper we propose a novel feature-space transformation method for the adaptation of DNN-HMM models. The proposed method is based on a special method of feature processing for DNNs. Input features for DNN training are generated from the likelihoods of the GMM model. The described method of feature processing is effective for training DNNHMMs when used in combination with other conventional features, such as 11×MFCC. According to the experiments, the best performance is obtained by the posterior combination technique which gives approximately 2-4% relative WERR compared to the the single best stream (the baseline DNN trained on 11×MFCC). The main advantage of these GMM-derived features is the possibility of performing the adaptation of a DNN-HMM model through the adaptation of the auxiliary GMM-HMM. We investigate the performance of MAP adaptation for this scheme. Experiments demonstrate that MAP adaptation is very effective and, for single-stream systems, gives, on average, approximately 35-36% relative WERR for a 5 minute adaptation sample and 5% relative WERR for a 5 second adaptation sample, as compared to a SI-DNN model. If we compare adaptation results in a multi-stream recognition system, we see that the performance improvement from the adaptation for a 5 minute adaptation set is 11% absolute (29% relative) WERR as compared to the SI system. It is worth noting that in the proposed scheme, other methods for the adaptation of the auxiliary GMM-HMM can be used instead of MAP adaptation. Thus, this approach opens new perspectives in the adaptation of DNN-HMMs, since it allows us to use approaches developed for GMM-HMM [18], as well as speaker adaptive training, in DNN-HMM adaptation.

f-gmm-550 and f-gmm-1727 features increase up to 3.6% and 2.8% respectively. The purpose of this set of experiments was to find out whether it is possible to improve the performance of SI systems by applying different multi-stream combination techniques, so-called late and early integration [20]. 4.3.1. Feature concatenation In this experiment, 11×MFCC and f-gmm-550 features are concatenated, resulting in 693-dimension feature vectors, and new DNN SI models are trained. The results with feature concatenation are presented in Table 3. We can see that this type of features helps to reduce WER for the SI model compared to the baseline SI model for f-gmm-550. In addition, the adaptation performance with feature concatenation is slightly better than with f-gmm-550 for the 4-hidden-layer DNN, while for the 2hidden-layer DNN it remains unchanged. However, the performance of feature concatenation without adaptation is still worse than 11×MFCC, especially for the deeper network. Table 3: Feature concatenation and posterior combination results for 11×MFCC and f-gmm-550 features. Acoustic models Type of Features # HL 2 Feature concatenation 4 2 Posterior combination 4

SI 39.8 39.4 38.1 37.2

WER, % Adaptation 26.8 26.3 27.1 26.6

6. Acknowledgements This work was partially financially supported by the Government of the Russian Federation, Grant 074-U01.

4.3.2. Posterior combination

7. References

Finally, we perform state-level system combination experiments. The outputs of the two classifiers can be combined using various multi-stream combination techniques [24] such as sum (or linear combination [6]), product, maximum, minimum, etc. In this work results are shown for the mean of outputs of two DNNs, as this combination gave the best result in the preliminary experiments. System combination is performed at the state-level for ev-

3000

[1]

Gauvain J. L. and Lee C. H., “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains”, IEEE Trans. Speech and Audio Proc., 2.2: 291-298, 1994.

[2]

Gales M. J. F., “Maximum likelihood linear transformations for HMM-based speech recognition”, Computer speech and language, 12.2: 75-98, 1998.

[3]

Lee L. and Rose R. C., “Speaker normalization using efficient frequency warping procedures”, Acoustics, Speech, and Signal Processing, ICASSP-96. Conference Proceedings, Vol. 1: 353-356, IEEE, 1996.

[4]

Rath, S. P., Povey, D., Vesely, K., and Cernocky, J., “Improved feature processing for deep neural networks”, Proc. of Interspeech, 2013.

[5]

Seide, F., Li, G., Chen, X., and Yu, D., “Feature engineering in context-dependent deep neural networks for conversational speech transcription”, In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE: 24-29, 2011.

[6]

Lei, X., Lin, H., and Heigold, G., “Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 7634-7638, 2013.

[7]

Gemello, R., Mana, F., Scanzio, S., Laface, P., and De Mori, R., “Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training”, Acoustics, Speech and Signal Processing, ICASSP 2006 Proceedings. 2006 IEEE International Conference on. Vol. 1. IEEE, 2006.

[8]

Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., and Robinson, T., “Speaker-adaptation for hybrid HMMANN continuous speech recognition system”, 1995.

[9]

Li, B., and Sim, K. C., “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems”, INTERSPEECH, 526-529, 2010.

[20] Pinto, J., and Hermansky, H., “Combining Evidence from a Generative and a Discriminative Model in Phoneme Recognition”, Proceedings of Interspeech, 2414-2417, 2008. [21] “Dahl, G. E., Sainath, T. N., and Hinton, G. E., Improving deep neural networks for LVCSR using rectified linear units and dropout”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 8609-8613, 2013. [22] Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q. V., ... and Hinton, G. E., “On rectified linear units for speech processing”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 3517-3521, 2013. [23] Ellis, D. P., and Reyes-Gomez, M., “Investigations into tandem acoustic modeling for the Aurora task”, Eurospeech 2001: Scandinavia: 7th European Conference on Speech Communication and Technology. Aalborg Congress and Culture Centre, AalborgDenmark: proceedings. 189-192. ISCA-Secretariat, 2001. [24] Kittler, J., Hatef, M., Duin, R. P., and Matas, J., “On combining classifiers”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(3): 226-239, 1998.

[10] Trmal, J., Zelinka, J., and Muller, L., “Adaptation of a feedforward artificial neural network using a linear transform”, Text, Speech and Dialogue, 423-430, Springer Berlin Heidelberg, 2010. [11] Yao, K., Yu, D., Seide, F., Su, H., Deng, L., and Gong, Y., “Adaptation of context-dependent deep neural networks for automatic speech recognition”, Spoken Language Technology Workshop (SLT), 2012 IEEE, 366-369, 2012. [12] Stadermann, J., and Rigoll, G., “Two-stage speaker adaptation of hybrid tied-posterior acoustic models”, Proc. IEEE Intl. Conf. on Acoustic, Speech and Signal Processing, 2005. [13] Dupont, S., and Cheboub, L. “Fast speaker adaptation of artificial neural networks for automatic speech recognition”, Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on. Vol. 3: 1795-1798 IEEE, 2000. [14] Liao, H., “Speaker adaptation of context dependent deep neural networks”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 7947-7951, 2013. [15] Yu, D., Yao, K., Su, H., Li, G., and Seide, F., “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 7893-7897, 2013. [16] Albesano, D., Gemello, R., Laface, P., Mana, F., and Scanzio, S., “Adaptation of artificial neural networks avoiding catastrophic forgetting”, Neural Networks, IJCNN’06. International Joint Conference on. IEEE, 1554-1561, 2006. [17] Abdel-Hamid, O., and Jiang, H., “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 7942-7946, 2013. [18] Woodland, P. C., “Speaker adaptation for continuous density HMMs: A review”, ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition, 2001. [19] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... and Kingsbury, B., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups”, Signal Processing Magazine, IEEE, 29(6), 82-97, 2012.

3001