INTERSPEECH 2015
GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models Natalia Tomashenko1,2 and Yuri Khokhlov3 1
Speech Technology Center, Saint-Petersburg, Russia 2 ITMO University, Saint-Petersburg, Russia 3 STC-innovations Ltd, Saint-Petersburg, Russia
[email protected],
[email protected]
Abstract
pose of the present work is to make a step in that direction using GMM-derived features for training DNN models. Some feature-space adaptation methods from the GMMHMM framework, such as vocal tract length normalization (VTLN) [36] and fMLLR [2], have been successfully applied for DNN-HMM systems [5, 6, 7]. Most of the other methods can be classified into several types: (1) linear transformation, (2) regularization techniques, (3) auxiliary features, (4) combining GMM and DNN models. Linear transformation is one of the most popular approaches for speaker adaptation with ANNs. Linear transformation can be applied at different levels of the ANN-HMM system: to the input features, as in linear input network transformation (LIN) [10, 11, 12, 13] or feature-space discriminative linear regression (fDLR) [6, 14]; to the activations of hidden layers, as in linear hidden network transformation (LHN) [10, 11]; or to the softmax layer, as in LON [12] or in output-feature discriminative linear regression [14]. Wherever the transformation is applied, weights are initialized with an identity matrix and then trained by minimizing the error at the output of the ANN system while keeping the weights of the original ANN fixed. Adaptation of hybrid tied-posterior acoustic models [15] can also be considered a linear transform of posteriors. The authors of [16] describe a method based on linear transformation in the feature space and principal components analysis (PCA). The second type of adaptation consists in re-training the entire network or only a part of it using special regularization techniques for improving generalization, such as L2-prior regularization [17], Kullback-Leibler divergence regularization [18], conservative training [19]. In [15] only a subset of the hidden units with maximum variance (computed on the adaptation data) is retrained. The number of speaker-specific parameters is reduced in [20] through factorization based on singular value decomposition (SVD). Regularized adaptive training of subsets of DNN parameters is explored in [21]. Using auxiliary features is another approach in which the acoustic feature vectors are augmented with additional speakerspecific or channel-specific features computed for each speaker or utterance at both training and test stages. An example of effective auxiliary features is i-vectors [22, 23, 24, 25, 37], it has been shown that they can be complementary to fMLLR. Alternative methods are adaptation with speaker codes [26] and factorized adaptation [27], which takes into account the underlying factors that contribute to the distorted speech signal. The most common way of combining GMM and DNN models for adaptation is using GMM-adapted features, for example fMLLR, as input for DNN training [5, 6]. In [7] likelihood scores from DNN and GMM models, both adapted in the
In this paper we investigate GMM-derived features recently introduced for adaptation of context-dependent deep neural network HMM (CD-DNN-HMM) acoustic models. We improve the previously proposed adaptation algorithm by applying the concept of speaker adaptive training (SAT) to DNNs built on GMM-derived features and by using fMLLR-adapted features for training an auxiliary GMM model. Traditional adaptation algorithms, such as maximum a posteriori adaptation (MAP) and feature space maximum likelihood linear regression (fMLLR) are performed for the auxiliary GMM model used in a SAT procedure for a DNN. Experimental results on the Wall Street Journal (WSJ0) corpus show that the proposed adaptation technique can provide, on average, a 17–28% relative word error rate (WER) reduction on different adaptation sets under an unsupervised adaptation setup, compared to speaker independent (SI) DNN-HMM systems built on conventional features. We found that fMLLR adaptation for the SAT DNN trained on GMMderived features outperforms fMLLR adaptation for the SAT DNN trained on conventional features by up to 14% of relative WER reduction. Index Terms: speaker adaptation, deep neural networks (DNN), MAP, fMLLR, CD-DNN-HMM, GMM-derived (GMMD) features, speaker adaptive training (SAT)
1. Introduction Nowadays, due to the recent success of DNNs [1], artificial neural networks (ANN) are getting more and more popular in state-of-the-art recognition systems. It has been shown [1] that DNN hidden Markov models (HMMs) outperform the conventional Gaussian mixture model (GMM) HMMs in many automatic speech recognition (ASR) tasks. However, many adaptation algorithms that have been developed for GMM-HMM systems [2, 3, 4] cannot be easily applied to DNNs because of the different nature of these models. GMM is a generative model and it fits the training data so that the likelihood of the data given the model is maximized. In contrast, DNN is a discriminative model, and its parameters are trained to minimize the classification error. Since DNN parameter estimation uses discriminative criteria, it is more sensitive to segmentation errors and can be less reliable for unsupervised adaptation. Many new adaptation methods have recently been developed for DNNs [5–35], and a few of them [5, 6, 7, 8, 9] take advantage of robust adaptability of GMMs. However, there is no universal method for efficient transfer of all adaptation algorithms from the GMM framework to DNN models. The pur-
Copyright © 2015 ISCA
2882
September 6- 10, 2015, Dresden, Germany
feature space using the same fMLLR transform, are combined at the state level during decoding. The authors of [9] propose combining the GMM and DNN models using the temporally varying weight regression (TVWR) framework. In speakerdependent bottle-neck (BN) layer training method, described in [28], speaker normalized BN features are derived and are further used for training GMM-HMM systems. In tandem systems DNN-derived features are also used for training GMMs [38]. Another model-based adaptation approach is learning speaker-specific hidden unit contributions (LHUC) [29], where an amplitude parameter is introduced for each hidden unit, tied on a per-speaker basis, and then estimated using adaptation data. In [30] the shape of the activation function is changed to better fit the speaker-specific characteristics. In this work we present a novel approach for SAT of DNNs based on using GMM-derived features as the input to DNNs. In the past it was shown [39] that GMM log-likelihoods can be effectively used as features for training a MLP phoneme recognizer. Our approach is based on using features derived from a GMM model [8] and GMM-based adaptation techniques. In this paper we investigate the concept of SAT for DNN and applied MAP and fMLLR adaptation algorithms in feature space for DNN-HMM models. The proposed approach to processing features allows other algorithms for GMM-HMM adaptation to be used in the DNN framework as well. The rest of the paper is organized as follows. In Section 2, SAT for DNN-HMM based on GMM-derived features is introduced. Section 3 describes the experimental results of SAT with MAP and fMLLR adaptation algorithms. Finally, conclusions are presented in Section 4.
SAT-DNN training
Context extension: ×11(±5)
Auxiliary GMM
GMM-derived feature extraction
Cepstral mean normalization
Speaker adaptation
MFCC feature extraction
Input sound
Figure 1: Using speaker adapted GMM-derived features for SAT DNN-HMM training. Suppose ot is the acoustic feature at time t, then the new GMM-derived feature vector ft is calculated as follows:
2. SAT for DNN-HMM based on GMM-derived features
ft = [p1t , . . . , pn t ],
(1)
where n is the number of states in the auxiliary GMM-HMM monophone model,
Construction of GMM-derived features for adapting DNNs was proposed in [8], where it was demonstrated, using MAP adaptation as an example, that this type of features makes it possible to effectively use GMM-HMM adaptation algorithms in the DNN framework. In this work we improve the previously proposed scheme for GMM-derived feature extraction and apply the concept of SAT to DNNs, trained on GMM-derived features. Also, we investigate using fMLLR, another conventional adaptation algorithm, in the proposed approach. Finally, we explore the combination of MAP and fMLLR techniques for SAT with GMMderived features. Our features are obtained as follows (see Figure 1). First, 39-dimensional Mel-frequency cepstral coefficients (MFCC) with delta and acceleration coefficients are extracted with perspeaker cepstral mean normalization (CMN). Then an auxiliary GMM monophone model is used to transform cepstral feature vectors into log-likelihoods vectors. At this step, speaker adaptation of the auxiliary SI GMM-HMM monophone model is performed for each speaker in the training corpus and the new speaker-adapted (SA) GMM-HMM model is created in order to obtain SA GMM-derived features. In the auxiliary GMM, each phoneme is modeled using a three state left-right context-independent GMM, with 30 Gaussians per state. The silence model is a 1-state GMM with 70 Gaussians. For a given acoustic MFCC-feature vector, a new GMM-derived feature vector is obtained by calculating loglikelihoods across all the states of the auxiliary GMM monophone model on the given vector.
pit = log (P (ot | st = i))
(2)
is the log-likelihood estimated using the GMM-HMM. Here st denotes the state index at time t. In our case n is equal to 118 and this procedure leads to a 118-dimension feature vector per speech frame. After that, the features are spliced in time taking a context size of 11 frames (i.e., ±5). We will refer to these resulting features as 11×118GMMD (or, briefly, GMMD) features. The dimension of the resulting features is equal to 1298 (11×118). These features are used as the input for training the DNN. The proposed approach can be considered a feature space transformation technique with respect to DNN-HMMs trained on GMM-derived features.
3. Experimental results 3.1. Baseline system The experiments were conducted on the WSJ0 corpus [40]. For acoustic models training we used 7138 utterances of 83 speakers from the standard SI-84 training set, which correspond to approximately 15 hours (13 hours of speech and 2 hours of silence) data, recorded with the Sennheiser microphone, 16 kHz. The phoneme set consisted of 39 phonemes and 1 silence model. We use conventional 11×39MFCC features (39dimensional MFCC (with CMN) spliced across 11 frames
2883
In all experiments we consider SI DNN trained on 11×39MFCC features as the baseline model and compare the performance results of the other models with it.
(±5)) as baseline features and compare them to the proposed GMM-derived features. The two SI-DNN models corresponding to these two types of features, 11×39MFCC and 11×118GMMD, have identical topology (except for the dimension of the input layer) and are trained on the same training dataset. An auxiliary monophone GMM was also trained on the same data. For training SI-DNN on GMMD features, we applied the scheme shown in Figure 1, but eliminated the speaker adaptation step. The SI CD-DNN-HMM systems used four 1000-neuron hidden layers and an approximately 2500-neuron output layer. The neurons in the output layer correspond to contextdependent states determined by tree-based clustering in CDGMM-HMM. Rectifier nonlinearities [41, 42] are used in the hidden layers. The DNN system was trained using the framelevel cross entropy criterion and the senone alignment generated from the GMM system. The output layer is a soft-max layer. DNNs are trained without layer-by-layer pre-training, and hidden dropout factor (HDF = 0.2) is applied during the training as a regularization [41].
Table 1: Summary of WER (%) results on WSJ0 evaluation set si et 20. Δ WER - relative WERR. Adaptation WER, % Δ WER, % (SAT) SI 9.55 baseline 11×39MFCC fMLLR 8.03 16.0 SI 9.28 2.8 MAP 7.60 20.4 GMMD fMLLR 7.85 17.8 fMLLR+MAP 7.83 18.0 Type of Features
Table 2: Summary of WER (%) results on WSJ0 evaluation set si et 05. ΔWER - relative WERR. Adaptation WER, % Δ WER, % (SAT) SI 3.23 baseline 11×39MFCC fMLLR 2.76 14.5 SI 3.19 1.2 MAP 2.69 16.8 GMMD fMLLR 2.52 22.0 fMLLR+MAP 2.78 13.9 Type of Features
3.2. Test data Evaluation was carried out on the two standard WSJ0 evaluation tests: (1) si et 05 and (2) si et 20. Test si et 05 is a November 92 NIST evaluation set with 330 read utterances (5353 words) from 8 speakers (about 40 utterances per speaker). A standard WSJ trigram closed NVP (non-verbalized punctuation) language model (LM) with a 5k word vocabulary was used during recognition. Test si et 20 consists of 333 read utterances (5645 words) from 8 speakers. A WSJ trigram open NVP LM with a 20k word vocabulary was used during recognition. The OOV rate was about 1.5%. Both LMs were pruned as in the Kaldi [43] WSJ recipe with the threshold 10−7 . Unless explicitly stated otherwise, the adaptation experiments were conducted in an unsupervised mode on the test data using transcripts obtained from the first decoding pass.
The SAT DNN trained on GMMD features with MAP adaptation demonstrates the best result over all cases in the test si et 20. It outperforms the baseline SI DNN model trained on 11×39MFCC features and results in 20.4% relative WER reduction (WERR). In the test si et 05 the SAT DNN trained on GMMD features with fMLLR adaptation performs better than other models and gives 22.0% relative WERR. Moreover, we can see that in all experiments SAT models trained on GMMD features, with fMLLR as well as with MAP adaptation, perform better in terms of WER than the SAT model trained on 11×39MFCC features with fMLLR adaptation. From the last rows of Tables 1 and 2, which show the results of combining MAP and fMLLR algorithms, we can conclude that this combination does not lead to additional improvement in performance, and for the test si et 05, it even degrades the result. The results presented in Tables 3 and 4 demonstrate adaptation performance for different speakers from the si et 20 and si et 05 datasets, correspondingly. The bold figures in the tables indicate the best performance over the three acoustic models. Relative WERR (Δ WER) is given in comparison to the baseline model, SI DNN built on 11×39MFCC. We can observe that all three models behave differently depending on the speaker.
3.3. SAT for DNNs We trained three SAT DNNs on GMMD features, as shown in Figure 1. For adapting an auxiliary GMM model we used two different adaptation algorithms: MAP [3] and fMLLR [2]. In addition, we trained a SAT DNN on GMMD features with “fMLLR+MAP” configuration, where MAP adaptation of an auxiliary GMM model was performed after fMLLR adaptation. For comparison purposes we also trained a DNN on conventional 11×39MFCC features with fMLLR. All SAT DNNs had similar topology and were trained as described in Section 3.1 for SI models. We used GMM-HMM models, trained on BN features [44] for getting state tying and alignment for further training SAT-DNN models, because we noticed that these BN features provided more accurate alignment for SAT-DNN models built on GMM-derived features. Training GMM-HMMs and fMLLR adaptation was carried out using the Kaldi speech recognition toolkit [43].
3.3.2. Adaptation performance depending on the size of the adaptation dataset In this experiment we measured the performance of the proposed adaptation method using different amounts of adaptation data. Adaptation sets of different size, from 15 to 210 seconds of speech data (per speaker), were used to adapt SAT DNNHMM models trained on 11×39MFCC and GMMD features. The results are shown in Figure 2. Relative WERR is given with respect to the baseline SI DNN trained on 11×39MFCC.
3.3.1. Adaptation performance for different DNN-HMMs The performance results in terms of WER for SI and SAT DNN-HMM models are presented in Table 1 (si et 20) and Table 2 (si et 05). We can see that SI DNN-HMM trained on GMMD performs slightly better than DNN-HMM trained on 11×39MFCC features.
2884
Table 3: Speaker adaptation performance on the WSJ0 evaluation set si et 20.
440 441 442 443 444 445 446 447 All
25
Δ WER, % WER, % Model: SI, 11×39MFCC GMMD GMMD fMLLR fMLLR MAP 11×39MFCC 8.49 24.1 22.4 6.9 16.02 24.3 26.2 36.4 10.59 13.2 15.8 14.5 9.94 11.3 14.1 9.9 10.87 11.8 18.4 26.3 7.25 13.0 20.4 18.5 7.13 8.5 17.0 12.8 6.59 16.0 0.0 26.0 9.55 16.0 17.8 20.4
Relative WER reduction, %
Speaker
30
20 15
using extra data
10 5 11x39MFCC+fMLLR GMMD+fMLLR GMMD+MAP
0 −5 −10 10
Table 4: Speaker adaptation performance on the WSJ0 evaluation set si et 05.
15
30 60 120 200 Amount of adaptation data, sec
400
Figure 2: Adaptation performance on si et 05 test depending on the size of the adaptation dataset. Relative WERR is given with respect to the baseline SI DNN trained on 11×39MFCC.
Δ WER, % WER, % Speaker Model: SI, 11×39MFCC GMMD GMMD fMLLR fMLLR MAP 11×39MFCC 440 4.15 29.6 40.7 40.7 441 4.71 43.3 56.7 16.7 442 3.32 0 8.3 12.5 443 1.52 -10.0 0 -20 444 2.43 33.3 22.2 22.2 445 3.16 0.0 5.3 15.8 446 1.89 -15.2 -7.6 23 447 4.87 6.3 15.6 6.3 All 3.23 14.5 22 16.8
features and by using fMLLR-adapted features for training an auxiliary GMM model. Traditional adaptation algorithms, such as MAP and fMLLR were performed for the auxiliary GMM model used in a SAT procedure for a DNN. Experimental results on the WSJ0 corpus demonstrate that, in an unsupervised adaptation mode, the proposed adaptation technique can provide, approximately, a 17–20% relative WERR for MAP and 18–28% relative WERR for fMLLR adaptation on different adaptation sets, compared to the SI DNNHMM system built on conventional 11×39MFCC features. We found that fMLLR adaptation for the SAT DNN trained on GMM-derived features outperforms fMLLR adaptation for the SAT DNN trained on conventional features by up to 14% of relative WER reduction. It has been shown, that fMLLR adaptation for GMMD features is efficient when using a small amount of adaptation data, while MAP adaptation works better when more adaptation data are used. It is worth noting that in the proposed scheme, any other methods for the adaptation of the auxiliary GMM can be used instead of MAP or fMLLR adaptation. Thus, this approach provides a general framework for transferring adaptation algorithms developed for GMM-HMMs to DNN adaptation.
We can see that, for adaptation sets of different size, the SAT DNN trained on GMMD features with fMLLR perform better than the SAT DNN trained on 11×39MFCC features with fMLLR. In contrast, MAP adaptation reaches the performance of fMLLR adaptation (for 11×39MFCC) only when using all the adaptation data. However, the gain from MAP adaptation grows monotonically as the sample size increases, while fMLLR adaptation reaches saturation on a small adaptation dataset. Note that in [8] we observed a similar trend for MAP adaptation. We conducted an additional experiment with MAP adaptation, marked in Figure 2 with the text ”using extra data”, in which we added the data from the WSJ0 si et ad dataset to the adaptation data. Hence, in this case we performed the adaptation in a semi-supervised mode: the transcriptions for the si et ad dataset were supposed to be known and the transcriptions for the si et 05 were generated from the first decoding pass. The total duration of the adaptation data was approximately 380 seconds for each speaker. This result (21.4% of relative WERR) confirms the suggestion that the performance of MAP adaptation does not reach its maximum in the previous experiments.
5. Acknowledgements This work was partially financially supported by the Government of the Russian Federation, Grant 074-U01, and by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0057, ID RFMEFI57914X0057.
6. References [1]
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... and Kingsbury, B., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups”, Signal Processing Magazine, IEEE, 29(6), 82-97, 2012.
[2]
Gales, M. J. F., “Maximum likelihood linear transformations for HMM-based speech recognition”, Computer speech and language, 12.2: 75-98, 1998.
[3]
Gauvain, J. L. and Lee, C. H., “Maximum a posteriori estima-
4. Conclusions In this work we have investigated GMM-derived features recently introduced for adaptation of DNN-HMM acoustic models. We improved the previously proposed adaptation algorithm by applying the concept SAT to DNNs built on GMM-derived
2885
tion for multivariate Gaussian mixture observations of Markov chains”, IEEE Trans. Speech and Audio Proc., 2.2: 291-298, 1994. [4]
Woodland, P. C., “Speaker adaptation for continuous density HMMs: A review”, ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition, 2001.
[5]
Rath, S. P., Povey, D., Vesely, K., and Cernocky, J., “Improved feature processing for deep neural networks”, Proc. of Interspeech, 2013.
[6]
Seide, F., Li, G., Chen, X., and Yu, D., “Feature engineering in context-dependent deep neural networks for conversational speech transcription”, Proc. of ASRU, 2011 IEEE Workshop on. IEEE: 24-29, 2011.
[7]
[8]
[9]
[24] Senior, A., and Lopez-Moreno, I., “Improving DNN speaker independence with i-vector inputs”, Proc. of ICASSP, 225-229, 2014. [25] Saon, G., Soltau, H., Nahamoo, D., and Picheny, M. “Speaker adaptation of neural network acoustic models using i-vectors”, Proc. of ASRU, 55-59, 2013. [26] Abdel-Hamid, O., and Jiang, H., “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code”, Proc. of ICASSP, 7942-7946, 2013. [27] Li, J., Huang, J. T., and Gong, Y., “Factorized adaptation for deep neural network”, Proc. of ICASSP, 5537-5541, 2014. [28] Doddipatla, R., Hasan, M., and Hain, T., “Speaker Dependent Bottleneck Layer Training for Speaker Adaptation in Automatic Speech Recognition”, Proc. of Interspeech, 2199-2203, 2014.
Lei, X., Lin, H., and Heigold, G., “Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition”, Proc. of ICASSP, 7634-7638, 2013.
[29] Swietojanski, P., and Renals, S., “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models”, Proc. of IEEE SLT, 2014.
Tomashenko, N., and Khokhlov, Y., “Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing”, Proc. of Interspeech, 2997-3001, 2014.
[30] Siniscalchi, S. M., Li, J., and Lee, C. H., “Hermitian polynomial for speaker adaptation of connectionist speech recognition systems”, Audio, Speech, and Language Processing, IEEE Transactions on, 21(10), 2152-2161, 2013.
Liu, S., and Sim, K. C., “On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition”, Proc. of ICASSP, 195-199, 2014.
[31] Masato, M., and Kawahara, T., “Unsupervised Speaker Adaptation of DNN-HMM by Selecting Similar Speakers for Lecture Transcription”, APSIPA: 2014.
[10] Gemello, R., Mana, F., Scanzio, S., Laface, P., and De Mori, R., “Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training”, Proc. of ICASSP, Vol. 1. IEEE, 2006.
[32] Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., and Liu, Q., “Fast adaptation of deep neural network based on discriminant codes for speech recognition”, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12), 1713-1725, 2014.
[11] Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., and Robinson, T., “Speaker-adaptation for hybrid HMMANN continuous speech recognition system”, 1995.
[33] Miao, Y., Zhang, H., and Metze, F., “Towards speaker adaptive training of deep neural network acoustic models”, Proc. of Interspeech, 2189-2193, 2014.
[12] Li, B., and Sim, K. C., “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems”, Proc. of Interspeech, 526-529, 2010.
[34] Huang, Z., Li, J., Siniscalchi, S. M., Chen, I. F., Weng, C., and Lee, C. H., “Feature Space Maximum A Posteriori Linear Regression for Adaptation of Deep Neural Networks” Proc. of Interspeech, 2992-2996, 2014.
[13] Trmal, J., Zelinka, J., and Muller, L., “Adaptation of a feedforward artificial neural network using a linear transform”, Text, Speech and Dialogue, 423-430, Springer Berlin Heidelberg, 2010.
[35] Tang, Y., Mohan, A., Rose, R. C., and Ma, C., “Deep neural network trained with speaker representation for speaker normalization”, Proc. of ICASSP, 6329-6333, 2014.
[14] Yao, K., Yu, D., Seide, F., Su, H., Deng, L., and Gong, Y., “Adaptation of context-dependent deep neural networks for automatic speech recognition”, Proc. of SLT, 2012 IEEE, 366-369, 2012. [15] Stadermann, J., and Rigoll, G., “Two-stage speaker adaptation of hybrid tied-posterior acoustic models”, Proc. of ICASSP, 2005.
[36] Lee, L. and Rose R., C., “Speaker normalization using efficient frequency warping procedures”, Proc. of ICASSP, Vol. 1: 353356, IEEE, 1996.
[16] Dupont, S., and Cheboub, L. “Fast speaker adaptation of artificial neural networks for automatic speech recognition”, Proc. of ICASSP, Vol. 3: 1795-1798 IEEE, 2000.
[37] Novoselov, S., et al., “RBM-PLDA subsystem for the NIST iVector Challenge”, Proc. of Interspeech, 378-382, 2014.
[17] Liao, H., “Speaker adaptation of context dependent deep neural networks”, Proc. of ICASSP, 7947-7951, 2013.
[38] Ellis, D. P., and Reyes-Gomez, M., “Investigations into tandem acoustic modeling for the Aurora task”, Proc. of Eurospeech, 189192, 2001.
[18] Yu, D., Yao, K., Su, H., Li, G., and Seide, F., “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition”, Proc. of ICASSP, 7893-7897, 2013.
[39] Pinto, J., and Hermansky, H., “Combining Evidence from a Generative and a Discriminative Model in Phoneme Recognition”, Proc. of Interspeech, 2414-2417, 2008.
[19] Albesano, D., Gemello, R., Laface, P., Mana, F., and Scanzio, S., “Adaptation of artificial neural networks avoiding catastrophic forgetting”, Proc. of IJCNN’06., 1554-1561, 2006.
[40] Paul, D. B., and Baker, J. M., “The design for the Wall Street Journal-based CSR corpus”, Proc. of the workshop on Speech and Natural Language. Association for Computational Linguistics, 357-362, 1992.
[20] Xue, J., Li, J., Yu, D., Seltzer, M., Gong, Y., “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network”, Proc. of ICASSP, 6359-6363, 2014.
[41] Dahl, G. E., Sainath, T. N., and Hinton, G. E., “Improving deep neural networks for LVCSR using rectified linear units and dropout”, Proc. of ICASSP, 8609-8613, 2013.
[21] Ochiai, T., Matsuda, S., Lu, X., Hori, C., ans Katagiri, S., “Speaker adaptive training using deep neural networks”, Proc. of ICASSP, 6349-6353, 2014.
[42] Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q. V., ... and Hinton, G. E., “On rectified linear units for speech processing”, Proc. of ICASSP, 3517-3521, 2013.
[22] Karanasou, P., Wang, Y., Gales, M. J., and Woodland, P. C., “Adaptation of deep neural network acoustic models using factorised i-vectors”, Proc. of Interspeech, 2180-2184, 2014.
[43] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... and Vesel, K., “The Kaldi speech recognition toolkit”, Proc. of ASRU, 2011.
[23] Gupta, V., Kenny, P., Ouellet, P., and Stafylakis, T., “I-vectorbased speaker adaptation of deep neural networks for french broadcast audio transcription”, Proc. of ICASSP, 6334-6338, 2014.
[44] Grezl, F., Karafiat, M., and Vesely, K., “Adaptation of multilingual stacked bottle-neck neural network structure for new language”, Proc. of ICASSP, 7654-7658, 2014.
2886