Page 1 of 2 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Contentvariability may change priorsubspace to final publication in an issue of the cite the paper usethe theweight doi provided on the page. while please in CAT, parameters areDigital formedLibrary by combining Hidden learning forjournal. Toutterance, the weights from each class. The utterance dependent (UD) adaptation of deep neural networks transformation matrix is also produced in a different way. Furthermore,
Sarith Fernando, Vidhyasaharan Sethu and Eliathamby Ambikairajah
comparing our approach to most other DNN adaptation methods [4, 6], the main differences lie in the structure used to represent the additional information, and in the training procedure. In other methods [9], the speaker information is given as an additional feature to the DNN. However, in this paper, a parameter base weight matrix is introduced to represent the utterance/condition variability space. The utterance information acts as an interpolation vector to transform the original base model into an adapted model.
This paper proposes a deep neural network (DNN) adaptation method, herein referred to as the hidden variability subspace (HVS) method, to achieve improved robustness under diverse acoustic environments arising due to differences in conditions e.g. speaker, channel, duration and environmental noise. In the proposed approach, a set of condition dependent parameters is estimated to adapt the hidden layer weights of the DNN in the hidden variability subspace to reduce the condition mismatch. These condition dependent parameters are then connected to various layers through a new set of adaptively trained weights. We evaluate the proposed hidden variability learning method on a language identification task and show that significant performance gains can be obtained by discriminatively estimating a set of adaptation parameters to compensate the mismatch in the trained model.
Proposed Hidden Variability Subspace: A conceptual diagram of our proposal is shown in Figure 1. A DNN comprised of many hidden layers stacked on top of each other. The output 𝒉𝒍 of any hidden DNN layer 𝑙, is determined by the previous hidden layer output 𝒉𝒍−𝟏 as, (1) 𝒉𝒍 = ℋ(𝑊 𝑙 𝒉𝒍−𝟏 + 𝒃𝒍 ) 𝑙 𝒍 where 𝑊 and 𝒃 are the weight matrix and the bias vector respectively. The activation function is denoted by ℋ. The proposed HVS adaptation employs a condition dependent transformation by an utterance dependent (UD) 𝑄𝑙 matrix for layer 𝑙 on the existing 𝑊 𝑙 weights in equation (1) as follows: (2) 𝒉𝒍 = ℋ(𝑊 𝑙 𝑄𝑙 𝒉𝒍−𝟏 + 𝒃𝒍 )
Introduction: Deep neural networks (DNNs) are widely used as front end feature extractors [1] and in a number of end-to-end speech processing systems [2]. Even though DNNs are often trained on large databases to learn variabilities in the input feature space, they are still vulnerable due to mismatched conditions between training and testing data and lead to significant performance degradations [3]. Speaker adaptation is commonly used in DNNs to reduce the mismatch conditions occurring due to speaker variabilities among training and testing data [4]. Adaptation to acoustic conditions has also been shown to improve DNN based acoustic modelling performance [3]. Therefore, adaptation is important for DNNs as it enables DNNs achieve significant performance improvements [2]. Adaptation methods based on linear transformations append a condition dependent linear layer to the original model [5]. In subspace techniques, the adaptation is accomplished with the aid of changing the subset of DNN weights in the original model [6], which may help to avoid overfitting. Most prominently, regularization based adaptation helps to make training and testing conditions more similar by introducing an additional error in the training process [5]. Methods which use adaptive training can be categorized as cluster adaptive training (CAT) methods, well known feature normalization methods such as constrained maximum likelihood linear regression, and vocal tract length normalization [7]. In this work, we propose the hidden variability subspace (HVS) technique which performs adaptation to capture the variability in training and testing conditions without changing the learned base model. In HVS, we estimated different condition dependent parameters and then recombined to DNN layers using newly introduced adaptively trained weights. In addition to the standard acoustic features, the proposed method incorporates utterance information during training. This approach is known as utterance-aware training (UaT). UaT enables DNNs to exploit additional information about an utterance to alter the model parameters for utterance normalization. Firstly, we initialize the HVS during training using the domain invariant low variability feature space utterance information [8]. After training, we can assume that the outputs of this HVS are condition/utterance dependent. During testing, a test vector is transformed through the trained HVS to utilize the condition of an utterance. Since DNNs are trained discriminatively in an end-toend fashion, it is generally assumed that the hidden layer outputs are particularly vulnerable to channel and noise conditions when test utterances are from different conditions. In contrast, domain invariant low variability features are extracted in an unsupervised manner, which does not explicitly remove the effect of channel and noise conditions. Therefore, by using these features we believe that we can capture the condition of an utterance more effectively. The main intuition is to use these features to learn the HVS (𝑀 in Figure 1) and give additional information on channel and noise conditions by transforming the hidden layer representations. In our method, we define a new subspace that learns a feature transformation based on the domain invariant low variability feature space. The proposed method has some similarity to CAT, namely in sharing the idea of changing the original weights during training, but they also differ in many ways. In HVS the weights are adjusted for each test
Estimating the full matrix 𝑄𝑙 introduces a vast number of UD parameters, and so the utterance representation is reduced by applying a constraint on 𝑄𝑙 to be diagonal in the training process. We propose to ̂𝒍 of 𝑄𝑙 using a low-dimensional utterance learn the diagonal elements 𝒑 representation as given below: (3) ̂𝒍 = ℋ(𝑀𝑙 𝝎 ̂ + 𝝋𝒍 ) 𝒑 ̂ is a feature vector in a low dimensional space for a given where 𝝎 utterance. For layer 𝑙, 𝑀𝑙 is the defined subspace and 𝝋𝒍 is the residual, ̂𝒍 parameter values which is characterized by the UD training data. The 𝒑 are estimated in both training and testing using the adaptation data. The ̂ is extracted separately of the DNN training. Further, adding a nonlinear 𝝎 ̂ enables learning more activation ℋ as in equation (3) specific to 𝝎 ̂, detailed information during training. In addition, it improves learning 𝒑 more scalable to the original DNN model. HVS adaptation produces a more stable 𝑄𝑙 when an additional constraint is applied diagonally to the elements, restricting them in the span of [0, 𝑟], 𝑟 ∈ ℝ. In this scheme, 𝑄𝑙 is learned while keeping the initial model weights 𝑊 𝑙 from layer 𝑙 fixed. This makes 𝑄𝑙 act as a filter for the existing weights, enriching them with condition/utterance dependent information. Training complexity can be further reduced by using constrained layer adaptation. This procedure keeps only one representation for UD components from the last hidden layer, updated during the adaptation step, instead of learning different utterance representations for UD transformations at various levels in the DNN. Testing Data Testing Data
Training Data
Training Data
Input features
Stage 1 Pre Training
Input
Stage 2 Adaptive Training
Domain Invariant Transformation T
Hidden layer features
Mismatch compensated hidden layer features
DNN hidden layers HVS
Training & Testing Data
1
Training & Testing Data
features
Subspace M
Softmax Layer Hidden Variability Transformation
Domain invariant low variability features
Fig 1. Both training and testing domains are mapped to the latent space using the unsupervised domain invariant transformation 𝑇 (low variability subspace). The metric 𝑀 defined in the hidden variability space is learned to minimize the mismatch and to maximize the discriminative power between samples in the DNN. Domain distributions are indicated by dashed ellipsoids. Our learning scheme identifies the transformation 𝑀 non-linearly.
Page 2 of 2 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. ContentSetup: may change prior to final in an issue of theof journal. To cite the 1paper please use the doiperformance provided on of thethe Digital Library page. Experimental Although there are publication many possible applications Table describes the baseline BLSTM system and HVS (speaker recognition, speech recognition, language modelling, etc.), the gain that can be attained by the additional HVS transformation. It can one task that is significantly affected by mismatched conditions is be clearly seen that HVS improves performance for 1s duration language identification (LID): automatically recognizing the language of utterances significantly (from 73.2 to 79.1), being susceptible to a given speech segment. LID can use different adaptation techniques to mismatch conditions. It should be noted that this refinement is highly reduce the mismatch conditions between the training and testing data significant (14.92) in ‘mismatched’ condition languages (Japanese, caused by the background, channel, speaker, and duration variabilities. Russian, and Korean) compared to ‘matched’ languages. Even though there is sufficient data available for training systems, when Table 1: Performance of the proposed HVS compared to BLSTM system it comes to testing, the utterances can have a very short duration. In the for AP17-OLR 1s duration for matched and mismatched conditions. testing phase, these short duration utterances are the most affected by Accuracy [%] Improvement mismatched conditions. We propose a modified bidirectional long short Condition [%] term memory (BLSTM) structure which is utterance adaptively trained BLSTM HVS using i-vectors [8]. The i-vectors can be considered as low dimensional 1 Matched 77.43 81.01 4.42 representations of the utterance characteristics. 2 Mismatched 63.66 74.82 14.92 The main purpose of these experiments is to show how the HVS method is more effective for mismatched conditions when compared to Overall 73.2 79.1 7.46 matched conditions. We use the AP17-OLR training set with 10 languages for training the system [10]. These 10 languages have been Conclusion: In this work, we have proposed the hidden variability split into two main partitions based on their recording environment subspace (HVS) adaptation method for mismatch adaptation. The conditions. Three languages (Japanese, Russian and Korean) are recorded proposed HVS method estimates utterance dependent parameters in a in two different conditions and so are labelled ‘mismatched’, whereas all HVS and connects this to DNN layers using newly introduced weights other 7 languages are from ‘matched’ conditions. We have tested the which are adaptively trained. We have evaluated the HVS method on the system for 1s duration development data. The bottleneck features (BNF) AP17-OLR 1s duration task and have shown that it can capture variability are extracted as in [1]. The BLSTM baseline is trained on the BNF. in training and testing conditions particularly when there is a significant Cepstral mean variance normalization is locally conducted on the BNF mismatch in conditions. Experimental results show that HVS learning features before they are presented to the BLSTM system. The BLSTM outperforms a standard BLSTM system, by using extra information about has a single hidden layer of 1024 nodes and 10 classes as the outputs. the channel/noise condition to perform implicit feature normalization on After the BLSTM layer, feature level averaging is conducted before speech utterance. feeding into the next layer. Then 400-dimension i-vectors are extracted from the same BNF features. The universal background model consists Sarith Fernando, Vidhyasaharan Sethu and Eliathamby Ambikairajah (School of Electrical Engineering and Telecommunications, UNSW, of 2048 Gaussians. Only the final hidden layer is adapted in these NSW 2052, Australia) experiments, reducing the training complexity. First the initial BLSTM E-mail:
[email protected] 𝑙 𝑙 base model is trained. Then 𝑀 and 𝜑 are learned using i-vectors from Sarith Fernando and Eliathamby Ambikairajah are also with DATA61, the training utterances, while keeping the initial model weights fixed. Sydney, Australia Final classification is performed using the BNF as input to the BLSTM and test i-vectors. References
1. F. Richardson, D. Reynolds, and N. Dehak, "A Unified Deep Neural Network for Speaker and Language Recognition," arXiv preprint arXiv:1504.00923, 2015. 2. S. Fernando, V. Sethu, E. Ambikairajah, and J. Epps, "Bidirectional Modelling for Short Duration Language Identification," presented at the Interspeech 2017, Sweden, 2017. 3. Joachim Fainberg, Steve Renals, and P. Bell, "Factorised representations for neural network adaptation to diverse acoustic environments," presented at the Interspeech 2017, 2017. 4. L. Samarakoon and K. C. Sim, "On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in dnn acoustic models," in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 2016, pp. 5275-5279. 5. D. Yu, K. Yao, H. Su, G. Li, and F. Seide, "KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 7893-7897. 6. J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, "Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 6359-6363. 7. L. Lee and R. C. Rose, "Speaker normalization using efficient frequency warping procedures," in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, 1996, pp. 353-356. 8. N. Dehak, P. A. Torres-Carrasquillo, D. A. Reynolds, and R. Dehak, "Language Recognition via i-vectors and Dimensionality Reduction," in INTERSPEECH, 2011, pp. 857-860. 9. V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis, "I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 6334-6338. 10. Zhiyuan Tang, Dong Wang, Yixiang Chen, and Q. Chen, "AP17-OLR Challenge: Data, Plan, and Baseline," arXiv:1706.09742, 2017.
Results and Analysis: Figure 2 shows the distribution of the final hidden layer outputs from the DNN before and after the proposed transformation in equations (2) and (3) for the Korean language. Ideally, training (blue/left) and testing (brown/right) feature distributions should lie on top of each other if training and testing conditions match perfectly. When comparing 2.a and 2.b Figures, the adapted feature distributions in 2.b show greater overlap, implying adaptation helps to overcome the distribution mismatches between feature vectors from training and testing utterances. Distribution mismatch can also be quantified using Kullback– Leibler divergence (KL). KL divergence is calculated between two distributions for BLSTM output feature vectors from training and testing utterances. These are modelled separately for Korean language. The resulted values of KL divergence were 0.7084 and 0.2489, before and after the transformation respectively, demonstrating that there is a lower mismatch in the transformed space. Similarly, the KL divergence of all other languages was analyzed. Even though there is an improvement in KL divergence individually in all languages, the ‘mismatched’ languages (Japanese, Russian and Korean) show the highest improvement. This suggests that the HVS transformation is more effective when there is a mismatch between training and testing data.
(a)
(b)
Fig. 2 Comparison of Training and Testing data (a) before and (b) after HVS transformation for the Korean language.
2