Continuous Hindi Speech Recognition Using Kaldi

14 downloads 0 Views 575KB Size Report
Today, deep learning is one of the most reliable and technically ... Speech recognition, Kaldi, Hindi language. .... Hindi, Arabic, English and Italian language. .... Young, S., Gales, M., Liu, X.A., Povey, D., Woodland, P.: The HTK Book ( version 3Β ...
Continuous Hindi Speech Recognition Using Kaldi ASR based on Deep Neural Network Prashant Upadhyay1, Sanjeev Kumar Mittal2, Omar Farooq1, Yash Vardhan Varshney1 and Musiur Raza Abidi1 1

Department of Electronics, Aligarh Muslim University-Aligarh, Uttar Pradesh, 202002, India 2 Electrical Engineering, Indian Institute of Science, Bangalore, Karnataka, 560012, India {upadhyaya.prashant,rsr.skm, omarfarooq70}@gmail.com

Abstract. Today, deep learning is one of the most reliable and technically equipped approach for developing more accurate speech recognition model and natural language processing (NLP). In this paper, we propose ContextDependent Deep-Neural-network HMMs (CD-DNN-HMM) for large vocabulary Hindi speech using Kaldi automatic speech recognition toolkit. Experiments on AMUAV database demonstrate that CD-DNN-HMMs outperform the conventional CD-GMM-HMMs model and provide the improvement in word error rate of 3.1% over conventional triphone model. Keywords: Deep neural network (DNN), Hidden Markov model (HMM), Speech recognition, Kaldi, Hindi language.

1

Introduction

Deep learning approach has recently become the hot research area for the researcher working in the field of automatic speech recognition system. Deep learning was initially set during late 1980s that relied on back propagation algorithm [1] and Boltzmann machine algorithm [2]. In this, feature training was performed on one or more layers of hidden units using back propagation. However, during that period, the limitation of available hardware restricted the use of Boltzmann machine algorithms [2]. Therefore, more research was conducted for developing more efficient algorithms to train the layered-Boltzmann machine. One of the well-known layered Boltzmann machine is a restricted Boltzmann machine (RBM) that contain a single hidden layer. Hinton [3] proposed the efficient algorithms for training deep layered RBM machine. Advancements in the field of Graphics Processing Units (GPUs) hardware had make it possible to increase the depth of deep neural nets from single hidden layers to multiple hidden layers. Therefore, development of more powerful algorithms and improvement in GPUs hardware had intensively increased the use of deep neural nets. Modern GPUs are more capable of handling large database models to be trained and made it more faster in comparison to the CPUs in peak operation per second [3] [4].

2

Thus, advantage of using GPU is that it can minimizes the computational time and can also provide the parallel programming environment, that makes the algorithms run much faster. This have attracted the researcher working in the field of machine learning. Areas such as image processing [6] [7] and speech processing [8] [9] [10] had picked up and started using deep learning models to train the classifier. Experiment conducted in [5] using phone recognition based on deep neural nets showed promising results over baseline model. The same approach was reported in [8], Dahl et al. [9] developed the first continuous large vocabulary speech recognition model based on deep neural nets. Performance obtained through deep neural nets was so good that researcher in the field of speech processing, switched to develop their acoustic model based on deep neural nets. DNN became the state-of-the-art for speech recognition which provides multiple non-linear layers for modeling the system [10][16]. Previously, most of the work done on automatic speech recognition model was based on simple training and decoding of HMM-GMM model. But, today DNN have proved to be the speedy way for most of the automatic speech recognition system. Deep Neural Networks (DNNs) along with Hidden Markov Models (HMMs) has shown the significant improvement over automatic speech recognition task [10]-[16]. RBM, auto encoder and Deep Belief Networks (DBN) are some of the neural based techniques which are used to improve the performance of the model system [12], [17]-[19]. Auto encoders are simple three layer neural networks whose outputs are directly connected back to inputs. All of the output edges are connected back to input edges. Hidden layers are usually lesser than the input and output layers. The objective is to minimize the error at reconstruction layer in other words is to find the most efficient compact representation or encoding for the input signal. RBM on the other hand uses stochastic (Gaussian distribution) and adjusts the weights to minimize the reconstruction error. RBMs don’t use deterministic methods. In Deep Belief Networks (DBN) different layers are composed of RBMs that are stacked together to form multiple layers. Each such layer learns features from the features that are extracted from the underlying layer [12], [17]-[19]. However, research suggest, that any of the above method may give better performance and more likely be dependent on the dataset used for research. For, this research work, RBM model are considered. In this paper, continuous Hindi speech deep neural network recognition model based on Kaldi toolkit [20] is evaluated. Standard MFCC features are extracted from AMUAV corpus. AMUAV Hindi speech database contains 1000 phonetically balanced sentences recorded by 100 speakers consisting of 54 Hindi phones. Each speaker in AMUAV records 10 short phonetically sentence out of which two sentences are common to each speaker. Audio was sampled at 16000 Hz. Standard Mel-frequency cepstral coefficients (MFCCs) features are extracted from the given audio files. Trigram language model was developed using SRILM language toolkit

3

2

Restricted Boltzmann Machine

Generally, RBM can be considered as a two state model, in which one state contain m visible units (v) to represent observable data and another state consist of n hidden units (h). Figure 1, shows the undirected graph of an RBM with n hidden and m visible variables.

Fig. 1. The undirected graph of an RBM with n hidden and m visible variables.

Restricted Boltzmann machine (RBM) compute the energy of every unit of visible and hidden layers states which is obtained as: 𝑬(𝐯, 𝐑) = βˆ’π›π“ 𝐯 βˆ’ 𝐜 𝐓 𝐑 βˆ’ 𝐯 𝐓 𝐖𝐑

(1)

where W contains connection weights of visible/hidden layer in matrix form, b is visible bias vector, and c is a hidden bias vector. Also, visible/hidden layers joint probability distribution given by Gibbs distribution 𝑝(𝐯, 𝐑) [2] [3] [12], for the given configuration is defined in term of their energy as: 𝑝(𝐯, 𝐑) =

𝑒 βˆ’πΈ(𝐯,𝐑) 𝑍

(2)

where the normalization factor 𝑍 = βˆ‘π―,𝐑 𝑒 βˆ’πΈ(𝐯,𝐑) is known as the partition function. Equation (1), RBM energy function can be re-written as a function dependent on at most one hidden unit, which yields: 𝐸(v, h) = βˆ’π›½(𝐯) + βˆ‘π‘— 𝛾𝑗 (𝐯, β„Žπ‘— )

(3)

where 𝛽(𝐯) = 𝐛T 𝐯 and 𝛾𝑗 (𝐯, β„Žπ‘— ) = βˆ’(𝑐𝑗 + 𝐯 T π–βˆ—,𝑗 )β„Žπ‘— , with π–βˆ—,𝑗 denoting the jth column of 𝐖. Thus, to train the restricted Boltzmann machine for an arbitrary model parameter ΞΈ, gradient of the log likelihood of the data is computed. Thus, for a give set of training data, model parameter do the learning by finding the weight and biases (parameter) in which training data set have high probability. Therefore, updating of model parameter

4

ΞΈ is based on expectations of the energy gradients over the given data distribution and model distribution, computed as: βˆ†πœƒ = βŸ¨πœ•πΈ|πœ•πœƒβŸ©π‘‘π‘Žπ‘‘π‘Ž βˆ’ βŸ¨πœ•πΈ|πœ•πœƒβŸ©π‘šπ‘œπ‘‘π‘’π‘™

3

(4)

Deep Neural Network using KALDI ASR Tool

Deep Neural Networks along with high compute capability GPU cards are now one of the core programming tools for developing model for speech recognition task. Two most important speech recognition toolkits HTK 3.5 [21] and Kaldi [22] have included DNNs for building speech recognition model. However, Kaldi which is actively maintained and contain simple documentation of DNNs script has attracted most of the researcher working in the field of speech recognition [22]. Therefore, Kaldi has become the most preferable speech recognition toolkit for building speech model using DNN. Kaldi toolkit, currently support three code bases for DNNs [22]. First code script β€œnnet1” is maintained by Karel Vaseley in which training can be done using single GPU. Second code script β€œnnet2” is maintained by Daniel Povey which is basically based on the original script of Karel Vaseley. It allows multiple threads for programing by using multiple CPU or GPU for training. Third code script is β€œnnet3” which is the enhanced version of β€œnnet2” script of Daniel Povey. Experiment work on DNN using Kaldi ASR toolkit has been reported in [23]-[29]. Povey et al. [23], perform experiment using n-gram phone model based on sequencediscriminative training. In their work, they proposed lattice free (LF) version using Maximum Mutual Information (MMI) criterion for training of acoustic model based on neural network for five large vocabulary continuous speech recognition (LVCSR) task. Cosi [24] performed phone recognition on ArtiPhon Italian corpus [25] using Kaldi ASR toolkit. However, results obtained using DNNs do not overcome with baseline method. Cosi claim, that this is due to insufficient corpus size resulted in failure of tuning of the DNN architecture for ArtiPhon corpus. Miao [26] developed a full-fledged DNN acoustic model based on Kaldi + PDNN model using Kaldi Switchboard 110-hour setup. In their work, initial model was developed using Kaldi toolkit and DNN model was trained using PDNN. Finally, these models (GMM+PDNN) were decoded using Kaldi ASR. Presently, four DNN recipes are released using Kaldi+PDNN. PDNN is deep learning toolkit developed under Theano environment. Another work by Chen et.al, [27], based on Long Short-Term Memory (LSTM) using recurrent neural network has reported. Using this model, robust speech recognition is performed using the integration approach of both automatic speech recognition and speech enhancement using LSTM network. However, most of the works based on Kaldi speech recognition model are used for English language. Here, in this paper we propose the deep neural nets model for the Hindi language.

5

4

Experiment Setup

Experiment on Hindi automatic speech recognition model for Independent Speaker (IS) using Kaldi toolkit was set up under Ubuntu 16.04 LTS (64 bit Operating System). Running on dual core 4 CPUs of Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz with 4GM RAM and hosting single GPU of compute capability of 3.5, 2 GB memory running under CUDA 8.0 toolkit. Therefore, implementation of Hindi speech using DNNs is restricted to β€œnnet1” i.e., Karel’s setup which use pre-training and compiled using CUDA GPU (NVIDIA). Experiments were performed on AMUAV Hindi speech database which contain 1000 phonetically balanced sentences recorded by 100 speakers using 54 Hindi phones [30]. Further, for feature-space reduction, linear discriminant analysis (LDA) along with maximum likelihood linear transform (MLLT) is applied which act as the observation input to Gaussian mixture model (GMM) based acoustic models. Also, for tied-state triphones, we built feature-space maximum likelihood linear regression (fMLLR) for speaker adapted GMM model. Further, frame-based cross-entropy along with state-level Minimum Bayes Risk (sMBR) training of Hindi database is evaluated.

Fig. 2. Training of RBM layers for Hindi speech model for AMUAV database.

Figure 2 shows the stage of DNNs for Hindi Speech automatic speech recognitions system. RBM pre-training is usually done in unsupervised stage on top of fMLLR features. Frame cross-entropy training using mini-batch Stochastic Gradient Descent is performed so that it correctly classifies each frame to their corresponding triphonestates (i.e. PDFs). Further, sequence-training optimizing sMBR is done so that better frame accuracy with respect to reference alignment is obtained. It then performs 6 iterations of RBM pre training, followed by frame cross-entropy training and sMBR sequence-discriminative training. Table 1 shows that using sequence-discriminative training the word error rates of Hindi automatic speech recognition model was further reduced.

6 Table 1. WER performance for Monophone, Triphone, LDA+MLLT and Karel’s DNN using MFCC features for continuous Hindi speech recognition. Features

Word Error Rate (%WER)

MonoPhone Training & Decoding

18.32

Deltas + Delta-Deltas Training & Decoding

14.73

LDA + MLLT Training & Decoding

12.62

DNN Karel’s

5

Karel’s Hybrid DNN (dnn4_pretrain-dbn_dnn)

12.13

System combination Karel’s DNN +sMBR (dnn4_pretrain-dbn_dnn_smbr)

11.63

Comparative performance

This section presents the comparative analysis of work done on Kaldi-DNN using Hindi, Arabic, English and Italian language. Ahmed et al. [31] and Cosi [32] presented the complete recipe of building Arabic and Italian speech recognition model using Kaldi. Ali et al., [31], performed experiment using broadcast news system using 200 hours Gale corpus for Broadcast Report (BR), Broadcast Conversational (BC) and the combination of the two i.e., BR+BC. Here, only the best results obtained for Arabic language using Broadcast Report (BR) were presented. For Italian language, experiment was performed using 10 hours of Italian FBK ChilIt corpus [32]. For Hindi language, experiment was performed using AMUAV corpus [30]. We also performed the experiment on English language using TIMIT corpus [33] for same TIMIT-DNN recipe present under Kaldi source. However, as the corpus size of each language were different, the comparisons were made only on the results obtained for respective language using Kaldi-DNN. MFCC features were used as a baseline for all the four languages. Table 2, shows the comparative analysis of the work for different languages. Table 2. Comparative analysis of DNN model developed for particular language (Arabic, Italian, English and Hindi) over conventional GMM-HMM (LDA+MLLT) model in terms of word error rate (%WER). Features

Corpus Set

Arabic Language (%WER) [31]

Italian Language (%WER) [32]

English Language (%WER) [33]

Hindi Language (%WER) [30]

Gale corpus

ChilIt corpus

TIMIT corpus

AMUAV corpus

GMM-HMM (LDA+MLLT)

22.32

12.7

24.0

12.62

Karel’s Hybrid DNN

17.36

8.6

18.4

12.13

System combination Karel’s DNN +sMBR

15.81

8.3

18.1

11.63

7

From Table 2, it can easily be shown that for each case, DNN has performed well over conventional GMM-HMM model. It can easily be observed from the table that performance of Arabic language which was trained over 200 hours of dataset has shown significant improvement over other languages. Similarly, Italian language which was trained over 10 hours of dataset has shown better results than Hindi language. For similar corpus size i.e., TIMIT and AMUAV, experiment results obtained are better for Hindi language. But, improvement from GMM-HMM to DNN model was observed for TIMIT. For Hindi language, we obtained the improvement of ~1% over conventional LDA+MLLT model whereas for TIMIT it was ~6%. This is because, the pre training of DNN depends upon the labeled frames (phoneme-to-audio alignments) using GMM-HMM system. In case of TIMIT corpus phoneme-to-audio alignments were done using 39 English phones. Whereas for AMUAV phoneme-to-audio alignments is done on 54 Hindi phones. The lesser number of phone model resulted in better GMM-HMM phone model alignment that can increase the performance of ASR model trained using DNN. Due to insufficient corpus size in AMUAV, it was not optimum feasible to find the optimum weight for updating the model parameter and tuning of the DNN layers for a given set of training data. More accurate results could be obtained with further tuning of the DNN layer, which necessitates an increased corpus size.

6

Conclusion and future work

From the results obtained, it is found that the performance of Hindi speech recognition model has shown improvement using deep neural network. Experiments were performed using two phases. In first phase, simple DNN without sMBR was executed. In the second phase, DNN using sMBR was attempted. For first phase, improvement in WER (%) of 6.19%, 2.6%, and 0.49% was obtained with respect to monophone, triphone and LDA + MLLT respectively. For second phase, improvement in WER (%) of 6.69%, 3.1% and 0.99% was obtained with respect to monophone, triphone and LDA + MLLT respectively. Further, using sMBR, it had shown further improvement of 0.5% WER over pre-trained DNNs. Hence, more robust results can be obtained with further tuning of the DNN layer. We plan to continue our work by increasing the AMUAV database and develop more accurate DNN models for continuous Hindi speech. We have not used auto encoder and DBN in present study. We think that DBN usage would improve our results and hence we definitely plan to make use of DBN in our very next experiments. As future work we can implement our speech model using auto encoder, DBM and compare them with RBM models.

8

7

Acknowledgment

The authors would like to acknowledge Institution of Electronics and Telecommunication Engineers (IETE) for sponsoring the research fellowship during this period of research.

References 1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature. 323, 533–536 (1986). 2. Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 (1985). 3. Hinton, G.E.: Training Products of Experts by Minimizing Contrastive Divergence. Neural Comput. 14, 1771–1800 (2002). 4. Raina, R., Madhavan, A., Ng, A.Y.: Large-scale Deep Unsupervised Learning Using Graphics Processors. Proc. 26th Int. Conf. Mach. Learn. (ICML 09). 873–880 (2009). 5. Mohamed, A., Dahl, G.E., Hinton, G.: Acoustic Modeling Using Deep Belief Networks. IEEE Trans. Audio. Speech. Lang. Processing. 20, 14–22 (2012). 6. Mnih, V., Hinton, G.E.: Learning to detect roads in high-resolution aerial images. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 6316 LNCS, 210–223 (2010). 7. Cireşan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Handwritten Digit Recognition with a Committee of Deep Neural Nets on GPUs. Tech. Rep. No. IDSIA-03-11. 1–8 (2011) 8. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Large vocabulary continuous speech recognition with context-dependent DBN-HMMS. 2011 IEEE Int. Conf. Acoust. Speech Signal Process. 4688–4691 (2011). 9. Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving Deep Neural Networks for LVCSR Using Rectified Linear Units and Dropout. Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2013 IEEE. 8609–8613 (2013). 10. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Process. Mag. 82–97 (2012). 11. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2013 IEEE. 1–18 (2012). 12. Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. Proc. 1, 5884–5887 (2011). 13. Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., Acero, A.: Recent advances in deep learning for speech research at Microsoft. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 8604–8608 (2013). 14. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20, 30–42 (2012).

9 15. Zeiler, M.D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q. V, Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., Hinton, G.E.: ON RECTIFIED LINEAR UNITS FOR SPEECH PROCESSING New York University , USA Google Inc ., USA University of Toronto , Canada. Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2013 IEEE. 3517– 3521 (2013). 16. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription. 2011 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2011, Proc. 24–29 (2011). 17. Gehring, J., Miao, Y., Metze, F. and Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377-3381 (2013). 18. Deng, L., Seltzer, M.L., Yu, D., Acero, A., Mohamed, A.R. and Hinton, G.: Binary coding of speech spectrograms using a deep auto-encoder. In Eleventh Annual Conference of the International Speech Communication Association, pp 1692-1695 (2010). 19. Dahl, G., Mohamed, A. R., & Hinton, G. E.: Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in neural information processing systems, pp. 469-477 (2010). 20. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. IEEE Work. Autom. Speech Recognit. Underst. 1–4 (2011). 21. Young, S., Gales, M., Liu, X.A., Povey, D., Woodland, P.: The HTK Book ( version 3 . 5a ). Cambridge Univ. Eng. Dep. (2015). 22. Kaldi Home Page (kaldi-asr.org). 23. Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., Khudanpur, S.: Purely sequence-trained neural networks for {ASR} based on lattice-free {MMI}. Proc. of INTERSPEECH. 2751–2755 (2016). 24. Cosi, P.: Phone Recognition Experiments on ArtiPhon with KALDI. Proc. Third Ital. Conf. Comput. Linguist. (CLiC-it 2016) Fifth Eval. Campaign Nat. Lang. Process. Speech Tools Ital. 1749, 0–5 (2016). 25. Canevari, C., Badino, L., Fadiga, L.: A new Italian dataset of parallel acoustic and articulatory data. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH. 2015–January, 2152–2156 (2015). 26. Miao, Y.: Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN. arXiv CoRR. abs/1401.6, 1–4 (2014). 27. Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. Interspeech. 3274–3278 (2015). 28. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. 2014 IEEE Int. Conf. Acoust. Speech Signal Process. 215–219 (2014). 29. Vu, N.T., Imseng, D., Povey, D., Motlicek, P., Schultz, T., Bourlard, H.: Multilingual deep neural network based acoustic modeling for rapid language adaptation. Icassp-2014. 7639– 7643 (2014). 30. Upadhyaya, P., Farooq, O., Abidi, M.R., Varshney, P.: Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition. Arch. Acoust. 40, (2015). 31. Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S., Glass, J.: A complete KALDI recipe for building Arabic speech recognition systems. 2014 IEEE Work. Spok. Lang. Technol. SLT 2014 - Proc. 525–529 (2014).

10 32. Cosi, P.: A KALDI-DNN-based ASR system for Italian. Proc. Int. Jt. Conf. Neural Networks. (2015). 33. Lopes, C., PerdigΓ£o, F.: Phone recognition on the TIMIT database. Speech Technol. 1, 285–302 (2011).