Sep 22, 2018 - [40] Prothom Alo, http://www.prothomalo.com/, Accessed Feb, 2018. [41] Somewhereinblog, http://www.somewhereinblog.net/, Accessed Feb.
International Conference on Bangla Speech and Language Processing(ICBSLP), 21-22 September, 2018
Noise Robust End-to-End Speech Recognition For Bangla Language Sakhawat Hosain Sumit, Tareq Al Muntasir, M.M.Arefin Zaman, Rabindra Nath Nandi and Tanvir Sourov Socian Ltd Dhaka, Bangladesh {sumit, tareq, arefin, rabindra, tanvir}@socian.ai
Abstract— Robust speech recognition system is crucial for real-world applications and speech signal generally contains considerable amount of noise. We propose an end-to-end deep learning approach leveraging current progresses in Automatic Speech Recognition system to recognize continuous Bangla speech for noisy environments. We improve the robustness of our model through data augmentation and deep model architecture. We evaluate our model on an internal and two available datasets. We achieve impressive result on both clean read and noisy speech data. Our model achieves 10.65 % and 34.83 % CER on CRBLP (clean read) and Babel (noisy) respectively, which is state of the art for continuous Bangla speech recognition to the best of our knowledge. For our internal data set, we achieved 12.31 % and 9.15 % CER on clean and noisy speech respectively. Keywords—End-to-End, Noise Robust, Speech Recognition, CTC, Bangla Language
I.
INTRODUCTION
In recent years, Deep Neural Networks (DNNs) based architectures have shown tremendous results for continuous speech recognition tasks over classical machine learning based approaches like Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) [1, 2, 3, 4, 5]. A relatively new approach, sequence-to-sequence (Seq2Seq) has gained popularity in Automatic Speech Recognition (ASR) community as it solves the problem to produce variable length sequences which standard DNN cannot solve. Various Seq2Seq models have been proposed recently, including Recurrent Neural Network (RNN) Transducer [6], Listen, Attend and Spell (LAS) [7], Neural Transducer [8], Monotonic Alignments [9] and Recurrent Neural Aligner (RNA) [10]. Besides RNN, Convolutional Neural Network (CNN) has also shown outstanding results in end-to-end ASR [11, 12, 13, 14, 15]. Training these kind of Deep Network is quite expensive and requires a significant amount of time. In order to make the training procedure faster and smoother, a method called Batch Normalization (BN) is often used [16]. BN accelerates training by allowing high learning rate and optimize the network by reducing internal covariate shift. Speech signals are not well segmented, to align inputoutput sequences properly and avoid repeated characters DNNs are trained using Connectionist temporal classification (CTC) [17, 18]. In this paper, we propose an end-to-end speech recognition system for continuous Bangla speech by leveraging the capability of DNNs to automatically learn from a large speech corpus without any hand-crafted feature engineering. Our model is trained in end-to-end fashion to 978-1-5386-8207-4/18/$31.00 ©2018 IEEE
produce transcriptions from speech. Training a neural network in an end-to-end approach requires a lot of data. Therefore, we augment training data in several ways so that model can learn robustness to noise and speaker variation on its own. Our model uses spectrogram feature sequences from the raw audio signal and propagates through multiple Convolution and Recurrent layer with a preceding BN layer. In the final layer, we use a fully connected (FC) layer to aggregate the insights learned from the previous layers. Finally, model output is discriminated using the CTC criterion and decode the output sequences by Beam search [45]. II.
RELATED WORK
English continuous speech can be recognized with accuracy rate of more than 90% [19, 20]. Unfortunately, in Bangla, works on speech recognition is still in the very preliminary stage. Research work to recognize Bangla speech has been started around 2000. In the early 2000s, researchers tended to use different signal processing techniques to extract the features of the speech signal and used different classifiers for recognition. Since then different attempts have been made to recognize Bangla speech with high accuracy. Reference [21] has attempted continuous Bangla speech recognition using HMM for pattern recognition with the assistance of a stochastic language model. They have achieved 85% accuracy but their grammar is built using only 100 words. Muhammad et al. [22] also used HMM with Mel-frequency cepstral coefficient (MFCC) features to classify only Bangla digits. Even with this small number of classes, it fails to recognize digits with similar pronunciations like 6 (six), 7 (seven), 8 (eight) and 9 (nine). This system fails to detect these similar digits because it depends on the spectrogram as feature and the spectrogram of those digits are very similar. Linear Predictive Coding (LPC) and Artificial Neural Networks (ANNs) was used in [23]. The main limitation of this work is it has been done using only four Bangla words uttered by only four persons (two Male And two Female). The accuracy they got is also not clear from their paper. In [24] LPC, GMM and MFCC were used for feature extraction and Dynamic Time Warping for matching. Their best model has got highest 84% recognition rate for one hundred Bangla word in normal room environment. Though ANN has been used in [25], it has been done for only ten (10) digits in a noise free environment. The average accuracy is 92% for speaker independent setting. In very recent time, [26] has proposed a Bangla Phoneme Recognition system to help in the
development of a Bangla Speech Recognizer using a new Linear Predictive Cepstral Coefficient-based feature. Though it can detect phonemes with high accuracy (99%) this method is not an end to end speech recognizer. In [27, 28] several ANNs architecture experimented with Bangla words, characters and digits recognition using end-to-end approach. All the previous methods for Bangla speech recognition were developed for a noise-free well-controlled speech data and none of the methods experimented with continuous speech. They did not consider the variability in various parameters like speed, noise, regional dialect and loudness. The size of the data set for evaluation was also very small for all the works. In conclusion, not a single Bangla speech to text system has been found in the published literature that works very well with real-life large-scale data. III. MODEL ARCHITECTURE A shallow multi-layer Neural Network model cannot leverage hundreds of hours of speech data. Sufficient model capacity is required to work with large dataset to overcome the underfitting problem. For this reason, a model inspired by deep architecture is needed. We evaluate model architecture up to 9 layers including Convolutional, Recurrent and FC layer. A single FC layer was used to aggregate the local information learned by the network to do class discrimination. Optimizing deep architecture requires a substantial amount of time and often suffers from internal covariate shift and vanishing gradient problem [29] due to improper weight initialization. To overcome this issue, we use Batch Normalization before each hidden layer and Dropout [30] regularization was used for avoiding overfitting problem. After the FC layer, we use CTC [31] based approach with graphemes (characters) as the output symbols and Beam search approach was used for decoding output sequences. A. Convolutional Layers In Convolution Layers, each neuron is only connected to a few nearby neurons in the previous layer and designed to reduce translational variance by using weight sharing mechanism whereas, in FC layers each neuron is connected to every neuron in the previous layer which makes it computationally expensive. Recently CNN based architectures have shown improvement in ASR tasks [13, 32]. We used convolution layers at the beginning of our model architecture to learn low-level features and increase the field of view. B. Recurrent Layers Current research in speech and language processing has shown that using complex recurrence can assist the network to learn long sequences [3, 7, 33, 34]. Long Short-Term Memory (LSTM) [35] units and Gated Recurrent Units (GRU) [36] are two commonly used recurrent architecture. A recent comprehensive study showed that performance of GRU is comparable to LSTM with properly tuned forget bias [37]. For smaller data sets GRU and LSTM with the
same number of parameters achieve similar accuracy whereas, GRU is faster to train than LSTM. We examined with GRU to reduce training time. For GRU hidden state ℎ can be computed as,
ℎ´ =
=
+ℎ
=
+ℎ
ℎ
ℎ = 1−
+
∗ℎ
∗ℎ
+
∗ ℎ´
Here, and presents the update and reset gate respectively and is the sigmoid activation function. is the weight matrix and is the recurrent connection between current and previous hidden layer. C. Batch Normalization As we increase our architecture via depth, training takes a significant amount of time and model can suffer from optimization issue. To solve this issue, we explore Batch Normalization [16] as a technique to accelerate training and improve generalization. −
=
+ +
Here, and are learnable parameters used to scale the parameters. The term and are mean and variance, respectively, computer over each minibatch. is a very small positive constant used for numerical stability (prevent division by 0), in our experiment we used 1 − 5 as constant. D. Connectionist Temporal Classification Speech signals are often not well segmented with respect to transcripts and contain some intrinsic noise. To perform end-to-end speech recognition, we need to perfectly align audio clips with their corresponding transcripts. Unfortunately, we don’t have explicit information about how the characters in the transcript align to the audio signal. CTC is a way to solve this problem without any explicit information about the alignment between input and output sequence and capable to ignore blanks and repeated characters without penalty. With CTC we can train a network according to the maximum-likelihood criterion to compute the conditional probability by marginalizing all possible alignments and it assumes conditional independence between output predictions at different time steps given aligned inputs. The probability distribution for CTC can be computed as, |
=
∨ ∈
≈
∨ ∈
Here, and presents acoustic feature vector and label sequences respectively.
IV. TRAINING DATA As end-to-end speech recognition systems require an abundance of labeled training data, we have built a modest training dataset for Bangla speech. Furthermore, we also used two available datasets. We used Babel [38] a private dataset in both training and test phase (randomly differentiated) and CRBLP speech corpora [39] in the only test phase. Babel contains both read and conversation speech data which includes a considerable amount of noise, in contrast, CRBLP contains noise free well recorder audio signals. We found a large amount of utterance in Babel containing no spoken word or just stop word (human expression) which barely can be used and may affect model optimization. Many utterances also include a significant amount of leading and ending pause. We trimmed additional pauses from audios and ignore corrupted audio signals. Table I presents the training data summary. TABLE I. TRAINING DATA SUMMARY
Dataset Babel Socian Socian
Speech Type noisy (read & conversational) read noisy (read & conversational) Total
Hours 50 200 100 350
A. Dataset Construction Our internal Bangla speech corpus was created from raw data recorded as audio clips with transcripts and a phone conversation between our employees. Later, we transcribe the conversational data. For read speech, audios were recorded in both clean and noisy environments. We used the short article as transcripts which were crawled from several web sources [40, 41, 42]. The length of these audio clips ranged from seconds to several minutes. It is quite impractical for RNN to unroll long audio segments in time. To solve this problem, we used the Kaldi toolkit [43] to align and segment the audio clips in maximum ten seconds long with their corresponding transcripts. B. Data Augmentation As our actual training data size is relatively small, we tried several augmentation techniques to expand the effective size of our training data. Augmentation techniques are listed below. Volume perturbation Time perturbation Pitch perturbation Noise perturbation We perturb volume, time and pitch effects randomly for every utterance. As Babel contains a good amount of intrinsic noise and many of the utterances can hardly be recognized by human, so we avoid augmentation for Babel. Even though our internal training data contains some intrinsic noise, we increase the quality and variety of noise through augmentation to improve our model robustness for real-life scenario. We found extensive noisy data leads to worse result and requires a significant amount of time for model optimization, so we augment noise of various
environments including bus, cafe, street and pedestrian area on only 40% of data randomly using CHiME [44]. V.
RESULT AND DISCUSSION
In order to build a noise robust and real-world applicable end-to-end speech recognition system, we augmented data in several ways and evaluate our model on a wide variety of test samples. We used two available data sets and two test sets collected internally described in Table II. Combining these test sets presents a wide variety of challenging speech environments including low signal, read, conversational and noisy speech. We set 16kHz sampling rate and 1 channel for audio signal. We find that our models converge after 20 epochs, so we trained each model for 20 epochs over combined training set described in Table I. For convolution layer we used 32 filters and 800 units for the recurrent layer. We used stochastic gradient descent with momentum 0.99 along with minibatch size 32 utterances. Learning rate is initialized at 1 × 10 and annealed by a constant 0.1 after each epoch. We used a Beam width size 10 for the decoder. Training procedure took over a week on a single GeForce GTX 1080 Ti GPU. A. Read Speech . Read speech is arguably easier speech recognition task. Our source of read speech is the publicly available CRBLP speech corpus and internally built test set. CRBLP and internal test set contain clean read speech from single and 30 (thirty) speaker respectively. Table III and IV presents the performance comparison of our system architectures for read speech. We found deep architecture with augmentation gain 1.81% and 6.11% relative CER improvement after augmentation on CRBLP and Socian read speech respectively. B. Noisy Speech We evaluate our model performance on noisy speech using Babel private dataset and our internally built dataset. Datasets contain various noisy environments, including bus, cafe, street and pedestrian area. Speech recognition for such environment is a challenging task, networks ability to handle a variety of environments is needed for better generalization in real-world scenarios. Table V and VI presents the performance comparison of our system architecture for noisy speech. Results show deep architecture with augmentation gain 2.17% and 25.8% relative CER improvement after augmentation on Babel and Socian noisy speech respectively. TABLE II.
Test Data
TEST DATA SUMMARY
Speech Type
Hour
CRBLP
read
11
Babel
noisy (read & conversational)
10
Socian
read
50
Socian
noisy (read & conversational)
40
a.
. Test set overview of different environments speech data.
TABLE III.
COMPARISON OF WER AND CER ON READ SPEECH
Read Speech Test set
2 CNN + 3 GRU + 1 FC
CRBLP Socian b.
3 CNN + 5 GRU + 1 FC
WER
CER
WER
CER
48.04
12.83
45.74
12.46
56.07
19.11
53.687
18.42
Deep architecture perform better on both read speech for WER and CER.
TABLE IV.
COMPARISON OF WER AND CER AFTER AUGMENTATION ON READ SPEECH
We speed up the learning procedure by applying Batch Normalization and Dropout regularization was used to ignore the overfitting issue. We show that leveraging more data with augmentation and deep model architectures can achieve generalized accuracy for noise robust Bangla speech recognition system. Our best model achieves 9.15% and 34.83% CER on Socian and Babel noisy speech respectively and 12.31% and 10.65% CER on Socian and CRBLP read speech respectively. Though we achieve quality CER in both read and noisy speech, WER is still poor. An additional language model can be used to improve word or character level performance.
Read Speech After Augmentation Test set
2 CNN + 3 GRU + 1 FC
3 CNN + 5 GRU + 1 FC
WER
CER
WER
CER
CRBLP
40.28
11.06
38.65
10.65
Socian
32.46
12.47
32.19
12.31
c.
2 CNN + 3 GRU + 1 FC
[4]
3 CNN + 5 GRU + 1 FC
WER
CER
WER
CER
Babel
86.88
36.62
87.52
37.00
Socian
67.62
35.24
66.14
34.95
d.
Deep architecture perform better on both Babel and Socian noisy speech for WER and CE
As end-to-end systems require an extensive amount of training data, we increased our training data size through augmentation. Evaluation results indicate that increasing data and model architecture via depth leads bettergeneralized accuracy for both read and noisy Bangla speech. TABLE VI. COMPARISON AUGMENTATION ON NOISY SPEECH
OF
WER
AND
CER
AFTER
e.
2 CNN + 3 GRU + 1 FC
[5]
[6] [7]
[8]
[9]
[10]
Noisy Speech After Augmentation Test set
[3]
COMPARISON OF WER AND CER ON NOISY SPEECH
Noisy Speech Test set
[1]
[2]
Deep architecture with augmentation gain 1.81% and 6.11% relative CER improvement on CRBLP and Socian read speech respectively.
TABLE V.
REFERENCES
3 CNN + 5 GRU + 1 FC
WER
CER
WER
CER
Babel
84.78
35.66
83.32
34.83
Socian
30.10
9.41
28.92
9.15
Deep architecture with augmentation gain 2.17% and 25.8% relative CER improvement on Babel and Socian noisy speech respectively.
VI. CONCLUSION In this work, we leverage CNN to learn important features and GRU to unroll long feature sequences. CTC was used to perfectly align audio signals to corresponding transcripts.
[11]
[12]
[13]
[14]
G. Hinton et al, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups”, IEEE Signal processing magazine, 2012, vol. 29(6), pp. 82-97. B . Kingsbury, TN. Sainath and H. Soltau, “Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization”, InThirteenth Annual Conference of the International Speech Communication Association 2012. F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks”, In Twelfth annual conference of the international speech communication association, 2011. N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition”, In Thirteenth Annual Conference of the International Speech Communication Association, 2012. G.E. Dahl, D. Yu, L. Deng and A. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition”, IEEE Transactions on audio, speech, and language processing, 2012, vol. 20(1), pp.30-42. A. Graves, “Sequence transduction with recurrent neural networks”, arXiv preprint arXiv:1211.3711, 2012. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”, In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 4960-4964). IEEE. N. Jaitly, Q.V. L, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio, “An online sequence-to-sequence model using partial conditioning”, In Advances in Neural Information Processing Systems, 2016, pp. 5067-5075. C. Raffel, M.T. Luong, P.J. Liu, R.J. Weiss, and D. Eck, “Online and linear-time attention by enforcing monotonic alignments” arXiv preprint arXiv:1704.00784, 2017. H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping”, In Proc. Interspeech, 2017, pp. 1298-1302. O. Abdel-Hamid, A.R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition”, IEEE/ACM Transactions on audio, speech, and language processing, 2014, vol. 22(10), pp.1533-1545. Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C.L.Y. Bengio, and A. Courville, “Towards end-to-end speech recognition with deep convolutional neural networks”, arXiv preprint arXiv:1701.02720, 2017. Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-end speech recognition”, In Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4845-4849. Y. Qian, and P.C. Woodland, “Very deep convolutional neural networks for robust speech recognition” In Spoken Language Technology Workshop (SLT), 2016, pp. 481-488.
[15] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-toend convnet-based speech recognition system”, arXiv preprint arXiv:1609.03193, 2016. [16] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, arXiv preprint arXiv:1502.03167, 2015. [17] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based endto-end speech recognition using multi-task learning”, In Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4835-4839. [18] Y. Wang, X. Deng, S. Pu, and Z. Huang, “Residual convolutional CTC networks for automatic speech recognition”, arXiv preprint arXiv:1702.07793, 2017. [19] C.C. Chiu et al, “State-of-the-art speech recognition with sequence-tosequence models. arXiv preprint arXiv:1712.01769, 2017. [20] R. Prabhavalkar, K. Rao, T.N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition”, In Proc. Interspeech, 2017. pp. 939-943. [21] M. Hasnat, J. Mowla, and M. Khan, “Isolated and continuous bangla speech recognition: implementation, performance and application perspective”, 2007. [22] G. Muhammad, Y.A. Alotaibi, and M.N. Huda, “Automatic speech recognition for Bangla digits”, In Computers and Information Technology, 2009. ICCIT'09. pp. 379-383. [23] A.K. Paul, D. Das, and M.M. Kamal, “Bangla speech recognition system using LPC and ANN”, In Advances in Pattern Recognition, 2009. ICAPR'09. pp. 171-174. [24] M.A. Ali, M. Hossain, and M.N. Bhuiyan, “Automatic speech recognition technique for Bangla words”, International Journal of Advanced Science and Technology, , 2013, vol. 50. [25] M. Hossain, M. Rahman, U.K. Prodha and M. Khan, “Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech Recognition”, arXiv preprint arXiv:1308.3785, 2013. [26] H. Mukherjee, S. Phadikar, and K. Roy, “An Ensemble LearningBased Bangla Phoneme Recognition System Using LPCC-2 Features”, In Intelligent Engineering Informatics, Springer, Singapore, 2018, pp. 61-69. [27] M.M.H. Nahid, B. Purkaystha, and M.S. Islam, “Bengali speech recognition: A double layered LSTM-RNN approach”, In Computer and Information Technology (ICCIT), 2017, pp. 1-6. [28] M. Ahmed, P.C. Shill, K. Islam, M.A.S. Mollah and M.A.H. Akhand, “Acoustic modeling using deep belief network for Bangla speech recognition”, In Computer and Information Technology (ICCIT), 2015, pp. 306-311. [29] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks”, In International Conference on Machine Learning, 2013, pp. 1310-1318. [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks
[31]
[32]
[33]
[34]
[35] [36]
[37]
[38] [39]
[40] [41] [42] [43]
[44]
[45]
from overfitting”, The Journal of Machine Learning Research, 2014, vol. 15(1), pp.1929-1958. A. Graves, S. Fernández, F. Gomez and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, In Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369376. T.N. Sainath et al, “Improvements to deep convolutional neural networks for LVCSR”, In Automatic Speech Recognition and Understanding (ASRU), 2013, pp. 315-320. D. Bahdanau, K. Cho and Y. Bengio, “Neural machine translation by jointly learning to align and translate”, arXiv preprint arXiv:1409.0473, 2014. I. Sutskever, O. Vinyals and Q.V. Le, “Sequence to sequence learning with neural networks”, In Advances in neural information processing systems, 2014, pp. 3104-3112. S. Hochreiter and J. Schmidhuber, “Long short-term memory”, Neural computation, 1997, vol. 9(8), pp.1735-1780. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation”, arXiv preprint arXiv:1406.1078, 2014. R. Jozefowicz, W. Zaremba and I. Sutskever, “An empirical exploration of recurrent network architectures”, In International Conference on Machine Learning, 2015, pp. 2342-2350. IARPA Babel Bengali Language Pack, https://catalog.ldc.upenn.edu/LDC2016S08, Accessed Feb, 2018. F. Alam, S.M. Habib, D.A. Sultana and M. Khan, “Development of annotated Bangla speech corpora”, In Spoken Languages Technologies for Under-Resourced Languages, 2010. Prothom Alo, http://www.prothomalo.com/, Accessed Feb, 2018. Somewhereinblog, http://www.somewhereinblog.net/, Accessed Feb 2018. Bdnews24, https://bdnews24.com/, Accessed May, 2018. D. Povey et al, “The Kaldi speech recognition toolkit”, In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society. J. Barker, R. Marxer, E. Vincent and S. Watanabe, “The third ‘CHiME’speech separation and recognition challenge: Dataset, task and baselines”, In Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 504-511. S. Wiseman and A.M. Rush, “Sequence-to-sequence learning as beam-search optimization”, arXiv preprint arXiv:1606.02960, 2016.