Noise-robust speech recognition based on SNR ...

Noise-robust speech recognition based on SNR Dependent Temporal Structure Normalisation Nathan Sherwood1 , Julian Cardenas Barreras2 , Eduardo Castillo Guerra1 1

Department of Electrical and Computer Eng., UNB, Fredericton, NB, Canada 2 Faculty of Electrical Eng., UCLV, Santa Clara, Cuba [email protected], [email protected], [email protected]

Abstract We introduce a new feature normalisation technique for noise robust speech recognition called signal to noise ratio (SNR) dependent temporal structure normalisation (TSND). TSND is an adjustment to temporal structure normalization which helps to improve performance in high noise environments. TSN works to normalize the trend of noise in the features power spectral density (PSD). The performance of TSND and other filters was assessed using the SPHINX-III speech recognition system with the TIDIGITS speech corpus. Data was corrupted with seven noise types taken from the NOIZEUS database under seven different noise levels (SNRs from -5 dB to 20 dB). Results show that TSND performs better than our other filters in extreme noise conditions, reducing word error rate (WER) by 9.88% relatively over mean-variance normalisation (MVN) in -5 dB. Index Terms: robust automatic speech recognition, feature normalisation, temporal filter

1. Introduction Automatic speech recognition (ASR) has been the center of a good deal of research. It’s performance has improved to the point where in a clean environment it is extremely effective. However, realistically application in the real-world do not have the luxury of clean data. In these noise filled environments the performance of ASR degrades rapidly. This additive and convolution noise makes the use of such applications painful and leaves users frustrated. As such, there has also been a significant amount of research in increasing the robustness of ASR. Some approaches have used a single technique or combinations of techniques. These techniques seek to enhance or normalize speech features. Some apply pre-processing or post-processing techniques and some employ adaptive models. A very common and effective means of improving performance is through normalization. By normalizing the extracted features we can reduce the effect of noise. Common techniques include: cepstral mean normalization (CMN) [1], cepstral variance normalization

(CVN) [2], and histogram equalization (HEQ) [3]. These techniques are particularly effective because they are simple, easy to implement, and independent of noise statistics. They work well regardless of the singal to noise ratio (SNR) or noise environment. [1, 2, 3] Typically to achieve better performance these techniques are combined with techniques such as a relative spectra (RASTA) filter [4] or a autoregressive moving average (MVA) filter [5]. A RASTA filter works to enhance modulation spectra which correlate with speech intelligibility while attenuating others. The MVA filter works as a low-pass filter and has been shown to yield some very good results across all noise levels. While we have seen improvements, performance remains extremely poor in low SNR environments. One of the more recent techniques to show good results is temporal structure normalization (TSN) [6]. In this paper we propose a modification to the TSN filter to help improve ASR in noisy environments using knowledge of the current SNR.

2. Temporal Structure Normalisation TSN is a technique for reducing acoustic noise. It was observed that changes in the SNR and other environment conditions affected the PSD function in various ways. [6] In creating the TSN filter we are attempting to smooth the ceptra to better reflect how clean features vary. [7] This is done by normalizing only the trend of the PSD function of an utterance to that of our reference PSD. The process is as follows: A reference PSD, Pref,j (k), is calculated for every feature, yielding j PSD functions, where j is the number of cepstra and k is the modulation frequency. For the utterance that TSN is being applied to another j test PSD functions, Ptest,j (k), are calculated. We create a TSN filter for each feature independently, therefore it is easier to reduce the notation to Pref (k) and Ptest (k) for the current cepstra to filter. We then estimate the desired magnitude response of our filter as follows: q (1) |H(k)| = Pref,j (k)/Ptest,j (k)

At this point you may choose to mix the response with some other filter, such as RASTA or ARMA. Simply multiply the magnitude response with the TSN filter. |H 0 (k)| = |H(k)||G(k)|

(2)

where |H 0 (k)| is the resulting combined response, and |G(k)| is the response of the alternative filter. As TSN is implemented by FIR filters the alternative magnitude response, |G(k)|, must be realizable with FIR filters. [7] Normalizing the features completely would be ineffective as we would lose the ability to distinguish spoken content. For example, without mixing filters the output would be |H(k)|2 Ptest,j (k) = Pref,j (k) = Pout (k). As such we seek to only normalize trend of the PSD function, normalizing the overall shape related to the environment and leaving spoken content in tack. To do this we truncate and scale our filter weights. We do this by taking so many weights from either side of the inverse fast Fourier transform (IFFT) of the magnitude response. For the purposes of this paper we have chosen to keep 30 weights. These weights are then put though a Hanning window, and normalized such that they have an overall gain of unity. To obtain our PSD functions we have chosen to use the Burg algorithm. The Burg algorithm uses an autoregressive (AR) prediction model for estimating the PSD. This results in a smooth PSD and a magnitude response smoother then the ideal response in (1) in order to normalize the trend. 2.1. SNR Dependent Temporal Structure Normalisation In building the reference PSD for TSN we use a large number of clean utterances. We calculate the PSD for each feature and each frame of these utterances then determine the average. This leaves us with j average PSD reference functions. The purpose of this average process is not only to obtain a more general temporal structure but also to remove or mitigate the effects of spoken content. In calculating our test PSD we do not have this luxury. However, if we had knowledge of what kind of environment is effecting our utterance we can use that information to obtain a better filter. With information on the SNR we can build noisy reference PSD functions specific to certain noise levels. By averaging the PSD of many noisy training samples we can obtain a more accurate representation of the general temporal structure that we are attempting to normalize. This is the process of SNR Dependent Temporal Structure Normalisation (TSND). We will accomplish this process in 2 stages. The first stage we attempt to normalize the temporal structure based on the SNR level. Similarly to regular TSN equation (1) |R(k)| =

q Pref (k)/PSN R (k)

(3)

where PSN R,j denotes the averaged PSD over a specific noise level, and |R(k)| is the new first stage magnitude response. We determine the filter similarly to TSN, but it should be noted that we could omit the truncation and scaling steps of TSN, depending on the accuracy of your SNR estimate. To mitigate the lose in performance that could be obtained by a mismatch in SNR as the exact SNR is rarely meet we included the same truncation and scaling routine of TSN. If we denote |R0 (k)| to be the magnitude response of the filter obtained by shortening and scaling |R(k)| then we can write the output as Pout (k) = |R0 (k)|2 PSN R,j (k). The second stage is to apply a standard TSN filter after the test utterance has been filtered as outlined above. As we normalized based on an averaged SNR level the trend of this specific utterance still needs to be normalized.

3. Experiments 3.1. Speech Corpus This paper uses real-world noise from the NOIZEUS database [8] mixed into the TIDIGITS speech database [9] to create its own speech corpus. The TIDIGITS database is fairly large consisting of over 25 thousand digit sequences spoken by over 300 men, women and children. For our purposes we have omitted the children and used only utterances spoken by the 111 men and 114 woman. Training and testing sets were created directly following the original TIDIGITS database, splitting approximated half the men and women. The TIDIGITS database was downsampled to 8 kHz, from its original 20kHz. These clean files were filtered with a modified Intermediate Reference System (IRS) as specified by ITU-I P.862 [10]. This filtering process allowed us to maintain the spectrum of the speech signal while mixing in the noise extracted from NOIZEUS. The NOIZEUS database contains 30 sentences from the IEEE sentence database [8, 11] These sentences have been corrupted with noises from 8 different real-world noise environments, namely: Babble, Car, Exhibition Hall, Restaurant, Street, Airport, Train Station, and Train. These noise environments were all obtained from the AURORA database [12]. Noise was mixed into the samples at 4 different levels: 0db, 5db, 10db, and 15dB SNRs. Noise was extracted by subtracting clean utterances from noisy ones. Noise was mixed into the TIDIGITS database with all 8 real-world environments across 6 different SNR levels, -5dB, 0dB, 5dB, 10dB, 15dB, and 20dB. For each utterance from TIDIGITS the corresponding energy was computed such that noise could be added depending on the desired SNR.

3.2. Speech Recognition System CMU’s SPHINX-3 speech recognition system [13] was used to preform all experiments. This system follows the Hidden Markov Model (HMM) approach trained from acoustic features. For these experiments utterances were divided into segments 25 ms long with a frame rate of 100 frames/sec. In accordance with default parameters of SPHINX pre-emphasis filtering and a Hamming window were applied to each frame. Features were extracted as 16 Mel-frequency cepstral coefficients (MFCC) (power + 15 cepstra) including deltas and double deltas, yielding a 48-dimensional feature vector. Digit context-dependent triphones were modeled using 3 state HMMs, 250 tied states and eight Gaussian mixtures per state. We used a trigram language model with weight and word insertion probability experimentally determined to be 12 and 0.1 respectively.

Figure 1: Average Matched Results

3.3. Feature normalization techniques Features are extracted with spinx fe, which is part of the overall SPHINX system. Using this program we extract our 16 static features. The deltas an double deltas are calculated during training and decoding, as per the SPHINX system. Static features are then modified using Matlab. First the features are normalized with MVN, then features are processed using temporal filters: RASTA, MVA, TSN, and TSND. MVN was applied on a by utterence bases. RASTA was implemented as outlined in [4]. MVA processing [5] was done using an order of 2, as we found it provided the best performance. In implementing TSN we choose to use the Burg method to estimate the PSD, using a autoregressive model of order 15. In taking the two sided PSD we choose to use 256 samples, ensuring that we’ve taken sufficient data. Clean reference PSD functions were calculated from our clean training set with MVN applied, containing 8623 utterances, by averaging the PSD functions of every utterance. Our TSND clean reference PSD was created the same way. Its addition noise dependent PSD functions were created by taking the average of PSD functions of utterances from the training sets of 4 noise environments (Airport, Babble, Car, and Train), 33892 utterances in total, in their specific SNR level.

Figure 2: Average difference from matched to single model performance in untrained environments types and tested on each environment and noise type separately

3.4. Test Methodology

Our Matched test scenario will give a good overall performance evaluation of how our methods perform under separate noise environments and noise levels. The single model test scenario is meant to evaluate how the techniques can handle unforeseen noise conditions. For each SNR level we train our models with a subset of the noise environment. We have chosen to use Airport, Babble, Car, and Train noise environments to train our models. The difference obtained between the matched scenario in the remaining noise environments will give some indication on how well our techniques work in unforeseen environments. We also included Multi-conditional training as it typically leads to improved performance.

In order to evaluate the performance of these techniques we have devised three training and testing scenarios:

3.5. Results and Discussion

• Matched: Trained and tested using the same environment noise • Single Model: Trained using a subset of environments and then tested on each separately • Multi-condition Single Model: Trained using a subset of environments including multiple noise

In Figure 1 we show the results of training and testing under matching noise environments and noise levels. In our noisiest test scenario, -5 dB, MVA, TSN and TSND improved MVN by 7.42%, 7.78%, and 9.88% repectively. As such, TSND proved to perform the best in noisier SNRs. TSND also consistently improves on TSN, even

Table 1: Average difference in WER between Single Model and Multi-condition Single Model (A positive number indicates a reduced WER with MC) Method MVN MVN+R MVN+MVA MVN+TSN MVN+TSND

-5 dB -1.09 1.13 1.66 2.60 1.21

0 dB -0.31 0.83 -0.88 0.42 0.31

5 dB -0.71 -0.20 -0.45 -0.36 -0.45

10 dB -0.24 -0.24 -0.33 -0.20 -0.19

15 dB -0.18 -0.25 -0.09 -0.03 0.00

20 dB 0.05 0.00 -0.04 -0.15 -0.18

yielding slight improvements in relatively noise free 20 dB. While the traditional TSN seems to lag behind MVA in most other noise levels our adjustment, TSND, tends to increase performance to meet MVA. Figure 2 shows the difference between using a single model and matched results, of noise environments that were not used in training. These untrained noise environments are E-Hall, Restaurant, Street, and Train Station. TSN and TSND seem to be very consistent in untrained noise environments, whereas MVA and RASTA seemed to be have a larger amount of variation. As such the performance of TSN(D) can be relied upon, even in unforeseen noise environments. MVA which is a very powerful algorithm actually ended up being the least reliable, having both very poor and very good results in unseen noise. By moving to a multi-condition approach we were able to increase results by 1.65% WE, on average in -5 dB noise ( 1). MVA, TSN, and TSND were improved by 1.66%, 2.60%, and 1.21% respectively, again in -5dB noise. In less noisy environments, 5 dB and better, using a multi-condition approach proved to have a very slight, almost negligible, negative impact on results.

4. Conclusions In this paper we have tested several immensely popular feature extraction techniques for noise robust speech recognition. We have shown that TSN is a very comparable method to the very effective MVA technique, especially when applying TSND. TSN(D) has been shown to be a viable option for use in noise-robust speech recognition and has proved to be reliable, even in unforeseen noise environments. As no one method tends to be the best performer in any one noise level, and techniques like multi-conditional training tend to only improve high noise levels, in the future we would like to investigate a switching mechanism to implement multiple models.

5. References [1] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 29, no. 2, 1981 [2] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol. 25, 1998.

[3] Shih-Hsiang Lin, Yao-Ming Yeh, and Berlin Chen, “A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition,” Computational Linguistics and Chinese Language Processing Vol. 12, No. 2, June 2007 [4] Hermansky and Morgan, “RASTA Processing of Speech,” IEEE Transactions on Speech and Audio Processing. vol. 2, no. 4, Oct 1994 [5] Chia-Ping Chen and Jeff A. Bilmes, “MVA Processing of Speech Features,” IEEE Transactions on audio, speech, and language processing, vol. 15, no. 1, Jan 2007 [6] X. Xiao, E. S. Chng, and H. Li, “Evaluating the Temporal Structure Normalisation Technique on the Aurora-4 Task,” Interspeech, 1070-1073, 2007 [7] X. Xiao, E. S. Chng, and H. Li, “Normalization of the Speech Modulation Spectra for Robust Speech Recognition,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 8, Nov 2008 [8] Hu, Y. and Loizou, P., “Subjective evaluation and comparison of speech enhancement algorithms,” Speech Communication, 49, 588-601, 2007. [9] Texas Instruments Incorporated, “A database for speakerindependent digit recognition,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP, Dallas, Texas, 1984. [10] International Telecommunication Union “Perceptual evaluation of speech quality (PESQ): An objective method for end-toend speech quality assessment of narrow-band telephone networks and speech codecs,” ITU-T Recommendation P.862, 2001. Online: ftp://ftp.signalogic.com/documentation/PESQ/P862E.pdf accessed on 20 Nov 2012 [11] IEEE Subcommittee, “IEEE Recommended Practice for Speech Quality Measurement,” IEEE Trans. Audio and Electroacoustics, AU-17(3), 225-246, 1969 [12] H. Hirsch, and D. Pearce, “The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions,” ISCA ITRW ASR2000, Paris, France, 18-20 Sep 2000 [13] R. Singh, “The Sphinx Speech Recognition Systems,” 10 Oct 2003. Online: http://www.cs.cmu.edu/ rsingh/homepage/sphinx history.html accessed on 20 Nov 2012