INTERSPEECH 2010
Autocorrelation and Double Autocorrelation Based Spectral Representations for a Noisy Word Recognition System Tetsuya Shimamura, Nguyen Ngoc Dinh Graduate School of Science and Engineering, Saitama University, Japan
[email protected],
[email protected] robustness of the proposed spectral representations will be showed in Section 4 by comparing with conventional MFCC method. Conclusions are drawn in Section 5.
Abstract Two methods of spectral analysis for noisy speech recognition are proposed and tested in a speaker independent word recognition experiment under an additive white Gaussian noise environment. One is Mel-frequency cepstral coefficients (MFCC) spectral analysis on the autocorrelation sequence of the speech signal and the other is MFCC spectral analysis on its double autocorrelation sequence. The word recognition experiment shows that both of the proposed methods achieve better results than the conventional MFCC spectral analysis on the input speech signal.
2. Autocorrelation and Double Autocorrelation Function 2.1. Un-windowed Type of Autocorrelation Function For the speech signal frame s (t ), t = 1 2 N , the unwindowed type of Autocorrelation function (ACF) is defined as
Index Terms: word recognition, double autocorrelation, noise
N −1
rs (i ) = ¦ s ( j ) s ( j + i ), i = 0 N − 1
1. Introduction
When s (t ) is corrupted by additive white Gaussian noise, the noisy signal is given by (2) x (t ) = s (t ) + n (t ), t = 0 N − 1
Speech recognition is applied in many systems. In an environment without occurrence of noise, high recognition accuracy can be achieved. However in noisy environments, it commonly shows a poor performance. To improve the performance of recognition systems in noisy environments, we usually need to use a more robust spectral representation rather than the Linear Predictive Coding (LPC) spectral representation that is widely used in speech processing fields such as speech coding, speech synthesis, etc. Mansour and Juang [2] proposed a method called Short-Time Modified Coherence (SMC) spectral representation and then Hernando and Nadeu [3] extended it to One-Sided Autocorrelation Linear Predictive Coding (OSALPC) spectral representation. Instead of the speech signal sequence, both of them applied a linear prediction (LP) filter on the Autocorrelation sequence to get the spectral feature. Shannon and Paliwal [4] proposed the Autocorrelation Mel-frequency cepstral coefficients (AMFCC) method that uses Higher-lag Autocorrelation sequence as the input of Mel-frequency filter bank analysis to find a Melfrequency cepstral coefficients (MFCC) spectral representation. All of the above methods rely on the robustness of the Autocorrelation function against noise to improve the robustness of the spectral representation for speech recognition. The recognizers that use these spectral representations showed a better performance rather than the typical MFCC analysis using LPC directly on the speech signal. On the other hand, McGinn and Johnson [5] showed that successively applying the Autocorrelation on the speech signal may lead to an increase in signal-to-noise ratio (SNR) relative to just that of the single Autocorrelation. The present paper proposes two methods using Autocorrelation and Double Autocorrelation sequences as the input of a Mel-frequency filter bank analysis to find MFCC spectral feature and use these spectral representations for a word recognition system. This paper is organized as follows. Section 2 describes the Double Autocorrelation Function. Its spectral representation estimate will be shown in Section 3. The
Copyright © 2010 ISCA
(1)
j =0
where (1) to
n(t ) denotes additive white Gaussian noise. Applying x(t ) gives N −1
rx (i ) = ¦ x( j ) x( j + i ) j =0
N −1
N −1
j =0
j =0
= ¦ s ( j ) s ( j + i ) + ¦ s ( j ) n( j + i ) N −1
N −1
j =0
j =0
+ ¦ n( j ) s ( j + i ) + ¦ n ( j ) n( j + i )
(3)
= rs (i ) + rsn (i ) + rns (i ) + rn (i ) i = 0 N − 1 where
rn (i )
is Autocorrelation sequence of
rsn (i) + rns (i )
n(t )
and
is the cross correlation of signal and noise.
Based on the fact that signal and noise are un-correlated and so the energy of Autocorrelation sequence of additive white Gaussian noise is zero lag concentrated, the un-windowed type of Autocorrelation function (with zero lag eliminated) can improve the SNR in comparing with speech signal in the time domain. Instead of linear prediction (LP) analysis directly on the speech signal, the conventional SMC [2] and OSALPC [3] methods use this Autocorrelation sequence as the input of LP analysis to estimate spectral feature and using these spectral representation to speech recognition system gave high recognition accuracy rate in noisy environments. On the other hand, conventional AMFCC [4] method using MFCC spectral analysis on this Autocorrelation sequence (with the first 2ms length eliminated, so called
1712
26 - 30 September 2010, Makuhari, Chiba, Japan
Higher-lag Autocorrelation function) results in more noise robust than not only MFCC analysis on the speech signal but also the above conventional SMC and OSALPC methods that use LPC analysis on Autocorrelation sequence.
2.2. Double Autocorrelation Function Double Autocorrelation function is Autocorrelation function on the Autocorrelation sequence of speech signal and defined in the following formula: N −1
wrs (i ) = ¦ rs ( j )rs ( j + i ), i = 0 N − 1
(4)
j =1
where
rs (i )
is Autocorrelation function of speech signal
defined by (2). For noisy speech signal Autocorrelation function becomes
x(t )
(a)
, Double
N −1
wrx (i ) = ¦ rx ( j )rx ( j + i ), i = 0 N − 1
(5)
j =1
where
rx (i )
is Autocorrelation function of
x(t )
defined by
(3). Note that in (5), to achieve SNR improvement, the zero lag of Autocorrelation sequence
rx (0)
is not used to
compute the Double Autocorrelation function. Replacing
rx (i )
from (3) into (5), like in Autocorrelation domain, the
noise part in Double Autocorrelation domain will be composed of the pure noise term of Double Autocorrelation function of noise and the signal-noise cross term of cross correlation between signal and noise. Since the noise in the Autocorrelation domain, with the occurrence of the noisesignal cross term, is not white noise any more, the Double Autocorrelation function, which applies the Autocorrelation function on the speech Autocorrelation sequence, cannot achieve high SNR improvement like in applying Autocorrelation function on speech signal sequence. However the signal-noise cross term in the Double Autocorrelation domain becomes more correlated with the signal than in Autocorrelation domain [5]. In the frequency domain, the spectrum of the noise part becomes more correlated with the spectrum of signal part. As a result, spectral distortion reduction can be achieved. Figure 1 shows the correlation between the noise part and the signal part in the Double Autocorrelation domain of one real speech frame in comparing with in Autocorrelation domain (waveform in time domain and power spectral density in frequency domain) under a noisy environment at SNR 5dB. From these figures, it can be observed that in both Autocorrelation and Double Autocorrelation domains, a SNR improvement is achieved and spectral distortion becomes so much smaller in the Double Autocorrelation domain than in the Autocorrelation domain.
(b)
(c)
3. Autocorrelation and Double Autocorrelation Function Spectral Representation This section presents the estimation of MFCC on the Autocorrelation and Double Autocorrelation sequences of speech signal. For a speech recognition system, the most widely used spectral feature is MFCC. In [4], MFCC spectral analysis on Autocorrelation sequence showed more robust performance than conventional LPC method. Because of this robustness, here we use MFCC method for spectral analysis on
(d) Figure 1: Noise part, clean signal, and noisy signal in the Autocorrelation domain ((a) Wave form, (b) Spectrum) and Double Autocorrelation domain ((c) Wave form, (d) Spectrum).
1713
Autocorrelation and Double Autocorrelation sequences. We call the two resulting spectral representations as SA-MFCC and WA-MFCC, respectively (SA is denoted for Single Autocorrelation to distinguish with Double Autocorrelation, which is denoted as WA). Flow charts of SA-MFCC and WAMFCC spectral estimations are showed in Figure 2.
4.2. Spectral Feature Analysis In our word recognition experiment, robustness of SA-MFCC and WA-MFCC spectral representations will be compared to the MFCC analysis method on speech signal. For the MFCC analysis method on speech signal, frame lengths used is N samples. For speech signal of 8kHz sampling rate, 160 is used as the value for N, so frame lengths for MFCC, SA-MFCC and WA-MFCC are 20ms, 40ms and 80ms, respectively. Number of MFCC coefficients is 12, which is common for all the methods. When MFCC coefficients are estimated for each frame using one of the above methods, a vector composed of MFCC coefficients, their delta, and acceleration coefficients is used as spectral feature vector to input to the recognition system.
4.3. Word Recognition System For word recognition system, the HTK Toolkit [7] will be used. In the training phase, one HMM model of five states with four Gaussian Mixture Models will be trained for each digit by using digit's train data. Then each input test data will be tested against 11 digit HMMs to find the best match digit output. The total accuracy recognition rate will be used to evaluate the performance of each method.
Figure 2: SA-MFCC (left) and WA-MFCC (right) spectral estimation.
4.4. Results Here, the recognition results are presented. Table 1 and Figure 3 show the results in the case where Hamming window is used for the two proposed methods, and Table 2 and Figure 4 show the results in the case where DDR Hamming window is used for the two proposed methods.
In the SA-MFCC spectral estimation, the speech signal is firstly framed into frame length of 2N samples and M=N/2 samples frame shift by applying rectangle windows. Then N samples length Autocorrelation sequence will be computed. For WA-MFCC spectrum estimation, a 4N sample length frame will be used to compute an N sample length Double Autocorrelation sequence. Before the next step of MFCC spectral analysis, we apply a pre-emphasis filter [6] followed by a window function. For pre-emphasis step, a pre-emphasis filter with pre-emphasis coefficient of 0.975 will be used. To verify the effects of window function's dynamic range to recognition result, we use two types of window; Hamming window and Double Dynamic Range (DDR) Hamming window. The DDR Hamming window as defined in [4] is the normalized un-bias Autocorrelation function of a Hamming window (so it will have dynamic range of about twice of Hamming window's dynamic range). Note that, different from the AMFCC method in [4], here we use a pre-emphasis filter not directly on the input speech signal but after getting SNR improvement by Double Autocorrelation Function, since experimentally a pre-emphasis filter seems to degrade recognition system performance in low SNR noisy environments.
Table 1. Recognition accuracy (%) with Hamming window.
SNR(dB) 20 15 10 5 0 -5
4. Word Recognition Experiment 4.1. Speech Database In our experiments, the isolated digits part of TIDIGITS database [8] will be used. This part of the database includes 10 digits and the /oh/ word spoken by speakers, which are two times used for one word. Train data includes 94 speakers (37 men and 57 women) data and test data includes 113 speakers (56 men and 57 women) data. Sampling rate used here is 8 kHz. To make a noisy signal for the test phase, artificial white Gaussian noise is added to the clean test data so that the resulting noisy test signals have SNR from 20dB to -5dB in 5dB steps.
MFCC 99.6 92.56 83.43 61.02 45.41 26.35 12.11
#$%%
SA-MFCC 99.12 97.14 95.17 89.06 70.15 43.56 24.22
#$%%
WA-MFCC 93.68 92.29 91.43 89.62 82.70 60.78 42.84
& #$%%
! "
Figure 3: Recognition accuracy with Hamming window.
1714
SNR noisy environments. Especially, by using a DDR Hamming window function, both methods achieved comparable recognition results to the MFCC method under noiseless environments and outperformed the MFCC method under additive white Gaussian noisy environments. In our experiment, we only tested the proposed methods in the case of additive white Gaussian noise. For other wideband Gaussian noises and noises than in Autocorrelation domain, their energy concentrate at lower lags, they may still be able to achieve the same results by eliminating more lower lags than just eliminating only the zero lag coefficient in computing the Autocorrelation function as done in [4].
Table 2. Recognition accuracy (%) with DDR Hamming. SNR(dB) 20 15 10 5 0 -5
SA-MFCC 99.36 97.95 95.29 86.85 66.98 45.21 21.44
WA-MFCC 99.36 97.91 95.98 89.66 77.59 59.90 35.08
6. References #$%%
& #$%%
[1]
[2]
[3]
[4]
! "
[5]
Figure 4: Recognition accuracy with DDR Hamming window.
[6] [7]
From these results, it is observed that by using Hamming window, the SA-MFCC and WA-MFCC methods outperform the MFCC method under noisy environments of SNR15dB. In noiseless and high SNR noisy environments, both the AMFCC and WA-MFCC methods are worse than the MFCC method in recognition rate. This is because the dynamic range of spectrum used for both methods is enlarged when comparing with the MFCC spectrum, so the use of Hamming windows is not suitable for both methods. By using a DDR Hamming window with twice dynamic range enlarged, the results of both SA-MFCC and WA-MFCC methods in noiseless and high SNR noisy environments become comparable to the MFCC methods. In low SNR noisy environments, the improvement still remains. In comparing with the SA-MFCC method, the WA-MFCC achieves better recognition results, especially in noisy environments of SNR 10dB. This is because in a Double Autocorrelation domain, the noise part is more correlated to the signal part than in the Autocorrelation domain, as shown in Section 3.
[8]
5. Conclusions In this paper, two robust speech spectral representations are proposed for speech recognition systems under additive white Gaussian noise environments. The two are the SA-MFCC and the WA-MFCC methods that use MFCC analysis on the Autocorrelation sequence and Double Autocorrelation sequence of speech signal, respectively. We have tested both the proposed methods in a speaker independent word recognition experiment. Both methods show a better result than conventional MFCC method on speech signal in low
1715
L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, 1993. D. Mansour and B. H. Juang, “The short-time modified coherence representation and its application for noisy speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, no.6, pp. 795–804, 1989. J. Hernando and C. Nadeu, “Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 5, no. 1, pp. 80-84, 1997. B.J. Shannon, K.K. Paliwal, “Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition,” Speech Communication, vol. 48, no. 11, pp. 1458-1485, 2006. D. P. McGinn and D. H. Johnson, “Reduction of all-pole parameter estimation bias by successive autocorrelation,” Proc. ICASSP, pp. 1088-1091, 1983. L. R. Rabiner and B. Gold. Theory and Application of Digital Signal Processing, Prentice-Hall, Englewood Cliffs, 1975. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book (for HTK Version 3.4), Cambridge University Engineering Department, 2009. L. Rabiner, Fundamentals of Speech Recognition Course, http://www.caip.rutgers.edu/~lrr/.