robust mfcc feature extraction algorithm using efficient ... - Uni mb

2 downloads 0 Views 722KB Size Report
convolutional distortion, a RASTA filtering of log filter-bank energy trajectories is ... 45.06% (Aurora 3) relative to the baseline MFCC front-end is achieved. 1.
ROBUST MFCC FEATURE EXTRACTION ALGORITHM USING EFFICIENT ADDITIVE AND CONVOLUTIONAL NOISE REDUCTION PROCEDURES Bojan Kotnik, Damjan Vlaj, Zdravko Kačič, Bogomir Horvat Faculty of Electrical Engineering and Computer Science University of Maribor, Smetanova 17, SI-2000 Maribor, Slovenia [email protected], [email protected], [email protected], [email protected]

ABSTRACT In this paper a robust mel frequency cepstral coefficient feature extraction procedure using noise reduction, frame attenuation and RASTA processing is presented. In the preprocessing stage a hybrid Hamming–Cosine window is applied. To minimize the effect of additive environmental noise on speech signal a spectral subtraction based on spectral smoothing is used. A general mel filtering approach is performed on noise reduced signal. To detect speech frames, a voice activity detection based on log filter-bank energies is performed. The log filter-bank magnitudes of noise-only frames are attenuated. To reduce the level of convolutional distortion, a RASTA filtering of log filter-bank energy trajectories is applied. At final stage, a noise robust feature vector, which consists of 12 mel cepstrum coefficients and the log energy is created. For evaluation of improvement of speech recognition with the proposed front-end, the Aurora 2, 3 databases together with the HTK speech recognition toolkit have been chosen. The total improvement of 41.14% (Aurora 2) and 45.06% (Aurora 3) relative to the baseline MFCC front-end is achieved.

1.

INTRODUCTION

Automatic extraction of useful information from speech has been a subject of an active research for many decades. The mel frequency cepstral feature extraction methods that are currently used in many speech recognition systems are motivated by properties of human auditory system and speech perception. The characteristics of the speech signal or measurements made from the signal vary with many factors. For example, if the same speech sound uttered by the same person is recorded in different noisy environments using transmission channels with different transfer characteristics (e.g. PSTN, ISDN, GSM), the resulting set of signals will have different characteristics depending upon the nature of the recording environment and the channel transfer function [1]. It is well known that the performance of an automatic speech recognition system degrades in the presence of noise (additive or convolutional). This is due to the acoustic mismatch between the features used to train and test this system and the ability of the acoustic models to describe the corrupted speech [2]. With the intention to achieve better multiconditional noise robustness of an automatic speech recognition system, some improvements and refinements of the well known mel frequency cepstral feature extraction principle are presented in

this paper. Often no special attention is given to the “trivial” time-domain preprocessing stage like windowing. The performed experiments with asymmetric hybrid Hamming – Cosine window (described in Section 2.1.) showed recognition improvement against generally used symmetrical Hamming window. Further, a special attention is given to reduction of additive and convolutive noises. With the intention to reduce additive noises, a spectral subtraction based on spectral smoothing is proposed. A standard mel filtering approach with triangular half-overlaped filters is performed on noise reduced signal. An endpoint detector based on log filter-bank energies and an hangover criterion is applied (Section 2.4.). The log filterbank magnitudes of noise-only frames are attenuated (Frame attenuation principle, Section 2.5.). The convolutive distortion, which originate due to transmission channel transfer function can effectively be removed with RASTA filtering of log filterbank energy trajectories. This filtering process is described in Section 2.6. Finally, the feature vector, which consists of 13 elements (12 MFCC + logE) is created. The tests are performed on Aurora 2 and 3 databases. The results are discussed in Section 4, finally, conclusions are given in Section 5.

2.

FEATURE EXTRACTION PROCEDURE

Ideally speaking, noise robustness could be solved in the feature extraction module since it would eliminate the need for additional application specific data collection or parameter compensation. Figure 1 shows the proposed feature extraction front-end algorithm. It involves time domain preprocessing stage with hybrid Hamming-cosine window, noise reduction stage with spectral subtraction based on spectral smoothing, filterbank analysis stage, endpoint detector based on the SNR estimation in the filter-bank domain, frame attenuation principle, RASTA processing as well as DCT based mel cepstrum feature extraction procedure. 2.1. Framing, windowing, and FFT computation The input signal y[n] is divided into overlapped frames of the length of 30 ms (L=240 samples at sampling rate of 8kHz). The frame shift interval is 11.25 ms (90 samples) long. Afterwards each frame is multiplied by the window function. Choosing an appropriate window consists of selecting the type, length and placement of the window. Window lengths are typically in the 20-30 ms range. Multiplying a signal by a window has the effect of convolving the signal spectrum with the window spectrum.

ICSLP'02 Proceedings, Denver, Colorado, USA, pp. 445 - 448, 2002.

Figure 2: A comparison of Hamming and hybrid Hamming-Cosine window. 2.2. Spectral subtraction based on spectral smoothing The spectral subtraction based on spectral smoothing as additive noise reduction technique is carried out [4]. This algorithm has the main advantage that no explicit detection of non-speech segments is needed. Another advantage of this spectral subtraction approach is in lower degree of added musical distortion in the output signal |X[m,k]|, which is the main handicap of classical spectral subtraction [4]. 2.3. Mel filtering

Figure 1: Proposed robust feature extraction algorithm. Ideally, the window spectrum would have a narrow main lobe and small sidelobes. However, there is a tradeoff between the main lobe width and side lobe attenuation. A wider main lobe results in more averaging across neighbouring frequencies while larger sidelobes introduce more aliasing from other frequency regions. Therefore, there are two desirable window spectrum features: a narrow bandwidth main lobe and large attenuation in the sidelobes. In proposed front-end, a hybrid Hamming-Cosine window (hHCw), presented with equation (1), similar to those in G.729 [3] is used:       π 2 n , n = 0, K , 5 L − 1 0.54 − 0.46 cos 10   6   L −1   6  . w(n ) =   5     2π  n − L   5 6  cos  n = L, K , L − 1   4 , 6 L − 1      6 

(1)

Figure 2 shows a time and a frequency domain comparison of the Hamming (Hw) and proposed hybrid Hamming-Cosine (hHCw) window, both are the length of L=240 samples. It can be seen from magnitude spectrum plots in Figure 2 that hHCw has wider main lobe than Hw, but larger attenuation in the sidelobes. The performed experiments show, that with hHCw better speech recognition results can be achieved when compared to Hw. In the consecutive step a zero padded N=256 order Fast Fourier Transformation is performed, producing the signal |Y[m,k]|.

A commonly used 24 (NumChan) channel mel filter-bank analysis with triangular shaped and half overlapped filters described in greater detail in [4], is performed on the noisereduced signal |X[m,k]|. The filter-bank output fbank[m,i], where i=1,2,...NumChan denotes a filter-bank channel index, is subject to a natural logarithm function, producing log filter-bank magnitude fln[m,i]: f ln [ m, i ] = ln (1 + f bank [ m, i ]) .

(2)

2.4. Endpoint detection To classify frames as speech (noisy speech frames) or nonspeech (noise-only frames) a SNR driven voice activity detector (VAD) is used. The output decision data ϑ[m] of the VAD is ϑ[m] = 1 (the frame m of the input signal y[m,n] contains speech or noisy speech) or ϑ[m] = 0 (the frame m of the input signal y[m,n] contains noise only). The SNR of the frame m corresponds to the difference between the short-term and the long-term log filter-bank energy estimates. The long-term estimate is updated when the VAD decides that the current frame corresponds to the noise only and the log filter-bank energy of the current frame is used as a short-term estimate. The short-term spectral energy Ef[m] is calculated for each frame as:

E f [m] =

NumChan

∑f

ln

[m, i] ,

(3)

i=1

where NumChan=24 represents a number of filter-bank channels. The short-term spectral energy of the current frame Ef[m] is used in the update of the long-term mean spectral energy Em[m] as:

ICSLP'02 Proceedings, Denver, Colorado, USA, pp. 445 - 448, 2002.

if ( E f [m] − E m [m − 1]) < α then E f [m] − E m [m − 1] , E m [m] = E m [m − 1] + 100 else E m [m] = E m [m − 1]

(4)

where α=10 is a long-term mean spectral energy Em[m] update threshold. After determination of short-term spectral energy Ef[m] and the long-term mean spectral energy Em[m], the speech/non-speech decision procedure is defined as:

if

(E [m] − E [m]) ≥ β f

m

then

ϑ [m] = 1 else

,

(5)

ϑ [m] = 0

where β=4.5 is an empirically estimated SNR - equivalent threshold used in the speech/non-speech decision process. Furthermore, the VAD incorporates also the so-called "hangover" criterion, which prevents a misclassification of weak fricatives to noise at the end of the speech segments. 2.5. Frame attenuation principle

The VAD decision data ϑ[m] is then used in the frame attenuation stage to attenuate the log filter-bank magnitudes fln[m,i]. The log filter-bank magnitudes fln[m,i] of the frame m are attenuated if the frame m is declared as noise-only. The criterion used in the proposed algorithm is presented with equation (6):

ϑ[ m ] = 1 ,  f [m, i ], f [m, i ] =  ln ψ f m i ϑ[ m ] = 0 ⋅ [ , ], ln 

(6)

where the attenuation factor ψ = 0.03 is empirically determined. Proposed frame attenuation principle differs from well known frame dropping algorithm, where the noise-only frames are discarded completely from further processing steps. The experiments show that with frame attenuation principle better speech recognition results can be achieved when compared to frame dropping. 2.6. Temporal RASTA processing for channel normalization

Any stationary convolutive distortion will be an additive component in the logarithmic spectral energy domain. If a stationarity of a transmission channel or a microphone characteristic is assumed, it is easy to show that any convolutive distortion of the signal affects the mean of the time trajectory of logarithmic spectral energy [2]. Methods for processing temporal trajectories of logarithmic energies have already been proven to be effective in dealing with channel variability and RASTA (RelAtive SpecTrA) processing was introduced as an alternative to mean subtraction [2]. In this paper we propose temporal RASTA processing of log filter-bank magnitudes. The RASTA filter is a band-pass filter with the following transfer function [2]:

 2 z + z −1 − z −3 − 2 z −4  . H RASTA (z ) = 0.1z 4   1 − 0.98 z −1  

(7)

Each log filter-bank magnitude component f[m,i], where i=1,2,...NumChan is filtered with HRASTA(z) and RASTA filtered log filter-bank magnitudes fRASTA[m,i] are produced. The presented RASTA filter attenuates modulation frequency components below 1Hz and above 10Hz. Slow-varying components, corresponding to the frequency characteristics of a communication channel, are suppressed, and a robust representation more insensitive to environmental effects is obtained. The low-pass filtering also helps to smooth spectral changes present in adjacent frames as a result of analysis artifacts, like position of the window with respect to the pitch period. 2.7. MFCC feature vector generation

In the last step of proposed feature generation algorithm the discrete cosine transformation (DCT) is performed on the RASTA filtered log filter-bank magnitudes fRASTA[m,i], i=1,2,...NumChan and twelve mel cepstral coefficients C[m,j], where j=1,2,…12, are obtained as result. Additionally, the shortterm energy Ef[m] computed in endpoint detection procedure serves as the 13th element in the final feature vector. The dynamic features (delta and acceleration coefficients) are calculated internally in speech recognition process.

3.

EXPERIMENTS AND RESULTS

Evaluation of proposed robust feature extraction algorithm was performed on Aurora 2 and Aurora 3 databases. Speech data in Aurora 2 database [5] is derivative of the TIDigits database. Different noises are added in a controlled way at different SNR. Two training modes are defined: “clean” uses only clean digits and “multiconditional” uses four different noises. Three test sets are defined. “Set A” has the same noises as the multiconditional training, resulting in a matched test condition. “Set B” uses another 4 different noises, which shall represent realistic scenarios for using a mobile terminal. In this case, there is a mismatch between training and test conditions. The data in “Set C” simulates a channel mismatch. The four languages, Finnish, Spanish, German, and Danish of Aurora 3 databases are taken from the corpora that were recorded as part of the SpeechDat-Car project [6]. These are real recordings recorded in driving cars with a close talking microphone and a hands-free microphone. Three train/test configurations were defined: the well-matched condition (WM), the medium mismatched condition (MM) and the highly mismatched condition (HM). In the WM case, 70% of the entire data is used for training and 30% for testing. The training set contains all the variability that appear in the test set. In the MM case, only hands-free microphone data is used for both training and testing. For the HM case, training data consists of close microphone recordings only while testing is done on far microphone data. The performance of the presented feature extraction algorithm is evaluated by comparison to standard baseline MFCC features. The configuration of the recognizer is standardized for all tasks. It uses the HTK speech recognition toolkit [7] where the models are words, each composed of 16 emitting states, 3 mixtures per state (with the exception of the silence model which uses 3 emitting states and 6 mixtures per state). Table 1 presents Aurora 2 & 3 based performance characteristics summary of proposed method.

ICSLP'02 Proceedings, Denver, Colorado, USA, pp. 445 - 448, 2002.

Table 1: The absolute performance (word error rate) and relative performance (relative error reduction over the MFCC baseline) of proposed noise robust feature extraction algorithm achieved on Aurora 2 and Aurora 3 databases.

4.

DISCUSSION

Table 1 contains the absolute performance (the word error rates) of proposed noise robust feature extraction algorithm for different databases and test conditions as well as the performance relative to the baseline MFCC. It can be seen that the reduction in word error rate is consistent within Aurora 2 and Aurora 3 databases. SpeechDat-Car databases of Aurora 3 are realistic databases so the baseline performance deteriorates much faster with mismatch compared to mismatch created by artificially adding noise to clean speech in Aurora 2 database. The proposed noise robust front end algorithm achieves highest relative improvement in high mismatch conditions. This is the case on Aurora 2 when the clean training/multiconditional test is performed. The relative improvement achieved on this Aurora 2 task is 62.02%. Almost identical result - relative improvement of 61.03% - on Aurora 3 databases in high mismatch condition is achieved. The experiments with Aurora 3 databases proved also, that the proposed endpoint detector in combination with frame attenuation principle is an important part of the robust feature extraction module. If the endpoint detector is absent then the long segments of silence before the onset of speech in SpeechDat-Car databases could cause a large number of insertion errors. The main advantage of RASTA processing is evident directly from the high improvement on “Set C” test of Aurora 2 database (relative improvement of 29.56% in multiconditional training condition). The RASTA successfully eliminates convolutive distortion and compensates the effect of channel mismatch.

5.

CONCLUSION

In this article a multiconditional robust mel frequency cepstral coefficients feature extraction algorithm is proposed. As an alternative to commonly used symmetrical Hamming window, an asymmetrical hybrid Hamming – Cosine window (hHCw) in the preprocessing stage is proposed. As additive noise reduction scheme a spectral subtraction based on spectral smoothing and minimum statistics is used. A mel filtering approach is performed on noise reduced signal. Detection of speech/nonspeech frames based on log filter-bank energies is applied. The

frame attenuation principle is used to deal with noise-only frames. For reduction of convolutional distortion the RASTA processing is performed. At final stage, a noise robust feature vector, which consists of 12 mel cepstrum coefficients and the log energy is created. For evaluation of improvement of speech recognition with proposed front-end, the Aurora 2, 3 databases together with HTK speech recognition toolkit have been chosen. The total improvement of 41.14% (Aurora 2) and 45.06% (Aurora 3) relative to the baseline MFCC front-end is achieved. The results obtained with Aurora 2 and Aurora 3 databases show that proposed method gives considerable recognition improvement when compared to the baseline system and is appropriate also for usage in a distributed speech recognition system.

6. [1] [2] [3] [4]

[5]

[6]

[7]

REFERENCES

Junqua, J. C. and Haton, J. P., “Robustness in Automatic Speech Recognition”, Kluwer Academic Publishers, Norwell, Massachusetts, USA, 1996. Hermansky, H., Morgan, N., “RASTA Processing of Speech”, IEEE Trans. Speech and Audio Proc., Vol. 2, No. 4, October 1994. ITU-T, “Coding of Speech at 8kbit/s using ConjugateStructure Algebraic-Code-Excited Linear-Prediction (CSACELP)”, ITU-T Recommendation G.729, Mar. 1996. Kotnik, B., Kacic, Z., and Horvat, B., "A Multiconditional Robust Front-End Feature Extraction with a Noise Reduction Procedure Based on Improved Spectral Subtraction Algorithm", Eurospeech 2001 Proceedings, pp. 197 – 200, Aalborg, Denmark, 2001. Hirsh, H. G., and Pearce, D., “The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions”, ISCA ITRW ASR2000, Paris, France, September 2000. AU/225/00, AU/271/00, AU/273/00, AU/378/00, “Finnish, Spanish, German, Danish Databases for ETSI STQ Aurora WI008 Advanced DSR Front-End Evaluation: Description and Baseline Results”. Young, S., "HTK Book - Version 2.1", Entropic Cambridge Research Laboratory, 1997.

ICSLP'02 Proceedings, Denver, Colorado, USA, pp. 445 - 448, 2002.

Suggest Documents