robust speech recognition with msc/dra feature extraction on ... - eurasip

Robust Speech Recognition with MSC/DRA Feature Extraction on Modulation Spectrum Domain Naoya Wada and Yoshikazu Miyanaga Abstract— This report introduces noise robust speech recognition and proposes advanced speech analysis techniques named MSC (Modulation Spectrum Control)/DRA (Dynamic Range Adjustment). The dynamic range of cepstrum obtained from noisy speech is usually smaller than that from the same speech without noise since some speech features are hidden in noise. This difference may cause recognition errors. Therefore the adjustment of dynamic range can realize the accurate extraction of speech features. The proposed techniques DRA and MSC focus on the speech feature adjustment. DRA normalizes dynamic ranges and MSC eliminates the noise corruption of speech feature parameters. The experiments on isolated word recognition were carried out using 40 male and female speakers for training and 5 male and female speakers for recognition. The result of recognition rate improving from 17% to 64% versus running car noise at -10dB SNR is shown as an example.

the feature vector should include not only noise-robustness but also static and dynamic speech features properly. Delta cepstrum shows a dynamic feature which approximates the differential of the cepstrum using ones in some frames [9]. Since multiplicative noises in spectral domain are reduced by the subtraction in quefrency domain, it has been considered that the delta cepstrum includes noise-robust speech features. We propose new noise robust speech processing techniques, i.e. MSC and DRA. MSC extracts speech components from noisy speech in modulation spectrum domain more effectively than RASTA[10] and RSF[11]. DRA normalizes the maximum amplitudes of feature parameters and corrects the differences of dynamic ranges between that of trained data and observed speech data.

I. INTRODUCTION

II. M ODULATION S PECTRUM C ONTROL (MSC)

Speech recognition systems have two major approaches. One is continuous speech recognition and the other is word speech recognition. While continuous speech recognition can manage various long utterances, its accuracy is not as enough as that of word speech recognition. Word speech recognition is more effective for the practical applications such as car navigation systems, mobile terminal units, etc. Although the system has been already used in various noisy environments, large noise corruption causes serious recognition errors. Therefore noise robustness is still required in many applications. In order to reduce the affect of noises into the system, the extraction of robust speech features should be considered. Various noise robust methods have been developed such as noise-robust LPC analysis [1],[2], Hidden Markov Model (HMM) decomposition and composition [3],[4],[5], and the extraction of dynamic cepstrum, [6],[7] etc. In spite of such research activities, the useful noise-robust technique is still limited as the spectral subtraction (SS) method [8]. However, SS without adaptation to noise may cause deterioration of speech features such as musical noise. It means that the estimation of an accurate noise status by SS becomes difficult in some circumstances. As the suitable speech features, the cepstrum has been widely used. When we consider a robust speech recognition,

Modulation spectrum is described in Fig. 1 and in addition the proposed MSC is explained in this figure. Short-time speech characteristics in frequency domain are obtained by applying windowing and Fourier Transform to speech waveform in time domain. Therefore, the time trajectory in specific frequency is obtained by tracing its values in each time. The time trajectory in frequency domain is called running spectrum. Running spectrum in each frequency constitutes two dimensional data and what is obtained from its frequency analysis is called modulation spectrum. Fig. 2 compares modulation spectra obtained from noisefree speech and noisy speech (white noise is added at the 5dB SNR). It shows modulation spectrum obtained from noisefree speech is dominant at the modulation frequency band below 10 Hz while the other obtained from noisy speech is dominant below 1 Hz only. Human’s speech alteration of syllable is approximately at the same speed and speech components in modulation spectral domain are dominant around 4 Hz. Particularly, modulation spectral components from 2 Hz to 8 Hz are used for speech recognition. On the other hand, noise components are stable or have no specific alteration rhythm and they concentrate on the modulation frequency band below 1 Hz. Modulation frequency band higher than 12 Hz can be regarded as miscellaneous noise components or unnecessary speech components for recognition (i.e. speaker’s characteristics such as tone, pronunciation, etc.). Therefore, noise robustness can be realized by extracting the modulation frequency band from 1 Hz to 12 Hz. Conventional methods extract this band by applying bandpass filtering to running spectrum domain. It means conventional methods do not obtain the modulation spectrum actu-

The authors would like to thank Research and Development Headquarters, Yamatake Corporation for fruitful discussions. This study is supported in parts by Semiconductor Technology Academic Research Center (STARC), Program 112 and the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B2) (15300010), 2003-2007. Mr. Wada and Prof. Miyanaga are with Hokkaido University, Graduate School of Information Science and Technology, Sapporo 060-0814, Japan.

[email protected]

0 .2 0 .1 0 - 0 .1 0

2000

4000

6000

8000

10000

12000

DFT Running Spectrum

Frame Number

Frequency

Log

Band-Pass Filtering

FIR-RASTA

MSC

Compensate the inverse of running spectrum

(a) Modulation spectrum of noise-free speech

Modulation Spectrum

DFT on time trajectories in each frequency

Modulation Frequency

Frequency 6

Modulation Spectrum

4 2

Weighting

0 1.5

0

5

10

1

15

20

25

30

35

40

Weighting Factor

0.5 0

0

5

10

15

20

25

30

35

40

(b) Modulation spectrum of noisy speech Fig. 1. The processes of FIR-RASTA and MSC.

ally. However, MSC directly applies weighting to modulation spectra in each frequency so that it eliminates unnecessary components of modulation spectra accurately. This weighting is free from the limitations of band-pass filtering such as the number of taps, delay, stability and phase distortion and a more favorable method for practical speech recognition system. III. DYNAMIC R ANGE A DJUSTMENT (DRA) ON C EPSTRUM One of the major causes of noise corruption is derived from the differences in the dynamic ranges of cepstrum. The dynamic range of cepstrum indicates the difference between maximum and minimum of cepstral values in each order. Both maximum and minimum values are the peak of cepstrum and they show the real characteristics of speech. However, the cepstral amplitude of peak are reduced comparing to the amplitude of noise free speech and characteristics are degraded. Fig. 3 shows distributions of the number of

Fig. 2. A comparison of modulation spectrum obtained from noise-free speech and noisy speech. Black part indicates modulation spectral components are dominant. Sample speech is ”Hachinohe”. The noise included in noisy speech is white noise at 5dB SNR.

dynamic ranges of cepstra and also proves that dynamic range is usually reduced by additive noise comparing to that of noise free speech even if RASTA or MSC is applied. Considering that speech recognition is a kind of pattern matching, these differences can be compensated by normalizing both amplitudes of clean speech and noisy speech. DRA adjusts these various dynamic ranges by normalizing the amplitude of speech features. In the DRA, each coefficient of a speech feature vector is adjusted in proportion to its maximum amplitudes as f˜i (t) = fi (t)/ max |fj (t)| j=1,··· ,m

(i = 1, · · · , m),

(1)

where fi (t) denotes an element of the feature vector, m

The Number of Sample

clean

The Number of Sample

Fourier Transform ABS Mel Filterbank Analysis

2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 | | | | | | | | | | | | | | | | | 0.0 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0

(a)

Cepstral Dynamic Range

400 350 300 250 200 150 100 50 0

Log

Analysis Part

Inverse Fourier Transform Delta Cepstrum Speech Feature Vector 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 | | | | | | | | | | | | | | | | | 0.0 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0

(b) The Number of Sample

Speech Signal

noisy

400 350 300 250 200 150 100 50 0


Reference Data

Input Speech for Test Score Estimation

Training HMMs Baum-Welch Algorithm

300

Viterbi Algorithm

250

Training Part

200 150

Recognition Part

100 50

Results

0

4.0 | 0.0

(c)

4.8 | 4.0

5.6 | 4.8

6.4 | 5.6

7.2 | 6.4

8.0 | 7.2

8.8 | 8.0

9.6 | 8.8

10.4 11.2 12.0 12.8 13.6 14.4 15.2 16.0 | | | | | | | | | 9.6 10.4 11.2 12.0 12.8 13.6 14.4 15.2 16.0

Fig. 4. Procedure of speech recognition system.


Fig. 3. Distributions of dynamic ranges of the 1st cepstra obtained from the analysis of 100 Japanese isolated words spoken two times by 5 male speakers. Applied noise is white noise at 10dB SNR. (a) is obtained from original cepstra, (b) is obtained from cepstra after RASTA and (c) is obtained from cepstra after MSC.

denotes the dimension and t denotes the frame number. Using (4), all coefficients are adjusted into the range from -1 to 1. With DRA/MSC, speech analysis is refined. Using MSC the influences on the differences between clean and noisy speech are eliminated in modulation spectrum domain. This process removes unnecessary parts of speeches for recognition such as characteristics of speakers and noise influences. Then DRA the difference of cepstral dynamic range is adjusted. Finally the calculated speech characteristics are compensated and the cepstrum from noisy speech is adjusted to the one from clean speech in case of the same word. IV. E XPERIMENTS A. Word Recognition Results In order to evaluate the noise robustness of the proposed techniques, isolated word speech recognition using HMM [12] has been carried out. Fig.4 shows the process of the conventional recognition system. The former part of this figure shows the procedure of the speech feature extraction

which is consists of ordinary feature extraction based on MelFrequency Cepstral Coefficient (MFCC) [13] and HMMs. MFCC is one of the most useful speech features and is based on the human’s perception of frequency. It is derived from speech power spectrum. The latter part shows a training part using Baum-Welch Algorithm and a recognition part using Viterbi Algorithm. The recognitions part is implemented using the MATLAB software. The acoustic models are thirtytwo-state one-mixture-per-state HMMs. The whole database is Japanese common voice data ’Chimei’ (means the names of places) delivered by the Japan Electric Industry Development Association. The database consists of 100 Japanese isolated words spoken four times by 90 persons. The data are 11.025kHz and 16bit sampling speech. Other conditions are described in Table.I. Then, RASTA, MSC and DRA are applied separately or together. Several styles of recognitions are evaluated in these conditions. Speech feature vectors have 38-dimensional parameters which consist of 12 cepstral coefficients, 12 deltacepstral coefficients, 12 delta-delta-cepstral coefficients, delta-logarithmic power and delta-delta-logarithmic power. Recognition results are shown in Table.II and III. At a first glance same tendency of recognition rates are obtained in both white and running-car noise environments and each noise robust speech feature extraction method, DRA, RASTA and MSC improves recognition performances except for

TABLE I C ONDITION OF SPEECH ANALYSIS AND RECOGNITION .

Recognition task

Isolated 100 words vocabulary 100 Japanese region names from JEIDA

Speech data Sampling

11.025KHz, 16-bit

Window length

23.2ms(256point)

Frame period

11.6ms(128point)

Window function

Hanning window

Pre-emphasis

1-0.97Z-1

Speech analysis

12-dimensional MFCC

Speech recognition

Continuous-HMM 32-states word HMMs

Training set

40 females and 40 males, three utterances each

Test set

Speaker-independent 5 females and 5 males, two utterances each

Noise varieties

White noise SNR 10,20,30dB Running car noise SNR -10,0,10dB

DRA at higher SNR. Comparing recognition performances of RASTA and MSC, MSC is a little superior to RASTA in both environments and the advantage of MSC is confirmed. Then by combining DRA, both methods shows better performances. The MSC with DRA shows the best performance among all methods including RASTA with DRA. Especially in the running-car noise environment at -10dB SNR, DRA improves the recognition rate with MSC by 31.15% while DRA improves that with RASTA by 20.75% only. V. C ONCLUSIONS In this report, the new techniques for noise suppression, DRA and RSF are presented. RSF emphasizes speech frequency bands by applying the FIR filtering. DRA normalizes the maximum amplitudes of the cepstrum. The effectiveness is estimated in speech recognition experiments and the application of both techniques shows the best performance. This result indicates that RSF extracts speech characteristics more effectively than RASTA and a synergistic effect should exist between DRA and RSF. R EFERENCES [1] Tierney J., “A study of LPC analysis of speech in additive noise”, IEEE Trans. on Acoust., Speech, and Signal Process., vol. ASSP-28, no.4 p.p.389-397, Aug. 1980. [2] Kay S.M., “Noise compensation for autoregressive spectral estimation”, IEEE Trans. on Acoust., Speech, and Signal Process., vol. ASSP-28, no.3 p.p.292-303, March 1980. [3] Varga A. and Moore R., “Hidden Markov Model Decomposition of Speech and Noise”, Proc IEEE ICASSP pp. 845-848, 1990. [4] Gales M.J.F. and Young S.J., “Cepstral parameter compensation for HMM recognition in noise”, Speech Communication, vol.12, no.3, p.p.231-239, 1993.

TABLE II R ECOGNITION RATES VERSUS WHITE NOISES FOR THE ESTIMATION OF FEATURE EXTRACTION . Speech Feature SNR Conventional DRA RASTA MSC RASTA+DRA MSC+DRA

Rec. Rates 10dB 20dB 57.30 96.70 70.15 96.05 70.45 96.95 74.55 97.05 80.90 97.25 85.05 97.10

[%] 30dB 99.35 99.25 99.20 99.25 99.35 99.15

TABLE III R ECOGNITION RATES VERSUS RUNNING CAR NOISES FOR THE ESTIMATION OF FEATURE EXTRACTION . Speech Feature SNR Conventional DRA RASTA MSC RASTA+DRA MSC+DRA

Rec. Rates -10dB 0dB 17.10 77.80 25.40 76.90 27.80 90.80 32.35 90.20 48.55 89.90 63.50 93.75

[%] 10dB 95.55 95.25 98.35 98.50 97.80 98.35

[5] Martin F., Shikano K., Minami Y. and Okabe Y., “Recognition of Noisy Speech by Composition of Hidden Markov Models”, IEICE Technical Report, SP92-96, pp.9-16, Dec. 1992. [6] Aikawa K. and Saito T., “Noise robustness evaluation on speech recognition using a dynamic cepstrum”, IEICE Technical Report, SP94-14, pp.1-8, June 1994. [7] Aikawa K., Hattori H., Kawahara H. and Tohkura Y., “Cepstral representation of speech motivated by time-frequency masking: An application to speech recognition”, J. Acoust. Soc. Am., vol.100, no.1, p.p.603-614, July 1996. [8] Boll S., “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. ASSP, vol. ASSP-27, no.2, pp.113-120, 1979. [9] Furui S., “Speaker-Independent isolated word recognition using dynamic features of speech spectrum”, IEEE Trans. on Acoust., Speech, and Signal Process., vol.ASSP-34, no.1 p.p.52-59, Feb. 1986. [10] Hermansky H. and Morgan N., “RASTA processing of speech,” IEEE Trans. Speech and Audio Process, vol.2, pp578-579, Oct 1994. [11] Hayasaka N., Miyanaga Y. and Wada N., “Running spectrum filtering in speech recognition,” SCIS Signal Processing and Communications with Soft Computing, Oct 2002. [12] Rabiner L.R., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, vol. 77, no.2 Feb. 1989. [13] Davis S.B. and Mermelstein P., “Comparison of parametric representations for mono-syllabic word recognition in continuously spoken sentences,” IEEE Trans. on Speechand Signal Processing, pp.357-366 1980. [14] N. Wada, Y. Miyanaga, N. Yoshida and S. Yoshizawa, “A consideration about an extraction of features for isolated word speech recognition in noisy environments,” ISPACS2002, DSP2002-33, pp.19-22, Nov 2002.