PS-ZCPA Based Feature Extraction with Auditory Masking, Modulation ...

3 downloads 162 Views 2MB Size Report
performance of the PS-ZCPA method by embedding auditory masking into it, and a slightly ...... [12] http://cnsl.kaist.ac.kr/research/kypark/masking/masking.html.
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.3 MARCH 2006

1015

PAPER

Special Section on Statistical Modeling for Speech Processing

PS-ZCPA Based Feature Extraction with Auditory Masking, Modulation Enhancement and Noise Reduction for Robust ASR Muhammad GHULAM†a) , Takashi FUKUDA†∗ , Kouichi KATSURADA† , Junsei HORIKAWA† , Nonmembers, and Tsuneo NITTA† , Member

SUMMARY A pitch-synchronous (PS) auditory feature extraction method based on ZCPA (Zero-Crossings Peak-Amplitudes) was proposed previously and showed more robustness over a conventional ZCPA and MFCC based features. In this paper, firstly, a non-linear adaptive threshold adjustment procedure is introduced into the PS-ZCPA method to get optimal results in noisy conditions with different signal-to-noise ratio (SNR). Next, auditory masking, a well-known auditory perception, and modulation enhancement that simulates a strong relationship between modulation spectrums and intelligibility of speech are embedded into the PS-ZCPA method. Finally, a Wiener filter based noise reduction procedure is integrated into the method to make it more noise-robust, and the performance is evaluated against ETSI ES202 (WI008), which is a standard front-end for distributed speech recognition. All the experiments were carried out on Aurora-2J database. The experimental results demonstrated improved performance of the PS-ZCPA method by embedding auditory masking into it, and a slightly improved performance by using modulation enhancement. The PS-ZCPA method with Wiener filter based noise reduction also showed better performance than ETSI ES202 (WI008). key words: pitch synchronous analysis, ZCPA, auditory masking, modulation enhancement, Wiener filtering

1. Introduction The performance of automatic speech recognition (ASR) degrades highly with increasing noise, while human beings are able to recognize even in presence of high background noise. One of the main reasons behind this difference is that an auditory system incorporates several features, which make it robust to noise. Therefore, the use of auditory-based feature extraction methods for ASR has been increased in recent years for their robustness in presence of noise. EIH (Ensemble Interval Histogram) model [1], proposed by Ghitza, uses an array of level-crossing detectors attached to the outputs of band-pass filters to generate an interval histogram. The EIH model produces dominant periodic temporal structures by analyzing zero-crossing intervals in frequency bands. The ZCPA method [2], which is an improvement of the EIH model, uses peak rather than level-crossings to measure intensity of each zero-crossing interval. The ZCPA method was proved more robust and Manuscript received July 11, 2005. Manuscript revised September 27, 2005. † The authors are with the Graduate School of Engineering, Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. ∗ Presently, with Tokyo Research Laboratory, IBM Japan Ltd. a) E-mail: [email protected] DOI: 10.1093/ietisy/e89–d.3.1015

computationally efficient than the EIH model. It is well known that an auditory neuron system has a pitch-synchronous mechanism [3], which can be useful for speech detection, however, neither the ZCPA method nor the EIH model utilizes the mechanism. We proposed pitchsynchronous ZCPA (PS-ZCPA) method [4] that extracts pitch-synchronous features by using the ZCPA method. In the ZCPA method, the positive zero-crossings in each subband are detected, and the intervals between successive positive zero-crossings are calculated. The peaks within the intervals are also detected at the same time. Then a histogram of the intervals for all subbands are collected with the logarithmic peak-values contributing as a weighting factor. In the proposed PS-ZCPA method, at first, a noise-robust nondelayed pitch detection algorithm (PDA) is applied to extract pitches of a speech signal, and also to detect voiced (V) and unvoiced/silent (U/S) frames (segments) of the signal. For voiced frames, the highest peak, Ph , in each pitch interval for each subband is detected and its logarithmic value is obtained. The peaks that are above a threshold determined by the Ph , rather than all the peaks as in the ZCPA method, are to contribute in histogram bin count. The count of the histogram bin is increased by adding corresponding logarithmic peak-values. For unvoiced or silent frames, as there are no pitches, all the logarithmic peak-values contribute to histogram bin count same as with the ZCPA method. In [4], the threshold determined by the Ph was fixed, and to get optimal results for different SNRs, we had to adjust the threshold manually. To overcome this problem, we need to integrate a procedure that can automatically adjust the threshold parameter for signals with different SNRs. The novel procedure used in this paper checks the noise level at each channel output by analyzing silent segments at the beginning of speech, and adjusts the threshold parameter accordingly. A perceived histogram from the PS-ZCPA method is influenced by various kinds of auditory effects. One of the important auditory effects is masking. Masking functions in such a way that a masker component inhibits other components in its vicinity [5]. There are two types of auditory masking. Temporal masking is a phenomenon whereby louder sounds mask other sounds for a short time before and after their occurrence, while in simultaneous masking, highenergy frequency components mask adjacent frequencies

c 2006 The Institute of Electronics, Information and Communication Engineers Copyright 

IEICE TRANS. INF. & SYST., VOL.E89–D, NO.3 MARCH 2006

1016

with lower energy. From a signal-processing point of view, masking enhances peaks on a time-spectrum pattern that are expected to perform robust speech recognition. In this paper, performance of the PS-ZCPA method is enhanced by incorporating auditory masking into it. One of the main objectives of front-end processing in robust ASR is to preserve critical linguistic information while suppressing such irrelevant information as speakerspecific characteristics, channel characteristics, and additive noise. To determine information to be preserved, it is necessary to identify those features of the signal that are necessary for speech recognition. It has been reported that there is a strong correlation between modulation transfer function and intelligibility of speech [6]. Low modulation frequencies include such information as channel characteristics, speaker information, and voice quality, which are assumed not crucial for human speech communication. Similarly, high modulation frequencies might be less important for speech communication. In clean condition, the dominant component of modulation spectrum of continuous speech lies between 1 Hz and 16 Hz with its peak around 4 Hz in modulation frequency [7]. To incorporate modulation into the PS-ZCPA method, at first, envelopes are extracted from all of the bandpass filter outputs by using peaks only. Next, the envelopes are filtered by a modulation filter, and then the modulated envelopes are back into zero-crossing signals with the help of previously stored zero-crossing and phase information. Finally, the modulated signals are processed using the PSZCPA method. The performance of the PS-ZCPA method using modulation enhancement is evaluated in this paper both with and without auditory masking. In a severe noise environment, temporal structure obtained by the ZCPA-based methods in high frequency bands is deteriorated, and becomes unreliable in recognition [4]. A noise reduction procedure can serve as a solution to it. In this paper, the performance of the PS-ZCPA method with noise reduction is compared with that of the ETSI (European Telecommunication Standards Institute) standard Advanced distributed speech recognition front-end ES202 (WI008) [8]

Fig. 1

that is based on Mel-Cepstrum representation and is designed to improve recognition performance in background noise. For a noise reduction step, both the PS-ZCPA and the ES202 use a two-stage Wiener filter based noise suppression procedure. In the experiments, the PS-ZCPA method using noise-reduction procedure is evaluated both with and without auditory masking and modulation enhancement. This paper is organized as follows. Section 2 briefly reviews the PS-ZCPA method and presents adaptive threshold adjustment procedure (ATAP). Section 3 outlines experimental setups and results using ATAP. Section 4 and Sect. 5 describe the implementation details of auditory masking and modulation filter, respectively, into the PS-ZCPA method. Section 6 describes the PS-ZCPA with Wiener filter based noise reduction. Section 7 gives the experimental results and discussion. Finally, Sect. 8 draws some conclusions. 2. PS-ZCPA Method Figure 1 shows the block diagram of the PS-ZCPA method with a noise reduction and adaptive noise adjustment procedure. The PS-ZCPA method [4] is divided into two parts: A) a pitch detector that includes voiced and unvoiced/silent frames detection, and B) a feature extractor. The PS-ZCPA method uses pitch-synchronized peaks to extract features. At first, the speech signal is passed through a bank of band-pass filters (BPFs). Then the PS-ZCPA-based features are computed by the following procedure: (1) detects all the zero crossings from each filter output (subband signal), (2) calculates the inverse of the successive positive zero-crossing interval lengths that corresponds to the dominant frequency, (3) collects histograms of the inverse zerocrossing lengths over all subband signals, (4) increases a histogram bin count by the logarithmic value of the peak detected in corresponding zero-crossing interval. To precisely detect zero-crossing point, a linear interpolation between preceding and succeeding samples of positive zerocrossing is performed. In the PS-ZCPA method, for a voiced frame, the highest peak, (Ph ), within a pitch period obtained

Block diagram of the PS-ZCPA method with adaptive threshold adjustment and noise reduction.

GHULAM et al.: PS-ZCPA BASED FEATURE EXTRACTION WITH AUDITORY MASKING, MODULATION ENHANCEMENT

1017

by the pitch detector is extracted. The peaks that have height above l% of Ph within that pitch period are to contribute in the histogram bin count. The other peaks (smaller ones) in that pitch period are of no contribution. For unvoiced/silent frames, as there are no pitches, all the peaks have contribution in the histogram bin count. The l should carefully be chosen so that no important information is lost, as well as heavily noise-corrupted peaks are not counted. A final histogram is obtained by summing the histograms throughout all the channels. In [4], it was shown that manual adjustment of the threshold, l, was necessary for optimum result in speech signals with different SNRs.

of Subway, Babble, Car, and Exhibition. Data in Test B are added to by noises of Restaurant, Street, Airport, and Station. In Test C, besides the additive noise, channel distortion is also included. In the baseline system [9], there are thirteen recognition units: eleven digit HMMs with sixteen states and twenty Gaussian mixtures; one silence HMM with three states, and one short pause HMM with one state (shared with middle state of silence). Thirteen Gaussian mixtures are used for silence and short pause. For the experiments in this paper, the training was performed using clean data only and the category was zero (no change at back-end). 3.2 Experimental Setup

2.1 Adaptive Threshold Adjustment Procedure To overcome the problem of changing the threshold for optimum result in different types of noise conditions, a timedomain adaptive threshold adjustment procedure (ATAP) is integrated into the PS-ZCPA method. In ATAP, the noise level is checked in each filter output, and the threshold is adjusted accordingly as shown in the following steps: i. Calculates the average value, Pavg , of the peaks in the first 160 ms (silent segments) for each filter output. Pavg corresponds to the noise intensity of corresponding filter output. This step is performed in ‘Noise intensity calculator’ block in Fig. 1. ii. For voiced segments, the threshold is set to the maximum between l% of Ph , and Pavg for each pitch interval. At higher SNR, where the noise level is low, the threshold is automatically adjusted to l% of Ph within each pitch period, and at lower SNR, where the noise level is high, it is automatically set to Pavg . This step ensures that the peaks with heavily-corrupted noise are not counted and no important peak information is lost. iii. For unvoiced/silent segments, the threshold is fixed to Pavg . The peaks that are above the adjusted threshold contribute in the histogram bin count. 3. Experiments on the PS-ZCPA Method 3.1 Database The performance of the PS-ZCPA method with and without ATAP was evaluated using the Aurora-2J database [9]. In the Aurora-2J database, utterances are connected Japanese digit strings and sampling rate is 8 kHz. Selections of eight different real-world noises have been added to the speech over a range of signal-to-noise ratios (SNRs: −5 dB, 0 dB, 5 dB, 10 dB, 15 dB, 20 dB, clean). In the Aurora-2J digit recognition task, the evaluation focuses on robustness against additive noise and distortion by an unknown transmission channel. There are three tests from the Aurora-2J database to evaluate the performance of all considered techniques [9]. Eight different real noises are divided into two groups for testing. Data in Test A are added to by noises

Twenty FIR hamming BPFs of order 61, with center frequencies uniformly spaced on the Bark scale between 150 Hz and 3.7 kHz, are used for experiments on the PSZCPA method. The bandwidths of the filters are chosen to equal critical bandwidths. Frequency range between 0 and 4 kHz is partitioned into eighteen histogram bins uniformly distributed on the Bark scale. Frame lengths are set to 30/ fck , where fck are the center frequencies of the filters in kHz, and frame rate is 10 ms. For pitch detection algorithm (PDA) [4], first fourteen filters, where the center frequency of the 14-th filter is 1.9 kHz, are used to produce a summary auto-correlogram. Then the summary auto-correlogram is used to detect pitches and voiced and unvoiced/silent frames of speech signal. For the PS-ZCPA method, the value of the threshold parameter l is fixed to 20, when ATAP is used, and to 40, when ATAP is not used. These values of l are found to give the optimum results. However, a variation of ±20 of the optimum value l, does not cause much deviation to the result across different noise conditions. In the baseline system of Aurora-2J, the feature vectors consist of 12 MFCC and log energy with their corresponding delta and acceleration coefficients. Thus, each vector contains 39 components in total. The MFCC features are calculated using 25 ms frame lengths and at 10 ms frame rate. The performance of the PS-ZCPA method both with and without ATAP was compared with those of the conventional ZCPA method [2], and the MFCC method (baseline [9]). For the ZCPA method and the PS-ZCPA method, DCT is applied to the histogram to extract twelve cepstrums. Corresponding delta and acceleration coefficients are appended to give a dimension of 36 to feature vectors. 3.3 Experimental Results and Discussion The experimental results in word accuracy (%Acc) are shown in Tables 1 and 2. In Table 1, (a) and (b) give the results of the PS-ZCPA method without and with ATAP, respectively. Table 2 shows relative performance of the PS-ZCPA method with and without ATAP, and the ZCPA method in comparison with the baseline. Tables 1 and 2 justify the use of ATAP into the PSZCPA method. For example, the PS-ZCPA without ATAP has overall average word accuracy of 65.40%, while that

IEICE TRANS. INF. & SYST., VOL.E89–D, NO.3 MARCH 2006

1018 Table 1

Performance of the PS-ZCPA method.

Table 2 Relative performances of the PS-ZCPA with and without ATAP, and the ZCPA comparing to MFCC.

without ATAP (Fig. 2 (c)), respectively. From Fig. 2, we can see that ATAP can help to maintain formant frequencies even in noisy condition. The distorted temporal structure in high frequency bands (Fig. 2 (c)) is minimized by using ATAP. Linear interpolation, described in Sect. 2, that detects precise zero-crossing points also helps to reduce saturation (distortion) of the spectrogram at higher frequency channels. 4. PS-ZCPA Method with Auditory Masking

Fig. 2 Spectrograms for the utterance /roku/ (’six’) obtained using the PS-ZCPA method with ATAP (a) for clean speech, (b) for subway noisy speech with SNR = 10 dB, and (c) that obtained using the PS-ZCPA method without ATAP for the same noisy speech.

with ATAP has 68.98%. For all test data sets (Set A, Set B, Set C), the proposed PS-ZCPA with ATAP shows better performance than that without ATAP. The PS-ZCPA method both with and without ATAP also outperforms the ZCPA method. For the rest of the experiments in this paper, the PS-ZCPA method refers to the PS-ZCPA with ATAP, if not otherwise mentioned. Figure 2 shows spectrograms of the utterance /roku/ (’six’) for clean speech (Fig. 2 (a)), and for subway noisy speech having SNR = 10 dB (Fig. 2 (b)) using the PS-ZCPA method with ATAP, and for same noisy speech using that

Masking is the process or amount by which the threshold of audibility is raised by the presence of another sound. There are two types of masking observed in human auditory perception: simultaneous masking, and non-simultaneous (temporal) masking. Simultaneous masking is a frequency domain phenomenon where a low level signal can be made inaudible by simultaneously occurring stronger signals if the masker and the maskee are close enough to each other in frequency. The maskers affect not only the frequencies within a critical band, but also in surrounding bands. A spreading function represented by a matrix S (Zi , Z j ) estimates the effects of masking across the critical bands. The function used in this work has been proposed in [10] and is expressed as S (Zi , Z j ) = 15.81 + 7.5 × (Zi − Z j + 0.474)  − 17.5 × 1 + (Zi − Z j + 0.474)2

(1)

where, Zi and Z j are the Bark frequencies of the masked signal and the masking signal, respectively. The spreading function is independent of both the center frequency and level of masking signal. The distribution of the spreading function S (Zi , Z j ) is shown in Fig. 3 (a). In this paper, Zi and Z j represent the histogram bins. The PS-ZCPA generated histogram, H(k, Zi ) where k is the frame number, is then multiplied with S (Zi , Z j ) as follows:

GHULAM et al.: PS-ZCPA BASED FEATURE EXTRACTION WITH AUDITORY MASKING, MODULATION ENHANCEMENT

1019

C(k, Zi ) =



S (Zi , Z j ) × H(k, Zi )

(2)

Zj

C(k, Zi ) denotes the value of the spread masked histogram at bin Zi for the k-th frame of the speech. Thus, in the PS-ZCPA method, count for each histogram bin corresponding to the Bark index is determined first; then the histogram bin count is spread over all the histogram bins through convolving the histogram bin count with the spreading function. It can be mentioned that the filter bandwidth in the PS-ZCPA method corresponds to the critical bandwidth, and simultaneous masking affects across the critical bands. The other kind of masking, where weak signal components become inaudible by the presence of stronger ones in the same critical band that precede or follow them in time, is called temporal masking. It was shown that the effect of forward masking is higher than that of backward mask-

ing in speech perception [11]. In this paper, temporal masking is implemented using the following unilateral integration model [12] that simulates forward masking:  αn H(k − n, Zi ) y(k, Zi ) = H(k, Zi ) + C0 − C1



n

β H(k − n, Zi ) n

(3)

n

where, C0 and C1 are the constants reflecting the amount of integration, and α, β are the exponential decays of previous response and masking term, respectively. H(k, Zi ) is the value of the histogram at Zi -th bin for k-th frame, and y(k, Zi ) is the corresponding masked value, and n is the window length of forward masking. The distribution of Eq. (3) is shown in Fig. 3 (b). Solid line in Fig. 3 (b) represents temporal masking curve achieved with optimal values of constants C0 , C 1 , α, and β, while dotted line represents that with minimal recognition accuracy obtained by a worst combination of the values of the constants (described in Sect. 7.2). A block diagram of the PS-ZCPA method followed by auditory masking is shown in Fig. 4. While doing experiment with both types of masking, simultaneous masking is applied first and followed by temporal masking. 5. PS-ZCPA Method with Modulation Enhancement

Fig. 3

Fig. 4

Distribution of masking curves.

PS-ZCPA feature extraction with auditory masking.

Fig. 5

The intelligibility of speech has a strong correlation with modulation spectrum of speech. Modulation spectrum is defined as spectral representation of temporal envelope of speech signal. It has been verified that no significant information for perceiving speech is contained below 1 Hz and above 16 Hz of the modulation spectrum on the temporal envelopes [6]. A block diagram of the PS-ZCPA method with modulation enhancement is shown in Fig. 5. Input speech signal is passed through a bank of BPFs, and then zero crossings, both positive and negative going, as well as peaks between them of each filter output are detected. Next, the positive and negative peaks are full-wave rectified, and an envelope is extracted by joining the peaks. The zero-crossing val-

Block diagram of the PS-ZCPA method with modulation enhancement.

IEICE TRANS. INF. & SYST., VOL.E89–D, NO.3 MARCH 2006

1020

Fig. 7

Fig. 6 (a) Magnitude-frequency response of modulation filter. (b) Extracted envelope (middle) from a BPF output (upper) using positive and negative peaks in each zero-crossing interval. The corresponding modulated envelope is shown at lower panel.

ues are stored in buffer, which will be used to reconstruct the zero-crossings after modulation filtering. A modulation filter is applied to the envelope. The modulation filter is designed as a 61 order FIR filter, and the filter design is same for all the envelopes. Figure 6 (a) shows the magnitude frequency characteristics of the modulation filter used in the experiment. The filter enhances components around 1–16 Hz, and suppresses other components. An example of envelope from a band-pass (1017 Hz–1167 Hz) filter output and the corresponding modulated envelope is shown in Fig. 6 (b) for the speech /ro/ (of /roku/ (’six’)) at SNR = 5 dB. After modulation filtering, new zero-crossing peakamplitude information is calculated from the modulated envelope with the help of previously stored zero-crossing information. Finally, PS-ZCPA features are calculated by pitch intervals [4]. 6. PS-ZCPA Method with Noise Reduction The PS-ZCPA method is robust to noise without any kind of noise reduction procedure (see Table 2), however, embedding such a procedure to the method may further increase its robustness to noise. In this paper, the performance of the PS-ZCPA method with noise reduction is compared with that of the ETSI standard Advanced DSR (Distributed Speech Recognition) front-end ES202 (WI008) [8], which was based on Mel-Cepstrum representation. Noise reduction of WI008 is based on Wiener filter theory and it is performed in two stages. The detail of noise reduction algorithm is described in [8], and corresponding process flow is shown in Fig. 7. The input signal is de-noised in the first stage and the output enters into the second stage, where an additional dynamic noise reduction is performed. In the first stage, after framing a input signal, linear spectrum of each frame is estimated. The signal spectrum is then smoothed along time index in Power Spectrum Density (PSD) Mean block. After that, frequency domain Wiener

Process flow of noise reduction used in WI008 [8].

filter coefficients are calculated by using both current frame spectrum estimation and noise spectrum estimation. Noise spectrum is estimated from noise frames, which is detected by a voice activity detector (VAD). Linear Wiener filter coefficients are smoothed along frequency axis by using a Melfilter bank, resulting in a Mel-warped frequency domain Wiener filter. The impulse response is obtained by applying Mel-warped inverse DCT. Finally, input signal of each stage is filtered in the Apply Filter. At the end of Noise Reduction, DC offset of noise-reduced signal is removed in the OFF block. Additionally, in the second stage, the aggression of noise reduction is controlled by gain factorization. To implement noise reduction in the PS-ZCPA method, the noise reduction procedure used in the WI008 is adopted without any change. First, input signal is passed through the Wiener filter based noise reduction procedure as shown in Fig. 1. Next, de-noised signal is entered into a bank of BPFs. The filter outputs are then processed to give PS-ZCPA features. It can be noted that the ATAP described in Sect. 2 has little effect on reducing noise. 7. Experiments on the PS-ZCPA Method with Auditory Masking, Modulation Enhancement, and Noise Reduction 7.1 Database All the experiments were carried out using the same Aurora2J database described in Sect. 3.1. 7.2 Experimental Setup The experimental setup is the same as described in Sect. 3.2. ATAP is applied in all the experiments. For temporal masking (Eq. (3)), the constants C0 , C 1 , α, and β are chosen to give the best performance under the existing setup. In the experiments, the values of the constants are varied as C0 from 0.1 to 0.9, C1 from 0.01 to 0.1, α from 0.1 to 0.9, and β from 0.9 to 0.99. The optimal result is found with C 0 = 0.3, C1 = 0.03, α = 0.6, and β = 0.98. However, no significant deviation in the overall result is observed with a little variation in the constant values. For example, the optimal recognition accuracy using temporal masking is 72.58%

GHULAM et al.: PS-ZCPA BASED FEATURE EXTRACTION WITH AUDITORY MASKING, MODULATION ENHANCEMENT

1021

(Table 3), and recognition accuracy without temporal masking is 68.98% (Table 1 (b)). The worst result obtained is 71. 96% (not shown) with a combination of the constant values as C0 = 0.7, C1 = 0.1, α = 0.1, and β = 0.93. The temporal masking curve with worst-case scenario that obtains accuracy of 71.96% is shown by dotted line in Fig. 3 (b). The value of n is set to sixteen. Masking is performed on the PSZCPA histogram, where the bin number is eighteen. Then, DCT is applied to the masked histogram to extract 12 cepstrums. Corresponding delta and acceleration coefficients are appended to give 36 feature vectors. For WI008 performance evaluation, 25 ms Hammingwindowed speech segments are applied every 10 ms. 12 Mel-cepstral parameters along with log power and their delta, and acceleration parameters (total dimension 39) are used as feature vectors. For comparison, the performances of the following methods were evaluated: a. b. c. d. e. f. g. h. i.

The PS-ZCPA with simultaneous masking (S-mask) The PS-ZCPA with temporal masking (T-mask) The PS-ZCPA with both types of masking (ST-mask) The PS-ZCPA with modulation, without masking (MOD) The PS-ZCPA with modulation, with both types of masking (MOD ST-mask) WI008 The PS-ZCPA with noise reduction, no modulation, no masking (NR) The PS-ZCPA with noise reduction, with both types of masking (NR ST-mask) The PS-ZCPA with noise reduction, modulation, and

Table 3 Performance of the PS-ZCPA method with auditory masking. The results are given in overall average (avg.) accuracies (%). S-mask, T-mask, and ST-mask stand for simultaneous masking, temporal masking, and both the masking, respectively. SNR (dB)

S-mask

T-mask

ST-mask

clean 20 15 10 5 0 −5 Avg.

99.87 94.92 84.43 76.30 57.67 38.14 27.34 70.29

99.91 95.16 85.55 78.51 60.73 42.97 32.93 72.58

99.90 95.25 86.14 79.24 61.77 44.15 34.28 73.31

both types of masking (NR-MOD ST-mask) 7.3 Experimental Results and Discussion The experimental results in overall average accuracy on the PS-ZCPA method with auditory masking are shown in Table 3. From Tables 1 and 3, we can see that embedding auditory masking increases performance of the PS-ZCPA method. Simultaneous masking has lesser effect comparing to temporal masking. In fact, simultaneous masking has negative effect on the PS-ZCPA in low noise condition. It means that the PS-ZCPA method already has some sort of spectral masking effect integrated. The best result is obtained by embedding both types of masking, for example, the overall average accuracy is increased from 68.98% (Table 1 (b)), obtained without masking, to 73.31%. Table 4 shows the experimental results of the PS-ZCPA method with modulation enhancement. Table 4 indicates that a little improvement is achieved by adding modulation enhancement to the PS-ZCPA method. For example, the PSTable 4 Performance of the PS-ZCPA method with modulation (MOD) enhancement. The results are given in overall avg. accuracies (%). SNR (dB)

MOD

MOD ST-mask

clean 20 15 10 5 0 −5 Avg.

99.90 94.71 83.72 75.73 57.23 38.16 26.35 69.91

99.90 95.48 86.54 79.86 62.53 46.60 35.14 74.20

Table 5 Performance of the PS-ZCPA method with Wiener filter based noise reduction (NR). The results are given in overall avg. accuracies (%). SNR (dB)

WI008

NR

NR ST-mask

NR-MOD ST-mask

clean 20 15 10 5 0 −5 Avg.

98.48 97.52 95.00 88.16 70.69 38.52 7.84 77.98

99.91 98.45 96.53 90.78 73.34 44.79 26.50 80.78

99.90 98.75 97.23 91.79 75.61 47.47 27.66 82.17

99.89 98.65 97.04 91.34 75.26 46.65 28.27 81.79

Table 6 Performance of the PS-ZCPA method with Wiener filter based noise reduction using both type of auditory masking.

IEICE TRANS. INF. & SYST., VOL.E89–D, NO.3 MARCH 2006

1022

ZCPA with modulation enhancement has 69.91% overall average accuracy while that without modulation has 68.98% (Table 1 (b)). It can be attributed to the fact that the PSZCPA method uses dominant frequencies and peaks, which are less corrupted by noise. This results in little effect of modulation enhancement on the PS-ZCPA method. The experimental results of applying Wiener filter based noise reduction are shown in Table 5. The PS-ZCPA method with noise reduction and both types of masking performs the best, while that with noise reduction and without masking achieves better recognition performance than WI008. In the experiments, the overall average accuracy of WI008 is 77.98%, and that of the PS-ZCPA with noise reduction and without masking is 80.78%, and with both types of masking is 82.17%. However, the PS-ZCPA method with noise reduction, modulation enhancement, and both types of masking, all together, shows some degraded performance (overall average word accuracy is 81.79%). This is probably because, in the PS-ZCPA method, speech wave and zero-crossings are reconstructed after noise reduction and modulation filtering respectively, and during the two reconstruction processes, some information may lose. Particularly, after modulation filtering, zero-crossings are reconstructed with the help of previously stored zero-crossings information, which is composed by both low and high frequency components, while modulated envelope is composed only by low frequency components. A further investigation will be carried out on how to minimize the loss while applying both noise reduction and modulation enhancement to the PS-ZCPA method. The complete result of the PS-ZCPA method with auditory masking together with Wiener filter based noise reduction, which gives the best performance is shown in Table 6. It achieves a distinguished overall relative performance 66.87%, not shown in Tables, comparing to methods in Table 2.

tion, Culture, Sports, Science and Technology, Japan. References [1] O. Ghitza, “Auditory models and human performance in tasks related to speech coding and speech recognition,” IEEE Trans. Speech Audio Process., vol.2, no.1, pp.115–132, Jan. 1994. [2] D.S. Kim, S.Y. Lee, and R.M. Kil, “Auditory processing of speech signals for robust speech recognition in real-world noisy environments,” IEEE Trans. Speech Audio Process., vol.7, no.1, pp.55–69, Jan. 1999. [3] T. Hashimoto, Y. Katayama, K. Murata, and I. Taniguchi, “Pitchsynchronous response of cat cochlear nerve fibers to speech sounds,” Japanese J. Physiology, vol.25, pp.633–644, 1975. [4] M. Ghulam, T. Fukuda, J. Horikawa, and T. Nitta, “A noise-robust feature extraction method based on pitch-synchronous ZCPA for ASR,” Proc. ICSLP04, pp.133–136, 2004. [5] B.C.J. Moore, An Introduction to the Psychology of Hearing, Fourth ed., Academic Press, New York, 1997. [6] M.R. Schroeder, “Modulation transfer functions: Definition and measurement,” IEEE Trans. Acoust. Speech Signal Process., vol.ASSP-26, no.6, pp.179–182, 1978. [7] T. Arai, M. Pavel, H. Hermansky, and C. Avendano, “Syllable intelligibility for temporally filtered LPC cepstral trajectories,” J. Acoust. Soc. Am., vol.105, no.5, pp.2783–2791, 1999. [8] ETSI ES 202 050 V1.1.1, “Distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms,” 2002. [9] S. Nakamura, K. Yamamoto, K. Takeda, S. Kuroiwa, N. Kitaoka, T. Tamada, M. Mizumachi, T. Nishiura, M. Fujimoto, A. Sasou, and T. Endo, “Data collection and evaluation of Aurora-2 Japanese corpus,” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.619–623, 2003. [10] M.R. Schroeder, B.S. Atal, and J.L. Hall, “Optimizing digital speech coders by exploiting masking properties of the human ear,” J. Acoust. Soc. Am., vol.66, no.16, pp.1647–1651, Dec. 1979. [11] E. Zwicker and U.T. Zwicker, “Audio engineering and psychoacoustics: Matching signals to the final receiver, the human auditory system,” J. Audio Eng. Soc., vol.39, no.3, pp.115–125, 1991. [12] http://cnsl.kaist.ac.kr/research/kypark/masking/masking.html

8. Conclusion A simple noise-robust adaptive threshold adjustment procedure was embedded to the previously proposed PS-ZCPAbased feature extraction method to increase its robustness. The effects of auditory masking, modulation enhancement and Wiener filter based noise reduction were also investigated. These noise suppression techniques improved the robustness of PS-ZCPA method. Integrating masking effect and noise reduction improved the performance of the method, while modulation enhancement had little gain. Simultaneous masking had negative effect at clean condition. The PS-ZCPA method with noise reduction outperformed standard DSR front-end WI008. The best result was obtained by the PS-ZCPA method using auditory masking together with Wiener filter based noise reduction. Acknowledgments This work was supported by The 21 st Century COE Program “Intelligent Human Sensing”, from the ministry of Educa-

Muhammad Ghulam was born in 1973. He received his Bachelors in Computer Science and Engineering degree in 1997 from Bangladesh University of Engineering and Technology, and M.E. degree in Knowledge-based Information Engineering in 2003 from Toyohashi University of Technology, Japan. He is currently enrolled as a Ph.D. student in the Department of Electronic and Information Engineering at Toyohashi University of Technology. He is engaged in research on automatic speech recognition. He is a member of the Acoustic Society of Japan (ASJ), and student member of the IEEE, and the ISCA.

GHULAM et al.: PS-ZCPA BASED FEATURE EXTRACTION WITH AUDITORY MASKING, MODULATION ENHANCEMENT

1023 Takashi Fukuda received his Ph.D. degree from Toyohashi University of Technology, Japan, in 2005. He is currently engaged as a researcher at IBM Research, Tokyo Research Laboratory. His research field includes automatic speech recognition. He is a member of the ASJ and the ISCA.

Kouichi Katsurada received his Ph.D. degree from Osaka University in 2000. He has been a research associate at the department of knowledge-based information engineering, Toyohashi University of Technology since 2000. His research interests include multimodal interaction, knowledge processing and semantic web. He is a member of AAAI, IPSJ, JSAI, ASJ, NLP and HIS.

Junsei Horikawa was born in 1952. He received his B.E in 1975 and M.E in 1977 from Osaka University, Osaka, Japan and his Ph.D. in 1986 from Tokyo Medical and Dental University, Tokyo, Japan. After engaging in researches on physiological mechanisms of the auditory system at Medical Research Institute, Tokyo Medical and Dental University, he has been a professor of the department of KnowledgeBased Information Engineering, Toyohashi University of Technology since 1998. His current research fields are auditory neurophysiology and psychology in animals and human. He is a member of the Acoustic Society of Japan, the Physiological Society of Japan, the Japan Neuroscience Society and the Japanese Neural Network Society.

Tsuneo Nitta was born in Yamaguchi Prefecture, Japan in 1946. He received his B.E.E. degree in 1969 and his Dr. Eng. degree in 1988, both from Tohoku University, Japan. After engaging in research and development at the R&D Center of Toshiba Corporation and Multimedia Engineering Laboratory, where he was a chief Research Scientist, since 1998 he has been a Professor at the Graduate School of Engineering, Toyohashi University of Technology. His current research interests include speech recognition, multi-modal interaction, and acquisition of language and concepts. He received the Best Paper Award from the Institute of Electronics, Information and communications Engineers (IEICE), Japan, in 1988. He is a member of the Information Processing Society of Japan (IPSJ), the Acoustic Society of Japan (ASJ), the Japanese Society for Artificial Intelligence and the IEEE.

Suggest Documents