SYNCHRONOUS AUDITORY IMAGE AND

0 downloads 0 Views 813KB Size Report
Key words: Vocoder, STRAIGHT, Auditory image, AIM, gammachirp auditory filter. 1. .... frequency, it is necessary to compensate for group delay across channels.
Chapter # SPEECH SEGREGATION USING AN EVENTSYNCHRONOUS AUDITORY IMAGE AND STRAIGHT

Toshio Irino*, Roy D. Patterson+, and Hideki Kawahara* *

Faculty of Systems Engineering, Wakayama University 930 Sakaedani, Wakayama, 640+ 8510 Japan. CNBH, Physiology Department, Cambridge University, Downing street, Cambridge, CB2 3EG, UK.

Abstract: We present new methods to segregate concurrent speech sounds using the auditory image model and a vocoder. The auditory representation of sound referred to as an 'auditory image' preserves fine temporal information, unlike conventional window-based processing systems. This makes it possible to segregate speech sources with an event synchronous procedure. Fundamental frequency information is used to estimate the sequence of glottal pulse times for a target speaker, and to repress the glottal events of other speakers. The procedure leads to robust extraction of the target speech and effective segregation even when the SNR is as low as 0 dB. Moreover, the segregation performance remains high when the speech contains jitter or when the fundamental frequency is not estimated perfectly. This contrasts with conventional comb filter methods where errors in event estimation produce a substantial reduction in performance. Test results suggest that this auditory segregation method has potential for speech enhancement in applications such as hearing aids. Key words:

1.

Vocoder, STRAIGHT, Auditory image, AIM, gammachirp auditory filter

INTRODUCTION

Speech segregation is an important aspect of speech signal processing, and many segregation systems have been proposed (Parsons, 1976; Lim et al., 1978). Most of the systems are based on the short-time Fourier transform, or a sinusoid model, and they segregate sounds by extracting harmonic

2

Chapter #

components located at integer multiples of the fundamental frequency. It is, however, difficult to extract truly harmonic components when the estimation of the fundamental frequency is disturbed by concurrent noise; the error increases in proportion to the harmonic number. We have developed an alternative method for speech segregation based on the Auditory Image Model (AIM) and a scheme of event-synchronous processing. AIM was developed to provide a reasonable representation of the "auditory image" we perceive in response to sounds (Patterson et al., 1995). We have also developed an "auditory vocoder" (Irino et al., 2003a,b) for resynthesizing speech from the auditory image using an existing, highquality vocoder, STRAIGHT (Kawahara et al., 1999). The auditory representation preserves fine temporal information, unlike conventional window-based processing, and this makes it possible to do synchronous speech segregation. We also developed a method to convert F0 into event times and demonstrated the potential of the method in low SNR conditions. This paper presents a procedure for segregating speakers using an eventsynchronous auditory image with two different methods of resynthesizing the speech of the individual speakers. We also show that the system is robust in the presence of errors in fundamental frequency estimation which are unavoidable.

2.

AUDITORY VOCODER

We propose two versions of the auditory vocoder as shown in Figs 1 and 2. They involve the Auditory Image Model (AIM) (Patterson et al., 1995), a robust F0 estimator (e.g., Nakatani and Irino, 2002), and a synthesis module based either on STRAIGHT (Fig. 1) or an auditory synthesis filterbank (Fig. 2). The speech segregation is performed in the event-synchronous version of AIM which is common in the two versions. In section 2.1, we describe the event-synchronous procedure to enhance speech representations and a method for converting F0 to event times. In section 2.2, we describe two procedures for resynthesizing signals from the auditory image.

2.1

Event based Auditory Image Model

AIM performs its spectral analysis with a gammatone or gammachirp filterbank on a quasi-logarithmic (ERB) frequency scale. The output is halfwave rectified and compressed in amplitude to produce a simple form of Neural Activity Pattern (NAP). The NAP is converted into a Stabilized Auditory Image (SAI) using a strobe mechanism controlled by an event detector. Basically, it calculates the times between neural pulses in the

#. SPEECH SEGREGATION USING AN EVENT-SYNCHRONOUS AUDITORY Image and STRAIGHT Synthesis section

STRAIGHT

Pulse/noise genarator

Analysis section

Auditory Mellin Image Model

Input si(t)

Spectral filter

Output so(t)

training

Auditory frequency analysis with gammatone filterbank (ERB Scale)

Mapping module (warped-DCT + nonlinear MRA)

Neural Activity Pattern (NAP)

Robust F0 estimator

3

Event Detector Stabilized Auditory Image (SAI) Strobing Mechanism

Mellin Image (MI)

Figure1. Configuration of the auditory vocoder: The gray arrows show the processing path used for parameter estimation. Once the parameters are fixed, the path with the black arrows is used for resynthesizing speech signals.

Synthesis module

Auditory Image Model Input

Basilar membrane motion (BMM) by Gammachirp filterbank (GCFB)

Neural Activity Pattern (NAP)

Stabilized Auditory Image (SAI) Enhanced image

Target Event Detector

F0 for target speech

BMM

Average Amplitude

Average Amplitude

x

BMM amplitude recovery

Temporally Reversal Gammachirp filterbank + Weighted Sum

Output

Eventsynchronous strobing BMM phase generation from glottal/noise source

F0 info.

Robust fundamental frequency (F0) estimator

SAI to NAP conversion

F0 info.

Pulse/noise genarator

BMM by GCFB

BMM Phase

Figure2. Configuration of auditory vocoder using an auditory synthesis filterbank. There is no training path.

auditory nerve and constructs an array of time-interval histograms, one for each channel of the filterbank. 2.1.1

Event detection

An event-detection mechanism was introduced to locate glottal pulses accurately for use as strobe signals. The upper panel of Fig. 3 shows about

4

Chapter #

30 ms of the NAP of a male vowel. The abscissa is time in ms; the ordinate is the center frequency of the gammatone auditory filter in kHz. The mechanism identifies the time interval associated with the repetition of the neural pattern and the interval is the fundamental period of the speech at that point. Since the NAP is produced by convolution of the speech signal with the auditory filter and the group delay of the auditory filter varies with center frequency, it is necessary to compensate for group delay across channels when estimating event timing in the speech sound. The middle panel shows the NAP after group delay compensation; the operation aligns the responses of the filters in time to each glottal pulse. The solid line in the bottom panel shows the temporal profile derived by summing across channels in the compensated NAP in the region below 1.5 kHz. The peaks corresponding to glottal pulses are readily apparent in this representation. We extracted peaks locally using an algorithm similar to adaptive thresholding (Patterson et al., 1995). The dotted line is the threshold used to identify peaks, indicated by circles and vertical lines. After a peak is found, the threshold decreases gradually to locate the next peak. We also introduced a form of prediction to make the peak detection robust. The threshold is reduced by a certain ratio when the detector does not find activity at the expected period, defined as the median of recent periods. This is indicated by the sudden drop in threshold towards the end of each cycle. Tests with synthetic sounds confirmed that this algorithm works sufficiently well when the input is clean speech. It is, however, difficult to apply this method under noisy conditions particularly when the SNR is low. So, we enlisted the F0 information to improve event time estimation. 2.1.2

A robust F0 estimator for event detection

It is easier to estimate fundamental frequency, F0, than event times for a given SNR. The latest methods for F0 estimation are robust in low SNR conditions and can estimate F0 within 5 % of the true value in 80 % of voiced speech segments in babble noise, at SNRs as low as 5 dB (Nakatani and Irino, 2002). So, we developed a method of enhancing event detection with F0 information. Figure 4 shows a block diagram of the procedure. First, candidate event times are calculated using the event detection mechanism described in the previous section. The half-life of the adaptive threshold is reduced to avoid missing events in target speech. The procedure extracts events for both the target and background speech. Then we produce a temporal sequence of the pulses located at the candidate times. For every frame step (e.g., 3 ms), the value of F0 of the target speech in that frame is converted into a temporal function consisting of a triplet of

#. SPEECH SEGREGATION USING AN EVENT-SYNCHRONOUS AUDITORY Image and STRAIGHT

5

Figure 3. Auditory event detection for clean speech

Phase compensated NAP

F0 information of target speech

Event detection based on NAP

Producing triplet of Gaussians 1/F0 1/F0

Cross-correlation + maximum matching selection

Estimated target event

Figure 4. Selection of Event time by F0 information

gaussian functions with a periodicity of 1/F0. The triplet function is, then, cross-correlated with the event pulse sequence to find the best matching lag time. At this best point, the temporal sequence and the triplet of gaussians are multiplied together so as to derive a value similar to a likelihood for each event. The likelihood-like values are accumulated and applied to the thresholding with an arbitrary value to derive estimates of the event times for the target speech signal. 2.1.3

Event synchronous strobed temporal integration

Using the target event times, we can enhance the auditory image for the target speech. Figure 5a shows a schematic plot of a NAP after group delay compensation for a segment of speech with concurrent vowels. The target

6

Chapter #

Figure 5. Event synchronous strobes for concurrent speech

speech with a 10-ms glottal period is converted into the black activity pattern, while the background speech with a 7-ms period is converted into the gray pattern. For every target event time (every 10 ms), the NAP is converted into a two-dimensional auditory image (AI) as shown in Fig. 5b. The horizontal axis of the AI is time-interval (TI) from the event time; the vertical axis is the same as the NAP. As shown in Fig. 4b, the activity pattern for the target speech always occurs at the same time-interval, whereas that for the background speech changes position. So, we get a "stablized" version of the target speech pattern in this "instantaneous" auditory image (AI). Temporal integration is performed by weighted averaging of successive frames of the instantaneous AI as shown in Fig. 5c, and this enhances the pattern for the target speech relative to that for the background speech. In this way, the fine structure of the target pattern is preserved and stabilized when the event detector captures the glottal event of the target speech correctly. The weighted averaging is essentially equivalent to conventional STI where the weighting function is a ramped exponential with a fixed halflife. In the following experiment, we used a hanning window spanning five successive SAI frames, although tests indicated that the window shape does not have a large effect as long as the window length is correct.

#. SPEECH SEGREGATION USING AN EVENT-SYNCHRONOUS AUDITORY Image and STRAIGHT

2.2

7

Synthesis procedure

We can now produce an auditory image in which the pattern of the target speech is enhanced. It only remains to resynthesize the target speech from their auditory representations. 2.2.1

Resynthesis using STRAIGHT

Figure 1 includes a synthesis method using STRAIGHT and a mapping module between the auditory image and a time-frequency representation based on the STFT. The method is described by Irino et al. (2003a,b). Briefly, each frame of the 2-D stabilized auditory image (SAI) is converted into a Mellin Image (MI), and then the MI is converted into a spectral distribution using a warped-DCT. The mapping function between the MI and the warped-DCT is constructed using a non-linear multiple regression analysis (MRA). Fortunately, it is possible to determine the parameter values for the non-linear MRA in advance, using clean speech sounds and the analysis section of STRAIGHT as indicated in Fig. 1 gray arrows. The speech sounds are, then, reproduced using a spectral filter excited with the pulse/noise generator shown in Fig. 1 in the synthesis section of STRAIGHT. The pulse/noise generator is controlled by the robust F0 estimator in the left box. This version works fairly well and is expected to capture fine structure of speech signal in the SAI and MI. But it is rather complex and inflexible because training is required to establish the mapping function. Moreover, the quality of the resynthesized sounds was not always good despite the use of STRAIGHT. So, we rebuilt the synthesis procedure using a channel vocoder concept and introduced an approximation in the mapping function. 2.2.2

Auditory channel vocoder

Figure 2 shows the revised version based on an auditory filterbank and channel vocoder (Gold and Radar, 1967). The channel vocoder consisted of a bank of band-limited filters and pulse/noise generators controlled by previously extracted information about fundamental frequency and voicing. There were two banks of gammachirp/gammatone filters for analysis and synthesis because of the temporal asymmetry in the filter response. But the synthesis filterbank is not essential when we compensate for the phase lag of the analysis filter. The amplitude information of the Basilar Membrane Motion (BMM) is recovered from the SAI via the NAP as described in the next section. The phase information of the BMM is derived by the filterbank

8

Chapter #

analysis of the signal from a pulse/noise generator. We employed the generator in STRAIGHT since it is known to produce pulses and noise with generally flat spectra, and it does not affect the amplitude information. The amplitude and phase information is combined to produce the real BMM for resynthesis. 2.2.3

Mapping function from SAI to NAP

Mapping from the SAI to the NAP required one more enhancement. In the previous version, we took the fine temporal structure in both the SAI and NAP into account when designing the mapping function. As a result, it required a non-linear mapping between multiple inputs and a single output. In the current study, we assumed that the frequency profile of the SAI provides a reasonable representation of the main features of the target speech when we restrict the range of integration in the appropriate way. As shown in Fig. 5c, the target feature is concentrated in the time-interval dimension when the filter center frequency, fc, is high, and it overlaps the next cycle when the center frequency is low. The activity of the interfering speech appears at a low level in the rest of the SAI. So, we limited the range of integration [Tmin ,Tmax ] on the time-interval axis to

Tmin = −1 and Tmax = min(5,max(1.5,2 / fc )) (ms).

€ €

T€ min is constant. Tmax is 5 ms when fc is low and 1.5 ms when fc is high. We applied a weighting function to produce a flat NAP from a regular SAI. The € synthetic sounds were reasonably good. Moreover, the assumption resulting simplifies the procedure. It would probably be possible to enhance the € € structure in the SAI. sound quality using the fine

3.

EXPRERIMENTS

Many segregation procedures require precise F0 information for effective segregation. This is unrealistic because F0 estimation is never error-free and natural vocal vibration contains jitter. So it is important to evaluate how robust the system is to jitter and errors in F0 estimation. The new event finding procedure produces performance that is the same, or a little better, than the previous version. It is easier to listen to the target sound than the mix of target speech and distracter speech, or the mix of target speech and babble noise. We compared the new method to a comb-filter method using sound mixtures where F0 is 5% more than the original value. Figures 6a and 6b

#. SPEECH SEGREGATION USING AN EVENT-SYNCHRONOUS AUDITORY Image and STRAIGHT

9

Figure 6. Waveform and spectrum when the F0 estimation error is 5 %.(a) (b) original sound, (c)(d) mix of sound and babble noise, (e)(f) signals segregated by the proposed method, and (g)(h) by a comb-filter method

show the original, isolated target speech and its spectrogram. The target sound was mixed with babble noise at an SNR of 0 dB to produce the wave in Fig. 6c. At this level, a listener needs to concentrate to hear the target speech in the distracter. The spectrogram in Fig. 6d shows that the features of the target speech are no longer distinctive. The target speech (Fig. 6e) extracted by the auditory vocoder was distorted but entirely intelligible, whereas the distracter speech was decomposed into a non-speech sound which greatly reduced the interference. The spectrogram in Fig. 6f shows that the vocoder resynthesis restores the harmonic structure of the target

10

Chapter #

speech but not the distractor speech, e.g., compare the restoration of harmonic structure just after 800 ms with the lack of harmonic restoration in the region just before 800 ms. For comparison, we applied a comb-filter method based on the STFT to the same sound. The target speech extracted in this way sounds low-pass filtered and the babble noise is relatively strong. The target speech wave (Fig. 6g) is noisier than the wave in Fig. 6e. The spectrogram in Fig. 6f shows that much of the harmonic structure is lost by the comb-filter method, particularly the higher harmonics. This follows directly from the fact that the error increases in proportion to harmonic number. It is a fundamental defect.

4.

CONCLUSIONS

We have presented methods to segregate concurrent speech sounds using an auditory model and a vocoder. Specifically, the method involves the Auditory Image Model (AIM), a robust F0 estimator, and a synthesis module based either on STRAIGHT or an auditory synthesis filterbank. The eventsynchronous procedure enhances the intelligibility of the target speaker in the presence of concurrent background speech. The resulting segregation performance is better than with conventional comb-filter methods whenever there are errors in fundamental frequency estimation as there always are in real concurrent speech. Test results suggest that this auditory segregation method has potential for speech enhancement in applications such as hearing aids. Acknowledgments This work was partially supported by a project grant from the faculty of systems engineering of Wakayama University, by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B), 15300061, 2003, and by the UK Medical Research Council (G9901257). REFERENCES Gold, B. and Rader, C. M., 1967. "The Channel Vocoder," IEEE Trans. Audio and Electroacoustics, Vol. AU-15, 148-161. Irino, T., Patterson, R.D., Kawahara, H., 2003a, "Speech segregation using event synchronous auditory vocoder," in Proc. IEEE ICASSP 2003, Hong Kong. Irino, T., Patterson, R.D., Kawahara, H., 2003b. "Speech segregation based on fundamental event information using an auditory vocoder," Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003, (Interspeech 2003)), 553-556, Geneva, Switzerland. Kawahara, H., Masuda-Katsuse, I. and de Cheveigne, A., 1999. "Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Communication, 27, 187-207.

#. SPEECH SEGREGATION USING AN EVENT-SYNCHRONOUS AUDITORY Image and STRAIGHT

11

Lim, J.S. Oppenheim, A.V. Braida, L.D., 1978. "Evaluation of an adaptive comb filtering method for enhancing speech degraded by white noise addition," IEEE, Trans. ASSP, ASSP-26, 354-358. Nakatani, T. and Irino, T., 2002. "Robust fundamental frequency estimation against background noise and spectral distortion," ICSLP 2002, 1733-1736, Denver, Colorado. Parsons, T.W. 1976. "Separation of speech from interfering speech by means of harmonic selection," J.Acoust. Soc.Am., 60, 911-918. Patterson, R.D., Allerhand, M. and Giguere, C. 1995. "Time-domain modelling of peripheral auditory processing: a modular architecture and a software platform", J. Acoust. Soc. Am., 98, 1890-1894. http://www.mrc-cbu.cam.ac.uk/cnbh/