SPEECH ENHANCEMENT USING ADAPTIVE FILTERS ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
Brain Science Institute RIKEN. Wako-shi, Saitama, JAPAN. **. Depto. de Engenharia Eletrica Universidade Federal do Maranhao. Sao Luis - Ma - BRAZIL.
SPEECH ENHANCEMENT USING ADAPTIVE FILTERS AND INDEPENDENT COMPONENT ANALYSIS APPROACH Tomasz Rutkowski , Andrzej Cichocki and Allan Kardec Barros 

Brain Science Institute RIKEN Wako-shi, Saitama, JAPAN  Depto. de Engenharia Eletrica Universidade Federal do Maranhao Sao Luis - Ma - BRAZIL email: [email protected], [email protected] and [email protected] http://www.bsp.brain.riken.go.jp/ ABSTRACT In this paper we consider the problem of enhancement and extraction of one speaker speech signal corrupted by environmental acoustic noise/interferences and other speakers using array of microphones containing at least two microphones. The preprocessing unit mimic human auditory system by roughly emulating cochlea by nonuniform bandpass filter bank. We construct filter bank with center frequencies of subbands based on approximation of target speaker fundamental frequency. Then we incorporate blind signal separation method for each subband signals, to extract maximum information that represent target speaker speech. After that desired signal is reconstructed from independent components representing every subband. Experiments with office room recordings are presented to confirm the validity and good performance of the proposed method in real-world environment. 1. INTRODUCTION We propose the system that approximately emulate human auditory system usning a nonuniform bandpass filter bank with tracking the fundamental frequency of target speaker. The filter bank consists of bandpass filters with bandwidth starting from 100 Hz to 200 Hz for first filter and increasing in length by factor two for next filters. For the experiments we use telephone quality speech signals with sampling frequency 8kHz. We assume the hearing system performs a spectrographic analysis of an auditory stimulus at the cochlea, which can be regarded as a bank of nonuniform selfadaptive filters whose outputs are ordered tonotopically. At first the fundamental frequency of the target speaker in mixture of convoluted speech signals is estimated in order to properly design center frequencies of the filters. The bank of adaptive band-pass filters, processes the available microphone signals around the fundamental frequency of the target speaker and around its harmonics. In the next stage of processing we perform blind source separation BSS, blind source extraction (BSE) or independet component analysis (ICA) for each frequency sub-band (bin). The efficient learning algorithms are used to perform BSS/BSE/ICA. Finally

the set switches is implemented to perform temporal masking and selection/classification tasks of one independent components with specific feature that enhances the voice of target speaker. The main problem in this stage is to decide which signal from obtained subband independent components should be discarded and which one contain essential information from target speaker. We apply spectral measure to solve this problem in every subband. After such processing we have subband signals that carry speech signals with enhanced target speaker information. Last stage of our signal processing system performs signal reconstruction. The inverted filter bank is applied to correctly reconstruct the target speaker voice, avoiding problems with aliasing of subband filtered components. Extensive computer simulations results with array of two or more microphones confirm validity and performance of the proposed approach. We present the results for real room recordings with natural reverberations from the walls and objects in the room. 2. AUDITORY SYSTEM FILTER BANK It is well known that the human auditory system can roughly be described as a nonuniform bandpass filter bank, consisting of strongly overlapping bandpass fil-

s1

x~1

Auditory Filter Bank

s3

Auditory Filter Bank

x11

M

x21

Nf0

s2 ~ x2

f0

f0

M

Nf0

BSE1

M

x1N x2N

y1

BSEN

yN

Inversed Auditory Filter Bank

y = sˆ 1

f0 estimation

Fig. 1. Conceptual block diagram of the algorithm which roughly mimics the auditory system. First it estimates the fundamental frequency f 0 and process the microphone signals using a bank of band-pass filters (such as the inner ear). Then it process mixed/convolved signal by an BSE or blind signal deconvolution (BSD) algorithm for each frequency bin. Finally, it performs masking by set of switches and inversed filter bank. ters with bandwidth in the order of 50 to 100 Hz for signals below 500 Hz and up to 5000 Hz for signal at higher frequencies [1]. The hearing system performs a spectrographic analysis of any auditory stimulus at the cochlea, which can be regarded as a bank of nonuniform self-adaptive filters whose outputs are ordered tonotopically. Recently, many sophisticated and biologically plausible models of such auditory filters banks have been proposed. In this paper we employ the idea of speaker voice extraction/enhancement from natural mixtures recorded in real environments. Following the idea, we suggest preprocessing unit for blind signal separation signal, that transform microphone signals into subbands with center frequencies around fundamental frequency f 0 and its harmonics. The fundamental frequency of target speaker can be estimated on basis of microphone signals x 1 and x2 and/or estimated signal y (t). Unvoiced speech is noisy due to the random nature of the signal generated at a narrow constriction in the vocal tract for such sounds. For both voiced and unvoiced excitation, the vocal tract, acting as a filter, amplifies certain sound frequencies while attenuating others. As a periodic signal, voiced speech has spectra consisting of harmonics of the fundamental frequency of the vocal fold vibration; this frequency, often abbreviated f 0, is the physical aspect of speech corresponding to perceived pitch. In our algorithm we use that feature to track the speaker, that fundamental frequency f 0 is known or can be estimated. Our filter bank adopt central frequencies of subband bins to enhance signal of target speaker. An important property of speech signals is that they are non-stationary, but can be regarded as locally stationary. Roughly speaking, speech can be divided in

e

e

the time domain into voiced and unvoiced sounds, the first ones having more structure in the frequency domain than the later. Indeed, voiced sounds are regarded in general as quasi-periodic. In this matter, some experiments pointed that humans may be using this voiced structure to separate sounds from the background. Moreover, humans can more easily understand a voiced than a unvoiced sound. One open problem is how the auditory system segregates those sounds in a higher level (at the auditory cortex). In this work, we suggest that this can be carried out by exploiting the local statistical independence of a pair of sub-band sounds for each frequency sub-band (bin). Our final objective is to develop an efficient algorithm whose output signal y (t) is a modified version of a speech signal s i (t), i.e. signal of interest y (t) = g (si (t)), where g () represents a unavoidable distortion filter and a non-linear transformation operator. Also in our algorithm we included the temporal masking characteristic of the auditory system. This can be managed by a set of switches which select from the subband components after blind separation/extraction algorithm speaker related information. Fig.1 shows the conceptual block diagram of the system. It is composed of four parts. The first one is the fundamental frequency estimation algorithm in spectral domain, which estimates the fundamental frequency from the mixed signals as well as it shows which part of the speech is voiced. It should be noted, that estimation of f 0 becomes rather difficult when all speakers are at the same distance from microphones. The second processing unit is a bank of FIR band-pass filters with center frequencies adopted to estimated f 0

a)

H0 H1

y0(n) y1(n)

analysis v0(n)

b)

v1(n)

h2

h2

i2

10 0

v1(n)

decimators

u0(n) u1(n)

expanders

i2

v0(n) Magnitude (dB)

x(n)

F0

−10 −20 −30 −40 −50 −60 −70

x(n)

0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 Normalized Frequency (×π rad/sample)

0.8

0.9

1

^

Fig. 3. Filter bank frequency characteristic for 5 subbands.

F1

synthesis

and to avoid aliasing problems we apply:

Fig. 2. Filter bank analysis (a) and sections (b) for only 2 subbands.

for target speaker. The filter banks process the signals around the fundamental frequency and around its harmonics. The third section is a bank of BSE (blind signal extraction) units for enhancing the desired signal for each frequency sub-band. Finally, the last section performs signal reconstruction based on subbands signal. The reconstruction section is inversed auditory filter in respect to the second unit of our system. In this way we try to reconstruct the signal from subbands avoiding aliasing and frequency distortion problems.

3. ADAPTIVE FILTER BANKS The filter bank consists of slightly overlapping bandpass filters with bandwidth starting from 100 Hz to 200 Hz for first filter and increasing in length by factor two for next filters. At first the fundamental frequency of the target speaker in mixture of convoluted speech signals is estimated in order to properly design center frequencies of the filters. The bank of adaptive band-pass filters, processes the available microphone signals around the fundamental frequency of the target speaker and around its harmonics. Following the idea of human auditory system frequency sensitivity, we decided to construct filter banks with center frequencies as follows: f 0; 4f 0; 10f 0; 22f 0; : : :, making the next subbands two times wider in direction to higher frequencies, because human speech has less sound representation there. The lowpass and highpass FIR filters are implemented in cascade configuration. In order to prevent perfect reconstruction of the signal, we design analysis and synthesis filters with following constrains [2] in every section (see Fig. 2 for reference). To following constrain prevents distortion: F0 (z )H0 (z ) + F1 (z )H1 (z ) = 2z l

(1)

F0 (z )H0 ( z ) + F1 (z )H1 ( z ) = 0

(2)

The exemplary five section filter bank frequency characteristic is presented on the Fig. 3. The center frequency of every subband is constructed from pairs of lowpass and highpass filters is always around f 0 and its higher harmonics f 0; 4f 0; 10f 0; 22f 0; : : : for every target speaker.

4. BLIND SOURCE SEPARATION In order to separate the subbands signals we can use any algorithm for BSS/ICA like Natural Gradient algorithm, SOBI, JADE, RICA etc. [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. In this paper we use novel algorithm for blind source extraction (BSE). Due to limit of space we present here only final algorithm without the theoretical justification. Let us consider a single processing unit for extraction of voiced independent subcomponent (see Fig. 4): yi (k )

=

wTi xi (k) =

"i (k )

=

yi (k )

m X w j =1

XL b p=1

ij xij (k ) ;

ip yi (k

p)

wiT xi (k) yei ;

=

(3)

(4)

where m is the number of microphones i = (N - number of frequency bins) i = L b y (k p) is [wi1 ; : : : ; wim ]T and i (k ) = p=1 ip i the output of FIR bandpass filter with suitably chosen center frequency and bandwidth. Coefficients b ip are fixed. It can be easily shown that the weight of the BSE processing unit can be iteratively updated as follows

1; 2; : : : ; N ,

P

ye

wi = R^ x 1x R^ xy~ ; wi = wi=kwik i

i

i

w

(5)

mixture from microphone #1

x i1

w i1

xi2

wi 2 + +

x im

yi (k)



fi 0

M+

wim

Bandpass Filter bip

~ yi (k )

mixture from microphone #2 (closer to target speaker)

extracted speaker

Fig. 4. Single processing for blind extraction of an independent voiced component (m means the number of microphones typically m = 2. where

R^ x x

=

R^ x y

=

i

i

i

i

X x k xT k ; i i N k M X x k ye k : 1

M

( )

( )

(6)

Fig. 6. Result for extraction of one speaker from mixture of three speakers recorded using only two microphones (target speaker was close to microphone #2).

=1

1

N

k=1

i(

)

i(

)

shown in Fig. 8. For each subband i we perform the following processing y1 (k ) = xi1 (k )

XL b

ip xi2 (k

p)

It should be noted that by changing the central frequency of bandpass filter we can extract in generally different components. Moreover, using the above concept we always extract desired independent components with higher energy so masking set of switches is not necessary.

with updating

5. MULTICHANNEL BLIND DECONVOLUTION/EQUALIZATION

6. SIGNAL RECONSTRUCTION FROM INDEPENDENT COMPONENTS

Instead of instantaneous blind extraction of subband signals we can apply blind deconvolution/equalization especially if bandwidth is relatively large. The simple model for multichannel deconvolution/equalization is

The components carrying maxima around f 0 harmonics obtained from previous section are taken into reconstruction. The signals carry speech signals with enhanced target speaker information. Last part of our signal processing system performs signal reconstruction. The inverted filter bank is applied to correctly reconstruct the target speaker voice, avoiding problems with aliasing of subband filtered components. Extensive computer simulations results presented in next section confirm validity and performance of the proposed approach. We present results for real room recordings with natural reverberations from the walls and subjects in the room.

p=0

4bip = yi (k)xi2 (k

p):

(7)

(8)

7. EXPERIMENTS WITH SPEECH SIGNALS RECORDED IN REAL ENVIRONMENT

Fig. 5. Room recording plan.

The real room recordings were done in the empty experimental room, without carpet and any sound absorbing elements, with many reverberations (easy to notice even during usual conversation). We used two or three cardioid condenser boundary microphones

mixture from microphone #1

mixture from microphone #2

extracted speaker

Fig. 5 Exemplary computer simulations are shown in Fig. 10 and Fig. 7. For each experiment we have obtained essential enhancement of target speaker. Due to limit of space more details will be given during workshop’s presentation. For all performed experiments considerable speech enhancement has been achieved. More detailed audio experiment will be presented at conference. 8. CONCLUSIONS AND DISCUSSION

mixture from microphone #3

Fig. 7. Result for extraction of one speaker from mixture of four speakers recorded using three microphones (target speaker was close to microphone #1).

x i1

xi 2

+ _

yi

bi

In this paper we have described multistage subband based system for extraction and enhancement of speech signal corrupted by other speakers and other acoustic interferences. The proposed approach can be extended to other applications like extraction of biomedical signals with reduced number of sensors. The open problems is how to extract a speaker with lower energy than others speakers or speech signal with specific features independently of this how far away is from the microphones. mixture from microphone #1

mixture from microphone #2

FIR

extracted speaker

mixture from microphone #3

Fig. 8. Processing unit for blind deconvolution/equalization for m = 2 microphones. mixture from microphone #4

audio-technica PRO44, that can record sounds from half-cardiod space. Such configuration let as record sounds from many directions similarly as human being can sense using ears. Boundary microphones make the task more complicated, because they record more reverbarations from surroundings than directional ones. The microphones were connected to microphone high class line amplifier and professional 20-bit multitrack digital recording system in PC class computer. The system allows us to record up to 8 channels simultaneously with 20-bit resolution and sampling frequency 44.1kHz. The following recordings were done using natural voices and sounds from speakers: (i) 2 mixed man and woman voices talking different frazes in English; (ii) 3 man voices talking different frazes in English; (iii) mixed recordings of man and woman voices talking different frazes in English; (iv) mixed human and natural sound (rain, water fall) sounds or music. We conducted all experiments with target speaker positioned closer to microphones than other sources. The scheme of our recording conditions is presented on

Fig. 9. Result for extraction of one speaker from mixture of five speakers recorded using four microphones (target speaker was close to microphone #3).

9. REFERENCES [1] D. O’Shaughnessy, Speech Communication Human and Machine, IEEE Press, New York, second edition, 2000. [2] G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley - Cambridge Press, Wellesley MA 02181 USA, 1996. [3] S. Amari, “ICA of temporally correlated signals - learning algorithm,” in Proceedings of ICA’99:

mixture (speech + rain) from microphone #1

mixture (speech + rain) from microphone #2

extracted speaker

Neural Computation, vol. 9, pp. 1483 – 1492, 1997. [10] S.C. Douglas and S.-Y. Kung, “Kuicnet algorithm for blind deconvolution,” in Proceedings of the 1998 IEEE Workshop on Neural Networks for Signal Processing, New York, 1998, pp. 3– 12. [11] C. Jutten and J. Herault, “Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture,” Signal Processing, vol. 24, pp. 1–20, 1991. [12] L. Molgedey and H.G. Schuster, “Separation of a mixture of independent signals using timedelayed correlations,” Physical Review Letters, vol. 72, no. 23, pp. 3634–3637, 1994.

Fig. 10. Result for extraction of one speaker from mixture of two speakers talking during heavy rain, recorded using only two microphones (target speaker was close to microphone #2). International workshop on blind signal separation and independent component analysis, Aussois, France, Jan. 1999, pp. 13–18. [4] S. Amari and A. Cichocki, “Adaptive blind signal processing - neural network approaches,” Proceedings IEEE, vol. 86, no. 10, pp. 2026–2048, October 1998, (invited paper). [5] A. K. Barros and A. Cichocki, “RICA - reliable and robust program for independent component analysis,” Report and matlab program of RIKEN, Brain Science Institute RIKEN, 2-1 Hirosawa, Wako-shi, Saitama, 351-0198 JAPAN, http://www.riken.nagoya.jp/sensor/allan/RICA or http://go.to/RICA. [6] A. Belouchrani, K.A. Meraim, and J.-F. Cardoso, “A blind source separation technique using second order statistics,” IEEE Transactions on Signal Processing, vol. 45, pp. 434–444, February 1997. [7] A. Cichocki, R. Thawonmas, and S. Amari, “Sequential blind signal extraction in order specified by stochastic properties,” Electronics Letters, vol. 33, no. 1, pp. 64–65, January 1997. [8] N. Delfosse and P. Loubaton, “Adaptive blind separation of independent sources: a deflation approach,” Signal Processing, vol. 45, pp. 59 – 83, 1995. [9] A. Hyv¨arinen and E. Oja, “A fast fixed-point algorithm for independent component analysis,”

[13] B. A. Pearlmutter and L. C. Parra, “Maximum likelihood blind source separation: A contextsensitive generalization of ica,” in Proceedings of NIPS’96, 1997, vol. 9, pp. 613–619. [14] L. Tong, V.C. Soon, R. Liu, and Y. Huang, “Amuse: a new blind identification algorithm,” in Proceedings of ISCAS’90, New Orleans, LA, 1990. [15] J.K. Tugnait, “Blind spatio-temporal equalization and impulse response estimation for mimo channels using a godard cost function,” IEEE Transactions on Signal Processing, vol. 45, pp. 268–271, January 1997. [16] S. Choi and A. Cichocki, “Blind separation of nonstationary sources in noisy mixtures,” Electronic Letters, vol. 36, pp. 848–849, April 2000. [17] A. K. Barros and N. Ohnishi, “Removal of quasi-periodic sources from physiological measurements,” in Proceedings of ICA’99: International workshop on blind signal separation and independent component analysis, Aussois, France, Jan. 1999, pp. 185–190. [18] J. Huang, K-C. Yen, and Y. Zhao, “Subbandbased adaptive decorrelation filtering for cochannel speech separation,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 4, pp. 402–406, July 2000. [19] A.K. Barros, H. Kawahara, A. Cichocki, S. Kojita, T. Rutkowski, M. Kawamoto, and N. Ohnishi, “Enhancement of a speech signal embedded in noisy environment using two microphones,” in Proceedings of the Second International Workshop on ICA and BSS, ICA’2000, Helsinki, Finland, 19-22 June, 2000, pp. 423– 428.