speech enhancement using 4-element microphone ...

Presented at the International Engineering Research Conference 2010 (IERC 2010) University of San Carlos - Talamban Campus, Cebu City Philippines January 14-16, 2010

SPEECH ENHANCEMENT USING 4-ELEMENT MICROPHONE ARRAY BEAMFORMING COMBINED WITH BLIND SIGNAL SEPARATION

Rodrigo C. Muñoz, Jr. and Nilo T. Bugtai Electronic & Communication Engineering Department De La Salle University, Manila [email protected], [email protected]

ABSTRACT

shortcomings of existing MBD [3]. It also has been successfully applied to speech enhancement in real world noisy environments [4]. Performance of BSS algorithms can be improved further by utilizing directional information of the sources through beamforming. Several approaches have been reported [1-2,46]. In this paper, however, beamforming is utilized as a front end before BSS to remove direct parts of interfering signals as shown in Fig. 1. For this purpose, NBF is chosen among various beamforming techniques for its capability of making sharp directional nulls toward interfering signals with only two microphones. This paper is organized as follows: first, blind signal separation is described briefly and the FNMBD algorithm is described with its structure, properties, and modifications. Next, geometry and filters for NBF is described. Then beam pattern of NBF, FNMBD, and NBF plus FNMBD are investigated and discussed. Finally, simulation results in real world environment are presented.

Blind signal separation is known as a powerful tool for enhancing noisy speech in many real world environments. In this paper, the detail properties and structure of the newly proposed null beamforming-blind separation (NBF-BSS) algorithm is discussed. The performance of the blind signal separation is improved using directional information of sources. The beam patterns of the null beamformer, blind signal separation, and the combination of the two are investigated. It is shown that directional null reduces mainly direct parts of the unwanted signals whereas blind source separation reduces reverberant parts as well as direct parts. Increasing the number of microphone sensors converge into faster and improved separation by removing the direct parts first by null beamforming. Simulation results using real office recordings confirm the expectation.

1. INTRODUCTION

S1

Speech enhancement of noisy speech in real world environments has been studied for a long time. It can be applied to most speech communication systems including automatic speech recognition, speech compression, acoustic echo cancellation. Speech enhancement using multiple microphones can be categorized into beamforming and blind signal separation (BSS). In beamforming, signals are separated using the directional information of the signals with respect to the microphones. On the other hand, blind signal separation does not use any information on the signals and assumes that original signals are statistically independent each other. In general, BSS provides better performance over the beamforming techniques in terms of the number of microphones, speech quality (distortion), and interference reduction. In case of blind speech separation in a noisy room, the received signal at a microphone would be a reverberant and mixed version of sound sources and it is modeled as convolutive mixing. Then restoration of the sound sources can be accomplished by estimating the unmixing (separating) filters that deconvolve the room impulse responses. Multichannel blind deconvoltion (MBD) is one practical method to estimate the unmixing filters. Recently, a fast and stable MBD algorithm has been proposed which removes

X’1

X1

U1

NBF

X2

BSS

U2

X’2

S2

Fig. 1. Speech enhancement system using BSS combined with NBF. 2.

BLIND SIGNAL SEPARATION

2.1. Blind Signal Separation Problem In convolutive mixing with P sources and Q sensors, the mixed signal at the sensor j is given by P

x j (), k = (),å

¥

å1,...,A ji p si k -

i = 1 p= - ¥

p

j =

Q

(1)

k i =1, K, P, are source signals and A ji , p is where si (), the pth coefficient of the (,i j)

1

th

component (from the source i to

the sensor j) of the mixing system. Assume that the unmixing ¥ system is given by W(,z k)() = å p = - ¥ Wp k z - p . Then

cascade. The next subsection introduces the basic of a null beamformer using a linear microphone array.

the i th separated signal is given by u i ()()() k =

Q

¥

å å

j = 1 p= - ¥

3.1. Geometry of NBF

W ij , p k x j k - p

where W ij , p ()k is the (,i j)

th

(2)

The geometry of the beamformer is shown below in Fig. 2. Sources 1 and 2 must be of different nature and are independent to each other. The locations of the sources are known a priori. Denoted by angle 1 and 2, these angles are measured from the normal line with respect to the microphone array. The signs of these angles are dependent to the direction of arrival (DOA) of the signals. To approximate the spatial characteristics and nature of the signals, the sensors must be located with the same separation distance. Consequently, for the purpose of simplifying the goal of this paper, it discussed the 4-microphone array concept. The phase center of the array plays an important role in the geometry of the beamformer for the reason that it serves as the reference in calculating spatial delays of the incoming signals in the aperture of each sensor. The spatial delay is the key in the beamforming analysis. The elements of matrix Af are frequency-domain quantity calculated from the spatial delay of the plane wave from both sensors in reference to phase center of the array. Note that the sign of d1 and d2 , which are the coordinates of the sensors, depend on the advancing direction of the plane wave with the phase center of the array being the point of origin.

component of Wp ()k . The

number of sensors Q is assumed to be equal to or greater than the number of sources P for successful separation. One practical method to estimate the unmixing system W(,z k) is multichannel blind deconvolution (MBD) which minimizes iteratively mutual information between separated signals. In [4], the unmixing filters are approximated with bidirectional filters which are truncated and shifted so that the unmixing signal is given by u()()() k =

L

å

p= 0

Wp k x k - p

(5)

2.2. A Frequency-Domain Normalized Multichannel Blind Deconvolution Algorithm Recently, a new MBD algorithm has been proposed to improve the performance of existing MBD algorithms. It employs the unidirectional filters and normalized gradient terms. The MBD algorithm can be implemented in the frequency domain using the advantage of FFT in an overlap-save manner [11]. It can be expressed in the frequency domain, in simplified matrix form omitting time-domain constraints, as

D W ()()(())() b = {I - y b u b f

f

f

H

}W

f

b

Normal line

Sound source 2

Sound source 1

+ 2

1

d

(6)

d1

where b denotes block index, the superscript f the quantity in the frequency domain, H the hermitian transposition. In addition, I is formed by repeating I number of times corresponding to FFT size. Notice that (6) is actually a timedomain algorithm so that forward/inverse Fourier transforms and proper time-domain constraints are required to compute linear convolution and correlation via circular convolution and correlation, respectively. The algorithm (6) with bidirectional filters is presented in [12]. The frame length N should be larger than 4L to compute L unbiased cross-correlation lags. For the bidirectional case, L+ however, 2L cross-correlations of lag from - (1) to

0

Linear Microphone array

dQ

Phase Center of the array

Fig. 2. Geometry of a linear beamformer. The mixing matrix A f which is assumed to be complex-valued due to the introduction of a model that deals with arriving lags/leads among each element can be expressed as,

AkiBF  exp  j 2 fd k sin(/i  c

(7)

where c is the velocity of sound, dk the position of the

k th microphone, i the incidence angle of the i th source.

L are required and we need N ³ 8L . If we omit this constraint, the cross-correlations are biased but more samples of u()b are utilized.

3.2. The Unmixing Matrix for NBF The filter coefficients of the NBF is computed simply as the

 

inverse of the mixing matrix W f = A f

3. NULL BEAMFORMING

1

. Thus the elements

of the unmixing matrix can be derive using following equations with a bit of manipulations and expressed as

There are several types of beamformers varying from adaptive and non-adaptive form. In this paper, the conventional nullbeamformer (NBF) is presented. A significant characteristic of the NBF is a steep null toward the interfering signal. It greatly supplements the performance of BSS when connected in

2

BF

W11 ()exp fm 



2   j sin f m d1 /

1 c 

Presented at the International Engineering Research Conference 2010 (IERC 2010) University of San Carlos - Talamban Campus, Cebu City Philippines January 14-16, 2010

4.

 exp  j 2 f m d1 (sin 2  sin 1 )/ c 

(8a)

Speech sounds of a male voice located at -30o and a female voice located at 40o, recorded in a normal office, are used as the two mixing sources. The mixing signals, approximately 120cm from the array, are applied to four microphones separated at 4cm. The mixed signals are recorded at 8 KHz sampling rate. The recording is arranged to have 3 segments of data with 10 seconds length per segment. The first segment is used for learning and the remaining segments are used for SIR calculations. The last two segments consist of two 10 seconds segment where only one source is active at each segment. Signal-to-Interference ratio (SIR) is used as a performance measure. The SIR is defined as the ratio of the signal power of the target signal from that of the interfering signal.



1

 exp  j 2 f m d 2 (sin 2  sin 1 )/ c  BF

W12 ()exp fm 



2   j sin fmd2 /

1 c 

 exp  j 2 f m d1 (sin 2  sin 1 )/ c 

(8b)



1

 exp  j 2 f m d 2 (sin 2  sin 1 )/ c  BF W21 ()exp fm 



2   j sin f m d1 / 2 c 

  exp  j 2 f m d1 (sin   sin 2 )/ c 

(8c)



1

 exp  j 2 f m d 2 (sin 1  sin 2 )/ c  BF W22 ()exp fm 



2   j sin f m d 2 / 2 c 

  exp  j 2 f m d1 (sin 1  sin 2 )/ c 

(8d)

4.2. Performance of the BSS (FNMBD) Combined with NBF



1

 exp  j 2 f m d 2 (sin 1  sin 2 )/ c  BF

W13 ()exp fm 



2   j sin fm d3 /

Beam pattern plays an important role in the experimentation and analysis and is expressed as

1 c 

Q

2W

  exp  j 2 f m d1 (sin 2  sin 1 )/ c 

Fl (,f )()exp  



BF



2   j sin f m d 3 / 2 c 

  exp  j 2 f m d1 (sin 2  sin 1 )/ c 

(8f)



 exp  j 2 f m d 2 (sin 2  sin 1 )/ c  BF W14 ()exp fm 



2   j sin fmd 4 /

1

1 c 

  exp  j 2 f m d1 (sin 1  sin 2 )/ c 

(8g)



 exp  j 2 f m d 2 (sin 1  sin 2 )/ c  BF W24 ()exp fm 



1

2   j sin f m d 4 / 2 c 

  exp  j 2 f m d1 (sin 1  sin 2 )/ c 



 exp  j 2 f m d 2 (sin 1  sin 2 )/ c 

f sin / j  fd k

 c

(9)

for given unmixing filters W lk ()f . In Fig. 5, beam patterns of a four (4) element NBF, BSS (FNMBD), and NBF+ BSS are shown. The beam pattern of NBF+FNMBD is obtained for the system in Fig. 1. It is clear that 4-element NBF provides a deep null at the specified DOA of the interfering signal. This reveals the fact that NBF removes the direct parts of the interfering signals effectively. On the other hand, BSS (FNMBD) provides null over the wide range of DOA which implies the fact that BSS removes reverberant parts as well as direct parts of the interfering signals. By directly cascading NBF into FNMBD, the resulting beam pattern shows clear null toward interfering signals as well as reverberant parts. Figure 6 investigate the performance of NBF+FNMBD in terms of SIR. Performance of FNMBD is also measured with input from two different microphones set-up, the 4-element and 2-center element microphone pairs. The performance of BSS (FNMBD) with inputs from 4-element 1/4 is better than that with inputs from 2-center element microphones. This indicates that separation performance of BSS (FNMBD) alone may be improved to a certain extent by increasing the number of microphone elements. However, by combining BSS with NBF, the separation performance is improved significantly. NBF itself provides SIR of 8dB by removing direct parts of the interfering signals. The cascaded BSS then take charges of the reverberant parts. Faster convergence also observed in the combined case.

(8e) W23 ()exp fm 

lk

k 1

1

 exp  j 2 f m d3 (sin 2  sin 1 )/ c 

SIMULATIONS

4.1. Simulation Setup

1

(8h) The estimated source signal components are possibly recovered with different order. To avoid this problem, we have to permute the sensor location by manipulating the working matrix. The solution for the matrix will be 1 0 1 f (9) Wf   A 1 0 It is important to notice that the unmixing filters of the null beamformer are transformed back into the time domain to compute the outputs. It is quite different from the previous experiments where the entire computation is accomplished in the frequency domain [7,8].

 

3

[4]

[5]

[6] Fig. 5. Beam pattern of NBF, FNMBD, and NBF+FNBMD.

[7]

[8]

[9] [10] Fig. 6. SIR Improvement of NBF+BSS over BSS (FNMBD). 5.

[11]

CONCLUSIONS

The performance of BSS(FNMBD) can be improved further by combining with NBF. Investigating the beam patterns of NBF, BSS(FNMBD), and NBF+BSS, the combined system is shown to provide clear improvement in reduction of direct parts of interfering signals. Simulation results using a realworld recording confirm the superior performance of the combined system in terms of SIR and convergence speed.

[12]

[13]

ACKNOWLEDGEMENT

[14]

This work was supported by the PGMA-CHED funded by Philippine Government (HEFDP). REFERENCES [1] R. Muñoz, Jr., “Enhancement of Noisy Speech Using Null Beamforming” International Research Conference 2008 (IRC-2008), West Visayas State University, La Paz, Iloilo City on February 27~29, 2008. [2] R. Muñoz, Jr. and S. H. Nam, “Speech Enhancement using blind signal separation combined with null beamforming,” Proc. The 9thWestern Pacific Acoustic Conference (WESPACIX 2006), Seoul, South Korea, page 204, 2006. [3] S.-I. Amari, S. C. Douglas, A. Cichocki, and H. H. Yang, “Novel on-line adaptive learning algorithms for blind

4

deconvolution using the natural gradient approach,” Proc. IEEE 11th IFAC Symposium on System Identification, SYSID-97, Kitakyushu, Japan, pp. 1057-1062, 1997. S. H. Nam and S. Beack, “A frequency-domain nomalized multichannel deconvolutive algorithm for acoustical signals,” Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA'04), Granada, Spain, pp. 522-529, 2004. S. Beack, S. H. Nam, and M. Hahn, “A new speech enhancement algorithm for car environment noise cancellation with MBD and Kalman filtering,” IEICE Trans. Fundamentals, vol. E88-A, n0. 3, pp. 685-689, March 2005. C. Fancourt and L. Parra, “The generalized sidelobe decorrelator,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 167-170, Oct. 2001. H. Saruwatari, T. Kawamura, K. Sawai, A. Kaminuma, M. Sakata, “Blind source separation based on fastconvergence algorithm using ICA and beamforming for real convolutive mixture,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp.3097-3100, May 2002. H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, and K. Shikano, “Blind source separation based on subband ICA and beamforming,” Proc. ICSLP2000, vol.3, pp.99-97, Oct. 2000. J.-F. Cardoso, “Blind signal separation: statistical principles,” Proceedings of the IEEE, vol. 9, no. 10, pp. 2009-2025, 1998. S.-I. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, no. 2, pp. 251276, 1998. L. Zhang, A. Cichocki, and S. Amari, “Geometrical structures of FIR manifold and their application to multichannel blind deconvolution,” Proc. IEEE Workshop on Neural Networks for Signal Processing (NNSP'99), Madison, Wisconsin, pp. 303-312, 1999. T.-W. Lee, A. J. Bell, and R. Lambert, “Blind separation of delayed and convolved sources,'' Advances in Neural Information Processing Systems, MIT Press, vol. 9, pp. 758-764, 1997. M. Joho, “A systematic approach to adaptive algorithms for multichannel system identification, inverse modeling, and blind identification,” Ph.D. dissertation, ETH Zurich, December 2000. S.-I. Amari, T.-P. Chen, and A.Cichocki, “Nonholonomic orthogonal learning Algorithms for blind source separation,” Neural Computation, vol. 12, no. 6, pp.