A COMPARISON BETWEEN ALTERNATIVE BEAMFORMING STRATEGIES FOR INTERFERENCE CANCELATION IN NOISY AND REVERBERANT ENVIRONMENT Shmulik Markovich, Sharon Gannot
Israel Cohen
School Engineering Bar-Ilan University Ramat-Gan, 52900, Israel e-mail:
[email protected]
Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel e-mail:
[email protected]
ABSTRACT In speech communication systems the received microphone signals are often degraded by competing speakers, noise signals and room reverberation. Microphone arrays are commonly utilized to enhance the desired speech signal. In this paper two important design criteria, namely the Minimum Variance Distortionless Response (MVDR) and the Linearly Constrained Minimum Variance (LCMV) beamformers, are explored. These structures differ in their treatment of the interference sources. Experimental results using simulated reverberant enclosure are used for comparing the two strategies. It is shown that the LCMV beamformer outperforms the MVDR beamformer provided that the acoustic environment is time-invariant. Index Terms— Array signal processing, Speech Enhancement, Subspace methods, Interference cancellation. 1. INTRODUCTION Speech signals propagating in acoustic environments are often contaminated by interfering sources, such as competing speakers and noise sources, and are also subject to distortion due to the reverberating nature of the enclosure. In recent years microphone arrays are gaining an increased attention by both the research and industrial communities. The spatial diversity, inherent to the array configuration, can be utilized by applying a spatio-temporal filter, usually referred to as beamformer. The main objective of the beamformer is to extract a desired signal, impinging on the array from a specific location, out of noisy measurements thereof. Several criteria can be applied to the design of the beamformer (a comprehensive summary can be found in [1, 2]). Of special interest to the current contribution are the MVDR and the LCMV beamformers. In a MVDR beamformer [3], the power of the output signal is minimized under the constraint that signals arriving from the assumed direction of the desired speech source are processed without distortion. Several researchers proposed modifications to the MVDR for dealing with multiple linear constraints, denoted LCMV. Their work was motivated by the desire to apply further control to the array beam-pattern, beyond that of a steer-direction constraint. Hence, the LCMV can be applied for constructing a beam-pattern satisfying certain constraints for a set of directions, while minimizing the array response in all other directions. Griffiths and Jim have shown that the MVDR beamformer can be efficiently implemented in Generalized Sidelobe Canceler (GSC) structure, which decouples the constrain-
1-4244-2482-5/08/$20.00 ©2008 IEEE
ing and the minimization operations. Breed and Strauss [4] showed later that the LCMV beamformer has an equivalent GSC implementation as well. In this paper we compare the performance of two wideband beamformers in the task of extracting a single desired source from multiple speakers in noisy and reverberant environment. The first, the Transfer Function Generalized Sidelobe Canceler (TF-GSC) [5] is an MVDR beamformer reformulated in the frequency domain. The second is a recently proposed multi-constraint beamformer which belongs to the LCMV family of beamformers. By this comparison we wish to derive guidelines for the treatment of competing speakers. The structure of this paper is as follows. In Sec. 2 the task of enhancing a desired speech source, contaminated by both stationary noise and non-stationary interference signals in reverberant environment is presented. The LCMV beamformer and its GSC implementation are reviewed in Sec. 3. Two paradigms for treating the interference signals are discussed in Sec. 4. In the TF-GSC structure only the array response to the desired signal is constrained, while for the recently proposed multi-constraint beamformer, the array response to all directional sources is constrained. The two alternative beamformers are compared in Sec. 5 using both objective distortion measures and subjective assessment. The contribution is concluded in Sec. 6. 2. PROBLEM FORMULATION Consider M microphones arranged in an arbitrary array, receiving a desired speech signal sd (n) contaminated by Ns stationary noise sources ssi (n), i = 1, . . . , Ns , and Nns non-stationary sources, e.g. competing speakers, denoted sns i (n), i = 1, . . . , Nns . Each of the involved signals undergo filtering by a Room Impulse Response (RIR), representing the acoustic path (usually modeled as a Finite Impulse Response (FIR) filter), before being picked up by the microphones. The M received signals zm (n) are augmented in a vector which can be presented in the Short Time Fourier Transform T (STFT) domain as z(`, k) , z1 (`, k) . . . zM (`, k) where ` is the frame index and k represents the frequency bin. The received signals comprise contributions from the desired source sd (`, k), from all stationary noise s T s1 (`, k) . . . ssNs (`, k) sources ss (`, k) = , ns the non-stationary interference signals s (`, k) = ns T s1 (`, k) . . . sns , and sensor noise sigN (`, k) ns T v1 (`, k) . . . vM (`, k) nals v(`, k) , , usually
203
IEEEI 2008
assumed to be spatially-white. Mathematically, the problem can be formulated as: d
z1 (, k)
z(`, k) = h (`, k)s (`, k)+ (1) H s (`, k)ss (`, k) + H ns (`, k)sns (`, k) + v(`, k)
z(`, k) = H(`, k)s(`, k) + v(`, k)
(2)
the filtering matrix is defined as H(`, k) , where hd (`, k) H s (`, k) H ns (`, k) and the signal vector as T s(`, k) , sd (`, k) (ss (`, k))T (sns (`, k))T . 3. THE LCMV BEAMFORMER AND THE GSC FORMULATION
y(`, k) = w† (`, k)z(`, k) (3) T where w(`, k) = w1 (`, k), . . . , wM (`, k) are the beamformer filters, set to satisfy the LCMV criterion with multiple constraints: argmin {w† (`, k)Φzz (`, k)w(`, k)} † C (`,k)w(`,k)=g (`,k) (4)
where
C † (`, k)w(`, k) = g(`, k)
(5)
is the constraints set. The well-known solution to (4) is given by: w(`, k) = Φ−1 zz (`, k)C(`, k) −1 × C † (`, k)Φ−1 g(`, k) zz (`, k)C(`, k)
(6)
The LCMV can be implemented using the GSC formulation [4]. In this structure the filter set w(`, k) can be split to two orthogonal components [1], one in the constraint plane and the other in the orthogonal subspace: w(`, k)
=
w0 (`, k) − wn (`, k)
=
w0 (`, k) − B † (`, k)q(`, k)
(7)
where B(`, k) is the projection matrix to the “null” subspace, denoted Blocking Matrix (BM), i.e. BC(`, k) = 0. w0 (`, k) is the Fixed Beamformer (FBF) satisfying the constraints set, wn (`, k) is orthogonal to w0 (`, k), and q(`, k) is a set of Adaptive Noise Canceler (ANC) filters adjusted to obtain the (unconstrained) minimization. In the original GSC structure the
1-4244-2482-5/08/$20.00 ©2008 IEEE
−
y(, k)
yANC (, k) zM (), k
u1 (), k
u2 (, k)
ANC
BM q(, k) uM (, k)
Fig. 1. The proposed LCMV beamformer reformulated in a GSC structure. filters q(`, k) are calculated adaptively using the Least Mean Squares (LMS) algorithm. Using [1] the FBF is given by: w0 (`, k) = C(`, k) C † (`, k)C(`, k)
−1
g(`, k).
(8)
The BM can be determined as the projection matrix to the null subspace of the column-space of C:
The multi-microphone measurements can be utilized for enhancing the desired speaker. The enhancement is carried out by designing the beamformer:
w(`, k) =
+
FBF
d
d T where, hd (`, k) , are h1 (`, k) . . . hdM (`, k) the Acoustic Transfer Function (ATF)s, relating the desired source and the M microphones. Similarly, the filtering matrices components {H s (`, k)}im = hsim (`, k), i = 1, . . . , Ns , m = 1, . . . , M and {H ns (`, k)}im = hns im (`, k), i = 1, . . . , Nns , m = 1, . . . , M are the ATFs relating the stationary noise sources and the non-stationary interference sources, respectively, with the microphones. The assumption that the window length is much larger then the RIR length ensures the Multiplicative Transfer Function (MTF) approximation [6] validness. Compactly the problem can be stated as:
yFBF (, k)
z2 (, k)
B(`, k) = I − C(`, k) C † (`, k)C(`, k)
−1
C † (`, k). (9)
q(`, k) can be found in closed-form by applying the Wiener filter or adaptively by applying the (multi-channel) LMS algorithm. A block diagram of the GSC structure is depicted in Fig. 1. 4. ALTERNATIVE DEFINITIONS OF THE CONSTRAINTS SET Two paradigms can be adopted in the design of a beamformer which is aimed at enhancing a desired signal contaminated by both noise and interference. These paradigms differ in their treatment of the interference (competing speech and/or directional noise), which is manifested by the definition of the constraints set, namely C(`, k) and g(`, k). The straight-forward alternative is to apply a single constraint beamformer, usually referred to as MVDR beamformer, which was efficiently implemented by the TF-GSC [5], for the reverberant case. In this structure the desired signal direction is maintained by the FBF, while all other interference sources are treated by the ANC branch. Another alternative suggests defining constraints for both the desired and the interference sources. Two recent contributions adopt this alternative. In the Dual source Transfer Function Generalized Sidelobe Canceler (DTF-GSC) [7] algorithm the FBF block is modified to steer a beam towards the desired signal while simultaneously steering a null towards the interferer. The BM is modified to block both the desired and the non-stationary interference sources. The ANC is hence only responsible for the stationary noise signal mitigation. In a more recent contribution [8] the FBF is modified to treat all directional signals, i.e. maintaining the desired source, while directing nulls towards the non-stationary interference and stationary
204
IEEEI 2008
noise signals. The BM in this structure blocks all directional signals, and the ANC is responsible only for the residual errors. We will show in Sec. 5 that in static scenarios, welldesigned nulls towards all interfering signals (as proposed in [8]) result in an improved undesired signal cancellation compared with the ANC branch in the TF-GSC structure [5]. Naturally, while considering time-varying environments this advantage cannot be guaranteed. 4.1. The TF-GSC Algorithm In TF-GSC [5] structure the constraint set (5) is implemented using the following definitions: d ˜ d (`, k) , h (`, k) C(`, k) = h d h1 (`, k) h iT d hd (`,k) h (`,k) = 1 h2d (`,k) . . . hM d (`,k) 1
and
(10)
1
g(`, k) , g = 1.
(11)
Hence the FBF is given by: w0 (`, k) =
˜ d (`, k) h . ˜ d (`, k)k2 kh
(12)
The BM blocks only the desired signal, and its output signals are given by: um (`, k) = zm (`, k) −
˜ dm (`, k)z1 (`, k), h
m = 2, . . . , M. (13) It is understood that u1 (`, k) in Fig. 1 is identically zero. The total output of the algorithm is given by: y(`, k) = hd1 (`, k)sd (`, k)+ residual noise and interference signals.
4.2. Multi-Constraint GSC A novel method for designing and implementing a multiconstraint beamformer is presented in [8]. It is shown that the following constraint set obtains interference and directional noise mitigation while maintaining the desired source (or sources in the more general case) as received by the first microphone. The constraint matrix C(`, k) is given by: h i ˜ k) , h ˜ d (`, k) E(`, k) C(`, k) = C(`, (15) ˜ (`, k) is defined in (10) and E(`, k) is a basis spanwhere h ning the interference subspace. The desired response is given by: T 1 ... 1 0 ... 0 g(`, k) = g , | {z } | {z } . (16) K
Assume that there exists a segment `˜ during which the only active non-stationary signal is the desired source. The corresponding PSD matrix will then satisfy:
˜ k)f (k) = λ(k)Φ ˆ dzz (`, ˆ szz (`, k)f (k). Φ
(19)
Finally, the desired signal RTFs is obtained using the following normalization: d ˆ ˜ (`, k) , h
ˆ szz (`, k)f (k) Φ ˆ szz (`, k)f (k) Φ
(20)
1
where (·)1 is the first component of the vector corresponding to the reference microphone (arbitrarily chosen to be the first microphone). The interference subspace E(`, k is estimated by the following procedure. Let ` = `1 , . . . , `Nseg , be a set of Nseg frames for which all desired sources are inactive. For every segment we estimate the subspace spanned by the active interferences (both stationary and non-stationary). Let ˆ zz (`i , k) be a PSD estimate at the interference-only frame Φ `i . Using the Eigenvalue Decomposition (EVD) we have ˆ zz (`i , k) = E ˆ i (k)Λ ˆ i (k)E ˆ †i (k). Interference-only segments Φ consist of both directional interference and noise components and spatially-white sensor noise. The larger eigenvalues can be attributed to the coherent signals while the lower eigenvalues to the spatially-white signals. Hence, using pre-defined thresholds all eigenvectors corresponding to weak eigenvalues are discarded. To guarantee that eigenvectors i = 1, 2, . . . , Nseg that are common to more than one segment are not counted more than once they should be collected by the union operator: Nseg
ˆ E(k) ,
N −K
1-4244-2482-5/08/$20.00 ©2008 IEEE
†
ˆ szz (`, k). +Φ (18) Now, applying the Generalized Eigenvalue Decomposition ˜ k) and Φ ˆ dzz (`, ˆ szz (`, k) we have: (GEVD) to Φ ˜ k) ≈ (σ d (`, k))2 hd (`, ˜ k) hd (`, ˜ k) ˆ dzz (`, Φ
(14)
It is therefore shown that exact knowledge of the Relative ˜ d (`, k) suffices for the implementaTransfer Function (RTF)s h tion of the TF-GSC beamformer. This beamformer is only aiming at noise reduction, sacrificing dereverberation, as its desired signal component is identical to the desired signal as received at the first microphone. A practical estimation procedure, based on the non-stationarity of the speech signal is given in [5]. Affes and Grenier [9] proposed an alternative ATF estimation procedure, based on subspace tracking, for a similar GSC structure.
d
The FBF w0 is then constructed using (8) and the definitions (15)-(16), and the BM B(`, k) is constructed using (9) and (15). The reference noise signals are given by u(`, k) , B(`, k)z(`, k). The ANC filters are updated with the multichannel LMS algorithm (see [5] for details.). It is shown in [8] that if the defined constraint set is exactly met, and if the sensor noise v(`, k) is spatially-white, q(`, k) is identically zero. However, in practical scenarios, where the constraint set is replaced by its estimated value, the unavoidable estimation errors result in leakage of directional signals to the reference noise signals u(`, k). In this case the ANC branch contributes to the algorithm’s performance. The total output of the multiconstraint beamformer consists of components similar to the ˜ k), described in (14). A practical method for estimating C(`, necessary for the implementation of the beamformer, is derived in [8]. Here only a brief summary of the estimation method is given. The desired signal RTF is estimated by the following procedure. Consider time frames for which only the stationary sources are active and estimate the corresponding Power Spectrum Density (PSD) matrix: ˆ szz (`, k) ≈ H s (`, k)Λs (`, k) H s (`, k) † + σv2 I. (17) Φ
[
ˆ i (k) E
(21)
i=1
205
IEEEI 2008
ˆ where E(k) is an estimate for the interference subspace basis E(`, k) assumed to be time-invariant in the observation period. Unfortunately, due to arbitrary activity of sources and estimation errors, eigenvectors that correspond to the same source can be manifested as a different eigenvector in each segment. These differences can unnecessarily inflate the number of estimated interference sources. Erroneous rank estimation is one of the causes to the well-known desired signal cancelation phenomenon in beamformer structures, since desired signal components may be included in the null subspace. Consider the following Orthogonal Triangular Decomposition (QRD) of the subspace spanned by the major eigenvectors (weighted in respect to their eigenvalues) obtained by the previous estimation stage: Q(k)R(k) = h 1 ˆ 1 (k)Λ ˆ 12 (k) E
...
1 2
ˆ Nseg (k)Λ ˆ N (k) E seg
i
(22) P (k)
where Q(k) is a unitary matrix, R(k) is an upper triangular matrix with decreasing diagonal absolute values, P (k) is a per1 mutation matrix and (·) 2 is a square root operation performed on each of the diagonal elements. All vectors in Q(k) that do not pass a pruning procedure are not counted as basis vectors of the directional interference subspace. The collection of all vecˆ tors passing the designated thresholds, constitutes E(k), the estimate of the interference subspace basis. Finally, the constraint matrix that determines the beamformer is constructed as h d i ˆ ˜ k) = ˆ C(`, ˆ k) . ˜ (`, k) E(`, h
Similarly, the input SIR of the desired signal relative to the stationary signal i = 1, . . . , Ns : 2 P P d d s ` k s (`, k)h1 (`, k) SIRin,i [dB] = 10 log10 P P 2. s s ` k (si (`, k)hi1 (`, k))
These quantities are compared with the corresponding output SIR: 2 P P d ` k y (`, k) SIRns [dB] = 10 log P P out,i 10 2 ns ` k (yi (`, k)) P P 2 yid (`, k) SIRsout,i [dB] = 10 log10 P` Pk s 2 . ` k (yi (`, k))
For evaluating the distortion imposed on the desired source we also calculated the Segmental Signal to Noise Ratio (SSNR) and Log Spectral Distortion (LSD) distortion measures relating the desired source component at microphone #1, namely sd (`, k)hd1 , and its corresponding component at the output, namely y d (`, k). The algorithms were also subjectively compared by informal listening tests and by the assessment of waveforms and sonograms. The signal as recorded by microphone #1 is depicted in Fig. 2, the TF-GSC [5] output is depicted in Fig. 3, and the output of the multi-constraint algorithm [8] is shown in Fig. 4. It is evident that the multi-constraint beamformer outperforms the TF-GSC beamformer especially in terms of the competing speaker cancellation. In Table 1 the algorithms 4 3.5
5. EXPERIMENTAL STUDY
1-4244-2482-5/08/$20.00 ©2008 IEEE
Frequency [kHz]
10
2.5
5
2 0 1.5 −5 1 −10 0.5 −15
0 0.5
−20
Amplitude
The performance of the two strategies was compared in a simulated room environment, using one desired speech source, one stationary speech-like noise drawn from NOISEX-92 [10] database, and various number of competing speakers (ranging from zero to two). All speech signals were recorded in a quite environment. The image method proposed by Allen and Berkley was used to generate the RIR using the simulator in [11]. All the directional signals were then convolved with the corresponding time-invariant RIRs. The M = 11 microphone signals zm (`, k) were finally obtained by summing up the contributions of all directional sources with an additional uncorrelated sensor noise. The desired signal to sensor noise ratio was set to 41dB (this ratio determines σv2 ). The relative power between the desired source and all interference sources is depicted in the simulation results in Table 1. Denote by y d (`, k), the desired signal component at the beamformer output, yins (`, k); i = 1, . . . , Nns the corresponding non-stationary interference components, and yis (`, k); i = 1, . . . , Ns the stationary interference components. One quality measure used for evaluating the performance of the proposed algorithm is the improvement in the Signal to Interference Ratio (SIR) level. Since, generally, there are several interference sources we will use all pairs of SIR for quantifying the performance. The SIR of the desired signal relative to the non-stationary signal i = 1, . . . , Nns as measured on microphone #1 is defined as follows: 2 P P d d ` k s (`, k)h1 (`, k) SIRns [dB] = 10 log in,i 10 P P 2. ns ns ` k (si (`, k)hi1 (`, k))
15
3
0
−0.5 0
−25
5
10
15
20
25 30 Time [Sec]
35
40
45
50
Fig. 2. Microphone #1 signal. are compared in terms of the objective quality measures, as explained above, for various number of interference sources. It is evident from Table 1 that the multi-constraint beamformer outperforms the TF-GSC algorithm in terms of SIR improvement, as well as distortion level measured by the SSNR and LSD values. The lower distortion of the multi-constraint beamformer can be attributed to the stable nature of the nulls in the beam-pattern as compared with the adaptive nulls of the TFGSC structure. 6. CONCLUSIONS Two alternative beamforming strategies for interference cancelation in noisy and reverberant environment were compared in this work in terms of interference cancellation ability and imposed distortion. The TF-GSC, which belongs to the MVDR
206
IEEEI 2008
Nns
sns 1 − 6 6
0 1 2
Input SIR sns 2 − − 6
TF-GSC ss1 13 13 13
SIR sns 2 − − 16.01
sns 1 − 15.54 13.86
SSNR ss1 28.62 26.77 23.13
6.97 6.31 7.06
LSD 2.73 2.75 2.77
sns 1 − 26.01 23.58
Multi-Constraint SIR SSNR sns ss1 2 − 40.77 14.66 − 36.95 12.72 26.70 30.70 11.75
LSD 1.31 1.35 1.39
Table 1. SIR, SSNR and LSD improvements in dB relative to microphone #1 as obtained by the TF-GSC [5] and the multiconstraint [8] beamformers. Test scenario: Simulated room environment with reverberation time T60 = 300mS, 11 microphones, one desired speaker, one stationary noise, and various number of interfering speakers.
7. REFERENCES
4 3.5
15
Frequency [kHz]
3
10
2.5
5
2 0 1.5 −5 1 −10 0.5 −15
0 0.5 Amplitude
−20 0
−25
−0.5 0
5
10
15
20
25 30 Time [Sec]
35
40
45
4 15
Frequency [kHz]
3
10
2.5
5
2 0 1.5 −5 1 −10 0.5 −15
0 0.5
−20 0
−25
5
10
15
20
25 30 Time [Sec]
35
40
45
50
Fig. 4. The multi-constraint beamformer [8] output.
family, applies a single constraint towards the desired signal, leaving the interference mitigation to the adaptive branch of the beamformer. The recently proposed multi-constraint beamformer implicitly applies carefully designed nulls towards all interference signals, leaving only the treatment of residual signals to the adaptive branch. It is shown that for the timeinvariant scenario the later design shows a significant advantage over the former beamformer design. It remains an open question what is the preferred strategy in slowly time-varying scenarios.
1-4244-2482-5/08/$20.00 ©2008 IEEE
[3] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, vol. 57, no. 8, pp. 1408–1418, Aug. 1969.
[5] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” Signal Processing, vol. 49, no. 8, pp. 1614–1626, Aug. 2001.
3.5
−0.5 0
[2] S. Gannot and I. Cohen, Springer Handbook of Speech Processing. Springer, 2007, ch. Adaptive Beamforming and Postfitering, pp. 199–228.
[4] B. R. Breed and J. Strauss, “A short proof of the equivalence of LCMV and GSC beamforming,” IEEE Signal Processing Lett., vol. 9, no. 6, pp. 168–169, Jun. 2002.
50
Fig. 3. The TF-GSC [5] output.
Amplitude
[1] B. D. Van Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial filtering,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 5, no. 2, pp. 4– 24, Apr. 1988.
[6] Y. Avargel and I. Cohen, “System identification in the short-time Fourier transform domain with crossband filtering,” IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 4, pp. 1305–1319, May 2007. [7] G. Reuven, S. Gannot, and I. Cohen, “Dual-source transfer-function generalized sidelobe canceler,” IEEE Trans. Audio, Speech and Language Processing, vol. 16, no. 4, pp. 711–727, May 2008. [8] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,” submitted to IEEE Transactions on Audio, Speech and Language Processing, Jul. 2008. [9] S. Affes and Y. Grenier, “A signal subspace tracking algorithm for microphone array processing of speech,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 5, pp. 425–437, Sep. 1997. [10] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247–251, Jul. 1993. [11] E. Habets, “Room impulse response (RIR) generator,” http://home.tiscali.nl/ehabets, Oct. 2008.
207
IEEEI 2008