MMSE-BASED BLIND SOURCE EXTRACTION IN ... - IEEE Xplore

5 downloads 1559 Views 468KB Size Report
USING A COMPLEX COHERENCE-BASED A PRIORI SAP ESTIMATOR. Maja Taseska ... first propose a multichannel noise PSD matrix estimator that is based.
International Workshop on Acoustic Signal Enhancement 2012, 4-6 September 2012, Aachen

MMSE-BASED BLIND SOURCE EXTRACTION IN DIFFUSE NOISE FIELDS USING A COMPLEX COHERENCE-BASED A PRIORI SAP ESTIMATOR Maja Taseska and Emanu¨el A.P. Habets International Audio Laboratories Erlangen∗ Am Wolfsmantel 33, 91058 Erlangen - Germany {maja.taseska,emanuel.habets}@audiolabs-erlangen.de ABSTRACT In many practical situations we can assume that the desired sound is strongly coherent across the microphone array and it is corrupted by diffuse noise. We propose an algorithm to extract such desired sounds without prior knowledge of the noise power spectral density (PSD) matrix or the direction of arrival of the desired source(s). We first propose a multichannel noise PSD matrix estimator that is based on the speech presence probability. To differentiate between desired and noise signals, we propose to control the a priori speech absence probability using an estimate of the direct to diffuse ratio (DDR). We then obtain an estimate of the desired signal using a parametric multichannel Wiener filter that allows a trade-off between noise reduction and speech distortion. We demonstrate that the noise reduction can be increased and the speech distortion can be decreased by controlling the trade-off parameter using the estimated DDR. Index Terms— source extraction, noise reduction, speech presence probability, diffuse noise field 1. INTRODUCTION Hands-free capture of speech is required in many human-machine interfaces and communication systems. Built-in microphones usually receive a mixture of the desired speech and ambient noise. As the noise degrades the quality and intelligibility of the desired sound, the microphone signals can be processed (i.e., filtered and summed) in order to extract the desired source signal or, in other words, reduce the noise. Most noise reduction algorithms require an accurate estimate of the noise power spectral density (PSD). In practice, the noise signal is unobservable and its PSD needs to be estimated from the noisy microphone signals. In [1], a minima controlled recursive averaging (MCRA) noise PSD estimator was proposed that uses a soft-decision update rule based on the a posteriori speech presence probability (SPP). A similar estimator was proposed in [2], where a fixed a priori speech absence probability (SAP) and a fixed a priori signal to noise ratio (SNR) was used rather than signal dependent quantities as in [1]. Recently, Souden et al. [3] proposed a multichannel noise PSD matrix estimator that uses a multichannel SPP estimator [4]. In [3], the authors determine the a priori SAP using the a priori SNR in a way similar to the MCRA noise PSD estimator. The a posteriori SPP has also been used to control the trade-off between noise reduction and speech distortion. In [5], for example, the SPP was used to control the trade-off parameter of a parametric multichannel Wiener filter (PMWF). In terms of both noise reduction ∗ A joint institution

IIS

of the University Erlangen-Nuremberg and Fraunhofer

and speech distortion, the SPP-controlled PMWF outperforms the traditional multichannel Wiener filter (MWF) that uses a fixed tradeoff parameter. In this contribution, we aim at extracting sounds that are strongly coherent across the array without prior knowledge of the noise PSD matrix and the direction of arrival of the coherent sound source(s). We assume that the desired source signal is immersed in diffuse noise, a scenario which is often encountered in practice. First, we extend the single-channel noise PSD estimator proposed in [2] to the multichannel case. In order to differentiate between desired sounds and noise, we propose to base the a priori SAP on the direct to diffuse ratio (DDR) that is determined using a recently proposed DDR estimator [6]. Finally, a MMSE estimate of the desired signal received by one of the microphones is obtained by applying a PMWF. Instead of controlling the trade-off parameter using the SPP, we propose to use the estimated DDR, thereby, increasing the reduction of diffuse noise and decreasing the distortion of coherent sounds. This paper is organized as follows: Section 2 gives an overview of the signal model in the short-time Fourier transform (STFT) domain and defines the quantities to be estimated in order to compute the MMSE estimate of the desired signal. In Section 3, we overview the MMSE-based noise PSD matrix estimation and the computation of the multichannel SPP [4]. Moreover, we give a brief overview of the DDR estimator used to control the a priori SAP. In Section 4, it is shown how the desired signal is estimated using the PMWF. In Section 5, the performance of the proposed algorithm is evaluated. Section 6 contains the concluding remarks.

2. PROBLEM FORMULATION In the following, we consider the well-established signal model in multichannel speech processing, where each microphone of an M element array captures an additive mixture of a desired signal and noise. The signal received at the m-th microphone can be described in the STFT domain as follows Ym (n, k) = Xm (n, k) + Vm (n, k),

(1)

where Xm (n, k) and Vm (n, k) denote the complex spectral coefficients of the desired source signal the noise component m-th microphone, respectively, and n and k are the time and frequency indices, respectively. In this work, we assume that the desired signal is spatially coherent across the microphones and that the spatial coherence of the noise follows the spatial coherence of an ideal spherically isotropic sound field [7]. The observed noisy signals can be written in vector notation as y(n, k) = [Y1 (n, k) . . . YM (n, k)]T and the PSD matrix of y(n, k)

is defined as

  Φyy (n, k) = E y(n, k) yH (n, k) ,

(2)

where the superscript H denotes the conjugate transpose of a matrix. The vectors x and v and the matrices Φxx and Φvv are defined similarly. The speech and the noise signals are assumed uncorrelated and zero mean, such that (2) can be written as Φyy (n, k) = Φxx (n, k) + Φvv (n, k).

(3)

We introduce the following standard hypotheses regarding the presence of a desired signal in a given time-frequency bin H0 (n, k) : y(n, k) = v(n, k) indicating speech absence and H1 (n, k) : y(n, k) = x(n, k) + v(n, k) indicating speech presence. Assuming that we take the first microphone of the array as a reference, our objective is to estimate the desired signal X1 (n, k). 3. NOISE PSD MATRIX ESTIMATION Taking into account the speech presence uncertainty, a minimum mean square error (MMSE) estimate for the noise PSD matrix at a certain time-frequency bin is given by1     E vvH | y = p[H0 | y] E vvH | y, H0   + p[H1 | y] E vvH | y, H1 . (4) A common noise estimation technique to approximate (4) is to use a weighted sum of recursively averaged past spectral power values of the noisy observation and an estimate of the noise PSD of the previous frame, as described in [1, 2] for a single-channel case and in [4] for a multi-channel case. This estimation technique can be expressed as follows    vv (n) = p[H0 | y] αv Φ  vv (n − 1) + (1 − αv ) yyH Φ  vv (n − 1), + p[H1 | y] Φ

(5)

 vv is the estimated noise PSD matrix and 0 ≤ αv < 1 is a where Φ chosen smoothing parameter. Rearranging (5), the following update rule is obtained  vv (n) = p[H0 | y](1 − αv ) yyH Φ  vv (n − 1) + [αv + p[H1 | y](1 − αv )] Φ    H   vv (n − 1), = α (n) yy + 1 − α (n) Φ

(6)



such that α = p[H0 | y] (1 − αv ). In the following, we give a brief overview of multichannel SPP [4] required to control the timevarying and frequency-dependent smoothing parameter α . 3.1. Multichannel Speech Presence Probability Under the assumption that the desired speech and noise components can be modelled as complex multivariate Gaussian random variables the multichannel SPP estimate is given by [4]

−1 β(n,k) q(n, k) − p[H1 (n, k) | y(n, k)] = 1 + , ξ(n, k) e ξ(n,k) 1 − q(n, k) (7) 1 In

the sequel, we omit the time and frequency indices where possible.

where q(n, k) = p[H0 (n, k)] denotes the a priori speech absence probability (SAP), and (8) ξ(n, k) = 1 + tr Φ−1 vv (n, k) Φxx (n, k) , −1 β(n, k) = yH (n, k)Φ−1 vv (n, k)Φxx (n, k)Φvv (n, k)y(n, k), (9)

where tr{·} denotes the trace operator. The a priori SAP q(n, k) can be fixed (c.f. [2]), or it can be signal dependent (c.f. [1, 3]). In the following section, we propose to use the DDR to determine the a priori SAP. 3.2. DDR-based a priori Speech Presence Probability In this work, we estimate the DDR using a complex coherence (CC)based estimator, that was recently proposed in [6]. The CC between two signals measured at microphones a and b is defined in STFT domain as φab (n, k) γab (n, k) = , (10) φaa (n, k)φbb (n, k) where φab (n, k) is cross PSD and φaa (n, k) and φbb (n, k) are the auto PSDs of the two signals. The DDR estimator in [6] is based on a sound field model where the sound pressure at any position and time-frequency instance is modelled as a superposition of a direct sound represented by a single monochromatic plane wave and an ideal diffuse field. Assuming omnidirectional microphones, the CC function can be expressed as γab (k) =

Γ(n, k) ejθ(n,k) + γab,diff (k) , Γ(n, k) + 1

(11)

where θ(n, k) is the phase shift of the direct sound between the two microphones, Γ(n, k) denotes the DDR, and γab,diff (k) = sin(κr)/κr is the CC of an ideal spherically isotropic sound field, with κ corresponding to the wavenumber at frequency index k, and r to the distance between sensors a and b. The PSDs required to compute γab (k) using (10), are approximated by temporal averages and the phase shift θ(n, k) of the direct sound is estimated from the ˆ k) = ∠φˆab (n, k). estimated noisy PSD, i.e., θ(n, The DDR Γ(n, k) can now be expressed in terms of the estiˆ k) as, mated CC γˆab (n, k) and the estimated phase shift θ(n,

γab,diff (k) − γˆab (n, k) ˆ Γ(n, k) =  . (12) ˆ γ ˆab (n, k) − ej θ(n,k) For a detailed derivation of this estimator, we refer the reader to [6]. Depending on the application, the CC function γab,diff (k) can also be replaced by the spatial coherence corresponding to another noise field. ˆ Clearly, low values of Γ(n, k) indicate the absence of the deˆ sired coherent source, whereas high values of Γ(n, k) indicate the presence of the desired coherent source. Based on this observation, ˆ we can use Γ(n, k) to compute the a priori SAP q(n, k). We propose to use the following mapping function f [Γ(n, k)] = lmin + (lmax − lmin )

10cρ/10 , + Γ(n, k)ρ

10cρ/10

(13)

where lmin and lmax determine the minimum and maximum values that the function can attain, c (in dB) controls the offset along the Γ axis, and ρ defines the steepness of transition region between desired and non-desired component. The parameters are chosen such that a low DDR corresponds to a high SAP, while a high DDR corresponds to a low SAP.

3.3. Summary of the Noise PSD Matrix Estimator The proposed multichannel noise PSD matrix estimator is an extension of the single-channel estimator proposed in [2]. In contrast to the algorithm in [2], the a priori SAP is based on the DDR and the a priori SNR is based on the multichannel observations. The proposed noise PSD matrix estimator can be summarized as follows: 1. Estimate the a priori SAP q(n), based on the DDR estimate Γ(n) for the current frame, according to (12). 2. Estimate p[H1 (n) | y(n)] according to (7), using the estimate from the previous frame Φvv (n−1) and the current recursive estimate of Φyy (n) given by  yy (n − 1) + (1 − αy ) y(n)y(n)H . (14)  yy (n) = αy Φ Φ 3. Compute a recursively smoothed SPP as follows p¯(n) = αp p¯(n − 1) + (1 − αp ) p[H1 (n) | y(n)].

(15)

4. Avoid stagnation by setting p[H1 (n) | y(n)] to a chosen maximum value pmax whenever p¯(n) > pmax . 5. Update the noise PSD matrix by using p[H1 (n) | y(n)], (5) and (6). The parameters αy and αp denote the smoothing constants. 4. BLIND SOURCE EXTRACTION Considering the first microphone as a reference, the STFT domain PMWF is given by [8] hW,μ (n, k) =

Φ−1 vv (n, k)Φxx (n, k) u1 , μ(n, k) + tr Φ−1 vv (n, k)Φxx (n, k)

(16)

where μ(n, k) is the trade-off parameter and u1 = [1 0 . . . 0]T . A major advantage of this filter is that we do not require an estimate of the propagation vector related to the desired source. In this contribution, we propose to control the trade-off parameter based on the estimated DDR using (13) such that μ(n, k) = ˆ f [Γ(n, k)]. The parameters lmin , lmax , ρ and c are chosen such that we obtain μ(n, k) > 1 when the estimated DDR is low to achieve a larger amount of noise reduction compared to the standard MWF, and μ(n, k) ≈ 0 (i.e., approximately equal to the MVDR beamformer [8]) when the estimated DDR is high, to avoid speech distortion. An example of the mapping functions for the trade-off parameter μ and the SAP q are depicted in Fig. 1. Finally, the MMSE estimate of the desired signal is obtained according to 1 (n, k) = p[H1 (n, k) | y(n, k)] hH X W,μ (n, k) y(n, k) + p[Ho (n, k) | y(n, k)] Gmin (k) Y1 (n, k),

(17)

where the gain factor Gmin (k) determines the maximum amount of noise reduction when the desired speech is assumed to be inactive and mitigates speech distortions in case of a false-negative decision.

5

1

4

0.8

3

0.6

2

0.4

1

0.2

0 −15

−10

−5

0

5

DDR (Γ)(dB)

10

q (dashed line)

μ (solid line)

It is important to note that by defining the coherent signal component as desired and the diffuse component as noise, the DDR is equivalent to the a priori SNR, which is commonly used to determine the a priori SAP. In contrast to the a priori SNR, the DDR is based on the spatial coherence and assumes that the desired sound field and noise field are homogeneous.

0 15

Fig. 1. Parameters controlled by the estimated DDR. For μ: l1 = 0, l2 = 5, ρ = 2, c = 0. For q: l1 = 0.2, l2 = 0.8, ρ = 2, c = 3. 5. PERFORMANCE EVALUATION 5.1. Setup and Performance Measures We evaluate the performance of the proposed algorithm in terms of the achieved speech enhancement at the output of the PMWF. The analysis was carried out for different SNRs and a reverberation time of 300 ms. Two different types of noise were used: stationary noise with a long-term PSD equal to the long-term PSD of speech and nonstationary babble noise. In both cases, the CC of the noise signals corresponds to the CC of an ideal diffuse field [9]. The sampling frequency was 16 kHz, and the frame length was L = 512 samples. The simulation was performed for a uniform linear array of M = 4 microphones with an inter-microphone spacing of d = 2.3 cm. The desired signals were obtained by convolving 45s of clean speech with room impulse responses (RIRs) that were generated using an efficient implementation of the imagesource model [10]. The PSDs required for the DDR estimate are approximated by averaging over 15 time frames. For these experiments we used the q and μ mappings with the parameters as illustrated in Fig. 1. The smoothing parameters α used in the recursive averaging in (6), (14) and (15) were chosen as 0.75, 0.8 and 0.9, respectively. We studied the PESQ score improvement [11] and the segmental SNR gain at the output of different beamformers steered by the estimated noise PSD matrix. The PESQ improvement is computed as ˆ 1 and the inthe difference in PESQ score of the inverse STFT of X verse STFT of Y1 . The segmental SNR was obtained by splitting the signals into non-overlapping segments of 10 ms and averaging over the obtained SNR values in dB. The segmental SNRs at the input and at the output are denoted by Si and So , respectively. We compare the performance of the standard MVDR and Wiener beamformers, the DDR-controlled PMWF, and the estimate given by (17). 5.2. Results The PESQ improvement at the output of the beamformers is illustrated in Fig. 2 as a function of the input SNR Si . It can be seen that the proposed MMSE estimator outperforms the standard beamformers. In addition, the DDR-controlled PMWF performs better than the two beamformers with a fixed trade-off. The algorithm leads to a significant PESQ improvement in the case of babble noise, which due to its non-stationarity represents a challenging problem for many algorithms. The corresponding segmental SNR gains are shown in Fig. 3. Spectrograms of the desired source signal at the first microphone, the received noisy signal, the standard MWF and the MMSEbased estimate are illustrated in Fig. 4, for an excerpt of 11s. The corresponding mapping from the estimated DDR to the a priori SAP is shown in Fig. 5. It can be seen that the SAP is correctly estimated at high frequencies as well, therefore preserving the speech signal at these frequencies where the input SNR is low.

μ=1

μ=0

μ = f (Γ)

Eq. (17)

−20

0.4

0.6

0.3 0.4 0.2 0.2

0.1 0

5

10

15

Si (dB)

20

0 −5

25

0

5

10

Si (dB)

15

20

25

μ=1

μ=0

μ = f (Γ)

So − Si (dB)

8

6 5

6

4 4

3 2

2 0

5

10

15

Si (dB)

20

1 −5

25

0

5

10

Si (dB)

15

20

25

Frequency (kHz)

Fig. 3. SNR gain for stationary (left) and babble noise (right) 8

8

6

6

4

4

2

2

100

200

frame index

300

0

Frequency (kHz)

8

6

6

4

4

2

2

200

frame index

(c) PMWF, μ = 1

200

300

(b) Noisy signal

8

100

100

frame index

(a) Clean signal

0

20

0.2

0.4

0.6

0.8

6

6

4

4

2

2

300

0

100

200

100

200

frame index

frame index

300

0

100

200

frame index

300

(b) A priori SAP (q)

Fig. 5. Estimated DDR and the corresponding SAP (Si = 11 dB) 7. REFERENCES

Eq. (17)

7

0

10

8

(a) DDR (Γ)

Fig. 2. PESQ improvement for stationary (left) and babble noise (right)

−5

0

8

0 0 −5

−10

0.5

Frequency (kHz)

PESQ improvement

0.8

300

(d) PMWF according to (17)

Fig. 4. Exemplar spectrograms for babble noise (Si = 11 dB) 6. CONCLUSIONS An algorithm is proposed to blindly extract sounds that are strongly coherent across the array. First, we proposed a multichannel noise PSD matrix estimator that is based on the a posteriori SPP. In contrast to earlier works, we use an estimate of the DDR to determine the a priori SAP. Secondly, we propose to use the estimated DDR to control the trade-off parameter of the PMWF. Finally, we demonstrated that the proposed DDR-controlled PWMF outperforms the MVDR beamformer and the MWF in terms of segmental SNR improvement and PESQ improvement.

[1] I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466– 475, Sept. 2003. [2] T. Gerkmann and R. C. Hendriks, “Noise power estimation base on the probability of speech presence,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, 2011. [3] M. Souden, J. Chen, J. Benesty, and S. Affes, “An integrated solution for online multichannel noise tracking and reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, pp. 2159 – 2169, 2011. [4] M. Souden, J. Chen, J. Benesty, and S. Aff`es, “Gaussian model-based multichannel speech presence probability,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 1072–1077, July 2010. [5] K. Ngo, A. Spriet, M. Moonen, J. Wouters, and S. H. Jensen, “Incorporating the conditional speech presence probability in multi-channel wiener filter based noise reduction in hearing aids,” EURASIP Journal on Advances in Signal Processing, vol. 2009, 2009. [6] O. Thiergart, G. Del Galdo, and E. A. P. Habets, “Signalto-reverberant ratio estimation based on the complex spatial coherence between omnidirectional microphones,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 309–312. [7] G. W. Elko, “Spatial coherence functions,” in Microphone Arrays: Signal Processing Techniques and Applications, M. Brandstein and D. Ward, Eds., chapter 4, pp. 61–85. Springer-Verlag, 2001. [8] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing, Springer-Verlag, Berlin, Germany, 2008. [9] E. A. P. Habets, I. Cohen, and S. Gannot, “Generating nonstationary multisensor signals under a spatial coherence constraint,” Journal Acoust. Soc. of America, vol. 124, no. 5, pp. 2911–2917, Nov. 2008. [10] E. A. P. Habets, “Room impulse response generator,” Tech. Rep., Technische Universiteit Eindhoven, 2006. [11] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2001, vol. 2, pp. 749–752.

Suggest Documents