In this contribution we will develop a joint maximum a posteriori. (JMAP) estimator for speech enhancement under a non-Gaussian noise assumption which will ...
SPEECH ENHANCEMENT USING A JOINT MAP ESTIMATOR WITH GAUSSIAN MIXTURE MODEL FOR (NON-)STATIONARY NOISE Bal´azs Fodor and Tim Fingscheidt Institute for Communications Technology, Technische Universit¨at Braunschweig Schleinitzstr. 22, D - 38106 Braunschweig, Germany {b.fodor, t.fingscheidt}@tu-bs.de ABSTRACT In many applications non-stationary Gaussian or stationary nonGaussian noises can be observed. In this paper we present a maximum a posteriori estimation jointly of spectral amplitude and phase (JMAP). It principally allows for arbitrary speech models (Gaussian, super-Gaussian, ...), while the noise DFT coefficients pdf is modeled as Gaussian mixture (GMM). Such a GMM covers both a non-Gaussian stationary noise process, but also a non-stationary process that changes between Gaussian noise modes of different variance with probability of the GMM weight. Accordingly, we provide results for these two types of noise, showing superiority over the Gaussian noise model JMAP estimator even in case of ideal noise power estimation. Index Terms— Speech enhancement, MAP estimation, Gaussian noise
2. REFERENCE ESTIMATOR WITH GAUSSIAN NOISE MODEL The input signal y(n) of a speech enhancement system is assumed to consist of the clean speech signal s(n) and the additive noise signal n(n). After segmentation, windowing, and a DFT transform, the input signal can be rewritten as Y (, k) = S(, k)+N (, k) with being the frame index, k being the frequency bin index. Using polar coordinates, the input signal can be reformulated as R(, k)ejΘ(,k) = A(, k)ejα(,k) + B(, k)ejβ(,k) where R, A, B, (Θ, α, β) are the magnitudes (phases) of the short-time spectra Y , S, and N , respectively1 . Using the JMAP error criterion, the estimated magnitude Aˆ and phase α ˆ of the clean speech spectrum are computed as:
1. INTRODUCTION In speech enhancement, a low level of speech distortion can be achieved by employing an appropriate speech model. Classically, the discrete Fourier transform (DFT) coefficients of the speech signal are commonly modeled by a Gaussian distribution [1, 2]. However, the actual speech content in a noisy signal can be better preserved by applying different speech priors, such as generalized Gamma [3], or super-Gaussian [4]. In analogy, further reduction of residual noise can be attained by a proper selection of the noise model. Instead of a Gaussian distribution [2, 4], it is sometimes preferable to use a Gaussian mixture model (GMM) of the noise DFT coefficients, as done in [3, 5] in the context of minimum mean-sqared error (MMSE) estimators. Such a model is suitable for environments with stationary non-Gaussian noise or with non-stationary Gaussian noise. Also non-stationary non-Gaussian interferers such as babble noise could be modeled by a GMM. In this contribution we will develop a joint maximum a posteriori (JMAP) estimator for speech enhancement under a non-Gaussian noise assumption which will be reflected by a GMM. To ease comparison to [2], a Gaussian model of the speech DFT coefficients will be employed, however, our approach is not restricted to the Gaussian speech model. The resulting spectral weighting rule turns out to be analytically not tractable, therefore, we present a numerical solution, obtained by an offline full search using the JMAP criterion. The paper is organized as follows: In Section 2, a short review of the reference joint MAP estimator with Gaussian noise model will be given. Section 3 will present the new JMAP estimator based on a noise GMM, followed by the evaluation of the proposed weighting
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
rule in Section 4. Finally, Section 5 will give some concluding remarks.
4768
p(Y |A, α)p(A, α) ˆ α A, ˆ = arg max p(A, α|Y ) = arg max A,α A,α p(Y ) (1) where p(.) denotes the corresponding probability density function (pdf). Since the probability density p(Y ) is independent from the magnitude A and phase α, (1) can be rewritten as: ˆ α A, ˆ = arg max J(A, α) (2) A,α
with J(A, α) = p(A, α) · p(Y |A, α). Let us assume, that the real and imaginary part of the speech DFT coefficients are independent and identically distributed (i.i.d.), A is statistically independent from α, and α being uniformly distributed. A Gaussian speech model leads then to a Rayleigh pdf of the speech amplitude yielding p(A, α) =
1 A − 12 A2 · 2 2 e σS . 2π σS
(3)
The real and imaginary part of the noise DFT coefficients N are assumed to be Gaussian i.i.d., therefore, p(Y |A, α) turns out to be (subscript G for Gaussian noise model) pG (Y |A, α) =
1 − σ12 |Y −Aejα |2 N . 2 e πσN
(4)
The phase estimate α ˆ can be obtained by setting the partial derivative of the natural logarithm of the cost function J(A, α) w.r.t. α equal to zero. Using the phase estimate α, ˆ the amplitude estimate Aˆ can be 1 For ease of notation, indices and k are largely omitted in the rest of the paper.
ICASSP 2011
8
6
6 GGMM
GG
8
4
2 1 0 20
2 1 0 20
20
0 −20 −20
ξ [dB]
calculated by again setting the partial derivative of ln J(A, α) w.r.t. A equal to zero. Then, the estimate of the clean speech spectrum parameters can be expressed as follows: ˆ ˆ k)ej α(,k) ˆ k) = G(, k) · Y (, k) A(, = S(,
being a function of the a posteriori signal-to-noise ratio (SNR) 2 (, k) and the a priori SNR ξ(, k) = γ(, k) = R2 (, k)/σN 2 2 2 σS (, k)/σN (, k) with σN (, k) and σS2 (, k) being the noise and speech variance in the DFT domain, respectively. A plot of the weighting rule GG can be seen in Figure 1. 3. NEW JMAP ESTIMATOR WITH GMM NOISE MODEL Assuming that the real and imaginary part of the noise DFT coefficients N are zero mean and i.i.d., they can be statistically represented by the complex-valued Gaussian mixture
m=1
M cm − σ12 |N |2 e m , 2 πσ m m=1
Independent from the choice of the speech model, the phase estimate turns out to be that of the noisy speech spectrum: α ˆ = Θ.
M cm − σ12 |Y −Aejα |2 e m . 2 πσm m=1
(8)
Assuming again p(A, α) with uniformly distributed α, taking the natural logarithm of J(A, α) in (2), and setting its partial derivative w.r.t. α to zero, we obtain the following expression: M
−2AR
m=0 M m=0
2
2
1 [R +A −2AR cos(α−Θ)] 2 cm − σm 4 e σm
− 12 [R2 +A2 −2AR cos(α−Θ)]
cm 2 e σm
!
sin(α − Θ) = 0 (9)
σm
4769
(10)
Using this result, the magnitude estimate Aˆ is subsequently computed for any speech prior p(A, α) by solving the following equa tion: ∂ ∂ ln p(A, α) (11) = JGMM ∂A ∂A α=Θ ˆ M
+2(R − A)
m=1 M m=1
1 (R−A)2 2 cm − σm 4 e σm 1 (R−A)2 2 cm − σm 2 e σm
!
= 0.
Since this equation is not solvable in analytical form, a numerical optimization has to be performed. For this purpose, no differentiation of the cost function JGMM (or its natural logarithm) w.r.t. A is necessary. Instead, the maximization could be applied directly to the cost function JGMM : Aˆ = arg max JGMM |α=Θ ˆ
(12)
A
(7)
representing the GMM weight and the variance of with cm and the m-th of M complex-valued Gaussian modes, respectively. In a practical implementation these parameters are commonly trained by using training noise data through the expectation maximization (EM) algorithm. Therefore, the conditional pdf of the noisy speech spectrum given the speech spectrum can now be formulated as (subscript GMM for the GMM noise model)
ξ [dB]
Fig. 2. Weighting rule GGMM of the JMAP estimator using a Gaussian speech model and a noise GMM with two Gaussian mixtures according to Table 1
= arg max p(A, α) ·
2 σm
pGMM (Y |A, α) =
−20 −20
(5)
The JMAP weighting rule GG of the Gaussian assumption both for noise and speech terms turns out to be [2]: ξ + ξ 2 + 2(1 + ξ)ξ/γ GG = (6) 2(1 + ξ)
2 cm N (0, σm )=
0
γ [dB]
Fig. 1. Weighting rule GG of the JMAP estimator using a Gaussian speech and noise model
M
20
0
0
γ [dB]
p(N ) =
4
A
M cm − σ12 (R−A)2 e m . 2 πσm m=1
Employing a Gaussian speech model yields p(A, α) as in (3), we obtain the following cost function: JGMM |α=Θ = ˆ
2 √ M γ 1 −G2 γ cm −(1−G)2 γ σσN 1 2 ξ · m , (13) G e e π2 ξ σN σ2 m=1 m
with σN =
2 σGMM
M 2 = cm σm
(14)
m=1
being the standard deviation of the noise (model). For a given 2 2 GMM noise model, a higher noise variance σN = σGMM leads 2 to proportionally higher Gaussian mode variances σm , such that 2 2 2 σN /σm = const. (see (14)). Therefore, σN does not influence the position of the maximum of JGMM . Hence, the resulting weighting rule GGMM is also independent from the total variance of the GMM
JGMM
ξ = 8dB
JGMM
8
ξ = 8.42dB
7
Gaussian noise model [Wolfe, 2] GMM noise model [proposed]
6
G
5 4 ξ = 5dB
3 2
JGMM
ξ = 9dB 0
0.5
G
1
1 0 −20
1.5
Fig. 3. Sketch of the cost function JGMM = f (G) for different ξ at γ = 20dB. 2 2 2 m cm σm 100% ·cm σm /σGMM 1 0.71 0.28 20% 2 0.29 2.8 80% Table 1. Parameters of the noise GMM for Fig. 2 and Fig. 4 2 . σGMM In order to numerically solve the optimization problem GGMM = arg max JGMM with JGMM |α=Θ from (13), we restrict the ˆ G
α=Θ ˆ
computation of SNR pairs (ξ, γ) and gain values G to [-20,20] dB, and [0,10], respectively. Within these ranges, we computed the cost function JGMM (13). Next, we simply searched for the maximum of it to obtain the JMAP estimate GGMM for the Gaussian speech model. To illustrate the resulting weighting rule GGMM we assume now the noise pdf being (modeled by) a GMM with two Gaussians, as given in Table 1. Its parameters are derived from EM training of parts of the NTT Ambient Noise Database [6] (babble noise). The last column of Table 1 shows the variance contribution of each Gaussian mode to the total variance of the GMM. In Figure 2, the resulting weighting rule GGMM is plotted in the range (γ, ξ) = [−20, 20] dB. It is interesting to note that there is a discontinuity in the surface of GGMM . This special behavior is due to the maximum search property of the MAP estimator: The cost function JGMM (13) basically contains two exponential (mixture) functions, one for the speech, one for the noise, and therefore two concurrent peaks in JGMM , as plotted in Figure 3. Note, that the position of the maximum belonging to the noise model is always close to G = 1. In the left corner of Figure 2, the maximum of JGMM (and therefore the actual value of GGMM ) is dominated by the speech model. Going towards the right corner in Figure 2, this maximum peak belonging to the speech model becomes smaller and moves towards a larger G, until it is equally large as the noise component (see middle plot of Fig. 3). Going further towards the right (or to higher values of ξ), the maximum of JGMM jumps over to the dominating noise model. The plateau on the right side of the discontinuity is caused by the noise model. A black line in Figs. 1 and 2 indicates GGMM = 1. In the GGMM > 1 region, the proposed weighting rule behaves more conservative than the reference, as it can also be seen in Figure 4. Between G = 1 and the discontinuity, the proposed weighting rule again is more conservative, this time by not suppressing that much, which ensures a good speech preservation performance. Beyond the discontinuity, towards large γ’s (and small ξ’s), the additive noise dominates, therefore, lower weights are needed in
4770
ξ = −5dB −10
0 γ [dB]
10
20
Fig. 4. Weighting rules GG (dashed line) and GGMM (solid line) for ξ = −5dB and ξ = 5dB of the JMAP estimator using a Gaussian speech model order to attenuate the disturbing noise. In this area, the proposed weighting rule behaves more aggressive than the reference, which is reflected by smaller weights. Therefore, a greater amount of noise reduction can be obtained. 4. EVALUATION In order to precisely show the merit of the proposed JMAP estimator, the evaluation was performed with artificially generated noise signals. It is interesting to note that a noise GMM can well represent two different noise types: stationary or non-stationary (non-)Gaussian noise. As clean speech, we employed 96 speech signals (four male and four female speakers) taken from the NTT database [7] and downsampled to 8kHz sampling rate. After segmentation with a Hann window, a frame length of 256 samples, a frame shift of 128 samples, and the DFT transform, the addition of the noise was performed in the frequency domain by superimposing the speech spectrum to the artificial noise spectrum, the latter being generated as follows (GMM with M = 2 and parameters as given in Table 1): For the stationary (white) non-Gaussian noise we randomly chose a Gaussian mode m with the probability cm for each time-frequency bin (, k). Then a normally distributed pseudorandom complex number N (, k) was generated, according to the resulting Gaussian 2 mode m with σm . In order to generate non-stationary (white) Gaussian noise, all frequency bins belonging to a frame index were generated from the same Gaussian mode either m = 1 or m = 2. This leads to a fluctuation of the noise spectral variance from frame to frame, while the pdf of this randomly generated noise spectrum is still the same Gaussian mixture as above. The non-stationary nature of the noise can be seen in the spectrogram plot in Figure 5 as vertical lines of high noise variance (m = 2), while the other frames contain only low level noise (m = 1). We evaluated the performance of the proposed weighting rules w.r.t. speech preservation and noise attenuation performance. Given a noisy speech signal y(n) we employed the respective clean speech s˜(n) and noise component n ˜ (n) of the enhanced signal sˆ(n) = s˜(n) + n ˜ (n). Through the clean speech signal s(n) and its processed replica s˜(n), the speech preservation performance was represented by the PESQ MOS score [8] and the segmental speech
Frequency [Hz]
4
Noise type Noise model
100 80 60
40 150 200 250 300 Frame index Fig. 5. Spectrogram of the noisy speech signal with artificially generated non-stationary white noise for SNR=0dB 0 0
50
100
to speech distortion ratio (SSDR) [9, 10]: 1 SSDR() NΛ ∈Λ T 2 τ =1 s (τ + T ) SSDR() = li 10 log10 T 2 τ =1 e (τ + T ) SSDRseg =
L T 1 τ =1 n2 (τ + T ) T L ˜ 2 (τ + T ) τ =1 n =1
Non-stationary Gaussian GMM [2] [new] 1.94 2.07 3.71 4.00 8.75 9.79 10.86 11.96
Table 2. Simulation results for a Gaussian speech model at SNR=0dB. Variance contribution of the noise Gaussian modes: 20% (c1 = 0.71, σ12 = 0.28) and 80% (c2 = 0.29, σ22 = 2.8) 5. CONCLUSIONS
(15) (16)
where e(n) = s(n) − s˜(n), Λ is a set of frames belonging to speech activity, the operation li{.} limits SSDR() to [-10,30] dB, and T = 256 is the number of samples within a frame. Meanwhile, we assessed the noise attenuation performance by computing the segmental noise attenuation measure based on the noise signal n(n) and the processed noise component n ˜ (n) [9, 10]: NAseg = 10 log10
PESQ MOS SSDRseg [dB] ΔSNR [dB] NAseg [dB]
Stationary Gaussian GMM [2] [new] 1.92 2.06 3.74 3.98 8.79 9.76 11.15 12.35
(17)
where L is the number of frames. The SNR improvement ΔSNR were measured as the difference between the output and the input SNR values. The input (output) SNR is determined as the ratio of the power of s(n) (˜ s(n)) to that of n(n) (˜ n(n)). All SNR measurements were performed using the active speech level measurement tool, according to the ITU-T recommendation P.56 [11]. Since our proposed JMAP estimator had to be evaluated independently of known noise power estimators for stationary and non2 2 stationary noise, we made the mean noise power σN = σGMM = f (, k) known to both the investigated reference algorithm, as well as to our new estimator, which can be considered as a challenge for showing further improvements by a new weighting rule. Subsequent to the a posteriori SNR estimation, we employed the decision-directed approach [1] to estimate the a priori SNR. The weighting rule was then computed by using the proposed JMAP estimator with the noise GMM described in Section 3, realized by a table lookup of GGMM = f (γ, ξ) with both γ and ξ varying from −20...+20dB in 1dB steps. As a reference, we also employed the JMAP estimator with a Gaussian noise model [2], (6). The results are summarized in Table 2. In Table 2, it can be seen that both in stationary and non-stationary cases the proposed weighting rule with the GMM noise model achieves better results. On the one hand, it shows a better quality of the speech component, reflected by slightly higher PESQ MOS scores and SSDR values. On the other hand, in both stationary and non-stationary cases, a better noise suppression performance of the GMM based weighting rule can be observed. Both NA and ΔSNR measures show an improvement of about 1dB compared to the reference Gaussian noise model. These better results were confirmed by informative subjective listening tests, which also testified the lower amount of musical noises of the proposed weighting rule. We can conclude that in all typical measures—in non-Gaussian stationary as well as in non-stationary Gaussian noise—the proposed JMAP estimator outperformed the reference one.
4771
In this paper, we present a joint maximum a posteriori estimator assuming a non-Gaussian pdf of the noise DFT coefficients. It turns out that the proposed approach advantageously works for stationary non-Gaussian noise, but also for non-stationary noise, even in the case where the mean noise variance was made perfectly known to the system. 6. REFERENCES [1] Ephraim, Y.; Malah, D., “Speech Enhancement Using a Minimum-Mean Square Error Short-Time Spectral Amplitude Estimator,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 32, no. 6, pp. 1109–1121, Dec. 1984. [2] Wolfe, P. J.; Godsill, S. J., “Efficient Alternatives to the Ephraim and Malah Suppression Rule for Audio Signal Enhancement,” in EURASIP Journal on Applied Signal Processing, 2003, vol. 2003, pp. 1043–1051. [3] Hendriks, R. C.; Heusdens, R.; Kjems, U.; Jensen, J., “On Optimal Multichannel Mean-Squared Error Estimators for Speech Enhancement,” IEEE Signal Processing Letters, vol. 16, no. 10, pp. 885–888, Oct. 2009. [4] Lotter, T.; Vary, P., “Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model,” EURASIP Journal on Applied Signal Processing, vol. 7, pp. 1110–1126, 2005. [5] Potamitis, I.; Fakotakis, N.; Kokkinakis, G., “A Trainable Speech Enhancement Technique Based on Mixture Models for Speech and Noise,” in EUROSPEECH 2003, Geneva, Switzerland, Sep. 2003. [6] “Ambient Noise Database for Telephonometry,” NTT Advanced Technology Corporation (NTT-AT), 1996. [7] “Multi-Lingual Speech Database for Telephonometry,” NTT Advanced Technology Corporation (NTT-AT), 1994. [8] “Perceptual Evaluation of Speech Quality (PESQ),” ITU-T P.862, Feb. 2001. [9] Fingscheidt, T.; Suhadi, S., “Quality Assessment of Speech Enhancement Systems by Separation of Enhanced Speech, Noise, and Echo,” in INTERSPEECH 2007, Antwerp, Belgium, Aug. 2007, pp. 818–821. [10] Fingscheidt, T.; Suhadi, S.; Stan, S., “Environment-Optimized Speech Enhancement,” IEEE Transactions on Audio, Speech & Language Processing, vol. 16, no. 4, pp. 825–834, 2008. [11] “Objective Measurement of Active Speech Level,” ITU-T P.56, Mar. 1993.