2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
October 21-24, 2007, New Paltz, NY
OVERFITTING-RESISTANT SPEECH DEREVERBERATION Takuya Yoshioka, Tomohiro Nakatani, Takafumi Hikichi, and Masato Miyoshi NTT Communication Science Laboratories, NTT Corporation 2-4, Hikari-dai, Seika-cho, Soraku-gun, Kyoto 619-0237, Japan
[email protected] ABSTRACT This paper proposes a method that prevents the overfitting problem inherent in the joint source-channel estimation approach to speech dereverberation. The approach has several desirable attributes such as high estimation accuracy. However, the channelrelated parameters estimated with the conventional implementation of this approach often overfit the source characteristics present in observed signals. This overfitting results in unstable behavior of the dereverberation process. The problem stems from the fact that the conventional implementation employs a point estimation scheme to obtain the parameters describing the source characteristics. The proposed method marginalizes the source parameters to mitigate the overfitting problem. Two kinds of experimental results are reported, one of which was obtained in a single source situation and shows the ability to prevent the overfitting; the other is obtained in a multi-source scenario indicating the applicability of the proposed method to multi-source situations. 1. INTRODUCTION Speech signals captured by microphones in a room inevitably contain reverberant components. These reverberant components often degrade the quality of the speech. Thus, many researchers have studied speech dereverberation, which is a process of estimating clean speech signals after observing reverberant signals. The fundamental problem in speech dereverberation is sourcechannel identifiability. Reverberant signals are modeled as filtered versions of a source signal with channel impulse responses. Dereverberation is usually achieved by estimating the channel impulse responses or their inverse filters given the reverberant signals. The difficulty of channel estimation lies in the fact that the reverberant signal is invariant under an operation where an arbitrary filter is applied to the channel impulse responses and its inverse filter is applied to the source signal. Therefore, it is essential to exploit some kinds of characteristics inherent in the source or channels. Several approaches have been proposed for tackling the sourcechannel identifiability problem. They include the subspace approach [1], the LIME algorithm [2], observation prewhitening [3], and joint source-channel estimation [4]. The joint source-channel estimation approach introduces parameters describing the source characteristics in addition to parameters related to the channel characteristics such as the channel impulse responses. Both the source and channel parameters are estimated for given reverberant signals. It was proven that this approach was able to find the channel parameters under the condition that the source parameters varied in time. This approach would be more promising than the others because it has several desirable attributes such as high estimation accuracy and robustness to channel order misspecification.
978-1-4244-1619-6/07/$25.00 ©2007 IEEE
163
Although it is promising, there is still a drawback with conventional joint source-channel estimation. The estimated channel parameters are often unstable. To be more precise, the norm of the estimates becomes quite large. Such unstable estimates have harmful effects on the output speech quality especially in online processing or real-recording applications where the channel impulse responses may constantly fluctuate [5]. The reason for the instability is that the channel estimates overfit the source characteristics present in the observed signals. The overfitting results from the fact that this approach obtains the source parameters by point estimation. To overcome this drawback, we propose estimating the source parameters in the form of a probability distribution rather than to obtain their point estimates. The overfitting problem will be mitigated by averaging the criterion for the goodness of the channel parameters based on the source probability distribution. This is realized by formulating the dereverberation task as a maximum likelihood (ML) estimation of the channel parameters where the source parameters are treated as random variables and marginalized. The effectiveness of the proposed method is demonstrated by experimental results. Experimental results with a multi-source scenario are also reported, indicating the potential applicability of the proposed method to multi-source situations. 2. TASK FORMULATION 2.1. Speech dereverberation task The speech dereverberation task in this paper is formulated as follows. Let a source signal be represented by s(n), and the L-tap impulse responses of the channels from the source to M microphones by {hm,k }1≤m≤M,0≤k≤L−1 . We assume M ≥ 2. The reverberant signal observed by the m-th microphone is written as L−1
xm (n) =
hm,k s(n − k), 1 ≤ m ≤ M.
(1)
k=0
The goal of the speech dereverberation is to estimate the source signal s(n) after observing M N samples of reverberant signals {xm (n)}1≤m≤M,1≤n≤N . As is known, the observed signals can be rewritten as multivariate autoregressive (AR) processes (see [6] for details). Let us assume that the first microphone is the closest to the source and, without loss of generality, h1,0 = 1. Then, there exist regression coefficients G = {gm,k }1≤m≤M,1≤k≤K such that M
K
x1 (n) =
gm,k xm (n − k) + s(n). m=1 k=1
(2)
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
The source signal s(n) can be obtained by using G based on (2). Therefore, the speech dereverberation task is equivalent to estimating the regression coefficients G. 2.2. ML approach rationale In this paper, we approach the above task based on ML estimation. With the ML approach, the reverberant signals are treated as random variables. A parametric model of these variables is introduced, where the regression coefficients G are included in the model parameters. After observing the samples of the reverberant signals, G that maximize the log likelihood are found. In order to formulate the speech dereverberation task as the ML estimation task, we introduce a frame-wise AR model of order P to the source signal s(n): P
s(n) =
at(n),k s(n − k) + e(n),
(3)
k=1
where t(n) is the frame index to which the n-th sample belongs, At = {at,k }1≤k≤P is the regression coefficients of the t-th frame, and e(n) is an innovations process. The innovations process e(n) 2 is a Gaussian white noise with variance σt(n) . In this paper, G, A = {At }1≤t≤T , and Σ = {σt2 }1≤t≤T , where T denotes the number of frames, are called channel regression coefficients, source regression coefficients, and innovation variances respectively. By using (2), (3), and the Gaussian white noise assumption on e(n), we obtain the conditional probability density function (pdf) of x1 (n) as p(x1 (n)|X1 (n − 1), G, A, Σ, X2:M ) 1
1
2 )− 2 exp − = (2π)− 2 (σt(n)
e(n)2 , 2 2σt(n)
(4)
where X1 (n − 1) and X2:M are respectively defined as X1 (n − 1) = {x1 (k)}1≤k≤n−1 X2:M = {xm (n)}2≤m≤M,1≤n≤N .
(5) (6)
We hereafter consider X1 = {x1 (n)}1≤n≤N as random variables and X2:M as determined values for mathematical tractability. The log likelihood of the channel regression coefficients G given X = {X1 , X2:M } is derived as follows. The log likelihood is defined by marginalizing the nuisance parameters A and Σ as l(G|X) = log
p(X1 , A, Σ|G, X2:M )dAdΣ.
(7)
The pdf p(X1 , A, Σ|G, X2:M ) is factorized as
October 21-24, 2007, New Paltz, NY
3. DERIVATION OF ML CHANNEL ESTIMATION 3.1. EM algorithm To maximize the log likelihood (7), we employ the expectation maximization (EM) algorithm because (7) involves marginalization of the nuisance parameters. The EM algorithm iteratively maximizes the expected complete log likelihood, i.e. the auxiliary function, as ˆ (i+1) = argmax Q(G|G ˆ (i) ) M-step : G
E-step : Q(G|G ) = log p(X1 , A, Σ|G, X2:M )p(A,Σ|X,Gˆ (i) ) , (11) where i is the iteration index. In the conventional joint source-channel estimation [4], on the other hand, the update equation set is ˆ (i+1) = argmax log p(X1 , A ˆ(i) , Σ ˆ (i) |G, X2:M ) G
(12)
ˆ (i) } = argmax p(A, Σ|X, G ˆ (i) ). {Aˆ(i) , Σ
(13)
G
A,Σ
Comparing the EM update equation set (10) and (11) with its counterpart (12) and (13), we find that the complete log likelihood log p(X1 , A, Σ|G, X2:M ) is averaged over the source parameters ˆ (i) ) A and Σ by their conditional posterior distribution p(A, Σ|X, G (i) (i) ˆ ˆ instead of plugging in A and Σ . Thus, the overfitting problem with the conventional joint source-channel estimation will be mitigated by the ML channel estimation. Below we provide sketches ˆ (i) ) of the derivation of the conditional posterior distribution p(A, Σ|X, G (i) ˆ ). The and the maximization of the auxiliary function Q(G|G details are omitted because of limited space. 3.2. Conditional posterior distribution ˆ (i) ) is derived The conditional posterior distribution p(A, Σ|X, G (i) ˆ as follows. Since X and G are given, we have the source estis(i) (n)}1≤n≤N by using (2) as mates Sˆ(i) = {ˆ K
M
sˆ(i) (n) = x1 (n) −
T
(i) p(At , σt2 |Sˆt ),
(8)
n=1
Note that we here assumed that A and Σ are independent of G and X2:M . The second pdf on the right hand side of (8) was already defined as (4). As regards the first pdf p(A, Σ), we assume the Jeffreys prior:
(i) where Sˆt are composed of the samples of Sˆ(i) belonging to the t-th frame. In (15), we ignored correlations among samples around frame boundaries. When using the Jeffreys prior (9), the posterior distribution of the t-th frame can be represented as (i) p(At , σt2 |Sˆt ) =
Rt−1 t , U
t
× χ−2 {σt2 |U − P, U (rt (0) −
T
p(A, Σ) =
t=1
(15)
t=1
p(x1 (n)|X1 (n − 1), G, A, Σ, X2:M ).
(14)
Thus, we have ˆ (i) ) = p(A, Σ|Sˆ(i) ) ≈ p(A, Σ|X, G
N
(i)
gˆm,k xm (n − k) m=1 k=1
p(X1 , A, Σ|G, X2:M ) = p(A, Σ)
(10)
G
ˆ (i)
p(At , σt2 ) ∝
T
(σt2 )−1 .
(9)
t=1
Now, the ML-based speech dereverberation task is formulated as finding G that maximize the log likelihood (7) given X.
978-1-4244-1619-6/07/$25.00 ©2007 IEEE
164
{
Rt σt2
−1
T −1 t )} t Rt
(i) (i) (i) (i) −2 {σt2 |ρt , λt }, t | t , Ξt }χ
(16)
where U is the frame size, P is the source regression order, ( | , Ξ) denotes the Gaussian pdf of with mean and covariance matrix
2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Ξ, and χ−2 (ξ 2 |ρ, λ) denotes the scaled inverse chi square distribution of ξ 2 with degree of freedom ρ and scale parameter λ. Also, by letting Nt denote the first sample index of the t-th frame, t , Rt , t , rt (0) are respectively defined as follows:
October 21-24, 2007, New Paltz, NY
where V = [Vi,j ]1≤i,j≤M (M -by-M block matrix)
T
= [at,1 , · · · , at,P ]T Rt = [rt (i − j)]1≤i,j≤P (P -by-P matrix) T
= [rt (1), · · · , rt (P )]
t
Vp,q =
(17) (18)
t
t,p (n)
t=1
T t=1
1 σt2
Nt +U −1 n=Nt
(21)
In (21), only the third term is the function of G. Thus, the derivative of the expected complete log likelihood, or the auxiliary function, with respect to gm,k is
×
T
Nt +U −1
t=1
n=Nt
ˆ (i) ) 1 p(σt2 |X, G σt2 2
ˆ (i) ) ∂e(n) dAt dσt2 . p(At |σt2 , X, G ∂gm,k
T 1 ,···
p (n
,
− k − 1)T
(i) ρt (i) λt
(28)
(29) (i) t
T T M]
−
(i) t
(30)
(31)
p
=
u1 (n)
p (n
− 1)Nt ≤n