Simultaneous Diagonalization in the Frequency Domain (SDIF) for Source Separation Hsiao-Chun Wu and Jose C. Principe Computational Neuro-Engineering Laboratory Department of Electrical and Computer Engineering University of Florida, Gainesville, FL 32611 E-mail:
[email protected] and
[email protected]
ABSTRACT Reverberant signals recorded by multiple microphones can be described as sums of sources convolved with different parameters. Blind source separation of this unknown linear system can be transformed to a set of instantaneous mixtures for every frequency band. In each frequency band, we may use the simultaneous diagonalization algorithms to separate the sources. In addition to our previous simultaneous diagonalization to minimize the Frobenius norm, we now propose another set of efficient simultaneous diagonalization algorithms based on Hadamard’s inequality to make the source separation feasible in the frequency domain.
1. INTRODUCTION Blind source separation arises in many fields such as noise reduction, radar and sonar processing, speech enhancement, separation of rotating machine noises, localization in array processing. The basic formulation of this problem is to recover the signals emitted by N sources s(t) from M observations x(t), which are linear mixtures of these sources. The only assumption is the statistical independence among the sources. Many solutions under a simplified instantaneous model have been proposed to test the measurements for independence based on the higher-order cumulants, the probability density function or the temporal (nonstationary) structure of the sources [1-6]. However, it is still difficult for researchers to solve the more realistic model of convolutive mixtures. Previous algorithms have the restrictive assumption of independent, identically distributed (i. i. d.) or symmetric-PDF processes [7, 8]. However, we will have difficulty of applying density mismatching in source separation with iid PDF-model-based methodology. Hence
recently a less restrictive criterion was proposed for the complete source statistical characteristics, which is called simultaneous diagonalization of local covariances either in the frequency or in the time domain. Source separation in the frequency domain has been a new direction in this area [9, 10, 11]. The simultaneous diagonalization in the frequency band leads to promising results for speech separation since the nonstationary structure creates sufficient constraints even with simple temporal second order models [11]. The Frobenius norm has been utilized successfully as a measurement of simultaneous diagonalization for a BSS method in the time domain [12]. It has been also extended to the frequency domain to separate the source component of the mixture in every frequency band [11]. However, there are three problems that need further investigation in the frequency decomposition: the effect of the scaling and permutation in the filter structure, and the computational complexity of the Frobenius norm. In this paper, we will derive another set of simultaneous diagonalization algorithms based on Hadamard’s inequality to avoid the high computational complexity due to Frobenius-based simultaneous diagonalization [11] in every frequency band.
2. PROBLEM STATEMENT Assume N independent sources s(t) = [s1(t), ..., sN(t)]T N
xi(t ) =
∞
ååh
ij ( τ )s j ( t
– τ)
(1)
j = 1τ = 0
where hij(t) is the unknown FIR filter shown in Figure 1.
We observe the mixture x(t) = [x1(t),....,xN(t)]T. To demix the observations, we need to search for a multi-dimensional FIR filter. The sources s(t) can at best be reconstructed up to a permutation and scaling. We will assume all the signals to be zero mean. Equivalently, Equation (1) can be revised in a matrix product form as
train if the separating filters are long. On the other hand, if we rewrite Equation (3) in the frequency domain as Y ( ω, t ) = W ( ω )X ( ω, t )
(5)
and t+T
L
X(t) =
å τ
=
X ( ω, t ) =
H ( τ )S ( t – τ )
(2)
0
where H(τ)’s are N × N finite-support filter coefficient matrices with length L at each delay τ and S(t-τ) are the delayed source vector [s(t-τ)]j , 1 ≤ j ≤ N and the X(t) = [s(t-τ)]j , 1 ≤ j ≤ N are the observations (the mixed signals). The goal of the source separation is simply to estimate the separating filter matrix W(τ)’s with maximum length M where M
Y(t) =
å W ( τ )X ( t – τ ) τ
=
(3)
0
in order that the convolutive product becomes L+M+1
å τ
=
W ( τ )H ( t – τ ) = P
0
g1 ( t ) 0 0 0
0
… 0 0 gN ( t )
(4)
where P is a permutation matrix. The procedure is illustrated in Figure 1. h11(t)
s1(t)
x1(t) Σ
w12(t)
h12(t) h21(t)
s2(t)
w11(t)
x2(t)
w21(t)
Σ
h22(t)
Mixing channel
w22(t)
y1(t) Σ
y2(t) Σ
Separating network
Figure 1. An example of two-source, two-sensor separtion problem..
3. SOURCE SEPARATION BY FREQUENCY DECOMPOSITION
(6)
t' = t
2πk where ω = ---------, k = 0, 1, …, T (T it the window size of T
DFT). We can also utilize the DFT (discrete Fourier transform) similar to Equation (6) to obtain W(ω) and Y(ω,t). Since we assume time invariant mixing and separating filters, W(ω) does not depend on time index t. The blind source separation has been transformed to estimate a series of independent separating matrices W(ω), which have smaller size N × N . Hence we may do the parallel processing in individual frequency bands. After each separating matrix estimate is obtain, we collect them and recover the output. The method with frequency decomposition may have faster convergence [9-11] than the methods in the time domain [12, 15] when M » N.
4. SCALING AND PERMUTATION IN FREQUENCY BANDS Even though the separating matrix in each frequency band, W(ω) is precisely estimated such that the product W(ω)H(ω) is diagonal or a permuted diagonal matrix, where H(ω) is the DFT of the mixing matrix H(t), there are still two undetermined variables, namely, the possibility of permutation and scaling of the separating matrices for some frequency bands. Both indeterminacies will deteriorate the overall separation performance. The following discussion will analyze and propose constraints to reduce the distortion created by these two factors. 4.1 The Scaling Problem In Each Frequency Band For a row vector wm(ω) of W(ω) at frequency band ω=2πk/T, if we scale this vector by some arbitrary unknown complex constant am(k), the result is a filtered version of the mth output given by: ym ( t )
The source separation methods in the time domain impose severe restriction on the filter length [15]. Almost all the adaptive separation methods take a long time to
å X ( t' ) exp ( –jωt' ) ,
filtered
= ym
original
( t ) ⊗ f m, k ( t )
(7)
where fm,k(t) is the inverse DFT due to this scaling factor as
1 j2πkt f m, k ( t ) = ------ a m ( k ) × exp æ -------------ö è T ø 2π
.
T–1
1 + -----2π
å
k = 0, p ≠ k
j2πpt exp æ --------------ö è T ø
(8)
(Note that if am(k) = 1, fm,k(t) is an impulse and the mth output does not change). If we denote the filter fm,t(t) as a total resulting effect on the mth output due to the scaling factors am(k)’s in all the frequency bands, we can write fm,t(t) as T–1
f m, t ( t ) =
-ö å am ( k ) × exp æè -----------T ø j2πkt
.
(9)
k=0
Equation (9) shows that the demixed output can have many alternate filtered versions by fm,t(t) as follows (expressing the result in the Z-domain) : F 1, t ( z ) 0 Y' ( z ) =
0
0
0
0 0
… 0 0 0 F m, t ( z ) 0
0 0
0 0
0 0
0 0
not successful the separating filter matrix is composed of FIRs with many nonzero weight coefficients. To avoid the permutation problem we should constrain the length of the separation filters. This was proposed by [11] but never explained. In addition, the truncation problem of the windowed DFT in Equation (6) will be severe if the filter is IIR since the equivalence between the linear convolution and frequency multiplication only exists if the filters are much shorter than the window size T. Based on these two reasons, we have to constrain the separating filters with shortsupport FIR structure. However, the selection of the filter order is difficult since it is a compromise between the necessity to constrain the solution to avoid the permutation, and the required fitting of the number of degrees of freedom of the unknown mixture compounded by the size of the DFT.
5. SIMULTANEOUS DIAGONALIZATION IN THE FREQUENCY DOMAIN Y(z)
If we construct two covariance matrices as follows: (10)
… 0 0 F N, t ( z )
T1 – 1
1 Σ ( ω, t 1 ) = ----T1
where Y(z) is the Z-transform of Y(t) and Fm,t(t) is the Ztransform of fm,t(t). Hence we need to constrain the diagonal elements of W(z) such that Fm,t(z)wmm(z) = 1, ∀1 ≤ m ≤ N .
(11)
4.2 The Permutation Problem In Each Frequency Band If the constraints on the scaling for separating matrices in all the frequency bands in Equation (11) are utilized we have actual perfect separating matrices satisfying the condition that W(z)H(z) is diagonal, under any assumptions. The problem is that during learning, permutations may occur. If that is the case, we see that we are normalizing the rows of the separating matrix by the wrong weight factor. For instance, the interchange between the ith row and jth row, will make the ith row and the jth row of the separating filter matrix in the form of wi(z) = [wj1(z), wj2(z), ...., 1 (the
jth
element), ..., wjN(z)] and wj(z) = [wi1(z), wi2(z), ...., th
1 (the i element), ..., wiN(z)]. Effectively these filters will become IIR filters, or at least FIRs with very long tails, and this explains why in practice when the separation is
H
å
X ( ω, t 1 )X ( ω, t 1 )
t1 = 1
(12)
T2
1 Σ ( ω, t2 ) = -----------------
T 2 – T1
å
H
X ( ω, t2 )X ( ω, t 2 )
t2 = T1 + 1
where superscript H denotes the Hermitian operator and X(ω, t) is a zero mean vector. It is obvious that Σ(ω, t1) and Σ(ω, t2) are positive definite and the separating matrix in each frequency band corresponds to the unique generalized eigen decomposition of the pencil Σ(ω, t1)+λΣ(ω, t2). If we whiten Σ(ω, t1) by V(ω) where 1 – --2
H
H
V ( ω )Σ ( ω, t 1 )V ( ω ) = I and V ( ω ) = Λ ( ω ) U ( ω ) (Λ(ω)
and U(ω) are eigenvalue and eigenvector matrices of Σ(ω, t1) ). Then we may use the eigen decomposition to compute the separating matrix W(ω) by: H
W ( ω ) = Q ( ω )V ( ω )
(13)
where Q(ω) is the eigen vector matrix of the transformed covariance ΣV(ω) = V(ω)Σ(ω, t2)VH(ω). This criterion has been used by [5] for the instantaneous mixtures, but can also be extended to convolutive mixtures. The problem is that it is too time consuming, and since it is an analytic 3
procedure can not compensate for the permutation and scaling problems discussed above.
6. SIMULTANEOUS DIAGONALIZATION CRITERION Since the covariance matrices Σ(ω, t1) and Σ(ω, t2) are Hermitian and positive definite, by the Hadamard Inequality [14], we know
ì … 0 0 ï – η å í 0 σ Y jj ( ω ) 0 ï ti 0 … î 0
–1 –H
ü ï
W ( ω )Σ ( ω, t 1 ) – W ( ω ) ý ï þ
(18)
or its two equivariant rules in Equation (19) and (20) as ì … 0 0 ï – η å í 0 σ Y jj ( ω ) 0 ti ï 0 … î 0
–1
ü ï Σ Y ( ω, t i ) W ( ω ) – W ( ω ) ý ï þ
(19)
N
det [ Σ Y ( ω, t i ) ] ≤
∏ σYjj ( ω ) , ∀1 ≤ j ≤ N , i = 1, 2,
(14)
j=1
where ΣY(ω, ti) = W(ω)Σ(ω, ti)WH(ω)
(15)
is the output covariance. The equality in Equation (14) exists if only if Σ(ω, ti) is diagonal. Hence we may utilize Equation (14) to simultaneously diagonalize several covariance matrices ( Note: you may decompose the series into more than two segments and obtain more than two covariance matrices similar to Equation (12) ). Our criterion is as follows: ì N ï J ( ω ) = å í å log [ σ Y jj ( ω ) ) ] – log [ det [ Σ Y ( ω, t i ) ] ï ti j = 1 î
ü ï ] ý, (16) ï þ
ì … 0 0 ï and – η å í 0 σ Yjj ( ω ) 0 ti ï 0 … î 0
–1
ü ï Σ Y ( ω, t i ) – I ý ï þ
(20)
where I is the identity matrix. Equation (18), (19), (20) are the extension and modification of the learning rule for an instantaneous mixture [13] to the complex-valued frequency domain. We also noted that Equation (19) and (20) are much more computationally efficient than the learning rule from the Frobenius criterion in [11]. For instance, in [11] the complexity is 7N2, while ours is 2N2. 6.2 The Source Separation Procedure According to our previous description in Section 3, 4 and 6.1, our source separation algorithm can be stated as follows:
From the previous discussion about the Hadamard Inequality, we have learned that the global minimum is achieved only when the ΣY(ω, ti)’s are all diagonal, which is the separation solution. Morevoer, this solution leads to an adaptive algorithm which can be integrated with the constraint to avoid the permutation and scaling.
• Initialize all the W(ω)’s. • Compute the output covariance ΣY(ω, ti) as Equation
6.1 The Gradient Descent For Simultaneous Diagonalization
• Wnew(ω) = Wold(ω)+∆W(ω). • Constrain W(t) as a finite-support filter matrix with
According to our proposed criterion in Equation (16), We can derive the weight increment from the gradient rule: ∂J ( ω ) ∆W ( ω ) = ---------------∂W ( ω )
(17)
From Equation (16) and (17), we may formulate the weight adaptation as: ∆W(ω) =
(15) for all the segments ti’s.
• Compute the weight adaptation ∆W(ω)’s for all the frequency ω by either Equation (18) or (19) or (20).
length M.
• Repeat the first five steps until learning convergence. 7. SIMULATION We will provide two series of experiments in this section. First, in order to quantify the performance of this algorithm, we use the zero-mean independent speech sources with unity variance from TIMIT database and mix it artificially by computer. Then we use the real-world data downloaded from the website of Salk Institute to test our method [16].
ing. Based on the simultaneous diagonalization, we derive a new SDIF algorithm. This algorithm is more computationally efficient and takes shorter time to converge. Our simulation shows also that it performs better. However, the constraint on the separating filter length is critical and still under investigation. As a rule of thumb, we need to have ten times longer DFT windows than the separating filter to ensure a good approximation between frequency multiplication and temporal linear convolution.
7.1 BSS from Artificial Convolutive Mixture As depicted in Figure 1, our mixing channel is composed by two cross paths, H12(z) = -0.6 + 0.5z-1 + 0.43z-2 + 0.42z-3 + 0.31z-4 + 0.28z-5 - 0.25z-6 + 0.2z-7 + 0.1z-8 and H21(z) = 0.9 + 0.8z-1 - 0.77z-2 + 0.56z-3 + 0.50z-4 + 0.43z-5 - 0.4z-6 + 0.22z-7 + 0.1z-8 + 0.04z-9, with H11(z) = H22(z) = 1. Utilizing our separation algorithm in Section 6.2 and the straight generalized eigen-decomposition in Equation (13), after just 10 to 50 iterations, we will obtain the final separating filter wkj(t)’s. The ith output’s signal-to-noise ratio SNR(i) is defined by
REFERENCE [1] J.-F. Cardoso and B. Laheld, “Equivariant adaptive source separation,” IEEE Transactions on Signal Processing, vol. 44, no. 12, December 1996. [2] A. J. Bell and T. J. Sejnowski, “An information maximization approach to blind separation and blind deconvolution,” Neural Computation, 7, pp. 1129-1159, 1995. [3] D. Obradovic and G. Deco, “Unsupervised learning for blind source separation: an information-theoretic approach,” in Proc. ICASSP, pp. 127-130, 1997. [4] L. Tong, R. Liu, V. Soon and Y. Huang, “Indeterminacy and identifiability of blind identification,” IEEE Trans. on Circuits and Systems, vol. 38, no. 5, pp. 499509, May 1991. [5] A. Souloumiac, “Blind source detection and separation using second order nonstationarity,” in Proc. ICASSP, pp. 1912-1915, 1995. [6] L. Molgedey and G. Schuster, “Separation of a mixture of independent signals using time delayed correlations,” Physical Review Letters, vol. 72, no. 23, pp. 3634-3637, 1994. [7] D. Yellin and E. Weinstein, “Criteria for multichannel signal separation,” IEEE Trans. on Signal Processing, vol. 42, no. 8, pp. 2158-2168, August 1994. [8]. H. L. Nguyen, C. Jutten, “Blind sources separation for convolutive mixtures,” Signal Processing, vol. 45, no. 2, August 1995, pp. 209-229. [9] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” MIT technical report, 1997. [10] C. Serviere, “Feasibility of source separation in frequency domain,” in Proc. ICASSP, pp. 2085-2088, 1998. [11] L. Parra and C. Spence, “Convolutive blind source separation based on multiple decorrelation,” technical report, April, 1998. [12] H.-C. Wu and J. C. Principe, “A unifying criterion for blind source separation when the simultaneous diagonalization of correlation matrices. in Proc. NNSP, pp. 496 -
ìL + M +1 N L 2ü æ ö ï ï ç ï w ( t – τ )h ji ( τ )÷ ï ÷ ï ï å ç å å ij è ø ï ï t = 0 j = 1τ = 0 SNR ( i ) = 10 log í -------------------------------------------------------------------------------------2 ý, k ≠(21) i ïL + M + 1æ N L ö ï ï ç ÷ ï ï å ç å å w kj ( t – τ )h ji ( τ )÷ ï ï t = 0 èj = 1 τ = 0 ø ï î þ
If we denote the straight simultaneous diagonalization as SSD and our algorithm as SDIF, we obtain the performance (in dB) as Table 1. TABLE 1.
Comparison between SSD and SDIF.
Scheme all in dB
SSD, M=4096
SSD, M=500
SSD, M=100
SDIF, M=500
SDIF, M=100
out1 out2 average
-5.1669 -1.2778 -3.2224
-2.2016 0.8287 -0.6865
1.3327 2.4782 1.9055
7.0863 8.5798 7.8331
9.5707 12.8879 11.2293
Note that all the experiments are conducted with the same window length T as described in Equation (6). The different values of M mean the different lengths of the final separating filters, i.e., wij(t) = 0, t >M, for both SSD and SDIF simulations.
8. CONCLUSION In this paper we utilize the frequency decomposition approach to reduce the network complexity occurring in the time domain. We analized in detail the implications of scaling and permutation in the frequency domain demix-
5
505, 1997. [13] K. Matsuoka, M. Ohya and M. Kawamoto, “A neural net for blind separation of nonstationary signals,” Neural Networks, vol. 8, no. 3, pp. 411-419, 1995. [14] B. Noble and J. W. Daniel, Applied Linear Algebra, Prentice-Hall, 1988. [15] D. Chan, P. Rayner, S. J. Godsill, “Multi-channel signal separation,” in Proc. 1996 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Atlanta, U. S. A., May 1996, pp. 649-652. [16] T.-W. Lee, A. J. Bell and R. Orglmeister, “Blind source separation of real world signals,” in Proceedings of IEEE International Conference on Neural Networks, vol. 4, pp. 2129-2134, 1997.