LNCS 6365 - Speech Separation via Parallel Factor ... - Springer Link

Speech Separation via Parallel Factor Analysis of Cross-Frequency Covariance Tensor Xiao-Feng Gong and Qiu-Hua Lin School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China [email protected]

Abstract. This paper considers separation of convolutive speech mixtures in frequency-domain within a tensorial framework. By assuming that components associated with neighboring frequency bins of the same source are still correlated, a set of cross-frequency covariance tensors with trilinear structure are established, and an algorithm consisting of consecutive parallel factor (PARAFAC) decompositions is developed. Each PARAFAC decompositon used in the proposed method can simultaneously estimate two neighboring frequency responses, one of which is a common factor with the subsequent crossfrequency covariance tensor, and thus could be used to align the permutations of the estimates in all the PARAFAC decompositions. In addition, the issue of identifiability is addressed, and simulations with synthetic speech signals are provided to verify the efficacy of the proposed method. Keywords: Blind source separation, Tensor, Parallel factor analysis.

1 Introduction Independent component analysis (ICA) aims at recovering multiple independent source signals mixed through unknown channels, with only the observations collected at a set of sensors. Since ICA requires little prior information about the source signals and channels, it has become a widely used method for speech separations [1-6]. Earlier works on ICA are mostly focused on instantaneous mixtures. However, the instantaneous mixing model does not always match the practical situations. For example, in reverberation environment which is often encountered in practice, the signals captured by microphones are attenuated and delayed versions of multiple source signals superimposing one another, resulting in a convolutive mixing model. As a result, blind separation of convolutive mixtures has become a key problem in speech processing. Generally speaking, blind separation of convolutive speech mixtures can be performed in either time-domain or frequency-domain. More exactly, the time-domain methods propose to minimize some independent criterion with respect to the coefficients of mixing channels at each time delay. The problem with time-domain methods is that when the speech signals are recorded in strong reverberation, there would be too many parameters to adjust and this may result in convergence difficulty and heavy V. Vigneron et al. (Eds.): LVA/ICA 2010, LNCS 6365, pp. 65–72, 2010. © Springer-Verlag Berlin Heidelberg 2010


computational burden [1]. The frequency-domain methods, on the other hand, transfer the deconvolution problem into a set of instantaneous mixture separation problems, and hence facilitate the use of well-studied instantaneous mixture separation methods. However, there are prices to pay for the advantage of frequency-domain methods. Firstly, the fourier transform used in frequency-domain methods tends to generate nearly Gaussian signals. Therefore, many existing methods for instantaneous mixture separation, that require non-Gaussianity, are no longer valid. To solve this problem, the non-stationarity of speech signals is exploited as well as second-order statistics, and joint diagonalization or tensor based methods have been proposed [4-6]. Secondly, the independently implemented separation procedures may yield mismatched permutations and scalings of the estimated sources or mixing matrices associated with different frequency bins, resulting in the so-called permutation and scaling ambiguities. Since the scaling ambiguity can be well solved [2], the frequency ambiguity becomes the main problem with the frequency-domain methods. In existing works, prior information on both the mixing filters (such as the continuity of frequency response) and source signals (such as the covariance across frequency bins) have been used to align the permutations, a detailed survey on permutation correction methods could be found in [1]. In this paper, we propose a new method for convolutive speech separation within the tensorial framework. More precisely, a set of cross-frequency covariance tensors, which incorporate both the non-stationarity and the covariance between neighboring frequency bins, are firstly established. And then an algorithm comprising consecutive parallel factor analysis (PARAFAC) decompositions is proposed. Unlike the existing works that estimate one frequency response associated with a single frequency bin each time, the proposed method could generate paired estimates of two neighboring frequency responses in each run of the PARAFAC decomposition. In addition, we develop a permutation correction scheme by exploiting the common factor shared by neighboring cross-frequency covariance tensors.

2 Problem Statement We consider M mutually uncorrelated speech signals s (t ) = [ s1 (t ), s2 (t ),..., sM (t )]T collected with an array of N microphones, and denote the recorded mixtures by x (t ) = [ x1 (t ), x2 (t ),..., xN (t )]T . In the noise-free case, x (t ) could be modeled as: L −1

x (t ) = ∑ H ( τ )s (t − τ ) = H (t ) ƒ s (t )


τ =0

where ‘ ƒ ’ denotes the operation of linear convolution, the N × M matrix H ( τ ) denotes the impulse response matrix of the mixing filter at time-lag τ . Its elements hnm ( τ ) are coefficients of the room impulse response (RIR) between the mth source and the nth microphone, n = 1, 2,..., N , m = 1, 2,..., M , and L denotes the maximum channel length. To recover the original speech sources, the goal is to find a demixing filter G ( τ ) such that:

K −1

s (t ) = ∑ G ( τ ) x (t − τ )


τ =0

where K is the length of the demixing filter, and s (t ) is the restored source vector. In frequency-domain methods, the convolutive model in (1) is reduced to a set of instantaneous mixing models by applying the short-time fourier transform (STFT): F

x (t , f ) = ∑ w F (t ) x ( τ − t )e 2 πifτ


τ =1

where F is the frame length, and wF (t ) is an F point Hanning window. Then (1) could be rewritten as:

x (t , f ) = H f s (t , f )


where H f denotes the frequency response of the mixing filter, and s(t , f ) is the STFT of s (t ) . Therefore, the deconvolution problem could be solved by identifying the instantaneous mixture model (4) independently for all the frequency bins. However, the main difficulty with the frequency-domain methods is the necessity to solve the scaling and permutation ambiguities raised in the separation procedure, of which the latter still remains as an open problem. In section 3, we will propose a new permutation correction scheme to solve this problem. Before addressing further the details, we list the assumptions used throughout the whole paper as follows: A1) The sources are zero-mean and mutually uncorrelated at each frequency bin f . In addition, components associated with adjacent frequency bins of the same source are correlated; A2) The cross-frequency covariance of different sources for each pair of frequency bins ( f , f + 1) vary differently with time; A3) The number of speakers M is known, and not larger than the number of microphones N . A4) The impulse responses of all mixing filters are constant. In addition, arbitrary M columns of H f are linearly independent.

3 Proposed Algorithm 3.1 Estimation of Cross Frequency Covariance Tensor

We firstly define the cross frequency covariance matrix as follows: Rt , f  E ( x (t , f ) x H (t , f + 1) )


where ‘ E ’ denotes the mathematical expectation. Noting that the speech signals are nonstationary, Rt , f actually varies with time, and the cross frequency covariance tensor R f could then be obtained by stacking T temporal samples of R(t , f ) into a third-order tensor as follows:

R f (:,:, k )  Rtk , f



where R f (:,:, k ) denotes the matrix slice of tensor R f ∈ C N × N ×T by fixing its third index to k and varying its first and second indices, k = 1, 2,..., T . According to (4), assumptions A1) and A3), (6) could be rewritten as: M

R f = ∑ h f , m D r f , m D h∗f +1, m


m =1

where h f , m and h f +1, m are the mth column vectors of H f and H f +1 , respectively. T r f , m  E ⎡⎣ sm (t1 , f ) sm∗ (t1 , f + 1),..., sm (tT , f ) sm∗ (tT , f + 1) ⎤⎦ , sm (tk , f ) is the component of the mth source associated with frequency bin f and time instant tk , and ‘ D ’ denotes tensor outer product1. In practice, the cross frequency tensor is unavailable but can be estimated from the collected data sampled at T different time instants. The idea is to average the results obtained from Q successive frames, each overlapping the neighboring one with 3F / 4 samples, where F is the frame length (see Fig. 1). As a result, Rtk , f could be estimated as, f = 1, 2,..., F − 1 : Q  = 1 ∑ x ⎛ t + q F , f ⎞ x H ⎛ t + q F , f + 1⎞ R tk , f ⎜k ⎟ ⎜ k ⎟ Q q =1 ⎝ 4 ⎠ 4 ⎝ ⎠


Then the estimate of the cross frequency covariance tensor is obtained by replacing  . Rtk , f in (6) by R tk , f Rtk , f F

xm (tk , f )

xm (tk +

F 4

F ,f) 4

% xm (tk + (Q − 1)

F ,f) 4

Fig. 1. Q succesive frames for estimating Rtk , f

3.2 The Algorithm

When the identifiability condition for the unique (up to scaling and permutation ambiguities) PARAFAC decomposition is met (this issue is to be addressed in subsection 3.3), we could obtain the estimate of {h f , m , h f +1, m } by fitting the trilinear structure of R f . The PARAFAC decomposition is usually carried out under an alternating least squares (ALS) principle [7]. Some other methods such as the simultaneous matrix diagonalization or optimization based approaches may also be used, one can refer to [8] for a detailed comparison of PARAFAC fitting methods. In addition, methods for accelerating ALS based algorithm such as compression based PARAFAC (COMFAC) have also been proposed in literature [9]. We herein adopt the scheme introduced in [10] along with COMFAC to speed up the PARAFAC decomposition. 1

The outer product of three vectors a ∈ C I , b ∈ C K and c ∈ C L is a tensor T ∈ C I ×K ×L given by ti ,k ,l  ai bk cl .

  [h ,..., h Denote the estimate of h f +1, m by hf +1, m , and H f +1 f +1,1 f +1, M ] , then M ×M M ×M  H f +1 Pf Λf = H f +1 , where Pf ∈ R and Λf ∈ R are the permutation matrix and scaling diagonal matrix associated with frequency bin f , respectively. Furthermore, by denoting B f  [r f ,1 , r f ,2 ,..., r f , M ] and letting Dk ( B f ) be a diagonal matrix containing the kth row of B f , k = 1, 2,..., T , we could further define a new tensor C f +1 ∈ C N × N ×T as follows:

H H  ) −1 H  H R (:,:, k ) C f +1 (:,:, k )  ( H f +1 f +1 f +1 f +1 H H  ) −1 H  H H D ( B ) H H = D ( B )( H P Λ ) H = (H f +1 f +1 f +1 f +1 k f +1 f +2 k f +1 f +2 f f


The equation above could be rewritten into the standard PARAFAC formulism: C f +1 = ∑ em D r f +1, m D ( h′f + 2, m ) M


m =1

where em ∈ C M is the mth column vector of the M × M identity matrix I M , and h′f + 2, m is the mth column vector of H f + 2 Pf Λf . By implementing PARAFAC decomposition to C f +1 we could obtain E f +1  ⎡⎣ef +1,1 ,..., ef +1, M ⎤⎦ and H f + 2   ′ ⎡⎣ hf + 2,1 ,..., hf + 2, M ⎤⎦ , where e f +1, m and h f + 2, m are estimates of em and h f + 2, m , respectively. In addition, since the uniqueness of PARAFAC decomposition does not take the scaling and permutation ambiguities into consideration, we have:

⎧⎪ E f +1 = Pf +1 Λf +1 ⎨ ⎪⎩ H f + 2 = H f + 2 Pf Pf +1 Λf Λf +1


where Λf +1 and Pf +1 are the scaling and permutation matrices associated with frequency bin f + 1 , respectively. If we constrain E f +1 to be an identity matrix, then from (11) and noting that Pf +1 is either non-diagonal or equal to identity matrix, we have Pf +1 = Λf +1 = I M and H f + 2 = H f + 2 Pf Λf . We note that Pf and Λf are associated with the previous frequency bin, indicating {hf ,m } , {hf +1, m } , and {hf + 2, m } are permuted from {hf ,m } , {hf +1, m } , and {h f + 2, m } in the same manner, m = 1, 2,..., M , and thus the permutations for {hf ,m } , {hf +1, m } , {hf + 2, m } are aligned. By implementing the aforementioned scheme successively for all the frequency bins, we could finally solve the permutation ambiguity problem. As for the scaling ambiguity, we could refer to the principle introduced in [2] for its solution. The potential advantage of the proposed method is that the use of cross-frequency covariance tensors could result in an aligned pair of frequency responses in each run of PARAFAC decomposition, and this property could be used to enable a more reliable permutation correction scheme (not limited to the one proposed herein). However, we note that the proposed permutation aligning strategy is sequential, and thus consecutive errors may occur if the permutation correction scheme fails at a certain frequency index. Moreover, by noting that equation (9) implies more microphones than speakers, the proposed algorithm could only be used in the over-determined case although it has been addressed in literature that the powerful uniqueness properties of PARAFAC could be used to tackle underdetermined problems [10]. These problems will be the focus of our future works.


3.3 Identifiability

We firstly introduce the following theorem. Theorem 1 (Kruskal theorem [9]): Given a PARAFAC model ,


R r 1

ar D br D cr ,

denote A = [a1 ,..., aR ] , B = [b1 ,..., bR ] , and C = [c1 ,..., cR ] . Then the decomposition of this PARAFAC model is unique up to permutation and scaling when:

k A + k B + kC = 2 k A + k B ≥ 2 M + 2


where k A , k B , and kC denote the Kruskal rank of A , B , and C , respectively2. We then use Theorem 1 to analyze the identifiability issue of the proposed algorithm. It is required that the PARAFAC decomposition for each frequency bin should be unique for the proposed method to be valid, and therefore the models given in (7) and (10) should both satisfy the identifiability condition. That is:

k H f + k B f + k H f +1 ≥ 2 M + 2


k I M + k B f + k H f +1 ≥ 2 M + 2

Obviously, k I M = M . Moreover, by assumptions A2) and A3), we have k H f = k H f +1 = M , amd (13) can be reduced to k B f ≥ 2 , which is satisfied if and only if B f does not contain collinear columns. Recall that the mth column of B f represent the cross-frequency covariance of the mth source for the frequency pair ( f , f + 1) at different time instants, then by assumption A2), we know that arbitrary two columns of B f are not collinear. Then k B f ≥ 2 is also satisfied. As a result, the identifiability conditions for both (7) and (10) are met, and the proposed algorithm is valid under assumptions A1) – A4).

4 Simulation Results In this section, we use simulations with synthetic speech signals to demonstrate the performance of the proposed method. The overall signal-to-interference ratio (SIR) is used as the measure of performance: M

Ȥ 10¦ m 1 log

¦ s ¦¦

2 mm


(t )

s 2 (t ) k z m mk



where smm (t ) is the component coming from the mth source in its estimate sm (t ) , and smk (t ) is the cross-talk from the kth source. Since the observations are synthetic, we have access to the microphone signals xnm (t ) , n = 1, 2,..., N , recorded when only the mth source is present. Therefore, (14) could be rewritten as:

∑ (∑ g χ = 10∑ log ∑ ∑ (∑ N




m =1





k ≠m

n =1

(t ) ƒ xnm (t )



g mn (t ) ƒ xnk (t )



The Kruscal rank of matrix A , denoted by k A , is the maximal number r such that any set of r columns of A is linearly independent.

where g mn (t ) is the (m, n)th entry of the demixing filter G (t ) . We compare the proposed method with PARAFAC method based on stacking the time varying covariance matrices into a covariance tensor for each frequency bin [5]. Moreover, to enable a clearer comparison, we modify the permutation correction method used in [5] by exploiting continuity of frequency responses with the scheme introduced in [6]. As a result, the two compared methods are similar in form except that different tensors are used for PARAFAC decompositions. We consider the scenario that two speech signals, sampled at 16kHz with duration of 10 seconds, are mixed with filters with order 128 and 256. The overall SIR curves of the proposed cross frequency tensor based PARAFAC (CFT-PARAFAC) and existing covariance tensor based PARAFAC (CT-PARAFAC) are plotted in Fig. 2. 14





12 Overall SIR (dB)

Overall SIR (dB)

8 11



6 9







600 700 Frame length



(a) Mixing filter with order 128






600 700 Frame length




(b) Mixing filter with order 256

Fig. 2. Overall SIR versus the frame length with mixing filters of order 128 and 256

From Fig. 2, we can see that the proposed CFT-PARAFAC method offers larger overall SIR than CT-PARAFAC. In addition, we note that the overall SIR curves of both methods decrease as the frame length increases. This is because a larger frame length could result in an increased number of mismatched frequency responses, which would add to the difficulty of permutation correction. Recall that the frame length should not be too small for a desired frequency resolution, hence, we need to select a proper frame length for the proposed method.

5 Conclusion In this paper, we considered the problem of blind separation for convolutive speech mixtures. The covariance tensors across frequency bins are used instead of the covariance tensors within a single frequency bin, and a tensorial scheme consisting of successive PARAFAC decompositions is developed. The problem of permutation correction is solved in the proposed method by simultaneously extracting and pairing two adjacent frequency responses, and exploiting the common factors shared by neighboring cross frequency covariance tensors, under the assumptions that components of the same source associated with adjacent frequency bins are correlated.


Simulation results have shown that, for mixing filters with order 128 and 256, the overall signal-to-interference ratio of the proposed method is larger than the existing covariance tensor based PARAFAC method. This observation further implies that exploiting covariance across frequency bins in-process (contrary to post-processing) may be advantageous than merely using continuity of frequency responses in-process, and the cross frequency covariance tensors may just provide such a way. Acknowledgments. This work was supported by the National Natural Science Foundation of China under Grant No. 60971097.

References 1. Pedersen, M.S., Larsen, J., Kjems, U., Parra, L.C.: A survey of convolutive blind source separation methods. Springer Handbook on Speech Processing and Speech Communication, 1–34 (2007) 2. Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on temporal structure of speech signal. Neurocomputing 41, 1–24 (2001) 3. Wang, L.D., Lin, Q.H.: Frequency-domain blind separation of convolutive speech mixturees with energy correlation-based permutation correction. In: Zhang, L., Lu, B.-L., Kwok, J. (eds.) ISNN 2010. LNCS, vol. 6063, Springer, Heidelberg (2010) 4. Parra, L.C., Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Transactions on Speech and Audio Processing 8, 320–327 (2000) 5. Nion, D., Mokios, K. N., Sidiropoulos, N. D., Potamianos, A.C.: Batch and adaptive PARAFAC-based blind separation of convolutive speech mixtures. IEEE Transactions on Audio, Speech and Language Processing (to appear) 6. Serviere, C., Pham, D.T.: Permutation correction in the frequency domain in blind separation of speech mixtures. EURASIP Journal on Applied Signal Processing Article ID 75206, 1–16 (2006) 7. Sidiropoulos, N.D., Bro, R., Giannakis, G.B.: Parallel factor analysis in sensor array processing. IEEE Transactions on Signal Processing 48, 2377–2388 (2000) 8. Tomasi, G., Bro, R.: A comparison of algorithms for fitting the PARAFAC model. Computational Statistics and Data Analysis 50, 1700–1734 (2006) 9. Sidiropoulos, N.D., Giannakis, G.B., Bro, R.: Blind PARAFAC receivers for DS-CDMA systems. IEEE Transactions on Signal Processing 48, 810–823 (2000) 10. De Lathauwer, L., Castaing, J.: Blind identification of underdetermined mixtures by simultaneous matrix diagonalization. IEEE Transactions on Signal Processing 56, 1096–1105 (2008)