A Geometrically Constrained Independent Vector ... - Springer Link

A Geometrically Constrained Independent Vector Analysis Algorithm for Online Source Extraction Affan H. Khan(B) , Maja Taseska, and Emanuël A.P. Habets International Audio Laboratories Erlangen, Erlangen, Germany [email protected]

Abstract. In this paper, an online constrained independent vector analysis (IVA) algorithm that extracts the desired speech signal given the direction of arrival (DOA) of the desired source and the array geometry is proposed. The far-field array steering vector calculated using the DOA of the desired source is used to add a penalty term to the standard cost function of IVA. The penalty term ensures that the speech signal originating from the given DOA is extracted with small distortion. In contrast to unconstrained IVA, the proposed algorithm can be used to extract the desired speech signal online when the number of interferers is unknown or time varying. The applicability of the algorithm in various scenarios is demonstrated using simulations.

1

Introduction

Many modern communication systems require a high-quality handsfree capture of speech using one or more microphones. The signal received at the microphones is usually a mixture of desired and undesired source signals. One common approach to extract the desired source signal from the received mixture is through the use of beamforming algorithms. Beamformers can either be fixed, which require the knowledge of the DOA of the desired source, or data-dependent, which require an accurate estimate of the second-order statistics (SOS) of the desired and the undesired signals. The estimation of the SOS is a challenging task that can be accomplished, for example, by detecting the activity of the desired sound sources [1,14]. Independent component analysis (ICA) provides an alternative approach to source extraction in which sound sources are separated based on the assumption that their signals are mutually statistically independent. Some common methods to obtain independent components include maximization of non-Gaussianity of the separated signals [10,11], minimization of mutual information, [6] and maximum likelihood-based signal estimation [4]. ICA algorithms however suffer from scaling and permutation ambiguities and non-uniqueness of solution in underdetermined scenarios [6]. A joint institution of the University of Erlangen-Nuremberg and Fraunhofer IIS. c Springer International Publishing Switzerland 2015 E. Vincent et al. (Eds.): LVA/ICA 2015, LNCS 9237, pp. 396–403, 2015. DOI: 10.1007/978-3-319-22482-4 46

A Geometrically Constrained Independent Vector Analysis Algorithm

397

To improve the quality of the extracted signal and to mitigate the inherent problems in ICA, various researchers have proposed to incorporate prior information by adding constraints to the optimization problem of ICA. DOA-based constraints were first introduced in ICA in [13]. The authors performed source separation through joint diagonalization of SOS [12] with a soft DOA-based constraint on each separation filter. Similarly in [9], a hard constraint on one of the separation filters was applied to ensure an undistorted response in the direction of the desired source. Both of these are batch algorithms and require prior knowledge of the number of sources. In contrast, the authors in [15] used a soft constraint to extract the desired source without prior knowledge of the number of sources. The use of DOA-based prior information mitigates the permutation ambiguity in ICA as each separation filter is constrained to extract an independent component originating from a given direction. However, if the sources are close to each other, permutation ambiguity might occur nonetheless. To solve the permutation ambiguity, IVA was proposed in [8] as a generalization of ICA. In IVA, statistical dependence between the output signals is jointly minimized across all frequency bins such that permutation ambiguity does not occur. An online variant of IVA was proposed in [7]. In this paper, we propose an online geometrically constrained IVA (CIVA) algorithm that works in the frequency domain to extract the desired source whose DOA is known. We augment the standard cost function of IVA with a penalty term that restricts the Euclidean angle between one of the separation filters and the far-field steering vector calculated using the desired source DOA. This ensures that the desired speech signal is always delivered at the output of the corresponding separation filter with small distortion and without the knowledge of number of interferers. In contrast, the unconstrained IVA algorithm introduces higher distortion of the desired speech signal in non-determined and reverberant scenarios.

2

Problem Formulation

We consider a scenario where a sound field composed of L speech signals and background noise is captured by M microphones. The L speech signals are assumed to be mutually statistically independent. The signal received at the m-th microphone can be described in the short-time Fourier transform (STFT) domain with sufficiently long time-frames as follows Ym (n, k) =

L

Am,l (k) Sl (n, k) + Vm (n, k),

(1)

l=1

where n and k represent the time and frequency indices, Sl (n, k) denotes the signal of the l-th source, Vm (n, k) denotes the background noise component, and Am,l (k) denotes the acoustic transfer function (ATF) between l-th source and m-th microphone. The M microphone signals can be expressed in vector notation as follows y(n, k) = A(k)s(n, k) + v(n, k),

(2)

398

A.H. Khan et al. T

T

where s(n, k) = [S1 (n, k) · · · SL (n, k)] , y(n, k) = [Y1 (n, k) · · · YM (n, k)] , and A(k) = [a1 (k) · · · aL (k)] with al (k) = [A1,l (k) · · · AM,l (k)]T . In this paper, we consider the problem where only one of the L sources is desired. Without loss of generality, we assume source 1 to be desired and rewrite (2) as follows y(n, k) = a1 (k) S1 (n, k) +

L

au (k) Su (n, k) + v(n, k).

(3)

u=2

The extraction of source signals from the received mixture in a standard blind source separation (BSS) algorithm is achieved by a demixing matrix W(k) as follows ˆs(n, k) = W(k) y(n, k),

(4)

where ˆs(n, k) is a vector of estimated sources at the output of the BSS algorithm, and each row of the demixing matrix represents a filter. For L = M , the demixing matrix W(k) can then be written out as follows H

W(k) = [w1 (k) w2 (k) w3 (k) · · · wM (k)] .

(5)

The aim in this paper is as follows: given the DOA of the desired source, compute a demixing matrix W(k), by minimizing the statistical dependence among the output signals ˆs(n, k), while ensuring that w1 (k) extracts the desired source signal.

3

Geometrically Constrained Independent Vector Analysis

The optimization criterion employed in this paper to estimate the demixing matrix is based on minimization of mutual information. In Sect. 3.1, we review the concept of IVA proposed in [8]. In Sect. 3.2, we present the proposed geometrically constrained IVA algorithm for online extraction of the desired source. For simplicity of derivation, we assume M = L and use the index m for sources and microphones. In the performance evaluation, we demonstrate the applicability of the proposed algorithm to scenarios where M = L as well. 3.1

Unconstrained IVA

The derivation of unconstrained IVA in this section follows from [8]. In standard IVA, the source signals are modelled as multivariate random variables Sm (n) = T [Sm (n, 1) · · · Sm (n, K)] , where K denotes the total number of frequency bins. The cost function of IVA based on mutual information between the multivariate ˆ m (n) is then given by random variables S Jiva (W) = −

M m=1

K ˆ m (n) − E log p S log|det [W(k)] |. k=1

(6)


399

Gradient-based iterative algorithms are used to find a demixing matrix W(k) that minimizes (6). The iterative update for W(k) is given by Wb (k) = Wb−1 (k) − η

∂Jiva = Wb−1 (k) − η∇Wb (k), ∂Wb−1 (k)

(7)

where η (η ≥ 0) is the learning rate of the algorithm and b is the iteration index. ˆ m (n), required in The probability density function (PDF) of the output signals S (6) to compute the gradient, can be estimated using the data or modeled based on prior knowledge [6]. Since speech signals are known to have a supergaussian PDF, they are modelled in IVA using a multivariate Laplacian distribution as follows ⎞ ⎛

K

p [Sm (n)] = p [Sm (n, 1), · · · , Sm (n, K)] = α exp ⎝− |Sm (n, k)|2 ⎠. (8) k=1

For the m-th output signal, score functions are then calculated as the following partial derivatives ∂log p Sˆm (n, 1) · · · Sˆm (n, K) Sˆm (n, k) ˆ m (n) = ϕ(k) S . (9) = K ∂ Sˆm (n, k) |Sˆ (n, k)|2 k=1

m

Using (9), the gradient ∇Wiva (k) of the cost function in (6) is given by ∇Wiva (k) =

∂Jiva = E ϕ(k) (n)yH (n, k) − W−H (k), ∂W(k)

(10)

where T ˆ 1 (n) ϕ(k) S ˆ 2 (n) · · · ϕ(k) S ˆ M (n) ϕ(k) (n) = ϕ(k) S . 3.2

(11)

Geometrically Constrained IVA

Due to the inherent scaling ambiguity, the estimated signals in a BSS system must be normalized to a reference microphone. We can therefore replace the desired source ATF vector a1 (k) in (3) by the relative transfer function (RTF) vector of the desired source with respect to a reference microphone. Assuming far field propagation, the RTF with respect to the first microphone can be written as T T T (12) g1 (k) = 1 ej(2πf /c)[r2 − r1 ] q1 · · · ej(2πf /c)[rM − r1 ] q1 , where rm is the location of the m-th microphone, q1 represents a unit-norm vector pointing in the direction of the desired source, c is the speed of sound and f = k Fs (2 K)−1 is the frequency in Hertz with Fs being the sampling frequency.

400

A.H. Khan et al.

We define a penalty function to restrict the Euclidean angle between w1 (k) and g1 (k). The Euclidean angle between w1 (k) and g1 (k) is defined as Re wH 1 (k) g1 (k) . (13) cos Θ(k) = ||w1 (k)|| ||g1 (k)|| The proposed penalty function that steers the filter of interest w1 in the direction of the desired source is then given by Jp (w1 ) =

K

2

[cos Θ(k) − 1] .

(14)

k=1

The cost function for the geometrically constrained IVA algorithm is then obtained by augmenting the IVA cost function in (6) by Jp , i.e., Jciva (W) = Jiva (W) + λ Jp (w1 ) ,

(15)

where λ (λ ≥ 0) is the penalty parameter. The gradient of Jciva with respect to the elements of the demixing matrix can be expressed as ∇Wciva (k) =

∂Jciva = ∇Wiva (k) + λ∇Wp (k), ∂W(k)

(16)

where ∇Wiva (k) is the gradient of Jiva given in (10) and ∇Wp (k) is the gradient of Jp . Since the penalty function Jp is only a function of w1 (k), the gradient of Jp with respect to filters wu (k) (u = 2, 3 . . . M ) is zero, i.e., ∂Jp ∇wH 1 (k) = ∇Wp (k) = , (17) 0M −1×M ∂W(k) where the gradient of the proposed penalty function with respect to w∗1 (k) is derived based on the theorems in [3]. It is given by H w1 (k) ∇w1 (k) = C · (cos Θ(k) − 1) g1 (k) − Re w (k)g (k) , 1 1 ||w1 (k)||2 (18) where C = 1/ ||w1 (k)|| · ||g1 (k)||2 . Using (10) and (17), the gradient matrix for geometrically constrained IVA algorithm is given by ∇wH 1 (k) ∇Wciva (k) = E{ϕ(k) (n)yH (n, k)} − W−H (k) +λ . (19) 0M −1×M ∇Wiva (k) ∇Wp (k)

Similar to the online IVA algorithm derived in [7], we obtain an online variant of the CIVA algorithm by omitting the expectation operator in (19). To avoid divergence of the algorithm due to source signal fluctuations, we normalize the gradient matrix at each frame by its Frobenius norm || · ||F and update as follows Wn (k) = Wn−1 (k) − η

∇Wn,civa (k) . ||∇Wn,civa (k)||F

(20)

Finally, scaling ambiguity is mitigated by multiplying Wn−1 (k)I at each frame, where denotes the element-wise product.


4

401

Performance Evaluation

The quality of the desired speech signal at the output of the proposed algorithm was evaluated using simulated audio data. To obtain the microphone signals, clean speech signals sampled at 16 kHz were convolved with simulated room impulse responses. Room impulse responses were generated using [5]. A circular microphone array with a diameter of 2.5 cm was employed. The STFT frame size was 1024 samples with 50 % overlap. In all experiments, a diffuse noise with 30 dB signal-to-noise (SNR) ratio and a sensor noise with 40 dB SNR was added to the microphone signals. The segmental signal-to-interference (segSIR) ratio and segmental speech distortion index (segSD) [2] were used to measure the performance and the desired source signal at the reference microphone was used as ground truth. The performance of the proposed online CIVA algorithm was compared to the online IVA algorithm proposed in [7]. The learning rate η was set to 150 and λ was set to 10. The filter w1 (k) that extracts the desired signal in CIVA was initialized with g1 (k) while all the other filters were initialized with columns of an identity matrix. In the first scenario, we evaluated the quality of the extracted desired signal when the number of interferers was fixed. The simulations were repeated for T60 = 150 ms and T60 = 300 ms on a 20 s speech segment with source positions as depicted in Fig. 1. Source 1 to 3 were selected as desired one by one and the results were averaged over the three simulations. The performance measures for M > L, M = L and M < L are given in Table 1. With M = L, unconstrained IVA provided better interferer suppression as the solution in this case is unique. Nevertheless, constrained IVA still resulted in lower speech distortion. When M was increased from 3 to 4, the CIVA algorithm provided a gain in segSIR of 5.3 dB at T60 = 150 ms and 3.5 dB at T60 = 300 ms along with a decrease in desired speech distortion, while the performance of the IVA algorithm deteriorated due to non-uniqueness of solution with M = 4, L = 3. The performance of the IVA algorithm deteriorated further when L was increased to 5, while the CIVA algorithm provided a segSIR of 8.7 dB and 3.6 dB for T60 = 150 ms and T60 = 300 ms respectively. Moreover, it can be noted that the

Fig. 1. Room geometry

402

A.H. Khan et al.

CIVA algorithm maintained a very low desired speech distortion in all cases as the proposed penalty restricts the Euclidean angle between w1 (k) and g1 (k). Table 1. Performance measures for fixed number of sources T60 = 150 ms T60 = 300 ms segSD segSIR (dB) segSD segSIR (dB)

Algorithms Unprocessed mixture

1.9

1.5

IVA [7] (M = 3, L = 3) 0.21

14.9

0.31

4.7

0.12

11.7

0.24

5.2

CIVA (M = 3, L = 3) Unprocessed mixture

1.9

1.5

IVA [7] (M = 4, L = 3) 0.27

8.2

0.39

3.8

0.04

17.0

0.07

8.7

CIVA (M = 4, L = 3) Unprocessed mixture

−0.3

0.7

IVA [7] (M = 4, L = 5) 0.54

0.2

0.57

0.1

0.05

8.7

0.08

3.6

CIVA (M = 4, L = 5)

Number of speakers

In the second scenario, the proposed algorithm was evaluated with a time varying number of interferers with M = 6 and T60 =150 ms. The number of active sources over time is plotted in Fig. 2 (Top). Source 1 was selected as desired. The segSIR improvement calculated for segments of 1 s is plotted in Fig. 2 (Bottom). The experiment showed the effectiveness of the proposed algorithm with an unknown, time varying number of interferers. It must be noted that the online IVA algorithm did not converge as M was greater than L. The proposed online CIVA algorithm, however, converges in all cases.

Source (1 and 2)

6

(1 to 4)

(1 to 3)

(1 to 5)

4 2 0

5

10

15

20

25

30

35

40

45

50

30

35

40

45

50

Time [s]

Segmental SIR improvement [dB]

20

15

CIVA IVA

10

5

0

−5

5

10

15

20

25

Time [s]

Fig. 2. Top: Activity pattern of speech sources over time; Bottom: Segmental SIR improvement over time. M = 6, T60 = 150 ms.


5

403

Conclusions

An online constrained IVA algorithm was developed to extract the desired speech signal given the desired source DOA. The DOA was used to obtain the array steering vector. A penalty function was then added to IVA to penalize the Euclidean angle between one separation filter and the array steering vector. Simulations demonstrated the applicability of the algorithm to scenarios with fixed and timevarying number of interferers. Future work includes comparison of the algorithm against beamforming algorithms and evaluation with measured data.

References 1. Araki, S., Sawada, H., Makino, S.: Blind speech separation in a meeting situation with maximum SNR beamformers. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2007) 2. Benesty, J., Chen, J., Huang, Y.: Microphone Array Signal Processing. Springer, Berlin (2008) 3. Brandwood, D.: A complex gradient operator and its application in adaptive array theory. IEE Proc. F Commun. Radar Signal Process. 130, 11–16 (1983) 4. Cardoso, J.F.: Infomax and maximum likelihood for blind source separation. IEEE Signal Process. Lett. 4, 112–114 (1997) 5. Habets, E.A.P.: Room Impulse Response Generator. Technical report, Technische Universisteit Eindhoven (2006) 6. Hyv¨ arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 7. Kim, T.: Real-time independent vector analysis for convolutive blind source separation. IEEE Trans. Circuits Syst. 1, 1431–1438 (2010) 8. Kim, T., Lee, S.Y.: Blind source separation exploiting higher-order frequency dependencies. IEEE Trans. Audio Speech Lang. Process. 15, 70–79 (2006) 9. Knaak, M., Araki, S., Makino, S.: Geometrically constrained independent component analysis. IEEE Trans. Audio Speech Lang. Process. 15, 715–726 (2007) 10. Li, H., Adali, T.: A class of complex ICA algorithms based on the kurtosis cost function. IEEE Trans. Audio Speech Lang. Process. 19, 408–420 (2008) 11. Novey, M., Adali, T.: Complex ICA by negentropy maximization. IEEE Trans. Neural Networks 19, 596–609 (2008) 12. Parra, L., Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Process 8, 320–327 (2000) 13. Parra, L.C., Alvino, C.V.: Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans. Speech Audio Process. 10, 352–362 (2002) 14. Taseska, M., Habets, E.A.P.: Spotforming using distributed microphone arrays. In: IEEE workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013) 15. Zhang, W., Rao, B.D.: Combining independent component analysis with geometric information and its application to speech processing. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2009)