robust underdetermined blind audio source separation of sparse

ROBUST UNDERDETERMINED BLIND AUDIO SOURCE SEPARATION OF SPARSE SIGNALS IN THE TIME-FREQUENCY DOMAIN Si Mohamed Aziz Sba¨ı

Abdeldjalil A¨ıssa-El-Bey

Dominique Pastor

Institut Télécom; Télécom Bretagne; UMR CNRS 3192 Lab-STICC, Technopôle Brest Iroise, France Université européenne de Bretagne, France ABSTRACT We address the problem of blind source separation in the underdetermined and instantaneous mixture case. The proposed method is based on an algorithm developed by Aissa-El-Bey and al.. This algorithm requires a good choice of the noise threshold and does not take into account the noise contribution in the inversion process. In order to overcome these drawbacks, this paper presents a robust underdetermined blind source separation approach. Robustness is achieved by estimating the noise standard deviation and using this estimate in the inversion process and the expression of the noise threshold. The good performance of the proposed method is shown by comparison with state-of-the-art methods. Index Terms— Robustness, sparsity, blind source separation, noise variance, subspace projection. 1. INTRODUCTION The objective of source separation systems is to estimate the original source signals from their mixtures. The first attempts for solving the audio source separation problem were carried out in the context of Computational Auditory Scene Analysis (CASA). The objective of this approach [1] is to model the process by which the human auditory system perceives the sounds and achieves source separation. Afterwards, several approaches including blind source separation (BSS) have been proposed to adress this problem. The term blind indicates that no or very little prior information about the sources and the mixing environment is available. Blind source separation is an important research topic in a variety of fields, including speech and audio processing [2], radar processing [3], medical imaging [4], and communication [5]. BSS problems can be classified according to the nature of the mixing process (instantaneous, convolutive) and the ratio between the number of sources and the number of sensors of the problem (determined, underdetermined, overdetermined). This paper is restricted to the instantaneous case, where each observation consists of a sum of sources with different signal intensity in presence of noise. Underdetermined BSS (UBSS) problem is obviously an illposed problem, and its solution cannot be derived without additional assumptions. A general framework for UBSS is to exploit the sparsity of the sources. In this paper, and in continuation of [6] [7] [8], we exploit the sparsity of the nonstationary sources in the timefrequency domain. A source is said to be sparse in a given signal represention domain if most of its samples are (almost) zero. The UBSS algorithms proposed in [6], [8] estimate the unknown mixing matrix from the audio mixture data by using a clustering algorithm in the time-frequency domain, and then recover the source signals. These algorithms depend on the noise level and require a good choice of the noise threshold to reject the time-frequency

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

3716

points where the signal is absent. The problem of rejecting (resp. accepting) the time-frequency points where the signal is absent (resp. present) can be considered as a statistical testing problem. The main issues are that very little is known about the signal and the noise level can be unknown. Non parametric and robust statistics address such situations [9]. In the present paper, we then use recent results about problems of the same type to estimate the noise level. The estimation of the noise level is performed via the Modified Complex Essential Supremum Estimate (MC-ESE) introduced in [10]. The MC-ESE can be regarded as to an alternative to standard robust scale estimators such as the popular median absolute deviation (MAD) estimator [11]. In fact, for the application considered below, the MC-ESE outperforms the MAD. The noise standard deviation estimate is then used instead of the unknown true value of the noise standard deviation in the expression of the standard quadratic or Wald’s test that guarantees a specific false alarm probability, when the noise standard deviation is known, for deciding on the presence or the absence of signals. The decision scheme resulting from this estimate-and-plug-in approach basically compares the observation norm to a threshold height, whose computation involves the noise standard deviation estimate. Observation norms above this threshold must be kept because they are likely to pertain to signals relevant for the processing. Estimating the noise standard deviation then gives the opportunity to involve this information in the inversion process so as to optimize the estimation of source Short-Term Fourier Transform (STFT) coefficients by including this noise information in the inversion process. Summarizing, the contribution of this paper is a robust UBSS (RUBSS) approach that estimates the noise level to remove the noise contribution from the mixtures and an optimization of the inversion process. Note that both the subspace-based methods in [6] and the proposed method require information of the mixing matrix A. Since A can be estimated by the method described in [6], we work with the estimate of A in the sequel. This paper is organized as follows. The next section reviews the subspace-based algorithms for time-frequency nondisjoint sources. A RUBSS algorithm is proposed in section 3. Several measurements on real data are given in section 4 for numerical performance evaluation, followed by conclusions in section 5. 2. SUBSPACE BASED ALGORITHM Consider the following instantaneous mixing model: x(t) = As(t) + n(t),

(1)

where, for any t ∈ R, s(t) = [s1 (t), s2 (t), · · · , sN (t)]T is the N × 1 source vector, x(t) = [x1 (t), x2 (t), · · · , xM (t)]T is the

ICASSP 2011

M ×1 mixture vector, A = [a1 , a2 , · · · , aN ] is the complex M ×N mixing matrix and n(t) = [n1 (t), n2 (t), · · · , nM (t)]T is additive noise. It is assumed that (nk (t))1≤k≤M are mutually decorrelated and independent with the sources. Given x(t), the purpose of UBSS is to estimate s(t) in the underdetermined case, i-e when N > M . Without loss of generality, we assume that the column vectors of A have all unit norm, i-e, ai = 1 for all i ∈ {1, 2, . . . , N }

2. Vector clustering: to estimate the matrix A; 3. STFT estimation of the sources (5); 4. Inverse STFT: to recover the signal in the time domain. 3. RUBSS ALGORITHM

Applying the STFT to (1) yields to the model where the notation is obvious: S x (t, f ) = AS s (t, f ) + S n (t, f ), (2)

Given the STFTs of the mixtures, the first stage in the subspacebased algorithm [6] is the noise thresholding procedure. Our main contributions focus on this stage. Rather than (3), we decide on the presence of the signal by a thresholding test.

In order to preserve the time-frequency points (t, f ) with sufficient energy only, a noise thresholding procedure in [6] is carried out: for each time-slice specified by some time-instant t of the timefrequency domain, we keep the points such that

3.1. Threshold height

S x (t, f ) > , max S x (t, ξ)

(3)

ξ

where is a fixed threshold (typically, = 0.05). By keeping points for which (3) holds, time-frequency points with negligible energy are removed. In [6], the following assumptions are made: • (A1 ): The column vectors (ai )1≤i≤N of A are pairwise linearly independent. • (A2 ): The number of active sources at any (t, f ) is strictly less than the number M of sensors. Based on (A2 ), let us denote, at a given time-frequency point (u, ξ), the indexes of the active sources by J = {i1 , i2 , · · · , ir } (r < M ). Let sJ = [si1 , si2 , · · · , sir ]T be the column vector of those active sources, and AJ = [ai1 , ai2 , · · · , air ]. Let P = IM − −1 H AJ (AH AJ define the orthogonal projection matrix onto the J AJ ) noise subspace of AJ , where IM is the M × M identity matrix. If si is an active source, Pai is very small (zero in the noisless case). This latter remark enables us to identify the indexes i1 , i2 , · · · , ir , and hence the sources present at (u, ξ). Indeed, in this case, (2) reduces to: S x (t, f ) = AJ S sJ (t, f ) + S n (t, f ), (4) Then the STFT coefficients of these active sources can be recovered using: S sJ (u, ξ) ≈ A# J S x (u, ξ),

(5)

H −1 H where A# AJ is the Moore-Penrose pseudoinverse J = (AJ AJ ) of AJ . In [12], the authors show that the subspace based method cannot be guaranteed to work well under the mere condition that the column vectors of the mixing matrix are pairwise linearly independent (A1 ). In fact, the subspace method should satisfy that any M × M submatrix of the mixing matrix has full rank, a stronger condition than (A1 ). Thus, throughout the rest of the paper, we assume the following assumption instead of (A1 ):

• (A1 ): Any M × M submatrix of the mixing matrix has full rank, that is , for all J ⊂ {1, 2, · · · , N } of cardinality less than M , (aj )j∈J are linearly independent. In summary, the procedure in [6] for UBSS is described as follows: 1. STFT computation of the mixtures and noise thresholding (3): to select the time-frequency components with sufficient energy;

3717

In our model, for each sensor, the observations (STFT coefficients) are assumed to result from signals, randomly present or absent, in independent additive white Gaussian noise (AWGN). By transforming the input complex observations into a real array, this means the following: 1) each observation is assumed to be noise alone or the sum of some random signal and noise; 2) noise is supposed to be “white and Gaussian” in that it is Gaussian distributed with null mean and covariance matrix equal to σ0 times the 2 × 2 identity matrix I2 . We thus model our sequence of observations by a sequence Y = (Y k )k∈N such that, for every k ∈ N, Y k = εk Λk + X k where ε = (εk )k∈N is a sequence of random variables valued in {0, 1}, Λ = (Λk )k∈N stands for some sequence of 2-dimensional real random signals and X = (X k )k∈N is some 2-dimensional WGN with standard deviation σ0 , that is, the 2-dimensional random vectors X k , k ∈ N, are i.i.d. Gaussian distributed with null mean vector and covariance matrix σ02 I2 . We write Y = εΛ + X and, in this model, Y is thus the sequence of observations, Λ is the sequence of signals that can be observed in independent AWGN represented by the sequence X and εk models the possible occurrence of Λk so that each Y k obeys a binary hypothesis testing model where the null hypothesis is that εk = 0 and the alternative one is that εk = 1. For each k, εk , Λk and X k are assumed to be independent. For all u ∈ C, let Th define the thresholding test with threshold height h, i-e, 1 if |u| ≥ h Th (u) = (6) 0 if |u| < h For all k ∈ N, the false alarm probability of Th is PF A = P[X k ≥ h]. In many applications, such as radar, we wish to ensure a constant false alarm fixed in advance. Since noise is supposed to be Gaussian distributed with null mean and covariance matrix equal to σ0 times the 2 × 2 identity matrix I2 , the false alarm probability 2 2 is given by PF A = e−h /2σ0 . √Thus, the threshold that ensures a fixed false alarm PF A is h = σ0 −2 ln PF A . It only remains to estimate the noise standard deviation with a robust estimation algorithm. This is what we describe in the subsection below. 3.2. MC-ESE : Estimation of noise standard deviation MC-ESE stands for modified complex essential supremum estimate. This algorithm, analysed in [13], concerns a finite number of observations that are 2-dimensional independent real random vectors. With the same notation described in the preceding subsection, the MC-ESE addresses the problem of estimating the noise standard deviation σ0 without prior knowledge on the distributions of the signals. The MC-ESE is thus an alternative to the the median

absolute deviation about the median (MAD) [11], which can be used to perform an estimate of the noise standard deviation under the same assumptions as above. In statistical terms, the signals play the role of outliers amongst the sequence of observations. Inasmuch as the MAD estimator performs well in presence of very few outliers amongst a sequence of independent and identically distributed (i.i.d.), this estimator will yield a very good estimate of the noise standard deviation, provided that the signal priors are very small in our problem. However, when the signal priors become close to one half, the MAD estimator performs poorly, whereas the MC-ESE, by nature, is capable of providing very good estimates. This is why the MC-ESE and the MAD estimator will be compared in the sequel. With the notation introduced above, the MC-ESE is calculated in two steps. A first estimate of σ0 is computed by setting σ 0 =

argmin

Fig. 1. Comparaison of performance between subspace UBSS, MCESE RUBSS and MAD-RUBSS algorithms: NMSE vs SNR.

σ∈[σmin ,σmax ]

⎧ m ⎫ ⎪ √ ⎪ ⎪ ⎪ ⎪ Y k I(Y k ≤ β σ 2) √ ⎪ ⎪ ⎪ ⎨ ⎬ Υ (β 2) 1 k=1 . √ −σ sup m √ Υ0 (β 2) ⎪ ∈{1,...,L}⎪ ⎪ ⎪ ⎪ I(Y ≤ β σ 2) ⎪ ⎪ ⎪ k ⎩ ⎭

of active sources at any time-frequency point is strictly less than the number of sensors (A2 ), equation (2) is replaced by:

k=1

(7) where I(A) is the indicator function of event A, and [σmin , σmax ] is a search interval computed as follows. Denoting by Y [k] , k = 1, 2, . . . , m, the sequence of observations Y 1 , Y 2 , . . . , Y m , sorted by increasing norm, the √ lower bound of the search interval is σmin = Y [kmin ] / 2 where kmin = m/2 − hm,

h = 1/ 4m(1 − Q) and Q is a positive real number less than m or equal to 1 − 4(m/2−1) The upper bound of the search in2. √ terval is fixed to σmax = Y [m] / 2. The reasons for these choices are given in [10, Section 3.2]. A minimization routine for scalar bounded non-linear functions, such as the MATLAB routine fminbnd.m, can be used for the computation of σ 0 . This first estimate σ 0 can be improved by defining the MC-ESE √ m Y k 2 I(Y k ≤ σ 0 2) k=1 , (8) σ 0 = λ m √ I(Y ≤ σ 2) k

S x (t, f ) = AJ S J (t, f ) + S n (t, f ),

(9)

where J = {i1 , i2 , · · · , ir } (r < M ) is the set of indexes of the active sources at time-frequency point (t, f ) and S J (t, f ) = [Ssi1 (t, f ), Ssi2 (t, f ), · · · , Ssir (t, f )]T is the STFT column vector of the active sources at time-frequency point (t, f ). The vector S J (t, f ) is estimated by transforming the noisy STFT mixture S x (t, f ) with a linear decision operator which minimizes the estimation mean-square error. The resulting estimator is S J (t, f ) = minimizes the risk E{S (t, f ) − DS (t, f )2 } (t, f ), where D DS x J x when D ranges over the space of matrices. After computation, we have H 0 2 IM )−1 S (t, f ) J AH SˆJ (t, f ) ≈ R J (AJ RJ AJ + σ x

(10)

where σ 0 is the estimation of the noise standard deviation given J is the empirical correlation matrix of S (t, f ). by MC-ESE, and R J 4. SIMULATIONS

0

k=1

where λ is an empirical constant. in [13], a reasonable As proposed √ √ choice for this constant is λ = Υ0 ( 2)/Υ2 ( 2) = 1.0937. A more detailed description of this estimator is beyond the scope of this paper. We will then be satisfied by mentioning that this estimator stems from [10, Theorem 1, statement (i)] with an heuristic description given in [13]. 3.3. Wiener filtering Given the estimation of the noise standard deviation, we optimize the estimation of source STFT coefficients by including this noise information in the inversion process (5). Indeed, the procedure described above (the third step) implicitly considers that the selected points at the first step are noiseless, which is not the case. To optimize the source estimation, we acquire information about the variance of the signal and exploits this prior information in a Bayes framework. Using the source identification procedure of the active sources at any time-frequency point, and under the assumption that the number

In our simulations, we consider a mixing system with four sources and three sensors. The audio source signals arrive at the sensor array at different angles θ1 = 15 ◦ , θ1 = 30 ◦ , θ1 = 45 ◦ and θ1 = 75 ◦ . All signals have the same length of L = 8192 samples. The RUBSS algorithm is used to separate the four source signals from the mixed signals observed by three sensors with a false alarm rate fixed to 10−2 . In figure 1, the performance of the RUBSS algorithm based on MC-ESE for the noise variance estimation (MCESE-RUBSS) is compared to the performance of the RUBSS algorithm based on the MAD estimator for the noise variance estimation (MAD-RUBSS) based on the MAD estimate of the noise standard deviation and to that of the subspace based UBSS algorithm of [6], which plays the role of baseline. The source separation performance is measured by the normalized mean square error (NMSE). The MC-ESE clearly outperforms the MAD and provides slightly better separation results than those obtained by subspace UBSS without resorting to the empirical thresholds of [6] that are manually chosen with respect to each input signal to noise ratio.

3718

and nonparametric detection. The aim is to be fully automatic and independent with the value of the false alarm probability. In order to do this, we plan to consider a class of signals constrained by minimum norm bound. This parameter can be fixed according to some simulations on synthetic signals suitably chosen. 6. REFERENCES [1] D. F. Rosenthal and H. G. Okuno, Eds., Computational Auditory Scene Analysis, L. Erlbaum Associates Inc., 1998. [2] A. Aissa-El-Bey, K. Abed-Meraim, and Y. Grenier, “Underdetermined blind audio source separation using modal decomposition,” EURASIP Journal on Audio, Speech, and Music Processing, 2007. Fig. 2. Comparaison between subspace UBSS and MCESE-RUBSS algorithms when SN R = 25db: NMSE vs number of sources.

[3] V. Varajarajan and J.L. Krolik, “Multichannel system identification methods for sensor array calibration in uncertain multipath environments,” in IEEE Signal Processing Workshop on Statistical Signal Processing (SSP’01), Singapore, 2001, pp. 297–300. [4] A. Rouxel, D. Le Guennec, and O. Macchi, “Unsupervised adaptive separation of impulse signals applied to EEG analysis,” in IEEE International Conference on Acoustics, Speech, Signal Processing, Singapore, 1998, pp. 2888–2897. [5] K. Abed-Meraim, S. Attallah, T.J. Lim, and M.O. Damen, “A blind interference canceller in DS-CDMA,” in IEEE International Symposium on Spread Spectrum Techniques and Applications, Parsippany, 2000, pp. 358–362. [6] A. Aissa-El-Bey, N. Linh-Trung, K. Abed-Meraim, A. Belouchrani, and Y. Grenier, “Underdetermined blind separation of nondisjoint sources in the time-frequency domain,” IEEE Transactions on Signal Processing, vol. 55, pp. 897–907, 2007. [7] P. Bofill and M. Zibulevsky, “Underdetermined blind source separation using sparse representations,” Elsevier Signal Processing, vol. 81, pp. 2353–2362, 2001.

Fig. 3. Comparaison between subspace UBSS and MCESE-RUBSS algorithms when SN R = 45db: NMSE vs number of sources.

As expected for both methods (MCESE-RUBSS and subspace based UBSS), the results of figures 2, 3 show a clear degradation of the separation quality as the number of sources increases, when the SN R = 25db and SN R = 45db, respectively. Indeed, the source interference increases with the number of sources and so that assumption (A2 ) is ill satisfied.

[8] N. Linh-Trung, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating more sources than sensors using time-frequency distributions,” EURASIP Journal on Applied Signal Processing, 2005. [9] H. V. Poor, Ed., An Introduction to Signal Detection and Estimation, Springer-Verlag, 1994. [10] D. Pastor, “A theoretical result for processing signals that have unknown distributions and priors in white gaussian noise,” Computational statistics and data analysis, vol. 52, pp. 3167– 3186, 2008. [11] F.R. Hampel, “The influence curve and its role in robust estimation,” Journal of the American Statistical Association, vol. 69, no. 346, pp. 383–393, Jun. 1974.

5. CONCLUSION In this paper, we have presented a new method for the RUBSS problem. The proposed method is based on a detector that decides on the presence or the absence of a random signal with unknown distribution in a background of additive and independent white Gaussian noise. The signal is detected when the observation is above a threshold height. In order to guarantee a specified false alarm probability, the threshold height is chosen as a function of the noise standard deviation. MC-ESE gives a robust estimate of this value and we have used this estimate in the inversion process. Simulation results illustrate the effectiveness of the proposed approach. Therefore, our future investigations will focus on alternative approaches in robust

3719

[12] D. Peng and Y. Xiang, “Underdetermined blind source separation based on relaxed sparsity condition of sources,” IEEE Transactions on Signal Processing, vol. 57, 2009. [13] F.-X. Socheleau, D. Pastor, and A. Aissa-El-Bey, “Robust statistics based noise variance estimation: Application to wideband interception of non-cooperative communications,” IEEE Transactions on Aerospace and Electronic Systems, vol. 2, no. 7, pp. 406–413, Mars 2011.