Adaptive Cross-Channel Interference Cancellation on Blind Signal Separation Outputs Using Source Absence/Presence Detection and Spectral Subtraction Gil-Jin Jang1 , Changkyu Choi1 , Yongbeom Lee1 , Yung-Hwan Oh2 1
Human Computer Interaction Laboratory, Samsung Advanced Institute of Technology 2 Spoken Language Laboratory, Korea Advanced Institute of Science and Technology {giljin.jang,changkyu choi,leey}@samsung.com
[email protected] http://myhome.naver.com/jangbal/
Abstract The performances of blind source separation (BSS) are still not satisfiable to apply to the real environments. The major obstacle may seem the finite filter length of the assumed mixing model and the nonlinear sensor noises. This paper presents a two-step speech enhancement method with stereo microphone inputs. The first is an ordinary frequency-domain BSS step, and the second is the removal of the remaining cross-channel interference by a spectral cancellation approach using a probabilistic source absence/presence detection technique. Our experimental results show good separation enhancement performances on the real recordings of speech and music signals compared to the conventional BSS methods.
1. Introduction In real recording situations , each source signal spreads in all directions and reaches each microphone through “direct paths” and “reverberant paths.” Blind source separation (BSS) refers to a class of separation methods that require no source signal information except the number of mixed sources [1]. The observed signal by the jth microphone is expressed as xj (t) =
N
hji (t) ∗ si (t) + nj (t) ,
(1)
i=1
where si (t) is the ith source signal, N is the number of sources, ∗ is the convolution operator, xj (t) is the observed signal, and hji (t) is the transfer function from source i to sensor j. The noise term nj (t) refers to the nonlinear distortions due to the characteristics of the recording devices. The assumption that the sources never move often fails due to the dynamic nature of the acoustic objects. Moreover the practical systems should set a limit on the length of an impulse response, and the limited length is often a major performance bottleneck in realistic situations [2].
This paper proposes a post-processing technique for eliminating the remaining cross-channel interference at the BSS output. The techniques for BSS that can be coupled with our method are the time and frequency blind deconvolution [1, 3], beamforming [4], and noise suppression using unidirectional microphones. Our method is motivated by adaptive noise cancellation (ANC) [5]. The proposed method considers one BSS output as noisy signal and the other as reference noise source, and performs cancellation in the powerspectral domain as the conventional spectral subtraction methods do [6]. The advantage of the powerspectral subtraction is the effective absorption of small amount of mismatch between the actual filter and the estimated one. The disadvantage is the introduction of the musical noises due to the belowzero spectral components as a result of the subtraction. With the help of source absence/presence detection prior to the subtraction, we reduce the error of the cancellation factor estimation and hence minimize the musical noises. Experimental results show that our proposed method has a superior performance to the frequency-domain BSS method in realistic conditions.
2. Frequency-domain Blind Source Separation For simplicity, we consider stereo input, stereo output convolutive cases only. Using short time Fourier transform, Equation 1 is rewritten as X(ω, n) = H(ω)S(ω, n) + N(ω, n) ,
(2)
where ω is a frequency index, H(ω) is the 2 × 2 square mixing matrix, X(ω, n) = [X1 (ω, n) X2 (ω, n)]T and T −1 −i2πωτ /T xj (tn + τ ), representing Xj (ω, n) = τ =0 e the DFT of the frame of size T with shift length T2 starting at tn = T2 (n−1)+1 where · is a flooring operator, and corresponding expressions apply for S(ω, n) and N(ω, n).1 The unmixing process can be formulated 1 In our manuscript, we denote lowercase letters with argument t for the time-series, and capital letters with argument ω and n for the Fourier
S1 + S 2
S1 + S 2
given by
BSS
S1 + S 2
p (Hi,m |Yi (n)) = (5) p (Yi (n)|Hi,m ) p (Hi,m ) p (Yi (n)|Hi,0 ) p (Hi,0 ) + p (Yi (n)|Hi,1 ) p (Hi,1 ) S1
+ S2
Figure 1: The separability of the ordinary BSS algorithm. Left two signals are sensor inputs, and right two signals are BSS outputs. The original sources are rock music and speech signals. in a frequency bin ω: Y(ω, n) = W(ω)X(ω, n), where 2 × 1 vector Y(ω, n) is an estimate of the original source S(ω, n). We use the non-holonomic ICA learning rule [7] that guarantees an orthogonal solution: (3) ∆W ∝ ϕ(Y)YH − diag ϕ(Y)YH , where H is the Hermitian transpose, and the polar nonlinear function ϕ(·) is defined by ϕ(Y) = T [Y1 /|Y1 | Y2 /|Y2 |] [8]. A disadvantage of this decomposition is that there arise the permutation problem in each independent frequency bin [1]. The problem is solved by the time-domain spectral smoothing [3].
3. Cross-Channel Interference Cancellation Figure 1 illustrates the input and the output of the ordinary BSS system. The output signals still contain crosschannel interference that is audible and identifiable by human listeners. However, in the first output, if we assume that the speech signal is present only in the region enclosed by rectangles (call them active blocks), apparently the region enclose by ellipses (inactive blocks) contains the music signal only. When the primary source is present, it often coexists with the secondary source, and the interference occurs. Therefore we define the interference probability by the presence probability of the primary source, which is modeled by complex Gaussian distributions [9]. For each frame of the ith BSS output, we denote a set of all the frequency components for a frame by Yi (n) = {Yi (ω, n)|ω = 1, . . . , T }, and two hypotheses Hi,0 and Hi,1 are given which respectively indicate the absence and presence of the primary source: Hi,0 Hi,1
: Yi (n) = S˜j (n) : Yi (n) = S˜i (n) + S˜j (n) , i = j ,
(4)
where the S˜i is a filtered version of Si . Conditioned on Yi (n), the source absence/presence probabilities are transform at frequency ω for the nth frame. When the letters are boldfaced, they are column vectors whose components are accompanying the same arguments.
where p(Hi,0 ) is a priori probability for source i absence, and p(Hi,1 ) = 1−p(Hi,0 ) is that of the cross-channel interference. By assuming the probabilistic independence among the frequency components, p (Yi (n)|Hi,m ) = ω p (Yi (ω, n)|Hi,m ). Then the source absence probability becomes p (Hi,0 |Yi (n)) = (6)
−1 T P (Hi,1 ) p(Yi (ω, n)|Hi,1 ) , 1+ P (Hi,0 ) ω p(Yi (ω, n)|Hi,0 ) where the posterior probability of Hi,1 is simply p (Hi,1 |Yi (n)) = 1 − p (Hi,0 |Yi (n)). Because the assumed mixing model of ANC is a linear FIR filter architecture, directly applying ANC may not model the linear filter’s mismatch to the realistic conditions — nonlinearities due to the sensor noise and the infinite filter length. Therefore we add a nonlinear feature adopted in spectral subtraction [6]: |Ui (ω, n)| Ui (ω, n)
= f (|Yi (ω, n)| − αi bij (ω)|Yj (ω, n)|) , = Yi (ω, n), i = j , (7)
where αi is the over-subtraction factor, Yi (ω, n) is the ith component of the BSS output Y(ω, n), bij (ω) is the cross-channel interference cancellation factor for frequency ω from channel j to i, and the bounding function f (·) is defined by
a if a ≥ ε f (a) = , (8) ε if a < ε where the positive constant ε sets a lowerbound on the spectrum value. The nonlinear operator f (·) suppresses the remaining errors of the BSS with the introduction of musical noises. We evaluate the posterior probability of Yi (ω, n) given each hypothesis by the complex Gaussian distributions of |Ui (ω, n)|:
2 |Ui (ω, n)| p(Yi (ω, n)|Hi,m ) ∝ exp − , (9) λi,m (ω) where λi,m is the variance of the subtracted frames. The variance λi,m can be updated at every frame by the following probabilistic averaging formula: λi,m
⇐
{1 − ηλ p (Hi,m |Yi (n))} λi,m +ηλ p (Hi,m |Yi (n)) |Ui (ω, n)|2 , (10)
where the positive constant ηλ defines the adaptation frame rate. The primary source signal is expected to be
at least “emphasized” by BSS. Hence we assume that the amplitude of the primary source should be greater than that of the interfering source, which is primary in the other BSS output channel. While updating the model parameters, it might happen that the variance of the enhanced source, λi,1 , becomes smaller than λi,0 . Since such cases are undesirable, we explicitly change two models when λi,0 (ω) > λi,1 (ω) . (11)
the primary source power to the secondary source power ratio in a channel: E1 (ui ) SIR(ui ) [dB] = 10 log10 E2 (ui ) E1+2 (ui ) − E2 (ui ) 10 log10 , (15) E2 (ui )
According to the earlier findings about natural sound distributions, ν is set to be less than 1 for highly kurtotic speech signals [10], greater than 1 for music signals [11], and 2 for pure Gaussian random noises. In the case of the speech signal mixtures, we assign ν = 0.8 for p (Hi,1 |Yi (n)), and ν = 1.5 for p (Hi,0 |Yi (n)) to fit to the distribution of the musical tones.
where E1 (ui ) and E2 (ui ) are the average power of primary and secondary source in signal ui , and E1+2 (ui ) is the average power when cross-interference occurs. When the two sources are uncorrelated, we can approximate E1 E1+2 − E2 . Because the exact signal is unable to obtain, we exploit the interference probabilities to evaluate the source powers: 2 n P (Hi,0 |Yi (n)) ui (t) n E2 (ui ) = n P (Hi,0 |Yi (n)) 2 n P (Hi,1 |Yi (n)) ui (t) n , E1+2 (ui ) = n P (Hi,1 |Yi (n)) where ui (t)2 n is the average sample energy of frame n. Table 1 reports the SIRs of the input signals, the BSS outputs, and the interference-cancelled results with the proposed method. The music signals are active in the whole time courses, so the estimated interference energy E2 (ui ) is not reliable. So we calculate the SIRs only for the first channels where the speech signals f1 and m1 are primary sources. By applying the frequency-domain BSS, there was 4 dB SIR enhancement on the average. But by the proposed post-processing method on the BSS output, there was additional 6 dB average enhancement. Figure 2 plots the stepwise processing of the proposed method on the mixture f1-g2. The cross-channel interference in the BSS outputs is significantly removed in the final results.
4. Evaluation
5. Conclusions
We conducted experiments designed to demonstrate the performance of the proposed method. The data are recorded in a normal office room. Two loudspeakers play the different sources and two omnidirectional microphones simultaneously record the mixtures at a sampling rate of 16kHz. The left speaker plays one of male and female speech signals, and the right speaker plays one of 5 different sounds at a time. The speech signals are a series of full sentence utterances, and the music signals are a pop song, a rock with vocal sounds, and a soft instrumental music. The distance between the sensors is 50cm, between the speakers is 50cm, and between the sensor and the speaker is 100cm. The length of the frame for the frequency domain BSS was 512 samples, and the same length is used for the cross-channel interference cancellation algorithm. The separation results are compared by signal to interference ratio (SIR), which we define the logarithm of
The ordinary BSS algorithms have inherent separation errors due to the mismatch between the assumed linear model and the real transfer functions. We proposed a post-processing technique that is applicable to such realistic environments. We deal with nonstationary natural noise sounds on the assumption that the number of sources and the number of sensors are strictly two, and each of the blind source separation system outputs has a primary source and a secondary source signal identified by their relative power. Any kind of the multichannel techniques that can produce ‘primary’ and ‘secondary’ outputs can be accompanied to our method. For example, blind deconvolution techniques, beamforming, and combining multiple directional microphones. When we are interested in a single source, it is possible to use a uni-directional and a number of omni-directional microphones. The proposed algorithm considers one BSS output as noisy signal and the other output as reference
ω
ω
The next step is updating the interference cancellation factors. We compute the difference between the spectral magnitude of Yi and Yj at frequency ω and frame n: δi (ω, n) = |Yi (ω, n)| − bij (ω)|Yj (ω, n)| .
(12)
We define the cost function J by ν-norm of the difference multiplied by the frame probability: ν
J(ω, n) = p (Hi,0 |Yi (n)) · |δi (ω, n)| .
(13)
The gradient-descent learning rules for bij at frame n is ∆bij (ω) ∝ −
∂J(ω, n) ∂bij (ω)
= p (Hi,0 |Yi (n)) · |δi (ω, n)|
ν−1
Yj (ω, n) . (14)
0.4 0.2 0 −0.2 −0.4
6. References
0.4 0.2 0 −0.2 −0.4
0.4 0.2 0 −0.2 −0.4 0
5
10
15
20
0
5
10
15
20
0
5
10
15
20
0.2 0 −0.2 0
5
10
15
20 0.1
0.2 0 −0.2
0 −0.1 0
5
10
15
20
0
5
10
15
20
Figure 2: Step-by-step enhancement result of the developed stereo-input separation method. The first row is the input mixture signal, and the second row is the separation result of the frequency-domain blind source separation (BSS) algorithm. The cross-channel interference probability is computed and represented by the red lines on the waveforms. Based on the computed probabilities, adaptive cross-channel interference cancellation (ACCIC) is performed to remove the leftover cross-channel interference signals. The final denoised result is shown in the third row. noise source, and the cancellation is done in the powerspectral domain as the conventional spectral subtraction methods do. The advantage of the powerspectral subtraction is that it effectively absorbs the small amount of mismatch between the actual filter and the estimated one, and generates cleanly denoised signals. The disadvantage is the introduction of the musical noises due to the below-zero spectral components. With the help of source absence/presence detection prior to the subtraction, we reduce the error of the cancellation filter estimation and hence minimize the musical noises. Table 1: Calculated SIRs of the input signals, the BSS outputs, and the interference-cancelled results with the proposed method. ‘mixture’ columns indicate the type of sources mixed into the stereo input. ‘f1’ and ‘f2’ are female speakers, ‘m1’ and ‘m2’ are male speakers, and ‘g1’ to ‘g3’ are three different music signals. All the scalar values are in dB. mixture f1-g1 f1-g2 f1-g3 f1-f2 f1-m2 m1-g1 m1-g2 m1-g3 m1-f2 m1-m2 average increase
Input 6.37 3.84 1.89 3.08 7.23 7.91 4.19 0.87 2.54 6.74 4.47 -
BSS only 7.13 8.75 5.74 6.45 10.92 10.37 8.81 4.84 9.42 11.72 8.41 +3.95
Proposed 11.04 16.57 11.11 10.90 16.82 16.15 16.36 10.97 15.74 17.46 14.32 +5.90
[1] K. Torkkola, “Blind signal separation for audio signals - are we there yet?,” in Proc. ICA99, (Aussois, France), pp. 261–266, January 1999. [2] S. Araki, S. Makino, R. Aichner, T. Nishikawa, and H. Saruwatari, “Subband based blind source separation with appropriate processing for each frequency band,” in Proc. ICA2003, (Nara, Japan), pp. 499– 504, April 2003. [3] L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Trans. Speech and Audio Processing, vol. 8, pp. 320–327, May 2000. [4] C. Choi, D. Kong, S. M. Yoon, and H.-K. Lee, “Separation of multiple concurrent speeches using audio-visual speaker localization and minimum variance beam-forming,” in Proc. ICSLP, (Jeju Island, Korea), October 2004. [5] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Zeidler, E. Dong, and R. C. Goodlin, “Adaptive noise cancelling: principles and applications,” Proceedings of the IEEE, vol. 63, pp. 1692–1716, December 1975. [6] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acous., Speech and Signal Processing, ASSP, vol. 27, no. 2, pp. 113–120, 1979. [7] S. Choi, S. Amari, A. Cichocki, and R. wen LIU, “Natural gradient learning with a nonholonomic constraint for blind deconvolution of multiple channels,” in Proc. ICA99, (Aussois, France), pp. 371– 376, January 1999. [8] H. Sawada, R. Mukai, S. Araki, and S. Makino, “Polar coordinate based nonlinear function for frequency-domain blind source separation,” in Proc. ICASSP, (Orlando, Florida), May 2002. [9] N. S. Kim and J.-H. Chang, “Spectral enhancement based on global soft decision,” IEEE Signal Processing Letters, vol. 7, pp. 108–110, May 2000. [10] G.-J. Jang, T.-W. Lee, and Y.-H. Oh, “Learning statistically efficient features for speaker recognition,” in Proc. ICASSP, (Salt Lake City, Utah), May 2001. [11] A. J. Bell and T. J. Sejnowski, “Learning the higherorder structures of a natural sound,” Network: Computation in Neural Systems, vol. 7, pp. 261–266, July 1996.