Single-channel speech separation using combined EMD and speech-specific information M. K. Prasanna Kumar & R. Kumaraswamy
International Journal of Speech Technology ISSN 1381-2416 Volume 20 Number 4 Int J Speech Technol (2017) 20:1037-1047 DOI 10.1007/s10772-017-9468-3
1 23
Your article is protected by copyright and all rights are held exclusively by Springer Science+Business Media, LLC. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.
1 23
Author's personal copy Int J Speech Technol (2017) 20:1037–1047 DOI 10.1007/s10772-017-9468-3
Single-channel speech separation using combined EMD and speech-specific information M. K. Prasanna Kumar1 · R. Kumaraswamy2
Received: 18 June 2017 / Accepted: 27 September 2017 / Published online: 23 October 2017 © Springer Science+Business Media, LLC 2017
Abstract Multi-channel blind source separation (BSS) methods use more than one microphone. There is a need to develop speech separation algorithms under single microphone scenario. In this paper we propose a method for single channel speech separation (SCSS) by combining empirical mode decomposition (EMD) and speech specific information. Speech specific information is derived in the form of source-filter features. Source features are obtained using multi pitch information. Filter information is estimated using formant analysis. To track multi pitch information in the mixed signal we apply simple-inverse filtering tracking (SIFT) and histogram based pitch estimation to excitation source information. Formant estimation is done using linear predictive (LP) analysis. Pitch and formant estimation are done with and without EMD decomposition for better extraction of the individual speakers in the mixture. Combining EMD with speech specific information provides encouraging results for single-channel speech separation. Keywords BSS · SCSS · EMD · IMF · SIFT · Multi pitch information
1 Introduction Blind source separation is an art of separating sources from the mixture without prior knowledge of original source signals. It is still an open challenge for researchers as the solution is not unique. Most of the research is done on multichannel source separation with linear instantaneous mixing model or convolutive mixing model. They are applicable to well determined case (P = Q) or over determined case (P Q) with unknown source number. Underdetermined problem needs more attention especially under single microphone scenario. This type of BSS problems are normally known as single channel source separation (SCSS). In this paper we consider two speakers and one channel for the mixed signal which is described as follows. (1) In Eq. 1, n = 1, 2,…, N indicates the time index. It is required to separate two speakers s1 (n) and s2 (n) from one mixture collected by the microphone denoted as y(n). Recent works on SCSS can be classified as top down approach and bottom up approach Hershey et al. (2011). Top down approach is majorly focused on model based techniques which needs training information of sources whereas bottom up techniques directly work on the mixed signal without any knowledge of the sources. We follow the bottom up approach in this paper. Another way of classifying the SCSS techniques is supervised (guided) and unsupervised (unguided) methods which are driven by the model based
y(n) = s1 (n) + s2 (n)
* M. K. Prasanna Kumar
[email protected] R. Kumaraswamy
[email protected] 1
BMS College of Engineering, Bangalore, Karnataka 560019, India
Siddaganga Institute of Technology, Tumkur, Karnataka 572103, India
2
13
Vol.:(0123456789)
Author's personal copy 1038
methods and non-model based methods respectively Vincent et al. (2014). Most of the real world problems in SCSS fall under unsupervised case. Most of the supervised methods exploit Hidden Markov Model (HMM) and approximate posterior distributions by Gaussian distribution models (Ellis 2006; Reyes-Gomez et al. 2004; Kristjansson et al. 2004). Separation quality depends on source models which again require thousands of hidden states of the sources. In Stark et al. (2011) SCSS based on source-filter model was proposed which uses model driven multi pitch estimation using factorial HMM. Vocal tract filter was modelled using vector quantization (VQ) or Non-negative Matrix Factorization (NMF). To compute pitch and vocal tract filters, this method uses time–frequency spectrogram which is modelled by Gaussian mixtures. For this purpose they require supervised data, pitch-pairs for the corresponding speech mixture spectrogram to learn Gaussian Mixing Model (GMM). Model driven SCSS suffers from execution time which involves training phase as well as the inference phase. Another model based method based on maximum likelihood approach was presented in Jang and Lee (2003). It requires basis functions of mixed source signals. For real world problems a dictionary of bases for various kinds of natural sounds and finding most characterizing bases in dictionary for a generic case are necessary conditions to achieve good separation performance. Another category of supervised SCSS methods fall under Independent Component Analysis-SCSS (ICA-SCSS) (Jang and Lee 2003; Li et al. 2006; Fevotte and Godsill 2006). In these methods sources are modelled as sparse combinations of a set of temporal basis functions derived from ICA, with ICA bases sources are estimated using maximizing log likelihood function of mixed signal. Separation efficiency depends on overlap of temporal basis functions. SCSS using wavelet packet decomposition is given in Litvin and Cohen (2011). The method depends on statistical model of the source and these models are trained from samples of each source separately. Computational auditory scene analysis (CASA) based method was presented in Li et al. (2006) which exploits short time Fourier transform (STFT). The mixed signal is segmented into time–frequency cells which are then used to categorize objects by harmonicity and correlated modulation. However CASA based methods suffers from cross talk problem and a suitable cross talk suppression algorithm is required. SCSS based on Non-negative matrix factorization (NMF) algorithms were presented in (Schmidt and Olsson 2006; Virtanen 2007). Different grouping is proposed in Virtanen (2007) but in practice sources overlap in time–frequency domain making it difficult to obtain correct clustering. Another variant of SCSS based on Itakura-Saito
13
Int J Speech Technol (2017) 20:1037–1047
non-negative matrix factorization (ISNMF) proposed in Gao et al. (2013). This method uses time–frequency analysis using gamma tone filter bank. A non-model based SCSS is proposed in Tengtrairat et al. (2013) by creating a pseudo–stereo mixture from single channel data and projecting over 2-D histogram. It makes many assumptions like sources are windowed disjoint orthogonal in STFT domain, local stationarity of the sources and phase ambiguity. A combined EMD and ICA for SCSS was proposed in Mijovic et al. (2010) which is more suitable for biomedical signals due to limitations of the ICA for speech signals as mentioned earlier. A subspace decomposition based method using EMD and Hilbert spectrum was proposed in Molla and Hirose (2007) which depends on derived independent basis vectors stationary over time. A variable regularized Sparse NMF (v-SNMF) along with EMD is given in Gao et al. (2011) which are mainly focused on regularizing the sparseness of temporal structure of SNMF. Here sparseness on temporal structure is imposed element wise for optimal sparseness. In the proposed method EMD is applied to voiced segments in time domain and later multi pitch tracking is done based on robust Simple Inverse Filtering Tracking (SIFT) and histogram of pitch for all speech frames. Filter characteristics are obtained using LP Analysis. Finally we separate the individual speakers by combining IMF’s selection from source filter characteristics, unvoiced segments and missing data imputation using voiced IMF’s. Results show that the proposed method produces comparative results with existing algorithms. This paper is organised as follows. Section 2 introduces background of EMD algorithm. In Sect. 3 we present the proposed separation algorithm. Section 4 describes the experimental results. Section 5 provides discussion of results. Finally Sect. 6 concludes the paper.
2 Background of EMD 2.1 Empirical mode decomposition (EMD) The EMD method can decompose a nonlinear and non-stationary signal into intrinsic mode functions (IMFs) which consists of components from high frequency to low frequency represented by
y(n) =
M ∑
ck (n) + rM (n)
(2)
k=1
In Eq. 2, ck(n) represents the kth IMF, rM(n) is the residual of the signal y(n) and M is the number of IMFs. A function has to satisfy the following conditions to become an IMF. (1) The number of extrema and number of zero crossings should
Author's personal copy Int J Speech Technol (2017) 20:1037–1047
1039
be same or differ at most by one. (2) At any point the local average of the upper envelope and lower envelope is zero. The residual is a monotonic function or a constant (Wang et al. 2014; Huang and Shen 1998). The EMD algorithm Wang et al. (2014) is described as follows. EMD Algorithm:
IMF of ith trial and ensemble number of EEMD is denoted by L. The EEMD method adds uniform white noise with finite amplitude to the original signal and projects the different frequency components onto corresponding frequency banks to overcome the mode mixing problem of EMD Wu and Huang (2009).
Step 1 Initialize k = 1; r(n) = y(n), Here mixed speech and residual signals are represented by y(n) and r(n) respectively. Step 2 Extract all extremes of y(n). Step 3 Find the upper envelope Eu(n) and lower envelope El(n) by interpolation. Step 4 Find the local mean µ(n).
2.3 Complementary EEMD (CEEMD)
𝜇(n) =
Eu (n) + El (n) 2
(3)
Step 5 Find the difference d(n).
d(n) = y(n) − 𝜇(n)
(4)
if (d(n) is an IMF) then (5)
ck (n) = d(n)
(6)
r(n) = r(n) − ck (n)
(7)
𝐞𝐥𝐬𝐞 y(n) = d(n);𝐠𝐨𝐭𝐨 step 2.
𝐞𝐧𝐝 of EMD 𝐞𝐥𝐬𝐞 y(n) = r(n);𝐠𝐨𝐭𝐨 step 2 𝐞𝐧𝐝 𝐢𝐟 2.2 Ensemble EMD (EEMD) Limitation associated with EMD is that of mode mixing due to intermittency. It can be defined in two ways. A single IMF component having wider scales or a signal residing indifferent IMF components. To overcome this limitation, a noise assisted method called Ensemble EMD can be used. The EEMD method can be described as follows.
ci,k (n) + ri,M (n)
(11) where y(n) is the original signal, wi(n) is the ith added white noise, y+i (n) is the sum of the original signal and white Gaussian noise and y−i (n) is the difference between original signal and white Gaussian noise. The original signal y(n) can be represented as M
𝐢𝐟 r(n) is a monotonic function 𝐭𝐡𝐞𝐧
M ∑
(10)
L
L
1 ∑ + 1 ∑∑ + − (n) (ci,k (n) + c−i,k (n)) + r (n) + ri,m y(n) = 2L k=1 i=1 2L i=1 i,m (12) where c+i,k (n) is the kth IMF of y+i (n) and c−i,k (n) is the kth IMF of y−i (n). It still remains unsolved to choose proper amplitude of the additive white noise for EEMD method Yeh et al. (2010). For detailed algorithm of EMD, EEMD and CEEMD the readers can refer to (Huang and Shen 1998; Wu and Huang 2009; Yeh et al. 2010).
𝐞𝐧𝐝 𝐢𝐟
yi (n) =
y+i (n) = y(n) + wi (n) y−i (n) = y(n) − wi (n)
k =k+1
yi (n) = y(n) + wi (n)
In EEMD method to clear the residual of the added white noise from the IMFs a large ensemble number is required leading to computational complexity and increased computation time. To overcome this limitation of EEMD, complementary EEMD (CEEMD) was developed. The CEEMD method adds white noise in pairs with one positive and other negative to the original signal and produces two sets of ensemble IMFs. Hence two different combinations of original signal and added white noise can be obtained.
(8) (9)
k=1
where i = 1,2,…,L. y(n) is the original signal, wi(n) is the ith added white noise. yi(n) is the noisy signal of ith trial. M is the number of IMF components from EMD, ci,k is the kth
3 Proposed separation algorithm In this section we discuss the proposed separation algorithm in detail starting from multi pitch tracking from excitation source information to detect the number of sources, estimation of the filter characteristics and finally separating the source signals. Figure 1 shows the core procedure of single channel speech separation. 3.1 Voiced and unvoiced classification of mixed speech frames Classifying the mixed speech into voiced and unvoiced frames make the estimation of source-filter characteristics
13
Author's personal copy 1040
Int J Speech Technol (2017) 20:1037–1047
3.2 EMD of voiced frames in mixed speech signal We apply EMD to only voiced frames in the mixed speech for estimating source-filter characteristics. This process can be described as follows.
ym (n) = vf
M ∑
m cm (n) + rM (n), k
m = 1, 2, … , V
(13)
k=1
In Eq. 13, ym (n) is the mth voiced frame of mixed speech vf Fig. 1 Proposed single channel source separation algorithm
more effective. It also helps in reducing the computations required for EMD as we apply EMD only to the voiced frames in our proposed algorithm. We used three way classifications into silence, unvoiced and voiced frames of the mixed speech signal using real world labelling scheme. It uses zero crossing detection and short time energy of the small frames of mixed signal as mentioned in silence/ unvoiced/voiced (SUVing) Greenwood and Kinghorn (1999). Figure 2 shows the zero crossing rate and short time energy computed for different frames of the mixed speech signal. From the plot it is clear that when zero crossing rate is high, energy will be low indicating the presence of unvoiced frame. Voiced frames are identified when zero crossing rate is low and energy is high. By using real world labelling scheme mentioned in Greenwood and Kinghorn (1999) we get most of the frames as voiced and very few frames will be unvoiced as the segment of speech contains more voiced region compared to unvoiced region.
Fig. 2 Classification of voiced and unvoiced frames in mixed speech signal
13
signal, M is the number of IMF’s considered, cm (n) is the kth k m (n) is the monoIMF of mth voiced frame of mixed speech, rM tonic function of mth voiced frame of mixed speech signal and Vis the number of voiced frames considered for mixed speech signal. The stopping criteria for number of IMFs can be taken as standard deviation between two successive decompositions falling less than a threshold value or energy of a particular IMF component is too less to be considered. A Cauchy type of convergence test is considered as stopping criteria to determine number of IMFs. In Eq. 14, SDk is the threshold value computed for kth IMF. Here ck−1(n) and ck(n) are the two successive IMFs under consideration. From Fig. 3 it is clear that EMD decomposes the original signal into oscillatory sub band components starting from high frequency components to low frequency components. In Fig. 3 first five IMFs are shown.
SDk =
∑N−1
(ck−1 (n) − ck (n))2 ∑N−1 2 c (n) n=0 k−1
n=0
Fig. 3 Intrinsic mode functions of a mixed speech frame
(14)
Author's personal copy Int J Speech Technol (2017) 20:1037–1047
1041
3.3 Estimation of pitch using excitation source information and SIFT
A(z) = 1 +
The excitation source information can be extracted from the speech signal using linear prediction (LP) analysis Yegnanarayana et al. (2009). To reduce the computational complexity of cepstral based deconvolution procedure in frequency domain we use LP analysis in estimating source and filter components from the speech signal. LP analysis finds these components in time domain itself (30). The SIFT (Simple Inverse Filtering Tracking) method is mostly used pitch estimation method. This is based on linear prediction (LP) analysis of speech. The SIFT performs auto correlation of the LP residual than speech directly. Vocal tract information is modelled by the LP coefficients and hence the LP residual mostly contains the excitation source information. The auto correlation of LP residual will therefore have unambiguous peaks representing the pitch period information. In LP analysis each sample is predicted as a linear combination of the past p samples, where p is the order of prediction. The LP analysis is applied to all voiced frames before EMD and to all IMFs of voiced frames after EMD.
ỹ m (n) vf
=−
p ∑
aj ym (n vf
− j),
m = 1, 2, … , V
(15)
p ∑
aj cm (n − j), k
k = 1, 2, … , M
(16)
j=1
c̃ m (n) = − k
j=1
(n) is the predicted value of mth voiced frame in where ỹ m vf mixed speech signal, c̃ m (n) is the predicted value of kth IMF k of mth voiced frame in mixed speech signal, {aj} are the LP coefficients, V is the number of voiced frames in the mixed signal and M is the number of IMFs under consideration. The error between speech sample and predicted value is given by
em (n) = ym + vf vf
p ∑
aj ym (n − j), vf
m = 1, 2, … , V
j=1
em (n) = cm (n) + imf k
p ∑ j=1
aj cm (n − j), k
k = 1, 2, … , M
p ∑
aj z−j
(19)
j=1
Passing the speech signal through this inverse filter is equivalent to using optimized values of LPCs in Eqs. (15) and (16) and therefore the minimum error signal is the LP residual signal which contains excitation source information. The autocorrelation of LP residual will have unambiguous peaks. The distance between the two largest peaks represents the fundamental period of the signal. The pitch information is extracted for all the voiced speech frames in the mixture which can be represented by f0m for m = 1,2,…,V, where V is the number of voiced speech frames in the mixed speech signal. Similarly we estimate pitch information from all IMFs of a particular voiced speech frame which can be denoted by f0k,m for k = 1,2,…,M, where M is the number of IMFs in every voiced speech frame. 3.4 Determining number of speakers from single channel speech mixture Fundamental period of the speech in each voiced frame is determined by auto correlation of LP residual as described in previous section. In the proposed method at the end of autocorrelation, we have multi pitch information from various voiced frames. The multiple pitch information was pooled into a pitch period histogram. An illustration of histogram of multi pitch information collected from single channel mixture of two speakers is shown in Fig. 4. We further extract the envelope of pitch period histogram using Hilbert envelope denoted by Eq. 20.
(17)
(18)
(n) is the error value of voiced frames, em (n) is the where em vf imf error value of IMFs, ym is the mth voiced frame and cm (n) is vf k the kth IMF of mth voiced frame. The LPCs define the inverse filter represented by
Fig. 4 Histogram of multi pitch information from single channel mixture of two speakers
13
Author's personal copy 1042
h(n) =
Int J Speech Technol (2017) 20:1037–1047
√
g2 (n) + g2h (n)
(20)
In Eq. 20, h(n) is the Hilbert envelope, gh(n) is the Hilbert transform of g(n) and g(n) represents the values of pitch period histogram. To eliminate spurious peaks from the histogram envelope and detect peaks more reliably we further divide the square of each sample of Hilbert envelope by the moving average of Hilbert envelope computed over a short window around the sample as follows.
d(n) =
1 2R+1
h2 (n) ∑n+R
l=n−R
h(l)
(21)
In Eq. 21, d(n) is the smoothed envelope of histogram with prominent peaks and R is the number of samples corresponding to short window. The smoothed peaks after removing spurious peaks are shown in Fig. 5.
all pole system transfer function H(z) is given by the inverse of Eq. (19).
H(z) =
1 1 = ∑p A(z) 1 + j=1 aj z−j
(22)
To specify the model order we used the general rule that the order is two times the expected number of formants plus two. In the frequency range [0,fs/2] where fs is the sampling frequency, we expect three formants. Therefore we set the model order equal to 8. Find the roots of the prediction polynomial returned by LPC. The complex root pair of LPC polynomial is represented by (23) Once the roots are obtained we determine the angles corresponding to the roots and angular frequency is converted to Hz. Formant frequency in Hz is represented by
z = qe±j𝜃
fs 𝜃 Hz 2𝜋
(24)
3.5 Estimation of the filter characteristics
F=
Formant frequencies are resonant frequency of vocal tract and formant frequencies vary with a vocal tract configuration. Typically there are three resonance of significance for human vocal tract that is below 3500 Hz. Phonemes can be easily distinguished by the frequencies of first two or three formants F1, F2, and F3. F1 varies from 300 to 1000 Hz, F2 varies from 850 to 2500 Hz and F3 varies from 2300 to 3500 Hz. Energy concentration of formant frequency is more than any other frequency of the speech signal. Formant frequencies are more speaker specific. Formant frequencies can be obtained by the roots of LPC polynomial Snell and Milinazzo (1993) as given in (15) and (16). In z domain the
where fs is the sampling frequency and 𝜃 is the angular frequency. Bandwidth of the formants B is represented by the distance of the prediction polynomial zeros from the unit circle. ( )( f ) −1 | | 2 B= log |qe±j𝜃 | (25) | | 2 2𝜋
We used the criterion that formant frequencies should be greater than 90 Hz with bandwidth less than 400 Hz to determine the formants. We compute the first three formants for every voiced frame in the mixed speech signal denoted by Fim. We also compute the first three formants for every IMF of a voiced speech frame denoted by Fik,m. Where i = 1, 2,3 represents the number of formants, m = 1,2,…,V represents the number of voiced speech frames and k = 1,2,…,M represents the number of IMFs in every frame of voiced speech signal. 3.6 Reconstruction of speakers from mixed signal
Fig. 5 Smoothed Hilbert envelope of histogram shown in Fig. 4
13
Reconstruction of speakers was done with the knowledge of s fundamental frequency f0 x estimated for individual speakers earlier. Where x = 1,2 for the present case with two speakers. We applied a four stage reconstruction process for individual speakers. ( Decision for)selecting IMFs for a speaker was done s | | using min |f0m − f0 x | for x = 1,2 (1) Selecting IMFs based on | | the knowledge of source characteristics f0m and f0k,m. Here f0m represents fundamental frequency of mth voiced speech frame and f0k,m represents fundamental frequency of kth IMF in mth voiced speech frame. Based on minimum absolute difference
Author's personal copy Int J Speech Technol (2017) 20:1037–1047
1043
( ) | | min |f0m − f0k,m | one IMF is selected out of M IMFs for a | | particular speaker in that voiced speech frame. This process is repeated for all voiced speech frames in the mixed signal. The selected kth IMF for a mth voiced frame is denoted by ck,m source (n). (2) Selecting IMFs based on the knowledge of filter characteristics Fim and Fik,m where Fim represents ith formant of mth voiced speech frame and Fik,m represents ith formant of kth IMF in mth voiced ( speech frame. ) Based on minimum absolute | m k,m | difference min |Fi − Fi | one IMF is selected out of M | | IMFs for a particular speaker in that voiced speech frame for i = 1,2,3. This process is repeated for all voiced speech frames in the mixed signal. The selected kth IMF for a mth voiced frame is denoted by ck,m (n). (3) The unvoiced frames of mixed filter speech are assigned based on energy content of individual speakers after step 1 and step 2 of reconstruction process. Since majority of the energy lies in voiced speech frames it contributes more to the overall energy of the individual speakers. As very few unvoiced frames exists in two speakers mixed signal, unvoiced frames are assigned to the speaker with high energy content after reconstruction with step 1 and step 2 of the reconstruction process. The unvoiced speech frames are denoted by yquvf (n) for q = 1,2,....U where U is the number of unvoiced frames in the mixed speech signal which is very less compared to the number of voiced speech frames. (4) Finally the missing voiced frames from individual speakers are imputed by selecting the IMF having least variance indicating the presence of a single speaker. This process is repeated for every voiced speech frame till all the missing samples are imputed. Figure 6 shows the variance of all IMFs over each speech frame. The IMF contributing to missing data imputak,m tion is denoted by cminvar (n). For all the above steps k = 0,1,2,...M and m = 0,1,2,...V. Where M is the number of IMFs and V is the number of voiced speech frames. Based on the above four steps of reconstruction, separated speaker s̃x (n) from the mixture can be estimated by q
k,m k,m k,m s̃x (n) = csource (n) + cfilter (n) + yuvf (n) + cmin (n) var
(26)
where x = 1, 2 indicates the number of speakers in the mixed speech signal.
4 Experimental analysis In this section we produce the results obtained after applying the proposed method to mixture of male–male, male–female and female–female speakers. The recording was done using single microphone and two speakers in a real room environment. Initially we compare the results within the proposed
Fig. 6 Variance of IMFs over speech frames
algorithm for various source-filter features combined with EMD. We also compare the proposed method with EEMD and CEEMD. Later we compare the proposed method with other existing SCSS techniques for various objective measures like signal to distortion ratio (SDR), signal to artifact ratio (SAR) and signal to interference ratio (SIR) proposed in the literature Vincent et al. (2006). We also compute Improvement of Signal to Noise Ratio (ISNR) as proposed in Molla and Hirose (2007) for various SCSS methods. All the computations were done using Matlab R 2007b underwindows7, 64 bit platform with Intel core-i5-2500 CPU at 3.3. GHz and 4 GB RAM. The recovered source signal is decomposed into a source part starget along with error terms like interference einterf and algorithmic artifacts eartif. These components are used to define the measures like SDR, SIR and SAR as follows.
SDR = 10log10
‖ ‖2 ‖st arg et ‖ ‖ ‖
‖ ‖2 ‖eint erf + eartif ‖ ‖ ‖
‖ ‖2 ‖st arg et ‖ ‖ ‖ SIR = 10log10 ‖ ‖2 ‖eint erf ‖ ‖ ‖ ‖ ‖2 ‖st arg et + eint erf ‖ ‖ ‖ SAR = 10log10 ‖ ‖2 ‖eartif ‖ ‖ ‖
(27)
(28)
(29)
13
Author's personal copy 1044
Int J Speech Technol (2017) 20:1037–1047
Fig. 7 Comparison of averaged SDR for source-filter characteristics with EMD
Fig. 9 Comparison of averaged SAR for source-filter characteristics with EMD
Fig. 8 Comparison of averaged SIR for source-filter characteristics with EMD
Fig. 10 Comparison of averaged ISNR for source-filter characteristics with EMD
Figures 7, 8 and 9 shows internal comparison of the proposed algorithm with various source-filter characteristics over SDR, SIR and SAR respectively. Another objective measure which is commonly used for SCSS comparison is ISNR. It gives the distortion measure between original and estimated source signal.
shows the internal comparison of proposed algorithm with respect to ISNR over source-filter characteristics. Figure 11 shows the comparison of objective measures with respect to EEMD and CEEMD along with EMD. Figures 12, 13, 14 and 15 shows the comparison of SDR, SIR, SAR and ISNR respectively over other existing SCSS methods. The proposed method is good enough to separate the speech–speech mixtures from real recordings. Figure 16 shows an example speech separation for male–female speech mixture. Waveform shows the correlation of estimated sources with the original source signals. Figure 17 gives the comparison between SCSS methods in terms of execution time. The execution time for each method was calculated by repeating the algorithm for ten times in the
∑ � 2� ∑ � �2 n ��sx (n) �� n �sx (n)� ISNRx = 10log10 ∑ − 10log 10 ∑ 2 � � �2 ̃x (n)�� n �sx (n) − s n �y(n) − sx (n)�
(30) Here ISNRx is the ISNR of xth speaker for x = 1, 2.sx(n) is the xth original speech signal, s̃x (n) is the xth reconstructed speech signal and y(n) is the mixed speech signal. Figure 10
13
Author's personal copy Int J Speech Technol (2017) 20:1037–1047
Fig. 11 Comparison of objective measures with EMD, EEMD and CEEMD
Fig. 12 Comparison of averaged SDR for various SCSS
interval of 2000 samples. The final calculation was done for 20,000 samples into consideration. Figure 17 shows EMD and variants of EMD based methods takes significantly less time for execution compared to highly iterative modified NMF based algorithm. Table 1 shows the comparison of various single channel source separation algorithms. Other existing methods like SOLO and basic NMF algorithm also takes less time for execution but the separation results for a speech–speech mixture is low. The proposed method takes less time for execution since it is free from iterative optimization procedure. The demo audio files of proposed source separation technique can be found at https://websitebuilder102.website.com/ audio-and-speech-research.
1045
Fig. 13 Comparison of averaged SIR for various SCSS
Fig. 14 Comparison of averaged SAR for various SCSS
Fig. 15 Comparison of averaged ISNR for various SCSS
13
Author's personal copy 1046
Int J Speech Technol (2017) 20:1037–1047
Fig. 16 Time waveforms of separation for a male–female speech mixture. First: original speech1, second: original speech2, third: mixed speech, fourth: separated speech1, fifth: separated speech2
on formants since IMF selected based on formants has more energy compared to other IMFs. Out of first three formants F1and F2 region overlaps since they are closely separated in the frequency band irrespective of the gender of the speaker. Whereas F3 can be used more efficiently for separating the speakers irrespective of their gender along with f0. When both features are combined with EMD produces superior results for speaker separation as shown in graphs in Figs. 7, 8 and 9. EMD when combined with f0 and F3 results in more balanced objective measures. ISNR will be maximum when we combine IMFs selected by all three formants and IMF selected by fundamental frequency irrespective of the gender of the speakers in the mixed speech signal. This is due to high energy levels in the IMFs selected based on formants. There is no significant difference between the objective measures obtained. However the execution time required for EMD will be slightly less compared to other variants of EMD. So EMD can still be good enough to separate the speakers in a speech–speech mixture when combined with source-filter characteristics. Except SAR the proposed method produces superior separation results for a speech–speech mixture. This is because EMD is combined with source-filter characteristics instead of highly iterative optimization problem between the original and estimated sources.
6 Conclusion In this paper we presented a time domain approach for unsupervised single-channel source separation for speech–speech mixture of real world recordings. Combining EMD with speech specific information like multi pitch information and formants produces comparative results with other existing approaches. The results produced encourage further research in combining EMD with other signal processing approaches. Future research can be towards estimation of number of speakers when there are more than two speakers in a single channel speech mixture which is quite challenging.
Fig. 17 Comparison of execution time over number of samples
5 Discussion of results Fundamental frequency f0 alone with EMD provides good separation results. But it is desirable to have IMFs selected based Table 1 Comparison of single channel source separation algorithms Mixed signal
Male + male Male + female Female + female
13
EMD + f0 and F3 (proposed)
Unsupervised NMF
ISNMF2D
SOLO
SAR
SIR
SDR
ISNR
SAR
SIR
SDR
ISNR
SAR
SIR
SDR
ISNR
SAR
SIR
SDR
ISNR
2.3 1.36 1.8
3.15 6.1 6.9
4.95 3.25 4.5
0.52 3.55 2.8
1.9 0.9 0.75
0.2 0.3 0.1
1.9 2.4 0.2
0.1 0.9 0.4
3.1 2.2 2.4
1.65 4.75 5.1
1.34 3.2 3.27
0.24 1.2 0.5
2.9 1.2 2.1
0.74 3.6 2.7
0.8 2.9 2.7
0.46 0.9 0.4
Author's personal copy Int J Speech Technol (2017) 20:1037–1047
References Bofill, P. (2008). Identifying single source data for mixing matrix estimation in instantaneous blind source separation Proceedings of the ICANN (pp. 759–767). Berlin: Springer. Douglas, S. C., & Sawada, H., & Makino S. (2005). Natural gradient Multichannel blind deconvolution and speech separation using causal FIR filters”. IEEE Transactions on Speech Audio Processing, 13(1), 92–104. Ellis, D. P. (2006). Model-based scene analysis. Computational auditory scene analysis: Principles, algorithms, and applications, 4, 115–146. Fevotte, C., & Godsill, S. J. (2006). A Bayesian approach for blind separation of sparse sources. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 2174–2188. Gao, B., Woo, W. L., & Dlay, S. S. (2011). Single-channel source separation using EMD-subband variable regularized sparse features. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 961–976. Gao, B., Woo, W. L., & Dlay, S. S. (2013). Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura–saito nonnegative matrix two-dimensional factorizations. IEEE Transactions on Circuits and Systems I: Regular Papers, 60(3), 662–675. Greenwood, M., & Kinghorn, A. (1999). SUVing: Automatic silence/ unvoiced/voiced classification of speech. Sheffield: Undergraduate Coursework, Department of Computer Science, The University of Sheffield. Hershey, J. R., Olsen, P. A., Rennie, S. J., & Aron, A. (2011) Audio Alchemy: Getting computers to understand overlapping speech. Scientific American Online. http://www.scientificamerican.com/ article/speech-gettingcomputersunderstand-overlapping. Huang, N. E., & Shen, Z., & Long S. R. (1998). The empirical mode decomposition and Hilbert spectrum for nonlinear and non-stationary time series analysis”. Proceedings of the Royal Society of London A, 454, 903–995. http://iitg.vlab.co.in/?sub=59&brch=164&sim=616&cnt=1108. Jang, G. J., & Lee, T. W. (2003). A maximum likelihood approach to single-channel source separation. Journal of Machine Learning Research, 4, 1365–1392. Karhunen, J., & Oja, E. (2001). Independent component analysis. Hoboken: Wiley. Kristjansson, T., Attias, H., & Hershey, J. (2004). Single microphone source separation using high resolution signal reconstruction. In IEEE Proceedings.(ICASSP’04). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. (Vol. 2, pp. ii-817). Li, P., Guan, Y., Xu, B., & Liu, W. (2006). Monaural speech separation based on computational auditory scene analysis and objective quality assessment of speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 2014–2023. Li, Y., Amari, S. I., Cichocki, A., Ho, D. W., & Xie, S. (2006). Underdetermined blind source separation based on sparse representation. IEEE Transactions on Signal Processing, 54(2), 423–437. Litvin, Y., & Cohen, I. (2011). Single-channel source separation of audio signals using bark scale wavelet packet decomposition. Journal of Signal Processing Systems, 65(3), 339–350. Mijovic, B., De Vos, M., Gligorijevic, I., Taelman, J., & Van Huffel, S. (2010). Source separation from single-channel recordings by
1047 combining empirical-mode decomposition and independent component analysis. IEEE Transactions on Biomedical Engineering, 57(9), 2188–2196. Molla, M. K. I., & Hirose, K. (2007). Single-mixture audio source separation by subspace decomposition of Hilbert spectrum. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 893–900. Ozerov, A., & Févotte, C. (2010). Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 550–563. Reyes-Gomez, M. J., Ellis, D. P., & Jojic, N. (2004). Multiband audio modeling for single-channel acoustic source separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP’04). (Vol. 5, pp. V-641). Schmidt, M. N., & Olsson, R. K. (2006). Single-channel speech separation using sparse non-negative matrix factorization. In Spoken Language Proceesing, ISCA International Conference on (INTERSPEECH). Snell, R. C., & Milinazzo, F. (1993). Formant location from LPC analysis data. IEEE Transactions on Speech and Audio Processing, 1(2), 129–134. Stark, M., Wohlmayr, M., & Pernkopf, F. (2011). Source–filter-based single-channel speech separation using pitch information. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 242–255. Tengtrairat, N., Gao, B., Woo, W. L., & Dlay, S. S. (2013). Singlechannel blind separation using pseudo-stereo mixture and complex 2-D histogram. IEEE Transactions on Neural Networks and Learning Systems, 24(11), 1722–1735. Vincent, E., Bertin, N., Gribonval, R., & Bimbot, F. (2014). From blind to guided audio source separation: How models and side information can improve the separation of sound. IEEE Signal Processing Magazine, 31(3), 107–115. Vincent, E., Gribonval, R., & Févotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1462–1469. Virtanen, T. (2007). Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1066–1074. Wang, Y. H., Yeh, C. H., Young, H. W. V., Hu, K., & Lo, M. T. (2014). On the computational complexity of the empirical mode decomposition algorithm. Physica A: Statistical Mechanics and its Applications, 400, 159–167. Wu, Z., & Huang, N. E. (2009). Ensemble empirical mode decomposition: A noise-assisted data analysis method. Advances in Adaptive Data Analysis, 1(01), 1–41. Yegnanarayana, B., Swamy, R. K., & Murty, K. S. R. (2009). Determining mixing parameters from multispeaker data using speechspecific information. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1196–1207. Yeh, J. R., Shieh, J. S., & Huang, N. E. (2010). Complementary ensemble empirical mode decomposition: A novel noise enhanced data analysis method. Advances in Adaptive Data Analysis, 2(02), 135–156. Yilmaz, O., & Rickard, S. (2004). Blind separation of speech mixtures via time frequency masking. IEEE Transactions on Signal Processing, 52(7), 1830–1847.
13