Improvement of Mask-Based Speech Source Separation Using DNN Ge Zhan, Zhaoqiong Huang, Dongwen Ying, Jielin Pan and Yonghong Yan Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences
[email protected]
Abstract The speech mask is widely used to separate multiple speech sources, wherein the time-frequency bins are classified into clusters that correspond to each source. For each source, the separated signal consists of the components on TF bins that are dominated by this source, whereas the components on the remaining bins are completely masked. Most separation methods ignored the masked components. In fact, the masked components may contain some useful information, and the maskbased speech source separation can be improved by reconstructing the masked components. This paper proposes a postprocessing method to reconstruct the masked frequency components through a deep neural network (DNN). We construct a regression from the reliable frequency components to the masked components. After the masked-based separation, the reliable components are kept unchanged, and the masked components are reconstructed by the outputs of DNN. Experimental results confirmed that the proposed method significantly improved the mask-based separation, and that the masked components are still useful to the speech quality. Index Terms: Spectrographic speech mask, speech presence probability, time-frequency correlation, neighbor factor
1. Introduction The speech mask has been widely used for speech separation [1]–[6], wherein the time-frequency bins are classified into clusters that correspond to each source. In general, there exist two kinds of speech masks. One is the binary mask, which forces a hard decision to be made about whether speech is present or absent at each time-frequency (TF) bin [1]–[3]. The other is the soft mask, where the element of the mask takes on a continuum of value between 0 and 1, which is often referred to as the probability of speech presence or the energy ratio of speech signals at each TF bin [4]–[6]. Whether the speech mask is binary or soft, it can be regarded as a state matrix that represents the presence of the speech source in the TF domain [7]. The fundamental intention of the mask-based speech source separation is to extract the frequency components of the speech sources. One common step of the mask-based speech source separation is to apply the speech mask to the spectrum of the multiple sources [1]–[6]. For each source, the separated signal consists of the frequency components that are assigned to the source by the speech mask, whereas the remaining components that are removed. Afterwards, the separated signal can be inversely transformed into the corresponding waveform of the source. Such a process is often referred to as TF masking [1]. By TF masking, the speech source separation problem is addressed as a clustering problem that are based on the spectrum and the mask. The frequency components of separated signal are usually considered as reliable for perceiving the information carried by the original signal of the source. The reliable com-
978-1-5090-4294-4/16/$31.00 ©2016 IEEE
ponents can contribute to the improvement to the speech intelligibility [2], whereas the masked components of the source may also contain useful information. Therefore, a post-processing method is highly required to ameliorate the extracted spectrum. To improve the mask-based speech source separation, it is critical to reconstruct the masked components on the basis of the reliable components in the separated signal. The regression from the reliable components to the masked components is highly non-linear. Some recent researches have utilized the method of sparse representation to model some complex regressions and have reported good performances in [8], [9]. More recently, deep neural networks (DNNs) have been adopted to construct some similar regressions, and some convincing results have be reported [4]–[6], [10]. [5] takes the speech separation as a front end before the enhancement, and estimates the spectrographic speech mask that represents the speech signals. [10] trains a DNN to model the mapping relationship between noisy and clean spectrums, and takes a great deal of conditions for training to overcome the mismatch problem. They have proven the capability of a DNN to construct a highly non-linear regression. Therefore, a post-processing method using a DNN to improve the mask-based speech source separation is proposed in this paper. In the proposed method, the DNN is adopted to construct the complex regression from the reliable components to masked components. The masked components are reconstructed by the DNN, and then they are combined with the reliable components to compose the modified separated signal. To further improve the performance, the per-utterance statistical information is used as a part of the input feature of the DNN. The evaluation results confirmed the efficiency of the proposed method.
2. Problem Formulation To simplify the problem and make it straightforward to analysis, the binary mask [1]–[3] is chosen to extract the separated signal in the proposed method. One feasible way of generating the binary mask is via the azimuthes of the speech sources. The azimuthes that can be obtained by methods of sound source localization like [11], [12] can be utilized in a speech mask estimator like [7]. The details about generating the binary mask is beyond the interest of this paper. The bin-wise TF masking is given by SS = SM . ∗ M,
(1)
where M is the speech mask of one speech source; SM is the signal of multiple speech sources, SS is the separated signal of the source, and the .∗ is bin-wise multiplication in the TF domain. Due to the property of the bin-wise TF masking, the masked components are unknowable, and thus SS inevitably lacks some useful information. Fig. 1 demonstrates the difference among the frequency components of different signals,
Figure 1: (a) is the log-power spectrum of two speech sources; (b) is the ideal binary mask of one source in (a), where the bright color indicates 1 and the dark color indicates 0; (c) is the extracted spectrum obtained via (a) and (b); (d) is the spectrum of the original signal of the source in (b).
where the separated signal lacks a large amount of components in the TF domain. Therefore, the proposed post-processing method is dedicated to overcome the defect by reconstructing the masked components. A DNN is adopted to model the regression from the reliable components to the masked components of the binary-masked separated signal. The separated signal is the basic material of the DNN training. However, by TF masking, the separated signal contains plenty of zero-valued frequency components where the original bin is masked. The massive zero-valued components form a uniform distribution which is inconsistent to the natural distribution of the speech signal in the TF domain. Not to mention that the natural distribution may vary with speakers. The DNN training may suffer from the confusion of such two kinds of conflicting distributions of the input data. So, it is assumed in the proposed method that the per-utterance statistical information can facilitate the DNN overcoming the confusion. The per-utterance statistical information is thus utilized to represent the masked components which are unknowable in the separated signal.
3. Proposed Method A flow chart of the proposed method is illustrated in Fig. 2. The front end is a typical mask-based speech source separation operating the binary TF masking, with the input is the log-power spectrum of the mixed signals of two interfering speech sources. The output of the front end is the separated signal which is in fact a log-power spectrum of one speech source. Then, the DNN is trained to construct the regression from the reliable frequency components to the masked components. 3.1. Basic Training The DNN in the proposed method is a feed-forward network. With many levels of non-linearities [13], the DNN is capable
Figure 2: The flow chart of the proposed method.
of constructing the regression required in the proposed method. Fig. 3 illustrates the training procedure of the DNN. In order to train the DNN, a common training strategy which is similar to that in [10] is adopted. The DNN is firstly pre-trained as a stack of restricted Boltzmann machines (RBMs) [14], wherein the first one is a Gaussian-Bernoulli RBM and the others are Bernoulli-Bernoulli RBMs. The RBMs are trained one-by-one before they are stacked. The pre-training is conducted under the contrastive divergence criterion [14]. Afterwards, the DNN is fine-tuned to update the parameters inside under the minimum mean square error criterion [10]. Such a training strategy is intended to avoid local minimum. The well-trained DNN is regarded as the baseline model to demonstrate the role of the masked components to speech quality.
Figure 3: The training procedure of the DNN.
Figure 4: The distribution of the frequency components: (a) is the line graph; (b) is the normal distribution. The color red indicates the reliable components, and the color blue indicates the masked components. The means and the variances of the normal distributions in (b) are calculated from the statistical information given by (a).
3.2. Modified Training Fig. 4 illustrates the line graph and the distribution of the components on a frequency of the speech source that is used to present Fig. 1. Both the reliable components and the masked components have an inherent distribution. Moreover, the distribution of the speech signal varies with the speakers. The separated signal contains plenty of zero-valued frequency components where the original components are masked. The massive zero-valued components is inconsistent to the natural distribution of the speech signal in the TF domain, which will lead to a potential confusion to the DNN training. To solve the confusion, the statistical information of the extracted spectrum is utilized to represent the binary masked components. Actually, the statistical information varies with the speaker. Meanwhile, based on a specific speaker, the magnitude of the frequency components on each frequency has a wide variation. Within the variation range, it is assumed that the per-utterance means of the frequency components can be helpful. In the TF domain, per-utterance means are calculated on each frequency. Then, the per-utterance means are used to represent the masked components. That means, the magnitudes where the component is masked are replaced by the per-utterance means, while the reliable components remain unchanged. Another DNN with the same structure of the baseline DNN is trained to adapt to the modified separated signal, with the training strategy remain unchanged. The newly trained DNN is proposed to further improve the mask-based speech source separation.
4. Evaluation In order to make the speech source signals, the TIMIT database [16] is selected. The TIMIT database contains two data sets, namely the training and the testing. The training set of the TIMIT database is used to make the training set for the DNN. First, twice in two different orders, the ten short utterances of the same speaker are concatenated into one long utterance. Then, the two long utterances are used as two interfering speech sources. They are superimposed to make one mixture. In the same way, the testing set of the TIMIT database is used to make the testing set for the DNN. A training data set of nearly 20
hours is selected for the DNN training. A matched testing data set is consisted of 20 long utterance from the training data set of TIMIT. Similarly, a mismatched testing data set is organized with 20 long utterances from the testing data set of TIMIT. For signal analysis, the size of a frame is set as 256, and the frame shift is set as 128. The sampling rate of the TIMIT data base is 16kHz. The log-power spectrum of the signals is obtained via a 256-point discrete-time Fourier transform (DTFT). The utterances are firstly transformed into log-power spectrums via DTFT. The mask-based speech source separation is fed with the log-power spectrum to generate the separated signal. The DNN takes the separated signal as input, and reconstructs the masked frequency components in the separated signal. The reliable components of the separated signal are kept, and the masked components are replaced with the output of the welltrained DNN. The modified separated signal can be inversely transformed into the the waveform. The structure of the proposed DNN is illustrated in Fig. 3. The number of the input nodes is set as 1419, because the input of the DNN is a long signal frame which is concatenated with 11 adjacent frames of the log-power spectrum. There are three hidden layers, together with one input layer and one output layer. The number of the nodes of each hidden layer is empirically set as 1024. The number of the output nodes is set as 129 in order to generate the improved log-power spectrum frame by frame. The proposed method actually trains two DNNs, namely the baseline DNN and the modified DNN. The evaluation of the baseline DNN is intended to demonstrate the importance of the masked components to speech quality. The evaluation of the modified DNN is arranged to further improve the performance of the proposed DNN, and also to verify the usage of per-utterance statistical information to represent the masked components. The same speech source in Fig. 1 is used to give an informal comparison which is illustrated in Fig. 5. One can see that both the baseline and the modified DNN can reconstruct the masked components in the extracted log-power spectrum. It allows the separated signal to get close to the original source signal. By the way, within the two red rectangle, there are more
Figure 5: (a) is the extracted spectrum which is a part of (c) in Fig. 1, with the masked components represented by 0; (b) is the modified spectrum with the masked components represented by per-utterance means on each frequency; (c) is the spectrum reconstructed by the baseline DNN; (d) is the spectrum reconstructed by the modified DNN; (e) and (f) are the same spectrum of the original source signal.
Table 1: Objective evaluation in matched case. Methods Raw Baseline DNN Modified DNN
LSD 3.8245 1.7951 1.6922
SegSNR 13.0491 15.9634 16.4065
PESQ 2.9620 3.5021 3.6053
Table 2: Objective evaluation in mismatched case. Methods Raw Baseline DNN Modified DNN
LSD 3.8245 1.8375 1.7721
SegSNR 13.0491 15.8761 16.3937
PESQ 2.9620 3.4911 3.5833
performance. The results of objective evaluations confirm efficiency of the proposed method. The baseline DNN is capable of constructing the regression from the reliable components to the masked components in the separated signal. It is contributed by the capability of modeling highly non-linear relationships. Further, the masked components that are represented by perutterance means are found helpful in the training according to the evaluation results. The potential confusion of the masked components is efficiently overcome. The modified DNN gets facilitated in learning the highly non-linear relationship. The total performance of the proposed method substantially improved the mask-based speech source separation.
5. Conclusion and Discussion reconstructed components generated by the modified DNN. The comparison between the two encircled parts indicate that the modified DNN can generate a more adequate reconstruction of the masked components. Three methods of objective evaluation are conducted to evaluate the proposed method. The logarithmic spectral distortion (LSD) is obtained by comparing the improved spectrum and the original spectrum of the same speech source. Then, the improved spectrum is inversely transformed into the waveform. The segmental signal-to-noise ratio (SNR) and the perceptual evaluation of speech quality (PSEQ) are obtained by comparing the waveform and the original speech source signals. The three methods are conducted both in matched and mismatched case. Tables 1 and 2 present the evaluation results in different case respectively. The word raw in the tables means that the extracted spectrum obtained from the mask-based speech source separation is directly used for evaluation, in order to reflect the improvement offered from the DNNs. Note that, the figures of segmental SNR and PESQ are in dB, while the figure of LSD is not. For segmental SNR and PESQ, a higher figure means a better performance. For LSD, a lower figure means a better
This paper proposes a post-processing method using a DNN to improve the mask-based speech source separation. The proposed method uses the DNN to model the mapping from the reliable frequency components to the masked frequency components. The evaluation results showed that the proposed method substantially improves the performance of the mask-based separation, which indicates that the masked components still contribute to the speech quality. The proposed method can be taken as a post-processing procedure for conventional mask-based separation.
6. Acknowledgment This work was supported by the National Program on Key Basic Research Project (2013CB329302), the National Natural Science Foundation of China (Nos. 61271426, 11461141004, 91120001), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant Nos. XDA06030100, XDA06030500), and by the CAS Priority Deployment Project (KGZD-EW-103-2).
7. References [1] O. Yilmaz, S. Rickard, “Blind separation of speech mixtures via time-frequency masking.” IEEE Trans. on signal process., 52(7): 1830-1847, 2004. [2] D.L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis.” Speech separation by humans and machines. Springer US, 181-197, 2005. [3] M. Cobos, J.J. Lopez, “Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors,” Trans. on Audio, Speech, and Language Process., 20(7):2059–2064, 2012. [4] H. Sawada, S. Araki, S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment.” IEEE Trans. on Speech, Audio, and Language Process., 19(3): 516-527, 2011. [5] A. Narayanan, D.L. Wang, “Investigation of speech separation as a front-end for noise robust speech recognition,” IEEE/ACM Trans. on Audio, Speech, and Language Process., 22(4):826–835, 2014. [6] B. Li, K.C. Sim, “A spectral masking approach to noise-robust speech recognition using deep neural networks,” IEEE/ACM Trans. on Audio, Speech, and Language Process., 22(8):1296– 1305, 2014. [7] G. Zhan, Z.Q. Huang, D.W. Ying, J.L. Pan, Y.H. Yan, “Spectrographic speech mask estimation using the time-frequency correlation of speech presence.” INTERSPEECH 2015, 2287-2291. [8] J.F. Gemmeke, T. Virtanen, A. Hurmalainen, “Exemplar-based sparse representations for noise robust automatic speech recognition.” IEEE Trans on Audio, Speech, and Language Process., 19(7): 2067-2080, 2011. [9] J.F. Gemmeke, “Noise robust ASR: Missing data techniques and beyond.” UB Nijmegen, 2011. [10] Y. Xu, J. Du, L.R. Dai, C.H. Lee, “A regression approach to speech enhancement based on deep neural networks.” IEEE/ACM Trans. on Audio, Speech, and Language Process., 23(1): 7-19, 2015. [11] D.W. Ying, Y.H. Yan, “Robust and fast localization of single speech source using a planar array.” IEEE Signal Processing Letters, 20(9): 909-912, 2013. [12] Z.Q. Huang, G. Zhan, D.W. Ying, Y.H. Yan, “Robust multiple speech localization using time delay histogram.” ICASSP 2016, 3191-3195. [13] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, “Exploring strategies for training deep neural networks.” Journal of Machine Learning Research, 10(Jan): 1-40, 2009. [14] G.E. Hinton, L. Deng, D. Yu, G.E. Dahl, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process., 29(6): 82C97, 2012. [15] A. Varga, H. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., 12(3):247–251, 1993. [16] J. S. Garofolo, “Getting started with the DARPA TIMIT CDROM: An acoustic phonetic continuous speech database,” Nat. Inst. Standards Technol. (NIST), Gaithersburg, MD, prototype as of Dec. 1988.