Multi-Metrics Learning for Speech Enhancement Szu-Wei Fu, Ting-yao Hu, Yu Tsao, and Xugang Lu Abstract—This paper presents a novel deep neural network (DNN) based speech enhancement method that aims to enhance magnitude and phase components of speech signals simultaneously. The novelty of the proposed method is two-fold. First, to avoid the difficulty of direct clean phase estimation, the proposed algorithm adopts real and imaginary (RI) spectrograms to prepare both input and output features. In this way, the clean phase spectrograms can be effectively estimated from the enhanced RI spectrograms. Second, based on the RI spectrograms, a multi-metrics learning (MML) criterion is derived to optimize multiple objective metrics simultaneously. Different from the concept of multi-task learning that incorporates heterogeneous features in the output layers, the MML criterion uses an objective function that considers different representations of speech signals (RI spectrograms, log power spectrograms, and waveform) during the enhancement process. Experimental results show that the proposed method can notably outperform the conventional DNN-based speech enhancement system that enhances the magnitude spectrogram alone. Furthermore, the MML criterion can further improve some objective metrics without trading off other objective metric scores. Index Terms—Deep neural network, multi-objective learning, phase processing, speech enhancement.
I.
INTRODUCTION
Recently, various types of deep learning based denoising models have been proposed and extensively investigated [14]. Compared to traditional speech enhancement models, deep neural networks (DNNs) have demonstrated their superior ability to model the non-linear relationship between noisy and clean speech [1, 5]. However, most denoising models focus only on processing the magnitude spectrogram (e.g., log-power spectra, LPS) leaving phase in its original noisy condition. This may be because there is no clear structure in a phase spectrogram, which makes estimating clean phase from noisy phase extremely difficult [6]. On the other Copyright (c) 2016 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to
[email protected]. Szu-Wei Fu is with Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan and Research Center for Information Technology Innovation (CITI) at Academia Sinica, Taipei, Taiwan. (e-mails:
[email protected]). Ting-yao Hu is with Department of Computer Science, Carnegie Mellon University, Pittsburg, PA, USA. (e-mails:
[email protected]) Yu Tsao is with the Research Center for Information Technology Innovation (CITI) at Academia Sinica, Taipei, Taiwan (e-mails: yu.tsao @citi.sinica.edu.tw ). X. Lu is with the National Institute of Information and Communications Technology, Tokyo 184-0015, Japan (e-mail:
[email protected]).
hand, several researches have shown the importance of phase when spectrograms are resynthesized back into timedomain waveforms. Roux [7] demonstrated that when the inconsistency between magnitude and phase spectrograms is maximized, the same magnitude spectrogram can lead to extremely diverse resynthesized sounds, depending on the phase with which it is combined. Paliwal et al. [8] confirmed the importance of phase for perceptual quality in speech enhancement, especially when window overlap and length of the Fourier transform are increased. Therefore, to further improve the performance of speech enhancement, phase information is considered in some up-to-date researches [9]. Williamson et al. [6, 10] employed a DNN for estimating the complex ratio mask (cRM) from a set of complementary features, and thus magnitude and phase can be jointly enhanced through cRM. The quality of the enhanced speech is improved compared to the ideal ratio mask (IRM) based model, while intelligibility remains roughly unchanged. Although phase remains unchanged, Wang et al. [11] proposed a DNN model for time-domain signal reconstruction. The system tries to learn an optimal masking function given the noisy phase by incorporating a speech resynthesizer into the network. In the meanwhile, Chen et al. [12] proposed to adopt the rate-domain complex-valued ideal ratio mask (RDcIRM) as the training target of the DNN. When compared with a state-of-the-art system [13], the proposed algorithm can produce de-reverberated speech with higher speech intelligibility and quality scores such that it is more suitable for human-listening applications Speech enhancement is a task that tries to improve several objectives, such as intelligibility and quality, of a noisy speech signal. Therefore, multiple metrics are applied to evaluate the performance of enhanced speech in different aspects. In this paper, we propose a novel DNN model that processes magnitude and phase jointly and improves the performance of different metrics simultaneously. The proposed network attempts to learn the mapping function from a noisy RI spectrogram to a clean RI spectrogram. The lowlevel representation of RI allows other targets (e.g., log power spectrogram and waveform reconstruction) to be included into the objective function. Each target can correspond to a metric; hence, the learning process is referred to as multi-metrics learning (MML) in this paper. Different from a usual multi-objective optimization problem [14], the targets in MML do not conflict with each other, which implies that there are no serious trade-offs between different metrics.
II.
NOISY PHASE
For DNN-based speech enhancement, the noisy and clean speech signals are usually first converted into the frequency domain to extract their LPS as input features and output targets, respectively [1]. The enhanced signal in the time domain can be synthesized from the combination of its enhanced LPS and phase information, which is borrowed from the original noisy speech. Figure 1 presents an example of clean magnitude and phase spectrograms (top) and thresholded phase difference between clean and noisy speech under high and low SNR conditions (bottom). Using a noisy phase may not be a serious problem in high SNR conditions since the noisy phase is similar to the clean phase in high-energy regions (bottom-left of Fig. 1). To briefly explain the reason, the noisy phase in a time-frequency (T-F) unit is defined as 𝑎𝑟𝑐𝑡𝑎𝑛2(𝑁𝑖 , 𝑁𝑟 ), where 𝑁𝑖 and 𝑁𝑟 are the noisy signal’s imaginary and real parts, respectively, and 𝑎𝑟𝑐𝑡𝑎𝑛2 is similar to the arc tangent of 𝑁𝑖 /𝑁𝑟 , except that the signs of both arguments are considered to determine the appropriate quadrant [15]. Here, the expression of phase is simplified as follows: 𝑁𝑖 𝑆𝑖 + 𝑛𝑖 𝑎𝑟𝑐𝑡𝑎𝑛 = 𝑎𝑟𝑐𝑡𝑎𝑛 (1) 𝑁𝑟 𝑆𝑟 + 𝑛𝑟 where 𝑆𝑖 and 𝑆𝑟 , 𝑛𝑖 and 𝑛𝑟 are imaginary and real parts of speech and noise, respectively. When the SNR of noisy speech and the energy in the speech T-F unit is high enough, that is |𝑆𝑖 | ≫ |𝑛𝑖 |, |𝑆𝑟 | ≫ |𝑛𝑟 |, then the noisy phase in (1) is similar to the clean phase as (2): 𝑁𝑖 𝑆𝑖 𝑎𝑟𝑐𝑡𝑎𝑛 ≈ 𝑎𝑟𝑐𝑡𝑎𝑛 (2) 𝑁𝑟 𝑆𝑟 (2) well explains why the structures in top-left and bottomleft figure of Fig. 1 are similar to each other. However, this is not the case in low SNR conditions where the quality of the synthesized signal with enhanced phase may be considerably improved [6]. III.
MULTI-METRICS LEARNING
One possible way to enhance the phase is to employ a conventional DNN model to estimate clean phase from noisy phase. Due to the lack of structure (as shown in top-right of Fig.1), however, it is difficult for a machine learning model (even for deep learning) to learn the relationship between clean and noisy phase [6]. Williamson et al.[10] found that the structures in RI spectrograms are similar to that of magnitude spectrograms. Based on this observation, Fig. 2 presents the proposed DNN structure for speech enhancement based on RI spectrograms. Rather than processing the phase directly, the network aims to estimate clean RI spectrogram from noisy RI spectrogram. From the definition of phase 𝑦𝑝 = 𝑎𝑟𝑐𝑡𝑎𝑛2(𝑦𝑖 , 𝑦𝑟 ), where 𝑦𝑝 , 𝑦𝑖 , 𝑦𝑟 are enhanced phase, imaginary part, and real part, in a T-F unit, respectively. If the enhanced real and imaginary parts are appropriately processed, the phase part may thereby be enhanced as well. Since the outputs of the proposed network are RI spectrograms, except for the clean RI spectrogram reconstruction
50
50
100
100
150
150
200
200
250
250 20
40
60
80
100
1 20
120
50
50
100
100
150
150
200
200
250
250
40
60
80
100
120
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
20 40 60 80 100 120 20 40 60 80 100 120 Fig.1. Example of clean magnitude (top-left) and phase (top-right) spectrograms. Phase difference between clean and noisy speech (engine noise) under 12 dB (bottom-left) and -12 dB (bottom-right). Here the regions in blue represent the absolute phase difference is smaller than 0.1.
we can incorporate log power spectrogram and waveform reconstructions into the original objective function:
O (|| yˆ i y i ||22 || yˆ r y r ||22 )
|| log( yˆ i 2 yˆ r 2 ) log( y i 2 y r 2 ) ||22
(3)
ˆ y w y ||2 || w 2
where yˆ i , yˆ r
1
0.9
R L and y i , y r R L are clean and enhanced
imaginary and real spectrograms, respectively, and L is the
ˆ y , wy R dimension of the spectrum. w
2 L 2
are the corre-
sponding clean and enhanced waveforms, respectively. , , and are weighting factors for different targets. Note that the first term is the original objective function for the RI spectrum. The second term is about log power spectrogram which tries to minimize log-spectral distortion (LSD in dB) [16], and the last term is about waveform reconstruction which is related to the metrics evaluated in time domain. Hence, the learning process is called multi-metrics learning (MML) in this paper. Although the last two terms may seem redundant, they actually facilitate the enhanced speech to further approach the clean speech (as well be seen in the experiments). Since the arctan2 function is not a continuous function of 𝑦𝑖 and 𝑦𝑟 , it may be unsuitable to directly include the phase target into (3) for a gradient-based optimization method as DNN. The second term in (3) can also be expressed as functions of y and yˆ in matrix forms as follows:
|| log( yˆ i 2 yˆ r 2 ) log( y i 2 y r 2 ) ||22
|| log(P sqr(Iyˆ )) log( P sqr(Iy )) ||22
(4)
Pseudo Output Layer ( Other objectives) True ex: magnitude spectrogram, waveform, etc. Output Fixed Layer Weights (Objective 1) Clean Real Part
Clean Imaginary Part
ing the true output layer, stacking above it as shown in Fig. 2. In this paper, we refer this augmentation as the “pseudo layer” for its characteristic and structure. The gradient will pass through the pseudo layer to adjust the weights before the true output layer. Unlike multi-task learning [17] which makes the “model” able to process different tasks in the same time, the proposed MML tries to improve the performances of “outputs” under different metrics simultaneously. IV.
Noisy Real Part
Noisy Imaginary Part
Fig. 2. Proposed pseudo network with pseudo hidden and output layer(s).
where
yˆ [yˆ r yˆ i ]T and y [y r yi ]T R2 L are the verti-
cally cascaded vectors of the clean and enhanced RI spec2 L2 L
trograms, respectively. I R is the identity matrix, sqr(.) is the square function, P R L2 L is the permutation matrix defined as:
P [I( L ) x ( L) I( L) x ( L) ]
(5)
The last term in (3) can also be expressed as a function of y , yˆ through the inverse discrete Fourier transform (IDFT):
ˆ y w y ||2 || w 2
|| (CU1yˆ r - SU 2 yˆ i ) (CU1y r - SU 2 y i ) ||22 (6) || Fyˆ - Fy ||22 where U1 , U2 R
(2 L2) xL
are the matrices used for recovery
of the even symmetry of the real part, and the odd symmetry (2 L 2) x (2 L 2)
of the imaginary part, respectively. C, S R are the cosine and sine matrices in the IDFT, respectively. (2 L 2) x (2 L ) is defined as: FR F [ CU1 - SU 2 ] (7) Hence, from (4) and (6), the objective function (3) can be reformulated as follows:
O || yˆ y ||22
|| log( P sqr(Iyˆ )) log( P sqr(Iy )) ||22 (8) 2 || Fyˆ - Fy ||2 It can be easily noted that all the terms in (8) are directly related to the output vector y and can be expressed as a combination of matrix multiplication and a non-linear function as in a typical neural network. Therefore, the proposed network can be equivalently represented as additional pseudo hidden and output layer(s) with fixed weights, augment-
EXPERIMENTS
A. Experimental Setup In our experiments, the TIMIT corpus [18] was used to prepare the training and test sets. 600 utterances were randomly selected and corrupted with five noise types (Babble, Car, Jackhammer, Pink, and Street), at five SNR levels (-10 dB, 5 dB, 0 dB, 5 dB, and 10 dB). Another 100 randomly selected utterances were mixed to form the test set. To make experimental conditions more realistic, both the noise types and SNR levels of the training and test sets were mismatched. Thus, we intentionally adopted three other noise signals: (White Gaussian noise (WGN), a stationary noise) and (Engine, Baby cry, non-stationary noises), with another five SNR levels: -12 dB, -6 dB, 0 dB, 6 dB, and 12 dB to form the test set. All the results reported in the next part are averaged across the three noise types. In this work, 257 dimensional (L=257) LPS (for the baseline) and 512 dimensional RI spectrograms were extracted from the speech waveforms as acoustic features. Mean and variance normalization was applied to the input feature vectors to make the training process more stable. The DNNs in this experiment have three hidden layers with 1000 nodes per layer. All the models employ parametric rectified linear units (PReLUs) [19] as activation functions and are trained using adam [20] with batch normalization [21]. To evaluate the performance of different models, LSD was used for evaluating differences in the frequency domain. In addition, the perceptual evaluation of speech quality (PESQ) [22] and the short-time objective intelligibility (STOI) scores [23] were employed to evaluate the speech quality and intelligibility, respectively. B. Experimental Results 1.) The Effect of Phase in Different SNR Conditions In this section, we briefly confirm whether the explanation made in section II is rational or not. The baseline enhancement system (LPS as output targets) is used for denoising the magnitude spectrogram. To investigate the effect of phase under different SNRs, the SSNRs of the synthesized waveform with noisy and clean phases are compared. Table I shows the increment of SSNR by combining enhanced magnitude with clean phase and verifies our observations that using noisy phase in low SNR conditions degrades the signal seriously.
TABLE II PERFORMANCE COMPARISON OF DIFFERENT SYSTEMS IN TERMS OF LSD, STOI, AND PESQ
DNN-baseline SNR (dB) 12 6 0 -6 -12 Avg.
LSD 3.115 3.404 3.747 4.114 4.426 3.761
STOI 0.814 0.778 0.717 0.626 0.521 0.691
PESQ 2.334 2.140 1.866 1.609 1.447 1.879
RI-DNN ( 1, 0, LSD STOI 3.928 0.849 4.095 0.814 4.339 0.747 4.640 0.636 4.944 0.498 4.389 0.709
TABLE I SSNR INCREMENT BY APPLYING CLEAN PHASE UNDER DIFFERENT SNR CONDITIONS
Input SNR 12 dB 6 dB 0 dB -6 dB -12 dB
SSNR increment (dB) 0.837 1.082 1.263 1.381 1.492
2.) Comparison of the Proposed System We compare the proposed system with the DNN-baseline model, which enhances only the magnitude spectrogram. To separately investigate the effects of applying an RI spectrogram and MML, the proposed network with , 0 is denoted as RI-DNN, and the proposed network with multimetrics is denoted as MML-DNN. Table II shows the quantitative results of the average LSD, STOI, and PESQ scores on the test set, among the three models. As expected, the DNN-baseline model can reach lowest LSD since it enhances the LPS directly (not through the RI spectrogram). However, in terms of the other two metrics evaluated in the time domain, the proposed RI-DNN shows improvements compared to the baseline. This suggests that enhancing the LPS alone can only optimize LSD and hence might neglect other metrics [6, 24]. In addition, the improvements in PESQ are more significant compared to that in STOI, which implies that the phase information is important for perceptual quality [8]. Next, the effects of the last two terms in (8) are investigated. First, when there is only the second term (log power spectrogram reconstruction, 0, 1, 0 ) is involved in the objective function, the results (not shown due to the limited space) are similar to those of the baseline system. When only the third term (waveform reconstruction, 0, 0, 1 ) is applied, the results (not shown due to the limited space) are similar to those of RI-DNN (because the relation between the first and the third terms in (8) is just a linear transformation through the matrix F ). Hence, enhancing the RI spectrograms is similar to enhancing the waveforms directly. Therefore, in the following experiments, we only focus our attention on the effects of incorporating the log power spectrogram reconstruction as another target, and is fixed to 0.03 empirically.
MML-DNN ( 1, 0.03, 0 ) LSD STOI PESQ 3.743 0.832 2.546 3.892 0.802 2.389 4.128 0.740 2.153 4.420 0.643 1.883 4.703 0.522 1.628 4.177 0.709 2.120
0) PESQ 2.546 2.328 2.054 1.750 1.503 2.036
Estimated Log Power Spectrogram True Output Layer
M1
1
R1
1
M2
1
M3
1
1
R2 R3 ... Rn
Estimated Real Part
I1 . . .
Log Non-Linearity
Mn
... 1
I2
1
1
Square Non-
I3 ... In Linearity
Estimated Imaginary Part
Fig. 3. Pseudo layer: applying the LPS estimation term in the objective function makes the estimated real and imaginary parts entangle with each other.
Next, we investigate the effects of the MML criterion by simultaneously applying the first and the second terms in (8) as our objective function. The experimental results for MML-DNN are also presented in Table II. When compared to RI-DNN ( 1, 0, 0 ), MML with the LPS estimation ( 1, 0.03, 0 ) can effectively reduce the LSD scores. Meanwhile, the PESQ and STOI scores were also improved, especially in low SNR conditions. This may be due to that RI-DNN estimated all the output nodes independently while MML made the estimated real and imaginary spectrograms influence each other (as shown in Fig. 3). In other words, the RI-spectrograms (in the same frequency bin) have to cooperate with its counterpart to make a good estimation of LPS. This constraint may facilitate DNN better generalization and performance especially in low SNR conditions. V. CONCLUSION The contribution of this paper is three-fold. First, we proposed a novel DNN-based speech enhancement model using the RI spectrogram as input and output features. Second, based on RI spectrograms, an MML criterion that considered multiple metrics was derived. Third, we confirmed that by incorporating information in different representations of speech signals (RI spectrograms and LPS), the proposed MML-DNN-based enhancement system can outperform a system that uses LPS features alone in terms of STOI and PESQ, both of which are highly correlated to human perception. In the future, we will investigate the integration of the proposed MML criteria and the multi-task learning to further improve the model’s enhancement capability.
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15] [16]
Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal Processing Letters, vol. 21, pp. 65-68, 2014. X. Lu, Y. Tsao, S. Matsuda, and C. Hori, "Speech enhancement based on deep denoising autoencoder," in INTERSPEECH, 2013, pp. 436-440. Y. Zhao, D. Wang, I. Merks, and T. Zhang, "DNN-based enhancement of noisy and reverberant speech," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6525-6529. S.-W. Fu, Y. Tsao, and X. Lu, "SNR-aware convolutional neural network modeling for speech enhancement," in INTERSPEECH, 2016. D. Liu, P. Smaragdis, and M. Kim, "Experiments on deep learning for speech denoising," in INTERSPEECH, 2014, pp. 2685-2689. D. S. Williamson, Y. Wang, and D. Wang, "Complex ratio masking for joint enhancement of magnitude and phase," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 5220-5224. J. Le Roux, "Phase-controlled sound transfer based on maximally-inconsistent spectrograms," Signal, vol. 5, p. 10, 2011. K. Paliwal, K. Wójcicki, and B. Shannon, "The importance of phase in speech enhancement," speech communication, vol. 53, pp. 465-494, 2011. K. Li, B. Wu, and C.-H. Lee, "An iterative phase recovery framework with phase mask for spectral mapping with an application to speech enhancement," in INTERSPEECH, 2016. D. S. Williamson, Y. Wang, and D. Wang, "Complex ratio masking for monaural speech separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 483-492, 2016. Y. Wang and D. Wang, "A deep neural network for timedomain signal reconstruction," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4390-4394. T.-H. Chen, C. Huang, and T.-S. Chi, "Dereverberation based on bin-wise temporal variations of complex spectrogram," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, "Learning spectral mapping for speech dereverberation and denoising," IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, pp. 982-992, 2015. Y. Jin and B. Sendhoff, "Pareto-based multiobjective machine learning: An overview and case studies," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, pp. 397-415, 2008. H. S. Kasana, Complex Variables: Theory and Applications: PHI Learning Pvt. Ltd., 2005. J. Du and Q. Huo, "A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions," in INTERSPEECH, 2008, pp. 569-572.
[17] [18]
[19]
[20] [21]
[22]
[23]
[24]
R. Caruana, "Multitask learning," Machine learning, vol. 28, pp. 41-75, 1997. J. W. Lyons, "DARPA TIMIT acoustic-phonetic continuous speech corpus," National Institute of Standards and Technology, 1993. K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026-1034. D. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014. S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," in Proceedings of the 32nd International Conference on Machine Learning (ICML15), 2015, pp. 448-456. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs," ITU-T Recommendation, p. 862, 2001. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time– frequency weighted noisy speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 2125-2136, 2011. T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, "Phase processing for single-channel speech enhancement: history and recent advances," IEEE Signal Processing Magazine, vol. 32, pp. 55-66, 2015.