A Multi-Frame Approach to the Frequency-Domain Single-Channel ...

28 downloads 6759 Views 1MB Size Report
but today's widespread cellular phones and hands-free handsets are more likely to be ... The associate editor coordinating the review of this manu- script and ...
1256

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

A Multi-Frame Approach to the Frequency-Domain Single-Channel Noise Reduction Problem Yiteng Arden Huang, Member, IEEE, and Jacob Benesty

Abstract—This paper focuses on the class of single-channel noise reduction methods that are performed in the frequency domain via the short-time Fourier transform (STFT). The simplicity and relative effectiveness of this class of approaches make them the dominant choice in practical systems. Over the past years, many popular algorithms have been proposed. These algorithms, no matter how they are developed, have one feature in common: the solution is eventually formulated as a gain function applied to the STFT of the noisy signal only in the current frame, implying that the interframe correlation is ignored. This assumption is not accurate for speech enhancement since speech is a highly self-correlated signal. In this paper, by taking the interframe correlation into account, a new linear model for speech spectral estimation and some optimal filters are proposed. They include the multi-frame Wiener and minimum variance distortionless response (MVDR) filters. With these filters, both the narrowband and fullband signal-to-noise ratios (SNRs) can be improved. Furthermore, with the MVDR filter, speech distortion at the output can be zero. Simulations present promising results in support of the claimed merits obtained by theoretical analysis. Index Terms—Frequency domain, interframe correlation, maximum signal-to-noise ratio (SNR) filter, minimum variance distortionless response (MVDR) filter, single-channel noise reduction, speech enhancement, tradeoff filter, Wiener filter.

I. INTRODUCTION VERY speech communication and processing system suffers from the ubiquitous presence of additive noise, but today’s widespread cellular phones and hands-free handsets are more likely to be used in acoustically adverse environments where background noise from different origins is loud and where the microphone may not be in close proximity to the speech source. The noise degrades the perceptual quality of speech and will impair the speech intelligibility when the signal-to-noise ratio (SNR) comes down to a certain level. Noise reduction intends to suppress such additive noise for purposes of speech enhancement. Operating on the noisy speech captured by a single microphone, noise reduction algorithms generally can enhance only the perceptual quality of

E

Manuscript received February 03, 2011; revised June 09, 2011; accepted October 16, 2011. Date of publication October 31, 2011; date of current version February 24, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Daniel P. W. Ellis. Y.Huang is with WeVoice, Inc., Bridgewater, NJ 08807 USA (e-mail: [email protected]). J. Benesty is with INRS-EMT, University of Quebec, Montreal, QC H5A 1K6, Canada. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2011.2174226

speech when presented directly to a human listener with normal hearing, but may improve both speech quality and intelligibility when the enhanced speech goes through a voice communication channel before being played out [1] and/or for the hearing impaired [2]. So single-channel noise reduction (SCNR) has a large variety of applications including mobile phones, hearing aids, voice over Internet protocol (VoIP), just to name a few. The first SCNR system was developed over 45 years ago by Schroeder [3], [4]. The principle of Schroeder’s system is the nowadays well-known spectral magnitude subtraction method. This work, however, has not received much public attention, probably because it is a purely analog implementation and more importantly it was never published in journals or conferences outside of the Bell System. The interest in a digital form of the spectral subtraction technique was sparked by a 1974 paper by Weiss, Aschkenasy, and Parsons [5]. A few years later, Boll, in his often-cited paper [6], reintroduced the spectral subtraction method yet for the first time in the framework of digital short-time Fourier analysis. These early algorithms were all based on an intuitive and simple idea: the clean speech spectrum can be restored by subtracting an estimate of the noise spectrum from the noisy speech spectrum, and the noise spectrum is estimated and updated during silent periods. Though practically effective, the spectral magnitude subtraction approach is by no means optimal. It was thanks to the papers of [7] and [1] that the spectral subtraction technique began being examined in the framework of optimal estimation theory. This treatment initiated the development of many new noise reduction algorithms in the last three decades. They include the Wiener filter that intends to directly recover the complex (amplitude and phase) spectrum (i.e., the waveform in the time domain) of the clean speech [1], [7], and in contrast those in which only the spectral amplitude of the clean speech is estimated while its phase is copied from the phase of the noisy signal. The spectral amplitude can be taken as the square root of a maximum-likelihood (ML) estimate of the clean speech’s power spectrum. This leads to the spectral power subtraction method [7], [8], which is subtly different from the ML spectral amplitude estimator [7]. In addition to the classical approach of ML estimation, the Bayesian decision rule was found also very useful. Ephraim and Malah introduced a celebrated minimum mean square error (MMSE) estimator for spectral amplitude (MMSE-SA) in [9]. This original idea was later enriched by the MMSE estimator for log spectral amplitude (MMSE-LSA) [10] and other generalized Bayesian estimators [11]–[13], which minimize the posterior expectation of various distance measures between the actual and estimated speech spectral amplitude. Maximum a posteriori (MAP) is

1558-7916/$31.00 © 2011 IEEE

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

another important Bayesian decision rule based on which Wolfe and Godsill developed an MAP spectral amplitude estimator (MAP-SA) [14]. In the aforementioned ML, MMSE, and MAP spectral amplitude estimators, it is commonly assumed that the short-time Fourier transforms (STFTs) of speech and noise are zero-mean, independent, complex Gaussian random processes. These assumptions are practically reasonable but may not be strictly true. Alternatively a super-Gaussian model was suggested to be applied in combination with the MAP-SA approach in [15]. More complicated statistical speech models (e.g., hidden Markov model) can also be used [16] but no close-form solution will be possibly deduced. While SCNR has been widely studied in the time domain and other transform domains too (see [17] and [18] for more complete discussions on those subjects), the frequency-domain techniques are by far the most popular choice in practical systems for their simplicity and relative effectiveness. In this paper, we will focus only on this class of approaches. In spite of using the distinctive optimization rules (ML, MMSE, or MAP), spectral distance measures (linear versus log), and statistical models for speech [Gaussian, super-Gaussian, or hidden Marov model (HMM)], the existing frequency-domain noise reduction algorithms have one feature in common: the solution is eventually expressed as a gain function applied to the short-time Fourier transform (STFT) of the noisy signal in each frequency. This is due to a simplified formulation of the problem in which it has been implicitly assumed that the STFT of the current frame is uncorrelated with that in the neighboring frames. However, this is not accurate for speech enhancement since speech is a highly self-correlated signal. Consequently, by taking the interframe correlation into account, we should be able to develop more sophisticated algorithms with hopefully better noise reduction results. In this case, when we estimate the STFT of the clean speech in the current frame, we use the STFTs of the noisy signal not only in the current frame but also in the previous frames (with respect to the same frequency). This leads to a new model similar to a microphone array system: we have multiple noisy speech observations; their speech components are correlated while their noise components are presumably uncorrelated or correlated in a different way than speech components. As a result, the multichannel (here multi-frame) Wiener filter and the minimum variance distortionless response (MVDR) filter that were usually associated with microphone arrays will be developed for SCNR in this paper. It is well known that the gain functions of the existing frequency-domain SCNR algorithms can not improve the narrowband SNR and fullband noise reduction is achieved at a price of speech distortion. With the new algorithms developed in this paper, we will show that both the narrowband and fullband SNRs can be improved. Moreover, with the MVDR filter, the detrimental speech distortion in the output can be zero. An early attempt at exploiting the inter-frame correlation of speech in subbands was reported in [19]. A simple first-order autoregressive (AR) model was used to describe the variation of speech and hence the Kalman filter was developed to estimate the clean speech signals in each subband. The coefficients of the subband AR models need to be estimated from the noisy

1257

microphone signal and their estimates are usually biased in practice. So this method is subject to errors from model misspecification. In a recent paper [20], it was also suggested that the inter-frame correlation of speech STFTs could be exploited and an iterative optimization scheme was proposed to improve the traditional frequency-domain Wiener filter. Such a scheme is at best suboptimal, while here the developed multi-frame Wiener and MVDR filters are all globally optimal with respect to the cost functions defined in their own problem formulations. This paper is organized as follows. Section II describes the classical formulation of the SCNR problem. Section III presents the new linear model for speech spectral estimation that has taken the interframe correlation into account. Then in Section IV, some useful performance measures for SCNR that fits with the linear model are defined. In Section V, we develop the multi-frame Wiener, MVDR, and trade-off filters for SCNR. The experimental results are presented in Section VI and finally the conclusions are drawn in Section VII. II. PROBLEM FORMULATION The noise reduction problem considered in this paper is one , being of recovering the desired signal (or clean speech) the time index, of zero mean from the noisy observation (microphone signal) [21]–[23] (1) is the unwanted additive noise, which is assumed where to be a zero-mean random process white or colored but uncor. To simplify the development and analysis of related with the main ideas of this work, we further assume that all signals are Gaussian and wide sense stationary. Using the short-time Fourier transform (STFT), (1) can be rewritten in the frequency domain as (2) where , of

, , and

, and are the STFTs , respectively, at frequency-bin and time-frame . Since and are uncorrelated by assumption, the variance of is

(3) where ances of

denotes mathematical expectation, and and are the variand , respectively. III. A NEW LINEAR MODEL FOR SPEECH SPECTRAL ESTIMATION

In the classical linear model, we try to estimate our desired , from the observation signal, , by apsignal, plying a complex gain to it [23], i.e.,

(4)

1258

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

where the superscript

denotes complex conjugation, is the filtered desired signal is the residual noise. and Using the mean-square error (MSE) between the estimated and desired signals, we can easily derive the optimal Wiener gain, which is real and is given by [23]

contain both a part of the desired signal and a component that we consider as an interference. This suggests that we should into two orthogonal components decompose corresponding to the part of the desired signal and interference: (9) where (10) (11)

(5) As a result, the estimate of

in the Wiener sense is [23]

and (12)

(6) In (4), we implicitly assumed that the observation signal at the current time-frame is uncorrelated with itself at the previous time-frames. This, however, is not true since speech signals are well known to be highly colored for a long period of time. Therefore, the interframe correlation should be taken into account in the derivation of any noise reduction algorithms. For that, the complex gain in (4) needs to be replaced by a finite-impulse-response filter (FIR), i.e.,

(7)

is the interframe correlation coefficient of the signal Hence, we can write the vector as

.

(13) where vector

is the desired signal

is the interference signal vector, and

denotes complex transpose-conjugawhere the superscript tion, is the number of consecutive time-frames, and (14)

are vectors of length . The case corresponds to the conventional frequency-domain approach. Note that this concept, to take into account the interframe correlation in a speech enhancement algorithm, was introduced in [23] but in the Karhunen–Loève expansion (KLE) domain. In [25], the interframe correlation was also used to improve the a priori SNR estimator. Let us now decompose the signal into the following form:

(8) and are defined in a similar way to , is a filtered version of the desired signal at consecutive time-frames, and is the residual noise which is uncorrelated with . (and not At time-frame , our desired signal is ). However, the vector in the whole vector [(8)] contains both the desired signal, , and the components , , which are not the desired signals at time-frame but signals that are correlated . Therefore, the elements , , with where

is the (normalized) interframe correlation vector. Substituting (13) into (8), we get

(15) is the filwhere is tered desired signal and the residual interference. We observe that the estimate of the desired signal is the sum of three terms that are mutually uncorrelated. The first one is clearly the filtered desired signal while the two others are the filtered undesired signals (interferis ence-plus-noise). Therefore, the variance of (16) where

(17)

(18)

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

(19) is the correlation matrix (whose rank is equal to 1) of and is the correlation matrix of .

1259

where denotes the trace of a square matrix. This quantity , of the corresponds to the maximum eigenvalue, since the matrix has rank one. It matrix also corresponds to the maximum output SNR since the filter

(28) IV. PERFORMANCE MEASURES In this section, we give some very useful measures that fit well with the linear model developed in Section III where the interframe correlation is taken into account. We define the narrowband and fullband input SNRs as [23]

which maximizes eigenvector of sponding eigenvalue is

[(23)] is the maximum for which its corre. As a result, we have

(29)

(20) and (21) It is easy to show that [23]

(30) (22)

To quantify the level of noise remaining at the output of the FIR filter, we define the narrowband output SNR as the ratio of the variance of the filtered desired signal over the variance of the residual interference-plus-noise1, i.e.,

We define the fullband output SNR as

(31) and it can be verified that [23] (32)

(23) where

(33) (24)

is the interference-plus-noise covariance matrix. For the partic, where is the first column of the idenular filter tity matrix (of size ), we have (25) And for the particular case

As a result,

The noise-reduction factor [26], [27] quantifies the amount of noise whose is rejected by the filter. This quantity is defined as the ratio of the variance of the noise at the microphone over the variance of the interference-plus-noise remaining after the filtering operation. The narrowband and fullband noise-reduction factors are then

, we also have (26)

Hence, in the two previous scenarios, the narrowband SNR cannot be improved. Now, let us define the quantity

(27) 1In this paper, we consider the interference as part of the noise in the definitions of the performance measures.

(34) (35) The noise-reduction factors are expected to be lower bounded by 1 for optimal filters. So the more the noise is reduced, the higher are the values of the noise-reduction factors. , may distort the desired In practice, the FIR filter, signal. In order to evaluate the level of this distortion, we define the speech-reduction factor [23] as the variance of the desired signal over the variance of the filtered desired signal at

1260

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

the output of the filter. Therefore, the narrowband and fullband speech-reduction factors are defined as

where (45) is the signal distortion due to the complex filter and (46)

(36) represents the residual interference-plus-noise. The narrowband MSE is then (37) An important observation is that the design of a filter that does not distort the desired signal requires the constraint (38)

(47) where

Thus, the speech-reduction factor is equal to 1 if there is no distortion and expected to be greater than 1 when distortion occurs. By making the appropriate substitutions, one can derive the relationships (39)

(48) and

(40) When no distortion occurs, the gain in SNR coincides with the noise-reduction factor. Another useful performance measure is the speech-distortion index [26], [27] defined as

(49) For the particular filter MSE is

,

,

, the narrowband

(50) (41) in the narrowband case and as

so there is neither noise reduction nor speech distortion. We can now define the narrowband normalized MSE (NMSE) as

(42) in the fullband case. The speech-distortion index is always greater than or equal to 0 and should be upper bounded by 1 for optimal filters; so the higher is its value, the more the desired signal is distorted. V. OPTIMAL FILTERS

(51) where (52)

In this part, we derive three fundamental filters with the linear interframe model and show how they are related to each other. . For that, we We also show the relationship with need to derive first the MSE criterion and its relation with the MSEs of speech distortion and residual interference-plus-noise. We define the narrowband error signal between the estimated and desired signals as

(53) This shows how the narrowband MSEs are related to some of the performance measures. In the same way, we define the fullband MSE as

(43) which can be written as the sum of two error signals: (44)

(54)

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

We then deduce the fullband NMSE:

1261

that we can rewrite as

(63)

(55) It is straightforward to see that minimizing the narrowband MSE for each is equivalent to minimizing the fullband MSE. It is clear that the objective of noise reduction in the frequency domain with the interframe filtering is to find optimal filters at each frequency-bin and time-frame that would either directly minimize or minimize or subject to some constraint.

Using (62), we find that the narrowband output SNR is

(64) and the narrowband speech-distortion index is a clear function of this narrowband output SNR (65)

A. Wiener The Wiener filter is easily derived by taking the gradient of , with respect to the narrowband MSE, and equating the result to zero:

(i.e., Interestingly, the higher is the value of by increasing the number of interframes), the less the desired signal is distorted with the Wiener filter at frequency-bin . Clearly,

(56) (66) is the covariance is and , but

where matrix of and the cross-correlation matrix between

(57)

since the Wiener filter maximizes the narrowband output SNR. It is of great interest to observe that the two filters and are equivalent up to a scaling factor. With the Wiener filter the narrowband noise-reduction factor is

so that (56) becomes (58) The Wiener filter can also be rewritten as

(67)

(59)

Using (65) and (67) in (51), we find the minimum NMSE (MNMSE)

From Section III, it is easy to verify that

(68) (60)

Determining the inverse of bury’s identity

from (60) with the Wood-

(61) and substituting the result into (58), leads to another interesting formulation of the Wiener filter (62)

B. Minimum Variance Distortionless Response The celebrated minimum variance distortionless response (MVDR) filter proposed by Capon [28], [29] is usually derived in a context where we have at least two sensors (or microphones) available. Interestingly, with the linear interframe model, we can also derive the MVDR (with one sensor only) by minimizing the MSE of the residual interference-plus-noise, , with the constraint that the desired signal is not distorted. Mathematically, this is equivalent to

subject to

(69)

1262

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

for which the solution is

to insure that we get some noise reduction. where , to adjoin the constraint By using a Lagrange multiplier, to the cost function, we easily deduce the tradeoff filter

(70)

Using (60) and the Woodbury’s identity again but with respect this time, we can rewrite the MVDR as to (71)

(80)

The Wiener and MVDR filters are simply related as follows:

Again by using (60) and the Woodbury’s identity with respect , we can rewrite the tradeoff filter as to

(72) where (73) Here again the two filters and are equivalent up to a scaling factor. From a narrowband point of view, this scaling is not significant but from a fullband point of view it can be important since speech signals are broadband in nature. Indeed, it can easily be verified that this scaling factor affects the fullband output SNRs and fullband speech-distortion indices. While the narrowband output SNRs of the Wiener and MVDR filters are the same, the fullband output SNRs are not because of the scaling factor. It is clear that we always have (74) (75) (76)

(77) and

(81) The Lagrange multiplier is determined according to such that . However, in practice this is not easy. Alternatively, this parameter is chosen in an ad-hoc way and we see that for , , which is the Wiener • filter; , , which is the MVDR • filter; , results in a filter with low residual noise at the • expense of high speech distortion; , results in a filter with high residual noise and low • speech distortion. Again, we observe here as well that the tradeoff and Wiener filters are equivalent up to a scaling factor. As a result, the narrowband output SNR with the tradeoff filter is obviously the same as the narrowband output SNR with the Wiener filter, i.e., (82)

(78) C. Tradeoff In the tradeoff approach, we try to compromise between noise reduction and speech distortion. Instead of minimizing the MSE as we already did to find the Wiener filter, we could minimize the speech-distortion index with the constraint that the noisereduction factor is equal to a positive value that is greater than 1. Mathematically, this is equivalent to

subject to

(79)

and does not depend on . However, the narrowband speechdistortion index is now both a function of the variable and the narrowband output SNR: (83) From (83), we observe how can affect the desired signal. The tradeoff filter is interesting from several perspectives since it encompasses both the Wiener and MVDR filters. It is then useful to study the fullband output SNR and the fullband speech-distortion index of the tradeoff filter, which both depend on the variable .

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

Using (80) in (31), we find that the fullband output SNR is

1263

, the full4) Property 5.4: With the tradeoff filter, band output SNR is always greater than or equal to the fullband , . input SNR, i.e., Proof: We know that

(84)

(91) which implies that

We propose the following. 1) Property 5.1: The fullband output SNR of the tradeoff filter is an increasing function of the parameter . Proof: Indeed, using the proof given in [30] by simply replacing integrals by sums, we find that

(92) and hence

(85) proving that the fullband output SNR is increasing when is increasing. From Property 5.1, we deduce that the MVDR filter gives the smallest fullband output SNR, which is

(93) But from Proposition 5.1, we have

(86)

(94) as a result,

We give another interesting property. 2) Property 5.2: We have

(95)

(87) Proof: Easy to show from (84). While the fullband output SNR is upper bounded, it is easy to show that the fullband noise-reduction factor and fullband speech-reduction factor are not. So when goes to infinity so and . are The fullband speech-distortion index is

which completes the proof. To end this section, let us mention that the decision-directed method [31], which is a reliable estimator of the narrowband input SNR, will fit very well with the proposed algorithms since this estimator implicitly assumes that the successive frames are correlated. VI. EXPERIMENTAL RESULTS In this section, we present the experimental results of the proposed frequency-domain SCNR algorithms that may use multiple consecutive STFT frames. A comparison with the traditional single-frame Wiener filter will be used to study and validate the merits of exploiting interframe correlations. Due to the limitation of space, the focus is placed on showing the performance of the new multi-frame Wiener and MVDR filters.

(88) A. Setup and Metrics 3) Property 5.3: The fullband speech-distortion index of the tradeoff filter is an increasing function of the parameter . Proof: It is straightforward to verify that (89) which ends the proof. It is clear that (90) Therefore, as increases, the fullband output SNR increases at the price of more distortion to the desired signal.

In our experiments, the microphone signal is artificially synthesized by adding computer-generated white Gaussian random noise or prerecorded real-world noise to a clean speech signal. The clean speech signals were recorded from 22 female and 24 male talkers with a Sennheiser ME-80 Condenser Super Cardoid 50–15000 Hz microphone in the Bell Labs Murray Hill’s anechoic chamber in July/August 1989. Each talker provided 2 to 8 minutes of “conversational speech” that is a “story” about anything that came to his/her mind. All recordings were originally digitized at a sampling rate of 48 kHz with 16 bits per sample. They were then downsampled to 8 kHz for our use. In the experiments presented here, we consider only two female and two male speakers. Each story was cut to have the same length of 60 s. For real-world noise, we only consider car

1264

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

TABLE I ACCURACY OF THE FREQUENCY ANALYSIS AND RESYNTHESIS PROCEDURE USED IN THE PRESENTED EXPERIMENTS WITH A 75% OVERLAP BETWEEN NEIGHBORING WINDOWS. THE DISTORTION LEVEL IS MEASURED WITH REFERENCE TO THE INPUT SIGNAL

and babble noise. The car noise is fairly stationary but colored with an energy roll-off (approximately 12 dB per octave) towards high frequencies. The babble noise was recorded in the New York Stock Exchange (NYSE). It is not only colored but also nonstationary with mixtures of nearly inaudible voices and sporadic cell phone rings. The noise level is adjusted according to that of the clean speech and a specified input SNR. In the following, if not explicitly stated otherwise, the noise is white Gaussian random noise and the speech source is the first female talker. The fullband output SNR and speech distortion measures are used in our experiments. Both the conventional definitions (see [26] and [27]) and the new definitions (introduced above in Section IV) of these two measures will be investigated. Moreover, we will use the Perceptual Evaluation of Speech Quality (PESQ [34]) to obtain an objective assessment of the overall quality of the speech that are enhanced by the proposed noise reduction algorithms. Algorithm Implementation The algorithms discussed and developed in this paper are all frequency-domain approaches. The STFT is implemented with the Kaiser window and the fast Fourier transform (FFT). The window size in samples is set to be a power of 2. For the traditional single-frame Wiener filter, an overlap of 50% between neighboring windows is commonly used while for the proposed multi-frame Wiener filter and MVDR algorithms we adopt an overlap of 75% to retain a higher inter-frame correlation. The overlap-add method is used for signal reconstruction in the time domain to avoid errors caused by circular convolution. This analysis and synthesis procedure is nearly perfect in Matlab, resulting in little distortion in the reconstructed signal if no manipulation is carried out to its frequency-domain representations. The level of distortion varies with , as shown in Table I. The window size determines the FFT resolution and also affects the calculated interframe correlations. These two effects can have probably opposite impacts on the performance of the developed algorithms, which will be explored in the experiments. In all experiments, we use the first 100 frames to compute the initial estimates of and by averaging with a batch method. The rest of the signal frames are then used for performance evaluation. In this process, the estimates of and are recursively updated according to

(96) (97)

and are the forgetting factors. where In order to remove the uncertainty of voice activity detection in an otherwise more practical but less rigorous performance comparison, we choose to update the estimates of noise statistics continuously from the noise signal in these simulations. After and become available at time-frame , is computed as . The interframe correlation vector is then taken as the transpose of the first of speech row vector of normalized by its first element. To compute the inverse of in (59), the technique of regularization is used, so that is replaced by (98) where

is the regularization factor. It is empirically set as in this paper. When the inverse of instead of needs to be calculated as in (70), the technique of regularization will be similarly applied. B. Wiener Filters We first show the performance of the traditional single-frame Wiener filter, which provides a benchmark for studying other takes noise reduction filters. Such a Wiener filter (corresponding to 32 ms) and 50% overlapping windows. In this experiment, we let the input SNR be dB and consider , 0.86, and 0.72. We intend to examine how the fullband performance varies with . Fig. 1 plots the results. Using a large (close to 1), we cannot capture the short-term variation of inherently nonstationary speech signals, but on the other hand with a small , the sample estimate of the signal variance has a large variation due to a limited number of data to do averaging. So the best performance is achieved with a moderate . An interesting observation is that . The perforthe output SNR reaches its peak when is then also included mance of the Wiener filter with in Fig. 1 for easy comparison. Since is directly computed and continuously updated from the noise signal in our simulations, the match of to leads to the same tracking characteristics for and , and hence better noise reduction performances. The second experiment considers the multi-frame Wiener filters. Again, dB. We set and with a 75% overlap. We let go from 1 up to 8. The results are presented in Fig. 2. The performance of the traditional Wiener filter ( , , and a 50% overlap) is also included for comparison. When becomes larger, the size of the covariance matrix that needs to be inverted in the multi-frame Wiener filters grows. As a result, the optimal forgetting factors and that produce the best noise reduction performance increase. It is evident that using multiple STFT frames is helpful. Comparing the best-case scenario of against that of (using the same windowing scheme of and an overlap of 75%), we see that the output SNR is improved by approximately 2 dB while the level of speech distortion remains almost the same, but when is larger than 4, the gain in the SNR will no longer significantly increase. At the same time, the performance becomes more sensitive to and the algorithm becomes more computationally intensive due to a larger size of . A comparison of the multi-frame Wiener filter

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

Fig. 1. Effect of the forgetting factors  and  on the fullband performance of the traditional Wiener filter without using the interframe correlation by setting L . The window size is K (32 ms) with a 50% overlap and dB.

0

=1

= 256

iSNR =

Fig. 2. Comparison in the fullband performance of the Wiener filters using various numbers of consecutive STFT frames L to the traditional Wiener filter using K with a 50% overlap, and L . The forgetting factors are  , the window size is K (8 ms), and dB. 

=

= 256

= 64

=1

iSNR = 0

( , , and a 75% overlap) to the traditional Wiener filter that uses with a 50% overlap reveals that the former still outperforms the latter although the gain in SNR becomes smaller (reduces to about 0.5 dB). The third experiment was designed to reveal how the window size affects the performance of the multi-frame Wiener filter. We consider dB and , and make , but let (a power of 2) vary from 32 to 256. Fig. 3 shows the results of this study. When is small (e.g., 32), the FFT resolution is poor. In this case, increasing is helpful to improve the performance, but when moves from 64 to 128, the obtained performance gain becomes very marginal. This is due to the fact that a large corresponds to a long gap in time between two consecutive frames and hence leads to a weaker interframe correlation.

1265

Fig. 3. Effect of the window size on the fullband performance of the Wiener filters using multiple consecutive STFT frames. The traditional Wiener filter ,K (L with a 50% overlap) is used as the benchmark. For the multiframe Wiener filters, we set L and 75% for overlapping. The forgetting  , and dB. factors are 

=1

= 256 =

=4 iSNR = 0

( = 1) = 64

Fig. 4. Comparison of the traditional single-frame L and the proposed multi-frame L Wiener filters under different input SNR conditions. The  and the window size is K forgetting factors are  (8 ms) with a 75% overlap.

( = 4)

=

The fourth and last experiment with the Wiener filters was to examine the benefit of using multiple frames under different input SNR conditions. This time, we fix and again make . We let iSNR be 10, 0, or 10 dB, and be 1 or 4. The results are visualized in Fig. 4. It is promising that the gain of using multiple frames is observable over a practically wide range of input SNRs. An interesting discovery is that the gain is greater for a low iSNR than for a high iSNR. Before we conclude this subsection, there is one thing that needs to be clarified and discussed. It is about which set of performance measures we used for the above presented experiments. As a matter of fact, we used the conventional definitions (see [26] and [27]) instead of the definitions introduced in Section IV that is based on the signal decomposition of (15).

1266

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

Fig. 5. Signal waveforms for a multi-frame (L = 4) Wiener filter (only the first 20 seconds are shown): the clean speech x(t), the additive Gaussian noise v(t), the filtered desired speech x (t), the residual interference x (t), and the residual noise v (t). The input SNR is 0 dB, K = 64 with a 75% overlap, and  =  = 0:9.

Fig. 6. Comparison in the fullband performance of the MVDR filters using various numbers of consecutive STFT frames L. The forgetting factors are  =  , the window size is K = 64 (8 ms) with a 75% overlap, and iSNR = 0 dB.

These two sets of definitions differ in the way of treating the interference. The interference is produced by multi-frame processing from the speech components of the previous frames that are uncorrelated with the desired speech in the current frame. So when as in the traditional Wiener filter, the interference is always zero and the two sets of performance measures definitions are equivalent, but in the proposed framework that uses multiple frames, it is more insightful to consider the interference as a part of noise in the new definition of output SNR, or similarly in that of speech distortion measure. While this makes sense, we found that the new definitions are not reliable in practice with the Wiener filters. By definition, is uncorrelated with the interference vector and therefore the three components of the narrowband error in (43) through (46) are mutually uncorrelated too. As a result, the narrowband MSE is the sum of , , and [see (47) and (49)], which are all

Fig. 7. Effect of the window size on the fullband performance of the MVDR filter using multiple consecutive STFT frames. The number of used frames is L = 4, there is a 75% overlap, the forgetting factors are  =  , and iSNR = 0 dB.

non-negative. When the narrowband MSE is minimized by the Wiener filter, the power of the residual interference will be small (at least smaller than the MSE), and the conventional and new definitions of output SNR (respectively, speech distortion) should not be significantly different, but in practice the estimate of the interframe correlation vector can never be accurate and the estimation error leads to a leakage of speech into the interference. Consequently, there remain some correlations between the calculated speech distortion [see (45)] and the calculated residual interference in (46). So the decomposition of (47) through (49) does not hold and the power of the estimated residual interference can be even greater than the minimum MSE, which makes the new definitions unreliable. This issue is illustrated by the waveforms presented in Fig. 5. They are the results of the Wiener filter with , dB, , and . We see that is clearly correlated with and has a large variance. If the new definitions are used, dB and dB, which do not reflect our perception of the quality of the enhanced signal. The signal of of this Wiener filter looks and sounds similar to the clean speech and the noise level has been clearly reduced. Using the conventional definitions, we get that dB and dB. C. MVDR Filters In the simulations of the MVDR filters, the new performance measures are used. While the signal decomposition can be problematic as explained above, the MVDR filters minimize the variance of the residual interference plus noise under the constraint of no distortion in the filtered desired speech signal such that the speech leakage in the interference will be suppressed too. The first experiment considers the MVDR filters that use different numbers of frames for dB and with the window size being set as . Again we make . Fig. 6 shows the results. The levels of the speech distortion index of these MVDR filters clearly indicate that the speech distortionless constraint has been met regardless of various values of .

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

1267

Fig. 8. Fullband performance of the MVDR filter under different input SNR  , the number of used frames is conditions. The forgetting factors are  L , and the window size is K (8 ms) with a 75% overlap.

= 12

= 64

=

As increases, the optimal and becomes larger and the best oSNR improves. When reaches 12, the benefit of further increasing the number vanishes and is strongly outweighed by the cost of increased complexity. Fig. 7 presents the results of the experiment that investigates the impact of the window size on the MVDR filter. The input SNR is 0 dB and . It is clear that the optimal forgetting factors do not change much with the window size and produces the best output SNR (only slightly better than ). Fig. 8 shows the fullband performance of the MVDR filter under different input SNR conditions. We consider and , and make . The speech distortionless constraint has always been satisfied and the gain in SNR declines as the input SNR increases.

D. Perceptual Quality The conducted research indicates that the output SNR and the speech distortion index provide a complete and insightful picture of the noise reduction performance. They are closely aligned with our perception of the quality of the enhanced signals in informal listening tests, if the proper set of definitions are applied since each set has some caveats as explained above at the end of Section VI-C. Using the same set of definitions, it has become clear that exploiting interframe correlations is helpful to both the Wiener and MVDR filters, but it can give rise to arguments if we compare the performance of the Wiener and MVDR filters using different sets of performance measure definitions. So for this task, we chose to use the PESQ measure, which has been found to have higher correlations, than other widely known objective measures, with the subjective ratings of overall quality of enhanced speech signals [35], [36]. All four talkers were used to find the average PESQ score for each tested condition. Such a raw PESQ mean opinion score (MOS) is then mapped to the PESQ MOS-LQO (listening quality objective) to

=1 =4 = 64

Fig. 9. Comparison of the traditional single-frame Wiener filter (WF, L , with a 50% overlap), the multi-frame Wiener filter (WF, L K , K with a 75% overlap), and the MVDR filter (MVDR, L ,K with a 75% overlap) using the PESQ MOS-LQO measure in (a) white Gaussian noise, (b) car noise, and (c) NYSE babble noise. For the single-frame Wiener  : . For the multi-frame Wiener filter,   : . For filter,  the MVDR filter,   : .

= 256 = 64

=

= 12

=06 =

=08

=

=09

make a linear connection to subjective MOS using the following mapping function [37]:

(99) PESQ MOS-LQO ranges between 1.02 and 4.55. Fig. 9 shows the results for the three different noise types. For the traditional single-frame Wiener filter ( , with a 50% according to the results preoverlap), we set sented in Fig. 1. For the proposed multi-frame Wiener filters, we make that , , and with a 75% overlap. For the MVDR filter, we set , with a 75% overlap, and . The multi-frame Wiener filter performs always better than the single-frame counterpart for all noise types. It is noted that the MVDR filter produces low speech distortion but high residual noise. When the input SNR

1268

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

is low (lower than 10 dB), the high level of the residual noise outweighs speech distortion in the PESQ measure such that the MVDR filter yields lower PESQ scores than the two Wiener filters. On the contrary, when the input SNR gets practically high, speech distortion becomes much easier to be perceived with lower residual noise in the background. Consequently, the MVDR filter has higher PESQ scores than the Wiener filters in those conditions. VII. CONCLUSION In this paper, we presented an insightful analysis of the frequency-domain SCNR algorithms whose solutions are all finally expressed as gain functions applied to the spectrum of the noisy speech only in the current frame. We explained that this common feature is due to the disregard of the interframe correlation, which may be strong for speech. By taking the interframe correlation into account, we proposed a new linear model for speech spectral estimation and developed three, namely, the Wiener, MVDR, and trade-off filters. It was proved that with these filters, both the narrowband and fullband output SNRs can be improved. Furthermore, with the MVDR filter, speech distortion at the output is theoretically zero. Extensive simulation results were reported and clearly justified the advantage of exploiting the interframe correlation for SCNR. REFERENCES [1] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, no. 12, pp. 1586–1604, Dec. 1979. [2] V. Bray and M. Valente, “Can omni-directional hearing aids improve speech understanding in noise?,” Audiol. Online, Sep. 2001, [Online]. Available: http://www.audiologyonline.com/articles/article detail.asp?article id=300, Accessed Jan. 30, 2011. [3] M. R. Schroeder, “Apparatus for suppressing noise and distortion in communication signals,” U.S. Patent 3,180, 936, 1965. [4] M. R. Schroeder, “Processing of communications signals to reduce effects of noise,” U.S. Patent 3,403, 224, 1968. [5] M. R. Weiss, E. Aschkenasy, and T. W. Parsons, “Processing speech signals to attenuate interference,” in Proc. IEEE Symp. Speech Recognition, Apr. 15–19, 1974, pp. 292–295. [6] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979. [7] R. J. McAulay and M. L. Malpass, “Speech enhancement using a softdecision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, pp. 137–145, Apr. 1980. [8] M. Berouti, M. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE ICASSP, 1979, pp. 208–211. [9] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984. [10] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-33, no. 2, pp. 443–445, Apr. 1985. [11] C. H. You, S. N. Koh, and S. Rahardja, “ -order MMSE spectral amplitude estimation for speech enhancement,” IEEE Trans. Speech Audio Process., vol. 13, no. 4, pp. 475–486, Jul. 2005. [12] P. C. Loizou, “Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 857–869, Sep. 2005.

[13] E. Plourde and B. Champagne, “Generalized Bayesian estimators of the spectral amplitude for speech enhancement,” IEEE Signal Process. Lett., vol. 16, no. 6, pp. 485–488, Jun. 2009. [14] P. J. Wolfe and S. J. Godsill, “Efficient alternatives to the Ephraim-Malah suppression rule for audio signal enhancement,” EURASIP J. Appl. Signal Process., Special Iss.: Digital Audio for Multimedia Commun., pp. 1043–1051, Sep. 2003. [15] T. Lotter and P. Vary, “Noise reduction by maximum a posteriori spectral amplitude estimation with supergaussian speech modelling,” in Proc. Int. Workshop Acoust. Echo Noise Control, Sep. 2003, pp. 83–86. [16] Y. Ephraim, D. Malah, and B.-H. Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 12, pp. 1846–1856, Dec. 1989. [17] J. Benesty and J. Chen, Optimal Time-Domain Noise Reduction Filters—A Theoretical Study. Berlin, Germany: Springer, 2011. [18] J. Benesty, J. Chen, and Y. Huang, Speech Enhancement in the Karhunen-Loeve Expansion Domain. San Rafael, CA: Morgan & Claypool, 2011. [19] W.-R. Wu and P.-C. Chen, “Subband Kalman filtering for speech enhancement,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process, vol. 45, no. 8, pp. 1072–1083, Aug. 1998. [20] J. L. Roux, E. Vincent, Y. Mizuno, H. Kameoka, N. Ono, and S. Sagayama, “Consistent Wiener filtering: Generalized time-frequency masking respecting spectrogram consistency,” in Proc. 9th Int. Conf. Latent Variable Anal. Signal Separat. (LVA/ICA), Sep. 2010, pp. 89–96. [21] P. Vary and R. Martin, Digital Speech Transmission: Enhancement, Coding and Error Concealment. Chichester, U.K.: Wiley, 2006. [22] P. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL: CRC, 2007. [23] J. Benesty, J. Chen, Y. Huang, and I. Cohen, Noise Reduction in Speech Processing. Berlin, Germany: Springer-Verlag, 2009. [24] J. Benesty, J. Chen, and Y. Huang, “On noise reduction in the Karhunen-Loeve expansion domain,” in Proc. IEEE ICASSP, 2009, pp. 25–28. [25] I. Cohen, “Relaxed statistical model for speech enhancement and a priori SNR estimation,” IEEE Trans. Speech Audio Process., vol. 13, pp. 870–881, Sep. 2005. [26] J. Benesty, J. Chen, Y. Huang, and S. Doclo, “Study of the Wiener filter for noise reduction,” in Speech Enhancement, J. Benesty, S. Makino, and J. Chen, Eds. Berlin, Germany: Springer-Verlag, 2005, ch. 2, pp. 9–41. [27] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction Wiener filter,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1218–1234, Jul. 2006. [28] J. Capon, “High resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, vol. 57, no. 8, pp. 1408–1418, Aug. 1969. [29] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, pp. 661–675, Aug. 1971. [30] M. Souden, J. Benesty, and S. Affes, “On the global output SNR of the parameterized frequency-domain multichannel noise reduction Wiener filter,” IEEE Signal Process. Lett., vol. 17, pp. 425–428, May 2010. [31] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984. [32] O. Frost, “An algorithm for linearly constrained adaptive array processing,” Proc. IEEE, vol. 60, no. 1, pp. 926–935, Jan. 1972. [33] M. Er and A. Cantoni, “Derivative constraints for broad-band element space antenna array processors,” IEEE Trans. Acoust., Speech, Signal Process., vol. 31, no. 6, pp. 1378–1393, Dec. 1983. [34] Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs, ITU-T Rec. P.862, 2001. [35] T. Rohdenburg, V. Hohmann, and B. Kollmeir, “Objective perceptual quality measures for the evaluation of noise reduction schemes,” in Proc. 9th Int. Workshop Acoust. Echo Noise Control, 2005, pp. 169–172. [36] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 229–238, Jan. 2008. [37] Mapping Function for Transforming Raw Result Scores to MOS-LQO, ITU-T Rec. P.862.1, 2003.

HUANG AND BENESTY: MULTI-FRAME APPROACH TO THE FREQUENCY-DOMAIN SINGLE-CHANNEL NOISE REDUCTION PROBLEM

Yiteng (Arden) Huang (S’97–M’01) received the B.S. degree from the Tsinghua University, Beijing, China, in 1994 and the M.S. and Ph.D. degrees from the Georgia Institute of Technology (Georgia Tech) in 1998 and 2001, respectively, all in electrical and computer engineering. From March 2001 to January 2008, he was a Member of Technical Staff at Bell Laboratories, Murray Hill, NJ. In January 2008, he founded the WeVoice, Inc., Bridgewater, NJ, and served as its CTO. His current research interests are in acoustic signal processing and multimedia communications. He has coauthored/coedited eight books and published many journal and conference papers. Dr. Huang served as an Associate Editor for the EURASIP Journal on Applied Signal Processing from 2004 and 2008 and for the IEEE SIGNAL PROCESSING LETTERS from 2002 to 2005. He served as a technical cochair of the 2005 Joint Workshop on Hands-Free Speech Communication and Microphone Array and the 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. He received the 2008 Best Paper Award and the 2002 Young Author Best Paper Award from the IEEE Signal Processing Society, the 2000–2001 Outstanding Graduate Teaching Assistant Award from the School Electrical and Computer Engineering, Georgia Tech, the 2000 Outstanding Research Award from the Center of Signal and Image Processing, Georgia Tech, and the 1997–1998 Colonel Oscar P. Cleaver Outstanding Graduate Student Award from the School of Electrical and Computer Engineering, Georgia Tech. He is leading the efforts as the Principal Investigator to develop new voice communication systems for NASA’s next-generation EVA (Extra Vehicular Activity) spacesuits. In 2009 and 2010, he was granted two NASA Tech Brief awards for his contributions.

1269

Jacob Benesty was born in 1963. He received the M.S. degree in microwaves from Pierre and Marie Curie University, Paris, France, in 1987, and the Ph.D. degree in control and signal processing from Orsay University, Orsay, France, in April 1991. During the Ph.D. degree (from November 1989 to April 1991), he worked on adaptive filters and fast algorithms at the Centre National d’Etudes des Telecomunications (CNET), Paris, France. From January 1994 to July 1995, he worked at Telecom Paris University on multichannel adaptive filters and acoustic echo cancellation. From October 1995 to May 2003, he was first a Consultant and then a Member of the Technical Staff at Bell Laboratories, Murray Hill, NJ. In May 2003, he joined the University of Quebec, INRS-EMT, in Montreal, QC, Canada, as a Professor. His research interests are in signal processing, acoustic signal processing, and multimedia communications. He is the inventor of many important technologies. In particular, he was the Lead Researcher at Bell Labs who conceived and designed the world-first real-time hands-free full-duplex stereophonic teleconferencing system. Also, he and T. Gaensler conceived and designed the world-first PC-based multi-party hands-free full-duplex stereo conferencing system over IP networks. He is the editor of the book series: Springer Topics in Signal Processing. He has coauthored and coedited/coauthored many books in the area of acoustic signal processing. He is also the lead editor-in-chief of the reference Springer Handbook of Speech Processing (Springer-Verlag, 2007). Dr. Benesty was the cochair of the 1999 International Workshop on Acoustic Echo and Noise Control and the general cochair of the 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. He was a member of the IEEE Signal Processing Society Technical Committee on Audio and Electroacoustics and a member of the editorial board of the EURASIP Journal on Applied Signal Processing. He is the recipient, with Morgan and Sondhi, of the IEEE Signal Processing Society 2001 Best Paper Award. He is the recipient, with Chen, Huang, and Doclo, of the IEEE Signal Processing Society 2008 Best Paper Award. He is also the coauthor of a paper for which Y. Huang received the IEEE Signal Processing Society 2002 Young Author Best Paper Award. In 2010, he received the “Gheorghe Cartianu Award” from the Romanian Academy.