A perceptually motivated approach for speech

0 downloads 0 Views 529KB Size Report
spectral subtraction method, speech enhancement, subspace method. ... The second stage incorporated the estimated masking thresholds in the ... proposed a perceptual post-filter for signal-subspace enhance- ... In this paper, we propose a new perceptually motivated ap- proach for ..... Spectral floor any negative el-.
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 5, SEPTEMBER 2003

457

A Perceptually Motivated Approach for Speech Enhancement Yi Hu, Student Member, IEEE, and Philipos C. Loizou, Member, IEEE

Abstract—A new perceptually motivated approach is proposed in this paper for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise. Index Terms—Multitaper power spectrum estimation, multiwindow covariance matrix estimation, perceptual weighting, spectral subtraction method, speech enhancement, subspace method.

I. INTRODUCTION

M

OST speech enhancement methods suffer from an annoying residual noise known as musical noise. Several methods have been proposed to reduce the effect of residual noise [1]–[4]. Some of these methods attempt to reduce the perceptual effect of the residual noise by using a human auditory model, which is widely used in wideband audio coding [5]–[7]. This model is based on the fact that additive noise is inaudible to the human ear as long as it falls below some masking threshold. Methods exist to calculate the masking threshold in the frequency domain using critical band analysis and the excitation pattern of the basilar membrane [5]. The idea behind the perceptually-based methods for speech enhancement is to shape the residual noise spectrum in such a way so that it falls below the masking threshold, thereby making the residual noise inaudible. Virag [3] applied such a model in spectral subtraction using a two-stage approach. In the first stage, a nonaggressive spectral subtraction algorithm was applied to the noisy speech, and the enhanced speech was used to get an estimate of the masking thresholds. The second stage incorporated the estimated masking thresholds in the subtractive process. In [4], Jabloun and Champagne proposed a way to incorporate the above auditory model in subspace-based methods for speech enhancement. They first performed the eigen-decomposition of the clean signal Manuscript revised June 5, 2003. This work was supported in part by the National Institute of Deafness and other Communication Disorders/National Institutes of Health under Grant R01 DC03421. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Li Deng. The authors are with Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083-0688 USA (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TSA.2003.815936

covariance matrix (by subtracting the estimated noise covariance matrix from the noisy speech covariance matrix) and then estimated the masking thresholds using the spectrum derived from an eigen-to-frequency domain transformation. The masking thresholds were transformed back to the eigenvalue domain using a frequency-to-eigen domain transformation and then incorporated into the signal-subspace approach. In [8], the authors proposed a perceptual post-filter for signal-subspace enhancement using masking thresholds estimated by first pre-processing the signal through spectral subtraction. Similar techniques for incorporating masking thresholds in speech enhancement were also used in [2], [9], [10]. The success of the above methods is largely dependent on accurate estimation of the masking thresholds in noise. In most cases these methods rely on pre-enhancing the signal to estimate the thresholds and then incorporating the estimated thresholds in the speech enhancement algorithm. In low signal-to-noise ratio (SNR) conditions, however, the estimated masking threshold levels might deviate from the true ones leading possibly into additional residual noise caused by the inaccuracy of the estimated thresholds. In this paper, we propose a new perceptually motivated approach for speech enhancement which does not depend on the accurate estimation of the masking thresholds. The proposed approach is motivated by the perceptual weighting technique used in low-rate analysis-by-synthesis speech coders [11], [12] which takes into account the frequency-domain masking properties of the human auditory system. Analysis-by-synthesis speech coders (e.g., CELP) obtain the optimal excitation for LPC synthesis in a closed-loop manner by minimizing a perceptually weighted error criterion [12]. This criterion was motivated by frequency masking experiments [13] that showed that the auditory system has a limited ability to detect noise in frequency bands in which the speech signal has high energy, i.e., near the formant peaks. Hence to capitalize on this masking effect, the quantization noise needs to be distributed unevenly across the frequency bands in such a way that it is masked by the speech signal. This can be done using a perceptual filter derived from the LPC analysis of the speech signal [11]. The above perceptual error criterion was incorporated in the proposed approach in both time and frequency domains using a constrained minimization framework. In the frequency domain, we show that the proposed approach has the form of a modified version of the Wiener filter approach. In the proposed methods, we also address the fundamental issue that is partly responsible for the musical noise, i.e., the inaccurate estimation of the power spectrum and covariance matrix. This is an issue unduly ignored by most of the speech enhancement methods, however it is very

1063-6676/03$17.00 © 2003 IEEE

458

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 5, SEPTEMBER 2003

important for obtaining good speech quality. For better spectrum estimates, we propose the use of multitaper spectral estimators which were shown to have better bias and variance properties than traditional spectral estimation methods. For better covariance matrix estimation, we propose the use of a multiwindow estimator of the covariance matrix as suggested in [14]. This paper is organized as follows. Section II describes the perceptual weighting technique, Section III presents the perceptually motivated frequency domain method for speech enhancement and Section IV presents the perceptually-motivated subspace method. The experimental results are given in Section V, and the conclusions in Section VI. II. INTRODUCTION

PERCEPTUAL WEIGHTING TECHNIQUE

TO THE

In this section, we first introduce the perceptual weighting techniques used extensively in analysis-by-synthesis speech coders, and then derive the perceptual weighting matrices incorporated in the subspace method and frequency domain method respectively.

Fig. 1. LPC spectrum of the vowel /iy/ and the frequency response of the corresponding perceptual weighting filter, with p equal to 20 and and set to 1 and 0.9, respectively.

B. Perceptual Weighting in the Time Domain is related to

The perceptually weighted error signal by the error signal

A. Perceptual Weighting In most low-rate speech coders (e.g., CELP), the excitation used for LPC synthesis is selected in a closed-loop fashion using a perceptually weighted error criterion [12]. This error criterion exploits the masking properties of the auditory system. More specifically, it is based on the fact that the auditory system has a limited ability to detect the quantization noise near the high-energy regions of the spectrum (e.g., near the formant peaks). Quantization noise near the formant peaks is masked by the formant peaks, and is therefore not audible. Auditory masking can be exploited by shaping the frequency spectrum of the error so that less emphasis is placed near the formant peaks and more emphasis is placed on the spectral valleys, where any amount of noise present will be audible. The error is shaped using the following perceptual filter:

where denotes the convolution operator, and is the impulse . In practice, a truncated impulse response response of can be used. With the above equation, the perceptual weighting used in the subspace method (with time-domain conmatrix straints) is a lower triangular matrix of the form:

.. .

.. .

..

.

.. .

(2)

, is the truncated impulse response where , given in (1) and is the of the perceptual weighting filter frame length. C. Perceptual Weighting Matrix in the Frequency Domain

(1)

be the frequency response of the perceptual filter Let defined in (1), i.e., (3)

is the LPC polynomial, are the short-term linear where and are prediction coefficients, parameters that control the energy of the error in the formant regions and is the prediction order. Fig. 1 shows an example of the spectrum of the perceptual filter for a speech segment extracted from the vowel /iy/ in “heed.” As can be seen, more emphasis is placed in the spectral valleys and less emphasis is placed near the formant peaks. The above perceptual weighting filter was incorporated into the proposed methods to shape the residual noise and make it inaudible. This was done by using a perceptually weighted error criterion rather than a mean square error criterion in the speech enhancement algorithm.

by Denote the frequency response of the error signal and the frequency response of the perceptually weighted error by . is related to by signal With the above equation, the perceptual weighting matrix used in the frequency domain method can be defined using the following diagonal matrix:

.. . where

.. . and

..

.

is the FFT length.

.. .

(4)

HU AND LOIZOU: PERCEPTUALLY MOTIVATED APPROACH FOR SPEECH ENHANCEMENT

III. PERCEPTUALLY MOTIVATED FREQUENCY DOMAIN METHOD FOR SPEECH ENHANCEMENT This section presents the perceptually motivated frequency domain method, followed by the implementation details.

459

In this paper, we propose the use of a perceptually weighted residual noise in place of the constraint in (8). The main motivation is to shape the spectrum of the residual noise so that it is not perceptually audible. The perceptually weighted residual noise can be obtained as follows:

A. Principles of Perceptually Based Frequency Domain Approach Assuming that the noise is additive and uncorrelated with speech, the noisy signal can be written as (5)

where is the -dimensional diagonal perceptual weighting matrix defined in (4). Defining the energy of the perceptually weighted residual noise as

where , and are -dimensional noisy speech, clean speech and noise vectors, respectively. We denote the N-point discrete Fourier transform matrix by

(9) the proposed optimal linear estimator can be obtained by solving the following constrained optimization problem:

.. . where vector

.. .

..

.

.. .

. The Fourier transform of the noisy speech can then be written as (6)

and are the vectors containing the where spectral components of the clean speech vector and the noise vector , respectively. be the linear estimator of , where Let is a matrix. The error signal obtained by this estimation is given by

(10) The solution to (10) can be found using a method similar to that in [19]. Specifically, G is a stationary feasible point if it satisfies the gradient equation of the objective function

and (11) where have

is the Lagrangian multiplier. From

we (12)

where represents the speech distortion represents the in the frequency domain and residual noise in the frequency domain. After defining the en, as ergy of the spectral speech distortion,

and the energy of the spectral residual noise,

To simplify matters, we assume that a gain is applied to each frequency component individually, i.e., we assume that is a diand are asympagonal matrix. The matrices and are Toeplitz) totically diagonal [15] (assuming that and are the and the diagonal elements of and of the clean power spectrum components speech vector and noise vector , respectively. Denoting the , the diagonal elements of the diagonal elements of by as (3), (12) can be simplified to matrix

, as

(7) we can obtain the optimal linear estimator by solving the following constrained optimization problem:

(8) is a positive number. The estimator derived this way where minimizes the energy of the frequency domain speech distortion while maintaining the energy of the frequency domain residual noise below the preset threshold .

The gain function therefore given by

for the frequency component

is

(13) Since in practice we do no have access to the power spectrum of the clean speech, we use . The above solution can be viewed as a modified version of the Wiener filter or a modified version of the power spectral , , the above subtraction method. Note that if equation reduces to the spectrum over-subtraction method [1], , (13) reduces to the Wiener filter. and if in addition

460

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 5, SEPTEMBER 2003

B. Power Spectrum Estimation From (13), it can be seen that the linear estimator in the frequency domain depends on the accurate estimation of the and . Informal listening tests revealed that power spectra the quality of the enhanced speech is greatly influenced by the estimation of the power spectrum in that different types of power spectrum estimators yield different speech quality. In this paper we consider using the multitaper method proposed by Thomson [16] for spectrum estimation. Direct spectrum estimation based on Hamming windowing is the most often used power spectrum estimator for speech enhancement. Although windowing reduces the bias, it does not reduce the variance of the spectral estimate [17]. The idea behind the multitaper spectrum estimator [16], [18] is to reduce this variance by computing a small number of direct spectrum estimators each with a different taper (window), and then average the spectral estimates (the underlying philosophy is similar to Welch’s periodogram method [17]). Now, if the L tapers are chosen to be pairwise orthogonal and properly designed to prevent leakage, then the resulting multitaper spectra estimator will be superior to the periodogram in terms of reduced bias and variance. In fact, the variance of the multitaper estimate will be smaller than the variance of each spectral estimate by a factor . The multitaper method therefore obtains a satisfactory of tradeoff between estimation bias and estimation variance. The multitaper spectrum estimator is given by

(14) is the th data taper used for the spectral estimate , which is also called the th eigenspectrum. These tapers if are chosen to be orthonormal, i.e., and equal to 1 if . A good set of orthogonal data tapers with good leakage properties are given by the discrete prolate spheroidal sequences (dpss) which are a function of a , prescribed mainlobe width . is chosen to be less than where is the number of samples and is the mainlobe width expressed in units of normalized frequency, i.e., [16]. Fig. 2 shows example plots of the first four dpss, , , 2, 3, 4, for . were uniformly Note that in (14), the eigenspectra weighted. Thomson [16] also proposed an adaptive method in which the eigenspectra were weighted as follows: where

(15) is the eigenvalue corresponding to taper (note where are eigenvectors of a symmetric Toeplitz matrix that is given by [16]), and (16) is the signal energy and is the true spectrum. where depends on the true spectrum , however The term

Fig. 2.

First four discrete prolate spheroidal sequences estimated assuming

NW = 4 with N = 160.

in practice an estimate of is used to compute iteratively. That is, first use (14) with set equal to 1 to obtain , and then substitute that in (16) to an initial estimate of . Substitute the estimated weights to (15) obtain to obtain the multitaper spectral estimate with now set to a . The new multitaper spectral estinumber smaller than is substituted back to (16) to get the new weights mate and so forth. More detailed description of multitaper algorithms can be found in [18, pp. 361–371]. Our proposed approach incorporated the above adaptive . Fig. 3 shows examples of multitaper method with the spectrum of a speech segment computed using a Hamming window spectral estimator and the multi-taper spectral estimator. Note that the multi-taper spectra has significantly less variance than the traditional method of estimating the spectrum based on Hamming windowing the data. C. Implementation Details The proposed perceptually motivated frequency domain method can be formulated using the following six steps. For each speech frame: of the Step 1) Compute the multitaper power spectrum noisy speech using (15), and estimate the multiof the clean speech signal taper power spectrum , where is by the multitaper power spectrum of the noise. can be obtained using noise samples collected during speech-absent frames. Spectral floor any negative elas follows: ements of if if where is the spectral floor which was set to . Step 2) Compute the value used in (13) according to (17)

HU AND LOIZOU: PERCEPTUALLY MOTIVATED APPROACH FOR SPEECH ENHANCEMENT

461

stage. The second approach is to obtain directly from noisy speech. We chose this approach because we found that, at least for 5 dB speech-shaped noise, the estimated formant frequencies of the corrupted speech were close to those of clean speech and in were set to 1 and 0.9 re[21]. The values of spectively. IV. PERCEPTUALLY MOTIVATED SUBSPACE METHOD FOR SPEECH ENHANCEMENT In the previous section, the optimal estimator was derived in the frequency domain. In this section we derive the optimal estimator in the time domain using the signal subspace framework. A. Principles of Perceptually Based Subspace Approach

Fig. 3. Comparison of the power spectrum of the vowel /iy/ estimated by the multitaper method (top panel) and the direct spectrum estimator with a Hamming window (bottom panel).

where ,

is the maximum allowable value of , , and is computed

as

As discussed in [19], the value of controls the tradeoff between residual noise and speech distortion in the frequency domain. A larger value of will yield more speech distortion with less residual noise, and vice versa. To better control this tradeoff, we propose the above method for selecting the value [1]. We set of according to the estimated to 5 in this paper according to experimental results. Step 3) Compute the perceptual filter in (1) using a th order LPC analysis, and form the perceptual weighting as per (4). A 12th order LPC analysis was matrix used. using (13). Step 4) Estimate the gain function by Step 5) Obtain the enhanced frequency spectrum , where is the FFT spectrum of the noisy speech . Note that the noisy speech was Hamming-windowed for the estima. tion of to obtain the enStep 6) Apply the inverse FFT of hanced speech signal. The estimator was applied to 32-ms duration frames of the noisy signal with 50% overlap between frames. The enhanced speech signal was combined using the overlap and add approach [20]. There are at least two different ways to estimate the percep. One, is to first enhance the signal (say with tual filter using the pre-enhanced spectral subtraction), and estimate signal. The major drawback of this approach, however, is that any distortion introduced in enhancing the signal in the first in the second stage might lead to inaccurate estimates of

The model used in the subspace approach is the same as the one used in the frequency domain method, the main difference being that the constraints are now applied in the time domain rather than in the frequency domain. The time-domain subspace approach is briefly outlined below (more thorough treatment can be found in [19], [22]). be a linear estimator of the clean speech , Let matrix. The error signal obtained by this where is a estimation is given by

where represents the speech distortion and represents the residual noise [19]. Defining the energy of the speech distortion and the energy of the residual noise as

the optimum linear estimator can be obtained by solving the following time domain constrained optimization problem

(18) where is a positive constant. Such an approach was proposed in [19] for white noise and later extended to colored noise in [22]–[25]. An attempt was made in [19] to extend this approach to the KLT domain using spectral domain constraints. No explicit method was proposed in [19] however to directly shape the spectrum of the residual noise in the frequency domain. As in the proposed perceptually motivated frequency domain method, we propose the use of a perceptually weighted residual noise in place of the constraint in (18). The perceptually can be obtained as follows: weighted residual noise

where is defined in (2). Defining the energy of the perceptually weighted residual noise as

462

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 5, SEPTEMBER 2003

the proposed optimal linear estimator can be obtained by solving the following time-domain constrained optimization problem:

where is the th element of . Let where is the th column of . Then for each column is computed as

, ,

(19) It can be shown, using the method of Lagrange multipliers, that the solution to (19) satisfies the following equation: (20) where is the Lagrange multiplier. The techniques proposed in [22], [25] can be used to further simplify the above equation. In [25], a nonunitary matrix was constructed that simultaneously diagonalized the two matrices and as follows [26]:

(21) where and are the eigenvalue matrix and eigenvector matrix , i.e., respectively of

It can be shown that is a real matrix [27], assuming that is positive definite. After substituting (21) into (20), we get the following: (22) The above equation has the form of the well known Lyapunov equation encountered frequently in control theory. The Lyapunov equation can be solved numerically using the algorithm proposed in [28]. Explicit solutions can be found in [29, p. 414]. After solving for in (22), the enhanced speech signal . can be obtained by

where is a diagonal matrix having in its diagonal the th discrete prolate spheroidal sequence [16], is the number of sequences used for covariance estimaand tion ( was set to 4 in this paper). The final was obtained as follows:

C. Implementation Details The proposed approach can be formulated using the following six steps. For each speech frame: of the noisy Step 1) Compute the covariance matrix of the signal, and estimate the covariance matrix , where the noise clean signal by can be computed using noise covariance matrix samples collected during speech-absent frames. The . matrix can be obtained by Step 2) Perform the eigen-decomposition of

Step 3) Assuming that the eigenvalues of are ordered as , estimate the dimension of the speech signal subspace as follows:

That is, is equal to the number of strictly positive eigenvalues of . value Step 4) Compute the value as in (17). Here the is computed as

B. Covariance Matrix Estimation From (20), it can be seen that the linear estimator depends and on the accurate estimation of the covariance matrices . Informal listening tests showed that the quality of the enhanced speech is greatly influenced by the estimation of the covariance matrix. In this paper, we use a multiwindow estimator of the covariance matrix [14]. Compared with the commonly used covariance matrix estimators in [19], [24], [25], the multiwindow estimator reduces the estimation variance greatly, and it can be used in place of the commonly used covariance estimators to improve performance. multiwindow covariance matrix estiTo obtain the mator [14] from the incoming -dimensional data vector ( ), we first form a Hankel matrix from . The matrix has the form

.. .

.. .

..

.

.. .

Note that the SNR estimation is based on the is set to 5 whitened speech and noise signals. here according to experimental results. Step 5) Compute the perceptual filter in (1) using a th order LPC analysis, and form the perceptual as per (2). We used weighting matrix in this paper for speech sampled at 8 kHz. Solve , using the Lyapunov equation in (22) for , where diagonal values of are set to zero. the last , Step 6) Estimate the enhanced speech signal by is the solution of (22). where The estimator was applied to rectangularly windowed frames of the noisy signal which overlapped each other by 50%. Two extra future frames were used to estimate the covariance mawas set to 40 samples (assuming an 8 kHz samtrices and pling frequency). The enhanced speech signal was Hamming

HU AND LOIZOU: PERCEPTUALLY MOTIVATED APPROACH FOR SPEECH ENHANCEMENT

Fig. 4. Spectrum of the speech-shaped noise.

windowed and combined using the overlap and add approach [20]. Similar to the frequency domain method discussed in the prewas obtained vious section, the perceptual weighting filter directly from the noisy speech. The values of and were set to 1 and 0.9, respectively.

463

Fig. 5. Comparative performance for speech-shaped noise at 5 dB, 10 dB, 15 dB, and 20 dB in terms of mean MBSD measure, segmental SNR measure (s-SNR) and overall SNR measure, for 60 TIMIT sentences produced by 30 male speakers and 30 female speakers.

V. EXPERIMENTAL RESULTS AND EVALUATION For evaluation purposes, we used 60 sentences from the TIMIT database, downsampled to 8 kHz. The sentences were produced by 30 male and 30 female speakers. For colored noise, we used speech-shaped noise, Volvo car interior noise and F16 cockpit noise added to the clean speech files at 5 dB, 10 dB, 15 dB, and 20 dB SNR, respectively. The speech-shaped noise, included in the HINT database [30], was computed by filtering white noise through an FIR filter with frequency response that matched the long-term spectrum of the sentences in the HINT database. The spectrum of the speech-shaped noise is shown in Fig. 4. The modified bark spectral distortion (MBSD) measure [31], the segmental and overall (global) SNR [20] measures were used for evaluation of the proposed perceptually weighted frequency domain approach (PW-FREQ) and the perceptually-based subspace approach (PW-SS). The MBSD measure is an improved version of the Bark Spectral Distortion (BSD) [32], and was found to be highly correlated with speech quality [31]. For comparative purposes, we also implemented and evaluated the time domain constraint (TDC) approach in [25] and the spectral subtraction (SPECSUB) method of Berouti et al. [1]. The TDC approach in [25] generalized the TDC approach proposed for white noise in [19] into the colored noise case, however no shaping of the residual noise is done in [25]. Figs. 5, 6 and 7 give the mean results in terms of the three objective measures for 60 TIMIT sentences corrupted by speechshaped noise, F16 cockpit noise and Volvo car interior noise respectively. The results are given separately for each objective measure. As can be seen from Figs. 5 and 6, according to the MBSD measure and the segmental SNR measure our proposed

Fig. 6. Comparative performance for F16 cockpit noise at 5 dB, 10 dB, 15 dB and 20 dB in terms of mean MBSD measure, segmental SNR measure (s-SNR) and overall SNR measure, for 60 TIMIT sentences produced by 30 male speakers and 30 female speakers.

methods, PW-FREQ and PW-SS outperformed the spectral subtraction method in [1] and the TDC approach in [25] respectively. For Volvo car interior noise, PW-FREQ performed the best in terms of segmental SNR and overall SNR, although all the methods yielded comparable MBSD values. Informal listening tests showed that the best speech quality was obtained by the PW-FREQ method with no noticeable musical noise, consistent with the lowest values of the MBSD measure obtained by this method. Listening tests showed that the speech quality of the proposed methods was sensitive to the choice of the maximum allow, in (17). Good quality was achieved able value of , i.e., , however, we found that better quality can be using , e.g., . achieved using larger values of The implementation of the proposed algorithms involves two critical components, namely, the perceptual weighting tech-

464

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 5, SEPTEMBER 2003

Fig. 7. Comparative performance for Volvo car noise at 5 dB, 10 dB, 15 dB and 20 dB in terms of mean MBSD measure, segmental SNR measure (s-SNR) and overall SNR measure, for 60 TIMIT sentences produced by 30 male speakers and 30 female speakers.

nique to exploit the masking properties of the human auditory system and the multitaper/multiwindow method to reduce the estimation variance of the power spectrum and the covariance matrix estimators. Since a reduced estimation variance will improve the enhancement performance, as expected, the multitaper/multiwindow estimator contributed to the improvement of the algorithm performance a great deal in all SNRconditions. On the other hand, our informal listening tests indicated that the benefit of the perceptual weighting techniques was more evident in the high SNR ( 15 dB) conditions, and was not as obvious in the low SNR conditions. We suspect this is partly due to the way we obtained the perceptual weighting filters, and also the method we chose to estimate the value of . VI. SUMMARY AND CONCLUSIONS A new method of incorporating perceptual weighting in frequency domain method and subspace based method was proposed for better noise shaping. Unlike other methods, the proposed approach does not depend on the estimation of masking thresholds which are known to be difficult to estimate accurately in noise without the need to preprocess the corrupted signal. The use of a multitaper power spectrum estimator is proposed for the frequency domain method and the use of a multiwindow covariance estimator is proposed for the subspace method. Objective measures and informal listening tests showed significant improvements over other speech enhancement methods. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments. REFERENCES [1] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1979, pp. 208–211.

[2] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, “Speech enhancement based on audible noise suppression,” IEEE Trans. Speech Audio Processing, vol. 5, pp. 479–514, Nov. 1997. [3] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 126–137, Mar. 1999. [4] F. Jabloun and B. Champagne, “A perceptual signal subspace approach for speech enhancement in colored noise,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, 2002, pp. 569–572. [5] J. D. Johnston, “Transform coding of audio signals using perceptual noise criteria,” IEEE J. Select. Areas Commun., vol. 6, pp. 314–323, Feb. 1988. [6] K. Brandenburg and G. Stoll, “ISO-MPEG-1 audio: a generic standard for coding of high quality digital audio,” J. Audio Eng. Soc., vol. 42, pp. 780–792, 1994. [7] T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proc. IEEE, vol. 88, pp. 451–515, 2000. [8] M. Klein and P. Kabal, “Signal subspace speech enhancement with perceptual post-filtering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2002, pp. 537–540. [9] A. A. Azirani, R. J. L. Bouquin, and G. Faucon, “Optimizing speech enhancement by exploiting masking properties of the human ears,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1995, pp. 800–803. [10] S. Gustafsson, P. Jax, and P. Vary, “A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, pp. 397–400. [11] B. S. Atal and M. R. Schroeder, “Predictive coding of speech and subjective error criteria,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp. 247–254, 1979. [12] P. Kroon and B. S. Atal, “Predictive coding of speech using analysis-by-synthesis techniques,” in Advances in Speech Signal Processing, S. Furui and M. Sondhi, Eds. New York: Marcel Dekker, 1992, pp. 141–164. [13] M. R. Schroeder, B. S. Atal, and J. L. Hall, “Optimizing digital speech coders by exploiting masking properties of the human ear,” J. Acoust. Soc. Amer., vol. 66, pp. 1647–1652, 1979. [14] L. T. McWhorter and L. L. Scharf, “Multiwindow estimators of correlation,” IEEE Trans. Signal Processing, vol. 46, pp. 440–448, Feb. 1998. [15] R. Gray, “On the asymptotic eigenvalue distribution of toeplitz matrices,” IEEE Trans. Inform. Theory, vol. IT-18, pp. 725–730, 1972. [16] D. J. Thomson, “Spectrum estimation and harmonic analysis,” Proc. IEEE, vol. 70, pp. 1055–1096, Sept. 1982. [17] S. M. Kay, Modern Spectral Estimation. Englewood Cliffs, NJ: Prentice-Hall, 1988. [18] D. B. Percival and A. T. Walden, Spectral Analysis for Physical Applications: Multitaper and Conventional Univariate Techniques. Cambridge, U.K.: Cambridge Univ. Press, 1993. [19] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 251–266, 1995. [20] J. R. Deller, J. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals. New York: IEEE Press, 2000. [21] G. Parikh, “The Effect of Noise on the Spectrum of Speech,” M.S. thesis, Univ. of Texas at Dallas, 2002. [22] Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” IEEE Signal Processing Lett., pp. 204–206, July 2002. [23] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 87–95, Feb. 2001. [24] U. Mittal and N. Phamdo, “Signal/noise KLT based approach for enhancing speech degraded by colored noise,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 159–167, Mar. 2000. [25] Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, Orlando, FL, May 2002, pp. 573–576. [26] S. B. Searle, Matrix Algebra Useful for Statistics. New York: Wiley, 1982. [27] G. Strang, Linear Algebra and its Applications, 3rd ed. New York: Harcourt Brace Jovanonich, 1988. [28] R. H. Bartels and G. W. Stewart, “Solution of the matrix equation ,” Commun. ACM, vol. 15, no. 9, pp. 820–822, 1972. [29] P. Lancaster and M. Tismentetsky, The Theory of Matrices. New York: Academic, 1985. [30] M. Nilsson, S. Soli, and J. Sullivan, “Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Amer., vol. 95, pp. 1085–1099, 1994.

XB = C

AX+

HU AND LOIZOU: PERCEPTUALLY MOTIVATED APPROACH FOR SPEECH ENHANCEMENT

[31] W. Yang, M. Benbouchta, and R. Yantorno, “Performance of the modified bark spectral distortion as an objective speech quality measure,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, pp. 541–544. [32] S. Wang, A. Sekey, and A. Gersho, “An objective measure for predicting subjective quality of speech coders,” IEEE J. Select. Areas Commun., vol. 10, pp. 819–829, 1992.

Yi Hu (S’01) received the B.S. and M.S. degrees in electrical engineering from University of Science and Technology of China (USTC) in 1997 and 2000, respectively. Currently he is pursuing the Ph.D. degree in electrical engineering at University of Texas at Dallas, Richardson. His research interests are in the general area of signal processing, ASIC/FPGA design of DSP algorithms, and VLSI CAD algorithms.

465

Philipos C. Loizou (S’90–M’91) received the B.S., M.S., and Ph.D. degrees, all in electrical engineering, from Arizona State University, Tempe, in 1989, 1991 and 1995, respectively. From 1995 to 1996, he was a Postdoctoral Fellow in the Department of Speech and Hearing Science at Arizona State University, working on research related to cochlear implants. He was an Assistant Professor at the University of Arkansas at Little Rock from 1996 to 1999. He is now an Associate Professor in the Department of Electrical Engineering at the University of Texas at Dallas. His research interests are in the areas of signal processing, speech processing, and cochlear implants. Dr. Loizou is a member of the Industrial Technology Track Technical Committee of the IEEE Signal Processing Society, and was also an Associate Editor of the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING (1999–2002).