IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008
825
Environment-Optimized Speech Enhancement Tim Fingscheidt, Senior Member, IEEE, Suhadi Suhadi, and Sorel Stan, Senior Member, IEEE
Abstract—In this paper, we present a training-based approach to speech enhancement that exploits the spectral statistical characteristics of clean speech and noise in a specific environment. In contrast to many state-of-the-art approaches, we do not model the probability density function (pdf) of the clean speech and the noise spectra. Instead, subband-individual weighting rules for noisy speech spectral amplitudes are separately trained for speech presence and speech absence from noise recordings in the environment of interest. Weighting rules for a variety of cost functions are given; they are parameterized and stored as a table look-up. The speech enhancement system simply works by computing the weighting rules from the table look-up indexed by the a posteriori signal-to-noise ratio (SNR) and the a priori SNR for each subband computed on a Bark scale. Optimized for an automotive environment, our approach outperforms known—environment-independent—speech enhancement techniques, namely the a priori SNR-driven Wiener filter and the minimum mean square error (MMSE) log-spectral amplitude estimator, both in terms of speech distortion and noise attenuation. Index Terms—Noise reduction, speech enhancement.
I. INTRODUCTION
I
N THE past three decades, many approaches to the field of speech enhancement have been proposed. Their aim often is to find a reasonable compromise between a sufficiently strong noise attenuation, a naturally sounding residual noise (i.e., low noise distortion), and an almost untouched speech signal component (i.e., low speech signal distortion). Many approaches propose a new spectral weighting rule. This can principally be performed in two ways: First of all, by employing a distinct cost function, e.g., the minimum mean square error (MMSE) of complex spectral values, spectral amplitudes, log-spectral amplitudes, or perceptually motivated variants of these [1]–[4], the (joint) maximum a posteriori (MAP) criterion [5], [6], etc. Second, a different probability density function (pdf) for the clean speech spectrum and/or the noise spectrum can be modeled from thorough observation of real-world signals. The pdf of the real and imaginary parts of the clean speech spectrum is usually modeled as Gaussian pdf [2]–[4], but a gamma pdf [7] or a Laplacian pdf [8], [9] matches even better. Alternatively, a supergaussian pdf [5], [6] can be applied to model the pdf of the clean speech spectral amplitude. Instead of a single pdf, a certain number of pdfs of the clean speech spectrum, e.g., a Gaussian
Manuscript received February 28, 2007; revised January 11, 2008. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Rainer Martin. T. Fingscheidt and S. Suhadi are with the Institute for Communications Technology, Braunschweig Technical University, D-38106 Braunschweig, Germany (e-mail:
[email protected];
[email protected]). S. Stan was with BenQ Mobile, Munich, Germany. He is now with Bardehle Pagenberg, D-81679 Munich, Germany (e-mail:
[email protected]). Digital Object Identifier 10.1109/TASL.2008.920062
mixture model [10], a complex Gaussian mixture model [11], or a mixture of autoregressive Gaussian models [12] can be estimated through a training procedure to derive the corresponding MMSE spectral amplitude weighting rules. The choice of the cost function (leading to a specific estimator) and the chosen clean speech spectrum pdf turn out to be crucial points. A mismatch between signals and their modeled spectrum pdf, as well as the choice of an inappropriate cost function, usually lead to an increased level of residual noise and/or speech distortion. Porter and Boll [13] tackled this problem by assuming a Gaussian pdf for the noise spectrum, and derived generalized estimators as a function of some unknown clean speech spectral amplitude pdf. Then, they used clean speech training data instead of an analytically formulated clean speech spectral amplitude pdf to derive the estimator and implemented it as a table look-up. As a step further, we have proposed a data-driven approach using clean speech and noise training data to train the table look-up of a so-called ideal gain averaging (IGA) spectral amplitude weighting rule [14]. Frequency-individual spectral amplitude weighting rules are trained as a function of a given a posteriori SNR, as well as a priori SNR, which is computed through a modified decision-directed approach. As no a priori knowledge of the clean speech and noise signals (i.e., their spectrum pdf) is provided, the weighting rules need to be trained separately for speech presence and speech absence to effectively exploit the statistical characteristic of clean speech and noise signals. A practical solution with an -order polynomial least-squares ( -PLS) parameterized Bark scale spectral amplitude weighting rules [15] allows to reduce the memory requirements of the table look-up for a DSP implementation. A similar approach recently has been proposed by Erkelens et al. [16], [17], where a spectral amplitude weighting rule is trained only under stationary noise of known spectral variance. Also being trained as a function of a given a posteriori SNR and a priori SNR, the frequency-independent weighting rule for the whole utterance (i.e., for both speech presence and speech absence) is however generated with a different a priori SNR computation. Wherever there is a lack of training data, Erkelens employs the state-of-the-art weighting rules of the respective cost function under the Gaussian speech model. Finally, all weighting rules are implemented simply as a table look-up without any gain parameterization. In this paper, we present a couple of new spectral amplitude estimators optimized for a specific acoustic environment. No assumption on the spectrum pdf is taken at all. All the new spectral weighting rules result in a table look-up indexed by the a priori and the a posteriori SNR values. The known IGA approach is improved by a modified training formula. Furthermore, we derive its error criterion and discuss an interesting relation to an algorithm from Loizou [4]. Compared to our earlier works, we improve the assembly of data for the training of the Bark scale
1558-7916/$25.00 © 2008 IEEE
826
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008
weighting rules, as well as the weighting rule parameterization using a weighted -PLS ( -WPLS) model [18]. In an automotive environment, we investigate how specific the training data should be (car specific versus general automotive environment), and we also evaluate the performance of the proposed algorithms in an environment that is different from the training environment. The performance of our new approach will be compared to the well-known Wiener filter (i.e., MMSE error criterion), employing decision-directed a priori SNR estimation [1], [2], and Ephraim and Malah’s MMSE-LSA estimator [3], both being shortly recapitulated in Section II. An outline of our approach follows then in Section III. In Section IV, the performance of all algorithms in terms of segmental speech-to-speech distortion ratio (SSDR) and noise attenuation (NA) is evaluated along with a subjective test.
II. SOME KNOWN APPROACHES After short-time Fourier transform (STFT) of length , the clean speech spectrum subject to additive noise at frame and frequency bin is expressed as . The time-domain signals of noisy speech, clean , and , respectively. speech, and noise are Following noise estimation by the minimum statistics apand proach [19], the resulting noise spectral variance the noisy speech spectrum are used to compute the a posteriori SNR (1)
that the real and imaginary parts of clean speech spectra have Gaussian distribution, the MMSE-LSA estimator can be derived and formulated as follows:
(4) with (5) The integral term in (4) is usually implemented as a table look-up. Furthermore, the performance of both estimators can be improved by considering the speech presence probability. Still under assumption that the pdf of clean speech and noise spectra is modeled as a complex Gaussian, both estimators are reformulated as follows [20], [21]:
(6) being comwith the speech absence probability (SAP) . The puted from [22] with a SAP upper limit new SAP parameter helps the reference algorithms to reduce the amount of residual noise without increasing the perceivable is determined by (3) or (4) with the a speech distortion. being replaced by [2] leading to priori SNR
is estimated via the decision-diThen, the a priori SNR rected (DD) approach of Ephraim and Malah [2] (7) The clean speech spectrum estimate as
is finally computed
(2) (8) are set to 0.98 and 15 dB, where the parameters and respectively. Minimizing the mean square error (MMSE) between clean speech spectrum and its estimate yields the Wiener filter weighting rule, which can be computed using the a priori SNR [1] (“WF-DD”)
III. NEW ENVIRONMENT-OPTIMIZED APPROACH A. Training: Data Acquisition and Preprocessing
(3) Note that due to the use of Ephraim and Malah’s DD approach, the Wiener filter becomes an estimator of good performance with highly reduced musical tones. Ephraim and Malah [3] proposed MMSE estimation of the log-spectral amplitudes (“MMSE-LSA”). Under assumption
We assume the availability of noise recordings from the target environment of our speech enhancement system. Using a training database consisting of these noises artificially added is to clean speech signals, the noise spectral variance estimated applying the minimum statistics approach [19]. Then, the a posteriori SNR values are computed with (1) for each frequency bin in each frame of the training database.
FINGSCHEIDT et al.: ENVIRONMENT-OPTIMIZED SPEECH ENHANCEMENT
827
Also, the respective a priori SNR values are computed by using however a slightly modified DD approach:
(9) Here, the clean speech spectral amplitude estimate in (2) is replaced by the correct clean speech spectral amplitude , which is only known during training and the fact that we have access to the clean speech signal. Note that to compensate the DD bias in low SNR conditions, Erkelens et al. proposed two alternatives: First, a bias-compensating factor can be applied [16]: (10) with being estimated using the MMSE-STSA or MMSE-LSA approach [2], [3]. As a second alternative, the esin (2) can be computed through the MMSE timate estimator [17], [23]: (11) In our approach, we avoid the bias simply by using the correct in (9) instead. The weighting rules of Erkevalue lens et al. were trained under stationary noise with known spec. In our approach, real noise signals from tral variance the target environment are used instead, and the estimated noise is employed for training. spectral variance in dB is quantized Now the pair of SNR values in each frame and each frequency bin with a step size of dB and a quantization range of dB according to
(12) are the number of quantization levels for the where a posteriori and a priori SNR values, respectively. Along with , we store the respective (known) complex spectral index and, for ease of later computations, . values Following this procedure, for each pair of and (i.e., for each pair of the quantized a posteriori and a priori SNR values), complex spectral three vectors with a number of values are aggregated:
at frequency bin . The vector with the absolute value taken of all its elements is denoted as
and analogously for the vector
and
. In the same manner,
is also aggregrated:
B. Training: Realizations for Some Cost Functions Apart from avoiding prior assumptions, e.g., on the clean speech or noise spectrum pdf, a particular strength of our approach is the simple incorporation of almost arbitrary cost functions. In the following, we show different possibilities of how , and to exploit the collected vectors to compute the scalar gain factor that is optimal for a given cost function. The realization of any of the following cost functions will result in a set of frequency-individual weighting . rules, 1) Ideal Gain Averaging (IGA): Without defining an explicit cost function, an intuitive solution will be the computation in the of ideal gains and their averaging over all instances at training database, which yielded an SNR index pair frequency bin [14] (“IGA”): (13) The real-valued ideal gain is given as (14) since multiplication of with yields the cor. rect clean speech spectral amplitude The problem using (14) is, however, that gain values greater than 1 may occur, especially if the a posteriori SNR is smaller than the a priori SNR. As these cases occur very rare, the reis very small, and consequently the correspective is statistically not responding resulting weighting rule liable. Moreover, due to the nonideal nature of noise and SNR estimators, this may lead in practice to occasional noise amplification. To prevent this from happening, the gains should always be limited to the range [0, 1]. Instead of applying an explicit limitation to the gains in (14), we found the computation of an ideal gain with a Wiener filter-like formulation to be preferable: (15)
with the vector index denoting the first instance, where occured, and index deSNR index pair noting the last instance seen in training with SNR index pair
Note that the correct noise and clean speech spectral amplitudes are used, and that the direct multiplication of with yields close-to-perfect clean speech reconstruction at least from an auditive point of view. This still justifies the term ideal gain for (15). In our investigations, we also found that can decrease employing the noise spectral variance
828
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008
the variance on the average computation of the scalar gain factor in (13). Taking this fact into account, it is desirable to reformulate (15) as follows:
we will get value MMSE gain is then computed by
. The final spectral
(16) Please note that (16) does not really mean the decision for a specific cost function or error criterion, since the clean speech are used. However, the implied spectral amplitudes cost function for averaging the ideal gain as done in (13) can be given as
(19) Here again we have the situation that the gain values may beas an come larger than 1. Regarding , and using again the staapproximation of tistical independence of clean speech and noise, we can rewrite (19) as follows (“MMSE-SV”): (20)
(17) with (18) By inserting (18) and into w.r.t. to (17), and setting the partial derivative of zero, it is easy to show that the final result is (13). It is worth to be noted that (17) resembles Loizou’s per. ceptually motivated cost function [4, Eq. (14)] with Loizou however proposed as a denominator term the squared , which emphaclean speech spectral amplitude sizes spectral valley regions, where masking effects are not as strong as in the spectral peak regions. On the other hand, can hardly be explained as our denominator term exploiting masking effects. However, since it tends to be larger , our gain averaging will penalize residual than noise in the spectral valley regions not as much as Loizou’s does. approach 2) Spectral Value MMSE (MMSE-SV): The cost function for the MMSE of complex spectral values is given as
Minimization of the cost function can be written as
This resembles the Wiener filter formula (3) and is also similar to (15) inserted into (13), except that here the summation (or averaging) takes place separately in the numerator and in the denominator. 3) Spectral Amplitude MMSE (MMSE-SA): The cost function for the MMSE of (real-valued) spectral amplitudes can be formulated as
Minimization of the cost function can be written as
with superscript
denoting the transpose. Inserting (21)
leads then to the result (22) To prevent the gain values from becoming larger than 1, we from reapply the approximation of the denominator (20) (“MMSE-SA”): (23)
with superscript denoting the Hermitian, i.e., the conjugate leads complex transpose. Inserting then to the MMSE result
using . Assuming that clean speech and noise are statistically independent,1 with a sufficiently large 1This independence is actually not the case. If background noise increases in volume, speakers usually do so as well. Also, articulation changes significantly (Lombard effect).
4) Log-Spectral Amplitude MMSE (MMSE-SAlog): The cost function for the MMSE of (real-valued) log-spectral amplitudes can be formulated as
with logarithm applied to a vector shall be the same vector with the logarithm of its elements. Minimization of the cost function can be written as
FINGSCHEIDT et al.: ENVIRONMENT-OPTIMIZED SPEECH ENHANCEMENT
829
With (21) inserted, this can be solved easier in scalar notation
D. Training Option 1: Bark Scale Weighting Rule
This leads then to the result
As an option to save memory for the frequency-individual weighting rules, we redefine the weighting rules in subbands instead of frequency bins. Please note that our frequency- or subband-dependent weighting rules differ from Erkelens’ approach [16], [17], where only a single weighting rule is trained for all frequency bins. Taking the human auditory system into consideration, we redefine all weighting rules in Section III-B on the Bark bands (up to 4 kHz). As opposed to scale with [15], the Bark scale weighting rule for each is calculated according to (13), (20), (23), Bark band replaced by . The data in vectors and (25) with (24)
In analogy to (15), we reapply the approximation of the denomto avoid the gain values greater than 1, and inator we obtain (“MMSE-SAlog”)
(25)
C. Training: Speech Absence Probability In any known speech enhancement techniques, the weighting rules are commonly derived based on a kind of a priori knowledge of the clean speech and noise signals, i.e., the pdf assumptions of the clean speech and noise spectra. It is advantageous to reuse this a priori knowledge as the speech absence probability (SAP) to further improve the performance of the weighting rules [2]. We therefore also employ the SAP in a binary form as voice activity detection (VAD) decision to the resulting frequency-dependent weighting rules. In each frequency bin it leads to a separate weighting rule for cases with speech being present and for cases with speech being absent. In contrary, Erkelens et al. [16], [17] proposed to train only a general weighting rule for the whole utterance, i.e., for both speech presence and speech absence. The separation of course will increase the memory requirements and the system complexity, as well as it will imply a dependence on the VAD performance. However, as we do not make use of some explicit a priori knowledge of clean speech and noise pdfs, we require separate training of weighting rules for speech presence and speech absence to independently exploit the spectral statistical characteristics of the clean speech and noise spectra. For this purpose, we compute the VAD of each frequency bin simply by applying a threshold on speech absence probability, which is computed as a function of the a priori SNR [22]. Note that it is not advantageous to use an ideal VAD (i.e., operating on clean speech) in training. Instead, one should employ the very same VAD scheme in training and in test (i.e., operating on noisy speech) to have better matched conditions.
is assembled from all frequency bins belonging to that Bark band . Note that each of these vectors now has a . length of E. Training Option 2:
-WPLS Parameterization
At the end of the training, we obtain 2-D raw weighting rules or alternatively indexed by the SNR index for each frequency bin or Bark band , respectively. pair Storing all weighting rules as a table look-up may still exceed the memory limitations of certain applications. To further reduce the memory requirements for each a posteriori SNR index and Bark band , we propose an optional parameterization of as a function of the a priori the 1-D weighting rule as a function of has a total of gain SNR index . consecutive values in values, of which a number of shall contain at least those gain the range values that were seen in training. The -PLS model could be the first alternative to parameterize the raw weighting rule [15]. As we are dealing with the raw weighting rules, the -PLS parameterization has a disadvantage to fit the irregular shape of the raw weighting rules (as shown later in Fig. 2), in which the sharp peaks mostly occur in cases with sparse training data (see also Fig. 1). Therefore, we propose to apply an -order weighted polynomial least-squares -WPLS) parameterization, which puts a higher weighting factor on SNR index pairs with more training data (i.e., with smaller variance) [18], [24]. In -WPLS parameterization for each a posteriori SNR index and Bark band , the elements of the weighting matrix . This [18, Sec. 14.2] are chosen to be proportional to yields a better fit of the parameterized weighting rule to the raw often seen in training than those one for SNR index pairs seldom seen in training.2 As a result of the -WPLS parameterization for each a posteriori SNR index and Bark band , we obtain the coefficients , as well . as the two additional regression range parameters Table I summarizes the average memory requirements of our environment-optimized approach after the Bark scale redefini. Using the tion and an -WPLS parameterization with 2Please note that the data matrix of the R -WPLS parameterization [18, Eq. (14.5)] is computed as a (N (R + 1)) Vandermonde matrix [18, Eq. (4.9)] evaluated on the average value of all instances of (k ) in the quantization interval with the a priori SNR index j .
2
830
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008
TABLE I MEMORY REQUIREMENTS AFTER THE BARK SCALE REDEFINITION AND , THE R-WPLS PARAMETERIZATION (PERCENT VALUES FOR N ;K ;B ;R MEAN OF N
15
= 256 = 19 = 4)
= 20
4-WPLS parameterization, a comparable performance to the original nonparameterized approach can still be attained with a memory requirement of 10.6 kB (assuming that each gain value is represented by 16 bits). In comparison, Erkelens’ single weighting rule requires up to 7.2 kB of memory [17]. Please note that all values in Table I are computed by considering only gain values, as well as the two additional regression range pa. rameters F. Testing: Outline of the New Algorithm The proposed algorithm for speech enhancement as used in the testing session is as follows: At first, the noise spectral variance is computed by applying the same noise estimation approach as in the training session. Then, the a posteriori SNR is computed. In contrast to training, (2) is applied to compute the a priori SNR values. The VAD decision is subsequently computed for each frequency bin based on the resulting a priori SNR values [22]. Depending on the VAD decision, the respective (Bark band) weighting rule is selected. Finally, the clean speech spectral amplitude is estimated by simply applying (8). In the computation of the a priori SNR in testing, we are aware that applying (2) does not compensate for the DD bias in low SNR conditions. As an alternative approach, the bias could be reduced by employing (10) or (11) to compute the a priori SNR. Unfortunately, this again implied an assumption of a Gaussian clean speech spectrum model. In our context, this alternative did not yield measurable improvements. In case of the -WPLS parameterization, after SNR quanis used to retization (12), the a posteriori SNR index trieve the respective -WPLS Bark scale weighting rule pa, with belonging to Bark band . rameters The appropriate gain value is computed simply as , where (26) being the largest a priori SNR value in the quantizawith . Please note that the last case in tion interval with index (26) represents gain extrapolation for the unseen data at high . Meanwhile, in [16] and [17], a priori SNR, those unseen gain values are replaced by the state-of-the-art weighting rules derived from the respective cost function under the Gaussian speech model instead. IV. EXPERIMENTAL RESULTS We evaluate the performance of the proposed approaches kHz in automotive and office environments. In our
experiment, 40 different utterances comprising eight different speakers (four male and four female), and car-specific noise different cars (VW Golf, Benz-124, signals from Dodge Intrepid, and Mitsubishi Pajero) are taken from the NTT database and NTT-AT database, respectively. Each car-specific noise signal with 12-min total duration is segmented according to the length of the utterances, yielding 88 car-specific noise segments. All signals are then being split into two sets of equal size (20 different utterances spoken by two male and two female speakers and 44 car-specific noise segments). Combining the noisy speech clean speech and the noise signals, utterances in each car-specific environment are obtained for training and testing sessions, respectively. We alternatively perform training also in a more general automotive and in an office environment: Here, we use 38 car noise signals (different from the car-specific noise signals), and 38 office noise signals from the NTT-AT database, respectively. For each environment, we randomly take a segment in each of these 38 noise signals. After combination with the clean speech noisy speech training data from above, we obtain utterances in the general automotive and office environments, respectively. All these training and test sets are then prepared for 5 (for training only), 0, 5, 10, 15, 20 dB SNR condisets of tions, respectively. In total, we get training files for the car-specific environment, one set of training files for the general automotive envitraining files for the office ronment, one set of sets of test files for environment, and the car-specific environment. As reference systems, we simulate the a priori SNR-driven Wiener filter [1], as well as Ephraim and Malah’s MMSE-LSA estimator [3], both making use of speech absence probability as sketched in Section II. They apply the same weighting rule to all frequency bins. Noise estimation for all compared approaches is being done using minimum statistics [19]. The STFT length , and the window is a flat-top Hann window with is 40 samples rising edge, 120 samples being 1, and 40 samples trailing edge. Consequently, this makes a frame length of 200 samples and a frame shift of 160 samples. In our first experiment we investigate the distribution of during training of the quantized SNR pairs a weighting rule for any of the cost functions discussed in Section III-B. Fig. 1 shows the result for a low and a mid Bark band in speech presence and speech absence, respectively. We observe that in the first Bark band, automotive noise seems to be dominant w.r.t., speech, indicated by the large amount of negative a priori SNR values. On the other hand, the 14th Bark band results show that positive a priori SNR values very likely indicate speech presence. Due to the lower bound in (9), the values dB do not occur. Interestingly, the distribution of the quanis also lower-bounded by a curve tized SNR pairs spanning from the 4 15 dB to the (20, 3) dB grid point. Considering the modified DD approach (9), it is apparent that this phenomenon occurs because of the dependence of the a priori SNR values on the a posteriori SNR values. For , the a priori SNR is lower-bounded by which in dB is about 17 dB below the a posteriori SNR.
FINGSCHEIDT et al.: ENVIRONMENT-OPTIMIZED SPEECH ENHANCEMENT
831
dB has Please note that the choice of the step size as been found to be a compromise between the algorithm performance and the memory requirements (after -WPLS paramedB, can give a slight perforterization). A choice of, e.g., mance improvement, but it requires more memory and a higher order of the -WPLS model. On the other hand, step sizes bedB further reduce the required memory; however, yond speech distortion is introduced. Let us introduce now the measures of quality that we apply in our evaluation (see also [25]). At first, we compute the , based on the noisy weighting rule values . Next, the clean speech and noise signals speech signal and are separately subject to the obtained weighting rule in the spectral domain to produce the filtered clean speech and the filtered noise , respectively. The SSDR of sample length is then formulated as segment with SSDR with speech distortion Fig. 1. Data distribution during training of weighting rules in general automoBark tive noise, separated into speech presence and speech absence: b Bark (lower). (upper) and b
=1
= 14
where samples compensates for the sample delay in the dB leads then to the filtered signals. Limiting with final segmental SSDR SSDR SSDR
SSDR SSDR
(27)
is the number of elements in set , which could The term represent either all frames of the test data or a subset with speech being present. In addition, we exclude frames of insignificant dB speech power by taking only frames with SSDR in set . The segmental noise attenuation (segmental NA) can be computed in analogy:
(28) Fig. 2. Raw IGA weighting rules trained in general automotive noise in speech presence and speech absence: b Bark (upper) and b Bark (lower).
=1
= 14
An example of Bark scale IGA weighting rules for the same Bark bands is depicted in Fig. 2. In the general automotive environment, the dominance of the car noise energy in low Bark bands is now nicely represented by the fact that the weighting rules in the first Bark band for speech absence exhibit a smaller value than those in the 14th Bark band. Apart from it, it is also shown that the weighting rule in speech absence does not always completely suppress the signal due to VAD decision errors during training. The nonzero weighting rule measured in speech absence can advantageously help to preserve the naturalness of speech and noise, particularly in the transition from speech presence to speech absence, or vice versa.
The term is the number of elements in set , which could represent either all frames of the test data or a subset with speech being present. No further restriction is made on set . Going up the curves in Figs. 3–5, the markers of each curve belong to the SNR conditions of 0, 5, 10, 15, 20 dB, respectively. The more a curve is located in the upper right of the figure, the larger the values of the segmental NA and segmental SSDR become, the less residual noise and speech distortion remain, and the better the algorithm performs. After the use of the Bark scale and the 4-WPLS parameterization, we present the segmental SSDR and NA results for all new environment-optimized techniques discussed in Section III, namely: ideal gain averaging “IGA” (13), spectral value MMSE “MMSE-SV” (20), spectral amplitude MMSE “MMSE-SA” (23), and log-spectral amplitude MMSE “MMSE-SAlog” (25). Please note that the segmental NA measures in these plots are evaluated for the whole
832
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008
Malah’s MMSE-LSA estimator “MMSE-LSA” (4) are shown as well. All evaluated algorithms employ the DD approach to compute the a priori SNR and the speech presence probability estimator. The performance differences of the algorithms therefore reflect the respective weighting rules and their interaction with the DD a priori SNR computation. As a further reference, multiplication between the noisy specand the instantaneous ideal gain trum (29)
Fig. 3. Segmental SSDR for speech presence versus segmental NA for the whole utterance in a car-specific environment; training with general automotive data.
Fig. 4. Segmental SSDR for speech presence versus segmental NA for the whole utterance in a car-specific environment; training with car-specific data.
Fig. 5. Segmental SSDR for speech presence versus segmental NA for the whole utterance in a car-specific environment; training with office data.
utterance. On the other hand, as speech distortion occurs in speech-only regions (disregarding any distortion in nonspeech regions, e.g., the sound of aspiration or coughing), the segmental SSDR is measured only in speech presence. In Figs. 3–5, we present the averaged simulation results of the new environment-optimized techniques tested in the four car-specific testing sets. For comparison, the a priori SNR-driven Wiener filter “WF-DD” (3) and Ephraim and
computed similar to (16) serves as an upper limit of quality at least for IGA. In this manner, it results in an almost perfect clean speech spectral amplitude (with phase distortion) indicated by a competitive noise attenuation and amazingly 2–4 dB less distorted speech. Considering that (16) is the practical implementation of (14), this result can consequently be employed to validate the term ideal gain in (14). First of all, all new weighting rules are trained with the general automotive training set. As it is shown in Fig. 3, we find that the proposed MMSE-SA technique obviously introduces the most speech distortion; however, it gives the highest noise attenuation. Although MMSE-LSA allows for a better speech preservation than the Wiener filter, unfortunately it cannot give a competitive noise attenuation. An explanation is the increase of the MMSE-LSA gain curve while the a posteriori SNR is decreasing (see [3, Fig. 1]). The proposed MMSE-SV, MMSESAlog, and IGA approaches obviously outperform the reference algorithms in terms of noise attenuation and speech distortion. MMSE-SAlog and IGA are very close in performance, both being the best techniques among the proposed algorithms. Nonetheless, informal listening tests confirm the superiority of IGA over all other schemes from an auditive point of view, even with slight advantage versus MMSE-SAlog. Next, we train the new weighting rules by employing the car-specific training sets, and apply them to the corresponding car-specific testing sets as before. After taking the average values over all four testing sets, the results are depicted in Fig. 4. Unexpectedly, training with data from the same specific car in which the test is performed does not bring any improvement at all. Even though they yield a minor improvement in speech preservation, all proposed weighting rules (except IGA) show a slight performance degradation in noise attenuation. Retesting the weighting rules on their respective training sets shows a much better performance, which brings us to the conclusion that the performance deterioration occurs because of too few training data from each specific car. For practical applications, it is advantageous to have training data that covers a wide range of possible circumstances of the target environment, e.g., noise situations such as open window, rotating fan, accelerating machine, etc. Nevertheless, it is a good (generalizing) result that training a weighting rule in a general automotive environment (in this case: a number of cars) outperforms the reference speech enhancement techniques, even when it is tested in a car unseen in training. In the last experiment, we train our environment-optimized weighting rules in the office environment, and we apply them again to the car-specific testing sets. In this way, we investigate how a severe mismatch between training and testing data can
FINGSCHEIDT et al.: ENVIRONMENT-OPTIMIZED SPEECH ENHANCEMENT
TABLE II PREFERENCE SCORES FOR IGA COMPARED TO WIENER FILTER IN THE CAR-SPECIFIC ENVIRONMENT
833
to the reference speech enhancement techniques, it turns out that our approach yields a consistently higher noise attenuation together with the same or even less speech distortion. At SNR dB preferences over 72% were conditions ranging from achieved. ACKNOWLEDGMENT
impact the performance of the proposed techniques, as it is depicted in Fig. 5. The proposed MMSE-SA and MMSE-SV techniques clearly show the poorest performance in preserving the speech contents, although they give the highest noise attenuation level. In comparison to Figs. 3 and 4, also MMSE-SAlog and IGA severely lost in terms of speech quality. Therefore, all proposed algorithms can be considered as environment-dependent, however, with IGA in Fig. 5 still exceeding the performance of the reference algorithms. In summary, the results reveal that the occuring mismatch degrades the overall performance of the proposed algorithms. Among all environment-dependent algorithms, IGA turns out to be the best scheme in matched training and test conditions, and shows even competitive performance in mismatched conditions. Finally, we conduct a subjective listening test in an A/B fashion with a total of four randomly chosen speech samples comprising two speakers (one male and one female) in car noise. Sixteen expert and nonexpert listeners have to give their preference either to the IGA approach or the Wiener filter, which are presented randomly in both orders. The IGA weighting rules are trained under the general automotive training set. Each test subject performs the test three times, namely for 0, 5, and 15 dB SNR conditions, respectively. Table II shows that there is a significant preference in all three SNR cases for the new environment-optimized speech enhancement technique IGA ranging from 72.66% to 78.91%. In the subjective listening tests, most of the listeners affirm that low level musical tones are still audible in the enhanced signals of both algorithms, especially for the 0-dB SNR condition. In this condition, relatively more speech distortions are audible in the signals enhanced by the Wiener filter. Nevertheless, it is also reported that a few samples enhanced by the IGA technique sound a bit metallic. In speech absence the superiority of the IGA estimator versus the Wiener filter is obviously perceivable. Being trained in the target environment, the proposed IGA algorithm yields more noise attenuation, particularly in low frequencies (where the automotive noise is mostly concentrated), which is even clearly audible in the 15-dB SNR condition. V. CONCLUSION We addressed the problem of how to optimize a speech enhancement system for a specific environment, specifically in an automotive environment. We proposed a new approach that does not require an explicit pdf model of the clean speech and noise spectra. Using the a posteriori SNR and the a priori SNR as an index, frequency-dependent weighting rules are trained from some recorded noise data in the target environment and are stored as a table look-up. Having a distinct weighting rule in speech presence and speech absence for each frequency (or Bark) band helps to improve the performance of the proposed approach. In comparison
The authors would like to thank the anonymous reviewers for their constructive comments that helped to improve the paper. REFERENCES [1] P. Scalart and J. Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proc. ICASSP’96, Atlanta, GA, May 1996, pp. 629–632. [2] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121, Dec. 1984. [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443–445, Apr. 1985. [4] P. Loizou, “Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 857–869, Sep. 2005. [5] T. Lotter and P. Vary, “Noise reduction by maximum a posteriori spectral amplitude estimation with supergaussian speech modeling,” in Proc. IWAENC, Kyoto, Japan, Sep. 2003, pp. 83–86. [6] T. Lotter and P. Vary, “Speech enhancement by map spectral amplitude estimation using a super-Gaussian speech model,” EURASIP J. Appl. Signal Process., vol. 7, pp. 1110–1126, 2005. [7] R. Martin, “Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors,” in Proc. ICASSP’02, Orlando, FL, May 2002, pp. 504–512. [8] R. Martin and C. Breithaupt, “Speech enhancement in the DFT domain using Laplacian speech priors,” in Proc. IWAENC’03, Kyoto, Japan, Sep. 2003, pp. 87–90. [9] B. Chen and P. Loizou, “Speech enhancement using a MMSE short time spectral amplitude estimator with laplacian speech modelling,” in Proc. ICASSP’05, Philadelphia, PA, May 2005, pp. 1097–1100. [10] I. Potamitis, N. Fakotakis, and G. Kokkinakis, “A trainable speech enhancement technique based on mixture models for speech and noise,” in Proc. Eurospeech’03, Geneva, Switzerland, Sep. 2003, pp. 573–576. [11] G. Ding, X. Wang, Y. Cao, F. Ding, and Y. Tang, “Speech enhancement based on speech spectral complex Gaussian mixture model,” in Proc. ICASSP’05, Philadelphia, PA, Mar. 2005, pp. 165–168. [12] Y. Cheng, D. O’Shaughnessy, and P. Kabal, “Speech enhancement using a statistically derived filter mapping,” in Proc. Int. Conf. Spoken Lang. Process., Banff, AB, Canada, Oct. 1992, pp. 515–518. [13] J. Porter and S. Boll, “Optimal estimators for spectral restoration of noisy speech,” in Proc. ICASSP’84, San Diego, CA, Mar. 1984, pp. 18A.2.1–18A.2.4. [14] T. Fingscheidt and S. Suhadi, “Data-driven speech enhancement,” in Proc. ITG-Fachtagung “Sprachkommunikation”, Kiel, Germany, Apr. 2006, VDE-Verlag. [15] S. Suhadi, S. Stan, and T. Fingscheidt, “A novel environment-dependent speech enhancement method with optimized memory footprint,” in Proc. Int. Conf. Spoken Lang. Process., Pittsburgh, PA, Sep. 2006, pp. 249–252. [16] J. Erkelens, J. Jensen, and R. Heusdens, “A general optimization procedure for spectral speech enhancement methods,” in Proc. EUSIPCO’06, Florence, Italy, Sep. 2006, CD-ROM. [17] J. Erkelens, J. Jensen, and R. Heusdens, “A data-driven approach to optimizing spectral speech enhancement methods for various error criteria,” Proc. Speech Commun., Special Iss. Speech Enhancement, vol. 49, no. 7–8, pp. 530–541, July–Aug. 2007. [18] S. Haykin, Adaptive Filter Theory, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 1991. [19] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Acoust., Speech, Signal Process., vol. 9, no. 5, pp. 504–512, Jul. 2001. [20] D. Malah, R. Cox, and A. Accardi, “Tracking speech-presence uncertainty to improve speech enhancement in nonstationary noise environments,” in Proc. ICASSP’99, Phoenix, AZ, Mar. 1999, pp. 789–792.
834
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008
[21] J. Li, M. Akagi, and Y. Suzuki, “Improved hybrid microphone array post-filter by integrating a robust speech absence probability estimator for speech enhancement,” in Proc. ICSLP’06, Pittsburgh, PA, Sep. 2006, pp. 2130–2133. [22] I. Cohen, “Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator,” IEEE Signal Process. Lett., vol. 9, no. 4, pp. 112–116, Apr. 2002. [23] P. Wolfe and S. Godsill, “Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement,” in Proc. 11th IEEE Signal Process. Workshop Statist. Signal Process., Singapore, Aug. 2001, pp. 496–499. [24] R. Carroll and D. Ruppert, Transformation and Weighting in Regression. New York: Chapman &Hall, 1988. [25] S. Gustafsson, “Enhancement of audio signals by combined acoustic echo cancellation and noise reduction,” Ph.D. dissertation, Inst. Commun. Equipment Data Process., Technical Univ. of Aachen, Aachen, Germany, 1999, Aachener Beiträge zu digitalen Nachrichtensystemen, P. Vary, Ed. Tim Fingscheidt (S’93–M’98–SM’04) received the Dipl.-Ing. degree in electrical engineering and the Ph.D. degree from Aachen University of Technology, Aachen, Germany, in 1993 and 1998, respectively. He further pursued his work on joint speech and channel coding as a Consultant in the Speech Processing Software and Technology Research Department, AT&T Labs, Florham Park, NJ. In 1999, he entered the Signal Processing Department of Siemens AG (COM Mobile Devices), Munich, Germany, and contributed to speech codec standardization in ETSI, 3GPP, and ITU-T. In 2001, he became Team Leader for Audio Applications. In 2005, he joined Siemens Corporate Technology, Munich, Germany, leading the speech technology development activities in recognition, synthesis, and speaker verification. Since 2006, he has been a Professor at the Institute for Communications Technology, Braunschweig Technical University, Braunschweig, Germany. His research interests are speech signal processing for automotive applications, with focus on speech enhancement and instrumental quality measures. Dr. Fingscheidt received several awards, among them a prize from the Vodafone Mobile Communications Foundation in 1999 and the 2002 prize from the Information Technology branch of the Association of German Electrical Engineers (VDE ITG).
Suhadi Suhadi received the B.S. degree in electrical engineering from Bandung Institute of Technology, Bandung, Indonesia, in 1999 and the M.S. degree in information and communication systems from the Technical University of Hamburg-Harburg, Hamburg, Germany, in 2003. He is currently pursuing the Ph.D. degree in the area of speech enhancement and noise reduction at the Institute for Communications Technology, Braunschweig Technical University, Braunschweig, Germany.
Sorel Stan (M’92–SM’05) received the Dipl.Ing. degree in electronics and telecommunications from the University Politehnica of Bucharest, Bucharest, Romania, in 1992, with a research thesis prepared at the Politecnico di Torino, Turin, Italy, under a TEMPUS scholarship, and the Ph.D. degree in computational and applied mathematics from the University of the Witwatersrand, Johannesburg, South Africa, in 1999. From 1992 to 1994. he was an Assistant Professor at the University Politehnica of Bucharest in the field of pattern recognition and neural networks. From 1995 to 1998, he was a Scientific Consultant for the Anglo-American Corporation of South Africa working on remote sensing, hyperspectral image processing, inverse problems, and data fusion. In 1998, he joined the German Aerospace Center (DLR), Oberpfaffenhofen, Germany, as a Research Scientist, where he was involved in the preparation of the Synthetic Radar Topography Mission of the Space Shuttle Endeavour (STS-99) jointly organized by NASA and DLR. In 2000, he joined the R&D Labs of Siemens AG, Mobile Devices, Munich, Germany, where he worked until 2005 on speech signal processing with an emphasis on robust speech and speaker recognition, multilingual speech synthesis, and language identification. During 2006, Dr. Stan was a Technology Manager at BenQ Mobile, Munich, Germany. Since 2006, he has been with the Munich office of the Bardehle Pagenberg Intellectual Property Law Firm as a Technology Expert advising on the prosecution and litigation of patents in wireless communications and computer technology.