Noise Codebook Adaptation for Codebook-Based Noise Reduction

1 downloads 0 Views 274KB Size Report
Abstract—In this paper, a codebook-based noise reduction algorithm is presented. .... need to compute (5) for a big number of codebook-pairs. However,.
Noise Codebook Adaptation for Codebook-Based Noise Reduction Tobias Rosenkranz Siemens Audiologische Technik GmbH 91058 Erlangen, Germany Email: [email protected]

Abstract—In this paper, a codebook-based noise reduction algorithm is presented. Whereas known methods utilize autoregressive (AR) modeling of spectral envelopes, cepstral modeling is preferred here. A comparison of both methods is given. Furthermore, a novel, robust adaptation method of the noise codebooks is introduced which utilizes the cepstral envelope modeling. A perceptual validation of the method is provided.

I. I NTRODUCTION Conventional single-channel noise reduction algorithms typically have problems with non-stationary noise. For example, the wellknown minimum statistics approach [1] estimates the noise by searching minima in the spectrum of the noisy input signal in a search-frame of certain duration. Because the duration of this searchframe is typically around one second, fast changes of the noise characteristics cannot be followed. Codebook-based approaches [2]–[4] try to overcome this problem by incorporating a priori knowledge about speech and different noise types. The methods mentioned perform a joint estimation of the speech and noise spectra on a frame-by-frame basis. The estimates are taken from predefined sets of typical speech and noise spectra, which are stored in codebooks. The frames are typically 20 to 40 msec long so that fast fluctuations of the signal characteristics can be followed almost instantaneously. However, these methods rely on the accuracy of the underlying models (i. e., the codebooks). A model mismatch or inaccurate training data used for the generation of the codebooks leads to a noticeable loss of performance. In this paper, a method for dynamically adapting the noise codebooks is presented, which has the potential to lessen the mentioned problems. This paper is organized as follows: In Sect. II the basics of the underlying method are explained is detail. In Sect. III a comparison of the AR-parameter-based envelope modeling used in [2], [3] and the cepstrum-based envelope modeling, which is used in this publication, is given.

and noise spectral envelopes can be written as Sˇss (ejΩ ) = σs2 S¯ss (ejΩ ), Sˇnn (ejΩ ) = σn2 S¯nn (ejΩ ).

(3a) (3b)

In this paper, parametric representations of these spectral shapes are used. The parameters are denoted by θs and θn for speech and noise, respectively, which can be either AR parameters or cepstral coefficients (CC, see Sect. III). B. Codebook Generation Codebooks for speech and different noise classes are generated from training data, which is assumed to be representative for the respective signals. The codebook generation comprises segmenting the training data into blocks of equal length, computing the corresponding parameter vector for each block, and applying the LBG-algorithm [5] to the whole set of parameter vectors. Thus, the codebooks can also be regarded as vector-quantizers. Since each parameter vector of the codebooks corresponds to a spectral shape, the codebooks contain typical spectral shapes of the respective signals. C. Joint Speech and Noise Estimation Codebook Speech

Codebook Noise

Gain Adaptation

II. M ODEL BASED S IGNAL E STIMATION A. Signal Model Sˇxx (ejΩ ) = Sˇss (ejΩ ) + Sˇnn (ejΩ )

An additive noise model x(k) = s(k) + n(k)

is considered, where k is the discrete time index, s(k) and n(k) denote the clean speech and the noise signal (which may be colored and non-stationary), respectively and x(k) is the noisy input (i. e., the microphone signal). Speech and noise are assumed to be uncorrelated. Therefore, the power spectrum of the noisy input can be expressed as the sum of the speech and noise components: Sxx (ejΩ ) = Sss (ejΩ ) + Snn (ejΩ ).

Likelihood Computation

(1)

(2)

Spectral envelopes are considered, which are denoted by a downwards pointing hat ˇ·. These envelopes can be decomposed into a gainnormalized spectral shape S¯ and a gain σ 2 . Consequently, the speech

Fig. 1.

Illustration of codebook-based joint speech- and noise-estimation.

In Fig. 1, the basic principle of the codebook-based estimation scheme is illustrated. A joint speech and noise spectral estimation is performed whereas the estimates are restricted to be elements of the trained codebooks. The estimation procedure is as follows: For each possible pair of codebook combinations (i. e., from the speech and noise codebooks), a gain adaptation is performed so that the sum of the gain-adapted speech- and noise-envelopes minimizes a spectral distance measure between this sum and the envelope of the

noisy input [6]. Then, the likelihood of the pair having produced the observation (i. e., the noisy input) is computed. For computing the likelihood, it is assumed that the noisy input follows a multivariate normal distribution, i. e., T −1 1 1 p(x|θ) = (4) e− 2 (x Rx x) , K 1 (2π) 2 det {Rx } 2 where x = [x1 , . . . , xK ]T denotes a block T of noisy speech samples of length K and θ = θs , θn , σs2 , σn2 denotes a vector containing the speech and noise parameters as well as the speech and noise gains σs2 and σn2 . The covariance matrix of the noisy input Rx can be expressed as the sum of the speech and noise covariance matrices, since the signals are uncorrelated, thus Rx = Rs + Rn . Equation (4) must be evaluated for each pair ofcodebook  possible T combinations, i. e., for each θij = θis , θjn , σ ˆsij,2 , σ ˆnij,2 , where i j θs and θn denote the ith speech codebook entry and the jth noise codebook entry, respectively and σ ˆsij,2 , σ ˆnij,2 are the estimated gains (estimation takes place in the “gain adaptation” block in Fig. 1), which are calculated according to [6]. As mentioned earlier, the parameters θis and θjn correspond to gain normalized spectral shapes i j which are denoted by S¯ss (ejΩ ) and S¯nn (ejΩ ). The spectral envelopes i i corresponding to θij are therefore Sˇss (ejΩ ) = σ ˆsij,2 S¯ss (ejΩ ) and j jΩ ij,2 j jΩ Sˇnn (e ) = σ ˆn S¯nn (e ). Equation (4) can then be written using these envelopes. For large K, the covariance matrices can be approximated by circulant matrices by adding entries in the upper right and lower left corners. The Fourier transform of these matrices yields the corresponding power spectra which are approximated by their spectral envelopes. See [2] for a more detailed derivation. Thus, the logarithm of (4), i. e., the log-likelihood function for each codebook pair, can be written as K ln p(x|θij ) ≈ − ln 2π− 2 Z2π Z2π (5) Sxx (ejΩ ) K K ij dΩ, − ln Sˇxx (ejΩ )dΩ − ij 4π 4π Sˇxx (ejΩ ) 0

0

ij i j where Sˇxx (ejΩ ) = Sˇss (ejΩ ) + Sˇnn (ejΩ ) denotes the estimated spectral envelope of the noisy input which equals the sum of the estimates of the speech and noise components. Based on the likelihood values of each codebook pair, maximum likelihood (ML) or minimum mean-square error (MMSE) estimates of the real parameters (and therefore of the real spectral envelopes) of speech and noise can be obtained. The ML estimate is

ˆ ML = argmax p(x|θ). θ

(6)

θ

ˆ ML comprises one pair In the ML estimation scheme, the estimate θ of codebook entries together with the respective gains. The MMSE estimation is given as the conditional expectation ˆ MMSE = E{θ|x} which can be written using Bayes’ rule as θ ˆ MMSE = θ

Z

θ p(θ|x)dθ =

Θ

Z

θ

p(x|θ)p(θ) dθ. p(x)

(7)

Θ

This integral can be approximated in the discrete case as ˆ MMSE = θ

Ns Nn 1 X X ij p(x|θij )p(θis )p(θjn ) , θ Ns Nn p(x)

(8)

i=1 j=1

with

Ns Nn 1 XX p(x) = p(x|θij )p(θis )p(θjn ), Ns Nn i=1 j=1

(9)

where Ns and Nn denote the speech codebook size and the noise codebook size, respectively. The terms p(θis ) and p(θjn ) represent the a priori distributions of the speech and noise parameters, respectively which are proportional to the number of training vectors in the respective Voronoi cells. Finally, estimates of the speech and noise spectral envelopes are ˆ {ML,MMSE} into obtained by converting the parameter estimate θ spectral envelopes. Noise reduction is performed by applying a Wiener filter comprising the estimated envelopes of speech and noise ˆ jΩ ) = H(e

Sˆss (ejΩ ) , jΩ ˆ Sss (e ) + Sˆnn (ejΩ )

(10)

where Sˆss (ejΩ ) and Sˆnn (ejΩ ) denote the estimates of the speech envelope and the noise envelope, respectively, which are obtained from the estimated parameter set. The method is computationally very demanding because of the need to compute (5) for a big number of codebook-pairs. However, it is in principle possible to parallelize the algorithm so that a lot of computational time can be saved. III. E NVELOPE M ODELING In previous work on the topic, AR modeling has been used to model the spectral envelopes of speech and noise. The AR envelope of a speech signal block (analogously, all the considerations also hold for the noise) is given as Sˇss (ejΩ ) = with As (ejΩ ) =

σs2 , |As (ejΩ )|2

p X

as,l e−jΩl ,

(11)

(12)

l=0

where the as,l are the parameters of a linear predictor of order p, which are computed using the Levinson-Durbin recursion. It is referred to these parameters as AR-parameters. The vectors θs , θn introduced in Sect. II-A may consist of AR-parameters. In this work, however, cepstrum-based envelope modeling is used. The qth cepstral coefficient of a speech signal block is given as cs,q =

N−1 2π 1 X log (|Sµ |) ej N µq , N

(13)

µ=0

where Sµ is the N -point DFT of the speech signal and µ is the discrete frequency. It can be seen that the cepstral transform performs a Fourier analysis of the log-spectrum along the frequency-axis. Thus, the higher coefficients represent spectral components which vary fast with frequency and therefore model the spectral fine-structure. On the other hand, the lower coefficients represent the spectral rough structure and thus the spectral envelope of a signal.1 The coefficient cs,0 accounts for the DC-component and is therefore related to the signal power or the signal gain. The relation to the gain factor is cs,0 = log σs2 . In order to extract the spectral envelope, all coefficients beyond the model order are set to zero, for the spectral shape, also cs,0 is set to zero. Transforming the nulled cepstrum back into the frequency domain yields the spectral envelope or shape. Therefore, the vectors θs , θn can also contain these non-zero cepstral coefficients. 1 in a strict sense, this envelope does not match the classic definition of an envelope meaning that it does not touch the maxima of the spectrum. In this paper, the term “envelope” is used for a spectrum that does not contain the spectral fine structure.

0 PowerSpectrum CepstralEnvelope AREnvelope

-10

Level [dB] −→

-20 -30 -40 -50 -60 -70 -80

0

1000

2000

3000

4000

5000

Frequency [Hz] −→

6000

7000

8000

Fig. 2. Comparison of cepstrum-based and AR-parameter-based envelope modeling.

IV. A DAPTATION OF

THE

N OISE C ODEBOOKS

So far, only static codebooks have been considered. However, with real signals it may happen that the noise cannot be modeled sufficiently well. On the one hand, one is interested in keeping the noise codebooks as small as possible because • the complexity has to be kept as low as possible, • more codebook entries lead to more ambiguities which degrades estimation performance [6]. On the other hand the codebooks should model every variation of the noise. But it is not always achievable to cover all possible variations

of a noise class with just a few codebook-entries. Moreover, composing a balanced ensemble of training data may be an impossible task. It is therefore advantageous to render the noise codebooks adaptively. If, e. g., the acoustic conditions (e. g., the distance between microphone and noise source) differ between training and testing, then a difference between the model and the actual noise is observed. Such a difference can be modeled as a linear filtering in many cases. If it were possible to identify that filter, one could modify the codebook entries by linear filtering and thus achieve a better model of the current noise situation. It is especially efficient to to this modification in the cepstral domain because there a linear filtering corresponds to a simple addition.

Parameter 2

In Fig. 2, a comparison of CC- and AR-envelope-modeling is given. The gray solid line shows the power spectrum of a frame of voiced speech. One can see that the AR-envelope (black dash-dotted line) tends to model the spectral maxima, while the CC-envelope (black dashed line) lies in the middle of the spectral minima and maxima. The tendency of the AR-envelope to model the spectral maxima causes problems when dealing with voiced speech because the envelope actually depends to a certain degree on the pitchfrequency. As can be seen in Fig. 2, there are two maxima of the ARenvelope that coincide with the 2nd and 4th harmonic of the speech power spectrum. That means that two signals with different pitchfrequencies but otherwise identical spectral envelopes are modeled differently by the AR-model which is not the case for the CC-model as long as the model order is chosen sufficiently small so that the cepstral coefficient which corresponds to the pitch-frequency is not included in the model. The advantage of the CC-model is therefore that spectral envelope and pitch are strictly separated, thus a more accurate model of the spectral envelope is obtained. The property of the CC-model to lie in the middle of the spectral minima and maxima, however, poses a problem to the noise reduction task. Consider a Wiener filter computed with spectral envelopes according to (10) which is applied to a frame of voiced speech corrupted by noise. Since the AR-model tends to touch the spectral maxima, the filter cannot attenuate between the pitch-harmonics, thus there will be residual noise. With the CC-model there will be less residual noise but the speech spectrum is distorted instead since the parts of the pitch-harmonics that lie above the CC-envelope will be attenuated. This speech-distortion is perceived as more disturbing than the residual noise introduced by the AR-model. To compensate for both the residual noise and the speech distortion, appropriate pitch-adaptive comb-filters Pˆ (ejΩ ) have to be multiplied with the speech envelopes so that the harmonic structure is restored. This is particular important for the CC-model because of the introduced speech distortion. Therefore, in a real system a pitch-estimator is needed in order to restore the harmonicity of voiced speech.

Parameter 1

Fig. 3. Illustration of codebook-shift. Small blue dots indicate training vectors, red small dots indicate the actual noise. Big green dots indicate codebook vectors, the big green cross indicates the center of gravity of the codebook. The red cross indicates the long-term noise estimate. The codebook vectors after the shift are red.

To identify this filter, it is proposed to use a long-term noise estimator (e. g., minimum statistics). This shall be illustrated in Fig. 3. The small blue dots correspond to training-vectors (for convenience only two dimensions have been plotted), while the big green dots indicate codebook vectors. The center of gravity of all training vectors is also the temporal mean of the parametric representation (i. e., the spectral shape) of the noise training data. It is depicted by a green cross. Furthermore, it can be assumed that the parametric representation of the spectral shape of a long-term noise estimate (red cross) lies in the temporal mean of the real noise (small red dots). The filter that has to be identified is then the difference between the center of gravity of the training vectors and the long-term noise estimate. The adaptation can be performed directly by a shift of the codebook vectors if the parameters are cepstral coefficients. This is not the case with AR-parameters because a shift in the AR-domain does not correspond to a linear filtering. There, is would be necessary to go into the log-spectral domain, perform the shift, and go back to the AR-domain. However, it is not guaranteed that there is an equivalent of the shifted envelope in the AR-domain. Thus, the cepstral domain is preferred. Fig. 4 shows an overview of the whole codebook-based noise reduction system including codebook shift and pitch-adaptive comb filtering. V. T EST R ESULTS A perceptual evaluation of the method was performed in form of A/B-comparisons. The test persons were presented two different processed sound examples together with the unprocessed reference. The overall preference as well as speech and noise artifacts was given a rating between −2(1)2, where −2 meant “A much better

TABLE II M EDIAN AND MEAN SCORES OF A/B- COMPARISON CODEBOOK - BASED ALGORITHM VS . MINIMUM STATISTICS ( RESULTS WERE OBTAINED WITH 15 SUBJECTS ).

Pitch/Voicing Estimation

Comb Filter jΩ

X(e )

Minimum Statistics

LT jΩ Sˆnn (e )

Shift Codebook

CB Noise

Gain Rule

ˆ jΩ ) H(e

Fig. 4. Overview block diagram of codebook based noise reduction. In this LT (ejΩ ) is the long-term noise figure, X(ejΩ ) denotes the input spectrum, Sˆnn ˆ jΩ ) is the estimate of the noise reduction Wiener filter. estimate, and H(e

than B” and 2 meant “B much better than A”. Three algorithms were used: A state-of-the art long-term noise estimator based on minimum statistics, the codebook-based algorithm (using the CC-model) with static codebooks and the codebook-based algorithm with adaptive codebooks. For the code-based algorithms, perfect pitch information was provided. Test data was not included in the training data. A total of 8 different sound examples had been processed with 2 different speakers (male/female), 2 different signal-to-noise ratios (10 dB and 0 dB), and 2 different noises, one rather stationary street noise (Sig. 1) and one rather non-stationary street noise with cars passing by very quickly (Sig. 2). The MMSE estimation scheme was used, however, the results from the ML estimation scheme are almost identical which is because of the steepness of (5). TABLE I R ELEVANT PARAMETERS . Speech codebook size Ns

1024 entries

Noise codebook size Nn

4 entries

Speech-model order p

16

Noise-model order q

10

Blocklength K

1024 samples

Overlap K − R

896 samples

DFT-length N

512 samples

Sampling frequency fs

16 kHz

The results are summarized in Tab. II and Tab. III. One can see that especially in the non-stationary scenario, the codebook-based method is heavily preferred compared to the minimum statistics algorithm while it is only little preferred in the rather stationary scenario. This is because the minimum statistics approach is not able to track the fast noise variations. It can be seen from Tab. III that the adaptive version of the codebook-based method is preferred especially in the rather stationary case. This is because the adaptation relies on an accurate long-term noise estimate. This estimate is of course better if the noise is stationary. It has to be noted that the perfect pitch-information greatly improves performance. State-of-the-art pitch estimators do not provide sufficient accuracy for signal-to-noise ratios of less than about 10 dB. Thus, the 10 dB scenario can be regarded as a realistic case whereas the 0 dB scenario should show the potential of codebook-based algorithms if an accurate pitch-estimate where available. VI. C ONCLUSION In this paper, a codebook-based noise reduction algorithm was presented which is based on previous work on the topic. It was

Sig. 2

Sˆnn (ejΩ )

Preference Speech Art. Noise Art. Preference Speech Art. Noise Art.

SNR Mean +0.22 −0.67 −0.78 −1.72 −0.83 −1.44

0 dB SNR Median Mean ±0.0 +0.17 −1.0 −0.89 −1.0 −0.56 −2.0 −1.78 −0.5 −0.44 −1.0 −1.28

TABLE III M EDIAN AND MEAN SCORES OF A/B- COMPARISON ADAPTIVE VS . STATIC CODEBOOKS ( RESULTS WERE OBTAINED WITH 9 SUBJECTS ).

Sig. 1

Envelope Estimation

Sig. 2

Sˆss (ejΩ )

Sig. 1

Pˆ (ejΩ ) CB Speech

10 dB Median ±0.0 −1.0 −1.0 −2.0 −1.0 −1.5

Preference Speech Art. Noise Art. Preference Speech Art. Noise Art.

10 dB Median −1.0 ±0.0 −1.0 −1.0 −1.0 −1.0

SNR Mean −0.63 +0.08 −0.33 −0.71 +0.17 −0.13

0 dB SNR Median Mean −1.0 −0.83 ±0.0 −0.58 ±0.0 −0.79 −0.5 −0.50 ±0.0 ±0.00 ±0.0 −0.25

shown that cepstral envelope modeling enables a simple and efficient codebook adaptation which is based on a long-term noise estimate. It has been shown that the codebook-adaptation leads to a performance gain especially for stationary noise signals. It has also been shown that the codebook-based algorithm is superior to state-of-the-art methods especially for highly non-stationary noise. Future work includes investigating on methods that do not need pitch-information as well as methods that reduce complexity. ACKNOWLEDGMENT This work is part of the project “Model-Based Hearing Aid” that was partially funded by the “Bundesministerium f¨ur Bildung und Forschung”. R EFERENCES [1] R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics,” IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp. 504–512, Jul. 2001. [2] M. Kuropatwinski and W. B. Kleijn, “Estimation of the Short-Term Predictor Parameters of Speech Under Noisy Conditions,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 5, pp. 1645–1655, Sep. 2006. [3] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Codebook-Based Bayesian Speech Enhancement for Nonstationary Environments,” IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 2, pp. 441–452, Feb. 2007. [4] T. Rosenkranz, “Modeling the Temporal Evolution of LPC Parameters for Codebook-Based Speech Enhancement,” in International Symposium on Image and Signal Processing and Analysis, Salzburg, Sep. 2009, pp. 455–460. [5] Y. Linde, A. Buzo, and R. M. Gray, “An Algorithm for Vector Quantizer Design,” IEEE Trans. Commun., vol. COM-28, no. 1, pp. 84–95, Jan. 1980. [6] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Codebook Driven ShortTerm Predictor Parameter Estimation for Speech Enhancement,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 1, pp. 163–176, Jan. 2006.