Noise Reduction on Speech Codec Parameters
Nicolas Duetsch , Herv´e Taddei , Christophe Beaugeant and Tim Fingscheidt
Technical University Munich, Arcisstr. 21, 80333 Munich, Germany Email:
[email protected] Siemens AG ICM MP, Haidenauplatz 1, 81675 Munich, Germany Email:
[email protected] Abstract— When transmitting speech in packet networks system, a codec is used to compress. In order to improve the signal quality, background noise is attenuated. Noise reduction can either be done as preprocessing before speech encoding or in the network. In that case the bitstream is decoded, speech enhancement is performed in the time and/or frequency domain and the processed signal is re-encoded. Both methods are computationally expensive. In this paper an approach to reduce environmental background noise by modifying the codec parameters is discussed. It will be explained which speech codec parameters are important with regard to noise reduction and a method to adapt these parameters will be proposed.
I. I NTRODUCTION Where phones are used, the energy of the background noise can be very high. This is especially the case for cellular phones that are often used in very noisy environment, such as in cars or at crowded places. Speech codecs usually have a bad robustness against encoding of speech in noisy environments. Therefore, noise reduction is most of the time done at the device microphone before encoding speech [1]. A few recent studies have focused on the interaction between noise reduction and speech coding [2], [3] to enhance the global performance of the couple noise reduction / speech coding. These studies however are limited to the interaction between two independent blocks. One further possibility leads to think of embedded solutions where noise reduction is really integrated into the speech codec itself [4]. Such embedded solutions allow applying noise reduction in the network by working directly on the transmitted codec parameters. In this article, we investigate such embedded systems. After introducing in section II the so-called ”(speech) codec parameters”, we present an experiment that allows us to figure out which parameters have an influence on the surrounding noise level. By replacing parameters from a noisy signal by the ones obtained from a less noisy signal, as depicted in section III, we show that certain parameters have important influence on the surrounding noise level. Accordingly, we focus our study on attempts to adapt these parameters to obtain an efficient noise reduction through processing speech codec parameters in section IV. II. T HE AMR CODEC Speech codecs can be classified into three types: speech based codecs (e.g. G.711), parameter based
codecs (e.g. Vocoder) and hybrid codecs (e.g. CELP, RELP) [5]. In mobile communications, air interface transmission requires a low bit rate and the end-user asks for a high factor of intelligibility of the transmitted speech. These requirements are fulfilled by hybrid codecs. 3GPP (3rd Generation Partnership Project) chooses the CELP based AMR (Adaptive Multi Rate) codec as the mandatory speech codec for UMTS and also GSM for coding of speech at 8 kHz sampling frequency. This speech codec consists of a multirate speech coder, a source-controlled rate scheme including a Voice Activity Detection (VAD), a comfort noise generation system and an error concealment mechanism to compensate the effects of transmission errors and packet loss [6]. In the following a short introduction to the AMR speech coder is given. The AMR codec processes speech frames of 20 ms length. Each frame is divided into 4 subframes of equal length. The codec is based on the Code-Excited Linear Predictive (CELP) coding model using a order linear prediction filter. The filter coefficients, usually called Linear Prediction Coefficients (LPC), are computed each frame by solving a linear system of equations using the Levinson-Durbin algorithm. The LPC coefficients are further quantized and transmitted as LSP (Line Spectral Pair) parameters to the decoder. After filtering the input speech signal by the LPC filter, a residual signal is obtained. This signal is transmitted to reconstruct the speech at the decoder. To do so an adaptive codebook search is first performed on subframe basis leading to a pitch delay and an adaptive gain value. Using these parameters a new residual signal is computed by subtracting the excitation of the adaptive codebook multiplied with the gain factor. The resulting signal is used to process another codebook search. The resulting parameters of the latter search are the index of that algebraic (or fixed) codebook vector and the fixed gain value. The following set of parameters: LSP, pitch delay, fixed codebook index and both fixed and adaptive gains are transmitted to the decoder. At the decoder, the received pitch index (adaptive codebook index) is used to find the fractional pitch lag. The adaptive excitation is found by interpolating the past excitation (at the pitch delay) using an FIR filter. The received algebraic codebook index identifies
the algebraic code vector. Then both gain factors are decoded. The codebook excitations are multiplied with their respective gains and summed up to form an excitation signal for the LPC synthesis filter. The LPC coefficients of this synthesis filter are obtained from the LSP parameters. Finally, a post-processing algorithm is applied to enhance the quality of the reconstructed speech. In the following, we call the transmitted vectors (LSP parameters, pitch delay, index of the fixed code vector and both fixed and adaptive gains) ”(speech) codec parameters”. III. R EPLACEMENT
OF CODEC PARAMETERS
A. Principle In this section, we consider a basic end-to-end transmission of speech through a network using an AMR codec at both ends . The encoder transmits a bitstream every 20 ms to the receiver.
near−end device speech + noise 10 dB
PSfrag replacements
speech + noise 20 dB
Fig. 1.
Encoder
Encoder
network bitstream
bitstream
far−end device
Decoder
decoded speech
LPC coefficients adaptive codebook vector / gain fixed codebook vector / gain
Experimental setup for the exchange of codec parameters
Fig. 1 shows the experimental setup of our test. Our goal is to figure out, which codec parameters are the most useful to reduce the environmental noise. For this purpose, the AMR decoder was slightly modified such that it could read two bitstreams coming from files encoded with different noise levels. Then, according to the desired experiment, the decoder can use some of the parameters encoded with high noise level or some parameters encoded with low noise level. This permits us to evaluate the influence of the codec parameters on noise level as well as on the quality of the decoded speech. To compute files with different Signal to Noise Ratio (SNR) levels, noise at different level is added to the near-end speech using the ITU software tool library [7]. The energy of the speech is normalized to an active speech level of -26 dB, while the energy of the background noise is set to -36 dB and -46 dB, respectively. For the same speech and noise inputs, this and , with SNR of 10 dB leads to two files, and 20 dB respectively.
The noisy speech is encoded into the bitstream and then decoded. This decoded file is considered as the reference noisy speech in our experiment. From the encoded bitstream obtained from the codec certain parameters are extracted. We exchange in parameter(s) by the respective one(s) of , the others kept unchanged. Accordingly, we obtain several decoded signals, whose properties can be evaluated to find out which parameters have the most influence with regards to noise level and speech quality.
B. Subjective listening test In order to evaluate the effect of exchanging different parameters, several subjective tests were performed. A pre-selection was done through informal tests, and shows that it is sufficient to focus our study on the exchange of the LPC coefficients, of the fixed codebook gain, or of both gains. More formal listening tests were conducted using a Comparison Category Rating (CCR) test [8]. Basically, for CCR test, pairs of decoded speech files are played to naive listeners. After each pair, listeners can decide which file they prefer. Accordingly they have to give a score, which ranges from -3 to +3; -3 meaning ”first is much better than the second” and +3 ”second is much better than the first”. 8 people took part in the test and listened to decoded / modified speech files (with car noise added) derived from four different speakers (two male, two female). C. Results of the CCR test The results of the listening test are depicted in Table I. In the left column one can see which parameters are exchanged in the first file of the test pair. ”AMR codec” refers to the unchanged decoded speech file
with noisy parameters ( ). In regards the substituted parameters of the second file of the test pair are listed in the first row of the table. The values in the table are averaged scores of our CCR test. TABLE I Results of the CCR listening test
AMR codec both gains fixed gain
both gains -0.14
fixed gain +0.25 +0.36
LPC coeff. -1.04 -0.98 -0.96
As an example, -0.14 in the table (column ”both gains”, row ”AMR codec”) denotes that the listeners slightly prefer the unchanged decoded speech files than the modified files with replaced fixed and adaptive gain. By analyzing Table I we find the following ranking of the codec parameter compositions: 1) Exchange of the fixed codebook gain 2) Unchanged AMR codec (reference codec)
3) Modification of both gains 4) Replacement of the LPC coefficients The subjective listening test clearly indicates that replacing the fixed codebook gain by the one extracted from a higher SNR file gives better performance than replacing any other codec parameters. One advantage by changing this parameter is that the noise is effectively reduced with a reasonable introduction of distortion into the speech. This is due to the fact that the LPC filter models basically the formants, which characterize human speech. Thus the LPC-filtered residual speech is composed of a formant-less speech part and a noisy signal part, which is nearly not affected by the LPC filter. Because of the structure of CELP based codecs, this noisy residual is mainly taken into account during the fixed codebook search and not during the adaptive codebook search. As the vectors of the fixed codebook are all normalized, the fixed gain represents the level of the noise. Reducing this fixed gain in a proper manner will reduce the noise level.
and of theoretical considerations, we can assess that the level of background noise is more or less represented by the gain of the fixed codebook. Therefore modifications of this gain could lead to noise reduction. Accordingly, in the next section we present algorithms dealing with fixed gain noise reduction. IV. N OISE REDUCTION
BY FIXED GAIN MODIFICATION
Basically, the fixed gain can be seen as a multiplicative factor applied to the noisy signal. We make the hypothesis that a parallel between this factor and the amplitude of the noise can be drawn as it was suggested by the results of section III. The basic idea that we developed is to modify the fixed gain according to an estimation of the fixed gain of the noise. Such considerations lead us to investigate an extrapolation of short-term spectral attenuation [9] to a fixed gain noise reduction. LPC coefficients adaptive codebook vector / gain fixed codebook vector
The ranking of the listening test shows furthermore decoded speech speech that an exchange of both codebook gains decreases Encoder Decoder
+ noise + noise fixed gain the speech quality. This property is confirmed by our modification PSfrag replacements informal listening test, where it was noticed that the modified speech gets very distorted when replacing estimation both gains. This can be explained by the fact that speech Encoder the adaptive codebooks of the near-end and of the VAD far-end diverged. This leads to a divergence between Fig. 2. Action chart of the modification of the fixed gain. the excitation of the synthesis filter at the decoder and the excitation signal at the encoder side. Because of the relation between excitation signal and adaptive The principle of noise reduction using fixed gain codebook, the difference of both excitation signals modification is depicted in Fig. 2. We use for the is amplified. It turned out that it is very difficult to modification a two step approach, extrapolated from the reduce the noise by modifying the adaptive codebook short-term spectral attenuation: gain or the pitch lag. These parameters are in practice 1) Estimation of the fixed gain of the noise very ”sensitive” and their exchange creates important 2) Application of an attenuation rule to the fixed distortion on speech. gain of the noisy speech signal By looking at the results in Table I, one can see moreover that a replacement of the LPC coefficients achieves a worse score. Such an exchange does not have any effect on the reduction of the noise level and introduces significant distortion to the speech. This can be explained by the fact that the excitation of the synthesis filter is not modified by an exchange of codec parameters and thus the noisy part of the filtered signal is colored without any power attenuation. Besides, the speech part of the modified signal gets drastically distorted, because the mismatch of the LPC coefficients involves a wrong reconstruction of the formants. It turns out from the CCR test that this mismatch has a huge influence on the quality of speech (score -1 in Table I). To summarize the results of the CCR listening test
In the following subsections, we describe the noise estimation and the attenuation rules, derived from noise reduction in the frequency domain. A. Noise estimation In order to modify the fixed gain , an estimation of the fixed gain of the noise signal is needed. This gain is assumed to be stationary or slowly varying relatively to the speech. The estimation is done during speech pauses, when only background noise is present. To do so the VAD results obtained using the same speech file but without background noise are used. They deliver a ”perfect” voice detection as shown in Fig. 2. Thus, we are studying the feasibility of the noise reduction based on a fixed gain modification and discarding any problems involved by the possible nonrobustness of the VAD.
B. Gain modification algorithms 1) Gain subtraction: The method described in this subsection is derived from the spectral amplitude subtraction [10] in the frequency domain. In this work an estimation of the noise signal spectrum is subtracted from the noisy speech spectrum. This approach is transferred to the parameter domain. Of course, one has to consider, that the fixed gain of the noisy speech signal is not the sum of the gains of the noise only and speech only signal and , because of the non-linearity of the AMR codec. Thus a first order Taylor approximation is done and we assume, that the derivation coefficients are equal to one. This is true, if we suppose time-invariant derivation coefficients: if no background noise is present, is equal to , and during speech pauses, is equal to . By summarizing these assumptions we state that the fixed gain of the noisy speech signal is a function of the following type:
(1)
Using these considerations, an estimation of the noisy gain is subtracted from the fixed gain of the noisy speech signal in order to reduce the background noise of the speech signal.
(2)
is the new evaluated fixed codebook gain, whereas is the fixed gain of the noisy speech signal and is the estimated fixed gain of the background noise. As the gain has to be non-negative, the evaluated gain is set to zero if the estimation is bigger than the gain of the encoded speech with added background noise. 2) Squared gain subtraction: Besides spectral amplitude subtraction, spectral power subtraction is usually performed in ”classical” noise reduction [9]. If we translate this method to the codec parameter domain and according to the assumptions (1) of the previous subsection, this leads us to think of subtracting the squared estimated gain of the noise from the squared gain of the encoded signal resulting in a new squared evaluated gain, further passed to the decoder.
(3)
Similar to the treatment of negative in (2), the evaluated gain of (3) is set to zero, if the squared noisy gain is bigger than the squared gain of the noisy speech signal.
3) Gain subtraction with recursive SNR estimation: Another approach to modify the encoded gain is to multiply it with a scalar value being of range . According to the value of , the encoded gain is completely or not attenuated in the extreme cases. (4) The scalar is a transposition of the Wiener filter, which is also used for noise reduction in the frequency domain [11].
!
(5)
"
If the Signal to Noise Ratio is chosen as !
(6)
another implementation of (2) is obtained. A different way to estimate the Signal to Noise Ratio is to compute it recursively. Therefore a time index # is introduced, which refers to the subframe number, as the fixed gain is computed on a subframe basis in the AMR encoder. The " is the ratio of the former evaluated gain and the present noisy gain weighted with a factor $ and updated by the ratio of the encoded and noisy $ gain from the present subframe weighted with . # # # $ $ # " (7) $
#
#
lies in the interval and affects the rate of updating the ! . (7) presents the advantage to smooth the is linked with the ! . As the multiplication factor by (5), short term modifications of this factor are ! also avoided. This leads to a processed speech signal without any sudden modifications. This is principally preferred by the human ear and therefore the processed speech signal sounds more pleasant. C. Results After implementing the different methods to calculate a new gain for the fixed codebook, some informal listening tests were done to figure out which methods provide efficient noise reduction with as less artifacts on speech and residual noise as possible. Beneath the results of the listening tests presented in this section, the new evaluated gains are visualized in Fig. 4 - 6. As reference signal the unmodified fixed gain is shown in Fig. 3. The gain subtraction algorithm (2) achieves a very high reduction of the background noise, but the residual noise is not stationary and the speech presents artifacts. The non-stationarity of the noise is a result of a too high estimation of the noisy gain . When this gain is bigger than the encoded gain of the noisy speech file, the new evaluated gain is set to zero. The sudden changes between gains of value zero and non-zero gain values can be seen in Fig. 4 and
2000
2000
1800
1800
1600
1600
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
0
1
Fig. 3.
2
3
4
5 time / s
6
7
8
9
10
0
Encoded gain of the fixed codebook (reference) 2000
1800
1800
1600
1600
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
1
Fig. 4.
2
3
4
5 time / s
6
7
8
9
1
Fig. 5.
2000
0
0
10
Modified fixed gain with gain subtraction
create the non-stationarity of the background noise. Furthermore the fixed gain is also decreased during speech parts resulting in a lower level of speech. Additionally, the speech presents artifacts as shown by the informal listening tests. Worse results are obtained by modifying the fixed gain with the squared gain subtraction algorithm (3). As the listening tests indicate the background noise is nearly not attenuated and becomes non-stationary. In addition to this, the speech presents artifacts. The reason being that the evaluated gain of the fixed codebook is almost not decreased and often set to zero during speech pauses as Fig. 5 shows. Since the gain of the adaptive codebook is very small during speech pauses, the excitation of the synthesis filter is mainly controlled by the fixed excitation. The fixed excitation is the result of the multiplication of the fixed codebook vector by the (evaluated) fixed gain. Therefore the synthesized noise is either completely attenuated
0
0
1
Fig. 6.
2
3
4
5 time / s
6
7
8
9
10
Modified fixed gain with squared gain subtraction
2
3
4
5 time / s
6
7
8
9
10
Modified fixed gain with recursive SNR estimation
(evaluated gain is zero) or not at all (evaluated gain is not decreased), resulting in drastical shortterm changes of the energy of the noise. In summary, this method is not sufficient enough for noise reduction. The gain subtracting method with recursive SNR estimation (4), (5) and (7) achieves good noise reduction. Our informal tests show, that the noise level is less attenuated than with the subtraction method (2). But the residual noise obtained by recursive SNR estimation stays stationary and is more pleasant to listen to. It can be compared to a kind of comfort noise. Furthermore the noise level during speech is also decreased without introducing many artifacts. Fig. 6 plots the reduced gain. During speech pauses, we observe that the gain stays stable compared to previous solution in Fig. 5. Besides, Fig. 6 shows that the evaluated gain is slightly smoothed during speech activity periods. The smoothness property
of this evaluated fixed gain is due to the recursive $ computation of the SNR (7). In our simulation was chosen equal to a high value (0.98) to produce high smoothing. Compared to the encoded gain, the evaluated gain is much lower during speech pauses and only a bit lower during speech periods. It results in a high decrease of the noise level and a low decrease of the speech level. The informal listening tests confirmed this observation. According to our listening tests, the gain modification with recursive SNR estimation (in (4), (5) and (7)) leads to the best trade-off between the amount of noise reduction and artifacts introduced on the residual noise and speech. It is also important to emphasize that fixed gain modification provides processed speech signals of good quality. If artifacts are indeed introduced in speech or in residual noise, they are not drastically more important than the ones introduced by the classical tandeming of independent noise reduction and AMR coding [1]. Our proposed method used a VAD computed on a clean signal. That is to say an ideal VAD output. This allowed us to avoid possible problems created by a ”real” VAD and to focus on the gain weighting rule. This strong hypothesis has to be taken into account when evaluating these results. Nevertheless, further attempts using the VAD of the AMR on the noisy signal were investigated after the writing of this paper. Our current results confirmed the results of the feasibility study proposed in this article. V. C ONCLUSION A CCR listening test showed that the fixed codebook gain could be used to reduce environmental noise in the codec parameter domain. Accordingly different gain modification techniques have been introduced and discussed. It was shown that good results can be achieved with some simple computations on the codec parameters. Further investigations on processing of the adaptive codebook gain after speech pauses or on the fixed gain estimation of the noise signal would be profitable. Post-processing of the evaluated fixed gain could be another direction to enhance our current solution. The low complexity of the presented methods makes them quite suitable for mobile phones. It would also allow to perform noise reduction in the network without any additional delay or distortion on the speech signal by tandeming of codecs (Tandem Free Operation). R EFERENCES [1] P. Jax, R. Martin, P. Vary, M. Adrat, I. Varga, W. Frank, and M. Ihle, “A noise suppression system for the AMR speech codec,” in Proc. KONVENS, 2000.
[2] R. Martin and R. Cox, “New speech enhancement techniques for low bit rate speech coding,” in Proc. IEEE Workshop Speech Coding, 1999, pp. 165–167. [3] D. Virette, P. Scalart, and C. Lamblin, “Analysis of background noise reduction techniques for robust speech coding,” in EUSIPCO 2002, 2002. [4] R. Chandran and D. J. Marchok, “Compressed domain noise reduction and echo seppression for network speech enhancement,” in Proc.of the 43rd IEEE Midwest Symposium on Circuits and Systems, vol. 1, August 2000, pp. 10–13. [5] P. Vary, U. Heute, and W. Hess, Digitale Sprachsignalverarbeitung. Teubner Verlag, 1998. [6] Mandatory Speech Codec speech processing functions; AMR speech codec; General Description, 3GPPP TS 26.071, June 2002. [7] Software tools for speech and audio coding standardization, ITU-T G.191 STL 2000, November 2000. [8] Methods for Subjective Determination of Transmission Quality, ITU-T G.800, August 1996. [9] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” in Proc. IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 2, April 1979, pp. 113–120. [10] J. Lim and A. Oppenheim, “Enhancement and bandwith compression of noisy speech,” in Proc. IEEE, vol. 67, 1979, pp. 1586–1604. [11] S. Vaseghi, Advanced Signal Processing and Digital Noise Reduction. New York: Wiley-Teubner, 1996.