causal multi quantile noise spectrum estimation for spectral subtraction

2 downloads 0 Views 215KB Size Report
ABSTRACT. Suppression of additive noise from speech signal is a fun- damental problem in audio signal processing. We present in this paper a novel algorithm ...
CAUSAL MULTI QUANTILE NOISE SPECTRUM ESTIMATION FOR SPECTRAL SUBTRACTION Mohsen Farhadloo, Abolghasem Sayadian, Meysam Asgari, Mina panahi Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic) 13597-45778, Tehran, Iran phone: + (98) 021 64543381, email: [email protected], [email protected]

ABSTRACT Suppression of additive noise from speech signal is a fundamental problem in audio signal processing. We present in this paper a novel algorithm for single channel speech enhancement. The algorithm consists of two steps: First, estimation of the noise power spectrum with a multi quantile method and second, elimination of the estimated noise from the observed signal by spectral subtraction or Wiener filtering. In this method, instead of a global quantile for all frequency bands, we divide the entire frequency band into three regions and use different quantile in each region. Our simulation results show that the new method has better performance than quantile based noise estimation. 1.

INTRODUCTION

Suppression of background noise is one of the most prominent challenges in speech communication systems. Background acoustic noise, almost always corrupts the clean speech and degrades the performance of speech processing systems. This is true, for example, in systems of speech communication, speech analysis, and speech recognition. In such systems, clean speech is desired, therefore speech enhancement, is also often considered as a useful preprocessing step to improve the performance of such systems. Noise reduction, is really a difficult problem largely due to the wide variety of background noise types and amounts. Various speech enhancement techniques have been studied up to now for the purpose of eliminating noise. Two major categories are single-channel and multi-channel speech enhancement systems. The focal point of this paper is single-channel speech enhancement systems, where the only available input is the noisy speech. In multiple-channel systems, there is more than one microphone and thus may provide additional information about the noise statistics. Also some techniques have used some priori information about clean speech and/or noise to simplify the problem. Spectral subtraction is one of the most popular methods among single-channel methods. In this method after noise spectrum estimation in the first step, the estimated noise is subtracted from noisy signal spectrum, in subsequent step. One method for estimating the noise spectrum is the Boll's method [1], which is based on Voice Activity Detection (VAD). In this method, those segments of the

noisy signal in which the speech is not present, are assumed noise and noise spectrum is calculated or updated via averaging over these segments spectrum, then the estimated noise spectrum is subtracted from each segment of noisy signal and the clean speech is obtained by transforming the resulting spectrum into time domain. Although with decreasing signal-to-noise ratio (SNR), reliable detection of nonspeech segments becomes increasingly difficult; this method assumes that noise is stationary in time interval between two nonspeech segments. Another method for estimating noise spectrum is known as minimum statistics which can update the noise estimation even during the speech segments [2]. Since the minimum is sensitive to outliers, [3] has used a quantile different from minimum in all frequency bands. In this paper a new subtractive-type speech enhancement scheme based on quantiles is proposed which uses various quantiles in different bands, and for handling nonstationary noise conditions we use low-frequency regions for spectral subtraction as described in [5]. 2.

NOISE SPECTRUM ESTIMATION

In this section, we consider the problem of estimating of noise spectrum, using the observed noisy spectrum. We consider an additive noise model where speech and noise are independent y ( n, t i ) = s ( n, t i ) + v ( n, t i ) (1) where y(n, ti ) , s(n, ti ) and v(n, ti ) are sampled noisy speech, clean speech and noise of i th frame respectively. In frequency domain the noisy speech signal in (1) can be expressed as Y ( f , t i ) = S ( f , t i ) + V ( f , ti ) (2) where Y ( f , ti ) , S( f , ti ) and V ( f , ti ) are the Fourier transform of the i th frame respectively. In [3] an algorithm for noise estimation has been proposed which uses a quantile different from minimum. That is based on the fact that even in speech segments not all frequency bands are occupied with speech and a significant percentage of time the energy in each frequency band is on noise level. In this approach for each frequency band the spectrum of all frames in the entire utterance are sorted in an ascending order and then q-th temporal quantile is taken as

noise spectrum. This method, however, does not use a well known fact that we know. It is well known that, human speech information mostly exists within frequency region of 50-3500 Hz. On the other word, we can expect that noise has more noticeable power in the outside of human speech frequency region than inside of this region. By considering this fact, we divide the entire frequency region into three regions, and then use different quantile in each of these regions. More precisely, the frequency region of 0-50 Hz is considered as low-frequency, the region of 50-3500 Hz as middle-frequency and the region of frequencies more than 3500 Hz as high-frequency region. For every frequency that is placed in low-frequency region, qlow-quantile is computed as noise spectrum and for those that are placed in middle and high frequency regions qmiddle-quantile and qhigh-quantile is computed respectively. If we denote the estimated noise spectrum by | Vˆ ( f ) | then we will have

| Y ( f , t q   ˆ | V ( f ) |= | Y ( f , tq  | Y ( f , tq

lowT

) |

middleT

highT

) |

) |

if 0 ≤ f < 50Hz if 50 ≤ f ≤ 3500Hz

(3)

if f > 3500Hz

where T is the total number of frames in the entire utterance. The above presented method requires the entire received speech utterance for estimating noise spectrum, i.e. after receiving the entire speech signal, the noise spectrum is estimated and then it will be subtracted from all frames of the utterance. It is obvious that the designed wiener filter for eliminating the estimated noise spectrum is a noncausal filter. An efficient approximate technique was suggested for solving this problem. That depends on using a fixed length buffer ∆ for storing the received noisy speech frames up to time t, and then the multi quantile noise spectrum is estimated by determining the qlow, qmiddle and qhigh quantiles from the frames that are stored in the buffer. If the buffer is full, [3] has suggested that smallest and largest element are removed from the buffer but in this paper, we remove the first frame that is entered the buffer, when the buffer is full. Therefore we can consider that we have a running buffer that moves during time and accommodation with nonstationary noise condition is improved. 3.

quately in each frame instead of taking a fixed value. By using the ratio between the energies of noisy signal and estimated noise spectrum in low-frequency region (0-50Hz), λt i for i th frame is defined as 50

∑| Y ( f ) | λt = i

f =0 50

(5)

∑ | Vˆ ( f ) | f =0

and then (4) is replaced by Y(f )  ˆ ˆS(f )=(|Y(f )| −λti |V(f )|)|Y(f )| 0 

if λt |Vˆ(f )| ≤ |Y(f )| i

(6)

if otherwise

In [5] a preliminary noise spectrum which is obtained from nonspeech frames corresponding to beginning of the utterance, is used as Vˆ ( f ) and assumes that the initial frames of each utterance are silences. In fact the initial frames of the received signal may not be silence and this assumption is not valid in practical situations, thus we use the multi quantile noise spectrum estimation which is described in section II for noise estimation. 4.

EXPERIMENT RESULTS

Experiments using real speech data were carried out to show the effectiveness of our proposed method. The observed speech signal was segmented into 25 ms length frames with 50 percent frame shift. For each frame the power spectrum was estimated through a Hamming windowed 2048 points Fast Fourier Transform (FFT). The observed speech signals were provided by adding tree type of nonstationary noises to the clean speeches. Speakers are an Iranian male and female and the sampling frequency is 8 KHz. Three kinds of noises—babble noise, factory noise, and f16 noise—were prepared for the experiments. Figure 1 shows the waveforms of these noises.

SPECTRAL SUBTRACTION

In conventional spectral subtraction, if the estimated noise spectrum Vˆ ( f ) is available, then we can estimate the clean speech as Y(f )  ˆ if λ|Vˆ(f )| ≤ |Y(f )| (|Y(f )| −λ|V(f )|) |Y(f )| Sˆ(f ) = (4) 0 if otherwise  where λ is a positive constant that is considered for reducing musical noise. Musical noise is an artificial noise generated due to transforming the estimated clean speech into time domain, by using noisy signal phase. In [5] for accommodating the spectral subtraction method with nonstationary noise conditions, is suggested that λ to be changed ade-

Figure 1: Waveforms of prepared noises (a) babble noise, (b) f16 noise, and (c) factory noise. First we considered the problem of determining the value of q in each frequency region, i.e. low, middle, and high frequency regions. In order to solve this problem we used Signal Estimation Error (SEE) as an objective criterion that is defined as

SEE =

∑n sˆ(n, ti ) 1 T ( 1 − ) ∑ T i =1 ∑ s(n, ti )

(7)

n

where sˆ(n, ti ) and s(n, ti ) are estimated and original clean speech respectively. In our experiments we found that this criterion function is more effective than Mean Squared Error (MSE) criterion function. The optimal value for q in each region was determined experimentally. To this purpose, for all speech signals that were corrupted with babble and f16 noises, the noise spectrum was estimated through proposed method and after computing λti in each frame, the estimated noise spectrum was subtracted from observed speech signal as described in section 3. The above procedure was repeated for all values of qlow, qmiddle, and qhigh within (0-1), and the SEE was computed. That set of qlow, qmiddle, and qhigh which yield the lowest SEE was selected as the optimal set. The averaged results for two kinds of prepared noises— babble and f16— are summarized in table 1. Table 1: Averaged values for q based on babble and f16 noise qmiddle qhigh qlow Male 0.85 0.6 0.75 Female 0.9 0.55 0.7 As it expected, in low and high frequency regions in where noise has significant power, we have greater value for q than in middle frequency region. According to values of table 1 we set qlow=0.9, qmiddle=0.55 and qhigh=0.7 in our experiments. Here we should mention that as we know Mean Squared Error (MSE) criterion is a famous and conventional criterion that is pervasively used for evaluating the effectiveness of estimation theory applications such as speech enhancement methods. However, in our experiments we found that our proposed criterion function, SEE, is more effective than Mean MSE criterion. We repeated the above mentioned procedure for determining the optimal values of q in each region, first time with MSE criterion and second time with our proposed criterion function, SEE. Afterward via these two optimal set of q, we estimated the clean speech by using our mentioned method in sections 2 and 3. Subsequently, we asked some listeners to evaluate the quality of estimated clean speeches, and we realized that the estimated clean speech that was acquired with SEE criterion function was subjectionaly better than that acquired with MSE criterion function. In other words, we found that SEE criterion function posses higher correlation with subjective sound quality than MSE criterion functions. Figure 2 shows the behavior of the proposed method based on the above settings for the case where the additive noise is factory noise and the speaker is a male and the SNR is equal to 0dB. In this figure clean speech, noisy signal and the estimated clean speech are shown. Figures 3 and 4 show the babble and f16 noise cases but where only the noisy and estimated clean speech are shown in each figure.

Figure 2: waveforms of (a) clean speech, (b) clean speech contaminated with factory noise, (c) estimated clean speech via proposed method

Figure 3: Waveforms of (a) clean speech contaminated with babble noise, (b) estimated clean speech

Figure 4: Waveforms of (a) clean speech contaminated with f16 noise, (b) estimated clean speech In table 2 the evaluated SEE against SNR have been given for factory noise. The results for babble and f16 noises are summarized in table 3 and 4. Table 2: Values of signal error estimation for factory noise in different SNR SNR -5 0 5 10 15 20 dB 0.040 0.073 0.109 0.112 0.110 0.110 SEE Table 3: Values of signal error estimation for babble noise in different SNR SNR -5 0 5 10 15 20 dB 0.400 0.054 0.076 0.116 0.120 0.115 SEE

Table 4: Values of signal error estimation for f16 noise in different SNR SNR -5 0 5 10 15 20 dB 0.158 0.045 0.108 0.120 0.118 0.115 SEE The small values for SEE are pointing to the effectiveness of our method and comparing the clean speech and estimated one, shows the eligibility of our proposed criterion function. An important parameter that should be determined when using a buffer for noise spectrum estimation is the buffer length. For determining the proper buffer length extent, some experiments were carried out. Averaged results of experiments with different buffer length and qlow=0.9, qmiddle=0.55 and qhigh=0.7 in different SNR and for three kinds of mentioned noises are reported in figures 5, 6 and 7. In these experiments we have changed the SNR form -5 dB to 20 dB and in each one (SNR), the SEE has been evaluated for each buffer length and then the averaged results over SNR are reported.

Figure 7 Values of signal error estimation for f16 noise for different buffer lengths. According to these figures the most proper extent for buffer length ∆ is 105. It is observed that the larger buffer length the lower the respected SEE up to it reaches 105, after that the SEE will increase as the buffer length is enlarged. 5.

CONCLUSION

In this paper quantile-based noise spectrum estimation is improved. By considering the well known fact that, human speech information is mostly within 50-3500 Hz frequency region, the total frequency content of the observed speech signal is divided into three frequency regions and in each region a different quantile is used. REFERENCES

Figure 5 Values of signal error estimation for factory noise for different buffer lengths.

Figure 6 Values of signal error estimation for babble noise for different buffer lengths.

[1] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction” IEEE Trans. Acoustic Speech Signal Processing., vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979. [2] Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, 1990 [3] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise for speech enhancement and coding,” IEEE Trans. Signal Process., vol. 39, no. 8, pp. 1732–1742, Aug. 1991. [4] R. Martin, “Spectral subtraction based on minimum statistics”, in Proc. EUSIPCO, Sep. 1994, pp. 1182–1185. [5] J. Sohn, N. S. Kim, and W. Sung, “A statistical modelbased voice activity detection,” IEEE Signal Process. Lett, vol. 6, no. 1, pp. 1–3, Jan. 1999. [6] V. Stahl, A. Fischer, and R. Bippus, “Quantile based noise estimation for spectral subtraction and Wiener filtering”, in Proc. IEEE ICASSP, Jun.2000, pp. 1875–1878. [7] J. Yamauchi, T. Shimamura, “Noise estimation using high frequency regions for spectral subtraction” IEICE Trans. Fundam., vol. E85-A, no.3, pp. 723–727, Mar. 2002. [8] K.Yamashaita, T.Shimamura, “Nonstationary Noise Estimation Using Low-Frequency Regions For Spectral Subtraction”, IEEE Signal Processing Letters, Vol. 12, No. 6, June 2005. [9] Signal Processing Information Base, ‘‘Noise data’’. Available from: .

Suggest Documents