The Employment of Bayesian Method in Noise Reduction and Packet Loss Replacement Alaa Rahimi, Seyed Ghorshi, Ali Sarafnia School of Science and Engineering, Sharif University of Technology, International Campus, Kish Island, IRAN
[email protected]
Abstract - Speech enhancement in real-time applications improves the quality and intelligibility of the speech and reduces communication fatigue. Nowadays, due to reactivity of the systems and spread of online real-time applications, including VoIP, state-space models have been used broadly. This paper presents a speech enhancement method based on adaptive Bayesian-Kalman filter and Bayesian-MAP estimation to improve the performance and the quality of the enhancement procedure. The enhancement method includes a combination of Bayesian-Kalman filter for noise reduction and Bayesian-MAP estimation for parameter estimation of the lost speech segments. Performance evaluation and result of the proposed method indicates the efficiency of this method compared to other method. Keywords – Speech Enhancement; PLC; Bayesian Kalman Filter; Bayesian MAP Estimation
I.
INTRODUCTION
Speech enhancement improves the quality of speech and reduces communication fatigue. Speech enhancement is used in application such as mobile phones, VoIP, teleconferencing systems, speech recognition, hearing aids and in-vehicle communication systems. Speech enhancement considers noise reduction, packet loss concealment (PLC) and bandwidth extension. The main objective of speech enhancement is to maximize noise reduction while minimizing speech distortion. Adaptive noise reduction techniques are based on matching the coefficients of prediction with the trajectory of observation signal. In these techniques, it is assumed that the speech is contaminated with noise.
network a lost packet might be replaced with zeros but in a high quality communication system in order to avoid quality reduction, a suitable algorithm is needed to replace the missing speech segments. This type of packet replacement is called packet loss concealment (PLC). Autoregressive (AR) models are used in restoration of lost segments of speech signal in many applications. For example, AR models for estimation of non-recoverable errors in compact disk systems [4]. It has been observed that the speech production model which is commonly used for PLC is the AR model whose coefficients are obtained using the Yule-Walker equation [5]. Vaseghi and Rayner in [6] have proposed a pitch-based AR interpolator for use of the long term correlation speech parameters near periodic signals. In 2002, Kauppinen and Roth proposed a method for replacing the missing segments of the excitation with zeros which used the LP model for estimation of the missing segments only from past segments of the speech [7]. Due to the fact that the estimation of the excitation of the AR source filter is crucial in AR-based interpolators, a time-reversed excitation substitution algorithm with a multi-rate post-processing module for audio gap restoration was introduced in [8]. This paper aims to implement a state-space model for the noise reduction part of speech enhancement. This state-space model is implemented in time domain and is employed in Bayesian State Space Kalman filter. Also, A Bayesian-MAP estimation technique is used for packet loss replacement which is the second part of speech enhancement.
Kalman introduced his systematic state-space approach to linear filtering based on least square error method in 1960 [1]. Although Kalman filter is much more computationally expensive than other filtering techniques, efficiency of Kalman filter leads to use of this filter for the purpose of noise reduction in wide range of applications [2].
The rest of this paper is organized as follows. Section 2 includes the proposed model specifications and estimation objectives. Experimental illustrations and performance evaluation results are included in section 3 while section 4 concludes the paper.
There are three major error types in a computer network, bit error, jitter and packet loss. Packet loss occurs when packets of data are lost in a computer network and they fail to reach the destination node [3]. Packet loss can be caused by several factors including, fading, channel congestion and packet corruption during the transfer or bad routing routine in the network. Because of the fact that the fraction of packet lost increases while the network traffic increases. As a result, the network performance is often measured in terms of packet loss. A lost packet might be retransmitted in order to ensure that all data is received. In speech communication over the
The proposed method combines Bayesian Kalman filter and Bayesian-MAP estimation for the purpose of noise reduction and replacement of lost speech segment. The low order recursive AR model is also used for forward and backward prediction of the lost speech segment.
II.
MODEL SPECIFICATION AND OBJECTIVES
A. Bayesian-Kalman Filter for Noise Reduction In VoIP the speech signal might be received noisy and some part of the speech might be lost due to packet lost. Because some parts of the speech signal might also be lost
th
55 International Symposium ELMAR-2013, 25-27 September 2013, Zadar, Croatia
207
durinng the noise reduction r proccess, in the prooposed modell first the noise n reductionn technique ussing Bayesiann Kalman filterr will be applied a to thee received siggnal. Afterwaards, a packett lost estim mation modell will be appplied to the de-noised sppeech signaal. The speechh signal is moddeled as an AR R process as:
ݔሺݐݐሻ ൌ ܽ ሺ ݐെ ݇ሻ ݁ሺݐሻ ሺͳሻ ୀଵ
ݔሺݐሻ ൌ ܣ௫ ݔሺ ݐെ ͳሻ ݁ሺݐሻ ሺʹሻ Where ܽ is the coefficiennts vector off ܲ௧ order off the W autoregressive moodel of the sppeech. Equatioon (1) can alsso be writtten in a state-sspace Kalmann filtering moddel as: ݔሺݐሻ ܽ ۍ ۍ ېଵ ݔ ێሺ ݐെ ͳሻ ͳ ێ ۑ ݔ ێሺ ݐെ ʹሻ Ͳ ێ ۑ ێ ۑൌ ǤǤ ǤǤ ێ ێ ۑ Ǥ ێ ێ ۑǤ ݔۏሺݐ ݔെ ܲ ͳሻͲ ۏ ے
ܽଵ Ͳ ͳ ͲǤ Ǥ Ͳ
ǤǤ ǤǤ ǤǤ ͳ Ͳ Ǥ Ͳ
Ǥ Ǥ ܽିଵ ǤǤ ǤǤ Ͳ Ͳ ǤǤ ͲǤ Ͳ Ǥ ͳ
Ǥ Ǥ
ܽ ݔሺ ݐݐെ ͳሻ ݁ሺݐሻ ۍ ې Ͳ ݔ ێ ېሺ ݐݐെ ʹሻ ې Ͳ ۍ ۑ Ͳ ݔ ێ ۑሺ ݐݐെ ͵ሻ ۑ Ͳ ێ ۑ ێ ۑ ͲǤ ێ ۑ ۑ ǤǤ ǤǤ ێۑ ۑ ێ ۑ Ǥ ێۑ Ǥ ێ ۑǤ ۑ Ͳ ݔۏ ےሺ ݐݐെ ܲሻے Ͳ ۏ ے
ሺ͵ሻ
W Where ݔሺݐሻ is a ܲ ൈ ͳ dim mensional signnal at timeݐ. ܣis a ܲ ൈ ܲ dimensiional state trannsition matrix at times ݐെ ͳand ͳ mensional uncorrelated inpuut excitation vector v ݐ. ݁ሺݐሻ is a ܲ dim of thhe state equatioon. In this woork, we assumeed that the chaannel distoortion equals to ܫǤ The bllock diagram of the propposed Bayeesian Kalman filter method for noise reduuction is illusttrated in Fiig. 1. B. B Bayesian-MAP P Parameter Estimation E off Lost Segmentt I our propossed method, we In w first find the t coefficiennts of the previous p packet, then the exxcitation of the previous pacckets are used to extraact and estim mate the exciitation of thee lost segm ment. Becausee the length off lost segmentts might be long, a backkward estimattion is also applied and the result off the com mbination of both b backwarrds and forw wards interpolation will form the finaal restored seegment. It is also a assumed that, t length off the signal issܰ, and also the t signal is made m the total withh a combinatioon of the lost frames of the speech signal, the prevvious frames and a the next frames f of the lost frames of o the
Figurre 1. Block diaggram of Kalman Filter F model for speech s enhancement.
th
208
speeech signal. ݔଵ ݏݑ݅ݒ݁ݎ Ͳ ݔൌ ൭ ݐݏܮ൱ ൌ ൭ Ͳ ൱ ൭ݔ௨ ൱ ሺͶሻ ݔଶ ݊݁ݐݔ Ͳ Where W ݔ௨ reppresents the uunknown sam mples of the sppeech or the t lost segm ment of the sppeech signal, ݔଵ indicatees the know wn previous samples s of thee lost segmentt of the speechh and ݔଶଶ indicates thhe known prroceeding sam mples of thee lost segm ment of the speech. C. Excitation E Generation of Loost Segment Autocorrelatio A on function is employed to estimate the pitch perio od of precediing frame righht before the lost frame [99]. In ordeer to generate the excitationn of the lost segment, s the last l ݇ sam mples of previoous excitationn signal are co oncatenated, where w ݇ iss equal to piitch period oof the excitattion signal of the prev vious segmennt. On the othher hand, k samples from m the succceeding frames are also usedd for excitatio on concatenation. It shou uld be noted thhat these sampples have the same length as a the lost sample. Furtthermore, reccursive low order o AR uses the geneerated excitatiion signal in tthe synthesis filter f to reconsstruct the lost speech frames. f The bblock diagram m of the propposed mod del for packet lost replacemeent is illustrateed in Fig. 2. D. Proposed P Metthod In I this methhod a Bayessian Kalman n filter has been emp ployed for noiise reduction oof the speech signal. In adddition to noise n reductioon, the Bayessian-MAP coefficient estim mator and the excitatioon generatingg function haave been useed to estim mate the lost segment of speech signal. The best interrpolation bettween forwaard estimation and backkward estim mation is choosen by the m measure of thee length of thee lost segm ment and the mean squareed error (MSE E) of forwardd and back kward linear estimations. In the proposed methodd, the desttination node receives r a noiisy observation n signal, the signal s goess through the Bayesian Kaalman filter for f noise reduuction and the backgrouund noise willl be removed d from the siignal. Afteerwards, the PLC techniquue is applied d to the signnal to replace the lost segments oof the speech h. The resultts of prop posed method are given in ffollowing secttion.
Figure 2. Bayesiann-MAP Estimatioon for PLC from previous p packets.
55 International Symposium ELMAR-2013, 25-27 September 2013, Zadar, Croatia
IIII.
EXPRIMA AENTAL RESUL LTS
T proposedd method has been tested in 100 senteences The spokken by femalle speaker seelected randomly from TIIMIT databbase [10]. We W also tried to compare Bayesian-Kalman filterr method with w Wiener filter method in termss of specctrograms andd MSE versus the SNR valuues to observe the efficciency of theese methods for noise reduction. r Fiig. 3 illustrates the specctrograms of the original clean speech, noisy n speeech and Kalm man de-noisedd speech signnal respectivelly. It can be b observed from fr Fig. 3 thaat the noise haas been reduceed by Bayeesian-Kalmann filter and thhe estimated signal s is not noisy n any more. Fig. 4 also illustrrates the speectrograms off the origiinal clean speeech, noisy sppeech and thee estimated sppeech signal using Wienner Filter metthod. By compparing Fig. 3 with Fig. 4 one couldd exactly see the differencces between these t methhods. It could be observed that t high frequuency componnents of speech s signal are well moodelled with Bayesian-Kaalman filterr compared too Wiener filtter method. Table T 1 shows the com mparison betweeen the SNRss at the input and the estim mated signal and gives mean m squared error (MSE) values at diffferent SNR Rs. It can be perceived froom Table 1, thhat as long as a the SNR R of input siggnal increasess the SNR vaalues of estim mated signals are also inncreased and the MSE aree decreased. Itt can also be observed from Fig. 5 that, t as the SN NRs values at a the
Figuree 3. Spectrogram ms of clean and noisy n speech signnals using the BayyesianKalmaan Filter: (a) Cleaan speech signal,, (b) Noisy speech signal at 10 dB B SNR, (c) Dee-noised speech signal. s
TABLE I.
ILLUSTRATION O OF DIFFERENT SN NRS AT THE INPUTT, ESTIMATED SIG GNALS AND MSE
Inp put SNR Value (dB)
S SNR Value of Dee-noised Speech ussing BayesianK Kalman Filter (dB)
SE Estimated MS using BayesianKalman Filteer
Estimateed MSE using Wiener Fiilter
-5
6.59
0.000078
0.0004007
0
9.86
0.000041
0.0004000
5
13.15
0.00002
0.000315
10
16.85
0.000009
0.0001889
ut increases thhe MSE valuee at the outputt decreases inn both inpu meth hods. It is notted from Figss. 3, 4 and 5 that, t the BayeesianKalm man filter method m outperrforms Wieneer filter on noise redu uction. Better speech qualiity and intelligibility couldd also be perceived p at liistening experriments after noise reductioon of noissy speech by Bayesian B Kallman filter meethod comparred to Wieener filter methhod.
Figurre 4. Spectroggrams of clean aand noisy speech h signals using thhe Wiener Filterr: (a) Clean speeech signal, (b) N Noisy speech sign nal at 10 dB SNR R, (c) Denoiseed speech signal.
th
55 International Symposium ELMAR-2013, 25-27 September 2013, Zadar, Croatia
209
-4
4.5
x 10
Different Input SNR Values vs. MSE Kalman Filter Wiener Filter
4
Mean Squared Error (MSE)
3.5 3 2.5 2 1.5 1 0.5 0 -5
0
5
10
SNR (dB)
Figure 5. Mean Squared Error vs. Signal to Noise Ratio of input speech.
For the purpose of packet loss replacement, the artificially lost segment is made during 10 ms to 32 ms portion of phoneme Ȁ݁ݕȀ of a spoken sentence and Bayesian-MAP estimation is used for the estimation of linear prediction coefficients to estimate the lost segment of speech. It is also assumed that the length of available frames (i.e. previous and succeeding frames) is the same as the length of the lost segment. In Fig. 6, the result of proposed method for noise reduction and PLC is illustrated. The comparisons are made between a) 20 ms of original clean signal. b) The same 20 ms of the noisy speech at 10 dB SNR. c) De-noised signal after the Bayesian Kalman filter is applied. d) Estimated speech segment after the proposed Bayesian-MAP estimation is applied for PLC. It can be noted that clean signal is overlapped with the estimated de-noised signal in part (c). It is also observed in part (d) that the clean speech segment is overlapped with estimated speech segment. IV.
CONCLUSIONS
In this paper, a state-space speech enhancement method was proposed. The enhancement system combined noise reduction and packet loss concealment within a single versatile model. Dynamic Bayesian Kalman filter and Bayesian-MAP coefficient estimation were proposed for noise reduction of the observation signal and estimation of coefficients of lost segment. The results of the Bayesian Kalman noise reduction method at different SNRs have been analyzed. Additionally, PLC results of the proposed method were presented. It has been observed that the proposed method can be effective for long-term correlation estimation. Furthermore, with an accurate excitation approximation of the loss segment, and a Bayesian-MAP estimation of the coefficients, the result of reconstruction from our proposed method would match with the original speech segment reasonably. From the experimental results it can be concluded that our proposed Bayesian Enhancement method can reduce the noise and estimate lost packet.
th
210
Figure 6. 20ms of de-noised and estimated speech segment: (a) Original segment, (b) Noisy observation segment, (c) De-noised segment, (d) Estimated segment.
REFERENCES [1]
R. E. Kalman,”A new approach to linear filtering and prediction problems.” Journal of Basic Engineering, vol. 82, no. 1, pp. 35-46, March 1960. [2] S. R. Miralavi, S.Ghorshi, and A.Tahaei. "A Kalman filter approach to packet loss replacement in presence of additive noise." in Proc. IEEE International Conference on Information Science, Signal Processing and their Applications (ISSPA), 2012, pp. 352-356. [3] J. F. Kurose, and K. W. Ross, Computer Networking: A Top-Down Approach, 5th Edition. New York: Addison-Wesley, 2010. [4] A. J. E. M. Janssen, R. Veldhuis, and L. B. Vries, “Adaptive interpolation of discrete-time signals that can be modeled as AR processes,” IEEE Transaction on Acoustics, Speech and Signal Processing, vol. 34, no. 2, pp. 317–330, April 1986. [5] Z. Guoqiang, and W. B. Kleijn, “Autoregressive model-based speech packet-loss concealment,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008, pp.47974800. [6] S. Vaseghi, and P. J. W. Rayner, “Detection and suppression of impulsive noise in speech communication systems,” IEEE Proceedings, Part 1, 137(1), pp.38-46, 1990. [7] I. Kauppinen, and K. Roth, ”Audio signal restoration- theory and applications,” in Proc. 5th Int. Conf. on Digital Audio Effects, Hamburg, Germany, 2002, pp. 105-110. [8] P.A. Esquef, and L. W. P. Biscainho, “An efficient model-based multirate method for reconstruction of audio signals across long gaps,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp.1391-1400, 2006. [9] L. R. Rabiner, “On the use of autocorrelation analysis for pitch detection,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 25, pp. 24-33, February 1977. [10] TIMIT Dictionary available: