IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
1129
A Soft Voice Activity Detection Using GARCH Filter and Variance Gamma Distribution Rasool Tahmasbi and Sadegh Rezaei
Abstract—This paper presents a robust algorithm for a voice activity detector (VAD) based on generalized autoregressive conditional heteroscedasticity (GARCH) filter, variance gamma distribution (VGD), and adaptive threshold function. GARCH models are new statistical methods that are used especially in economic time series. There is a consensus that speech signals exhibit variances that change through time. GARCH models are a popular choice to model these changing variances. A speech signal is assumed to have a VGD because the VGD has heavier tails than the Gaussian distribution (GD). The distribution of noise signal is assumed to be Gaussian. In proposed method, heteroscedasticity will be modeled by GARCH, and then the parameters of the distributions will be estimated recursively. Finally, hard detection is the result of comparing a multiple observation likelihood ratio test (MOLRT) with an adaptive threshold function. The simulation results show that the proposed VAD is able to operate down to 5 dB and in nonstationary environments. Index Terms—Estimation theory, generalized autoregressive conditional heteroscedasticity (GARCH) model, heteroscedasticity, probability distribution, voice activity detection (VAD).
I. INTRODUCTION OICE ACTIVITY DETECTION (VAD) refers to the ability to distinguish voice from noise, and is an integral part of a variety of speech communication systems, such as speech coding [1], speech recognition, audio conferencing, hands-free telephony [2], speech enhancement [3], wireless communication [4], [5], and echo cancellation. During the last years, numerous researchers have studied different strategies for detecting speech in noise and the influence of the VAD on the performance of speech processing systems. Sohn [1] proposed a robust VAD algorithm based on a statistical likelihood ratio test (LRT) involving a single observation vector. Later, Cho [6] suggested an improvement based on a smoothed LRT. It has been shown recently [7], [8] that incorporating longterm speech information to the decision rule reports benefits for speech/pause discrimination in high-noise environments. For example, Ramírez [9] proposed an LRT involving multiple and independent observations. In [10], the method is a little different, but they incorporate long-term information too: the signal is first decorrelated using an orthogonal transformation and then a hidden Markov model (HMM) is employed. They assumed that the distribution
V
of speech is Laplacian, because, in [11], it is shown that the speech signal has a Laplacian distribution (LD). In this paper, we assume that the speech signal has a variance gamma distribution (VGD), since VGD is a generalization of LD and in VGD becomes LD. specific cases Note that the approaches of [1] and [9] are performed in frequency domain, but [10] is performed in the time domain. However, it takes time to decorrelate signals via orthogonalization. Our proposed method is performed in time domain; like [10], it is necessary that the signal be uncorrelated. To decorrelate signals, we used the generalized autoregressive conditional heteroscedasticity (GARCH) model, which can model both noise and speech heteroscedasticity. However, estimating of GARCH parameters is time consuming; to solve this problem, a predefined estimation of parameters is presented. This paper is organized as follows. In Section II, the GARCH model is introduced. We show that it can model heteroscedasticity, and we show that every GARCH series is uncorrelated. Section III presents statistical models of speech and noise and an adaptive threshold function for discriminating between them. In Section IV, the algorithm of the proposed method is presented. Section V illustrates the experimental results of the proposed method. II. GARCH MODEL GARCH models are new statistical methods that are used especially in economic time series. GARCH stands for generalized autoregressive conditional heteroscedasticity. Loosely speaking, you can think of heteroscedasticity as time-varying variance (i.e., volatility). Conditional implies a dependence on the observations of the immediate past, and autoregressive describes a feedback mechanism that incorporates past observations into the present. GARCH then is a mechanism that includes past variances in the explanation of future variances. More specifically, GARCH is a time series modeling technique that uses past variances and past variance forecasts to forecast future variances. Definition: Let be a sequence of i.i.d. random variables have standard Gaussian distribution. is called such that GARCH(q,p) process if (1)
Manuscript received June 26, 2006; revised November 12, 2006. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mark Hasegawa-Johnson. The authors are with Amir Kabir (Polytechnic) University, Tehran 158754413, Iran (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TASL.2007.894521
where
1558-7916/$25.00 © 2007 IEEE
is a nonnegative process such that (2)
1130
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
and (3) , the process reduces to ARCH(q) (autoregressive For conditional heteroscedasticity of order q). In ARCH(q) processes the conditional variance is specified as a linear function of past sample variances only, whereas the GARCH(p,q) process allows lagged conditional variances to enter as well [12]. To review elementary aspects of the GARCH model, denote as conditional expectation while the condition is on the (which is denoted by ; past information up to time , where is the sigma field i.e., ). So generated by (4) and conditional variance is
Fig. 1. Results of the proposed VAD with 0-dB SNR. (a) Clean speech. (b) Noisy speech with zero SNR. (c) Estimation of noisy speech via GARCH model. (d) Soft detection. (e) Hard detection.
(5) Now suppose
TABLE I MEAN OF GARCH PARAMETERS
is a GARCH process. So
(6) Therefore, we have (7)
TABLE II VARIANCE OF GARCH PARAMETERS
and its conditional expectation and variance is (8)
(9)
It is clear speech signals have these properties [Fig. 1(b)]. [14] showed that speech can be modeled through GARCH(1,1). McNeil [15] argued that a GARCH(1,1) model with Student innovations is enough to remove the dependence in return series. Sometimes, a filter with normal innovations is enough too. The filtered return series, which is defined to be (10)
As you can see, the mean and variance of a GARCH process are constant but its conditional variance is changing over time. So, in comparison with autoregressive moving average (ARMA) process, GARCH process models the time-variation of variance (i.e., volatility). Other interesting properties of GARCH processes are as follows: , for ; 1) has heavier tails than the Gaussian distribution (GD). 2) Note that property 1) together with the normality assumption are independent. of noises ensure that the As [13] discussed when the mean level of a series stayed close to zero over the entire period and changes in variance (volatility) occurred, then this series could be modeled through GARCH.
should be an approximately i.i.d. series. [There are several criteria for checking goodness of fit, such as the Akaike criteria (AIC).] We used a GARCH(1,1) to model the speech signal. Therefore (11) should be estimated. where , , and We call garchfit in Matlab to calibrate the GARCH model. The mean and variance of these three parameters for 100 different speeches with different SNRs is shown in Tables I and II. As you see, the estimation values are related to the amount of SNR: reducing SNR causes increased but reduced and visa versa.
TAHMASBI AND REZAEI: SOFT VOICE ACTIVITY DETECTION USING GARCH FILTER AND VARIANCE GAMMA DISTRIBUTION
III. STATISTICAL ASPECTS
1131
and the decision rule is defined by
the estimation of noisy speech via GARCH Denote by (i.e., ). After estimating noisy speech through the GARCH model, the result is a series of data that have heavier tails than the Gaussian and are uncorrelated—by the normality assumption of noises, they become independent. In situations such as this, a distribution like the VGD is appropriate [16], [17]. We distinguish speech from silence as follows. is an -dimensional vector of data which is Assume that the estimation of noisy speech via GARCH model at time , i.e., (12) where denotes the transposing operation. Then, we compare two hypotheses using a multiple observation likelihood ratio test (MOLRT)
speech, silence,
.
(19)
As [9] discussed, the use of the MOLRT for voice activity detection is mainly motivated by two factors: 1) the optimal behavior of the so-defined decision rule; and 2) a multiple observation vector for classification defines a reduced variance LRT reporting clear improvements in robustness against the acoustic noise present in the environment. B. Distribution of Speech A random variable is said to be VGD with parameters , , , and if its density is given by (20)
is silence is speech.
(13) with
Before starting the next section, we review elementary features of MOLRT and the distribution of speech and noise. A. Multiple Observation Likelihood Ratio Test In a two-hypothesis test, the optimal decision rule that minimizes the error probability is the Bayes classifier. In the LRT, it is assumed that the number of observations is fixed and represented by a vector . The performance of the decision procedure can be improved by incorporating more observations in the statistical test. In a two-class classification problem, a MOLRT can be defined by
(14)
(21) is the modified Bessel function of the third kind [18, ch. is the gamma function. The parameter domain is 11], and and . If , then the distribution restricted to is symmetric and and are shape parameters. The momentgenerating function of is given by (22) where ized moments by
. Therefore, we obtain the central-
where and are probability distribution function (pdf) of speech and noise, respectively. If the observations are independent then
(23)
(15)
In [19], estimation of parameters of the VGD via a moment matching method is presented. Denote and as skewness and kurtosis, respectively. Then, the parameter estimates are
An equivalent log-LRT can be defined by taking logarithms (16)
(24)
(25) (26)
By defining (27) (17) (28) the MOLRT can be recursively computed (18)
In [19], it is shown that if (skewness) is close to zero, then these estimators can successfully obtain good approximations
1132
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
to , , , and . On the other hand, in [11], it is shown that the . So speech signal has a symmetric distribution (29) (30)
D. Adaptive Threshold As discussed in Section III-A, speech is active at time if (based on MOLRT). It is better, however, to comwith a threshold function at each time , i.e., pare speech, silence,
(31) (32) As you can see, estimation of is related to (kurtosis). To simplify the computations, we assumed a fixed value for . For example, [10] assumed (LD is specific case of VGD). Therefore, by assuming that the speech distribution is symmetric, then for a fixed , the estimation of the parameters via the moment matching method is as follows. was estimated at time , then the estimation If the vector and is of
or speech is active if (46) where and are VGD and GD, respectively, and their parameters are estimated at time via (34), (36)–(39), and (43), (44). It is natural to assume that the threshold function is a function of parameters of the speech distribution and . An appropriate approach is to define (47)
(33) (34)
(45)
where is a constant threshold and parameters are estimated based on If we define
is defined in (21) and its at time .
(35) (48) and can be recursively computed via
then (36) (37) (38) (39)
Note that
and
(49) Therefore, using (47) and (49) in (46) leads to the following approach. Speech is active if
are defined to shorten above equations. (50)
C. Distribution of Noise We assume that the noise is Gaussian. Therefore, its pdf is given by (40) If the vector are
was observed, then the estimation of
The left side of (50) can be viewed as a criterion for detecting silence or speech activity and is a criterion for soft detection. This criterion will be compared with , and hard detection will be derived. As you see, reducing the number of computations in (50) is another advantage.
and IV. PROPOSED VOICE ACTIVITY DETECTION ALGORITHM (41) (42)
and can be recursively computed as shown in (36)–(39), i.e., (43) (44)
For implementation of the proposed VAD, its algorithm is presented and shown in Fig. 2. In this algorithm, GARCH parameters are selected from Table I. This table can be extended for different conditions by using the garchfit command in MATLAB. The presented algoand rithm can be divided into two parts: algorithms for . Since for , the length of vector is less than for , we can not use the recursive equations (36)–(39) and (43), (44) to estimate , , and ; instead, we should use (33)–(35)
TAHMASBI AND REZAEI: SOFT VOICE ACTIVITY DETECTION USING GARCH FILTER AND VARIANCE GAMMA DISTRIBUTION
1133
Fig. 2. Proposed VAD algorithm. TABLE III PROBABILITY OF TRUE DETECTION (%)
Fig. 3. Results of the proposed VAD with t distribution noise (5-dB SNR). (a) Clean speech. (b) Noisy speech. (c) Estimation of noisy speech via GARCH model. (d) Soft detection. (e) Hard detection.
and (41), (42) for . For computing soft detection, we use information up to time . On the other hand, since log likelihood is motivated by . For length of segment, we multiply the soft detection by , since the length of is , we use recursive equations (34), (36)–(39), and (43), (44). The computational complexity of the proposed method is , where is the length of noisy speech . Note that the complexity of methods [1], [10], and [9] are , , , respectively. and V. EXPERIMENTAL RESULTS In this section, the results of the proposed method are presented. The speech signals are obtained from http://www.dailywave.com. Fig. 1(a) shows the manually marked clean speech sample. Fig. 1(b) shows the noisy speech with Gaussian noise at dB. GARCH estimation of noisy speech is shown in SNR , Fig. 1(c). The estimated GARCH parameters are , and . As discussed before, Tables I and II show the mean and variance of GARCH parameters for different noisy speeches with Gaussian noise. Since estimation of GARCH parameters is related to signal-to-noise ratio (SNR), so it is possible to use these approximated parameters instead of using an algorithm for finding exact estimation of the parameters. The left side of (50) is computed and shown in Fig. 1(d). This , and a hard decision is criterion is compared with derived. The hard detection is shown in Fig. 1(e). To evaluate the performance of the proposed VAD, the speech and silence intervals are marked manually; then, the hard decision of the VAD is compared with the manually marked intervals. For Fig. 1, the probability of true detection (PTD), which is defined as the number of frames correctly classified as speech versus nonspeech, divided by the total number of frames in the test sample, is equal to 0.94. The results of the proposed method with different SNR are presented in Table III. The PTD is more than 96% when SNR is greater than 2, and in very noisy cases SNR is near 75%, which are good results. So, the presented algorithm can be viewed as a robust algorithm.
Fig. 4. Results of the proposed VAD with beta distribution noise (10-dB SNR). (a) Clean speech. (b) Noisy speech. (c) Estimation of noisy speech via GARCH model. (d) Soft detection. (e) Hard detection.
Also, to prove the superiority of our algorithm, we examine our algorithm with non-Gaussian noises. Using the same format as Fig. 1, the results of the proposed method with noises from the distribution (with four degrees of freedom) and beta disand ) are shown in tribution (with parameters Figs. 3 and 4, respectively. Since beta random numbers are not zero-mean, we subtract the mean before adding these noises to the clean speech signal. Also, Fig. 5 shows the results of the proposed method with colored noise. To generate colored noise, we passed the white noises through a finite-impulse response filter. The PTD for these noisy speech signals are 0.96, 0.94, and 0.93, respectively. Note that for Fig. 3 and 5, SNR is 5 dB and for Fig. 4 it is 10 dB. The PTD of the previously published algorithms and the presented method are given in Table IV with different SNRs and different noises. The PTD of the proposed method is higher than
1134
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
Fig. 5. Results of the proposed VAD with colored noise (5-dB SNR). (a) Clean speech. (b) Noisy speech. (c) Estimation of noisy speech via GARCH model. (d) Soft detection. (e) Hard detection. TABLE IV PROBABILITY OF TRUE DETECTION (%)
the PTDs achieved by the methods of Ramírez and Segura [9], Gazor and Zhang [10], and Sohn et al. [1]. VI. CONCLUSION The objective of this paper is to exploit the properties of new statistical tools such as the GARCH model and heavy tailed distribution, to find a robust algorithm for VAD in the presence of a high level of noise. The results show that the performance of the presented VAD is quite good when we take advantage of the adaptive threshold function. The complexity of the proposed algorithm is very low due the fact that it is performed in the time domain, and the estimations can be computed recursively. REFERENCES [1] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1–3, Jan. 1999. [2] N. R. Garner, P. A. Barrett, D. M. Howard, and A. M. Tyrrell, “Robust noise detection for speech detection and enhancement,” Electron. Lett., vol. 33, no. 4, pp. 270–271, Feb. 1997.
[3] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,” IEEE Trans. Speech Audio Process., vol. 9, no. 2, pp. 87–95, Feb. 2001. [4] F. Beritelli, S. Casale, and A. Cavallaero, “A robust voice activity detector for wireless communications using soft computing,” IEEE J. Sel. Areas Commun., vol. 16, no. 12, pp. 1818–1829, Dec. 1998. [5] D. K. Freeman, G. Cosier, C. B. Southcott, and I. Boyd, “The voice activity detector for the pan European digital cellular mobile telephone service,” in Proc. Int. Conf. Acoust., Speech, Signal Process., May 1989, pp. 369–372. [6] Y. D. Cho, K. Al-Naimi, and A. Kondoz, “Improved voice activity detection based on a smoothed statistical likelihood ratio,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 2001, vol. 2, pp. 737–740. [7] J. Ramírez, J. C. Segura, M. C. Benítez, A. de la Torre, and A. Rubio, “A new Kullback–Leibler VAD for speech recognition in noise,” IEEE Signal Process. Lett., vol. 11, no. 2, pp. 666–669, Feb. 2004. [8] ——, “Efficient voice activity detection algorithms using long-term speech information,” Speech Commun., vol. 42, no. 3–4, pp. 271–287, Oct. 2004. [9] J. Ramírez and J. C. Segura, “Statistical voice activity detection using a multiple observation likelihood ratio test,” IEEE Signal Process. Lett., vol. 12, no. 10, pp. 689–692, Oct. 2005. [10] S. Gazor and W. Zhang, “A soft voice activity detector based on a Laplacian–Gaussian model,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 498–505, Sep. 2003. [11] ——, “Speech probability distribution,” IEEE Signal Process. Lett., vol. 10, no. 7, pp. 204–207, Jul. 2003. [12] T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” J. Econometrics, vol. 31, pp. 307–327, 1986. [13] D. Pena, G. C. Tiao, and R. S. Tsay, A Course in Time Series Analysis. New York: Wiley, 2001, ch. 1 and 9. [14] I. Cohen, “Modeling speech signals in time-frequency domain using GARCH,” Signal Process., vol. 84, pp. 2453–2459, 2004. [15] A. McNeil, R. Frey, and P. Embrechts, Quantitative Risk Management: Concepts, Techniques and Tools. Princeton, NJ: Princeton Univ. Press, 2005, ch. 4. [16] D. B. Madan, P. P. Carr, and E. C. Chang, “The variance gamma process and option pricing,” Eur. Finance Rev., vol. 2, pp. 79–105, 1998. [17] E. Daal and D. Madan, “An empirical examination of the variancegamma model for foreign currency option,” J. Business, vol. 78, pp. 134–176, 2005. [18] M. Abramowitz and I. Stegun, Handbook of Mathematical Functions. New York: Dover, 1968, ch. 11. [19] E. Seneta, “Fitting the variance-gamma to financial data,” J. Appl. Probability, vol. 41A, pp. 177–187, 2004.
Rasool Tahmasbi received the B.Sc. degree in mathematical statistics from Shiraz University, Shiraz, Iran, in 2004 and the M.Sc. degree in mathematical statistics from Amir Kabir University (Polytechnic), Tehran, Iran, in 2006. He is currently pursuing the Ph.D. degree in applied mathematics at Amir Kabir University, Tehran, Iran. His main research interests are statistical and adaptive signal processing and speech processing.
Sadegh Rezaei received the B.Sc. degree from Ahvaz University, Ahvaz, Iran, in 1981, the M.Sc. degree from Tarbiyat Modares University, Tehran, Iran, in 1986, and the Ph.D. degree from Adelaide University, Adelaide, SA, Australia, all in statistics in 1996. From 1996 to 2004, he was with the Department of Statistics, Ahvaz University. He is currently with the Department of Statistics, Amir Kabir University, Tehran. His main research interest is statistical speech processing.