Speech Endpoint Detection Method Based on TEO in ... - Science Direct

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Procedia Engineering

Procedia Engineering 00 (2011) 000–000 Procedia Engineering 29 (2012) 2655 – 2660 www.elsevier.com/locate/procedia

2012 International Workshop on Information and Electronics Engineering (IWIEE)

Speech Endpoint Detection Method Based on TEO in Noisy Environment LI Jiea*, ZHOU Pingb, JING Xinxingc, DU Zhirana b

a School of Computer Science and Engineering, Guilin University of Electronic Technology, Guilin Guangxi 541004, China School of Electric Engineering and Automation, Guilin University of Electronic Technology, Guilin Guangxi 541004, China c College of Information and Communication, Guilin University of Electronic Technology, Guilin Guangxi 541004, China

Abstract Speech endpoint detection is a crucial component in speech recognition system, especially in the noisy environment. It’s accuracy affects the computational complexity and the recognition performance of the speech recognition system. The paper proposes an endpoint detection of speech signals based on Teager Energy Operator(TEO).It uses a threestate transition and judgment mechanism based on double thresholds, witch ensure the accuracy in noisy environment and the robustness to changes in absolute levels. Comparison with other two endpoint detection algorithms and experimental results show this algorithm has better detection capability in low signal to noise ratio environments and takes on more advantages.

© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Harbin University of Science and Technology Open access under CC BY-NC-ND license. Key Word: Teager Energy Operator ; Endpoint Detection; Speech Recognition; Noise

1. Introduction Speech endpoint detection is also known as voice activity detection, it can distinguish the voice signal and non-voice signals from the voice signals in the environment of noise, then determine the start and end points of the operation. The operation is an important preparatory work of speech recognition, the accuracy of the direct impact to the back of the feature extraction and the final recognition results. Research shows that, even in the ideal condition, more than half of errors the speech recognition are caused by the inaccurate endpoint detection[1].

*

* Corresponding author. E-mail address: [email protected].

1877-7058 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. doi:10.1016/j.proeng.2012.01.367

2656

LI Jie et al.Zhiran/ / Procedia Engineering 29 (2012) 2655 –000–000 2660 LI Jie,ZHOU Ping,JING Xinxing,DU Procedia Engineering 00 (2011)

In high-noise environment, the existing algorithms can not always accurately detect the endpoint of voice. In this paper, TEO is as the acoustic characteristics, and have propose a endpoint detection method in noisy environment, which have detected by analyzing TEO in the speech signal. The experiments show that, in high-noise environments, this method can accurately detect the endpoint of the speech signal. 2. Teager energy operator 2.1 The characteristics of TEO The study of Teager [2] and others shows that, the real sound source should be generated by the interaction of non-linear eddy in the channel. To measure the voice of this non-linear processes, Literature[2] have propose TEO, and Kaiser [3] showed the TEO computing formula of its continuous and discrete forms. In the continuous-time signal x (t), TEO is defined as: (1) ψ= [ x (t )] ( x ′(t )) 2 − x (t ) x ′′(t ) Which, x′(t ) = dx(t ) / dt , x (t) is the continuous time domain signal. For discrete-time signal, equation (1) can be approximated as: (2) ψ [ x ( n )]= x ( n ) 2 − x ( n − 1) x ( n + 1) Which ψ [x (n)] for the signal TEO, x (n) n is the sample of discrete signal time in n. Assuming an amplitude of A, the initial phase is φ, frequency is f, the sampling frequency the signal x of fs, the discrete form = xn A cos( Ω n + φ ) , the angular velocity Ω = 2π f / f s . Three adjacent points of Signal the following equations can be composed of the following equations. (3) x(n) =A cos(Ωn + φ ); x(n + 1) =A cos[Ω( n + 1) + φ ]; x( n − 1) =A cos[Ω( n − 1) + φ When Ω＜π/4,sinΩ ≈ Ω，After the solution of equations (3), we have (4) ψ[x(n= )] x(n)2 − x(n−1)x(n+= 1) A2 sin2(Ω) ≈ A2Ω2 It can be seen from the above equation, the output of TEO not only include both amplitude information A, but also contain frequency-domain information Ω. 2.2 The noise characteristics of TEO Suppose x (n) is a broadband stationary random signal, then E{ψ[x(n)]}= E[x(n)2]− E[x(n−1)x(n+1)]= Rx (0) − Rx (2) , Rx (k) is the autocorrelation function of x(n). In the noisy speech signal, Suppose the signal x (n) is the sum of the pure speech signal s (n) and zero-mean additive noise ω (n),the TEO of the noisy speech signal x (n) is: (5) ψ [x(n)] =ψ [s(n)] +ψ [ω(n)] + 2ψ [s(n),ω(n)] whereψ [ s (n), ω (n)] is the cross-ψ energy of s(n) and ω(n). As s(n) and ω(n) are independent, expected value of ψ [ s (n), ω (n)] is zero, then E{ψ[x(n)]}= E{ψ[s(n)]}+ E{ψ[ω(n)]} .Further E{ψ[ω(n)]} is negligible compared to E{ψ[s(n)]},so E{ψ[x(n)]}≈E{ψ[s(n)]}.It can be seen from the above derivation, TEO have the capabilities of eliminate zero mean noise and enhance voice, and the calculation is convenient, the computing is little. 3 The description of Algorithm 3.1 The overall describes of algorithm The detailed steps of algorithm are shown in Figure 1. First find the TEO of input speech signal, second have a window sub-frame treatment of TEO, third get TEO of each frame and the smoothing

LI Jie etPing,DU al. / Procedia 29 (2012) 2655 – 2660000–000 LI Jie,ZHOU ZhiranEngineering / Procedia Engineering 00 (2011)

process of TEO curve, Finally, it have detection of the endpoint through voice segment of the double threshold-tri-state judgments.

Fig1 The algorithmic descriptions in details

3.2 Smoothing In order to reduce false positives caused by jitter noise and improve the robustness of system, this paper will smooth the TEO before making endpoint detection: E[ψ (n − 1)] + E[ψ (n)] + E[ψ (n + 1)] (6) E[ψ (n)]' = 3

3.3 The judgment to pairs of threshold - Tri-State endpoint At the beginning of endpoint detection, first it should determine the two thresholds for TEO. One is the relatively low threshold, the value is small, and it is more sensitive to the changes of signal and could easily be exceeded. Another is the relatively high threshold, the value is large, and the signal must reach a certain intensity, then the threshold will be exceeded. If low threshold have been exceeded, it is not necessarily the beginning of the voice and may be caused by a short noise. If high threshold have been exceeded, it must be caused by the speech signal. Since the original and last short periods of voice signal acquisition are generally segment without voice and only have background noise, then it can calculate the threshold level of energy TEO with the "silent" in a few frames. In the experiment, it have found that sometimes a few frames of the voice section which have set to "mute" is too high, then it will lead to a very high threshold level, and sometimes it is too quiet, then the Quiet section is also as the lead voice segment. To avoid this error, we have the introduction of Energy mean Eave and lowest energy Emn of the entire voice. The Specific dynamic threshold is calculated as: E2 = max[ 0.4*mmx + 0.6*Emn,0.3* Eave + 0.7*Emn] (7) E1 = max ⎡⎣E2 + mmx − mmn,m2 + ( Eave − Emn) /10⎤⎦

mmx and mmn are the maximum and minimum energy of TEO for the first and last five signals.

4. Experiments and results 4.1 Test Set In this paper, it give experiments about the proposed method, it have collect sound signal samples with a sound card in a format of wav file. Voice samples have used a sampling frequency of 8 kHz ,8 bits to quantify, 16-bit sampling accuracy, frame size 200 of sampling and Hamming window, the shift of frame is 100. The background noise have used four common noise in the Noise X_92 noise library, including white noise (White), speaking background noise (Babble), factory noise (Factory) and high-speed car noise (Volvo), in the condition of four different SNR ,it give endpoint detection to voice samples that have recorded and comparison with the standard endpoint to get the correct detection rate. Before the test, we should manually mark the location of the voice endpoint in each voice sample, the accuracy of voice endpoint is to determine whether the detect of the endpoint location is accurate.

2657 3

2658


4.2 Experimental results The contents of the voice sample are "wo dao Beijing qu", then we give endpoint detection to speech signal with using the traditional Energy-Zero endpoint detection method and the algorithm in this paper, and the results of test have marked. Figure 2 shows the compared detection results of speech endpoint with two methods.

(a) Energy-Zero

(b) TEO

Fig2 Endpoint detection results by 2 methods

Which, (a) are the results which have detect by the traditional double-threshold zero endpoint detection method, The top is a waveform chart of voice signal, The middle is an average energy chart for the shortterm, The bottom one is a chart for the short-time average zero-crossing rate, vertical line is marked for the endpoint detection (the following is same). (b) is the detect result of algorithm in this paper, the above is the waveform of voice, the following is an energy chart for the TEO. It can be seen that the results of two tests are almost same in the high SNR. Figure 3 shows the comparison test results of the different mixed noise with using the algorithm in this paper when the SNR of voice samples above is -5dB.

(a)White noise

(b) Babble noise

2659 5

LI Jie etPing,DU al. / Procedia 29 (2012) 2655 – 2660000–000 LI Jie,ZHOU ZhiranEngineering / Procedia Engineering 00 (2011)

(c)Factory noise

(d)Volvo noise

Fig3 Endpoint detection results of corrupted speech signal by different noises(SNR=-5dB)

4.3 Result Analysis It can be seen through experiments, the algorithm theory in this paper is simple，it have small amount of computation ， easy hardware implementation ， high accuracy of judgments for endpoint, the robustness to majority of noise, high anti-interference ability ， the adaptive Sentencing guidelines. However, this algorithm is not very satisfactory of the decision for endpoint in voiceless segment, it is mainly because that the voiceless segment relative to most of the noise do not have obvious time-domain and frequency domain, which is similar to white noise. Table 1 shows the comparison result of the detection accuracy for algorithm in this paper, using energy method and spectral entropy method in the condition of four kinds of noise and SNR signal. Table 1 The detection accuracy rates of 3 endpoint detection methods in different SNR Noise White

Babble

Factory

Volvo

Method

15dB

5dB

0dB

-5dB

Energy

90.67

64.38

55.68

42.33

Spectral entropy

93.22

85.79

78.44

69.74

TEO

95.31

86.36

77.27

70.56

Energy

78.34

56.96

44.58

22.78

Spectral entropy

85.49

78.64

71.33

58.76

TEO

88.46

81.82

65.82

50.91

Energy

80.67

58.33

46.64

24.67

Spectral entropy

88.67

82.78

74.47

63.76

TEO

96.73

90.91

88.64

76.85

Energy

77.34

56.96

44.58

22.78

Spectral entropy

87.33

80.33

73.23

61.58

TEO

89.38

83.28

79.55

76.73

As can be seen from Table 1, The endpoint detection methods based on TEO is in different noise environments, it shows that Correct rate of detection was significantly higher than the energy method, and the overall detection accuracy is better than the entropy method, Especially it can still maintain a high degree of detection accuracy at low SNR. 5. Conclusion

2660


In this paper, it have theoretical analysis of the frequency domain characteristics and noise reduction features in the TEO, and have presented voice activity detection algorithm research based on TEO in the low SNR . The comparative experiment of different SNR and detection methods show that, the algorithm in low SNR have a good endpoint detection capabilitie. References: [1] ZHANG Junchang, JIANG Fei, LIU Hong. Study on endpoint detection based on multi-characteristic jointed in noisy environment,Computer Engineering and Applications,2009(45): 114-116. [2] Teager H, Teager S. Evidence for nonlinear production mechanisms in the vocal tract. Speech production & Speech Modeling[A],1990(55): 241-261 [3] Kaiser J F. On a simple algorithm to calculate the energy of a signal . IEEE International Conference on Acoustics, Speech and Signal Processing[A],1990: 381-384.