introduction - VUT FIT

ENDPOINT DETECTION IN THE CONTINUOUS SPEECH USING THE NEURAL NETWORKS Filip Orság* [email protected]

František Zbořil* [email protected] Abstract: This paper deals with a problematic of endpoint detection in a continuous speech. A back-propagation neural network was used to recognise “non-speech” frames of the speech signal. Some different sets of features were extracted from the speech signal to serve the purpose of endpoint detection and the results of the experiments were summarised.

Key Words: recognition, speech, endpoint, detection, neural, network

1

INTRODUCTION

Endpoint detection is very important problem in many speech-processing systems. The systems that process a word as a unit have to locate its beginning and end. The problem of detecting (locating) the endpoints would seem to be easily solvable for a human, but it has been found to be a very complex and challenging task, in many cases, for a machine. In some situations it is not so difficult to determine the position of the endpoints – e.g. in case the signal-to-noise ratio level is high enough. But these “mostly ideal” cases are not common. In the real cases the noise interferes with some phonemes so much that it is difficult to recognise them correctly. Such phonemes are [1]: unvoiced fricatives (/f/ - “five”, /T/ - “thing”, /h/ “help”, /S/ - “show”) or voiced fricatives that become unvoiced at the end of a word (“has”), unvoiced stops – plosives (/p/ - “pea”, /t/ - “tea”, /k/ - “kick”), nasals at the end (“gone”), and trailing vowels at the end (“zoo”). 2

MEANS OF ENDPOINT DETECTION

There are many methods that could be used to determine the endpoints. But many of them are not reliable enough to locate the exact position of either the beginning or the end of a word in the continuous speech. Some of them are described in this chapter. At first some elementary notation must be declared. Assume a signal s(n) that is N samples long. This signal is separated into frames of a length typically 10-20 ms, which gives *

Faculty of Information Technology, Department of Intelligent Systems, Božetěchova 2, CZ-612 66 Brno

us K = 220 samples in case the sampling frequency would be 22 kHz and the frame length 10 ms. The frames may be windowed, if it is necessary, to improve the frequency characteristic. To become better results, it is useful to define an overlapping of the frames. The overlapping helps us especially when the frame length exceeds recommended limits (i.e. > 10-20 ms). The overlapping is also practical in such cases, in which a higher precision is needed, because it refines the frame length. If the overlapping was O = 20 samples and the frame length K = 220 samples, then the effective frame length would be 200 samples, which is more precise step than the 220 samples step. The indices or the time instants n of the whole signal s(n) fit in a range from 1 to N, i.e. n = 1,2, …, N. There are M frames and the index m of a frame fit in a range from 1 to M, i.e. m = 1,2, …, M. If the overlapping is O then from the above ensues that N ≧ M(K-O). The m-th frame is defined as s (m, k ) = s ((m − 1) ⋅ ( K − O) + k − 1),

2.1

m = 1,2,L, M ; k = 1,2,L, K .

(1)

NOISE THRESHOLD

The most common and the easiest method of the endpoint detection is a determination of a noise threshold. In this case the threshold is determined from a noise signal snoise, length of which is Nnoise samples. Then the threshold T is T=

1

N noise

N noise

n =1

∑ s (n ). noise

(2)

This method is disadvantageous. Assume a noisy environment and a low-class signal source. It is very difficult then to determine the threshold, which would be usable to separate all phonemes. It is because the unvoiced phonemes are so much like the noise, that it is practically impossible to recognize them. This method would be very usable because of its simplicity. In fact, it is usable in case there is very good signal source, i.e. there is a high signal-to-noise ratio. 2.2

ENERGY AND ZERO-CROSSING RATE

Another widely used method is a method based on the comparison of the energy and zero-crossing rate (ZCR) of all the frames. The energy of the m-th frame is K

E (m ) = ∑ s 2 (m, k ),

(3)

k =1

and the ZCR is

Z ( m) =

1 K sgn( s (m, k )) − sgn( s (m, k − 1)) , ∑ 2 K k =2

(4)

where + 1, sgn( s (m, k )) =  − 1,

s (m, k ) ≥ 0 . s (m, k ) < 0

(5)

The ZCR is higher in case a frame fits in an unvoiced region of the speech or in case it

does not belong to the speech, i.e. the noise has higher values of the ZCR. The value of the energy is high in case a frame fits in a voiced region of the speech. Knowing this it is possible to find approximate endpoints of a word. The method in detail was published in [2] and is based on searching of thresholds of the ZCR and energy. 2.3

SEARCHING OF THE PHONEME BOUNDS

Another way to detect the endpoints is a method based on a comparison of the autocorrelation values of two frames. The method consists of four main steps. Because it is extensive to be described here in detail, the people concerned in this method are referred to [3]. The method is described as follows: 1. Calculation of a function B(m) for each frame m = l2 + 1, l2 + 2,…,M – l1 B(m ) = b ⋅

R(m + l1 , 0 ) − R(m − l 2 , 0 )

R(m + l1 , 0 ) + R(m − l 2 , 0 )

+

max K

∑ R(m + l , k ) − R(m − l 1

k =1

2

, k) ,

(6)

where b is a multiplication factor, l1 is forward shift, l2 backward shift, maxK is a maximal order of the function R(m, k), which is k-th order autocorrelation of the m-th frame defined as K −k

R(m, k ) = ∑ s (m, n ) ⋅ s(m, n + k ), n =1

k = 0,1,L K − 1 .

(7)

2. Smoothen of the function B(m) [3]: 1 B f (m ) = ⋅ [B (m − 1) + B(m ) + B (m + 1)], 3

m = 2,3,L, M − 1

(8)

3. Searching of the local maxima and minima. The maxima reflect disparity rate of two compared frames, i.e. the frame m + l1 and m - l2, and the minima reflect identity rate of them. 4. Exclusion of the insignificant extremes. Not all of the extremes found in step 3 are bounds between phonemes. Only the significant ones could be. It is reason for the exclusion of the insignificant extremes. After these steps there is a set of maxima, each of which could be a bound between phonemes, wherewith it could be an endpoint as well. But this method is not applicable easily in case of endpoint detection in a continuous speech, because it is a bit difficult to discern between the found phoneme bounds and real endpoints – each found bound could be an endpoint. The method is usable only when it is clear that there is only a word in the given utterance. Then the results are very good, although when there is a fricative at the end or beginning of the word, this method fails alike the other algorithms. 3

SEARCHING OF THE ENDPOINTS USING NEURAL NETWORK

Neural networks are well known for their ability to separate vectors and to classify them. This was the main reason to choose the neural network in this case. The problem of decision whether a frame is or is not a part of an utterance is typical task for a neural network. The feedforward neural network was chosen from all the existing types of neural networks as the best choice for this task.

3.1

CHOICE OF THE FEATURES

Very important thing is a choice of a feature set. There are a lot of possibilities and combinations, but in such cases it is usually true that less complicated would be better. There were used various sets of features in the training and recognition processes. The simplest choice – the energy and zero-crossing rate of a frame – results from the method mentioned in 2.2. Second set contains the zero crossing-rate as well, but instead of the simple energy there is calculated a Sum of the Magnitudes of the Fourier Spectrum (SMFS). The Discrete Fast Fourier Transformation (DFFT) of the m-th frame of the signal s is defined as K

S (m, k ) = ∑ s (m, n) ⋅ e

− jkn

2π K

,

n =1

k = 0,1,L, K − 1 .

(9)

Then the sum of the magnitudes of the spectrum is defined as follows K −1

SMFS (m) = ∑ S (m, k ) .

( 10 )

k =0

The SMFS may seem to be conformable to the energy, because shape of a curve that stands for the energy flow of a word is nearly the same as the one of the SMFS, but there is a very difference. The energy of the problematic phonemes is usually very low, which causes difficulties when trying to recognize them in a noisy background. But if we compared the spectra of a frame containing noise and another one containing an unvoiced phoneme, we could see a difference between them. Spectrum of the noise is usually very well balanced (which does not stand for a spectrum of a phoneme) and the magnitude values are lower than the ones of the phoneme. This results in the low error rate of the endpoints detection when using the SMFS instead of the energy. Next four sets contain the LPCs (Linear Prediction Coefficients). More exactly, these coefficients are coefficients of a linear prediction filter (see [1]). To obtain these parameters the autocorrelation method of autoregressive (AR) modelling is used. The generated filter might not model the process exactly even if the data sequence is truly an AR process of the correct order. This is because the autocorrelation method implicitly windows the data, that is, it assumes that signal samples beyond the length of N are 0. LPCs are the least squares solution to Xa ≈ b ,

( 11 )

0  M   1  1 0   a(1)      , b = 0  , x(1) , a =   M  M x(2)       a( K ) 0  M  x( N )

( 12 )

where 0  x(1)  x(2) x(1)   M x(2)  X =  x( N ) M  0 x( N )  O  M  0 L 

L O O O O O

0

and N is the length of the signal x, K is the order of the prediction and a(i) is the i-th LP coefficient. Solving the least squares problem via the normal equations leads to the YuleWalker equations R(1) L R( K − 1)  a(1)   − R(1)   R (0)  R(1)   a(2)   − R (2)  M R(0) O  ⋅ = ,  M M O R(1)   M   M        R (0)  a( K ) − R( K )  R ( K − 1) L R(1)

( 13 )

where R(k) is k-th order autocorrelation estimated for signal x. Yule-Walker equations are solved in O(K2) flops by the Levinson-Durbin algorithm. This was applied to each frame m = 1,2, …, M. The last two sets are composed of the Mel Frequency Spectral Coefficients (MFSC) and Mel Frequency Cepstral Coefficients (MFCC). These coefficients (above all the MFCC) are widely used in the speech recognition. A “mel” is a unit of measure of perceived pitch of frequency of a tone. It does not correspond linearly to the physical frequency of the tone, as the human auditory system apparently does not perceive pitch in the linear manner [1]. The MFSCs of the m-th frame could be computed using DFFT (equation (9)) as K −1

MFSC (m, i ) = ∑ F (i ) ⋅ S (m, k ) , 2

( 14 )

k =0

where F(i) is i-th filter from a bank of i = 1,2, … , I triangular filters based upon the mel-scale [1],[3]. The MFCCs of the m-th frame result from the MFSCs as I π  MFCC (m, n) = ∑ log MFSC (m, i ) ⋅ cos n(i − 0.5)  . I  i =1

( 15 )

This implies that there are I MFSCs, from which it is possible to calculate more than I MFCCs, but this would not usually be done [1]. 3.2

THE NEURAL NETWORK

For this purpose was chosen the feedforward neural network with either one or two hidden layers and one output neuron. Number of input neurons is the same as the length of an input vector. The net was trained using the Levenberg-Marquardt back-propagation method, which proved itself as the best choice. Goal performance (mean-square error) of the net was set to 10-10. It was very hard limit on the training, but the algorithm showed its strengths and the performance goal was reached many times. Sure, it was not possible to reach the performance goal every time and that is why the limit of maximum epochs was set to 500 and the minimum gradient was set to 10-12. As the transfer function of all the neurons was chosen the tangential function. 3.3

TESTS AND RESULTS

To get the most objective results there were applied five various types of a network topology. The first tested network contained 2 hidden layers, both of them with 10 neurons. The second one was same as the first one, but in the first hidden layer there were only 5 neurons. Next three tested configurations consisted of only one hidden layer with 10, 5, and 2 neurons.

All networks were trained to find the "non-speech" frames of an utterance and then their error rate was determined upon other utterances. The utterance was recorded via a usual microphone with a rather low signal-to-noise ratio, which should test the quality of the networks and chosen features. The sampling frequency of the recording was 22050 Hz and the precision was 16 bits per sample. The features were calculated from a number of frames of different lengths - 1024 samples (which are in this case 46 ms), 512 samples (23 ms) and 256 samples (11 ms). Usually a length 10-20 ms is accepted as a standard in the speech processing, but for this task a frame length of 46 ms was added as well, just to test it. The frames overlapped in their halves, which resulted in 5 ms step in case of the 11 ms frame length, 11 ms step in case of the 23 ms frame length and 23 ms step in case of the 46 ms frame length.

Energy SMFS ZCR

ZCR

LPCs

MFSCs

MFCCs

4

8

16

24

12

12

K = 1024 (i.e. 46 ms), O = 512 10x10x1

6,7

1,1

1,7

4,2

7,6

11,2

3,4

16,0

5x10x1

4,2

0,6

4,2

8,4

10,1

2,5

0,8

10,1

10x1

1,7

0,8

2,5

3,4

9,2

9,2

1,7

33,6

5x1

2,5

0,3

3,4

7,6

4,2

5,4

2,5

37,0

2x1

7,6

0,3

8,4

7,6

9,2

3,3

2,9

3,4

N = 512 (i.e. 23 ms), O = 256 10x10x1

17,6

1,9

2,5

7,9

3,8

4,2

11,0

10,8

5x10x1

5,8

1,3

2,1

8,3

6,3

3,4

2,5

7,1

10x1

5,9

1,7

2,5

4,6

6,7

4,2

2,1

8,8

5x1

6,7

0,8

4,6

6,3

5,8

2,5

3,5

12,5

2x1

4,6

0,8

2,5

5,6

5,0

7,1

1,7

2,1

N = 256 (i.e. 11 ms), O = 128 10x10x1

2,0

1,7

2,3

4,2

8,9

6,0

2,3

11,2

5x10x1

1,9

0,8

3,1

4,8

5,2

3,1

2,1

7,7

10x1

1,5

1,6

4,6

7,3

6,4

9,1

3,1

15,8

5x1

3,5

0,8

1,7

11,0

4,6

8,5

1,0

4,4

2x1

1,2

0,6

2,9

9,2

4,0

5,4

2,1

10,8

Tab. 1: Error rate of the endpoints detection using a feedforward neural network. There were used 3 different frame lengths to train the neural network, 5 various network topologies to find out the influence of the topology and 8 sets of features to find the best solution.

The error rate was calculated as ERR =

F ⋅100 , M

( 16 )

where M is the total count of the frames of all the test utterances and F is count of the misclassified frames (i.e. the frames, which were classified as non-speech frames although they were speech frames and vice versa). Results of all tests are summarized in table 1. The results show that in this case it is not useful to enlarge the network, because the smaller networks were able to gain in many cases a lower error rate than the larger ones. The smaller the network is the better is possibility of its usage in the real time applications (because of its speed). Also the features that are usually used in the speech processing (the LPCs, the MFCCs) proved to be improper for the task of the endpoints detection using a neural network. The best choice seems to be a combination of the ZCR together with the SMFS and the MFSCs. Each of the networks was able to find the beginning or end of a word when using the ZCR and SMFS or the MFSCs even if there was a “problematic” phoneme mentioned in the chapter 1. These features are both based on the Fourier transformation so that an FFT algorithm limits their usage in the real time application, but with a good and really fast FFT it is possible to apply them in the real time applications. 4

CONCLUSION

The results of the tests proved that a simple solution might be much better than a complicated one. The task of the endpoint detection is really not easy at all for a machine. But the feedforward neural network with a good back-propagation training algorithm defended its position in the speech processing. Inaccurate endpoint detection can cause misclassification rather than other possible mistakes. The results proved the possibility to detect the beginning and end of a word in an utterance quite accurately, which gives us a very good base for the further processing of the found word (i.e. recognition of this word, speaker verification based upon this word etc.). ACKNOWLEDGEMENTS

The research has been done under support of FRVS project No. FR0835/2002/G1, GA CACR project No. 102/01/1485 and Research Intention No. CEZ: J22/98:262200012. REFERENCES [1] Deller, J.R., Hansen, J.H.L., Proakis, J.G.: Discrete-Time Processing of Speech Signals, New York, USA, IEEE Press 2000, ISBN 0-7803-5386-2 [2] Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances, Bell System Technical Journal, vol. 54, pp. 297-315, 1975 [3] Orság, F.: Anwendungen der digitalen Sprachverarbeitung, diploma thesis, Brno, 2001