Incremental Learning Algorithm for Speech Recognition - Springer Link

0 downloads 0 Views 866KB Size Report
The input data was extracted from 75 spoken sentences of the TIMIT speech database. The sentences are spoken by 25 speakers (5 female and 20 male).
Incremental Learning Algorithm for Speech Recognition Hisham Darjazini [email protected]

Qi Cheng

Ranjith Liyana-Pathirana

[email protected]

[email protected]

University Of Western Sydney School of Engineering Locked Bag 1797 Penrith South DC NSW 1797 Australia Abstract - This paper presents an implementation of incremental learning neural networks algorithm for speech recognition. The algorithm has been investigated using the TIMIT speech samples and it has been shown to demonstrate high recognition accuracy.

ploys 55 sub-recognizers for recognition of 54 phones and silent period.

KEY WORDS Incremental learning, speech recognition I. INTRODUCTION Incremental learning updates a recognizer using the information obtained from unknown input data (ID). It has the advantage of adapt to changing input without requiring timeconsuming in training. Incremental learning algorithms were mostly designed and tested for pattern recognition applications (References [1], [8], [9] and [10]). Speech signals change from speaker to speaker and from time to time even for the same speaker. Incremental learning is a suitable tool for speech recognition. Though incremental learning has been applied to speech enhancement in Ref. [3], very little research has been reported in the literature on its use in speech recognition. In this paper, we propose an implementation to feed-forward incremental learning algorithm based on method developed by Darjazini and Tibbitts (Ref. [2]). II. WEIGHT SET ADDITION ALGORITHM The weight set addition algorithm is based on a method for speech recognition that employs a comb of phone subrecognizers (Ref. [2]). As shown in Fig. 1, the method em-

Fig. 1. Comb of Sub-recognizers. All the sub-recognizers are implemented using an identical feed-forward neural network (FF-NN). Each sub-recognizer has an output referred to as Phone Identification Response (PIR), where the PIR is a continuous variable between 0 and 1. Each sub-recognizer indicates that the input speech contains a specific phone if the value of PIR is close to 1. The tolerance in the network is set to 0.05. Therefore, each PIR with a value of greater than or equal to 0.95 is taken as an indication of a potential match. The algorithm extracts a new weight set (WS) from a new unknown data set during recognition. In this algorithm, the sub-recognizer contains two phases of back-propagation instead of one. At the initial trial the network behaves as a normal back-propagation network. At the subsequent trial the network performs the incremental learning process firstly by running the network using the available weight set, at this point a measure applies at the output, if the resulted error comes greater than the maximum allowed error, the process

231 T. Sobh (ed.), Innovations and Advanced Techniques in Computer and Information Sciences and Engineering, 231–235. © 2007 Springer.

232

DARJAZINI ET AL.

will be terminated as a nonrecognized phone. If the error is less

Fig. 2. Choice of Weight Set for Incremental Learning. than the maximum allowed error and higher than a minimum acceptable output; then the incremental learning phase starts. The goal of the incremental learning phase here is to achieve an acceptable output at the output layer by adjusting the weight set using adaptive learning rate. When this is achieved the new weight set will be saved and sent to the MLWA (see Fig. 2) for later reference. In the following recognition, the new set, as well as all the existing sets, will be tested as a potential weight set candidate in the FF-NN. The weight set that produces the highest PIR is selected as the most recent updated weight set. This function is performed by the Most Likelihood Weight Activator (MLWA) unit, which is shown in Fig. 2.

The input data was extracted from 75 spoken sentences of the TIMIT speech database. The sentences are spoken by 25 speakers (5 female and 20 male). Every speaker posses one of three main dialects from the American English, and the accent was chosen arbitrarily. The data was mixed to produce as much variety as possible to every phone; this is to get the advantage of having the sub-recognizer being exposed and to deal with most varied forms of the same phone. In the primitive representation of the input data, there were 54 distinctive phones appeared in 2440 samples, which were segmented from 637 words. The table in the appendix shows these phones and their number of occurrence. Experiments were performed firstly by initiating (first run) the sub-recognizers using the back-propagation learning algorithm and applying the Delta rule. The exit condition of this session was the number of iterations, which was set at 500, and the learning rates were all initiated to 0.5. The weights were initiated to random normally distributed values and the learning set contained nonclustered stimulus. Maximum accepted error (tolerance) is 0.01 and the incremental learning width is 0.219, i.e. the range is from 0.989 to 0.77 The initial session provides the first weight set (WS1) for the MLWA and determines the first cluster of the input data. The number of phone samples for the initial session was in this case 15, and the sub-recognizer converged from the target after 50 epochs. In each epoch, the network manipulated the inner weights of the hidden layers. An error monitor was set to measure the value of mean squared error (MSE) value at each hidden layer and the effects of a particular PE on the overall result of the network. The accuracy of the PIR was within an error value of 0.01, which is below the tolerance value. The overall performance on the initial learning set scored 94.44% accuracy.

In Fig. 2, WS1 is obtained from training session, obviously, in the early stages of incremental learning. Subsequent WSS along with WS1 will have statistical order in the MLWA and the highest probability WS is the one which is used mostly. Other sets will be used more often later on. Fig. 3 shows the multi-layer structure of a sub-recognizer. The input layer contains 17 processing elements (PE) used to receive 17 input elements, which represents the Mel-scale Frequency Coefficients (MFCC) of the corresponding phone. In this structure, the input layer acts as a buffer to the subsequent hidden layers. There are three hidden layers H1, H2, and H3, each one containing (34 - 51 - 34) PEs respectively. The output layer contains one PE representing a measure of the matching of the input speech (stimulus) to a particular phone. III. EXPERIMENTAL RESULTS AND DISCUSSION Fig. 3. Structure of Sub-recognizer.

INCREMENTAL LEARNING ALGORITHM FOR SPEECH RECOGNITION

233

Fig. 4 illustrates the performance of the sub-recognizer in the initial session, where Fig. 4(a) illustrates the measure of the mean square error (MSE) graph and Fig. 4(b) illustrates the PIR values at the end of the initial session. It can be noted that the network converged successfully within short time measured at about 50 epochs. The rest of the samples have been presented to the network in the incremental learning stage where the performance was close to 99.20%. The failed cases produced from samples with PIRs out of the predetermined incremental learning range. In the initiation session, some of the phone sets required up to 13 trials to achieve convergence. This was partially due to the wide range of the used phone types in the input data. The diversity of the input data resulted in the wide distance between some of the stimuli presented to the network. Fig. 5 illustrates examples of two trials on the phone / s /, where, Fig. 5(a) shows the MSE and the PIR for one of the nonconverged trials. When that occurred, the trial was restarted again (based on a new randomly generated weight set) and at the end a convergence was achieved as shown in Fig. 5(b). In several phone sets used above, there was some overlapping in the phonemic borders. Therefore, amalgamation have happened to some phone sets and one set of phones was taken as a unique cluster in the larger phone set, which was used to create a larger learning set. For example, the set / b / and / bcl / were merged together. This is aimed to achieve larger variety of the phone forms in the phonemic knowledge representation. This amalgamation resulted in more learning sessions to run and more complex work to be performed at MLWA. Finally, the number of distinctive phones in the phonemic knowledge was chosen to be 55 and the overall performance of the system converged to the target.

Fig. 5. Illustration of Training Experiments on the Phone / s /. IV. CONCLUSION The proposed incremental learning algorithm allows the original sub-recognizers to be updated without causing the system to lose its original phonemic knowledge or suffer from catastrophic forgetting problems. The system has demonstrated excellent performance.

Fig. 4. The Sub-recognizer Performance in the Training Session.

One critical parameter worth mentioning here is that the incremental learning range is a very critical parameter for the system performance and has to be predetermined. The wrong range could result in false recognition in the phonemically

234

DARJAZINI ET AL.

adjacent phones and may lead to catastrophic forget situation. ABBREVIATIONS ACL = Activation Control Line. FF-NN = Feed Forward Neural Networks. ID = Input Data. MFCC = Mel-scale Frequency Cepstrum Coefficients. MLWA = Most Likelihood Weight Activator. MSE = Mean Square Error. PE = Processing Element. PIR = Phone Identification Response. WS = Weight Set.

APPENDIX Phones set used in the learning trials and their relevant number Phone ch jh dh f s sh th v z em en eng m n ng nx epi h# pau eI hh hv L r w y b bcl d dcl dx g gcl

Number of samples 12 15 48 33 126 38 10 40 42 3 15 1 73 137 23 13 21 1 22 18 15 24 82 87 43 24 43 3 59 13 44 23 3

k kcl p q t tcl aa ae ah ao aw ax axh axr ay eh er ey ih ix iy ow oy uh uw ux

87 13 51 64 85 22 64 69 34 41 15 75 7 41 31 57 37 46 91 136 112 38 11 9 6 28 REFERENCES

[1] D. Chakraborty and N. Pal, A novel learning scheme for multilayered perceptrons to realize proper generalization and incremental learning, IEEE Transaction on Neural Networks, 14(1), January 2003, 1-14. [2] H. Darjazini and J. Tibbitts, The construction of phonemic knowledge using clustering methodology, Proceedings of the Fifth Australian International Conference on Speech Science and Technology, Dec. 1994, Vol. 1, 202-208. [3] L. Deng, J. Droppo and A. Acero, Incremental Bayes learning with prior evolution for tracking nonstationary noise statistics from noisy speech data, Proceedings of IEEE International Conference on Acoustic Speech and Signal Processing, Apr. 2003,Vol. 1, 6-10. [4] L. Fu, H. Hsu and J.C. Principe, Incremental backpropagation learning networks”, IEEE Transactions on Neural Networks, 7(3), Nov. 1996, 757-761. [5] C.M. Higgins and R.M. Goodman, Incremental learning with rule based neural networks, Proceedings of IEEE International Joint Conference on Neural Networks, 1991, 875880.

INCREMENTAL LEARNING ALGORITHM FOR SPEECH RECOGNITION

[6] T. Hoya, On the capability of accommodating new classes within probabilistic neural networks, IEEE Transaction on Neural Networks, 14(2), Mar 2003, 450-453 [7] W.L. Mahood, Incremental learning mechanisms for speech understanding, Proceedings of the IEEE International Workshop on Tools for Artificial Intelligence, Architectures, Languages, and Algorithms, Oct. 1989, 237-243. [8] R. Polikar, L. Udpa, S. Udpa and V. Honavar, Learn++: An incremental learning algorithm for supervised neural networks, IEEE Transaction on Systems, Man, and Cybernetics - Part C: Applications and Reviews, 31(4), Nov. 2001, 497-508 [9] M.T. Vo, Incremental learning using the time delay neural network, Proceedings of IEEE International Conference on Acoustic Speech and Signal Processing, Vol. 2, April 1994, 629-632. [10] D. Wang and B. Yuwono, Incremental learning of complex temporal patterns, IEEE Transactions on Neural Networks, 7(6), Nov. 1996, 1465-1481.

235

Suggest Documents