Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
AUTOMATED SPEECH RECOGNITION OF ISOLATED WORDS USING NEURAL NETWORKS DIPANWITA PAUL Masters Research Scholar School of Education Technology Jadavpur University, Kolkata 700032, India
[email protected]
Dr. RANJAN PAREKH Asst. Professor School of Education Technology Jadavpur University, Kolkata 700032, India
[email protected] Abstract : This paper presents a methodology for automated recognition of isolated words independent of speakers. It utilizes a feature vector consisting of a combination of the first three formant frequencies of the vocal tract and the mean zero crossing rate (ZCR) of the audio signal. Formant frequencies are estimated by simulating the vocal tract by an LPC filter and calculating its resonant frequencies. ZCR is computed by partitioning the audio signal into segments and calculating the number of times the signal crosses the zero amplitude level within each segment. A neural network (multi-layer perceptron) is used as a classifier for identifying the spoken word. The network is trained using a set of specific words uttered by nine speakers (both male and female) and tested for the same words uttered by a different set of speakers. Accuracies indicate that the feature set performs better than contemporary works in extant literature. Keywords: Speech recognition; formant frequencies; zero crossing rate; neural network. 1. Introduction Automated Speech Recognition (ASR) is a popular and challenging area of research in developing human computer interactions. Typical application areas include data entry into forms, database management and control, keyboard enhancement, providing hands free and eyes free control of manufacturing processes, biometrics, computer aided instructions and so on. The main challenges of speech recognition lies in modeling the variations of the uttered speech by different individuals belonging to different geographical boundaries, social background, age, gender, occupation etc. Added complications are introduced by the fact that the same word might be spoken differently by the same individual at different points of time and in different contexts. From the viewpoint of computer based processing, analyzing speech signals are generally more difficult than image processing since audio signals vary over time and hence the signal properties keep changing continuously. This paper provides a study for modeling isolated words spoken by different individuals, using four features namely the first three formant frequencies and the mean zero crossing rate of the audio signal. The paper is organized as follows : section 2 reviews earlier work in this area, section 3 describes the proposed approach, section 4 tabulates the details of the experimentations and results obtained, and section 5 provides the overall conclusions and future scopes. 2. Previous Work Initial approaches for content based audio similarity by comparing audio samples had limited accuracy because of the possibilities of different speech signals using different digitization parameters. Later approaches used features extracted from audio files to characterize them. In [1] the author has used features based on wavelet transform to recognize five Malayalam words recorded by different persons. The feature vector consists of statistical metrics derived from the wavelet coefficients : mean, standard deviation, energy, kurtosis and
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
4993
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
skewness. The classifier used in this paper is layered feed-forward artificial neural network along with back propagation algorithm. The accuracy obtained with prerecorded samples is 100% and real-time testing is 66%. In [2] the authors present an approach to develop an ASR system for Urdu language. The proposed system is based on an open source speech recognition framework called Sphinx-4 which uses statistical based approach Hidden Markov Model for developing ASR system. The authors present a Speaker Independent ASR system for small sized vocabulary. In [3] the authors study the problem of speech recognition in the presence of non stationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, they investigated the use of factorial hidden Markov model (FHMM) architecture developed from a clean-speech hidden Markov model (HMM) and a sudden-noise HMM. In [5] the authors present a speaker independent large vocabulary continuous speech recognition technique for recognizing Mongolian vocabulary. In [6] the authors present a framework of HMM parameter adaptation technique for improving automatic speech recognition (ASR) performance in the noisy environments, which combines the clean hidden Markov models (HMMs) with the noise model. In [7], it was shown that 2-D cepstrum (TDC) analysis enables a compact representation of speech signals. The authors proposed TDC-HMM speech recognizer with a small number of acoustic observations. In [8] the authors present a new robust feature for speech recognition, obtained from Cepstral Mean Normalized reduced order Linear Predictive Coding (LPC) coefficients derived from the speech frames decomposed using Discrete Wavelet Transform (DWT). In [9] the authors present an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. In [10] the authors present automated isolated word speech recognition for Malay language that relies heavily on the well known and widely used statistical method in characterizing the speech pattern, the Hidden Markov Model (HMM). In [12] the authors present presents a vowel (devanagri) recognition system based on the LPC model used as a feature extraction technique and distance measure for recognition purpose. It was determined in the course of its testing that the system recognized 88% vowels from the database. 3. Proposed Approach 3.1. Formant Frequencies Formants are the resonance frequencies of the vocal tract and typically contribute to the intelligibility of speech. They are measured by observing peaks in the sound spectrum. Different types of sounds uttered by the human voice box, especially the vowels, can be characterized by their frequency components, and this forms the basis of speech recognition by frequency analysis of the formants. The formant with the lowest frequency is called ff1 and subsequent higher frequencies are called ff2, ff3 and so on. Since the formant frequencies are properties of the vocal tract, they need to be inferred rather than measured. To estimate formant frequencies the speech signal is modeled as if it was generated by a particular kind of source and filter. Here an LPC filter is used. LP (Linear prediction) is a mathematical operation which provides an estimation of the current sample of a discrete signal as a linear combination of several previous samples. The prediction error i.e. the difference between the predicted and actual value is called the residual. If xn' be the predicted value of xn it is given by Eq. (1), where {ai}, i=1,…,p, are the filter coefficients. p
xn ' a1.xn-1 a2 .xn-2 ... a p .xn- p ai .xn-i
.
(1)
i 1
The number of filter coefficients p is determined from the empirical rule in Eq. (2) as per procedures outlined in [13], where fs is the sampling rate of the digitized speech signal
p 2
ISSN : 0975-5462
fs . 1000
Vol. 3 No. 6 June 2011
(2)
4994
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
To find the formant frequencies from the filter, the locations of the resonances that make up the filter need to be determined. This involves treating the filter coefficients as a polynomial and solving for the roots of the polynomial. If rk be the k-th root, then the corresponding formant frequency (ffk) is given by Eq. (3) where Im(rk) and Re(rk) designate the values of the imaginary and real portions of the root. The computations are done only for roots having Im(rk) > 0.
Im(rk ) ff k arctan , Im(rk ) 0 Re(rk )
(3)
The formant frequencies are scaled by a factor (fs/2π) to convert to Hertz and sorted. The lowest three roots correspond to the formant frequencies ff1, ff2 and ff3 and are used here for the feature vector. 3.2. Zero Crossing Rate (ZCR) ZCR of an audio signal is a measure of the number of times the signal crosses the zero amplitude line by transition from a positive to negative or vice versa. The audio signal is divided into temporal segments 20 ms in size and zero crossing rate for the k-th segment (zk) is computed using Eq. (4), where sgn x(n) indicates the sign of the n-th sample x(n) and N is the total number of samples in the segment. sgn x(n) can three possible values : +1, 0, -1 depending on whether the sample is positive, zero or negative. The instantaneous accuracy is fixed at 20 ms because the human perceptual system is generally not more precise, and moreover because speech signals remain stationary for 5–20 ms [11]. N
zk
| sgn x(n) sgn x(n 1) | n 1
.
(4)
2N
The mean zero crossing rate Zm for all segments is calculated as per Eq. (5) where S is the total number of segments in an audio file.
Zm
1 S zk . S k 1
(5)
3.3. Feature vector and Classification The feature vector consists of four elements, the first three being the formant frequencies and the fourth being the mean ZCR of the file, and is represented in Eq. (6).
F { ff1 , ff 2 , ff3 , Z m } .
(6)
Each word is classified through a training phase and a testing phase. The training set consist of 9 utterances of the word by 9 speakers (both male and female) while the testing set consist of utterances of the same word by another set of 9 speakers (both male and female). Classification is done using Neural Networks (MLP : multilayer perceptron) and these are compared with results obtained by L1 (Manhattan) and L2 (Euclidean) metrics. 4. Experimentations and Results Experiments to study the efficiency of the proposed method are done using three different data sets : (1) Dataset-1 : 3 words : pull (w1), hot (w2), head (w3) (2) Dataset-2 : 4 words : pull (w1), teacher (w2), vow(w3), thigh (w4) (3) Dataset-3 : 5 words : pull (w1), teacher (w2), heart (w3), vow(w4), thigh (w5) The words are taken from the British Council Phonetic Charts [4] and 18 speakers, 9 male and 9 female, were asked to utter the words in a controlled laboratory environment. The speech signals were digitized at a sample rate of 44100 Hz using 16 bits and saved in WAV format and in mono mode. Before feature extraction the audio files were subjected to a pre-processing step involving amplitude normalization and DC offset correction. A
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
4995
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
constant amount of 25000 samples were isolated from each file and fed to the feature extraction block to compute the feature vector as per Eq. (6). 4.1. Dataset-1 (3 words) The training and testing plots for each feature are shown in Fig. 1. Fig. 1(a) indicates the variation in the value of the first formant frequency ff1 for each of the three words uttered by the 9 speakers during the training phase while Fig. 1(b) indicates the same for the other set of 9 speakers used for the testing phase. Similarly Fig. 1(c) and 1(d) shows the variations of the second formant frequency ff2, Fig. 1(e) and 1(f) shows the variations of the third formant frequency ff3 and Fig. 1(g) and 1(h) shows the variations of the mean ZCR Zm, during the training and testing phases respectively.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig: 1: Plot of features corresponding to 3 words : (a, b) training and testing plots for first formant frequency (ff1), (c, d) training and testing plots for second formant frequency (ff2), (e, f) training and testing plots for third formant frequency (ff3), (g, h) training and testing plots for mean ZCR (Zm)
Class probability of each test sample is estimated using a neural network (multi-layer perceptron : MLP). The neural network architecture used is 4-30-3 i.e. 4 input nodes (for the 4-element feature vector), 30 nodes in the hidden layer and three output nodes (for discriminating between 3 words), log-sigmoid activation functions for
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
4996
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
both the neural layers, learning rate of 0.01 and Mean Square Error (MSE) threshold of 0.005 for convergence. The convergence plot and MLP output are shown in Fig. 2. The accuracy obtained for 3 words data set is 88.89%.
(a)
(b) Fig: 2: NN classification for 3 words (a) convergence plot (b) MLP output
4.2. Dataset-2 (4 words) The training and testing plots for each feature are shown in Fig. 3. Fig. 3(a) indicates the variation in the value of the first formant frequency ff1 for each of the four words uttered by the 9 speakers during the training phase while Fig. 3(b) indicates the same for the other set of 9 speakers used for the testing phase. Similarly Fig. 3(c) and 3(d) shows the variations of the second formant frequency ff2, Fig. 3(e) and 3(f) shows the variations of the third formant frequency ff3 and Fig. 3(g) and 3(h) shows the variations of the mean ZCR Zm, during the training and testing phases respectively.
(a)
(b)
(c)
(d)
(e)
(f)
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
4997
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
(g)
(h)
Fig: 3: Plot of features corresponding to 4 words : (a, b) training and testing plots for first formant frequency (ff1), (c, d) training and testing plots for second formant frequency (ff2), (e, f) training and testing plots for third formant frequency (ff3), (g, h) training and testing plots for mean ZCR (Zm)
Class probability of each test sample is estimated using a neural network (multi-layer perceptron : MLP). The neural network architecture used is 4-30-4 i.e. 4 input nodes (for 4-element feature vector), 30 nodes in the hidden layer and 4 output nodes (for discriminating between 4 words), log-sigmoid activation functions for both the neural layers, learning rate of 0.01 and Mean Square Error (MSE) threshold of 0.005 for convergence. The convergence plot and MLP output are shown in Fig. 4. The accuracy obtained for 4 words data set is 86.11%.
(a)
(b) Fig: 4: NN classification for 4 words (a) convergence plot (b) MLP output
4.3. Dataset-3 (5 words) The training and testing plots for each feature are shown in Fig. 5. Fig. 5(a) indicates the variation in the value of the first formant frequency ff1 for each of the five words uttered by the 9 speakers during the training phase while Fig. 5(b) indicates the same for the other set of 9 speakers used for the testing phase. Similarly Fig. 5(c) and 5(d) shows the variations of the second formant frequency ff2, Fig. 5(e) and 5(f) shows the variations of the third formant frequency ff3 and Fig. 5(g) and 5(h) shows the variations of the mean ZCR Zm, during the training and testing phases respectively. Class probability of each test sample is estimated using a neural network (multi-layer perceptron : MLP). The neural network architecture used is 4-30-5 i.e. 4 input nodes (for 4-element feature vector), 30 nodes in the hidden layer and 5 output nodes (for discriminating between 5 words), log-sigmoid activation functions for both the neural layers, learning rate of 0.01 and Mean Square Error (MSE) threshold of 0.005 for convergence. The convergence plot and MLP output are shown in Fig. 6. The accuracy obtained for 5 words data set is 82.22%.
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
4998
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig: 5: Plot of features corresponding to 5 words : (a, b) training and testing plots for first formant frequency (ff1), (c, d) training and testing plots for second formant frequency (ff2), (e, f) training and testing plots for third formant frequency (ff3), (g, h) training and testing plots for mean ZCR (Zm)
(a)
(b) Fig: 6: NN classification for 5 words (a) convergence plot (b) MLP output
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
4999
Dipanwita Paul et al. / International Journal of Engineering Science and Technology (IJEST)
4.4. Analysis Table 1 indicates the accuracies obtained by implementing the proposed algorithm on Dataset-1 (3 words), Dataset-2 (4 words), Dataset-3 (5 words). In each of the cases, the accuracy obtained by implementing the proposed algorithm is compared with the accuracy obtained by implementing the existing algorithm in [1] on the same data set. Classification is done using Neural Networks (MLP : multi-layer perceptron) and these are compared with results obtained by L1 (Manhattan) and L2 (Euclidean) metrics. X represents the cases where NN convergence could not be achieved and hence accuracy calculation could not be done. Table 1: Recognition Accuracies
Classifier Neural Network Euclidean Distance Manhattan Distance
Dataset-1 (proposed) 88.89
Dataset-1 (existing) X
Dataset-2 (proposed) 86.11
Dataset-2 (existing) X
Dataset-3 (proposed) 82.22
Dataset-3 (existing) X
92.59
74.07
75.00
50.00
68.89
35.56
88.89
77.78
75.00
52.78
73.33
35.56
5. Conclusions and Future Scopes This paper presents an automated system for recognition of isolated words independent of speakers. The feature vector consists of the first three formant frequencies of the vocal tract and the mean zero crossing rate of audio signal. Experiments to demonstrate the performance of the system were done using three different data sets consisting of 3, 4 and 5 words uttered by 18 different speakers. Classifications were done using multi-layered neural networks and the recognition accuracies obtained are also compared using standard vector comparison methods like the Manhattan and Euclidean distances. The results also demonstrate the superiority of the features as compared to the contemporary work [1] using wavelet based features. Future improvements to the system could be investigated by incorporating features based on cepstral coefficients. References [1] [2]
[3] [4] [5]
[6] [7] [8] [9] [10] [11] [12] [13]
Ambalathody P.(2010),”Main Project-Speech Recognition Using Wavelet Transform.” Internet:www.scribd.com/doc/36950981/MainProject-Speech-Recognition-using-Wavelet-Transform, unpublished.. Ashraf J., et al(2010), “Speaker Independent Urdu Speech Recognition Using HMM,” in the Proceedings of the Natural language processing and information systems and 15th international conference on Applications of natural language to information systems, Cardiff, Wales,pp.140-148. Betkowska A., Shinoda K., Furui S.(2007), ”Robust Speech Recognition Using Factorial HMMs for Home Environments”, Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing, Volume 2007 Issue 1,Article ID 20593, ,pp.10-10. British Council, ”British Council Phonemic Chart” (www.teachingenglish.org.uk/try/activities/phonemic-chart.) Gao G., Biligetu, Nabuqing , Zhang S.(2006),” A Mongolian Speech Recognition System Based on HMM”, in the Proceedings of the 2006 international conference on Intelligent computing: Part II ,Computational Intelligence, Volume 4114/2006,Kunming,China, pp.667-676. H. Shen, et al (2005),”HMM Parameter Adaptation Using the Truncated First-Order VTS and EM Algorithm for Robust Speech Recognition,” in the Proceedings of Computational Intelligence and Security(CIS 2005),China, pp.979-984. Jarina R., Kuba M.,Paralic M. (2005), ”Compact Representation of Speech Using 2-D Cepstrum– An Application to Slovak Digits Recognition,”, 8th International Conference on Text, Speech and Dialogue TSD 2005 , Karlovy Vary, pp. 342–347 Nehe N. S., Holambe R. S.(2009), “New robust subband Cepstral feature for isolated world recognition”, in the Proceedings of the International Conference on Advances in Computing, Communication and Control, Mumbai, pp.326-330 Patel I., Rao Y.S. (2010), “Speech Recognition Using Hidden Markov Model with MFCC-Sub band Technique, “International Conference on Recent Trends in Information, Telecommunication and Computing (ITC 2010), Kochi, pp: 168 – 172 Rossi F. , Ainon R.N.(2008), “Isolated Malay speech recognition using Hidden Markov Models”, in the proceedings of International Conference on Digital Object Identifier, kuala Lumpur, pp.721 – 725 Spanias A.(1994) Speech coding: A tutorial review, Proceedings of IEEE, 82(10), pp. 1541–1582 Thorat R. A., Jadhav R. A.(2009), ”Speech Recognition System,” International Conference on Advances in Computing, Communication and Control,Mumbai,pp.607-609 University College London (UCL), Department of Phonetics and Linguistics, Technical Report, Lecture 10 : Speech Signal Analysis (http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html)
ISSN : 0975-5462
Vol. 3 No. 6 June 2011
5000