Automatic Speech Recognition System for Isolated

Proc. of Int. Conf. on Emerging Trends in Engineering and Technology

Automatic Speech Recognition System for Isolated & Connected Words of Hindi Language By Using Hidden Markov Model Toolkit (HTK) Annu Choudhary, Mr. R.S. Chauhan, Mr. Gautam Gupta Department of electronics and communication (ECE) JMIT Radaur (Haryana) [email protected], [email protected]

Abstract— Speech recognition is the process of converting an acoustic waveform into the text similar to the information being conveyed by the speaker. In this paper implementation of isolated words and connected words Automatic Speech Recognition system (ASR) for the words of Hindi language will be discussed. The HTK (hidden markov model toolkit) based on Hidden Markov Model (HMM), a statistical approach, is used to develop the system. Initially the system is trained for 100 distinct Hindi words .This paper also describes the working of HTK tool, which is used in various phases of ASR system, by presenting a detailed architecture of an ASR system developed using various HTK library modules and tools. The recognition results will show that the overall system accuracy for isolated words is 95% and for connected words is 90%. Index Terms— HMM, HTK, Mel Frequency Cepstral Coefficient (MFCC), Automatic Speech Recognition (ASR), Hindi, Isolated word ASR, connected word ASR.

I. INTRODUCTION Speech is the basic medium through which we communicate with each other. Research on automatic speech recognition by machine has attracted much attention over the last five decades. It is easy for the people to interact with computers via speech, rather than using input devices such as keyboards and pointing devices. This can be achieved by developing an Automatic Speech Recognition (ASR) system which allows a computer to identify the words that a person speaks into a microphone or telephone and convert it into text. So it has the potential of being an important medium of interaction between human beings and computers. Communication among human beings is dominated by various languages like Hindi , Punjabi etc.. Therefore, people want to communicate with computers which can speak and recognize speech in native language. It will enable even a common man to communicate with the computer in his regional languages in India. However; there is lot of scope to develop ASR system for Hindi language. Some work is done in this direction in isolated words for small vocublary. The amount of work in Indian regional languages has not yet reached to a critical level, as already done in other languages in developed countries. So this is the need of the hour to develope ASR for Hindi language. II. LITERATURE SURVEY This section of paper will represent literature review of the works that are done for various languages like English, Hindi, Panjabi, Marathi etc. in last few years . DOI: 03.AETS.2013.3.234 © Association of Computer Electronics and Electrical Engineers, 2013

Tarun Pruthi et al. [1] described a speaker-dependent, isolated word recognizer for Hindi. Features are extracted using LPC and system was developed using HMM. He recorded the sounds of two male speakers. The vocabulary used by him consists of Hindi digits (0, pronounced as “shoonya” to 9, pronounced as “nau”). However the system is giving good performance, but the design is speaker dependent and have very small vocabulary. Gupta [2] worked for an isolated word speech recognition tool for Hindi language. He made use of continuous HMM. Word acoustic model was used for recognition. Again the word vocabulary contains digits of Hindi language. Results was good when tested for speaker dependent model. The results are satisfactory for other sounds too. Main drawback is that vocabulary size is too small. Anup Kumar Paul, Dipankar Das et al [3] developed Bangla Speech Recognition System using LPC and ANN (automatic neural network). This paper presents recognition system of the Bangla speech. There are two major parts of Bangla speech recognition system. Signal processing is the first part and the second part is speech pattern recognition technique. Starting and end point are detected during the speech processing stage. Pattern recognition is done in the second part Artificial Neural Network (ANN). Speech signals are recorded using an audio wave recorder in the normal room environment. Al-Qatab et al. [4] implemented an Arabic automatic speech recognition engine using HTK. The engine recognized both continuous speech as well as isolated words. The developed system used an Arabic dictionary built manually by the speech-sounds of 13 speakers and it used vocabulary of 33 words. R.L.K. Venkateswarlu, R. Ravi Teja et al [5] Developed Efficient Speech Recognition System for Telugu Letter Recognition. In this research, both MLP(Multilayer Perceptron) and TLRN (Time Lagged Recurrent Neural Network )models were trained and tested on a dataset that consists of Four different speakers (2Male and 2Female) are allowed to utter the letters for 10 times. Speaker dependent mode is used for recognition of the Telugu letters. Tested data presented to the network are same as the trained data in this mode. R. Kumar [6] implemented a system which involves only one speaker and recognize isolated word for the Punjabi language and further extended its work to compare the performance of speech recognition system for small vocabulary of speaker dependent isolated spoken words using the Hidden Markov Model (HMM) and Dynamic Time Warp (DTW) technique. The presented work emphasized on template-based recognizer approach using linear predictive coding with dynamic programming computation and vector quantization with Hidden Markov Model based recognizers in isolated word recognition tasks. Ahmad A. M. Abushariah [7] worked for English speech recognition. This paper aims to design and implement English digits speech recognition system using MATLAB (GUI). This work was based on the Hidden Markov Model (HMM), which provides a highly reliable way for recognizing speech. Mel Frequency Cepstral Coefficients (MFCC) technique was used to extract the features. This paper focuses on all English digits from (Zero through Nine). M Singh et al. [8] described a speaker independent, real time, isolated word ASR system for the Punjabi language was developed by The Vector Quantization and Dynamic Time Warping (DTW) approaches were used for the recognition system. The database of the features (LPC Coefficients or LPC derived coefficients) of the training data was created for training the system and for testing the system the test pattern (features of the test token) was compared with each reference pattern using dynamic time warp alignment. The system was developed for small isolated word vocabulary. Bharti W. Gawali1, Santosh Gaikwad et al [9] presents a Marathi database and isolated Word recognition system. Mel-frequency cepstral coefficient (MFCC), and Distance Time Warping (DTW) are used for extraction of features. Marathi speech database has been designed by using the Computerized Speech Lab For the extraction of the feature. The vocabulary consists of the vowels of Marathi and isolated words which start from a vowel and simple Marathi sentences. In this paper voice of 35 speakers was recorded and each word was repeated 3 times. The comparative recognition accuracy of DTW and MFCC was presented in this paper. Kuldeep kumar [10] worked for ASR for Hindi language.This paper aims to build a speech recognition system for Hindi language. System is developed using Hidden Markov Model Toolkit (HTK). Acoustic word model is used to recognize the isolated words. The system is trained for 30 Hindi words. Training data was collected from eight speakers. The overall accuracy of the presented system is 94% K. Kumar et al. [11] worked for connected-words speech recognition system for Hindi language. Hidden Markov model toolkit (HTK) was used to develop the system and the system was trained to recognize any sequence of words.

848

III. STATISTICAL FRAMEWORK OF AN ASR ASR as shown in Fig. 1 mainly comprises of five parts: Acoustic Analysis for feature extraction, Acoustic model based on statistical HMM approach, Language model, Pronunciation dictionary and the recognizer for recognition. The sound waves captured by a microphone at the front end are fed to the acoustic analysis module. In this module the input speech is first converted into series of feature vectors which are then forwarded to the decoder. This decoding module with the help of acoustic, language and pronunciation models comes up with the results. Mainly, the speech recognition problem can be divided into the following four step i.e. signal parameterization using a feature extraction technique such as MFCC or PLP, acoustic scoring with Gaussian mixture models (GMMs), sequence modeling with hidden Markov models (HMMs) and generating the competitive hypotheses using the score of knowledge sources (acoustic, language and pronunciation models) and selecting the best as final output with the help of a decoder.

Fig. 1 Architecture of ASR

IV. ASR SYSTEM DESCRIPTION The ASR is implemented using Hidden Markov Model Toolkit (HTK) [12] version 3.4(11).The Linux operating system Ubuntu version 11.10 has been used for developing the ASR. A. System Architecture & Implementation Recognition process of speech is shown in fig 2.The ASR system architecture mainly comprises of four components, namely, Training data preparation, Acoustical analysis, Acoustic model generation. Training Data Preparation: This phase consists of recording and labeling the speech signal. The implemented system is trained for 100 distinct Hindi language words. Firstly the data is recorded with the help of a dynamic unidirectional microphone using a recording tool audacity in .wav format. The .wav files recorded are saved as HTK transcription. The sampling rate used for recording is 16 kHz. Each word is uttered 10 times in two data files each containing 5 utterences and So the 100 distinct words resulted in (100*10) samples results in to 1000 files. A labeling tool wave surfer is used to label the speech waveforms. The labeled file saved in .lab format is a simple text file and these are used in acoustic model generation phase of the system. Acoustic Analysis: The speech recognition tools cannot process directly on speech waveforms. Acoustical analysis is performed on these waveforms.. The original waveform is converted into a series of acoustical vectors. Mel Frequency Cepstral Coefficient (MFCC) technique has been used for feature extraction. The computation steps of MFCC include: Framing: The signal is segmented in successive frames of interval 20 ms to 40 ms overlapping with each other. Windowing: Each frame is multiplied by a windowing function 849

(e.g. Hamming function). Extracting: A vector of acoustical coefficients (giving a compact representation of the spectral properties of the frame) is extracted from each windowed frame .Configuration file (.conf) is a text file which specifies the various configuration parameters such as format of the speech files (HTK), technique for feature extraction (MFCC), length of time frame (25msec), frame periodicity (10msec), number of MFCC coefficients (12) etc. The Acoustic Vector (.mfcc) files are used in both training and decoding phase of the system. The HCopy tool of HTK is used for this purpose. Acoustic Model Generation: An acoustic model is defined as a reference model to which comparisons are made to recognize unknown utterances. There are two kinds of acoustic models viz. word model and phoneme model. Word model has been used as it is suitable for small vocabulary and the statistical approach Hidden Markov Modeling (HMM) for system training. In this phase of implementation, first HMM is formed using a prototype. This prototype has to be generated for each word in the dictionary. This topology is used for all the HMMs and the defined topology consists of 4 active states (observation functions) and two non emitting states (initial and the last state with no observation function). Single Gaussian distributions with diagonal matrices are used as observation functions and these are described by a mean vector and variance vector in a text description file known as prototype. This pre-defined prototype along with Acoustic vector (.mfcc files) and Training labels (.lab files) is used by HTK tool HInit to initiate the system. In the second step of this phase’s implementation, HTK tool HRest is used for estimating the optimal values for the HMM parameters (transition probability, mean and variance vectors for each observation function). This iterative step is known as re-estimation and this is repeated several times for each HMM to train. These embedded reestimations indicate the convergence through the change measure (convergence factor).This final step of acoustic model generation phase, which is called convergence test, is repeated until absolute value of convergence factor does not decrease from one HRest iteration to another. In our system implementation reestimation iteration are repeated for five times. So five HMMs per word in the vocabulary are generated. Task Definition: Before entering the final stage of testing the developed system, the basic architecture of recognizer i.e. language model (task grammar) and word dictionary i.e. Pronunciation model (task dictionary) are to be defined. The task grammar, specified using extended Backus-Naur form (EBNF), is written in a text file. The task grammar is compiled with HTK tool HParse to generate the task network (.slf). The task dictionary that is also a text file develops a correspondence between the name of the HMM and name of the task grammar variable. The names of the labels are also added in the above correspondence as these names indicate the symbols that will be output by the recognizer. These names are treated as optional, if not given, the names of grammar variables are used by default for the output purpose. System Testing: This stage is responsible for generating transcription for an unknown utterance. Like the training corpus preparation the testing signal is also converted into series of acoustic vectors (mfcc) using HTK tool HCopy. This input observation along with HMMs definition, word dictionary, task network and names of generated HMMs (HMM list) is taken as input by HTK tool HVite to generate the output in a transcription file (.mlf). The HVite tool processes the signal using Viterbi Algorithm, based on token passing algorithm, which matches it against the recognizer’s Markov models. The transcription file is then processed by a filtering module which extracts the recognized word from the file and displays it in the form of text. Shell programming is used to run the various commands .Hence the implemented system is more abstract and fast.

Fig. 2 Recognition process of speech signal

850

Performance Analysis: The system performance is analyzed by HTK tool HResult. The output transcription file of the HVite tool is compared with the corresponding original reference transcription file. The following equations show the formula for evaluating performance of speech system where N is the number of words in test set, D denotes number of deletions, S is number of substitutions and I is the number of insertions. Percentage Accuracy (PA) = [N − D − S – I]/N × 100 where PA in above equation gives word accuracy rate. Word Error Rate (WER) can be calculated as following WER= 100% − Percentage Accuracy where Word Error Rate (WER) in above equation is used as one of the criterion to evaluate the performance of the system. V. HINDI CHARACTER SET Hindi is mostly written in a script called Nagari or Devanagari which is phonetic in nature. Like English, Hindi sounds also contains vowels and consonants. A. Vowels As shown in table 1 there is separate symbol for each vowel. The no of vowels are 12 vowels in Hindi language. The consonants have an implicit vowel + ( अ). To indicate a vowel sound other than the implicit one (i.e. अ), a vowel-sign (Matra) is attached to the consonant. The vowels and their equivalent Matras are shown in following table TABLE I. HINDI VOWEL SET

B. Consonants The consonant set in Hindi is shown in table 2. These are categorized according to the place and manner of articulation. Consonants have 5 Vargs which are also known as groups and 9 non-Varg. Each varg have 5 consonants, the last of each is called nasal. The first four consonants of each varg, are divided in to the primary and secondary pair. The primary consonants are unvoiced whereas secondary consonants are voiced sounds. The primary consonants may be aspirated or un-aspirated. Aspirated consonants have additional h sound. Remaining 9 non Varg consonants contains 5 semivowels, 3 sibilants and 1 aspirate. The complete Hindi consonant set with their phonetic property is given in following table. TABLE II. HINDI CONSONANT SET

851

C. Other Characters Besides consonants and vowels, Hindi language also contains anuswar (◌◌ं), visarga ( ◌◌ः ), chanderbindu ( ◌◌ँ), >, ऽ, @, ◌ौ. Anuswar denotes the nasal consonant sounds. VI. RECOGNITION RESULTS The system is trained for 100 distinct words. Then recognition accuracy is calculated for isolated and connected words in noise free envirnment which are tabulated as following. Table 3 & table 4 shows the the overall accuracy and word correction rate for isolated words and connected words respectively. TABLE III. ACCURACY FOR ISOLATED WORDS

Spoken words

Feature extraction technique

Recognizd words

accuracy

Word error rate

100

MFCC

95

95%

5%

TABLE IV ACCURACY FOR CONNECTED WORDS Connected words Taj mehal ek imarat hai Yeh agra mein Yamuna nadi Ke kinare sathit bhavan apni Sundrta vajah sansar saat ajubo Sadharan darshak bhi esh ko Dekh kar khush ho jate Lahor Jahangir ka makbra sone Amritsar mandir mugal Kaleen par Saahi khajane payi kharch nahi Hui thi samye hinduo ne Bahut nirman kiya tha inme Aelora vrinda van bhavaya giri Delhi lal kila aur uske Samne khadi jama mashjid kala Ache namune bne deevane khas Aam sunder humaun akbar pahli Shasan prashidh buland darvaja panch salim chitre ishvar mahima Samjhne Sadhan mana karta kabhi chote Pado prem har saptaah kaam

I 0 0 1 0 0 0 2 0 1 0 0 0 0 0 1 0 0 0 0 0

D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

S 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0

N 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

PA 100% 80% 60% 100% 100% 80% 60% 100% 80% 100% 100% 80% 100% 100% 80% 100% 80% 100% 100% 100%

VII. CONCLUSION In conclusion, an efficient, abstract and fast ASR system for regional languages like Hindi is important so that all the people who know only their regional language can communicate with the computer. The work implemented in the paper is a step towards the development of such type of systems. The work has been carried for isolated and connected words. The work may further be extended to large vocabulary size and to continuous speech recognition. REFERENCES [1] Pruthi T, Saksena, S and Das, P K Swaranjali, “ Isolated Word Recognition for Hindi Language using HMM and VQ”. International Conference on Multimedia Processing and Systems (ICMPS), IIT Madras (2010). [2] Gupta, R Speech Recognition for Hindi , M. Tech. Project Report, Department of CSE, IIT, Mumbai(2006). [3] Anup Kumar Paul, Dipankar Das, Md. Mustafa Kamal , “ Bangla Speech Recognition System using LPC and ANN”, seventh international conference on advances in pattern recognition(ICAPR) on 4-6 Feb 2009, page no. 171-174. [4] B.A.Q Al-Qatab, “Arabic Speech Recognition Using Hidden Markov Model Toolkit (HTK)”, Paper presented at International Symposium in Information Technology (ITSim). Kuala Lumpur, June 15-17, 2010 [5] R.L.K Venkateswarlu, R. Ravi Teja and R. Vasantha kumari , “Developing Efficient Speech Recognition System for Telugu Letter Recognition” computing communication and applications (ICCCA), 2012 International conference on 22-24 Feb , 2012.

852

[6] R. Kumar, “Comparison of HMM and DTW for Isolated Word Recognition of Punjabi Language” In Proceedings of Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Sao Paulo, Brazil. Vol. 6419 of Lecture Notes in Computer Science (LNCS), pp. 244– 252, Springer Verlag, November 8-11, 2010. [7] Ahmad A. M. Abushariah et al, “ English Digits Speech Recognition System Based on Hidden Markov Models”, international conference on computer and communication engineering(ICEEE ,2010), 11-13 May 2010 , Kuala Lumpur, Malasia. [8] R. Kumar and M. Singh, “Spoken isolated Word Recognition of Punjabi Language Using dynamic time Warping Technique” Demo in Proceedings of Information System for Indian Languages, Punjabi University, Patiala, India, March 9 - 11, 2011. Vol. 139 of Communication in Computer and Information Science (CCIS), Page 301, Springer Verlag. [9] Bharti W. Gawali1, Santosh Gaikwad Pravin Yannawar, Suresh C. Mehrotra ,“ Marathi isolated word Recognition System using MFCC and DTW Features”,ACEEE Int. J. on information technology, vol. 01, no. 01.March 2011. [10] K. Kumar and R. K. Aggarwal “Hindi Speech Recognition System using HTK” International journal of computing and Business research ISSN vol. 2 issue 2 May 2011. [11] K. Kumar and R. K. Aggarwal and A. Jain, “A Hindi speech recognition system for connected words using HTK” International Journal Computational System Engineering, vol. 1, no. 1, 2012. [12] “Hidden Markov Model Toolkit” HTK, available at http://htk.eng.cam.ac.uk,2012.

853