time performance. Keywords: speech recognition, genetic algorithms, MFCC, ... spoken database querying for novice users, âhands busyâ applications in ..... [1 ] Sakoe, H. and Chiba, S., " Dynamic Programming Optimization for. Spoken Word ...
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014)
Speech Feature Extraction and Recognition Using Genetic Algorithm Hitesh Gupta1, Deepinder Singh Wadhwa2 BGIET, Sangrur, Punjab, India. A speech interface would support many valuable applications — for example, telephone directory assistance, spoken database querying for novice users, “hands busy” applications in medicine or fieldwork, office dictation devices, or even automatic voice translation into foreign languages.
Abstract People are so comfortable with speech that they would also like to interact with computers via speech, rather than having to resort to primitive interfaces such as keyboards and pointing devices. However, undesired noises in environment like sound from heavy machines, vehicles are also present that causes undesired effects in speech transmission and acquiring systems reducing the performance of the machine receiving the speech signal. Advanced speech enhancement algorithms can be classified in main three categories, namely; filtering/estimation based noise reduction, beam forming and active noise cancellation (ANC) techniques. Recent work shows improved performance results when GA is applied on speech signals recorded under noisy conditions. The work here applies evolutionary computation in form of genetic algorithm to select the features that are responsible for discriminating the different words. In doing so, the amount of feature elements to be used also gets reduced and hence system can be made to recognize the word-speech with realtime performance.
1.1.2 Paradigm for Speech Recognition A general model for speech recognition, as shown in Fig.1.1 is used here.
Keywords: speech recognition, genetic algorithms, MFCC, FBCC
I. INTRODUCTION Figure 1.1. General block diagram of a task oriented speech recognition system.
1.1 Speech Recognition 1.1.1 Introduction to Speech Recognition Speech is a natural mode of communication for people. We learn all the relevant skills during early childhood, without instruction, and we continue to rely on speech communication throughout our lives. Vocalizations can vary widely in terms of their accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics (if telephones or other electronic equipment are used). All these sources of variability make speech recognition, even more than speech generation, a very complex problem. Yet people are so comfortable with speech that we would also like to interact with our computers via speech, rather than having to resort to primitive interfaces such as keyboards and pointing devices.
The conditions of evaluation — and hence the accuracy of any system — can vary along the following dimensions: Vocabulary size and confusability It is easy to discriminate among a small set of words, but error rates naturally increase as the vocabulary size grows. Speaker dependence vs. Independence Speaker independence is difficult to achieve because a system’s parameters become tuned to the speaker(s) that it was trained on, and these parameters tend to be highly speaker-specific. Error rates are typically 3 to 5 times higher for speaker independent systems than for speaker dependent ones.
363
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014) Isolated, discontinuous, or continuous speech Isolated speech means single words; discontinuous speech means full sentences in which words are artificially separated by silence; and continuous speech means naturally spoken sentences. Isolated and discontinuous speech recognition is relatively easy because word boundaries are detectable and the words tend to be cleanly pronounced.
The speech enhancement is not only useful for storage and transmission of speech data but it can play vital role in improving much need system based speech recognition where accurate identification of words and sentences can provide automation in most of the human-machine based interface and also be useful in machine-machine interaction based automation. It is obvious that speech enhancement can boost up the performance of speech recognition systems by keeping low word error rate (WER). There are various types of advanced speech enhancement algorithms in literature and they can be classified in main three categories, namely; filtering/estimation based noise reduction, beam forming and active noise cancellation (ANC) techniques. The development for speech recognition system has been for a while. The recognition platform can be divided into three types. Dynamic Time Warping (DTW)[1], the earliest platform, uses the variation in frame's time for adjustment and further recognition. Later, Artificial Neural Network (ANN) replaced DTW. Finally, Hidden Markov Model was developed to adopt statistics for improved recognition performance. Besides the recognition platform, the process of speech recognition also includes: recording of voice signal, point detect, pre-emphasis, speech feature capture, etc. The final step is to transfer the input sampling feature to recognition platform for matching. In recent years, study on Genetic Algorithm can be found in many research papers [2], [3], [4]. They demonstrated different characteristics in Genetic Algorithm than others. For example, parallel search based on random multi-points, instead of a single point, was adopted to avoid being limited to local optimum. In the operation of Genetic Algorithm, it only needs to establish the objective function without auxiliary operations, such as differential operation. Therefore, it can be used for the objective functions for all types of problems. Because artificial neural network has better speech recognition speed and less calculation load than others, it is suitable for chips with lower computing capability. Therefore, artificial neural network was adopted in this study as speech recognition platform. Most artificial neural networks for speech recognition are back-propagation neural networks. The local optimum problem [5] with Steepest Descent Method makes it fail to reach the highest recognition rate. In this study, Genetic Algorithm was used to improve the drawback.
Task and language constraints Even with a fixed vocabulary, performance will vary with the nature of constraints on the word sequences that are allowed during recognition. The difficulty of a task is more reliably measured by its perplexity than by its vocabulary size. Read vs. spontaneous speech Systems can be evaluated on speech that is either read from prepared scripts, or speech that is uttered spontaneously. Spontaneous speech is vastly more difficult Adverse conditions A system’s performance can also be degraded by a range of adverse conditions like environmental noise ,acoustical distortions, different microphones ,limited frequency bandwidth and altered speaking manner. The central issue in speech recognition is dealing with variability. Currently, speech recognition systems distinguish between two kinds of variability: acoustic and temporal. Acoustic variability covers different accents, pronunciations, pitches, volumes, and so on, while temporal variability covers different speaking rates. These two dimensions are not completely independent — when a person speaks quickly, his acoustical patterns become distorted as well — but it’s a useful simplification to treat them independently. 1.1.3 Research in the field of speech recognition Various approaches for noise reduction and speech enhancements have been investigated and developed. Wiener filter can be adaptively estimated used in an environment where the surrounding noise has time-varying characteristics. Adaptive algorithms such as Least Mean Square (LMS) and Recursive Least Squares (RLS) are well known examples and also widely used.
364
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014) 1.2 Genetic Algorithms
Evolutionary algorithms are thus made efficient because they are flexible, and relatively easy to hybridize with domain-dependent heuristics.
1.2.1 Introduction Darwin’s principle “Survival of the fittest” captured can be used as a starting point in introducing evolutionary computation. Biological species have solved the problems of chaos, chance, nonlinear interactivities and temporality. These problems proved to be in equivalence with the classic methods of optimization. The evolutionary concept can be applied to problems where heuristic solutions are not present or which leads to unsatisfactory results. As a result, evolutionary algorithms are of recent interest, particularly for practical problems solving. In an evolutionary algorithm, a representation scheme is chosen by the researcher to define the set of solutions that form the search space for the algorithm. A number of individual solutions are created to form an initial population. The following steps are then repeated iteratively until a solution has been found which satisfies a pre-defined termination criterion. Each individual is evaluated using a fitness function that is specific to the problem being solved. Based upon their fitness values, a number of individuals are chosen to be parents. New individuals, or offspring, are produced from those parents using reproduction operators. The fitness values of those offspring are determined. Finally, survivors are selected from the old population and the offspring to form the new population of the next generation. The mechanisms determining which and how many parents to select, how many offspring to create, and which individuals will survive into the next generation together represent a selection method. Many different selection methods have been proposed in the literature, and they vary in complexity. Typically, though, most selection methods ensure that the population of each generation is the same size. From the optimization point of view, the main advantage of evolutionary computation techniques is that they do not have much mathematical requirements about the optimization problems. All they need is an evaluation of the objective function. As a result, they are applied to nonlinear problems, defined on discrete, continuous or mixed search spaces, constrained or unconstrained.
1.3 Speech Recognition using Genetic Algorithms An important pre-processing step in Automatic Speech Recognition systems is to detect the presence of noise. It has been shown that accurate speech endpoint detection improves the isolated word recognition accuracy. Also, proper location of regions of speech reduces the amount of processing. This aspect is also important for mobile telephony. Thus, for developing speech recognition device capable of working in car an appropriate endpoint detection algorithm is needed. The endpoint detection problem is nontrivial for non stationary backgrounds where artifacts (i.e., non speech events) may be introduced by the speaker, the recording environment, and the transmission system. Ordinarily, the rate of zero crossing and short time energy is used for endpoint detection. However, this is not a trivial process and many different algorithms have been developed. 1.3.1. Speech pre-processing The speech signal needs be pre-processed prior to entering the recognition platform. The speech preprocessing includes point detection, hamming windows, speech feature, etc. 1. Fixed-size frame and Dynamic-size frame 2. Point Detection 3. Hamming Window 4. Feature capture 1.3.2 Speech recognition platform Back Propagation Neural Network (BPNN) is the most commonly used structure in an Artificial Neural Network (ANN). Although ANN has fast recognition rate and fault tolerance, it is not perfect because its SDM has a problem with Local Optimum. To prevent this from happening, GA is adopted to assist in SDM for obtaining optimal weight and improved recognition performance.
365
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014) 1.3.3 Genetic algorithm Process
2.1 Feature Extraction The features extracted out of the users’ speech must be robust to noise and insensitive to the changes in the characteristics of the microphone and environment. Feature vector based on Fourier-Bessel coefficients will characterizes both speech and speaker [6]. However the perceptual and cepstral characteristics of Bessel coefficients have not been explored yet. FBCC based feature extraction indicates an improved accuracy and efficiency in comparison to MFCC features extracted [7]. The block diagram of estimating or extracting the MFCC and FBCC features is shown in Fig. 2.2.
Figure 1.2. The evolutionary process of GA
II. METHODOLOGY The process of speech recognition in noisy condition is accomplished in two phases: 1. Feature extraction 2. Speech recognition using features extracted Figure 2.1 depicts the methodology of the proposed work. In the first phase, the features from the underlying speakers’ database are extracted and stored in a feature database. The database for features is now accessed by the speech recognition block to recognize the user for a new speech input, using Genetic Algorithm as an evolutionary computation tool for optimizing the search in the database. The result in the form of matched user ID is shown as an output of the work from the database.
Figure 2.2 Block diagram of extraction of speech features.
The FBCC feature extraction technique is applied in the proposed work to extract the features of users’ speech. The zeroth order Fourier-Bessel series expansion of a signal considered over some arbitrary interval is expressed as: (2.1) Where the analysis. The as:
,
and
is the time frame of
coefficients in (2.1) can be calculated
(2.2) Where is the order of FB expansion, and th are the zero and the first order Bessel functions. The coefficients extracted from the test users’ speech databases are stored in the features table, which will be accessed by the optimization algorithm designed using genetic algorithm for speech recognition in matching the speech with the existing users’ database.
Fig. 2.1 Proposed methodology
366
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014) 2.2 Speech recognition using Genetic Algorithm As genetic algorithm is well known for handling noisy functions well hence, the methodology adopted here uses genetic algorithm for recognizing the speech features and matching the same from the available database generated from the FBCC feature extraction technique. The speech features are recognized using genetic algorithm following a systematic technique which includes iterations consisting of a process looping around below mentioned steps: 1. Selection 2. Reproduction 3. Evaluation 4. Replacement
The speech signals are pre-processed before working on them. The signals are down converted to a sampling frequency of 16 kHz, and the length of the signals is constrained to at the most 4 seconds. For making a robust speech recognizer for different users, and for efficient working of the genetic algorithm, a database is required which should contain the extracted features of the user pertaining to different utterances. The extracted feature database of the utterances is made using FBCC, as discussed in the previous chapter. The features extracted are accessed by the genetic algorithm to search out the best match. The utterance is added with different types of noise, the features of the signal with added noise are extracted and the genetic algorithm finds optimally the best match for the features extracted with respect to the feature database, and displays the result for best match The simulations results for different types of noises are as below in table 3.1
2.3 The basic genetic algorithm is as follows: • [start] Genetic random population of n chromosomes (suitable solutions for the problem) • [Fitness] Evaluate the fitness f(x) of each chromosome x in the population • [New population] Create a new population by repeating following steps until the New population is complete o [Selection] select two parent chromosomes from a population according to their fitness ( the better fitness, the bigger chance to get selected). o [crossover] With a crossover probability, cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents. o [Mutation] With a mutation probability, mutate new offspring at each locus (position in chromosome) o [Accepting] Place new offspring in the new population. • [Replace] Use new generated population for a further sum of the algorithm. • [Test] If the end condition is satisfied, stop, and return the best solution in current population. • [Loop] Go to step2 for fitness evaluation. III.
Table 3.1. simulation results for different types of noises S.no
Type of noise added
Inpu t user
Input Sampl e
Recognis ed user
Recognis ed sample
Case 1
White gaussi annoise
10
3
10
3
Case 2
Carnoise
34
7
34
7
Case 3
Bnoise
23
5
23
6
Case 1
S IMULATION AND RESULTS
Simulation was taken out in Matlab and in order to simulate the analytics discussed previously, the database of 40 speakers is taken. This database contains different utterances of 40 different speakers, both male and female speakers, and each speaker has uttered 8 different sentences.
Figure 3.1.1 Speech Signal
367
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014) Case 3
Figure 3.1.2. White noise added to the signal.
Case 2
Figure 3.3.1. Speech signal
Figure 3.2.1. Speech Signal
Figure 3.3.2 Added B noise to the signal.
The simulation shown accurate results for low intensity noise like white Gaussian noise and car noise, whereas for high intensity noise like B-noise and impulse noise simulation results were quite comparable. IV.
CONCLUSION
In this paper an advanced algorithm has been proposed for speech recognition using Genetic algorithm. It is found genetic algorithm for optimization gives better result for speech recognition. Hidden Markov model (HMM) was the widely used method in speech recognition but the level of accuracy using HMM was strongly influenced by the optimization of extraction process and modeling methods while on the other hand better results can be achieved with the help of Genetic algorithm.
Figure 3.2.2 Car noise signal added.
368
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 1, January 2014) It is found that recognition accuracy for feature extraction with FBCC features in comparison with MFCC is better. In this thesis work combination of FBCC and Genetic algorithm has been implemented and better results are achieved. This algorithm has been tested on samples of various users with and without adding noise and a high degree of accuracy is achieved during recognition V.
REFERENCES [1 ] Sakoe, H. and Chiba, S., " Dynamic Programming Optimization for Spoken Word Recognition", IEEE Transactions on Signal Processing, Vol. 26, pp 43- 49. (1978). [2 ] Chu, S. H. , "Combination of GA and SDM to Improve ANN Training Efficiency", Shu-Te University, MS Thesis, Taiwan. (2003). [3 ] Chen, S. C., "Use of GA in CSD Coded Finite Impulse Digital Filter (FIR)", Shu-Te University, MS Thesis, Taiwan, (2003) [4 ] Chu, W. C. ,"Speech Coding Algorithms'', John Wiley & Sons, 9780- 471-37312-4, USA. (2003). [5 ] Yeh, Y. C. , "Implementation and Application of Artificial Neural Network", Ru Lin Publication, 957499628X, Taiwan ,(1993) [6 ] K. Gopalan, T. R. Anderson and E. J. Cupples, "A comparison of speaker identification result using features based on cepstrum and Fourier-Bessel expansion," IEEE Trans. Acoust. Speech Signal Processing, vol.7, no. 3, pp. 289-294, May., 1999 [7 ] P. Ganga Mohan, Chetana Prakash, Surayakanath Gangashetty, "Bessel transform for image resizing", Proceedings of International journal in Speech Processing Pages 42-62
FUTURE SCOPE
In fact the proposed method gives better results than the algorithms proposed earlier but there is always a window of improvement. Better results may be achieved by putting genetic algorithm with a combination of neural or fuzzy logics. Better feature extraction techniques can be implemented to achieve higher degree of accuracy in less computational time.
369