Genetic Algorithm Approach for Sinhala Speech Recognition P. G. N. Priyadarshani, N. G. J. Dias
Amal Punchihewa
Department of Statistics and Computer Science University of Kelaniya Kelaniya, Sri Lanka
[email protected]
School of Engineering and Adv. Technology Massey University Palmerston North, New Zealand
[email protected]
Abstract— For centuries, researchers around the world have attempted to develop a natural interface between human and computer that enables the computer to speak and understand the natural language as the humans do. Even speech recognition systems for some recognized languages have been developed to some extent; still people prefer to work with their native language. On the other hand, speaker dependability has been a major issue in many cases and majority of users prefer if the recognizer is independent of speaker because in a speaker dependent platform, each user has to undergo training phase as the recognizer does not keep reference templates for each potential user. In this research, Genetic Algorithm (GA) was successfully applied with Mel Frequency Cepstral Coefficients (MFCC) to identify separately pronounced Sinhala words in both speaker independent and speaker dependent platforms.
B. Pattern Recognition Approach During the last six decades, the greatest degree of success in speech recognition has been achieved by the use of pattern recognition paradigms [1]. The major feature of this approach is that it uses well formulated mathematical framework and establishes consistent speech pattern representations, for reliable pattern comparison, from a set of labeled training samples via a formal training algorithm. Most ASR systems based on pattern recognition approach basically compose with two phases namely training of speech patterns and recognition of patterns via pattern comparison. During the training phase, speech knowledge is brought to the system: the system learns the reference patterns representing the different speech sounds such as phrases, words and phones that constitute the vocabulary of the application. Each reference is learned through spoken instances and stored either in the form of templates obtained by some averaging method or models that characterize the statistical properties of pattern. The utility of the method heavily depends on the pattern comparison stage. In pattern comparison it does a direct comparison of the unknown speech with each possible pattern learned in the training phase and classifies the unknown speech according to the goodness of match of the patterns. There are two methods namely template based approach and stochastic approach.
Keywords-genetic algorithm; MFCC; selection; crossover; mutation; threshold; HCI
I.
INTRODUCTION
Automatic Speech Recognition (ASR) is a one sub domain that facilitates to build a natural interface between human and computer and it is the ability to understand the spoken words and convert them into text. ASR converts an acoustic signal, captured by a microphone or a telephone into machine readable format and the recognized words can be the final results for applications such as data entry, command and control. Also they can be used as the input to further linguistic processing in order to achieve speech understanding. Speech recognition techniques can be characterized broadly into acoustic phonetic approach, pattern recognition approach and artificial intelligence approach as described below.
C. Artificial Intelligence Approach Artificial Intelligence (AI) approach (knowledge based approach) involves two basic ideas. Firstly, it involves studying the thought processes of human beings. Secondly, it deals with representing those processes via machines (like computers, robots, etc). AI is behavior of a machine, which, if performed by a human being, would be called intelligence. It makes machines smarter and more useful, and is less expensive than natural intelligence. In this way, AI approach attempts to automate the recognition procedure according to the way a person applies his or her intelligence in visualizing, analyzing, and finally making a decision on the measured acoustic features. Therefore it is a hybrid of the acoustic phonetic approach and pattern recognition approach. Even recognition systems that use acoustic phonetic knowledge to develop classification rules for speech sounds were raised, template based approaches have been very effective in the design of a variety of speech recognition systems as they provided little
A. Acoustic Phonetic Approach The acoustic phonetic approach is the initial approach that was used by researchers to recognize speech. It assumes that finite, distinct phonetic units exist and those can be broadly characterized by features of the speech signal. The acoustic properties associated with phonetic units vary due to number of reasons; for example, different speakers and due to the coarticulation effect. But acoustic phonetic approach has excluded these deviations [1, 2, 9].
978-1-4673-2527-1/12/$31.00 ©2012 IEEE
896
insight about human speech processing. On the other hand, a large body of linguistic and phonetic literature provided insights and understanding to human speech processing. In its pure form, knowledge engineering design involves the direct and explicit incorporation of expert’s speech knowledge into a recognition system. This knowledge is usually derived from careful study of spectrograms and is incorporated using rules or procedures [1, 9].
efficient Sinhala speech recognizer that works well in both speaker independent and speaker dependent modes.
Among the restrictions, speaker dependability is in the top of the list and it is common to all approaches. Speaker dependent speech recognition systems require speaker enrollment, the user must provide samples of his or her speech before using, whereas other systems are said to be speaker independent, in that no enrollment is necessary. In the speaker dependent approach, each user has to undergo with training phase because the recognizer does not keep reference templates for each potential user. Therefore user domain is limited to those who have trained. In contrast, a speaker independent system is developed to operate for any speaker of a particular language.
A. Front-end Processing & Parameterization Selection of a feature extraction method plays a crucial role for speech recognition in terms of accuracy and speed. But feature extraction is somewhat complicated because of the speech signals are not stationary over the time [3]. Therefore, short-time spectral analysis was employed. Initially, the speech signals were pre-processed prior to feature extraction. The speech signals were pre-emphasized to boost the high frequency energy by eliminating spectral tilt [4]. The start point and the end point of the speech were determined to isolate the word utterance. The entire signal was divided into equal sized frames with small time duration (23ms) and arranged them overlapped to eliminate loosing of information due to the truncation of the signal into frames. A hamming window was applied to mitigate the problems due to truncation of the signal.
II.
III.
METHODOLOGY
This section presents the approach of the research. The section is divided into two sub-sections namely front-end processing & parameterization, and speech recognition.
RESEARCH PROBLEM
A. Motivation Currently, there is a great demand for a Human-ComputerInterface (HCI) which can be worked with Sinhala speech. With the availability of such interface, it is possible to develop very useful software ranging over different fields to make use of Information Technology efficiently. Still there is no such reliable speech recognition methodology has been developed to identify Sinhala speech.
Mel Frequency Cepstral Coefficients (MFCC) were used to model the speech signal because they have emerged as a promising feature extraction method [5, 6]. Not only the static features but also dynamic features of speech should be taken into account to better represent the vocal track. Fortunately, derivatives of MFCC are perfect enough to represent these dynamic features. Accordingly, a speech signal could be represented as a matrix with dimension n×36 if 12 MFCCs along with their first order and second order derivatives were considered, where n was the number of frames of the speech signal.
B. Significance Genetic Algorithm (GA) is an optimization technique that simulates the phenomenon of natural evolution, where species search for increasingly beneficial adaptations for survival within their complex environments. In this research, the important mechanisms of natural adaptation were imported into speaker independent Sinhala speech recognition.
B. Speech Recognition GA was applied for automatic recognition of speech. GAs are search algorithms based on the mechanics of natural selection and natural genetics. It is a major part in neural and evolutionary computing. They are stochastic (non deterministic) global search and optimization methods that mimic the metaphor of natural biological evolution. GAs operate on a population of potential solutions applying the principle of survival of the fittest to produce successively better approximations to a solution. At each generation of a GA, a new set of approximations is created by the process of selecting individuals according to their level of fitness in the problem domain and reproducing them using operators borrowed from natural genetics. This process leads to the evolution of populations of individuals that are better suited to their environment than the individuals from which they were created, just as in natural adaptation [7, 8].
C. Challenges Speaker dependability of a speech recognition system strongly degrades the usability and complacency of users because each user has to undergo training initially. But developing a system that overcomes the inter-speaker and intra-speaker variability is challenging. Intra-speaker variability is the variation of the speech caused by different speaking styles such as voice quality, articulation rate, stress, pitch contour etc. Obviously it is clear that anyone can not regenerate a voice signal. On the other hand, inter-speaker variability is the variation of voice signals generate by different speakers due to various physical, physiological and behavioural characteristics. In addition, importing GA mechanism to this particular problem is also challenging.
According to Holland [6], GA moves from one population of chromosomes to a new population by using a kind of natural selection together with the genetics inspired operators of crossover, mutation, and inversion. Each chromosome consists of genes, each gene being an instance of a particular allele.
D. Research Aim and Objectives Under this research, robustness of GA and MFCC in speech recognition was explored. The objective was to develop an
897
average distance between all the occurrences of a subpopulation. If the minimum distance is less than the recognition threshold of the word that represents the initial population then the word is recognized. If the recognition threshold is failed to achieve in the initial population after 50 generations, it was moved to next sub-population and so on until reach the recognition threshold of the word to recognize. The stop criterion was either the word is recognized or all subpopulations have been covered. Here the number of generations were limited to 50 because experiments proved that there is no point of selecting more than 50 due to the population was tend to converge after about 50 generations.
Each chromosome can be thought of as a point in the search space of candidate solutions. In the computer world, the chromosomes in a GA population typically take the form of bit strings. Each locus in the chromosome has two possible alleles: 0 and 1. The selection operator chooses those chromosomes in the population that will be allowed to reproduce, and on average the fitter chromosomes produce more offspring than the less fit ones. Crossover exchanges subparts of two chromosomes, roughly mimicking biological recombination between two single-chromosome (haploid) organisms; mutation randomly changes the allele values of some locations in the chromosome; and inversion reverses the order of a contiguous section of the chromosome, thus rearranging the order in which genes are arrayed. Inversion which is the fourth element of GAs is rarely used in today's implementations as it is not well established and distinguished [8]. Briefly, a GA should have a population of chromosomes and involves three types of operators: selection, crossover, and mutation.
3) Selection: Selection means how to choose the individuals in the population that will create offspring for the next generation, and how many offspring each will create. The purpose of selection is, of course, to emphasize the fitter individuals in the population in hopes that their offspring will in turn have even higher fitness. Selection has to be balanced with variation from crossover and mutation (the exploitation/exploration balance): too strong selection means that suboptimal highly fit individuals will take over the population, reducing the diversity needed for further change and progress; too weak selection will result in too slow evolution. Currently numerous selection schemes have been proposed in the GA literature: fitness proportionate selection, rank selection, tournament selection, and elitist selection. Here, elitist selection schema was followed to protect the best individuals in the population. Fitness proportionate selection was used to generate the rest of the individuals of the subsequent population. Retaining the best individuals in a generation unchanged in the next generation is called elitism. Elitism was first introduced by Kenneth De Jong in 1975 [8]. Such best individuals can be lost if they are not selected to reproduce or if they are destroyed by crossover or mutation.
1) Initialization of the Population: Population size says how many chromosomes are in population (in one generation). If there are too few chromosomes, GA has a few possibilities to perform crossover and only a small part of search space is explored. On the other hand, if there are too many chromosomes, GA slows down. However previous researches have proven that it is not useful to increase population size after some limit. Most fitted population size is mainly depended on the problem [8]. Here the population was the reference templates (learning corpora) managed by our GA. Each individual chromosome in our population represents a MFCC matrix rather than a vector of binary digits. The number of columns is the number of MFCC coefficients calculated frame wise and number of rows is the number of frames that exist in a particular individual (word). In other words each row represents the MFCC coefficients of the particular frame of the utterance. Even the number of columns is fixed for each individual the number of rows is varied because the length is varied from one utterance to another. Therefore,
Elitism very rapidly increased performance of GA, because it prevents losing the best found solution and it is a successful variant of the general process of constructing a new population. In fitness proportionate selection, the expected value of an individual (i.e., the expected number of times an individual will be selected to reproduce) is that individual's fitness divided by the average fitness of the population. Even the most common method for implementing fitness proportionate selection is roulette wheel sampling, here Stochastic Universal Sampling (SUS) was used because roulette wheel method could even allocate all offspring to the worst individual in the population and vice versa. Early in the search the fitness variance in the population is high and a small number of individuals are much fitter than the others. As a consequence, they and their descendents multiply quickly in the population, in effect preventing the GA from doing any further exploration. This is known as premature convergence. There are several scaling methods that maps raw fitness values to expected values so as to make the GA less susceptible to premature convergence. One is sigma scaling given in equation (2), which keeps the selection pressure relatively constant over the course of the run.
number of individuals in population (N) = number of speakers (s)* number of words (w)* number of repetitions (r). (1)
The population was divided into w (number of distinct words) sub-populations. Initially a single sub-population was selected which is made up of all the occurrences of a word as the initial population. 2) Evaluation of the Population: The GA most often requires a fitness function that assigns a score (fitness) to each chromosome in the current population. The fitness of a chromosome depends on how well that chromosome solves the problem. To evaluate the population, the fitness of each individual was calculated. In this work, it was essential to recognize the given word based on the minimum distance. Therefore the fitness of each individual is the distance between the MFCC matrix of the word to recognize and the MFCC matrix of each individual. That distance was measured using Euclidean distance. The best fitness is the smallest distance. A recognition threshold was established for each distinct word. It is the
f (i ) − f 1+ ExpVal (i ) = σ 1
898
if σ ≠ 0 if σ = 0
,
(2)
where ExpVal(i) is the expected value of individual i, f(i) is the fitness of i, f is the mean fitness of the population and σ is the standard deviation of the population finesses.
V.
Test I has achieved 96.13% accuracy while Test II has achieved 96.00% accuracy. Therefore, Test I proved that proposed GA is capable of handling multiple speakers and Test II proved that proposed GA is independent of the speaker. Further, word recognition of registered speakers is not dominant compared to a relatively unregistered speaker. Results indicated a satisfactory precision in both cases. Three types of recognition errors were identified: substitution, deletion and insertion as shown in TABLE 1. Substitution occurs if a word is misclassified with another word in the vocabulary, deletion occurs if a word is not recognized, and insertion occurs if a word that does not belong to the vocabulary is recognized.
4) Crosover: In GA, crossover is the main distinguishing genetic operator used to vary chromosomes from one generation to the next. After selecting the parent chromosomes the cross over operator is applied to produce new chromosomes. There are number of crossover techniques exist: one point crossover, two point crossover, cut and splice and uniform crossover. In one point cross over, it randomly chooses a locus (crossover point) and exchanges the subsequences before and after that locus between two chromosomes to create two offspring. In this work, one point crossover was used to prevent superfluous crossover. Each chromosome was subjected to crossover with probability 0.80.
Further, constructing templates for entire words was helpful to avoid the errors due to segmentation or classification of smaller acoustically more variable units such as phonemes.
5) Mutation: A mutation is a change of a gene found in a locus randomly determined. The altered gene may cause an increase or a weakening of the recognition. Mutation probability is usually very low. Each offspring is subjected to mutate with probability 0.01. Mutation helps to prevent the population from stagnating at any local optima.
VI.
REFERENCES
RESULTS [1]
To evaluate the performance, two types of tests were carried out. Firstly, six repetitions of each word made by three speakers who participated in the learning process were used. Secondly, ten repetitions of each word generated by a completely new speaker were used. The obtained Word Recognition Rates (WRR) are given in TABLE 1. Fifty words that do not belong to our vocabulary were used to check the insertion errors. TABLE I.
Test I
Test II
CONCLUSIONS
The study reveals that GA is a competing candidate that enables to work in a speaker independent platform. But the quality of any speech recognizer is depend on a number of factors including the nature of the utterance, speaker dependability, size of vocabulary, language complexity, and environment conditions. This study has covered only a vocabulary of thirty words in Sinhala language. In future developments, it is essential to increase the number of words while keeping the recognition accuracy and speed. On the other hand, utterance type should be shifted to continues and spontaneous speech. The robustness against the noise has to be addressed too.
6) Learning Corpora: Initially 30 distint Sinhala words were selected as the vocabulary. The vocabulary had been limited to 30 words due to the heavy work load behind the process. Ten repetitions of each word from four speakers were employed as the reference patterns. Therefore, the population size was 1200. The population was divided into 30 subpopulations which are equal to the number of distinct words. Each sub-population consists of 40 individuals. The choice of the initial population was random for each word to be recognized. IV.
DISCUSSIONS
[2]
[3]
[4]
WRR FOR FIVE SPEAKERS [5]
Number of Substitutions
Number of Deletions
Number of Insertions
WRR (%)
S1
0
0
3
98.50
S2
1
2
5
96.00
S3
2
1
9
94.00
[7]
S4 S5
2 3
0 1
6 4
96.00 96.00
[8]
[6]
[9]
899
L. Rabiner, B. H. Juang, and B. Yegnanarayana, Fundamentals of Speech Recognition, Pearson Education, 2011. M. A. Anusuya, and S. K. Katti, “Speech recognition by machine: A review,” International Journal of Computer Science and Information Security, vol. 6, pp. 181-205, 2009. I. Mcloughlin, Applied Speech and Audio Processing with Matlab Examples, School of Computer Engineering Nanyang Technological University, Singapore, Cambridge University Press, 2009. M. Heldner, “Spectral emphasis as an additional source of information in accent detection,” Prosody 2001: ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding, Red Bank, NJ, pp. 57-60. R. Proksa, “Visualization of stages of determining cepstral factors in speech recognition systems,” Jornal of Medical Informatics & Technologies, vol. 13, ISSN 1642-6037, pp. 121-128, 2009. E. Wong and S. Sridharan, “Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification,” Proceedings of lnternational Symposium on Intelligent Multimedia, Video and Speech Processing, Hong Kong, May 2001. David E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Pearson Education, 2004. Mitchell Melanie, An Introduction to Genetic Algorithms, the MIT Press, 1999. http://www.telecom.tuc.gr/~ntsourak/tutorial_sr.htm, visited: April 2011.