Dynamic Time Warping Based Speech Recognition for Isolated Sinhala Words P. G. N. Priyadarshani, N. G. J. Dias
Amal Punchihewa
Department of Statistics and Computer Science University of Kelaniya Kelaniya, Sri Lanka
[email protected]
School of Engineering and Adv. Technology Massey University Palmerston North, New Zealand
[email protected]
people prefer to use their voice to interact with the computer, because it is very convenient and inexpensive. The advantages are extended in many directions such as releasing the hands for other tasks. Speech synthesis is the artificial production of human speech and can be seen as the opposite of speech recognition [1, 2, 6]. Further, speech technologies are recognized as a promising biometric technique in authentication of people. Speaker identification and speaker verification play major roles in user authentication and verification [5, 7].
Abstract— Currently there is a considerable tendency in developing Automatic Speech Recognition (ASR) systems which are capable of tracking the human speech done in local specific languages and identifying them because the people prefer to work with computers using their native language. In Sri Lanka, a sizable portion of the population is discouraged to use computers simply because of the language barrier and difficulty of using the conventional interfaces. Consequently there is a great demand for a computer interface which enables the communication in Sinhala. This paper presents an approach to identify Sinhala speech based on Dynamic Time Warping (DTW) and the Mel Frequency Cepstral Coefficients (MFCC). The correct recognition was achieved in several phases and each phase is described in detail.
II.
A. Motivation In order to enable benefits to a wider proportion of population from Information Technology, there is a need for a natural interface other than keyboard and screen interface that are widely used at present. Human speech has been recognized as the most applicable tool to overcome this severe problem. It was the motivation behind this research.
Keywords-DTW; MFCC; hamming window; FFT; DCT
I.
INTRODUCTION
In general, there are many forms of human communication: speech, sign language, body language & gestures, textual language and pictorial language etc. However, speech and hearing have evolved as the main means of communication among human beings. Speech is always regarded as the most powerful form of communication because of its rich dimension characteristics. In addition to the speech text (words), the rich dimensions referred as the gender, attitude, emotion, health condition and identity of a speaker. Such information is very important for an effective communication.
B. Significance Currently there is no proper speech recognition approach for Sinhala language and the research in this field is still in an infant stage in Sri Lanka. On the other hand, in most of the systems that have been developed for other languages based on Dynamic Time Warping (DTW) have limited vocabulary for instance, ten words, but in this work a considerably large vocabulary exceeding thousand words was aimed while maintaining a higher recognition rate.
In contrast, the communication between computer and the human is basically done through the keyboard and screenoriented systems. In current Sri Lankan context, this restricts the usage of computers to a small fraction of the population, who are both computer literate and conversant in English. Accordingly, the major barrier between the computer and people in Sri Lanka is the non-adaptability to English, as English is not the mother tongue (Sinhala is the mother tongue for majority) for most of the people and there is a large proportion of under-educated people in rural areas of Sri Lanka.
C. Challenges Comparing a test pattern with the stored references and correct identification of the word is a main issue in any pattern recognition approach. Here the same speaker may pronounce the same word differently. Even the same word having the same duration might differ in the middle. Generally, variations in speech duration are not spread evenly over different sounds. For instance, consonants vary slightly in length, whereas vowels vary a great deal over different utterances of the same word.
Meanwhile, speech technologies are emerging to be the next generation user interface for computers [12]. Speech recognition is the recognition of natural speech through a computer. Spoken words and phrases are identified and converted into a machine understandable format. Especially,
978-1-4673-2527-1/12/$31.00 ©2012 IEEE
RESEARCH PROBLEM
One of the major difficulties in speech recognition is that although different recordings of the same word include more or
892
less the same sounds in the same order, the durations of each sub word within the word do not match. The situation becomes worst when there are similar sounding words. Consequently, when recognizing words by matching them with reference templates it gives inaccurate results if there is no temporal alignment.
Accordingly, the static information was estimated by MFCC candidates while the dynamic information was estimated by delta MFCC and delta-delta MFCC those are the first and second order derivatives of time. Feature extraction phase includes two sub phases namely pre-processing and MFCC generation as described below:
Working with large vocabularies is often complicated. Firstly, a large vocabulary is more likely to have more words that sound like each other than in a small vocabulary. Secondly, large number of reference patterns should be maintained under a considerable cost and recognition becomes slower since more time is consumed in searching. Accordingly, increasing the vocabulary of a speech recognition system increases its complexity and decreases its recognition rate. On the other hand, increasing the number of words is not enough if the recognizer is unable to differentiate similar sounding words.
1) Pre-processing: Before extracting the relevant features, the voice signal should be preprocessed to eliminate the problems that may arise in feature extraction. Pre-processing consists of pre-emphasis, end point detection, framing and windowing. a) Pre-emphasis: The spectrum for voiced segments has more energy at lower frequencies than at higher frequencies. This is called spectral tilt [10]. Spectral tilt is caused by the nature of the glottal pulse. Due to this effect, the information associated with high frequencies is difficult to be included into the feature vector. Pre-emphasis boosts the energy in the high frequencies to mitigate the above problem by passing the signal through a filter. As a consequence, high-frequency energy gives more information to the acoustic model. b) End point detection: Detecting the start point and end point of the speech signal is very important in speech processing. It helps to isolate the word utterance from the entire signal. Normally the signal may contain an unvoiced part as well as a voiced part. In most cases, the signal may associate the environmental noise and the noise generated by equipment such as microphone. The speaker may produce sound artifacts including lip smacks, heavy breathing, mouth clicks and pops.
D. Research Aim and Objectives In this research, the fitness of the DTW algorithm which is a dynamic programming technique was investigated in conjunction with the MFCCs to identify separately pronounced Sinhala words. Our attempt was to use a considerably large number of frequently used Sinhala words. The main objective was to develop an efficient speech recognizer which is a base for a natural human-machine interface. III.
METHODOLOGY
Feature extraction and feature matching are the two main phases of the speech recognition system. The Figure 1 shows the basic structure of the recognizer.
c) Framing and windowing: Particularly, speech signals are not stationary (referred to as quasi-stationary). But when examined over a sufficiently short period of time (5 ms - 100 ms), its characteristics are fairly stationary. However, over long periods of time the signal characteristics change to reflect the different speech sounds being spoken. Therefore, shorttime spectral analysis is the most appropriate way to characterize the speech signals [3]. Therefore, before extracting the relevant features, the entire signal was divided into frames having the same considerably small time period. The frames were arranged in overlapped manner to protect the information of the signal. Windowing was applied to avoid problems due to truncation of the signal. Here the hamming window given by equation (1) was employed succesfully [3].
Vocabulary: Database of reference patterns
Feature Extraction
Pattern comparison
Decision rule Written word
Figure 1. Block diagram of the system
2πn 0.54 − 0.46Cos N − 1 ; 0 ≤ n ≤ N - 1, w(n) = 0 ; Otherwise,
A. Feature Extraction Converting the sound waves into a parametric representation is a major part of any speech recognition approach. Here both static and dynamic features of speech were used for speech recognition task because the vocal track is not completely characterized only by static parameters. The articulators change their positions continuously during speech generation and that information is useful for speech recognition. MFCCs along with their first and second derivatives in time were used as the feature vector because they have been shown good performance in both speech recognition and speaker recognition than other conventional speech parameterization methods as well as the derivatives of MFCC reflect better dynamic changes of human voice over time [8, 9].
(1)
where N is the window length. 2) MFCC Generation: MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency. The process followed to extract MFCC emulates the way human ears capture and process sounds. The MFCC technique makes use of two types of filter, namely, linearly spaced filters and logarithmically spaced filters. To capture the phonetically important characteristics of speech, signal was expressed in the Mel frequency scale. This scale has linear
893
(DCT) as given in equation (5). In other words, the log energies were converted to cepstral coefficients with DCT.
frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. The relationship between the Mel warping and linear frequency can be given by the equation (2). Normal speech waveform may vary from time to time depending on the physical condition of speakers’ vocal cord. But MFFCs are less susceptible to these variations [11]. f (2) f mel = 2595 * log10 (1 + Hz ) . 700
M π (n − 0.5)l , (5) c(l ) = ∑ Y ′( m).Cos M m =1 where c(l) is the lth MFCC for a specific frame and Y’(M) is the logarithm of the square magnitude. The dynamic information associated with the speech signals was estimated by delta features as follows:
MFCCs are coefficients that collectively make up the Melfrequency cepstrum (MFC). The MFCCs are extracted from the speech signal that is framed and then windowed. Generating of MFCC involves number of steps. First, it is essential to transform the windowed signal from time domain to frequency domain because signal analysis in frequency domain is easier than in time domain. Fourier analysis which is a technique of extracting different frequency components from a complex wave such as a speech wave has been mathematically well established for a long time. But implementing computations to extract the individual frequency components of a wave was challenging until Cooley and Tukey developed a numerical analysis method called Fast Fourier Transform (FFT) which is a very powerful algorithm [4]. For each time window with length N, its frequency content was calculated by FFT as given by the equation: N −1
X ( k ) = ∑ y ( n )e
−2πjkn N
;0 ≤ k