International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
Voice Recognition System in Bahasa Indonesia for the control of Media Player Deepak Balram1 and Kuang Yow Lian2 Department of Electrical Engineering, National Taipei University of Technology, Taiwan.
[email protected] and
[email protected] Abstract-In this modern era, voice recognition has great importance in the area of artificial intelligence. In this paper, we have implemented a voice recognition system in Bahasa Indonesia language by interfacing an e-Box. Bahasa Indonesia is an Austronesian language considered as a branch of Malay spoken by more than 23 million people in the world and this render a great significance for this study. Bahasa Indonesia is infused with highly distinctive accents and vocabularies and this poses a challenge in implementing a voice recognition system in this language. A new database in Bahasa Indonesia language is created as part of this work. The front-end processing is carried out using Mel Frequency Cepstral Coefficient (MFCC) algorithm. The dynamic coefficients are considered for improving the accuracy of the front-end processing. The Gaussian Mixture Model (GMM) approach is followed for the modelling of the extracted features. The Graphical User Interface (GUI) created is ported in the Embed-Box 4864 (e-Box) and it acts as the voice recognition system. A comparative study of accuracy based on gender is also carried out as part of this work. Keywords: Gaussian Mixture Model, Graphical User Interface, Mel Frequency Cepstral Coefficient, voice recognition system.
I. INTRODUCTION The importance of voice recognition technology is that it enables humans to communicate to the machines using their evolutionary process in a natural way most convenient to them. The advancements in voice recognition made the life of human more easy in various fields of life. Voice recognition systems has got a wide range of applications and using voice as an interface has got prominent importance. Voice recognition is so advanced and even language won’t be a barrier for communication by using voice recognition technology. Even though English is considered as the global language, majority of the people in this world cannot speak English language and they use their native language for communication. According to the British council claim only 25% people in the world has some understanding of English. Majority of the research in Voice recognition technology is based on English and it is relevant to carry out research in other native languages also as 75% of people in the world has limited or no knowledge in English language. The Ethnologue catalogue of world languages lists there are 7099 known living languages in this world [1]. Our work focus on implementation of a voice recognition technology in Bahasa Indonesia, which is the official language of Indonesia. Indonesia is the world’s fourth most populous country with a population of over 260 million. Bahasa Indonesia is a version of Malay, which is a major language used by Austronesian family and having an official status in Singapore, Brunei and Malaysia apart from Indonesia. Even though there are over 700 regional languages in Indonesia, Bahasa Indonesia is predominantly used in business, education, medias and administration [2]. It is a language composed of distinctive accents and vocabularies. Bahasa Indonesia has no concept of tenses and plurals and is quite different from English grammatical usage. This language is infused with a lot of vagueness and euphemisms and so implementation of voice recognition technology is a challenge in Bahasa Indonesia. The first stage in voice recognition process is the extraction of features of the voice samples. The extraction of the most relevant features of the voice signal is an important step in the implementation of a voice recognition system. Pre-processing of voice samples is carried out as the first step in feature extraction stage to enhance the accuracy of
6
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
the feature extraction process. The important features hidden in a voice signal which is required for the classification process is extracted in the feature extraction stage and the unwanted features are not considered [3]. There are different techniques to extract the features of the voice samples and Mel Frequency Cepstral Coefficient is one of the most efficient algorithm used for the extraction process. The output of the MFCC process will be a matrix of features extracted from every frame of the voice signal. The logarithmic positioning of frequency bands in MFCC makes it an apt method which approximates human system response more closely compared to other existing systems [4]. The performance of MFCC is affected by many factors including the type of windowing method adopted, the number of filters used etc. Figure 1 shows the waveform analysis of the words “mulai”, “maju”, “mundur”, “jedah” and “berhenti” recorded by female speaker 1 as part of the work.
Figure 1: Waveform analysis of the 5 words in Bahasa Indonesia recorded by female speaker 1
The extracted features using MFCC undergo the modelling process in the second stage of voice recognition process. Parametric modelling of the extracted features is carried out using GMM [5]. In virtual sense, any real world distribution can be denoted as a mixture of Gaussian distributions. GMM algorithm is faster and thus computational time is less compared to other methods like Hidden Markov Model (HMM) and Deep Neural network (DNN) [6]. The computational complexity of the GMM is dependent on the mixture size used for the modelling purpose. The extracted features of every word is modelled as part of training the system and word recognition decision is made during the testing process by making use of the models trained [7]. e-Box can be made into a voice recognition system by porting the voice recognition program into it. e-Box 4864 is a compact PC and can be programmed according to the application demand. The miniaturized size of e-Box is an advantage when we use it in real time applications. e-Box 4864 is insensitive to temperature and power consumption is very low. As e-Box uses only 20W of power, it can be considered as green PC of the future. The e-Box 4864 runs
7
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
on virtual operating system. The Graphical User Interface created using Matlab can be ported to e-Box by following certain steps. In the final stage, the efficiency of the voice recognition system is analyzed and noted. The figure 2 shows the design of the proposed voice recognition system. e-Box Parametric modelling
Training stage
Feature extraction
Testing stage
Final Decision
Figure 2: Voice recognition system
The rest of the paper is organized as follows: The system overview is described in section II followed by results and discussions in section III. Section IV concludes this paper. III. SYSTEM OVERVIEW A. Feature Extraction Stage MFCC is used as the feature extraction algorithm in this paper due to its high efficiency and less complexity compared to other algorithms. In MFCC, the voice sample is divided into small frames of few milliseconds and the features in each frame of the voice signal is extracted [8]. Figure 3 represents the various steps involved in the feature extraction stage based on MFCC [9] [10].
Speech input
Pre-emphasis
Framing
Windowing
Delta energy
DCT
FFT
Mel filter Bank
Figure 3: Block diagram of feature extraction stage After the pre-emphasis is carried out to enhance the input speech signal, the pre-emphasized speech signal is divided into shorter frames. The characteristics of speech signal remains stationary for a shorter period of time and so processing should be carried out in short interval. It is because of this reason, framing is carried out. There will be an optional overlap between every frame and its previous frame during the framing process. In the windowing step, the discontinuities at the start and end of the frames is eliminated. A hamming window is used as the windowing function in this work [11]. W(n) = 0.54-0.46 cos (2 n/N-1), 0 n N-1
8
(1)
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
In equation 1, W(n) denotes the windowing function and N represents the number of samples in each frame. Mel frequency wrapping can be carried out in frequency domain and in order to convert the time domain signal to frequency domain, Fast Fourier Transform (FFT) is applied. In the Mel filter Bank stage, the frequency spectrum is mapped to mel scale. Below 1000Hz, the mel scale is linear and above 1000Hz, it is logarithmic in nature [12]. The mel scale conversion is done using the formula: Mel(f) = 2595*log10 (1+f/700)
(2)
Where Mel(f) represents the mel frequency and f denotes the frequency in Hz. For the conversion of log mel spectrum to time domain, we carry out the Discrete Cosine Transform (DCT). The output obtained after the application of DCT on log mel spectrum is the Mel Frequency Cepstral Coefficient. = In equation 3,
log
) ] where m = 0,1,…,k-1.
) cos [m(
(3)
denotes the MFCC and the number of coefficients is represented by m. The dynamic coefficients
are considered at the final stage of the feature extraction for improvement in accuracy of feature extraction stage. The delta coefficients are computed based on the equation:
(4) Where
denotes the delta coefficient at time t and the static coefficients are represented by
and
.
B. Parametric Modelling GMM consists of finite number of mixtures which represent a word based on mean, weight and covariance matrix [13]. The mixture size determines the computational complexity involved in GMM. The Probability Density Function(PDF) of a random variable is represented by GMM and the Gaussian mixture density is denoted by the equation [14]: P( | ) = In equation 4, the mixture weights are represented by and
( )
, i=1,…,M
(5) and the component densities by i=1,…,M
. Gaussian mixture model can be denoted as a collective representation of estimated parameters including
mean, weight and standard deviation. i=1,…,M.
(6)
The estimation of different parameters of the GMM is carried out using Expectation-Maximization (EM) algorithm iteratively. Mean, variance and mixture weight which are the estimated parameters is denoted using the equation given below [15]: Mean:
=
(7)
9
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
Variance:
=
Mixture weight: ,
and
denotes the vectors and
,
and
(8)
=
,
(9)
arre the arbitrary elements of these vectors respectively. The
word classification is done based on the estimated maximum log-likelihood value. III. RESULTS AND DISCUSSIONS A new database is created in Bahasa Indonesia language by recording the voice of 8 native speakers of Bahasa Indonesia. 4 male speakers and 4 female speakers of age 21-30 years are selected for the experiment. The application of this voice recognition system is to control the media player using voice. Most of the media player includes the mechanism to control the functionalities including play, stop, pause, forward and backward. So we have created the database of 5 words including “mulai”, “maju”, “mundur”, “jedah” and “berhenti”. The English translation of the word “mulai” is start, “maju” is forward, “mundur” is backward, “jedah” is pause and “berhenti” is stop. Figure 4 shows the plot of spectrum of frequencies for the 5 words recorded by female speaker 1.
Figure 4: Spectrogram analysis of the 5 words recorded by female speaker 1. The number of syllables for the words “mulai”, “maju”, “mundur” and “jedah” is 2 and for “berhenti” is 3. The time duration of the word “mulai” in seconds is around 0.78, “maju” is 0.64, “mundur” is 0.79 “jedah” is 0.73 and “berhenti” is 1.05.
10
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
Figure 5: Bar plot of MFCC for the word “berhenti”. Figure 5 shows the bar plot of MFCC computed for female speaker 1 uttering the word “berhenti”. The input voice is fed to the e-Box through the “Mic in” port using a microphone. The Graphical User Interface (GUI) is created from the voice recognition code using the GUIDE tool and it is converted to .exe file using the deploy tool. The application component and the e-Box component is created and using the target analyzer both are combined together and an image is built. Using this method, the GUI is ported to e-Box successfully. Table 1 shows the percentage accuracy of the voice recognition system in Bahasa Indonesia language. Table 1: Percentage accuracy of voice recognition system Words mulai
Maju
mundur
Jedah
berhenti
Gender
Accuracy
Female
86.66
Male
84.99
Female
91.66
Male
89.99
Female
88.32
Male
83.33
Female
93.33
Male
91.66
Female
94.99
Male
93.33
Table 1 depicts the accuracy for all the five Bahasa Indonesia words used in the work. The average accuracy of the voice recognition system is 89.82. The word “berhenti” having the highest time duration (1.05 s) and syllables (3) compared to other four words records the highest percentage accuracy of 94.16. The word “maju” has a smaller time duration (0.64 s) compared to the word “mulai” having 0.78 s time duration. But the calculation shows the percentage accuracy of “maju” is better than the word “mulai”. So even though time duration of samples has affect in the percentage accuracy of voice recognition system, a smaller difference in time duration of two words has only negligible or no affect in percentage accuracy.
11
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
94.16
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
85.82
85.82
90.82
92.49
Percentage Accuracy
MULAI
MAJU
MUNDUR
JEDAH
BERHENTI
90 80 70 60 East
50 40
West
30 20
North
10 0 1st
2nd
3rd
4th
Qtr
Qtr
Qtr
Qtr
Figure 6: Efficiency chart of voice recognition system The figure 6 shows the efficiency chart of the voice recognition system. The chart gives a better comparison of the accuracy of all the five Bahasa Indonesia words used in the voice recognition system. The word “berhenti” has the highest accuracy and the words “mulai” and “mundur” has the lowest accuracy. Figure 7 shows the graphical representation of accuracy based on gender. Analyzing the graph, we can find that gender has effect in the accuracy of the system and female speakers have a better accuracy compared to males in this voice recognition system based on Bahasa Indonesia. The average accuracy of all the female speakers is 90.9% and that of male speakers is 88.66%.
Figure 7: Graphical representation of accuracy based on gender
12
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 5, May 2017
IV. CONCLUSIONS The distinctive accents and vocabularies in Bahasa Indonesia pose a challenge in the implementation of voice recognition system in Bahasa Indonesia language. Eight native speakers of Indonesia are selected for this work and their accents are different when recording the same words. This extent of variation in the accent is not seen in English language spoken by speakers from the same country. When the difference in time duration of words is very small, it has negligible or no effect in the accuracy of the voice recognition system in Bahasa Indonesia. The average accuracy of females is higher compared to males for all the five words and so gender has influence in the accuracy of voice recognition system based on Bahasa Indonesia. The application of this voice recognition system can be further extended to other fields like robotic control.
ACKNOWLEDGMENT This work is supported by the Ministry of Science and Technology, ROC, under Grants 105–2221–E-027-061. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
[12] [13] [14] [15]
Simons, Gary F. and Charles D. Fennig (eds.). 2017. Ethnologue: Languages of the World, Twentieth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com. M. A. Ayu and T. Mantoro, "An Example-Based Machine Translation approach for Bahasa Indonesia to English: An experiment using MOSES," 2011 IEEE Symposium on Industrial Electronics and Applications, Langkawi, 2011, pp. 570-573. Saeed K, Nammous M K, “A speech-and-speaker identification system: Feature extraction, description, and classification of speech-signal image”, IEEE Transactions on Industrial Electronics, vol. 54, no. 2, pp. 887-897, 2017. Jr. J.D. Hansen, J. & Proakis, J. “Discrete time processing of speech signal”, 2nd edition, IEEE press, Newyork, 2000. Z. Weng, L. Li and D. Guo, "Speaker recognition using weighted dynamic MFCC based on GMM," 2010 International Conference on AntiCounterfeiting, Security and Identification, Chengdu, 2010, pp. 285-288. T. Marwala, U. Mahola and F. V. Nelwamondo, "Hidden Markov Models and Gaussian Mixture Models for Bearing Fault Detection Using Fractals," The 2006 IEEE International Joint Conference on Neural Network Proceedings, Vancouver, BC, 2006, pp. 3237-3242. R. Lippmann, E. Martin and D. Paul, "Multi-style training for robust isolated-word speech recognition," Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '87., 1987, pp. 705-708. K. S. R. Murty and B. Yegnanarayana, "Combining evidence from residual phase and MFCC features for speaker recognition," in IEEE Signal Processing Letters, vol. 13, no. 1, pp. 52-55, Jan. 2006. R. Togneri and D. Pullella, "An Overview of Speaker Identification: Accuracy and Robustness Issues," in IEEE Circuits and Systems Magazine, vol. 11, no. 2, pp. 23-61, Second quarter 2011. R. Vergin, D. O'Shaughnessy and A. Farhat, "Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition," in IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, pp. 525-532, Sep 1999. V. D. Ambeth Kumar, T. Vijaya Rajasekar, S. Malathi, P. Athiyaman and V. Kamalakannan. “Gender Identification Using a Speech Dependent Voice Frequency Estimation System”, International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, no. 11, November 2016. M. A. Hossan, S. Memon and M. A. Gregory, "A novel approach for MFCC feature extraction," 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, 2010, pp. 1-5. Bing Xiang and T. Berger, "Efficient text-independent speaker verification with structural Gaussian mixture models and neural network," in IEEE Transactions on Speech and Audio Processing, vol. 11, no. 5, pp. 447-456, Sept. 2003. B. L. Pellom and J. H. L. Hansen, "An efficient scoring algorithm for Gaussian mixture model based speaker identification," in IEEE Signal Processing Letters, vol. 5, no. 11, pp. 281-284, Nov. 1998. D. A. Reynolds and R. C. Rose, "Robust text-independent speaker identification using Gaussian mixture speaker models," in IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, Jan 1995.
13
https://sites.google.com/site/ijcsis/ ISSN 1947-5500