Comparison of Voice Features for Arabic Speech ... - Semantic Scholar

3 downloads 7136 Views 267KB Size Report
we calculate MFCC, this results in a new feature that we call. LBPCC. We propose .... For example when we add delta and double delta we get Featur_D_A. [13].
Comparison of Voice Features for Arabic Speech Recognition Mansour Alsulaiman, Ghulam Muhammad, Zulfiqar Ali Speech Processing Group, College of Computer and Information Sciences, King Saud University, PO Box: 51178, Riyadh 11543, Saudi Arabia Email: {msuliman, ghulam, zuali}@ksu.edu.sa The rest of the paper is organized as follows: section 2 describes the speech recognition system and its components: speech recognition engine and speech features extraction, section 3 provides the results of digit recognition with MFCC, LPC and other features, section 4 gives recognition rates of MFCC and LBPCC for consonant recognition, and finally section 5 draws some conclusions.

Abstract— Selection of the speech feature for speech recognition has been investigated for languages other than Arabic. Arabic Language has its own characteristics hence some speech features may be more suited for Arabic speech recognition than the others. In this paper, some feature extraction techniques are explored to find the features that will give the highest speech recognition rate. Our investigation in this paper showed that Mel-Frequency Cepstral Coefficients (MFCC) gave the best result. We also look at using an operator well know in image processing field to modify the way we calculate MFCC, this results in a new feature that we call LBPCC. We propose the way we use this operator. Then we conduct some experiments to test the proposed feature.

II.

Speech recognition is an important tool in humancomputer interaction, especially in this information age. Speech recognition systems consist of mainly two modules: the recognition engine module and the speech features extraction module. Research in speech recognition has advanced in recent years especially with new recognition tools such as ANN [1], HMM [2], and Support Vector Machines (SVM) [3]. Most of this research is in English language. Little research has been done on Arabic speech. Arabic speech has some distinct features that are not available in other languages. Hence this research wants to tackle an important subject of Arabic speech recognition, in particular to find the best speech feature for Arabic speech recognition.

Keywords- MFCC; LPC; LBPCC; ANN; HMM; Arabic speech recognition;

I.

INTRODUCTION

Automatic speech recognition is becoming more feasible these days due to advances in speech features extraction methods, advances in the recognition machines, and availability of fast processing power. The recognition engine in the speech recognition system is a general recognition tool that can be applied to speech or any other signal or sequence. Hence there may be no difference in using it for any language. As for the speech features it may depend on the language of speech.

A. Speech Recognition Engine Speech recognition systems generally assume that the speech signal is a realization of some message encoded as a sequence of one or more symbols. To recognize the underlying symbol sequence given a spoken utterance, the continuous speech waveform is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact representation of the speech waveform for the duration covered by a single vector (typically 20 ms or so); on the basis of that the speech waveform can be regarded as being stationary in this period. Although this is not strictly true, it is a reasonable approximation. In this work we used ANN and HMM as the recognition engines, hence we will describe them briefly below:

In the first part of this paper we concentrated on comparing the following speech features: Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction Coefficients (LPC), and some other features for achieving better Arabic speech recognition. These features are evaluated alone or complemented with other features. The features used for complementing are energy (E), zero mean static coefficients (Z), the first derivative (D), and the second derivative (A). The speech features alone and with different combinations of components are used to find the best speech features for Arabic speech recognition. We found MFCC to be the best speech feature. In this part, Artificial Neural Networks (ANN) was the recognition engine.

1) Artificial Neural Network: ANN aims to imitate the way the human brain works. It consists of neurons connected by weights. The network learns the knowledge by adjusting its weights. ANN has many classifications depending on configurations and methods of working. The main classification is to classify ANN into supervised and

In the second part of the paper we proposed a method to modify the way we calculate the speech features in order to improve the result of MFCC. The method did not improve the result but this investigation showed that the concept is a good one. Further research into this is needed where other windows may give better results. The Hidden Markov Model (HMM) was the recognition engine used in the second part.

978-1-4577-1539-6/11/$26.00 ©2011 IEEE

ARABIC SPEECH RECOGNITION

90

Each speaker has his own formant structure. It is the major difference between the speakers. LPC highlights these formant structures for speakers to make differentiation between them [7]. LPC are independent of pitch and intensity [8]. Atal showed the usefulness of LPC in a speaker identification tasks [9]. The modern day LPC based feature extractor consists of five major sections [10]; pre-emphasis, frame blocking, windowing, autocorrelation analysis and LPC computation. (2) Mel-Frequency Cepstral Coefficients (MFCC): The Mel-Cepstrum makes use of the auditory system principle, it has high discriminating power at lower frequencies compared to higher frequencies. Cepstral coefficients are the mostly used features in speaker recognition due to many reasons, the most important one is that they represent well the vocal tract changes, have the ability to contend with convolution channel distortion and due to their robustness against noise [11]. The major components of MFCC extraction are [10]: frame blocking, windowing, fast Fourier transformation, Mel-frequency filtering, and discrete Cosine Transformation, as shown in Figure2.

unsupervised. In supervised ANN the network is trained by giving it the input and the corresponding output. In the unsupervised ANN the network is only given the input samples and it will adjust its weights so that it will have similar response for similar inputs. The Artificial Neural Network that we will use is Back Propagation which is the most used form of supervised Neural Network. It consists of an input layer, an output layer, and one or more hidden layers, as shown in Figure 1. 2) Hidden Markov Model: HMM is based on a mathematical model known as Markov Chain (Markov Model) [4]. A Markov Model is similar to a Finite-State Automata where each edge is labeled by a probability. These probabilities are used for the transition, from one state to another state. Transition from one state to other state occurs at discrete time intervals. Markov model is used in limited problems due to the need of observable states. In most of the problems, the state sequence is not observable. For such kind of problem the more general model called Hidden Markov Model is used. In HMM, the state sequence is hidden (not observable) and observations are probabilistic functions. It is a doubly stochastic process due to the probability of both the emitting events and transition between states [5]. HMM is a well-known modeling technique for speech recognition and text-dependent speaker identification [6].

Figure 2. Block Diagram for MFCC Extractor

Plus MFCC and LPC there are other speech features that are used in speech processing. In our work we compared MFCC, LPC and some other well-known speech features [12], [13]. The speech features that we used are listed below with the abbreviation that are used in [13]:

Figure 1. A Two-Layer Neural Network

B. Speech Features The first step in speech features extraction is to divide the speech waveform into many overlapping windows whose length is around 20 msec. The overlap period is normally 10 msec. Each window is multiplied by a smoothing window before extracting its feature. The most widely used windows are Hamming, Hanning, and triangular window. We used Hamming window in our work. The most widely used speech features in speech processing are LPC and MFCC, hence below we will explain them briefly:

• • • • •

Linear Prediction Coefficients (LPC) LPC Cepstral Coefficients (LPCEPSTRA) Linear Prediction Reflection Coefficients (LPREFC) Mel-frequency Cepstral Coefficients (MFCC) Linear Mel-filter bank channel Outputs (MELSPEC)

The above features were used alone or complemented by one or more of other features such as energy (E), zero mean static coefficients (Z), the first derivative (D), and the second derivative (A). When we add one complement we code this as Feature_complement. For example when we add energy we get Feature_E. Similarly when we add delta coefficients, acceleration coefficients, and zero mean static coefficients

(1) Linear Prediction Coefficients (LPC): Vocal tract properties can be molded by using all-pole model with the help of LPC features. These features represent the main vocal tract resonance property in the acoustic spectrum.

91

TABLE 2:

we get Feature_D, Feature_A, and Feature_Z respectively. We may also add more than one complement. For example when we add delta and double delta we get Featur_D_A [13]. III.

ONE SUB FEATURE ADDED

Recognition Rate (%)

DIGIT RECOGNITION WITH MFCC, LPC, AND OTHER FEATURES

Features only

Feature_D

LPC

83.60

80.78

LPC EFC

83.00 84.20 84.30 84.67

85.60 86.00 85.40 63.98

LPCEPSTRA

A. The Data Base When we did our research there was no available database for Arabic speech that has many samples of each Arabic digit. Hence we had to make our own database. This database was beneficial for this research and many projects that followed by the author and other researchers.

MFCC MELSPEC

(3) Features with two added complement: We added delta and double delta . The result is given in Table 3.

The data base that we built consisted of the 10 digits in Arabic spoken by 15 male speakers. Each speaker spoke 10 samples of each digit. The recording was done in normal laboratory environment. The speaker spoke one or more digits in one vocalization. Later we segmented the speech into individual digits.

TABLE 3:

RECOGNITION RESULTS FOR THE BASIC FEATURES WITH TWO SUB FEATURE ADDED

Recognition Rate (%) Features only

Feature_D_A

MFCC

83.60 83.00 84.20 84.30

76.88 78.00 85.00 83.72

MELSPEC

84.67

72.83

LPC LPC EFC

B. The Experiments The Neural Networks consisted of 480 neurons in the input layer, 12 neurons in the hidden layer, and 10 neurons in the output layer. We used 12 coefficients per window, the window overlap was 10 msec, and the maximum digit length was 410 msec. Since we had 10 samples for each digit, hence the total number of files is 1400. We used the data of 11 persons (11 x 10 = 1100 files) for training the network. We used the data of 3 persons (3 x 10 = 300 files) for testing the network. We used the speech features alone and combined with other features as explained in section 2.B.

LPCEPSTRA

(4) Features with four added sub features: We added energy, delta, double delta and, and zero mean static coefficients and got Feature_E_D_A_Z. The result is listed in Table 4. TABLE 4: RECOGNITION RESULTS FOR THE BASIC FEATURES WITH FOUR SUB FEATURES ADDED.

(1) Features only : We tested the system using the speech features without any addition. The result is listed in Table 1 below. In this part and all other parts of section III we repeated each experiment more than one time to verify the conclusion. Hence the numbers in the table indicate the average of all the trials. TABLE 1:

RECOGNITION RESULTS FOR THE BASIC FEATURES WITH

Recognition Rate (%)

RECOGNITION RESULTS FOR THE BASIC FEATURES

Features only

Feature _E_D_A_Z

LPC

83.60

29.10

LPC EFC

83.00 84.20 84.30 84.67

79.07 31.45 86.71 60.80

LPCEPSTRA

Features

Recognition Rate (%)

MFCC

LPC

83.60

MELSPEC

LPREFC

83.00 84.20 84.30 84.67

LPCEPSTRA MFCC MELSPEC

(5) Effect of Noise: To check the robustness of the features we added noise to the original waveform (by HTK) and tested some of the above methods. The result indicated that LPCEPSTRA coefficients tolerated noise better than other LPC coefficients.

(2) Features with one added complement: We added delta, energy, and zero. The results when we add energy and zero mean were not good; hence we only give the result of adding delta in Table 2.

C.

DISCUSSION

From the above results we conclude that the selected features had comparable performance when used alone. When used with other complementing features MFCC was overall the best because the performance will improve a little or degrade a little. Other features gave poor performance

92

when used with some of the other complementing features. When used in the configuration Feature_E_D_A_Z MFCC had the best result of all experiments. We also saw that as expected cepstral coefficients tolerated noise more than other features. Many other experiments were done with one, two, and three complements. We did not report it here because it could not be done for all the five features that we investigated. CONSONANT RECOGNITION WITH MFCC AND LBPCC

We want now to investigate modifying the way we calculate the MFCC by using an operator that is used in image processing, named as Local Binary Pattern (LBP) operator. A. Local Binary Pattern Cepstral Coefficients (LBPCC) The local binary pattern (LBP) operator [14] has been widely used in various applications. To extract the LBPCC features, the LBP operator is applied on the logarithm of Mel-weighted spectrum of the speech. The Mel-filter bank is used to filter an input power spectrum through a bank of 24 Mel-filters. The major components of the LPBCC feature extractor are shown in the Figure 3. The LBP operator assigns a binary value to each value in the matrix by thresholding the 3x3-neighbourhood of each value with center value and considering the result as a binary number. Then the binary number formed from all the neighbors starting at the cell left of the center and moving counterclockwise is converted to a decimal number [15]. For example, a 3x3 window of the values is shaded in Figure 4(a) and its values are shown in Figure 4(b). Then, a new array is constructed where each cell is 1 if it is greater than the center value or 0 if it is less than the center value. In the next step a binary number is generated from this array starting from 2nd value of 1st column, as shown in Figure 4(c). The binary numbers is then converted to a decimal number, as in Figure 4(d), and replace the central value of the 3x3 window. Similarly, a 3x3 window is applied on each value of the Mel-weighted power spectrum resulting in a decimal number that replaces the original value of the spectrum. In a 3x3 window, horizontal values represents change in time, vertical values representing change in frequency, and diagonal values representing change in both, time and frequency. Hence the generated LBPCC can be looked at as a summarization of all these changes.

Figure 3. Block Diagram for LBPCC Extractor

93

Consonant

Correctly Recognized

Recognition Rate (%)

B. The Database To evaluate the LPBCC based speech recognition system, we used Arabic consonants pronounced by 44 male speakers. Each speaker has recorded 10 utterances of each consonant. Seven utterances of each consonant are used to train the system and the remaining three are used to evaluate the performance of the system. So the total number of training and testing samples for each consonant are (44x7=) 308 and (44x3=) 132, respectively.

Recognition Rate (%)

In this section, we used 12 LBPCC coefficients and their 12 first derivatives and 12 second derivatives to perform the experiments.

RECOGNITION RATES FOR MFCC

Correctly Recognized

TABLE 5:

Consonant

IV.

C. The Experiments Two different experiments are performed by using the above database. The features MFCC and LBPCC are used in the first and second experiment, respectively. HMM is used to construct the model of each consonant. Five states are used to model each consonant, and each state contains three Gaussian mixtures. The results of the MFCC based system are presented in Tables 5 and that of LBPCC based system in Table 6. In each table, the first column shows the consonant used for the testing, the second column gives the numbers of correctly recognized samples, and third column presents the recognition rates of the consonants.

΃ ˯ Ώ Ε Ι Ν Ρ Υ Ω Ϋ έ ί α ε ι

131 131 107 102 82 123 122 128 117 83 112 120 113 100 127

99.24 99.24 81.06 77.27 62.12 93.18 92.42 96.97 88.64 62.88 84.85 90.91 85.61 75.76 96.21

ν ρ υ ω ύ ϑ ϕ ϙ ϝ ϡ ϥ ϩ ϭ ϯ Overall

113 116 109 127 121 88 125 118 127 120 127 109 124 122 3344

85.61 87.88 82.58 96.21 91.67 66.67 94.70 89.39 96.21 90.91 96.21 82.58 93.94 92.42 87.36

-1.6 1.23 -1.8

1.56 0.90 -0.5

0.42 0.96 1.78

0 1 0

1 0

(b)

0 1 1

(c)

Binary: 10011010 Decimal: 154 (d)

Mel-weighted Power Spectrum

(a) Figure 4. Local Binary Pattern Operator

this investigation showed that the general idea is valid and further investigations are needed.

Consonant

Correctly Recognized

Recognition Rate (%)

Consonant

Correctly Recognized

Recognition Rate (%)

TABLE 6: RECOGNITION RATES FOR LBPCC

΃ ˯ Ώ Ε Ι Ν Ρ Υ Ω Ϋ έ ί α ε ι

125 130 68 69 26 116 105 117 56 108 89 100 93 103 113

94.70 98.48 51.52 52.27 19.70 87.88 79.55 88.64 42.42 81.82 67.42 75.76 70.45 78.03 85.61

ν ρ υ ω ύ ϑ ϕ ϙ ϝ ϡ ϥ ϩ ϭ ϯ Overall

116 65 87 118 122 105 112 109 100 112 103 67 112 113 2859

87.88 49.24 65.91 89.39 92.42 79.55 84.85 82.58 75.76 84.85 78.03 50.76 84.85 85.61 74.69

IV.

V.

CONCLUSION

In this paper, Arabic speech recognition is performed to evaluate the performance of different types of speech features i.e. LPC, LPCEPSTRA, LPREFC, MFCC, MELSPEC. These features are observed in various ways; alone and complemented with other components like energy (E), zero mean static coefficients (Z), the first derivative (D), and the second derivative (A). The MFCC had the best overall performance. A new approach to find speech features, named as LBPCC, is also introduced in this paper. The computation of LBPCC is the same as MFCC except that in LBPCC the DCT is performed after applying LBP operator on the log of Mel-weighted power spectrum, while in MFCC the DCT is directly applied on the log of Mel-weighted power spectrum. The experiments showed that both, MFCC and LBPCC had the same type of characteristics. But LBPCC did not improve the recognition rate. This is not what we aimed for, but it showed that the concept might work. Hence we will do further investigation by using other windows and other operators to get better results.

DISCUSSION

ACKNOWLEDGMENT

The maximum recognition rate for any consonant by using MFCC is 99.24% while by using LBPCC it is 98.48%. For the worst case, the minimum result for MFCC and LBPCC are 62.12% and 19.70%, respectively. The best recognition rates for both types of features appear for the same consonant, namely “˯”, and the worst results are also for the same consonant, namely “Ι”, showing that both features are capturing the same type of characteristics. By comparing Table 5 and Table 6 overall, we see that LBPCC did not improve the overall recognition rate, in specific the recognition rate was 12.67% less than MFCC. Nonetheless

This work is supported by the National Plan for Science and Technology in King Saud University under grant number 08-INF167-02. The authors are grateful for this support. REFERENCES [1] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [2] L. R. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition”, Proceedings of the IEEE, vol. 77, no. 2, Feb 1989.

94

[3] B. E. Boser, I. M. Guyon and V. N. Vapnic, “ A training algorithm for optimal margin classifiers”, in D. Haussler (ed.), 5th Annaul ACM Workshopon COLT, ACM Press, Pitsburgh, PA, pp. 144-152, 1992.

recognition," J.Acoustic. Soc. Amer., vol. 54, no. 6, pp. 13041312, 1974 [10] Z. Razak, N. J. Ibrahim, M. Y. Idna Idris, et al., “Quranic Verse recitation recognition module for support in J-QAF learning: A Review”, International Journal of Computer Science and Network Security (IJCSNS), vol. 8, no.8, pp. 207-216, Aug 2008.

[4] Baum L.E. and Petrie T., “Statistical inference for probabilistic functions of finite state Markov Chains”, Ann. Math. Stat., vol. 37, pp. 1554-1563, 1966. [5] Matsui T., Furui S., “Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs”, Proceeding of ICASSP, pp. II.157-164, March 1992.

[11] Z. Ali, M. Aslam, M. E. Ana María, “A speaker identification system using MFCC features with VQ technique”, Proceedings of 3rd International Symposium on Intelligent Information Technology Application 2009, IEEE, pp. 115119.

[6] Z. Ali, M. Aslam, M. E. Ana María, E. Gonzalo, “Textindependent speaker identification using VQ-HMM model based multiple classifier system.”, Lecture Notes in Computer Science, Springer, vol. 6438, pp.116-125, 2010.

[12].L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition, Prentice Hall, Signal Processing Series, 1993. [13].Steve Young, et. al., The HTK book, Cambridge University Engineering Department, 2006.

[7] Xugang L., Jianwu D., “An investigation of dependencies between Frequency components and speaker characteristics for text-independent speaker identification”, Speech Communication 07, vol. 50, no. 4, pp. 312-322, Oct 2007.

[14] T. Ojala, M. Pietikainen and T. Maenpaa, “Multi-resolution gray-scale and rotation invariant texture classification with local binary pattern”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, July 2002.

[8] Deller J. R., Hansen J. H. L., Proakis J. G., “Discrete-time processing of speech signals”, IEEE Press, New York, NY, 2000.

[15] T. Ahonen, A. Hadid and M. Pietikainen, “Face detection with local binary pattern: Application to face recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, December 2006.

[9] Atal B. S., "Effectiveness of Linear Prediction Characteristics of the speech wave for automatic speaker identification and

95