Text-Independent Speaker Identification Using VQ-HMM Model Based Multiple Classifier System Ali Zulfiqar1, Aslam Muhammad2, A.M. Martinez-Enriquez3, and G. Escalada-Imaz4 1
Departament of CS & IT, University of Gujrat, Pakistan
[email protected] 2 Departament of CS & E, U. E. T., Lahore, Pakistan
[email protected] 3 Department of Computer Science, CINVESTAV-IPN, Mexico
[email protected] 4 Artificial Intelligence Research Institute, CSIC, Barcelona, Spain
[email protected]
Abstract. Every feature extraction and modeling technique of voice/speech is not suitable in all type of environments. In many real life applications, it is not possible to use all type of feature extraction and modeling techniques to design a single classifier for speaker identification tasks because it will make the system complex. So instead of exploring more techniques or making the system complex it is more reasonable to develop the classifier by using existing techniques and then combine them by using different combination techniques to enhance the performance of the system. Thus, this paper describes the design and implementation of a VQ-HMM based Multiple Classifier System by using different combination techniques. The results show that the developed system by using confusion matrix significantly improve the identification rate. Keywords: Speaker identification, classifier combination, HMM, VQ, MFCC, LPC.
1 Introduction Speaker identification (SI) process identifies an unknown registered speaker by comparing it with those registered speaker voice stored in the database. SI can be text-dependent and text-independent [1]. Text-independent SI system is not limited to recognize speakers on the basis of same sentences stored in the database. While textdependent SI system only can recognize speakers by uttering the same sentence every time [2]. SI can be further divided into closed set SI and open set SI [3]. In closed set speaker identification, unknown speech signal came from one of the registered speakers. Open-set speaker identify unknown signal from either the set of the registered speakers or unregistered speakers. Closed-set text-independent speaker identification (CISI) system must allow capturing particular voice features even in a noisy environment. Two widely used feature extraction techniques - Mel-frequency Cepstral Coefficients (MFCC) [4] and Linear Prediction Coefficients (LPC) [5], and two modeling G. Sidorov et al. (Eds.): MICAI 2010, Part II, LNAI 6438, pp. 116–125, 2010. © Springer-Verlag Berlin Heidelberg 2010
Text-Independent Speaker Identification Using VQ-HMM Model
117
techniques – Hidden Markov model (HMM) [6] and Vector Quantization (VQ) [7] are used to construct different classifiers Each classifier is different from each other, in feature extraction and the modeling techniques used. MFCC simulates the behavior of human ear and uses Mel Frequency scale. LPC features represent the main vocal tract resonance property in the acoustic spectrum and make possible to distinguish one speaker from others, due to each speaker is characterized by his/her own formant structure. HMM based on Markov chain mathematical model is a doubly stochastic process that recognizes speakers very well in both text-dependent and textindependent SI system. VQ is implemented through LBG algorithm to reduce and compress feature vectors into a small number of highly representative vectors. The speaker identification made by a single decision making scheme is always a risky because each type of features are not suitable for all environments. Thus, this paper describes a Multiple Classifier System (MCS) for CISI which reduces errors and wrong identification. The basic idea is to analyze the results obtained by different classifiers. Then, these classifiers are integrated such that their reliability is enhanced due to a proper combination technique. The principle objective of this work is to get the better identification rate of MCS for CISI by using various combination techniques to compound the output of individual classifiers. The paper proceeds as follows: Section 2 describes the steps followed by all three developed classifiers for the SI. Section 3 explains the different combination techniques used in MCS to coalesce the normalized measurement level output of single classifiers for the joint decision. Section 4 depicts the results obtained from system testing and experimentations. Finally, we conclude our work in Section 5.
2 Single Classifier Speaker Identification System Three different classifiers are designed and implemented: - LPC based on Vector Quantization (VQ) (classifier K1); - MFCC based VQ (classier K2); and - MFCC based on HMM (classifier K3), (see Figure 1). All three classifiers are able to perform the closed-set text-independent speaker identification. Each classifier of any SI task includes following steps [8], [9], [10]:
Digital speech data acquisition. Acoustic events like phonemes occur in the frame of 10 mS to 100 mS [11]. Therefore, every speech signal is digitized into frames where duration of each frame is 23 mS for sampling frequencies 11025 Hz and that of 16 mS for the sampling frequency 8000 Hz. Feature extraction is the process by which the speech signal is converted to some type of parametric representation for further analysis and processing. This is a very important process in the high performance of CISI system. Appropriate features should be extracted from speech, otherwise the identification rate is influenced significantly. LPC and MFCC based feature vectors are extracted from the speech of each registered speaker. Acoustic model. It is not possible to use all extracted feature vectors to determine the identity of that speaker. Therefore, two modeling techniques: HMM and VQ are used to construct the acoustic model of each registered speaker. Then, for the identification of unknown registered speaker, its feature vectors are compared with each speaker’s acoustic model present in the speaker database.
118
A. Zulfiqar et al.
Pattern matching. Then, for the identification of unknown registered speaker, its feature vectors are compared with each speaker’s acoustic model present in the speaker database. Identification decision. When feature vectors of an unknown speaker are compared with acoustic model of each registered speaker, decision is made by computing distortion in the case of VQ modeling technique. Speaker’s acoustic model having minimum distortion with unknown speech signal is recognized as a true speaker. For HMM, decisions are made by using maximum likelihood criteria. Classifier K1
MFCC
Classifier K3
Classifier K2
VQ
MFCC
HMM
LPC
VQ
Fig. 1. Three Different Classifiers
In order to construct and check their performance of three individual classifiers K1, K2, and K3, we make following experimentation. K1 is obtained by combining MFCC with VQ, K2 and K3 by mixing respectively MFCC with HMM and LPC with VQ (see Figure 1). Additionally to make the system robust against noise, a consistent noise is added during the recording of each sentence. A database, having more than 700 voice samples, is recoded into two different sessions with the gap of two to three weeks to evaluate the performance of the system. It contains utterances of 44 speakers including 30 males and 14 females. Each speaker has recorded 6 different sentences at sampling frequencies 8000 Hz and 11025 Hz by using PRAAT software (http://www.fon.hum.uva.nl/praat/download_win.html). The list of these sentences is: Sentence 1: Decimal digits from Zero to Nine Sentence 2: All people smile in the same language. Sentence 3: Betty bought bitter butter. But the butter was so bitter that she bought new butter to make the bitter butter better. Sentence 4: Speakers recorded a random text from selected topic for 35 sec. Sentence 5: Speakers recorded his/her roll number or employee ID. Sentence 6: Speakers recorded a random text from selected topic for 12 sec. Identification rates of all three classifiers at both sampling frequency 8000 Hz and 11025 Hz are respectively depicted in Figure 2 and Figure 3 when they are trained and tested by using the following sentences: Training Sentence: Betty bought bitter butter. But the butter was so bitter that she bought new butter to make the bitter butter better. Testing Sentence: All people smile in the same language.
Text-Independent Speaker Identification Using VQ-HMM Model
119
100 90.91 86.36
Percentage
Percentage
80 61.36 60
40
20
0 Classifier K1
Classifier K2
Classifier K3
Fig. 2. Identification rate of classifiers at 8000 HZ
100
97.73 93.18
68.18
Percentage
Percentage
80
60
40
20
0 Cla ssifie r K1
Classifie r K2
Classifier K3
Fig. 3. Identification rate of classifiers 11025 HZ
Identification rate of these classifiers is computed by using the following relation: Identification Rate =
Truly Identified Speaker Total No. of Speakers
Above experiments show that higher sampling frequency demonstrate good identification rate than lower sampling frequency, and classifier K1 (MFCC based VQ classifier) is better than all other classifiers at both sampling frequencies. Now, we use the results of sampling frequency 8000 HZ to construct Multiple Classifier System (MCS) because in that case even best classifier has identification of 90.91%. So, there is a lot room for the improvement of identification rate.
3 Multiple Classifier Speaker Identification System (MCSIS) MCSIS uses the output of all three classifiers to make the joint decision about the identity of the speaker. The MCSIS can be categorized according to the level of classifier outputs which they use during the combination [12], [13]. There are three different levels of classifier’s outputs: • Abstract Level. Each classifier outputs the identity of a speaker only, and this level contains the lowest information. • Rank Level. Classifiers provide a set of speaker identities ranked in order of descending likelihood.
120
A. Zulfiqar et al.
• Measurement Level. It conveys the greatest amount of information about each particular speaker that may be correct one or not. Different combination techniques [14]: Sum Rule, Product Rule, Min Rule, and Max Rule are used in this work to combine the normalized measurement level outputs of classifiers for the joint decision about the identity of the speaker. 3.1 Normalized Measurement Level Output of Classifiers Measurement Level provides quite useful information about every speaker as shown in Table 1 and Table 2. First row of each table presents the measurement level output of Classifier K1 and K2 respectively, for speaker S1 when it is compared with 8 other speakers. A major problem within measurement level combination is the incomparability of classifier outputs [15]. As we can observe, the 1st row of both tables, it is clear that output of classifiers having different feature vectors differ in range and the outputs are incomparable. In consequence, before combining the outputs, it is necessary to treat these outputs of each classifier. Therefore, measurement level output of each classifier is normalized to the probabilities by dividing each element of the row by the sum of all the elements of that row. Table 1. Measurement level Output for a Classifier using MFCC Features
S1 Normalized
S1 1.549
S2 1.922
S3 2.096
S4 2.216
S5 2.098
S6 2.377
S7 2.192
S8 2.308
0.092
0.115
0.125
0.132
0.125
0.142
0.131
0.138
Table 2. Measurement level Output for a Classifier using LPC Features
S1 Normalized
S1 0.585
S2 0.692
S3 1.429
S4 0.742
S5 0.754
S6 1.310
S7 1.395
S8 1.694
0.069
0.082
0.170
0.088
0.090
0.134
0.166
0.201
After normalization, the outputs of all classifiers lie in the interval of [0,1]. Now, we find a suitable combination technique for MCSIS. These techniques are discussed in following sections, which provide the better identification rate than individual classifiers. 3.2 Sum Rule (Linear Combination) Linear combination is the simplest technique for MCS. For each speaker, sum of outputs of all classifiers is calculated. The decision of the true speaker depends on the maximum value obtained, after combination [13] [16] [17]. Suppose that there are 3 classifiers K1, K2, and K3, and five speakers (S1, S2, S3, S4, and S5). Outputs of these classifiers are represented by O1, O2, and O3. These output vectors are given as: T T T O1 = [ a1 a2 a3 a4 a5 ] , O 2 = [b1 b2 b3 b4 b5 ] , O 3 = [ c1 c2 c3 c4 c5 ]
Text-Independent Speaker Identification Using VQ-HMM Model
121
a j , b j , c j where j = 1, 2, 3, 4, 5 are positive real numbers. These output vectors are combined to make an output matrix which is ⎡ a1 ⎢a ⎢ 2 O = ⎢ a3 ⎢ ⎢ a4 ⎢⎣ a5
b1 b2 b3 b4 b5
c1 ⎤ c2 ⎥⎥ c3 ⎥ ⎥ c4 ⎥ c5 ⎥⎦
The Sum rule is defined as k
Osum = ∑ Oi
where i = 1, 2,L k
i =1
where Oi is the ith column of output matrix. After combing by sum rule output matrix becomes
Osum
⎡ a1 + b1 + c1 ⎤ ⎢a + b + c ⎥ ⎢ 2 2 2⎥ = ⎢ a3 + b3 + c3 ⎥ ⎢ ⎥ ⎢ a4 + b4 + c4 ⎥ ⎢⎣ a5 + b5 + c5 ⎥⎦
When the value in the first row is larger than other values, the result of MCS is speaker1 (S1). Similarly, if value in the second row is larger than other values then the result of MCS is speaker2 (S2) and so on. 3.3 Product Rule (Logarithmic Combination) Product rule, also called logarithmic combination, is another simple rule for classifier combination system. It works in the same manner than linear combination but instead of sum, the outputs for each speaker from all classifiers are multiplied [12], [16]. The product rule is defined as k
O prod = ∏ Oi
where i = 1, 2, 3,..., k
i =1
When the output of any classifier for a particular speaker is zero, this value is replaced by a very small positive real number. After combining the output vectors of all classifiers, output matrix becomes:
Oprod
⎡ a1 ⋅ b1 ⋅ c1 ⎤ ⎢a ⋅ b ⋅ c ⎥ ⎢ 2 2 2⎥ = ⎢ a3 ⋅ b3 ⋅ c3 ⎥ ⎢ ⎥ ⎢ a4 ⋅ b4 ⋅ c4 ⎥ ⎢⎣ a5 ⋅ b5 ⋅ c5 ⎥⎦
The decision criterion is similar as the Sum rule.
122
A. Zulfiqar et al.
3.4 Min Rule Min rule combination method measures the likelihood of a given speaker by finding the minimum normalized measurement level output for each speaker. Then final decision for identifying a speaker is made by determining the maximum value [12] [13]. Example 1: Consider an output matrix which is obtained by combining the output vectors of the three classifiers. Each column of the matrix represents the output of a classifier for five speakers. ⎡0.0 ⎢0.4 ⎢ O = ⎢0.6 ⎢ ⎢0.0 ⎢⎣0.2
0.3 0.3 0.5 0.0 0.1
0.2 ⎤ 0.2 ⎥ ⎥ 0.4 ⎥ ⎥ 0.1⎥ 0.3⎥⎦
Each element of Omin is the minimum value selected from each row of the output matrix O. Each row corresponds to the output of a classifier for a particular speaker. The final decision is the maximum value of the vector Omin which is 0.4. This value shows that the true speaker is the speaker number 3. Omin = [ 0.0 0.2 0.4 0.0 0.1]
T
3.5 Max Rule In the Max rule, the combined output of a class is the maximum value of the output values provided by different classifiers for the corresponding speaker [12] [16]. For a better explanation, consider the following example. Example 2: Assume that we have three classifiers and five speakers. Their output matrix is given below: ⎡0.0 ⎢0.4 ⎢ O = ⎢0.6 ⎢ ⎢0.0 ⎢⎣0.2
0.3 0.2 ⎤ 0.3 0.2 ⎥⎥ 0.5 0.4 ⎥ ⎥ 0.0 0.1⎥ 0.1 0.3⎦⎥
The combined output vector is obtained by selecting maximum value from each row of the output matrix. The resultant vector is Omax = [ 0.3 0.4 0.6 0.1 0.3]
T
Maximum value in the vector Omax is 0.6 which corresponds to speaker number 3. So, the joint decision of all the classifiers is the speaker3.
Text-Independent Speaker Identification Using VQ-HMM Model
123
3.6 Confusion Matrix Confusion matrix is a handy tool to evaluate the performance of a classifier. It contains the information of both truly identified speakers as well as misclassified speakers [15] [18]. Each column of this matrix represents the true speaker. Let us assume that 50 voice samples of speaker3 are tested by the identification system. If all these voice samples are truly identified then value at the 3rd row, 3rd column will be 50 with zeros elsewhere. On the other hand, if values of 1st , 2nd , 3rd , 4th , and 5th row of 4th column are 3, 1, 0, 41, and 5 respectively show that 3 times speaker4 is misclassified as speaker1, 1 time speaker4 is misclassified as speaker2, 45 times speaker4 is truly identified, and 5 times speaker4 is misclassified as speaker5 by the system. A confusion matrix is shown in Figure 4. True Speakers
Identified Speakers
1 2 3 4 5
1
2
3
4
⎡ 50 ⎢0 ⎢ ⎢0 ⎢ ⎢0 ⎢⎣ 0
0
0
3
47 2
0 50
1 0
0 1
0 0
41 5
5
0⎤ 0 ⎥⎥ 1⎥ ⎥ 0⎥ 49 ⎥⎦
Fig. 4. A Confusion Matrix
4 Results The identification rates of MSCIS, after applying Sum, Product, Min, and Max Rule on the output of individual classifiers, are presented in Figure 5. Max Rule as a Combination rule in MCS has shown an increase of 4.54% in identification rate than that of best individual classifier which was 90.91%.
10 0
9 5. 4 5 8 8 .6 4 8 1. 8 2
Percentage
80
70 .4 5
60 40
20 0 Sum R ule
P ro d uct R ul e
M i n R ule
M ax R ul e
Fig. 5. Identification Rates of Combination Techniques
124
A. Zulfiqar et al.
Some combination techniques show poor identification rate even than individual classifiers. A comparison between identification rates of best individual classifier K1, best combination technique (Max Rule), and confusion matrix is depicted in Figure 6.
10 0
9 5. 4 5
9 7 . 72
M a x R ul e
C o nf usi o n M at r ix
9 0 .9 1
Percentage
80
60
40
20
0 I nd i vi d ua l C l as s i f i e r
Fig. 6. Comparison of Confusion Matrix Technique with Max Rule and Individual Classifier
5 Conclusion and Future Work MFCC based VQ classifier, LPC based VQ classifier, and MFCC based HMM are combined to make Multiple Classifier System (MCS). Normalized measurement level outputs of the classifiers are combined by using Min Rule, Max Rule, Product Rule and Sum Rule. Combination technique Max Rule demonstrated good results as compared to other combination technique. Max rule improves identification rate by 4.54% than best individual classifiers. But when classifiers are combined by using Confusion matrix, it shows improvement of 6.81% than best individual classifier and 2.27% than Max Rule in the proposed multiple classifier text-independent system. Experiment shows that Confusion matrix based MCS produces excellent result as compared to each individual classifier. These results are also better than various combination techniques, i.e. Sum, Product, Min Rule, and Max Rule. In the identity of the speaker case studied, our proposed MCS for CISI system gives the same importance to the results obtained by each classifier. In order to enhance the performance in the decision process, the output of a classifier can be pondered by a weight, when its performance is better than other classifiers within the environment tested. It is our future validation in which we continuous making tests.
References [1] Furui, S.: Recent Advances in Speaker Recognition. Pattern Recognition Letter 8(9), 859–872 (1997) [2] Chen, K., Wang, L., Chi, H.: Methods of Combining Multiple Classifiers with Different Features and Their Application to Text-independent Speaker Identification. International Journal of Pattern Recognition and Artificial Intelligence 11(3), 417–445 (1997)
Text-Independent Speaker Identification Using VQ-HMM Model
125
[3] Reynolds, D.A.: An Overview of Automatic Speaker Recognition Technology. Proc. IEEE 4, 4072–4075 (2002) [4] Godino-Llorente, J.I., Gómez-Vilda, P., Sáenz-Lechón, N., Velasco, M.B., Cruz-Roldán, F., Ballester, M.A.F.: Discriminative Methods for the Detection of Voice Disorder. In: A ISCA Tutorial and Research Workshop on Non-Linear Speech Processing, The COST277 Workshop (2005) [5] Xugang, L., Jianwu, D.: An investigation of Dependencies between Frequency Components ans Speaker Characteristics for Text-independent Speaker Identification. Speech Communication 2007 50(4), 312–322 (2007) [6] Huang, X.D., Ariki, Y., Jack, M.A.: Hidden Markov Model for Speech Recognition. Edinburgh University Press, Edinburgh (1990) [7] Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE Transaction on Communications 28, 84–95 (1980) [8] Higgins, J.E., Damper, R.I., Harris, C.J.: A Multi-Spectral Data Fusion Approach to Speaker Recognition. In: Fusion 1999, 2nd International Conference on Information Fusion, Sunnyvale, CA, pp. 1136–1143 (1999) [9] Premakanthan, P., Mikhael, W.B.: Speaker Verification /Recognition and the Importance of Selective Feature Extraction:Review. In: Proc. of 44th IEEE MWSCAS 2001, vol. 1, pp. 57–61 (2001) [10] Razak, Z., Ibrahim, N.J., Idna Idris, M.Y., et al.: Quranic Verse Recitation Recognition Module for Support in J-QAF Learning: A Review. International Journal of Computer Science and Network Security (IJCSNS) 8(8), 207–216 (2008) [11] Becchetti, C., Ricotti, L.P.: Speech Recognition Theory and C++ Implementation. John Wiley & Sons, Chichester (1999) [12] Kittler, J., Hatef, M., Duin, R.P.W., Mates, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) [13] Kuncheva, L.I., Bezdek, J.C., Duin, R.P.W.: Decision Templates for Multiple Classifier Fusion: An Experimental Comparison. Pattern Recognition 34(2), 299–314 (2001) [14] Shakhnarovivh, G., Darrel, T.: On Probabilistic Combination of face and Gait Cues for Identification. In: Proc. 5th IEEE Int’l Conf. Automatic Face Gesture Recognition, pp. 169–174 (2002) [15] Ho, T.K., Hull, J.J., Srihari, S.N.: Decision Combination in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(12), 66–75 (1994) [16] Tumer, K., Ghosh, J.: Linear and Order Statistics Combiners for Pattern Classification. In: Sharkey, A. (ed.) Combining Artificial Neural Networks, pp. 127–162. Springer, Heidelberg (1999) [17] Chen, K., Chi, H.: A Method of Combining Multiple Probabilistic Classifiers through Soft Competition on Different Feature Sets. Neuro Computing 20(1-3), 227–252 (1998) [18] Kuncheva, L.I., Jain, L.C.: Designing Classifier Fusion systems by Genetic Algorithms. IEEE Tran. on Evolutionary Computation 4(4), 327–336 (2000)