Improving SVM by Modifying Kernel Functions for Speaker Identification Task Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze
Improving SVM by Modifying Kernel Functions for Speaker Identification Task 1
Siwar Zribi Boujelbene, 1, First Author Faculty of Humanities and Social Sciences of Tunis (FSHST), Tunis – Tunisia,
[email protected]
2
Dorra Ben Ayed Mezghani, *2,Corresponding Author High Institute of Computer Science of Tunis (ISI), Tunis – Tunisia,
[email protected]
3
Noureddine Ellouze National School of Engineer of Tunis (ENIT), Tunis – Tunisia,
[email protected] 3,
doi: 10.4156/jdcta.vol4.issue6.12
Abstract Support vector machine (SVM) was the first proposed kernel-based method. It uses a kernel function to transform data from input space into a high-dimensional feature space in which it searches for a separating hyperplane. SVM aims to maximise the generalisation ability that depends on the empirical risk and the complexity of the machine. SVM has been widely adopted in real-world applications including speech recognition. In this paper, an empirical comparison of kernel selection for SVM were used and discussed to achieve performance on text-independent speaker identification using the TIMIT corpus. We were focused on SVM trained using linear, polynomial and radial basis function (RBF) kernels. Results showed that the best performance had been achieved by using polynomial kernel and reported a speaker identification rate equal to 82.47%.
Keywords: Support Vector Machine, Kernel function, Speaker identification 1. Introduction Support vector machine (SVM) has gained much attention, since their inception [4, 3]. It presents a powerful new generation learning algorithm based on recent advances in statistical learning theory. It delivers state-of-the-art performance in real-word applications and presents a useful algorithm for pattern recognition tasks. Indeed, SVM has been used for isolated handwritten digit recognition [4, 16? 3], object recognition [1], speaker identification [14, 15], face detection in images [13], text categorization [19], etc. Speaker identification task is the process of determining from which of the registered speaker when a given utterance comes. Furthermore, it can be either text-independent or text-dependent. By textindependent, we refer to that the identification procedure should work for any text in either training or testing [11]. This is a different problem than text-dependent speaker identification, where the text in both training and testing is the same or is known. The first use of SVM for speaker identification was reported by Schmidt and Gish [15] where SVM was trained directly on the acoustic spaces, which characterize the client data and the impostor data. During testing, the segment score is obtained by averaging the scores of the SVM output for each frame. In [15] Schmidt and Gish achieved some mixed results on Switchboard, a very noisy speech corpus. In this paper, we were attempted to investigate the best choice among SVM kernels namely linear, polynomial and radial basis function (RBF) kernels. We were evaluated the SVM classifier for text independent speaker identification using the TIMIT corpus, which provided high quality recordings of speech. The task was not straightforward especially since we were required the application of different types of kernels by using different feature datasets. This paper was organized as follow: in section 2 an overview of support vector machines was presented. In section 3, the speaker identification using support vector machine was briefly described. In section 4, the experimental conditions were presented and finally results were discussed in section 5.
2. Overview of Support Vector Machine An SVM is a binary classifier that makes its decisions by constructing a hyperplane that optimally separates two classes. The hyperplane is defined by x ⋅ w + b = 0 where w is the normal to the plane.
100
International Journal of Digital Content Technology and its Applications Volume 4, Number 6, September 2010
For linearly separable data presented by {xi , y i }, xi ∈ ℜ d , y i ∈ {− 1,1}, i = 1...N . the optimal hyperplane is determined according to the maximum margin criterion [2]. This is achieved by minimizing this equation: 2 (1) w 2 subject to ( xi .w + b ) yi ≥ 1, ∀i. The solution for the optimal hyperplane w0 is a linear combination of a small subset of data xs , s ∈ {1...N } , known as the support vectors that satisfy ( x s ⋅ w0 + b ) y s = 1. For not linearly separable data, no hyperplane exists for which all points that satisfy the inequality above. To overcome this problem, ζ i are introduced and the object is then achieved by minimizing this equation: 1 2 (2) w + C ∑ L(ζ i ) subject to ( xi ⋅ w + b ) y i ≥ 1 − ζ i 2 2 i Where L is the loss function, C is a hyper-parameter used to specify the effects of minimizing the empirical risk and maximizing the margin and the empirical risk associated with the marginal or misclassified points is presented by the term on the RHS. According to Burges [2], the dual formulation, which is more conveniently solved, of (2) with L (ζ i ) = ζ i is described by the swing equation: ⎛ ⎞ max ⎜ ∑ α i + ∑ α iα j yi y j xi ⋅ x j ⎟ subject to α i, j ⎝ i ⎠ (2) 0 ≤ αi ≤ C
∑αy i
i
=0
i
Where α i is the Lagrange multiplier of the ith constraint in the primal optimization problem. The dual can be solved using standard quadratic programming techniques. The optimal plane, w0 is given by this equation: (4) w0 = ∑ α i yi xi i
The extension to non-linear boundaries is achieved through the use of kernels that satisfy Mercer’s condition [8]. In practice, kernels commonly used are linear kernels polynomial kernels and radial basis function (RBF) kernels. The linear kernels take this form: Κ ( xi , x j ) = ( xi ⋅ x j )
The polynomial kernels take this form:
Κ ( xi , x j ) = ( xi ⋅ x j + 1)
(5)
p
(6)
Where p is the degree of the polynomial and the RBF kernels take this form: ⎡ ⎛ x −x i j Κ ( xi , x j ) = exp ⎢ − 1 ⎜ ⎢ 2⎜ σ ⎢⎣ ⎝
⎞ ⎟ ⎟ ⎠
2
⎤ ⎥ ⎥ ⎥⎦
(7)
Where σ is the width of the radial basis function. The dual for the non-linear case is thus: ⎛ ⎞ max ⎜ ∑ α i + ∑ α iα j yi y j Κ ( xi , x j ) ⎟ subject to α , i i j ⎝ ⎠ 0 ≤ αi ≤ C
∑αy i
i
i
(8)
=0
SVMs rely heavily on quadratic programming (QP) problem. Unfortunately memory requirements of QP may be largely overcome by Chunking [13]. However, when a near singular Hessian is
101
Improving SVM by Modifying Kernel Functions for Speaker Identification Task Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze
encountered, an optimizer may find a suboptimal solution. Reference [5] describes Sequential Minimal Optimization method to solve the SVM QP problem without any extra matrix storage and without using numerical QP optimization steps at all. In [17, 7, 18], Zribi Boujelbene et al. suggest that this method can be used successfully for speech recognition.
3. Text independent speaker identification using SVM The goal of speaker identification is to determine the identity of a speaker by his/her voice from a group of speakers. So, for each speaker, a model of his/her voice should be created. The speaker is prompted to read a line of text. In a text independent speaker identification task the transcription of the text is ignored. The simplest method consists on constructing classifiers to separate each speaker from all of the others [12, 6]. So, if there are n speakers, n classifiers must be constructed. Indeed, the identity of the speaker is determined from the classifier that yields the largest utterance score giving by this function: arg max 1 j N
⎛
N
∑ ⎜⎝ ∑ α i =1
s
sj
⎞ ysj Κ ( xsj , xi ) + b ⎟ ⎠
(9)
Where xsj are the support vectors of the jth classifier, s is the score of an utterance of length N and α sj and ysj are the corresponding Lagrange multipliers and classes.
4. Experimental conditions Experiments were performed on the DR1 dialect of TIMIT corpus. This data consists of 47 speakers (16 females and 31 males) prompted to read ten sentences. Two sentences have the prefix "sa" (sa1 and sa2). Sentences "sa1" and "sa2" are different, but they are the same across speakers. Three sentences have the prefix "si" and five have the prefix "sx". These eight later sentences are different from one another and different across the speakers. The frames of data corresponding to silence were removed from the utterances. The data were recorded at a sample rate of 16 KHZ and a resolution of 16 bits. The features were derived using 12th order MFCC analysis from the audio recording, energy, deltas and deltas-deltas computed making up a thirty nine dimensional feature vector. Experiments were done on four sets of data defined by table 1:
Datasets W1 W3 W5 W7
Table 1. Datasets and splitting Size (ko) Instances Characteristics of every utterance 135 398 660 922
470 1410 2350 3290
middle frame three middle frames five middle frames seven middle frames
To handle the SVM multi-class problem, we were considered the one versus one approach classifier. This approach avoids several problems encountered with other approaches [14]. First it is much easier to separate two speakers than to separate one speaker from all others. Second, the number of training vectors is roughly equal for both classes. To evaluate the performance of SVM classifier, we were used the cross-validation approach. This choice was suggested by the fact that this approach has been unfortunately under-utilized in machine learning community [10, 9]. In [18], Zribi Boujelbene et al. suggested that this approach was the most powerful to estimate the generalization error of speech recognition systems. For our experiments, data were randomly partitioned into ten equally sized where 90% were used for training and the remaining 10% were used for testing. This technique was repeated for ten times, each time with a different test data. For each time i, the identification rate (IRi) was computed by: IRi (%) =
Num. Correct Identif . vectors (timei ) × 100. Total Num. of Vectors (timei )
The mean of the ten IRi present the IR of all data.
102
(10)
International Journal of Digital Content Technology and its Applications Volume 4, Number 6, September 2010
5. Results and discussion Our motivation was to analyze how much the different datasets results depend on the kernel function of SVMs. In this paper, we were focused on SVM trained using linear, polynomial and RBF kernels. We were attempted to investigate the best choice among SVM kernels for a particular task. Table 2 presented the performance of SVMs trained with linear, polynomial and RBF kernels on the TIMIT speaker identification task by using different datasets. Table 2 showed that SVMs trained using linear kernel had achieved an IR between 5.11% and 39.40%. However SVMs trained using polynomial kernel with p = 1 had achieved an IR between 5.32% and 38.89%. We remarked that these two classifiers had almost the same behavior whatever datasets used. It can be seen that SVMs trained using polynomial kernel with p = 10 had reported a marked improvement in speaker identification performance gradually with increasing the number of windows. Indeed the IR for W1 database was equal to 6.38 %, the IR for W3 database was equal to 74.89%, the IR for W5 database was equal to 81.79% and the IR for W7 database was equal to 77.81%. We noted that the best IR was reported by using W5 database. It can be seen also, that SVMs trained using polynomial kernel with p = 100 had reported the lowest IR equal to 2.13% for all datasets. We suggested that SVMs trained using polynomial kernels performed worst by increasing the n value. It can be noted that SVMs trained using RBF kernel with σ = 0.01 had achieved an IR between 4.68% and 23.77%, however, this IR was reported between 4.25% and 68.51% where σ = 0.1. We concluded that SVMs trained using RBF kernel with σ = 0.1 performed better than SVMs trained using RBF kernel with σ =0.01 for W3, W5 and W7. However SVMs trained using polynomial kernel with p = 10 performed significantly better than SVMs trained using RBF kernel with σ = 0.1 for all datasets. Table 2. Data dependent kernels selection for SVM on the TIMIT speaker identification task (%) Datasets Kernels Function W1 W3 W5 W7 Liniaire Polynomial (p =1) Polynomial (p =10) Polynomial (p =100) RBF (σ =0.01) RBF (σ = 0.1)
5.11 5.32 6.38 2.13 4.68 4.25
36.95 36.59 74.89 2.13 21.35 68.51
39.40 38.89 81.79 2.13 22.98 28.25
38.14 37.96 77.81 2.13 23.77 28.45
It can be easily observed that SVMs trained using polynomial kernel applied in W5 datasets had provided the best performance than other kernels for speaker identification task. Table 3 showed the behavior of SVMs trained using different degree of the polynomial kernel applied on W5 datasets. Table 3. Evaluation of polynomial kernel for SVM on the TIMIT speaker identification task (%) Polynomial degree IR 2 4 8 10 14 18 20
71.06 78.47 82.47 81.79 81.61 79.57 77.91
Table 3 showed that SVMs trained using polynomial kernel with p = 8 had performed better than others polynomial kernels degree. Indeed, his performance was improved and had achieved an IR equal to 82.47%.
103
Improving SVM by Modifying Kernel Functions for Speaker Identification Task Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze
6. Conclusion In this paper, we were attempted to investigate the best choice among SVM kernels namely linear, polynomial and radial basis function (RBF) kernels for text independent speaker identification using the TIMIT corpus. Different degree of the polynomial kernel and different width of the RBF kernel were evaluated. To specify our feature space, we have explored all sentences pronounced by all male and female speakers to assure a multi-speaker environment. Four datasets were defined depending on the number of middle frame for each feature vector. Our results had showed that the best performance had achieved by using polynomial kernel with p = 8 applied on the datasets characterized by five middle frame. In further work, we will try to study the performance of SVM for speaker identification task by using all dialects of the TIMIT corpus. We will study the performance of other popular learning algorithms, namely Naïve Bayes, Decision Tree, Radial Basis Function Networks and Multi Layer Perceptron.
7. References [1] Volker Blanz, Bernhard Schölkopf, Heinrich Bülthoff, Chris Burges J. C., Vladimir Vapnik, Thomas Vetter, “Comparison of View–Based Object Recognition Algorithms using Realistic 3d Models”, Artificial Neural Networks — ICANN’96, Berlin, Springer Lecture Notes in Computer Science, vol. 11, no. 12, pp. 251–256, 1996. [2] Chris Burges J. C., “A Tutorial on Suport Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 1–47, 1998. [3] Chris Burges J. C., Bernhard Schölkopf, “Improving the Accuracy and Speed of Support Vector Learning Machines”, Advances in Neural Information Processing Systems 9, Cambridge, MA, MIT Press, pp. 375–381, 1997. [4] Corinna Cortes, Vladimir Vapnik, “Support vector networks”, Machine Learning, vol. 20, pp. 273–297, 1995. [5] John Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization”, Advances in Kernel Methods - Support Vector Learning, MIT Press, pp. 185-208, 1998. [6] Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze, “Robust Text Independent Speaker Identification using Hybrid GMM-SVM System”, Journal of Convergence Information Technology – JDCTA, vol. 3, no. 2, pages 103-110, June 2009. [7] Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze, “Improved Feature data for Robust Speaker Identification using hybrid Gaussian Mixture Models - Sequential Minimal Optimization System”, Journal of The International Review on Computers and Software (IRECOS), vol. 4, no. 3, pp. 344-350, May 2009. [8] Richard Courant, David Hilbert, Methods of Mathematical Physics, Interscience, 1953. [9] Sholom M. Weiss, Casimir A. Kulikowski Computer Systems that Learn, Morgan Kaufmann, San Mateo, CA.1991. [10] Mohamed Bentoumi, Gilles Millerioux, Gérard Bloch, Latifa Oukhellou, Patrice Aknin, “Classification de défauts de rail par SVM”, In Proceeding of Actes du Premier Congrès International IEEE de Signaux, Circuits et Systèmes (SCS’04), Monastir, Tunisie, pp. 242-245, 2004. [11] Herbert Gish, Michael Schmidt, “Text-independent speaker identification”, IEEE Signal Processing Magazine, pp. 18–32, 1994. [12] Pedro Moreno, Purdy Ho, “A new svm approach to speaker identification and verification using probabilistic distance kernels”, In Proceedings of the European Conference on Speech Communication and Technology, pp. 2965-2968, 2003. [13] Edgar E. Osuna, Robert Freund, Federico Girosi, “An improved training algorithm for support vector machines”, In Proceedings of the 1997 IEEE Workshop on Neural Networks for Signal Processing, Amelia Island, FL, pp. 276–285, 1997. [14] Michael Schmidt, “Identifying speaker with support vector networks”, In Interface ’96 Proceedings, Sydney, 1996.
104
International Journal of Digital Content Technology and its Applications Volume 4, Number 6, September 2010
[15] Michael Schmidt, Herbert Gish, “Speaker Identification via Support Vector Machines”, in Proceeding of ICASSP, pp. 105–108, 1996. [16] Bernhard Schölkopf, Chris Burges J. C., Vladimir Vapnik, “Extracting support data for a given task”. In Proceedings of the First International Conference on Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA, 1995. [17] Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze, “Applications of Combining Classifiers for Text-Independent Speaker Identification”, In Proceedings of the 16th IEEE ICECS 2009, Hammamet-Tunisia, pp. 723-726, December 2009. [18] Siwar Zribi Boujelbene, Dorra Ben Ayed Mezghani, Noureddine Ellouze, “Vowel Phoneme Classification Using SMO Algorithm for Training Support Vector Machines”, in Proceedings of the IEEE International Conference on Information and Communication Technologies: from Theory to Applications ICTTA-08, Syria, pp. 1-5, 2008. [19] Thorsten Joachims, “Text categorization with support vector machines”, Technical report, LS VIII Number 23, University of Dortmund, ftp://ftp-ai.informatik.uni-dortmund.de/pub/Reports/report23.ps.Z, 1997.
105