SVM AGAINST GMM/SVM FOR DIALECT ...

4 downloads 0 Views 320KB Size Report
hybrid GMM/SVM to investigate on the influence of dialect and the size of database on ..... Eng., Zhejiang Univ. of Media & Commun., (Hangzhou, China, 2009).
International Journal of Computational Intelligence and Applications  World Scientific Publishing Company

SVM AGAINST GMM/SVM FOR DIALECT INFLUENCE ON AUTOMATIC SPEAKER RECOGNITION TASK Kawthar Yasmine ZERGAT, Abderrahmane AMROUCHE Speech Com. & Signal Proc. Lab.-LCPTS, Faculty of Electronics and Computer Sciences,USTHB, Bab Ezzouar, 16 111, Algeria. Email: [email protected], [email protected] A main challenge for current research on speech science is modeling individual variation in spoken language. Speakers have their own speaking styles, depending on their accent and dialect as well as their socioeconomic background. This paper addresses modeling classifiers such as SVM and the hybrid GMM/SVM to investigate on the influence of dialect and the size of database on Automatic Text Independent Speaker Recognition Task. The Principal Component Analysis (PCA) method is utilized to extract the most representative features. Experimental results show that, the size of database has an important impact on both SVM and GMM/SVM performance accuracy, which is not the case for the speaker dialect effect. In the other hand, we discuss the issue of PCA in the front end and the back end processing parts. Applying PCA dimensionality reduction improves the recognition accuracy especially for the hybrid GMM/SVM classifier, however, it did not give an obvious observation about the effect of dialect on the systems performances. Keywords: Speaker Recognition; SVM; MFCC; GMM/SVM; Size of Database, Dialect, PCA.

1.

Introduction

This paper considers the problem of Text independent Speaker Recognition task, which is a process that automatically recognize an unknown speaker from a set of other speakers, we can distingue two different cases, a "close set" case, where the test segment belong to the training database unlike the "open set" case, where the test segment can be from other sources [1]. In this paper we deal with Automatic Text Independent Speaker Recognition in an open set case, in this concept, the paper focus on the study of the effect of dialect and size of database using SVM [2] and the hybrid GMM/SVM [3] [4] classifiers. Dialects are varieties of speech within a specified language. It is described by The Oxford English Dictionary (OED) as "one of the subordinate forms or varieties of a language arising from local peculiarities of vocabulary, pronunciation and idiom". Different works were made in the direction of the dialect influence on the Automatic Speaker Recognition task. In [5], the author presented a model for parallel dialect, language, and speaker recognition on the Cell processor using Gaussian mixture model, where in [6], the paper explores the influence of dialects in automatic speaker recognition systems, the author determines the influence of dialectal variation in the reference population on speaker identification accuracy. In [7], the paper evaluates a Multi-stream dialect classification using SVM-GMM hybrid classifiers, after extracting dialect dependent features, the author compares individual performances of each feature on a corpus of three dialects of Spanish. 1

2

Kawthar Yasmine ZERGAT, Abderrrahmane AMROUCHE

In the other hand, we investigated also on the influence of Principal Component Analysis (PCA) [8] [9] applied to the Mel Frequency Cepstral Coefficients (MFCC) feature vectors on both SVM and the hybrid GMM/SVM classifiers. The PCA is used for the purpose of reducing the computational complexity, and to speed up the training process. In our previous work [10], the hybrid GMM/SVM was used to investigate on the dialect effect. The results show that the dialect has a small impact on the system performance accuracy. In this paper a comparative study of the SVM and the GMM/SVM classifiers is made to better investigate on the dialect effect using PCA dimensionality reduction algorithm on the MFCCs stream vectors. This paper is organized as following: More details about GMM and SVM models are presented in section 2 and section 3, the experimental evaluation and results are exposed in section 4 and 5. Finally, conclusions are drawn in section 6. 2. Gaussian Mixture Model Gaussian Mixture Model (GMM) is a type of density model that uses to represent the speaker model and follows the probabilistic rules. The GMMs models are robust and easy to implement [3], they are commonly use for language Identification, Gender Identification and Automatic Speaker Recognition ρ tasks. The GMM model obtains the likelihood of a D-dimensional Cepstral vector x and using a mixture model λ of M multivariate Gaussians given by: M

p ( x / λ ) = ∑ π i bi ( x ) Where

πi

(1)

i =1

represent the mixture weights and bi ( x ), i = 1,..., M are the component

densities given by: bi ( x ) =

1 2Π

Σi

D /2

µi

With mean vector M

constraint that

∑π i =1

1/ 2

i

[

exp −

1 2

( x − µ i )' ( Σ i ) −1 ( x − µ i )

]

(2)

and covariance matrix Σi . The mixture weights satisfy the

= 1.

These parameters are estimated using the Expectation–Maximization (EM) algorithm [3]. For speaker recognition each speaker is modeled by a GMM and is referred to by its model λ. The Universal Background Model (UBM) [3] is generally a large GMM learned from multiple speech files to represent the speaker independent distribution of features, its parameters (mean, variance and weight) are found using the EM algorithm. The hypothesized speaker specific model is derived by adapting the parameters of the UBM

SVM Against GMM/SVM for Dialect Influence on Automatic Speaker Recognition Task 3 using the speaker's training speech and a form of Bayesian adaptation MAP [3]. The specifics of the adaptation are as follows [3]. Given a UBM model and training vectors from the hypothesized speaker, X = {x1 , x2 ,..., xT } , we first determine the probabilistic alignment of the training vectors into the UBM mixture components. That is, for mixture i in the UBM, we compute

Pr(i / xt ) =

λi pi ( xt ) ∑ j =1 λ j p j ( xt )

(3)

M

T

ni ( X ) = ∑ Pr(i / xt )

(4)

t =1

Ei ( X ) =

1 ni

T

∑ Pr(i / x ) x t =1

t

(5)

t

This is the same as the expectation step in the EM algorithm. Finally, these new sufficient statistics from the training data are used to update the old UBM sufficient statistics for mixture i to create the adapted parameters for mixture i with the equations: −

µi = α i Ei ( X ) + (1 + α i ) µi , i = 1,..., M αi =

(6)

ni ( X ) ni ( X ) + r

(7)

Where r is a fixed relevance factor equal to 16 [4]. 3. Support Vector Machine Support Vector Machines (SVM) [2], is a powerful discriminative classifier that is related to minimizing generalization error, it has been used successfully in pattern recognition such as Speaker Recognition. SVM aims to fit an Optimal Separating Hyperplane (OSH) between classes by focusing on the training samples that lie at the edge of the class distributions, the support vectors, and separates classes using “Maximum-Margin” hyperplane boundary. The SVM is constructed from sums of a kernel function K (·, ·) as follow:

 N  f ( x ) = sign  ∑ α i t i K ( x , x i ) + b  with  t =i 

N

∑α t t =i

i i

=0

(8)

Where t i are the ideal outputs and xi represent the support vectors which are the training data, αi are Lagrange multipliers and ܾ represents the bias. When data are not linearly separable in the finite dimensional space, a kernel function K(·,·) is used, this leads to

4

Kawthar Yasmine ZERGAT, Abderrrahmane AMROUCHE

an easier separation between two classes with a hyperplane, where a linear hyperplane in the high dimensional kernel feature space corresponds to a nonlinear decision boundary in the original input space. More details can be found in both Vapniks’ book [11] and Burges’ tutorial [12]. The Radial Basis Function (RBF) and the polynomial kernels are commonly used, and take respectively the following forms: −γ x − xi 2

k ( x, xi ) = e

(9)

K ( xi , x j ) = ( xi .x j + 1) d

(10)

Where γ the width of the Radial Basis Function and d is the order of the polynomial function. Another approach became more popular, which consists of using the hybrid system. The main goal is to see the complimentary information provided by the traditional GMM method to the SVM model. In this approach, instead of using the MFCC value features directly, it uses the adapted Gaussian means of the mixture components obtained from the Universal Background Model (UBM) and using the maximum a posteriori (MAP) adaptation as input to the SVM method to achieve the discrimination and the decision task. 4. Experimental Protocol 4.1. Corpora The corpus used in this work is issued from the TIMIT database [13], it was one of the first corpora available with a large number of speakers, and has been used for many speaker recognition studies. This database includes phonetic and word transcriptions as well as a 16-bit, 16 kHz speech file for each utterance and is recorded in “.ADC” format. The database consists of a set of 8 sentences with 3s of length spoken by 491 speakers in English language and divided in 8 dialects (Dr1 to Dr8) of the United States. We have selected 5 phonetically rich sentences (SX recordings) for training task and 3 other utterances (SI sentences) different from the previous ones for testing task. In this way, the text independency of speaker recognition was preserved. The following Table illustrates the different subset of the Timit database: Table 1. Timit Subsets Subset

Dialect

Number of speakers

Dr1

New England

47

Dr2

Northern

90

Dr3

North Midland

86

Dr4

South Midland

65

SVM Against GMM/SVM for Dialect Influence on Automatic Speaker Recognition Task 5 Dr5

Southern

65

Dr6

New York City

47

Dr7

Western

66

Dr8

Army Brat

25

4.2 Experiments To study the influence of dialect on speaker recognition performance, four systems have been developed as follow:

Fig. 1. Model of the SVM system (process A).

In all experiments the signal pretreatment is as follow. First, a Voice Activity Detector (VAD) technique is used, this algorithm is widely used in speech and speaker recognition applications. For a given speech utterance , the energy of all speech frames is computed, an empirical threshold is then determined from the maximum energy of these speech frames, this classify speech segments as either speech or silence segments, finally silent segments (no-speech) are removed. Success in speaker recognition tasks depends on extracting and modeling the speaker dependent characteristics of the speech signal, which can effectively distinguish between talkers. In this work, an extraction of 12 MFCC, augmented with their delta and double delta Cepstral coefficients, making 36 dimensional feature vectors are used to represent the feature space. These features are extracted using a Hamming window with 20 ms of length and a shift of 10 ms, the window is used to taper the original signal on the sides and thus reduces the side effects [6]. Finally, a Cepstral Mean Subtraction (CMS) [14] is applied to these features by the subtraction of the Cepstral Mean of the feature vectors in goal to fit the data around their average. The SVM system is based on the principle of structural risk minimization, it is considered to be more suitable for classification and therefore is used in our work. The difficulty of the SVM classifier is setting of its respective optimal parameters (C , γ ) to achieve the lower misclassification accuracy, they are calculated during the training phase, where the final step consists of the testing phase, which allows evaluating the robustness of the classifier. To calculate the classification function class (x) in the

6

Kawthar Yasmine ZERGAT, Abderrrahmane AMROUCHE

SVM model, the Radial Basis Function (RBF) kernel was used. All presented results were obtained with that function. In this paper, the SVM was trained directly on the acoustic space, which characterizes the client data and the impostor data. In this way, 40 unknown speakers were used to represent the impostors for the recognition task. For the second system, PCA-SVM model, the goal was to study the effect of reducing the dimension of the feature vectors on the recognition accuracy, Principal Component Analysis method (PCA) was applied to the feature vector in the front end part of the Automatic Speaker Recognition (ASR) system for each speaker independently, in goal to better represents the speaker intra variability and to allow reducing the effective size of the input data (MFCCs), as illustrated in the following figure.

Fig. 2. Model of the PCA-SVM system (process B).

In the next experiment, the hybrid GMM/SVM approach is performed, for this, we used the SVM algorithm (discriminative part) to classify supervectors, which were made of GMM parameters (generative part). Gaussian mixture models were used with M =32 mixtures. 491 speaker’s models were trained and the parameter α i is calculated as in Eq. (7). For GMM-MAP training, only mean values of the Gaussian components were adapted with a relevance factor of 16, the weight vector and the covariance matrix were not modified. A gender balanced UBM consisted of 2048 mixture components was trained using the EM algorithm. The UBM model aims to model the general acoustic space of 120 unknown speakers (impostors), 60 male and 60 female, where each speaker utters five different sequences. We then trained an SVM model using the target GMM supervectors and the SVM background which represents GMM supervectors of 40 impostors labeled as (-1) for scoring as follow.

SVM Against GMM/SVM for Dialect Influence on Automatic Speaker Recognition Task 7

Fig. 3. Model of the GMM-SVM system (process C).

In PCA-GMM/SVM system, we applied the PCA dimensionality reduction to the adapted means in the back-end processing. The PCA consists of identifying principal components of the sample consisting by the best direction of the projection of the cloud point i.e. the maximum variance as follow.

Fig. 4. Model of the PCA-GMM- SVM system (process D).

For the classification phase, an Automatic Speaker Recognition system (ASR) predetermines the threshold values for its False Acceptance Rate (FAR) and its False Rejection Rate (FRR). The False Accept Rate or False Match Rate represents the probability that the system incorrectly matches the input pattern to a non-matching template in the database. Where, False Reject Rate or False Non-match Rate represents the probability that the system fails to detect a match between the input pattern and a matching template in the database. When the rates are equal, the common value is referred to as the Equal Error Rate (EER), this value indicates that the proportion of False Acceptances is equal to the proportion of False Rejections. The lower the Equal Error Rate value, the higher the accuracy of the ASR system.

5. Experiment Results 5.1 Speaker Recognition using SVM and GMM-SVM To evaluate the influence of dialect and size of database on Automatic Speaker Recognition (ASR), a comparative study of the SVM and GMM-SVM systems is

8

Kawthar Yasmine ZERGAT, Abderrrahmane AMROUCHE

achieved. The following Figure presents the results in term of EER of different dialects and different lengths of subdatabase containing in TIMIT dataset.

Fig. 5. Results in term of EER (%)

Figure 5 shows the EER (%) of speaker recognition system accuracy with both SVM and GMM/SVM classifiers, as expected, in major cases, the GMM-SVM outperforms the SVM system’s performance, for example, the EER obtained with the SVM model for Dr8 subset is equal to 26.89% where it is less than 22.1% for the hybrid GMM/SVM system. Even the three subsets of the TIMIT corpora have almost the same number of speaker, Dr4, Dr5 and Dr7 with different dialect, both GMM-SVM and SVM performance accuracies are quite the same for all these subsets. For example, for the SVM classifier, the EER in Dr4 (Dialect: South Midland, Number of speaker: 65) is 8.8%, in Dr5 (Dialect is: Southern, Number of speaker is: 65) EER=8.71% and in Dr7 (Dialect is: Western, Number of speaker is: 66) EER=8.18%. But in the other hand, a difference of performance accuracy of the order of 1.3% is noticed with the SVM classifier for Dr1 and Dr6 subsets. The EER in Dr1 (Dialect: New England, Number of speaker: 47) is 14.83% and is equal to 16.4% for the Dr6 (Dialect is: Southern, Number of speaker is: 47) subset. So, we can say that, the SVM is more sensitive to speaker dialect than the hybrid GMM/SVM classifier. However, the number of speakers has a big influence on both classifiers for speaker verification rate. Other wise to say, more the number of speakers is important, more the EER is minor, it’s clearly seen with Dr8 (Dialect: Army Brat, Number of speaker: 25), the EER= 26.89% where in Dr2 (Dialect: Northern , Number of speaker: 90), the EER= 6.93% for the SVM classifier.

SVM Against GMM/SVM for Dialect Influence on Automatic Speaker Recognition Task 9 5.2 Influence of PCA algorithm on Speaker recognition task The main goal of the experiments doing in this section is to evaluate the recognition performances of both SVM and GMM/SVM classifiers using the PCA dimensionality reduction. The results are illustrated in the following Figure.

Fig. 6. Results in term of EER (%)

From the above Figure, we can observe that using PCA dimensionality reduction on feature vectors leads to increase the accuracy rate for both SVM and the hybrid GMM/SVM classifiers. It can be explained by the redundancy of the speech signal [14]. Otherwise, the speech signal is characterized by elements with high information comparing to other samples which are least significant and are repeated along the signal (MFCCs), so the system is trained on unnecessary sample which leads to a loss of time, performance and credibility. Using PCA algorithm, it transforms the features into an orthogonal feature space and allows dropping the low weight transformed features. This improves performance by removing correlations between variables. In the other hand, with the PCA-SVM system, we better observe the impact of dialect on the SVM classifier for different subset using the same number of speakers, with different dialects. Another observation form the obtained results, is that applying PCA in the front end processing of the hybrid system allows better performances than those obtained with the SVM one. Because, in the hybrid system GMM/SVM, the input of the SVM classifier are adapted means using the UBM model. By calculating the mean of the reduced MFCCs feature vectors, we better observe the contribution of illuminating redundant information by the PCA algorithm on the input feature vectors (MFCCs).

10

Kawthar Yasmine ZERGAT, Abderrrahmane AMROUCHE

6. Conclusion The two classifiers SVM and GMM/SVM were used to study the dialectal variation in the reference population on speaker recognition accuracy. First, a comparative study between the SVM and GMM/SVM classifiers using the TIMIT corpora was presented. The results confirm the outperformance of the hybrid system, GMM/SVM compared to the SVM one. In the other hand, the effect of dialect and size of the database on the performance of Automatic Speaker Recognition (ASR) systems has been studied. From the results exposed in this paper, it is notice that the dialect effect varies from a classifier to another, for example the SVM classifier is more sensitive to some dialect than others, where it is not the case for the hybrid GMM/SVM model. However, the size of database (number of speakers) affected strongly the performance accuracy of the two systems, as showed in this letter, interesting performances are obtained when the size of database is important. So, from these observations, we cannot conclude that the dialect have an impact or not on the ASR systems. The paper also evaluates the influence of applying PCA dimensionality reduction in the front end and the back end parts of the Automatic Speaker Recognition systems. It has been proved that, PCA input dimensionality reduction can improve both the SVM and GMM/SVM classifiers, especially for large population size and various speaker dialects. Other type of features, such as prosodic and voice quality features can be fused with the proposed method to enhance the speaker recognition performance accuracy in future work.

References 1.

L. Jian, S. Shuifa, Z. Jianwei, and L. Xiaoli, Speaker recognition with VAD. Sch. of Inf. Eng., Zhejiang Univ. of Media & Commun., (Hangzhou, China, 2009).

2.

R. Dehak, N. Dehak, P. Kenny and P. Dumouchel, Linear and non linear kernel GMM supervector machines for speaker verification, in Proc. Interspeech, 2007, (Antwerp, Belgium, 2007), pp. 302–305.

3.

D. A. Reynolds, T. F. Quatieri and R. B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models. In: Digital Signal Processing, Vol. 10 (2000), pp. 19– 41.

4.

W. M. Campbell, D.E Sturim, D.A Reynolds and A. Solomonoff, SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation. In: Acoustics, Speech and Signal Processing ICASSP Proceedings, (2006).

5.

A. Moreno, M. García-Gomar, E. Martínez and J. Castaño, The influence of dialects in automatic speaker recognition systems. In: IAFPA, (2006);

6.

N. Malyska, S. Mohindra, D.A. Reynolds and J. Kepner, Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor. High Performance Embedded Computing, Lexington, (2008).

SVM Against GMM/SVM for Dialect Influence on Automatic Speaker Recognition Task 11 7.

R. Chitturi, J.H.L. Hansen, Multi-stream dialect classification using SVM-GMM hybrid classifiers, (2007), pp. 431 – 436.

8.

C. Hanilci, F. Ertas, VQ-UBM Based Speaker Verification Through Dimension Reduction Using Local PCA, In: 19th European Signal Processing conference (EUSIPCO), (Spain, 2011)

9.

K.Y. Lee, Local fuzzy PCA based GMM with dimension reduction on speaker identification. In : Pattern Recognition Letters, 2004, Vol. 25, pp. 1811-1817.

10. K.Y. Zergat, A. Amrouche and N. Asbai, Effect of Dialect, Size of Population and PCA on Speaker Verification Performance. In: 5ème conférence internationale sur les TIC pour l’amazighe, TICAM, (Marroco, 2012). 11. V. Vapnik, Statistical Learning Theory, John Wiley, (New York,1998). 12. C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Mining and Knowl. Discov, 1998, Vol. 2, pp. 1-47. 13. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren and V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, (Philadelphia, 1993). 14. T. Kinnunen, H. Li, An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 2010, Vol. 52, pp. 12-40.