Document not found! Please try again

Perfermances Evaluation of GMM-UBM and GMM-SVM for Speaker

0 downloads 0 Views 165KB Size Report
SVMs [3],[4] with the efficient and effective speaker representation offered by. GMM-UBM [5],[6] to obtain hybrid GMM-SVM system [7],[8]. The focus of this paper ...
Perfermances Evaluation of GMM-UBM and GMM-SVM for Speaker Recognition in Realistic World Nassim Asbai, Abderrahmane Amrouche, and Mohamed Debyeche Speech Communication and Signal Processing Laboratory, Faculty of Electronics and Computer Sciences, USTHB, P.O. Box 32, El Alia, Bab Ezzouar, 16111, Algiers, Algeria {asbainassim,mdebyeche}@gmail.com, [email protected]

Abstract. In this paper, an automatic speaker recognition system for realistic environments is presented. In fact, most of the existing speaker recognition methods, which have shown to be highly efficient under noise free conditions, fail drastically in noisy environments. In this work, features vectors, constituted by the Mel Frequency Cepstral Coefficients (MFCC) extracted from the speech signal are used to train the Support Vector Machines (SVM) and Gaussian mixture model (GMM). To reduce the effect of noisy environments the cepstral mean subtraction (CMS) are applied on the MFCC. For both, GMM-UBM and GMM-SVM systems, 2048-mixture UBM is used. The recognition phase was tested with Arabic speakers at different Signal-to-Noise Ratio (SNR) and under three noisy conditions issued from NOISEX-92 data base. The experimental results showed that the use of appropriate kernel functions with SVM improved the global performance of the speaker recognition in noisy environments. Keywords: Speaker recognition, Noisy environment, MFCC, GMMUBM, GMM-SVM.

1

Introduction

Automatic speaker recognition (ASR) has been the subject of extensive research over the past few decades [1]. These can be attributed to the growing need for enhanced security in remote identity identification or verification in such applications as telebanking and online access to secure websites. Gaussian Mixture Model (GMM) was the state of the art of speaker recognition techniques [2]. The last years have witnessed the introduction of an effective alternative speaker classification approach based on the use of Support Vector Machines (SVM) [3]. The basis of the approach is that of combining the discriminative characteristics of SVMs [3],[4] with the efficient and effective speaker representation offered by GMM-UBM [5],[6] to obtain hybrid GMM-SVM system [7],[8]. The focus of this paper is to investigate into the effectiveness of the speaker recognition techniques under various mismatched noise conditions. The issue of the Arabic language, customary in more than 300 million peoples around the B.-L. Lu, L. Zhang, and J. Kwok (Eds.): ICONIP 2011, Part II, LNCS 7063, pp. 284–291, 2011. c Springer-Verlag Berlin Heidelberg 2011 

Perfermances Evaluation of GMM-UBM and GMM-SVM

285

world, which remains poorly endowed in language technologies, challenges us and dictates the choice of a corpus study in this work. The remainder of the paper is structured as follows. In sections 2 and 3, we discuss the GMM and SVM classification methods and briefly describe the principles of GMM-UBM at section 4. In section 5, experimental results of the speaker recognition in noisy environment using GMM, SVM and GMM-SVM systems based using ARADIGITS corpora are presented. Finally, a conclusion is given in Section 6.

2

Gaussian Mixture Model (GMM)

In GMM model [9], there exist k underlying components {ω1 , ω2 , ..., ωk } in a d-dimensional data set. Each component follows some Gaussian distribution in the space. The parameters of the component ωj include λj = {μj , Σ1 , ..., πj } , in which μj = (μj [1], ..., μj [d]) is the center of the Gaussian distribution, Σj is the covariance matrix of the distribution and πj is the probability of the component ωj . Based on the parameters, the probability of a point coming from component ωj appearing at xj = x[1], ..., x[d] can be represented by Pr(x/λj ) =

−1  1 T exp{− ) (x − μ (x − μj )}  j −1 2 (2π)d/2 | j |

1

(1)

Thus, given the component parameter set {λ1 , λ2 , ..., λk } but without any component information on an observation point , the probability of observing is estimated by k  P r(x/λj )πj (2) Pr(x/λj ) = j=1

The problem of learning GMM is estimating the parameter set λ of the k component to maximize the likelihood of a set of observations D = {x1 , x2 , ..., xn }, which is represented by n Pr(D/λ) = Πi=1 P r(xi /λ)

3

(3)

Support Vector Machines (SVM)

SVM is a binary classifier which models the decision boundary between two classes as a separating hyperplane. In speaker verification, one class consists of the target speaker training vectors (labeled as +1), and the other class consists of the training vectors from an ”impostor” (background) population (labeled as -1). Using the labeled training vectors, SVM optimizer finds a separating hyperplane that maximizes the margin of separation between these two classes. Formally, the discriminate function of SVM is given by [4]: N  αi ti K(x, xi ) + d] f (x) = class(x) = sign[ i=1

(4)

286

N. Asbai, A. Amrouche, and M. Debyeche

 Here ti ε{+1, −1} are the ideal output values, N i=1 αi ti = 0 and αi > 0 ¿ 0. The support vectors xi , their corresponding weights αi and the bias term d, are determined from a training set using an optimization process. The kernel function K(, ) is designed so that it can be expressed as K(x, y) = Φ(x)T Φ(y) where Φ(x) is a mapping from the input space to kernel feature space of high dimensionality. The kernel function allows computing inner products of two vectors in the kernel feature space. In a high-dimensional space, the two classes are easier to separate with a hyperplane. To calculate the classification function class (x) we use the dot product in feature space that can also be expressed in the input space by the kernel [13]. Among the most widely used cores we find: – Linear kernel: K(u, v) = u.v; – Polynomial kernel: K(u, v) = [(u.v) + 1]d ; – RBF kernel: K(u, v) = exp(−γ|u.v|2 ). SVMs were originally designed primarily for binary classification [11]. Their extension problem of multi-class classification is still a research topic. This problem is solved by combining several binary SVMs. One against all: This method constructs K SVMs models (one SVM for each class). The ith SVM is learned with all the examples. The ith class is indexed with positive labels and all others with negative labels. This ith classifier builds hyperplane between the ith class and other K -1 class. One against one: This method constructs K(K − 1)/2 classifiers where each is learned on data from two classes. During the test phase and after construction of all classifiers, we use the proposed voting strategy.

4

GMM-UBM and GMM-SVM Systems

The GMM-UBM [2] system implemented for the purpose of this study uses MAP [12] estimation to adapt the parameters of each speaker GMM from a clean gender balanced UBM. For the purpose of consistency, a 2048-mixture UBM is used for both GMM-UBM and GMM-SVM systems. In the GMM-SVM system, the GMMs are obtained from training, testing and background utterances using the same procedure as that in the GMM-UBM system. Each client training supervector is assigned a label of +1 whereas the set of supervectors from a background dataset representing a large number of impostors is given a label of -1.The procedure used for extracting supervectors in the testing phase is exactly the same as that in the training stage (in the testing phase, no labels are given to the supervectors).

5 5.1

Results and Discussion Experimental Protocol and Data Collection

Arabic digits, which are polysyllabic, can be considered as representative elements of language, because more than half of the phonemes of the Arabic language are included in the ten digits. The speech database used in this work is

Perfermances Evaluation of GMM-UBM and GMM-SVM

287

a part of the database ARADIGITS [13]. It consists of a set of 10 digits of the Arabic language (zero to nine) spoken by 60 speakers of both genders with three repetitions for each digit. This database was recorded by speakers from different regions Algerians aged between 18 and 50 years in a quiet environment with an ambient noise level below 35 dB, in WAV format, with a sampling frequency equal to 16 kHz. To simulate the real environment we used noises extracted from the database Noisex-92 (NATO: AC 243/RSG 10). In parameterization phase, we specified the feature space used. Indeed, as the speech signal is dynamic and variable, we presented the observation sequences of various sizes by vectors of fixed size. Each vector is given by the concatenation of the coefficients mel cepstrum MFCC (12 coefficients), these first and second derivatives (24 coefficients), extracted from the middle window every 10 ms. A cepstral mean subtraction (CMS) is applied to these features in order to reduce the effect of noise. 5.2

Speaker Recognition in Quiet Environment Using GMM and SVM

The experimental results, given in Fig.1, show that the performances are better for males speakers (98, 33%) than females (96, 88%). The recognition rate is better for a GMM with k = 32 components (98.19%) than other GMMs with other numbers of components. Now, if we compare between the performances of classifiers (GMM and SVM), we note that GMM with k = 32 components yields better results than SVM (linear SVM (88.33%), SVM with RBF kernel (86.36%) and SVM with polynomial kernel with degree d = 2 (82.78%)). 5.3

Speaker Recognition in Noisy Environments Using GMM and SVM

In this part we add noises (of factory and military engine) extracted from the NATO base NOISEX’92 (Varga), to our test database ARADIGITS that con-

Fig. 1. Histograms of the recognition rate of different classifiers used in a quiet environment

288

N. Asbai, A. Amrouche, and M. Debyeche

taining 60 speakers (30 male and 30 female). From the results presented in Fig.2 and Fig.3, we find that the SVMs are more robust than the GMM. For example, recognition rate equal to 67.5%.(for SVN using polynomial kernel with d=2). than GMM used in this work. But, in other noise (factory noise) we find that GMM (with k=32) gives better performances (recognition rate equal to 61.5% with noise of factory at SNR = 0dB) than SVM. This implies that SVMs and GMM (k=32) are more suitable for speaker recognition in a noisy environment and also we note that the recognition rate varies from noise to another. As that as far as the SNR increases (less noise), recognition is better.

Fig. 2. Performances evaluation for speaker recognition systems in noisy environment corrupted by noise of factory

Fig. 3. Performances evaluation for speaker recognition systems in noisy environment corrupted by military engine

5.4

Speaker Recognition in Quiet Environment Using GMM-UBM and GMM-SVM

The result in terms of equal-error rate (EER) shown by DET curve (Detection Error trade-off curve) showed in Fig.4: 1. When the GMM supervector is used, with MAP estimation [12], as input to the SVMs, the EER is 2.10%. 2. When the GMM-UBM is used the EER is 1.66%. In the quiet environment, we can say that, the performances of GMM-UBM and GMM-SVM are almost similar with a slight advantage for GMM-UBM.

Perfermances Evaluation of GMM-UBM and GMM-SVM

289

Fig. 4. DET curve for GMM-UBM and GMM-SVM

5.5

Speaker Recognition in Noisy Environments Using GMM-UBM and GMM-SVM

The goal of the experiments doing in this section is to evaluate the recognition performances of GMM-UBM and GMM-SVM when the quality of the speech data is contaminated with different levels of different noises extracted from the NOISEX’92 database. This provides a range of speech SNRs (0, 5, and 10 dB). Table 1 and 2 present the experimental results in terms of equal error rate (EER) in real world. As expected, it is seen that there is a drop in accuracy for this approaches with decreasing SNR. Table 1. EER in speaker recognition experiments with GMM-UBM method under mismatched data condition using different noises

The experimental results given in Table 1 and 2 show that the EERs for GMM-SVM are higher for mismatched conditions noise. We can observe that, the difference between EERs in clean and noisy environment for two systems GMM-UBM and GMM-SVM. So, it is noted that again, the usefulness of GMMSVM in reducing error rates is noisy environment against GMM-UBM.

290

N. Asbai, A. Amrouche, and M. Debyeche

Table 2. EERs in speaker recognition experiments with GMM-SVM method under mismatched data condition using different noises

5.6

Conclusion

The aim of our study in this paper was to evaluate the contribution of kernel methods in improving system performance of automatic speaker recognition (RAL) (identification and verification) in the real environment, often represented by an acoustic environment highly degraded. Indeed, the determination of physical characteristics discriminating one speaker from another is a very difficult task, especially in adverse environment. For this, we developed a system of automatic speaker recognition on text independent mode, part of which recognition is based on classifier using kernel functions, which are alternatively SVM (with linear, polynomial and radial kernels) and GMM. On the other hand, we used GMM-UBM, especially the system hybrid GMMSVM, which the vector means extracted from GMM-UBM with 2048 mixtures for UBM in step of modeling are inputs for SVMs in phase of decision. The results we have achieved conform all that SVM and SVM-GMM techniques are very interesting and promising especially for tasks such as recognition in a noisy environments.

References 1. Dong, X., Zhaohui, W.: Speaker Recognition using Continuous Density Support Vector Machines. Electronics Letters 37, 1099–1101 (2001) 2. Reynolds, D.A., Quatiery, T., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Dig. Signal Process. 10, 19–41 (2000) 3. Cristianni, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press (2000) 4. Wan, V.: Speaker Verification Using Support Vector Machines, Ph.D Thesis, University of Sheffield (2003) 5. Campbel, W.M., Sturim, D.E., Reynolds, D.A.: Support Vector Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Process. Lett. 13(5), 115–118 (2006) 6. Minghui, L., Yanlu, X., Zhigiang, Y., Beigian, D.: A New Hybrid GMM/SVM for Speaker Verification. In: Proc. Int. Conf. Pattern Recognition, vol. 4, pp. 314–317 (2006)

Perfermances Evaluation of GMM-UBM and GMM-SVM

291

7. Campbel, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation. In: Proc. IEEE Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 97–100 (2007) 8. Dehak, R., Dehak, N., Kenny, P., Dumouchel, P.: Linear and Non Linear Kernel GMM Supervector Machines for Speaker Verification. In: Proc. Interspeech, pp. 302–305 (2007) 9. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley-Interscience (2000) 10. Moreno, P.J., Ho, P.P., Vasconcelos, N.: A Generative Model Based Kernel for SVM Classification in Multimedia Applications. In: Neural Informations Processing Systems (2003) 11. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273– 297 (1995) 12. Ben, M., Bimbot, F.: D-MAP: a Distance-Normalized MAP Estimation of Speaker Models for Automatic Speaker Verification. In: Proc. IEEE Conf. Acoustics, Speech and Signal Processing, vol. 2, pp. 69–72 (2008) 13. Amrouche, A., Debyeche, M., Taleb Ahmed, A., Rouvaen, J.M., Ygoub, M.C.E.: Efficient System for Speech Recognition in Adverse Conditions Using Nonparametric Regression. Engineering Applications on Artificial Intelligence 23(1), 85–94 (2010)

Suggest Documents