Speaker Identification and Verification using Support Vector Machines ...

8 downloads 210 Views 112KB Size Report
recognition [5][1]. State of the art systems are based on modeling a speaker- ..... C. Burges, “A tutorial on support vector machines for pattern recognition,” Data.
Speaker Identification and Verification using Support Vector Machines and Sparse Kernel Logistic Regression Marcel Katz, Sven E. Kr¨ uger, Martin Schaff¨oner, Edin Andelic, Andreas Wendemuth IESK, Cognitive Systems University of Magdeburg, Germany [email protected]

Abstract. In this paper we investigate two discriminative classification approaches for frame-based speaker identification and verification, namely Support Vector Machine (SVM) and Sparse Kernel Logistic Regression (SKLR). SVMs have already shown good results in regression and classification in several fields of pattern recognition as well as in continuous speech recognition. While the non-probabilistic output of the SVM has to be translated into conditional probabilities, the SKLR produces the probabilities directly. In speaker identification and verification experiments both discriminative classification methods outperform the standard Gaussian Mixture Model (GMM) system on the POLYCOST database.

1

Introduction

The use of speaker recognition and its applications is already widespread, e.g. access to a private area or in telephone banking systems, where it is important to verify that the person prompting the credit card number is the owner of the card. The field of speaker recognition can be divided into two tasks, namely speaker identification and speaker verification. The speaker identification task consists of a set of known speakers or clients (closed-set) and the problem is to decide which person from the set is talking. In speaker verification the recognition system has to verify if a person is the one he claims to be (open-set). So the main difficulty in this setup is that the system has to deal with known and unknown speakers, so-called clients and impostors respectively. In speech and speaker recognition Gaussian mixtures are usually a good choice in modeling the distribution of the speech samples both in continuous speech recognition with multi-state left-to-right Hidden Markov Models (HMMs) and for text-independent approaches with single-state HMMs [1]. It has to be noted that good performance of GMMs depends on a sufficient amount of data for the parameter estimation. In speaker recognition the amount of speech data

2

of each client is limited. Normally, only a few seconds of speech are available and so the parameter estimation might not be very robust. Especially if the data is limited, SVMs show a great ability of generalization and a better classification performance than GMMs. Also, in continuous speech recognition SVMs were integrated into HMMs to model the acoustic feature vectors [2]. There have been several approaches of integrating SVMs into speaker verification environments. One method is not to discriminate between frames but between entire utterances. These utterances have different lengths and so a mapping from a variable length pattern to a fixed size vector is needed. Several methods like the Fisher-kernel [3] map the resulting score-sequences of the GMMs into a high dimensional score-space where a SVM is then used to classify the data in a second step, e.g. [4].

2

GMM based Speaker Recognition

In the past several years there has been a lot of progress in the field of speaker recognition [5][1]. State of the art systems are based on modeling a speakerindependent universal background model (UBM) which is trained on the speech of a large number of different speakers. For each client of the system a speakerdependent GMM is then derived from the background model by adapting the parameters of the UBM using a maximum a posteriori (MAP) approach [6]. The GMM is the weighted sum of M component Gaussian densities given by: M X ci N (x|µi , Σi ) (1) p(x|λ) = i=1

where ci is the weight of the i’th component and N (x|µi , Σi ) is a multivariate Gaussian with mean vector µi and the covariance matrix Σi . The mixture model is trained using standard methods like the Expectation Maximization (EM) algorithm. Finally the probability that a test-sentence X = {x1 , ..., xN } is generated by the specific speaker model λ is calculated by the log-likelihood over the whole sequence: log P (X|λ) =

N X

log P (xi |λ).

(2)

i=1

Given a set of speakers the task of speaker identification is to deduce the most likely client C from a given speech sequence: Cˆk = argmax P (X|λk ).

(3)

k

In the case of speaker verification the decision of accepting a client is usually based on the ratio between the summed log likelihoods of the specific speaker models and the background model. Defining the probability P (X|λk ) as the probability of client Ck producing the sentence X and P (X|Ω) as the probability

3

of the background model, the client is accepted if the ratio is above a speakerindependent threshold δ: log

P (X|λk ) > δ. P (X|Ω)

(4)

This results in two possible error rates, the first one is the false-reject (FR) error rate (PReject|T arget ): the speaker is the claimed client but the likelihood-ratio of equation (4) is lower than the threshold. The second error rate is the false-accept (FA) error rate: the speaker is not the claimed one but the likelihood-ratio is higher than δ and the speaker is accepted (PAccept|N onT arget ). For the performance measure of a speaker verification system the decision cost function (DCF) is given. The DCF is defined as a weighted sum of the FR and the FA probabilities: Cdet = CF R × PReject|T arget × PT arget +CF A × PAccept|N onT arget × (PN onT arget )

(5)

with the predefined weights CF R , CF A and prior probabilities PT arget , PN onT arget = 1 − PT arget .

3

Support Vector Machines

Support Vector Machines (SVM) were first introduced by Vapnik and developed from the theory of Structural Risk Minimization (SRM) [7]. We now give a short overview of SVMs and refer to [8] for more details and further references. SVMs are linear classifiers that can be generalized to non-linear classification by the so-called kernel trick. Instead of applying the linear methods directly to the input space Rd , they are applied to a higher dimensional feature space F which is nonlinearly related to the input space via the mapping Φ : Rd → F. Instead of computing the dot-product in F explicitly, a kernel function k(x i , xj ) satisfying Mercer’s conditions is used to compute the dot-product. A possible kernel function is the Gaussian radial basis function (RBF) kernel:   −kxi − xj k2 . (6) k(xi , xj ) = exp 2σ 2 Suppose we have a training set of input samples x ∈ Rd and corresponding targets y ∈ {1, −1}.The SVM tries to find an optimal separating hyperplane in F by solving the quadratic programming problem: W (α) =

N X i=1

N

αi −

N

1 XX αi αj yi yj k(xi , xj ) 2 i=1 j=1

(7)

PN under the constraints i=1 αi yi = 0 and 0 < αi < C ∀i. The parameter C allows us to specify how strictly we want the classifier to fit to the training data.

4

The output of the SVM is a distance measure between a pattern and the decision boundary: N X αi k(xi , x) + b (8) f (x) = i=1

where the pattern x is assigned to the positive class if f (x) > 0. For the posterior class probability we have to model the distributions P (f |y = +1) and P (f |y = −1) of f (x) computing the probability of the class given the output by using Bayes’ rule [9] : P (y = +1|x) = g(f (x), A, B) =

1 1 + exp(Af (x) + B)

(9)

where the parameters A and B can be calculated by a maximum likelihood estimation [9].

4

Sparse Kernel Logistic Regression

Considering again a binary classification problem with targets y ∈ {0, 1}, the success probability of the sample x belonging to class 1 is given by P (y = 1|x) and P (y = 0|x) = 1 − P (y = 1|x) that it belongs to class 0. In Kernel Logistic Regression we want to model the posterior probability of the class membership via equation (8). Interpreting the output of f (x) as an estimate of a probability p(x, α) we have to rearrange equation (8) by the logit transfer function logit{P (x, α)} = log

P (x, α) = f (x) 1 − P (x, α)

(10)

which results in the probability: P (x, α) =

1 . 1 + exp(−f (x))

(11)

If we assume that the training data is drawn from a Bernoulli distribution conditioned on the samples x, the negative log-likelihood (NLL) l{α} of the conditioned probability P (y|x, α) can be written as l{α} = −

N X

yi f (xi ) − log (1 + exp(f (xi )))

i=1

λ + kf k2 2

(12)

with the ridge-penalty λ2 kf k2 = αT Kα to avoid over-fitting to the training data [10]. K is defined as the kernel matrix with entries Kij = k(xi , x). To minimize the regularized NLL we set the derivatives ∂l{α} to zero and use the ∂α

5

Newton-Raphson algorithm to iteratively solve equation (12). This algorithm is also referred to as iteratively re-weighted least square (IRLS), e.g. [11]: αnew = K + λW−1

−1

(13)

z

with the adjusted response z = Kαold + W−1 (y − p)



(14)

where p is the vector of fitted probabilities with the i’th element P (αold , xi ) and W is the N × N weight matrix with entries P (αold , xi )(1 − P (αold , xi )) on the diagonal. A sparse solution can be achieved if we involve only basis functions corresponding to a subset S of the training set R: f (x) =

q X

αi k(xi , x)

qN

(15)

i=1

with q training samples. If we apply equation (15) instead of (8) in the IRLS algorithm we get the following sparse formulation: αnew = KTN q WKN q + λKqq

−1

z KTN q W˜

(16)

 ˜ = KN q αold + W−1 (y − p) , the N × q matrix KN q = k(xi , xj ); xi ∈ with z R, xj ∈ S and the q × q regularization matrix Kqq = k(xi , xj ); xi , xj ∈ S. This sparse variant was introduced by [11]. The SKLR aims to minimize the NLL iteratively by adding training samples to a subset S of selected training vectors until the algorithm converges to some value. Starting with an empty subset we have to minimize the NLL for each xl ∈ R of the training set: l{xl } = −yT (KlN q αl ) + 1T log(1 + exp(KlN q αl )) λ T + αl Klqq αl 2

(17)

with the N × (q + 1) matrix KlN q = k(xi , xj ); xi ∈ R, xj ∈ S ∪ {xl } and the (q + 1) × (q + 1) regularization matrix Klqq = k(xi , xj ); xi , xj ∈ S ∪ {xl }. Then we add the vector for which we get the highest decrease in NLL to the subset: x∗l = argmin l{xl }.

(18)

xl ∈R

While in the original Newton-Raphson algorithm we iteratively estimate the parameter α applying the IRLS algorithm we can use a one step approximation here [11]. In each step we approximate the new α with the fitted result from the current subset S which we estimated in the previous minimization process.

6

5

Multi-class problems

Naturally, kernel logistic regression could be extended to multi-class problems. But for comparison with binary classifiers like the SVM we decided to use a common one-versus-one approach where C(C − 1)/2 classifiers learn pairwise decision rules [12], which is easier than solving C large problems. The pairwise probabilities µij ≡ P (qi |qi orqj , x) of a class qi given a sample vector x belonging to either qi or qj are transformed to the posterior probability P (qi |x) by [13]:   C X 1 P (qi |x) = 1/  (19) − (C − 2) . µij j=1,j6=i

6

Experiments

For all experiments we used the POLYCOST dataset [14]. This dataset contains 110 speakers (63 females and 47 males) from different European countries. The dataset is divided into 4 baseline experiments (BE1-BE4), from which we used the text-independent set BE4 for speaker identification and the text-dependent set BE1 for the speaker verification experiments. In the feature extraction the speech data is divided into frames of 25ms at a frame rate of 10ms and a voiced/unvoiced decision is obtained using a pitch detection algorithm. Only the voiced speech is then parameterized using 12 MelCepstrum coefficients as well as the frame-energy. The first and the second order derivatives are added, resulting in feature vectors of 39 dimensions. The parameters of the baseline GMM models were estimated using the HTK toolkit [15]. For the identification experiments we used 2 sentences of each speaker for the training and 2 sentences as development test set. The evaluation set contains up to 5 sentences per speaker, all in all 664 true identity tests. From every speaker there is a total amount of only 10 to 20 seconds of free speech for the training and about 5 seconds per speaker for the evaluation. The parameters of the different classifiers were validated on the development set. Table 1. Speaker Identification experiments on the POLYCOST database using different classification methods. Classifier GMM SVM SKLR

IER (%) 10.84 8.89 8.58

The utterances are classified to that speaker with the highest speaker-model score defined in equation (3). Because of the fact that all speakers are known to the system, the error rate is simply computed as Identification Error Rate

7

(IER). As can be seen in table 1, both the SVM and the SKLR classifiers clearly outperform the GMM baseline system. The SKLR classifier decreases the IER of the baseline system by about 20.8% relatively. In the verification experiments the sentence “Joe took father’s green shoe bench out” is given as a fixed password sentence shared by all clients. The classifiers are trained on 4 sentences of each speaker. We used the same parameters as in the identification experiments. For the GMM environment a gender-

gmm sklr

False Reject Rate (in %)

10

svm

5

2

1

0.5

0.5

1

2 5 False Accept Rate (in %)

10

Fig. 1. DET curves for the three systems on the POLYCOST-BE1 verification task.

independent background model is trained by 22 non-client speakers from the POLYCOST database. The evaluation test consists of 664 true client tests and 824 impostor trials. The results of the three classifiers are given in figure 1 as detection error tradeoff (DET) plot. The DET shows the tradeoff between falserejects (FR) and false-accepts (FA) as a decision threshold [16]. Additionally we report the Equal Error Rate (EER) and the DCF as performance measure in table 2. The parameters of the cost function used in the experiments are CF R = 10, CF A = 1 and PT arget = 0.01. As one can see in the table, the DCF of the evaluation test is reduced from 0.034 of the GMM baseline system to 0.019 of the SVM system. While the SVM clearly outperforms the GMM baseline, the SKLR performs only slightly better than the GMM system. This might be due to the fact that there was no special parameter estimation on the verification task and so the SVM exhibits a better generalization performance than the SKLR.

8 Table 2. Comparison of the EER and the DCF for three systems on the POLYCOSTBE1 speaker verification task. Classifier GMM SVM SKLR

7

EER (%) 4.09 2.16 3.31

DCF 0.034 0.019 0.029

Conclusion

In this paper we presented two discriminative methods for frame-based speaker identification and verification. Both methods outperform the GMM baseline in the speaker recognition experiments. Because the decision process depends directly on the discrimination of the different speaker models there is no need for a score normalization by a background model. The advantage of the SKLR is that it directly models the posterior probability of the class membership. The main drawback of the discriminative classification methods is the time and memory consuming parameter estimation, so that it is not possible to use larger datasets directly. One idea is not to use a multi-class method like the one-versus-all or one-versus-one approach of section 5 but to use a fixed set of background speakers. The speech sentences are then classified in a one-versusbackground approach which is computational more effective if the background set is not to large. In our future research we will extend the verification system to larger datasets and different speech conditions, like telephone and cellular speech.

References 1. D. Reynolds, “An overview of automatic speaker recognition technology,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, 2002, pp. 4072–4075. 2. S. E. Kr¨ uger, M. Schaff¨ oner, M. Katz, E. Andelic, and A. Wendemuth, “Speech recognition with support vector machines in a hybrid system,” in Proc. EuroSpeech, 2005, pp. 993–996. 3. T. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” in Advances in Neural Information Processing Systems, M. Kearns, S. Solla, and D. Cohn, Eds., vol. 11. Cambridge, MA, USA: MIT Press, 1999, pp. 487–493. 4. V. Wan and S. Renals, “Speaker verification using sequence discriminant support vector machines,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 203–210, 2005. 5. M. Przybocki and A. Martin, “NIST speaker recognition evaluation chronicles,” in Proceedings of ODYSSEY - The Speaker and Language Recognition Workshop, 2004. 6. D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000.

9 7. V. N. Vapnik, The Nature of Statistical Learning Theory, 2nd ed., ser. Information Science and Statistics. Berlin: Springer, 2000. 8. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, jun 1998. 9. J. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” in Advances in Large-Margin Classifiers, P. Bartlett, B. Sch¨ olkopf, D. Schuurmans, and A. Smola, Eds. Cambridge, MA, USA: MIT Press, oct 2000, pp. 61–74. [Online]. Available: http://research.microsoft.com/ jplatt/abstracts/SVprob.html 10. A. Hoerl and R. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, pp. 55–67, 1970. 11. J. Zhu and T. Hastie, “Kernel logistic regression and the import vector machine,” Journal of Computational and Graphical Statistics, vol. 14, pp. 185–205, 2005. 12. T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” in Advances in Neural Information Processing Systems 10, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds. Cambridge, MA, USA: MIT Press, jun 1998. 13. D. Price, S. Knerr, L. Personnaz, and G. Dreyfus, “Pairwise neural network classifiers with probabilistic outputs,” in Advances in Neural Information Processing Systems 7, G. Tesauro, D. Touretzky, and T. Leen, Eds. Cambridge, MA, USA: MIT Press, 7 1995, pp. 1109–1116. 14. H. Melin and J. Lindberg, “Guidelines for experiments on the polycost database,” in Proceedings of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, Vigo, Spain, 1996, pp. 59–69. 15. S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, Cambridge University Engineering Department, 2002. 16. A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The det curve in assessment of detection task performance,” Proc. EuroSpeech, vol. 4, pp. 1895–1898, 1997.