Speaker Recognition from Coded Speech Using ... - Semantic Scholar

Speaker Recognition from Coded Speech Using Support Vector Machines Artur Janicki and Tomasz Staroszczyk Institute of Telecommunication, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland [email protected],[email protected]

Abstract. We proposed to use support vector machines (SVMs) to recognize speakers from signal transcoded with different speech codecs. Experiments with SVM-based text-independent speaker classification using a linear GMM supervector kernel were presented for six different codecs and uncoded speech. Both matched (the same codec for creating speaker models and for testing) and mismatched conditions were investigated. SVMs proved to provide high accuracy of speaker recognition, however requiring higher number of Gaussian mixtures than in the baseline GMM-UBM system. In mismatched conditions the Speex codec was shown to perform best for creating robust speaker models. Keywords: speaker recognition, speaker classification, speech coding, support vector machines

1

Introduction

A speaker recognition system often is supposed to analyze voices of remote speakers; this is why the speech signal needs to be transmitted. This in turn implies that the recognition system has to analyze a signal which has been transcoded using one of the existing speech codecs. This can happen in a speaker identification or verification system working in the client-server architecture, where the speech signal is transmitted to the system over the Internet. Also a speaker verification system in a bank should work robustly regardless of the fact that the customer is calling using a land line, a mobile or an Internet phone. This shows why there is a need to make speech recognition robust not only against a change of the microphone or against speaker’s inter-session variability, but also against various speech codecs used in voice transmission. 1.1

Impact of speech coding on speaker recognition

Several studies have already been conducted on speaker recognition from coded speech. In majority of cases researches used speaker recognition based on Gaussian mixture models, where speaker models were adapted (using e.g. MAP maximum a posteriori algorithm) from a universal background model (GMMUBM systems) [1]. Usually two cases are considered:

2

Artur Janicki, Tomasz Staroszczyk

– matched conditions - when the speaker recognition system trained using speech transcoded with codec X is tested on speech transcoded with codec X; – mismatched conditions - when the system trained using speech transcoded with codec X (or uncoded at all) is tested on speech transcoded with codec Y. So it was for example in [2] where the authors showed for the NIST 1998 speaker recognition evaluation corpus how much the recognition accuracy is affected by transcoding using GSM 06.10, G.723.1 and G.729 codecs. The authors reported that GSM 06.10 codec had best results both for matched and mismatched conditions, whilst G.723.1 proved to be the worst (EER rose from 4% to 12% for female speakers), so the performance degradation was consistent with decreasing perceptual quality. GSM speech codecs were examined in [3], however only in matched conditions. The authors showed that both speaker identification and verification performance is degraded by these codecs, blaming the low LPC order in these codecs for that. They reached the speaker classification accuracy of 68.5% for GSM 06.10 and 71.8% for GSM 06.60. Speaker recognition from speech coded with GSM 06.60, G.729, G.723.1 and MELP codecs was researched in [4], both for matched in mismatched conditions. The authors used GMM-UBM technique, with gender-dependent UBM models. They found that the recognition accuracy decreases when the mismatch between the quality of ”training” and ”testing” codecs increases. It was shown that using handset dependent score normalization (HNORM) improved the results, especially in mismatched conditions. Speaker identification from speech transcoded with GSM 06.60 codec was described as well in [5]. The researchers were classifying 60 speakers pronouncing 10 digits in Arabic, recorded in the ARADIGIT corpus. They obtained the identification error rate of 21.94%. In [6] the researchers examined Speex codec, using their own created speech corpus. In various experiments they showed, among others, that Speex can serve well also for creating speaker models for testing GSM-encoded speech. Several studies investigated the possibility to recognize speakers directly from codec’s parameters (e.g. [2], [7]), however with results still inferior to those achieved by analysis of the synthetic (transcoded) speech. 1.2

SVMs and speaker recognition

Support vector machines started to be used in speaker recognition in the middle 90s [8], soon after detailed description of SVMs appeared in [9]. Since that time they have been used successfully in many studies. In [10] the authors used SVMs with Fisher kernel and LR (likelihood ratio) kernel with spherical normalization. On the PolyVar speech corpus they achieved up to 33% relative improvement of speaker verification accuracy compared to GMM-UBM systems. In [11] the authors proposed using an SVM machine to classify supervectors containing GMM parameters (more precisely: Gaussian mixture mean values). They used

Speaker Recognition from Coded Speech Using Support Vector Machines

3

the linear Kullback-Leibler kernel, which for M Gaussian components can be expressed as:

K(utta , uttb ) =

M X

(λi µai Σi−1 µbi ) =

i=1

M p X

− 21

λi Σi

µai

T p −1 λi Σi 2 µbi

(1)

i=1

where λ, µ and Σ are the i-th Gaussian parameters (weight, mean values and covariance matrix) of the utterance a and b. They proposed also another kernel, called GMM L2 inner product. The authors showed DET curves for a NIST SRE 2005 task, where SVMs outperformed classical GMM ATNorm approach. They also mentioned considerably less computational complexity of the SVM approach. In [12] the author proposed an improved supervector approach, which was faster and used smaller speaker models. 1.3

Aims of this study

Following promising results of SVM-based speaker recognition in several studies, we decided to apply it for coded speech. In this study we wanted to investigate the following: – How well SVM-based speaker classification will perform for coded speech? – What classification accuracy can we achieve with SVMs when recognizing speakers in mismatched conditions? – Which codec is the best to create speaker models resistant to the mismatch? Our results will be compared, among others, with the experiments on speaker recognition from speech transcoded with the GSM codecs [3] and the study concerning recognition from Speex-transcoded speech [6].

2 2.1

Experiment setup Speech data

The TIMIT speech corpus [13] was used as the database of recordings. Although it was originally designed for studies on speech recognition, this corpus was as well used for a number of studies on speaker recognition (e.g. [3], [14]), as it contains recordings of 630 speakers, what is a relatively big number. However, the TIMIT corpus contains single-session recordings only, so the problem of speaker’s inter-session variability is not covered in this study. Each of the speakers utters 10 sentences, each of them lasting 3.2 s on average. Five of these sentences (SX recordings) were used for training the system, whilst the remaining five ones (SA and SI sentences) were used for testing. The SA sentences are the same for every speaker, but they were used in the testing part only, so the text-independency of speaker recognition was preserved. The audio material per a single speaker is relatively short (ca. 32 s, in total for training and for testing, compared e.g. to 120 s in [6]), what makes the TIMIT speaker classification problem even a bigger challenge.

4


2.2

Tested codecs

In our experiments we decided to recognize speech transcoded with the codecs mostly used in the Internet telephony: – G.711 (PCM) - used in fixed telephony, but also in VoIP. A-law option was used in this study. – G723.1 - a codec based on MP-MLQ and ACELP, here the option with the bitrate of 6.4 kbps was used. – GSM 06.10 (known also as GSM Full-Rate) - designed in the early 90’s for the GSM telephony, but used in VoIP as well. Bitrate: 13 kbps. – GSM 06.60 (known also as GSM-Enhanced Full Rate) - is an enhanced version of GSM 06.10, offering 12.2 kbps bitrate. – G.729 - operates at a bit rate of 8 kbps, and is based on CS-ACELP. Used in VoIP especially when limited bandwidth is available. – Speex - a CELP-based lossy codec used in VoIP, offering 10 compression levels at the bitrates of 2.15 - 24.6 kbps (level 8 was used in this study, as it showed the best performance in mismatched conditions in [6]). 2.3

Classification

We decided to use a hybrid SVM-GMM approach: we used the SVM algorithm (discriminative part) to classify supervectors, which were made of GMM parameters (generative part). This way, following successful examples, such as [11] and [12], we hoped to benefit of the advantages of both the discriminative and generative approach. A UBM model was trained using 200 speakers and the remaining 430 ones were used for classification experiments, similarly as in [3]. The speech files were parameterized using 19 MFCC parameters (plus the ’0’ one), with the frame length of 30 ms and 10 ms analysis step. UBM models were created separately for speech transcoded with each codec, using the GMM algorithm with various numbers of mixture components. The speaker models were created by adapting the UBM model using MAP algorithm with the relevance factor RF = 1. Only mean values of the Gaussian components were adapted, the weight vector and the covariance matrix were not modified. The adapted mean values of each Gaussian mixture were put in a column, thus forming high-dimensional supervectors (SVs). Since we had ca. 16 s of training speech material (SX recordings), we could either create 1 SV per speaker using all of it, or we could split the signal into equal parts and create several training SVs per speaker. Initial experiments showed that using 8 SVs per speaker yielded good classification results, so this number was used in further experiments. Higher number of SVs per speaker sometimes caused that the SVs were not getting enough training data. We decided to assess the classification accuracy by classifying each of the tested sentences separately. The same procedure of generating SVs was followed for each of the test sentences, so as a result 430 x 5 = 2150 supervectors were created for testing. Classification accuracy was determined by the ratio of correctly classified sentences against the total number of test recordings (i.e. 2150).


5

The experiments were run in Matlab environment using libsvm toolbox for SVM classification [16] and h2m toolbox for GMM training [17]. The actual classification process was performed using the SVM machine with the classical Kullback-Leibler kernel. When testing classification in matched conditions, the UBM model, training and tested SVs were all created from speech transcoded with the same codec. In experiments with mismatched conditions, the UBM and training sequences were trained using speech transcoded with codec X, and tested on SVs created from speech transcoded with codec Y.

3

Results

Before experiments with speaker recognition from coded speech started, first classification was run for speech of original quality (f s = 16 kHz). It turned out that for the number of Gaussian mixtures M = 16 the achieved accuracy was worse than in [3] - it was only slightly over 89% compared to 97.8% in the study using GMM-UBM approach and M = 16, too. When we tried speaker classification from coded speech, it turned out that the same was e.g. for speech transcoded with GSM 06.10 codec: 58.5% compared to 68.5% in [3].

Fig. 1. Classification accuracy with SVM-GMM classifier for different numbers of Gaussian components, for clean and coded speech (GSM 06.10 codec). Dashed horizontal lines show recognition results of a GMM-UBM system, reported in [3].

So we decided to increase M . After doubling it, the result for GSM 06.10 codec using the SVM-GMM approach was almost the same as for the GMMUBM algorithm - it reached 68%, see Fig. 1. However due to unsatisfactory result for clean speech we decided to increase further the number of Gaussian components. For M = 256 the accuracy of speaker classification for clean speech outperformed the result achieved in [3] for GMM-UBM classification. What is even more important, the accuracy of classification for coded speech (here for GSM 06.10 codec) increased significantly, from 58.5% to almost 85%, and the

6


distance between performance for uncoded and coded speech decreased from over 30% relative to less than 14%; so we concluded that the gap between the performance for clean and coded speech will be lesser for higher M . Thus, in further experiments we decided to use speaker modeling with 256 Gaussian mixtures. Table 1. Speaker recognition accuracy [%] for systems trained (in rows) and tested (in columns) with different codecs. The diagonal (in bold) shows results for matched conditions. training/testing uncoded G.711 G.723.1 G.729 GSM 06.10 GSM 06.60 Speex 8

unGSM GSM Speex average stddev G.711 G.723 G.729 coded 06.10 06.60 8 89.67 86.42 71.63 65.16 69.63 72.00 86.60

87.40 88.23 68.88 62.37 70.98 65.54 84.23

49.49 46.98 73.81 57.95 55.26 63.07 62.88

51.49 47.02 61.63 77.12 44.98 63.21 53.07

51.49 50.47 60.33 37.72 83.12 41.86 62.79

51.44 64.65 71.30 77.91 57.72 84.28 67.67

83.63 80.60 75.58 62.74 72.23 63.07 86.65

66.37 66.34 69.02 63.00 64.84 64.72 71.99

17.59 16.07 4.64 8.91 10.45 7.90 11.87

Next experiments were run to check how speaker recognition using SVMGMM approach is affected by the coded speech. First, the matched conditions were provided, i.e. the system was tested with sentences encoded with the same codec as used during the model training. The performance of speaker classification using the SVM-GMM technique was consistent with speech quality offered by the given codec. The best quality codecs (G.711, Speex - even operating at quality level 8) yielded the best recognition results (88.2% and 86.7%, respectively), while G.723.1 and G.729 proved to be the worse in this trial, with the recognition accuracy of 73.8% and 77.1%, respectively (see the diagonal in Table 1). Similar relation between speaker recognition performance and codec quality was reported also in other studies, using other classification techniques (e.g. in [2], [6]). The decrease of performance when changing from speech sampled with 16 kHz (98% accuracy) to narrow-band speech sampled with 8 kHz (89.7% accuracy) is noteworthy, too. Finally, the speaker classification performance was checked under mismatched conditions. As expected, the accuracy decreased here, however not uniformly. The decrease was more significant when there was a remarkable mismatch between ”training” and ”testing” codecs. Testing the system trained with speech transcoded using a high quality G.711 codec with uncoded speech or speech transcoded using another high quality codec (e.g. Speex) results in only minor decrease of accuracy (see Table 1). On the contrary, testing it with the lowerquality G.723.1 or G.729 codecs makes the recognition accuracy almost twice worse. Accuracy in G.723.1/GSM06.60 mismatch is hardly affected (the decrease from 73.8% to 71.3%), as both codecs are CELP-based and offer similar, slightly lower speech quality. This phenomenon had been previously reported in [4] for GMM-UBM-based recognition.


7

Fig. 2. Recognition accuracy for speaker models created with different codecs against the codec of the tested speech. The enlarged markers denote matching conditions.

Fig. 2 can help to choose a codec which could be most suitable to create speaker models which would be resistant to mismatched conditions. The line corresponding to Speex is close to G.711 and ”uncoded” lines for high-quality codecs, but also is quite high for the remaining ones, so this codec seem to be the best candidate. This is confirmed also by the highest average recognition rate shown in Table 1 and is consistent with the results reported in [6] for GMMUBM classification. G.723.1 shows good results here, too - for some codecs (GSM 06.60, G.729) the results for speaker models created G.723.1 outperform the ones created with Speex. The results of G.723.1 are also the most stable for all the tested codecs, so the standard deviation of accuracy is here the lowest (less than 5%). In general G.711 behaves similarly to uncoded speech, due to the simple nature of G.711. The result for G.723.1/Speex8 mismatch is somewhat surprising - it turned out that this result is higher than in the matched conditions (75.6% vs. 73.8%). Also G.729 model tested with GSM 06.60 speech performs surprisingly well. These cases require further investigation.

4

Conclusions and future works

Our experiments showed that the SVM-based speaker classification from coded speech yields high accuracy results, so it can be used for that purpose. However, the used SVM-GMM technique required higher number of Gaussian mixtures than it was in the baseline GMM-UBM system [3]. This can be explained by the fact, that in a classical GMM-UBM approach the parameters of a tested speech signal are verified against the speaker model, while in the used SVMGMM technique the decision is based on the model of the tested speech, as the SVM is in fact discriminating the speaker models. This is why used speaker modelling should be more precise to achieve similar results. Similarly as in other studies the speaker classification accuracy was consistent with codec quality, so Speex and G.711 showed best results. In mismatched con-

8


ditions the degradation of recognition accuracy is less significant if the ”training” and ”tested” codecs offered similar speech quality. Future works can explore the possibility of using channel compensation techniques for SVMs, such as NAP, to decrease the impact of coder’s mismatch.

References 1. Reynolds, D.: Gaussian Mixture Models. Encyclopedia of Biometric Recognition (2008) 2. Quatieri, T., Singer, E., Dunn, R., Reynolds, D., Campbell, J.: Speaker and Language Recognition Using Speech Codec Parameters. In : Proc. Eurospeech ’99, Budapest, vol. 2, pp.787-790 (1999) 3. Besacier, L., Grassi, S., Dufaux, A., Ansorge, M., Pellandini, F.: GSM Speech Coding and Speaker Recognition. In : Proc. ICASSP, pp.1085-1088 (2000) 4. Dunn, R., Quatieri, T., Reynolds, D., Campbell, J.: Speaker Recognition from Coded Speech in Matched and Mismatched Conditions. In : Proc. ODYSSEY-2001, Crete, Greece, pp.115-120 (2001) 5. Krobba, A., Debyeche, M., Amrouche, A.: Evaluation of Speaker Identification System Using GSMEFR Speech Data. In : Proc. 2010 International Conference on Design & Technology of Integrated Systems in Nanoscale Era, Hammamet, pp.1 5 (2010) 6. Stauffer, A., Lawson, A.: Speaker Recognition on Lossy Compressed Speech using the Speex Codec. In : Proc. Interspeech 2009, Brighton, pp.2363-2366 (2009) 7. Moreno-Daniel, A., Juang, B.-H., Nolazco-Flores, J.: Speaker Verification Using Coded Speech. In : LNCS 3287, Berlin Heidelberg, pp.366-373 (2004) 8. Schmidt, M., Gish, H.: Speaker identification via support vector classifiers. In : Proc. IEEE International Conference of the Acoustics, Speech, and Signal Processing ICASSP ’96, Atlanta, USA (1996) 9. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 10. Wan, V., Renals, S.: SVMSVM: Support Vector Machine Speaker Verification Methodology. In : Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), Hong Kong, vol. 2, pp.221-224 (2003) 11. Campbell, W., Sturim, D., Reynolds, D.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters 13, 308-311 (2006) 12. Anguera, X.: MiniVectors: an Improved GMM-SVM Approach for Speaker Verification. In : Proc. Interspeech 2009, Brighton, UK, pp.2351-2354 (2009) 13. Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., Zue, V.: TIMIT Acoustic-Phonetic Continuous Speech Corpus., Linguistic Data Consortium, Philadelphia (1993) 14. Yu, E., Mak, M.-W., Kung, S.-Y.: Speaker Verification from Coded Telephone Speech Using Stochastic Feature Transformation and Handset Identification. In : Proc. 3rd IEEE Pacific-Rim Conference on Multimedia, Hsinchu, Taiwan, pp.598606 (2002) 15. SoX - Sound eXchange, http://sox.sourceforge.net/ 16. Chih-Chung, C., Chih-Jen, L.: LIBSVM – A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 17. Cappe, O.: h2m Toolkit, http://www.tsi.enst.fr/~cappe/

Speaker Recognition from Coded Speech Using ... - Semantic Scholar

Speaker Recognition from Coded Speech Using ... - Semantic Scholar

Suggest Documents

Automatic Speech and Speaker Recognition ... - Semantic Scholar

Speaker-Independent Speech Recognition - Semantic Scholar

Automatic Speech and Speaker Recognition ... - Semantic Scholar

Emotion Recognition from Speech using ... - Semantic Scholar

Emotion Recognition from Speech using ... - Semantic Scholar

speaker verification from coded telephone speech using ... - PolyU - EIE

Detecting Nonnative Speech Using Speaker Recognition ... - CiteSeerX

Simultaneous speech and speaker recognition using ... - CiteSeerX

Speaker Recognition - Semantic Scholar

speaker recognition using gmm - Semantic Scholar

speaker indipendent emotion recognition from speech

speaker indipendent emotion recognition from speech ...

speaker indipendent emotion recognition from speech ...

Speaker-Independent Silent Speech Recognition

gsm speech coding and speaker recognition - Semantic Scholar

continuous speech recognition using support ... - Semantic Scholar

Dysarthric Speech Recognition Using Dysarthria ... - Semantic Scholar

Spontaneous Speech Recognition Using a ... - Semantic Scholar

speech emotion recognition using stationary ... - Semantic Scholar

VISUAL SPEECH RECOGNITION USING ... - Semantic Scholar

Automatic Speech Recognition System Using ... - Semantic Scholar

Second Language Speech Recognition Using ... - Semantic Scholar

speech recognition using syllable patterns - Semantic Scholar

Phonetic Speaker Recognition - Semantic Scholar