Analysis (JFA) as the feature [1]. JFA finds the speaker and session variability in the GMM supervectors. Recently, JFA has shown to be capable of representing ...
EFFECTIVE BACKGROUND DATA SELECTION IN SVM SPEAKER RECOGNITION FOR UNSEEN TEST ENVIRONMENT: MORE IS NOT ALWAYS BETTER Jun-Won Suh, Yun Lei, Wooil Kim, and John H.L. Hansen Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas, Richardson, Texas, USA {jxs064200, yxl059200, wikim, John.Hansen}@utdallas.edu
ABSTRACT This study focuses on determining a procedure to select effective negative examples for development of improved Support Vector Machine (SVM) based speaker recognition. Selection of a background dataset, comprising of a group of negative examples, is critical in development of an effective decision surface between the primary speaker and outside speaker rejection space. Previous studies generally fix the number of examples based on development data for system performance evaluation, while for real applications this does not guarantee sustained performance for unseen data. In the proposed method, the error is estimated on the support vector to select the background dataset, thereby by customizing the background dataset for each enrollment speaker instead of training models with a fixed background data. The proposed method finds the equivalent or improved EER and DCF compared with the previous SVM-based studies, and provides consistent performance for unseen data. The method improves the 6% relative improvement on EER and DCF for NIST SRE 2010. Index Terms— Speaker recognition, background dataset selection, support vector machine, data evaluation 1. INTRODUCTION Current state-of-art speaker recognition systems generally employ a Support Vector Machine (SVM) using a Gaussian Mixture Model (GMM) supervector along with Joint Factor Analysis (JFA) as the feature [1]. JFA finds the speaker and session variability in the GMM supervectors. Recently, JFA has shown to be capable of representing both the speaker and common factors, where the SVM uses both information types as SVM input features, and a range of kernel and parameterization methods also has been used to improve classification performance for the SVM-based speaker recognition [1, 2]. This project was funded by AFRL through a subcontract to RADC Inc. under FA8750-09-C-0067, and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen.
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
5304
In general, the SVM builds a hyperplane for the enrollment speaker using both positive examples (enrollment speaker) and negative examples (impostor speakers) [3]. The limited amount of positive examples for each enrollment speaker requires effective use of the negatives examples, since in general there is an unlimited available database for the negative examples. Effective selection of the background dataset, which includes as many negative examples as possible which are close to the enrollment speaker, can improve performance of the SVM-based speaker recognition system [4, 5], while classification actually degrades when using all the negative examples. The only drawback in selecting the SVM background model is that it requires a specific number of negative examples. The problem of finding the best number of background entries for the dataset when experiencing unknown evaluation data is a challenging problem. Researchers generally fix the system parameters such as the entry number in the background dataset, the feature dimension, and the number of Gaussian mixtures when new evaluation data is to be tested such as NIST SRE 2010 using available pre-evaluation data such as SRE 2008. However, these parameters do not guarantee sustained performance for new unseen evaluation data. Moreover, previous studies [4] have shown that various sized background datasets for enrollment speakers result in differing system performance. Estimation of error boundary for SVM [6, 7] measures the suitability of enrollment speaker model trained with various size of background dataset. The system employs the error estimation scheme to select the background dataset for evaluation data without tuning to a specific dataset number and, at the same time, should avoid the risk of selecting too many poorly representative background dataset entries for unseen evaluation data. This paper is organized as follows. Sec. 2 describes the proposed process for effective dataset selection. The system description and specific parameter setup are given in Sec. 3. An extensive performance assessment and results are given in Sec. 4. Finally, conclusions are presented in the last Sec. 5.
ICASSP 2011
Average Error Variation by Increasing Size of Background Dataset
Overall Performance for various size of background dataset
SRE 08 Eval SRE 10 Eval
0.66
0.41 0.64
0.4
Average Error
Min. DCF
0.62 0.6 0.58 0.56
SRE 08 Eval. SRE 10 Eval.
0.54
0.5 0
SRE 10 Basline Using all neagtive examples
200
400
600
800
1000
1200
1400
Size of Background Dataset
1600
1800
0.38 0.37 0.36 0.35
SRE 08 Basline Using all neagtive examples 0.52
0.39
0.34 0.33
2000
200
300
400
500
600
700
Size of Background Dataset
800
900
Fig. 1. Overall background dataset performance on NIST SRE 2008 and 2010 evaluation sets.
Fig. 2. Reduction in average error with increasing background dataset size.
2. PROPOSED METHOD
is constructed using its own positive examples, and the negative examples are ranked-ordered by the support vector frequency. A distinctly different size of negative examples are used to construct each enrollment speaker. The differences between the two SVMs for the enrollment speaker trained with a ranked-order background dataset provides knowledge for the effect of each incremented dataset. The size of the background dataset, p, increased in increments of 100, (i.e. p = 100, p + 1 = 200, . . . , p + l = 100 × l). The trained SVM consists of the support vectors, xi , norm vector, wp , and bias, b, and satisfies the inequality of Eq. (1)
The selection of various SVM background datasets was previously considered by McLaren, et al. [4, 5], where the support vector frequency [4] or coefficients metric [5] was used to rank the negative examples for selecting the background dataset. The support vector frequency method selects negative examples by evaluating the examples using the target SVM model, and then selecting the closest negative examples to the enrollment speaker as the background dataset. The issue is that this fixed number of negative examples does not always provide consistent performance in different evaluation data. Fig. 1 shows results of background dataset selection using this support vector frequency method to select negative examples for the SRE-08 and SRE-10 male 5 min. telephone evaluation set, where the SRE-04 and SRE05 are used as the background dataset. The best background dataset (500) which is tuned using the SRE-08 evaluation set does not produce the best performance for the SRE-10 evaluation set. This variation in performance across the dataset suggests that the SVM system needs an improved method for finding effective background datasets. The following sections introduce the proposed background dataset selection method, which does not require a fixed number of negative examples. The main goal of the proposed method is that it provide consistent performance for new evaluation sets and does not require a fixed parameter set size for selecting the best background dataset. 2.1. Evaluation of SVM Model and Decision Rule In constructing the SVM hyperplane, it is necessary to obtain the outlier/reject speakers close to the decision surface, since outliers that are close to the optimum decision boundary will perform better classification. The enrollment speaker model
5305
yi (xi · wp + b) − 1 ≥ 0 ∀ i ∈ yi = 1.
(1)
The norm vector is formulated in order to maximize the space between both positive and p negative classes [8], α i yi x i . (2) wp = i
Output function for kth example is Op,k = wp · xk − b,
(3)
and the error is measured by Errn,p =
1 card{j : Op,j − θ < 0}. m
(4)
The development example, j, is selected from negative example dataset except trained from enrollment speakers, and m is the total number of j examples. The enrollment speaker is represented by the term, n. The decision rule θ is obtained from the first p = 100 background dataset, since the p examples are the closest negative examples from positive examples. 100
θ=
1 Op,k 100 k=1
(5)
3. SYSTEM DESCRIPTION
Average ED Variation by Increasing Size of Background Dataset 0.022
SRE 08 Eval SRE 10 Eval
0.02
3.1. Baseline System Setup
0.018
Average ED
0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002
200
300
400
500
600
Size of Background Dataset
700
800
Fig. 3. Average Error Difference (ED) by increasing background dataset.
The average of error for SRE-08 and SRE-10 is shown in Fig. 2, and the error is decreasing with increased number of background dataset size.
2.2. Background Dataset Selection Using Error Difference The difference between the two Err terms will be called the Error Difference (ED), and is calculated as, EDn,p = Errn,p − Errn,p+1 ,
(6)
where p represents the index of the background dataset as seen previously. The averages of ED for SRE-08 and SRE10 are shown in Fig. 3. The best performance in Fig. 1 is correlated with the slope of ED in Fig. 3. The best performance for SRE-08 occurs at size 500 background dataset in Fig. 1, and the steepest slope occurs at size 500 in Fig. 3, and the trend is similar for SRE-10. The background dataset selection is performed to select the steepest slope in ED for each enrollment speaker. The interval between dataset is fixed number, 100, and the error is decreasing with increased number of dataset like in Fig. 2. The simplified method to find the steepest slope for background dataset is DatasetSelection = max{EDn,p − EDn,p+1 }. p
(7)
EDn,p+1 is the model to predict the least ED between two models; therefore the p + 1 background dataset is selected to train SVM for enrollment speakers.
5306
For parameterization, a 60-dimensional feature (19 MFCC with log energy + Δ + ΔΔ) using a 25 ms analysis window with 10 ms skip rate, filtered by feature warping using a 3 s sliding window is employed. The system also employs Factor Analysis, followed by Linear Discriminative Analysis (LDA) and Within Class Covariance Normalization (WCCN) for the SVM system [1]. Similar SVM processing has also been employed, which represents the baseline here. Next, the NIST 2004, 2005, 2006 SRE enrollment data are used to train the gender-dependent UBM with 1024 mixtures. The total variability matrix was trained on the Switchboard II Phase 2 and 3, Switchboard Cellular Part 1 and 2, and the NIST SRE 2004, 2005, and 2006 male enrollment data with 5 or more recording sessions per speaker. A total of 400 factors were used. The LDA matrix is trained on the same data as the total variability matrix. In our experiments, the dimension of the LDA matrix is set to 140. Finally, the within class covariance matrix was trained using SRE-04, and SRE-05 data, and a cosine kernel was used in order to build the SVM systems. 3.2. Evaluation dataset The proposed algorithm is evaluated on the 5 min - 5 min telephone-telephone condition of the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpus. The evaluation dataset was limited to male speakers. 3.3. Background Dataset and Score Normalization Set The background dataset consists of SRE-04 and SRE-05 with a total of 2718 utterances. Each utterance is parameterized and used as negative example. The Support Vector frequency method sorts the 2718 negative examples, and top 100 are used for decision rule. The enrollment speaker models are built from 200 to 900 negative examples, and the 1718 examples are used for error estimation. 4. RESULTS NIST SRE has developed the detection cost function (DCF) to assess system performance. The proposed method finds the equivalent or improved system performance for both DCF and Equal Error Rate (EER) functions compared with previous studies [4, 5] without setting the fixed number of dataset size. 4.1. Background Dataset Selection Analysis A various background dataset is selected for each enrollment speaker using the proposed ED method. Table 1 shows the number of selected enrollment speakers for background dataset for SRE-08 and SRE-10 evaluation data.
Table 1. The number of selected enrollment speakers for background dataset. Background Dataset Size SRE-08 SRE-10
300
400
500
600
700
800
900
175 347
144 265
129 182
69 187
57 91
45 69
29 62
4.2. Background Dataset Selection Evaluation Here, SRE-08 and SRE-10 5 min - 5 min male telephone data are used for evaluation, as well as for selection of the background dataset described in Sec. 2.2. The SRE-04 and SRE05 corpora are also used as individual background datasets, and the proposed method is applied to find the best background dataset. Table 2 shows results for the best background dataset selected by the support vector frequency approach and our proposed method. Here, 2718 utterances of SRE-04 and SRE-05 are used as the background dataset, and 15 various background datasets are evaluated with SRE-08 data. The background dataset size is incrementally increased by 100 up until 1000, and by 200 up until 2000 to represent the negative examples in the dataset as in Fig. 1. The proposed method uses 8 various background dataset, incrementally increased by 100 from 200 up until 900, and other examples are used as error estimation. The 5th dataset, which consist of 500 negative examples, results in the best DCF (0.543) for SRE-08, and the proposed method obtains slightly better DCF (0.542). The best performance background number 500 is applied to SRE-10 evaluation data for NIST SRE submission, and the proposed method also shows the improve performance DCF (0.547) as results 6% relative improvement from submission result. The best performance for SRE-10 data with a fixed background number is 300 with DCF (0.575), and proposed method always perform better. Table 2. The proposed method result comparing the best 500 entries of background dataset set for SRE-08.
SRE-08 SRE-10 SRE-10
Fixed Number Background Dataset Fixed Min. EER Number DCF 500 0.543 4.98 500 0.583 5.83 300 0.575 5.79
Proposed ED Selection Method Min. EER DCF 0.542 4.97 0.547 5.49
term measures the error between models, and the most ED difference (the steepest slope in Fig. 3 ) selects the suitable background dataset for each enrollment speaker. This background dataset is used as the negative examples for training the enrollment speaker model. In this way, enrollment speakers are trained with the most informative and flexible size of negative speaker examples. NIST SRE-04 and SRE-05 datasets were used for background dataset pool. The selection of background dataset using ED method finds the minimum DCF/EER. The selection of the background dataset using ED is more robust for new unseen data than selecting a fixed number dataset. The proposed method also enables us to reach the minDCF for a new background dataset or evaluation data. For future work, a range of other kernels can be studied to project the support vectors into higher dimensions, and accurate expectation of error bound of SVM can be studied to select background dataset. 6. REFERENCES [1] N. Dehak, P. Kenny, R. Dehak, P. Ouellet, and P. Dumouchel, “Front-end Factor Analysis for Speaker Verification,” IEEE Transaction on Audio, Speech and Language Processing, 2010. [2] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and session variability in GMM-based speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1448–1460, 2007. [3] T. Joachims, “SVMLight: Support Vector Machine,” SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund, 1999. [4] M. McLaren, B. Baker, R. Vogt, and S. Sridharan, “Improved SVM speaker verification through data-driven background dataset collection,” ICASSP-2009, pp. 4041– 4044, 2009. [5] M. McLaren, B. Baker, R. Vogt, and S. Sridharan, “Exploiting Multiple Feature Sets In Data-Driven Impostor,” ICASSP-2010, pp. 4434–4437, 2010. [6] V. Vapnik and O. Chapelle, “Bounds on error expectation for support vector machines,” Neural computation, vol. 12, no. 9, pp. 2013–2036, 2000. [7] K. Duan, S.S. Keerthi, and A.N. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters,” Neurocomputing, vol. 51, pp. 41–59, 2003.
5. DISCUSSION AND CONCLUSIONS A new method was proposed to find the best background speaker model without requiring the same fixed number of negative examples for each background dataset. The ED
5307
[8] C.J.C. Burges, “A tutorial on support vector machines for pattern recognition,” Data mining and knowledge discovery, vol. 2, no. 2, pp. 121–167, 1998.