Modelling and Compensation for Language Mismatch in Speaker Verification Abhinav Misra, John H. L. Hansen1 Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science The University of Texas at Dallas (UTD), Richardson, Texas, USA
{abhinav.misra, john.hansen}@utdallas.edu
Abstract Language mismatch represents one of the more difficult challenges in achieving effective speaker verification in naturalistic audio streams. The portion of bi-lingual speakers worldwide continues to grow making speaker verification for speech technology more difficult. In this study, three specific methods are proposed to address this issue. Experiments are conducted on the PRISM (Promoting Robustness in Speaker Modeling) evaluation-set. We first show that adding small amounts of multi-lingual seed data to the Probabilistic Linear Discriminant Analysis (PLDA) development set, leads to a significant relative improvement of +17.96% in system Equal Error Rate (EER). Second, we compute the eigendirections that represent the distribution of multi-lingual data added to PLDA. We show that by adding these new eigendirections as part of the Linear Discriminant Analysis (LDA), and then minimizing them to directly compensate for language mismatch, further performance gains for speaker verification are achieved. By combining both multi-lingual PLDA and this minimization step with the new set of eigendirections, we obtain a +26.03% relative improvement in EER. In practical scenarios, it is highly unlikely that multilingual seed data representing the languages present in the test-set would be 1 This project was funded by AFRL under contract FA8 750-12-1-0188 and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J. H. L. Hansen.
Preprint submitted to Speech Communication
September 15, 2017
available. Hence, in the third phase, we address such scenarios, by proposing a method for Locally Weighted Linear Discriminant Analysis (LWLDA). In this third method, we reformulate the LDA equations to incorporate a local affine transform that weighs the same speaker samples. This method effectively preserves the local intrinsic information represented by the multimodal structure of the within-speaker scatter matrix, thereby helping to improve the class discriminating ability of LDA. It also helps in extending the ability of LDA to transform the speaker i-Vectors to dimensions that are greater than the total number of speaker classes. Using LWLDA, a relative improvement of +8.54% is obtained in system EER. LWLDA provides even more gain when multi-lingual seed data is available, and improves the system peformance by relative +26.03% in terms of EER. We also compare LWLDA to the recently proposed Nearest Neighbor Non-Parametric Discriminant Analysis (NDA). We show that not only is LWLDA better than NDA in terms of system performance but is also computationally less expensive. Comparative studies on DARPA Robust Automatic Transcription of Speech (RATS) corpus also show that LWLDA consistently outperforms NDA and LDA on different evaluation conditions. Our solutions offer new directions for addressing a challenging problem which has received limited attention in the speaker recognition community. Keywords: Speaker Verification, Language Mismatch, Linear Discriminant Analysis, Probabilistic Linear Discriminant Analysis
1. Introduction Speaker verification involves identifying a person from his/her voice [1]. The performance in speaker verification degrades when any mismatch exists between enrollment and test conditions [2, 3]. Several studies have attempted to address this problem [4, 5, 6, 7], but there has been significantly more emphasis on addressing channel mismatch while also hoping to simultaneously reduce the impact of language mismatch. Compensation for mismatch between enrollment and test languages has received relatively limited attention.
2
Previous studies directed at compensation for language mismatch include training speaker models using both enrollment and test languages [8]. In another study [9], the authors incorporated Language Identification (LID) as a first layer in speaker verification to detect the test language and then use an appropriate model trained on that language for scoring. Both of these studies, however, require a Gaussian Mixture Model-Universal background Model (GMM-UBM) speaker verification system [10], that is no longer state-of-the-art in speaker verification. In [11], a language dependent subspace is estimated using a Joint Factor Analysis (JFA) [6] framework and then suppressed as a nuisance attribute. This approach is close to the state-of-the art system, but requires significant multilingual seed data to train the system. Such amounts of data may not be available for all mismatch languages. In [12], authors demonstrate the impact of language mismatch on NIST Speaker Recognition Evaluation-2006 (SRE-2006) corpus; but the paper falls short of proposing any solution to the problem. In [13], the authors propose adding small amounts of multi-lingual data to a Probabilistic Linear Discriminant Analysis (PLDA) development set and achieve a significant improvement. In this study, we expand on preliminary work considered in [13], and propose three complimentary solutions for language mismatch. In contrast to our previous study, we use the standard PRISM evaluation-set [14]. PRISM combines NIST SRE data-sets from 2004 to 2010, as well as Fisher [15] and Switchboard [16]. The enrollment set in our system consists of English recordings, while all of the test set comprises of non-English recordings. We show improvement by adding small amounts of multi-lingual seed data, representing the test-set languages, to PLDA. In [17], we showed improvement in language recognition by maximizing the mismatch between different languages, which are present as part of the LDA development set. We proposed a method entitled Between-Class Covariance Correction (BCC) in which we compute the eigendirections representing the multi-modal distribution of different language i-Vectors, and show that incor3
porating these eigendirections in LDA leads to an improvement in language recognition performance. This improvement motivated us to apply the same method for speaker recognition as well. The difference in speaker recognition though, is that instead of maximizing the mismatch between languages, we aim to minimize it. Hence, we formulate Within-Class Covariance Correction (WCC), in which the eigendirections are added to the within-class covariance matrix. LDA works well only when the samples in a class are unimodal Gaussian. In [18], the authors showed that when data in a within-class scatter matrix comes from different channel sources, it is distributed in the form of different clusters, with each cluster corresponding to a separate channel source. In such cases, the LDA process tends to give an inaccurate estimation of the within-class scatter matrix, and this negatively affects overall speaker verification performance. Motivated by this observation, we present an alternative non-parametric discriminant analysis technique (LWLDA) that measures the within-speaker variation on a local basis using an affinity matrix. The affinity matrix is chosen such that nearby data pairs in the within-speaker scatter are kept closer, while the far apart data pairs are not imposed to be close to each other. Weighing the within-speaker data in such a way helps in locally preserving its multimodal property, and improves the system performance. The remainder of this paper is outlined as follows. We first give a brief overview of the speaker verification system used in our experiments in Section II. Section III presents the formulated multi-lingual PLDA method. In Section IV, the WCC method is presented, and Section V discusses the improved LWLDA technique of dimensionality reduction. Section VI presents the experimental setup and discusses results of this study, while Section VII concludes the study.
2. System Description In this section, we present a brief review of the i-Vector-PLDA based speaker verification system used in this study and as shown in Figure 1.
4
LDA Matrix
Development data for PLDA
I-vectors (Eval + PLDA Dev)
MFCC extraction
UBM
Length Normalization
Eval I-vectors
PLDA
Verification Score
T.V. Matrix
Enrollment + Test data
Evaluation data
UBM + T.V. matrix development data
Figure 1: System diagram of i-Vector-PLDA speaker verification system employed in this study.
An i-Vector is obtained by mapping a sequence of feature vectors, representing a speech utterance, to a fixed-length vector [7]. There are several ways to compute this mapping. One of the more traditional ways is to first compute Mel-Frequency Cepstral Coefficients (MFCCs) from the speech utterance, and then collect Baum-Welch statistics using cepstral coefficients and a separately trained UBM. Recently, senone posteriors from a Deep Neural Network (DNN) trained for Automatic Speech Recognition (ASR), has also being used to calculate the statistics [19]. In this study, since we propose methods focussed to mainly improve the back-end of state-of-the-art speaker verification systems, we choose the standard traditional i-Vector extraction method. We first compute the zero and first order statistics from the cepstral features using mixtures of a UBM. Next, a supervector is constructed by appending together the first-order statistics that are computed for each mixture component. This supervector is
5
assumed to obey a linear model of the following form:
M = m + T v.
(1)
where, M is the given supervector obtained after concatenation of the new means, m corresponds to a supervector obtained after concatenation of means of each mixture component, T is a Total Variability (TV) matrix, and v is a hidden variable with standard normal prior having zero mean and identity covariance matrix. The TV matrix spans a subspace containing speaker and channel specific information. It is trained using a method similar to that of estimating eigenvoices in speech recognition [20]. For each speech utterance u, an i-Vector ω is obtained as the mean of the posterior probability p(v|u). Once, an i-Vector has been extracted, it is subjected to one or more channel compensation techniques, such as Linear Discriminant Analysis (LDA) [21], Within Class Covariance Normalization (WCCN) [22], or Nuisance Attribute Projection (NAP) [23]. Next, channel-compensated i-Vectors are length normalized [24], to allow for the use of a Gaussian based PLDA classifier [25]. During PLDA based classification, it is assumed that i-Vectors follow an affine model similar to Eq. (1). For any speaker having R utterances (r=1,....,R), each of the R i-Vectors can be decomposed as:
ωr = m + φβ + r .
(2)
where, ωr is an i-Vector representing the rth utterance of a speaker, m is a global offset, φ corresponds to a matrix that spans the speaker-specific subspace, β is a standard-normally distributed latent variable, and r is a residual channel dependent term that is assumed to have zero mean and a full covariance matrix P P . All model parameters {m, φ, } are obtained from a large collection of development i-Vectors using Expectation Maximization (EM) as in [26].
6
To obtain a verification score, a log-likelihood ratio is calculated as:
score = log(
p(ω1 , ω2 |Htarget ) . p(ω1 , ω2 |Hnontarget )
(3)
where, Htarget is the hypothesis that both ω1 and ω2 share the same latent variable β, and Hnontarget is the hypothesis that the i-Vectors were generated from different latent variables. It is assumed that length normalization ensures i-Vectors are distributed normally, and based on that assumption the following closed-form solution is used for the computation of the likelihood ratio:
where,
P
tot
P P ω1 m score = logN ( ; ; Ptot P ac ) ω2 m ac tot P ω1 m 0 −logN ( ; ; tot P ). ω2 m 0 tot P P = φφT + and ac = φφT .
(4)
3. Method-I: Multi-lingual PLDA In the formulation of PLDA, we assume probabilistic generative models of iVector distributions. Specifically, the evaluation i-Vectors are assumed to obey P a Gaussian distribution with {m, φ, } model parameters. The model parameters are computed using a set of development i-Vectors that represent the distribution of the enrollment and test i-Vectors. It has been shown that if there is a mismatch between development and evaluation data, performance of a speaker verification system degrades [27]. Hence, model parameters obtained from development i-Vectors should be able to closely match the distribution of enrollment and test i-Vectors. Motivated by this observation, we consider adding small amounts of multi-lingual data representing all languages from the test-set, to the development data used for training PLDA parameters. While adding the multi-lingual data, we make sure that no additional speakers are added and the total number of recordings used to train the PLDA remains
7
the same. This is done by selecting bilingual speakers from PLDA development set, and replacing some of their English recordings with their non-English recordings. In total, 8.04% of all the English recordings from the PLDA development set are replaced by non-English recordings. This ensures that the improvement is obtained solely due to multi-lingual data and not because of any additional information added by new speakers or recordings. To assess the acoustic model proximity, we also compute the Mahalanobis distance between each of the evaluation i-Vectors and i-Vectors used to train PLDA. This is done to show that the multi-lingual PLDA better represents the distribution of the evaluation i-Vectors as compared to the original English only PLDA. Mahalanobis distance measures the separation of a point p from a distribution D as:
mahald =
q (p − µ)T Σ−1 (p − µ).
where, µ and Σ are the mean and covariance of the distribution D.
(5) For
mahald , the smaller the distance, the more close the point is to the distribution. Figure 2 shows the mean Mahalanobis distance between evaluation i-Vectors and i-Vectors used to train the English only PLDA and multi-lingual PLDA. It can be observed that compared to the English only PLDA, the multi-lingual PLDA is closer (i.e., smaller distance) to the evaluation i-Vectors by a relative 7.57%. It can be noted, that multi-lingual data can also be added during the iVector extraction stage. In [13], the authors showed that adding data to UBM or TV matrix offers improvement in performance. However, the improvement obtained by such additions is small compared to the addition within the PLDA stage. One of the reasons for this can be that even though addition at the iVector extraction stage encodes multi-lingual information in the i-Vectors, most of the information is lost due to application of LDA and length normalisation stages. On the other hand, addition at the PLDA stage is directly followed by
8
Mean Mahalanobis Distance ( in Squared units)
210
208.19
205
200
195
192.41
190
185
180
English PLDA
Multi-lingual PLDA
Figure 2: Mean Mahalanobis distance of eval i-Vectors from two sets of PLDAs.
the scoring stage, and hence shows more effective improvement in overall system performance.
4. Method-II: Within Class Covariance Correction (WCC) In this section, we present Within-Class Covariance Correction (WCC) and how it can be adapted to compensate for language mismatch in speaker verification. Before starting, let us recall the formulation of LDA. LDA aims to find a set of dimensions onto which speaker i-Vectors can be projected, so that the separability between different speakers can be maximized, while at the same time minimizing the separability between the same speaker samples. This is accomplished by maximizing the ratio of the between-class scatter matrix Sb to the within-class scatter matrix Sw . Sb and Sw are computed as:
Sb =
p 1 X nspk (µspk − µ)(µspk − µ)T . n spk=1
9
(6)
Sw
nspk p 1 X X spk = (ωj − µspk )(ωjspk − µspk )T . n j=1
(7)
spk=1
where, the number of speakers (or classes) is p. ω is an i-Vector and nspk is the number of i-Vectors corresponding to a speaker spk. µspk is the mean of i-Vectors belonging to speaker spk, while µ is global mean of all the n i-Vectors present in the development data-set. In order to formulate a criteria for class separability, after computation of the scatter matrices, we need to convert them to a number. This number should be larger when the between-class scatter is larger or the within-class scatter is smaller. One typical criteria is: −1 f = tr(Sw Sb ).
(8)
Our aim is to optimize f by finding a linear transformation A, such that A would transform the i-Vectors from input space x of high dimensionality to output space y of lower dimensionality as:
y = AT x.
(9)
Since scatter matrices have the form of covariance matrices, Sw and Sb in the y space can be computed from Sw and Sb in the x space using the following equation:
Sy = AT Sx A.
(10)
where, both Sx and Sy can be either Sw or Sb . Hence, f can also be written as: f = tr((AT Sw A)−1 (AT Sb A)).
(11)
Here, the value of A that optimizes f , is given by the eigenvectors corresponding −1 Sb [21]. to the largest eigenvalues of Sw
10
To compensate for language mismatch, we should maximize the separation between different speakers, while at the same time minimizing the separation between different languages spoken by each speaker. To achieve this, we attempt to modify the within-class scatter matrix Sw , to include an additional term representing the covariance of different languages around a global mean. We define the new term as “between-language covariance” Sbl and compute this as: l
Sbl =
1X (µi − µ)(µi − µ)T . l i=1
(12)
where, µi is the mean of the i-Vectors representing language i, l is the total number of languages present in the development set, and µ is the global mean of all the development i-Vectors. For the current study, there are a total of 25 different languages present in the LDA development set. However, there is an imbalance in the amount of i-Vectors available for each language. As discussed in the previous section, only 8.04% of all the development data is replaced by non-English recordings, while the rest is all English. Hence, the Sbl term will be heavily biased by an overwhelming majority of English i-Vectors. In order to ensure that all languages have similar weight in the computation of Sbl , we do some pruning. Most languages in the formulation here have around 150 i-Vectors present in the development set. For cases where languages have more than that, we randomly consider only the first 150 i-Vectors. Also, there are some languages that have less than 10 i-Vectors. We completely discard low i-Vector count languages in Sbl computation. After addressing the data imbalance, we compute Sbl and add it to Sw as: new Sw = Sw + αSbl .
(13)
where, α is a scaling factor that is chosen heuristically. The guiding principle behind this heuristic technique was the observation that as the values in the between-language covariance matrix and within-speaker covariance matrix become similar to each other, maximum improvement is obtained in system 11
performance. For the value of α = 1, the values are most similar to each other, and hence we get the most optimal performance. It can be noted, that the rank of Sbl is comparitively much lower than the rank of Sw . Since we are adding a term to the within-class scatter matrix, the method is known as Within-class Covariance Correction (WCC). There is another similar method that is widely used in speaker recognition, called Within-Class Covariance Normalization (WCCN). In [22], WCCN is employed for minimizing the intersession variability of speakers by modifying the generalized linear kernel of the Support Vector Machine (SVM) based speaker recognition system. First, the Within-Class Covariance matrix similar to Sw is computed and then its inverse is used to normalize a linear kernel. On the other hand, in WCC, first, language-specific covariances are computed by subtracting the global mean from language-specific means. Next, all the language-specific covariances are averaged together to obtain the Sbl term. Hence, WCC differs from WCCN both in terms of the core methodology as well as how the training labels are used.
5. Method-III: Locally Weighted Linear Discriminant Analysis (LWLDA) The traditional approach for computing LDA has three major disadvantages. First, it assumes a Gaussian distribution for both the intra- and inter-speaker scatter matrices. However, in the presence of noise or channel distortions, data from the same speaker is not necessarily distributed in a Gaussian manner [28, 29]. Second, LDA can map x to at most p−1 (number of speakers -1) dimensions. −1 This occurs since Sb has rank p − 1, and consequently the rank of Sw Sb is also −1 p − 1. Hence, only p − 1 eigenvalues of Sw Sb are non-zero, while the rest are
all zero. The final disadvantage with LDA is that computation of Sb is based on a scatter of the mean vectors. Hence, class separability information encoded in the covariance-differences is not effectively captured, thereby limiting the classification ability of the algorithm. Motivated by these observations, in [29], the authors proposed a method of nearest-neighbor discriminant analysis (NDA)
12
that measures the intra- and inter-speaker scatter matrices on a local basis using the nearest neighbor rule. NDA showed improvement on the DARPA Robust Automatic Transcription of Speech (RATS) and NIST SRE 2004-2010 corpora [29, 30]. However, we observed that on the PRISM set, NDA fails to provide any system gain. In this section, we present a new method of localized weighted linear discriminant analysis (LWLDA). LWLDA, like NDA, is based on non-parametric discriminant analysis. However, unlike NDA, LWLDA focusses on weighing the within-speaker i-Vectors. The weight matrix is computed such that the complex (multi-modal) structure of the within-speaker data is preserved. This is achieved by constraining the values of weight matrix to be between 0 and 1. The values are large if i-Vectors are close, while small if i-Vectors are far apart. Hence, far apart sample pairs belonging to the same class will have less influence on the within-speaker scatter computation as compared to closer sample pairs. Sample pairs belonging to different classes are not weighted by the weight/affinity matrix. This occurs since we want them to be separated from each other, irrespective of whether any affinity exists between them or not. Based on our experiments, we observed that LWLDA gives better performance than NDA on the PRISM data-set. 5.1. Choice of Affinity Matrix One of the easiest choices of an affinity matrix H can be: assign Hi,j = 1, when i-Vectors are neighbors and Hi,j = 0, otherwise. However, this kind of hard thresholding does not represent the contribution that far apart i-Vectors might have in Sw computation. Hence, we consider a Gaussian function that varies with the local density h of data samples, as our affinity matrix.
Hi,j = exp(−
||ωi − ωj ||2 ), hi hj
13
(14)
where, ||.|| represents the Eucledian norm and hi represents a scaling factor that takes into account the distribution of samples around ωi . This is defined as: hi = ||ωi − ωik ||.
(15)
where, ωik is the k-th nearest neighbour of ωi . The value of k is derived heuristically, and can vary for different distributions. 5.2. Reformulating LDA equations Once an affinity matrix is chosen, next we need to incorporate it in LDA equations. To accomplish that, first, LDA equations are reformulated in a nonparametric manner, using data pairs [31]. It is possible to rewrite Sb and Sw , given in Equations 6 and 7, in the following alternative form: Sb0 =
n 1 X W b (ωi − ωj )(ωi − ωj )T . 2 i,j=1 i,j
(16)
0 Sw =
n 1 X W w (ωi − ωj )(ωi − ωj )T . 2 i,j=1 i,j
(17)
where,
w Wi,j =
1 nspk
b Wi,j =
(18)
spki 6= spkj .
0 1 − n
spki = spkj
1 nspk
1
spki = spkj
(19)
spki 6= spkj .
N
where, spk is a speaker class and nspk is the number of i-Vectors corresponding to a speaker spk. Equations 16 and 17 are identical to Equations 6 and 7 respectively, as proven in Appendix A. It can be observed from the new formulation that ( n1 − while
1 n
and
1 nspk
1 nspk )
is negative,
are positive. Hence, when data pairs belong to the same class,
the terms in Sb0 are weighed negatively thereby making Sb0 smaller, while terms 14
0 0 in Sw are weighed positively making Sw larger. The exact opposite happens for
the case where data pairs belong to different classes. Terms in Sb0 are weighed 0 0 positively making Sb0 larger, while terms in Sw are given zero weight making Sw
smaller. Therefore, the new formulation conforms to our notion of LDA, where the distance between samples of different classes is maximized, while distance between samples of the same classes are minimized. 5.3. Application of the Affinity Matrix Finally, after reformulating the LDA equations in a pairwise manner, we apply the affinity transform H. This is accomplished by simply replacing W by H in the above equations. It can be noted that if the affinity value is set 0 become equal to Sb and Sw respectively. to 1 for all sample pairs, Sb0 and Sw
Hence, we can say that LWLDA is just a localized variant of LDA. The transformation matrix A as defined in Section 4, is computed in the 0 same way as original LDA, except now the local Sb0 and Sw are used instead
of the global Sb and Sw respectively. As a result of the affinity transform, Sb0 generally has a much higher rank than p − 1. 5.4. Computational Complexity Analysis The between-class scatter matrix Sb in NDA is computed as:
Sb =
p p n spk X X X
wjspk,l (ωjspk − (LM )spk,l )(ωjspk − (LM )spk,l )T . j j
(20)
spk=1 l=1, j=1 l6=spk
where, wjspk,l is the weighting function, and (LM )spk,l is the local mean of j k-NN samples for ωjspk from class l, given as: (LM )spk,l = j
k 1X N Nq (ωjspk , l). k q=1
(21)
where, N Nq (ωjspk , l) is the q th nearest neighbor of ωjspk in class l. Sw is computed in a similar way as Sb , except the weighting function is set to 1 and local gradients are computed within each class. Once we obtain the 15
scatter matrices, just like LDA, A is computed as the eigenvectors corresponding −1 to largest eigenvalues of Sw Sb .
It can be observed that the NDA computation has 4 nested loops. This leads to a computational time complexity of O(p(p − 1)nspk k), which simplifies to O(p2 nspk k). On the other hand, LWLDA has only two nested loops with a time complexity of O(pnspk ). Hence, it is clear that NDA involves significantly much more computations than LWLDA.
6. Evaluation NDA was initially proposed to address the DARPA RATS corpus for speaker verification[32]. Hence, to facilitate comparison, we evaluate LWLDA on both SRE and DARPA RATS corpora. First, we discuss experiments performed on SRE and then move to the DARPA RATS corpus. 6.1. Database: SRE Telephone channel data consisting of female bilingual speakers belonging to SRE05, SRE06, and SRE08 are extracted from the PRISM trial list. The data consists of speakers that have at least one recording in English and one in some other language. There are a total of 265 such bilingual speakers in these SRE sets speaking 25 non-English languages. English recordings of all speakers form the enrollment-set, while the non-english recordings form the test-set. A cartesian product of all sessions from these 265 speakers is used to generate the language mismatch trials. There are a total of 1379 target trials and 364,056 non-target trials. Also, 242 out of these 265 speakers have more than one English recording. English recordings from these 242 speakers are used in a similar way to form language match trials. There are a total of 1313 target trials and 346,632 non-target trials. 6.1.1. Experimental Setup After both language match as well as language mismatch trials have been established, cepstral features are extracted using a 25-ms Hamming window. Ev16
ery 10 ms, 12 MFCCs (excluding C0) are computed. Next, delta and delta-delta coefficients are added to produce a 36-dimensional feature vector corresponding to each recording. For the UBM, TV matrix, LDA and PLDA, gender dependent (female) training data is used. The data is extracted from Fisher 1 and 2, Switchboard phase 2 and 3, Switchboard cellular phase 1 and 2, and all the Mixer speakers (SRE0410) that were not used in creating the evaluation trials. Specifically, the UBM and TV matrix is trained using only Fisher and Mixer data, while LDA and PLDA models are trained using speakers from all the training data, that have at least six sessions. After i-Vectors are extracted, LDA is applied, followed by length normalization. In the end, PLDA is used to obtain a verification score for each trial using Eq. 4. To observe performance of the system, we chiefly employ two metrics: a) Equal Error Rate (EER), and b) minimum detection cost function with the NIST SRE-2008 definition (MinDCF-08) [33]. Both of these metrics measure the discrimination abilities of a speaker recognition system. While conducting our experiments, we also used minimum detection cost function with the NIST SRE-2010 definition (MinDCF-10) [34] as a metric. However, it did not provide useful information to compare the proposed and competing methods due to minimal variation in its magnitude. Hence, we are not reporting results using MinDCF-10. Also, since we are not dealing with calibration issues the log-likelihood-ratio cost function (Cllr ) [35] metric was not used. 6.1.2. Results Table 1 shows performance of the system for the case of language match and mismatch trials. It can be observed that EER drops by more than two fold when language mismatch is present. After the addition of multi-lingual data, a relative improvement of 17.96% in EER is obtained in speaker verification over the original language mismatch trials. After applying WCC, further improvement is observed in speaker verification performance, where the EER drops down to only 3.99%. This corresponds to a 9.83% relative improvement in EER
17
DET Plots
80
language mismatch using NDA (5.51) language mismatch using LDA (5.39)
60
language mismatch using LWLDA (4.93) 40
language mismatch using WCC (3.99)
Miss probability (in %)
language match using LDA (2.41) 20
10
5
2 1 0.5
0.2 0.1 0.1
0.2
0.5
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Figure 3: DET plots showing the system performances when English development data is used for LWLDA and NDA. The numbers in parenthesis correspond to points on the curve denoting EER. DET Plots
80
language mismatch using LDA (5.39) language mismatch using multi-lingual PLDA (4.42)
60
language mismatch using NDA (4.04)
40
Miss probability (in %)
language mismatch using LWLDA (3.99) language match using LDA (2.41) 20
10
5
2 1 0.5
0.2 0.1 0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
18 Figure 4:
DET plots showing the system
performances when multi-lingual development data is used for LWLDA and NDA. The numbers in parenthesis correspond to points on the curve denoting EER.
60
80
Total Eval.
Enrollment
Test
Cond.
Proposed
EER
Min.
Method
(%)
DCF-08
Errors (FA+FR)
English
English
English
-
2.41
0.94
-10,369
English
Non-English
LWLDA
4.93
2.23
-1,601
English
Non-English
-
5.39
2.26
baseline
English
Non-English
NDA
5.51
2.09
+471
English
Non-English
4.42
1.94
-3,375
Dev set
Multi-lingual Dev set
Multi-lingual PLDA
English
Non-English
NDA
4.04
1.66
-4,698
English
Non-English
WCC (α = 1)
3.99
1.86
-4,871
English
Non-English
LWLDA
3.99
1.73
-4,871
Table 1: Language Mismatch in Speaker Verification Results on SRE Database. Here, total errors refer to absolute count of False Alarm and False Reject errors. If the number of errors are more than the baseline system, we use (+) sign, otherwise (-) sign is used.
with respect to the multi-lingual PLDA system, and a total of 26.03% relative improvement with respect to the original baseline system. Next, LWLDA is applied to the original system with all English LDA development data. As observed from Table 1, a relative improvement of 8.54% in EER is obtained for the case of language mismatch trials. Hence, even if multi-lingual seed data representing the languages present in test-set is not available, we can still obtain improvement in system performance using LWLDA. However, on the other hand, NDA fails to provide any improvement over the language mismatch trials in the case where only English development data is available. System performance obtained after application of NDA almost remains the same. Figure 3 shows the Detection Error Tradeoff (DET) curves representing system performances using our proposed methods, as well as comparative methods. It can also be observed from Table 1 that when multi-lingual development data is available, LWLDA gives even further improvement. We obtain 26.03%
19
relative improvement in EER with respect to the baseline language mismatch system when using LWLDA. This shows that as the distribution of withinspeaker i-Vectors become more multi-modal with the addition of new languages, LWLDA becomes even more effective in improving system performance. NDA also provides 25.04% relative improvement in EER with respect to the baseline language mismatch system, in the case of multi-lingual development data. Since LWLDA and NDA both provide similar improvement in case of multi-lingual development data, we count the total number of errors a system makes and distinguish more clearly the performance of two systems. In Table 1, we report the number of errors made by a system relative to the baseline language mismatch system employing LDA. If the number of errors are more than the baseline system, we use (+) sign, otherwise (-) sign is used. It can be noted that these errors are computed when the system operates at the middle of the DET curve, when the threshold is the same as the EER. As can be observed from Table 1, the proposed LWLDA system with 3.99% EER has 173 fewer errors than an NDA based system with 4.04% EER. We further compute McNemar’s test of statistical significance [36] to interpret the performance difference between these two systems. We observed that for target trials ( for miss probability) the difference between the systems is highly statistically significant. The p value is 0.0001, which is much less than 0.05. However, for non-target trials (false alarm), the difference is not statistically significant (p=0.25). While conducting these tests of statistical significance, we chose EER as the operating point of the systems. Hence, even though in terms of number of errors the systems are different, however overall they are not actually different from a statistical significance level. Further insight into the system performance can be obtained from Figure 4. It can be noted here that although WCC needs multi-lingual development data, we still did not show it in Figure 4. WCC provides similar performance as LWLDA on a multi-lingual development set and hence overlaps with LWLDA DET curve. Therefore, for the sake of clarity, we show WCC in Figure 3 instead of 4. Finally, Table 2 shows the time taken (seconds) by LDA, LWLDA and NDA 20
Database
LDA
LWLDA
NDA
SRE
0.36
2.11
499.18
DARPA RATS
2.33
24.45
5104.05
Table 2: Time taken in seconds by each of the comparative language mismatch compensation algorithms on SRE and DARPA RATS corpora.
to compute the reduced set of i-Vector dimensions. All the algorithms are compared using an Intel Core i7
[email protected] x 8 processor with 32 GB RAM. It can be observed that both in the cases of SRE as well as DARPA RATS, NDA takes considerably more time than LDA and LWLDA. This corroborates our computational complexity analysis described in previous section. In case of SRE, there are 21,788 development i-Vectors while DARPA RATS has 55,982 development i-Vectors. Hence, time taken by algorithms are less in case of SRE as compared to DARPA RATS. It can be noted here that though both NDA and LWLDA provide similar performance in case of multi-linual development data, LWLDA has a competitive edge in terms of lower computational complexity. 6.2. Database: DARPA RATS For the second set of experiments we use the corpus available as part of RATS speaker recognition task. The data distributed by Linguistic Data Consortium (LDC) in the form of: LDC2012E49, LDC2012E63, LDC2012E69, LDC2012E85, LDC2012E117, contain speech in five languages: Levantine Arabic, Dari, Farsi, Pashto, and Urdu. We divide these data into three parts for our system training, enrolment and test. There are a total of 305 speakers available for evaluation (enrolment and test), while 5913 speakers are set aside for training UBM, TV Matrix and PLDA. All speakers represent both male and female genders. There are 8 channels (A-H) through which each of the speaker’s telephone recordings are retransmitted. In addition to 8 channels, there is also the speaker’s original telephone channel recording. Our evaluation and development data-sets contain recordings from all 9 channels. Each speaker model is trained using all sessions coming from 8 extremely degraded communication channels as well as the original telephone channel recording. A trial is designed 21
using one speaker model and one test session. The test sessions are also chosen to represent all 9 sources of speech recordings. To evaluate system performance, we consider 3 duration-specific tasks with the following enrollment-test conditions: 120s, 30s, 10s. The total number of trials for each condition are created by taking a cartesian product of all the enrolment and test sessions. This leads to roughly 3.2 million trials, with 10,617 target trials and 3,227,568 non-target trials. It can be noted here, that in all the enrollment-test conditions, UBM, TV and PLDA are all trained using complete recordings. To quantify system performance, we use same metrics as before. 6.2.1. Experimental Setup The speech is parameterized using 60-dimensional Mel-Frequency Cepstral Coefficients (MFCCs) containing delta and delta-delta coefficients. A gender independent, 2048 component, full covariance UBM is trained using 55,982 recordings representing 5913 speakers. Each of these recordings are roughly 15 min. long. The zero and first order statistics are computed for each of the recordings from UBM, and then used to compute a 600 dimensional TV matrix. Next, we extract 600-dimensional i-Vectors from the TV matrix and apply LDA, NDA or LWLDA for channel compensation. After channel compensation, we perform length normalization [24] and finally classify the i-Vectors using PLDA scoring. 6.2.2. Results Table 3 shows performance of our system using all three methods. It can be observed that LWLDA gives better performance than LDA and NDA in all enrollment-test conditions. In terms of EER, LWLDA provides 7.6% relative improvement over LDA and 11.50% relative improvement over NDA, in case of 30s-30s evaluation condition. Other evaluation conditions also show improvement with LWLDA, although the degree is smaller than that obtained for 30s30s evaluation condition. Furthermore, since RATS has limited multi-lingual data, our experiments also show that LWLDA is a better channel compensation technique when compared to either LDA or NDA.
22
DET Plots
80
NDA (5.51) 60
LDA (5.32)
Miss probability (in %)
40
LWLDA (5.04) 20
10
5
2 1 0.5
0.2 0.1 0.1
0.2
0.5
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Figure 5:
DET plots showing the system
performances on 120s-120s duration task of DARPA RATS corpus. The numbers in parenthesis correspond to points on the curve denoting EER. DET Plots
80
NDA (8.52) 60
LDA (8.16)
Miss probability (in %)
40
20
LWLDA (7.54)
10
5
2 1 0.5 0.2 0.1 0.1 0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
23 Figure 6: DET plots showing the system performances on 30s-30s duration task of DARPA RATS corpus. The numbers in parenthesis correspond to points on the curve denoting EER.
60
80
Total Eval.
Proposed
EER
Min.
Cond.
Method
(%)
DCF-08
Errors (FA+FR) 120s-120s
30s-30s
10s-10s
NDA
5.51
2.39
+6,152
LDA
5.32
2.26
baseline
LWLDA
5.04
2.31
-9,067
NDA
8.52
3.88
+11,658
LDA
8.16
3.67
baseline
LWLDA
7.54
3.63
-20,076
NDA
19.69
7.78
+39,830
LDA
18.46
7.58
baseline
LWLDA
17.99
7.38
-15,219
Table 3: Speaker recognition performance in case of DARPA RATS Corpus. Here, total errors refer to absolute count of False Alarm and False Reject errors. If the number of errors are more than the baseline system, we use (+) sign, otherwise (-) sign is used.
DARPA carried out the RATS project in five phases using a list of metrics to measure the system performance. These metrics, apart from measuring the Equal Error Rate (EER), also measure the Miss probability and False Alarm (FA) at different points on the DET curve. Miss rate at 2.5% false-alarm rate or miss rate at 4% false-alarm rate, are some of the common metrics used to report results on RATS data. It can be observed from Figures 5, 6 and 7 that LWLDA outperforms both NDA as well as LDA on other DARPA performance metrics as well. Since each channel in RATS corpus has it’s own distinct distortion characteristics, this leads to greater multi-modality in the within-speaker i-Vectors as compared to SRE corpus. In SRE corpus, all of the within-speaker i-Vectors are extracted from telephone channel data. The source of multi-modality in SRE database is mainly due to: i) speakers overlapping between SRE and Switchboard corpora, and ii) multi-lingual data available for speakers. Hence, this can be the reason for LWLDA performing better than NDA on DARPA RATS.
24
DET Plots
80
NDA (19.69) 60
LDA (18.46) 40
Miss probability (in %)
LWLDA (17.99) 20
10
5
2 1 0.5
0.2 0.1 0.1
0.2
0.5
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Figure 7: DET plots showing the system performances on 10s-10s duration task of DARPA RATS corpus. The numbers in parenthesis correspond to points on the curve denoting EER.
Finally, from Table 2 it can be observed that NDA takes 1.4 hours to process i-Vectors extracted from RATS data. Alternatively, LWLDA takes around 200 times less and LDA takes almost 1000 times less time to perform the same task.
7. Conclusion In this study, we considered a systematic investigation to address the problem of language mismatch in speaker verification. We proposed three methods and showed improvement using each solution. First, we showed that adding small amounts of multi-lingual seed data to PLDA helps improve system performance. By just replacing 8.04% of the English recordings from the PLDA development set with non-English recordings, we obtained a relative improvement of 17.96% in terms of EER. Next, we applied WCC and showed further improvement in system performance. We showed that adding new information to the Fisher ratio in the LDA computation, leads to improved system performance. The new information is in the form of new eigendirections, that are
25
computed from the development data and added to the denominator of the Fisher ratio, so that the resulting set of eigendirections have minimal language mismatch. Finally, we formulated and employed LWLDA to obtain improvement for the case where no multi-lingual seed data is available to add to the PLDA development set. The localized variant of LDA helps preserve the local structure of the within-class scatter and also produces a full rank between-class scatter matrix. Both of these advantages make this a highly attractive discriminant analysis tool compared to traditional LDA. We also compared LWLDA to another recently proposed non-parametric discriminant analysis method called NDA. Using SRE data, we showed that LWLDA performs better than NDA when only English development data is available. We also showed superior performance of LWLDA over NDA and LDA on DARPA RATS corpus. We analyzed the computational complexity of LWLDA and showed that it’s execution time is significantly smaller than NDA (i.e., more than 200 times smaller). All proposed solutions in this study are general purpose and can be applied to other forms of mismatch experienced in speaker verification. Hence, future work could focus on adapting these proposed methods to suppress other types of mismatch such as channel or enviroment mismatch (and others as highlighted in [1]) in speaker verification. Since LWLDA leads to dimensionality reduction upto any dimensional spaces, it can be extremely useful in areas such as language recognition, where the number of classes (languages) is usually less than the number of i-Vector dimensions. In fact, LWLDA can be useful whenever channel, noise, aging, environment or language cause within-speaker variability.
Appendix A. Reformulating LDA Equations Rewriting Eq. (7) as:
26
Sw
nspk p X X
=
(ωispk
−
spk=1 i=1
=
n X
ωi ωiT −
i=1
=
=
p X spk=1
n X
(
n X
nspk
1
X
nspk 1 nspk
w Wi,j )ωi ωiT −
i=1 j=1 n X
ωjspk )(ωispk
−
j=1
1 nspk
nspk
X
ωjspk )T (A.1)
j=1
nspk
X
ωispk (ωjspk )T
i,j n X
w Wi,j ωi ωjT
i,j=1
1 W w (ωi ωiT + ωj ωjT − ωi ωjT − ωj ωiT ), 2 i,j=1 i,j
which is the same as Eq. (17). Next, we know that the scatter matrix for the entire distribution is the sum of the between-class and within-class scatter matrices:
Sm = Sb + Sw =
n X
(ωi − µ)(ωi − µ)T .
(A.2)
i=1
Hence, we have:
Sb
=
=
n X i=1 n X
(
i=1
=
ωi ωiT −
n 1 X ωi ωjT − Sw n i,j=1
(A.3)
n n X X 1 1 )ωi ωiT − ωi ωjT − Sw n n j=1 i,j=1
n 1 X 1 w ( − Wi,j )(ωi ωiT + ωj ωjT − ωi ωjT − ωj ωiT ), 2 i,j=1 n
which gives Eq. (16)
References [1] J. H. L. Hansen, T. Hasan, Speaker recognition by machines and humans: A tutorial review, IEEE Signal Processing Magazine 32 (6) (2015) 74–99. doi:10.1109/MSP.2015.2462851. [2] B. Ma, H. Meng, M. Man-Wai, Effects of device mismatch, language mismatch and environmental mismatch on speaker verification, in: Acoustics, 27
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, 2007. [3] B. C. Haris, G. Pradhan, A. Misra, S. Shukla, R. Sinha, S. R. M. Prasanna, Multi-variability speech database for robust speaker recognition, in: Communications (NCC), 2011 National Conference on, 2011. doi:10.1109/NCC.2011.5734775. [4] D. Reynolds, Comparison of background normalization methods for textindependent speaker verification, in: Proc. InterSpeech, 1997. [5] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus Eigenchannels in speaker recognition, IEEE Trans. Audio Speech Lang. Process. [6] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification, IEEE Trans. Audio Speech Lang. Process. [7] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. [8] B. Ma, H. Meng, English-chinese bilingual text-independent speaker verification, in: Proc. IEEE ICASSP, 2004. [9] M. Akbacak, J. Hansen, Language normalization for bilingual speaker recognition systems, in: Proc. IEEE ICASSP, 2007. [10] D. Reynolds, T. Quatieri, R. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Process. 10 (1–3) (2000) 19 – 41. [11] L. Lu, Y. Dong, X. Zhao, J. Liu, H. Wang, The effect of language factors for robust speaker recognition, in: Proc. IEEE ICASSP, 2009. [12] N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M. Karafiat, D. van Leeuwen, P. Matejka, P. Schwarz, A. Strasheim, Fusion of heterogeneous speaker recognition systems in the STBU submission for the 28
NIST speaker recognition evaluation 2006, Audio, Speech, and Language Processing, IEEE Transactions ondoi:10.1109/TASL.2007.902870. [13] A. Misra, J. H. Hansen, Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS bi-ling corpora, in: Spoken Language Technology Workshop, 2014. SLT 2014, 2014. [14] L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glembek, M. Graciarena, A. Lawson, Y. Lei, P. Matejka, O. Plchot, et al., Promoting robustness for speaker modeling in the community: the PRISM evaluation set. [15] C. Cieri, D. Miller, K. Walker, The fisher corpus: A resource for next generations of speech-to-text, in: International Conference On Language Resources And Evaluation, 2004. [16] J. Godfrey, E. Holliman, J. McDaniel, Switchboard: Telephone speech corpus for research and development. [17] A. Misra, Q. Zhang, F. Kelly, J. H. Hansen, Between-class covariance correction for linear discriminant analysis in language recognition, in: Proc. Odyssey, 2016. [18] M. McLaren, D. van Leeuwen, Source-normalized lda for robust speaker recognition using i-vectors from multiple speech sources, Audio, Speech, and Language Processing, IEEE Transactions on 20 (3) (2012) 755–766. doi:10.1109/TASL.2011.2164533. [19] Y. Lei, N. Scheffer, L. Ferrer, M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, in: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 1695–1699. doi:10.1109/ICASSP.2014.6853887. [20] P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing 13 (3) (2005) 345–354. doi:10.1109/TSA.2004.840940.
29
[21] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. New York: Academic Press, 1990, ch.10. [22] A. Hatch, S. Kajarekar, A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition, in: Proc. InterSpeech, Pittsburgh, Pennsylvania, 2006. [23] A. Solomonoff, W. Campbell, C. Quillen, Nuissance attribute projection, Speech Communication, Elsevier Science BV. [24] D. Garcia-Romero, C. Y. Espy-Wilson, Analysis of i-Vector length normalization in speaker recognition systems, in: Proc. InterSpeech, Florence, Italy, 2011. [25] P. Kenny, Bayesian speaker verification with heavy tailed priors, in: Proc. Odyssey, Brno, Czech Republic, 2010. [26] S. J. D. Prince, J. H. Elder, Probabilistic linear discriminant analysis for inferences about identity, in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1–8. doi:10.1109/ICCV.2007. 4409052. [27] D. Garcia-Romero, A. McCree, Supervised domain adaptation for i-vector based speaker recognition, in: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 4047–4051. doi:10.1109/ICASSP.2014.6854362. [28] P. Kenny, Bayesian speaker verification with heavy-tailed priors, in: Proc. Odyssey, 2010. [29] S. Sadjadi, J. Pelecanos, W. Zhu, Nearest neighbor discriminant analysis for robust speaker recognition, in: Proc. Interspeech, 2014. [30] S. Sadjadi, S. Ganapathy, J. Pelecanos, The IBM 2016 speaker recognition system, in: Proc. Odyssey, 2016.
30
[31] M. Sugiyama, Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis, J. Mach. Learn. Res. 8 (2007) 1027–1061. URL http://dl.acm.org/citation.cfm?id=1248659.1248694 [32] K. Walker, S. Strassel, The RATS radio traffic collection system, in: Proc. Odyssey, 2012. [33] The NIST year 2008 speaker recognition evaluation plan, 2008, available at http://www.itl.nist.gov/iad/mig/tests/sre/2008/sre08_ evalplan_release4.pdf. [34] The NIST year 2010 speaker recognition evaluation plan,
2010,
available at http://www.nist.gov/itl/iad/mig/upload/NIST_SRE10_ evalplan-r6.pdf. [35] N. Brmmer, J. du Preez, Application-independent evaluation of speaker detection, Computer Speech Language 20 (23) (2006) 230 – 275. [36] D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, J. S. Bouten, NIST and NFI-TNO evaluations of automatic speaker recognition, Computer Speech Language 20 (2006) 128 – 158.
31