Determination of A Priori Decision Thresholds for ... - Semantic Scholar

1 downloads 0 Views 367KB Size Report
based on 138 speakers of the YOHO corpus suggest that with a simulated operation environment, it is able to de- termine the best compromise between false ...
Determination of A Priori Decision Thresholds for Phrase-Prompted Speaker Veri cation

M. W. Mak, W. D. Zhang, and M. X. He Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Abstract |Speaker veri cation systems are often compared based on an equal error rate (equal chance of false acceptance and false rejection) obtained by adjusting a decision threshold during veri cation. However, the threshold should be found before veri cation because the identity of a claimant is actually unknown in real-world situations. This paper presents a novel method to determine the decision thresholds of speaker veri cation systems using enrollment data only. In the method, a speaker model is trained to differentiate the voice of the corresponding speaker and that of a general population. This is accomplished by using the speaker's utterances and those of some other speakers (denoted as anti-speakers) as the training set. Then, an operation environment is simulated by presenting the utterances of some pseudo-impostors (none of them is an anti-speaker) to the speaker model. The threshold is adjusted until the chance of falsely accepting a pseudo-impostor falls below an application dependent level. Experimental evaluations based on 138 speakers of the YOHO corpus suggest that with a simulated operation environment, it is able to determine the best compromise between false acceptance and false rejection. Keywords | Speaker veri cation; threshold determination; elliptical basis function networks.

I. Introduction

The determination of decision thresholds is a very important problem in speaker veri cation. A large threshold could make the system annoying to users, while a small one could result in a vulnerable system. Conventional threshold determination methods [1], [2] typically compute the distribution of inter- and intra-speaker distortions, and then chose a threshold to equalize the overlapping area of the distributions, i.e. to equalize the false acceptance rate (FAR) and false rejection rate (FRR). The success of this approach, however, relies on whether the estimated distributions match the speaker- and impostor-class distributions. Another approach derives the threshold of a speaker solely from his/her own voice and speaker model [3]. Session-to-session speaker variability, however, contributes much bias to the threshold, rendering the veri cation system unusable. Due to the diculty in determining a reliable threshold, researchers often report the equal error rate (ERR) of veri cation systems based on the assumption that an a posteriori threshold can be optimally adjusted during veri cation. A real-world application, however, is only realistic with a priori thresholds which should be determined before veri cation. This project was supported by the H.K. Polytechnic University Grant No. 1.42.37.A042. M.X. He is with the Ocean Remote Sensing Institute, Ocean University of Qingdao, China.

In recent years, research e ort has focused on the normalization of speaker scores to minimize error rates. This includes the likelihood ratio scoring proposed by Higgins et al. [4], where veri cation decisions are based on the ratio of the likelihood that the observed speech is uttered by the true speaker to the likelihood that it is spoken by an imposter. The a priori threshold is then set to 1.0, with the claimant being accepted (reject) if the ratio is greater (less) than 1.0. Subsequent work based on likelihood normalization [5], [6], cohort normalized scoring [7], and minimum veri cation error training [8] also shows that including an impostor model during veri cation not only improves speaker separability, but also allows decision thresholds to be easily set. Although these approaches help to select an appropriate threshold, they may cause the system to favor rejecting true speakers, resulting in a high FRR. For example, Higgins et al. [4] reported that the FRR is more than 10 times larger than the FAR. A recent report [9] based on a similar normalization technique but di erent threshold setting procedure also found that the average of FAR and FRR is about 3 to 5 times larger than the ERR, suggesting that the ERR could be an over optimistic estimate of the true system performance. This paper proposes an a priori threshold determination method to address the above problem. The method is different from that of Higgins et al. in that rather than using a ratio speaker set formed by pooling the nearest reference speakers, we used two speaker sets, namely anti-speaker set and pseudo-impostor set, to determine the threshold. For each speaker, a speaker model is trained to di erentiate the voices of the speaker and the anti-speakers. Then, the pseudo-impostor set is used to determine the threshold. To enhance the capability of the speaker models without increasing the enrollment time, we sample the utterances of 45 anti-speakers and 45 pseudo-impostors to form a training set for building the speaker models and for determining the thresholds. Therefore, an operation environment for the speaker model is e ectively simulated. Experimental results show that the simulated operation environment enables the veri cation performance to be accurately predicted during enrollment time, thereby providing a reliable means of determining the decision thresholds. This paper is organized as follows. Section II outlines the speaker models and the veri cation procedure. The a priori threshold determination methods are explained in Section III, and their performance are compared in Section IV. The proposed speaker models and threshold determination

methods are compared with that of Higgins et al. [4] in achieve this goal. They are denoted as Baseline, PseudoSection V. Finally, we conclude our discussions in Section Impostor Based Threshold Determination (PIBTD) and VI. Sampling Pseudo-Impostor Based Threshold Determination (SPIBTD) in this paper. II. Speaker Verification

A. Speaker Models: EBF Networks Elliptical basis function (EBF) networks have been used as speaker models in this work [10]. EBF networks can be applied to speaker veri cation as follows. Each registered speaker is assigned an EBF network with two outputs. The rst output is trained to output a `1' for the speaker's speech and a `0' for other speakers' utterances, and vice versa for the second output. Therefore, two sets of data are required for constructing a speaker model, one of them being derived from the speaker and another from other speakers. We denote the second set of data as the anti-speaker set hereafter. Of particular interest is that the EBF networks incorporate the idea of likelihood ratio scoring in their discriminative training procedure. An EBF network does not require a set of cohort or background speakers during veri cation; rather, it embeds the characteristics of the background speakers in its parameter estimation procedure during enrollment. B. Veri cation Procedure For each veri cation session, the feature vectors derived from the utterances of a claimant are concatenated to form a vector sequence T = [~x1 ; ~x2 ; : : : ; ~xT ]. The sequence is then divided into a number of overlapping segments containing Ts consecutive vectors. Note that this approach is similar to that of [11] where each segment is considered to be independent. For a segment Ts , the normalized average outputs zk

X ey~k (~x) = T1 P2 y~r (~x) s ~x2Ts

r=1 e

k

= 1; 2

(1)

corresponding to the speaker and anti-speaker classes are computed, where y~k (~x) = Pyk(C(~xk)) represents the scaled output and P (Ck ) the prior probability of class Ck . Veri cation decisions are based on the criterion:  >  then accept the claimant (2) If z1 ? z2   then reject the claimant

A.1 Baseline This method, being very similar to that of Higgins et al. [4], is to form a baseline for comparison. Speci cally, for each registered speaker in the system, ve other speakers whose speech are closest to that of the speaker are selected from the population to form an anti-speaker set. Note that this is analogous to the ratio speakers of Higgins et al. The speech of the speaker and the anti-speakers are used to train a speaker model. Then, the same speech data from the anti-speakers are applied to the speaker model according to the above veri cation procedure. The FAR as a function of the threshold is obtained by adjusting the threshold  , resulting in an FAR curve. Similarly, the speaker's utterances which have been used to train the speaker model are presented to the model to obtain an FRR curve. A typical example of these curves is shown in Fig. 1(a). It shows that the FAR during veri cation is considerably higher than that during enrollment, suggesting that the system is vulnerable to impostor's attacks. A.2 Pseudo-Impostor Based Threshold Determination (PIBTD) Using the same set of utterances for training the speaker model as well as for producing the FAR and FRR curves has a serious drawback. After training, the speaker model is likely to bias towards representing the training utterances of the speaker and anti-speakers. If the same set of utterances is applied to the speaker model for producing FAR and FRR curves, the curves obtained are likely to be biased. To resolve this problem, PIBTD uses an alternative set of speakers, called pseudo-impostor set, together with another set of utterances (for which the speaker model has never seen before) produced by the registered speaker to obtain the FAR and FRR curves. More speci cally, after training the speaker model, ve pseudo-impostors are randomly selected from the population and applied to the speaker model. These pseudo-impostors, being di erent from the anti-speakers and never seen by the speaker model before, are more likely to form a better representation of the impostor population. This prevents the FAR curve from shifting along the threshold axis drastically during veri cation, as shown in Fig. 1(b). Therefore, the veri cation error rate becomes more predictable.

where  2 [?1; 1] is an a priori threshold that has been determined during enrollment (see Section III below). A veri cation decision is made for each segment, and the error rate (either FAR or FRR) is the proportion of incorrect A.3 Sampling Pseudo-Impostor Based Threshold Determiveri cation decisions to the total number of decisions. Denation (SPIBTD) tails of the veri cation procedure can be found in [10]. Obviously, the representation of the impostor population can be improved by increasing the number pseudoIII. A Priori Threshold Determination impostors and anti-speakers. However, increasing the size A. FARs and FRRs versus Thresholds of these sets will also lead to unrealistic enrollment time. To determine the a priori thresholds, we need to ob- SPIBTD aims at reducing the error rate and improving tain the FAR and FRR as a function of thresholds us- the robustness of the thresholds without increasing the ening enrollment data only. We propose three methods to rollment time. The basic idea is to randomly select the

feature vectors from a large number of pseudo-impostors and anti-speakers for training a speaker model as well as for determining a threshold. In this way, the number of training vectors and the enrollment time remain the same as compared to PIBTD. Another advantage of this sampling strategy is that the resulting training vectors become more representative of the impostor population, for they are derived from more pseudo-impostors as compared to PIBTD. As shown in Fig. 1(c), this makes the position of FAR curves more predictable as compared to PIBTD. The reduction in the displacement between the FAR curves obtained during enrollment and veri cation means that the FAR curve may be used to determine the threshold. More speci cally, the threshold is adjusted until the FAR obtained during enrollment falls below an application dependent level.

from his/her 10 veri cation sessions, were concatenated to form a sequence of features vectors. Similarly, impostors' feature vectors were randomly selected from the utterances of 45 impostors and then concatenated to form a vector sequence whose length is the same as that formed by the speaker's utterances. Veri cation decisions were made according to (2) with the segment length Ts in (1) being set to 300.2 For each genuine trial, a window covering 300 vectors was advanced forward along the vector sequence by one vector position. This arrangement produces approximately 1000 genuine trials and 1000 impostor attempts for each speaker. LP-derived cepstral coecients were used as acoustic features. For each utterance, the silent regions were removed by a silent detection algorithm based on the energy and zero crossing rate of the signal. The remaining signals were pre-emphasized by a lter with transfer function 1 ? 0:95z ?1. Twelfth-order LP-derived cepstral coecients were computed using a 28 ms Hamming window at a frame rate of 14 ms. These feature vectors were used to train a set of speaker models (EBF networks) with 12 inputs, two outputs, and 32 centers, where 8 centers were contributed by the corresponding speaker and the remaining 24 by the anti-speakers.

B. Threshold Selection Schemes Once the FAR and FRR curves are obtained, a threshold can be determined as follows. If the FAR and FRR curves cross each other, the crossing point will be chosen as the a priori threshold. Therefore, the a priori threshold is to equalize the chance of false acceptance or false rejection during enrollment. However, when the FAR and FRR curves do not cross each other, there exist a range of thresholds for which both FAR and FRR are zero. Four threshold A. FAR and FRR Versus Thresholds selection schemes are proposed to handle this situation and Fig. 1 depicts FARs and FRRs as a function of threshthey are summarized in Table I. olds of one of the 138 speakers. Some interesting results can be observed from these gures. First, Fig. 1(a) shows that IV. Experimental Evaluations there is a large displacement between the FAR curve correIn this work, all of the 138 speakers (108 male, 30 female) sponding to enrollment and that corresponding to veri cain the YOHO corpus [12] have been used for the experimen- tion when anti-speakers' utterances were used to determine tal evaluations. For each speaker in the corpus, there are 4 the FAR curve during enrollment. Second, when pseudoenrollment sessions with 24 utterances in each session and impostors were used to obtain the FAR curve during enroll10 veri cation sessions of 4 utterances each. Each utter- ment, the displacement is considerably reduced, as shown ance is composed of three 2-digit numbers (e.g. 34-52-67). in Fig. 1(b). Third, the displacement is further reduced All sessions were recorded in an oce environment using a for SPIBTD (Fig. 1(c)) where feature vectors were randomly sampled from a large number of pseudo-impostors, high quality telephone handset and sampled at 8 kHz. The enrollment process involves two steps. First, for each suggesting that the veri cation performance of the system speaker in the corpus, 72 utterances from the speaker's rst can be reliably predicted. Fig. 1(c) also suggests that the three enrollment sessions and 480 utterances from the 4 en- FAR curve provides a reliable means of determining the rollment sessions of 5 anti-speakers (Baseline and PIBTD) threshold. were used to train a speaker model. For SPIBTD, the 480 utterances were randomly selected from 45 anti-speakers. B. Comparing Di erent Threshold Selection Schemes Second, the a priori threshold was determined by using ei- B.1 Using Baseline ther the anti-speaker set (Baseline) or the pseudo-impostor Fig. 2 compares the four selection schemes for the baseset (PIBTD and SPIBTD)1 together with the speaker's line method by plotting the a posteriori thresholds against speech. The speaker's speech was derived from the trainthe a priori thresholds corresponding to 138 speakers in the ing utterances (Baseline) or from other utterances for which YOHO corpus. The a posteriori thresholds were chosen to the model has never seen before (PIBTD and SPIBTD). equalize FAR and FRR during veri cation. Fig. 3 plots Veri cation was performed using each speaker in the the FAR versus the FRR of 138 speakers using di erent corpus as a claimant, with 45 impostors being randomly threshold selection schemes. selected from the remaining speakers (excluding the antiFig. 2(a) shows that most of the a priori thresholds speakers and pseudo-impostors) and rotating through all are greater than the a posteriori ones. This suggests that speakers. The speaker's utterances, which were derived choosing the zero crossing of the FRR curves (Scheme I) 1 Five pseudo-impostors were used in PIBTD, whereas in SPIBTD, 2 This is roughly equivalent to the length of 3 utterances. Thus, the the pseudo-impostor set is constructed by selecting the feature vectors of 45 pseudo-impostors randomly. results can be compared with those of [4].

Baseline: Scheme II

PIBTD: Scheme IV

100 FAR curves

100

Th

40

80

FAR curves FAR or FRR

FAR or FRR

FRR curves 60

FRR curves

FRR curves

80

20

FAR or FRR

80

SPIBTD: Scheme IV

100

60 Th 40 20

0

40

0 0.2 0.4 Threshold

0.6

0.8

0 -0.2

(a)

Th

20

0 -0.4 -0.2

FAR curves

60

0

0.2 0.4 Threshold

0.6

0.8

-0.4 -0.2

(b)

0 0.2 0.4 Threshold

0.6

0.8

(c)

Fig. 1. FARs and FRRs as a function of decision thresholds of a speaker during enrollment (solid) and veri cation (dots) using (a) Baseline, (b) PIBTD, and (c) SPIBTD. The label \Th" denotes the a priori threshold found by the corresponding best threshold selection scheme.

Scheme Description I II

III IV

Selecting the zero crossings of the FRR curves as thresholds Selecting the middle of the zero crossings of the FAR and FRR curves as thresholds Selecting the zero crossings of the FAR curves as thresholds Selecting the point in the FAR curve such that its corresponding FAR attains a pre-de ned value TABLE I

Best for

None Baseline

PIBTD PIBTD & SPIBTD

Threshold selection schemes. All FAR and FRR curves are obtained at enrollment time. The last column lists the best threshold determination method(s) for each threshold selection scheme.

Baseline: Scheme II

Baseline: Scheme III

Baseline: Scheme IV 0.8

0.7

0.7

0.7

0.7

0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.1

0.1

0.1

0

0

0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

(a)

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

(b)

A Posteriori Threshold

0.8

A Posteriori Threshold

0.8

A Posteriori Threshold

A Posteriori Threshold

Baseline: Scheme I 0.8

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

(c)

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

(d)

Fig. 2. A posteriori equal error thresholds versus a priori thresholds found by choosing (a) the zero crossings of FRR curves (Scheme I), (b) the middle of the zero crossing of FAR and FRR curves (Scheme II), (c) the zero crossings of FAR curves (Scheme III), and (d) the point at which FAR attains 0.5% (Scheme IV) as thresholds. The baseline method was used to obtain the FAR and FRR curves in all cases.

is likely to overestimate the thresholds, resulting in high FRR during veri cation as shown in Fig. 3(a). On the other hand, Figs. 2(c) and 2(d) suggest that using Schemes III and IV is likely to yield underestimated thresholds, resulting in high FAR during veri cation (see Figs. 3(c) and 3(d)). Therefore, a reasonable choice is to select the middle of the zero crossing of the FAR and FRR curves as the threshold, ie Scheme II. This scheme not only results in comparable a priori and a posteriori thresholds, as shown in Fig. 2(b), but also minimizes FAR and FRR for some of the speakers simultaneously during veri cation, as shown in Fig. 3(b). However, there are still many speakers having a low FAR but a very high FRR or vice versa, suggesting that the baseline method is not very robust.

B.2 Using PIBTD Fig. 4 plots the a posteriori thresholds versus the a priori thresholds for PIBTD using the four threshold selection schemes as mentioned above. Similar to Baseline, choosing the zero crossings of FRR curves is likely to overestimate the thresholds, resulting in high FRR as shown in Fig. 5(a). This overestimation has been reduced by choosing the middle of the zero crossing of the FAR and FRR curves, as shown in Fig. 4(b). However, Fig. 5(b) shows that this still leads to high FRR for many speakers. For PIBTD, both Scheme III and Scheme IV give a very good match between a priori and a posteriori thresholds as shown in Figs. 4(c) and 4(d). They also gives a reasonable trade-o between FARs and FRRs, as illustrated in Figs. 5(c) and 5(d).

Baseline: Scheme II

Baseline: Scheme III

Baseline: Scheme IV 100

80

80

80

80

60

60

60

60

40 20

40 20

0 20

40 60 FAR

80

40 20

0 0

FRR

100

FRR

100

FRR

FRR

Baseline: Scheme I 100

20

0

100

0

(a)

20

40 60 FAR

80

40

0

100

0

(b)

20

40 60 FAR

80

100

0

(c)

20

40 60 FAR

80

100

(d)

Fig. 3. FRRs versus FARs corresponding to 138 speakers. All errors are based on the a priori thresholds determined by choosing (a) the zero crossings of FRR curves (Scheme I), (b) the middle of the zero crossing of FAR and FRR curves (Scheme II), (c) the zero crossings of FAR curves (Scheme III), and (d) the point at which the FAR attains 0.5% as thresholds. The baseline method was used in all cases. PIBTD: Scheme II

PIBTD: Scheme III

PIBTD: Scheme IV 0.8

0.7

0.7

0.7

0.7

0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.1

0.1

0.1

0

0

0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

0

(a)

A Posteriori Threshold

0.8

A Posteriori Threshold

0.8

A Posteriori Threshold

A Posteriori Threshold

PIBTD: Scheme I 0.8

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

0.5 0.4 0.3 0.2 0.1 0

0

(b)

0.6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

0

(c)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

(d)

Fig. 4. A posteriori equal error thresholds versus a priori thresholds found by choosing (a) the zero crossings of FRR curves, (b) the middle of the zero crossing of FAR and FRR curves, (c) the zero crossings of FAR curves, and (d) the point at which the FAR attains 0.5% as thresholds. PIBTD was used to obtain the FAR and FRR curves. PIBTD: Scheme II

PIBTD: Scheme III

PIBTD: Scheme IV 100

80

80

80

80

60

60

60

60

40 20

40 20

0 20

40 60 FAR

(a)

80

100

40 20

0 0

FRR

100

FRR

100

FRR

FRR

PIBTD: Scheme I 100

20

0 0

20

40 60 FAR

80

100

(b)

40

0 0

20

40 60 FAR

(c)

80

100

0

20

40 60 FAR

80

100

(d)

Fig. 5. FRRs versus FARs corresponding to 138 speakers. All errors are based on the a priori thresholds determined by choosing (a) the zero crossings of FRR curves, (b) the middle of the zero crossing of FAR and FRR curves, (c) the zero crossings of FAR curves, and (d) the point at which the FAR attains 0.5% as thresholds. PIBTD was used in all cases.

B.3 Using SPIBTD In SPIBTD, the FAR curves obtained during enrollment become very close to those obtained during veri cation (see Fig. 1(c)) because the speech are sampled from a large set of pseudo-impostors. Therefore, it makes sense to use the FAR curves obtained during enrollment to determine the threshold. The question remains to be answered is that which part of the FAR curve should be used for determining the threshold. To this end, we plot the FAR curves of 10 speakers in Fig. 6. A closer look at this gure reveals that the FAR curves become at when the error rate is close to zero. This is caused by the fact that the feature vectors of the 45 pseudo-impostors form a much scattered distri-

bution in the feature space as compared to those formed by 5 anti-speakers or 5 pseudo-impostors. Recall from (1) and (2) that veri cation decisions are based on the average di erence between the two scaled network outputs. Therefore, if the pseudo-impostors' vectors spread over a wide region in the feature space, the chance of having some of these vectors being closed to the speaker's vectors becomes high. This phenomenon is illustrated in Fig. 7 where the distributions of speaker's speech and impostors' speech are assumed to be uni-modal. In Fig. 7(a), the false acceptance region is small as the pseudo-impostor patterns spread over a small area in the feature space, resulting in a small FAR (E1 ). On the other hand, Fig. 7(b) shows

5

4

FAR

3

2

1

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 Threshold

Fig. 6. FAR curves of 10 speakers obtained by using SPIBTD.

FAR curve

Speaker’s Patterns

E1 T1

False acceptance region

T2

Decision boundary

Pseudo-impostor Patterns T2 T1

(a) FAR curve

Speaker’s Patterns

E1

False acceptance region

E2 T1

T2

Decision boundary

Pseudo-impostor Patterns

T2 T1

(b)

Fig. 7. Diagrams showing the e ect of having a scattered distribution of pseudo-impostor patterns on the FAR curves. The dashed lines represent the decision boundaries formed by setting the decision threshold to T1 and T2 .

that if the pseudo-impostor patterns spread over a large area, more patterns will be falsely accepted. The consequence is that the FAR changes by an insigni cant amount in a large range of threshold values, suggesting that using the zero crossing of FAR curves may result in unreliable thresholds. Fig. 6 shows that the threshold becomes very sensitive to the FAR when the latter falls below 0.5%. To overcome this dicult, Scheme IV focuses on the region where the threshold is less sensitive to the FAR, and selects the threshold at which the FAR attains an application dependent level. In this work, the level was set to 0.5%. Fig. 8(a) are comparable to Fig. 2(a) and Fig. 4(a), suggesting that using the zero crossing of FRR as threshold is not appropriate for SPIBTD. The large number of high FRR in Fig. 9(a) also agrees with this observation.

While choosing the middle of the FAR and FRR curves as threshold brings the a priori thresholds slightly closer to the a posteriori thresholds, it still causes an unacceptably high FRR for many speakers, as shown in Fig. 9(b). Fig. 9(c) suggests that selecting the zero crossings of FAR curves as thresholds can only reduce the number of speakers with high FRR slightly. This is because the thresholds are very sensitive to the FAR at the region of zero crossings, resulting in overestimated thresholds. Comparisons between Fig. 8(c) and Fig. 8(d) as well as Fig. 9(c) and Fig. 9(d) reveal that choosing a threshold that produces a pre-de ned FAR at enrollment time is the best approach as it gives the best compromise between FAR and FRR. We can see from Fig. 9 that the number of speakers with high FRR is progressively reduced when we shift the focus from the FRR curves to the FAR curves (Scheme I to Scheme IV). Clearly, the sampling strategy of SPIBTD not only makes the veri cation performance more predictable, but also provides us a reliable means of determining the thresholds. The main reason is that sampling the speech of a large number of impostors and anti-speakers is able to produce a better representation of the impostor population and to build more robust speaker models. C. Comparisons Based on the Best Threshold Selection Scheme The above results show that Baseline, PIBTD, and SPIBTD require di erent threshold selection schemes to achieve the best trade-o between FAR and FRR (see Table I). It is of interest to compare the results of these methods by using their respective best threshold selection scheme.3 To this end, we compare Fig. 2(b), Fig. 4(c), Fig. 4(d) and Fig. 8(d) in terms of robustness in threshold determination, and compare Fig. 3(b), Fig. 5(c), Fig. 5(d) and Fig. 9(d) in terms of the error rates obtained during veri cation. Evidently, the a priori and a posteriori thresholds obtained by the baseline method have the largest di erence, causing small FAR but very large FRR or vice versa for most of the speakers, as shown in Fig. 3(b). This makes the performance of the system dicult to predict. While the number of speakers with high FAR is smaller in Figs. 5(c) and 5(d), this has to be achieved by increasing the number of speakers with high FRR. Among all the methods, SPIBTD produces the most predictable system when it is combined with Scheme IV. D. Comparisons Based on Average Error Rates Table II summarizes the average (based on 138 speakers in the YOHO corpus) FAR, FRR, and ERR obtained by Baseline, PIBTD, and SPIBTD with di erent threshold selection schemes. The results show that using Scheme I results in very high FRR during veri cation, although both FAR and FRR during enrollment are very small. Scheme II is only appropriate for the baseline method as it produces a comparatively high FRR during veri cation for PIBTD 3 Here, we consider the scheme that produces the `best' balance between FAR and FRR as the best scheme.

SPIBTD: Scheme II

SPIBTD: Scheme III

SPIBTD: Scheme IV 0.8

0.7

0.7

0.7

0.7

0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.1

0.1

0.1

0

0

0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

0

(a)

A Posteriori Threshold

0.8

A Posteriori Threshold

0.8

A Posteriori Threshold

A Posteriori Threshold

SPIBTD: Scheme I 0.8

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

0.5 0.4 0.3 0.2 0.1 0

0

(b)

0.6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

0

(c)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 A Priori Threshold

(d)

Fig. 8. A posteriori equal error thresholds versus a priori thresholds found by choosing (a) the zero crossings of FRR curves, (b) the middle of the zero crossing of FAR and FRR curves, (c) the zero crossings of FAR curves, and (d) the point at which FAR (during enrollment) attains 0.5% as thresholds. SPIBTD was used to obtain the FAR and FRR curves. SPIBTD: Scheme II

SPIBTD: Scheme III

SPIBTD: Scheme IV 100

80

80

80

80

60

60

60

60

40 20

40 20

0 20

40 60 FAR

(a)

80

100

40 20

0 0

FRR

100

FRR

100

FRR

FRR

SPIBTD: Scheme I 100

20

0 0

20

40 60 FAR

80

100

(b)

40

0 0

20

40 60 FAR

(c)

80

100

0

20

40 60 FAR

80

100

(d)

Fig. 9. FRRs versus FARs corresponding to 138 speakers. All errors are based on the a priori thresholds determined by choosing (a) the zero crossings of FRR curves, (b) the middle of the zero crossing of FAR and FRR curves, (c) the zero crossings of FAR curves, and (d) the point at which FAR attains 0.5% as thresholds. SPIBTD was used in all cases.

and SPIBTD. Scheme III produces a good compromise between FAR and FRR for PIBTD but not for Baseline and SPIBTD. Finally, Scheme IV gives the lowest FAR and FRR for SPIBTD as well as a good compromise between FAR and FRR for PIBTD. One should bear in mind that the gures in Table II are based on the average of 138 speakers. Therefore, a good match between the average FAR and FRR does not mean that it can also produce a good match for individual speakers. For example, Fig. 5(d) shows that combining PIBTD and Scheme IV produces unmatched FAR and FRR for many speakers although closely matched averages (3.41% FAR versus 4.31% FRR) can be obtained. One may notice that the FARs during enrollment for Scheme IV in Table II are not equal to 0.5%. This is because the FAR and FRR curves of some speakers cross each other and the intersection point was chosen as the threshold, resulting in an FAR being higher than 0.5%. Therefore, the average FAR is slightly higher than the pre-de ned value. Table II also demonstrates that SPIBTD is the best method in terms of equal error rate (ERR), suggesting that sampling a large set of pseudo-impostors and anti-speakers is able to improves the capability of the EBF networks in modeling the anti-speakers and rejecting impostors. The ERR of SPIBTD is also about half of that of Higgins et al. [4] (0.7% against 1.8%), suggesting that sampling the utterances from a large number of anti-speakers is able

to produce a more robust speaker model. Note also that even our baseline method has a lower ERR as compared to that of Higgins et al. This implies that incorporating anti-speakers' speech into a speaker model has merits. The last column in Table II lists the FAR obtained by setting all thresholds to zero. Recall from Section II that the average di erence of the two normalized network outputs is compared with a threshold for making a veri cation decision. Recall also that Baseline, PIBTD, and SPIBTD use the same set of speaker data (but di erent sets of antispeaker data) for building a speaker model. Therefore, for a xed threshold, the FAR becomes an indicator of how good the anti-speakers are modeled by the EBF networks. The last column of Table II clearly shows that SPIBTD is more capable of modeling the anti-speakers. V. Comparing to Higgins' Model

Our proposed speaker models and threshold determination methods have three advantages over Higgins' ones [4]. First, Higgins' model requires to select 5 closest speakers (among 137) to form the ratio speaker set during veri cation, causing a computation burden. Our methods, on the other hand, embed the features of the ratio speakers in the speaker models during enrollment. Rather than selecting 5 closest ratio speakers, our methods only need to sample the feature vectors of 45 anti-speakers for constructing a speaker model, which only take a fraction of the time. One may argue that we simply shift the com-

Method Scheme I Baseline Scheme II Scheme III Scheme IV Scheme I PIBTD Scheme II Scheme III Scheme IV Scheme I SPIBTD Scheme II Scheme III Scheme IV

Enrollment Veri cation FAR % FRR % FAR % FRR % ERR % 0.00 0.26 0.03 55.21 0.00 0.00 4.40 3.21 0.04 0.00 43.81 0.16 0.00 0.00 46.02 0.14 1.11 0.25 0.36 0.21 38.70 0.26 0.26 0.57 10.80 0.29 0.24 3.09 4.71 0.33 0.26 3.41 4.31 0.18 0.29 0.16 41.72 0.18 0.18 0.18 18.99 0.70 0.19 0.16 0.32 9.81 0.62 0.15 1.12 3.94 TABLE II

Zero Threshold FAR %

59.74

10.38

Average error rates obtained by different methods. For Scheme IV, the pre-defined FAR was set to 0.5%.

putation burden from the veri cation sessions to the enrollment sessions. However, a low computation overhead during veri cation is certainly an advantage in real-world applications where real-time response is essential. During veri cation, our methods only require to compute two likelihood functions (one for the speaker class and the other for the anti-speaker class), whereas Higgins' model requires six (one for the speaker and another ve for the ratio speakers) likelihood functions to be evaluated. The second advantage of our proposed speaker models is that they are more robust in rejecting impostors. If the impostor's speech is closer to the speaker's speech than to the speech of the 5 ratio speakers, Higgins' model will accept the impostor. In our case, the speaker model is constructed by sampling the speech of the corresponding speaker and 45 anti-speakers. The latter forms a better representation of the impostor population as compared to using the speech of only 5 ratio speakers. Therefore, it is more likely that the impostor' speech is closer to the antispeakers' speech than to the speaker's speech, resulting in a rejection. The third advantage being that our methods provide several means of nding a threshold that can strike a balance between FAR and FRR, whereas Higgins' model makes no attempts to achieve this goal. Our experimental results show that the proposed methods not only produce an ERR that is only half of that obtained by Higgins et al. (0.70% versus 1.8%), but also produce a good compromise between FAR and FRR. For example, Higgins et al. obtained 4.2% FRR and 0.37% FAR using 3 utterances per trial, whereas we obtained 3.94% FRR and 1.12% FAR using speech segments with length approximately equal to three utterances. VI. Conclusions

This paper addresses the problem of determining a priori thresholds for phrase-prompted speaker veri cation. Conventional approaches have been compared with the proposed one. Experimental evaluations based on 138 speakers in the YOHO corpus have been carried out. It was shown that robust thresholds can be obtained by simulat-

ing an operation environment as close as possible to the real one. Our proposed method is able to predict the veri cation performance accurately by using enrollment data only, leading to more reliable thresholds. With the proposed method, it is able to nd a better balance between FARs and FRRs. References [1] S. Furui. Cepstral analysis technique for automatic speaker veri cation. IEEE Trans. on Acoustics Speech and Signal Processing, ASSP-29(2):254{272, 1981. [2] D. K. Burton. Text-dependent speaker veri cation using vector quantization source coding. IEEE Trans. on Acoustics, Speech, and Signal Processing, ASSP-35(2):133{143, 1987. [3] J. M. Naik, L. P. Netsch, and G. R. Doddington. Speaker veri cation over long distance telephone lines. In Proc. ICASSP'89, 1989. [4] A. Higgins, L. Bahler, and J. Porter. Speaker veri cation using randomized phrase prompting. Digital Signal Processing, 1:89{ 106, 1991. [5] T. Matsui and S. Furui. Likelihood normalization for speaker veri cation using a phoneme- and speaker-independent model. Speech Communications, 17:109{116, 1995. [6] C. S. Liu, C. H. Lee, B. H. Juang, and A. E. Rosenberg. Speaker recognition based on minimum error discriminative training. In Proc. ICASSP'94, volume 1, pages 325{328, 1994. [7] A. E. Rosenberg, J. DeLong, C. H. Lee, B. H. Juang, and F. K. Soong. The use of cohort normalized scores for speaker veri cation. In Proc. ICSLP'92, volume 2, pages 599{602, 1992. [8] A. E. Rosenberg, O. Siohan, and S. Parthasarathy. Speaker veri cation using minimum veri cation error training. In Proc. ICASSP'98, pages 105{108, 1998. [9] J. B. Pierrot et al. A comparison of a priori threshold setting procedures for speaker veri cation in the CAVE project. In Proc. ICASSP'98, pages 125{128, 1998. [10] M. W. Mak and C. K. Li. Elliptical basis function networks and radial basis function networks for speaker veri cation: A comparative study. In IJCNN'99, July 1999. [11] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identi cation using Gaussian mixture speaker models. IEEE Trans. on Speech and Audio Processing, 3(1):72{83, 1995. [12] Jr. J. P. Campbell. Testing with the YOHO CD-ROM voice veri cation corpus. In ICASSP95, pages 341{344, 1995.