score normalisation in a multi-band speaker verification ... - CiteSeerX

SCORE NORMALISATION IN A MULTI-BAND SPEAKER VERIFICATION SYSTEM Roland Auckenthalery and John S. Masonz yDepartment of Electronics,Technical University Graz, Inffeldgasse 12,A-8010 GRAZ, AUSTRIA zDepartment of Electrical & Electronic Engineering, University of Wales Swansea, SA2 8PP, UK email: feeaucken, [email protected]

´ RESUM E´ Récemment, l’utilisation dans le traitement de la parole de sousbandes de fréquences a e´ té proposée pour le reconnaissance vocale comme pour l’identification de locuteur comme moyen d Ramméliorer les performances. Dans le cas de l’identification ( verification) de locuteur, cette manière de procéder soulève d’intéressantes questions sur la normalisation des résultats, un facteur connu pour eˆ tre important dans le gain de précision. Cette e´ tude considère la normalisation des résultats dans le context de division en sous-bandes avec pour objectif de répondre des questions telles que : ou bien la normalisation doit eˆ tre réalisée au moment o les bandes sont sub-divisées; ou bien elle doit eˆ tre réalisée apres que les bandes aient e´ té recombinées, ou bien même si une combinaison de ces deux techniques est envisageable. De plus, un model base sur les mots et les arrangements de normalisation de cohortes sont comparées ensembles en utilisant différentes combinaisons. On a montré que de meilleurs taux d’égales erreurs de vérification sont obtenues avec des cohortes de petites tailles et libres de contraintes plutt qu’avec un model base sur les mots et avec un nombres limité de locuteur. Aussi, alors que la normalisation au moment de la division en sous-bandes donne de résultats comparables, il n’est pas toujours vrai qu’elle donne une meilleur TEE qu’après une normalisation apres reconstitution de bande.

ABSTRACT Recently, sub-band processing has been proposed for both speech and speaker recognition as a means of improving performance. In the case of speaker recognition (verification) this form of processing raises interesting questions on score normalisation, a factor known to be important in gaining good performance. This paper considers score normalisation in the context of subband processing with the aim of answering such questions as: whether normalisation should be performed within the sub-bands, or after recombination, or even a combination of the two positions? Furthermore, world-model and cohort normalisation schemes are compared, together with various combinations. It is shown that better verification equal-error rates are obtained with small-sized, unconstrained cohorts than with a world-model, under the conditions of closed-set speakers considered. Also, while normalisation within sub-bands does provide comparable

results, it does not generally give better EER than a single normalisation after sub-band recombination.

1. INTRODUCTION It is well known that the performance of speaker verification systems is sensitive to the a priori values set for the decision threshold. The underlying reason for this sensitivity is the typical variation found in the speech signal, from one occasion to the next. This variation, which can be due to both the speaking characteristics and the general environment and recording conditions, is difficult to model accurately and thus tends to cause classifier scores to be less discriminative. To counter this effect it is common practice to normalise the scores prior to the decision stage. The two mostly widely reported approaches to score normalisation make use of either a general, world-model [1], or a speaker-dependent set of individual models known as a cohort [2] [3]. In both cases the assumption is that the variations in the speech which cause classifier scores to vary, have similar effects on both the model under test and the normalising system, thereby reducing the effects of these variations. In terms of robustness in recognition generally, a recent development has been the use of sub-band processing. First proposed for speech recognition, the conceptual motivation comes from studies of how humans are deemed to perform the task. In the context of speaker recognition [4] reports merits of higher bands and [5] and [6] both examine sub-band processing per se. Possible benefits include the ability to omit or de-weight any sub-bands severely contaminated by noise, and the overall performance of sub-band systems have been shown to compare well with the conventional full-band approach [6]. In combining these two aspects of speaker verification, namely score normalisation and sub-band processing, an interesting question arises as to the position and form of the normalisation process. This paper examines this question, reporting on world-model normalisation and cohort normalisation, both before and after recombination, and also on various combinations of these cases.

2. SCORE NORMALISATION It is normal to consider the output of a classifier as a conditional probability of the form p(O mi ), where O is the observation under test - in this case a sequence of features - and mi is a given

j

j

speaker model. Ideally, we would like the alternative, p(mi O), that is, the probability of the model given the observation. Fortunately, Bayes’ theorem relates these two:

p(mi j O) = p(O jpm(Oi ))p(mi)

NormalisationLM Final Decision

+

Classification

(1)

providing a means for estimating one from the other. It is normal to assume equal probability for all speakers, hence p(mi ) can be ignored. p(O) is the probability of the observation itself. In the current context it is useful to consider replacing p(O) by alternative normalising probabilities or likelihoods, derived from other speaker sets. As mentioned above, two forms are common in practice, namely world-model [1] and cohorts dependent on i, [2] [3].

2.1. World-Model

For the world-model case, p(O) is replaced by a probability p(O M ) which is an output derived from a general, world-model, M.

i) p(mi j O) = pp((OO jj m M)

Recombination using equal weighted summation of log-likelihood distances

Classification

Splitting of Frequency Band

Classification NormalisationLM

a) Normalisation before Recombination Recombination using equal weighted summation of log-likelihood distances

Classification

j

(2)

NormalisationLM

Final Decision

+

Classification Splitting of Frequency Band

Normalisation LM

Classification

b) Normalisation after Recombination

2.2. Cohort Models

Using a cohort of speakers to provide a replacement for p(O) is based on the idea that only those models likely to be most influential, namely those nearest to the target speaker, need be included, [7]. Then

p(mi j O) = Qnp(Op(jOmji )m ) j j=1

(3)

Thus here p(O) is replaced by the geometric average probability across a cohort of speakers. The selection of the cohort is in itself an interesting problem. The approach by [2] selects the most competitive models to that of the target speaker, on the basis that their responses will be similar to unwanted variations. Alternatively, the cohort might be selected according to the test utterance [8]. Recently, [9] examines the different forms of cohort selection and world-model normalisation and shows benefits of cohort selection at the time of testing, referred to by [9] as unconstrained cohort selection. Finally, a decision must be made as to whether or not the target speaker is allowed in the normalisation system. Furui [10] refers to these two situations as posteriori probability and likelihood ratio, respectively. In the comparison by [9], the two are shown to perform essentially the same for unconstrained cohorts, but for pre-selected cohorts the likelihood ratio is reported to be better.

Figure 1: Concept of normalisation in a sub-band verification system

3. EXPERIMENTAL CONDITIONS All experiments use standard mel-frequency, log-spectral estimates and a closed 20 male speaker subset of the BT-Millar verification database, sampled at 8kHz. This digit database contains 25 versions (5 sessions) of digits one to nine and zero for each speaker. For sub-band processing the 32 mel spectral components are sub-divided equally without overlap. Classifiers are simple nearest-neighbour VQ systems, with a codebooks of 16 codewords, trained (10 versions from 2 sessions) and tested (15 versions from 3 subsequent sessions) in text-dependent mode. Scores at the sub-band outputs are equally weighted. While such an arrangement might be regarded as sub-optimal in a number of respects (for example [6] shows the mel-scale to be sub-optimal, particularly in the context of sub-band processing), it nonetheless serves the current purpose of investigating score normalisation in the context of sub-band processing. It follows from equations 2 and 3 that normalised scores can be obtained from: (4) L~i = Li LM

?

j

Here, initially we examine cohorts with and without the target speaker possibility, in the context of unconstrained cohorts, and show there is negligible difference between the two cases.

where Li is the log-likelihood for the target and replaces p(O mi ), and LM is the normalising value, representing either the world model or the cohort, and replaces p(O), with the log forms converting division to subtraction.

2.3. Normalisation on Sub-Band Processing

4. EXPERIMENTS

With the full band divided into sub-bands we can perform normalisation in different ways: within the sub-bands before recombination; after the sub-band scores are recombined; or in both positions. Figure 1 illustrates the first two possibilities. Furthermore, different forms of normalisation, world-model or cohort, can be adopted in various combinations.

The first benchmark experiments relate to a full-band conventional arrangement. Four normalisations: world-model and three types of cohort models are considered. For comparision to [9] cohort model with target speaker allowed and target speaker not allowed are examined. The third cohort method is based on an arithmetic average (see 3) and target speaker not allowed. Results are shown

in Figure 2 with equal error rate (%) plotted against cohort size. The two parallel profiles for the cohort arrangements show some Verivication EER (%)

Verivication EER (%)

5

CN-T CN+T CN-S WM

4 3

CN-B CN-A WM

5 4 3 2 1 0

2

1

5

10 Cohort Size

15

19

2 Sub-Bands

1

CN-B CN-A WM

5 1

5

10 Cohort Size

15

19

Figure 2: Comparison of different Normalisation methods: An unconstrained Cohort Normalisation (CN+T - target model included, CN-T no target model included,CN-S No target model but summed average of probabilities) is compared with World Normalisation (WM)

Verivication EER (%)

0

4 3 2 1 0 1

The overall trends of these results are very similar to those of [9], even though the experiments are with different databases and different classifiers. In both cases the cohort performance degrades almost linearly with increasing size, though here the fall is faster and, unlike in [9], actually crosses the world-model EER (at a cohort size of 12 or 13). This difference is likely to be due to differences in the size and training of the world model. Figure 2 also shows that the arithmetic average in the cohort method performs slightly better on a larger cohort size. For further experiments the geometric average is used due to easier implementation for loglikelihood calculations.

4.1. Normalisation before and after Recombination Of primary importance in this paper is the influence of score normalisation in the context of sub-band processing. Clearly normalisation can be positioned within each sub-band before recombination, or immediately after recombination, or in both positions. Furthermore, a world model or cohorts can be used in either position. In the case of cohorts, it is likely that different sub-bands will have different cohorts for a given speaker, and the question then arises as to whether or not this will enhance recognition performance. These variations are examined for 2, 4 and 8 sub-bands and results equivalent to the single-band case (Figure 2 ) are shown in Figure 3. The 3 plots again show benefits of cohort over world model normalisation, and again there is a linear degradation as the cohort size increases for normalisation after recombination. For small cohort sizes (less than 6) this is the best arrangement; however, for

5

10 Cohort Size

15

19

15

19

4 Sub-Bands 5 Verivication EER (%)

benefit for the unconstrained cohorts. The closeness of the two cohort profiles can be explained by the closeness of the speaker set, and excluding the target speaker from the one cohort merely introduces the next closest model, giving a shift of one speaker to the profiles.

4 3 2 1 0 1

5

10 Cohort Size

8 Sub-Bands Figure 3: Comparison of Normalisation performed on 2, 4 and 8 sub-bands for World Normalisation (WM), Cohort Normalisation before Recombination (CN-B) and Cohort Normalisation after Recombination (CN-A)

larger cohorts sub-band normalisation ie before recombination, is better. The latter profiles go from linear towards an exponential form as the number of sub-bands increases. In the case of the world model there can be no difference between normalisation before and after recombination, since the processes are merely that of addition and subtraction of the same components. Indeed, this would also be true in the case of cohort normalisation if, for any given target speaker, the cohorts contained the same speaker models across sub-bands. Clearly this is not the case, given the different profiles for the two cohort arrangements. Generally, the best performance comes from the two sub-band arrangement, a conclusion also drawn from previous speaker identification experiments [6]. Performance with 4 sub-bands and nor-

malisation after recombination is very similar to that of the fullband approach. Recombination 1 band 2 bands before 2 bands after 4 bands before 4 bands after 8 bands before 8 bands after

WM 2.37 1.98 1.98 2.51 2.51 4.53 4.53

CN1 CN5 CN10 CN15 0.58 0.85 0.55 1.37 0.66 3.56 1.74

1.29 1.11 1.11 1.41 1.36 3.06 2.68

2.08 1.66 1.74 1.76 2.31 3.28 3.60

2.75 2.47 2.58 2.66 3.36 4.22 4.78

Table 1: Verification error in % for two and four sub-bands where normalisation is performed before OR after recombination (WM ... World-, CN ... Cohort normalisation)

4.2. Combining different normalisation Finally we consider combinations of normalisation, before and after the recombination stage. Given the evidence that different cohorts are found across the sub-bands, the question arises as to whether combinations of normalisations before and after score recombination is of benefit. Normalisation before

CN1 CN5 CN10 CN15 WM

after

CN1 CN5 CN10 CN15 0.58 0.54 0.55 0.55 0.55

0.83 1.10 1.12 1.11 1.11

1.28 1.73 1.73 1.74 1.74

1.84 2.41 2.53 2.58 2.58

Two Sub-bands Normalisation before

CN1 CN5 CN10 CN15 WM

after

CN1 CN5 CN10 CN15 0.60 0.65 0.66 0.65 0.65

1.05 1.28 1.32 1.33 1.33

1.71 2.03 2.08 1.09 2.10

2.41 2.83 3.01 3.02 3.03

Four Sub-bands Table 2: Verification error in % for four sub-bands where normalisation is performed before AND after recombination (WM ... World-, CN ... Cohort normalisation) Tables 2 show that normalisation after recombination is the dominant component and influence of normalisation in the sub-bands prior to recombination is small or insignificant, even though it has been shown that sub-bands have different cohorts. This finding might change under more adverse conditions where normalisation can play a bigger role.

5. COMMENTS AND CONCLUSIONS In this paper different forms of score normalisation in the context of sub-band processing and speaker verification are examined. For the conventional single-band case the performance of cohort

and world-model normalisations show very similar trends to those reported recently by [9], even though different databases and different classifiers are used. Both show the benefits of the unconstrained cohort arrangement, particularly when the cohort size is small. In the case of sub-band processing, normalisation after recombination would appear to be sufficient, certainly under the present ’clean’ experimental conditions. Normalisation within the subbands has insignificant effect, if followed by normalisation after recombination. Again, it is shown that the overall recognition performance (in this case in terms of EER verification scores in contrast to speaker identification in [6]) is slightly but consistently better than the conventional full-band case. The assessment so far has been based on a closed-set of speakers (ie imposters are in the world-model and are allowed to be in the cohort), under clean-speech matched conditions, and in a text-dependent mode. The next stage is to expand beyond these semi-ideal conditions to where a greater degree of robustness is required, where normalisation is known to play a more significant role, and where sub-band processing might also contribute further.

6. REFERENCES 1. M. J. Carey and E. S. Parris. Adapting input transformations using alpha-nets for whole word speech recogntion. In Proc. EUROSPEECH, volume 2, pages 555–558, 1991. 2. A. E. Rosenberg, J. Delong, C. H. Lee, B. H. Juang, and F. K. Soong. The use of cohort normalised scores for speaker recognition. In Proc. ICSLP, pages 599–602, 1992. 3. T. Matsui and S. Furui. Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. In Proc. ICASSP, volume 2, pages 157–160, 1992. 4. Q. Lin, E.-E. Jan, C. Che, D.-S. Yuk, and J. L. Flanagan. Selective use of the Speech Spectrum and a VQGMM Method for Speaker Identification. In Proc. ICSLP, volume 4, pages 2415–2418, 1996. 5. L. Besacier and J. Bonastre. Subband approach for automatic speaker recognition. In Proc. AVBPA, pages 195–202, 1997. 6. R. Auckenthaler and J. Mason. Equalising Sub-Band Error Rates in Speaker Recognition. In Proc. EUROSPEECH, volume 5, pages 2203–2206, 1997. 7. M. J. Carey, E. S. Parris, and S. J. Bennett. Speaker verification. In Proc. I.O.A., volume 14, Pt 6, 1992. 8. T. Matsui and S. Furui. Concatenated phoneme models for text-variable speaker recognition. In Proc. ICASSP, volume 2, pages 391–394, 1993. 9. A. M. Ariyaeeinia and P. Sivakumaran. Analysis and Comparison of score normalisation methods for text-dependent spaeker verification. In Proc. EUROSPEECH, pages 1379– 1382, 1997. 10. S. Furui and T. Matsui. Free-text speaker recognition methods using phoneme class model. In 1st IEEE Workshop on I.V.T.T.A. and Piscataway, 1992.