List of Figures
1 Illustration of the segmentation of the database collected over a period of three months into training and testing sets
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
2 Recognition process
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
3 %Error against total number of mixtures for TI ergodic CDHMMs (10 version training) 4 %Error against number of training versions for TI codebooks
8
: : : : : : : : : :
9
: : : : : : : : : : : : : : : : : : : : : : : : :
9
5 As Figure 3 for TD constrained CDHMMs (10 version training) 6 As Figure 4 for TD codebooks
7
: : : : : : : : : : : : : : : : : : : : : : : :
10
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
10
7 %Error against the number of training versions for a TI 32 element VQ, and 32 mixture single state CDHMM 11 8 %Error against the number of training versions for TD DTW, 8 element VQ and 1 mixture 8 state CDHMM 11 9 DTW text-dependent digit performance
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
13
List of Tables
I
Summary of best TI and TD VQ performance
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
II Summary of best overall performance with 1,5 and 10 training versions
: : : : : : : : : : : : : : : : : : : :
14 14
Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation Kin Yu, John Mason & John Oglesby Speech Research Group, Department of Electrical Engineering University of Wales Swansea SA2 8PP, UK e-mail:
[email protected] &
[email protected] Phone: +44 792 294564 Fax: +44 792 295686
Abstract This paper evaluates continuous density hidden Markov models (CDHMM), dynamic time warping (DTW) and distortionbased vector quantisation (VQ) for speaker recognition, emphasising the performance of each model structure across incremental amounts of training data. Text-independent (TI) experiments are performed with VQ and CDHMMs, and text-dependent (TD) experiments are performed with DTW, VQ and CDHMMs. We show for TI speaker recognition, VQ performs better than an equivalent CDHMM with one training version, but is outperformed by CDHMM when trained with ten training versions. For TD experiments we show that DTW outperforms VQ and CDHMMs for sparse amounts of training data, but with more data, the performance of each model is indistinguishable. The performance of the TD procedures is consistently superior to TI, which is attributed to subdividing the speaker recognition problem into smaller speaker-word problems. We also show a large variation in performance across the dierent digits, concluding that digit zero is the best digit for speaker discrimination.
Keywords Speaker recognition, dynamic time warping, vector quantisation, continuous density hidden Markov models.
1. Introduction
Several analytical approaches have been applied to the task of speaker recognition, many of which originate in speech recognition. Dynamic time warping (DTW), vector quantisation (VQ) and continuous density hidden Markov models (CDHMM) are three of the most common approaches and are the three considered in this paper. DTW is a successful isolated-word processing technique often associated with small xed-vocabulary speech recognition. The technique applied here to speaker recognition follows that of Furui [1]. VQ is a coding technique applied to speech data to form a representative set of features. This set, or codebook can be used to represent a given speaker. Among the rst to apply this technique to speaker recognition were Soong et al. [2] and Buck et al. [3]. A stochastic modelling technique known as CDHMMs has also been applied to the problem of speaker recognition, providing similar advantages for speaker recognition as for speech recognition. Poritz [4] proposes the use of a ve state ergodic discrete HMM for text-independent (TI) speaker recognition which Tishby [5] expands to an eight state autoregressive CDHMM. Clearly there are some fundamental dierences in these 3 approaches which might well lead to dierences in per-
formance. A good example is the utilisation of time information. DTW is based on time alignment and is therefore inextricably and highly dependent on timing: thus test feature sets should have the same frame-rate and be presented to the system in an equivalent time series manner to the training sets. In contrast basic VQ has no such constraints, and is essentially time-independent, giving identical results even if the time sequence of the test features were to be randomly shued. Interestingly, HMM's, con gured either as `left-to-right' models or as ergodic models, can approach both of these cases in terms of time-dependency. Such dierences in fundamental characteristics prompts the question of relative performances. Text dependent and independent experiments cast light on these issues. Irvine [6] in text-dependent (TD) experiments compares the three approaches considered here, concluding that VQ provides the best performance. Matsui and Furui [7] compare VQ, discrete HMMs and CDHMMs for TI speaker recognition, illustrating improved performance of CDHMMs over discrete HMMs, and VQ over CDHMMs with limited amounts of training data. This paper concerns a comparison of DTW, VQ and CDHMMs for TI and TD recognition and also shows performance trends in each case as more training data is made available. The emphasis of the experiments is on the performance of the models under incremental amounts of training data in an attempt to identify the best approach. Consider the scenario of a single enrolement session where the client might reasonably be expected to utter just one or two versions of the digit set. Under these circumstances which approach to recognition gives the best performance: TD or TI; DTW, VQ or CDHMM? This paper attempts to address these questions. In each case VQ, DTW, or CDHMM is used in what might be regarded as a defacto standard con guration, without attempts to ne-tune to give better absolute performance. One could for example introduce discriminative training or, in the case of the CDHMM, consider many variants in parameter estimation. The tying of parameters across states has the eect of pooling training data, thereby increasing the reliability of estimates, but at a risk of possible reduced discrimination. A good example is the use of a global covariances [8]; however we leave this and other variants as avenues for further work. Speaker recognition is an umbrella term embracing two related tasks: speaker identi cation and speaker veri cation, with the dierence between the two being at the decision stage [9]. Identi cation is a one-from-N decision where the test speaker is assumed to be one of a set of known persons (closed set), and veri cation is a yes/no decision on a claimed identity. In practical terms veri cation has many more applications, primarily in the general area of access control. Identi cation is however useful in recognition studies since it tends to be more sensitive to changes in system parameters, it maximises the use of databases, and performance trends are usually applicable to veri cation. Hence the
task chosen for this study is identi cation. 2. Approaches to recognition
2.1 Vector quantisation
Vector quantisation is a non-parametric data reduction technique used to form a series of vectors known as codewords from a training set. The technique used for experimentation follows that of the classical generalised Lloyd or Linde, Buzo and Gray (LBG) algorithm [10]. This technique is an unsupervised training procedure, described as follows. Presented with a pool of training features, a single vector mean is calculated then perturbed to form two separate vectors around the rst mean. These are then reclustered using a variance convergence metric, which nds natural partitions of the pooled vectors. These two vectors are then perturbed to form four vectors, and subsequent reclustering takes place. This splitting and reclustering continues until the required 2R centroids have been created, where
R
is the
codebook rate. Recognition is performed using accumulated distortion to obtain a likelihood measure of the speaker belonging to the model. VQ is inherently TI, though a level of text dependency can be achieved with appropriate selection of training data. Thus TI and TD experiments are performed using this modelling technique. 2.2 Dynamic time warping
Dynamic time warping is a feature matching scheme that inherently accomplishes time alignment of reference and test features through a dynamic programming (DP) procedure. The DTW technique chosen for experimentation is an implementation proposed by Furui [1], but here using static features only. His method of constructing a reference template is as follows. One training utterance is designated as the initial template, to which a second is time aligned by DTW. The average of the two patterns, is then taken to produce a new template, to which a third utterance is time aligned. This process is repeated until all the training utterances have been combined into a single template. The nal template length is equal to that of the initially chosen utterance, irrespective of whether this utterance is representative of the length of the training utterances. There are no global constraints on the DP path except for xed endpoints. Recognition is performed by summing a `city-block' metric of a three point unweighted symetric warp path between the template and the test utterance. Since the method of template construction is inherently TD, only TD experiments are reasonable using this technique.
2.3 Hidden Markov models
Hidden Markov models are stochastic approximations of the process being modelled. Based on a rst order Markov chain, this method of model construction using states and transitions, allows for greater exibility in the type of data that is modelled, hence using appropriate HMM topologies both text dependent and independent experiments can be performed. The hidden Markov model toolkit (HTK [11]) is used in this set of experiments. This toolkit is speci cally designed for connected word speech recognition using CDHMMs but can be con gured for speaker recognition with ease. Training a particular model is performed in two stages. The rst stage uniformly partitions the training observations into
N
segments of equal length, and assigns these vectors to each state. From these pooled vectors the mixture
components of each state are calculated (a single mixture component is given by one vector mean and variance). This simple estimate of each state is followed by a Viterbi segmentation algorithm, where each observation in the training sequence is aligned using Viterbi recognition; re-computation of the mixture components occur during realignment of the observations. Initial estimates of the transition probabilities are assigned to D1 , where i is the number of transitions allowed out of state . Second stage re-estimation is then performed on this Viterbi estimated CDHMM using the i
D
i
Baum-Welch (forward-backward) algorithm [12] until convergence. Diagonal covariance matrices are assumed in the estimation of the output distributions. Recognition is performed by a Viterbi process to calculate the log probability for each utterance. For TI experiments the CDHMM is con gured ergodically, allowing for transitions from any state to any other state in the model [13]. For TD experiments, a constrained CDHMM is used. The constraints imposed on the model allow for transitions from one state to the following state and to itself, thus the model has a time sequence structure. 2.4 Speech database and pre-processing
The recognition task is performed on a subset of the BT Millar speech database. This database is collected in a quiet environment, using a high quality microphone. During collection, each speaker responds to a visual prompt to utter isolated digits (one to nine, zero, oh and nought) in a random order, a total of ve times in each of the ve sessions, the approximate timings of which are indicated in Figure 1. The database correspondingly contains 25 repetitions of each
of the vocabulary items from each speaker. The sessions take place over a period of approximately three months with speakers encouraged to divide sessions evenly across this period. The speech is recorded at 20kHz using 16 bits (linear) per sample. In these experiments the data is bandpass ltered to telephone bandwidth and downsampled to 8 kHz prior to feature extraction. The database is divided into training and testing sets (Figure 1). The rst ten versions, i.e. the rst two collection
week 0
week 2
week 4
week 6
week 8
Session 1
Session 2
Session 3
Session 4
Session 5
1-5
6-10
11-15
16-20
21-25
Training versions
Testing versions
Fig. 1. Illustration of the segmentation of the database collected over a period of three months into training and testing sets
sessions are reserved for training, with the remaining fteen repetitions reserved for testing. Training sets are varied as follows: training using one version is taken from the rst set of the rst collection. Training using two versions is taken from the rst two sets from the rst collection. Training using seven versions takes all the utterances from the rst collection and the rst two from the second collection. This incremental training data set selection is continued until all the data from the training set is exhausted, thus a series of experiments using one through ten training versions is utilised. A subset of speakers is adopted. The data from twenty males, all of approximately the same age is used. This represents a confusable subset, magnifying performance trends. The vocabulary is reduced to ten digits, 1 through 9 and
, thus there are 20 15 10 = 3000 test tokens for this identi cation task.
zero
There are two popular parameterisation techniques, namely linear predictive coding (LPC) and cepstral analysis. Both techniques provide a compact representation of the speech spectral envelope. The cepstral representation has been suggested as superior for speaker recognition [14], thus cepstral analysis, in particular mel scale warped cepstra [15] is used to parameterise the speech. Analysis is performed over a Hamming window of 32ms, with 50% overlap. Pooled inverse variance weighting [16] is applied to each of the 14 cepstral coecients, leading to a weighted cepstral distortion for VQ and DTW. 3. TI and TD experiments
In text-independent (TI) experiments a single multi-digit model is trained for each speaker and recognition experiments performed with test versions over the same set of digits (1 to 9 and zero). In the text-dependent (TD) case, digit-speci c models are trained for each speaker, so that each speaker has 10 models, one for each digit, and each model has approximately 101 of the amount of training data when compared with th
the corresponding TI case. Recognition assumes knowledge of the utterance being spoken. Models are trained with one through to ten versions, to indicate trends across training versions. Only the training data pertaining to a speci c speaker is used to train each model, and no a priori statistics are provided for the CDHMM. The general con guration for recognition is illustrated in Figure 2.
Test Utterance
Speaker 1 Model
Simmilarity/ Dissimmilarity measure
Speaker 2 Model
Simmilarity/ Dissimmilarity measure
Speaker 3 Model
Simmilarity/ Dissimmilarity measure
Decision Logic
Recognised Speaker
......... Speaker n Model
Simmilarity/ Dissimmilarity measure
Fig. 2. Recognition process
3.1 Model parameters
In DTW the models are fully determined by the training data and the vocabulary: the length of the model is the primary parameter and this is determined by the training data. There is one model per vocabulary item per speaker. In contrast, in the cases of VQ and CDHMM's, decisions on model parameters need to be made. For VQ the primary factor is the codebook size, and experimental results described below indicate that the optimum size is very dependent on the phonetic make-up of the data and not on the amount of training data. The CDHMM case is less straight-forward since the model topology is also a variable and as a consequence two primary parameters are the number of states and the number of mixtures in each state. An indication of overall model size is the product of these two, giving the total number of mixtures in the model [7][17]. In the following sections we look at recognition performance for VQ and CDHMM's in terms of the respective model parameters, for both TI and TD conditions, prior to comparing the performances for the three dierent approaches. 3.1.1 Text-independent parameters Results from experiments with various CDHMM topologies trained with ten versions and tested on the standard set are summarised in Figure 3. From these results it is noticed that the performance of the model correlates highly with the total number of mixtures in the model, i.e. the number of states times the number of mixtures per state. This trend has also been observed by Matsui et al. [7] and Zhu et al. [17], also in TI experiments. Of all the model variations considered, the thirty-two mixture equivalent models (2m16s, 8m4s and 32m1s) produce the best performance. From this subset of models, a thirty-two mixture, single state model (32m1s) is chosen for subsequent text-independent experiments, a form which has been used by other researchers [7][17][18]. Such single state models with neutral transitions are essentially identical to the approach known as a mixture Gaussian VQ [19].
50 45
1m4s = 1 mixture 4 states
40 1m4s 4m1s
%Error
35 30 25
8m1s 2m4s 1m8s
20 15
1m16s 4m4s 2m8s 16m1s
10 5
2m16s 8m4s 32m1s
0 4
8
12 16 20 24 28 Total number of mixtures
32
36
Fig. 3. %Error against total number of mixtures for TI ergodic CDHMMs (10 version training) 30 16 codewords 32 codewords 64 codewords 128 codewords 256 codewords
25
%Error
20 15 10 5 0 1
2
3 4 5 6 7 8 Number of training versions
9
10
Fig. 4. %Error against number of training versions for TI codebooks
For a comparison we require the second modelling technique, VQ, to be of a similar size. Figure 4 shows codebook performance as a function of the codebook rate and amount of training data. Noticeable performance dierences occur between, 16 and 32 codewords, and 32 and 64 codewords. Above 64 codewords performance improvements are small. A 32 element VQ codebook is chosen, despite its slight sub-optimal performance, to match the number of means used in the CDHMM. 3.1.2 Text-dependent parameters Results for corresponding TD experiments are shown for VQ and CDHMM in Figure 6 and Figure 5 respectively. For VQ, the performance again improves with the codebook size but in contrast to the all-digit TI case (Figure 4), there is little improvement beyond codebook size of 8. This is true across the complete range of training versions. We return to these comparisons later. The text-dependent CDHMM results, Figure 5 also provide an interesting contrast to their TI equivalent in Figure 3. The TD results show a clear minimum region when the total number of mixtures is between 8 and 16, either side of which error rates increase. Similar curves are observed for dierent amounts of training data. Within this region the
14
1m8s = 1 mixture 8 states
12
%Error
10 8 6 4 2
1m8s
0 4
8 12 16 20 24 28 32 36 40 44 48 52 56 Total number of mixtures
Fig. 5. As Figure 3 for TD constrained CDHMMs (10 version training) 30 2 codewords 4 codewords 8 codewords 16 codewords 32 codewords
25
%Error
20 15 10 5 0 1
2
3 4 5 6 7 8 Number of training versions
9
10
Fig. 6. As Figure 4 for TD codebooks
state/mixture combinations which give the best performances are 5m2s, 2m6s and 1m8s, suggesting that performance is little aected by the state transition parameters of the CDHMM [20]. The error pro le is likely to be governed by the phonetic diversity of the vocabulary and the amount of training data available for reliable parameter estimation. Hence an 8-element VQ codebook and a constrained 8-state single-mixture CDHMM is chosen to compare with the DTW modelling technique. 3.2 Performance of the three approaches
Performances are now compared in TI mode for VQ and CDHMM's and in TD mode for VQ, CDHMM's and DTW, with an emphasis on the amount of training data. McNemar's test is subsequently applied to the identi cation performance to quantify the signi cance of the dierences at a 95% con dence level. The Millar database contains 60 speakers in all and sample experiments with all 60 speakers show similar trends to those reported here for the 20 speaker subset.
30 32 VQ 32m1s CDHMM 25
%Error
20 15 10 5 0 1
2
3 4 5 6 7 8 Number of training versions
9
10
Fig. 7. %Error against the number of training versions for a TI 32 element VQ, and 32 mixture single state CDHMM 30 8 VQ DTW 1m8s CDHMM
25
%Error
20 15 10 5 0 1
2
3 4 5 6 7 8 Number of training versions
9
10
Fig. 8. %Error against the number of training versions for TD DTW, 8 element VQ and 1 mixture 8 state CDHMM
3.2.1 Text-independent tests Figure 7 illustrates the identi cation performance of a 32-element VQ codebook and a 32-mixture single state CDHMM. For 1 and 2 version training VQ performs better than the CDHMM, but for 7,8,9 and 10 version training the CDHMM outperforms the simpler modelling technique. Between these two regions the performance of the two classi ers is essentially the same. Clearly the CDHMM requires more training data than an equivalent sized VQ. 3.2.2 Text-dependent tests Figure 8 illustrates the identi cation performance of an 8-element VQ codebook, DTW and a single mixture 8-state CDHMM. DTW is consistently the best performer. The VQ and CDHMM show similar trends to those of the TI experimental results with a cross-over in the region of 6-version training beyond which the CDHMM gives a lower error rate than VQ. Performances for the three approaches converge with an increasing number of training versions. McNemar's test with a 95% con dence level is considered at the 1,5 and 10 training version points. In summary we can conclude that with 1 version training the dierence between VQ and DTW is not signi cant. However at 5 and 10
version training the superiority of DTW over both VQ and CDHMM is signi cant. 4. Perforamces of text-independent and text-dependent models
In both VQ and CDHMM experiments TD performance is better than TI. Consider rst the case of VQ. The lower pro les in Figure 6 for TD are consistently lower than the lowest TI pro les in Figure 4. Similar TD over TI improvements are evident for CDHMM's from Figures 5 and 3 respectively. These dierences, while small, are consistent and are attributable only to the subdivision of the data at the time of training, or to the knowledge of the vocabulary item at time of testing. Table I emphasises the point of TD superiority by comparing values for the best text-dependent and independent VQ codebooks irrespective of size. Table II reinforces this dierence across the three modelling techniques by ordering the best overall models for 1,5 and 10 training versions. 5. Subdividing the problem
It is noticed from Figure 5 for TD CDHMMs, that irrespective of the structure of a CDHMM, it is the total number of mixtures in the model which dominates the performance characteristic. As mentioned above in the region of good performance, 8 to 16 states, the best topologies include 5m2s, 2m6s and 1m8s, suggesting any in uence of transistion information is negligable. This leads to the conclusion that subdividing the training data to form multiple models for a speaker is a principal criterion for improving performance, whether a VQ or CDHMM is chosen. This subclassi cation also explains the good performance of the DTW technique, where the problem is inherently divided according to the vocabulary and acoustic segments of the training data. 5.1 Digit performance
Extending the theme of problem subdivision, we look at the performance of individual digits. Text-dependent DTW digit performance is illustrated in Figures 9a and 9b. The best digit across the range of training versions is
zero
(this is found to be true for both VQ and CDHMMs
although not shown). The good performance can be attributed to two distinctive characteristics of this utterance. The rst is its length: zero is found to be on average the longest utterance, hence presenting more information for both testing and training. The second characteristic is the voiced fricative of the rst phoneme, shown by Parris and Carey [21] to be a particularly useful phoneme in speaker recognition. Consistently good performance across various training versions is also illustrated for digits 1 and 9. The worst performers are digits 4,8,6 and 2.
20 two three four seven zero
%Error
15
10
5
0 1
2
3 4 5 6 7 8 Number of training versions
9
10
(a) 2,3,4,7 and zero 20 one five six eight nine
%Error
15
10
5
0 1
2
3 4 5 6 7 8 Number of training versions
9
10
(b) 1,5,6,8 and 9 Fig. 9. DTW text-dependent digit performance
A large variation is observed across the digits. For example Figure 9a shows digit 4 performing badly, while digit zero
performs well across all training versions, with a performance dierence of 6.3% at their closest point. Hence, in a
password system consisting only of digits, judicious choice could signi cantly improve performance. 6. Conclusion
Perhaps the most surprising overall nding presented in this paper is the superior performance of DTW over both VQ and the CDHMM. As mentioned above, the CDHMM performance is likely to improve with estimation of certain parameters, e.g. by adopting tied variances across states. This can be viewed as one step in moving the CDHMM towards a DTW or a VQ approach, and continuing in this vein, the DTW may be viewed as merely a degenerate case of the CDHMM [22]. In turn the VQ approach may be regarded as a degenerate case of DTW. Considering the latter pair, the essential dierence between VQ and DTW is the inherent time-alignment within DTW and the results indicate that some time-sequence information within speech, completely lost in VQ, is captured and used to increase speaker
No. of training TI TD versions 256 VQ 16 VQ 1 24.5% 23.5% 5 11.8% 9.2% 10 5.2% 3.8% TABLE I
Summary of best TI and TD VQ performance
No. of training versions 1 5 10
First
Best overall model Second Third
TD DTW
TD 16 VQ
TD DTW
TD 32 VQ
TD 1m8s CDHMM
TD DTW
TD 32 VQ
TD 1m8s CDHMM
21.5% 5.8% 2.8%
23.5% 9.1% 3.8%
-
13.4% 3.9%
TABLE II
Summary of best overall performance with 1,5 and 10 training versions NB the best results are all TD
discrimination by DTW. In contrast, the lack of recognition sensitivity to the number of CDHMM states suggests that the state transition probabilities do not themselves contribute to discrimination, but merely aid in the alignment of speech events to states. Finally, given the observation that DTW and VQ are both degenerate cases of CDHMM, the question is raised on how a CDHMM might be customised to harness the time-sequence information, thereby equaling or out-performing the DTW approach. 7. Acknowledgements
The Authors wish to thank BT Labs for the use of the Millar database, and continuing nancial support for this work. The Authors would also like to thank the Referees for their constructive comments which we believe has improved the content of this paper.
References [1] S. Furui. Cepstral analysis technique for automatic speaker veri cation. IEEE Trans. ASSP-29, pages 254{272, 1981. [2] F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang. A vector quantization approach to speaker recognition. Proc. ICASSP-85, 1:387 { 390, March 1985. [3] J. T. Buck, D. K. Burton, and J. E. Shore. Text-dependent speaker recognition using vector quantisation. Proc. ICASSP-85, 1:391{394, 1985. [4] A. Poritz. Linear predictive hidden Markov models and the speech signal. Proc. ICASSP-82, 2:1291{1294, 1982. [5] N. Z. Tishby. On the application of mixture AR hidden Markov models to text independent speaker recognition. IEEE Trans. Signal Processing, 39:563{570, 1991. [6] D. A. Irvine and F. J. Owens. A comparison of speaker recognition techniques for telephone speech. Proc. Eurospeech-93, 3:2275{2278, 1993. [7] T. Matsui and S. Furui. Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. IEEE Trans. Speech and Audio Processing, 2:456{459, 1994. [8] J. J. Webb and E. L. Rissanen. Speaker identi cation experiments using HMMs. Proc. ICASSP-93, 2:387{390, 1993. [9] G. R. Doddington. Speaker recognition - identifying people by their voices. Proc. IEEE, Vol. 73, No. 11, 1985. [10] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantizer design. IEEE Trans. Communications, 28:84{95, 1980. [11] S. J. Young and P. C. Woodland. HTK: Hidden Markov model toolkit V1.4 User manual. Cambridge University Engineering Department, Speech Group, 1992. [12] L. E. Baum. An inequality with applications to statistical estimation for probabilistic functions of Markov processes. Inequalities, 3:1{8, 1972. [13] S. Ran et al. Speaker recognition using continuous ergodic HMMs. Proc. SST-94, pages 706{713, 1994. [14] B. Atal. Eectiveness of linear prediction characteristics of the speech wave for automatic speaker identi cation and veri cation. J. Acoust. Soc. Am., Vol. 55, pages 1304{1312, 1974. [15] S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP-28, pages 357{366, 1980. [16] Y. Tohkura. A weighted cepstral distance measure for speech recognition. Proc. ICASSP-86, pages 761{764, 1986. [17] X. Zhu, Y. Gao, S. Ran, F. Chen, I. Macleod, B. Millar, and M. Wagner. Text-independent speaker recognition using VQ, mixture Gaussian VQ and ergodic HMMs. Proc. ESCA-1994, pages 55{58, 1994. [18] J. de Veth and H. Bourland. Comparison of hidden Markov model techniques for speaker veri cation. Proc. ESCA-94, 1994. [19] H. Gish and M. Schmidt. Text-independent speaker identi cation. IEEE Signal Processing Magazine, Vol. 11:18{32, 1994. [20] J. B. Millar, F. Chen, I. Macleod, S. Ran, H. Tang, M. Wagner, and X. Zhu. Overview of speaker veri cation studies towards technology for robust user-concious secure transactions. Proc. SST-94, pages 744{749, 1994. [21] E. S. Parris and M. J. Carey. Discriminative phonemes for speaker identi cation. Proc. ICSLP-94, 4:1843{1846, 1994. [22] B. Juang. On the hidden Markov model and dynamic time warping for speech recognition - a uni ed view. AT&T Bell Laboratories Technical Journal, pages 1213{1243, 1984.