Predictive Systems for Speaker Identification

Predictive Systems for Speaker Identification: Heuristics for Model Selection. T. Artières LAFORIA UA CNRS 1095 Tour 46-00 Boite 169, Université Paris 6 4 place Jussieu, 75252 Paris cedex 05 France [email protected] fax : (33)-1-44-27-70-00

1.

Introduction

Speaker recognition methods rely either on the direct classification of speech characteristics or on the modelization of speech utterances. Classification based techniques extract discriminant information from the signal, they may offer high performances even for large populations. However, since they do not allow to easily take into consideration new speakers, they are not adequate when the population may change frequently and is not limited to a fixed set. This is not a limitation for modelization approaches which attempt to identify the speech production system independently for each speaker. Several approaches from the second group have been proposed over the years: Vector Quantization has been the most popular technique, more recently Hidden Markov Models (HMM) or Predictive Neural Networks (PNN) systems have been developed. Although the may behave differently, HMMs and PNNs share many similarities. The main limitation of the modelization approach lies in the discrepancy between training and operational objectives. Being trained to estimate the probability density of the signal, these systems do not allow to synthesise efficiently discriminant information. This is done indirectly through the choice of adequate models. NNs offer a wide range of models from different complexities and nature, however, when training criteria do not reflect the classification goal, automatic training is not sufficient for selecting a model and the model space must be explored through heuristic search. We are interested here in neural predictive systems for Automatic Speaker Identification (ASI). These models have been shown recently to be good candidates for this task and have been reported to outperform HMMs [Hattori 92]. Capabilities of basic PNNs have been discussed earlier [Artières 93, Artières 94]. We study here the limitations of these models, analyse them and propose improved systems. They are shown to realise a good compromise between several model parameters: complexity, incorporation of non stationarity, error modelization, nature of the prediction task. We introduce in sections 2 and 3 respectively the predictive models and the database. Limitations of single state models are analysed in 4 and new models are presented in 5.

2.

Predictive Systems

In the predictive approach, speakers are identified by an auto-regressive modelization of the speech signal. An accurate discussion of this hypothesis may be found in [Grenier 80]. Stationary linear models (Vectorial Autoregressive Models, VARs) have been recently tested on large populations for clean speech [Bimbot 92] where they have shown good results providing the test utterance is long enough. Non linear predictive models have been shown to outperform linear ones on a variety of tasks. This is the case for speech recognition [Iso 90] or speaker recognition [Hattori, Artières 94]. Predictive models obey the following equation: Xt = F(C t ) + εt ∀ t = 1...T

where X1T = {X1,...,XT } is a parameter-vector sequence resulting from the analysis of a speech utterance, C t is the prediction context for frame X t (for example the p preceding frames), ε t is a noise independently identically distributed (i.i.d.), and F is a time independent (or locally independent) function. Training aims at estimating F parameters so as to reach a minimal residual error. In the following, we will implicitly use X t both for the random variable and his realisations in time. A probabilistic interpretation of the model behavior can be derived when hypothesis are made on the nature of ε t. If εt is N(µ,Σ), the conditional probability density of Xt given Ct is: 1 −1 1 [- [x -F(Ct )-µ]t Σ [x -F(Ct )-µ] ] exp 2 t t d 1 (2π)2 Σ2

P(Xt /F(C t ))=P(εt )=

3.

Data set

All the experiments presented here have been performed on a small set of 15 female speakers from the same dialect of the well-known TIMIT database. This database being clean and recorded in a single session, speaker identification is easy when the test utterance is long enough. In order to test our methods, we have used short segments from different lengths. The 5 SX sentences have been used for training, and the five remaining sentences (3 SI + 2 SA) for testing. Input data for the models are vectors resulting from a 16-order LPCC analysis, using 25,6 ms Hamming windows, with an overlap of 15,6 ms. The duration of an n-framesn lengthed utterance is thus seconds. All NNs used in the experiments are Multi100 Layer Perceptrons (MLP), with one hidden layer, trained to produce frame Xt, given a prediction context of the frame.

4. Performances and complexity compromise for single state models The link between the classification performance of a predictive system and the model complexity is not straightforward [Artières 93]. Figures 1.a and 1.b show

prediction and classification (on 0.5 sec speech segments) performances for one hidden layer MLPs with increasing complexity. A comparison of fig. 1.a and fig. 1.b shows that overfiting arises earlier for classification than for prediction and is much more severe. Moreover, controlling the latter through e.g. regularization has only indirect repercussions on the former. As a consequence, it is difficult to exploit the full capabilities of PNNs for sequence classification. This difficulty will appear with any non discriminant algorithm, whatever the models are, NNs or HMMs.

(b) (a) Fig. 1 : Performances of stationary neural predictive models as a function of their complexity (i.e. # hidden cells). Fig. 1.a. shows classification performances (with error bars) on short speech segments (0.5 sec.). The five curves in Fig 1.b. represent, from top to bottom, the mean prediction error of a speaker predictor on his training data (Training), his test data (Intra-Speaker Generalization), the test data of other speakers (Inter-Speaker Generalization), the mean error of the best impostor (Best impostor), and the difference between "Best impostor "and "Intra-Speaker Generalisation" ("Separation") which measures the distance between a speaker model and the most confusable impostor model.

In order to analyse the correlation between prediction and classification errors, we have introduced the "Separation" measure in Fig. 1 b between the true talker and his closest competitor. It allows to correlate the two errors on a frame basis: the isolated frame classification performance is maximal for 5 hidden cells, which corresponds to the maximum of the "Separation" measure on fig 1.b. This is no more true for longer test segments (as shown above for 0.5 sec) where the 10 hidden cells models are systematically the best. This is a consequence of the correlation of prediction errors. The probability of misclassification decreases with the test duration, the rate of this decrease being governed by the correlation between successive prediction errors. Errors becoming less correlated when the predictor complexity increases, optimal model complexity also increases with the size of the test utterance. Figure 2 shows another aspect of this phenomenon: the performance evolution as a function of the model complexity is not the same for the different classes of

phonemes. This suggests to use models which automatically adapt themselves to broad phonetic categories, this is what we do with the multi-state models discussed below.

Fig. 2 : Comparison of non linear predictors of various complexity (measured by the number of hidden cells, hc) and VARS. For each model is plotted the mean prediction errors (Intra-speaker generalization) on broad phonetic classes (TIMIT's labelling).

5.

Improved Modeling

5.1.

Non stationarity and error modelization

The mismatch between training and test criteria may be reduced by using smarter modelizations. We propose here two improvements which remove oversimplistic hypothesis of the above PNN models. First, the white noise hypothesis on the prediction error does not correspond to the behavior of the models, we will use a full gaussian instead. Second, the underlying stationary hypothesis of the PNN model is too strong. We will use ergodic multi-states predictive models [Hattori 92] and show that they may be effective at improving simple PNNs. We have combined these two improvements. These models are close to non linear autoregressive HMMs. They can be learned via Baum-Welch-like algorithm [Tsuboka 90] although we will use here a simpler training algorithm based on the Viterbi approximation. Training alternates the reestimation of the predictor's weights and the computation of the error characteristics along the optimal sequence of states. Note that this multi-state PNN model (MPNN) may be interpreted both as an ergodic model or a mixture of autoregressive models. [Matsui 92] has shown that ergodic HMMs do not outperform Gaussian Mixture Models for speaker recognition and put in evidence that the performances mainly depend on the mixture number and not on the states number. Figure 3 shows the relative performances of "one-state" and "three states" models, with euclidean (PNN and MPNN) or gaussian error (PNNG and MPNNG). The main results are as follows:

• The gaussian modelization of the prediction error always improves the performances. This improvement is an inverse function of the predictor complexity (number of hidden cells). This is true for one-state and 3-states models. There is only one exception, for 10 hidden cells predictors 3-states models which may be caused by the great number of parameters (≅2000) of the corresponding MPNNG. • 3-states models (MPNN or MPNNG) always outperform the corresponding onestate model (PNN or PNNG). • The optimal complexity of the predictors is dependent of the error modelization. For "one-state" models, the best models are the 10 hidden units predictors with an euclidean error, and the 5 hidden units with a gaussian error. This optimal complexity is identical here for PNN and MPNN and for PNNG and MPNNG . • multi-states models automatically focus on different clusters of the pattern space.

(b) (a) Fig. 3 : Comparison of PNN and PNNG (Fig a) on short (0.5 sec) speech segments (Training and Test datas), and of corresponding 3-states MPNNs and MPNNGs (Fig b) as a function of utterance length.

5.2.

Prediction task

Another important parameter of a model is the prediction task it will be trained on, e.g. which context should be used for predicting frame t. Although several methods have been proposed for input variable selection on prediction tasks, they are not relevant for our classification problem. Once again, things are more complex here since the best predictor is unlikely to be the best candidate for classification. The complexity of a speech prediction task may be defined by some measure of dependency between the value to predict and the predictor variables. We will not introduce here formal definitions, but it is clear that for speaker modelization, predicting t from t-1 alone is more difficult than t from (t-1 , t-2), which in turn is more easy than t from (t-3 , t-4). Figure 3 shows the performances of 6 different single state predictors of same architecture (15 hidden cells) which use different contexts arranged in increasing order of complexity (estimated through a conditional

entropy measure). The curves for three different durations all show two peaks at (t2, t-1) and (t-3, t-2).

6.

Discussion

From the above results, it is clear that improved performances will result from a compromise between different parameters of the models which cannot be learned automatically and must be searched heuristically in some model parameter space. We have analysed the limitations of simple PNN approaches and have proposed features of this parameter space which have proved to be efficient for improving performances. They allow to control the model complexity, its adequation to the task, the nature of the prediction task. Figure 4 plots the performances along two parameters of this model space and shows a local optimum which is our best single state model.

Fig. 4 : Influence of prediction task and model complexity (Number of hidden cells) for PNN-based systems. Fig. a shows the behaviour (for various test lengths) of 15 hidden cells PNNs for task predictions of increasing complexity. Fig b shows classification performances of PNN-based system as a function of the model complexity and of the task prediction (ordered by increasing complexity).

References [Artieres 93] Artières T., Gallinari P. 93 : neural models for extracting speaker characteristics in speech modelization systems, Eurospeech, III - 2263-2266. [Artieres 94] Artières T., Gallinari P. 94 : adequacy of neural predictors for speaker identification, WCNN. [Artières 95] Artières T., Gallinari P. 95 : multi-state predictive neural networks for textindependetnspeaker recognition, to be published in Eurospeech. [Bimbot] Bimbot F., Mathtan L., De Lima A., Chollet G. 92 : standard and target driven AR-Vector models for speech analysis and speaker recognition, ICASSP. [Grenier] Grenier Y. 80 : utilisation de la prédiction linéaire en reconnaissance et adaptation au locuteur, JEP 80, 163-171. [Hattori] Hattori H. 92 : text independent speaker recognition using neural networks, ICASSP, II 153156. [Iso] Iso K., Watanabe T. 90 : speaker-independent word recognition using a neural prediction model, ICASSP.90, pp 441-444. [Matsui] Matsui T., Furui S. 92 : comparison of text-independent speaker recognition methods using VQdistorsion and discrete/continuous HMMs, ICASSP 157-160. [Tsuboka] Tsuboka E, Takada Y, Waita H., 90 : neural predictive hidden Markov model, ICSLP 90.