Learning Vector Quantization in Text-Independent Automatic Speaker Recognition Thomas E. Filgueiras Filho Ronaldo O. Messina Euvaldo F. Cabral Jr. ftefil, ronaldo,
[email protected] Escola Polit´ecnica da Universidade de S˜ao Paulo, Departamento de Engenharia Eletrˆonica Laborat´orio de Comunicac¸o˜ es e Sinais, Grupo Comunicac¸a˜ o Homem–M´aquina S˜ao Paulo, Brasil, 1998
Abstract In this paper, it is reported a comparison among the Learning Vector Quantization (LVQ) and two other common approaches to text-independent speaker recognition, namely Gaussian Mixture Models (GMM) and Vector Quantization (VQ). The results shows that the neural approach is less efficient in terms of recognition scores than the GMM.
1. Introduction Automatic Speaker Recognition (ASR) is the task of identifying a person (speaker) solely by his/her voice using a machine (the term system will also be used). Before recognition, the speech signal is converted into a temporal sequence of feature vectors. There are two modes: speaker identification and speaker recognition. In the former, the person furnishes his/her identity and the machine accepts or not the identity, and in the latter the machine gives the identity of the person. When the machine must recognize the unknown person as one of a set, the recognition is called closed-set, and when the machine must decide if the person belongs or not to the set, the recognition is called open-set. When the person must use the same word or phrase that was used during the training, the system is called text-dependent, and when the person if free for saying anything, the system is called text-independent. There is another class, named restrictedtext, when the system furnishes the utterance that must be spoken. The ASR system in this article works in closed-set, text-independent mode. There are many applications of ASR, e.g., restrict access to places or information, bank accounting transactions (aside with speech recognition), and others. The indepen-
dency in relation of what need to be uttered gives more flexibility to the system, but increases the dificulty in determining the person’s identity, due to the greater variability faced by the system. In this paper, three different approaches to ASR are compared. Two are statistical methods and the other is an Artificial Neural-Network (ANN). The statistical methods are respectively the Gaussian Mixture Model (GMM) [8] and the Vector Quantization (VQ) using the Linde-Buzo-Gray (LBG) algorithm [5]. In these methods, the distribution of the feature vectors are both continuosly and discretely aproximated. The ANN approach is the Learning Vector Quantization (LVQ); the algorithms LVQ1, LVQ2.1 [3] and LVQ3 [4] were appropriately used for recognition. The rest of this paper develops as follows. In the next section the speech corpus used is described, and in Section 3, the conversion of the digitalized speech signal to features vectors is presented. The recognition schemes adopted are described in Section 4, and each method is also discussed in more detail. Section 5 brings the tests perfomed and the results obtained. Finally Section 6 discusses the whole work.
2. Speech Corpus Two sets of training and testing utterances, formed by phonetically balanced phrases in Brazilian Portuguese [1] were recorded by ten male speakers. The utterances were obtained in an office environment with a standard microphone and digitized with 16 bits of resolution and 11,025 samples/s. The sets were gathered in different sessions, separated by two weeks. The first set consists of twelve training phrases, selected by a person without any phonetic knowledge, resulting in about 25 s of speech, and ten test phrases from the most balanced list in [1]. The second set consists of six
training phrases, the first six from the third list, which concentrates the major kind of sounds, such as vowels, nasals, fricatives, etc., resulting in 13 s of speech. There are ten test phrases, from the less balanced list. In average, the test phrases from both sets have a duration of 2.32 s. There was not any noise reduction or normalization in the recording sessions. All the speakers were male in order to increase confusability among the speaker’s voices. The speakers were 19–29 years old and none of them presented any speech disability.
3. Feature Vectors Before recognition, the speech waveform must be in a representation suited to be processed by a digital computer and must retain the main information about the subject who produced the voice. The waves of pressure in air are transduced to values of electrical voltage and digitized. The resulting waveform is converted to a sequence of N -dimensional vectors that somehow represents the speech spectrum; these vectors are hereafter called feature vectors. The feature vectors are extracted in a frame-by-frame analysis, i.e. the speech signal is divided into overlapping segments (frames) with short term; these frames have a duration of approximately 10–30 ms, with 5–20 ms of overlaping. The frames are short because the vocal tract does not alter its shape in such short interval of time; the overlap between the segments guarantees that the variations in the vocal tract are captured in a smooth way by the temporal sequence of vectors; from each frame, a N -dimensional vector is calculated. In this work, the vectors are extracted with a rate of 100 vectors/s and the frames have a duration of 23.22 ms. The feature vectors used in this work are the MelFrequency Cepstral Coefficients (MFCC) [2]. They were chosen because they are fast to calculate and they provide a compact representation of the speech spectrum that has been proved to be efficient for speaker recognition.
4. Recognition Schemes In ASR, there is a training phase, where the speakers furnishes some samples of his/her voice, so that the system ”gets known” to the system. In this article, these samples are used to create a model for each speaker, and the decision of the identity is based in some appropriated metric, suited for each approach. In the GMM approach, the training consists in aproximating the distribuition of the feature vectors of each speaker in the set by a mixture of Gaussian densities. The recognition on new speech is made by calculating the likelihood of the feature vectors with each model and the decision is taken by the model which is more likely.
In the VQ approach, the distribution of the feature vectors is approximated by a discrete set of vectors. These vectors are the centroids of a partition of the space of vectors and minimize the distortion (that is the mean distance between the centroid and each vector) in each partition. The recognition is made by calculating the minimum mean distance between the new vectors and the centroids of each speaker and using this minimum to point out the speaker. The recognition with the LVQ works in the same way as for the VQ; what changes is the way of obtaining the centroids, that follows each LVQ algorithm. The main difference between these methods is that of the LVQ is trained in a discriminative way, i.e. the vectors from the other classes are also used during the training.
4.1. Gaussian Mixture Models A mixture density is a weighted sum of M component densities, and for a vector denoted as ~xt and a model s, the mixture density is defined as:
p (~xt js ) =
M
X
i=1
psi bsi (~xt )
(1)
where psi are the mixture weights and bsi (~xt ) are the component densities, each one is defined as a D-variate Gaussian function of the form:
1
exp , 12 (~xt , ~i )t ,i 1 (~xt , ~i ) D2 1 (2) ji j 2 (2) t is the mean vector and i is the covariance matrix. where ~ bsi (~xt ) =
To ensure that the mixture is a true probability density function, PM s the mixture weights must satisfy the constraint i=1 pi = 1. The mean vector, covariance matrix and mixture weights from all component densities, composes the model:
= fpi ; t ; t g
i = 1; 2; : : : ; M
parameterizing the complete Gaussian mixture. Each speaker is then referred by the model , which parametrizes a GMM characteristic of the person [7]. In the training phase is used the maximum likelihood (ML) estimation method to estimate the GMM parameters. In [7] is presented a variation of the Expectation-Maximization (EM) algorithm to find the ML parameters in this phase. The EM algorithm consists of, given a set of training T from an model vectors X = f~xt gt=1 , calculate, a new p (X j). In the initial model , in a way that p X j next interaction the new model becomes the initial mode, and so on until some convergence threshold is reached. To the formulas presented in [7, 8] calculate the new model were used. It was used a random initialization for the EM algorithm.
The testing phase consists of presenting all the N test T vectors f~xt gt=1 that composes the observation sequence to the models 1 ; 2 ; : : : ; S that represents the GMM of S speakers and find out which one has the highest probability to have produced the test sequence.
4.2. Vector Quantization As the distribution of the feature vectors of a person’s voice is of unknown shape, it could be modeled by a nonparametric model, such as that obtained with the use of Vector Quantization (VQ) [5]. This method is very simple and has good performance in speaker recognition [9]. The VQ consists in choosing a set of M vectors which approximates the pdf of the feature vectors (of a particular subject) in a discrete way. These vectors forms one codebook, which is used as a model of each speaker’s voice; it is believed that this discrete set of vectors represents basic acoustic units that are particular to each person’s voice, and so they should be representative of the person id. The vectors comprising the codebook are called codewords. The algorithm used in this work to calculate the codebooks is the well known LBG (Linde-Buzo-Gray) [5]. The LBG algorithm starts with one codeword, the mean of the training vectors, and a splitting procedure is made until the desired number of codewords is reached. This algorithm has the inconvenience of calculating the codewords only for a number that is a power of two, due to the splitting procedure. In the case of the MFCC, the distance between two D~ and y~ can be calculated simdimensional feature vectors x ply as the Euclidean norm between the vectors, because the cosine expansion performs a ”cheap” principal component analysis [2], which means that the vectors are in an orthogonal basis. But one norm which takes in account the variance of the individual vector components should provide a better partition of the space.
step, and this update is made for correct or incorrect classification in different ways, metrically compatible with the criterion used for identification. In the LVQ2.1 two codebook vectors are updated simultaneously. One of then belongs to the same class of the training vector and the other to a wrong class. Furthermore, the training vector must fall into a window of relative width w around the codebook vectors. In LVQ2.1 no attention was paid to what might happens to the location of the reference vectors in the long run if this process were continued, so it seems necessary to include corrections to ensure the continued approximating of the classes distribution. The LVQ3 combine these ideas, and now the two nearest neighbor are updated if one belongs to the same class of the training vector and the other to a wrong class, or if both belongs to the same class of the training vector. It is still needed that the training vector falls within the window. The testing phase consists in finding the nearest codebook vector to a given test vector, and the class of the reference vector will be assigned to the test vector. It is the same for the three methods above.
5. Experiments and Results The recognition experiments were taken in four different ways: the first experiment (EXP1) uses the training and testing phrases of the first set, in the second (EXP2) the training and testing phrases of the second set were used, in the third (EXP3) the training phrases of the first set were used along the testing phrases of the second set, and in the fourth (EXP4) the training phrases of the second set were used along the testing phrases of the first set. It was used 20 MFCC obtained from frames of 23.22 ms separated by 10 ms. The same MFCC were used in all the experiments. The results obtained with different number of centroids are shown in tables 1 to 4.
4.3. Learning Vector Quantization This method is a variant of the nearest-neighbor method, but without the use of statistical samples of tokens of known vectors; instead, a number of reference vectors are selected for each class, which values are optimized in the learning process. The reference vectors approximates the probability density function of the pattern classes. Or, more accurately, the nearest neighbors define decision surfaces between the pattern classes, which seems to approximate those of theoretical Bayes classifier very closely [3]. Three algorithms were used in the training phase, LVQ1 [3], LVQ2.1 [3] and LVQ3 [4], and the results of each one was evaluated in the testing phase. In the LVQ1 only the nearest codebook vector is updated at each learning
GMM VQ LVQ1 LVQ2.1 LVQ3
EXP1 95% 73% 37% 39% 39%
EXP2 88% 63% 50% 53% 52%
EXP3 67% 49% 59% 63% 63%
EXP4 78% 57% 64% 66% 67%
Table 1. Recognition with 2 centroids.
For the LVQ the number of codewords used were the number of centroids used for each speaker in the experiments above, multiplied by the number of speakers.
GMM VQ LVQ1 LVQ2.1 LVQ3
EXP1 100% 93% 88% 89% 89%
EXP2 97% 92% 88% 89% 88%
EXP3 85% 87% 59% 61% 61%
EXP4 91% 87% 64% 64% 64%
Table 2. Recognition with 8 centroids.
GMM VQ LVQ1 LVQ2.1 LVQ3
EXP1 100% 100% 94% 95% 95%
EXP2 96% 99% 99% 99% 99%
EXP3 91% 82% 77% 78% 77%
EXP4 88% 79% 75% 77% 77%
Table 3. Recognition with 32 centroids.
6. Conclusions When we use the second set of training phrases, as they concentrate more phonetic variation, high recognition scores are still obtained in EXP2 with half the time of speech available for training. In the EXP3 and EXP4, having the testing and training phrases from different sessions, a temporal variation is present. This variation decreases the recognition score, so it is an important factor in designing ASR systems. In this work no strategy was used to take into account this variation as the main interest was evaluating the modeling techniques. One very popular approach to compensate the temporal variation is the use of delta and delta-delta features [6]. The training time for the VQ and for the LVQ were the shortest and the longest, respectively, for the GMM the time is much longer than for VQ, but lesser than for the LVQ. This is due to the larger number of vectors presented, as the training is discriminative. The number of free parameters that have to be estimated in the GMM is larger than in the VQ and LVQ which results in a longer time in the training phase, especially when the number of mixtures is large and the necessary number of training vectors must also be larger to produce a good estimation. In the testing phase, the difference in recognition time is negligible and all of them performed in real-time. In the VQ and in LVQ, the recognition score increases with the number of centroids (there are some little oscillation), but there must be enough training vectors for proper convergence. A number of 32 or more centroids is adequate for the tests performed in this work, but in more general
GMM VQ LVQ1 LVQ2.1 LVQ3
EXP1 100% 100% 96% 96% 96%
EXP2 94% 98% 100% 100% 100%
EXP3 91% 88% 83% 83% 82%
EXP4 85% 82% 79% 80% 80%
Table 4. Recognition with 64 centroids.
tasks, more work should be done for optimizing the recognition system, along with modifications in the pre-processing. Increasing the number of mixtures in the GMM does not mean an increase in the recognition score. In the results we can see that exists an optimum number of mixtures with maximum performance; this number lies between 16 and 32 mixtures. For a small number of centroids (2), the LVQ performance is worst than that of VQ or GMM, this may be due to the discriminative training used for obtaining the centroids. In general, the performance of the LVQ1, LVQ2.1 and LVQ3 is the same, but the LVQ3 have a less computational cost than the others. Better results may be obtained using a variating learning-rate . Comparing the scores from the models generated by the LVQ, VQ and the GMM, the latter gives, in general, better performance than the formers in the experiments EXP3 and EXP4. This may be due to the higher modeling capacity of the GMM [8], as the combination of gaussians can approximate smoothly a large number of arbitrary distributions and using LVQ or VQ, the distribution is approximated in a coarser way. As further work we suggest that other features should be tested and speech under degraded conditions should also be tested for robustness verification of the features, as well as the use of delta features for compensating the temporal variation. A larger database shall be used in order to verify the modeling capabilities of the methods used when exists more overlap between the classes.
References [1] A. Alcaim, J. A. Solewicz, and J. A. de Moraes. Frequˆencia de ocorrˆencia dos fones e lista de frases foneticamente balanceadas no portuguˆes falado no rio de janeiro. Revista da Sociedade Brasileira de Telecomunicac¸ o˜ es, 7(1):23–41, dezembro 1992. [2] S. B. Davis and P. Mermelstein. Comparison of parametric representation for monosyllabic word recognition in continuous spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(4):357–366, August 1980. [3] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, New York, 1989.
[4] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, September 1990. [5] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-28(1):84–95, January 1980. [6] L. R. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, New Jersey, 1993. [7] D. A. Reynolds. A Gaussian Mixture Modeling Approach to Text-Independent Speaker Indentification. Ph.D. Thesis, Georgia Institute of Technology, 1992. [8] D. A. Reynolds. Robust text-independent speaker identification using gaussian mixture models. IEEE Transactions on Speech and Audio Processing, 3(1):72–83, January 1995. [9] F. K. Soong, A. Rosenberg, L. R. Rabiner, and B. Juang. A vector quantization approach to speaker recognition. In Proceedings of the IEEE ICASSP ’85, pages 387–390, Tampa, FL, March 1985.