Improving GMM-based Speaker Recognition Using Trained Voice Activity Detection Artur Janicki, Szymon Biały Institute of Telecommunications, Warsaw University of Technology Nowowiejska 15/19, 00-665 Warszawa, Poland e-mail:
[email protected];
[email protected]
Abstract—This document describes a method how to increase effectiveness of speaker recognition, based on Gaussian mixture models (GMM), in the presence of noise. The method consists in using a voice activity detector (VAD) and training it to given environmental conditions. The system has been verified for 44 Polish speakers for different types of noise and also tested for different feature vectors and GMM model size. The results are presented and discussed. It turns out that introducing VAD improves recognition significantly; however the degree of improvement depends on the noise type.
I. INTRODUCTION Speaker recognition is a fast-developing sub-domain of speech processing. It can be applied in access control, managing speech signal data or in personalisation, e.g. of settings in electronic equipment. Speaker recognition can be divided into two problems: first one is speaker identification (detection) – when the system has to identify a speaker from within a close set of N speakers; second one is speaker verification – this is when a speaker claims its identity. Both problems are similar and similar techniques are used for both of them, but they differ by decision level used when giving the final answer. Unfortunately speaker recognition methods are not ideal and they are prone to errors, especially in the presence of noise. This article proposes one of the ways of improving it. II. PROBLEM DEFINITION There exist various speaker recognition techniques. They can serve for text-dependent and/or text-independent recognition. Usually they are based on spectral parameters of speech, either by analysing their statistical distribution within longer speech fragment, or by first segmenting signal into e.g. phonemes and analysing statistics already within phonetic clusters. Some studies add pitch information to spectral features to enhance recognition effectiveness [3]. Those parameters are then used to build speaker models. For that purpose e.g. hidden Markov models (HMM), vector quantisation (VQ) or neural networks are used. Recently Gaussian mixture models (GMM) technique proved to be
very successful by many researchers, especially for textindependent speaker recognition. For example [5] presents a GMM-based speaker verification method, which compares likelihood for hypothesised speaker model against a generalised “background” speaker model. A multi-space probability distribution GMM (MSD-GMM) are used for join modelling of spectral and pitch features in [3]. Good performance of GMM in speaker recognition convinced us to use this technique in this paper, too. But speaker recognition can suffer severely if speech is disturbed with noise. Systems which perform well for clean speech, turn not to be robust against noise and make many identification errors, such as false alarms or misses. To try to cope with that problem a pre-processing of input speech signal is performed, such as spectral subtraction [8] and Wiener filtering [2], or speaker models are adapted to noisy conditions. Some improvement can also be caused by separate treatment of voiced and unvoiced speech. Anyway those methods are not satisfactory and there is a need of a robust speech activity detector. Such an energy based speech activity detector in Front-End Processing block of speaker identification system were already used before [5], but in this paper a newer, more universal approach will be presented and tested for speaker recognition. III. PROPOSED APPROACH To enhance effectiveness of GMM-based speaker recognition it was proposed to use voice activity detection (VAD), developed basing on idea taken from [2]. In order to make correct decision about voice/noise segments, the authors proposed there to use combination of features: amplitude level, zero crossing rate (ZCR), spectral information and Gaussian mixture model likelihood. Those features were normalised and weighted by four different factors, obtained during the minimum classification error (MCE) training process. The authors confirmed effectiveness of VAD functionality by testing it for 3 different types of noise (craft noise, background speech and sensor room) and showing that their system performs better than systems basing on a single feature only [2].
Following those promising results it was proposed to use a similar VAD and employ it in speaker recognition. Three features were taken into account: amplitude, zero crossing rate and spectrum. Amplitude level was calculated as a logarithm of energy of Hamming-windowed signal; zero crossing rate (ZCR) was a number of times the signal crosses 0 value within given timeframe. Spectrum was analysed by dividing frequency domain into 10 sub-bands and calculating energy in each of them. In the original implementation also forth feature was used: GMM likelihood, which was meant to be used to cope with background speakers. Recognition in presence of background speech was not analysed in this paper, so this component was omitted. Researching described in this paper involved training the VAD system for different types of noise, applying it for speaker recognition using Gaussian mixture models with different feature vectors and verifying speaker recognition performance. The effectiveness of speaker recognition was verified for Polish speaking voices, taken from a speech corpus. IV. TRAINING OF VAD For correct functioning the VAD module had to be properly trained to adapt to given type of noise. Minimum classification error (MCE) technique [7] was used as a training method. Following [2], a labeled file is used to train the weight of each of three features. Each frame of the utterance is judged as speech or non-speech and during the training process the weights are updated sequentially frame by frame. The misclassification measure of training data frame xt is defined as (1) dk ( xt ) = − gk ( xt ) + gi ( xt ) where the MCE criterion is defined using discriminative functions gk ( xt ), gi ( xt ) which represent the difference between sum of weighted features for a training frame (k denotes speech or non-speech cluster of current data frame, i denotes another cluster, non-speech or speech respectively, of previous data frames) and a threshold. The loss function lk is defined as a differential sigmoid function approximating the 0-1 step loss function: (2) lk ( xt ) = (1 + exp(−γ ⋅ dk ( xt )) −1 Initially weights w1..w3 are set to 1/3 and the scale is changed to logarithmic (weights are then denoted as ~1..w ~ 3 ). The set of weights w ~ is sequentially adjusted w every time a frame is given, according to the following rule: ~ (t + 1) = w ~ (t ) − εt∇lk ( xt ) (3) w where εt is a monotonically decreasing learning step size and ∇lk ( xt ) is a gradient of the loss function. At the end linear weights are restored as follows:
wk =
~k exp w K
∑ exp w~ k k =1
(4)
Having calculated weights during MCE training, the threshold which best divides speech and non-speech frames is computed. Table I shows sample results of VAD training for speech signals disturbed with different noise types. TABLE I SAMPLE RESULTS OF MCE VAD TRAINING noise type
weight 1
weight 2
weight 3
threshold
no noise Volvo factory oper. room
0.52982 0.17671 0.19282 0.60777
0.19591 0.70960 0.00078 0.00016
0.27427 0.11369 0.80640 0.39205
-2.86 0.62 10.54 8.29
V. EXPERIMENTS To verify effectiveness of the proposed method of improving speaker recognition, it was decided to use voice samples contained in PoInt – Polish Intonation Corpus [1]. This database was developed at Adam Mickiewicz University in Poznan and was constructed aiming at serving for researching on the Polish intonation; however the number of recorded speakers (44) makes it suitable also for experiments on speaker recognition. The speech was recorded in anechoic chamber at the sampling frequency of 44.1 kHz, but for our experiments has been downsampled to 16 kHz. To be able to estimate how the recognition performs in noisy conditions, noise samples taken from NOISEX-92 noise corpus [3] were being added to clean speech taken from PoInt database. Matlab® environment was used both for VAD training and GMM models training. Gaussian mixtures models were trained using expectation maximisation (EM) algorithm [6] on 30 s of speech for every speaker, with the frame length equal 25 ms and analysis step of 10 ms, what gives about 3000 data points per speaker. 10 s of speech per each speaker were used for testing. Several experiments were conducted, such as verification of VAD module impact on speaker detection for different feature vectors (both for clean and noisy speech) or for different number of Gaussian mixtures used in speaker modelling. Other tests were run to show how VAD module helps in speaker recognition in different noisy conditions, depending on how VAD had been trained. Two criteria of speaker recognition effectiveness were used. Because log-likelihood is a measure how an examined voice fits hypothesised speaker model, the first measure is a difference between minimal log-likelihood for the correct speaker and maximal log-likelihood for all remaining speakers. If the value is negative, we experience false alarms, i.e. a voice is wrongly recognised as belonging to the hypothesised speaker. The higher this distance is, the lower probability of false alarm is. In our experiments, sum of 44 such distances for all voices in the examined corpus will be presented in the results. The second criterion is ratio of misrecognised speakers, i.e. ratio between voices that failed to be distinguished from all the remaining ones and the total number of speakers (44).
VI. RESULTS First pictures present how introducing VAD module changes speaker recognition for different feature vectors. Analysed were 12 mel-cepstrum coefficients (MFCC) with and without energy, and MFCC coefficients with delta and delta delta values. Those cases were examined for speech disturbed with white noise (Fig. 1) with segmental SNR = 2.2 dB (for clean speech the recognition was errorless). Impact of VAD usage was tested also for different number of Gaussian mixtures (Fig. 2).
Next experiments showed, that even though VAD is helpful for various types of noise (see Fig. 3 for results for white, pink and brown noise, with segmental SNR 2.7 dB, 1.2 dB, -1.5dB respectively; VAD had been trained for given type of noise), the degree of improvement is not the same. White and pink noise both gained more than 100k of loglikelihood, but pink noise did not manage to get above zero level. Misrecognition errors decreased from 91% to 32% and from 97% to 61% respectively, so VAD performed significantly worse for the pink noise. Improvement for the brown noise was smaller, but here the starting point was different: the correctness both with and without VAD was 100% anyway.
Fig. 2. Summarised log-likelihood distance against number of mixtures; speech with white noise, segm. SNR = 2.7 dB; without and with VAD.
Fig. 1. Ratio of misrecognised speakers (above) and sum. log-likelihood distance for various feature vectors (noisy speech); without and with VAD.
The pictures clearly show improvement of recognition after introducing VAD module: summarised log-likelihood distance reaches positive values for 2 last feature vectors (Fig. 1) and the misrecognition error drops down from 55% to 48% for “MFCC+delta” vector or even from 91% down to 32% for “MFCC+delta+delta2” vector. As the final result was the best for the latter case, this feature vector has been selected for the next simulations. Worth noting is the fact, that without VAD using “MFCC+delta+delta2“ vector practically makes no sense, because it results in many false alarms (see also high negative value of log-likelihood distance in Fig. 1). The improvement is clearly visible for various numbers (32-512) of Gaussian mixtures, see Fig. 2.
Fig. 3. Effectiveness of speaker recognition for noisy speech.
Similar experiments were run also for speech mixed with sample types of additive noise, taken form NOISEX-92 corpus [4]. Fig. 4 shows the results for “Volvo car” noise, “factory” noise and “destroyer operations room” noise, with segmental SNR -1.3 dB, -1.5 dB, -3.3 dB respectively. Also here the results were positive, but not the same for all three noises. Factory and operational room noises benefited from
significant grow of log-likelihood distance, which resulted also in decrease of misrecognition error: from 84% down to only 27% for factory; and from 89% down to 63% for the operations room. Summarised log-likelihood distance for the Volvo noise decreased slightly, but anyway the misrecognition ratio was 0% in both cases.
Fig. 4. Effectiveness of speaker recognition for speech at the presence of environmental noise.
It was also checked how VAD trained for one noise performs for the other types of noise. Fig. 5 presents results of recognition using VAD adapted to factory noise. Obviously it performs very well for the same type of noise, but it also performs satisfactorily for some other types (white noise, operations room noise). However it worsens recognition for Volvo noise, causing that almost ½ of recognition decisions is wrong, while the recognition was 100% correct without using VAD, adjusted to factory noise.
Fig. 5. Using VAD trained for factory noise in speaker recognition for speech disturbed with different types of noise.
REFERENCES
VII. CONCLUSION The experiments showed strong positive impact of employing voice activity detector during speaker recognition. Both summarised log-likelihood distance measure and percentage of misrecognised speakers showed that introducing VAD module improved speaker detection in most of the examined cases, even for noisy speech signal with low SNR value (single dB). The enhancement was not everywhere equal and for example for brown noise was lower than in other cases. In one examined case (Volvo car noise) VAD did not show any improvement, anyway this noise hardly had an impact on recognition at all. Apparently the presented VAD is not suitable for such type of noise (low-frequency noise). The experiments showed also, that VAD should be oriented to given noise characteristics and possibly trained for given type of noise. If not, introducing improperly trained VAD can disturb speaker recognition. Potential future works in this area could involve experiments with searching for a universal VAD, suitable for wider range of noise types, including also background speech.
[1]
[2]
[3]
[4] [5]
[6]
[7]
[8]
M. Karpinski, J. Klesta “The project of an intonational database for the Polish language”, Proc. Prosody 2000, pp. 113-118, Adam Mickiewicz University, Poznan, Poland, 2001 Y. Kida, T. Kawahara, “Voice Activity Detection Based on Optimally Weighted Combination of Multiple Features”, Proc. Interspeech 2005 - Eurospeech, Lisbon, Portugal, 2005 Ch. Miyajima, Y. Hattori, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura “Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution”, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01). pp. 433-436, Salt Lake City, USA, 2001 NOISEX-92 project, web info: http://www.speech.cs.cmu.edu/comp. speech/Section1/Data/noisex.html D. A. Reynolds, T. F. Quatieri, R. B. Dunn “Speaker Verification Using Adapted Gaussian Mixture Models”, Digital Signal Processing, vol. 10, no. 1-3, 2000. D. A. Reynolds, R. C. Rose “Robust text-independent speaker identification using Gaussian mixture speaker models”, IEEE Trans. Speech and Audio Process., 3 (1), pp. 72–83, Jan. 1995. X. Wang “A Modified Minimum Classification Error (MCE) Training Algorithm for Dimensionality Reduction”, Journal of VLSI Signal Processing 32, pp. 19–28, Kluwer Academic Publishers, the Netherlands, 2002 P. Yu, Z. Cao “Text-Independent Speaker Identification Under Noisy Conditions Using Speech Enhancement”, Networks and Digital Signal Processing (CSNDSP-2002), Staffordshire, UK, 2002