on a forensic semi-automatic speaker verification ...

International Scientific Conference “Archibald Reiss Days”

XX

Vesna Trajkovska ENGLISH INFLUENCE ON MACEDONIAN TERMINOLOGY IN THE AREAS OF SECURITY AND LAW..........................................................................................395 Ljubinka Katić, Željko Bralić SECURITY STUDIES BETWEEN THE OLD AND THE NEW SECURITY PARADIGM.............................................................................................403 Zorica Kojčin, Ivan Žarković, Gordana Dobrivojević SECURITY AND ITS IMPACT ON THE DEVELOPMENT OF TOURISM IN THE REPUBLIC OF SERBIA...................................................................................411 Dragan M. Cvetković, Marija D. Mićović, Darko V. Senić SCOPE, DYNAMICS AND STRUCTURE OF ECOLOGICAL CRIMES IN SERBIA.................................................................................................419 Darko Božanić, Aleksandar Milić, Nenad Komazec THEORETICAL DETERMINATION OF THE REPUBLIC OF SERBIA ENERGY SECURITY ...................................................................429 Dragiša Jurišić SUBVERSIVE ACTIVITIES AGAINST SERBS AND SERBIA.........................................................439 Ilijazi Venezija INFORMATION SECURITY MANAGEMENT SYSTEM (ISO 27001)........................................447 Meiying Geng RESEARCH ON THE CONSTRUCTION OF SERVICE-ORIENTED PUBLIC SECURITY ADMINISTRATION.........................................457

Huapeng Wang, Cuiling Zhang, Jun Yang, Ming Wu.EXPERIMENTAL STUDY

ON A FORENSIC SEMI-AUTOMATIC SPEAKER VERIFICATION USING VOWEL CEPSTRAL WITH LIMITED QUESTIONED DATA 2013 INTERNATIONAL SCIENTIFIC CONFERENCE "ARCHIBALD REISS 1

DAYS",(3)pp.151-158.

UDC: 343.982:534

EXPERIMENTAL STUDY ON A FORENSIC SEMI-AUTOMATIC SPEAKER VERIFICATION USING VOWEL CEPSTRAL WITH LIMITED QUESTIONED DATA1 Lecturer Huapeng Wang, MA Professor Cuiling Zhang, PhD Department of Forensic Science and Technology, National Police University of China Professor Jun Yang, BA PhD Ming Wu PhD Institute of Acoustics, Chinese Academy of Sciences, Beijing Abstract: This study tests the discriminant performance of a forensic semi-automatic speaker verification system within likelihood ratio (LR) framework using limited data of questioned voices. Vowel tokens of /a/ from 42 speakers in the database of standard Chinese recorded in different sessions was selected for this test. Vowel cepstral coefficients were modeled using both the multivariate kernel-density (MVKD) and Gaussian mixture model (GMM) separately. Vowel cepstral coefficients from cross-validated comparisons of each same-speaker pair and different-speaker pair were used to calculate LRs. The accuracy of the system was measured using the log-likelihood-ratiocost function (Cllr). Results indicate that GMM model outperforms MVK model by showing better discriminant performance and accuracy for this task. Therefore, GMM is proposed to model vowel cepstral coefficients for limited questioned data. Keywords: forensic speaker identification; likelihood ratio; limited questioned data; vowel cepstral; GMM

INTRODUCTION The application of the Beyes approach is established as a theoretical framework for any forensic discipline [1]. The interpretation of the strength of forensic evidence using likelihood ratios has been recommended by many forensic statisticians and forensic scientists [2-7]. According to Morrison (2009), “we are in the midst of a paradigm shift in the forensic comparison sciences and the new paradigm can be characterized as quantitative data-based implementation of the LR framework with quantitative evaluation of the reliability of results” [8]. The new paradigm was initially and successfully used in DNA comparison in the mid-1990s, and is gradually spreading to other branches of forensic science, such as speech, fingerprint, handwriting, glass evidence, etc. In forensic investigations involved with speech evidence the questioned recording provided to the forensic speech expert is usually short and sometimes contains only a few words. This is unavoidable in forensic speaker recognition cases and what you can do is just use the limited data, as much as you can. In forensic speaker recognition two kinds of systems are often used within LR framework to evaluate the strength of speech evidence. One is the forensic fully automatic speaker verification system which deals with the speech signal as a whole. The state of art technique used in these fully automatic systems are Gaussian mixture model-universal background model (GMM-UBM) [9-11] proposed by D. A. Reynolds. However, the disadvantage of this model is the performance of system that can be degraded when speech samples are involved with noise, varied transmission channel and limited speech duration. It is hard to give a reliable verification result using fully automatic speaker verification systems when only a single short recording is provided. The other one is the forensic semi-automatic verification system, in which the desired vowels (monophthongs, diphthongs or triphthongs) are labeled manually by the operator and the speech features are usually extracted manually or semi-automatically. The common features used in forensic semi-automatic 1 Supported by NNSF of China under Grants (No.11004217 and No.11074279)

152

Huapeng Wang, Cuiling Zhang, Jun Yang, Ming Wu

verification systems are fundamental frequency (pitch) values, pitch curves, formant central frequency [12] [13] and coefficient values from discrete cosine transforms fitted to formant trajectories [14]. This semi-automatic verification system is a better choice for conditions when the questioned voice is very short and only very limited data can be used. This is because that vowel tokens can be analyzed one by one to get speakers’ individual information as much as possible, but the time and labor costs are huge. In this paper the forensic semi-automatic speaker verification system with automatic parameters of manually selected vowel tokens was used to evaluate the strength of limited speech evidence within the LR framework. It should be noted that the aim of this study is to see how much evidence strength the system can provide and what performance can be achieved under the condition of very limited questioned data. This does not mean that limited questioned data are enough for evidence evaluation in real forensic practice. In addition, LRs are calculated using both GMM and MVKD separately with vowel cepstral coefficients. The performance of the system using two sets of LRs from GMM and MVKD are compared. Finally, the fusion of the results from all tokens used is also discussed.

INTERPRETATION FRAMEWORK OF THE EVIDENCE In forensic investigation forensic experts have to interpret and evaluate the strength of the evidence even it is very limited (it is often impossible to obtain additional questioned recordings). Beyes theorem is the current state-of-the-art interpretation of forensic evidence and the framework of likelihood ratios is the main concern of forensic scientists [15]. LR framework does not force the forensic expert to make “yes” or “no” decisions, which is posterior odd (probability) and should be devolved upon the court. In the Bayes theorem forensic scientists provide the strength of evidence (likelihood ratios) which supports either prosecution hypothesis (H0) or defence hypothesis (H1), which is combined with prior background knowledge (prior odds, province of the court) to give posterior odds (province of the court) for judicial outcomes or issues [16]. It allows for revision based on the present evidence of the measure of uncertainty (LR, province of the forensic scientist) which is applied to the pair of two competing hypothesis: H0 and H1. The Beyes theorem is shown in Eq. (1).

This hypothetical-deductive reasoning based on the odds form of the Bayes theorem allows for the evaluation of the strength of evidence using LRs, which leads to the quantitative degree of support for one hypothesis against the other. That is to say, the strength of the present evidence is expressed in the form of LR value of two alternative hypotheses. Its numerator is used to estimate the probability of getting the evidence assuming H0 hypothesis is true; its denominator is used to estimate the probability of getting the same evidence assuming H1 hypothesis is true. The relative evidence strength in support of the hypothesis is reflected in the magnitude of LR value. The relative deviation of LR value from the unity (=1) provides an indication on the strength of the evidences. The more the LR deviates from unity, the greater the support for either prosecution hypothesis (i.e. when LR > 1) or defence hypothesis (i.e. when LR < 1). The closer the LR value approaches the unity, the less useful the evidence becomes, because it means that the evidence provides almost equal degree of support for both hypotheses [17]. The more similar between the questioned recording and suspect’s recording(s), the more likely they come from the same speaker and therefore the higher the ratio will be. However, this must be

EXPERIMENTAL STUDY ON A FORENSIC SEMI-AUTOMATIC SPEAKER... 153 balanced by the typicality of samples or features in the relevant population. The more typical the two samples or features in the relevant population, the more likely they have been taken randomly from the population, and the lower the ratio will be. The numerator of LR quantifies the degree of similarity between the questioned recording and suspect’s recording(s), and the denominator quantifies the degree of typicality of the questioned recording in the reference (relevant) background population. Hence, LR value is the result of interaction between both factors of similarity and typicality. LR framework makes it clear that both factors are necessary to evaluate the strength of evidence.

CALCULATION OF LIKELIHOOD RATIOS A. Multivariate kernel-density approach LRs can be calculated using the multivariate kernel density (MVKD) formula developed by Aitken and Lucy [18] and implemented by Morrison [19]. This formula evaluates the difference between the suspect’s and the questioned samples with respect to their typicality in the reference distribution estimated using data from samples taken from the appropriate relevant population. Within-speaker variance is estimated using a normal distribution model, and between-speaker variance is estimated using a kernel density model. Compared to most automatic speaker recognition systems which first calculate scores of difference between pairs of speech samples and then use these scores as inputs to a discriminative or generative model, the generative Aitken and Lucy formula calculates LRs via direct estimation of the probability densities of the original feature variables. B. GMM approach In forensic speaker recognition, when a forensic speaker verification system is used to compare the questioned recording and the suspect’s recording, a statistical model of one of them (normally the suspect’s recording, because its duration is usually longer than questioned recording) using the various features extracted from corresponding speech has to be established. The likelihood values are estimated from the comparison of features of questioned recording and the suspect’s model. The statistical model of the suspect can be represented by GMM, which has been successfully applied to cepstral feature vectors in many fully automatic speaker recognition systems [20]. For a D-dimensional feature vector, the mixture density used for the likelihood function is defined as:

The mixture density is a weighted linear combination of M unimodal Gaussian densities, pi(x) , each parameterized by a mean Dx1 vector mi and a DxD covariance matrix, Si.wi are the mixture weights, and satisfy Smi-1 wi = 1. The parameters of GMM are denoted as, l={wi, mi, Si}, where i=1,....,M. The forensic-speaker-recognition evidence does not consist in speech per se, but in the degree of similarity between speaker dependent features extracted from the questioned voice and same features extracted from the suspect voice, represented by his/her GMM model [21]. When the questioned voice is limited only the suspect’s voice can be used to evaluate the within-speaker variability. GMM can be used to model both within-speaker variability of the suspect and between-speaker variability. The likelihoods are explicitly represented by probability density models. Mathematically, H0 is represented by a model denoted l=hyp, which characterizes the hypothesis H0 in the feature space. For a Gaussian distribution, l=hyp denotes the mean vector and the covariance matrix parameters. The alternative hypothesis, H1, is similarly represented by the model denoted l=hyp. The values of LR are then expressed by Eq. (4). In this study we propose a new solution for modeling the

154


automatic parameters (Mel frequency cepstral coefficients, MFCC) of vowel tokens using GMM. The middle stable segment (32ms) of vowel /a/ was selected to compute MFCC. The MFCCs were then pooled together to train the multi-dimension GMM, whose dimension number is the same with the order of MFCC.

Fig.1 is a demonstration of LR calculation process. The dotted curve represents the withinspeaker distribution trained using the suspect’s data, which satisfies H0 hypothesis. The solid curve represents the between-speaker distribution trained using reference data, which satisfies H1 hypothesis. The feature used here is the first formant of vowel /a/ of males. Suppose that the measured mean F1 of the criminal’s voice is 800Hz, it will be evaluated in two distributions mentioned above. For the within-speaker distribution, the likelihood value is p1=0.0103; for the between-speaker distribution, the likelihood value is p2=0.0048; therefore, the LR value of this example is p1/p2=2.1321, that is to say, according to the present analysis, this evidence (only F1 of vowel /a/ ) is 2.1321 times more supporting H0 hypothesis than supporting H1 hypothesis. The result supports that the questioned recording and the suspect’s recording are spoken by the same person, but its strength is very low. This magnitude of LR value wouldn’t provide strong and valuable evidence for court.

Fig. 1 Demonstration of LR calculation using GMM

C. Accuracy of measurement A log-likelihood-ratio-cost function is used to assess the accuracy of the semi-automatic forensic speaker verification system in this study. It is independent of prior probabilities and costs, and has been adopted by the National Institute of Standards and Technology Speaker Recognitions (NIST SRE) . It is calculated using Eq. (3).

In Eq. (5), Nss and Nds are the numbers of same-speaker and different-speaker comparisons; LRss and LRds are the LR values calculated from the same-speaker and different-speaker comparisons. As a metric of reliability of a speaker verification system, Cllr has previously been used in both fo-

EXPERIMENTAL STUDY ON A FORENSIC SEMI-AUTOMATIC SPEAKER... 155 rensic automatic speaker recognition systems and the acoustic-phonetic forensic voice comparison research [22]. More reliable systems produce smaller Cllr values and less reliable systems produce larger Cllr values.

EXPERIMENTS The database includes 42 male speakers’ recordings of Standard Chinese, aged between 19 and 23. Each speaker was recorded on three different sessions, the first session was about one week earlier than the second session, and the first session was about one month earlier than the third session. Speech was elicited by a research assistant, who phoned the speakers and asked them a series of questions such as: “What’s your name?”, “What’s your mobile telephone number?” Recordings were made via the university internal telephone system using a KCM HCD9999/TSDL telephone, which has an in-build analogue cassette tape recording facility. Recordings were digitized and saved as 16 bit PCM sound files at a sampling frequency of 11.025 kHz. The number “8 /pa/” was chosen for analysis because it is considered as a lucky number by a lot of Chinese and more tokens can be found in the telephone numbers. Also speakers tend to stress it when they give these numbers. In addition, according to Gea de Jong [23] from University of Cambridge, the pronunciations of /i:, a:, c:/ have all indeed remained quite stable when compared to other monophthongs. A cross-validated procedure was adopted in the experiment as follows: 1) Four tokens of /a/ were selected randomly from all tokens (for each speaker there are 22~106 tokens) of each speaker in the database, each time only one token was used as questioned data to make comparisons, so there will be four LR results in a same- speaker comparison. 2) Every questioned token was used to compare with his own remaining tokens (the four tokens selected above were excluded) in all three non-contemporaneous recordings. 3) The questioned token was used to compare with each other speakers’ all tokens in all three non-contemporaneous recordings to get different-speaker LRs. 4) All the speakers’ data except those being compared were included into the reference population database during LR calculation. The 42 speakers produced 42 same-speaker pairs and 42(42-1)/2=861 different-speaker pairs. LRs were calculated using Aitken and Lucy’s MVKD formula and GMM approach respectively. The 14th-order Mel frequency cepstral coefficients (MFCC) were used as multiple dimensions features and 256 sampling points hamming windows were added to the stable part of vowel /a/ with the preemphasis coefficient 0.97. In the training of GMM, one mixture was used for the suspect features to represent within-speaker variation and two mixtures were used in the training of reference population model to represent between-speaker variation (We assume that there is very limited tokens for questioned data). More mixtures will cause the GMM not to converge within 200 iterations. Expectation-Maximization (EM) algorithm and diagonal-matrix were used in the training of GMM.

RESULTS AND DISCUSSION In the discussion below, LRs will be frequently expressed on a base-ten-logarithmic scale. This is convenient because on a logarithmic scale large positive numbers provide greater support for same-speaker hypothesis and large negative numbers provide greater support for different-speaker hypothesis. For example, a log10 LR of +1 indicates that the evidence is 10 times more likely to be observed under same-speaker hypothesis than under different-speaker hypothesis, and a log10 LR of -1 indicates that the evidence is 10 times more likely to be observed under different-speaker hypothesis than under same-speaker hypothesis. Cross-validated results and Cllr values of different comparison are given in Table 1. In table 1 “T1, T2, T3 and T4” represent the first, second, third and fourth questioned token of each speaker respectively, which were used to calculate the cross-validated LRs. “Mean” represents the mean value of LRs when each token in T1, T2, T3 and T4’ was used. “Fusion” represents the linear fusion of LR values resulted from each of the four tokens. The fusion results were calculated using the Focal

156


toolkit [24] and Logistic Regression calibration was also applied in the fusion procedure. “ER_SS” and “ER_DS” represent the error ratios of same-speaker comparisons (False Rejection Probability) and different-speaker comparisons (False Acceptance Probability) respectively. From Cllr values in this table we can conclude that the third (T3) token of each speaker is the biggest and the fourth (T4) is the smallest whatever MVKD or GMM was used. The Cllr value of “Mean” is smaller than any of them, but it is not as good as the fusion result. Table 1 Error Ratio and Cllr

Fig. 2 and Fig. 3 show the Tippett plots of the mean and the fused cross-validated LRs using MVKD and GMM respectively. The curves rising to the left represent the proportion of differentspeaker comparisons with log10 LRs equal to or greater than the value indicated on the x-axis. The curves rising to the right represent the proportion of same-speaker comparisons with log10 LRs equal to or less than the value indicated on the x-axis. The vertical line is the threshold which is zero in base-ten-logarithmic scale. The solid curves represent the fusion result of the LRs and the dashed curves represent the mean result of the LRs.

Fig. 2 Tippett plot of the mean (dashed lines) and fused (solid lines) cross-validated LRs using MVK approach.

Fig. 3 Tippett plot of the mean (dashed lines) and fused (solid lines) cross-validated LRs using GMM approach.

EXPERIMENTAL STUDY ON A FORENSIC SEMI-AUTOMATIC SPEAKER... 157 From Fig. 2 and Table I, we can see that the mean error ratio of different-speaker is high, that is the false-positive ratio is high, which is not expected in forensic evidence evaluation. If we applied the four tokens data of every speaker together in the MVKD approach, the results not listed here would be worse than the mean of LRs. Therefore, pooling the four tokens into MVKD together is not a good way to calculate LR values. After fusion, the error ratio of different-speaker decreased from 0.2613 to 0.1603 and Cllr value decreased from 0.4867 to 0.4556. The fused results using GMM are more reliable than using MVKD when Cllr values were taken into account because the data deviated from the mean of the features are refined during the modeling of GMM which are not the case for MVKD approach.

Compared with the results of [25], in which the error ratio of same-speaker comparisons is 0.0476 and the error ratio of different-speaker comparisons is 0.0667 calculated using the first three formants of /a/, the error ratios of this study are much higher. That is because only one token was used in this study to assess the evidence strength of one vowel. The purpose of doing this is to test whether the forensic speaker verification methods can still work under the extreme condition with very limited questioned data. Of course, if more tokens or more vowels are measured and their results are fused, the much better discriminant performance can be achieved. Moreover, all recordings on three different sessions were used to make analysis in this study and only recordings on one session were used in [25]. Comparisons using non-contemporaneous recordings can potentially result in poorer results than using contemporaneous recordings.

CONCLUSION This study was aimed to test the discriminant performance of forensic speaker verification systems based on statistical models when very limited questioned data are provided. The results show that the MVKd and GMM methods with vowel cepstral coefficients both give reasonable discriminant performance under this particular condition with very limited questioned data, but GMM model outperforms MVKD model in this case.

REFERENCES 1. D.Meuwly and A.Drygajlo, “Forensic Speaker Recognition Based on a Bayesian Framework and Gaussian Mixture Modeling (GMM)”, In Proc. of Odyssey’2001 ISCA Speaker Recognition Workshop, Crete (Greece), 2001:145-150. 2. C.G.G. Aitken, F. Taroni, Statistics and the Evaluation of Forensic Evidence for Forensic Scientist, 2nd, Wiley, Chichester, UK, 2004. 3. Association of Forensic Science Providers, Standards for the formulation of evaluative forensic science expert opinion, Sci. Justice 49 , 2009: 161–164. 4. D.J. Balding, Weight-of-evidence for Forensic DNA Profiles, Wiley, Chichester, UK, 2005. 5. J. Buckleton, A framework for interpreting evidence, in J. Buckleton, C.M. Triggs, S.J. Walsh (Eds.), Forensic DNA Evidence Interpretation, CRC, Boca Raton, FL, 2005: 27–63. 6. I.W. Evett, Towards a uniform framework for reporting opinions in forensic science case-work, Sci. Justice 38 , 1998: 198–202. 7. D. Lucy, Introduction to Statistics for Forensic Scientists, Wiley, Chichester, UK, 2005. 8. Geoffrey Stewart Morrison, Forensic voice comparison and the paradigm shift, Science and Justice 49, 2009: 298–308. 9. Drygajlo A. Forensic automatic speaker recognition. IEEE Signal processing Magazine, 2007, 24(2):132-135.

158


10. D. A. Reynolds, T. F. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, 2000: 19–41. 11. Gonzalez-Rodriguez J., Drygajlo A., Ramos-Castro D., Garcia-Gomar M., Ortega-Garcia J. Robust estimation, interpretation and assessment of likelihood ratios in forensic speaker recognition Computer Speech and Language, 20 (2-3 SPEC. ISS.), 2006: 331-335. 12. Geoffrey Stewart Morrison, Cuiling Zhang, Philip Rose, An empirical estimate of the precision of likelihood ratios from a forensic-voice-comparison system, Forensic Science International, Volume 208, Issues 1-3, 20 May 2011: 59-65. 13. Becker, Timo, Jessen, Michael, Grigoras, Catalin (2008): Forensic speaker verification using formant features and Gaussian mixture models, IINTERSPEECH-2008: 1505-1508. 14. Morrison G.S. A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: Multivariate kernel density (MVKD) versus Gaussian mixture modeluniversal background model (GMM-UBM) Speech Communication, 53 (2), 2011: 242-256. 15. Daniel Ramos-Castro, Extended Abstract for Best Ph.D. Thesis Award:Forensic Evaluation of the Evidence Using Automatic Speaker Recognition Systems, VI Jornadas en Tecnología del Habla and II Iberian SLTech Workshop, FALA 2010. 16. A. Drygajlo, D. Meuwly, and A. Alexander, Statistical Methods and Bayesian Interpretation of Evidence in Forensic Automatic Speaker Recognition,. in Proc. Eurospeech 2003, Geneva, Switzerland, 2003: 689-692. 17. P. Rose, Technical forensic speaker recognition: Evaluation, types and testing of evidence, Computer Speech and Language, vol.20, 2006: 159-191. 18. Aitken, C. G. G. and Lucy, D. Evaluation of Trace Evidence in the Form of Multivariate Data, App. Stat., Vol. 54, 2004: 109–122. 19. Morrison, G. S., Matlab Implementation of Aitken & Lucy’s (2004) Forensic Likelihood-Ratio Software Using Multivariate-Kernel-Density Estimation [software], 2007. Available: http:// geoff-morrison.net. 20. Douglas A. Reynolds, Richard C. Rose, Robust Text Independent Speaker Identification using Gaussian Mixture Models, IEEE Trans. Speech and Audio Process., vol. 3, no. 1, Jan. 1995: 7283. 21. P. Rose, Forensic Speaker Identification. Taylor & Francis, 2002. 22. N. Brümmer, J. du Preez, Application independent evaluation of speaker detection, Comp. Speech Lang., vol. 20, 2006: 230–275. 23. De Jong, Gea; McDougall, Kirsty; Hudson, Toby; Nolan, Francis, The speaker discriminating power of sounds undergoing historical change: A formant-based study, the Proceedings of ICPhS Saarbrücken, 2007: 1813–1816. 24. N. Brümmer, FoCal Toolkit, July 2005. http://niko.brummer.googlepages.com/focal/ 25. Wang Huapeng, Yang, Jun, The comparison of “Idiot’s Bayes” and multivariate kernel-density in forensic speaker identification using Chinese vowel /a/, 2010 3rd International Congress on Image and Signal Processing, v 8, 2010: 3533-3537.