are also used in specific scenarios, for example, in IVR (Interactive Voice Re- sponse) systems in call-centers for telephone banking. Text-independent ...
Automatically Trained TTS for Effective Attacks to Anti-spoofing System Galina Lavrentyeva1 , Alexandr Kozlov1 , Sergey Novoselov1 , Konstantin Simonchik1,2 , and Vadim Shchemelinin1,2 1
Speech Technology Center Limited, St.Petersburg, Russia www.speechpro.com 2 ITMO University, St.Petersburg, Russia www.ifmo.ru {lavrentyeva,kozlov-a,novoselov simonchik,shchemelinin}@speechpro.com
Abstract. This article is the proceeding of the priority research direction of the voice biometrics systems spoofing problem. We continue exploring speech synthesis spoofing attacks based on creating a text-tospeech voice. In our work we focused on the completely automatic way to create new voices for text-to-speech system and the investigation of the state-of-art spoofing detection system vulnerability to this spoofing attacks. Results obtained during our experiments demonstrate that 10 seconds of speech material are enough for EER increasement up to 19.67%. Considering the fact, that automatic method for synthesis voiced training allows perpetrators to increase the amount of spoofing attacks to biometric systems, we raise the issue of relevance of a new type of spoofing attack, and development of the effective methods to detect it.
Keywords: spoofing, anti-spoofing, speaker recognition, TTS
1
Introduction
Due to the growing interest in reliable authentication methods for restricting access to informational resources, biometrics authentication has improved significantly in recent years. For example, speaker recognition systems are widely used in Internet banking systems, customer identification during a call to a call center, as well as passive identification of a possible criminal using a preset "blacklist" [1], [2]. For instance, the latest overviews of speaker recognition systems show that EER is down to 1.5 - 2% for text-independent [3] and down to 1% for text-dependent [4] speaker recognition systems in various conditions. Although text-dependent systems propose better performance for authentic purposes, text-independent systems, that do not need any specific vocal password, are also used in specific scenarios, for example, in IVR (Interactive Voice Response) systems in call-centers for telephone banking. Text-independent automatic speaker verification (ASV) systems, as well as textdependent, are widely acknowledged to be vulnerable to spoofing attacks. A
2
Vulnerability of VVS to Spoofing Attacks by TTS Voices
multitude of different effective spoofing methods were proposed in literature in recent years. For example, [5] describes methods based on "Replay attack", "Cut and paste", "Handkerchief tampering" and "Nasalization tampering". [1] additionaly observe spoofing attacks based on "Speech synthesis" and "Voice conversion" as being the most potential spoofing attacks. In response to potential threat of spoofing attacks, many researches were focused on detection of the most effective anti-spoofing countermeasures [1] and investigation of the ASV systems vulnerability. Several papers on the ASV system robustness evaluation against spoofing attacks [5], [6], [7] showed the high importance of new robust anti-spoofing countermeasures. One of the most successful spoofing methods is Text-To-Speech (TTS) synthesis based on the target speakers voice. [6] examines the method of spoofing which is performed using a hybrid TTS method that combined Unit Selection and Hidden Narkov Models (HMM). Authors showed that the likelihood of false acceptance when using high-quality speech synthesis and a speech database recorded with studio quality can reach 98%. [8] explored the robustness of text-dependent verification system against spoofing based on the described synthesis method using an automatically labeled "free" speech recorded in the telephone channel and achieved 10.8% False Acceptance error. In this case perpetrator does not need the expert knowledge for preparing a synthesized voice. Because of that this method is more likely to be used by criminals. In this paper we focused on the investigation of the anti-spoofing system (ASS) introduced on the Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) Challenge 2015 [9] as the primary system [10]. The aim of this research was to detect robustness of this ASS system against attack scenario based on Text-to-Speech (TTS) synthesis with automatically labeled speech and detect the dependency of ASS system Equal Error Rate (EER) of acceptance of a spoofed speech and rejection of a human speech. For this purpose we considered speech databases, recorded in telephone channel.
2
Anti-spoofing System
Spoofing detection method was firstly introduced in the ASVspoof Challenge 2015 and achieved 0.008% EER for known spoofing attacks and 3.922% EER for unknown types of spoofing attacks [10]. It should be mentioned that for the HMM-based spoofing attacks of the ASVspoof Challenge evaluation base zero error of spoofing detection was achieved. That was the motivation to include anti-spoofing system to detect impostors in ASV system. This anti-spoofing system consists of three main components: – – – –
Pre-detection Acoustic feature extractor TV i-vector extractor SVM classifier
Vulnerability of VVS to Spoofing Attacks by TTS Voices
3
Pre-detector was used to check if the input signal has zero temporal energy and in this cases declared signal as spoofing attack. Otherwise acoustic features were extracted from signal. As front-end acoustic features we used: 12 Mel-Frequency Cepstral Coefficients (MFCC), 12 Mel-Frequency Principal Coefficients (MFPC) and 12 Cos-Phase Principal Coefficients (CosPhasePC) based on phase spectrum with its first and second derivatives. To obtain these coefficients Hamming windowing was used with 256 window length and 50% overlap. For the acoustic space modelling we used the standard TV-JFA approach, which is the state-of-the-art in speaker verification systems [3], [4], [11]. According to this version of the joint factor anlysis, the i-vector of the Total Variability space is extracted by means of JFA modification, which is a usual Gaussian factor analyser defined on mean supervectors of the Universal Background Model (UBM) and Total-variability matrix T. UBM is represented by the Gaussian mixture model (GMM) of the described features. The diagonal covariance UBM was trained by the standard EM-algorithm. For anti-spoofing method UBM was represented by a 1024-component Gaussian mixture model of the described features, and the dimension of the TV space was 400. All i-vectors derived from different extractors were concatenated in one common i-vector, that was used after on the classification step by SVM classifier.
3
Spoofing attack modelling
To model spoofing attacks to our ASV system we choosed method of TTS voice synthesis, based on previously prepared free speech of the ASV system user with its segmentation. The scheme of such spoofing attack is shown on the figure 1. This spoofing method used a hybrid TTS system developed by Speech Technology Center Ltd (STC), that was based on HMM and Unit Selection [12]. Authors of [6] demonstrated that in cases of the synthesized voice, built using 8 minutes of free speech recorded in a studio environment and manually labeled, the spoofer can achieve 19.1% likelihood of false acceptance. In our work we used STC-base of the free speech recorded in telephone channel in order to evaluate the effectiveness of this synthesis for spoofing aims. The segmentation of this base was obtained by STC automatic speech recognition (ASR) module [14]. For labeling we used automatic labeling module, which included automatic F0 period and phone labeling. This was done by automatic speech recognition (ASR) module, based on HMM. Phone labeling was based on forced alignement of the transcription and the signal. Details of this can be found in [14]. The process of automatically building a TTS voice is described in [15].
4
Vulnerability of VVS to Spoofing Attacks by TTS Voices
Fig. 1. The sceme of spoofing using TTS method
4 4.1
Experiments TTS training database
As it was mentioned before we used STC base of monologues recorded in telephone channel as prepared bases for TTS voice construction. These base contains Russian speech records of 60 speakers with 3-6 records for each. In order to achieve good quality of synthesis speech we took only those speakers, whose speech time was more than 10 seconds. Thus we got 16 voices for telephone channel. By these voices we processed train and test bases. Each of them conconsists of 20 records for each voice of possible phrases during interaction with IVR system. Examples of this phrases include: "Listen to the information on the capabilities of the new system. All categories. Sort by popularity.", "Find the nearest flight to Brazil. Departure from Moscow. Airfare. Hand luggage. One ticket.", "Get the buying rate of the pound sterling on 3 May 2014." and etc. Original recorded phrases were included in the test database. For each speaker, attempts to access the ASV system with spoofing detection module were made using generated records.
4.2
Evaluation results
Our experiments were focused on the evaluation of vulnerability of the described spoofing detection system to proposed spoofing method, based on TTS voice construction and compare it to the vulnerability of this system to effective Hidden Markov model (HMM) and TTS based speech synthesis spoofing attacks from ASVspoof Challenge 2015. In Table 1 the evaluation EER results for new spoofing type are introduced,
Vulnerability of VVS to Spoofing Attacks by TTS Voices
5
in comparisson with EER for the spoofing types, proposed by ASVspoof Challenge 2015. Here: S3 is the Hidden Markov model based speech synthesis system using speaker adaptation techniques [16] and only 20 adaptation utterances. S4 is the Hidden Markov model based speech synthesis system using speaker adaptation techniques [16] and only 40 adaptation utterances. S10 is a speech synthesis algorithm implemented with the open-source MARY Text-To-Speech system (MaryTTS) [17]. It can be seen that obtained results for proposed automatically trained TTS
Table 1. EER of spoofing detection module for different spoofing algorithms. Spoofing algorithm EER (%) TTS-based 19.667 S3 0.000 S4 0.000 S10 19.571
based spoofing method are comparable with result for MaryTTS spoofing attacks, that turned out to be the most powerful spoofing attack during ASvspoofing Challenge. According to the obtained results it can be suggested that hybrid speech synthesis system is more threatful. Figure 2 introduces the DET curve of spoofing detection module for spoofing
Fig. 2. DET curve of spoofing detection module against TTS based spoofing attack
6
Vulnerability of VVS to Spoofing Attacks by TTS Voices
atacks based on automatic training TTS.
5
Conclusion
In this paper we produced the investigation of the automatic TTS training method. During our research we compared hybrid TTS with MaryTTS and got comparable results confirming that proposed automatic method is suitable for effective spoofing attacks on the spoofing detection system and even 10 minutes of speech material is enough to increate EER of anti-spoofing system up to 19.67%. This results prove that it is highly important to improve spoofing detection methods against speech synthesis spoofing atacks. In future work we intent to use these results to improve our anti-spoofing system to make it robust to automatic trained TTS based spoofing attacks.
References 1. Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li: Spoofing and countermeasures for speaker verification: A survey. In: Speech Communication, vol. 66, no. 0, pp. 130– 153 (2015) 2. Matveev Yu. N.: Biometric technologies of person identification by voice and other modalities. In: Vestnik MGTU. Priborostroenie, Special Issue 3(3) “Biometric Technologies”, pp. 46-61 (2012) 3. Kozlov, A., Kudashev, O., Matveev, Yu., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for the NIST SRE 2012. Lecture Notes in Computer Science (LNCS), vol. 8113, pp. 278–285 (2013) 4. S. Novoselov, T. Pekhovsky, K. Simonchik: STC Speaker Recognition System for the NIST i-Vector Challenge. In: Proc. Odyssey 2014 - The Speaker and Language Recognition Workshop (2014) 5. Villalba E., Lleida E.: Speaker verification performance degradation against spoofing and tampering attacks. in Proc. of the FALA 2010 Workshop, pp. 131–134 (2010) 6. Shchemelinin V., Topchina M., Simonchik K.: Vulnerability of Voice Verification Systems to Spoofing Attacks with TTS Voices Based on Automatically Labeled Telephone Speech. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8773, No. LNAI, pp. 475–481 (2014) 7. Sebastien Marcel, Mark S. Nixon,Stan Z. Li: Handbook of Biometric Anti-spoofing: Trusted Biometrics Under Spoofing Attacks. In: Springer (2014) 8. Shchemelinin V., Topchina M., Simonchik K.: Vulnerability of Voice Verification Systems to Spoofing Attacks by TTS Voices Based on Automatically Labeled Telephone Speech. In: SPECOM (2014) 9. Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilc? Md Sahidullah, Aleksandr Sizov: ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge 2015. Available online: http: //www.spoofingchallenge.org/is2015_asvspoof.pdf
Vulnerability of VVS to Spoofing Attacks by TTS Voices
7
10. S. Novoselov, A. Kozlov, G. Lavrentyeva, K. Simonchik, V. Shchemelinin: STC Anti-spoofing Systems for the ASVspoof 2015 Challenge. in Proc. Interspeech (2015) 11. N. Dehak et al. :Support Vector Machines versus Fast Scoring in the LowDimensional Total Variability Space for Speaker Verification. In Proc. Interspeech, 2009, pp. 1559-1562. (2009) 12. Chistikov P., Korolkov E.: Data-driven Speech Parameter Generation For Russian Text-to-Speech System. Computational Linguistics and Intellectual Technologies. Proceedings of the Annual International Conference "Dialogue". Issue 11 (18) Volume 1 of 2. p. 103-111. (2012) 13. Simonchik K., Shchemelinn V.: "STC SPOOFING" Database for Text-Dependent Speaker Recognition Evaluation. Proceedings of SLTU-2014 Workshop St. Petersburg, Russia, 14-16 May 2014, pp. 221-224 (2014) 14. Natalia A. Tomashenko, Yuri Y. Khokhlov: Fast Algorithm for Automatic Alignment of Speech and Imperfect Text Data. In: Speech and Computer Lecture Notes in Computer Science Volume 8113, 2013, pp 146-153. (2013) 15. Solomennik A., Chistikov P., Rybin S., Talanov A., Tomashenko N.: Automation of New Voice Creation Procedure For a Russian TTS System. Vestnik MGTU. Priborostroenie, Special Issue 2 "Biometric Technologies", pp. 29-32. (2013) 16. J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained smaplr adaptation algorithm, IEEE Trans. Audio, Speech and Language Processing, vol. 17, no. 1, pp. 66–83, 2009. 17. Z. Wu, S. Gao, E. S. Chng, and H. Li. A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proc. Asia-Pacific Signal Information Processing Association Annual Summit and Conference (APSIPA ASC), 2014