Development of Excitation Source Information Based

0 downloads 0 Views 2MB Size Report
of-the-art spoof detection features (i.e SCMC and CQCC). We also ... 73.14% to 0.06% and 0.0% for male and female systems, respectively, in the face of. 11 ...
Development of Excitation Source Information Based Countermeasures for Replay Attacks A Project Phase II submitted in partial fullment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in Electronics and Communication Engineering (Communication Engineering) By JAGABANDHU MISHRA 2016342002 Under the Supervision of Dr. DEBADATTA PATI

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING National Institute of Technology Nagaland Chumukedima,Dimapur-797103,Nagaland(INDIA) MAY, 2018

Acknowledgement I take the honour and privilege to express my deep sense of gratitude to my esteemed guide

Dr Debadatta Pati,

Assistant Professor of ECE Department,

National institute of technology Nagaland for his scholarly instructive guidance, support and continued encouragement throughout my work. Without our inspiring discussions on certain issues at times, I would have never been able to complete this work. I greatly admire his attitude towards research, creative thinking, hard work and dedication. I thank the Head of the Department,

Dr P. Chinnamuthu

and other faculty

members for their help and support in carrying out my work. My thanks go to

Mr. Madhusudan Singh sir and Krishna Dutta sir for providing an excellent facility and maintaining a good ambience in the Speech processing and pattern recognition (SPPRC) lab to carry out my work. Their friendly and helpful nature has made my work easy. I also like to thank Department of Electronics and Information Technology (DeitY), Govt.

of India for providing nancial support, as my work is the part of

ongoing DeitY sponsered project title "Development of Excitation Source Features Based Spoof Resistant and Robust Audio-Visual Person Identication System". Finally, I must express my very profound gratitude to my parents for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching. This accomplishment would not have been possible without them.

Sincerely Jagabandhu Mishra

Abstract Replay attack

is an approach for acquiring the unauthorized access to the au-

tomatic speaker verication (SV) system by using targets' pre-recorded speech. This work investigates the usefulness of linear prediction (LP) residual signal and modelling of glottal ow derivative (GFD) signal to counter replay attacks.

In

detecting replay signals the major clues lie in tracing the record and playback devices characteristics that dominantly reect at low frequency regions due to loud speaker, and at high frequency regions due to two-stage A/D conversions. In that direction, must of the attempts try to derive the discriminative feature directly from the speech signal. In signal processing terms, the speech signal can be represented as the response of a slow time-varying vocal-tract system excited by a fast time-varying impulse like excitation source. Due to the presence of vocal tract resonances, the playback device characteristics may distort more to the spectral patterns of impulse like excitation signal then the spectral pattern of speech signal.

Based on the distribution nature of the mel-scale that tightly spaced in

low frequency regions and reverse in inverse mel-scale, residual mel-frequency cepstral coecients (RMFCC) and residual inverse mel-frequency cepstral coecients (RIMFCC) features are derived from LP residual signal and glottal ow derivative mel frequency cepstral coecients (GFDMFCC) are derived from GFD signal. The eectiveness of these features are demonstrated on ASVspoof2017 database.

In terms of equal error rate (EER), RMFCC features provide the best performance of 14.57% and RIMFCC of 15.35%. The fusion of RMFCC and RIMFCC features improves the performance to 10.14%, that is comparatively better than the stateof-the-art spectral centroid magnitude coecients (SCMC) feature performance of 11.49% and constant Q cepstral coecients (CQCC) of 15.12%. The fusion of RMFCC, RIMFCC with CQCC and SCMC further improves the performance to 8.72%. Further, the GFD signal based GFDMFCC provides an EER of 20.53%. The fusion of GFDMFCC with CQCC and SCMC provides an EER of 8.18%. These outcomes demonstrate the usefulness of excitation source information for developing countermeasures to replay attacks.

Contents

List of Figures

iii

List of Tables

v

1 Introduction

1

1.1

Speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Spoong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2.1

Types of Spoong and its eectiveness

. . . . . . . . . . . .

3

1.2.2

Impersonation . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2.3

Speech synthesis

. . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.4

Voice conversion

. . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.5

Replay attack . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.6

Need of countermeasures for replay attacks . . . . . . . . . .

6

1.3

Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . .

10

2 Countermeasures to replay attacks: A Review

11

3 Countermeasures to replay attacks: Excitation source information 17 3.1

Analysis of replay signal

3.2

Motivation for using excitation source information

. . . . . . . . .

20

3.3

LP residual signal . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.3.1

Parameterization of LP residual signal

. . . . . . . . . . . .

24

3.3.2

Selection of proper LP order . . . . . . . . . . . . . . . . . .

28

3.4

GFD signal 3.4.1

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimation and parameterization of GFD signal

. . . . . . .

4 Experimental studies

17

29 30

34

4.1

Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.2

ASVSpoof2017 database

. . . . . . . . . . . . . . . . . . . . . . . .

34

4.2.1

Experiment setup on ASVSpoof2017 database . . . . . . . .

35

4.2.2

Results and discussion on ASVspoof2017 database . . . . . .

36

IITG-MV replayed database . . . . . . . . . . . . . . . . . . . . . .

40

4.3.1

41

4.3

Experiment setup on IITG-MV database . . . . . . . . . . .

i

Contents

4.3.2

ii

GMM and GMM-UBM . . . . . . . . . . . . . . . . . . . . .

42

Results and discussion on IITG-MV database

47

. . . . . . . .

Performance evaluation based at equal condence level

. . .

50

5 Conclusion

56

List of Publications

61

List of Figures 1.1

Automatic speaker verication system with genuine and replay attempts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2

Speech signals and their spectrograms of actual and replay data. . .

8

3.1

Typical frequency response of high quality smart phone loudspeaker [1] 18

3.2

Eect of replay devices across dierent frequency bands.

3.3

replay speech generation model from actual speech

3.4

. . . . . .

19

. . . . . . . . .

20

Speech and LP residual signals of actual and corresponding replay utterances. (a-b) Actual and replay speech signals in time domain, (c-d) corresponding LP spectrum. (e-f ) Actual and replay LP residual signals in time domain, (g-h) corresponding LP residual spectrum. 22

3.5

Spectrograms of band ltered LP residual signals at (a) 0-500 Hz (low) and (b) 3-4 kHz (high) frequency regions. (c) and (d) represent the corresponding replay counterparts.

. . . . . . . . . . . . .

22

3.6

Filter bank structure of mel and inverse mel lters.

. . . . . . . . .

24

3.7

RIMFCC feature extraction block diagram . . . . . . . . . . . . . .

24

3.8

RMFCC feature extraction block diagram

. . . . . . . . . . . . . .

25

3.9

Typical glottal ow and its derivative waveforms [2]. . . . . . . . . .

29

3.10 Estimated GFD signal of actual and corresponding replay signal. . .

31

4.1

DET curves from replay detection experiments with RMFCC, RIMFCC, SCMC features and their dierent fusions.

4.2

39

DET curves from replay detection experiments with GFDMFCC, RMFCC and CQCC features. also shown

4.3

. . . . . . . . . .

DET curves of fused features are

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

DET plot showing the performance of mfcc and rmfcc features based SV systems with zero-eort attempts and under replay attacks. `P' stands for

4.4

replay attacks.

. . . . . . . . . . . . . . . . . . . . . . . .

FARs without and with replay attacks for males, females and wholeset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5

49

49

Comparison of baseline mfcc and rmfcc based systems with common CM values



0.5. (a) mfcc and (b) rmfcc based systems. PTimp

represents number of replay trials rejected as imposers. represents average CM value of rejected replay trials.

iii

CMpimp

. . . . . . . .

52

List of Figures 4.6

iv

Comparison of baseline mfcc and rmfcc based systems with common CM values



0.5. (a) mfcc and (b) rmfcc based systems. PTimp

represents number of replay trials rejected as imposers. represents average CM value of rejected replay trials. 4.7

. . . . . . . .

53

Comparison of baseline mfcc and rmfcc based systems with common CM values



0.5. (a) mfcc and (b) rmfcc based systems. PTimp

represents number of replay trials rejected as imposers. represents average CM value of rejected replay trials. 4.8

CMpimp

CMpimp

. . . . . . . .

54

Condence levels vs scores distribution of all truely accepted common actual and corresponding replay trials from mfcc and rmfcc based systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

List of Tables 1.1

A brief summary of spoong techniques, their expected accessibility and risk.

2.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A brief summary on dierent studies made on replay attacks and its countermeasures.

3.1

. . . . . . . . . . . . . . . . . . . . . . . . . .

20

Performance of RMFCC and RIMFCC features for dierent LP orders.

3.3

14

Spectral distortion (SD) between actual and corresponding playback speech signals of male and female speakers. . . . . . . . . . . .

3.2

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

The correlation between the LP based synthesized speech signals derived from LP residual and GFD signal by using ve dierent methods. The parameters are computed using speech of ve dierent speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

32

Spectral distortion (SD) between actual and replay GFD signals estimated by DYPSA and DPI methods. The parameters are computed using speech of ve dierent speakers

. . . . . . . . . . . . .

32

4.1

Details of ASVspoof2017 dataset [3] . . . . . . . . . . . . . . . . . .

35

4.2

Parameters used for deriving features to perform replay detection task.

4.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Performance of the proposed LP residual features and compared with state-of-the-art speech signal based SCMC feature. . . . . . . .

4.4

38

Performance of the GFDMFCC feature for replay attack detection task.

4.5

37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Performance comparison of the GFDMFCC, RMFCC, RIMFCC features with state-of-the-art SCMC, CQCC features for replay attack detection task. . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6

Summary of the dataset used in this work for genuine, impostor and replay trials.

The speech samples are taken from IITG-MV

database [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7

38

41

Speaker verication performance of the short-term mel-cepstral features derived from speech (mfcc) and LP residual (rmfcc) signals with GMM-UBM based SV system. In case of zero-eort replay attacks the performance is expressed in terms of ZFAR. Under replay attacks the performance is expressed in terms of PFAR.

v

. . . . . .

48

List of Tables 4.8

vi

Speaker verication Performance of equally condent mfcc and rmfcc based systems for zero-eort attempts and with replay attacks. TZ=Truly accepted by zero-eort trials, TZec = TZ at equal CM value, CMec =Average condence measure (CM) value of TZec , PTimp = Detected as impostors in replay trials and CMpimp = Average CM value of PTimp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Chapter 1 Introduction

1.1 Speaker recognition Speech, as one of the most information rich biosignals, is the primary means of human communication. Besides the message relayed through spoken words, the speech signal conveys information of the speaker's identity, enabling recognition of the person both by human listeners and automatic speaker recognition (SR) techniques. SR is the task of recognizing persons by using the information available in their speech samples. SR task perform by machine is called as automatic speaker recognition. Based on the task mode, SR classied as speaker identication (SI) and speaker verication (SV). In SI, the task is to determine the identity of the unknown speaker by comparison with a set of enrolled models. Accordingly, the SI system performance is evaluated in terms of identication accuracy.

If the

unknown speaker model is available with machine, then it is called as Closed-set SI, otherwise Open-set SI. In SV, the task is to validate the claimed identity of the speaker with respective model and thus 1:1 comparison. Here the decision of the SV system is binary either accept or reject. These decisions may either be true or false. Accordingly, the performance of SV system is measured in terms of two parameters: False acceptance rate (FAR) and False rejection rate (FRR) [5]. Both parameters are equally important and ideally needs to be as minimum as possible. 1

Chapter1: Introduction

2

A compromising measure is congured at FAR=FRR, called as equal error rate (EER). The SV system with minimum EER indicates better verication accuracy.

1.2 Spoong The signicant developments in technology has made the SV system for an acceptable biometric authentication system for public use in various applications like smartphone log-in, e-commerce, mobile banking and physical access control [6]. In contrast to other biometric technologies, such as ngerprint and face recognition, SV has a considerably wider range of potential applications since it requires no additional hardware investments:

a speech signal can be acquired through the

native communication channel of a particular application via landline phone, conventional cellular phone, radio phone, satellite phone, smartphone, tablet, or a desktop PC with a headset.

The SV (as an authentication) system is mostly

used either in unattended scenarios without human supervision, or from remote locations through various communication channels.

In unattended scenarios, for some malaed intension (i.e.

for an unauthorized

access), it is possible that an impostor can spoof any target speaker by using the latter's articial speech samples. In speaker verication terminology, attempt of manipulating or biasing the system decisions through spoong is called as spoof attack. It can be possible either by impersonation, by using sophisticated speech synthesis and voice conversion algorithms, or replaying the pre-recorded speech samples of the target.

Further, the remotely accessible facility makes the SV

system more prone to spoof attacks.

The vulnerability of the state-of-the-art

SV systems to spoof attacks is widely acknowledged [7].

These facts encourage

researchers to give much wider attention for developing countermeasures.

Chapter1: Introduction

3

1.2.1 Types of Spoong and its eectiveness Spoong refers to the presentation of a falsied or manipulated speech sample of a target speaker to input of the microphone of biometric system. In general, Spoofing can be done in four ways like impersonation, voice conversion, speech synthesis and replay attacks. In case of impersonation the impostor can generate the articial speech samples of any target speaker either with the help of professional impersonators or people having natural voice similar to that of the target. The publicly available synthesis and voice conversion algorithms are able to generate the very natural voice of any target [7]. It is also possible that an impostor can acquire or steal the target speaker's speech samples and store it in a recorded form, and later playback before the machine for unauthorized authentication. In speaker verication terminology it is called as

Replay attack.

The detail description of all

types of spoong attacks are mentioned in the below subsections.

1.2.2 Impersonation Impersonation refers to spoong attacks whereby a speaker attempts to imitate the speech of the target speaker and try to bias the system decision. In this case, the impersonators can readily adapt their voice to manipulate the decision of SV system, but got success only when their natural voice is already similar to that of the target speaker. Linguistic expertise was also not found to be useful, when the voice of the target speaker was very dierent to that of the impersonator. Since impersonation is thought to involve mostly the mimicking of prosodic and stylistic cues, it is perhaps considered more eective in fooling human listeners than today's state-of-the-art SV systems.

While the threat of impersonation is not fully understood due to limited studies involving small datasets, it is perhaps not surprising that there is no prior work to investigate countermeasures against impersonation [8]. If the threat is proven to be genuine, then the design of appropriate countermeasures might be challenging.

Unlike the spoong attacks discussed below, all of which can be assumed

Chapter1: Introduction

4

to leave traces of the physical properties of the recording and playback devices, or signal processing artefacts from synthesis or conversion systems where as the impersonators are live human beings who produce entirely natural speech.

1.2.3 Speech synthesis Speech synthesis, commonly referred to as text-to-speech (TTS), is a technique for generating intelligible, natural sounding articial speech for any arbitrary text. Speech synthesis is used widely in various applications including in-car navigation systems, e-book readers, voice-over functions for the visually impaired.

Typical

speech synthesis systems have two main components: text analysis and speech waveform generation, which are sometimes referred to as the front-end and backend, respectively.

In the text analysis component, input text is converted into

a linguistic specication consisting of elements such as phonemes. In the speech waveform generation component, speech waveforms are generated from the produced linguistic specication.

There is a considerable volume of research in the literature which has demonstrated the vulnerability of SV to synthetic voices generated with a variety of approaches to speech synthesis. Experiments using formant, diphone and unit selection-based synthetic speech in addition to the simple cut-and-paste of speech waveforms have been reported in [8].

Only a small number of attempts to discriminate synthetic speech from natural speech have been investigated and currently there is no general solution which is independent from specic speech synthesis methods. Some methods are reported by discriminating the F0 statistics and the phase of vocoders of actual to synthesized speech.

As human ears are insensitive to the phase variation and the

vocoders are in general minimum phase in nature, so this clue can be used for discriminate. Based on the diculty in reliable prosody modelling in both unit selection and statistical parametric speech synthesis, other approaches to synthetic speech detection use F0 statistics [8].

Chapter1: Introduction

5

1.2.4 Voice conversion Voice conversion is a sub-domain of voice transformation, which aims to convert one speaker's voice towards that of another.

The eld has attracted increasing

interest in the context of fooling the SV system over a decade [9]. Unlike TTS, which requires text input, voice conversion operates directly on speech samples. In particular, the goal is to transform according to a conversion function T the feature vectors (x) corresponding to speech from a source speaker (spoofer) to that they are closer to those of target a speaker (y):

y = T(x, θ )

(1.1)

A number of methods has been reported in literature to discriminate converted voice from actual voice [8]. In converted speech has some variation in the phase of the signal, so it has been shown the cosine phase and modied group delay function parameters are eective to discriminate [8, 10, 11].

1.2.5 Replay attack Replay attacks involve the presentation of previously-recorded speech from a genuine client in the form of continuous speech recordings, or samples resulting from the concatenation of shorter segments. Replay is an example of low-eort spoong attacks, they require simply the replaying of a previously captured speech signal. The availability of cheap and high quality recording and playback devices (i.e. smart phones) have made the replay attacks more easily accessible and highly effective [12]. The risk of replay attacks is even higher if recordings of a speaker are publicly available.

In Fig. 1.1 shows the process in which the an impostor can bias the SV system using replay spoong.

From the gure it has been clearly understood that the

acoustic dierence between actual and replay signal occurs due to the following reasons:

Chapter1: Introduction

6



Acoustic eects introduced by the recording device.



Acoustic conditions in the environment where the voice was acquired.



Acoustic eects of the replay device.



Acoustic conditions in the environment where the attack takes place.

Figure 1.1:

Automatic speaker verication system with genuine and replay attempts.

1.2.6 Need of countermeasures for replay attacks Table 1.1:

A brief summary of spoong techniques, their expected accessibility and risk.

Spoong techniques Impersonation

Description Target's

voice

Accessibility

Risk

low

low

Medium

Medium-High

Medium

Medium-High

High

High

mimic Speech synthesis

Speaker-specic speech

generation

from text input Voice conversion

Speaker

identity

conversion

using

speech only Replay attacks

Replay

of

pre-

recorded utterance

Chapter1: Introduction

7

Table 1.1 shows the accessibility and risk of dierent spoong techniques as reported in [8]. The attacks are ordered in terms accessibility. Replay attacks require less eort (High accessibility), as no signal processing knowledge require, only require a high quality record and playback device, which are available in nominal price. Voice conversion and speech synthesis require specialised algorithms, in addition to appropriate hardware and parameters describing the target's voice. They belong to a class of higher-eort (Medium accessibility) spoong attacks. While the availability of professional impersonators are very few and it also too tough task to mimic the exact format patterns of target's voice.

As per the risk point of view, the replay signals are the exact replication of actual signal (assuming high quality record, replay device and negligible acoustic variation at the time of record and replay), so the chance of bias the system decision is more. Initially, there was a general hypothesis that, the low-eort replay attacks is less eective and so a little attention was paid by the research community for countermeasures [7]. playback devices (i.e.

The availability of cheap and high quality recording and smart phones) have made the replay attacks more easily

accessible and highly eective [12].

1.3 Thesis contribution We have started our work by analyzing the actual and replay signal both in time domain, spectrograms and analyzing its eects on the features of state of the art SV systems. Fig. 1.2 shows the time domain and spectrographically representations of actual and corresponding high quality replay speech signals. In the time domain there is a change in signal patterns mainly in the form of distortion.

It is due

to the result of noise and reverberation added from the playback devices like, microphone, pre-amplier, A/D converter and loudspeaker [13].

However, their

respective spectrograms are very similar and hard to distinguish. In other words, the speaker-specic information in short-term spectral features are almost remain intact in replay signals. It is mainly due to the use of high quality devices that

Chapter1: Introduction

Figure 1.2:

8

Speech signals and their spectrograms of actual and replay data.

able to retain the average spectral patterns of the original speech in replay signals. The short-term spectral patterns and formants related information are hardly distinguishable, and thereby the state-of-the-art SV systems that commonly use the short-term spectral features become quite vulnerable to replay attacks.

A study on spoong threat assessments found that playback attacks provoke higher levels of FAR, prompting the research community to focus on developing countermeasures [14]. As a result, methods are developed and reported in literature [7]. These methods mostly process the short-term spectral patterns and develop countermeasures. Despite success in reducing the FAR, the reliability still remains a concern. This is may be due to the use of similar spectral information for speaker verication tasks and countermeasures as well. It indicates the need of alternative evidences to counter replay attacks.

Considering, the observations we made, the short term spectral energies are almost same in both actual and replay. The features derived directly from speech signal may not able to capture the minute acoustic variations causes due to replaying process. In linear prediction analysis, the LP coecients generally model the vocal tract resonances [15]. As we have seen resonance information is almost intact, so to capture the minute variations in replay signal, linear prediction (LP) error signal may be useful. LP error signal is popularly named as LP residual signal. Another observation also made, by analyzing the actual and replay signal band wise (i.e 0-500, 500-1500, 1500-3000, 3000-4000 Hz).

It has been shown that the signal

components present in the actual and replay from 0-500 Hz (record and replay

Chapter1: Introduction

9

device not able to reproduce the low frequency component) and 3000-4000 Hz (eect of multiple A/D conversion) has been largely eected.

In this work the potential of the LP residual signal is explored for developing countermeasures to replay attacks. are almost at.

The spectral patterns of LP residual signal

Spectral patterns of LP residual signals (actual and replay) is

disturbed both in high as well extreme low frequencies. So mel and inverse mel warped cepstral features (RMFCC and RIMFCC) are derived from the LP residual signal, and independently used to classify actual and replay signal. A comparison study has been made by the two features independently to decide the proper LP order, which optimally classify actual and replay signals.

It has been observed,

the optimal LP order for RMFCC and RIMFCC feature is 20 and 10 respectively. These results show that residual signals eectively capture the low frequency distortions with higher order LP analysis and vice versa. At last, the linear fusion is performed in the score level to get the overall performance.

Further, Speech production mechanism can be modeled as a slow varying vocaltract system excited by a fast varying excitation source [15, 16]. The vocal-tract system is behaved as a resonator and the quasi-periodic sequence of air pulses from the vocal folds vibration as a source of excitation signal. The speaker-specic vocal-tract variations are reected in the form of resonances, and the excitation characteristics by harmonics [16]. The LP spectrum popularly models the vocaltract characteristics, and the excitation signal in the form of LP residual [16]. The eectiveness of LP residual derived features motivates us to derive GFD signal (representation of excitation information) from actual and replay signal and observed the variations between them.

Initially the mel warped cepstral features

(GFDMFCC) are derived from the GFD signal and used it to classify the actual and replay signal.

The ASVspoof2017 database is used for analyzing the eectiveness of RMFCC, RIMFCC and GFDMFCC features using simple Gaussian mixtures model (GMM)

Chapter1: Introduction

10

classier. The performance of the derived features are also compared with stateof-the-art spoof detection features (i.e SCMC and CQCC). We also replayed Indian Institute of Technology Guwahati Multi-Variability (IITG-MV) database, and used it for investigate the eectiveness RMFCC feature to detect replay signals in speaker verication experiment, where classic Gaussian mixtures model with universal background model (GMM-UBM) classier has been used to make speaker specic models.

1.4 Organization of the Thesis In chapter 2 a review of the existing attempts to develop the countermeasures towards replay attacks is given. This chapter then compares all these methods. Finally the issues to be addressed as part of this thesis work are identied and the organization of the work is outlined.

In chapter 3 the usefulness of excitation source information to detect replay attacks has been reported. Along with this, the detail description of various parameterization methods of LP residual signals and estimation and parameterization of GFD signal are described.

In chapter 4 the detail description of various replay attacks detection system with the experimental results and discussions are mentioned.

A summary of the work presented in this thesis is given in chapter 5 by listing major contributions of the present work. Some directions for further research in the area of replay attacks detection using the information present in GFD signals are also mentioned.

Chapter 2 Countermeasures to replay attacks: A Review

This section presents a brief review of the existing studies made towards developing replay attacks detection systems. The review is made for understanding the seriousness of replay attacks to SV system, existing countermeasure approaches and then to nd out the best possible ways for further explorations. As mentioned earlier, the existing studies are very few and made with diverse experimental conditions.

The rst study on playback attacks detection (PAD) uses audio ngerprint (spectral peakmap) [17] approach. In that study the spectral peakmap of the incoming recording is rst extracted and then compared with the all previous attempts samples. If any previous sample is more similar to the current sample it recognize it as replay attempt. The key idea of the spoong detection technique is to decide whether the presented sample is matched to any previous stored speech samples based a similarity score. The experiments results on the RSR2015 database showed that the equal error rate (EER) and false acceptance rate (FAR) increased from both 2.92% to 25.56% and 78.36% respectively as a result of the replay attack. It conrmed the vulnerability of speaker verication to replay attacks. On the other hand, the spoong countermeasure was able to reduce the FARs from 78.36% and 73.14% to 0.06% and 0.0% for male and female systems, respectively, in the face of 11

Chapter2: Countermeasures to replay attacks: A Review

12

replay spoong. The experiments conrmed the eectiveness of the proposed antispoong technique. But the drawback of this approach is as attempts increases the size of database increases dynamically. So the maintenance of the dynamically increasing database is challenging.

In [18] use the low frequency information (i.e POP noise) to distinguish replay signal from actual signal.

They focus on voice liveness detection (VLD), which

aims to validate whether the presented speech signals originated from a live human or not. They use the phenomenon of pop noise, which is a distortion that happens when human breath reaches a microphone, as liveness evidence. They proposed a pop noise detection algorithms and shows through an experimental study that they can be used to discriminate live voice signals from articial ones. They mentioned that the loudspeaker can not able to reproduce the breathing noise of a speaker (present in low frequency below pitch).

So they use this information to design

replay detector and use this to develop countermeasure towards replay attacks. They reported the results in the inhouse data, the VLD technique able to classify the live speech and articial speech with a EER of 3.95% .

As previously mentioned development of countermeasures to protect SV system from replay attacks a burning issue.

So, speech research community recently

arranged a ASVspoof 2017 challenge for replay spoong detection.

They made

a common database [3] and oers to the community to detect replay and actual speech.

They described the database, protocols and initial ndings.

The

evaluation entailed highly heterogeneous acoustic recording and replay conditions which increased the equal error rate (EER) of a baseline SV system from 1.76% to 31.46%. Submissions were received from 49 research teams, 20 of which improved upon a baseline replay spoong detector EER of 24.77%, in terms of replay/nonreplay discrimination.

In the challenge, various new and inovative terminology

are discussed to counter the replay attacks. The details of the reported works are tabulated in Table 2.1. An average detection EER of 6.73% has been shown as the best single system performance in evaluation set.

Chapter2: Countermeasures to replay attacks: A Review

13

In [19], dierent features that used for speech synthesis and voice conversion spoof detection experiments are used to detect replay samples.

The dierent features

used are Constant-Q Cepstral Coecients (CQCCs), Mel Frequency Cepstral Coecients (MFCCs), Linear Frequency Cepstral Coecients (LFCCs), Inverted Mel Frequency Cepstral Coecients (IMFCCs), Rectangular Filter Cepstral Coecients (RFCCs), Linear Prediction Cepstral Coecients (LPCCs), Subband Spectral Flux Coecients (SSFCs), Subband Spectral Centroid Frequency Coecients (SCFCs) and Subband Spectral Centroid Magnitude Coecients (SCMCs). They evaluate the proposed countermeasures using two recently introduced databases, including the dataset provided for the ASVspoof 2017 challenge. They perform a crossdatabase validation to validate the consistency of their countermeasures. They evaluate the performance of all the features independently using simple 2class Gaussian Mixture Models (GMMs).

Finally, the overall performance is

evaluated using scorelevel fusion of the performance of several base classiers using logistic regression. Their best system achieved an Equal Error Rate of 10.52 % on the challenge evaluation set. As a result of the set of experiments, they provide some general conclusions regarding feature extraction for replay attack detection and identify which features show the most promising results. These ndings are as follows :



The use of voice activity detection to remove silence frames seemed to hurt performance in all cases. This might suggest that there is some information about playback device in nonspeech frames.



Cepstral Mean Normalization showed to improve the performance. However, more advanced techniques like CMVN, slidingwindow CMVN, or Feature Warping were not benecial.



Higher number of lters and static coecients (that used in other applications) seems to improve the detection accuracy.



Using static plus delta plus deltadelta coecients provided the best results in all cases.

Chapter2: Countermeasures to replay attacks: A Review •

14

Subband Spectral Centroid Magnitude Coecients (SCMCs) seem to be very promising features for replay attack detection. These systems showed both the lower error rates and the most consistent performance across dierent experiments.

Table 2.1:

A brief summary on dierent studies made on replay attacks and its countermeasures.

SV System Wu et al. [17] Shiota et al. [18] Kinnunen et al. [3, 20]

Font et al. [19]

Patil et al. [21]

Countermeasures

EER

Database

Spectram peak map

0.06

In house data

Breath noise(LF)

3.95

In house data

CQCC

15.12

ASVSpoof2017 [3]

SCMC

11.49

ASVSpoof2017

MFCC

27.12

ASVSpoof2017

IMFCC

30.91

ASVspoof2017

Instantaneous

14.06

ASVSpoof2017

CQCC(6-8 kHz)

17.31

ASVSpoof2017

LPCCres(6-8 kHz)

27.61

ASVSpoof2017

EF+PSRMS+IFCC+CQCC

13.95

ASVSpoof2017

Frequency(VESA-IFCC)

Witkowski et al.

[22]

Jelil et al. [23] +MFCC

In

[21] uses instantaneous frequency information to detect replay spoong.

In

this paper, they proposed a novel replay detector based on Variable length Teager Energy Operator Energy Separation Algorithm-Instantaneous Frequency Cosine Coecients (VESA-IFCC) for the ASV spoof 2017 challenge. The key idea is to exploit the contribution of IF in each subband energy via ESA to capture possible changes in spectral envelope (due to transmission and channel characteristics of replay device) of replayed speech. The IF is computed from narrowband components of speech signal, and DCT is applied in IF to get proposed feature set. They also compare the performance of the proposed VESA-IFCC feature set with the features developed for detecting synthetic and voice converted speech. This includes the CQCC, CFCCIF and prosody-based features.

On the development set, the

Chapter2: Countermeasures to replay attacks: A Review

15

proposed VESA-IFCC features when fused at score-level with a variant of CFCCIF and prosody based features gave the least EER of 0.12%. On the evaluation set, this combination gave an EER of 18.33%. However,post-evaluation results of challenge indicate that VESA-IFCC features alone gave the relatively least EER of 14.06% (i.e., relatively 16.11% less compared to baseline CQCC). They concluded that the proposed feature set exploit contribution of individual IFs in each subband energies via proposed VTEO-based ESA algorithm. Spectrographic analysis demonstrated eectiveness of proposed approach to discriminate replayed speech from natural speech w.r.t dierence in spectral energy density (in high frequency regions) and spectral smearing due to replay device. For ASVspoof 2017 challenge task, the proposed feature set performed relatively better than baseline CQCC.

In

[22] use high frequency parts of conventional LPCCs, CQCCs and MFCCs

features to detect replay attack. They addresses a replay spoong attack against a speaker recognition system by detecting that the analyzed signal has passed through multiple analogue-to-digital (AD) conversions.

Specically, they show

that most of the cues that enable to detect the replay attacks can be found in the high-frequency band of the replayed recordings. They have also used LPCC residual (LPCCres) feature to detect the replay attacks. The results of the investigated methods show a signicant improvement in comparison to the baseline system of the ASVspoof 2017 Challenge. A relative equal error rate (EER) reduction by 70% was achieved for the development set and a reduction by 30% was obtained for the evaluation set and get a minimum EER of 17.31. They also investigated spectral alterations introduced in the process of replay spoong and provided evidence that signicant spoong cues related to a multiple anti-aliasing ltering can be found at high frequencies. Several methods of high-frequency ne-grained parametrization were scrutinized. The ne-tuned CQCC showed the strongest generalization to unseen data, reducing the EER by 30%. The proposed approach does not solve the spoof detection problem completely, but it introduces a signicant improvement over the baseline CQCC-GMM system.

A new source feature based replay attack detection work is reported in [23], where they proposed two source features based on epoch strength feature (EF) and

Chapter2: Countermeasures to replay attacks: A Review

16

Peak to Side Lobe Ratio of the Hilbert Envelope of Linear Prediction Residual (PSRMS). They concluded that the source features when combined with CQCC, MFCC and instantaneous frequency cepstral coecients (IFCC) gave the best EER of 13.95% in the ASVspoof2017 database.

In [24] the author used LP residual phase cepstral coecients (LPRPC) and LP residual Hilbert envelope cepstral coecients (LPRHEC) to counter voice conversion and speech synthesis spoong attacks. The author reported that around the glottal closure instants (GCI), spoofed signal have high signal to noise ratio (SNR) than the actual signal. It is due to the higher degree of uctuations in the actual LP residual signal than the spoofed signal.

The brief summary of this chapter is given in the table no 2.1.

However, some

attempts have bee taken to counter replay attack but still the eectiveness of the proposed countermeasures are not at per. This may be the reason that most of the attempts, try to derive the features from the speech signal directly. Due to the short term spectral energies are almost indistinguishable in actual and replay signal, the derived features are not able to better classify actual and replay signals. These observations indicates the need of alternative evidence to counter replay attacks.

In this work, we use excitation source features, which are derived from LP residual signals and GFD signals alternative evidence to counter replay attacks.

Some

works have reported, the analysis of low frequency information (due to the eect of loudspeaker) to counter replay attacks [13, 18], and also some have reported the analysis of high frequency information (eect of multiple A/D conversion) to design countermeasures [22]. But the use of both low and high frequency informations to counter replay attacks are still missing.

In that direction, the discriminative

features are derived by looking into both low and high frequency evidences, which are explained in the following chapters.

Chapter 3 Countermeasures to replay attacks: Excitation source information

3.1 Analysis of replay signal Replay signals are the re-recorded and playback version of the actual signal. The dominant acoustic variations in actual and replay signals causes due to the record, replay device eect and the acoustic variations in the environment at the time of recording and playback. In that direction, we rst analyze the characteristics of record and playback device and then analyze the overall disturbance in both time and spectral domain of actual and replayed signals.

The attacker using replay attack, rst acquires the target speech and stored in digital recorded form and later placed before the machine through loudspeaker. The loudspeaker creates distortion in low frequency region in the form of strong attenuation. The typical frequency response of a smart phone loudspeaker is shown in Fig. 3.1. In low frequency region (0-500 Hz) the response is strongly attenuated around -20 dB to 0 dB. Beyond 500 Hz, the response oscillates between

±

5

dB. The diaphragms have to move comparatively longer distance to reproduce the 17

Chapter3: Countermeasures to replay attack: Excitation source information low frequency components, resulting strong attenuation [25].

18

Further, in replay

attempt the trial speech is passed through (at least) two times A/D conversions. Once at the attacker's end for storing the target speech samples, and the other during trial with machine. The A/D conversion uses anti-aliasing lter that suppress the high frequency components for well matching with the Nyquist rate. The suppression of high frequency components in multiple times (replay signal) create distortion in high frequency regions [22]. The environmental eects (where the recordings and presentations take place) also introduce distortion to the signal in the form of noise. In worst case scenario in which the attacker possesses high quality recordings (likely the case with attacker), the environmental impacts can be neglected.

The recording process and loudspeaker are the major sources of

creating distortion in replay signals, and tracing such distortions provide major clues for detection.

Figure 3.1:

Typical frequency response of high quality smart phone loudspeaker [1]

As we have discussed in chapter 1, the eects of low and high frequency distortions is hardly distinguishable from their overall time and spectrograms representation. This is because the low and high frequency distortions (due to replay device) are suppressed in spectrograms that represent average short-term spectral energy largely magnied by formant resonances conne in the range from 500 Hz to 3000 Hz. The low and high frequency distortions can be observed in band ltered signal. Fig. 3.2 shows the actual and replay signals ltered through four dierent bands,

Chapter3: Countermeasures to replay attack: Excitation source information

19

0-500 Hz, 500-1500 Hz, 1.5-3 kHz and 3-4 kHz. These bands are chosen based on prior idea about the range of formants and device distortions. The distortions in replay signal are comparatively more around 0-500 Hz and 3-4 kHz bands. The distortions around 500-3000 Hz are dominated by formants.

Actual Speech

1 0 -1 0

0.1

0.2

0.3

0.4

0.5

Magnitude

1 0 -1

0-500 Hz 0

0.1

0.2

0.3

Replay Speech

1 0 -1

0.4

0.5

1 0 -1

0

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

1 0 -1

1 500-1500 Hz 0 -1 0

0.1

0.2

0.3

0.4

0.5

1 0 -1

1 1500-3000 Hz 0 -1 0

0.1

0.2

0.3

0.4

0.5

1 0 -1

1 3000-4000 Hz 0 -1 0

0.1

0.2

0.3

0.4

0.5

Time (sec) Figure 3.2:

Eect of replay devices across dierent frequency bands.

A quantitative measurement of the distortion is made by using sub-band spectral distortion measure [26]. The speech samples of ve (3 Males and 2 Females) speakers collected from two dierent sessions (A & B) are considered for measurement. The spectral distortions values for three dierent bands are reported in Table 3.1. As we observed very less distortion around 500-1500 Hz and 1500-3000 Hz bands (Ref. Fig. 3.2), we club them and the spectral distortion is measured from 5003000 Hz bands. The average spectral distortion is signicantly more around low (0-500 Hz) and high (3-4 kHz) regions, reecting the eect of loudspeaker and double quantization process. The distortion is comparatively less for female speakers. It indicates the female speakers can easily be fooled by reply attacks.

The spectrum of the speech signal is dominant around formants, where as the impulse-like LP residual signal has a at spectrum. distortions may explicitly corrupt the at spectrum. usefulness of LP residual signal is investigated.

The lower and higher ends With that motivation the

Chapter3: Countermeasures to replay attack: Excitation source information Table 3.1:

20

Spectral distortion (SD) between actual and corresponding playback speech signals of male and female speakers. SD (dB) at three sub-bands level

Speakers

Sessions 0-500 Hz

500-3000 Hz

3-4 kHz

(Actual-A, Playback-A)

09.08

07.70

09.30

(Actual-B, Playback-B)

10.11

08.25

10.00

(Actual-A, Playback-A)

09.58

07.95

09.78

(Actual-B, Playback-B)

09.45

08.31

10.08

(Actual-A, Playback-A)

11.85

08.35

10.22

(Actual-B, Playback-B)

09.53

07.65

09.06

(Actual-A, Playback-A)

08.78

06.45

09.57

(Actual-B, Playback-B)

08.77

07.20

09.03

(Actual-A, Playback-A)

08.61

06.93

08.78

(Actual-B, Playback-B)

08.50

07.23

08.38

09.43

07.60

09.42

Spk1

Spk2

Spk3

Spk4

Spk5

Average

3.2 Motivation for using excitation source information In signal processing concept the generation of replay signals can be realized as shown in Figure 3.3. Accordingly, the replay signal can be expressed as the convolution sum of input speech (actual) signal with the impulse response (hpd (n)) of the recording and playback device. Mathematically, the replay speech signal sp (n) can be expressed as,

Figure 3.3:

replay speech generation model from actual speech

sp (n) = sa (n)

∗ hpd (n)

(3.1)

Chapter3: Countermeasures to replay attack: Excitation source information

21

where, `∗' represents the convolution sum. In terms of popular LP speech production model, the actual speech signal sa (n) can be expressed as,

sa (n) = ˆ sa (n) + ra (n)

(3.2)

and

ˆsa (n) =

p X k=1

ak sa (n  k)

(3.3)

Where, with proper LP order the ˆ sa (n) models the vocal-tract component of the actual speech signal in terms of LP coecients (LPCs) `ak s'.

The error in the

prediction ra (n), called as LP residual, models the excitation component [15, 16]. By employing Equations 3.1, 3.2 and 3.3, the replay speech signal sp (n) can be expressed as,

sp (n) =

=⇒ sp (n) =

p X k=1

" p X k=1

# ak sa (n  k) + ra (n)

ak sa (n  k)

∗ hpd (n)

∗ hpd (n) + ra (n) ∗ hpd (n)

(3.4)

The Equation 3.4 shows that both vocal-tract and excitation source components of actual speech sa (n) are aected by the inuence of hpd (n).

The vocal-tract

response is largely manifested in the form of resonances occurring mainly in the frequency range of 500Hz to 3500Hz for 8kHz sampling frequency.

Typically, the rst three dominant vocal-tract resonances occur around 500Hz, 1500Hz and 2500Hz, and around these regions the loudspeaker gain is within

±

1

dB (Refer Figure 3.1), comparatively low with reference to other regions. Thus, possibly the vocal-tract resonances may not be aected much by the loudspeaker response. So, the distortions present in the replay signal mainly due to the distortions in LP residual signal. These conclusions further strengthened by analyzing

Chapter3: Countermeasures to replay attack: Excitation source information Speech Segment

22

LP Residual

1

1 (a)

0 −1

(e)

0 −1

0

5

10 Time (msec)

15

20

0

1

5

10 Time (msec)

15

20

1 (b)

0 −1

(f)

0 −1

0

5

10

15

20

20 0 −20

(c) 0

1

2 3 Frequency (kHz)

1

2

3

10

15

20

(g) 0

(d)

5

20 0 −20

4

20 0 −20 0

0

1

2 3 Frequency (kHz)

4

20 0 −20

4

(h) 0

1

2

3

4

Speech and LP residual signals of actual and corresponding replay utterances. (a-b) Actual and replay speech signals in time domain, (c-d) corresponding LP spectrum. (e-f) Actual and replay LP residual signals in time domain, (g-h) corresponding LP residual spectrum. Figure 3.4:

Low Frequency Region (Narrowband)

High Frequency Region (Wideband)

4000

400 (b)

(a) 3500

Frequency (Hz)

200

0

3000 0

0.5

1

0

0.5

1

4000 400 (d)

(c) 3500 200

0

3000 0

0.5

1

0

0.5

1

Time (sec)

Spectrograms of band ltered LP residual signals at (a) 0-500 Hz (low) and (b) 3-4 kHz (high) frequency regions. (c) and (d) represent the corresponding replay counterparts. Figure 3.5:

the Fig. 3.4. In this gure a speech segment with sampling frequency 8 kHz, the temporal and spectral domains representation of actual and corresponding replay speech segments, and their corresponding LP residual parts are shown. In time domain similar kind of amplitude variations (pulsations) are visible both in replay speech and LP residual signals. It indicates that the replay distortions also appear

Chapter3: Countermeasures to replay attack: Excitation source information

23

in LP residual signal. In spectral domain, it can be observed that the LP spectrum of actual and replay signals are very much similar, almost indistinguishable. The formants are almost intact in replay signals. Any feature derived around formants seems not to be so useful in discriminating the replay signals. This may be the reason for which sate-of-art SV systems that use MFCC or LPCC features representing formant resonances are vulnerable to replay attacks. On the other hand, the LP residual spectra are noticeably distinct across entire frequency range. The actual LP residual spectra is nearly at but that overall atness nature is disturbed in corresponding replay spectrum, particularly in lower (0-500 Hz) and higher (3-4 kHz) subbands. This can be more clearly observed from the spectrograms of band ltered (0-500 Hz and 3-4 kHz) LP residual signals shown in Fig. 3.5. To acquire better frequency resolution, the spectrogram of the slowly varying (0-500 Hz) LP residual signal is shown through narrowband and fast varying (3-4 kHz) signal in wideband. The darkness portions of original spectrograms are smeared in respective replay counterparts, mainly due to the eect of attenuation. These facts motivates us to investigate the excitation sorce signals for developing countermeasures for replay attack.

The excitation source information of speech signal can

be derived in two modes, LP residual is known as parametric form of excitation where as the GFD signal is the estimation of excitation source.

3.3 LP residual signal The LP residual was computed using inverse ltering of speech signal reported in Equation 3.5

r(n) = s(n) 

p X k=1

ak s(n  k)

(3.5)

where s(n), r(n), p are the speech signal, LP residual signal and LP order respectively.

Chapter3: Countermeasures to replay attack: Excitation source information

24

3.3.1 Parameterization of LP residual signal The earlier works (Although few) have explored LP residual signal based features like, LP residual phase cepstral coecients (LPRPC), LP residual Hilbert envelope cepstral coecients (LPRHEC) and linear prediction cepstral coecients of residual (LPCCres) for detection of replay signals [22, 24]. However, these features reect the compact representation of the LP residual information present across entire frequency range. Based on our observations we decided to process the LP residual signal at desired frequency regions, for example at 0-500 Hz and 3-4 kHz regions. For that the advantage mel-scale distribution is exploited. The mel-scale is tightly spaced in low frequency region and reverse in inverse mel-scale [27]. The mel and inverse-mel scales are expressed as [27], Mel Bands 1

0.5

0 0

500

1000

1500

2000

2500

3000

3500

4000

3000

3500

4000

Inverse Mel Bands 1

0.5

0 0

500

1000

1500

2000

2500

Frequency (Hz)

Figure 3.6:

Filter bank structure of mel and inverse mel lters.

Figure 3.7:

RIMFCC feature extraction block diagram



f mel = 2595log10 1 +

f lin 700



(3.6)

Chapter3: Countermeasures to replay attack: Excitation source information

Figure 3.8:

25

RMFCC feature extraction block diagram

f invmel = f mel (f high ) + f mel (f low )  f mel [f high + f low  f lin ]

(3.7)

where f mel and f invmel are mels and inverse mels corresponding to the linear frequency f lin in Hz. f low and f high correspond to lower and higher ends of the spectrum. For a signal with sampling frequency Fs , usually f low = 0 and f high =

Fs 2.

The M = 32 bands distribution with Fs = 8kHz in mel and inverse-mel scales are shown in Fig. 3.6. In mel-scale the frequency range with upper limit 1 kHz are closely spaced overlapping triangular lters while smaller number of less closely spaced lters with similar shape are used to average the high frequency zone. Thus, mel-lters represent the low frequency region more accurately than the high frequency.

We follow standard cepstral analysis and compute cepstral features

called as residual MFCC (RMFCC), the block diagram of feature computation is shown in Fig. 3.8.

These features eectively capture the low frequency regions.

However, the other information that lie in higher frequency range are not eectively captured by RMFCC features. We use inverse mel lter that capture higher frequency region information eectively by using more number of closely spaced lters (Ref Fig.3.6) and process the LP residual signal following the same procedure as RMFCC (Ref Fig. 3.7 ), and called it as inverse RMFCC (RIMFCC). The eectiveness of these features are individually veried by conducting replay attack detection experiments on ASVspoof2017 database. Finally the score level linear fusion is used to evaluate the nal performance, which are explained in the following chapters.

Chapter3: Countermeasures to replay attack: Excitation source information

26

The detail block wise description of RMFCC and RIMFCC feature extraction are:

Preprocessing:- Before extracting the features of the signal various pre-processing tasks must be performed. The speech signal needs to undergo various signal conditioning steps before being subjected to the feature extraction methods. These tasks include:-

1. Signal mean subtraction:- It is performed to remove the DC information (not contain any information) from the signal.

2. Signal normalization:- It is performed to reduce the dynamic variations of the signal.

3. Pre-emphasis:- As in the speech signal, the energy of the low frequency components are very high in compare to the energy of high frequency components. To compensate this problem a st order dierentiator is used as a pre-emphasizer to enhance the high frequency component of input signal.

so (n) = ZT

1

"

1 ).ZT[s

(1  az

# i (n)]

(3.8)

Where ZT represents Z transform, so (n) and si (n) represents the output and input signal respectively and a is the lter coecients, whose value generally lies from 0.9 to 0.99.

Frame blocking:- Speech is a non-stationary signal, to do signal processing operations it is assumed that the speech signal is stationary in a short segment of time (i.e 10-30 ms). In that direction, the speech signal is divided into frames of N samples, with adjacent frames being separated by M samples with the value M less than that of N. The rst frame consists of the rst N samples. The second frame begins from M samples after the rst frame, and overlaps it by N - M samples and so on. This process continues until all the speech is accounted for using one or more frames. We have chosen the values of M and N to be N = 320 and M = 160 respectively (assuming signal is stationary in 20 ms segment, sampling frequency 16 kHz).

Chapter3: Countermeasures to replay attack: Excitation source information

27

Windowing:- The next step is to window each individual frame to minimize the signal discontinuities at the beginning and end of each frame. The concept applied here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If we dene the window as w(n), 0



n



N  1 where N is the frame length, then the result of windowing is the

signal so (n) = w(n).si (n)

(3.9)

where w(n) the window function, here we use hamming window (to eliminate the problem of spectral leakage and zero oset) to analyze the speech frames.

w(n) = 0.54  0.46cos

2π n

!

N1

,0

≤n≤N1

(3.10)

DFT:- In this stage the discrete Fourier transform of the windowed speech segment is taken to convert the speech frames from time to frequency domain. The DFT is computed using equation 3.11.

The DFT length N is always grater then the

windowed speech segments length to avoid time aliasing.

S(k) =

N1 X

s(n).e

j2π kn N ,0

≤ k ≤ N  1, 0 ≤ n ≤ N  1

(3.11)

n=0

Power spectrum:- After the calculation of DFT of each windowed speech seg2

ment the power spectrum (|S(K)| ) is calculated.

Mel/Inverse-Mel ltering:- As mentioned earlier the power spectrum of windowed speech segment is passed through mel or inverse mel lter banks to compute mel frequency or inverse mel frequency coecients. The equation and structure of mel and inverse mel lter banks are given in equation 3.6, 3.7, 3.12 and Fig. 3.6.

S(m) =

N1 X

2

|S(k)| Hm (k), 0

≤m≤M1

(3.12)

k=0 Where Hm (k) is the lter bank function and M is the no of lter bank used.

Chapter3: Countermeasures to replay attack: Excitation source information

28

log(.) and DCT: Log with base 10 is taken of the mel/inverse-mel coecients, and the inverse DCT of type 3 of these coecients are computed as per equation 3.13, called as mel on inverse-mel cepstral coecients.

Here DCT length

M should be always grater then the no of lter bank used to compute mel or inverse-mel coecients.

c(n) =

M1 X m=0

log10 (S(m))cos

π n(m  0.5) M

! ,0

≤n≤C1

(3.13)

Where C is the no of cepstral coecients needs to be computed.

3.3.2 Selection of proper LP order In order to compute the LP residual signal from speech signal using inverse ltering approach, we have to x the LP prediction order. To decide the proper LP order to compute the LP residual signal for derive RMFCC and RIMFCC feature to counter replay attack a series of experiments are performed by varying the LP order using the training and development set of ASVspoof2017 database.

The performance of RMFCC and RIMFCC features for dierent LP orders is given in Table 3.2.

The RMFCC features extracted form the LP residual at dierent

LP orders gives consistent performance in detecting replay and non-replay signals. This is may be due to the less variation in the spectral distortions with respect to dierent LP order at lower frequencies. achieved at LP order of 20. features.

The best performance of 17.95% is

So we set p = 20 for the computation of RMFCC

In case of RIMFCC features the performance is severely deteriorating

with higher value of LP orders. For example, for 18

≤ p ≤ 28

the performance is

going on decreasing from 20.78% to 30.25%. In lower order range the performance is almost consistent with slight variations. The best performance is achieved at p = 10, thus set for computation of RIMFCC features.

In detection of replay

signals the optimum LP orders of RMFCC and RIMFCC features show that the low frequency distortion can be eectively captured with higher order LP analysis

Chapter3: Countermeasures to replay attack: Excitation source information Table 3.2:

29

Performance of RMFCC and RIMFCC features for dierent LP orders. EER

LP order 6

8

10

14

18

20

24

28

RMFCC

19.21

18.38

18.42

18.26

18.26

17.95

18.05

19.27

RIMFCC

18.25

17.96

17.38

18.20

20.78

22.59

27.20

30.25

and vice versa. Finally the LP order 20 is set for the computation of RMFCC and 10 is set for the computation of RIMFCC feature.

3.4 GFD signal The performance of LP residual based features towards counter replay attack motivates us to derive the glottal ow derivative (excitation signal) signal from the speech signal. As the LP residual signal is an approximate representation of excitation signal, where as GFD is an estimation. Thus, it is obvious that the GFD signal contains relatively richer information about the excitation component of speech. It is therefore interesting to exploit such information towards detection of replay signals.

Figure 3.9:

Typical glottal ow and its derivative waveforms [2].

Typical waveforms of glottal ow and its derivative signals are shown in Fig. 3.9 [2]. The whole cycle is divided into three segments: Closed phase, Open phase and

Chapter3: Countermeasures to replay attack: Excitation source information

30

Return phase. The closed phase corresponds to time interval the vocal folds are closed, and so no ow occurs. The open phase corresponds to the time interval during which vocal folds are either fully or partially open, and there is nonzero airow.

The return phase is related with completion of the speech production

mechanism.

It is dened as the time interval from the most negative value of

the glottal ow derivative to the time of glottal closure.

The return phase is

particularly important, as this determines the amount of high-frequency energy present in both the source and the speech.

The more rapidness of glottal closure activity, the shorter the return path, resulting in more high-frequency energy and less spectral tilt. This component is especially aected by the response of the record and replay devices. For example, the multiple recordings introduce aliasing eect that aect the high frequency regions.

The

loudspeaker causes sharp attenuation in low frequency region causing disturbances in spectral atness [13]. It can be observed that from Fig. 3.9 that the rapidness of glottal activities are relatively more clear in GFD signal.

Motivated by this

observations we explore the GFD signal for detection of replay signals.

3.4.1 Estimation and parameterization of GFD signal The glottal ow and its derivative are assumed to be consist of coarse- and nestructure components [2, 28]. The coarse structure represents the general shape of the glottal cycle, and the ne structure the ripples and aspiration. ature several algorithms are available for estimation of GFD signal.

In liter-

The most

contemporary methods include DYPSA, YAGA, ZFR, SEDREAMS and DPI algorithms [29, 30]. These algorithms equally outperform than other, although rely on dierent approaches. For example, the DYPSA and YAGA approaches follow dynamic programming, where as the others follow smoothing process. Also, unlike others the DPI algorithm uses the residual signal. Thus these techniques may have dierent properties in terms of reliability, accuracy, and robustness.

They

are here evaluated in the context of task objective. For instance, how accurately the estimated GFD signal regenerate the original speech. If any deviation appear

Chapter3: Countermeasures to replay attack: Excitation source information

31

due to replay activity, then the corresponding GFD signal may regenerate dierent synthesized speech. Tracing such dierences may be useful in detecting the replay signal. The linear prediction coding (LPC) approach is employed for speech synthesis as per equation 3.5.

Various methods of glottal closing instant (GCI)

computation is employed to compute the GFD signal using complex cepstrum decomposition method as mentioned in [31]. The GFD signals then analyzed using a correlation method and spectral distortion method to get the best estimation in context of replay detection. The synthesized speech and the correlation is computed using the equation. 3.14 and 3.3.

sl (n) =

Corrl =

1

1 ZT

"

ZT[el (n)] 1

PP

k=1 ak z

# (3.14)

k

ln (sr (n)  sr (n))(sl (n)  sl (n)) k=1 p p

"P

ln

Var(sr (n))

Var(sl (n))

# (3.15)

where sl (n) is the synthesized speech, ln represents the length of the signal, l represents the index (1

≤ l ≤ 5) for ve dierent GFD signal and el (n) is the GFD

signals estimated using various methods.

The correlation is computed between

the synthesized speech using various GFD estimation techniques with a reference synthesized speech where the el (n) = r(n). The method showing higher correlation is considered for analysis.

GFD (DPI) of actual speech

Magnitude

GFD (DYPSA) of actual speech

0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6 0

100

200

300

0

GFD (DYPSA) of replay speech

100

200

300

GFD (DPI) of replay speech

0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6 0

100

200

300

0

100

200

300

Samples

Figure 3.10:

Estimated GFD signal of actual and corresponding replay signal.

Chapter3: Countermeasures to replay attack: Excitation source information

32

The correlation between the LP based synthesized speech signals derived from LP residual and GFD signal by using ve dierent methods. The parameters are computed using speech of ve dierent speakers. Table 3.3:

Correlation coecient Speakers DYPSA

YAGA

ZFR

SEDREAMS

DPI

Spk1

0.685

0.237

-0.083

0.104

0.696

Spk2

0.782

0.108

0.192

0.480

0.691

Spk3

0.727

0.090

0.002

0.021

0.652

Spk4

0.708

0.096

0.0566

0.135

0.681

Spk5

0.696

0.232

-0.1338

0.218

0.606

Average

0.720

0.153

0.006

0.191

0.665

Table 3.4: Spectral distortion (SD) between actual and replay GFD signals estimated by DYPSA and DPI methods. The parameters are computed using speech of ve dierent speakers

Spectral Distortions Speakers DYPSA

DPI

Spk1

13.54

14.53

Spk2

13.27

13.06

Spk3

16.82

16.52

Spk4

14.35

13.42

Spk5

14.72

12.83

Average

14.54

14.07

Table 3.3 shows the correlation values estimated from the speech samples of ve dierent speakers for ve dierent methods. On an average the DYPSA shows the highest correlation value of 0.720 followed by DPI with 0.665. These two methods provide almost equal correlation. For concrete clarication of choosing a method, we further compare these two methods with reference to our task objective. The GFD signals are estimated from the replay counterparts of the speech samples. The spectral distortion of actual and correspond replay speech samples are compared. The method with high spectral distortion is considered for experimental analysis. Table 3.4 shows spectral distortion values. It can be observed that the

Chapter3: Countermeasures to replay attack: Excitation source information

33

DYPSA approach provides relatively higher distortion value and thus considered for estimation of GFD signal.

The estimated GFD signal of actual and replay

signal using DYPSA and DPI method are depicted in Fig. 3.10. From the gure, we observed the oscillation (Zero crossing) in the replay signal is more and the epoch strengths are also less in reply signal in compared with actual signal.

In

modelling the GFD signal we exploit the advantage of mel-scale distribution and process the GFD signal in mel-frequency cepstral domain. The mel-scale is tightly spaced in low frequency region, say below 0-500 Hz.

In that region the excita-

tion source information is predominantly reect in spectrum. We follow standard cepstral analysis and compute cepstral features called as GFD mel-lter cepstral coecients (GFDMFCC). The rst sixteen coecients including zeroth-one are used as representative features. presented in following chapters.

The experimental results and observations are

Chapter 4 Experimental studies

In order to conduct experiment for developing countermeasures towards replay attack databases are needed.

4.1 Database For conducting replay attacks related experiments we have taken two set of databases, these are:-

ˆ

ASVspoof2017 database

ˆ

IITG-MV replayed database

4.2 ASVSpoof2017 database ASVspoof 2017 database, collected in variable conditions are suitable for replay study in more practical aspect [3]. Authentic recordings of ASVspoof 2017 database are the part of RedDots corpus while replay recordings are the authentic recording passed through dierent replay devices replayed version of acts as a replay recordings. The dataset consists of three subparts: training-set, development-set 34

Chapter4: Experimental studies

35

and evaluation-set. The dataset is collcted ar 16kHz sampling rate. The statistics are given in Table 4.1. Altogether database consists 3566 authentic recording and 14466 replay recordings from 42 speakers. For tuning the systems, the train set is used to build the genuine and the spoof speech models and the development set is used for testing.

The systems submitted to the challenge evaluation uses

only the train set for building the models and evaluation set as the test set. Post evaluation, the models are rebuilt using both the train and the development set utterances. The results presented in this paper for the evaluation set are the ones obtained using the latter conguration. Table 4.1:

Details of ASVspoof2017 dataset [3]

Dataset

Number of

Number of Utterances

Subparts

Speakers

Actual

Playback

Training

10

1508

1508

Development

8

760

950

Evaluation

24

1298

12008

Total

42

3566

14466

4.2.1 Experiment setup on ASVSpoof2017 database Here, we have focused our eorts on nding discriminative features. This approach is in line with the general observation that the design of spoong countermeasures should start with the search for a good set of discriminative features rather than the design of complex classiers.

Therefore, our proposal uses relatively simple

classiers. In particular, two Gaussian Mixture Models (GMMs) are trained on genuine and replay speech respectively using Maximum Likelihood estimation [19, 32]. The score is computed as the loglikelihood ratio for the test utterance given both classiers as per equation 4.1. The description of GMM classier is given in subsection 4.3.1. LLR = log(Lactual )  log(Lreplay )

(4.1)

Chapter4: Experimental studies

36

where Lactual and Lreplay are the test utterance likelihoods given to the actual and replay GMMs respectively.

The whole ASVspoof2017 database is grouped into three sets: Training set, Development set and Evaluation set. The speech data from training set is used to build the actual and replay models. The data from development set is used to tune the system parameters and evaluation set for estimation of system performance. Accordingly, in this work the data from development set is used for tuning computational parameters (Only training set is used to train the classiers) and the nal performance (Both training and development set is used for train the classiers) is estimated on evaluation set.

The RMFCC and RIMFCC features are computed from the LP residual(computed with optimal LP order 20 and 10 respectively 3.3.2) respectively signal for every 20 msec hamming frames with a shift of 10 msec. In [19], it is shown that the eect of replay devices signicantly reect in low SNR regions, like unvoiced or silence regions. So we do not follow voice activity detection (VAD) detection and consider all frames.

To compute the RMFCC and RIMFCC features the log magnitude

spectrum of LP residual frames are wrapped with 32 band mel and inverse mel lters, and subsequently inverse discrete Cosine transform (IDCT) is computed. The components of IDCT (excluding c0 ) together with their rst and second order delta coecients are represented as RMFCC and RIMFCC features. The performance of proposed features are investigated individualy and then compared with the state-of-the-art features.

The conguration of optimized parameters of dif-

ferent features are reported in Table 4.2. The state-of-the-art SCMC and CQCC features are derived as mentioned in [19] and

[20].

4.2.2 Results and discussion on ASVspoof2017 database The eectiveness of derived features are tested on ASVspoof 2017 database. The results of RMFCC and RIMFCC features with optimal LP order for evaluation set is given in Table 4.3.

The corresponding EER plots are shown in Fig. 4.1.

Chapter4: Experimental studies Table 4.2:

37

Parameters used for deriving features to perform replay detection task.

Features

f min  f max

RMFCC

0-8000

70 +

4 + 44

RIMFCC

0-8000

35 +

4 + 44

GFDMFCC

0-8000

16(includingC0 ) +

4 + 44

CQCC [20]

15.62-8000

20(includingC0 ) +

4 + 44

SCMC [19]

100-8000

40 +

Dimensions

4 + 44

The RMFCC provides 14.57% and RIMFCC 15.35%.

These performances indi-

cate that LP residual based features are eective in detecting the replay signals. The RMFCC and RIMFCC contain dierent information and this may be the reason for the dierence in respective performance. The higher performance in case of RMFCC feature show that the distortion in replay signals comparatively more signicant in low frequency regions.

Motivated by the advantage of fusion ap-

proach [33], we combine the evidences of RMFCC and RIMFCC features at score level and use for detection tasks. The linear weighted combination scheme is used for fusion of evidences [33]. In that scheme the individual decision are weighted by their respective performance and then added, resulting the overall score. The performance is signicantly improved to 10.14%.

This performance better than

the state-of-the-art speech signal based SCMC feature performance of 11.49% reported in [19]. The performance is further improved to 9.54% by combining the evidences of speech with LP residual based features. These observations shows the usefulness of deriving LP residual based features for detection of replay signals.

The performance of RMFCC and RIMFCC features motivates us to derive GFD signal from speech signal.

Initially, the GFD signal is parameterized by mel-

warped ltering called GFDMFCC and the eectiveness is tested with ASVspoof2017 and compared with the state-of-the-art CQCC feature. The results are given in Table 4.4, and corresponding DET curves are shown in Fig. 4.2.

The proposed

GFDMFCC feature provides the EER of 20.53%. In comparison the LP residual based feature provides an EER of 14.57%. On the other hand the speech signal

Chapter4: Experimental studies

38

Performance of the proposed LP residual features and compared with state-of-the-art speech signal based SCMC feature.

Table 4.3:

Table 4.4:

Features

Evaluation EER

RMFCC (LP order 20)

14.57

RIMFCC (LP order 10)

15.35

RMFCC+RIMFCC

10.14

SCMC [19]

11.49

SCMC+RMFCC+RIMFCC

9.54

Performance of the GFDMFCC feature for replay attack detection task. Features

Evaluation EER

GFDMFCC

20.53

RMFCC

14.57

RIMFCC

15.35

RMFCC+RIMFCC

10.14

CQCC [20]

15.12

CQCC+RMFCC+RIMFCC

9.34

CQCC+GFDMFCC

9.23

Table 4.5: Performance comparison of the GFDMFCC, RMFCC, RIMFCC features with state-of-the-art SCMC, CQCC features for replay attack detection task.

Features

Evaluation EER

CQCC+SCMC+RMFCC+RIMFCC

8.72

CQCC+SCMC+GFDMFCC

8.12

based CQCC feature provides an EER of 15.12%.

The GFDMFCC individual

provides the poorest performance. The reason may be due to the presence of only coarse structure information in GFD signals. However, it is interesting to notice that as compared to RMFCC, the joint use of CQCC and GFDMFCC provides an improvement performance of 9.23%, indicating the signicance of the later.

Further for comparison, both LP residual based features and GFD signal based

Chapter4: Experimental studies

39

40

Miss probability (in %)

20

10

5 RMFCC RIMFCC SCMC RMFCC+RIMFCC SCMC+RMFCC+RIMFCC

2 1 0.5 0.2 0.1 0.1 0.2

0.5

1

2

5

10

20

40

False Alarm probability (in %)

DET curves from replay detection experiments with RMFCC, RIMFCC, SCMC features and their dierent fusions.

Figure 4.1:

40

Miss probability (in %)

20

10

5

GFDMFCC RMFCC CQCC CQCC+RMFCC CQCC+GFDMFCC

2 1 0.5 0.2 0.1 0.1

0.2

0.5

1

2

5

10

20

40

False Alarm probability (in %)

DET curves from replay detection experiments with GFDMFCC, RMFCC and CQCC features. DET curves of fused features are also shown

Figure 4.2:

feature performances are fused with state-of-the-art CQCC and SCMC features in score level. The fused performances are given in table 4.5. From the fusion performance it is observe that, when the residual based RMFCC and RIMFCC features combined with speech based SCMC and CQCC feature gives an EER of 8.72% and when the GFD signal based GFDMFCC feature combined with SCMC and CQCC features gives an EER of 8.12%. These results indicates the GFDMFCC feature have more complementary information then the residual based RMFCC and RIMFCC feature with state-of-the-art speech based CQCC and SCMC features. So, for further study we plan to model both ne and course structure of GFD signal, and try to derive a well discriminative feature which, able to capture all discriminative evidences present in the replay signal.

Chapter4: Experimental studies

40

4.3 IITG-MV replayed database The dataset is manually developed by using publicly available Indian Institute of Technology Guwahati Multi-Variability (IITG-MV) database [4]. The PhaseI (oce) and Phase-II (laboratory) datasets of IITG-MV database are collected through microphone comprising all practically variable conditions except background and/or channel noise. It suits the replay attackers and thus consider for this study. The speech samples are collected in two modes:

versational speech.

read speech

and

con-

We prefer the later mode due to the following reasons. The

speaker characteristics including behavioral traits are relatively better manifested in conversational speech, and in general the replay attacker preferably acquires the speech samples (secretly) while the target is in conversation with others.

The Phase-I and Phase-II datasets of IITG-MV database contain 148 (112 males and 36 females) non-native English speakers speech samples, recorded at the rate of 16000 samples/second. The duration of the speech samples per speaker varies from 10 to 15 minutes.

For this experimental study, we consider 81 (45 males

and 31 females) speakers speech data and segregate into two groups:

Dataset-

I and Dataset-II. Dataset-I includes 5 male and 6 female speakers speech data amounting to one hour from each gender for building gender-independent UBM models. The Dataset-II is developed with 65 speakers speech data (compromising 40 males and 25 females) for evaluation purpose. Each speaker's rst two minutes speech data are used for enrollment. The remaining data are converted into several segments of 30 seconds duration and used for test trials. Each test segment of each speaker is used as a genuine trial for the same target model and an impostor trial against other speakers model of the same gender. This resulted into a huge number of trials. The detail statistics are summarized in Table 4.6. Altogether there are 42440 trials that includes 1274 genuine and 41166 impostors attempts through 706 males and 568 female speakers speech samples.

The replay speech samples are generated manually by replaying the original data through a high quality CREATIVE-SBS-A35 loudspeaker in acoustically controlled environment (i.e.

inside room with no fan and air condition noise) and

Chapter4: Experimental studies

41

recorded at the rate of 16000 samples/second. We put very careful eort in acquiring the good quality replay speech samples. To verify the quality of the replay data, the original and replay recordings are played in front of few participants. They hardly dierentiate between them, ensuring the quality of the replay data. The replay attempts are considered only for 1274 target claimants.

Summary of the dataset used in this work for genuine, impostor and replay trials. The speech samples are taken from IITG-MV database [4].

Table 4.6:

Statistics

Male

Female

Total

Background speakers

05

06

11

Target Speakers

40

25

65

Genuine trials

706

568

1274

Impostor trials

27534

13632

41166

replay trials (Target claimants)

706

568

1274

4.3.1 Experiment setup on IITG-MV database The developments in speaker verication technology have made the state-of-theart SV system a convenient tool for person authentication.

It is mainly due to

the use of vocal-tract information based short-term cepstral features (

mfcc)

and

probabilistic modelling techniques. Although advanced modelling techniques are available in literature, but the classic Gaussian mixtures model with universal background model (GMM-UBM) is standard and popular, require less amount of data and suitable for replay studies.

In this study we consider

mfcc

and

rmfcc

with their rst and second order delta coecients as the representative features for speech and LP residual signals, and conduct the experiments with GMMUBM based SV system. The description of GMM-UBM classication is given in subsection 4.3.1.

The speaker verication performance is generally shown by DET curve and measured in terms of equal error rate (EER), where the false rejection rate (FRR) and false acceptance rate (FAR) are equal. In false rejection, a genuine is classied as

Chapter4: Experimental studies

42

an imposter. In false acceptance case, an imposter is accepted as a target speaker. The replay attackers usually aim at specied (target) speakers. Thus, under replay attacks scenario the FAR is more relevant measuring parameter for evaluating the system vulnerability.

We consider two parameters:

zero-eort false acceptance

rate (ZFAR) and with replay attacks false acceptance rate (PFAR). The relative degradation in PFAR indicates the system vulnerability to replay attacks.

The

EER or equivalently the ZFAR is computed considering both genuine and impostor trials with zero-eort replay attacks. We call it as the baseline performance. The decision threshold is set based on the scores distribution of the baseline systems. The PFAR is computed only with the target trials. In this case, all target trials by actual speech are considered as genuine and through replay speech as impostors. The PFAR is measured based on the threshold of the baseline system. Further, it is assumed that the attackers know the gender of the target speaker. We segregate the dataset into male and female speakers sets and perform the trials with matched gender. Later, the whole-set performance is evaluated by padding the respective genuine and imposter trial scores of male and female speakers.

GMM and GMM-UBM In the speech and speaker recognition the acoustic events are usually modeled by Gaussian probability density functions (PDFs), described by the mean vector and the covariance matrix. However unimodel PDF with only one mean and covariance are unsuitable to model all variations of a single event in speech signals. Therefore, a mixture of single densities is used to model the complex structure of the density probability. For a D-dimensional feature vector denoted as xt , the mixture density for speaker s is dened as weighted sum of Mg component Gaussian densities as given by the following equation 4.2 [32].

P(xt |s) =

Mg X i=1

wi Pi (xt )

(4.2)

Chapter4: Experimental studies

43

where wi are the weights and Pi (xt ) are the component densities. Each component density is a D-variate Gaussian function of the form

Pi (xt ) =

where

µi

1 e

1 (2π )D/2 |

is a mean vector and

P

i

P

i

1



2

|2

T 1 (xt  µt ) Σi (xt

 

µi )

th

covariance matrix for i

component.

(4.3)

The

mixture weights have to satisfy the constraint 4.4 [32].

Mg X i=1

wi = 1

(4.4)

The complete Gaussian mixture density is parameterized by the mean vector, the covariance matrix and the mixture weight from all component densities.

These

parameters are collectively represented by

 s =



wi , µi , Σi

(4.5)

Training the GMMs To determine the model parameters of GMM of the speaker, the GMM has to be trained.

In the training process, the maximum likelihood (ML) procedure

is adopted to estimate model parameters.

For a sequence of training vectors

X = [x1 , x2 , .., xT ], the GMM likelihood can be written as (assuming observations independence)

P(X|s) =

ΠT t=1 P(xt |s)

(4.6)

Usually this is done by taking the logarithm and is commonly named as loglikelihood function. From Equations 4.2 and be written as [32]. log[P(X|s)] =

T X i=1

log

4.6, the log-likelihood function can

" M X i=1

# wi Pi (xt )

(4.7)

Chapter4: Experimental studies

44

Often, the log-likelihood value is divided by T to normalize out duration eects. Also, since the incorrect assumption of independence is underestimating the actual likelihood value with dependencies, scaling by T can be considered a rough compensation factor. The parameters of a GMM model can be estimated using maximum likelihood (ML) estimation. The main objective of the ML estimation is to derive the optimum model parameters that can maximize the likelihood of GMM. The likelihood value is, however, a highly nonlinear function in the model parameters and direct maximization is not possible. Instead, maximization is done through iterative procedures. Of the many techniques developed to maximize the likelihood value, the most popular is the iterative expectation maximization (EM) algorithm.

Expectation maximization (EM) algorithm The EM algorithm begins with an initial model s and tends to estimate a new model such that the likelihood of the model increasing with each iteration. This new model is considered to be an initial model in the next iteration and the entire process is repeated until a certain convergence threshold is obtained or a certain predetermined number of iterations have been made. A summary of the various steps followed in the EM algorithm are described below.

1. Initialization:- In this step an initial estimate of the parameters is obtained. The performance of the EM algorithm depends on this initialization. Generally, LBG or K-means algorithm is used to initialize the GMM parameters.

2. Likelihood computation:- In each iteration the posterior probabilities for the i

th

mixture is computed as equation 4.8

Pr(i|xt ) =

wi Pi (xt )

PM

j=1

(4.8)

wj Pj (xt )

3. Parameter Update: Having the posterior probabilities, the model parameters are up- dated according to the following expressions [32]

Chapter4: Experimental studies

45

PT wi =

i=1 Pr(i|xt )

(4.9)

T

PT µi

σi2

i=1 Pr(i|xt )xt

=

(4.10)

PT

i=1 Pr(i|xt )

PT =

2 i=1 Pr(i|xt )|xt  µi | PT i=1 Pr(i|xt )

(4.11)

In the estimation of the model parameters, it is possible to choose, either full covariance matrices or diagonal covariance matrices.

It is more common to use

diagonal covariance matrices for GMM, since linear combination of diagonal covariance Gaussians has the same model capability with full matrices.

Another

reason is that speech utterances are usually parameterized with cepstral features. Cepstral features are more compactable, discriminative, and most important, they are nearly uncorrelated, which allows diagonal covariance to be used by the GMMs. The iterative process is normally carried out 10 times, at which point the model is assumed to converge to a local maximum.

Maximum a posterior (MAP) adaptation:- Gaussian mixture models for a speaker can be trained using the modeling described earlier. For this, it is necessary that sucient training data is available in order to create a model of the speaker. Another way of estimating a statistical model, which is especially useful when the training data available is of short duration, is by using maximum a posteriori adaptation (MAP) of a background model trained on the speech data of several other speakers.

This background model is a large GMM that is trained with a

large amount of data which encompasses the dierent kinds of speech that may be encountered by the system during training. These dierent kinds may include dierent channel conditions, composition of speakers, acoustic conditions, etc. A summary of MAP adaptation steps are given below.

For each mixture i from the background model, Pr(i|xt ) is calculated as

Chapter4: Experimental studies

46

Pr(i|xt ) =

wi Pi (xt )

PM

j=1

(4.12)

wj Pj (xt )

Using Pr(i|xt ), the statistics of the weight, mean and variance are calculated as follows

ni =

T X

Pr(i|xt )

(4.13)

i=1

PT

i=1 Pr(i|xt )xt

Ei (xt ) =

2 Ei (xt )

(4.14)

ni

PT

2 i=1 Pr(i|xt )xt

=

(4.15)

ni

These new statistics calculated from the training data are then used adapt the

ˆ2

background model, and the new weights (w ˆi ), means (µˆi ) and variances (σ ) are

i

given by

" w ˆi =

µˆi

σˆi2 A scale factor 1.

αi

γ

=

αi ni T

# + (1 

αi )wi γ

(4.16)

αi Ei (xt ) + (1  αi )µi

(4.17)

αi Ei (x2t ) + (1  αi )(σi2 + µ2i )  µˆi 2

(4.18)

=

is used, which ensures that all the new mixture weights sum to

is the adaptation coecient which controls the balance between the old and

new model parameter estimates.

αi

is dened as [32]

αi

=

ni ni + r

(4.19)

where r is a xed relevance factor, which determines the extent of mixing of the old and new estimates of the parameters. Low values for

αi (αi ⇒ 0), will result in

Chapter4: Experimental studies

47

new parameter estimates from the data to be de-emphasized, while higher values (αi



0) will emphasize the use of the new training data-dependent parameters.

Generally only mean values are adapted. It is experimentally shown that mean adaptation gives slightly higher performance than adapting all three parameters.

Testing In identication phase, mixture densities are calculated for every feature vector for all speakers and speaker with maximum likelihood is selected as identied speaker.

For example, if S speaker models [s1 , s2 , ..., ss ] are available after the

training, speaker identication can be done based on a new speech data set. First, the sequence of feature vectors X = [x1 , x2 , .., xT ] is calculated. Then the speaker model ˆ s is determined which maximizes the a posteriori probability P(ss |X). That is, according to the Bayes rule

ˆs = maxP(ss |X) = max

P(X|ss ) P(X)

P(ss ), 1

≤s≤S

(4.20)

Assuming equal probability of all speakers and the statistical independence of the observations, the decision rule for the most probable speaker can be redened as [32]

ˆs = max

T X

logP(xt |ss ), 1

≤s≤S

(4.21)

t=1 Decision in verication is obtained by comparing the score computed using the model for the claimed speaker ss given by P(ss |X) to a predened threshold The claim is accepted if P(ss |X) >

θ,

θ.

and rejected otherwise.

4.3.2 Results and discussion on IITG-MV database The experimental results given in Table 4.7 with corresponding DET plots shown in Figure 4.3. The respective performances of mfcc and rmfcc features based systems are 2.97% and 5.38% for males, 3.69% and 5.46% for females, and 4.24% and 5.65% for whole set.

In all cases mfcc features provide relatively better performance.

Chapter4: Experimental studies

48

However, the importance of the rmfcc feature is signied under replay attacks. In that case the respective performances (PFAR) of mfcc and rmfcc features are 38.81% and 15.72% for male speakers, 65.14% and 51.40% for female speakers, and 54.08% and 31.16% for whole-set. With respect to corresponding ZFAR, both mfcc and rmfcc features show higher value of PFAR, indicating sharp degradation in the performance under replay attacks. It is mainly due to the use of high quality replay recordings. However, the performance degradations in case of rmfcc features are signicantly less. For example, in case of male, female and whole speakers set, the performance level dierences between PFAR and ZFAR due to mfcc features are 35.84%, 61.45% and 49.84%, respectively. The corresponding values from rmfcc features are 10.34%, 45.94% and 25.51%. Further, it is observed that fooling the SV system through replay attacks seems to be easier by female speakers, because the performance dierences between ZFAR and PFAR are relatively more for female trials. It may be due to their ne spectral structures. Interestingly, in female trials case also the rmfcc feature shows comparatively more robustness against replay attacks. Table 4.7: Speaker verication performance of the short-term mel-cepstral features derived from speech (mfcc) and LP residual (rmfcc) signals with GMMUBM based SV system. In case of zero-eort replay attacks the performance is expressed in terms of ZFAR. Under replay attacks the performance is expressed in terms of PFAR.

Dierence Dataset

Features

ZFAR

PFAR (PFAR-ZFAR)

mfcc

2.97

38.81

35.84

rmfcc

5.38

15.72

10.34

mfcc

3.69

65.14

61.45

rmfcc

5.46

51.40

45.94

mfcc

4.24

54.08

49.84

rmfcc

5.65

31.16

25.51

Male

Female

Whole-set

The histogram shown in Figure 4.4 clearly displays the potential of rmfcc features against replay attacks. In zero-eort replay attempts the mfcc based system performs well by showing lower ZFAR. While under replay attempts both systems

Chapter4: Experimental studies

49

DET plot showing the performance of mfcc and rmfcc features based SV systems with zero-eort attempts and under replay attacks. `P' stands for replay attacks.

Figure 4.3:

30 25

ZFAR

20

rmfcc mfcc

15 10 5 0

Male

Female Without playback attacks

Whole−set

Male

Female With playback attacks

Whole−set

70 60

PFAR

50 40 30 20 10 0

Figure 4.4:

FARs without and with replay attacks for males, females and whole-set.

are aected, but the rmfcc based system restricts the FAR values below the level of corresponding values from the standard mfcc system. These observations show that, although the baseline performances are not comparable, but by stand-alone the LP residual signal reects potential discriminatory information against replay signals.

Although the experimental results and histogram show the robustness of the LP

Chapter4: Experimental studies

50

residual signal against replay attacks, but confusions may arise due to higher differences in the baseline performances of mfcc and rmfcc based systems (Refer Figure 4.4). In all cases the respective baseline performances of mfcc and rmfcc systems are not comparable, indicating that the former contain stronger speakerspecic evidences. It may be argued that the lower level of error rate under replay attacks scenario by rmfcc features may be due to the presence of week speakerspecic evidences rather than due to replay signal specic discriminatory information. For example, during actual attempts (due to week evidences) the rmfcc features reject more number of genuine trials, and obviously may reject them in replay trials. Similarity, it may happen that during actual attempts many genuine trials might be accepted by both mfcc and rmfcc features, but the level of condence in accepting those speakers may be dierent. The genuine trials accepted by rmfcc features with comparatively at low condence may be rejected under replay trials. In order to avoid these confusions we perform dierent comparative studies at common levels and investigate the usefulness of the LP residual signal evidences against replay attacks. We consider three conditions: Comparison with commonly accepted zero-eort target trials, accepted zero-eort target trials with equal condence and nally at equal rejection rate. In those conditions we reevaluate the mfcc and rmfcc based systems performance and analyze the outcomes in the following subsection using condence measure.

Performance evaluation based at equal condence level In speaker verication terminology the condence measure (CM) is an estimation attributed to the degree of certainty of the trial speaker hypothesis for a given decision score.

Roughly speaking, the CM value is intended to compliment the

system decisions. For example, for a given score `x' the speaker is accepted with the condence `y' (CM value).

In this study, because of suitability to Bayesian

decision process, the margin-derived based measurement method is considered for CM estimation [34].

In that approach the CM value for a score Sc is dened

as (Equation 4.22) the absolute dierence between FAR and FRR for that score taken as threshold. It takes into account the distribution of errors with respect

Chapter4: Experimental studies

51

to a score. Thus, the closer the score is to the decision threshold, the lower the condence. The higher magnitude of CM value shows the power of decision taken by the machine.

CM(Sc) = |FAR(Sc)  FRR(Sc)|

(4.22)

The other advantage of considering this approach is that the CM value is zero at the point where FAR=FRR, i.e.

the EER decision threshold point.

In general

a score above the threshold is considered as genuine, otherwise impostor. Since, the CM value dened in Equation 4.22 is a function of scores, we can easily use the CM values in assigning the system decisions.

For examples the CM values

of the scores lower than the threshold may be considered for impostors and vice versa. Alternatively, by assigning a system threshold (th) dependent scaling factor (function of Sc) to CM values as in (Equation 4.23), we can distinguish whether the decision is for genuine or impostor. For example, the negative CM value for a decision score means the corresponding utterance has come from an impostor and vice versa.

CM(Sc) =

α(Sc) ∗ |FAR(Sc)  FRR(Sc)|

α(Sc) =

  +1

for Sc

 1

otherwise

(4.23)

≥ th (4.24)

For our comparison purpose, we consider decisions taken at equal condence level by two dierent systems are comparable. The steps involved for this comparative study are described below.

1. The truely accepted zero-eort trials (ZT) are selected by using their respective threshold values.

Chapter4: Experimental studies

52

Genuine trials Playback trials

Male Speakers 1

CM value

0.5 (a)

0 −0.5 −1

PTimp = 12 CM pimp = −0.16 0

20

40

60

80

100

120

1

CM value

0.5 (b)

0 −0.5 −1

PTimp = 79 CM pimp = −0.50 0

20

40

60 Genuine and playback trials

80

100

120

Comparison of baseline mfcc and rmfcc based systems with common CM values ≥ 0.5. (a) mfcc and (b) rmfcc based systems. PTimp represents number of replay trials rejected as imposers. CMpimp represents average CM value of rejected replay trials. Figure 4.5:

2. The CM values of the truely accepted zero-eort trials are computed by using Equation 4.23.

3. The decisions from two dierent systems are considered to be at equal condence level provided their respective CM values are diered by less than or equal to 0.05. A 5% tolerance is considered to avoid exact matching.

4. The comparative study for mfcc and rmfcc based systems are made only for zero-eort commonly true selected trials with equal condence level (ZTec ). 5. The trials selected from step four are veried under replay attempts, where the decisions with negative CM values are considered as impostors (Equation 4.23).

6. In case of replay attempts a system having decisions associated with more number of negative CM values is considered as comparatively more robust against replay attacks.

The CM values vs scores distribution plot of all truly accepted common actual and corresponding replay trials from mfcc and rmfcc based systems is shown in

Chapter4: Experimental studies

53

Genuine trials Playback trials

Female Speakers 1

CM value

0.5 (a)

0 PTimp = 31 CM pimp = −0.12

−0.5 −1

0

20

40

60

80

100

120

140

1

CM value

0.5 (b)

0 −0.5 −1

PTimp = 32 CM pimp = −0.54 0

20

40

60 80 Genuine and playback trials

100

120

140

Comparison of baseline mfcc and rmfcc based systems with common CM values ≥ 0.5. (a) mfcc and (b) rmfcc based systems. PTimp represents number of replay trials rejected as imposers. CMpimp represents average CM value of rejected replay trials. Figure 4.6:

Figure 4.8.

The density of the distribution in the negative region (CM



0) is

comparatively more for rmfcc based system, indicating higher number of replay trials are considered as impostors.

The individual comparative parameters are

summarized in Table 4.8. In that table the values in third column representing the respective number of zero eort truely accepted male, female and together trials by mfcc and rmfcc based systems. At equal condence level the mfcc and rmfcc systems commonly select 238 male trials, 253 female trials, and in total 409 trials with an average CM values of 0.54, 0.56 and 0.55, respectively. We consider these trials as references and verify the corresponding decisions with replay attempts. Ideally all these reference trials should be detected as impostors in replay attempts, but it is not happening by either system. It means both the systems are fooled by replay trials and a few have been detected as impostors. The success in detecting the replay trials is comparatively better for rmfcc based system. For example, the mfcc based system detects only 86 male, 78 female and in total 106 replay trials as impostors.

Where as the rmfcc based system detects 184 male, 119 female

and in total 275 replay trials as impostors. Further, from the last column of the Table 4.8 it can be observed that in all cases the CM values are comparatively

Chapter4: Experimental studies

54

Genuine trials Playback trials

Whole−set 1

CM value

0.5 (a)

0 PTimp = 21 CM pimp = −0.10

−0.5 −1

0

50

100

150

200

1

CM value

0.5 (b)

0 −0.5 −1

PTimp = 121 CM pimp = −0.47 0

50

100 Genuine and playback trials

150

200

Comparison of baseline mfcc and rmfcc based systems with common CM values ≥ 0.5. (a) mfcc and (b) rmfcc based systems. PTimp represents number of replay trials rejected as imposers. CMpimp represents average CM value of rejected replay trials. Figure 4.7:

more negative for rmfcc based system. It shows that, as compared to mfcc, the rmfcc based system detects the replay trials with higher condence, particularly in case of female speakers trials where the former is largely confused. It conrms the usefulness of LP residual signal in detecting the replay signals.

1 0.8 mfcc rmfcc

0.6 0.4

CM value

0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −0.5

0

0.5

1

scores

Condence levels vs scores distribution of all truely accepted common actual and corresponding replay trials from mfcc and rmfcc based systems. Figure 4.8:

A case by case replay trials analysis with male, female and whole-set are shown in Figures 4.5, 4.6 and 4.7, respectively. In these representations the trials with

Chapter4: Experimental studies

55

Speaker verication Performance of equally condent mfcc and rmfcc based systems for zero-eort attempts and with replay attacks. TZ=Truly accepted by zero-eort trials, TZec = TZ at equal CM value, CMec =Average condence measure (CM) value of TZec , PTimp = Detected as impostors in replay trials and CMpimp = Average CM value of PTimp . Table 4.8:

Dataset

Features

TZ

mfcc

684

Male rmfcc

668

mfcc

546

Female

TZec

CMec

238

0.54

253 rmfcc

537

mfcc

1220

Total

409 rmfcc

1202

PTimp

CMpimp

86

-0.23

184

-0.37

78

-0.13

119

-0.38

106

-0.17

275

-0.38

0.56

0.55

higher baseline CM values (≥ 0.5) are selected for clear visualization and observations.

The lower half of these gures (corresponding to negative CM values)

shows impostors.

It can be observed that in all cases the rmfcc based system

detects more number of replay trials as impostors. In some cases both mfcc and rmfcc systems detect the replay trials as impostors, but the condence level of the later is relatively more.

At the instances of miss detection to replay trials, the

rmfcc based system is confused almost at equal condence with respective genuine trials, suggesting explicit modeling of the LP residual signal for more eective countermeasure. From all the observations we investigated the importance of excitation source information both in designing replay attack detection system as well as designing replay attack resistant speaker verication system.

Chapter 5 Conclusion

This work demonstrates the usefulness of processing excitation source information in the form of LP residual signal and GFD signal for developing countermeasures to replay attacks. The countermeasures largely depend on tracing the device response in the replay signal. The existing countermeasures mostly use speech signal based features.

These features largely reect the formant modulation appear in the

spectrum, that almost remain intact in replay signal. So the SV systems those use similar spectral features for verication and countermeasures remain vulnerable to replay attacks. As an alternative, we investigate and demonstrate the usefulness of LP residual signal based features for detection of replay signals.

It is observed that the LP residual spectral patterns are signicantly distorted by playback device response, indicating the presence of discriminatory information. This information is quite eective in detecting the replay signals. We also observed that the response of the record and replay devices produce signicant distortion in low frequency regions due to the eect of loudspeaker and in high frequency regions due to the eect of multiple A/D conversions. The parametric representation of LP residual spectral information in the form of RMFCC (capture low frequency discriminative evidences) and RIMFCC (capture high frequency discriminative evidences) features eectively detect the replay signals. At the time of deciding proper LP order for replay detection study, it has been seen the low frequency distortion is well captured by higher order LP analysis and vice versa. 56

Chapter5: Conclusion

57

The performance of the LP residual based RMFCC and RIMFCC features are comparable with the state-of-the-art speech signal based SCMC feature. The proposed representations reect dierent evidences themselves and with the SCMC feature as well. If we successfully able to exploit the advantage of fusion approach the performance can be improved to an acceptable level. These observations conclude that the LP residual signal based features are indeed useful to counter replay attacks. The LP residual signal implicitly represent the excitation source information that include pitch and glottal ow parameters, this motivates us to derive the GFD signals.

First, we investigate various GFD signal estimation methods and select the best possible approach, particular in the context of replay detection tasks. In that sense the DYPSA approach of estimating GFD signal is found to be relatively more eective. With that representation mel-warped based cepstral parameterization is employed to model the GFD signal. In GMM based replay detection system the purposed GFD-MFCC feature provides 20.53% with ASVspoof2017 database. The reason may be due to the absence of ne structure information.

In comparison

to excitation source information based residual feature, the joint use of proposed GFDMFCC and speech signal based constant- Q cepstral coecients (CQCC) and SCMC provides an improved performance of 8.18%. It is also observed that in the estimated GFD signals the epoch strength is less and oscillation is also more in case of replay signal.

The future plan is to use both course, ne structure information and estimate the GFD signal and try to derive a well discriminative feature which, able to capture all discriminative evidences present in the replay signal, then investigate its potential for detection of replay signals.

Bibliography [1] J. Villaba and E. Lieida, Preventing replay attacks on speaker verication systems, in

Int. carnahan conf. on security technology (ICCST),

October

2011, pp. 18. [2] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, Modeling of the glottal ow derivative waveform with application to speaker identication,

Transactions on Speech and Audio Processing,

IEEE

vol. 7, no. 5, pp. 569586,

1999. [3] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, The asvspoof 2017 challenge: Assessing the limits of replay spoong attack detection, in

INTERSPEECH, 2017.

[4] Haris B. C., G. Pradhan, S. R. M. Prasanna, R. K. Das, and R. Sinha, Multivaribility speaker recognition database in indian scenario,

Int. J. of

Speech Technology (Springer), vol. 15, no. 4, pp. 441453, March 2012.

[5] J. P. Campbell, Speaker recognition:

A tutorial,

Proceedings of IEEE,

vol. 85, no. 9, pp. 14371462, 1997. [6] T. Kinnunen and H. Li, An overview of text-independent speaker recognition: from features to supervectors,

Speech Communication,

vol. 52, pp. 1240,

2010. [7] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, Spoong and countermeasures for speaker verication: A survey,

tion, vol. 66, pp. 135153, 2015.

Speech Communica-

[8] N. Evans, T. Kinnunen, J. Yamagishi, Z. Wu, F. Alegre, and P. De Leon, Speaker recognition anti-spoong, in

Handbook of Biometric Anti-Spoong.

Springer, 2014, pp. 125146. [9] B. L. Pellom and J. H. Hansen, An experimental study of speaker verication

Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, vol. 2. IEEE, 1999, pp. 837840. sensitivity to computer voice-altered imposters, in

[10] Z. Wu, E. S. Chng, and H. Li, Detecting converted speech and natural speech

Thirteenth Annual Conference of the International Speech Communication Association, 2012. for anti-spoong attack in speaker recognition, in

58

Bibliography

59

[11] Z. Wu, T. Kinnunen, E. S. Chng, H. Li, and E. Ambikairajah, A study on spoong attack in state-of-the-art speaker verication: the telephone speech

Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacic. IEEE, 2012, pp. 15. case, in

[12] K. A. Lee, B. Ma, and H. Li, Speaker verication makes its debut in smart-

IEEE Signal Processing Society Speech and language Technical Committee Newsletter, 2013. phone,

[13] J. Villalba and E. Lleida, Detecting replay attacks from far-eld recordings on speaker verication systems,

Springer, pp. 274285, 2011.

in Lecture Notes in Computer Science.

[14] F. Alegre, A. Janicki, and N. Evans, Re-assessing the threat of replay spoof-

in Proceedings of International Conference of the Biometrics Special Interest Group (BIOSIG), 2014.

ing attacks against automatic speaker verication,

[15] J. Makhoul, Linear prediction: A tutorial review,

Proc. IEEE, vol. 63, no. 4,

pp. 561580, Apr. 1975. [16] S. R. M. Prasanna, C. S. Gupta, and B. Yegnanarayana, Extraction of speaker-specic excitation information from linear prediction residual of speech,

Speech Commun., vol. 48, pp. 12431261, Jun. 2006.

[17] Z. Wu, S. Gao, E. S. Chng, and H. Li, A study on replay attack and

in Proceedings of AsiaPacic Signal Information Processing Association Annual Summit and Conference (APSIPA ASC), 2014. anti-spoong for text-dependent speaker verication,

[18] S. Shiota, F. Villavicencio, J. Yamagishi, N. Ono, I. Echizen, and T. Matsui, Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verication, in

INTERSPEECH, 2015.

[19] R. Font, J. M. Espin, and M. J. Cano, Experimental analysis of features for replay attack detection results on the asvspoof 2017 challenge, in

SPEECH, 2017.

INTER-

[20] H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T. Kinnunen, K. A. Lee, and J. Yamagishi, Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements. [21] H. A. Patil, M. R. Kamble, T. B. Patel, and M. Soni, Novel variable length teager energy separation based instantaneous frequency features for replay detection, in

INTERSPEECH, 2017.

[22] M. Witkowski, S. Kacprzak, P. Zelasko, K. Kowalczyk, and J. Galka, Audio replay attack detection using high-frequency features, in

INTERSPEECH,

2017. [23] S. Jelil, R. K. Das, S. R. M. Prasanna, and R. Sinha, Spoof detection using source, instantaneous frequency and cepstral features, in 2017.

INTERSPEECH,

Bibliography

60

[24] C. Hanilci, Linear prediction residual features for automatic speaker verication anti-spoong,

Multimedia Tools and Applications, pp. 113, 2017.

[25] F. Rumsey and T. McCormick,

Sound and recording: applications and theory.

CRC Press, 2014. [26] F. Nordin and T. Eriksson, A speech spectrum distortion measure with in-

proc. IEEE int. conf. on Acoustics, speech, and signal processing (ICASSP'01), vol. 2. IEEE, 2001, pp. 717720.

terframe memory, in

[27] S. Chakroborty, A. Roy, and G. Saha, Improved closed set text-independent speaker identication by combining mfcc with evidence from ipped lter banks,

International Journal of Signal Processing, vol. 4, no. 2, pp. 114122,

2007. [28] T. Ananthapadmanabha and G. Fant, Calculation of true glottal ow and its components,

Speech Communication, vol. 1, no. 3-4, pp. 167184, 1982.

[29] T. Drugman, M. Thomas, J. Gudnason, P. Naylor, and T. Dutoit, Detection of glottal closure instants from speech signals: A quantitative review,

Transactions on Audio, Speech, and Language Processing,

IEEE

vol. 20, no. 3, pp.

9941006, 2012. [30] A. Prathosh, T. Ananthapadmanabha, and A. Ramakrishnan, Epoch extrac-

IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 12, pp. tion based on integrated linear prediction residual using plosion index, 24712480, 2013.

[31] T. Drugman, B. Bozkurt, and T. Dutoit, Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation,

munication, vol. 53, no. 6, pp. 855866, 2011.

Speech Com-

[32] D. A. Reynolds, Speaker identication and verication using gaussian mixture speaker models,

Speech Commun., vol. 17, pp. 91108, Aug. 1995.

[33] D. J. Mashao and M. Skosan, Combining classier decisions for robust speaker identication,

Pattern Recognition, vol. 39, no. 1, pp. 147155, 2006.

[34] N. Poh and S. Bengio, Improving fusion with margin-derived condence in biometric authentication tasks, in

AVBPA.

Springer, 2005, pp. 474483.

List of Publications

61

List of Publications National and International Conferences ˆ

Jagabandhu Mishra, Madhusudan Singh and Debadatta Pati,  Processing linear prediction residual signal to counter replay attacks, SPCOM-2018.

ˆ

Jagabandhu Mishra, Madhusudan Singh and Debadatta Pati,  LP residual features to counter replay attacks, ICSigSys-2018.

ˆ

Jagabandhu Mishra, Madhusudan Singh and Debadatta Pati,  Exploring linear prediction residual signal for developing countermeasures to playback attacks, SCEECS-2018.

ˆ

Madhusudan Singh,

ˆ

Jagabandhu Mishra and Debadatta Pati,  Usefulness of linear prediction residual signal for development of replay attacks detection system, NCC-2017.

ˆ

Madhusudan Singh,

Jagabandhu Mishra and Debadatta Pati,  Development of playback attacks detection system, IEEE-TENCON-2017. Madhusudan Singh,

Jagabandhu Mishra and Debadatta Pati,  Replay attack: Its eect on GMM-UBM based text-independent speaker verication system, IEEE-UPCON-2016.

Papers Submitted ˆ

Jagabandhu Mishra, Madhusudan Singh and Debadatta Pati,  Modelling glottal ow derivative signal for detection of replay speech samples, NASA/ESA-AHS-2018.

Suggest Documents