Multi-factor authentication model based on

0 downloads 0 Views 3MB Size Report
Jan 10, 2016 - speech watermarking and online speaker recognition ... access the system through a remote terminal such as telephone or network [6]. However, two main ... Although in this system, the redundant embedding has been applied to ...... Bimbot F et al (2004) A tutorial on text-independent speaker verification.
Multimed Tools Appl DOI 10.1007/s11042-016-3350-1

Multi-factor authentication model based on multipurpose speech watermarking and online speaker recognition Mohammad Ali Nematollahi 3 & Hamurabi Gamboa-Rosales 1 & Francisco J. Martinez-Ruiz 1 & Jose I. De la Rosa-Vargas 1 & S. A. R. Al-Haddad 2 & Mansour Esmaeilpour 3

Received: 9 May 2015 / Revised: 10 January 2016 / Accepted: 9 February 2016 # Springer Science+Business Media New York 2016

Abstract In this paper, a Multi-Factor Authentication (MFA) method is developed by a combination of Personal Identification Number (PIN), One Time Password (OTP), and speaker biometric through the speech watermarks. For this reason, a multipurpose digital speech watermarking applied to embed semi-fragile and robust watermarks simultaneously in the speech signal, respectively to provide tamper detection and proof of ownership. Similarly, the blind semi-fragile speech watermarking technique, Discrete Wavelet Packet Transform (DWPT) and Quantization Index Modulation (QIM) are used to embed the watermark in an angle of the wavelet’s sub-bands where more speaker specific information is available. For copyright protection of the speech, a blind and robust speech watermarking are used by applying DWPT and multiplication. Where less speaker specific information is available the robust watermark is embedded through manipulating the amplitude of the wavelet’s sub-bands. Experimental results on TIMIT, MIT, and MOBIO demonstrate that there is a trade-off among recognition performance of speaker recognition systems, robustness, and capacity which are presented by various triangles. Furthermore, threat model and attack analysis are used to evaluate the feasibility of the developed MFA model. Accordingly, the developed MFA model is able to enhance the security of the systems against spoofing and communication attacks while improving the recognition performance via solving problems and overcoming limitations.

* Mohammad Ali Nematollahi [email protected]

1

Department of Electronics Engineering, Autonomous University of Zacatecas, 98000 ZacatecasZac., Mexico

2

Department of Computer & Communication Systems Engineering, Faculty of Engineering, University Putra Malaysia, UPM, Serdang 43400 Selangor Darul Ehsan, Malaysia

3

Computer Engineering Department, Hamedan Branch, Islamic Azad University, Hamedan, Iran

Multimed Tools Appl

Keywords Speech watermarking . Online speaker recognition . Discrete wavelet packet transform . Threat model . Attack analysis . Multi-factor authentication

1 Introduction Many biometric features such as face, iris, hand, fingerprint, and speech have been used for biometric systems. Despite the fact that, the amount of False Acceptance Rate (FAR) and False Rejection Rate (FRR) for the online speaker recognition systems is not accurate as compared to other biometric techniques therefore the online speaker recognition is still suitable technique due to cost-effectiveness, less signal processing complexity (one-dimensional nature), and ease of use with less restrictions for recording. Speaker recognition is a kind of speech recognition system with speaker identification which involves identifying an unknown speaker by using a population of known speakers. This system also has speaker verification, as the most popular type of general biometric verification method which aims to verify the identity of a given speaker from a population of known speakers [28]. The speaker recognition system may not be popular for on-site application (where the person needs to be in front of the system to be recognized) due to its inability to provide a certain level of reliability and security as compared to other biometric recognition techniques. On the other hand, this system is still popular for online applications where the person can access the system through a remote terminal such as telephone or network [6]. However, two main problems still exist: Firstly, the recognition performance of the online speaker recognition technique is not high enough (as it cannot provide 100 % correct recognition). Consequently, the user is inconvenienced by having false nonmatching [18, 30]. Secondly, online speaker recognition is widely used in unattended telephony applications which can be exposed to malicious manipulation and interference compared to other biometrics. Therefore, speech has more potential for spoofing attacks. According to a recent study [38], spoofing attack is the most potentially possible attack for an online speaker recognition system. In addition to this vulnerability risk, high FAR also can threaten the security of online speaker recognition systems. Accordingly, today’s technology has applied multimodal and Multi Factor Authentication (MFA) to solve the recognition performance and security problems by combining different authentication factors. A few MFA systems have proposed to apply another authentication factor (i.e. PIN or token) to embed into the biometric signal to increase recognition performance and security of the overall system [15]. A model has been proposed as Two-factor Authentication (2FA) by using semi-fragile watermarking to embed smart-cards into the iris image [15]. Although in this system, the redundant embedding has been applied to improve the robustness against unintentional attacks, the recognition performance can still be degraded as adding more watermark bits can possibly degrade the recognition performance. Another 2FA technique is proposed by combining speaker biometric and password through speaker recognition and speech recognition respectively [30]. In this technique, verbal information in the speech signal undergoes speech recognition and is matched with the verbal information in the host’s password file. In addition, speaker biometric information in the speech signal also undergoes speaker recognition to verify the user’s speech. However, high error for both speech

Multimed Tools Appl

recognition and speaker recognition systems made this 2FA system inconvenient for users. Moreover, due to the lack of extraction of robust speaker biometric information, it is possible for attacker to spoof this system. To overcome this deficiency, this paper developed a MFA model based on online speaker recognition and multipurpose speech watermarking to provide speech ownership protection and speech tamper detection through the communication channel. However, applying multipurpose speech watermarking can significantly degrade the online speaker recognition performance [3, 11, 12, 26]. This is due to multipurpose speech watermarking and online speaker recognition systems have opposite goals while the Signal-to-Watermark-Ratio (SWR) is decreased and the robustness of the watermark is increased, the speaker recognition performance can be decreased [3, 4, 11, 12]. Since the main aim of MFA is to enhance recognition performance, applying robust watermark technology in this context is questionable due to its potential degradation on recognition performance of online speaker recognition. Therefore, the robust and the semi-fragile watermark bits are embedded where less and more speaker-specific sub-bands are available respectively. Basically, discriminative speaker features are within low and high frequency bands: glottis is between 100 and 400 Hz, piriform fossa is between 4 and 5 kHz, and constriction of the consonants is 7.5 kHz [5, 16, 23]. The rest of this paper is organized as follow: first, the proposed MFA model is described; second, applied methodology is discussed; third, the threat model and attack analysis for the developed MFA model are studied; fourth, experimental result on the proposed system is evaluated; and finally, conclusion and future trend are drawn.

2 General MFA model In this part, the MFA model is developed based on online speaker recognition and multipurpose speech watermarking technology. Three phases including sign up, login, and recognition are discussed in detail. In addition, the possibility of changing the PIN is discussed. For better explanation, Table 1 presents the notations of the proposed MFA model which is shown in Fig. 1. Each phase of Fig. 1 (which the idea is inspired from [21]) is explained in detail as follows:

2.1 Sign-up phase Before login, the speaker must register in the system. This phase can be done in front of the system or via a secure channel. The speaker needs to do the following steps: Step 1. First, the speaker (SPKRi) provides an identity (IDi), speech (SPKi), and selects a PIN (PINi) personally. Step 2. Then, the system (OSRS) computes BMi and PINi as follows: BMi ¼ V FEðSPKiÞ Mdli ¼ V FMðBMiÞ Step 3. The system (OSRS) saves Mdli and PINi.

Multimed Tools Appl Table 1 Applied notations for proposed MFA model based on speech biometric and digital speech watermarking Symbol

Notation

OSRS SPKR SPKi IDi BMi PWi Θ1, Θ2, Θ3 Key1 Key2 OTP ⨁ WM_EX(.) WM_EM(.) Hash(.) VFE(.) VFM(.) VFS(.)

Online speaker recognition system. Speaker (user). Speech of the speaker. Identity of the speaker. Speaker biometric feature of the speaker. PIN selected by the speaker. Thresholds for speaker biometric and speech watermarking systems. Private Key shared between SPKR and OSRS. Private Key shared between SPKR and OSRS. One Time Password sent by online speaker recognition system to the speaker. XOR operation. Watermark extraction process. Watermark embedding process. One-way hash function. Extract the speaker biometric feature from the speech signal. Model the speaker biometric feature for SPKRi. Compute the score.

Step 4. Finally, the speaker (SPKRi) is registered to the system (OSRS) through IDi, Mdli, and PINi.

2.2 Login phase When the speaker needs to be recognized by the MFA system, must do the following steps: Step 1. First, the speaker (SPKRi) requests to be recognized by the system (OSRS). Then, the speaker receives Key1, Key2, Hash(.) and OTP from the online system. Step 2. Next, the speaker (SPKRi) pronounces a sentence as Si and enters PIN (C_PINi) in the system (OSRS), and enter OTP. Step 3. The speaker has to perform the following operations: R ¼ Hash ðC PINiÞ W M 1 ¼ R ⊕ Key 1 W M 2 ¼ O T P ⊕ Key 2 SWi ¼ WM EMðSi; WM1; WM2Þ Step 4. Finally, the speaker (SPKRi) sends the watermarked speech signal (SWi) during the identification process. Apart from SWi, the speaker (SPKRi) should send claim (IDi) in the verification process.

Multimed Tools Appl

Fig. 1 The proposed MFA model

2.3 Recognition phase While a request (SWi) is received by the system (OSRS), the following steps must be performed: Step 1. First, the system (OSRS) checks the validity of the request (SWi) for speaker identification and the speaker (IDi) for speaker verification purpose.

Multimed Tools Appl

Step 2. When Step1 is valid, then the following operation must be done: EBMi ¼ V FEðSWiÞ EMdli ¼ V FMðEBMiÞ Con1←V FSðEMdli; MdliÞ ½EWM1; EWM2 ¼ WM EXðSWiÞ Con2←EWM1⊕Key1 ER ¼ EWM2⊕Key2 Con3←ER?HashðC PINiÞ Step 3. Check the following conditions: If Con1 > Θ1Con2 < Θ2Con3 < Θ3 Accept the speaker (SPKRi) with the speaker IDi for speaker verification. Identify the speaker (SPKRi) with the identity of IDi for speaker identification. Else Reject the speaker (SPKRi) with the speaker IDi for speaker verification. Unable to identify the speaker (SPKRi) with the identity of IDi for speaker identification. End

2.4 Change PIN In another situation, the speaker (SPKRi) can change the PIN freely. For this purpose, the di: following steps are performed to change old PINi to new PIN Step 1. First, the speaker (SPKRi) requests to change the password to the system (OSRS). Step 2. The system (OSRS) sends (Hash(.), Key1, Key2) to the speaker (SPKRi). Step 3. The speaker (SPKRi) provides his or her identity (IDi), speech (Si), and enters the old password (PINi) personally. The PINi is secured by key1. M 1 ¼ Hash ðPINiÞ PIN old ¼ M1⊕Key1   di M 2 ¼ Hash PIN PIN new ¼ M2⊕Key2

Step 4. The speaker (SPKRi) sends the request (IDi, PIN_old, PIN_new, Si) through a secure channel. Step 5. Next, the following operations are performed to verify the identity of the speaker (SPKRi) in the system (OSRS): EBMi ¼ V F E ðSiÞ EMdli ¼ V F M ðEBMiÞ Con1←V FSðEMli; MdliÞ R 1 ¼ PIN old ⊕ Key 1 Con1←R1?HashðC PINiÞ

Multimed Tools Appl

Step 6. Check the following condition: If Con1 > Θ1 &Con2 < Θ2 R 2 ¼ PIN new ⊕ Key 2 Else Replace PINi with R2 in the system (OSRS). Reject the request for PIN change. End

3 Methodology Figure 2 demonstrates the critical bands which are chosen to embed the watermark for each robust and semi-fragile watermarking technique. As it is shown in Fig. 2, the selected bands for robust have less speaker-specific information which has caused less degradation on the recognition performance of online speaker recognition systems. However, the selected bands for semi-fragile have more speaker-specific information which is tied intrinsically to speaker biometric for tamper detection, and any attempt of adversary can destroy the semi-fragile watermark. Furthermore, the semi-fragile speech watermark technique has negligible degradation on recognition performance due to very small watermark intensity. For preventing conflict between wavelet’s sub-bands, the speech signal has decomposed into 16 critical bands by applying DWPT which packet binary tree algorithm follows the same line as the wavelet decomposition process [24]. Then, 8 critical bands (with numbers 2, 3, 4, 5, 6, 7, 13, and 14), where the amount of F-ratio is not much, are chosen for robust speech watermarking technique and the rest of the critical bands (with numbers 1,8,9,10, 11, 12, 15, and 16) are chosen for semi-fragile speech watermarking technique. As seen, sub-band samples are sent directly to the watermark extraction blocks to detect the watermarks for matching block. Figure 2 presents the overall MFA framework based on the proposed multipurpose speech watermarking technique and online speaker recognition.

3.1 Robust speech watermarking algorithm In this Section, multiplicative technique along with DWPT is applied to embed the robust watermark in speech signal. In order to develop a desired robust speech watermarking technique for MFA

Fig. 2 The developed multipurpose speech watermarking technique for online speaker recognition systems by applying DWPT decomposition

Multimed Tools Appl

model, it is important to provide a tradeoff between robustness and imperceptibility which is directly influence online speaker recognition performance. Therefore, the statistical order of the methods in [29] is improved to manipulate the amplitude of the wavelet sub-bands where less speaker-specific information is available. For this purpose, the speech signal should be divided into non-overlapping frames with the length of N. Then, whole samples in frames are multiplied based on Eqs. (1) and (2): XN

r i¼1 i

XN

r i¼1 i

¼ α ¼

XN

if mi ¼ 1

ð1Þ

1 XN s if mi ¼ 0  i¼1 i α

ð2Þ

s i¼1 i

where α corresponds to the intensity of the watermark which is slightly greater than 1. whenever α is increased, the robustness of the watermark is increased but the imperceptibility is decreased. si and ri correspond to the ith samples of the original and watermarked frame respectively, N is length of original data sequence, and mi is a watermark bit to be embedded. According to the recent study [1, 29], if variance of the noise, variance of the host speech signal, and α are known, the watermark bits can be extracted from the energy of the watermarked speech signal by using a predefined threshold. This predefined threshold can be defined based on Eq. (3): XN r2 ≷0 T ð3Þ i¼1 i 1 where T corresponds to the value of the threshold. However, under gaining attack, which multiplied all samples by a constant, the watermark cannot detect properly. Therefore, a rational watermark extraction technique should apply to detect the watermark bits at the receiver. For this purpose, two equal sets A and B from the host speech signal are selected. If they don’t have same energy, their energy are equalized by applying a distortion signal. Then, by using Eqs. (1) and (2), the watermark bits are embedded into the set A. For watermark bits extraction, Eq. (4) can be applied. X R ¼ XA

rOrder i

rOrder B i

≷10 T

ð4Þ

In Eq. (4), Order must be an even number which is assumed Order = 4 for better trading off between imperceptibility and robustness. The overall details of threshold (T) estimation for noisy channel are discussed in Appendix A. The main purpose is estimating the T where can differentiate two intervals as (−∞, T] and [T, ∞) for 0 and 1 respectively. This amount does not have exact value which can be computed. Therefore, the threshold value is experimentally estimated by performing simulation. The simulation is running for various threshold values (T) to compute the corresponding BER, then the T with minimum BER is chosen as optimum T. In the following, the described statistical model for robust watermarking is applied in order to embed the watermark bits in less speaker-specific sub-bands of DWPT. The embedding and extraction process are described in the details in the following algorithms: Robust Embedding steps: a) The host speech signal is segmented into frame Fi with lengths of M. b) Apply DWPT with D levels on each frame to calculate the different wavelet’s sub-bands.

Multimed Tools Appl

c) A data sequence is arranged based on specific selected sub-bands from the last level of DWPT with lengths of N. d) The data sequence with length N is divided into two sets of A and B. Each sets must have same energy with equal length which is N/2. Their energy must be equalized by adding a distortion, if these two sets have different energy. e) In order to improve the watermark bits detection, a channel coding technique (hamming method) can be applied. f) Apply Eqs. (1) and (2) to embed the coded watermark bits into A set. g) The watermarked signal is reconstructed by apply inverse DWPT. Robust Extraction steps: a) The watermarked speech signal is segmented into frame Fi with lengths of M (which may be assumed as a public key between the sender and receiver). b) Apply DWPT with D levels on each frame to calculate the different wavelet’s subbands (which may be assumed as a public key between the sender and receiver). c) A data sequence with lengths of N is arranged based on specific selected sub-bands from the last level of DWPT. d) The data sequence with length N is divided into two sets of A and B. Each sets must have same energy with equal length which is N/2. e) The watermark bits are extracted as in Eq. (4). f) Decode technique is applied to detect the watermark bits.

3.2 Semi-fragile speech watermarking algorithm In this Section, the idea of angle quantization index modulation in [27] for watermarking is extended to speech signal in order to develop a semi-fragile speech watermarking technique for the proposed MFA model. On the other hand, the distortion effect of angle quantization is optimized by using Lagrange method. Therefore, optimized angle quantization along with DWPT is applied to change energy ratio between two blocks where more speaker specific wavelet sub-bands are available. Due to the quantization of the signal’s angles is sensitive to intentional manipulation, it is highly desirable to use semi-fragile watermark which ties intrinsically to speaker-specific sub-bands in order to provide authentication over an unknown channel. Two sets from the host speech signal are required for applying angle quantization. Therefore, ×1 and × 2 are selected from the host speech signal in order to calculate their polar coordinate as in Eqs. (5) and (6):   x2 θ ¼ arctan x1 r¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffi x21 þ x22

ð5Þ ð6Þ

Then, the watermark bits are embedded in angle θ by using Quantization Index Modulation (QIM). However, the watermark may not be detected even without any attack due to sensitivity and fragility in the nature of angle quantization. Therefore, the ratio of energy between two blocks of the host speech signal is quantized in order to overcome this problem. Furthermore,

Multimed Tools Appl

every watermark bit is repeatedly embedded in a frame to reduce the error of the developed semi-fragile speech watermarking technique. For this purpose, every frame is segmented into various blocks with length of L. Next, two sets of X and Y are selected from each block. Then, Eq. (7) is computed the angle (θ). 1 0X Lb =2 2 y iC B ð7Þ θ ¼ arctan@ X Li¼1=2 A b 2 x i¼1 i In order to minimize the variation of Y after angle quantization, Lagrange method applied to optimize Y set. Therefore, an optimization problem is formulated in Eq. (8) to estimate Y set. 8 X Lb =2  Q 2 > < Cost : J ðY Þ ¼ yi −yi i¼1 ð8Þ  2 X L =2 b > Q Q : Condition : C ðX Þ ¼ y −θ  E ¼ 0 X i i¼1 The optimized values of the Eq. (8) are estimated by using Lagrange method as in Eq. (9): ∇J ðY Þ ¼ λ∇C ðX Þ

ð9Þ

Equations (10) and (11) can solve these optimized values in Eq. (9): yiQ;Opt ¼

λOpt

yi 1−λOpt

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EY ¼ 1− Q θ  EX

ð10Þ

ð11Þ

Semi-fragile Embedding steps: a) The host speech signal is segmented into frame Fi with lengths of M. b) Apply DWPT with D levels on each frame to calculate the different wavelet’s subbands. c) A data sequence is arranged based on specific selected sub-bands from the last level of DWPT with lengths of N. d) The data sequence with N length is divided into different blocks with lengths of L. From each block,two equal length sets of X and Y with size of L/2 are selected. e) The energy ratio EEYX for both sets of X and Y is computed. f) The watermark bit repeatedly is embedded into whole the blocks of a frame as in Eq. (12):   θ þ mi  Δ ð12Þ θQ ¼  2Δ þ mi  Δ 2Δ where Δ is quantization steps, mi and θQ corresponds to original and watermarked angle of the energy ratio. Whenever the quantization step is decreased, the imperceptibility is increased but robustness is decreased. g) To minimize watermarked distortion, Lagrange method is applied to compute the set Y. h) The watermarked signal is reconstructed by apply inverse DWPT.

Multimed Tools Appl

Figure 3 shows the block diagram of the embedding process in the proposed multipurpose speech watermarking technique. In the following, the watermark extraction process is described which is reversed of embedding process: Semi-fragile Extraction steps: a) The host speech signal is segmented into frame Fi with lengths of M. b) Apply DWPT with D levels on each frame to calculate the different wavelet’s subbands. c) A data sequence is arranged based on specific selected sub-bands from the last level of DWPT with lengths of N. d) The data sequence with N length is divided into different blocks with lengths of L. From each block,two equal length sets of X and Y with size of L/2 are selected. e) The energy ratio EEYX for both sets of X and Y is computed. f) The binary watermark bit is extracted from the watermarked angle θ as in Eq. (13): rk −Q ðrk Þ ^bk ¼ argmin ð13Þ bk ¼f0;1g

bk

Where rk correspond to the angle of the energy ratio of the watermarked signal at the received and Qbk corresponds to the quantization function when it is meeting the watermark bits bk = {0, 1}. g) Repeatedly doing step e to step f until all the watermark bits are extracted from the blocks of a frame. h) In order to decide about extracted bit in a frame, all extracted watermark bits from the blocks are compared to a threshold which is a number between 0.5 and 1. If majority of extracted bits from all blocks are higher than the threshold, the extracted watermark is 1. Otherwise, the extracted watermark is 0. Whenever the threshold is assuming a number near to 1, the fragility of the developed semi-fragile watermark is increased.

Fig. 3 Block diagram of embedding process in the proposed multipurpose speech watermarking technique

Multimed Tools Appl

Figure 4 shows the block diagram of the extraction process in the proposed multipurpose speech watermarking technique.

4 Threat model With the purpose of develop the MFA model based on multipurpose speech watermarking and online speaker recognition, the most important issue is analyzing the security of the proposed MFA model. However, the definition of security should be clarified. For that reason, in this study the security of the proposed MFA model is discussed in two main parts. In the first part, the security requirement of the MFA model is discussed as the main goal to achieve. In the second part, the attacker model is discussed as the potential attack that the MFA model is dealing with.

4.1 Security requirements of the proposed MFA model Based on the main requirements of the proposed MFA model, two applications of the watermarking are discussed as follows; a) Fingerprinting: This application is useful to identify the legitimate speaker who pronounces the speech signal. The ownership of the speech signal should be tractable even when an adversary seriously colludes with the speech signal. The adversary should also not be able to easily create an ambiguity for the legitimate speaker when detecting his or her fingerprints which have already embedded into the speech signal. To achieve this watermark property, a robust digital speech watermarking should be applied. Then, S is the speaker, C is the speech signal, A is an adversary, Dist(.,.) is the perceptual distance measurement between two speeches, T(.) is the tracing function for detecting the watermark and t is the threshold. This property can be formally defined as: Definition: For the fingerprinting speech signal C, C∈S is robust against any adversary attack J = A(C) such that Dist(C,J) < t. When an efficient function T(.) is available, then T(J)∈S.

Fig. 4 Block diagram of extraction process in the proposed multipurpose speech watermarking technique

Multimed Tools Appl

b) Tamper detection: This application is useful to check the originality of the speech signal. A receiver of the speech signal should ensure that the speech signal is not tampered by an unauthorized party. To achieve this watermark property, a semi-fragile digital speech watermarking should be applied. When CHL(.) is the unintentionally degradation function (i.e., channel effect) on the speech signal, V(.) is the efficient tamper-proofing verification function which extracts the semi-fragile watermark. Definition: For the authenticator speech signal C, C∈S is free from any tampering when both the following conditions are found: I. For any negligible effect on the speech signal C’ = CHL(C) such that Dist(C,C’) < t, then Prob[V(C’) = Yes] < £1. II. For any adversary A and the speech signal J = A(C) such that Dist(C,J) < t, then Prob[V(J) = Yes] > £1.

These two statements show that when the speech signal is tampered intentionally, the probability of the tamper detection is higher than the threshold. However, when the speech signal is just manipulated unintentionally, the probability of the tamper detection is less than the threshold which may be negligible. The explanation why to use both types of watermarks is that each robust and semifragile can provide different applications for the proposed MFA model. For instant, robust speech watermarking can ensure that the adversary does not alter the speech signal which is important against replay attack, non-repudiation, and session hijack. However, semi-fragile speech watermarking can ensure that speech signal isn’t tampered intentionally which is protect the developed MFA model against spoofing, template, eavesdropping, theft, and copying attacks.

4.2 Attacker model for the proposed MFA model Apart from the security requirements for the developed MFA model, it is crucial to have a look at attacks dealt with the proposed MFA model. A suitable model attacker may properly improve the security of the system. The attacker model can detect the potential vulnerable point in the proposed MFA model. Although it cannot predict which kind of attacks have been used by an adversary, it requires rigorous treatment for the potential attacks. In the following, two categories of the attacks are discussed including general attacks and signal processing attacks.

4.2.1 General attacks Guessing attack It is highly desirable for a MFA system to be secure in terms of guessing attack or exhaustive search attack. Actually, a guessing attack means increasing FAR of the online speaker recognition system. This increase can be done by brute force search of an adversary which may record or synthesize the speech. By using the False 1 ¼1 Match Rate (FMR) for the result [30], the keyspace for the speech is between 0:007 20 1 42:9 and 0:0003 ¼ 3333:3. Furthermore, a 20-bits PIN has the keyspace of 2 = 1048576.

Multimed Tools Appl

It can be seen that PIN (1048576) > Speech (3333.3). Although none of the keyspace of the PIN and speech is large enough to be secure against guessing attack and exhaustive search attack, employing both PIN and speaker biometric can be adequate. As a result, the guessing combination of PIN and speaker biometric is not easy. A large keyspace can defend the MFA system against these attacks. Apart from this situation, the adversary can extract the watermark from the speech signal. Even when the speaker can extract the watermark, it is just a secure message as a result of hashing PIN with the encrypted operation by a key (hash (PIN) ⨁key). Therefore, the adversary needs to have both key and hashing function.

Plain text or template attack Plain text or template attack mainly happen on the speech biometric side. An adversary can attack the speech biometric when the speech is not a secret. Therefore, speech template protection is not fully achieved for speaker biometric. The best way to assure this security is to authenticate the speech that is captured in a lively way which is not already entered as a file. An OTP can improve the security of the MFA system which reveals any manipulation of the speech template. Eavesdropping, Theft, and Copying Attacks One of the threats is to steal the PIN. This threat may be done by eavesdropping attack which requires an adversary to have a physical presence. Using the combination of speaker biometric, PIN, and OTP as a MFA system is a good defense against this attack because the adversary needs to steal all of these factors. Furthermore, the theft and copying attacks are difficult because OTP and PIN are hidden in the speech signal. In addition to watermarking technology to protect the speaker biometric template from theft and copying attacks, using Exclusive OR (XOR) operation as an encryption can secure the MFA system. Counterfeiting or spoofing attack Similar to theft and copy attacks, forgery attack can threaten the speaker biometric at the sensor part. Although the speaker biometric can be replaced easily and it does not have secrecy, the communication channel security of the speaker biometric system can be protected by a combination of robust and semi-fragile speech watermarking. Replay attack In a replay attack, an adversary tries to insert the speech signal on the channel between the speaker and online speaker recognition system. Even when the speech is encrypted, the adversary can still put the encrypted data on the channel. Furthermore, when speech is sent directly, the speech signal can be replayed. The main defense mechanism is verification of the legitimacy of the speech signal which is successfully done by digital speech watermarking. Robust speech watermarking can ensure that the adversary does not alter the speech signal. Furthermore, using OTP as time stamps can resist against a replay attack. At any time a delay for transmitting the watermarked speech signal can be terminated by the MFA system. Trojan horse attack The Trojan horse attack tries to masquerade as a trust application to gain the information of the speaker. This attack is used to steal PIN and speaker biometric. The main defense is assurance about the legitimacy and trust of the authenticator capture sensor. There is not much effort can be done when the speaker wants to have his or her speaker biometric and PIN in a Trojan horse. However, using OTP can help the MFA system not to succumb to this attack. When the speaker biometric is replaced by a speaker biometric containing the Trojan horse to produce yes-match for anyone, the adversary cannot produce PIN and OTP.

Multimed Tools Appl

Denial-of-service attack In some conditions, an adversary tries to increase FRR to force the system to lockout to limit the number of the adversary’s attempts. Such a service attack can be defended by combining the speaker biometric and PIN as a MFA system. In the MFA system, it is not possible for the adversary to simply make any number of incorrect attempts. Session hijack In some situations, the previously valid watermarked speech signal may be recorded to exploit for an unauthorized access which is known as session hijacking. For every login session, a unique OTP is embedded as a timestamp into the speech signal. The uniqueness of each session can guarantee the freshness of the property [9]. Man-in-the-middle The recorded speech at the sensor should pass through many online speaker recognition components. Therefore, the reliability of the system’s integrity is necessary against man-in-the-middle attack [34]. Using two keys as well as hashing function can protect the watermark from any misuse. Furthermore, the semi-fragile watermark is tied intrinsically to the speaker biometric which can prevent the adversary from injecting the compromised PIN. Any adversary’s attempt can be detected by using the semi-fragile watermark. Non-repudiation In some conditions, there is a requirement that the sender cannot deny sending the speech signal to the receiver. Sometimes the sender may have the ability to deny which is known as plausible deniability [22]. This type of attack can be protected by a combination of the speaker biometric, OTP, and PIN as a MFA system. As a result, it is difficult for the speaker to deny because three factors such as the speaker biometric, PIN, and timestamp as OTP are available in speech signal at the same time.

4.2.2 Signal processing attacks Sometime an adversary tries to apply a signal processing operation to remove the watermark’s signature in the speech signal. These attacks consist of adding noise, filtering, and compression. They also perform some distortions to the speech signal. It is important to provide security against signal processing attacks since the formulation of such security is difficult. In addition, designing the digital speech watermarking which can resist against all possible signal processing operations is very hard.

4.3 Attack analysis of the proposed MFA model The amount of risk and threat for single factor authentication methods increases due to a lack of security in ordinary online speaker recognition systems. These systems are vulnerable against malware attacks, replay attacks, offline brute force attacks, key logger Trojans, dictionary attacks, and shoulder surfing. Recently, 2 Factor Authentication (2FA) has become a mandatory demand in many governmental policies [19]. Four levels of assurance are defined by the Office of Management and Budget (OMB 04–04) as in [7]. Each level shows the degree of confidence that the user is in fact a legitimate user. Table 2 presents these four levels. As seen in Table 2, applying cryptographic hash function, speaker biometric, and multipurpose speech watermarking have improved the developed MFA system in level 4. Furthermore, the registration of speaker biometric features and PIN in the enrolment for each user is capable of the developing the MFA system to have enough protection against different attacks.

Multimed Tools Appl Table 2 Levels of assurance in authentication systems based on OMB 04–04 Level

Secret

Definition

Level 1

Any type of token with no identity proof

Low confidence assurance available in identifier technique.

Level 2

Single factor authentication with some identity proof

Medium confidence assurance available in identifier technique.

Level 3

MFA with stringent identity proof

High confidence assurance available in identifier technique.

Level 4

MFA+ crypto token with registration per person

Very high confidence assurance available in identifier technique.

Due to the diversity in proposing user the authentication method, a standard is defined by presenting five levels of user authentication [19] Table 3 shows these five level. As discussed in Table 4, it can be concluded that the proposed MFA system has protection against various attacks, as summarized in Table 4. Therefore, the proposed MFA system is in level 5.

5 Experimental setup In this part, the simulation was done to evaluate the performance of the developed MFA model. For baseline systems, Mel Frequency Cepstrum Coefficients (MFCC), two state-of-theart baseline systems were used, including GMM-UBM [33] and i-vector PLDA [10, 17] based speaker verification systems. For the performance evaluation of speaker identification, only GMM speaker identification system [31, 32] was constructed to study the effects of multipurpose speech watermarking on the performance of the speaker recognition systems. For this reason, MSR Identity MATLAB Toolbox v1.0 [35] was used to construct speaker verification systems. Voice Box MATLAB Toolbox [8] was applied to the speaker identification system. For hash function, DataHash MATLAB function [36] was applied. Other systems such as semi-fragile speech watermarking and robust speech watermarking were implemented specifically as MATLAB codes. It must be mentioned that the speech signals of TIMIT [13], MIT [37] and MOBIO [25] databases were used in this experiment. Table 3 Five levels of user authentication [19, 20] Level

Description

Level 1

Uses offline registration of identification information such as PIN, OTP, etc.

Level 2

Uses a soft token which is issued based on a reliable identification of the user. This reliable identification has already been done by the government through passport, driver’s license, etc.

Level 3

Uses combination of an accredited certificate (a soft token) with other security factors such as mobile phone, security card, security token, etc. Uses combination of an accredited certificate (a soft token) with other hardware security devices like OTP.

Level 4 Level 5

Uses combination of an accredited certificate (a soft token) with watermarked biometric information like key with fingerprints.

Multimed Tools Appl Table 4 Required authentication protection mechanism for each level [19, 20] Required Protection

Level 1

Level 2

Level 3

Level 4

Level 5

Online guessing

Yes

Yes

Yes

Yes

Yes

Replay

Yes

Yes

Yes

Yes

Yes

Eavesdropper

No

Yes

Yes

Yes

Yes

Verifier impersonation Man-in-the-middle

No No

No No

Yes Yes

Yes Yes

Yes Yes

Session Hijacking

No

No

No

Yes

Yes

Signer impersonation

No

No

No

No

Yes

The simulation parameters were assumed as follows: a) L = 8, Δ ¼ π =64 , Tp = 0.9, α = 1.15, and T = 0.95. b) The level of the wavelet was 4 and Daubechies’ wavelet function was used for DWPT. c) The size of each frame was 32 ms which was equal to N = Fs × 0.032 = 512 of the samples. d) For channel coding, Hamming method was used with its parameters assumed to be n = 15, k = 11. Due to the problem of mutual competition in the nature of watermarking, Fig. 5 shows triangles for different online speaker recognition systems based on multipurpose speech watermarking, with each of the systems focusing only on a watermarking criterion. As seen, more concentration in each watermarking criterion degraded the other watermarking criterion. Therefore, providing reasonable and acceptable performance by trading off among different watermark criteria including capacity, robustness, and recognition performance (or imperceptibility), was application dependent. Due to the correlation between imperceptibility (in terms of SWR or SNR) and speaker recognition performance, only EER and the identification rate were depicted in these triangles. It must be noted that each criterion in each axis was normalized into a range of from 0 to 1 with respect to the maximum amount at that axis due to better visualization. For example, the amount for the capacity axis was divided by 32 bps and the watermark’s robustness was divided by 100. Furthermore, each axis was organized in an ascending order for better consistency. For example, whenever the criterion was increased, this criterion was far from the axis origin. Also, for better understanding, the experimental results for online speaker identification and verification systems are discussed in separates triangles. Figures 6 and 7 present the effect of capacity on online speaker recognition performance and robustness of the watermarking system. It is clear that embedding more watermark bits consequently could inject more noise into the speech signal. Furthermore, the probability of detection of the watermark bits with error was increased. Therefore, it could affect other watermark criteria. As seen, whenever the capacity in terms of bps was increased, other criteria including robustness and imperceptibility of the watermark signal were decreased. The recognition performance of MFA system then degraded. Figures 8 and 9 show the effect of robustness on the other watermarking criteria. As seen, whenever the watermark was embedded with more intensity, imperceptibility decreased. Furthermore, embedding less watermark bits could reduce the probability of detecting the watermark with error. This way was also important for the watermark’s robustness. It is clear that increasing the robustness of the watermark can decrease the recognition performance.

Multimed Tools Appl

Fig. 5 Triangles for online (a) speaker verification and (b) speaker identification systems based on multipurpose speech watermarking

From the above discussion, it can be concluded the optimum system depends on the usage. If the usage is for forensic that need to know the owner of the speech signal, the suitable model is a system with less robustness and less capacity. On the other hand, if the usage is for data

Fig. 6 Triangles for the effect of capacity on other watermarking criteria in online speaker verification system

Multimed Tools Appl

Fig. 7 Triangles for the effect of capacity on other watermarking criteria in online speaker identification system

copyright protection, then a system with more robustness and less capacity is more suitable model. If the usage is for stenographic purpose, a system with high capacity is desirable model. Furthermore, the results confirm speaker verification and identification are similar manner with speech watermarking. Figure 10 presents the amount of BERs for different SNRs for both robust and semi-fragile speech watermarking techniques. As seen, the robust watermark could be extracted even in very low SNR. The detection error for robust watermark was less than 10 % for SNRs which were higher than 20 dB. However, semi-fragile speech watermarking must have high SNR to be extractable. For example, even for SNRs which were higher than 80 dB, the detection errors weren’t less than 5 %. Therefore, multipurpose speech watermarking technique can provide tamper detection and proof of ownership at the same time. In this part, the performance of MFA systems based on online speaker recognition and digital multipurpose speech watermarking is evaluated. The output of the MFA system was affirmative when both authentication factors including speaker PIN and speaker biometric feature were satisfied at the same time. This satisfaction was evaluated by applying Boolean AND operation between the results of online speaker recognition system and the extracted watermark from the speech signal. To investigate the performance of the MFA system, 8 waves (from 10 waves) from 630 speakers of TIMIT speech database were used to train the speaker model. Then for testing, 630 speakers were divided into two sets with 315 in each set having 2 waves (which were not used for the training phase). The first set was considered as legitimate users and the encrypted PIN for that speaker was embedded into the speech signals. Based on speaker recognition and extraction of PIN from the first set, FRR was estimated. The second set was considered as impostor users and the encrypted PIN for that speaker was not embedded into the speech signals. Based on speaker recognition and extraction of the random PIN from the second set, FAR was estimated. For MIT and MOBIO speech databases, similar assumptions like TIMIT were held to evaluate FAR and FRR. It means the testing sets were divided into two sets including legitimate users and impostor users. Then the performance of MFA was evaluated. Figure 11 presents the DET curve for TIMIT, MIT, and MOBIO databases. As seen, by using both factors for authentication and by changing the threshold value, only FRR was

Multimed Tools Appl

Fig. 8 Triangles for the effect of the watermark’s robustness on (a) 32 bps, (b) 16 bps, and (c) 4 bps in online speaker verification system

changed. Therefore, it is impossible to extract and decrypt PIN, and at the same time generate the speaker’s biometric features. This situation removes FAR from the developed MFA system. For better visualization in Fig. 11, the results of each database were shifted by β and 2β respectively. Table 5 presents the comparison among the recognition performance, security level, time, and memory between speaker recognition systems and the developed MFA system. Although time and memory are a little bit greater for the developed MFA system, the current developed MFA system is outperformed than speaker recognition systems in terms of various security issues and recognition performance. Therefore, the developed MFA system improved through allocation of a unique PIN to each speaker, which is embedded to less speakerspecific sub-bands of the speech.

Multimed Tools Appl

Fig. 9 Triangles for the effect of the watermark’s robustness on (a) 32 bps, (b) 16 bps, and (c) 4 bps in online speaker identification system

As shown in Table 5, the performance for both MIT and MOBIO speech databases were significantly less than TIMIT speech database due to channel and environmental distortion effects. Furthermore, it is approved that the performance of i-vector’s speaker verification system was more than that of GMM-UBM system. This situation was because the i-vector system used low dimensional feature vectors compared to GMMUBM system. However, applying multipurpose digital speech watermarking as MFA model with i-vector and GMM-UBM speaker verification systems could improve the ERR to same value which were 0 %.

Multimed Tools Appl

Fig. 10 BER (%) for different SNRs for both robust and semi-fragile speech watermarking techniques

6 Conclusion and future works In this paper, a MFA method developed by a combination of multipurpose speech watermarking and online speaker recognition. For this reason, multipurpose speech watermarking technique developed by applying DWPT, QIM and multiplication. This watermarking technique is applied to improve the accuracy and communication channel security of speaker recognition systems for online biometric applications. As a result of embedding the robust watermark in less speaker specific of the speech sub-bands, the degradation effect on the recognition performance for this watermarking technique is

Fig. 11 EER for the developed MFA model for different speech databases

Multimed Tools Appl Table 5 Comparison between the developed MFA and current speaker recognition systems System

Identification rate (%)

EER (%) for i- vector

EER (%) for GMM- UBM

Security level

Time (s)

Memory (Kb)

Speaker recognition for TIMIT Speaker recognition for MIT Speaker recognition for MOBIO The developed MFA model

99.5

0.73

1.12

234

872,432

54.65

15.43

23.04

251

575,324

50.43

45.23

46.24

Level 2 (Single authentication factor) Level 2 (Single authentication factor) Level 2 (Single authentication factor)

371

654,234

100

0

0

421 Level 4 (MFA+ Crypto token + registration per person)

948,432

minimum. Furthermore, fragile watermark is tied intrinsically to speaker biometric for tamper detection, and any attempt of adversary can destroy it. The threat model and attack analysis of the proposed system are studied. It is shown that the communication channel security and recognition performance of the online speaker recognition systems can be enhanced by the proposed MFA model through multipurpose speech watermarking technology. Furthermore, the proposed MFA model is in level 5. The proposed MFA model is developed as proof-of-concept only. Therefore, the gap between the theoretical and real implementation always exists. For future work, there is a new opportunity for researchers to further develop MFA model. Acknowledgments The authors would like to appreciate anonymous reviewers who have made helpful comments on this drafts of this paper.

Appendix A Discrete Fourier Transform (DFT) is assumed as Weibull distribution. However, the distribution of the DWPT sub-bands is assumed as a Generalized Gaussian Distribution (GGD) [2]. GGD can be defined as in Eq. (14), if μ2s = 0 and σ2s are assumed.

f s ðs; μ; σs ; vÞ ¼



s−μ v 1  exp − 1 Aðσs vÞ 2Γ 1 þ Aðσs vÞ v 

ð14Þ

pffiffiffiffiffiffi ∞ where Γ(.) corresponds to Gamma function which is expressed by Γ ðxÞ ¼ ∫0 t x−1 e−t dt≅ 2π xx− 2 e−x ; v corresponds to the shape of the distribution which can be estimated by statistical moment of the signal. 1

Multimed Tools Appl

If the watermarked speech signal is passing through AWGN channel, it is possible to formulate the watermarked speech signal at receiver based on Eqs. (15) and (16). ri ¼ α  si þ ni if mi ¼ 1

ð15Þ

1  si þ ni if mi ¼ 0 α

ð16Þ

ri ¼

where ni corresponds to the amount of noise which is contaminated the watermarked speech signal. To estimate the probability of the watermark bits when it is 1, Eq. (17) is expressed: X

ðα  si þ ni Þ4 A ⇒Rj1 X 4 ð s þ n Þ i i B X X X X X 4 3 3 2 2 2 3 α4 s þ 4α s n þ 6α s n þ 4α s n þ n4 i i i i i i i A A A i ¼ X XA X A X X s4 þ 4 s3 n þ 6 s2 n2 þ 4 s n3 þ n4 B i B i i B i i B i i B i

Rj1 ¼

ð17Þ

As seen, the summation of different parameters in Eq. (17) are affected the amount of the detection threshold. By considering Central Limit Theorem (CLT), there is possible to compute different series in nominator and denominator based on Normal distribution. Due to large value for μ and long length of the speech frames, the Normal distribution is often generated positive numbers which can modeled parameters like ∑An4i which is always positive. Equations (18) and (19) are computed the mean and variance respectively. n o ð18Þ E ∑s4i ¼ ∑E s4i ¼ Mμ4 n X  nX  X  o2 o2 s4i −M μ4 var s4i ¼ E ¼E s4i −μ4 ¼ X  X  2

 4 8 2 2 E si −μ4 ¼ E si −μ4 ¼ M μ8 − M μ4

ð19Þ

where M corresponds to the length of each set of A and B. By applying the moment of GGD for r = 4 and r = 8, Eqs. (20) and (21) are estimated.     1 5 σ4s Γ Γ v v   μ4 ¼ 3 Γ2 v

ð20Þ

    1 9 Γ Γ v v   3 Γ4 v

ð21Þ

μ8 ¼

σ8s

3

Multimed Tools Appl

By considering Eqs. (18) and (19), Eq. (22) is formulated.   ∑s4i ∼N Mμ4 ; M μ8 −M μ24 If the mean of the noise is assumed as zero, Eq. (23) can be expressed.

 2 m 0 f or m ¼ 2k þ 1 ni ∼N 0; σn ⇒E ni ¼ f or m ¼ 2k ðm−1Þðm−3Þ…  1  σm n

ð22Þ

ð23Þ

Then, the Normal distribution of 4 moment noise component can be estimated as in Eq. (24). X   ð24Þ n4i ∼N 3M σ4n ; 96M σ8n The other parameters in Eq. (17) can be computed from Eq. (25) to (27).     7 6 2 1 σ Γ Γ s X   v v 3 2   si ni ∼N 0; M μ6 σn & μ6 ¼ 3 3 Γ v X

  s2i n2i ∼N M σ2s σ2n ; 3M μ4 σ4n −Mσ4s σ4n X

  si n3i ∼N 0; 15Mσ2s σ6n

ð25Þ

ð26Þ

ð27Þ

In order to simplify the computation, two free auxiliary parameters p and q are used in Eq. (28). Therefore, R|1,p,q can formulated as in Eq. (29).



X

X s4 B i

& q ¼ XA

s4i

s4 B i

ð28Þ

X X X X 3 2 2 2 3 α4 pq þ 4α3 s n þ 6α s n þ 4α s n þ n4 i i u i i i i A A A i ¼ Rj1; p; q ¼ ð29Þ X A X X X 3 2 2 3 4 w pþ4 s n þ6 s n þ4 sn þ n B i i B i i B i i B i where u and w are defined themselves by Eqs. (30) and (31).    f U ðuÞ∼N α4 pq þ 6α2 M σ2s σ2n þ 3M σ4n ; 16α6 M μ6 σ2n þ 36α4 3M μ4 σ4n −M σ4s σ4n þ 16α2  15M σ2s σ6n þ  8 96M σn

ð30Þ     f W ðwÞ∼N p þ 6M σ2s σ2n þ 3M σ4n ; 16M μ6 σ2n þ 36 3M μ4 σ4n −Mσ4s σ4n þ 16  15M σ2s σ6n þ 96Mσ8n

ð31Þ

Multimed Tools Appl

The density of wu is computed to estimate the pdf of R|1,p,q. By considering independency and normal distribution for two parameters of u and w, it is possible to express Eq. (32): f ðrÞ ¼ R 1;p;q

Z

∞ −∞

jwj f U ;W ðwr; wÞ dw

ð32Þ

Also, if U and W are assumed as normal distribution and independent, then fU,W(u, w) is formulated as in Eq. (33): f U ;W ðu; wÞ ¼ f U ðuÞ  f W ðwÞ

ð33Þ

Equation (34) is closed-form solution for Eq. (31) which has already discussed in literature [14].  2 2     μ μ − 12 2u þ 2w bðrÞcðrÞ 1 bðrÞ 1 pffiffiffiffiffiffi 2Φ e σu σw ð34Þ Dð r Þ ¼ 3 −1 þ 3 a ðrÞ aðrÞ a ðrÞπσu σw 2πσu σw Each parameter in Eq. (34) is defined based on Eqs. (35) to (38): sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 1 aðrÞ ¼ þ σ2u σ2w

bðrÞ ¼

μu μ r þ 2w σ2u σw

2   1 b ðrÞ 1 μ2u μ2w cðrÞ ¼ exp − þ 2 a2 ðrÞ 2 σ2u σ2w Z Φð r Þ ¼

r

1 2 1 pffiffiffiffiffiffi e− =2 u du 2π −∞

ð35Þ

ð36Þ

ð37Þ

ð38Þ

As a result, Eq. (39) formulate the density of R|1:   Z f r 1 ¼ R 1

UZ ∞ L

  f r 1; p; q f P ðpÞ f Q ðqÞ −∞ R 1;p;q

ð39Þ

The lowest bound and the highest bound are applied to restrict the energy ration between two A and B sets within L and U which is stated as in Eq. (40): X L < XA

r4i

r4 B i

Suggest Documents