Speaker Identification Using Discriminative Features ... - IEEE Xplore

0 downloads 0 Views 2MB Size Report
quality i-vector to construct a discriminative dictionary in SRC, supporting effective speaker identification. Besides improving dictionary from the i-vector aspect, ...
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

1979

Speaker Identification Using Discriminative Features and Sparse Representation Yu-Hao Chin, Jia-Ching Wang, Senior Member, IEEE, Chien-Lin Huang, Kuang-Yao Wang, and Chung-Hsien Wu, Senior Member, IEEE Abstract— Speaker identification is an important topic with relevance to various disciplines. This paper proposes a novel speaker identification system, which consists of two major components— feature extraction and sparse representation classifier (SRC). Although SRC has been utilized for many classification purposes, few studies have provided insight into the link between the commonly used speaker identification feature, i-vector, and SRC. To combine i-vector and SRC sufficiently, we use probabilistic principal component analysis and Bartlett test to extract highquality i-vector to construct a discriminative dictionary in SRC, supporting effective speaker identification. Besides improving dictionary from the i-vector aspect, we also utilize dictionary learning to further enhance the content of the dictionary. Two learning methods are proposed—robust principal component analysis dictionary and SVD-dictionary. Furthermore, we propose constructing a noise dictionary and combine it with the original dictionary to absorb and suppress noise when implementing the sparse coding. Various coding methods are utilized and analyzed. A comparison to the methods for speaker identification reveals that the proposed method outperforms the baselines and confirms its feasibility. Index Terms— Sparse representation classifier (SRC), speaker identification.

I. I NTRODUCTION

S

TUDIES of biometric identification have been performed for decades. In recent years, more studies on the security of access have emphasized biometric identification because of the uniqueness of the biological characteristics. This work focuses on voiceprinting. The use of voice to identify a speaker is called speaker identification, which is increasingly utilized in our daily lives as the popularity of smart phones and internet services increases. For example, many cellphones can be vocally unlocked . Most speaker identification systems involve statistical Gaussian mixture models (GMMs) [1]–[3]. The widespread use of GMMs for speaker modeling is motivated by efficient

Manuscript received August 14, 2016; revised February 10, 2017; accepted February 10, 2017. Date of publication March 6, 2017; date of current version June 5, 2017. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Venu Govindaraju. Y.-H. Chin and J.-C. Wang are with the Department of Computer Science and Information Engineering, National Central University, Taoyuan City 320 Taiwan (e-mail: [email protected]). C.-L. Huang is with Voicebox Technologies Inc., Bellevue, WA 98004 USA. K.-Y. Wang was with the Department of Computer Science and Information Engineering, National Central University, Taoyuan City 320 Taiwan. He is now with Wistron Corporation, Xinbei City 221, Taiwan. C.-H. Wu is with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIFS.2017.2678458

parameter estimation procedures that involve maximizing the likelihood between models and data. Another novel framework for speaker recognition is driven by the use of deep neural networks (DNNs) that is trained for automatic speech recognition (ASR) [4]. Some studies aim to explore the case of speaker recognition by emphasizing on the optimal organization and utilization of speaker information presented in the training and development data [5]. Although such issues are important, they are beyond the scope of this paper. This paper adopts sparse representation classifier (SRC) [6] as the classifier for speaker identification. SRC has recently attracted considerable attention from many fields, such as face recognition [6], iris recognition [7], texture classification [8], abnormal event detection [9], music genre classification [10], object tracking [11], machinery fault diagnosis [12], and machine condition monitoring [13]. Based on sparse representation, SRC searches for the most compact representation of a signal as a linear combination of atoms in an overcomplete dictionary. A test vector is classified using this linear combination by assigning it to the target class that minimizes the reconstruction error using the associated combination coefficients. Several investigations that adopted SRC in speaker recognition have come out. In 2011, Kua et al. [14] combined supervector extraction and SRC for speaker verification, but their experimental results revealed that the method did not significantly outperform the baseline (GMM-UBM). In 2012, Boominathan and Murty [15] directly utilized vocal features to construct a dictionary for SRC-based speaker verification. They used the orthogonal matching pursuit (OMP) algorithm, which is a greedy approximation to l0 optimization, for sparse representation. However, the performance of the SRC-based approach shown in [15] underperformed the existing Gaussian mixture model-universal background model (GMM-UBM) approach. Herein, we combine the GMM-UBM with SRC following several improvements of GMM-UBM approach and SRC. Experimental results show that the proposed system indeed outperforms the conventional GMM-UBM. Many state-of-the-art speaker identification systems extract i-vectors [16] from speech signals; these i-vectors can be regarded as compact representations of speaker utterances [17]. Namely, i-vector can be considered to be a concise representation of the characteristics of a speaker, and it is used to capture speaker and channel variability to develop automatic speaker verification (ASV) systems [18]. The relevant literature supports the following two observations: 1) i-vector has already become a major feature

1556-6013 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1980

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

representation in studies of speaker-related tasks; 2) sparse representation is a common classifier in other audio-related recognition tasks, but the work of [15] implies that SRC does not obtain satisfactory performance in speaker identification. Considering the above issues, it is intuitional to use i-vector as feature and use SRC as the classifier to construct a speaker identification system. However, the experimental results shown in [14] implied that directly combining SRC with i-vector may be not sufficient enough for speaker identification. Both i-vector and SRC should be improved to integrate with each other sufficiently and enhance the performance. This paper develops a novel speaker identification system that can produce discriminative i-vectors and dictionaries. Firstly, to obtain discriminative atoms, this paper utilizes probabilistic principal component analysis (PPCA) to extract PPCA-supervectors in a Bayesian manner [19], and further chooses the number of eigenvalues using Bartlett test [20] to remove redundant components. The remaining components are expected to support the construction of a high-quality variability matrix in the extraction of i-vectors. Second, this paper proposes enhancing the discriminability of the dictionary using robust principal component analysis (RPCA) [21] or singular value decomposition (SVD) [22]. A noise dictionary-based SRC is also investigated, but it does not obtain a satisfactory performance. The performances of these methods in dictionary construction are precisely analyzed in this paper. Besides, various coding methods of coefficients are also investigated. The relationship between these coding methods and the speaker identification task is discussed. The rest of this paper is organized as follows. Section II explores related works. Section III provides an overview of the proposed system. Sections IV describe in detail feature extraction. Section V explores SRC. Section VI summarizes the performance of the proposed method and presents the analyses thereof. Section VII draws conclusions and makes recommendations for future research. II. R ELATED W ORK This section reviews methods of automatic speaker recognition. First, the term ¡§speaker recognition¡¨ should be clarified: it involves two tasks, which are speaker identification and verification [23]. Speaker identification is identifying the speaker of an utterance from a set of known speakers. On the other hand, in the task of speaker verification, the speaker of the utterance claims his/her identity, and then the system verifies the correctness of the claim. Although this paper is concerned with speaker identification, methods of verification are also explored herein because they are similar to those of identification. The challenge of speaker recognition arises from the many sources of variability existing in a signal, such as within-speaker variability, task stress, vocal effort style, emotion, physiological phenomenon, and disguise. Many studies have considered the problems of noise and channel variability relying on high level knowledge of the topic. For example, the accent style of the speech of a person often remains unchanged when ill-health causes other aspects of the voice to change [23]. The first research on automatic speaker recognition was carried out in 1926 [24],

which was based on speech waveforms. Later, in 1970, the voiceprint was proposed [25], enabling speakers to be identified by visually comparing speech spectrograms. However, later works questioned the effectiveness of the voiceprint method [26], [27]. The MFCCs [28] and linear predictive coding-based [29] features were subsequently utilized in speaker recognition, and several feature normalization methods were proposed [30]–[32]. Related works reveal that the vocal feature extraction is the main focus of the research. Recently, however, speaker modeling adopting vocal feature-related methods has significantly improved the identification performance even though the improvement of feature robustness is difficult to achieve. Most of these feature-related methods are related to GMM. GMM has become a popular speaker modeling tool, which was firstly used in speaker recognition by [2]. The probabilistic nature of GMM can be exploited to capture the variability of speech. In 2000, an UBM was involved in [33]. UBM is an initial state of the GMM, and can be adapted to develop a speaker-dependent GMM, which is called GMM-UBM. In 2006, [34] further used GMM-UBM to extract supervectors from a signal [35], and then used supervectors as features to train the support vector machine (SVM) for speaker identification. Besides SVM, factor analysis is also a dominant trend operating on the supervectors. The current state-of-theart method of factor analysis is the i-vector method [36]; another is joint factor analysis (JFA) [37]. Some studies have combined the factor analysis-based method with SVM [36], [38], [39] and achieved good performance. Like SVM, SRC has also been used in the speaker recognition works [14], [40]– [43]. In 2010, Naseem et al. [44] utilized SRC in speaker identification. They developed two sub-systems using GMM supervectors that were constructed from Mel-frequency cepstral coefficients (MFCCs) and spectral centroid frequencies, respectively. These GMM supervectors were subsequently adopted to construct a dictionary for SRC. The fusion of these two sub-systems slightly improved the performance of speaker identification. In 2012, Haris et al. [40] also proposed an SRC-based method of speaker verification, which combines supervector extraction and SRC. They further constructed dictionaries using KSVD-related methods. In 2013, Chen et al. [41] combined the i-vector and atom aligned sparse representation to implement systems for speaker identification and verification. Their experiments have demonstrated that their method is robust against emotional variability in speech. In 2014, Nie et al. [42] utilized sparse representation to eliminate the effect of intrinsic variability on speaker verification. Their results show that an SRC-based system is robust against intrinsic variability. In 2015, Haris et al. improved the work [43] by involving joint sparse coding over learned dictionaries. The computation cost and the performance of the speaker recognition were discussed in [43]. There are other speaker recognition approaches that have recently been proposed [45]–[54], such as multimodal neural network-based speaker identification [45], speaker identification under noisy and reverberant conditions [46], and deep neural networkbased speaker recognition [50]. Although these methods are also important, they are beyond the scope of this paper.

CHIN et al.: SPEAKER IDENTIFICATION USING DISCRIMINATIVE FEATURES AND SPARSE REPRESENTATION

Fig. 1.

1981

Flowchart of proposed system.

III. S YSTEM OVERVIEW Fig. 2.

Figure 1 provides an overview of the framework of the proposed system. The system consists of two main modulesone for feature extraction and the other for SRC. First, acoustical features are extracted from the clips. An UBM model is trained and used to extract the PPCA-supervectors. In the implementation of PPCA, the number of eigenvalues is determined using the Bartlett test, to discard some redundant components. Next, by training the overall variation matrix, the PPCA-supervector can be further converted to lower dimensional space that exhibited more channel and speaker information. The i-vectors are subsequently obtained, and they are used to construct a dictionary using dictionary learning methods. The learned dictionary is finally involved in SRC. IV. F EATURE E XTRACTION This section illustrates the stepwise process for feature extraction. The overall process is summarized into four parts, which are GMM-supervector, PPCA-supervector, Bartlett test, and i-vector. A. GMM-Supervector The GMM-supervector is a commonly used feature in speaker identification. Figure 2 displays the process of extracting a supervector. To obtain a supervector, vocal features (such as MFCC) should firstly be extracted from the audio files of a background corpus, which consists of a large amount of speech clips. From these clips, an UBM is constructed using the expectation-maximization (EM) algorithm. When the system receives a testing speech clip, the supervectorextraction work uses the received clip to adapt the Gaussian components of the UBM using the method proposed by [33], yielding an adapted GMM. Finally, the mean vectors of the adapted Gaussian components are concatenated into a column vector, called supervector. This supervector has been proved to be able to capture well the vocal characteristics of an audio file. B. PPCA-Supervector Although the performance of the supervector is promising, its dimensionality is very high. Principal component analysis (PCA) is extensively used to reduce the dimensionality of the data, while retaining its characteristics by eliminating unimportant principal components. Probabilistic-PCA [55] uses

Flowchart of GMM-supervector extraction.

Bayesian theory in estimating principal components. A probabilistic model is constructed in the process of probabilisticPCA, and it is expected to be able to capture better characteristic of the data in comparison with PCA. The details of probabilistic-PCA are described as follows. Firstly, consider an equation for factor analysis [19]: x = Vz + u + ε

(1)

where x is a d × 1 raw vector, V is a d × M transformation matrix with M < d, and u is the d ×1 mean vector. The latent factor z is assumed to follow a Gaussian distribution N(0, I). The noise is defined as ε ∼ N(0, σ 2 I). Based on the above assumptions, the original data x can be modeled as a Gaussian distribution that follows N(0, σ 2 I+WWT ), where 0 is a vector with zeros in each elements and W is a covariance matrix. The model with such data form is called PPCA model [19], [56], [57]. On the other hand, in conventional PCA, the main ingredient V is identified to reduce the number of dimension of the data: x = u + Vz

(2)

According to Eqs. (1) and (2), the factor analysis formula of PPCA is similar to that of PCA. The PPCA factor analysis can be regarded as the PCA one but with a noise term. Multiple PPCA models can be mixed with each other. By defining the mixture weight, mean vector, factor loading matrix, and noise variance for the i -th component as wi , ui , Wi , and σi2 , respectively. PPCA models can be mixed as follows [19]. p(x) =

M 

wi p(x|i )

(3)

i=1

p(x|i ) = N(ui , σi2 I + Wi WiT )

(4)

Comparing Eq. (4) with the formula for the Gaussian mixture model, the Gaussian components are replaced by PPCA models. Maximum likelihood estimation is utilized to evaluate W and σ . The likelihood function p(x|W, σ 2 ) ∼ N(ui , σi2 I + WWT ) is shown as follows 1 1 | 2π(σ 2 I + WWT ) |− 2 ×ex p(− (x − u)T B(x − u)) (5) 2

1982

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

Fig. 4.

Fig. 3.

Flowchart of extraction of PPCA-supervector.

where B = (σ 2 I + WWT )−1 . The logarithmic form of Eq. (5) is shown below. 1 ln p(x|W, σ 2 ) = − (d ln(2π) + ln | σ 2 I + WWT |) 2 +(x − u)T (σ 2 I + WWT )−1 (x − u) (6) To obtain the optimum value of W and σ , Eq. (6) is partially differentiated with respect to W and σ . The optimum values of W and σ are subsequently obtained as follows. ∗

W = U(Y − σ I) R d  1 ∗ σ2 = λi d −m 2

1/2

(7) (8)

Verification process of Bartlett test.

experience or tuning, this paper uses Barrett test to determine the number of the selected principal components [20]. Specifically, for a mixture of PPCA components, each PPCA component has unique ingredient, and thus the number of principal components should be determined independently for each of PPCA components. The Bartlett test is involved in the determination of the number of principal components for each component. The conventional Bartlett test determines if n samples have the same variances. A hypothesis is defined as follows. H0 : σ12 = σ22 = ... = σn2

(10)

In the case of the selection of principal components, this paper assumes that the small eigenvalues are equal; the hypothesis thus becomes the following form. H0 : λk+1 = λk+2 = ... = λn

(11)

i=m+1

where U is a d × m matrix that is formed from the m selected eigenvectors, Y is an m × m matrix that is formed from the m selected eigenvalues, and R is an identity matrix. Based on the above equations, the latent factor z has the following form. z = (σ 2 I + WT W)−1 WT (x − u)

(9)

Figure 3 shows a flowchart of the estimation steps of PPCA-supervector. Firstly, an adapted latent factor UBM is constructed. The latent factor is extracted subsequently from the input speech clip x. Finally, the PPCA-supervector is obtained via the adapted latent factor UBM. A more detailed illustration of the extraction of a PPCA-supervector can be seen in [19]. C. Bartlett Test Although the PPCA-supervector approach can reduce the dimensionality of the supervector, the dimensionality to which the supervector should be reduced can be determined in a few ways. In most work on PPCA, the number of selected components (the dimensionality of the PPCA-supervector) depends on the researchers¡¦ experience or a pre-defined threshold of the values of the selected components. However, the performances of the above methods are unstable in multiple tasks since the characteristics of the components are not analyzed precisely. Instead of determining the number of components by

where k denotes the number of selected principal components. The Bartlett test statistic can be expressed as follows. n q=k+1 λq (12) T = n λq n−k ( q=k+1 ) n−k If the number of observations m is enough, the distribution of T can be considered to be a χ 2 distribution. The T thus has the following form. T ≈ (m − (2n − 11)/6)((n − k) log λ¯ −

n 

log λq )

(13)

q=k+1

where λ¯ is the average statistic of the back (n −k) eigenvalues that are given in Eq. (13). Equations (12) and (13) are used to select the value of k by trying various values of k and determining whether H0 is rejected or not. To implement the above verification process, a pre-defined 2 should be determined. The value of α is the threshold χα,n 2 significance level. The hypothesis H0 is rejected if T > χα,n ; otherwise, it is accepted. There are k principal components selected if the value of k is rejected by the hypothesis H0 ; otherwise, the value of k is replaced by (k − 1) and the verification process is repeated. Figure 4 shows a visual illustration of the whole verification process, in which 36 eigenvalues are considered.

CHIN et al.: SPEAKER IDENTIFICATION USING DISCRIMINATIVE FEATURES AND SPARSE REPRESENTATION

Fig. 5.

Flowchart of feature extraction.

D. i-Vector After extracting the PPCA-supervector, the proposed system transforms it to the i-vector [16]. The i-vector [16] can be regarded as a modified version of joint factor analysis (JFA) [37]. Therefore, JFA is considered here. JFA decomposes a supervector of a speech utterance, M, into a speaker and session-independent supervector s, a speaker subspace that is constructed from A and H , and a session subspace that is constructed from K . The decomposition is expressed by M = s + Ay + K q + H z

(14)

where q denotes the factors of K , y and z denote the factors of A and H , respectively. Motivated by the JFA technique, a highly effective approach named i-vector is recently presented for speaker recognition. The technique of i-vector decomposes the supervector of the speaker utterance M into a speaker and session-independent supervector s and the total variability space built by T . The decomposition becomes the following form. M =s+Tf

1983

(15)

where f is the speaker and session-dependent factor vector in the total variability space. Training the total variability matrix T can be considered training the A matrix in JFA except that the utterances are spoken by different speakers. The speaker and session-dependent factor vector f is called an i-vector, which has many fewer dimensions than a supervector. The above four subsections can be summarized as follows. Figure 5 shows the process of feature extraction, which is composed of three major parts. Firstly, an UBM is trained using the universal background dataset, and then adopting PPCA and the Barlett test to obtain a latent factor model. When a speech clip is received by the system, its latent factor is extracted. The PPCA-supervector is subsequently obtained by a MAP-based adaptation algorithm between the extracted latent factor and the latent factor model. The PPCAsupervectors of the training utterances are used to estimate a

total variability matrix. Finally, the i-vector of a speech can be extracted using its PPCA-supervector and the pre-trained total variance matrix. V. S PARSE R EPRESENTATION C LASSIFIER Sparse representation [6] has been the focus of extensive research in recent years. Define the feature vector of a clip as k ∈  L . Sparse representation tries finding a column vector a ∈  J , which is multiplied by a pre-trained dictionary D ∈  L×J to approximate k. aˆ = argmina1 subject to Da − k2 ≤ ε

(16)

a

The l1 -norm of the vector a can make the elements of the vector a follow a sparse distribution. These elements are called coefficients in the sparse representation. To implement the identification, for each class (i.e. speaker), the residual r g (k) is calculated from its corresponding class [58]. r g (k) = k − Dδg (ˆa)

(17)

where δg (ˆa) ∈  J is a column vector, and its nonzero values are the entries that are associated with class g. The class g that has the smallest residual is picked up to be the predicted-class of the test clip. VI. E XPERIMENTAL R ESULTS Three datasets were employed in the performance study herein; these were NIST2005, NIST2006, and NIST2008. NIST2005 contains telephone conversation and some auxiliary microphone receiving data. The telephone utterances were collected for the Mixer Corpus by the Linguistic Data Consortium,1 using the Fishboard platform; they were collected with some multi-channel data that were simultaneously recorded from many auxiliary microphones. Most of the clips of NIST2005 are mainly English utterances, but a few are in 1 Linguistic Data Consortium: https://www.ldc.upenn.edu/

1984

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

four other languages. The clips that follow 8con were used as the training data. Each of 136 speakers provided eight recordings, each of which lasted for five minutes, yielding 1088 training clips. The 1con clips were used as a testing dataset, which included 983 clips. This experimental setup is referred to as 8con-1con. NIST 2006 and NIST 2008 were used to train the UBM. The number of components in UBM was set to 512. In this paper, MFCC, delta MFCC, and double delta MFCC were used as a vocal feature set. The total number of dimensions of the feature was 36. In the extraction of features, the frame size was 32ms, and the frame shift was 10ms. Vocal features were used to extract supervectors. The dimension of a supervector was 18432 (512*36). The dimension of the i-vector was set to 400.

TABLE I C OMPARISON OF ACC S OF D IFFERENT A PPROACHES

TABLE II C OMPARISON OF N UMBERS OF S ELECTED P RINCIPAL C OMPONENTS IN A PPROACH III

A. Comparison of Proposed and Other Approaches The performance of the proposed approach was compared with that of other approaches. Four approaches were employed herein. Three of them were baselines, and the rest was the proposed approach. I) 1) GMM-SV + i-vector + LDA + WCCN + CD (Baseline) [36]: This baseline extracts the GMM-supervectors from the clips. The i-vectors are subsequently obtained. Next, the linear discriminant analysis is carried out to reduce the dimensionality of the i-vector. Within-class covariance normalization (WCCN) [59] is also implemented. Finally, the cosine distance-based identification (CD) is adopted to identify the speaker of the test clip. In the CD, for the identification of a test clip, the system selects eight training clips from each class, computing the inner products between the test clip and the selected training clips. The inner products of the clips of the same class are averaged to yield a similarity index between the class and the test clip. Finally, the class with the largest similarity index is the identified class. 2) PPCA-SV + i-vector + CD: This approach extracts the PPCA-supervectors from the clips. The i-vectors are subsequently obtained. Finally, the cosine distance measure is applied. 3) PPCA-SV + i-vector + SRC: This approach extracts the PPCA-supervectors from the clips. When implementing the PPCA, the number of the selected components is unified for the selections of principal components of all PPCA components. The number of selected components is empirically set to 15. The i-vectors are subsequently obtained. Finally, the SRC is applied. 4) PPCA-SV + Bartlett test + i-vector + SRC: The proposed approach extracts the PPCA-supervectors from the clips. Specifically, different from approach II, the number of selected principal components is determined by the Barlett test for each PPCA component individually rather than being unified. The value of α is set to 0.05 when implementing the Barlett test, and the dimensionality of the supervector is reduced from 18432 to 14390 after using the Barlett test to select the principal components. The i-vectors are subsequently obtained. Finally, the SRC is applied.

Table I displays the performances of the three approaches: the first column specifies the approaches while the second column specifies their accuracies (ACC). The parameters in these approaches were not tuned. From Table I, we see that approach IV yielded the highest ACC, which was 77.01%. To verify the effectiveness of the Barlett test, the number of selected principal components was tuned in approach III, and whether approach III outperformed approach IV was determined. Table II presents the relevant results. We see that the performance of approach III is improved as more principal components are selected, but its best performance does not outperforms the performance of approach IV. This observation indicates that the Barlett test-based PPCA outperforms the conventional universal-and-tuning-based PPCA. B. Dictionary in SRC We discuss about the dictionary in SRC herein. For a discriminative dictionary, it is believed that the atoms of a certain class should be as pure as possible. Therefore, this paper tries using RPCA or SVD to remove abundant information in the clips, and then the discriminative dictionary is obtained. 1) Proposed approach with RPCA-dictionary: This approach roughly follows approach IV (PPCA-SV + Bartlett test + i-vector + SRC) except dictionary construction. When constructing the dictionary in SRC, for each of the speakers, we implement individually RPCA [21] to decompose whose enrollment i-vectors, defined as D, into the following components. D = L+E

(18)

where L denotes a low rank matrix and E denotes a sparse error. The low rank matrixes of each speakers are merged as the dictionary in SRC. This dictionary is called RPCA-dictionary. 2) Proposed system with SVD-dictionary: This system also follows approach IV (except in the dictionary construction. The dictionary is constructed using the

CHIN et al.: SPEAKER IDENTIFICATION USING DISCRIMINATIVE FEATURES AND SPARSE REPRESENTATION

1985

TABLE III

TABLE V

C OMPARISON OF ACC S OF D IFFERENT D ICTIONARIES

C OMPARISON OF M ETHODS OF C OEFFICIENTS O PTIMIZATION

TABLE IV C OMPARISON OF N UMBERS OF S ELECTED C OMPONENTS IN SVD-D ICTIONARY

SVD technique. We call such a dictionary SVDdictionary. SVD [22] decomposes D into the following components. D = U VT

(19)

where U denotes an m ×m matrix, denotes a diagonal matrix, and VT denotes an n × n matrix. The performances of the two dictionaries are shown in Table III and precisely discussed. Both the performances of the two approaches in Table III outperform that of the original proposed approach, and the SVD-dictionary clearly outperforms the RPCA-dictionary. Various numbers of selected components in the SVD-dictionary are used to ensure that the RPCAdictionary outperforms the SVD-dictionary. Table IV shows the performances of SVD-dictionary with different numbers of selected components. Apparently, the RPCA-dictionary outperforms the SVD-dictionary in all cases. We believe that the consideration of the sparse error causes the RPCA-dictionary to outperform the SVD-dictionary. The sparse error can be regarded as noise. Accordingly, a more intuitive dictionary construction method is used to suppress the noise in the clips. First, we compute a mean atom for each class. Next, the mean atom of the corresponding class is subtracted from each atom in the dictionary. These subtracted atoms are merged into a noise dictionary. The noise dictionary is combined with the original dictionary to a new dictionary, and the new dictionary is used to implement the SRC. We expected that the noise information in the clip can be absorbed by the coefficients corresponding to the noise dictionary part when implementing the sparse coding. However, the system yields an ACC of only 76.81%, which is not satisfactory. By observing the sparse coefficients, few noise coefficients are non-zero. We believe that the phenomenon is caused by the sparse constraint in the sparse coding, which makes noise hardly absorbed by noise coefficients since it involves penalty in the optimization. C. Optimization of Coefficients in SRC Many SRC-related methods are used to identify speakers herein for comparison. The key differences among these

methods concern the optimization of the coefficients. The selected methods are elucidated below. 1) Proposed approach with kernel sparse representation classifier (KSRC): This approach replaces the SRC part of the proposed approach (PPCA-SV + Bartlett test +i-vector + SRC) with KSRC [60]. KSRC projects the i-vector onto the kernel space using kernel linear discriminant analysis [61], which can enhance the discrimination of the i-vector. The sparse coding is implemented on the kernel space. 2) Proposed approach with Bayesian compressive sensing (BCS): This approach replaces the SRC part of the proposed approach with BCS [62]. Similar to SRC, BCS represents the testing clip as a mixture of the i-vectors in the dictionary. However, BCS does not measure each of the aforementioned i-vector directly, but rather makes a set of related measurements, each of which is a linear combination of the original i-vectors [62]. BCS estimates the coefficients in a probabilistic way. 3) Proposed approach with approximate Bayesian compressive sensing (ABCS): This approach replaces the SRC part of the proposed approach with ABCS [63]. ABCS is a modified work of BCS. Since a closed-form solution is rarely obtained using BCS, ABCS uses an approximate method to find the solution. Table V displays the experimental results. All approaches in Table V outperform the original approach that uses SRC, and KSRC obtains the best performance. It is observed that the probabilistic optimization of coefficients does not have much improvement herein. On the other hand, KSRC pays attention to the variability among clips and classes. Sufficient improvement is exposed, which specifies the aspect that variability is the key issue in the task of speaker identification. VII. C ONCLUSION This paper proposes a novel speaker identification system, which consists of two major parts - feature extraction and SRC. In feature extraction, a PPCA-supervector is constructed using PPCA, and then the number of eigenvalues is obtained using the Bartlett test. By this manner, the number of dimensions of each component is determined. The i-vector is subsequently extracted. Finally, SRC is implemented to identify the speaker of the test clip. In the experiment section, this paper tries to enhance the performance of SRC via choosing primary elements of the dictionary, which compensates session and channel variability, to make the dictionary discriminative. Furthermore, this paper constructs a noise dictionary to absorb variability and suppress noise. Finally, this paper tries obtaining the sparse coefficients

1986

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 12, NO. 8, AUGUST 2017

by KSRC, ABCS, and BCS. KSRC considers variability existing in the clips or classes. ABCS and BCS model the sparse coefficients as a probabilistic distribution. Experiments demonstrate the effectiveness of the proposed approach. R EFERENCES [1] A. Matza and Y. Bistritz, “Skew Gaussian mixture models for speaker recognition,” IET Signal Process., vol. 8, no. 8, pp. 860–867, 2014. [2] D. A. Reynolds and D. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995. [3] B. L. Pellom and J. H. L. Hansen, “An efficient scoring algorithm for Gaussian mixture model based speaker identification,” IEEE Signal Process. Lett., vol. 5, no. 11, pp. 281–284, Nov. 1998. [4] Y. Lei, N. Scheffer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2014, pp. 1695–1699. [5] G. Liu and J. H. L. Hansen, “An investigation into back-end advancements for speaker recognition in multi-session and noisy enrollment scenarios,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 12, pp. 1978–1992, Dec. 2014. [6] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [7] J. K. Pillai, V. M. Patel, R. Chellappa, and N. K. Ratha, “Secure and robust iris recognition using random projections and sparse representations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 9, pp. 1877–1893, Sep. 2011. [8] L. Liu and P. W. Fieguth, “Texture classification from random features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 3, pp. 574–586, Mar. 2012. [9] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2011, pp. 3449–3456. [10] Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporal modulations,” in Proc. 17th Eur. IEEE Conf. Signal Process., Aug. 2009, pp. 1–5. [11] Z. Han, J. Jiao, B. Zhang, Q. Ye, and J. Liu, “Visual object tracking via sample-based adaptive sparse representation (AdaSR),” Pattern Recognit., vol. 44, no. 9, pp. 2170–2183, 2011. [12] H. Liu, C. Liu, and Y. Huang, “Adaptive feature extraction using sparse coding for machinery fault diagnosis,” Mech. Syst. Signal Process., vol. 25, no. 2, pp. 558–574, 2011. [13] H. Liu, Y. Li, N. Li, and C. Liu, “Robust visual monitoring of machine condition with sparse coding and self-organizing map,” in Proc. Int. Conf. Intell. Robot. Appl. (ICIRA), Nov. 2010, pp. 642–653. [14] J. M. K. Kua, E. Ambikairajah, J. Epps, and R. Togneri, “Speaker verification using sparse representation classification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2011, pp. 4548–4551. [15] V. Boominathan and K. S. R. Murty, “Speaker recognition via sparse representations using orthogonal matching pursuit,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Mar. 2012, pp. 4381–4384. [16] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “i-vector based speaker recognition on short utterances,” in Proc. 12th Annu. Conf. Int. Speech Commun. Assoc., 2011, pp. 2341–2344. [17] S. Cumani and P. Laface, “Large-scale training of pairwise support vector machines for speaker recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 11, pp. 1590–1600, Nov. 2014. [18] B. V. Srinivasan, Y. Luo, D. Garcia-Romero, D. N. Zotkin, and R. Duraiswami, “A symmetric kernel partial least squares framework for speaker recognition,” IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 7, pp. 1415–1423, Jul. 2013. [19] T. Hassan and J. H. L. Hansen, “Acoustic factor analysis for robust speaker verification,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 21, no. 4, pp. 842–853, Apr. 2013. [20] M. S. Bartlett, “Tests of significance in factor analysis,” Brit. J. Psychol., vol. 3, no. 2, pp. 77–85, 1950. [21] J. Wright, A. Ganesh, S. Rao, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization,” in Proc. Neural Inf. Process. Syst., 2009, pp. 2080–2088. [22] G. H. Golub and C. Reinsch, “Singular value decomposition and least squares solutions,” Numer. Math., vol. 14, no. 5, pp. 403–420, Apr. 1970.

[23] J. H. L. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74–99, Nov. 2015. [24] J. H. Wigmore, “New mode of identifying criminals,” Amer. Inst. Crim. Criminol., vol. 17, no. 2, pp. 165–166, 1926. [25] L. G. Kersta, “Voiceprint identification,” Nature, vol. 196, no. 4861, pp. 1253–1257, Dec. 1962. [26] B. E. Koenig, “Spectrographic voice identification: A forensic survey,” J. Acoust. Soc. Amer., vol. 79, no. 6, pp. 2088–2090, 1986. [27] H. F. Hollien, Forensic Voice Identification. New York, NY, USA: Academic, 2002. [28] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980. [29] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, no. 4, Pp. 1738–1752, 1990. [30] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Process., vol. 29, no. 2, pp. 254–272, Apr. 1981. [31] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, Oct. 1994. [32] H. Boril and J. H. L. Hansen, “Unsupervised equalization of lombard effect for speech recognition in noisy adverse environments,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 6, pp. 1379–1393, Aug. 2010. [33] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digit. Signal Process., vol. 10, nos. 1–3, pp. 19–41, 2000. [34] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308–311, May 2006. [35] P. Kenny, M. Mihoubi, and P. Dumouchel, “New MAP estimators for speaker recognition,” in Proc. Interspeech, Geneva, Switzerland, 2003, pp. 2964–2967. [36] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 19, no. 4, pp. 788–798, Apr. 2011. [37] P. Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” Tech. Rep. CRIM-06/08-13, 2005. [Online]. Available: http://www.crim.ca/perso/patrick.kenny/ [38] S. Roweis, “EM algorithms for PCA and SPCA,” in Adv. Neural Inf. Process. Syst., vol. 10. Denver, CO, USA: MIT Press, 1988, pp. 626– 632. [39] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, and P. Dumouchel, “Support vector machines versus fast scoring in the lowdimensional total variability space for speaker verification,” in Proc. Interspeech, 2009, pp. 1559–1562. [40] B. C. Haris and R. Sinha, “Sparse representation over learned and discriminatively learned dictionaries for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Sep. 2012, pp. 4785–4788. [41] L. Chen and Y. Yang, “Emotional speaker recognition based on I-vector through atom aligned sparse representation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Sep. 2013, pp. 7760–7764. [42] Y. Nie, M. Xu, and H. Xianyu, “Intrinsic variation robust speaker verification based on sparse representation,” in Proc. IEEE Int. Conf. Asia–Pacific Signal Inf. Process. Assoc., Oct. 2014, pp. 1–4. [43] B. C. Haris and R. Sinha, “Robust speaker verification with joint sparse coding over learned dictionaries,” IEEE Trans. Inf. Forensics Security, vol. 10, no. 10, pp. 2143–2157, Oct. 2015. [44] I. Naseem, R. Togneri, and M. Bennamoun, “Sparse representation for speaker identification,” in Proc. IEEE Int. Conf. Pattern Recognit., May 2010, pp. 4460–4463. [45] N. Almaadeed, A. Aggoun, and A. Amira, “Speaker identification using multimodal neural networks and wavelet analysis,” IET Signal Process., vol. 4, no. 1, pp. 18–28, 2015. [46] X. Zhao, Y. Wang, and D. Wang, “Robust speaker identification in noisy and reverberant conditions,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 4, pp. 836–845, Apr. 2014. [47] J. C. Wang, Y. H. Chin, W. W. C. Hsieh, C. H. Lin, Y. R. Chen, and E. Siahaan, “Speaker identification with whispered speech for the access control system,” IEEE Trans. Autom. Sci. Eng., vol. 12, no. 4, pp. 1191–1199, Apr. 2015.

CHIN et al.: SPEAKER IDENTIFICATION USING DISCRIMINATIVE FEATURES AND SPARSE REPRESENTATION

[48] L. Schmidt, M. Sharifi, and I. L. Moreno, “Large-scale speaker identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2014, pp. 1650–1654. [49] J. Wang and M. T. Johnson, “Physiologically-motivated feature extraction for speaker identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2014, pp. 1690–1694. [50] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Process. Lett., vol. 22, no. 10, pp. 1671–1675, Oct. 2015. [51] P. Motlicek, S. Dey, S. Madikeri, and L. Burget, “Employment of subspace Gaussian mixture models in speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2015, pp. 4445–4449. [52] W. Zhu, S. O. Sadjadi, and J. W. Pelecanos, “Nearest neighbor based i-vector normalization for robust speaker recognition under unseen channel conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Jun. 2015, pp. 4684–4688. [53] P. Kenny, T. Stafylakis, J. Alam, and M. Kockmann, “JFA modeling with left-to-right structure and a new backend for text-dependent speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2015, pp. 4689–4693. [54] H. Sun, K. A. Lee, and B. Ma, “A new study of GMM-SVM system for text-dependent speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Oct. 2015, pp. 4195–4199. [55] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. Roy. Statist. Soc. B, vol. 61, no. 3, pp. 611–622, 1999. [56] T. Hasan and J. H. L. Hansen, “Factor analysis of acoustic features using a mixture of probabilistic principal component analyzers for robust speaker verification,” in Proc. Odyssey, Jun. 2012, pp. 243–247. [57] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Comput., vol. 11, no. 2, pp. 443–482, Feb. 1999. [58] J. C. Wang, Y. H. Chin, B. W. Chen, C. H. Lin, and C. H. Wu, “Speech emotion verification using emotion variance modeling and discriminant scale-frequency maps,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 10, pp. 1552–1562, Apr. 2015. [59] A. Kanagasundaram, D. Dean, R. Vogt, M. McLaren, S. Sridharan, and M. Mason, “Weighted LDA techniques for I-vector based speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Sep. 2012, pp. 4781–4784. [60] L. Zhang et al., “Kernel sparse representation-based classifier,” IEEE Trans. Signal Process., vol. 60, no. 4, pp. 1684–1695, Apr. 2012. [61] G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach,” Neural Comput., vol. 12, no. 10, pp. 2385–2404, 2000. [62] T. N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Jan. 2010, pp. 4370–4373. [63] D. K. A. Carmi, P. Gurfil, and B. Ramabhadran, “ABCS: Approximate Bayesian compressed sensing,” Human Lang. Technol., IBM, New York, NY, USA, Tech. Rep. RC 24816, 2009.

Yu-Hao Chin received the B.S. degree in applied information management from National Central University, Taoyuan, Taiwan, in 2013. He is currently pursuing the Ph.D. degree in computer science and information engineering with National Central University. His research interests include speaker recognition, affective computing, and audio event recognition.

Jia-Ching Wang (SM’09) received the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, in 1997 and 2002, respectively. He was an Honorary Fellow with the Department of Electrical and Computer Engineering, University of Wisconsin– Madison, in 2008 and 2009. Currently, he is an Associate Professor with the Department of Computer Science and Information Engineering, National Central University. His research interests include signal processing, machine learning, and VLSI architecture design. He is an Honorary Member of Phi Tau Phi Scholastic Honor Society and a member of ACM and IEICE.

1987

Chien-Lin Huang is currently a speech scientist at Voicebox Technologies Inc. Before, he was a scientist in Japan NICT and Singapore I2R, respectively. His research focuses on speech recognition, speaker recognition, and speech retrieval. He is an active member of speech and language processing communities. He has coauthored over 40 technical papers and holds two U.S. patents.

Kuang-Yao Wang received the M.S. degree in computer science and information engineering from National Central University, Taoyuan, Taiwan, in 2013. He is currently an Engineer with Wistron Corporation. His research interest is speaker recognition.

Chung-Hsien Wu (SM’03) received the B.S. degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1981, and the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University (NCKU), Tainan, Taiwan, in 1987 and 1991, respectively. Since 1991, he has been with the Department of Computer Science and Information Engineering, NCKU, where he became the Distinguished Professor in 2004 and served as the Chairman from 1999 to 2002. From 2009 to 2015, he was the Deputy Dean of the College of Electrical Engineering and Computer Science, NCKU. In 2003, he was a Visiting Scientist in the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA. His research interests include affective computing, speech recognition/synthesis, and spoken language processing. He received the Outstanding Research Award of National Science Council in 2010 and the Distinguished Electrical Engineering Professor Award of the Chinese Institute of Electrical Engineering, Taiwan, in 2011. He was the Associate Editor of IEEE T RANSACTIONS ON A UDIO , S PEECH AND L ANGUAGE P ROCESSING (2010–2014) and IEEE T RANSACTIONS ON A FFECTIVE C OMPUTING (2010–2014). He is currently an Associate Editor of ACM Transactions on Asian and Low-resource Language Information Processing, and Apsipa Transactions on Signal and Information Processing. He served as the Asia Pacific Signal and Information Processing Association Distinguished Lecturer and Speech, Language, and Audio Technical Committee Chair in 2013–2014.