Text-Independent Speaker Verification over a Telephone Network by Radial Basis Function Networks Man Wai Mak Department of Electronic Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Email:
[email protected]
ABSTRACT This paper presents several text-independent speaker verification experiments based on Radial Basis Function (RBF) and Elliptical Basis Function (EBF) networks. The experiments involve 76 speakers from dialect region 2 of the TIMIT and NTIMIT databases. Each speaker was modelled by a 12-input, 2-output network in which one output represents the speaker class while the other represents the anti-speaker class. The results show that both RBF and EBF networks are very robust in detecting impostors for clean speech, with the EBF networks being significantly better than the RBF networks in this respect. For clean speech, a false acceptance rate of 0.06% and a false rejection rate of 0.19% have been achieved. However, for telephone speech, the false acceptance rate and the false rejection rate are increased to 11.7% and 8.71%, respectively. It is concluded that better pre-processing techniques are required to reduce the effects of noise and channel variations.
1. INTRODUCTION In recent years, there has been a growing popularity in using the telephone network for bank service operations and financial transactions. These services are usually supported by some telephone-based account access systems in which eligibility for system entry is solely determined by the Personal Identity Numbers (PINs) entered via telephone keypads. However, PINs can be lost, stolen, or counterfeited. This form of security is, therefore, not sufficient. Speaker verification technology may be a solution to this problem. The objective of speaker verification is to verify an identity claim made by an unknown speaker and to decide whether to accept or reject the speaker. The possibility of frauds could be reduced if speaker verification and PINs are used together. One important issue of speaker verification is the robustness of the verification systems against impostors. To improve the robustness, we need a classifier capable of detecting novel data, i.e., instead of blindly classifying the input data as one out of K classes, it is able to indicate that the input data do not belong to any known class. Neural network classifiers with local perceptive field like Radial Basis Function (RBF) and Elliptical Basis Function (EBF) networks seem to be the best candidate for this task. Therefore, the present study aims at developing a robust text-independent speaker verification system based on these types of networks. Moreover, the performance of these networks under clean and telephone speech is compared. Although there have been a few reports on using RBF networks for speaker identification [5] and speaker verification [7] and their results show that RBF networks are robust and efficient in recognising speakers, none of these studies used a well-known speech database which can be accessed by other researchers. This motivates us to use the TIMIT and NTIMIT databases in the present study. The structure of this paper is as follows. The RBF and EBF networks are introduced in the next section. Section 3 presents the speech databases and the speaker verification experiments, and the results of the experiments are discussed. Finally, a conclusion is drawn in section 4.
2. RBF AND EBF NETWORKS The RBF and EBF networks are a kind of feedforward neural networks where the non-linearity in the hidden layer is represented by Gaussian basis functions. Broomhead and Lowe [1], and Moody and Darken [6] are among the first users of this technique. The outputs of an RBF network with I inputs, J function centers, and K outputs have the form
J r r y k ( x p ) = wk 0 + ∑ wkj φ j ( x p )
p = 1, K , P and k = 1, K K
(1)
j =1
where I ( x pi − µ ji ) 2 r (2) φ j ( x p ) = exp − ∑ . 2σ 2j i =1 r In (1) and (2), x p are the input vectors, µ ji and σ j are the function centers and function widths respectively, and wk0
is a bias term. In matrix notation, (1) can be written as Y = Φ W where Y is a P × K matrix, Φ is a P × (J+1) matrix, and W is a (J+1) × K matrix. Training of RBF networks involves the minimization of the total squared error 1 E = (D − ΦW)T (D − ΦW) (3) 2 where D is a P × K target matrix. Since (3) is quadratic in W, we can calculate W efficiently by using pseudoinverse methods once the elements of Φ are known, i.e., W = (Φ T Φ ) −1 Φ T D = Φ + D (4) + where Φ is the pseudo-inverse of Φ. The function centers µji and the function widths σj can be found by the Kmeans algorithm and the K-nearest neighbor heuristic, respectively [5,6]. According to (2), the function widths are equal for all input dimensions, hence the name “Radial”. This type of networks is particularly suitable for the situation where the variances of the input data are the same for all input dimensions. However, this is seldom the case in complex pattern recognition problems. To release this equal variance restriction, the EBF networks are introduced. In an EBF network, each cluster of data in the input space is associated with a Gaussian basis function of the form r T r r r 1 r φ j ( x p ) = exp − (5) x p − µ j Σ −j 1 x p − µ j j = 1, K, J 2γ j where Σj is the covariance matrix associated with the jth cluster and γ j is a parameter controlling the spread of the
(
)
(
)
basis function. We can note that when Σ j = diag{σ 2j , σ 2j , K , σ 2j } and γ j = 1 , the EBF network is reduced to an RBF r network. One simple way of estimating the centers µ j and the covariance matrices Σj is to find the sample average and the sample covariance r 1 µj ≈ Nj
∑x
r
r
x ∈Ωj
and
Σj ≈
1 Nj
∑ ( x − µ )( x − µ ) r
r
x ∈Ωj
r
r
j
r
j
T
(6)
where Ωj is the set of training vectors in the jth cluster and Nj is the number of vectors in Ωj. Alternatively, we can employ the EM algorithm to estimate these values [3]. Similar to RBF networks, the weight matrix W in an EBF network can be obtained by using the singular value decomposition technique after all the elements in the matrix Φ have been found. The advantage of EBF networks is that a smaller number of function centers is required as compared to the RBF networks. This is because, in EBF networks, each basis function has a full covariance matrix which is a better representation of the clusters, whereas in RBF networks, each cluster is associated with a diagonal covariance matrix with equal variance. This characteristic is demonstrated in Fig.1 where two clusters of overlapped Gaussian distributions have to be classified into two classes. The function centers and the contours of the Gaussian functions are also shown. The recognition accuracy obtained by using an RBF network is 72.0%, while this is increased to 89.5% when an EBF network was used.
3. SPEAKER VERIFICATION EXPERIMENTS 3.1 Speech Database A subset of a phonetically balanced, continuous speech, telephone bandwidth speech database (the NTIMIT) [2] was used in the verification experiments. For comparison purpose, a noise free version of this database (the TIMIT) was also used. The TIMIT and NTIMIT databases contain 630 speakers separated into eight dialect regions. Each speaker spoke two dialectal sentences (the SA sentence set), five phonetically-compact sentences (the SX sentence set) and three phonetically-diverse sentences (the SI sentence set). The two sentences in the SA sentence set are the same for all speakers and some sentences in the SX sentence set are also the same for some speakers. However, all sentences in the SI sentence set are different. It is therefore reasonable to use the SA and SX sentence sets as the training set and the SI sentence set as the test set in order to obtain the text-independent verification performance. In the experiments, 76 speakers (23 female and 53 male) from dialect region 2 were used. These speakers were divided into speaker set anti-speaker set and impostor set as shown in Table 1. The speaker set and the anti-speaker set contain 20 speakers each, while the impostor set contains 36 speakers. Each speaker in the speaker set has a personalized network (RBF or EBF) modeling the characteristics of his/her own voice. Speaker Set faem0 fajw0 fcaj0 fcmm0 fcyl0 fdas1 fdnc0 fdxw0 marc0 mbjv0 mcew0 mctm0 mdbp0 mdem0 mdlb0 mdlc2 mdmt0 mdps0 mdss0 mdwd0
Anti-speaker Set feac0 fhlm0 fjkl0 fkaa0 flma0 flmc0 fmjb0 fmkf0 mefg0 mhrm0 mjae0 mjbg0 mjde0 mjeb0 mjhi0 mjma0 mjmd0 mjpm0 mjrp0 mkah0
Impostor Set fmmh0 fpjf0 frll0 fscn0 fskl0 fsrh0 ftmg0 mkaj0 mkdt0 mkjo0 mmaa0 mmag0 mmds0 mmgk0 mmxs0 mppc0 mprb0 mrab0 mrcw0 mrfk0 mrgs0 mrhl0 mrjh0 mrjm0 mrjm1 mrjt0 mrlj0 mrlr0 mrms0 msat0 mtat1 mtbc0 mtdb0 mtjg0 mwsb0 mzmb0
Table 1: The names of speakers in the speaker set, anti-speaker set, and impostor set. The feature vectors that characterize the voice of speakers were derived from a LPC analysis procedure described as follows. The silent regions of the time domain speech signals were removed by using the information in the .phn files of the NTIMIT database. The remaining signals were pre-emphasized by a filter with a transfer function of the form H ( z) = 1 − 0.95z −1 . Then, for every 14 ms of speech signals, 12th order LPC-derived cepstral coefficients were calculated using a 28 ms Hamming window. The cepstral vectors were then gone through a cepstral mean removal procedure as it has been found that this method is effective in compensating channel variations [8]. 3.2 Training and Verification Procedures For each network, the feature vectors derived from the SA and SI sentence sets were used for training. The data derived from the speaker in the speaker set constitute the speaker class. The anti-speaker class consists of the data derived from all anti-speakers. Therefore, each network has 12 inputs, various number of hidden nodes, and 2 outputs, with each output node representing one class. During training, the K-means algorithm was applied separately to each speaker (including the speaker and all anti-speakers) so that each of them contributes an equal number of function centers. Then, the K-nearest neighbor heuristic (with K=2) was applied to all function centers to find the function widths. For each EBF network, the covariance matrix for each cluster was found by using (6) and γ j was set to 2* . Finally, the singular value decomposition technique was applied to find the pseudo-inverse of the matrix Φ, and the weight matrix W was then obtained. Since a 1-from-K coding scheme was used to code the target vectors and the output units are linear, the network r outputs are an estimate of the a posterior probabilities, i.e., P (C k | x ) [9]. Moreover, the outputs of the networks for an arbitrary input are guaranteed to sum to unity. It has been shown [10] that the average network output is an estimate of the prior probability P(Ck). Therefore, if the number of patterns in the training set is not evenly *
This value is found emprically based on a pilot verification experiment in which five speakers, five anti-speakers, and five impostors were involved.
distributed among the classes, then the network will bias towards those classes with a larger membership in the training set. In the present study, each speaker contributes the same number of sentences for training. Consequently, the ratio of training vectors between the speaker class and the anti-speaker class is about 1:20 (each network uses one speaker from the speaker set and 20 speakers from the anti-speaker set for training). The network will, therefore, favor strongly the anti-speaker class during verification. To solve this problem, we can use the method suggested by Lowe and Webb [4] to weight the error function according to the a prior class probabilities. Alternatively, we can perform this scaling process during verification by dividing each network output by the prior probability of that class. Since the outputs can be considered as the a posterior probabilities, we can apply the Bayes’ theorem r r r p ( x | C k ) P (C k ) (7) y k ( x ) ≈ P (C k | x ) = k = 1,2 r p( x ) to obtain the scaled likelihoods r r yk (x ) p(x| C k ) (8) ≈ k = 1,2 r . P (C k ) p( x ) r r r To verify a speaker, the unknown vectors X = {x1 , x 2 , K , x T } are fed to the network and the scaled log likelihood r r T T yk ( xt ) p( xt | C k ) (9) Lk = log ∏ k = 1,2 ≈ log r ∑ P (C ) p( xt ) t =1 t =1 k for the speaker and the anti-speaker classes are obtained. Since logarithm is a monotonic increasing function, we can drop the logarithmic operation without affecting the verification decision; hence (9) is simplified to r T yk ( xt ) . (10) Lk′ = ∑ t =1 P ( C k ) Finally, the verification decision is based on the following criterion: > ξ accept the speaker if L1′ − L2′ ≤ ξ reject the speaker where ξ is a threshold value by which we can control the ratio between the false rejection rate and the false acceptance rate. It is possible to use all the test vectors from the SI sentence set of each speaker to make one verification decision. The verification result, however, will be either “accept” or “reject” and the error rate will be either 100% or 0%. To increase the resolution of the error rate, we divided the sequence of test vectors into a number of overlapping segments. In the experiments, each segment has 200 consecutive vectors (2.8 seconds of speech) and a verification decision was made every ten vectors. Therefore, the value of T in (9) and (10) is equal to 200 and the error rate is the proportion of incorrect verifications to the total number of verification decisions. The false rejection rate of a particular speaker was obtained by presenting the SI sentence set of the speaker to his/her network. A threshold value for this rejection rate was also recorded. Using this threshold value, the false acceptance rate was obtained by feeding the SI sentence set of all impostors (here, impostors include all speakers in the impostor set and the speaker set excepting the one whose network is being tested) to the network. 3.3 Verification Results Table 2 shows the false acceptance rates (FAR) and false rejection rates (FRR) achieved by the RBF and EBF networks with various number of function centers. All values are based on the average of 20 speakers in the speaker set. The columns with the title “Ave %” are the average of FAR and FRR.
No. of
TIMIT
NTIMIT
centers RBF Networks EBF Networks per network FAR % FRR % Ave % FAR % FRR % Ave % 42 8.12 13.19 10.66 0.22 3.63 1.93 84 4.33 3.73 4.03 0.06 0.19 0.13 126 2.13 3.61 2.65 0.07 0.18 0.13 168 1.80 3.65 2.73 0.11 0.18 0.15
RBF Networks FAR % FRR % Ave % 44.67 12.39 28.53 23.34 8.10 15.72 18.95 8.43 13.69 17.20 8.32 12.76
EBF Networks FAR % FRR % Ave % 12.17 8.23 10.20 11.70 8.71 10.21 12.74 8.35 10.55 13.27 8.26 10.77
Table 2: False acceptance rates (FAR), false rejection rates (FRR), and average error rates (Ave) for different number of function centers per network. 3.4 Discussions The results show that the EBF networks achieve a significantly lower error rate as compared to the RBF networks. Moreover, the RBF networks require a larger number of function centers to achieve an error rate that is comparable to that of the EBF networks. For example, the average error rate of an EBF network with 42 function centers (2 function centers per speaker) is 1.93%, while that of an RBF network with 126 function centers is 2.65%. This implies that the covariance matrices are able to model the training data accurately and that the EBF networks are robust in detecting impostors. The results, however, also show that the error rates for telephone speech are more than ten times higher than that of the clean speech. This suggests that a better pre-processing technique has to be adopted or developed to reduce the effects of noise and channel variations.
4. CONCLUSION In this paper, radial basis function and elliptical basis function networks have been applied to text-independent speaker verification. Several verification experiments using the TIMIT and NTIMIT databases have been conducted to demonstrate the capabilities of these networks. It is concluded that the EBF networks are better than the RBF networks in modeling speaker characteristics and in detecting impostors. In addition, further research is required to find a better pre-processing technique to reduce the effects of noise and channel variations. In this study, the value of γ j is chosen empirically without any theoretical ground. Therefore, one possible future development is to optimize this parameter based on, perhaps, the eigenvalues of the covariance matrices and the distance between the function centers. Another possible direction is to incorporate a cost matrix in the error function so that the cost of false acceptance and false rejection can be different.
5. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]
Broomhead, D.S. & Lowe, D. 1988. Multivariable function interpolation and adaptive network, Complex System, 2, 321-355. Jankowski, C. Kalyanswamy, A., Basson, S., and Spitz, J. 1990. NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database, Proc. ICASSP90, 109-112. Lin, S.H. and Kung, S.Y. 1996. Face detection/recognition by probabilistic decision-based neural networks, to appear in IEEE Trans. on Neural Networks, Special Issue on Biometric Identification, Dec. 1996. Lowe, D. and Webb, A.R. 1990. Exploiting prior knowledge in network optimization: an illustration from medical prognosis, Network: Computation in Neural Systems, 1, 299-323. Mak, M.W., Allen, W.G. & Sexton, G.G. 1993. Comparing multi-layer perceptrons and radial basis function networks in speaker recognition, J. of Microcomputer Applications, 16, 147-159. Moody, J. and Darken, C.J. 1988. Fast learning in networks of locally tuned processing units, Neural Computation, 1, 281-294. Oglesby, J. and Mason, T.S. 1991. Radial basis function networks for speaker recognition, ICASSP91, 393-396. Reynolds, D.A. 1994. Experimental evaluation of features for robust speaker identification, IEEE Trans. on Speech and Audio Processing, 2 (4), 639-643. Richard, M.D. and Lippmann, R.P. 1991. Neural network classifiers estimate Bayesian a posterior probabilities, Neural Computation, 3, 461-483.
[10] Wan, E.A. 1990. Neural network classification: a Bayesian interpretation, IEEE Trans. on Neural Networks, 1 (4), 303-305. 10
10
8
8
6
6
4
4
2
2
0
0
-2
-2
-4
-4
-6
-6
centers
-8
-8 -10 -10
centers
-8
-6
-4
-2
0
2
4
6
(a) Classification by an RBF network
8
10
-10 -10
-8
-6
-4
-2
0
2
4
6
8
10
(b) Classification by an EBF network
Fig. 1: Classification of two overlapped Gaussian distributions by (a) an RBF network and (b) an EBF network. Both networks have two inputs, two function centers, and two outputs. For the EBF network, the value of γ j was set r 3 0 to 2. The first Gaussian distribution has mean µ 1 = [2 2]T and covariance matrix Σ 1 = , while the second 0 1 r 1 0 one has mean µ 2 = [ −1 −1]T and covariance matrix Σ 2 = . The classification accuracy for the RBF and 0 3 EBF networks are 72.0% and 89.5%, respectively.