Low Complexity Speaker Authentication Techniques Using Polynomial Classi ers W. M. Campbell and C. C. Broun Motorola SSG, Scottsdale, AZ
ABSTRACT
Modern authentication systems require high-accuracy low complexity methods. High accuracy ensures secure access to sensitive data. Low computational requirements produce high transaction rates for large authentication populations. We propose a polynomial-based classi cation system that combines high-accuracy and low complexity using discriminative techniques. Traditionally polynomial classi ers have been dicult to use for authentication because of either low accuracy or problems associated with large training sets. We detail a new training method that solves these problems. The new method achieves high accuracy by implementing discriminative classi cation between in-class and out-of-class feature sets. A separable approach to the problem enables the method to be applied to large data sets. Storage is reduced by eliminating redundant correlations in the in-class and out-of-class sets. We also show several new techniques that can be applied to balance prior probabilities and facilitate low complexity retraining. We illustrate the method by applying it to the problem of speaker authentication using voice. We demonstrate the technique on a multisession speaker veri cation database collected over a one month period. Using a third order polynomial-based scheme, the new system gives less than one percent average equal error rate using only one minute of training data and less than ve seconds of testing data per speaker. Keywords: Polynomials, pattern classi cation, user authentication, speaker veri cation
1. INTRODUCTION
Biometrics is an emerging technology with outstanding potential in many modern authentication systems. Biometrics simpli es the interface to the human user by eliminating the need for passwords. Passwords are cumbersome at best because of various practices currently in use. Passwords by their nature are dicult to remember, must be changed frequently, and are subject to \cracking." Biometrics solves the problems of passwords by the use of various distinguishing characteristics of individuals. Authentication methods commonly used are voice, ngerprints, hand geometry, iris structure, facial characteristics, etc. Access is controlled through a veri cation process which determines whether a claimant's characteristics match those of the claimed identity. As biometric technology continues to grow, a need for methods that provide a high number of veri cations per unit time is emerging. In this paper, we propose the use of polynomial classi ers for veri cation. Polynomials have many attractive properties for use in modern authentication systems. First, polynomials t well with traditional DSP architectures since polynomials involve repetitive multiplication-addition operations. Second, polynomials are known to be universal approximators for the ideal Bayes decision rule [1]. This property implies that as the degree of the polynomial is increased, and with enough data, we can achieve rates arbitrarily close to the Bayes error rate. Of course, in practice one is limited by mismatch in enrollment and veri cation. Previous techniques for polynomial classi cation have had limited success. Because of the combinatorial explosion of the number of terms, low order polynomials (degree 1 and degree 2) are typically used. Training techniques for polynomial classi cation fall into two categories. The rst category of techniques estimates the parameters for the polynomial expansion based on in-class data [2,3]. This method approximates the class speci c probabilities [4]. Since out-of-class data is not used for training a speci c model, accuracy is limited. A second category of methods involves discriminative training [4] with a mean-squared error criterion. The goal of these methods is to approximate the a posteriori distribution for each class. This method traditionally involves decomposition of large matrices, so that it is intractable for large training sets in terms of both computation and storage. We propose a new method Other author information: (Send correspondence to W.M.C.) W.M.C.: E-mail:
[email protected] C.C.B.: E-mail:
[email protected]
which uses discriminative training and solves the large data set problem. In addition, we show that considerable computational improvements can be gained with the new method. Section 2 describes the authentication problem in detail, illustrates the basic polynomial discriminant function, and outlines the training problem. Section 3 describes a new algorithm for training polynomial classi ers for authentication systems. Methods to lower computational complexity as well as increase accuracy are described. Section 4 describes the application of our method to the speaker authentication problem. Speaker authentication is the process of verifying the identity of an individual through voice. We illustrate the basic scenario for enrollment and veri cation. We then demonstrate the application of the new methods to the veri cation problem. We show that excellent results are obtained for veri cation using these new methods.
2. AUTHENTICATION
2.1. Hypothesis Testing
Authentication requires the decision between two hypotheses: 0 : the claimant is an impostor 1 : the claimant is performing a valid access
(1)
H
H
given an observation x whose class is unknown. Statistical decision theory indicates that the choice between these hypotheses should be performed using the probability distributions corresponding to the two classes. In particular, the Bayes decision rule [2] indicates that the decision should be made based upon the a posteriori probabilities of each class if ( 0 jx) ( 1 jx) decide 0 (2) if ( 0 jx) ( 1 jx) decide 1 p H
p H
H
p H
< p H
H
where x is an observation vector. A common method for solving the hypothesis testing problem is to approximate an ideal output for the classi er on a set of training data. That is, if 0 (x) and 1 (x) are the discriminant functions [4] for 0 and 1 respectively, then the ideal output of 0 (x) is 0 on all valid-access observation vectors and 1 on all impostor observation vectors. A similar statement is true for 1(x). If the discriminant function 0 , for example, is optimized for mean-squared error over all possible functions f
f
H
H
f
f
f
f
0
= argmin x
E ;H
0
f
j 0 (x) f
2 ;
0 (x; H )j
y
(3)
then the solution is 0 (x) = ( 0 jx), see [4]. In equation (3), x is the expectation operator over the joint distribution of x and the two hypotheses, and 0 (x ) is the ideal output for 0 . Thus, the least squares optimization problem gives the functions necessary for the hypothesis test (2). If the discriminant function in (3) is allowed to vary only over a given class (in our case polynomials with a limited degree), then the optimization problem (3) gives an approximation of the a posteriori probabilities [4]. Once we obtain an approximation to the a posteriori probabilities using polynomial discriminant functions 0 and 1 , the Bayes decision rule (2) can be written as f
p H
E ;H
y
;H
H
f
f
if 0 (x) 1 (x) decide if 0 (x) 1 (x) decide Using the fact that 0 (x) = 1 f
1 (x),
f
f
f
f
< f
0 1
(4)
H
H :
see [4], we can rewrite (4) as if 1 (x) 21 decide if 1 (x) 12 decide f
f
>
0 1
(5)
H
H :
The Bayes decision rule thus involves setting a threshold on the polynomial discriminant function modeling the valid access hypothesis. Two observations should be made on the rule (5). First, generalization to the Bayes decision rule with a cost function or a Neyman-Pearson type rule [2] involves replacing the threshold 1 2 by a more general threshold . Then, =
T
Feature Vectors
Discriminant Function
Score
Average
Authentication Model
Figure 1.
Processing for one model of the classi er.
for example, the threshold can be varied according to the desired false acceptance rate or the desired false rejection rate. A second observation is that the method models a posteriori probabilities for both 0 and 1 simultaneously using one discrimination function (because of the relation 0 = 1 1 ). This process is unlike methods that model the class-speci c distribution ( j 1 ). For those methods, improved performance is gained by modeling ( j 0 ); see, for instance, the cohort normalization methods in [5,6]. H
f
H
f
p x H
p x H
2.2. Polynomial Discriminant Functions
The basic embodiment of the polynomial classi er is shown in Figure 1. The classi er consists of several parts. Feature vectors, x1 x are introduced into the classi er. For each feature vector, x, a polynomial discriminant function, (x) = w (x), is evaluated. The authentication model is given by w. The vector (x) consists of the polynomial basis terms of the input feature vector; e.g., for two features, x = 1 2 and second order, the vector is given by ;:::; t
f
N
p
p
x
( )= 1
p x
1
2 1
2
x
x
1 2
x
x x
x
t
2 t : 2
(6)
x
In general, the polynomial basis terms of the form 1
xi xi
(7)
2 : : : xik
are used where is less than or equal to the polynomial order, . For each input feature vector, x , a score is produced by the inner product, w (x ). The total score is then averaged over all feature vectors to produce the nal output. The total score is then k
K
t
p
i
i
Score = 1
N
N X
=1
t
( )
w p xi :
(8)
i
Viewed in another manner, we are averaging the output of a discriminant function [2]. The accept/reject decision for the system in Figure 1 is performed by comparing the output score (8) to a threshold. If , then reject the claim; otherwise, accept the claim. This strategy is the one demonstrated in (5). s < T
2.3. Training
In order to optimize the performance of the classi er in Figure 1, we use discriminative training with a mean-squared error criterion as discussed in Section 2.1. For a valid claim, an output of 1 is desired. For impostor data, an output of 0 is desired. The resulting problem is w
= argmin w
NX user i
=1
t w p(xi )
Nimp X 2 2 t 1 + w p(yi ) :
=1
i
(9)
Here, the user's training data consists of x1 x user , and the example impostor data consists of y1 y imp . The training method can be written in matrix form. First, de ne Muser as the matrix whose rows are the polynomial expansion of the user's data; i.e., ;:::;
;:::;
N
2
user =
M
( 1) (x2 ) .. .
p x
6 6 6 4
p
(
N
3
t
7 7 7: 5
t
user )
p xN
(10)
t
De ne a similar matrix for the impostor data, Mimp. De ne M
=
user Mimp
M
(11)
:
The problem (9) then becomes w
where o is the vector consisting of
user
N
= argmin kMw
k2
ones followed by
imp
N
(12)
o
w
zeros (i.e., the ideal output).
3. TRAINING ALGORITHMS 3.1. The Basic Training Algorithm
To understand the diculty of training using traditional matrix methods such as the QR decomposition or the SVD, we illustrate the training problem (12) with a short example. Suppose we are training for one user and have 99 example impostors. For the case of speaker authentication, it is typical to have feature vectors of size 12 with 6000 vectors for each speaker and impostor (corresponding to 60 seconds of speech at 100 feature vectors per second). The number of rows of M is then 6000 100 = 600000; the number of columns for a third order polynomial is 455. Therefore, the matrix M has 273 million entries occupying a space of 2 2 gigabytes of storage (if 8 bytes per entry are used). Solving this problem using traditional methods requires the QR decomposition of the matrix M|a task which is intractable because of the large amount of memory required. The new method we propose solves this problem using very little memory; for example, using the new method, the problem above involves only the manipulation of a matrix of size 1 7 megabytes|a factor of 1000 in memory is saved. The problem (12) can be solved by partitioning and the method of normal equations [7], :
:
t
M Mw
=M o t
(13)
:
We rearrange (13) to user Muser + Mimp Mimp t
M
t
w
= Muser1
(14)
t
where 1 is the vector of all ones. If we de ne Ruser = Muser Muser and de ne Rimp similarly, then (14) becomes t
(Ruser + Rimp) w = Muser 1 t
(15)
:
Our new training method is based on (15). There are several advantages to this approach. First, the matrices Rimp and Ruser are xed in size; i.e., they do not vary with the size of the training data set. Let terms equal the number of terms in (x), then Rimp and Ruser are matrices of size terms terms. Second, the computation is partitioned. We can calculate Ruser and Rimp separately at costs of ( user terms2 ) and ( imp terms2 ), respectively. The calculation of these two matrices is the most signi cant part of computation. Note that Rimp can be precomputed and stored. Third, the terms in the right-hand side of (15) can be calculated as a submatrix of Ruser. We denote the resulting vector, auser . One potential drawback of the method should be noted. Since the normal equation method squares the condition number of the matrix M, it might be thought there would be problems solving (14). In practice, this squaring has M
p
M
O N
M
M
O N
M
not caused solution instability. Many linear approximation problems have large condition numbers (e.g., linear FIR lter design), but yield good results. The matrix Ruser (and its impostor counterpart) in (15) has many redundant terms. In order to reduce storage and computation, only the unique terms should be calculated. The terms in Ruser consist exactly of sums of the polynomial basis terms of order 2 . We denote the terms of order 2 as 2 (x). Table 1 shows the number of unique terms in Ruser for 12 features and dierent polynomial orders. The redundancy in Ruser is very structured. K
K
p
Term Redundancies for the Matrix Ruser. Order Terms in Ruser Unique Terms Ratio 2 8,281 1,820 4.55 3 207,025 18,564 11.15 4 3,312,400 125,970 26.30
Table 1.
Suppose we have the polynomial basis terms of order , and we wish to calculate the terms of order + 1. Assume that every term is of the form (7) where 1 2 . If we have the th order terms with end term having = as a vector u , we can obtain the ( + 1)-th order terms ending with +1 = as k
i
ik
l
k
i
ik
k
k
l
ik
2
1
xl u
l
3
6xl u2 7 6 7 6 . 7: 4 .. 5
(16)
xl ul
We can then construct (x) as follows. Initialize a vector of 1 and the rst order terms. Then recursively calculate the ( + 1)-th order terms from the th using (16). Concatenate the dierent orders to get the nal result. The resulting training algorithm is shown in Table 2. One can plot the index of terms in the vector 2 (x) versus the index of the terms in the matrix Ruser. We index the elements in Ruser as a one-dimensional array using column major form. The resulting plot is shown in Figure 2 for 8 features and a polynomial order of 3. Note the interesting self-similar structure. This structure becomes more detailed as the polynomial order is increased. It should be noted that the training algorithm in Table 2 is not limited to a polynomial basis of monomials. One key property which makes the algorithm possible is the fact that the approximation basis elements form a semigroup. That is, the product of two basis elements is again a basis element; for example, if 1 2 and 3 are basis elements then 1 2 3 is also a basis element. This property enables one to compute only the non-redundant elements in the matrix R . Another key property is the ability to partition the problem. If a linear approximation space is used, p
k
k
p
x x
x x x i
Table 2.
1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13)
Training Algorithm.
Let rimp = 0 and = 1. Retrieve feature vector y from background set of impostors. Let rimp = rimp + 2 (y ). Let = + 1. Goto step 2 until all impostor feature vectors are processed. Repeat steps 6 to 13 for each user . Set ruser = 0 and = 1. Retrieve feature vector , x , for user . Let ruser = ruser + 2 (x ). Let = + 1. Go to step 7 until all of user 's feature vectors are processed. Map ruser to Ruser , ruser to auser , and rimp to Rimp. Compute R = Ruser + Rimp. Compute the Cholesky decomposition of R, R = L L. Solve for wuser by using a triangular solver twice on L Lwuser = auser . i
i
p
i
i
i
k
i
;k
i
;k
i
;k
p
k
k;i
k;i
i
k
;k
;k
;k
t
;k
t
;k
;k
x
3500
3000
2000
2
Index in p ( x)
2500
1500
1000
500
0 0
Figure 2.
0.5
1
1.5 Index in R
2
2.5
3 4
spk
x 10
Plot of index of term in 2 (x) versus index in Ruser. p
then we obtain the same problem as in (12). The resulting matrix can be partitioned, and thus the problem can be broken up to ease memory usage.
3.2. Weighting
The method of training presented so far produces scores that are dependent on the number of users and the class priors in the training set. To understand the problem, consider equation (9). If we x the number of data vectors for the user, user, and increase the number of vectors for the example impostors, imp, then the error term involving the impostor data will tend to dominate the overall error. This property will cause the optimization to focus more on approximating 0 on the impostor data and less on approximating 1 on the user data. The nal optimization result will tend to decrease the score range on actual user data; i.e., as we increase imp arbitrarily, the optimal solution is the zero function. A statistical viewpoint of this process is that the problem (9) gives an approximation to the a posteriori probabilities which involve the prior probabilities of the impostor and user data. The a posteriori probability will be small if the prior probability, ( 1 ), is small since ( 1 j ) = ( 1 ) (xj 1 ) (x). In particular with very large amounts of impostor data, ( 1 ) goes to zero. Since the training set does not necessarily represent the priors for actual use, the error criterion must be modi ed. By compensating for the class priors, the score ranges are normalized, and the eects of overtraining are mitigated, yielding improved classi cation performance. Compensation improves performance by balancing the number of vectors in the distinct classes of the classi er. Existing approaches to prior compensation include dividing output scores by the prior probability, equalizing the number of times each feature vector is used for each class per epoch of training, and vector quantizing the data. For a summary see [8]. Our approach ts well with the polynomial classi er and equates to weighting the rows of M . The new training problem is formulated as N
N
N
p H
p H
x
p H
p
H
=p
p H
i
w
= argmin w
1
user X wt p(xi )
N
user i=1
N
1 2 + 1
imp X wt p(yi ) 2 :
N
imp i=1
N
(17)
Intuitively, the goal of the weighting is to equalize the contribution of the error for the user and the error for the impostor set. This equalization prevents overtraining on the impostor data set. Because of the linear structure of the equations, the factors in (17) can be brought into the matrix equations, giving us min w kDMw
k2
Do
(18)
where D is a diagonal matrix. With weighting, Equation (15) now becomes R
user +
user R imp Nimp
N
user = auser :
w
(19)
3.3. Reinforcement and New User Addition
Reinforcement is the process of regenerating a given user's model with newly acquired data. New user addition is the process of adding a new individual to either the impostor set or the user set. The proposed algorithm allows for both processes to be performed with minimal computation. Reinforcement can be performed quickly if we store the vectors ruser and rimp. In particular, typically the impostor background, rimp need only be computed once and then can be stored. Updating the user's authentication model is performed by updating the ruser vector with new feature vector data, x , i
user;new
r
= ruser old + 2 (x ) ;
p
i :
(20)
Retraining occurs after (20) has been performed for all new feature vectors. This reduces computation considerably since we do not need to recompute ruser from scratch. New user addition involves a similar process to reinforcement. We only need to retrieve the appropriate r vector and then add the new user's data via an equation similar to (20). Table 3 illustrates a reinforcement algorithm using all of the methods previously discussed, i.e., normal equations, compression of R, and weighting. Note the computational savings due to the availability of the vectors ruser and rimp from the original training process, see step (1). Note that the computational complexity of the algorithm in Table 3 is dominated by the Cholesky decomposition. Table 3.
1) 2) 3) 4) 5) 6) 7) 8) 9)
Reinforcement Algorithm.
Retrieve ruser , rimp . (Generated in the original training sequence.) Read in the user's new feature vector x . Compute ruser = ruser + 2 (x ). Go to step (2) until all new feature vectors for the user are processed. Obtain auser as a subvector of ruser . Compute r = ruser + user imp rimp . map Generate R = r ! R. Compute the Cholesky decomposition R = L L. Compute wuser by solving L Lwuser = auser using backsubstitution.
4.1. Basic Scenario
i
p
i
N
N
t
t
4. SPEAKER AUTHENTICATION
The basic scenario for speaker authentication (or veri cation) is shown in Figure 3. The user requests access to the system. The system responds by prompting the user to read a particular text. After the user has spoken the text, the system either grants or denies access. This scenario encompasses two processes. First, \liveness" testing is performed. By selecting a random set of doublets with which to prompt the user (e.g., \23-45-56"), the system counteracts spoo ng (using a tape recorder or other method). Second, after liveness testing has been performed, the system veri es the identity of the individual. This scenario used for speaker veri cation is referred to as text-prompted [9]. Other methods include text dependent (e.g., using a xed \voice password") or text independent (the user can utter an arbitrary phrase). There are several advantages of our implementation of text-prompted veri cation. First, the text corresponding to a \voice-password" does not have to be stored. This feature decreases model size. Second, \liveness" testing is not compromised since text-prompting randomly changes the input. Third, the user does not have to remember a \voice password."
Input Speech User
Microphone
A/D
Computer & Verification System
Accept / Reject Access
Prompted Text
Figure 3.
Scenario for speaker veri cation.
4.2. Enrollment
Before veri cation can be performed, the user must enroll in the system. The basic process of enrollment is as follows. First, the user repeats a set of phrases as prompted by the enrollment module. This process can vary from several seconds to several minutes depending upon the security level required by the system. Second, the classi er, shown in Figure 1, must be trained to distinguish between the user and other potential impostors. The process of training the veri cation system is straightforward. First, since the classi er in Figure 1 is discriminative, we must train the enrollee against a simulated impostor background. An impostor background is a xed set of speakers (typically, for access control, 100 speakers works well) which are used as the out-of-class examples for training. That is, when a user enrolls, the model is trained so that the user's features are labeled with a 1 and the features from the impostor background set are labeled with a 0. Training is accomplished as described in Section 3.
4.3. Veri cation
The veri cation process is shown in Figure 4. Feature extraction is performed and the result is passed through a classi er. Note that the structure of the classi er in Figure 4 is the same as that of Figure 1. The main dierence is that the resulting score is compared to a threshold. If the score is below a threshold , the claimant is rejected; otherwise, he is accepted. The performance of the veri cation system is measured in terms of the average EER (equal error rate) across the speaker population. The equal error rate is determined using the FAR (false acceptance rate) T
Accept >T Input Speech
Feature Extraction
Classifier
Score
Compare to Threshold, T