AN ITERATIVE TECHNIQUE FOR TRAINING

3 downloads 0 Views 435KB Size Report
where « is some suitably chosen scalar. A common method for iterative training is implemented in Kaczmarz algorithm for recursive learning [7, 8]. The.
AN ITERATIVE TECHNIQUE FOR TRAINING SPEAKER VERIFICATION SYSTEMS W. M. Campbell Motorola Human Interface Lab Tempe, AZ 85284, USA ABSTRACT As biometrics progresses from the lab into practical embedded applications, the need for systems that are computationally simple, memory efficient, and accurate becomes a priority. Polynomial classification systems have high potential to fit these requirements. Previous work has shown that polynomial techniques applied to speaker verification lead to accurate systems with simple multiply-add structures well-fitted to DSP architectures. One of the challenges of the polynomial method is to find memory efficient techniques for training. We show that through a simple matrix index mapping technique combined with iterative training, memory requirements can be reduced drastically in training. We apply the new method to the YOHO database to show the equivalence of the method to prior approaches. 1. INTRODUCTION Speaker recognition has been approached with several classification techniques. The most popular recent approaches have been Gaussian Mixture Models (GMM’s) [1] and Hidden Markov Models (HMM’s) [2, 3]. These approaches give excellent results when combined with cohort normalization techniques [4]. An alternative technique presented in [5] uses a discriminative polynomial classifier for speaker verification. The main advantages of this approach are:

   

The training method is able to handle large amounts of enrollment data in a straightforward manner. The architecture is based upon a simple multiply-add only architecture. The classifier is trained discriminatively with an algorithm achieving the global minimum. The classifier output approximates a posteriori probabilities. This eliminates the need to perform cohort selection and cohort scoring; cohorts are incorporated as part of the training process.

A drawback of this technique is that the training process requires the solution of a large (for small platforms) matrix problem. We propose to solve this problem by using a novel combination of iterative techniques and an implicitly constructed matrix. Iterative techniques to solve linear equations have typically been used in two areas. In the numerical analysis community, methods are targeted toward solving large sparse systems [6]. In the engineering community, approaches have concentrated on using iterative methods for recursive learning [7]. We experiment with iterative methods from both areas. The paper is organized as follows. In Section 2, we review the previous method for training a polynomial classifier. Then, in Section 3, we explain how iterative techniques can be applied to the problem. We also overview several iterative techniques. Section 4 shows how to compute a key component of the iterative algorithm. Section 5 applies the algorithm to the YOHO database. 2. DIRECT TRAINING METHOD The output of a polynomial classifier is given by f (x) = wt p(x). Here, x is a feature vector, w is a vector of coefficients (the speaker model), and p(x) is a vector of mono-

mial basis terms of degree K or less. For example, if K  t and x = x1 x2 , then

p(x) =



1

 x1 x2 x21 x1 x2 x22 : t

=2

(1)

When enrolling speaker k , we train the output of the classifier to approximate 1 on the speaker’s feature vectors and 0 on the anti-speaker data. We use a mean-squared error criterion for this training process. The resulting classifier approximates a posteriori probabilities [7]. Verification is accomplished by averaging the output of the classifier, f (x), over the feature vectors derived from an input utterance. The resulting average score is compared to a threshold and an accept/reject decision is made. Training the classifier is accomplished by using the algorithm in Table 1. In the table, p2 (x) is the vector of all monomials of degree less than or equal to 2K . The vector

Table 1: Training Speaker k ’s Model. 1) Set rk = 0 2) For i = 1 to Nk 3) Retrieve ith feature vector, xk;i , for speaker k 4) Let rk = rk + p2 (xk;i ) 5) Next i 6) Retrieve rimp 7) Compute r = rimp + (c 1)rk 8) Map r to R 9) Solve Rwk = subvector(crk )

p2 (x) can be calculated efficiently using the method in [5].

We typically choose c = 1 or c = N=Nk 1, where Nk is the number of feature vectors for speaker k , and N is the total number of feature vectors in the anti-speaker population plus Nk . A vector, rimp , representing the anti-speaker population is retrieved in Step 6. For the YOHO database and polynomial degree for our experiments, this 18; 564 vector represents the essential attributes of approximately 2:8 million 12-dimensional feature vectors in the anti-speaker population. We map the combination of the anti-speaker vector and speaker k ’s vector to R in Step 8. This mapping process will be explained in detail in Section 4. The mapping process produces a large matrix, R, which is then used in a linear equation to solve for the speaker model, wk . Our goal is to replace Steps 8-9 by an iterative technique to reduce storage substantially.

More sophisticated algorithms for iterative training are the successive over-relaxation (SOR) algorithm and the conjugate gradient (CG) algorithm. The SOR algorithm is an extension of the well-known Gauss-Seidel method [6] with a parameter 0 < ! < 2 which can be varied to give difference convergence rates. The conjugate gradient (CG) algorithm [6] is another popular method. It has the advantage that there are no direct parameters to estimate, and its convergence rate is determined by the condition of the matrix A. We mention that we chose these iterative methods because of their common use and applicability to our problem–many other methods are available, see [6]. We use iterative methods to solve the equation shown in Step 9 of Table 1. Several properties of R are critical. First, R is symmetric, nonnegative definite, and square by structure. Second, we assume (with no violations in practice), that R is nonsingular. These properties allow all of the mentioned iterative methods to be applied. 4. MATRIX-VECTOR MULTIPLY ALGORITHM The core of our iterative algorithm is a method for computing Rw for an arbitrary w without explicitly performing the mapping from r to R. The basic idea is to utilize the structure of the matrix R. We specialize to the case when we map a speaker’s vector rk to a matrix structure as in Step 8 in Table 1. The matrix Rk is obtained from a sum of outer products

Rk =

Nk X

=1

p(xk;i )p(xk;i )t :

(3)

i

3. ITERATIVE TRAINING Iterative methods are a common technique used to solve linear equations; e.g., Ax = b. The basic structure of an iterative method is as follows. First, an initial guess is made, x0 . Then, a descent direction, d, is estimated using data from previous iterations, the (typically) unaltered matrix A, and the current best solution, xi . In many cases, this involves computing a product Ap where p is some auxiliary vector. The new solution estimate is then given by xi+1 = xi + d where is some suitably chosen scalar. A common method for iterative training is implemented in Kaczmarz algorithm for recursive learning [7, 8]. The method uses the update

xi+1 = xi + (bj

aj xi )atj ;

(2)

where aj is the j th row of A, bj is the j th entry of b, and 0 < kaj k2 2 < 2 . We use  = 1==kaj k22 in our experiments. The two main advantages of this method are (1) it is computationally simple, and (2) the update involves only one row of A.

The mapping in Step 8 is based upon the fact that it is not necessary to compute the sum of outer products (3) directly. Instead one can compute the subset of unique entries (i.e., the vector p2 (x)), and then map this result to the final matrix. The advantage of this approach is that memory and computation are saved. For instance, for the system we present in Section 5, the R matrix is of size 455  455; thus, computing p2 (x) saves a factor of 207; 025=18; 564  11 in computation and storage. A straightforward way to implement the mapping is to precompute an index function. We first label each entry in the matrix Rk in column major form from 1 to the number of entries. The structure of Rk is determined by one outer product term from (3), p(xk )p(xk )t . Using a computer algebra package, one can compute the outer product and p2 (x) with symbolic variables. Then an exhaustive search for each entry of Rk in p2 (x) yields the required index map. An example of such a mapping is shown in Figure 1. The difficulty in using an index function is that the index map must be stored. To avoid this problem, we propose an alternate method based upon a simple property. Suppose we

Table 2: Calculation of y = Rw.

3500

3000

Index in r

k

2500

2000

1500

1000

500

0 0

0.5

1

1.5 Index in R

2

2.5

3 4

k

x 10

Figure 1: Index function for 8 features and degree 3. have a feature vector with n variables, x1 ; : : : ; xn . Now let q1 ; : : : ; qn be the first n primes. Then

x 1x 2 : : : x k i

i

i

7! q 1 q 2 : : : q k i

i

i

(4)

defines a 1-1 map between the monomials and a subset of the positive integers. This mapping turns the process of locating a monomial term into a numerical search. Based upon the mapping in (4), an algorithm for computing an arbitrary product, Rw, can be derived, see Table 2. The basic idea is as follows. We first compute the numerical equivalents to p(x) and p2 (x) (for “symbolic” x) using the mapping (4) in Steps 1-2. We sort the resulting vector v2 , so that it can be searched quickly. Steps 4-12 then obtain the ith entry of y using a standard matrix multiply; i.e, the (i; j )th entry of R is multiplied by the j th entry of x and summed over j to obtain yi . 5. RESULTS We applied our method to the YOHO database for speaker verification [9]. YOHO uses combination lock phrases, e.g. “26-81-57,” for enrollment and verification. The YOHO database has 4 enrollment sessions consisting of 24 phrases each; all sessions were used for training. For verification, there are 4 phrases in 10 sessions for a total of 40 tests per speaker. We performed verification using all speakers as impostors and true claimants. Feature extraction was performed by examining 30 ms frames every 10 ms. For each frame, mean removal, preemphasis, and a Hamming window were applied. Then, 12 LP coefficients were obtained and transformed to 12 LP cepstral coefficients (LPCC’s). Cepstral mean subtraction was performed on the result.

1) Let q be the vector of the first n primes. 2) Let v = p(q) and v2 = p2 (q). 3) Sort v2 into a numerically increasing vector, v20 . Store the permutation,  , which maps v20 to v2 . 4) For i = 1 to (Number of rows of R) 5) Let yi = 0 6) For j = 1 to (Number of columns of R) 7) Compute n = vi vj 8) Perform a binary search for n in v20 ; call the index of the resulting location i0n 9) Using the permutation  , find the index, in , in v2 corresponding to the index, i0n in v20 10) yi = yi + rin wj 11) Next j 12) Next i

A polynomial classifier of degree

3

was applied to the This resulted in a speaker model with 455 coefficients per speaker. The antispeaker population vector, rimp was constructed by computing an rk for each speaker and then summing across all speakers. This approach is a compromise since we are including actual impostors in our training; we have addressed this issue in [5]. We incorporated the algorithm in Table 2 into several iterative methods–SOR, CG, Kaczmarz method, steepest descent, and a preconditioned CG method. The results of these methods applied to training the first speaker, 101, is shown in Figure 2. The metric, "i used to indicate progress is the norm of the residual divided by the norm of the right-hand side, b; i.e., if the equation to be solved is Ax = b, then 12 dimensional feature vectors generated.

"

i

=

kAx bk2 : kbk2 i

(5)

Note that since we have assumed R to be invertible, then the error should ideally go to zero. We can thus base our convergence criterion on "i being below a certain value. From Figure 2, several conclusions can be made. First, methods such as steepest descent and Kaczmarz algorithm converge slowly. In practice, we found the solution did not perform well even after 1000 iterations. Second, the CG method gives acceptable results after 1000 iterations, but this amount of computation is unacceptable. One would typically expect significantly less than 455 iterations (the dimension of R) for convergence. Finally, the SOR method performs the best with no preconditioning. Note that we found that the SOR method with ! = 1:2 converged faster than ! = 1 (after trying several values of ! ).

bytes), and scratch space for the iterative algorithm (double precision, 455  5  8), for a total of 315; 224 bytes. The memory savings is thus 2; 218; 762=315; 224  7.

0

10

Kaczmarz Steepest Descent

−2

10

6. CONCLUSIONS

CG −4

10 εi

We have presented a new set of techniques for iterative training of polynomial classifiers. These methods reduce memory requirements for training allowing either larger problems to be solved in less memory or smaller problems to fit more readily in small devices. An application of this method to the YOHO database resulted in a quickly converging method giving equivalent accuracy performance to previous systems with substantial memory savings.

SOR 1.0

−6

10

SOR 1.2 −8

10

Precond. CG −10

10

0

200

400

600

800

1000

iteration

7. REFERENCES Figure 2: Comparison of Iterative Methods. We explored preconditioning the matrix R to achieve better convergence. The condition number of R was estimated to be about 5:2(107); this explained the difficulty in the CG iterations. After examining several standard approaches, we settled on a diagonal preconditioner. The matrix R arises from a matrix product Mt M. Applying a diagonal matrix, D, to normalize the column norms of M is the same as computing MD. This results in the matrix product DRD. A convenience of using the matrix D is that it can be obtained from the entries of R. That is,

D = diag

p



p

R1 1 ; : : : ; R ;

n;n

1

:

(6)

After applying this preconditioner, the condition number was reduced to 5:7(103 ). The resulting preconditioned CG algorithm (Precond. CG) is the fastest converging method in Figure 2. We note that we applied preconditioning with the other iterative methods–in no case did we obtain substantial gains as in the CG case. We implemented the preconditioned CG method in C and enrolled all speakers in the YOHO database using iterative methods. After experimenting with several convergence values, we found that if we iterated till "i < 10 4 , then the average EER (equal error rate) was the same as obtained from a direct solution method. On average, the method required about 100 iterations per speaker to converge. We contrast the memory usage of the original direct approach to the new iterative method. For the original approach, we allocate space for r (double precision, 8  18; 564 bytes), the index map (16 bit int, 2  455  455 bytes), and R (double precision, 8  455  455 bytes), for a total of 2; 218; 762 bytes. For the new method, we store r (double precision, 8  18; 564 bytes), v (16 bit int, 18; 564  2 bytes), v20 (32 bit int, 18; 564  4 bytes),  (16 bit int, 18; 564  2

[1] D. A. Reynolds, “Automatic speaker recognition using Gaussian mixture speaker models,” The Lincoln Laboratory Journal, vol. 8, no. 2, pp. 173–192, 1995. [2] J. Colombi, D. Ruck, S. Rogers, M. Oxley, and T. Anderson, “Cohort selection and word grammar effects for speaker recognition,” in International Conference on Acoustics Speech and Signal Processing, pp. 85–88, 1996. [3] C. Che and Q. Lin, “Speaker recognition using HMM with experiments on the YOHO database,” in Proc. Eurospeech, pp. 625–628, 1995. [4] A. E. Rosenberg, J. DeLong, C.-H. Lee, B.-H. Juang, and F. K. Soong, “The use of cohort normalized scores for speaker verification,” in Proceedings of the International Conference on Spoken Language Processing, pp. 599–602, 1992. [5] W. M. Campbell and K. T. Assaleh, “Polynomial classifier techniques for speaker verification,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 321–324, 1999. [6] G. H. Golub and C. F. Van Loan, Matrix Computations. John Hopkins, 1989. [7] J. Sch¨urmann, Pattern Classification. John Wiley and Sons, Inc., 1996. [8] S. Kaczmarz, “Angen¨aherte Aufl¨osung von Systemen linearer Gleichungen,” Bull. Internat. Aca. Polon. Sciences et Lettres, pp. 355–357, 1937. [9] J. P. Campbell, Jr., “Testing with the YOHO CD-ROM voice verification corpus,” in Proceedings of the Internation Conference on Acoustics, Speech, and Signal Processing, pp. 341–344, 1995.

Suggest Documents