figures of merit for biometric verification and a ... - Semantic Scholar

FIGURES OF MERIT FOR BIOMETRIC VERIFICATION AND A MEANS FOR DIMENSION REDUCTION Raymond Veldhuis, Asker Bazen Universiteit Twente, Fac. EEMCS P.O. Box 217, 7500 AE Enschede, The Netherlands [email protected]

Optimum reduction of the number of dimensions of the feature vector in biometric verification requires a figure of merit for the verification performance. We show that, under certain conditions, various figures of merit can be approximated by the same expression. We select one, the discrimination distance, as an objective function for dimension reduction. This new method for dimension reduction is called maximum discrimination analysis (MDA). It is shown to have a better verification performance than linear discriminant analysis. INTRODUCTION

Reduction of the number of dimensions of the feature vector is desirable in biometric verification. One reason is that it helps to keep the computational complexity within limits. Another reason is that, for systems with parameters derived from training data, it may improve the performance, e.g. in terms of the equal-error rate or the probability of error. This is called the Hughes phenomenon [5]. Here we consider likelihood-ratio-based verification, based on known normal probability densities, with the dimension reduction achieved by an orthogonal projection. In this case the Hughes phenomenon does not play a role and it is important to reduce the number of dimensions while keeping the performance as high as possible. Such a process requires a measure of performance as an objective function that is maximized. This is a problem, because the performance of a verification system is fully characterized by the receiver operating characteristic, which cannot be translated into a unique figure of merit. Dimension reduction is often based on linear discriminant analysis (LDA), e.g. [1]. LDA maximizes the distinction between classes based on their covariance matrices. It results in one single linear transform for all classes. In [4] it is suggested to use the divergence or the Bhattacharyya distance as a objective function. In [7] an objective function based on the geometric properties of the

underlying probability densities is proposed and compared with other objective functions. However, both [4] and [7] consider the classification problem, which is different from the verification problem. A generally accepted figure of merit in the (biometric) verification community is the equal-error rate, even though it is recognized that it only provides limited information about the performance. It is defined as the false-acceptance rate α (or the false-reject rate β) at the point of operation where α = β. A disadvantage is that, in the normal case considered here, a closed expression in terms of parameters of the densities does not exist for the equal-error rate. It is, therefore, computed from experimental data. We first review figures of merit that may serve as objective functions for dimension reduction. A verification system is called ‘good’, when the within-class variances are much smaller than the total variance. For ‘good’ verification systems we show that a number of different figures of merit can all be approximated by the same expression. One figure of merit, the discrimination distance between the probability density of the genuine feature vector and that of a feature vector randomly drawn from the observation space (further referred to as discrimination) is used as a basis for dimension reduction. By means of simulations, we illustrate that the relation between the logarithm of the equal-error rate and the discrimination is, by good approximation, linearly decreasing. This indicates that the discrimination can be acceptable as a measure of performance to the biometric verification community. The discrimination is used as a basis for a new method of dimension reduction called maximum discrimination analysis (MDA). This is a method to find the class-dependent orthogonal projection that keeps the discrimination in the subspace maximal. A numerical procedure to obtain the MDA transform from the parameters of the probability densities is presented. This procedure is simple and converges rapidly. We show for synthetic data that the verification performance in terms of equal-error rates obtained with MDA is superior to that obtained with linear discriminant analysis. FIGURES OF MERIT FOR BIOMETRIC VERIFICATION

We assume that the elements xi , i = 1, . . . , m of a feature vector x, randomly drawn from the observation space, are independent and identically distributed and have normal probability densities with zero mean. This probability density is denoted by p(x), with E{x2i } = 1. Furthermore, we assume that the elements of a genuine feature vector are uncorrelated. The probability density of a feature

vector belonging to a class with mean µ is denoted by p(x|µ), with E{xi } = µi and E{x2i } = σi2 . In practice, these conditions can always be met by applying a linear transform that simultaneously whitens the observation space and uncorrelates the genuine feature vector [4]. The likelihood ratio given class mean µ is denoted def def and the log-likelihood ratio by l(x; µ) = log(L(x; µ)). A by L(x; µ) = p(x|µ) p(x) ‘good’ verification system has a within-class variance that is much smaller than the between-class variance, i.e. σi2 ¿ 1, i = 1, . . . , m. We restrict ourselves to figures of merit for which a closed expression can be obtained. Later we will present an empirical relation between the equal-error rate and two of these figures of merit. Candidate figures of merit are: • The log area under the receiver operating characteristic [3]: Z

1

def

aROC = − log

βdα,

(1)

0

with α the false-acceptance and β the false-reject rate. • The Bhattacharyya distance [4]: Z p dB = − log p(x|µ)p(x)dx.

(2)

• The log-relative expectation distance, defined as: def

% = log(

E{L(x; µ)|µ} − E{L(x; µ)} 1 ) = log(E{L2 (x; µ)} − 1). std{L(x; µ)} 2

(3)

• The discrimination [2] (or Kullback-Leibler distance [6]): Z def

ddis =

log(

p(x|µ) )p(x|µ)dx = E{l(x; µ)|µ}. p(x)

(4)

• The divergence [4] (or symmetrical discrimination)1 : Z

ddiv

1

p(x|µ) )p(x|µ)dx + = log( p(x) = E{l(x; µ)|µ} − E{l(x; µ)}.

def

Z log(

p(x) )p(x)dx p(x|µ) (5)

The detection index, defined in the context of signal detection [8], is the divergence normalized by std{l(x; µ)}.

Table 1 shows the expressions for the figures of merit listed above and their approximations if σi2 ¿ 1, i = 1, . . . , m. Except for the divergence ddiv , we observe Table 1: Figures of merit (FOM) for verification and their approximation for σi2 ¿ 1, i = 1, . . . , m. Approximation for σi2 ¿ 1

FOM Expression aROC

R∞

− log( 12

0

P

β 2 (L) dL) L2

1 ( 2

i

− log P

µ2i i 1+σi2

dBat

1 ( 4

%

log( √Q

ddis

1 ( 2

ddiv

1 2

+ log(

1 2 2 σ i i (2−σi )

P i

P i

(1+σi2 )2 )) σi2

P

exp( 12

i

−

m 2

P

1 ( 4

log(2)

2 µ2i 2−σ 2 ) − 1)

µ2i + σi2 − log(σi2 )) −

i

i

P

1 ( 2

i

P

1 ( 2

m 2

(1−σi2 )2 +(1+σi2 )µ2i σi2

1 2

i

P i

µ2i − log(σi2 )) + log(4) R∞ 0

u

P 2 {χ2m > u}e 2 du

µ2i − log(σi2 )) −

m 2

log(2)

µ2i − log(σi2 )) −

m 2

log(2)

µ2i − log(σi2 )) −

m 2

1+µ2i σi2

that for ‘good’ verification systems all figures of merit can be approximated by P A( i µ2i − log(σi2 )) − Bm + C, with positive A, B and C. This is not immediately R∞ evident for aROC , but it follows from Figure 1, which shows log 0 P 2 {χ2m > u u}e 2 du as a function of m. 14

12

log integral

10

8

6

4

2

0 2

4

6

8

10

12

14

16

18

20

dimension

Figure 1: log proximation.

R∞ 0

u

P 2 {χ2m > u}e 2 du as a function of m and its straight-line ap-

P The expression i µ2i − log(σi2 ) can be interpreted as follows. Let p(x) and p(x|µ) be the result of a linear transform on an outcome space with zero mean and covariance matrix ΣT and let the original genuine feature vector have mean ν

P |ΣW | and covariance matrix ΣW . Then i µ2i −log(σi2 ) = ν T Σ−1 T ν −log( |ΣT | ). The first term in the right-hand side of this expression is the Mahanalobis distance from the class mean to the outcome-space mean. It says that verification performance is higher for unlikely classes. The second term is a measure of the relative withinclass variability. The lower this is, the higher the performance. Next we study how the discrimination and the divergence relate to the equalerror rate. Since there is no closed expression for the equal-error rate, this is done experimentally. The discrimination is chosen because of its simple expression. The divergence is chosen because its approximation for σi2 ¿ 1, i = 1, . . . , m differs from that of the other figures of merit and because [4] recommended it as an objective function for dimension reduction. We use the expressions in the middle column of Table 1. This means that the results are generally valid and not only when σi2 ¿ 1, i = 1, . . . , m. Figure 2 shows combinations of the log equalerror rate and discrimination (·) and combinations of log equal-error rate and the divergence (+) of 100 randomly drawn parameter sets {(µi , σi2 )}m i=1 , with m = 25. The log equal-error rates were computed by means of Monte-Carlo simulations. The σi2 were drawn from a uniform probability density on the interval [0, 1]. The µi were drawn from a normal probability density with zero mean and variance 1 − σi2 . This dependency of µi on σi2 ensures that the sum of the within-class variance and the between-class variance equals the total variance, which must hold for verification. Figure 2 illustrates that the relation between the log equalerror rate and the discrimination can be approximated well by a straight line. This means that the discrimination can be used to predict the equal-error rate. At the same time it shows that the relation between log-equal-error rate and the divergence is less consistent. Therefore, we have chosen the latter as the objective function for dimension reduction. MAXIMUM DISCRIMINATION ANALYSIS

We rewrite the discrimination (4) as ddis (µ, Λ) =

¢ 1¡ T µ µ + trace(Λ) − log(|Λ|) − m , 2

(6)

with Λ an m×m diagonal matrix with λii = σi2 . The aim of MDA is to determine a subspace of the observation space with an orthonormal basis {v1 , . . . , vn }, n < m, such that the discrimination after projection onto this subspace is maximum.

100 90

discrimination, divergence

80 70 60 50 40 30 20 10 0 −5 10

−4

10

−3

10 equal−error rate

−2

10

−1

10

Figure 2: Relation between equal-error rate and discrimination (·), and equalerror rate and divergence (+). Let Vn = (v1 , . . . , vn ). The discrimination after projection is 1 ddis (VnT µ, VnT ΛVn ) = (µT Vn VnT µ + trace(VnT ΛVn ) − log(|VnT ΛVn |) − n). (7) 2 For µ = 0, it can be shown that the basis Vn that maximizes this expression is identical to the one resulting from LDA. If this is not the case, direct maximization is cumbersome. Therefore, we follow an iterative approach in which we first determine the optimal projection onto a one-dimensional subspace. That is, we look for the v1 , with kv1 k = 1, which maximizes v1T µµT v1 + v1T Λv1 − log(v1T Λv1 ).

(8)

Note that the order in the first term has changed and that the trace operator and the determinant have disappeared. Each subsequent vi , i = 2, . . . , m is then found by maximizing viT µµT vi + viT Λvi − log(viT Λvi )

(9)

under the constraints that kvi k = 1 and vi ⊥ v1 , . . . , vi−1 . We write vi = Vi⊥ wi , in which the columns of Vi⊥ are orthonormal and span the subspace orthogonal to v1 , . . . , vi−1 . The constrained maximization of (9), or (8), is then equivalent to the unconstrained maximization as a function of wi and λ of wiT Mwi + wiT Cwi − log(wiT Cwi ) − λ(wiT wi − 1).

(10)

with M = Vi⊥T µµT Vi⊥ and C = Vi⊥T ΛVi⊥ . Setting the derivative w.r.t. wi equal to zero results in µ

¶ 1 M + C(1 − T ) wi = λwi . wi Cwi

(11)

ˆ i(0) = Vi⊥T µ/kVi⊥T µk as This nonlinear equation is solved iteratively by taking w ˆ i(j) as the eigenvector of M+C(1− (j−1)T1 (j−1) ) an initial estimate and selecting w ˆi w

ˆi Cw

for which (10) is maximum. We found that this procedure converges rapidly, e.g. in less than 5 iterations for m = 50. The performance of MDA depends on the parameters {(µi , σi2 )}m i=1 . If µi = 0, i = 1, . . . , m, MDA will operate identical to LDA, with, of course, the same performance. In all other cases, MDA will result in a higher discrimination after dimension reduction than LDA. How much higher depends on the parameters. In order to illustrate the effectiveness of MDA for dimension reduction, we plot the results of an example of dimension reduction by MDA and LDA with randomly 2 drawn {(µi , σi2 )}m i=1 , m = 25. The σi were drawn from a uniform probability density on the interval [0, 1] and the µi from a normal probability density with zero mean and variance 1 − σi2 . Figure 3 shows the discrimination and the equalerror rate as functions of the reduced number of dimensions for MDA (∇) and LDA (∗). The figure illustrates that MDA outperforms LDA in terms of both discrimination and equal-error rate. For instance, an equal-error rate of 10−2 requires 4 dimensions with MDA, but 7 with LDA. The reason for MDA’s better performance is that it makes use of the discriminative power of the mean µ, which is ignored by LDA. Note that all expressions and approximations for the figures of merit in Table 1 involve µ. CONCLUSIONS

Several figures of merit for biometric verification have been reviewed. It has been illustrated that one of these, the discrimination distance, has an approximate linear relation to the equal-error rate. Therefore, it can serve as an objective function for dimension reduction prior to biometric verification. MDA has been proposed as a new method for dimension reduction. An experiment with synthetic data has shown that MDA outperforms LDA in terms of discrimination and equalerror rate. The reason for MDA’s better performance is that it makes use of the discriminative power of the mean µ, which is ignored by LDA.

10

discrimination

8 6 4 2 0

0

5

10

15

20

25

15

20

25

dimension 0

10

−1

eer

10

−2

10

−3

10

0

5

10 dimension

Figure 3: Discrimination (top panel) and equal-error rate (bottom panel) as functions of the number of dimensions for MDA (∇) and LDA (∗). REFERENCES

[1] P. N. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Special Issue on Face Recognition, 17(7):711–720, 1997. [2] R.E. Blahut. Principle and Practice of Information Theory. Addison-Wesley Publishing Company, Reading, Massachusetts, 1987. [3] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997. [4] K. Fukunaga. Introduction to Statistical Pattern Recognition. Morgan Kaufmann, San Diego, second edition, 1990. [5] G.F. Hughes. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, 14(1):55–63, 1968. [6] S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. [7] M. Ordowski and G.G.L. Meyer. Geometric linear discriminant analysis for pattern recognition. Pattern Recognition, 37(3):421–428, 2004. [8] H.L. van Trees. Detection, Estimation and Modulation Theory, Part I. John Wiley and Sons, New York, 1968.