Extracting Discriminative Information from Cohort Models

Extracting Discriminative Information from Cohort Models Amin Merati, Norman Poh and Josef Kittler

Abstract— Cohort models are non-match models available in a biometric system. They could be other enrolled models in the gallery of the system. Cohort models have been widely used in biometric systems. A well-established scheme such as T-norm exploits cohort models to predict the statistical parameters of non-match scores for biometric authentication. They have also been used to predict failure or recognition performance of biometric system. In this paper we show that cohort models that are sorted by their similarity to the claimed target model, can produce a discriminative score pattern. We also show that polynomial regression can be used to extract discriminative parameters from these patterns. These parameters can be combined with the raw score to improve the recognition performance of an authentication system. The experimental results obtained for the face and fingerprint modalities of the Biosecure database validate this claim.

I. INTRODUCTION Biometric authentication is a process that uses a person’s physical and behavioural characteristics to verify an identity claim. In a typical biometric authentication system, a statistical model is built for each subject identity using a few samples of the client, collected during the enrolment phase. This model is also called target model or reference model. During the test phase, a query sample will be compared against the target model. The comparison is done through a classifier which produces an output score. This score which is generally a similarity score will be compared against a threshold to authenticate the identity claim. This score is also called raw verification score. Obviously the larger the similarity score produced, the more probable it will be that the identity claim is true. There are many degrading factors such as noise of the query sample which affect the similarity score. The classifier is not always capable of removing the effect of these factors, therefore makes the authentication process still a challenging problem. One popular way to improve the recognition performance of the biometric expert is to use a pool of Cohort models. Cohort models are in fact non-match models (from different subjects) which are either other reference models in the database or the reference models of another database. By scoring the query sample against the cohort models along with the target claimed model, a set of scores will be obtained, which are called cohort scores. The cohort scores and the raw similarity score are all subject to the same degradation and therefore cohort scores can be used to normalize the raw score to improve the recognition performance. The authors are with the Centre for Vision Speech and Signal Processing, University of Surrey, Guildford, UK.

978-1-4244-7580-3/10/$26.00 ©2010 Crown

A. Related Work on Using Cohort One of the well-established cohort-based score normalization is T-norm [1]. T-norm was initially proposed for the purpose of speaker verification. But Poh et al. [2] showed that it could be used for fingerprint modality. T-norm uses the first and second order moments of cohort scores to normalize the raw scores. The idea is to estimate the first and second order moments of the scores obtained by nonmatch query samples using cohort scores. Therefore, we expect that the normalized score obtained by a non-match sample would have zero mean and unit variance. Tulyakov et al. [3] proposed to combine the maximum of cohort score or the “the second best score” in an identification scenario using a SVM classifier. Aggarawal et al. [4] proposed to use the maximum of the cohort scores as the best competent hypothesis in likelihood-ratio based score normalization. Recently some approaches have been proposed to use the pattern of sorted cohort scores to predict the performance of an identification system. Wang [5] proposed a performance metric based on similarity scores, using the Perfect Recognition Similarity Scores, which are obtained by scoring all the enrolled samples against all the enrolled reference samples in a closed-set identification system. These scores were then sorted by values and used to build a performance prediction system against intrinsic factors affecting the recognition system. They also proposed differential features to predict the system performance against extrinsic factors affecting the recognition system. Boult et al. [6] proposed to use the differential features obtained by subtracting the sorted scores other than the best score from the best score in an identification system to predict the failure of recognition in an identification system. Aggarwal et al. [4] also used a few top cohort scores obtained from the cohort models as input features of an SVM classifier to improve the recognition performance of a verification system. In this paper, we revisit the cohort score ordering issue. We show that ordered score distributions have distinctive characteristic properties for true client claims and impostors. This discriminatory information can be extracted by informative modelling and used for decision making to enhance the system performance. B. Discriminative Cohort Pattern Cohort models are in fact non-match models which produce non-match scores. However the cohort models which are closer to the target model will produce cohort scores that are closer to the raw verification score than other cohort scores using the same query. Therefore, these cohort

Fig. 1. Block diagram of sorted cohort models for each target model. In this figure, 10 sorted cohort models for 3 target models are shown. The target models are shown in the left most column, each of which are labelled T 1–T 3, respectively. The cohort models are labelled C1–C10. For each row, the cohort models are sorted by their similirity/closeness to the target model so that the left-most cohort model is the most similar one to the targe model whereas the right-most cohort model is the most dissimilar one.

scores can be discriminative to some extent depending on the closeness of cohort models to target model. Using this intuition, we can sort the cohort models based on their closeness to each target model individually. Figure 1 shows a block diagram with 10 cohort models that are sorted in terms of similarity for each of the 3 target models (arranged in rows). The graphs of the mean and the variance of scores obtained from cohort models sorted by similarity of cohort to the target for face modality are shown in Figure 2(a). The statistical moments of cohort scores obtained from match queries and non-match queries are shown in red and blue respectively. The cohort models of rank one at the left of graph show the closest cohort models and cohort models with the highest rank at the right show the most dissimilar models. The distance between the mean of cohort scores for match queries and non-match queries shows the amount of discriminative information conveyed by one cohort order rank. The cohort models of rank order 1 are the most discriminative. By increasing the rank order, this discrimination decreases, eventually attaining zero; and this happens around a rank order of 150 for the face modality. By further increasing the rank order beyond this point, we observe that the distance between the two means starts increasing again, meaning that the discrimination power increases again beyond the equilibrium point. To be more objective, we explicitly measure the discrimination power of each rank order in terms of Equal Error Rate. EER is a unique operating point where the probability of falsely accepting a non-match claim (false acceptance) is the same as the probability of falsely rejecting a match claim (false rejection). The EER curve computed on the previous data is shown in Figure 2(c), as a function of the rank order of

Fig. 3. Sorting the cohort models in offline training process and extracting discriminative parameters in online process.

the cohort models. The decrease in EER after the equilibrium point is an interesting finding because it shows that the most dissimilar cohort model to the target model has also some discriminative power in distinguishing positive from negative accesses in a verification system. An alternative to this type of sorting is the sorting of cohort scores by value rather than sorting by similarity to the target models. This has already been proposed as a way of improve the recognition performance of an identification system [6]. The mean and variance of value sorted cohort scores for each query sample and for face modality are shown in Figure 2(b). As it can be clearly observed the distance between the two means of scores obtained from match and non-match queries is very small in comparison to the other method of sorting, which means that much less discriminatory information is available. This is also shown in the red graph of Figure 2(c), in which EER for different ranks of sorted cohorts is relatively high and almost constant. It means that none of the order ranks of value sorted cohort scores is discriminative. The process of sorting cohort models is performed in an offline mode using a training data set. This process is shown in Figure 3. For biometric modalities such as fingerprint, in which the target and cohort models consist of only one sample, the closeness of the target and cohort models is measured by directly comparing the two samples. Therefore, there is not any more need for another training data set. The scores obtained from sorted cohort models are used to extract discriminative parameters in an online process. These parameters will be combined with the raw verification score to improve the recognition performance. C. Our Contributions Our contribution in this paper is two-fold: 1) Discriminative cohort pattern generated by cohort models sorting. We show that by sorting cohort models with respect to their closeness to the target model,

30 25 20 15 10 0

EER of different orders of cohort models for sorted cohort model and value sorted scores for finger modality 50 similarity sorted models value sorted scores 45

non−match query match query

50 45

40

40 35 EER%

35

The Mean and variance of cohort scores sorted by value for match and non−match queries and for face modality 55 mean and variance of cohort scores

mean and variance of cohort scores

The Mean and variance of scores of cohort models sorted by similarity to the target model for match and non−match queries for face modality 45 non−match query match query 40

30 25 20

100

150 200 250 cohort model order

300

350

30 25

15 20

10 50

35

5 0

50

100


300

350

15 0

50

100


300

350

(a) The mean and the variance of scores of cohort (b) The mean of the variance of sorted cohort (c) EER vs cohort models rank order for the face models sorted by similarity for the face modality scores by value for the face modality modality Fig. 2. The graph of the mean and the variance of the cohort scores for the face modality are shown in (a) and (b). The blue graphs in (a) and (b) show the statistical moments of cohort scores obtained from non-match queries, where as the red ones show those obtained from match queries. The graphs obtained from cohort models sorted by similarity to the target model are shown in (a) and graph obtained from sorted cohort scores by value are shown in (b). In (a), cohort models with the rank order 1 are the most similar and the models of rank order 325 are the most dissimilar cohort models to the target. The blue graph in (c) shows the EER versus the rank order of cohort models sorted by similarity to target shown in (a) and the red graph shows the EER versus rank order for scores sorted by value as shown in (b).

cohort scores of positive and negative accesses would exhibit a discriminative pattern. 2) Extracting discriminative parameters from sorted cohorts. We will show that we can use polynomial regression to extract the discriminative parameters from cohort scores patterns. D. Paper Organisation This paper is organised as follows: Section II reviews the recent work in the area of using cohort in biometric authentication. We explain our proposed algorithm for extracting discriminative parameters in Section III. Section IV reviews the experimental results. Finally, conclusions are drawn in Section V. II. P RIOR W ORK Let y be a matching score obtained by comparing a query sample with a target (claimed) model in the database. Furthermore, let y c ∈ Y c be a cohort score obtained by comparing the query sample with a cohort model, and Y c be a set of cohort scores. Note that y is a result of comparing a query sample with a claimed model, whereas y c is the result of comparing the sample with a cohort model (of a different person). We shall also describe statistics of Y c using its first and second order moments: µc = Ey∈Y c [y] and (σ c )2 = Ey∈Y c (y − µc )2 where E[·] denotes the expectation of ·. Using this notation, T-norm can be defined as: y − µc yT = (1) σc Instead of using the moments, Aggarwal et al. [4] proposed the following normalization: y yAg = s.t. y c ∈ Y c (2) max(y c )

arguing that the first and second order moments are not necessarily representative of the cohort score set Y c . Tulyakov et al. [3] proposed a learning-based approach, formulating the combination of the original match score with the maximum of cohort score as a fusion problem: yT ul = P (C|y, max(y c )) s.t. y c ∈ Y c

(3)

where P (C|x) denotes the posterior probability of being a true claim, given observation x, in the case, a vector with two elements. A multi-layer perceptron was used to approximate (3) in [3]. They used other enrolled subjects in the database as the cohort models. III. METHODOLOGY Ysim c

Let = [yc1 , yc2 , yc3 , ..., ycK ] denote a vector of cohort scores obtained from K sorted cohort models with respect to their closeness or similarity to the target model, in which yc1 is the score obtained from the closest cohort model and ycK is the score obtained from the furthest cohort model. These cohort scores could be considered as discrete points on a function of rank order as follows. yci = f (i)

(4)

Polynomial regression can be used to approximate this function. Then this function can be approximated with a polynomial of degree n as follows: f (i) ≈ An in + An−1 in−1 + · · · + A2 i2 + A1 i + A0

(5)

Let A = [A0 , A1 , A2 ..., An ] be a vector of the coefficients of this polynomial. Then, the scores obtained from the sorted cohort models can be approximated by n + 1 parameters which are elements of A. These parameters can be combined with the raw score to improve the recognition performance of a verification system as follows: y P = P (C|y, A)

(6)

TABLE I N UMBER OF MATCH SAMPLES PER ENROLLEE

where P (C|y, A) denotes the posterior probability of being a true claim, given the raw score and the polynomial coefficients. We used a logistic regression classifier to approximate the posterior probability. IV. EXPERIMENTAL RESULTS A. Database and Matching algorithm In order to assess the merit of the proposed method, we used the fingerprint and face modality of the Biosecure data set [7]. • Fingerprint: For fingerprint modality, we used the NIST fingerprint matcher (“Bozorth3”). In this database, six fingers – thumb, middle and index fingers of both hands were recorded. The fingerprints were scanned with two devices, namely thermal and optical-based sensors. Each subject provides 4 impressions per device and per finger. Therefore, each subject supplied a total of 4 impressions×6 fingers×2 devices = 48 impressions. • Face: The collected face images are divided into three groups, namely fa, fnf and fwf. Face images in group fa are collected using the device with lower resolution (Webcam), whereas face images in groups fnf and fwf are collected using the devices with higher qualities (Digital Camera) [8]. Face images in group fnf are collected without using flash, whereas the face images in group fwf are collected with flash. Each subject provides one image per group. The face matcher that we used was provided by OmniPerception Ltd [8]. Five disjoint groups of subjects were identified, with the first four groups (respectively referred to as g1–g4), constituting enrollees, and the final group forming a separate set of cohort users to provide a pool of cohort models. Subjects in g1 and g2 were used as enrollees in the development (dev) set; and, g3 and g4 as enrollees in the evaluation (eva) set. Subjects in group g5 were used as cohort models. The total number of subjects in g1–g4 are {84, 83, 83, 81} respectively. The total number of cohort users is 84 for both modalities. For the purpose of obtaining a cohort scores, only the first of the four samples of the cohort was used. We require that each of the dev and eva sets to have its own enrollment and query data sets, i.e.,Dd,enrol , Dd,query for d ∈ {dev, eva}. Recall that there are four impressions (images) per modality (finger or face), per subject and per device. The first fingerprint impression (resp. face image) was used as the enrollment template (or model) for the target user. In order to generate match (genuine) scores, the second impression was used to produce scores for Ddev,query whereas the remaining two query samples were used to produce scores for Deva,query (see Table I). To generate the non- match scores, for Ddev,enrol we used query samples of g3; for Ddev,query , g4; for Deva,enrol , g1; and, for Ddev,query , g2, as listed in Table II. In this way, the non-match scores in all four data sets are completely disjoint. In the empirical evaluation to be reported in the next section, we use Ddev,query as our training set and Deva,query

Data sets, D dev (g1+g2) eva (g3+g4)

enrol 1 1

query 2 2

Note: The number of match scores of Ddev,query is the number of subjects in both g1 and g2 multiplied by 2. TABLE II NON - MATCH SCORE GENERATION

Data sets, D dev (g1+g2) eva (g3+g4)

enrol g3 g1

query g4 g2

Note: The entry g3 reads “the subjects in g3 are used to impersonate the enrollees of g1 and g2”.

as our test set. Note that the enrollees and non-match subjects in these two match scores are completely disjoint. This simulates a scenario where the development and operational data have disjoint subjects, a very realistic condition in practice. Using the conventional machine learning terms, we shall treat Deva,query as the test set, whereas the remaining three data sets as the training set. B. Performance Metrics In this paper, performance is reported either as Equal Error Rate (EER) or False Rejection Rate (FRR) for some important values of False Acceptance Rate (FAR). EER is the operating point (after fine-tuning the decision threshold) such that FAR is equal to FRR. In order to collate the statistics from 24 fingerprint experiments (due to six fingers and two acquisition devices and the cross validation protocol), and 6 for the face modality, it is necessary to cater for the uneven performance of the respective baseline (unnormalized) systems. For this purpose, we use the following relative performance metric: rel. change of EER =

EERalgo − EERbaseline , EERbaseline

where EERalgo is the EER of a given score normalization procedure whereas EERbaseline is the EER of the original system without any score normalization. The relative change of False Rejection Rate can be defined similarly. By collating the performance due to different systems, one can establish confidence intervals of the relative merit of our proposal with respect to the baseline system. These confidence intervals can be conveniently visualized using a boxplot, which shows the median, the first and third quarter, as well as the fifth and 95th percentiles of the data. Note that a negative value of the relative change of EER (resp. FRR) implies an improvement over the baseline system. C. Results We compared the following approaches:

•

• • • •

Baseline: the original system output without any postprocessing T-norm: post-processing using T-norm, as in (1) Tulyakov’s approach: post-processing using (3) Aggarwal’s approach: post-processing using (2) Our proposal: post-processing using (6) (labelled as “poly regression degree n” in which n is the degree of polynomial used for regression)

TABLE III AVERAGE FRR (%) FOR DIFFERENT METHODS AND 6 DATA SETS FINGERPRINT, WHEN FAR = 0.1 %. data set fo1 fo2 fo3 fo4 fo5 fo6

Bline 6.28 5.09 8.78 9.23 7.77 9.65

T-n 6.03 3.83 5.83 7.66 7.16 7.23

Aggr 6.49 4.11 6.63 8.37 7.45 7.14

Tuly Polreg 1 Polreg 2 Polreg 3 8.05 5.57 6.16 6.18 3.90 2.96 2.82 2.84 7.08 5.27 5.25 5.27 7.94 7.03 6.89 6.94 7.36 6.81 6.82 6.76 6.96 6.29 6.30 6.22

Note: Each entry is the average FRR of a two-fold cross validation result, reported for each finger, sensor type and normalization method. The smallest average FRR value of all methods (in a row) is printed in bold. Bline stands for the baseline approach; T-n for T-norm; Aggr for Aggrawal approach; Tuly for Tulyakov approach; and Polreg n stands for our proposal based on polynomial regression of degree n.

(a) ERR relative change for the finger modality

(b) ERR relative change for the face modality Fig. 4. EER relative change of different cohort-based normalization methods for the fingerprint modality (a) and the face modality (b).

The relative changes of EER for the above mentioned algorithms for fingerprint and face modality are shown in Figures 4(a) and (b) respectively. As can be observed, our proposal outperforms all the competing cohort-based algorithms, including the T-norm. As we see, the best polynomial model for each modality is of different degree, so obtained using simple cross validation. For fingerprint the best performance is obtained with polynomial of degrees 2 or 3, whereas for the face modality, this degree equals to 6. In order to find out how sensitive the degree of polynomial has on the generalization performance, we also carried out additional experiments by varying the degree of polynomial function from 2 to 4 for the fingerprint modality and from

6 to 8 for face modality. The results in Figures 4(a) and (b) show that, the degree of polynomial function has little impact on the generalization performance. Boxplots of the relative change of FRR for the three important values of F AR = 0.1%, F AR = 1% and F AR = 10% of the fingerprint modality are shown in Figures 5(a), (b) and (c), respectively. As it can also be observed in all these three figures, the proposed method of using polynomial regression of cohort scores outperforms the other cohortbased methods. Tables III and IV lists the average values of FRR (obtained via a two-fold cross validation procedure, treating dev and eva as two different folds) at F AR = 0.1% for the fingerprint and the face modalities, respectively. We observe that our proposed method for different degrees of polynomial is better than other normalization methods. We note that the fingerprint performance acquired using the thermal device is considerably worse than that of the optical device. This is because the quality of captured fingerprint images of the thermal device is generally of poorer quality. Samples fingerprint images captured by the two sensors and their associated local qualities are shown in Figure 6. Although the thermal sensor has a significantly smaller area, users are required to swipe their finger through the sensor. The SDK provided then stitches the images together to form a larger fingerprint image such as the one shown in Figure 6(c). Two sources of error are possible here: the error introduced during the stitching process (possibly introducing spurious minutiae or deleting existing ones) and the manner fingerprints are placed and swipt through the sensor. Since our proposal is modality-independent and that the performance metric employed is defined relatively to the baseline system, the higher error rates of the thermal sensor is not a major concern. In fact, by using more data sets, one can be even more confident about the conclusions drawn. V. CONCLUSIONS AND FUTURE WORKS A. Conclusions In this paper, we showed that cohort models sorted with respect to their closeness to the target model produce discriminative score patterns for match and non-match queries.

Boxplot of FRR rel. change for FAR=0.10 % for cohort−based methods and fingerprint modality



poly regression degree 2




Tulyakov Aggrawal

Method



Method



Method


Tulyakov Aggrawal

Tulyakov Aggrawal

Tnorm

Tnorm

Tnorm

baseline

baseline

baseline

−40

−20 0 20 FRR rel. change[%]

40

−60

−40

−20 0 FRR rel. change[%]

20

−50

0 50 FRR rel. change[%]

100

(a) rel. change of FRR@FAR=0.1% for the fin- (b) rel. change of FRR@FAR=1% for the finger- (c) rel. change of FRR@FAR=10% for the fingerprint modality print modality gerprint modality Fig. 5. FRR (False Rejection Ratio) relative change for fingerprint modality when FAR(False Acceptance Ratio) equals to (a)0.1%, (b)1% and (c)10% over 24 experiments.

TABLE IV FRR (%) FOR DIFFERENT METHODS AND 6 DATA SETS FACE , WHEN FAR = 0.1 %. data set fa1 fnf1 fwf1 fa2 fnf2 fwf2

Bline 29.03 12.20 11.16 30.52 15.94 10.23

T-n 28.57 14.33 9.43 23.90 16.12 11.76

Aggr 29.43 16.10 9.91 27.40 20.05 12.56

Tuly Polreg 6 Polreg 7 Polreg 8 25.91 27.23 27.13 26.35 11.06 11.43 10.82 11.28 10.34 8.36 8.40 8.99 24.78 23.55 23.86 23.43 15.76 15.45 15.04 15.27 8.68 7.49 7.51 7.35

Note: Each entry is the FRR for each face, sensor device and normalization method. The smallest FRR value of all methods (in a row) is printed in bold. Bline stands for the baseline approach; T-n for T-norm; Aggr for Aggrawal approach; Tuly for Tulyakov approach; and Polreg n stands for using polynomial regression of degree n to extract parameters from cohort scores.

(a) Optical sen- (b) Optical qual- (c) Thermal sen- (d) sor ity sor

Thermal quality

Fig. 6. Samples of two fingerprints acquired using two fingerprint sensors and their associated local quality maps [9].

We also showed that polynomial regression can be used to model these score patterns in order to extract discriminative parameters. These parameters can be combined with the raw score to improve the recognition performance of the verification system. The performance gains achieved in our experiments on fingerprint and face databases ranged from 6% to 14% over the baseline and from 3% to 6% over the state of the art normalization methods. In the near future, we plan (i) to employ other dimensionality reduction algorithms such principle component analysis in order to extract the most discriminative parameters from

the cohort information; and (ii) investigate the problem in the multimodal setting. VI. ACKNOWLEDGMENTS* This work was supported partially by the advanced researcher fellowship PA0022 121477 of the Swiss National Science Found ation and by the EU-funded Mobio project (www.mobioproject.org) grant IST-214324.

R EFERENCES [1] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score Normalization for Text-Independant Speaker Verification Systems,” Digital Signal Processing (DSP) Journal, vol. 10, pp. 42–54, 2000. [2] N. Poh, A. Merati, and J. Kitter, “Making better biometric decisions with quality and cohort information: A case study in fingerprint verification,” in Proc. 17th European Signal Processing Conf. (Eusipco), Glasgow, 2009, accepted. [3] S. Tulyakov, Z. Zhang, and V. Govindaraju, “Comparison of combination methods utilizing t-normalization and second best score model,” in IEEE Conf. on Computer Vision and Pattern Recognition Workshop, 2008. [4] G. Aggarwal, N.K. Ratha, R.M Bolle, and R. Chellappa, “Multibiometric cohort analysis for biometric fusion,” in IEEE Int’l Conf. on Acoustics, Speech and Signal Processing, 2008. [5] Peng Wang, Qiang Ji, and J.L. Wayman, “Modeling and predicting face recognition system performance based on analysis of similarity scores,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp. 665–670, April 2007. [6] W. Li, X. Gao, and T.E. Boult, “Predicting biometric system failure,” Computational Intelligence for Homeland Security and Personal Safety, 2005. CIHSPS 2005. Proceedings of the 2005 IEEE International Conference on, pp. 57–64, 31 2005-April 1 2005. [7] J. Ortega-Garcia, J. Fierrez, F. Alonso-Fernandez, J. Galbally, M. R. Freire, J. Gonzalez-Rodriguez, C. Garcia-Mateo, J.-L. Alba-Castro, E. Gonzalez-Agulla, E. Otero-Muras, S. Garcia-Salicetti, L. Allano, B. Ly-Van, B. Dorizzi, J. Kittler, T. Bourlai, N., F. Deravi, R. Ng, M. Fairhust, J. Hennebert, A. Humm, M. Tistarelli, L. Brodo, J. Richiardi, A. Drygajlo, H. Ganster, F. Sukno, S.-Kaushik Pavani, A. Frangi, L. Akarun, and A. Savran, “The multi-scenario multienvironment biosecure multimodal database (bmdb),” IEEE Trans. on Pattern Analysis and Machine, 2009. [8] N. Poh, T. Bourlai, and J. Kittler, “A BioSecure DS2 Report on the Technological Evaluation of Score-level Quality-dependent and Costsensitive Multimodal Biometric Performance,” in submitted, 2007. [9] Y. Chen, S.C. Dass, and A.K. Jain, “Fingerprint Quality Indices for Predicting Authentication Performance,” in LNCS 3546, 5th Int’l. Conf. Audio- and Video-Based Biometric Person Authentication (AVBPA 2005), New York, 2005, pp. 160–170.