A Verification Protocol and Statistical Performance Analysis for Face Recognition Algorithms Syed A. Rizvi1, P. Jonathon Phillips,2 and Hyeonjoon Moon3 1
Department of Applied Sciences College of Staten Island of City University of New York, Staten Island, NY 10314 2
National Institute of Standards and Technology Gaithersburg, MD 20899,
[email protected]
3
Department of Electrical and Computer Engineering State University of New York at Buffalo, Amherst, NY 14260
Abstract Two key performance characterization of biometric algorithms (face recognition in particular) are (1) verification performance and (2) and performance as a function of database size and composition. This characterization is required for developing robust face recognition algorithms and for successfully transitioning algorithms from the laboratory to real world. In this paper we (1) present a general verification protocol and apply it to the results from the Sep96 FERET test, and (2) discuss and present results on the effects of database size and variability on identification and verification performance.
1. Introduction Face recognition algorithms have matured to the point of being demonstrated in real-world identification and verification applications. In identification problems, the input to an algorithm is an unknown face, and the algorithm reports back the estimated identity of an unknown face from a database of known individuals. Whereas, in verification problems, the input is a face with a claimed identity. The algorithm either accepts or rejects the claimed identity. For these demonstrations to be successful, it is crucial that algorithm performance for intended applications can This work was performed as part of the Face Recognition Technology (FERET) program, which is sponsored by the U.S. Department of Defense Counterdrug Technology Development Program. Portions of this work were done while Jonathon Phillips was at the US Army Research Laboratory.
be estimated or predicted from known performance results. (These predictions are used in making decisions in designing demonstration system.) To be able to make predictions requires (1) performance results for identification and verification scenarios, and (2) characterization of performance in terms of database size and variability of the images in the database.
However, to date most results in the literature have been limited to a single identification performance measure for a database of images; i.e., on database ’X’, algorithm ’A’ correctly identifies faces n% of the time (or more generally, a single cumulative match curve for on database ’X’). This implies that identification performance on a single database is predictive of verification performance. Since identification and verification are different problems, they need separate evaluation techniques. We present a general verification evaluation protocol and the results of applying it to the Sep96 FERET test [7].
A key performance issue in face recognition is, how does the size of the database effect performance. The effects of scaling are important because algorithms are usually developed and initially evaluated on databases that are smaller than the databases encountered in applications. To be able to make an intelligent choice of which algorithms are appropriate for an application, one would like to be able to predict performance on larger databases. We investigate the effects of database size and composition on verification and identification performance.
2. The Sep96 FERET test The Sep96 FERET testing protocol was designed so that algorithm performance can be computed for (1) identification and verification evaluation protocols, and (2) a variety of different galleries and probe sets [7]. (The gallery is the set of known individuals. An image of an unknown face presented to the algorithm is called a probe, and the collection of probes is called the probe set.) The FERET tests were independently administered tests. A development set of images were provided to researchers to develop and train face recognition algorithms. Algorithms were tested on images from a sequestered set of images. In the Sep96 protocol, an algorithm is given two sets of images: the target set and the query set. We introduce this terminology to distinguish these sets from the gallery and probe sets that are used in computing performance statistics. The target set is given to the algorithm as the set of known facial images. The images in the query set are the unknown facial images to be identified. For each image qi in the query set Q, an algorithm reports the similarity si k between qi and each image tk in the target set T . The key property of the new protocol, which allows for greater flexibility in scoring, is that for any two images si and tk , we know si k . From the output files, algorithm performance can be computed for virtual galleries and probe sets. A gallery G is a virtual gallery if G is a proper subset of the target set, i.e., G T . Similarly, P is a virtual probe set if P Q. For a given gallery G and probe set P , the performance scores are computed by examination of the similarity measures si k such that qi 2 P and tk 2 G . The virtual gallery and probe set technique allows us to characterize algorithm performance for identification and verification. Also, performance can be broken out by database size or probe category, e.g., FB probes or duplicate probes. In the September 1996 FERET test, the target set contained 3323 images and the query set 3816 images. All the images in the target set were frontal images. The query set consisted of all the images in the target set plus rotated images and digitally modified images. We designed the digitally modified images to test the effects of illumination and scale. For each query image qi , an algorithm outputs the similarity measure si k for all images tk in the target set. The output for each query image qi is sorted by the similarity scores si . There are two versions of the September 1996 test. The target and query sets are the same for each version. The first version requires that the algorithms be fully automatic. (In the fully automatic version the testees are given a list of the images in the target and query sets. Locating the faces in the images must be done automatically.) In the second version the eye coordinates are given. The Sep96 test was administered in September 1996 and
()
()
()
()
()
in March 1997 (table 1 tells when the groups took the test. Two algorithms from the MIT Media Laboratory (MIT) were tested. The first was the algorithm tested in March 1995 [6]. This algorithm was retested so that improvement since March 1995 could be measured. The second algorithm was based on more recent work [5]. The University of Maryland (UMd) took the test in September 1996 and March 1997. The algorithm tested in September 1996 was based on Etemad and Chellappa [3] and the March 1997 algorithm was based on Zhao et al. [13]. To provide a performance baseline, we implemented a normalized correlation and a principal component analysis (PCA) based face recognition algorithms [10]. In our implementation of PCA, all images were scaled, masked, and histogram equalized. The training set consisted of 500 faces. Faces were represented by their projection on to the first 200 eigenvectors and were identified by a nearest neighbor classifier using the L1 metric. For normalized correlation, the images were geometrically normalized and masked. We only report results for the semi-automatic case (eye coordinates given), because the Media Lab and U. of Southern California were the only groups to take the fully automatic test. When comparing performance among algorithms one must note when the test were administered. This is because performance of the algorithms from a particular group will improve, and also, the performance level of face recognition algorithms in general improves overtime. The images were taken from the FERET database of facial images [8]. The facial images were collected in 15 sessions between August 1993 and July 1996. Sessions lasted one or two days, and the location and setup did not change during a session. To maintain a degree of consistency throughout the database, the photographer used the same physical setup in each photography session. However, because the equipment had to reassembled for each session, there was variation from session to session. Images of an individual were acquired in sets of 5 to 11 images, collected under relatively unconstrained conditions. Two frontal views were taken (FA and FB); a different facial expression was requested for the second frontal image. By July 1996, 1564 sets of images were in the database, for 14,126 total images. The database contains 1199 individuals and 365 duplicate sets of images (images taken on different days).
3. Verification Model In our verification model, an algorithm is presented with an image p that claims to the same person as in image g . The system either accepts or does not accept the claim. (If p and g are images of the same person then we write p g, otherwise, p 6 g .) The verification problem for a single individual is equivalent to the detection problem with a gallery
Table 1. List of groups that took the Sep96 test broken out by versions taken and dates administered. Test Date Version of test Group September 1996 March 1997 Baseline Fully Automatic MIT [5, 6] U. of Southern California (USC) [12] Eye Coordinates Given Baseline PCA [10] Baseline Correlation Excalibur Corp. (Carlsbad, CA) MIT Michigan State U. [9] Rutgers U. [11] UMd [3] USC
=
G fgg and probe set P . For each probe p, the algorithm determines if p g . This measures system performance for a single person (who’s face is in g ). However, we need performance over a population of people. In our model, the system allows access to a group of people fgi i ; : : : ; I g, which we call the gallery G . The system is tested by comparing each probe on the probe set P fpk k ; : : : ; K g with all gallery images. Mathematically, the system decides if pk gi for all i and k. We reduce this to I detection problems, where the gallery consists of a single image gi and the probe set P . (We assume that there is only one image per person in the gallery.) We get overall system performance by summing the results for I individual detection problems. We briefly review the following terminology commonly used in signal detection theory [2]. For a probe pk 2 P , two possible events are defined: Either a probe pk is in the gallery (event “ig”) or pk is not in the gallery (event “nig”). The probability of responding “yes” (Y) given an event “ig” has occurred is termed the hit rate and the probability of responding “yes” given an event “nig” has occurred is termed the false alarm rate. That is,
: =1 = ; =1
Hit Rate
= P (Y jig)
and False Alarm Rate
(1)
= P (Y jnig):
(2)
Our decisions will be made by a Neyman-Pearson observer. The observer responds “yes” if si k c and “no” if si k < c. By the Neyman-Pearson theorem [4], this decision rule maximizes the hit rate for a false alarm rate . A given cut-off, c, divides both distributions, “ig” and “nig,” into two intervals. The verification rate, P Y jig , and the false alarm rate, P Y jnig , are the proportions of the “ig” and “nig” distributions in the interval si k c. By
()
()
(
)
(
()
)
varying c from it’s minimum to maximum value we obtain a set of hit and false alarm rate pairs. A plot of these pairs is the receiver operating characteristic (ROC) (also known as the relative operating characteristic) [2, 4]. The verification protocol considers one gallery image at a time. For a given cut-off, c, the verification and false alarm rates for an image gi are denoted by Pc;i Y jig and Pc;i Y jnig , respectively. The gallery hit and false alarm rates at the cut-off c, denoted by Pc Y jig and Pc Y jnig , are then computed by
(
(
)
)
(
(
)
)
Pc (Y jig) = jG1 j
XP Gj
j
i=1
c;i
(Y jig)
(3)
G X 1 Pc (Y jnig) = jGj Pc;i (Y jnig); (4) i=1 which is the average of Pc;i (Y jig ) and Pc;i (Y jnig ) over the gallery G. The verification ROC is computed by varying c from ,1 to +1.
and
j
j
For a given algorithm, the choice of a suitable hit and false alarm Rate pair would probably depend on a particular application. However, for performance evaluation and comparison the equal error rate is quoted. The equal error rate is the point where the incorrect rejection and false alarm rates are equal; that is, P N jig P Y jnig . (The incorrect rejection rate is one minus the hit rate.)
(
)= (
)
4. Verification Results In this section we report the results of applying the verification protocol to the Sep96 FERET test scores. For the results in this section, the gallery consisted of 1196 individuals with one FA image per person. The performance was computed for the following two probe sets:
Probability of correct verification
Probability of correct verification
1.0
0.9
UMD Mar 97 USC Mar 97 Excalibur Corp. MIT Sep 96 Rutgers MSU MIT Mar 95 UMD Sep 96 Baseline PCA Baseline Correlation
0.8
0.7 0.0
0.1
0.2
0.3
0.4
0.5
0.9
0.8
0.7
UMD Mar 97 USC Mar 97 Excalibur Corp. MIT Sep 96 Rutgers MSU MIT Mar 95 UMD Sep 96 Baseline PCA Baseline Correlation
0.6
0.5
0.4 0.0
0.1
False alarm rate
Figure 1. ROC for FB probes. (Gallery size = 1196, number of probes = 3120.)
0.2
0.3
0.4
0.5
False alarm rate
Figure 2. ROC for duplicate probes. (Gallery size = 730, number of probes = 3120.)
5. Effects of Gallery and Probe Set Size Table 2. Equal error rates for the algorithms Equal error rate (%) Algorithm FB probes Duplicate probes Baseline PCA 7 23 Baseline correlation 4 25 Excalibur 5 17 MIT Mar95 5 24 MIT Sep96 4 23 MSU 3 24 Rutgers 6 18 UMd Sep96 7 22 UMd Mar97 1 16 USC 2 16
FB images: the second frontal image taken from the same set as the image in the gallery.
Duplicate: is an image taken in a different session (a different date) or taken under special circumstances (such as the subject was wearing glasses, different hair length, etc).
Figure 1 shows the ROC performance for FB probes; figure 2 shows the ROC performance for duplicate probes. In table 2 we report the equal error rate for FB and duplicate probes.
5.1 Verification A concern for verification is how performance is effected by an increase in the size of the probe sets. Performance is determined by P si k jig and P si k jnig , which are generated by sampling from Pig and Pnig , the universe of probes in the gallery and not in the gallery. When the size of the probe set increases (sampling from Pig and Pnig ), the accuracy of the estimates of P si k jig , P si k jnig , and the resulting ROC increase. We investigated this by running two empirical experiments with the baseline PCA algorithm. In the first experiment we varied the size of the probe set while keeping the gallery fixed. The gallery size is 1196 and we computed ROCs for probe sets of size 500, 1000, 1500, 2000, and 2500 (fig. 3). By examining figure 3 we see that verification performance converges and there is very little difference in the ROCs for the probe sets of size 1500, 2000, and 2500. In the second experiment we kept the probe set size fixed and varied the size of the gallery. In this experiment the probe set consisted of 3120 images and was the same for all runs. We randomly generated galleries of size 200, 400, and 600. For each gallery size, we randomly generated 10 galleries and computed the ROCs for duplicate probes. For each gallery size, we found the best and worst performing ROCs and plotted them in figure 4. This clearly shows that the accuracy of the estimate of the ROC increases as the size of the gallery increases.
( () )
( ()
)
( () ) ( ()
)
1.0
0.9
0.9
Probability of correct verification
Probability of correct verification
1.0
0.8
0.7
Number of Probes = 500 Number of Probes = 1000 Number of Probes = 1500 Number of Probes = 2000 Number of Probes = 2500
0.6
0.5
0.4 0.0
0.1
0.2
0.3
0.4
0.5
0.8
0.7
0.5
0.4 0.0
False alarm rate
5.2 Identification For identification, the corresponding concern is the effects on performance as the size of the gallery increases. To study this effect, we randomly generated galleries of size 200, 400, 600, and 800. For each gallery size, we randomly generated 100 galleries and scored them against duplicate probe sets. The scores were calculated for the baseline PCA algorithm. The identification scores are reported as cumulative match scores. Let K be the size of a probe set and Rk the number probes that the correct answer is in the top k. For each k we compute mk Rk =K , the portions of the probes for which the correct answer is in the top k matches. The results are presented in a graph, with k on the horizontal axis and mk on the vertical axis. This graph is a plot of the cumulative match scores. The mk scores report performance for a single gallery and probe set. To summarize performance for a sample population of galleries, we average the cumulative match scores. If m`k Rk =K for gallery G` , then the average 1 ` rank k score mk ` mk . The curve k; mk is the avN erage cumulative match scores. Figure 5 reports the average cumulative match scores for galleries of size 200, 400, 600, and 800, which shows a decrease in performance as gallery size increases. The average cumulative score reports performance as a function of absolute rank. When comparing galleries of different sizes, it is insightful to compare relative performance of galleries. In relative performance, we measure the fraction of probes that are in the top . The average percent score for is m mbN=100c . (The relative
=
P
%
(
e
=
0.1
0.2
0.3
0.4
0.5
False alarm rate
Figure 3. ROCs for the baseline PCA algorithm for duplicate probes as a function of size of the probe set.
= =
Gallery Size = 600 Gallery Size = 400 Gallery Size = 200
0.6
)
%
Figure 4. ROCs for the baseline PCA algorithm for duplicate probes as a function of size of the gallery.
performance for a gallery G is denoted by m0 .) Figure 6 reports the results from figure 5 as average cumulative percent scores. The graph of the average cumulative percent scores in figure 6 are very close in the value. This suggests the hypothesis that the distributions of m0 are the same for the different sized galleries. We performed hypothesis testing with the bootstrap permutation test on the means of m005 (chap. 15, Efron and Tibshirani [1]). The null hypothesis was that the distributions for m005 were the same for all gallery sizes. The results show that in all cases, the null hypothesis cannot be rejected. We repeated the procedure for m010 , which produced the same conclusion, we cannot reject the null hypothesis.
6. Conclusion We have devised a verification test protocol for the Sep96 FERET test, and reported results for this protocol. This allows for an independent assessment of algorithms in a key potential application area of face recognition. By comparing MIT’s performance between March 1995 and September 1996, and the U. of Maryland’s performance between September 1996 and March 1997, one see the improvement for both algorithms. This shows that algorithm development is a dynamic process and evaluations such as FERET make an important contribution to computer vision. Also, that when comparing algorithms, one needs to note when the tests where administered. The tests let researchers know the strengths of their algorithms and were improvements could
1.0 1.0
0.9
0.9
0.8
Cumulative Match Score
Cumulative Match Score
0.8 0.7 0.6 0.5 0.4 0.3
gallery size 200 400 600 800
0.2
0.6 0.5 0.4 0.3
0.1 0.0
0
10
20
30
40
50 Rank
60
70
80
gallery size 200 400 600 800
0.2
0.1 0.0
0.7
90
100
Figure 5. Average cumulative rank scores for galleries of size 200, 400, 600, 800. The average was computed from 100 galleries for each size gallery. Scores are for duplicate probe sets.
be made. By knowing their weaknesses, researchers knew where to concentrate their efforts to improve performance. The databases used to develop and test algorithms are usually smaller than the databases that will be encountered in applications. Therefore, we need to able to predict the effects on performance of increases in the size of the gallery and probe set. For verification we have shown that the accuracy of performance estimates increase as the size of the probe set increases. We have empirically shown for one algorithm that expected relative cumulative match scores are not a function of gallery size.
References [1] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. [2] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [3] K. Etemad and R. Chellappa. Discriminant analysis for recognition of human face images. J. Opt. Soc. Am. A, 14:1724–1733, August 1997. [4] D. Green and J. Swets. Signal Detection Theory and Psychophysics. John Wiley & Sons Ltd., 1966. [5] B. Moghaddam, C. Nastar, and A. Pentland. Bayesian face recognition using deformable intensity surfaces. In Proceedings Computer Vision and Pattern Recognition 96, pages 638–645, 1996.
0
1
2
3
4
5 6 7 8 Percentage of Rank
9
10
11
12
Figure 6. The cumulative match scores in figure 5 rescaled so that the rank is a percentage of the gallery.
[6] B. Moghaddam and A. Pentland. Probabilistic visual learning for object detection. IEEE Trans. PAMI, 17(7):696–710, 1997. [7] P. J. Phillips, H. Moon, P. Rauss, and S. Rizvi. The FERET evaluation methodology for face-recognition algorithms. In Proceedings Computer Vision and Pattern Recognition 97, pages 137–143, 1997. [8] P. J. Phillips, H. Wechsler, J. Huang, and P. Rauss. The FERET database and evaluation procedure for facerecognition algorithms. Image and Vision Computing Journal, 16(5):295–306, 1998. [9] D. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Trans. PAMI, 18(8):831–836, 1996. [10] M. Turk and A. Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–86, 1991. [11] J. Wilder. Face recognition using transform coding of gray scale projection projections and the neural tree network. In R. J. Mammone, editor, Artifical Neural Networks with Applications in Speech and Vision, pages 520–536. Chapman Hall, 1994. [12] L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elasric bunch graph matching. IEEE Trans. PAMI, 17(7):775–779, 1997. [13] W. Zhao, R. Chellappa, and A. Krishnaswamy. Discriminant analysis of principal components for face recognition. In 3rd International Conference on Automatic Face and Gesture Recognition, in press.