SIViP (2011) 5:401–413 DOI 10.1007/s11760-011-0244-6
ORIGINAL PAPER
Face authentication with undercontrolled pose and illumination Maria De Marsico · Michele Nappi · Daniel Riccio
Received: 15 September 2010 / Revised: 14 December 2010 / Accepted: 17 February 2011 / Published online: 7 August 2011 © Springer-Verlag London Limited 2011
Abstract Though face recognition gained significant attention and credibility in the last decade, quite few commercial applications are presently able to actually benefit from this. As a matter of fact, the high intra-class variability, mainly due to changes in pose and lighting, strongly hinders a wider diffusion of this biometry in real-world applications. In this paper, we propose FACE (Face Analysis for Commercial Entities), a framework for face recognition, which is robust to both pose and light variations, thanks to the implemented correction strategies. The framework also includes two separate indices for the quantitative assessment of these two kinds of distortions. They allow evaluating the conditions of the sample at hand before submitting it to the classifier. Moreover, FACE implements two reliability margins, which, differently from the preceding two, estimate the “acceptability” of the single response from the classifier. Experimental results show that FACE, thanks to its overcoming of problems due to pose and lighting variations, is able to provide an accuracy (in terms of Recognition Rate) which is better, in some respect, than the present state of art. On one side, corrections of pose and light allow FACE to achieve good results even in non-optimal conditions. On the other side, the integration of distortion measures and reliability margins into the recognition process allows to even improve such results, with a significant increase in system accuracy.
M. De Marsico (B) Sapienza Università di Roma, via Salaria 113, 00198 Rome, Italy e-mail:
[email protected] M. Nappi · D. Riccio Università di Salerno, via Ponte don Melillo, 84084 Fisciano, Italy e-mail:
[email protected] D. Riccio e-mail:
[email protected]
Keywords Biometrics · Face recognition · Pose · Illumination · Correlation index
1 Introduction Secure access to resources, either logical (data and information) or physical (laboratories, military zones), is a basic issue in modern society. Technological development has fostered the introduction and improvement of more and more efficient, as well as complex, techniques to address it. Nowadays, face biometry is especially studied for security applications, and only shy attempts exist to extend related technologies to commercial contexts. This is due to the fact that, despite advances in research, real-world scenarios remain a challenge, since some key conditions can significantly affect recognition performances. Actually, it often happens that gallery (stored) face images are caught in quite controlled conditions, in a rather predefined pose and uniform lighting, while probes (test images during system actual operation) might be caught in a less controlled setting. This can be anticipated as a quite typical situation for commercial applications. Illumination and pose variations are therefore especially crucial factors that may, and actually do, hinder a correct recognition in these scenarios. As a consequence, face biometry is not yet sufficiently mature to be exploited in totally uncontrolled settings, and this limits its wider diffusion. Many experiments demonstrate that illumination may induce greater differences in the appearance of a single individual than those among different individuals. Therefore, many approaches for illumination invariance have been explored. Gao et al. [8] define a new extension to the edge map technique, namely the Line Edge Map: face contours are extracted and combined in segments, which are in turn organized in lines. The Hausdorff distance is also modified
123
402
in order to manage these new feature vectors. A novel prefiltering criterion is applied, before performing the actual testing operation. However, Fisherfaces remain superior to this methodology, thanks to their capability to maximize the between-persons variability, while minimizing the withinperson differences. According to this and similar results, Li et al. [13] argue that performances of a recognition system can be often further improved by combining several suitable linear techniques. As for robustness to lighting variations in purely 2D methods, one of the present trends is to exploit some image normalization techniques, rather than devising distortion invariant recognition techniques. A comparison is performed in [18] among many lighting normalization algorithms. SQI (Self-Quotient Image) [21] and LBP (Local Binary Pattern) [3] result to represent the best compromise in terms of simplicity, speed and authentication accuracy when the adopted classifier is of eigen-space-based type. The use of generative methods seems an interesting alternative to the above approaches. However, not much has been done along this line yet. One of the few related works has been proposed by Georghiades et al. [9]. The face shape and the albedo are extracted by few images of a subject, by means of shape-from-shading techniques. The derived 3D model is then used to synthesize a wide set of face views in different poses and illumination conditions. The core underlying hypothesis is that, for each fixed pose of the object/face, all its views under different illuminations form a convex cone in the image space. Such convex cone is therefore computed for every subject and pose and then approximated by means of a low-dimensional linear subspace. In the testing phase, the pose of the subject is estimated, and the identity with the nearest convex cone is assigned to it, using the Euclidean distance. This method is better than many others in terms of recognition rate. The reverse of the medal is the non-trivial computational cost of the training phase. Generative methods are also used to address distortions due to pose changes. As an alternative, multi-view face recognition directly extends frontal face recognition. Related algorithms require gallery images in every pose. On the other hand, in face recognition performed across pose, we aim at recognizing a face acquired from a quite novel viewpoint [23]. In [24], one frontal and one profile image are used as a starting point to recover both 3D shape and texture information. Such novel 3D virtual face model subsequently allows generating face images in completely arbitrary poses. In [22], the frontal image of a face in non-frontal pose is reconstructed from a single image, in order to avoid the computationally expensive process of 3D model generation. The underlying assumption is that there exists a linear mapping between subregions (patches) of non-frontal and frontal images of the same subject.
123
SIViP (2011) 5:401–413
Despite the sketched limitations, even if face recognition is less reliable than, say, fingerprints or iris, and even if a privacy violation is sometimes asserted [11], yet it seems a good compromise among different requirements. It entails lighter computation and produces less psychological discomfort, since people is quite accustomed to be photographed. Therefore, it is very promising and actively investigated in the field of massive recognition applications. An example is the organization of databases that collect photographs of interesting subjects (actors, public personalities) in different situations and automatically classify them according to subjects’ identity. As discussed above, light and pose variations are among the main factors that undermine the accuracy of face retrieval systems. Nevertheless, some proposals try to address this kind of application, such as Polar Rose [25], iPhoto [26] and Google Picasa [27]. Polar Rose is proposed as a tool in whose design particular attention has been devoted to the connection with social networks such as Facebook and Flikr. Therefore, photo tagging, as well as import from Facebook of photo albums and names, is among the main provided functions. The core technology of this software is the reconstruction of a 3D face model, starting from twodimensional images. On the other hand, Picasa and iPhoto are more oriented to purely managing and organizing photo albums, so that they rely on a semi-automatic training procedure. This implies that the user submits a query and refines its results by providing a relevance feedback, i.e. by discriminating, in more steps, significant images from uninteresting ones. Along a more research-oriented line, Becker and Ortiz [4] evaluated some of the most popular algorithms for face recognition in literature (PCA, LDA, ICA, SVMs) to assess the feasibility of using them in the kind of settings considered above. Their benchmark consisted of image datasets from Facebook. Their conclusion is that none of the tested algorithms is presently sufficiently robust to face acquisition variations (e.g. pose and illumination) in a real setting and therefore sufficiently mature to guarantee satisfying performances in a real application. In this paper, we propose FACE, a framework for a completely automatic analysis of images, aiming at extracting and classifying faces in them. FACE (Face Analysis for Commercial Entities) reduces the impact of pose and lighting distortions on the accuracy of the recognition system, by implementing correction procedures that try to convert back the acquired face to a frontal pose and to a uniform lighting. A cloud of interest points is located during the segmentation process of the input image. These points are then used to correct the pose through affine transformations of the involved regions. After a pseudo-frontal pose has been obtained, lighting is corrected by implementing the SQI method, where each pixel is divided by the mean of its neighbourhood [22]. The located interest points on the face allow FACE to derive important additional information regarding the acquisition
SIViP (2011) 5:401–413
conditions of the sample. In particular, two ad hoc (for face biometry) measures are defined, one for the pose (Sample Pose—SP) and the other for lighting (Sample Illumination— SI), which are inversely related to the “effort” needed to correct the original image. A high measure means a low distortion. Such measures allow discarding some samples (a priori sample selection). This happens when the obtained values are too low, anticipating a poorly significant response from the classifier. FACE also exploits two general-purpose (i.e. suited to be used with any biometry) reliability margins (System Response Reliability—SRR), SRR I and SRR II, which evaluate each single response from the system by considering the overall composition of the classifier gallery. SRR allows to assess the quality of the response and to discard those ones whose reliability index is below a pre-defined threshold (a posteriori response selection). In summary, we have a sample correction phase, immediately followed by the sample assessment, before the recognition procedure, and a final response assessment. To perform our experiments, FACE was included in the framework that has been implemented and placed at researchers’ disposal by Becker and Ortiz [4]. In this way, it was possible to compare its performances with the other implemented algorithms. Tests were performed both on the database included in the framework (AT&T [28]) and on a dataset including 1500 photographs of celebrities downloaded from [29]. Results show that FACE is definitely better than the other algorithms in the framework, with an 11% increment in Accuracy. Since FACE does not need any training phase, a further advantage is that it is not necessary to retrain it after adding new photographs; this translates into the ability to tag such newcomers without having to re-tag all the others. FACE was also tested on FERET database [15] to assess the precision of interest points location on the face and how it affects recognition Accuracy performance. The paper is organized in six sections, besides this introduction. Section 2 presents a short survey of the state of art regarding distortion measures and reliability margins to better place those implemented in FACE. Section 3 presents the methods for face location and for the correction of pose and lighting distortions. Feature extraction process and template matching are described in Sect. 4. Section 5 presents experimental results. Section 6 draws some conclusions.
2 System reliability measurements Recognition systems might not be equally reliable, due to the possibly different accuracy of their procedures, or even to the different intrinsic difficulty in processing the corresponding biometries. On the other hand, not even all responses from the same system might be equally reliable, due to the possibly different quality of input from time to time. In
403
many situations, an unreliable response is not necessarily equivalent to a rejection; rather, it may represent a valid reason to perform a further check. The definition of a measure for response reliability is then crucial for the overall accuracy of the system. Obviously, it is not possible to naively rely on the single value of the distance/similarity between the probe and the returned subject. In order for this information to further reflect a kind of response reliability, the relation between the returned subject and the rest of the gallery should be also considered. In other words, results should be differently evaluated also based on the similarity of gallery subjects, i.e. on how much the biometric feature is discriminative. Classical performance measures, such as Recognition Rate, actually measure the overall ability of a system to perform a correct recognition. As a matter of fact, they are used to compare the performances of different systems. However, they provide a global estimate of the recognition ability, while they do not provide any cue on the reliability of a single testing operation. A more complex approach to evaluate a system/response reliability relies on the quality of input data. However, it is hard to devise a metric estimating the confidence in an identification result, given the quality of a biometric acquisition. Input quality is exploited, for example, by Kryszczuk et al. [12]. They use Bayesian networks to evaluate the probability of a correct verification decision by a classifier given the available evidence, coming from several sources. In particular, a vector of signal-domain quality measures accounts for significant characteristics of each biometry and of each classification method. It is worth noticing that such metrics should also depend on the exploited classification method, as different algorithms might be sensible to specific problems in the input, while being well robust to others. A different solution found in literature is the use of confidence measures based on margins. Each margin measures the amount of “risk” associated with a single subsystem response, after observing its scores. Margins can be applied to any subsystem, despite the nature of its input. Poh and Bengio [16] introduce a confidence margin based on False Acceptance Rate (FAR) and False Rejection Rate (FRR) of a biometric system. They show how it behaves better than those by Freund [7] and by Vapnik [19], which are exploited for boosting and within statistical learning theory, respectively. The latter can only be calculated by supposing that the target output (class-label, i.e. impostor/genuine in this case) is known, so that they are only significant in training. On the other hand, since Poh and Bengio’s margin relies on an estimate (FAR, FRR) of the actual distribution of genuine/impostor subjects’ scores, a quite high number of responses are marked as reliable. Although the error rate of the global system is reduced, this might not be appropriate for applications requiring a very high security level. Kryszczuk et al. [12] also exploit
123
404
a margin similar to Poh and Bengio’s one, which is defined in terms of Correct Acceptance (CA) and Correct Rejection (CR) accuracies at a given acceptance threshold. In general, frequentist approach to reliability is considered valid only assuming that the distribution of the scores of the testing set is similar to the distribution of the scores of the development set, and also stable in time. 3 Face detection and normalization The first step in a face recognition-based application is the detection of one or more faces within an image. In our FACE, the face and its characteristic points are located through the approach presented in [14]. This extended Active Shape Model (STASM) algorithm first submits the image to a global face detector (Viola-Jones [20] or Rowley [17]), which extracts each region including a face; each such region is separately inputted to Active Shape Model (ASM). ASM searches relevant landmarks by minimizing a global distance between candidate image points and their homologues on a general model (shape model), which is pre-computed over a wide set of training images. The algorithm locates 68 interest points, with a precision that decreases with pose distortions. Figure 1 shows an image in non-optimal pose and illumination conditions and the set of STASM landmarks. These points are exploited later for the normalization process, taking the face image to canonical pose and illumination. This is a core issue in a quite uncontrolled setting, where these two represent the main hindering elements for classification. As we will describe below, FACE separately addresses them using different starting points from literature.
SIViP (2011) 5:401–413
efficient ones. It allows to obtain a 3D face model, starting from one or more 2D images and modifying a 3D generic model (morphable model). A similar idea is applied in Polar Rose (see Section 1) and, in principle, allows to simultaneously correct both pose and illumination. The computational cost is its true limit, most of all when processing a high number of faces. FACE exploits a less complex, yet equally effective approach to pose normalization. Actually, it starts from the set of 68 relevant points that are located by STASM, but uses only 13 of them, as shown in Fig. 2 (the red ones). The centre of the eyes is used to correct head rolling (Fig. 2a). In order to obtain a normalized image, we first transform the original one so that the right end side of the represented face is always the better exposed (the best) one. To this aim, dl and dr are defined as the distances between the external corners of the left and right eye, respectively, and the tip of the nose (Fig. 2b) and allow to locate the best (better exposed) half of the face in the original image. If it is the right one (dr ≥ dl ), the image is left unchanged; otherwise, it is reflected with respect to the vertical axis (horizontal flip). The points in Fig. 2c provide the exact face profile. The first point upward (the light one) is not provided by STAMS, but is an additional one which is derived as the median point of the link between its immediate neighbours (the first two points upwards on nose sides, as shown in Fig. 1 (right)). The face profile and the image borders delimit the right and left face regions. The normalization process continues by applying a stretching operation to all rows of the right region, in order to obtain the same constant length. The lines passing through some of the interest points in Fig. 2d are then used to divide the so processed right half face into horizontal and vertical bands. The choice of the delimiting points is based on the elements that mainly drive the normalization
3.1 Correction of pose and illumination Many algorithms in literature address pose normalization, aiming at improving classification accuracy. We can mention that proposed by Blanz and Vetter [5] among the most
Fig. 1 The original face image (left) and landmarks located by STASM (right)
123
Fig. 2 The six main steps of the pose and illumination correction process
SIViP (2011) 5:401–413
process, namely the bands, respectively, containing the eyebrow, the eye, the nose and the mouth. Lines are resized to make these interest points fall in pre-determined positions. Finally, the left half face is reconstructed by reflecting the right half (Fig. 2e). The last step might seem redundant, as most recognition algorithms could still work using half face. However, we perform it to allow the use of our pose correction procedure even in those systems where the recognition relies on the detection of special interest points, e.g. the centres of both eyes, and on the measurement of their distances to derive geometric features. After completing the pose correction process, the resulting final image undergoes illumination correction. This is performed through the SQI algorithm described in [21]. The value of each image pixel is divided by the mean of the values in its neighbourhood, represented by a square mask of size k × k (in our case k = 8). The final result of the overall normalization process is shown in Fig. 2f.
3.2 Measuring pose and illumination distortions A possible (preliminary) way to evaluate the expectable reliability of a FACE response is to consider the amount of “effort” needed to correct pose and illumination in the probe image. More corrections imply lower reliability. This method is bound to the biometry at hand and to the specific normalization process. In practice, if an image is too much distorted, we decide to discard it without proceeding with recognition. As described above, the normalization procedure aims at recovering a frontal pose of the face presented in the input image, starting from the points located through the approach described in [14]. The distribution of such points on the face is a good starting point to evaluate the degree of distortion which was corrected by the pose normalization process. On the other hand, the measure of the original lighting distortion is directly performed using the grey levels of the input image. The index related to pose distortion is given by the linear combination of three components, which are, respectively, inversely proportional to roll, yaw and pitch. In general, these three measures concomitantly vary, so that they need to be jointly considered; however, in our case, the considered pose variations are such that a good distortion measure can be also obtained by singularly considering the weight of each of them. As a matter of fact, we underline that we are interested in the estimation of the rough influence of the pose distortion on the recognition process, while an accurate estimation of the exact rotation angles would be beyond our goals and therefore uselessly burdensome. In practice, roll is approximated with the angle θ , which is formed by the line passing through the centres of the eyes, and the x axis (Fig. 3a). We then define the contribution of the roll to the pose distortion as:
405
Fig. 3 Feature points used to compute pose and illumination distortion measures
2 · θ ,1 , roll = min π
(1)
As for yaw, we compute the above-defined left distance dl and right distance dr , between the external corner of each eye and the nose tip (Fig. 3b). In a frontal pose, such distances tend to be equal; otherwise, one of them increases to the cost of the other. The contribution of the yaw to the pose distortion is defined as: max(dl , dr ) − min(dl , dr ) , (2) yaw = max(dl , dr ) The pitch is considered in a similar way; the involved distances are eu and ed , which are, respectively, the distances of the root of the nose and of the chin from the nose tip, so that it can be defined as: max(eu , ed ) − min(eu , ed ) , (3) pitch = max(eu , ed ) Notice that eu and ed are surely more subject-dependent than dl and dr , the latter being coarsely equal when the face is frontally seen. However, in a frontal pose, even eu and ed are not dramatically different. Moreover, we remind again that we merely need a very rough evaluation of the amount of deformation which the pose underwent with respect to a frontal one. As a matter of fact, we do not use such indices directly for correction, but only to compare their combination with a deformation threshold. All three factors range from 0 (almost no distortion) to 1 (worse situation, higher distortion). Sample Pose index can now be defined as a weighted linear combination: SP = α · (1 − roll) + β · (1 − yaw) + γ · (1 − pitch)
(4)
with α + β + γ = 1. In our case, we experimentally found that values α = 0.1, β = 0.6, γ = 0.3 work pretty well.
123
406
SIViP (2011) 5:401–413
The reason for the so lower value of α is that, once the image has been corrected using the computed angle θ , rolling does not further influence face perceived appearance. When a face is lit up in an optimal way, neither shadows nor saturated zones appear on it due to the light reflected by the skin. As a consequence, some face regions assume a uniform grey-level distribution (present a quite similar greylevel histogram). Based on such hypothesis, we select some reference points, starting from those located on the face during the pose normalization step, useful to identify those face regions that are less prone to variations in optimal lighting conditions. In particular, eight points can be automatically computed as intermediate positions, starting from those returned by the algorithm in [14] (Fig. 3d). For each point, we select a portion of the image which is proportional to the square containing the face. For each portion w, we compute the histogram h and its centre of mass, using the formula: 255 255 i · h w (i) h w (i), (5) mc(w) = i=0
i=0
The complete mc is a vector of eight elements that correspond to the centres of mass of the eight selected sub-regions. The lighting distortion index is defined as the variance of the elements of vector mc, normalized by a function F in the interval [0, 1]. A Sample Illumination index is defined so as to be inversely proportional to the distortion: SI = 1 − F(std(mc))
(6)
In the case at hand, we used the sigmoid function as normalization function, with sigmoid parameters experimentally computed to introduce the lower distortion in the distribution of original data. We did not use a more popular method like min-max because the maximum value to normalize is not known and can span a significant range. In these conditions, min-max is known to present poor performances [10]. Notice that, as with SP, the higher the value, the more reliable the response. Both SP and SI can be used to perform an a priori selection of promising templates, i.e. of templates that will possibly induce a correct recognition. On the other hand, templates with SP and/or SI below appropriate thresholds may be discarded, or even undergo a different recognition process. The influence of thresholds for SP and SI over the final recognition accuracy will be discussed in the section about experimental results.
4 Face matching We have already cited a recent work by Becker and Ortiz [4], who used some typical image datasets from Facebook as benchmark. They demonstrated that many of the most
123
well-known techniques for face classification [1] are still too sensible to image distortions to be profitably used in commercial applications, typically running in partially or totally uncontrolled settings. In this paper, we propose to perform image matching by a localized version of the spatial correlation index. In its global form, given two images A and B ¯ it is and the respective mean values of their pixels A¯ e B, defined as: n−1 m−1 s(A, B) =
¯ ¯ j=0 (A(i, j) − A)(B(i, j) − B) ¯ 2 n−1 m−1 (B(i, j) − B) ¯ 2 j=0 (A(i, j) − A) i=0 j=0 i=0
n−1 m−1
i=0
(7) In FACE, we adapted it in order to work locally, over single sub-regions r A and r B of images A and B. For each subregion r A , we search, in a limited window around the same position in B, the region r B that maximizes the correlation s(r A , r B ). The global correlation S(A, B) is obtained as the sum of the local maxima. This approach is more accurate, but also more computationally expensive. However, some tricks, such as the pre-computation of some quantities in the matching formulae, the code optimization and the reduced required resolution, allow to perform a considerable amount of matches (hundreds) in less than one second on mediumlow band computing equipment. The identification protocol requires that the system gallery G contains a number l > 0 of acquisitions (templates) for each enrolled identity I j , j = 1, . . . , |H |, where H is the set of such identities. When a new query image p is submitted to FACE, it is matched against all the templates gk , k = 1, . . . , |G|, in the gallery G by computing the corresponding correlation index S( p, gk ). The list of values so computed is organized in decreasing order, and the identity I j with more images in the first n positions is returned. 4.1 System response reliability We also exploit a biometry-independent kind of margin, which is actually independent from both the biometry at hand and from any processing that is performed over images, in particular normalization. This margin results from defining a system/gallery-dependent metric that we will call System Response Reliability (SRR) [2]. It stems from the consideration that gallery is an integral component of an identification system, and its composition may influence recognition performances. However, present methodologies do not support tuning to the database; our metric also attempts to limit such drawback. It measures the ability of separating genuine subjects from impostors, for each probe, in the sharpest possible way. For readers’ convenience, we report here its rationale and core concepts, as well as its most recent formulation, which is slightly improved from [2].
SIViP (2011) 5:401–413
407
Let A be an identification system and G its gallery of correctly enrolled identities. Assume there are l > 0 acquisitions (templates) for each genuine identity. Let p be a person to be identified (probe). The system first computes the distances d ( p, gi ), i = 1, . . . , |G|, between p and each template in the gallery. Distances are ordered so that d( p, gi1 ) ≤ d( p, gi2 ) ≤ · · · ≤ d( p, gi|G| ). We analysed two different functions ϕ j ( p), j = 1, 2, used to compute two alternative SRR measures. SRR I uses the Relative Distance (relative distance between the scores of the first two retrieved distinct identities), while SRR II uses the Density Ratio (relative amount of gallery templates that are “near” to the retrieved identity). In both cases, a normalization function F is used to normalize distances in the interval [0, 1) (e.g. min-max), so that each ϕ j ( p) is independent from the specific measure d. The first function ϕ1 ( p) is defined as: ϕ1 ( p) =
F(d( p, gi2 )) − F(d( p, gi1 )) , F(d( p, gi |G| ))
(8)
where gi2 is the second distinct identity in the returned ordered list. It relates to the degree of uncertainty by which the system identified person p : if ϕ1 ( p) is high, a person exists in the gallery which is much more similar to p than all the others; if it is small, then the retrieved gi1 is only the less far in a set of identities with a similar distance from p. As experimentally observed, the relative distance tends to be small for impostors and high for genuine subjects, independently from the biometry and from the classification method. On the contrary, the value below which a small relative distance probably pertains to an impostor depends on both the biometry and the classifier, as we will discuss below. Given the ordered F-normalized distances F(d( p, gi )), the second function ϕ2 ( p), i.e. the Density Ratio, is computed using the ratio between the number of subjects in the gallery, distinct from the returned identity, giving a distance lower than twice F(d( p, gi1 )), and the cardinality |G| of the gallery ϕ2 ( p) = 1 − |Nb | / |G|, where Nb = gik ∈ G | F(d( p, gik )) < 2 · F(d( p, gi1 ))
(9)
Even in this formulation, the higher is the algorithm ability to discriminate between a genuine subject and an impostor one, and the lower is the probability to find identities different from the correct one at a small distance from it. Notice that in both cases, the values for ϕ j ( p) fall within the interval [0,1]. It is worth noticing that ϕ1 ( p) is quite easier to compute, but also more sensible to outliers. These are represented by situations in which either the first two retrieved subjects are occasionally very similar, even if quite different from the rest of the gallery, or the last subject is occasionally very far (high distance) from the rest of the gallery. On the other hand, ϕ2 ( p) is a little bit more expensive, but considers a significant local
neighbourhood of the retrieved subject. Therefore, both the preceding situations, in particular far outliers, do not affect the result. After defining ϕ j ( p), we need to identify a value ϕ j fostering a correct separation between genuine subjects and impostor ones. Each value (ϕ1 or ϕ2 ) marks the point of maximum uncertainty and varies with the biometry and with the classifier, so that it must be estimated from time to time during the setting up of the single subsystem. The optimal ϕ j is given by that value able to minimize the wrong estimates of function ϕ j ( p), which can occur in two kinds of situations. In one case, an impostor is erroneously recognized with ϕ j ( p) higher than ϕ j , so that the false acceptance is erroneously supported by a high value of ϕ j ( p). Otherwise, a genuine subject is recognized with ϕ j ( p) lower than ϕ j , so that the acceptance may be possibly questioned due to a low value of ϕ j ( p). We also define S(ϕ j ( p), ϕ j ) as the width of the subinterval from ϕ j to the proper extreme of the overall interval [0, 1) of possible values, depending on the comparison between the current ϕ j ( p) and ϕ j :
1 − ϕ j if ϕ j ( p) > ϕ j S ϕ j ( p), ϕ j = (10) ϕj otherwise SRR index (SRR I or SRR II) can finally be defined as: (11) SRR j = ϕ j ( p) − ϕ j /S(ϕ j ). In detail, we first measure the absolute distance between ϕ j ( p) and the “critical” point ϕ j . Such distance gets higher values for ϕ j ( p) much higher than ϕ j (genuine), or for ϕ j ( p) much lower than ϕ j (impostors). However, it is also important to take into account how much it is significant with respect to the subinterval over which it is measured. This allows to compare reliability of different responses and to finally estimate a threshold th j above which we can consider response reliability satisfactory enough. As an example, assume ϕ j = 0.1; a reject with ϕ j ( p) = 0.02 (absolute distance 0.08 and SRR j = 0.8) is to be considered more reliable than a reject with ϕ j ( p) = 0.08 (absolute distance 0.02 and SRR j = 0.2), and this can be reflected by the absolute distances alone. However, such response must also be considered more reliable than an accept with ϕ j ( p) = 0.19 (absolute distance 0.09 and SRR j = 0.1), which in fact spans a proportionally minor distance from the critical point towards the other end of the overall interval. SRR I or SRR II can be used (a posteriori) to only select responses whose reliability is above a suitable threshold. The more stringent the application requirements, the higher the thresholds. Their influence over the final system accuracy is discussed in the section about experimental results. Notice that different analyses such as ROC analysis can be used to assess the overall system ability to return a correct recognition or rejection. On the other hand, this is not the aim of SRR measures, which rather aim at allowing the system to
123
408
only return responses that are sufficiently reliable. Except for the lower number of “usable” responses, a ROC analysis performed after exploiting SRR would result in better accuracy performances.
5 Experiments FACE was designed and implemented to be exploited in poorly controlled settings. An example of use is automatic classification of photographs. A number of photo archives of celebrities are available online, most of all for singers and actors. The huge work of classification that is performed by fans or by archive managers suggests a strong interest in applications and techniques able to support it. A tool able to automatically classify the faces in a photograph can significantly reduce the amount of work needed to create such archives. But it can also supply further functionalities, such as the ability to tag one’s own images or those of one’s friends on the pages of social networks such as Facebook or Flikr. Actually, the classification of images that are familiar to a single user makes up a simplified version of the problem, due to the reduced number of classes to which the interesting photographs belong (relatives or friends). On the other hand, think to the support that such a tool could provide to a magazine editorial staff in retrieving photographs with a specific well-known face, or even with all and/or only a set of celebrities (gossip, demonstrations, history). In this case, not only the difficulties related to image distortions (pose, illumination) have to be addressed, but also those related to the huge number of classes to which the faces to classify may belong. FACE was adapted to be included in the framework developed by Becker and Ortiz [4], from which we also partially inherited test data in order to have a more clear comparison with the other methods. Nevertheless, we also added further archives, which present especially interesting characteristics in our case, which are not always found in other standard datasets. In detail, tests were performed using two different datasets: ORL Database and CelebrityDB. The former was created by AT&T Laboratories in Cambridge [28] and includes face images of 40 different subjects, with a 92 × 112 pixel resolution with 8 bits depth. The latter was created ad hoc by downloading images from [29] and includes 1,500 images of 50 celebrities, with a 64 × 100 pixel resolution and 256 grey levels. The main difference between the two datasets is in the amount and quality of distortions within images. The first dataset is mainly affected by pose variations, while expression and lighting variations are less significant. On the contrary, the second dataset is characterized by strong variations in pose, expression and illumination. Many photographs were caught without asking the subject to pose, indoor as well as outdoor, with a resulting high variability in image conditions. We processed the CelebrityDB
123
SIViP (2011) 5:401–413
twice, the first time with the procedure implemented in [6] (the same used by Becker and Ortiz) and the second time with our one. We obtained two normalized databases, CDB-I and CDB-II, respectively. A different subset of CelebrityDB (normalized with our method), with only three images for each of the 50 celebrities, was used to obtain a further database named CDB-III. AT&T dataset was only normalized with the technique in [6]. FACE was also tested on a number of samples from FERET database [15]. This includes 14,051 images taken from 1,199 persons in 15 different sessions between 1993 and 1996. Experiments on such database aimed at assessing three critical aspects for the overall authentication process: (a) the precision of STASM location of interest points; (b) the effects of a bad points location over system recognition performances; and (c) FACE’s accuracy in terms of Cumulative Match Score (CMS) and Receiving Operating Curve (ROC). We used three subsets of the whole database, which are classified in [15] as FAFB (frontal faces with expression variations), FAFC (frontal faces with lighting variations) and QR (quarter right—pose variations). 5.1 Testing system accuracy Among the techniques described and implemented in [4], we considered support vector machines (SVM), incremental support vector machines (ISVM), Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Incremental Linear Discriminant Analysis (ILDA) and Independent Component Analysis (ICA). We compared them with FACE that was further stressed by setting n = 1 in the identification protocol (Section 3), i.e. the identity I j in the first position was returned. Our first goal was to evaluate the benefits from the normalization procedure described in Sect. 2.1. To this aim, we used AT&T, CDB-I and CDB-II databases. Results in terms of Accuracy (in practice RR) are reported in the first three rows of Table 1. Accuracy actually refers to Recognition Rate, but we will continue using this term to be consistent with the exploited framework. All methods perform quite well with AT&T database (accuracy always above 90%, but FACE reaches 98%). This
Table 1 Accuracy of recognition methods across four testing datasets DB
Method FACE (%)
SVM (%)
ISVM (%)
PCA (%)
LDA (%)
ILDA (%)
ICA (%)
AT&T
98
97
96
93
96
96
91
CDB-I
87
44
43
27
42
38
31
CDB-II
90
65
64
47
75
69
45
CDB-III
75
45
48
33
NA
50
33
SIViP (2011) 5:401–413
was expected, due to its characteristics. Table 1 shows that FACE always performs better than the other algorithms even in more difficult settings, especially when the comparison involves CDB-I. Related to this, it is important to notice that the results with CDB-II are far better than those with CDB-I. This holds for all methods, demonstrating that the introduced normalization technique achieves a general relative accuracy increment of about 40%. FACE is slower than the other methods, due to an higher number of operations to compute correlations. As noticed above, this drawback can be limited by pre-computing most needed values and storing them as a feature vector. On the other hand, since tagging is usually performed offline, speed is not so critical. Moreover, the lack of a training phase and of the related need for re-tagging the whole dataset when new images are inserted makes the technique even more attractive. The second experiment was performed using CDB-III. Its goal was to demonstrate that FACE can provide good performances even with a limited number of available images. If this agrees with the fact that there is no training phase, it is to consider that having an higher number of photographs for the same subject increases the probability to retrieve it in the first positions of the answer set. The fourth row in Table 1 shows that FACE recognized 75% of the subjects with only three photographs, compared with 50% accuracy provided by ILDA, which was the best result among the other tested algorithms. Notice that LDA did not return any result with this database. This might be due to the too low number of available images per subject to perform a suitable training phase. To this respect, we remind again that no image was used for training, because FACE does not need specific training on each database. Therefore, one image from the triple was used as probe, while the remaining two were the gallery templates. This differs from the other methods, which need to exploit a (separate) subset of images to learn the reduced feature space, and generally do this from scratch for each new database. 5.2 Integrating system reliability measures Table 1 shows that CDB-III is the most problematic of the involved databases in the experimentation, in terms of obtained performances. As a matter of fact, all systems provided an RR lower than 80%. For this reason, all the described reliability indices (both application-dependent and general-purpose) were tested on CDB-III, aiming at enhancing system performances. In this second evaluation, we only considered FACE, since it not only resulted as the most effective method with this dataset, but also maintained and improved such trend over the other datasets. The aim of this second group of experiments was to determine the contribution that the single reliability indices might provide to improving performances in terms of accuracy of a
409
Fig. 4 Accuracy variations corresponding to variations of thresholds for SP and SI indices
Fig. 5 Variation in the number of adequate images corresponding to variations of thresholds for SP and SI indices
face recognition system, under uncontrolled conditions. This contribution was separately evaluated for each index (SP, SI, SRR I and SRR II). According to what is implemented in the protocol defined by Becker and Ortiz in [4], CDB-III was divided into two subsets: Gallery, the set of images used for enrolling subjects into the system, and Probe, all remaining images used for system testing. Figure 4 shows, in the form of a surface, the variation of system performances measured in terms of Accuracy, when all images in the Probe with distortion indices SP and SI below a fixed threshold are discarded beforehand as inadequate. In other terms, an image in the probe set is considered as adequate and therefore submitted to the system to be identified only if SP ≥ thp e SI ≥ thI . On the other hand, Fig. 5 shows the variation of the percentage of images that are marked as adequate corresponding to the variation of thresholds for SP and SI indices. It is also possible to observe the influence that SP and SI singularly exert, by simply considering thI = 0 e thp = 0 in turn.
123
410
SIViP (2011) 5:401–413 Table 2 Precision (in pixels) of STASM location on FERET images with 256 × 384 pixels resolution DB
Face component Eye Sx
Eye Dx
Nose
Mouth
FAFB
3.14
3.16
7.71
3.61
FAFC
2.85
2.83
6.31
2.67
QR
4.45
4.46
13.7
12.3
Fig. 6 Accuracy variations and number of reliable responses corresponding to variations of SRR I threshold
Fig. 7 Accuracy variations and number of reliable responses corresponding to variations of SRR II threshold
As we can observe from Fig. 4, system performances become fixed at 70% accuracy, with 100% responses marked as adequate, when neither of the two indices is considered (thI = 0 = thp = 0). On the contrary, with very high thresholds (thI = 0.8 e thp = 0.8), the number of responses marked as adequate reduces to 70%, while accuracy become fixed at 80%. Figures 6e, 7 show performance variations when using different thresholds for SRR I and SRR II, in terms of accuracy and responses marked as reliable. The best compromise is generally obtained when using a threshold value of about 0.4, since about 70% responses is still marked as reliable, while system accuracy increases from 70 to 85%. 5.3 Testing on FERET database The first experiment on FERET aimed at assessing the precision of interest points location performed by STASM on a face. As expected, such precision decreases as image distortion increases (pose, illumination, expression and occlusion). Some images in FERET are coupled with a file containing a number of information about the image itself (file name, acquisition date, etc.); position information about some face elements (left eye centre, right eye centre, noise tip and mouth centre) is especially interesting and is expressed as pairs of (x, y) coordinates. Such information is always present for images in FAFB, while seldom appears for images in FAFC and QR. In this experiment, we measured the distance (in pixels) between the position that was identified by STASM and the one in the file attached to the image. Table 2 reports the
123
Fig. 8 Example faces normalized by FACE, with correct (top) and incorrect (bottom) interest points location by STASM
results obtained on the single datasets. For better readability, we remind that the FERET images resolution is 256 × 384 pixels. We can observe from Table 2 that the ASM implemented by STASM is much more precise in eye location, than in nose and mouth location, most of all when the pose is not frontal, as in QR dataset. In particular, nose tip is the point presenting the most significant error; this can be ascribed to the 2D nature of the image, which does not allow a correct distinction of such region which is characterized, on the other hand, by a strong depth variation. In other words, being a prominent structure in the face, the nose is subject to greater visual transformations even with a little pose variation, and this also hinders STASM processing. An imprecise location of the nose tip causes a kind of aberration in the normalized image, due to the importance of this point in the image reshaping process. An example of normalization in cases of correct/incorrect location is reported in Fig. 8. In the second experiment, we measured FACE accuracy in terms of Cumulative Match Score (CMS) and Receiving Operating Curve (ROC), when faces that were wrongly processed by STASM are left in a dataset (circles), and when they are left out from it (squares). The curves that were obtained for the three datasets FAFB, FAFC and QR are shown in Figs. 9, 10 and 11, respectively. We can observe that an
SIViP (2011) 5:401–413
411
Fig. 9 CMC and ROC curves of FACE system on FAFB dataset
Fig. 10 CMC and ROC curves of FACE system on FAFC dataset
Fig. 11 CMC and ROC curves of FACE system on QR dataset
123
412
incorrect location affects more identification than verification. In other words, since we are in a “closed universe” test, an incorrectly normalized subject is not identified by FACE, and, since it is surely in the gallery, this is always counted as an error. On the contrary, during verification, a distance above the acceptance threshold, when the two subjects’ identities are really different, is considered as a correct reject apart from the correctness of face normalization, which only influences a false reject. The problem is how to identify such faces for which the normalization process was poor. An important observation is that, while the use of SRR can represent a solution to this or contribute anyway to relieve the effects of a bad normalization (diamonds—SRR-based selection of responses, with SRR II), the pose distortion index does not provide significant information, since it is computed just using points identified by STASM. When using SRR, corrupted samples are left in the dataset, and responses are considered reliable or not according to the associated SRR value. The number of unreliable responses is about 20% in FAFB, 48% in FAFC and 44% in QR. The curve with plus markers reports the application of SRR index when corrupted samples are left out from the dataset. As we can observe from Figs. 9, 10 and 11, it always represents the best performance; of course, the reverse of the medal is the lower number of “useful” responses, since part of the samples was first removed due to the corrupted samples selection(about 15% in FAFB, 30% in FAFC and 37% in QR), and then SRR index further filtered responses obtained from the remaining dataset. When SRR is applied to only uncorrupted samples, the number of unreliable responses is about 8% in FAFB, 18% in FAFC and 25% in QR. In this last case, percentages are computed considering only correctly normalized samples.
6 Conclusions In this work, we proposed FACE. This new framework for face analysis and classification is able to significantly improve accuracy performances compared with state of the art, in situations when the acquisition conditions are not optimal. This is typical of applications such as photo tagging over social networks like Facebook, or cataloguing of celebrities’ images in a magazine editorial office. Furthermore, we introduced measures for pose and lighting distortion and showed that they can be used to discard images whose quality is not sufficient to guarantee an accurate response from the recognition system. System reliability was also measured after each response in terms of SRR I and SRR II, two reliability indices based on the analysis of single system responses in relation to the composition of its gallery. Experimental results show an improvement of system performances with all presented
123
SIViP (2011) 5:401–413
indices. Among the aspects to investigate in the next future, we certainly include the combination of such indices, aiming at further enhancing system accuracy. This combination is a complex problem in itself, since it is bound to an accurate statistical analysis of their possible interdependencies.
References 1. Abate, A.F., Nappi, M., Riccio, D., Sabatino, G.: 2D and 3D face recognition: a survey. Pattern Recognit. Lett. 28(14), 1885–1906 (2007) 2. Abate, A.F., De Marsico, M., Nappi, M., Riccio, D.: Data Normalization and Fusion in Multibiometric Systems. In: Proceedings of the Twelfth International Conference on Distributed Multimedia Systems (DMS 2007) 3. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary patterns. In: Sebe, N., Lew, M.S., Huang, T.S. (ed.) The Eigh European Conference on Computer Vision (ECCV 2004), pp. 469–481 (2004) 4. Becker, B., Ortiz, E.: Evaluation of face recognition techniques for application to facebook. In: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, December 2009 5. Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans. PAMI 25(9), 1063–1074 (2003) 6. Bolme, D.S., Beveridge, J.R., Teixeira, M., Draper, B.A.: The CSU face identification evaluation system: its purpose, features, and structure. In: ICVS Graz, Austria (2003) 7. Freund, Y., Schapire, R.: A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14(5), 771–780 (1999) 8. Gao, Y., Leung, M.K.H.: Face recognition using line edge map. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 764–779 (2002) 9. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.J.: From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 643–660 (2001) 10. Jain, A.K., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognit. 38(12), 2270– 2285 (2005) 11. Johnson, M.L.: Biometrics and the threat to civil liberties. Computer 37(4), 90–92 (2004) 12. Kryszczuk, K., Richiardi, J., Prodanov, P., Drygajlo, A.: Reliabilitybased decision fusion in multimodal biometric verification, EURASIP J. Adv. Signal Process. 2007(1), 74–83 (2007) 13. Li, Q., Ye, J., Kambhamettu, C.: Linear projection methods in face recognition under unconstrained illuminations: A comparative study. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR04) (2004) 14. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In: Proceedings of the European Conference on Computer Vision, pp. 504–513 (2008) 15. Phillips, P.J., Wechsler, H., Huang, J., Rauss, P.: The FERET database and evaluation procedure for face-recognition algorithms. Image Vis. Comput. J. 16(5), 295–306 (1998) 16. Poh, N., Bengio, S.: Improving fusion with margin-derived confidence in biometric authentication tasks. In: Proceedings of Fifth International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA), pp. 474–483 (2005) 17. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 23–38 (1998) 18. Ruiz-del-Solar, J., Quinteros, J.: Illumination compensation and normalization in eigenspace-based face recognition: a comparative
SIViP (2011) 5:401–413
19. 20. 21.
22.
study of different pre-processing approaches. Pattern Recognit. Lett. 29(14), 1966–1979 (2008) Vapnik, V.N.: Statistical Learning Theory. Springer, New York (1998) Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE CVPR, pp. 511–518 (2001) Wang, H., Li, S.Z., Wang, Y., Zhang, J.: Self quotient image for face recognition. In: Proceedings of the International Conference on Image Processing, pp. 1397–1400 (2004) Wang, Y., Pan, G., Wu, Z.: Sphere-spin-image: a viewpointinvariant surface representation for 3D face recognition. In: Proceedings of the International Conference on Computational Science (ICCS’04), Lecture Notes in Computer Science, vol. 3037 (June), pp. 427–434 (2004)
413 23. Xiujuan, C., Shiguang, S., Xilin, C., Wen, G.: Linear regression for pose-invariant face recognition. IEEE Trans. Image Process. 7(16), 1716–1725 (2007) 24. Zhang, X., Gao, Y., Leung, M.K.H.: Recognizing rotated faces from frontal and side views: an approach toward effective use of mugshot databases. IEEE Trans. Inf. Forensics Secur. 3(4), 1966–1979 (2008) 25. http://www.polarrose.com/. (02/02/2010) 26. http://www.apple.com/it/ilife/iphoto/. (03/02/2010) 27. http://picasa.google.it. (03/02/2010) 28. A. T. L. Cambridge (2010) The Database of Faces. http:// www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. (03/02/2010) 29. http://www.celebritywonder.com/html/gallery.shtml
123