IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
769
Shape-Driven Gabor Jets for Face Description and Authentication Daniel González-Jiménez and José Luis Alba-Castro
Abstract—This paper proposes, through the combination of concepts and tools from different fields within the computer vision community, an alternative path to the selection of key points in face images. The classical way of attempting to solve the face recognition problem using algorithms which encode local information is to localize a predefined set of points in the image, extract features from the regions surrounding those locations, and choose a measure of similarity (or distance) between correspondent features. Our approach, namely shape-driven Gabor jets, aims at selecting an own set of points and features for a given client. After applying a ridges and valleys detector to a face image, characteristic lines are extracted and a set of points is automatically sampled from these lines where Gabor features (jets) are calculated. So each face is depicted by 2 points and their respective jets. Once two sets of points from face images have been extracted, a shape-matching algorithm is used to solve the correspondence problem (i.e., map each point from the first image to a point within the second image) so that the system is able to compare shape-matched jets. As a byproduct of the matching process, geometrical measures are computed and compiled into the final dissimilarity function. Experiments on the AR face database confirm good performance of the method against expression and, mainly, lighting changes. Moreover, results on the XM2VTS and BANCA databases show that our algorithm achieves better performance than implementations of the elastic bunch graph matching approach and other related techniques. Index Terms—AR face database, BANCA database, elastic bunch graph matching (EBGM), face authentication/recognition, Gabor jets, ridges and valleys, shape contexts, XM2VTS database.
I. INTRODUCTION
I
N automatic face recognition, the selection of points for feature extraction is one of the most important steps in designing algorithms that encode local information, as well as the feature extraction itself. A well-known class of techniques used for face recognition is based on elastic graph matching (EGM), where a reference model grid is adjusted to an input image. This grid can be either rectangular, such as the approaches by Duc et al. [2] and Kotropoulos et al. [4], or defined by universal face points, such as pupils, tip of the nose, corners of
Manuscript received October 5, 2006; revised August 12, 2007. This work was supported in by the Spanish Ministry of Education under Project TEC200507212/TCM and in part by the European Sixth Framework Programme under the Network of Excellence BIOSECURE (IST-2002-507604). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Anil Jain. The authors are with the Signal Theory and Communications Department, ETSI Telecommunications, University of Vigo, Vigo 36310, Spain (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIFS.2007.910238
the mouth, etc., as it is done in the methods by Wiskott et al. [1], and Smeraldi and Bigun [3]. Regardless of this choice, each grid node is attached with features that characterize the neighborhood of its location in the image, and the use of a particular feature has lead to different approaches. For instance, [1]–[3] use Gabor-based features while in [4], and in the work proposed by Zafeiriou et al. [23] a multiscale morphological analysis was employed. Although some results derived from the works of [2] and [4] will be presented, in this paper, we will focus on the comparison with the elastic bunch graph matching algorithm (EBGM) [1]. This technique computes multiscale and multiorientation Gabor responses (jets) from a set of (the so-called) fiducial points, located at specific face regions (eyes, tip of the nose, mouth , that is, “universal” landmarks). Finding every fiducial point relies on a matching process between the candidate jet and a bunch of jets extracted from the corresponding fiducial points of training faces. This matching problem is solved by maximizing a function that takes texture and mesh distortion into account. In this way, there are several variables that can affect the accuracy of the final positions, as differences in pose, illumination conditions, and insufficient representativeness of the stored bunch of jets. Once fiducial points are adjusted, only textural information (Gabor jets) is used in the classifier. The main novelty of this paper is somewhat conceptual, since our ultimate goal should be to exploit individual face structure so that the system focuses on subject-specific discriminative points/regions, rather than on universal landmarks like EBGM does. In this sense, the concrete processing that is applied to faces in order to achieve that goal is (although critical) just an issue of implementation. In practical terms, the main differences between EBGM and our current approach are focused on the way we locate and match points and on the final dissimilarity function that does not only use texture but also geometrical information. Our method locates salient points in face images by means of a ridges and valleys detector [6]. Low-level descriptors, such as edges or ridges and valleys, have been already used for face recognition motivated by cognitive psychological studies [9], [10] which indicated that human beings could recognize line drawings as quickly and almost as accurately as gray-level pictures. For instance, Takács [11] proposes using edge map coding and a modified pixel-wise Hausdorff distance to compare faces. Gao and Leung [12] introduce a compact face feature, the so-called line edge map (LEM), to code and recognize faces. In [13], Alba-Castro et al. propose a supervised discriminant Hausdorff distance to compare sketches obtained by means of a ridges and valleys detector. In this work, however, we will also use texture information to improve classification accuracy. In
1556-6013/$25.00 © 2007 IEEE
770
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
fact, at each of the located points, Gabor jets will be calculated and stored for further comparison. One of the main advantages of localizing points by means of a ridges and valleys detector is that as only some basic image operations are needed, the computational load is reduced from the original EBGM algorithm and, at the same time, possible discriminative locations are found in an early stage of the recognition process. In this sense, we say that this method is inherently discriminative, in contrast to trainable parametric models. Some of the located points may belong to “universal” landmarks, but many others are person dependent. The correspondence between points of two faces only uses geometrical information and it is based on shape contexts [15]. This way, a comparison between shape-driven jets is feasible. As a byproduct of the correspondence algorithm, we extract measures of local geometrical distortion, and the final dissimilarity function compiles geometrical and textural information. To the best of our knowledge, the combination of tools we apply (low-level face description + shape matching + feature extraction) is also novel in the field of face recognition. The paper is organized as follows: Section II introduces the ridges and valleys detector. Grid adjustment and selection of points is described in Section III, while Section IV explains texture extraction through Gabor filters. Section V shows the algorithm used to match points between two faces. The sketch distortion term is introduced in Section VII. Section VIII proposes a linear combination to fuse shape and texture scores. In Section IX, we conduct experiments on the AR face database [19] to test the performance of the system against lighting and expression changes. Experimental results are given in Sections X and XI for the XM2VTS [17] and BANCA [16] databases, respectively. Finally, conclusions and future research lines are drawn in Section XII. A preliminary version of this paper appeared in [28]. II. RIDGES AND VALLEYS DETECTOR First of all, shape information must be extracted from face images. Although other approaches, such as edges, can be used, face shape has been obtained through a ridges and valleys detector. Contrary to edges, where people agree on their mathematical characterization, the case of ridges and valleys is more complex and several mathematical characterizations exist that try to formalize the intuitive notion of the ridge/valley. In this paper, we have used the ridges and valleys obtained by thresholding the so-called multilocal level-set extrinsic curvature (MLSEC) [6], [7]. The main reasons that support the choice of the MLSEC are its invariance to both rigid image motions and monotonic grayscale changes and, mainly, its high continuity and meaningful dynamic range [7]. Basically, the MLSEC operator works as follows. • Computing the normalized gradient vector field of the smoothed image. • Calculating the divergence of this vector field, which is bounded and gives an intuitive measure of valleyness (positive values running from 0 to 2) and ridgeness (negative values from 2 to 0). • Thresholding the response so that image pixels where are considered the MLSEC response is smaller than
Fig. 1. Applying the ridges and valleys detector to the same face image using two different smoothing filters. Left: Original image. Center-left: Valleys and ridges image. Center-right: Thresholded ridges image. Right: Thresholded valleys image.
ridges, and those pixels larger than are considered valleys. and . In Several parameters must be adjusted, such as . Also, the smoothing filter this work, we fixed applied to the faces can be modified, leading to a more or less detailed shape image. Fig. 1 shows the result of applying the ridges and valleys operator to the same face image using two different smoothing filters. One of the interesting properties of the MLSEC operator is that it behaves well in the presence of illumination changes [8] due to the fact that the response of the operator depends on the orientations of the normalized gradient fields rather than on their magnitudes. Besides its desirable illumination behavior, the relevance of valleys in the face-shape description has been pointed out by some cognitive science works. Among others, Pearson et al. [14] hypothesize that this kind of filter could be used as an early step by the human visual system (HVS), because several similarities exist between valley responses and the way human beings analyze faces. 1) Valley positions provide the observer with 3-D information of the shape of the object that is being observed. Valleys of a 3-D surface with uniform albedo are placed at those points where surface normal is perpendicular to the point-observer axis and, in a similar way, ridges are placed at those points where surface normal is collinear with the point-observer axis. 2) The response of a valley detector depicts the face in a similar way a human could have drawn it, showing the position and extent of the main facial features, as can be seen in the rightmost column of Fig. 1. 3) The ability of HVS when recognizing faces decreases dramatically if negative images are used instead of positive ones. Valleys, as well as ridges, do not remain at the same position when the image is negated (valleys become ridges and vice versa), and it seems clear from Fig. 1, that it is more difficult for humans to infer identity from the ridges image than from the valleys one. Although the last statement would seem in favor of the use of valleys instead of ridges, results reported in [8] do not clearly support this hypothesis. Ridges, valleys, and edges were evaluated and compared (using both Euclidean and Hausdorff-based
GONZÁLEZ-JIMÉNEZ AND ALBA-CASTRO: SHAPE-DRIVEN GABOR JETS
771
distances) in a face recognition framework under illumination and expression changes. Both ridges and valleys clearly outperformed edges while, on average, ridges turned out to work slightly better than valleys. Following these results, we decided to focus on the use of ridges in our experiments, although some results with valleys will also be presented. III. SHAPE SAMPLING
Fig. 2. Left: Original rectangular dense grid. Center: Sketch. Right: Grid adjusted to the sketch.
Once the ridges and valleys have been extracted in a new image, we must sample the obtained lines in order to keep a set of points that depicts the face. For generality and ease of notation, hereinafter, we will refer to the binary image (ridges or valleys) obtained as a result of the previous step, as the sketch (i.e., the methodology that will be introduced in this and the following sections is valid for both ridges and valleys. However, in agreement with the discussion at the end of Section II and unlike otherwise stated, the presented numerical results were obtained using ridges). In order to select a set of points from the original sketch, a nodes) is applied onto the face dense rectangular grid ( image and each grid node is moved toward its nearest point of the sketch, that is, ( if . In order to avoid the case in which two or more grid nodes coincide on the same sketch point , a flag is set to 1 the first time is used, so that the other grid nodes must find their corresponding sketch points among the remaining ones. , Finally, we obtain a vector of points and . Typical sizes for are 100 or where more nodes. These points sample the original sketch, as it can be seen in Fig. 2. Obviously, uniform sampling could also be used, but the main reason to use the “deformable” rectangular grid relies on the fact that, this way, there is a naive mapping between points coming from the same node of their respective rectangular grids, which will be used as a baseline algorithm for point matching. This will become clearer in Section X, with some experimental results.
Fig. 3. Log-polar histogram located over a point of the face: shape context.
of the image. At point feature vector:
, we obtain the following
(2) where stands for the th coefficient of the feature vector extracted from . So for a given face with a set , we obtain Gabor jets of points .
IV. EXTRACTING TEXTURAL INFORMATION
V. SHAPE CONTEXTS
, using the same A set of 40 Gabor filters configuration as in [1], is used to extract textural information. These filters are biologically motivated convolution kernels in the shape of plane waves restricted by a Gaussian envelope [32] as it is shown next
Suppose that shape information has been extracted from two and . Let and be the sketches for these images, say be the set of incoming images, and let points for , and the set of points for . Given that we do not have any label regarding which pair of points should be matched, if we want to compare feature vectors from both images, we need to obtain a function that maps each point from to a point within
(1)
(3)
contains information about scale and orientation, and where is used in both directions the same standard deviation for the Gaussian envelope. The region surrounding a pixel in the image is encoded by the convolution of the image patch with these filters, and the set of responses is called a jet . So a jet is a vector with 40 complex coefficients, and it provides information about an specific region
where and . Hence, the feature vector from , , will be compared to , extracted from . between sets of points, In order to obtain the mapping we have adopted the idea described in [15]. For each point in the constellation, we compute a 2-D histogram (called shape context) of the relative position of the remaining points, and so that a vector of distances a vector of angles are calculated
772
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
for each point. Bins are uniform in log-polar space (i.e., the logarithm of distances is computed). Each pair will increase the number of counts in the adequate bin of the histogram, as shown in Fig. 3. So, finally, each face shape is depicted through a set of 2-D histograms. Once the sets of histograms are computed for both faces, we must match each point in the first set with a point from the second set . A point from is matched to a point from if the term defined as
For each point , we will use the vector as the x axis, so that rotation invariance is achieved. In [15], they treated the tangent vector at each contour point as the positive x axis to achieve rotation invariance. As long as we do not work on contours, it is not easy to define this tangent vector. Also, the angle between the two images can be computed as follows: (7)
(4) is minimized.1 As explained in [15], not only distances between histograms could be considered but also appearance-based differences. In this sense, we could introduce Gabor jet dissimilarities (Section VI) in the function to be minimized, but it would require more computation ( jet comparisons). In order to decrease the burden, we could restrict the search to the neighborhood of each point, provided that faces are in a fixed position (i.e., without rotation). In future research, we plan to assess the behavior of the matching process when both geometrical and textural information are used as well as the impact on performance (and computational time) that causes the search to be constrained to the region surrounding each point. A. Invariance to Scaling, Translation, and Rotation Invariance to translation is intrinsic to the shape context definition since all measurements are taken with respect to points over the face lines. To achieve scale invariance, we measure how big the object (face) is. One way to do this is by adding the distances between all points in the constellation, that is, to proceed as follows:
so that the system is able to put both images in a common position for further comparison. If we do not take this angle into account, textural extraction will not be useful for our purpose. VI. TEXTURE DISSIMILARITY be the set of jets calculated Let and the set of jets comfor puted for . The similarity function between these two faces is given by (8) represents the normalized dot product bewhere tween correspondent jets, but considersthat only the moduli of jet coefficients are used. Texture dissimilarity is simply calcu. Using (8), it follows lated as that:
(5) (9) gives an idea of the size of the face, so that This distance it can be normalized to standard scale just by resizing the input , where is the distance image by a factor of for a standard face size. Also, if we are looking for a more accurate estimation of the size of the face, an iterative process for a given can be applied until the ratio threshold . Furthermore, we can provide for rotation invariance, as will be explained. The vectors of angles are calculated taking the x axis (the vector ) as reference. This is enough if we are sure that the faces are in an upright position. But to deal with rotations in plane (i.e., if we do not know the rotation angle of the heads), we must take a relative reference for the shape-matching algorithm to perform correctly. Consider, for , the centroid of the the set of points constellation
VII. MEASURING DISSIMILARITY BETWEEN SETS OF POINTS We have defined two different terms to measure the geometrical dissimilarity between two sets of points depicting their respective sketches (10) (11) Equation (10) computes dissimilarity by adding the individual costs between matched points represented in (4). On the other hand, (11) calculates dissimilarity by summing the norm of the difference vector between matched points. The linear combination of these two distance measures (12)
(6) 1k
in (4) runs over the number of bins in the 2-D histogram.
is what we call sketch distortion (SKD). Figs. 4 and 5 give a visual understanding of this concept. Fig. 4 shows two instances of face images from subject A, while faces in Fig. 5 belong
GONZÁLEZ-JIMÉNEZ AND ALBA-CASTRO: SHAPE-DRIVEN GABOR JETS
Fig. 4. Top left: First image from subject A. Center: Sketch. Right: Grid adjusted to the sketch. Bottom left: Second image from subject A. Center: Sketch. Right: Grid adjusted to the sketch.
773
Fig. 6. TER (evaluation and test sets) against .
matching), we thought of linearly combining both shape and texture scores, leading to the final dissimilarity measure (13) From (9)–(13), it immediately follows that equal to:
is
(14)
Fig. 5. Top left: First image from subject B. Center: Sketch. Right: Grid adjusted to the sketch. Bottom left: Second image from subject B. Center: Sketch. Right: Grid adjusted to the sketch.
TABLE I SKETCH DISTORTION (SKD) BETWEEN THE FACE IMAGES FROM FIGS. 4 AND 5
to subject B. The visual geometric difference between the two people is reflected in the sketch distortion term, whose values . are shown in Table I for VIII. COMBINING SHAPE AND TEXTURE Although shape information is somehow encoded in jets (they have been calculated at shape-sampled points, and they have found their correspondent jets for comparison by shape
Equation (14) shows that each contribution of jet dissimilarity is modified by a geometrical distortion [the so-called local sketch distortion (LSKD)]. A high value in from the pair means that local differences exist between matched points, so that jet dissimilarity will also be high. This fact is more likely to occur when incoming faces do not represent the same person. Even if LSKD is low, but faces do not belong to the same person, textural information will increase the dissimilarity between them. On the other hand, when faces belong to the same subject, low LSKD values should be generally achieved, so that matched points are located over the same face region, resulting in low jet dissimilarity. Thus, the measurement in (14) reinforces discrimination between subjects. As a preliminary result, Fig. 6 shows the performance of the system over a subset of the XM2VTS database [17] using and . In this figure, the total error rate (TER)2 in the evaluation and test sets are plotted against . corresponds to the case in which only textural information is taken into account. As we can see, there is a range of values for which the TER is below the one obtained by only using texture information, and the value of which minimizes also minimizes the TER in the TER in the evaluation set the test set. 2The TER is defined as the sum of the false acceptance rate (FAR) and false rejection rate (FRR), which are common measures to assess biometric systems performance.
774
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
Fig. 7. Face images from the AR face database. The top row shows images from the first session: a) Neutral, b) Smile, c) Anger, d) Scream, e) Left light on, f) Right light on, and g) Both lights on, while the bottom row presents the shots recorded during the second session h)–n).
IX. TESTING THE SYSTEM AGAINST LIGHTING AND EXPRESSION VARIATIONS A. Database In order to test the behavior of the system in the presence of illumination and expression changes, we used the AR face database [19]. Each subject in the database participated in two recording sessions separated by two weeks. For our experiments, we considered the images from 106 subjects (half men and half women) showing different facial expressions and with illumination changes. Fig. 7 presents the shots taken for one subject of the database during the first and second sessions (top and bottom row, respectively). On the top row, from left to right, the first four images present facial expression changes: a) Neutral, b) Smile, c) Anger, and d) Scream, while the last three shots were taken under different lighting conditions with neutral expression e) Left light on, f) Right light on, and g) Both lights on. Analogously, the bottom row presents the same configuration for images recorded during the second session —h) to n)—. Following [20], we distinguish between gallery and probe images. The gallery contains images of known individuals, which are used to build templates, and the probe set contains images of subjects with unknown identity that must be compared against the gallery. A closed universe model is used to assess system performance, meaning that every subject in the probe set is also present in the gallery. B. Facial Expression Changes In this experiment, we will assess the recognition accuracy of the system when only a neutral face is available as gallery and the probe images show expression variations. Fig. 8 presents the cumulative match score for rank (percentage of successful first), when the neuidentification of a subject within the tral shot a) is used as gallery and shots b), c), and d) are presented to the system as probe. Clearly, images with smiling and angry expressions are correctly recognized, but the algorithm fails to identify screaming faces. The same behavior is observed with the images from the second session (i.e., h) as gallery, and shots i), j), and k) as probe. Averaging results from both tests, the recognition rates with rank 1 are 92%, 99%, and 37% for smiling, angry, and screaming faces. Clearly, angry
Fig. 8. System performance with expression variations. Gallery: shot a) (neutral face from first session). Probe: shots b), c), and d). Clearly, the system only fails to recognize screaming faces.
faces are easier than smiling ones, due to the fact that the appearance variation when changing from neutral to anger is less than that of changing from neutral to smile (see the second row of Fig. 7). Regarding screaming faces, it is clear (bottom row of Fig. 9) that the appearance variation is very large and, hence, it is a difficult task when only neutral faces are used as gallery (Gabor jets will differ significantly even if they are extracted on exact corresponding positions). No significant differences are obtained when valleys are used instead of ridges (average recognition rates with rank 1 of 93%, 99%, and 35%). Ridges and valleys are image-based descriptors that sketch the face shape and, accordingly, the obtained representations depend on the current emotion shown in the image. We would like to highlight that, although the position and shape of the lines obviously vary with expression, these lines keep representing the main facial features in a consistent manner (compare the two rows of Fig. 9). In [12], results were reported under the same conditions using LEM. This approach achieves 78.57%, 92.86%, and 31.25% with smiling, angry, and screaming expressions, respectively. Clearly, our method outperforms LEM in three cases although the degradation suffered with screaming faces is similar for both approaches. In order to give less weight to those regions that are more affected by expression changes, Martínez [30] proposed to
GONZÁLEZ-JIMÉNEZ AND ALBA-CASTRO: SHAPE-DRIVEN GABOR JETS
Fig. 9. Top row: ridges and valleys for the neutral expression. Bottom row: ridges and valleys for the screaming expression. Although the position and shape of the sketch lines obviously vary with expression, these lines keep representing the main facial features in a consistent manner.
775
Fig. 10. Top row: ridges and valleys for the neutral expression with diffuse light. Bottom row: ridges and valleys for the neutral expression when both lights are switched on. Although the obtained sketch is not completely invariant to lighting changes (for instance, some valleys from the nose region —top row, purple— dissapear in the presence of strong lighting, valleys associated with “wrinkles” appear —bottom row, blue— and some ridges change —top and bottom rows, red—), the reported results (see text) demonstrate that the system achieves robust behavior under the tested conditions.
use optical flow between the two images to be compared. The best reported results were approximately 96% for smiling, 84% for angry, and 70% for screaming faces. The optical flow-based technique outperforms ours significantly when screaming faces are tested. However, our approach is comparable to [30] when testing smiling expression, and clearly provides better performance with angry faces. We would like to highlight that obtaining an expression-invariant face recognition system was not the point of this research. However, it has been demonstrated that our method behaves reasonably well (to a certain extent) in the presence of expression changes. As a future research line, we could apply a similar idea to that of Martínez (weigh the different jet contributions according to the deformation provoked by expression variations) in order to improve performance with screaming faces.
C. Illumination Variation In this experiment, we will assess the performance of the system under different illumination conditions. The neutral face with diffuse light is used as gallery, while the probe images are shots taken under lighting changes. Fig. 11 presents the cumulative match score for rank when shot a) is used as gallery and shots e), f), and g) are presented to the system as probe. Similar behavior is observed with images from the second session, that is, shot h) as gallery and shots l) (100% recognition rate), m) (100% recognition rate), and n) (93% recognition rate) as the probe. The results using valleys are even a bit better (100%, 100%, and 96% of recognition rate). Clearly, the performance only drops a bit when both lights are switched on. This evidence shows that the system can be affected by extreme lighting conditions, such as overillumination, since this could provoke apparent changes on face shape (see Fig. 10). In [12], results were also reported under the same lighting conditions using LEM. This approach achieves 92.86%, 91.07%, and 74.11% with the left light on, right light on, and both lights on, respectively. In all cases, SDGJ (both ridges and valleys) performs better than LEM.
Fig. 11. System performance under lighting variations. Gallery: shot a). Probe: shots e), f), and g).
X. FACE AUTHENTICATION ON THE XM2VTS DATABASE We tested our method using the XM2VTS database on both configurations I and II of the Lausanne protocol [18]. The XM2VTS database contains synchronized image and speech data recorded on 295 subjects (200 clients, 25 evaluation impostors, and 70 test impostors) during four sessions taken at one-month intervals. The database was divided into three sets: a training set, an evaluation set, and a test set. The training set was used to build client models, while the evaluation set was used to estimate thresholds. Finally, the test set was employed to assess system performance. Configurations I and II of the Lausanne protocol differ in the distribution of client training and client evaluation data, representing configuration II, which is the most realistic case. In configuration I, there are three training images per client and, in configuration II, four training images per client are available. Hence, for a given test image, we have three and four scores, respectively, which were fused using the median rule in order to obtain better results.
776
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
last row of Table II. As shown, the performance with the naive mapping is much worse, reflecting again the importance of the shape-matching algorithm. Finally, we ran experiments using valleys instead of ridges. In both configurations, the performance with valleys was worse than that of ridges: there were total error rates of 8.26% and 5.14% in configurations I and II, respectively. B. Measuring Fig. 12. Set of points used for jet extraction in the EBGM approach. Blue triangles represent manually annotated vertices, while red dots represent the middle point connecting manual vertices.
A. Comparison With EBGM As discussed in the introduction, the EBGM algorithm looks for a set of predefined points, such as the pupils, the corners of the mouth, etc., from where jets are extracted. In order to compare our approach with an implementation of the EBGM, we decided to use a set of manually annotated landmarks3 so that we can assess the performance of an “ideal” EBGM (without the effect of fiducial points search errors). Strictly speaking, since no graph matching stage is used, we cannot talk of EBGM, but of a manually annotated face-like mesh. However, a perfect graph-matching step would output those manual positions and, hence, we refer to this approach as an “ideal” EBGM. Although only 68 points were marked in each face, after tessellation using Delaunay, the middle points of some of the edges connecting the original vertices were also included in the final set, as shown in Fig. 12. In EBGM, the correspondences between points are known, so there is no need to match vertices from the faces to be compared. However, and in order to show that the shape context algorithm works properly for our purposes, a comparison between shape-matched jets (extracted at manual positions) was also included in the tests, namely EBGM with shape contexts (EBGM-SC). The first two rows from Table II show the results obtained with EBGM and EBGM-SC. As we can see, both approaches perform almost the same over configurations I and II, which confirms that shape context matching is an adequate choice. In [1], no grid distortion was taken into account and, in order to perform a fair comparison, we tested our approach , without the sketch distortion term, that is, (third row of Table II). Although there are no statiscally significant differences between SDGJ-SC and EBGM, it is clear that our approach achieves identical performance (even better) without the need for manual localization of “fiducial points.” As explained in Section III, the original rectangular grid is and, thus, node is disdeformed toward the sketch placed to position and the same occurs with , in which node moves toward . We decided to use a shape-matching algorithm to map points but, in fact, naive mapping exists between and based on spatial coherence, as long as both of them come from node of their respective rectangular grids. This inherent mapping was used in one experiment (SDGJ-NM, SDGJ with naive matching) whose results are presented in the 3Available at: http://www-prima.inrialpes.fr/FGnet/data/07-XM2VTS/ xm2vts-markup.html.
and
Performance
Up to now, shape distortion has not been taken into account for the experiments. First of all, we consider the different components of the sketch distortion term on their own (i.e., as individual classifiers). As explained before, shape contexts have been used to match points from the manually annotated set in the EBGM-SC algorithm, showing adequate behavior according to the achieved results. As a byproduct of this matching, and measures are obtained for the set of predefined fiducial points. These shape distortions are also computed for the SDGJ-SC approach. From Table III, which presents the classification performance for each measure, we can highlight the following. 1) None of them are strong classifiers as long as the error rates are high. outperforms in the two configurations for both 2) EBGM-SC and SDGJ-SC algorithms. 3) According to these measures, the distribution of points obtained from the SDGJ approach is more discriminative than the set of manually localized fiducial points. Although the grid distortion defined in [1] is slightly different measure introduced here, statement 3) is in agreefrom the ment with the fact that [1] does not consider these grid distortions when comparing two faces. In fact, as a final experiment, the classification performance of grid distortions given by
(15)
that is, as that defined in [1] was measured. stands for the number of edges connecting manual vertices (see Fig. 12), is the th vector edge in face and the same for . The TER obtained was more than 70%, confirming once again, that the distribution of “universal” fiducial points is not discriminative at all (at least, according to the measures we have tested). C. Shape and Texture Combination Results As discussed previously, a linear combination of shape and texture scores was used in the final dissimilarity function. In order to select an adequate value for , we fixed and performed a grid-search on , preserving the values that minimize the TER on the evaluation set. These optimal values were used in the test phase achieving a TER 5.99% and 4.06% in configurations I and II, respectively. D. Results From Other Researchers Three public face competition contests have been organized on the XM2VTS in years 2000 [21], 2003 [22], and 2006 [24].
GONZÁLEZ-JIMÉNEZ AND ALBA-CASTRO: SHAPE-DRIVEN GABOR JETS
777
TABLE II FACE AUTHENTICATION ON THE XM2VTS DATABASE. FALSE ACCEPTANCE RATE (FAR), FALSE REJECTION RATE (FRR) AND TOTAL ERROR RATE (TER) OVER THE TEST SET FOR OUR SHAPE-DRIVEN APPROACH (WITHOUT SKETCH DISTORTION) AND THE EBGM ALGORITHM
TABLE III FACE AUTHENTICATION ON THE XM2VTS DATABASE. FALSE ACCEPTANCE RATE (FAR), FALSE REJECTION RATE (FRR), AND TOTAL ERROR RATE (TER) OVER THE TEST SET FOR AND COMPUTED FROM EBGM-SC AND SDGJ-SC
GD
TABLE IV RESULTS FROM OTHER RESEARCHERS ON THE XM2VTS DATABASE
The first three rows of Table IV shows results achieved with approaches that share some algorithmic similarity with ours and the methods discussed in the introduction. • The Aristotle University of Thessaloniki—AUT(2000)— tested the morphological-based EGM algorithm [4], competing in year 2000. • The IDIAP institute—IDIAP(2000)—implemented the system described in [2] (rectangular grid attributed with Gabor features), also taking part in the contest held in year 2000. • The Tubitak Bilten University—TB(2003)—entered the competition of year 2003 testing an implementation of EBGM. In [23], they exploit discriminant information in a modified morphological EGM, achieving clear improvements over the raw in configuration I). Several steps of discrimdata (TER inant analysis were tested (only with configuration I). . 1) Node weighting TER 2) Similarity measure (textural and geometrical information) . weighting: TER 3) Weighting morphological feature coefficients: TER . . 4) All discriminative steps: TER Table IV also shows some of the best results on the XM2VTS: a LDA-based approach [UniS-NC (2003)] and a complex ensemble of learning classifiers based on the manipulation of Gabor features [CAS (2006)]. The former took part in the competition held in 2003 [22], while the latter participated in the recent contest [24] (2006).
GD
E. Accuracy-Based Jet Selection It is clear that if no discriminative steps are applied to the approach of [23], our method outperforms it significantly (TER against TER ). However, the inclusion of all discriminative stages in [23] leads to much better performance than SDGJ. (TER Although shape-driven points have been proven to be more discriminative than universal landmarks, and quite robust to illumination and expression changes, we must be reminded that the location of these positions relies on an image-based operator (ridges and valleys), which could be affected by an inexact face localization or image noise. In such cases, some of the selected positions are likely to be positioned outside the face region (neck, hair, etc.), while others could lie on “noisy” ridges and/or valleys. Moreover, it is well known that not all face regions have equal discriminatory power [2], [5], [23], and we should take these facts into account to improve our results. In order to discard noisy/nondiscriminative locations, keeping only the best positions in [29], we introduced a simple client-specific technique for the selection of such locations: by measuring the accuracy of each Gabor jet (considered to be an individual classifier), we only preserve those with a TER below a threshold on the evaluation phase (see Fig. 13 for an example). Hence, it is a hard weighting function (selected or not) based on the individual classification accuracy of each jet. In this case, the ) claiming to be similarity between a test image (with jets identity (whose jets are ) is given by (16) represents the number of selected locations for client where . The weight is equal to 1 if the corresponding jet from was selected, and 0 otherwise. A TER client was achieved in configuration I of the XM2VTS, thus outperforming (although nonsignificantly) the results of [23]. However, if we observe the two last rows of Table IV, it is clear that our method does not currently provide the best performance on the database. As highlighted in the abstract and introduction, the main novelty of this paper is conceptual in the
778
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
TABLE V RESULTS REPORTED ON THE BANCA DATABASE FROM OTHER RESEARCHERS
Fig. 13. Left: Original set of shape-driven points for client 003 of the XM2VTS database. Right: Set of preserved shape-driven points after accuracy-based selection (Section X-E).
Fig. 14. Examples of images from the controlled, degraded, and adverse conditions of the BANCA database.
sense that proposes a different way (exploiting individusal face structure) to look for discriminative points in face images and it represents, to the best of our knowledge, a first attempt in this direction. We are confident that there is still room for improvement as has been demonstrated by the fact that using simple discriminant analysis clearly reduced the difference with the CAS algorithm (2.52% versus 0.96%), and further research is needed in order to decrease error rates (e.g., by choosing better metrics to compare features, through the selection of the most discriminative jet coefficients, etc.).
XI. FACE AUTHENTICATION ON THE BANCA DATABASE We have also used the English part of the BANCA database (comprising 52 subjects, half men and half women) on protocols matched controlled (MC) and pooled (P) to test our method. The subjects in this database were captured in three different scenarios: controlled, degraded, and adverse over 12 different sessions spanning three months. Examples of images from these three conditions are shown in Fig. 14. Since the images were extracted from video sequences in which the subjects were asked to utter a few sentences, expression changes (specially mouth motion) can appear. Moreover, in the degraded and adverse scenarios, there are no constraints on lighting, distance to the camera, etc., and the resolution of the degraded images is clearly worse than that of the two remaining scenarios. In order to propose an experimental protocol, it is necessary to define a development set, in which the system can be ad-
TABLE VI OUR RESULTS ON THE BANCA DATABASE ON CONFIGURATIONS MC AND P WITH = 1 AND = = 0
justed by setting thresholds, etc., and an evaluation set,4 where system performance can be assessed. For this reason, two disand , each one with joint subsets were created, namely is used as the 26 people (13 men and 13 women). So when development set, is used for evaluation and vice-versa. In the experiments carried out, three specific operating conditions corresponding to three different values of the cost ratio FAR/FRR, namely , , have been considered. The so-called weighted error rate (WER) given by (17) was calculated for the test data of groups and at the three different values of . Both protocols (MC and P) use the same data to build client models (controlled images from session 1), but differ significantly in the set of images used for testing, as long as MC only uses controlled images, whilst P also tests adverse and degraded faces. These facts make protocol P more challenging than protocol MC. The experiments have been carried out employing the preregistered images of 55 51 pixels used for the competition contests on ICBA 2004 [25] and ICPR 2004 [26]. Results from the five methods which entered these two competitions are given in Table V (Protocol MC was used in ICBA 2004 and configuration P was employed for the contest in ICPR 2004). Table VI shows the performance of the algorithm when . Taking the SKD term into account yielded a small improvement in average WER: 4.42% and 10.43% for protocols MC and P, respectively. From the comparison between these results and Table V, we can highlight: • our system does not provide the best authentication rates over this database, but • it shows the smallest degradation in performance when changing from protocol MC to protocol P, as the average 4Note that the concept of the evaluation set is different in the BANCA and XM2VTS protocols: BANCA’s evaluation set is analogous to XM2VTS’s test set.
GONZÁLEZ-JIMÉNEZ AND ALBA-CASTRO: SHAPE-DRIVEN GABOR JETS
WER is only 2.36 times worse. For instance, both IDIAP approaches worked better than ours on configuration MC, but our algorithm outperformed them working on protocol P. To provide baseline results for the two contests mentioned before (ICPR and ICBA), a set of algorithms developed by the Colorado State University (CSU) [27] was tested. The best results obtained with an implementation of the EBGM approach belonging to this set are much worse (average WERs of 8.79% and 14.21% for protocols MC and P, respectively) than those obtained with our approach even if only texture is taken into account, which indicates that SDGJ selects better discriminative locations for face authentication. The same set of system parameters (Gabor filter frequencies for instance) was used for the experiments on the XM2VTS and BANCA databases, despite the difference in the resolution of the images tested ( 150 115 pixels in XM2VTS and 55 51 pixels in BANCA). The performance on BANCA is expected to improve when using higher resolution images. In fact, with 150 115 pixels, the average WER (average WER) obtained through the combination of shape and texture was 9.47% for protocol P. XII. CONCLUSIONS AND FURTHER RESEARCH The main novelty of this paper is somewhat conceptual, since it proposes an alternative way to the selection of key points in face images. Our ultimate goal should be to exploit an individual face structure so that the system focuses on subject-specific discriminative points/regions, rather than on universal landmarks. In this sense, the choice of the particular face shape descriptor, point matching algorithm, and feature extraction method are (although critical) just implementation issues. Biological reasons [14] as well as the better behavior than edges [8] motivated the use of ridges and valleys for face shape description. Analogously, the selection of Gabor filters for feature extraction was inspired both by biological reasons [31], [32] and because of its wide use in face recognition [1]–[3], [33]. Finally, we chose shape context matching because it has proven to be a robust descriptor, performing accurately in object recognition/retrieval tasks via shape [15]. The combination of these techniques is also novel in the field of face recognition. Briefly, the algorithm can be summarized as follows. • Facial structure is exploited through the use of a ridges and valleys detector, so that points are automatically sampled from lines depicting the subject’s face. • At each shape-driven position, a set of Gabor filters is applied, obtaining Gabor jets which provide the textural information about the face. • Given two images and their respective sets of points, shape context matching is used to determine which pair of jets should be compared, obtaining, at the same time, two geometrical measures between faces, whose linear combination forms the sketch distortion term. Further experiments should be conducted in order to assess the performance of the matching process when Gabor jet dissimilarities are taken into account along with histogram distances, as well as the impact (both on performance and computational
779
time) that causes the search to be constrained to the region surrounding each point. Experimental results on the AR face database demonstrate that although our system has not been particularly designed to cope with expression changes, it behaves reasonably well in the presence of such variations. In order to improve the results with large expression changes (i.e., screaming faces), we plan to apply a function that weighs the different facial regions according to the deformation caused by a given expression. Moreover, tests under different lighting conditions confirm the good performance of the system with illumination changes. We have demonstrated, empirically, that the distribution of shape-driven points is more discriminative than the distribution of fiducial points as used in [1]. Experimental results on the XM2VTS database show that our approach performs marginally better than an ideal EBGM without the need for localizing “universal” fiducial points. Also, it has been demonstrated that a simple linear combination of texture and shape scores improves the performance of the system (compared to the only-texture method) although this improvement is not always significant. The comparison with other raw (i.e., without discriminant analysis) EGM methods reveals that our system achieves lower error rates. The application of a simple (hard) feature selection stage provoked clear improvements in performance ( 61%), achieving better results than the morphogical EGM with several steps of discriminative analysis. However, as a future research line, we plan to study which jet coefficients are the most discriminative as well as selecting the appropriate soft local weights for the shape-driven features. The achieved results on the BANCA database also confirm that changing from an “easy” protocol to a more challenging configuration provokes less degradation in performance than with other methods and that our approach clearly outperforms an implementation of the EBGM algorithm on this database. ACKNOWLEDGMENT The authors would like to thank A. López and A. Pujol for providing the code of the MLSEC operator and their insights on ridge and valley detector. Moreover, the authors would also like to thank the reviewers for their helpful comments that improved the quality of this paper. REFERENCES [1] L. Wiskott, J. M. Fellous, N. Kruger, and C. von der Malsburg, “Face recognition by elastic bunch graph matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 775–779, Jul. 1997. [2] B. Duc, S. Fischer, and J. Bigun, “Face authentication with Gabor information on deformable graphs,” IEEE Trans. Image Process., vol. 8, no. 4, pp. 504–516, Apr. 1999. [3] F. Smeraldi and J. Bigun, “Retinal vision applied to facial features detection and face authentication,” Pattern Recognit. Lett., vol. 23, no. 4, pp. 463–475, 2002. [4] C. Kotropoulos, A. Tefas, and I. Pitas, “Frontal face authentication using morphological elastic graph matching,” IEEE Trans. Image Process., vol. 9, no. 4, pp. 555–560, Apr. 2000. [5] A. Tefas, C. Kotropoulos, and I. Pitas, “Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 7, pp. 735–746, Jul. 2001. [6] A. M. López, F. Lumbreras, J. Serrat, and J. J. Villanueva, “Evaluation of methods for ridge and valley detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 4, pp. 327–335, Apr. 1999.
780
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 2, NO. 4, DECEMBER 2007
[7] A. M. López, D. Lloret, J. Serrat, and J. J. Villanueva, “Multilocal creaseness based on the level-set extrinsic curvalture,” Computer Vis. Image Understand., vol. 77, pp. 111–144, 2000. [8] A. Pujol, A. López, J. L. Alba, and J. J. Villanueva, “Ridges, valleys and Hausdorff based similarity measures for face description and matching,” in Proc. Int. Workshop Pattern Recognition Information Systems, Setubal, Portugal, Jul. 2001, pp. 80–90. [9] I. Biederman and J. Gu, “Surface versus edge-based determinants of visual recognition,” Cognit. Psychol., vol. 20, pp. 38–64, 1988. [10] V. Bruce, E. Hanna, N. Dench, P. Healey, and M. Burton, “The importance of “Mass” in line drawings of faces,” Appl. Cognit. Psychol., vol. 6, pp. 619–628, 1992. [11] B. Takács, “Comparing face images using the modified Haussdorf distance,” Pattern Recognit., vol. 31, no. 12, pp. 1873–1881, 1998. [12] Y. Gao and M. Leung, “Face recognition using line edge map,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 6, pp. 764–779, Jun. 2002. [13] J. L. Alba-Castro, A. Pujol, A. López, and J. J. Villanueva, “Improving shape-based face recognition by means of a supervised discriminant Hausdorff distance,” in Proc. IEEE Int. Conf. Image Processing , 2003, vol. 3, pp. 901–904. [14] D. E. Pearson, E. Hanna, and K. Hanna, “Computer-generated cartoons,” in Images and Understanding. Cambridge, U.K.: Cambridge Univ. Press, 1990, pp. 46–60. [15] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 24, pp. 509–522, Apr. 2002. [16] E. Bailly-Baillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Marièthoz, J. Matas, K. Messer, V. Popovici, F. Porèe, B. Ruiz, and J.-P. Thiran, “The BANCA database and evaluation protocol,” in Proc. AVBPA, 2003, pp. 625–638 [Online]. Available: http://www.ee.surrey.ac.uk/banca. [17] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: “The extended M2VTS database”,” in Proc. AVBPA, 1999, pp. 72–77. [18] J. Luttin and G. Maître, “Evaluation protocol for the extended M2VTS database (XM2VTSDB),” IDIAP, Tech. Rep. RR-21, 1998. [19] A. M. Martínez and R. Benavente, “The AR face database,” Comput. Vis. Ctr. (CVC), Tech. Rep. 24, 1998. [20] P. J. Phillips, H. Moon, S. Rizvi, and P. Rauss, “The FERET evaluation methodology for face recognition algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1090–1104, Oct. 2000. [21] J. Matas, M. Hamouz, K. Jonsson, J. Kittler, Y. Li, C. Kotropoulos, A. Tefas, I. Pitas, T. Tan, H. Yan, F. Smeraldi, J. Bigun, N. Capdevielle, W. Gerstner, S. Ben-Yacoub, Y. Abeljaoued, and E. Mayoraz, “Comparison of face verification results on the XM2VTS database,” in Proc. ICPR, 2000, vol. 4, pp. 858–863. [22] K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou, W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. B. Tek, G. B. Akar, F. Deravi, and N. Mavity, “Face verification competition on the XM2VTS database,” in Proc. AVBPA, 2003, pp. 964–974. [23] S. Zafeiriou, A. Tefas, and I. Pitas, “Exploiting discriminant information in elastic graph matching,” in Proc. IEEE ICIP, 2005, vol. 3, pp. 768–771. [24] K. Messer, J. Kittler, J. Short, G. Heusch, F. Cardinaux, S. Marcel, Y. Rodriguez, S. Shan, Y. Su, W. Gao, and X. Chen, “Performance characterisation of face recognition algorithms,” in Proc. ICBA, 2006, pp. 1–11. [25] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, S. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, N. Poh, Y. Rodriguez, K. Kryszczuk, J. Czyz, L. Vandendorpe, J. Ng, H. Cheung, and B. Tang, “Face authentication test on the BANCA database,” in Proc. ICBA, 2004, pp. 8–15.
[26] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, F. Cardinaux, S. Marcel, S. Bengio, C. Sanderson, N. Poh, Y. Rodriguez, J. Czyk, L. Vandendorpe, C. McCool, S. Lowther, S. Sridharan, V. Chandran, R. Paredes-Palacios, E. Vidal, L. Bai, L. Shen, Y. Wang, C. Yueh-Hsuan, L. Hsien-Chang, H. Yi-Ping, A. Heinrichs, M. Mueller, A. Tewes, C. von der Malsburg, R. Wurtz, Z. Wang, F. Xue, Y. Ma, Q. Yang, C. Fang, X. Ding, S. Lucey, R. Goss, and H. Schneiderman, “Face authentication test on the BANCA database,” in Proc. ICPR, 2004, pp. 523–532. [27] [Online]. Available: http://www.cs.colostate.edu/evalfacerec/index. html. [28] D. González-Jiménez and J. L. Alba-Castro, “Shape contexts and Gabor features for face description and authentication,” in Proc. IEEE ICIP, 2005, pp. 962–965. [29] D. González-Jiménez and J. L. Alba-Castro, “Pose correction and subject-specific features for face authentication,” in Proc. ICPR, 2006, vol. 4, pp. 602–605. [30] A. M. Martínez, “Recognizing expression variant faces from a single sample per class,” in Proc. IEEE Int. Conf. Computer Vision Pattern Recognition, 2003, vol. 1, pp. 353–358. [31] J. G. Daugman, “Two-dimensional spectral analysis of cortical receptive field profiles,” Vis. Res., vol. 20, pp. 847–856, 1980. [32] J. G. Daugman, “Complete discrete 2D Gabor transforms by neural networks for image analysis and compression,” IEEE Trans. Acoust., Speech Signal Process., vol. 36, no. 7, pp. 1169–1179, Jul. 1988. [33] C. Liu, “Gabor-based kernel PCA with fractional power polynomial models for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 572–581, May 2004.
Daniel González-Jiménez received the Telecommunications Engineer degree from the Universidad de Vigo, Vigo, Spain, in 2003, where he is currently pursuing the Ph.D. degree in the field of face-based biometrics. His research interests include computer vision and image processing.
José Luis Alba-Castro received the M.Sc. and Ph.D. degrees (Hons.) in telecommunications engineering from the Universidad de Santiago, Santiago, Spain, in 1990 and the Universidad de Vigo, Vigo, Spain, in 1997. His research interests include computer vision, statistical pattern recognition, automatic speech, and speaker recognition and image-based biometrics. He has written several papers and been the leader of several R&D projects on these topics. He is an Associate Professor of Discrete Signal Processing, Pattern Recognition, Image Processing and Biometrics at the Universidad de Vigo.