A 2D+3D Face Authentication System Robust Under Pose ... - CiteSeerX

2 downloads 0 Views 312KB Size Report
lighting. The performance of the proposed system is tested on a face database of more than 3,000 ..... the image brightness and light source direction L using a.
A 2D+3D Face Authentication System Robust Under Pose and Illumination Variations Filareti Tsalakanidou 1,2 , Sotiris Malassiotis2 , and Michael G. Strintzis1,2 ∗ 1

Information Processing Laboratory Electrical and Computer Engineering Department Aristotle University of Thessaloniki, Greece [email protected], [email protected]

Abstract A complete face authentication system integrating 2D color and 3D depth images is described in this paper. Depth information, acquired by a novel low-cost 3D and color sensor based on the structured light approach, is used for robust face detection, localization and 3D pose estimation. To cope with illumination and pose variations, 3D information is used for the normalization of the input images. Illumination compensation exploits depth data to recover the illumination of the scene and relight the image under frontal lighting. The performance of the proposed system is tested on a face database of more than 3,000 images, in conditions similar to those encountered in real-world applications. Experimental results show that when normalized images, depicting upright orientation and frontal lighting, are used for authentication, significantly lower error rates are achieved. Moreover, the combination of color and depth data results in increased authentication accuracy, compared to the use of each modality alone.

1 Introduction Face recognition has received significant attention in the last 15 years, due to the increasing number of commercial and law enforcement applications requiring reliable personal authentication (e.g. access control, surveillance of people in public places, security of transactions, mug-shot matching, human-computer interaction) and the availability of low-cost recording devices. The majority of existing face recognition algorithms use 2D intensity images of the face. Although such systems have advanced to be fairly accurate under controlled conditions, extrinsic imaging parameters such as pose and illumination are still responsible for serious deterioration of their performance, ∗ This work is funded by research project BioSec IST-2002-001766 (Biometrics and Security, http://www.biosec.org), under Information Society Technologies (IST) priority of the 6th Framework Programme of the European Community.

2

Informatics and Telematics Institute 1st Km Thermi-Panorama Rd, Thessaloniki 57001, Greece [email protected]

as was shown in recent public facial recognition benchmarks [20]. To improve performance under these conditions, three-dimensional (3D) face recognition techniques have been proposed, which exploit the information about the 3D structure of the human face [4]. The use of 3D images for personal identification is mainly motivated by the fact that the shape of faces conveys important discriminatory information, which can be used either alone or as an additional cue which can be combined with reflectance data, represented by 2D intensity images, to enhance face recognition performance. Another motivation comes from the fact that the 3D structure of the face is inherently insensitive to illumination conditions, use of cosmetics or face pigment. Moreover, the use of 3D data can aid significantly pose estimation and pose compensation. The earliest approach adopted towards 3D face recognition is based on the extraction of 3D facial features by means of differential geometry techniques [22]. Commonly, surface properties such as curvature are used to localize facial features invariant to rigid transformations of the face, by segmenting the surface into concave, convex and saddle regions. Facial ridge and valley lines are then detected and are consequently used for the extraction of facial features such as the eyes, eyebrows, nose, mouth, etc by making use of prior knowledge of face anatomy [12]. Face recognition is based on the comparison of feature vectors representing the spatial relationships between extracted facial features. Local maximum and minimum principal curvature measurements, i.e. ridge and valley lines, are used in [13, 21] for the construction of the Extended Gaussian Images (EGI) of the face. Matching between test and gallery depth images is performed using rotation invariant correlation between the respective EGIs. More sophisticated techniques, such as Point Signatures [6], have been proposed for 3D surface feature extraction and have also been applied for 3D face recognition. Point Signatures describe the local structure of face points and are used to find correspondences between 3D faces. When matching two faces, first the rigid parts of the face surfaces are identified, then a correspondence is established between

the two surfaces based on the correlation of signature vectors in respective points, and finally a similarity measure based on the optimal transformation between the two surfaces is calculated. Other 3D face recognition approaches are based on the iterative closest point (ICP) algorithm, which is widely used for registration of 3D models [14]. At each iteration, the closest point correspondences between two faces are established and then, the average distance of the these correspondences is minimized by a rigid transformation. The matching efficiency of the ICP is improved by considering additional features, such as color or curvature, or by using a weighted distance. The most significant disadvantage of feature-based approaches is that they rely on accurate 3D maps of faces, usually extracted by expensive off-line 3D scanners. In practice, 3D digitizers produce range images containing several artifacts, noise and missing pixels. The applicability of feature based approaches when using such data is questionable, especially if computation of curvature information is involved, since curvature is a second derivative. Also, the computational cost associated with the extraction of the features (e.g. curvatures) is significantly high. This hinders the use of such techniques in real-world applications. Appearance based methods like PCA or Fisherfaces simplify 3D data processing by using 2D depth images (see Fig. 1b), where pixel values represent the distance of corresponding points from the sensor. The main problem of such techniques is alignment of 3D images and pose variations, but it can be overcome, since 3D data can be exploited for accurate pose estimation and compensation. The combination of 2D and 3D images for multimodal face recognition using PCA has also been proposed, and significant performance improvement has been achieved [24, 5]. In [5], a database of significant size was used to produce comparative results of face identification using eigenfaces for 2D, 3D and their combination and for varying image quality. This test however considered only frontal images captured under constant illumination. In [23], a database containing several variations (pose, illumination, expressions, glasses, several recording sessions) was used. The embedded hidden markov model method was implemented in both color and depth images and low error rates were reported. To cope with intrapersonal variations due to viewpoint or lighting conditions, the face database was enriched with artificially generated examples depicting variations in pose and illumination. In this paper, we describe and evaluate a complete face authentication system using a combination of 2D color and 3D range images captured in real-time. A completely different approach compared to [23] is adopted. Variations of facial images due to changes in pose or lighting conditions are compensated prior to classification. Several novel techniques which are capable, taking as input a pair of 2D and 3D images, to produce a pair of normalized images depicting frontal pose and illumination are proposed. Also, a novel face detection and localization method relying exclu-

(a)

(b)

Figure 1. (a) Image captured by the color camera with color pattern projected. (b) Computed range image. Bright pixels correspond to points closer to the camera and white pixels correspond to undetermined depth values.

sively on depth data is presented. The algorithm exhibits robustness to background clutter, occlusion, face pose variation and harsh illumination conditions by exploiting depth information and prior knowledge of face geometry and symmetry. The efficiency and robustness of the proposed system is demonstrated on a data set of significant size and it is compared with semi-automatic rectification. The paper is organized as follows. A brief description of the 3D sensor is given in Section 2. Pose and illumination compensation algorithms are presented in Section 3 and 4 respectively. Face classification using 2D and 3D data is outlined in Section 5, while experimental results evaluating the developed techniques are presented in Section 6. Finally, conclusions are drawn in Section 7.

2 Acquisition of 3D Data For the recording of 2D and 3D facial data, a sensor capable of synchronous real-time acquisition of 3D images and associated color 2D images is employed [9]. The sensor is built using low cost devices, an off-the-shelf CCTV-color camera and a standard slide projector. Acquisition of 3D data is based on an improved and extended version of the Coded Light Approach (CLA). By rapidly alternating the color-coded light pattern with a white-light pattern, color and depth images are acquired almost simultaneously (see Figure 1). For the experiments in this paper, the system was optimized for an access control application scenario, leading to an average depth accuracy of 0.5mm for objects located about 1m meter from the camera in an effective working space of 60cm × 50cm × 50cm. The spatial resolution of the range images is equal to the horizontal color camera resolution. The vertical range image resolution is also close to the vertical resolution of the the color camera, for a lowbandwidth surface such as the face. For objects outside the working volume the projected pattern will be fuzzy leading to undetermined depth values. The acquired range images contain artifacts and missing points, mainly over areas that cannot be reached by the projected light and/or over highly refractive (e.g. eye-glasses) or low reflective surfaces (e.g.

hair, beard).

3 Pose Compensation Recent public facial recognition benchmarks have shown that while the best recognition techniques were successful on large face databases recorded in controlled conditions, their performance was seriously deteriorated in uncontrolled conditions, mainly due to variations in illumination and head rotations [20]. Such variations have proven to be one of the biggest problems of face recognition systems. Several techniques have been proposed to recognize faces under varying pose. One approach is the automatic generation of novel views resembling the pose in the probe image. This is achieved either by using a face model (an active appearance model (AAM) [7] or a deformable 3D model [3]) or by warping frontal images using the estimated optical flow between probe and gallery [2]. Classification is subsequently based on the similarity between the probe image and the generated view. A different approach is based on building a pose varying eigenspace by recording several images of each person under varying pose. Representative techniques are the view-based subspace [19] and the predictive characterized subspace [10]. Unlike the techniques described above, in our system pose estimation and compensation is based solely on 3D images and thus exhibits robustness under varying illumination and occlusion of facial features (beard and hair, glasses, etc). Another advantage stemming from the availability of 3D data is that geometric artifacts, which may be present after image warping, are avoided. The aim of the pose compensation algorithm described in this section is to generate, given a pair of color and depth images, novel corresponding color and depth images depicting a frontal, upright face orientation. Also, the center of the face on the input image is aligned with the center of the face in the gallery images of the same person with pixel accuracy. The first step of the algorithm is face detection. Segmentation of the head from the body relies on statistical modelling of the head - torso points using a mixture of Gaussians assumption. The parameters of the model are then estimated by means of the Expectation Maximization algorithm and by incorporating a-priori constraints on the relative dimensions of the body parts [16]. Next, we estimate 3D head pose based on the detection of the nose [16]. After the tip of the nose is localized, a 3D line is fitted on the 3D coordinates of pixels on the ridge of the nose. This 3D line defines two of the three degrees of freedom of the face orientation. The third degree of freedom, that is the rotation angle around the nose axis, is then estimated by finding the 3D plane that cuts the face into two bilateral symmetric parts. The error of the above pose estimation algorithm tested on more than 2000 images is less than 2◦ . Once the tip of the nose and the pose of the face have been estimated, a 3D coordinate frame aligned with the face

(c)

(a)

(b)

(d)

(e)

(f)

Figure 2. Pose compensation example. (a) Original color image, (b) original depth image showing detected head blob and estimated local coordinate system fixed on the nose, (c) rectified depth image, (d) symmetry-based interpolation, (e) final linear interpolated depth image, (f) rectified color image.

is defined centered on the tip of the nose. A warping procedure is subsequently applied on the input depth image to align this local coordinate frame with a reference coordinate frame, bringing the face in up-right orientation. The reference coordinate frame is defined during the training of faces using the gallery images, as will be described below. The transformation between the local and reference coordinate frames is further refined to pixel accuracy by applying the ICP surface registration algorithm [1] between the warped and a reference (gallery) depth image corresponding to claimed person ID. The rectified depth image contains missing pixels. Some of the missing values are determined simply by copying corresponding symmetric pixel values from the other side of the face. Remaining missing pixel values are linearly interpolated from neighboring points. The interpolated depth map is subsequently used to rectify the associated color image also using 3D warping (see Fig. 2). For the training phase a similar but simpler pose compensation algorithm is applied. To obtain optimal results the pose of the face is estimated by manually selecting three points on the input image defining a local 3D coordinate frame. Then, the input color and depth images are warped to align this local coordinate frame with the coordinate frame of the camera, using the surface interpolation algorithm described above. For one of the pose compensated depth images of each person, a simplified version of the automatic pose estimation algorithm above is applied thus estimating a reference coordinate frame. This last step is important, since the slant of the nose differs from person to person.

4 Illumination Compensation In this section, an algorithm is described that compensates illumination by generating from the input image a novel image, relit from a frontal direction. Our approach is inspired by recent work on image-based scene relighting used for rendering realistic images [26, 8]. Image relighting relies on inverting the rendering equation, i.e. the equation that relates the image brightness with the object material and geometry and the illumination of the scene. Given several images of the scene under different conditions, this equation may be solved (although an ill-posed problem) to recover the illumination distribution and then use this to rerender the scene under novel illumination. The first step is therefore to recover the scene illumination from a pair of color and depth images. Assuming that the scene is illuminated by a single light source, a technique is adopted that learns the non-linear relationship between the image brightness and light source direction L using a set of artificially generated bootstrap images. For each subject in the database, we use the reference pose compensated depth image Ir to render N virtual views of the face, illuminated from different directions. The set of light source directions is uniformly sampled from a section of the positive hemisphere. To decrease the dimensionality of the problem, from each rendered image a feature vector is extracted containing locally weighted averages of image brightness over M preselected image locations (M = 30 in our experiments). The sample locations are chosen so as to include face areas with similar albedo (i.e. the skin). Feature vectors xi , i = 1, . . . , N extracted from all the images, normalized to have zero mean and unit variance, are then used as samples of the M -dimensional illuminant direction function L = G(x). An approximation of this func˜ using the samples is a regression problem that may tion G be efficiently solved using Support Vector Machines (SVM) [15]. Assume now, that we want to compute the similarity between a pose compensated probe image and gallery images of a person j in the gallery. A feature vector x is computed from the probe image as described previously. Then, ˜ j (x), an estimate of the light source direction is given by G i.e. the SVM regression function computed for the person j during the training phase. Given the estimate of the light source direction L, relighting the input image with frontal illumination L0 is straightforward. Let IC , ID be the input pose compensated color and depth images respectively and I˜C the illumination compensated image. Then, the image irradiance for each pixel u is approximated by IC (u) = A(u)R(ID , L, u), I˜C (u) = A(u)R(ID , L0 , u) (1) where A is the unknown face albedo or texture function (geometry independent component) and R is a rendering of the surface with constant albedo. Equation 1 is written R(ID , L, u) I˜C (u) = IC (u) R(ID , L0 , u)

(a)

(b)

(c)

Figure 3. Illumination compensation example. (a) Original image, (b) R(ID , L, u), (c) novel image relit by frontal light.

i.e. the illumination compensated image is given by multiplication of the input image with a ratio image. Figure 3 illustrates the relighting of side illuminated images. The same relighting procedure is also applied on training images. Then, it is expected that illumination compensated probe and gallery images of the same person will only differ up to a scale factor, since the intensity of the light source may not be recovered. This scale factor is cancelled by taking the logarithm of the images (that makes the factor additive instead of multiplicative) and subsequently subtracting the mean value. Although, the description of the above relighting technique considers a single channel image, color images may be handled equally well by applying the same procedure (illuminance estimation and relighting) separately for each channel. An important advantage of the previously described algorithm is the flexibility in coping with complex illumination conditions by adaptation of the rendering function R above. For example, accounting for attached shadows may be simply achieved by activating shadowing in the rendering engine. On the other hand, from our experience with different rendering models, good results may be also obtained with relatively simple renderings.

5 Face Classification Normalized color and depth images are subsequently provided as input to a color classifier and a depth classifier respectively. For practical reasons only, the red component of the color images was used. Pixel values are normalized to have zero mean and unit variance before classification. We have experimented with several state of the art techniques including Embedded Hidden Markov Models (EHMM) [18], Probabilistic Matching (PM) [17] and Elastic Graph Matching (EGM) [25]. The two latter were shown to give the lowest error rates. We have finally selected the PM algorithm, since its computational complexity was one

order of magnitude smaller than that of the EGM algorithm for similar error rates. For depth images the same algorithms were tested. In this case, pixel values correspond to distance from the sensor rather than brightness. In addition, we have conducted experiments with a simplified version of the Point Signature algorithm [6]. The results using this feature-based approach were not as good as appearance-based techniques, possibly due to the sensitivity of this algorithm to noise and artifacts (e.g. holes) in depth images. Finally, PM was adopted for both color and depth image classification. The scores (maximum-likelihood probabilities) returned by each classifier are subsequently normalized. This consists of first estimating for each enrolled user i the mean mi and variance σi of the genuine transaction score distribution (by means of robust estimation techniques) using a set of scores extracted using a bootstrap set of images. Then, we apply the tanh normalization technique to map the scores in the interval [0, 1]. 1 (s − µi ) s0 = {tanh(0.05( )) + 1} 2 σi where s and s0 are the initial and normalized scores respectively [11]. Once the scores are normalized, fusion of color and depth modalities is achieved by simply adding the corresponding scores.

6 Experimental Results The focus of the experimental evaluation was to investigate the efficiency of the complete face authentication system on conditions that are similar to those encountered in real-world applications. Therefore, a database containing several facial appearance variations was recorded. 73 volunteers participated in two recording sessions. The second recording session was two weeks after the first session. For each subject several images depicting different appearance variations were acquired: facial expressions (smiling, laughing), illumination (side spot light), pose variations (±20◦ ), images with glasses, and frontal images (see Figure 4). In total, more than 3000 image pairs were recorded. The PM algorithm was trained using 4 images per person from the first recording session, including one frontal image, two images depicting facial expressions and one image with glasses. Testing was performed using the remaining images. Table 1 demonstrates the equal error rates achieved with the proposed compensation scheme. This is compared with the case of manual pose normalization, i.e. three points over the eyes and mouth were selected by a human operator and used to rectify the images. In this case, rectification is performed by 2D affine warping of the images. As shown in Table 1, the proposed scheme results in authentication errors which are much lower compared to those achieved by manual image normalization, especially for images depicting pose variations. Moreover, a significant improvement is reported, when the illumination compensation algorithm is used. In Figure 5, the receiver operator characteristic curve

Figure 4. Examples of face color images and corresponding depth maps depicting various expressions, pose, as well as illumination variations, during the recording of a single session.

C C (PC+IC) D D (PC) C+D C+D (PC+IC)

A 7.6 4.7 5.7 4.3 5.2 2.8

F 2.1 1.2 1.2 1.2 1.0 0.6

E 6.3 5.1 4.6 4.7 3.9 3.5

P 9.3 6.9 7.1 5.1 8.6 3.6

I 5.2 3.1 2.2 2.3 2.7 1.4

G 3.2 3.8 4.8 4.8 3.2 3.4

Table 1. Equal error rates (%) for different image variations (A: All variations, F: Frontal, E: Expressions, P: Pose, I: Illumination, G: Glasses) and modalities (C: Color, D: Depth, C+D: color + depth). Even rows show results with the proposed pose (PC) and illumination compensation (IC) techniques, while odd rows are results with manual image rectification.

is shown. By combining color and depth information, a correct recognition rate of 95% is obtained for a false acceptance rate of 0.5%.

7 Conclusions In the present paper, a novel approach for 2D and 3D face authentication has been proposed, based on automatic image normalization algorithms that exploit the availability of 3D information. Given a pair of color and depth images, the pose of the face is first estimated using depth data only, and is then compensated, resulting in a frontal view. Finally, the scene illumination is recovered from the pair of images, and a new color image, relit from a frontal direction, is generated. Significant improvements in face classification accuracy were obtained using this scheme. The system was shown to be more sensitive to pose variations and facial expressions. Although, the proposed pose compensation procedure significantly reduces the error rates for images depicting pose variations, it can lead to some distortion on the side of the face where occluded parts of the face appear after rotation. We plan to deal with this

1

0.9

0.8

1-FRR

0.7

0.6 Color Depth Both 0.5

0.4

0.3 -6 10

-5

10

-4

10

-3

10 FAR

-2

10

-1

10

0

10

Figure 5. Receiver operating characteristic curve showing false acceptance rate (FAR) versus correct recognition rate (1-FRR) for the two modalities and their combination.

issue by investigating novel classification schemes able to disregard missing pixels. Finally, it is emphasized that the proposed face classification scheme may be easily combined with the majority of existing 2D systems to improve their performance and that it is capable of real-time face authentication under arbitrary pose or lighting variations.

References [1] P. J. Besl and N. D. McKay. A method for registration of 3-d shapes. IEEE Trans. Pattern Analysis and Machine Intelligence, 14(2):239–256, February 1992. [2] D. Beymer and T. Poggio. Face recognition from one example view. Proc. IEEE Int. Conf. Computer Vision, 1:500– 507, 1995. [3] V. Blanz, S. Romdhami, and T. Vetter. Face identification across different poses and illuminations with a 3d morphable model. Proc. of Int. Conf. on Automatic Face and Gesture Recognition, 1:202–207, 2002. [4] K. W. Bowyer, K. Chang, and P. J. Flynn. A survey of approaches to three-dimensional face recognition. Proc. Int. Conf. Pattern Recognition, 2:358–361, 2004. [5] K. Chang, K. Bowyer, and P. Flynn. Face recognition using 2d and 3d facial data. Proc. Workshop in Multimodal User Authentication, 1:25–32, December 2003. [6] C.-S. Chua, F. Han, and Y.-K. Ho. 3d human face recognition using point signature. Proc. 4th IEEE Int. Conf. on Automatic Face and Gesture Recognition, 1:233–238, 2000. [7] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Trans. Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001. [8] P. Debevec, T. Hawkins, C. Tchou, H.-P. Duiker, W. Sarokin, and M. Sagar. Acquiring the reflectance field of a human face. Proc. of SIGGRAPH 2000, Computer Graphics Proceedings, Annual Conference Series, 1:145–156, July 2000. [9] F. Forster, P. Rummel, M. Lang, and B. Radig. The hiscore camera: a real time three dimensional and color camera. Proc. Int. Conf. Image Processing, 2:598–601, Oct. 2001.

[10] D. B. Graham and N. M. Allinson. Face recognition from unfamiliar views: Subspace methods and pose dependency. Proc. of Int. Conference on Automatic Face and Gesture Recognition, 1:348–353, 1998. [11] A. Jain, K. Nandakumar, and A. Ross. Score normalization in multimodal biometric systems. Pattern Recognition, 2005. [12] T. K. Kim, S. C. Kee, and S. R. Kim. Real-time normalization and feature extraction of 3d face data using curvature characteristics. Proc. 10th IEEE Int. Workshop on Robot and Human Interactive Communication, 1:74–79, 2001. [13] J. C. Lee and E. Milios. Matching range images of human faces. Proc. IEEE Int. Conf. on Image Processing, 1:722– 726, 1990. [14] X. Lu, D. Colbry, and A. K. Jain. Three-dimensional model based face recognition. Proc. Int. Conf. Pattern Recognition, 1:362–366, 2004. [15] S. Malassiotis and M. G. Strintzis. Pose and illumination compensation for 3d face recognition. Proc. Int. Conf. Image Processing, 1:91–94, 2004. [16] S. Malassiotis and M. G. Strintzis. Robust real-time 3d head pose estimation from range data. Pattern Recognition, 2005. [17] B. Moghaddam, W. Wahid, and A. Pentland. Beyond eigenfaces: Probabilistic matching for face recognition. Proc. of Int’l Conf. on Automatic Face and Gesture Recognition, 1:30–35, April 1998. [18] A. V. Nefian and M. H. Hayes. An embedded hmm - based approach for face detection and recognition. Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing, 6:3553– 3556, 1999. [19] A. Pentland, B. Moghaddam, and T. Starner. View-based and modular eigenspaces for face recognition. Proc. Conf. Computer Vision and Pattern Recognition, 1:84–91, 1994. [20] P. Phillips, P. Grother, R. Micheals, D. Blackburn, E. Tabassi, and J. Bone. Face recognition vendor test 2002:evaluation report. Technical report, March 2003. [21] H. T. Tanaka, M. Ikeda, and H. Chiaki. Curvature-based face surface recognition using spherical correlation. principal directions for curved object recognition. Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, 1:372– 377, 1998. [22] E. Trucco and A. Verri. Introductory Techniques for 3-D Computer Vision. Prentice-Hall, NJ 07458, 1998. [23] F. Tsalakanidou, S. Malassiotis, and M. G. Strintzis. Face localization and authentication using color and depth images. IEEE Transactions on Image Processing, 14(2):152– 168, February 2005. [24] F. Tsalakanidou, D. Tzovaras, and M. G. Strintzis. Use of depth and colour eigenfaces for face recognition. Pattern Recognition Letters, 24(9-10):1427–1435, June 2003. [25] L. Wiskott, J.-M. Fellous, N. Kr¨uger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(3):775–779, July 1997. [26] L. Zhang and D. Samaras. Face recognition under variable lighting using harmonic image exemplars. Proc. Conf. Computer Vision and Pattern Recognition, 1:19–25, 2003.

Suggest Documents