2110
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 12, DECEMBER 2014
RGB-D-Based Face Reconstruction and Recognition Gee-Sern (Jison) Hsu, Member, IEEE, Yu-Lun Liu, Hsiao-Chia Peng, and Po-Xun Wu
Abstract— Most RGB-D-based research focuses on scene reconstruction, gesture analysis, and simultaneous localization and mapping, but only a few study its impacts on face recognition. A common yet challenging scenario considered in face recognition takes a single 2D face of frontal pose as the gallery and other poses as the probe set. We consider a similar scenario but with an RGB-D image pair taken at frontal pose for each subject in the gallery, only 2D images with a large scope of pose variations in the probe set, and study the advantage of the additional depth map on top of the regular RGB image. To tackle the cases with depth map corrupted by quantization noise, which are often encountered when the face is not close enough to the RGB-D camera, we propose a resurfacing approach as a preprocessing phase. We formulate the 3D face reconstruction using the RGB-D image as a constrained optimization and compare the results with different reconstruction settings. The reconstructed 3D face allows the generation of 2D face with specific poses, which can be matched against the probes. To deal with occlusion and expression variations, an automatic landmark detection algorithm is exploited to identify the parts on a given probe that are good for recognition. Experiments on benchmark databases show that the additional depth map substantially improves the cross-pose recognition performance, and the landmark-based component selection also improves the recognition under occlusion and expression variation. The performance comparison with other contemporary approaches also shows the effectiveness of the proposed approach. Index Terms— Face recognition, face reconstruction, RGB-D images.
I. I NTRODUCTION
T
HE arrival of low-cost RGB-D cameras, such as Microsoft Kinect and Asus Xtion, has created a new landscape in computer vision to explore. While tremendous efforts have been devoted to gesture analysis, scene reconstruction and SLAM (Simultaneous Localization and Mapping) [1]–[3], only a limited number of research is available on face recognition using RGB-D images [4]–[6]. Although several 3D based approaches can handle recognition across poses using one single RGB image [7], [8], the advantage given by the additional depth map in the RGB-D image is so far unclear and yet to be studied.
Manuscript received March 15, 2014; revised July 29, 2014; accepted September 15, 2014. Date of publication October 1, 2014; date of current version November 10, 2014. This work was supported by the Ministry of Science and Technology under Grant 103-2221-E-011-106-MY2. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Bir Bhanu. The authors are with the Artificial Vision Laboratory, National Taiwan University of Science and Technology, Taipei 10607, Taiwan (e-mail:
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIFS.2014.2361028
We consider a scenario in which an RGB-D image of a frontal face is collected for each subject in the gallery set, but the probe set contains RGB images only. This scenario is to simulate a situation that an RGB-D camera is available only at user registration phase, but recognition must be carried out on RGB images that can be easily obtained by a regular camera without the depth channel. This scenario is more practical with a broader scope of applications than that considered in [4]–[6] where RGB-D images are required in both registration and recognition phases, i.e., in both the gallery and probe sets. We propose an approach that reconstructs a 3D face from an RGB-D image for each subject in the gallery, aligns the reconstructed 3D model to a probe using facial landmarks and recognizes the probe using the Sparse Representationbased Classification (SRC) [9]. The primary variable tackled in this study is pose, i.e., cross-pose recognition, although partial occlusion and expression variation are also considered in our experiments. The research on RGB-D based face recognition is at its early stage, and only a few works are reported. The algorithm in [4] exploits the 3D facial symmetry to construct a canonical frontal view, shape and texture, of each gallery face. The canonical depth map and texture of a probe face are then sparse approximated from the dictionaries learned from training data. The texture is transformed from RGB to a discriminant color space before sparse coding and the reconstruction errors from the sparse coding are added for individual identities in the dictionary. The probe face is assigned the identity with the smallest reconstruction error. Combining face detection in color image and nose tip localization in the depth map, Ciaccio et al. [5] crop the facial region, and detect two fiducial points to normal the cropped face and localize the 3D face center. The normalized face is segmented into patches for feature extraction and matching against the probe. It is experimentally proven on the CurtinFaces database [4] that the features that blend the Local Binary Pattern (LBP) and a covariance descriptor yield a better performance than using either feature alone. Both methods in [4] and [5] are compared with the proposed in the performance evaluation reported in Sec. IV. Goswami et al. [6] compute four entropy maps on RGB and depth images with varying patch sizes, and a visual saliency map on the RGB image. The Histogram of Oriented Gradients (HOG) [10] descriptors are extracted from these five entropy/saliency maps as features. A random decision forest classifier is exploited for recognition. Their experimental results indicate that the RGB-D based face recognition outperforms many 2D and 3D approaches. The scenario considered in the aforementioned studies [4]–[6] assumes both depth and color images of
1556-6013 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
HSU et al.: RGB-D-BASED FACE RECONSTRUCTION AND RECOGNITION
Fig. 1. Profiles of the RGB-D images taken at different distances, the quantization noise on the depth data deteriorates the reading of the profile, and the deterioration worsens as distance increases.
a probe available at recognition phase. However, we allow each probe associated with one color face image only, without the depth map. In addition, we propose a solution to handle the issue with significant quantization noises on the depth map, which would be encountered when the face is more than 1.2m away from the RGB-D camera, as shown in Fig. 1. This issue has not been discussed in any previous work. Other closely related research deals with cross-pose recognition using a single 2D frontal face as a gallery sample and faces of other poses as the probe set. Most of the 3D-based approaches are composed of 3D reconstruction of each 2D gallery face, synthesis of novel 2D views using the reconstructed model, and match of the synthesized images to the probes. Morphable model [11] uses the prior knowledge, including the 3D face shapes and textures collected from hundreds of 3D scans to build a 3D model for a 2D face. Although regarded as a good solution for cross-pose recognition, it is expensive in storage and computation because of the huge amount of dense 3D scan data considered. A similar approach with automatic feature localization is given in [12], which reports a good performance for poses less than 45°, but degrades significantly for large poses. The Generic Elastic Model (GEM) [7] claims that the depth of a gallery face can be accurately reconstructed by a generic depth map with 2D dense meshes built on the landmarks on both the gallery face and generic model. The gallery face and the generic model of non-frontal poses are aligned through the landmarks obtained by a commercial tool. The study only considers the cases when both eyes are visible, and thus works only for pose up to yaw angle 60°. It is further improved in an extension work [13]. Although the reconstruction appears better than the original GEM [7], it is verified on the poses up to yaw angle 30° on Multi-PIE [14]. Its performance on poses with large rotations is yet to be studied. Arguing that many 3D face models with the Lambertian assumption ignore specular and diffuse reflections, a Heterogeneous Specular and Diffuse (HSD) 3D surface approximation is proposed in [8] and proven effective with high recognition rates on extreme poses. Experiments on the CMU PIE database reveals that the HSD outperforms many contemporary approaches. However, the requirement of multiple frontal images with various illumination conditions for the surface approximation substantially weakens its practical applicability. Some studies,
2111
for example [15], only consider the depth map without the RGB image for recognition. Our method is partially related to the 3D reconstruction on a single 2D facial image using the 3D scan of a different face as a reference in [16]. The 3D reference face offers an initial estimates of the facial surface parameters, and the algorithm recompute these parameters so that the Lambertian projection of the final 3D model can be close to the 2D facial image. However, the reference model considered in [16] is a high resolution face scan (160 pixels in average between the centers of eyes, if the model is from FRGC [17]), the method proposed in this work fails to handle low resolution depth maps, like those collected from a Kinect sensor. We consider three distances between a subject’s face and the Kinect sensor: 1m, 1.5m and 2m, associated with respectively 45, 26 and 20 pixels in average between the centers of the eyes. The challenge at this setting is not just imposed by the low resolution depth map, but the quantization error on the depth map also poses a substantial threat to the 3D reconstruction. The rest of the paper is organized as follows: we first present the 3D reconstruction given an RGB-D image in Sec. II, including the tackling of low resolution depth maps corrupted by quantization noise. Given a probe with an arbitrary pose, we present a procedure in Sec. III to align the reconstructed model to the probe using stabilized facial landmarks, followed by the SRC-based recognition. An extensive experimental study on four different databases, namely the Biwi [18], Eurecom [19], CurtinFaces [4] and RGBDFaces, along with a comparison with contemporary approaches is reported in Sec. IV. Sec. V gives a conclusion to this study.
II. 3D R ECONSTRUCTION W ITH RGB-D I MAGES It is required in our reconstruction phase that the depth map in the RGB-D image must reveal the facial surface property to some extent, i.e., the depth map must reveal some 3D face shape. This requirement cannot be met in the case when the distance between the subject and RGB-D camera is larger than some threshold, which, according to our experiments on the Kinect sensor, is around 1.7 meters. Fig. 1 shows the profile view of the RGB-D images of a face taken at different distances. As the distance increases, the signal-tonoise ratio (S/N) decreases and the quantization noise worsens the measurement of the depth. The quantization noise can substantially corrupt the raw data and make the depth map far from revealing much of the 3D face shape. Therefore, the proposed reconstruction is composed of two stages, the first stage aims at handling the corrupted depth map when the gallery image is taken at a relatively large distance. Given a face image preprocessed by the first stage or taken at close distance,1 the second stage aims at reconstructing and refining the depth map so that the difference between the prospective projection of the depth map and the RGB part of the input image can be minimized. 1 Several publicly available databases were made with a close distance, e.g., one meter, between the subject and camera.
2112
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 12, DECEMBER 2014
Fig. 2. Facial landmarks detected in our experiment on face depth corrupted by quantization noise. The color regions are the landmark patches replaced by high resolution depth patches in the proposed preprocessing phase.
A. Resurfacing of Facial Depth Corrupted by Quantization Noise This stage is required only for the cases where the raw depth map is deteriorated by the quantization noise, like the cases with distances larger than 1.5m in Fig. 1. Although the raw depth map in such cases loses precision capturing fine gradients in depth, it still offers a coarse measure to the overall face depth, and the RGB image, i.e., the RGB part of the RBG-D image, holds some details in the other dimensions of the face. We propose an approach to keep the coarse measure of the overall face depth map while amending the quantized and corrupted depth patches at specific local regions. The approach replaces the quantized depth patches at several landmark regions by the corresponding depth patches from a high resolution face scan, and resurfaces the face in a way to best describe the replaced depth patch and its neighboring region. It consists of the following steps: 1) Resize the raw depth map D0 to Di such that Di has the same size as of a preselected high resolution 3D face scan Si . The high resolution 3D face scan Si can be selected from a 3D face database, for example the FRGC [17]. The raw RGB image is also resized to Ii with the same size as of Di . 2) Detect the facial landmarks on Ii , Di and Si using an approach developed based on the 2D landmark detection algorithm [20]. Details are given subsequently. 3) Replace the quantized depth patches in a few selected local regions on Di by the fine-layered depth patches taken from the corresponding local regions of Si . These selected local depth patches, called landmark patches, are determined by the subsets of the landmarks at the specific regions. Fig. 2 shows five of these regions used in our study, including the eyes, nose tip, mouth and chin. 4) Compute the surface in the form of a low-order polynomial that best describes each fine-layered landmark patch and its neighboring region with quantization errors. The union of such locally smoothed surfaces gives the amended depth map. There are two key components in the above procedure, one is the landmark detection and the other is the smoothed surface estimation given a noisy dataset.
1) Landmark Detection: Substantial progress has been made to facial landmark detection in recent years [20]–[22]. The part-based tree-structured model in [20] can simultaneously solve face detection, landmark localization, and pose estimation. It is experimentally proven better than most state of the art in all three tasks. The core part of the method is a mixture of trees with a shared pool of parts; each facial landmark is modeled as a part and a global mixture is used to capture the topological variations across pose. The model detects 68 landmarks in face with yaw angle between ±45° and 39 landmarks in cases with yaw angle beyond this range. It is required that the face must be large enough for these many landmarks to be localized. For smaller sizes of faces, such as the cases considered in this study with some distance to the RGB-D camera larger than 1.5m, the model needs to be redefined. We retrained the model on CurtinFaces database with faces resized to 27 pixels between the centers of the eyes, and defined 26 landmarks for yaw between ±45° and 16 landmarks for larger angles. The landmarks obtained from the above settings are based on RGB images. To further stabilize the landmarks on the RGB-D images, each landmark is moved around on its depth neighborhood to see if there is a spot considered better localization in depth. This relocation improves the localization of the landmarks on nose tip, nasal base, eye corners, mouth corners, and around the facial contour. 2) Smoothed Surface Estimation: We applied the Moving Least Squares (MLS) [23] to smooth z r,0 , the raw depth data of the reference model, so that the measurement noise in z r,0 can be removed and the smoothed surface z r can best approximate z r,0 . Given a subset of z r,0 in the form of point clouds, denoted as Pk = { pi }i=1,...,Nk , the goal is to determine a novel set of points, Rk = {ri }i=1,...,Nk , on a low-order polynomial that minimizing the distance between Pk and Rk . The smoothed surface z r can then be obtained from {Rk }∀k . Modified from the MLS reported in [23] for better efficiency, our method is composed of the following step, 1) Use Pk to determine a local plane H0 with origin q0 and normal n0 so that the following weighted sum can be computed, Nk
u 0 (x i , yi ) − μi,0
2
φ ( pi − q0 )
(1)
i=1
where u 0 (x i , yi ) is the distance of ri to H0 with the location of its projection onto H0 given by (x i , yi ); μi,0 is the distance of pi to H0 , i.e., μi,0 = n 0 · ( pi − q0 ); and φ(·) is a Gaussian function so that the points closer to q0 are weighted more. Assuming that Rk are described by a low-order polynomial in terms of the coordinates (x i , yi ) on H0 , i.e., ri = f (x i , yi |0 ) and u(x i , yi ) = n 0 · ( f (x i , yi |0 ) − q0 ), where f (x i , yi |0 ) is a polynomial surface with parameter 0 that defines the local geometry of Rk . 2) Because H0 can be uniquely defined given q0 and n 0 , one can change them to q1 and n 1 and obtain a novel plane H1. Given that the order of the polynomial f (x i , yi |0 ) is fixed (so that the number of parameters
HSU et al.: RGB-D-BASED FACE RECONSTRUCTION AND RECOGNITION
2113
of f (x i , yi |0 ) is fixed), a parameter estimation problem can be defined as the minimization of the weighted sum as: ∗k , n∗k , qk∗ = arg min
Nk
, n , q i=1
(u(x i , yi ) − μi )2 φ ( pi − q ) (2)
The above can be repeated on other subsets {Pk }∀k for estimating {k , nk , qk }∀k and {Rk }∀k . A key issue in this scheme is the initial estimates of n 0 and q0 . A few possible ways are given in [23]; however, from our experiments we found that the minimum principal component extracted from Pk offers a good estimate of n 0 and the centroid of Pk can be appropriate as q0 . To extract the principal components, one needs to solve the eigenvectors of the covariance Ck , 1 ( pi − q0 ) · ( pi − q0 )T k k
Ck =
(3)
x,y
i=1
where q0 is the centroid of Pk , and considered as the origin of the initial plane H0 . n0 , the normal vector of H0 , is given by the eigenvector of Ck associated with the lowest eigenvalue. Following the above approach, the surface normal nr can be obtained from the estimated polynomials f (x i , yi |k ). Given nr and the associated 2D image Ir , ρr can be estimated using the method presented in the next section with some simplification. B. Iterative Face Surface Estimation Assuming that the face surface is Lambertian, I (x, y), its projection on an image plane, can be written as: I (x, y) = ρ(x, y)h(x, y) · n (x, y) = ρ(x, y)R(x, y)
(4)
where ρ(x, y) is the surface albedo at the point (x, y), h(x, y) ∈ R 3 is the lighting cast on (x, y) with intensity on each of the three directions, n (x, y) is the face surface normal at (x, y), and the reflectance R(x, y) = h(x, y) · n(x, y). The above irradiance model in (4) connects the 3D surface in terms of its normal n(x, y) to its 2D projection I (x, y). For simplicity of notation, the coordinates (x, y) is often dropped in the rest of the paper, and n (x, y), for example, is written as n. It is assumed that the reflectance can be approximated by spherical harmonics, i.e., R(x, y) ≈ l · Y ( n)
separated from n , but in the latter they are split into the lighting vector l and the spherical harmonics Y ( n ), which is solely dependent on the components of the surface normal n, namely n x , n y and n z . The estimation of the depth z(x, y), the surface normal n and the albedo ρ can be formulated as the minimization of ||I − ρ l · Y ( n )|| subject to the constraints L z ∗ dz ≈ 0 and L ρ ∗ dρ ≈ 0, where dz = z(x, y) − z s (x, y), dρ = ρ(x, y) − ρs (x, y), z s (x, y) is the depth in the RGB-D image, ρs (x, y) is the albedo estimated from the RGB-D image which can be initially assumed as a Gaussian blurred grayscaled image. L z and L ρ are the Laplacian of Gaussian (LoG) operators that identify the spots with large changes in the depth and albedo differences, respectively. Such a formulation allows one to solve the constrained minimization using the following Lagrangian function L: (I − ρ l · Y ( n ))2 + λ1 (L z ∗ dz )2 + λ2 (L ρ ∗ dρ )2 L=
(5)
where l is the lighting coefficient vector and Y ( n ) is the spherical harmonic vector [24] which, in the second order approximation, takes the following form: Y ( n ) = c0 , c1 n x , c1 n y , c1 n z , c2 n x n y , c2 n x n z , c2 n y n z , √ T c2 (n 2x − n 2y )/2, c2 (3n 2z − 1)/2 3 (6) √ √ √ √ √ where c0 = 1/ 4π, c1 = 3/ 4π, c2 = 3 5/ 12π. The difference between h · n and l · Y ( n ) is that the lighting intensity and direction are all merged into h in the former,
(7) where λ1 and λ2 are Lagrange multipliers. Because the RGB-D camera is basically a stereo camera with known baseline and camera parameters, one can readily align the RGB image with the depth image. Given the aligned RGB-D image, there are several ways to solve the minimization of L over l, z and ρ, and the following steps are exploited for better efficiency: 1) Assuming that z = z s and ρ = ρs , and computing the surface normal n from z s , the suboptimal spherical ∗ harmonic coefficients l can be obtained2by solving the squared error term, x,y (I − ρ l · Y ( n )) . 2) Given the obtained l∗ and keeping z = z s , the most appropriate albedo ρ can be determined by solving ∗ n ))2 + the partial minimization, x,y (I − ρ l · Y ( 2 ∗ λ2 (L g ∗ dρ ) . This ends up with ρ = 1/(l · Y ) and the parameter in L ρ that makes L ρ ∗ dρ ≈ 0. 3) Given (7) with l∗ and ρ ∗ , the suboptimal depth z ∗ can be determined by first solving an approximated z˜∗ using the surface normal n computed from z s , and using z˜∗ to compute an intermediate normal, ni . ni can then be used to update z˜∗ to z ∗ . 4) Repeat the above until the difference between successive depths, ||z ∗ (i + 1) − z ∗ (i )||, below a threshold. The above resurfacing scheme is applied on each face in the gallery to amend the facial depth corrupted by the quantization noise. Each amended face is influenced by the high resolution 3D face scan whose landmarked depth patches were used for resurfacing. The high resolution 3D face scan can be available from a database, and we chose the FRGC [17] in the experiments. We have observed that, for example, an Asian face in the gallery can be Caucasianized in parts if the 3D face scan comes from a Caucasian, and vise versa. A sample is shown in Fig. 3, where the ground truth is an Asian male (A.M.) and the reconstructed profiles can vary depending on different high resolution 3D face scans used in the resurfacing.
2114
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 12, DECEMBER 2014
solving the following l1 -minimization: r ∗ = arg min ||r ||1, subject to ||q − Pr ||2 ≤
Fig. 3. Profiles of the ground truth (G.T.) and the resurfaced faces using four different high resolution 3D face scans, Asian female (A.F.), Asian male (A.M.), Caucasian female (C.F.) and Caucasian male (C.F.).
Fig. 4. The reconstructed model is aligned with the probe image according to its pose and scale determined by the landmark detection at poses L30 and L60.
III. L ANDMARK -A SSISTED AND SRC-BASED R ECOGNITION Given a probe image, one must determine the pose and scale of a 2D face to be generated by the reconstructed 3D face model, and the pose and scale must align with those of the probe, so that the generated face can be matched against the probe. We handle this alignment issue using the facial landmark detection algorithm proposed in [20]. The algorithm can simultaneously solve for face detection, landmark localization, and pose estimation. It is experimentally proven better than most state of the art in all three tasks. We only use a partial set of the landmarks for alignment, including those on eye corners, both ends of eyebrows, nose and nostrils and mouth corners, as shown in Fig. 4. This leads to 16 landmarks for poses with yaw angle ≤30° and 12 landmarks for larger poses because in the latter cases the eye and eyebrow on the farther side become invisible. Because the Sparse Representation-based Classification (SRC) is proven effective for face recognition, especially in handling illumination, expression and occlusion [9], [26], but rarely applied for tackling pose, it is explored in this study. Given a set of m 3D reconstructed gallery faces, denoted as M = [M1 , M2 , . . . , Mm ], and a probe q, all labeled with the aforementioned landmarks, the core part of SRC solves for the linear representation of q in the span of P, where P = [P1 , P2 , . . . , Pm ] is a matrix with its column Pi being a feature vector extracted from Mi (details on the extraction of Pi are given subsequently). One can therefore write q = Pr + μ, where r is a sparse vector and μ is a noise with bounded energy, i.e., ||μ||2 < . Following the rules in compressing sensing [9], r can be obtained by
(8)
A comprehensive discussion on the solutions for the above l1 -minimization is given in [25], where five fast algorithms were evaluated on the face recognition performance under illumination variations. We select one of the best, the TNIP (Truncated Newton Interior-Point) algorithm, for our experiments. The TNIP exploits gradient projection (GP), and searches for the sparse vector r along certain gradient direction with fast convergence speed. It reformulates the problem (8) into the following form: 1 (9) r ∗ = arg min ||q − Pr ||22 + λ||r ||1 2 r where λ is the Lagrange multiplier. Such a formulation enables the solution using quadratic programming. In summary, the landmark-assisted and SRC-based recognition consists of the following steps: 1) Use the landmarks in q to estimate the nominal yaw and pitch angles of the probe, denoted as (φr , φ p ), respectively. 2) Rotate Mi to (φr , φ p ), and capture the 2D projection r, p of Mi on an image plane, denoted as Pi . Determine a scale factor si so that the distance between the landr, p marks on Pi and q is minimized. This involves the r, p conformation of the regions of interest in Pi and q. r, p r, p r, p r, p 3) Form a matrix Pr, p = [ p1 , p2 , . . . , pm ] where pi r, p is a feature vector extracted from Pi , and it can be a normalized downsampled pixel intensities or others. The Local Binary Pattern (LBP) [27] is chosen as it reveals the most consistent results in our experiments. 4) Solve for rr,∗ p = arg min r 21 ||q − Pr, p r ||22 + λ||r ||1 with quadratic programming, and determine the best matched subject in the gallery by locating the maximum sparse coefficient in rr,∗ p . 5) To make the above approach more robust to pose and scale misalignment, one may generate multiple model-based images within some neighborhood of the nominal raw and pitch angles so that each subject contributes to a set of images and the base matrix can r, p,δ r, p,δ r, p,δ be expressed as Pr, p,δ = [P1 , P2 , . . . , Pm ], where δ defines the range of the neighborhood and r, p,δ r±δ , p±δ1 r±δ , p±δ2 r±δ , p±δk Pi = [ pi 1 , pi 2 , . . . , pi k ]. IV. E XPERIMENTS We selected the Biwi Kinect Head Pose Database (abbreviated as Biwi) [18], the Eurecom Kinect Database (Eurecom) [19], the CurtinFaces [4] and the RGBDFaces (made for this research, to be introduced below) for the experiments. The experiments were run on a Windows-7 workstation with 3.4GHz and RAM 8GB, and the codes were written in C/C++ with OpenCV ver.2.4.6 and PCL (Point Cloud Library) ver.1.6. On average, reconstruction takes 38.5 secs for each face and recognition takes 0.3 sec. Biwi contains 20 subjects with pose variation within ±75° in yaw, ±60° in pitch, and ±50° in roll. It comprises 24 sequences with 16 subjects recorded once and 4
HSU et al.: RGB-D-BASED FACE RECONSTRUCTION AND RECOGNITION
2115
Fig. 5. From left, Column 1 are the probe images, Column 2 are the RGB-D images directly taken from the Kinect sensor, Column 3 are the RGB-D images processed by an interpolation filter, Column 4 are the 2D faces generated based on the proposed RGB-D reconstruction, Column 6 are generated using [16] with an RGB image and a 3D reference model, Column 5 are generated in the same way but with the reference depth replaced by the depth map from the RGB-D image.
subjects twice at around 1 meter away from the RGB-D camera (Kinect). The faces are 90 × 110 pixels in size on average Eurecom contains 52 subjects taken in two sessions. In each session, the faces are captured in RGB-D image pairs with nine states, namely neutral, smile, open mouth, left profile, right profile, occlusion on eyes, occlusion on mouth, occlusion using paper, and light on. Although this study focuses on pose, we also test the performance on other states. To highlight the contribution of the depth map attributed in the RGB-D image, we compare the proposed approach with the face reconstruction using the RGB image only following the state-of-the-art approach in [16]. A sample of the results on the Biwi is shown in Fig. 5. The first column from the left are the probe images. The second column “RGB-D original” are the RGB-D images directly taken from the Kinect sensor. The third column “RGB-D interpolated” are the 2D projected images based on the 3D faces built by running a bilateral filter on the raw RGB-D images to interpolate the fragmented point clouds. The fourth column “RGB-D reconst.” are the 2D projected images based on the 3D faces using the proposed reconstruction. The rightmost column “Ref. only” are the projected faces using the approach in [16] with a single RGB image only and the 3D scanned model of a different face as reference. The second column from the right “Ref.+Depth map” shows a case that the reference 3D model in [16] is replaced by the depth map from the RGB-D image. The contribution of the depth map can be seen on the fifth column “Ref.+ Depth map” as the reference depth is replaced by the depth map of the RGB-D image. The reconstructed faces are quite similar to those in the fourth column “RGB-D reconst.” The recognition performances of the four cases are shown in Fig. 6 with different yaw angles. The recognition rate of the “Ref. only” degrades substantially when
Fig. 6. Recognition rates of the selected methods, “RGB-D reconst.”, “RGB-D interpolated”, “Ref.+ Depth map”, and “Ref. only”, with yaw varying from L60 (60° to the left) to R60 (60° to the right) on Biwi database.
the yaw angle ≥30°. The performances of “RGB-D reconst.” and “Ref.+ Depth map” are close, although the former appears slightly better. This closeness reveals that the raw depth map dominates the depth recovery and the reference model does not contribute much in this case. One of the main themes in this study is to better justify the contribution made by the additional depth map quantitatively. The closeness between the fourth and fifth columns highlights the fact that the depth map is crucial for precise face reconstruction, and the 3D reference face model proposed in [16] can be completely ignored if an RGB-D camera can be used for user registration. Quite a few RGB-based methods handling cross-pose recognition require a 3D reference face model for depth initialization and propose ways for depth recovery of gallery faces [7], [13], [16]. The fifth column “Ref.+ Depth map” is an example that shows an RGB-based method can be substantially improved if the faces in the gallery are taken using an RGB-D camera, instead of a
2116
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 12, DECEMBER 2014
Fig. 7. A sample set from our RGBDFaces database, which contains 28 subjects, 11 poses (L75°, L60°, L45°, . . . ,R60°, R75°) and 3 distances (1m, 1.5m and 2m) between the Kinect camera and the subject.
Fig. 9. A sample from Eurecom database, the top raw from right are Neutral, Light On, Left Profile (90° yaw to the left), Right Profile (90° to the left), Occlusion on Eyes; the bottom row are Occlusion on Mouth, Occlusion by Paper, Mouth Open and Smile. Fig. 8. Performance comparison of the “RGB-D reconst.” and “RGB-D interpolated” on the RGBDFaces dataset with 11 poses and 3 distances (1m, 1.5m and 2m) between the Kinect sensor and the subject.
regular RGB camera. Although considered out of scope of this paper, it is a valuable study to evaluate RGB-based methods with the required 3D reference face replaced by the depth map from an RGB-D camera, extending the results from the RGB domain to the RGB-D domain. Fig. 6 also shows that “RGB-D interpolated” performs well; however, this is the case with dense RGB-D data when the subject is close to the RGB-D camera, i.e. ≤1m. Its performance would degrade as the distance to the RGB-D camera increases, because the quantization error imposed on the depth map also increases. To study the performance variation of the proposed approach with RGB-D images taken at different distances, we built a new RGB-D face dataset on our own, called RGBDFaces.2 The RGBDFaces contains 28 subjects, 11 poses (L75°, L60°, L45°, . . . , R60°, R75°) and 5 distances (1m, 1.2m, 1.5m, 1.7m and 2m) between the Kinect sensor and the subject. The set of a sample subject is shown in Fig. 7. This database allows us to compare the performance of the “RGB-D reconst.” and “RGB-D interpolated”, and the result is shown in Fig.8, where only 3 distances are shown for better reading. Although the interpolation filter can be an option when the distance to the Kinect sensor is short, i.e., ≤1m, its performance degrades significantly when the subject moves slightly apart, e.g., ≥1.5m, from the Kinect 2 Available at https://sites.google.com/site/avlrgbdfacedatabase.
sensor. The “RGB-D reconst.” with the proposed resurfacing can maintain its performance well as the distance increases. An example of the 9 settings in Eurecom database is shown in Fig. 9 with each face labeled with landmarks. When testing on images with occlusion and expression, only the regions with landmarks are considered, i.e., the recognition is performed on partial regions of the faces. We use the neutral ones in Session 1 as the gallery set to reconstruct the 3D facial models, and the rest 8 settings along with those in Session 2 as the probe set. The recognition performance is shown in Fig. 10. Because the model based frontal faces are almost equally well built by the “RGB-D reconst.”, “Ref.+ Depth map” and “Ref. only”, and only the regions with landmarks are used for handling occlusions and smile, the recognition performances also appear equally well. The most difficult cases, observed with the lowest recognition rates, are for both left and right profiles; however, similar to the tests on Biwi, the proposed “RGB-D reconst.” performs the best, followed by “Ref.+ Depth map”, and “Ref. only” performs the worst. In the rest of settings, the cases with sunglasses and occlusion with paper are also relatively difficult to handle. We also compared the proposed “RGB-D reconst.” with Li et al. [4] and Ciaccio et al. [5] on the CurtinFaces database. The CurtinFaces offers more than 5,000 images of 52 subjects. A pose-expression subset is selected for this comparison which contains 5 poses (L60°, L30°, 0°, R30° and R60°) and 7 expressions per pose, plus the profile poses, i.e., L90° and R90° with neutral expression,
HSU et al.: RGB-D-BASED FACE RECONSTRUCTION AND RECOGNITION
2117
Fig. 10. Performance comparison of “RGB-D reconst.”, “Ref.+ Depth map”, and “Ref. only” on Eurecom dataset with 9 settings: Neutral, Light On, Left Profile (90° yaw to the left), Right Profile (90° to the left), Occlusion on Eyes, Occlusion on Mouth, Occlusion by Paper, Mouth Open and Smile.
Fig. 11. Performance comparison of “RGB-D reconst.” with Li et al. [4] and Ciaccio et al. [5] on the pose-expression subset of the CurtinFaces database.
However, for the cases with large quantization noise (the distance ≥1.5m), most of the failures are caused by the inaccuracy in the resurfacing because the resurfaced face loses some similarity to the actual face, although the misalignment between the probe and gallery faces also leads to some failures. Fig. 12 shows a couple cases which failed to be recognized in our experiments. The 1.5m-Rec. and 1.7m-Rec. are the resurfaced faces given the raw data 1.5m-Raw and 1.7m-Raw, respectively. The nose and eyes are not appropriately reconstructed, causing the SRC classifier failed to find the right match. V. C ONCLUSION
Fig. 12. A failure sample taken from cases with large quantization noise, which is caused mostly by the components, e.g., nose and eyes, inaccurately reconstructed, making the SRC classifier fail to find the right match. 1.5m-Rec. (1.7m-Rec) is the reconstructed of the raw data 1.5m-Raw (1.7m-Raw).
winding up 1924 images. Only one pair of RGB-D images at 0° is used as the gallery for each subject, and the rest RGB images are all probes (unlike [4], which considers the RGB-D images as probes). Fig. 11 shows that the approach in [5] is outperformed by the proposed and [4], both of which perform well up to yaw 60° with recognition rates over 97%. However, the proposed outperforms [4] at 90°. The cases failed to be recognized are primarily caused by the inaccuracies in the reconstruction and landmark detection. For the cases without much of the quantization noise, i.e., the distance to the camera ≤1m, the facial depth can be captured in fine depth resolution and most gallery faces can be accurately reconstructed. The few failures in such cases are mostly caused by the misalignment between the probe and the gallery faces due to the inaccuracy in landmark detection.
The major differences between other RGB-D based face recognition and this study include: 1) Others consider RGB-D images available in both gallery and probe sets, while ours considers RGB-D images in the gallery set only and the probe set is composed of RGB images without the depth map; 2) Others have not considered the cases with significant quantization noise on the depth map, which are encountered when the subject is more than 1.2m away from the RGB-D camera. We propose a resurfacing scheme to handle the quantization noise, and an approach to exploit the RGB-D images for face reconstruction and recognition. This study shows that because of the depth map in the RGB-D image, the 3D face can be accurately reconstructed and used to recognize faces with extreme poses and other variables. The experiments also reveal that other 3D face reconstruction methods may also benefit from using RGB-D images for recovering the face depth. The RGB-D based face recognition is at its early stage. New algorithms, protocols and databases are expected to emerge soon for advancement, and this study is one of such endeavors. R EFERENCES [1] B. Holt, E.-J. Ong, H. Cooper, and R. Bowden, “Putting the pieces together: Connected poselets for human pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, Nov. 2011, pp. 1196–1201. [2] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proc. 12th Eur. Conf. Comput. Vis., 2012, pp. 746–760.
2118
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 12, DECEMBER 2014
[3] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoor environments,” Int. J. Robot. Res., vol. 31, no. 5, pp. 647–663, Apr. 2012. [4] B. Y. L. Li, A. S. Mian, W. Liu, and A. Krishna, “Using Kinect for face recognition under varying poses, expressions, illumination and disguise,” in Proc. IEEE Workshop Appl. Comput. Vis., Jan. 2013, pp. 186–192. [5] C. Ciaccio, L. Wen, and G. Guo, “Face recognition robust to head pose changes based on the RGB-D sensor,” in Proc. IEEE 6th Int. Conf. Biometrics, Theory, Appl., Syst., Sep./Oct. 2013, pp. 1–6. [6] G. Goswami, S. Bharadwaj, M. Vatsa, and R. Singh, “On RGB-D face recognition using Kinect,” in Proc. IEEE 6th Int. Conf. Biometrics, Theory, Appl., Syst., Sep./Oct. 2013, pp. 1–6. [7] U. Prabhu, J. Heo, and M. Savvides, “Unconstrained pose-invariant face recognition using 3D generic elastic models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 10, pp. 1952–1961, Oct. 2011. [8] X. Zhang and Y. Gao, “Heterogeneous specular and diffuse 3-D surface approximation for face recognition across pose,” IEEE Trans. Inf. Forensics Security, vol. 7, no. 2, pp. 506–517, Apr. 2012. [9] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2005, pp. 886–893. [11] V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1063–1074, Sep. 2003. [12] D. Jiang, Y. Hu, S. Yan, L. Zhang, H. Zhang, and W. Gao, “Efficient 3D reconstruction for face recognition,” Pattern Recognit., vol. 38, no. 6, pp. 787–798, Jun. 2005. [13] J. Heo and M. Savvides, “Gender and ethnicity specific generic elastic models from a single 2D image for novel 2D pose face synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp. 2341–2350, Dec. 2012. [14] T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, and expression database,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 12, pp. 1615–1618, Dec. 2003. [15] R. Min, J. Choi, G. Medioni, and J. Dugelay, “Real-time 3D face identification from a depth camera,” in Proc. IEEE 1st Int. Conf. Pattern Recognit., Nov. 2012, pp. 1739–1742. [16] I. Kemelmacher-Shlizerman and R. Basri, “3D face reconstruction from a single image using a single reference face shape,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 2, pp. 394–405, Feb. 2011. [17] P. J. Phillips et al., “Overview of the face recognition grand challenge,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2005, pp. 947–954. [18] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Van Gool, “Random forests for real time 3D face analysis,” Int. J. Comput. Vis., vol. 101, no. 3, pp. 437–458, 2013. [19] T. Huynh, R. Min, and J.-L. Dugelay, “An efficient LBP-based descriptor for facial depth images applied to gender recognition using RGBD face data,” in Proc. Asian Conf. Comput. Vis. Workshop, 2013, pp. 133–145. [20] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2879–2886. [21] P. N. Belhumeur, D. W. Jacobs, D. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 545–552. [22] J. M. Saragih, S. Lucey, and J. F. Cohn, “Deformable model fitting by regularized landmark mean-shift,” Int. J. Comput. Vis., vol. 91, no. 2, pp. 200–215, 2011. [23] M. Alexa, J. Behr, D. Cohen-Or, S. Fleishman, D. Levin, and C. T. Silva, “Computing and rendering point set surfaces,” IEEE Trans. Vis. Comput. Graphics, vol. 9, no. 1, pp. 3–15, Jan./Mar. 2003. [24] M. Hazewinkel, Encyclopaedia of Mathematics: An Updated and Annotated Translation of the Soviet ‘Mathematical Encyclopaedia’ (Encyclopaedia of Mathematics). The Netherlands: Springer-Verlag, 1997. [25] A. Y. Yang, S. S. Sastry, A. Ganesh, and Y. Ma, “Fast l -minimization algorithms and an application in robust face recognition: A review,” in Proc. 17th IEEE Int. Conf. Image Process., Sep. 2010, pp. 1849–1852.
[26] W. Deng, J. Hu, and J. Guo, “Extended SRC: Undersampled face recognition via intraclass variant dictionary,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1864–1870, Sep. 2012. [27] M. Heikkila and M. Pietikainen, “A texture-based method for modeling the background and detecting moving objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, pp. 657–662, Apr. 2006.
Gee-Sern (Jison) Hsu (M’09) received the dual M.S. degree in electrical and mechanical engineering, and the Ph.D. degree in mechanical engineering from the University of Michigan, Ann Arbor, MI, USA, in 1993 and 1995, respectively. He was a Post-Doctoral Fellow with the University of Michigan, from 1995 to 1996, and a Senior Research Staff with the National University of Singapore, Singapore, from 1997 to 2000. In 2001, he joined Penpower Technology, Fremont, CA, USA, where he lead research on face recognition and intelligent video surveillance, and the Department of Mechanical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, in 2007, where he is currently an Associate Professor. His research interests are in the areas of computer vision and pattern recognition, in particular, on face recognition and license plate recognition. Dr. Hsu and his team received the Best Innovation Award at the SecuTech Expo from 2005 to 2007.
Yu-Lun Liu received the B.S. degree in aerospace engineering from Tamkang University, Taipei, Taiwan, in 2012, and the M.S. degree in mechanical engineering from the National Taiwan University of Science and Technology, Taipei, Taiwan, in 2014, where he is currently pursuing the Ph.D. degree with the Graduate Institute of Color and Illumination Technology. His research interests include computer graphics and face recognition.
Hsiao-Chia Peng received the B.S. degree in mechanical engineering from National Chung Cheng University, Chiayi, Taiwan, in 2007. She is currently pursuing the Ph.D. degree in mechanical engineering with the National Taiwan University of Science and Tecnnology, Taipei, Taiwan. Her research interests include image processing, pattern recognition, and face recognition.
Po-Xun Wu received the B.S. degree in material science and engineering from the National Taiwan University of Science and Technology, Taipei, Taiwan, in 2012, where he is currently pursuing the M.S. degree in mechanical engineering. His research interest focuses on face recognition.