Computer Vision Winter Workshop 2006, Ondˇrej Chum, Vojtˇech Franc (eds.) Telˇc, Czech Republic, February 6–8 Czech Pattern Recognition Society
Estimation of Face Depth Maps from Color Textures using Canonical Correlation Analysis Michael Reiter1 , Ren´e Donner12 , Georg Langs12 , and Horst Bischof2
Abstract We propose a method for estimating face depth maps from color face images. The method is based on Canonical Correlation Analysis (CCA) which exploits the correlation between face color texture and surface depth. The results of experiments conducted on a database of 218 3D scans with corresponding color images show that only a small number of canonical factors are needed to describe the functional relation of depth and texture with reasonable accuracy.
1 Introduction The recovery of depth and shape information from 2D-face images allows to deal with effects of changing illumination conditions and viewing angle [16]. It can be used, for example, to remove or reduce the effects of illumination and thereby increase the recognition accuracy in complex lighting situations. As another example, consider a face image acquired by a surveillance camera showing the face in a arbitrary viewing angle. Matching with a frontal view image stored in a database could be performed by transforming the stored image, i.e. rendering a synthetic face image with corresponding viewing angle and lighting conditions using a 3D-depth map of the face. Different approaches exist for recovering shape from 2D face images. Yuille et al. [16] use Singular Value Decomposition (SVD) to reconstruct shape and albedo from multiple images under varying illumination conditions. By isolating each of the factors that govern the face appearance, their model allows to predict the shape of faces and generate face images under new illumination conditions with the use of less training data than standard appearance models. Shape from shading [17, 7] algorithms have been applied to face images, where different constraints such as symmetry or (piece-wise) constant albedo are used to render the problem well-posed. In statistical approaches [3, 1, 2, 14] the relationship of shape and intensity is learned by training from a set of examples, i.e. intensity images with corresponding shapes. In [3] a 3D-morphable model is learned from 3D scans, which
Training
Pattern Recognition and Image Processing Group, Vienna University of Technology, Favoritenstr. 9, A-1040 Vienna, Austria
[email protected],
[email protected],
[email protected] 2 Institute for Computer Graphics and Vision, Graz University of Technology Inffeldg. 16 2.OG, A-8010 Graz, Austria,
[email protected]
RGB images
Depth maps CCA
Canonical factor pairs
Test
1
Input image
Predicted depth map
Figure 1: Overview of the algorithm: During training canonical factor pairs are generated from a set of training examples. They are used for the prediction of a 3D map from RGB data during application.
can be matched with new input images by an iterative optimization procedure. The parameters of the matched model can be used to determine the shape of the input face. In this paper, we propose a statistical method for predicting 3D depth maps of faces from frontal view color face images based on canonical correlation analysis (CCA)[10]. The basic idea of our approach is, that the relationship of depth and face texture , as a combined effect of illumination direction, albedo and shape can be modelled effectively with a small number of factors pairs, i.e., correlated linear features in the space of depth images and color images. The method is not limited to face images but can generally applied to other classes of objects (surfaces) having similar structure, variability and shadows. On overview of the proposed method is depicted in Fig. 1 The CCA approach allows to take into account the vector spaces of color images and corresponding shapes simultaneously. It determines linear combinations of variables (canonical variates) in each of the two signals, which are
1
Estimation of Face Depth Maps from Color Textures using Canonical Correlation Analysis pairwise maximally correlated. The directions of maximum correlation (canonical factors) capture relevant signal components, constituting the functional relation of the two signals. There exist a number of related regression techniques, such as Multivariate Linear Regression(MLR) [8], Partial Least Squares [9][8], Reduced Rank Wiener Filtering (see for example [6]). CCA, in particular, has some very attractive properties (for example, it is invariant w.r.t. affine transformations and thus scaling - of the input variables) and can not only be used for regression purposes, but whenever one needs to establish a relation between two sets of measurements (e.g., finding corresponding points in stereo images [4]). In signal processing, CCA is used for optimal reduced-rank filtering [11], where the goal is data reduction, robustness against noise and high computational efficiency. It has also been successfully applied to pattern classification [13], appearance based 3D pose estimation [12] and stereo vision [4]. The rest of this paper is organized as follows: Section 2 introduces Canonical Correlation Analysis. In Section 3 the experimental setup is explained and the results are presented, while Section 4 provides a conclusion and an outlook.
2
Canonical Correlation Analysis for prediction of depth maps from intensity images
Canonical Correlation Analysis is a very powerful tool that is especially well suited for relating two sets of measurements (signals). Like principal components analysis (PCA), CCA also reduces the dimensionality of the original signals, since only a few factor-pairs are normally needed to represent the relevant information; unlike PCA, however, CCA takes into account the relationship between two signal spaces (in the correlation sense), which makes them better suited for regression tasks than PCA. We regard the image vector of the RGB color image and the depth maps as two correlated random vectors x ∈ IRp and y ∈ IRq , where q and p corresponds to the dimensionality of the vector space of the RGB images and depth maps. In [3] two separate eigenspaces are generated by PCA and a linear regression on the eigenspace coefficients is used to model the relation of x and y. 2.1
CCA
We use CCA to find pairs of directions wx and wy that maximize the correlation between the projections x = wTx x and y = wTy y. In the context of CCA, the projections x and y are also referred to as canonical variates. Formally, the directions can be found as maxima of the function E[xy] E[wTx xyT wy ] =q , ρ= p E[x2 ]E[y 2 ] E[wTx xxT wx ]E[wTy yyT wy ] wTx Cxy wy ρ= q . wTx Cxx wx wTy Cyy wy
2
[←]
x1
y1
x2 canonical factors
S RE UA N SQ SSIO T E AS R LE REG
Wx
canonical variates
y2
wx
Wy canonical variates
canonical factors
wy
CCA maximises the correlation between the canonical variates
Figure 2: Scheme of canonical correlation analysis regression. The (empirical) canonical factors Wx and Wy maximize the correlation of projections WxT X and WxT Y of sets of training data X and Y.
whereby Cxx ∈ IRp×p and Cyy ∈ IRq×q are the withinset covariance matrices of x and y, respectively, while Cxy ∈ IRp×q denotes their between-set covariance matrix. A number of at most k = min(p, q) factor pairs hwix , wiy i, i = 1, . . . , k can be obtained by successively solving iT T wi = (wiT x , wy ) = arg max(wix ,wiy ) {ρ}
subject to ρ(wjx , wiy ) = ρ(wix , wjy ) = 0 for j = 1, . . . , i − 1 The factor pairs wi can be obtained as solutions (i.e., eigenvectors) of a generalized eigenproblem (for details see e.g., [12]). The extremum values ρ(wi ), which are referred to as canonical correlations, are obtained as the corresponding eigenvalues. By employing CCA, we perform regression on only a small number (compared to the original dimensionality of the data) of linear features, i.e. derived linear combinations of original response variables y. Thus, CCA can be used to compute the (reduced) rank-n regression parameter matrix by using only n < k factor pairs. Thereby, in contrast to standard multivariate regression CCA takes advantage of the correlations of response variables to improve predictive accuracy [5]. In our approach, we use the regression scheme depicted in Fig. 2, where we perfom regression of the response variables y onto the leading canonical variates xproj = WTx x, where Wx = (wx1 , . . . , wxn ) with n < k. Analogously to standard multivariate regression, CCA can also directly be formulated as a linear least squares problem. It can be easily shown that minimizing RSS(wx , wy )
= E[(wxT x − wyT y)2 ] = E[wxT xxT wx ] −2E[wxT xyT wy ] + E[wyT yyT wy ] = wxT Cxx wx −2wxT Cxy wy + wyT Cyy wy
Michael Reiter, Ren´e Donner, Georg Langs, and Horst Bischof
[←]
subject to the constraints = 1, = 1.
yields the first canonical factor pair. An iterative (online) CCA algorithm based on the above formulation is described in [4]. This formulation also allows to successively obtain multiple factor pairs. 2.2
Predicting depth maps
Given a training set of N pairs of RGB images X = (x1 , . . . , xN ) and corresponding depth maps Y = (y1 , . . . , yN ) where xj resp. yj are vector representations, we obtain empirical canonical factor pairs wi . A subset of factor pairs wi with i = 1, . . . , n < k is then used for the prediction of depth maps from new RGB images. The subset corresponds to the canonical factors with the highest canonical correlations in the training set. We denote the matrices with the leading empirical canonical factors by Wx = (wx1 , . . . , wxn ) and Wy = (wy1 , . . . , wyn ), respectively. Given a specific input RGB image x the prediction ypredicted can be obtained as ypredicted (x) =
RTcca x.
(1)
The matrix Rcca is pre-computed during training by T
Rcca = P† Y ,
(2)
where P are the projections of training images onto the leading canonical factors, i.e., P = WxT X
(3)
and P† denotes the pseudo-inverse1 of P.
3
Experimental results
Setup For the experiments the USF Human-ID 3D Face Database [15] was used. The 3D form of the faces is acquired with a CyberWare Laser scanner. It gives a cylindrical depth map of the face with 1 degree horizontal resolution. The faces were remapped into a cartesian coordinate system with the z-direction parallel to the radiant going trough the center between the eyes and resolution corresponding the the vertical resolution of the scans. Regions not belonging to the face were discarded. The evaluation was performed on a set of 218 face images. During training, we use 5-fold cross-validation on the training set of 150 images to determine the optimal number of factor pairs and the optimal value for the regularization parameter needed for CCA (for details of how to perform regularization see [12]). The performance is evaluated on the test set of the remaining 68 images. The prediction is assessed qualitatively and quantitatively. 1 Note that the multiplication of x with R cca can be performed in two successive steps. In standard MLR, P = X.
Absolute voxel error
wxT Cxx wx wyT Cyy wy
10 8 6 4 2 10
20 30 40 Number of factors
50
Figure 3: Box plots of the absolute median pixel value error for CCA regression using different numbers of factor pairs. The horizontal lines correspond to the lower and upper quartile of the MLR results. The optimum of CCA regression with a 19.1% smaller median error is obtained using 5 factor pairs.
Results Since the number of training data N is much lower than the dimensionality of the signal spaces, the maximum possible number of factor pairs k is determined by the number of training data pairs (150). The quantitative results given in Figures 3 and 4 show, that while the optimal number of factor pairs (determining the rank of the regression matrix) is actually much smaller than 150, the depth error improves to 85% of the error of standard regression with 5 factor pairs. Only a fraction of the available factor pairs is sufficient for predicting the texture with higher accuracy than full-rank MLR achieves. The resulting mean depth error is 6.93 voxels. Note that most of the error is caused by the distortions at the boundary of the faces. In Figure 5 the spectrum of the canonical factors is plotted. Figure 6 shows the factor pairs 1, 2 and 3 corresponding to the 3 largest canonical correlations. Qualitative results are shown in Figure 7. The first and second columns show the original (ground truth) and predicted 3D depth maps (as textureless surfaces) of two example faces, respectively. In the third column the original (ground truth) 3D surface with face texture is depicted. In the forth column the 3D surface with texture predicted by CCA is shown, and in the third column the difference of the depth values is visualized in the same scale.
4
Conclusion
We have presented a method for predicting face depth maps from color images using Canonical Correlation Analysis. Only a small number of factor pairs are needed to predict the depth map from color images with reasonable accuracy. The increase of accuracy when using only a small number of factor pairs indicates that the noise in the training data degrades MLR results. Despite the simple nature of the algorithm (reconstruction is performed by a matrix multiplication) that does not utilize an explicit illumination model, rea-
3
Estimation of Face Depth Maps from Color Textures using Canonical Correlation Analysis
(a)
(b)
(c)
(d)
(e)
(f)
[←]
Figure 6: First 3 canonical factors of represented as RGB color images (a-c) and depth images (d-f) respectively.
Original surface
Reconstructed surface
Original face patch
Reconstructed face patch
Difference surface
Figure 7: Original images and CCA-predicted depth maps for 3 faces along the approximation error. The original texture is projected onto both depth maps. Note that the approximation (depth) error has the same scaling as original and reconstruction, thus displaying the high accuracy of the reconstruction.
sonable depth predictions are achieved. However, in our experiments we had controlled non-varying illumination conditions. A more profound analysis w.r.t. illumination sensitivity and comparison to other shape-from-shading methods is needed.
Acknowledgement This research has been supported by the Austrian Science Fund (FWF) under the grant P17083-N04 (AAMIR). Part of this work has been carried out within the K-plus Competence center ADVANCED COMPUTER VISION funded under the K plus program.
4
References [1] JJ Atick, PA Griffin, and AN Redlich. Statistical approach to shape from shading: reconstruction of three- dimensional face surfaces from single two-dimensional images. Neural Computation, 1996. [2] Ronen Basri and David Jacobs. Photometric stereo with general, unknown lighting. In CVPR, 2001. [3] V. Blanz, S. Romdhani, and T. Vetter. Face identification across different poses and illuminations with a 3d morphable model. Auto. Face and Gesture Recognition, 2002. [4] Magnus Borga. Learning Multidimensional Signal Processing. Link¨oping Studies in Science and Technology, Dissertations, No. 531. Department of Electrical Engineering, Link¨oping University, Link¨oping, Sweden, 1998.
Michael Reiter, Ren´e Donner, Georg Langs, and Horst Bischof
Relative error
2 1.5 1 0.5 10
20 30 40 Number of factors
50
Figure 4: Box plots of the pixel value error for CCA regression relative to standard MLR using different numbers of factor pairs. The horizontal line correspond to the lower and upper quartile of the MLR results, which coincide with the median in this relative diagramm. Quartiles for the CCA results correspond to the variation w.r.t. the MLR result on the same individual images.
[←] [11] Yingbo Hua, Maziar Nikpour, and Petre Stoica. Optimal reduced-rank estimation and filtering. IEEETSP, 49(3):457–469, 2001. [12] Thomas Melzer, Michael Reiter, and Horst Bischof. Appearance models based on kernel canonical correlation analysis. Pattern Recognition, 39(9):1961–1973, 2003. [13] A. Pezeshki, L. L. Scharf, M. R. Azimi-Sadjadi, and Y. Hua. Underwater target classification using canonical correlations. In Proc. of MTS/IEEE Oceans ’03, 2003. [14] John A Robinson and Justen R Hyde. Estimation of face depths by conditional densities. In BMVC, 2005. [15] Sudeep Sarkar. 3D Face Database, Univ. of Florida. [16] A.L. Yuille, D. Snow, and R. Epstein. Determining generative models of objects under varying illumination: Shape and albedo from multiple images using SVD and integrability. International Journal of Computer Vision, 1999. [17] W Zhao and R Chellappa. Illumination-insensitive face recognition using symmetric shape-from-shading. In IEEE CVPR, 2000.
1
Correlation
0.8
0.6
0.4
0.2
0 0
50 100 Number of factors
150
Figure 5: Spectrum of canonical correlations.
[5] L. Breiman and J.H. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society, 59(1):3–54, 1997. [6] Konstantinos I. Diamantaras and S.Y. Kung. Principal Component Neural Networks. John Wiley & Sons, 1996. [7] Roman Dovgard and Ronen Basri. Statistical symmetric shape from shading for 3d structure recovery of faces. In ECCV, 2004. [8] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. New York: Springer-Verlag, 2001. [9] A. H¨oskuldsson. PLS regression methods. Journal of Chemometrics, 2:211–228, 1988. [10] H. Hotelling. Relations between two sets of variates. Biometrika, 8:321–377, 1936.
5