Shape from Recognition and Learning: Recovery

0 downloads 0 Views 218KB Size Report
In this paper, a novel framework for the recovery of 3D surfaces of faces from ..... Using our rendering model, we generate samples of in- tensity and range for ...
Shape from Recognition and Learning: Recovery of 3-D Face Shapes  Dibyendu Nandy & Jezekiel Ben-Arie EECS Department, University of Illinois at Chicago, Chicago, IL 60607 Email: fdnandy,[email protected] Abstract In this paper, a novel framework for the recovery of 3D surfaces of faces from single images is developed. The underlying principle is shape from recognition, i.e. the idea that pre-recognizing face parts can constrain the space of possible solutions to the image irradiance equation, thus allowing robust recovery of the 3D structure of a specific part. Shape recovery of the recognized part is based on specialized backpropagation based neural networks, each of which is employed in the recovery of a particular face part. Representation using principal components allows to efficiently encode classes of objects such as nose, lips, etc. The specialized networks are designed and trained to map the principal component coefficients of the shading images to another set of principal component coefficients that represent the corresponding 3D surface shapes. A method for integrating recovered 3D surface regions by minimizing the sum squared error in overlapping areas is also derived. Quantitative analysis of the reconstruction of the surface parts show relatively small errors indicating that the method is robust and accurate. The recovery of a complete face is performed by minimal squared error merging of face parts.

1. Shape from Recognition Shape-from-shading is generally an ill-posed problem because of the multiplicity of solutions to the image irradiance equation at each point. Several methods have been used to try and constrain the solution as outlined below. In our approach, the solution space is significantly reduced by pre-recognizing particular face parts. The derivation of their 3D surfaces from the much smaller solution space is quite robust. The brightness of a point in an image I (x y ), determines only the projection of the surface normal n(x y ) onto the incident illumination direction i, given by the image irra This work was supported by the National Science Foundation under Grant Nos. IRI-9623966, IRI-97-11925 and Army Research Office under Grant No. DAAG 55-98-1-0424.

diance equation I (x y ) = R(i n(x y )). R() is the reflectance function, which is an inner product for Lambertian surfaces. There are an infinite number of normal vector fields that can give rise to an intensity image, if surface normals are arbitrarily assigned. This makes the problem extremely ill-posed. The problem can be made better posed by enforcing integrability through various constraints, as the integrability condition must be satisfied by smooth surfaces. Most shape from shading methods have been based on using variational approaches to minimize the average error in satisfying the image irradiance equation. Constraints are included in the functional using Lagrange multipliers. This includes the methods proposed by Horn [3] and variations thereof, including photometric stereo [15], direct reconstruction of height by enforcing differential smoothness constraints [6], using line drawing interpretation to reconstruct piecewise smooth surfaces [8] or using only local differential conditions [9]. Each method has some shortcomings. For example, in [9] it is assumed that the surface is locally spherical. Only few approaches using neural networks for the shape from shading problem have been proposed. This includes [7], in which it is shown that it is possible to reconstruct the approximate surface normals of ellipsoidal shapes by training a network with intensity patterns. Also, [4] models the reconstruction of surface normals of a hybrid reflectance surface using a photometric stereo approach using known multiple illuminations. Some novel approaches have been proposed in recent years including [5], where the authors use topological considerations to classify singular points in images and use a weighted distance transform to globally recover surface shape. In [14], diffuse illumination is modeled by interreflections and the 3D shape is recovered by adjusting lineof-sight/horizon height values. Another recent work using principal components is proposed in [1]. Complete 3D face surfaces are used to solve the image irradiance equation in a parametric form, whereas we employ neural networks. Also, the face images that can be recovered in [1] are limited to a single pose, size and location. In this paper, we focus on the recovery of 3D shape of the

human face from single images. The shape from recognition framework allows to recover the 3D surfaces of recognized and segmented face parts such as the nose, eyes, lips, forehead, etc. The face location and orientation is estimated using neural networks [12] and principal components [10]. Specific face parts like noses and eyes are detected using our recently developed model based segmentation [2]. Once a part is detected, the shape recovery problem is reduced to estimating the specific 3D shape of an example of that class of objects, like a nose or lips. Given such a class of objects, we show that it is possible to train a neural network to learn to associate the appearance of such a class of objects under varying pose and illumination to its true 3D shape. The space defining 3D shape variation in each class is then a lower parameter space defining only the variations generating a set of images for a particular part. Specialized neural networks are designed to map intensity input patterns belonging to such classes of parts to their 3D shapes under varying pose and illumination. The efficiency of training the network is improved significantly, by using the principal components to represent the intensity and 3D surfaces thus reducing the number of parameters that the network needs to learn. We assume that the surface reflectance of faces are Lambertian and we model the image formation process under a single illumination with varying source direction. Face images are rendered using the Lambertian model from range data of human heads, as detailed in Section 2. The same range data can give rise multiple images depending on the illumination. Thus, we have a many-toone mapping problem. Given an upright and approximately frontal face image in a field of view, the nose, lips, eyes, cheeks, and forehead lie in approximately the same region relative to each other. By applying rectangular masks (with small overlaps) to a face image, these parts are analyzed separately and in parallel. This is illustrated in Fig. 1. As derived in Section 3, each such rectangular imaged part is projected onto its corresponding set of principal components to get the corresponding intensity principal component coefficients (IPCC). These are then input to the corresponding backpropagation network, which are trained to map the IPCC with corresponding 3D shape principal component coefficients (SPCC). The resulting SPCC are used to reconstruct the 3D shape of the part via the Karhunen-Loeve transform, which are combined together using a local error constraint (detailed in Section 5) to recover the complete 3D face surface.

2. Rendering Model Human head models used here are described by range data defined by the function r(h ), where h and  are the height and pan or azimuthal angular coordinates respectively. The Cartesian coordinates of each point on the sur-

Face parts

Part Shape Reconstruction

Backprop

PCA

Neural Networks

SPCC

IPCC

I p(x,y)

r p(x,y)

Input Intensity Image I(x,y) Complete range r(x,y) by range correction on overlapping parts

Figure 1. Shown is a block diagram of the data flow in our shape recovery algorithm. At the bottom left is a intensity image on which specified regions indicate the general locations of eyes, nose, lips, cheeks, etc. Note that only symbolic patches of the eyes, nose and lips are shown and the dataflow for only the nose region is illustrated. Each of these imaged parts are then projected onto their respective principal components to generate Intensity Principal Component Coefficients (IPCC). These are input to a backpropagation based neural network trained to give corresponding 3D Shape Principal Component Coefficients (SPCC). The SPCC are used to recover the individual 3D shape of the face parts. These 3D shapes in the form of range data are integrated together in a smooth manner by applying corrections to the overlapping areas.

face is given by:

S (h )x(h ) y (h ) z (h )]

x r h ) cos  y0 + h z0 + r(h ) sin ] (1) for some (x0  y0  z0 ) which relates the origins of the two = 0 + (

coordinate systems. Projections of the face from different viewpoints are related to the Cartesian coordinate system through two rotations, the pan/azimuth (rotation about the vertical y axis analogous to ) and the tilt/elevation (rotation about the horizontal x axis given by ). The pan can be accommodated by directly changing the value of  after the origins of the two coordinate systems are made to coincide. Tilted view points  can be obtained by applying a rotation matrix R to the Cartesian description. Uniform scale variations can also be incorporated by applying a matrix  I to the Cartesian coordinate transformation. The local tangent plane at a surface point is given by the vectors @ S =@h and @ S =@. The surface normal is thus given by the cross product,

nS (h ) /

@S  @S @h @

The unit normal can be shown to be

nS (h )

=

p

1

@r 2 2 @r 2 @r @r 2 r2 +( @r @ ) +( @l ) +r ( @ ) ;2 @ @l )

(2)

2 @r 3 @r  + r sin  @l cos  ; @ cos 5 4 ;r @r @l @r ; @r @l sin  + @ sin  + r cos 

3. Principal Component Representation (3)

I (h ) can now be computed, using the Lambertian reflectance model I (h ) = (h )nS (h )  i (4) where i  (ix  iy  iz ), is a vector representing the incident light, nS (h ) is the surface normal given by Eq. (2) and  (h ) is the albedo. Implicit in the rendering is the non-linearity that sets the image point I (x y ) = 0 for self

occluding points, i.e, where surface reflectivity is negative. Also, for the purpose of this paper,  (h ) is assumed to be constant, though our strategy does not preclude the use of varying albedo, especially in the region of the eyes and eyebrows. Eq. (1) is then used to get the projected image from a desired viewpoint ( ), by using standard interpolation techniques. The corresponding range data r(x y ) from the same viewing direction can also be obtained similarly. An example of such a rendered face image and the corresponding range image is shown in Fig. 2.

Each face part is associated with intensity data Ip (x y ) and range data rp (x y ), parameterized on the set of view directions, scales and illumination directions as illustrated in Section 2. The aim of the neural network based shape recovery problem is to learn the mapping between I p (x y ) and the corresponding r p (x y ). The dimensionality of the problem depends on the size of the imaged part as well as the range of the viewing and illumination parameters that we seek to generalize. Principal component analysis (PCA) by the Karhunen-Loeve transform [11] provides a meaningful way to reduce the dimensionality and consequently the computational complexity. Here, we briefly outline the procedure. Given a dataset consisting of N shading images (or range images) of a class, say noses, the principal component representation of any sample image Ip (x y ) within the same class is given by

Ip (x y)  Ipm (x y) +

P X i iI (x y)  i=1

(5)

PN

where Ipm (x y ) = 1=N i=1 Ipi (x y ) is the mean image, i (x y ) are the principal components of the dataset and P N . The coefficients of the PCA, i are given by

i =

;

;

In our work, head models of various people are rendered from various viewpoints and with various of illumination directions. The view directions are approximately frontal from the range  2 ;15  +15] and  2 ;5  +5 ]. Similarly, the illumination directions are chosen in the range  2 ;30  +30] and  2 ;30  +30]. Scale variations in the range  2 0:95 1:05] are also applied. These parameters are selected at random to generate datasets of the intensity function I (x y ) and the corresponding range functions r(x y ), which span the range of views and intensities for approximately frontal face images. The two datasets of intensity and range are then delineated by rectangular masks corresponding to parts like the eyes, nose, lips, cheek and forehead. We let Ip (x y ) and rp (x y ) represent such delineated rectangular regions.

x y

Ip (x y) ; Ipm (x y)) iI (x y) :

(

(6)

The principal components are the eigenvectors of the dataset obtained by solving the eigen value equation R = , where R is the auto-covariance of data (the N images), is the eigenvalue and is the eigen vector or principal component. This representation (Eq. (5)) is the best linear approximation in a minimum mean squared error sense. PThe mean squared error in reconstruction is given by



Figure 2. A sample range image r (x y ) (left) and the corresponding intensity image I (x y ) (right) viewed from direction ( 9 , 2 ) and illuminated from direction ( 2 ,  13 ) rendered according to Section 2.

XX

=



N i=P +1 i :



Using our rendering model, we generate samples of intensity and range for each face part with varying randomly selected pose, scale and illumination from the ranges defined in Section 2. We apply the Karhunen-Loeve transform separately to the training samples of the intensity and range data to obtain the principal components f I g, and f R g respectively. f I g are used to generate the intensity principal component coefficients (IPCC) for the input. f R g are used to reconstruct the range data rp (x y ) from the neural network generated 3D shape principal component coefficients (SPCC), using Eq. (5). This is illustrated in Fig. 1. Fig. 3 and Fig. 4 show the average and the first few principal components for the nose under the variations we seek to generalize and learn.

1

2

3

4

Avg.

a)

b)

c)

nose 5

6

7

8

Figure 3. The average and first 8 principal components of intensity data for noses. The principal components are shown added to the average image to show how each function modifies the average. The coefficients obtained by the projection of the input intensity field on these basis functions are used as inputs.

1

2

3

4

5

6

7

8

Avg. nose Figure 4. The average and first 8 principal components of range data for noses. The principal components are shown added to the average image to show how each function modifies the average. The coefficients corresponding to these basis functions are used as outputs to train our network.

4. Learning Shading Patterns We now describe a feed-forward neural network that can learn the many-to-one mapping to recover local 3D surface shape from the local image shading pattern. The neural model is a three layer structure with input, hidden and output layers. The learning problem is quite complex as it is expected that the network should be able to estimate relative surface heights of neighboring pixels under varying conditions of illumination as well as variations of the surface type. It was found in prior experiments, that the task of estimating complete 3D face shape for different people, under conditions of varying pose, scale and illumination, is too complex for a single neural network. This is the reason for the division of the recovery task into specialized networks for recovering individual parts. The input intensities and output heights are normalized such that the center grid point is shifted to 0. The input intensity and output height is then scaled to be in the range ;1 1]. The IPCC of the input shading patterns and the SPCC of the output range patterns are generated using PCA

Figure 5. a) The shading pattern of a particular nose illuminated from the the direction (14:4  11:7 ). b) The recovered nose range image using the backpropagation network and the Karhunen-Loeve transform. c) The original 3D nose shape for comparison.

as shown in Section 3. The input implicitly has a local receptive field in the image domain. The input consists of the IPCC or intensity principal component coefficients for a rectangular patch extracted from the face image. The illumination vector is assumed to be known and given by the direction cosines. This is a reasonable assumption, as there are several algorithms which can successfully extract the illumination direction as for example in [16]. The eigen coefficients of the illumination patterns (IPCC), and range data (SPCC), are then used to form the input and the output training set for the neural network. The number of principal components P is determined such that the mean reconstruction error is less than 5%. We construct and train specialized networks for each part like the nose, eyes and lips. Each network is a feed-forward three layer network with sigmoidal nonlinearities at the output of each hidden node. The networks are trained with the well known backpropagation algorithm [13], which minimizes the mean squared error at the output. The network thus learns to map the IPCC at the input to the SPCC at the output. These encode the corresponding intensity and range data. As PCA is used to derive these representations, data which are near (have small Euclidean distance) each other in the intensity domain (or range domain) are also near terms of the principal component coefficients. Thus approximate SPCC generated by the neural network also recovers 3D shape information (range) which closely approximates the true range. This is the underlying principle in using our neural network approach. Training is carried out in a batch mode till the average error is less than 1% on output set. In usage, the network is input with IPCC vectors and generates corresponding SPCC vectors. The SPCC is used to reconstruct the range using Eq. (5). Individual range parts are then integrated together into a complete surface using the method described in Section 5.

5. Integrating Surface Patches The reconstructed parts rp (x y ) are defined to have rectangular fields of view in the input domain. These regions are integrated together in a smooth manner by a mechanism,

which minimizes the difference in heights in the overlapping pixels of neighboring regions. Each patch r pi has a floating height hi because of the normalization described in Section 4. i is an index to the location of the rectangular support region. The relative height difference h ij between two overlapping patches rpi and rpj is given by minimizing the sum squared error E in heights on the overlapping area grid points indexed by fx y g.

E=

overlap X X i kh (x y ) ; hj (x y ) + hij k2 x y

@Eij Minimizing E by setting @ ( h ) 1 hij = Area(overlap )

=0

(7)

gives

overlap X X i j (h (x y ) ; h (x y ))

Figure 7. The recovered facial structure obtained by integrating overlapping patches of the reconstructed local parts.

x y

(8) This relative height correction is applied to every pair of overlapping patches in local operations. Note, that the rectangular support regions are fixed for the face images and these corrections can be applied directly to the overlapping areas of each such region. Now overlapping area grid points have multiple heights with minimum sum squared errors. They are averaged to generate the height estimate for that point. This results in a complete 3D shape recovery. Figure 8. The original 3D face shape shown with a mesh.

a)

b)

c)

Figure 6. a) A face image illuminated frontally. b) Recovered range data. c) Original range data for face.

6. Results and Conclusions We now present some results of the approach used for 3D surface shape recovery. As mentioned above, we generate about 3000 samples of intensity and range data pairs for each face part, with variations in scale pose, and illumination direction. Half of the data set is used in training and half in testing the 3D shape recovery neural networks. The performance of the neural networks are first analyzed with respect to its accuracy in mapping the IPCC to the SPCC. However the real measure of performance can be obtained only from an error measure in the domain of the recovered 3D shape parts. Hence we measure the mean percentage error per pixel for each face part. Table 1 shows the mean

RMS errors in the SPCC and the mean % errors per pixel in the recovered parts. As can be seen the mean error per pixel is about 2% for the lips which has the largest errors, implying that the estimated range is within 2% of the true range for all the parts. The mean error of the reconstructed face structure after the integration is less than 5%. The slight increase in error is seen because of the errors introduced due to averaging in the overlap regions. As an example, in Fig. 5(a), we show the intensity pattern due to the image of a nose. Also shown are the recovered surface and the original surface with the data shown in the form of range images. In Fig. 6 a), we show the intensity image of a face not included in the training data. The recovered range image is shown in Fig. 6 b), which can be compared to the original range data in Fig. 6 c). This data is also shown with a mesh in Fig. 7 and Fig. 8. As can be seen, the recovered surface is remarkably accurate as compared to the original indicating that both the PCA based local patch reconstruction and the minimum sum squared error integration techniques are quite robust. We have proposed a framework for recovery of 3D surface shape by recognition and learning. The recognition principle significantly reduces the space of valid surfaces

Parts SPCC RMS error avg. % err/pixel

forehead 3.56e-4 0.89%

nose 3.47e-4 0.90%

right eye 2.59e-4 0.69%

left eye 2.53e-4 0.70%

right cheek 3.23e-4 0.86%

left cheek 2.60e-4 0.82%

lips 7.73e-4 2.03%

Table 1. Mean RMS errors in SPCC and mean % errors per pixel in the recovered parts. Both errors are seen to be very small.

thus allowing for a robust solution. We can conjecture that the human visual system might use similar mechanisms to efficiently recover 3D shapes of objects, first by recognizing generic object parts and then inferring the 3D shapes of such parts. In this work, face parts are first recognized and then specific 3D shapes of parts are recovered by a neural network that learns the mapping between the input image domain and the output 3D shape domain. The shape recovery method is based on using a neural network for learning brightness patterns and mapping them to corresponding range data. Specifically, given a certain class of objects, we propose that it is possible to train a neural network to learn to map the appearance of such a class of objects under varying pose and illumination to its true 3D shape. Specialized neural networks are trained to recover range information of specific face parts like the nose, lips, eyes, cheeks, and forehead. In prior experiments, we find that specialization is necessary since the 3D recovery of a full face by a single network is too difficult and yields much less accurate reconstruction. The high dimensionality of such data contributes to high computational complexity. Hence, principal component analysis is used to represent both the intensity and 3D shape data in lower dimensional spaces. Individual neural networks are successfully trained to map the coefficients of the principal components for image-range pairs, under varying pose and illumination. The recovered range data for each part must then be integrated with others to generate the complete surface. Towards this end, we propose a method for integrating reconstructed surface patches by minimizing the sum squared error in overlapping areas. Individual parts of faces like the nose and lips are successfully recovered. The complete 3D shape of the face is generated using a mean squared error merging.

References [1] J. J. Atick, P. A. Griffin, and A. N. Redlich. Statistical Approach to Shape from Shading: Reconstruction of ThreeDimensional Face Surfaces from Single Two-Dimensional Images. Neural Computation, 8:1321–1340, 1996. [2] T. authors. Model based segmentation and detection of affine transformed shapes in cluttered images. In 1998 IEEE International Conference on Image Processing (ICIP’98), volume III, pages 75–79, October 1998.

[3] B. K. P. Horn. Calculating the Reflectance Map. Computer Vision, Graphics & Image Processing, 33(2):174–206, 1986. [4] T.-E. Kim, S.-H. Lee, S.-H. Ryu, and J.-S. Choi. Shape Recovery of Hybrid Reflectance Surface using Neural Network. In 1997 IEEE International Conference on Image Processing (ICIP’97), volume III, pages 416–419, October 1997. [5] R. Kimmel and A. M. Bruckstein. Global Shape From Shading. Computer Vision & Image Understanding, 20(1):23–38, 1998. [6] Y. G. Leclerc and A. F. Bobick. The direct computation of height from shading. In 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’91), Lahaina, Maui, Hawaii, 1991. [7] S. R. Lehky and T. J. Sejnowski. Network model of shapefrom-shading: neural function arises from both receptive and projective field. Nature, 333:452–454, May 1988. [8] J. Malik and D. Maydan. Recovering Three Dimensional Shape from a Single Image of Curved Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6), 1989. [9] A. P. Pentland. Local Shading Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(2):170– 187, 1984. [10] A. P. Pentland, B. Moghaddam, and T. Starner. View-Based and Modular Eigenspaces for Face Recognition. In 1994 IEEE Conference on Computer Vision and Pattern Recognition, 1994. [11] A. Rosenfeld and A. C. Kak. Digital Image Processing. Academic Press, NY, 1982. [12] H. A. Rowley, S. Baluja, and T. Kanade. Neural NetworkBased Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, 1998. [13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel Distributed Processing: Explorations in the Microstructure of Cognition : Foundations Vol. 1, chapter Learning internal representations by error propagation, pages 318–362. MIT Press, Cambridge, MA, 1986. [14] A. J. Stewart and M. S. Langer. Toward Accurate Recovery of Shape from Shading Under Diffuse Lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(9):1020–1025, Sept. 1997. [15] R. J. Woodham. Photometric Method for Determining Surface Orientation from Multiple Images. Optical Engineering, 19(1):139–144, 1980. [16] Q. Zheng and R. Chellappa. Estimation of illuminant direction, albedo and shape from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:680–702, 1991.

Suggest Documents