Shape from recognition: a novel approach for 3-D face shape recovery ...

206

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 2, FEBRUARY 2001

Shape from Recognition: A Novel Approach for 3-D Face Shape Recovery Dibyendu Nandy, Member, IEEE, and Jezekiel Ben-Arie, Member, IEEE

Abstract—In this paper, we develop a novel framework for robust recovery of three-dimensional (3-D) surfaces of faces from single images. The underlying principle is shape from recognition, i.e., the idea that pre-recognizing face parts can constrain the space of possible solutions to the image irradiance equation, thus allowing robust recovery of the 3-D structure of a specific part. Parts of faces like nose, lips and eyes are recognized and localized using robust expansion matching filter templates under varying pose and illumination. Specialized backpropagation based neural networks are then employed to recover the 3-D shape of particular face parts. Representation using principal components allows to efficiently encode classes of objects such as nose, lips, etc. The specialized networks are designed and trained to map the principal component coefficients of the part images to another set of principal component coefficients that represent the corresponding 3-D surface shapes. To achieve robustness to viewing conditions, the network is trained with a wide range of illumination and viewing directions. A method for merging recovered 3-D surface regions by minimizing the sum squared error in overlapping areas is also derived. Quantitative analysis of the reconstruction of the surface parts in varying illumination and pose show relatively small errors, indicating that the method is robust and accurate. Several examples showing recovery of the complete face also illustrate the efficacy of the approach. Index Terms—Backpropagation, expansion matching (EXM), principal components analysis, shape from, shape from recognition.

I. INTRODUCTION

S

HAPE from shading is an ill-posed problem. This is because a particular brightness at an image point, can give rise to a multiplicity of possible three dimensional orientations for the underlying surface. Hence, the space of valid solutions has to be constrained is some manner to allow shape recovery. In our approach, the solution space is constrained by dividing an object such as a human face to generic parts like nose, lips and eyes and then inferring their three-dimensional (3-D) shapes from resulting solution space, which is much smaller. We call this concept shape from recognition [16], [2]. We use the principle of expansion matching (EXM) [5], [19], [3] to robustly detect and localize the parts of interest such as the eyes, lips and Manuscript received July 13, 1999; revised September 18, 2000. This work was supported in part by the National Science Foundation under Grants IIS9979774, IRI-9876904, and IRI-9711925 and the Army Research Office under Grant DAAG 55-98-1-0424. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Eric L. Miller. D. Nandy is with the Media Processing Technology Group, Tellabs Operations, Inc., Bolingbrook, IL 60440 USA. J. Ben-Arie is with the Department of Electrical Engineering and Computer Science, University of Illinois, Chicago, IL 60607-7053 USA (e-mail: [email protected]). Publisher Item Identifier S 1057-7149(01)00913-7.

nose. Once such parts are recognized and isolated, the solution space is effectively constrained to particular classes of shapes. These shapes may be combined to recover the entire 3-D shape. The representation is made compact by using principal components to efficiently encode the 3-D structure of shapes such as nose, lips, eyes and other face parts. Specialized neural networks, trained using the back-propagation algorithm [21], are designed to map the principal component coefficients of the intensity images to the principal component coefficients that represent the corresponding 3-D surface shapes for varying illumination and view directions. This can be regarded as shape from learning [16], [2]. The classification of parts of a face makes the approach more general than simply attempting to learn the mapping for entire face images. Explicitly, the shape recovery process is simplified by limiting it to variations in face parts. Implicitly, the approach allows to encode the combination of variations observed in a larger set of complete faces. The recovery of individual parts simplifies the training process as the complexity of each network is reduced. Using neural networks also has the advantage of speedy reconstruction once the network is trained. The process is summarized in Fig. 1. We conjecture that the human visual system might use similar mechanisms to first generically detect an object part. Then, its 3-D shape may be efficiently recovered from the reduced space of solutions that exist for that class of objects due to variations in the actual shape, pose and illumination. II. PREVIOUS WORK IN SHAPE FROM SHADING We now briefly review the shape from shading problem and the work of others in this area. The brightness of a point in an , is a function of the incident illumination and image , given by the image the corresponding surface normal irradiance equation (1) is the reflectance function. There are an infinite where number of normal vector fields that can give rise to an intensity image, if surface normals are only required to be constrained by (1). This makes the problem extremely ill-posed. The problem is made better posed by enforcing additional constraints such as integrability by different means [6]. Smooth surfaces always fulfill the integrability condition [8]. Most shape from shading methods have been based on using variational approaches to minimize the average error in satisfying the image irradiance equation. Surface continuity constraints are included in the functional using Lagrange multi-

1057–7149/01$10.00 © 2001 IEEE

NANDY AND BEN-ARIE: NOVEL APPROACH FOR 3-D FACE SHAPE RECOVERY

207

Fig. 1. Shown is a block diagram of the data flow in our shape recovery algorithm. The intensity image on the left is processed by expansion matching (EXM) detectors for eyes, nose, lips, etc. Rectangular windows at these locations extract the intensity patches for further processing. Symbolic patches of the eyes, nose and lips are shown and the data-flow for only the nose region is illustrated. Each of these imaged parts are projected onto their respective principal components to generate intensity principal component coefficients (IPCC). These are input to a backpropagation based neural network trained to give corresponding 3-D Shape Principal Component Coefficients (SPCC). The SPCC are used to recover the individual 3-D shape of the face parts. These 3-D shapes in the form of range data are integrated together in a smooth manner by applying corrections to the overlapping areas.

pliers. This includes the methods proposed by Horn [9] and variations thereof, including photometric stereo [23], direct reconstruction of height by enforcing differential smoothness constraints [12], using line drawing interpretation to reconstruct piecewise smooth surfaces [14] or using only local differential conditions [18]. Each method has its unique constraints. For example, in [18] it is assumed that the surface is locally spherical. In addition, all the variational methods are iterative processes and also require knowledge of the exact reflectance function of the surface. Some novel approaches have been proposed in recent years including [11], where the authors use topological considerations to classify singular points in images and use a weighted distance transform to globally recover surface shape. In [22], diffuse illumination is modeled by inter-reflections and complex 3-D shapes like drapes are recovered by iteratively adjusting line-of-sight/horizon height values. A recent work using principal components is proposed in [1]. Complete 3-D face shapes are used to iteratively solve the image irradiance equation in a parametric form. However, unlike their approach, we use the principle of recognition based shape from shading. This work also differs from our approach of using the principle of recovery by parts of faces. We also use neural networks to recover principal components of 3-D shapes from image principal components. Generically recognizing faces can allow us to detect faces in images and then recover the 3-D shape of faces in the range of poses and illuminations that the neural networks are trained with. A few approaches using neural networks for the shape from shading problem have also been proposed. This includes [13], in which it is shown that it is possible to reconstruct the approximate surface normals of ellipsoidal shapes by training a network with intensity patterns. However, this work is limited only to one type of surface. Also, [10] models the reconstruction of surface normals of a hybrid reflectance surface using a photometric stereo approach using known multiple illuminations. Most of the above algorithms are designed to recover the 3-D shape of smooth but arbitrary objects. However, the information available in the image is limited. Thus, the problem is one of estimating a 3-D shape in a space with an excessively large number of degrees of freedom. In contrast, we know that prior knowledge, contextual information and prior expectations influence human interpretation of sensory data [7]. We base our shape from recognition and learning approach on this idea.

Fig. 2. Coordinate system for mapping range data. r is the cylindrical range information in h coordinates. (pan angle) and (tilt angle) define the viewpoint and illumination vectors used in our model.

0

III. IMAGING MODEL FOR FACE RANGE DATA The human head models used in our work, are described by , where and are the range data defined by the function height and pan or azimuthal angular coordinates respectively. The Cartesian coordinates of each point on the surface is given by

(2) which relates the origins of the two cofor some ordinate systems. Different projections of the face are obtained two rotations, the pan or azimuthal angle and the tilt or elevation angle . Changes in may be directly applied to the data . The geometry is and is changed through a rotation matrix illustrated in Fig. 2. In this figure, the nose of the head is along direction. the The surface normal at a surface point is given by the cross product (3)

208


The unit normal can be shown to be

(4) Fig. 3. Sample intensity image I (x; y ) (left) and the corresponding range image r (x; y ) (right) viewed from direction ( 9 , 2 ) and illuminated from direction ( 2 , 13 ) rendered according to Section III.

0

We assume that the surface reflectance of faces are Lambertian (diffuse reflectance)1 and we model the image formation process under a single illumination with varying source direccan now be computed as tion. (5) , is a vector representing the direction where is the surface normal equation of incident light, . The (2) is used to generate the projected image is also obtained similarly. corresponding range data In our experiments, approximately frontal view directions are and . chosen from the range Similarly, the illumination directions are chosen in the range and . This range of parameters define our training and test data. An example of such a rendered face image and the corresponding range image is shown in Fig. 3. IV. RECOGNITION AND SEGMENTATION OF FACE PARTS We segment the face image into parts such as eyes, nose, lips and forehead for analysis. The face center and the relative positions of facial features are determined using expansion matching (EXM) filters [5] for each individual feature. Since the relative positions of the face features are more or less constant in humans, nonspecific parts like the cheeks and forehead are derived from the positions of detected features like the eyes, nose, and lips. We now describe the EXM approach for localizing face features. EXM is an optimal matching method based on nonorthogonal expansion of the image signal onto template-similar basis functions. Detection using EXM is founded on optimizing a novel discriminative signal to noise ratio (DSNR). The DSNR has been defined by Ben-Arie and Rao [19], [5], [3] as the ratio of the signal response power at the center location of a feature template to the response power elicited at all off-center points. We can design an EXM based detector for any given shape or 1For Lambertian surfaces, the reflectance function R( ) in (1) is a function of the inner product of i and n. It must be emphasized here that surfaces have different kinds of reflectance varying from completely diffuse to completely specular (mirror-like). Most surfaces may be modeled as being primarily diffuse with some specular part.

0

feature model by optimizing the DSNR. Here we present only the main concepts of EXM and refer the reader to previous works by the authors [19], [5], [17] for details. Different faces are imaged in frontal poses and illuminations are approximately (manually) registered and averaged. The average face image (lexicographic vector form), then forms the template for designing the EXM filter. We model the distortions from the template in the image to be processed as additive uncorrelated noise. This is not strictly true. However, when we seek to detect a feature like a nose, it occurs in the context of other face features like eyes or cheeks. Since these are dissimilar from the nose image in general, the assumption of reduced correlation is and reasonable. The face autocorrelation model is given by . The the autocorrelation of the “noise” is assumed to be is given by [19], [5] EXM face filter (6) where is a scaling factor. As we see from Eq. (6), the EXM filter is similar to the Wiener restoration filter, where the signal being restored is the ideal detection result a Dirac delta and the feature template is the blurring function. Like the Wiener filter, the EXM filter is efficiently implemented in the frequency domain. In two dimensions, it is given by (7) and are the spectral densities of where the expected noise/distortion and the correlation coefficients respectively. The term , is usually replaced by a constant parameter, to make the filter independent as our paramof the correlation process. We use a value of eter, which provides a good balance by providing robustness to noise as well good discrimination. may be reordered as a two dimensional The vector filter . This forms an EXM mask for the entire face matrix as shown in Fig. 4(a). Specific feature filters for the eyes, nose by windowing with Gausand lips are extracted from ). The ’s are given in Table I. The locations are sians (


Fig. 4.

209

(a) EXM template for the entire face, (b) EXM nose detector, (c) EXM right-eye detector, (d) EXM left-eye detector, and (e) EXM lip detector. TABLE I

’S USED FOR THE GAUSSISN WINDOWS TO DESIGN EXM FILTERS FOR THE

FACE PARTS

representation of any sample image class is given by

within the same

(9) where determined by the location of features in the average face. Thus, for each feature part are given by EXM detectors (8) The various feature detectors are shown in Fig. 4. We illustrate the results of face localization on two differently posed and illuminated faces. The detection results for each filter are presented followed by the original face with the detected locations. As seen in Figs. 5 and 6, the detection results are quite robust. This approach allows to apply overlapping windows (see Fig. 7) to a detected face with respect to its center, capturing the nose, eyes, forehead, cheeks and lips. This is analogous to foveating on the center of a size normalized face image and knowing the approximate relative positions and sizes of various face parts. Once a part is detected, the shape recovery problem is reduced to estimating the specific 3-D shape of an example of that class of objects. V. PRINCIPAL COMPONENT ANALYSIS Each face part is associated with intensity data and , parameterized on the set of view directions, range data scales and illumination directions as illustrated in Section III. and the The aim is to is to learn the mapping between . The dimensionality of the problem decorresponding pends on the size of the imaged part as well as the range of the viewing and illumination parameters that we seek to generalize. Principal component analysis (PCA) by the Karhunen–Loeve transform [20] provides a meaningful way to reduce the dimensionality and consequently the computational complexity. Here, we briefly outline the procedure. intensity images of dimenGiven a dataset consisting of of a class, say noses, the principal component sions

is the mean image, are the principal components of the dataset and . The coefficients of the PCA, are given by (10)

The principal components are the eigenvectors of the dataset obtained by solving the eigen value equation (11) is the auto-covariance of data (the images), is the where is the eigen vector or principal component. eigenvalue and This representation [Eq. (9)] is the best linear approximation in a minimum mean squared error sense. The mean squared error bases is given by in reconstruction with only (12) A similar set of principal components are extracted for the which may be given by range data (13) are used to generate the intensity principal component are used to reconstruct coefficients (IPCC) for the input. from the neural network generated 3-D the range data shape principal component coefficients (SPCC). Figs. 8 and 9 show the average and the first few principal components for the nose under the variations we seek to generalize and learn. Tables II and III compare the window sizes of each feature with the subspace dimension required to represent them accurately for our training data of about 1500 randomly generated sample images. The dimension is chosen so that the mean squared error

210


Fig. 5. (a) Nose detection, (b) detection of right eye, (c) detection of left eye, and (d) lip detection. These raw EXM filtering results are shown in reverse brightness. Thus, the locations are indicated by the black spots. These are easily extracted by simple peak detection. The results are indicated in (e) the original image using the symbol.

2

Fig. 6. (a) Nose detection, (b) detection of right eye, (c) detection of left eye, and (d) lip detection. These raw EXM filtering results are shown in reverse brightness. Thus the locations are indicated by the black spots. These are easily extracted by simple peak detection. The results are indicated in (e) the original image using the symbol.

2

in reconstruction as given by (12) is less than 5% for the intensity data and less than 1% for the range data. The two tables also show the energies of first 5 principal components (eigenvalues) for each feature.

VI. LEARNING SHADING PATTERNS Given a class of face features, we train a neural network to associate the appearance of such features under varying pose and illumination to their true 3-D shape. Since the same 3-D surface can give rise to multiple images depending on the pose

and illumination, we have a many-to-one mapping from the 3-D shape to its appearance in an image. Specialized neural networks are designed to map intensity input patterns belonging to such classes of parts to their 3-D shapes under varying pose and illumination. These parts may then be merged to recover complete facial structure using a minimum mean squared error criteria. The efficiency of training the networks is improved significantly, by using principal component representation outlined in Section V to reduce the number of parameters that each network needs to learn. Using our rendering model, we generate about 3000 samples of intensity and range for each face part with varying randomly


211

TABLE II PRINCIPAL COMPONENT REPRESENTATION IS TABULATED HERE FOR THE IMAGES OF THE DIFFERENT FACE PARTS. THE IMAGES DATA CONSISTS OF THE DIFFERENT FACES USED, THE POSE VARIATIONS AND THE ILLUMINATION VARIATIONS. THE ORIGINAL SIZE OF THE IMAGED PART IS COMPARED TO THE DIMENSIONALITY OF THE PRINCIPAL COMPONENT SUBSPACE USED. THIS SUBSPACE REPRESENTS 95% OF THE ENERGY OF THE REPRESENTATION. THE FIRST FIVE EIGEN VALUES ARE ALSO PROVIDED TO INDICATE THE RATE OF FALL OF THE SIGNIFICANCE OF THE PRINCIPAL COMPONENTS

Fig. 7. This figure illustrates the overlapping windows applied to the face images after detection of the face parts. Each window is outlined in a different shade of gray to delineate it from the others.

TABLE III PRINCIPAL COMPONENT REPRESENTATION IS TABULATED HERE FOR THE 3-D SHAPES OF THE DIFFERENT FACE PARTS. THE RANGE DATA CONSISTS OF THE DIFFERENT FACES USED WITH POSE VARIATIONS. THE ORIGINAL SIZE OF THE RANGE IMAGE IS COMPARED TO THE DIMENSIONALITY OF THE PRINCIPAL COMPONENT SUBSPACE USED. THIS SUBSPACE REPRESENTS 99% OF THE ENERGY OF THE REPRESENTATION. THE FIRST FIVE EIGEN VALUES ARE ALSO PROVIDED TO INDICATE THE RATE OF FALL OF THE SIGNIFICANCE OF THE PRINCIPAL COMPONENTS

Fig. 8. Average and first eight principal components of intensity data for noses. The principal components are shown added to the average image to show how each function modifies the average. The coefficients obtained by the projection of the input intensify field on these basis functions are used as inputs.

Fig. 9. Average and first eight principal components of range data for noses. The principal components are shown added to the average image to show how each function modifies the average. The coefficients corresponding to these basis functions are used as outputs to train our network.

selected pose and illumination from the ranges defined in Section III. The training data is generated from range models of seven human heads. These are shown in Fig. 10(a)–(g). The pose and illumination of these models are changed to generate

Fig. 10. Data used in experiments. (a)–(g) show example intensity images from training data using seven head models; (h) shows an image from data used for testing only to show the generalization of the model.

training data for the individual parts. We assume a Lambertian reflectance and a constant albedo model for our experiments. Half of this data set is used in training the networks. The other half is used in showing that the errors in the shape recovery are acceptably small. An eighth range image, Fig. 10(h), is used to

212


TABLE IV THE NEURAL NETWORKS USED ARE FULLY CONNECTED AND TRAINED USING THE BACKPROPAGATION ALGORITHM. THE INPUT NODES PROVIDE A LOCAL REGION OF SUPPORT FOR EACH NETWORK DEFINED BY THE UNDERLYING INTENSITY PATCH IT REPRESENTS. THE HIDDEN AND OUTPUT NODES PROVIDE THE PROCESSING FUNCTIONS, WHICH ARE PRIMARILY AN AFFINE TRANSFORM OF THE INPUT VECTOR TO THAT NODE FOLLOWED BY A SIGMOIDAL NON-LINEARITY. THE TRAINING USES THE BACKPROPAGATION ALGORITHM AND IS RUN FOR 50 000 ITERATIONS IN A BATCH MODE. THE AVERAGE ERROR OF THE OUTPUT COEFFICIENTS FOR THE TRAINING DATA AT THE END OF THIS PERIOD IS ABOUT 10

generate test data for demonstrating the generalization of the training to data not used in training. In the training data, the input intensities and output heights of the pixels in the patch are normalized such that the center pixel is set to 0. The input intensity and output height is then . The IPCC of the input scaled to be in the range shading patterns and the SPCC of the output range patterns are generated using PCA as given in Section V. The input implicitly has a local receptive field in the image domain. The image is transformed into the IPCC for a rectangular patch extracted from the face image. The illumination vector is assumed to be known and defined by the direction cosines in the frame of reference of the 3-D face model. This is a reasonable assumption, as there are several algorithms which can estimate the illumination direction as in [24]. The IPCC along with the direction cosines of the illumination vector form the input. The corresponding SPCC form output for training the neural networks. The network thus learns to map the IPCC at the input to the SPCC at the output. As PCA is used to derive these representations, data which are near (have small Euclidean distance to) each other in the principal component subspace are also near in the full image (or range) space. This is a direct result of the low mean representation error of the PCA coefficients. Thus, even approximate SPCC generated by the neural network recover 3-D shape information (range) which closely approximates the true shape. network approach. The neural model is a three layer structure with input, hidden, and output layers. The hidden and output nodes have a sigmoidal . nonlinearity which compresses the output to the range The training is done using back-propagation [21]. Training is carried out in a batch mode till the average error is less than 1% on the output set. In our experiments, 50 000 batch training cycles were used. The details of the structure of each neural network used for the various parts are given in Table IV. Note that the input includes the IPCC’s supplemented by the illumination vector and thus the number of input nodes for each network is . In usage, the network is input with IPCC vectors and generates corresponding SPCC vector. The SPCC is used to reconstruct the range using (13). Individual range parts are then merged together into a complete surface using the method described in Section VII.

VII. MERGING SURFACE PATCHES The reconstructed parts are defined to have rectangular fields of support in the input domain. These regions are merged together in a smooth manner by minimizing the difference in heights in the overlapping pixels of neighboring regions. Each patch has a floating height because of the normalization described in Section VI. is an index to the location of the rectangular support region. The relative height difference between two overlapping patches and is given by minimizing the sum squared error in heights on the overlapping area grid points indexed by (14) Minimizing

by setting

gives (15)

is the number of pixels in the overlap area. where The pixels in any overlapping areas of two adjacent recovered range patches have an average reconstruction height difference . Obviously, . Two such adjagiven by cent patches are merged by applying the relative height differis added to to minimize the MSE over ence. Thus the overlap area. Note, that the rectangular support regions are fixed for the face images and these corrections can be applied directly to the overlapping areas of each such region. Now overlapping area grid points have multiple heights with minimum sum squared errors. These heights are averaged to generate a unique and consistent height estimate for that point. This results in a complete 3-D shape recovery. VIII. EXPERIMENTAL RESULTS We present here some results using our system for surface shape recovery. As mentioned above, we generate about 3000 samples of intensity and range data pairs for each face part, with variations in pose, and illumination direction. Seven head models [Fig. 10(a)–(g)] are used for this data. The view directions are randomly generated from the range and . Similarly, the illumination directions are and randomly generated from the range . Half of the data set is used in training and


213

TABLE V MEAN RMS ERRORS IN THE RECOVERED SPCC FOR THE VARIOUS PARTS. THE FIRST COLUMN SHOWS THE MEAN RMS ERRORS FOR ALL THE PRINCIPAL COMPONENTS USED IN THE REPRESENTATION, THE REMAINING COLUMNS SHOW THE RMS ERRORS FOR THE FIRST 5 PRINCIPAL COMPONENTS. THESE ERRORS ARE SEEN TO BE RELATIVELY SMALL FOR THE DIFFERENT FACE PARTS

TABLE VI MEAN RMS ERRORS IN SPCC (FROM TABLE V AND RESULTING AVERAGE ABSOLUTE PERCENTAGE ERROR PER PIXEL (AAPEPP) IN THE RECOVERED PARTS. BOTH ERRORS ARE SEEN TO BE RELATIVELY SMALL. THE AAPEPP DEFINES THE TOLERANCE OF THE NETWORKS IN TERMS OF THE ACCURACY OF RECOVERING 3-D SHAPES

half in evaluating the performance metrics for the 3-D shape recovery neural networks. Full face recovery is also demonstrated in this section. Head models, from the seven models used for training, are used to generate input images which are oriented and illuminated from directions not used in training. An eighth head model [Fig. 10(h)], is also used to demonstrate the generalization capability to reconstruct faces not used in training. The performance of the neural networks are first analyzed with respect to its accuracy in mapping the IPCC to the SPCC. As seen in Table V, mean RMS error for all the principal components as well as the RMS errors of the five largest principal component coefficients considered individually for each part are small, indicating that the neural networks are robust in their mapping. However the real measure of performance can be obtained only from an error measure in the domain of the recovered 3-D shape parts. Hence we measure the average absolute percentage error per pixel (AAPEPP) for each recovered face part. This error measure is defined as follows. Given the estimated range at pixel for a part and the original measure , and there being pixels in that range measure part (16) Table VI shows the mean RMS errors for all the principal components in the SPCC and the AAPEPP errors per pixel in the recovered parts. As can be seen the AAPEPP is about 2% for the lips which has the largest errors, implying that the estimated range is within 2% of the true range for all the parts. The mean error of the reconstructed 3-D face shapes after merging of the individual parts is less than 3%. As an example of the recovery of individual parts, in Fig. 11, we show the intensity pattern due to the image of a nose. The recovered nose shape and the original shape are shown in the form of range images.

Fig. 11. (a) Shading pattern of a particular nose illuminated from the direction (14:4 ; 11:7 ), (b) recovered nose range image using the backpropagation network and PCA, and (c) original 3-D nose shape for comparison.

We now demonstrate the robustness of our approach to strong illumination variations and with significant pose variations. To illustrate this factor clearly, we generate two intensity images at different poses and illuminated from quite different directions, for each of our examples. This is apparent from the different intensity images input to the system as shown. The recovered surface shape is then shown along with the original surface. The reconstruction is seen to be quite accurate both visually as well as by the AAPEPP indicated in the figure caption. These results are shown in Figs. 12–14. We also demonstrate in Fig. 15 the successfully reconstruction of the 3-D shape from a head model with which the system is not trained. It can be observed that the reconstruction error is not significantly larger than with the other models. IX. CONCLUSIONS We have proposed a framework for recovery of surface shape by recognition and learning. The recognition principle significantly reduces the space of valid surfaces thus allowing for a robust solution. In this work, we have outlined this principle and focused on the problem of recovery of face shapes. Face parts can first be detected and localized and then specific 3-D shapes of parts can be recovered by a neural network that learns the

214


0

0

0

0

0

Fig. 12. (a) Face posed at (8:2 ; 1:2 ), off center and an illumination of ( 20:2 ; 21:8 ) along with the recovered range data (b) and the original range data (c) (APEPP: 2.19%). (d) The same face posed at ( 5:2 ; 1:1 ), off center and an illumination of (9:9 ; 7:5 ) along with the recovered range data (e) and the original range data (f) (APEPP: 2.88%).

0

0

Fig. 13. (a) Face posed at (6:9 ; 3:5 ), off center and an illumination of ( 15:5 ; 10:2 ) along with the recovered range data (b) and the original range data (c) (APEPP: 1.31%). (d) The same face posed at ( 4:8 ; 2:1 ), off center and an illumination of (14:5 ; 26:2 ) along with the recovered range data (e) and the original range data (f) (APEPP: 1.41%).

0


0

0

215

0

0

Fig. 14. (a) Face posed at ( 7:6 ; 1:5 ), off center and an illumination of ( 25:1 ; 19:4 ) along with the recovered range data (b) and the original range data (c) (APEPP: 1.52%). (d) The same face posed at (9:4 ; 0:5 ), off center and an illumination of (11:2 ; 14:7 ) along with the recovered range data (e) and the original range data (f) (APEPP: 2.27%).

0

0

0

0

Fig. 15. Recovery of face shape from untrained models. (a) A face posed at (10:2 ; 1:2 ), off center and an illumination of ( 21:3 ; 22:5 ) along with the recovered range data (b) and the original range data (c) (APEPP: 3.03%). (d) The same face posed at (7:8 ; 2:1 ), off center and an illumination of ( 10:2 ; 5:2 ) along with the recovered range data (e) and the original range data (f) (APEPP: 2.84%).

0

216


mapping between the input image domain and the output 3-D shape domain. The shape recovery method is based on using neural networks for learning brightness patterns and mapping them to corresponding range data. Specifically, given a certain class of objects, we show that it is possible to train a neural network to learn to map the appearance of such a class of objects under varying pose and illumination to its true 3-D shape. Specialized neural networks are trained to recover range information of specific face parts like the nose, lips, eyes, cheeks, and forehead. In prior experiments, we find that specialization is necessary since the 3-D recovery of a full face by a single network is too difficult and yields less accurate reconstruction. The high dimensionality of such data contributes to high computational complexity. Hence, principal component analysis is used to represent both the intensity and 3-D shape data in lower dimensional spaces. Individual neural networks are successfully trained to map the coefficients of the principal components for image-range pairs, under varying pose and illumination. The recovered range data for each part must then be merged with others to generate the complete surface. Toward this end, we proposed a method for merging reconstructed surface patches by minimizing the sum squared error in overlapping areas. Individual parts of faces like the nose and lips are successfully recovered and merged to generate the 3-D shape of the face. The mean error rates in the surface recovery are less than 3% per pixel. In addition, we demonstrate using examples that the system is robust quite large variations in illumination direction as well as significant variations in pose. We also demonstrate accurate shape recovery for a face model not used in training. The above experiments are constrained to face data using a Lambertian reflectance model with frontal and near frontal variations in the views of faces with strong variations in illumination. We see that the networks are able to learn to associate shapes with corresponding images robust to a wide range of illumination variations and moderate variations in pose. In future work, it is planned to extend this approach to more generalized reflectance models as well as larger variations in pose. Methods to make the approach robust to errors illumination estimates will be investigated. The approach will also be tested with a larger database and with real imagery after appropriate training. The results presented in this work may have implications in understanding the way the brain solves the 3-D shape perception problem. We conjecture that the brain is good at visual recognition of generic classes of objects. Knowing an object class, immediately reduces the problem to recovering shape variances which differ with varying viewpoints and occurrences of specific examples of that class. This drastically reduces the solution space of the shape from shading problem making it more tractable. REFERENCES [1] J. J. Atick, P. A. Griffin, and A. N. Redlich, “Statistical approach to shape from shading: Reconstruction of three-dimensional face surfaces from single two-dimensional images,” Neural Comput., vol. 8, pp. 1321–1340, 1996. [2] J. Ben-Arie and D. Nandy, “A neural network approach for reconstructing surface shape from shading,” in Proc. IEEE 1998 Int. Conf. Image Processing, vol. 2, Chicago, IL, Oct. 1998, pp. 972–976.

[3] J. Ben-Arie and K. R. Rao, “A novel approach for template matching by nonorthogonal image expansion,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, pp. 71–84, Feb. 1993. , “On the recognition of occluded shapes and generic faces using [4] multiple-template expansion matching,” in Proc. 1993 IEEE Conf. Computer Vision Pattern Recognition, New York, June 1993, pp. 214–219. , “Optimal template matching by nonorthogonal image expansion [5] using restoration,” Int. J. Mach. Vis. Applicat., vol. 7, pp. 69–81, Mar. 1994. [6] R. T. Frankot and R. Chellappa, “A method for enforcing integrability in shape from shading algorithms,” IEEE Trans. Pattern Anal. Machine Intell., vol. 11, no. 4, pp. 439–451, 1988. [7] R. L. Gregory, The Intelligent Eye. New York: McGraw-Hill, 1970. [8] H. W. Guggenheimer, Differential Geometry. New York: Dover, 1977. [9] B. K. P. Horn, “Calculating the reflectance map,” Comput. Vis. Graph. Image Process., vol. 33, no. 2, pp. 174–206, 1986. [10] T.-E. Kim, S.-H. Lee, S.-H. Ryu, and J.-S. Choi, “Shape recovery of hybrid reflectance surface using neural network,” in Proc. 1997 IEEE Int. Conf. Image Processing, vol. 3, Oct. 1997, pp. 416–419. [11] R. Kimmel and A. M. Bruckstein, “Global shape from shading,” Comput. Vis. Image Understand., vol. 20, no. 1, pp. 23–38, 1998. [12] Y. G. Leclerc and A. F. Bobick, “The direct computation of height from shading,” in Proc. 1991 IEEE Computer Society Conf. Computer Vision Pattern Recognition, Lahaina, Maui, HI, 1991. [13] S. R. Lehky and T. J. Sejnowski, “Network model of shape-fromshading: Neural function arises from both receptive and projective field,” Nature, vol. 333, pp. 452–454, May 1988. [14] J. Malik and D. Maydan, “Recovering three dimensional shape from a single image of curved objects,” IEEE Trans. Pattern Anal. Machine Intell., vol. 11, no. 6, 1989. [15] B. Moghaddam and A. P. Pentland, “Probabilistic visual learning for object representation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 696–710, July 1997. [16] D. Nandy and J. Ben-Arie, “Shape from recognition and learning: Recovery of face shape,” in IEEE Computer Society 1999 Conf. Computer Vision Pattern Recognition, vol. 2, Fort Collins, CO, Jun. 1999, pp. 2–7. , “Generalized feature extraction using expansion matching,” IEEE [17] Trans. Image Processing, vol. 8, pp. 22–33, Jan. 1999. [18] A. P. Pentland, “Local shading analysis,” IEEE Trans. Pattern Anal. Machine Intell., vol. 6, no. 2, pp. 170–187, 1984. [19] K. R. Rao and J. Ben-Arie, “Multiple template matching using expansion matching,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 490–503, Oct. 1994. [20] A. Rosenfeld and A. C. Kak, Digital Image Processing. New York: Academic, 1982. [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations Vol. 1. Cambridge, MA: MIT Press, 1986, ch. Learning Internal Representations by Error Propagation, pp. 318–362. [22] A. J. Stewart and M. S. Langer, “Toward accurate recovery of shape from shading under diffuse lighting,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 1020–1025, Sept. 1997. [23] R. J. Woodham, “Photometric method for determining surface orientation from multiple images,” Opt. Eng., vol. 19, no. 1, pp. 139–144, 1980. [24] Q. Zheng and R. Chellappa, “Estimation of illuminant direction, albedo and shape from shading,” IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp. 680–702, 1991.

Dibyendu Nandy (S’91–M’00) received the B.E. degree in electrical engineering from the University of Poona, India, in 1990, the M.S. degree in electrical engineering from Florida State University, Tallahassee, in 1992, and the Ph.D. degree in the electrical engineering and Computer Science Department from the University of Illinois, Chicago, in 1999. Currently he is with the Media Processing Technology Group, Tellabs Operations, Inc., Bolingbrook, IL. His research interests lie in signal and image processing, neural networks, and models for visual and auditory processing. He has authored or coauthored more than 15 publications in these areas. Dr. Nandy is a Member of Sigma Xi, Tau Beta Pi, and Phi Kappa Phi. He has been a recipient of the University Fellowship at Florida State University and the Andrew Foundation Fellowship at the University of Illinois at Chicago.


Jezekiel Ben-Arie (M’91) received the B.Sc., M.Sc. and Dr.Sc. degrees from the Technion—Israel Institute of Technology, Haifa. Currently, he is with the Electrical Engineering and Computer Science Department, University of Illinois, Chicago, as an Associate Professor. His areas of specialization are machine vision, image processing, and neural networks. He has also worked in auditory processing and sound localization. In these areas, he has more than 120 technical publications. In 1986, he discovered the probabilistic peaking effect of viewed angles and distances. In 1992, he developed the nonorthogonal expansion matching (EXM) method which is quite useful in the recognition of highly occluded objects and in motion video tracking and compression. Recently, he developed the afine invariant spectral signatures (AISS), the volumetric frequency representation (VFR) for 3-D shape representation and recognition, and a novel approach for human activity recognition. Prof. Ben-Arie currently serves as an Associate Editor of IEEE TRANSACTIONS ON IMAGE PROCESSING.

217