2011 18th IEEE International Conference on Image Processing
2D/3D VIRTUAL FACE MODELING SoonKee Chung1 , Jean-Charles Bazin2∗ and Inso Kweon1 1: RCV Lab, KAIST, South Korea 2: Ikeuchi Laboratory, The University of Tokyo, Japan ABSTRACT We propose a novel and simple framework that solves two popular problems in digital photography: 2D face synthesis and 3D face modeling. 2D face synthesis aims at creating a new face, usually by mixing two or more portraits. We extend this notion to the combination of human and statue faces. The goal of 3D face modeling is to reconstruct a face in three dimensions from one or several images. These two tasks are often treated as separate problems although they both consider face modeling. In this paper, we propose a unified and general framework for both 2D and 3D cases that runs in a fully automatic manner. Our work also creates stereoscopic views for entertainment 3D display. Experimental results and subjective tests have confirmed the validity of our approach. Index Terms— face synthesis, 3D face, 3D display, stereoscopic view 1. INTRODUCTION This paper is dedicated to two popular problems in digital photography: 2D face synthesis and 3D face modeling. 2D face synthesis creates a new face picture by combining two or several portraits. It results a virtual face which does not correspond to a particular person. It has applications in the industry to anonymize people in pictures or for pure entertainment purpose. Given one or numerous images of a person, 3D face modeling aims to reconstruct his/her face in three dimensions. The cinematography industry is particularly demanding for such a technology, especially for special effects (e.g. The Curious Case of Benjamin Button and Pirates of the Caribbean). Among the several works dedicated to 2D face synthesis, Bitouk et al. introduced an automatic system for swapping two faces of different persons with minimal artifacts [1]. Given a large dataset of faces, [2] starts by generating an approximate coherent image and then synthesizes texture with non-stationary image quilting model. However, the existing methods are limited to human faces. On the contrary, our goal is to model faces having different textures and structures, especially the human and statue faces, as shown in Fig 1-top. ∗ Corresponding
author:
[email protected]
978-1-4577-1303-3/11/$26.00 ©2011 IEEE
1097
Fig. 1. Our concept of 2D (top) and 3D (bottom) face modeling. The inputs are surrounded in green and the outputs, automatically obtained by the proposed algorithm, are in red. Concerning 3D face modeling, a large amount of literature has been published. Many researches focus on building hardware systems [3][4] and the obtained 3D models are usually very accurate. However these systems are often expensive and/or require several input images. Another approach is referred to as single view 3D face reconstruction, like [5][6]. Interesting results can be obtained but most of the methods following this approach requires manual interaction and is very specific. On the contrary, we are interested in developing a method which is automatic and global, in the sense that it encompasses the tasks of 2D face synthesis and 3D face modeling in a unified framework. The remaining of this paper is composed of two main sections. First, we propose a mathematical formulation of the 2D face modeling for human-statue face synthesis, that we call virtual face sculpting. In the second section, we generalize this approach to the 3D face modeling application. 2. 2D FACE MODELING Given the pictures of a statue face and a human face, our goal is to synthesize a virtual sculpture that has the appearance of the input statue but also looks like the input human face. A typical example is given in Fig 1-top. It is a very challenging task because (1) the virtual sculpture must verify the color and texture consistency of the original statue and (2) the structure of the input human face must also not be modified in the virtual sculpture, otherwise the person will not be recognizable.
2011 18th IEEE International Conference on Image Processing
In order to achieve this goal, we formulate the task as a graph labeling problem whose main steps are the following: • Align the two faces with respect to facial features (eyes, nose, mouth, etc...) • Create a “patch library” containing all the patches of the input statue face • Initialize an empty output face whose size is the same as the human face; the whole facial region is unknown in the initial stage
noted lx , from the patch library. We defined the associated energy as the sum of two terms: ∑ ∑ E(L) = Esim (lx ) + α Esmooth (lx , ly ) (1) p∈G
where Esim and Esmooth are respectively the energies for the similarity and the smoothness constraints. α is a relative weight to balance these two competitive energies. The notation < p, q > indicates that the pixels p and q are neighbors. The optimal combinations of labels L∗ = (lp1 , . . . , lpM ) are obtained by minimizing the energy E(L):
• Find the best patch from the “patch library” for each pixel point of the output face that minimizes the energy function Each of these steps are explained in more details in the following sections. 2.1. Face Alignment and Patch Library This section presents the two preliminary steps of our method: face alignment and creation of the patch library. An interesting observation is that, although the human and statue faces have different textures and structures, they have similar local regions around anchor points such as eyes and mouth. Therefore, we start by automatically aligning the two faces from these facial anchor points. This will facilitate the patch comparison in the next step. Face alignment can be easily performed by using Active Shape Model (ASM) algorithm. ASM is a general algorithm to detect local features describing the shape of an object through training. We applied [7] to detect all the important facial features in both faces. Then we performed a warping to align these points and thus we obtained aligned human and statue faces. ASM also provides the shape boundary of the face, which we use to mask out the non-facial region in the image. Subsequently, we break down the facial region into small patches for every pixel point (patch size is determined by the user, usually 7x7). These patches are then stored in a patch library, which forms a list of candidate patches for each pixel point in the output face. 2.2. Energy Function Our main idea to create the virtual face statue is to re-combine the patches contained in the patch library (i.e. from the input face statue) in an appropriate configuration. This configuration must (1) respect the structure of the given human face (that we call the similarity constraint) and (2) minimize the “patch effect” (that we call the smoothness constraint). We formulate this task as a graph labeling problem, where each patch is associated to a label. The nodes of the graph, noted G, is the set of M pixels of the output image. For each pixel p of G with the position x, the goal is to find the best label (patch),
1098
∈G
L∗ = arg min E(L) L
(2)
The similarity term aims to find the patches having a similar structure. Color is not an appropriate measure because the human and statue faces have a very different appearance, as can be observed in Fig 1; thus we preferred gradient information. We finally defined the similarity as the normalized sum of squared differences (NSSD): 1 ∥G(P (x)) − G(P(lx ))∥2 2 ∥G(P (x)) − GP ∥2 + ∥G(P(lx )) − Glx ∥2 (3) where G(P ) is the gradient of a patch P , GP is the mean gradient for the patch P (x) and Glx is the mean gradient for the patch P(lx ). Esim (lx ) is a normalized term which returns a real value in [0, 1] [8]. If G(P (x)) and G(P(lx )) are equal, then the returned value is 0. The smoothness energy encodes the overlapping coherence between two adjacent patches P(lx ) and P(ly ). This energy term is defined as the NSSD on their overlapping area. This term can also been considered as an extension of [9] to larger and rectangular patches. To find the labeling minimizing the MRF energy of Eq (2), we apply Belief Propagation (BP) [10][11]. BP is a popular iterative algorithm that treats energy costs for each pixel point as potential and propagates messages throughout the graph to find the best label for each point. Using the facial anchor points detected in section 2.1, we applied a warping step to project the pixel of the output image into the statue face. It permits to drastically decrease the number of candidate patches (i.e. the search space) and impose some a priori geometric information about the faces. Esim (lx ) =
2.3. Results and Analysis We applied our 2D face modeling framework on various pictures. Representative results are displayed in Figure 2 (preliminary results were presented in [13]). For comparison, we also applied the Poisson Image Editing method [12]. One may consider the Poisson outputs as pleasing results, but it is important to note that they contain non-appropriate colors (i.e. colors that do not exist in the input statue) and thus cannot be
2011 18th IEEE International Conference on Image Processing
of the depth values over the overlapping region. Combining depth and color terms in the smoothness energy permits to obtain a pleasing 3D face in terms of 3D sensation as well as photometric coherence. By transferring depth information, our method outputs a virtual 3D face, with its associated depth map, and can synthesize virtual view points, in a fully automatic manner (cf Fig 1-bottom). We can also automatically generate stereoscopic views for 3D display (cf Fig 3). 3.2. Comparative Analysis From Different Input 3D Models
Fig. 2. First column: input human faces. Second column: input statue faces. Third column: output faces using Poisson Image Editing [12]. Right column: output faces using our proposed method. Notice the color blending from the human faces which is not wanted in our goal. considered a virtual statue. On the contrary, our method permits to maintain the color, texture and structure consistencies of the input statue. It is also interesting to observe that we can reconstruct small features like cheek wrinkle, hair streak, and eye glasses. 3. 3D FACE MODELING In the previous section, we showed how to transfer 2D color patches to create a virtual statue. If 3D information of the input statue face is also known, a challenging question would be “what would the output face look like in 3D by directly transferring the depth values associated to the selected patches?” The motivation behind this extension is simple: if the input statue face is associated to a 3D model (obtained by any methods), each pixel of the statue face has a depth value and thus it is possible to create a “depth patch library”. By combining the color information and the depth patches, we can therefore output a virtual 3D face. This idea allows us to explore an alternative approach of the 3D face modeling problem. In this section, we present the generalization of our 2D patch transfer-based framework towards 3D face modeling. 3.1. Energy Term Generalization We now present the energy term generalization to manipulate depth information. We do not modify the similarity energy of Eq (1) since we still aim for the output face to portray the same structure as the input human face. For the smoothness energy, we add one more term that encodes the depth consistency within the local region of the output face to obtain a coherent 3D surface. This consistency is defined as the NSSD
1099
This section studies the influence of the input 3D model with respect to the output 3D face. So far in this paper, our method focused on face modeling from a statue and a human faces. Now we also consider two input human faces to explore the maximum feasibility of our proposed method. For the experiment with input human 3D face, we used a dataset consisting of 3D models of various people in several orientations [11]. We have compared the 3D faces generated from a statue and several human faces. We have carried out a subjective test to vote for the most pleasing 3D faces generated by our method using different input 3D faces (called reference faces). We have generated stereoscopic virtual views to display the 3D faces using Nvidia 3D vision kit. Since our method can obtain 3D information, subsequently we can generate the virtual left view of the stereoscopic image pairs. The right view is simply the original human image. The virtual left views, as well as their stereoscopic image, are shown in Figure 3. The subjective test consists of four kinds of reference faces: a) statue face, b) random selection of a human face from the 3D face database, c) manual selection of a human face with similar appearance to the target face, and d) automatic selection of a human face (by PCA face recognition). The test result is shown in Figure 4. From the result, we may conclude that the 3D faces modeled from the reference faces whose appearances are similar to the target faces lead to a more pleasing experience of 3D display, which was intuitively expected. Indeed similar appearance of 2D image may imply similar 3D structure. 4. CONCLUSION In this paper, we proposed a unified framework, based on patch transfer, for the two tasks of 2D and 3D face modeling. For the 2D case, we considered structure information and a smoothness constraint on the color. By combining the faces of a human and a statue, we created a virtual face sculpture in a fully automatic manner. We then generalized the formulation for the 3D case by adding a depth smoothness term. By transferring not only the appearance but also the depth, we managed to obtain a 3D face model from a single view automatically. We also generated stereoscopic views for 3D display application. Experimental results and subjective test confirmed our approach.
2011 18th IEEE International Conference on Image Processing
Fig. 4. Subjective test on 3D display of faces modeled from different kinds of reference faces. cf text for details. Transactions on Graphics (SIGGRAPH’10), vol. 29, no. 3, 2010. [4] T. Beeler, B. Bickel, P. Beardsley, R. Sumner, and M. Gross, “High-quality single-shot capture of facial geometry,” ACM Transactions on Graphics (SIGGRAPH’10), vol. 29, no. 3, 2010. [5] S.W. Park, J. Heo, and M. Savvides, “3D face reconstruction from a single 2D face image,” in Computer Vision and Pattern Recognition Workshops, 2008. [6] A. Niswar, E.P. Ong, and Z. Huang, “Pose-invariant 3D face reconstruction from a single image,” in SIGGRAPH-Asia, 2010. Fig. 3. Left column: original faces (with unknown 3D). Middle column: virtual left views automatically generated by the proposed method. Right column: generated stereoscopic view for 3D display. 5. ACKNOWLEDGEMENT This research was partially supported by The Ministry of Knowledge Economy (MKE), Korea, under the Human Resources Development Program for Convergence Robot Specialists supervised by the National IT Industry Promotion Agency (NIPA). The authors also thank Roger Blanco Ribera and Quang Pham for the early works on face modeling. 6. REFERENCES [1] D. Bitouk, N. Kumar, S. Dhillon, P. N. Belhumeur, and S. K. Nayar, “Face swapping: automatically replacing faces in photographs,” ACM Transactions on Graphics (SIGGRAPH’08), vol. 27, no. 3, 2008. [2] U. Mohammed, S. J. D. Prince, and J. Kautz, “Visiolization: Generating novel facial images,” ACM Transactions on Graphics (SIGGRAPH’09), vol. 28, no. 3, pp. 57:1–57:8, 2009. [3] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer, “High resolution passive facial performance capture,” ACM
1100
[7] S. Milborrow and F. Nicolls, “Locating facial features with an extended active shape model,” European Conference on Computer Vision (ECCV’08), 2008. [8] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother, “Bi-layer segmentation of binocular stereo video,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 407–414, 2005. [9] L.Y. Wei and M. Levoy, “Fast texture synthesis using tree-structured vector quantization,” in SIGGRAPH ’00, 2000, pp. 479–488. [10] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” International Journal of Computer Vision (IJCV’06), vol. 70, no. 1, pp. 41–54, 2006. [11] J. Sun, L. Yuan, J. Jia, and H.-Y. Shum, “Image completion with structure propagation,” in SIGGRAPH’05, 2005, pp. 861–868. [12] P. Perez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Transactions on Graphics (SIGGRAPH’03), vol. 22, no. 3, pp. 313–318, 2003. [13] J.C. Bazin, S.K. Chung, R.B. Ribera, Q. Pham, and I.S. Kweon, “Virtual face sculpting,” SIGGRAPH 2010 Poster, 2010.