Active Appearance Models for Automatic Fitting of ... - User Web Pages

5 downloads 0 Views 698KB Size Report
Peter Seidel. A statistical method for robust 3D sur- face reconstruction from sparse data. In Second Inter- national Symposium on 3D Data Processing, Visual-.
Active Appearance Models for Automatic Fitting of 3D Morphable Models Nathan Faggian, Andrew P. Paplinski Clayton School of Information Technology Monash University, Victoria, Australia {nathanf,app}@mail.csse.monash.edu.au

Abstract This paper presents a fast fitting method for 3D Morphable Models (3DMMs). In most cases fitting a Morphable Model to an image is done using slow non-linear optimization processes. We avoid this by introducing a relationship to Active Appearance Models (AAMs) that can be used to linearize the non-linear optimization problem of 3DMM fitting. Using the combination of a constrained AAM and closed form 2D-3D fitting method for 3DMMs we show that a perceptually close 3D shape can be extracted from the AAM fittings. We show preliminary results of the method and a simplified exterior orientation solution. Finally we conclude with a tracking algorithm that is based on the combination of the AAM and 3DMM, presenting the tracking and reconstruction errors.

1. Introduction In this paper we focus on the combination of the Active Appearance Model and the 3D Morphable Model. Our goal is to model the human face in a manner that is is suitable for real-time applications. In this work we present a solution to the 3D modelling of the human face that adheres to three constraints: 1) modelling is fast, 2) a sparse set of features can be used, 3) the result is constrained to the face domain. To meet these constraints we decided to use PCA models for shape and texture built from a 3D database of laser scanned heads. Choosing the AAM for its speed of fitting and the 3DMM for its representational power and natural extension to tracking.

2. PCA Models for Shape and Texture A 3D Morphable Model (3DMM) is a statistical representation of both the shape and texture of an object in a certain domain [11, 9]. A 3DMM is built from 3D scans of data, which is then put into dense correspondence [3].

Proceedings of the IEEE International Conference on Video and Signal Based Surveillance (AVSS'06) 0-7695-2688-8/06 $20.00 © 2006

Jamie Sherrah Clarity Visual Intelligence Victoria, Australia [email protected]

The aligned data is then stacked into two matrices, one for texture and one for shape samples. Principle Component Analysis [1] in then computed on the matrices. This produces two models, one for shape and one for texture variation, this defines the 3DMM: ˆs = ¯s + S · diag(σs ) · cs ˆt = ¯t + T · diag(σt ) · ct

(1)

where ˆs and ˆt are novel (3N × 1) shape and texture vectors, ¯s and ¯t are the (3N × 1) mean shape and texture vectors, S and T are the (3N × M) column (shape and texture eigenvectors) spaces of the shapes and textures, σ are the corresponding shape and texture eigenvalues, cs and ct are (M × 1) shape and texture coefficients. The linear equations (1) describe the variation of shape within the span of the training heads. The coefficients c are scaled by the corresponding eigenvalue σ (dominance) of the eigenvectors S, T. Using varied coefficients it is possible to render different heads within the span of the original 3D scans. 3DMMs are similar to simpler 2D models such as the Active Appearance Model [4] although pose as well as illumination can be addressed directly because the model is three-dimensional. Active Appearance models are low dimensional 2D models, Cootes et al [4] presented them as a method to model objects in images. The concepts is the popular modeling by synthesis approach to image analysis. It is a technique that has been broadly used in the field of computer vision. AAMs are similar to the 3DMM (equation (1) also applies to AAMs), the key difference is the dimensionality of the model. AAMs are generally constructed using handlabelled images, these are then aligned using Procrustes analysis [5] and then PCA is used to build shape and texture models. Figure (1) demonstrates the effect of varying both the shape and texture models for the AAM. In this case it demonstrates the variation of the first principle component.

3. 3DMM Fitting Fitting of a 3DMM involves the unwrapping of the rendering process, and is necessarily a nonlinear operation.

150 200 250 50 100 150 200 250 50 100 150 200 250 √ 50 −3.5 σ 1 100 150 200 250 50 100 150 200 250 −3.5√σ 1 50 100 150 200 250

√ −2.5 σ 1

√ −1.5 σ 1

¯s

√ 1.5 σ 1

√ 2.5 σ 1

√ 3.5 σ 1

V) is a mapping function: y = Lcs

(a) First mode of variation in shape model √ −2.5 σ 1

√ −1.5 σ 1

¯t

√ 1.5 σ 1

√ 2.5 σ 1

where cs is a (M × 1) vector of shape coefficients and y is a (2P × 1) vector of demeaned AAM features in image coordinates: y = r − L¯ v (5)

√ 3.5 σ 1

(b) First mode of variation in texture model

such that L is a (2P × 3N) mapping matrix and v ¯ is the mean (3N × 1) shape vector. When it is assumed that L is known it might seem obvious to solve for cs (equation 4) with an application of the pseudo-inverse of L:

Figure 1. AAM: Shape and texture models Current popular methods for fitting 3DMMs involve iterative gradient descent approaches [11, 9], although any nonlinear optimization could be used. Fitting involves a search for the shape, texture and 3D rendering parameters. This is done using an analysis by synthesis approach where the parameters are inferred from a difference between a rendered head (image) and the input image. The matching of the texture implies a near photo-realistic result and effectively minimize the cost function: 2

||F(¯s + S · diag(σ)cs , ¯t + T · diag(σ)ct ) − I||F

3.1. Fast Morphable Model Fitting If it is possible to ignore the texture information for 3DMM fitting then a viable alternative to the standard optimization methods is available. This is the recently proposed approach of Blanz et al [2] which is a concise and mathematically optimal method to reconstruct a 3DMM from a sparse set of either 2D or 3D feature points. The method has two advantages: 1) it relies on only linear operators and 2) operates in real-time (at the expense of model accuracy). Using the assumption that only a small set of corresponding points are available the method minimizes the cost function: 2

(6)

The significant problem with this method is that it does not represent a meaningful result. It minimizes the wrong error. This is because the derived vector cs is not restricted to the span of the face model (3DMM) and thus cannot be used. A more appropriate error to minimize is one within the span of the 3DMM: y = L · S · diag(σ)cs

(7)

where S is the (3N × M) eigenvector matrix and σ is the (M × M) corresponding eigenvalue matrix. The solution to this can be derived through the use of SVD and a simple restating of the equation to match: y = Qcs

(8)

where Q is the (2P × M) image plane projection of the subset of scaled eigenvectors. Using SVD and modifying the singular values (by adding a regularization term, η ) provides a regularized fitting that is applied directly as the solution to (7): si cs = UT 2 Vy (9) si + η this provides a concise and closed form solution to the problem of estimating a dense 3D shape from a sparse set of 2D points. It is however dependant on the selection of the regularization term η, shown in figure (2).

(3)

where L is a camera matrix, containing the full set of intrinsic and extrinsic parameters, V is a subset selection matrix, S is the (3N × M) column (eigenvectors) space of the training shapes, σ are the corresponding eigenvalues, cs are shape coefficients and r is a (2P × 1) set of feature points. Blanz’ regularized solution [2] for fitting a given set of points y, provided by the AAM, is the problem of estimating the coefficients for shape cs , where L (multiplied with

Proceedings of the IEEE International Conference on Video and Signal Based Surveillance (AVSS'06) 0-7695-2688-8/06 $20.00 © 2006

cs = L+ y

(2)

where F is a rendering function that when provided with shape and texture coefficients produces an image that is aligned with the input image, I. These methods have shown that they work well. The only draw back is that the speed to arrive at a result, which is typically seconds or minutes.

||L · V · S · diag(σ)cs − r||F

(4)

(a) ˆs

(b) η = 10−3

(c) η = 101

(d) η = 103

¯ (e) S

Figure 2. Variation of η, on a random 3D head, ˆs, demonstrating how the estimate ap¯ proaches the mean, S.

In figure (2), as the η value increases the estimate tends to the mean 3DMM shape, or rather the norm of the estimated coefficients approaches zero. This is a desirable denoising and smoothing effect. Blanz showed that this is a statistically optimal method for 3DMM shape reconstruction given a minimal number of features and so represented a fine choice for 3DMM fitting given our real-time requirement.

where N(I(W(¯s; α)); γ) is the pixels in an image sampled under a globally and locally transformed overlayed AAM and T(x) is a rendered AAM using the mean shape and ˙ cost function is minimized by mean texture, shown in The estimating the correct update to the α and γ parameters, 4α, 4γ which are then inverted and composed with the previously updated α, γ.

4.2. AAMs across Pose

4. AAMs The only problem with the Blanz’ fitting method is that the user must define both V and r. The matrix V maps the smaller set of 2D coordinates in an image to corresponding 3D vertices in the 3DMM. It maps from the highdimensional 3DMM shape vector to the smaller (2D or 3D) shape vector, r. The key contribution of this paper is an automatic way to deduce the mapping V and the shape vector, r. We use AAMs built the span of the 3DMM 3D head basis. Specifically the AAMs that are constructed using the 2D projection of the 3DMM data and which inherently provide the important 2D to 3D mapping, V. This was identified in [6] and is exploited in this work for fully automatic fitting of 3DMMs.

4.1. AAM Fitting

5. Simplified Exterior Orientation

The AAM fitting algorithm used in our method is based on the Inverse Compositional Image Alignment (ICIA) AAM fitting method [8]. We chose ICIA because it is a fast technique for AAM fitting that can run in real-time. ICIA reverses the roles of the template and the image, this step allows the computation related to the Jacobian (which defines how the image pixels move with respect to a transformation) to be precomputed. This modification results in a very efficient AAM fitting algorithm. When applied to AAMs, ICIA, optimizes for two components that define the change in shape of the AAM. The local W(¯s; α) and global N(¯s; γ) transformation. These transforms describe the variation of shape in two different cases: 1) α defines a shift of individual vertices’s from the normalized model 2) γ is a similarity transformation that moves the AAM in a image, allowing the AAM to express different scales, rotations and translations. W(¯s; α) ' ˆs = ¯s + Sα,

Tracking across small changes in facial pose can be accomplished if the AAM is built with 3D pose variations [6]. In this paper we track generated 3D heads as ground truth models with AAMs built from the projected 3D data containing ±30 degrees of yaw rotation. This allows our AAMs to track across in kind pose variations of the generated test data. It is important to note that this can only be done reliably because we make use of 3DMM data during our AAM construction process. In doing this we avoid the difficulty in hand labelling shape models and reduce their influence on the fitting error [7]. To build the AAMs we control the rendering process and can easily determine 2D shape from 3D projections. This is necessary for AAM texture and shape model construction of which we retain 95% variance for both.

N(¯s; γ) ' ˆs = ¯s + S∗ γ

where (W, N) are local and similarity transforms, (¯s) is the mean shape, (α, γ) are parameter vectors and (S, S∗ ) define the column space of the local and similarity transforms. Fitting AAMs using ICIA is then the minimization of: X 2 [I(N(W(¯s; α)); γ) − T(x)] (10) x

Proceedings of the IEEE International Conference on Video and Signal Based Surveillance (AVSS'06) 0-7695-2688-8/06 $20.00 © 2006

To apply Blanz’ fitting method the rigid motion of the 3DMM with respect to the image must be estimated. It is a search for the optimal rotation, scaling and translation of the model that minimizes the difference between the projected 3D model (X) and the AAM points (ˆ x). A rigid body transformation of the model is defined like so:      ˆ R t X X = (11) 0 1 1 1 where R is the (3 × 3) rotation matrix, t is the (3 × 1) transˆ is the (3 × 1) rotated and translated lation vector and X X vector. Using a scaled orthographic projection we can describe the observed 2D AAM points as projected model points that have undergone an unknown rigid transformation.     1s 0 0 R t X x ˆ= (12) 0 1s 0 0 1 1 where s is the scaling constant and x ˆ is an observed 2D AAM point. Under a scaled orthographic projection the equation for the elements of x ˆ become:     s · r1 s · tx xˆx X (13) = s · r2 s · ty xˆy where r1 , r2 are the first and second rows of the rotation matrix and tx , ty are the x, y translations. The solution to the

rigid motion of the model and the scale is coupled (equation 12), it is not easily solved directly and should be derived in a least mean squared sense. If it is assumed that rotations are relatively small then the first order approximation of the Rodriguez equation for rotation becomes:   1 −vz · sin θ vy · sin θ 1 −vx · sin θ (14) R =  vz · sin θ −vy · sin θ vx · sin θ 1 For a least squares form we first examine the expanded exponential canonical form. For now ignoring the effects of scale and translation and solving for x ˆ = RX:     1 −vz · sin θ vy · sin θ x ˆx = X (15) vz · sin θ 1 −vx · sin θ x ˆy this can be re-arranged into the least squares equations:      v · sin θ  x ˆ −Xy Xz 0  z  vy · sin θ = x (16) x ˆy Xx 0 Xz vx · sin θ Given equation 16 it is possible to estimate (in a first order sense) rotation between the 3D model and 2D feature points. This is applied iteratively to estimate the true rigid rotation of the feature points. Once rotation is determined it is possible to address translation and scale. Ignoring the effects of rotation the equation for translation and scaling (ˆ x = sX + t) is:     s 0 0 tx x ˆx X (17) = 0 s 0 ty x ˆy this can also take the form of a least squares equation:       s x ˆ Xx 1 0   tx = x (18) x ˆy Xy 0 1 ty

7. Computing Fitting Error In order to compute the error for the 3DMM fitting we decided to only compute the shape error in regions within the 2D AAM fitting. This required that the pixels on the 3DMM which mapped in 2D to the convex hull of the AAM be identified to form an error mask. Using the error mask we are able to compute the per-vertex error between the estimate and the ground truth. An example of the error region is shown in figure (3.a) and an example of the error for a fitting is drawn as a normalized image in figure (3.b).

(a) The error mask determined as the 3D vertices that project to the convex hull of the 2D AAM.

(b) The error image of a fitting, the fitting is shown in figure 5.

Figure 3. 3DMM Error mask examples.

7.1. Tracking Face tracking is a motivating experiment for our proposed combination of the AAM and the 3DMM. We devised a method that keeps track of global rotation and median filters the 3DMM shape estimate that Blanz’ fitting provides. Our simple algorithm relies only on ICIA AAM fitting and Blanz’ 3DMM fitting:

and from this tx , ty , s can be calculated.

6. Experiments In our experiments we used a 3DMM to generate ground truth data. Using the 3DMM we were able to both generate data for AAM construction, generate random (identity) testing data and also compute correspondence between the 2D AAM and the 3DMM implicitly. For our AAM fittings we trained two models with the generated head data that had variation in pose (in YAW) from 0 to 30 degrees and -30 to 0 degrees. This was done using the full set of our 75 head 3D database [10], which was rotated in increments of 5 degrees. We used two models because AAM fitting is not as robust as to allow full -30 to 30 degrees of shape modeling, this step (as expected) improved fitting.

Proceedings of the IEEE International Conference on Video and Signal Based Surveillance (AVSS'06) 0-7695-2688-8/06 $20.00 © 2006

Algorithm 1 Fitting using an AAM to acquire r and V Require: rows(r) > 6 + N while r do [Rn , sn , tn ] = ΩP (cn−1 , Rn−1 , sn−1 , tn−1 ) {linear pose estimates the pose parameters} Rn = Rn · Rn−1 {incremental rotation is composed with global rotation} [Rn , sn , tn ] → Ln {the camera matrix L is composed of the pose and scale estimates} [cn ] = BI (Ln , rn , η) {Blanz’ method is used to estimate the new identity[2], η is set by the user } end while

10 rotation error (degrees)

model error (euclidean)

100 80 60 40 20 0 −30

−20

0 −10 10 20 rotation angle (degrees)

5

0

−5 −10 −30

30

−20

0 20 −10 10 rotation angle (degrees)

30

(a) Rotation error in pitch (degrees) versus incre- (b) Rotation error in roll (degrees) versus incremenmental rotation tal rotation 100 model error (euclidean)

rotation error (degrees)

10 5 0

−5 −10 −30

−20

0 20 −10 10 rotation angle (degrees)

30

80 60 40 20 0 −30

−20

0 −10 10 20 rotation angle (degrees)

30

(c) Rotation error in yaw (degrees) versus incremen- (d) Shape error ( euclidean ) versus incremental rotal rotation tation

Figure 4. Average Error in Pose Estimates and Shape for 100 randomly generated 3D heads. In the first step of the algorithm the measurement vector, r is determined by the AAM fitting process. This is then used in the linear pose equations of section 5. The 3DMM shape (X from equation 12) is encoded as the model coefficients, cn and this is used along with the mapping V to estimate the scale, rotation and translation of the 3D model with respect to the 2D AAM fitting. The incremental rotation estimate is then multiplied with a global rotation to keep track of object rotations across measurements. The camera matrix is then packed, using the pose estimates to form the matrix L. Finally Blanz’ method is applied to solve for the 3DMM parameters given a user specified η and the whole process repeats for the next measurement.

8. Results To justify our fusion of the AAM and 3DMM we evaluated the proposed tracking algorithm. For the tracking experiment we generated 100 random 3D heads, then rotated the heads from -30 to positive 30 degrees with increments of 5 degrees in yaw. Using the correct AAM to track if the yaw rotation was negative or positive we then extracted the measurement vector, r. This was found to be reasonably accurate with an error average distance of 8 pixels from the

Proceedings of the IEEE International Conference on Video and Signal Based Surveillance (AVSS'06) 0-7695-2688-8/06 $20.00 © 2006

ground truth 2D shape. Given the AAM fitting we then applied our tracking (algorithm 1). Figure (4.a-c) demonstrates the average error for each yaw, pitch and roll rotation. When the error is averaged across all 100 trackings then the pose error for yaw is 0.17 degrees, pitch 0.25 degrees and roll -4.45 degrees. This is a motivating result that shows the accuracy of such our simple method is within (on average) 5 degrees of the true rotation. Figure (4.d) shows the overall shape error between the ground truth 3D heads and the 3DMM fitting. Using Blanz’ fitting method the results presented similar result to [2]. For the experiment we tried many different η values but found unity to perform the best. This also presented the most perceptually pleasing result (figure 5). After tracking we measured the fitting error, per vertex (section 7). When averaged across all 100 trackings this was found to be 39.38 units.

9. Conclusion We have proposed a combination of the AAM and 3DMM and shown preliminary results. The key benefit of this approach is that the labeling of feature points is now automated for Blanz’ 3DMM fitting. Through automating

Original gamma 0, theta 0, phi 30

Ground truth gamma 0, theta 0, phi 30

Linear estimate gamma 1.3, theta -1.8, phi 31

(a) From the original to the AAM segmentation to the rendered facial estimate

Figure 5. Estimated 3D shape from a sparse set of 2D AAM features. the correspondence computation we can provide a 3DMM point based fitting that is suitable as a good initial estimate of the 3D structure from a subset of 2D features, quickly. In addition to this automation we demonstrated a pose estimation algorithm that could be used. We have shown this to be within approximately 5 degrees of accuracy for our tests. Finally we presented a combination of Blanz’ fitting, AAM segmentation and the pose estimation technique to form a tracking algorithm.

10. Acknowledgments Thanks are sent to Sami Romdhani from the faculty of Computer Science at UniBas for putting the USF data into correspondence and the Australian Research Council for its continued funding.

References [1] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [2] Volker Blanz, Albert Mehl, Thomas Vetter, and HansPeter Seidel. A statistical method for robust 3D surface reconstruction from sparse data. In Second International Symposium on 3D Data Processing, Visualization and Transmission, pages 293–300, 2004. [3] C.Basso, T. Vetter, and V. Blanz. Regularized 3D morphable models. In Workshop on: Higher-Level Knowledge in 3D Modeling and Motion Analysis, 2003. [4] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. In Proc. European Conference on Computer Vision, volume 2, pages 484–498. Springer, 1998.

Proceedings of the IEEE International Conference on Video and Signal Based Surveillance (AVSS'06) 0-7695-2688-8/06 $20.00 © 2006

[5] M. Devrim. Generalized procrustes analysis and its applications in photogrammetry. Technical report, Swiss Federal Institute Of Technology, 2003. [6] N. Faggian, S. Romdhani, J. Sherrah, and A. Paplinski. Color active appearance model analysis using a 3D morphable model. In Digital Image Computing: Techniques and Applications, December 2005. [7] Ralph Gross, Ian Matthews, and Simon Baker. Generic vs. person specific active appearance models. The Robotics Institute, Carnegie Melon University, 2003. [8] I. Matthews and S. Baker. Active appearance models revisited. International Journal of Computer Vision, 2000. [9] S. Romdhani and T. Vetter. Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In IEEE Conference on Computer Vision and Pattern Recognition, 2005. [10] Prof. Sudeep Sarkar. USF DARPA humanID 3D face database. University of South Florida, Tampa, FL. [11] T. Vetter and V. Blanz. A morphable model for the synthesis of 3D faces. In Siggraph 1999, Computer Graphics Proceedings, pages 187–194, 1999.

Suggest Documents