University of Maryland ... involves tracking of ve points (four at the eye corners and the fth .... Only relative distances among the four corners of the eyes are.
AN ANTHROPOMETRIC SHAPE MODEL FOR ESTIMATING HEAD ORIENTATION T. HORPRASERT, Y. YACOOB and L. S. DAVIS Computer Vision Laboratory University of Maryland College Park, MD 20742
Abstract
An approach for estimating 3D head orientation in a monocular image sequence is presented. The approach employs recently developed imagebased parameterized tracking for face and face features to locate the area in which an estimation of point feature locations is performed. This involves tracking of ve points (four at the eye corners and the fth is the tip of the nose). We describe an approach that relies on the coarse structure of the face to compute orientation relative to the camera plane. Our approach employs symmetry of the eyes and anthropometric statistics to estimate the head yaw, roll and pitch.
Keywords: Face orientation, feature tracking, anthropometry. 1 Introduction Watching people move is a favorite human pastime, and the shapes of people and their parts play important roles in the lives of both human and computer programs. Human shape is dynamic, due to the many degrees of articulative freedom of the human body, and the deformations of the body and its parts induced by muscular action. Human shape variability is also highly limited by both genetic and environmental constraints, and is characterized by a high degree of (approximate) symmetry and (approximate) invariants of body length scales and ratios. The eld of anthropometry 6;14 is concerned with tabulation and modeling of the distributions of these scales and ratios, and can serve as a useful source of shape constraints for analyzing videos of people when prior information is not available about the speci c individuals being viewed. In this paper we are concerned with the problem of watching an arbitrary individual's head in a video sequence (as might be seen by a camera attached The support of the Defense Advanced Research Projects Agency (DARPA Order No. C635) under Contract N00014-95-1-0521 is gratefully acknowledged.
1
to a personal computer) and following the person's head movements in 3-D to help determine, for example, where on the screen a person might be looking, whether the person is attending to the information being presented on the screen, whether the person is speaking to the machine, etc. More speci cally, our goal is to recover and track the three degrees of rotational freedom of a human head, without prior knowledge of the exact shape of the head and face being observed. The approach we develop is a two stage approach in which the person being watched is rst classi ed according to gender, race and age. This classi cation can be used to index into tabulated anthropometric distributions of face length scales (in our case eye ssure and nose length) to constrain as tightly as possible a head orientation recovery algorithm that integrates face symmetries with these length scale distributions. This instantaneous orientation estimation algorithm is combined with a robust motion analysis algorithm (which both aids with the tracking of facial features and provides additional constraints to an orientation tracker) to estimate the sequence of head orientations observed. In this paper we focus on both the static orientation estimation algorithm and the feature tracking method employed by the system (which employs recently developed image-based parameterized tracking 1 for face and face features to locate the area in which an estimation of the eye corner and nose tip positions is performed). We also provide an error analysis of the orientation estimation procedure, considering errors due to both image analysis and anthropometric modeling. The latter estimates expected errors in head orientation recovery using variances of subpopulations identi ed. This could, for example, allow adaptive control of the screen resolution with which information is presented to a user as a function of expected orientation accuracy. There are two principal approaches in the literature to recovering head orientation: model- and appearance- based. Model-based approaches typically recover an individual's 3-D head shape using stereo or shading, and then use the 3-D model to estimate the head's orientation. On the other hand, appearancebased approaches such as view interpolation (see 5 ) or eigenface approaches (see 11) are used in conjunction with a training stage (e.g., person sits in front of a computer and is asked to look at several points on a screen by turning his head) that constructs the association between appearance and head orientation. Appearance models are expected to be less accurate than model-based approaches since they involve computing a global, lower-dimensional representation of the data (thus tuning the representation to \mean views"). Previous model-based and appearance-based algorithms include 13;7. In Tsukamoto et al. 13, the orientation is modeled as a linear combination of disparities between facial regions and several face models. Gee and Cipolla 7 estimate head orientation based on knowledge of the individual face geometry 2
and assuming a weak perspective imaging model. Their method also depends on the distance between the eyes and mouth which often changes during facial expressions. In work related to ours, DeCarlo and Metaxas 3 used anthropometry to constrain a deformable model during head tracking. We de ne head orientation to be the three rotation angles in an orthogonal coordinate system located at the camera imaging plane. In this system, the roll angle is zero when the eyes are horizontal, the yaw angle is zero when the face is in frontal view and the pitch is zero when the bridge of the nose is parallel to the image plane of the camera. The pitch angle de nition is somewhat arbitrary, since inclinations of the nasal bridge dier between individuals. Our objective is to develop a vision system that can recover the orientation of a head in 3-D without prior knowledge of the exact head surface shape of the people being viewed.
2 A Perspective Model for Head Orientation Recovery The computational model for head orientation recovery is based on face symmetry and anthropometric modeling of structure. Our approach recovers orientation from the distances among ve points: four eye corners and the nose tip. We show how these distances allow us to recover the absolute rotation angles about the Z and Y axes. However, these distances are insucient for recovering the rotation around the X axis. Thus, a statistical analysis for face structure based on anthropometry is employed to estimate the angle of rotation about the X-axis. y z y β x
α
E1
E2
y
x
x Right eye corners
E3
γ
β
E4 Tip of the nose
N Image Plane
α γ
Left eye corners
Image Plane
Figure 1: Geometric con guration of head orientation and coordinate system.
3
z
We employ a coordinate system xed to the camera with the origin point being at its focal point (see Figure 1). The image plane coincides with the XY-plane and the viewing direction coincides with the Z-axis. ; ; (pitch, yaw and roll, respectively) denote the three rotation angles of the head about the X,Y,and Z-axis, respectively. Our model for head estimation assumes that the four eye-corners are colinear in 3D; this assumption can be relaxed to account for a slight horizontal tilt in the eyes of Asian subjects (see Farkas 6 for statistics on the deviation of the eye corners from co-linearity). Let upper case letters denote coordinates and distances in 3D and lower case letters denote their respective 2D projections. Let E1; E2; E3 and E4 denote the four eye corners and e1 ; e2 ; e3 and e4 denote their projection in the image plane (((ui ; vi) are the coordinates of the points). A complete treatment of the estimation of these angles is given in 9 . The roll angle is trivial to compute from the eye corners. The head yaw is recovered based on the assumptions that the eyes are of equal width and E1 ; E2; E3 and E4 are collinear. Then, from the symmetry of the eye widths, we compute the yaw angle as a function of measured distances in the image and the focal length of the camera. Only relative distances among the four corners of the eyes are required. The method has no dependence on face structure, the distance of the face from the camera or feature translation in the image. The computation of the pitch angle assumes that the width of the eyes E and the length of the nasal bridge N are represented by a distribution function drawn from anthropometric data. The estimation of yaw is given by (1) = arctan (S ?f1)u 1 where f is the focal length, u1 is given by 2 2 1 ? v1)] ? [M (u2 ? v2 )(u1 ? v1 ) ] (2) u1 = [vuM(uv(M(u 1 ? v1 ) ? u) All the terms involve only distances between points in the image (see 9 ). S is obtained by solving the equation u = u2 ? u1 = ? (S ? 1)(S ? (1 + (2=Q))) (3) v v1 ? v2 (S + 1)(S + (1 + (2=Q))) where only distances in the image are used (Q and M involve only distances in the image). The estimation of pitch is given by (4) = arcsin( p (p2f+ f 2 ) [p21 p20 p21 ? f 2 p21 + f 2 p20 ]) 0 1 4
q
8 7 6 5 4 3 2 1 0 −100
−80
−60
−40
−20
0
20
40
60
80
φ Rotation angle (degrees)
100
Localization accuracy (pixels)
Localization accuracy (pixels)
9
Z-Rotation (γ)
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −100
−80
−60
−40
−20
0
20
40
60
80
Rotation angle (degrees)
100
Localization accuracy (pixels)
Y-Rotation (β)
X-Rotation (α) 10
6
5
4
3
2
1
0 −100
−80
−60
−40
−20
0
20
40
60
80
100
Rotation angle (degrees)
Figure 2: The localization accuracy of the rotation about each axis, while maintaining 0.5(solid line), 1(dash line), 2(dash+point line), and 5(dot line) degrees error.
where p0 is the nasal bridge projection at 0 degrees, and p1 is the measured projection of the nasal bridge (p0 is estimated using anthropometry).
3 Error Analysis Simulation In this section the eects of both image error (localization of eye corners) and model errors (due to expected variations in length from the anthropometric models) on orientation recovery are presented. In this analysis, we assume that the structure is that of an average adult American male. Thus, the expected length of the nasal bridge is 50mm, and the expected length of eye ssure is 31.3mm. The distance between model and camera is 500mm, and the focal length is 16mm. We also assume a pixel size of 16 m (typical of what we might encounter with a PC-mounted camera). Figure 2 illustrates the localization accuracy of image measurements required to obtain a given accuracy (0.5 to 5.0 degrees) in recovering orientation as a function of absolute orientation. As can be observed, very high localization accuracy is needed for yaw and roll of about 0 degrees. We call these angles degenerate angles. While the degenerate angles for yaw and roll are independent of any length scales, the degenerate angle for pitch depends on the angle = arcsin(N=Z) where N is the length of the nasal bridge and Z is the distance between the face and the camera. In this particular analysis, 50 ). Generally, this angle is the angle for which the line of 5.74 degree (arcsin 500 sight to the nose tip is tangent to the circle of radius N centered on the face (see Figure 3). In addition to depending on the localization accuracy of the image processing, the recovery of pitch also depends on the deviation of the observed 5
y
z x Umax Z
N
φ
Figure 3: An illustration showing characteristic of rotating about X-axis. By considering rotations from -90 to 90 degrees, the projective point on image plane will reach the highest point when the angle of rotation is degree. This tangent angle is in uenced by the distance between the original point of rotation and the lens, and the radius of the rotation.
face from the assumed face anthropometry as illustrated in Figure 4. The plots show errors in pitch recovery from applying our computational model to various face structures. The top graph display nasal bridge error for N N and N 2N (N = 50mm and N = 3:6mm). The bottom graph displays eye width error for E E and E 2E (E = 31:3mm and E = 1:2mm). Notice that the errors are small away from the degenerate angle. The horizontal axis represents the actual pitch while the vertical axis is the estimated pitch employing an average face structure. 100
100
+7.2 +3.6 -3.6 -7.2
60 40 20 0 −20 −40 −60
60 40 20 0 −20 −40 −60
−80 −100 −100
+2.4 +1.2 -1.2 -2.4
80
Computed pitch (degree)
Computed pitch (degree)
80
−80
−80
−60
−40
−20
0
20
40
60
80
−100 −100
100
Actual pitch (degree)
−80
−60
−40
−20
0
20
40
60
80
100
Actual pitch (degree)
a) Nasal bridge error (+/- 1, 2 S.D.)
b) Eye width error (+/- 1, 2 S.D.)
Figure 4: An illustration showing errors of our computationmodels to various face structures.
6
Frame N
Whole Face Region Tracking
Frame N+1
Motion Estimation
Local Features Region Tracking
Motion Estimation
Inner Right Eye Corner
Fine Matching (Color Correlation)
Outter Right Eye Corner
Tip of the Nose
Figure 5: Diagram shows the hierarchical framework for facial features tracking.
4 Facial Features Tracking We employ a hierarchical method for tracking the facial features used by the head orientation recovery model. We assume the initial locations of these points are given. This initialization problem has been addressed in the literature, see, e.g., 10 8 and 12. Given these points in the rst frame, we track each point independently by employing recently developed image-based parameterized tracking 1. This performs both global region tracking of the whole face and local region tracking of the individual facial features. We then re ne our position estimates by employing color correlation. Figure 5 shows the hierarchical framework for facial feature tracking. From face and feature tracking, the locations of feature points in the next frame are estimated. The positional re nement step involves creating a template, from the current frame, of each feature point and then performing a color correlation over a search region around the estimated location in the 7
next frame. In the correlation, a mis-match score, M, which is a measure of the dierence between template, T, and the target image, I, is determined. M is de ned as follows: M=
X MAX(jRT i;j
i;j
? RI j; jGT ? GI j; jBT ? BI j) i;j
i;j
i;j
i;j
i;j
(5)
where R, G, B represent red, green, and blue pixel values respectively, and i, j are the index of the pixel. Our orientation computational model assumes that the corners of the eyes are collinear. We take this constraint into account while the corners of the eyes are being tracked. While performing correlation over the search region, the mis-match scores and their corresponding pixel positions are tabulated. Then, we consider the best m candidates for each feature. Next, a minimumsquared-error line tting algorithm 4 is used to determine a single \eye" line. More precisely, given the set of 4m candidate points f(xi ; yi)g, we nd c0 and c1 that minimize the error function
Xn [(c + c xi) ? yi] i=1
0
1
2
Once the best t line is obtained, we then choose the one, among the candidates for each eye corner, that minimizes jdj M (6) jdmaxj + Mmax where jdj is the vertical distance between the point to the line; and where jdmaxj and Mmax are the maximum value of jdj and M respectively. Representative results of our tracking method are shown in Figure 6, and the orientation recovery results are shown in Figure 7.
5 Summary We have presented an approach for computation of head orientation by employing face symmetry and statistical analysis of face structure from anthropometry. The rotations of the head about the Z and Y-axes can be computed directly from the relative distances among corners of the eyes and the camera focal length. The estimation of the orientation of the head about the X-axis employed anthropometric analysis of face structure. A preliminary experiment on real images has demonstrated the performance of our approach. 8
Figure 6: Illustration of the result from applying the hierarchical tracking method on a sequence of 24 frames, with 10 frames/sec frame rate, of a face rotating in Yaw direction. Only every other frame, starting at the rst frame, is shown.
40
30
Angle (degrees)
20
10
0
−10
Pitch −20
Yaw
−30
Roll
−40 0
5
10
15
20
25
Frames
Figure 7: The plot shows the sequences of rotation angles recovered through the image sequence shown in Figure .
9
References 1. M. J. Black and Y. Yacoob, \Tracking and recognizing facial expressions in image sequences, using local parameterized models of image motion", ICCV, 1995, 374-381. 2. R. Chellappa, C.L. Wilson and S.A. Sirohey, \Human and machine recognition of faces: A survey" Proc. of IEEE, Vol 83, 1995, pp.705{740. 3. D. DeCarlo and D. Metaxas, \The integration of optical ow and deformable models with applications to human face shape and motion estimation," IEEE CVPR, 1996, 231-238. 4. R. O. Duda and P. E. Hart, \Pattern Classi cation and Scene Analysis", John Wiley & Sons, 1973. 5. T. Ezzat and T. Poggio, \Facial Analysis and Synthesis Using ImageBased Models," International Conference on Face and Gesture. 1996, 116-121. 6. L.G. Farkas, Anthropometry of the Head and Face, 2nd edition, Raven Press, 1994. 7. A. Gee and R.Cipolla, \Estimating Gaze from a Single View of a Face," ICPR 94, 758-760, 1994. 8. R. Herpers and et al., \Edge and Keypoint Detection in Facial Regions", Proc. of Int. Conf. on Automatic Face and Gesture Recognition, 1996, 212-217. 9. T. Horprasert, Y. Yacoob and L.S Davis \Computing 3D head orinetation from a monocular image sequence" Proc. of Int. Conf. on Automatic Face and Gesture Recognition, 1996, 242-247. 10. K. M. Lam and H. Yan, \Locating and Extracting the Eye in Human Face Images", Pattern Recognition, Vol.29, No.5, pp.771-779, 1996. 11. A. Pentland, B. Moghaddam and T. Starner, \View-based and modular eigenspaces for face recognition," IEEE CVPR, 1994, 84-91. 12. M.J.T. Reinders, R.W.C. Koch, and J.J. Gerbrands, \Locating Facial Features in Image Sequences using Neural Networks", Proc. of Int. Conf. on Automatic Face and Gesture Recognition, pp. 230-235, October 1996. 13. A. Tsukamoto, C. Lee and S. Tsuji, \Detection and pose estimation of human face with synthesized image models," ICPR 94, 754-757, 1994. 14. J. Young, Head and Face Anthropometry of Adult U.S. Citizens, Government Report DOT/FAA/AM-93/10, July 1993.
10