HEAD POSE ESTIMATION BY NON-LINEAR ... - Semantic Scholar

2 downloads 0 Views 392KB Size Report
head pose from either video sequence or individual images. De- veloped from ... hidden manifold in the high dimensional space to a low dimen- sional space in ... on head images even when the head is totally or partially back to the camera. 2.
HEAD POSE ESTIMATION BY NON-LINEAR EMBEDDING AND MAPPING Nan Hu, Weimin Huang

Surendra Ranganath

Institute for Infocomm Research (I2 R) 21 Heng Mui Keng Terrace Singapore 119613 {nhu,wmhuang}@i2r.a-star.edu.sg

Dept. Electrical & Comp. Engrg. National University of Singapore 4 Engineering Drive 3, Singapore 117576 [email protected]

ABSTRACT In this paper, we present a new scheme to robustly estimate the head pose from either video sequence or individual images. Developed from ISOMAP, we learn a person-independent and nonlinear embedding space (we call it a 2-D feature space) for different poses. A nonlinear interpolation is proposed to map new sequences or images into the 2-D feature space. Especially for video sequences, we propose an adaptive local fitting technique to filter unreasonable mappings. By exploring the intrinsic characteristics, we further estimate the head pose of that sequence or image. Experiments reported in this paper showed robust results.

the head pose fixed so that we keep the tilt angle fixed to about 90 degree. We segment and extract the head from each frame of the sequence. Since the size of the head in each image can be different, we normalize the head size to be the same as 24 × 16. After we get the size-normalized head images, the histogram equalization and the Gaussian filtering processing are applied to each image to reduce noise and the effect of varied illumination. A sample of the normalized and Gaussian filtered heads is show below in Fig 1.

1. INTRODUCTION Different head poses, although represented as high dimensional images, should be lying on some low dimensional manifolds. Recently, some new nonlinear dimensionality reduction technique have been introduced, including ISOMAP [2], LLE [6]. Both methods have been shown to be able to successfully embed the hidden manifold in the high dimensional space to a low dimensional space in several examples. Several people have considered using ISOMAP or LLE to solve problems, such as [8] and [7]. When dealing with head images, both ISOMAP and LLE fail to embed multiple persons’ data together, since both are persondependent, i.e. they can only work on individual person’s data. Here, we propose a new scheme based on ISOMAP to build a unified 2-D feature space for multiple persons’ head pose images and nonlinearly map new image or sequence to find the corresponding pose. Results show our method can work on head images even when the head is totally or partially back to the camera. 2. METHODOLOGY In this section, we will describe our head pose estimation method. The algorithm is composed of two parts: i) unified embedding to find the 2-D feature space. ii) parameter learning to find a personindependent mapping. Here, we propose to use foreground segmentation and edge detection technique to extract the head in each frame of the sequence. However, our algorithm can also be used with different head tracking algorithms [1]. 2.1. Data Description and Preprocessing The data we used are image sequences of heads taken from a fixed video camera from a near horizontal view. The persons are sitting in a chair which is able to rotate. We rotate the chair and keep

Fig. 1. A example of the head sequence used in our proposed method.

2.2. Unified Embedding 2.2.1. Nonlinear Dimensionality Reduction ISOMAP method [2] finds coordinates in Rd of data that lie on a d dimensional manifold embedded in a D >> d dimensional space. The aim of using the ISOMAP is to preserve the topological structure of the data, i.e. the Euclidean Distances in Rd should correspond to the geodesic distances (distances on the manifold). Fig. 2(a) shows the embedding of the sequence sampled in Fig. 1 using ISOMAP. The sequence is made up of 967 frames. We use a 2-D embedding since physically it is a pan rotation only (2-D case). As can be noticed, the embedding can turn different pan poses in a circular order and the outline of the embedding is ellipse-like. One point that need to be emphasized is that we do not use the temporal relation to achieve the embedding, since the goal is to obtain an embedding that preserves the geometry of the manifold. Fig. 2(b) shows the results of the classic dimensionality reduction method (PCA) on the same sequence. We choose also a 2-D embedding to make it comparable. As can be seen, using classic linear dimensionality reduction method like PCA will lead to erroneous embedding in our case.

3000

1000

1500

147 439

74658 366

2000

We here employ an ellipse fitting method [3] to best represent each manifold before we normalize it. Fig. 4 shows the results of the ellipse fitted on the sequence sampled in Fig. 1. After fitting with an ellipse for each manifold, we further normalize all the manifolds into a unified embedding space, which is detailed in Table 1. Fig. 5 shows two of the sequence normalized into the 2-D feature space.

366

500

1 950

0 −500

0

−1000 −1000

804

512

512 1

877

−1500

220 −2000

585

293 950

1000

731

220

658

731

−2000

877 804

74

−2500

−3000 −3000 −4000 −4000

−3000

−2000

585293 −1000

0

1000

2000

3000

439

−3500 −1500

4000

−1000

−500

0

(a)

500

1000

147 1500

2000

(b)

Fig. 2. 2-D embedding of the sequence sampled in Fig. 1 (a) by ISOMAP, (b) by PCA.

Table 1. A complete description of our unified embedding algorithm.

1

2.2.2. Embedding Multiple Manifolds Although good ISOMAP is in representing in low dimensional embedding space the hidden manifold of high dimensional data as shown in Fig. 2(a), it fails when trying to embed multiple people data together into one manifold. Since typically the distances between different points of the same person are much smaller than distances between corresponding points of different persons’, the residual variance minimization technique used in ISOMAP, therefore, preserves large contributions, i.e. the distances between points of different persons’. This is shown in Fig. 3(a) where ISOMAP is used to embed two people’s manifolds where all the inputs are spatially registered. As a result, the embedding shows separate manifolds (note one manifold has degenerated into a point because the embedding is dominated by cross-person distance which is much larger than within-person distance.) Besides, another fundamental problem is that different person will have different shape of manifold. This can be seen in Fig. 3(b). 6000 220 512 804 731 439 147

4000

2000 1169 1096 1388 1315 0 1023 1242

2

3

Individual Embedding Define Y P = {y1P , · · · , ynPP } the vector sequence of length nP in the original measurement space for person P . ISOMAP is used P to embed Y P to a 2-D embedded space. ZP = {zP 1 , · · · , znP } are the corresponding coordinates in the 2-D embedded space for person P . Ellipse Fitting For person P , we use an ellipse to fit Z P , resulting in the ellipse P P T with parameters: center cP e = (cx , cy ) , major and minor axes P P a and b respectively, and orientation ΦP e . Multiple Embedding P P T For person P , let zP i = (zi1 , zi2 ) , i = 1, · · · , nP . We rotate, P P scale and translate  every zi to obtain z∗i =   P P 1/a 0 cosΦe −sinΦP P P e z − c . e i sinΦP cosΦP 0 1/bP e e Identify the frontal face frames for Person P , and the corresponding {z∗P i } of these frames. The mean of these points is calculated, and the embedded space is rotated so that this mean value lies at the 90 degrees angle. After that, we choose a frame l showP ing left profile and test whether z∗l is close to 0 degrees. If not,  P P −1 0 · z∗i . we set z∗i = 0 1

293

−2000

585

−4000

658 74 366 877 950

−6000

1 −8000 −1.5

−1

−0.5

0

0.5

1 9

x 10

(a) 3000

3000

3000

2000

2000

2000

1000 1000

1000 0

0

0

−1000 −1000 −2000 −2000

−1000

−3000

−2000 −3000 −2500 −2000 −1500 −1000 −500

0

500

1000

1500

2000

2500

−4000 −4000

−3000

−2000

−1000

0

1000

2000

3000

4000

−3000

(b) −4000 −4000

Fig. 3. (a) Embedding obtained by ISOMAP on the combination of two person’s sequences. (b) Separate embedding of two manifold for two people data. To embed multiple persons’ data to find the 2-D feature space, each person’s manifold is embedded separately using ISOMAP.

−3000

−2000

−1000

0

1000

2000

3000

Fig. 4. The results of the ellipse (solid line) fitted on the sequence (dotted points).

0.7 1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0.65

−2

−1.5

−1

−0.5

0

0.5

1

1.5

mean mse of cross−validation

1.5

2

0.6 0.55 0.5 0.45 0.4 0.35 0.3

Fig. 5. Two sequences whose low-dimensional embedded manifolds have been normalized into the unified embedding space (shown separately).

2.3.1. RBF Interpolation As described in Table 1, denote the input images of person P from a sequence as Y P = {y1P , · · · , ynPP ∈ RD } and the sets of corresponding points in the feature space, i.e. the unified embedded P P P space, as Z ∗ = {z∗1 , · · · , z∗nP }, where nP is the number of frames for person P . We can then learn a nonlinear interpolative mapping from the input images to the corresponding coordinates in the feature space by using Radial Basis Functions (RBF). We combine all the persons’ sequences together, Γ = {Y P1 , · · · , Y Pk } = {y1 , · · · , yn0 }, and their corresponding coordinates in P1 Pk the feature space, Λ = {Z ∗ , · · · , Z ∗ } = {z∗1 , · · · , z∗n0 }, where n0 = nP1 + · · · + nPk is the total number of input images. For every single point in the feature space, we take the interpolative mapping function to have the form M 

ωi · ψ(|y − ci |).

10

(1)

where ψ(·) is a real-valued radial basis function, ωi are real coefficients, ci are centers of the basis functions on RD , |·| is the norm on RD (original input space). Choices for basis functions include thin-plate spline (ψ(u) = u2 log(u)), the multiquadric 2 √ − u (ψ(u) = u2 + a2 ), Gaussian (ψ(u) = e 2σ2 ), etc.. In our experiment, we use Gaussian basis functions and employ k-means clustering [4] algorithm to find the corresponding centers. Once basis centers have been determined, the widths σi2 are set equal to the variances of the points in the corresponding cluster. To decide the number of basis functions to use, we experimentally tested various values of M and calculated the mean squared error of the RBF output. For every value of M , we used a leaveone-out cross-validation method, i.e., we take out in turn one person’s data for testing, and combine all the remaining persons’ data to learn the parameters of the RBF interpolation system. Fig. 6 shows the results of our test for different number of basis functions (from 2 to 50). As can be seen in Fig. 6, to avoid both underfitting and overfitting, a good choice of the number of basis functions is M = 8. Let ψi = ψ(|y − ci |) and by introducing an extra basis function ψ0 = 1, (1) can be written as M  i=0

ωi ψi .

40

50

∗ ∗ Let points in the feature space be written as z∗i = (zi1 , zi2 ). After obtaining the centers ci , and determining the width σi2 , to determine the weights ωi , we merely have to solve a set of linear equations

fl (yi ) =

M 

∗ ωlj · ψ(|yi − cj |) = zil ,

(2)

i = 1, · · · , n0 , (3)

j=0

where l = 1, 2.

  

By defining matrices Ω =

  , Z =

ω10 ω20

··· ···

ω1M ω2M



,Ψ=

ψ11 · · · ψn0 1  ∗ z11 · · · zn∗ 0 1 .. .. , where ∗ . ψij . z12 · · · zn∗ 0 2 ψ1M · · · ψn0 M ψij = ψ(|yi − cj |), (3) can be written in matrix form as

i=1

f (y) =

20 30 number of centers

Fig. 6. Mean squared fitting error for different values of M (the number of basis functions).

2.3. Person-Independent Mapping

f (y) = ω0 +

0.25 0

Ω · Ψ = Z.

(4)

The least squares solution for Ω is then given by Ω = ZΨ∆ , ∆

where Ψ

T −1

= Ψ (ΨΨ ) T

(5)

is the pseudo inverse of Ψ.

2.3.2. Adaptive Local Fitting The RBF interpolation can map an image or a video sequence into the 2-D feature space and find the corresponding coordinate or sequence of coordinates. When processing video sequences for attentive behavior detection, temporal continuity requirement and temporal local linearity assumption can be applied to correct unreasonable mappings and to reduce head locating error, if any, in individual frames, as well as to smooth the outputs of RBF interpolation. We propose an adaptive local fitting (ALF) technique. Our ALF algorithm is composed of two parts: 1) adaptive outlier correction; 2) locally linear fitting. In adaptive outlier correction, assuming temporal continuity of the head video sequence and their corresponding 2-D features, estimates which are far away from those of their S (an even number, e.g. S = 2s0 ) temporally nearest neighbor (S-TNN) frames are defined as outliers. Let zt be the output of the RBF interpolation system for the t-th frame, and DtS be the mean distance between zt and the points {zt−k | − s0 ≤ k ≤ s0 , k = 0}:

s 

250

0

250

|zt − zt−k | ,

(6)

k=−s0 ,k=0

where |·| is the norm on the 2-D feature space. For the t-th frame, we wait until the (t + s0 )-th image (to obtain all S-TNNs) to make update. We calculate DtS and adaptively update the mean Mt and the variance Vt of the sequence {DsS0 +1 , · · · , DtS } as follows:

150

94

82

68

93

65

−13

50

−87 −119

−120

−50

−11

50

−30

0 −93 −111 −138

−100

−179

−150 −200

−150

0

200

400

600 Frame Number

800

1000

−200

1200

0

100

200 300 Frame Number

(a) = =

 S2 1 ( Dj − (t − s0 )Mt 2 ). t − s0 − 1 j=s +1 0

√ To check for outliers, we set a threshold h = λ Vt , where λ is a tolerance coefficient. Using different values of λ can make the system tolerant to different degrees of sudden change in the head pose. If Dt − Mt > h, we deem point zt to be an outlier, and set 0 zt = S1 t+s j=t−s0 ,j=t zj . In locally linear fitting, we assume local linearity within a temporal window of length L. We employed the technique suggested in [5] for linear fitting to smooth the output of RBF interpolation. After the above process, the head pose angle can be very easily estimated as zt2 ). (7) θt = tan−1 ( zt1 3. EXPERIMENTAL RESULTS In this section, we present the our head pose estimation results. In the experiment, we use 7 persons’ sequences, where they are slowly rotating their heads for three complete circle continuously. To test the generalization ability of our person-independent mapping function, we use a leave-one-out cross-validation (LOOCV), i.e., we take out in turn one sequence as the testing data and use all the remaining sequences for parameter learning. Fig. 7 shows the results of the person-independent mapping to estimate the head pose angle in each frame for four of the 7 sequences where these 7 sequences are used in turn as the test data in the LOOCV method. The green lines correspond to “ground truth” head pose angle. This is obtained by calculating the projection of the test sequence into the unified 2-D embedded space. This ground truth can be compared to the pose angles estimated from the person-independent RBF interpolation system shown with red lines, and it can be seen that the latter are very good approximations to the ground truth. The values above the small head images are the pose angles of those images calculated from personindependent mapping. 4. DISCUSSION AND CONCLUSION Our nonlinear embedding method is robust in different persons and varied illumination. The data we used is under different illuminations, with or without light on and inside different rooms with different background (inhomogeneous). The unified embedding and the nonlinear mapping make our method person-independent regardless whether the person is in our database. In addition, our system is also robust to small facial expressions, like in 7(d) where the person is smiling. Our system use small images (24 × 16) for

9

50

−36

−50

0 −96

−98 −118

−50

38 −8

50

−31 −62

0

−82

−91

−102 −124

−129

−100

−150 −200

107

97

83

−50

−173

−100

141

100

3 −40

109

150

62 34

100 Degree Value

Vt

200

74

60 66

500

173 157

128

150

t

250

159

200

400

(b)

250

Degree Value

Mt

1 [(t − s0 − 1)Mt−1 + DtS ], t − s0

67

28 8

−50

−137

−100

76

55 100

17

0

91

75

150

41

100

152

148 200

120

110

166

152

145 200

Degree Value

1 = S

Degree Value

DtS

−150

0

100

200

300 400 500 Frame Number

600

700

800

−200

0

100

200

(c)

300 400 500 Frame Number

600

700

800

(d)

Fig. 7. Four of the LOOCV results of our person-independent mapping system to estimate head pose angle. Green lines correspond to “ground truth” pose angles, while red lines show the pose angles estimated by the person-independent mapping. the estimation. For large images, we can shrink the size and the aspect ratio to fit in our system. Actually, sometimes, large head images are not as easy to be captured as small sized ones. Future work is to extend our method to a system that can also work with different tilt angles of the head and large facial expressions such as laughs. 5. REFERENCES [1] M. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey,” in IEEE PAMI, vol. 24, no. 1, pp. 34-58, Jan 2002. [2] J. Tenenbaum, V. de Silva, and J. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, 290(5500), pp. 2319-2323, December 2000. [3] Fitzgibbon, M. Pilu , R.Fisher “Direct least-square fitting of Ellipses” , IEEE PAMI, vol. 21, no. 5, pp. 476-480, June 1999. [4] R. Duda, P. Hart, D. Stork, “Pattern Classification”, 2nd Edn., John Wiley & Sons, Inc., 2000. [5] Hutcheson, M.C.,“Trimmed Resistant Weighted Scatterplot Smooth,” Master’s Thesis, Cornell University, Ithaca, NY, 1995. [6] S. Roweis, and L. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, 290(5500), pp. 23232326, Dec. 2000. [7] A. Elgammal, and C. Lee, “Separating Style and Content on a Nonlinear Manifold,”, CVPR’04, pp. 478-485, 2004. [8] M. Vlachos, et al., “Non-Linear Dimensionality Reduction Techniques for Classification and Visualization,” Proc. 8th ACM SIGKDD Int’l Conf. on Knowledge discovery and data mining, pp.645-651, 2002.

Suggest Documents