A Particle Filtering Framework for Joint Video Tracking and ... - UIC ECE

1

A Particle Filtering Framework for Joint Video Tracking and Pose Estimation Chong Chen, Dan Schonfeld, Fellow, IEEE

Abstract— A method is introduced to track the object’s motion and estimate its pose directly from 2D image sequences. ScaleInvariant Feature Transform (SIFT) is used to extract corresponding feature points from image sequences. We demonstrate that pose estimation from the corresponding feature points can be formed as a solution to Sylvester’s equation. We show that the proposed approach to the solution of Sylvester’s equation is equivalent to the classical SVD method for 3D-3D pose estimation. However, whereas classical SVD cannot be used for pose estimation directly from 2D image sequences, our method based on Sylvester’s equation provides a new approach to pose estimation. Smooth video tracking and pose estimation is finally obtained by using the solution to Sylvester’s equation within the importance sampling density of the particle filtering framework. Finally, computer simulation experiments conducted over synthetic data and real-world videos demonstrate the effectiveness of our method in both robustness and speed compared with other similar object tracking and pose estimation methods. Index Terms— Video Tracking, Pose Estimation, Sylvester’s Equation.

I. I NTRODUCTION Video tracking and pose estimation have received a large amount of attention over the past few decades. Tracking can be used to monitor the motion of objects as characterized by their position coordinates in video sequences. In many applications such as human-computer interaction, interactive conferencing, and virtual reality, it is important to also monitor the rotation of the object out-of-image-plane (namely, pan and tilt), and the rotation angles can be obtained by using various methods for pose estimation. In this paper, with the given video sequences but without additive information, we focus on the problem of tracking and pose estimation of a rigid object directly from these 2D images, and we present a particle filtering approach to estimate both tracking and pose parameters. The premise of our approach is that particle filters can be used to estimate a state vector representation of affine models to capture the coordinates and rotation of objects in video sequences. Braathen et al. [1] estimate pose state with particle filtering by first estimating the camera parameters and 3D head geometry from a set of training images. An earlier effort for head tracking and pose estimation based on particle filtering has been presented in [2]. This method, unlike our approach, also relies on a learning scheme to estimate the pose by using a large collection of training images. They use Gaussian and Gabor filters for modeling image features for head pose estimation. C. Chen and D. Schonfeld are with the Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607 (e-mail: [email protected], [email protected])

II. R ELATED W ORK Pose analysis aims to recover body poses and motion parameters from static images or video sequences. There are two basic problems in the field of pose estimation: (a) camera pose estimation and (b) object pose estimation. Both views attempt to estimate the pose from points or lines with known correspondences. These two topics are closely related and many algorithms can be applied in the solution of both problems. For example, in [3], the head pose is recovered by formulating the problem as a camera pose estimation problem. Many approaches have been proposed during the past years, and most can be classified into two major categories: (i) feature-based methods and (ii) appearance-based methods. Appearance-based methods often rely on some training to obtain templates for object pose estimation. Usually, the main difficulty with appearance-based methods is that they depend on the image conditions under which learning is performed. Thus, changes in these conditions require that the entire learning process be repeated [4]. Image normalization applied to the learning appearance models has been used to make the models more robust to illumination changes [5]. Feature-based methods rely on corresponding feature points from different images. Solutions with redundant data can be classified into two classes: (i) linear methods and (ii) nonlinear methods [6]. Nonlinear solutions are generally more robust to noise; however, they suffer from heavy computation, usually require a good initialization, and most cannot guarantee convergence. While linear methods require less computation, the results are often inferior due to lack of accuracy and corruption by noise. In the literature, the study of the orientation problem is mainly focused on investigation of the uniqueness or the number of solutions. A family of linear algorithms [7] has been developed to provide a unique solution for the 4-point, 5-point and n-point camera pose estimation problem. This method provides an efficient solution by exploiting data redundancy while avoiding iterations and employing the Sylvester resultants to eliminate variables in their computation. Andreff et al. [8] model the rigid transform between a camera mounted onto a robot end-effector and the end-effector itself as a Sylvester’s equation by using structure-from-motion with known robot motion for hand-eye calibration. Sylvester’s equation is used to obtain a linear solution based on the rotation and translation parameters. In contrast, we rely on Sylvester’s equation to estimate the pose parameters by solution of a constrained optimization problem. The object pose estimation from corresponding points based

2

on SVD techniques have been well established [9]. Horn’s algorithm computes the eigensystem of a derived matrix and is similar to the SVD approach [10]. Olsson et al. [11] generalize the method in [10] by incorporating point, line and plane features in a common framework for finding the globally optimal solution to the problem of pose estimation, but it can only deal with a known object and also parameterize rotations with quaternions. The remainder of this paper is organized as follows: In Section 3, we introduce the method of pose estimation based on Sylvester’s equation. In Section 4, we introduce the particle filtering framework for the object tracking and pose parameters from video sequences. In Section 5, we conduct computer simulation experiments and comparison to existing methods. Finally, we present a summary of our results in Section 6. III. P OSE E STIMATION BASED E QUATION

ON

S YLVESTER ’ S

The proposed approach for pose estimation is based on corresponding points. We rely on feature points based on Scale Invariant Feature Transform (SIFT) [12]. We begin with a known pose, estimate the pose change between two successive frames, and obtain the current state by cumulating all of the past results. A. Adaptive Block Matching In our system, we only consider the feature points within the object region in each image. Therefore, a tracking algorithm must firstly be applied to extract the region of interest. The tracking system used in our approach is based on motion-based particle filtering (MBPF) [13]. In MBPF, Adaptive Block Matching (ABM) is first used for motion estimation, which characterizes the mean of a Gaussian density function for importance sampling. Initially, the ABM output is a mask (i.e. union of rectangular blocks), covering the estimated position of the target. In our system, an ellipse is used to optimally fit the mask using the Least Squares method. The center of the ellipse is used to represent the 2D translation of the object from the last position. The SIFT features are extracted from the entire image, and only those within the area of the ellipse are kept and used for the solution of Sylvester’s equation which provide the 3D rotation estimate. Both the 2D translation from ABM as well as the 3D rotation parameters provided by Sylvester’s equation are used to define the mean of the importance sampling density function in particle filtering. B. Outliers Rejection In general, the resulting area obtained by using ABM includes the scene background, and we remove unwanted feature pairs based on their slopes and lengths. As can be seen in Fig. 1, if an arrow’s direction or length is distinctly different from most of the arrows, it indicates a false match. Based on the results presented in [14] as well as our experiments, we define an incorrect match if the slope is beyond mslope ± 0.2 (rad, around ±11.5 degree) or the length exceeds mlength ∗

(1 ± 0.3) (pixels), where mslope is the mean of the slope and mlength is the mean length computed from all of the feature points. RANSAC could also be introduced here to identify incorrect matches [15].

Fig. 1. Feature points extracted with SIFT: each arrow indicates the moving direction and distance of a feature point from two successive images. (a) from the whole image; (b) from the object region.

C. Projection from 3D to 2D The problem is to determine the rotation angles and translation vector from the matched points between two successive images. Here, we adopt the image projection model based on the camera pinhole model presented in [9]. This model assumes that the relative depth within the object is small compared to the distance of the object to the camera. Given a point in the 3D real object, its positions before and after motion have the following relationship:   j   j xt xt+1  = R3×3  y j  + T3×1 ,  yj (1) t t+1 j j zt zt+1 where R3×3 is a 3 × 3 rotation matrix, and T3×1 is a 3 × 1 translation vector. j, (j = 1, 2, . . .) is the index of matched feature pairs. Our goal is to estimate the rotation matrix and the translation vector. With the perspective projection model, if (ujt , vtj ) and j j (ut+1 , vt+1 ) are the corresponding projected points in the image, we have xjk zkj

=

ujk , f

ykj zkj

=

vkj , f

k = t, t + 1 ,

where f is the focal length. Integrating Eq. (2) to Eq. (1), we have  j   j  ut+1 ut j z 0  vj  = t R3×3  v j  + T3×1 , t j t+1 z t+1 f f

(2)

(3)

j 0 where T3×1 = T3×1 · f /zt+1 is the 3 × 1 translation matrix within the image plane [16]. j However, the ratio ztj /zt+1 is unknown. The primary difficulty in these models is that the depth information is lost in the process of projecting 3D objects into 2D images. j Here, we regard ztj /zt+1 = 1. This assumption is valid when the object is comparably far from the camera and the depth change is small between successive frames in the video sequence (usually, less than 1/24 second). A similar model,

3

the weak projective camera model, is used in the cases with a known object [17]. Adams et al. [18] show that this simple homography-based method works fast and accurately when the planar assumption is valid, and the accuracy degrades as the excursions from the plane become significant. For simplicity, in the remainder of this presentation, we assume that the ratio of the depth associated with corresponding features in adjacent frames is unity. Thus, Eq. (3) is simplified as  j   j  ut+1 ut 0  = R3×3  v j  + T3×1  vj . (4) t t+1 f f Furthermore, if we just consider the first two rows of the above matrix equation and rearrange the matrix equation, we obtain j j r11 r12 ut lx ut+1 = + , (5) j r21 r22 ly vt+1 vtj

where lx = r13 f +t0x and ly = r23 f +t0y are the new translation parameters. D. Lagrange Method The discussion above is presented with one pair of points. Considering all of the matched points obtained from SIFT, we can compute the Least-Squares estimate with respect to the rotation and translation variables. Thus, the problem of pose estimation from image sequences is given by min

R1 ,R2 ,Lx ,Ly

(Ut+1 − R1 Pt − Lx )(Ut+1 − R1 Pt − Lx )T

+(Vt+1 − R2 Pt − Ly )(Vt+1 − R2 Pt − Ly )T ,

(6)

F =

∂F ∂Lx ∂F ∂Ly

= −Ut+1 + R1 Pt + Lx = 0 = −Vt+1 + R2 Pt + Ly = 0

.

From Eq. (9), we have Lx Ut+1 − R1 Pt L= = . Ly Vt+1 − R2 Pt

(9)

(10)

Since there is only one translation vector for all of the feature points within a frame, the translation vector can be determined from the following equation once the rotation parameters are known [9]: lx ut+1 r11 r12 ut = − , (11) ly v t+1 r21 r22 vt ! PM j u /M ut ut+1 j=1 t PM j where = and = vt v t+1 j=1 vt /M ! PM j ut+1 /M Pj=1 , the mean of the feature points from two M j j=1 vt+1 /M sequential images respectively. M is the number of feature points. U t+1 ut+1 ut+1 . . . = and Define P t+1 = v t+1 Vt+1 v t+1 . . . Ut ut ut . . . Pt = = . Then, Eq. (10) is v t vt . . . Vt rewritten as U t+1 − R1 P t Lx . (12) L= = Ly V t+1 − R2 P t −(Ut+1 − U t+1 )PtT + R1 (Pt − P t )PtT + λ1 R1 + λ3 R2 =0 ,

, the top Ut+1 = Vt+1 and L =

to solve the

(Ut+1 − R1 Pt − Lx )(Ut+1 − R1 Pt − Lx )T + (Vt+1 − R2 Pt − Ly )(Vt+1 − R2 Pt − Ly )T

2 2 + λ1 (R1 R1T + r13 − 1) + λ2 (R2 R2T + r23 − 1)

+ 2λ3 (R1 R2T + r13 r23 ) ,

(

By substituting L in Eq. (8), we obtain

subject to

 2 =1  R1 R1T + r13 T 2 R2 R2 + r23 = 1 ,  R1 R2T + r13 r23 = 0 R1 r11 r12 where R2×2 = = R2 r21 r22 left four elements of R3×3 , Pt+1 = 1 1 ut u2t . . . ut+1 u2t+1 . . . , Pt = 1 2 vt1 vt2 . . . vt+1 vt+1 ... Lx lx lx . . . = . Ly ly ly . . . We apply the Lagrange multiplier method constrained optimization problem, and obtain

and

(7)

where λ1 , λ2 and λ3 are Lagrange multipliers. The partial derivatives of F are taken with respect to the rotation and translation coefficients to yield ∂F T ∂R1 = (−Ut+1 + R1 Pt + Lx )Pt + λ1 R1 + λ3 R2 = 0 ∂F T ∂R2 = (−Vt+1 + R2 Pt + Ly )Pt + λ2 R2 + λ3 R1 = 0 (8)

−(Vt+1 − V t+1 )PtT + R2 (Pt − P t )PtT + λ2 R2 + λ3 R1 =0 . (13)

0 Denote Pt+1 = Pt+1 − P t+1 , Pt0 = Pt − P t , A = Pt0 PtT and 0 T B = Pt+1 Pt . Then, Eq. (13) can be simplified as 0 −Ut+1 PtT + R1 A + λ1 R1 + λ3 R2 = 0 , (14) 0 −Vt+1 PtT + R2 A + λ3 R1 + λ2 R2 = 0

which can be further combined into a matrix equation: −B + R2×2 A + ΛR2×2 = 0 , (15) λ1 λ3 where Λ = is the Lagrange multiplier matrix, and λ3 λ2 its elements are introduced in Eq. (7). Eq. (15) is known as Sylvester’s equation. It is also called Lyapunov’s equation in the special case where Λ = AT . Once Sylvester’s equation has been solved, the rotation angles between successive frames are obtained. If we assume that we begin with a known pose (for example, the frontal view of the face), then the estimate of the current pose is acquired by accumulating all of the past pose results. The initial pose can be determined by using various methods including templatematching techniques.

4

E. Kronecker Product Approach Sylvester’s equation can be solved with the Kronecker Product algorithm, the Bartels-Stewart method, the HessenbergSchur approach, the subspace techniques, and the numerical calculations. An example solvent with the Kronecker Product and vec operator is given by [19] vec(R2×2 ) = (I ⊗ Λ + A ⊗ I)−1 vec(B) .

(16)

F. Lagrange Multipliers In our case, the challenge for solving Sylvester’s equation is that the matrix Λ is unknown. In our experiments, we first determine an approximation of the matrix R2×2 . For example, we can obtain an initial estimate of Λ with the assumption that R2×2 is an identity matrix (i.e. all three rotation angles are zero): 0 Λ0 = B − A = Pt+1 PtT − Pt0 PtT . (17) This initial estimate is reasonable as we only focus on the rotation between two successive images at one time, and these angles are generally close to 0. Then, we can obtain a symmetric Λ1 by replacing Λ0 (1, 2) and Λ0 (2, 1) with (Λ0 (1, 2) + Λ0 (2, 1))/2, getting the first evaluation. We must also iteratively compute Λi to ensure that the constraint is p satisfied: R1 R2T ± (1 − R1 R1T )(1 − R2 R2T ) = 0. To choose whether + or − can be decided by the direction of the rotation. G. Sylvester’s Equation and SVD Haralick et al. [9] also obtained the same equation as Eq. (15) for the 3D-3D case: (18)  λ1 λ4 λ5 where R3×3 is the 3 × 3 rotation matrix, Λ3×3 = λ4 λ2 λ6 , λ5 λ6 λ3 0 0T 0T B3×3 = Q0t+1 Q0T t and A3×3 = Qt Qt . Here we use Qt and 0T Qt+1 to denote the corresponding 3D feature points minus their mean. T We can determine Λ3×3 by right multiplying R3×3 to obtain −B3×3 + R3×3 A3×3 + Λ3×3 R3×3 = 0 , 

T T . Λ3×3 = B3×3 R3×3 − R3×3 A3×3 R3×3

Because both Λ3×3 and have

T R3×3 A3×3 R3×3

(19)

are symmetric, we

T T T B3×3 R3×3 = (B3×3 R3×3 )T = R3×3 B3×3 .

(20)

As suggested in [9], we can apply SVD to obtain B3×3 = U SV T ,

(21)

where U and V are unitary and S is a diagonal matrix. Finally, we can solve R3×3 from Eq. (20) and Eq. (21): R3×3 = U V

T

.

(22)

We can see that SEA (Sylvester’s equation algorithm) is identical to SVD approach in the 3D-3D case, with or without noise. Providing that we have RRT = I, this conclusion can be further extended from 3D to any dimension.

However, in our current 2D case, the orthonormalT ity have constraints been changed, i.e. R2×2 R2×2 = 2 1 − r13 −r13 r23 , which in general is not a unity matrix. 2 −r13 r23 1 − r23 Therefore, SVD approaches do not lead to a solution to this problem. Nonetheless, a solution to this problem can be computed by using other direct or numerical methods. IV. P OSTPROCESSING : R EFINED S EQUENTIAL P OSE E STIMATION Our system estimates the pose by computing the state update between frames, and the current pose is obtained by accumulating all of the past pose results. There could be accumulation errors with this mechanism. Particle filtering can avoid some large shift in this process, and provide smooth estimates of the transform parameters from ABM and Sylvester’s equation. A. Particle Filtering Particle filtering methods propagate the probability density function used to characterize objects in video sequences based on Bayes’ rules: where

p(xk |z1:k ) ∝ p(zk |xk )p(xk |z1:k−1 ) ,

p(xk |z1:k−1 ) =

Z

(23)

p(xk |xk−1 )p(xk−1 |z1:k−1 ) dxk−1 , (24)

and x and z are used to denote the state and observed data, respectively. For our application, a seven-parameter hyper-ellipse is used to define the state vector: x = (Ox , Oy , a, b, rx , ry , rz ) ,

(25)

where (Ox , Oy ) represent the coordinates of the center of the ellipse, (a, b) denote the short and long radii of the ellipse, and (rx , ry , rz ) represent the rotation angles around the x, y and z-axis, respectively. Notice that the change in the center’s coordinates corresponds to the translation T and the scale of the object in the image plane can be ascertained from the radii. Moreover, we rely on Euler’s rotation theorem R3×3 = Rx Ry Rz , where Rx , Ry and Rz are the rotation matrices corresponding to the angles rx , ry and rz , respectively. In particular, rz is the rotation angle in the image plane. Particle filters use Monte Carlo sampling to recursively estimate the posterior probability density function (pdf) p(xk |z1:k ) by a set of random particles xik with associated weights ωki , (i = 1, 2, . . . N ), where N is the number of particles and k is the time step. In our case, each time step represents one frame of the video sequence. The particles xi ∼ q(x) are randomly drawn from a proposal importance density q(·). Consequently, an approximation is given by p(xk |z1:k ) ≈

N X i=1

ωki δ(xk − xik ) .

(26)

The unnormalized weights are chosen using the principle of importance sampling as i ωki ∝ ωk−1

p(zk |xik )p(xik |xik−1 ) . q(xik |xik−1 , zk )

(27)

5

Once the weight for each particle has been evaluated, it is normalized, and the true state is estimated by a discrete weighted approximation: ˆ xk =

N X

ωki xik .

(28)

i=1

Obviously, when q(·) is close to the true posterior, the particles are more effective. The idea is then to put more particles in areas where posterior may have higher density to avoid useless particles that lead to more expensive computation. B. Importance Sampling We adopt the approach used in motion-based particle filters for selecting the importance density [13], and extend this approach by incorporating the solution of Sylvester’s equation into this framework. Several advantages emerge of this choice of importance function. First, we economize on processing time by searching only the state space around the estimated position. Second, the tracker is adaptive to all kinds of motion even with a poor a priori knowledge of the object’s dynamics. The online motion estimation of the object does not make any assumption about the underlying motion of the object. As shown in Fig. 2, the translation of the motion vector is given by the difference between the position of the center of the newly fitted ellipse by ABM and the position of the center of the previous mean state ellipse, and the rotation of the motion vector directly results from Sylvester’s equation. Consequently, the motion vector is defined as ∆xk = [xc (k) − xc (k − 1), yc (k) − yc (k − 1), 0, 0, rx (k), ry (k), rz (k)] , (29) where (xc (k), yc (k)) is the center of the ellipse at time index k, and (rx (k), ry (k), rz (k)) is the pose update. Then we have the following sampling scheme xik

=

xik−1

, υk ∼ N (0, ΣG ) .

(30)

Nxk (xik−1 + ∆xk , ΣG ) ,

(31)

+ ∆xk +

υki

The importance function is then a sum of functions q(xk |xk−1 , zk ) =

N X

D. Likelihood Function The likelihood function p(zk |xik ) acts as a measure of how close one particle is to the true state. In this paper, we propose the overall likelihood function as a product of three independent functions: (i)

(i)

(i)

p(zk |xik ) = pcolor × pgradient × pf eature , (i)

(i)

(34)

(i)

where pcolor , pgradient , pf eature are used to denote the likelihood based on color, gradient, and feature of the ith sample, respectively. Both the color and gradient relate to the tracking parameters; whereas the features are associated with the pose parameters. E. Color Module The color model proposed in [20] is used in this paper. Assume that the color distributions are discretized into m bins, and the histograms are produced with the function h(gi ), that assigns the color at location gi to the corresponding bin. Here, the histograms are computed in RGB space using 8×8×8 bins. Let us use qk = {qki }i=1,...,n and pk = {pik }i=1,...,n to denote the normalized model histogram and normalized particle histogram, at time k, respectively. The model histogram serves as an estimate of the true histogram and is used to measure the particle weight. A popular measure between two distributions p and q is provided by the Bhattacharyya coefficient: n p X p(i) q (i) . ρ[p, q] =

(35)

i=1

The larger ρ is, the more similar the two distributions are. As a distance between two distributions, we define the measure p d = 1 − ρ[p, q] . (36) Thus, small distances correspond to large weights given by (i) pcolor = √

2 1 1 − d − (1−ρ[p,q]) 2 2σc e 2σc2 = √ e , 2πσc 2πσc

(37)

which are specified by a Gaussian with variance σc .

i=1

where N (µ, Σ) is a seven-dimensional Gaussian density function: 1 1 exp[− (x − µ)T Σ−1 (x − µ)] . (32) N (µ, Σ) = p 7 2 (2π) |Σ|

In Eq. (30), the covariance ΣG is adjusted according to the motion. For simplicity, the covariance matrix ΣG remains fixed during the entire video sequence in our implementation of the joint tracking and pose estimation using particle filtering.

C. System Dynamics The system dynamics is modelled by a discrete-time random walk process, reflecting the absence of prior knowledge of the object’s dynamics. The state equation is given by xk = xk−1 + vk , where vk ∼ N (0, ΣP ) is a seven-dimensional Gaussian noise process. Therefore, we have the transition probability p(xk |xk−1 ) = Nxk (xk , Σk ) .

(33)

F. Gradient Module We use the model developed in [21]. The observations are collected on normal lines to the contour. Let Zk,ϕ be the edge detection observations on line ϕ at time k, Zk,ϕ = Z1 , Z2 , . . . , ZI , where I is the number of edges detected. Because of the background clutter, there can be multiple edges along each normal line. At most one of the I edges is the true contour point. Let p0 be the prior probability that none of the I edges is the true contour edge. With the assumptions introduced in [22] that the clutter is a Poisson process along the line with spatial density γ and the true target measurement is normally distributed with standard deviation σz , we obtain the edge likelihood model as follows [13]: p(Zk,ϕ |λϕ ) ∝ √

1 (minj (Zj − λϕ ))2 exp(− ) , (38) 2σz2 2πσz p0 γ

where λϕ represents the pixel along the normal line ϕ belonging to the sample ellipse.

6

By assuming independence between the different L normal lines, we have the following overall likelihood function for sample i: L Y (i) pgradient = p(Zk,ϕ |λϕ ) . (39) ϕ=1

G. Feature Module (i) Let R2×2E , (i = 1, 2, . . .), be the ith particle transform (i) matrix, and the translation parameters lE can be obtained (i) according to Eq. (11). We can now compute pf eature , by first projecting the feature points in the frame at time t into frame (i)j (i)j t + 1. The projected points (ut+1E , vt+1E ) are given by ! ! ! (i) (i)j (i)j lxE ut+1E ut (i) = R2×2E + (i) , (40) (i)j (i)j lyE vt+1E vt where j is the index of feature points. Because there are usually many pairs of feature points extracted with SIFT, we cannot guarantee perfect matching between all of the (i)j (i)j (i)j corresponding feature points; i.e. ut+1E = ut+1 and vt+1E = (i)j vt+1 , for each j. Clearly, small distances should be associated with large weights. Therefore, the likelihood for the feature model is computed by X 1 (i) q . pf eature = (i)j (i)j (i)j (i)j j (ut+1 − ut+1E )2 + (vt+1 − vt+1E )2 (41) H. Initialization To initialize our system, we must begin with a known pose and position of the target in the first image. This can be done automatically or manually. In our simulations, we mark points manually around the object. Our system subsequently produces an ellipse from these points by using the Least-Squares method [23]. After initialization, particles are generated from a predefined pdf. The choice of the pdf used depends on the movement characteristics of the target (e.g. fast versus slow moving objects).

V. E XPERIMENTAL R ESULTS We demonstrate our algorithm on different image sequences with preliminary results. We have used 200 particles per frame in the following simulations. We compare our proposed approach of pose estimation based on SEA to several well-known methods for pose estimation. Specifically, we implemented a method for pose estimation based on Template Matching (TM) [2]. We have also used the Pointing Database [24] to construct the head pose model, where each pose is trained by all 15 samples. However, our implementation of the TM method is different from the original algorithm presented in [2]. Specifically, in order to make the head model robust to background clutter, the original algorithm has learned a face skin model from the training data, and we have not provided this model in our implementation of the TM method. For simplicity, we have implemented and tested the TM method on real videos corresponding to tilt and pan independently. Finally, we also implemented an approach to pose estimation using the essential matrix (EM) method [25]. A. Synthetic Points with Ground Truth We firstly demonstrate the noise robustness of our algorithm with synthetic points, and the results are shown in Fig. 3. Here we use 20 points and run 5000 times. Since the evaluation of the Lagrange Multiplier Matrix Λ is crucial, we firstly evaluate Λ by perturbing the matrix with noisy data. We use SNR to measure the error introduced in the matrix Λ, and Λ is finally evaluated by calculating the error of the rotation angles computed from the perturbed matrix. We fix the SNR for the input points as 10dB for this experiment. As shown in Fig. 3(a), the result is promising. For instance, the error in rotation angles is about 5 degrees when SNR= 5dB for Λ. The results, under different noise levels for the input feature points, are shown in Fig. 3(b). We also compare SEA with LSM (Least Square Method). In the existing linear solutions, the orthogonality of the rotation matrix constraints are weakly enforced or not enforced at all [6]. However, it is known that the robustness of pose estimation algorithm can be improved by adding these constraints. LSM is a method without considering the constraints, and it deviates greatly as a high noise level presents. B. Large Angle Rotations

Fig. 2.

The system diagram

The Girl sequence has 318 frames with a resolution of 720× 576 pixels and a frame rate of 25 frames per second (fps). To achieve a lower computational cost, we firstly resize the images to 128 × 96 pixels using bilinear interpolation, where the head region is about 40×40 pixels. In this video sequence, a girl turns to the right, left, up, and down, respectively, and almost 90 degrees in each direction. This video has a clear blue background and is uniformly illuminated, except for a few frames. This video sequence is used to demonstrate the robustness of SEA for pose estimation of large angles. In Fig. 4, the girl moves her head up and down in the vertical direction, and the curve in the middle is the false estimate in the horizontal direction with the proposed SEA. Also as

7

Fig. 3.

Pose estimation results of the synthetic data. (a) with different Lagrange Multiplier evaluations; (b) with different input SNR

60 r Pose Estimation for Girl Nodding (degree)

True

r

TM

40

r

SE

20

0

−20

−40

−60 140

160

180

200

220 240 Frame Index

260

280

300

320

Fig. 4. Pose estimation for the video of Girl Nodding: the red curve is the true pose; the green curve is the result from the Template Matching (TM) method [2]; and the blue curve is obtained based on our proposed Sylvester’s Equation (SE) algorithm. Also the blue curve in the middle is the false estimate in the horizontal direction with the proposed SE algorithm.

shown, this false estimation has been mostly depressed. It is important to note that in our implementation of the TM approach [2], only the templates in the vertical direction are employed. Had we used templates in both vertical and horizontal directions, the pose estimate would have degraded, while the computation of the TM algorithm would have increased substantially because of more templates to be compared. The out-of-plane rotation is most useful in pose estimation, so the numbers representing the estimated rotation angles of pan and tilt, respectively, are listed in the tables under each figure, and these numbers are also depicted visually using the horizontal and vertical bars at the edge of the images. The scale of the object is identified by the size of the ellipse. Moreover, the orientation of the ellipse also indicates the inplane rotation angles. As shown in Fig. 5, the ground truth data is growing over time corresponding to rising of the head upward in the image, but the results from the TM method [2] falsely indicates flipping, which alludes to the sensitivity of the TM algorithm

Index TRUE (degree) EM (degree) TM (degree) SEA (degree)

156 (-28, 0) (-52, -5) (-32, 0) (-28, 7)

157 (-33, 0) (-50, -5) (-20, 0) (-34, 6)

158 (-38, 0) (-45, -6) (-30, 0) (-40, 5)

159 (-42, 0) (-43, -7) (-43, 0) (-43, 4)

Fig. 5. Pose estimation results of the video Girl Nodding: (a) the Essential Matrix (EM) method [25]; (b) the Template Matching (TM) method [2]; (c) the proposed Sylvester’s Equation Algorithm (SEA).

to tracking, and the error could be extremely large compared to the ground truth data, while SEA results are smoother because of the particle filtering. Although TM algorithm [2] achieves tracking and pose estimation simultaneously with particle filtering, it is still sensitive to the tracking steps, because the template matching is applied based on tracking. In comparison, the output of SEA is stable, and not so sensitive to the tracking, because some incorrectly matched feature pairs have been discarded before calculating the pose. Some quantitative comparisons are shown in Table I. As can be seen from Fig. 5(a), the pose in Frame 156 is approximately 30 degrees. However, the EM method provides an estimate of over 50 degrees. In the remainder of the sequence, the girl’s head moves increasingly higher. Yet, the

8

estimated angles provided by EM are increasingly smaller. The EM algorithm assumes that the camera has been calibrated and all of the necessary parameters are assumed to be known. Therefore, the focal length f is replaced with 1 in Eq. (4). In practice, these assumptions violate and consequently the EM estimate is inaccurate. SEA is the only pose estimation method whose estimate is close to the ground truth, while both EM and TM lead to large errors, as shown in the experiment. In particular, since the video sequence depicts a woman who raises her head upward slowly, the red bars associated with the pose estimate provided by SEA are becoming longer smoothly as shown in Fig. 5(c); whereas, the red bars corresponding to EM and TM are decreasing and discontinuous as illustrated in Fig. 5(a) and (b), respectively. C. Arbitrary Rotation with Long Sequences TABLE II C OMPARISON OF S PEED FOR THE V IDEO Lab Methods Speed (s)

RLS 44.4348

TPE 17.0936

SEA 17.7656

SEA is also tested with long sequences. The Lab sequence has 2423 frames with a resolution of 640 × 480 pixels and a frame rate of 30 fps. The images are firstly resized into 160 × 120 pixels using bilinear interpolation for a lower computational cost. A man randomly rotates his head, including some large angles and combined angles, with changing speed. This video is taken in the room environments with full illumination, and this experiment is used to illustrate the performance of SEA with long time computation. Here SEA is compared to several methods: First, the algorithm based on RANSAC followed by Least Squares (RLS) is consider. As mentioned in Section III, some false matched points are detected by their dissimilar length and slope in SEA. In RLS, however, RANSAC is applied to remove feature point outliers, and LSM is subsequently employed to estimate the rotation matrix. Second, the approach which performs tracking followed by pose estimation (TPE) with the same coefficients as in SEA is also completed. As shown in Fig. 6, all three systems are able to recognize the combined rotations correctly, and reveal pan and tilt separately in the initial frames. Although the object is correctly identified with RLS with our propose tracking approach, but the pose estimation results are incorrect after 450 frames, while TPE and SEA maintain an accurate estimate throughout the entire sequence. Since TPE also follows the true values closely, it indicates that our proposed method is not sensitive to the tracking results. It has been explained that some false corresponding feature pairs have been eliminated prior to pose estimation. The proposed method will provide an accurate pose estimation when most feature pairs have been correctly identified. A comparison of the computation time for each frame is provided in Table II. Particularly, the man also made some facial expressions purposely. These expressions could involve the moving of

most feature points, resulting in the fluctuations of estimation results with all of the three methods. But as such motion with feature points is generally slight, the fluctuation in the output is acceptable. In addition, the estimate results will also come back to the position before expressions once these behaviors stop. Note that the high performance of TPE is due to the novel use of motion-based particle filtering for tracking and the proposed solution to pose estimation based on Sylvester’s equation introduced in this paper. Moreover, while the performance of the SEA algorithm is superior to TPE, the increase in the computational load of SEA compared to TPE is nearly negligible as shown in Table II. D. Translation and Depth Change SEA is then tested when there are translation and depth change. The Car sequence has 230 frames with a resolution of 320 × 240 pixels and a frame rate of 30 fps. The size of the car in the video is quite small, and environments are a little dark. It is even hard for human eyes to recognize the car and its pose at first sight, but it has no effects on SEA for it is based on SIFT features, which are robust to illumination conditions. In the video, a car slows down at the crossing, waiting for the signals, and then turns its direction. This test sequence observes complicated position and pose change, including pure translation, translation with turning, depth change and occlusion. In Fig. 7, the first two pictures indicate that the translation does not influence SEA, as before calculating the pose, all of features are shifted by subtracting their mean. Moreover, there are other cars behind our target, but the tracker is not attracted by them due to the use of the color module and gradient module. In the median frame, the car is partly occluded by a pole, but the results after the particle filter in our system are robust to the occlusion. Furthermore, the car is moving and turning, and SEA can also distinguish translation and rotation properly. There is depth change with the last two images, and depth change could cause problems to our system. j = 1 in Section III-C on the assumption We regard ztj /zt+1 that the depth change is small and motion in this direction is slow. In this experiment, this assumption is valid because the speed of vehicles is usually slow at crossings. However, j if the car moves fast to or away from the camera, ztj /zt+1 should be adjusted according to the size change of the object. As the minor depth changes are reflected by very slight scale modification in the projected video sequence, and horizontal and vertical movements, on the other hand, translate into welldefined spatial changes in the captured image sequence, the object in one of two successive frames (tth or t + 1th ) should be properly enlarged to eliminate the impact of depth change. Details pertaining to scale change estimation for the tracked object in the video sequence are described in the presentation of the ABM algorithm in [13]. VI. C ONCLUSION In this paper, we introduced a method to estimate tracking and pose parameters within the framework of particle filtering.

9

TABLE I C OMPARISON OF RESULTS FOR THE Methods Root Mean-Square Error (RMSE) RMSE in the other direction SNR (dB) Speed (second/frame)

Index RLS (degree) TPE (degree) SEA (degree)

372 (-15, 46) (-19, 44) (-16, 45)

820 (55, 52) (15, 71) (10, 66)

VIDEO

Template Matching 13.35 not applicable 8.9199 18.6703

1099 (87, 13) (24, -30) (22, -43)

1336 (79, 13) (37, 35) (32, 34)

Girl Sylvester’s Equation 6.01 6.76 16.4713 15.9099

1593 (83, -1) (28, 34) (26, 49)

1948 (74, 5) (10, 5) (10, 6)

2054 (80, 3) (31, 6) (34, 10)

2421 (-29, 17) (10, 6) (12, 7)

Fig. 6. Pose estimation results of the video Lab: (a) the RANSAC and Least Squares (RLS) method; (b) the tracking followed by pose estimation (TPE) method; (c) the proposed Sylvester’s Equation algorithm (SEA).

Index SEA (degree,degree) Fig. 7.

47 (0, 1)

109 (-1, 3)

173 (2, -18)

195 (5, -46)

218 (9, -58)

Pose estimation results of a moving car.

Our approach can be used to directly estimate the 3D rotation parameters from 2D image sequences without constructing a 3D model or system training and learning prior to estimation. Corresponding SIFT points are used as features for pose estimation from successive images. We demonstrate that the pose estimation problem can be formulated as a solution to Sylvester’s equation. Moreover, we incorporate our approach in a particle filtering framework for both video tracking and pose estimation. The resulting algorithm provides a smooth and efficient estimation of the affine motion parameters (location coordinates and rotation angles). An error analysis demonstrating the robustness of the proposed pose estimation method to noise degradation has been presented by Chen et al. [26]. In the future, we plan to incorporate the learningbased template matching approach into our system to recover

from accumulation errors. We also plan to extend the proposed framework by relying on centralized and distributed solutions to Sylvester’s equation for pose estimation from multiple camera views.

R EFERENCES [1] B. Braathen, M. S. Bartlett, G. L. Ford, and J. Movellan, “3-d head pose estimation from video by nonlinear stochastic particle filtering,” in Proc. 8th Joint Symposium on Neural Computation, May 2001. [2] S. O. Ba and J. M. Odobez, “Evaluation of multiple cues head pose tracking algorithm in indoor environments,” in Proc. International Conference on Multimedia and Expo, Amsterdam, 2005. [3] S. Ohayon and E. Rivlin, “Robust 3d head tracking using camera pose estimation,” in Proc. 18th International Conference on Pattern Recognition, August 2006, pp. 1063–1066.

10

[4] F. Dornaika and F. Davoine, “Head and facial animation tracking using appearance-adaptive models and particle filters,” in Proc. Conference on Computer Vision and Pattern Recognition Workshop, 2004, vol. 10, pp. 153–162. [5] W. S. Zheng, J. H. Lai, and P. C. Yuen, “Weakly supervised learning on pre-image problem in kernel methods,” in Proc. 18th International Conference on Pattern Recognition, 2006, vol. 2, pp. 711–715. [6] Q. Ji, M. S. Costa, R. M. Haralick, and L. G. Shapiro, “A robust linear least-squares estimation of camera exterior orientation using multiple geometric features,” Isprs Journal of Photogrammetry and Remote Sensing, vol. 52, no. 2, pp. 75–93, 2000. [7] L. Quan and Z. Lan, “Linear n-point camera pose determination,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 8, pp. 774–780, 1999. [8] N. Andreff, R. Horaud, and B. Espiau, “Robot hand-bye calibration using structure-from-motion,” International Journal of Robotics Research, vol. 20, no. 3, pp. 228–248, 2001. [9] R. M. Haralick, H. Joo, C.-N. Lee, X. Zhuang, V. G. Vaidya, and M. B. Kim, “Pose estimation from corresponding point data,” IEEE Transactions on Systems, Man and Cybernetics, vol. 19, no. 6, pp. 698– 700, 1989. [10] B. K. P. Horn, H. M. Hilden, and S. Negahdaripour, “Closed-form solution of absolute orientation using orthonormal matrices,” Journal of the Optical Society of America A, vol. 5, pp. 1127–1135, 1988. [11] C. Olsson, F. Kahl, and M. Oskarsson, “The registration problem revisited: Optimal solutions from points, lines and planes,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp. 1206–1213. [12] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004. [13] N. Bouaynaya and D. Schonfeld, “Complete system for head tracking using motion-based particle filter and randomly perturbed active contour,” in Proc. SPIE, Image and Video Communications and Processing, 2005, vol. 5685, pp. 864–873. [14] E. Delponte, F. Isgr, F. Odone, and A. Verri, “Svd-matching using sift features,” Special Issue on the Vision, Video and Graphics Conference, vol. 68, no. 5, pp. 415–431, 2006. [15] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Communication of the ACM, vol. 24, no. 6, pp. 381–395, 1981. [16] T. Jebara, A. Azarbeyejani, and A. Pentland, “3d structure from 2d motion,” IEEE Signal Processing Magazine, vol. 16, no. 3, pp. 66–84, 1999. [17] R. Horaud, F. Dornaika, B. Lamiroy, and S. Christy, “Object pose: The link between weak perspective, paraperspective, and full perspective,” International Journal of Computer Vision, vol. 22, no. 2, pp. 173–189, 1997. [18] H. Adams, S. Singh, and D. Strelow, “An empirical comparison of methods for image-based motion estimation,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and System, 2002, vol. 1, pp. 123–128. [19] C. F. Van Loan, “The ubiquitous kronecker product,” Journal of Computational and Applied Mathematics, vol. 123, pp. 85–100, 2000. [20] K. Nummiaroa, E. K. Meierb, and L. V. Gool, “An adaptive color-based particle filter,” Image and Vision Computing, vol. 21, no. 11, pp. 99–110, 2002. [21] M. Isard and A. Blake, “Contour tracking by stochastic propagation of conditional density,” in Proc. European Conference on Computer Vision, May 1996, pp. 343–356. [22] M. Isard and A. Blake, “Condensation - conditional density propagation for visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, pp. 5–28, 1998. [23] A. W. Fitzgibbon, M. Pilu, and R. B. Fisher, “Direct least squares fitting of ellipses,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, pp. 476–480, 1999. [24] N. Gourier, D. Hall, and J. L. Crowley, “Estimating face orientation from robust detection of salient facial features,” in Proc. Pointing 2004, ICPR, International Workshop on Visual Observation of Deictic Gestures, 2004. [25] X. Zhuang, T. S. Huang, and R. M. Haralick, “Two-view motion analysis: a unified algorithm,” Journal of the Optical Society of America A: Optics, Image Science, and Vision, vol. 3, no. 9, pp. 1492–1500, 1986. [26] C. Chen, D. Schonfeld, J. Yang, and M. Mohamed, “Pose estimation from video sequences based on sylvester’s equation,” in SPIE Proceedings of Electronic Imaging: Science and Technology. Conference on Visual Communications and Image Processing, San Jose, California, 2007.

Chong Chen is currently a Ph.D. candidate working in the Multimedia Communications Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Chicago. He received B.E. in electrical engineering from Xi’an Jiaotong University in China in 2002, and M.S. in electrical and computer engineering from the Institute of Automation, Chinese Academy of Sciences in China in 2005. His research interests are computer vision, image analysis and computer graphics.

Dan Schonfeld received the B.S. degree in Electrical Engineering and Computer Science from the University of California at Berkeley, and the M.S. and Ph.D. degrees in Electrical and Computer Engineering from The Johns Hopkins University, in 1986, 1988, and 1990, respectively. In 1990, he joined the University of Illinois at Chicago, where he is currently a Professor in the Departments of Electrical and Computer Engineering, Computer Science, and Bioengineering. Dr. Schonfeld has authored over 170 technical papers in various journals and conferences. He was co-author of papers that won the Best Student Paper Awards in Visual Communication and Image Processing 2006 and IEEE International Conference on Image Processing 2006 and 2007. Dr. Schonfeld has been elevated to the rank of IEEE Fellow ”for contributions to image and video analysis.” He is currently serving as Deputy Editor-in-Chief of the IEEE Transactions on Circuits and Systems for Video Technology and Special Sections Area Editor for the IEEE Signal Processing Magazine. His current research interests are in multi-dimensional signal processing; image and video analysis; computer vision; and genomic signal processing.