Int J Comput Vis DOI 10.1007/s11263-009-0267-4
Quasi-perspective Projection Model: Theory and Application to Structure and Motion Factorization from Uncalibrated Image Sequences Guanghui Wang · Q.M. Jonathan Wu
Received: 1 December 2008 / Accepted: 29 June 2009 © Springer Science+Business Media, LLC 2009
Abstract This paper addresses the problem of factorizationbased 3D reconstruction from uncalibrated image sequences. Previous studies on structure and motion factorization are either based on simplified affine assumption or general perspective projection. The affine approximation is widely adopted due to its simplicity, whereas the extension to perspective model suffers from recovering projective depths. To fill the gap between simplicity of affine and accuracy of perspective model, we propose a quasi-perspective projection model for structure and motion recovery of rigid and nonrigid objects based on factorization framework. The novelty and contribution of this paper are as follows. Firstly, under the assumption that the camera is far away from the object with small lateral rotations, we prove that the imaging process can be modeled by quasi-perspective projection, which is more accurate than affine model from both geometrical error analysis and experimental studies. Secondly,
The work is supported in part by Natural Sciences and Engineering Research Council of Canada, and the National Natural Science Foundation of China under Grant No. 60575015. Electronic supplementary material The online version of this article (http://dx.doi.org/10.1007/s11263-009-0267-4) contains supplementary material, which is available to authorized users. G. Wang () · Q.M.J. Wu Department of Electrical and Computer Engineering, University of Windsor, 401 Sunset, Windsor, N9B 3P4, Ontario Canada e-mail:
[email protected] Q.M.J. Wu e-mail:
[email protected] G. Wang National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, China
we apply the model to establish a framework of rigid and nonrigid factorization under quasi-perspective assumption. Finally, we propose an Extended Cholesky Decomposition to recover the rotation part of the Euclidean upgrading matrix. We also prove that the last column of the upgrading matrix corresponds to a global scale and translation of the camera thus may be set freely. The proposed method is validated and evaluated extensively on synthetic and real image sequences and improved results over existing schemes are observed. Keywords Structure from motion · Computational models of vision · Quasi-perspective projection · Imaging geometry · Matrix factorization · Singular value decomposition · Euclidean reconstruction
1 Introduction The problem of structure and motion recovery from image sequences is an important theme in computer vision. Great progresses have been made for different applications during the last two decades (Hartley and Zisserman 2004). Among these methods, factorization based approach, for its robust behavior and accuracy, is widely studied since it deals uniformly with the data sets of all images (Poelman and Kanade 1997; Quan 1996; Tomasi and Kanade 1992; Triggs 1996). The factorization algorithm was first proposed by Tomasi and Kanade (1992) in the early 90’s. The main idea of this algorithm is to factorize the tracking matrix into motion and structure matrices simultaneously by singular value decomposition (SVD) with low-rank approximation. The algorithm assumes an orthographic projection model. It was extended to weak perspective and paraperspective projection by Poelman and Kanade (1997). The orthographic,
Int J Comput Vis
weak perspective, and paraperspective projections can be generalized as affine camera model. More generally, Christy and Horaud (1996) extended the above methods to a perspective camera model by incrementally performing the factorization under affine assumption. The method is an affine approximation to full perspective projection. Triggs (1996) and Sturm and Triggs (1996) proposed a full projective reconstruction method via rank-4 factorization of a scaled tracking matrix with projective depths recovered from pairwise epipolar geometry. The method was further studied in Han and Kanade (2000), Heyden et al. (1999), Mahamud and Hebert (2000), where different iterative schemes were proposed to recover the projective depths through minimizing reprojection errors. Recently, Oliensis and Hartley (2007) provided a complete theoretical convergence analysis for the iterative extensions. Unfortunately, no iteration has been shown to converge sensibly, and they proposed a simple extension, called CIESTA, to give a reliable initialization to other algorithms. The above methods work only for rigid objects and static scenes. Whereas in real world, many scenarios are nonrigid or dynamic such as articulated motion, human faces carrying different expressions, lip movements, hand gesture, and moving vehicles etc. In order to deal with such situations, many extensions stemming from the factorization algorithm were proposed to relax the rigidity constraint. Costeira and Kanade (1998) first discussed how to recover the motion and shape of several independent moving objects via factorization using orthographic projection. Bascle and Blake (1998) proposed a method for factorizing facial expressions and poses based on a set of preselected basis images. Recently, Li et al. (2007) proposed to segment multiple rigid-body motions from point correspondences via subspace separation. Yan and Pollefeys (2005, 2008) proposed a factorizationbased approach to recover the structure and kinematic chain of articulated objects. In the pioneer work by Bregler et al. (2000), it is demonstrated that the 3D shape of a nonrigid object can be expressed as a weighted linear combination of a set of shape bases. Then the shape bases and camera motions are factorized simultaneously for all time instants under the rank constraint of the tracking matrix. Following this idea, the method was extensively investigated and developed by many researchers, such as Brand (2001, 2005), Del Bue et al. (2006, 2004), Torresani et al. (2008, 2001), and Xiao et al. (2006), Xiao and Kanade (2005). Recently, Rabaud and Belongie (2008) relaxed the Bregler’s assumption (2000) by assuming that only small neighborhoods of shapes are wellmodeled by a linear subspace, and proposed a novel approach to solve the problem by adopting a manifold-learning framework. Most nonrigid factorization methods are based on affine camera model due to its simplicity. It was extended to perspective projection in Xiao and Kanade (2005) by iteratively
recovering the projective depths. The perspective factorization is more complicated and does not guarantee its convergence to the correct depths, especially for nonrigid scenarios (Hartley and Zisserman 2004). Vidal and Abretske (2006) proposed that the constraints among multiple views of a nonrigid shape consisting of k shape bases can be reduced to multilinear constraints. They presented a closed form solution to the reconstruction of a nonrigid shape consisting of two shape bases. Hartley and Vidal (2008) proposed a closed form solution to the nonrigid shape and motion with calibrated cameras or fixed intrinsic parameters. Since the factorization is only defined up to a nonsingular transformation matrix, many researchers adopt the metric constraints to recover the matrix and upgrade the factorization to the Euclidean space (Brand 2001; Bregler et al. 2000; Del Bue et al. 2004; Torresani et al. 2001). However, the rotation constraint may cause ambiguity in the combination of shape bases. Xiao et al. (2006) proposed a basis constraint to solve the ambiguity and provided a closed-form solution. The essence of the factorization algorithm is to find a low-rank approximation of the tracking matrix. Most algorithms adopt SVD to compute the approximation. Alternatively, Hartley and Schaffalizky (2003) proposed to use power factorization (PF) to find the low-rank approximation, which can handle missing data in a tracking matrix. It was extended to nonrigid factorization in both metric space (Wang et al. 2008) and affine space (Wang and Wu 2008a). Vidal et al. (2008) proposed to combine the PF algorithm for motion segmentation. There are some other nonlinear based studies to deal with incomplete tracking matrix with some entries unavailable, such as Damped Newton method (Buchanan and Fitzgibbon 2005) and Levenberg-Marquardt based method (Chen 2008). Torresani et al. (2008) proposed a Probabilistic Principal Components Analysis algorithm to estimate the 3D shape and motion with missing data. Camera calibration is an indispensable step in retrieving 3D metric information from 2D images. Many selfcalibration algorithms were proposed to calibrate fixed camera parameters (Maybank and Faugeras 1992; Hartley 1997; Luong and Faugeras 1997), varying parameters (Heyden and Åström 1997; Pollefeys et al. 1999), and affine camera models (Quan 1996). Previous studies on factorization are either based on affine camera model or perspective projection. The affine assumption is widely adopted due to its simplicity although it is just an approximation to real imaging process. Whereas the extension to perspective model suffers from recovery of the projective depths, which is computationally intensive and no convergence is guaranteed. In this paper, we try to make a trade-off between the simplicity of affine and accuracy of full perspective projection and propose a novel framework for the problem. Assuming that the camera is far away from the object with small lateral rotations, which is
Int J Comput Vis
similar to affine assumption and is easily satisfied in practice, we propose a quasi-perspective projection model and give an error analysis of different projection models. The model is proved to be more accurate than affine approximation since the projective depths are implicitly embedded in the shape matrix, but its computational complexity is similar to affine. We apply this model to the factorization algorithm and establish a framework of rigid and nonrigid factorization under quasi-perspective projection. We elaborate the computational details on recovery of the Euclidean upgrading matrix. To the best of our knowledge, there is no similar report in literature. The idea was first proposed in CVPR 2008 (Wang and Wu 2008b) and we will present more theoretical analysis and experimental evaluations in the paper. The remaining part of the paper is organized as follows. The definition and background on the factorization algorithm is given in Sect. 2. The proposed quasi-perspective model and error analysis are elaborated in Sect. 3. The application to rigid factorization under the proposed model is detailed in Sect. 4. The quasi-perspective nonrigid factorization is presented in Sect. 5. Extensive experimental evaluations on synthetic data are given in Sect. 6. Some test results on real image sequences are reported in Sect. 7. Finally, the concluding remarks are presented in Sect. 8.
2 Background on Factorization 2.1 Problem Definition Under perspective projection, a 3D point Xj is projected onto an image point xij in frame i according to equation λij xij = Pi Xj = Ki [Ri , Ti ]Xj
(1)
where λij is a non-zero scale factor, commonly called projective depth; xij = [uij , vij , 1]T and Xj = [xj , yj , zj , 1]T are expressed in homogeneous form; Pi is the projection matrix of the i-th frame; Ri and Ti are the corresponding rotation matrix and translation vector of the camera with respect to the world system; Ki is the camera calibration matrix in form of ⎤ ⎡ fi ςi u0i (2) Ki = ⎣ 0 κi fi v0i ⎦ 0 0 1 where fi represents the camera’s focal length; [u0i , v0i ]T is the coordinates of the camera’s principal point; ςi refers to the skew factor; κi is called aspect ratio of the camera. For some precise industrial CCD cameras, we may assume zero skew, known principal point, and unit aspect ratio i.e., ςi = 0, u0i = v0i = 0, and κi = 1. Then the camera is simplified to have only one intrinsic parameter.
When the distance of an object from a camera is much greater than the depth variation of the object, we may assume affine camera model. Under affine assumption, the last row of the projection matrix is of the form PT3i [0, 0, 0, 1], where ‘’ denotes equality up to scale. Then the projection process (1) can be simplified by removing the scale factor λij . ¯j +T ¯i x¯ ij = Ai X
(3)
where, Ai ∈ R2×3 is composed by the upper-left 2 × 3 sub¯ j = [xj , yj , zj ]T are matrix of Pi ; x¯ ij = [uij , vij ]T and X the non-homogeneous form of xij and Xj , respectively; T¯ i is the corresponding translation vector, which is actually the image of world origin. Under affine projection, it is easy to verify that the centroid of a set of space points is projected to the centroid of their images. Therefore, the translation term vanishes if all the image points in each frame are registered to the corresponding centroid, and the projection is simplified to the form ¯j x¯ ij = Ai X
(4)
The problem of structure from motion is defined as: Given n tracked feature points of an object across a sequence of m frames {xij |i = 1, . . . , m, j = 1, . . . , n}. We want to recover the structure Sij = {Xij |i = 1, . . . , m, j = 1, . . . , n} and motion {Ri , Ti } of the object. The factorization based algorithm is proved to be an effective method to deal with this problem. As shown in Table 1, the algorithms can generally be classified into following four categories according to the camera assumption and object property. (i) Rigid object under affine assumption; (ii) rigid object under perspective projection; (iii) nonrigid object under affine assumption; (iv) nonrigid object under perspective projection. In Table 1, ‘Quasi-persp’ stands for quasi-perspective projection model to be discussed in the paper. The meaning of symbols W, M, S, B, and H in the table is defined in the following subsections. 2.2 Rigid Factorization Under affine assumption (4), the projection from space to the sequence is expressed as ⎡ ⎤ ⎡ ⎤ x¯ 11 · · · x¯ 1n A1
⎢ .. ⎥ ⎢ . . ¯n ¯ 1, . . . , X .. .. ⎦ = ⎣ ... ⎥ (5) ⎣ . ⎦ X
x¯ m1 · · · x¯ mn Am S¯ 3×n
W2m×n
M2m×3
where W is called tracking matrix; M and S¯ are called motion matrix and shape matrix respectively. It is evident that the rank of the tracking matrix is at most 3, and the rank
Int J Comput Vis Table 1 Classification of structure and motion factorization of rigid and nonrigid objects
Classification
Tracking matrix
Motion matrix
Shape matrix
Upgrading matrix
Rigid
M ∈ R2m×3
S¯ ∈ R3×n
H ∈ R3×3
Perspective
W ∈ R2m×n ˙ ∈ R3m×n W
M ∈ R3m×4
S ∈ R4×n
H ∈ R4×4
Quasi-Persp
W ∈ R3m×n
M ∈ R3m×4
S ∈ R4×n
H ∈ R4×4
Affine
M ∈ R2m×3k
¯ ∈ R3k×n B
H ∈ R3k×3k
Perspective
W ∈ R2m×n ˙ ∈ R3m×n W
M ∈ R3m×(3k+1)
B ∈ R(3k+1)×n
H ∈ R(3k+1)×(3k+1)
Quasi-Persp
W ∈ R3m×n
M ∈ R3m×(3k+1)
B ∈ R(3k+1)×n
H ∈ R(3k+1)×(3k+1)
Nonrigid
Affine
constraint can be easily imposed by performing SVD decomposition on the tracking matrix W and truncating it to rank 3. However, the decomposition is not unique since it is only defined up to a nonsingular linear transformation ma¯ Actually, the decomtrix H ∈ R3×3 as W = (MH)(H−1 S). position is just one of the affine reconstructions of an object. By inserting H into the factorization, we can upgrade the reconstruction from affine to the Euclidean space. We will alternatively name the matrix as (Euclidean) upgrading matrix in the following. Many researchers utilize the metric constraints of the motion matrix to recover the matrix (Poelman and Kanade 1997; Quan 1996), which is indeed a self-calibration process under the constraints of simplified camera parameters. When the perspective projection model (1) is adopted, the factorization equation can be formulated as ⎡
λ11 x11 ⎢ .. ⎣ .
λm1 xm1
··· .. . ···
˙ 3m×n W
⎤ ⎡ ⎤ λ1n x1n P1 ¯n ¯ 1, . . . , X .. ⎥ = ⎢ .. ⎥ X . ⎦ ⎣ . ⎦ 1, . . . , 1
λmn xmn Pm S4×n
Hebert 2000). However, there is no guarantee that the procedure will converge to a global minimum, as recently proved in Oliensis and Hartley (2007), no iteration has been shown to converge sensibly. 2.3 Nonrigid Factorization When an object is nonrigid, many studies follow the Bregler’s assumption (Bregler et al. 2000) that the nonrigid structure can be approximated by a linearly weighted combination of k rigid shape bases. S¯ i =
k
(6) ˙ is called projective depths scaled tracking matrix, where W and its rank is at most 4 if a consistent set of scalars λij are present; M and S are the camera matrix and homogeneous shape matrix respectively. Obviously, any such factorization corresponds to a valid projective reconstruction which is defined up to a projective transformation matrix H ∈ R4×4 . We can still use the metric constraint to recover the upgrading matrix. The most difficult part for perspective factorization is to recover the projective depths that are consistent with (1). One method is to estimate the depths pairwisely from the fundamental matrix and then string them together (Sturm and Triggs 1996; Triggs 1996). The disadvantage of such approach is the computational cost and possible error accumulation. Another method is to start with initial depths λij = 1, and iteratively refine the depths by reprojections (Han and Kanade 2000; Hartley and Zisserman 2004; Mahamud and
ωil Bl
(7)
where Bl ∈ R3×n is a shape base that embodies the principal mode of the deformation, ωil ∈ R is called deformation weight. Under this assumption and affine camera model, the nonrigid factorization is modeled as ⎡
M3m×4
l=1
x¯ 11 ⎢ .. ⎣ .
x¯ m1
··· .. . ···
W2m×n
⎤ ⎡ x¯ 1n ω11 A1 .. ⎥ = ⎢ .. . ⎦ ⎣ . x¯ mn
ωm1 Am
··· .. . ···
M2m×3k
⎤⎡ ⎤ ω1k A1 B1 .. ⎥ ⎢ .. ⎥ . ⎦⎣ . ⎦ ωmk Am
Bk ¯ 3k×n B
(8) We call M nonrigid motion matrix, B¯ nonrigid shape matrix which is composed of the k shape bases. It is easy to see from (8) that the rank of the nonrigid tracking matrix W is at most 3k. The decomposition can be achieved by SVD with the rank-3k constraint, which is defined up to a nonsingular upgrading matrix H ∈ R3k×3k . If the matrix is known, Ai , ¯ The ωil and S¯ i can be recovered accordingly from M and B. computation of H here is more complicated than in rigid case. Many researchers (Brand 2001; Del Bue et al. 2004; Torresani et al. 2001) adopted the metric constraints of the motion matrix. However, the constraints may be insufficient when the object deforms at varying speed. Xiao et al. (2006) proposed a basis constraint to solve such ambiguity.
Int J Comput Vis
Similarly, the factorization under perspective projection can be formulated as follows (Xiao and Kanade 2005). ⎡ ⎢ ˙ 3m×n = ⎢ W ⎣
(1:3) ω11 P1
.. . (1:3) ωm1 Pm
··· .. . ···
(1:3) ω1k P1
.. . (1:3) ωmk Pm
M3m×(3k+1)
⎤ ⎡B ⎤ 1 ⎢ .. ⎥ ⎥ .. ⎥ ⎢ . ⎥ ⎥ . ⎦ ⎢ ⎣Bk ⎦ (4) Pm T 1 (4) P1
B(3k+1)×n
(9)
especially for images of a video sequence. Suppose Ow − Xw Yw Zw is the world coordinate system selected on the object to be reconstructed. Oi − Xi Yi Zi is the camera system with Oi being the optical center of the camera. Without loss of generality, we assume there is a reference camera system Or − Xr Yr Zr . As the world system can be set freely, we align it with the reference frame as illustrated in Fig. 1. Therefore, the rotation Ri of frame i with respect to the reference frame is the same as the rotation of the camera to the world system.
˙ is the depths-scaled tracking matrix as in (6); P where W i (4) and Pi denote the first three and the fourth columns of Pi , respectively; 1 = [1, . . . , 1]T is an n vector with unit entities. The rank of the correctly scaled tracking matrix is at most 3k + 1. The decomposition is defined up to a transformation H ∈ R(3k+1)×(3k+1) , which can be determined in a similar but more complicated way. Just as in rigid case, the most difficult part for nonrigid perspective factorization is to determine the projective depths. Since there is no pairwise fundamental matrix for deformable features, we can only use the iterative method to recover the depth, although it is more likely to be stuck in a local minimum in nonrigid situation.
Definition 1 (Axial and lateral rotation) The orientation of a camera is usually described by roll-pitch-yaw angles. For the i-th frame, we define the pitch, yaw, and roll as the rotations αi , βi , and γi of the camera with respect to the Xw , Yw , and Zw axes of the world system. As shown in Fig. 1, the optical axis of the cameras usually point towards the object. For convenience of discussion, we define γi as the axial rotation angle, and define αi , βi as lateral rotation angles.
3 Quasi-perspective Projection
Proof Suppose the rotation and translation of the i-th frame to the world system are Ri = [r1i , r2i , r3i ]T and Ti = [txi , tyi , tzi ]T , respectively. Then the projection matrix can be written as
(1:3)
In this section, we will propose a new quasi-perspective projection model to fill the gap between simplicity of affine camera and accuracy of perspective projection. 3.1 Quasi-perspective Projection Under perspective projection, the image formation process is shown in Fig. 1. In order to ensure large overlapping part of the object to be reconstructed, the camera usually undergoes really small movements across adjacent views,
Proposition 2 Suppose the camera undergoes small lateral rotation with respect to the reference frame, then the variation of projective depth λij is mainly proportional to the depth of the space point, and the projective depths of a point at different views have similar trend of variation.
Pi = Ki [Ri , Ti ] ⎡ T fi r1i + ςi rT2i + u0i rT3i ⎣ κi fi rT2i + v0i rT3i = rT3i
⎤ fi txi + ςi tyi + u0i tzi κi fi tyi + v0i tzi ⎦ (10) tzi
Let us decompose the rotation matrix into the rotations around three axes as R(γi )R(βi )R(αi ). Then we have
Ri = R(γi )R(βi )R(αi ) ⎡ ⎤⎡ ⎤⎡ ⎤ ⎡ ⎤ Cγi −Sγi 0 Cβi 0 Sβi 1 0 0 Cγi Cβi Cγi Sβi Sαi − Sγi Cαi Cγi Sβi Cαi + Sγi Sαi = ⎣Sγi Cγi 0⎦ ⎣ 0 1 0 ⎦ ⎣0 Cαi −Sαi⎦ = ⎣Sγi Cβi Sγi Sβi Sαi + Cγi Cαi Sγi Sβi Cαi − Cγi Sαi⎦ 0 Sαi Cαi −Sβi Cβi Sαi Cβi Cαi 0 0 1 −Sβi 0 Cβi (11)
Int J Comput Vis
Fig. 1 Imaging process of an object. (a) Camera setup with respect to the object. (b) The relationship of world coordinate system and camera system at different viewpoint
where ‘S’ stands for sine function, and ‘C’ stands for cosine function. By inserting (10) and (11) into (1), we have
where Cβi Cαi ≤ 1. Under the assumption of tzi zj , the ratio can be approximated by
λij = [rT3i , tzi ]Xj
μi =
= −(Sβi )xj + (Cβi Sαi )yj + (Cβi Cαi )zj + tzi
(12)
From Fig. 1, we know that the rotation angles αi , βi , γi of the camera to the world system are the same as those to the reference frame. Under small lateral rotations, i.e., small angles of αi and βi , we have Sβi Cβi Cαi and Cβi Sαi Cβi Cαi . Thus (12) can be approximated by λij ≈ (Cβi Cαi )zj + tzi
(13)
All features {xij |j = 1, . . . , n} in the i-th frame correspond to the same rotation angles αi , βi , γi and translation tzi . It is evident from (13) that the projective depths of a point in all frames have similar trend of variation, which are in proportion to the value of zj of the space point. Actually, the projective depths have nothing related with the axial rotation γi . Proposition 3 Under small lateral rotations and further assumption that the distance of the camera to an object is greatly larger than the depth of the object, i.e., tzi zj , then the ratio of {λij |i = 1, . . . , m} corresponding to any two different frames can be approximated by a constant. Proof Let us take the reference frame as an example, the ratio of the projective depths of any frame i to those of the reference frame can be written as μi =
λrj (Cβr Cαr )zj + tzr ≈ λij (Cβi Cαi )zj + tzi
=
Cβr Cαr (zj /tzi ) + tzr /tzi Cβi Cαi (zj /tzi ) + 1
(14)
λrj tzr ≈ λij tzi
(15)
All features in a frame have the same translation term. Thus from (15) we can see that the projective depth ratios of two frames for all features have the same approximation μi . According to Proposition 3, we have λij = μ1i λrj . Thus the perspective projection equation (1) can be approximated by 1 λrj xij = Pi Xj μi Let us denote λrj as
(16) 1 j
, and reformulate (16) as
xij = Pqi Xqj
(17)
where Pqi = μi Pi ,
Xqj = j Xj
(18)
We call (17) as quasi-perspective projection model. Compared with general perspective projection, the quasiperspective assumes that projective depths between different frames are defined up to a constant μi . Thus the projective depths are implicitly embedded in the scalars of the homogeneous structure Xqj and the projection matrix Pqi , and the difficult problem of estimating the unknown depths is avoided. The model is more general than affine projection model (3), where all projective depths are simply assumed to be equal.
Int J Comput Vis
3.2 Error Analysis of Different Projection Model
¯a= m
In this subsection, we will give a heuristic analysis on the imaging errors of quasi-perspective and affine camera models with respect to the general perspective projection. For simplicity, the subscript ‘i’ of the frame number is omitted in the following. Suppose the intrinsic parameters of the cameras are known, and all images are normalized by applying the inverse K−1 i to each frame. Then the projection matrices under different projection model can be written as ⎡ T ⎤ r 1 tx ⎢ ⎥ P = ⎣rT2 ty ⎦ , rT3 = [−Sβ, CβSα, CβCα], (19) T r 3 tz ⎤ ⎡ T r 1 tx ⎥ ⎢ Pq = ⎣ rT2 ty ⎦ , rT3q = [0, 0, CβCα], (20) rT3q tz ⎤ ⎡ T r 1 tx ⎥ ⎢ Pa = ⎣rT2 ty ⎦ , 0T = [0, 0, 0] (21) 0 T tz where P is the projection matrix of perspective projection, Pq is that of quasi-perspective assumption, and Pa is that of affine projection. It is clear that the main difference of these projection matrices only lies in last row. For a space point ¯ = [x, y, z]T , its projection under different camera models X is given by ⎡ ⎤ u ¯ X ⎦, v m=P =⎣ (22) 1 T ¯ r 3 X + tz ⎡ ⎤ u ¯ X ⎦, v =⎣ mq = Pq (23) 1 T ¯ r3q X + tz ⎡ ⎤ u ¯ X (24) ma = Pa = ⎣v ⎦ 1 tz
1 u tz v
(30)
¯ is an ideal image of perspective projecThe point m ¯ q − m| ¯ as the error of quasition. Let us define eq = |m ¯ a − m| ¯ as the error of affine, where perspective, and ea = |m ‘| · |’ stands for the norm of a vector. Then we have ¯ q − m| ¯ eq = |m T¯ T ¯ (r3 − rT3q )X r 3 X + tz ¯ ¯ −m ¯ = det = T |m| m ¯ + tz ¯ + tz r3q X rT3q X −(Sβ)x + (CβSα)y ¯ |m|, = det (CβCα)z + tz ¯ a − m| ¯ ea = |m T¯ T¯ r X + tz r X ¯ ¯ −m ¯ = det 3 |m| m = 3 tz tz −(Sβ)x + (CβSα)y + (CβCα)z ¯ |m| = det tz
(31)
(32)
Based on above equations, it is rational to state the following results for different projection models. 1. The axial rotation angle γ around Z-axis has no influence ¯ m ¯ q and m ¯ a. on the images of m, 2. When the distance of a camera to an object is much larger ¯ q and m ¯ a are close to m. ¯ than the object depth, both m 3. When the camera system is aligned with the world system, i.e., α = β = 0, we have rT3q = rT3 = [0, 0, 1] and ¯ q = m, ¯ and the quasi-perspective assumpeq = 0. Thus m tion is equivalent to perspective projection. 4. When the rotation angles α and β are small, we have eq < ea , i.e., the quasi-perspective assumption is more accurate than affine assumption. 5. When the space point lies on the plane through the world origin and perpendicular to the principal axis, i.e., the direction of rT3 , we have α = β = 0 and z = 0. It is easy to ¯ =m ¯q =m ¯ a. verify that m
where ¯ + tx , v = r T X ¯ + ty , u = rT1 X 2
(25)
¯ = −(Sβ)x + (CβSα)y + (CβCα)z, rT3 X
(26)
¯ = (CβCα)z rT3q X
(27)
and the nonhomogeneous image points can be denoted as 1 u ¯ = T m , (28) ¯ r 3 X + tz v 1 u ¯q= T , (29) m ¯ r3q X + tz v
4 Quasi-Perspective Rigid Factorization Under quasi-perspective projection (17), the factorization equation of a tracking matrix is expressed as ⎡
x11 ⎢ .. ⎣ . xm1
··· .. . ···
⎤ ⎡ ⎤ x1n μ1 P1 .. ⎥ = ⎢ .. ⎥ [ X , . . . , X ] n n . ⎦ ⎣ . ⎦ 1 1 xmn
(33)
μm Pm
which can be written concisely as W3m×n = M3m×4 S4×n
(34)
Int J Comput Vis
The form is similar to perspective factorization (6). However, the projective depths in (33) are embedded in the motion and shape matrices, hence there is no need to estimate them explicitly. By performing SVD on the tracking matrix and imposing the rank-4 constraint, W may be factorized ˆ 3m×4 Sˆ 4×n . However, the decomposition is not unique as M since it is defined up to a nonsingular linear transformation ˆ and S = H−1 S. ˆ If a reasonable upgradH4×4 as M = MH ing matrix is recovered, the Euclidean structure and motions can be easily recovered from the shape matrix S and motion matrix M. Due to the special form of (33), the recovery of an upgrading matrix has some special properties compared with those under affine and perspective projection. We will show the computation details later in the article.
We adopt the metric constraint to compute an upgrading matrix H4×4 . Let us represent the matrix into two parts as (35)
where Hl denotes the first three columns of H, Hr denotes ˆ ˆ i is the i-th triple rows of M, the fourth column. Suppose M ˆ ˆ ˆ then from Mi H = [Mi Hl |Mi Hr ], we know that ˆ i Hl = μi P(1:3) = μi Ki Ri , M i
(36)
ˆ i Hr = μi P(4) M i
(37)
= μi Ki Ti
ˆ i QM ˆ T , where Q = Hl HT is a 4 × 4 Let us denote Ci = M i l symmetric matrix. As in previous factorization studies (Han and Kanade 2000; Quan 1996), we adopt a simplified camera model with only one parameter as Ki = diag(fi , fi , 1). Then from ˆ i QM ˆ T = (μi Ki Ri )(μi Ki Ri )T Ci = M i ⎡ 2 ⎤ fi ⎦ = μ2i Ki KTi = μ2i ⎣ fi2 1
where Uij denotes the (i, j )-th element of U, and uij is a scalar. For example, a n × (n − 1) vertical extended upper triangular matrix can be written explicitly as ⎡ ⎤ u11 u12 · · · u1(n−1) ⎢u21 u22 · · · u2(n−1) ⎥ ⎢ ⎥ ⎢ u32 · · · u3(n−1) ⎥ U=⎢ (41) ⎥ ⎢ .. ⎥ . . ⎣ . . ⎦ un(n−1)
4.1 Recovery of the Euclidean Upgrading Matrix
H = [Hl |Hr ]
Definition 4 (Vertical extended upper triangular matrix) Suppose U is a n × k (n > k) matrix. We call U a vertical extended upper triangular matrix if it is of the form uij if i ≤ j + (n − k) (40) Uij = 0 if i > j + (n − k)
Proposition 5 (Extended Cholesky Decomposition) Suppose Qn is a n × n positive semidefinite symmetric matrix of rank k. Then it can be decomposed as Qn = Hk HTk , where Hk is a n × k matrix of rank k. Furthermore, the decomposition can be written as Qn = k Tk with k a n × k vertical extended upper triangular matrix. The degree-of-freedom of the matrix Q is nk − 12 k(k − 1), which is the number of unknowns in k . The proof of Proposition 5 is given in Appendix 1. Form the Extended Cholesky Decomposition we can easily obtain the following result. Result 6 The matrix Q recovered from (39) is a 4 × 4 positive semidefinite symmetric matrix of rank 3. It can be decomposed as Q = Hl HTl , where Hl is a 4 × 3 rank 3 matrix. The decomposition can be further written as Q = 3 T3 with 3 a 4 × 3 vertical extended upper triangular matrix.
(39)
The computation of Hl is very simple. Suppose the SVD decomposition of Q is U4 4 UT4 , where U4 is a 4 × 4 orthogonal matrix, 4 = diag(σ1 , σ2 , σ3 , 0) is a diagonal matrix with σi the singular value of Q. Thus we can immediately have ⎡√ ⎤ σ1 √ ⎦ Hl = U(1:3) ⎣ σ2 (42) √ σ3
Since the factorization (33) can be defined up to a global scalar as W = MS = (εM)(S/ε), we may set μ1 = 1 to avoid the trivial solution of Q = 0. Thus we have 4m + 1 linear constraints in total on the 10 unknowns of Q, which can be solved via least squares. Ideally, Q is a positive semidefinite symmetric matrix, the matrix Hl can be recovered from Q via matrix decomposition.
where U(1:3) denotes the first three columns of U. Then the vertical extended upper triangular matrix 3 can be constructed from Hl as shown in Appendix 1. The computation is an extension of Cholesky Decomposition to the case of positive semidefinite symmetric matrix, while general Cholesky Decomposition can only be applied to positive definite symmetric matrix. From the number of unknowns in 3 we know that Q is only defined up to 9 degree-offreedom.
we can obtain the following constraints. ⎧ Ci (1, 2) = Ci (2, 1) = 0 ⎪ ⎪ ⎨ Ci (1, 3) = Ci (3, 1) = 0 ⎪ Ci (2, 3) = Ci (3, 2) = 0 ⎪ ⎩ Ci (1, 1) − Ci (2, 2) = 0
(38)
Int J Comput Vis
Remark 7 In Result 6, we assume Q is positive semidefinite. However, the recovered matrix Q may be negative definite in case of noisy data, thus we can not adopt the above method to decompose it into the form of Hl HTl or 3 T3 . In this case, let us denote ⎡ h1 ⎢h4 3 = ⎢ ⎣
⎤ h2 h3 h5 h6⎥ ⎥ h7 h8⎦ h9
(43)
and substitute the matrix Q in (38) with 3 T3 . Then a best estimation of 3 in (43) can be obtained via minimizing the following cost function m 1 2 J1 = min Ci (1, 2) + C2i (1, 3) + C2i (2, 3) (3 ) 2 i=1 + (Ci (1, 1) − Ci (2, 2))2
(44)
The minimization scheme can be solved via any nonlinear optimization techniques, such as Gradient Descent or Levenberg-Marquardt (LM) algorithm. Remark 8 In Result 6, we claim that the symmetric matrix Q can be decomposed into 3 T3 . In practice, the recovery of 3 is unnecessary since the upgrading matrix (35) is not unique. Thus we can simply decompose the matrix into Hl HTl as shown in (42). However, the decomposition is impossible for negative definite matrix Q. In such cases, it is suggested to parameterize Q with 3 since we can reduce 3 unknowns by introducing the vertical extended upper triangular matrix (43). Hence we only need to optimize 9 parameters in the minimization scheme (44). We now show how to recover the right part Hr of the upgrading matrix (35). From quasi-perspective equation (17), we have (1:3) ¯ j ) + (μi P(4) ) j xij = (μi Pi )( j X i
(45)
For all the features in the i-th frame, we make a summation of their coordinates and have n
(1:3)
xij = μi Pi
j =1
n j =1
¯ j ) + μi P ( j X i
(4)
n
j
(46)
j =1
j =1 j
= 1. Thus equation (46) is simplified to
⎡ ⎤ u j ij ˆ i Hr = M xij = ⎣ j vij ⎦ j =1 n n
(47)
which provides 3 linear constraints on the four unknowns of Hr . Therefore, we can obtain 3m equations from the sequence and Hr can be recovered via linear least squares. From the above analysis, we note that the solution of Hr is not of the world oriunique as it is dependant on selection ¯ j ) and the global scalar n j . Actually, gin nj=1 ( j X j =1 Hr may be set freely as shown in the following proposition. Proposition 9 (Recovery of Hr ) Suppose Hl in (35) is al˜ r ], ˜ = [Hl |H ready recovered. Let us construct a matrix as H ˜ where Hr is an arbitrary 4-vector that is independent with ˜ must be a valid upgrading the three columns of Hl . Then H ˜ ˆ ˜ matrix. i.e., M = MH is a valid Euclidean motion matrix, ˜ −1 Sˆ corresponds to a valid Euclidean shape maand S˜ = H trix. The proof can be found in Appendix 2. According to Proposition 9, the value of Hr can be set randomly as any 4-vector that is independent to Hl . In practice, Hr may be set from SVD decomposition of Hl = U4×4 4×3 VT3×3 ⎡
σ1 ⎢0 = [u1 , u2 , u3 , u4 ] ⎢ ⎣0 0
0 σ2 0 0
⎤ 0 0⎥ ⎥ [v , v , v ]T σ3 ⎦ 1 2 3 0
(4) μi Pi
(48)
where U and V are two orthogonal matrices, is a diagonal of the three singular values. Let us choose an arbitrary value σr between the biggest and the smallest singular values of Hl , then we may set H r = σ r u4 ,
H = [Hl , Hr ]
(49)
The construction guarantees that H is invertible and has the same condition number as Hl , such that we can obtain a good precision in computing the inverse H−1 . After recovering the Euclidean motion and shape matrices, the intrinsic parameters and pose of the camera associated with each frame can be easily computed as follows. (1:3)
ˆ i Hl , can be recovered from M = where ˆ Mi Hr . Since the world coordinate system can be chosen ¯ j ) = 0, which is equivalent to freely, we may set nj=1 ( j X set origin of the world system at the gravity center of the scaled space points. On other hand, since the reconstruction is defined up to a global scalar, we may simply set (1:3) μi Pi
n
μi = Mi(3) ,
(50)
fi =
1 1 (1:3) (1:3) Mi(1) = Mi(2) , μi μi
(51)
Ri =
1 −1 (1:3) K Mi , μi i
(52)
Ti =
1 −1 (4) K Mi μi i
Int J Comput Vis (1:3) where M(1:3) . The result is obi(t) denotes the t-th row of Mi tained under quasi-perspective assumption, which is a close approximation to the general perspective projection. The solution may be further optimized to perspective projection by minimizing the image reprojection residuals.
1 |¯xij − xˆ ij |2 (Ki ,Ri ,Ti ,μi ,Xj ) 2 m
J2 =
frames due to occlusions, it is hard to perform SVD decomposition. In case of missing data, we can replace the step 2 in Algorithm 10 with power factorization algorithm (Hartley and Schaffalizky 2003; Wang and Wu 2008a) to obtain a ˆ and S. ˆ Then upgrade the solution least-square solution of M to Euclidean space according to the proposed scheme.
n
min
(53)
i=1 j =1
where xˆ ij denotes the reprojected image point computed via perspective projection (1). The minimization process is termed as bundle adjustment (Hartley and Zisserman 2004), which is usually solved via Levenberg-Marquardt iterations. 4.2 Outline of the Algorithm The implementation of the rigid factorization algorithm is summarized as follows. Algorithm 10 (Quasi-perspective rigid factorization) Given the tracking matrix W ∈ R3m×n across a sequence with small camera movements. Compute the Euclidean structure and motion parameters under quasi-perspective projection. 1. Balance the tracking matrix via point-wise and imagewise rescalings, as in (Sturm and Triggs 1996), to improve numerical stability; 2. Perform rank-4 SVD factorization on the tracking matrix ˆ and S; ˆ to obtain a solution of M 3. Compute the left part of upgrading matrix Hl according to (42), or (44) for negative definite matrix Q; 4. Compute Hr and H according to (49); ˆ 5. Recover the Euclidean motion matrix M = MH and −1 ˆ shape matrix S = H S; 6. Estimate the camera parameters and pose from (50) to (52); 7. Optimize the solution via bundle adjustment (53). Remark 11 In above analysis, as well as in other factorization algorithms, we usually assume one-parameter-camera model as in (38) so that we may use this constraint to recover an upgrading matrix H. When the one parameter assumption is not satisfied in real applications, it is possible to take the proposed solution as an initial value and optimize the camera parameters via Kruppa constraint arisen from pairwise images (Wang et al. 2008). Remark 12 The essence of quasi-perspective factorization (34) is to find a rank-4 approximation MS of the tracking matrix, i.e. to minimize the Frobenius norm W − MS 2F . Most studies adopt SVD decomposition of W and truncate it to the desired rank. However, when the tracking matrix is not complete, such as some features are missing in some
5 Quasi-perspective Nonrigid Factorization For nonrigid factorization, we still follow the Bregler’s assumption (7) to represent a nonrigid shape by weighted combination of k shape bases. Under quasi-perspective projection, the structure is expressed in homogeneous form with nonzero scalars. Let us denote the scale weighted nonrigid structure associated with the i-th frame as S¯ i = ¯ 1 , . . . , n X ¯ n ], denote the l-th scale weighted shape ba[ 1 X ¯ l , . . . , n X ¯ ln ]. Then from (7) we have sis as Bl = [ 1 X 1 ¯i = X
k
¯ l, ωil X i
i = 1, . . . , n
(54)
l=1
Let us multiply a weight scale i on both side as ¯ i = i i X
k l=1
¯l = ωil X i
k
¯ l ), ωil ( i X i
i = 1, . . . , n
(55)
l=1
then we can immediately have the following result. k S¯ i ωil Bl l=1 Si = T = T
(56)
We call (56) Extended Bregler’s assumption to homogeneous case. Under this extension, the quasi-perspective projection of the i-th frame can be formulated as k (1:3) (4) l=1 ωil Bl Wi = (μi Pi )Si = [μi Pi , μi Pi ] T ⎡ ⎤ B1 · · ·⎥ (1:3) (1:3) (4) ⎢ ⎥ = [ωi1 μi Pi , . . . , ωik μi Pi , μi Pi ] ⎢ (57) ⎣Bk ⎦ T Thus the nonrigid factorization under quasi-perspective projection can be expressed as ⎤ ⎡ · · · ω1k μ1 P(1:3) μ1 P(4) ω11 μ1 P(1:3) 1 1 1 ⎢ .. .. .. ⎥ .. ⎥ W3m×n = ⎢ . . . . ⎦ ⎣ (1:3)
ωm1 μm Pm ⎡ ⎤ B1 ⎢ .. ⎥ ⎢ ⎥ ×⎢ . ⎥ ⎣Bk ⎦ T
···
(1:3)
ωmk μm Pm
(4)
μm Pm
(58)
Int J Comput Vis
or express concisely in matrix form as W3m×n = M3m×(3k+1) B(3k+1)×n
(59)
The factorization expression is similar to (9). However, the difficult problem of estimating the projective depths is avoided here. The rank of the tracking matrix is at most 3k + 1, and the factorization is defined again up to a transformation matrix H ∈ R(3k+1)×3k+1) . Suppose the SVD factorˆ B. ˆ ization of a tracking matrix with rank constraint is W = M Similar to the rigid case, we can adopt the metric constraint to compute an upgrading matrix. Let us denote the matrix into k + 1 parts as H = [H1 , . . . , Hk |Hr ]
(60)
where Hl ∈ R(3k+1)×3 (l = 1, . . . , k) denotes the l-th triple columns of H, and Hr denotes the last column of H. Then we have ˆ i Hl = ωil μi P(1:3) = ωil μi Ki Ri , M i
(61)
ˆ i Hr = μi P(4) = μi Ki Ti M i
(62)
Similar to (38) in rigid case, Let us denote Cii = ˆ ˆ T with Ql = Hl HT , we get Mi Ql M l i ˆ i Ql M ˆ T = (ωil μi Ki Ri )(ωi l μi Ki Ri )T Cii = M i =
ωil ωi l μi μi Ki (Ri Ri )KTi
(63)
where i and i (= 1, . . . , m) correspond to different frame numbers, l = 1, . . . , k corresponds to different shape bases. Assuming a simplified camera model with only one parameter as Ki = diag(fi , fi , 1), we have ⎡ 2 ⎤ fi ˆ T = ω 2 μ2 ⎣ ˆ i Ql M ⎦ (64) Cii = M fi2 i il i 1 from which we can obtain following four constraints. ⎧ f1 (Ql ) = Cii (1, 2) = 0 ⎪ ⎪ ⎨ f2 (Ql ) = Cii (1, 3) = 0 (65) f (Q ) = Cii (2, 3) = 0 ⎪ ⎪ ⎩ 3 l f4 (Ql ) = Cii (1, 1) − Cii (2, 2) = 0 The above constraints are similar to (39) in rigid case. However, the matrix Ql in (64) is a (3k + 1) × (3k + 1) symmetric matrix. According to Proposition 5, its degreeof-freedom should be 9k degree of freedom, since it can be decomposed into the product of (3k + 1) × 3 vertical extended upper triangular matrix. Given m frames, we have 4m linear constraints on Ql . It appears that if we have enough features and frames, the matrix Ql can be solved linearly by stacking all the constraints in (65). Unfortunately, only
the rotation constraints may be insufficient when an object deforms at varying speed, since most of the constraints are redundant. Xiao and Kanade (2005) proposed a basis constraint to solve this ambiguity. The main idea is to select k frames that include independent shapes and treat them as a set of bases. Suppose the first k frames are independent of each other, then their corresponding weighting coefficients can be set as 1 if i, l = 1, . . . , k and i = l (66) ωil = 0 if i, l = 1, . . . , k and i = l From (63) we can obtain following basis constraint. ⎡ ⎤ 0 0 0 Cii = ⎣0 0 0⎦ if i = 1, . . . , k, i = 1, . . . , m, 0 0 0 and i = l
(67)
Given m images, (67) can provide 9m(k − 1) linear constraints to the matrix Ql (some of the constraints are redundant since Ql is symmetric). By combining the rotation constraint (65) and basis constraint (67) together, the matrix Ql can be computed linearly. Later, Hl , l = 1, . . . , k can be decomposed from Ql according to following result. Result 13 The matrix Ql is a (3k + 1) × (3k + 1) positive semidefinite symmetric matrix of rank 3. It can be decomposed as Q = Hl HTl , where Hl is a (3k + 1) × 3 rank 3 matrix. The decomposition can be further written as Q = 3 T3 with 3 being a (3k + 1) × 3 vertical extended upper triangular matrix. The Result can be easily derived from Proposition 5. Note that the Proposition 9 is still valid for nonrigid case. Thus the vector Hr in (60) can be set as an arbitrary (3k + 1)-vector that is independent with all columns in {Hl }l=1,...,k . After recovering the Euclidean upgrading matrix, the camera parameters, motions, shape bases, weighing coefficients can ˆ and be easily determined from the motion matrix M = MH −1 ˆ shape matrix B = H B.
6 Evaluations on Synthetic Data 6.1 Evaluation on Quasi-perspective Projection During the simulation, we randomly generated 200 points within a cube of 20 × 20 × 20 in space as shown in Fig. 2(a), where we only displayed the first 50 points for simplicity. The depth variation in Z-direction of the space points is shown in Fig. 2(b). We simulated 10 images from these points by perspective projection. The image size is set at
Int J Comput Vis
Fig. 2 Evaluation on projective depth approximation of the first 50 points. (a) and (b) Coordinates of the synthetic space points (c) and (d) The real and the approximated projective depths under quasi-perspective assumption
800 × 800. The camera parameters are set as follows: The focal lengths are set randomly between 900 and 1100, the principal point is set at the image center, and the skew is zero. The rotation angles are set randomly between ±5◦ . The X and Y positions of the cameras are set randomly between ±15, while the Z positions are set evenly from 200 to 220. The true projective depths λij associated with these points across 10 different views are shown in Fig. 2(c), where the values are given after normalization so that they have unit mean value. We then estimate λ1j and μi from (13) and (14), and construct the estimated projective depths λ from λˆ ij = μ1ji . The registered result is shown in Fig. 2(d). We can see from experiment that the recovered projective depths are very close to the ground truths, and are generally proportional to the variation of space points in Z-direction. If we adopt affine camera model, it is equivalent to setting all the projective depths to 1. The error is obviously much bigger than that of the quasi-perspective assumption. According to projection equations (28) to (32), different images will be obtained if we adopt different camera models. Here we generated three sets of images using the simulated space points via general perspective projection model, affine camera model, and quasi-perspective projection model. We compared the errors of quasi-perspective projection model (31) and affine assumption (32). The mean
errors of different models in each frame are shown in Fig. 3(a), the histogram distribution of errors for all 200 points across 10 frames is shown in Fig. 3(b). From the result, we can see that the error of quasi-perspective assumption is much more smaller than that under affine assumption. Influence of different imaging conditions to the quasiperspective assumption is also investigated. Initially, we fix the camera position as given in first test and vary the amplitude of rotation angles from ±5◦ to ±50◦ in a step of 5◦ . At each step, we check the relative error of recovered projective depths, which is defined as eij =
|λij − λˆ ij | × 100 (%) λij
(68)
where λˆ ij is the estimated projective depth. We carried out 100 independent tests at each step so as to obtain a statistically meaningful result. The mean and standard deviation of eij are shown in Fig. 4(a). We then fix the rotation angles at ±5◦ and vary the relative distance of a camera to an object (i.e. the ratio between the distance of a camera to an object center and that of the object depth) from 2 to 20 in a step of 2. The mean and standard deviation of eij at each step for 100 tests are shown in Fig. 4(b). The result shows that the quasi-perspective projection is a good ap-
Int J Comput Vis
Fig. 3 Evaluation of the imaging errors by different camera models. (a) The mean error in each frame. (b) The histogram distribution of the errors under quasi-perspective and affine projection model
Fig. 4 Evaluation on quasi-perspective projection under different imaging conditions. (a) The relative error of the estimated depths with different rotation angles. (b) The relative error with respect to different relative distances
proximation (eij < 0.5%) when the rotation angles are less than ±35◦ and relative distance is larger than 6. Please note that the result is obtained from noise free data. 6.2 Evaluation on Rigid Factorization We added Gaussian white noise to the initially generated 10 images, and varied the noise level from 0 to 3 pixels with a step of 0.5. At each noise level, we reconstructed the 3D structure of the object which is defined up to a similarity transformation with the ground truth. We register reconstructed model with the ground truth and calculate the reconstruction error, which is defined as mean pointwise distances between reconstructed structure and the ground truth. The mean and standard deviation of the error on 100 independent tests are shown in Fig. 5. The proposed algorithm (Quasi) is compared with (Poelman and Kanade 1997) under affine assumption (Affine) and (Han and Kanade 2000) under perspective projection (Persp). We then take these solutions as initial values and perform the perspective optimization through LM iterations. It is evident that the pro-
posed method performs much better than that of affine, the optimized solution (Quasi+LM) is very close to perspective projection with optimization (Persp+LM). The proposed model is based on the assumption of large relative camera-to-object distance and small camera rotations. We compared the effect of the two factors to different camera models. In first case, we vary the relative distance from 4 to 18 in steps of 2. At each relative distance, we generated 20 images with the following parameters. The rotation angles are confined between ±5◦ , the X and Y positions of the camera are set randomly between ±15. We recovered the structure and computed the reconstruction error for each group of images. The mean error by different methods is shown in Fig. 6(a). In the second case, we increase the rotation angles to the range of ±20◦ , and retain other camera parameters as in the first case. The mean reconstruction error is given in Fig. 6(b). The results are evaluated on 100 independence tests with 1-pixel Gaussian noise. We can obtain the following conclusions from the results. (1) The error by quasi-perspective projection is consistently less than that by affine, especially at small relative distances. (2) Both
Int J Comput Vis
Fig. 5 Evaluation on rigid factorization. The mean (a) and standard deviation (b) of the reconstruction errors by different algorithms at different noise levels
Fig. 6 The mean reconstruction error of different projection models with respect to varying relative distance. The rotation angles of the camera are confined to a range of (a) ±5◦ and (b) ±20◦
reconstruction errors by affine and quasi-perspective projection increase greatly when the relative distance is less than 6, since both models are based on large distance assumption. (3) The error at each relative distance increases with the rotation angles, especially at small relative distances, since the projective depths are related with rotation angles. (4) Theoretically the relative distance and rotation angles have no influence on the result of full perspective projection. However, we can see that the error by perspective projection also increases slightly with an increase in rotation angles and the decrease in relative distance. This is because we estimate the projective depths iteratively starting with an affine assumption (Han and Kanade 2000). The iteration easily gets stuck to local minima due to bad initialization. We compared the computation time of different factorization algorithms without LM optimization. The program was implemented with Matlab 6.5 on an Intel Pentium 4 3.6 GHz CPU. In this test, we use all the 200 feature points and vary the frame number, from 5 to 200, so as to generate different data size. The actual computation time (seconds) for different data sets are listed in Table 2, where computation time
Table 2 The average computation time of different algorithms Frame number
5
10
50
100
Time (s)
0.015
0.015
0.031
0.097
Affine Quasi
0.015
0.016
0.047
0.156
Persp
0.281
0.547
3.250
6.828
150 0.156 0.297 10.58
200 0.219 0.531 15.25
for perspective projection is taken under 10 iterations (it usually takes about 30 iterations to compute the projective depths in perspective factorization). Clearly, the computation time of quasi-perspective is at the same level as that under affine assumption, While the perspective factorization is computationally more intensive than other methods. 6.3 Evaluation on Nonrigid Factorization In this test, we generated a synthetic cube with 6 evenly distributed points on each visible edge. There are three sets of moving points on adjacent surfaces of the cube that move on the surfaces at constant speed as shown in Fig. 7(a), each
Int J Comput Vis
Fig. 7 Simulation result on nonrigid factorization. (a) Two synthetic cubes with moving points in space. (b) The quasi-perspective factorization result of the two frames superimposed with the ground truth. (c) The final structures after optimization
Fig. 8 Evaluation on nonrigid factorization. The mean (a) and standard deviation (b) of the reconstruction errors by different algorithms at different noise levels
moving set is composed of 5 points. The cube with moving points can be taken as a nonrigid object with 2 shape bases. We generated 10 frames with the same camera parameters as in the first test of rigid case. We reconstructed the structure associated with each frame by the proposed method as shown in Fig. 7(b) and (c). We can see that the structure after optimization is visually the same as the ground truth, while the result before optimization is a little bit deformed due to perspective effect. We compared our method with the nonrigid factorization under affine assumption (Xiao et al. 2006) and that under perspective projection (Xiao and Kanade 2005). The mean and standard deviation of the reconstruction errors with respect to different noise levels are shown in Fig. 8. It is clear
that the proposed method performs much better than that under affine camera model.
7 Evaluation on Real Image Sequences We tested our proposed method on many real sequences, and we report results of four experiments here. All images in the test, except those in the Franck face sequence, were captured by Canon Powershot G3 camera with a resolution of 1024 × 768. In order to ensure large overlap of the object to be reconstructed, the camera undergoes small movement during image acquisition, hence the quasi-perspective
Int J Comput Vis
(a)
(b)
(c) Fig. 9 Reconstruction result of the stone post sequence. (a) Three images from the sequence, where the tracked features with relative disparities are overlaid to the second and the third images. (b) The reconstructed VRML model of the scene shown from different viewpoints with texture mapping. (c) The corresponding triangulated wireframe of the reconstructed model
assumption is satisfied for all these sequences. Please refer to the supplemental video for details of these test results. 7.1 Test on Stone Post Sequence There are 8 images in the stone post sequence, which were taken at the Sculpture Park near downtown Windsor. We established the initial correspondences by utilizing technique in Wang (2006) and eliminated outliers iteratively as in Torr et al. (1998). Totally 3693 reliable features were tracked across the sequence, the features in two frames with relative disparities are shown in Fig. 9. We recovered 3D structure of the object and camera motions by utilizing the proposed algorithm, as well as some previous methods. The recovered camera focal lengths are listed in Table 3, where we give the result of first frame only due to limited space, ‘Quasi+LM’,
‘Affine+LM’, and ‘Persp+LM’ stand for quasi-perspective, affine, and perspective factorization with global optimization, respectively. Figure 9 shows the reconstructed VRML model with texture and corresponding triangulated wireframe viewed from different viewpoints. The reconstructed model is visually plausible and realistic. In order to give a comparative quantity evaluation, we reproject the reconstructed 3D structure back to the images and calculate reprojection errors, which is defined as distances between detected and reprojected image points. Figure 10 shows the histogram distributions of the errors using 9 bins. The corresponding mean (‘Mean’) and standard deviation (‘STD’) of the errors are listed in Table 3. We can see that the reprojection error by our proposed model is much smaller than that under affine assumption.
Int J Comput Vis
7.2 Test on Fountain Base Sequence There are 7 images in the fountain base sequence, which were also taken at the Sculpture Park of Windsor. The correspondences were established using same technique as in previous test. Totally 4218 reliable features were tracked across the sequence as shown in Fig. 11(a). Figure 11(b) and (c) show the reconstructed VRML model with texture mapping and the corresponding triangulated wireframe from different viewpoints. The model looks realistic and most details are correctly recovered by the method. A comparison analysis on camera parameters and reprojection errors are presented in Table 3 and Fig. 10, respectively. We can see from the results that our proposed scheme outperforms that under affine camera model. 7.3 Test on Dynamic Grid Sequence There are 12 images in the dynamic grid sequence. The background of the sequence is two orthogonal sheets with square grids which are used as ground truth for evaluation. On the two orthogonal surfaces, there are three moving objects that move linearly in three directions. We established correspondences using method (Wang 2006), and eliminated outliers interactively. Totally 206 features were Table 3 Camera parameters of the first frame and reprojection errors in real sequence test Sequence
Method
Focus (f )
Mean
STD
Stone post
Quasi+LM
2151.8
0.421
0.292
Fountain base
Affine+LM
2167.3
0.667
0.461
Persp+LM
2154.6
0.237
0.164
Quasi+LM
2140.5
0.418
0.285
Affine+LM
2153.4
0.629
0.439
Persp+LM
2131.7
0.240
0.168
tracked across the sequence, where 140 features belong to static background and 66 features belong to three moving objects, as shown in Fig. 12(a). We recovered metric structure of the scenario by utilizing proposed method. Figure 12(b) and (c) show reconstructed VRML models and corresponding wireframes associated with two dynamic positions. It is clear that the dynamic structure is correctly recovered. The background of this sequence is two orthogonal sheets with square grids. We take this as ground truth and compute the angle (unit: degree) between two reconstructed surfaces of the orthogonal background, the length ratio of two diagonals of each square grid and the angle formed by the two diagonals. The mean errors of these three values are denoted by Eα1 , Erat , and Eα2 , respectively. The mean reprojection error Erep1 of the reconstructed structure is also computed. As a comparison, the results obtained by different methods are listed in Table 4. The result by the proposed model outperforms that of affine. 7.4 Test on Franck Face Sequence The Franck face sequence was downloaded from the European working group on face and gesture recognition (www-prima.inrialpes.fr/FGnet/). We selected 60 frames with various facial expressions for the test. The image resolution is 720 × 576, and there are 68 tracked feature across the sequence, which are also downloaded from the internet. Figure 13 shows the reconstructed models of four frames utilizing our proposed method. Different facial expressions are correctly recovered, though some points are not very accurate due to tracking errors. The result could be used for visualization and recognition. For analysis, the relative reprojection error Erep2 generated from different methods are listed in Table 4. We can see that in all these tests, the accuracy by the proposed method is fairly close to that of full perspective projection, and performs much better than affine assumption.
Fig. 10 The histogram distributions of the reprojection errors by different algorithms in real sequence test. (a) Result of stone post sequence. (b) Result of fountain base sequence
Int J Comput Vis
(a)
(b)
(c) Fig. 11 Reconstruction result of the fountain base sequence. (a) Three images from the sequence, where the tracked features with relative disparities are overlaid to the second and the third images. (b) The reconstructed VRML model of the scene shown from different viewpoints with texture mapping. (c) The corresponding triangulated wireframe of the reconstructed model
Table 4 Performance comparison on the grid and face sequences Method
Eα1
Eα2
Erat
Erep1
Erep2 5.26
Quasi
1.62
0.75
0.12
4.37
Affine
2.35
0.92
0.15
5.66
6.58
Persp
1.28
0.63
0.10
3.64
4.35
Quasi+LM
0.58
0.26
0.04
1.53
2.47
Affine+LM
0.96
0.37
0.07
2.25
3.19
Persp+LM
0.52
0.24
0.04
1.46
1.96
8 Conclusion In this paper, we proposed a quasi-perspective projection model and analyzed the projection errors of different pro-
jection models. We applied our proposed model to rigid and nonrigid factorization and elaborated the computation details of Euclidean upgrading matrix. The proposed method avoids the difficult problem of computing projective depths in perspective factorization. It is computationally simple with better accuracy than affine approximation. The proposed model is suitable for structure and motion factorization of a short sequence with small camera motions. Experiments demonstrated improvements of our algorithm over existing techniques. It should be noted that the small rotation assumption of the proposed model is not so limited and is usually satisfied in many real applications. During image acquisition of an object to be reconstructed, we tend to control the camera movement so as to guarantee large overlapping part, which also facilitates the feature tracking process.
Int J Comput Vis
(a)
(b)
(c) Fig. 12 Reconstruction results of the dynamic grid sequence. (a) Three images from the sequence overlaid with the tracked features and relative disparities shown in the second and the third images, please note the three moving objects. (b) The reconstructed VRML model of the structure shown from different viewpoints with texture mapping. (c) The corresponding triangulated wireframe of the reconstructed model
For a long sequence of images taken around an object, the assumption is violated. However, we can simply divide the sequence into several subsequences with small movements, then register and merge the result of each subsequence to reconstruct the structure of the whole object. Acknowledgements The authors would like to thank the anonymous reviewers for their valuable comments and constructive suggestions. The work is supported in part by Natural Sciences and Engineering Research Council of Canada, and the National Natural Science Foundation of China under Grant No. 60575015.
Appendix 1: Proof of Proposition 5 Extended Cholesky Decomposition: Suppose Qn is a n × n positive semidefinite symmetric matrix of rank k. Then it can be decomposed as Qn = Hk HTk , where Hk is a n × k
matrix rank k. Furthermore, the decomposition can be written as Qn = k Tk with k a n × k vertical extended upper triangular matrix. The degree-of-freedom of the matrix Q is nk − 12 k(k − 1), which is the number of unknowns in k . Proof Since Qn is a n × n positive semidefinite symmetric matrix of rank k, it can be decomposed by SVD as ⎡ σ1 ⎢ ⎢ ⎢ ⎢ Qn = UUT = U ⎢ ⎢ ⎢ ⎢ ⎣
⎤ ..
. σk 0 ..
.
⎥ ⎥ ⎥ ⎥ T ⎥U ⎥ ⎥ ⎥ ⎦
(69)
0 where U is a n × n orthogonal matrix, is a diagonal matrix with σi the singular value of Qn . Thus we can immediately
Int J Comput Vis
(a)
(b)
(c) Fig. 13 Reconstruction of different facial expressions in Franck face sequence. (a) Four frames from the sequence with the 68 tracked features overlaid to the last frame. (b) Front, side, and top views of the reconstructed VRML models with texture mapping. (c) The corresponding triangulated wireframe of the reconstructed model
have ⎡√ σ1 (1:k) ⎢ Hk = U ⎣
⎤
..
.
Hku ⎥ ⎦= Hkl √ σk
(70)
such that Qn = Hk HTk , where U(1:k) denotes first k columns of U, Hku denotes upper (n − k) × k submatrix of Hk , and Hkl denotes lower k × k submatrix of Hk . By applying RQdecomposition on Hkl , we have Hkl = kl Ok , where kl is an upper triangular matrix, Ok is an orthogonal matrix. Let us denote Hku OTk as ku , and construct a n × k vertical extended upper triangular matrix k = have Hk = k Ok , and
Qn = Hk HTk = (k Ok )(k Ok )T = k Tk
ku kl
. Then we
(71)
It is easy to verify that the degree-of-freedom of the matrix Q (i.e., the number of unknowns in k ) is nk − 12 k(k − 1). The proposition can be taken as an extension of Cholesky Decomposition to the case of positive semidefinite symmetric matrix, while Cholesky Decomposition can only deal with positive definite symmetric matrix.
Appendix 2: Proof of Proposition 9 Recovery of Hr : Suppose Hl in (35) is already recovered. ˜ = [Hl |H ˜ r ], where H ˜ r is an Let us construct a matrix as H arbitrary 4-vector that is independent with the three columns ˜ must be a valid upgrading matrix, i.e., M ˜ = of Hl . Then H ˆH ˜ is a valid Euclidean motion matrix, and S˜ = H ˜ −1 Sˆ corM responds to a valid Euclidean shape matrix.
Int J Comput Vis
Proof Suppose the correct transformation matrix is H = [Hl |Hr ], then from ¯ 1 , . . . , n X ¯n 1 X −1 ˆ (72) S=H S= 1 , ..., n ¯ 1, . . . , X ¯ n] we can obtain one correct Euclidean structure [X of the object under certain coordinate frame in the world by dehomogenizing of the shape matrix S. The arbitrary con˜ = [Hl |H ˜ r ] and the correct matrix H is destructed matrix H fined up to a 4 × 4 invertible matrix G in form of I g ˜ (73) H = HG, G = T3 s 0 where I3 is a 3 × 3 identity matrix, g is a 3-vector, 0 is a zero 3-vector, s is a nonzero scalar. Under the transformation ˜ the motion M ˆ and shape Sˆ are transformed to matrix H, I3 −g/s −1 ˜ ˆ ˜ ˆ (74) M = MH = MHG = M T 0 1/s ˜ −1 Sˆ = (HG−1 )−1 Sˆ = G(H−1 S) ˆ S˜ = H ¯ + g)/s · · · n (X ¯ n + g)/s (X =s 1 1 ··· n 1
(75)
We can see from (75) that the new shape S˜ is actually the original structure that undergoes a translation g and a scale 1/s, which does not change the Euclidean structure. From ˜ (1:3) = M(1:3) , which indicates that the first(74) we have M three-column of the new motion matrix (corresponds to the rotation part) does not change. While the last column, which corresponds to translation part, is modified in accordance with the translation and scale changes of the structure. ˜ is a valid transforTherefore, the constructed matrix H mation matrix that can upgrade the factorization from projective space into the Euclidean space.
References Bascle, B., & Blake, A. (1998). Separability of pose and expression in facial tracing and animation. In Proceedings of the international conference on computer vision (pp. 323–328) 1998. Brand, M. (2001). Morphable 3D models from video. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 456–463) 2001. Brand, M. (2005). A direct method for 3D factorization of nonrigid motion observed in 2d. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 122–128) 2005. Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering nonrigid 3D shape from image streams. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 690–696) 2000. Buchanan, A. M., & Fitzgibbon, A. W. (2005). Damped newton algorithms for matrix factorization with missing data. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 316–322) 2005.
Chen, P. (2008). Optimization algorithms on subspaces: revisiting missing data problem in low-rank matrix. International Journal of Computer Vision, 80(1), 125–142. Christy, S., & Horaud, R. (1996). Euclidean shape and motion from multiple perspective views by affine iterations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(11), 1098– 1104. Costeira, J., & Kanade, T. (1998). A multibody factorization method for independent moving objects. International Journal of Computer Vision, 29(3), 159–179. Del Bue, A., Smeraldi, F., & de Agapito, L. (2004). Non-rigid structure from motion using nonparametric tracking and non-linear optimization. In IEEE workshop in articulated and nonrigid motion ANM04, held in conjunction with CVPR2004 (pp. 8–15), June 2004. Del Bue, A., Lladó, X., & de Agapito, L. (2006). Non-rigid metric shape and motion recovery from uncalibrated images using priors. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 1191–1198) 2006. Han, M., & Kanade, T. (2000). Creating 3D models with uncalibrated cameras. In Proceedings of IEEE computer society workshop on the application of computer vision (WACV2000), December 2000. Hartley, R. (1997). Kruppa’s equations derived from the fundamental matrix. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 133–135. Hartley, R., & Schaffalizky, F. (2003). Powerfactorization: 3D reconstruction with missing or uncertain data. In Australia-Japan advanced workshop on computer vision, 2003. Hartley, R., & Vidal, R. (2008). Perspective nonrigid shape and motion recovery. In ECCV (1), Lecture notes in computer science: Vol. 5302 (pp. 276–289). Berlin: Springer. Hartley, R., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd edn.). Cambridge: Cambridge University Press. Heyden, A., & Åström, K. (1997) Euclidean reconstruction from image sequences with varying and unknown focal length and principal point. In IEEE conference on computer vision and pattern recognition (pp. 438–443) 1997. Heyden, A., Berthilsson, R., & Sparr, G. (1999). An iterative factorization method for projective structure and motion from image sequences. Image and Vision Computing, 17(13), 981–991. Li, T., Kallem, V., Singaraju, D., & Vidal, R. (2007). Projective factorization of multiple rigid-body motions. In IEEE conference on computer vision and pattern recognition, 2007. Luong, Q., & Faugeras, O. (1997). Self-calibration of a moving camera from point correspondences and fundamental matrices. International Journal of Computer Vision, 22(3), 261–289. Mahamud, S., & Hebert, M. (2000). Iterative projective reconstruction from multiple views. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 430–437) 2000. Maybank, S., & Faugeras, O. (1992). A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(2), 123–151. Oliensis, J., & Hartley, R. (2007). Iterative extensions of the Sturm/Triggs algorithm: convergence and nonconvergence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2217–2233. Poelman, C., & Kanade, T. (1997). A paraperspective factorization method for shape and motion recovery. IEEE Transactions on Pattern and Analysis and Machine Intelligence, 19(3), 206–218. Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters. International Journal of Computer Vision, 32(1), 7–25. Quan, L. (1996). Self-calibration of an affine camera from multiple views. International Journal of Computer Vision, 19(1), 93–105.
Int J Comput Vis Rabaud, V., & Belongie, S. (2008). Re-thinking non-rigid structure from motion. In IEEE conference on computer vision and pattern recognition, 2008. Sturm, P. F., & Triggs, B. (1996). A factorization based algorithm for multi-image projective structure and motion. In European conference on computer vision (2) (pp. 709–720) 1996. Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2), 137–154. Torr, P. H. S., Zisserman, A., & Maybank, S. J. (1998). Robust detection of degenerate configurations while estimating the fundamental matrix. Computer Vision and Image Understanding, 71(3), 312–333. Torresani, L., Yang, D. B., Alexander, E. J., & Bregler, C. (2001). Tracking and modeling non-rigid objects with rank constraints. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 493–500) 2001. Torresani, L., Hertzmann, A., & Bregler, C. (2008). Nonrigid structurefrom-motion: Estimating shape and motion with hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5), 878–892. Triggs, B. (1996). Factorization methods for projective structure and motion. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 845–851). San Francisco, California, USA, 1996. Vidal, R., & Abretske, D. (2006). Nonrigid shape and motion from multiple perspective views. In European conference on computer vision (2). Lecture notes in computer science: Vol. 3952 (pp. 205– 218). Berlin: Springer. Vidal, R., Tron, R., & Hartley, R. (2008). Multiframe motion segmentation with missing data using powerfactorization and GPCA. International Journal of Computer Vision, 79(1), 85–105.
Wang, G. (2006). A hybrid system for feature matching based on SIFT and epipolar constraints. (Tech. Rep.). Department of Electrical and Computer Engineering, University of Windsor. Wang, G., Tsui, H.-T., & Wu, J. (2008). Rotation constrained power factorization for structure from motion of nonrigid objects. Pattern Recognition Letters, 29(1), 72–80. Wang, G., & Wu, Q. J. (2008a). Stratification approach for 3D euclidean reconstruction of nonrigid objects from uncalibrated image sequences. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38(1), 90–101. Wang, G., & Wu, J. (2008b). Quasi-perspective projection with applications to 3D factorization from uncalibrated image sequences. In IEEE conference on computer vision and pattern recognition, 2008. Wang, G., Wu, J., & Zhang, W. (2008). Camera self-calibration and three dimensional reconstruction under quasi-perspective projection. In Proceedings Canadian conference on computer and robot vision (pp. 129–136) 2008. Xiao, J., & Kanade, T. (2005). Uncalibrated perspective reconstruction of deformable structures. In Proceedings of the international conference on computer vision (Vol. 2, pp. 1075–1082) 2005. Xiao, J., Chai, J., & Kanade, T. (2006). A closed-form solution to nonrigid shape and motion recovery. International Journal of Computer Vision, 67(2), 233–246. Yan, J., & Pollefeys, M. (2005). A factorization-based approach to articulated motion recovery. In IEEE conference on computer vision and pattern recognition (2) (pp. 815–821) 2005. Yan, J., & Pollefeys, M. (2008). A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5), 865–877.