Structure and Motion Factorization under Quasi ... - IEEE Xplore

1 downloads 0 Views 768KB Size Report
Department of ECE, University of Windsor, 401 Sunset, Windsor, ON, ... Department of Control Engineering, Aviation University, Changchun, 130022, China.
Structure and Motion Factorization under Quasi-Perspective Projection with Missing Data in Tracking Matrix Guanghui Wang†‡ Q. M. Jonathan Wu† Wei Huang† † Department of ECE, University of Windsor, 401 Sunset, Windsor, ON, Canada, N9B 3P4. ‡ Department of Control Engineering, Aviation University, Changchun, 130022, China.

Abstract The paper is focused on the problem of structure and motion factorization from uncalibrated image sequences. Based on our early study on quasi-perspective projection, we give an analysis on the imaging errors of different projection models and propose to adopt power factorization algorithm to deal with missing data problem. The main contribution lies in two aspects. First, we carry out an error analysis of the quasi-perspective projection and prove that it is more accurate than affine model under small camera movements. Second, we propose to utilize power factorization to factorize the tracking matrix. Compared with SVD-based method, the algorithm can work with incomplete tracking data, and it is computationally cheaper than other methods. The proposed method is evaluated on synthetic and real image sequences and better results are observed.

1. Introduction Structure and motion recovery from image sequences is an important theme in computer vision. Great progresses have been made during the last two decades [4]. The factorization method was proposed by Tomasi and Kanade [10] in early 90’s under the assumption of orthographic projection model. Its main idea is to factorize the tracking matrix into motion and structure matrices simultaneously by singular value decomposition (SVD) with low-rank constraint. The algorithm was extended to weak perspective and paraperspective projection by Poelman and Kanade [7]. Quan [8] proposed a self-calibration algorithm for affine factorization. Different with SVD-based methods, Hartley and Schaffalizky [5] proposed a power factorization algorithm to find a low-rank approximation of the tracking matrix under affine assumption. An extension of the algorithm to nonrigid case was studied in [14].

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

More generally, Christy and Horaud [2] extended the above methods to perspective camera model by incrementally performing the factorization under affine assumption. The method is an affine approximation to general perspective projection. Triggs and Sturm [9, 12] proposed a full projective reconstruction method via rank-4 factorization of a scaled tracking matrix with projective depths recovered from pairwise epipolar geometry. The method was further studied in [3, 6], where subspace constraints are embedded to recover the projective depths iteratively. In order to deal with the scenarios of nonrigid or dynamic, many extensions stemming from the factorization algorithm were proposed to relax the rigidity constraint [1, 11, 14]. Most factorization methods are based on affine due to its simplicity. However, it may cause large reconstruction error if the assumption is not strictly satisfied. The perspective factorization is more complicated and there is no guarantee that it will converge to the correct projective depths. In our early study [15], we proposed a quasi-perspective model under the assumption of small camera movements to make a trade-off between the simplicity of affine and the accuracy of perspective projection. In this paper, we will first give an error analysis of different projection models. Then we propose to utilize the power factorization algorithm to quasi-perspective factorization, the algorithm is simple and can work with incomplete tracking matrix.

2. Camera projection model Perspective projection is the most general camera model in computer vision. Under this model, a 3D point in space Xj = [xj , yj , zj , 1]T is projected onto image xij = [uij , vij , 1]T in the i-th frame via a rank-3 projection matrix Pi ∈ R3×4 as λij xij = Pi Xj = Ki [Ri , Ti ]Xj

(1)

where Ki , Ri and Ti are the corresponding calibration matrix, rotation matrix, and translation vector of

the camera with respect to the world frame; λij is a nonzero scale factor, commonly called projective depth. Suppose the rotation angles around the three world axes are αi , βi , γi , then we have [15] (2) λij = PT3i Xj = [rT3i , tzi ]Xj = −(Sβi )xj + (Cβi Sαi )yj + (Cβi Cαi )zj + tzi where  S  stands for sine function, and  C  stands for cosine function; PT3i and rT3i denote the third row of Pi and Ri , respectively; tzi is the third element of Ti . The perspective model is computationally complicated due to the unknown scalar λij . When the distance of the object to the camera is much greater than the depth variation of the object, we may assume affine camera model. Under affine assumption, the last row of the projection matrix is of the form PT3i = [0, 0, 0, 1], thus the unknown scalar is eliminated. This is equivalent to assuming all projective depths λij = 1, which is only valid when the ratios of true depths of different 3D points remain approximately constant through the sequence. Under affine condition, if we further assume that the camera undergoes small movement, we proved [15] that (i) the variation of the projective depth λij is mainly proportional to the depth of the space point. i.e. λij ≈ (Cβi Cαi )zj + tzi . (ii) the ratio of the projective depths corresponding to any two different frames can be approximated by a constant. i.e. ttz1 ≈ μi . Thus if we zi set j = λ11j , and replace Pi with μi Pi , and Xj with j Xj , the projection (1) is approximated by xij = (μi Pi )(j Xj )

(3)

We call (3) quasi-perspective projection, and we experimentally proved that the model is more accurate than affine camera model since the projective depths are implicitly embedded in the shape matrix.

3. Error analysis of different projections We will present a simple analysis on the imaging errors of quasi-perspective and affine camera models with respect to the general perspective projection. For simplicity, the subscript i of the frame number is omitted in this section. Let us set the origin of the world system on the center of the object, and set the camera at the Z direction. Suppose the camera’s intrinsic parameters are known, and the image is normalized by applying K−1 to each frame. Then the projection matrix P of perspective projection, Pq of quasi-perspective and Pa of affine projection can be written as  T   T   T  P=

r1 tx rT 2 ty rT 3 tz

, Pq =

r1 tx rT 2 ty rT 3q tz

, Pa =

r1 tx rT 2 ty 0T tz

(4)

where rT3 = [−Sβ, CβSα, CβCα], rT3q = [0, 0, CβCα], 0T = [0, 0, 0], T = [tx , ty , tz ]T is the translation vector. It is clear that the first two rows of these projection matrices are the same. For a space point X = ¯ T , 1]T = [x, y, z, 1]T , the images under different [X camera models are given by  u  v m = PX = rT X+t (5) ¯ z 3  u  v (6) mq = Pq X = rT X+t ¯ z 3q u (7) ma = Pa X = tv z

¯ rT1 X

¯ + ty , rT3 X ¯ = where u = + tx , v = rT2 X T ¯ −(Sβ)x + (CβSα)y + (CβCα)z, r3q X = (CβCα)z. The inhomogeneous image points can be denoted as   1 u (8) m ¯ = T¯ r3 X + tz v     1 1 u u , m ¯a = (9) m ¯q = T ¯ v tz v r3q X + tz Let us define the errors of m ¯ q and m ¯ a to m ¯ as ¯ rT3 X m ¯ tz ¯ (rT3 − rT3q )X ¯q −m ¯ = T ¯ m ¯ eq = m r3q X + tz

ea = m ¯a −m ¯ =

(10) (11)

Based on the above equations, we are easy to obtain the following conclusions. Conclusion 1. When the distance of the camera to the object is much larger than the object size, both m ¯q and m ¯ a are very close to m. ¯ If the space point lies on the plane through the world origin and is perpendicular to the principal axis, we have α = β = 0 and z = 0. It is easy to verify that m ¯ =m ¯q =m ¯ a in this case. Conclusion 2. When the rotation angles α and β are small, m ¯ q is much more close to m ¯ than m ¯ a . When the camera system is align with the the world system, we have rT3q = rT3 = [0, 0, 1], and m ¯ q = m. ¯

4. Structure and motion factorization Given n tracked features xij across a sequence of m frames, we want to recover the structure and motion of the scene. The factorization based algorithm is proved to be an effective method for the problem. Under quasiperspective projection (3), the factorization equation of the tracking matrix is expressed as  x11 ··· x1n   μ P  1 1 .. . . .. .. = [1 X1 , · · · , n Xn ] (12) . . . . 

xm1 ··· xmn μm Pm S4×n

  W3m×n

M3m×4

where Mi stands for the motion matrix of the ith frame, Sj stands for the jth point in space. The purpose of factorization is to find a rank-4 approximation MS of W. Instead of SVD, we will adopt power factorization algorithm [5] to solve the problem. Given a randomly selected rank-4 matrix M ∈ R3m×4 , we iteratively perform the following two steps until convergence. 1. Update shape matrix S by minimizing (13); 2. Update motion matrix M by minimizing (13). The above minimization is very simply and can be done via least squares, which allows us to deal with the tracking matrix with some entries unavailable. In case of missing data, we can use the cost function (14) to update M and S by solving a set of equations derived from the available features in least-square sense. There is currently no theoretical proof of the convergence of the algorithm. Nevertheless, through extensive simulations, we find that the algorithm usually converges to a correct solution within 4 iterations. Suppose Wt is the reprojected tracking matrix at tth iteration, the convergence can be determined by checking the variation Δt =

1 Wt − Wt−1 2F mn

(15)

Similar to SVD factorization, the low-rank decomposition MS is not unique since it is defined up to a nonsingular linear transformation matrix H ∈ R4×4 . We usually adopt the metric constraint to compute the transformation matrix [15]. Then the camera parameters and the Euclidean structure of the scene can be easily recovered from M ← MH and S ← H−1 S. It should be noted that the algorithm can also be extended to nonrigid case, the main difference lies in the dimension of the motion and shape matrices [15].

5. Evaluation on synthetic data We randomly generated 200 points within a cube of 20 × 20 × 20 in space, and simulated 10 images from

6

3.0 Computation time (second)

1 min W − MS2F (13) 2 (M,S)  1 min = xij − Mi Sj 2F (14) i j 2 (Mi ,Sj )

J=

these points by perspective projection. The image size is set at 800 × 800. The camera parameters are set as follows. The focal lengths vary randomly from 1000 to 1100. The three rotation angles are set randomly between ±5◦ . The X and Y positions of the cameras are set randomly between ±20, while the Z positions are set randomly between 250 to 270. The imaging conditions are close to the quasi-perspective assumption. During the test, 1-pixel Gaussian white noise is added to the image points. We first tested the convergence of the algorithm in two conditions: (i) We use all data in the 10 frames; (ii) we randomly deleted 20% entries from the initial tracking matrix and performed the algorithm using the remaining data. At each iteration, we recorded the variation of the reprojected tracking matrix (15) as shown in Fig.1(a). The algorithm converges quickly in both cases. The missing data does not have noticeable influence to the convergence speed at this experiment. However, the algorithm may fail when the proportion of missing entries is over 50%. Variation of reprojections

where W is called the tracking matrix; M and S are called the motion matrix and shape matrix respectively. It is clear that the rank of the tracking matrix is at most 4, and the rank constraint can be easily imposed by performing SVD on W and truncating it to rank 4 [15]. In real application, it is hard to have all the features tracked across the sequence. While the SVD decomposition can not deal with the tracking matrix with some entries unavailable. Let us define the cost function

no missing data

5

20% missing data

4

Noise: 1-pixel 3 2 1 0

1

2

3

4 5 Iteration times

(a)

6

7

8

Proposed Affine Quasi 10% Pers

2.5 2.0

Noise: 1-pixel 1.5 1.0 0.5 0.0

10

50

100 150 Frame number

200

250

(b)

Figure 1. (a) The convergence property of the algorithm; (b) the average computation time by different algorithms for different data sets.

We compared the average computation time by different algorithms. The program was implemented with Matlab on Dell Inspiron 600m laptop of Pentium(R) 1.8GHz CPU. The real computation time for different data sets (we vary the frame number from 10 to 250) is shown in Fig.1(b), where ’Quasi’ stands for the quasiperspective factorization via SVD [15], ’Affine’ for the affine factorization [8], ’10% Pers’ stands for 10% of the computation time taken by the perspective factorization [3]. The proposed algorithm is computationally cheaper than other methods. We recovered the 3D structure of the points and registered it with the ground truth. During the test, we vary the noise level from 0 to 3 pixels with a step of 0.5. We define the reconstruction error as the pointwise distance between the recovered structure and the ground truth. The mean and standard deviation of the distances are shown in Fig.2(a). The results are obtained from 100 independent tests. Compared with other methods, it is evident that the proposed method has almost the same accuracy as SVD-based quasi-perspective factorization. The result is much better than that of affine.

24 Proposed Quasi Affine Pers

0

0.5

1.0 1.5 2.0 Noise level (pixel)

2.5

3.0

but is computationally much cheaper. The most important attribute of the algorithm is that it can easily deal with the missing data problem. Experiments demonstrated the advantages of the proposed algorithm. The algorithm can be extended to nonrigid factorization.

PF+20% PF+0% Quasi

20 Ratio of points (%)

Reconstruction error

5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

16 12 8 4 0 0

0.1

0.2 0.3 0.4 0.5 0.6 Reprojection error (pixel)

(a)

0.7

0.8

(b)

Figure 2. (a) The mean and STD of the reconstruction errors; (b) the histogram distribution of reprojection errors of fountain sequence.

Acknowledgment The work is supported in part by the Canada Research Chair program, and the National Natural Science Foundation of China under grant no. 60575015.

References

Figure 3. Three images of the fountain sequence with tracked features and the reconstructed VRML model and wireframe.

6. Evaluation on real sequence We only report one result on a fountain sequence due to space limit. There are 7 images with a resolution of 1024 × 768. We established totally 4218 reliable correspondences by the system [13]. Fig.3 shows three images with the tracked features and relative disparities overlaid to the second and the third images. We recovered the Euclidean structure by the proposed method. The reconstructed VRML model and the corresponding triangulated wireframe viewed from different viewpoints are shown in Fig.3. The result is visually plausible and realistic. After reconstruction, we reproject the 3D points to the images and calculate the reprojection errors. Fig.2(b) shows the histogram distribution of the errors. We randomly deleted 20% tracking data and compared the algorithm (with/without missing data) with the SVD-based method. The results are comparative without much difference.

7. Conclusions In this paper, we carried out an error analysis on the quasi-perspective projection model and proved that it is more accurate than affine. We then proposed a power factorization scheme to solve the problem, which has almost the same accuracy with SVD-based factorization,

[1] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from image streams. In Proc. CVPR (2), pages 690–696, 2000. [2] S. Christy and R. Horaud. Euclidean shape and motion from multiple perspective views by affine iterations. IEEE T-PAMI, 18(11):1098–1104, 1996. [3] M. Han and T. Kanade. Creating 3d models with uncalibrated cameras. In Proc. WACV, 2000. [4] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Second edition, 2004. [5] R. Hartley and F. Schaffalizky. Powerfactorization: 3D reconstruction with missing or uncertain data. In Proc. AJAWCV, 2003. [6] S. Mahamud and M. Hebert. Iterative projective reconstruction from multiple views. In Proc. CVPR (2), pages 430–437, 2000. [7] C. Poelman and T. Kanade. A paraperspective factorization method for shape and motion recovery. IEEE T-PAMI, 19(3):206 – 218, 1997. [8] L. Quan. Self-calibration of an affine camera from multiple views. IJCV, 19(1):93–105, 1996. [9] P. F. Sturm and B. Triggs. A factorization based algorithm for multi-image projective structure and motion. In ECCV (2), pages 709–720, 1996. [10] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. IJCV, 9(2):137–154, 1992. [11] L. Torresani, A. Hertzmann, and C. Bregler. Learning non-rigid 3D shape from 2D motion. In NIPS, 2003. [12] B. Triggs. Factorization methods for projective structure and motion. In Proc. CVPR, pages 845–851, 1996. [13] G. Wang. A hybrid system for feature matching based on SIFT and epipolar constraints. Tech. Rep., University of Windsor, 2006. [14] G. Wang and J. Wu. Stratification Approach for 3D Euclidean Reconstruction of Nonrigid Objects From Uncalibrated Image Sequences. IEEE T-SMCB, 38(1): 90– 101, 2008. [15] G. Wang and J. Wu. Quasi-perspective projection with applications to 3D factorization from uncalibrated image sequences. In Proc. CVPR, 2008.

Suggest Documents