the shape and motion of an object from a sequence of images. It achieves its accuracy ... motions can be approximated as linear and with constant speeds within ... De nition 1 An point p equals s+v, where s is the initial position of each point and v is .... the same direction), rank 5 (objects moving in two dimensional space) or.
The Factorization Method with Linear Motions Mei Han Takeo Kanade October 1999 CMU-RI-TR-99-23
The Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213
Abstract In this paper we describe the factorization method with linear motions. We design an uni ed representation of scene structure and moving objects by assuming that the objects are moving linearly and with constant speeds. The representation enables the subspace constraints be used to the measurement matrix so that the scene structure, moving trajectories and camera motion are reconstructed simultaneously. We also discuss the solutions for degenerate cases. Preliminary results on synthetic data are also presented.
@1999 Carnegie Mellon University
Keywords: structure from motion, linear motion
1 Introduction In this paper we present the factorization method with linear motions. It provides a new representation to incorporate scene structure and objects linear motions. It does not require pre-segmentation and the number of moving objects is unknown. It recovers camera motion, scene structure and linear motion trajectories simultaneously. This report describes the method and preliminary experiments on synthetic data are shown.
1.1 Background
This work is based on the system we built for 3D scene analysis of video sequences by homography ([4]). The homography-based framework consists of a robust homography algorithm, a camera motion solver, and a dense projective depth map recovery. The main applications of the system are motion detection and scene structure recovery. The open problem of that system is temporal integration. In that system we integrate geometrical recovery information pairwisely whose complexity increases exponentially with the length of the sequence. Factorization is a robust and ecient method for accurately recovering the shape and motion of an object from a sequence of images. It achieves its accuracy and robustness by taking advantage of the large stream of redundant input. It can be viewed as one approach of information integration. Based on this thinking we design the factorization method with linear motions and incorporate it with the homography-based system to recover scene structure, camera motion and motion trajectories simultaneously. The output from the previous system provides patch-based tracking results while the factorization method integrates the information over video sequences and recovers scene structure and ego motion which are used by that system to start dense depth re nement as well as the motion trajectories.
1.2 Scenario
Video sequences of our interest are taken from a moving airborne platform where the ego-motion is complex and the scene is relatively distant but not necessary at. That includes: distant scene where weak or para perspective projection is a good enough approximation and moving objects can and have to be ab1
stracted as points. scene with 3D structure where parallax exists and makes image registration approach not enough. multiple moving objects whose number is unknown. motions can be approximated as linear and with constant speeds within short periods of sequences. The goal of the factorization method is to recover camera motion, scene structure and linear motion trajectories simultaneously without pre-segmentation. One example of the scenario is shown in gure 1.
(a) (b) (c) Figure 1: (a) 1st (b) 21st (c) 35th image of the sequence.
1.3 Related work
Factorization methods ([8] and [7]). These methods only recover the
shape of one object and camera motion. Multibody factorization method ([3]). This method regards each object as one independent motion/structure space and solves each one by the factorization method. It does not require beforehand segmentation either, but the process is tricky in rank determination and block detection. More importantly the method does not work on our scenario because it requires that each object is close and clear as one 3D object so as to provide enough information as one independent space. Non-Rigid Parallax ([1]). Avidan and Shashua's abstraction of moving objects as points initializes our motion de nition. This method recovers 2
linear motions by 3D line tting and it requires camera calibration information and beforehand motion detection. Plane plus parallax ([5]). The 2D method provides a neat representation of projective depth. There is continuous work on temporal integration similar to trilinear constraints ([6]). It is mainly 2D approach and does not recover motion trajectory and scene structure in 3D. Motion segmentation ([9]). Its focus is on 2D parametric motion model selection while our method incorporates segmentation within structure from motion representation. Motion segmentation methods are mostly region based and do not help in our aerial image scenario.
1.4 Pros and Cons
Good points of our method: No requirement of knowledge of the number of moving objects. No requirement of beforehand motion detection and pre-segmentation. Uni ed representation of scene and moving objects. Recovering scene structure, camera motion and moving trajectories simultaneously. No tricky threshold for rank determination and motion segmentation. Robustness and eciency of factorization method. Assumptions we have to make: Objects moving linearly. Objects moving with constant speeds. For aerial images the above assumptions are easy to be satis ed within short periods of video sequences.
3
1.5 Applications
We already start working on following applications: motion detection and recovery 3D mosaicking multiresolution video sequences
2 Representation
2.1 Scene and Moving Objects
We regard every feature point as a moving point with constant speed: points from static scene are points with zero speed and moving points are points with their corresponding constant speeds. De nition 1 An point p equals s + v, where s is the initial position of each point and v is its moving speed. As in factorization representation, a point pm is observed in frame f at image coordinates (ufm; vfm) (using orthographic projection as example): "
#
"
ufm = lif (pm ? tf ) + ox vfm lajf (pm ? tf ) + oy
#
(1)
where m = 1 M and f = 1 F , and M is the number of feature points and F is the number of frames. Simplifying the above equations with calibrated camera parameters, these equations can be written as
ufm = mf pm + tfx vfm = nf pm + tfy
(2)
tfx = ?(tf if ) tfy = ?(tf jf )
(3)
where
4
mf = if nf = jf
(4) All feature points coordinates (ufm; vfm) are put in a 2F M measurement matrix W 2 u11 u1M 3
uF 1 v11 vF 1
7 7 uFM 777 v1M 777 5
(5)
3 s s s 2 M 1 S = 64 v1 v2 vM 75 1 1 1
(6)
W
6 6 6 6 = 66 6 6 4
vFM Each column of the measurement matrix contains the observations for a single point, while each row contains the observed u-coordinates or vcoordinates for a single frame. In our uni ed representation of scene points and moving points, 2
and
2
M
i1x i1y i1z i1x i1y i2x i2y i2z 2i2x 2i2y i3x i3y i3z 3i3x 3i3y
6 6 6 6 6 6 6 6 6 iFx 6 = 66 6 j 6 1x 6 6 j2x 6 6 j 6 3x 6 4
i1z t1x 2i2z t2x 3i3z t3x
iFy iFz FiFx FiFy FiFz tFx j1y j1z j1x j1y j1z t1y j2y j2z 2j2x 2j2y 2j2z t2y j3y j3z 3j3x 3j3y 3j3z t3y
3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5
(7)
jFx jFy jFz FjFx FjFy FjFz tFy Therefore, we can get the standard matrix equation of factorization: W = MS (8) where M is the 2F 7 motion matrix whose rows are the [mf mf0 tfx] and [nf n0f tfy ] vectors, S is the 7 M shape matrix whose columns are the [sm vm 1]T vectors. For weak perspective and para perspective projections, we can get similar representations. 5
3 Factorization with Linear Motions 3.1 Moving Coordinate System
We de ne a moving world coordinate system so that we can get the camera translation vectors immediately from the image data: 2 2 t1x 3 u11 u1M 3 2 m1 m01 3 6 6 6 6 6 6 6 6 4
uF 1 v11 vF 1
7 6 7 6 uFM 777 = 666 v1M 777 666 5 4
vFM
mF n1 nF
6 7 777 " 7 # 6 m0F 77 s1 s2 sM +666 tFx 777 h 1 1 1 i n01 777 v1 v2 vM 666 t1y 777 4 5 5 0 tFy nF
(9)
We have M X m=1 M X m=1
ufm = vfm =
M X m=1 M X m=1
(m ~ f pm + tfx ) = m~ f (n~f pm + tfy ) = n~f
M X
m=1 M X
m=1
pm + Mtfx
pm + Mtfy
(10)
where m ~ f = [mf m0f ] and n~f = [nf n0f ]. As the points are either static or moving linearly with constant speeds, we design the world coordinate system as one with xed orientation while the origin is moving linearly with constant speed. Actually the origin speed is the average of all moving speeds over the entire feature points set. Therefore, we know M X m=1
pm = 0
(11)
and the elements of the translation vectors are computed as:
tfx = M1
tfy = M1
M X m=1 M X m=1
6
ufm vfm
(12)
3.2 Decomposition
Once we know the translation vector, we subtract it from W , 2
W^ = W ?
t1x
3
7 6 7 6 7 6 6 tFx 7 h 6 7 6 t 7 6 1y 7 7 6 5 4
i ^ ?1S^ = MS 1 1 1 = M^ S^ = MAA
(13)
tFy
where
M = M^ A S = A?1 S^ and
3 m 1 m01 6 7 7 6 7 6 6 mF m0F 7 6 M = 6 n n0 77 6 1 1 77 6 4 5 nF n0F # " s s s 1 2 M S = v v v 1 2 M
(14)
2
(15)
It is known that the rank of the measurement matrix W^ is at most 6 no matter how many moving objects are there. Using the SVD to perform the factorization and get the best possible rank 6 approximation of W^ as M^ S^.
3.3 Normalization
We still take orthographic projection as example. Weak and paraperspective cases are pretty much the same except dierent representations of metric constraints. As in the factorization method, the decomposition of W^ into the product of M^ and S^ is only determined up to a linear transformation matrix A. 7
We determine this matrix A by observing that the rows of the motion matrix M (the [mf m0f ] and [nf n0f ] vectors) must be of a certain form: jmf j2 = 1 (16) 2 jnf j = 1 (17) 2 0 2 jmf j = f (18) 2 0 2 jnf j = f (19) mf nf = 0 (20) 0 0 mf nf = 0 (21) 0 mf nf = 0 (22) mf n0f = 0 (23) De ne i h (24) A = B1 B2 where A is 6 6 matrix and B1, B2 are both 6 3 matrices. As 3 m 1 6 7 6 7 7 6 6 mF 7 ^ 6 M B1 = 6 n 77 6 1 7 7 6 4 5 nF 3 2 2 0 3 m m 1 1 6 7 6 7 7 7 6 6 7 6 6 0 7 6 mF 7 6 mF 7 M^ B2 = 66 n0 77 = N 66 n 77 6 1 7 6 1 7 7 7 6 6 4 5 4 5 n0F nF 2
where
2
N
6 6 6 6 6 6 = 666 6 6 6 6 4
1 0
0 0 2 0
0 0 0 0 0 0 0 0
0 0
0 0 0 0
F 0 0 0 1 0 0 0 2 0 0 0 8
0 0 0
F
(25)
3 7 7 7 7 7 7 7 7 7 7 7 7 7 5
(26)
we get the relation between B1 and B2:
B2 = K B 1
(27)
where
K = M^ ?1 N M^ (28) and M^ ?1 is the general inverse. Equations (16) (17) and (20) give constraints on items of Q1 = B1B1T and equations (18) (19) and (21) are constraints on items of Q2 = B2B2T . From relation of B1 and B2 (equation (27)), we have Q2 = B2 B2T = K B1 B1T K T = K Q1 K T
(29)
which changes the constraints on Q2 to constraints on Q1. Equations (22) and (23) are constraints on Q3 = B2B1T which are also constraints on Q1:
Q3 = B2 B1T = K B1 B1T = K Q1
(30)
SVD is used to get B1 from Q1 and B2 is computed from equation (27), so we get A as the linear transformation. The nal results can be aligned with any orientation of the world coordinate system. The reasons we use all eight constraints on M are not only for the best least square approximations, but also for enough constraints to get linear solutions of Q1. We will explore this in more detail in the next section.
3.4 Degenerate Cases
The standard process described above solves the full rank case whose structure and motion spaces are both rank 3. This is the case when the scene is three dimensional and the speeds of moving objects distribute in the 3D space as well. In this section we discuss the solutions for degenerate cases.
3.4.1 Degenerate Shape
We do not access this problem because the previous homography-based system handles this case nicely. If the scene has degenerate shape, like planer, the homography approach detects the case and solves the trajectories directly. 9
3.4.2 Degenerate Motion Space
First we need to decide which kind of approximation is the best, that is, to decide if the situation is rank 3 (no motion), rank 4 (object(s) moving in the same direction), rank 5 (objects moving in two dimensional space) or rank 6 which is the standard case. Rank of the measurement matrix is one important clue. There are two main reasons why it is hard to compute the right rank. One is that we are using orthographic, or weak perspective, or para perspective to approximate the perspective projection. It induces noises in the rank computation. Another is that data have noises. We use two algorithms to choose the right approximation case: rank computation from noisy data ([2]). This method builds noise model of input data and estimates the rank of the measurement matrix W by singular values and the noise model. We improve the approach by using W T W instead of W . error analysis. This is a brute force method. We estimate structure (scene shape and trajectories) and camera motion of dierent rank approximations, that is, we go through four times of factorization for rank 3, 4, 5 and 6. We design an error metric to decide which result is the best approximation:
Error = E1 + E2 + E3
(31)
where
E1 = jW^ ? M^ S^j2 E2 = E3 =
F X
f =1 F X f =1
(32)
((jmf j ? 1)2 + (jnf j ? 1)2 + jmf nf j2)
(33)
(jm0f ? fmf j2 + jn0f ? fnf j2 )
(34)
The approximation with the least error is taken as the best one. Rank 3 situation can be solved by the standard factorization method. Now we will describe how to solve rank 4 and rank 5 cases. 10
Rank 4 Degenerate Case Rank 4 case is when the moving objects are
all moving in the same direction or the opposite direction, i.e., the moving space is one dimensional. We take the i direction of the world coordinate system aligned with the moving direction. The origin is still moving with constant speed. Therefore, 2
i1x i1y i1z i1x i2x i2y i2z 2i2x
6 6 6 6 6 6 iFx 6 ^ ^ = M S = 666 6 j1x 6 6 j 6 2x 6 4
3
7 7 7 72 7 FiFx 77 6 76 76 j1x 777 4 2j2x 777 5
W^
iFy iFz
s1x s1y s1z v1x
j1y j1z j2y j2z
s2x s2y s2z v2x
sMx 3 sMy 777 sMz 5 vMx
(35)
jFx jFy jFz FjFx
We de ne
i
h
(36) A = B1 B2 where A is 4 4 matrix, B1 is 4 3 matrix and B2 is 4 1 matrix and
B2 = K B11
(37)
B11 is the rst column of B1 and K is de ned in equation(28). Now the constraints (16) (17) and (20) still hold while others cannot be represented. Fortunately the above three kinds of constraints are enough to solve Q1 linearly for rank 4. Similarly we compute B1 by decomposition of Q1 and B2 by equation (37). The alignment R is determined by: B2 R = K (B1 R)1
(38)
which means the alignment is constrained to make the i direction as the moving direction.
Rank 5 Degenerate Case Rank 5 case is when the objects are moving
on a plane. Similarly, we assume the i direction and j direction of the world coordinate system are aligned with the two dimensional moving space. The
11
origin is still moving with constant speed. Therefore, 2
W^
i1x i1y i1z i1x i1y i2x i2y i2z 2i2x 2i2y
6 6 6 6 6 6 iFx 6 = M^ S^ = 666 6 j1x 6 6 j 6 2x 6 4
7 72 7 7 7 FiFx FiFy 77 666 76 76 j1x j1y 777 64 2j2x j2y 777 5
iFy iFz j1y j1z j2y j2z
3
s1x s1y s1z v1x v1y
s2x s2y s2z v2x v2y
3
sMx 7 sMy 77 sMz 77 (39) vMx 75 vMy
jFx jFy jFz FjFx FjFy
We de ne where
h
A = B1 B2 h
i
(40) i
(41) B2 = K B11 B12 B11 and B12 are the rst two column of B1 and K is de ned as in equation(28). Now the constraints (16) (17) and (20) still hold while others cannot be represented. But the above three kinds of constraints are not enough to solve Q1 linearly for rank 5 case. We gure out a way to represent these constraints in ve parameters c. c is the third column of B1, so is a 5 1 vector. With the help of c, we represent constraints of (18) and (19) as:
^ f B2 BT2M ^ Tf + f 2M^ f ccTM ^ Tf = f 2 jm0f j2 = M
(42)
where M^ f is the f th row of matrix M^ . The other constraints are expressed in terms of B2B2T and ccT. We know that the constraints on B2B2T can be changed to constraints on Q1. Therefore, we get one linear equation of Q1 and ccT. This is not a full rank linear equation set, however, it is full rank on Q1 if given ccT. So we change the problem as a non-linear optimization problem on 5 parameters. Therefore, we solve a small scale non-linear optimization to compute the c vector rst, then generate the Q1 matrix by least square. B1 and B2 are calculated from Q1. The alignment R is determined by: h
B2 R = K (B1 R)1 (B1 R)2 12
i
(43)
It constrains the alignment to put the i ? j plane of the world coordinate system on the moving plane. The above equation is solved by least eigenvalue method.
4 Experiments We apply the factorization method with linear motions to synthetic data under perspective projection models. We generate sequences of 100 frames with 49 feature points from the scene and 0 to 4 objects moving in random directions. We add 2% noises to the data. The results shown here are generated by the weak perspective factorization method with linear motions. The experimental results show our method is robust and ecient in dealing with dierent situations. Figure 2 shows the situation when 4 objects are moving. Blue dots denote the feature points from the scene and the other colors denote dierent moving objects.
(a) (b) Figure 2: Rank 6 case: reconstructed scene with moving trajectories. (a) shows the scene structure and initial positions of moving objects. (b) shows the scene as well as the moving trajectories. Figure 3 is the situation when there is no moving object. The method detects the rank is 3 and recovers the scene structure correctly. Figure 4 shows the case when there is only one moving object. The method detects the rank is 4 and recovers the scene structure and the moving trajectory. Figure 5 is the case when there are two moving objects but they are moving in the same or opposite direction. The rank is still 4. Figure 6 shows the case when there are two moving objects whose directions are on a plane. The method detects the rank is 5 and recovers the scene structure and the moving trajectories. Figure 7 is the case when there are 13
(a) Figure 3: Rank 3 case: reconstructed scene with no moving object.
(a) (b) Figure 4: Rank 4 case: reconstructed scene with moving trajectory. (a) shows the scene structure and initial position of moving object. (b) shows the scene as well as the moving trajectory.
(a) (b) Figure 5: Rank 4 case: reconstructed scene with moving trajectories. (a) shows the scene structure and initial positions of moving objects. (b) shows the scene as well as the moving trajectories. 14
three moving objects but their moving directions are distributed on a plane. The rank is still 5.
(a) (b) Figure 6: Rank 5 case: reconstructed scene with moving trajectories. (a) shows the scene structure and initial positions of moving objects. (b) shows the scene as well as the moving trajectories.
(a) (b) Figure 7: Rank 5 case: reconstructed scene with moving trajectories. (a) shows the scene structure and initial positions of moving objects. (b) shows the scene as well as the moving trajectories.
5 Conclusion In this paper we describe the factorization method with linear motions. We design an uni ed representation of scene structure and moving objects by assuming the objects are moving linearly and with constant speeds. The representation enables the subspace constraints be used to the measurement matrix so that the scene structure, moving trajectories (moving directions 15
and speeds) and camera motion are reconstructed simultaneously. We are working on further experiments and analysis of its robustness and eciency.
Acknowledgements Many thanks to Simon Baker, Martial Hebert, David Larose and Teck-khim Ng for useful suggestions and comments to this work.
References [1] S. Avidan and A. Shashua. Non-rigid parallax for 3d linear motion. In Proceedings of the 1998 DARPA Image Understanding Workshop, pages 199{201, 1998. [2] T. Boult and L. G. Brown. Factorization-based segmentation of motions. In Proceedings of the 1991 Visual Motion Workshop, pages 179{186, 1991. [3] J.P. Costeira and T. Kanade. A multibody factorization method for independently moving-objects. IJCV, 29(3):159{179, 1998. [4] M. Han and T. Kanade. Homography-based 3d scene analysis of video sequences. In DARPA98, pages 154{160, 1998. [5] M. Irani, P. Anandan, and D. Weinshall. From reference frames to reference planes: Multi-view parallax geometry and applications. In ECCV98, 1998. [6] S. Peleg, A. Shashua, D. Weinshall, M. Werman, and M. Irani. Multisensor representation of extended scenes using multi-view geometry. In DARPA98, pages 57{61, 1998. [7] C. Poelman and T. Kanade. A paraperspective factorization method for shape and motion recovery. PAMI, 19(3):206{218, 1997. [8] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A factorization method. IJCV, 9(2):137{154, 1992. [9] P.H.S. Torr and D.W. Murray. Outlier detection and motion segmentation. SPIE, 2059:432{443, 1993. 16