Factorization-based Non-rigid Shape Modeling

0 downloads 0 Views 224KB Size Report
Traditionally there exist two vision-based methods for 3-D reconstruction: ... Vedula et.al [19] considered computation of dense non- ... Caspi and Irani discuss [3] how to align two image sequences captured by a stereo rig moving jointly in space, ... face tracking from a stereo rig, such as work in [20], but they only tracked the ...
Factorization-based Non-rigid Shape Modeling and Tracking in Stereo-Motion Yu Huang, Jilin Tu, Thomas S Huang IFP, U. of Illinois at Urbana-Champaign {yuhuang, jilintu, [email protected]} Abstract In recent years, researchers have tackled the topic of augmenting “structure from motion” with stereo information. Unfortunately nearly all stereo-motion algorithms assume that the scene is rigid. In [16] the factorization-based structurefrom-motion method for rigid objects was proposed, and soon it was extended to non-rigid or deformable objects in [17]. In this paper we propose a framework of factorization-based non-rigid shape modeling and tracking in stereo-motion. We construct a measurement matrix consisting of the stereo-motion data captured from a stereo-rig. Organized in a particular way this matrix could be factored by Singular Value Decomposition (SVD) to get the 3D basis shapes, their configuration weights, rigid motion and camera geometry. With this, the stereo correspondences can be inferred from motion correspondences only requiring that a minimum of 3K point stereo correspondences (where K is the dimension of shape basis space) are determined in advance. Basically this framework still keeps the usage potential of rank constraints, meanwhile it has other advantages such as simpler correspondence and accurate reconstruction even with short image sequences. Results of synthesized data and real stereo sequences are given to demonstrate the performance.

1. Introduction Tracking the object and recovering its 3D shape (especially non-rigid shape) from sequences of images are fundamental problems in computer vision community. They have various applications such as scene modeling, robot navigation, object recognition and virtualized reality[1-2]. Traditionally there exist two vision-based methods for 3-D reconstruction: visual motion and stereo vision. Visual motion, also called structure from motion, is inherently to recover 3D information from an image sequence acquired under a relative motion between the camera and the object. Stereo vision (structure from stereo) is to estimate 3D structure by triangulation from two widely displaced views of the same object. Both methods depend on how to solve the notorious correspondence problem. Basically this problem is relatively easy to handle in visual motion [16] because the features such as corners, edges or junctions, have strong temporal association even without any prior knowledge of the dynamic model. However, it still requires a long image sequence for accurate 3D reconstruction. Comparatively, stereo vision undergoes a much easier reconstruction task by triangulation, but the difficulty has been transferred to a ill-posed stereo correspondence task, even though we have the epipolar constraints [5]. In structure from motion, Tomasi and Kanade [16] proposed one of the most influential approaches as the factorization method for rigid objects and orthographic projection. Many extensions have been put forward: Poelman and Kanade [14] extended it to para-perspective or affine camera; Triggs [18] tried to generalize this technique for a full perspective projection; Costeira and Kanade relaxed the rigidity constraints in a multi-body factorization method [4]; Irani and Anandan [9] proposed covariance-weighted factorization, which factors noisy feature correspondences with high degree of directional uncertainty into the structure and motion. Previously 3D non-rigid motion and structure recovery is based on some assumptions about the scene, for example a deformable model or minimal deviation from a rigid body motion. Bregler et. al [17] proposed a first framework for 3D reconstruction of non-rigid or deformable objects based on factorization. Shortly afterwards Brand [1] proposed a flexible factorization approach which minimizes the deformations relative to the mean shape by introducing an optimal correction matrix.

1.1 Stereo-Motion Researchers have tackled this topic of augmenting “structure from motion” with stereo information. Some works are feature-based [5], while others are called the “direct” method using the spatial and temporal image gradient information [15]. The notable problem is how to fully utilize the redundant information in the stereo-motion analysis, but practically the more important issue would be how to make the two basic cues complement with each other.

Unfortunately we find nearly all stereo-motion algorithms assume that the scene is rigid [5, 7, 12, 15]. A impressive stereo-motion analysis framework based on factorization under affine projection was proposed in [7]: the measurement matrix of the stereo-motion data, organized in a particular way in which all of the motion correspondences and a part of stereo correspondences have been established well, could be decomposed into the 3D structure of the scene, the camera parameters, the rigid motion and the stereo geometry. With this, they can infer the unfinished part of stereo correspondences. There are only a few stereo-motion papers ever considering non-rigid motion. Liao et.al [10] used a relaxation-based algorithm to co-operatively match features in both the temporal and spatial domains. In [11] a grid which acts as a deformable model was used in the 3D non-rigid structure recovery. Carceroni and Kutulakos [2] presented an algorithm that computes motion and shape of dynamic surface elements to represent the spatial-temporal structure of a scene observed by multiple cameras under known lighting conditions. Vedula et.al [19] considered computation of dense nonrigid scene flow from optic flow of stereo image sequences assuming various instantaneous properties of the scene are known. Caspi and Irani discuss [3] how to align two image sequences captured by a stereo rig moving jointly in space, here it is strictly constrained that the two video uncalibrated cameras have approximately the same center of projection but different 3D orientation. Neumann and Aloimonos put forward [13] a method to compute 3-D motion fields on a nonrigid face using multiple (eleven) calibrated cameras. In a global optimization step the head is fitted to the spatio-temporal contour and multi-camera stereo data derived from all images. Although recently some researchers have discussed human face tracking from a stereo rig, such as work in [20], but they only tracked the head pose/orientation with the implicit assumption that the head is a rigid body.

1.2 Review Our motivations come from the work in [17] [7]. We propose a stereo-motion analysis framework of 3D non-rigid shape recovering and tracking based on factorization. After performing singular value decomposition (SVD) on the wellorganized stereo-motion measurement matrix, we could factorize it into 3D basis shapes, their configuration weights, stereo geometry and rigid motion parameters. Based on it, we could infer stereo correspondences from motion correspondences only requiring that at least 3K point stereo correspondences (where K is the dimension of shape basis space) are established by some classical stereo matching techniques, such as epipolar constraints. This framework naturally has the usage potential of rank constraints as [17]. It is an extension of [7]’s work to non-rigid objects, so these advantages, such as simpler correspondence and accurate reconstruction even with short sequences, are preserved. Sect. 2.1 reviews the factorization work for the non-rigid motion model in [17]. Our work as an extension to stereomotion is described in Sect. 2.2. In Sect. 2.3 we discuss how to infer stereo correspondences. Sect. 3 provides our experiment results of synthesized data and real sequences.

2. Stereo-Motion Analyses and Factorization 2.1 Non-rigid Motion Model First we describe the factorization framework of non-rigid motion proposed in [17]. The shape of the non-rigid object is described as a key-frame basis set S1 , S 2 , …, S K . Each key-frame S i is a 3xP matrix describing P points. The shape of a specific configuration S t at the time frame t is a linear combination of the basis set: K

S t = ∑ lt ,i S i , S , S i ∈ ℜ 3× P , li ∈ ℜ .

(1)

i =1

Assume a weak-perspective model (scaled orthographic model) for the camera projection process. The 2D image points (u t ,i , vt ,i ) are related to 3D points of a configuration S t at a specific time frame t by

u t ,1  v t , P

... u t , P  K   = R' t  ∑ l t ,i S i  + T ' t , ... v t , P   i =1  r r2 r3  R' t =  1 . r4 r5 r6 

(2) (3)

where R't (2x3) contains the first two rows of the full 3-D rigid rotation matrix Rt , and T ' t is the 2-D rigid translational vector (it consists of the first two components of the 3-D translation vector Tt ). The weak perspective scaling has been implicitly coded in l ' t ,1 ,...l ' t , K . Actually we can eliminate T ' t by subtracting the mean of all 2D image points, and then can assume that S t is centered at the origin. We can rewrite the linear combination in (2) as a matrix multiplication:  S1    u t ,1 ... u t , P   S2  , l ' R ' ... l ' R ' = ⋅   t ,1 t t,K t  ...  v t , P ... v t , P    S K  Stacking all point tracks over the whole sequence into a large measurement matrix W, we can write ... l '1, K R'1   S1  l '1,1 R'1   l ' 2,1 R' 2 ... l ' 2, K R' 2   S 2  ⋅ , W = ...   ...  ... ...     ... l ' N , K R' N   S K  N ,1 R ' N l ' 4 23 1 444244443 1

[

]

(4)

(5)

B

Q'

Here the 2Nx3K matrix Q’ contains for each time frame t the pose R't and configuration weights l ' t ,1 ,...l ' t , K , and the 3KxP matrix B codes the K key-frame basis shapes S i . In the noise free case, rank of W is r ≤ 3K. This factorization can be realized using SVD, i. e. W = U ∑ V T = Qˆ '⋅Bˆ , only considering the first r singular values and singular vectors. ˆ in Qˆ ' , it The next step is to extract the pose R't and shape basis weights l ' t ,1 ,...l ' t , K from the matrix Qˆ ' . For each Q' t can be written as (for convenience, the time index has been dropped) [17] Qˆ ' = l ' R ' ... l ' R'

[ t ,1

t

t

l ' r ' = 1 1 l '1 r ' 4

t ,K

t

]

l '1 r ' 2

l '1 r '3

... l ' K r '1

l' K r '2

l '1 r ' 5

l '1 r ' 6

... l ' K r ' 4

l ' K r '5

l ' K r '3  . l ' K r ' 6 

ˆ can be reordered into a new matrix: The elements of Q' t

 l '1 r '1 l '1 r ' 2  l' r' l '2 r '2 Q 't =  2 1   l ' K r '1 l ' K r ' 2  l '1    l' =  2  ⋅ [r1  ...    l ' K 

r2

r3

l '1 r ' 3

l '1 r ' 4

l '1 r ' 5

l ' 2 r '3

l '2 r '4

l ' 2 r '5

...

...

l ' K r '3

l ' K r '4

r4

r6 ] .

r5

l ' K r '5

l '1 r ' 6  l ' 2 r ' 6    l ' K r'6 

(6)

ˆ is rank of 1 and also can be factored by SVD. Because this factorization is not unique, there exists which shows that Q' t ˆ . Thus it leads to an alternative factorization: one invertible matrix G that ortho-normalizes all of the sub-blocks Q' t (7) Q' = Qˆ '⋅G , B = G −1 ⋅ Bˆ . Irani exploited rank constraints in [8] for optic flow estimation in the case of rigid motion. Building on this technique, a framework of robust tracking could be set up as below [17]. Since matrix W is assumed to have rank r, all P columns of W can be constrained as a linear combination of r “basistracks”, Q’. A basis of the r-dimensional subspace could be set up as long as a minimum of r linearly independent columns of W are available. Then all the other columns of W can be inferred from the set of basis and additional local image constraints. As in the original Lucas-Kanade tracking, the local patch flow [u, v] can be computed by solving the equation as

 ∑ I x2 ∑ I x I y  = I I (8) v t ,i ⋅  ∑ x t ∑ I y It . ∑ I x I y ∑ I y2  The trick is, all the NxP flow vectors over the whole sequence are coded relative to one reference frame (template). Hence we can write [8] II   II [U V ]⋅  II xx II xy  = II xt II yt , (9) yy   xy

[u t ,i

[

]

]

[

]

where II pq ( p, q = x, y.) are NxP diagonal matrixes with I x , I y for all P local image patches [8]. Accordingly

II pq ( p = x, y; q = t.) are NxP matrixes with I x , I y , I t for all local patches.

Then we split Qˆ ' into Qˆ ' u and Qˆ ' v , where Qˆ ' u contains all odd rows and Qˆ ' v contains all even rows. As Q’ is a basis tracks for W, supposed we could find a matrix Bˆ which satisfies Qˆ ' Bˆ = U , Qˆ ' Bˆ = V . (10) u

Now we substitute (10) into (9) and get

[Qˆ '

v

]

[

]

 II xx II xy  Bˆ Qˆ ' v Bˆ ⋅  (11)  = II xt II yt .  II xy II yy  For a long sequence (N>>P) this system is very over-constrained to solve for the entries of Bˆ . It could be exploited to handle occlusion if the number of missing points is not overly large. Accordingly, the product Qˆ ' Bˆ provides also a good u

prediction of the displacements for the missing points at certain time frames as long as and the disappearing points are visible in enough frames [17].

2.2 Stereo-Motion Model

Figure 1. Non-rigid Object in Stereo-motion.

Below we also utilize the rank constraints to help stereo correspondence. As shown in Figure 1, let ( R , T ) be the rotational and translational relationships between the stereo cameras. Under a scaled orthographic camera model we can also assume the shape has been centered at the origin. Therefore the translation T could be subtracted from the shape relationship, since a translation part in depth has only effect on the scale factor and a translation part in the image plane is eliminated. So the 3D coordinates of any point with respect to the two camera coordinate frames, S l and S r , and the corresponding shape basis, S l ,i and S r ,i (i = 1, 2,…, K), are related by S r = R ⋅ S l , S r ,i = R ⋅ S l ,i , i = 1, 2,…, K.

Now we rewrite (4) as

(12)

... l1, K R1   S1  l1,1 R1 ...      l 2,1 R2 ... l 2,K R2  ⋅  S 2  , W =  Ft  ...   ...  ... ...  ...     142 4 43 4 l N ,1 R N ... l N ,K R N   S K  1444 424444 3 123 F

(13)

B

Q

where Ft is the 2x3 scaled orthographic projection matrix given by 1 0 0  Ft = s t  , 0 1 0 

(14)

with s t as the scale factor at time frame t. Applying the non-rigid motion model to the two cameras separately, one obtains two image measurement matrices respectively as Wl = Fl Ql Bl , Wr = Fr Qr Br . (15) Because the shape is centered at the origin, we can omit the translation component in the relationship between two rigid motion representations for two camera coordinate frames and only consider the relationship of rotation components as Rr ,t R = R Rl ,t . (Some derivations are given in Appendix) Consequently, we write

Wl   Fl Ql Bl   Fl  I 3N    =  F ' E~Q B  =   ~ Ql B l Wr   r F ' r   E  l l 1  23 14243 123 Ψ

H

(16)

E

, ~ where F ' r actually has coded the scaling change of Fr due to translation T , and the 3Nx3N matrix E is given as  ... ~ . E =  R (17)   ... Equation (16) represents the matrix decomposition of the stereo-motion correspondences into 3D structure Bl , the rigid motion and shape basis weights Ql , the stereo geometry E and the camera parameters H . It is obvious, like Wl and Wr , Ψ is of rank at most 3K: rank ( Ψ ) ≤ 3K. Below based on this rank property, we can infer stereo matching from motion correspondences. Bregler et.al ever formulated in [17] an extension of the factorization technique to multiple views. The additional constraints are assumptions that all views share the same deformation coefficients for a particular time instant. But unfortunately this assumption is much strict. We observe that, shape basis weights l ' t ,1 ,...l ' t , K in equation (4) have coded the weak perspective scaling implicitly. It is not easy to guarantee the scaling factors of multiple views are approximately equivalent. And, prior to factorization stereo correspondences must be known.

2.3 Stereo Matching Inference A stereo rig of cameras are constructed, and they observe the object undergoes a rigid motion and a deformed motion during which N pairs of images are taken. Distinct feature points are then extracted from the stereo image sequences, and in each sequence they are tracked separately using the motion correspondence method. Now the stereo correspondences are not established yet while the estimated dense motion correspondences are assumed to be mostly correct. With such motion correspondences, the measurement matrixes Wl∗ and Wr∗ can be constructed, here different from Wl and Wr , their columns have not been properly ordered. As Ψ is of rank at most 3K, a basis of the 3K-dimensional subspace could be set up as long as a minimum of 3K linearly independent columns of Ψ are available. Then all the other columns of Ψ are inferred from the set of basis. Suppose k matches are obtained by some stereo correspondence technique with epipolar constraints (To simplify 1D searching on the epipolar line, the technique of image rectification could be done prior to stereo matching), where k ≥ 3K. The corresponding columns of Wl∗ and Wr∗ can be stacked into a 4Nxk sub-matrix Ψk . SVD of Ψk is Ψk = U k ∑ k VkT . Actually the first 3K’ columns of U k construct the optimal basis of 3K’-dimensional vector subspace

(Note K’ is the estimated number of shape basis, which maybe is not equal to the true number K.). Let a1 , a 2 ,..., a 3 K ' be the extracted basis vectors of the column space of Ψ and let a 4Nx3K’ matrix A = [a 1 , a 2 ,..., a 3 K ' ] , so a column v of Ψ is only a linear combination of the columns of A. Let two 2Nx3K’ matrixes A l and A r be the top-half and the bottom-half sub-matrixes of A respectively such that the columns of A l belong to Wl∗ and the columns of A r to Wr∗ . For a column v l of Wl∗ its stereo correspondence v r in Wr∗ can be predicted from A l and A r as:

v r = ( A r A l+ ) v l where

A l+

(18)

is the pseudo-inverse of A l and is given by

A l+ = ( A Tl A l ) −1 A Tl . (19) But the predicted result may not be exact due to noise. A measure for feature matching could be count on: normally we calculate the least-mean-squares-error (LMSE) in all the positions over the entire image sequence with reference to the prediction results; However, even this measure is small enough, we can not guarantee it is a correct pair of stereo matching; An additional measure related to windowed template matching is probably taken into account, i.e. the average normalized correlation must be high enough [7]. If not, the image feature is ignored. Finally all the inferred stereo correspondences are grouped together to re-estimate the basis A, which is supposed to be more accurate. This process could be iterated till convergence. However, we still reconstruct 3D deformable shape via triangulation from views of the calibrated stereo cameras once all the stereo correspondences are obtained. Consequently we can calculate by factorization the 3D shape basis from the measurement matrix of 3D point positions, similar to (5) and (10), then extract the pose parameters and shape basis configuration weights by rank-1constraints. Different from (5), this time we can extract all nine components of the rotation matrix rather than only the top two rows. Recovering the pose Rt and original configuration weights l t ,1 ,...l t , K actually has realized 3-D non-rigid tracking.

3. Experiments 1. Synthetic Data In our experiments, we randomly generate 3D basis shapes, their corresponding weights and rigid (rotation) motions to create ground-truth data of multiple features in multiple stereo frames. The object basis shapes are generated randomly by sampling points uniformly inside a cube of size 60cm centered at the origin. The first basis shape is given the largest weight (where the sum of all weights is equal to one) in order to guarantee the overall shape has a strong rigid component. The three Euler angles for each rotation matrix are created by sampling uniformly in [-30˚, 30˚], and the generated object points are imaged from the stereo cameras under full perspective and quantized to pixels in each image.

(a) N (b) K’ Figure 2. Errors with respect to N and K’.

To make the weak perspective model feasible, we let depth variance between the closest and furthest object points is much smaller with respect to the distance away from the stereo camera rig. Meanwhile the two optic axes of the stereo cameras converge to the 3-D centroid of object points (i.e. the fixation point). Gaussian noises of zero mean and 1 pixel variance (as a matter of fact, there is quantization error too) are added to the final measurement matrix. In a stereo vision configuration shown in Figure 1, the deformable object is located about 4.5m away from the stereo cameras which are 60cm apart. Each camera has a focal length of 15m and a resolution of 512x512. The key issue o the proposed method is the accuracy of inferred stereo correspondences. Here the averaged errors (i.e. the average Euclidean distance between the inferred and true feature image positions) are given in Figure 2-3 while the frame length N and the estimated number of shape basis K’ vary respectively. In all experiments, the number of points in shape base P=100. In the result of Figure 2(a), K’ = K = 5, k = 15 and the biggest weight for the mean shape is 0.7. We have found the error gets much larger when N < 8, so this part of our results are not given. Note the error is less than 2 pixels when N is bigger than 20, which is mainly due to added Gaussian noise and the quantization error. In the result of Figure 2(b), K = 10, the biggest weight for the mean shape is 0.4. We let k = 3K’ when K’ varies. We can see the error decreases as K’ increases.

(a) correctly matched points (b) relative error percentage Figure 3. Performance with respect to N.

We also calculate the number of correctly matched features and the average reconstructed relative error (divided by the true value) when N varies, the results are shown in Figure 3(a) and 3(b), with K = K’ = 5, k = 20. When N increases the number of correctly matched points increases, on the contrary the relative error decreases.

2. Real Data In the experimental setup the two digital video cameras are mounted vertically and connected to a PC through 1394 links. The human face recordings in the collected videos are captured with resolution 320x240 at 30 frames per second. They contain rigid head motions, and non-rigid eye/eyebrow/mouth facial motions. It is difficult to estimate optical flow from facial motions using traditional gradient-based or template matching methods because the facial surface is smooth and its motion is non-rigid, subject to environmental illuminance change. We choose to use a Bazier Volume model-based face tracker to obtain the optical flow around the face area [6]. For each camera, we track the facial motion using independent face trackers with a dense 3D geometrical mesh model. The first experiment we did is to reconstruct the facial structure from rigid facial motions. In the videos, the human head moves up and backward within 30 frames. A pair of stereo images with depicted tracking points is shown in Figure 4. As the face trackers are applied independently to the video sequences of the two cameras. We don’t know whether there is correspondence between the mesh points of the face models used by the two face trackers, except those points at the eye corners and mouth corners. We identify these points as distinct feature points (shown in red) and the correspondences of the rest points are inferred using the bases factorized from the optical flow vectors of these distinct feature points. In the rigid motion case, we take the number of bases K=3. Figure 5 shows the found correspondences of optical flows estimated from the two face trackers. The red trajectories are the mapping of the optical flow of the mesh points from upper camera view to lower camera view using equation (18). The green trajectories show the found

correspondent trajectory of mesh points from video of lower camera. After the correspondence is established, the 3D face geometrical structure in each time instant can be reconstructed. Figure 6 shows the reconstructed mesh points in the 3D space.

(a) Upper camera (b) Lower camera Figure 4. Tracking result for rigid motion

In order to verify our theories with non-rigid motion, we further identified a stereo video sequences in which the subject opens mouth within 8 frames. As shown in Figure 7, the distinct facial features (depicted in red) are the eye corners, mouth corners, nostrils, and the center of the upper and lower lip. As the non-rigid motion only contains the opening mouth, we take K=6 in this case. The found correspondences of optical flow trajectories are shown in Figure 8. It is shown that most of the found correspondences of the optical flow trajectories are caused by the opening mouth. The reconstructed 3D face geometric structure is shown in Figure 9, where the purple dots are the reconstructed 3D points.

Figure 5. Correspondent optical flow trajectories Figure 6. The reconstructed points in 3D space.

4. Conclusion and Future Work We have presented a framework for recovering 3D non-rigid shape and motion viewed from calibrated stereo cameras. This approach is a factorization-based method, so it naturally has the usage potential of rank constraints. Meanwhile it gives a mechanism of inferring stereo correspondences from motion correspondences only requiring that a minimum of 3K point stereo correspondences are obtained initially. The combination of motion and stereo cues offers the advantages, such as simpler stereo correspondence and accurate reconstruction even with short sequences. Experimental results from synthesized data and real stereo sequences are also given to demonstrate our method’s performance. Future work will discuss how to detect not a few outliers for “robust” factorization and how to realize model-based tracking along with model refinement.

(a) Upper camera (b) Lower camera Figure 7. Tracking results for non-rigid motion.

Acknowledgement Thank Dr. Zhengyou Zhang at Microsoft Research for allowing us to use the test stereo sequences.

Figure 8. Optical flow trajectory correspondence.

Figure 9. The reconstructed face.

References [1] Brand, M. E., “Morphable 3D models from video”. IEEE CVPR01, December 2001. [2] R. L. Carceroni, K. N. Kutulakos, “Multi-View Scene Capture by Surfel Sampling: From Video Streams to Non-Rigid 3D Motion, Shape Reflectance”, ICCV’01, June 2001. [3] Y. Caspi and M. Irani, “Alignment of Non-Overlapping Sequences”. IEEE ICCV01, July 2001. [4] J. Costeira, T. Kanade “A Multibody Factorization Method for Independent Moving Objects”, IJCV, 29(3): 159-179, Sept. 1998. [5] F. Dornaika and R. Chung, “Stereo Correspondence from Motion Correspondence”, IEEE CVPR’99, pp 70-75, 1999. [6] Hai Tao and Thomas S Huang, "Explanation-based facial motion tracking using a piecewise Bezier volume deformation model," CVPR'99, 1999. [7] Ho P K and Chung R, “Stereo-Motion that Complements Stereo and Motion Analysis”, CVPR’97, pp213-218, 1997. [8] M. Irani, “Multi-Frame Optical Flow Estimation Using Subspace Constraints”. IEEE ICCV, September 1999. [9] M. Irani and P. Anandan, “Factorization with Uncertainty”. ECCV00, June 2000. [10] W.-H. Liao, S. J. Aggarwal, and J. K. Aggarwal. “The reconstruction of dynamic 3D structureof biological objects using stereo microscope images”. Machine Vision and Applications, 9:166-178, 1997. [11] S. Malassiotis and M. G. Strintzis. “Model-based joint motion and structure estimation from stereo images”. CVIU, 65(1):79-94, 1997.

[12] Mandelbaum, R. Salgian, G. Sawhney, H. “Correlation-based estimation of ego-motion and structure from motion and stereo”, IEEE ICCV99, 1999. [13] J. Neumann and Y. Aloimonos, “Spatio-temporal stereo using multi-resolution subdivision surfaces”. IJCV, 47(1): 181-193, 2002. [14] C. J. Poelman and T. Kanade, ``A Paraperspective Factorization Method for Shape and Motion Recovery'', ECCV94, pp.97-108, 1994 [15] G. Stein and A. Shashua, "Direct estimation of motion and extended scene structure for a moving stereo rig", Proc. IEEE CVPR98, 1998. [16] C. Tomasi and T. Kanade, ``Shape and Motion from Image Streams under Orthography: a Factorization Method'', IJCV, vol.9, no.2, pp.137-154, 1992. [17] L. Torresani, D. Yang, G. Alexander, C. Bregler “Tracking and Modelling Non-Rigid Objects with Rank Constraints”, Proc. IEEE CVPR, 2001. [18] B. Triggs, ``Factorization Methods for Projective Structure and Motion'', Proc. IEEE CVPR96, pp.845--851, 1996. [19] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-Dimensional Scene Flow”, Proceedings of ICCV99, pp. 722 729, 1999. [20] R. Yang and Z. Zhang. “Eye Gaze Correction with Stereovision for Video Tele-Conferencing”. ECCV02, 2002. [21] Z. Zhang et.al, ``A Robust Technique for Matching Two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry'', Artificial Intelligence Journal, 78:87-119, October 1995.

Appendix Because the shape is centered at the origin, we can omit the translation components in the rigid motion representations of two camera coordinate frames. Here let’s assume the relationship between them as

 R r ,t 0  1x3

0 3 x1   R T   R T   Rl ,t  =  1  01x 3 1  01x 3 1  01x 3

0 3 x1  , 1 

So, we have

 R r ,t R   01x 3

Rr ,t T   R Rl ,t = 1   01 x 3

T . 1

Obviously,

R r , t R = R R l , t , Rr , t T = T . We can get

( Rr ,t − I )T = 0 . Therefore,

(20)

T is the eigenvector corresponding to the eigenvalue 1.0 of matrix Rr ,t . It means that T is along the rotation

axis of the rotation matrix Rr ,t . But

T can’t satisfy (20) in each time instant, t=1,2,…K, except that T is always equal to 0 3 x1 . So we can say that

T doesn’t influence the relationship between the structure representations in two camera coordinate frames if both shapes are centered at each origin.

Suggest Documents