Anaglyph 3D Stereoscopic Visualization of 2D Video ... - IEEE Xplore

0 downloads 0 Views 638KB Size Report
Abstract—In this paper, we propose a simple Anaglyph 3D stereo generation algorithm from 2D video sequence with monoc- ular camera. In our novel approach ...
2013 International Conference on Virtual Reality and Visualization

Anaglyph 3D Stereoscopic Visualization of 2D Video based on Fundamental Matrix Zhihan LU∗† , Shafiq ur R´ehman∗† , Muhammad Sikandar Lal Khan† , Haibo Li‡ ∗ Shenzhen Institutes of Advanced Technology, Shenzhen, PRC. † i2lab., Till¨ampad fysik och elektronik (TFE), Ume˚a University, Sweden. ‡ Royal Institute of Technology(KTH), Sweden. Email:(zhihan.lu,shafiq.urrehman,muhammad.sikandar.lal.khan)@umu.se

In this paper, a low cost and simple 2D to 3D visualization algorithm for anaglyph generation is presented. The proposed algorithm employs pseudo-3D graphics approach based on stereoscopic 3D camera pose estimation in successive frames considering fundamental matrix based homography correspondence. In order to strengthen the pseudo-3D effect, the final output images extracted from 2D video are rendered into stereoscopic 3D style based on color code. Our approach makes proper trade-off between the rendering performance and computational complexity. Subjective evaluation shows that the proposed method provides a good 3D perception and the viewers can feel the 3D depth of the generated video more strongly with yellow-blue glasses or other anaglyph rendering technology. The remaining of this paper is organized as follows. Section II provides an overview of the related works and reviews the homography matrix based camera pose estimation for plane correspondence in 2D to 3D video sequence. Section III describes our approach for camera pose estimation using structure from motion techniques with the description of 3D generation using fundamental matrix. Concluding remarks are provided in section IV.

Abstract—In this paper, we propose a simple Anaglyph 3D stereo generation algorithm from 2D video sequence with monocular camera. In our novel approach we employ camera pose estimation method to directly generate stereoscopic 3D from 2D video without building depth map explicitly. Our cost effective method is suitable for arbitrary real-world video sequence and produces smooth results. We use image stitching based on plane correspondence using fundamental matrix. To this end we also demonstrate that correspondence plane image stitching based on Homography matrix only cannot generate better result. Furthermore, we utilize the structure from motion (with fundamental matrix) based reconstructed camera pose model to accomplish visual anaglyph 3D illusion. The proposed approach demonstrates a very good performance for most of the video sequences. Keywords-Anaglyph, 3D video, 2D to 3D conversion.

I. I NTRODUCTION Presenting two offset images separately to the left and right eye of human creates depth illusion in features of an image; which is known as 3D Stereoscopic visualization. Nowadays the availability of 3DTV and 3D cinema has increased interest in 3D content rendering and visualization. Normally 3D video rendering system desire penalty of increased bandwidths. Currently 3D video capture and recording systems are much more expensive. It is also fact that there is tremendous amount of 2D content exists in both commercial market and personal recordings; so it makes sense to device simple and cost effective methods to convert 2D content in 3D visualization. With availability of 2D digital camera in everyday consumer electronics (such as mobile phone, tablet PC) requires cost effective and simple methods for 3D rendering and visualization of the user’s personal recorded 2D videos. Anaglyph images are used to encode stereo image pairs for viewing and considered one of the most cost effective methods of spectroscopic 3D visualization. The earliest attempt to construct a 3D image via anaglyph stereo approach was demonstrated by W. Rollmann in 1853 [15]. Most common method for creating an anaglyph is to obtain two separate images from left and right viewpoints, superimpose and align them, and then remove complementary colors from each of the views [11]. Personal video 2D to 3D Anaglyph visualization can be a simple and cost effective way for affordable 2D capturing and rendering devices. 3D Anaglyph visualization can be rendered to normal display and the viewer’s only need cheap anaglyph glasses to enjoy contents in 3D. 978-0-7695-5150-0/13 $26.00 © 2013 IEEE DOI 10.1109/ICVRV.2013.59

II. R ELATED WORK Generally for rendering and visualization of 3D video from 2D video, current commercial systems employ depth image based rendering and visualization. From monoscopic 2D view depth information is not available and most of the stereoscopic 3D anaglyph generating methods reconstruct this depth information hence termed as depth estimation methods [1], [3], [13]. These depth based methods need complex and time consuming algorithms. The related research using homography to rectify the image for generating stereoscopic 3D has shown the feasibility of creating stereoscopic 3D without depth map [9]. Recently [17] have proposed homography based method in which the two related images are aligned to same image plane for generating 3D anaglyph visualization for personal photos. The results are not accurate for continuous video sequence but good enough for generating stereoscopic 3D image from two related static images. For arbitrary real-world continuous video sequence the usage of homography matrix directly generates non-smooth results and too unstable to watch (as discussed in following section). The most important reason is that 305

the homography matrix can only represents the approximate relation between two 3D spatial perspective views, but for a series of continuous frame images of the video accumulated imprecise result lead to the non-smooth video effect. The randomness of RANSAC [16] is another reason,but recently some improved results has also been reported [6]. There have been a series of researches about how to use fundamental matrix to rectify the two images and stitch them[5]. These methods employ the epipole theory and calculate the two resampling transformation matrixes which make the epipole lines parallel to the horizontal X-axis in both images and then perform the resampling and stitching to the both images. These methods consider that the epipole lines are parallel to the X-axis, which makes the epipole to a point at infinity (1, 0, 0)T , and in this case at least one of the transform matrix is an affine matrix. However, in the stereoscopic 3D video generation process, the original 2D video frame motion is not always horizontal movement. So the result of using directly these methods is the movie whirl for the horizontal epipole line. In this paper, we propose a novel method which does not need to rotate the movie-frame to stitch them. We use a simple and fast method of camera pose estimation to reconstruct more accurate 3D information using the 3D information hidden in edges in the RGB image instead of depth. Our alignment approach can rectify the two images into the same plane and set shift between them to adapt the distance between two eyes. We employ Uncrossed-Parallax (two views with parallel orientations) to generate the image behind the screen creating 3D stereo view for anaglyph coding and visualization.

Fig. 1. From left to right: Homography vs. Fundamental matrix based visualization experiment platform. The visualization experiment data analysis results based on fundamental matrix (red) and homography matrix (blue).

It has known [4] that the homographies induced by the plane nT X + d = 0 under the coordinates ΠE = (nT , d)T is : Hij = Ki [R − tnT /d]Kj−1 And if the camera rotates about its optical centre, the group of transformations the images may undergo is a special group of homographies [2], [17]: Hij = Ki Ri RjT Kj−1 The method proposed by [2] neglects the translation of the camera during the video shooting, and is not suitable for our purpose when applied directly; it generates perceptually inaccurate the stereoscopic 3D video results. The real world contents lie in 3D space and 2D projection (homography) correspondence image plane cannot accurately model 3D space contents. Hence, 2D perspective transforms or homography is not suitable for calculating the image geometric transformation when applied directly. Our stereo visualization experiment shows that the homography matrix causes a deviation during the stereoscopic generating process when applied directly. During our pose estimation based experiment we visualize the rotation of a simple virtual object based on the rotation from homography matrix and fundamental matrix, while the camera move back and forth along the x-axis (as shown in Fig. 1-a). For this experiment we use BazAR: A vision based fast detection library [8]. Our experiment shows that the translation motion of camera produces arbitrary rotations while using pairwise homography in image plane correspondence, i.e. the rotation matrix is altered with translation. Whereas fundamental matrix based projection remains unaltered. The rotation data analysis also confirms that more smooth motion trajectory can be achieved using fundamental matrix method. Fig. 1-b depicts motion trajectory plot of rotation data where results based on the fundamental matrix (shown in red) are smoother as compare to homography matrix (in blue). For 3D correspondence for this work, we firstly derive fundamental matrix from image plan point correspondence and then employ epipolar geometry for computing pairwise homography. It gives both accurate rotation and translation (as explained in following section).

A. Homography and Fundamental matrix based Visualization 3D rendering methods are based on humans vision perceived depth which replay on many factors such as geometry, size, occlusion, focus, stereo disparity, motion parallax, binocular parallax etc. Binocular parallax methods generate 3D effect by presenting different images to each eye through a filtering mechanism created by the views 3D glasses. To create a sense of depth from 2D images, researchers employ techniques for obtaining information about the geometry of 3D scenes from 2D images using 3D scene geometry and camera motion from a set of images of a static scene. The process of estimating 3D structure from 2D image sequence by recovering the motion of a camera and the shape of objects in front of it from the images is known as structure from motion (SFM). The common method for SFM is feature based approach (see [14] for more details). The feature base SFM depends on robust feature detection and matching, geometric image transformation, image stitching and adjustment.For robust feature detection and matching in two corresponding images, SIFT feature [10] detector with FLANN [12] match is normally employed to extract the key points in a given video sequence. Formally, for each frame Ik the Ik+n − th frame is used for 3D coding. For 3D coding n−frames temporal distance is kept for reasonable geometric correspondences.

III. A LGORITHM D ESCRIPTION The first step in our algorithm is Structure from motion (SFM) techniques for obtaining information about the geometry of 3D scenes from 2D images, which means extracting 3D

306

In S, all elements a should be approximately equal b. If there is large difference between a and b, we assume that both cameras have the same pose. ⎛ ⎞ ⎛ ⎞ a 0 0 0 1 0 S = ⎝ 0 b 0 ⎠ W = ⎝ −1 0 0 ⎠ 0 0 0 0 0 1 Here, two cases of rotation matrix are : Ra = U ∗ W ∗ V T , Rb = U ∗ W T ∗ V T . For this we use important constraint i.e. if detRa = 1.0, otherwise Ra is not a rotation matrix and H is replaced with identity matrix. If the constraints are not established, the cause must be mistake of the frame registration step, because in fact the rotation matrix is always the orthogonal matrix. So in this case, H is replaced with identity matrix I. The absolute translation matrix is given as: t = [U (1, 3) U (2, 3) U (3, 3)]T . We get four candidate solutions as: P1 [Ra |t], P2 [Ra | − t], P3 [Rb |t], P3 [Rb | − t]. In order to find unique correct solution, all of the points we reconstruct all corresponding point to Pi , i = 1, 2, 3, 4. It is noted that correct solution P is acquired only when most (75%) of the reconstructed points are in front of the two views of the cameras.

Fig. 2. A block diagram based overview of our proposed algorithm for Anaglyph 3D stereoscopic visualization of 2D personal video.

pose information of the camera into projection matrix. Because the image formation process is not generally invertible [14], the projection is transformed indirectly into homography matrix (based on fundamental matrix and epipolar geometry) to rectify the image. SFM techniques has been used in a wide range of applications including photogrammetric survey [14], the automatic reconstruction of virtual reality models from video sequences [18], and for the determination of camera motion [7]. Our proposed algorithm, firstly we decompose 2D video into a frames and select two images (with overlapping areas) for matching to find the corresponding pairs. Secondly, the projection matrix is estimated based on fundamental matrix and epipolar geometry. It is then transferred into homography correspondence to reconstruct synthesis views for anaglyph 3D stereo coding. Finally, the two images are rectified based on the homography matrix and color coded (as shown in Fig. 2). For frame feature detection and matching, our algorithm uses SIFT feature detector and the optical flow method to track the features in the consecutive frames. The matching results are more accurate than SIFT with FLANN.

C. Transform projection matrix into homography matrix H The relation between fundamental matrix F and homography matrix H can be realized as following: Fij ≈ [eij ] × Hij Hence, F corresponds to the H of a plane and the epipole. For this work, it is assumed that the projection matrix from the first camera as : P = K[I|0], and then SFM is employed to estimate projective matrix P  = K[R|t]. In order to avoid the missing projection parts in an image plane, we consider the plane Z = 0 and use homography matrix to map the two views; H = K[R1 R2 t], where R1 , R2 are respectively the first and second column of the rotation matrix R. In practice, the translation matrix t from fundamental matrix is not suitable for real image coordination, so we select the normalized translation matrix T from homography matrix. We also use a normalized translation matrix from identity matrix I which means the transformation between two images is null. Using the contrast effect, we found the normalized translation matrix from homography matrix is more accurate.

A. Fundamental matrix F and essential matrix E RANSAC [16]algorithm is employed to find good matches, and then 8 points approach is used to calculate the Fundamental matrix F [18]. From which essential matrix E is estimated, where E = K −1 ∗ F ∗ K, where K is the camera calibration matrix. Here we used an important constraint as: if detE = 0 proceed to keep the E,otherwise H is replaced with identity matrix.

D. Stereoscopic 3D for 2D video The indirect homography introduction make sure that cor responding points in two image planes satisfy: x = Hx In Fig. 3 comparison results are shown; i.e., a stereoscopic 3D anaglyph view generated from our proposed indirect method and direct H matrix based method. The results based on our proposed methods are perceptually superior. The calculated average processing time in MATLAB 7.12.0 64-bit on 3.4Ghz desktop computer is depicted in Tab. I. It indicates that the average increase in time is far less than 10%, which means it’s bearable to consume a little more processing time for

B. Extracting projection matrix It is supposed the projection matrix is P [R|t]. The essential matrix is composed by SVD, as [U, S, V ] = SV D(E), and S is calculated.

307

algorithm is shown both analytically and experimentally based on subjective test. The main advantage of our algorithm is simplicity , robustness and cost affectivity for personal videos. The proposed indirect approach helps the users to convert their real world 2D videos into 3D anaglyph visualization. The proposed method can only reconstruct the 3D from video with transverse motion, because it gives up the depth reconstruction procedure for generating quickly 3D anaglyph visualization. In this paper we do not consider the object segmentation, so the method is only suitable for the static scene captured by active vision or the segmented scene. In future studies, we will optimize our algorithm for above mentioned issues.

Fig. 3. From left to right: Stereoscopic 3D generated view (a) our proposed method, (b) from direct homography H matrix method.

ACKNOWLEDGMENTS The authors are thankful to Chinese Academy of Sciences Fellowship for Young Foreign Scientists( 2012Y1GA0002), National Natural Science Fund of China(61070147) and Ministry of culture of the PRC’s S&T innovation fund under grant(16-2012).

Fig. 4. Stereoscopic 3D result generated for an indoor 2D video sequence using our proposed algorithm.

R EFERENCES [1] L. Azzari, F. Battisti, A. Gotchev, M. Carli, and K. Egiazarian. A modified non-local mean inpainting technique for occlusion filling in depth-image-based rendering. In Pro. SPIE, Stereoscopic Displays and Applications XXII, volume 7863, 2011. [2] M. Brown and D.G. Lowe. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vision, 74(1):59–73, 2007. [3] H. Fradi and J. Dugelay. Improved depth map estimation in stereo vision. In Pro. SPIE, Stereoscopic Displays and Applications XXII, volume 7863, 2011. [4] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA, 2003. [5] Richard I. Hartley. Theory and practice of projective rectification. Int. J. Comput. Vision, 35(2):115–127, 1999. [6] C. Kurz, T. Thorm¨ahlen, and H. Seidel. Bundle adjustment for stereoscopic 3d. In Proc. 5th int. conf. Computer vision/computer graphics collaboration techniques, 2011. [7] K. N. Kutulakos and J. R. Vallino. Calibration-free augmented reality. IEEE Trans. Visualization and Computer Graphics, 1998. [8] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEE Trans. Pattern Anal. Mach. Intell., 28(9):1465–1479, 2006. [9] C. Loop and Z. Zhengyou. Computing Rectifying Homographies for Stereo Vision. In IEEE Computer Vision and Pattern Recognition(CVPR), volume 1, 1999. [10] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004. [11] H.C. McKay. Three-dimensional Photography: Principles of Steroscopy. American Photography, 1951. [12] M. Muja and D.G Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In Proc. Int. Conf. . Computer Vision Theory and Application (VISSAPP’09), pages 331–340, 2009. [13] C. Niquin, S. Pevost, and Y. Remion. A point cloud based pipeline for depth reconstruction from autostereoscopic sets. In Pro. SPIE, Stereoscopic Displays and Applications XXI, volume 7524, 2010. [14] D.P. Robertson and R. Cipolla. Structure from Motion. Practical Image Processing and Computer Vision, John Wiley, 2009. [15] R. Spottiswoode and N. Spottiswoode. The Theory of Stereoscopic Transmission & Its Application to the Motion Picture. Cambridge University Press, 1993. [16] P. H. S. Torr and D. W. Murray. The development and comparison of robust methodsfor estimating the fundamental matrix. Int. J. Comput. Vision, 1997. [17] S. Yousefi, F. Kondori, and H. Li. 3d gestural interaction for stereoscopic visualization on mobile devices. In 14 Computer Analysis of Images and Patterns (CAIP), pages 555–562, 2011. [18] A. Zisserman, A. Fitzgibbon, and G. Cross. Vhs to vrml: 3d graphical models from video sequences. In Pro. IEEE Int. Conf. Multimedia Computing and Systems - Volume 2, 1999.

better and more accurate 3D stereoscopic visualization effect. The proposed algorithm is evaluated on our own collection of indoor video sequences. The results were smooth enough for 3D anaglyph generation (see Fig. 4). From these images, one can perceive a coherent motion of the shooting angle of the camera. E. Color Code We use color anaglyphs method based on color code to visualize the stereoscopic. The user can feel 3D sight by wearing special glasses with yellow in left eye and blue in right eye. By this method, the overlapping region of the two images will restitute into RGB panchromatic after calculation, and the different regions of the two images are rendered into yellow and blue channel respectively. IV. C ONCLUDING R EMARKS In this work we have proposed a novel and cost effective method for Anaglyph 3d stereoscopic visualization of 2D video based on indirect homography matrix calculation using fundamental matrix. The advantage of our proposed TABLE I P ROCESSING TIME /F RAME (PTPF) OF PROPOSED FUNDAMENTAL METHOD (FM) AND DIRECT HOMOGRAPHY METHOD (DHM) FOR DIFFERENT VIDEO SEQUENCES . Frame Size Standard 320 × 240 640 × 480 1600 × 1200 Wide Screen 360p. 640 × 360 480p. 856 × 480 720p. 1280 × 720 1080p. 1920 × 1080

No. of frames

PTPF for FM in Sec.

PTPF for DHM in Sec.

Avg. % increase

319 297 128

136.99 - 0.42/f 520.73 - 1.75/f 859.71 - 6.72/f

129.39 - 0.41/f 488.75 - 1.65/f 823.26 - 6.43/f

5.9% 6.5% 4.4%

89 107 98 136

84.71 - 0.95/f 154.31 - 1.44/f 351.13 - 3.58/f 919.02 - 6.7/f

82.84 - 0.93/f 149.14 - 1.39/f 342.67 - 3.49/f 882.70- 6.5/f

2.3% 3.5% 2.5% 4.1%

308