depth recovery from unsynchronized video streams

0 downloads 0 Views 972KB Size Report
imaging technologies, capturing multiple high quality video streams has ... Once the cameras are synchronized, if a 3D scene point is observed in two ... at the intersection of the two rays formed by the images points and camera ..... Since most of the distortion is in the radial direction from the lens center, only two parameters ...
DEPTH RECOVERY FROM UNSYNCHRONIZED VIDEO STREAMS Chunxiao Zhou and Hai Tao

Department of Computer Engineering University of California, Santa Cruz, CA 95064 Email: [email protected] Abstract Dynamic depth recovery from unsynchronized video streams is an open problem. In this paper, we propose a novel framework for estimating dense depth information of dynamic scenes using multiple video streams captured from unsynchronized stationary cameras. The main idea is to convert the unsynchronized depth computation problem to the traditional synchronized depth computation problem in which many classic stereo algorithms can be applied. We solve this problem by first imposing two assumptions about the scene motions and the time difference between cameras. The scene motion is represented using a local constant velocity model and the camera temporal difference is modeled as a constant within a short of period of time. Based on these models, geometric relations between the images of moving scene points, the scene depth, the scene motions, and the camera temporal offset are investigated and an estimation method is developed to compute the camera temporal difference. The three main steps of the proposed algorithm are 1) estimation of the temporal offsets between multiple cameras, 2) synpaper of synchronized image pairs based on the estimated camera temporal offsets and optical flow fields computed in each view, and 3) stereo computation using the synthesized synchronous image pairs. The algorithm is tested on both synthetic data and real images. Promising quantitative and qualitative experimental results are demonstrated in the paper.

1

1 Introduction Depth recovery is a classic problem in computer vision. It plays a very important role in many applications such as 3-D reconstruction, object recognition, motion analysis, image based rendering, tracking, robotics, and human computer interaction. In recent years, with advances in computing and imaging technologies, capturing multiple high quality video streams has become easier, and the problem of recovering depth maps of dynamic scenes has received increasing attention. In using the term “dynamic”, we refer to the problem of depth recovery not only across multiple views, but also across different time instants. When a dynamic scene is imaged in multiple synchronized video streams from different fixed viewpoints, depth information can be computed using stereo algorithms at any time instant. For such configurations, dynamic stereo algorithms have been recently developed to take advantage of the temporal coherency of the scene motion to achieve more robust depth estimation [4], [16], [24], [29]. However, the requirement of synchronization among cameras greatly limits the use of these algorithms for many applications where synchronization is difficult to achieve, for technical or economic reasons. Being able to compute depth from unsynchronized video streams will significantly reduce the complexity of the video capture system and benefit many applications from home video editing to 3D reconstruction in visual sensor network. The problem we attempt to solve in this paper is: when a dynamic scene is captured by multiple stationary unsynchronized cameras such as consumer camcorders, how to estimate the depth information of the scene. The problem of unsynchronized dynamic depth recovery from two views is depicted in Figure 1. Once the cameras are synchronized, if a 3D scene point is observed in two views at a certain time instant, the scene point should be at the intersection of the two rays formed by the images points and camera centers (See Figure 2 for an illustration). This is known as the triangulation process in literature on stereo

2

computation (see [7] for a nice review of early stereo algorithms). The remaining problem is how to find the correct correspondences across views, a problem that most stereo algorithms are designed to solve [2], [3], [8], [12]. Scharstein and Szeliski provide a taxonomy of dense stereo correspondence techniques [17] as well as a test bed for the quantitative evaluation of stereo algorithms. Input1(left view)

Input2(right view) Time t+2

Time t

Time t+1

Time t+2+△t Time t+△t

Time t+1+△t

Output: depth maps Time t+2

Time t+1

Time t

Figure 1. Unsynchronized dynamic depth recovery

t

t

p

m p'

Camera 1

Camera 2

Figure 2. Images of a moving 3D scene point in two synchronized views. When the cameras are not synchronized, however, the problem becomes much more complicated. The triangulation process is no longer valid because the same scene points observed in the two views may be imaged at different time instants. When the scene point is moving, the two image rays in general will not intersect (See Figure 3 for an illustration). An implication of this phenomenon is that in theory, if a scene 3

point can move at arbitrarily speed in any direction, the stereo computation from unsynchronized video streams is an ill-posed problem because as illustrated in Figure 3, there are infinitely many solutions, each formed by picking an arbitrary point on each image ray.

time t

time t +∆t

t

t+∆t

m

p p'

Camera 1

Camera 2

Figure 3. Images of a moving 3D scene point in two unsynchronized views. In this paper, we propose to solve this problem by first imposing two reasonable assumptions regarding the scene motions and temporal offset between different cameras. We then develop a simple algorithm that first estimates the temporal offset and then converts the unsynchronized video sequences into synchronized ones. Once this conversion is completed, traditional synchronized stereo algorithms are employed to compute the depth information of the dynamic scenes. Our paper is related and much inspired by the pioneering work of Caspi and Irani [5], in which two algorithms were developed to achieve spatio-temporal alignment of video sequence using a simple parametric model that describes the transformation between two views and an addition parameter to describe the temporal offset between cameras. However, in the unsynchronized stereo problem, the transformation between different views is determined not only by the camera parameters, but also by the depth values of individual pixels. A simple, yet promising method is proposed in this paper to solve this problem. In another related work, Avidan and Shashua [1] proposed a method called trajectory triangulation for computing the 3D positions of moving scene points from multiple images of the points. Linear and conic motion trajectory models were considered in their work. The proposed algorithm is also closely related to recent work on dynamic 4

stereo computation where the recovery of the depth information of dynamic scenes is concerned [4], [16], [24], [29], [31]. Recently, Shimizu [19] proposed a method for calculating the 3D position of an object with two unsynchronized cameras. The 3D position is measured as a crossing point of the line in 3D through the detected position from the last image frame as well as from past frames. However, this algorithm is used for the object tracking rather than dense depth recovery. To our best knowledge, all the existing algorithms are based on the synchronized camera capture system and there was no previous work on recovering dense depth information of dynamic scenes using unsynchronized cameras. The paper is organized as follows. Section 2 explains the main ideas of the proposed approach. Section 3 describes in detail an implementation. In Section 4, quantitative and qualitative experimental results on synthetic and real data are described and analyzed. Discussions and conclusions can be found in Section 5.

2 Unsynchronized Dynamic Depth Recovery 2.1 The basic geometric constraint As discussed in the previous section, if the scene motion is arbitrary, there are infinitely many solutions to the unsynchronized stereo problem. However, due to the physical laws that govern the dynamics of scene objects, motions of these objects are smooth during a short period of time. If we assume that the motion of a scene point can be locally approximated using a linear motion, a simple geometric constraint can be developed for estimating 3D positions of scene points using its four images in the two views. As shown in Figure 4, suppose a scene point undergoing linear motion is imaged at time instants t , t ' , t + 1 , t '+1 , in the two views as a, b, c’, and d’. In the general case, the linear trajectory can be found by intersecting the plane ac1c , and the plane b' c 2 d ' , where c1 and c 2 are the two camera centers. The 3D positions A, B, C, D of the scene point can be found by intersecting the linear trajectory with c1 a , c1c , c 2 b' , and c 2 d ' . For 5

the special case that the two planes are coplanar, any lines in the epipolar plane is a solution, therefore additional constraint is needed to find a unique solution. The geometric constraint resulted from the linear motion assumption implies a very simple depth computation algorithm using unsynchronized cameras. However, to compute a dense depth map, for each pixel, its correspondences in the other three images need to be found. When the pixel lies in an area without prominent features, this is a difficult task because any estimation error in the single view correspondence, i.e. optical flow fields, or the cross view correspondences will propagate into the final solution. In the following section, we propose a simplified algorithm where the problem of finding 4correspondences for every point is avoided.

t

C A

a

t+1

t’

t’+1 D

B

c

b’

c1

Camera 1

d’

Camera 2

c2

Figure 4. Determine the linear trajectory of the scene point by intersecting two planes ac1c , and b' c 2 d ' . For a pair of unsynchronized video cameras, if their frame rates are the same and they are roughly aligned to frame accuracy (Here “roughly aligned” means the absolute temporal difference between two aligned frames is less than one frame displacement, which is usually 1/30 second.), then a very simple geometric constraint can be derived for the four images of a moving scene point if we impose two simple assumptions. The first assumption is that the scene points moves at constant velocity in short of period of time. The second assumption is that the camera temporal difference is a constant over a short period of time. It should be noticed that, the assumption of constant camera temporal difference ∆t is only local 6

and gradual change of ∆t is allowed in this model. The assumption of linear motion with constant velocity is the simplest scene motion model. It is also possible to use more complex model such as a constant acceleration model.

2.2 Converting asynchronous streams to synchronized streams Instead of solving the 4-correspondence problem for every scene points, we propose to solve the problem by first estimating the camera temporal offset ∆t using prominent feature points and then synthesizing synchronize view in one of the two views. Then a standard stereo algorithm is employed to compute the depth for each frame in the synthetic synchronized sequences. t+1

t+∆t

t

C A

t+1+∆t D

B

ab cd

b’

Camera 1

d’

Camera 2

Figure 5. The relation between the camera temporal offset ∆t and the cross ratio.

2.2.1

Computing the camera temporal offset

Camera temporal offset ∆t is computed using a sparse set of 4-correspondences. In figure 5, for the four points A, B, C, D, when the linear motion model is assumed, the cross ratio is an invariance under perspective projection. With the constant velocity model and the constant camera offset, the cross ratio can be computed as AB / BD ∆t / 1 = = ∆t 2 . AC / CD 1 / ∆t

7

(1)

Since the cross ratio is invariant under perspective projection, therefore,

ab / bd = ∆t 2 . ac / cd

(2)

This implies an algorithm for finding ∆t , which is described as follows: Step 1: Compute 4-feature correspondences. Step 2: For each 4-correspondence, find the cross ratio in one of the views, e.g. left view. This can be achieved by first finding the epipolar line. Then the intersections of these epipolar lines and the line formed by the two points in the first view are computed by using

b = ( a × c ) × ( F T b' ) d = (a × c) × ( F T d ' )

,

(3)

where F is the fundamental matrix between the two views. The cross ratio is then computed as (2). Step 3: If

ab / bd < 0 then discard it ac / cd

Otherwise the ∆t can then be computed as

⎧⎪ ∆t 2 < ab, ac > ≥ 0 ∆t = ⎨ , ⎪⎩− ∆t 2 < ab, ac > < 0

(4)

where is the dot product operator. It should be mentioned that this method does not work if the four 3D points A , B , C , D and the camera centers are coplanar because the intersections b and d cannot be found as the line from a to c and the epipolar line through b or d are parallel (Figure 6). In the general case, there are infinitely many solutions, both for ∆t and the trajectory. This can be shown geometrically as well as by counting the number of unknowns and number of constraints. A special case is of particular interest and will be discussed in more detail here. We can suppose that the two cameras are arranged in a standard stereo setup with horizontal

8

epipolar lines and a baseline of B, but without synchronization. Figure 7 shows a scene point moving in a direction parallel to the camera baseline, with depth d , temporal camera offset ∆t and motion v . It can be shown that any solution in the form of (αd , αv, (1 - α) B + α∆tv ) generates exactly the same 4correspondence. We call this phenomenon the depth-motion-time ambiguity in unsynchronized stereo computation. t

t+∆t

t+1 t+1+∆t

A B

a

C D

c b’

Camera 1

d’

Camera 2

Figure 6. When the 3D scene motion and the camera centers are coplanar, ∆t cannot be computed using Equation (2).

∆tv

v

d

αd B

Figure 7. The depth-motion-time ambiguity in unsynchronized stereo computation.

2.2.2

Synpaper of synchronized image pair

9

Once the time offset ∆t between cameras is estimated, images in the second view are synthesized to form synchronized video sequences with respect to the image sequence in the first view. This is accomplished by first estimating the optical flow fields between consecutive frames in the second view and then warping the images to form new images at time instants in-between the frames. Figure 8 illustrates this process. Let I(t + ∆t ) and I(t + ∆t + 1) denote the image at time instants t + ∆t and t + ∆t + 1 . First the optical flow field from I(t + ∆t ) to I(t + ∆t + 1) and the optical flow from I(t + ∆t + 1) to I(t + ∆t ) are computed.

These

optical

flow

fields

are

denoted

as

f1 = f (I(t + ∆t ) → I(t + ∆t + 1))

and

f 2 = f ( I(t + ∆t + 1) → I(t + ∆t )) . Then images I(t + ∆t ) and I(t + ∆t + 1) are warped to the new time instant t using (1 - ∆t ) × f1 and ∆t × f 2 , respectively. The resultant two warped images are the predicted image at time t + 1 using image t + ∆t and image t + ∆t + 1 . These two images are combined using a blending factors of ∆t and 1 - ∆t . In summary, the image at time t is synthesized as

I (t + 1) = ∆t × W (I (t + ∆t ), (1 - ∆t ) × f 1 ) + (1 - ∆t ) × W (I (t + ∆t + 1), ∆t × f 2 )

,

(5)

where W ( I , f ) is a forward warping function that warps image I using the optical flow f . It should be mentioned that an approximation is used in this image synpaper procedure. Ratio between segments on a 3D linear trajectory, which is 1 - ∆t : ∆t in our case, is not a projective invariance in general. The ratio between the projected 2D segments is not exactly 1 - ∆t : ∆t . However, when the inter-frame scene motion is small relative to the object-to-camera distance, it is a very good approximation.

10

f1

d’ (1-∆t)f1

∆tf2

b’ f2 e’= ∆tb’’+(1- ∆t)d’’

t+ ∆t

t+1+ ∆ t

t+1

Figure 8. Synthesized image at t + 1 from images at t + ∆t and t + ∆t + 1 .

2.2.3

Synchronous stereo

Once the image at time t + 1 is synthesized in the second view, it can be used with the real first view image at t + 1 to compute depth in this frame. Existing stereo algorithms can be used for this purpose. Careful readers will notice that the depth map for each time instant is computed independently. The 3D motions of scene points in the space, which require the registration of the depth maps, or the full 4correspondences, are not computed. In other words, depth maps at different time instants are not registered. This is exactly how the difficult 4-correspondence problem is avoided. In our approach, the only requirement is that the synthesized images should be similar to the real ones. This can be achieved even if the optical flow is wrong. This phenomenon can be best observed in textureless areas.

3 Implementation of the algorithm The proposed unsynchronized stereo algorithm has been implemented. Figure 9 illustrates the main phases of the algorithm. Details on the methods adopted and various problems encountered will be discussed in the following subsections.

11

The Unsynchronized Depth Recovery Algorithm 1. Camera calibration and pose estimation. 2. Coarse temporal correspondence between cameras. 3. Finding features and 4-correspondences. 4. Robust estimation of ∆t . 5. Synthesizing synchronized video streams. 6. Depth computation.

Figure 9. The unsynchronized Depth Recovery Algorithm

3.1 Camera calibration and pose estimation In the proposed algorithm, we assume the intrinsic and extrinsic camera parameters are known. In our implementation, two methods are used to calibrate the cameras intrinsically and extrinsically. In the first method, we adopt Zhang’s algorithm [23], which compute the camera matrix with a planar calibration object (see Figure 10). Since the algorithm also computes the relative 3D pose of each camera with respect to the calibration object, it is possible to use the same algorithm to compute the relative poses between cameras. However, when the cameras are unsynchronized, this method is not applicable. That’s why we compute the camera pose using static scene without any moving objects. When the scene lacks natural 3D features, additional 3D objects are inserted into the scene to provide sufficient information for computing camera poses. Zhang’s algorithm is depicted in Figure 11. The inputs of Zhang’s calibration algorithm are several files listing coordinates of the detected corners on the calibration board in various orientations, a model file with a sorted list of coordinates in a separate coordinate system, and a Boolean which determines whether

12

distortion is being modeled or not. The outputs are the estimates of camera centers with respect to various images, the horizontal and vertical camera focal lengths, the camera distortion, and the homographies mapping the various calibration board planes to the model plane, which is assumed to be parallel to the xy-plane with z=0, for ease of calculation.

Figure 10. Calibration Images Since the coordinates of the detected image corners do not physically correspond to the coordinates of corners on the model plane, all points are normalized to make them relatively close before the homographies from the model plane to each image plane are calculated. We normalize the coordinates of the detected image corners for each image in turn so that average distance from each image center to each point is 1.0. The homography, by mapping one plane to another and thus providing a correspondence between two sets of points on different planes, is a composition of a 3-D rotation and a 3-D translation. Once the homographies are estimated with sets of corresponded corners, the next step is to decompose the homographies to form an initial guess at the rotations and translations that relate the different planes, as well as the intrinsic camera parameters. 13

To reconcile the measured corners to the ideal ones, the camera distortion needs to be taken into account. Since most of the distortion is in the radial direction from the lens center, only two parameters are modeled in Zhang’s algorithm. After all parameters have been estimated, optimization will adjust all parameters slightly in order to improve the accuracy of the entire model. In this process, the LevenbergMarquardt algorithm is applied to minimize the reprojection error. Zhang’s Algorithm of Camera Calibration and Pose Estimation 1. Print a pattern and attach it to a planar surface. 2. Take a few images of the model plane under different orientations by moving the plane. 3. Detect the feature points in the images. 4. Normalize the coordinates of the points. 5. Compute homographies that map one image plane to another. 6. Decompose homographies to get initial calibration result. 7. Distortion estimation.

Figure11. Zhang’s Algorithm of Camera Calibration and Pose Estimation Bouguet’s algorithm [34] provides an alternative for camera calibration and pose estimation. Bouguet’s algorithm was partially inspired from Zhang’s work. Their estimations of homographies are identical. The main differences exist in the camera distortion model and the intrinsic parameter estimation from homographies. If the overlap of two views is relatively large, it is easier to adopt Bouguet’s method. With only a set of calibration planes in various orientations observed by both two cameras, camera calibration and pose estimation can be achieved together in one procedure. In addition, Bouguet released a very useful camera calibration matlab toolbox online [34] as well as many detailed examples and important references. 14

In a real system, it is possible to simplify the above procedures and simplify the setup procedure. To obtain intrinsic camera parameters, self-calibration [23] [30] methods can be employed to avoid the use of calibration objects. The requirement of a complete static scene in the calibration and pose estimation stages can be dropped with background estimation techniques that detect and remove moving foreground objects in each frame. It is possible to estimate camera poses using feature points on the moving objects, which itself is an open research problem.

3.2 Coarse temporal correspondence between cameras To compute the accurate temporal offset between cameras, our algorithm requires the unsynchronized video sequences to be roughly temporally aligned firstly. When the two cameras are close to each other, the image transformation between the two views can be approximated by an affine transform. Under such condition, the method developed by Caspi and Irani can be used for computing the rough temporal alignment. Another much simpler method is to align the sequence by detecting unique movement or change in the dynamic scenes, automatically or manually. Currently, our experiment results are obtained by manually aligning the two video sequences with accuracy of up to several frames.

3.3 Finding features and 4-correspondences In order to compute the camera temporal offset, we need to extract 4-correspondences on the moving object to compute the cross ratio.

3.3.1

Motion Detection

Image subtraction is probably the simplest way to detect motion. We subtract the current image from the previous one to separate the moving objects from the static background. Since the image sequences are noisy, erosion operation is used to reduce image differences caused by the noise, and then the erodent holes are filled using dilation operation and filling operation. Finally, erosion operation is applied again to 15

eliminate the enlargement effect of erosion operation. Figure 12 is an example of the motion detection in the dynamic scene.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 12. (a) current frame (b) next frame (c) image subtraction (d) erosion result (e) dilation result (f) filing result (g) second erosion result (h) detected motion

3.3.2

Feature Detection

After finding the moving objects in the dynamic scene, we detect the features on the moving objects by means of Harris corner detector [9]. The instrument to detect corners lies in the following matrix of the image derivatives:

⎡ ⎛ ∂I ⎞ 2 ⎢∑ ⎜ ⎟ ⎝ ∂x ⎠ M =⎢ ⎢ ∂I ∂I ⎢∑ ⎢⎣ ∂x ∂y

∂I ∂I ⎤ ⎥ ∂x ∂y ⎥ 2 . ⎛ ∂I ⎞ ⎥ ∑ ⎜⎜ ⎟⎟ ⎥ ⎝ ∂y ⎠ ⎥⎦



(6)

Since ∆I ≈ (∆x,∆y ) • M , if at a certain point the two eigenvalues of the matrix M are large, then a small motion in any direction will cause an prominent change of grey level. This indicates that the point is a corner. In fact, the geometric interpretation is encoded in the eigenvectors and eigenvalues of the matrix

16

M . Note that M is symmetric, and it has two nonnegative eigenvalues. Using matrix transformation, M can be expressed as

⎡λ M =⎢ 1 ⎣0

0⎤ λ 2 ⎥⎦

(7)

The eigenvectors of M encode the directions of variation; the eigenvalues of M encode the variation strength. Thus, a corner is detected if the smaller eigenvalue of M is large enough. An alternative method to

detect

corners

is:

we

can

locate

the

corner

where

the

corner

response

function

R = det( M ) - k × Trace 2 (M ) is larger than a threshold, where k is a small number (0.04, suggested by Harris). Also, there is another useful corner response function given as R = det( M ) ÷ Trace(M ) . To avoid corners due to image noise, it can be interesting to smooth the images with a Gaussian filter. This should however not be done on the input images, but on the images containing the squared image derivatives. In practice, if the number of detected corners is too much more than necessary, we can restrict the number of corners by means of raising the corner response threshold or nonmax suppression.

Figure 13. Detected features. Another issue encountered in our implementation is that the feature points need to be located at sub-pixel accuracy in computing ∆t . This is accomplished by locally fitting the images using a quadratic surface and finding the zero-curvature point. This step is important because the motion of scene points in each 17

view is relatively small in a video sequence. Errors in feature points will result in inaccurate temporal camera offset estimation. Figure 13 illustrates an example of feature detection.

3.3.3

Finding 4-correspondences

A technique for matching two uncalibrated images has been developed by Zhang [30]. It finds initial matches using correlation and relaxation methods followed by the robust Least-Median-of-Squares (LMedS) technique. This robust technique employs epipolar geometry estimated on the initial correspondences to find and discard false matches. As shown by the authors, this method performs very well, even in images with numerous repetitive patterns. In our system, Zhang’s algorithm for finding feature points across two views is applied and a postprocessing step that links pair-wise correspondences into 4-correspondences has been developed. There are multiple ways of choosing frames for computing pair-wise correspondences. We may denote the first frame and the second frame in camera 1 as v1 and v3 and denote the first and the second frames in camera 2 as v 2 and v 4 . One method is to find pair-wise correspondences first for v1v 2 , v 2 v3 , v3 v 4 (see Figure14). Another method that has been considered is to find pair-wise correspondences for image pairs

v1v3 , v 2 v 4 , and v1v 2 . Our experience shows that the first method is more reliable probably because the pair-wise algorithms tend to find similar correspondences when all the transformation across views are similar. It is also possible to modify Zhang’s algorithm to find the 4-correspondences directly, which hopefully will make the point tracks more reliable.

18

Time t+1

Time t+1+△t

Time t

Time t+△t

Figure 14. Detecting 4-correspondences

3.4 Robust estimation of ∆t Once 4-correspondences are found, they are used for computing camera offset ∆t . As mentioned in the previous section, such feature points should lie on the moving objects and ideally the motion trajectory should be as perpendicular to the epipolar plane as possible. The candidate 4-correspondences are used in the computation of ∆t if they satisfy the following two conditions:

| a − c |> τ l

and α > τ α

(8)

where | a − c | is the motion of the feature points in the first view (Figure 15) and α is the angle between ac and the epipolar line passing through a . In our implementation, the two thresholds were set as

τ l = 1.5 pixels and τ α = arctan(0.3) .

t+∆t

t

t+1 C

A

t+1+∆t D

B

ab cd

b’

Camera 1

d’

Camera 2

Figure 15. The relationship between the camera temporal offset ∆t and the cross ratio. 19

In practice, it is possible that we can’t find good 4-correspondences at a certain time instant. In order to make the estimation more robust, 4-correspondences are collected in a temporal sliding window from time t - W to time t + W . All the features in this window are used to estimate ∆t . Based on the assumption of constant time shift, this reasonable estimation process is more robust. In our implementation, W = 4 . In each sliding window, a candidate ∆t is computed for each 4-correspondence. The RANSAC (Random Sample Consensus algorithm) is implemented to obtain a robust estimation of the ∆t using all the candidate ∆t s. For each candidate ∆t , we counted how many candidates ∆t s fit with it within a given threshold and set up a voting set including all the accepted candidates. Then a robust result is achieved by averaging all candidate ∆t s in the maximum voting set.

3.5 Synthesizing synchronized video streams New images are synthesized in one of the views to create synchronized video streams. As described in Section 2.2.2, in the second camera view, images I(t + ∆t ) and I(t + ∆t + 1) are warped to the time instant

t using (1 - ∆t ) × f1 and ∆t × f 2 , respectively. Two forward warping steps are required in this process. In our implementation, we approximate this process by two backward warping procedures [28], where the optical flow fields used are - (1 - ∆t ) × f 1 and - ∆t × f 2 . This is a good approximation to the forward warping method when the motion is not too large.

3.6 Depth computation Once the synthetic synchronized video stream is rendered, traditional synchronized stereo algorithm can be applied to compute depth information. A stereo algorithm based on Tao’s previous work [21] has been used in our implementation. The framework employs depth and visibility based rendering within a global matching criteria to compute depth in contrast with approaches that rely on local matching measures and relaxation. A color segmentation based depth representation guarantees smoothness in textureless regions. 20

Hypothesizing depth from neighboring segments enables propagation of correct depth and produces reasonable depth values for unmatched region. A practical algorithm that integrates all these aspects is presented in Tao’s paper. Recently published dynamic stereo algorithms can also be applied to depth computation. Note that most of dense stereo algorithms need the rectification pre-process.

4 Experimental results and analysis The proposed algorithm has been tested on synthetic and real image sequences. A single set of parameters is used to obtain all the experimental results shown in this paper. Quantitative as well as qualitative evaluations will be described in this section.

4.1 Synthetic sequence A synthetic sequence has been generated using 3ds Max™ to obtain quantitative results on how well the proposed algorithm performs. The dynamic scene consists of a part of the interior of a room and a moving painting in a frame. Images as well as depth maps are synthesized at two viewpoints mimicking a standard stereo setup. To generate unsynchronized video streams from the two viewpoints, the rendered sequences are temporally sub-sampled with the same rate but with different offsets. For example, in order to achieve ∆t =1/3, we select 3n frames from the first sequence, but select the 3n + 1 frames from the second sequence. We show an example with ∆t = 0.5 . This is the most difficult case in synthesizing synchronized sequence because the new view is the farthest from the nearest real frame. Figure 16 shows three images in the first view (left view) and the corresponding real disparity map, which is converted from the depth map generated by 3ds Max™. The floating-point disparity values range approximately from 3 to 13 pixels in this sequence.

21

Figure16. Frames 30, and 50 (from top to bottom) in the left view and the corresponding true disparity maps in the Painting sequence. Using the method described in Section 3.4, the camera temporal offset ∆t is estimated for every frame based on features collected in the neighbor 9 frames. The results are shown in Figure 17. The estimation is fairly accurate, especially when the vertical motion, which is perpendicular to the epipolar lines, is large at the two ends of the sequence.

Figure 17. The estimation of ∆t in the Painting sequence. The true value is ∆t = 0.5 . Using the estimated camera temporal offset, synchronized images are synthesized. The real images in the second view can be rendered using 3ds Max™ to serve as the ground truth for testing the 22

performance of the image synpaper process. In Figure 18(a-b), the synthesized frame 40 in the right view and the real frame 40 in the second view are shown. Figure 18(c) shows the absolute difference between the two images and Figure 18(d) shows one of the optical flow fields that are used for image warping. The quality of the warped image is very good except that some errors occur in the occluded regions and around the depth boundaries.

(a)

(b)

(c)

(d)

Figure 18. (a) Synthetic frame 40 in the right view (b) real frame 40 in the right view (c) the absolute difference between the two images, with a average value of 0.293 and the maximum value is 13.04 (d) one of the optical flow fields (the x component) used in the image warping. Disparity maps are computed using the synthetic synchronized images sequences. Some of the results are shown in Figure 19. The estimated disparity values are compared with the ground truth in each frame and the mean square errors are computed for the whole scene and only for the moving object. The disparity has also been computed using the same stereo algorithm on the original two unsynchronized sequences and the mean square errors are computed as well. The MSE values for both experiments are shown in 23

Figure 20. It can be observed that the proposed algorithm dramatically improves the results. The MSE value for the moving part is 0.746 pixels on average for our algorithm. Since our depth algorithm uses discrete disparity values with an interval of 0.25 pixels, the MSE values also contain rounding errors. In comparison, when the disparity is computed directly from the unsynchronized sequences, the average MSE is 9.98 pixels. In order to analyze the disparity error result from unsynchronization, we use SSD (Sum of Squared Difference) function as a example to estimate the disparity between two rectified images. When both left image and right image are rectified, we can search along the scanline and find the disparity value d by minimizing the SSD function (equation 9) value.

SSD (d ) =

∑ ( I ( x, y ) - I l

r

( x + d , y )) 2 ,

(9)

window

where I l is the left image, and I r is the right image. If the 2D motion during the temporal offset is (m x , m y ) , the relation between left and right images can be modeled as:

I l ( x, y ) = I r ( x + d 0 + m x , y + m y ) ,

(10)

where m x and m y are 2D motion in horizontal and vertical direction, and d 0 is the true disparity value. With the Taylor expansion locally,

I r ( x + d 0 + m x , y + m y ) ≈ I r ( x, y ) + ( d 0 + m x ) Plugging it into equation (9), we have

24

∂I r ∂I + my r . ∂ ∂y x

(11)

SSD (d ) =



((d - d 0 - m x )

window

∂I ∂I r - my r )2 . ∂y ∂x

(12)

Minimizing the SSD function value with respect to disparity value d, we have

I r ∂I r ∂ x ∂ y ∂ ∂I ∑( r ) 2 ∂x

∑ d = d 0 + mx + m y

.

(13)

It shows that the disparity error caused by unsynchronization includes two items. The horizontal motion m x creates disparity error directly while vertical motion m y , together with a weighted

∂I r ∂I r ∂y factor composed by local image gradient, account for the disparity error. ∂I r 2 ∑ ( ∂x )

∑ ∂x

(a)

(b)

Figure 19. Disparity maps computed in frame 50 (a) using the proposed algorithm and (b) using the original unsynchronized images.

4.2 Real sequences - indoor Two different consumer camcorders, one Canon Elura M20 and one Sony PC-110, are used to take the two unsynchronized sequences in this experiment. The cameras are fixed on tripods at roughly the same height and point to the center of the scene. The cameras are individually calibrated using a planar pattern.

25

The camera distortion is estimated and compensated. The relative pose between the two cameras is estimated using the Zhang’s algorithm [30] and the fundamental matrix between the two cameras is derived. The other steps are the same as the synthetic data experiment, except that the ground truth data is not available for camera temporal offset, real synchronized views in the second camera, and disparity maps. In Figure 21(a-b), two video frames in the first view are shown. Figure 21(c) shows the estimated corresponding disparity maps using our algorithm. Figure 22 shows the estimates of the camera temporal offset, with an average value of 0.64. A similar result is illustrated in Figure 23 and Figure 24. In Figure 23(a-b), synthesized image pair is shown. Figure 23(c) shows the estimated corresponding disparity map using our algorithm. Figure 24 shows the estimates of the camera temporal offset, with an average value of 0.471.

Figure 20. Average disparity values on the moving part of the scene

26

(a)

(b)

(c)

Figure 21. The Balloon sequence: (a) the original frame 50 in the first view, (b) the synthesized synchronized view at the same time instant, and (c) the corresponding disparity maps in the left view.

Figure 22. The estimates of ∆t in the Balloon sequence. The average ∆t is 0.64.

27

(a)

(b)

(c)

Figure 23. The Frisbee sequence: (a) the original frame 15 in the first view (b) the synthesized synchronized view at the same time instant and (c) the corresponding disparity maps in the left view.

Figure 24. The estimated ∆t in the Frisbee sequence.

4.3 Real sequences- outdoor The proposed algorithm has also been tested on outdoor scenes. The camera setup is similar to the one used in the indoors sequences except that the two cameras are at different heights to avoid the coplanar configuration discussed in Section 2.2.1. In Figure 25, some of the initial frames and the computed depth maps are shown. Figure 26 shows the estimates of the camera temporal offset, with an average value of 0.592.

28

Figure 25. The Walking sequence I. Upper row: the original frames. Lower row: the estimated depth maps.

Figure 26. The estimated ∆t in the Walking sequence. A similar result is illustrated in Figure 27 and Figure 28. In Figure 27, some of the initial frames and the computed depth maps are shown. In comparison, the disparity has also been computed using the same stereo algorithm on the original unsynchronized two sequences. It can be observed that the proposed algorithm dramatically improves the results for the moving man part. Figure 28 shows the estimates of the camera temporal offset, with an average value of 0.583.

29

Figure 27. The Walking sequence II. Upper row: the original frames. Middle row: using the proposed algorithm. Lower row: using the original unsynchronized images.

Figure 28. The estimated ∆t in the Walking sequence.

30

5 Discussions and conclusions We have proposed in this paper a simple method to compute depth information from unsynchronized video streams. The algorithm is based on two reasonable assumptions regarding the scene motion and the camera temporal offset. The proposed algorithm avoids the difficult 4-correspondence problem by computing depth from synthesized views. A feature-based method has been developed for robust estimation of camera temporal offsets. Promising experimental results have been obtained with a relatively straightforward implementation. During the development of this algorithm, we have noticed that our algorithm is sensitive to the magnitude of the motion during the temporal offset period. If the motion is small compared with errors caused by camera calibration, pose estimation and quantification, the estimation of temporal offsets will become unstable. Fortunately, the depth computation result is still good enough since unsynchronized image pairs are very close to synchronized image pairs when the motion is small. Moreover, the temporal offset may be computed using linear motion during longer period (temporal offset plus several frame interval).

However, accurate dense estimation of optical flow

becomes difficult in the case of big motion. As we know, the performance of depth computation depends on the synthesized synchronized image pairs that are created with optical flow and original unsynchronized image pairs. An iterative algorithm involved with both temporal offset estimation and optical flow estimation is expected to be developed in the future. Along the course of the development of this algorithm, we have noticed many new exciting research problems in unsynchronized stereo computation. Some of them include the camera calibration and pose estimation with moving objects in the scenes, the stereo computation from unsynchronized moving cameras, and the robust estimation of camera temporal offsets when very few features can be detected on the moving objects.

31

References [1.] S. Avidan and A. Shashua, “Trajectory Triangulation: 3D Reconstruction of Moving Points from a Monocular Image Sequence.” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 22(4), pp. 348--357, 2000 . [2.] P. N. Belhumeur, “A Bayesian-approach to binocular stereopsis,” Int. Journal of Computer Vision, vol. 19, no. 3, pp. 237-260, August 1996. [3.] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” in Proc. Int. Conf. on Computer Vision (ICCV’99), Sept. 1999. [4.] R. L. Carceroni and K. N. Kutulakos, “Scene capture by surfel sampling: from multi-view streams to non-rigid 3D motion, shape and reflectance,” in Proc. Int. Conf. on Computer Vision (ICCV’01), July 2001. [5.] Y. Caspi and M. Irani, “Spatio-Temporal Alignment of Sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 11, November 2002. [6.] J. Davis, R. Ramamoorthi, and S. Rusinkiewicz, “Spacetime stereo: A unifying framework for depth from triangulation.” In IEEE Conf. on Computer Vision and Pattern Recognition,2003. [7.] U. R. Dhond and J. K. Aggarwal, “Structure from stereo: a review,” IEEE Transactions on System, Man, and Cybernetics, vol. 19, no. 6, pp. 1489-1510, 1989. [8.] K. J. Hanna and Neil E. Okamoto, “Combining stereo and motion analysis for direct estimation of scene structure,” in Proc. Int. Conf. on Computer Vision, pp. 357-265, 1993. [9.] C. Harris and M. Stephens, “A combined corner and edge detector”', Fourth Alvey Vision Conference, pp.147-151, 1988. [10.] L. Hong and G. Chen, “Segment-Based Stereo Matching Using Graph Cuts.” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’04), June 2004. [11.] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiment.” IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(9), [12.] K. N. Kutulakos and S. M. Seitz, “A theory of shape by space carving,” in Proc. Int. Conf. on Computer Vision (ICCV’99) , pp. 307-314, 1999. [13.] M. Lin and C. Tomasi, “Surfaces with Occlusions from Layered Stereo.” Ph.D. paper, Stanford University, 2002. [14.] R. Mandelbaum, G. Salgian, and H. Sawhney, “Correlation based estimatin of ego-motion and structure from motion and stereo.” In Int. Conf. on Computer Vision, pages 544–550,1999. 32

[15.] S. Roy and I. J. Cox, “A maximum-flow formulation of the N-camera stereo correspondence problem”, in Proc. Int. Conf. on Computer Vision (ICCV'98), Bombay, India, January 1998. [16.] D. Scharstein and R. Szeliski, “Stereo matching with nonlinear diffusion.” in International Journal of Computer Vision, 28(2):155-174, July 1998. [ 17. ] D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.” Int. Journal of Computer Vision, 47(1/2/3):7-42, April-June 2002 [18.] Jianbo Shi and Carlo Tomasi, “Good Features to Track.” In IEEE Conference on Computer Vision and Pattern Recognition, pages 593-600, 1994. [19.] S. Shimizu, H. Fujiyoshi, Y. Nagasaka, and T. Takahashi, “A Pseudo Stereo Vision Method for Unsynchronized Cameras,” Proc. of Asian Conferrence on Computer Vision(ACCV’04), vol. 1, pp. 575 - 580, 2004. [ 20. ] J. Sun, H. Y. Shum, and N. N. Zheng, “Stereo matching using belief propagation” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 1-13, 2003 [21.] H. Tao, H. S. Sawhney, and R. Kumar, “Dynamic depth recoveryfrom multiple synchronized video streams.” In IEEE Conf.on Computer Vision and Pattern Recognition, 2001. [22.] P.H.S. Torr and C. Davidson, “IMPSAC: A synpaper of importance sampling and random sample consensus to effect multi-scale image matching for small and wide baselines.” In The Sixth European Conference on Computer Vision, pages 819—833, 2000. [ 23. ] P.H.S. Torr, T. Wong, D.W. Murray and A. Zisserman, “Cooperating motion processes.” In Proceedings of the British Machine Vision Conference, pages 126—129, 1991. [24.] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-dimensional scene flow,” in Proc. Int. Conf. on Computer Vision, pp. II-722 – 729, Sept. 1999. [25.] G. Wolberg, “Digital Image Warping,” Wiley-IEEE Press, July 1990. [26.] Yalin Xiong, Larry Matthies, “Error Analysis of a Real-Time Stereo System.” In IEEE Conf.on Computer Vision and Pattern Recognition, 1997 [27.] G. Young and R. Chellappa, “3-D motion estimation using a sequence of noisy stereo images: models, estimation and uniqueness results,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp.735-759, 1990. [28.] Li Zhang, Brian Curless, and Steven M. Seitz, “Spacetime Stereo: Shape Recovery for Dynamic Scenes.” In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Madison, WI, June, 2003, pp. 367-374. 33

[29.] Y. Zhang and C. Kambhamettu, “Integrated 3D scene flow and structure recovery from multiview image sequences,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’00), pp. II-674-681, South Carolina, June 2000. [30.] Y. Zhang and C. Kambhamettu, “On 3D scene flow and structure estimation.” In IEEE Conf. on Computer Vision and PatternRecognition, 2001. [31. ] Z. Zhang, R. Deriche, O. Faugeras, Q.-T. Luong, “A Robust Technique for Matching Two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry”, Artificial Intelligence Journal, Vol.78, pages 87-119, October 1995. [ 32. ] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 1330-1334, 2000. [33.] Chunxiao Zhou and Hai Tao, “Depth Computation of Dynamic Scenes Using Unsynchronized Video Streams,” in Proc. Int. Conf. on Computer Vision, CVPR’03, pp. II 351-358, 2003. [34.] http://www.vision.caltech.edu/bouguetj/calib_doc/index.html

34

Suggest Documents