3-D Scene Data Recovery using Omnidirectional ... - CiteSeerX

2 downloads 0 Views 837KB Size Report
image panoramas at different camera locations, we can recover 3-D data of the scene using a set of simple techniques: feature tracking, an 8-point structure from ...
International Journal of Computer Vision, x, 1{20 (1997)

c 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

3-D Scene Data Recovery using Omnidirectional Multibaseline Stereo SING BING KANG

[email protected] Digital Equipment Corporation, Cambridge Research Lab, One Kendall Square, Bldg. 700, Cambridge, MA 02139

RICHARD SZELISKI

[email protected] Microsoft Corporation, One Microsoft Way, Redmond, WA 98052-6399

Received XX, 1996. Revised YY, 1996.

Abstract.

A traditional approach to extracting geometric information from a large scene is to compute multiple 3-D depth maps from stereo pairs or direct range nders, and then to merge the 3-D data. However, the resulting merged depth maps may be subject to merging errors if the relative poses between depth maps are not known exactly. In addition, the 3-D data may also have to be resampled before merging, which adds additional complexity and potential sources of errors. This paper provides a means of directly extracting 3-D data covering a very wide eld of view, thus bypassing the need for numerous depth map merging. In our work, cylindrical images are rst composited from sequences of images taken while the camera is rotated 360 about a vertical axis. By taking such image panoramas at di erent camera locations, we can recover 3-D data of the scene using a set of simple techniques: feature tracking, an 8-point structure from motion algorithm, and multibaseline stereo. We also investigate the e ect of median ltering on the recovered 3-D point distributions, and show the results of our approach applied to both synthetic and real scenes.

Keywords: Omnidirectional stereo, multibaseline stereo, panoramic structure from motion, 8-point algorithm, scene modeling, 3-D median ltering.

1. Introduction A traditional approach to extracting geometric information from a large scene is to compute multiple (possibly numerous) 3-D depth maps from stereo pairs, and then to merge the 3-D data [9], [12], [24], [29]. This is not only computationally intensive, but the resulting merged depth maps may be subject to merging errors, especially if the relative poses between depth maps are not known exactly. The 3-D data may also have to be resampled before merging, which adds additional complexity and potential sources of errors. 3-D data registration and merging are very much simpli ed if the motion of the stereo pair is known. One simple instance of this is xing the location

of the center of the reference camera of a stereo pair and constraining the motion of the stereo pair to rotation about the vertical axis. However, unless a motorized rotary table is used to control the amount of rotation, the rotation between successive stereo views still has to be estimated. There is still the same (but much more constrained) problem of registration and 3-D resampling, albeit with only one rotational degree of freedom. This paper provides a means of directly extracting 3-D data covering a very wide eld of view, thus by-passing the need for numerous depth map merging. In our work, cylindrical images are rst composited from sequences of images taken while the camera is rotated 360 about a vertical axis. By taking such image panoramas at di erent

2

Kang and Szeliski

camera locations, we can recover 3-D data from the scene using a set of simple techniques: feature tracking, 8-point direct and iterative structure from motion algorithms, and multibaseline stereo. There are several advantages to this approach. First, the cylindrical image mosaics can be built quite accurately, since the camera motion is very restricted. Second, the relative pose of the various camera locations can be determined with much greater accuracy than with regular structure from motion applied to images with narrower elds of view. Third, there is no need to build or purchase a specialized stereo camera whose calibration may be sensitive to drift over time|any conventional video camera on a tripod will suce. Our approach can be used to construct models of building interiors, both for virtual reality applications (games, home sales, architectural remodeling), and for robotics applications (navigation). In this paper, we describe our approach to generate 3-D data corresponding to a very wide eld of view (speci cally 360 ), and show results of our approach on both synthetic and real scenes. We rst review relevant work in Section 2 before delineating our basic approach in Section 3. The method to extract wide-angle images (i.e., panoramic images) is described in Section 4. Section 5 reviews the 8-point algorithm and shows how it can be applied to cylindrical panoramic images. Section 6 describes two methods for extracting 3-D point data: the rst relies on unconstrained tracking and uses an 8-point structure from motion algorithm, while the second constrains the search for feature correspondences to epipolar lines (traditional stereo). We brie y outline our approach to modeling the data in Section 7|details of this are given elsewhere [15]. Finally, we show results of our approach in Section 8 and close with a discussion and conclusions.

2. Relevant work There is a signi cant body of work on range image recovery using stereo (good surveys can be found in [3], [6]). Most work on stereo uses images with limited elds of view. One of the earliest work to use panoramic images is the omnidirectional stereo system of Ishiguro [13], which uses

two panoramic views. Each panoramic view is created by one of the two vertical slits of a camera image sweeping around 360; the cameras (which are displaced in front of the rotation center) are rotated by very small angles, typically about 0.4. One of the disadvantages of this method is the slow data acquisition, which takes about 10 minutes. The camera angular increments must be approximately 1/f radians, and are assumed to be known a priori. Murray [22] generalizes Ishiguro et al.'s approach by using all the vertical slits of the image (except in the paper, he uses a single image raster). This would be equivalent to structure from known motion or motion stereo. The advantage is more ecient data acquisition, done at lower angular resolution. The analysis involved in this work is similar to Bolles et al.'s [4] spatiotemporal epipolar analysis, except that the temporal dimension is replaced by that of angular displacement. Another related work is that of plenoptic modeling [21]. The idea is to composite rotated camera views into panoramas, and based on two cylindrical panoramas, project disparity values between these locations to a given viewing position. While a disparity value is estimated for each pixel in each cylindrical panorama, no explit 3-D model is ever constructed. Our approach is similar to that of [21] in that we composite rotated camera views to panoramas as well. However, we go a step further by reconstructing 3-D feature points and modeling the scene based upon the recovered points. We use multiple panoramas for more accurate 3-D reconstruction.

3. Overview of approach

Our ultimate goal is to generate a photorealistic model to be used in a variety of scenarios. We are interested in providing a simple means of generating such models. We also wish to minimize the use of CAD packages as a means of 3-D model generation, since such an e ort is labor-intensive [36]. In addition, we would like to generate our 3D scene models using o -the-shelf equipment. In our case, we use a workstation with framegrabber (real-time image digitizer) and a standard 8-mm camcorder.

Omnidirectional Multibaseline Stereo P

Omnidirectional multibaseline stereo

Recovered 3-D scene points

Modeled scene

Fig. 1. Generating scene model from multiple 360 panoramic views.

Our approach is straightforward: at each camera location in the scene, we capture sequences of images while rotating the camera about the vertical axis passing through the camera optical center. We composite each set of images to produce panoramas at each camera location. We use stereo to extract 3-D data of the scene. Finally, we model the scene using these 3-D data input and render it with textures extracted from the input 2-D images. Our approach is summarized in Figure 1. Using panoramic images, we can directly extract 3-D data covering a very wide eld of view, thus by-passing the need for numerous depth map merging. Multiple depth map merging is not only computationally intensive, but the resulting merged depth maps may be subject to merging errors, especially if the relative poses between depth maps are not known exactly. The 3-D data may also have to be resampled before merging, which adds additional complexity and potential sources of errors. Using multiple camera locations in stereo analysis signi cantly reduces the number of ambiguous matches and also has the e ect of reducing errors by averaging [23], [16]. This is especially important for images with very wide elds of view, because depth recovery is unreliable near the epipoles1 , where the looming e ect takes place, resulting in very poor depth cues.

3

ning the camera around. To achieve this, we manually adjust the position of camera relative to an X-Y precision stage (mounted on the tripod) such that the motion parallax e ect disappears when the camera is rotated back and forth about the vertical axis [30]. Prior to image capture of the scene, we calibrate the camera to compute its intrinsic camera parameters (speci cally its focal length f , aspect ratio r, and radial distortion coecient ). The camera is calibrated by taking multiple snapshots of a planar dot pattern grid with known depth separation between successive snapshots. We use an iterative least-squares algorithm (LevenbergMarquardt) to estimate camera intrinsic and extrinsic parameters (except for ) [34].  is determined using 1-D search (Brent's parabolic interpolation in 1-D [25]) with the least-squares algorithm as the black box. The steps involved in extracting a panoramic scene are as follow:  At each camera location, capture a sequence while panning camera around 360 .  Using the intrinsic camera parameters, correct the image sequence for r, the aspect ratio, and , the radial distortion coecient.  Convert the (r; )-corrected 2-D at image sequence to cylindrical coordinates, with the focal length f as its cross-sectional radius. An example of a sequence of corrected images (of an oce) is shown in Figure 3.  Composite the images (with only x-directional DOF, which is equivalent to motion in the angular dimension of cylindrical image space) to yield the desired panorama [31]. The relative displacement of one frame to the next is coarsely determined by using phase correlation [19]. This technique estimates the 2-D translation between a pair of images by taking 2-D

4. Extraction of panoramic images A panoramic image is created by compositing a series of rotated camera images, as shown in Figure 2. In order to create this panoramic image, we rst have to ensure that the camera is rotating about an axis passing through its optical center, i.e., we must eliminate motion parallax when pan-

Fig. 2. Compositing multiple rotated camera views into a panorama. The '' marks indicate the locations of the camera optical and rotation center.

4

Kang and Szeliski

Image 1



Image 2

Image (N-1)

Image N

Fig. 3. Example undistorted image sequence (of an oce).

Fig. 4. Panorama of oce scene after compositing.

Fourier transforms of both images, computing the phase di erence at each frequency, performing an inverse Fourier transform, and searching for a peak in the magnitude image. Subsequently, the image translation is re ned using local image registration by directly comparing the overlapped regions between the two images [31], [32].  Correct for slight errors in the resulting length (which in theory equals 2f ) by propagating residual displacement error equally across all images and recompositing. The error in length is usually within a percent of the expected length. An example of a panoramic image created from the oce scene in Figure 3 is shown in Figure 4. y

P

x

v

θ

O

f

P’

z

Fig. 5. Cylindrical coordinate system. P is any point on the cylindrical surface while P' is the point projected on the x-z plane.  is the angle subtended by the x-axis and the line segment O-P', O being the center of the coordinate frame. f is the camera focal length while v is the height of point P.

5. Recovery of epipolar geometry In order to extract 3-D data from a given set of panoramic images, we have to rst know the relative positions of the camera corresponding to the panoramic images. For a calibrated camera, this is equivalent to determining the epipolar geometry between a reference panoramic image and every other panoramic image. The epipolar geometry dictates the epipolar constraint, which refers to the locus of possible image projections in one image given an image point in another image. For planar image planes, the epipolar constraint is in the form of straight lines [7]. For cylindrical images, epipolar curves are sinusoids [21]. We use the 8-point algorithm [20], [11] to extract the essential matrix, which yields both the relative camera placement and the epipolar geometry. This is done pairwise, namely between a reference panoramic image and another panoramic image. There are, however, four possible solutions [20], [11]. The solution that yields the most positive projections (i.e., projections away from the camera optical centers) is chosen. 5.1. 8-point algorithm: Basics

We brie y review the 8-point algorithm here. If the camera is calibrated, i.e., its intrinsic parameters are known, then for any two corresponding

Omnidirectional Multibaseline Stereo image points (at two di erent camera placements) (u; v; w)T and (u0; v0 ; w0)T in 3-D, we have

0u1 (u0 ; v0 ; w0)E @ v A = 0 w

(1)

The matrix E is called the essential matrix, and is of the form E = [t] R, where R and t are the rotation matrix and translation vectors, respectively, and [t] is the matrix form of the cross product with t. If the camera is not calibrated, we have a more general relation between two corresponding image points (on the image plane) (u; v; 1)T and (u0 ; v0; 1)T, namely

0u1 (u0 ; v0 ; 1)F @ v A = 0 1

(2)

F is called the fundamental matrix and is also of rank 2, F = [t] A, where A is an arbitrary 3  3 matrix. The fundamental matrix is the generalization of the essential matrix E , and is usually employed to establish the epipolar geometry and to recover projective depth [8], [27]. In our case, since we know the camera parameters, we can recover E . Let e be the vector comprising eij , where eij is the (i,j )th element of E . Then for all the point matches, we have from (1) uu0e11 + uv0 e21 + uw0e31 + vu0 e12 + vv0 e22 + (3) vw0 e32 + wu0e13 + wv0 e23 + ww0e33 = 0; from which we get a set of linear equations of the form

Ae = 0:

(4)

If the number of input points is small, the output of algorithm is sensitive to noise. On the other hand, it turns out that normalizing the 3-D point location vector on the cylindrical image reduces the sensitivity of the 8-point algorithm to noise. This is similar in spirit to Hartley's application of isotropic scaling [11] prior to using the 8-point algorithm (though Hartley's algorithm is for recovering the fundamental matrix and not the es-

5

sential matrix). The 3-D cylindrical points are normalized according to the relation

u = (f sin ; v; f cos ) ! u^ = u=juj; (5) i.e., we normalize each vector so that it is a unit direction in space. The coordinate system used for the cylinder is shown in Figure 5. With N panoramic images, we solve for (N ; 1) sets of linear equations of the form (4). The kth set corresponds to the panoramic image pair 1 and (k + 1). Notice that the solution for t is de ned only up to an unknown scale. In our work, we measure the distance between camera positions; this enables us to recover the scale. However, we can relax this assumption by carrying out the following steps:  Fix camera distance of rst pair (pair 1), to, say unit distance. Assign camera distances for all the other pairs to be the same as the rst.  Calculate the essential matrices for all the pairs of panoramic images, assuming unit camera distances.  For each pair, compute the 3-D points.  To estimate the relative distances between of camera positions for pair j 6= 1 (i.e., not the rst pair), nd the scale of the 3-D points corresponding to pair j that minimizes the distance error to those corresponding to pair 1. Robust statistics is used to reject outliers; speci cally, only the best 50% are used. Note that outlier rejection is performed at this step of relative scale recovery only. Once the relative scales is recovered, all of the scaled points are used in determining the optimal merged 3-D positions. 5.2.

Tracking features for 8-point algorithm

The 8-point algorithm assumes that feature point correspondences are available. Feature tracking is a challenge in that purely local tracking fails because the displacement can be large (of the order of about 100 pixels, in the direction of camera motion). To mitigate this problem, we use splinebased tracking, which attempts to globally minimize the image intensity di erences. This yields estimates of optic ow, which in turn is used by a local tracker to re ne the amount of feature displacement.

6

Kang and Szeliski

The optic ow between a pair of cylindrical panoramic images is rst estimated using spline-based image registration between the pair [33], [35]. In this image registration approach, the displacement elds u(x; y) and v(x; y) (i.e., displacements in the x- and y- directions as functions of the pixel location) are represented as two-dimensional splines controlled by a smaller number of displacement estimates which lie on a coarser spline control grid. Once the initial optic ow has been found, the best candidates for tracking are then chosen. The choice is based on the minimum eigenvalue of the local Hessian, which is an indication of local image texturedness. Subsequently, using the initial optic ow as an estimate displacement eld, we use the Shi-Tomasi tracker [28] with a window of size 25 pixels  25 pixels to further re ne the displacements of the chosen point features. Why did we use the approach of applying the spline-based tracker before using the Shi-Tomasi tracker? This approach is used to take advantage of the complementary characteristics of these two trackers, namely: 1. The spline-based image registration technique is capable of tracking features with larger displacements. This is done through coarse-to- ne image registration; in our work, we use 6 levels of resolution. While this technique generally results in good tracks (sub-pixel accuracy) [35], poor tracks may result in areas in the vicinity of object occlusions/disocclusions. 2. The Shi-Tomasi tracker is a local tracker that fails at large displacements. It performs better for a small number of frames and for relatively small displacements, but deteriorates at large numbers of frames and in the presence of rotation on the image plane [35]. We are considering a small number of frames at a time, and image warping due to local image plane rotation is not expected. The Shi-Tomasi tracker is also capable of sub-pixel accuracy. The approach that we have undertaken for object tracking can be thought of as a \ ne-to- ner" tracking approach. In addition to feature displacements, the measure of reliability of tracks is available (according to match errors and local texturedness, the latter indicated by the minimum

eigenvalue of the local Hessian [28], [35]). As we will see later in Section 8.1, this is used to cull possibly bad tracks and improve 3-D estimates. Once we have extracted point feature tracks, we can then proceed to recover 3-D positions corresponding to these feature tracks. 3-D data recovery is based on the simple notion of stereo.

6. Omnidirectional multibaseline stereo The idea of extracting 3-D data simultaneously from more than the theoretically sucient number of two camera views is founded on two simple tenets: statistical robustness from redundancy and disambiguation of matches due to overconstraints [23], [16]. The notion of using multiple camera views is even more critical when using panoramic images taken at the same vertical height, which results in the epipoles falling within the images. If only two panoramic images are used, points that are close to the epipoles will not be reliable. It is also important to note that this problem will persist if all the multiple panoramic images are taken at camera positions that are collinear. In the experiments described in Section 8, the camera positions are deliberately arranged such that all the positions are not collinear. In addition, all the images are taken at the same vertical height to maximize view overlap between panoramic images. We use three related approaches to reconstruct 3-D from multiple panoramic images. 3-D data recovery is done either by (1) using just the 8-point algorithm on the tracks and directly recovering the 3-D points, or (2) proceeding with an iterative least-squares method to re ne both camera pose and 3-D feature location, or (3) going a step further to impose epipolar constraints in performing a full multiframe stereo reconstruction. The rst approach is termed as unconstrained tracking and 3-D data merging while the second approach is iterative structure from motion. The third approach is named constrained depth recovery using epipolar geometry.

Omnidirectional Multibaseline Stereo 6.1. Reconstruction Method 1: Unconstrained feature tracking and 3-D data merging

In this approach, we use the tracked feature points across all panoramic images and apply the 8-point algorithm. From the extracted essential matrix and camera relative poses, we can then directly estimate the 3-D positions. The sets of 2-D image data are used to determine (pairwise) the essential matrix. The recovery of the essential matrix turns out to be reasonably stable; this is due to the large (360 ) eld of view. A problem with the 8-point algorithm is that optimization occurs in function space and not image space, i.e., it is not minimizing error in distance between 2-D image point and corresponding epipolar line. Deriche et al. [5] use a robust regression method called least-median-of-squares to minimize distance error between expected (from the estimated fundamental matrix) and given 2-D image points. We have found that extracting the essential matrix using the 8-point algorithm is relatively stable as long as (1) the number of points is large (at least in the hundreds), and (2) the points are well distributed over the eld of view. In this approach, we use the same set of data to recover Euclidean shape. In theory, the recovered positions are only true up to a scale. Since the distance between camera locations are known and measured, we are able to get the true scale of the recovered shape. Note, however, that this approach does not depend critically on knowing the camera distances, as indicated in Section 5.1. Once we have recovered the camera poses, i.e., the rotation matrices and translation vectors, we use the following method for estimating the 3-D point positions pi. Let uik be the ith point of image k, v^ik be the unit vector from the optical center to the panoramic image point in 3-D space, ik be the corresponding line passing through both the optical center and panoramic image point in space, and tk be the camera translation associated with the kth panoramic image (note that t1 = 0). The equation of line ik is then rik = ik v^ik + tk . Thus, for each point i (that is constrained to lie on line i1), we minimize the error function

Ei =

N X

k=2

kri1 ; rik

k2

(6)

7

where N is the number of panoramic images. By taking the partial derivatives of Ei with respect to ij , j = 1, ..., N , equating them to zero, and solving, we get PN tT ;v^ ; ;v^T v^  v^  i1 ik ik =2 k  i1 i1 = kP N 1 ; ;v^ T v^ 2  ; (7) i1 ik k=2 from which the reconstructed 3-D point is calculated using the relation pi1 = i1v^i1 . Note that a more optimal manner of estimating the 3-D point is to minimize the expression

Ei =

N X

k=1

kpi1 ; rik k2

(8)

A detailed derivation involving (8) is given in Appendix A.1. To simplify the inverse texturemapping of the input images onto the recovered 3-D mesh of the estimated points, the projections of the estimated 3-D points have to coincide with the 2-D image locations in the reference image. This can be justi ed by saying that since the feature tracks originate from the reference image, it is reasonable to assume that there is no uncertainty in feature location in the reference image (see [2] for a discussion of similar ideas). An immediate problem with the approach of feature tracking and data merging is its reliance on tracking, which makes it relatively sensitive to tracking errors. It inherits the problems associated with tracking, such as the aperture problem and sensitivity to changing amounts of object distortion at di erent viewpoints. However, this problem is mitigated if the number of sampled points is large. In addition, the advantage is that there is no need to specify minimum and maximum depths and resolution associated with multibaseline stereo depth search (e.g., see [23], [16]). This is because the points are extracted directly analytically once the correspondence is established. 6.2. Reconstruction Method 2: Iterative panoramic structure from motion

The 8-point algorithm recovers the camera motion parameters directly from the panoramic tracks, from which the corresponding 3-D points can be

8

Kang and Szeliski

computed. However, the camera motion parameters may not be optimally recovered, even though experiments by Hartley using narrow view images indicate that the motion parameters are close to optimal [11]. Using the output of the 8-point algorithm and the recovered 3-D data, we can apply an iterative least-squares minimization to re ne both camera motion and 3-D positions simultaneously. This is similar to work done by Szeliski and Kang on structure from motion using multiple narrow camera views [34]. As input to our reconstruction method, we use 3-D normalized locations of cylindrical image points. The equation linking a 3-D normalized cylindrical image position uij in frame j to its 3D position pi, where i is the track index, is





uij = P R(jk)pi + t(jk) = F (pi; qj ; tj)

(9)

where P () is the projection transformation; R(jk) and t(jk) are the rotation matrix and translation vector, respectively, associated with the relative pose of the j th camera. We represent each rotation by a quaternion q = [w; (q0; q1; q2)] with a corresponding rotation matrix R(q) =



1

; 2q12 ; 2q22

2q0 q1 + 2wq2 2q0 q2 2wq1

;

; ; ;

2q0 q1

2wq2 1 2q 2 2q 2 0 2 2q1 q2 + 2wq0

2q0 q2 + 2wq1

; ;



2q1 q2 2wq0 1 2q 2 2q 2 0 1

;

(10)

(alternative representations for rotations are discussed in [1]). The projection equation is given simply by

0 u 1 0x1 @ v A=P@ y A p w

z

0 1

x 1 @ y A (11) x2 + y2 + z 2 z

In other words, all the 3-D points are projected onto the surface of a 3-D unit sphere. To solve for the structure and motion parameters simultaneously, we use the iterative Levenberg-Marquardt algorithm. The LevenbergMarquardt method is a standard non-linear least squares technique [25] that works well in a wide range of situations. It provides a way to vary smoothly between the inverse-Hessian method and the steepest descent method.

The merit or objective function that we minimize is

C ( a) =

XX i

j

cij juij ; F (aij )j2 ; (12)

where F () is given in (9) and

;

aij = pTi ; qTj ; tTj

T

(13)

is the vector of structure and motion parameters which determine the image of point i in frame j . The weight cij in (12) describes our con dence in measurement uij , and is normally set to the inverse variance ij;2 . We set cij = 1. The Levenberg-Marquardt algorithm rst forms the approximate Hessian matrix

A=

X X  @F (aij ) T @F (aij ) cij i

@a

j

@a

(14)

and the weighted gradient vector

b=;

X X  @F (aij ) T cij @ a eij ; (15) i

j

where eij = uij ; F (aij ) is the image plane error of point i in frame j . Given a current estimate of a, it computes an increment a towards the local minimum by solving (A + I) a = ;b;

(16)

where  is a stabilizing factor which varies over time [25]. Note that the matrix A is an approximation to the Hessian matrix, as the secondderivative terms are left out. As mentioned in [25], inclusion of these terms can be destabilizing if the model ts badly or is contaminated by outlier points. To compute the required derivatives for (14) and (15), we compute derivatives with respect to each of the fundamental operations (perspective projection, rotation, translation) and apply the chain rule. The equations for each of the basic derivatives are given in Appendix A.2. The derivation is exactly the same as in [34], except for the projection equation.

Omnidirectional Multibaseline Stereo n

tio

rch

ec dir

sea

O

u

Ok

1

O

2

Fig. 6. Principle of omnidirectional multibaseline stereo. O1 ... Ok represent the centers of the panoramic images 1 to k, respectively, with panoramic image 1 taken to be the reference image.

6.3. Reconstruction Method 3: Constrained depth recovery using epipolar geometry

As a result of the rst reconstruction method's reliance on tracking, it su ers from the aperture problem and hence limited number of reliable points. The approach of using the epipolar geometry to limit the search is designed to reduce the severity of this problem. Given the epipolar geometry, for each image point in the reference panoramic image, a constrained search is performed along the line of sight through the image point. Subsequently, the position along this line which results in minimum match error at projected image coordinates corresponding to other viewpoints is chosen. Using this approach results in a denser depth map, due to the epipolar constraint. This constraint reduces the aperture problem during search (which theoretically only occurs if the direction of ambiguity is along the epipolar line of interest). The principle is the same as that described in [16]. The principle of multibaseline stereo in the context of multiple panoramic images is shown in Figure 6. In that gure, take panoramic image with center O1 as the reference image. Suppose we are interested in determining the depth associated with image point u as shown in Figure 6. Given minimum and maximum depths as well as the depth resolution, we then project hypothesized 3-D points onto the rest of the panoramic images (six points in our example). For each hy-

9

pothesized point in the search, we nd the sum of squared intensity errors between the local windows centered at the projected image points and the local window in the reference image. The hypothesized 3-D point that results in the minimum error is then taken to be the correct location. Note that the 3-D points on a straight line are projected to image points that lie on a sinusoidal curve on the cylindrical panoramic image. The window size that we use is 2525. The results do not seem to change signi cantly with slight variation of the window size (e.g., 2323 and 2727). There was signi cant degradation in the quality of the results for small window sizes (we have tried 1111). While this approach mitigates the aperture problem, it su ers from a much higher computational demand. In addition, the recovered epipolar geometry is still dependent on the output quality of the 8-point algorithm (which in turn depends on the quality of tracking). The user has to also specify minimum and maximum depths as well as resolution of depth search. An alternative to working in cylindrical coordinates is to project sections of cylinder to a tangential rectilinear image plane, rectify it, and use the recti ed planes for multibaseline stereo. This mitigates the computational demand as search is restricted to horizontal scanlines in the recti ed images. However, there is a major problem with this scheme: reprojecting to rectilinear coordinates and rectifying is problematical due to the increasing distortion away from the new center of projection. This creates a problem with matching using a window of a xed size. As a result, this scheme of reprojecting to rectilinear coordinates and rectifying is not used. A point can be made of using the original image sequences in the projection onto composite planar images to get scan-line epipolar geometry for a speedier stereo matching process. There are the questions as to what the optimal projection directions should be and how many projections should be used. The simpliest approach, as described in the previous paragraph, would be to project subsets of images to pairwise tangent planes. However, because the relative rotation within the image sequence cannot be determined exactly, one would still encounter the problem of constructing composite planar images that are not exactly

10

Kang and Szeliski

physically correct. In addition, one would still have to use a variable window size scheme due to the varying amount of distortion across the composite planar image (e.g., comparing a local lowdistortion area in the middle of one composite planar image with another that has high distortion near a side of another composite planar image). Our approach to directly use the cylindrical images is mostly out of expediency.

7. Stereo data segmentation and modeling Once the 3-D stereo data has been extracted, we can then model them with a 3-D mesh and texture-map each face with the associated part of the 2-D image panorama. We have done work to reduce the complexity of the resulting 3-D mesh by planar patch tting and boundary simpli cation. Our simpli cation and noise reduction algorithm is based on a segmentation of the input surface mesh into surface patches using a least squares tting of planes. Simpli cation is achieved by extracting, approximating, and triangulating the boundaries between surface patches. The displayed models shown in this paper are rendered using our modeling system. A more detailed description of model extraction from range data is given in [15].

8. Experimental results In this section, we present the results of applying our approach to recover 3-D data from multiple panoramic images. We have used both synthetic and real images to test our approach. As mentioned earlier, in the experiments described in this section, the camera positions are deliberately arranged so that all of the positions are not collinear. In addition, all the images are taken at the same vertical height to maximize overlap between panoramic images. 8.1. Synthetic scene

The synthetic scene is a room comprising objects such as tables, tori, cylinders, and vases. One

half of the room is textured with a mandrill image while the other is textured with a regular Brodatz pattern. The synthetic objects and images are created using Rayshade, which is a program for creating ray-traced color images [17]. The synthetic images created are free from any radial distortion, since Rayshade is currently unable to model this camera characteristic. The omnidirectional synthetic depth map of the entire room is created by merging the depth maps associated with the multiple views taken around inside the room. The composite panoramic view of the synthetic room from its center is shown in Figure 11. From left to right, we can observe the vases resting on a table, vertical cylinders, a torus resting on a table, and a larger torus. The results of applying both reconstruction methods (i.e., unconstrained search with 8-point and constrained search using epipolar geometry) can be seen in Figure 7. We get many more points using constrained search (about 3 times more), but the quality of the 3D reconstruction appears more degraded (compare Figure 7(b) with (d)). This is in part due to matching occurring at integral values of pixel positions, limiting its depth resolution. The dimensions of the synthetic room are 10(length)  8(width)  6(height), and the speci ed resolution is 0.01. The quality of the recovered 3-D data appears to be enhanced by applying a 3-D median lter. The median lter works in the following manner: For each feature point in the cylindrical panoramic image, nd other feature points within a certain neighborhood radius (20 in our case). Then sort the 3-D depths associated with the neighborhood feature points, nd the median depth, and rescale the depth associated with the current feature point such that the new depth is the median depth. As an illustration, suppose the original 3-D feature location is vi = di v^ i , where di is the original depth and v^i is the 3-D unit vector from the camera center in the direction of the image point. If dmed is the median depth within its neighborhood, then the ltered 3-D feature location is given by vi0 = (dmed =di)vi = dmed v^i . However, the median lter also has the e ect of rounding o corners.

Omnidirectional Multibaseline Stereo

(a) Correct distribution

(b) Unconstrained 8-point

(e) Median- ltered 8-point

(f) Median- ltered iterative

11

(c) Iterative

(d) Constrained search

(g) Median- ltered constrained

(h) Top view of 3-D mesh of (e)

Fig. 7. Comparison of 3-D points recovered of synthetic room. The four camera locations are indicated by +'s in (h).

The mesh in Figure 7(h) and the three views in Figure 8 are generated by our 3-D modeling system described in [15]. As can be seen from these gures, the 3-D recovered points and the subsequent model based on these points basically

(a) View 1

preserved the shape of the synthetic room. While shape distortions can be easily seen at the edges, texture-mapping tends to reduce the visual effect of incorrectly recovered shapes away from the edges.

(b) View 2

(c) View 3

Fig. 8. Three views of modeled synthetic room of Figure 7(h). Note the distorted shape of the torus in (c) and the top parts of the cylinders in (b) and (c) that do not look circular.

In addition, we performed a series of experiments to examine the e ect of both \bad" track removal and median ltering on the quality of recovered depth information of the synthetic room.

The feature tracks are sorted in increasing order

according to the error in matching2. We continually remove tracks that have the worst amount of

12

Kang and Szeliski

Fig. 11. Panorama of synthetic room after compositing. Table 1. Comparison of 3-D RMS error between unconstrained and constrained stereo results (n is the number of points).

constrained(n=10040) 8-point(n=3057) 8-point(n=1788) 0.315039 0.266600

match error, recovering the 3-D point distribution at each instant. From the graph in Figure 10, we see an interesting result: as more tracks are taken out, retaining the better ones, the quality of 3-D point recovery improves|up to a point. The improvement in the accuracy is not surprising, since the worse tracks, which are more likely to result in worse 3-D estimates, are removed. However, as more and more tracks are removed, the gap between the amount of accuracy demanded of the tracks, given an increasingly smaller number of available tracks, and the track accuracy available, grows. This results in generally worse estimates of the epipolar geometry, and hence 3-D data. Concomitant to the reduction of the number of points is the sensitivity of the recovery of both epipolar geometry (in the form of the essential matrix) and 3-D data. This is evidenced by the uctuation of the curves at the lower end of the graph. Another interesting result that can be observed is that the 3-D point distribution that has been median ltered have lower errors, especially for higher numbers of recovered 3-D points. As indicated by the graph in Figure 10, the accuracy of the point distribution derived from just the 8-point algorithm is almost equivalent that that of using an iterative least-squares (Levenberg-Marquardt) minimization, which is statistically optimal near the true solution. This result is in agreement with Hartley's application of the 8-point algorithm to narrow-angle images [11]. It is also worth noting that the accuracy of the iterative algorithm is best at smaller numbers

0.393777 0.364889

0.302287 0.288079

of input points, suggesting that it is more stable given a smaller number of input data. Table 1 lists the 3-D errors of both constrained and unconstrained (8-point only) methods for the synthetic scenes. It appears from this result that the constrained method yields better results (after median ltering) and more points (a result of reducing the aperture problem). In practice, as we shall see in the next section, problems due to misestimation of camera intrinsic parameters 0.40 8−point (known camera distance) 8−point (unknown camera distance) iterative median−filtered 8−point (known camera distance) median−filtered 8−point (unknown camera distance) median−filtered iterative

0.35

3−D RMS error

original median- ltered

0.30

0.25 100.0

80.0

60.0 40.0 Percent of total points

20.0

0.0

Fig. 10. 3-D RMS error vs. number of points. The original number of points (corresponding to 100%) is 3057. The dimensions of the synthetic room are 10(length)  8(width)  6(height). Note that by \8-point," we mean the reconstruction method 1, with the application of the 8-point algorithm and direct 3-D position calculation. The \iterative" method is reconstruction method 2. The results of reconstruction method 3 is not represented here because the selected points (2-D location and sample size) are different.

Omnidirectional Multibaseline Stereo

(a) Unconstrained 8-point

(b) Median- ltered version of (a)

(c) Iterative

(d) Median- ltered version of (c)

(e) Constrained search

(f) Median- ltered version of (e)

13

(g) 3-D mesh of (b) Fig. 12. Extracted 3-D points and mesh of oce scene. Notice that the recovered distributions shown in (c) and (d) appear more rectangular than those shown in (a) and (b). The camera locations are indicated by +'s in (g).

14

Kang and Szeliski

(speci cally focal length, aspect ratio and radial distortion coecient) causes 3-D reconstruction from real images to be worse. This is a subject of on-going research.

 Adjust the X-Y position stage while panning 

8.2. Real scenes

The setup that we used to record our image sequences consists of a DEC Alpha workstation with a J300 framegrabber, and a camcorder (Sony Handycam CCD-TR81) mounted on an X-Y position stage axed on a tripod stand. The camcorder settings are made such that its eld of view is maximized (at about 43 ). To reiterate, our method of generating the panoramic images is as follows:  Calibrate the camcorder using an iterative Levenberg-Marquardt least-squares algorithm [34].

(a) View 1

  

the camera left and right to remove the e ect of motion parallax; this ensures that the camera is then rotated about its optical center. At each camera location, record onto tape an image sequence while rotating the camera, and then digitize the image sequence using the framegrabber. Using the recovered camera intrinsic parameters (focal length, aspect ratio, radial distortion factor), undistort each image. Project each image, which is in rectilinear image coordinates, into cylindrical coordinates (whose cross-sectional radius is the camera focal length). Composite the frames into a panoramic image. The number of frames used to extract a panoramic image in our experiments is typically about 50.

(b) View 2

(c) View 3

Fig. 9. Three texture-mapped views of the modeled oce scene of Figure 12(g)

We recorded image sequences of two scenes, namely an oce scene and a lab scene. A panoramic image of the oce scene is shown in Figure 4. We extracted four panoramic images corresponding to four di erent locations in the oce. (The spacing between these locations is about 6 inches and the locations are roughly at the corners of a square. The size of the oce is about 10 feet by 15 feet.) The results of 3-D point recovery of the oce scene is shown in Figure 12, with three sample views of its model shown in Figure 9. As can be seen from Figure 12, the re-

sults due to the constrained search approach looks much worse. This may be directly attributed to the inaccuracy of the extracted intrinsic camera parameters. As a consequence, the composited panoramas may actually be not exactly physically correct. In fact, as the matching (with epipolar constraint) is in progress, it has been observed that the actual correct matches are not exactly along the epipolar lines; there are slight vertical drifts, generally of the order of about one or two pixels.

Fig. 13. Panorama of laboratory after compositing.

Another example of a real scene is shown in Figure 13. A total of eight panoramas at eight di erent locations (about 3 inches apart, ordered

Omnidirectional Multibaseline Stereo roughly in a zig-zag fashion) in the lab are extracted. The longest dimensions of the L-shaped lab is about 15 feet by 22.5 feet. The 3-D point distribution is shown in Figure 15 while Figure 14 shows three views of the recovered model of the lab. As can be seen, the shape of the lab has been reasonably well recovered; the \noise" points at the bottom of Figure 15(a) corresponds to the positions outside the laboratory, since there are parts of the transparent laboratory window that are not covered. This reveals one of the weaknesses of any correlation-based algorithm (namely all stereo algorithms); they do not work well with image re ections and transparent material. Again, we observe that the points recovered using constrained search is worse. The errors that were observed with the real scene images, especially with constrained search, are due to the following practical problems:  The auto-iris feature of the camcorder used cannot be deactivated (even though the focal

(a) View 1

  



(b) View 2

15

length was kept constant). As a result, there may be in fact slight variations in focal length as the camera was rotated. The camera may not be rotating exactly about its optical center, since the adjustment of the X-Y position stage is done manually and there may be human error in judging the absence of motion parallax. The camera may not be rotating about a unique axis all the way around (assumed to be vertical) due to some play or unevenness of the tripod. There were digitization problems. The images digitized from tape (i.e., while the camcorder is playing the tape) contain scan lines that are occasionally horizontally shifted; this is probably caused by the degraded blanking signal not properly detected by the framegrabber. However, compositing many images averages out most of these artifacts. The extracted camera intrinsic parameters may not be very precise.

(c) View 3

Fig. 14. Three texture-mapped views of the modeled laboratory scene of Figure 15(g)

As a result of the problems encountered, the resulting composited panorama may not be physically correct. This especially causes problems with constrained search given the estimated epipolar geometry (through the essential matrix). We actually widened the search a little by allowing search as much as a couple of pixels away from the epipolar line. The results look only slightly better, however; while there is a greater chance of matching the correct locations, there is also a greater chance of confusion. This relaxed mode of search further increases the computational demand and has the e ect of loosening the constraints, thus making this approach less attractive.

9. Discussion and conclusions We have shown that omnidirectional depth data (whose denseness depends on the amount of local texture) can be extracted using a set of simple techniques: camera calibration, image compositing, feature tracking, the 8-point algorithm, and constrained search using the recovered epipolar geometry. The advantage of our work is that we are able to extract depth data within a wide eld of view simultaneously, which removes many of the traditional problems associated with recovering camera pose and narrow-baseline stereo. Despite the practical problems caused by using unsophisticated equipment which result in slightly

16

Kang and Szeliski

(a) Unconstrained 8-point

(d) Median- ltered version of (c)

(b) Median- ltered version of (a)

(e) Constrained search

(c) Iterative

(f) Median- ltered version of (e)

(g) 3-D mesh of (b) Fig. 15. Extracted 3-D points and mesh of laboratory scene. The camera locations are indicated by +'s in (g).

Omnidirectional Multibaseline Stereo incorrect panoramas, we are still able to extract reasonable 3-D data. Thus far, the best real data results come from using unconstrained tracking and the 8-point algorithm (both direct and iterative structure from motion). Results also indicate that the application of 3-D median ltering improves both the accuracy and appearance of stereo-computed 3-D point distribution. In terms of the di erences between the three reconstruction methods, reconstruction methods 1 (8-point and direct 3-D calculation) and 2 (iterative structure from motion) yield virtually the same results, which suggests that the 8-point algorithm applied to panoramic images gives near optimal camera motion estimates. This is consistent with the intuition that widening the eld of view with the attendant increase in image resolution results in more accurate estimation of egomotion; this was veri ed experimentally by Tian et al. [37]. One can then deduce that the iterative technique is usually not necessary. In the case of reconstruction method 3, where constrained search using the epipolar constraint is performed, denser data are obtained at the expense of much higher computation. In addition, the minimum and maximum depths have to be speci ed a priori. Based on real data results, the accuracy obtained using reconstruction method 3 is more sensitive to how physically correct the panoramas are composited. The compositing error can be attributed to the error in estimating the camera intrinsic parameters, namely the focal length and radial distortion factor [14]. To expedite the panorama image production in critical applications that require close to real-time modeling, special camera equipment may be called for. One such possible specialized equipment is Ahuja's camera system (as reported in [10], [18]), in which the lens can be rotated relative to the imaging plane. However, we are currently putting our emphasis on the use of commercially available equipment such as a cheap camcorder. Even if all the practical problems associated with imperfect data acquisition were solved, we still have the fundamental problem of stereo|that of the inability to match and extract 3-D data in textureless regions. In scenes that involve mostly textureless components such as bare walls and ob-

17

jects, special pattern projectors may need to be used in conjunction with the camera [16]. Currently, the omnidirectional data, while obtained through a 360 view, has limited vertical view. We plan to extend this work by merging multiple omnidirectional data obtained at both di erent heights and at di erent locations. We will also look into the possibility of extracting panoramas of larger height extents by incorporating tilted (i.e., rotated about a horizontal axis) camera views. This would enable scene reconstruction of a building oor involving multiple rooms with good vertical view. We are currently characterizing the e ects of misestimated intrinsic camera parameters (focal length, aspect ratio, and the radial distortion factor) on the accuracy of the recovered 3-D data. In summary, our set of methods for reconstructing 3-D scene points within a wide eld of view has been shown to be quite robust and accurate. Wide-angle reconstruction of 3-D scenes is conventionally achieved by merging multiple range images; our methods have been demonstrated to be a very attractive alternative in wide-angle 3-D scene model recovery. In addition, these methods do not require specialized camera equipment, thus making commercialization of this technology easier and more direct. We strongly feel that this development is a signi cant one toward attaining the goal of creating photorealistic 3-D scenes with minimum human intervention.

Acknowledgments We would like to thank Andrew Johnson for the use of his 3-D modeling and rendering program and Richard Weiss for helpful discussions.

Appendix A.1. Optimal point intersection In order to nd the point closest to all of the rays whose line equations are of the form r = tk +k v^ k , we minimize the expression X E = kp ; (tk + k v^k )k2 (A1) k

where p is the optimal point of intersection to be determined. Taking the partials of E with respect

18

Kang and Szeliski

to k and p and equating them to zero, we have @ E = 2^vT (t +  v^ ; p) = 0 (A2) k k k k @k @ E = ;2 X(t +  v^ ; p) = 0: (A3) k k k @p k

Solving for k in (A2), noting that v^kT v^k = 1, and substituting k in (A3) yields X; tk ; v^k (^vkT tk ) ; p + v^k (^vkT p) = 0; k

from which

p= = where

"X #;1 "X k

Ak

k

Ak tk

"X #;1 "X #  k

Ak

k

pk ;

# (A4)

Ak = I ; v^k v^kT

is the perpendicular projection operator for ray v^k , and pk = tk ; v^k (^vkT tk ) = Aktk is the point along the viewing ray r = tk + k v^k closest to the origin. Thus, the optimal intersection point for a bundle of rays can be computed as a weighted sum of adjusted camera centers (indicated by tk 's), where the weighting is in the direction perpendicular to the viewing ray. A more \optimal" estimate can be found by minimizing the formula X E = ;k 2 kp ; (tk + k v^ k )k2 (A5) k

with respect to p and k 's. Here, by weighting each squared perpendicular distance by ;k 2 , we are downweighting points further away from the camera. The justi cation for this formula is that the uncertainty in v^k direction de nes a conical region of uncertainty in space centered at the camera, i.e., the uncertainty in point location (and hence the inverse weight) grows linearly with k . However, implementing this minimizationrequires an iterative non-linear solver.

A.2. Elemental transform derivatives The derivative of the projection function (11) with respect to its 3-D arguments and internal parameters is straightforward: 0 2 2 1 @ P (x) = 1 @ y ;+xyz x2;+xyz 2 ;;xz A @x D ;xz ;yz x2 +yzy2 ;

where

;



D = x2 + y2 + z 2 The derivatives of an elemental rigid transformation (9) x0 = Rx + t are @ x0 = R; @ x0 = I; @x @t and @ x0 = ;RC(x)G(q); @q where 0 0 ;z y 1 C(x) = @ z 0 ;x A ;y x 0 and 0 ;q w q ;q 1 0 2 1 G(q) = 2 @ ;q1 ;q2 w q0 A ;q2 q1 ;q0 w (see [26]). The derivatives of a screen coordinate with respect to any motion or structure parameter can be computed by applying the chain rule and the above set of equations. 3 2

Notes 1. For a pair of images taken at two di erent locations, the epipoles are the location on the image planes which are the intersection between these image planes and the line joining the two camera optical centers. An excellent description of the stereo vision is given in [7]. 2. Note that in general, a \worse" track in this sense need not necessarily translate to a worse 3-D estimate. A high match error may be due to apparent object distortion at di erent viewpoints.

Omnidirectional Multibaseline Stereo

References 1. N. Ayache. Arti cial Vision for Mobile Robots: Stereo Vision and Multisensory Perception. MIT Press, Cambridge, Massachusetts, 1991. 2. A. Azarbayejani and A. P. Pentland. Recursive estimation of motion, structure, and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6):562{575, June 1995. 3. S. T. Barnard and M. A. Fischler. Computational stereo. Computing Surveys, 14(4):553{572, December 1982. 4. R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An approach to determining structure from motion. International Journal of Computer Vision, 1:7{55, 1987. 5. R. Deriche, Z. Zhang, Q.-T. Luong, and O. Faugeras. Robust recovery of the epipolar geometry for an uncalibrated stereo rig. In Third European Conference on Computer Vision (ECCV'94), pages 567{ 576, Springer-Verlag, Stockholm, Sweden, May 1994. 6. U. R. Dhond and J. K. Aggarwal. Structure from stereo|A review. IEEE Transactions on Systems, Man, and Cybernetics, 19(6):1489{1510, November/December 1989. 7. O. Faugeras. Three-dimensional computer vision: A geometric viewpoint. MIT Press, Cambridge, Massachusetts, 1993. 8. O. D. Faugeras. What can be seen in three dimensions with an uncalibrated stereo rig? In Second European Conference on Computer Vision (ECCV'92), pages 563{578, Springer-Verlag, Santa Margherita Liguere, Italy, May 1992. 9. F.P. Ferrie and M.D. Levine. Integrating information from multiple views. In IEEE Workshop on Computer Vision, pages 117{122, IEEE Computer Society, 1987. 10. D.H. Freedman. A camera for near, far, and wide. Discover, 16(48):48, November 1995. 11. R. Hartley. In defence of the 8-point algorithm. In Fifth International Conference on Computer Vision (ICCV'95), pages 1064{1070, IEEE Computer Society Press, Cambridge, Massachusetts, June 1995. 12. K. Higuchi, M. Hebert, and K. Ikeuchi. Building 3D models from unregistered range images. Technical Report CMU-CS-93-214, Carnegie Mellon University, November 1993. 13. H. Ishiguro, M. Yamamoto, and S. Tsuji. Omnidirectional stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:257{262, 1992. 14. S. B. Kang and R. Weiss. Characterization of Errors in Compositing Panoramic Images. Technical Report 96/2, Digital Equipment Corporation, Cambridge Research Lab, June 1996. 15. S. B. Kang, A. Johnson, and R. Szeliski. Extraction of Concise and Realistic 3-D Models from Real Data. Technical Report 95/7, Digital Equipment Corporation, Cambridge Research Lab, October 1995. 16. S. B. Kang, J. Webb, L. Zitnick, and T. Kanade. A multibaseline stereo system with active illumination and real-time image acquisition. In Fifth International Conference on Computer Vision (ICCV'95), pages 88{93, Cambridge, Massachusetts, June 1995.

19

17. C. E. Kolb. Rayshade User's Guide and Reference Manual. August 1994. 18. A. Krishnan and N.. Ahuja. Panoramic image acquisition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'96), San Francisco, California, June 1996. 19. C. D. Kuglin and D. C. Hines. The phase correlation image alignment method. In IEEE 1975 Conference on Cybernetics and Society, pages 163{165, New York, September 1975. 20. H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293:133{135, 1981. 21. L. McMillan and G. Bishop. Plenoptic modeling: An image-based rendering system. Computer Graphics (SIGGRAPH'95), :39{46, August 1995. 22. D.W. Murray. Recovering range using virtual multicamera stereo. Computer Vision and Image Understanding, 61(2):285{291, 1995. 23. M. Okutomi and T. Kanade. A multiple baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353{363, April 1993. 24. B. Parvin and G. Medioni. B-rep from unregistered multiple range images. In IEEE Int'l Conference on Robotics and Automation, pages 1602{1607, IEEE Society, May 1992. 25. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art of Scienti c Computing. Cambridge University Press, Cambridge, England, second edition, 1992. 26. A. A. Shabana. Dynamics of Multibody Systems. J. Wiley, New York, 1989. 27. A. Shashua. Projective structure from uncalibrated images: Structure from motion and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8):778{790, August 1994. 28. J. Shi and C. Tomasi. Good features to track. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'94), pages 593{600, IEEE Computer Society, Seattle, Washington, June 1994. 29. H.-Y. Shum, K. Ikeuchi, and R. Reddy. Principal componentanalysis with missing data and its application to object modeling. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'94), pages 560{565, IEEE Computer Society, Seattle, Washington, June 1994. 30. G. Stein. Accurate internal camera calibration using rotation, with analysis of sources of error. In Fifth International Conference on Computer Vision (ICCV'95), pages 230{236, Cambridge, Massachusetts, June 1995. 31. R. Szeliski. Image Mosaicing for Tele-Reality Applications. Technical Report 94/2, Digital Equipment Corporation, Cambridge Research Lab, June 1994. 32. R. Szeliski. Video mosaics for virtual environments. IEEE Computer Graphics and Applications, :22{30, March 1996. 33. R. Szeliskiand J. Coughlan. Hierarchicalspline-based image registration. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'94), pages 194{201, IEEE Computer Society, Seattle, Washington, June 1994.

20

Kang and Szeliski

34. R. Szeliski and S. B. Kang. Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation, 5(1):10{28, March 1994. 35. R. Szeliski, S. B. Kang, and H.-Y. Shum. A parallel feature tracker for extended image sequences. In IEEE International Symposium on Computer Vision, pages 241{246, Coral Gables, Florida, November 1995. 36. C. J. Taylor, P. E. Debevec, and J. Malik. Recon-

structing polyhedral models of architectural scenes from photographs. In Fourth European Conference on Computer Vision (ECCV'96), pages 659{668, Springer-Verlag, Cambridge, England, April 1996. 37. T.Y. Tian, C. Tomasi, and D.J. Heeger. Comparison of approaches to egomotion computation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 315{320, San Francisco, CA, June 1996.

Suggest Documents