Structure and Motion Estimation from Dynamic Silhouettes ... - CiteSeerX

2 downloads 0 Views 811KB Size Report
An object's silhouette and its deformations as the object moves with respect to the .... type: if 1 and 2 have the same sign (resp. opposite signs), P is an elliptic ...
Structure and Motion Estimation from Dynamic Silhouettes under Perspective Projection Tanuja Joshi

Narendra Ahuja Jean Ponce Beckman Institute University Of Illinois 405 North Mathews Ave Urbana, IL 61801, U.S.A. Abstract

We address the problem of estimating structure and motion of a smooth curved object from its silhouettes observed over time by a trinocular stereo rig under perspective projection. We rst construct a model for the local structure along the silhouette for each frame in the temporal sequence. The local models are then integrated into a global surface description by estimating the motion between successive frames. The algorithm tracks certain surface features (parabolic points) and image features (silhouette in ections and frontier points) which are used to bootstrap the motion estimation process. The entire silhouette along with the reconstructed local structure are then used to re ne the initial motion estimate. We have implemented the proposed approach and report results on real images.

Submitted to the International Conference on Computer Vision, 1995.

1 Introduction An object's silhouette and its deformations as the object moves with respect to the camera re ect the object's structure and motion characteristics. For objects that have little or no surface detail such as surface markings or texture, silhouettes serve as an important cue for the estimation of object structure and motion. Several methods have been proposed for structure estimation from silhouettes under known camera motion [1, 4, 8, 14, 15, 16, 17]. These approaches have demonstrated that given a set of three or more nearby views of a smooth object, the structure of the object up to second order can be obtained along its silhouette. The recovery of structure and motion from a monocular sequence of silhouettes has been investigated by Giblin et al. [7]. Considering the case of a smooth curved object rotating about a xed axis with constant angular velocity, they have shown that given a complete set of orthographic silhouettes the rotation axis and the velocity can be recovered, along with the visible region of the object surface. In the case of a known angular velocity, they have also shown that the rotation axis can be recovered from the silhouettes of the object over a short time interval. 1

The problem of estimating structure and motion of a smooth object undergoing arbitrary unknown motion was considered in [9]. The structure and motion was estimated by tracking the dynamic silhouettes observed by a trinocular stereo rig over time under orthographic projection. In this paper, we consider an analogous problem for the case of perspective projection. The resulting structure and motion estimation algorithm will be useful in a situation where the viewer has no knowledge or control of the object's motion (see [14] for a complementary approach, where a viewer plans its motion for building a global model of an object). Another application is model construction for object recognition [10, 13]. Due to self-occlusion, a simple motion such as rotation on a turntable may not reveal all the interesting parts of a complex object. It is desirable to be able to construct the model completely by observing the object even as it undergoes an unknown arbitrary motion.

1.1 Problem Statement and Approach

Consider a smooth curved object in a scene observed by a viewer. The viewing cone grazes the object surface along an occluding contour. The image curve where the viewing cone intersects the image plane is de ned as the object's silhouette. At each point on the occluding contour the surface normal is orthogonal to the viewing direction. This property makes the 3D occluding contour and the 2D silhouette viewpoint-dependent. Thus if the object is moving relative to the viewer, the silhouettes appearing in successive images will be projections of di erent 3D contours on the surface. This is in contrast with an image sequence containing only viewpoint-independent features. In this paper, we address the problem of estimating structure and motion of a smooth curved object from its silhouette observed over time by a trinocular stereo rig under perspective projection. Figure 1 shows an overview of our approach. To relate the silhouettes observed at time instants t and t + 1 (or equivalently, to estimate the relative motion between time t and t + 1), we construct a model for the local structure along the silhouette at each time instant. We use trinocular imagery for our analysis since three images taken from known relative positions are sucient to recover the local structure (up to second order) along the silhouette [4, 15, 16]. If the motion of the object is small, the structure estimated at time t will also be valid at time t +1. We use the structure computed at time t to estimate the motion between the frames at time t and t + 1. The local models are then integrated into a global surface description using the estimated motion between successive frames. Note that for structure estimation from trinocular images or for motion estimation from a pair of images taken by the central camera at di erent times, we need to establish a relationship or correspondence between the points on the silhouettes. Since the 3D occluding contour changes with motion between the viewer and the object, successive silhouettes are projections of di erent 3D contours. Therefore there is no true 3D point-to-point correspondence between a pair of silhouettes. For the triplet of images observed at a given time by the three cameras, the epipolar geometry is known and we establish correspondences between distinct 3D points lying on a common epipolar curve and then estimate local structure parameters as explained in Sect. 2. 2

Global Structure

I (t+1)

I (t+1)

2

I (t) 2

I (t+1)

1

I (t) 1

Structure Estimation

3

I (t) 3

Contour at t+1

Motion Estimation

Contour at t

Figure 1: Block diagram of the approach. However for the images taken by a camera at successive time instants, the epipolar geometry is unknown. Unless we know the correspondences we cannot estimate the motion, but we need to know the motion, or equivalently the epipolar geometry, to establish the correspondences for all the points on the silhouette. To bootstrap the process of motion estimation, we use some detectable and trackable silhouette features (in ections and frontier points) to obtain an initial estimate of the unknown motion. The initial estimate of the motion enables us to get an estimate of the epipolar geometry, thus allowing identi cation of the correspondences for the rest of the silhouette. We then iteratively re ne the initial motion estimate, updating the correspondences for all the silhouette points at each iteration. In Sect. 2 the algorithm for structure estimation using trinocular imagery is described. The algorithm used for motion estimation from dynamic silhouettes under perspective projection is discussed in Sect. 3. To demonstrate the feasibility of our method, we present experimental results on a sequence of real images. We conclude with comments in Sect. 4.

2 Structure Estimation Using Trinocular Imagery 2.1 Modeling the Local Structure

The local structure (up to second order) at a surface point P is de ned by the 3D location of P in the global coordinate frame (in our case, the coordinate frame of the central camera), the surface normal at P , the two principal directions in the tangent plane and the principal curvatures. At each point P , we de ne a local coordinate frame (Xl ; Yl ; Zl) whose origin 3

coincides with P , the Xl -axis being aligned with the outward normal, and the Yl and Zlaxes being aligned with the principal directions. The local surface up to second order is a paraboloid [5], given by the equation 1 1 Xl = 1 Yl2 + 2 Zl2 ; (1) 2 2 where 1 and 2 are the principal curvatures at P . The signs of 1 and 2 de ne the point type: if 1 and 2 have the same sign (resp. opposite signs), P is an elliptic (resp. hyperbolic) point. The elliptic points project onto convex silhouette points while the hyperbolic points project onto concave silhouette points. If either 1 or 2 is zero, P is a parabolic point and the silhouette has an in ection [3, 11, 16]. Equation (1) can be rewritten in matrix form as: QlT MlQl = 0; (2) where Ml is a symmetric 4  4 matrix and Ql = [Xl ; Yl; Zl ; 1]T is the vector of homogeneous coordinates of a point Q on the paraboloid at P .1 The rigid transformation parameters de ning the local coordinate frame at P with respect to the world coordinate frame (in our case, the coordinate frame of the central camera), together with the principal curvatures 1 and 2, completely determine the local structure at P . Let (X; Y; Z ) be the camera-centered coordinate frame, where the Z -axis coincides with the optical axis and the XY -plane is the image plane. See Fig. 2. Let the point P (which is the origin of the local frame) be at (X0; Y0; Z0) in the camera-centered frame. We denote the angle made by Xl -axis with the normal to the silhouette by  , the angle between the X -axis and the normal to the silhouette by  and the angle between the viewing direction and one of the principal directions (say the Zl -axis) by . The six-tuple (; ; ; X0; Y0; Z0) de nes the local frame with respect to the cameracentered frame. To completely describe the object surface locally, we need to specify ; ; ; X0; Y0 ; Z0; 1 and 2 for each point on the silhouette. A point Q in the local coordinate frame de ned at point P can be represented in the coordinate frame of the central camera by Q = T0 R0 Ql; (3) where T0 is the 4  4 matrix for a translation by (X0 ; Y0; Z0); R0 is the 4  4 matrix for the rotation between the two coordinate frames, and Ql = [Xl ; Yl ; Zl; 1]T and Q = [X; Y; Z; 1]T are the homogeneous coordinate vectors of Q in the local and camera-centered frames respectively. After eliminating Ql between (2) and (3) we obtain the equation of the paraboloid in the camera-centered frame as:  = QT MQ = 0; (4) where (5) M = T0?1T R0 Ml R?0 1 T0?1: Notation: All boldface letters denote coordinate vectors or arrays. We use capital letters to denote 3D points in the scene and small letters to denote their projections in the image, e.g., p is the projection of a 3D point P . 1

4

Direction of n p X l

Tangent Plane

V P

ζ γ

P

Z l Yl

X Contour Image

np θ

Z

x0

y0

Silhouette Y

Camera

Figure 2: Projection geometry. By de nition, the surface normal at the contour point P is orthogonal to the viewing direction. This condition can be expressed as: NP  VP = 0; (6) where NP = [ @@X ; @@Y ; @@Z ]T ; and VP = [X0; Y0; Z0]T (see Fig. 2). This is a linear condition in (X; Y; Z ), implying that the contour of the paraboloid is a planar curve. Eliminating Z between (4) and (6) gives the equation of the silhouette in the image coordinates, which is a conic. The paraboloid model described in this section is used to model the structure at each point along the silhouette. From a single projection of the paraboloid, we can obtain ve constraints on the eight structure parameters (; ; ; X0; Y0; Z0; 1 and 2):  The surface normal NP at P is orthogonal to the viewing direction VP . Also the tangent tp to the silhouette at p lies in the tangent plane at P , implying that NP is orthogonal to tp. Thus knowing tP and VP we can compute the surface normal NP completely: NP = t p  V P : (7) This gives the angles  and  directly.  The image coordinates of p = (x0; y0) give a constraint on X0 ; Y0 and Z0: Xf x0 = 0 ; Z0 Yf y0 = 0 ; Z0

5

(8) (9)

where f is the focal length.  The curvature of the silhouette at p gives a constraint on 1; 2 and . To complete the local structure model, we need to estimate the depth Z0 and obtain two more constraints on 1 ; 2, and . As explained in Sect. 2.3 we obtain the required constraints using the matched points from the other two images of the trinocular imagery. The next section presents the method to nd correspondences.

2.2 Finding Correspondences

When the relative motion between the object and the camera is known for a pair of images (either in the trinocular stereo rig or in the sequence taken by a camera), we can de ne an epipolar plane for each point in each image. Let I1 and I2 be the two images taken by two cameras with optical centers C1 and C2 respectively. The epipolar plane for these two cameras and a point p1 (which is the projection of a contour point P1 ) in image I1 is de ned as the plane passing through p1 , C1 and C2. For two images I1 (t) and I1 (t + 1) taken by a camera at times t and t + 1, the epipolar plane for a point p1 in I1(t) is de ned as the plane passing through p1 and the instantaneous translational velocity. In either case, the epipolar lines are de ned as the intersections of the epipolar plane with the two image planes. Similar to the conventional stereo, we can match points lying on corresponding epipolar lines. The di erence here is that the two matched image points are not projections of the same 3D point (See Fig. 3). Matching based on epipolar geometry was employed in previous approaches as well [4, 9, 15, 16]. Using the epipolar plane for camera C2 we can nd a match point p2 in image I2 . Similarly if we have a third camera C3, the corresponding epipolar plane can be used to nd a match point p3 in image I3 . Thus we have a triple of points matched in the three images using the epipolar geometry. Consider for a moment a continuous relative motion between a camera and an object. At each point on the silhouette, the points matched using the instantaneous epipolar geometry trace a 3D curve on the object such that at each point the curve is tangent to the view line. This curve is called the epipolar curve [4, 15, 16]. In the trinocular imagery case, the points matched using epipolar geometry will lie on the corresponding epipolar curve (see Fig. 3). When the surface normal at a point is not parallel to the epipolar plane normal the epipolar curve is a regular curve. Point P1 in Fig. 3 is one such point. But at points such as Q in Fig. 3 where the normal to the surface is parallel to the epipolar plane normal, the epipolar curve degenerates into a point. In fact, in such a case, the matched points q1 and q2 are projections of the same 3D point. Such points are the frontier points for the corresponding camera motion [7]. This fact was used for the case of orthographic projection to estimate the translation given an estimate of rotation parameters [9]. We also make use of this fact in estimating the motion parameters for the perspective projection case as explained in Sect. 3. 6

N Q

Epipolar Curve Contours P 1

P 2

q 1

p 1

C 1

q 2

I 1 p 2 Silhouettes

I

2

C

2

Figure 3: Epipolar geometry.

2.3 Structure Estimation

Previous approaches for estimation of structure under known camera motion [4, 15, 16] have used the epipolar parameterization of the surface. In [4] a di erential formulation is presented to construct the structure along the silhouette. In [16] the radial plane, i.e., the plane containing the view line and the surface normal, is used to estimate one of the surface curvatures; the lines are projected onto the radial plane and a circle tangent to the three projected view lines is estimated and taken to be the osculating circle of the radial curve (the curve of intersection of the radial plane with the surface). Like Szeliski and Weiss [15], we choose the epipolar plane instead of the radial plane for the computation of one of the surface curvatures. Since the epipolar curve is tangent to the view line at every point, we can estimate its osculating circle to the epipolar curve by nding the circle that is tangent to the three view lines. If the three optical centers of the trinocular imagery are collinear, both the left and right images will have a common epipolar plane. But in general the optical centers will not be collinear, making the three view lines non-coplanar. In such a case the three view lines are projected onto a common plane. As shown in Fig. 4, in general there are four circles tangent to a set of three lines in a plane [16]. But if we know which side of each view line the object lies, the circle can be detected uniquely (the solid circle in Fig. 4). The point where this circle touches the central 7

Epipolar Plane

v v

v 1

2

3

P Curvature k e

=

re

1 re Osculating Circle

Camera 2

Camera 3

Camera 1

Figure 4: Estimating the osculating circle. view line is an estimate of the 3D surface point. This gives the depth of the point along the central view line. The curvature ke of the tted circle is an estimate of the curvature of the epipolar curve. We can compute the normal curvature kn of the surface along the view line (which is the tangent direction of the epipolar curve) using ke and the angle between the surface normal and the epipolar plane normal: kn = ke cos : (10) Knowing the normal curvature gives us a constraint on the surface curvatures 1 ; 2 and the angle - since the normal curvature along a tangent direction making an angle with one of the principal directions is related to 1 and 2 by the Euler formula [5]: n = 1 cos2 + 2 sin2 : (11) We obtain one more constraint on the surface structure based on the fact that the tangent to the 3D contour is along a direction conjugate to the view line [12]2 . Once the depth of the points along the silhouette is computed, we can estimate the direction of the tangent to the 3D contour; this gives a constraint on 1 ; 2 and . This constraint, along with the constraint given by the curvature of the silhouette in the central image, and the constraint given by equations (10) and (11) give us three equations which are solved to obtain the values of the structural parameters 1 ; 2 and . 0

0

2.4 Experimental Results

0

We have applied the method described in this section to a sequence of real images. The image sequence was taken by a calibrated camera system observing an object placed on a 2 Two directions U and V are said to be conjugate if the change in the surface normal along direction U is orthogonal to direction V and vice versa. 8

(a)

(d)

(b)

(c)

(e)

(f)

0.0004

0.0003

0.0002

0.0001

200

400

600

800

-0.0001

-0.0002

-0.0003

(g) (h) Figure 5: (a-c) Sample triple of images of the trinocular imagery; (d-f) corresponding detected contours; (g) recovered 3D contour; (h) recovered Gaussian Curvature along the contour

9

-1565

-1565

-1570

-1570

-1575

-1575

-1580

-1580

-1585

-1585

200

400

600

800

200

(a)

400

600

800

400

600

800

(b) 0.025

0.025 0.0225

0.02 0.02 0.0175

0.015

0.015

0.0125

200

400

600

800

200 0.0075

0.005

(c) (d) Figure 6: (a-b) Depth around the silhouette before and after interpolation resp.; (c-d) one of the surface curvatures along the silhouette before and after interpolation resp. turntable. Figure 5(a-c) shows a sample triple of images of the trinocular sequence. Figure 5(d-f) shows the detected contours in the corresponding images. The 3D contour, reconstructed using the algorithm of Sect. 2.3, is displayed in Fig. 5(g) and the recovered Gaussian curvature along the contour is shown in Fig. 5(h). The gaps in the reconstructed contour result because as we approach the frontier points, i.e. as the angle between the surface normal and the epipolar plane normal becomes smaller, the epipolar-match-based reconstruction becomes less reliable. A simple method for nding the epipolar match of a silhouette point pc in the central image is to search for the edgel closest to the corresponding epipolar line. To improve the accuracy of the reconstruction, we use interpolation to nd the intersection of the silhouette with the epipolar line. Figure 6 shows the improvement in the depth and curvature values 10

(a)

(b)

(c) 0.025

0.025

0.025

0.0225 0.02

0.02 0.02

0.015

0.0175

0.015

0.015 0.01

0.01 0.0125

0.005

0.005

200

400

600

800

0.0075 200

(d)

400

600

200

800

400

600

(e) (f) Figure 7: (a-b) Contours with stereo angle 10 and 20 deg resp; (c) both the contours overlaid; (d-e) surface curvature with stereo angle 10 and 20 deg resp; (f) both curvature plots overlaid

11

800

reconstructed using this improved method. We also studied the e ect of changing the stereo angle of the trinocular stereo rig on the reconstructed contour and the curvatures. It is expected that the reconstruction method (which is based on detecting a circle tangential to three lines) will introduce some smoothing e ect to the recovered surface. The larger the angle of the trinocular stereo rig, more will be the smoothing e ect. If the surface shape is changing more rapidly than the viewing angle between the stereo cameras, then we notice an appreciable change in the recovered contour with respect to the change in the stereo angle. In Fig. 7, we can see that the change in the recovered depth is appreciable at the lower right corner of the contour. To avoid oversmoothing of the surface, we choose a smaller stereo angle.

3 Motion Estimation Motion estimation under orthographic projection was presented in [9]. In this section, we consider an analogous problem and extend the technique to the more general case of perspective projection. We obtain 3D contours on the surface in the successive frames I1 (t) and I1(t + 1) of the central camera using trinocular imagery. These contours are related by unknown motion. Let us assume that between consecutive time instants t and t + 1, the object undergoes a rotation of an unknown angle about an unknown axis = [!x; !y ; !z ]T passing through the origin, followed by an unknown translation [tx ; ty ; tz ]T . Let T be the matrix corresponding to the unknown translation and let R be the matrix corresponding to the unknown rotation. In the case of orthographic projection, the direction of the epipolar lines is determined by the rotational parameters alone, whereas in the case of perspective projection it is determined by the rotational and the translational parameters. Hence we cannot separate the estimation of rotational and translational parameters under perspective projection as it is possible under orthographic projection. Therefore we have to modify the motion estimation algorithm for the perspective projection case. The motion estimation is done in two steps. In the rst step described in Sect. 3.1, we use the change in the surface normal at the in ections to estimate the rotation parameters. We obtain an estimate of the translation parameters using the frontier points. In the second step, the estimate of motion parameters is re ned using the rest of the silhouette. Both the steps perform non-linear least-squares minimization. The next two sections describe these two steps in detail. Experimental results are presented in Sect. 3.3.

3.1 Obtaining the Initial Estimate

3.1.1 Initial Estimate of the Rotational Parameters In ections are projections of parabolic points on the object surface [3, 11, 16]. Parabolic points are de ned as the points where the Gaussian curvature is zero. On generic surfaces these points lie on continuous curves called parabolic curves. With relative motion between 12

the object and the viewer, the in ections in successive images will be projections of neighbouring points lying on the parabolic curve. A parabolic point has a single asymptotic direction (which is also one of the principal directions) and the surface is locally cylindrical. Let us begin with the following lemma [9, 10]. We have included a proof of the lemma in the appendix for completeness. Lemma 1 At a parabolic point, if we move along the surface in any direction (hence in particular, along the direction tangent to the parabolic curve), the change in the surface normal is perpendicular to the asymptotic direction. Consider the projection p of a parabolic point P in the central image I1 (t). p will be an in ection point of the silhouette. If we track in ection p in image I1(t + 1) of the central camera, we will be moving along the parabolic curve at P . From Lemma 1, we can see that the change dN in the surface normal N should be orthogonal to the asymptotic direction, say A, at P . We can compute A from the local structure estimated at time t using trinocular imagery. Let p be the tracked in ection in I1(t + 1), which is the projection of P , a neighbouring parabolic point on the surface. Let the unit surface normal at P be N 0. From Lemma 1, 0

0

0

A  dN = A  (R?1N 0 ? N ) = A  R?1N 0 ? A  N = 0: (12) But the asymptotic direction A lies in the tangent plane, implying AN = 0. Equation 12

reduces to:

A  R?1N 0 = 0:

(13) We parameterize the rotation using three angles - angle  made by with the Z -axis, angle between its projection in XY -plane and the X -axis, and the rotation angle . To recover the rotation parameters, we need to track at least three in ections. With n  3 in ections present in images I1(t) and I1(t + 1), we use least-squares minimization with the objective function given by: n h i X Ai  (R?1Ni0) 2 : (14) i=1

The minimization is done over the three parameters ; and : Note that we have to rst nd a match between the sets of in ections on the silhouettes in the two images. Since in ections are discrete points on the silhouette, they are in general few in number and all possible matches between the two sets of in ections can be considered. Moreover, if the motion is small, the ordering of the in ections along the silhouette is maintained in general. This condition further reduces the number of possible matches. We match the in ections such that (1) the ordering of the matched in ections along the silhouette is maintained, and (2) the angle between the normals at the matched in ections is small.

3.1.2 Initial Estimate of Translational Parameters Under orthographic projection, rotation parameters alone determine the direction of the epipolar plane normal. With an estimate of the rotation parameters the epipolar plane 13

normal can be estimated and in turn frontier points can be detected. Once the frontier points are detected and matched, estimation of the translational parameters becomes a linear problem [9]. However this is not the case under perspective projection, where the rotational parameters alone do not determine the epipolar plane normal. The frontier points cannot be detected unless the translation parameters are estimated rst. Note that the goal of this step is only to obtain an initial estimate of the motion parameters. Our experience shows that the epipolar plane normal direction with the orthographic projection approximation is close to the corresponding epipolar plane normal direction for the perspective case. Given an initial estimate of the rotation parameters, we estimate the epipolar plane normal with the orthographic projection approximation and use it to detect the frontier points. Needless to say the detected frontier points are an approximation of the true frontier points of the perspective projection, but this approximation gives a good initial estimate of translational parameters. This initial estimate is re ned in the second step. Recall that the matched frontier points are projections of the same 3D points. Therefore if a frontier point f in image I1(t) matches with the frontier point f in image I1(t + 1), then we have: F0 = RF + T: (15) This implies that for a given rotation, estimating the translation becomes a linear problem once the frontier points are detected and matched. This is taken as the initial estimate of the translation parameters for a given estimate of the rotation parameters. 0

3.2 Re ning the Motion Estimate

In the second step, we use the structure along the entire silhouette to re ne the estimate of the motion parameters obtained in the rst step. With an estimate of R and T , we can determine the location of the camera in the frame I1 (t + 1) relative to the local paraboloids in the frame I1 (t). We can also determine the epipolar plane for each point in image I1 (t). Once the structure parameters of the local paraboloids and the epipolar plane are known, we can estimate the curvature of the epipolar curve at each point. Consider a point pi on the silhouette in image I1 (t) (Fig. 8). We can predict the match point ppi in frame I1(t + 1) as the projection of a point pPi which satis es the following two conditions: 1. pPi lies on the estimated osculating circle of the epipolar curve, and 2. the surface normal at pPi is orthogonal to the viewing direction. But we can also detect the epipolar match point dpi in image I1 (t + 1) as the intersection point of the silhouette at t + 1 and the estimated epipolar line corresponding to pi. In the re nement step, we iteratively minimize the sum of the squared distance between p p and d p for all silhouette points p . The minimization is over the six dimensional space i i i of R and T parameters. In our algorithm, we iterate on the rotation parameters and at 0

0

0

0

0

0

0

14

Contour

I1 (t)

Estimated Osculating Circle Pi

Silhouette X

Surface Normal pi

Y

Z

Camera C 1

I1 (t+1)

p

d

p’ i

p’ i

Estimated Epipolar Line for p i

R, T

Estimated Viewing Direction at time t+1

Figure 8: Predicted and detected epipolar match. each iteration step we perform non-linear least-squares minimization and determine the best translation parameters that give the minimum sum of distances. The algorithm for motion estimation can be written as follows. 1. Obtain an initial estimate of the rotation parameters ( 0 ; 0; 0) using tracked in ections as explained in Sect. 3.1.1. Set = 0 ;  = 0 and = 0 . 2. For the given rotation parameters ; ; (a) Detect and match the frontier points f and f in the two central frames with the orthographic projection approximation. Using the matched frontier points, compute the initial estimate of translation parameters as explained in Sect. 3.1.2. (b) Knowing the local structure at each point on the silhouette, re ne the estimate of the translation parameters to minimize the squared sum S of the distance between the predicted and the detected epipolar match points for all the silhouette points. The sum S is given by: P S = ni=1 dist (p pi ;d pi )2 : 3. Minimize S by updating the values of the rotation parameters ; and , and repeating Step 2. 0

0

3.3 Implementation and Results

0

We can potentially consider the entire silhouette for the computation of the sum S . But observing that the structure estimation using epipolar matches gets less reliable as we approach 15

1 2 3 4 5 6 7

True 1st Step Final 1st Step Final 1st Step Final 1st Step Final 1st Step Final 1st Step Final 1st Step Final

Rotation Axis Error in Rotation Error in

(!x; !y ; !z )

Angle (0.008,0.94,0.33) 5.0 ( 0.0339,0.942,0.335) 1.5 5.05 0.0493 ( -0.0197,0.944,0.33) 1.61 5.17 0.172 ( 0.0677,0.953,0.294) 4. 5.59 0.586 ( 0.0312,0.947,0.32) 1.42 5.25 0.248 ( -0.0424,0.942,0.334) 2.92 5.003 0.003 ( -0.11,0.939,0.324) 6.82 5.002 0.002 ( -0.0557,0.948,0.314) 3.79 5.04 0.0409 ( 0.00372,0.944,0.33) 0.268 4.87 0.132 ( 0.163,0.933,0.322) 8.9 4.72 0.277 ( 0.0475,0.942,0.332) 2.25 4.98 0.0185 ( 0.0469,0.909,0.415) 5.72 4.68 0.317 ( -0.0301,0.946,0.324) 2.24 5.17 0.165 ( -0.0855,0.956,0.282) 6.07 5.35 0.35 ( 0.0105,0.949,0.315) 0.915 5.42 0.424

Translation (tx ; ty ; tz ) in mm (125.6, 0.64, -5.03) ( 127.,-2.94,-4.98) ( 128.,4.69,-5.35) ( 142.,-8.4,-6.15) ( 132.,-2.57,-5.41) ( 125.,7.7,-5.08) ( 125.,17.,-5.49) ( 127.,9.29,-3.23) ( 122.,1.19,-4.88) ( 117.,-19.7,-3.95) ( 125.,-4.69,-4.77) ( 113.,-4.19,-3.4) ( 130.,6.13,-5.15) ( 136.,14.4,-6.2) ( 137.,0.476,-5.87)

Table 1: Result of the motion estimation. the frontier points, we exclude the points close to the frontier points from the computation of S . Since we have the depth values along the detected contour and can predict the 3D location of the predicted match point, we could use the 3D distance between the detected and predicted match points. But this might result in propagating the errors in structure estimation to the estimation of motion, as was con rmed through experiments. Therefore, the 2D distance was chosen as the error measure. Note that if R and T represent the relative motion from time t to t + 1, then RT and ?RT T represent the relative motion from t + 1 to t. We can use the structural parameters estimated at time t + 1 to predict the silhouette at time t, making use of all the structural information available. This improved the performance of the motion estimation algorithm. We applied the method to the sequence considered in Sect. 2.4 where an object placed on a turntable was observed by a xed camera system. The three optical centers were not collinear. Also the axis of rotation was not parallel to the image plane of the central camera, making the motion under consideration a general one. The stereo angle was 10 degrees and the angle of rotation between successive time instants was 5 degrees. Table 1 lists the recovered motion parameters after each step on a sample set of frames. For each step, we also list the angle between the true and estimated rotation axes and the error in the rotation angle. All angles are in degrees. Although the motion was constant 16

Sum S 5.09 4.55 25.7 10. 19.3 14.2 132. 8.74 22.4 6.37 239. 9.06 95.8 13.3

x -25 0 25 50 75 100 100 50

y

0

-50 50

-100

-1500

y

0

z-1550

-50

-1600

-100

-25 -1600

0 25 x

-1550

(a)

50 75

z

(b) Figure 9: (a-b) Two views of the global structure after 8 frames -1500

5 0 0 -50 -100

x

z -1500 -1550 -1600 -1650 100

100

y

y 0

0

-100

-100 -100

-50

0 x

(a)

-1500

-1600

-1550

-1650

(b) Figure 10: (a-b) Two views of the global structure after 18 frames z

17

50

through the sequence, this information was not used anywhere in the algorithm. It was found that the rst step of minimization was stable with respect to the initial guess. The result of this step is used as the initial guess for the second step. Figure 9 shows two di erent views of the global structure estimation after 8 frames. We have displayed the detected 3D occluding contours placed in a common coordinate frame after derotating them using the estimated motion. Figure 10 shows two views of the global structure after extending it over 18 frames.

4 Conclusions We have used the relationship between certain silhouette features (in ections) and a model of the local surface structure to estimate both the motion and the global surface structure from perspective views of the silhouettes of a moving object. In estimating the motion we also used another set of points on the silhouette: the frontier points. Although estimating motion from silhouettes is more dicult than using viewpoint-independent features, we have the advantage that we have more information about the surface even from a single silhouette - the surface normal, the sign of the Gaussian curvature and a constraint on the principal curvatures at the surface point. We have extended the method presented in [9] for the case of orthographic projection to the more practical case of perspective projection. The results obtained on real images are encouraging and demonstrate the validity of the method. Currently the motion estimation is done purely between two successive frames. To make the motion estimation more robust, a motion model over multiple frames can be used. It can be shown that three images taken from known relative positions are sucient to differentiate between viewpoint-dependent and viewpoint-independent edges in the scene [16]. Thus using trinocular imagery we can detect viewpoint-independent edges present if any, and utilize them as well in motion estimation. See [2, 6] for motion estimation based on viewpoint-independent edges. We have assumed that three in ections will be visible so that the initial estimate of the rotation parameters can be done. When this is not true, it is possible that other features can be utilized, e.g. bitangent points. We have also assumed that the topology of the silhouette does not change between frames. Our current method should be extended to handle a situation where the topology changes.

Appendix: Proof of Lemma 1 Let us parameterize the surface at a parabolic point P by two parameters u and v, such that the u-axis is along the asymptotic direction A at P and the v-axis is along a direction perpendicular to A in the tangent plane. The u- and v-axes span the tangent plane. The second fundamental form in this basis is given by :   0 0 II = 0 k : (16) 18

The change in the normal along a direction D= [u; v]T in the tangent plane is given by:

dN = II D = [0; k v]T : (17) Clearly the change in the normal along D, is along the v-axis which is orthogonal to the

asymptotic direction (the u-axis). Q.E.D. Acknowledgments: This work was supported in part by the Advanced Research Projects Agency under Grant N00014-93-1-1167 and in part by the National Science Foundation under Grant IRI-9224815. Jean Ponce was supported in part by the Center for Advanced Study of the University of Illinois at Urbana-Champaign. Authors also wish to thank Prof. David Kriegman and B. Vijayakumar for providing the image sequence used in the exeperiments.

References [1] E. Arbogast and R. Mohr. Reconstruction de surfaces a partir des contours critiques, dans un contexte multi-images. Technical Report RR LIFIA (94), LIFIA, 1989. [2] E. Arbogast and R. Mohr. An egomotion algorithm based on the tracking of arbitrary curves. In Proc. European Conf. Comp. Vision, pages 467 { 475, 1992. [3] A. Blake and R. Cippola. Robust estimation of surface curvature from deformation of apparent contours. In Proc. European Conf. Comp. Vision, pages 465 { 474, 1990. [4] A. Blake and R. Cippola. Surface shape from the deformation of apparent contours. Int. J. of Comp. Vision, 9(2):83 { 112, 1992. [5] M. P. do Carmo. Di erential Geometry of Curves and Surfaces. Prentice-Hall, Englewood Cli s, NJ, 1976. [6] O. Faugeras and T. Papadopoulo. Disambiguating stereo matches with spatio-temporal surfaces. In J. Mundy and A. Zisserman, editors, Geometric Invariance in Computer Vision, pages 310{331. MIT Press, Cambridge, Mass., 1992. [7] P. Giblin, J. Rycroft, and F. Pollick. Moving surfaces. In Mathematics of Surfaces V. Cambridge University Press, 1993. to appear. [8] P. Giblin and R. Weiss. Reconstruction of surfaces from pro les. In Proc. IEEE Conf. Comp. Vision Patt. Recog., pages 136 { 144, 1987. [9] T. Joshi, N. Ahuja, and J. Ponce. Silhouette-based structure and motion estimation of a smooth object. In Proc. ARPA Image Understanding Workshop, 1994. [10] T. Joshi, J. Ponce, B. Vijayakumar, and D.J. Kriegman. Hot curves for modelling and recognition of smooth curved 3d shapes. In Proc. IEEE Conf. Comp. Vision Patt. Recog., pages 876 { 880, 1994.

19

[11] J. J. Koenderink. What does the occluding contour tell us about solid shape. Perception, 13:321 { 330, 1984. [12] J. J. Koenderink. Solid Shape. MIT Press, MA, 1990. [13] D.J. Kriegman, B. Vijayakumar, and J. Ponce. Reconstruction of HOT curves from image sequences. In Proc. IEEE Conf. Comp. Vision Patt. Recog., pages 20{26, 1993. [14] K. Kutulakos and C. Dyer. Global surface reconstruction by purposive control of observer motion. Technical Report 1141, Computer Science Dept., University of Wisconsin, Madison, 1993. [15] R. Szeliski and R. Weiss. Robust shape recovery from occluding contours using a linear smoother. In Proc. IEEE Conf. Comp. Vision Patt. Recog., pages 666 { 667, 1993. [16] R. Vaillant and O. Faugeras. Using extremal boundaries for 3-d object modeling. IEEE Trans. Patt. Anal. Mach. Intell., 14(2):157 { 172, 1992. [17] J. Y. Zheng. Acquiring 3-d models from sequences of contours. IEEE Trans. Patt. Anal. Mach. Intell., 16(2):163 { 178, 1994.

20