Simultaneous Motion, Depth And Slope Estimation With A Camera-Grid Hanno Scharr and Tobias Schuchert
Institute for Chemistry and Dynamics of the Geosphere, ICG III Forschungszentrum J¨ulich GmbH, 52425 J¨ulich, Germany
[email protected]
Abstract Standard optical flow methods are very successful for displacement estimation in image sequences using a brightness change constraint equation (BCCE). Recently a multi-camera generalization of the brightness constancy model has been proposed in [22] and tested for several special cases. The complete, unrestricted model has not been investigated yet. This paper bridges this gap and gives additional detail for the derivation of the model and an appropriately adapted total least squares (TLS) parameter estimation. We use 1d or 2d camera grids acquiring a 4 or 5 dimensional data set, i.e. spatiotemporal data coming from each camera with one or two additional dimensions for the camera position. From such a data set for each surface patch 3d motion, 3d position, and surface normals are estimated simultaneously using a weighted TLS estimation scheme. The implications coming from the 2d or 3d nullspace of the occurring data-matrix of the system of equations are presented in detail. In experiments we test the performance of this approach.
1
Introduction
Motion estimation as well as disparity estimation are standard tasks for optical flow algorithms as well known from early publications (e.g. [13, 18] and many more). The assumption to the local data used to estimate optical flow is usually formulated as brightness change constraint equation (BCCE). In [22] a multi-camera extension of the brightness constancy constraint model for simultaneous local estimation of 3d motion, 3d position and surface normals has been introduced. There the model has been tested for 4 special cases: estimation of depth
Figure 1: Top left: a camera grid. Top right: a input image acquired at one of the 25 camera positions. Bottom: reconstructed cube with 3d velocity (left) and normals (right).
only using 1d and 2d camera grids, estimation of depth and 3d motion, and estimation of depth and normals. In the current paper we simultaneously estimate all the parameters in the model. An image sequence can be interpreted as data in a 3d x-y-t-space. For optical flow estimation in this space a BCCE defines a linear model for the changes of gray values due to local object motion. The result of the calculation is a displacement vector field. Assuming the data comes from a single fixed camera displacements come from object motion. Assuming a moving camera looking at a fixed scene displacements (then usually called disparities) are anti-proportional to local depth. This is known as structure from camera motion (e.g. [16]). The basic idea for the new estimation technique
presented here is to interpret the camera position s = (sx , sy ) as additional dimension(s) of the data (see Fig. 1, top left). Hence all image sequences acquired by a 1d camera grid can be combined to sample a 4d-Volume in x-y-sx -t-space. If a 2d camera grid is used (as e.g. in [19] ) we get a 5d-Volume in x-y-sx -sy -t-space. We define brightness constancy in this space as vanishing total differential of the intensity data. A camera model is used to project a dynamic surface patch into the image. This yields equations for x- and y-position of the patch as a function of camera position and time. Plugging the linearized total differentials of x and y into the total differential of the intensity results in the sought for model. It boils down to be an affine optical flow model with 3 dimensions (sx , sy , t) behaving like the time dimension in a usual affine optical flow model. A detailed derivation can be found in Sec. 2. As this linear data model has the same form as usual BCCEs no special estimation framework has to be established. We use the structure tensor method ( e.g. [2]) but other methods can be applied as well (e.g. the ones in [1, 12]). But as parameters in the model are mixed motion-, normals- and disparity-parameters, they have to be disentangled. Especially when 2d camera grids are used multiple independent measurements are present in each model equation. These not only have to be disentangled, but also recombined respecting their error estimates (cmp. Sec. 4). Our target application is plant growth estimation. E.g. from [23] we know that plant motion can be measured by optical flow using a frame rate of less than one image per minute. Consequently our scene can be interpreted to be fixed for considerably smaller time intervals. Thus, instead of using a 2d camera grid build from real cameras, which would be expensive and hard to calibrate, we use a single camera on a XY-moving stage. Image sequence acquisition with the moving stage can be done within a few seconds. The acquired sequential multi-position camera image samples a 4d volume in x-y-sx -sy -space at an approximately fixed point in time. A typical result using real data acquired under lab conditions is depicted in Fig. 1. Related work. There is rich literature on optical flow estimation techniques (see the overviews [1, 12]) and many extensions have been developed. There are extensions towards affine motion estimation [8, 9], scene flow [27] and brightness
changes [6, 11]. Robust [3] and variational estimators [4, 13, 21], regularization schemes [15], and special filters [7, 10, 14] have been studied. Recently depth estimation via optical flow and epipolar geometry has been presented relaxing the constraint of parallel lines of sight [24]. Most of these extension are either already in our model or can be applied seamlessly, which is an advantage over other reconstruction methods. There already exist frameworks for simultaneous motion and stereo analysis (e.g. [5, 26, 28]). Also stereo for curved surfaces still finds attention [17]. To the best of our knowledge no available approach can deal with slanted surfaces and simultaneously estimate 3d motion. The current paper is an extension of [22] as stated above. Paper organization. We start by deriving the constraint equation (Sec. 2) followed by a compact revision of the structure tensor TLS estimator (Sec. 3). Sec. 4 then treats the disentangling of the model parameters. An experimental validation of the novel model for 1d and 2d camera grids within this estimation framework is presented in Sec. 5.
2
Derivation of the BCCE
In this section the constraint equation describing local changes in data coming from a camera grid shall be developed, given a 3d object/motion model, a camera model and a brightness change model.
2.1
Surface Patch Model
For each pixel at a given point in time we want to estimate depth, motion and surface normals from a local neighborhood. Thus as object/motion model we use a surface patch valid for the whole neighborhood. This surface element at world coordinate position (X0 , Y0 , Z0 ) is modelled as a function of time t and local world coordinates (4X, 4Y ) with X- and Y -slopes Zx and Zy , moving with velocity (Ux , Uy , Uz ):
X X0 + Ux t + 4X Y = Y0 + Uy t + 4Y Z Z0 + Uz t + Zx 4X + Zy 4Y (1) The surface normal is then (−Zx , −Zy , 1).
Object
P
Z Z0∆x/f
Zx∆X
P0
Z0
∆X
x0 ∆x f
image plane
cal units. Image coordinates x and y are given in pixel, i.e. the physical length of a sensor element on the camera chip. Camera coordinates sx and sy are given in camera to camera displacements. And time t is given in ”frames”, i.e. the inverse of the temporal acquisition rate. This is important when interpreting unit free parameters estimated from the model equations.
x sx
X
Figure 2: Pinhole projection of a point P and its patch center P0 .
2.2
Camera Model
As camera model we use a pinhole camera at world coordinate position (sx , sy , 0), looking into Z-direction (cmp. Fig. 2) µ ¶ µ ¶ f x X − sx = (2) y Y − sy Z where camera position space is sampled equidistantly using a camera grid (cmp. Fig. 1 top left).
2.3
Brightness Change Model
We sample (sx , sy )-space using several cameras each of which acquires an image sequence. We combine all of these sequences into one 5d data set sampling the continuous intensity function I(x, y, sx , sy , t). We will now formulate the brightness model in this 5d space. We assume that the acquired brightness of a surface element is constant under camera translation, meaning we are looking at a Lambertian surface. In addition this brightness shall be constant in time, i.e. we need temporally constant illumination, which can be relaxed easily using the approach shown in [11]. We see that brightness does not change with one temporal and two spatial coordinates. In other words there is a 3d manifold in our 5d space in which I does not change. Thus the total differential dI vanishes in this manifold. The brightness model therefore is dI = Ix dx + Iy dy + Isx dsx + Isy dsy + It dt = 0 (3) ∂I We use the notation I∗ = ∂∗ for derivatives of the image intensities I. Please note that all derivatives and differentials in this equation have physi-
2.4
Combination of the 3 Models
In order to derive a single linear equation from the above models, we first apply the pinhole camera (Eq. 2) to the moving surface element (Eq. 1) µ ¶ µ ¶ f x X0 + Ux t + 4X − sx = (4) y Y0 + Uy t + 4Y − sy Z Consequently we can calculate the differentials dx and dy for a fixed surface location (i.e. for constant 4X and 4Y ) µ ¶ µ ¶ (Ux − Uz fx )dt − dsx f dx = (5) dy (Uy − Uz fy )dt − dsy Z This equation is nonlinear in Uz as Z = Z0 +Uz t+ Zx 4X + Zy 4Y . We linearize it via the assumption that Uz t is small compared to the overall depth |Z0 | À |Uz t| and omit Uz t in the denominator. For image-based expressions we use the following notation (cmp. Fig. 2) local pixel coordinates x = x0 + 4x y = y0 + 4y with pixel to patch center pixel distances f (1−Zx x )
f (1−Zy y )
f f 4x = 4X 4y = 4Y Z0 Z0 3d optical flow and disparity ux = Zf0 Ux uy = Zf0 Uy 1 uz = − Z0 Uz v = − Zf0 slopes Zy Zx zx = Z0 (1−Z zy = Z (1−Z y x x ) y ) f
0
f
(6) The expression for the slopes become obvious using the following linearization −f Z0 +Zx 4X+Zy 4Y
Z
x ≈ −f (1 − Z 4X − Zy0 4Y ) Z0 Z0 = v + zx 4x + zy 4y (7) assuming |v| À |zx 4x| and |v| À |zy 4y|. Then |v| À |zx 4x| ⇔ 1 À |Zx 4x/(f − Zx x)|. This is usually well fulfilled for not too large slopes Zx and
normal angle lenses as then Zx x is substantially smaller than the focal length f and 4x ¿ f . In our Sony XC-55 camera pixel size is 7.4µm and f = 12.5mm, i.e. n4x/f ≈ n0.6e − 3 for the nth neighbor pixel. An analogue calculation is true for the y-direction. Plugging Eqs. 5–7 into the brightness model Eq. 3 yields the sought for BCCE 0
= Ix (vdsx + (ux + x0 uz ) dt) + Ix 4x(zx dsx + uz dt) + Ix 4yzy dsx + Isx dsx + Iy (vdsy + (uy + y0 uz ) dt) + Iy 4y(zy dsy + uz dt) + Iy 4xzx dsy + Isy dsy + It dt
3
In this total least squares parameter estimation method a model with linear parameters pj has to be given in the form dT p = 0 with data depending vector d and parameter vector p. In our case we have one equation for each 4d or 5d pixel. In order to get an over-determined system of equations, one uses the assumption, that within a small neighborhood Ω of pixels i all equations are approximately solved by the same set of parameters p. We get
(8)
where all nonlinear terms coming from the multiplications with zx 4x and zy 4y are neglected. This is allowed as long as approximation Eq. 7 holds. We decompose (and rearrange) this equation into data vector d and parameter vector p: d = (Ix , Iy , Ix 4x, Ix 4y, Iy 4y, Iy 4x, Isx , Isy , It )T p = ( vdsx + (ux + x0 uz ) dt, vdsy + (uy + y0 uz ) dt, zx dsx + uz dt, zy dsx , zy dsy + uz dt, zx dsy , dsx , dsy , dt)T (9) and get the model equation dT p = 0. This linear BCCE can be used for combined estimation of 3d optical flow disparity and normal of the imaged surface element. Looking at the data vector we observe that this model is very similar to an affine optical flow model (see e.g. [9]), but here it has 3 dimensions that take the same role time has in an affine model. The different terms and properties of this model have been investigated individually in [22] by looking at several special cases in experiments. Here we examine the full model for a 1d and a 2d camera grid. We get the constraint equation for a 1d grid by setting dsy = 0 in Eq. 8. Written as dataand parameter vector we get d = (Ix , Iy , Ix 4x, Ix 4y, Iy 4y, Isx , It )T p = ( vdsx + (ux + x0 uz ) dt, (uy + y0 uz ) dt, zx dsx + uz dt, zy dsx , uz dt, dsx , dt)T (10) All experiments are done using a simple total least squares parameter estimation scheme, which we briefly review in the next section.
Revision of the Structure Tensor
dT i p = ei for all pixels i in Ω
(11)
with errors ei which have to be minimized by the sought for solution p ˜ . Using a matrix D composed of the vectors di via Dij = (di )j Eq. 11 becomes Dp = e. We minimize e in a weighted 2-norm !
||e|| = ||Dp|| = pT DT WDp =: pT Jp = min (12) where W is a diagonal matrix containing the weights. In our case Gaussian weights are used (see Sec. 5). The matrix J = DT WD is the so called structure tensor. The space of solutions p ˜ is spanned by the eigenvector(s) to the smallest eigenvalue(s) of J. We call this space the null-space of J as the smallest eigenvalue(s) are 0 if the model is fulfilled perfectly, i.e. e = 0. For standard optical flow the null-space is 1d, here it is 2d or 3d, depending on the model used. If there is not enough variation in the data, the so called aperture problem occurs, meaning the null-space has a higher dimension as indicated by the model. Here, we assume that enough data variation is present. Handling of other cases can be deduced from literature (e.g. [25]).
4
Disentangling the parameters
Using a 1d camera grid and model Eq. 10, the 2 eigenvectors with respect to the smallest eigenvalues span the nullspace of J. For a 2d grid and Eq. 9 3 eigenvectors span the nullspace. For all models, we linearly combine these eigenvectors such that all but exactly one component of {dsx , dsy , dt} vanish. From the linear combination with dt 6= 0 and dsx = dsy = 0, we calculate the motion components. From the other 1 or 2 eigenvectors with dt = 0 we derive depth and normals.
4.1
Depth and normals
For the 1d camera grid we build linear combinations p ˜ 1 and p ˜ 2 of the eigenvectors such that component 6 of p ˜ 1 (i.e. dsx ) and component 7 of p ˜ 2 (i.e. dt) vanish, respectively, i.e. p ˜ 1,6 = 0 and p ˜ 2,7 = 0. Then we get the v, zx , and zy via p ˜ 2,1 v= p ˜ 2,6
p ˜ 2,3 zx = p ˜ 2,6
p ˜ 2,4 zy = p ˜ 2,6
The diagonal entries of RT AR are the independent measurements uz,1 = (c2 p ˜ 1,3 + s2 p ˜ 1,5 )/˜ p1,7 and 2 2 uz,2 = (s p ˜ 1,3 + c p ˜ 1,5 )/˜ p1,7 , where c and s are the components of the normalized first eigenvector of C. The two measurements are combined via uz =
uz,1 /σ12 + uz,2 /σ22 1/σ12 + 1/σ22
(13)
respecting their errors σi . We then get
as dt = 0 for this parameter vector (cmp. Eq. 10). For the 2d camera grid, 2 linear independent eigenvectors with dt = 0 can be derived. Thus we could estimate v, zx , and zy from either of the 2 eigenvectors, but could not decide which one is better. Fortunately one can estimate the error covariance matrix (see [20]) for the estimation pairs and rotate sx -sy -space such that this matrix becomes diagonal. Then the estimates are independent and can be recombined respecting their then known individual errors. A more detailed description of this approach can be found in [22]. For the estimation of uz a similar subspace rotation has to be applied. But there the rotation is in x-y-space not in sx -sy -space. This is described next.
ux = p ˜ 1,1 /˜ p1,7 − x0 uz uy = p ˜ 1,2 /˜ p1,7 − y0 uz
4.2
It + R∇I(ARx + u) = 0 ⇔ It + ∇IRT ARx + ∇IRT u) = 0
(15)
(17)
with central pixel coordinates x0 and y0 .
5
Experiments
We want to test the performance of the 1d and 2d camera grid model for simultaneous estimation of motion, depth and normals. In [22] a 1d camera grid model for motion and depth, but without normals has been investigated. We get it from Eq. 9 by setting zx = 0, zy = 0, and dsy = 0 d= p=
Motion components
Here, we argue for the 1d camera grid model, but for the 2d camera grid an analogous derivation is valid. The motion components are calculated from p ˜1 where dsx = 0. We can calculate uz from components 3 and 5 individually, but want to treat them as 2 independent measurements, uz,1 and uz,2 . They are independent only in the eigensystem of their error covariance matrix C, which we approximate as derived in [20]. It can be diagonalized by rotation R as C = RT ΣR, where Σii = σi2 is a diagonal matrix and σi2 are the error variances of the independent measurements. As R rotates the x-ycoordinate system we show how this affects uz,1 and uz,2 . For dsx = 0 we identify the remaining nonzero components of the model equation (10) as an affine flow model It + ∇I(Ax + u) = 0 ·µ ¶µ ¶ µ ¶¸ uz,1 0 x ux It + (Ix , Iy ) + =0 0 uz,2 y uy (14) Rotating the coordinate system by R yields
(16)
(Ix , Iy , Ix 4x + Iy 4y, Isx , It )T ((vdsx + (ux + x0 uz )dt), (uy + y0 uz )dt, uz dt, dsx , dt)T
(18)
In a first experiment we compare the influence of the additional estimation of normals to the other parameters in the 1d case (via Eq. 10) using synthetic data. In a second experiment we test the benefit of a 2d grid (Eq. 9) compared to the 1d grid (Eq. 10). Results for real data acquired under controlled lab conditions have already been depicted in Fig. 1.
5.1
Setup and Prerequisites
In [22] properties of this estimation approach have been investigated with respect to filter choices, linearization effects, and noise influence. We use the results here by choosing 5 cameras (i.e sampling points in sx and sy ) for the 1d grid and 5 × 5 cameras for the 2d grid, because the sx - and sy derivatives can be calculated with low systematic error. Linearization effects due to the approximation |Z0 | À |Uz t| spoil estimation results for large |Uz t|, as expected. Thus we select our frame rate À |Uz |/|Z0 |. Noise increases the absolute error of the estimated components of the parameter vectors in the structure tensor. Thus image sequences should be acquired such that the estimable signal is
1d Grid with and without Normals
In order to show the influence of the additional estimation of surface normals in the 1d camera grid model, we generate noise free, synthetic sinusoidal images (see above) with varying Zx . The other parameters were set to λx =λy =λ, α=0◦ , Z0 =100mm, f =10mm, Zy =0. From these sequences we estimate local depth Z and depth velocity Uz for both 1d grid models, and Zx for the model additionally handling surface normals. The respective plots can be found in Fig. 4 and Fig. 5. There Zx is given in degree, i.e. Zx [◦ ]= arctan(Zx ). Looking at Fig. 5 we observe that left and right plots are almost identical. Thus the model with the normals performs as stable as the simpler model without normals. In the case of noise free synthetic data, a positive influence of
b
Figure 3: Sinusoidal images for the central camera sx =sy =0, origin (X0 , Y0 ) in the image center, no motion, Z0 =100mm, f =10mm, Zy =0 using a λx =λy =16pixel=0.1184mm, α=0◦ , Zx =1.73=60 ˆ ◦ and b λx =8pixel, λy =80pixel, α=30◦ , Zx =1=45 ˆ ◦ . Ground truth depth maps are given below the images. the normals is visible for the estimation of Uz but only for short wavelengths (cmp. Fig. 5a and b). For all the other cases depicted in Fig. 5 the influence is negligible. Looking at Fig. 4 we see that the absolute error of the estimated Zx is in the range of 0.001◦ to 0.01◦ in the noise free case and in the range of 0.01◦ to 10◦ when 2.5% Gaussian noise is added to the data.
5.3
1d Grid versus 2d Grid
The benefit of a second camera-shift direction is expected to lie in the reduction of a potential aperture problem. This means, looking at a strongly oriented pattern, a 1d camera grid can not resolve a
b 1.E-01
Abs. error of Zx [°]
5.2
a
Abs. error of Zx [°]
maximal. Especially absolute values of displacements, i.e. uncorrected optical flow components (ux + x0 uz ) and (uy + y0 uz ) as well as disparity v, should be as large as possible. Unfortunately the estimation process using the structure tensor breaks down around and above 0.25 of the dominating spatial wavelength in the image data. Therefore we smooth the acquired data in x- and y-directions in order to suppress smallest wavelengths. For reasonable camera distances in the grid disparity is much larger than 1 pixel. Therefore we define a working depth with an integer disparity v0 and shift our data in sx - and sy -directions such that this disparity becomes zero. Then we can estimate disparity reliably in a depth interval around the working depth, where extremal disparities should not exceed ±1 pixel. We select our acquisition configuration accordingly. Please note that this restriction is not inherent to the model but is due to the use of the structure tensor and can be relaxed using warping or multiscale estimators (see e.g. [4, 21]). As synthetic data we calculate sinusoidal patterns with wavenumbers Kx =2π ∗ f /(Z0 λx ) and Ky =2π ∗ f /(Z0 λy ), focal length f , depth Z0 at the origin, and wavelengths λx and λy in the image plane (length in pixels depends on the pixel size, we use 7.4µm). The sinusoidal pattern is then cos(Kx (cα 4X + sα 4Y )) cos(Ky (−sα 4X + cα 4Y )) where 4X and 4Y are calculated by inverting Eq. 4, cα = cos(α) and sα = sin(α) for a given rotation angle α. Two example images can be found in Fig. 3.
1.E-02 1.E-03 1.E-04
Lambda: 8
1.E-05
Lambda: 16
1.E-06
Lambda: 32 1.E-07 0
20
40
60
Slope Zx [°]
80
Lambda: 8 Lambda: 16 Lambda: 32
1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 0
20
40
60
80
Slope Zx [°]
Figure 4: 1d camera grid model with normals. Absolute errors of normals calculated from sinusoidal images. Left: no noise, right: 2.5% Gaussian noise.
a
b λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E-01 1.E-02
Rel. error of Uz
Rel. error of Uz
Table 1: Comparison 1d and 2d camera grid. Standard deviations calculated from sinusoidal images.
1.E+00
1.E+00
1.E-03 1.E-04 1.E-05 1.E-06
λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E-01 1.E-02 1.E-03
1.E-05 1.E-06
0
20
40
60
80
0
Slope Zx [°]
20
40
60
80
Slope Zx [°]
c
d λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E-01 1.E-02
λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E+00
Rel. error of Z
Rel. error of Z
1.E+00
1.E-03 1.E-04 1.E-05 1.E-06
1.E-01 1.E-02 1.E-03 1.E-04
20
40
60
0◦ 30◦ 60◦ 90◦ 0◦ 30◦ 60◦ 90◦
0 0 0 0 2.5% 2.5% 2.5% 2.5%
Z0 : std. 1d 0.103 0.107 0.117 0.127 0.72 3.29 4.86 6.76
dev. [mm] 2d 0.103 0.103 0.106 0.103 0.715 2.82 2.44 0.53
Zy : std. 1d 0.19 3.11 3.06 0.001 42.5 78.0 80.5 84.3
dev. [◦ ] 2d 0.17 0.14 0.05 0.01 40.9 37.9 30.8 30.9
1.E-05
80
0
Slope Zx [°]
20
40
60
80
Slope Zx [°]
e
in the noise free case. In the noisy data case the aperture problem becomes visible in the 1d model: estimation of Z0 gets worse for increasing α, Zy is not estimable. The 2d model still gives Z0 reliably, Zy has a high but limited variance.
f λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E-01
λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E+00
Rel. error of Uz
1.E+00
Rel. error of Uz
noise
1.E-06 0
1.E-02 1.E-03 1.E-04 1.E-05
1.E-01 1.E-02 1.E-03 1.E-04 1.E-05
0
20
40
60
80
0
Slope Zx [°]
20
40
60
80
Slope Zx [°]
g
h λ= 8 pixel
1.E+00 1.E-01
λ= 8 pixel λ= 16 pixel λ= 32 pixel
1.E+00
λ= 16 pixel λ= 32 pixel
Rel. error of Z
Rel. error of Z
α
1.E-04
1.E-02 1.E-03 1.E-04
1.E-01 1.E-02 1.E-03 1.E-04 1.E-05
1.E-05 0
20
40
60
Slope Zx [°]
80
0
20
40
60
80
Slope Zx [°]
Figure 5: Comparison of 1d camera grid models. Relative errors from sinusoidal images. Left: model without normals, right: model with normals. a–d noise free data, e–h 2.5% Gaussian noise added.
e.g. depth if the pattern is oriented along the sx -axis. A 2d camera grid should not encounter this problem as depth can be estimated easily in the camera direction normal to the pattern orientation. To test this behavior we generate noisy (σ=0.025) and noise free sinusoidal patterns with λx =8pixel, λy =80pixel and varying orientation angle α. The other parameters are fixed to be Z0 =100mm, f =10mm, Zy =1=45 ˆ ◦ , Zx =0. From these sequences we estimate local depth Z, and surface slope Zy and table their standard deviation (see Tab. 1). Systematic relative errors (noise free data) are always around 1e-3 for Z0 . For Zy they are well below 0.7◦ and 0.05◦ in the 1d and the 2d case, respectively. For the standard deviations we observe that both grid models perform comparable
6
Summary and Outlook
We described and tested a novel model for simultaneous estimation of 3d motion, depth and surface normals in multi-camera sequences. We demonstrated its use within the structure tensor estimation framework but due to its BCCE-like form other optical flow estimation methods may be applied as well. This will provide additional benefit e.g. higher robustness and/or faster performance. Using the structure tensor we have investigated the full model and derived a method to disentangle the mixed motion components. In the experiments we compared the 1d model without normals from [22] to the full model. All parameters can be estimated as well or better with the full model. The additional parameters do not lead to instabilities. In addition the 2d model is more robust with respect to the aperture problem. In future work, we will investigate other estimation schemes instead of the structure tensor. Further we will investigate additional constraints e.g. enforcing that slopes of the estimated depth field and the estimated slopes agree. Acknowledgement We thank Andres Chavarria for fruitful discussions. This work has partly been funded by DFG SPP1114 (SCHA 927/1-3).
References [1] J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of optical flow techniques. In
IJCV, pages 43–77, 1994. 12(1). [2] J. Big¨un and G. H. Granlund. Optimal orientation detection of linear symmetry. In ICCV, pages 433–438, 1987. [3] M. Black, D. Fleet, and Y. Yacoob. Robustly estimating changes in image appearence. CVIU, 7(1):8–31, 2000. [4] A. Bruhn, J. Weickert, and C. Schn¨orr. Lucas/kanade meets horn/schunck: Combining local and global optic flow methods. IJCV, 61(3):211–231, 2005. [5] R. Carceroni and K. Kutulakos. Multi-view 3d shape and motion recovery on the spatiotemporal curve manifold. In ICCV (1), pages 520–527, 1999. [6] T. S. Jr. Denney and J. L. Prince. Optimal brightness functions for optical flow estimation of deformable motion. IEEE Trans. Im. Proc., 3(2):178–191, 1994. [7] H. Farid and E. P. Simoncelli. Optimally rotation-equivariant directional derivative kernels. In 7th Int’l Conf Computer Analysis of Images and Patterns, Kiel, 1997. [8] G. Farneb¨ack. Fast and accurate motion est. using orient. tensors and param. motion models. In ICPR, pages 135–139, 2000. [9] D.J. Fleet, M.J. Black, Y. Yacoob, and A.D. Jepson. Design and use of linear models for image motion analysis. IJCV, 36(3):171 –193, 2000. [10] D.J. Fleet and K. Langley. Recursive filters for optical flow. PAMI, 17(1), 1995. pp. 61-67. [11] H. Haußecker and D.J. Fleet. Computing optical flow with physical models of brightness variation. PAMI, 23(6):661–673, June 2001. [12] H. Haußecker and H. Spies. Motion. In B. J¨ahne, H. Haußecker, and P. Geißler, editors, Handbook of Computer Vision and Applications. Academic Press, 1999. [13] B.K. Horn and B.G. Schunck. Determining optical flow. In Art. Int., volume 17, pages 185–204, 1981. [14] B. J¨ahne, H. Scharr, and S. K¨orkel. Principles of filter design. In Handbook of Computer Vision and Applications. Academic Press, 1999. [15] T. Koivunen and A. Nieminen. Motion field restoration using vectormedian filtering on high definition television sequences. In Proc. Vis. Comm. and Im. Proc.’90, volume 1360, pages 736 – 742, 1990.
[16] L.H.Matthies, R.Szeliski, and T.Kanade. Kalman filter-based algorithms for estimating depth from image sequences. IJCV, 3:209– 236, 1989. [17] G. Li and S.W. Zucker. Differential geometric consistency extends stereo to curved surfaces. In ECCV 2006, Part III, LNCS 3953, pages 44––57. Springer Berlin Heidelberg, 2006. [18] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In DARPA Im. Underst. Workshop, pages 121–130, 1981. [19] Y. Nakamura, T. Matsuura, K. Satoh, and Y. Ohta. Occlusion detectable stereo– occlusion patterns in camera matrix. In CVPR, pages 371–378, 1996. [20] O. Nestares, D.J. Fleet, and D.J. Heeger. Likelihood functions and confidence bounds for total-least-squares problems. In CVPR, pages 523–530, 2000. [21] N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert. Highly accurate optic flow computation with theoretically justified warping. IJCV, 67(2):141–158, April 2006. [22] H. Scharr. Towards a multi-camera generalization of brightness constancy. In International Workshop on Complex Motion, Okt. 2004, LNCS 3417, 2005. In press. [23] U. Schurr, A. Walter, S. Wilms, H. Spies, N. Kirchgessner, H. Scharr, and R. K¨usters. Dynamics of leaf and root growth. In 12th Int. Con. on Photosynthesis, PS 2001, 2001. [24] N. Slesareva, A. Bruhn, and J. Weickert. Optic flow goes stereo: A variational method for estimating discontinuity-preserving dense disparity maps. In DAGM 2005, LNCS 3663, pages 33–40. Springer, 2005. [25] H. Spies and B. J¨ahne. A general framework for image sequence processing. In Fachtagung Informationstechnik, pages 125–132, 2001. [26] R. Szeliski. A multi-view approach to motion and stereo. In CVPR, 1999. [27] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade. Threedimensional scene flow. In ICCV 1999, pages 722–729, 1999. [28] S. Vedula, S. Baker, S. Seitz, R. Collins, and T. Kanade. Shape and motion carving in 6d. In CVPR 2000, pages 592–598, 2000.