Statistical Bias in 3D Reconstruction From Monocular Video - CiteSeerX

1

Statistical Bias in 3D Reconstruction From Monocular Video Amit K. Roy Chowdhury and Rama Chellappa Center for Automation Research and Department of Electrical and Computer Engineering University of Maryland, College Park, MD 20742.

The work was partially supported by DARPA/ONR grant N00014-00-1-0908.

2

Abstract The present state of the art in computing the error statistics in 3D reconstruction from video concentrates on estimating the error covariance. A different source of error which has not received much attention in the computer vision community, but has been noted by psychophysicists, is the fact that there exists “systematic biases in observers’ magnitude estimation of perceived depth” [1]. In this paper, we prove that the depth estimate, based on the continuous (differentiable) version of structure from motion (SfM), is statistically biased and derive a precise expression for it. Many SfM algorithms that reconstruct a scene from a video sequence pose the problem in a linear least squares framework Ax = b. It is a well-known fact that the least squares estimate is biased if the system matrix A is noisy. In SfM, the matrix A contains the image coordinates, which are always difficult to obtain precisely; thus it is expected that the structure and motion estimates in such a formulation of the problem would be biased. Though earlier works have analyzed the bias in 3D motion estimation from stereo, to the best of our knowledge, the statistical bias in 3D reconstruction from monocular video has not been studied in the video analysis community. Existing results on the minimum achievable variance of the estimator are extended by deriving a generalized Cramer-Rao lower bound. A detailed analysis of the effect of various camera motion parameters on the bias is presented. We conclude by presenting the effect of bias compensation on real-life 3D face reconstruction problems. Keywords Structure from motion; error analysis; bias; least squares; Cramer-Rao lower bound.

I. Introduction Starting from the seminal work of Longuet-Higgins [2], structure from motion (SfM) has been one of the most active research areas in computer vision for decades, with the result that today numerous algorithms exist which address various aspects of the problem ([3], [4], [5], [6], [7]), [8], [9], [10], [11], [12], [13], [14], [15], [16]). Despite the existence of so many competing algorithms, constructing accurate 3D models reliably from images is still a challenging problem.

3

Many researchers have analyzed the sensitivity and robustness of many of the existing algorithms. The work of Weng et al. [17] is one of the earliest instances of estimating the standard deviation of the error in reconstruction using first-order perturbations in the input. The CramerRao lower bounds on the estimation error variance of the structure and motion parameters from a sequence of monocular images was derived in [5]. Young and Chellappa derived bounds on the estimation error for structure and motion parameters from two images under perspective projection as well as from a sequence of stereo images [18]. Similar results were derived by Daniilidis and Nagel in [19] and the coupling of the translation and rotation for a small field of view was studied. They also proved that many algorithms for three-dimensional motion estimation, that work by minimizing an objective function, suffer from instabilities, and examined the error sensitivity in terms of translation direction, viewing angle and distance of the moving object from the camera. Zhang’s work [8] on determining the uncertainty in estimation of the fundamental matrix is another important contribution in this area. Chiuso et al [13] and Soatto and Brockett [20] have analyzed SfM in order to obtain provably convergent and optimal algorithms. Oliensis emphasized the need to understand algorithm behavior and the characteristics of the natural phenomenon that is being modeled [15]. Ma, Kosecka and Sastry [21] also addressed the issues of sensitivity and robustness in their motion recovery algorithm. Sun, Ramesh and Tekalp [22] proposed an error characterization of the factorization method for 3D shape and motion recovery from image sequences using matrix perturbation theory. Morris, Kanatani and Kanade [23] analyzed the non-trivial effects of unknown scale factor, referred to in the literature as gauge freedom, on the covariance calculations in SfM. In [24], [25], we showed that it is possible to analytically compute the error covariance of 3D reconstruction as a function of the error covariance of the optical flow estimates, using the implicit function theorem [26]. A different source of error is the bias in depth estimation. Some authors, notably [27] and [28],

4

have proved that there exists a bias in the translation and rotation estimates from stereo. In [29], Oliensis mentions correcting for the bias in the depth estimate, arising from the statistics of the parameters in their optimization function. However, he neglects the bias resulting from noisy image measurements. Recently, it has been proposed that the bias in the optical flow field can be a possible explanation for many geometrical optical illusions [30]. Extending their analysis to the 3D reconstruction problem, we that the 3D depth estimate obtained from SfM algorithms is statistically biased and, under many conditions, the bias is numerically significant [31]. We also show that the bias leads to a new lower bound on the minimum variance of an SfM estimate, thus extending the results in [18]. The effect of different camera motion parameters and variations of SfM algorithms on the bias is studied. We present the effect of bias compensation on real-life 3D face reconstruction problems.

II. Bias in Depth Reconstruction A. Problem Formulation Given two images, I1 and I2 , we are interested in computing the camera motion and structure of the scene from which these images were derived. If p(x, y) and q(x, y) are the horizontal and vertical velocity fields of a point (x, y) in the image plane, they are related to the 3D object motion and scene depth (under the infinitesimal motion assumption) by p(x, y) = (−vx + xvz )g(x, y) + xyωx − (1 + x2 )ωy + yωz q(x, y) = (−vy + yvz )g(x, y) + (1 + y 2 )ωx − xyωy − xωz ,

(1)

where V = [vx , vy , vz ] and Ω = [ωx , ωy , ωz ] are the translational and rotational motion vectors respectively, g(x, y) = 1/Z(x, y) is the inverse scene depth, and all linear dimensions are normalized in terms of the focal length f of the camera [32]. The problem is to estimate V, Ω and Z

5

given (p, q). Equation (1) can be rewritten in a more useful form (because of the scale ambiguity [32]) as

p(x, y) = (x − xf )h(x, y) + xyωx − (1 + x2 )ωy + yωz q(x, y) = (y − yf )h(x, y) + (1 + y 2 )ωx − xyωy − xωz ,

v

where (xf , yf ) = ( vvxz , vyz ) is known as the focus of expansion (FOE) and h(x, y) =

(2)

vz Z(x,y) .

For N points, the above equations can be represented using matrix notation, where the subscript is used to index the feature point. Consider N points (for a sparse depth map, this denotes N feature points, while for a dense depth map it denotes the number of pixels in the image). Let us define

h = (h1 , h2 , ..., hN )TN ×1 u = (p1 , q1 , p2 , q2 , ..., pN , qN )T2N ×1 ri = (xi yi , −(1 + x2i ), yi )T3×1 si = (1 + yi2 , −xi yi , −xi )T3×1 Ω = (wx , wy , wz )T3×1 Q = r1 s1 r2 s2 ... rN

sN

T

2N ×3

6



 x1 − x f     y1 − y f     0    P =  0    ..  .     0    0

B = [P 

0

···

0

···

x2 − xf

···

y2 − yf

···

.. .

..

0

···

0

···

.

Q]2N ×(N +3) 

 h   z =    Ω

0



     0     0     0    ..  .    x N − xf     y N − yf

2N ×N

.

(3)

(N +3)×1

Then (2) can be written as Bz = u.

(4)

We want to compute z from u. Consider the cost function which minimizes the reprojection error (i.e. bundle adjustment)

C =

=

=

N X

[(pi − pî )2 + (qi − qî )2 ] =

i=1 n=2N X

1 2 1 2

i=1

N X

(ui −

N +3 X

1 ||Bz − u||2 2 n

bij zj )2 =

j=1

2 2 (Cpi + Cqi ),

1X 2 Ci 2 i=1

(5)

i=1

where bij is the (i, j)th element of B. (ˆ pi , qî ) are the projections of the depth and motion estimates, z, onto the image plane and are obtained from the right hand side of the equations (2). In general, the above mentioned cost function requires a non-linear optimization, and various strategies (e.g. [10]) have been proposed for this purpose.

7

B. Computation of Bias

We now proceed to prove the main result of this paper. Because of the fact that feature positions are never tracked perfectly, the 3D reconstruction, in most situations, is statistically biased and the bias is significant. We give a precise expression for the bias and outline a proof of it in the Appendix. As mentioned earlier, the solution of the cost function (5) involves non-linear optimization which is extremely difficult, unless very good initial conditions are available. One of the common methods used is to first estimate the camera motion and then the depth. Another strategy is to update the camera motion and the depth, one at a time, using the previous estimate of the other, till a convergence criterion is reached [14]. For monocular video sequences, it is often possible to first estimate the direction of motion (i.e. FOE) and then estimate the depth and rotational motion [33]. These deterministic optimization methods will usually guarantee convergence to a local minimum. The point to note is that if we assume an estimate of the camera motion or FOE and solve for the depth or vice versa, we essentially are solving a linear system of equations. This can be easily seen from the bilinear parameterization of (2). It is a well known fact that the least squares solution to a linear system of the form Ax = b with errors in the system matrix A is biased [34]. When the SfM problem is posed in a least squares framework, the matrix A involves the image coordinates, which almost always have measurement errors. Thus it should be expected that the solution of the SfM problem would also have a bias. To the best of our knowledge, this is the first attempt to explicitly compute and analyze the bias, arising from errors in feature tracking, in the depth reconstruction from monocular video. The actual value of the bias would be different for the different situations explained above. We consider the particular situation where the camera motion is known and derive the expression.

8

We do this because one of the most common approaches to solving the 3D reconstruction problem is to first estimate the camera motion, and then the depth. Expressions for the other conditions (e.g. simultaneous estimation of depth and rotation) can be similarly derived. For algorithms which estimate the parameters alternatively and update based on the previous estimates, the bias will propagate through the reconstruction strategy. For the case when the camera motion is known, (4) can be written as b = Ah,

(6)

T ∆ ∆ with A=P and b= p1 − rT1 Ω, q1 − sT1 Ω, ..., pN − rTN Ω, qN − sTN Ω . We now state the main result in the form of a theorem. ˆ = (AT A)−1 AT b. For convenience, let us Theorem 1: Consider the LS solution of (6), i.e. h define ∆ M=AT A = diag (xi − xf )2 + (yi − yf )2 = diag [mii ]i=1,...,N , ∆

(7)

∆

v=AT b = [(xi − xf )vpi + (yi − yf )vqi ]T ∆

= [v1 , ..., vN ]T ,

(8)

ˆ is where vpi = pi − rTi Ω and vqi = qi − sTi Ω. The bias in the inverse depth estimate, h, ˆ = E[h] ˆ − h, ¯ where h ¯ is the true value. If σ 2 = E[δxi 2 ] = E[δyi 2 ] is the variance in the image b(h) i coordinate measurements, then under the assumptions of the above formulation, the bias in the ˆ i , of the ith feature point is given by inverse depth estimate, h [Bias]i

+

2σi2 (xi − xf )2 rTix + (yi − yf )2 sTiy Ω 2 mii 2 2σi + (xi − xf )(yi − yf )(sTix + rTiy ) Ω 2 mii σi2 (xi − xf )ωy − (yi − yf )ωx − (rTix + sTiy )Ω , mii =

ˆ i) = b(h

(9)

9

where fix represents the partial derivative of a function fi with respect to x. Proof: See Appendix A

III. Analysis of Bias A. Effect of Model Parameters Effect of Camera Motion: The bias is a function of the camera motion parameters. It is most affected by the rotational motion of the camera. As can be seen from the expression in equation (9), the bias is negligibly small when the angular motion is zero (or very small). A detailed numerical analysis of the effect of various camera motion parameters will be presented in Section V. Bias Compensation: Once the structure and motion estimates are obtained, the bias can be ˆ =h ¯ + b(h), ˆ where h ¯ computed and subtracted out of the estimate. For the biased estimate, E[h] ˆ c = h−b( ˆ ˆ is the bias compensated estimate, then E[h ˆ c ] = E[h]−b( ˆ ˆ = h, ¯ is the true value. If h h) h) thus leading to an unbiased estimate. The accuracy resulting from the bias compensation will be shown through numerical simulations in Section V. Bias and Total Least Squares (TLS): The TLS has emerged as an alternative to least squares since it is capable of handling errors in both the observations, b, and the system variables, A, in a linear system Ax = b. However, the TLS estimate is unbiased only if the error in estimating A is equal in variance to the error in estimating b [35]. Such a condition would be very difficult to maintain in (2). Also, estimating the bias of a TLS estimate is extremely cumbersome. Further, as argued in [35], the covariance of an unbiased TLS estimate is larger than that of the LS estimate, in the first order approximation as well as in simulations. Hence, there is no fundamental gain in choosing the TLS over the LS solution. Thus using the TLS criterion cannot be a solution to the problem of bias in the 3D estimate. A very thorough analysis of bias in vision problems

10

and the applicabilty of different techniques to remove the bias can be found in [30]. A possibility for overcoming the problems of applying TLS lies in the use of the generalized TLS (GTLS) [35]. This would, however, require estimates of the error covariances of A and b. For the SfM problem, it implies that we need to know the error covariances of the feature points, as well as the optical flow estimates (i.e. both spatial and temporal error covariances). Obtaining these statistics accurately is usually not an easy problem. Therefore, it is difficult to eliminate the bias in the depth reconstruction by choosing a different optimization criterion. Bias in Multi-frame Reconstruction: Suppose now that we have L two-frame reconstruc1

L

ˆ , ..., h ˆ ) be the two-frame estimates aligned with respect to a particular frame of tions. Let (h ¯ and the bias in (9) be represented by (b(h ˆ 1 ), ..., b(h ˆ L )), i.e. reference. Let the true value be h ˆ i] = h ¯ + b(h ˆ i ), i = 1, ..., L. Assume that the estimates and the true value have the same scale E[h (so that the problem of scale ambiguity does not arise). Then the least squares estimate for the ˆ = structure over all L observations is h

1 L

PL

ˆ i . Taking expectations on both sides, we see

i=1 h

ˆ = that the bias in the multi-frame estimate is bL (h) (9) for the i and (i + 1)st frame.

1 L

PL

ˆ i ), where b(h ˆ i ) is obtained from

i=1 b(h

B. Bias in Psychophysical Experiments No general theory exists which explains the performance of the human visual system. However, it is generally believed that our eyes receive a sequence of images and the data from a number of retinal images is combined to obtain one representation of the physical scene [36]. The early visual processing system extracts local image measurements from this sequence of images for use in further estimation processes. In this paper, these measurements correspond to the positions and gray values of image points. However, they can be derived only within a range of accuracy, i.e. there is noise in the measurements of the positions of the image points. This error in obtaining

11

the positions of the points precisely results in a bias in the solution of the SfM problem. Such systematic biases in depth perception have also been noted in psychophysical experiments on human subjects. Todd mentions “that it is hard to explain ... the existence of systematic biases in observers’ magnitude estimation of perceived depth” [1]. Although observers should be able to perceive an object’s depth upto a homogeneous scaling transformation (because of the scale ambiguity in perspective projection), Todd shows through his experiments that this is not the case. He and other researchers have shown that there is often a systematic expansion or compression of perceived depth. In our work, we do not attempt to model the specifics of the human visual system. However, the mathematical results we obtain show that there is a systematic bias in depth estimation using the differential model for SfM. While we do not claim our results to be an explanation of the experimental observations, the connection is very interesting, and worthy of future investigation. IV. Minimum Variance Bound for SfM Lower bounds on variance of an estimator is usually expressed in terms of the Cramer-Rao lower bound (CRLB). The expression for the CRLB that is often used in practice assumes the estimate to be unbiased. This is because it is often difficult to calculate the bias of an estimator. The general expression for the CRLB after incorporating the bias in the estimate and under the proper regularity assumptions is [37] Σθ (g) ≥ bθ (g)bθ (g)T + (Ip×p + ∇θ bθ (g))M −1 (θ)(Ip×p + ∇θ bθ (g))T ,

(10)

where g is the estimate of the parameter θ, b is the bias of the estimate, and J the Fisher information matrix. ∇θ is the gradient with respect to θ. Several authors have derived expressions for the variance of the multi-frame SfM reconstruction under Gaussian noise assumptions and then invoked the Gauss-Markov theorem of least squares [37] to obtain the Fisher information

12

(FI) matrix (and thus the CRLB) by inverting the covariance matrix ([4], [5], [38], [18], [39], [40]). This assumes the estimate to be unbiased. In the light of our discussion, this means that the true positions of the features are known, which is hardly the case ever. Since we know the expression for the bias, we can obtain a more accurate expression for the minimum variance that we can ˆ denote the estimate of h ¯ (the true value). Let the bias in the multi-frame expect to obtain. Let h ˆ = 0. ˆ (Section (II)). Since the bias does not depend on h, ¯ ∇ ¯ b(h) estimate be denoted by b(h) h Let the FI matrix for multi-frame reconstruction, under additive Gaussian noise assumptions ¯ (refer to [18] for details). This derivation is based on in p and q in (2), be denoted by F (h) the assumptions that the feature positions are known exactly and the motion estimates {p i , qi } are corrupted by additive white Gaussian noise. Since the estimate in this case is unbiased, the ˆ FI matrix can be inverted to obtain the CRLB. Then the variance in the biased estimate h, i h ˆ = E (h ˆ − h)( ¯ h ˆ − h) ¯ T can be derived easily from (10) by substituting the represented as Σh¯ (h) values of the different terms as:

ˆ ≥ b(h)b ˆ T (h) ˆ + F −1 (h), ¯ Σh¯ (h)

(11)

V. Experimental Results

In this section, we describe the results of experiments, which were conducted using both simulated, as well as, real data. The simulation experiments give us a numerical idea about the importance of the bias term in the overall reconstruction, as well as an understanding of the effects of the different motion parameters on the bias. The experiments with video sequences of faces show the impact of the bias on real-life 3D face reconstruction problems.

13

A. Effect of Bias on Reconstruction In this set of experiments, we plotted the actual reconstruction estimate, with and without bias compensation, against the true depth. A set of 10 random feature points were tracked across a few frames. The depths from each pair of frames were obtained and then fused together using stochastic approximation, as explained in [25]. To fix the scale of the reconstruction, the depth at the first point was used. In these experiments we considered the case of non-zero but constant translational and rotational camera motion. The effect of measurement noise was studied by adding different levels of noise to the feature positions. Figures 1 (a), (b), (c) and (d) are for 2 , σ 2 , σ 2 and σ 2 respectively, where σ 2 > σ 2 > σ 2 > σ 2 . four different noise variances, σx1 x2 x3 x4 x4 x3 x2 x1

It can be seen that bias compensation makes the estimate closer to the true value in all the cases (i.e. reduces the bias) and gives significant advantages for some of the points.

B. Variation of Bias with Individual Camera Motion Parameters Given the rather complicated form of (9), it is difficult to obtain analytical expressions for the effects of the various camera motion parameters on the reconstruction bias. In this set of experiments, we analyzed the effects of the camera motion through numerical simulations. The focal length of the camera was assumed to be known. Each of the six motion parameters was varied over a certain range of values, keeping all the others fixed. While fixing the range over which to vary the camera motion, it should be borne in mind that the basic SfM equations in (2) are valid only for small camera motion. Thus it does not make sense to study the behavior of the bias term over a large range of camera motion parameters. The plot of the bias for various values of the camera motion parameters is shown in Figure 2. The bias is plotted on the vertical axis as a percentage of the true depth values. The motion terms which affect the bias the most are (vx , vy , ωx , ωy ). The effect of (vz , ωz ) on bias is almost

14 30

30

25

25

20

20

15

15

10

10

5

1

2

3

4

5

6

7

8

9

10

5

1

2

3

4

(a)

5

6

7

8

9

10

6

7

8

9

10

(b)

35

30

30 25

25 20

20

15 15

10 10

5

1

2

3

4

5

6

7

8

9

10

5

1

2

3

4

(c) Fig. 1.

5

(d)

2 , σ 2 , σ 2 and σ 2 respectively in the feature (a), (b), (c) and (d) are reconstruction plots for noise variances σ x1 x2 x3 x4

2 > σ 2 > σ 2 > σ 2 . The plots are for the same set of ten 3D points tracked over 15 frames. The positions, where σx4 x3 x2 x1

camera is moving with constant, non-zero translation and rotation. The solid lines indicate the true depth values, the dashed lines indicate reconstruction without bias compensation, and the dashed and dotted lines indicate reconstruction with bias compensation.

negligible.

C. The Generalized CRLB Next we present the effect of the bias term on the CRLB. A set of fifty 3D points was generated in space. Different trajectories were generated by varying the camera motion parameters. The perspective projections of the 3D points onto the image plane were tracked across a sequence of 15 frames. Random noise was added to the positions of the points in the images. We present here the effects of the different motion parameters on the CRLB. Figures 3 and 4 plot the trajectories of some of the feature points and the trace of the CRLB matrix of the structure estimates. The trajectories of the points tracked across all the frames are shown in the top row, while the trace of the CRLB for the unbiased estimator and the trace of the

15 4.5

10

−0.15

4 8 3.5

−0.2

3 6 2.5

−0.25

2

4

1.5

−0.3 2

1

0.5

−0.35 0

0

−0.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−2

0

0.2

0.4

(a)

0.6

0.8

1

1.2

1.4

−0.4

0

0.2

0.4

(b)

0

0.6

0.8

1

1.2

1.4

0.8

1

1.2

1.4

(c)

0

−0.12

−0.13

−0.5 −5

−0.14 −1 −10 −0.15 −1.5 −0.16 −15 −2 −0.17 −20 −2.5

−25

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−3

−0.18

0

(d) Fig. 2.

0.2

0.4

0.6

(e)

0.8

1

1.2

1.4

−0.19

0

0.2

0.4

0.6

(f)

Plots of the variation in the bias in inverse depth with the camera motion parameters. The horizontal axes represent

the following: (a): vx ∈ (0, 1)cm/frame, (b): vy ∈ (0, 1)cm/frame, (c): vz ∈ (0, 1)cm/frame, (d): ωx ∈ (0, 10)degrees/frame, (e): ωy ∈ (0, 10)degrees/frame, (f): ωz ∈ (0, 10)degrees/frame. The values of the bias on the vertical axis are in percentages of the true inverse depth value. The camera motion is scaled between 0 and 1 on the horizontal axis.

generalized CRLB for the biased one are plotted in the bottom row as a function of the number of frames. The plots are for various values of the camera motion parameters x f , yf , ωx , ωy , ωz . We will now briefly analyze the different cases. 1. Constant translation, non-zero rotation: This is the case in all the plots in Figure 3. It can be seen that the effects of different values of the linear velocity term on the bias in the structure estimate are not drastically different. Whether one of the components (v x or vy ) is zero or both of them are non-zero, the bias is significant. There is a distinct upward shift in the minimum reconstruction error in all these cases. 2. No Rotation: This is the case of particular interest. When the rotational velocity is zero, the

16 0

0

0

50

50

50

100

100

100

150

150 150

200

200 200

250

250 250

300

300 300

350

350 350

400

450

400

500 100

450 100

150

200

250

300

350

400

450

500

550

400

450

150

200

250

(a)

350

400

450

500

500 100

150

200

250

(c)

−5

9

300

x 10

8

350

400

450

500

(e)

−5

8

300

−4

x 10

1.2

x 10

7 1

7 6 6

0.8 5

5 4

0.6

4 3 3

0.4 2

2 0.2 1

1

0

1

2

3

4

5

6

7

8

9

0

1

(b) Fig. 3.

2

3

4

5

6

7

0

1

2

3

(d)

4

5

6

7

(f)

Plots of the trajectories of feature points (top row) and the CRLB of the inverse depth as a function of the number of

frames, for different camera motion parameters (bottom row). (a), (b): xf = 0, yf = 10, ωx = ωy = 1degree/frame, ωz = 0; (c), (d): xf = 10, yf = 0, ωx = ωy = ωz = 1degree/frame; (e), (f): xf = 10, yf = 10, ωx = ωy = ωz = 1degree/frame. The solid line shows the CRLB for the unbiased estimate and the dotted line for the biased one.

bias term is negligible; the difference in the CRLB is too small to represent in the plots of Figures 4(b) and 4(d)). This is the case irrespective of whether (i) V is constant but non-zero; (ii) V is constant but either vx or vy is zero; (iii) there is an acceleration in the linear component of the velocity. The reason for this can be understood from the expression for the bias in (9). The only term in (9) that contributes to the bias if Ω = 0 is

σi2 mii

[(xi − xf )ωy − (yi − yf )ωx ]. This would

be small unless the variances of the errors in the feature positions are very large, in which case the solution itself would be unreliable. 3. Linear Acceleration, Constant Rotation: This is the case in Figure 4(f). Again, we see that the bias is significant once the rotational velocity is non-zero.

17 0

0

0

50 50

50 100

100 100

150

150

200 150 250

200 200

300

250 350 250

300

400

350 100

150

200

250

300

350

400

300

0

50

100

(a)

200

250

300

450 100

150

200

(c)

−3

6

150

11

300

350

400

450

500

(e)

−5

x 10

250

−5

x 10

7

x 10

10 6

5 9

5

8 4 7

4 3

6 3 5

2 4

2

3 1

1 2

0

0

10

20

30

40

(b) Fig. 4.

50

60

1

1

2

3

4

5

6

7

8

0

1

2

(d)

3

4

5

6

7

8

(f)

Plots of the trajectories of some of the feature points (top row) and the trace of the CRLB of the inverse depth as a

function of the number of frames, for different camera motion parameters (bottom row). (a), (b): x f = 1, yf = 1, ωx = ωy = ωz = 0; (c), (d): Uniform acceleration, ωx = ωy = ωz = 0; (e), (f): Uniform acceleration, ωx = ωy = ωz = 1degree/frame. The solid line shows the CRLB for the unbiased estimate and the dotted line for the biased one.

The conclusion that can be drawn from this analysis is that the parameters that affect bias the most are the camera angular motion values. For small rotation, the bias in the estimate is negligible. While this can be understood mathematically from the expression for the bias as derived above, we do not yet have a physical interpretation for it. D. Bias in 3D Face Modeling Having analyzed the significance of the bias and the effect of camera motion parameters on it, we now study its effect on a practical 3D reconstruction problem. For this we require a dataset with true depth values (probably obtained from a range scanner) and a video sequence of the same scene. Our interest has been the problem of face reconstruction. For this particular prob-

18

(a)

(b)

(c)

Fig. 5. (a) shows an image from the input video sequence, (b) and (c) are two examples of the 3D face reconstruction.

lem, we decided to use the database available on the World Wide Web at http://sampl.eng.ohiostate.edu/ sampl/data/3DDB/RID/minolta/faces-hands.1299/index.html. This database includes the 3D depth model obtained from a range scanner and a frontal image of the face for texture mapping. We used the 3D model and the texture map to create a sequence of images after specifying the camera motion. Given this sequence of images, we estimate the 3D model using a 3D face reconstruction algorithm [41] (not explained in this paper) and the bias in the reconstruction using the equation (9).

1

We present here the results on the first five face models in the above

mentioned database. Following the convention on the website, we refer to the five subjects as ”frame001” to ”frame005”. The depth map for all the five subjects is shown in the first row of Figure 6. The detials of the face reconstruction technique are available in [41]. The bias is shown in the form of a gray scale image in the second row of the same figure. The figures show that the bias has an average value with some variation around it. In Table I, we show that the peak value of the bias is a significant percentage of the true depth value. This happens only for a few points; however it has significant 1

An example of the output of our face reconstruction algorithm (not an example from the above database),

described in [41], is shown in Figure 5.

19

(a)

(b)

(c)

(d)

(e)

Fig. 6. The reconstructed 3D depth maps are shown in the upper row. The lower row shows the estimated bias. The first column is the result on “frame001” of the dataset, the second column for “frame002” and so on and so forth.

impact on the 3D face model because of interpolation techniques which, invariably, are a part of any method to build 3D models. The third and fourth columns in Table I represent the root mean square (RMS) error of the reconstruction represented as a percentage of the true depth and calculated before and after bias compensation. The change in the average error after bias compensation is very small. However, by itself, this number is misleading. The average error in reconstruction may be small, however, even one outlier has the potential to create a very poor reconstruction. Hence it is very important to compensate for the bias in problems related to 3D reconstruction from monocular video.

20

TABLE I Effect of Bias on 3D Face Reconstruction Subject Index

Peak % Bias

Avg. % Error (before)

Avg. % Error (after)

1 (frame 001)

30

3.8

3.6

2 (frame 002)

34

3.2

3.1

3 (frame 003)

29

3.0

2.7

4 (frame 004)

21

3.9

3.7

5 (frame 005)

26

3.2

3.0

VI. Conclusion

Traditionally, the analysis of the accuracy of 3D reconstruction has focused on the error covariance of the estimate. In this paper, we have pointed out that there is another source of error in the SfM problem, namely the bias in the estimate. There exist interesting parallels between our mathematical result and certain experimental observations in psycho-physics, which we have pointed out. Our derivation of the bias term was based on the fact that the solution of a least squares estimation problem with noisy system matrix is statistically biased. The system matrix in SfM contains the positions of the features, which can never be obtained exactly. A minimum error bound (i.e. generalized Cramer-Rao lower bound) for SfM is proposed after incorporating the bias term. Simulations were carried out in order to show the effects of the different camera motion parameters on the bias. It was observed that the bias is negligibly small if the camera angular motion is small. The effect of the bias on a practical 3D face reconstruction problem was also studied.

21

Appendix

I. Proof of Bias Expression

We give the main steps of the proof of Theorem 1 below.

Let the errors in the different variables be expressed as follows: xi = x ¯i +δxi , yi = y¯i +δyi , xf = x ¯f + δxf , yf = y¯f + δyf , pi = p¯i + δpi , qi = q¯i + δqi , ωx = ω ¯ x + δωx , ωy = ω ¯ y + δωy , ωz = ω ¯ z + δωz , where the over-bars represent the unknown true values of the parameters and the observed (measured) values are represented without the over-bars. For convenience, let us define

∆ ∆ M=A0 A = diag (xi − xf )2 + (yi − yf )2 i=1,...,N = diag [mii ]i=1,...,N      ∆ 0 v=A b =    

(x1 − xf )(p1 − r1 Ω0 ) + (y1 − yf )(q1 − s1 Ω0 ) .. .

(xN − xf )(pN − rN Ω0 ) + (yN − yf )(qN − sN Ω0 )    

  ∆  =    

(x1 − xf )vp1 + (y1 − yf )vq1 .. .

(xN − xf )vpN + (yN − yf )vqN

     ∆   =       

v1   ..  .  .   vN

      

(12)

ˆ in a Taylor series around the true value h ¯ and assuming the mean deviation in Expanding h that region to be zero (i.e. E[δpi ] = E[δqi ] = E[δxi ] = E[δyi ] = E[δxf ] = E[δyf ] = E[δωx ] = E[δωy ] = E[δωz ] = 0) and all the components δpi , δqi ,δxi , δyi ,δxf , δyf ,δωx , δωy , δωz to be mutu-

22

ally uncorrelated, we can express ˆ ≈ h ¯+ E[h]

N X i=1

δpi ˆ ∂2h 2 E[ ∂ δpi 2

2

]+

δqi ˆ ∂2h 2 E[ ∂ δqi 2

2

]

δωy 2 δωx 2 δωz 2 ˆ ˆ ˆ h ∂2h ∂2h E[ E[ E[ ] + ] + ] 2 2 2 ∂ δωx ∂ δωy ∂ δωz 2 2 2 N X δxi 2 δyi 2 ˆ ˆ ∂2h ∂2h + ]+ ] 2 E[ 2 E[ ∂ δxi ∂ δyi 2 2 ∂2

+

i=1

+

ˆ ∂2h ∂ δxf

2

E[

δxf 2 δyf 2 2ˆ ] + ∂ h 2 E[ ], ∂ δyf 2 2

(13)

where all the partials are computed at the true value of the depth. In the absence of any measurement noise, the expected value of the estimate obtained from the least squares solution ¯ However, since there exist errors in the measurement model, the should equal the true value h. estimate is biased and the sum of the last nine terms on the right hand side of (13) represents the total bias in the estimate. In order to calculate the bias, we need to compute the derivatives in (13). We can compute all the derivatives using the fact that for an arbitrary matrix Q [42],

−

Note that Also,

∂2v 2 ∂ δpi

∂M ∂ δpi

=

=

∂2v 2 ∂ δqi

∂M ∂ δqi

=

=

∂M ∂ δωx

∂2v 2 ∂ δxf

=

=

∂Q −1 ∂Q−1 = Q−1 Q . ∂x ∂x

∂M ∂ δωy

∂2v 2 ∂ δyf

=

=

∂M ∂ δωz

∂2v 2 ∂ δωx

=

(14)

= 0, since M does not involve these variables. ∂2v 2 ∂ δωy

=

∂2v 2 ∂ δωz

= 0, since v is a linear function

ˆ = M−1 v, we can now compute the following terms: of these variables. Since h ˆ ∂h ∂ δpi

= −M−1

∂M M−1 v ∂ δpi

ˆ ∂2h 2 ∂ δpi

= −M−1

∂M M−1 ∂v ∂ δpi ∂ δpi

ˆ ∂h ∂ δωx

= −M−1

∂M M−1 v ∂ δωx

ˆ ∂2h

= −M−1

∂M M−1 ∂v ∂ δωx ∂ δωx

∂ δωx

2

+ M−1

∂v ∂ δpi

+ M−1

+ M−1

= M−1

∂2v 2 ∂ δpi

∂v ∂ δωx

+ M−1

∂v , ∂ δpi

= 0,

= M−1 ∂2v

∂ δωx

2

∂v , ∂ δωx

= 0,

(15)

23

Following exactly similar steps, we get ˆ ≈ h ¯+ E[h]

ˆ ∂2h 2 ∂ δωy

N X i=1

+

ˆ ∂2h 2 ∂ δωz

=

δxi ˆ ∂2h 2 E[ ∂ δxi 2

= 0. Hence (13) simplifies to

2

]+

δyi ˆ ∂2h 2 E[ ∂ δyi 2

2

]

δxf 2 δyf 2 ˆ ˆ h ∂2h E[ E[ ] + ], 2 2 ∂ δxf ∂ δyf 2 2 ∂2

(16)

which is to be expected since the system matrix A depends only on {xi , yi } and (xf , yf ). Next, we compute the derivatives with respect to deviations from the FOE (δx f , δyf ). ˆ ∂h ∂ δxf

= −M−1

ˆ ∂2h

= 2M−1

∂ δxf

2

∂M ∂ δxf

∂M ∂ δxf

−2M−1 ˆ ∂2h ∂ δyf

2

= 2M−1

M−1

∂M ∂ δxf

∂M ∂ δyf

−2M−1

M−1 v + M−1

M−1

M−1

∂M ∂ δyf

∂M ∂ δxf

M−1

M−1 v − M−1

∂v ∂ δxf

∂M ∂ δyf

∂v ∂ δxf

+ M−1

∂2v 2 ∂ δxf

M−1 v − M−1

∂v ∂ δyf

+ M−1

−1 ∂2M v 2M ∂ δxf

−1 ∂2M v 2M ∂ δyf

∂2v 2. ∂ δyf

(17)

Computing the derivatives of M with respect to (δxf , δyf ), we get ∂M ∂ δxf ∂2M 2 ∂ δxf ∂M ∂ δyf ∂2M 2 ∂ δyf

= diag [−2(xi − xf )]i=1,...,N , = 2IN ×N = diag [−2(yi − yf )]i=1,...,N , = 2IN ×N

∂v ∂ δxf

= [−vp1 , ..., −vpN ]0 ,

∂v ∂ δyf

= [−vq1 , ..., −vqN ]0 , −2(xi − xf ) = diag , mii i=1,...,N −vpN 0 −vp1 = , ..., . m11 mN N

M−1

∂M ∂ δxf

M−1

∂v ∂ δxf

Using (18), we can now compute some of the terms in the expression for

(18) ˆ ∂2h 2 ∂ δxf

and

ˆ ∂2h 2 ∂ δyf

in

24

(17): M−1

0 4(x1 − xf )2 v1 4(xN − xf )2 vN , , ..., m311 m3N N 2vN 0 2v1 −1 ∂ 2 M −1 , ..., 2 v = M , 2M ∂ δxf m211 mN N 2(x1 − xf )vp1 2(xN − xf )vpN 0 −1 ∂M −1 ∂v . M M = diag , ..., ∂ δxf ∂ δxf m211 m2N N

∂M ∂ δxf

M−1

∂M ∂ δxf

M−1 v =

(19) Similar expressions can be obtained for the partial derivatives with respect to δy f . Substituting the above expressions in (17), we get the expression for one of the bias terms in (16) as ˆ ∂2h 2 ∂ δxf

+

ˆ ∂2h 2 ∂ δyf

= 0.

(20)

We now need to compute the derivatives with respect to the image coordinates (δx i , δyi ). ˆ ∂h ∂ δxi

= −M−1

ˆ ∂2h 2 ∂ δxi

= 2M−1

∂M M−1 v ∂ δxi

= 2M−1

∂v ∂ δxi

∂M M−1 ∂M M−1 v ∂ δxi ∂ δxi

−2M−1 ˆ ∂2h 2 ∂ δyi

+ M−1

∂M M−1 ∂v ∂ δxi ∂ δxi

+ M−1

∂M M−1 ∂M M−1 v ∂ δyi ∂ δyi

−2M−1

∂M M−1 ∂v ∂ δyi ∂ δyi

− M−1 ∂2v ∂ δxi

− M−1

+ M−1

−1 ∂2M v 2M ∂ δxi

2

−1 ∂2M v 2M ∂ δyi

∂2v 2. ∂ δyi

(21)

The derivatives of M with respect to {δxi , δyi } are ∂M ∂ δxi ∂2M 2 ∂ δxi

= diag [0, ..., 0, 2(xi − xf ), 0, ..., 0] = diag [0, ..., 0, 2, 0, ..., 0]

(22)

Similar expressions can be obtained for derivatives with respect to δyi , by substituting y for x. Let us now denote the partial of a function fi = f (xi , yi ) with respect to xi by fix and the second partials by fixx , fixy , fiyy . Thus rix = [yi , −2xi , 0] , riy = [xi , 0, 1] , six = [0, −yi , 1] , siy =

25

[2yi , −xi , 0] , rixx = [0, −2, 0] , riyy = [0, 0, 0] , sixx = [0, 0, 0] , siyy = [2, 0, 0]2 . The derivatives of v with respect to (δxi , δyi ) are

∂vi ∂ δxi

= vpi + (xi − xf )

∂vpi ∂ δxi

+ (yi − yf )

∂vqi ∂ δxi ∆

= vpi − (xi − xf )rix Ω0 − (yi − yf )six Ω0 = vix ∂vi ∂ δyi

= vqi + (xi − xf )

∂vpi ∂ δyi

+ (yi − yf )

∂vqi ∂ δyi ∆

= vqi − (xi − xf )riy Ω0 − (yi − yf )siy Ω0 = viy ∆

∂ 2 vi 2 ∂ δxi

= −2rix Ω0 + 2(xi − xf )ωy = vixx

∂ 2 vi 2 ∂ δyi

= −2siy Ω0 − 2(yi − yf )ωx = viyy .

∆

(23)

In the above calculations, we have made the simplifying assumption that (pi , qi ) do not depend on (δxi , δyi ). This can be easily relaxed and a more precise expression for the partials computed. Using (21) and (23), we now compute each of the terms in (

M−1 ∂M M−1 ∂M M−1 v ∂ δxi ∂ δxi

=

−1 ∂2M v 2M ∂ δxi

=

∂M M−1 ∂v ∂ δxi ∂ δxi

=

∂M M−1 ∂v ∂ δyi ∂ δyi

=

M−1 M−1

M−1

M−1 M−1

2

∂2v 2

=

∂2v 2 ∂ δyi

=

∂ δxi

ˆ ˆ ∂2h ∂2h 2, 2) ∂ δxi ∂ δyi

in (16):

0 4(xi − xf )2 vi 0, ..., 0, , 0, ..., 0 m3ii 0 2vi 0, ..., 0, 2 , 0, ..., 0 mii 0 2(xi − xf )vix 0, ..., 0, , 0, ..., 0 m2ii 0 2(yi − yf )viy 0, ..., 0, , 0, ..., 0 m2ii 0 vixx 0, ..., 0, , 0, ..., 0 mii 0 viyy 0, ..., 0, , 0, ..., 0 mii

(24)

We are using the subscripts x, y for brevity of notation, but the derivatives are actually with respect to {δx i , δyi }.

26

Then the expression for the ith component of

ˆ ∂2h 2 ∂ δxi

+

ˆ ∂2h 2 ∂ δyi i

ˆ ∂2h 2 ∂ δxi

+

ˆ ∂2h 2 ∂ δyi

is

8[(xi − xf )2 + (yi − yf )2 ]vi 4vi − 2 3 mii mii 4[(xi − xf )vix + (yi − yf )viy ] vixx + viyy − + mii m2ii 4 = (xi − xf )2 rix + (yi − yf )2 siy Ω0 2 mii 4 + 2 [(xi − xf )(yi − yf )(six + riy )] Ω0 mii 2 + (xi − xf )ωy − (yi − yf )ωx − (rix + siy )Ω0 mii

=

(25)

Substituting (20) and (25) in (16), we obtain the expression of the bias in Theorem 1. We have computed the bias for the inverse depth variables in (2) only. In a similar manner, we can estimate the bias in the camera motion parameters also. However, the calculations would be more cumbersome. Certain simplifying assumptions had to be made in order to keep the algebraic manpulations tractable. However, we believe that the advantage of obtaining an approximate analytical expressions outweighs the disadvantages of the approximation.

References [1]

J.T. Todd, “Theoretical and biological limitations on the visual perception of three-dimensional structure from motion,” in High Level Motion Processing: Computational, Neurobiological and Psychophysical Perspectives, 1998.

[2]

H.C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from two projections,” Nature, vol. 293, pp. 133–135, September 1981.

[3]

R.Y. Tsai and T.S. Huang, “Estimating 3-d motion parameters of a rigid planar patch: I,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 29, pp. 1147–1152, December 1981.

[4]

T.J. Broida and R. Chellappa, “Estimation of object motion parameters from noisy images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 8, pp. 90–99, 1986.

27

[5]

T.J. Broida and R. Chellappa, “Performance bounds for estimating three-dimensional motion parameters from a sequence of noisy images,” Journal of the Optical Society of America A, vol. 6, pp. 879–889, 1989.

[6]

T.J. Broida and R. Chellappa, “Estimating the kinematics and structure of a rigid object from a sequence of monocular images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 13, pp. 497–513, 1991.

[7]

C. Tomasi and T. Kanade, “Shape and motion from image streams under orthography: A factorization method,” International Journal of Computer Vision, vol. 9, pp. 137–154, November 1992.

[8]

Z. Zhang and O. Faugeras, 3D Dynamic Scene Analysis, Springer-Verlag, 1992.

[9]

O.D. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993.

[10] R. Szeliski and S.B. Kang, “Recovering 3d shape and motion from image streams using non-linear least squares,” Journal of Visual Computation and Image Representation, vol. 5, pp. 10–28, 1994. [11] A. Azarbayejani and A. Pentland, “Recursive estimation of motion, structure, and focal length,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17, pp. 562–575, 1995. [12] G.P. Stein and A. Shashua, “Direct estimation of motion and extended scene structure from a moving stereo rig,” in Conference on Computer Vision and Pattern Recognition, 1998, pp. 211–218. [13] A. Chiuso and S. Soatto, “3d motion and structure causally integrated over time: Theory and practice,” Tech. Rep., ESSRL 99-003, Washington University, Saint Louis, 1999. [14] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [15] J. Oliensis, “A critique of structure-from-motion algorithms,” Computer Vision and Image Understanding, vol. 84, no. 3, pp. 407–408, December 2001. [16] J. Oliensis and Yacup Genc, “Fast and accurate algorithms for projective multi-image structure from motion,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 546–559, June 2001. [17] J. Weng, N. Ahuja, and T.S. Huang, “Optimal motion and structure estimation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 15, pp. 864–884, September 1993. [18] G.S. Young and R. Chellappa, “Statistical analysis of inherent ambiguities in recovering 3-d motion from a noisy flow field,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 14, pp. 995–1013, October 1992. [19] K. Daniilidis and H.H. Nagel, “The coupling of rotation and translation in motion estimation of planar surfaces,” in Conference on Computer Vision and Pattern Recognition, 1993, pp. 188–193.

28

[20] S. Soatto and R. Brockett, “Optimal structure from motion: Local ambiguities and global estimates,” in Conference on Computer Vision and Pattern Recognition, 1998, pp. 282–288. [21] Y. Ma, J. Kosecka, and S. Sastry, “Linear differential algorithm for motion recovery: A geometric approach,” International Journal of Computer Vision, vol. 36, pp. 71–89, January 2000. [22] Z. Sun, V. Ramesh, and A.M. Tekalp, “Error characterization of the factorization method,” Computer Vision and Image Understanding, vol. 82, pp. 110–137, May 2001. [23] D.D. Morris, K. Kanatani, and T. Kanade, “3d model accuracy and gauge fixing,” Tech. Rep., CarnegieMellon University, Pittsburgh, 2000. [24] A. Roy Chowdhury, Statistical Analysis of 3D Modeling From Monocular Video Streams, PhD Thesis, University of Maryland, College Park, 2002. [25] A. Roy Chowdhury and R. Chellappa, “Stochastic approximation and rate-distortion analysis for robust structure and motion estimation,” Accepted to International Journal of Computer Vision, 2003. [26] R. Walter, Principles of Mathematical Analysis, 3rd Edition, McGraw-Hill, 1976. [27] K. Daniilidis and M.E. Spetsakis, “Understanding noise sensitivity in structure from motion,” in VisNav93, 1993. [28] K.I. Kanatani, “Unbiased estimation and statistical analysis of 3-d rigid motion from two views,” Pattern Analysis and Machine Intelligence, vol. 15, no. 1, pp. 37–50, January 1993. [29] J. Oliensis, “A multi-frame structure-from-motion algorithm under perspective projection,” International Journal of Computer Vision, vol. 34, pp. 1–30, August 1999. [30] C. Fermuller, D. Shulman, and Y. Aloimonos, “The statistics of optical flow,” Computer Vision and Image Understanding, vol. 82, 2001. [31] A. Roy Chowdhury and R. Chellappa, “Statistical error propagation in 3d modeling from monocular video,” in CVPR Workshop on Statistical Analysis in Computer Vision, 2003. [32] Vishvijit Nalwa, A Guided Tour of Computer Vision, Addison Wesley, 1993. [33] S. Srinivasan, “Extracting structure from optical flow using fast error search technique,” International Journal of Computer Vision, vol. 37, pp. 203–230, 2000. [34] W.A. Fuller, Measurement Error Models, Wiley, 1987. [35] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem, SIAM Frontiers in Applied Mathematics, 1991.

29

[36] R.H.S. Carpenter, Movements of the Eye, Pion, London, 1988. [37] Jun Shao, Mathematical Statistics, Springer-Verlag, 1998. [38] G.S. Young and R. Chellappa, “3-d motion estimation using a sequence of noisy stereo images: Models, estimation, and uniqueness results,” Pattern Analysis and Machine Intelligence, vol. 12, no. 8, pp. 735–759, August 1990. [39] K. Kanatani, Statistical Optimization for Geometric Computation: Theory and Practice, North-Holland, 1996. [40] Z.Y. Zhang, “Determining the epipolar geometry and its uncertainty: A review,” International Journal of Computer Vision, vol. 27, pp. 161–195, March 1998. [41] A. Roy Chowdhury and R. Chellappa, “Face reconstruction from monocular video using uncertainty analysis and a generic model,” Accepted to Computer Vision and Image Understanding, 2003. [42] Thomas Kailath, Linear Systems, Prentice-Hall, 1980.