International Journal of Computer Vision 31(1), 51–82 (1999) c 1999 Kluwer Academic Publishers. Manufactured in The Netherlands. °
Robust Optical Flow Computation Based on Least-Median-of-Squares Regression E.P. ONG The Institute of Microelectronics, 11 Science Park Rd, Science Park II, Singapore 117685
[email protected]
M. SPANN School of Electronic and Electrical Engineering, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK
[email protected]
Received May 20, 1997; Accepted September 28, 1998
Abstract. An optical flow estimation technique is presented which is based on the least-median-of-squares (LMedS) robust regression algorithm enabling more accurate flow estimates to be computed in the vicinity of motion discontinuities. The flow is computed in a blockwise fashion using an affine model. Through the use of overlapping blocks coupled with a block shifting strategy, redundancy is introduced into the computation of the flow. This eliminates blocking effects common in most other techniques based on blockwise processing and also allows flow to be accurately computed in regions containing three distinct motions. A multiresolution version of the technique is also presented, again based on LMedS regression, which enables image sequences containing large motions to be effectively handled. An extensive set of quantitative comparisons with a wide range of previously published methods are carried out using synthetic, realistic (computer generated images of natural scenes with known flow) and natural images. Both angular and absolute flow errors are calculated for those sequences with known optical flow. Displaced frame difference error, used extensively in video compression, is used for those natural scenes with unknown flow. In all of the sequences tested, a comparison with those methods that result in a dense flow field (greater than 80% spatial coverage), show that the LMedS technique produces the least error irrespective of the error measure used. Keywords: least-median-of-squares robust regression, optical flow estimation, affine motion model, motion discontinuity detection, image sequence analysis
1.
Introduction
The computation of optical flow has been an active field of research for the past two decades. The accurate computation of the optical flow from sequences of images is a fundamental stage prior to other image processing steps, such as image segmentation, image compression, robot navigation and computing structure from motion.
Traditional methods for computing optical flow are based mainly on the brightness constancy assumption (a form of data conservation constraint) and the assumption that the flow field will be smooth (spatial coherence constraint). Methods that are based only on these two assumptions cannot provide good estimates of the flow at motion discontinuities (corresponding to boundaries of moving objects), occluded/dis-occluded
52
Ong and Spann
regions, and intensity discontinuities (for example on sharp edges or in highly textured regions) where the assumptions will be violated. In addition, the brightness constancy assumption may not hold at specific points or for some subset of points within an image region when the image region contains transparency, specular reflections, shadows, or fragmented occlusion (for example looking through the branches of a tree or looking through the spacing of a fence). Besides, in real world images, the velocity fields are neither locally constant nor globally smooth, rather, they are piecewise continuous. The usual global smoothness constraint has an adverse effects on the estimated optical flow field because it blurs the motion discontinuities. Among these different problems, the most difficult one is that of handling occlusions between different moving objects in a scene. Occlusions generate discontinuities in the optical flow field and give rise to regions in which no valid motion information is available. The usual computational approaches are not able to cope simultaneously with all these problems because the constraints they introduce on the desired motion field only imperfectly account for the complexity of the real-world scene. The fundamental problem in the estimation of optical flow is that the computation of the flow involves the pooling of constraints over some spatial neighbourhood. To gain accurate estimates, this region must be sufficiently large to constrain the solution (the standard aperture problem). On the other hand, the larger the region considered for the computation of optical flow, the more likely it is to contain multiple motions or depth boundaries with competing constraints. This has been referred to as the generalised aperture problem. The pooling of constraints over a neighbourhood is usually performed using least-squares estimation. These least-squares formulations average the multiple motions that may be present in the neighbourhood and result in incorrect motion estimates from the data conservation constraint and oversmoothing by the spatial coherence constraint. For example, when two motions are present in a neighbourhood, one set of constraints will be consistent with one of the motions and another set will be consistent with the other motion. When examining one of the motions, the constraints for the other motion will appear as gross errors (usually referred to as “outliers” in the statistical literature). Least-squares estimation is well known for its sensitivity to the presence of outliers and cannot give correct parameter estimates when any outlier is present in the given data. In order to compute optical flow accurately and robustly,
the sensitivity of the recovered optical flow to violations of the single motion assumption must be reduced by detecting and rejecting the outliers. Early approaches to motion and disparity estimation which, although not cast in the current terminology of “robust” statistics, were nevertheless aimed at dealing with situations where more than one object motion or surface was present. The methods can generally be classified into a mixture of clustering algorithms and Hough transform or relaxation labelling voting schemes in order to achieve robust parameter estimation. Barnard and Thompson (1980) proposed a relaxation labelling technique enforcing local continuity constraints for the computation of stereo disparity. Fennema and Thompson (1979) proposed a clustering scheme where the Hough transform and the optical flow constraint equation was used for computing the optical flow. This is a global approach since the Hough transform is applied to cluster all the pixels in the image. The method is immune to the incorrect motion constraints generated along motion boundaries and will not be confused by multiple moving objects in the scene. However, the technique is only suitable for images containing purely translational motions. Schunck (1989) proposed a pixel-based method for computing optical flow based on clustering the motion constraint lines belonging to the pixels in a small local neighbourhood. The flow is obtained by clustering the intersections on the constraint lines of the neighbourhood and the more dominant group of intersections identifies the optical flow of the central pixel in the neighbourhood. However, this method relies heavily on the accuracy of the constraint line at the central pixel, which, if severely corrupted by noise or it violates the brightness constancy assumption, would result in an erroneous flow. More recent approaches to the problem of motion discontinuities have been based on the use of Markov random fields and modelling the discontinuities using a line process (Heitz and Bouthemy, 1990; Heitz and Bouthemy, 1993; Konrad and Dubois, 1992; Murray and Buxton, 1987; Zhang and Hanauer, 1995). However, these methods tend to be slow in convergence and are sensitive to the selection of the model parameters. Since the values of these parameters cannot be determined automatically, they have to be obtained experimentally by trial and error in order to get good results. Apparently, different parameter values need to be used for distinctly different types of image sequences. More recent “robust statistical” approaches have been generally aimed at the identification and
Robust Optical Flow Computation
elimination of outliers in the data. The outliers are defined as data points that do not conform to the postulated model and result from noise or model discontinuites at object or motion boundaries. One approach is the use of robust re-descending M-estimators (Odobez and Bouthemy, 1994a, 1994b; Black and Anandan, 1996; Black and Rangarajan, 1996; Bober and Kittler, 1994) aimed at the robust computation of optical flow. These methods are generally based on the use of more robust error norms than the squared error but are still incapable of effectively handling regions comprising two or more distinct motions due to their relatively low breakdown point. A novel use of the M-estimator is to combine it with a robust genetic partitioning algorithm to iteratively segment and compute the motion parameters from a noisy optical flow field (Huang et al., 1995). This technique computes the 3D motion parameters from the optical flow field using a set of linear homogenous equations constraining the motion parameters. This technique is only applicable to sparse flow data because of its exhaustive search strategy during the partitioning stage. An interesting approach to representing images generated by multiple models each with its own spatial support is proposed by Darrell and Pentland (1995). Such an approach could be applied to the robust computation of flow fields containing discontinuities. A global strategy is adopted which iteratively determines how many models make up the data followed by determining the spatial supports of each model. An M-estimation procedure is used to compute the model parameters given the supports. This technique has been applied to the segmentation of tracked feature points using a rigid body motion model but has not been applied to the estimation of dense flow fields. Note also that, for this and the previous example, requiring a segmentation prior to computing the optical flow field over the region of support of each segmented region, apart from the significant difficulty of obtaining accurate motionbased segmentations, would not in general produce better results than a purely local flow estimation algorithm that handles motion discontinuities in an unconstrained fashion. Another class of techniques aim to achieve accurate estimation even when up to half of the data points are outlying. This is clearly useful in local optical flow estimation where the neighbourhood overlaps a motion discontinuity. One recent approach, (Torr et al., 1994) uses the random sample consensus paradigm
53
(RANSAC) to perform an initial estimate of the Fundamental Matrix for structure and motion computation using matched features and the epipolar constraint. The subset of the data is used to calculate the parameters. This procedure is carried out a large number of times to guarantee that at least one of the sub-sets has a high probability of being outlier free. In the case of the RANSAC algorithm, the number of inliers is maximized. A subsequent stage of the algorithm proposes a segmentation of the data into a set of inliers and outliers based on a maximum-likelihood approach. This results in an adaptive threshold for rejecting points as outliers based on their distance to the epipolar line. Obviously, computer vision algorithms will be enormously useful if they are relatively free of user interventions and yet are able to perform their functions. Such algorithms are either free of tuning parameters or they have an automatic way of determining the control parameters to be used when they are going to be applied for different types of perceived environment. With this goal in mind, such an algorithm that performs motion estimation has been developed. The algorithm is based on the application of a robust estimator, the least-median-squares (LMedS) estimator, and it will be shown how it is able to overcome the 50% breakdown limitation of the LMedS estimator thus providing accurate flow estimates in the vicinity of motion boundaries containing up to three distinct motions. Also, using a novel sampling strategy, it will be shown how more accurate flow estimates are produced inside regions. It will also be illustrated how the algorithm can be embedded in a multiresolution framework to provide better estimates of large motions.
2. 2.1.
Robust Computation of Optical Flow Introduction
A motion estimation technique that is able to provide good estimates of the optical flow inside regions and in the vicinity of motion discontinuities and occlusions will be presented.1 This technique is a local method for computing optical flow and the flow vectors are computed using an affine model of the flow based on the motion constraint equation and LMedS regression. Overlapping blocks are used and each block is allowed to be shifted from its nominal location by a few pixels to minimise the robust scale estimate computed within the block. The optimum motion parameter vector at a
54
Ong and Spann
point is selected to be the parameters which correspond to the smallest robust scale estimate of the model-fitting among all the overlapping blocks. Points which are outliers to these overlapping blocks are rejected and no optical flow is computed at these positions. The use of overlapping blocks eliminates the block-effects commonly seen in local blockwise computation of the optical flow. The shifting of the blocks to obtain the smallest robust scale estimate also makes the motion estimates more robust when the block straddles regions of three motions or when it straddles regions of two motions with equal numbers of supporting data points. The technique, which is able to incorporate a multiresolution scheme and coarse-to-fine strategy, also gives reasonably good estimates of the optical flow for objects with large motion.
Let the residual in modelling the motion field at a point x be r (x), where: r (x) = ∇ I (x, t)T u(x, t) + It (x, t) The above equation can be rewritten as: r (x, 2) = a(x)T 2 − y(x)
Least-Median-Squares (LMedS) Motion Estimator
The computation of the optical flow is based on the motion constraint equation (Horn and Schunck, 1981): ∇ I (x, t)T u(x, t) + It (x, t) = 0
(1)
where I (x, t) is the brightness function, ∇ I (x, t) = (Ix (x, t), I y (x, t)), Ix (x, t), I y (x, t) and It (x, t) are the partial derivatives of I (x, t) with respect to space and time at the point (x, t); u = (u, v)T is the flow vector with flow velocities, u and v, in the x and y directions, and x = (x, y)T . The flow field, u, can be modelled by a 2D affine motion model. An affine model of the flow field is defined as: u(x) = Ax + b
(2)
where A = ( aa02 aa13 ), b = (b0 , b1 )T . An affine motion model has been shown to be sufficient to describe complex motions such as expansion, rotation and shear (Campani and Verri, 1992). Negahdaripour and Lee (1992) show that an affine motion model, which is a first-order approximation to the optical flow field, can be used for estimating flow from arbitrary scenes and any camera motion. In addition, he shows that the method based on an affine model is more robust than other techniques that require the knowledge of the full flow or information up to and including the secondorder approximation of it.
(4)
where a(x) = (I x , x Ix , y Ix , I y , x I y , y I y )T , 2 = (b0 , a0 , a1 , b1 , a2 , a3 )T , y(x, = −It (x). The motion parameters are computed using the LMedS robust regression method which minimises the median of the squared-residuals: ˆ = arg min med r (x, 2)2 2 2
2.2.
(3)
x
(5)
where x ∈ R, R being the neighbourhood for computing the optical flow. Given a data set of n observations, the algorithm repeatedly draws m sub-samples each of p different observations from the data set using a Monte-Carlo type technique, where p is the number of parameters in the model ( p = 6 for the affine motion model). For each sub-sample, indexed by J , 1 ≤ J ≤ m, the corˆ J ) are estimated responding parameters (denoted by 2 from the p observations. In addition, the median of the squared residuals, denoted by M J , is also determined, where: ˆ J )2 M J = med r (x, 2 x
(6)
ˆ for which the correspondThe LMedS solution is the 2 ing M J is the minimum among all the m different M J ’s, ˆ = arg min2 J medx r (x, 2 J )2 . i.e., 2 The probability that at least one of the m sub-samples consists of p good observations is given by: P = 1 − [1 − (1 − ε) p ]m
(7)
where ε is the fraction of outliers that may be present in the data. Therefore, the minimum number of subsamples required for a given P, ε, and p is given by: m=
log(1 − P) log[1 − (1 − ε) p ]
(8)
Since the LMedS relative efficiency is poor in the presence of Gaussian noise, a single-step weighted
Robust Optical Flow Computation
least-squares procedure is performed based on the estimates of the LMedS step. First, an initial scale estimate is computed by (Rousseeuw and Leroy, 1987) µ σ 0 = 1.4826 1 +
5 n−p
¶r
ˆ 2 med r (x, 2) x
(9)
The small sample correction factor of 5/(n − p) was found to have little practical significance for this work. The initial scale estimate is used to determine an initial weight w(x) for each observation, where: ( 0 ˆ 1 if |r (x, 2)/σ | ≤ 2.5 w(x) = (10) 0 ˆ 0 if |r (x, 2)/σ | > 2.5 These initial weights are then used to compute the final robust scale estimate: và !,à ! u u X X t 2 ˆ σˆ = w(x)r (x, 2) w(x) − p x∈R
x∈R
(11) The final weights are computed using Eq. (10) by replacing the σ 0 by the final scale estimate, σˆ . The motion parameters are then estimated using the weighted least-squares procedure: X ˆ = arg min 2 w(x)r (x, 2)2 (12) 2
x
ˆ are obtained by solving The final parameters, 2, (Draper and Smith, 1981): ˆ = (XT WX)−1 XT Wy 2
(13)
where X = (a(x1 ), a(x2 ), . . . , a(xn ))T , xi = (xi , yi ), for i = 1, 2, . . . , n, W = diag(w(x1 ), w(x2 ), . . . , w(xn )), and y = (y(x1 ), y(x2 ), . . . , y(xn ))T . 2.3.
Selection of Sub-samples
The above-mentioned LMedS motion estimator is used to compute the optical flow in each local neighbourhood from two successive frames of images. P is chosen to be 0.95, and ε is 0.5 such that 50% of outliers can be tolerated in the data set. For an affine model, p = 6, and hence the minimum number of sub-samples required is 191 in order to satisfy Eq. (7). Different ways of selecting the member points for each sub-sample are
55
possible. Previously, each sub-sample of 6 points was chosen as follows (Ong and Spann, 1995): 1. Pick a point randomly in order to centre a window for picking the 6 points. 2. Open a window of s × s pixels (6 × 6 is used throughout the experiments) with respect to the point initially picked and then randomly pick 6 points from within the window to compute the affine motion parameters. The 6 points chosen must have spatial gradients whose magnitudes are in the top three-quarters out of all of the pixels in the block. Note that for each sub-sample, 6 points are randomly picked from a smaller support region rather than from the entire block because it is believed that the points for computing the optical flow should be close together, yet from a large enough area to tolerate the presence of outliers such that there is a higher chance of obtaining a sub-sample which contains only good points. In addition, points with higher gradient magnitudes are used because points with zero or low gradient magnitudes do not convey motion information. The above approach has been found to give reasonably good results. However, it had been observed that the use of the magnitude of the spatial gradients may not be ideal for certain types of image sequences where edges are prominent. If points that correspond to these edges have been used in the sub-samples, the accuracy of the algorithm is reduced. A sub-sampling strategy based on the use of the local orientation matrix is thus proposed. The orientation matrix A of a pixel xo with matrix elements A(i, j) is computed as follows (Bigun et al., 1991): µ
A(i, j) (xo ) =
µ ¶X ¶ 1 ∂ I (xo + x) w(x) 4π 2 x∈R ∂ xi µ ¶ ∂ I (xo + x) × (14) ∂x j
where (xi , x j ) = {(x, x), (x, y), (y, x), (y, y)} for (i, j) = {(1, 1), (1, 2), (2, 1), (2, 2)}, w(x) is a Gaussian weighting function of the form: 1 e− w(x, y) = √ ( 2π )σ 2
¡ x 2 +y2 ¢ 2σ 2
(15)
where (−ww /2) ≤ x, y ≤ (ww /2), ww corresponding to the width of the Gaussian weighting function and R
56
Ong and Spann
denoting the 2D space encompassed by the finite extent Gaussian weighting function. The distribution of eigenvalues of the orientation A(1,2) ] allows the degree of ammatrix, A = [ AA(1,1) A(2,2) (2,1) biguity in the local optical flow estimates to be determined. A is a positive-definite matrix and is the scatter matrix which represents the distribution of energy about the two-dimensional plane through the origin of the spatio-temporal (x, y, t) domain (Bigun et al., 1991). A has two non-negative eigenvalues, λ1 and λ0 , where λ1 ≥ λ0 ≥ 0, with corresponding eigenvectors u1 and u2 . Three possible situations may be detected based on the values of the eigenvalues λ1 and λ0 . Firstly, when the greylevel at a pixel location in the 2D image is constant, λ1 = λ0 = 0. Secondly, image regions where there are one-dimensional features such as lines and edges will give λ1 À λ0 , where λ1 > 0 and λ0 ≥ 0. Finally, image regions where there are two-dimensional features such as corners and textured surfaces will give λ1 ≥ λ0 > 0. It has been observed that the eigenvalues of the orientation matrix provide a very good indication of the underlying greylevel image structure. The group of 6 points chosen for each sub-sample must be those points where the smallest eigenvalue of their orientation matrix is non-zero. Regions where the flow is ambiguous generally correspond to uniformly grey or regions containing a single orientation and these regions generally are in the minority of a processing block for scenes containing sufficient texture. Thus those points whose smallest eigenvalue is in the top three-quarters out of all the points in the block are selected. This was found to work well in practice. 2.4.
Computation of Optical Flow Field
In the proposed framework, the LMedS motion estimator is applied to a set of overlapping blocks that sub-divides the image, each block overlapping its immediate neighbours by 50%. Let I (x, y) be the greylevel image data of the reference frame in the image sequence and R(x, y) be the point-set of pixel sites within a (Nw × Nh ) block for the block position (x, y): R(x, y) = {(x − (Nw /2), y − (Nh /2)), (x − (Nw /2) + 1, y − (Nh /2)), . . . , (x + (Nw /2) − 1, y + (Nh /2) − 1)} (16)
The motion parameters and the robust scale estimates, computed using the LMedS motion estimation method, are then functions of the block position (x, y) ˆ and defined as 2(R(x, y)) and σˆ (R(x, y)). An important step of the algorithm is to allow the block to be displaced from its nominal location by a quarter of the distance between adjacent blocks in both x and y directions. The displaced block position for position (x, y), (1x (x, y), 1 y (x, y)), is taken to be that displacement which gives the smallest robust scale estimate σˆ : σˆ (R(x + x 0 , y + y 0 )) (1x (x, y), 1 y (x, y)) = arg min 0 0 (x ,y )
(17) for {(x 0 , y 0 ) : −Nw /4 ≤ x 0 ≤ Nw /4, −Nh /4 ≤ y 0 ≤ Nh /4}. The above procedure will produce, for each position of the block, a subset R I of the original pointset R defined as follows: R I (x + 1x , y + 1 y ) = {(x 0 , y 0 ) ∈ R(x + 1x , y + 1 y ) : |r (x 0 , y 0 )| ≤ 2.5σˆ (R(x + 1x , y + 1 y ))}
(18)
where r (x) is the residual at each pixel site (x, y) as defined above. Each set R I consists of points that are inliers to the block. ˆ The final motion parameters 2(R(x + 1x , y + 1 y )) and robust scale estimate σˆ (R(x +1x , y +1 y )) for the displaced block correspond to the optimally displaced window (1x (x, y), 1 y (x, y)) that gives the smallest robust scale estimate. The corresponding optimum moˆ 0 , y 0 ) for each point (x 0 , y 0 ) in the tion parameters 2(x image can thus be expressed as follows: ˆ 0 , y 0 ) = arg 2(x
0 0 ˆ min {2(R(x i , y j )) : (x , y )
σˆ (R(xi ,y j ))
∈ R I (xi , y j )}
(19)
where (xi , y j ) are the centre positions of all of the displaced blocks containing (x 0 , y 0 ). These motion paˆ 0 , y 0 ) are finally used to compute the flow rameters 2(x velocities using the equation u(x) = Ax + b. The above procedures are important because the overlapping neighbourhood strategy eliminates the block-effects commonly faced in local blockwise methods for computing the optical flow. Block-effects prevent accurate estimation of the flow around the boundary regions of the moving objects. In addition, the ability to shift the block to obtain the optimally displaced block position and hence the optimum motion
Robust Optical Flow Computation
Figure 1.
Effect of shifting the location of the neighbourhood for computing the optical flow.
parameters will prevent inaccurate parameter estimates when the block straddles regions containing three motions and none of the motions make up 50% of the points in the block. Figure 1 shows an example where an original block location straddles regions containing three motions and none of the motions make up 50% of the points in the block. Since the region corresponding to the right sphere consists of pixels that make up less than 50% of the total pixels in the straddled block, direct application of the robust LMedS estimator that has a breakdown point of 50% will still give erroneous parameter estimates (A1 , b1 ). However, it can be seen that the shifted block location causes more than 50% of the points in the block to now correspond to the motion on the right sphere. Now the region corresponding to the right sphere consists of pixels that make up more than 50% of the total pixels in the straddled block and hence applying the LMedS estimator to compute the motion parameters in such a block will give correct motion parameter estimates (A2 , b2 ). Thus, the use of the motion parameters (A2 , b2 ) will give the correct motion vector for the point “x”. In the same way, it can also be seen that the block shifting step can prevent inaccurate parameter estimates when the block straddles regions containing two motions with the same numbers of supporting points for each motion. With the overlapping block strategy, each point in the image may have no associated motion parameters and is thus considered as an outlier to all the overlapping blocks (this is a point where optical flow cannot be computed, normally corresponding to regions of occlusion). 2.5.
57
Multiresolution Scheme
When the image sequence involves objects moving with large displacements, single resolution gradient-
based motion estimation algorithms tend to yield under-estimates. In order to provide better estimates for objects moving with large displacements, the above mentioned LMedS motion estimator has been embedded in a multiresolution scheme which utilises a coarseto-fine strategy. The basic idea of the scheme is to obtain rough estimates of the motion parameters at the lowest resolution and then refine the estimates using the higher resolution images. In order to apply the multiresolution scheme, image pyramids must first be constructed from the original images. An image pyramid consists of layers of images stacked on top of each other, with the higher layer being half the size in width and height compared to the layer immediately below it. The bottom layer (i.e., the base) of the pyramid corresponds to the original image and is usually regarded as level 0 of the pyramid. In the proposed technique, a quadtree pyramid is used due to its simplicity. Let I (x, y, k) be the greylevel of the pixel site (x, y) at resolution level k of the image pyramid. For a quadtree pyramid, the greylevel at the pixel site (x, y) at resolution level k is given by: I (x, y, k) =
1 1 X X (1/4)I (2x + m, 2y + n, k − 1) m=0 n=0
(20) Thus, it can be seen that level k of the quadtree image pyramid is obtained by averaging over the four pixels at level k − 1 and then sub-sampling. The motion estimate at some lower level k projected down the pyraˆ k , bˆ k ), can be refined at level k according to the mid, (A following regression equation (Meyer and Bouthemy, 1994): ∇ I (xk + 2δ xˆ k+1 , t + δt)T (1Ak xk + 1bk )δt + I (xk + 2δ xˆ k+1 , t + δt) − I (xk , t) = 0 (21)
58
Ong and Spann
ˆ k , bˆ k ) and where (1Ak , 1bk ) is the refinement to (A k where x is the position in the grid at level k and δ xˆ k+1 is the displacement at pixel location xk+1 where xk+1 is the father of xk . The above equation is linear with respect to 1Ak and 1bk . An over-constrained system of linear equations can be obtained if the above equation is considered for n points xk within a given region (assuming that the n points in the region moves with the same motion) and therefore the motion parameter esˆ k , 1bˆ k ) of (1Ak , 1bk ) can be obtained. timates (1A In order to obtain good parameter estimates and discard the outlying points, LMedS regression is used to compute the initial incremental motion parameters ˆk ˆk ˆ LMedS = {1A 12 LMedS , 1bLMedS } by solving: ˆ LMedS = arg min med r (x, 12)2 12 12
x
(22)
where x ∈ R, R being the neighbourhood for computing the optical flow, r (x, 12) = a(x)T 12 − y(x), a(x) = (Ix , x Ix , y Ix , I y , x I y , y I y )T , ∂ I (xk + 2δ xˆ k+1 , t + δt) ∂x (23) ∂ k k+1 Iy = I (x + 2δ xˆ , t + δt) ∂y ¢T ¡ 12 = 1b0k , 1a0k , 1a1k , 1b1k , 1a2k , 1a3k , Ix =
y(x) = −I (x + 2δ xˆ k
k+1
, t + δt) + I (x , t). k
A single-step weighted least-squares procedure is then performed based on the estimates of the LMedS step using a similar set of Eqs. (9)–(13) to obtain ˆ = the final desired incremental motion estimates 12 k k ˆ ˆ {1A , 1b }. At the coarsest resolution level K , the estimate ˆ K , bˆ K ) of (A K , b K ) can be obtained from solving (A the following equation using LMedS regression: ∇ I (x K , t)T (A K x K + b K )δt + I (x K , t + δt) − I (x K , t) = 0
(24)
At each subsequent finer resolution levels, the motion parameter estimates at level k can be obtained as
follows: ˆk = A
K −1 X
ˆi +A ˆ K, 1A
i=k
bˆ k =
K −1 X
(25) 2i−k 1bˆ i + 2 K −k bˆ K ,
i=k
where K is the coarsest resolution level, and k is the current resolution level. These update equations are obtained by noting that the constant terms (the b’s) are homogenous to a scale change whereas the linear terms (the A’s) have no unit. At the finest resolution, each point in the image may have no associated motion parameters or it may have more than one set of estimated motion parameters. The motion parameters corresponding to the smallest robust scale estimate will then be assigned to each point. With the use of the incremental motion estimation scheme, bilinear interpolations of I and ∇ I must be computed for points which are not on the discrete grid of the image. This is used to interpolate the greylevel and greylevel gradients used in the regression calculation of Eq. (21) to compute the LMedS estimates of 1A and 1b as the spatial co-ordinates in Eq. (21) are now not integer lattice points but points projected to by the flow estimates computed at the previous resolution prior to refinement at the current resolution. 2.6.
Confidence Measure
The use of a confidence measure is a way of attempting to improve the motion estimates by rejecting unreliable flow vectors at the expense of a lower density of the computed optical flow field. A simple way of ensuring that the motion estimates will be reliable in the case of the LMedS motion estimator is to threshold based on the smallest eigenvalue of the matrix XT WX. If this matrix is near degenerate or not well-conditioned enough, the smallest eigenvalue of this matrix will be close to zero and hence the estimated motion parameters will not be reliable and thus should be discarded. The use of the smallest eigenvalue of the matrix XT WX for the confidence measure is indeed similar to using the condition number in normal least-squares regression, the main difference being that the diagonal weighting matrix W is computed from the LMedS procedure and reflects the inlier/outlier partition of the data. Other more involved schemes which involve the use of certain confidence measures for thresholding the
Robust Optical Flow Computation
optical flow vectors may also be used. For example, the current implementation of the LMedS motion estimator uses only pixels whose smallest eigenvalue of their orientation matrix is in the top three-quarters of all the pixels in the region for computing the initial LMedS motion estimates. It is also possible to use only those pixels whose smallest eigenvalue of their orientation matrix is above a certain threshold to compute the initial LMedS motion estimates and also use only those pixels whose smallest eigenvalue of their orientation matrices are above the given threshold for computing the final motion estimates using re-weighted least squares. The pixels whose smallest eigenvalue of their orientation matrices are below the given threshold are given a weight of zero and will never be used in the computation of the motion estimates. Such an approach is most likely to produce more reliable motion estimates since the use of the smallest eigenvalue of the orientation matrix as a confidence measure is similar to the confidence measure proposed in Barron et al. (1994), for thresholding the optical flow vectors produced by Lucas and Kanade’s method (Lucas and Kanade, 1981). However, the reliability in the motion estimates is gained at the expense of a much lower density in the optical flow field and this may not be ideal if the application of the motion estimator is in providing the motion estimates for motion segmentation. Since the focus of the proposed motion estimator is for the application to motion segmentation, the use of different sorts of confidence measures have not been thoroughly investigated. Thus, only the simple confidence measure, i.e., the smallest eigenvalue of the matrix XT WX, is occasionally used to illustrate the idea that an appropriate confidence measure does help to provide more accurate optical flow fields.
3.1.
Quantitative Error Measures for Image Sequences with Known Optical Flow
Two different error measures, the angular and absolute error, have been used to provide a means of comparing the different methods. The angular error measure (Barron et al., 1994) is given by: ψ = arccos(Evc · vEe )
ψ = kvc − ve k
(27)
where v = (u, v)T . In the following sections, quantitative results will be presented in terms of the average error ψ¯ and the standard deviation σψ expressed as: ψ¯ =
Results
In this section, an extensive set of quantitative as well as qualitative results will be presented for the proposed motion estimation technique. These results will be compared with a range of other published motion estimation techniques. In the case of image sequences with a known motion field, angular and absolute errors will be used to provide a quantitative comparison. In the case of real image sequences with no available known motion field, the displaced frame difference error will be used to provide a means for quantitative comparisons.
(26)
√ where vE ≡ (u, v, 1)T /( u 2 + v 2 + 1), vEc is the true motion vector and vEe the estimated motion vector. This angular error measures errors as angular deviations from the correct space-time orientation. This error measure is convenient because it handles both large and small speeds without the amplification inherent in a relative measure of vector differences. However, the problem with this error measure is that differences of relatively large motion vectors correspond to relatively small angular errors. Also, symmetrical deviations of estimated motion vectors from the true values result in different angular errors (Otte and Nagel, 1994). To avoid these effects, an alternative error measure, the absolute error measure, may also be used. The absolute error is given by:
σψ = 3.
59
PN i=1
N s PN
ψi ¯ 2 − ψ) N −1
i=1 (ψi
(28)
(29)
where N is the number of pixels with reliable flow estimates in the image. 3.2.
Quantitative Error Measures for Image Sequences with no Known True Optical Flow
One problem with comparing the different motion estimation methods when applied to real image sequences is that the true optical flow is usually not available and thus there are only qualitative results. One possible way of providing a quantitative comparison is to use
60
Ong and Spann
the displaced frame difference error measure that is commonly used in video compression to evaluate the goodness of the motion estimates. The displaced frame difference (DFD) error is defined as: DFD = I (xi , yi , t) − I (xi + dx (xi , yi ), yi + d y (xi , yi ), t + 1) (30) where I (xi , yi , t) is the image intensity at pixel location (xi , yi ) in frame t, (dx (xi , yi ), d y (xi , yi )) is the displacement of the pixel at location (xi , yi ) from frame t to t + 1. Thus, the mean absolute displaced frame difference error measure will be used as an error measure to provide a quantitative comparison in addition to the qualitative comparison between the different techniques. For techniques that provide a confidence measure and thresholding is being used, only pixels that are considered to provide reliable estimates contribute to the total error and thus the density denotes the percentage of pixels in the image that give reliable estimates. It should also be noted that the mean absolute displaced frame difference only provides a measure of the goodness of the pixel correspondence estimates but does not necessarily provide an insight about how well the motion estimates correlate with the true projected 3D motion vectors. Clearly a greylevel mapping from frame t to t + 1 with infinite degrees of freedom could reduce the DFD to zero without performing any optical flow estimation. However, if the mapping is constrained to pixel-wise displacements, even with no constraints on the flow field, the DFD will in general be positive. It is worth mentioning also that the DFD error is the whole basis of the development of motion compensation algorithms for video compression applications and has been found to be a reasonable measure of optical flow error for both synthetic and natural sequences (Lin and Barron, 1994). 3.3.
Comparison of Techniques
Unless otherwise stated, the results for the techniques being compared in the following sections are generated with the parameters mentioned in this section. Other than the LMedS motion estimator and the technique proposed by Odobez and Bouthemy (1994a, 1994b), the parameters used for the other techniques are those suggested in (Barron et al., 1994), which have been found to give good results. When the image sequence consists of 15 frames or more, pre-smoothing
is achieved via spatial-temporal smoothing using a Gaussian kernel with a standard deviation of 1.5 pixelframes as implemented in Barron et al. (1994). This pre-smoothing is carried out for all methods unless stated otherwise. When less frames are available, only 2 frames will be employed for the computation of the optical flow and pre-smoothing is achieved via spatial smoothing using a Gaussian kernel with a standard deviation of 1.5 pixels in space. In the case of the LMedS motion estimator, 16 × 16 pixel blocks are being used and each block overlaps its neighbouring blocks by 50% and is allowed a maximum displacement of one-quarter the block size in both the x- and y-directions unless specified otherwise. When the multiresolution strategy is being applied, the image pyramid used is a quadtree pyramid with three levels of resolutions and 32 × 32 pixel blocks at the finest resolution. Odobez and Bouthemy proposed the use of a Tukey biweight re-descending M-estimator and an affine model to compute optical flow in a local square block of the image in order to overcome the problem of multiple motions that may be present in the scene. In order to solve the M-estimation problem, an iteratively re-weighted least-squares method has been used. The optical flow estimation has been embedded in a multiresolution framework where a Gaussian low-pass image pyramid is built and the flow estimate is refined from coarse to fine resolution. In the implementation, blocks of the same size as that applied to the LMedS motion estimator are used and a maximum of 20 iterations is used for the iteratively re-weighted least squares stage. When the multiresolution strategy is being applied, a Gaussian pyramid with the same number of resolution levels and also the same block size as applied to the LMedS motion estimator is used. Lucas and Kanade’s technique2 is a local approach to optical flow computation which constrains the flow to be a constant vector within each spatial neighbourhood. The flow is calculated using a weighted least-squares technique using a Gaussian weighting function. In the implementation, the image sequence is usually first presmoothed with a spatio-temporal Gaussian filter with a standard deviation of 1.5 pixels in space and 1.5 frames in time. Thresholding for this technique, if applied, is based on the smallest eigenvalue of the measurement matrix and a threshold value of 1.0 is used. In Horn and Schunck’s method (Horn and Schunck, 1981), a global error function based on the greylevel constraint equation and a regularisation term is
Robust Optical Flow Computation
minimised in order to compute the flow. The regularisation term imposes smoothness on the flow field and is expressed by the square of the magnitude of the flow gradients. In the implementation, the image sequence is also pre-smoothed with a spatio-temporal Gaussian filter with a standard deviation of 1.5 pixels in space and 1.5 frames in time. In addition, the smoothness term λ was set to 0.5 and a maximum of 100 iterations were used. Thresholding for this method, if applied, was based on the spatial gradient of the image intensity. Nagel uses a global smoothness constraint to determine the flow field. Since the flow field in the case of Horn and Shunck’s algorithm is assumed to vary smoothly over the image support, difficulties at greylevel edges corresponding to occluding edges will occur. To overcome this problem, Nagel introduced an oriented smoothness constraint that depends on the local edge orientation which attenuates the variation of the flow in a direction perpendicular to the edge gradient. In the implementation of Nagel’s technique, the image sequence is first pre-smoothed with a spatiotemporal Gaussian kernel with a standard deviation of 3 pixels in space and 1.5 frames in time for the natural images and a standard deviation of 1.5 pixels in space and 1.5 frames in time for the synthetic image sequences. In addition, the smoothing parameter α is set to 0.5, parameter δ is set to 1.0, and a maximum of 100 iterations is used. Again, thresholding for this method, if applied, was based on the spatial gradient of the image intensity. Anandan proposes a correlation matching technique based on a Laplacian pyramid and a coarse-to-fine matching strategy based on the sum of squared differences (SSD) (Anandan, 1989). Sub-pixel displacements are computed by finding the minimum of a quadratic approximation to the SSD surface about the minimum SSD value found with integer displacements. A confidence measure is proposed based on the curvature of the surface near the minimum SSD value. A smoothness constraint is also employed to propagate displacement measurements with high confidence values to neighbouring areas where the confidence is low to “fill in” areas with unreliable flow estimates. In the implementation, a 3-level Laplacian pyramid with 10 Gauss-Seidel iterations are used. In Fleet and Jepson’s technique (Fleet and Jepson, 1992; Barron et al., 1994) the optical flow field is computed based on the components that are defined in terms of the instantaneous flows normal to the level-phase contours in the output of band-pass velocity-tuned
61
filters. Given the normal components of the flow from different filter channels, an affine model is fitted in a least-squares sense to each 5 × 5 pixel neighbourhood using the normal components. In the implementation, a sigma value of 2.5 is used, corresponding to a single scale tuned to a spatio-temporal wavelength of 4.25 pixel-frames. Thus, the entire temporal support is 21 frames, and the same threshold values as those in Fleet and Jepson are used. Thus, a threshold τ of 1.25 and a percentage amplitude threshold of 5%, are used. When the image sequence consists of 15 frames or more but less than 21 frames, a sigma value of 1.5 is used. It is worth mentioning that the applications of the motion estimation techniques determine their suitability and usefulness. For the purpose of motion segmentation, a good motion estimation technique should give a very dense (if not a density of 100%) optical flow field and at the same time give a reasonably low quantitative error. As such, motion estimation techniques that provide very good quantitative results at the expense of a very low density of the optical flow field will not be suitable for the purpose of motion segmentation. The proposed technique hopes to achieve a reasonably dense optical flow field and a very low quantitative error at the expense of a small decrease in the density of the optical flow field and at the same time without requiring the use of any thresholding.
3.4.
Results for Synthetic Images
3.4.1. Image Sequence Containing Small Motions. It is claimed that the effectiveness of the proposed technique results primarily from a combination of four factors: • LMedS robust parameter estimation. • A sampling strategy based on the eigenvalues of the orientation matrix. • An overlapping block strategy leading to multiple flow estimates per pixel. • A block shifting strategy leading to more accurate flows at ‘complex’ motion boundaries, comprising three or more independent motions. The last factor is the least important and indeed can be ‘switched-off’ in order to lessen the computational burden. A controlled experiment was designed to compare quantitatively the accuracy of the flows produced by the
62
Ong and Spann
LMedS-based method and least-squares (LS) method using identical combinations of block overlapping and block shifting strategies and an affine flow model in both cases. For computing the motion parameters using the LS method coupled with the overlapping block strategy, the flow vector giving minimum LS error was chosen for each pixel exactly as for the LMedS case given in Eq. (19). In both cases, a 16 × 16 processing block was used, with blocks overlapping by 50% in each direction where overlapping blocks were used.3 If block shifting is used, then a block has a maximum shift of 4 pixels in the +x and +y directions. The experiments were carried out on the computer generated Squares sequence comprising linearly translating sinusoidal patterns, the reference frame of which is shown in Fig. 2(a). The true flows for the top and bottom squares are (1, 2) and (1, −1) respectively as shown in Fig. 2(b). Figure 2(c) shows the magnitude of the flow image converted into a greylevel image. Figures 2(d)–(i) show the flow images, both in the needle diagram format and as greylevel images representing the flow magnitudes, for all the cases considered using the LS estimator. Figures 2(j), (k), (m) and (n) give the corresponding flow fields, again in needle diagram and as greylevel images representing flow magnitudes, using the LMedS estimator. For Figs. 2(j) and (k), the block shifting has been switched off. Figures 2(l) and (o) show confidence images for the cases of no block shifting and block shifting respectively. The confidence images indicate the weights of the pixels that have been used in the weighted leastsquares procedure for computing the motion parameters. A weight of 1.0 is represented by a greylevel of 255 (white) and a weight of 0.0 is represented by a greylevel of 0 (black). Thus, a greylevel of zero (black) indicates that no optical flow can be computed. Table 1 gives the angular and absolute flow errors in each case where it can be seen that there is an order of magnitude improvement in the flow errors for the LMedS estimator. Also, the sharpness of the edges Table 1.
in the flow magnitude images is also apparent for the LMedS estimator. Not quite as obvious from these images is a bias in the magnitude of the flow amounting to around 5% for the square with the largest flow magnitude in the cases where the LS-estimator has been used. For the LMedS cases the flow magnitude is accurate to the third decimal place. This is a consequence of the novel sampling strategy used based on the eigenvalues of the orientation matrix. Finally, Figs. 2(l) and (o) demonstrate the effectiveness of the block shifting strategy where, in the case where no block shifting is applied, gaps in the motion boundaries have occurred, particularly at the junction of three motions whereas when block shifting is applied, a continuous motion boundary has been found. The effectiveness of this strategy is borne out by the improvement in the flow error. Note that such a significant improvement in the error when block shifting is employed is not as apparent for the LS estimator as the outlier/inlier partitioning is not as effective for non-robust estimation. Figure 3(a) shows the first frame of a computergenerated sequence, the Small Spheres sequence designed to show comparative performance on simulated 3D curved surfaces where the affine flow model is only an approximation to the true flow. The spheres and the background are textured by modulating sinusoidallyvarying greylevel according to the surface orientation. The left sphere is translating to the left (at a speed of about 1.5 pixels/frame) and the right sphere is diverging outwards (with a maximum speed of about 1.8 pixels/frame). Figure 3(b) shows the needle diagram of the true optical flow field (sub-sampled at an interval of 8 pixels). Figure 3(c) shows the needle diagram (sub-sampled at an interval of 8 pixels) of the optical flow computed using the LMedS motion estimator. An 8 × 8 pixel local neighbourhood is used to compute the optical flow and only a single resolution is used since the motion in the image is small. Fig. 3(d) shows the confidence image of the optical flow computed using the LMedS motion estimator. It can be seen in Fig. 3(c)
Error statistics for the Squares sequence.
Method
Density (%)
Angular average error
Angular standard deviation
Absolute average error
Absolute standard deviation
LS—non overlapping blocks LS—overlapping blocks, no shift
100
2.99
9.76
0.08
0.26
100
1.79
7.97
0.05
LS—shifted overlapping blocks
0.22
100
1.42
7.31
0.04
0.21
LMedS—no shift
98.46
0.17
2.60
0.007
0.08
LMedS—shift
97.99
0.12
2.08
0.005
0.06
Robust Optical Flow Computation
63
Figure 2. (a)–(o) Reference frame, optical flow fields and confidence images for the Squares test sequence using LMedS and LS estimators with and without block shifting.
that the flow has been computed accurately at the motion boundaries of the two occluding spheres, while Fig. 3(d) shows that the motion discontinuities and occlusion regions (points in black) have been successfully
isolated. Note that at the top and bottom of the regions where the right sphere occludes the left sphere, the local neighbourhood for computing the optical flow encounters the situations where it straddles three types
64
Ong and Spann
Figure 3. (a)–(o) Reference frame and optical flow fields for the Small Spheres test sequence comparing the LMedS approach with existing approaches. L&K—Lucas and Kanade, H&S—Horn and Schunck, F&J—Fleet and Jepson, O&B—Odobez and Bouthemy.
of motions (one translating to the left, one diverging, and one stationary). However, the algorithm has successfully computed the flow in these neighbourhoods. Figures 3(e) and (f) show the magnitude and direction (represented as greylevel) of the flow computed using the LMedS motion estimator. It can be seen that blockeffects, commonly faced in local blockwise methods for computing the optical flow, have been eliminated.
Figures 3(g)–(l) show the needle diagram of the optical flow computed using Lucas and Kanade’s method, Horn and Schunck’s method, Nagel’s method, Anandan’s method, Fleet and Jepson’s method and Odobez and Bouthemy’s method. As mentioned previously, the same local neighbourhood size of 8 × 8 pixels and a single image resolution as that applied to the LMedS motion estimator is being used for the
Robust Optical Flow Computation
Odobez and Bouthemy’s method. Figure 3(m) shows the confidence image of the Odobez and Bouthemy’s method, while Figs. 3(n) and (o) show the magnitude and direction (represented as greylevel) of the corresponding optical flow. Block-effects can be observed in this case and the method fails to provide good estimates of the flow at motion boundaries. This experiment shows that the LMedS and the Odobez and Bouthemy’s method give the best qualitative results for the computed optical flow field. However, as the image is quite small, it is visually not obvious from the flow field that block-effects and errors around the motion boundaries have occurred using Odobez and Bouthemy’s method. For Lucas and Kanade’s method, obvious errors have occurred at the motion boundaries of the spheres. Anandan’s method gives the worst qualitative result. Interestingly, Fleet and Jepson’s technique, which has been reported to give very good results (Barron et al., 1994) does not give as good results on this image sequence. This may be because the spatio-temporal frequencies of the textured patterns in the image does not match those in the pass-band to which the filters of Fleet and Jepson’s technique are tuned. Since the Small Spheres sequence has a relatively dense structure and is sufficiently textured, thresholding using the confidence measures, as proposed in Barron et al. (1994), for the other techniques does not have much effect on the resulting density of the optical flow field. A quantitative comparision of the various methods for computing the optical flow has also been made. Table 2 shows the angular and absolute errors for the various optical flow methods for the Small Spheres sequence. The results show that the LMedS method gives the lowest angular and absolute errors, followed by Odobez and Bouthemy’s, Lucas and Kanade’s, Horn and Schunck’s, Fleet and Jepson’s, Nagel’s and
Table 2.
65
Figure 4. (a) First frame of the Big Spheres sequence. (b) Needle diagram of the true optical flow field.
Anandan’s method. Note, however, that Fleet and Jepson’s method only gives very sparse optical flow estimates. 3.4.2. Performance under Varying Noise Levels. The Big Spheres sequence, the reference frame of which is shown in Fig. 4(a), was used along with additive Gaussian noise of difference standard deviations. Figure 4(b) shows the true optical flow field with respect to the reference for this image sequence. The left sphere is moving downwards at about 1 pixel/frame, while the right sphere is diverging outwards with a maximum speed of about 0.8 pixel/frame. The background is moving to the left at a speed of 1 pixel/frame. The angular error statistics for the various estimation techniques when applied on the Big Spheres image sequence with Gaussian noise of standard deviation 0, 4, 8, 12, and 16 added are summarised in Fig. 5. The error statistics for the various techniques show some interesting results. Firstly, Fleet and Jepson’s technique, which has been reported to give very good results (Barron et al., 1994) produces very low density and very high errors on this image sequence. This again
Error statistics for the Small Spheres image sequence.
Method
Density (%)
Angular average error
Angular standard deviation
Absolute average error
Absolute standard deviation
LMedS
97.13
0.93
2.41
0.018
0.046
Lucas and Kanade
100
1.65
5.37
0.03
0.11
Horn and Schunck
100
2.45
7.28
0.05
0.20
Nagel
100
3.92
8.24
0.08
0.18
Anandan
100
8.42
11.05
0.16
0.22
Fleet and Jepson
40.14
2.46
6.48
0.05
0.14
Odobez and Bouthemy
96.90
1.59
3.94
0.03
0.07
66
Ong and Spann
Figure 5.
Average angular error as a function of noise standard deviation for the Big Spheres sequence.
may be because the spatio-temporal frequencies of the textured patterns in the image do not match those in the pass-band to which the filters of Fleet and Jepson’s technique are tuned. In addition, Anandan’s technique also does not perform well in this case because it is a matching technique, especially when Gaussian noise has been added to the image sequence. (The results from Anandan’s method is excluded from Fig. 5 because the errors produced are so large.) It can be seen from Fig. 5 that the LMedS technique gives the best results regardless of the amount of noise present, although the error does rises quite steeply as the noise level increases.
3.5.
Results for Realistic Images
Realistic images are images with textures that resemble real objects but with simulated motion and thus the true optical flow is available. The first sequence considered is the Translating Tree sequence4 as used in Barron et al. (1994) and Fleet and Jepson (1992). This sequence consists of 40 frames and is captured by a camera that moves normal to its line of sight along its X -axis with respect to a textured planar surface (see Fig. 6(a)). Thus, the 2D image velocities are all parallel to the image x-axis, with speeds in the x-direction ranging from 1.728 to 2.250 pixels/frame. Figure 6(b) shows the true optical flow with respect to the reference frame (frame 20) of the image sequence.
Figure 6(c) shows the optical flow obtained using the LMedS motion estimator. To provide a qualitative comparison, Fig. 6(d) shows the optical flow estimated using Lucas and Kanade’s method without thresholding while Fig. 6(e) shows the optical flow estimated using Lucas and Kanade’s method with thresholding based on the smallest eigenvalue of the measurement matrix (threshold = 1.0). Figures 6(f)– (j) show the optical flow estimated using the Horn and Schunck’s method, Nagel’s method, Anandan’s method, Fleet and Jepson’s method and Obobez and Bouthemy’s method. Table 3 presents the angular error statistics for the different techniques. The results for Uras et al.’s, Singh’s, Waxman et al.’s, and Heeger’s techniques are quoted from Barron et al. (1994), those generated by Weber and Malik’s technique from Weber and Malik (1995), and those generated by Bober and Kittler’s technique from Bober and Kittler (1994). The quantitative results show that Fleet and Jepson’s technique gives the lowest errors at the expense of a much lower density of the optical flow field. Among those techniques that provide a high density (greater than 80%) of optical flow fields, the LMedS technique gives the best result, followed by Weber and Malik’s, Odobez and Bouthemy’s, and Bober and Kittler’s techniques. Uras et al.’s technique also performs very well on this sequence. Lucas and Kanade’s technique with thresholding also performs very well, although at the expense of a much lower density of the optical flow
Robust Optical Flow Computation
Figure 6.
67
(a)–( j) Reference frame and optical flow fields for the Translating Tree test sequence.
field (as in the case of Fleet and Jepson’s technique). These results are summarised in Fig. 7. The Diverging Tree sequence, as used by Barron et al. (1994) is also captured by a camera moving with
respect to a textured planar surface and consists of 40 frames (see Fig. 8(a)). In this case, the camera moves along its line of sight and the focus of expansion is in the centre of the image. The image velocities in both
68
Ong and Spann
Table 3.
Summary of the Translating Tree angular error statistics.
Technique
Average error
Standard deviation
Uras et al.
0.62
0.52
Uras et al. (threshold det(H ) ≥ 1.0)
0.46
0.35
Density (%) 100 41.8
Singh (Step 1, n = 2, w = 2)
1.64
2.44
Singh (Step 1, n = 2, w = 2, λ1 ≤ 5.0)
0.72
0.75
Singh (Step 2, n = 2, w = 2)
1.25
3.29
Singh (Step 2, n = 2, w = 2, λ1 ≤ 0.1)
1.11
0.89
99.6
Heeger (level 0)
8.10
12.30
77.9
Heeger (level 1)
4.53
2.41
57.8
Waxman et al. (σ f = 2.0)
6.66
10.72
Horn and Schunck
2.02
2.27
Horn and Schunck (threshold k∇ I k2 ≥ 5.0)
1.89
2.40
Lucas and Kanade
1.13
1.49
Lucas and Kanade (threshold λ2 ≥ 1.0)
0.66
0.67
39.8
Lucas and Kanade (threshold λ2 ≥ 5.0)
0.56
0.58
13.1
100 41.4 100
1.9 100 53.2 100
Nagel
2.44
3.06
Nagel (threshold k∇ I k2 ≥ 5.0)
2.24
3.31
Anandan
4.54
3.10
Fleet and Jepson (τ = 2.5)
0.32
0.38
74.5
Fleet and Jepson (τ = 1.25)
0.23
0.19
49.7
Fleet and Jepson (τ = 1.0)
0.25
0.21
Bober and Kittler
0.54
1.69
100 53.2 100
26.8 100
Weber and Malik
0.49
0.35
96.8
LMedS
0.39
0.29
95.2
Odobez and Bouthemy
0.48
0.43
98.9
Figure 7. Bar chart of average angular errors for motion estimation techniques that provide a dense optical flow (80% or more) when applied on the Translating Tree sequence.
Robust Optical Flow Computation
Figure 8.
69
(a)–( j) Reference frame and optical flow fields for the Diverging Tree test sequence.
the x- and y-directions vary from 1.29 pixels/frame on the left side to 1.86 pixels/frame on the right. Figure 8(b) shows the true optical flow with respect to the reference frame (frame 20) of this image sequence and Figs. (c)–(j) show flow fields for all of the other methods implemented.
Table 4 shows the angular error statistics for the different techniques when applied to the Diverging Tree sequence. As in the case of the Translating Tree sequence, the quantitative results for the Diverging Tree sequence show that Fleet and Jepson’s technique gives the lowest errors at the expense of a much lower density
70
Ong and Spann
Table 4.
Summary of the Diverging Tree angular error statistics.
Technique
Average error
Standard deviation
Uras et al.
4.64
3.48
Uras et al. (threshold det(H ) ≥ 1.0)
3.83
2.19
Singh (Step 1, n = 2, w = 2, N = 4)
17.66
14.25
Singh (Step 1, n = 2, w = 2, N = 4, λ1 ≤ 5.0)
7.09
6.59
Singh (Step 2, n = 2, w = 2, N = 4)
8.60
5.60
Density (%) 100 60.2 100 3.3 100
Singh (Step 1, n = 2, w = 2, N = 4, λ1 ≤ 0.1)
8.40
4.78
99.0
Heeger
4.49
3.10
74.2
11.23
8.42
Horn and Schunck
2.55
3.67
Horn and Schunck (threshold k∇ I k2 ≥ 5.0)
1.94
3.89
Lucas and Kanade
2.70
2.80
Waxman et al. (σ f = 2.0)
4.9 100 32.9 100
Lucas and Kanade (threshold λ2 ≥ 1.0)
1.94
2.06
48.2
Lucas and Kanade (threshold λ2 ≥ 5.0)
1.65
1.48
24.3
Nagel
2.94
3.23
Nagel (threshold k∇ I k ≥ 5.0)
3.21
3.43
Anandan (frames 19 and 21)
7.64
4.96
100
Anandan (frames 20 and 21)
16.94
12.39
100
Fleet and Jepson (τ = 2.5)
0.99
0.78
61.0
Fleet and Jepson (τ = 1.25)
0.80
0.73
46.5
Fleet and Jepson (τ = 1.0)
0.73
0.46
Bober and Kittler
3.69
4.39
Weber and Malik
3.18
2.50
88.6
LMedS
1.32
1.16
95.1
Odobez and Bouthemy
1.43
1.17
Least-squares (block-based) method
1.98
2.81
of the optical flow field. Among those techniques that provide a high density (greater than 80%) of optical flow fields, the LMedS technique gives the best result, followed by the Odobez and Bouthemy’s, Weber and Malik’s, and Bober and Kittler’s techniques. Lucas and Kanade’s technique with thresholding also performs very well, although at the expense of a much lower density of the optical flow field (as in the case of Fleet and Jepson’s technique). Also shown in Table 4 is the error obtained by applying a blockwise (non-overlapping blocks of the same size as used in the LMedS implementation) LS parameter estimator. As demonstrated for the experiment described in Section 3.4.1 using the synthetic Squares sequence, even when there is only one coherent motion, the sampling strategy coupled with the use of overlapping blocks leads to a more accurate flow field. Contrary to the case of the Translating Tree sequence, Uras et al.’s technique did not perform well on
100 53.5
28.2 100
97.4 100
this sequence. This is because the second-order constraints used are based on the conservation of the intensity gradient, and hence are, strickly speaking, invalid for rotation, dilation and shear. This is in contrast to the first-order differential techniques which are based on the first-order gradient constraint equation that is valid for smooth deformations of the input, including affine deformations. Figure 9 summarises the results. Figure 10(a) shows one frame of the Yosemite sequence showing a ‘fly-through’ a valley. This sequence was produced by photo-composition, rendering the valley by draping an ortho-rectified aerial image onto an elevation grid whilst the clouds and sky are taken from a real scene and artificially translated across the field of view. In this sequence, the clouds translate to the right with a velocity of about 2 pixels/frame. The motion at the upper right of the foreground is mainly divergent, while speeds at the lower left are about 4 to 5 pixels/frame. This sequence is interesting because of
Robust Optical Flow Computation
71
Figure 9. Bar chart of average angular errors for motion estimation techniques that provide a dense optical flow (80% or more) when applied on the Diverging Tree sequence.
the wide range of velocities and the presence of occluding edges between the mountains and at the horizon. There is also severe aliasing in the lower portion of the images. Thus, most methods for computing 2D motion are expected to provide poor velocity measurements on this sequence. Table 5 shows the angular error statistics for the different techniques on the Yosemite sequence. Among those techniques that provide a high density (greater than 80%) of optical flow fields, the LMedS technique gives the best results followed by Odobez and Bouthemy’s technique. The result on the Yosemite sequence for Bober and Kittler’s technique is not available. In contrast to the case of the Translating Tree and Diverging Tree sequences, Fleet and Jepson’s technique does not give a significantly better result compared to techniques such as Lucas and Kanade’s and Horn and Schunck’s method. For the Yosemite sequence, Lucas and Kanade’s technique with thresholding being applied (threshold λ ≥ 5.0) gives the best results but with a very low density (8.7%). With a thresholding of λ2 ≥ 1.0, Lucas and Kanade’s technique gives the third best results but with a density of 35.1%. Weber and Malik’s technique gives an optical flow field with a density of 64.1% and also give better results than
Fleet and Jepson’s technique. Note that thresholding can also be applied to the LMedS technique by using the confidence measure mentioned in the previous section, resulting in the second best result in Table 5 at the expense of an optical flow field with a lower density (58.4%). Figure 11 summarises the results for methods giving a flow field density of 80% or more where, again, the LMedS method gives the lowest error. 3.6.
Results for Natural Images
Several natural image sequences were used to demonstrate the effectiveness of the LMedS motion estimation technique. These include the SRI, Rubik Cube, and the Taxi sequences as used by Barron et al. (1994) and the Mobile sequence obtained from Rensselaer Polytechnic Institute (RPI) image database. The Taxi sequence, one frame of which is shown in Fig. 12(a), shows a street scene that contains four moving objects: (1) the taxi turning the corner, (2) a car in the lower left driving from left to right, (3) a van in the lower right driving from right to left, and (4) a pedestrian in the upper right. Image velocities of the four moving objects are approximately 1.0, 3.0, 3.0, and 0.3 pixels/frame respectively.
72
Ong and Spann
Figure 10.
(a)–(k) Reference frame and optical flow fields for the Yosemite test sequence.
The SRI sequence, one frame of which is shown in Fig. 13(a), is captured by a camera that translates parallel to the ground plane, perpendicular to its line of sight, in front of clusters of trees. This is a challenging sequence because of the relatively poor resolution, the amount of occlusion, and the low contrast. Image velocities are as large as 2 pixels/frame. The Rubik Cube sequence depicts a cube that is rotating counter-clockwise on a turntable. One frame of the
sequence is shown in Fig. 14(a). The motion field induced by the rotation of the cube includes velocities less than 2 pixels/frame (velocities on the turntable range from 1.2 to 1.4 pixels/frame and those on the cube are between 0.2 and 0.5 pixels/frame). The Mobile sequence depicts an indoor scene. Fig. 15(a) shows the reference frame of the Mobile sequence. The background consists of translating patterns moving from left to right while a toy train
Robust Optical Flow Computation
Table 5.
73
Summary of the Yosemite angular error statistics.
Technique
Average error
Standard deviation
Uras et al.
10.22
16.51
Uras et al. (threshold det(H ) ≥ 1.0)
7.55
19.64
Singh (Step 1, n = 2, w = 2)
15.28
19.61
Density (%) 100 14.7 100
Singh (Step 1, n = 2, w = 2, λ1 ≤ 6.5)
12.01
21.43
Singh (Step 2, n = 2, w = 2)
10.44
13.94
Singh (Step 1, n = 2, w = 2, λ1 ≤ 0.1)
10.03
13.13
97.7
11.3 100
Heeger (combined)
15.93
23.16
44.8
Heeger (level 0)
22.82
35.28
64.2
Heeger (level 1)
9.87
14.74
15.2
Heeger (level 2)
12.93
15.36
2.4
Waxman et al. (σ f = 2.0)
20.05
23.23
9.78
16.19
Horn and Schunck Horn and Schunck (threshold k∇ I k2 ≥ 5.0)
5.59
11.52
10.08
17.72
Lucas and Kanade (threshold λ2 ≥ 1.0)
4.28
11.41
Lucas and Kanade (threshold λ2 ≥ 5.0)
3.22
8.92
10.22
16.51
6.06
12.02
Lucas and Kanade
Nagel Nagel (threshold k∇ I k2 ≥ 5.0) Anandan
7.4 100 32.9 100 35.1 8.7 100 32.9
13.36
15.64
Fleet and Jepson (τ = 2.5)
4.63
13.42
100 34.1
Fleet and Jepson (τ = 1.25)
5.28
14.34
30.6
Weber and Malik
4.31
8.66
64.2
LMedS
5.79
12.55
89.9
LMedS (thresholded)
4.16
10.54
58.4
Odobez and Bouthemy
6.17
12.61
98.1
Figure 11. Bar chart of average angular errors for motion estimation techniques that provide a dense optical flow (80% or more) when applied on the Yosemite sequence.
74
Ong and Spann
Figure 12.
(a)–(i) Reference frame and optical flow fields for the Taxi test sequence.
traverses from right to left in the foreground and pushes a ball in front of it. The calendar and the textured region directly above the calendar all move with the same motion and are traversing diagonally upwards and to the right. Note that the ball is rotating in an anticlockwise direction as it is being pushed by the train Table 6.
DFD error statistics of Taxi sequence.
Table 7. DFD error statistics of SRI sequence.
Average DFD error
Standard deviation
Horn and Schunck
4.01
Lucas and Kanade
3.99
Lucas and Kanade (threshold λ2 ≥ 1.0)
6.49
5.10
Nagel
4.21
3.48
Anandan
4.35
3.60
Fleet and Jepson (τ = 1.25)
6.20
4.84
27.84
LMedS
3.84
3.08
Odobez and Bouthemy
3.93
3.17
Technique
and the track of the train is curved at the right-hand side of the image thus making the head of the train look slightly tilted upwards. Tables 6–9 present the DFD error statistics for all of the natural image sequences. One interesting observation from the results of the real image sequences
Density (%)
Technique
3.25
100
Horn and Schunck
7.50
5.46
100
3.23
100
Lucas and Kanade
7.26
5.19
100
Lucas and Kanade (threshold λ2 ≥ 1.0)
8.30
5.83
100
Nagel
8.42
6.22
100
100
Anandan
7.28
5.16
100
Fleet and Jepson (τ = 1.25)
6.51
4.44
72.21
95.46
LMedS
6.46
4.27
87.78
98.87
Odobez and Bouthemy
6.89
4.86
99.10
22.42
Average DFD error
Standard deviation
Density (%)
33.18
Robust Optical Flow Computation
Table 8. DFD error statistics of Rubik sequence.
Table 9. DFD error statistics of Mobile sequence.
Average DFD error
Standard deviation
Horn and Schunck
2.07
1.83
Lucas and Kanade
2.11
1.83
94.89
Lucas and Kanade (threshold λ2 ≥ 1.0)
3.87
2.83
20.76
Lucas and Kanade (threshold λ2 ≥ 1.0)
Nagel
3.28
3.04
100
Anandan
5.31
4.99
100
Fleet and Jepson (τ = 1.25)
4.42
3.08
LMedS
1.97
Odobez and Bouthemy
2.06
Technique
Density (%)
Technique
100
Horn and Schunck
8.62
6.63
100
Lucas and Kanade
8.54
6.57
100
11.00
8.18
Nagel
9.02
7.00
100
Anandan
9.67
7.66
100
12.63
Fleet and Jepson (τ = 1.25)
9.63
7.03
58.71
1.73
82.07
LMedS
8.08
6.14
90.30
1.77
89.26
Odobez and Bouthemy
8.44
6.47
98.96
is that when thresholding based on the confidence measures proposed in Barron et al. (1994) are applied on the techniques tested here, such as Lucas and Kanade’s and Nagel’s techniques, the DFD error increases. This is in contrast to the use of true optical flow for providing error estimates where the application of thresholding reduces the angular errors. In the case
Figure 13.
75
Average DFD error
Standard deviation
Density (%)
37.6
of Lucas and Kanade’s method, thresholding is based on the smallest eigenvalue of the measurement matrix and thus will threshold away pixels that have very little greylevel variation. However, pixels with very little greylevel variations will tend to give lower DFD errors even when the motion estimates are not very accurate and thus when these pixels have been removed from
(a)–(i) Reference frame and optical flow fields for the SRI test sequence.
76
Ong and Spann
Figure 14.
(a)–(i) Reference frame and optical flow fields for the Rubik test sequence.
consideration, the resultant mean absolute DFD error increases. In the case of Nagel’s technique, the same situation occurs since the thresholding is based on the L2 norm of the spatial gradients at each pixel. Thus, the use of the mean absolute DFD error is just a means for providing some quantitative comparison but should not be taken as an indication of how well the motion estimates correlate with the true projected 3D motion vectors. The experimental results show that the LMedS method gives the lowest DFD errors for most, if not all, of the real image sequences with a relatively high density of the optical flow field. The density of the optical flow field is not 100% due to the use of the LMedS estimator and hence is dependent on the nature of the image sequence. This method automatically adjusts the output density by filtering away pixels that give gross errors to give a relatively dense optical flow field which also results in a flow field that provides relatively low DFD errors. In contrast,
thresholding based on the confidence measures proposed in Barron et al. (1994), for the various techniques increase the DFD errors in most cases, if not all. In most cases, Odobez and Bouthemy’s method gave the second best results in terms of the DFD errors although the resultant density of the optical flow field is not 100%. Lucas and Kanade’s method and the Horn and Schunck’s method, both without thresholding, give quite close results in most cases, except the SRI sequence where Lucas and Kanade’s method performs significantly better than the Horn and Schunck method. Both methods gave reasonably lower DFD errors in most cases, if not all, compared to the techniques of Nagel, Anandan (except the SRI sequence where the Anandan method gave lower DFD error than the Horn and Schunck method), and Fleet and Jepson (except the SRI sequence where Fleet and Jepson’s method gave the second best results).
Robust Optical Flow Computation
Figure 15.
77
(a)–(i) Reference frame and optical flow fields for the Mobile test sequence.
Contrary to using the angular error measure when the true optical flow field is available, Fleet and Jepson’s method does not appear to perform well when DFD errors are used for quantitative comparisons. However, the only exception is its performance on the SRI image sequence where it provides the second lowest DFD error with a reasonably dense optical flow field (about 72%) trailing only the LMedS method. Fleet and Jepson’s method does not perform well when DFD errors are used for quantitative comparisons probably because it suffers from providing a low density of optical flow field and hence results in a higher DFD error just as when techniques such as Lucas and Kanade’s and Nagel’s have been thresholded using their respective confidence measures. Indeed, Fleet and Jepson
results presented here have been thresholded based on the thresholds recommended in Barron et al. (1994). It should finally be noted that, in order to increase the density of the flow field provided by Fleet and Jepson’s method, the threshold was varied on a trial and error basis but the attempts did not generate significantly denser flows.
3.7.
Multiresolution Results for Synthetic Image Sequence
Figure 16(a) shows one frame of the synthetic image sequence, the Spheres-Square sequence, that contains objects moving with large motions. This image
78
Ong and Spann
Figure 16.
(a)–(f) Reference frame and optical flow fields for the Spheres Square test sequence.
sequence only consists of 5 frames. Figure 16(b) shows the true optical flow field. The left sphere is moving to the left at about 2.5 pixels/frame, while the right sphere is diverging outwards with a maximum speed of about 3.4 pixels/frame. The square block is moving vertically downwards at 6 pixels/frame. Figure 16(c) shows the needle diagram of the optical flow (subsampled at an interval of 8 pixels) computed using the multiresolution LMedS method. Figures 16(d)–(f) shows needle diagrams of the optical flow computed using Lucas and Kanade’s, the multiresolution version of Anandan’s method using a 3 level Laplacian pyramid and the multiresolution version of Odobez and Bouthemy’s method using a 3 level Gaussian pyramid. Since this sequence only consists of 5 frames, Fleet and Jepson’s technique cannot be applied. In addition, temporal-smoothing cannot be applied in the case of Lucas and Kanade’s technique. Table 10.
It can be seen that the LMedS method gives the best qualitative results for the computed optical flow field. Block effects can be distinctly observed from the flow field produced by Odobez and Bouthemy’s method, and obvious errors have also occurred at the motion boundaries of the spheres for the flow field produced by Lucas and Kanade’s method. Anandan’s method results in no visibly large errors in the estimated flow but has difficulty reproducing the divergent motion in the righthand sphere. Table 10 shows the angular and absolute errors for the various optical flow methods for the SpheresSquare image sequence. The results show that the multiresolution LMedS method gives the lowest angular errors, followed by Odobez and Bouthemy’s method, Anandan’s method, and finally Lucas and Kanade’s method. The results show that Lucas and Kanade’s method that operates on a single resolution of the image
Error statistics for the Spheres-Square sequence.
Method LMedS (multiresolution) Anandan (multiresolution) Odobez and Bouthemy (multiresolution) Lucas and Kanade
Density (%) 98.31 100 96.83 100
Angular average error
Angular standard deviation
Absolute average error
Absolute standard deviation
1.20
5.78
0.07
0.29
11.32
16.62
0.39
0.77
5.13
13.16
0.17
0.42
10.42
22.98
0.72
2.25
Robust Optical Flow Computation
only works well for image sequences with smaller motions. The results for angular error indicates another property of the various methods. Specifically they show that the LMedS method has the lowest angular error, followed by the Odobez and Bouthemy’s method, Lucas and Kanade’s method, and finally Anandan’s method. Methods which give larger angular errors imply that more pixel estimates have errors, whereas methods which give larger absolute errors only give an indication of the overall errors in the computed flow field, and it is possible that a few very large errors may cause a method to give a very large absolute error although the estimated motion vectors at all other points may be very accurate. Therefore, a better optical flow method should be one which gives both lower absolute error and angular error. Based on this view, the quantitative results show that the LMedS method is the best among all the methods being compared. It should finally be noted that, for most if not all of the methods available in the single resolution version, a multiresolution version could be developed by warping the reference frame according to the flow projected from the previous resolution. However, this simple approach is not robust to the presence of motion discontinuities (even if a coarse level estimate could be robustly computed) because the warping process has no knowledge of occlusion or disocclusion and hence the flow field cannot be accurately refined near motion discontinuities. In contrast, the LMedS method robustly computes the refinements to the flow at each level of the pyramid. 3.8.
Multiresolution Results for Natural Image Sequence
Figure 17(a) shows the reference frame of the FlowerGarden sequence. This image sequence depicts a garden scene captured by a horizontally translating observer. The horizontal motion on the tree is approximately 6 pixels. Figures 17(b) and (c) show the optical flow estimated using the LMedS motion estimator for both the single resolution case and the case that utilises the multiresolution scheme. It can be seen that reasonably good results have been obtained using this technique and the results produced by the multiresolution scheme seems better than that produced by a single resolution due to the large foreground motion in this sequence. Figures 17(d) and (e) show the optical flow computed using Lucas and Kanade’s technique without and with
79
thresholding. For the case without thresholding, lots of errors occur on the tree, indicating that this technique cannot estimate optical flow reliably when the motion is large. When thresholding has been applied, most of the flows on the tree and the sky section have been filtered out. Figures 17(f)–(i) show the optical flow estimated using Horn and Schunck’s, Nagel’s, Anandan’s (again based on a 3 level Laplacian pyramid), and Fleet and Jepson’s methods respectively. It can be seen that Horn and Schunck’s and Nagel’s methods did not perform well. Anandan’s method seems to give reasonably good results. The optical flow estimated using Fleet and Jepson’s technique appears to be quite sparse and most of the flows on the tree and a major section of the flower bed in the foreground has been filtered out. Figures 17(j) and (k) shows the optical flow estimated using the Odobez and Bouthemy’s motion estimation technique for both the single resolution case and the case that utilises the multiresolution scheme. It can be seen that reasonably good results have also been obtained using this technique and the results produced by the multiresolution scheme also seems better than that produced by a single resolution. Table 11 shows the average DFD errors and the standard deviations for the various optical flow methods for the Flower Garden sequence. The results show that both the LMedS motion estimator and Odobez and Bouthemy’s method that utilises the multiresolution scheme produce lower errors than their single resolution counterparts. In addition, the LMedS motion estimator that utilises the multiresolution scheme produce
Table 11. DFD error statistics of Flower-Garden sequence. Technique
Average DFD error
Standard deviation
Density (%)
Horn and Schunck
16.09
13.64
100
Lucas and Kanade
15.75
13.30
100
Lucas and Kanade (threshold λ2 ≥ 1.0)
18.00
14.80
Nagel
17.11
14.55
100
Anandan (multiresolution)
12.25
10.03
100
9.73
7.73
36.13
Fleet and Jepson (τ = 1.25)
35.11
LMedS
11.56
9.62
88.48
Odobez and Bouthemy
13.21
11.12
97.77
7.26
5.72
89.03
10.19
8.65
93.41
LMedS (multiresolution) Odobez and Bouthemy (multiresolution)
80
Ong and Spann
Figure 17.
(a)–(k) Reference frame and optical flow fields for the Flower Garden test sequence.
the lowest error, followed by Fleet and Jepson’s method and Odobez and Bouthemy’s method that utilises the multiresolution scheme. However, Fleet and Jepson’s method produces a flow field with very low density, and as mentioned before, most of the flows on the tree and a major section of the flower bed in the foreground has been filtered out. As the flower bed section has the most rapid variations in greylevels and once this has been filtered away, this results in the apparently low DFD error given by Fleet and Jepson’s technique. In contrast, Lucas and Kanade’s method with thresholding filtered
away most of the flows on the tree and the sky section (which has slow variations in greylevels) instead of the flower bed section causing the resulting DFD error to increase tremendously although it also gives a flow field with a very low density. 4.
Conclusions
This paper has presented a technique to provide good estimates of the optical flow in regions of motion discontinuities and occlusions and also for large motions
Robust Optical Flow Computation
from image sequences. The proposed technique is based on the application of a LMedS robust estimator to compute the optical flow using the motion constraint equation and a 2D affine model of the flow. It has been shown that the LMedS method gives relatively better results for both large and small motions than most, if not all, of the techniques that have been studied here. A major shortcoming of the method is the high computational cost. The current implementation takes about an hour to process a 128 × 128 pixel frame pair on a Sparc-10 workstation whereas Lucas and Kanade’s method takes about 10 s, Anandan’s method takes about 5 min and Odobez and Bouthemy’s takes about 1 min. There are a number of ways of reducing the processing time. The first is to remove the block shifting. As can be seen from the experiment reported in Section 3.4.1, this still produces excellent results but the accuracy of motion boundaries indicated by the confidence image is degraded. In this case the speed-up would be block shift 2 ) . A second possiof the order of ( maximum block shift step bility is to reduce the number of samples taken to compute the LMedS estimate. Again this produces degraded results. Finally, as reported in (Rousseeuw and Leroy, 1987), a parallelized version of the estimator is possible whereby trial LMedS parameter estimates for each sample can be computed by independent processing elements. Such an implementation of the multi-variate LMedS estimator has actually been tested at the IBM research centre as reported in Rousseeuw, page 201. 5.
Existing Problems and Future Work
The proposed approach has been able to overcome one of the problems commonly faced in computing optical flow, i.e., to provide better estimates of the optical flow in the vicinity of motion discontinuities and occlusions. However, some of the other problems still remain unsolved. For example, the proposed approach mark regions of occlusions and motion discontinuities as outliers without further processing such ambiguous regions. However, it may be possible to compute the optical flow for the occlusion regions using information explicitly from three or more frames. In addition, it is well-known that optical flow estimation based on the intensity constancy assumption does not work well when the image is not sufficiently textured, such as for the case of man-made objects that have little textures on them. The integration of feature-based approaches of 2D motion estimation into the dense optical flow
81
approach may improve the estimation of optical flow in such insufficiently textured regions (provided distinct features can be identified in such regions). Another interesting possibility of improving the performance of the method is to use a block splitting strategy in order to adapt the blocksize to the local region structure as commonly used in block matching algorithms for video compression applications. This would undoubtedly improve the flow estimates in coherently moving regions which have an area less than one half the area of a processing block because in this case such a region could never be the majority inlying region in the block. These regions would currently be declared as outlying regions and no flow values computed for them. Another point worth noting is that if the boundaries of the independently moving objects can be extracted accurately from the image sequence, then a better estimate of the optical flow can be obtained since now each of the segmented regions contain datapoints that are consistent with each other and belong to the same motion model. However, as pointed out in (Horn, 1986), motion estimation and segmentation is a “chicken and egg” problem that cannot be easily resolved. One possibility is to use intensity information in the images as a guide to motion estimation by delineating regions that belong to the same brightness function model. However, such an approach will not work for images that are finely textured or even on simple random dot patterns. Thus, future work should look into fusing the intensity and texture information from the reference image, and also utilise the reliable but sparse features and the motion information that can be computed from more than two frames of the image sequence. Acknowledgment The authors would like to extend their thanks to the anonymous referees who, through their detailed suggestions and comments, have improved the quality and presentation of this paper. Notes 1. Parts of the work have been presented in Ong and Spann, 1995, 1996. 2. Lucas and Kanade, Horn and Schunck, Nagel, Anandan, and Fleet and Jepson’s techniques were implemented in Barron et al. (1994). The provision of the implementations by J.L. Barron and the test image sequences is gratefully acknowledged. 3. Obviously it is not appropriate to use a non-overlapping block strategy for LMedS parameter estimation as inlier/outlier
82
Ong and Spann
partitioning would result in many pixels not having flow vectors assigned. 4. This Translating Tree image sequence, together with the Diverging Tree, Yosemite, SRI, Rubik Cube, and the Taxi sequences were provided by J.L. Barron. The Translating and Diverging Tree sequence were created by D. Fleet, the Yosemite sequence by L. Quam, and the Rubik Cube sequence by R. Szeliski. The SRI sequence was originally provided by SRI International while the Taxi sequence was originally provided by the University of Hamburg.
References Anandan, P. 1989. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:282–310. Barnard, S.T. and Thompson, W.B. 1980. Disparity analysis of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2:333–340. Barron, J.L., Fleet, D.J., and Beauchemin, S.S. 1994. Performance of optical flow techniques. International Journal of Computer Vision, 1:43–77. Bigun, J., Granlund, G.H., and Wiklund, J. 1991. Multidimensional orientation estimation with applications in texture analysis and optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:775–790. Black, M.J. and Anandan, P. 1996. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding, 63:75–104. Black, M.J. and Rangarajan, A. 1996. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International Journal of Computer Vision, 19:57– 91. Bober, M. and Kittler, J. 1994. Robust motion analysis. In Proceedings of Computer Vision and Pattern Recognition, pp. 947–952. Campani, M. and Verri, A. 1992. Motion analysis from first-order properties of optical flow. CVGIP: Image Understanding, 56:90– 107. Darrell, T. and Pentland, A.P. 1995. Cooperative robust estimation using layers of support. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:474–487. Draper, N.R. and Smith, H. 1981. Applied Regression Analysis. 2nd edition, John Wiley & Sons: New York. Fennema, C.L. and Thompson, W.B. 1979. Velocity determination in scenes containing several moving objects. Computer Graphics and Image Processing, 9:301–315. Fleet, D.J. and Jepson, A.D. 1992. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5:77–104. Heitz, F. and Bouthemy, P. 1990. Multimodal motion estimation and segmentation using Markov random fields. In Proceedings of 10th International Conference on Computer Vision, pp. 378–382. Heitz, F. and Bouthemy, P. 1993. Multimodal estimation of discontinuous optical flow using Markov random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15:1217–1232.
Horn, B.K.P. 1986. Robot Vision. MIT Press: Cambridge. Horn, B.K.P. and Schunck, B.G. 1981. Determining optical flow. Artificial Intelligence, 17:185–203. Huang, Y., Palaniappan, K., Zhuang, X., and Cavanaugh, J.E. 1995. Optic flow field segmentation and motion estimation using a robust genetic partitioning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:1177–1190. Konrad, J. and Dubois, E. 1992. Bayesian estimation motion vector fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:910–927. Lin, T. and Barron, J.L. 1994. Image reconstruction error for optical flow. In Proc. Vision Interface, pp. 73–80. Lucas, B. and Kanade, T. 1981. An iterative image registration technique with an application to stereo vision. In Proceedings of 7th International Joint Conference on Artificial Intelligence, pp. 674– 679. Meyer, F.G. and Bouthemy, P. 1994. Region-based tracking using affine motion models in long range image sequences. CVGIP Image Understanding, 60:119–140. Murray, D.W. and Buxton, B.F. 1987. Scene segmentation from visual motion using global optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:220–228. Negahdaripour, S. and Lee, S. 1992. Motion recovery from image sequences using only first order optical flow information. International Journal of Computer Vision, 9:163–184. Odobez, J.M. and Bouthemy, P. 1994a. Robust multiresolution estimation of parametric motion models in complex image sequences. In Proceedings of 7th European Conference on Signal Processing, 1994. Odobez, J.M. and Bouthemy, P. 1994b. Robust multiresolution estimation of parametric motion models applied to complex scenes. IRISA, France, Technical Report No. 788. Ong, E.P. and Spann, M. 1995. Robust computation of optical flow. In Proc. 5th British Machine Vision Conference, Birmingham, UK, pp. 573–582. Ong, E.P. and Spann, M. 1996. Robust multiresolution computation of optical flow. In Proc. IEEE International Conference on Accoustic, Speech, and Signal Processing, Atlanta, USA, pp. 1939– 1942. Otte, M. and Nagel, H.H. 1994. Optical flow estimation: Advances and comparisons. In Proc. 3rd European Conference on Computer Vision, pp. 51–60. Rousseeuw, P.J. and Leroy, A.M. 1987. Robust Regression and Outlier Detection. John Wiley & Sons: New York. Schunck, B.G. 1989. Image flow segmentation and estimation by constraint line clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:1010–1027. Torr, P.H.S., Beardsley, P.A., and Murray, D.W. 1994. Robust vision. In Proc. British Machine Vision Conference, vol. 2, pp. 145– 154. Weber, J. and Malik, J. 1995. Robust computation of optical flow in a multiscale differential framework. International Journal of Computer Vision, 14:67–81. Zhang, J. and Hanauer, G.G. 1995. The application of mean field theory to image motion estimation. IEEE Transactions on Image Processing, 4:19–33.