E-mail:{varomero,j.nieto}@acfr.usyd.edu.au. AbstractâThis paper presents a motion detection approach based on a combination of dense optical flow and 3D ...
2013 IEEE Intelligent Vehicles Symposium (IV) June 23-26, 2013, Gold Coast, Australia
Stereo-based Motion Detection and Tracking from a Moving Platform Victor Romero-Cano and Juan I. Nieto Australian Centre for Field Robotics, University of Sydney E-mail:{varomero,j.nieto}@acfr.usyd.edu.au Abstract— This paper presents a motion detection approach based on a combination of dense optical flow and 3D stereo reconstruction. Our motion detection is not based on predefined templates, providing a generic framework suitable for a broad range of applications such as situation awareness. The approach estimates the likelihood of pixels motion from the fusion of dense optical flow and dense depth information estimated from a stereo camera. Temporal consistency is incorporated by tracking moving objects across consecutive images. The proposed algorithm is validated with publicly available datasets. The consistent results across different scenarios demonstrate the robustness of our framework, presenting an average detection rate of 92%.
INTRODUCTION Motion detection is of significant importance for advanced driving assistance systems (ADAS) and autonomous navigation. Being able to detect dynamic objects such as vehicles, cyclists, or pedestrians and to estimate their positions and dynamics allow either drivers or autonomous vehicles to increase their level of awareness. Traditionally, mobile robotics research in navigation rely on the assumption of static environments [1]. This assumption dramatically limits the number of applications and environments where mobile robots can operate [2]. Efficient dynamic scene management is also important for driver assistance systems. Moving objects provide information about the norms ruling the environment which can, for example, be used to learn expected behaviours and detect anomalies. Several vision-based motion detection approaches have been proposed over the last decade. The frameworks presented in [3], [4] detect objects of interest by using a classifier. These approaches showed a high detection rate. Nevertheless its application is limited to environments where the previously defined object templates are valid. Approaches that consider motion cues in the object detection task are more flexible and can be adopted in a wider diversity of applications [5], [6], [7]. Vision-based motion detection from a moving platform can be carried out by comparing the measured and predicted optical flow between two consecutive single frames. The former can be obtained using dense optical flow; whereas the latter can be determined using depth information from a stereo camera and the vehicle’s dynamics. In [5], a prediction of the optical flow between two consecutive frames is calculated as a function of the current scene depth and egomotion. From the difference between the predicted and measured flow fields, large non-zero blobs are flagged as potential moving objects. Although this motion detection scheme provides dense results, the approach does not consider the noise involved in the perception task. As 978-1-4673-2754-1/13/$31.00 ©2013 IEEE
Fig. 1: The motion detection framework pipeline. From top to bottom: disparity image, optical flow field, motion likelihood and detections.
a result, the system may be prone to produce a large number of false positives or miss detections. In contrast, the implementations proposed in [8], [6] take into account the uncertainties associated with the observations and their propagation through the system. Unlike our framework, these approaches are based on sparse features which cannot guarantee the detection of all the moving pixels as a dense approach could. This paper presents a probabilistic approach for motion detection that incorporates both, the uncertainties associated with the platform motion as well as the uncertainties integrated during the perception process. Fig. 1 shows a block diagram of the proposed approach. Initially, the predicted optical flow is determined based on a dense disparity map and the vehicle’s egomotion. The latter can be obtained, either from an on-board navigation system or from visual odometry [9]. The error associated with the observations is then propagated to both the measured and predicted optical flow. Subsequently, a chi-squared distribution is used to calculate the Mahalanobis distance between the predicted and measured flow. This process is conducted in the motion detection stage, which provides the likelihood of pixels motion. Finally, temporal consistency is imposed by tracking blobs of moving pixels along consecutive images. The paper is organised as follows. Section I briefly reviews the literature. Section II describes the adopted formulation for the predicted and measured optical flow. This is followed by a description of the uncertainty modelling and the proba499
bilistic motion detection approach. The experimental results, presented in Section III, are followed by the conclusions in Section IV.
A. Optical Flow: Prediction Given a pixel position x in image i, and given the rotation R and translation t between camera positions i and i + 1, the position of pixel x in image i + 1 can be calculated using Eq. 1 [18].
I. R ELATED W ORK
x ˆ0 = KRK0 x + Kt/Z,
Vision-based motion detection has been a relevant subject of research in areas like surveillance and robotics since the late seventies. The works presented in [10], [11] and [12] highlighted the importance of motion perception and provided the first techniques to calculate a measure of visual motion named optical flow. Since then, a large amount of research has been accomplished in order to provide efficient and accurate forms to calculate the motion of multiple moving objects from visual cues [13], [14], [15]. The problem of motion detection from video streams recorded from a static camera counts with very mature solutions [16]. On the other hand, solving the problem when the camera is moving presents extra challenging issues. Since the camera motion induces intensity changes in the entire visual field, even static objects appear as moving objects. Therefore, a static model of the background cannot be estimated. One way to segment independent moving objects from the background is to compensate for the platform movement. In order to do that, an estimate of the scene depth must be calculated. In [7] the Flow Vector Bound constraint along with the epipolar constraint are used as cues for dense motion detection. For this type of approach, the uncertainty in depth of a point spans through the entire epipolar line or at least a section of it. In this work, a stereo vision system is used to obtain a dense depth field from which a prediction of the optical flow is estimated. Although moving objects could be estimated from the difference between predicted and the measured optical flow [5], most of the state-of-the-art moving-object detection schemes rely on templates of the objects expected to be in the environment [3], [4]. As in [5], [6], [7] we consider pixel motion as the main cue for target detection. Hence, any sort of moving object can be detected. We model the uncertainty in the predicted flow by propagating the noise from the platform motion up to the optical flow prediction and use the spatial gradientbased covariance proposed by [17] to provide an uncertainty measure of optical flow. An optical flow based motion detection is also presented in [7]. The authors present high detection rates, however, the whole framework rely on a computationally expensive implementation of dense optical flow (6.5 minutes per frame). In contrast, although our matlab implementation is not real-time capable, the probabilistic modelling that we present is based on a simpler and faster dense optical flow technique (0.5 minute per frame).
(1)
where K represents the intrinsic camera calibration matrix and Z the pixel depth in the camera coordinates for image i. The motion between frames as well as the pixel depths posses uncertainties acquired during their estimations. These uncertainties come from sensor noise and simplifications done in the models. Ignoring these uncertainties provokes a larger number of false positives and heuristics are typically used to reduce those. In our framework, the uncertainty in the predicted optical flow is propagated from the sensors to the final estimation using a first order Gaussian approximation. As shown in Eq. 1 the predicted optical flow is a function of the camera motion, the current pixel location and its depth. A linear approximation for the optical flow covariance can be obtained as: Σp = Jp PaJT p,
(2)
where Jp represents the Jacobian matrix of the pixel position in the second image, with respect to point coordinates in the first image and camera motion, and Pa the covariance of the input variables. # " 0 0 0 0 0 Jp =
∂x ∂x ∂y 0 ∂x
∂x ∂y ∂y 0 ∂y
∂x ∂R ∂y 0 ∂R
∂x ∂T ∂y 0 ∂T
∂x ∂Z ∂y 0 ∂Z
,
2 2 Pa = diag(σx2 , σy2 , σR , σt2 , σZ ),
The variances σx and σy describe the quantisation error of the camera, σR and σt uncertainty in camera motion and 2 σZ the uncertainty in the estimated point depth. The last one in our case is estimated from a stereo camera and therefore can be obtained as: σZ =
Z2 σd , Bf
with σd representing the uncertainty in stereo disparity calculation, B the camera baseline and f the focal length. B. Optical Flow: Measurement Estimating the uncertainty in the optical flow field is in general a difficult problem [19]. In addition, different optical flow techniques will provide different performance. To make our system independent of the method selected, the uncertainty in the measured optical flow is obtained from a lower bound estimate. The basic starting point for any optical flow calculation is the brightness constancy assumption. By considering the
II. F RAMEWORK This section presents a description of the different parts that integrate our framework as shown in Fig. 1. 500
The covariance matrix C = Σp + Σv , and the threshold λ is obtained from the inverse χ2 cumulative distribution at a significance level α. In our implementations we use α = 0.1. Moving pixels are therefore detected as false alarms in our model, i.e. measurements where the validation test fails. After applying the validation gate to all the pixels in the image, we obtain blobs representing potential moving objects. Subsequently, these blobs are fed to a tracking module that is used to impose temporal consistency and to estimate the dynamics of the moving objects.
tracking of pixels with constant brightness, the image temporal derivative yields the gradient constraint equation (3) that allows to derive an estimator for the image velocity field. fs · v + ft = 0
(3)
Several sources of error can affect this model, the aperture problem and changes in illumination being the two most common ones. In order to explicitly represent these sources of uncertainty, [17] proposed to describe them with additive Gaussian terms to Eq. 3. We briefly explain next how to derive a conservative estimate for the optical flow uncertainty. Using the Bayes rule, the probability distribution of the image velocities is determined as P (ft |v, fs )P (v) . P (v|fs , ft ) = P (ft )
D. Imposing Temporal Consistency In order to prune the number of false positives, temporal consistency is incorporated by tracking detections across consecutive images. A track is initialised for each new detection that has not been associated to any existing track. In our implementation we need a minimum of three frames to post a detection; the first two frames are needed to detect moving pixels and the third is used to validate the detections. Using a larger number of frames would increase robustness, but it will render a system with slower reaction time. Tracking is performed using Kalman filters. A track is initialised for each detected blob and tracked using a constant velocity model. Detections are assigned to tracks by minimising a cost function that takes into account the Mahalanobis distance between the predicted and measured centroids. The Munkres’ version of the Hungarian algorithm is used to compute an assignment which minimises the total cost.
(4)
The selected measurement model provides a relation between the temporal and spatial image derivatives ft = −fs · v + fs · n1 + n2 ,
(5)
where n1 and n2 are noise terms with covariances R1 and σ2 respectively. They are independent additive Gaussian noise terms describing the error in the image velocity and in the gradient constraint. Assuming a prior model that favours slow local flows and assuming independence between n1 and n2 and between the components of the spatial gradient, the likelihood function and prior model are expressed as P (ft |v, fs ) = N (ft | − fs · v, fs T R1 fs + σ2 )
(6)
P (v) = N (0, Rp ).
(7)
E. Regions of interest generation In order to to do a better use of the information explained in previous sections, some geometric constrains where considered in order to define regions of interest (ROIs) where valid moving object could be found. Since moving objects are not higher than two meters, ROIs are obtained by detecting and removing tall structures from the work space. In order to do this, a least squares estimation of the ground plane is obtained and the orthogonal distance of the point cloud to that plane is calculated. Then, a polar grid is fitted to the point cloud and points falling in cells with a distance to the ground plane greater than 2.5m are removed as shown in Fig. 2.
Given the marginal distribution for the prior and the conditional for the likelihood in the form defined by Eq. 6 and Eq. 7, the covariance matrix that describes the posterior over the per-pixel image velocities is Σv = [Rp −1 + fs [fs T R1 fs + R2]−1 fs T ]−1 .
(8)
Rewriting Eq. 8 with M as the gradient matrix, σ1 and σ2 as the variance for the spatial and temporal derivatives respectively, and ||fs || as the pixel contrast, the covariance matrix for the measured optical flow can be estimated as # " M Σv = + R−1 . (9) p 2 (σ1 kfs k + σ2 )
III. E XPERIMENTAL VALIDATION To validate our approach we used publicly available datasets [20]. Two different scenarios, city and highway, were considered for testing our motion detection technique. The datasets used account for different conditions in terms of level of clutter and illumination of the scenes as well as velocity of the platform (an average of 25km/h). All the sequences were recorded at a rate of 10Hz. Sequences 1 and 2 have an image resolution of 1238x374 and 1242x375 pixels respectively. The baseline of both stereo rigs are approximately 0.54m with a focal length of 721.54 pixels.
C. Motion Likelihood Estimation Let us assume that the measured optical flow at each pixel location is distributed according to a Gaussian centered at the measurement prediction (predicted optical flow) x ˆ0 with covariance Σp . Then, static pixels will be validated by an ellipsoidal validation gate between the predicted and measured optical flow v(λ) = {x0 : (x0 − x ˆ0 )T C −1 (x0 − x ˆ0 ) ≤ λ}.
(10) 501
In places where the nav-solution fails, an stereo-based visual odometry algorithm is used as a backup [23]. B. Results Fig. 3 presents the detection results obtained by our algorithm. Figs. 3(a) and 3(c) show the Mahalanobis distance between predicted and measured optical flows for the two sequences. The brighter the pixels, the higher the motion likelihood is. Detections are obtained by thresholding the motion likelihood image with a χ2 value of 4.6 that corresponds to a p-value of 0.1. Figs. 3(b) and 3(d) show the detection of targets moving in several directions, e.g. towards the platform, away from the platform and across the platform.
(a) One frame of sequence 1 with the estimated ground plane
(b) Image coloured using the orthogonal distance to the ground plane
(a) Motion likelihood - sequence 1 (c) Image coloured according to each pixel’s angular location in a horizontal polar grid
(b) Detection output - sequence 1
(d) Image coloured according to each pixel’s radial location in a horizontal polar grid
(c) Motion likelihood - sequence 2
(e) Final ROIs
Fig. 2: Regions of interest generation. From top to bottom: Ground plane, Orthogonal distance to the ground plane, per-pixel angular location in the polar grid, per-pixel radial location in the polar grid and obtained ROIs.
(d) Detection output - sequence 2
Fig. 3: Detection results for sequence 1 and 2. Detected moving objects are marked by white bounding boxes. 3(a) and 3(c): Per-pixel motion likelihood. 3(b) and 3(d): Detection responses.
A. Data Processing The OpenCV implementation of Semi-Global BlockMatching [21] is used to calculate dense disparity fields. Subsequently, an implementation based on [22] is used to calculate a dense optical flow field for the current and previous left images. Our method is based on the comparison between the scene motion due to the vehicle displacement and the real scene optical flow that could include changes due to independent moving objects. Therefore, the vehicle’s localisation must be estimated in order to compensate scene changes in the image plane. In our experiments, egomotion is obtained from the GPS/IMU navigation solution provided.
Fig. 4 presents the tracking results. Each detection is associated with its more likely track. Note that the objects traverse through cluttered scenarios with changing lighting conditions and still, consistent detections are obtained. Furthermore, objects that remained in the camera’s field of view in the last three images (first, middle and last images of a set of 115 frames from sequence one) were correctly associated to the the track they belonged. The performance of our detection and tracking scheme is subsequently described. The raw detector generates sparse 502
TABLE I: Confusion matrix for the experimental results obtained. Values include sequences 1 and 2 Moving (a)
Moving0
61 True Positives
Static0
30 False Positives
Static
5 False Negatives
n/a
(b)
positives in sequence 1 for three choices of the threshold λ in the χ2 test (3.22, 4.6 and 5.99). Since we cannot normalise the number of false positives, Fig. 5 is not a ROC (Receiver Operating Characteristic) curve. Nevertheless, it shows the sensitivity of our algorithm for a dataset with a wide variety of environmental conditions1 .
(c)
(d)
Fig. 4: Images with detetection and tracking results. The last three images correspond to a sequence of 115Stereoscopic Scene Flow Computation for 3D Motion Understanding frames in sequence 1. They show how objects that remain in the scene are correctly associated to their respective tracks (tracks number 239, 243 and 259).
Fig. 5: Number of false positives and true negatives for three different χ2 values in sequence 1.
false positives due to reflections and illumination changes. Further tracking diminishes the number of false positives but it drops the recall due to the fact that a larger number of frames are required to initialize the tracker. Table I shows the number of correct and wrong detections of moving objects in both test sequences. We used two frames for detection and one frame for temporal consistency checking. True positives represent the number of moving objects that were detected at least once; false positives are static objects detected as moving and false negatives are moving objects that were not detected at any point during their appearance in scene. Since the algorithm carries out detection from motion, no direct assessment about static objects is presented. Therefore, the true-negative value does not apply to our performance analysis. From table I, precision and recall of the detection algorithm can be calculated. A precision of 64% along with a Recall of 92% were obtained. Finally, Fig. 5 presents the number of false and true
C. Discussion We found that most of the false positives are due to reflections on glasses in the environment, an open and challenging problem in the computer vision community. Moreover, false negatives are caused by objects that were not reconstructed by the stereo algorithm too. The latter happens usually when objects are at less than 5m from the camera and, thus, their disparity blobs are too fragmented to be accepted by the disparity algorithm. Objects that are moving with slow velocities and/or in a low-illumination section in the scene are unlikely to be detected as well. Fig. 6 presents one case in which a false positive is caused by a car reflected on the surface of a glass (see bounding box at the right hand side of Fig. 6), and a false negative corresponding to a person walking at a very low velocity along a low illumination section of the scene (see bounding circle in Fig. 6). Furthermore, Fig. 7 shows a case in which 1A
503
movie with the experimental results can be watched at [24]
a moving object was not completely reconstructed by the stereo algorithm, and therefore it was not detected.
[2] B. Y. H. Durrant-whyte, D. Pagac, B. E. N. Rogers, M. Stevens, and G. Nelmes, “An Autonomous Straddle Carrier for Movement of Shipping Containers,” IEEE Robotics & Automation Magazine, vol. 14, no. September, pp. 14–23, 2007. [3] D. M. Gavrila and S. Munder, “Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle,” International Journal of Computer Vision, vol. 73, no. 1, pp. 41–59, July 2006. [4] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, “Moving Obstacle Detection in Highly Dynamic Scenes,” in IEEE International Conference on Robotics and Automation, 2009, pp. 56–63. [5] A. Talukder and L. Matthies, “Real-time Detection of Moving Objects from Moving Vehicles using Dense Stereo and Optical Flow,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, Japan, 2004, pp. 3718–3725. [6] H. Badino and T. Kanade, “A Head-Wearable Short-Baseline Stereo System for the Simultaneous Estimation of Structure and Motion,” in IAPR Conference on Machine Vision Aplications (MVA), 2011. [7] R. K. Namdev, A. Kundu, K. M. Krishna, and C. V. Jawahar, “Motion Segmentation of Multiple Objects from a Freely Moving Monocular Camera,” in IEEE International Conference on Robotics and Automation, 2012. [8] R. Katz, O. Frank, J. Nieto, and E. Nebot, “Dynamic Obstacle Detection based on Probabilistic Moving Feature Recognition,” in Field and Service Robotics - Springer Tracts in Advanced Robotics, C. Laugier and R. Siegwart, Eds. Springer Berlin / Heidelberg, 2008, pp. 83–91. [9] V. Guizilini and F. Ramos, “Semi-parametric models for visual odometry,” IEEE International Conference on Robotics and Automation, pp. 3482–3489, May 2012. [10] C. L. Fennema and W. B. Thompson, “Velocity determination in scenes containing several moving objects,” Computer Graphics and Image Processing, vol. 9, no. 4, pp. 301–315, Apr. 1979. [11] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, no. 1-3, pp. 185–203, Aug. 1981. [12] B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” in Proceedings of Imaging Understanding Workshop, vol. 130, 1981, pp. 121–130. [13] S. Y. Elhabian, K. M. El-sayed, and S. H. Ahmed, “Moving Object Detection in Spatial Domain using Background Removal Techniques - State-of-Art,” Computer, no. 2, pp. 32–54, 2008. [14] S. Gauglitz, T. H¨ollerer, and M. Turk, “Evaluation of Interest Point Detectors and Feature Descriptors for Visual Tracking,” International Journal of Computer Vision, vol. 94, no. 3, pp. 335–360, Mar. 2011. [15] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A Database and Evaluation Methodology for Optical Flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, Nov. 2010. [16] N. Buch, S. a. Velastin, and J. Orwell, “A Review of Computer Vision Techniques for the Analysis of Urban Traffic,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 3, pp. 920–939, Sept. 2011. [17] E. P. Simoncelli, “Bayesian Multi-Scale Differential Optical Flow,” in Handbook of Computer Vision and Applications, 1998, pp. 397–422. [18] R. Hartley and A. Zisserman, Multiple View Geometry in computer vision, 2000. [19] V. Willert, J. Eggert, J. Adamy, and E. K¨orner, “Non-Gaussian velocity distributions integrated over space, time, and scales.” IEEE transactions on systems, man, and cybernetics, vol. 36, no. 3, pp. 482–93, July 2006. [20] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in International Conference on Vision and Pattern Recognition, Providence, USA, 2012. [21] H. Hirschm¨uller, “Stereo processing by semiglobal matching and mutual information.” IEEE transactions on pattern analysis and machine intelligence, vol. 30, no. 2, pp. 328–41, Feb. 2008. [22] J. Weickert, A. Es, and C. S. Orr, “Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods,” International Journal of Computer Vision, vol. 61, no. 3, pp. 211–231, 2005. [23] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan : Dense 3d Reconstruction in Real-time,” in Intelligent Vehicles Symposium, 2011. [24] “Video with results,” 2012. [Online]. Available: http://www.youtube.com/watch?v=BzJzg0Q3Kvg
Fig. 6: False positive (white bounding box) and false negative caused by a reflection and low illumination (white bounding circle), respectively (right hand side of the image).
Fig. 7: A false negative caused by an object that was not reconstructed by the stereo algorithm is marked by a red bounding box.
IV. C ONCLUSIONS A framework for the dense detection of moving objects from stereo vision has been presented. The method combines dense optical flow and stereo cues with the camera motion. Furthermore, it integrates models for explicit representation of the uncertainty in the predicted and measured optical flow. As a result, our framework provides a high detection rate (92%) without the need of object classifiers. This makes the method context-free and suitable to detect arbitrary moving objects. V. ACKNOWLEDGMENTS This work was supported by the Rio Tinto Centre for Mine Automation and the Australian Centre for Field Robotics. R EFERENCES [1] T. Bailey and H. Durrant-whyte, “Simultaneous Localisation and Mapping ( SLAM ): Part II State of the Art,” Robotics and Automation Magazine, pp. 1–10, 2006.
504