Segmentation of scenes of mobile objects and demonstrable ...

1 downloads 0 Views 4MB Size Report
a pixel is modeled using local binary pattern histograms. Figures 6, 7 ... the blob moving in the last few frames), then pixel p at time t − A was with high probability ...
QUT Digital Repository: http://eprints.qut.edu.au/

This is the accepted version of the following conference paper:  Maire, Frederic, Morris, Timothy, & Rakotonirainy, Andry (2011)  Segmentation of scenes of mobile objects and demonstrable  backgrounds. In: 6th International Symposium on Autonomous  Minirobots for Research and Edutainment (AMiRE 2011), 23‐25 May  2011, Bielefeld. 

© Copyright 2011 Please consult the authors. 

Segmentation of Scenes of Mobile Objects and Demonstrable Backgrounds Frederic Maire, Timothy Morris and Andry Rakotonirainy

Abstract In this paper we present a real-time foreground–background segmentation algorithm that exploits the following observation (very often satisfied by a static camera positioned high in its environment). If a blob moves on a pixel p that had not changed its colour significantly for a few frames, then p was probably part of the background when its colour was static. With this information we are able to update differentially pixels believed to be background. This work is relevant to autonomous minirobots, as they often navigate in buildings where smart surveillance cameras could communicate wirelessly with them. A by-product of the proposed system is a mask of the image regions which are demonstrably background. Statistically significant tests show that the proposed method has a better precision and recall rates than the state of the art foreground/background segmentation algorithm of the OpenCV computer vision library.

1 Introduction The segmentation of moving objects in a fixed camera scene is still a developing area of research because of the many conflicting goals of background model maintenance [11]. Distinguishing background and foreground is a fundamental task of many computer vision applications such as the analysis of video streams of road traffic [1, 6], the distributed control of mobile robots in an environment equipped with static cameras [9], or the monitoring of people in public places [3]. The challenges for building an accurate background model include dealing with dynamic Frederic Maire FaST QUT and NICTA QRL, Brisbane QLD 4001, e-mail: [email protected] Timothy Morris FaST QUT, Brisbane QLD 4001 e-mail: [email protected] Andry Rakotonirainy CARRS-Q QUT, Brisbane QLD 4001 e-mail: [email protected]

1

2

Frederic Maire, Timothy Morris and Andry Rakotonirainy

lighting and background motion [11] such as swaying trees or water waves. Segmentation in videos is predominately achieved by comparing the current frame to some learned background model [4]. If pixels do not fit well the statistical model, they are classified as foreground. The accuracy of the background model directly affects any subsequent image processing step. The most popular foreground-background segmentation algorithms rely on the adaptation of a statistical model. The statistical model is usually a mixture of Gaussians or a collection of bins. The statistical model must try to meet two conflicting objectives; on one hand it must adapt rapidly to react to sudden changes in lighting conditions (sun hiding behind a cloud for example), on the other hand it must have enough inertia in order not to forget how the background looks like behind a slow moving foreground object. The time scale of the adaptation of the statistical model critically depends on its learning rate. The experiments presented in Section 4 demonstrate that the state of the art foreground–background segmentation algorithm of the OpenCV computer vision library struggles to simultaneously segment slow moving and fast moving objects. The method that we introduce in Section 3 aims at addressing this problem.

2 Previous Work Simple background segmentation methods like [12] use a single Gaussian whose parameters are updated recursively. More sophisticated methods accommodate backgrounds exhibiting multi-modal characteristics with a mixture of Gaussians [10]. These algorithms are capable of learning a statistical model for dynamic backgrounds like waves on water or swaying tree branches [1]. A number of variations of this method have been proposed. For example, in [13] the number of Gaussians per pixel can adaptively change, and in [7] a learning procedure that improves the segmentation accuracy and model convergence rate is proposed. Other improvements include the modeling of each background pixel with a set of code words [5], and the utilization of a histogram of features per pixel [8]. In [2] the neighbourhood of a pixel is modeled using local binary pattern histograms. Figures 6, 7 and 8 were obtained with the OpenCV library implementation of the state-of-the-art background segmentation algorithm introduced in [8]. The top-left images of these figures show how moving objects smear the image associated to the statistical background model. Figure 8 shows how slow moving or stopping vehicles get integrated into the background model. In particular, slow moving elongated homogeneous blobs like the bus do no always get their interior properly segmented because the statistical model gets habituated too quickly to the interior colour of the blob. The method that we introduce in Section 3 addresses these problems.

Segmentation of Scenes of Mobile Objects and Demonstrable Backgrounds

3

3 Proposed Method Our approach to create a more robust model exploits the following property of the bird eye view of most environments; if a pixel p of the image has not changed significantly from time t − ∆ up to time t −1, and if at time t the pixel p is covered by a blob that we can trace in the short-term video memory (that is, we have observed the blob moving in the last few frames), then pixel p at time t − ∆ was with high probability a background pixel. Indeed, the most likely explanation of the evolution of the colour of pixel p is that at time t − ∆ pixel p corresponded to a patch of the ground and that a mobile object ran over it at time t. Knowing that a pixel is likely to be background allows us to differentiate the way the pixel model is updated. The more confident we are that a pixel is background, the larger its learning rate should be. The pseudo-code below outlines our algorithm. Our statistical model consists of a matrix of individual normalized colour histograms for each pixel. Steps 3 and 4 can be replaced by the computation of the likelihood of the observed pixel values, and thresholding these probabilities. However, we found that the computationally simpler method of retrieving the most likely image from the statistical model works suficiently well. It is indeed easy to keep track of which bin of a histogram is the most populated during Steps 15 and 16. Algorithm 1 Proposed method for foreground-background segmentation 1: while a new frame is available do 2: grab next frame Ft time-stamped t 3: retrieve most likely image IM from the statistical scene model 4: threshold the difference kFt − IM k into a binary image IB 5: apply a morphological close operator to IB 6: replace the blobs of IB by their convex hulls 7: attempt to merge neighboring convexified blobs 8: for all blob B of Ft do 9: create a set S of feature points belonging to the blob 10: track the feature points of S in Ft−1 11: if any blob B0 from Ft−1 contains a feature point for S then 12: Let B inherit some properties of B0 {like the age of the blob} 13: end if 14: end for 15: update statistical scene model 16: update the provable background model 17: end while

A short-term video memory in the form of a circular buffer of frames enables a straightforward blob tracking. For each frame a number of attributes are computed and some recorded; the age of each blob (that is, the number of frames since its first detection), a binary mask of where motion was detected, a list of feature points (corner like points) detected in the moving blobs, and the contour of each blob.

4

Frederic Maire, Timothy Morris and Andry Rakotonirainy

Fig. 1 The short-term video memory collects information about moving blobs. When a foreground blob is detected in frame Ft , features inside this blob are tracked backward temporally to determine whether the blob corresponds to a moving object. If the pixels in the corresponding footprint of the blob in frame Ft−∆ did not change significantly from time t − ∆ to time t, then these pixels are assumed to be background. The traces of the moving blobs are accumulated into a driveable mask.

When a new frame is grabbed, a number of processing steps are performed. The first step is the computation of a mask of foreground pixels IB . This segmentation is achieved with a simple thresholded frame difference kFt − IM k where Ft denotes the current frame, and IM the most likely image according to the adaptive statistical model of the scene. A morphological close operation on IB helps remove image noise. Next, in lines 6 and 7, we approximate the blobs with their convex hulls. The convex hull of a blob is a better approximation of the blob than its best fitting rotated rectangle or its best fitting ellipse. But the convex hull is nevertheless simpler than the original contour of the blob in terms of the number of points of the contour. We try to merge neighbouring convex blobs Bi and B j the following way; we consider the contour CU of the convex union U of Bi and B j , then for each pair of diametrically opposite vertices on CU , we scan the line segment joining these two vertices. If all the scanned segments are likely foreground, then Bi and B j are replaced by the convex blob U. In order to track the blobs, we perform a search of a set of good features to track in the region of the new frame Ft restricted to IB , and try to match these feature points in previous frames by calculating the optical flow for this sparse feature set using the iterative Lucas-Kanade method with pyramids (implemented in the OpenCV library). Good features are located by examining the minimum eigenvalue of each 2 by 2 gradient matrix, and features are tracked using a Newton-Raphson method of minimizing the difference between the two windows. Multi-resolution tracking allows for relatively large displacements between images. Each blob of the current frame either inherits the attributes (internal identification number and drawing colour) from the matched blob (if any) in the previous frame, or is classified as a new blob. To finish the main loop, we update the statistical model of the scene. The update also refines the driveable region model as illustrated in Figure 1.

Segmentation of Scenes of Mobile Objects and Demonstrable Backgrounds

5

4 Experimental Results We have implemented using the OpenCV computer vision library the method described in the previous section. The OpenCV library provides optimized functions for many of the image processing tasks our method needs to complete (like the detection of feature points, the computation of the convex hulls and the extraction of the contour of a blob). In particular, finding distinctive points that can be tracked by the Lucas Kanade method does not require much programming. Lucas Kanade method is a widely used differential method for optical flow estimation that runs in real time. We have tested our system on two videos. The system runs in real time (more than 25 frames per second on a laptop). Videos comparing our proposed method to the state-of-the-art background segmentation method available to anyone through the OpenCV library CvFGDStatModel. The results of these tests can be viewed online. Per image, our method was about 5 ms slower than CvFGDStatModel. Video name Video URL Intersection Highway Intersection Highway

Method used

Proposed method Proposed method http://www.youtube.com/watch?v=UeBr_7Kn2hU CvFGDStatModel http://www.youtube.com/watch?v=8X893aZGFy4 CvFGDStatModel http://www.youtube.com/watch?v=c5b52L00xUE http://www.youtube.com/watch?v=jxyY2Rs11FQ

Precision and recall are two widely used metrics for evaluating the correctness of pattern recognition algorithms. In the context of our application, a true positive is a detected blob that corresponds to a foreground object, a false positive is a detected blob which does not correspond to a foreground object. A false negative is a foreground object which has not been detected. The precision is the number of true positives divided by the sum of the number of true positives and the number of false positives. The recall is the number of true positives divided by the sum of the number of true positives and the number of false negatives. The performance of the two tested algorithms are summarized in the table below. For our experiments, we labeled by hand the moving objects in the two videos. The blobs corresponding to captions in the videos were ignored as they could be discarded easily with a mask in a preprocessing step. Video name Method used Precision Recall Intersection Proposed method 1.00 0.83 Highway Proposed method 1.00 0.97 Intersection CvFGDStatModel 0.80 0.68 Highway CvFGDStatModel 0.97 0.71 Running statistical one-tailed t-test for paired samples from each video with respect to the two methods shows that the proposed method performs better with a statistical significance of a t-value level less than 0.01 for all cases.

6

Frederic Maire, Timothy Morris and Andry Rakotonirainy

Figures 2, 3, 4, 6, 7 and 8 are all divided in four subfigures. The top left subfigure shows the original frame, the top right subfigure is the most likely image according to the statistical model of the scene. The bottom left binary subfigure shows the intermediate segmentation (after pixel classification with the statiscal background model and morphological close). The bottom right subfigure shows the final segmentation.

Fig. 2 Frame 68 processed by the proposed method. The cyclist was just detected (bottom left subfigure), but this blob is not old enough to appear in the segmented image. The most likely image (top right subfigure) is clean.

5 Conclusion In this paper we have presented a new foreground-background segmentation method that exploits blob motion to learn a more robust statistical model of the environment. We have designed and implemented in C++ (using OpenCV) a prototype for a scene analysis system. The approach introduced in this paper is applicable to any environment that is intrinsically two dimensional. Fixed networked smart cameras which look down on the ground could assist autonomous mobile robots in their navigation task. A robust background segmentation algorithm like the one proposed here is highly desirable for these environments.

Segmentation of Scenes of Mobile Objects and Demonstrable Backgrounds

7

Fig. 3 Frame 269 processed by the proposed method. The blob corresponding to the cyclist is old enough to appear in the segmented image.

Acknowledgements The first author would like to thank Ms Irina Gordienko for her invaluable help. The algorithms for the experiments were implemented and evaluated using open-source software, including Linux-based operating systems. The authors would like to extend their thanks to all open-source developers for their efforts.

References 1. Sen-Ching S. Cheung and Chandrika Kamath. Robust techniques for background subtraction in urban traffic video. In S. Panchanathan & B. Vasudev, editor, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5308 of Presented at the Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, pages 881–892, jan 2004. 2. M. Heikkil¨a and M. Pietikainen. A texture-based method for modeling the background and detecting moving objects. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(4):657–662, apr. 2006. 3. S. Huwer and H. Niemann. Adaptive change detection for real-time surveillance applications. In Visual Surveillance, 2000. Proceedings. Third IEEE International Workshop on, pages 37 –46, 2000. 4. Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Tomoji Toriyama, and Kiyoshi Kogure. Robust foreground extraction technique using background subtraction with multiple thresholds. In Optical Engineering, volume Vol. 46. SPIE, 9 2007. 5. Kyungnam Kim, Thanarat H. Chalidabhongse, David Harwood, and Larry Davis. Real-time foreground-background segmentation using codebook model. Real-Time Imaging, 11(3):167 – 256, 2005. Special Issue on Video Object Processing.

8

Frederic Maire, Timothy Morris and Andry Rakotonirainy

Fig. 4 Frame 204 processed by the proposed method. One of the cars stopped at the traffic light has not been tagged in the segmented image. The colour of the car is too similar to the colour of the road. Although a blob corresponding to the front of this car can be seen on the bottom left image, it is too recent as no blob was detected for this car on the previous frame.

6. Anh-Nga Lai, Hyosun Yoon, and Gueesang Lee. Robust background extraction scheme using histogram-wise for real-time tracking in urban traffic video. In Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on, pages 845–850, jul. 2008. 7. Dar-Shyang Lee. Effective gaussian mixture learning for video background subtraction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(5):827–832, may. 2005. 8. Liyuan Li, Weimin Huang, Irene Y. H. Gu, and Qi Tian. Foreground object detection from videos containing complex background. In In MULTIMEDIA ’03: Proceedings of the eleventh ACM international conference on Multimedia, pages 2–10. ACM Press, 2003. 9. C. Losada, M. Mazo, S. Palazuelos, and F. Redondo. Adaptive threshold for robust segmentation of mobile robots from visual information of their own movement. In Intelligent Signal Processing, 2009. WISP 2009. IEEE International Symposium on, pages 293 –298, aug. 2009. 10. C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition (CVPR), volume 2, pages 246–252, 1999. 11. K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: principles and practice of background maintenance. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 1, pages 255–261 vol.1, 1999. 12. C.R. Wren, A. Azarbayejani, T. Darrell, and A.P. Pentland. Pfinder: real-time tracking of the human body. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(7):780– 785, jul. 1997. 13. Z. Zivkovic. Improved adaptive gaussian mixture model for background subtraction. volume 2, pages 28–31 Vol.2, aug. 2004.

Segmentation of Scenes of Mobile Objects and Demonstrable Backgrounds

9

Fig. 5 Provable background image quantized with 64 gray level values.

Fig. 6 Frame 68 processed by the OpenCV library method. The cyclist is missing from the binary image. Slow moving objects smear the model image.

10

Frederic Maire, Timothy Morris and Andry Rakotonirainy

Fig. 7 Frame 269 processed by the OpenCV library method. The cyclist is still missing from the binary image. Moreover some cars are completely missed.

Fig. 8 Frame 204 processed by the OpenCV library method. When cars slow down at the traffic light, they become part of the background.

Suggest Documents