2014_Evaluation of object segmentation to improve moving vehicle

0 downloads 0 Views 2MB Size Report
Moving Vehicle Detection in Aerial Videos ... the camera motion is tracking of local image features such ... moving objects, we evaluate various object segmentation approaches based on contour extraction, blob extraction, or machine learning to handle such effects. .... icantly different relative velocities compared to the static.
Evaluation of Object Segmentation to Improve Moving Vehicle Detection in Aerial Videos Michael Teutsch, Wolfgang Kr¨uger, and J¨urgen Beyerer Fraunhofer IOSB, Karlsruhe, Germany {michael.teutsch, wolfgang.krueger, juergen.beyerer}@iosb.fraunhofer.de

Abstract Moving objects play a key role for gaining scene understanding in aerial surveillance tasks. The detection of moving vehicles can be challenging due to high object distance, simultaneous object and camera motion, shadows, or weak contrast. In scenarios where vehicles are driving on busy urban streets, this is even more challenging due to possible merged detections. In this paper, a video processing chain is proposed for moving vehicle detection and segmentation. The fundament for detecting motion which is independent of the camera motion is tracking of local image features such as Harris corners. Independently moving features are clustered. Since motion clusters are prone to merge similarly moving objects, we evaluate various object segmentation approaches based on contour extraction, blob extraction, or machine learning to handle such effects. We propose to use a local sliding window approach with Integral Channel Features (ICF) and AdaBoost classifier.

1. Introduction Mobile platforms such as Unmanned Aerial Vehicles (UAVs) equipped with video cameras are a flexible and efficient support to ensure both civil and military security. In order to gain scene understanding, moving objects play a key role and have to be detected and tracked correctly. This can be a challenging task due to high object distance, simultaneous object and camera motion, shadows, or weak contrast. In scenarios where vehicles are driving on busy urban streets, this is even more challenging due to possible merged detections. After image registration and alignment to compensate for camera motion, difference images are usually used to detect motion areas [17, 18, 19, 27]. This method works well for wide area surveillance with high UAV altitude where the same area is surveilled for at least several seconds and object motion is fast enough to produce prominent motion blobs in the difference image.

Figure 1. Undesired over/under-segmentation for motion clustering (moving features in yellow, motion clusters in cyan).

For lower altitude UAVs, this is more difficult since each local area is surveilled for only a few seconds or less and slow object motion might not be prominent enough to be detected. Furthermore, due to parallax effects and misalignment during image registration, distinguishing between motion and noise in the difference image becomes even more difficult [22]. Just like few authors [3, 14, 22], we propose to use clustering of moving local features such as KLT features or Harris corners to determine motion areas. However, undersegmentation for vehicles driving one behind the other and oversegmentation for weakly textured vehicles as both seen in Fig. 1 are likely to occur. Siam et al. [22] apply object tracking using Kalman filter to handle such effects, but we aim to improve moving vehicle detection before tracking. We have two main contributions: (1) To handle the mentioned segmentation problems, we implement and evaluate various state-of-the-art object segmentation methods based on contour and blob extraction. (2) Machine learning is introduced by using a local sliding window with Integral Channel Features (ICF) and AdaBoost classifier to search for vehicles inside the motion clusters. Motion cluster properties help us to reduce the sliding window search space and to reject false positive detections at stationary vehicles. The remainder of the paper is organized as follows: literature is reviewed in Section 2. The evaluated algorithms are described in Section 3. Experimental results are given in Section 4. We conclude in Section 5.

local feature detection and tracking

image registration

independent motion detection

local feature clustering

object segmentation

outlier and duplicate removal

Figure 2. The concept for object detection and segmentation.

2. Related Work

3. Object Detection and Segmentation

We skip literature about difference images and focus on related work that is relevant for object segmentation and applicable to our motion areas. This is mainly the detection of vehicles in single images or stationary vehicles. Tanaka and Saji [23] propose parallelogram detection with Hough transform assuming that vehicles have a rectangular appearance in top view UAV images. Li et al. [12] detect object pixels based on color value deviation from the background pixel values. Zheng et al. [28] propose to use the Black and White Tophat transform in road areas and Otsu thresholding to detect objects. Teutsch and Krueger [24] calculate gradient magnitudes with different approaches and fuse them in a common accumulator image. Objects are detected by thresholding and connected-component labeling. A very common combination of detection and machine learning is the sliding window approach, which has been successfully applied in face detection [26] and human detection [7]. A search window of certain size is shifted in different scales across the whole image. In each window, features are calculated and the classifier decision is evaluated for object or non-object. Gaszczak et al. [9] detect vehicles using sliding window with Haar features und cascaded AdaBoost. Since vehicle orientation is variable, four discretized orientations are specified and one classifier is trained for each orientation. Nguyen et al. [16] use sliding windows with Haar features, orientation histograms, and Local Binary Patterns (LBP) as vehicle decriptors and Discrete AdaBoost for classification. Cao et al. [4] propose a boosting light and pyramid sampling histogram of oriented gradients (bLPS-HOG) feature extraction method together with a linear Support Vector Machine (SVM). The sliding window approach is very time-consuming as the whole image has to be scanned at different scales. Initial detections can shorten the processing time significantly. After the detection of motion areas, Lin et al. [13] use scale normalization, Haar features, and cascaded AdaBoost to classify vehicles inside these areas. Cheng et al. [5] detect and cluster Harris corners and Canny edges in areas of foreground color followed by an SVM to distinguish between vehicle and nonvehicle colors and a Dynamic Bayesian Network (DBN) to verify object pixels. Shi et al. [21] propose a two stage SVM using size features in the first and HOG features in the second stage. Gleason et al. [10] apply clustering of densely distributed Harris corners and refine the cluster areas using color segmentation. These detections are classified using Gabor features and Random Forest.

The concept of our video processing chain is shown in Fig. 2. The module colors correspond to the bounding box colors used to visualize the results. In the remainder of this section, each module will be described in more detail. The focus, however, is on object segmentation.

3.1. Independent Motion Detection and Clustering Harris corners are detected and tracked [20]. Image registration and alignment is done by homography estimation and image warping. Local features (corners) with significantly different relative velocities compared to the static background features used for homography estimation are considered to move independently of the camera motion. Hence, we assume that they come from moving objects. Moving local features are clustered with respect to spatial proximity and similar motion direction and magnitude. An example for moving local features (yellow color) and motion clusters (cyan color) is visualized in Fig 1. The resulting clusters are used as initial vehicle hypotheses and need to be verified by object segmentation.

3.2. Object Segmentation Motion is a helpful information to improve object segmentation. By assuming that the vehicle orientation corresponds to its motion direction, we can normalize the orientation by rotating the cluster upright. Besides the desired case of one vehicle per motion cluster, a cluster can contain no vehicle (false positive), one part of a vehicle (split detection), or several vehicles (merged detection causing false negatives). To handle split detections, we extend each cluster in motion direction. Three object segmentation approaches are analyzed: contour extraction, blob extraction, and machine learning. 3.2.1

Contour Extraction

Gradients and edges are used for contour extaction. We implemented and evaluated three methods: In order to fuse gradients [24], gradient magnitudes are calculated by Sobel operator, morphological gradient, and LBP. The magnitudes are added up in a common accumulator and quantile-based thresholding is used to distinguish between object and background pixels. Canny edge detection [5] is applied to cluster edge pixels and moving features. Although Cheng et al. propose to use color classification and a DBN to detect object pixels in areas of motion and edges, we skip

extended motion cluster

Figure 3. ICF with gradient magnitude (lower left), six gradient orientation channels and some integration areas (green rectangles).

this part since we do not have color in our videos. Finally, relative connectivity calculation [25] followed by hysteresis thresholding aims to find and emphasize opposing edge pairs. To each approach we apply morphological closing and connected-component analysis to determine the final detections. Too small detections are rejected.

sliding window result

after first NMS

after second NMS Figure 4. Machine learning: Sliding Window approach.

3.2.2

Blob Extraction

Homogeneous image regions such as vehicle roofs or hoods are emphasized and detected by thresholding. We implemented two approaches: Black and White Tophat Transform followed by Otsu thresholding and morphological closing [28] and Maximally Stable Extremal Regions (MSER) [15] with minimum size and eccentricity constraints. Too small detections are rejected. 3.2.3

Machine Learning

Machine learning is introduced by using a local sliding window with ICF and AdaBoost classifier in order to search for vehicles inside the motion clusters. Sliding window has been widely used for object detection in single images [7, 26] but here this method is used together with independent motion detection. Motion cluster properties help us to reduce the sliding window search space since we know the approximate vehicle size from pixel ground sampling distance (GSD) and vehicle orientation from motion direction. Furthermore, we can reject false positive detections at stationary vehicles (see Section 3.3). We use ICF and AdaBoost classifier [8] instead of popular HOGs and SVM [7]. The performance is similar but the runtime is much faster [1]. As seen in Fig. 3, we use six gradient orientation channels in order to calculate firstorder ICF features (green rectangles). 2,000 of these features are calculated and concatenated to set up the vehicle descriptor. We use a Gentle AdaBoost classifier [2] with 500 decision trees of depth 2 and trained it with 780 negative and 790 positive samples coming from a wide area aerial image. Each sample is normalized to horizontal orientation and 32×16 pixels in scale. The test dataset consists of 20,000 negative and 664 positive samples coming from a different wide area aerial image. We achieved an Area Under Curve (AUC) value of 0.9984 with ICF and AdaBoost compared to 0.9987 with HOGs and SVM which is very similar.

The sliding window approach is visualized in Fig. 4. With a fixed sized window in different scales we search for vehicles inside the extended motion cluster in order to find different sized vehicles between compact cars and trucks. The sliding window result is visualized in red color. Each sliding window is represented by its center point. Since we shift the window pixel by pixel in horizontal and vertical direction, a dense red grid of center points appears. Bright red indicates a high decision function value of the AdaBoost classifier and, hence, a high probability for a vehicle. After a first Non-Maximum Suppression (NMS) on these decision function values we calculate all potential detections (yellow boxes) and apply a second NMS to all overlapping boxes to determine the final result (red boxes) [8].

3.3. Outlier and Duplicate Detections Outlier detections can come from vehicles in adjacent motion clusters or stationary vehicles parked at the roadside. In order to eliminate them, we check for motion vectors which are inside the detection. If a certain number of vectors is exceeded, the detection is accepted. Duplicate detections occur in case of split motion clusters for one object or split segmentation results. Three different approaches have been analyzed to handle duplicate detections: • Non-Maximum Suppression (NMS): detections sufficiently overlapping each other are sorted by their quality, which is the gray-value variance for contour and blob extraction or the classifier decision function value for sliding window. The detection with the highest quality is kept while all other detections are eliminated. • Fusion by overlap (OV): detections sufficienty overlapping each other are fused into one detection. • Median detections (MED): for all detections sufficienty overlapping each other, a new detection at median position and with median size is initialized and replaces all the overlapping detections.

Table 1. f-score evaluation for 3 duplicate removal methods.

dataset SEQ 1

SEQ 2

VIVID

segmentation algorithm Sliding Window Gradient [24] Tophat [28] Sliding Window Gradient [24] Tophat [28] Sliding Window Gradient [24] Tophat [28]

NMS 0.962 0.939 0.918 0.908 0.896 0.872 0.984 0.987 0.911

method OV 0.960 0.937 0.918 0.908 0.903 0.876 0.985 0.990 0.912

MED 0.958 0.933 0.914 0.908 0.882 0.876 0.985 0.989 0.911

4. Evaluation and Experimental Results Three videos have been used for our experiments. Two own sequences SEQ 1 and SEQ 2 in top view and sequence EgTest01 of the VIVID dataset [6]. Our sequences do not provide color information, so we did not evaluate any colorbased method. We manually labeled all sequences for moving vehicles and moving pedestrians/bikes/motorcycles. All objects except of vehicles were not considered for the evaluation, no matter whether they have been detected or not. We labeled 4,731 vehicles in SEQ 1, 1,373 in SEQ 2, and 6,866 in EgTest01. For quantitative evaluation, we use standard evaluation measures: true positives (TP), false positives (FP), false negatives (FN), precision (pcsn), recall (rcll), f-score, Normalized Multiple Object Detection Accuracy (N-MODA), and Normalized Multiple Object Detection Precision (N-MODP) [11]. In the first experiment, we analyze the three removal methods for duplicate detections using all three datasets. As seen in Table 1, we evaluate one contour-based method [24], one blob-based method [28], and our proposed local sliding window approach. For reasons of simplicity, we only use f-score here as measure. The best result for each dataset is highlighted in red color. NMS works best for the sliding window and OV for all other approaches. Thus, we consider these combinations for the next experiment. The second experiment is the quantitative evaluation shown in Table 2. We use all three datasets, all six segmentation methods, and all evaluation measures. We highlighted the best result for each measure (each column) and dataset in red color. The most important measures f-score, N-MODA, and N-MODP are underlined, too. While fscore and N-MODA contain information about the false positive and the false negative rate, N-MODP is the mean overlap of ground truth (GT) rectangles and detection rectangles for all TPs. Motion Clustering is the baseline approach without any additional object segmentation. The high number of FPs for Motion Clustering occur mainly due to vehicle shadows (SEQ 1 and SEQ 2) and turning vehi-

cles (EgTest01). FNs often appear for under-segmented multiple vehicles driving one behind the other. The results show that our proposed local sliding window approach outperforms all other methods with respect to f-score and N-MODA in two out of three datasets and achieves the best mean overlap (N-MODP) for all datasets. As Siam et al. [22] provide their precision and recall for EgTest01, we can include this in Table 2. Our approach achieves a higher f-score without multi-object tracking. The qualitative evaluation is given in Fig. 5. Cyan boxes depict the detected motion clusters and red boxes the results of object segmentation. While all approaches perform well in the first example (upper row), we see more merged or split detections occurring for the methods based on contour and blob extraction as the examples become more difficult (lower rows). The final segmentation result with our local sliding window approach is visualized in Fig. 6. One full example image is shown for each dataset. Small cyan boxes with no segmentation (red box) inside are coming from moving pedestrians, bikes, or motorcycles which we considered to be irrelevant for vehicle detection.

5. Conclusions and Future Work We present an evaluation of six object segmentation methods based on contour extraction, blob extraction, or machine learning to improve moving vehicle detection in aerial videos coming from a moving UAV camera. Simple independent motion clustering is not sufficient to reliably detect moving vehicles in busy urban traffic as over/undersegmentation occurs regularly. We propose a novel method using a local sliding window with ICF and AdaBoost classifier inside motion areas in order to find vehicles. In our challenging sequences, this method outperforms other approaches with respect to f-score and other standard measures. In the less crowded public VIVID sequence it performs similar to gradient based contour extraction.

References [1] R. Benenson, M. Mathias, R. Timofte, and L. V. Gool. Pedestrian detection at 100 frames per second. In CVPR, 2012. [2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. [3] X. Cao, J. Lan, P. Yan, and X. Li. KLT Feature Based Vehicle Detection and Tracking in Airborne Videos. In Proc. of the International Conf. on Image and Graphics (ICIG), 2011. [4] X. Cao, C. Wu, J. Lan, P. Yan, and X. Li. Vehicle Detection and Motion Analysis in Low-Altitude Airborne Video Under Urban Environment. IEEE Transactions on Circuits and Systems for Video Technology, 21(10):1522–1533, 2011. [5] H.-Y. Cheng, C.-C. Weng, and Y.-Y. Chen. Vehicle Detection in Aerial Surveillance Using Dynamic Bayesian Networks. IEEE TIP, 21(4):2152–2159, Apr. 2012.

Table 2. Quantitative evaluation for 3 datasets, 6 segmentation algorithms, and various evaluation measures.

dataset

SEQ 1

SEQ 2

VIVID

segmentation algorithm Local Sliding Window Gradient Fusion [24] Canny Edges [5] Rel. Connectivity [25] MSER [15] Tophat Transform [28] Motion Clustering Local Sliding Window Gradient Fusion [24] Canny Edges [5] Rel. Connectivity [25] MSER [15] Tophat Transform [28] Motion Clustering Local Sliding Window Gradient Fusion [24] Canny Edges [5] Rel. Connectivity [25] MSER [15] Tophat Transform [28] Motion Clustering Motion Clustering + Tracking [22]

GT 4731 4731 4731 4731 4731 4731 4731 1373 1373 1373 1373 1373 1373 1373 6866 6866 6866 6866 6866 6866 6866

TP 4463 4366 4141 4357 3395 4253 4070 1181 1257 1210 1114 1210 1287 1289 6726 6812 6800 6142 6703 6782 6844

FP 83 226 136 581 113 286 807 47 155 96 364 110 277 579 73 90 85 1836 195 1231 487

FN 268 365 590 374 1336 478 661 192 116 163 259 163 86 84 140 54 66 724 163 84 22

-

-

-

-

[6] R. T. Collins, X. Zhou, and S. K. Teh. An Open Source Tracking Testbed and Evaluation Web Site. In Proceedings of the IEEE International Workshop on PETS, 2005. [7] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. In CVPR, 2005. [8] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral Channel Features. In BMVC, 2009. [9] A. Gaszczak, T. P. Breckon, and J. Han. Real-time people and vehicle detection from UAV imagery. In Proceedings of SPIE Vol. 7878, 2011. [10] J. Gleason, A. V. Nefian, X. Bouyssounousse, T. Fong, and G. Bebis. Vehicle Detection from Aerial Imagery. In IEEE ICRA, 2011. [11] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, M. Boonstra, V. Korzhova, and J. Zhang. Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol. IEEE TPAMI, 31(2):319–336, 2009. [12] Q. Li, B. Lei, Y. Yu, and R. Hou. Real-time Highway Traffic Information Extraction Based on Airborne Video. In Proc. of IEEE ITSC, 2009. [13] R. Lin, X. Cao, Y. Xu, C. Wu, and H. Qiao. Airborne moving vehicle detection for video surveillance of urban traffic. In Proc. of the IEEE Intelligent Vehicles Symposium (IV), 2009. [14] P. Luo, F. Liu, X. Liu, and Y. Yang. Stationary Vehicle Detection in Aerial Surveillance with a UAV. In Proceedings of the International Conference on Information Science and Digital Content Technology (ICIDT), 2012.

evaluation measures prcn rcll f-score 0.982 0.943 0.962 0.951 0.923 0.937 0.968 0.875 0.919 0.882 0.921 0.901 0.968 0.718 0.824 0.937 0.899 0.918 0.835 0.860 0.847 0.961 0.860 0.908 0.890 0.916 0.903 0.926 0.881 0.903 0.754 0.811 0.781 0.917 0.881 0.899 0.823 0.937 0.876 0.690 0.939 0.795 0.989 0.980 0.984 0.987 0.992 0.990 0.988 0.990 0.989 0.770 0.895 0.828 0.972 0.976 0.974 0.846 0.988 0.912 0.934 0.997 0.964 0.991

0.971

0.980

N-MODA 0.925 0.875 0.846 0.798 0.693 0.838 0.689 0.825 0.802 0.811 0.546 0.801 0.735 0.517 0.968 0.979 0.978 0.627 0.947 0.808 0.925

N-MODP 0.696 0.606 0.526 0.624 0.482 0.530 0.481 0.593 0.559 0.496 0.515 0.477 0.535 0.514 0.526 0.511 0.500 0.393 0.497 0.489 0.448

-

-

[15] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In BMVC, 2002. [16] T. T. Nguyen, H. Grabner, H. Bischof, and B. Gruber. Online boosting for car detection from aerial images. In Proceedings of the IEEE International Conference on Research, Innovation and Vision for the Future (RIVF), 2007. [17] A. G. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu. Multi-Object Tracking Through Simultaneous Long Occlusions and Split-Merge Conditions. In CVPR, 2006. [18] V. Reilly, H. Idrees, and M. Shah. Detection and Tracking of Large Number of Targets in Wide Area Surveillance. In ECCV, 2010. [19] I. Saleemi and M. Shah. Multiframe Many-Many Point Correspondence for Vehicle Tracking in High Density Wide Area Aerial Videos. IJCV, 104(2):198–219, Sept. 2013. [20] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994. [21] X. Shi, H. Ling, E. Blasch, and W. Hu. Context-Driven Moving Vehicle Detection in Wide Area Motion Imagery. In ICPR, 2012. [22] M. Siam, R. ElSayed, and M. ElHelw. On-board Multiple Target Detection and Tracking on Camera-Equipped Aerial Vehicles. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), 2012. [23] K. Tanaka and H. Saji. Detection of parallelograms using hough transform. The Transactions of the Institute of Electronics, Information and Communication Engineers D, J89D(3):606–612, 2006.

Local Sliding Window

Gradient Fusion [24]

Canny Edge [5]

Relative Connectivity [25]

MSER [15]

Tophat Transform [28]

Figure 5. Qualitative evaluation for 6 different object segmentation algorithms.

SEQ 1

SEQ 2

VIVID EgTest01

Figure 6. Results for local sliding window approach in each sequence (motion clusters in cyan, segmentation in red). Since the classifier is trained for vehicles only, motion clusters without segmentation contain other moving objects such as pedestrians, bikes, or motorcycles. [24] M. Teutsch and W. Kr¨uger. Detection, Segmentation, and Tracking of Moving Objects in UAV Videos. In AVSS, 2012. [25] M. Teutsch and T. Schamm. Fast Line and Object Segmentation in Noisy and Cluttered Environments Using Relative Connectivity. In IPCV, 2011. [26] P. Viola and M. Jones. Robust Real-time Face Detection. Intern. Journal of Computer Vision, 57(2):137–154, 2004.

[27] J. Xiao, H. Cheng, H. Sawhney, and F. Han. Vehicle Detection and Tracking in Wide Field-of-View Aerial Video. In CVPR, 2010. [28] Z. Zheng, G. Zhou, Y. Wang, Y. Liu, X. Li, X. Wang, and L. Jiang. A Novel Vehicle Detection Method With High Resolution Highway Aerial Image. IEEE JSTARS, 6(6):2338– 2343, Dec. 2013.

Year: 2014 Author(s): Teutsch, Michael; Krüger, Wolfgang; Beyerer, Jürgen Title: Evaluation of object segmentation to improve moving vehicle detection in aerial videos DOI: 10.1109/AVSS.2014.6918679

(http://dx.doi.org/10.1109/AVSS.2014.6918679)

© IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Details: Ko. Hanseok (Ed.) ; Institute of Electrical and Electronics Engineers -IEEE-; IEEE Signal Processing Society: 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2014 : Seoul, South Korea, 26 - 29 August 2014 Piscataway, NJ: IEEE, 2014 ISBN: 978-1-4799-4871-0 ISBN: 978-1-4799-4870-3 pp.265-270

Suggest Documents