Multimodal vehicle detection: fusing 3D-LIDAR and

0 downloads 0 Views 3MB Size Report
This paper proposes a multimodal vehicle detection system integrating data from a. 3D-LIDAR .... tion confidence, width, height, center and so forth) from detec-.
1

Pattern Recognition Letters journal homepage: www.elsevier.com

Multimodal vehicle detection: fusing 3D-LIDAR and color camera data Alireza Asvadi∗∗, Luis Garrote, Cristiano Premebida, Paulo Peixoto, Urbano J. Nunes Institute of Systems and Robotics (ISR-UC), Dept. of Electrical and Computer Engineering (DEEC), University of Coimbra, Coimbra, Portugal

ABSTRACT

Most of the current successful object detection approaches are based on a class of deep learning models called Convolutional Neural Networks (ConvNets). While most existing object detection researches are focused on using ConvNets with color image data, emerging fields of application such as Autonomous Vehicles (AVs) which integrates a diverse set of sensors, require the processing for multisensor and multimodal information to provide a more comprehensive understanding of real-world environment. This paper proposes a multimodal vehicle detection system integrating data from a 3D-LIDAR and a color camera. Data from LIDAR and camera, in the form of three modalities, are the inputs of ConvNet-based detectors which are later combined to improve vehicle detection. The modalities are: (i) up-sampled representation of the sparse LIDAR’s range data called dense-Depth Map (DM), (ii) high-resolution map from LIDAR’s reflectance data hereinafter called Reflectance Map (RM), and (iii) RGB image from a monocular color camera calibrated wrt the LIDAR. Bounding Box (BB) detections in each one of these modalities are jointly learned and fused by an Artificial Neural Network (ANN) late-fusion strategy to improve the detection performance of each modality. The contribution of this paper is two-fold: 1) probing and evaluating 3D-LIDAR modalities for vehicle detection (specifically the depth and reflectance map modalities), and 2) joint learning and fusion of the independent ConvNet-based vehicle detectors (in each modality) using an ANN to obtain a more accurate vehicle detection. The obtained results demonstrate that 1) DM and RM are very promising modalities for vehicle detection, and 2) experiments show that the proposed fusion strategy achieves higher accuracy compared to each modality alone in all the levels of increasing difficulty (easy, moderate, hard) in KITTI object detection dataset. c 2018 Elsevier Ltd. All rights reserved.

1. Introduction Object detection is one of the fundamental topics in computer vision, and it is one of the essential components in the perception systems designed for Autonomous Vehicles (AVs). Despite the impressive progress already accomplished in object detection, incorporating multimodal data and designing a robust and accurate detection system is still a very challenging task. In most cases, AVs are equipped with a varied set of sensors e.g., mono and stereo cameras, thermal, night vision, sonar, LIDAR and RADAR to have a robust multimodal perception of the driving environment, as described by Urmson et al. [1] and Ziegler et al. [2]. Among the above-named sensors, 3D-LIDARs are the pivotal sensing solution for ensuring the high-level of reliability and safety demanded for advanced driver assistance systems (ADAS) and autonomous driving systems. The main characteristics of LIDAR sensors are their wide

∗∗ Corresponding

author. e-mail: [email protected] (Alireza Asvadi)

field of view (FOV), very precise distance measurement (cmaccuracy), object recognition at long-range and night-vision capability. Main disadvantages are the cost, having mechanical parts, being large and having high-power requirement, although these issues tend to have less significance by the emergence of Solid-State 3D-LIDAR sensors (e.g., Quanergy’s S3 and Velodyne Velarray sensors) which are compact, efficient and have no moving parts. Recently, 3D-LIDAR sensors, driven by a reduction in their cost and by an increase in their resolution and range, started to become a valid option for object detection, tracking, and scene understanding in Intelligent Vehicle (IV) and in Intelligent Transportation Systems (ITS) contexts. 1.1. Contributions In this paper, we propose a real-time multisensor (color camera and 3D-LIDAR) and multimodal (color image, 3DLIDAR’s range and reflectance data) vehicle detection system. Three modalities, color image, dense (up-sampled) representations of sparse 3D-LIDAR’s range and reflectance data, hereinafter referred to as dense-Depth Map (DM) and denseReflectance Map (RM), are used as three individual inputs to a

2 Multimodal Data Gen.

Individual Detection

DM

YOLO-D

RM

YOLO-R

Detection Fusion

LIDAR

Vision

Fusion

Fused Detection

YOLO-C

Fig. 1. The conceptual model of the proposed multimodal vehicle detection.

Deep ConvNet-based object detection framework (YOLO realtime object detection framework [3, 4]), referred as YOLO-C, YOLO-D and YOLO-R respectively, to achieve vehicle detection in each modality in the form of Bounding Boxes (BBs), followed by a decision-level fusion approach. The conceptual model of the proposed pipeline and contributions (highlighted with red color) are summarized in Figure 1. Specifically, the main contributions of this paper are as follows: – In addition to the color image, DM and RM are considered for vehicle detection. A less explored ‘LIDAR reflection’ is investigated for vehicle detection, and is proved that can be beneficial for the vehicle detection purpose. – A decision-level fusion approach is proposed for high-level integration of detections from color, DM and RM modalities. The proposed methods extract a rich set of features (e.g., detection confidence, width, height, center and so forth) from detection hypotheses in each modality. The target output is defined as the overlaps of detected windows (or bounding boxes: BBs) in each modality with the ground-truth. A Multi-Layer Perceptron (MLP) Neural Network is trained to learn and model the nonlinear relationships among modalities, and to deal with detection limitations in each modality. The rest of the paper is organized as follows. Some related work in object detection and multimodal fusion is reviewed in Section 2. The proposed approach for multimodal vehicle detection is described in detail in Section 3. Experimental results are presented and discussed in Section 4, and Section 5 brings some concluding remarks. 2. Related Work This section gives a concise overview of object detection, the related fusion methods and recent advancements. The state-ofthe-art in object detection is primarily concentrated on processing color images as perceived sensory data. 2.1. Object Detection Object detection can be divided into pre and post-Deep learning arrival. This section reviews some of the major contributions in this field. 2.1.1. Non-ConvNet Approaches Before the recent advancement of Deep Learning (specifically ConvNets) that revolutionized object classification and consecutively the object detection field, the literature were

mainly focused on using hand crafted features and traditional classification techniques (e.g., SVM, AdaBoost, Random Forest). Some of the major contributions in the object detection field, before the Deep Learning era, are listed below: Cascade of weak classifiers: Viola and Jones [5] proposed one of the early works for object detection. They used Haar features and performed object detection by applying Adaboost training and cascade classifiers based on sliding-window principle. Histogram of Oriented Gradients (HOG): Dalal and Triggs [6] introduced efficient HOG features based on edge directions in the image. They performed linear SVM classification on sub-windows extracted from the image using a sliding-window mechanism. Deformable Parts Model (DPM): Proposed by Felzenszwalb et al. [7], DPM is a graphical model that is designed to cope with object deformations in the image. DPM assumes that an object is constructed by its parts. It uses HOG and linear SVM, again on a sliding window mechanism. Regionlets [8] improves DPM by considering the possible locations of parts. Pepik et al. [9] extended DPM to 3D object models. Selective Search (SS): Uijlings et al. [10] proposed SS to generate a set of data-driven, class-independent object proposals and avoid using conventional sliding-window exhaustive search. SS works based on hierarchical segmentation using a diverse set of cues. They used SS to create a Bag-of-Words based localization and recognition system. 2.1.2. ConvNet based Approaches The remarkable success of ConvNets as an optimal feature extractor for image classification/recognition made a huge impact on the object detection field as demonstrated by LeCun et al. [11] and, more recently, by Krizhevsky et al. [12]. Currently, the best performing object detectors use ConvNets and are summarized below. Sliding-window ConvNet: Following the traditional object detection paradigm, ConvNets were initially employed using the sliding-window mechanism like in the Overfeat framework as proposed by Sermanet et al. [13]. Region-based ConvNets: In R-CNN [14], SS [10] is used for object proposal generation, pre-trained ConvNet on ImageNet (fine-tuned by PASCAL VOC dataset) for feature extraction, and linear SVM for object classification and detection. Instead of doing ConvNet-based classification for thousands of SS-based generated object proposals, which is slow, Fast RCNN [15] uses Spatial Pyramid Pooling networks (SPPnets) [16] to pass the image through the convolutional layer once, followed by an end-to-end training. In Faster R-CNN [17, 18] a Region Proposal Network (RPN), a type of Fully-Convolutional Network (FCN) [19], is introduced for region proposal generation. It increases the run-time efficiency and accuracy of the object detection system. Single Shot Object Detectors: YOLO (You Only Look Once) [3, 4] and SSD (Single Shot Detector) [20] model object detection as a regression problem and try to eliminate the object proposal generation step. These approaches are based on a single ConvNet followed by a non-maximum suppression step. In these methods the input image is divided into a grid (7 × 7 grid

3 for YOLO and 9 × 9 for SSD) where each grid cell is responsible for predicting a pre-determined number of object BBs. The main advantages are building a faster detector and seeing the whole image during training increases the detection accuracy. In the SSD approach hard negative mining is performed and samples with highest confidence loss are selected. Two main disadvantages of this class of methods are i) they impose hard constraints on the bounding box prediction (e.g., in YOLO each grid cell can predict only two BBs) and ii) the detection of small objects can be very challenging. The SSD approach tried to solve the second problem with the help of additional data augmentation for smaller objects. A number of other ConvNet based approaches were presented for object detection in previous literature. Xiang et al. [21] modify Fast R-CNN [15] by injecting 3D Voxel Patterns (3DVP) [22] subcategory information into the network to improve the object proposal generating step. Cai et al. [23] perform detection at multiple intermediate network layers to deal with different sized objects. Chen et al. [24] generate 3D proposals by assuming a prior on the ground plane (using calibration data). Proposals are initially scored based on some contextual and segmentation features, followed by rescoring using a version of Fast R-CNN [15] for 3D object detection. Yang et al. [25] approach is based on the rejection of negative object proposals using convolutional features and cascaded classifiers. Next, surviving proposals are evaluated using scale dependent object classifiers. 2.2. 3D-LIDAR and Camera Fusion Although there is a rich literature on multisensor data fusion as recently surveyed by Durrant-Whyte and Henderson [26], only a small number of works addresses multimodal and multisensor data fusion in object detection. Fusion-based object detection approaches can be divided based on the abstraction level where the fusion takes place, namely i) low-level (early) fusion which combines sensor data to create a new set of data, ii) mid-level that integrates features, and iii) high-level (late or decision-level) fusion which combines the classified outputs [27]. This section surveys the state-of-the-art fusion techniques using vision and 3D-LIDAR in the multimodal object detection context. Premebida et al. [28] combine Velodyne LIDAR and color data for pedestrian detection. A dense depth map is computed by up-sampling LIDAR points. Two DPMs are trained on depth maps and color images. The DPM detections on depth maps and color images are fused to achieve the best performance using a late re-scoring strategy (by applying SVM on features like BB’s sizes, position, scores and so forth). Gonzalez et al. [29] use color images and 3D-LIDAR-based depth maps as inputs, and extract HOG and Local Binary Patterns (LBP) features. They split training set samples in different views to take into account different poses of objects (frontal, lateral, etc.) and train a separate random forest of local experts for each view. They investigated feature-level and late fusion approaches. They combine color and depth modalities at feature level by concatenating HOG and LBP descriptors. They train individual detectors on each modality and use ensemble of detectors for late

Table 1. Some recent related work on 3D-LIDAR and camera fusion. In the table, if more than one fusion strategy is experimented in a method, the best-performing solution for that method is emphasized with a larger red check mark. Reference Early Mid Late Technique Premebida et al. [28] SVM Re-scoring Gonzalez et al. [29] Ensemble Voting ConvNet Schlosser et al. [30] Chen et al. [32] ConvNet Oh and Kang [33] ConvNet/SVM

fusion of different views detection. They achieved a better performance with the feature-level fusion scheme. Schlosser et al. [30] explore the ConvNet-based fusion of 3D-LIDAR and color image at different level of representation for pedestrian detection. They compute HHA (horizontal disparity, height, angle) data channels [31] from LIDAR-based depth map. They show that the late-fusion of HHA features and color images achieve better results. Chen et al. [32] proposed a multi-view object detection using deep learning. They used 3D-LIDAR top and front views and color image data as inputs. The top view LIDAR data is used to generate 3D object proposals. The 3D proposals are projected to three views for obtaining region-wise features. A region-based feature fusion scheme is used for the classification and orientation estimation. This approach enables interactions of intermediate layers from different views. Oh and Kang [33] use segmentation-based methods for object proposal generation from LIDAR’s point cloud data and a color image. They use two independent ConvNet-based classifiers to classify object candidates in color image and LIDAR-based depth map, and combine the classification outputs at decision level using convolutional feature maps, category probabilities and SVMs. Table 1 provides an overview of fusion approaches based on 3D-LIDAR and camera data.

3. Multimodal Vehicle Detection System In this section, the proposed multimodal vehicle detection system is described. We start by describing the multimodal data generation. Next, we briefly introduce the YOLO framework, which is the ConvNet-based vehicle detector considered for each modality, and finally, we explain the proposed multimodal fusion scheme. 3.1. System Overview The architecture of the proposed multimodal vehicle detection system is shown in Fig. 2. Three modalities, DM, RM (both generated from 3D-LIDAR data) and color image are used as inputs. Three YOLO-based object detectors are run individually on each modality to detect the 2D object BBs in the color image, DM and RM. 2D BBs obtained in each of the three modalities are fused by a learning-based re-scoring function followed by a non-maximum suppression. The purpose of the multimodal detection fusion is to reduce the misdetection rate from each modality which leads to a more accurate detection.

4 Multimodal Data Gen. PCD

3D-LIDAR Camera

DM Gen. RM Gen. Color Image

Vehicle Detectors DM

RM

YOLO-D

YOLO-R YOLO-C

Multimodal Detection Fusion BBD , sD

Joint Re-scoring

BBR , s′R

BBR , sR BBC , sC

BBD , s′D

Feature Extraction

MLP

BBF , sF

NMS

BBC , s′C

Fused Detection

Fig. 2. The pipeline of the proposed multimodal vehicle detection algorithm. For details, please refer to the text.

3.2. Multimodal Data Generation The color image is readily available from the color camera. However, 3D-LIDAR-based dense maps are not directly available and need to be computed. Assuming that a LIDAR and a camera are calibrated with respect to each other, the projection of the LIDAR point into the image plane is much sparser than its associated image. Such limited spatial resolution of the LIDAR makes object detection from raw (sparse) LIDAR data very challenging. Therefore, in this paper, we propose to generate high-resolution (dense) map representations using LIDAR data to (i) perform deep-learning-based vehicle detection in LIDAR dense maps and, (ii) to carry out a decision-level fusion strategy. To generate a dense (up-sampled) map from LIDAR, a number of techniques can be used as described by Premebida et al. [34]. Here, we adopted the Delaunay Triangulation (DT) as a technique to obtain high-resolution maps. DT generates a mesh from the projected sparse depth points on the camera coordinate system. The nearest neighbors were used to interpolate the unsampled locations of the map. DT is effective in obtaining dense maps with close to 100% of density because this method interpolates all locations in the map regardless the positions of the input (raw) points. In this work, the dense maps are obtained solely from LIDAR data thus, data (color or texture) from the camera is not used in the maps. Besides the depth map (DM), a dense reflectance map (RM) is also considered in the vehicle detection system. In the case of DM, the variable to be interpolated is the range (distance), while the reflectance value (reflection return) is the variable to be interpolated to generate the RM. The reflectivity attribute is related to the type of surface the LIDAR reflection is obtained. Figure 3 shows an example color image followed by the dense maps (DM and RM) obtained using DT and nearest neighbor interpolation. The image and the LIDAR data used to obtain the dense maps are taken from the KITTI dataset. 3.3. YOLO Vehicle Detection using Color, DM, and RM Data You only look once (YOLO) [3, 4] is a state-of-the-art, realtime object detection system. In this paper, YOLOv2 [4] is trained individually on each of the three training sets (color, DM and RM). The result is three trained YOLOv2 models, one per modality. 3.3.1. YOLO: Real-Time Object Detection Framework In YOLO, object detection is defined as a regression problem and object BBs and detection scores are directly estimated from

Fig. 3. An example of a color image taken from the KITTI dataset with superimposed LIDAR data, followed by the corresponding depth (DM) and reflectance dense maps (RM).

image pixels. This approach eliminates the need for an object proposal step. First, the input image is resized to a resolution of 416×416 pixels. Next, the image is divided into 7 × 7 grid regions, and two centers of the BBs are assumed in each grid cell. Therefore, each grid predicts two BBs with their associated confidence scores (this means a prediction of at the most 98 bounding boxes per image). A single convolutional network runs once on the image to predict object BBs. The network is composed by 24 convolutional layers followed by 2 fully connected layers which connects to a set of bounding box outputs. Finally, a non-maximum suppression is applied to suppress duplicated detections. YOLO looks at the whole image during training and test time, therefore, in addition to object appearances, its predictions are informed by contextual information in the image. 3.4. Multimodal Detection Fusion This section presents a multimodal detection fusion system that tries to use the associated confidence of individual detections (detection scores) and the detected BBs’ characteristics in each modality to learn a fusion model and deal with detection limitations in each modality.

5 Feature Extraction

Image Plane sC

𝐵𝐵𝑀

sD

sR BBC BBD BBR μx

-

µx,µy

-

-

μy 𝜎x 𝜎y BBM

Target NN Training

-

-

-

IOUC IOUD IOUR

-

-

-

-

-

Fig. 4. Feature extraction and the joint re-scoring training strategy. Some of the different situations that may happen in tri-BBs generation are depicted in ‘image plane’ (in the left). The hypothetical detections from YOLO-C, YOLO-D, YOLO-R and the ground-truth are depicted, in the image plane, with red, green, blue and dashed-magenta BBs, respectively. Feature extraction and the target are represented by matrices where each column corresponds to a feature and each row a combination of detections (in the middle). Each matrix cell contains the colors corresponding to the detections contributing to the feature’s value or a dash in gray background if the value is missing (zero).

3.4.1. Joint Re-Scoring using MLP Network The detections from modalities are in the form of a set of BBs {BBC , BBD , BBR } with their associated confidence scores {sC , sD , sR }. The overlap between BBs {BBC , BBD , BBR } is computed and boxes that overlap are considered to be detecting the same object. Then, a set of overlapping BBs is extracted and for each detector present in the set, all combinations are extracted. The ideal result is a set of three BBs (henceforth called tri-BBs), each from one modality. If a given modality is not present on the set, the corresponding detector BB is considered to be empty (BB=0). / A Multi-Layer Perceptron Neural Network is used as a fitting function and applied over a set of attributes extracted from the tri-BBs to learn the multidimensional nonlinear mapping between the BBs from modalities and the ground-truth BBs. For each combination of BBs the attributes extracted (F ) are as follows: F = (sC , sD , sR , BBC , BBD , BBR , µx , µy , σx , σy , BBM )

(1)

where sC , sD , sR are the detection confidence scores and BBC , BBD , BBR are the BBs correspondent to the color, DM and RM detectors. Every BB is defined by four properties {w, h, cx , cy } which indicate width, height and the geometrical center in x and y (all normalized with respect to the image’s width and height), respectively. The µx , µy , σx , and σx correspond to the average of all available BBs geometrical centers and their standard deviation. BBM corresponds to the minimum bounding box that contains all non empty bounding boxes in the combination. In cases where combinations do not contain one or two detectors, scores and BBs for those detectors are set to zero and are not included in the computation of the average, standard deviation and minimum containing bounding box (see Figure 4). This results in a feature vector size of 23. The associated set of target data (T ), defining the desired output, is determined as a set of three intersection-over-union (IOU) metrics: T = (IOUC , IOUD , IOUR ) Area(BBi ∩ BBg ) IOUi = Area(BBi ∪ BBg )

‘intersection-over-union overlap with the ground-truth’. The trained MLP learns to estimate the overlap of tri-BBs with the ground-truth and based on that re-scores tri-BBs. A simple average rule between scores are applied when there are multiple scores for the same BB. The re-scoring function generates, per frame, the same set of detection BBs from different modalities i.e., {BBC , BBD , BBR } with the re-scored detection confidences {sC0 , s0D , s0R } (see Figure 2). 3.4.2. Non-Maximum Suppression The input to the Non-Maximum Suppression (NMS) module is the set of BBs in the same ‘neighborhood’ area which is a consequence of having a multimodal detection system. This could degrade the performance of the detection algorithm, as can be seen later on in the experimental results section. To solve this, NMS is used to discard multiple-detected occurrences around close locations i.e., to retain the local most-confident detector. The ratio ϒ between the intersection and the union area of the overlapping detection windows is calculated and for ϒ greater than a threshold τ = 0.5 (value obtained experimentally) the detection window with the greatest confidence score is retained and the remaining detections are suppressed. Further strategies to perform NMS is addressed by Franzel et al. [35]. An example of the fusion detection process is shown in Figure 5. 4. Experimental Setup and Evaluation For the evaluation purpose, quantitative and qualitative experiments using the KITTI object detection dataset [36] is performed to validate the performance of the proposed multimodal vehicle detection system. In comparison with other datasets, in KITTI object size and pose undergo severe changes including occlusion, which occur very often in real world autonomous driving scenarios [37].

(2) 4.1. KITTI Dataset for Vehicle Detection (3)

where i determines each modality {C, D, R} and g denotes the ground-truth BB. Once the MLP has fit the data, it forms a generalization of ‘extracted features from tri-BBs’ and their

The KITTI dataset was captured in urban areas using an egovehicle equipped with different sensors, including a 1.4 Mega pixels color camera and a Velodyne LIDAR. The camera image was cropped to 1382×512 pixels and rectified which results in a smaller image size of about 1242×375 pixels. The camera

6 DM Gen.

YOLO-D YOLO-R

0.47

0.32

0.4

RM Gen.

YOLO-C

34 ms

15 ms

Feature Extraction

MLP Rescoring

NMS C++ C

2 ms

11 ms

1 ms

MATLAB

Fig. 6. The parallelism architecture for real-time implementation, the processing time (in milliseconds), and the implementation environment of the different steps of the proposed detection system. 0.78

0.33

0.3

0.43

0.55

0.72

0.78 0.72

0.17

0.57

0.33

0.05

0.3

0.43 0.47

0.4

0.07

0.39 0.48

0.32

0.32 0.55 0.57

0.13 0.35

The YOLOv21 416×416 detection framework [4] was used in the experiments. The YOLOv2 detector in each color, DM and RM modality (referred as YOLO-C, YOLO-D and YOLO-R respectively) and the proposed learning-based fusion scheme were optimized using training and validation sets, and evaluated on the testing set. Each individual YOLO-C/D/R was trained for 80,200 iterations. MLPs (with one and two hidden layers) were experimented for function fitting. The MLP fitting function was trained using Levenberg-Marquardt back-propagation algorithm. The implementation environment and the computational load of the different steps of the proposed algorithm are reported in Figure 6. The modality generation and feature extraction steps are implemented in C + +, YOLO-C/D/R are in C, and re-scoring and NMS are implemented in MAT LAB (MEX enabled), respectively. The average time for processing each frame is 63 milliseconds (about 16 frames per second). Considering the synchronized camera and Velodyne LIDAR are working about 10 Hz, the real-time processing can be achieved by the proposed architecture. 4.3. Baselines and Metrics

Fig. 5. Illustration of the fusion detection process. Top to bottom: detections from YOLO-C (red), YOLO-D (green) and YOLO-R (blue) with associated confidence scores. The forth image represents the merged detections. Last image shows the fusion vehicle detection results (cyan) after rescoring and NMS compared to ground-truth (dashed-magenta). A dashed cyan BB indicates detections with confidence less than 0.2 that can be discarded by simple post-processing.

was synchronized with the 10 Hz spinning Velodyne HDL-64E. The Velodyne has 64 layers vertical resolution, 0.09 angular resolution, 2 cm distance accuracy, and captures 100k points per cycle, approximately. Only LIDAR data within the range of 80 m is considered for dense maps generation. The KITTI object detection ‘training dataset’ contains 7,481 frames with 51,867 labels for 9 different categories: Pedestrian, Car, Cyclist, Van, Truck, Person sitting, Tram, Misc, and Don’t care. During the experiments, only the ‘Car’ label was considered for evaluation. The dataset was partitioned into three subsets: 60% as training set (4489 observations), 20% as validation set (1496 observations), and 20% as testing set (1496 observations). 4.2. Implementation Details and Computational Analysis The experiments were carried out using a Hexa-core 3.5 GHz processor, powered with a GTX 1080 GPU and 64 GB RAM.

Following KITTI’s assessment methodology, the PASCAL VOC intersection-over-union (IOU) metric on three difficulty levels was used as the evaluation criterion with an overlap of 70% for car detection. The precision-recall curve and Average Precision (AP), which corresponds to the area under the precision-recall curve, were computed and reported over easy, moderate and hard data categories to measure the vehicle detection performance. The difficulty levels were defined as (i) ‘Easy’ which represents fully visible cars with minimum BB height of 40 pixels, (ii) ‘Moderate’ which includes partial occlusions with minimum BB height of 25 pixels, and finally (iii) the ‘Hard’ level which integrates the same minimum BB height with higher occlusion levels. 4.4. Quantitative Evaluation Results To assess the proposed learning-based fusion detection method, the performance of the fusion model with two sets of features (using the confidence score feature subset, and using the entire feature set) was evaluated in our offline testing set. In addition, we presented results in comparison with the state-ofthe-art methods on the KITTI online benchmark.

1 https://pjreddie.com/darknet/yolo/

7 AP: 73.93 61.69 54.00 1

1

1 Easy Moderate Hard

0.9

Easy Moderate Hard

0.9

0.8

0.7

0.7

0.6

0.6

0.6

0.4

Precision

0.8

0.7

0.5

0.5 0.4

0.5 0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0 0

0.2

0.4

0.6

0.8

1

Easy Moderate Hard

0.9

0.8

Precision

Precision

AP: 68.36 52.23 45.22

AP: 68.19 54.59 47.61

0

0

0.2

0.4

Recall

0.6

Recall

0.8

1

0

0.2

0.4

0.6

0.8

1

Recall

Fig. 7. The vehicle detection performance in color, DM and RM modalities. From left to right, YOLO-C, YOLO-D, and YOLO-R, respectively.

Fig. 8. The joint re-scoring function learned from the confidence score-only features. The color-coded value is the predicted overlap in the range of [0, 1]. The value ‘0’ indicates the prediction of no-overlap and value ‘1’ is the prediction for 100% overlap between the corresponding detector’s BB and the ground-truth BB.

4.4.1. Performance Evaluation on Validation and Test Sets The YOLO vehicle detection performance for each modality (color, DM, and RM data) is presented in Figure 7. As can be seen by comparing precision-recall curves, in addition to color data, the DM and RM modalities, when used individually, shown very promising results. Two sets of experiments were conducted for evaluating the performance of the fusion vehicle detection system. The first set of experiments demonstrates the improvement gained using the confidence score feature subset. In the second experiment, the entire feature set is employed for learning the joint re-scoring function. Experiment using the Confidence Score Feature Subset: The re-scoring function can be interpreted as a three-class function fitting MLP. To visualize the performance of fitting function, in the first experiment a 3-layer MLP is trained using a subset of features (three detection confidence scores {sC , sD , sR }). All combinations of confidence scores are generated and inputted to the trained MLP, and the estimated intersection-over-union overlaps are computed and shown in Figure 8. This figure illustrates how modality detectors are related, and shows the learned object detector behaviors in modalities based on the detection

scores. In fact it shows in which combination of scores each detectors perform best. The 3-layer MLP reached the minimum Mean Squared Error (MSE) of 0.0179 with 41 hidden neurons. The Average Precision (AP) of the first experiment on the test set is reported in Table 2. The results show that the fusion method achieves an improved performance even with learning only detection scores. Experiment using the Entire Feature Set (Augmented Features): In the second experiment, the full set of feature was considered. Experiments with one and two hidden layers are conducted. Figure 9 plots the validation performance progress of MLPs as the number of hidden neurons increases. In the training set, as the number of neurons increases, the error decreases. For the 3-layer MLP (one hidden layer), the validation performance reached the minimum Mean Squared Error (MSE) of 0.0156 with 23 hidden layer neurons. The 2 hidden layers MLP reached the least MSE of 0.0155 with 15 and 7 neurons in the first and second hidden layers, respectively. The precisionrecall curves of the multimodal fusion vehicle detection after merging, re-scoring and non-maximum suppression is shown in Figure 10. The Average Precision (AP) score is computed on

8 2-Hidden Layers Fusion Network Performance on the Training Set

1-Hidden Layer Fusion Network Performance

0.032

Train Validation

0.03

0.018

0.026 0.024 0.022 0.02

0.0175

0.0175

Mean Squared Error (MSE)

Mean Squared Error (MSE)

0.028

Mean Squared Error (MSE)

2-Hidden Layers Fusion Network Performance on the Validation Set

0.017 0.0165 0.016 0.0155 0.015 0.0145 3

0.018 0.016

5

10

20

30

40

0.0165

0.016

0.0155 3

7

9

11

13

15

17

19

0.014 0

0.017

50

11 9 15 13 19 17

7

5

3

7

9

11

13

15

17

11 9 15 13 19 17

19

7

5

3

# neurons in 1st hid. layer

# neurons in 1st hid. layer

# neurons in 2nd hid. layer

# neurons in 2nd hid. layer

Number of hidden neurons

5

Fig. 9. Effect of the number of layers / hidden neurons on MLP performance. Left shows the training and validation performances of the 3-layer MLP (i.e.1 hidden layer) as the number of hidden neurons increases. Middle and right figures show the 2 hidden layers MLP performance on the training and validation sets, respectively.

AP: 75.13 62.74 53.60 1 Easy Moderate Hard

0.9

Table 2. Performance evaluation of the studied vehicle detectors on the KITTI Dataset. YOLO-Color, YOLO-Depth, YOLO-Reflectance modalities and the late fusion vehicle detection strategy are compared (‘Fusion− ’ denotes the result using only the confidence score feature subset). The figures denote Average Precision (AP) measured at different difficulty levels. The best results are printed in bold. Modality Easy Moderate Hard Color 73.93 % 61.69 % 54.00 % Depth 68.19 % 54.59 % 47.61 % Re f lec. 68.36 % 52.23 % 45.22 % Fusion− 74.21 % 62.18 % 54.06 % Fusion 75.13 % 62.74 % 55.10 %

0.8 0.7

Precision

0.6 0.5 0.4 0.3 0.2 0.1

0

0.2

0.4

0.6

0.8

1

Recall

Fig. 10. Multimodal fusion vehicle detection performance. The merged detections before (dotted-line) and after re-scoring (dashed-line), and the vehicle detection performance after re-scoring and non-maximum suppression (solid-line).

the test set for each independent detectors and for the learned fusion models and reported in Table 2. The proposed fusion scheme boosts the vehicle detection performance in each of the easy, moderate and hard categories by at least 1.05 percentage points (in ‘Easy’ category it went up to 1.2 percentage points). 4.4.2. Performance Evaluation on KITTI Online Benchmark To compare with the state-of-the-art, the proposed method was evaluated on the KITTI online object detection benchmark against methods that benefit from LIDAR data. Results are reported in Table 3 and Figure 11. As can be noted from the table, the proposed method surpasses some of the approaches in the KITTI benchmark while having the shortest running time.

Table 3. Fusion Detection Performance on KITTI Online Benchmark. Approach Easy Moderate Hard Run Time MV3D [32] 90.53 % 89.17 % 80.16 % 0.36 3D FCN [38] 85.54 % 75.83 % 68.30 % 5 MV-RGBD-RF [29] 76.49 % 69.92 % 57.47 % 4 VeloFCN [39] 70.68 % 53.45 % 46.90 % 1 Proposed Method 64.77 % 46.77 % 39.38 % 0.063 Vote3D [40] 56.66 % 48.05 % 42.64 % 0.5 CSoR [41] 35.24 % 26.13 % 22.69 % 3.5 mBoW [42] 37.63 % 23.76 % 18.44 % 10

Car 1

Easy Moderate Hard

0.8

Precision

0

0.6

0.4

0.2

4.5. Qualitative Results Figure 12 shows some of the most representative qualitative results using the entire feature set. As it can be seen for all cases the proposed multimodal vehicle fusion system cleverly combines detection confidences of YOLO-C, YOLO-D and YOLOR and outperforms each individual one.

0 0

0.2

0.4

0.6

0.8

1

Recall

Fig. 11. Precision-Recall on KITTI Online Benchmark (Car Class).

9

0.61 0.52 0.62 0.28

0.36

0.79 0.83 0.83

0.84 0.84 0.79

0.59

0.89

0.83

0.6

0.02

0.89 0.86 0.86 0.28

0.47 0.33 0.58

0.29

0.87

0.25

0.36

0.58 0.27 0.63 0.72

0.27

0.59 0.43 0.32

0.76 0.68 0.68

0.550.42 0.05

0.62

0.25

0.05

0.64 0.29

0.77

0.05

0.04

0.53

0.72 0.73 0.54

0.05

0.390.67

0.55

0.61

0.33 0.73

0.12

0.02

0.56 0.58 0.46 0.7

0.24

0.57 0.5 0.42

0.4

0.25

0.55

0.28

0.39

0.42

0.360.38 0.38

0.3 0.32

0.57 0.57

0.08

0.490.61

0.42

0.07

0.05

0.77

0.27

0.59 0.51 0.43

0.09

0.11

0.84 0.8 0.67

0.06

0.3

0.270.36 0.23

0.05

0.370.39 0.38

0.02 0.03

0.05 0.84

0.41

0.62 0.43 0.65 0.65 0.34

0.56 0.48 0.49

0.73 0.560.56

0.75

0.75 0.75 0.28 0.83

0.56 0.64

0.77 0.73 0.420.41

0.260.36 0.27

0.44

0.42 0.7 0.71 0.36

0.65

0.6

0.12

0.63 0.6

0.83 0.04

Fig. 12. Sample screenshots of the fusion detection system results. Left column shows the detection results from YOLO-C (red), YOLO-D (green) and YOLO-R (blue) with associated confidence scores. Right column shows the fusion vehicle detection results (cyan) after re-scoring and NMS compared to ground-truth (dashed-magenta).

10 5. Summary, Concluding Remarks and Future Directions Two sensors (color camera and 3D-LIDAR) in three modalities (color image, 3D-LIDAR’s range and reflectance data) were used as inputs to the proposed multimodal vehicle detection system. An individual ConvNet-based vehicle detection is applied to each modality independently, and a decision-level fusion solution based on joint re-scoring and non-maximum suppression integrates the detections according to the BB characteristics, detection confidence scores and so forth. It has been shown that Dense-depth Map (DM) and dense-Reflectance Map (RM) are very promising for 3D-LIDAR-based vehicle detection. Also, the decision-level fusion outperforms individual modality detectors and improves vehicle detection performance by at least 1.05 percentage points (in some cases it went up to 1.2 percentage points). The other 3D-LIDAR’s view (e.g., top view) can be explored for the vehicle detection and incorporated in the fusion framework. Acknowledgments This work has been supported by “AUTOCITS - Regulation Study for Interoperability in the Adoption of Autonomous Driving in European Urban Nodes” - Action number 2015EU-TM-0243-S, co-financed by the European Union (INEACEF); and FEDER through COMPETE 2020 program under grants UID/EEA/00048 and RECI/EEI-AUT/0181/2012 (AMS-HMI12). References [1] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al., Autonomous driving in urban environments: Boss and the urban challenge, Journal of Field Robotics 25 (8) (2008) 425–466. [2] J. Ziegler, P. Bender, M. Schreiber, H. Lategahn, T. Strauss, C. Stiller, T. Dang, U. Franke, N. Appenrodt, C. G. Keller, et al., Making bertha drive – an autonomous journey on a historic route, IEEE Intelligent Transportation Systems Magazine 6 (2) (2014) 8–20. [3] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: CVPR, 2016, pp. 779–788. [4] J. Redmon, A. Farhadi, Yolo9000: Better, faster, stronger, in: CVPR, 2017. [5] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: CVPR, Vol. 1, 2001, pp. I–I. [6] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR, 2005, pp. 886–893. [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE PAMI 32 (9) (2010) 1627–1645. [8] X. Wang, M. Yang, S. Zhu, Y. Lin, Regionlets for generic object detection, IEEE PAMI 37 (10) (2015) 2071–2084. [9] B. Pepik, M. Stark, P. Gehler, B. Schiele, Multi-view and 3d deformable part models, IEEE PAMI 37 (11) (2015) 2232–2245. [10] J. R. Uijlings, K. E. Van De Sande, T. Gevers, A. W. Smeulders, Selective search for object recognition, IJCV 104 (2) (2013) 154–171. [11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551. [12] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012, pp. 1097–1105. [13] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: Integrated recognition, localization and detection using convolutional networks, in: ICLR, 2014.

[14] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: CVPR, 2014, pp. 580–587. [15] R. Girshick, Fast r-cnn, in: ICCV, 2015, pp. 1440–1448. [16] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE PAMI 37 (9) (2015) 1904– 1916. [17] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: NIPS, 2015, pp. 91–99. [18] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutional networks for accurate object detection and segmentation, IEEE PAMI 38 (1) (2016) 142–158. [19] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE PAMI. [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, SSD: Single shot multibox detector, in: ECCV, 2016, pp. 21–37. [21] Y. Xiang, W. Choi, Y. Lin, S. Savarese, Subcategory-aware convolutional neural networks for object proposals and detection, in: WACV, 2017, pp. 924–933. [22] Y. Xiang, W. Choi, Y. Lin, S. Savarese, Data-driven 3d voxel patterns for object category recognition, in: CVPR, 2015, pp. 1903–1911. [23] Z. Cai, Q. Fan, R. S. Feris, N. Vasconcelos, A unified multi-scale deep convolutional neural network for fast object detection, in: ECCV, 2016, pp. 354–370. [24] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, R. Urtasun, Monocular 3d object detection for autonomous driving, in: CVPR, 2016, pp. 2147– 2156. [25] F. Yang, W. Choi, Y. Lin, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, in: CVPR, 2016, pp. 2129–2137. [26] H. Durrant-Whyte, T. C. Henderson, Multisensor data fusion, in: Springer Handbook of Robotics, 2016, pp. 867–896. [27] F. Garcia, D. Martin, A. de la Escalera, J. M. Armingol, Sensor fusion methodology for vehicle detection, IEEE Intelligent Transportation Systems Magazine 9 (1) (2017) 123–133. [28] C. Premebida, J. Carreira, J. Batista, U. Nunes, Pedestrian detection combining rgb and dense lidar data, in: IROS, 2014, pp. 4112–4117. [29] A. Gonz´alez, D. V´azquez, A. M. L´oopez, J. Amores, On-board object detection: Multicue, multimodal, and multiview random forest of local experts, IEEE Transactions on Cybernetics. [30] J. Schlosser, C. K. Chow, Z. Kira, Fusing lidar and images for pedestrian detection using convolutional neural networks, in: ICRA, 2016, pp. 2198– 2205. [31] S. Gupta, R. Girshick, P. Arbel´aez, J. Malik, Learning rich features from rgb-d images for object detection and segmentation, in: ECCV, 2014, pp. 345–360. [32] X. Chen, H. Ma, J. Wan, B. Li, T. Xia, Multi-view 3d object detection network for autonomous driving, in: CVPR, 2017. [33] S.-I. Oh, H.-B. Kang, Object detection and classification by decision-level fusion for intelligent vehicle systems, Sensors 17 (1) (2017) 207. [34] C. Premebida, L. Garrote, A. Asvadi, A. P. Ribeiro, U. Nunes, Highresolution lidar-based depth mapping using bilateral filter, in: ITSC, 2016, pp. 2469–2474. [35] T. Franzel, U. Schmidt, S. Roth, Object Detection in Multi-view X-Ray Images, Springer Berlin Heidelberg, 2012, pp. 144–154. [36] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: CVPR, 2012. [37] J. Janai, F. G¨uney, A. Behl, A. Geiger, Computer vision for autonomous vehicles: Problems, datasets and state-of-the-art, ARXIV. [38] B. Li, 3d fully convolutional network for vehicle detection in point cloud, in: IROS, 2017. [39] B. Li, T. Zhang, T. Xia, Vehicle detection from 3D LIDAR using fully convolutional network, in: RSS, 2016. [40] D. Z. Wang, I. Posner, Voting for voting in online point cloud object detection., in: RSS, 2015. [41] L. Plotkin, Pydriver: Entwicklung eines frameworks f¨ur r¨aumliche detektion und klassifikation von objekten in fahrzeugumgebung (2015). [42] J. Behley, V. Steinhage, A. B. Cremers, Laser-based segment classification using a mixture of bag-of-words, in: IROS, 2013.

Suggest Documents