achieve an object-level perception of dynamic scenes, followed by the fusion of 3D-. LIDAR with camera data to .... 1.3.1 Guidelines for Reading the Thesis . ..... PDF Probability Density Function .... free-space detection and has a direct application in safe driving systems such as .... and Computing 694, Springer, vol 2, 2018.
Alireza Asvadi
Multi-Sensor Object Detection for Autonomous Driving
Tese de Doutoramento em Engenharia Electrotécnica e de Computadores, ramo de especialização em Computadores e Electrónica, orientada pelo Professor Urbano José Carreira Nunes e Professor Paulo José Monteiro Peixoto e apresentada ao Departamento de Engenharia Electrotécnica e de Computadores da Faculdade de Ciências e Tecnologia da Universidade de Coimbra
Março de 2018
Faculty of Science and Technology Department of Electrical and Computer Engineering
Multi-Sensor Object Detection for Autonomous Driving
Alireza Asvadi
Thesis submitted to the Department of Electrical and Computer Engineering of the Faculty of Science and Technology of the University of Coimbra in partial fulfillment of the requirements for the Degree of Doctor of Philosophy
Principal supervisor:
Prof. Urbano José Carreira Nunes
Co-supervisor:
Prof. Paulo José Monteiro Peixoto
Coimbra March 2018
Abstract In this thesis, we propose on-board multisensor obstacle and object detection systems using a 3D-LIDAR, a monocular color camera and a GPS-aided Inertial Navigation System (INS) positioning data, with application in self-driving road vehicles. Firstly, an obstacle detection system is proposed that incorporates 4D data (3D spatial data and time), and composed by two main modules: (i) a ground surface estimation using piecewise planes, and (ii) a voxel grid model for static and moving obstacles detection using ego-motion information. An extension of the proposed obstacle detection system to a Detection And Tracking Moving Object (DATMO) system is proposed to achieve an object-level perception of dynamic scenes, followed by the fusion of 3DLIDAR with camera data to improve the tracking function of the DATMO system. The obstacle detection we propose is to effectively model dynamic driving environment. The proposed DATMO method is able to deal with the localization error of the position sensing system when computing the motion. The proposed fusion tracking module integrates multiple sensors to improve object tracking. Secondly, an object detection system based on the hypothesis generation and verification paradigms is proposed using 3D-LIDAR data and Convolutional Neural Networks (ConvNets). Hypothesis generation is performed by applying clustering on point cloud data. In the hypothesis verification phase, a depth map is generated using 3DLIDAR data, and the depth map values are inputted to a ConvNet for object detection. Finally, a multimodal object detection is proposed using a hybrid neural network, composed by deep ConvNets and a Multi-Layer Perceptron (MLP) neural network. Three modalities, depth and reflectance maps (both generated from 3D-LIDAR data) and a color image, are used as inputs. Three deep ConvNet-based object detectors run individually on each modality to detect the object bounding boxes. Detections on each one of the modalities are jointly learned and fused by an MLP-based late-fusion strategy. The purpose of the multimodal detection fusion is to reduce the misdetection rate from each modality, which leads to a more accurate detection. Quantitative and qualitative evaluations were performed using ‘Object Detection Evaluation’ dataset and ‘Object Tracking Evaluation’ based derived datasets from the KITTI Vision Benchmark Suite. Reported results demonstrate the applicability and efficiency of the proposed obstacle and object detection approaches in urban scenarios. i
ii
Keywords Autonomous vehicles; Robotic Perception; Detecting and Tracking of Moving Objects (DATMO); Supervised Learning Based Object Detection
Resumo Nesta tese e´ proposto um novo sistema multissensorial de detecc¸a˜ o de obst´aculos e objetos usando um LIDAR-3D, uma cˆamara monocular a cores e um sistema de posicionamento baseado em sensores inerciais e GPS, com aplicac¸a˜ o a sistemas de conduc¸a˜ o aut´onoma. Em primeiro lugar, prop˜oe-se a criac¸a˜ o de um sistema de detec¸a˜ o de obst´aculos, que incorpora dados 4D (3D espacial + tempo) e e´ composto por dois m´odulos principais: (i) uma estimativa do perfil do ch˜ao atrav´es de uma aproximac¸a˜ o planar por partes e (ii) um modelo baseado numa grelha de voxels para a detec¸a˜ o de obst´aculos est´aticos e dinˆamicos recorrendo a` informac¸a˜ o do pr´oprio movimento do ve´ıculo. As funcionalidade do systemo foram posteriormente aumentado para permitir a Detec¸a˜ o e Seguimento de Objetos M´oveis (DATMO) permitindo a percepc¸a˜ o ao n´ıvel do objeto em cenas dinˆamicas. De seguida procede-se a` fus˜ao dos dados obtidos pelo LIDAR3D com os dados obtidos por uma cˆamara para melhorar o desempenho da func¸a˜ o de seguimento do sistema DATMO. Em segundo lugar, e´ proposto um sistema de detec¸a˜ o de objetos baseado nos paradigmas de gerac¸a˜ o e verificac¸a˜ o de hip´oteses, usando dados obtidos pelo LIDAR-3D, recorrendo a` utilizac¸a˜ o de redes neurais convolucionais (ConvNets). A gerac¸a˜ o de hip´oteses e´ realizada aplicando um agrupamento de dados ao n´ıvel da nuvem de pontos. Na fase de verificac¸a˜ o de hip´oteses, e´ gerado um mapa de profundidade a partir dos dados do LIDAR-3D, sendo que esse mapa e´ inserido numa ConvNet para a detec¸a˜ o de objetos. Finalmente, e´ proposta uma detecc¸a˜ o multimodal de objetos usando uma rede neuronal h´ıbrida, composta por Deep ConvNets e uma rede neural do tipo Multi-Layer Perceptron (MLP). As modalidades sensoriais consideradas s˜ao: mapas de profundidade, mapas de reflectˆancia geradas a partir do LIDAR-3D e imagens a cores. S˜ao definidos trˆes detetores de objetos que individualmente, em cada modalidade, recorrendo a uma ConvNet detetam as bounding-boxes do objeto. As detec¸o˜ es em cada uma das modalidades s˜ao depois consideradas em conjunto e fundidas por uma estrat´egia de fus˜ao baseada em MLP. O prop´osito desta fus˜ao e´ reduzir a taxa de erro na detec¸a˜ o de cada modalidade, o que leva a uma detec¸a˜ o mais precisa. Foram realizadas avaliac¸o˜ es quantitativas e qualitativas dos m´etodos propostos, utilizando conjuntos de dados obtidos a partir dos datasets “Avaliac¸a˜ o de Detecc¸a˜ o de iii
iv Objetos” e “Avaliac¸a˜ o de Rastreamento de Objetos” do KITTI Vision Benchmark Suite. Os resultados obtidos demonstram a aplicabilidade e a eficiˆencia da abordagem proposta para a detec¸a˜ o de obst´aculos e objetos em cen´arios urbanos.
Palavras chave Ve´ıculos Aut´onomos; Percepc¸a˜ o Rob´otica; Detecc¸a˜ o e Seguimento de Objectos M´oveis; Detec˜ao de Objectos Baseada em Aprendizagem Supervisionada
Acknowledgment Foremost, I would like to express my gratitude to my supervisor Prof. Urbano J. Nunes for his continuous support during my Ph.D. study, and to my co-supervisor Dr. Paulo Peixoto for giving me the motivation to achieve more. I also wish to thank Dr. Cristiano Premebida for his help and support. I would like to thank co-authors of some papers, specially Pedro Gir˜ao, Luis Garrote and Jo˜ao Paulo for the discussions and contributions to enrich my research quality. I would also like to thank my lab colleagues for creating a friendly environment and my Iranian friends who helped me for a fast start in Coimbra. Finally, and on a more personal level, I sincerely thank my family (specially my wife Elham and my parents) for encouraging and supporting me during difficult periods. I acknowledge the Institute for Systems and Robotics – Coimbra for supporting my research. This work has been supported by “AUTOCITS - Regulation Study for Interoperability in the Adoption of Autonomous Driving in European Urban Nodes” - Action number 2015-EU-TM-0243-S, co-financed by the European Union (INEACEF); and FEDER through COMPETE 2020 program under grants UID/EEA/00048 and RECI/EEI-AUT/0181/2012 (AMS-HMI12).
A LIREZA A SVADI Coimbra March 2018
v
vi
c Copyright by Alireza Asvadi 2018
All rights reserved
Contents 1
Introduction 1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . 1.2 Specific Research Questions and Key Contributions . . . . 1.2.1 Defining the Key Terms . . . . . . . . . . . . . . 1.2.2 Challenges of Perception for Autonomous Driving 1.2.3 Summary of Contributions . . . . . . . . . . . . . 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Guidelines for Reading the Thesis . . . . . . . . . 1.4 Publications and Technical Contributions . . . . . . . . . 1.4.1 Publications . . . . . . . . . . . . . . . . . . . . . 1.4.2 Software Contributions . . . . . . . . . . . . . . . 1.4.3 Collaborations . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
1 2 2 2 3 4 5 6 7 8 9 9
I
BACKGROUND
11
2
Basic Theory and Concepts 2.1 Robot Vision Basics . . . . . . . . . . . . . 2.1.1 Sensors for Environment Perception 2.1.2 Sensor Data Representations . . . . Transformation in 3D Space . . . . 2.1.3 Multisensor Data Fusion . . . . . . Sensor Configuration . . . . . . . . Fusion Level . . . . . . . . . . . . 2.2 Machine Learning Basics . . . . . . . . . . 2.2.1 Supervised Learning . . . . . . . . Multi-Layer Neural Network . . . . Convolutional Neural Network . . . Optimization . . . . . . . . . . . . 2.2.2 Unsupervised Learning . . . . . . . DBSCAN . . . . . . . . . . . . . .
13 13 14 16 17 20 20 21 22 22 24 26 29 30 31
vii
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
viii 3
4
II 5
CONTENTS Test Bed Setup and Tools 3.1 The KITTI Vision Benchmark Suite . . . . . . . . . . . . . . 3.1.1 Sensor Setup . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Object Detection and Tracking Datasets . . . . . . . . 3.1.3 ‘Object Tracking Evaluation’ Based Derived Datasets 3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Average Precision and Precision-Recall Curve . . . . 3.2.2 Metrics for Obstacle Detection Evaluation . . . . . . . 3.3 Packages and Toolkits . . . . . . . . . . . . . . . . . . . . . . 3.3.1 YOLOv2 . . . . . . . . . . . . . . . . . . . . . . . . Obstacle and Object Detection: A Survey 4.1 Obstacle Detection . . . . . . . . . . . . . . . . 4.1.1 Environment Representation . . . . . . . 4.1.2 Grid-based Obstacle Detection . . . . . . Ground Surface Estimation . . . . . . . . Generic Object Tracking . . . . . . . . . Obstacle Detection and DATMO . . . . . 4.2 Object Detection . . . . . . . . . . . . . . . . . 4.2.1 Recent Developments in Object Detection Non-ConvNet Approaches . . . . . . . . ConvNet based Approaches . . . . . . . 4.2.2 Object Detection in ADAS Domain . . . Vision-based Object Detection . . . . . . 3D-LIDAR-based Object Detection . . . 3D-LIDAR and Camera Fusion . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . .
33 33 34 36 36 39 39 40 41 41
. . . . . . . . . . . . . .
43 44 44 45 46 46 48 50 50 50 50 51 52 52 53
METHODS AND RESULTS Obstacle Detection 5.1 Static and Moving Obstacle Detection . . . . . . . . . . . . . . . . 5.1.1 Static and Moving Obstacle Detection Overview . . . . . . 5.1.2 Piecewise Ground Surface Estimation . . . . . . . . . . . . Dense Point Cloud Generation . . . . . . . . . . . . . . . . Piecewise Plane Fitting . . . . . . . . . . . . . . . . . . . . 5.1.3 Stationary – Moving Obstacle Detection . . . . . . . . . . . Ground – Obstacle Segmentation . . . . . . . . . . . . . . Discriminative Stationary – Moving Obstacle Segmentation 5.2 Extension of Motion Grids to DATMO . . . . . . . . . . . . . . . . 5.2.1 2.5D Grid-based DATMO Overview . . . . . . . . . . . . .
55 . . . . . . . . . .
. . . . . . . . . .
57 58 58 58 58 60 64 65 65 69 69
CONTENTS 5.2.2
5.3
6
7
From Motion Grids to DATMO . . . . 2.5D Motion Grid Detection . . . . . . Moving Object Detection and Tracking Fusion at Tracking-Level . . . . . . . . . . . . 5.3.1 Fusion Tracking Overview . . . . . . . 5.3.2 3D Object Localization in PCD . . . . 5.3.3 2D Object Localization in Image . . . 5.3.4 KF-based 2D/3D Fusion and Tracking .
ix . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Object Detection 6.1 3D-LIDAR-based Object Detection . . . . . . . . . . 6.1.1 DepthCN Overview . . . . . . . . . . . . . . . 6.1.2 HG Using 3D Point Cloud Data . . . . . . . . Grid-based Ground Removal . . . . . . . . . . Obstacle Segmentation for HG . . . . . . . . . 6.1.3 HV Using DM and ConvNet . . . . . . . . . . DM Generation . . . . . . . . . . . . . . . . . ConvNet for Hypothesis Verification (HV) . . 6.1.4 DepthCN Optimization . . . . . . . . . . . . . HG Optimization . . . . . . . . . . . . . . . . ConvNet Training using Augmented DM Data 6.2 Multimodal Object Detection . . . . . . . . . . . . . 6.2.1 Fusion Detection Overview . . . . . . . . . . 6.2.2 Multimodal Data Generation . . . . . . . . . . 6.2.3 Vehicle Detection in Modalities . . . . . . . . 6.2.4 Multimodal Detection Fusion . . . . . . . . . Joint Re-Scoring using MLP Network . . . . . Non-Maximum Suppression . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
Experimental Results and Discussion 7.1 Obstacle Detection Evaluation . . . . . . . . . . . . . . . . . 7.1.1 Static and Moving Obstacle Detection . . . . . . . . . Evaluation of Ground Estimation . . . . . . . . . . . Evaluation of Stationary – Moving Obstacle Detection Computational Analysis . . . . . . . . . . . . . . . . Qualitative Results . . . . . . . . . . . . . . . . . . . Extension to DATMO . . . . . . . . . . . . . . . . . 7.1.2 Multisensor Generic Object Tracking . . . . . . . . . Evaluation of Position Estimation . . . . . . . . . . . Evaluation of Orientation Estimation . . . . . . . . . Computational Analysis . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
69 70 72 74 74 75 78 80
. . . . . . . . . . . . . . . . . .
83 84 84 84 84 85 86 86 89 90 90 90 90 91 92 92 92 93 94
. . . . . . . . . . .
97 97 97 99 99 101 102 105 106 107 107 108
x
CONTENTS
7.2
III 8
Qualitative Results . . . . . . . . . . . . Object Detection Evaluation . . . . . . . . . . . 7.2.1 3D-LIDAR-based Detection . . . . . . . Evaluation of Recognition . . . . . . . . Evaluation of Detection . . . . . . . . . Computational Analysis . . . . . . . . . Qualitative Results . . . . . . . . . . . . 7.2.2 Multimodal Detection Fusion . . . . . . Evaluation on Validation Set . . . . . . . Evaluation on KITTI Online Benchmark Computational Analysis . . . . . . . . . Qualitative Results . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
CONCLUSIONS Concluding Remarks and Future Directions 8.1 Summary of Thesis Contributions . . . . 8.1.1 Obstacle Detection . . . . . . . . 8.1.2 Object Detection . . . . . . . . . 8.2 Discussions and Future Perspectives . . .
108 111 111 112 112 112 113 113 115 119 120 122
123 . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
125 125 125 126 127
Appendices Appendix A 3D Multisensor Single-Object Tracking Benchmark A.1 Baseline 3D Object Tracking Algorithms . . . . . . . . . . . . . . . . A.2 Quantitative Evaluation Methodology . . . . . . . . . . . . . . . . . . A.3 Evaluation Results and Analysis of Metrics . . . . . . . . . . . . . . . A.3.1 A Comparison of Baseline Trackers with the State-of-the-art Computer Vision based Object Trackers . . . . . . . . . . . . .
131 132 134 135
Appendix B Object Detection Using Reflection Data B.1 Computational Complexity and Run-Time . . . . . . . . . . . B.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Sparse Reflectance Map vs RM . . . . . . . . . . . . B.2.2 RM Generation Using Nearest Neighbor, Linear and Interpolations . . . . . . . . . . . . . . . . . . . . . . B.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . .
139 140 140 140
Bibliography
. . . . . . . . . . . . . . . Natural . . . . . . . . . .
136
141 142 145
List of Figures 1.1 1.2
The summary of contributions. . . . . . . . . . . . . . . . . . . . . . . An illustrative block diagram of the contents of Chapters 5 and 6. . . . .
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
A summary of advancement of 3D-LIDAR technologies. . . . . . . Employed data representations. . . . . . . . . . . . . . . . . . . . . Coordinate frames and relative poses. . . . . . . . . . . . . . . . . A symbolic representation of Durrant-Whyte’s data fusion schemes. An example of single hidden layer MLP. . . . . . . . . . . . . . . . ConvNet layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic architecture of ConvNet. . . . . . . . . . . . . . . . . . . . . Main concepts in DBSCAN. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
15 16 19 20 25 28 29 31
3.1 3.2 3.3
Sensors setup on AnnieWAY. . . . . . . . . . . . . . . . . . . . . . . . The top view of the multisensor configuration. . . . . . . . . . . . . . . An example of the stationary and moving obstacle detection’s GT data. .
34 35 38
4.1
Some approaches for the appearance modeling of a target object. . . . .
47
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14
Architecture of the proposed obstacle detection system. . . . . . . . . An example of the generated dense PCD. . . . . . . . . . . . . . . . Illustration of the variable-size ground slicing. . . . . . . . . . . . . . An example of the application of the gating strategy on a dense PCD. The piecewise RANSAC plane fitting process. . . . . . . . . . . . . . Binary mask generation for the stationary and moving voxels. . . . . . An example result of the proposed obstacle detection system. . . . . . The architecture of the 2.5D grid-based DATMO algorithm. . . . . . . The motion computation process. . . . . . . . . . . . . . . . . . . . . The 2.5D grid-based motion detection process. . . . . . . . . . . . . A sample screenshot of the proposed 2.5D DATMO result. . . . . . . Pipeline of the proposed fusion-based object tracking algorithm. . . . Proposed object tracking method results. . . . . . . . . . . . . . . . . The ground removal process. . . . . . . . . . . . . . . . . . . . . . .
59 60 61 62 63 66 68 69 71 72 73 75 76 77
xi
. . . . . . . .
. . . . . . . . . . . . . .
5 7
xii
LIST OF FIGURES 5.15 The MS procedure in the PCD. . . . . . . . . . . . . . . . . . . . . . . 5.16 The MS computation in the image. . . . . . . . . . . . . . . . . . . . .
78 79
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8
The proposed 3D-LIDAR-based vehicle detection algorithm (DepthCN). HG using DBSCAN in a given point cloud. . . . . . . . . . . . . . . . The generated dense-Depth Map (DM) with the projected hypotheses. . Illustration of the DM generation process. . . . . . . . . . . . . . . . . The ConvNet architecture. . . . . . . . . . . . . . . . . . . . . . . . . The pipeline of the proposed multimodal vehicle detection algorithm. . Feature extraction and the joint re-scoring training strategy. . . . . . . . Illustration of the fusion detection process. . . . . . . . . . . . . . . . .
85 86 87 88 89 91 93 95
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17
Evaluation of the proposed ground estimation algorithm. . . . . . . . . 98 An example of the obstacle detection evaluation. . . . . . . . . . . . . 100 Computational analysis of the proposed obstacle detection method. . . . 102 A few frames of obstacle detection results (sequences 1 to 4). . . . . . . 103 A few frames of obstacle detection results (sequences 5 to 8). . . . . . . 104 2.5D grid-based DATMO results of 3 typical sequences. . . . . . . . . . 106 Object tracking results (sequences 1 to 4). . . . . . . . . . . . . . . . . 109 Object tracking results (sequences 5 to 8). . . . . . . . . . . . . . . . . 110 The precision-recall of the proposed DepthCN method on KITTI. . . . . 113 Few examples of DepthCN detection results. . . . . . . . . . . . . . . . 114 The vehicle detection performance in color, DM and RM modalities. . . 116 The joint re-scoring function learned from the confidence score. . . . . 117 Influence of the number of layers / hidden neurons on MLP performance. 118 Multimodal fusion vehicle detection performance. . . . . . . . . . . . . 119 The precision-recall of the multimodal fusion detection on KITTI. . . . 120 Fusion detection system results. . . . . . . . . . . . . . . . . . . . . . 121 The parallelism architecture for real-time implementation. . . . . . . . 122
A.1 The precision plot of 3D overlap rate. . . . . . . . . . . . . . . . . . . 135 A.2 The precision plot of orientation error. . . . . . . . . . . . . . . . . . . 136 A.3 The precision plot of location error. . . . . . . . . . . . . . . . . . . . 137 B.1 B.2 B.3 B.4
Precision-Recall using sparse RM vs RM. . . . . . . . . . . . . Precision-Recall using RM with different interpolation methods. An example of color image, RMnearest , RMnatural and RMlinear . . Examples of RefCN results. . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
141 142 143 144
List of Tables 3.1 3.2 4.1 4.2 4.3 4.4
Detailed information about each sequence used for the stationary and moving obstacle detection evaluation. . . . . . . . . . . . . . . . . . . 37 Detailed information about each sequence used for multisensor 3D singleobject tracking evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 39 Comparison of some major grid-based environment representations. . . Some of the recent obstacle detection and tracking methods for autonomous driving applications. . . . . . . . . . . . . . . . . . . . . . . Related work on 3D-LIDAR-based object detection. . . . . . . . . . . . Some recent related work on 3D-LIDAR and camera fusion. . . . . . .
Values considered for the main parameters used in the proposed obstacle detection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Results of the evaluation of the proposed obstacle detection algorithm. . 7.3 Percentages of the computational load of the different steps of the proposed system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Values considered for the main parameters used in the proposed 2.5D grid-based DATMO algorithm. . . . . . . . . . . . . . . . . . . . . . . 7.5 Values considered for the main parameters used in the proposed 3D fusion tracking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Average object’s center position errors in 2D (pixels) and 3D (meters). . 7.7 Orientation estimation evaluation (in radian) . . . . . . . . . . . . . . . 7.8 The ConvNet’s vehicle recognition accuracy with (W) and without (WO) applying data augmentation (DA). . . . . . . . . . . . . . . . . . . . . 7.9 DepthCN vehicle detection evaluation (given in terms of average precision) on KITTI test-set. . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Evaluation of the studied vehicle detectors on the KITTI Dataset. . . . . 7.11 Fusion Detection Performance on KITTI Online Benchmark. . . . . . .
45 49 53 54
7.1
98 100 101 105 107 108 108 112 112 119 120
A.1 Detailed information and challenging factors for each sequence. . . . . 133 B.1 The RefCN processing time (in milliseconds). . . . . . . . . . . . . . . 140 xiii
xiv
LIST OF TABLES B.2 Detection accuracy with sparse RM vs RM on validation-set. . . . . . . 141 B.3 Detection accuracy using RM with different interpolation methods on validation-set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
List of Algorithms 1 2 3 4
The DBSCAN algorithm [1]. . . . . . Dense Point Cloud Generation. . . . . Piecewise Ground Surface Estimation. Short-Term Map Update. . . . . . . .
xv
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
32 59 64 70
xvi
LIST OF ALGORITHMS
List of Abbreviations 1D One-Dimensional; One-Dimension 2D Two-Dimensional; Two-Dimensions 3D Three-Dimensional; Three-Dimensions 4D Four-Dimensional; Four-Dimensions ADAS Advanced Driver Assistance Systems AI Artificial Intelligence ANN Artificial Neural Network AP Average Precision AV Autonomous Vehicle BB Bounding Box BOF Bayesian Occupancy Filter BoW Bag of Words BP Back Propagation CAD Computer Aided Design CA-KF Constant Acceleration Kalman Filter CNN Convolutional Neural Network ConvNet Convolutional Neural Network CPU Central processing unit CV-KF Constant Velocity Kalman Filter xvii
xviii
LIST OF ALGORITHMS
CUDA Compute Unified Device Architecture DATMO Detection And Tracking Moving Object DBSCAN Density-Based Spatial Clustering of Applications with Noise DEM Digital Elevation Map DM (dense) Depth Map DoF Degrees of Freedom DPM Deformable Part Model DT Delaunay Triangulation EKF Extended Kalman Filter FC Fully-Connected FCN Fully Convolutional Network FCTA Fast Clustering and Tracking Algorithm FoV Field of View fps frames per second GD Gradient Descent GNN Global Nearest Neighborhood GNSS Global Navigation Satellite System GPS Global Positioning System GPU Graphical Processing Unit GT Ground-Truth HG Hypothesis Generation HHA Horizontal disparity, Height and Angle feature maps HOG Histogram of Oriented Gradients HV Hypothesis Verification
LIST OF ALGORITHMS ICP Iterative Closest Point IMU Inertial Measurement Unit INS Inertial Navigation System IOU Intersection Over Union ITS Intelligent Transportation System IV Intelligent Vehicle KDE Kernel Density Estimation KF Kalman Filter LBP Local Binary Patterns LIDAR LIght Detection And Ranging LLR Log-Likelihood Ratio LS Least Square MB-GD Mini-Batch Gradient Descent mDE mean of Displacement Errors MHT Multiple Hypothesis Tracking MLP Multi-Layer Perceptron MLS Multi-Level Surface map MS Mean-Shift MSE Mean Squared Error NMS Non-Maximal Suppression NN Nearest Neighbor; Neural Network PASCAL VOC The PASCAL Visual Object Classes project PCD Point Cloud Data PF Particle Filter
xix
xx
LIST OF ALGORITHMS
PDF Probability Density Function PR Precision-Recall R-CNN Region-based ConvNet RADAR RAdio Detection and Ranging RANSAC RANdom SAmple Consensus ReLU Rectified Linear Unit RGB Red Green Blue RGB-D Red Blue Green and Depth RM (dense) Reflection Map ROI Region Of Interest RPN Region Proposal Network RTK Real Time Kinematic sDM sparse Depth Map SGD Stochastic Gradient Descent Sig. Sigmoid SIFT Scale-Invariant Feature Transform SLAM Simultaneous Localization And Mapping SPPnets Spatial Pyramid Pooling networks sRM sparse Range Map; sparse Reflectance Map SS Selective Search SSD Single Shot Detector SVD Singular Value Decomposition SVM Support Vector Machine YOLO You Only Look Once real-time object detection system
List of Symbols and Notations Symbols 7→ ← {.} 0/ ⊂ ∈ ≈ k.k ∩ ∪ ∧ ∨ ∂ ∆ b.c ⊕ R
maps to assignment set empty set subset belonging to a set approximation Euclidean norm intersections of sets union of sets logical and logical or partial derivative difference between two variables floor function, truncation operation dilation (morphology) operation set of real numbers
General Notation and Indexing i j a ai A A> A−1 |A| A
first-order index i ∈ {1, . . . , N} second-order index j ∈ {1, . . . , M} scalars, vectors i-th element of vector a matrices transpose of matrix A inverse of matrix A determinant of matrix A sets
xxi
xxii
LIST OF ALGORITHMS
Sensor Data P x, y, z, r µ E S M c υ u, v d
a LIDAR PCD LIDAR data (3D real-value spatial data and 8-bit reflection intensity) height value in a grid cell Elevation grid local (static) short-term map 2.5D motion grid occupancy value voxel size size parameters of an image 8-bit depth-value in a depth map
Transformations and Projections R t T T PC2I R0 PL2C ϕ, θ , ψ
rotation matrix translation vector transformation matrix transformation in homogeneous coordinates camera coordinate system to image plane projection matrix rectification matrix LIDAR to camera coordinate system projection matrix Euler angles roll, pitch and yaw
Machine Learning D D x(i) , y(i) h L R J θ W, b F (·) Li nl W S k w, h
unknown data distribution independently and identically distributed (i.i.d.) data i-th pair of labeled training example learned mapping loss function regularization function loss function with regularization parameters to learn weight matrix and bias parameters of a neural network activation function i-th layer of MLP neural network number of layers in MLP a 2D weighting kernel feature map number of kernels kernel width and height
LIST OF ALGORITHMS z s α c
xxiii
zero padding stride learning rate cluster label
Miscellaneous Notation ∆α η λi ai , bi , ci , di nˆ i δZ δψ Cs , Cd Bs , Bd R Kσ (.) Θ χ µ(.) ℜ Ω f
angle between LIDAR scans (in elevation direction) a constant that determines number of intervals to compute slice sizes edge of i-th slice parameters of i-th plane unit normal vector of i-th plane distance between two consecutive planes angle between two consecutive planes static and dynamic counters static and dynamic binary masks LLR Gaussian kernel with width σ set of 1D angle values center of 3D-BB mean function color model of an object 2D convex-hull confidence map
xxiv
LIST OF ALGORITHMS
Chapter 1 Introduction Contents 1.1
Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Specific Research Questions and Key Contributions . . . . . . . .
2
1.2.1
Defining the Key Terms . . . . . . . . . . . . . . . . . . . .
2
1.2.2
Challenges of Perception for Autonomous Driving . . . . . .
3
1.2.3
Summary of Contributions . . . . . . . . . . . . . . . . . . .
4
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3.1
Guidelines for Reading the Thesis . . . . . . . . . . . . . . .
6
Publications and Technical Contributions . . . . . . . . . . . . . .
7
1.4.1
Publications . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4.2
Software Contributions . . . . . . . . . . . . . . . . . . . . .
9
1.4.3
Collaborations . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3
1.4
Science is nothing but perception. Plato
This chapter describes the context, motivation and aims of the study, and summaries the study rationale. Then, the organization of the thesis and overview of the chapters are presented. The last section of this chapter describes the dissemination, software components and collaborations. 1
2
CHAPTER 1. INTRODUCTION
1.1
Context and Motivation
Injuries caused by motorized road transport was the eighth-leading cause of death worldwide with over 1.3 million deaths in 2010 [2]. By 2020, it is predicted that road accidents will take the lives of around 1.9 million people [3]. Studies show that human error accounts for more than 94% of accidents [4]. In order to reduce the unacceptably high number of death and road injuries resulting from human error, researchers are trying to shift the paradigm of the transportation system, in which the task of a driver changes from driving to supervising the vehicle. Autonomous Vehicles (AV) which are able to perceive environment and take appropriate actions can expected to be a viable solution to the aforementioned problem. In the last couple of decades, autonomous driving and Advanced Driver Assistance Systems (ADAS) have had remarkable progress. The perception systems of ADASs and AVs, which is of interest here, perceives the environment and builds an internal model of the environment using sensor data. In most cases, AVs [5, 6] are equipped with a varied set of on board sensors e.g., mono and stereo cameras, LIDAR and Radar to have a multimodal, redundant and robust perception of the driving scene. Among the abovenamed sensors, 3D-LIDARs are the pivotal sensing solution for ensuring the high-level of reliability and safety demanded for autonomous driving systems. In this thesis, we propose a framework for driving environment perception with the fusion of 3D-LIDAR, monocular color camera and GPS-aided Inertial Navigation System (INS) data. The specific research questions and the main contributions of this thesis are addressed in the next sections.
1.2
Specific Research Questions and Key Contributions
Despite the impressive progress already accomplished in modeling and perceiving the surrounding environment of AVs, incorporating multisensor multimodal data and designing a robust and accurate obstacle and object detection system is still a very challenging task, which we are trying to address in this thesis. Some of the commonly used terms in the proposed approach, key issues for driving scene understanding, and a summary of the contributions are provided in the following sections.
1.2.1
Defining the Key Terms
In this section we define the key terms that refer to the concepts at the core of this thesis. • Obstacle detection. Throughout this dissertation, the terms ‘Obstacle’ and ‘generic object’ are used interchangeably to refer to anything that stands on the ground (usually on the road in the vicinity of the AV), which can potentially lead to a
1.2. SPECIFIC RESEARCH QUESTIONS AND KEY CONTRIBUTIONS
3
collision and obstructs the AV’s mission. Examples of obstacles are items like as traffic signal poles, street trees, fireplugs, objects (e.g., pedestrians, cars and cyclists), animals, rocks, etc. Obstacles can be items that are foreign to usual driving environment like as auto parts or waste/trash which may have been deliberately or accidentally left on the road. The term ‘obstacle detection’ will refer to using sensor data to detect and localize all entities (obstacles) over the ground. From the practical point of view, obstacle detection is closely coupled with the free-space detection and has a direct application in safe driving systems such as collision detection and avoidance systems. • (Class-specific) object detection. The term ‘object detection’ will be used in this thesis to refer to discovery of specific categories of objects (e.g., pedestrians, cars and cyclists) within a driving scene from sensor data. Class-specific object detection closely corresponds to the supervised learning paradigm, where Ground-Truth (GT) training labels are available (please refer to section 2.2.1 for more details). The term ‘3D object detection’ will be used in this study to describe the identification of the volume of specific class of objects from the sensor data or a ‘Representation’ of the sensor data in R3 . The term ‘representation’ is used to describe the encoding of sensor data into a form suitable for computational processing. To see more about sensor data representations please refer to section 2.1.2. It should be noted that the detection of the angular position, orientation or pose of the object is not addressed in this thesis.
1.2.2
Challenges of Perception for Autonomous Driving
Autonomous driving has seen a lot of progress recently, however, to increase AVs capability to perform with high reliability in dynamic (real-world) driving conditions, the perception system of AVs need to be empowered with a stronger representation and a greater understanding of their surroundings. In summary, the following key problems for perceiving the dynamic road scene were deduced. • Question 1. What is an effective representation of the dynamic driving scene? How to efficiently and effectively segment the static and moving parts (obstacles) of the driving scene? How to detect generic moving objects and deal with the localization error in the AV’s position sensing? • Question 2. On what level should multisensor fusion act? How to integrate multiple sources of sensor data to reliably perform object tracking? How to build a real-time multisensor multimodal object detection system, and how to overcome limitations in each sensor/modality?
4
CHAPTER 1. INTRODUCTION
1.2.3
Summary of Contributions
The key contributions of this thesis are novel approaches for multisensor obstacle and object detection. To summarize, the following specific contributions were proposed: • A Stationary and Moving Obstacle Segmentation System. To address the question of “what is an effective representation of the dynamic driving scene?, and how to efficiently segment the static and moving parts (obstacles) of the driving scene?”, we proposed an approach for modeling the 3D dynamic driving scene using a set of variable-size planes and arrays of voxels. 3D-LIDAR and GPS-aided Inertial Navigation System (INS) data are used as inputs. The set of variablesize planes are used to estimate and model non-planar grounds, such as undulated roads and curved uphill - downhill ground surfaces. The voxel pattern representation is used to effectively model obstacles, which are further segmented into: static obstacles and moving obstacles (see Fig. 1.1 (a)). • A Generic Moving Object Detection System. In an attempt to address the problem “how to detect generic moving objects and deal with the localization error in the AV’s position sensing?”, we proposed a motion detection mechanism that can handle localization errors and suppress false detections using spatial reasoning. The proposed method extracts an object-level representation from motion grids, in the absence of a priori assumption on the shape of objects, which makes it suitable for a wide range of objects (see Fig. 1.1 (b)). • A Multisensor Generic Object Tracking System. In an attempt to answer the question, “on what level should multisensor fusion act to reliably perform object tracking?”, we proposed a multisensor 3D object tracking system which is designed to maximize the benefits of using dense color images and sparse 3DLIDAR point-clouds in combination with INS localization data. Two parallel mean-shift algorithms are applied for object detection and localization in the 2D image and 3D point-cloud, followed by a robust 2D/3D Kalman Filter (KF) based fusion and tracking (see Fig. 1.1 (c)). • A Multimodal Object Detection System. Ultimately, with the aim to answer the question “how to build a real-time multisensor multimodal object detection system, and how to overcome limitations in each sensor/modality?”, we proposed an approach using a hybrid neural network, composed of a ConvNet and a Multi-Layer Perceptron (MLP), to combine front-view dense maps generated from range and reflection intensity modalities from 3D-LIDAR with color camera, in a decision-level fusion framework. The proposed approach is trained to learn and model the nonlinear relationships among modalities, and to deal with detection limitations in each modality (see Fig. 1.1 (d)).
1.3. THESIS OUTLINE
5
Figure 1.1: The summary of contributions. (a) a stationary and moving obstacle segmentation system using 3D-LIDAR and INS data, where the ground surface is shown in blue, static obstacles are shown in red, and moving obstacles are depicted in green, (b) a generic moving object detection system using 3D-LIDAR and INS data, where the detected generic moving objects are indicated in green bounding boxes, (c) a multisensor generic object tracking system using 3D-LIDAR, camera and INS data, where the tracked generic object is shown in green, and (d) a multimodal object detection system, where the detected object categories are demonstrated in different colors.
1.3
Thesis Outline
This introductory chapter gave the general context of this thesis, contributions and structure of the thesis. The outline of remaining chapters is presented as following. Part I – BACKGROUND Chapter 2 – Basic Theory and Concepts In this chapter we describe the related concepts, theoretical and mathematical backgrounds required for developing the proposed approaches. Some ideas of Robot Vision, such as sensor data representation and multisensor data fusion are described. Then, a brief introduction of Machine Learning with a focus on supervised and unsupervised learning paradigms is presented. Chapter 3 – Test Bed Setup and Tools This chapter introduces the reader to the experimental setup, the dataset and the evaluation metrics. We discuss also the packages and libraries that we have used to develop our approach. Chapter 4 – Obstacle and Object Detection: A Survey In this chapter, we will give an overview of approaches for obstacle and object detection. More specifically, the survey focuses on the current state-of-the-art for dynamic driving environment representation, obstacle detection and recent developments in object detection in ADAS domain.
6
CHAPTER 1. INTRODUCTION
Part II – METHODS AND RESULTS Chapter 5 – Obstacle Detection This chapter addresses the problem of obstacle detection in dynamic urban environments. First we propose a complete framework for 3D static and moving obstacle segmentation and ground surface estimation using voxel pattern representation and piecewise planes. Next, we discuss on generic moving object detection while considering errors in positioning data. Finally, we propose a multisensor system architecture for the fusion of 3D-LIDAR, color camera and Inertial Navigation System (INS) data at tracking-level. Chapter 6 – Object Detection This chapter addresses the problem of object detection. First we start by introducing a 3D object detection system based on the Hypothesis Generation (HG) and Verification (HV) paradigms using only 3D-LIDAR data. Next, we propose an extended approach for real-time multisensor and multimodal object detection with fusion of 3D-LIDAR and camera data. Three modalities, RGB image from color camera, front view dense (up-sampled) representations of 3D-LIDAR’s range and reflectance data are used as inputs to a hybrid neural network, which consists of a ConvNet and a Multi-Layer Perceptron (MLP), to achieve object detection. Chapter 7 – Results and Discussion In this chapter, we will analysis the results of our work. The experiments focus on the evaluation of the proposed obstacle detection and object detection approaches. A comparison with the state-of-the-art methods and the discussion about the obtained results are provided. Part III – CONCLUSIONS Chapter 8 – Concluding Remarks and Future Directions The chapter concludes the thesis with discussions on the thesis novelty, contributions, the achieved objectives, and also suggestions for future works.
1.3.1
Guidelines for Reading the Thesis
This section provides guidelines for reading this thesis and explains how different parts of the thesis are related to each other and gives suggestions for reading order. As introduced in this chapter, the main novelty of this research is a multisensor framework for obstacle detection and object detection for autonomous driving. A familiarity with the basic concepts and related works are helpful, though we present the relevant preliminaries in the first part of the thesis. Readers that are familiar with those topics may wish to skip Part I. The second part is mainly based on the 9 published papers (see section
1.4. PUBLICATIONS AND TECHNICAL CONTRIBUTIONS
7
1.4.1), which comprises the main body of this thesis and presents the contributions and results. For the sake of clarity and in order to improve the thesis readability, Part II follows mostly the course of the developed work itself (chronological order). Chapters 5 and 6, which describe our obstacle and object detection approaches, could be read separately (see Fig. 1.2). The reader interested in processing of temporal sequence of multisensor data may focus particularly on Chapter 5. In this chapter we describe the proposed stationary and moving obstacle segmentation, and generic moving object detection and tracking. A reader who is mostly interested in multisensor category-based object detection from a single frame of sensors’ data, which is related to supervised learning paradigm, may want to focus on Chapter 6. In this chapter we describe 3DLIDAR-based and multimodal object detection systems. Chapter 7 presents the experimental results and analysis. Finally, in Part III, the reader can find conclusions and our suggestions for future directions.
Chapter 5: Obstacle Detection
Chapter 6: Object Detection
Static and Moving Generic Moving Fusion at Obstacle Segmentation Object Detection Tracking-Level
Multimodal 3D-LIDAR-based Object Detection Object Detection
3D-LIDAR RGB Camera INS (GPS/IMU)
Figure 1.2: An illustrative block diagram of the contents of Chapters 5 and 6. The sensors used for each algorithm are shown in the bottom part of the figure.
1.4
Publications and Technical Contributions
Some preliminary reports of the findings and intermediate results of this thesis have been published in 9 papers. Moreover, implementation of the proposed approach led to the development of a set of software modules/toolboxes, some of them with the collaboration of other colleagues.
8
CHAPTER 1. INTRODUCTION
1.4.1
Publications
The core parts of this thesis is based on the following peer-reviewed publications of the author. Journal Publications • A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, U. Nunes, Multimodal Vehicle Detection: Fusing 3D-LIDAR and color camera data, Pattern Recognition Letters, Elsevier, 2017. DOI: 10.1016/j.patrec.2017.09.038 • A. Asvadi, C. Premebida, P. Peixoto, and U. Nunes, 3D-LIDAR-based Static and Moving Obstacle Detection in Driving Environments: An approach based on voxels and multi-region ground planes, Robotics and Autonomous Systems, Elsevier, vol. 83, pp. 299-311, 2016. DOI: 10.1016/j.robot.2016.06.007 Conference Proceedings and Workshops • A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, U. Nunes, Real-Time Deep ConvNet-based Vehicle Detection Using 3D-LIDAR Reflection Intensity Data, Robot 2017: Third Iberian Robotics Conference, Advances in Intelligent Systems and Computing 694, Springer, vol 2, 2018. DOI: 10.1007/978-3-319-70836-2 39 • A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, and U. Nunes, DepthCN: Vehicle Detection Using 3D-LIDAR and ConvNet, IEEE 20th International Conference on Intelligent Transportation Systems (ITSC 2017), 2017. DOI: 10.1109/ ITSC.2017.8317880 • A. Asvadi, P. Girao, P. Peixoto, and U. Nunes, 3D Object Tracking Using RGB and LIDAR Data, IEEE 19th International Conference on Intelligent Transportation Systems (ITSC 2016), 2016. DOI: 10.1109/ITSC.2016.7795718 • P. Girao, A. Asvadi, P. Peixoto, and U. Nunes, 3D Object Tracking in Driving Environment: A short review and a benchmark dataset, PPNIV16 Workshop, IEEE 19th International Conference on Intelligent Transportation Systems (ITSC 2016), 2016. DOI: 10.1109/ITSC.2016.7795523 • C. Premebida, L. Garrote, A. Asvadi, A. P. Ribeiro, and U. Nunes, High-resolution LIDAR-based Depth Mapping Using Bilateral Filter, IEEE 19th International Conference on Intelligent Transportation Systems (ITSC 2016), 2016. DOI: 10.11 09/ITSC.2016.7795953 • A. Asvadi, P. Peixoto, and U. Nunes, Two-Stage Static/Dynamic Environment Modeling Using Voxel Representation, Robot 2015: Second Iberian Robotics
1.4. PUBLICATIONS AND TECHNICAL CONTRIBUTIONS
9
Conference, Advances in Intelligent Systems and Computing 417, Springer, vol. 1, pp. 465-476, 2016. DOI: 10.1007/978-3-319-27146-0 36 • A. Asvadi, P. Peixoto, and U. Nunes, Detection and Tracking of Moving Objects Using 2.5D Motion Grids, IEEE 18th International Conference on Intelligent Transportation Systems (ITSC 2015), 2015. DOI: 10.1109/ITSC.2015.133
1.4.2
Software Contributions
This thesis also has several MATLAB / C++ technical contributions which are available at the author’s GitHub page1 . High-level MATLAB programming language was used to enable rapid prototype development. The main software contributions of this thesis are three-fold, as follows. • A MATLAB implementation of ground surface estimation, and static and moving obstacle segmentation. • A MATLAB implementation of on board multisensor generic 3D object tracking. • A C++ / MATLAB implementation of multisensor and multimodal object detection.
1.4.3
Collaborations
Parts of this thesis were the outcome of collaborative work with other researchers, which led to joint publications afterwards. While working on my PhD thesis, I co-supervised the Master thesis of Pedro Gir˜ao. The multisensor 3D object tracking framework, described in Section 3 of Chapter 5, was jointly developed with Pedro Gir˜ao. In Section 2 of Chapter 6, we propose our approach for multimodal object detection, which was a joint work with Luis Garrote and Cristiano Premebida. Particularly, the C++ version of 3D-LIDAR-based dense maps and feature extraction presented in Chapter 6 were joint work with Luis Garrote.
1 https://github.com/alirezaasvadi
10
CHAPTER 1. INTRODUCTION
Part I BACKGROUND
11
Chapter 2 Basic Theory and Concepts Contents 2.1
2.2
Robot Vision Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1
Sensors for Environment Perception . . . . . . . . . . . . . . 14
2.1.2
Sensor Data Representations . . . . . . . . . . . . . . . . . . 16
2.1.3
Multisensor Data Fusion . . . . . . . . . . . . . . . . . . . . 20
Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1
Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . 30
If I have seen further than others, it is by standing upon the shoulders of giants. Isaac Newton
In this chapter we describe the basics of Robot Vision and Machine Learning. Some relevant ideas of sensor data representation and fusion are described. Then, a brief introduction of supervised and unsupervised learning paradigms is presented.
2.1
Robot Vision Basics
An AV (also known as robotic car) is a robotic platform that uses a combination of sensors and algorithms to sense its environment, process the sensed data and react appropriately. In this dissertation, the ‘Robot Vision’ term refers to the processing of robotic sensors (such as vision, range and other related sensors), sensor data representations, data fusion and understanding of sensory data for robotic perception tasks (compared 13
14
CHAPTER 2. BASIC THEORY AND CONCEPTS
with the ‘Computer Vision’ term which is mainly focused on extracting information by processing images from cameras). In the following, we present a basic overview on sensors, sensor data representation formats and sensor fusion strategies.
2.1.1
Sensors for Environment Perception
Sensors are the foundation of AV’s perception. An AV usually uses a combination of sensor technologies to have a redundant and robust sensory perception. In the following capabilities and limitations of common perception sensors in Intelligent Vehicle (IV) and in Intelligent Transportation Systems (ITS) contexts are discussed with a focus on 3D-LIDAR sensors. • Monocular Camera. Monocular cameras have been the most common sensor technology for perceiving the driving environment. Specifically, high-resolution color cameras are the primary choice to detect traffic signs, license plates, pedestrians, cars, and so on. Monocular vision limitations include illumination variations in the image, difficulties in direct depth perception and vision through the night, which restricts their use in realistic driving scenarios and confines the reliability of safe driving. • Stereo Camera. Binocular vision is a passive solution for depth perception. Although affordable and having no moving parts, the major stereo vision limitations include a poor performance in texture-less environments (e.g., night driving scenarios, snow covered environments, heavy rain and intense lighting conditions) and the dependency on calibration quality. • RADAR. RADAR measures distance by emitting and receiving electromagnetic waves. RADAR is able to work efficiently in extreme weather conditions but suffers from narrow Flied of View (FoV) and low resolution, which limits its application in the object perception tasks (e.g., object detection and recognition). • LIDAR. The main characteristics of LIDAR sensors are their wide FoV, very precise distance measurement (cm-accuracy), object recognition at long-range (with perception range exceeding 300 m) and night-vision capability. 3D-LIDARs (such as conventional mechanical Velodyne and Valeo devices) are able to acquire 3D spatial information, and are less sensitive to weather conditions in comparison with cameras, and can work under poor illumination conditions. Main disadvantages are the cost, having mechanical parts, being large, having high-power requirement and not acquiring color data, although these issues tend to have less significance by the emergence of Solid-State 3D-LIDAR sensors (e.g., Quanergy’s
2.1. ROBOT VISION BASICS
15
HDL-64E
(a) VLS-128
(c)
HDL-32E VLP-16
(b) 3D Flash LIDAR
Velarray
S3 S3-Qi
Headlight integrated S3
Figure 2.1: A summary of advancement of 3D-LIDAR technologies: (a) Examples of conventional 3D-LIDARs (LIDARs with moving parts), among which VLS-128a is the most recent one; (b) Examples of cost effective Solid-State 3D-LIDAR sensors (e.g., S3 coming in at about $250 with the maximum range upward 150 m) and Koito’s headlamp with integrated S3 LIDARb , and (c) Example of VLS-128 captured PCD (top) in comparison with HDL-64E PCD (bottom). Although both, VLS-128 and HDL-64E, are conventional 3D-LIDARs, VLS-128 is a third in size and weight, provides considerably denser PCD than HDL-64E, and can measure upto 300 m (in comparison with 120 m maximum acquisition range of HDL-64E). a http://velodynelidar.com/blog/128-lasers-car-go-round-round-david-hall-velodynes-new-sensor/
(accessed December 1, 2017). b https://twitter.com/quanergy/status/817334630676242433 (accessed December 1, 2017).
S31 , Velodyne Velarray2 and Continental AG’s 3D Flash LIDAR3 sensors) which are compact, efficient and have no moving parts. Recently, 3D-LIDAR sensors, driven by a reduction in their cost and by an increase in their resolution and range, started to become a valid option for object detection, classification, tracking, and driving scene understanding. Some of the recent technologies and trends of 3DLIDARs are shown in Fig. 2.1. In this thesis we restrict ourselves to 3D-LIDAR and its fusion and integration with monocular camera (the high resolution color data provided by RGB camera can be used as a complement of 3D-LIDAR data) and GPS-aided Inertial Navigation System (INS) position sensing. 1 http://quanergy.com/s3/
(accessed October 1, 2017). (accessed October 1, 2017). 3 https://www.continental-automotive.com/en-gl/Landing-Pages/CAD/AutomatedDriving/Enablers/3D-Flash-Lidar (accessed December 1, 2017). 2 http://velodynelidar.com/news.php#254
16
CHAPTER 2. BASIC THEORY AND CONCEPTS
(a)
(b)
(c)
(d)
(e)
Figure 2.2: Employed data representations: (a) (a cropped part of) an RGB image containing a cyclist; (b) the corresponding PCD; (c) the corresponding Voxel Grid representation projected into the image plane; (d) Elevation Grid of a car projected into the image plane; and (e) Top: (a cropped part of) an RGB image with superimposed projected LIDAR points, and bottom: the corresponding generated depth map.
2.1.2
Sensor Data Representations
Representation of 2D and 3D sensor data is a key task for processing, interpreting and understanding data. Examples of different representations include: 2D image, multiview RGB(D) images, polygonal mesh, point cloud, primitive-based CAD model, depth map, 2D, 2.5D (Elevation) and 3D (Voxel) grid representations, where each type of representation format has its own characteristics. This section gives basics of sensor data representation formats (see Fig. 2.2) and transformation tools that were used for developing the proposed perception algorithms. • RGB Image. A 2D grayscale image is a u ×v grid of pixels, each pixel containing a gray level, that provides depiction of a scene. An RGB image (which is readily available from a color camera) is an extension of the 2D (grayscale) image, and is defined as u × v × 3 data array to account for Red, Green, and Blue color components. Assuming 8 bits for each R, G and B elements (each element with an unsigned 8-bit integer has a value between 0 and 28 − 1), a pixel is encoded with 24 bits. • 3D Point Cloud Data (PCD). A point in 3D Euclidean space can be defined as a position in x-, y- and z- coordinates. PCD is a set of such data points; and can be used to represent a volumetric model of urban environments. In our case, PCD is captured by a 3D-LIDAR and contains an additional reflection information. LIDAR reflection measures the ratio of the received beam sent to a surface, which depends upon the distance, material, and the angle between surface normal and the ray. Assuming a 4D LIDAR point (3D real-value spatial data and 8-bit reflection intensity) is denoted by p = [x, y, z, r], the set of data points (PCD) can be described as P = {p1 , . . . , pN }, where N is the number of captured points. • Elevation Grid. An Elevation grid (also called Height map) is a 2.5D grid repre-
2.1. ROBOT VISION BASICS
17
sentation, composed by cells, with uniform resolution in x- and y-directions (i.e., a grid of squares), where each grid cell stores the height µ of obstacles above the ground level. For each cell, the height value µ can be determined by calculating 1 nc the average height of all measured points mapped into the cell using: nc ∑i=1 zi , where nc represents the number of points in the cell. Height map represents only the top layer of data and therefore, in practice the data belonging to overhangs at higher levels than the AV (e.g., bridges) is ignored without compromising safety. • Voxel Grid. A volumetric scene or object representation in which 3D space is divided into a grid of rectangular cubes (i.e., voxels). A voxel can be associated with multiple attributes such as occupancy, color or density of measurements. Voxelization can be produced from 3D-LIDAR’s PCD using two main steps: 1)- Quantizing the end-point of a beam, which can be attained by pˆ = bp / υc × υ, where b.c denotes the floor function, and υ is the voxel size, and ˆ where then finding unique values as the occupied voxel locations: U = unique(P), Pˆ = { pˆ1 , . . . , pˆN }; 2)- Computing the occupancy value, c, of a voxel by counting ˆ as the repeated elethe number of repeated elements (similar-value points) in P, ments denote points within the same voxel. • Depth Map. A depth map (also called range image) is a 2.5D image with pixel values that are corresponded to the distance from visible points in the observed scene to the range sensor (with a specific view point). It is called 2.5D image because the backside of the scene cannot be represented. The depth map is a u × v image in which each pixel q = [i, j, d] is represented by (integer valued) spatial position (i, j) in the image, where i and j are in the ranges [1, · · · , u] and [1, · · · , v], respectively; and the depth-value (usually represented as a 8-bit value), denoted by d. The depth map can be generated using 3D-LIDAR’s PCD, using the following process: Projecting PCD onto the 2D-image plane (the projected PCD will have lower density than the image resolution), depth encoding (converting real values to 8-bit values), and interpolating the unsampled locations in the map to obtain the high-resolution depth map. Transformation in 3D Space Transforms are one of the principal tools when working with the 3D representation formats (e.g., PCD, Elevation and Voxel grids). In this section we describe the 3D rigid transformation, the 3D relative pose and the ICP algorithm that were used in this thesis. For further reading we refer to [7]. • 3D Rigid Transformation. Assuming x-, y- and z-axes are a right-handed coordinate system, the 3D rigid transformation is described by y = Rx + t
(2.1)
18
CHAPTER 2. BASIC THEORY AND CONCEPTS where R is a rotation matrix and t a 3 × 1 translation vector, respectively. The rotation matrix R is orthogonal and has the following characteristics: R−1 = R> and |R| = 1. The rotation matrix R can be decomposed into basic rotations about the x-, y- and z-axes as follows R = Rz (ψ) Ry (θ ) Rx (ϕ)
(2.2)
where
1 0 0 Rx (ϕ) = 0 cos ϕ − sin ϕ 0 sin ϕ cos ϕ
cos θ Ry (θ ) = 0 − sin θ cos ψ Rz (ψ) = sin ψ 0
0 sin θ 1 0 0 cos θ − sin ψ 0 cos ψ 0 0 1
(2.3)
where ϕ, θ and ψ are called the Euler angles. Therefore, a 3D rigid transform is described by 6 free parameters (6 DOF): (ϕ, θ , ψ,tx ,ty ,tz ). It is worth mentioning that in AV applications usually x-axis points in the direction of movement and the z-axis points up. In homogeneous coordinates, (2.1) can be written as y˜ = T x˜ where the tilde indicates quantities in homogeneous coordinates and R t T= . 0 1
(2.4)
(2.5)
• 3D Relative Pose. The relative pose (also known as rigid body motion) can be used to describe transformations between coordinates. Fig. 2.3 shows an example of computation of the object pose in the world coordinate frame, composing two relative poses: from the world coordinate frame {O} to the 3D-LIDAR (mounted on AV) {L} and from the 3D-LIDAR {L} to the object coordinate frame {K}, which can be computed by matrix multiplication in homogeneous coordinates: R1 t1 R2 t2 R1 R2 t1 + R1t2 T1 T2 = = . (2.6) 01×3 1 01×3 1 01×3 1
2.1. ROBOT VISION BASICS
19
𝒯2
3D-LIDAR on AV
{L}
𝑦𝐿
𝑧𝐾
𝑧𝐿
𝒯1
object
𝑥𝐿
𝑧
{K}
𝑦 {O}
𝑥
𝑦𝐾
𝑥𝐾
world coordinate frame
Figure 2.3: Coordinate frames and relative poses. • Iterative Closest Point (ICP). The ICP algorithm was first proposed by Besl and McKay [8] for registration of 3D shapes. Consider two PCDs captured in different pose conditions from the same scene, the objective is to determine the transformation between PCDs by matching them. More formally, given an observation O and a reference model M, the aim is to determine the rigid transformation from O to M by minimizing the error of the PCD pairs: argmin ∑kM(i) − (RO(i) + t)k2 . R,t
(2.7)
i
In the first step, the centroids of the PCDs are computed to estimate the translation by t = µO − µM (2.8) where µO and µM are the means (centroids) of the respective PCDs. Then the correspondences from the observation to the reference model (usually a subset of points are associated) have to be calculated, e.g., using KD-trees search algorithm [9]. And in the end a cross covariance matrix is computed by C = ∑(M(i) − µM )(O(i) − µO )> ,
(2.9)
i
and by using Singular Value Decomposition (SVD) C = V ΣU > ,
(2.10)
from which the rotation matrix is determined by R = VU > .
(2.11)
20
CHAPTER 2. BASIC THEORY AND CONCEPTS Fused Information
(a + b)
(b)
(c)
Complementary Fusion
Competitive Fusion
Cooperative Fusion
S1
S2
A
S3 B
A
B
B
S4
S5
C
C′
C
Sources (Sensors) Measurement Objects
Figure 2.4: A symbolic representation of Durrant-Whyte’s data fusion schemes (from [13]). The estimated transformation (R,t) is used to further adjust the PCDs. The process is repeated until it converges. A valid initial estimation of transformation can lead to less computational and quicker ICP convergence which in our case this initial transformation can come from AV’s positioning data. The ICP algorithm can be applied also on the center of voxels in Voxel Grids, or the height at the center of the cells in Elevation Grids.
2.1.3
Multisensor Data Fusion
Multisensor fusion is a key aspect of robot vision. Sensor data fusion, in simple words, can be defined as the process of merging information from more than one source to achieve a more specific inference which cannot be obtained by using a single sensor. Several taxonomies for sensor data fusion have been developed based on approaches to deal with data (e.g., relations among data sources [10], input/output data types [11] and fusion types [12]). In this section we discuss data fusion according to sensor configuration and the level of data abstraction used for fusion, which we found more suitable for describing our methods. It should be mentioned, although the terms data fusion and information fusion are usually used interchangeably, the term information refers to processed data with some semantic content. Sensor Configuration Durrant-Whyte [10, 13] categorized multisensor fusion on the basis of information interaction of the sources into complementary, competitive and cooperative sub-groups (see Fig. 2.4).
2.1. ROBOT VISION BASICS
21
• Complementary. In the complementary sensor configuration, sensors are independent with partial information about a scene. In this case, sensor data can be integrated to give a more complete observation of the scene. The aim of complementary fusion is to address the problem of incompleteness. For instance, employment of multiple sensors (e.g., cameras, LIDARs or RADARs), each observing disjunct parts, to cover the entire surrounding of an AV. • Competitive. A sensor configuration is called competitive if sensors supply independent information of the same measurement area. The purpose of competitive fusion strategy is to provide redundant information to increase robustness and to reduce the effect of erroneous measurements. Examples are object detection and tracking using LIDAR and vision sensors (observing the same FoV). • Cooperative. A cooperative sensor configuration combines information from independent sensors to obtain information that would not be available from individual sensors (i.e., a sensor relies on the observations of another sensor to derive usually more complex - information). For example, perceiving motion by integration of (series of) 3D-LIDAR and GPS localization inputs. Another example is depth perception using images from two cameras at different viewpoints (stereo vision). Fusion Level Data fusion approaches according to the abstraction levels can be classified into: Low, Mid-, High-, and Multi-level fusion [13]. In the following, we take object detection (using LIDAR and vision sensors) as an example to discuss the idea of fusion at different levels. • Low-level. Low-level (also known as signal-level or early) fusion directly combines raw sensor data from multiple sensors to provide merged data to be used for subsequent tasks. An example is combining 3D-LIDAR-based depth map and color camera data into the RGB-D format and then processing the RGB-D data using an end-to-end object detection framework. • Mid-level. In middle-level (also known as medium-level or intermediate-level) fusion, extracted features from multiple sensor observations are combined into a concatenated feature vector which is taken as the input for further process. For instance, extracting features (e.g., HOG features) from RGB and depth map, separately. Then concatenating features and presenting it to a Deformable Parts Model (DPM) detector.
22
CHAPTER 2. BASIC THEORY AND CONCEPTS • High-level. High-level (also known as decision-level, symbol-level or late) fusion combines local semantic representations of each data sources to determine the final decision. An example is running an object detection algorithm on RGB and depth map independently to identify object bounding boxes, followed by combination of the detected bounding boxes (e.g., using voting method) to obtain the final detections. • Multi-level. Multiple-level (also known as hybrid) fusion addresses the integration of data at different levels of abstraction. For instance, exploiting multiple feature map layers of an end-to-end ConvNet-based object detection framework, applied on RGB-D data, to obtain more accurate object detection results.
2.2
Machine Learning Basics
Artificial Intelligence (AI) can be defined as the simulation of human intelligence on a machine. Machine Learning (ML), as a particular approach to achieve AI, gives computers the ability to learn (from data) without being explicitly programmed. Broadly, ML can be split into four major types based on learning style: supervised, unsupervised, semi-supervised and reinforcement learning. In supervised learning, the aim is to obtain a mapping from input output pairs, whereas in unsupervised learning outputs are unknown and the objective is to learn a structure from unlabeled data. In semisupervised learning, as a middle ground of supervised and unsupervised cases, the purpose is to learn from data, given labels for only a subset of the instances. The aim in reinforcement learning is to learn from sequential feedbacks (reward and punishment) in the absence of training data. The following sections provide the necessary technical background on supervised and unsupervised learning paradigms4 . For a more thorough introduction we recommend [14] and [15].
2.2.1
Supervised Learning
In the statistical learning framework, it is assumed that the training set D is independently and identically distributed (i.i.d.) sampled from an unknown distribution D. Given the set of training data D = {(x(i) , y(i) ); i = 1, ..., N}, where (x(i) , y(i) ) pair is a labeled training example; and N is the number of training examples, the aim is to learn a mapping (called hypothesis) from input to output space h : X 7→ Y, such that even when given a novel input x, h(x) provides an accurate prediction of y. Formally, the goal is to find h∗ that minimizes the expected loss over the unknown data distribution D (i.e., the 4 We
mostly followed the notation of Andrew Ng, et al., Unsupervised Feature Learning and Deep Learning (UFLDL) Tutorial. Retrieved from http://ufldl.stanford.edu/
2.2. MACHINE LEARNING BASICS
23
average loss over all possible data) by h i h∗ = argmin E(x,y)∼D L (h(x), y) . h
(2.12)
Since accessing D is not possible, the above Equation is not directly solvable except through indirect optimization over the set of available training data by h∗ = argmin h
1 N L (h(x(i) ), y(i) ). ∑ N i=1
(2.13)
This learning paradigm is called Empirical Risk Minimization (ERM). However, this inevitable oversimplification of the loss function (i.e., optimizing Equation 2.13 instead of Equation 2.12) will lead to the overfitting problem. This situation happens when the hypothesis obtains high accuracy on the training data D but cannot generalize to data points under D distribution, (x, y) ∼ D, that are not present in the training set. Another issue that may arise is that if more than one solution to (2.13) exist, then which one should be selected? The Structural Risk Minimization (SRM) paradigm refines (2.13) with the introduction of a regularization term R (h) that incorporates the complexity of hypotheses: 1 N ∗ (2.14) h = argmin ∑ L (h(x(i) ), y(i) ) + R (h). h N i=1 An example of a regularization function is R (h) = λ khk2 which is called Tikhonov regularization, where λ is a positive scalar and the norm is the `2 norm. The SRM balances between ERM and hypothesis complexity, and is linked to the Occam’s Razor principle which states that having two solutions of the same problem, the simpler solution is preferable [14]. This idea can be traced back to Aristotle’s view that “nature always chooses the shortest path”. To conclude, it can be noted that (2.14) rectifies the divergence between (2.12) and (2.13). Supervised learning can be categorized into regression and classification problems. • Regression. In the regression problem, the output y takes continuous real values. Lets consider the linear regression with hypothesis as: (i)
> (i) hθ (x(i) ) = ∑M j=0 θ j x j = θ x ,
(2.15)
where the θ j ’s are the parameters, and the x j ’s of x represent features (the intercept term is denoted by x0 = 1). Considering the squared-error loss function, the objective is to find θ that minimizes: data loss
z
}|
{ 2
regularization
z }| { λ M 2 1 > (i) (i) θ x −y + J (θ ) = ∑ | {z } ∑ θj . 2N i=1 2N j=1 N
hθ (x(i) )
(2.16)
24
CHAPTER 2. BASIC THEORY AND CONCEPTS It should be noted that in the computation of regularization term, the bias term θ0 is excluded. • Classification. In the classification problem, the output y is discrete numbers of categories. The classification can be either binary (meaning there is two classes to predict) or multi-class. Logistic regression is an example of a binary classification algorithm, and the aim is to predict labels y(i) ∈ {0, 1} using the logistic (sigmoid) function: 1 . (2.17) hθ (x) = 1 + exp(−θ > x) The sigmoid function takes as input any real value and outputs a value between 0 and 1 (see Fig. 2.5 (a)). The objective is to minimize the following cross-entropy loss which is a convex function (i.e., converges to the global minimum). ( − log(hθ (x(i) )), if y(i) = 1 (i) (i) L (hθ (x ), y ) = (2.18) − log(1 − hθ (x(i) )), if y(i) = 0 Rewriting (2.18) more compactly and adding the regularization term, the optimization problem become as following: data loss
regularization
{ z }| { N λ M 2 1 J (θ ) = − ∑ y(i) log(hθ (x(i) )) + (1 − y(i) ) log(1 − hθ (x(i) )) + ∑ θ j (2.19) N i=1 2N j=1 z
}|
We introduce in next subsections the supervised learning methods that were used in this thesis. Multi-Layer Neural Network A basic form of a neural network, comprised of a single neuron (see Fig. 2.5 (a)), takes x as input and outputs: hW,b (x) = F (W > x) = F (∑M j=1 W j x j + b),
(2.20)
where W, b are the parameters, and F (·) is the activation function. While the weight W can be considered as the steepness-changing parameter of the sigmoid function, the bias value b provides shift for the sigmoid activation function. By changing the notation as (W, b) → θ and choosing the activation in the form of sigmoid (logistic) function, 1 , it can be seen that (2.20) is equivalent to the hypothesis of logistic F (z) = 1+exp(−z) regression (2.17). A Multi-Layer Perceptron (MLP) can be seen as an aggregation of such logistic regressions. The main idea behind MLP, as promised by the universal approximation
2.2. MACHINE LEARNING BASICS
1
25 Hidden Layer
𝑥1
0.5
𝑥1
0
(2)
-5
0
𝑎1
𝑥2
+5
Sigmoid (sig)
(2)
Input Features
𝑥2
𝑎2
Output
𝑥3
ℎ𝑊,𝑏 (𝑥)
ℎ𝑊,𝑏 (𝑥)
(2)
𝑎𝑠2
Layer 𝔏3
𝑥𝑀
𝑥𝑀
+1 +1
+1
Layer 𝔏2
Layer 𝔏1
(a)
(b)
Figure 2.5: (a) A single neuron, and (b) An example of single hidden layer MLP. The details of weights and biases (that should appear on the edges) are omitted to improve readability. theorem [16], is a single hidden layer MLP with sufficient number of hidden neurons is capable of approximating any function with any desired accuracy. An MLP consists of an input layer, one or more hidden layers and an output layer (see Fig. 2.5 (b)). The role of the input layer (L1 ) is to pass the input data into the network. The hidden layer (L2 ) take input features x = [x1 , . . . , xM ]> and the bias unit (+1) as well as the associated weights and biases (on the edges) to compute hidden neurons’ outputs. The output layer (L3 ) takes inputs from the hidden neurons, the bias unit (+1), weights and the bias, and determines the output of the MLP: hW,b (x) = F (W (2) F (W (1) x + b(1) ) + b(2) ),
(2.21)
where W (1) ∈ Rs2 ×s1 and b(1) ∈ Rs2 ×1 are the weight matrix and the biases associated with the connections between L1 and L2 , respectively; W (2) ∈ Rs3 ×s2 and b(2) ∈ Rs3 ×1 are the weight matrix and the bias between L2 and L3 , respectively; and sl denotes the number of nodes (bias is not included) in layer Ll (i.e., in our example s1 = M and s3 = 1). To show the linkage between neurons, (2.21) can be expanded into the following expression: (2) (2)
(2) (2)
(2) (2)
(2)
hW,b (x) = F (W11 a1 +W12 a2 + · · · +W1s2 as2 + b1 ),
(2.22)
26
CHAPTER 2. BASIC THEORY AND CONCEPTS
where (2)
(1)
(1)
(1)
(1)
(2)
(1)
(1)
(1)
(1)
a1 = F (W11 x1 +W12 x2 + · · · +W1s1 xM + b1 ) a2 = F (W21 x1 +W22 x2 + · · · +W2s1 xM + b2 )
(2.23)
··· (2)
(1)
(1)
(1)
(1)
as2 = F (Ws2 1 x1 +Ws2 2 x2 + · · · +Ws2 s1 xM + bs2 ), (l)
(l)
where Wi j denotes the weight between neuron j in Ll , and neuron i in Ll+1 ; and ai denotes the output of neuron i in Ll . For a single training example (x(i) , y(i) ), based on the ERM paradigm and using squared-error loss function (other loss functions, e.g., cross-entropy, can be used as well), the objective is to minimize:
2 1
(i) (i) L (W, b; x , y ) = hW,b (x ) − y . 2 (i)
(i)
(2.24)
If regularization term is considered, the loss function of a nl -layered MLP can be defined as (2.25) that can be used for both classification and regression problems, data loss
regularization
{ z N 2 1 λ nl −1 (i) (i) J (W, b) = h (x ) − y + ∑ W,b ∑ 2N i=1 2 l=1 z
}|
}|
sl sl+1
∑∑
(l)
W ji
{ 2
(2.25)
i=1 j=1
where the first and second terms are the average sum-of-squared error and the regularization term (controlled by λ ), respectively. By minimizing the above objective function, the network weights can be trained. In addition to weight parameters, an MLP has a set of hyperparameters (the type of activation function, the number of hidden layers and hidden neurons) to optimize, which are usually determined experimentally. A major problem with MLP is that it does not consider the spatial structure of the input image, as for example u × v image data has to be first converted to (u × v) × 1 vector. Convolutional Neural Networks (CNNs, or ConvNets) which explore the spatial image data structure, are described in the next subsection. Convolutional Neural Network Deep Learning (DL), a technique for implementing ML, learns representations of data with multiple levels of abstraction. A ConvNet is a class of DL that uses convolution instead of matrix multiplication in (at least one of) the layers [15]. A ConvNet in its simplest form, is comprised of a series of convolutional layers, non-linearity functions, max pooling layers, and Fully-Connected (FC) layers. To train a ConvNet, millions (or even billions) of parameters need to be tuned (which require large volumes of data). The
2.2. MACHINE LEARNING BASICS
27
parameters are located in the convolutional and FC layers. The non-linearity and pooling layers are parameter-free and perform certain types of operations. Each component of ConvNet is described briefly in the following. • Convolutional Layer. Assuming a 2D image I as the input and a 2D (weighting) kernel W, the 2D discrete convolution is given by (I ∗ W)(i, j) = ∑m ∑n I(m, n) W(i − m, j − n).
(2.26)
The convolution is commutative (i.e., I ∗ W = W ∗ I), which comes from the notion that in the above formulation (both the rows and columns of) the kernel is flipped relative to the image. However, in practical implementation the commutative property is not usually important, and a similar concept called crosscorrelation is used which is given by (I ∗ W)(i, j) = ∑m ∑n I(i + m, j + n) W(m, n).
(2.27)
The only difference between convolution and cross-correlation is that in crosscorrelation the kernel is applied without flipping (hereinafter both referred to as convolution). Convolution of a kernel with an image result in a feature map. Concretely, assuming kernel W ∈ Rw×h , the (i, j)-th element of the feature map S is given by S(i, j) = F ((I ∗ W)(i, j) + b) . (2.28) The expansion yields p
S(i, j) = F
!
q
∑ ∑
I(i + m, j + n) W(m, n) + b ,
(2.29)
m=−p n=−q
where
W(−p, −q) ··· W(−p, +q) .. .. W= , . W(0, 0) . W(+p, −q) ··· W(+p, +q)
(2.30)
where w = 2 × p + 1 and h = 2 × q + 1 are the width and height of the kernel W; F is the Rectified Linear Unit (ReLU) activation function; and b is the bias. By canceling (i, j) indexes in (2.28) and changing I and W positions (assume being commutative), the computation of two (hypothetical) consecutive convolutional layers can be formulated as: F (W(2) ∗ F (W(1) ∗ I + b(1) ) + b(2) ),
(2.31)
which reminds (2.21). In practice, for each convolutional layer, k kernels (trained in a supervised manner by backpropagation) of size w × h are applied across u × v
28
CHAPTER 2. BASIC THEORY AND CONCEPTS Input image or feature map
Input feature map
Kernel
Max pooling Max
Output feature map
(a)
Output feature map
(b)
feature map
FC
FC
(c)
Figure 2.6: ConvNet layers: (a) Convolutional layer (without kernel flipping); (b) Pooling layer, and (c) FC layers (with minor modification from [17]). input image, and considering zero padding z and stride s as hyperparameters, it will result in a stacked k feature maps of size (u − w + 2z)/s + 1 × (v − h + 2z)/s + 1. The primary purpose of the convolutional layer is to extract features; and the more layers the network has, the higher-level features can be derived. Two main advantages of using convolutions are: 1)- Parameter sharing, the same kernel parameters shared across different locations of the input image; and 2)Sparse connectivity, kernels operate on small local regions of the input (W I) which reduces the number of parameters (see Fig. 2.6 (a)). • Non-Linearity. ReLU is given by F (x) = max(0, x) that replaces the negative values in the feature map by zero. ReLU applies elementwise nonlinear activation (without affecting the receptive fields of the feature maps), which increases the nonlinear properties of the decision function and of the overall network. • Pooling Layer. the max pooling operator is a form of down-sampling and reduces the spatial size of the input to reduce the computational cost and to improve translation invariance. More specifically, max pooling has two hyperparameters (the spatial extent e and stride s), and partitions the input into a set of rectangles and for each such subregion, outputs the most important information, i.e., the maximum value (see Fig. 2.6 (b)). • FC Layer. a FC layer have full connections to all activations in the previous layer (as in a standard MLP neural network). The input to the FC layer is the set of computed features maps at the previous layer. This stage aims to use the feature maps to classify the input image into classes (see Fig. 2.6 (c)). FC layers usually occupy a significant amount of the parameters in a ConvNet and hence make it prone to overfitting. The dropout regularization technique [18], which consists randomly dropping neurons and their connections during training, can be
2.2. MACHINE LEARNING BASICS
29 Ouput predictions Class 1 Class 2
1st Pool Input Image
1st Conv. + ReLU
2nd Pool nd 2 Conv. + ReLU
Class c
FC
FC
Figure 2.7: Basic architecture of ConvNet (LeNet-5 [19]). employed to prevent the overfitting problem. Fig. 2.7 shows a ConvNet in the form of [INPUT,[CONV+RELU,POOL]×2,FC×2].
Optimization Up to this point it has been shown how some problems can be modeled using supervised learning techniques. A question that remains is how to learn the model parameters (i.e., how to solve θ ∗ = argmin J (θ )). A basic approach is to set up a system of equations θ
and solve for the parameters (analytical calculation) or to perform a brute force search to obtain the parameters. However, modern ML techniques (e.g., ConvNets) have millions of parameters to adjust that make direct optimization infeasible. In fact, even using state-of-the-art methods, it is very common to spend months on several machines to optimize a neural network. The bases to tackle this sort of problems are explained in the following. • Gradient Descent (GD). Assuming loss function J (θ ) is a differentiable function, the slope of J (θ ) at the 1D point θ can be given by J 0 (θ ). Based on Taylor series expansion: J (θ + α) ≈ J (θ ) + α J 0 (θ ), which can be used to determine the direction that θ minimizes the J (θ ) (e.g., for θ 0 = θ − α sign(J 0 (θ )), J (θ 0 ) < J (θ )). This method (applied iteratively) is called gradient descent. For the multi-dimensional θ , the new point in the direction of the steepest descent can be computed as θ ← θ − α ∇θ J (θ ), (2.32) where ∇θ J (θ ) is the vector of partial derivatives, where ∂∂θ j J (θ ) corresponds with the j-th element; and α is called the learning rate, which defines the step size (in the path of steepest descent) and is usually set to a small value. In the original gradient descent (GD) algorithm (2.32), also known as batch gradient descent, the whole data is used to determine the trajectory of the steepest descent. The GD can be extended to Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent (MB-GD), as briefly described in the following.
30
CHAPTER 2. BASIC THEORY AND CONCEPTS Stochastic Gradient Descent (SGD) While it is true that a large dataset can improve generalization and alleviate the problem of overfitting, it is sometimes computationally infeasible. In the SGD algorithm, in each step the gradient is estimated based on a single randomly picked example. Mini-Batch Gradient Descent (MB-GD) In this method, which can be considered as a compromise solution between GD and SGD, a subsample of dataset (in the order of few hundred examples) which is drawn uniformly at random is used to estimate the gradient. The MB-GD sometimes referred to as SGD with mini-batch. • The Backpropagation (BP) Algorithm. The backpropagation algorithm is a method that applies the chain rule to recursively compute the partial derivatives (and consequently the gradients) of the loss function with respect to the weights and biases in the network. The algorithm can be described as follows: 1. To train the network, the parameters are initialized with random values. The input (in the case of ConvNets, the input is an image) propagates through the network and the output is computed. This step is called forward propagation. 2. The output data is compared with the Ground-Truth (GT) labels (i.e., the desired output data) and the error is computed and stored. If the error exceeds a predefined threshold, the next step is performed, if not, the training of the network is considered complete. 3. The error is back propagated from the output layer to the input layer to update parameter values (using gradient descent). This process is repeated and the parameters are adjusted until the difference between the network output and GT labels reaches the desired error (the predefined threshold).
2.2.2
Unsupervised Learning
Given a set of unlabeled data D = {x(i) ; i = 1, ..., N}, the aim of unsupervised learning is to discover a compact description of the data. Examples are clustering and dimensionality reduction which reduce the number of instances and dimensions, respectively. In the following, clustering is described in further details. • Clustering. Assuming each data point is represented by x(i) , clustering can be defined as predicting a label c(i) for each data point, which results in partitioning the data points D into a set that is called the clustering of D . The clustering aims to group data into clusters such that they agree with ‘human interpretation’ (which is difficult to define) of the data. In the following subsection the DBSCAN which was used in this thesis is described.
2.2. MACHINE LEARNING BASICS core
31
border 𝜀
𝐩 noise
(𝐚)
𝐩
𝐪
(𝐛)
𝐪
𝐩𝟐
(𝐜)
𝐩
𝐪 𝐨
(𝐝)
Figure 2.8: Main concepts in DBSCAN. (a) core, border and outlier points (minPts = 3); (b) directly density-reachable; (c) density-reachable, and (d) density-connected concept. DBSCAN To put it simply, Density Based Spatial Clustering of Applications with Noise (DBSCAN) [1] considers clusters as dense areas segregated from each others by sparse (low density) areas. DBSCAN has two parameters: ε-neighborhood and minPts, which define the concept of dense. Lower ε-neighborhood or higher minPts indicate the requirement of higher density to create clusters. The formal definition is given below. In DBSCAN, given a set of data points D, the ε-neighborhood of a point p ∈ D is defined as Nε (p) = {q | d(p, q) ≤ ε}, where d is a distance function (e.g., Euclidean distance). Each data point based on |Nε (p)| and a threshold minPts, which is the minimum number of points in an ε-neighborhood of that point, is categorized as: core, border or outlier point, as shown in Fig. 2.8 (a). A point p is a core point if |Nε (p)| ≥ minPts; the point is a border point if it is not a core point but it is in a ε-neighborhood of a core point; otherwise it is considered as an outlier. The following definitions were introduced for the purpose of clustering. • Directly density-reachable. A point q is directly density-reachable from a point p if q ∈ Nε (p), and |Nε (p)| ≥ minPts, which means p is a core point and q is in its ε-neighborhood (see Fig. 2.8 (b)). • Density-reachable. A point p is called density-reachable from a point q if there is a sequence of points p1 , ..., pn , p1 = q, pn = p such that pi+1 is directly densityreachable from pi , ∀ i ∈ {1, 2, ..., n − 1} (see Fig. 2.8 (c)). • Density-connected. A point p is density-connected to a point q if there is a point o such that both, p and q are density-reachable from o (see Fig. 2.8 (d)). A cluster c is defined as a non-empty subset of D satisfying the Maximality and Connectivity conditions. • Maximality. ∀ p, q: if p ∈ c and q is density-reachable from p, then q ∈ c. • Connectivity. ∀ p, q ∈ c: p is density-connected to q.
32
CHAPTER 2. BASIC THEORY AND CONCEPTS
Algorithm 1 The DBSCAN algorithm [1]. 1: Inputs: Points D, and Parameters: ε-neighborhood, minPts 2: Output: Clusters of points 3: ClusterId ← 1 4: for all core points do 5: if the core point has no ClusterId then 6: ClusterId ← ClusterId + 1 7: Label the core point with ClusterId 8: end if 9: for all points in ε-neighborhood, except the point itself do 10: if the point has no ClusterId then 11: Label the point with ClusterId 12: end if 13: end for 14: end for The DBSCAN algorithm begins with an arbitrary selection of a point p. If p is a core point, |Nε (p)| ≥ minPts, a cluster is initialized and expanded to its density-reachable points. If among the density-reachable points an additional core point is detected, the cluster is further expanded to include all points in the new core point’s neighborhood. The cluster is formed when no more core points are left in the expanded neighborhood. This process is repeated with new unvisited points to discover remaining clusters. After processing all points, the remaining points are considered as noise outlier points (outliers). This procedure is summarized in Algorithm 1.
Chapter 3 Test Bed Setup and Tools Contents 3.1
3.2
3.3
The KITTI Vision Benchmark Suite . . . . . . . . . . . . . . . . . 33 3.1.1
Sensor Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2
Object Detection and Tracking Datasets . . . . . . . . . . . . 36
3.1.3
‘Object Tracking Evaluation’ Based Derived Datasets
. . . . 36
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1
Average Precision and Precision-Recall Curve . . . . . . . . 39
3.2.2
Metrics for Obstacle Detection Evaluation . . . . . . . . . . . 40
Packages and Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1
YOLOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 The unexamined life is not worth living. Socrates
This chapter introduces the working principles of the sensors used in our approach, the experimental dataset (The KITTI Vision Benchmark Suite [20]), the evaluation metrics, and the packages and tools we used to develop our method.
3.1
The KITTI Vision Benchmark Suite
The KITTI dataset1 is the largest publicly available and the most widely-used dataset for AV perception applications with realistic multisensory data and accurate GroundTruth (GT). Specifically, in comparison with other datasets, in KITTI object size and 1 http://www.cvlibs.net/datasets/kitti/index.php
33
5. Experimental Evaluation
34
CHAPTER 3. TEST BED SETUP AND TOOLS Velodyne HDL-64E Laserscanner Point Gray Flea 2 Video Cameras
z x y x z y
x
(a) Karlsruhe, Germany
z
y
OXTS OXTS RT RT 3003 GPS GPS / IMU
(b) Experimental Vehicle AnnieWAY (MRT/KIT)
Figure 3.1: Sensors setup on AnnieWAY: the recording platform of the KITTI dataset (courtesy of [20]). All heights wrt. road surface
All camera heights: 1.65 m
pose undergo severe changes including occlusion, which occur very often in real world Cam 1 (gray) 0.06 m Cam 3 (color) autonomous driving scenarios [21]. The KITTI dataset is used for performing the asVelodyne laserscanner Cam-to-CamRect & CamRect (height: 1.73 m) and for 0.05providing m 0.54 m of the sessment proposed algorithms, the experimental results shown x -to-Image x z y Cam 0 (gray) IMU-to-Velothe basic characteristics of the KITTI dataset, in0.06 Chapter 7.Cam In 2this section ywe describe m (color) z 0.32 m x the instrumented recording platform, the sensor setup and the transformations between Velo-to-Cam GPS/IMU z 1.68 m (height: 0.93 m) sensors. For the detailed specification, pleasey refer to [20].
Wheel axis (height: 0.30m)
1.60 m
0.80 m
3.1.1
0.27 m
0.81 m
0.48 m
Sensor Setup 2.71 m
(c) Sensor Setup (Topon View) The sensors setup mounted a Volkswagen Passat B6 (KITTI recording platform) can be seen in Fig. 3.1. The platform is equipped with one 3D-LIDAR, 2 Color cameras, Figure 5.1.: Recording2Platform. VW Passat wagon has been equipped with Gray-scaleAcameras, andstation an Inertial Navigation System. In thefour context of this thesis, video cameras (two color and two gray scale cameras). A rotating 3D laser scanner and the range measurements in the form of PCDs, from 3D-LIDAR; thea RGB images from GPS/IMU inertial navigation system unit have been installed for obtaining ground truth annothe left color camera, and the positioning data from the GPS-aided Inertial Navigation tations. System (INS) were used for developing the proposed algorithms, with the following characteristics.
stereo rig and an GPS/IMU system for localization. The trunk of our vehicle houses • Velodyne HDL-64E 2. The 3D-LIDAR spins at 10 Hz counter-clockwise with a PC with two six-core Intel XEON X5650 processors and a shock-absorbed RAID 64 vertical layers (approximately 0.4° equally spaced angular subdivisions); 26.8 5 hard disk system, storingdegree up to vertical 4 terabytes. Our computer field of view (+2°/ −runs 24.8°Ubuntu up and Linux down);(64 0.09° angular resolubit) and a database for cognitive automobiles [80] to store the incoming data streams tion; 2 cm distance accuracy, and captures approximately 100k points per cycle. in real-time. The LIDAR’s PCD is compensated for the vehicle ego-motion. The sensor’s maximum recording range is 120 m.
5.2. Sensor Calibration We took care that all sensors are carefully synchronized and calibrated [72, 150]. To avoid drift over time, we calibrated the sensors at each day of our recordings. The 66
3.1.(a)THE KITTI VISION BENCHMARK SUITE 35 Karlsruhe, Germany (b) Experimental Vehicle AnnieWAY (MRT/KIT) All heights wrt. road surface All camera heights: 1.65 m Cam 1 (gray) Cam 3 (color)
Wheel axis (height: 0.30m)
Cam-to-CamRect & CamRect -to-Image
0.54 m 1.60 m
Cam 0 (gray) Cam 2 (color)
0.06 m
1.68 m 0.80 m
z
0.06 m Velodyne laserscanner (height: 1.73 m)
x x y
z y
Velo-to-Cam
0.05 m
IMU-to-Velo
GPS/IMU (height: 0.93 m)
0.27 m
0.81 m
x
0.32 m z y 0.48 m
2.71 m
(c) Sensor Setup (Top View)
Figure 3.2: The top view of the multisensor configuration composed of 4 cameras, a 3D-LIDAR and a GPS-aided INS (courtesy of [20]).
Figure 5.1.: Recording Platform. A VW Passat station wagon has been equipped with four video cameras (two color and two gray scale cameras). A rotating 3D laser scanner and a • Pointnavigation Grey Flea 2system Camera (FL2-14S3C-C). The color facing forward, GPS/IMU inertial unit have been installed for camera, obtaining ground truth annohas the resolution of 1.4 Mega pixels. The camera image was cropped to 1382×512 tations. pixels, and after image rectification (that is, projecting stereo images onto a common image plane), the image become smaller in size (about 1242×375 pixels). The shutter system was synchronized with the 10 Hz spinning LIDAR. houses stereo rig and ancamera GPS/IMU for localization. The trunkVelodyne of our vehicle
a PC with two six-core Intel XEON X5650 processors and a shock-absorbed RAID • GPS-aided INS (OXTS RT 3003). The GPS-aided INS is a high-precision in5 hard disk system, storing up to 4navigation terabytes. Ourwith computer Ubuntu Linux (64 tegrated GPS/IMU inertial system a 100 Hz runs sampling rate and a resolution 0.02 m / 0.1°. The localization is provided with an accuracy bit) and a database forofcognitive automobiles [80]data to store the incoming data streams of less than 10 cm with Real Time Kinematic (RTK) float/integer corrections enin real-time. abled. RTK is a technique that improves the accuracy of position data obtained from satellite-based positioning systems.
5.2. Sensor Calibration
Figure 3.2 shows the sensor configuration (top view) on the KITTI’s recording platform. The projection of a 3D-LIDAR’s PCD P into the left camera’s image plane (i.e., Cam 2 We took(color) care coordinates that all sensors are carefully synchronized and calibrated [72, 150]. in Fig. 3.2) can be performed as follows:
To avoid drift over time, we calibrated the sensors at each day of our recordings. The Projection Matrix
66
z }| { P = PC2I × R0 × PL2C × P, ∗
(3.1)
where PC2I is the projection matrix from the camera coordinate system to the image plane, R0 is the rectification matrix, and PL2C is LIDAR to camera coordinate system projection matrix.
36
CHAPTER 3. TEST BED SETUP AND TOOLS
3.1.2
Object Detection and Tracking Datasets
The KITTI dataset was captured in urban areas using an ego-vehicle equipped with multiple sensors. The data used for validation of the performance of the proposed algorithms is taken from ‘Object Detection Evaluation’ and ‘Object Tracking Evaluation’ from KITTI Vision Benchmark Suite, as described as follows: • Object Detection Evaluation. The KITTI Object Detection Evaluation is partitioned into two subsets: training and testing sets. The ‘training dataset’ contains 7,481 frames of images and PCDs with 51,867 labels for nine different categories: Pedestrian, Car, Cyclist, Van, Truck, Person sitting, Tram, Misc, and Don’t care. The ‘test dataset’ consists of 7,518 frames (the class labels for the test dataset are not accessible for users). Despite having eight labeled classes (except for Don’t care labels), only Pedestrian, Car and Cyclist are evaluated in the online benchmark (test dataset). It should be noted that the dataset contains objects with different levels of occlusion. • Object Tracking Evaluation. The KITTI Object Tracking Evaluation is composed of 21 training and 29 test sequences (of different lengths). In the KITTI dataset, objects are annotated with their tracklets, and generally, the dataset is more focused on the evaluation of the data association problem in discriminative approaches. The original KITTI Object Detection Evaluation is used for the evaluation of the proposed object detection algorithms in Chapter 6. During our experiments, only the ‘Car’ label was considered for evaluation. To the best of our knowledge, there is no publicly available dataset of sequences of images, PCDs and positioning data for evaluating stationary – moving obstacle detection and multisensor generic 3D single-object tracking in driving environments (which are the main modules of the proposed obstacle detection system, and are discussed in Chapter 5). Therefore, two datasets were built out of KITTI Object Tracking Evaluation to validate the performance of the proposed obstacle detection system, as detailed in the next subsection.
3.1.3
‘Object Tracking Evaluation’ Based Derived Datasets
In this section we describe the ‘Object Tracking Evaluation’ based generated datasets in order to evaluate the proposed obstacle detection system, including: stationary – moving obstacle detection and multisensor 3D single-object tracking evaluations. • Stationary – Moving Obstacle Detection Evaluation. For the obstacle detection evaluation task (which is composed of the ground surface estimation and stationary – moving obstacle detection and segmentation evaluations), eight sequences
3.1. THE KITTI VISION BENCHMARK SUITE
37
Table 3.1: Detailed information about each sequence used for the stationary and moving obstacle detection evaluation. Seq. (1) (2) (3) (4) (5) (6) (7) (8)
# of Ego-vehicle Frames Situation 154 Moving 447 Moving 373 Hybrid 340 Moving 376 Hybrid 209 Stationary 145 Stationary 339 Moving
Scene Condition Urban Urban Urban Downtown Hybrid Downtown Downtown Urban
Object Type C.Y.P C.P C Y.P C.Y.P Y.P P C
Number of Objects Station. Moving 11 5 67 6 25 14 27 25 1 14 0 17 0 10 0 18
from the ‘Object Tracking Evaluation’ set are used. The 3D-Bounding Boxes (3D-BBs) of objects are available in the KITTI dataset. In our dataset, the 3DBBs of stationary and moving objects were manually discriminated and labeled as being stationary or moving, by an annotator. An example of the Ground-Truth (GT) data is shown in Fig. 3.3. The characteristics of each sequence are summarized in Table 3.1. Two of the sequences (6 and 7) were taken by a Stationary vehicle and four of them (1, 2, 4 and 8) were taken by a Moving vehicle. In the remaining sequences the vehicle went through both stationary and moving situations. The dataset is divided into two scene conditions: highways and roads in urban areas (Urban) or alleys and avenues in downtown areas (Downtown). Different types of objects such as Car (C), Pedestrian (P) and cYclist (Y) are available in the scenes. The total number of objects (stationary and moving) that are visible in the perception field of the vehicle is also reported per sequence. • 3D Single-Object Tracking Evaluation. In order to evaluate the object tracking module performance, eight challenging sequences out of the ‘Object Tracking Evaluation’ set were generated. In the rearranged dataset, each sequence denotes the full trajectory of only one target object. That is, in comparison to the original tracklets, in the composed dataset the full track of an individual object is extracted (i.e., if one scenario includes two target objects, it is considered as two sequences). The details of each sequence and the challenging factors are reported in Table 3.2. Specifically, this table is divided into three main parts, as described in the following. The General Specifications This is the description of each sequence including the number of frames; the scene condition: Urban (U) and Downtown (D);
38
CHAPTER 3. TEST BED SETUP AND TOOLS
Figure 3.3: An example of the stationary and moving obstacle detection’s GT data. The top image shows a screenshot from the KITTI ‘Object Tracking Evaluation’ with 3D-BBs being used to represent objects in the scene. The bottom figure shows the corresponding 3D-BBs in the Euclidean space. Green and red BBs indicate moving and stationary objects, respectively. The black dots represent the bases of the 3D-BBs, and are used to evaluate the ground surface estimation. the Ego-vehicle situation: Moving (M) and Stationary (S), and the object type: C, P, and Y are abbreviations for Car, Pedestrian, and cYclist, respectively. The RGB Camera’s Challenging Factors It describes each sequence in terms of experiencing one of the following challenges, occlusions: No (N), Partial (P) and Full (F) occlusions; illumination variations; object pose variations, and changes in the object’s size, where Y and N are abbreviations for Yes and No, respectively. The 3D-LIDAR’s Challenging Factors This describes the main challenges for each of the PCD sequences, in terms of the number of object points: L and H are abbreviations for Low and High, respectively; distance to the object: N, M, and F are abbreviations for Near, Medium, and Far, respectively, and the
3.2. EVALUATION METRICS
39
Table 3.2: Detailed information about each sequence used for multisensor 3D singleobject tracking evaluation. General Specifications
RGB Cam. Chall. Factors
Seq.
# of Scene Ego-veh. Obj. Frames Cond. Situation Type
Illum. Obj. Pose Obj. Size Occlu. Variat. Variat. Variat.
(1) (2) (3) (4) (5) (6) (7) (8)
154 154 373 41 149 45 71 188
N P N P P F-P P-F N
U U U U D D D D
M M S-M S S S S-M M-S
Y C C Y P P P P
Y Y Y N N Y Y Y
Y Y N N N N N Y
Y Y Y N Y Y Y Y
3L Chall. Factors # of Obj. Distance Velocity Points to the Obj. Variat.
H H-L H-L H H H-L L-H L-H
N-M M-F N-M-F N N N N M-N
N Y Y N N N N Y
velocity variations: Y and N are abbreviations for Yes and No, respectively. When there is a multiple entry, the order is corresponding with the temporal occurrence. For example in the case of distance to the object (in the 3D-LIDAR’s challenging factors columns), the entry N-M-F denotes that the object was first close to the ego-vehicle, next went to the middle range, and after went far. An extended version of this dataset is presented in Appendix A.
3.2
Evaluation Metrics
In this section, we describe the evaluation metrics that were considered as the most relevant for the evaluation of our obstacle and object detection algorithms.
3.2.1
Average Precision and Precision-Recall Curve
For object detection evaluation, KITTI uses as the evaluation criterion the PASCAL VOC2 intersection-over-union (IOU) metric on three difficulty levels. The overlap rate in 2D is given by area(2D-BB ∩ 2D-BBG ) , (3.2) IOU = area(2D-BB ∪ 2D-BBG ) where 2D-BB determines the bounding box for a detected object and 2D-BBG denotes the GT BB. The difficulty levels were defined as (i) ‘Easy’ which represents fully visible cars with minimum BB height of 40 pixels, (ii) ‘Moderate’ which includes partial occlusions with minimum BB height of 25 pixels, and finally (iii) the ‘Hard’ level which integrates the same minimum BB height with higher occlusion levels. The precision-recall 2 http://host.robots.ox.ac.uk/pascal/VOC/
40
CHAPTER 3. TEST BED SETUP AND TOOLS
curve and average precision (which corresponds to the area under the precision-recall curve), were computed and reported over easy, moderate and hard data categories (with an overlap of 70% for ‘Car’ detection) to measure the detection performance. For more details, please refer to [22].
3.2.2
Metrics for Obstacle Detection Evaluation
To the extent of our knowledge, there is no standard and well-established assessment methodology for obstacle detection evaluation. Therefore, we defined several key performance metrics (mostly are defined in 3D space) to evaluate different components of the proposed obstacle detection system. Those metrics, which are further described in Chapter 7, include the following: • mean of Displacement Errors (mDE). Obstacle detection algorithms need some assumptions about the ground surface to discriminate between the ground and obstacles. For the evaluation of the ground-surface estimation process, the average distance from the base of the GT 3D-BBs (of labeled objects) to the estimated ground surface is computed as a measure of error (see Fig. 3.3). This is based on the assumption that all objects lie on the true ground surface. • The number of ‘missed’ and ‘false’ detections. As explained in the introduction chapter, the term ‘obstacle’ refers to all kinds of objects. The GT labels for such a large and diversified group of items are not usually accessible. The obstacle detection evaluation is performed as follows: 1) a random subsample3 of frames (of PCDs) were selected from the dataset; 2) detections in those frames are projected into the image plane and a human observer performs a visual analysis in terms of missed and false obstacles, and 3) in order to consider the moving obstacles, another similar evaluation procedure is carried out for moving items. The discriminated labels (stationary and moving) in the dataset, although only a subset of all obstacles, were used to help the human observer to perform the visual analyses. • The position and orientation errors. To assess the accuracy of the generic 3D object tracking performance, the position and orientation errors were considered. The average center position errors in 2D and 3D are calculated using the Euclidean distance of the center of the computed 2D-BB and 3D-BB from the 2D/3D GT BBs (the GT 2D and 3D-BBs are available in the KITTI dataset). In the KITTI dataset, the GT orientation of the object is only given in terms of the Yaw angle which describes the object’s heading (for more details, please refer to [23]). The orientation error in 3D is given by computing the angle between the object pose and the GT pose in x-y plane. 3 Due
to the great effort required to analyze the full dataset, a subsample of the dataset is used.
3.3. PACKAGES AND TOOLKITS
3.3
41
Packages and Toolkits
In this section, we describe the You Only Look Once (YOLO) [24, 25] package. YOLO is a state-of-the-art, real-time object detection system based on Darknet4 , which is an open source Neural Network (NN) framework written in C and CUDA.
3.3.1
YOLOv2
In the following, we describe YOLO and its most-recent version YOLOv2 which is used within the thesis (specifically, it was used in Section 6.2). In YOLO, object detection is defined as a regression problem and object BBs and detection scores are directly estimated from image pixels. Taking advantage from a gridbased structure, YOLO eliminates the need for an object proposal generation step. The YOLO network is composed by 24 convolutional layers followed by 2 Fully-Connected (FC) layers which connects to a set of BB outputs. In YOLO, the image is divided into S × S grid regions, and the output prediction is in the form of a S × S × (B × 5 + C) matrix, where B is the number of assumed BBs in each cell; C is the class probabilities wrt to the classes, and the coefficient ‘5’ denotes 2D spatial position, width, height and the confidence score of each BB. In YOLO [24], the input image is divided into 7 × 7 grid regions (i.e., S = 7), and two centers of the BBs are assumed in each grid cell (i.e., each grid predicts two BBs and one class with their associated confidence scores which means a prediction of at the most 98 BBs per image). Prediction of 20 classes are considered in the YOLO (i.e., C = 20). Therefore the output will be in the shape of a 7 × 7 × 30 matrix. YOLO looks at the whole image during training and test time; therefore, in addition to object appearances, its predictions are informed by contextual information in the image. The most-recent version of YOLO which is used in this thesis (denoted as YOLOv2 [25]) and its main differences with the original YOLO are described next. In YOLO, the constraints of predicting only two BBs and one class limits the detection performance for small and nearby objects. In YOLOv2 [25], the image is divided into 13 × 13 grid regions, where each grid cell is responsible for predicting five object BB centers (i.e., 845 BB detections per image). In addition, instead of direct prediction of BBs from FC layers (as in original YOLO), in YOLOv2 the FC layers are removed and BBs are computed by predicting corrections (or offsets) on five predefined anchor boxes. The network in YOLOv2 is composed by 19 convolutional layers and 5 max-pooling layers, and (similar to YOLO) runs once on the image to predict object BBs. Some other additional improvements of the YOLOv2 (in comparison with YOLO) are: batch normalization (to speed up learning and also as a form of regularization), high resolution classifier and detector, and multi-scale training. For more details please refer to [25]. 4 http://pjreddie.com/darknet
42
CHAPTER 3. TEST BED SETUP AND TOOLS
Chapter 4 Obstacle and Object Detection: A Survey Contents 4.1
4.2
Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.1
Environment Representation . . . . . . . . . . . . . . . . . . 44
4.1.2
Grid-based Obstacle Detection . . . . . . . . . . . . . . . . . 45
Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1
Recent Developments in Object Detection . . . . . . . . . . . 50
4.2.2
Object Detection in ADAS Domain . . . . . . . . . . . . . . 51
The greatest challenge to any thinker is stating the problem in a way that will allow a solution. Bertrand Russell
In this chapter, we present the state-of-the-art approaches on environment perception for intelligent and autonomous vehicles focusing on the problems of obstacle and object detection. Specifically, obstacle detection is the problem of identifying obstacles in the environment. There is a related term called Detection And Tracking Moving Objects (DATMO), which is more focused on modeling dynamic generic objects (moving obstacles) in urban environments. Object detection is related to problems of recognizing and locating class-specific objects in a scene. 43
44
4.1
CHAPTER 4. OBSTACLE AND OBJECT DETECTION: A SURVEY
Obstacle Detection
Robotic perception, in the context of autonomous driving, is the process by which an intelligent system translates sensory data into an efficient model of the environment surrounding a vehicle. Environment representation is the basis for perception tasks. In the following, we survey common approaches for environment representation with a focus on grid-based methods, and next, we review studies related to grid-based obstacle detection methods.
4.1.1
Environment Representation
Given sensory data, it needs to be processed by a perception system in order to obtain a consistent and meaningful representation of the environment surrounding the vehicle. Three main types of data representations are commonly used: 1) Point cloud; 2) Feature-based, and 3) Grid-based. Point cloud-based approaches directly use raw sensor data, with minimum preprocessing and the highest level of detail, for environment representation [26]. This approach generates an accurate representation. However, it requires large memory and high computational power. Feature-based methods use locally distinguishable features (e.g., lines [27], surfaces [28], superquadrics [29]) to represent the sensor information. Feature-based approaches are concise and sparse representation models with no direct representation of free and unknown areas. Grid-based methods discretize the space into small grid elements, called cells, where each cell contains information regarding the sensory space it covers. Grid-based solutions are memoryefficient, simple to implement, with no dependency to predefined features, and have the ability to represent free and unknown space, which make them an efficient technique for sensor data representation in robotic applications. Several approaches have been proposed to model sensory data using grids. Moravec and Elfes [30] presented early works on 2D grid mapping. Hebert et al. [31] proposed a 2.5D grid model (called elevation maps) that stores in each cell the estimated height of objects above the ground level. Pfaff and Burgard [32] proposed an extended elevation map to deal with vertical and overhanging objects. Triebel et al. [33] proposed a Multi-Level Surface (MLS) map that considers multiple levels for each 2D cell. These methods, however, do not represent the environment in a fully volumetric (3D) way. Roth-Tabak and Jain [34] and Moravec [35] proposed a 3D occupancy grid composed of equally-sized cubic volumes (called voxels). However, it requires large amounts of memory since voxels are defined for the whole space, even if there are only a few measured points in the environment. Specifically, LIDAR-based 3D occupancy grids can represent free and unknown areas by accepting a higher computation cost of ray casting algorithms for updating the grid cells. 3D grid maps can be built faster by discarding ray casting algorithms and considering only the end-points of the beams [36]. However, by ignoring ray casting algorithms, information about free and unknown spaces is lost.
4.1. OBSTACLE DETECTION
45
Table 4.1: Comparison of some of the major grid-based environment representations. L, M, F, and D are abbreviations for Level of detail, Memory, Free and unknown space representation ability, and the Dimension of representation, respectively. Representation 2D Occ. Grid [30] Elevation Grid [31] Extended Elev. [32] MLS Grid [33] 3D Occ. Grid [34] Voxel Grid [37] Octree [39]
L + ++ ++ ++ +++ +++ +++
M + + + ++ +++ ++ ++
F + + -
D 2 2.5 2.5 2.5 3 3 3
A related approach is proposed by Ryde and Hu [37], in which they store a list of occupied voxels over each cell of a 2D grid map. Douillard et al. [38] used a combination of a coarse elevation map for background representation and a fine resolution voxel map for object representation. To reduce memory usage of fully 3D maps, Meagher [39] proposed octrees for 3D mapping. An octree is a hierarchical data structure for spatial subdivision in 3D. OctoMap [40] is a mature version of octree-based 3D mapping. However, the tree structure of octrees causes a more complex data access in comparison with a traditional 3D grid. In another attempt, Dryanovski et al. [41] proposed the multi-volume occupancy grid, where observations are grouped into continuous vertical volumes (height volumes) for each map cell. Table 4.1 provides an overview of grid-based environment models.
4.1.2
Grid-based Obstacle Detection
Obstacle detection, which is usually built on top of grid-based representation, is one of the main components of perception in intelligent and autonomous vehicles [42]. In recent years, by increasing growth of 3D sensors such as stereo cameras and 3D-LIDARs, most of the obstacle detection techniques have been revisited to adapt themselves to 3D sensor technologies [43]. In particular, the perception of a 3D dynamic environment surrounding a moving ego-vehicle requires an ego-motion estimation mechanism (in addition to a 3D sensor). A perception system with the ability to detect stationary and moving obstacles in dynamic 3D urban scenarios has a direct application in safety systems such as collision warning, adaptive cruise control, vulnerable road users detection and collision mitigation braking. Obstacle detection systems can be extended to include higher level perception functionalities including Detection And Tracking Moving Objects (DATMO) [44]; object detection, recognition and behavior analysis [45]. Obstacle detection algorithms need some assumptions about the ground surface to discriminate between the ground and obstacles [46].
46
CHAPTER 4. OBSTACLE AND OBJECT DETECTION: A SURVEY
Ground Surface Estimation Incoming data from a 3D sensor need firstly to be processed for ground surface estimation and subsequently for obstacle detection. Ground surface and obstacle detection have a strong degree of dependency because the obstacles (e.g., trees, walls, poles, fireplugs, vehicles, pedestrians, and cyclists) are all located on the surface that represents the roadway and the roadside. Many of the methods assume that the ground is flat and everything that stands up from the ground is considered as obstacles [47], [48], [49]. However, this simple assumption is overridden in most of the practical scenarios. In [28] the ground surface is detected by fitting a plane using RANSAC on the point cloud from the current time instance. This method only works well when the ground is planar. Non-planar grounds, such as undulated roads, curved uphill and downhill ground surfaces, sloped terrains or situations with big rolling and pitch angles of the roads remain unsolved. The ‘V-disparity’ approach [50] is widely used to detect the road surface from the disparity map of stereo cameras. However, disparity is not a natural way to represent 3D Euclidean data and it can be sensitive to roll angle changes. A comparison between ‘V-disparity’ and Euclidean space approaches is given in [51]. In [52] a combination of RANSAC [53], region growing and Least Square (LS) fitting is used for the computation of the quadratic road surface. Though it is effective, yet it is limited to the specific cases of planar or quadratic surfaces. Petrovskaya [54] proposed an approach that determines ground readings by comparing angles between consecutive readings from Velodyne LIDAR scans. Assuming A, B, and C are three consecutive readings, the slope between AB and BC should be near zero if all three points lie on the ground. A similar method was independently developed in [55]. In [56] the ground points are detected by comparing adjacent beams, as the difference between adjacent beams is lower at objects and higher at the ground. Mertz et al. [57] build an Elevation grid by subtracting the standard deviation from the average height of the points within each cell. The cells with Elevation value lower than a certain threshold are considered as the ground cells. In [58], all objects of interest are assumed to reside on a common ground plane. The bounding boxes of objects, from the object detection module, are combined with stereo depth measurements for the estimation of the ground plane model. Generic Object Tracking Generic (or model-free) object tracking is an essential component in the obstacle detection pipeline. Using tracking, an ego-vehicle can predict its surrounding objects’ locations and behaviors, and based on that make proper decisions and plan next actions. This section gives a brief overview of object tracking algorithms using multimodal perception systems of autonomous vehicles. Object tracking algorithms can be divided into tracking by detection and model-free categories as detailed in the sequel. • Tracking by Detection. Discriminative object trackers localize the object us-
4.1. OBSTACLE DETECTION
47
obstacle
observer
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4.1: Some approaches for the appearance modeling of a target object. (a) represents a scan of a vehicle which is split up by an occlusion from top view [59] , (b) the centroid (point model) representation, (c) 2D rectangular [60] or 2.5D box [56] shape based representations, (d) 2.5D grid, 3D voxel grid [48, 61] or octree data structurebased representation [49, 62], (e) object delimiter-based representation [63], and (f) 3D reconstruction of the shape of the target object [64, 65]. ing a pre-trained (supervised) detector (e.g., DPM [66]) that learns a decision boundary between the appearance of the target object and other objects and obstacles, and next link-up the detected positions over time. Many approaches [67, 68, 69, 70, 71, 72] have been proposed for discriminative object tracking based on monocular cameras, with most of them focused on the data association problem. An overview of such approaches is given in the ‘KITTI Object Tracking Evaluation Benchmark’1 , MOT15 [73] and MOT16 [74]. However, the requirement of having all object categories being previously known and trained limits the application of discriminative approaches. • Model-Free. To have a reliable perception system for autonomous cars in realworld driving scenarios a generic object tracker [63, 75, 76, 77] is also required. A generic tracker should be able to track all kinds of objects, even if their existence is not previously predicted or trained. It is generally assumed that the initial position of the object is given (e.g., using a motion detection mechanism). Generative methods build a model to describe the appearance of the object (in the initial frame) and then look for its next occurrence by searching for the region most similar to the model. To handle the appearance variations of the target object, the object model is often updated online. The simplest representation of a target object considers the centroid of object points, so-called the point model. The point model is feasible even with a few number of object points. However, a richer appearance modeling can be exploited to capture objects physical properties (see Fig. 4.1). Generic object tracking, integrated in the obstacle detection system, is further discussed in the following. 1 http://www.cvlibs.net/datasets/kitti/eval
tracking.php
48
CHAPTER 4. OBSTACLE AND OBJECT DETECTION: A SURVEY
Obstacle Detection and DATMO This section briefly reviews grid-based obstacle detection in dynamic environments. Some approaches [48, 60, 61] detect and track generic dynamic objects based on their motion. This group of methods is the most widely used and is closely related to the Detection and Tracking of Moving Objects (DATMO) approaches [44]. The Bayesian Occupancy Filter (BOF) is a well known grid-based DATMO. In BOF, Bayesian filtering is adapted to the occupancy grids to infer the dynamic grids, followed by segmentation and tracking (using Fast Clustering and Tracking Algorithm (FCTA) [78]), to provide an object level representation of the scene [79]. Motion detection can be achieved by detecting changes that occur between two or three consecutive observations (which can be interpreted as ‘frame differencing’) [60]. Detection of motion can also be achieved by building a consistent static model of the scene, called the background model, and then finding deviations from the model in each incoming frame [49]. This process can be referenced as ‘background modeling and subtraction’. The background model is usually a short-term map of the surrounding environment of the ego-vehicle. Generally, the static background model is built by combining the ego-vehicle localization data and a representation of 3D sensor inputs such as: PCD [60], 2.5D Elevation grid [63], Stixel (sets of thin and vertically oriented rectangles) [47, 80], voxel grid [48, 61] or octree data structure-based representation [49, 62]. Ego-motion estimation is usually achieved using Visual Odometry [48], INS [63], variants of ICP scan matching algorithm [61] or a combination of them [49]. In Broggi et al. [48] approach, voxels are used to represent 3D space, and the ego-motion is estimated using Visual Odometry. A color-space segmentation is performed on the voxels, and the voxels with similar features are grouped to form obstacles. The ego-motion is used to distinguish between stationary and moving obstacles. Finally, the geometric center of each obstacle is computed, and a KF is applied to estimate its velocity and position. Azim and Aycard [49] proposed an approach based on the inconsistencies between observation and local grid maps represented by an Octomap (which is a 3D occupancy grid with an octree structure) [40]. Obstacles are segmented using DBSCAN, followed by a KF and Global Nearest Neighborhood (GNN) data association for tracking. Next, an adaboost classifier is used for object classification. Moosmann and Stiller [65] used a local convexity based segmentation method for object hypotheses detection. A combination of KF and ICP is used for tracking generic moving objects and a classification method for managing tracks. Their method includes the 3D reconstruction of the shape of the moving objects. Hosseinyalamdary et al. [59] used prior Geospatial Information System (GIS) map to reject outliers. They tracked moving objects in a scene using KF, with Constant Velocity process model (CV-KF) and used ICP for pose estimation. Dewan et al. [81] detect motions between consecutive scans using RANSAC and use a Bayesian approach to segment and track multiple objects in 3D-LIDAR data. The majority of these approaches have only been developed for the detection and
4.1. OBSTACLE DETECTION
49
Table 4.2: Some of the recent obstacle detection and tracking methods for autonomous driving applications. Cam., SV, mL, 2L, and 3L are abbreviations for Camera, Stereo Vision, multilayer LIDAR, 2D-LIDAR, and 3D-LIDAR, respectively. Col., Vel., Mot., and Spat. are abbreviations for Color, Velocity, Motion, and Spatial, respectively. Ref.
3D Sens.
[79] [63] [61] [59] [75] [65] [47] [64] [81] [48] [49] [60]
SV, 2L SV mL 3L, GIS SV 3L SV 3L, Cam. 3L SV 3L 3L
Ego-motion Estimation Odometry GNSS, INS ICP INS V-Odometry – V-Odometry INS DGPS/IMU V-Odometry INS, ICP INS
Motion Det. & Segmentation FCTA Obj. Delimit. Motion – Multi Scale Local Convexity Mot. Spat. Shape – Motion Color Space DBSCAN Motion
Object Represent. 2D Occ. Vel. DEM Voxel PCD Voxel PCD Stixel Col. PCD PCD Voxel Octree 2D Rect.
Obj. Search Mechanism Bayesian PF EKF CV-KF KF, MHT CV-KF 6D-KF CV-KF Bayesian KF KF, GNN PF
Obj. Model Update – KF – – Weighted ICP ICP, Accum. – ICP, Accum. – – – CV-KF
tracking of generic moving objects. However, in real-world applications, static obstacles should also be taken into account. Segmentation-based approaches are proposed to partition the PCD into perceptually meaningful regions that can be used for obstacle detection. Oˇsep et al. [75] used the PCD generated from a disparity map (obtained from a stereo camera pair) to find and track generic objects. They suggested a two-stage segmentation approach for multiscale object proposal generation, followed by Multi Hypothesis Tracking (MHT) at the level of the object proposals. Vatavu et al. [63] built a Digital Elevation Map (DEM) from PCD obtained from a stereo vision system. They segmented obstacles by extracting free-form object delimiters. The object delimiters are represented by their positions and geometries, and then tracked using Particle Filters (PFs). KFs are used for adapting object delimiter models. Pfeiffer and Franke [47] used a stereo vision system for acquiring 3D PCD and Visual Odometry for ego-motion estimation. They used Stixels for the environment representation. Dynamic stixels are segmented based on the motion, spatial and shape constraints, and are tracked by a so-called 6D-vision KF [82], which is a framework for the simultaneous estimation of 3D-position and 3D-motion. In another approach, focused on the problem of generic object tracking, Held et al. [64] combined a PCD with a 2D camera image to construct an up-sampled colored PCD. They used a color-augmented search algorithm to align the colored PCDs from successive time frames. Assuming a known initial position of the object, they utilized 3D shape, color data and motion cues in a probabilistic framework to perform joint 3D reconstruction and tracking. They showed that the accumulated dense model of the object leads to a better object velocity estimate. A summary of the most representative obstacle detection and tracking approaches is provided in Table 4.2.
50
CHAPTER 4. OBSTACLE AND OBJECT DETECTION: A SURVEY
4.2
Object Detection
This section gives an overview of object detection, the related fusion methods and recent advancements applied in ADAS and ITS domains.
4.2.1
Recent Developments in Object Detection
The state-of-the-art in object detection is primarily concentrated on processing color images. Object detection can be divided into pre and post-Deep learning arrival. Non-ConvNet Approaches Before the recent advancement of Deep Learning (specifically ConvNets) that revolutionized object classification and consecutively the object detection field, the literature were mainly focused on using hand crafted features and traditional classification techniques (e.g., SVM, AdaBoost, Random Forest). Some of the major contributions in the object detection field, before the Deep Learning era, are listed below: • Cascade of weak classifiers. Viola and Jones [83] proposed one of the early works for object detection. They used Haar features and performed object detection by applying Adaboost training and cascade classifiers based on slidingwindow principle. • Histogram of Oriented Gradients (HOG). Dalal and Triggs [84] introduced efficient HOG features based on edge directions in the image. They performed linear SVM classification on sub-windows extracted from the image using a slidingwindow mechanism. • Deformable Parts Model (DPM). Proposed by Felzenszwalb et al. [85], DPM is a graphical model that is designed to cope with object deformations in the image. DPM assumes that an object is constructed by its parts. It uses HOG and linear SVM, again on a sliding window mechanism. • Selective Search (SS). Uijlings et al. [86] proposed SS to generate a set of datadriven, class-independent object proposals and avoid using conventional slidingwindow exhaustive search. SS works based on hierarchical segmentation using a diverse set of cues. They used SS to create a Bag-of-Words based localization and recognition system. ConvNet based Approaches The remarkable success of ConvNets as an optimal feature extractor for image classification/recognition made a huge impact on the object detection field as demonstrated
4.2. OBJECT DETECTION
51
by LeCun et al. [87] and, more recently, by Krizhevsky et al. [88]. Currently, the best performing object detectors use ConvNets and are summarized below.
• Sliding-window ConvNet. Following the traditional object detection paradigm, ConvNets were initially employed using the sliding-window mechanism (but in a more efficient way), like in the Overfeat framework as proposed by Sermanet et al. [89]. • Region-based ConvNets. In R-CNN [90], SS [86] is used for object proposal generation, pre-trained ConvNet on ImageNet (fine-tuned by PASCAL VOC dataset) for feature extraction, and linear SVM for object classification and detection. Instead of doing ConvNet-based classification for thousands of SS-based generated object proposals, which is slow, Fast R-CNN [91] uses Spatial Pyramid Pooling networks (SPPnets) [92] to pass the image through the convolutional layer once, followed by an end-to-end training. In Faster R-CNN [93, 94] a Region Proposal Network (RPN), a type of Fully-Convolutional Network (FCN) [95], is introduced for region proposal generation. It increases the run-time efficiency and accuracy of the object detection system. • Single Shot Object Detectors. YOLO (You Only Look Once) [24, 25] and SSD (Single Shot Detector) [96] model object detection as a regression problem and try to eliminate the object proposal generation step. These approaches are based on a single ConvNet followed by a non-maximum suppression step. In these methods, the input image is divided into a grid (7 × 7 grid for YOLO and 9 × 9 for SSD) where each grid cell is responsible for predicting a pre-determined number of object BBs. In the SSD approach, hard negative mining is performed, and samples with highest confidence loss are selected. Two main disadvantages of this class of methods are i) they impose hard constraints on the bounding box prediction (e.g., in YOLO each grid cell can predict only two BBs) and ii) the detection of small objects can be very challenging. The SSD approach tried to solve the second problem with the help of additional data augmentation for smaller objects.
4.2.2
Object Detection in ADAS Domain
Object detection is a crucial component of sensor-based perception systems for advanced driver assistance systems (ADAS) and for autonomous driving. This section gives an overview of object detection and the related fusion methods in IV and ITS domains.
52
CHAPTER 4. OBSTACLE AND OBJECT DETECTION: A SURVEY
Vision-based Object Detection Despite remarkable advancements in object detection, designing an object detection system for real-world driving applications is still a very challenging problem. Yebes et al. [97] modified DPM [85] to incorporate 3D-aware HOG-based features extracted from color images and disparity maps. Disparity maps are computed from each pair of left-right images of stereo cameras employing the Semi-Global Matching (SGM) [98] method. The DPM object detector is trained on 3D-aware features. Xiang et al. [99] introduced a ConvNet-based region proposal network that uses subcategory information to guide the proposal generating process. In their approach Fast R-CNN [91] is modified by injecting subcategory information (using 3D Voxel Patterns [100] as subcategories) into the network for joint detection and subcategory classification. Cai et al. [101] proposed a multi-scale object detection based on the concept of rescaling the image multiple times so that the classifier can match all possible object sizes. Their approach consists of two ConvNet-based sub-networks: a proposal sub-network and a detection sub-network learned end-to-end. Chabot et al. [102] introduced Deep MANTA, a framework for 2D and 3D vehicle detection in monocular images. In their method, inspired by the Region Proposal Network (RPN) [93], vehicle proposals are computed and then refined to detect vehicles. They optimized ConvNet for six tasks: region proposal, detection, 2D box regression, part localization, part visibility and 3D template prediction. Chen et al. [103] generate 3D proposals by assuming a prior on the ground plane (using calibration data). Proposals are initially scored based on some contextual and segmentation features, followed by rescoring using a version of Fast RCNN [91] for 3D object detection. Yang et al. [104] approach is based on the rejection of negative object proposals using convolutional features and cascaded classifiers. 3D-LIDAR-based Object Detection Recently, 3D-LIDARs started to become used for high-level perception tasks like object detection. This subsection gives a concise overview of vehicle detection approaches using 3D-LIDARs. Behley et al. [105] propose a segmentation-based object detection using LIDAR range data. A hierarchical segmentation is used to reduce the over- and under-segmentation effects. A mixture of multiple bag-of-word (mBoW) classifiers is applied to classify extracted segments. Finally, a non-maximum suppression is used considering the hierarchy of segments. In the Wang and Posner [106] approach, LIDAR points together with their reflectance values are discretized into a coarse 3D voxel grid. A 3D sliding window detection approach is used to generate the feature grid. At each window location, the feature vectors contained within its bounds are stacked up into a single long vector and passed to a classifier. A linear SVM classifier scores each window location and returns a detection score. Li et al. [107] used a 2D Fully-Convolutional Network (FCN) in a
4.2. OBJECT DETECTION
53
Table 4.3: Related work on 3D-LIDAR-based object detection. Ref. [105] [108] [107] [106] [109] [110] [111] [112] [113]
Modality Range Range Range Range + Reflectance Range + Reflectance Color + Range Color + Range Color + Range Color + Range + Reflectance
Representation PCD PCD Top view Voxel Voxel Front view Front view Front view Front + Top views
Detection Technique Hierarchical Seg. + BoW 3D-FCN 2D-FCN Sliding-window + SVM Feat. Learning + Conv. Layers + RPN Sliding-window + DPM/SVM Sliding-window + RF Seg.-based Proposals + ConvNet Top view 3D Proposals + ConvNet
2D point map (top-view projection of 3D-LIDAR range data) and trained it end-to-end to build a vehicle detection system based only on 3D-LIDAR range data. Li [108] extended it to a 3D Fully-Convolutional Network (FCN) to detect and localize objects as 3D boxes from LIDAR point cloud data. In a similar approach, Zhou and Tuzel [109] proposed a method that generates 3D detection directly from the PCD. Their method divides the space into voxels, decodes points within the voxel into a feature vector, and then a RPN is applied to provide 3D detections. 3D-LIDAR and Camera Fusion Although there is a rich literature on multisensor data fusion as recently surveyed by Durrant-Whyte and Henderson [114], only a small number of works addresses multimodal and multisensor data fusion in object detection. Fusion-based object detection approaches can be divided based on the abstraction level where the fusion takes place, namely i) low-level (early) fusion: combines sensor data to create a new set of data; ii) mid-level fusion: integrates features, and iii) high-level (late or decision-level) fusion: combines the classified outputs; and iv) multi-level fusion: integrates different levels of data abstraction (also see Section 2.1.3). This subsection surveys the state-of-theart fusion techniques using vision and 3D-LIDAR in the multimodal object detection context. Premebida et al. [110] combine Velodyne LIDAR and color data for pedestrian detection. A dense depth map is computed by up-sampling LIDAR points. Two DPMs are trained on depth maps and color images. The DPM detections on depth maps and color images are fused to achieve the best performance using a late re-scoring strategy (by applying SVM on features like BB’s sizes, position, scores and so forth). Gonzalez et al. [111] use color images and 3D-LIDAR-based depth maps as inputs, and extract HOG and Local Binary Patterns (LBP) features. They split training set samples in different views to take into account different poses of objects (frontal, lateral, etc.) and train a separate random forest of local experts for each view. They investigated feature-
54
CHAPTER 4. OBSTACLE AND OBJECT DETECTION: A SURVEY
Table 4.4: Some recent related work on 3D-LIDAR and camera fusion. In the table, if more than one fusion strategy is experimented in a method, the best-performing solution for that method is emphasized with a larger red check mark. Reference Premebida et al. [110] Gonzalez et al. [111] Schlosser et al. [115] Chen et al. [113] Oh and Kang [112]
Early
Mid
Late
Multi
Technique SVM Re-scoring Ensemble Voting ConvNet ConvNet ConvNet/SVM
level and late fusion approaches. They combine color and depth modalities at feature level by concatenating HOG and LBP descriptors. They train individual detectors on each modality and use an ensemble of detectors for the late fusion of different views detection. They achieved a better performance with the feature-level fusion scheme. Schlosser et al. [115] explore the ConvNet-based fusion of 3D-LIDAR and color image at different levels of representation for pedestrian detection. They compute HHA (horizontal disparity, height, angle) data channels [116] from LIDAR data. They show that the late-fusion of HHA features and color images achieve better results. Chen et al. [113] proposed a multi-view object detection using deep learning. They used 3DLIDAR top and front views and color image data as inputs. The top view LIDAR data is used to generate 3D object proposals. The 3D proposals are projected into three views for obtaining region-wise features. A region-based feature fusion scheme is used for the classification and orientation estimation. This approach enables interactions of different intermediate layers from different views. Oh and Kang [112] use segmentation-based methods for object proposal generation from LIDAR’s point cloud data and a color image. They use two independent ConvNet-based classifiers to classify object candidates in color image and LIDAR-based depth map and combine the classification outputs at decision level using convolutional feature maps, category probabilities, and SVMs. Table 4.3 provides a review of object detection approaches that incorporate 3D-LIDAR data, in terms of detection techniques. Table 4.4 provides an overview of the architecture of fusion approaches that use 3D-LIDAR and camera data. In this chapter, the state-of-the-art approaches to the problems of obstacle and object detection was surveyed. The objective of this dissertation is to push forward the stateof-the-art in multisensor multimodal obstacle and object detection domain which are the core modules of perception system for autonomous driving.
Part II METHODS AND RESULTS
55
Chapter 5 Obstacle Detection Contents 5.1
5.2
5.3
Static and Moving Obstacle Detection . . . . . . . . . . . . . . . . 58 5.1.1
Static and Moving Obstacle Detection Overview . . . . . . . 58
5.1.2
Piecewise Ground Surface Estimation . . . . . . . . . . . . . 58
5.1.3
Stationary – Moving Obstacle Detection . . . . . . . . . . . . 64
Extension of Motion Grids to DATMO . . . . . . . . . . . . . . . . 69 5.2.1
2.5D Grid-based DATMO Overview . . . . . . . . . . . . . . 69
5.2.2
From Motion Grids to DATMO . . . . . . . . . . . . . . . . 69
Fusion at Tracking-Level . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.1
Fusion Tracking Overview . . . . . . . . . . . . . . . . . . . 74
5.3.2
3D Object Localization in PCD . . . . . . . . . . . . . . . . 75
5.3.3
2D Object Localization in Image
5.3.4
KF-based 2D/3D Fusion and Tracking . . . . . . . . . . . . . 80
. . . . . . . . . . . . . . . 78
In this chapter, we present the proposed obstacle detection approach to continuously estimate the ground surface and segment stationary and moving obstacles, followed by an extension of the proposed obstacle segmentation approach to DATMO, and the fusion of 3D-LIDAR’s PCD with color camera data for the object tracking function of DATMO. Parts of this chapter have been published in one journal article and 4 book chapter, conference, and workshop proceedings: Journal of Robotics and Autonomous System [117], the Second Iberian Robotics Conference [118], IEEE Intelligent Transportation Systems Conferences [119, 120], and the Workshop on Planning, Perception and Navigation for Intelligent Vehicles [121]. 57
58
CHAPTER 5. OBSTACLE DETECTION
5.1
Static and Moving Obstacle Detection
In this section, considering data from a 3D-LIDAR and a GPS-aided INS mounted onboard an instrumented vehicle, a 4D approach (utilizing both 3D spatial and time data) is proposed for ground surface modeling and obstacle detection in dynamic urban environments. The system is composed of two main modules: 1) a ground surface estimation based on piecewise plane fitting, and 2) a voxel grid model for static and moving obstacle detection and segmentation.
5.1.1
Static and Moving Obstacle Detection Overview
Fig. 5.1 presents the architecture of the proposed method. The proposed method comprises two phases: 1) ground surface estimation: a temporal sequence of 3D-LIDAR data and GPS-aided INS positioning data are integrated to form a dense model of the scene. A piecewise surface estimation algorithm, based on a ‘multi-region’ strategy and Velodyne LIDAR scans behavior, is applied to fit a finite set of multiple planes using RANSAC method (that fits the road and its vicinity), and 2) static and moving obstacles segmentation: the estimated ground model is used to separate the ground from non-ground 3D-LIDAR points (which represent obstacles that are standing on the ground). The voxel representation is employed to quantize 3D-LIDAR data for efficient further-processing. The proposed approach detects moving obstacles using discriminative analysis and ego-motion information, by integrating and processing information from previous measurements.
5.1.2
Piecewise Ground Surface Estimation
This section starts by presenting the process of dense PCD generation, which will be used for the ground surface estimation. Dense Point Cloud Generation The dense PCD construction begins by transforming the PCDs from the ego-vehicle coordinates to the world coordinate system using INS positioning data. This transformation is further refined by PCD down-sampling using Box Grid Filter (BGF)1 , followed by PCDs alignment using ICP algorithm [122]. This process, detailed in the following, is summarized in Algorithm 2. Let Pi denote a 3D PCD in the current time i, and P = {Pi−m , · · · , Pi−1 , Pi } is a set composed of the current and m previous PCDs. Using a similar notation, let T = {Ti−m , · · · , Ti−1 , Ti } be the set of vehicle pose parameters, a 6 DOF pose in Euclidean 1 The
MATLAB pcdownsample function is used in our implementation.
5.1. STATIC AND MOVING OBSTACLE DETECTION
59
Piecewise Ground Surface Estimation Dense PCD Generation
A Set of PCDs P
Piecewise Plane Fitting
G Ground Parameters
D Integrated PCDs
INS Data T Static - Moving Obstacle Detection A Set of PCDs P
Ground – Obstacle Segmentation
INS Data T
Ground G Param.
Static - Moving Obstacle Detection
Static and Moving Voxels
P, D Voxel Grids
Figure 5.1: Architecture of the proposed obstacle detection system. Algorithm 2 Dense Point Cloud Generation. 1: Inputs: PCDs: P and Ego-vehicle Poses: T 2: Output: Dense PCD: D 3: for scan k = i − m to i do 4: Tˇk ← ICP (BGF (GI (Pi , Ti )), BGF (GI (Pk , Tk ))) 5: D ← Merge (Pk , Ti , Tˇk ) 6: end for
. updated transformation
space, given by a high precision GPS-aided INS positioning system. The transformation Tk = (Rk | tk ) consists of a 3 × 3 rotation matrix Rk and a 3 × 1 translation vector tk , when k ranges from i − m to i. The function GI (Pk , Tk ) denotes the transformation of a PCD Pk from ego-vehicle to the world coordinate system using: Rk × Pk + tk . A Box Grid Filter is used for down-sampling the PCDs. BGF partitions the space into voxels and averages (x, y, z) value of points within each voxel. This step makes the PCD registration faster, while keeping the result accurate. ICP is applied for minimizing the difference between every PCD and the considered reference PCD. The down-sampled version of the current PCD Pi is used as the reference ‘the fixed PCD’, and then the 3D rigid transformation for aligning other down-sampled PCDs Pk ‘moving PCDs’ with the fixed PCD is estimated. Assuming Tˇk as the updated GPS-aided INS transformation (i.e., after applying ICP), the so called dense PCD Di is obtained using the ‘Merge’ function, by transforming the PCDs P into the current coordinates’ system of the ego-vehicle using the parameters of Tˇk = (Rˇk | tˇk ) and Ti = (Ri | ti ) by Di =
−1 ˇ k=i−m Ri × ((Rk × Pk + tˇk ) − ti ),
Si
(5.1)
60
CHAPTER 5. OBSTACLE DETECTION
Figure 5.2: The generated dense PCD of a traffic pole before and after applying Box Grid Filter and ICP algorithm. The red rectangle in the top image shows the traffic pole. Bottom left shows the generated dense PCD only using GPS-aided INS localization data. Bottom right shows the result obtained after applying BGF and ICP steps to further align consecutive PCDs and to reduce the localization error. The sparse points on right side of the pole (bottom right image) correspond to chain that exists between poles. Different colors encode distinct LIDAR scans. The dense PCDs were rotated regarding their original position in the image above to better evidence the difference. where ∪ defines the union operation. The integrated PCD D is cropped to the local grid: D ← Crop(D). Note that the subscript i has been omitted to simplify notation. An example of a dense PCD generated using Tˇk transformation is shown in Fig. 5.2. Piecewise Plane Fitting A piecewise plane fitting algorithm is then applied to D in order to estimate the ground geometry. Existing methods in the literature are mainly developed to estimate specific types of ground surface (e.g., planar or quadratic surfaces). In comparison to the previous methods, we contribute with a piecewise plane fitting that is able to estimate an arbitrary ground surface (e.g., a ground with a curve profile). The proposed algorithm is composed by four steps: 1) Slicing, 2) Gating, 3) Plane Fitting, and 4) Validation. First, a finite set of regions on the ground are generated in accordance to the car orientation.
5.1. STATIC AND MOVING OBSTACLE DETECTION
61
z y
0 h 0
0
N
{V}
Sk
x
0
1
2
3
k 1
k
N
Figure 5.3: Illustration of the variable-size ground slicing for η = 2. Velodyne LIDAR scans are shown as dashed green lines. These regions (hereafter called ‘slices’) are variable in size and follow the geometrical model that governs the Velodyne LIDAR scans. Second, a gating strategy is applied to the points in each slice using an interquartile range method to reject outliers. Then, a RANSAC algorithm is used to robustly fit a plane to the inlier set of 3D data points in each slice. At last, every plane parameter is checked for acceptance based on a validation process that starts from the closest plane to the farthest plane. • Slicing. This process starts from the initial region, defined by the slice S0 , centered in the vehicle coordinates {V } with the radius of λ0 = 5 m, as illustrated in Fig. 5.3. This is the closest region to the host vehicle, with the densest number of points and with less localization errors. It is reasonable to assume that the plane fitted to the points belonging to this region is estimated with more confidence and provides the best fit among all the remaining slices, hence this plane can be considered as the ‘reference plane’ for the validation task. The remaining regions (in the area between λ0 and λN ), having increasing sizes, are obtained by a strategy that takes into account the LIDAR-scans behavior: assumed to follow a tangent function law. Each slice (or region) begins from the endmost edge of the previous slice in the vehicle movement direction. According to the model illustrated in Fig. 5.3, the edge of the slice Sk is given by the following tangent function: λk = h · tan(α0 + k · η · ∆α),
{k : 1, ..., N}
(5.2)
where α0 = arctan(λ0 /h) and h is the elevation of the Velodyne LIDAR to the ground (h ≈ 1.73 m, provided in the dataset, see Fig. 3.2); N is the total number of slices given by N = b(αN − α0 )/(η · ∆α)c; ∆α is the angle between scans in elevation direction (∆α ≈ 0.4°), and b.c denotes truncation operation (the floor function). Here, η is a constant that determines the number of ∆α intervals used to compute slice sizes (it is related to the number of LIDAR readings in each slice). For η = 2, as represented in Fig. 5.3, at least two ground readings of a
62
CHAPTER 5. OBSTACLE DETECTION
(a)
Outliers
Gatek 6
(b)
Figure 5.4: An example of the application of the gating strategy on a dense PCD (bottom shows the lateral view). The black edged boxes indicate the gates. Inlier points in different gates are shown in different colors, and red points show outliers.
(single) Velodyne scan fall into each slice which is enough for fitting a plane. For example, a data point p = [x, y, z], p ∈ D, with λk−1 < x < λk falls into the k-th slice: Sk ← Slice(D). In order to simplify the notation, we use Sk to mean both the slice and the points in that slice, and try to clarify, whenever it is required. • Gating. A gating strategy using the interquartile range (IQR) method is applied to the points in Sk to detect and reject outliers that may occur in the LIDAR measurement points. First we compute the median of the height data which divides the samples into two halves. The lower quartile value Q25% is the median of the lower half of the data. The upper quartile value Q75% is the median of the upper half of the data. The range between the median values is called interquartile range: IQR = Q75% − Q25% . The lower and upper gate limits are learned empirically, and were chosen as Qmin = Q25% − 0.5 · IQR and Qmax = Q75% respectively (which are stricter ranges when compared to the standard IQR rules of Q25% − 1.5 · IQR and Q75% + 1.5 · IQR). The ‘Gate’ function (see Algorithm 3), denoted by Gate(·), is applied to points in Sk and outputs Sk (e.g., a data point p = [x, y, z] with Qmin < z < Qmax is considered as an inlier and is included in the output of the ‘Gate’ function).
5.1. STATIC AND MOVING OBSTACLE DETECTION
Gatek 1
Gatek
63
K1
Z K 1 Qkmin 1
Sk
d min
k 1
k
Qkmax 1
k 1
Figure 5.5: The piecewise RANSAC plane fitting process. Dashed orange shows the lower and upper gate limits. Dashed black rectangles show the gate computed for the outlier rejection task. Solid green lines show the estimated plane using RANSAC in a lateral view. Dashed green line shows the continuation of the Sk ’s fitted plane in slice Sk+1 . The distance (δ Zk+1 ) and angle (δ ψk+1 ) between two consecutive planes are shown in red. Dashed magenta lines show the threshold that is used for the ground obstacle segmentation task. Points under dashed magenta lines are considered as ground points. The original PCD is represented using filled gray circles. • RANSAC Plane Fitting. The RANSAC method [53] robustly fits a mathematical model to a dataset containing outliers. Differently from the Least Square (LS) method that directly fits a model to the whole dataset (when outliers occur LS will not be accurate), RANSAC estimates parameters of a model using different observations from data subsets. In order to perform this stage efficiently, a subsample of the filtered PCD in Sk is selected and a plane is fitted to it using the 3-point RANSAC algorithm. In each iteration, the RANSAC approach randomly selects three points from the dataset. A plane model is fitted to the three points and a score is computed as the number of inliers points whose distance to the plane model is below a given threshold. The plane having the highest score is chosen as the best fit to the considered data. A given plane, fitted to the road and its vicinity pavement, is denoted as ak x + bk y + ck z + dk = 0, and stored by Gk ← [ak , bk , ck , dk ]. • Validation of Piecewise Planes. Due to the broader area and denser points (with less errors in LIDAR measurements) in the immediate slice S0 , the plane computed from this region’s points has the best fit among the other slices, and hence is considered as the ‘reference plane’ G0 . According to the tangent based slicing (5.2) the number of LIDAR’s ground readings in the other slices should be almost equal (see Fig. 5.3). The validation process starts from the closest plane G1 to
64
CHAPTER 5. OBSTACLE DETECTION
Algorithm 3 Piecewise Ground Surface Estimation. 1: Input: Dense PCD D 2: Output: Ground Model G = {G1 , · · · , GN } 3: for slice k = 1 to N do 4: Sk ← Slice (D) 5: Sk ← Gate (Sk ) 6: Gk ← RANSAC (Sk ) 7: end for 8: for slice k = 1 to N do 9: if ¬((δ ψk < τ°) ∧ (δ Zk < `)) then 10: Gk ← Gk−1 11: end if 12: end for
. plane fitting in each slice
. the validation process
the farthest plane GN . For the validation of piecewise planes, two features are considered: 1. The angle between two consecutive planes Gk and Gk−1 is computed as folnˆ ×nˆ k | where nˆ k and nˆ k−1 are the unit normal vectors lows: δ ψk = arctan | nˆk−1 k−1 ·nˆ k of the planes Gk and Gk−1 , respectively. 2. The (elevation) distance between Gk−1 and Gk planes is computed by δ Zk = |Zk − Zk−1 |, where Zk and Zk−1 are the z value for Gk−1 and Gk on the edge of slices: ( xy ) = ( λ0k ). The z value for Gk can be computed by reformulating the plane equation as: z = −(ak /ck )x − (bk /ck )y − (dk /ck ). If the angle δ ψk between the two normals is less than τ° and the distance δ Zk between planes is less than ` (τ° and ` are given thresholds), the current plane is assumed valid. Otherwise, the parameters from the previous plane Gk−1 are propagated to the current plane Gk and the two planes are considered to be part of the same ground plane: Gk ← Gk−1 . This procedure is summarized in Algorithm 3. The output of this algorithm is the ground model defined by the set G = {G1 , · · · , GN }. The validation process of the piecewise RANSAC plane fitting is illustrated in Fig. 5.5.
5.1.3
Stationary – Moving Obstacle Detection
The estimated ground surface is used to separate the ground and obstacles. A voxelbased representation of obstacles above the estimated ground is presented. A simple yet efficient method is proposed to discriminate moving parts from the static map of the environment by aggregating and reasoning temporal data using a discriminative analysis.
5.1. STATIC AND MOVING OBSTACLE DETECTION
65
Ground – Obstacle Segmentation The multi-region ground model G is used for the ground and obstacle separation task. It is performed based on the distance between the points inside each slice region Sk to the corresponding surface plane Gk . Given an arbitrary point p◦ inside the surface plane Gk (e.g., p◦ = [0, 0, − dckk ]), the distance from a point p ∈ Sk to the plane Gk can be computed −−−−→ by the following dot product: d = ( p − p◦ ) · nˆ k , where nˆ k is the unit normal vector of Gk plane. The points under a certain reference height dmin are considered as a part of the ground plane and are removed (see Fig. 5.5). The remaining points represent obstacles’ points. This process is applied on the last m previous scans P (after applying the updated transformations Tˇk ) and the dense PCD D, and segments them into P = {PG , PO } and D = {DG , DO }, respectively. The superscripts G and O denote ground and obstacle points, respectively. Urban scenarios, especially those in downtown areas, are complex 3D environments, with a great diversity of objects and obstacles. Voxel grids are dense 3D structures with no dependency to predefined features which allow them to provide detailed representation of such complex environments. The voxelization of the obstacles’ points is performed using the process mentioned in Section 2.1.2, which outputs lists of voxels with their corresponding occupancy values. Voxelization is applied to the obstacle points set PO and to the dense PCD DO , which results in P and D, respectively. Discriminative Stationary – Moving Obstacle Segmentation The obstacle voxel grids P = {Pi−m , · · · , Pi−1 , Pi } and the integrated voxel grid D are used for the stationary and moving obstacle segmentation. The main idea is that a moving object occupies different voxels along time while a stationary object will be mapped into the same voxels in consecutive scans. Therefore, the occupancy value in voxels corresponding to static parts of the environment is greater in D. To materialize this concept, first a rough approximation of stationary and moving voxels is obtained by using a simple subtraction mechanism. Next, the results are further refined using a discriminative analysis based on ‘2D Counters’ built in the xy plane. The Log-Likelihood Ratio (LLR) of the 2D Counters is computed to determine the binary masks for the stationary and moving voxels. This is described in the following. • Preprocessing. A subtraction mechanism is used as a preprocessing step. The cells belonging to the static obstacles in D capture more amount of data points and therefore have a greater occupancy value in comparison with each of the obstacle voxel grids in P (see Fig. 5.6 (a)). On the other hand, since moving obstacles occupy different voxels (in different time instances) in the grid, it may be possible that for those voxels some elements of D and Pk will have the same occupancy values. Having this in mind, D is initialized as the stationary model. The voxels in
66
CHAPTER 5. OBSTACLE DETECTION
t t1
(a)
t t2
t t1
t t0
Moving obstacle Stationary obstacle
(b)
(c)
(d)
Td (e)
Ts Figure 5.6: The process used for the generation of binary masks of the stationary and moving voxels. (a) shows a moving pedestrian and a stationary obstacle. The car in the left represents the ego-vehicle. The black, orange, blue and green points are hypothetical LIDAR hitting points that occur in different time instances. As it can be seen, since the stationary obstacle captures multiple scans, it will evidence a higher occupancy value in comparison with the moving obstacle that occupies different locations; (b) and (c) show Cs counters computed from D before and after preprocessing, respectively; (d) shows the Cd counter computed from P, and (e) shows the output of the log-likelihood ratio of (c) and (d). Ts and Td are the thresholds used for the computation of the binary masks. D are then compared with the corresponding voxels in each of the obstacle voxel grids Pk ∈ P. Those voxels in D that have the same value as corresponding voxels of Pk are considered as moving voxels and filtered out. Next, the filtered D is used to remove stationary voxels from the current obstacle voxel grid Pi . Filtered D and Pi are outputted. To keep the notation simple, we keep the variable names the same as before pre-processing and dismiss the subscript of Pi . • 2D Counters. A voxel can be characterized by a triplet of indexes (i, j, k) which defines the position of the voxel within the voxel grid, and corresponds with the
5.1. STATIC AND MOVING OBSTACLE DETECTION
67
x-, y- and z- axes. We assume that all voxels with the same (i, j) index values have the same state (i.e., each vertical bar located in the xy plane is stationary or moving). Based on this assumption, two 2D counters (Cs and Cd ) are constructed out of (occupancy values of the voxels of) D and P voxel grids using a summation operation, as expressed by p(i,j)
Cs (i, j) =
∑ D(i, j, k) k=1 q(i,j)
Cd (i, j) =
(5.3)
∑ P(i, j, k) k=1
where (i, j, k) is the position of a voxel in the voxel grid; Cs and Cd are the computed static and dynamic counters, and p(i, j) and q(i, j) indicate the number of voxels in the (i, j)-th column/bar of D and P, respectively. See Fig. 5.6 (b), (c) and (d) for the illustration of this process. • Log-Likelihood Ratio. The Log-Likelihood Ratio (LLR) expresses how many times more likely data is under one model than another. LLR of the 2D counters Cs and Cd is used to determine the binary masks for the stationary and dynamic voxels, and is given by R(i, j) = log
max{Cd (i, j), ε} max{Cs (i, j), ε}
(5.4)
where ε is a small value (we set it to 1) that prevents dividing by zero or taking the log of zero. The counter cells belonging to moving parts have higher values in the computed LLR. Static parts have negative values and cells that are shared by both static and moving obstacles tend to zero. By applying a threshold on R(i, j), 2D binary masks of the stationary and moving voxels (see Fig. 5.6 (e)) can be obtained using the following expressions: ( 1 if R(i, j) < Ts Bs (i, j) = 0 otherwise ( (5.5) 1 if R(i, j) > Td Bd (i, j) = 0 otherwise where Ts and Td are the thresholds used to compute the 2D binary masks for detecting the most reliable stationary and moving voxels; Bs and Bd are the static and dynamic binary 2D masks which are applied to all levels of D and P voxel grids to generate voxels labeled as stationary or moving. Fig. 5.7 shows the outputted static and dynamic voxels and the estimated ground surface.
68
CHAPTER 5. OBSTACLE DETECTION
Figure 5.7: The top image shows the projection of the result of the proposed estimated ground surface, and static and moving obstacle detection system on a given frame from the KITTI dataset. The piecewise plane estimation of the ground surface is shown in blue, the detected static obstacles are shown by red voxels, and the generic moving objects are depicted by green voxels. Bottom image shows the corresponding piecewise ground planes and dynamic and static voxels, represented in three dimensions.
In the proposed method, the localization error in the GPS-aided INS positioning system is corrected by applying the ICP algorithm. The proposed algorithm outputs are the estimated ground surface (using piecewise planes) and the detected obstacles, using a voxel representation, which are subsequently segmented into static and moving parts. The moving parts of the environment can be further processed to obtain the object level segments, and then to track the generic moving object segments over time. Next, we address this problem (also known as DATMO, which is the abbreviation for the Detection And Tracking Moving Objects) in a 2.5D (Elevation) grid basis. In addition, a new approach is developed to address the localization error of the GPS-aided INS position sensing.
5.2. EXTENSION OF MOTION GRIDS TO DATMO
69
Tracking Module
Moving Object Detection Module PCD INS Data
Map Update
E
Building Local Short-Term Map
S
Local Elevation Grid Generation
E 2.5D Motion Grid Detection
List of Objects’ Locations
M Moving Obj. Detection
O
Kalman Tracking
List of Objects and Tracks Track Management
Data Associat.
Figure 5.8: The architecture of the proposed algorithm for 2.5D grid-based DATMO.
5.2
Extension of Motion Grids to DATMO
In this section, a DATMO approach is proposed based on motion grids. The motion grids is achieved by building a short-term static model of the scene, followed by a properly-designed subtraction mechanism to compute the motion, and to rectify the localization error of the GPS-aided INS positioning system. For the generic moving object extraction from motion grids a morphology based clustering method is used. The extracted (detected) moving objects are finally tracked using KFs.
5.2.1
2.5D Grid-based DATMO Overview
In this section, we present the proposed 2.5D grid-based DATMO (see architecture in Fig. 5.8). At every time step, a local Elevation grid is built using the 3D-LIDAR data. The generated Elevation grids and localization data are integrated into a temporary environment model called ‘local (static) short-term map’. In every frame, the last Elevation grid is compared with an updated ‘local short-term map’ to compute a 2.5D motion grid. A mechanism based on spatial properties is presented to suppress false detections that are due to small localization errors. Next, the 2.5D motion grid is post-processed to provide an object level representation of the scene. The multiple detected moving objects are tracked over time by applying KFs with Gating and Nearest Neighbor (NN) association strategies. The proposed 2.5D DATMO outputs the track list of objects’ 3D-BBs.
5.2.2
From Motion Grids to DATMO
This section describes the 2.5D motion grid detection (as an alternative approach for voxel based motion detection and computationally-costly ICP algorithm for correcting
70
CHAPTER 5. OBSTACLE DETECTION
Algorithm 4 Short-Term Map Update. 1: 2: 3: 4: 5: 6: 7:
Inputs: Previous Elevation grids: E = {Ei−n−1 , · · · , Ei−2 , Ei−1 } and the newly computed Elevation grid: Ei (all are transformed to the current pose of the vehicle) Output: Short-term map: Si Remove Ei−n−1 for grid k = i − n − 1 to i − 1 do . move Elevation grids downwards in E Ek ← Ek+1 end for Si ← Mean(E) . on m most recent observations of each cell
localization error, presented in the previous section). Next, the generic moving object detection and tracking algorithm is explained. 2.5D Motion Grid Detection This subsection briefly describes the motion detection algorithm, comprising the following three process steps: • A Single Instance of the Local Elevation Grid. In the present work, the Elevation grid (see Section 2.1.2) is built to cover a local region (10 m behind, 30 m ahead, and ±10 m on the left and right sides) surrounding the ego-vehicle. The ground cells, with a variance and height lower than certain given thresholds, are discarded when building the Elevation grid, as shown in the following equation: ( 0 if (σ 2j < Tσ ) ∧ (µ j < Tµ ) (5.6) E( j) = µ j otherwise where µ j and σ 2j are the average height and variance in j-th cell, and the thresholds Tσ and Tµ are learned empirically. • Local (Static) Short-Term Map. This step consists on the integration of consecutive Elevation grids and GPS-aided INS positioning data to build a local static short-term map of the surrounding environment. The short-term map Si is updated on every input Elevation grid Ei obtained from the last 3D-LIDAR data. To build the short-term map, initially a queue like data structure E = {Ei−n , · · · , Ei−1 , Ei } is defined using a First In First Out (FIFO) approach to store the last n sequential Elevation grids (which are permanently being transformed according to the current pose of the vehicle). Next, the short-term map is calculated based on E, and by taking the mean on m last valid values of a cell’s history with a constraint that the cell should have been observed for a minimum k number of times. The short-term map update procedure is summarized in Algorithm 4.
5.2. EXTENSION OF MOTION GRIDS TO DATMO
71
j-th Cell
Figure 5.9: The motion computation process for the j-th cell (shown in red color in the Elevation grid). The set of cells J, in the ε-neighborhood of j-th cell, in the short-term map is shown in green. In the figure, the radius ε is considered as being 1 cell. • 2.5D Motion Grid Computation. Ideally, motion detection could be performed by subtracting the last Elevation grid from the short-term map. However, in practice the ego-vehicle suffers from poor localization accuracy, where by using a simple subtraction can result in many false detections. Specifically, false detections due to small localization errors are usually in the form of spatially clustered regions in the Elevation grid (See Fig. 5.10 (a)). To reduce the occurrence of such false detections, a spatial reasoning is employed and integrated into the motion detection process. The j-th cell of the motion grid M (temporal subscript i is omitted for notational simplicity) can be obtained using the last Elevation grid E and the short-term map S by ( E( j) if kE( j) − S(k∗ )k > Te M( j) = 0 otherwise
(5.7)
k∗ = argminkE( j) − S(k)k, {k ∈ J}
(5.8)
where k
where J indicates a set containing indexes of cells in the ε-neighborhood of j-th cell. To summarize, if a cell in the Elevation grid has a value close to neighborhood cells (of the corresponding cell) in short-term map, it is considered as a false detection and suppressed, otherwise it is part of the motion (see Fig. 5.9). The radius of the neighborhood ε depends on the localization error and the number of scans that are considered for constructing the short-term map. The threshold Te is the maximum acceptable distance between cell values, and is set to α × E( j). The coefficient α learned empirically. Using the proposed approach, most of the false detections are eliminated. Some sparse false detections can still remain, which can be removed by applying a simple post-processing (see Fig. 5.10 (b)).
72
CHAPTER 5. OBSTACLE DETECTION
Figure 5.10: From top to bottom: (a) 2.5D motion grid obtained by simple subtraction of the last Elevation grid from the short-term map; (b) after false detection suppression, and (c) after morphology, post-processing, and labeling connected components. For a better visualization, the grids were projected and displayed onto the RGB image. Moving Object Detection and Tracking In this section, we present a motion grouping mechanism to extract an object level representation of the scene, followed by the description of the tracking module. • Moving Object Detection. A mathematical-morphology based approach is employed for generic moving object extraction from the motion grid: O = (M ⊕ sx ) ∧ (M ⊕ sy ),
(5.9)
5.2. EXTENSION OF MOTION GRIDS TO DATMO
73
where the dilation (morphology) operation is represented by ⊕; sx and sy are the rectangular structures applied in x- and y- directions to compensate for the gap between 3D-LIDAR scans in the vehicle movement direction (which may cause a detected object to be split into different sections), and to fill the small holes inside object motion. The results of dilation in x- and y- directions are multiplied together to keep false detections small. Next, some post-processing is performed to remove very small and unusual size regions and to label connected components. The labeled connected components that correspond to generic moving objects are inputted to the tracking module. At this stage, the fitted 3D-BB of each moving object (without considering the object orientation) can be computed using the x-y size and the maximum height of each connected component. Fig. 5.10 shows the different steps involved in motion detection module. • Tracking Module. The tracking module is composed of three submodules as follows:
Figure 5.11: A sample screenshot of the proposed 2.5D DATMO result, demonstrated in 2D (top) and 3D (bottom) views.
74
CHAPTER 5. OBSTACLE DETECTION Kalman Filter (KF) The centroid of a labeled segment of the motion (see Fig. 1 and Fig. 3 (c)) is considered as a detected generic moving object (also known as point model representation). A KF [123] with Constant Velocity model (CV-KF) is used for the prediction of each object’s location in the next frames. An individual 2D KF is associated for every new detected moving object. Data Association (DA) Gating and NN strategies are used to determine which detected object goes with which track. Initially, for each track, Gating is applied to prune the candidates (the detected objects). If there is more than one candidate the nearest one is associated with the track, else if there is no candidate, it is assumed that a miss detection is occurred and the KF predicted output from previous time step is used, and a flag is sent to the track management for further actions. Track Management The main objectives of the track management are track initializations for new detections and removing the tracks that left the local grid. When there is a detection that is not associated to any existing track, a new track is initialized, but the track management unit will wait for the next frame for confirmation. If in the next frame a detection gets associated with that track it is confirmed as a new track, else it is considered as a false detection. Fig. 5.11 shows the result of the proposed 2.5D DATMO.
5.3
Fusion at Tracking-Level
Object tracking is one of the key components of a DATMO system. Although most approaches work only on image or LIDAR sequences, this section proposes an object tracking method using fusion of 3D-LIDAR with RGB camera data to improve the tracking function of a multisensor DATMO system.
5.3.1
Fusion Tracking Overview
Considering sensory inputs from a camera, a 3D-LIDAR and an INS mounted on-board the ego-vehicle, 3D single-object tracking is defined as: given input data (RGB image, PCD, and an ego-vehicle pose) and given the initial object’s 3D-BB in the first PCD, estimate the trajectory of the object, in the 3D world coordinate system as both ego-vehicle and object move around a scene. The conceptual model of the proposed multisensor 3D object tracker is shown in Fig 5.12. Object tracking starts with the known 3D-BB in the first scan. Next, the ground plane is estimated, and ground points are eliminated from the PCD. The remaining object points P are projected into the image plane and the 2D convex-hull (Ω) of the projected point set, P∗ , is computed. The convex-hull Ω accurately segments object pixels from other pixels. The 3D-BB and its corresponding Ω
5.3. FUSION AT TRACKING-LEVEL
Initialization in PCD
3D-BB
3D MS Localization in PCD
75
Cpcd
3D KF Fusion
C3D
3D KF Tracking
C’rgb
P Automatic Initial. in Image
Ω
2D MS Localization in Image
Crgb 2D - 3D Projection
Outputs the object’s: • • • • • •
Trajectory Velocity estimation Predicted location 3D-BB in PCD Orientation 2D convex-hull in Im.
Figure 5.12: The diagram of the major pipeline of the proposed 2D/3D fusion-based 3D object tracking algorithm. are used to initialize tracking in the PCD and image. In the next time-step, two MeanShift (MS) [124] based localizers are run individually to estimate the 2D and 3D object positions in the new image and PCD, respectively. An adaptive color-based MS is used to localize the object in the image, while the localization in the PCD is performed using the MS gradient estimation of the object points in the 3D-BB. The 2D position of the object in the image is projected back to 3D. KFs with a Constant Acceleration model (CA-KF) are used for the fusion and tracking of object locations in the image and PCD. The predicted object’s position and orientation are used for the initialization of the 3DBB in the next PCD. Fig. 5.13 shows the result of the proposed algorithm. In the next subsections, object localization components, in the 3D PCD and 2D RGB image, are described followed by the explanation of the proposed fusion and tracking approaches.
5.3.2
3D Object Localization in PCD
Incoming PCDs need to be processed to remove ground points in the 3D-BB, avoiding object model degradation. • Removing the Ground Points. Ground points typically constitute a large portion of a 3D-LIDAR’s PCD. If an appropriate feature is selected and the corresponding Probability Density Function (PDF) computed, then the peak-value in the PDF could be used to indicate ground points. Leveraging this fact, a Kernel Density Estimation (KDE) is used to estimate the PDF of angles (the considered feature) between the x-y plane and the set of lines passing through the center
76
CHAPTER 5. OBSTACLE DETECTION
Figure 5.13: Proposed object tracking method results. The bottom figure shows the result in the 3D PCD where the 3D-BB of the tracked object is shown in blue, object trajectory is represented as a yellow curve and the current estimated speed of the object is shown inside a text-box (27.3 km/h). The ego-vehicle trajectory, given by an INS system, is represented by a magenta curve. Parts of the 3D-LIDAR PCD in the field of view of the camera are shown in white (obstacle points) and red (detected ground points). The top figure represents the tracking result in the 2D RGB image, where the detected object region and its surrounding area are shown in blue and red, respectively. of the ego-vehicle (the origin point) and every point belonging to the PCD. Let Θ = {θ1 , · · · , θN } denote the set of 1D angle values (measured in x-z plane) for a PCD, where θi = arctan(zi / xi ). The univariate KDE is obtained by P(θ ) =
1 N
∑N i=1 Kσ (θ − θi )
(5.10)
where Kσ (.) is a Gaussian kernel with width σ , and N is the number of points.
5.3. FUSION AT TRACKING-LEVEL
77 the end point i
i
(a) the origin point
zi xi
x axis
(b)
(c)
Figure 5.14: The ground removal process. (a) The angle value θi for a point i; (b) KDE of the set Θ and the detected pitch angle θρ , and (c) The ground removal result. Red points denote detected ground points. The green ellipse shows the PCD of the car and the detected points. The corresponding car in the image is shown with a red ellipse. The ground is assumed to be plane, and the pitch angle of the ground plane is identified as the KDE’s peak value θρ . The points below a certain height dmin from the estimated ground plane are considered as being the ground points (Fig. 5.14). In order to increase the robustness of the ground removal process, a KF with CA model is used for the estimation of the ground plane’s pitch angle. To eliminate outliers, the angle search area is limited to a gate in the vicinity of the predicted KF value from the previous step. If no measurements are available inside the gate, the predicted KF value is used. • MS-based Localization in PCD. Object localization in the PCD is performed as follows: 1. Computing the shift-vector. Given the center of the 3D-BB as χ, the shift-
78
CHAPTER 5. OBSTACLE DETECTION
Figure 5.15: The MS procedure in the PCD. Left: bird’s-eye view. Right: top view. The brighter blue color of the 3D-BB shows the most recent iteration. vector between χ and the point set P0 inside the 3D-BB is computed using, mk = χk − µ(P0 )
(5.11)
where µ(.) indicates the mean function, and k is the iteration index. 2. Translating 3D-BB. The 3D-BB is translated using the shift-vector, χk+1 = χk + mk
(5.12)
The shift-vector always points toward the direction of the maximum increase in the density. 3. Iterating steps 1 and 2 until convergence. The MS iteratively shifts the 3D-BB until the object is placed entirely within the 3D-BB. A centroid movement |mk | less than 5 cm or a maximum number of iterations equal to 5 are considered as the MS convergence. The MS process in PCD is shown in Fig. 5.15. The object position is represented by the centroid of object points (point model) inside the 3D-BB. The point model is feasible even with a few number of object points which is the case in sparse 3D-LIDAR’s PCD (especially for far objects). The centroid after convergence is denoted by C pcd and outputted to the fusion module.
5.3.3
2D Object Localization in Image
Object points P (after ground removal) are projected onto the image and the 2D convexhull (also called Ω region) of the projected point set P∗ = {p∗1 , · · · , p∗n } is computed. The
5.3. FUSION AT TRACKING-LEVEL
79
Mapping
(a) MS Localization
(b)
2D MS Localization in the RGB Image
Object and Background Convex-hulls
Figure 5.16: The MS computation in the image. (a) the schematic diagram of the MS computation work flow. (b) Left: Ω and Ω† in blue and red, respectively. Middle: computed ℜ and f. Each non-empty bin in the ℜ is represented by a circle. Each bin-value is represented by the area of a circle. Each circle’s location represents a color in the histogram (the same as its face color). Right: MS localization procedure. The brighter blue Ω indicates the most recent MS iteration. 2D convex-hull is the smallest 2D convex polygon that encloses P∗ . The computed Ω accurately segments the object from the background in comparison with the traditional rectangular 2D-BB. The√surrounding area Ω† is computed automatically by expanding Ω by a factor equal to 2 with respect to its centroid so that the number of pixels in Ω− = Ω† − Ω (the region between Ω and Ω† ) is approximately equal to the number of pixels within Ω. • Color Model of the Object. Two joint RGB histograms are calculated from the pixels within Ω and Ω− regions. The LLR of the RGB histograms expresses how much more likely each bin is under Ω color model than Ω− . In the LLR, positive bins more likely belong to Ω, bins with a negative value to Ω− and bins shared by both Ω and Ω− tend to zero. The positive part of the LLR is used to represent the discriminant object color model, o n max{HΩ (i),ε} (5.13) ℜ(i) = max log max{H − (i),ε} , 0 Ω
80
CHAPTER 5. OBSTACLE DETECTION where HΩ and HΩ− are the computed histograms from the Ω and Ω− regions, respectively; ε is a small value that prevents dividing by or taking the log of zero, and the variable i ranges from 1 to the number of histogram bins. The color model of the object (ℜ) is normalized and used for localizing the object in the next frame. • MS-based Localization in Image. MS-based object localization for the next frame starts at the centroid of the confident map (f) of the Ω region in the current frame. This confidence map is computed by replacing the color value of each pixel in the Ω region by its corresponding bin value in ℜ. In each iteration, the center of Ω, from the previous step, is shifted to the centroid of f (current confidence map) computed as follows: Cnew =
1 m m ∑i=1 fi
Ci
(5.14)
where Ci = {ri , ci } denotes pixel positions in Ω, and m is the total number of pixels in Ω. The maximum number of MS iterations needed to achieve convergence was empirically limited to 4 unless the centroid movement is smaller than a pixel (see Fig. 5.16). The computed 2D object centroid after convergence is denoted by Crgb and outputted to the fusion module. • Adaptive Updating of ℜ Bins. RGB images obtained from cameras are very informative but they are very sensitive to variations in illumination conditions. To adapt the object color model and overcome changes in the object color appearance during tracking, a bank of 1D KFs with a CA model is applied. KFs estimate and predict ℜ bin values for next frames. A new 1D KF is initialized and associated with every newly observed color bin. When the bin values become zero or negative, the corresponding KFs are removed. Based on a series of tests where 8 × 8 × 8 histograms (512 bins) were considered, the average number of utilized KFs in each frame were about 70 (∼ 14% of the total number of bins).
5.3.4
KF-based 2D/3D Fusion and Tracking
KFs are used for the fusion and tracking of object centroids obtained from the image and the PCD: • 2D/3D Fusion for Improved Localization. The computed 2D location of the 0 ) using a method described in [64]. object (Crgb ) is projected back to 3D (Crgb Although originally used for up-sampling a PCD, we employ it for projecting the object centroid in the image back to the 3D-LIDAR space. The PCD is projected into the image and the nearest points in each of the 4 quadrants (upper left/right and lower left/right) surrounding Crgb are found. The bilinear interpolation on the corresponding 4 points in the PCD (before the projection to the image) is 0 . computed to estimate Crgb
5.3. FUSION AT TRACKING-LEVEL
81
A KF-based fusion (the measurement fusion model [125]) is applied to integrate 0 with C Crgb pcd and estimate the fused centroid C3D . The idea is to give more trust to the method that performs better, thus providing a more accurate estimate than each method individually. The dynamics of the object and the fused measurement model of the object localizers in the PCD and image are given by xt = AF · xt−1 + wt zt = HF · xt + vt
(5.15)
where wt and vt represent the process and measurement noise, AF is the fusion state transition matrix, and HF is the fusion transformation matrix. The augmented measurement vector zt is given by, i> h > (C0 )> (C ) zt = pcd rgb
(5.16)
• 3D Tracking. A 3D CA-KF is used for the robust tracking of the fused centroid > ˙ y, ˙ z˙, x, ¨ y, ¨ z¨ , where x, C3D . Let the state of the filter be x = x, y, z, x, ˙ y, ˙ z˙ and x, ¨ y, ¨ z¨ define the velocity and acceleration corresponding to the x, y, z location. The discrete time process and measurement models of the system are given by xt = AT · xt−1 + wt zt = HT · xt + vt
(5.17)
where AT and HT are the state transition matrix and the transformation matrix for object tracking. To eliminate outliers and increase the robustness, the search area is limited to a gate in the vicinity of the predicted KF location (accessible from: xt = AT × xt−1 ). If no measurement is available inside the gate area, the KF prediction is used. The result of the proposed algorithm is the estimated trajectory of the object in 3D world coordinates, its velocity, and the predicted object location in the next time-step. Object orientation is achieved by subtracting its current and previous locations.
82
CHAPTER 5. OBSTACLE DETECTION
Chapter 6 Object Detection Contents 6.1
6.2
3D-LIDAR-based Object Detection . . . . . . . . . . . . . . . . . 84 6.1.1
DepthCN Overview . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.2
HG Using 3D Point Cloud Data . . . . . . . . . . . . . . . . 84
6.1.3
HV Using DM and ConvNet . . . . . . . . . . . . . . . . . . 86
6.1.4
DepthCN Optimization . . . . . . . . . . . . . . . . . . . . . 90
Multimodal Object Detection
. . . . . . . . . . . . . . . . . . . . 90
6.2.1
Fusion Detection Overview . . . . . . . . . . . . . . . . . . 91
6.2.2
Multimodal Data Generation . . . . . . . . . . . . . . . . . . 92
6.2.3
Vehicle Detection in Modalities . . . . . . . . . . . . . . . . 92
6.2.4
Multimodal Detection Fusion . . . . . . . . . . . . . . . . . 92
In the last chapter, we described the proposed methods for generic object detection by processing a temporal sequence of sensors’ data. This chapter addresses the problem of class-specific object detection from a single frame of multisensor data. This chapter starts by detailing 3D object detection using 3D-LIDAR data and then proceed to describing multisensor and multimodal fusion for object detection task. In this chapter we use the ‘Car’ class as the example of class-specific object. Parts of this chapter have been published in one journal article and 3 conference proceedings: Journal of Pattern Recognition Letters [126], IEEE Intelligent Transportation Systems Conferences [127, 128], and the Third Iberian Robotics Conference [129]. 83
84
6.1
CHAPTER 6. OBJECT DETECTION
3D-LIDAR-based Object Detection
The application of an unsupervised learning technique to support (class-specific) supervised learning based object detection was the main purpose of this section. A vehicle detection system (herein called DepthCN) based on the Hypothesis Generation (HG) and Verification (HV) paradigms is proposed. The data inputted to the system is a point cloud obtained from a 3D-LIDAR mounted on board an instrumented vehicle, which is then transformed to a dense-Depth Map (DM). Specifically, the DBSCAN clustering is used to extract structures (i.e., to segment individual obstacles that stand on the ground) from the 3D-LIDAR data to form class-agnostic object hypotheses, followed by (classspecific) ConvNet-based classification of such hypotheses (in the form of a DM). The term ‘object hypotheses’, which we use interchangeably with ‘object proposals’ refers to the projection of segmented ‘obstacles’ to the camera coordinate system.
6.1.1
DepthCN Overview
The architecture of DepthCN is presented in Fig. 6.1. The approach comprises two stages: 1) the offline learning stage to optimize HG and HV steps, 2) the online vehicle detection stage. After offline optimization, the optimized parameters (highlighted in Fig. 6.1) are passed to the online stage. The online detection starts with removing ground points, clustering the LIDAR’s point cloud to form segmented obstacles and then projecting the obstacles onto 3D-LIDAR-based dense-Depth Map (DM). Bounding boxes are fitted to the individual projected obstacles as object hypotheses (the HG step). Finally, the bounding boxes are used as inputs to a ConvNet to classify/verify the hypotheses of belonging to the category ‘vehicle’ (the HV step). In the following, the proposed Hypothesis Generation (HG) using 3D-LIDAR data and Hypothesis Verification (HV) using DM and ConvNet are described, and then the offline DepthCN optimization process is explained.
6.1.2
HG Using 3D Point Cloud Data
Objects in driving environment may appear in different sizes and locations. The stateof-the-art approaches speed-up the detection process using a set of object proposals instead of an exhaustive sliding windows search. In this section, vehicle proposals are generated solely from 3D-LIDAR data. Grid-based Ground Removal To increase the quality of object proposals and to reduce unnecessary computations, points that belong to the ground first need to be removed. In a grid-based framework,
6.1. 3D-LIDAR-BASED OBJECT DETECTION
Offline Optimization
ConvNet Training
HG Optimization Exhaustive Ground Rem.: Search: Obstacle Seg.:
85
ConvNet Training
𝜈, 𝛿
𝜂, 𝜖
Data Augmen.
DM Gen.
Trained ConvNet and
Online Vehicle Detection Hypothesis Generation (HG) 3D-LIDAR PCD
𝜈, 𝛿
Ground Removal
Hypothesis Verification (HV) 3D-LIDAR PCD
DM Generation
𝜂, 𝜖
Obstacle Segmentation
Proposal BBs Candidate Extraction
Training Set GT BBs and Labels LIDAR PCDs
𝜂, 𝜖, 𝜈, 𝛿
Segmented Obstacles 3D – 2D Projection Trained ConvNet ConvNet Classification
Proposal BBs
Detected Vehicles
Figure 6.1: The proposed 3D-LIDAR-based vehicle detection algorithm (DepthCN). ground points are eliminated by rejecting cells containing points with low variance in z-dimension. Obstacle Segmentation for HG 3D-LIDARs have previously shown promising performance for obstacle detection (section 5). Taking this into account, we explore a HG technique using data from a 3DLIDAR. After removing ground points, by applying DBSCAN [1] on the top-view x-y values of the remaining points, 3D-LIDAR points are segmented into distinctive clusters where each cluster approximately corresponds to an individual obstacle in the environment. The segmented obstacles (i.e., clusters) are then projected onto the camera coordinate system (using LIDAR to camera calibration matrices), and the fitted 2D-BB for each cluster is assumed as an object hypothesis (see Fig. 6.2).
86
CHAPTER 6. OBJECT DETECTION Examples of segmented obstacles
Examples of object proposals
Figure 6.2: HG using DBSCAN in a given point cloud. Top shows the point cloud where the detected ground points are denoted with green and LIDAR points that are out of the field of view of the camera are shown in red. The segmented obstacles are shown with different colors. The bottom image shows the projected clusters and HG results in the form of 2D-BBs (i.e., object proposals). The image frame here is used only for visualization purpose. The right-side shows the zoomed view, and the vertical orange arrows indicate corresponding obstacles. The dashed-blue BBs indicate two vehicles marked by KITTI Ground-Truth (GT).
6.1.3
HV Using DM and ConvNet
The ConvNet classifier focuses on identifying vehicles from the set of object hypotheses onto 3D-LIDAR-based DM (as illustrated in Fig. 6.3). At this stage, the system encompasses two steps: 1) DM generation from 3D-LIDAR data and 2) DM-based ConvNet for vehicle Hypothesis Verification (HV). DM Generation To generate a dense (up-sampled) map from LIDAR, a number of techniques can be used as described in [128, 130, 131]. Here, the LIDAR dense-Depth Map (DM) generation is made by projecting sparse 3D-LIDAR’s point cloud on camera coordinate system, performing interpolation and encoding as described in the sequel (see Fig. 6.4). • 3D-LIDAR – Image Projection. 3D-LIDAR’s point cloud data P = {X,Y, Z}, is filtered to the camera’s field of view, and projected onto the 2D-image plane using Projection Matrix
z }| { P∗ = PC2I × R0 × PL2C × P,
(6.1)
6.1. 3D-LIDAR-BASED OBJECT DETECTION
87
Figure 6.3: The generated dense-Depth Map (DM) with the projected hypotheses (41 object proposals are depicted with red rectangles). For viewing the corresponding RGB image and 3D-LIDAR data please refer to Fig. 6.2. where PC2I is the projection matrix from the camera coordinate system to the image plane, R0 is the rectification matrix, and PL2C is LIDAR to camera coordinate system projection matrix. Considering P∗ = {X ∗ ,Y ∗ , Z ∗ }, using the row and column pixel values {X ∗ ,Y ∗ } accompanied with range data Z ∗ , a compact sparse Range Map (sRM) is computed, which has a lower density than the image resolution. • sRM Depth Encoding. sRM is converted to 8-bit integer gray-scale image format using Range Inverse method which dedicates more bits to closer depth values. Let ζ ∈ Z ∗ be the projected real-range values, and ζmin and ζmax the minimum and maximum range values considered. The 8-bit quantized depth-value of a pixel (ζ8bit ) is attained by k j ζ8bit =
ζmax ×(ζ −ζmin ) ζ ×(ζmax −ζmin )
× 255 ,
(6.2)
where b.c denotes the floor function. This process converts the original range values in sRM to the 8-bit quantized sparse Depth Map (sDM). • Delaunay Triangulation (DT). We adopted the Delaunay Triangulation (DT) as a technique to obtain high-resolution maps. DT is effective in obtaining dense maps with close to 100% of density because this method interpolates all locations in the map regardless the positions of the input (raw) points. The DT is used for mesh generation from the row and column values {X ∗ ,Y ∗ } of the projected 3DLIDAR points P∗ . The DT produces a set of isolated triangles ∆ = {δ1 , · · · , δn }, each triangle δ composed of three vertices nk , {k : 1, 2, 3}, useful for building the interpolating function F (·) to perform an interpolation over sDM depth values. • Interpolation of sparse Depth Map (sDM). Unsampled (missing) intensity value i of a pixel location P which lie within a triangle δ , is estimated by interpolating
88
CHAPTER 6. OBJECT DETECTION
(a)
(b)
(c)
(d)
Figure 6.4: Illustration of the DM generation process: (a) a color image with superimposed projected LIDAR points; (b) the generated 2D triangulation; (c) the zoomed area within the red box of above image, and (d) shows the constructed DM.
6.1. 3D-LIDAR-BASED OBJECT DETECTION 32 Kernels 5x5x1 Stride 1, Padding 2 1st 66x112x1 input Kernel
89
f(x)
ReLU
x S@2 P@0
64@5x5x32 S@1 P@2
S@2 P@0
Dropout Positive (Car)
Negative A proposal
32th Kernel
1st Convolutional Layer
1st Max 2nd Conv. 2nd Max FC Pooling Layer Layer Pool. Layer
FC
Figure 6.5: The ConvNet architecture (details of the second convolutional and pooling layers are ignored to improve readability). depth values of the surrounding triangle vertices nk , {k : 1, 2, 3} using Nearest Neighbor interpolation function F (which means selecting the value of the closest vertex) applying (6.3), ending up in a DM (Fig. 6.4). i = F (argminkP − nk k), nk
{k : 1, 2, 3}
(6.3)
ConvNet for Hypothesis Verification (HV) ConvNet is used as the HV core in DepthCN. The ConvNet input size is set as 66 × 112 where 66 and 112 are the average Ground-Truth (GT) vehicle height and width (in pixels) in the training dataset. An object proposal BB in the DM is extracted as the vehicle candidate (see ‘Candidate Extraction’ in Fig. 6.1), resized to 66 × 112 and inputted to ConvNet for classification. The ConvNet employed in DepthCN is composed by 2 Convolutional layers, 3 Rectified Linear Units (ReLUs), 2 Pooling layers, 2 Fully Connected (FC) layers, a Softmax layer, and a Dropout layer for regularization (as illustrated in Fig. 6.5). Each component of ConvNet architecture is described briefly in the following. • Convolutional Layers. by applying convolution filters across input data, feature maps are computed. The first and the second convolutional layers contain 32 filters of 5 × 5 × 1 and 64 filters of 5 × 5 × 32 respectively. • ReLUs. ReLUs use the non-saturating activation function F (x) = max(0, x), elementwise, to increase the nonlinear properties of the network. ReLUs are used in the first and the second convolutional layers, and after the first FC layer. • Max-Pooling. Max-pooling (with stride 2 and padding 0) is used to partition the input feature maps into sets of 3 × 3 rectangle sub-regions. It outputs the maximum value for each sub-region.
90
CHAPTER 6. OBJECT DETECTION • Fully-Connected (FC) layers. FC layers have full connections to all activations in the previous layer. Two FC layers with 64 and 2 neurons are used to provide the classification output.
6.1.4
DepthCN Optimization
DepthCN’s online phase is composed by HG and HV modules (see Fig. 6.1). The optimization of these modules are performed offline as follows. HG Optimization In the grid-based ground removal, the parameters are grid cell size (υ) and variance threshold (δ ). The minimum number of points (η) and the distance metric (ε) are related to DBSCAN. The optimal parameter values for ground removal and clustering were optimized jointly, using exhaustive search, by maximizing the overlap of generated hypotheses with ground-truth BBs (minimum overlap of 70%). ConvNet Training using Augmented DM Data The ConvNet was trained on the augmented 3D-LIDAR-based DMs. Data augmentation is the process of generating a large training dataset from a small dataset using different types of transformations in a way that a balanced distribution is reached, while the new dataset still resembles the distribution that occurs in practice (i.e., increasing training data such that it still resembles what might happen under real-world conditions). In the proposed approach, a set of augmentation operations like scaling and depth-value augmentations, to resemble the closer and further objects; flipping, to simulate the effect of driving in the opposite direction; jittering and aspect-ratio augmentations, to simulate the effect of potential calibration errors and inaccurate GT labeling; cropping, to resemble occlusions that may occur in practice; rotation, to resemble objects being at different positions on the road, and shifting each line with different small random biases to resemble noise and depthmap generation errors are performed. This process is performed to aggregate and to balance the training dataset with two major goals and benefits: i) Balancing data in classes: reducing bias of ConvNet, ii) Increasing training data: helping ConvNets to tune large number of parameters in the network. The ConvNet training is performed after DM-based augmentation using Stochastic Gradient Descent (SGD) with dropout and `2 regularizations.
6.2
Multimodal Object Detection
Most of the current successful object detection approaches are based on a class of deep learning models called Convolutional Neural Networks (ConvNets). While most exist-
6.2. MULTIMODAL OBJECT DETECTION Vehicle Detectors
Multimodal Data Gen. PCD
DM Gen. RM Gen.
Cam.
RM
Color Image
YOLOv2-D YOLOv2-R YOLOv2-C
BB , s
Multimodal Detection Fusion
BBR , sR BB , s
Joint Re-scoring
BB , s′
BBR , s′R
Feature MLP Extraction BB , s′
NMS
3L
DM
91
BBF , sF
Fused Detection
Figure 6.6: The pipeline of the proposed multimodal vehicle detection algorithm. Cam. and 3L are abbreviations for Camera and 3D-LIDAR, respectively ing object detection researches are focused on using ConvNets with color image data, emerging fields of application such as Autonomous Vehicles (AVs) which integrates a diverse set of sensors, require the processing for multisensor and multimodal information to provide a more comprehensive understanding of real-world environment. This section proposes a multimodal vehicle detection system integrating data from a 3D-LIDAR and a color camera. Data from LIDAR and camera, in the form of three modalities, are the inputs of ConvNet-based detectors which are later combined to improve vehicle detection. The modalities are: (i) up-sampled representation of the sparse LIDAR’s range data called dense-Depth Map (DM), (ii) high-resolution map from LIDAR’s reflectance data hereinafter called Reflectance Map (RM), and (iii) RGB image from a monocular color camera calibrated wrt the LIDAR. Bounding Box (BB) detections in each one of these modalities are jointly learned and fused by an Artificial Neural Network (ANN) late-fusion strategy to improve the detection performance of each modality. The contribution of the proposed approach is two-fold: 1) probing and evaluating 3D-LIDAR modalities for vehicle detection (specifically the depth and reflectance map modalities), and 2) joint learning and fusion of the independent ConvNet-based vehicle detectors (in each modality) using an ANN to obtain a more accurate vehicle detection.
6.2.1
Fusion Detection Overview
The architecture of the proposed multimodal vehicle detection system is shown in Fig. 6.6. Three modalities, DM, RM (both generated from 3D-LIDAR data) and color image are used as inputs. Three YOLOv2-based object detectors are run individually on each modality to detect the 2D object BBs in the color image, DM and RM. 2D-BBs obtained in each of the three modalities are fused by a re-scoring function followed by a nonmaximum suppression. The purpose of the multimodal detection fusion is to reduce the misdetection rate from each modality which leads to a more accurate detection. In the following, we start by describing the multimodal data generation. Next, we explain
92
CHAPTER 6. OBJECT DETECTION
the proposed multimodal fusion scheme including a brief introduction of the YOLOv2 framework, which is the ConvNet-based vehicle detector considered for each modality.
6.2.2
Multimodal Data Generation
The color image is readily available from the color camera. However, 3D-LIDAR-based dense maps are not directly available and need to be computed. Assuming that a LIDAR and a camera are calibrated with respect to each other, the projection of the LIDAR point into the image plane is much sparser than its associated image. Such limited spatial resolution of the LIDAR makes object detection from sparse LIDAR data challenging. Therefore, in the proposed method, we propose to generate high-resolution (dense) map representations using LIDAR data to (i) perform deep-learning-based vehicle detection in LIDAR dense maps and, (ii) to carry out a decision-level fusion strategy. Besides the depth map (DM), a dense reflectance map (RM) is also considered in the vehicle detection system. In the case of DM, the variable to be interpolated is the range (distance), while the reflectance value (8-bit reflection return) is the variable to be interpolated to generate the RM. LIDAR reflectivity attribute is related to the ratio of the received beam sent to a surface, which depends upon the distance, material, and the angle between surface normal and the ray. Fig. 6.8 shows an example color image followed by the dense maps (DM and RM) obtained using DT and nearest neighbor interpolation. The image and the LIDAR data used to obtain the dense maps are taken from the KITTI dataset.
6.2.3
Vehicle Detection in Modalities
You Only Look Once (YOLO) [24, 25] is a real-time object detection system. In YOLO, object detection is defined as a regression problem and, taking advantage from a gridbased structure, object BBs and detection scores are achieved directly (i.e., without the need for an object proposal step). In this work, the most-recent version of YOLO, denoted as YOLOv2 [25] is used. The YOLOv2 network is composed by 19 convolutional layers and 5 max-pooling layers. The input image (after resizing to 416×416 pixels) is divided into 13 × 13 grid regions, and five centers of the BBs are assumed in each grid cell. A non-maximum suppression is applied to suppress duplicated detections (see Section 3.3 for more details). YOLOv2 is trained individually on each of the three training sets (color, DM and RM). The result is three trained YOLOv2 models, one per modality.
6.2.4
Multimodal Detection Fusion
This section presents a multimodal detection fusion system that tries to use the associated confidence of individual detections (detection scores) and the detected BBs’ characteristics in each modality to learn a fusion model and deal with detection limitations in each modality.
6.2. MULTIMODAL OBJECT DETECTION
93
Image Plane 𝐵𝐵𝑀
1 2 3 4 5
µ x,µ y
Feature Extraction sC
1 2 3 4 5
sD
sR BBC BBD BBR μx
-
-
-
μy 𝜎x 𝜎y BBM
-
-
-
Target NN Training
IOU
IOU
IOUR
-
-
-
-
-
Figure 6.7: Feature extraction and the joint re-scoring training strategy. Some of the different situations that may happen in tri-BBs generation are depicted in ‘image plane’ (in the left). The detections from YOLOv2-C, YOLOv2-D, YOLOv2-R and the groundtruth are depicted, in the image plane, with red, green, blue and dashed-magenta BBs, respectively. Feature extraction and the target are represented by matrices where each column corresponds to a feature and each row a combination of detections (in the middle). Each matrix cell contains the colors corresponding to the detections contributing to the feature’s value or a dash in gray background for an empty cell (zero). Joint Re-Scoring using MLP Network The detections from modalities are in the form of a set of BBs {BBC , BBD , BBR } with their associated confidence scores {sC , sD , sR }. The overlap between BBs {BBC , BBD , BBR } is computed and boxes that overlap are considered to be detecting the same object. Then, a set of overlapping BBs is extracted, and for each detector present in the set all combinations are extracted (see Fig. 6.7). The ideal result is a set of three BBs (henceforth called tri-BBs), each from one modality. If a given modality is not present on the set, the corresponding detector BB is considered to be empty (BB=0). / A Multi-Layer Perceptron Neural Network is used as a fitting function and applied over a set of attributes extracted from the tri-BBs to learn the multi-dimensional nonlinear mapping between the BBs from modalities and the ground-truth BBs. For each combination of BBs the
94
CHAPTER 6. OBJECT DETECTION
attributes extracted (F) are as follows: F = (sC , sD , sR , BBC , BBD , BBR , µx , µy , σx , σy , BBM ),
(6.4)
where sC , sD , sR are the detection confidence scores and BBC , BBD , BBR are the BBs correspondent to the color, DM and RM detectors. Every BB is defined by four properties {w, h, cx , cy } which indicate width, height and the geometrical center in x and y (all normalized with respect to the image’s width and height), respectively. The µx , µy , σx , and σx correspond to the average of all available BBs geometrical centers and their standard deviation. BBM corresponds to the minimum bounding box that contains all non empty bounding boxes in the combination. In cases where combinations do not contain one or two detectors, scores and BBs for those detectors are set to zero and are not included in the computation of the average, standard deviation and minimum containing bounding box (see Fig. 6.7). This results in a feature vector size of 23. The associated set of target data (T), defining the desired output, is determined as a set of three intersection-over-union (IOU) metrics: T = (IOUC , IOUD , IOUR ),
(6.5)
Area(BBi ∩ BBG ) , (6.6) Area(BBi ∪ BBG ) where i determines each modality {C, D, R} and the superscript G denotes the groundtruth BB. Once the MLP has fit the data, it forms a generalization of ‘extracted features from tri-BBs’ and their ‘intersection-over-union overlap with the ground-truth’. The trained MLP learns to estimate the overlap of tri-BBs with the ground-truth and based on those re-scores tri-BBs. A simple average rule of the scores are applied when there are multiple scores for the same BB. The re-scoring function generates, per frame, the same set of detection BBs from different modalities i.e., {BBC , BBD , BBR } with the re-scored detection confidences {sC0 , s0D , s0R } (see Fig. 6.6). IOUi =
Non-Maximum Suppression The input to the Non-Maximum Suppression (NMS) module is the set of BBs in the same ‘neighborhood’ area which is a consequence of having a multimodal detection system. This could degrade the performance of the detection algorithm, as can be seen later on in the experimental results section. To solve this, NMS is used to discard multipledetected occurrences around close locations i.e., to retain the local most-confident detector. The ratio Υ between the intersection and the union area of the overlapping detection windows is calculated and for Υ > 0.5 (value obtained experimentally) the detection window with the greatest confidence score is retained and the remaining detections are suppressed. Further strategies to perform NMS are addressed by Franzel et al. [132]. An example of the fusion detection process is shown in Fig. 6.8.
6.2. MULTIMODAL OBJECT DETECTION
0.47
(a)
0.78
0.33
0.3
0.55
(b)
0.72
0.57
(c)
0.78 0.72
0.33
0.3
0.43 0.47
0.4
0.07
0.39 0.48
0.32
(d)
0.17
(e)
0.32
0.4
0.43
95
0.05
0.32 0.55 0.57
0.13 0.35
Figure 6.8: Illustration of the fusion detection process: (a) detections from YOLOv2-C (red); (b) YOLOv2-D (green), and (c) YOLOv2-R (blue) with associated confidence scores; (d) represents the merged detections, and (e) shows the fusion vehicle detection results (cyan) after re-scoring and NMS compared to ground-truth (dashed-magenta). A dashed cyan BB indicates detections with confidence less than 0.2 that can be discarded by simple post-processing.
96
CHAPTER 6. OBJECT DETECTION
Chapter 7 Experimental Results and Discussion Contents 7.1
7.2
Obstacle Detection Evaluation . . . . . . . . . . . . . . . . . . . . 97 7.1.1
Static and Moving Obstacle Detection . . . . . . . . . . . . . 97
7.1.2
Multisensor Generic Object Tracking . . . . . . . . . . . . . 106
Object Detection Evaluation . . . . . . . . . . . . . . . . . . . . . 111 7.2.1
3D-LIDAR-based Detection . . . . . . . . . . . . . . . . . . 111
7.2.2
Multimodal Detection Fusion . . . . . . . . . . . . . . . . . 113
In this chapter we describe the experiments carried out to evaluate the proposed obstacle and object detection algorithms using the KITTI dataset. Comparative studies with other algorithms were performed whenever it was possible.
7.1
Obstacle Detection Evaluation
The obstacle detection performance assessment is consists of the evaluation of the proposed ground estimation; stationary and moving obstacle detection and segmentation; 2.5D grid-based DATMO, and the proposed fusion tracking algorithms. These algorithms were previously described in Chapter 5.
7.1.1
Static and Moving Obstacle Detection
In this subsection we describe the evaluation of the proposed ground surface estimation and the stationary – moving obstacle detection methods. The parameter values used in 97
98
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
Table 7.1: Values considered for the main parameters used in the proposed obstacle detection algorithm. m 6
η 6
` 10
τ° 10
d min 20
Td 5
υ 10
Ts 50
the implementation of the proposed algorithm are reported in Table 7.1. The first parameter m is a general parameter indicating the number of merged scans. The next four parameters, (η, τ°, ` and dmin ) are related to the ground surface estimation: η shows the number of ∆α used to compute each slice limits; τ° and ` are respectively the maximum acceptable angle and distance between two the planes applied in the validation phase of the piecewise plane fitting. The parameter dmin is a threshold in centimeters. Points with heights lower than dmin from the piecewise planes are considered as part of the ground plane. The last three parameters (υ, Td and Ts ) are used to configure the obstacle detection algorithm: υ is the voxel size in centimeters, and Td and Ts are thresholds for computing the binary mask of stationary and moving voxels. The proposed approach detects obstacles in an area covering 25 meters ahead of the vehicle, 5 meters behind it and 10 meters on the left and right sides of the vehicle, with 2 meters in height. Parameters m and η were selected experimentally as described in the next subsection.
0.11
TADE
0.1 0.09
mDE 0.08 2 2
4 4 6
6 8
8 10
no. of integrated scans (m)
10
no. of ∆α interval (η)
Figure 7.1: Evaluation of the proposed ground estimation algorithm in term of mDE by varying the number of integrated scans m and parameter η (related to the slice sizes).
7.1. OBSTACLE DETECTION EVALUATION
99
Evaluation of Ground Estimation For the evaluation of the ground estimation process, inspired by [58], we assume that all objects are placed on the same surface as the vehicle and that the base points of the GT 3D-BBs (available in the KITTI dataset), are located on the real ground surface (see Section 3.2.2 and Fig. 3.3). The ground estimation error is calculated by taking the average distance from the base points of the GT 3D-BBs to the estimated ground surface. Concretely, the mean of Displacement Errors (mDE) in the i-th frame is defined by −−−−→ G ˆ (7.1) mDE (i) = M1 ∑M k=1 |( pk − p ) · n| where pG k denotes the base of the GT 3D-BB of the k-th object; M is the total number of objects in the i-th frame; the variables p and nˆ are the point and the unit normal vector that define the corresponding surface plane, respectively. The mDE for all sequences is computed by (i) mDE = N1 ∑N (7.2) i=1 mDE where i ranges from 1 to the total number of frames in all 8 sequences of Table 3.1. The mDE was computed for different number of integrated frames m and η. The results are reported in Fig. 7.1. The minimum mDE = 0.086 is achieved by the combination of m = 6 and η = 6. Evaluation of Stationary – Moving Obstacle Detection The obstacle detection evaluation using 3D-LIDAR is a challenging task. To the best of our knowledge, there is no available dataset with Ground-Truth (GT) for ground estimation or general obstacle detection evaluations1 . The closest work to ours is presented in [52], where their evaluation for obstacle detection is carried out according to the number of missed and false obstacles, determined by a human observer. We followed a similar approach for evaluating the proposed obstacle detection system (also see Section 3.2.2). A number of 200 scans (25 scans for each sequence) were selected randomly out of the more than 2300 scans available on the dataset (see Table 3.1). An evaluation was performed for the general obstacle detection and another one for the moving obstacle detection (see Fig. 7.2). • General Obstacle Detection Evaluation. For evaluating the general obstacle detection, voxel grids of the stationary and moving obstacles are projected into the corresponding image, and a human observer performs a visual analysis in terms of ‘missed’ and ‘false’ obstacles. It should be noticed that every distinct element identified above the terrain level is considered as an obstacle (e.g., pole, tree, wall, car and pedestrian). The total number of missed obstacles is 186 out 1 The
benchmarks usually provide specific object classes e.g., pedestrians, vehicles.
100
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
Figure 7.2: An example of the obstacle detection evaluation. Red and green voxels show results of the proposed method. The 3D-BBs of stationary and moving obstacles are shown in red and green, respectively. Only green boxes are considered for the evaluation of the moving obstacle detection algorithm performance. Blue arrows show two missed obstacles (thin and small poles). Table 7.2: Results of the evaluation of the proposed obstacle detection algorithm. Seq. (1) (2) (3) (4) (5) (6) (7) (8) Total
# of Obst. Obst. Mov. 501 59 288 28 281 61 381 94 254 83 791 551 336 215 179 110 3011 1201
# of Missed Obst. Obst. Mov. 83 0 56 0 24 1 10 2 1 0 9 0 1 0 2 0 186 3
# of False Obst. Obst. Mov. 0 4 0 7 0 9 0 0 7 0 37 8 46 2 0 1 90 31
of 3,011. The total number of false detected obstacles is 90. Table 7.2 reports the details of the obstacle detection results for each sequence. The highest number of missed obstacles occurs in sequences (1) and (2) that contain many thin and small poles. Most of the false detections happen in sequences (6) and (7) that contain slowly moving objects. Some parts of very slowly moving objects may have been seen several times in the same voxels and therefore, may wrongly be integrated into the static model of the environment. The shadow of the wrongly modeled stationary obstacle stays for a few scans and causes the false detection. • Moving Obstacle Detection Evaluation. The proposed obstacle detection method is able to discriminate moving parts from the static map of the environment. Therefore, we performed an additional evaluation for measuring the performance of the moving obstacle detection. Among the 1201 moving obstacles present in
7.1. OBSTACLE DETECTION EVALUATION
101
Table 7.3: Percentages of the computational load of the different steps of the proposed system: (a) dense PCD generation, (b) piecewise ground surface estimation, (c) ground obstacle separation and voxelization, and (d) stationary - moving obstacle segmentation. (a) 83.2%
(b) 7.1%
(c) 7.7%
(d) 2%
the considered scans, only 3 moving obstacles were missed. A number of 31 obstacles were wrongly labeled as moving, mainly due to localization errors. Localization errors cause thin poles, observed in different locations by the ego-vehicle’s perception system, to be wrongly considered as moving obstacles. The result for each sequence is also shown in Table 7.2. Computational Analysis The experiments reported in this section were conducted on the first sequence of the Table 3.1, using a quad-core 3.4 GHz processor with 8 GB RAM under MATLAB R2015a. • Processing load of the different steps. In order to evaluate what steps of the algorithm are more time consuming, the percentages of the processing loads of the different phases are reported in Table 7.3. The first stage is the most computationally demanding part of the algorithm, mostly because of the ICP algorithm (consuming 83.2% of the computational time). Piecewise ground surface estimation and ground - obstacle separation modules are accounted for 14.8% of total computational time. • Main factors affecting the computational cost. The computational cost of the proposed method depends on the size of the local grid, the size of a voxel, the number of integrating scans, and the number of non-empty voxels (this is because only non-empty voxels are indexed and processed). The considered evaluation scenario has in average nearly 1% non-empty voxels. The size of a voxel and the number of integrating PCDs are two key parameters that have a correspondence with the spatial and temporal properties of the proposed algorithm, and have directly impact on the computational cost of the method. The average speed of the proposed algorithm (in frames per second) wrt the voxel size and the number of integrating PCDs are reported in Fig. 7.3. As it can be seen, the number of integrated scans has the greatest impact on the computational cost of the proposed method. The proposed method configured with the parameters listed in Table 7.1 works at about 0.3 fps. • The accuracy and the computational cost. There is a compromise between the computational cost versus the detection performance of the proposed method.
102
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
speed (fps)
1.5
1
0.5
0 2
1 4
0.5 6
0.2 8
0.1 10
0.05
no. of integrated scans (m) voxel size (υ)
Figure 7.3: Computational analysis of the proposed method as a function of the number of integrated scans m and the voxel size υ (the voxel volume is given by υ × υ × υ). Clearly, as the number of integrated scans increases, the performance in terms of stationary and moving object detection is improved. However, it adds an additional computational cost and makes the method becomes slower. On the other hand, less integrated scans make the environment model weaker. Overall, the proposed approach presents satisfactory results when the number of integrated scans is greater than 4 (the considered parameter value m = 6 meets this condition, see Fig. 7.1). Qualitative Results In order to qualitatively evaluate the performance of the proposed algorithm, 8 sequences were used (see Table 3.1). The most representative results are summarized in Fig. 7.4 and Fig. 7.5, in which each row corresponds to one sequence. The proposed method detects and segments stationary and moving obstacles’ voxels around the ego-vehicle when they get into the AV’s local perception field. In the first sequence, our method detects a cyclist and a car as moving obstacles, while they are in the perception field, and models the walls and stopped cars as part of the static model of the environment. In (2) and (3) sequences the ego-vehicle is moving in urban areas roads. The proposed method models trees, poles and stopped cars as part of the stationary environment and moving cars and pedestrians as dynamic obstacles. The sequence number (4) shows a downtown area, where the proposed method successfully modeled moving pedestrians and cyclists as part of the dynamic portion of the environment. Pedestrians without a movement correctly become part of the sta-
7.1. OBSTACLE DETECTION EVALUATION
103
(1)
(2)
(3)
(4)
Figure 7.4: A few frames of obstacle detection results obtained for sequences 1 to 4 as listed in Table 3.1 and its corresponding representation in three dimensions. Piecewise ground planes are shown in blue. Stationary and moving voxels are shown in green and red respectively. Each row represents one sequence. From left to right we see the results obtained in different time instants.
tionary model of the environment. Sequence number (5) shows a crosswalk scenario. Our method models passing pedestrians as moving objects, represented in the image by green voxels. In sequences number (6) and (7), the vehicle is not moving. Most of the moving objects are pedestrians which our method successfully detects. In particular,
104
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
(5)
(6)
(7)
(8)
Figure 7.5: A few frames of obstacle detection results obtained for sequences 5 to 8 as listed in Table 3.1 and its corresponding representation in three dimensions. Piecewise ground planes are shown in blue. Stationary and moving voxels are shown in green and red respectively.
notice the last image of sequence number (6) and the first image of sequence number (7) which represent very slowly moving pedestrians that may temporarily be modeled as stationary obstacles, which will not be critical in practical applications. Notice the curvature of the ground surface in sequence number (7) that is not possible to be modeled using just one plane. Sequence number (8) shows a road with moving vehicles.
7.1. OBSTACLE DETECTION EVALUATION
105
Table 7.4: Values considered for the main parameters used in the proposed 2.5D gridbased DATMO algorithm. υ 20
Tσ 2
Tµ 30
n 50
m 30
k 3
ε 5
α 0.2
The proposed method performs well on most of the moving vehicles. When vehicles are stopped in the traffic, they gradually become part of the static model of the environment. Extension to DATMO In this subsection, qualitative analysis of the grid-based DATMO system is performed. The main parameter values used in the implementation of the 2.5D DATMO algorithm are reported in Table 7.4. The parameters υ, Tσ and Tµ are related to the Elevation grid generation: the grid resolution υ (in x-y dimensions) is chosen to be equal to 20 cm, and Tσ and Tµ are the variance and height thresholds for an Elevation grid’s cell, learned empirically. The n, m, and k are the parameters for the short-term map generation. The spatial ε and height α parameters are linked to the motion detection module. More specifically, the set of all cells that lie at the (spatial) distance ε cells from the j-th cell are considered for motion detection in the j-th cell (see Fig. 5.9). The α threshold is used to calculated Te ( j) = α × E( j), which is the maximum acceptable difference of cells’ height values. Notice that E( j) is the height value in the j-th cell. In this work, the radius ε was considered as being of 5 cells, which is a sufficient number of cells to compensate for a maximum localization error of 1 m. The coefficient α can take a range of values from 0.2 to 0.5. To the best of our knowledge, there is no standard dataset available to evaluate a DATMO approach, which is why in this section a qualitative evaluation is performed. A variety of challenging sequences were used. The most representative sequences are summarized in Fig. 7.6. This figure is composed of two kinds of representations: the RGB image of the scene and the grid representation of the scene. The 2.5D motion grid, 3D-BBs and tracks of the moving objects are shown in the grid representation. The blue dots correspond to the 3D-LIDAR PCDs and vectors on the center of the local grid show the pose of the vehicle. Only the 3D-BB of detected moving objects are shown in the RGB image. The selected sequences are: (1) vehicles circulating on a highway; (2) a road junction scenario, and (3) a crossing scenario. In the first scenario, the proposed DATMO system detects and tracks all the moving vehicles when they get into the local perception field. In the road junction scenario, in the early frames a vehicle comes from different lane and in the next frames two vehicles join to the road. Our method successfully detects all moving objects. In the crossing scenario, the proposed DATMO system successfully detected the vehicles passing by.
106
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
(1)
(2)
(3)
Figure 7.6: 2.5D grid-based DATMO results of 3 typical sequences. Each row represents one sequence. From top to bottom, results for: (1) vehicles circulating on a highway; (2) a road junction scenario, and (3) a crossing scenario. Left to right we see the results obtained in different time instants.
7.1.2
Multisensor Generic Object Tracking
In this section, we present the evaluation of the proposed multisensor 3D single-object tracking approach. The parameter values used in the proposed fusion tracking implementation are reported in Table 7.5; where η and η 0 are the maximum numbers of Mean Shift (MS) iterations in the PCD and image domains, respectively; a displacement δ ` < 5 cm in the PCD and δ `0 < 1 pixel in the image are considered for the MS convergence; the value dmin is the threshold in cm for the ground removal process, and b = 8 is the number of histogram bins for each color channel. The proposed high-level fusion method (H-Fus.) was evaluated against five tracking methods on our KITTI-based derived dataset (see Section 3.1.3 and Table 3.2). The selected methods operate on the image or PCD or their fusion. Two image-based MS variants: (1) The original MS [133] and (2) MS with Corrected Background Weighted Histogram (CBWH) [134]. Three PCD-based methods: (3) Baseline KF-based track-
7.1. OBSTACLE DETECTION EVALUATION
107
Table 7.5: Values considered for the main parameters used in the proposed 3D fusion tracking algorithm. η 5
η0 4
δ` 5
δ `0 1
d min 20
b 8
ing that uses the ‘point model’ and 3D CA-KF with a Gating Data Association (DA); (4) MS-based object detection and localization in 3D-PCD, and (5) A low-level fusion approach (L-Fus.) that uses MS on the colored PCD (obtained by combining PCD and RGB data), and a CA-KF. The MS-I, CBWH, KF, MS, and L-Fus. are abbreviations for methods (1) to (5) respectively, in Table 7.6 and Table 7.7. KF and MS methods are further described in Appendix A. To assess the proposal’s performance, the object’s center position errors in 2D and 3D and the object orientation error in x-y plane were evaluated. Evaluation of Position Estimation The Euclidean distance of the center of the computed 2D-BB or 3D-BB from the 2D/3D Ground-Truth (GT) (extracted from the KITTI dataset) are given by q 1 N 2 E2D = ∑ (ri − riG )2 + (ci − cG i ) N i=1 (7.3) q 1 N G G G 2 2 2 E3D = ∑ (xi − xi ) + (yi − yi ) + (zi − zi ) N i=1 where ri , ci shows the detected object position (2D-BB center); xi , yi , zi indicates the G G G center of the 3D-BB in the PCD; riG , cG i and xi , yi , zi denote the GT, and N is the total number of scans. Table 7.6 summarizes the evaluation results. A dash entry (–) represents a failure of the given algorithm to track the object. The MS provides the smallest error when it does not fail. However it is prone to errors mostly because it starts diverging to nearby objects or obstacles in cluttered environments. The proposed method is the only one with stable results while keeping the center position error low. The MS-I and CBWH methods are very fragile as essentially they are not designed to overcome challenging factors in real-world driving environments (see Table 3.2). Evaluation of Orientation Estimation The GT orientation of the each object is given only in terms of Yaw angle which describes the object’s heading (i.e., rotation around y-axis in camera coordinates in the KITTI dataset [135]). For the 3D approach, the orientation error was computed by − →× − →G ϕ 1 N i ϕi (7.4) Eϕ = N ∑i=1 arctan | − →· − →G | ϕ ϕ i
i
108
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
Table 7.6: Average object’s center position errors in 2D (pixels) and 3D (meters). Seq. (1) (2) (3) (4) (5) (6) (7) (8)
The Average Errors in 3D
The Average Errors in 2D
H-Fus.
KF
MS
L-Fus.
H-Fus.
KF
MS
L-Fus.
MS-I
CBWH
0.30 1.98 1.67 0.39 0.22 0.19 0.26 0.17
– 17.69 – – 2.90 – 2.30 0.82
0.21 1.84 1.54 – 0.18 0.11 0.19 0.15
0.25 – 1.62 – 1.44 0.15 0.18 0.20
9.2 12.0 3.9 5.7 12.4 13.6 19.5 22.8
– 407.1 – – 157.0 – 186.9 51.2
8.4 15.1 7.2 – 12.1 10.2 16.3 17.1
11.7 – 8.6 – 53.6 15.8 14.5 25.8
263.6 208.3 16.3 217.8 279.0 420.6 118.8 225.0
298.1 306.1 37.5 333.1 128.9 418.7 192.6 162.1
Table 7.7: Orientation estimation evaluation (in radian) Seq. (1) (2) (3) (4) (5) (6) (7) (8)
H-Fus. 0.41 0.41 0.11 0.10 0.20 0.20 0.26 0.15
KF – 1.25 – – 0.56 – 0.92 0.29
MS 0.39 0.42 0.13 – 0.13 0.14 0.15 0.14
L-Fus. 0.39 – 0.14 – 0.24 0.16 0.16 0.15
− where → ϕG i is the object’s GT orientation. As it can be seen from Table 7.7, the most stable results are provided using the proposed H-Fus. method. Similar to the pose estimation evaluation, the MS provides the smallest error. However, as stated before, MS is error-prone (i.e., suffers from diverging to the nearby clutters). The proposed H-Fus. approach compensates this type of error by using KFs in the fusion and tracking framework. A further analysis in terms of the object pose variations, and other challenges are presented in the qualitative results section. Computational Analysis The experiments were performed using a quad-core 3.4 GHz processor with 8 GB RAM under MATLAB R2015a. The non-optimized implementation of the proposed method runs at about 4 fps. The major computational cost is due to the bank of 1D KFs which keeps track of the object color histogram (see Subsection 5.3.3 for details). Qualitative Results Results obtained by the proposed algorithm are shown in Fig. 7.7 and Fig. 7.8. The proposed method successfully tracks the objects throughout the considered sequences.
7.1. OBSTACLE DETECTION EVALUATION
109
Frame #28/154
Frame #73/154
Frame #153/154
Frame #9/154
Frame #84/154
Frame #126/154
Frame #6/373
Frame #87/373
Frame #217/373
Frame #6/41
Frame #20/41
Frame #35/41
(1)
(2)
(3)
(4)
Figure 7.7: Object tracking results obtained for sequences 1 to 4 as listed in Table 3.2 and its corresponding representation in the 3D space. In the image, blue and red polygons denote the detected object region and its surrounding area. In the PCD, object trajectory is represented with a yellow curve, the 3D-BB is shown in blue and the GT 3D-BB in red. The detected ground points are shown in red. Each row represents one sequence. From left to right we see the results obtained in different time instants.
110
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION Frame #38/170
Frame #55/170
Frame #153/170
Frame #26/63
Frame #30/63
Frame #54/63
Frame #2/71
Frame #20/71
Frame #40/71
Frame #221/387
Frame #291/387
Frame #383/387
(5)
(6)
(7)
(8)
Figure 7.8: Object tracking results obtained for sequences 5 to 8 as listed in Table 3.2 and its corresponding representation in the 3D space. In the image, blue and red polygons denote the detected object region and its surrounding area. In the PCD, object trajectory is represented with a yellow curve, the 3D-BB is shown in blue and the GT 3D-BB in red. The detected ground points are shown in red. Each row represents one sequence. From left to right we see the results obtained in different time instants.
7.2. OBJECT DETECTION EVALUATION
111
Next qualitative evaluations are presented in terms of occlusion, illumination, velocity, pose and size variations. • Occlusion. In sequences (2) the tracked car goes under occlusion caused by a parked van. In (5) and (6) the pedestrians are occluded by other pedestrians. In (7) the pedestrian was occluded by bushes for a number of frames. • Illumination Variations. In (1-3), and (6-8) the tracked objects go through illumination changes. • Velocity Variations. In sequences (2), (3) and (8) the object of interest moves with an unsteady velocity throughout the sequence. In sequence (3) the red car accelerates and then stops at a crossroad and a crosswalk. • Pose and Size Variations. In (2), (3) and (8) large variations in the object size occur mostly because the distance to the ego-vehicle is changing. Object pose also varies in almost all sequences.
7.2
Object Detection Evaluation
Quantitative and qualitative experiments using ‘Object Detection Evaluation’ from KITTI Vision Benchmark Suite [23] were performed to validate the performance of the proposed DepthCN and multimodal object detection system. Please refer to Sections 3.1.2 and 3.2.2 for details of the ‘Object Detection Evaluation’ dataset and the evaluation metrics, respectively. During the experiments, only the ‘Car’ label was considered for evaluation.
7.2.1
3D-LIDAR-based Detection
In this section we present the evaluation of DepthCN. The proposed DepthCN is relying on Velodyne LIDAR (range data only). The maximum range in DepthCN algorithm is limited to 80 m (specifically, this value is used for generating the DM). The original KITTI training dataset was divided into two sets: training (80%) and validation (20%), and DepthCN was optimized for the latter training and validation data. We considered depth map as a grayscale image and employed LeNet-5 [19] (which is designed for character recognition from grayscale images) with some slight modifications as the ConvNet architecture. DepthCN was evaluated in terms of classification and detection accuracy and computational cost. Results are provided in the next subsections.
112
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
Table 7.8: The ConvNet’s vehicle recognition accuracy with (W) and without (WO) applying data augmentation (DA). Dataset Training set Validation set
WO-DA 92.83% 86.69%
W-DA 96.02% 91.93%
Table 7.9: DepthCN vehicle detection evaluation (given in terms of average precision) on KITTI test-set. Approach DepthCN mBoW [105]
Easy 37.59 % 36.02 %
Moderate 23.21 % 23.76 %
Hard 18.01 % 18.44 %
Evaluation of Recognition The ConvNet training is performed after 3D-LIDAR DM-based data augmentation (see Subsection 6.1.4). The Stochastic Gradient Descent (SGD) with a mini-batch size of 128, a momentum of 0.9, and max epochs of 40 with 50% dropout and `2 regularization was employed for the ConvNet training. Considering an input DM of size 66 × 112, the accuracy of the implemented ConvNet for vehicle classification with and without data augmentation is reported in Table 7.8. The data augmentation improved the accuracy by more than 5 percentage points. Evaluation of Detection DepthCN was evaluated against mBoW [105] which is one of the most relevant methods and like ours operate directly on 3D-LIDAR’s range data. Both methods try to solve (class-specific) object detection problem by assuming a middle cluster representation that is approximately correspond to the individual obstacles that stand on the ground (in practice, the segmented obstacles in LIDAR data can be used for free space computation to avoid collisions). The mBoW uses hierarchical segmentation with bagof-word classifiers whereas DepthCN uses DBSCAN with a ConvNet classifier. Results for vehicle detection are given in terms of average precision (AP) in Table 7.9. As can be noted from the table, DepthCN surpasses by about 1.5 percentage points the mBoW in Easy difficulty level, while slightly underperforms in Moderate and Hard levels. A Precision-Recall curve is shown in Fig. 7.9. Computational Analysis The experiments with DepthCN were performed using a Hexa core 3.5 GHz processor powered with a GTX 1080 GPU and 64 GB RAM under MATLAB R2017a. The run-
7.2. OBJECT DETECTION EVALUATION
113 Car
1
Easy Moderate Hard
Precision
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Recall
Figure 7.9: Precision-Recall on the KITTI testing dataset for easy, moderate and hard Car detection difficulty levels. time of DepthCN (unoptimized implementation) for processing a point cloud is about 2.3 seconds in comparison with 10 seconds processing time of mBoW (implemented in C/C++) under 1 core 2.5 Ghz. Qualitative Results Qualitative results are also provided in Fig. 7.10. The proposed method detects all obstacles (generic objects) in the environment (in the form of object proposals, as shown by red rectangles in Fig. 7.10), and then classifies the target class of object (i.e., Car class). As it can be seen the 3D PCD of each generic object is retrievable from the 3D-LIDAR’s PCD as well. It is observed that the proposed method performs better for closer objects.
7.2.2
Multimodal Detection Fusion
The dataset was partitioned into three subsets: 60% as training set (4489 observations), 20% as validation set (1496 observations), and 20% as testing set (1496 observations). The experiments were carried out using a Hexa-core 3.5 GHz processor, powered with a GTX 1080 GPU and 64 GB RAM. The YOLOv22 416×416 detection framework [25] was used in the experiments. The YOLOv2 detector in each color, DM and RM 2 https://pjreddie.com/darknet/yolo/
114
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION
Figure 7.10: Few examples of DepthCN detection results (four pairs of DM and color images with corresponding PCDs). The generated hypotheses and the detection results are shown, in both DM and color images, as red and dashed-green BBs, respectively. The bottom figures show the result in the PCD, where the detected vehicles’ clusters are shown in different colors, and the remaining LIDAR points are shown in green. Notice that color-images are presented just for improving visualization and making understanding of the results easier.
modality (referred as YOLOv2-C, YOLOv2-D and YOLOv2-R respectively) and the proposed learning-based fusion scheme were optimized using training and validation sets, and evaluated on the testing set. Pre-trained ConvNet convolutional weights, com-
7.2. OBJECT DETECTION EVALUATION
115
puted on the ImageNet dataset3 , were used as initial weights for training. Each individual YOLOv2-C/D/R was fine-tuned for 80200 iterations using SGD with learning rate of 0.001, 64 as batch size, weight decay of 0.0005 and momentum of 0.9. MLPs (with one and two hidden layers) were experimented for function fitting. The MLP fitting function was trained using Levenberg-Marquardt back-propagation algorithm. To evaluate the proposed learning-based fusion detection method, the performance of the fusion model with two sets of features: 1) using the confidence score feature subset, and 2) using the entire feature set was evaluated in our offline testing set. In addition, we presented results in comparison with state-of-the-art methods on the KITTI online benchmark. Evaluation on Validation Set The YOLOv2 vehicle detection performance for each modality (color, DM, and RM data) is presented in Fig. 7.11. To have fair comparison among modalities, the images (DM, RM and color image) were converted to JPEG files with 75% compression quality. The average file size for each DM, RM and color modalities is approximately 28 KB, 44 KB and 82 KB, respectively. As can be seen by comparing precision-recall curves, in addition to color data, the DM and RM modalities, when used individually, shown very promising results (further analysis on RM is presented in Appendix B). Two sets of experiments were conducted for evaluating the performance of the fusion vehicle detection system. The first set of experiments demonstrates the improvement gained using the confidence score feature subset. In the second experiment, the entire feature set is employed for learning the joint re-scoring function. • Experiment using the confidence score feature subset. The re-scoring function can be interpreted as a three-class function-fitting MLP. To visualize the performance of the fitting function, in the first experiment a 3-layer MLP is trained using a subset of features (three detection confidence scores {sC , sD , sR }). All combinations of confidence scores are generated and inputted to the trained MLP, and the estimated intersection-over-union overlaps are computed as shown in Fig. 7.12. This figure illustrates how modality detectors are related, and shows the learned object detector behaviors in modalities based on the detection scores. In fact it shows in which combination of scores each detector contributes more for the final decision. The 3-layer MLP reached the minimum Mean Squared Error (MSE) of 0.0179 with 41 hidden neurons. The Average Precision (AP) of the first experiment on the test set is reported in Table 7.10. The results show that the fusion method achieves an improved performance even when just the detection scores were considered. 3 www.image-net.org
116
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION AP: 73.93 61.69 54.00 1 Easy Moderate Hard
0.9 0.8 0.7
(a)
Precision
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.2
0.4
0.6
0.8
1
Recall
AP: 68.19 54.59 47.61 1 Easy Moderate Hard
0.9 0.8 0.7
(b)
Precision
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.2
0.4
0.6
0.8
1
Recall AP: 68.36 52.23 45.22 1 Easy Moderate Hard
0.9 0.8 0.7
(c)
Precision
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.2
0.4
0.6
0.8
1
Recall
Figure 7.11: The vehicle detection performance in color, DM and RM modalities: (a) YOLOv2-C; (b) YOLOv2-D, and (c) YOLOv2-R.
7.2. OBJECT DETECTION EVALUATION
117
Figure 7.12: The joint re-scoring function learned from the confidence score-only features. The color-coded value is the predicted overlap in the range of [0, 1]. The value ‘1’ is the prediction for 100% overlap between the corresponding detector’s BB and the ground-truth BB and the value ‘0’ indicates the prediction of no-overlap. • Experiment using the entire feature set (augmented features). In the second experiment, the full set of feature was considered. Experiments with one and two hidden layers were conducted. Fig. 7.13 plots the validation performance progress of MLPs as the number of hidden neurons increases. In the training set, as the number of neurons increases, the error decreases. For the 3-layer MLP (one hidden layer), the validation performance reached the minimum Mean Squared Error (MSE) of 0.0156 with 23 hidden layer neurons. The 2 hidden layers MLP reached the least MSE of 0.0155 with 15 and 7 neurons in the first and second hidden layers, respectively. The precision-recall curves of the multimodal fusion vehicle detection after merging, re-scoring and non-maximum suppression is shown in Fig. 7.14. The Average Precision (AP) score is computed on the test set for each independent detectors and for the learned fusion models and reported in Table 7.10. The proposed fusion scheme boosts the vehicle detection performance in each of the easy, moderate and hard difficulty-level categories in KITTI by at least 1.05 percentage points (in ‘Easy’ category it went up to 1.2 percentage points). The merit of the proposed fusion method is demonstrated by its higher performance (than each of the individual detectors) on the validation set. In addition, the fusion strategy in the proposed method is very flexible in the sense that it can be used to combine different types of object detectors. The proposed fusion scheme is focused on learning the bounding-boxes characteristics and their associated scores in the modalities jointly. An MLP-based fusion model is learned to deal with detection limitations in each modality.
118
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION 1-Hidden Layer Fusion Network Performance
0.032
Train Validation
0.03
(a)
Mean Squared Error (MSE)
0.028 0.026 0.024 0.022 0.02 0.018 0.016 0.014 0
10
20
30
40
50
Number of hidden neurons 2-Hidden Layers Fusion Network Performance on the Training Set
(b)
Mean Squared Error (MSE)
0.018 0.0175 0.017 0.0165 0.016 0.0155 0.015 0.0145 3
5
7
9
11
13
15
17
19
11 9 15 13 19 17
7
5
3
# neurons in 1st hid. layer # neurons in 2nd hid. layer
2-Hidden Layers Fusion Network Performance on the Validation Set
(c)
Mean Squared Error (MSE)
0.0175
0.017
0.0165
0.016
0.0155 3
5
7
9
11
13
15
17
19
11 9 15 13 19 17
7
5
3
# neurons in 1st hid. layer # neurons in 2nd hid. layer
Figure 7.13: Influence of the number of layers / hidden neurons on MLP performance. (a) shows the training and validation performances of the 3-layer MLP (i.e.1 hidden layer) as the number of hidden neurons increases; (b) and (c) figures show the 2 hidden layers MLP performance on the training and validation sets, respectively.
7.2. OBJECT DETECTION EVALUATION
119
AP: 75.13 62.74 53.60 1 Easy Moderate Hard
0.9 0.8 0.7
Precision
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.2
0.4
0.6
0.8
1
Recall
Figure 7.14: Multimodal fusion vehicle detection performance. The merged detections before (dotted-line) and after re-scoring (dashed-line), and the vehicle detection performance after re-scoring and non-maximum suppression (solid-line). Table 7.10: Performance evaluation of the studied vehicle detectors on the KITTI Dataset. YOLOv2-Color, YOLOv2-Depth, YOLOv2-Reflectance modalities and the late fusion detection strategy are compared (‘Fusion− ’ denotes the result using only the confidence score feature subset). The figures denote Average Precision (AP) measured at different difficulty levels. The best results are printed in bold. Modality Color Depth Re f lec. Fusion− Fusion
Easy 73.93 % 68.19 % 68.36 % 74.21 % 75.13 %
Moderate 61.69 % 54.59 % 52.23 % 62.18 % 62.74 %
Hard 54.00 % 47.61 % 45.22 % 54.06 % 55.10 %
Evaluation on KITTI Online Benchmark To compare with the state-of-the-art, the proposed method was evaluated on the KITTI online object detection benchmark against methods that that also consider LIDAR data. Results are reported in Table 7.11 and Fig. 7.15. As can be noted from the table, the proposed method surpasses some of the approaches in the KITTI while having the shortest running time. In the current version of the proposed fusion detection, the input size is set as default 416 × 416 pixels. The proposed method can achieve higher detection rate by increasing the input image size at the price of slightly higher computational cost.
120
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION Table 7.11: Fusion Detection Performance on KITTI Online Benchmark. Approach MV3D [113] 3D FCN [108] MV-RGBD-RF [111] VeloFCN [107] Proposed Method Vote3D [106] CSoR [136] mBoW [105]
Easy 90.53 % 85.54 % 76.49 % 70.68 % 64.77 % 56.66 % 35.24 % 37.63 %
Moderate 89.17 % 75.83 % 69.92 % 53.45 % 46.77 % 48.05 % 26.13 % 23.76 %
Hard 80.16 % 68.30 % 57.47 % 46.90 % 39.38 % 42.64 % 22.69 % 18.44 %
Run Time 0.36 5 4 1 0.063 0.5 3.5 10
Car 1
Easy Moderate Hard
Precision
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Recall
Figure 7.15: Precision-Recall on KITTI Online Benchmark (Car Class). Computational Analysis The proposed fusion detection method is based on a single-shot object detector (YOLOv2) which eliminates the object proposal generation step. The adopted YOLOv2, the efficient design and implementation of the DM, RM, and the fusion architecture of the proposed method gives high performance yet cost-effective and capable of work in realtime. The implementation environment and the computational load of the different steps of the proposed algorithm are reported in Fig. 7.17. The modality generation and feature extraction steps are implemented in C++, YOLOv2-C/D/R are in C, and re-scoring and NMS are implemented in MATLAB (MEX enabled), respectively. The average time for processing each frame is 63 milliseconds (about 16 frames per second). Considering the synchronized camera and Velodyne LIDAR are working about 10 Hz, real-time processing can be achieved by the proposed architecture.
7.2. OBJECT DETECTION EVALUATION
0.61 0.52 0.62 0.28
0.36
0.79 0.83 0.83
0.84 0.84 0.79
0.59
0.05
0.390.67
0.55
0.89
0.83
0.6
0.02
0.89 0.86 0.86 0.28
0.47 0.33 0.58
0.29
0.87
0.25
0.36
0.58 0.27 0.63 0.72
0.27
0.59 0.43 0.32
0.76 0.68 0.68
0.550.42 0.05
0.62
0.25
0.05
0.64 0.29
0.77
0.05
0.04
0.53
0.72 0.73 0.54
121
0.61
0.33 0.73
0.12
0.02
0.56 0.58 0.46 0.7
0.24
0.57 0.5 0.42
0.4
0.25
0.55
0.28
0.39
0.42
0.360.38 0.38
0.3 0.32
0.57 0.57
0.08
0.490.61
0.42
0.07
0.05
0.77
0.27
0.59 0.51 0.43
0.09
0.11
0.84 0.8 0.67
0.06
0.3
0.270.36 0.23
0.05
0.370.39 0.38
0.02 0.03
0.05 0.84
0.41
0.62 0.43 0.65 0.65 0.34
0.56 0.48 0.49
0.73 0.560.56
0.75
0.75 0.75 0.28 0.83
0.56 0.64
0.77 0.73 0.420.41
0.260.36 0.27
0.44
0.42 0.7 0.71 0.36
0.65
0.6
0.12
0.63 0.6
0.83 0.04
Figure 7.16: Fusion detection system results. Left column shows the detection results from YOLOv2-C (red), YOLOv2-D (green) and YOLOv2-R (blue) with associated confidence scores. Right column shows the fusion vehicle detection results (cyan) after re-scoring and NMS compared to ground-truth (dashed-magenta).
122
CHAPTER 7. EXPERIMENTAL RESULTS AND DISCUSSION DM Gen.
YOLO-D YOLO-R
RM Gen.
YOLO-C
34 ms
15 ms
Feature Extraction
MLP Rescoring
NMS C++ C
2 ms
11 ms
1 ms
MATLAB
Figure 7.17: The proposed parallel processing architecture for real-time implementation, the processing time (in milliseconds), and the implementation environment of the different steps of the proposed detection system. Qualitative Results Fig. 7.16 shows some of the most representative qualitative results using the entire feature set. As it can be seen for most of the cases the proposed multimodal vehicle fusion system cleverly combines detection confidences of YOLOv2-C, YOLOv2-D and YOLOv2-R and outperforms each individual one.
Part III CONCLUSIONS
123
Chapter 8 Concluding Remarks and Future Directions Contents 8.1
8.2
Summary of Thesis Contributions . . . . . . . . . . . . . . . . . . 125 8.1.1
Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . 125
8.1.2
Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 126
Discussions and Future Perspectives . . . . . . . . . . . . . . . . . 127
Every new beginning comes from some other beginning’s end. Seneca
8.1
Summary of Thesis Contributions
This thesis has developed multisensor object detection algorithms for autonomous driving considering two different paradigms: model free (generic or class-agnostic) object detection based on motion cues and supervised learning based (class-specific) object detection.
8.1.1
Obstacle Detection
The term ‘obstacle’ was used to refer to generic objects that stands on the ground. The proposed obstacle detection approach (as described in Chapter 5) takes as input sequential color images, 3D-PCDs, and the ego-vehicle’s localization data. The method 125
126
CHAPTER 8. CONCLUDING REMARKS AND FUTURE DIRECTIONS
consists on segmenting obstacles into stationary and moving parts, DATMO, and an approach, based on the fusion of 3D-PCDs and color images, for tracking individual moving objects. • Static and moving obstacle detection. This section introduced the proposed 4D obstacle detection algorithm (utilizing both 3D-spatial and temporal data) and was divided into two parts: ground surface modeling using a piecewise RANSAC plane fitting, and a voxel-based representation of obstacles above the estimated ground. The voxel-grid model of the environment was further segmented into static and moving obstacles using discriminative analysis and ego-motion information. The key contributions of this section were a novel ground surface estimation algorithm (which is able to model arbitrary-curve ground profiles) and a simple yet efficient method for segmenting static and moving parts of the environment. • Motion grid-based DATMO. In previous section, we introduced a voxel representation based approach to segment moving obstacles. In this section, DATMO was addressed in a motion grid basis. The motion grids are build using a shortterm static model of the scene (using Elevation grids), followed by a properlydesigned subtraction mechanism to compute the motion, and to rectify the localization error of the GPS-aided INS positioning system. For the object-level representation (i.e., generic moving object extraction from motion grids) a morphologybased clustering was used. The detected generic moving objects were tracked over time using KFs with Gating and Nearest Neighbor association strategies. • Multisensor fusion at tracking-level. As part of the overall proposed obstacle detection pipeline, we presented a multisensor 3D single-object tracking method to improve the tracking function of the DATMO system (i.e., the proposed fusion tracking can be used instead of using a simple KF). In the proposed fusion tracking method, two parallel mean-shift algorithms were run individually for object localization in the color image and 3D-PCD, followed by a 2D/3D KF based fusion and tracking. The proposed approach analyzes a sequential 2D-RGB image, 3D-PCD, and the ego-vehicle’s positioning data and outputs the object’s trajectory, current velocity estimate, and its predicted pose in the world coordinate system in the next time-step.
8.1.2
Object Detection
In Chapter 6 we described the proposed methods for (class-specific) supervised learning based object detection. The dataset for class-specific object detection evaluation is composed of a set of random images (and other sensors’ data), and the task is to detect and localize objects based on processing a single instance (frame) of sensors’ data.
8.2. DISCUSSIONS AND FUTURE PERSPECTIVES
127
• 3D-LIDAR-based object detection. In this section, an unsupervised learning technique is used to support (class-specific) supervised learning based object detection. A vehicle detection system based on the (unsupervised) hypothesis generation and (supervised) hypothesis verification using 3D-LIDAR data, DBSCAN clustering and a ConvNet was proposed. Specifically, hypothesis generation was performed by applying DBSCAN clustering on PCD data to discover structures from data and to form a set of hypotheses. The produced hypotheses (nearly) correspond to distinctive obstacles over the ground, and in practice, can be used for free space computation to avoid collisions. Hypothesis verification was performed using a ConvNet applied on the generated hypotheses (in the form of a depth map). • Multimodal object detection. This section was an extension of the previous one, and presented a multimodal fusion approach that benefits from three modalities, front-view dense-depth and dense-reflection maps (generated from sparse 3D-LIDAR data) and a color image, for object detection. The proposed method is composed by deep ConvNets and a Multi-Layer Perceptron (MLP) neural network. Deep ConvNets were run individually on the three modalities to achieve detections in the modalities. The proposed method extracts a rich set of features (e.g., detection confidence, width, height, center and so forth) from the detections in each modality. The desired target output of the fusion approach is defined as the overlaps of detected bounding boxes in each modality with the ground-truth. The MLP was trained to learn and model the nonlinear relationships among modalities, and to deal with the detection limitations in each modality.
8.2
Discussions and Future Perspectives
The present study extends our knowledge of multisensor motion- and supervised learningbased object detection. The main limitations that need to be considered and some recommendations for future research are presented in the following paragraphs. • Computation time. The major part of the algorithms was implemented in MATLAB for rapid prototyping. Some parts were implemented in C/C++. The multimodal fusion detection (mostly in C/C++) works at 16 fps (i.e., real-time processing, considering the proposed processing architecture). The proposed stationary – moving obstacle detection and multisensor single-object tracking algorithms (written mostly in MATLAB) run at about 0.3 fps and 4 fps, respectively. These algorithms can be expected to achieve real-time processing after implementation in a more efficient programming language (e.g., C/C++) and exploiting pipeline parallelism.
128
CHAPTER 8. CONCLUDING REMARKS AND FUTURE DIRECTIONS • Sensor fusion. We studied sensor/modality fusion for object detection and tracking tasks. A low-level and high-level (KF-based) multisensor fusion methods for object tracking were developed and analyzed in this thesis. The low-level fusion, denoted by ‘L-Fus.’, was presented in Chapter 7 as a comparative method for assessing the proposed ‘H-Fus.’ fusion tracking approach. We showed that the high-level fusion offers higher performance than the low-level method in the proposed tracking pipeline. For the purpose of multisensor multimodal object detection, a high-level (learning-based) multimodal fusion, based on two sensors (color camera and 3D-LIDAR) in three modalities (color image, 3D-LIDAR’s range and reflectance data), was studied in this thesis. The proposed fusion detection method, besides being able to work in real-time (considering the proposed parallel processing architecture), learns the nonlinear relationships among modalities and deals with the detection limitations of each modality. Incorporating multiple level of data abstraction in the fusion framework can be explored as future work (e.g., integrating multiple feature-map layers of a ConvNet-based object detection into the fusion framework to obtain a more accurate object detection). • Multi-view detection. In Chapter 6, the supervised learning based object detection was described. Specifically, in the second part of Chapter 6, front-view dense multimodal maps from 3D-LIDAR (i.e., range and reflectance maps) were explored for class-specific object detection. We suggest that in future research the incorporation of other 3D-LIDAR views (e.g., top view in the form of an Elevation grid representation) in the fusion framework for object detection could be researched. • Integrating temporal data to improve the object detection performance. One of the less researched issues in the state-of-the-art is how much temporal data (e.g., in the form of moving object detection) can improve per-frame detection results. We anticipate that integrating temporal data into the object detection process would increase its performance. Taking into account this thesis’ findings (on motion- and supervised learning-based object detection), we suggest that future research should look into exploiting temporal data to enhance the performance of the class-specific object detection. • Benchmarking. The lack of annotated datasets for obstacle (or generic object) detection evaluation was one of the main challenges for developing this thesis. We introduced some benchmarks (extracted out of the KITTI dataset) for the evaluation of ground surface estimation (in three dimensions), stationary – moving obstacle detection and 3D single-object tracking performances. Obstacle detection benchmarking, although very challenging, could be a future direction to accelerate the research toward real-world autonomous driving.
Appendices
129
Appendix A 3D Multisensor Single-Object Tracking Benchmark
131
132APPENDIX A. 3D MULTISENSOR SINGLE-OBJECT TRACKING BENCHMARK Previous attempts to propose object tracking benchmarks for automotive applications were mostly based on monocular cameras [73, 74], or were just focused on the data association problem [137]. A benchmark dataset, called 3D Object Tracking in Driving Environment (3D-OTD), is proposed (based on the ‘KITTI Object Tracking Evaluation’) to facilitate the evaluation of appearance modeling in single-object tracking using a multimodal perception system of autonomous vehicles. Therefore, instead of tracklets, full track of each object is extracted. A benchmark dataset with 50 annotated sequences is constructed out of the ‘KITTI Object Tracking Evaluation’ to facilitate the performance evaluation. In the constructed benchmark dataset, each sequence denotes a trajectory of only one target object (i.e., if one scenario includes two target objects, it is considered as two sequences). The specifications of each sequence and the most challenging factors are extracted and reported in Table A.1. The table contains the description of the scene, sequence, and objects including the number of frames for each sequence, object type: car ‘C’, pedestrian ‘P’ and cyclist ‘Y’, object and Ego-vehicle situations: moving ‘M’ or stationary ‘S’, scene condition: roads in urban environment ‘U’ or alleys in downtown ‘D’. The object width (Im-W) and height (Im-H) in the first frame (in pixels), and width (PCD-W), height (PCD-H), and length (PCD-L) in the first PCD (in meters) of each sequence are also reported. Each of the sequences are categorized according to the following challenges: occlusion (OCC), object pose (POS) and distance (DIS) variations to Ego-vehicle, and changes in the relative velocity (RVL) of the object to the Ego-vehicle.
A.1
Baseline 3D Object Tracking Algorithms
As a starting point for the benchmark, two generative 3D-LIDAR-based methods were implemented as baselines for evaluation purposes. The baseline methods take LIDAR PCDs as the input (after a ground removal process). The initial position of the Object’s 3D Bounding Box (3D-OBB) is known, the size of the 3D-BB is assumed fixed during the tracking, and the ‘point model’ is used for the object representation. • Baseline KF 3D Object Tracker (3D-KF). A 3D Constant Acceleration (CA) KF with a Gating Data Association (DA) is used for the robust tracking of the object > ˙ x, ¨ y, y, ˙ y, ¨ z, z˙, z¨ , centroid in the consecutive PCDs. The state of the filter is x = x, x, where x, ˙ y, ˙ z˙ and x, ¨ y, ¨ z¨ are velocity and acceleration in x, y, z location, respectively. To eliminate outliers and increase the robustness of the process, the search area is limited to a gate in the vicinity of the predicted KF location from the previous step. If no measurement is available inside the gate area, the predicted KF value is used. Experiments with different gate sizes (1 × 3D-OBB, 1.5 × 3D-OBB and 2 × 3D-OBB) were performed to conclude that the gate size of 1 × 3D-OBB provides a better result.
A.1. BASELINE 3D OBJECT TRACKING ALGORITHMS
133
Table A.1: Detailed information and challenging factors for each sequence. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
# Frm. 154 154 101 18 58 144 78 78 122 314 297 101 42 136 38 51 42 31 24 390 36 65 56 474 63 99 41 323 188 51 41 131 132 140 141 112 31 112 145 54 45 264 71 125 146 156 45 188 359 360
Obj. C Y C C C P C C C C C Y Y C C C C C C C C C C C P Y P Y C C P P P P P P Y P P P P C P P V P P P P P
Obj. Status M M S S S S M M M M M M M M S M M S M M S S M M M M M S M M M M M M M M M M M M M M M M S M M M M M
Ego. Status M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M S S S S S S S S S S S M M M M M M M M M
Scene Cond. U U U U U U U U U U U U U U U U U U U U U U U U U D D U U U D D D D D D D D D D D U D D D D D D D D
Im-W Im-H
PCD-H PCD-W PCD-L OCC POS
DIS
178 154 93 77 19 16 54 193 100 152 36 8 15 95 98 52 22 52 30 18 28 76 152 274 16 39 25 25 30 126 70 46 43 33 27 33 35 20 19 101 196 28 89 36 45 25 29 31 28 26
1.73 2.00 2.19 1.52 1.54 1.72 1.48 3.01 1.59 1.64 1.62 1.64 1.64 1.47 1.45 1.57 1.85 1.45 3.43 1.25 2.71 1.72 3.52 3.52 1.63 1.81 1.53 1.72 1.44 1.50 1.63 1.76 1.89 1.83 1.70 1.84 1.84 1.67 1.95 1.71 1.64 1.40 1.61 1.64 2.56 1.88 1.67 1.76 1.80 1.72
* * * * * * *
208 127 42 52 17 42 21 77 302 87 36 26 36 34 34 34 19 34 32 13 32 28 57 97 30 39 42 38 21 37 105 65 72 63 58 66 47 54 53 160 224 24 124 62 56 48 58 67 51 49
0.82 1.82 1.89 1.55 1.66 0.73 1.59 2.59 1.65 1.67 1.62 0.33 0.33 1.35 1.63 1.65 1.67 1.60 2.81 1.59 1.89 1.73 2.89 2.89 0.40 0.59 0.61 0.78 1.74 1.54 0.66 0.90 0.84 0.73 0.65 0.78 0.50 0.44 0.62 0.48 0.55 1.54 0.91 0.88 2.05 0.95 0.70 0.76 0.90 0.84
1.78 4.43 5.53 3.57 4.14 0.55 3.46 11.84 3.55 3.63 4.50 1.57 1.57 3.51 4.20 4.10 4.09 4.22 7.02 3.55 5.77 4.71 10.81 10.81 0.83 1.89 0.73 1.70 4.23 4.09 0.89 1.11 1.05 1.16 1.10 1.03 1.60 0.75 0.74 0.93 0.94 3.36 0.91 0.49 5.86 0.94 0.94 1.01 0.94 0.85
*
* *
*
* * *
* * *
* *
* *
* * *
* * * * * * * * * * * * * * * *
* *
* * * * * * * * * * * * * * * * * * * *
* * * * *
RVL
* * * *
*
* *
* *
* * *
*
* * * *
* * * * * * *
• Baseline MS 3D Object Tracker (3D-MS). In the 3D-MS approach, the Mean Shift (MS) iterative procedure is used to locate the object, as follows:
134APPENDIX A. 3D MULTISENSOR SINGLE-OBJECT TRACKING BENCHMARK 1. The shift-vector between the center of 3D-BB and centroid of point set P inside the 3D-BB is computed. 2. The 3D-BB is translated using the shift-vector. 3. Iterate steps 1 and 2 until convergence: The MS iteratively shifts the 3D-BB until the object is placed entirely within the 3D-BB. MS is considered converged when the centroid movement |mk | < 0.5m or the maximum number of iterations is met. We conducted an experiment with different maximum number of iterations (3, 5 and 10) and observed that a maximum of 3 iterations provides a better result. The object orientation is achieved by subtracting the current estimated location and the previous location of the object.
A.2
Quantitative Evaluation Methodology
Different metrics have been proposed for the evaluation of object tracking methods [138, 139, 140]. For the quantitative evaluation, two assessment criteria are used as follows: • The precision plot of overlap success. The overlap rate (the intersection-overunion metric) in 3D is given by O3D =
volume(3D-BB ∩ 3D-BBG ) volume(3D-BB ∪ 3D-BBG )
(A.1)
where 3D-BBG is the Ground-Truth (GT) 3D-BBs available in the KITTI dataset. The overlap rate ranges from 0 to 1. To be correct (to be considered a success), the overlap ratio O3D must exceed 0.25, which is a standard threshold. The percentage of frames with a successful occurrence is used as a metric to measure tracking performance. • The precision plot of orientation success. The GT for the orientation of the object in the KITTI dataset is given by the Yaw angle (Yaw angle describes the heading of the object, and corresponds to the rotation around z-axis) The orientation error can be computed by → − → − Eθ = | θ − θ G |
(A.2)
→ − where θ G is the GT orientation of the object. The precision plot of orientation is given by the percentage of frames with Eθ less than a certain threshold (this value is empirically set as 10 degrees).
A.3. EVALUATION RESULTS AND ANALYSIS OF METRICS 3D-BB Overlap - OCC
1
3D-BB Overlap - POS
1 3D-KF 3D-MS
0.9
135
3D-KF 3D-MS
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3D-BB Overlap - DIS
1
0
0.1
0.2
0.3
1 3D-KF 3D-MS
0.9
0.4
0.5
0.6
0.7
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
1
3D-KF 3D-MS
0.9
0.8
0.9
3D-BB Overlap - RVL
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure A.1: The precision plot of 3D overlap rate based on OCC, POS, DIS and RVL challenges.
A.3
Evaluation Results and Analysis of Metrics
The metrics for the two baseline trackers (3D-MS and 3D-KF) are computed based on OCC, POS, DIS and RVL challenges and plotted in Fig. A.1 and Fig. A.2, where xaxis denotes normalized number of the frames in all the sequences and y-axis shows normalized cumulative sum of the successful cases (i.e. each frame in which the 3D BB overlap or orientation error condition is met gets added to the cumulative sum). The 3D-KF achieves higher success rate because the 3D-MS tracker may diverge to a denser nearby object (a local minima) instead of tracking the target object. Interestingly, 3DKF performs much better in the RVL challenge because of a more accurate estimation of the object dynamics. However, the 3D-MS tracker has a higher precision in the orientation estimation. The average computation time of baseline trackers is about 15 fps. The experiment was carried out using a quad core 3.4 GHz processor with 8 GB
136APPENDIX A. 3D MULTISENSOR SINGLE-OBJECT TRACKING BENCHMARK Pose - OCC
Pose - POS
1
1 3D-KF 3D-MS
0.9
3D-KF 3D-MS
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pose - RVL
Pose - DIS 1
1 3D-KF 3D-MS
0.9
3D-KF 3D-MS
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure A.2: The precision plot of orientation error based on OCC, POS, DIS and RVL challenges. RAM under MATLAB R2015a.
A.3.1
A Comparison of Baseline Trackers with the State-of-the-art Computer Vision based Object Trackers
3D-LIDAR sensors are opening their way for high-level perception tasks in computer vision, like object tracking, object recognition, and scene understanding. We found it interesting to compare our baseline trackers (3D-MS and 3D-KF) with two high-ranking state-of-the-art computer vision based object trackers (SCM [141], and ASLA [142]) in the Object Tracking Benchmark [138]. SCM and ASLA run at about 1 fps and 6.5 fps, respectively. The precision plot is given by the percentage of successful occurrences (localization error less than 20 pixels [138]), and is presented in Fig. A.3.
A.3. EVALUATION RESULTS AND ANALYSIS OF METRICS
137
Precision Plot - LOC 1 3D-KF 3D-MS SCM ASLA
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure A.3: The precision plot of location error. We found that our baseline trackers, benefiting from highly reliable 3D-LIDAR data, have superior performance over the state-of-the-art approaches in Computer Vision field. This is because, in autonomous driving scenarios, ego-vehicle and objects are often moving. Therefore, object size and pose undergo severe changes (in the RGB image), which can easily mislead visual object trackers.
138APPENDIX A. 3D MULTISENSOR SINGLE-OBJECT TRACKING BENCHMARK
Appendix B Object Detection Using Reflection Data
139
140
APPENDIX B. OBJECT DETECTION USING REFLECTION DATA Table B.1: The RefCN processing time (in milliseconds). Impl. Details RM Generation YOLOv2 Detection
Proc. Time 34 15
Environment C++ C
In this appendix, an object detection method using 3D-LIDAR reflection intensity data and YOLOv2 416×416 object detection framework is presented (herein called RefCN, which stands for ‘Reflectance ConvNet’). The front-view dense Reflection Map (RM) runs through the trained RM-based YOLOv2 pipeline to achieve object detection. For this analysis, the KITTI object detection ‘training dataset’ (containing 7481 frames) was partitioned into two subsets: 80 % as training set (5985 frames) and 20 % as validation set (1496 frames). The ‘Car’ label was considered for the evaluation.
B.1
Computational Complexity and Run-Time
The experiments were run on a computer with a Hexa-core 3.5 GHz processor, powered with a GTX 1080 GPU and 64 GB RAM under Linux. Two versions of RM were implemented: a version using MATLAB scatteredInterpolant function and a much faster reimplementation in C++. The RM generation in MATLAB takes about 1.4 seconds while in C++ it takes 34 ms. The implementation details and the computational load of RM generation and YOLOv2 detection steps are reported in Table B.1. The overall time for processing each frame using C++ implementation is 49 milliseconds (more than 20 frames per second). Considering that the KITTI dataset was captured using a 10 Hz spinning Velodyne HDL-64E, it can be concluded that RefCN can be performed in real-time.
B.2
Quantitative Results
Quantitative experiments were conducted to assess the performance of the RefCN: (i) Sparse Reflectance Map versus RM; (ii) comparison of RMs with different interpolation methods; (iii) RM versus color and range data modalities; and (iv) RefCN versus stateof-the-art methods.
B.2.1
Sparse Reflectance Map vs RM
The RefCN was trained on training set and evaluated on the validation set. As can be seen from Fig. B.1 and Table B.2, the results show that the RM (with the default input size of 416×416 and the Nearest Neighbor interpolation) considerably improves the
B.2. QUANTITATIVE RESULTS
141
1
Easy Moderate Hard
0.9 0.8
Precision
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Figure B.1: Precision-Recall using sparse RM (dashed lines) versus RM (solid lines) on KITTI validation-set (Car class). Table B.2: Detection accuracy with sparse RM vs RM on validation-set. Input Data Sparse RM RM RM*
Easy 23.45 % 67.69 % 72.67 %
Moderate 17.57 % 51.91 % 62.65 %
Hard 15.57 % 44.98 % 54.89 %
detection performance in comparison with the sparse Reflectance Map. In Table B.2, RM* denotes the results for an increased input size of 1216×352. For the rest of this document, the analyses were performed for the default input size of 416×416.
B.2.2
RM Generation Using Nearest Neighbor, Linear and Natural Interpolations
The result from the previous experiment shows that the use of dense up-sampled representation considerably improves the detection rate. A question that can be raised is which interpolation method gives the best performance. In this experiment, we evaluated three interpolation methods: Nearest Neighbor (RMnearest ), Natural Neighbor (RMnatural ) and Linear (RMlinear ) Interpolation. RMnatural is based on Voronoi tessel-
142
APPENDIX B. OBJECT DETECTION USING REFLECTION DATA 1 Easy Moderate Hard
0.9 0.8
Precision
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Figure B.2: Precision-Recall using RMnearest (solid lines), RMnatural (dashed lines) and RMlinear (dotted lines) on KITTI Validation-set (Car class). Table B.3: Detection accuracy using RM with different interpolation methods on validation-set. Input Data RM (RMnearest ) RMlinear RMnatural
Easy 67.69 % 60.60 % 65.25 %
Moderate 51.91 % 45.71 % 50.07 %
Hard 44.98 % 40.79 % 44.76 %
lation of the projected LIDAR points which result in a continuous surface except at projected points. RMlinear is based on linear interpolation between sets of three points (of the projected LIDAR points) for surfaces in Delaunay Triangulation (DT) format. Fig. B.3 shows an example of a color image and the corresponding generated RMs. The detection performance for each interpolation method is reported in Fig. B.2 and Table B.3. The best performance was attained, for all categories, with RMnearest .
B.3
Qualitative Results
Figure B.4 shows some of the representative qualitative results with many cars in the scene. As can be seen, for most cases, the RefCN correctly detects target vehicles.
B.3. QUALITATIVE RESULTS
143
(a)
(b)
(c)
(d)
Figure B.3: Top to bottom: (a) an example of color image from KITTI dataset, (b) the generated RMnearest , (c) RMnatural and (d) RMlinear , respectively.
144
APPENDIX B. OBJECT DETECTION USING REFLECTION DATA
0.81 0.83
0.7
0.62
0.7
0.55
0.71 0.63 0.87
0.77
0.81 0.83
0.7
0.62
0.7
0.55
0.71 0.63 0.87
0.77
(a)
(b)
(c)
0.9
0.91
0.58
0.7
0.56 0.46
0.37
0.36
0.5 0.78
0.91
0.58
0.7
0.56 0.46
0.37
0.36
0.5 0.78
0.8
0.78
0.7
0.68
0.39
0.9
0.490.3 0.93
0.8
0.78
0.7
0.68
0.39
0.490.3 0.93
Figure B.4: Examples of RefCN results. Detections are shown, as green BBs in the color-images (top) and RMs (bottom) compared to the ground-truth (dashed-magenta). Notice that the depicted color-images are shown only for visualization purpose.
Bibliography [1] Martin Ester, Hans-Peter Kriegel, J¨org Sander, Xiaowei Xu, et al. A densitybased algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226–231, 1996. [2] K Bhalla, M Shotten, A Cohen, M Brauer, S Shahraz, R Burnett, K LeachKemon, G Freedman, and CJ Murray. Transport for health: the global burden of disease from motorized road transport. World Bank Group: Washington, DC, 2014. [3] Etienne Krug. Decade of action for road safety 2011–2020. Injury, 43(1):6–7, 2012. [4] Santokh Singh. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Technical report, Traffic Safety Facts Crash Stats, National Highway Traffic Safety Administration, Washington, DC, 2015. [5] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner, M. N. Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, Michele Gittleman, Sam Harbaugh, Martial Hebert, Thomas M. Howard, Sascha Kolski, Alonzo Kelly, Maxim Likhachev, Matt McNaughton, Nick Miller, Kevin Peterson, Brian Pilnick, Raj Rajkumar, Paul Rybski, Bryan Salesky, YoungWoo Seo, Sanjiv Singh, Jarrod Snider, Anthony Stentz, William Red Whittaker, Ziv Wolkowicki, Jason Ziglar, Hong Bae, Thomas Brown, Daniel Demitrish, Bakhtiar Litkouhi, Jim Nickolaou, Varsha Sadekar, Wende Zhang, Joshua Struble, Michael Taylor, Michael Darms, and Dave Ferguson. Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics, 25(8):425–466, 2008. [6] Michael Montemerlo, Jan Becker, Suhrid Bhat, Hendrik Dahlkamp, Dmitri Dolgov, Scott Ettinger, Dirk Haehnel, Tim Hilden, Gabe Hoffmann, Burkhard Huhnke, et al. Junior: The stanford entry in the urban challenge. Journal of field Robotics, 25(9):569–597, 2008. 145
146
BIBLIOGRAPHY
[7] Peter Corke. Robotics, Vision and Control: Fundamental Algorithms In MATLAB® Second, Completely Revised, volume 118. Springer, 2017. [8] Paul J Besl, Neil D McKay, et al. A method for registration of 3-d shapes. IEEE Transactions on pattern analysis and machine intelligence, 14(2):239–256, 1992. [9] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975. [10] Hugh F Durrant-Whyte. Sensor models and multisensor integration. The international journal of robotics research, 7(6):97–113, 1988. [11] Belur V Dasarathy. Decision fusion, volume 1994. IEEE Computer Society Press Los Alamitos, CA, 1994. [12] R Boudjemaa and AB Forbes. Parameter estimation methods for data fusion, national physical laboratory report no. CMSC 38, 4, 2004. [13] Federico Castanedo. A review of data fusion techniques. The Scientific World Journal, 2013, 2013. [14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [15] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016. [16] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [17] Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, and Michael S Lew. Deep learning for visual understanding: A review. Neurocomputing, 187:27–48, 2016. [18] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014. [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [20] A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
BIBLIOGRAPHY
147
[21] Joel Janai, Fatma G¨uney, Aseem Behl, and Andreas Geiger. Computer vision for autonomous vehicles: Problems, datasets and state-of-the-art. ARXIV, 2017. [22] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. [23] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012. [24] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016. [25] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017. [26] Andreas N¨uchter, Kai Lingemann, Joachim Hertzberg, and Hartmut Surmann. 6d slam with approximate data association. In Advanced Robotics, 2005. ICAR’05. Proceedings., 12th International Conference on, pages 242–249. IEEE, 2005. [27] Daniel Sack and Wolfram Burgard. A comparison of methods for line extraction from range data. In Proc. of the 5th IFAC symposium on intelligent autonomous vehicles (IAV), 2004. [28] Miguel Oliveira, Victor Santos, Angel Sappa, and P.Dias. Scene representations for autonomous driving: an approach based on polygonal primitives. In 2nd Iberian Robotics Conference, 2015. [29] Ricardo Pascoal, Vitor Santos, Cristiano Premebida, and Urbano Nunes. Simultaneous segmentation and superquadrics fitting in laser-range data. Vehicular Technology, IEEE Transactions on, 64(2):441–452, 2015. [30] Hans P Moravec and Alberto Elfes. High resolution maps from wide angle sonar. In Robotics and Automation. Proceedings. 1985 IEEE International Conference on, volume 2, pages 116–121. IEEE, 1985. [31] M Herbert, C Caillas, Eric Krotkov, In So Kweon, and Takeo Kanade. Terrain mapping for a roving planetary explorer. In Robotics and Automation, 1989. Proceedings., 1989 IEEE International Conference on, pages 997–1002. IEEE, 1989. [32] Patrick Pfaff, Rudolph Triebel, and Wolfram Burgard. An efficient extension to elevation maps for outdoor terrain mapping and loop closing. The International Journal of Robotics Research, 26(2):217–230, 2007.
148
BIBLIOGRAPHY
[33] Rudolph Triebel, Patrick Pfaff, and Wolfram Burgard. Multi-level surface maps for outdoor terrain mapping and loop closing. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 2276–2282. IEEE, 2006. [34] Yuval Roth-Tabak and Ramesh Jain. Building an environment model using depth information. Computer, 22(6):85–90, 1989. [35] H Moravec. Robot spatial perceptionby stereoscopic vision and 3d evidence grids. Perception,(September), 1996. [36] D. Haehnel. Mapping with Mobile Robots. PhD thesis, University of Freiburg, Department of Computer Science, December 2004. [37] Julian Ryde and Huosheng Hu. 3d mapping with multi-resolution occupied voxel lists. Autonomous Robots, 28(2):169–185, 2010. [38] Bertrand Douillard, J Underwood, Narek Melkumyan, S Singh, Shrihari Vasudevan, C Brunner, and A Quadros. Hybrid elevation maps: 3d surface models for segmentation. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 1532–1538. IEEE, 2010. [39] Donald Meagher. Geometric modeling using octree encoding. Computer graphics and image processing, 19(2):129–147, 1982. [40] Armin Hornung, KaiM. Wurm, Maren Bennewitz, Cyrill Stachniss, and Wolfram Burgard. Octomap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots, 34(3):189–206, 2013. [41] Ivan Dryanovski, William Morris, and Jizhong Xiao. Multi-volume occupancy grids: An efficient probabilistic 3d mapping model for micro aerial vehicles. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 1553–1559. IEEE, 2010. [42] Anca Discant, Alexandrina Rogozan, Calin Rusu, and Abdelaziz Bensrhair. Sensors for obstacle detection-a survey. In Electronics Technology, 30th International Spring Seminar on, pages 100–105. IEEE, 2007. [43] Nicola Bernini, Massimo Bertozzi, Luca Castangia, Marco Patander, and Mario Sabbatelli. Real-time obstacle detection using stereo vision for autonomous ground vehicles: A survey. In Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on, pages 873–878. IEEE, 2014. [44] Anna Petrovskaya, Mathias Perrollaz, Luciano Oliveira, Luciano Spinello, Rudolph Triebel, Alexandros Makris, John-David Yoder, Christian Laugier, Urbano Nunes, and Pierre Bessiere. Awareness of road scene participants for
BIBLIOGRAPHY
149
autonomous driving. In Handbook of Intelligent Vehicles, pages 1383–1432. Springer, 2012. [45] Sayanan Sivaraman and Mohan Manubhai Trivedi. Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis. Intelligent Transportation Systems, IEEE Transactions on, 14(4):1773–1795, 2013. [46] Zhongfei Zhang, Richard Weiss, and Allen R Hanson. Obstacle detection based on qualitative and quantitative 3d reconstruction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(1):15–26, 1997. [47] D. Pfeiffer and U. Franke. Efficient representation of traffic scenes by means of dynamic stixels. In Intelligent Vehicles Symposium (IV), 2010 IEEE, pages 217–224, June 2010. [48] A. Broggi, S. Cattani, M. Patander, M. Sabbatelli, and P. Zani. A full-3d voxelbased dynamic obstacle detection for urban scenario using stereo vision. In Intelligent Transportation Systems - (ITSC), 2013 16th International IEEE Conference on, pages 71–76, Oct 2013. [49] A. Azim and O. Aycard. Layer-based supervised classification of moving objects in outdoor dynamic environment using 3d laser scanner. In Intelligent Vehicles Symposium Proceedings, 2014 IEEE, pages 1408–1414, June 2014. [50] Raphael Labayrade, Didier Aubert, and Jean-Philippe Tarel. Real time obstacle detection in stereovision on non flat road geometry through” v-disparity” representation. In Intelligent Vehicle Symposium, 2002. IEEE, volume 2, pages 646– 651. IEEE, 2002. [51] Angel D Sappa, Rosa Herrero, Fadi Dornaika, David Ger´onimo, and Antonio L´opez. Road approximation in euclidean and v-disparity space: a comparative study. In Computer Aided Systems Theory–EUROCAST 2007, pages 1105–1112. Springer, 2007. [52] Florin Oniga and Sergiu Nedevschi. Processing dense stereo data using elevation maps: Road surface, traffic isle, and obstacle detection. Vehicular Technology, IEEE Transactions on, 59(3):1172–1182, 2010. [53] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
150
BIBLIOGRAPHY
[54] Anna V Petrovskaya. Towards dependable robotic perception. Stanford University, 2011. [55] John Leonard, Jonathan How, Seth Teller, Mitch Berger, Stefan Campbell, Gaston Fiore, Luke Fletcher, Emilio Frazzoli, Albert Huang, Sertac Karaman, et al. A perception-driven autonomous urban vehicle. Journal of Field Robotics, 25(10):727–774, 2008. [56] Jaebum Choi, Simon Ulbrich, Bernd Lichte, and Markus Maurer. Multi-target tracking using a 3d-lidar sensor for autonomous vehicles. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pages 881– 886. IEEE, 2013. [57] Christoph Mertz, Luis E Navarro-Serment, Robert MacLachlan, Paul Rybski, Aaron Steinfeld, Arne Suppe, Christopher Urmson, Nicolas Vandapel, Martial Hebert, Chuck Thorpe, et al. Moving object detection with laser scanners. Journal of Field Robotics, 30(1):17–43, 2013. [58] Andreas Ess, Konrad Schindler, Bastian Leibe, and Luc Van Gool. Object detection and tracking for autonomous navigation in dynamic environments. The International Journal of Robotics Research, 29(14):1707–1725, 2010. [59] Siavash Hosseinyalamdary, Yashar Balazadegan, and Charles Toth. Tracking 3d moving objects based on gps/imu navigation solution, laser scanner point cloud and gis data. ISPRS International Journal of Geo-Information, 4(3):1301–1316, 2015. [60] Anna Petrovskaya and Sebastian Thrun. Model based vehicle detection and tracking for autonomous urban driving. Autonomous Robots, 26(2-3):123–139, 2009. [61] Takeo Miyasaka, Yoshihiro Ohama, and Yoshiki Ninomiya. Ego-motion estimation and moving object tracking using multi-layer lidar. In Intelligent Vehicles Symposium, 2009 IEEE, pages 151–156. IEEE, 2009. ´ c, Ivan Markovi´c, Sre´cko Juri´c-Kavelj, and Ivan Petrovi´c. Short-term [62] Josip Cesi´ map based detection and tracking of moving objects with 3d laser on a vehicle. In Informatics in Control, Automation and Robotics, pages 205–222. Springer, 2016. [63] Andrei Vatavu, Radu Danescu, and Sergiu Nedevschi. Stereovision-based multiple object tracking in traffic scenarios using free-form obstacle delimiters and particle filters. Intelligent Transportation Systems, IEEE Transactions on, 16(1):498–511, 2015.
BIBLIOGRAPHY
151
[64] David Held, Jesse Levinson, and Sebastian Thrun. Precision tracking with sparse 3d and dense color 2d data. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1138–1145. IEEE, 2013. [65] Frank Moosmann and Christoph Stiller. Joint self-localization and tracking of generic objects in 3d range data. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1146–1152. IEEE, 2013. [66] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. [67] Andreas Geiger, Martin Lauer, Christian Wojek, Christoph Stiller, and Raquel Urtasun. 3d traffic scene understanding from movable platforms. Pattern Analysis and Machine Intelligence (PAMI), 2014. [68] Li Zhang, Yuan Li, and Ramakant Nevatia. Global data association for multiobject tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. [69] Hamed Pirsiavash, Deva Ramanan, and Charless C Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011. [70] Anton Milan, Stefan Roth, and Konrad Schindler. Continuous energy minimization for multitarget tracking. IEEE transactions on pattern analysis and machine intelligence, 36(1):58–72, 2014. [71] Philip Lenz, Andreas Geiger, and Raquel Urtasun. Followme: Efficient online min-cost flow tracking with bounded memory and computation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4364–4372, 2015. [72] Ju Hong Yoon, Chang-Ryeol Lee, Ming-Hsuan Yang, and Kuk-Jin Yoon. Online multi-object tracking via structural constraint event aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1392–1400, 2016. [73] Laura Leal-Taix´e, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
152
BIBLIOGRAPHY
[74] Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. [75] Aljoˇsa Oˇsep, Alexander Hermans, Francis Engelmann, Dirk Klostermann, , Markus Mathias, and Bastian Leibe. Multi-scale object candidates for generic object tracking in street scenes. In ICRA, 2016. [76] Alex Teichman and Sebastian Thrun. Tracking-based semi-supervised learning. The International Journal of Robotics Research, 31(7):804–818, 2012. [77] Ralf Kaestner, J´erˆome Maye, Yves Pilat, and Roland Siegwart. Generative object detection and tracking in 3d range data. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 3075–3081. IEEE, 2012. [78] Kamel Mekhnacha, Yong Mao, David Raulo, and Christian Laugier. Bayesian occupancy filter based” fast clustering-tracking” algorithm. In IROS 2008, 2008. [79] Qadeer Baig, Mathias Perrollaz, and Christian Laugier. A robust motion detection technique for dynamic environment monitoring: A framework for grid-based monitoring of the dynamic environment. IEEE Robotics & Automation Magazine, 21(1):40–48, 2014. [80] Hern´an Badino, Uwe Franke, and David Pfeiffer. The stixel world-a compact medium level representation of the 3d-world. In Pattern Recognition, pages 51– 60. Springer, 2009. [81] Ayush Dewan, Tim Caselitz, Gian Diego Tipaldi, and Wolfram Burgard. Motionbased detection and tracking in 3d lidar scans. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), Stockholm, Sweden, 2016. [82] Uwe Franke, Clemens Rabe, Hern´an Badino, and Stefan Gehrig. 6d-vision: Fusion of stereo and motion for robust environment perception. In Pattern Recognition, pages 216–223. Springer, 2005. [83] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, volume 1, pages I–I, 2001. [84] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. [85] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627–1645, 2010.
BIBLIOGRAPHY
153
[86] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. IJCV, 104(2):154–171, 2013. [87] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989. [88] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. [89] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014. [90] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014. [91] Ross Girshick. Fast R-CNN. In ICCV, 2015. [92] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE PAMI, 37(9):1904–1916, 2015. [93] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. [94] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE PAMI, 38(1):142–158, 2016. [95] Evan Shelhamer, Jonathon Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE PAMI, 2016. [96] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In ECCV, pages 21–37, 2016. [97] J Javier Yebes, Luis M Bergasa, and Miguel Garc´ıa-Garrido. Visual object recognition with 3D-aware features in KITTI urban scenes. Sensors, 15(4):9228–9250, 2015. [98] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE TPAMI, 30(2):328–341, 2008.
154
BIBLIOGRAPHY
[99] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Subcategory-aware convolutional neural networks for object proposals and detection. In WACV, pages 924–933, 2017. [100] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category recognition. In CVPR, pages 1903–1911, 2015. [101] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016. [102] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, C´eline Teuli`ere, and Thierry Chateau. Deep MANTA: A coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. In CVPR, 2017. [103] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, pages 2147–2156, 2016. [104] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In CVPR, pages 2129–2137, 2016. [105] Jens Behley, Volker Steinhage, and Armin B Cremers. Laser-based segment classification using a mixture of bag-of-words. In IROS, 2013. [106] Dominic Zeng Wang and Ingmar Posner. Voting for voting in online point cloud object detection. In Robotics: Science and Systems, 2015. [107] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3D LIDAR using fully convolutional network. In RSS, 2016. [108] Bo Li. 3d fully convolutional network for vehicle detection in point cloud. In IROS, 2017. [109] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396, 2017. [110] Cristiano Premebida, Joao Carreira, Jorge Batista, and Urbano Nunes. Pedestrian detection combining RGB and dense LIDAR data. In IROS, 2014. [111] Alejandro Gonz´alez, David V´azquez, Antonio M L´oopez, and Jaume Amores. On-board object detection: Multicue, multimodal, and multiview random forest of local experts. IEEE Transactions on Cybernetics, 2016.
BIBLIOGRAPHY
155
[112] Sang-Il Oh and Hang-Bong Kang. Object detection and classification by decision-level fusion for intelligent vehicle systems. Sensors, 17(1):207, 2017. [113] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017. [114] Hugh Durrant-Whyte and Thomas C Henderson. Multisensor data fusion. In Springer Handbook of Robotics, pages 867–896. Springer, 2016. [115] Joel Schlosser, Christopher K Chow, and Zsolt Kira. Fusing lidar and images for pedestrian detection using convolutional neural networks. In ICRA, pages 2198–2205, 2016. [116] Saurabh Gupta, Ross Girshick, Pablo Arbel´aez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In ECCV, pages 345–360, 2014. [117] Alireza Asvadi, Cristiano Premebida, Paulo Peixoto, and Urbano Nunes. 3d lidarbased static and moving obstacle detection in driving environments: an approach based on voxels and multi-region ground planes. Robotics and Autonomous Systems, 83:299–311, 2016. [118] Alireza Asvadi, Paulo Peixoto, and Urbano Nunes. Two-stage static/dynamic environment modeling using voxel representation. In Robot 2015: Second Iberian Robotics Conference, pages 465–476. Springer, 2016. [119] Alireza Asvadi, Paulo Peixoto, and Urbano Nunes. Detection and tracking of moving objects using 2.5 d motion grids. In Intelligent Transportation Systems (ITSC), 2015 IEEE 18th International Conference on, pages 788–793. IEEE, 2015. [120] Alireza Asvadi, Pedro Gir˜ao, Paulo Peixoto, and Urbano Nunes. 3d object tracking using rgb and lidar data. In ITSC, 2016. [121] Pedro Girao, Alireza Asvadi, Paulo Peixoto, and Urbano Nunes. 3d object tracking in driving environment: A short review and a benchmark dataset. In Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on, pages 7–12. IEEE, 2016. [122] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Robotics-DL tentative, pages 586–606. International Society for Optics and Photonics, 1992. [123] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
156
BIBLIOGRAPHY
[124] Yizong Cheng. Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(8):790–799, 1995. [125] JB Gao and Chris J Harris. Some remarks on kalman filters for the multisensor fusion. Information Fusion, 3(3):191–201, 2002. [126] Alireza Asvadi, Luis Garrote, Cristiano Premebida, Paulo Peixoto, and Urbano J Nunes. Multimodal vehicle detection: fusing 3d-lidar and color camera data. Pattern Recognition Letters, 2017. [127] Alireza Asvadi, Luis Garrote, Cristiano Premebida, Paulo Peixoto, and Urbano J Nunes. Depthcn: Vehicle detection using 3d-lidar and convnet. In ITSC, 2017. [128] Cristiano Premebida, Luis Garrote, Alireza Asvadi, A Pedro Ribeiro, and Urbano Nunes. High-resolution lidar-based depth mapping using bilateral filter. In ITSC, pages 2469–2474, 2016. [129] Alireza Asvadi, Luis Garrote, Cristiano Premebida, Paulo Peixoto, and Urbano J Nunes. Real-time deep convnet-based vehicle detection using 3d-lidar reflection intensity data. In Robot 2017: Third Iberian Robotics Conference, 2017. [130] Isaac Amidror. Scattered data interpolation methods for electronic imaging systems: a survey. Journal of electronic imaging, 11(2):157–176, 2002. [131] Imran Ashraf, Soojung Hur, and Yongwan Park. An investigation of interpolation techniques to generate 2d intensity images from lidar data. IEEE Access, 2017. [132] Thorsten Franzel, Uwe Schmidt, and Stefan Roth. Object Detection in Multi-view X-Ray Images, pages 144–154. Springer Berlin Heidelberg, 2012. [133] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based object tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(5):564–577, 2003. [134] Jicai Ning, Leiqi Zhang, Dejing Zhang, and Chunlin Wu. Robust mean-shift tracking with corrected background-weighted histogram. Computer Vision, IET, 6(1):62–69, 2012. [135] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [136] Leonard Plotkin. Pydriver: Entwicklung eines frameworks f¨ur r¨aumliche detektion und klassifikation von objekten in fahrzeugumgebung. Bachelor thesis, Bachelor’s Thesis, Karlsruhe Institute of Technology, Germany, 2015.
BIBLIOGRAPHY
157
[137] Andreas Geiger, Martin Lauer, Christian Wojek, Christoph Stiller, and Raquel Urtasun. 3d traffic scene understanding from movable platforms. Pattern Analysis and Machine Intelligence (PAMI), 2014. [138] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015. ˇ [139] Luka Cehovin, Aleˇs Leonardis, and Matej Kristan. Visual object tracking performance measures revisited. IEEE Transactions on Image Processing, 25(3):1261– 1274, 2016. [140] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Cehovin, Gustavo Fernandez, Tomas Vojir, Gustav Hager, Georg Nebehay, and Roman Pflugfelder. The visual object tracking vot2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 1– 23, 2015. [141] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang. Robust object tracking via sparse collaborative appearance model. IEEE Transactions on Image Processing, 23(5):2356–2368, 2014. [142] Xu Jia, Huchuan Lu, and Ming-Hsuan Yang. Visual tracking via adaptive structural local sparse appearance model. In Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on, pages 1822–1829. IEEE, 2012.