Two-Stage Static/Dynamic Environment Modeling ...

2 downloads 0 Views 17MB Size Report
Two-Stage Static/Dynamic Environment. Modeling Using Voxel Representation. Alireza Asvadi, Paulo Peixoto, and Urbano Nunes. Institute of Systems and ...
Two-Stage Static/Dynamic Environment Modeling Using Voxel Representation Alireza Asvadi, Paulo Peixoto, and Urbano Nunes Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra, Coimbra, Portugal {asvadi,peixoto,urbano}@isr.uc.pt

Abstract. Perception is the process by which an intelligent system translates sensory data into an understanding of the world around it. Perception of dynamic environments is one of the key components for intelligent vehicles to operate in real-world environments. This paper proposes a method for static/dynamic modeling of the environment surrounding a vehicle. The proposed system comprises two main modules: (i) a module which estimates the ground surface using a piecewise surface fitting algorithm, and (ii) a voxel-based static/dynamic model of the vehicle’s surrounding environment using discriminative analysis. The proposed method is evaluated using KITTI dataset. Experimental results demonstrate the applicability of the proposed method. Keywords: Velodyne perception, motion detection, dynamic environment, piecewise surface, voxel representation

1

Introduction

Intelligent vehicles have seen a lot of progress recently. An intelligent vehicle is generally composed by three main modules: perception, planning and control. The perception module builds an internal model of the environment using sensor data, while the planning module performs reasoning and makes decisions for future actions based on the current environment’s model. Finally the control module is responsible for translating actions into commands to the vehicle’s actuators. Today, the perception system of an intelligent vehicle is able to sense and interpret surrounding environment in 3D using sensors such as stereo vision [1], [2] and 3D laser scanners [3], [4]. The data acquired by 3D sensors needs to be processed to build a 3D internal representation of the environment surrounding the vehicle. Awareness of moving objects is one of the key components for the perception of the environment. By detecting, tracking and analyzing the moving objects present in the scene, an intelligent vehicle can make a prediction about objects’ locations and behaviors and plan for next actions. In this paper, we propose a voxel-based representation of the dynamic/static three-dimensional environment surrounding a moving vehicle equipped with a

2

Alireza Asvadi, Paulo Peixoto, and Urbano Nunes

Velodyne Lidar and an Inertial Navigation System (GPS/IMU). The main contributions of the present work are: 1)- a piecewise surface fitting method for ground estimation and object/ground separation; 2)- a voxel-based discriminative static/dynamic environment modeling. The output of the proposed method is a voxel representation of the environment surrounding a moving vehicle where voxels are classified as being static or dynamic (see Fig. 1).

Fig. 1. Top picture shows an image of a given frame from KITTI dataset and the result of the proposed method projected onto it. Static/dynamic voxels are shown with green and red colors respectively. The bottom picture shows the corresponding dynamic and static cells in surrounding environment of the intelligent vehicle in 3D.

The remaining part of this paper is organized as follows. Section 2 describes the related state of the art. Section 3 describes the proposed voxel-based static/dynamic environment modeling by introducing a piecewise surface fitting algorithm. Experimental results are presented in Section 4 and Section 5 brings some concluding remarks.

2

Related Work

The perception of a 3D dynamic environment captured by a moving vehicle requires a 3D sensor and an ego-motion estimation mechanism. The representation of the environment is another important issue for 3D perception of dynamic environment. Pfeiffer and Franke [5] used stereo vision system for acquiring 3D data and visual odometry for ego-motion estimation. They proposed the Stixel representation, consisting on sets of thin, vertically oriented rectangles used for

Two-Stage Static/Dynamic Environment Modeling Using Voxel ...

3

the representation of the environment. Stixels are segmented based on the motion, spatial and shape constraints and tracked using a 6D-vision Kalman filter framework that is a framework for the simultaneous estimation of 3D-position and 3D-motion. Asvadi et al. [6] used 2.5D elevation grids to build the environment representation using as input data from a Velodyne laser scanner and GPS/IMU localization system. They combined 2.5D grids with localization data to build a local 2.5D map. In every frame, using a robust spatial reasoning, the last generated 2.5D grid was compared with the local 2.5D map to detect the 2.5D motion grid. Motion grids are grouped to provide an object-level representation of the scene. Next, they applied data association and Kalman filtering for tracking grouped motion grids. Broggi et al. [7] used setero vision as a 3D sensor. They estimated ego-motion using visual odometry and used it to distinguish between stationary and moving objects. A color-space segmentation of voxels that are above the ground plane is also performed. Voxels with similar features are grouped together. Next, the center of mass of each cluster is computed and Kalman filtering is applied to estimate their velocity and position. Azim and Aycard [8] proposed a method based on the inconsistencies between observation and local grid maps represented by an Octomap [9]. An Octomap is a 3D occupancy grid with an octree structure. Next, they segmented objects using density based spatial clustering. Finally, Global Nearest Neighborhood (GNN) data association, and Kalman filter for tracking, and Adaboost classifier for object classification are used. They used data from a Velodyne laser scanner, and they estimated ego motion using odometry and scan matching. The summary of the aforementioned methods is shown in Table 1. In comparison with these

Table 1. Some recent related work on 2.5D/3D perception of dynamic environments surrounding a vehicle. Reference Pfeiffer and Franke, 2010 [5] Asvadi et al., 2015 [6] Broggi et al., 2013 [7] Azim and Aycard, 2014 [8]

Representation

Approach for perception of dynamics Segmentation of stixels based on motion, Stixel (2.5D vertical spatial and shape constraints using graph bars in depth image) cut algorithm 2.5D elevation grid Build a local 2.5D map and compare the last generated 2.5D grid with the local map Distinguish stationary/moving objects us3D voxel grid ing ego-motion estimation and color-space segmentation of voxels Octomap Inconsistencies on map and density based spatial clustering

methods, the work presented in this paper contributes with a piecewise surface fitting and a discriminative voxel-based static/ dynamic environment modeling, with voxels around the vehicle being classified as static or dynamic.

4

3

Alireza Asvadi, Paulo Peixoto, and Urbano Nunes

Proposed Method

In this section, we present an algorithm for voxel-based representation of the static/dynamic environment surrounding a vehicle equipped with a Velodyne Lidar and an Inertial Navigation System (GPS/IMU). First, at every time step, point clouds from the m last sensor measurements are integrated. Next, a piecewise surface fitting algorithm is applied on the integrated points to estimate the ground parameters (see Fig. 4-ii). Piecewise surface parameters are used for removing ground points from a point cloud. Voxelization process is performed by quantizing and counting the total number of points that falls into each cell. A voxel-based representation of the integrated point clouds and the last frame are used to build the static/dynamic model of the surrounding vehicle environment. 3.1

Point Clouds Integration and Ground Estimation

In this section, we present an algorithm for fitting a piecewise surface model on a set of registered and integrated point clouds to estimate the ground geometry. Fig. 2 shows the architecture of the algorithm. In every frame, by going over m previous scans (m is the number of scans to integrate and n is the number of current scan), point clouds P ts are loaded and transformed into the current coordinates of the vehicle. The integrated point clouds IP ts are cropped to a region inside the local grid, which is an area covering 5 to 15 meters ahead of the vehicle, 10 meters on the left and right sides of the vehicle, with 2 meters in height. This procedure is summarized in Algorithm 1. Point clouds are integrated for two purposes: 1)- for robust estimation of ground parameters; 2)- to extract the static model of the environment from it, as explained in the next section.

GPS/IMU localization data Point cloud

Algorithm 1

Algorithm 2

Temporal integration of point clouds

Piecewise ground estimation

Ground parameters

Integrated point clouds

Fig. 2. Point clouds integration and ground estimation.

Those points that belong to the ground can cause false detections in undulated roads and slow the process of building voxel grid. To address this problem, the ground is cut into stripes according to the car orientation and a piecewise surface fitting method is proposed to estimate the ground stripes’ parameters. Ground estimation process starts with the stripe closest to the host vehicle. Because in closer regions, point clouds are denser and measured with less localization errors, therefore stripes are estimated with more confidence. As a pre-process, points with a height greater than 1 meter in the first stripe (closest

Two-Stage Static/Dynamic Environment Modeling Using Voxel ...

5

Algorithm 1 Temporal integration of point clouds. Inputs: Point cloud of a single scan P ts(i); localization and pose data of the vehicle in Euclidean space given by GPS/IMU measurements LaP (i); number of integrated point clouds m. Output: Integrated point clouds IP ts. start for i : n − m to n IP ts ← Transform P ts(i) using LoP (i) and LoP (n) end IP ts ← Remove outliers from IP ts end

stripe to the host vehicle ) are considered as outliers and rejected. The result is a filtered point cloud in the stripe region IP tsf . Next, piecewise surface fitting is performed on the inlier points of the stripe’s region by fitting a plane using a least square method. Estimation of the next stripe is started from endmost edge of the first stripe in the vehicle movement direction. A similar approach is used for computing the next stripes (See Fig. 3). Every stripe parameter is checked

X

Z

Y

Fig. 3. Piecewise ground estimation using stripes. Vectors show the pose of the vehicle. Initial point for estimation of the second stripe is shown in red.

for acceptance. The validation process starts from the closest stripe to the farthest stripe. If the slope difference between previous plane’s stripe and current stripe is less than 15 degrees, it is considered as a valid value and a new plane is initialized, or else it is considered the two stripes are the same ground plane and the parameters from the previous stripe are used instead. Every stripe’s length is selected as 5 meters in the vehicle movement direction. The process is shown in Algorithm 2. These parameters are used in the next step, to exclude ground’s points while building the voxel grid.

6

Alireza Asvadi, Paulo Peixoto, and Urbano Nunes Algorithm 2 Piecewise ground estimation. Inputs: Integrated point clouds IP ts; number of stripes k. Output: Surface stripes parameter matrix P rm. start for surface stripe’s number i: 1 to k IP tsf (i) ← Reject outliers from IP ts P rm(i) ← Least square fit on IP tsf (i). if ∆(P rm(i), P rm(i − 1)) > threshold P rm(i) ← P rm(i − 1) end if end for end

3.2

Voxel-based Static/Dynamic Modeling of the Environment

Surface parameters from the previous step are used for removing the ground points from two groups of point clouds: 1)- integrated point clouds; 2)- point clouds from single scans. For removing the ground points, the distance between a point cloud and piecewise surfaces is computed, and all points under the surface are removed. To make the approach more robust against undulated roads, points with heights lower than 20 cm from the surface are also rejected. Next, the point cloud is voxelized and sent to the next module for building the static/dynamic model of the environment.

Voxelization Voxel grids dicretize the space into equally-sized cubic volumes or voxels, where each voxel contains information about the space it’s representing. Grids are a memory-efficient dense representation with no dependency to predefined features which allow them to provide detailed representation of complex environments. These attributes made the voxel grid an efficient tool for integrating temporal data and representing the surrounding environments of intelligent vehicles in 3D. Here, the voxelization process is performed by quantizing endpoints of the beams, counting the total number of points that falls into each grid cell, and storing a list of occupied voxels. This simplified model drastically speeds up the process at the cost of discarding information about free and unknown spaces. The size of a voxel in each dimension was chosen to be equal to 10 cm. The selected size provides enough number of voxels to represent objects and keeps the discretization errors low. Fig. 4 shows a voxel grid with estimated piecewise surfaces. Two previous groups of point clouds (e.i. integrated point clouds with removed ground points and point clouds from single scans with removed ground points) are voxelized and used for static/dynamic modeling of the environment. The voxlization results are the integrated grid voxels IGrd which is computed from integrated point clouds with rejected ground points and voxel grids from single scans with removed ground points Grd[i].

Two-Stage Static/Dynamic Environment Modeling Using Voxel ...

7

Fig. 4. From top to bottom: i) RGB image of the scene, ii) piecewise surface fitting and voxel representation of objects. Notice the curvature of the ground that makes it impossible to model using only one surface. Red lines show corresponding locations on images.

Static/Dynamic Modeling The main idea behind this section is that since, a moving object occupies different voxels along time, a Velodyne Lidar captures and saves more data in the static voxels in comparison with voxels belong to moving objects. Therefore, the voxel values for the static parts of the environment are greater. To realize this concept, a two-stage process is proposed (Fig. 5). First stage provides a rough estimation of static/dynamic voxels by using a simple subtraction mechanism. Second stage further refines the results using a discriminative analysis on the 2D histograms computed from the output of the first stage.

Voxel representation of the integrated point clouds

Algorithm 3 Remove dynamic voxels

Voxel representation of the last scan

Static

Building 2D histogram of the static cells in the X-Y plane

Static

Remove static voxels Dynamic Building 2D histogram of the from the last frame dynamic cells in the X-Y plane

Stage 1: Rough approximation of static / dynamic cells

×

Computing the log-likelihood Binary mask of static voxels ratio of histograms Binary mask of

Static part of the environment

Dynamic part of using equation dynamic voxels the environment (1)

×

Stage 2: Discriminative analysis of static / dynamic cells

Fig. 5. Two-stage discriminative static/dynamic environment modeling.

In the first stage, integrated voxel grid IGrd is compared with each of the last m grids Grd[i], {i : n − m, .., n} to remove the dynamic voxels of the integrated voxel grid IGrd. By comparing each pairs of IGrd and Grd[i], {i : n − m, .., n},

8

Alireza Asvadi, Paulo Peixoto, and Urbano Nunes

those voxels belonging to the integrated grid that have the same value with each of corresponding voxels in Grd[i], {i : n − m, .., n} are removed. It means those voxels of IGrd have been seen only for one time and therefore, more likely belong to a moving object (dynamic cells). The process is shown in Algorithm 3. Using this approach we filter out dynamic voxels and build a rough stationary Algorithm 3 Procedure of removing the dynamic voxels. Inputs: Integrated voxel grid IGrd; Point cloud of a single scan P ts(i); localization and pose data given by GPS/IMU LaP (i). Output: Integrated voxel grid with removed dynamic cells. start for i : n − m to n P ts(i) ← Transform P ts(i) using LoP (i) and LoP (n) P ts(i) ← Remove outliers from P ts(i) Grd[i] ← Voxelization of P ts(i). Compare IGrd with Grd[i]; remove dynamic cells of IGrd. end end

model of the environment. Next, the stationary model is used to remove static voxels from the voxel representation of the last scan using a simple subtraction operation. The outputs of the first stage are roughly approximated static and dynamic grids that are inputed to the second stage. The output of stage 1 is inaccurate because some parts of very slowly moving objects may have seen more than once in the same voxels and therefore may wrongly be inserted into the static model of the environment (See Fig. 6-i). To remove false detections in the second stage, we assume that all voxels in every column of the x-y plane have the same state (static or dynamic). Based on this assumption, we proposed to build the 2D histogram of the cells in the X-Y plane. Histograms are built for both approximated static/dynamic cells from stage 1. The log-likelihood ratio of 2D histograms of the approximated dynamic and static cells are employed to determine the binary mask for the dynamic voxels. It is computed by: Li = log

max{hd (i), δ} max{hs (i), δ}

(1)

where δ is a small value (we set it to 1) that prevents dividing by zero or taking the log of zero. hd is the 2D histogram computed from the approximated dynamic grid of the stage 1, and hs is the 2D histogram computed from the approximated static grid of stage 1. Cells belonging to the dynamic part have higher values in the computed log-likelihood ratio, static cells have negative values and cells that are shared by both, the dynamic and static, tend towards zero (See Fig. 6-ii). By applying a thresholding operation to Li , a 2D binary mask of dynamic cells is obtained. The 2D mask is applied to all levels of the approximated dynamic

Two-Stage Static/Dynamic Environment Modeling Using Voxel ...

9

gird from stage 1 to generate the final output. A similar approach is used for computing the binary mask of the static voxels. The outputs of stage 2 are voxels labeled as static or dynamic (See Fig. 6-iii).

Fig. 6. From top to bottom: i) static and dynamic voxels outputted from stage 1, ii) the process of computing the binary mask of the dynamic voxels, iii) static and dynamic voxels outputted from stage 2 after discriminative analysis. For a better visualization, the voxel grids outputted from stages 1 and 2 were projected and displayed onto the RGB image. Static/dynamic voxels are shown with green and red colors respectively.

4

Experimental Results

The presented algorithm was tested on the KITTI dataset [10]. The proposed method is currently implemented in MATLAB. The ground-truth for the task of static/dynamic environment modeling or motion detection is not available yet, therefore we have performed a qualitative evaluation. In the following sections, we describe the dataset and present experimental results. 4.1

Dataset

The KITTI dataset was captured using a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The Velodyne HDL-64E

10

Alireza Asvadi, Paulo Peixoto, and Urbano Nunes

spins at 10 frames per second with vertical resolution of 64 layers and angular resolution of 0.09 degree. The maximum recording range is 120 m. The inertial navigation system is a OXTS RT3003 inertial and GPS system with 100 Hz speed of recording data and a resolution of 0.02m / 0.1 degree. 4.2

Evaluation

In order to evaluate the performance of the proposed algorithm, a variety of challenging sequences were used. The eight most representative sequences are summarized in Fig. 7, in which each row corresponds to one sequence. The proposed method detects and classifies dynamic/static voxels around moving vehicles when they get into the local perception field.

Fig. 7. Sample screenshots of the results obtained in the considered sequences. Static/dynamic voxels are shown in green and red colors respectively. Each row represents one sequence. Left to right we see the results obtained in different time instants.

In the first sequence, our method detects a cyclist and a car (while they are in the perception field) as dynamic objects within the scene and models the walls and stopped cars as part of the static model of the environment. The second sequence shows a downtown area, where the proposed method successfully models moving pedestrians and cyclists as dynamic part of the environment. The third sequence shows a cross section scenario, our method models almost all passing pedestrians as dynamic voxels.Sequence number 4 shows another challenging downtown scenario with moving objects range from pedestrian, groups of pedestrian to cyclist. The proposed method successfully models moving objects as a dynamic part of the environment.

Two-Stage Static/Dynamic Environment Modeling Using Voxel ...

4.3

11

Computational Analysis

speed (fps)

There is a compromise between the computational cost and performance of the static/dynamic modeling of the proposed method. Increasing the number of the integrated scans, will lead to a better removal of the dynamic cells and generates a stronger static/dynamic model. However, it adds an additional computational cost and makes the method slower. On the other hands, less integrated scans make the environment model weaker. The proposed method has a stable results when the number of integrated frames are more than 10. The number of integrated scans of the proposed algorithm is set as 20. The computational cost of the proposed method depends on the size of the local grid, the size of a voxel, the number of integrating frames, and the number of non-empty voxels, since only non-empty voxels are indexed and processed. We performed the experiment on the first scenario of the previous section with a fixed sized local grid. The scenario has in average nearly 1% non-empty voxels. The size of a voxel and the number of integrating point clouds are two key parameters that correspond with spatial and temporal properties of the proposed algorithm respectively, and directly impact on the computational cost of the method. The experiment was carried out with a quad core 3.4 GHz processor with 8 GB RAM under MATLAB R2013a. The average speed of the proposed algorithm (frames per second) together with the value of each parameter (voxel size and number of integrating frames) are reported in Fig. 8. As it can be seen the number of integrated frames has the greatest impact on the computational cost of the proposed method. The proposed method configured with the defaults parameters works at 1.05f ps.

4 2 0 5

10

15

20

25

30

no of integrating frames

35

0.1

0.2

0.3

0.4

0.5

0.7 0.6

voxel size

Fig. 8. Computational analysis of the proposed method.

5

Concluding Remarks and Future Work

The 3D perception of dynamic environment is one of the key components for intelligent vehicles to operate in real-world environments. In this paper, we propose a voxel-based representation of the dynamic/static three-dimensional environment surrounding a moving vehicle equipped with a Velodyne Lidar and an

12

Alireza Asvadi, Paulo Peixoto, and Urbano Nunes

Inertial Navigation System (GPS/IMU). A piecewise surface fitting algorithm is proposed to estimates the ground surface and remove the ground points from the point clouds. A discriminative voxel-based static/dynamic environment modeling is proposed with voxels classified into the static and dynamic classes. The proposed method was evaluated using KITTI dataset and experimental results demonstrate the applicability of the proposed method. We propose two new directions for the future work. First, the color information from the image can be incorporated to provide a more robust static/dynamic environment modeling. Second, the dynamic model of the environment can be investigated for object detection and tracking purposes. Acknowledgments. This work has been supported by the FCT project ”AMSHMI2012 - RECI/EEIAUT/0181/2012” and project ”ProjB-Diagnosis and Assisted Mobility - Centro-07-ST24-FEDER- 002028” with FEDER funding, programs QREN and COMPETE.

References 1. C. Laugier, I.E. Paromtchik, M. Perrollaz, M.Y. Yong, J-D. Yoder, C. Tay, K. Mekhnacha, and A. Negre, Probabilistic analysis of dynamic scenes and collision risks assessment to improve driving safety, IEEE Intelligent Transportation Systems Magazine, vol. 3, no. 4, pp. 4-19, 2011. 2. J. Ziegler, P. Bender, M. Schreiber, H. Lategahn et al., Making bertha drive? - an autonomous journey on a historic route, IEEE Intelligent Transportation Systems Magazine, vol. 6, no. 2, pp. 8-20, 2014. 3. C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. N. Clark et al., Autonomous driving in urban environments: Boss and the urban challenge, Journal of Field Robotics, vol. 25, no. 8, pp. 425-466, 2008. 4. M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger, D. Haehnel et al., Junior: The stanford entry in the urban challenge, Journal of Field Robotics, vol. 25, no. 9, pp. 569-597, 2008. 5. D. Pfeiffer, and U. Franke, Efficient representation of traffic scenes by means of dynamic stixels, In IEEE Intelligent Vehicles Symposium (IV), pp. 217-224, 2010. 6. A. Asvadi, P. Peixoto, and U. Nunes, Detection and tracking of moving objects using 2.5D motion grids, In 18th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2015. 7. A. Broggi, S. Cattani, M. Patander, M. Sabbatelli, and P. Zani, A full-3D voxelbased dynamic obstacle detection for urban scenario using stereo vision, In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2013. 8. A. Azim, and O. Aycard, Layer-based supervised classification of moving objects in outdoor dynamic environment using 3D laser scanner, In IEEE Intelligent Vehicles Symposium (IV), pp. 1408-1414, 2014. 9. A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, OctoMap: An efficient probabilistic 3D mapping framework based on octrees, Autonomous Robots, vol. 34, no. 3, pp. 189-206, 2013. 10. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, Vision meets robotics: The kitti dataset, The International Journal of Robotics Research, vol. 32, no. 11, pp. 12311237, 2013.

Suggest Documents