Convolution Neural Network Application for Road

0 downloads 0 Views 424KB Size Report
Sep 7, 2018 - machine learning approach, where Artificial Neural Networks learn features ..... data labeling C++ was used with Json libraries to access the.
Intelligent Systems Conference 2018 6-7 September 2018 | London, UK

Convolution Neural Network Application for Road Asset Detection and Classification in LiDAR Point Cloud George E. Sakr

Lara Eido

Charles Maarawi

Faculty of Engineering (ESIB) St. Joseph Univeristy Of Beirut Email: [email protected]

Faculty of Engineering (ESIB) St. Joseph Univeristy Of Beirut (ESIB)

Faculty of Engineering (ESIB) St. Joseph Univeristy Of Beirut (ESIB)

Abstract—Self-driving cars (or autonomous cars) can sense and navigate through an environment without any driver intervention. To achieve this task, they rely on vision sensors working in tandem with accurate algorithms to detect movable and nonmovable objects around them. These vision sensors typically include cameras to identify static and non-static objects, Radio Detection and Ranging (RADAR) to detect the speed of the moving objects using Doppler effect and Light Detection and Ranging (LiDAR) to detect the distance to objects. In this paper, we explore a new usage of LiDAR data to classify static objects on the road. We present a pipeline to classify point cloud data grouped in volumetric pixels (voxels). We introduce a novel approach to point cloud data representation for processing within Convolution Neural Networks (CNN). Results show an accuracy exceeding 90% in the detection and classification of road edges, solid and broken lane markings, bike lanes, and lane center lines. Our data pipeline is capable of processing up to 20,000 points per 900ms on a server equipped with 2 Intel Xeon processors 8-core CPU with HyperThreading for a total of 32 threads and 2 NVIDIA Tesla K40 GPUs. Our model outperforms by 2% ResNet applied to camera images for the same road.

which are very accurate in detecting and tracking the position of both moving and non-moving objects. Generally, in machine vision, this can be achieved using two different approaches: using the more traditional image processing techniques with hand-coded features and labels, or using a more automated machine learning approach, where Artificial Neural Networks learn features within the data. Both approaches are well established in this domain but there is no agreed upon answer on which approach is better as both are common in the industry. In this study, we evaluate the latter by employing CNN for the detection and classification of pseudo-static roadway features in LiDAR point cloud data. The novelty in this research is summarized in three points: 1)

2)

Keywords—Self Driving Car; Convolution Neural Networks; LiDAR Point Cloud

I.

I NTRODUCTION

As the industrial age continues to unfold, self-driving cars are seen as the next logical step in automation [1]. It is estimated that self-driving cars will lead to many environmental benefits with regard to fuel economy [2], [3] through the optimization of highways traffic flow [4], [5], the reduction of the entire vehicle fleet to only 15% of the current amount by leveraging the sharing economy [6], and enabling platoon driving, which is projected to save to 20-30% fuel consumption [7]. In addition, self-driving cars are expected to cause a decline in accident rates, which constitute the eight highest death cause worldwide by the World Health Organization [8]. Another consequence of handing the reins over to autonomous systems is stress reduction [9] and increased availability of parking space by 75% of the current capacity [10]. Leading technology and automotive companies in this space are targeting the end of the next decade as the self-driving car era in which the number of self-driving cars will surpass the number of regular cars. For cars to drive themselves safely, they need algorithms

3)

Using LiDAR point cloud to detect and classify static road assets, where the usual usage of LiDAR was to calculate the distance between the car and other objects A new data pipeline that transforms 3D point cloud into a 2D hyper-image that is compatible with CNN. A new CNN-based deep architecture for high accuracy in detection and classification of static road assets.

The rest of the paper is organized as follows: Section II discusses previous techniques of road asset detection. Section III describes the experimental setup, data collection procedure and the LiDAR data format. In section IV we introduce the network of choice in our study (CNN) and discuss its advantages. Section V introduces a novel 3D data representation method allowing it to fit in the CNN model, and section VI describes the CNN architecture. The experimental results and the comparison with CNN on camera images are presented in section VII and finally the execution time of the pipeline is discussed in section VIII. II.

R ELATED W ORK

In this section we present some of the previous research for road asset detection. Chan Yee Low et al. combine the use of Canny edge detection and Hough transform for lane marker detection. The system captures images from a front viewing vision sensor placed facing the road behind the windscreen as input. Canny edge detection performs feature recognition which is followed by a Hough transform for lane generation [11]. This method requires empirical tuning of multiple

IEEE

1|Page

Intelligent Systems Conference 2018 6-7 September 2018 | London, UK parameters and manual selection of the features to detect. Aly presented a real-time approach to lane marker detection in urban streets by generating top view images and using selective oriented Gaussian filters and fast RANSAC algorithm for fitting Bezier splines through the lane markings. This method achieved comparable results to previous techniques [12]. Weng et al. uses LiDAR point cloud for traffic signs recognition. Their algorithm starts by using the intensity of the received point cloud to detect traffic signs which are known to be painted with highly reflective materials. Then, they make use of the geometric shape and the pairwise 3D shape context to make the final classification. The results show that the proposed method is efficient in detecting and classifying traffic signs from LiDAR point clouds [13]. The main issue with the aforementioned research was the need to manually handcraft the features that the algorithm should monitor in order to perform detection. In the deep learning approach, the first layers of the neural network will have the ability to automatically detect important features and feed them to the deeper part of the network for detection. Luca et al. [14] took advantage of CNN for road detection using only LiDAR data. The LiDAR point cloud was used to create topdown images encoding several basic statistics like elevation and density. Those top view images are 2D images with the basic statistics included. They can be fed to a CNN using the traditional way. The shortcoming of this method is that it removes an entire dimension from the data, which could contain valuable information. Maturana et al. [15] coupled a volumetric occupancy map with a 3D CNN to detect small and potentially obscured obstacles in vegetated terrain. However, their work did not cover road assets such as lines and road edges instead they were interested in differentiating low vegetation that are safe to land on from solid objects that might be hazardous for landing. This paper specifically aims to present a pipeline that takes incoming LiDAR pulses to detect and classify different static road assets. This research focuses on the detection and classification of road edges, yellow single solid lines, yellow single broken lines, bike lanes and lane center line using deep neural networks. To achieve this aim, a new method is introduced at the start of the pipeline to transform the 3D LiDAR point cloud into hyper-images compatible with CNN. The pipeline terminates with a new CNN architecture that will detect and classify static road assets from the transformed hyper-images. The realization of this pipeline needs an annotated dataset of static road assets. The data collection and annotation is described in the next section. III.

DATA C OLLECTION AND F ORMATTING

Training a classifier to detect and classify road assets requires the collection of annotated structured data. This section describes the data collection phase as well as the format of the acquired data. A. Data Collection In this study we evaluate our extraction and classification methodology on LiDAR point cloud data collected with a mobile mapping setup. A survey-grade Inertial Navigation System (INS) was utilized in combination with a laser scanner. The model for the INS is Novatel IGM-S1 STIM. The LiDAR

that was used is the Velodyne HDL-32 that was mounted tilted backwards at a 45 degree pitch angle. This mounting configuration is appropriate in surveying and mapping applications. The sensors were time synchronized using a pulse per second (PPS) signal originating from the GPS receiver. The INS was used to measure the vehicle trajectory through the environment, and LiDAR was used to extract points in the vicinity of the vehicle. Data was collected from a single trip on Marin Ave in Berkeley, CA. Data was registered to the UTM (Zone 10N) frame of reference to create a point cloud. Relative fit and loop closure methods were employed to ensure a point cloud free of artifacts. To create a ground truth dataset from the input data, a map of 3D vectors of splines was generated with manual annotation using custom made tools. B. Data Format and Annotation The LiDAR outputs a point cloud which is at its most basic, a collection of points in three dimensions, each point having an x, y and z coordinate defined in a coordinate frame of reference. In addition to geometry, each point also has an intensity value, which corresponds to the amount of light reflected back to the laser scanner. However, the reflected points are not annotated and it is not possible to identify the reflecting object. Hence the need to use the ground truth map that was annotated manually. In the simplest of forms the ground truth map is a JSON file that contains a summary of the static objects found in the point cloud. A triangular road sign is represented by its 3 vertices. Any point from the point cloud that falls withing these 3 vertices is annotated as road sign. Similarly, a road edge is represented by a spline which is a sequence of continuous lines. If a point falls on any road-edge spline it is annotated as road edge. This process can be also applied to bike lanes, solid and broken lines. In general we examined the raw points which fall within the shape defined in the map, and applied the appropriate annotation. For those points which do not fall within any predefined shape, they get the N U LL label. The N U LL label is extremely important as it will be used to test if the final model can detect an asset or will it be missed and labeled as N U LL. The program responsible for labeling the points was written in C++ and 32 threads were created in order to accelerate it. The choice of 32 threads was due to the server’s capability of running up to 32 threads simultaneously. The described method generates annotated point cloud. This annotated point cloud will be used as an input to our classifier. Hence the need to define which classifier to use and how can the data be formatted to be compatible with this classifier. Next section will answer those needs. IV.

CNN

Pattern recognition and image captioning were greatly impacted by Convolution Neural Networks [16] [17]. The challenging part before the CNN era was handcrafting the features and tuning the classifier. This previous challenge was revolutionary solved by CNN which gave us the ability to learn the features directly from the training data. CNN also has an architecture that makes it specifically efficient in image recognition by implementing the convolution operation that captures the 2D nature of an image. CNN have been making small progress in commercial use for the past twenty years

IEEE

2|Page

Intelligent Systems Conference 2018 6-7 September 2018 | London, UK [18], however, the adoption of CNN has grown exponentially in the last seven years because of two main reasons. First, the publication of large and labeled data sets such as the Large Scale Visual Recognition Challenge (ILSVRC) [19] and second, the development of the massively parallel graphics processing unit (GPU) which is used to accelerate the weights optimization process for CNN . One drawback in regular neural network is that it treats all pixels on equal terms. This is a result of the fact that every neuron is connected to all the pixels of the image, hence treating far away pixels the same way it treats the close pixels which destroys, from the neuron perspective, the spacial structure of the image. In contrast CNN takes advantage of spatial structure in an image by using their special architecture which is greatly suitable to image recognition. A deep neural network is usually made up of a succession of convolution neural networks, pooling layers and dense layers. There are of course other types of deep networks but in this research we used the traditional structure of a deep network. The convolution layer is based on a small local receptive field also called a kernel. The size of this small receptive field is a hyper parameters that must be tuned using validation data. For example if the kernel size is 3 x 3 then every neuron in the convolution network would be connected to 9 pixels. The kernel slides horizontally and vertically by a certain amount of pixels called stride. The stride is also a hyper parameter that should be tuned. Whenever the kernel slides it covers a new part of the image which by itself will be connected to a different neuron. So every receptive field is connected to a different neuron. However, the second important principle in a convolution network is weight sharing. All the neurons in the same layer share the same weights. This gives the network the ability to learn the same pattern across the image. Hence if the layer learns to detect horizontal edges, in on part of an image it will also be able to detect it anywhere in the image. This technique makes the convolution layer shift invariant. But this also restricts the layer to learn one thing in a image. This is why a full convolution layer is made by stacking many layers on top of each others. So every layer will learn to detect a different thing across the whole image. To reduce overfitting a dropout layer can be used after the convolution stage. A dropout layer simply drops some of the neurons with a probability  that is also tuned on the validation set. Finally the output of the convolution layers is flattened into 1 vector which forms the input of the dense layer. The dense layer is just a regular fully connected layer that will bring together all the learnt features from the convolution part and used a softmax to output the decision. V.

DATA R EPRESENTATION AND L ABELING

As described in the previous section, CNN expects a list of 2D images to be inputed for training. However, the point cloud is just a list of annotated quadruple (x,y,z,r). In this section we present the algorithm that creates hyper-images from the point cloud. Hyper-images are designed to be compatible with CNN. This section also presents the labeling method used to give every created hyper-image a corresponding label. A. Transformation Algorithm The algorithm for transforming the point cloud into a compatible hyper-image is defined as follows:

9 8 7 6 5 4 3 2

0

Fig. 1.

1) 2) 3) 4) 5) 6) 7)

Fig. 2.

1

3

2

5

4

6

7

8

1 76 8 9 9

5

4

3

2

1

0

1m3 divided into 1000 sub-voxels of 10 × 10 × 10cm3

find the minimum and maximum values of x, y and z in the point cloud. create a 1m3 voxel starting at xmin , ymin and zmin . split the voxel into 1000 sub-voxels of 10 × 10 × 10cm3 . (Figure 1) find all the points from the point cloud that fall within every sub-voxel. give a value to every sub-voxel equal to the average reflectivity of the total points that fall inside it. slide the voxel by a stride of 25cm in all directions respectively. repeat from step 3 until you reach xmax , ymax and zmax .

Flattening of the bottom right vector of the 3D voxel 0 1 2 3 4 5 6 7

[9,8,7,6,5,4,3,2,1,0]

8 9

(a) Before

(b) After

The above steps will transform the point cloud into a group of 3D-like images. The final step is to create the hyper-images that are 2D-like images. Normally a 2D image is formed by a group of pixels each having R,G,B components. We decided to take every sub-voxel from the x-y surface and give it 10 components instead of 3 (RGB) components. Those 10 components are the values of the sub-voxels lying along the z-axis of the surface sub-voxel. Figure 3a shows the bottom right vector along the z-axis before the flattening procedure and figure 3b shows the 1 pixel with its 10 components that replaces the original vector. Hence a big voxel is a now a hyper-image having a width and height of 10 pixels. Each pixel has 10 components instead of the regular 3 (RGB) components. As a summary we say that every hyper-image is (10,10,10). Finally all N hyper-images are generated and grouped in one array making the training set a 4D array (N,10,10,10) which is perfectly compatible with the input of a CNN. This transformation allows also the localization of a detected asset to within 1m3 because it scans the input point cloud and creates 1m3 hyper-images and classifies them. Finally this transformation results in a set of unlabeled hyperimages.

IEEE

3|Page

Intelligent Systems Conference 2018 6-7 September 2018 | London, UK Correct Class of the Hyper-image

B. Labeling the Hyper-Images The hyper-images are formed by flattened sub-voxels. Every sub-voxel contains a number of annotated points generated from the JSON file. We define the dominant sub-label as the most frequent asset from all the points that fell inside this sub-voxel. 1) 2) 3) 4) 5)

Every sub-voxel is sub-labeled by its dominant sublabel. The dominant label in a hyper-image is the most frequent sub-label among all of its sub-voxels. Label the hyper-image by its dominant label. If all sub-voxels are sub-labeled null then the hyperimage receives the null label. In case the dominant label was null but other subvoxels are not sub-labeled as null, then the second most frequent sub-label will be given to the hyperimage.

The output of the above algorithms is a dataset of labeled hyper-images. The hyper-images are fed into a CNN-based deep architecture introduced in the next section. VI.

C LASSIFIER A RCHITECTURE

The classifier is designed to minimize the categorical cross entropy error defined by: L=−

n X C X

tic log(yic )

Point Cloud Transformation Algorithm

Hyper Images

Classifier

Back Propagation weight adjustment

Fig. 3.

-

Error

Training Phase LiDAR Measurements Transformation Algorithm

Fig. 4.

Network output

Hyper Image

Classifier

Predicted Class

Testing Phase

the above mentioned parameters and their values are discussed in next section. VII.

R ESULTS AND D ISCUSSION

This section describes the architectures used to detect and classify 2, 4 and 6 different road assets. It also discusses the tuning of the different network parameters used to achieve this accuracy. The accuracy of our model is compared to ResNet’s [20] accuracy on camera images. The ResNet was tuned on images corresponding to the same LiDAR point cloud.

i=1 c=1

where N is the total number of samples in the batch, C is the total number of classes, tic = 1 if and only if sample i belongs to class c and yic is the probability that sample i belongs to class c. The training set of hyper-images is fed into the classifier which then computes a proposed class for each image. The proposed class is compared to the desired class for that hyper-image and the weights are adjusted to minimize the error. In minimizing the error, the network brings its output closer to the desired output. The weight adjustment is accomplished using back propagation as implemented in the KERAS machine learning package. The update of the weights occurs after the batch is sent into the network, the error on that batch propagates backwards and every weight is updated by a small amount proportional to the error. The proportionality factor is called the learning rate α. An epoch is completed when all the batches have passed through the network. The weights are saved after every epoch. Early stopping was used with a delay of 3 epochs. This allows the network to stop the optimization process after 3 consecutive epochs in which the validation error does not decrease. The weights that yielded the smallest error are saved. A block diagram of our training system is shown in figure 3. Once trained the network can classify hyper-images as shown in figure 4.

A. Detecting Bike Lanes The first models were evaluated by classifying two classes: Bike Lane and Null (images that do not correspond to any road asset). The data set was balanced between each class as well as between training and testing: 51,566 images for training from which 5156 were used for validation and 58,014 images for testing which were never used during the training phase. The testing and training data each belongs to a different part of the road. This allows us to demonstrate the ability of our classifier to generalize to unseen scenarios. Many different architectures were used and the one that yielded the highest validation accuracy was used on the testing set. Table I shows the results for the different architectures on the validation set. Note that the kernel size of the convolution layer and the dropout layer was also varied and the reported accuracy is the for the architecture that yielded the highest accuracy. Finally the dropout layer (if used) is placed after all the convolution layers and before the dense layer. We also tried to put it between the convolution layers but we did not obtain a better accuracy. The reference ResNet accuracy on camera images for this model is 96%.

The classifier block is a deep neural network made up of convolution layers, dropout layers and fully connected dense layers. The number of convolution layers, dense layers as well as the dropout probability are varied to obtain a better accuracy. The kernel size of every convolution layer is also a parameter that can be varied and the stride at which it slides on the image is also another parameter to tune. Finally the batch size is usually set based on the amount of memory available. All of IEEE

TABLE I.

2-C LASS VALIDATION ACCURACY

Model (Convolutional, Dropout, Dense) 7 Convolutional, 1 Dropout, 4 Dense 7 Convolutional, 1 Dropout, 3 Dense 7 Convolutional, 1 Dropout, 2 Dense 7 Convolutional, 0 Dropout, 3 Dense 6 Convolutional, 1 Dropout, 2 Dense 5 Convolutional, 1 Dropout, 2 Dense 4 Convolutional, 1 Dropout, 2 Dense 4 Convolutional, 0 Dropout, 2 Dense 3 Convolutional, 1 Dropout, 2 Dense

Accuracy (%) 97 98.3 98.4 97.8 97.7 97.6 98 96.8 97.3

4|Page

Intelligent Systems Conference 2018 6-7 September 2018 | London, UK As it can be seen in table I the architecture that yields the highest accuracy on the validation set is the one with 7 convolution layers, 1 dropout layer and 2 dense layers. All convolution layers used the ReLu activation function and 0 padding. The first convolution layer used 16 kernels each of size 3 x 3, the 2nd consists of 32 kernels of size 3 x 3 the 3rd used 64 kernels of size 3 x 3 the last 4 used 128 kernels of size 3 x 3. The convolution layers were followed by 1 dropout layer with dropout probability of 0.2. 2 dense layers were used after the dropout later, the first had a size of 512 with a ReLu activation function and followed by another dense layer with size 2 and a softmax activation function. The batch size for training was 128. The Adam optimizer was used with adaptive learning rate to minimize the categorical cross-entropy loss. This model gave an accuracy of 98.1% over the testing set. A high accuracy given that the testing set represents a part of the road that is unseen before by the network. This shows that our classifier is able to generalize to unseen scenarios. It also shows that LiDAR-CNN outperforms RestNet by more than 2%. B. Four Assets In this part we present models evaluated to classify: Bike Lane, Lane Centerline, Road Edge and Null. The data set was balanced between all four classes (21,234 image of each class). The images were shuffled and then 61,153 images were used for training, 6,795 for validation and 16,988 for testing (20% of the total number of images). Many different architectures were used and the one that yielded the highest validation accuracy was used on the testing set. Table II shows the results for the different architectures on the validation set. Note that the kernel size of the convolution layer and the dropout layer was also varied and the reported accuracy is the for the architecture that yielded the highest accuracy. The reference ResNet accuracy for this experiment is 92%. The accuracy was obtained using the same procedure used for the 2 class experiment. TABLE II.

4-C LASS VALIDATION ACCURACY

Model (Convolutional, Dropout, Dense) 7 Convolutional, 1 Dropout, 2 Dense 6 Convolutional, 1 Dropout, 3 Dense 6 Convolutional, 1 Dropout, 2 Dense 5 Convolutional, 1 Dropout, 2 Dense 4 Convolutional, 1 Dropout, 2 Dense 4 Convolutional, 0 Dropout, 2 Dense 3 Convolutional, 1 Dropout, 2 Dense

Accuracy (%) 93.4 93.2 93.6 94.5 94.6 94.1 94

As it can be seen in table II the architecture that yields the highest accuracy on the validation set is the one with 4 convolution layers, 1 dropout layer and 2 dense layers. All convolution layers used the ReLu activation function and 0 padding. The first convolution layer used 16 kernels each of size 3 x 3, the 2nd consists of 32 kernels of size 3 x 3 the 3rd used 64 kernels of size 3 x 3 the last one used 128 kernels of size 3 x 3. The convolution layers were followed by 1 dropout layer with dropout probability of 0.2. 2 dense layers were used after the dropout later, the first had a size of 512 with a ReLu activation function and followed by another dense layer with size 4 and a softmax activation function. The batch size for training was 128. The Adam optimizer was used with adaptive learning rate to minimize the categorical cross-entropy loss. This model gave an accuracy of 94.3% over the testing set. A

very good accuracy given that the testing set represents a part of the road that is unseen before by the network. This shows the ability of training the model on a small part of the road and extrapolate to unseen territories. This shows that our classifier is able to generalize to unseen scenarios. We also notice that LiDAR-CNN outperforms ResNet by around 2.6% for the best model obtained. C. Six assets In this part we present models evaluated to classify: Bike Lane, Lane Centerline, Road Edge, Yellow Single Solid Line, Yellow Single Broken Line and Null. The data set was balanced between all six classes (9,246 image of each class). The images were shuffled and then 55,473 images were used for training, 5,547 for validation and 13,869 for testing (25% of the total number of images). Many different architectures were used and the one that yielded the highest validation accuracy was used on the testing set. Table III shows the results for the different architectures on the validation set. Note that the kernel size of the convolution layer and the dropout layer was also varied and the reported accuracy is the for the architecture that yielded the highest accuracy. The reference accuracy that was given by ResNet for this experiment is 87.5%. TABLE III.

6-C LASS VALIDATION ACCURACY

Model (Convolutional, Dropout, Dense) 7 Convolutional, 1 Dropout, 2 Dense 6 Convolutional, 1 Dropout, 2 Dense 5 Convolutional, 1 Dropout, 2 Dense 5 Convolutional, 0 Dropout, 2 Dense 4 Convolutional, 2 Dropout, 2 Dense 4 Convolutional, 1 Dropout, 3 Dense 4 Convolutional, 1 Dropout, 2 Dense 4 Convolutional, 0 Dropout, 3 Dense 4 Convolutional, 0 Dropout, 2 Dense 3 Convolutional, 1 Dropout, 2 Dense 3 Convolutional, 0 Dropout, 2 Dense

Accuracy (%) 89.5 90.2 87.6 89.7 88.1 89.5 89.5 90 90.6 89.4 88.7

As it can be seen in table III the architecture that yields the highest accuracy on the validation set is the one with 4 convolution layers, 0 dropout layer and 2 dense layers. All convolution layers used the ReLu activation function and 0 padding. The first convolution layer used 16 kernels each of size 3 x 3, the 2nd consists of 32 kernels of size 3 x 3 the 3rd used 64 kernels of size 3 x 3 the last one used 128 kernels of size 3 x 3. 2 dense layers were then used, the first had a size of 512 with a ReLu activation function and followed by another dense layer with size 6 and a softmax activation function. The batch size for training was 128. The Adam optimizer was used with adaptive learning rate to minimize the categorical cross-entropy loss. This model gave an accuracy of 90.3% over the testing set. A good accuracy given that the testing set represents a part of the road that is unseen before by the network. This shows that our classifier is able to generalize to unseen scenarios. This proves the ability of training the model on a small part of the road and extrapolate to unseen territories. LiDAR-CNN also outperforms RestNet by around 3%. VIII.

E XECUTION T IME

The latency introduced in every part of the pipeline is introduced in this section. Since the latency depends on the hardware and software used, they are introduced first then the execution time of every part of the pipeline is presented.

IEEE

5|Page

Intelligent Systems Conference 2018 6-7 September 2018 | London, UK IX.

A. Hardware The training of a deep network like the ones presented above with a large training set is challenging on a regular computer due to memory and processing speed limitations. It has to be done offline on a high performance computer and the resulting model is implemented on a regular computer. To train the classifier we used a server equipped with 2 Intel Xeon processors 8-core CPU with HyperThreading for a total of 32 threadsand, 2 NVIDIA Tesla K40 GPU with 12 GB of memory each dedicated to the GPU. 128GB of RAM were available for the Xeon CPUs and 2TB of SSD hard drive were used to store the data locally. The presence of the 2GPUs was of absolute importance to train the model in a reasonable amount of time. For instance the training phase took between 5 and 7 minutes per model.

C ONCLUSION AND FUTURE WORK

In this research we presented a new application for LiDAR combined with deep learning for road asset detection. We presented also a new method for LiDAR data representation to create hyper-images that fits the input of a convolution neural network. The network was able to differentiate between 6 different assets with high accuracy > 90%. The limitation of our model is its inability to locate the detected asset within the 1 meter cube voxel. Another limitation is the execution time in real time which takes around 825ms to process 20,000 LiDAR points. This time must be reduced to less than 100ms to be able to accommodate the amount of data generated by the LiDAR. As a future work we will consider locating assets inside the voxel, we will implement the model in C++ to further accelerate the processing time. ACKNOWLEDGMENT

B. Software Keras API was installed on Ubuntu server to easily operate with CNN libraries. Keras is an open source API capable of running on top of Tensorflow [21]. Tensorflow is an open source library for machine learning developed by Google. For data labeling C++ was used with Json libraries to access the Json file. C. Pipeline Execution Time The execution time of all phases of the pipeline was evaluated: •

Generating 3D voxels from points received by the LiDAR



Formatting the images into Hyper-Images



Analyzing the hyper-images by the CNN model

This project has been funded with the joint support from the National Council for Scientific Research in Lebanon and the St. Joseph University of Beirut. The authors would like to thank the dean Fadi Geara of the faculty of engineering at St Joseph university for providing us with the material needed for this study namely the new deep learning server with multiple NVIDIA Tesla GPUs in collaboration with Murex. We would also like to thank Civil Maps for providing us with the LiDAR data and labels used in this research and Dr. Fabien Chraim from civil maps for his innovative ideas and reviewing the paper. R EFERENCES [1] [2]

[3]

In order to simulate the reception of data points by the LiDAR, a file containing 20,000 points in a 3x3x3m space was created. 35 images were generated in the process and fed to the CNN model. The execution time of each stage of the pipeline was computed several times and the average value was calculated. The results are presented in table IV. TABLE IV.

[4]

[5]

E XECUTION T IME

Stage Read File Into RAM Calculate Space Boundaries Generate Voxels Generate Hyper-images Write Images to Disk (Pickle) Load Images from Disk (unpickle) Load CNN Model from Disk Classify 35 Hyper-Images

Mean Time (ms) 21.36 0.05 390.86 20.975 1.92 3.26 7872.9 411.22

[6] Std (ms) 2.14 0.004 51.96 5.57 0.19 0.04 53.53 30.41

[7]

[8]

[9]

Table IV shows the execution time for the different stages of the pipeline. It is noted that during runtime, the CNN model will be loaded once so the 8 seconds should not be counted, as well as writing and reading from disk. So in total 20,000 LiDAR points in a 3 x 3 x 3 environment takes about 825ms to classify. We presume that the execution time could be further reduced in case CNN was implemented in C++.

[10]

[11]

IEEE

J. Rosenzweig and M. Bartl, “A review and analysis of literature on autonomous driving,” E-Journal Making-of Innovation, 2015. W. Payre, J. Cestac, and P. Delhomme, “Intention to use a fully automated car: Attitudes and a priori acceptability,” Transportation research part F: traffic psychology and behaviour, vol. 27, pp. 252– 263, 2014. T. Luettel, M. Himmelsbach, and H.-J. Wuensche, “Autonomous ground vehiclesconcepts and a path to the future,” Proceedings of the IEEE, vol. 100, no. Special Centennial Issue, pp. 1831–1839, 2012. S. Le Vine, A. Zolfaghari, and J. Polak, “Autonomous cars: The tension between occupant experience and intersection capacity,” Transportation Research Part C: Emerging Technologies, vol. 52, pp. 1–14, 2015. A. H. Jamson, N. Merat, O. M. Carsten, and F. C. Lai, “Behavioural changes in drivers experiencing highly-automated vehicle control in varying traffic conditions,” Transportation research part C: emerging technologies, vol. 30, pp. 116–125, 2013. P. E. Ross, “Robot, you can drive my car,” IEEE Spectrum, vol. 51, no. 6, pp. 60–90, 2014. J. Weyer, R. D. Fink, and F. Adelt, “Human–machine cooperation in smart cars. an empirical investigation of the loss-of-control thesis,” Safety science, vol. 72, pp. 199–208, 2015. W. H. O. Violence, I. Prevention, and W. H. Organization, Global status report on road safety 2013: supporting a decade of action. World Health Organization, 2013. C. M. Rudin-Brown, H. A. Parker, and A. R. Malisia, “Behavioral adaptation to adaptive cruise control,” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 47, no. 16. SAGE Publications Sage CA: Los Angeles, CA, 2003, pp. 1850–1854. A. Alessandrini, A. Campagna, P. Delle Site, F. Filippi, and L. Persia, “Automated vehicles and the rethinking of mobility and cities,” Transportation Research Procedia, vol. 5, pp. 145–160, 2015. C. Y. Low, H. Zamzuri, and S. A. Mazlan, “Simple robust road lane detection algorithm,” in Intelligent and Advanced Systems (ICIAS), 2014 5th International Conference on. IEEE, 2014, pp. 1–4.

6|Page

Intelligent Systems Conference 2018 6-7 September 2018 | London, UK

[12] [13]

[14]

[15]

[16]

[17]

[18]

[19]

[20] [21]

M. Aly, “Real time detection of lane markers in urban streets,” in Intelligent Vehicles Symposium, 2008 IEEE. IEEE, 2008, pp. 7–12. S. Weng, J. Li, Y. Chen, and C. Wang, “Road traffic sign detection and classification from mobile lidar point clouds,” in 2015 ISPRS International Conference on Computer Vision in Remote Sensing. International Society for Optics and Photonics, 2016, pp. 99 010A–99 010A. L. Caltagirone, S. Scheidegger, L. Svensson, and M. Wahde, “Fast lidar-based road detection using convolutional neural networks,” arXiv preprint arXiv:1703.03613, 2017. D. Maturana and S. Scherer, “3d convolutional neural networks for landing zone detection from lidar,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 3471–3478. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. L. D. Jackel, D. Sharman, C. E. Stenard, B. I. Strom, and D. Zuckert, “Optical character recognition for self-service banking,” AT&T technical journal, vol. 74, no. 4, pp. 16–24, 1995. A. Berg, J. Deng, and L. Fei-Fei, “Large scale visual recognition challenge (ilsvrc), 2010,” URL http://www. image-net. org/challenges/LSVRC, 2010. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

IEEE

7|Page

Suggest Documents