Deep Learning Traffic Sign Detection, Recognition

9 downloads 0 Views 1MB Size Report
based on Cascade Deep learning and AR, which superim- ..... deep learn- ing library [6]. We conduct ... ment the suggested method in C++ and test the real-time.
Deep Learning Traffic Sign Detection, Recognition and Augmentation Lotfi Abdi

Aref Meddeb

National Engineering School of Tunis University of Tunis El Manar, Tunisia Networked Objects Control and Communication Systems Laboratory

National Engineering School Of Sousse University of Sousse, Tunisia Networked Objects Control and Communication Systems Laboratory

[email protected]

[email protected]

ABSTRACT Driving is a complex, continuous, and multitask process that involves driver’s cognition, perception, and motor movements. The way road traffic signs and vehicle information is displayed impacts strongly driver’s attention with increased mental workload leading to safety concerns. Drivers must keep their eyes on the road, but can always use some assistance in maintaining their awareness and directing their attention to potential emerging hazards. Research in perceptual and human factors assessment is needed for relevant and correct display of this information for maximal road traffic safety as well as optimal driver comfort. In-vehicle contextual Augmented Reality (AR) has the potential to provide novel visual feedbacks to drivers for an enhanced driving experience. In this paper, we present a new real-time approach for fast and accurate framework for traffic sign recognition, based on Cascade Deep learning and AR, which superimposes augmented virtual objects onto a real scene under all types of driving situations, including unfavorable weather conditions. Experiments results show that, by combining the Haar Cascade and deep convolutional neural networks show that the joint learning greatly enhances the capability of detection and still retains its realtime performance.

CCS Concepts •Human-centered computing → Human computer interaction (HCI); Mixed / augmented reality; Displays and imagers;

Keywords Deep Convolutional Neural Networks; Augmented Reality; Haar Cascade; Traffic Signs; Convolutional Neural Networks

1.

INTRODUCTION

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 2017,April 03-07, 2017, Marrakech, Morocco

Copyright 2017 ACM 978-1-4503-4486-9/17/04. . . $15.00

http://dx.doi.org/10.1145/3019612.3019643

Automotive active safety systems have become increasingly common in road vehicles since they provide an opportunity to significantly reduce traffic fatalities by active vehicle control. Traffic signs play a vital role in safe driving and in avoiding accidents by informing the driver of speed limits or possible hazards. The visibility of traffic signs is crucial for the drivers’ safety. In some cases, it may be very difficult to recognize traffic signs timely and accurately because the visibility may be reduced due to some environmental factors. Guiding the driver attention to an imminent danger, some where around the car, is a potential application. In cars, the Augmented Reality (AR) is becoming an interesting means to enhance the active safety in the driving task. To acquire traffic information and keep the drivers’ eye on roads at the same time, a drastic increase in the developments of the AR applications on a Head-Up Display (HUD) has been noticed. After recognizing the traffic signs, a driver may be notified of the recognized traffic signs in a manner of audio or visual information. Traffic signs or road signs are an important part of the road environment as they provide visual messages not only for drivers, but for all road users. Traffic Sign Recognition (TSR) includes traffic sign detection and classification. Several detection algorithms are based on edge detection, making them more robust to changes in illumination. Some research has been conducted to evaluate methods for directing the driver attention using AR cues [7], [14]. Furthermore, several approaches and techniques for road sign detection and recognition have been introduced [9], [2]. The most common approach consists of two main stages: detection and recognition. The considered baseline algorithms represent some of the most popular detection approaches such as the Viola-Jones detector, based on Haar features [5], and the linear classifier, which relies on the HOG descriptors. Recently, deep learning, which is a type of machine learning method, has drawn a lot of academic and industrial interest [1]. A significant amount of work has set the state-of-the art in object detection by using deep learning descriptors generated with Convolutional Neural Networks(CNN). In object detection, methods such as R-CNN have reached excellent results by integrating CNNs with region proposal generation algorithms, such as selective search [8]. CNNs also demonstrated excellent performance on a number of visual recog-

nition tasks that include classification of entire images [4], predicting presence/absence of objects in cluttered scenes or localizing objects by ROIs. Moreover, the CNN has been adopted in object recognition for its high accuracy. In [12], an approach based on the combination of color transformation and multi-layer CNNs were proposed using a combination of supervised and unsupervised learning. This model learns multi stages of invariant features of image with a filter bank layer, a non-linear transform layer, and a spatial feature pooling layer. In [10], the authors applied Convolutional Networks (ConvNets) to the task of traffic sign classification. The ConvNets were biologically-inspired multi-stage architectures that automatically learned hierarchies of invariant features. The CNNs consist of a multi-stage processing of an input image to extract hierarchical and high-level feature representations. In [3] a real-time system for traffic signs was put forward, which used a sliding window method combining various DNNs trained on differently preprocessed data into a Multi-Column DNN (MCDNN). The latter further boosted recognition performance, where the method transformed the original image into a gray scale image by using support vector machines. Then it used CNNs with fixed and learnable layers for detection and recognition. The recognition of road traffic signs correctly at the right time for that particular place is very important for any vehicle driver to ensure a safe journey for themselves and their passengers’. Although traffic signs are designed to be clearly visible, they can be missed due to driver distraction or sign masking. In this paper, a novel method is proposed for the TSR based on the Haar cascade to reduce the computational region generation and the deep CNN approach for verification. The AR-TSR may make driving even more comfortable and may help the drivers receive important information regarding the signs, even before their eyes can actually see the sign, and in an easy and comprehensible way.

2. 2.1

PROPOSED METHOD Overview of Our Approach

The vision algorithms for driver assistance systems usually need to fulfill strong real-time constraints. Hence, we draw a particular focus on real-time capability of the algorithms evaluated here. Another important issue was the study of conventional traffic signs, in terms of rules for placement and visibility, types of traffic signs, and the migration of these to an in-vehicle display. Such a vision-based safety system has remained elusive in cars because computers typically face a tradeoff between analyzing video images quickly and drawing the right conclusions. On the one hand, a simple haar cascade detection algorithm can quickly detect many traffic signs in certain images, but lacks the sophistication to distinguish between traffic signs and similar-looking objects in the toughest cases. On the other hand, machine learning algorithms called deep neural networks can handle such complex pattern recognition, but work too slowly for real-time traffic signs detection. In order to achieve robust and fast traffic sign detection, the new algorithm begins with the simpler Haar cascade detec-

tion algorithms in the early stages of analysis to help filter out obvious non-traffic signs parts of an image, then brings in the more sophisticated deep learning of neural networks only in the final stages. In these later stages, the algorithm combines the simpler and more sophisticated algorithms in a way that balances detection accuracy against complexity. Our method uses three steps of (A)Hypothesis Generation, (B)Verification and (C)Augmentation. In the first step, the Region Of Interest (ROI) is extracted using a scanning window with a Haar cascade detector to reduce the computational region in the hypothesis generation step. Next, we use deep learning classification for verification. Finally, a multiclass sign classifier takes the positive ROIs and assigns a 3D traffic sign for each one, using a linear SVM. An overview of our proposed framework of TSR system is shown in Figure 1. In both stages, we assume that the intrinsic and distortion parameters of the camera are known and do not change.

2.2

Generation of Bounding Boxes

The initial detection phase of a traffic sign recognition system has much computational costs because ROIs in a large range of scales have to be searched in the complete image. In order to reduce the search space, the adopted solution is to combine a cascade with fewer stages with other methods that eliminate the false positives. In order to speed up the computation time, a cascaded classifier architecture is adopted. We trained the boosting cascade on local patches from large scale dataset. During the detection phase, the system scans each window of the input image and extracts the Haar-like features of that particular window, which is then used to compare to the cascade classifier. Finally, only a few of these sub-windows accepted by all stages of the detector are regarded as objects. The detection process takes an image as an input and gives at the output the regions that contain the ROI. The false alarm rate of the Haar cascade detector, without a hypothesis verification, is higher but it eliminates most of the non-object regions. The Haar-like features were originally proposed in the framework of object detection in the face detection approach. An AdaBoost cascade using Haar-like features is trained off line and a boosting algorithm is used to train a classifier with the Haar-like features of positive and negative samples. The AdaBoost Algorithm trains iteratively a strong classifier, which is the sum of several weak classifiers. The object is classified positively only if it is positively classified in each cascade stage. In fact, from an integral image, in a classifier produced by AdaBoost, voting is done as a summation of weighted classifiers. On average, only a small subset of classifiers vote positively because of cascading. In this work the first stage, a fast search mechanism based on simple features is applied to detect ROI. Next, in the verification phase, to confirm whether each ROI contains a traffic sign or not, a second stage is needed to eliminate some false positives. The second stage uses a computationally more expensive, but also more accurate set of features on these regions to classify them into traffic sign and non-traffic sign.

Figure 1: Overview of the DeepAR-TSR application

2.3

Architecture of Deep CNN

The accuracy and short processing time are extremely important for TSR. In both instances, the ability to recognize signs and their underlying information is highly desirable. This information can be used to warn the human driver of an oncoming change, or in more intelligent vehicle systems, to actually control the speed and/or steering of the vehicle. It is therefore necessary to classify the characteristics of the information given and find a way to represent the information according to these characteristics. In this work, after the traffic sign detection using the proposed method, the ouput ROI need to be resized to 64x64. We trained a deep learning classifier in the form of convolutional neural networks with approximately 10,000 identities with 160 hidden identity features in the top layer. More specifically, with the recent developments of CNN schemes in the computer vision domain, some well-known CNN models have emerged. In the first stage, the input is a pre-processed image of size 64X64 pixels and is given to a convolutional layer. It contains 4 convolutional layers, 1 fully-connected layer and 1 output layer. Number of filters in 4 convolutional layers is 20, 40, 60 and 80 respectively. Size of weights in 4 convolutional layers is 4x4, 3x3, 2x2 and 2x2. Each of the first 3 convolutional layers is followed by one Max-pool layer. The last convolutional layer is a local-connected layer in which convolutional weights and biases are not shared in different positions. Dimension of the fully-connected layer is 160. The output layer is a Soft-max layer and its dimension is equal to the number of class labels in task. We use a Deep CNN, which outputs a fixed number of ROI. In addition, it outputs a score for each box expressing the network confidence of this box containing an object(traffic

sign). The image classification scores are used as contextual information to refine the classification scores of ROI. No previous algorithms have been capable of optimizing the trade-off between detection accuracy and speed for cascades with stages of such different complexities. The output of the detection stage is a list of ROI objects that could be probable road signs. The results we’re obtaining with this new algorithm are substantially better for real-time, accurate traffic detection. If the recognition is complete, a multiclass sign classifier takes the positive ROI and assigns a 3D traffic sign to each one. Once the ROI is classified, the next step is to estimate the camera motion parameters over time to enhance the image view by associating extra contextual information and enriching the visual content, as well.

2.4

Pose Estimation and Augmentation

The key to realize a AR 3D-registration is to obtain a camera projection matrix, which represents the relationship between the 2D points from the image and the 3D points from the model. From the planar homography, we can easily compute the camera position and rotation, which provides the motion estimates. The used mathematical model is the projection transformation, which is expressed by equation 1 where λ is the homogeneous scale factors unknown a priori, where P is a 3 × 4 projection matrix, x = (x, y) are the homogeneous coordinates of the image features, X = (X, Y, Z) are the homogeneous coordinates of the feature points in the world coordinates, K ∈ R3×3 is the matrix with the camera intrinsic parameters, also known as camera matrix, the joint rotation-translation matrix [R|t] is the matrix of extrinsic parameters, R = [rx ry rz ] is the 3 × 3 rotation matrix, and

Table 1: Recall and precision results for traffic sign detection

T = [t] is the translation of the camera. x = λP X = K[R|t]X

(1)

The projection matrix P is the key to creating a realistic augmented scene using the intrinsic parameters of the camera, the dimensions of the video frame, and the distances of the near and far clipping planes from the projection center. In our method, we assume that the intrinsic parameters are known in advance and do not change, and this is reasonable in most cases.

z}|{ K



z }| { [R|t]

2D translation

  0 1 s/f 0 ∗ 0 1 1 0 0 } | {z

2D scaling

2D shear

Extrinsic matrix { z }|  { 0 R 0 0 ∗ ∗ (I|t) 0 1 |{z} 1 | {z } } 3D translation 3D rotation

(2) Once K is known, the extrinsic parameters for each image is readily computed. From equation 1, we have:

r1 = λ + K

−1

h1

r2 = λ + K −1 h2 r3 = r1 ∗ r2 t = λ + K −1 h3

 h1 = [h11 h21 h31 ]T       h2 = [h12 h22 h32 ]T      h3 = [h13 h23 h33 ]T where  r1 = [r11 r21 r31 ]T       r2 = [r12 r22 r32 ]T     r3 = [r13 r23 r33 ]T



h11 W here H =  h21 h31

h12 h22 h32

0.8

Precision

}|   0 x0 fx 0 1 y0  ∗  0 f y 0 1 0 0 {z } | {z

1 = 0 0 |

Recall 99.21 % 99.16 % 98.89 % 98.84% 98.82% 99.12% 99.15%

Precision 99.13 % 99.17 % 99.04 % 98.95% 98.92% 99.06% 99.08%

Speed limit Danger signs Unique signs Mandatory signs Derestriction signs Other prohibitory signs

Intrinsic matrix

z

Test 127 119 50 135 120 140 115

1

Extrinsic matrix

Intrinsic matrix

P =

Traffic Signs Speed limit Danger signs Unique signs Mandatory signs Derestriction signs Derestriction signs Other prohibitory signs

           

0.6

0.4

0.2

h13 h23 h33    1   and λ =  kK −1 + h1 k     

0

0

0.2

0.4

0.6

0.8

1

Recall

(3)

Figure 3: Precision-recall curve for traffic sign detection To correctly model the perspective projection of the camera, we must mimic the intrinsic camera parameters in the virtual environment. When we have the camera calibrated in a frame, we can synchronize the real camera with a virtual camera and project the virtual objects onto the real image using OpenGL. As it is indicated, most of current marker-less tracking approaches require a 3D model of the environment for matching 2D features to those lying on the model. In addition to the complexity of building a model, such a strategy will result in performance problems when the model is very complex or the environment is dynamic. In contrast, our approach does not need to perform 3D engineering of the environment. Also, we use a simple virtual 3D model, with a known size, to define a reference coordinate system. For the robust tracking of the camera pose, we have developed a new marker-less approach that combines information from both real and virtual worlds.

3.

EXPERIMENTAL RESULTS

To evaluate the performance of the proposed algorithm, our implementation is based on the open source Caffe deep learning library [6]. We conduct comprehensive experiments to demonstrate the performance of the proposed method and also present comparison with previous methods. We implement the suggested method in C++ and test the real-time performance on the German Traffic Sign Recognition Benchmark (GTSRB) dataset [11]. For training and testing GTSRB dataset contains 51839 images in 43 classes. We have selected 39,209 images for training and the rest for testing.

3.1

Detection Performance

The database used to train the detectors has been collected from the GTSRB dataset, the Belgian Traffic Signs dataset (BelgiumTS) [13]. Our training data set consists of 4,500 interest traffic signs and 6,000 non traffic signs. The sizes of traffic sign examples are in range from 15x15 to 250x250 pixels. In order to get a high efficiency classifier, we have set the number of cascade stages to 7. The minimum hit rate is set to 0.995 while the maximum false positive rate is set to 0.5. The samples’ width and height have been both set to 24 × 24. The evaluation is based on the recall and precision values, which are summarized in Table 1 vs. the number of test images. The experimental results of Table 1 demonstrate an excellent performance of our system. The results show that the proposed algorithm attains an average precision rate of 98.81 % and an average recall rate of 98.22 %.

3.2

Classification Performance

To evaluate the performance of our classification module, we firstly evaluate the classification task on the testing images of GTSRB. There are 51,840 images in GTSRB (39,210 for training and 12,630 for testing) with the sizes vary between 15x15 and 222x193. The GTSRB data set contains 43 classes with unbalanced class frequencies. Some experiments have also been conducted to measure the performance, the GTSRB has been used, where Table 2 shows the classification

Figure 2: Insertion of virtual 3D object under different lighting conditions Table 2: Accuracy results for traffic sign classification Traffic Signs Speed limit Danger signs Unique signs Mandatory signs Derestriction signs Other prohibitory signs

Test 1500 1900 1000 1700 1350 1850

Accuracy 99.58% 99.17 % 99.42 % 99.45% 99.32% 99.27%

Table 3: Performance comparison with other traffic sign recognition methods

Method Speed limit Danger signs Unique signs Mandatory signs Derestriction signs Other prohibitory signs

[3] 99.47 99.07% 99.22% 99.89% 99.72% 99.93%

[12] 97.63% 98.67% 100% 99.72% 98.89% 99.93%

[10] 98.61 98.03 98.63% 97.18% 94.44% 99.87%

[15] 95.95% 92.08% 98.73 % 99.27 % 87.50 % 99.13 %

Our 99.58% 99.17% 99.42% 99.45% 99.32% 99.27%

rates. A key idea of our method is to project the 3D object sign using the corresponding sparse dictionary and then to classify the projected vector with the SVM. Furthermore, we evaluate the classification task on the detected signs returned by the previous detection module. As shown in Table 2, the overall classification accuracy is 99.36 %. If the recognition is complete, a multiclass sign classifier takes the positive ROI and assigns a 3D traffic sign to each one.

objects onto the real environment. Using the calculated pose matrix, a virtual 3D object is projected into the real world scene for augmentation purposes, as shown in Figure 2. The experiments have confirmed that the system can accurately superimpose virtual textures or 3D objects to a user selected planar part of a natural scene in real-time, under general motion conditions, without the need of markers or other artificial beacons.

3.3

3.4

Augmented Reality Tracking

In this section, the results obtained during real-time tests, performed with a fully equipped vehicle, are presented. We have started the evaluation of the AR tracking by superimposing 3D graphics on target images. To provide driving safety information using the proposed AR-TSR, various sensors and devices have been attached to the experimental test vehicle. The system has been empirically tested under different lighting conditions, in sunny or cloudy days, in the rain and at night, as shown in Figure 2. The experimental results have shown that the proposed method has significantly reduced the computational cost and also stabilized the camera pose estimation process. A virtual object is attached to a real object for the augmentation purpose, and the camera pose are used to superimpose virtual

Comparisons with Other State-of-the-Art Methods

In order to verify the discrimination performance and computation efficiency of the proposed feature for traffic sign detection, the experiments on the public available data set of traffic signs are implemented. The performance results of the machine learning algorithms are all significantly different from each other. We have compared the suggested method with other state-of-the-art algorithms such as the committee of CNNs [3], Human Performance [12], Multi-Scale CNNs [10] and Random Forests [15]. The performance is analyzed in terms of recognition accuracy in Table 3. According to the results for the GTSRB data set, shown in Table 3, this work achieves a 99.36% recognition accuracy, which is a comparable performance of 0.09% less then

the work by [3], and a performance of 0.52% higher than the work by [12] and 1.05 % than the work by [10]. The accuracy of recognizing unique signs reaches 99.42%, which is comparable with the best achieved one and this work achieves the recognition speed of 35 frames per second. Compared with other methods, this paper presents an overview of studies related to drivers’ perception and cognition when this information is displayed on the windshield HUD, as it can be a solution to reduce the duration and frequency of drivers looking away from the traffic scene, which is very important in safe driving assistance systems. Compared with other state-of-the-art methods, our approach offers the best results in certain categories suchas OtherProhibitions and Mandatory, it also provides very close results to the best ones reported in the other categories. this work greatly reduces the recognition time, which makes our method a good choice for real-world applications, such as DAS.

4.

[6]

[7]

CONCLUSIONS

The in-vehicle contextual AR has the potential to provide a novel visual feedback to drivers for an enhanced driving experience. This paper has presented a fast and robust traffic signs recognition technique. A new AR-TSR approach to create real-time interactive traffic animations was introduced, in terms of rules for placement and visibility, types of traffic signs, and migration of these to an in-vehicle display. The AR-TSR supplements the exterior view of the traffic conditions in front of the vehicle with virtual information for the driver. In this paper, we present a new real-time approach for fast and accurate framework for traffic sign recognition, based on Cascade Deep learning, which superimposes augmented virtual objects onto a real scene under all types of driving situations, including unfavorable weather conditions. We employed this approach to improve the accuracy of the traffic sign detector to assist the driver in various driving situations, increase the driving comfort, and reduce traffic accident risks. Experimental results show that the suggested method reached comparable performance to state-ofthe-art approaches but with less computational complexity and shorter training time. We have also noticed that AR impacts the allocation of visual attention more strongly during the decision-making phase.

5.

[5]

[8] [9]

[10]

[11]

[12]

[13] [14]

REFERENCES

[1] Y. Bengio. Learning deep architectures for ai. R in Machine Learning, Foundations and trends 2(1):1–127, 2009. [2] L. Chuan, P. Shenghui, Z. Fan, L. Menghe, and K. Baozhong. A method of traffic sign detecting based on color similarity. In Measuring Technology and Mechatronics Automation (ICMTMA), 2011 Third International Conference on, volume 1, pages 123–126. IEEE, 2011. [3] D. Cire¸san, U. Meier, J. Masci, and J. Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 32:333–338, 2012. [4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional

[15]

activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013. W. J. Jeon, G. A. R. Sanchez, T. Lee, Y. Choi, B. Woo, K. Lim, and H. Byun. Real-time detection of speed-limit traffic signs on the real road using haar-like features and boosted cascade. In Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, page 93. ACM, 2014. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014. S. Kim and A. K. Dey. Simulated augmented reality windshield display as a cognitive mapping aid for elder driver navigation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 133–142. ACM, 2009. K. Lenc and A. Vedaldi. R-cnn minus r. arXiv preprint arXiv:1506.06981, 2015. A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey. Intelligent Transportation Systems, IEEE Transactions on, 13(4):1484–1497, 2012. P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 2809–2813. IEEE, 2011. J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1453–1460. IEEE, 2011. J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012. R. Timofte. Kul belgium traffic signs and classification benchmark datasets. http://btsd.ethz.ch/shareddata. M. Tonnis, C. Sandor, C. Lange, and H. Bubb. Experimental evaluation of an augmented reality visualization for directing a car driver’s attention. In Proceedings of the 4th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 56–59. IEEE Computer Society, 2005. F. Zaklouta, B. Stanciulescu, and O. Hamdoun. Traffic sign classification using kd trees and random forests. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 2151–2155. IEEE, 2011.

Suggest Documents