2015 IEEE Intelligent Vehicles Symposium (IV) June 28 - July 1, 2015. COEX, Seoul, Korea
Integrating Visual Selective Attention Model with HOG Features for Traffic Light Detection and Recognition Yang Ji, Ming Yang, Zhengchen Lu, and Chunxiang Wang
lights cannot be clearly distinguished due to the long distance. There are also many researchers use the machine learning method to solve the problem. The intelligent vehicle BIT in 2009 Future Challenge of China recognize traffic lights with traffic lights classifiers trained by Haar feature and AdaBoost algorithm [4]. Chulhoon Jang et al. propose the multiple exposure technique by integrating both low and normal exposure images to select candidate regions, and then utilizing a support vector machine (SVM) with a histogram of oriented gradients (HOG) to detect traffic lights [5]. However, more researchers applied SVM and HOG features to detect and recognize traffic signs [6-8]. Robert Kastner et al. combine the attention based detection and array of weak classifiers for traffic sign detection and recognition [9]. Since the method of machine learning is time-consuming, it is necessary to get the candidate regions of traffic lights. These methods are trying to get candidate regions through the feature of color or brightness. It is difficult to get the ideal results since there are so many objects that own the similar properties of color and brightness, which will cause much interference.
AbstractTraffic light detection and recognition play a more important role in Advanced Driver Assistance Systems and driverless cars. This paper presents a method of integrating Visual Selective Attention (VSA) model with HOG features to solve the problem of detecting and recognizing traffic lights in complex urban environment. First of all, the VSA model is used to get candidate regions of the traffic lights. Then, the HOG features of the traffic lights and SVM classifier are used in these candidate regions to get precise regions of traffic lights. Within these regions, the color of traffic light is recognized according to the information in the gray-scale image of channel A. Experimental results show that the proposed method has strong robustness and high accuracy. Key words: traffic light, VSA, spectral residual, HOG, SVM
I. INTRODUCTION In order to complete the task of driving in the urban environment, it is important to detect and recognize traffic lights for driverless car. It is also a great significance for the Advanced Driver Assistance Systems (ADAS) to implement this function. Researchers have made a lot of efforts to solve this problem.
Another research idea is trying to detect traffic lights with some prior knowledge, such as map and GIS (Geographic Information System). V.John et al. proposed a method of using GPS sensors and a traffic light location database to identify ROI (Region of Interest) in the image captured by camera. After that, the CNN (convolutional neural net) algorithm is used to detect traffic lights within the ROI calculated by those prior knowledge [10]. Google has done the similar research in an earlier time, and its autonomous car is using such kind of method to detect the traffic lights [11]. The disadvantage of this method is that much pre-work must be done and the map information must be accurate and newest, which leads to a high cost.
Many researchers try their best to make full use of color, shape, luminance and structural information of the traffic lights. Masako Omachi et al. propose the method of using color and edge information of traffic lights [1]. Youn K. Kim et al. detect traffic lights mainly with the color filtering and thresholding algorithm [2]. These methods just use a small part of the traffic lights features. For these disadvantages, many researchers tried to combine information such as the color, shape and structural information as much as possible. ZHANG et al. use a multi-feature fusion algorithm to get a score of the candidate region based on blob scores and box scores [3]. The main weakness of these methods is the vulnerability on the change of environment. If the luminance of the light is too high or too low, it is hard to determine color because of the saturation problem. Whats more, the shape of
VSA model is proposed in [12, 13], which can focus on the region of interest quickly and effectively. In this paper, we propose a method of using VSA model to get candidate regions and then using the SVM classifier to determine whether it is a traffic light according to the HOG features of the image within the detection window. The detection window is searching within these candidate regions. In this way, a precise region of traffic lights can be got. Finally, the state of light will be recognized based on the information of gray-scale image of channel A. In this way, large amount of computation can be avoided, meanwhile a high detection accuracy can be achieved.
*Resrach supported by The National Natural Science Foundation of China (91420101/51178268/61174178). Yang Ji, Ming Yang, and Zhengchen Lu are with Department of Automation, Shanghai Jiao Tong University, Shanghai Key Lab of Navigation and Location Services, 200240, China (phone: +86-21-34204553;e-mail:
[email protected]). Chunxiang Wang is with Research Institute of Robotics, Shanghai Jiao Tong University, Shanghai 200240, China. 978-1-4673-7266-4/15/$31.00 ©2015 IEEE
280
The rest of this paper is organized as follows. In section II, the method of obtaining candidate regions by using VSA model and the training of traffic lights s are introduced. Also, the details of using HOG feature and SVM classifier to get a precise region of traffic light and the state recognition will be introduced. Section III will demonstrate the effectiveness of the proposed method with experiment results. Conclusion and future work of this paper will be illustrated in Section IV.
many cases, the brightness of traffic light is not so strong to be extracted. It will cause a lot of omission to process only on gray-scale image in some complex environment.
II. PROPOSED METHOD
First of all, the source image is resized to 120160 for the purpose of real-time performance. Then, Gaussian filter will be applied on the resized image. In this way, we can avoid to use Gaussian filter in S A and S S respectively and save the computational time. Considering color is the most saliency feature for traffic light detection, we choose the LAB and HSI color space to extract the color information instead of RGB color space. Since the value from negative to positive change in channel A of LAB color space indicates the range of color change from green to red, it is very suitable to detect and recognize traffic lights.
Converting the image into gray-scale will not make the contrast of lights and background obvious enough. The effect of getting candidate regions will be not so good. In this paper, a method of getting significant candidate regions of traffic lights is conducted by using VSA model based on spectral residual [15].
A. System Overview The framework of the system is presented as Figure.1. In order to have a robust detection of traffic lights in complex urban environment, HOG features, which consist of abundant structural information of traffic lights, will be used. Candidate regions should be first identified for the purpose of reducing the consumed time of computing HOG features and SVM classifier. The operation of detecting precise regions is based on the candidate regions obtained from VSA model. The detection window will be moved within the candidate regions, and SVM classifier will be applied to determine whether the image within the detection window is the traffic light. After the processing of this part, the bounding boxes will be obtained. Mapping the bounding boxes into the gray scale image of channel A proportionally, the state of lights can be recognized by the pixel value in A channel.
Normally, the color filter will set the value of the pixels that meet the requirement of the certain color to be 255, while the other gray value are set to be 0. This kind of filtering method will import much distortion since there are many pixels that are similar to the object. In order to solve this problem, we come up with the solution of using band-pass filtering based on Gaussian filtering. The function is written as:
VSA A Channel
CSM S Channel
PA=
ASM
Source image
Candidate Regions
SSM
Candidate Regions
Proportional mapping
SVM classifier
2 A
e
( PA A )2 2 A2
(1)
where is the steepness of the function curve, is the
represents the filtered value of pixels in A channel. Similarly, can be got from .Then, the spectral residual approach is applied to generate the ASM (A channel saliency
Generation of Precise Region HOG features
1
Precise regions detection
map). The ASM can be got according to the formula below.
state recognition
Figure 1. System overview
B. Getting candidate regions It is difficult to get the precise region of traffic lights without missed and error detection. This part is trying to miss the fewest traffic lights candidate regions and get the rough candidate regions in the captured image. As can be seen in lots of literature, it is usually implemented by color segmentation and shape filtering. Roual de Charette et al. use gray-scale image to get bright spots considering that all traffic lights share the common property to emit light [14]. However, in 281
A( f ) R( F ( PA))
(2)
P( f ) I ( F ( PA))
(3)
L( f ) log( A( f ))
(4)
R( f ) L( f ) hn ( f ) L( f )
(5)
S A ( F 1 (exp( R( f ) P( f ))))2
(6)
1 1 1 1
hn 2 1 1 1
n
1 1 1
(n 3)
Step 1: Set the size of detection window as 55, and slide the detection window in the CSM with the step size 55;
(7)
Step 2: Count the number of the pixels whose gray value is larger than and the quantity is written as N;
1
Where F and F represent Fourier Transform and Inverse ' Fourier Transform. A( f ) denotes the real part of F ( PA )
Step 3: If N is larger than the threshold , the region within the detection window is marked as ROI (Region of Interest), then, go to step 1.
'
and P( f ) denotes the imaginary part of F ( PA ) . hn is the mean filter kernels. In this paper, the size of the kernel is set as 3 3. According to the theory of spectral residual, the information of the image can be separated into two parts: the innovation part and the redundancy part. Once the redundancy part is excluded, the useful information for detection can be reserved. is the ASM which we want.
After that, the ROI of source image will be calculated proportionally from the resized image. The HOG descriptor will move within the ROI obtained from the procedures above. C. Training HOG of traffic light In order to get more precise regions of traffic lights in the image, features of traffic lights should be exploited as much as possible. If we merely consider the lights, ignoring the structural information of traffic lights, other kinds of lights, such as tail lights, streetlights, may be taken as traffic lights for mistake. For this consideration, this paper select HOG as the features of detection.
In the same way, SSM (Saturation Saliency Map) and ISM (Intensity Saliency Map) can be got. Experiment results show that SSM produces a relatively concentrated attention result on traffic lights while ISM produces a relatively distraction result on traffic lights. Thus, we will select ASM and SSM to integrate as a CSM (Chief Saliency Map) through a weighted average sum as following.
PCSM a1 * PASM a2 * PSSM
HOG is a great feature descriptor to detect object. This descriptor counts occurrence of gradient orientation in localized portions of an image. While the edge of traffic lights backplane is an important feature which contains large amount of gradient information, we can apply the HOG descriptor to detect traffic lights.
(8)
Figure 2 shows the result of the concentrated attention regions.
In this paper, we take the whole traffic light with the backplane as positive training examples, and the negative ones are selected from the background of urban scenes that don consist of any traffic lights. We have cut 785 traffic lights from training data manually as positive training examples. The number of negative examples is 2768. Figure 3 shows some of the training samples.
Figure 3. some positive (top row) and negative (bottom rows) samples used for SVM classifier training
Figure 2. concentrated attention regions based on (VSA) selective visual attention. (a) is the source image, (b) is the ASM (A channel Saliency map), (c) is the SSM (Saturation Saliency Map), (d) is the CSM (Chief Saliency Map).
The aim of using HOG features and SVM classifier is to detect the traffic light region instead of recognizing the state of traffic light (this part will be done after traffic lights detection), so red lights and green lights will be treated both as positive samples. Before training HOG descriptor, these positive and negative samples will be resized as the same size. In this paper, the size is 1530.
Based on CSM, the candidate searching area of HOG descriptor can be got by the following steps:
282
the ground truth bounding box manually. If the overlap ratio of the detection bounding box and the ground truth bounding box is larger than 50%, the detection result will be considered as correct detection.
D. HOG and SVM detection based on candidate regions In part B, the candidate regions of traffic lights are obtained. The detection window slides within these regions. By calculating the HOG features of the image within the detection window, these HOG features are the samples to be classified by the SVM classifier that has been trained before. The SVM classifier will judge whether there is a traffic light in the detection window according to the HOG features.
The size of test sequence is 800 600 and the number of images contained in this sequence is 4537. The proposed method is implemented by C++ with the help of openCV and runs on the computer with i3 M330 CPU, 2GB RAM and 2.13GHz frequency.
For many applications, using a saliency detection decreases the amount of overall calculation time although the algorithm in part B will cost time. Because the calculation of HOG features in the whole image is much more complex than the calculation we make in part B.
B. Baseline Algorithm For the purpose of making comparison and validating the performance of proposed method, we implemented the image processing algorithm which is noted by [14]. First, Spot Light Detection (SLD) is applied on the whole grayscale image. Next, seeded region growing algorithm is applied to select the connected regions and the shape filter rules (constraints on area and ratio of width and height) will be applied to select the suitable connected regions whose size and width-height ratio fall into the reasonable range. Finally, the Adaptive Template Matcher (ATM) is used to evaluate the matching confidence for the candidate regions and detect traffic lights. This will be represented as BA I (baseline algorithm I) in the following.
In this paper, the detection window size is 1530, block size is set as 510, and the block stride is 55. E. Traffic lights recognition In part D, the precise regions of traffic lights are obtained. Within these regions, traffic lights state, i.e. color of traffic lights, can be determined by the way of combining A channel in LAB color space. In the gray-scale image of channel A, there will be many bright and dark regions. If the pixel value is lower than , the pixels will be regarded as dark pixels. Similarly, if the pixel value is higher than , the pixels will be regarded as bright pixels. In part D, the bounding boxes of traffic lights will be determined, therefore, the precise position and range of regions will be mapped into the gray-scale image of channel A in proportion. Respectively, within these regions, number of dark pixels and bright pixels will be counted. In each bounding box, the numbers of dark pixels and bright pixels will be compared; meanwhile, the proportion of these pixels number shares in the total pixel numbers in the bounding box will be counted. If the quantity of bright pixels is greater than the dark ones, in addition, the proportion of the bright pixels in the bounding box is greater than the threshold T, the traffic lights can be recognized as red lights. Similarly, if the dark pixels can satisfy with the two conditions mentioned above, the traffic light will be recognized as green lights. III. EXPERIMENTS A. Instruction of the experiment The proposed method is validated on the test data sampled by the in-vehicle camera. The test data was acquired in a sunny afternoon, which consists of variety of conditions such as strong backlight and much complex urban scenes that consists of many objects that are similar to traffic lights. It will be a great challenge to reach a high accuracy under these conditions, which can highlight the superior of the performance of our proposed method in these complex situations. To evaluate the detection result, we have labeled 283
Meanwhile, in order to get the effect of acceleration by applying the VSA module before using HOG and SVM to detect traffic lights, we conduct the experiment by using SVM and HOG without VSA on the whole image. The average per-frame processing time of the two methods are listed in Table II. This will be represented as BA II (baseline algorithm II) in the following. Experiment result and analysis The proposed method has been tested in the test data and Figure. 4 shows some of the experimental results. As can be seen in Figure 4, the proposed method can achieve a good result in different urban scenes. Even the quality of the captured image is very poor except for (d) and (f), the traffic lights can be recognized correctly. Meanwhile, Figure 5 shows some of the results in the same test data with the baseline algorithm I. As can be clearly seen in table I, under the same environment, the false detection of baseline algorithm I increase remarkably. Since the color and saturation of the background share quite substantial similarity, and the geometric and structural information of the ATM is much less than HOG features, it is a difficult work to detect traffic light in the complex background by BA I. the operation of SLD in BA I will lose the candidate regions due to the long detection distance.
(a)
(b) (b)
((c))
(d)
(e)
(f)
Figure 4. detection results in complex urban scenes
Figure 6. some wrong detections that mistake background for traffic lights with our proposed method, the pictures at bottom are the amplified false detection results cut from the top pictures TABLE I.
Figure 5. detection results with baseline algorithm I in the same urban scenes
ACCURACY OF TWO METHODS
Red
Proposed method
Green
Baseline algorithm I
Green
Red
TABLE II.
By using the method of fusing A channel and S channel to create the CSM, it can avoid the disadvantage of the baseline algorithm I ss.
Processing time per frame
With 32 levels of multi-scale detection of each time, the detection distance of our proposed method ranges from 15m to 100m. In Figure.4 (f), the distance is about 20m. In Figure 4. (a), (b), (c) and (e), the distance of detection is about 100m, the traffic lights are quite small in the captured image. With the traditional method of image processing, these lights will be missed. By contrast, the proposed method can detect the traffic lights that are missed by the baseline algorithm I. Figure. 4(e) shows a good detection result with the strong distortion of backlight. By contrast, the baseline algorithm I mistakes back ground for red traffic light and can barely show good performance in such condition.
TP
FN
FP
Precision
Recall
983 438 911 408
35 23 107 53
23 11 114 63
97.71% 97.55% 88.87% 86.62%
96.56% 95.01% 89.49% 88.50%
AVERAGE PROCESSING TIME
Proposed method
BA I
BA II
142.54ms
117.63ms
413.67ms
In table I, False Positive (FP) means the errors that mistake other similar objects in background for the corresponding traffic lights. False Negative (FN) is the case that the
captured image. True Positive (TP) means the correct detection of the traffic lights. The columns of Precision and Recall are computed as follows:
precison recall
However, there exists many unsatisfactory detection results in our proposed method. Figure 6 shows some of the examples that mistake the background for traffic lights. As can be seen, the structure of these error objects is very similar to traffic lights.
TP TP FP
TP TP FN
(9)
(10)
From the experiment results above, a clear conclusion can be got that our proposed method has a better performance in accuracy. In order to test the real-time performance of our proposed method, a comparison of the baseline algorithm I, II and the proposed method has been made. As can be seen in table II, our proposed method cost about 25ms more than the baseline algorithm I on average. The main time-consuming process is
With the purpose of making comparison about the performance of accuracy between our proposed method and the baseline algorithm I, two methods have been tested on the same video sequences sampled by the in-vehicle camera.
284
the computation of HOG features and SVM. However, the comparison between BA II and our proposed method demonstrates that with the help of VSA model, the processing time per frame has been decreased remarkably.
[8]. Creusen, I.M., et al. Color exploitation in hog-based traffic sign detection. in Image Processing (ICIP), 2010 17th IEEE International Conference on. 2010. Hong Kong. [9]. Kastner, Robert, et al. "Attention-based traffic sign recognition with an array of weak classifiers." Intelligent Vehicles Symposium (IV), 2010 IEEE. IEEE, 2010. [10]. John, V., et al. "Traffic light recognition in varying illumination using deep learning and saliency map." Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014. [11]. Fairfield, Nathaniel, and Chris Urmson. "Traffic light mapping and detection."Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011. [12]. Zhang Q, Gu G, Xiao H. Computational model of visual selective attention[J]. Robot, 2009, 31(6): 574-580. [13]. Yang M, Lu Z, Guo L, et al. Vision-Based Environmental Perception and Navigation of Micro-Intelligent Vehicles[M]//Foundations and Applications of Intelligent Systems. Springer Berlin Heidelberg, 2014: 653-665. [14]. de Charette, R. and F. Nashashibi. Real time visual traffic lights recognition based on Spot Light Detection and adaptive traffic lights templates. in Intelligent Vehicles Symposium, 2009 IEEE. 2009. Xi'an. [15]. Xiaodi, H. and Z. Liqing. Saliency Detection: A Spectral Residual Approach. in Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on. 2007. Minneapolis, MN.
IV. CONCLUSION A. Conclusion This paper presents an efficient framework of detecting and recognizing traffic lights by using HOG feature and SVM classifier. In order to reduce the amount of computation, we put forward a method of using VSA model to obtain the candidate regions of traffic lights, and the detector only moves within these candidate regions. Experiments have been conducted to compare the performance of the baseline algorithm I, II and the proposed method. The results support the conclusion that the last one has a better robustness and achieves a higher accuracy in comparison with the baseline algorithm I. Due to the local motion of the detection window within the candidate regions instead of global motion, the proposed method can meet the requirement of real-time ing time per frame is slightly higher than the baseline algorithm I. B. Future work The algorithm does not consider the lane information and all detection result will be output. However, during the running of the intelligent vehicle, the traffic light signal of different lane may be different, which need to combine the current lane information of vehicle with traffic light detection to get the corresponding detection result. Meanwhile, further research will also consider using GPU to improve the real-time performance. REFERENCES [1]. Omachi, M. and S. Omachi. Detection of traffic light using structural information. in Signal Processing (ICSP), 2010 IEEE 10th International Conference on. 2010. Beijing. [2]. Kim, Y.K., K.W. Kim and Y. Xiaoli. Real Time Traffic Light Recognition System for Color Vision Deficiencies. in Mechatronics and Automation, 2007. ICMA 2007. International Conference on. 2007. Harbin: IEEE. [3]. Yue, Z., et al. A multi-feature fusion based traffic light recognition algorithm for intelligent vehicles. in Control Conference (CCC), 2014 33rd Chinese. 2014. Nanjing. [4]. Xiong, Guangming, et al. "Autonomous driving of intelligent vehicle BIT in 2009 future challenge of China." Intelligent Vehicles Symposium (IV), 2010 IEEE. IEEE, 2010. [5]. Jang, Chulhoon, et al. "Multiple exposure images based traffic light recognition."Intelligent Vehicles Symposium Proceedings, 2014 IEEE. IEEE, 2014. [6]. Xie, Yuan, et al. "Unifying visual saliency with HOG feature learning for traffic sign detection." Intelligent Vehicles Symposium, 2009 IEEE. IEEE, 2009. [7]. Zaklouta, Fatin, and Bogdan Stanciulescu. "Segmentation masks for real-time traffic sign recognition using weighted HOG-based trees." Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on. IEEE, 2011. 285