Comparison of machine learning and deep learning

Comparison of machine learning and deep learning classifiers for Traffic sign image classification Junaid Ahmed

Siyu Huang

Kaiyuan Li

Electrical and Computer Engineering North Carolina State University [email protected]



Abstract—This report studies the traffic sign image classification problem for GTSRB dataset. First data is pre-processed and different image features are selected for classification. We consider here hand crafted features like pixel intensity values, hue histograms (HH) and Histogram of Oriented Gradients (HOG) and also present procedure to get learned features using pretained Convolutional Neural Networks (CNN). For classification k-nearest neighbor (kNN), Linear Discriminant Analysis (LDA), multi-class Support Vector Machine (SVM) and CNN are applied and the performance is compared in terms of accuracy and time complexity. For the chosen dataset, LDA and CNN performed best with accuracy of 95.66% and 97.30% respectively while LDA being fastest and CNN being the slowest of all.This work is carried out as project of ECE-592 course, Fall 2017 semester. Keywords: Hue Histogram, HOG, KNN, LDA, SVM,CNN

I. I NTRODUCTION Traffic sign recognition is a multi-category classification problem with unbalanced class frequencies. It provides drivers safety and precaution information. It also has applications in automotive industry, sign monitoring etc. Although the problem is easy for human’s to classify because of easily recognizable shapes, color, icon and text, but for computers the problem is difficult due to different lighting conditions and view points. Multiple approaches are made to solve this problem, our focus is to compare some approaches and evaluate performance. The paper is organized as follows: following section gives the motivation behind the work, section III gives our problem formulation and section IV outlines some recent results obtained . Section V discusses the composition of dataset and its analysis. Section VI provides the details of experiments involving feature selection, pre-processing and structure of classifiers. Section VII analyzes the results obtained.Finally, results are concluded in section VIII. II. M OTIVATION Traditionally traffic sign recognition problem has been studied as a computer vision problem such as [1] but with the rise of machine learning and deep learning CNNs, the problem is divided into detection and classification parts. Detection focuses on finding the traffic sign in an image and is still mainly done using computer vision approaches and classification using any machine learning classifier as in [2]. In the classification area, there has been plenty of research on this

topic, which uses various datasets and different classification approaches. Since there are many datasets used to study the problem and many of the datasets do not provide comprehensive real world situations, it is difficult to compare classifiers’ performance evaluated on different data sets. It is necessary to use an extensive dataset and use different classification methods to see the difference in the results of processing time and accuracy. Here we study and compare machine learning algorithms like kNN,SVM, LDA and deep learing approaches involving CNNs to compare the classification performance for a specific dataset. III. P ROBLEM S TATEMENT Problem studied here is the performance comparison of classification algorithms for a specific dataset. An extensive dataset is to be selected to give unbiased comparison of different approaches. After pre-processing and feature extraction KNN, LDA, SVM and CNN are implemented to compare the performance in terms of training time, classification time, validation accuracy and test accuracy. IV. R ELATED WORK J. Stallkamp et al. [3] presented the method of a committee of convolutional neural networks on the GTSRB dataset. The classifier gives a result of a 99.46% correct classification rate. This method is able to learn task-specific features from raw data. Safat B. Wali et al [4] developped an efficient traffic sign detection and recognition system which contains an enriched dataset of Malaysian traffic signs. The system demonstration using a RGB colour segmentation and shape matching followed by support vector machine (SVM) classifier led to promising results with respect to the accuracy of 95.71%, false positive rate (0.9%), and processing time (0.43 s). Maldonado Bascon et al. [5], presented a classification performance of 95.5% accuracy using support vector machines. The dataset comprises 36,000 Spanish traffic sign samples of 193 sign classes. V. DATASET For our study, we selected German Traffic Sign Recognition Benchmark (GTSRB) dataset [6]. GTSRB provides 39209 training images and 12630 test images with 43 classes. Although there are limited number of classes but GTSRB

presents a lot of practical scenarios in dataset like motion-blur, sun-glare, physical damage, view point variations etc which make the classification problem interesting and challenging. Fig. 1 shows few images from dataset.

Fig. 1. Some of chalanging images of the dataset

Data set is organized into training and test partitions. Training images are organized as video sequence but test dataset images are random images from different scenes so that algorithms should be tested without temporal resolution accumulation. Although it provides vast number of training images but the number of images per class are not uniform. Fig. 2 shows the distribution of the training data. The image size is not fixed and it varies from 15x15 to 225x225. Also the border number of border pixels varies from image to image (at least 5 pixels).

A. Pre-processing As mentioned in Section V, the image size in training and test dataset is not fixed. Therefore, we wrote matlab code to either upscale or downscale a given image in dataset to a fixed size of 30×30. The size 30×30 was selected empirically to get better and fast results. Also in order to tackle different lighting conditions problem, each image was normalized to zero mean unit standard deviation image (i.e. zero center normalization). Some other pre-processing tasks were also performed like converting RGB images to gray scale and image border cropping but since each of these didn’t yield any performance improvements, so the results presented here are for processing either color images or image features (discussed in section VI-B). B. Feature Selection We used two approaches to get feature vector for each image in dataset. One approach was to use hand crafted features of an image and other approach was to learn features using pre-trained CNN (i.e Alexnet in our case) called feature extraction/learning. Given below is the brief description of feature vectors used: 1) Image intensity Values: Intensities values of image stored in a vector. This is basically naive approach to get a feature vector and does not provide much information. So, we expect poor performance for this feature vector. 2) Hue Histograms: Hue histogram is 256-length histogram vector of Hue values in HSV color space representation of image. 3) HOG: HOG is the gradient based feature vector descriptor, originally developed for Human detection in an image [7]. It involves dividing the image into fixed size patches ( two variants were considered HOG1 = 5×5, HOG2= 2×2) and finding gradients and histogram to get feature vector of size 1568. 4) Feature Extraction using AlexNet [8]: In order to get abstract features, we utilized pre-trained CNN (i.e AlexNet) to get the feature vector for both train and test dataset. We used fully connected layer ’fc7’ of Alexnet to get the learned feature vector of size 4096. Fig. 3 outlines the feature extraction procedure using CNN.

Fig. 2. Distribution of training data over all classes in GTSRB dataset

VI. E XPERIMENTS Given here are the details of pre-processing, feature selection and types of classifiers used and their parameter selection. All the experiments were performed in Matlab environment. For kNN and SVM, Matlab provided functions were used, while LDA was implemented by authors. CNN was also constructed using Neural Network Toolbox of Matlab.

Fig. 3. Feature extraction using pre-trained CNN

C. Classification Methods 1) KNN: k-nearest neighbors (KNN) is a classification method that using the vector space model to classify datapoints[9].It works on the fact that same class will have high similarit and by calculating the similarity

with known class, the class of unknown class can be evaluated. Since the performance varies with choice of number of neighbors (k), we examined multiple ks and choose k to maximize test accuracy and minimize over fitting. 2) LDA: Linear discriminant analysis (LDA) is a generalization of Fisher’s linear discriminant [10]. It assumes that the class densities are multivariate Gaussians distribution with a common covariance matrix. It is in view of the maximum posteriori estimate of class members. LDA works effectively when each observation of the independent variables is continuous. We firstly obtained the mean and variance from each class of training set, then applied LDA equation to each data point of testing set. The class with the highest probability would be selected as the class for the testing point. 3) SVM: Support Vector Machines (SVM) are supervised learning models and are used in classification and regression analysis [11].Given a set of training examples, each training instance is labeled as belonging to one or the other of the two categories. Basic SVM model is defined for only two class problem but since we have 43 classes we utilized Matlab provided multi-class svm classifier based on [12].

Fig. 4. kNN perfomrance with different choices of k

4) CNN: The Convolutional Neural Network (CNN) is a feedforward neural network, which has excellent performance for large-scale image processing applications. It is inspired by biological process that is able to learn specific invariant task features hierarchically. The convolutional neural network consists of one or more convolutional layers and a top fully connected layer with associated weights, relu and pooling layers as well. This structure allows the convolutional neural network to take advantage of the two-dimensional structure of the input data. For our dataset, we tried different number of layers but single layer with following configuration performed best:

Input Layer : [30 × 30 × 3 input image layer ] Convolution Layer: [size 3×3 and depth of 20] • Max Pooling Layer : [stride =2] • ReLU Layer • Fully connected layer: [size = 43] • Softmax Layer We also processed data using LeNet model architect [13] for the CNN. The LeNet architecture, first introduced by LeCun et al, was used for character recognition in documents. It consists of 13 layers with two [5 × 5] convolutional layers and two fully connected layers The LeNet was implemented as follow: • Input Layer : [30 × 30 × 3 input image layer ] • Convolution Layer: [size 5×5 and depth of 20] • ReLU Layer • Max Pooling Layer : [stride =2] • Convolution Layer: [size 5×5 and depth of 20] • ReLU Layer • Max Pooling Layer : [stride =2] • Flatten layers from numbers 8 [1x1x400 → 400] and 6 [5x5x16 → 400] • Fully connected layer: [size = 120 → 84] • ReLU Layer • Fully connected layer: [size = 84 → 43] • Softmax Layer • •

VII. R ESULTS AND A NALYSIS We report here the best performances of kNN, LDA, SVM and CNN classifiers with combination of features discussed. Table I summarizes the results. Results are given in terms of training and classification time which is time for any classifier to to train and test on training and test data respectively. Validation accuracy is obtained by processing some of training data (10000 images) and finding accuracy. Test accuracy is the accuracy of classifier in predicting test data. Given below is the analysis of resutls: 1) kNN: kNN model was trained for pixel intesity values, hue histograms, HOG1 and HOG2 features separately and evaluated on test data. Although kNN gave fastest performance (comparable to LDA) but highest accuracy that we could manage to reach was 81.36%. The selection of ’k’ was based on the principle to minimize the overfitting and maximaize the test accuracy. Fig 4 shows an example of selection ’k’ for hue histogram features. 2) LDA: LDA was trained for pixel intesity values, hue histograms, HOG1 and HOG2 features separately and surprisingly we got validation accuracy of 98.31% and test accuracy of 95.56%. LDA with HOG2 gave higher accuracy (see Fig 5) and the performance was comparable (ours is 0.19% higher) to the performance that was actually achieved in GTSRB competition results [3].Also, LDA classifiers performance in terms of processing time for training and testing was fastest among all. 3) Multi-class SVM using Alexnet Features : With Alexnet features we trained kNN, LDA and multi-class SVM

VIII. C ONCLUSION AND F UTURE W ORK

Fig. 5. Accuracy Comparison

classifiers. Reported here are the best results that were achieved using multi-class SVM (see Table I). Although the performance was comparable to what was achieved using HOG for kNN , but the size of feature vector (= 4096) generated from AlexNet is higher than others which increased the resource requirement and hence increased the processing time even for simple classifiers like kNN and SVM. 4) CNN: After pre-processing images two CNN were tested as described in Section VI-C. Using single layer CNN resulted in test accuracy of 85.48%. As we increased the number of layers, the performance was decreased mainly because of the fact that our image size [30 × 30] was small and adding more layers and max pooling essentially decreased the information scope down the layers and hence degraded performance for softmax layer. We tested LeNet over dataset, although it gave highest of all performance in terms of test accuracy but it was slowest of all, taking almost an hour to train and 4 min to classify all test dataset.

TABLE I C OMPARISON OF PERFORMANCE OF DIFFERENT CLASSIFICATION METHODS USING SPECIFIC FEATURE VECOTORS AND CNN

We compared the performance of kNN, LDA, multi-class SVM and CNN for traffic sign image classification problem. In general, kNN performed less well than the other classifiers for this dataset. Overall, HOG2 feature is more efficient and accurate for the kNN and LDA method. We also came to conclusion that in contrast to traditional computer vision, where hand-crafted features are learned through experience, convolutional neural networks are able to learn task-specific features from raw data. Also, from the application point of view where processing time and memory and resource requirements play essential role in choosing classifier, we found using LDA with HOG features as computationally cheap and simple classifier to get results comparable to computationally demanding CNN. Thus , for applications where processing time and resource requirements are important aspects when choosing a classifier, hardware capabilities might also be taken into account. Future work should discover and implement more CNN model architects and implement it and compare the performance. Also, one interesting implementation would be to use the outputs of multiple layers to fully connected layer and increase the feature vector size to mitigate the performance reduction that we observed in implementing more layers for CNN. One pre-processing aspect to study would be to vary the size of image and see how the size affects the performance. Other classifiers (eg. Random Forest) and different feature vectors (eg. Hough Transform) combinations can be tried on same or different dataset to verify the conclusion. R EFERENCES [1] J. Torresen, J. W. Bakke, and L. Sekanina. Efficient recognition of speed limit signs. In Proceedings. The 7th International IEEE Conference on Intelligent Transportation Systems (IEEE Cat. No.04TH8749), pages 652–656, Oct 2004. [2] L. Chen, Q. Li, M. Li, and Q. Mao. Traffic sign detection and recognition for intelligent vehicle. In 2011 IEEE Intelligent Vehicles Symposium (IV), pages 908–913, June 2011. [3] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32 (Supplement C):323 – 332, 2012. Selected Papers from IJCNN 2011. [4] Safat B. Wali, Mahammad A. Hannan, Aini Hussain, and Salina A. Samad. An automatic traffic sign detection and recognition system based on colour segmentation, shape matching, and svm. Mathematical Problems in Engineering, 2015, 2015. doi: 10.1155/2015/250461. [5] S. Maldonado Basc´on, J. Acevedo Rodr´ıguez, S. Lafuente Arroyo, A. Fernndez Caballero, and F. L´opezFerreras. An optimization on pictogram identification for the road-sign recognition task using svms. Comput. Vis. Image Underst., 114(3):373–383, March 2010. ISSN 1077-3142.

[6] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: A multi-class classification competition. In The 2011 International Joint Conference on Neural Networks, pages 1453–1460, July 2011. URL http://benchmark.ini.rub.de/. [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893 vol. 1, June 2005. [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [9] N.S.Altman. An introduction to kernel and nearestneighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992. [10] R. A. FISHER. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. [11] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, Sep 1995. [12] Sergio Escalera, Oriol Pujol, and Petia Radeva. Separability of ternary codes for sparse designs of errorcorrecting output codes. Pattern Recognition Letters, 30 (3):285 – 297, 2009. [13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.