Convolutional Neural Networks for Aerial Vehicle Detection and

0 downloads 0 Views 854KB Size Report
... [email protected]. Abstract—This paper investigates the problem of aerial vehicle recognition using a two-class deep convolutional neural network .... in conjunction with these seven colors describe a 14-class recognition problem.
Convolutional Neural Networks for Aerial Vehicle Detection and Recognition Amir Soleimani∗ , Nasser M. Nasrabadi∗ , Elias Griffith† , Jason Ralph† , Simon Maskell† ∗ Lane

Department of Computer Science and Electrical Engineering, West Virginia University of Electrical Engineering and Electronics, University of Liverpool Email: [email protected], [email protected] [email protected], [email protected], [email protected]

† Department

Abstract—This paper investigates the problem of aerial vehicle recognition using a two-class deep convolutional neural network classifier. The network receives an image and a desired class, and makes a yes or no output by matching the image and the desired class. This strategy helps when considering more classes in testing than in training.

Image

Classifier

Label Code

(a)

I. I NTRODUCTION Aerial imagery captured by drones or Unmanned Aerial Vehicles (UAVs) is a great tool for surveillance because of its wide field of view and the ability of drones to access places that would otherwise be difficult to visit. This advantage necessary results in objects of interest occupying small number of pixels in each image. Meanwhile, it must be considered that providing a comprehensive dataset that covers all the probable objects variations and classes is impossible. In order to alleviate the challenge of objects occupying small number of pixels, we split the problem into two sub-problems [1]. We first assume that a deep detector like SSD [2] extracts objects or areas of interest, and second, we use a deep convolutional network to recognize which of the extracted objects of interest are also the vehicles we wish to detect. In this paper, we propose a framework that can handle the problem of open-ended classification. A classical image classification system receives an image and produce an output label. However, in this paper we use another architecture in which it receives an image and a desired class as a code vector in its input, and makes a yes or no decision about the correctness of the input label. In other words, it decides if the input image has the desired class label or not (see Figure 1). II. R ELATED W ORKS The combination of choosing a good hand-engineered image feature descriptors like histogram of gradients (HOG) [3] and scale-invariant feature transform (SIFT) [4], and choosing a classifier like support vector machine (SVM) [5] or multilayer perceptron (MLP) had been the main focus of research papers in the area of image classification for several years. However, in recent years deep convolutional neural networks, which have outperformed other methods on different datasets like VOC [6] and Imagenet [7], are attracting interest in the field of image classification. Deep convolutional neural networks like LeNet [8] and AlexNet [9] have confirmed their effectiveness as classifiers that

Image Common Space

Classifier

YES NO

Desired Class (b)

Fig. 1. (a) A classical classifier that receives an image and predicts a label code for the image class. (b) The proposed architecture that can consider classes that have not been seen during the training.

(a)

(b)

Fig. 2. (a) An aerial image. (b) Some examplar objects of interest.

only receive the images without any type of image feature descriptors. Visual Question Answering (VQA) systems take an image and an open-ended textual question about the given image, and try to provide an answer to the question in a textual format [10]. The core idea behind the VQA task is to answer to unseen questions that could be considered here as classes. Agrawal et

Yes/No

512 512 256

Con. Layer + Relu

64

Max Pooling

Desired Class

Fully-Connected

bag of words Rep.

128 Image

Fig. 3. Proposed method for the task of aerial vehicle detection. Detected objects of interest are described using the features extracted by the convolutional layers. Desired classes are represented using bag of words and then fully connected layers build a common latent space in which the yes or no decision is made on top of this space. Accuracy 97.13%

True Positive 96.43% TABLE I

True Negative 98.26%

100

R ESULTS ON ALL THE CLASSES

90 80 70 60

Accuracy

al. by using LSTM and MLP structure achieves acceptable results on their large dataset consisting about 0.25M images, 0.76M questions, and 10M answers1 .

50

III. P ROPOSED M ETHOD

40

We propose a framework in which, first, SSD [2], which have shown promising performances in aerial images object detection [1] and [11], proposes objects of interest. Second, we use VGG-16 [12] using only one fully connected layer to extract visual descriptors. Meanwhile, desired classes are coded using the bag of words representation, and then visual features extracted by the VGG network and desired class information are transformed into a common latent feature subspace using a fully-connected layer (see Figure 3). Finally, a two class softmax classifier predicts if the image has the desired class or not.

30

IV. R ESULTS In order to evaluate our proposed framework, we use our dataset that contains real aerial images and synthetic cars and trucks which are placed on the streets (see Figure 2). Vehicles can have seven different colors: black, white, gray, yellow, green, blue, and red. The two types of vehicles in conjunction with these seven colors describe a 14-class recognition problem.

20 10 0

Bla

ck

r r r r r r r k Ca d Ca ruck ruck ruck ruck ruck ruck ruc Ca e Ca Ca y Ca Ca it low Gra reen Blue Re ck T ite T ow T ray T en T lue T ed T Wh Yel e R G B G Bla Wh Yell Gr

Classes

Fig. 4. (a) An aerial image. (b) Some example objects of interest.

Regrading the quality and the size of objects in our aerial images, this performance is promising. In order to check the ability of the proposed method on unseen classes (open-ended setup), we repeat the experiment in a way that we only train on 13 classes and set one class aside for testing. This experimental setup is repeated 14 times for 14 classes. Figure 4 shows the accuracy of the system for the unseen classes. Based on this experiment, we can see that the proposed method is capable of recognizing unseen vehicles but belonging to similar classes. V. C ONCLUSION

Table 1 shows the performance of the proposed method on our dataset. Note that we here trained on all 14 classes. 1 http://www.visualqa.org/

In this paper, we investigated the problem of aerial vehicle detection. We used a deep detector to generate objects of interest, and proposed a yes or no classification framework

in which images and desired classes are received as the input and the output is predicted from the feature vector generated by the common latent feature sub-space. Results shows that in addition to the promising performance when recognizing seen classes, our framework can recognize unseen classes as well. This is the advantage of the open-ended framework. R EFERENCES [1] A. Soleimani and N. Nasrabadi, “Convolutional neural networks for aerial multi-label pedestrian detection,” in 21st International Conference on Information Fusion (accepted), 2018. [2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37. [3] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893. [4] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. [5] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995. [6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010. [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [10] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2425– 2433. [11] M. Barekatain, M. Mart´ı, H.-F. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger, “Okutama-action: An aerial view video dataset for concurrent human action detection,” in 1st Joint BMTT-PETS Workshop on Tracking and Surveillance, CVPR, 2017, pp. 1–8. [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

Suggest Documents