Integrating Multi-Layer Features of Convolutional ...

16 downloads 33038 Views 1MB Size Report
successful applications including remote sensing scene classification. However ... efforts have been dedicated to developing robust feature representation ...
Integrating Multi-Layer Features of Convolutional Neural Networks for Remote Sensing Scene Classification Erzhu Li, Junshi Xia, Member, IEEE, Peijun Du, Senior Member, IEEE, Cong Lin, and Alim Samat, Member, IEEE 

the competitive performance compared with fully trained CNN models, fine tuning CNN models and other related works.

Abstract—Scene classification from remote sensing images provides new possibilities for potential application of high spatial resolution imagery. How to efficiently implement scene recognition from high spatial resolution imagery remains a significant challenge in the remote sensing domain. Recently, Convolutional Neural Networks (CNN) have attracted tremendous attention because of their excellent performance in different fields. However, most works focus on fully training a new deep CNN model for the target problems without considering the limited data and time-consuming issues. To alleviate the aforementioned drawbacks, some works have attempted to use the pre-trained CNN models as feature extractors to build feature representation of scene images for classification and achieved successful applications including remote sensing scene classification. However, existing works pay little attention to exploring the benefits of multi-layer features for improving the scene classification in different aspects. As a matter of fact, the information hidden in different layers has great potential for improving feature discrimination capacity. Therefore, this paper presents a fusion strategy for integrating multi-layer features of a pre-trained CNN model for scene classification. Specifically, the pre-trained CNN model is used as a feature extractor to extract deep features of different convolutional and fully-connected layers, then a multiscale improved Fisher kernel (MIFK) coding method is proposed to build mid-level feature representation of convolutional deep features. Finally, the mid-level features extracted from convolutional layers and the features of fully-connected layers are fused by a PCA/SRKDA method for classification. For validation and comparison purposes, the proposed approach is evaluated via experiments with two challenging high-resolution remote sensing datasets, and shows

Index Terms—Convolutional Neural Networks, improved Fisher kernel, spectral regression kernel discriminant analysis, feature fusion, scene classification.

I. INTRODUCTION

W

ITH the rapid development of satellite imaging techniques and a series of earth observation programs, there are many choices available for collecting high resolution satellite imagery in practical applications. It provides the possibility of extracting high level Land-Use/Land-Cover (LULC) information for land surface information updating and spatial pattern analysis [1]-[4]. In comparison with LULC information in the pixel domain, high level information is the semantic abstraction of the sub-images, such as scene categories. Therefore, scene classification as a recognition technique attempts to extract scene level semantic information from HSR imagery. However, traditional classification methods based on pixel or object cannot accomplish this task due to existing semantic gap among them. Indeed, it is still considered an open and challenging task for remotely sensed scene classification. Constructing discriminating feature representation of raw visual data is one of the most necessary steps to bridge the huge gap between raw visual data and its semantic category in almost any computer vision recognition problem, including remotely sensed scene classification. Over the few years, substantial efforts have been dedicated to developing robust feature representation methods in different domains [5]-[7]. In the early years, most methods were developed directly based on raw color or texture features, such as color histograms, histogram moments [5], [8]. Then, mid-level feature coding methods based on visual dictionaries, i.e., Bag-of-Words (BOW) [9], probabilistic latent semantic analysis (PLSA), latent Dirichlet allocation (LDA) [10], [11], and Fisher kernel vector [12], attracted much attention for many years. These methods all tend to represent images based on a particular number of visual words, such as counting the occurrence frequency or approximating the probability distribution of the visual words. Moreover, these methods do not only perform directly for raw data, but for higher level features including SIFT features, sparse coding features, and other spatial

This work was supported in part by National Natural Science Foundation of China (NSFC) under Grant No. 41631176, the Research Projects of China Geological Survey under Grant No. 12120113007500, the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD) and the Fundamental Research Funds for the Central Universities (Corresponding author: Peijun Du.). E. Li, P. Du and C. Lin are with Department of Geographical Information Science, and Key Laboratory for Satellite Mapping Technology and Applications of State Administration of Surveying, Mapping and Geoinformation of China, Nanjing University, and with Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China (e-mail: [email protected]; [email protected]; [email protected]). J. Xia is with the Research Center for Advanced Science and Technology, The University of Tokyo, 153-0094 Tokyo, Japan(e-mail:[email protected] m). A. Samat is with State Key Laboratory of Desert and Oasis Ecology, Xinjiang Institute of Ecology and Geography, CAS, Urumqi 830011, China (e-mail: [email protected]).

1

dominant features [11], [13], [14]. However, there are only small improvements over these methods in recent years for remote sensing images, due to the specificities of remotely sensed data. In other words, scene classification methods either with low-level features or with mid-level features have performed well on some remotely sensed scenes with the specific color or homogeneous structures, but it is difficult for them to deal with the scene images with more complex structures and spatial layouts. Recently, deep learning has become the new state-of-the-art solution for computer visual recognition [15], [16]. Many successful applications have been reported in classic recognition problems. Methods based on deep learning have obtained dramatic improvements beyond the previous state-of-the-art records in different domains, such as object detection [17], face and speech recognition[18], [19], action recognition [20], semantic segmentation [21], nature images classification [22], and remote sensed scene classification [23]. Especially, the deep Convolutional Neural Networks (CNN) is acknowledged as the most popular method due to its ability to learn hierarchical level abstraction of input data by encoding input data on different layers. Compared with the traditional scene classification methods, deep learning methods have achieved far better classification performance in remote sensing domain [24-26]. In the early works, fully training a new CNN is preferable according to various problems [24], [25]. However, the state-of-the-art results were achieved with several challenges such as high computational burden, proneness to overfitting, requiring a considerable amount of training data and so on [26]. At present, recent works have demonstrated that the existing CNN models pre-trained on large datasets such as ImageNet [27] can be transferred to other recognition tasks [26], [28]. In [26], the authors have evaluated and analyzed three strategies of exploiting CNN, including fully training a new CNN by the objective dataset and employing the pre-trained CNN as feature extractors. They demonstrated that the pre-trained CNN used as feature extractors achieved better performance than fully training a new CNN for three remotely sensed scene datasets. Besides, other works [29], [30] aiming at training a new CNN have not obtained better performances than the works [28], [31] employing existing pre-trained CNN for remote sensed scene classification. These works also conclude that it is difficult to fully train a new deep CNN model with a few hundreds or thousands training samples. According to these previous works and conclusions, the pre-trained CNN used as feature extractors to construct feature representation is preferred for remotely sensed scene classification when there are not sufficient training data available. This transfer strategy can avoid most of the drawbacks in training a new CNN, and maybe more practical in applications, especially for problems with limited training data. Fortunately, the transfer strategy is applicable for remote sensing images, and several works have achieved many competitive results in comparison with the early state-of-the-art result [28]. However, recent few works only have explored features of one or two layers for classification, the rest of layers with lower level semantic information have not been explored.

Therefore, we used pre-trained CNN as feature extractors in this work, and pay our more attention to exploiting the power of mid-level information from more layers in the remote sensing domain. It assumes that the mid-level information hidden in each layer is more or less helpful to improve the discrimination of feature representation. The main contribution of this work is providing a solution to exploit more benefits of a pre-trained CNN for remotely sensed scene classification. To this end, the main novelty lines in, 1) We present a multiscale improved Fisher kernel coding method for representing the mid-level information from convolutional layers. 2) We propose a feature fusion framework to integrate multi-layer features for classification by a step-by-step dimensionality reduction strategy. The remainder of this paper is organized as follows. In Section II, the related studies about CNN based methods are introduced and summarized. Section III presents our proposed approach. Details of our experimental results and analysis are presented in Section IV. Finally, conclusions are summarized in Section V. II. RELATED WORKS Convolutional Neural Networks (CNN) has become the most popular of the deep learning based networks for learning visual features in several distinct tasks of different domains, including remote sensing scene classification [23], [24], [32]. In contrast with the classical framework of visual recognition which commonly have two separate steps including feature learning and classification, CNN can learn features and classifier at once by multi-layer neural networks. Moreover, a multi-layer CNN model can obtain different levels of abstraction for visual data, ranging from low-level information in the first layers, to more semantic information in intermediate layers and high level information directly connected with categorical attributes in the final layers. Since the abundant hierarchical information extracted in a process it has been achieved state-of-the-art performances in many different applications [33-35]. A deep CNN is composed of multiple computational blocks and a block commonly but not consists of three layers: 1) a convolutional layer; 2) an activation layer; 3) a pooling layer. Typically, a complete CNN is composed of one or more of such blocks, and followed by one or more fully-connected layers and a final classifier layer. Suppose given an input data X which may be a 2-D or 3-D array, and a CNN with n computational blocks, the output of convolutional layer in L-th block can be produced by [26] fL (X) = ∑Li=1 Xi *Wi + bi

(1)

where Xi, Wi, and bi represent input, weights and bias, respectively, X1 = X; * is the convolution operation. The output of a convolutional layer is regarded as input of next computational block after computational processes on corresponding activation and pooling layers. The output of final convolutional layer is fed into followed fully connected-layers and classifier layer in order. The entire network is trained with 2

back-propagation of a supervised loss function such as soft-max loss function [26]:

have paid attention to exploiting features of more layers for classification recently. As a matter of fact, the features of the different layers contain abundant information due to their various abstraction of the input data. Specifically, the level of information extracted from different layer ranges from low to high, the higher-level information is extracted from the deeper layers. In general, the lower-level information contains abundant local structure properties, whereas the higher-level information provides the global layout properties. For scene images, their local structure properties and global layout properties all play important roles in building their discriminative features. Especially for easily confused classes, the local structure properties of the scene images are more important. Therefore, integrating multi-layer features can fuse different level information for the better completing classification task.

N

1 J(W,b)= − ∑[ y(i) × log hW,b (Xi ) + N i=1

(1- y(i) ) × log(1- hW,b (Xi ))]

(2)

where hW,b denotes the entire network; Xi is a input sample, y (i) is its corresponding label; N is the total number of training samples. With the defined cost function, the CNN can be trained to minimize the loss function by Stochastic Gradient Descent (SGD) [36] or other optimization algorithms. CNN has provided an end-to-end learning framework to bridge the gap between raw data (e.g., scene image, image pixel and other signal data) and semantic labels, which is a competitive advantage in comparison with previous state-of-the-art methods [37]. Therefore, many CNN were designed and trained for different given problems, which always have different deep architectures, such as the number and types of layers, and their organization [33], [38]. Based on this idea, train a new network is commonly preferable since it tends to give a specific solution for a given problem. For remote sensing applications, some efforts have focused on specific vehicles detection in high resolution remote sensing imagery by considering contextual and structural information in spatial domain extracted by CNN [24], [39], [40]. Furthermore, more works applied CNN to extract spatial or structure information for spectral-spatial classification of hyperspectral images [32], [35], [41], [42]. The other important application is the image classification or scene classification by training a fully deep CNN [23], [29], [30]. During these years, some famous CNN architectures have been trained in a large amount of data like ImageNet dataset in computer vision applications, such as AlexNet [33], GoogLeNet [25]. It gives full control of the architecture and parameters and maybe prone to yield a more robust network by training a new network against specific task. However, this strategy not only needs design a new network architecture, but also requires a large amount of data to learn the parameters. Inevitably, the drawbacks such as high computational cost, proneness to overfitting appear during the training process. To alleviate these drawbacks, some strategies attempt to explore existing deep convolutional networks, including fine-tuned CNN and using pre-trained CNN as feature extractors [26], [28]. For instance, Nogueira et al. [26] evaluated different strategies for exploiting the power of existing CNN and found that fine-tuned strategy tends to be the best one. Many other works for fine-tuning famous pre-trained CNN have achieved state-of-the-art performance in different applications [33], [34], [43]. As the feature extractors, a pre-trained CNN is directly used for given problems without tuning architecture and parameters of the networks [26], [28]. In [26], the authors utilized the features of last fully-connected layer as input for classification and achieved higher accuracy than the most of the previous methods. Hu et al. [28] attempted to exploit the features of the final convolutional layer by feature coding methods and used together with features of a fully-connected layer for classification and achieved similar performance of fined-tuned networks in [26]. However, there are few works

III. PROPOSED APPROACH A CNN commonly contains convolutional layers and fully– connected layers. Therefore, a complete CNN could extract hierarchical features on different layer when it is used as feature extractors. In this section, a framework of feature fusion is presented to integrate multi-layer features for scene classification. For this purpose, a method of multiscale improved Fisher vector coding is proposed to build mid-level features from convolutional layers and a step-by-step dimensionality reduction algorithm PCA/SRKDA is constructed for multi-layer features fusion. We firstly introduce the framework of feature fusion proposed in this paper, then present the particular methods including multiscale improved Fisher kernel coding and PCA/SRKDA according to priority.

Fig. 1. Proposed scene classification approach using multi-layer features of the pre-trained CNN model. In this figure, conv1-L represent the convolutional layers, MIFV1-L are the features extracted from corresponding convolutional layers by multiscale improved Fisher kernel coding method; FC1-n represent the fully-connected layers. RMIFV1-L and RFC1-n represent the reduced features of MIFV1-L and FC1-n, respectively.

A. Multi-Layer Features Fusion In order to utilize the multi-layer features of the pre-trained model, we must code the convolutional features and generate their linear vector representations as the outputted features of fully-connected layers. In this paper, we propose a multiscale 3

X = {Il,i L }nl=0 = {x1 , x2 , ⋯, xT }∈ ℝD be the set of T D-dimensional multiscale convolutional features, and λ = {wi , μi , Σi , i =1,2,⋯,k} be the parameters of a Gaussian mixture model (GMM) fitting the distribution of features from the

improved Fisher kernel coding method to code the convolutional features. Therefore, there are two kinds of high dimensional feature vectors including multiscale improved Fisher vectors generated from different convolutional layers and the feature vectors outputted by different fully–connected layers. For fusing the different high dimensional feature vectors into a vector with low dimensionality, a progressive reduction strategy namely PCA/SRKDA is firstly applied to map each high dimensional feature vector into a low dimensional space. Then, all reduced features are combined into a feature vector as the mid-level feature representation of a scene image. Finally, a linear SVM classifier is employed for classification. The detailed classification process is illustrated in Fig. 1.

n

multiscale convolutional feature set {Il,i L }l=0 , i = 1, 2,⋯, M , where wi , μi , Σi are respectively the mixture weight, mean vector and covariance matrix of Gaussian model; k is the component number of GMM. It is assumed that the generation process of X can be modeled by a probability density function p(x| λ) with the parameters λ. In this case, X can be described by the gradient vector [12], [44]: GX λ =

B. Multiscale Improved Fisher Kernel Coding The Fisher kernel is a generic framework which combines the benefits of generative and discriminative approach [44]. In this framework, an image or other matrix data could be represented by a Fisher vector, which is regarded as mid-level feature representation of an image. Generally, a CNN can capture spatial layout profiles of a scene image with fixed size, without paying attention to various observation scales. While for the Fisher kernel coding method as a global and out-of-order technique, scale phenomenon greatly influences density distribution of a given scene image. To achieve multiscale Fisher vector of a scene image, a series of images at different observation scales are produced by pyramid algorithm and fed into a CNN for extracting multiscale convolutional features, which then are stacked to be encoded by Fisher kernel. To generate multiscale images, the Gaussian pyramid method is employed. An image is weighted down using a Gaussian smoothing and subsampling to series images with different scales. Given a scene image set I = {I1 , I2 , ⋯, IM }∈ ℝm×m×d , M is the number of images; m is the image size and d is the number of channels. The multiscale representation of an n image Ii is denoted as {Ili }l=0 , I0i = Ii , where l is the scale level. The Gaussian smoothing and subsampling processes between two adjacent levels are defined as [45], l Il+1 i = G * Ii

(6) (7)

T

pi (xt | λ) =

exp{12(xt - μi ) Σi-1 (xt - μi )} (2π)

D⁄2

| Σi |1⁄2

(8)

Accordingly, the Fisher vector ɡX of X is defined as, λ X ɡX = F-1/2 λ Gλ λ

(9)

where Fλ is the Fisher information matrix of the GMM model λ. Following the mathematical derivations in [44], the D-dimensional gradients with respect to the mean μi and standard deviation σi are presented as, x -μ ɡX = T√1w ∑Tt=1 γt (i) ( tσ i) μ,i i

i

(10)

2

x -μ ɡX = 1 ∑Tt=1 γt (i) [( tσ i) -1] σ,i T√2w i

i

γt (i) =

wi pi (xt | λ)

∑K j=1 wj pj (xt | λ)

(11) (12)

where γt (i) is the occupancy probability of xt to i-th Gaussian component; 𝜎𝑖 is the standard deviation and σ2i = diag(Σi ). The Fisher vector contains three gradients in theory, one of them with respect to the mixture weight wi is discarded because it contains few information for facilitating the recognition task. Therefore, only the two gradients are utilized to build Fisher vector representation. As a result, the linear vector representation of a Fisher vector can be denoted as Φ(X) = {ɡX , ɡX , ⋯, ɡX , ɡX }T , where k is the component number of μ,1 σ,1 μ,k σ,k

(4)

where G is the Gaussian filter coefficients; * is the convolution operation; REDUCE(∙) is the subsampling operation. After achieving the multiscale images, they are fed into the CNN to extract corresponding convolutional features with various scales. The convolutional features can be extracted by the pre-trained CNN as, Zl,i L = fL (Ili )

∑Tt=1 ∇λ log p (xt | λ)

p (xt | λ) = ∑ki=1 wi pi (xt | λ)

(3)

l+1 Il+1 i =REDUCE(Ii )

1 T

GMM. Moreover, two ideas are used to improve the performance of the representation for classification task [44]. In this paper, the first one is applied in each dimension of Φ(X) the following function:

(5)

f(z) = sign (z)|z|𝛼

Zl,i L

where is the convolutional output of Ili on L-th layer; f(∙) is the pre-trained CNN model and fL (∙) contains first L convolutional layers of f(∙), fL (∙) = f(∙ ; W1 , W2 , ⋯, WL ). Naturally, the multiscale convolutional features on Lth layer

(13)

where 0 ≤ 𝛼 ≤ 1 is a parameter of the normalization, and commonly set to 0.5. Then the vector Φ(X) is L2-normalized. C. PCA/SRKDA Dimensionality Reduction The feature representation extracted by the CNN are always characterized by a sparse feature vector with very high dimensionality. However, stacking the above features may

n

of a given scene image Ii can be represented as {Il,i L }l=0 . Let 4

yield to redundant information, making it difficult to select an optimal combination of features. Furthermore, the high dimensionality of the stacked features provokes the curse of dimensionality or the overfitting problem, thus resulting lower classification accuracy. Thus, we use dimensionality reduction technique to alleviate the curse of dimensionality. An unsupervised or supervised method can be employed to map a high dimensional data into a low dimensional space at one time. But it always performs badly in preserving the discriminative of data for an unsupervised method. With regard to a supervised method, it is more prone to fall into overfitting when the dimensionality of mapped data is much greater than the number of classes. Therefore, a step-by-step strategy is proposed to alleviate the drawback when only using an unsupervised or supervised method separately. The dimensionality reduction process consists of two progressive reduction operations as shown in Fig.2. Here, principal component analysis (PCA) and a supervised method called spectral regression kernel discriminant analysis (SRKDA)[46] are utilized.

[47]. It has been demonstrated that SRKDA is more competitive than the original KDA in terms of computational cost and performance [46]. Suppose we have a feature set of m n-dimensional samples X1 = {x1 , x2 , ⋯, xm } ∈ ℝn , belonging to C classes. The objective function of KDA is as follow [48]: αopt = arg max

αT KWKα

T

αT KKα

where α= [α1 ,⋯,αm ] is the eigenvectors corresponding to the following eigen-problem: KW Kα = λK Kα (16) where K is the kernel matrix (𝐾𝑖𝑗 = к(xi , xj )), and к(∙) is the kernel function; W is defined as a m × m matrix, Wij = 1⁄mc if xi and xj are both belong to c-th class, otherwise Wij = 0; mc is the total number of samples in c-th class. To solve above eigen-problem, eigen-decomposition or regularization based method is commonly used to obtain the optimal eigenvectors α for original KDA[48], [49]. But for SRKDA, the eigen-problem can be solved by solving the following problems [46], [47]: Wy = λy

(17)

K𝛂= y

(18)

KW K𝛂= KWy = Kλy = λKy = λK K𝛂

(19)

Thus, y is the eigenvectors of eigen-problem Eq. (17). [47] has proved that this eigen-problem is trivial and the eigenvectors can be get directly instead of solving the eigen-problem routinely. Specially, y = {y1 ,⋯,yC-1 | yTc e=0, yTi yj =0, i≠j}, and

Fig. 2. The process of feature dimensionality reduction by PCA/SRKDA method. X0 is the input data; Θ1 and Θ2 represent the unsupervised and supervised dimensionality reduction models, respectively; X1 are the mapped features of X0 by Θ1; X2 are the mapped features of X1 by Θ2; Label represents the class labels of all samples in X1.

e =[1, 1, ⋯, 1]T . The eigenvectors y can be obtained by Gram-Schmidt process [47]. After obtaining the eigenvectors y, the objective α can be solved by a linear manner. When K is non-singular, the unique α can be obtained by

PCA: PCA is widely used as an unsupervised linear dimensionality reduction method. It is an efficient method for preprocessing very high dimensional data without any pre-setting parameter. In this paper, the improved Fisher vector generated from convolutional layer or fully connected feature vector directly outputted by a fully-connected layer always has high dimensionality. Here, we apply PCA method in the first reduction step not only for its efficiency, also for preventing overfitting. In this step, a certain number of principal components are selected as the reduced features of original inputting features. Given a feature set X0 consisting of Fisher vector or fully connected feature vector, the process of PCA dimensionality reduction is: X1 = Θ1 X0

(15)

α = K-1 y

(20)

when K is singular, α is approximated by solving the following linear model, (K + δI)α = y

(21)

where I is the identity matrix and δ ≥ 0 is the regularization parameter. Since y includes C-1 eigenvectors, the α solved by SRKDA is a m×(C-1) matrix, α =[α1 ,⋯,αC-1 ] . Therefore, let Θ2 = α be the mapping coefficient matrix, a sample xi can be embedded into C-1 dimensional subspace by

(14)

X1 →X2 : ΘT2 K(:, xi )

where Θ1 is the projection coefficient matrix of first n principal components; X1 is the reduced feature set with n dimensionality. SRKDA: Linear Discriminant Analysis (LDA) is a popular method for extracting low-dimensional features which preserve class separability. Kernel Discriminant Analysis (KDA) is the kernel version of LDA aiming at highly nonlinear distributed data. SRKDA is an improved method of KDA, which obtains the KDA projective function via spectral regression technique

(22)

where K(:, xi ) = [к(x1 , xi ), ⋯, к(xm , xi )]T . IV. EXPERIMENTS AND ANALYSIS In this section, two real very high spatial resolution scene image sets used for evaluating the proposed approach are firstly introduced. Then, seven pre-trained CNN models and the parameter settings are described briefly. Finally, the results of the scene classification are discussed. 5

A. Experimental Scene Image Datasets The first dataset, UC Merced (UCM) land use dataset maybe also called as land use or land cover (LULC) dataset or other names in various literatures, was manually extracted from large aerial orthoimagery covering various urban areas around USA with a pixel resolution of one foot [9]. These scene images were divided into 21 land use classes, and each class is composed of 100 images of size 256×256×3. This dataset is widely used for evaluating methods of scene classification or object detection in the remote sensing community. In this dataset, there are multi-label scene images for some categories, such as freeway, runway, buildings and sparse residential buildings. Moreover, some categories share similar structure information, such as buildings and storage tanks. All class samples are shown in Fig. 3.

deep features of the selected layers for constructing the discriminating features of classification. In this paper, seven pre-trained CNN models, AlexNet, CaffeNet, VGG-M, VGG-S, VGG-F, much deeper network VGG-VD16 and VGG-VD19, are applied as feature extractors. These models were all trained to perform classification in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) data. AlexNet, developed by Krizhevsky et al. [33], was the winner of the ILSVRC-2012 competition. AlexNet model consists of five convolutional layers and two fully-connected layers, some of the convolutional layers are followed by max-pooling layers and the final layer of the Networks is soft-max classifier, which also can be considered as a fully-connected layer. Therefore, AlexNet model contains three fully-connected layers in total. CaffeNet has similar architecture compared to AlexNet, except for the different order of pooling and normalization layers. CaffeNet [51] is trained by the open-source deep learning framework Caffe on the ILSVRC-2012 training dataset as well, and obtains the result close to the performance of AlexNet. VGG-M, VGG-S and VGG-F [52] were all developed based on open-source Caffe framework, and have the similar architecture with five convolutional layers and three fully-connected layers (including soft-max layer). The differences are settings of the stride and pooling window size for some convolutional layers. VGG-VD16 and VGG-VD19, introduced in [38], were the winner of ILSVRC-2014 competition in the localization and classification tracks. In contrast to the aforementioned models, VGG-VD16 and VGG-VD19 have more convolutional layers, consisting of thirteen and sixteen convolutional layers respectively and three fully-connected layers (considering soft-max layer). According to the proposed approach, there are some parameters to be reassigned during the whole process. In the experiments, all convolutional and fully-connected layers are selected for extracting features for AlexNet, CaffeNet, VGG-M VGG-S and VGG-F models, due to they only have five convolutional layers. While for VGG-VD16 and VGG-VD19, the last five convolutional layers and all three fully-connected layers are selected. Then, three scales of the image are generated by Gaussian pyramid method for building multiscale improved Fisher vector, and the number of Gaussian components in the GMM is empirically set to be 100 [28]. At the third step, the high dimensional feature vector is reduced to 200 dimensions by PCA, and the regularization parameter δ in SRKDA method is assigned to be 0.0005, RBF kernel function is applied to build kernel feature space. At final classification stage, all samples of input dataset are randomly divided into five equal parts, and four parts are used as training samples and the rest as testing dataset every time according to common practice in previous studies. The classification experiment is repeated five times to yield a mean accuracy as the final accuracy assessment. The accuracy measure, overall accuracy (OA), is employed in each repeated process. Due to very high dimensionality for the feature representation of scene image, the public LIBLINEAR library [53] is used for training linear SVM classifier. The solution of L2-regularized and L2-loss is

Fig. 3. Example images of the 21 land-use categories for UCM dataset. An example image of each category is selected to display. (a) Agricultural, (b) Airplane, (c) Baseball Diamond, (d) Beach, (e) Buildings, (f) Chaparral, (g) Dense Residential, (h) Forest, (i) Freeway, (j) Golf Course, (k) Harbor, (l) Intersection, (m) Medium Density Residential, (n) Mobile Home Park, (o) Overpass, (p) Parking Lot, (q) River, (r) Runway, (s) Sparse Residential, (t) Storage Tanks, (u) Tennis Courts.

The second public dataset, WHU_RS dataset, was collected from satellite images of Google Earth [50]. This dataset contains 1005 scene images and 19 classes, including: Airport, Beach, Bridge, Commercial, Desert, Farmland, Football Field, Forest, Industrial, Meadow, Mountain, Park, Parking Lot, Pond, Port, Railway Station, Residential, River and Viaduct. Each class of this dataset has approximately 50 images with 600×600 pixel and 3 channels. In this dataset, many scene images vary greatly in resolution, scale and orientation. Moreover, some classes have complex layouts such as port, railway station and viaduct as displayed in Fig. 4. It is a more challenging benchmark dataset in evaluating the performance of remote sensing scene classification.

(a) Port (b) Railway Station Fig. 4. Example images of WHU_RS dataset.

(c) Viaduct

B. Pre-trained CNN Models and Experimental Setup The strategy of using a pre-trained CNN model as feature extractor is straightforward and convenient since no retraining or tuning is necessary. Moreover, one only needs to select a pre-trained model and the layers to be used, then extract the 6

selected in training process, and the penalty parameter С is empirically set to be 100. All experiments are performed on a 64 bits Intel Xeon E5-2630 machine with 2.4GHz of clock and 64GB of RAM memory, and the proposed method is implemented in MATLAB 2106a. The open source library VLFeat [54] is employed for implementing the feature coding methods, a MATLAB toolbox MatConvNet [55] is used for extracting CNN features. The pre-trained CNN models are available in home page of MatConvNet (www.vlfeat.org/matconvnet/).

layer number of CNN model has significant influence in the time-consuming of the method, because the most time-consuming processes are extraction and construction of mid-level features from convolutional layers. The computational time of fusion process by PCA/SRKDA and classification process is short. In general, it is an acceptable efficiency to complete scene classification task within a few hours using a CNN based method. For further analysis, we show classification performances using features from each convolutional or fully-connected layer, and make a comparison with the results of our fused strategy in Fig. 5. We can notice that the performances using the representation features from only one convolutional layer or fully-connected layer are worse than the results produced by fused features of multiple layers. It is also important to highlight that the lower convolutional layers perform worse than a deeper convolutional layer in generally. While for VGG-VD16 and VGG-VD19, the selected convolutional layers are the last five layers, their performances have little difference. Another aspect to highlight is that the deeper convolutional layer (e.g., the last convolutional layer) has better performance than the fully-connected layers. It indicates that the convolutional layers have great potential to build more discriminative features for classification, due to their ability to consider the rotation and scales of scene images. As a matter of

C. Experimental Results and Analysis In this subsection, we show the results obtained when the proposed approach is applied to the UCM dataset. For comparable analysis, seven pre-trained CNN models are utilized in the proposed approach to implement scene classification. Besides, we evaluate the performance of features from each convolutional or fully-connected layer in scene classification. The deep features of the convolutional layer are represented by multiscale improved Fisher kernel coding method aforementioned, and the deep features of the fully-connected layer are normalized by L2-normalization. Then, the feature vectors are directly fed into linear SVM classifier without dimensionality reduction. TABLE I RESULTS OF PROPOSED METHOD USING VARIOUS PRE-TRAINED CNN MODELS FOR UCM DATASET

Pre-trained CNN model AlexNet CaffeNet VGG-M VGG-S VGG-F VGG-VD16 VGG-VD19

Classification accuracy(%) 98.29 ±0.43 98.05 ±0.46 98.43 ±0.73 98.33 ±0.61 98.05 ±0.72 98.57 ±0.34 97.67 ±0.62

Running times(s) 5247 5209 8382 8138 4435 14762 15794

Table I presents the experimental results of UCM dataset obtained by the proposed approach in this work. According to the performances with different pre-trained CNN models, VGG-VD16 achieves the highest accuracy. For other six models, AlexNet, VGG-M and VGG-S obtain much similar results, CaffeNet and VGG-F achieve the same average accuracy, and VGG-VD19 performs worst among these models. In terms of accuracy stability, VGG-M, VGG-S, VGG-F and VGG-VD19 have slightly worse performances. Overall, the proposed method achieves competitive accuracies (more than 97%) whatever pre-trained models are employed. We also report the running times of our proposed method with different CNN models in Table I. It can be seen that the

Fig. 5. Classification performances of different layers and multi-layer fusion for UCM dataset. Labels conv1-5 represent the convolutional layers in order for AlexNet, CaffeNet, VGG-M VGG-S and VGG-F, and the last five convolutional layers in order for VGG-VD16 and VGG-VD19; Labels fc1-3 represent the fully-connected layers in order; Label fused represent multi-layer fusion.

fact, all used layers achieve good results in terms of accuracy (more than 80%), but great difference is also existing among them. It is a precondition to fuse features of all layers for improving the performance. Naturally, the fused features by our proposed method achieved an improved result for the all seven pre-trained models.

7

There are small number of testing samples after selecting 80% of samples as training samples for WHU_RS dataset. However, this result also demonstrates that the convolutional layers play an important role in our proposed method. The features extracted from convolutional layers are more discriminative than the features of fully-connected layers when the pre-trained CNN models are used as feature descriptors.

Fig. 6. Classification performances under 30%, 40% and 50% of all samples used as training samples for UCM dataset.

The number of training samples is always a key factor to be considered in evaluating a classification approach. A good classification approach should get a competitive performance even with fewer training samples. Based on this idea, we also assess the performance of our proposed method with decreased size of training samples. For this purpose, we randomly select 30%, 40% and 50% of all samples as training samples respectively, and the rest as testing samples dataset accordingly. The final accuracy is the average of five repeated experiments. From the results displayed in Fig. 6, a high classification accuracy (more than 93%) has been produced even with fewer training samples, and the accuracy increases with growing number of training samples, exceeding 97% with 50% of all samples as training sample dataset. It is a vast improvement for UCM dataset compared with previously reported works. Also, the classification performance of repeated trials is stable. The same evaluation scheme is executed for WHU_RS dataset. Similarly, the average results of five repeated experiments are shown in Table II. Surprisingly, better results are obtained even for this more challenging dataset. The highest accuracy exceeds 99% when applying the VGG-VD16 and VGG-VD19 models to build fused features. Moreover, other five models also achieve competitive accuracies. From these results, it is demonstrated that the proposed multi-layer features fusion framework is suitable for remotely sensed scene classification.

Fig. 7. Classification performances of different layers and multi-layer fusion for WHU_RS dataset.

For further comparison, we evaluate the performances of

TABLE II RESULTS OF PROPOSED METHOD USING VARIOUS PRE-TRAINED CNN MODELS FOR WHU_RS DATASET

Pre-trained Classification Running CNN model accuracy(%) times(s) AlexNet 98.47 ±1.02 3385 CaffeNet 98.47 ±0.72 3091 VGG-M 98.67 ±0.58 8259 VGG-S 98.78 ±0.46 7500 VGG-F 98.27 ±0.77 2614 VGG-VD16 99.08 ±0.43 17139 VGG-VD19 99.18 ±0.28 18718 Making a comparison with the performances only using features of a convolutional or fully-connected layer, a little variety arises in this experiment. From Fig.7, we can see that the results produced by features of deeper convolutional layers (e.g., conv4 and conv5) are very closer to the results of fused features. It indicates that the fusion strategy has not brought to further improvement of classification accuracy. This result may be caused by the imbalance of training and testing samples.

Fig. 8. Classification performances of different layers and multi-layer fusion under 50% of all samples used as training samples for WHU_RS dataset.

each layer and the fusion strategy when using fewer training samples. In this experiment, the samples of the WHU_RS dataset are divided into two equal parts. One part is used as the training samples and the remaining one is used as the testing samples. As displayed in Fig. 8, the similar curves are produced when decreasing the number of training samples in comparison with the results of UCM dataset. There is a noticeable improvement using fused features for most of the pre-trained models. Besides, an impressive accuracy (98.9%) is still achieved for the VDD-VD19 model even with fewer training samples. From this experiment, it shows that fusion strategy proposed in this paper performs better when with limited training samples.

8

D. Comparison with Other Feature Coding Methods A multiscale improved Fisher kernel coding (MIFK) method is proposed and applied to build mid-level feature representation from convolutional layers in this paper. For evaluating the performance of this coding method, three other coding methods including BOW, Locality-constrained linear coding (LLC) [56] and vector of locally aggregated descriptors (VLAD)[57] are employed to make comparisons with the MIFK method. They are evaluated by building mid-level feature representation from different convolutional layers and conducting classification using the mid-level features. During the implementing process, the codebooks of BOW, LLC and VLAD are all learned by K-means clustering, but the numbers of codebooks are empirically assigned to be 1000, 1000 and 100 with respect to each method. The same classification scheme is applied to the four methods, and 80% samples are selected as training samples, the rest as testing samples accordingly. The final accuracy is the average of five repeated experiments. As shown in Fig.9, the MIFK method performs best during all classification schemes for UCM dataset. It has obvious advantages in building mid-level feature representation especially for AlexNet, CaffeNet, VGG-M, VGG-S and VGG-F models in comparison with other feature coding methods. Among these methods, BOW obtains the worst results for each convolutional layer, LLC and VLAD methods achieve close classification accuracies in the last four convolutional layers for all the CNN models. The other noticeable characteristic is that the performances of these methods have obvious contrast in former convolutional layers. For VGG-VD16 and VGG-VD19, this characteristic is not obvious because only last five convolutional layers are selected to extract features in this work. Overall, the MIFK method proposed in this paper is a better choice to complete feature representation of convolutional layers according to the comparisons with other feature coding methods.

The findings are similar to the previous experiment, showing that MIFK method achieves the best results and BOW method performs worst for the seven CNN models. However, the highest accuracies obtained by MIFK, LLC and VALD methods are very close. This phenomenon may be caused by the imbalance of training and testing samples for WHU_RS dataset as well. They all obtain a high classification accuracy when few samples are used to test their performances.

Fig. 10. Classification performances of different feature coding methods for WHU_RS dataset.

E. The Influence Analysis of Fully-connected layers and Combining Features from Different CNN Models There are three fully-connected layers for all the seven pre-trained CNN models applied in this study. Most previous works always selected the first fully-connected layer to extract features of scene images. However, all the three fully-connected layers are employed in our proposed method. Therefore, it is necessary to assess the influence of selecting different fully-connected layers. Table III shows the performances when features extracted from five convolutional layers and a fully-connected layer are fused by our proposed method. From the accuracy reports, there are little changes for the three fully-connected layers using our proposed fusion strategy. This phenomenon indicates that there is not the best choice when only one fully-connected layer is employed to extract features in our proposed fusion framework. Making a comparison with the results of employing three fully-connected layers as shown in Table I and Table II, they are improved when three fully-connected layers are selected for most pre-trained CNN models. However, VGG-VD19 performs better when only one fully-connected layer is utilized for UCM dataset, and VGG-VD16 has the same experience for WHU_RS dataset. Overall, the results demonstrate that each fully-connected layer has the potential to improve the discrimination of feature representation for most pre-trained CNN models used in this work. The most attention of this work is how to explore the potential of multiple layers from a CNN models. However, the

Fig. 9. Classification performances of different feature coding methods for UCM dataset.

The same comparison experiment is also evaluated via the WHU_RS dataset, and the performances are shown in Fig.10. 9

cooperation of different CNN models also has ability to improve the discrimination of feature representation due to their various network architectures. Here, we only consider evaluating the performances of combining two different CNN models. With this idea, we combine the fused features from two different CNN models as the final feature representation and then complete the classification based these combining features. The experimental results are displayed in Fig.11 and Fig.12. As shown in Fig.11 and Fig.12, the results of only using one CNN model are displayed in the diagonal and highlighted by white color, the results highlighted in green color indicate the classification accuracy is not improved by combining features extracted from two CNN models, and the improved results are highlighted in yellow color.

From Fig.11 and Fig.12, it can be seen that the classification accuracy is improved for most experiments when features from two CNN models are combined. The highest accuracy (98.81%) is obtained by combing the features extracted AlexNet and VGG-VD16 models for UCM dataset. For WHU_RS dataset, the highest accuracy (99.49%) is also achieved by the same combination. It is also important to highlight that a CNN model with few convolutional layers, such as AlexNet, CaffeNet, VGG-M, VGG-S and VGG-F, can achieve a better performance when it is combined with a deeper CNN model like VGG-VD16 and VGG-VD19. However, the improvement of this simple combination strategy is less than 1% in terms of accuracy. Therefore, the more efficient fusion method is worth revisiting in the next works.

TABLE III RESULTS OF PROPOSED METHOD BY FUSING FIVE CONVOLUTIONAL LAYERS AND A FULLY-CONNECTED LAYER

Pre-trained CNN model AlexNet CaffeNet VGG-M VGG-S VGG-F VGG-VD16 VGG-VD19

conv1-5+fc1 97.95±0.43 97.76±0.46 98.05±0.59 97.90±0.59 97.86±0.56 98.10±0.61 98.19±0.57

UCM conv1-5+fc2 97.95±0.43 97.81±0.31 98.14±0.57 97.71±0.82 97.90±0.62 98.29±0.52 97.90±0.66

conv1-5+fc3 97.71±0.46 97.76±0.36 97.90±0.64 97.76±0.82 97.90±0.57 98.29±0.52 98.00±0.49

conv1-5+fc1 98.16±1.33 97.86±1.82 98.57±0.67 98.16±0.77 98.57±0.67 99.39±0.43 98.16±0.77

WHU_RS conv1-5+fc2 98.27±1.12 97.76±1.72 98.57±0.67 98.06±0.67 98.57±0.67 99.29±0.58 98.06±0.67

conv1-5+fc3 98.06±1.11 98.06±1.59 98.47±0.81 98.96±0.88 98.47±0.81 99.49±0.36 97.96±0.88

Fig. 12. Classification performances of combining features from two CNN models for WHU_RS dataset. The classification accuracy highlighted in green color is not improved by combing features from two CNN models and the improved results are highlighted in yellow color.

Fig. 11. Classification performances of combining features from two CNN models for UCM dataset. The classification accuracy highlighted in green color is not improved by combing features from two CNN models and the improved results are highlighted in yellow color.

[29], the CNN models were fully trained on UCM dataset, and a gradient boosting machine was employed to combine multiple CNN models for improving the classification performance. In [58], the pre-trained CaffeNet and GoogLeNet were used as feature descriptors for scene classification. Besides, the work reported in [59] attempted to transfer the pre-trained CNN models by fine-tuning the parameters on the target dataset, and two fine-tuned pre-trained CNN models were applied to UCM dataset. It should be noted that the aforementioned works all

F. Comparison with The State-of-the-art Methods Many works have summarized the performances of previous works for UCM dataset [26], [28]. Here, some of them are summarized for a comparison analysis. Especially, these studies utilize the same or similar experimental protocol, such as 80% samples are used as training samples and the rest is the testing data. As shown in Table IV, all CNN based methods have achieved the more competitive results (more than 90%) than traditional methods (i.e., BOW, SC, SPM and PSR). In 10

used the features of a fully-connected layer for classification. Furthermore, Hu et al. have paid more attention to exploiting the features from convolutional layers by feature coding methods and achieved a more competitive result for UCM dataset when combine two pre-trained models. In contrast, our work pays much attention to fusing multi-layer features of a CNN model for scene classification. From the results, a better result has been obtained by our proposed method in comparison with the fully-trained models and fine-tuned pre-trained models. Moreover, a higher accuracy is achieved by combining features from two CNN models in our experiments. For WHU_RS dataset, previous works have not evaluated their methods by using this dataset with the same training and testing protocol because it has few samples for each class in contrast with the UCM dataset. For instance, Hu and Zou have randomly selected 60% samples as the training samples and the remaining was used as testing samples [28], [60]. Nogueira et al have followed the same experimental setup as our settings in this paper, 80% samples are randomly selected as the training samples and the remaining are used for testing [26]. In [28], the authors used VGG-VD16 as feature extractor and the vector of locally aggregated descriptors were applied to build the linear vector representation of last convolutional features, and the best result (98.64%) is produced. In [60], local and global features were fused to build image representation for scene classification and achieved its best accuracy 95.26%. In [26], the pre-trained CNN models were used for remote sensing scene classification by fine tuning strategy and the best accuracy exceeded 98%. Making a comparison with these performances, our proposed method achieved the best accuracy (99.18%) using 80% samples as training dataset and obtained an impressive accuracy (98.8%) even using 50% samples as training dataset. It shows that the performance of our proposed method is comparable to the state-of-the-art methods previously proposed. TABLE IV PERFORMANCE COMPARISON OF THE STATE-OF-THE-ART METHODS ON THE UCM DATASET

Methods Spatial BOW [9] SIFT+BOW [28] SC+Pooling [13] SPM [6] PSR [6] GBRCN [29] CaffeNet [58] GoogLeNet [58] CaffeNet + fine-tuning [59] GoogLeNet+fine- tuning [59] VGG-M+IFK [28] VGG-S+VGG-VD16 [28] Ours (VGG-VD16) Ours(VGG-VD16+AlexNet)

Classification accuracy (%) 81.19 75.11 81.67 ±1.23 86.8 89.1 94.53 95.02 ±0.81 94.31 ±0.89 95.48 97.10 96.90 ±0.77 98.49 98.57 ±0.34 98.81 ±0.38

V. CONCLUSION Deep Convolutional Neural Networks technology provides a powerful solution for scene image classification. However, 11

fully training a CNN model remains challenging for remotely sensed scene images in consideration of limited training data and time-consuming process. Therefore, this study focuses on exploiting the potential of pre-trained CNN models in remote sensing scene classification, and presents a multi-layer features fusion strategy for scene classification based on multiscale improved Fisher vector coding and PCA/SRKDA methods. In comparison with related scene classification methods based on CNN model, the novelties lie in integrating multi-layer features of a CNN model to build more discriminative feature representation for classification, and proposed an efficient feature fusion strategy to fuse multiple very high dimensional features. Through experiments on two commonly test datasets, the results demonstrate that the approach offers the state-of-the-art classification performance for the two datasets. In addition, it demonstrates that deeper convolutional features are more discriminative than fully-connected features when a pre-trained CNN model is used as feature extractor. For the proposed method in this paper, although it achieved a better performance by integrating multi-layer convolutional features and considering multiscale information of scene images than performing classification using limited information of one or two convolutional layers, it inevitably increases the computational burden of completing the classification scheme. In addition, the supervised method used in the feature fusion process reduces the diversity of feature representations constructed by the different pre-trained CNN models and leads to a negative effect on fusing multiple CNN models for further improving the classification performance. REFERENCES [1] M. Voltersen, C. Berger, S. Hese, and C. Schmullius, “Expanding an urban structure type mapping approach from a subarea to the entire city of Berlin,” In Jt. Urban Remote Sens. Event, JURSE, Lausanne, Switzerland, 2015, pp. 1-4. [2] X. Zhang, and S. Du, “A Linear Dirichlet Mixture Model for decomposing scenes: Application to analyzing urban functional zonings,” Remote Sens. Environ., vol. 169, pp. 37-49, Nov. 2015. [3] L. Zhang, L. Zhang, D. Tao, and X. Huang, “On combining multiple features for hyperspectral remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 3, pp. 879-893, Aug. 2012. [4] L. Zhang, Q. Zhang, L. Zhang, D. Tao, X. Huang, and B. Du, “Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding,” Pattern Recogn., vol. 48, no. 10, pp. 3102-3112, Oct. 2015. [5] K. V. D. Sande, T. Gevers, and C. Snoek, “Evaluating Color Descriptors for Object and Scene Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1582-1596, Sep. 2010. [6] S. Chen, and Y. Tian, “Pyramid of Spatial Relatons for Scene-Level Land Use Classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 4, pp. 1947-1957, Apr. 2015. [7] R. Raja, S. M. M. Roomi, and D. Dharmalakshmi, “Outdoor Scene classification using invariant features,” In 2013 4th Nat. Conf. Comput. Vis., Pattern Recogn., Image Process. Graph., NCVPRIPG 2013, New York, USA, 2013, pp. 1-4. [8] Y. Jiang, R. Wang, and P. Zhang, “Texture description based on multiresolution moments of image histograms,” Opt. Eng., vol. 47, no. 3, Mar. 2008.

[9] Y. Yang, and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” In 18th ACM SIGSPATIAL Int. Conf. Adv. Geogr. Inform. Syst., ACM SIGSPATIAL GIS 2010, San Jose, California, USA, 2010, pp. 270-279. [10] G. Cheng, L. Guo, T. Zhao, J. Han, H. Li, and J. Fang, “Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA,” Int. J. Remote Sens., vol. 34, no. 1, pp. 45-59, Jan. 2013. [11] B. Zhao, Y. Zhong, G. S. Xia, and L. Zhang, “Dirichlet-Derived Multiple Topic Scene Classification Model for High Spatial Resolution Remote Sensing Imagery,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 4, pp. 2108-2123, Apr. 2016. [12] F. Perronnin, and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” In 2007 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. , CVPR'07, Minneapolis, MN, USA, 2007, pp. 1-8. [13] A. M. Cheriyadat, “Unsupervised Feature Learning for Aerial Scene Classification,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 1, pp. 439-451, Jan. 2014. [14] F. Hu, G. S. Xia, Z. Wang, X. Huang, L. Zhang, and H. Sun, “Unsupervised Feature Learning Via Spectral Clustering of Multidimensional Patches for Remotely Sensed Scene Classification,” IEEE J. Sel. Topics Appl. Earth Observ. in Remote Sens., vol. 8, no. 5, pp. 2015-2030, May 2015. [15]J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Netw., vol. 61, pp. 85-117, Jan. 2015. [16] L. Zhang, L. Zhang, and B. Du, “Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art,” IEEE Geosci. Remote Sens. Mag., vol. 4, no. 2, pp. 22-40, June 2016. [17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-Based Convolutional Networks for Accurate Object Detection and Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142-158, Jan. 2016. [18] S. Balaban, “Deep learning and face recognition: The state of the art,” In Proc. SPIE Int. Soc. Opt. Eng., Baltimore, MD, USA, 2015, Art. no. 94570B. [19] O. Abdel-Hamid, A. r. Mohamed, H. Jiang, and G. Penn, “Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition,” In 2012 IEEE Int. Conf. Acoust. Speech Signal Proc., ICASSP 2012, Kyoto, Japan, 2012, pp. 4277-4280. [20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.-F. Li, “Large-scale video classification with convolutional neural networks,” In 27th IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2014, Columbus, OH, USA, 2014, pp. 1725-1732. [21]S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” In 15th IEEE Int. Conf. Comput. Vis., ICCV 2015, Santiago, Chile, 2016, pp. 1529-1537. [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” In 29th IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2016, Las Vegas, NV, USA, 2016, pp. 770-778. [23] O. A. B. Penatti, K. Nogueira, and J. A. d. Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?,” In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, CVPRW 2015, Boston, MA, USA, 2015, pp. 44-51. [24] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high resolution multispectral data with deep learning features,” In IEEE Int. Geosci. Remote Sens. Symp., IGARSS 2015, Milan, Italy, 2015, pp. 1873-1876. [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with

12

convolutions,” In IEEE Conf. Comput. Vision Pattern Recognit., CVPR 2015, Boston, MA, USA, 2015, pp. 1-9. [26] K. Nogueira, O. A. B. Penatti, and J. A. dos Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,” Pattern Recognit., vol. 61, pp. 539-556, Jan. 2017. [27] J. Deng, W. Dong, R. Socher, L. J. Li, L. Kai, and F.-F. Li, “ImageNet: A large-scale hierarchical image database,” In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., CVPR 2009, Miami, Florida, USA, 2009, pp. 248-255. [28] F. Hu, G. S. Xia, J. Hu, and L. Zhang, “Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery,” Remote Sens., vol. 7, no. 11, pp. 14680-14707, Nov. 2015. [29] F. Zhang, B. Du, and L. Zhang, “Scene Classification via a Gradient Boosting Random Convolutional Network Framework,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1793-1802, Mar. 2016. [30] W. Zhao, and S. Du, “Scene classification using multi-scale deeply described visual words,” Int. J. Remote Sens., vol. 37, no. 17, pp. 4119-4131, Sep. 2016. [31] E. Othman, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani, “Using convolutional features and a sparse autoencoder for land-use scene classification,” Int. J. Remote Sens., vol. 37, no. 10, pp. 1977-1995, May 2016. [32] J. Yue, W. Zhao, S. Mao, and H. Liu, “Spectral–spatial classification of hyperspectral images using deep convolutional neural networks,” Remote Sens. Lett., vol. 6, no. 6, pp. 468-477, Jun. 2015. [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” In 26th Annu. Conf. Neural Inf. Process. Syst. 2012, NIPS 2012, Lake Tahoe, NV, USA, 2012, pp. 1097-1105. [34] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” In IEEE Conf. Comput. Soc. Vis. Pattern Recognit., CVPR 2015, Boston, MA, USA, 2015, pp. 3431-3440. [35] W. Zhao, and S. Du, “Learning multiscale and deep representations for classifying remotely sensed imagery,” ISPRS J. Photogramm. Remote Sens., vol. 113, pp. 155-165, Mar. 2016. [36] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” In 19th Int. Conf. Comput. Stat., COMPSTAT 2010, Paris, France, 2010, pp. 177-186. [37] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436-444, May 2015. [38] K. Simonyan, and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, pp. 1-14, Apr. 2014. [39] X. Chen, S. Xiang, C. L. Liu, and C. H. Pan, “Vehicle Detection in Satellite Images by Hybrid Deep Convolutional Neural Networks,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 10, pp. 1797-1801, Oct. 2014. [40] J. Tang, C. Deng, G. B. Huang, and B. Zhao, “Compressed-Domain Ship Detection on Spaceborne Optical Image Using Deep Neural Network and Extreme Learning Machine,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 1174-1185, Mar. 2015. [41] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deep supervised learning for hyperspectral data classification through convolutional neural networks,” In Int Geosci Remote Sens Symp, IGARSS 2015, Milan, Italy, 2015, pp. 4959-4962. [42] W. Zhao, and S. Du, “Spatial Feature Extraction for Hyperspectral Image Classification: A Dimension Reduction and Deep Learning Approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4544-4554, Aug. 2016.

[43] G. Gando, T. Yamada, H. Sato, S. Oyama, and M. Kurihara, “Fine-tuning deep convolutional neural networks for distinguishing illustrations from photographs,” Expert Sys. Appl., vol. 66, pp. 295-301, Dec. 2016. [44] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the Fisher Kernel for Large-Scale Image Classification,” In 11th Eur. Conf. Comput. Vis., ECCV 2010, Heraklion, Crete, GRE, 2010, pp. 143-156. [45] S. Baker, and T. Kanade, “Hallucinating faces,” In Int. Conf. Autom. Face Gesture Recognit., FG 2000, Grenoble, France, 2000, pp. 83-88. [46] D. Cai, X. He, and J. Han, “SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 1, pp. 1-12, Jan. 2008. [47]D. Cai, X. He, and J. Han, “Speed up kernel discriminant analysis,” VLDB Journal, vol. 20, no. 1, pp. 21-33, 2011. [48] G. Baudat, and F. Anouar, “Generalized Discriminant Analysis Using a Kernel Approach,” Neural Comp., vol. 12, no. 10, pp. 2385-2404, Oct. 2000. [49] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, “Fisher discriminant analysis with kernels,” In Proc. 1999 9th IEEE Workshop Neural Netw. Signal Process. (NNSP'99), Madison, WI, USA, 1999, pp. 41-48. [50] G.-S. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, and H. Matre, “Structural high-resolution satellite image indexing,” In ISPRS Int. Arch. Photogramm., Remote Sens. Spat. Inf. Sci., Vienna, Austria, 2010, pp. 298-303. [51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” In 2014 ACM Conf. Multimedia, MM 2014, Orlando, FL, USA, 2014, pp. 675-678. [52] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” CoRR, vol. abs/1405.3531, pp. 1-11, Nov. 2014. [53] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871-1874, Aug. 2008. [54] A. Vedaldi, and B. Fulkerson, “Vlfeat - An open and portable library of computer vision algorithms,” In Proc. 18th ACM Int. Conf. Multimedia, Firenze, Italy, 2010, pp. 1469-1472. [55] A. Vedaldi, and K. Lenc, “MatConvNet: Convolutional neural networks for MATLAB,” In 23rd ACM Int. Conf. Multimedia, MM 2015, Brisbane, QLD, Australia, 2015, pp. 689-692. [56] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” In 2010 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., CVPR 2010, San Francisco, CA, USA, 2010, pp. 3360-3367. [57] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descriptors into a compact image representation,” In 2010 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., CVPR 2010, San Francisco, CA, USA, 2010, pp. 3304-3311. [58] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, and L. Zhang, “AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification,” CoRR, vol. abs/1608.05167, pp. 1-25, Sep. 2017. [59] M. Castelluccio, G. Poggi, C. Sansone, and L. Verdoliva, “Land Use Classification in Remote Sensing Images by Convolutional Neural Networks,” CoRR, vol. abs/1508.00092, pp. 1-11, Aug. 2015. [60] J. Zou, W. Li, C. Chen, and Q. Du, “Scene classification using local and global features with collaborative representation fusion,” Inf. Sci., vol. 348, pp. 209-226, Jun. 2016.

13

Suggest Documents