International Journal of Remote Sensing
ISSN: 0143-1161 (Print) 1366-5901 (Online) Journal homepage: http://www.tandfonline.com/loi/tres20
Using convolutional features and a sparse autoencoder for land-use scene classification Esam Othman, Yakoub Bazi, Naif Alajlan, Haikel Alhichri & Farid Melgani To cite this article: Esam Othman, Yakoub Bazi, Naif Alajlan, Haikel Alhichri & Farid Melgani (2016) Using convolutional features and a sparse autoencoder for land-use scene classification, International Journal of Remote Sensing, 37:10, 1977-1995, DOI: 10.1080/01431161.2016.1171928 To link to this article: http://dx.doi.org/10.1080/01431161.2016.1171928
Published online: 21 Apr 2016.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tres20 Download by: [Universita di Trento], [Farid Melgani]
Date: 23 April 2016, At: 07:08
INTERNATIONAL JOURNAL OF REMOTE SENSING, 2016 VOL. 37, NO. 10, 1977–1995 http://dx.doi.org/10.1080/01431161.2016.1171928
Using convolutional features and a sparse autoencoder for land-use scene classification Esam Othmana, Yakoub Bazi
a
, Naif Alajlana, Haikel Alhichria and Farid Melgani
b
a
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia; bDepartment of Information Engineering and Computer Science, University of Trento, Trento, Italy ABSTRACT
ARTICLE HISTORY
In this article, we propose a novel approach based on convolutional features and sparse autoencoder (AE) for scene-level landuse (LU) classification. This approach starts by generating an initial feature representation of the scenes under analysis from a deep convolutional neural network (CNN) pre-learned on a large amount of labelled data from an auxiliary domain. Then these convolutional features are fed as input to a sparse AE for learning a new suitable representation in an unsupervised manner. After this pre-training phase, we propose two different scenarios for building the classification system. In the first scenario, we add a softmax layer on the top of the AE encoding layer and then finetune the resulting network in a supervised manner using the target training images available at hand. Then we classify the test images based on the posterior probabilities provided by the softmax layer. In the second scenario, we view the classification problem from a reconstruction perspective. To this end we train several class-specific AEs (i.e. one AE per class) and then classify the test images based on the reconstruction error. Experimental results conducted on the University of California (UC) Merced and Banja-Luka LU public data sets confirm the superiority of the proposed approach compared to state-of-the-art methods.
Received 20 November 2015 Accepted 15 March 2016
1. Introduction Land-use (LU) classification from remote-sensing images is an important task as it reflects the social and economic activities of a given territory. On the other hand, it is considered by the remote-sensing community as a challenging problem since it requires a high-level semantic description, which is not adopted by traditional low-level representations. In the literature of remote sensing, several methods have been introduced recently addressing the issue of LU classification. In Yang and Newsam (2010), the authors proposed three different bags of visual words (BOVW)-based schemes. The first one is a standard nonspatial representation in which the frequencies of quantized image features are used to distinguish between classes. The second scheme is a spatial pyramid match kernel, which considers the absolute spatial arrangement of the image features. The last scheme is based CONTACT Yakoub Bazi
[email protected] Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia © 2016 Informa UK Limited, trading as Taylor & Francis Group
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1978
E. OTHMAN ET AL.
on the spatial co-occurrence kernel that considers the relative arrangement of the image features. In the BOVW context, Yang and Newsam (2011) proposed another method, which is noted as spatial pyramid co-occurrence, where a BOVW histogram involved both absolute and relative spatial arrangements of the visual words whose co-occurrence was counted over a spatial partitioning of the target image. This scheme has scored an improvement over the non-spatial BOVW. Another model in Chen and Tian (2015), termed pyramid of spatial relations (PSR), was introduced to overcome the fact that the standard BOVW discards the spatial information of the image features. The PSR model employs a concept of spatial relation to describe the relative spatial relationship of a group of local features. In addition, the PSR also optimizes the storage cost of the visual word codebook size; it also has shown robustness to rotation and translation. Zhao, Tang, and Huo (2014) presented a concentric circle-based spatial-rotation-invariant representation strategy for describing the spatial information of visual words and proposed a concentric circle-structured multi-scale BOVW method using multiple features. Specifically, their proposed methodology is performed in six steps. First, multiple local features are extracted from multiple resolution images (constructed starting from the original image). Next, different visual vocabularies are generated with a k-means clustering algorithm for different resolutions. Afterward, each resolution image is partitioned into a number of annular subregions by a set of concentric circles and separately represented as histograms of visual word occurrences by mapping the local features of annular subregions from this resolution image to the corresponding learned visual vocabulary. Subsequently, for each resolution image, all generated histograms are concatenated together. The final signature of the target image is then shaped by a further concatenation of the previously created resolution histograms. The latter image representations are finally fed into a support vector machine (SVM) classifier to finalize the decisionmaking. Huang, Lu, and Zhang (2014) proposed a multi-index learning method, where a set of low-dimensional information indices is used to represent the complex geospatial LUs in high-resolution images, and it proved as efficient to the alternative common high-dimensional feature spaces. Multi-feature fusion also allows one to take advantage of all the latent information within the remotely sensed images. In particular, in Huang, Lu, and Zhang (2014), both spectral and spatial features have been combined into one single model, which demonstrated good performance. Cheng et al. (2014) introduced a rotation invariant approach for multi-class geospatial object detection and geographic image classification based on the collection of part detectors. Each detector is a linear SVM. The part detectors are used for the detection of either objects or recurring spatial patterns within a certain range of orientation and thereafter they provide a solution for rotation invariant detection of multi-class objects. For geographic image classification, they adopted a large number of pre-trained part detectors to detect special visual parts from images and used them as attributes to represent the images. Cheng et al. (2015) proposed a part-based method to represent images by applying a large number of part detectors to them. To minimize the computational cost, which may occur when the number of part detectors increases, they adopted a single-hidden-layer autoencoder (AE) and a single-hidden-layer neural network with an L0-norm sparsity constraint, respectively, to train coarse to fine shared intermediate representations, called sparselets. Finally, Mekhalfi et al. (2015) formulated the LU scene classification problem within a compressive sensing fusion framework. They adopted three feature extraction techniques, namely co-occurrence of adjacent local binary patterns, gradient local autocorrelations, and histogram of oriented gradients for feature generation.
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
INTERNATIONAL JOURNAL OF REMOTE SENSING
1979
Then two multi-feature ensemble techniques were employed to achieve high classification accuracy results. This article proposes a novel approach for LU scene classification based on deep learning (Chen, Xiang, et al. 2014; Schmidhuber 2015). The idea of deep learning, also known as hierarchical learning (proposed for the first time by Hinton, Osindero, and Teh 2006), is about learning a good feature representation automatically from the input data. Typical deep learning architectures include deep belief networks (DBNs) (Hinton, Osindero, and Teh 2006), stacked autoencoder (SAE) (Vincent et al. 2008), and convolutional neural networks (CNNs) (Farabet et al. 2013). Recently, compared to shallow architectures (i.e. handcrafted features fed as input to a kernel classifier), deep learning has shown outstanding results in many applications such as image classification (Hayat, Bennamoun, and An 2015), object recognition (Bai et al. 2015), face recognition (Gao et al. 2015), medical image analysis (Brosch and Tam 2015), speech recognition (Hinton et al. 2012), and traffic flow prediction (Lv et al. 2015). In the context of remote sensing, the concept of deep learning was introduced into hyperspectral data classification for the first time by Chen, Lin, et al. (2014). The authors applied an AE in an unsupervised way to learn the deep features of hyperspectral data. Particularly, they used single-layer AE and SAEs to learn the shallow and deep features of hyperspectral data, respectively. Chen, Zhao, and Jia (2015) proposed a classification framework based on DBNs to analyse hyperspectral data. The proposed method uses a single-layer restricted Boltzmann machine (RBM) and multilayer deep network to learn the shallow and deep features, respectively. In another remote-sensing context, Tang et al. (2015) proposed a compressed-domain framework for fast ship detection on space-borne optical imagery. Here the authors used a deep neural network (DNN) for hierarchical ship feature extraction in the wavelet domain. The learned features are more robust under certain conditions compared with other existing feature descriptors. Finally, the authors adopted the extreme learning machine (ELM) for feature fusion and classification. Han et al. (2015) proposed a weakly supervised learning (WSL) framework for object detection from optical images. For the learning phase, they used a deep Boltzmann machine (DBM) to learn high-level feature representation for various geospatial objects. Huang et al. (2015) proposed a DNN-based method for pan-sharpening. They particularly learned the relationship between the high- and low-resolution image patch pairs. Chen, Xiang, et al. (2014) proposed a method to detect small objects such as vehicles in satellite images. They made use of CNNs to extract and learn rich features from the training data. Finally, Han et al. (2015) proposed a geospatial object detection approach by combining WSL and high-level feature learning. They used DBM to infer the spatial and structural information encoded in the low-level and middle-level features to effectively describe objects in images. It appears clearly from the above works that deep learning is becoming very attractive for solving various remote-sensing applications. In this work, we propose exploiting it to develop a transfer learning approach to address the issues of LU classification. The approach is inspired from the idea of using pre-learned classifiers (trained from auxiliary data) to learn a target classifier (Yang, Yan, and Hauptmann 2007; Schweikert et al. 2008; Duan, Tsang, and Xu 2012; Duan et al. 2012; Duan, Xu, and Tsang 2012). This concept, mainly proposed for shallow architectures, aims to adapt the decision function of the classifier in the target domain from another classifier trained with auxiliary domain. In contrast, in this work the
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1980
E. OTHMAN ET AL.
adaptation is carried out at the feature level thanks to deep learning. Specifically, the approach starts by generating an initial feature representation of the images under analysis from a deep CNN pre-learned on a large amount of labelled images from an auxiliary domain. Then these convolutional features are fed as input to a sparse AE for learning a new suitable representation of the data in an unsupervised manner. After this pre-training phase, we propose two different scenarios for building the classification system. In the first scenario, we add a softmax layer on the top of the encoding layer of the AE and then finetune the resulting network in a supervised way using the target training images. Then we classify the test images based on the posterior probabilities provided by the softmax layer. In the second scenario, we train class-specific AEs (i.e. one AE per class) and then classify the test images based on the reconstruction error. In the experiments, we evaluate the method on two public benchmark LU data sets to confirm the promising capabilities of the proposed transfer learning approach.
2. Description of the proposed method 2.1. Problem formulation Let fIi ; yi gni¼1 be a training set where Ii 2wh is a remote-sensing image of size w h; and yi 2 f1; 2; . . . ; cg is its corresponding class label. Let us consider also fjCNN ; j ¼ 1; . . . ; L be the output of the jth layer of a CNN model with L layers pre-learned on a large amount of labelled auxiliary images from a different domain. Our aim is to develop a classification nþm system that allows classifying the m test images Ij j¼iþ1 based on the available training set as well the pre-learned CNN model. Detailed descriptions of the proposed approach are provided in the next subsections.
2.2. Convolutional feature generation from a pre-learned CNN Deep CNNs are composed of several layers of processing, each comprising both linear and non-linear operators, which are learnt jointly, in an end-to-end manner, to solve specific tasks (Farabet et al. 2013; Brosch and Tam 2015). Specifically, deep CNNs are commonly made up of four layers: 1) convolutional layer; 2) normalization layer, 3) pooling layer, and 4) fully connected layer. The convolutional layer is the core building block of the CNN and its parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input image. The feature maps produced via convolving these filters across the input image are fed to a non-linear gating function such as the Rectified Linear Unit (ReLU) (Krizhevsky, Sutskever, and Hinton 2012). Then the output of this activation function can further be subjected to normalization (i.e. local response normalization) to help in generalization. Regarding the pooling layer, it takes small rectangular blocks from the convolutional layer and subsamples it to produce a single output from each block. There are several ways to perform pooling, such as taking the average or the maximum, or a learned linear combination of the values in the block. The main two reasons of using this layer are, first, to control overfitting and, second, to reduce the amount of parameters and computation in the network. After several convolutional and pooling layers, the high-level reasoning in the neural network is performed via fully
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
INTERNATIONAL JOURNAL OF REMOTE SENSING
1981
Figure 1. Example of generating a convolutional feature from a CNN pre-learned on an auxiliary domain. The image is fed to CNN through different layers (convolution, non-linear activation function, pooling, and normalization). The output of the last fully connected layer represents the convolutional feature.
connected layers. A fully connected layer takes all neurons in the previous layer and connects it to every single neuron it has. In the case of classification, a softmax layer is added at the end of this network and the weights of CNN are learned using backpropagation. Given the deep nature of CNN, one can extract features at different layers. In this work, we take the output of the hidden fully connected layers (before the softmax layer) to represent the training and test images as shown in Figure 1. That is, we feed each image Ii as input to the CNN and generate a feature representation vector x i 2D of dimension D at layer k, with k ¼ L 1: x i ¼ fkCNN . . . f2CNN f1CNN ðIi Þ ; i ¼ 1; . . . ; n þ m: (1)
2.3. Unsupervised feature learning using sparse AE In this phase, we identify a suitable model to learn the underlying structure of the CNN nþm feature vectors fx i gnþm i¼1 of the images fIi gi¼1 . To this end, we use a sparse AE, which is a symmetrical neural network mainly used for learning the features of a data set in an unsupervised manner. Typically, the sparse AE is made up of encoding and decoding parts, respectively. In the encoding part, the input x i is mapped to the hidden representation hi 2S of dimension S through the non-linear activation function g as follows: hi ¼ gðWðeÞ x i þ bðeÞ Þ;
(2)
where WðeÞ 2SD is the encoder weight matrix and bðeÞ 2 < r ¼ BernoulliðpÞ ~ i ¼ hi r h ; (6) > ~i þ bðdÞ : x^i ¼ g WðdÞ h with denoting an element-wise product and r2S is a vector of independent Bernoulli random variables. BernoulliðpÞ is a probability distribution that takes the value of 1 with a success probability of p (usually set to 0.5) and 0 otherwise. At test time, the weights are set as pWðdÞ and the network is used without dropout. To optimize the cost function in (4), we first initialize the parameter vector θAE to small values near zero (e.g. [−0.005 0.005]). Then one can use a mini-batch gradientbased technique or use a second-order optimization method called L-BFGS (Liu and Nocedal 1989), which is a quasi-Newton method based on the BFGS (Nocedal 1980) update procedure. To reduce the number of parameters, the weights learned for the T
coding layer are simply tied to the decoding layer, i.e. WðdÞ =WðeÞ (the superscript T denotes a matrix transpose operation), during the optimization process. It is worth noting that one can build a deep learning architecture with several hidden layers. Here, several AEs are trained in greedy layer-wise unsupervised mode, leading to what we call stacked AEs (SAEs). The main idea is to learn a hierarchy of features one level at a time. Specifically, the learning process starts by training the first AE in an unsupervised
INTERNATIONAL JOURNAL OF REMOTE SENSING
1983
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
manner by optimizing (4) with the original input data to obtain the first hidden representation layer. Then the reconstruction layer of this AE is removed and the obtained hidden layer is used as the input data for training the next AE to generate higher-level representations, and so on. Finally, the learned feature representations can be fed as input to a linear classifier such as SVM for performing classification. Alternatively, one can add on the top of the resulting hidden representation layers a logistic/softmax regression layer to perform binary/multi-class classification. In the next sections, we present this scheme with one hidden layer, in addition to another one based on the AE reconstruction error.
2.4. Discriminative classification After the pre-training phase, we add on the top of the hidden representation layer of the AE a softmax regression layer to perform multi-class classification, yielding an NN with one hidden layer tailored to task-specific supervised learning as shown in Figure 2. We term this method CNN-NN (Neural Network). Then we fine-tune the resulting network using back-propagation by minimizing the following cost function (Vincent et al. 2008): ! n X c 1X expðgθNN ðx i ÞÞ L2 ðθNN Þ ¼ 1ðyi Þln Pc ; (7) n i¼1 j¼1 j¼1 expðgθNN ðx i ÞÞ n o where θNN ¼ WðeÞ ; bðeÞ ; Wsoftmax ; bsoftmax represents the parameters of the discriminative NN. Wsoftmax and bsoftmax represent the weights and biases of the softmax layer, respectively. Equation (7) refers to the cross-entropy loss for the softmax layer and 1ðÞ is an indicator function that takes 1 if the statement is true, otherwise it takes 0, and gθNN ðxi Þ is the output of the NN for an input xi . The estimation of the vector of parameters θNN of the NN starts by initializing the weights of the softmax layer to small random values, whereas the weights of the hidden layers are initialized by the encoding weights obtained in the pre-training phase. Then the cost (7) is minimized with a min-batch Gradient Descent algorithm. We recall that we use the dropout technique as described previously for combatting overfitting problems and increasing the generalization ability of the network. At test time, we feed each test CNN feature vector to this network and assign it to the class yielding the maximum posterior probability.
Figure 2. The discriminative classification approach showing the combination of the encoding layer of the AE with the softmax classification layer.
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1984
E. OTHMAN ET AL.
Figure 3. The classification by the reconstruction approach showing the utilization of several classspecific AEs to generate a reconstruction error vector. The index of the minimum value in the reconstruction error vector represents the class label.
2.5. Classification by reconstruction In this approach termed CNN-AE, we view the classification problem from a reconstruction point of view in a similar way to the compressive sensing approach proposed in Mekhalfi et al. (2015). To this end, we train c class-specific AEs (see Figure 3) using the same procedure as in the pre-training phase. However, in this case for each AE we use only the training patterns of class j ¼ 1; . . . ; c. During the optimization process, the weights n o ðeÞ ðdÞ ðeÞ ðdÞ Wj ; Wj ; bj ; bj of the jth AE are initialized by the set of encoding and decoding weights of the AE obtained in the pre-training phase. At test time, we feed each CNN test feature vector to these c-class-specific AEs to generate a reconstruction error vector of dimension c. Then we assign the CNN test feature vector to the class equal to the index of the minimum value in this reconstruction error vector.
3. Experiments 3.1. Data set description (1) The University of California (UC) Merced LU data set: The data set was manually derived by (Yang and Newsam 2010; Yang and Newsam 2011) from another data set of large aerial orthoimagery with a pixel resolution of 30 cm. It was downloaded from the US Geological Survey national map of the following US regions: Birmingham, Boston, Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura. It consists of 2100 RGB images of size ðw hÞ ¼ ð256 256Þ pixels each,
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
INTERNATIONAL JOURNAL OF REMOTE SENSING
1985
Figure 4. Image samples from the UC Merced data set.
categorized into 21 classes (100 images per class). The class labels are as follows: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbour, intersection, medium-density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis court. Sample images of this LU database are shown in Figure 4. (2) Banja-Luka data set: The database consists of 606 RGB aerial images of size ðw hÞ ¼ ð128 128Þ pixels. This database was constructed from a part of Banja Luka city, Bosnia and Herzegovina (Risojević, Momić, and Babić 2011). It is composed of six classes, which are: houses (143 images), cemetery (28 images), industry (75 images), field (178 images), river (77 images), and trees (105 images). As for UC Merced, sample images of this database are shown in Figure 5.
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1986
E. OTHMAN ET AL.
Houses
Cemetery
Industry
Field
River
Trees
Figure 5. Image samples from the Banja-Luka data set.
3.2. Experiment setup For generating the convolutional features, we use the pre-learned deep CNN model of Chatfield et al. (2014) composed of eight layers. Specifically this deep CNN uses five convolutional filers of dimensions (number of filters × filter height × filter depth: 96 × 7 × 7, 256 × 5 × 5, 512 × 3 × 3, 512 × 3 × 3, and 512 × 3 × 3) and three fully connected layers with the following number of hidden nodes (fc1: 4096; fc2: 4096; and softmax: 1000). This model was pre-learned on the ILSVRC-12 challenge data set (Deng et al. 2009). It was trained on 1.2 million RGB images of size 224 224 pixels belonging to 1000 classes. These classes describe general images such as beaches, dogs, cats, cars, shopping carts, and minivans. As can be seen, this auxiliary domain is completely different from the remotesensing data sets used in the experiments. Thanks to the deep architecture of this CNN, one can extract features at different representation levels. In this work, we resize the images of both data sets to 224 224 and then feed them to this deep CNN and consider the output of the fully connected layer, which produces a feature vector of dimension D ¼ 4096. For CNN-NN, we set the target activation function ρ to 0:001, the dropout probability p to 0.5, and the regularization parameter λ to 0.01. Then we normalize each CNN feature independently in the range ½0; 1 and use a sigmoid activation function for the hidden layer. For the back-propagation algorithm, we use a mini-batch gradient optimization method (i.e. learning rate, 0.01; momentum, 0.5; and mini-batch size, 200). For CNN-AE, we experimentally found that is better to apply PCA whitening to the normalized features and then consider a tanh activation function for the hidden layer. In addition, as we are training one AE per class, we set the mini-batch size to 10. For both schemes, we set the dimension of the hidden layer S ¼ 256. However, in the experiments we assess also the effect of this parameter on the performances of CNN-NN and CNN-AE.
INTERNATIONAL JOURNAL OF REMOTE SENSING
1987
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
In the experiments, we compare our results to state-of-the-art methods in terms of classification accuracy. Regarding the UC data set, we use two types of cross-validations. The first type uses a fivefold validation scheme where the database is randomly divided into five folds, each containing 20 images per class. One subset is adopted as a test set, and the remaining four subsets are used as a training set. The other type uses a twofold validation scheme where the database is randomly split into two equally subsets. One subset is used as a training set and the other is considered as a test set. Regarding the Banja-Luka data set, for the sake of consistency we use tenfold cross validation as performed in Risojević, Momić, and Babić (2011). For both data sets, we perform these operations five times and then consider the average classification results.
3.3. Results In the following paragraphs, we report and discuss the results obtained by CNN-NN and CNN-AE. We recall that all experiments were conducted on hp-station with an Intel Xeon processor 2.40GHz and 20.00 GB of RAM. (1) UC Merced data set: Figure 6 shows the CNN features obtained from the prelearned model as well as the features learned by the AE. A preliminary inspection shows that the learned features are different for these sample images belonging to different classes. In terms of classification accuracy and using fivefold validation, CNN-NN yields an overall accuracy of 97.19%, whereas CNN-AE yields 95.05%. The confusion matrices (see Figure 7) show that CNN-NN confuses mainly between ‘dense residential’ and ‘medium residential’ areas. Regarding CNN-AE, besides dense residential, it does not perform well for ‘tennis court’ and ‘building’ classes. The accuracies for these two classes
(a)
(b)
(c)
Figure 6. Feature vectors generated by CNN and AE for the UC Merced data set: (a) input images (agriculture, building, baseball, diamond, and beach), (b) feature generated by CNN, and (c) feature generated by AE.
E. OTHMAN ET AL.
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1988
Figure 7. Confusion matrices obtained by: (a) CNN-NN and (b) CNN-AE for the UC Merced data set. Darker grey-scales indicate accurate classification whereas lighter grey-scales indicate low accuracy.
INTERNATIONAL JOURNAL OF REMOTE SENSING
1989
Table 1. Classification results obtained for the UC Merced data set. Method Yang and Newsam 2011 Cheng et al. 2014 Chen and Tian 2015 Mekhalfi et al. 2015 Cheng et al. 2015 Zhang, Du, and Zhang 2015 Luus et al. 2015 Marmanis et al. 2016 Pre-trained CNN models
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
The proposed method Zhao, Tang, and Huo 2014 Mekhalfi et al. 2015 Cheng et al. 2015 Pre-trained CNN models The proposed method
Description
OA (%)
Validation
Spatial BOVW Collection of part detectors Pyramid of spatial relations CS multi-feature fusion Partlets-based method Gradient boosting RCN Multi-view deep learning CNN+CNN CNN+SVM CNN+One fully connected layer CNN+Two fully connected layers CNN-AE CNN-NN Concentric multi-scale BOVW CS multi-feature fusion Partlets-based method CNN-SVM CNN+One fully connected layer CNN+Two fully connected layers CNN-AE CNN-NN
81.19 91.33 89.10 94.33 91.33 94.53 93.48 92.40 95.09 95.00 94.72 95.05 97.19 86.64 91.10 88.76 92.83 92.99 92.88 93.90 95.10
5-Fold
2-Fold
are 73% and 83%, respectively. In Table 1, we provide the classification results using the twofold validation scheme used by Zhao, Tang, and Huo (2014) and Mekhalfi et al. (2015). Here the overall accuracies are 95.1% and 93.9% for CNN-NN and CNN-AE, respectively. In Table 1, we compare the classification results obtained by the proposed method against several state-of-the-art methods. Furthermore, we compare our results against two other transfer learning schemes based on CNN. In the first scheme, we add on the top of the last fully connected layer of CNN one/two fully connected layers. Then we fine-tune these layers using the available target training data (Oquab et al. 2014). In the second scheme, we train a linear SVM classifier on the features extracted from the last fully connected layer. The results reported in Table 1 for both validation schemes show clearly that our proposed approach yields promising results compared to state-ofthe-art methods. (2) Banja-Luka data set: As for UC Merced, we show in Figure 8 the features learned by the AE for sample images of this data set. In Table 2, we report the classification results in terms of overall as well as average accuracies as the number of images is not the same for all classes. The (overall and average) accuracies obtained by CNN-NN and CNN-AE are equal to (98.67% and 98.44%) and (96.33% and 93.9%), respectively. Figure 9 depicts the confusion matrices obtained for both methods. As can be seen for CNN-NN, the accuracies of all classes are above 95%. On the other hand, CNN-AE does not perform well on the ‘Cemetery’ as the accuracy for this class is equal to 80%. Here again, both methods yield better results compared with state-of-the-art methods based on handcrafted features. Comparing our results to the transfer methods based on CNN, we observe that CNN-NN performs well unlike CNN-AE, which produces a less-competing average accuracy. However, in the next experiment we will show that the performances of CNN-AE can be boosted by increasing the number of nodes in the hidden layer. (3) Sensitivity analysis with respect to the number of hidden nodes: To analyse the proposed approach further, we repeat the above experiments but with a different
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1990
E. OTHMAN ET AL.
Figure 8. Feature vectors generated by CNN and AE for the Banja-Luka data set: (a) input images (cemetery, field, houses, and river); (b) feature generated by CNN; and (c) feature generated by AE. Table 2. Classification results obtained for the Banja-Luka data set. Method Risojević, Momić, and Babić 2011 Pre-trained CNN models The proposed method
Description Multispectral Gabor Multispectral Gist CNN-SVM CNN+One fully connected layer CNN+Two fully connected layers CNN-AE CNN-NN
(OA, AA) (%) ( , 88.00) ( , 89.30) (95.33, 93.17) (96.33, 96.32) (96.00,94.33) (96.33, 93.68) (98.67, 98.44)
Validation 10-Fold
number of hidden nodes (i.e. 64, 128, 512, and 1024 nodes). Figures 10 and 11 show the results obtained for CNN-NN and CNN-AE, respectively. From Figure 10, we observe that changing the number of nodes doesn’t affect much the classification accuracy of CNNNN. For both data sets, it appears that setting S to 256 is a good compromise between accuracy and computational cost. Regarding CNN-AE, the situation is different as it performs better when increasing the number of hidden nodes. It appears that S ¼ 512 is the good choice, since for the UC Merced data set the overall accuracy reaches 96.29% for S ¼ 512, corresponding to an increase of around 1% compared to S ¼ 256. For Banja Luka the average accuracy reaches 96.51%, which represents an increase of about 3% compared with S ¼ 256.
1991
Figure 9. Confusion matrices obtained by: (a) CNN-NN and (b) CNN-AE for the Banja-Luka data set. Darker grey-scales indicate accurate classification whereas lighter grey-scales indicate low accuracy. 98
97
OA (%)
97
OA (%)
96
Accuracy (%)
96
Accuracy (%)
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
INTERNATIONAL JOURNAL OF REMOTE SENSING
95 94 93
95 94 93 92
92
91
91 90
90 64
128
256
512
1024
64
128
256
No. of nodes
No. of nodes
(a)
(b)
512
1024
Figure 10. Sensitivity analysis of CNN-NN with respect to the number of hidden nodes: (a) UC Merced, and (b) Banja-Luka data sets.
4. Conclusions In this article we have proposed a novel transfer learning approach for LU scene classification. This approach exhibits the following interesting proprieties. It relies on a CNN pre-trained on
1992
E. OTHMAN ET AL. 100 99
97
OA (%) AA (%)
96
97
Accuracy (%)
Accuracy (%)
98
98
OA (%) AA (%)
96 95 94 93
95 94 93 92
92
91
91
90
90 64
128
256
512
1024
64
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
No. of nodes
(a)
128
256
512
1024
No. of nodes
(b)
Figure 11. Sensitivity analysis of CNN-AE with respect to the number of hidden nodes: (a) UC Merced, and (b) Banja-Luka data sets.
an auxiliary domain to convert the images under analysis into convolutional features. Unlike the methods proposed in the literature based on handcrafted features, it adapts the features of the auxiliary domain using an AE rather than updating the decision function of the classifier. To this end, it tailors the AE to view the classification problems from discriminative and reconstruction perspectives. Specifically, the proposed solutions are termed CNN-NN and CNN-AE, respectively. The experimental results obtained on LU data sets allow us to draw the following conclusions: 1) both CNN-NN and CNN-AE clearly perform better than state-of-theart methods based on handcrafted features confirming the power of deep learning; 2) for both data sets, CNN-NN exhibits a better behaviour compared with CNN-AE; 3) however, CNNAE is interesting from a practical point of view as it allows adding new classes without retraining the entire system compared with CNN-NN. Finally, for future developments we suggest: 1) exploring configurations based on several SAEs; 2) developing multi-feature configurations as done in Mekhalfi et al. (2015); and 3) improving the capabilities of CNNAE, by introducing the energy concept (Kamyshanska and Memisevic 2015) as an alternative solution for comparing the scores of the different class-specific AEs.
Acknowledgement The authors would like to thank A. Vedaldi and K. Lenc (Vedaldi and Lenc 2014) for making available the software MatConvNet used in the context of this work. The authors would like to acknowledge the support from the Distinguished Scientist Fellowship Program at King Saud University.
Disclosure statement No potential conflict of interest was reported by the authors.
Funding This work was supported by the Deanship of Scientific Research of the King Saud University through the International Research Group [Project IRG15-20].
INTERNATIONAL JOURNAL OF REMOTE SENSING
1993
ORCID Yakoub Bazi http://orcid.org/0000-0001-9287-0596 Farid Melgani http://orcid.org/0000-0001-9745-3732
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
References Bai, J., Y. Wu, J. Zhang, and F. Chen. 2015. “Subset Based Deep Learning for RGB-D Object Recognition.” Neurocomputing 165: 280–292. doi:10.1016/j.neucom.2015.03.017. Brosch, T., and R. Tam. 2015. “Efficient Training of Convolutional Deep Belief Networks in the Frequency Domain for Application to High-Resolution 2D and 3D Images.” Neural Computation 27: 211–227. doi:10.1162/NECO_a_00682. Chatfield, K., K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. “Return of the Devil in the Details: Delving Deep into Convolutional Nets.” Proceedings BMVC, Nottingham, September. Chen, S., and Y. Tian. 2015. “Pyramid of Spatial Relations for Scene-Level Land Use Classification.” IEEE Transactions on Geoscience and Remote Sensing 53 (4): 1947–1957. doi:10.1109/ TGRS.2014.2351395. Chen, X., S. Xiang, C.-L. Liu, and C.-H. Pan. 2014. “Vehicle Detection in Satellite Images by Hybrid Deep Convolutional Neural Networks.” IEEE Geoscience and Remote Sensing Letters 11 (10): 1797– 1801. doi:10.1109/LGRS.2014.2309695. Chen, Y., Z. Lin, X. Zhao, G. Wang, and Y. Gu. 2014. “Deep Learning-Based Classification of Hyperspectral Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6): 2094–2107. doi:10.1109/JSTARS.2014.2329330. Chen, Y., X. Zhao, and X. Jia. 2015. “Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8 (6): 2381–2392. doi:10.1109/JSTARS.2015.2388577. Cheng, G., J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren. 2015. “Effective and Efficient Midlevel Visual Elements-Oriented Land-Use Classification Using VHR Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 53 (8): 4238–4249. doi:10.1109/ TGRS.2015.2393857. Cheng, G., J. Han, P. Zhou, and L. Guo. 2014. “Multi-Class Geospatial Object Detection and Geographic Image Classification Based on Collection of Part Detectors.” ISPRS Journal of Photogrammetry and Remote Sensing 98: 119–132. doi:10.1016/j.isprsjprs.2014.10.002. Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” Proceedings CVPR, Miami, FL, June 20–25. Duan, L., I. W. Tsang, and D. Xu. 2012. “Domain Transfer Multiple Kernel Learning.” IEEE Transactions Pattern Analysis and Machine Intelligence 34 (3): 465–479. doi:10.1109/ TPAMI.2011.114. Duan, L., D. Xu, I. H. Tsang, and J. Luo. 2012. “Visual Event Recognition in Videos by Learning from Web Data.” IEEE Transactions Pattern Analysis and Machine Intelligence 34 (9): 1667–1680. doi:10.1109/TPAMI.2011.265. Duan, L., D. Xu, and W. Tsang. 2012. “Domain Adaptation from Multiple Sources: A DomainDependent Regularization Approach.” IEEE Transactions on Neural Networks and Learning Systems 23 (3): 504–518. doi:10.1109/TNNLS.2011.2178556. Farabet, C., C. Couprie, L. Najman, and Y. LeCun. 2013. “Learning Hierarchical Features for Scene Labeling.” IEEE Transactions Pattern Analysis and Machine Intelligence 35 (8): 1915–1929. doi:10.1109/TPAMI.2012.231. Gao, S., Y. Zhang, K. Jia, J. Lu, and Y. Zhang. 2015. “Single Sample Face Recognition via Learning Deep Supervised Autoencoders.” IEEE Transactions on Information Forensics and Security 10 (10): 2108–2118. doi:10.1109/TIFS.2015.2446438. Han, J., D. Zhang, G. Cheng, L. Guo, and J. Ren. 2015. “Object Detection in Optical Remote Sensing Images Based on Weakly Supervised Learning and High-Level Feature Learning.” IEEE
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
1994
E. OTHMAN ET AL.
Transactions on Geoscience and Remote Sensing 53 (6): 3325–3337. doi:10.1109/ TGRS.2014.2374218. Hayat, M., M. Bennamoun, and S. An. 2015. “Deep Reconstruction Models for Image Set Classification.” IEEE Transactions Pattern Analysis and Machine Intelligence 37 (4): 713–727. doi:10.1109/TPAMI.2014.2353635. Hinton, G. E., L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” IEEE Signalling Processing Magazine 29 (6): 82–97. doi:10.1109/MSP.2012.2205597. Hinton, G. E., S. Osindero, and Y. Teh. 2006. “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation 18 (7): 1527–1554. doi:10.1162/neco.2006.18.7.1527. Huang, X., Q. Lu, and L. Zhang. 2014. “A Multi-Index Learning Approach for Classification of HighResolution Remotely Sensed Images over Urban Areas.” ISPRS Journal of Photogrammetry and Remote Sensing 90: 36–48. doi:10.1016/j.isprsjprs.2014.01.008. Huang, W., L. Xiao, Z. Wei, H. Liu, and S. Tang. 2015. “A New Pan-Sharpening Method with Deep Neural Networks.” IEEE Geoscience and Remote Sensing Letters 12 (5): 1037–1041. doi:10.1109/ LGRS.2014.2376034. Kamyshanska, H., and R. Memisevic. 2015. “The Potential Energy of an Autoencoder.” IEEE Transactions Pattern Analysis and Machine Intelligence 37 (6): 1261–1273. doi:10.1109/ TPAMI.2014.2362140. Krizhevsky, A., I. Sutskever, and G. E. Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” NIPS 1106–1114. Liu, D. C., and J. Nocedal. 1989. “On the Limited Memory BFGS Method for Large Scale Optimization.” Mathematical Programming 45: 503–528. doi:10.1007/BF01589116. Luus, F. P. S., B. P. Salmon, F. Van Den Bergh, and B. T. J. Maharaj. 2015. “Multiview Deep Learning for Land-Use Classification.” IEEE Geoscience and Remote Sensing Letters 12 (12): 2448–2452. doi:10.1109/LGRS.2015.2483680. Lv, Y., Y. Duan, W. Kang, Z. Li, and F. Y. Wang. 2015. “Traffic Flow Prediction With Big Data: A Deep Learning Approach.” IEEE Transactions Intelligent Transportation System 16 (2): 865–873. Marmanis, D., M. Datcu, T. Esch, and U. Stilla. 2016. “Deep Learning Earth Observation Classification Using ImageNet Pretrained Networks.” IEEE Geoscience and Remote Sensing Letters 13 (1): 105– 109. doi:10.1109/LGRS.2015.2499239. Mekhalfi, M. L., F. Melgani, Y. Bazi, and N. Alajlan. 2015. “Land-Use Classification with Compressive Sensing Multifeature Fusion.” IEEE Geoscience and Remote Sensing Letters 12 (10): 2155–2159. doi:10.1109/LGRS.2015.2453130. Nocedal, J. 1980. “Updating Quasi-Newton Matrices with Limited Storage.” Mathematics of Computation 35: 773–773. doi:10.1090/S0025-5718-1980-0572855-7. Oquab, M., L. Bottou, I. Laptev, and J. Sivic. 2014. “Learning and Transferring Mid-Level Image Representations Using Convolutional Neural Networks.” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, June 23–28, 1717–1724. Risojević, V., S. Momić, and Z. Babić. 2011. “Gabor Descriptors for Aerial Image Classification. Springer Berlin Heidelberg.” Adaptive and Natural Computing Algorithms 6594: 51–60. Schmidhuber, J. 2015. “Deep Learning in Neural Networks: An Overview.” Neural Networks 61: 85– 117. doi:10.1016/j.neunet.2014.09.003. Schweikert, G., C. Widmer, B. Scholkopf, and G. Ratsch. 2008. “An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis.” Proceedings of Advances in Neural Information Processing Systems 1433–1440. Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15 (1): 1929–1958. Tang, J., C. Deng, G.-B. Huang, and B. Zhao. 2015. “Compressed-Domain Ship Detection on Spaceborne Optical Image Using Deep Neural Network and Extreme Learning Machine.” IEEE Transactions on Geoscience and Remote Sensing 53 (3): 1174–1185. doi:10.1109/ TGRS.2014.2335751.
Downloaded by [Universita di Trento], [Farid Melgani] at 07:08 23 April 2016
INTERNATIONAL JOURNAL OF REMOTE SENSING
1995
Vedaldi, A., and K. Lenc. 2014. “Matconvnet - Convolutional Neural Networks for MATLAB.” Proceedings of the ACM International Conference on Multimedia. Vincent, P., H. Larochelle, Y. Bengio, and P. A. Manzagol. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” ACM Proc. 25th Int. Conf. Mach. Learn., Helsinki, July 5– 9, 1096–1103. Yang, J., R. Yan, and A. G. Hauptmann. 2007. “Cross-Domain Video Concept Detection Using Adaptive SVMs.” Proceedings ACM Int’l Conference Multimedia 188–197. Yang, Y., and S. Newsam. 2010. “Bag-Of-Visual-Words and Spatial Extensions for Land-Use Classification.” Proc. 18th ACM SIGSPATIAL Int. Conf. Adv. Geogr. Inf. Syst., San Jose, CA, November 2–5, 270–279. Yang, Y., and S. Newsam. 2011. “Spatial Pyramid Co-Occurrence for Image Classification.” Proceedings IEEE International Conference Computation Vision, Barcelona, November 6–13, 1465–1472. Zhang, F., B. Du, and L. Zhang. 2015. “Scene Classification via a Gradient Boosting Random Convolutional Network Framework.” IEEE Transactions Geoscience and Remote Sensing PP (99): 1–10. Zhao, L., P. Tang, and L. Huo. 2014. “Land-Use Scene Classification Using a Concentric CircleStructured Multiscale Bag-Of-Visual-Words Model.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (12): 4620–4631. doi:10.1109/JSTARS.2014.2339842.