Joint Alternate Small Convolution and Feature Reuse for ... - MDPI

0 downloads 0 Views 9MB Size Report
Aug 26, 2018 - years, a convolutional neural network (CNN) has been successfully used in HSI ... of strong correlation among bands for HSI classification, an effective CNN architecture is proposed ...... Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for. Mobile Devices.
International Journal of

Geo-Information Article

Joint Alternate Small Convolution and Feature Reuse for Hyperspectral Image Classification Hongmin Gao †

ID

, Yao Yang † , Chenming Li *

ID

, Hui Zhou and Xiaoyu Qu

College of Computer and Information, Hohai University, Nanjing 211100, China; [email protected] (H.G.); [email protected] (Y.Y.); [email protected] (H.Z.); [email protected] (X.Q.) * Correspondence: [email protected]; Tel.: +86-25-5809-9136 † The author contributed equally to this work and should be considered co-first author. Received: 15 July 2018; Accepted: 24 August 2018; Published: 26 August 2018

 

Abstract: A hyperspectral image (HSI) contains fine and rich spectral information and spatial information of ground objects, which has great potential in applications. It is also widely used in precision agriculture, marine monitoring, military reconnaissance and many other fields. In recent years, a convolutional neural network (CNN) has been successfully used in HSI classification and has provided it with outstanding capacity for improving classification effects. To get rid of the bondage of strong correlation among bands for HSI classification, an effective CNN architecture is proposed for HSI classification in this work. The proposed CNN architecture has several distinct advantages. First, each 1D spectral vector that corresponds to a pixel in an HSI is transformed into a 2D spectral feature matrix, thereby emphasizing the difference among samples. In addition, this architecture can not only weaken the influence of strong correlation among bands on classification, but can also fully utilize the spectral information of hyperspectral data. Furthermore, a 1 × 1 convolutional layer is adopted to better deal with HSI information. All the convolutional layers in the proposed CNN architecture are composed of small convolutional kernels. Moreover, cascaded composite layers of the architecture consist of 1 × 1 and 3 × 3 convolutional layers. The inputs and outputs of each composite layer are stitched as the inputs of the next composite layer, thereby accomplishing feature reuse. This special module with joint alternate small convolution and feature reuse can extract high-level features from hyperspectral data meticulously and comprehensively solve the overfitting problem to an extent, in order to obtain a considerable classification effect. Finally, global average pooling is used to replace the traditional fully connected layer to reduce the model parameters and extract high-dimensional features from the hyperspectral data at the end of the architecture. Experimental results on three benchmark HSI datasets show the high classification accuracy and effectiveness of the proposed method. Keywords: hyperspectral image classification; strong correlation among bands; convolutional neural network; alternate small convolutions; feature reuse

1. Introduction Hyperspectral remote-sensing technology has been an important part of comprehensive Earth observation research since the 1980s. It is also a key point in the competition for international Earth observation technology. In addition, the applications of this technology have been gradually extended to many fields, such as environmental monitoring, land use, resource investigation, and atmospheric research. Hyperspectral image (HSI) data have the characteristics of combining images and spectra, that is, each pixel on an HSI corresponds to a spectral curve. The class of ground truth can be identified on the basis of its spectral reflectance because of the aforementioned characteristics. Each pixel in an HSI corresponds to hundreds or even thousands of bands. Moreover, the spectral ISPRS Int. J. Geo-Inf. 2018, 7, 349; doi:10.3390/ijgi7090349

www.mdpi.com/journal/ijgi

ISPRS Int. J. Geo-Inf. 2018, 7, 349

2 of 21

resolution of HSI reaches nanoscale and contains abundant and detailed ground truth information. However, traditional multispectral data contain only a few spectral bands and have low spectral resolution. Therefore, the use of hyperspectral data is more advantageous than that of multispectral data in identifying and classifying ground truth. With hyperspectral data, a substantial variety of fine ground truth can be identified, and various samples and band combinations can be selected flexibly to obtain different features and fulfill varying task demands during ground truth recognition and classification. Furthermore, with the development of big data technology, HSI will have a bright future in big data research in remote sensing because of its considerable potential application value. HSIs bring substantial opportunities to ground truth recognition and classification because of their rich information but remain a challenge to traditional remote-sensing image classification methods due to the large number of bands and insufficient training samples of HSI data. On the one hand, with the increase of band number, the classification accuracy obtained by directly using the information of all the bands is likely to decrease, that is, the so-called Hughes phenomenon [1]. On the other hand, classification speed also restricts the application of HSIs as the number of bands increases. Many research institutions and related scholars have explored HSI classification in theory and application and proposed many mature and classic HSI classification methods to fully utilize the advantages of hyperspectral remote-sensing technology. From the perspective of HSI description space, HSI classification methods can be divided into two categories [2]: the methods based on spectral space and those based on feature space. The former utilizes the spectral curve that can reflect the spectral feature of ground truth to identify the ground truth. This kind of method include minimum distance spectrum matching, and spectral angle matching. The latter utilizes the statistical characteristics of ground truth in the feature space to build classification models, and representative methods, including the decision tree [3], artificial neural network (ANN) method [4], and support vector machine (SVM) classification [5]. The ground truth that corresponds to the same pixel is not unique, and many mixed pixels exist due to “same objects with different spectrum, different objects with same spectrum” phenomenon and the low spatial resolution of HSI. Moreover, the classification accuracy of the methods based on spectral space is seriously affected. By contrast, classification methods based on feature space are not constrained by these factors and hence favored by scholars. For instance, Sun et al. [5] proposed a band-weighted SVM (BWSVM) method to classify HSIs. Li et al. [6] adopted a radial basis function to implement a kernel extreme learning machine (KELM), which provided better classification performance on the HSI than kernel SVM. Wei et al. [7] functionalized hyperspectral pixels, adopted functional principal component analysis (KPCA) to reduce the dimensionality of functional data, and classified the reduced-dimensionality data by KELM. Although scholars have attained many excellent research achievements in HSI classification, the characteristics of hyperspectral data, such as high dimension, large computation, and strong correlation among bands, still restrict traditional methods from improving classification accuracy. Furthermore, due to the shallow structure, the above methods are equipped with limit ability of feature extraction. Therefore, their classification performance for HSIs makes it easy to encounter bottlenecks. The convolutional neural network (CNN), with its deep structure and end-to-end learning mode, is provided with strong learning capability for features. In such a learning mode, the low-level features are abstracted layer by layer to the high-level features required by relevant tasks. Particularly, CNNs have demonstrated remarkable performance in image classification and target detection [8–10], which undoubtedly represent the gospel for improving the classification performance of HSI effectively. A CNN is a special feedforward ANN. About 20 years ago, LeCun et al. [11] trained a CNN using the back-propagation algorithm and the gradient learning technique and then demonstrated its effectiveness through a handwritten digit recognition task. However, the development of CNN subsequently declined due to the constraints of computing power and difficulties in theoretical analysis for neural networks. In 2012, Hinton et al. [12] succeed at image classification challenge of ImageNet with Alex-Net, whose accuracy was approximately 12% higher than that of the immediate runner-up. Since then, CNNs such as NIN (Network in Network) [13], GoogLeNet [14], DeepID2+ [15],

ISPRS Int. J. Geo-Inf. 2018, 7, 349

3 of 21

and ResNet [16] have made great historical breakthroughs and drawn a resurgence of attention to various visual recognition tasks. CNNs are being increasingly applied to HSI classification. For example, Chen et al. [17] exploited a CNN to extract the deep features of hyperspectral data and committed to solve the problems of high-dimensional and limited samples in hyperspectral data. Yue et al. [18] proposed a new classification framework composed of exponential momentum deep CNN and SVM. This framework can construct high-level spectral–spatial features hierarchically in an automated manner. Hu et al. [19] proposed a CNN model to classify HSIs directly in the spectral domain, whose CNN architecture only contains one convolutional layer and one fully connected layer, which may be hard to effectively extract robust spectral features from when the number of training samples per class is small. The 3D-CNN was also applied to hyperspectral classification [20–22], which can extract spectral and spatial features simultaneously and was provided with excellent classification performance. Makantasis et al. [23] encoded spatial and spectral information of hyperspectral images with CNN, and adopted randomized PCA along the spectral dimension to reduce the dimensionality of input raw data, which is a good idea. The use of CNNs effectively improves HSI classification. However, the strong correlation among bands in hyperspectral data, which is also an important factor that restricts the improvement of HSI classification accuracy, has rarely been studied. Furthermore, the CNN application technology in HSI classification is not mature enough, and such shortcomings as weak generalization capability and easy overfitting remain. A new deep CNN architecture that realizes the classification task by learning the hyperspectral data features layer by layer is proposed in this research to solve the aforementioned problem. The major contributions of this work can be summarized as follows. 1.

2.

3.

Unlike existing HSI classification methods, this work transforms the 1D spectral vectors of hyperspectral data into 2D spectral feature matrices. The spectral features are mapped from 1D to 2D space. And the variations among different samples, especially those among samples from various classes, are highlighted. This work enables the CNN to fully use the spectral information from each band and extract the spectral features of the hyperspectral data accurately. Meanwhile, the interference of highly correlated bands for HSI classification can be weakened. The entire network architecture adopts small convolution kernels with size of 3 × 3 or 1 × 1 to form convolutional layers. The conversion of the 1D spectral vector to a 2D spectral feature matrix can weaken the interference of highly correlated bands for HSI classification, but cannot eliminate the correlation among bands. Adopting convolutional kernels with different sizes allows the acquisition of local receptive fields with varying sizes. After multilayer abstraction, the correlation among different spectral bands is gradually weakened. The entire network can learn the features of HSIs meticulously and robustly. Furthermore, cascaded 1 × 1 convolutional layers can increase the non-linearity of the network and make the spectral features of the hyperspectral data increasingly abstract. Simultaneously, the correlation among bands in the hyperspectral data can be weakened and the features of hyperspectral data can be learned effectively. 1 × 1 and 3 × 3 convolutional layers are cascaded to form a special composite layer. The 1 × 1 convolutional layer can integrate high-level spectral features output by the front layer from a global perspective and increasing the compactness of the proposed CNN architecture. The 3 × 3 convolutional layer can deeply learn the features integrated by the 1 × 1 convolutional layer in detail from multiple local perspectives. Multiple composite layers are cascaded so that 1 × 1 and 3 × 3 convolutional layers are stacked in the network alternately. In a cross-layer connection, the input and output of each composite layer are spliced into new features in the feature dimension and passed to the next composite layer, thus accomplishing feature reuse. This combination of alternating small convolutions and feature reuse is called the ASC–FR module. When extracting the features of hyperspectral data, the ASC–FR module can constantly switch the perspective of extraction between the global and local perspectives. Therefore, this module ensures that the spectral features can be fully utilized after multilayer abstraction, the deep features of hyperspectral data are extracted comprehensively and meticulously, and the adverse

ISPRS Int. J. Geo-Inf. 2018, 7, 349

4 of 21

effects of strong correlation among bands on classification are weakened. To a certain extent, the overfitting and gradient disappearance in the proposed CNN architecture are solved. Thus, this module can improve the accuracy of HSI classification effectively. The remainder of this paper is organized as follows. Section 2 briefly introduces the advantages of the small convolution kernel and describes some of the relevant work about CNN-based HSI classification. Section 3 describes the overall design of the proposed method. Section 4 evaluates the classification performance of this work through comparative experiments and analyses. Section 5 concludes the paper. 2. Related Work 2.1. Small Convolution Generally, a CNN contains the input layer, the convolutional layer, the pooling layer, the fully connected layer and the output layer. Each convolutional layer comprises several convolutional kernels, which can extract the local features from input feature maps. Small convolutional kernels (3 × 3 or 5 × 5) have more advantages than large ones. Through multilayer stacking, small convolutional kernels can provide the receptive field with the same size as that provided by large convolutional kernels. Moreover, the use of small convolutional kernels can bring two advantages. First, stacking multiple layers that consist of small convolutional kernels can increase the network depth, thereby enhancing the capacity and complexity of the network. Second, the model parameters can be reduced effectively by using small convolution in the model. In VGG-Net [24], which was proposed by the Oxford Visual Geometry Group of the University of Oxford, all convolutional layers adopt 3 × 3 convolutional kernels, thus possibly reducing network parameters effectively and enhancing the fitting capability of the network. As for a smaller convolutional kernel, the 1 × 1 sized convolutional kernel, its capacity was first discovered in the NIN [13] in 2014. This network has a multilayer perceptron convolutional (mlpconv) layer, which is a cascaded cross-channel parametric pooling (CCCP) on a normal convolutional layer. This CCCP structure allows the network to learn complex and useful cross-channel integration features. The CCCP layer is equivalent to a convolutional layer with 1 × 1 convolutional kernels. After the appearance of NIN, GoogLeNet, ResNet and their families also widely adopted small convolutional kernels and demonstrated that such kernels can greatly improve the capability of networks for learning features. 2.2. Convolutional Neural Network (CNN)-Based Classification for Hyperspectral Image (HSI) CNN has demonstrated remarkable performance in HSI classification, denoising [25], segmentation [26] and so on. After summarizing the research works of CNN in HSI classification, according to the way of implementation, the HSI classification method based on CNN can be divided into 1D-CNN-based, 2D-CNN-based and 3D-CNN-based methods, and the method combining CNN with other approaches. Among them, the method based on 1D-CNN [17,19] usually assumes that each pixel in an HSI only corresponds to one class, and directly uses 1D spectral information for hyperspectral classification. Although this kind of methods are easy to implement, they are seriously affected by mixed pixels and strong correlation among bands, which causes that the model cannot effectively learn the spectral features of each pixel, and is provided with poor adaptability. The method based on 2D-CNN [27,28] usually adopts some approaches (such as principal component analysis, autoencoder) to extract the main components of spectral information in advance, and then extract neighborhood pixel blocks from the image as samples. The 3D-CNN based method utilizes pixel cube to implement hyperspectral classification [20–22]. Both the two kinds of methods introduce spatial information, which enables the model to learn more useful features and significantly improve the classification effect. However, these two kinds of methods require a large number of labeled pixels to generate sufficient samples for training the CNN model, coupled with the limited number of labeled pixels in HSIs, which aggravates the shortage of training samples and makes the model easy to over-fit.

ISPRS Int. J. Geo-Inf. 2018, 7, 349

5 of 21

The two kinds of methods usually assume that the pixels in the same spatial neighborhood have similar spectral characteristics with the central pixels, and those pixels belong to the same class as the central pixel. It is ignored that the pixels in the same neighborhood may represent a different class of ground truth, that is, it is impossible to ensure that only one class of objects is included in the Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW 5 of 20 same ISPRS neighborhood. Therefore, if the spatial size of the neighborhood pixel blocks is large, the two kinds of methods will be disturbed by heterogeneous noise (the pixels in a neighborhood pixel block as the central pixel. It is ignored that the pixels in the same neighborhood may represent a different belong to different class from the central pixel). The last kind of method [29,30] usually combines other class of ground truth, that is, it is impossible to ensure that only one class of objects is included in the approaches with CNN, which uses approaches make up forpixel some shortcomings CNN in same neighborhood. Therefore, if other the spatial size of thetoneighborhood blocks is large, theof two HSI classification, so will as to classificationnoise performance CNN model. kinds of methods beimprove disturbedthe by heterogeneous (the pixels of in athe neighborhood pixelThis blockkind of method requires fully excavating thecentral characteristics thekind CNN, and requires theoretical belong to different class from the pixel). Theoflast of method [29,30] deep usually combines study. other approaches with CNN, which uses other approaches to make up for some shortcomings of The implementation process is relatively tedious, but the performance is often excellent.

CNN in HSI classification, so as to improve the classification performance of the CNN model. This of method requires fullyBased excavating the characteristics ofConvolutions the CNN, and requires deep theoretical 3. HSIkind Classification Method on Alternating Small and Feature Reuse study. The implementation process is relatively tedious, but the performance is often excellent. (ASC–FR) 3. HSI Classification Method Based on Alternating Small Convolutions and Feature Reuse 3.1. Data Pre-Processing (ASC–FR)

Hyperspectral data contain large number of bands and rich spectral values. Considering these characteristics, the 1D spectral vector is converted into the 2D feature matrix. Then, each pixel in 3.1. Data Pre-processing the original HSI corresponds to a spectral feature map,and andrich thespectral difference among samples becomes Hyperspectral data contain large number of bands values. Considering these increasingly remarkable. Inputting the spectral feature map into CNNs not only makes the network characteristics, the 1D spectral vector is converted into the 2D feature matrix. Then, each pixel in the learn original the spectral features of to hyperspectral datamap, effectively, also reduces influence of highly HSI corresponds a spectral feature and thebut difference among the samples becomes increasingly Inputting theMoreover, spectral feature map into CNNs only makes the network correlated bandsremarkable. on HSI classification. the bondage of thenot high-dimension characteristic learn the spectral hyperspectral data but also reduces the influence of highly of hyperspectral datafeatures in theofclassification can effectively, be removed. Therefore, this data pre-processing correlated bands on HSI classification. Moreover, the bondage of the high-dimension characteristic technique can improve the classification accuracy effectively. The procedure of data pre-processing of hyperspectral data in the classification can be removed. Therefore, this data pre-processing is displayed in Figure 1. It is noted that a few bands should be removed before transforming the technique can improve the classification accuracy effectively. The procedure of data pre-processing spectral vector into the feature a fewshould bandsbecan eliminate redundant information, is displayed in Figure 1. It is map. noted Removing that a few bands removed before transforming the weaken the correlation bands of Removing the data, and enhance theeliminate inter class separability. It also brings spectral vector intoamong the feature map. a few bands can redundant information, advantages improvement of HSIbands classification accuracy. However, if too many bands are removed, weakentothe correlation among of the data, and enhance the inter class separability. It also brings advantages to improvement of HSI classification accuracy. However, if too many bands are some useful information will be lost, and the classification performance of the model will be degraded. removed, usefulto information will be lost,the and the classification performance the model will Therefore, it is some necessary reasonably control number of removed bands toofmake the advantages be degraded. Therefore, it is necessary to reasonably control the number of removed bands to make outweigh the disadvantages. Moreover, in order to convert the 1D spectral vector into the spectral the advantages outweigh the disadvantages. Moreover, in order to convert the 1D spectral vector into feature map, the number of reserved bands should be a square number. the spectral feature map, the number of reserved bands should be a square number. Reshape

Hyperspectral data

1D Spectral Curve

2D Spectral Feature Map

Figure 1. The diagram of data pre-processing procedure.

Figure 1. The diagram of data pre-processing procedure. 3.2. Proposed CNN Architecture

3.2. Proposed CNN Architecture

In the process of designing the network structure in this work, improving the HSI classification

accuracy is considered as a goal. Meanwhile, abstraction hyperspectral In the process of designing the network deepening structure the in this work, of improving the data HSIfeatures classification and avoiding the gradient disappearance and overfitting of the network as far as possible are data accuracy is considered as a goal. Meanwhile, deepening the abstraction of hyperspectral considered as the main principles. The proposed CNN architecture is shown in Figure 2. At the input

ISPRS Int. J. Geo-Inf. 2018, 7, 349

6 of 21

features and avoiding the gradient disappearance and overfitting of the network as far as possible ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW 6 of 20 are considered as the main principles. The proposed CNN architecture is shown in Figure 2. At the input end, three × 3 convolutional are stacked, followed bycascaded two cascaded 1 × 1 convolution end, three 3 × 3 3convolutional layerslayers are stacked, followed by two 1 × 1 convolution layers, layers, that is, the mlpconv layer. This layer can enhance the complexity of this network make that is, the mlpconv layer. This layer can enhance the complexity of this network andand make the the features of hyperspectral data increasingly abstract. Then, the output features are integrated by features of hyperspectral data increasingly abstract. Then, the output features are integrated by overlapping pooling andand forwarding to thetotwo composite layers. The obtained features overlappingmax max pooling forwarding thecascaded two cascaded composite layers. The obtained are then transmitted to a 1 × 1 convolution layer and an average pooling layer. Finally, global average features are then transmitted to a 1 × 1 convolution layer and an average pooling layer. Finally, global pooling (GAP) is adopted to deal with the whole feature map, and classification results are outputted average pooling (GAP) is adopted to deal with the whole feature map, and classification results are by Softmax layer. The outputs eachoutputs convolutional layer are treated with outputted by Softmax layer.ofThe of each convolutional layerbatch are normalization treated with (BN) batch and rectified linear unit (ReLU) in turn, that is, Conv → BN → ReLU. normalization (BN) and rectified linear unit (ReLU) in turn, that is, Conv → BN → ReLU. Width Widthexpansion expansionrate. rate.In Inthis thiswork, work,the thewidth width(channel (channelnumber) number)ofofeach eachconvolution convolutionlayer layerisisset set as a multiple of g. The width of the network increases with g, which is thus called the width expansion as a multiple of g. The width of the network increases with g, which is thus called the width expansion rate. rate.Different Differentwidth widthexpansion expansionrates ratesindicate indicatethat thatthe thenumber numberofofnew newfeature featureinformation informationreceived received by byeach eachlayer layervaries. varies.Therefore, Therefore,the thewidth widthof ofthe thenetwork networkcan canbe beadjusted adjustedand andthe theefficiency efficiencyofofthe the network cancan be improved by anby adjustment of the width expansion rate. In the experiment, networkparameters parameters be improved an adjustment of the width expansion rate. In the the output dimension of nearly all the layers, except the composite are set aslayers, 2g to experiment, the output dimension ofconvolution nearly all the convolution layers, exceptlayers, the composite adjust the width and depth of network conveniently. are set as 2g to adjust the width and depth of network conveniently. ASC–FR ASC–FRmodule. module.The Thecomposite compositelayer layerconsists consistsofof1 1××11and and33×× 33convolution convolutionlayers, layers,where wherethe the output dimensions of both 1 × 1 and 3 × 3 convolution layers are set as 4g. In this manner, the output output dimensions of both 1 × 1 and 3 × 3 convolution layers are set as 4g. In this manner, the output dimension dimensionof ofeach eachcomposite compositelayer layer isismaintained maintained and and the the expansion expansion of of the the network network isis facilitated. facilitated. Through a stitching operation, the input and output features of each composite layer are combined Through a stitching operation, the input and output features of each composite layer are combined as that form formthe theinput inputofof the next composite layer. operation to feature asnew new features features that the next composite layer. ThisThis operation leadsleads to feature reuse, reuse, which is helpful in preventing overfitting. The aforementioned information is the detail the which is helpful in preventing overfitting. The aforementioned information is the detail of theofASC– ASC–FR module. When the feature dimension is high splicing, composite layer reduce FR module. When the feature dimension is high afterafter splicing, the the composite layer cancan reduce the the dimension of features, thereby increasing the compactness of the network structure and reducing dimension of features, thereby increasing the compactness of the network structure and reducing the the computation. computation.

Figure2.2.Proposed Proposedconvolutional convolutionalneural neuralnetwork network(CNN) (CNN)architecture. architecture. Figure

Experimentsand andAnalysis Analysis 4.4.Experiments Theproposed proposedCNN CNNarchitecture architectureisisimplemented implementedvia viaTensorFlow1.4.2. TensorFlow1.4.2. The The experiments experiments are are The conductedon ona adesktop desktop with Windows 7 64-bit Inter(R) Core(TM) i5-4460 GB RAM conducted PCPC with Windows 7 64-bit OS,OS, Inter(R) Core(TM) i5-4460 CPU,CPU, 8 GB8RAM and and NVIDIA GeForce GTX 1070 8G GPU. Three benchmark datasets are used to evaluate the NVIDIA GeForce GTX 1070 8G GPU. Three benchmark datasets are used to evaluate the classification classification performance of the proposed method. This section introduces the datasets, provides details about the experimental design, and conducts analyses according to the experimental results.

ISPRS Int. J. Geo-Inf. 2018, 7, 349

7 of 21

performance of the proposed method. This section introduces the datasets, provides details about the experimental design, and conducts analyses according to the experimental results. ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

7 of 20

4.1. Datasets and Data Pre-Processing

4.1. Datasets and Data Pre-Processing

4.1.1. Indian Pines Dataset

4.1.1. Indian Pines Dataset

The Indian Pines image was captured by the AVIRIS sensor with a spatial resolution of 20 m The Indian Pines image was captured by the AVIRIS sensor with a spatial resolution of 20 m over the Indian Pines test site in north-western Indiana. This image contains 145 × 145 pixels and over the Indian Pines test site in north-western Indiana. This image contains 145 × 145 pixels and 224 224 spectral reflectancebands, bands, whose wavelength ranges to 2.5 µm. After discarding spectral reflectance whose wavelength ranges from from 0.4 μm0.4 to µm 2.5 μm. After discarding four four zero bands, 220 bands are reserved. There are 16 ground-truth classes and 10,249 labeled pixels. zero bands, 220 bands are reserved. There are 16 ground-truth classes and 10,249 labeled pixels. Figure 3 displays the details of the Indian Pines image. Table 1 illustrates the number of samples Figure 3 displays the details of the Indian Pines image. Table 1 illustrates the number of samples per per the Indian Pines dataset. class class in theinIndian Pines dataset.

(a)

(b)

Figure 3. The Indian Pines image: (a) the 21th band image; (b) the ground truth of Indian Pines, where

Figure 3. The Indian Pines image: (a) the 21th band image; (b) the ground truth of Indian Pines, the white area represents the unlabeled pixels. where the white area represents the unlabeled pixels. Table 1. Indian Pines image: the number of samples per class.

Table 1. Indian Pines image: the number of samples per class. Class Name Number Train Test C1 Alfalfa 46 10 36 Class Name Number Train Test C2 Corn-notill 1428 345 1083 C1 Alfalfa 46 10 36 C3 Corn-mintill 830 219 611 C2 Corn-notill 1428 345 1083 C4 Corn 237 64 173 C3 Corn-mintill 830 219 611 C5 Grass-pasture 483 133 350 C4 Corn 237 64 173 C6 Grass-trees 730 188 542 C5 Grass-pasture 483 133 350 C7 Grass-pasture-mowed 28 21 C6 Grass-trees 730 1887 542 C8 Hay-windrowed 478 363 C7 Grass-pasture-mowed 28 7115 21 C8 Hay-windrowed 478 1158 363 C9 Oats 20 12 C9 Oats 20 8243 12 C10 Soybean-notill 972 729 C10 Soybean-notill 972 243 729 C11 Soybean-mintill 2455 626 1829 C11 Soybean-mintill 2455 626 1829 C12 Soybean-clean 593 136 457 C12 Soybean-clean 593 136 457 C13 Wheat 205 46 159 C13 Wheat 205 46 159 C14 Woods 1265 311 954 C14 Woods 1265 311 954 C15 Buildings-Grass-Trees-Drives 386 85 301 Buildings-Grass-Trees-Drives 386 85 301 C15 C16 Stone-Steel-Towers 93 67 C16 Stone-Steel-Towers 93 2626 67 Total 102,49 2562 7687 Total 10,249 2562 7687 For the Indian Pines dataset, 20 bands that cover the region of water absorption and four bands

For Pines dataset, 20 bands that[104–108,150–165,218–220], cover the region of water absorption withthe lowIndian signal-to-noise ratio (SNR), including are removed. Theand 196 four bandspreserved with low signal-to-noise ratio (SNR), including [104–108,150–165,218–220], are removed. bands are used for classification. Figure 4 displays the mean spectral signatures of the 16 The 196 preservedclasses. bandsThe are 1D used for classification. Figure 4 displays the mean spectral signatures ground-truth spectral vector that corresponds to each sample is converted into a

ISPRS Int. J. Geo-Inf. 2018, 7, 349

8 of 21

ISPRSInt. Int.J.J.Geo-Inf. Geo-Inf.2018, 2018,7,7,xxFOR FORPEER PEER REVIEW 88ofof2020 ISPRS of the 16 ground-truth classes. The REVIEW 1D spectral vector that corresponds to each sample is converted into a spectral feature map that contains 14 × 14 pixels. Then, the data are processed by zero mean spectralfeature featuremap mapthat thatcontains contains14 14 ×× 14 14 pixels. pixels. Then, Then, the spectral the data data are are processed processedby byzero zeromean meantreatment. treatment. treatment. 5 shows the spectral feature maps the 16 ground truth classes after quantizing Figure5Figure 5shows shows the spectral spectral feature maps maps of the the 16 of ground Figure the feature of 16 ground truth truth classes classes after after quantizing quantizingspectral spectral spectral values 20 levels. 4 shows the differences between different curves areinnoticeable values 20at levels. FigureFigure shows that the thethat differences between values atat20 levels. Figure 44shows that differences between different differentcurves curvesare arenoticeable noticeable inmost most in most bands, while being remarkably small in the other bands. A few spectral curves nearly bands, while being remarkably small in the other bands. A few spectral curves nearly coincide even bands, while being remarkably small in the other bands. A few spectral curves nearly coincide coincide even inall all the bands, and hard to distinguish distinguish them. In the of this will eveninin allthe thebands, bands, and ithard is hard to distinguish them. the process of classification, this situation and ititisis to them. In theInprocess process of classification, classification, thissituation situation will easilylead leadto tothe themisclassification misclassification of ofofsome some classes, which is conducive to improvement ofof of will easily easily lead to the misclassification someclasses, classes, which is conducive to the improvement which is not notnot conducive to the the improvement overall classification accuracy. Therefore, training the network using 1D spectral vectors directly overall classification accuracy. Therefore, training thethe network overall classification accuracy. Therefore, training networkusing using1D 1Dspectral spectralvectors vectors directly directlyisis not not conducive to the extraction of features from the original data, thereby leading to an overfitting not conducive to the extraction of features from the original data, thereby leading to an overfitting conducive to the extraction of features from the original data, thereby leading to an overfitting problem problem and poor overall classification performance of the proposed method. After converting the and poor overall classification performance of the proposed After converting the and problem poor overall classification performance of the proposed method.method. After converting the samples samples into 2D forms, spectral feature maps from different classes are easy to distinguish, as samples into 2D forms, spectral feature maps from different classes are easy to distinguish, as in into 2D forms, spectral feature maps from different classes are easy to distinguish, as displayed displayed in Figure 5. This illustrates that the data processing method is conducive to the network displayed in Figure 5. This illustrates that the data processing method is conducive to the network Figure 5. Thisof illustrates that the data method is conducive to the network learning learning the hyperspectral data processing features, thus improving the classification performance of theof the learning ofdata the features, hyperspectral data features, thus improving performance the classification performance of the hyperspectral thus improving the classification of the network. network.

network.

Figure 4. Mean spectral signatures of 16 classes in the Indian Pines dataset. Figure 4. Mean spectral signatures of 16 classes in the Indian Pines dataset.

Figure 4. Mean spectral signatures of 16 classes in the Indian Pines dataset.

Figure 5. 2D spectral feature maps of 16 classes in the Indian Pines dataset.

Figure 5. 2D spectral feature maps of 16 classes in the Indian Pines dataset.

Figure 5. 2D spectral feature maps of 16 classes in the Indian Pines dataset.

ISPRS Int. J. Geo-Inf. 2018, 7, 349 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

9 of 21 9 of 20

4.1.2. Salinas Salinas Dataset Dataset 4.1.2. The Salinas Salinas image image was was collected collected by by the the AVIRIS AVIRIS (Airborne (Airborne Visible VisibleInfrared InfraredImaging ImagingSpectrometer) Spectrometer) The sensor with with aa spatial spatial resolution resolution of of 3.7 3.7 m m per per pixel pixel over over Salinas California. This This image image contains contains sensor Salinas Valley, Valley, California. 512 × 217 pixels, 224 spectral bands, and 16 ground truth classes. The details of the Salinas image are 512 × 217 pixels, 224 spectral bands, and 16 ground truth classes. The details of the Salinas image are shown in in Figure Figure 6. 6. Table Table 22 lists liststhe thenumber numberof ofsamples samplesper perclass. class. shown

(a)

(b)

Figure 6. The Salinas image: (a) the 21th band image; (b) the ground truth of Salinas, where the white Figure 6. The Salinas image: (a) the 21th band image; (b) the ground truth of Salinas, where the white area represents the unlabeled pixels. area represents the unlabeled pixels. Table 2. Salinas image: the number of samples per class. Table 2. Salinas image: the number of samples per class.

Class Class C1 C1 C2 C2 C3 C3 C4 C4 C5 C5 C6 C6 C7 C7 C8 C8 C9 C9 C10 C10 C11 C12 C11 C13 C12 C14 C13 C15 C14 C16 C15 C16

Name Number Train Test Name Number Train Test green_weeds_1 2009 524 1485 green_weeds_1 2009 524 1485 green_weeds_2 3726 933 2793 green_weeds_2 3726 933 2793 Fallow 1976 514 1462 Fallow 1976 514 1462 Fallow rough plow 1394 343 1051 Fallow rough plow 1394 343 1051 Fallow smooth 2678 671 2007 Fallow smooth 2678 671 2007 Stubble 3959 977 2982 Stubble 3959 977 2982 Celery 3579 931 2648 Celery 3579 931 2648 Grapes untrained 11,271 2826 Grapes untrained 11,271 2826 8445 8445 Soil vinyard 6203 1536 4667 Soil vinyard 6203 1536 2465 4667 Corn senesced 3278 813 Corn senesced 3278 813 2465 Lettuce_romaine_4wk 1068 263 805 Lettuce_romaine_5wk 1927 493 1434 Lettuce_romaine_4wk 1068 263 805 Lettuce_romaine_6wk 916 211 705 Lettuce_romaine_5wk 1927 493 1434 Lettuce_romaine_7wk 1070 238 832 Lettuce_romaine_6wk 916 211 705 Vinyard_untrained 7268 1806 5462 Lettuce_romaine_7wk 1070 238 832 Vinyard_vertical 1807 453 1354 Total 54,129 13,532 Vinyard_untrained 7268 1806 40,597 5462 Vinyard_vertical 1807 453 1354 Total that cover the region 54,129 of water 13,532absorption 40,597 and 8 bands with For the Salinas dataset, 20 bands low SNR, including [107–113,153–168,220–224], are removed. The 196 remaining bands are used for For the Salinas dataset, 20 bands that cover the region of water absorption andmaps 8 bands classification. The spectral vectors of the preserved data are transformed into feature withwith 14 ×low 14 SNR, including [107–113,153–168,220–224], are removed. The 196 the remaining bands signatures are used for pixels. The data are processed by zero mean treatment. Accordingly, mean spectral of classification. The spectral vectors of the preserved data are transformed into feature maps with 14 16 classes in the Salinas dataset are shown in Figure 7, and the spectral feature maps after quantizing× 14 pixels. The data are processed by zero mean treatment. Accordingly, the mean spectral signatures of 16 classes in the Salinas dataset are shown in Figure 7, and the spectral feature maps after

ISPRS Int. J. Geo-Inf. 2018, 7, 349 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

10 of 21 10 of 20 10 of 20

quantizing spectral values at 20 levels are shown in Figure 8. Figures 7 and 8 demonstrate that spectral values at 20values levels at are shown Figure 8. Figures 7 and 8 demonstrate transforming quantizing spectral 20 levelsinare shown in Figure 8. Figures 7 and 8 that demonstrate that transforming the samples from 1D to 2D emphasizes the difference among the samples from various the samples from 1D to 2Dfrom emphasizes differencethe among the samples various ground-truth transforming the samples 1D to 2Dthe emphasizes difference amongfrom the samples from various ground-truth classes for the Salinas dataset. classes for theclasses Salinasfor dataset. ground-truth the Salinas dataset.

Figure 7. 7. Mean Mean spectral spectral signatures signatures of of 16 16 classes classes in the the Salinas dataset. dataset. Figure Figure 7. Mean spectral signatures of 16 classes in in the Salinas Salinas dataset.

Figure 8. 2D spectral feature maps of 16 classes in the Salinas dataset. Figure classes in in the the Salinas Salinas dataset. dataset. Figure 8. 8. 2D 2D spectral spectral feature feature maps maps of of 16 16 classes

4.1.3. Pavia University Dataset 4.1.3. Pavia University Dataset 4.1.3. Pavia University Dataset This HSI was captured by the ROSIS (Reflective Off-axis System Imaging Spectrometer) sensor This HSI was captured by the ROSIS (Reflective Off-axis System Imaging Spectrometer) sensor HSIresolution was captured ROSIS (Reflective Off-axis System Imaging sensor withThis spatial of 1.3bymthe over the Pavia University in northern Italy. It Spectrometer) contains 103 spectral with spatial resolution of 1.3 m over the Pavia University in northern Italy. It contains 103 spectral with spatial resolution of 1.3 m over the Pavia University in northern Italy. It contains 103 spectral bands, 610 × 340 pixels and 9 ground-truth classes. Its details are shown in Figure 9. Table 3 lists the bands, 610 × 340 pixels and 9 ground-truth classes. Its details are shown in Figure 9. Table 3 lists the bands, 340 pixels and 9in ground-truth number610 of × samples per class this image. classes. Its details are shown in Figure 9. Table 3 lists the number of samples per class in this image. number of samples per class in this image.

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

11 of 20

ISPRS Int. J. Geo-Inf. 2018, 7, 349

11 of 21

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

(a)

11 of 20

(b)

(a) (b) the ground truth of Salinas, where Figure 9. The Pavia University image: (a) the 21th band image; (b) the white area represents the unlabeled pixels. Figure 9. The Pavia University image: (a) the 21th band image; (b) the ground truth of Salinas, where Figure 9.white The Pavia University (a) pixels. the 21th band image; (b) the ground truth of Salinas, where the area represents theimage: unlabeled the white area represents unlabeled pixels. Table 3.the Pavia University image: the number of samples per class. Table 3. Pavia University image: the number of samples per class.

Class Name image: theNumber Train perTest Table 3. Pavia University number of samples class. Class Name Number Train Test C1 Asphalt 6631 1668 4963 C1 Asphalt 6631 1668 4963 Class Name Number Train Test C2 Meadows 18,649 4583 14,066 C2 Meadows 18,649 4583 14,066 C3 Gravel 2099 512 1587 1587 C1 C3 Asphalt 6631 1668 4963 Gravel 2099 512 C2 C4 Meadows 18,649 4583 14,066 C4 Trees 3064 831 2233 Trees 3064 831 2233 C3 Gravel sheets 2099 512 1587 C5 1345 331 1014 1014 C5 Painted Paintedmetal metal sheets 1345 331 C4 Trees 3064 831 2233 BareSoil Soil 5029 1270 C6C6 Bare 5029 1270 3759 3759 C5 Painted metal sheets 1345 331 1014 C7 Bitumen 1330 334 996 C7 Bitumen 1330 334 996 C6 Bare Soil 5029 1270 3759 C8 Self-Blocking Bricks 3682 936 2746 C8 Self-Blocking 3682 936 2746 C7 Bitumen Bricks 1330 334 996 Shadows 947 229 C8 C9 Self-Blocking Bricks 3682 936 2746 C9 Shadows 947 229 718 718 Total 42,776 10,694 C9 Shadows 947 229 32,082 718 Total 42,776 10,694 32,082 Total 42,776 10,694 32,082 For the Pavia University image, only three bands were removed and 100 bands were retained to For the Pavia University image, only three bands were removed 100 bands were retained to convert spectral vectors into feature maps because there exist obviousand differences among different For the Pavia University image, only three bands were removed and 100 bands were retained to convert spectral vectors into feature maps because there exist obvious differences among different mean spectral curves (Figure 10). The corresponding spectral feature maps are shown in Figure 11 convert spectral vectors into feature maps because there exist obvious differences among different mean meanbelow. spectral curves (Figure 10). The corresponding spectral feature maps are shown in Figure 11

spectral below. curves (Figure 10). The corresponding spectral feature maps are shown in Figure 11 below.

Figure 10. Mean spectral signatures of 16 classes in the Pavia University dataset.

Figure 10. 10. Mean Mean spectral spectral signatures signatures of of 16 16 classes classes in in the the Pavia Pavia University University dataset. dataset. Figure

ISPRS Int. J. Geo-Inf. 2018, 7, 349 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

12 of 21 12 of 20

Figure dataset. Figure 11. 11. 2D 2D spectral spectral feature feature maps maps of of 16 16 classes classes in in the the Pavia Pavia University University dataset.

4.2. 4.2. Experimental Experimental Design Design In this section, section, several several comparative comparative experiments evaluate the the classification classification In this experiments are are designed designed to to evaluate performance of the proposed method in HSI. The details of the experiments are as follows. performance of the proposed method in HSI. The details of the experiments are as follows. (1) Comparison of classification performances of proposed method under different parameter (1) Comparison of classification performances of proposed method under different parameter settings. settings. Two comparisons are needed because different width expansion rates (g) and network Two comparisons are needed because different width expansion rates (g) and network depths depths lead to varying classification performances of the proposed method. ① The number of 1 The number of composite lead to varying classification performances of the proposed method. composite layers is denoted as nc_layer and set as 2, and the classification performances of the layers is denoted as nc_layer and set as 2, and the classification performances of the proposed proposed method when the value of g is 8/20/32 are compared. ② g is set as 20, and the 2 g is set as 20, and the classification method when the value of g is 8/20/32 are compared. classification performances of the proposed method when the value of nc_layer is 2/4/8 are performances of the proposed method when the value of nc_layer is 2/4/8 are compared. compared. (2) Comparison with other methods. The classification performance of proposed method is compared (2) Comparison with other methods. The classification performance of proposed method is with that of the deep learning method and non-deep learning method on the HSI. compared with that of the deep learning method and non-deep learning method on the HSI.

In the the training training process, process, the the training training samples samples are are divided divided into into batches, batches, and and the the number number of of samples samples In per batch in our experiment is 64. For each epoch, the whole training set is learned by the proposed per batch in our experiment is 64. For each epoch, the whole training set is learned by the proposed CNN. The The total The Adam Adam optimizer optimizer is is applied applied to to train CNN. total of of epochs epochs is is 200 200 in in each each experiment. experiment. The train the the proposed CNN, and MSRA initialization method [31] is used for weight initialization. The parameter proposed CNN, and MSRA initialization method [31] is used for weight initialization. The parameter 8 . The initial learning rate for the Indian Pines dataset of the × 10−The of the Adam Adam optimizer, optimizer,epsilon, epsilon,isisset setas as11e–8. initial learning rate for the Indian Pines dataset is − 2 is 1 × 10 , which is reduced 10/100/200 times when epoch =respectively. 20/60/100,For respectively. the 1e–2, which is reduced 10/100/200 times when epoch = 20/60/100, the Salinas For dataset −3 , which is reduced Salinas dataset and the Pavia University dataset, the initial learning rate is 1 × 10 and the Pavia University dataset, the initial learning rate is 1e–3, which is reduced 10/100/200 times 10/100/200 when epoch = 40/100/150, there is no special thenare all when epoch times = 40/100/150, respectively. If thererespectively. is no specialIfillustration, then allillustration, the parameters the parameters are set according to the aforementioned settings. These parameters may not be optimal, set according to the aforementioned settings. These parameters may not be optimal, but at least but at least for theCNN. proposed CNN. effective foreffective the proposed Division of of training training set set and and test test set. set. Generally, Generally, insufficient will lead Division insufficient training training samples samples will lead to to serious serious overfitting of deep CNN models. However, the deep CNN model is equipped with strong overfitting of deep CNN models. However, the deep CNN model is equipped with strong feature feature extraction capacity capacity due due to to its Moreover, BN BN and and MSRA method are are extraction its deep deep structure. structure. Moreover, MSRA initialization initialization method adopted for the generalization performance of adopted for training training the theproposed proposedCNN CNNeffectively, effectively,which whichimprove improve the generalization performance of proposed CNN. Therefore, proposed method can extract deep features from small training samples set effectively. The limited available labeled pixels of HSI leads to insufficient training samples, which

ISPRS Int. J. Geo-Inf. 2018, 7, 349

13 of 21

proposed method can extract deep features from small training samples ISPRS Int. J.CNN. Geo-Inf. Therefore, 2018, 7, x FORproposed PEER REVIEW 13 of 20 set effectively. The limited available labeled pixels of HSI leads to insufficient training samples, usually make make it a challenge to improve the classification accuracy of the HSI. Considering this, to which usually it a challenge to improve the classification accuracy of the HSI. Considering this, evaluate the we randomly randomly select select to evaluate theeffectiveness effectivenessofofproposed proposedmethod methodunder underlimited limited training training samples, samples, we 25%of ofsamples samplesin ineach eachHSI HSIdataset datasetfor forthe thetraining trainingset, set,and andthe therest restfor forthe thetest testset. set. 25% 4.3. 4.3. Experimental Experimental Results Resultsand andAnalyses Analyses In Inthis thiswork, work,overall overallaccuracy accuracy(OA), (OA),average average accuracy accuracy (AA), (AA), and and kappa kappa coefficient coefficient(Kappa) (Kappa) are are adopted adoptedto toevaluate evaluatethe theclassification classificationperformance performanceof ofthe theproposed proposedmethod methodin inHSI HSIdata. data. Among Among them, them, OA OA refers refers to to the the ratio ratioof ofthe thenumber number of ofpixels pixelscorrectly correctly classified classified to tothe thetotal totalnumber number of ofall alllabeled labeled pixels. AA is the average of classification accuracy of all classes. The Kappa coefficient is used to assess pixels. AA is the average of classification accuracy of all classes. The Kappa coefficient is used to the agreement of classification for all the The greater kappa better overall assess the agreement of classification forclasses. all the classes. The the greater thevalue, kappathe value, thethe better the classification effect. All the following data are the 10 experimental results under the under same overall classification effect. All the following dataaverage are theof average of 10 experimental results conditions to ensure the objectivity of the experimental results. the same conditions to ensure the objectivity of the experimental results. (1) (1)Experimental ExperimentalResults Resultsin inthe theIndian IndianPines PinesDataset Dataset Classification Classification performance performance of of the the proposed proposed method method under under different different width width expansion expansion rates: rates: Figure nc_layer = =2 2and g= Figure12 12displays displaysthe thebar barofofthe theclassification classificationresults resultswhen when nc_layer and g =8/20/32. 8/20/32. According Accordingto to Figure is is thethe same, thethe width of the network is increased gradually and and the Figure12, 12,when whenthe thenetwork networkdepth depth same, width of the network is increased gradually classification performance of theofproposed method is improved gradually with the increase of width the classification performance the proposed method is improved gradually with the increase of expansion rate (g). However, trend of this gradually saturates. gradually OA/AA/Kappa are width expansion rate (g).the However, theimprovement trend of this improvement saturates. increased by 3.05%/5.08%/0.0346, respectively, when grespectively, is increased from When gfrom is increased OA/AA/Kappa are increased by 3.05%/5.08%/0.0346, when8gtois20. increased 8 to 20. from 20 to 32, OA and Kappa are almost invariable and AA is slightly reduced. Within a certain range, When g is increased from 20 to 32, OA and Kappa are almost invariable and AA is slightly reduced. the increase of network width can lead to improving thelead feature extractionthe of feature hyperspectral data, Within a certain range, the increase of network width can to improving extraction of thus enhancing classification accuracy. However, this improvement has an upper limit. Widening the hyperspectral data, thus enhancing classification accuracy. However, this improvement has an upper network width will thecertainly computation of the thereby increasing time limit. Widening the certainly network increase width will increase thenetwork, computation of the network,the thereby consumed classification. Therefore, the width of the networkthe should be of controlled when should designing increasingby the time consumed by classification. Therefore, width the network be the HSI classification method. controlled when designing the HSI classification method.

(a)

(b)

Figure 12. Classification result of proposed method when nc_layer = 2, g = 8/20/32: (a) shows the Figure 12. Classification result of proposed method when nc_layer = 2, g = 8/20/32: (a) shows the classification accuracy per class, (b) shows overall accuracy/average accuracy/kappa coefficient classification accuracy per class; (b) shows overall accuracy/average accuracy/kappa coefficient (OA/AA/Kappa). (OA/AA/Kappa).

Classification performance of the proposed method under different number of composite layers. Classification of the proposed method different number of composite layers. Figure 13 displaysperformance the bar of classification results whenunder nc_layer = 2 and g = 8/20/32. Increasing the Figure 13 displays the bar of classification results when nc_layer = 2 and g = 8/20/32. Increasing the number of composite layers cannot increase the classification accuracy of the proposed method when number of composite cannot the increase accuracy depth of the proposed method when the width expansion layers rate remains same.the Anclassification increase in network will greatly increase the the width expansion rate remains same.theAntime increase in network depthConsidering will greatly the increase computation of the network and the prolong for network training. large the computation the networkdata, and prolong the time for should networkbetraining. Considering the large computation of of hyperspectral the network depth controlled on the premise of ensuring high classification accuracy. In summary, in terms of classification accuracy and speed, setting g = 20 and nc_layer = 2 is suitable.

ISPRS Int. J. Geo-Inf. 2018, 7, 349

14 of 21

computation of hyperspectral data, the network depth should be controlled on the premise of ensuring high classification accuracy. summary, in terms of classification ISPRSIn Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW accuracy and speed, setting g = 20 and nc_layer = 2 is suitable. 14 of 20 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

(a) (a)

14 of 20

(b) (b)

Figure 13. Classification results of proposed method nc_layer = 2, g = 8/20/32: (a) shows the Figure 13. Classification results of proposed method nc_layer = 2, g = 8/20/32: (a) shows the classification accuracy per results class, (b) the OA/AA/Kappa. Figure 13. Classification ofshows proposed method nc_layer = 2, g = 8/20/32: (a) shows the classification accuracy per class; (b) shows the OA/AA/Kappa. classification accuracy per class, (b) shows the OA/AA/Kappa.

Impact of band number. As can be seen from Figure 4, when the number of reserved bands is Impact of band band number. number. Ascan canbe be seenfrom from Figure when thenumber number of reserved bands As seen Figure when the reserved bands is 196, many spectral curves still have serious aliasing at4,4, some bands, whichofis unfavorable for is 196, many spectral curves stillthe have serious aliasing at some some bands, bands, the which is unfavorable 196, many spectral curves still have serious aliasing which is classification. In order to obtain optimal number of at reserved classification resultsfor of classification. In order of reserved bands, classification results of In obtain the optimal 100, 144 and 196 bandsto under nc_layer = 2, g =number 20 are compared (Table 4). the According to Figures 4 and 100, 144 and and 196 bands bands under nc_layer =spectral 20 compared (Table 4).which According to Figures 4 196 nc_layer = =2,2, g of =g 20 areare compared 4). According to Figures and 14, after removing moreunder bands, the aliasing curves is(Table much less, enhances the4inter and 14, after removing more bands, theeffectively. aliasing ofUnfortunately, spectral curves is much less, which the 14, after removing bands, the aliasing of spectral curves is much less, which enhances inter class separability ofmore Indian Pines data and simultaneously, itenhances alsothe leads to inter class ofinformation. Indian data effectively. Unfortunately, and simultaneously, it also leads class separability of Indian PinesPines dataThe effectively. Unfortunately, andseparability simultaneously, also leads to the loss ofseparability much useful enhancement of inter class bringitbenefits to the to the loss of much useful information. enhancement ofuseful inter class separability bring benefits to the loss of much information. TheThe enhancement inter class separability bring benefits to the improvement ofuseful classification accuracy, while the lossofof information cause damage the the improvement of classification accuracy, while the lossofofuseful useful information cause damage damage to to the improvement of accuracy, while loss information cause improvement ofclassification classification accuracy. For the the reason that the disadvantage outweighs the improvement of classification accuracy. For the reason that the disadvantage outweighs the advantage, improvement of classification accuracy. the reason that the by disadvantage the advantage, compared with 196-bands OA,For 144-bands OA decreases only 0.12%, outweighs and 100-bands compared with 196-bands OA, 144-bands decreases by only 0.12%, 100-bands OA100-bands decreases advantage, compared 196-bands OA, 144-bands OA decreases byand only 0.12%, and OA decreases by 2.74%with (see Table 4). As aOA result, it is the best choice to reserve 196 bands. by (see Table 4). As a result, it isAs the best choice to reserve 196 bands. OA2.74% decreases by 2.74% (see Table 4). a result, it is the best choice to reserve 196 bands.

(a) 100 bands (a) 100 bands

(b) 144 bands (b) 144 bands

Figure 14. Spectral curves of 100 bands and 144 bands in the Indian Pines dataset. Figure 14. Spectral curves of 100 bands and 144 bands in the Indian Pines dataset. Figure 14. Spectral curves of 100 bands and 144 bands in the Indian Pines dataset. Table 4. Classification results of different number of reserved bands in the Indian Pines dataset. Table 4. Classification results of different number of reserved bands in the Indian Pines dataset. Table 4. Classification results different number in the Indian Pines dataset. BandofNumber OA of reserved AA bands Kappa

Band 100 Number 86.94% OA AA Kappa 85.73% 0.8510 Band Number OA AA Kappa 100 86.94% 144 89.76% 85.73% 88.37% 0.8510 0.8838 100 86.94% 85.73% 0.8510 144 89.76% 196 89.88% 88.37% 90.92% 0.8838 0.8845 144 89.76% 88.37% 0.8838 196 89.88% 90.92% 0.8845 196 89.88% 90.92% 0.8845 (2) Experimental Results on the Salinas Dataset (2) Experimental Results on the Salinas Dataset The classification performance of the proposed method for the Salinas dataset under different The classification of the proposed method the Salinas dataset under different width expansion ratesperformance or network depths. Table 5 shows thefor classification results of the proposed width expansion rates dataset or network 5 shows the classification the proposed method on the Salinas whendepths. nc_layerTable = 2 and g = 8/20/32 and when g =results 20 andofnc_layer = 2/4/8. method Salinas dataset 2 and g =network 8/20/32 and when g = 20 and nc_layer = 2/4/8. When gon is the increased from 8 when to 20nc_layer under =the same depth, OA/AA/Kappa increase by When g is increasedrespectively. from 8 to 20 under theincreased same network depth, by 1.19%/0.48%/0.0132, When g is from 20 to 32,OA/AA/Kappa OA/AA/Kappaincrease are nearly

ISPRS Int. J. Geo-Inf. 2018, 7, 349

15 of 21

(2) Experimental Results on the Salinas Dataset The classification performance of the proposed method for the Salinas dataset under different width expansion rates or network depths. Table 5 shows the classification results of the proposed method on the Salinas dataset when nc_layer = 2 and g = 8/20/32 and when g = 20 and nc_layer = 2/4/8. When g is increased from 8 to 20 under the same network depth, OA/AA/Kappa increase by 1.19%/0.48%/0.0132, respectively. When g is increased from 20 to 32, OA/AA/Kappa are nearly unchanged. The increase of network width can improve the classification performance of the proposed method in a certain range. If the width is sufficient, widening the network will not affect the classification capability of the proposed method. When the width expansion rate is the same, the classification accuracy is nearly unchanged with the increase of network depth. This phenomenon may be caused by the sufficiently large number of samples in the Salinas dataset, which is remarkably close to or may even reach the maximum capacity of the proposed method. Table 5. Classification results under different width expansion rate or network depth in the Salinas dataset. nc_layer = 2

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 OA AA Kappa

g = 20

g=8

g = 20

g = 32

nc_layer = 2

nc_layer = 4

nc_layer = 8

100.00% 99.61% 99.86% 99.33% 99.20% 99.90% 99.85% 87.65% 99.59% 98.35% 98.76% 99.17% 99.15% 97.16% 83.38% 99.48% 94.82% 97.53% 0.9423

99.93% 99.82% 99.73% 99.81% 99.45% 99.93% 99.96% 90.91% 99.57% 98.18% 99.25% 99.30% 100.00% 97.29% 86.75% 99.78% 96.01% 98.11% 0.9555

100.00% 99.78% 99.86% 99.62% 99.60% 99.98% 100.00% 89.75% 99.42% 98.45% 99.01% 99.79% 100.00% 96.96% 86.92% 99.92% 95.81% 98.07% 0.9533

99.93% 99.82% 99.73% 99.81% 99.45% 99.93% 99.96% 90.91% 99.57% 98.18% 99.25% 99.30% 100.00% 97.29% 86.75% 99.78% 96.01% 98.11% 0.9555

100.00% 99.78% 100.00% 99.71% 98.96% 99.90% 99.96% 89.60% 99.59% 98.96% 99.14% 99.37% 99.86% 99.28% 84.49% 99.63% 95.49% 98.02% 0.9497

100.00% 99.80% 100.00% 99.92% 99.20% 100.00% 100.00% 89.95% 99.72% 99.06% 98.68% 99.24% 100.00% 98.26% 84.52% 99.96% 95.68% 98.12% 0.9493

(3) Experimental Results on the Pavia University Dataset To demonstrate the relationship between classification accuracies and inter-class separability, Tables 6 and 7 display the details of the classification results for Pavia University data at nc_layer = 2 and g = 20. In Table 6, the value in the ith row, jth column means the number of samples of the jth class which is classified to the ith class. Each row represents the samples in a predicted class, and each class represents the samples in an actual class. The values on diagonal line means the number of samples which are classified correctly. Those values not on diagonal line means false positives or false negatives. For example, in the first row, 4790 samples of C1 are classified correctly, while 31 samples of C3 are misclassified into C1, which means false positives. In the seventh column, 887 samples of C7 are classified correctly, while 108 samples are misclassified into C1, which means false negatives. In Table 7, the value in the ith row, jth column means the percentage of the samples classified into the ith class from the jth class. The values on diagonal line are just the classification accuracies of corresponding classes. For example, 93.37% of C7 samples are classified correctly, but 6.32% of the samples classified into C7 are from C1, which are misclassified. The smaller the difference of mean spectral curves, the smaller the difference among spectral feature maps from a different class, the worse the inter-class separability, the easier it is to cause misclassification. As shown in Figure 12, the curves

ISPRS Int. J. Geo-Inf. 2018, 7, 349

16 of 21

of C3 and C8 are very similar, that is, the difference between C3 and C8 is small. So, many samples (212) of C3 are misclassified into C8. Similarly, many samples (179) of C8 are misclassified into C3. In this way, it is not difficult to explain that the OA of proposed method is only 89.95% on the Indian Pines dataset, while 96.01% on the Salinas dataset and 96.15% on the Pavia University dataset. Because at many bands, there is serious aliasing in the spectral mean curves of the Indian Pines dataset (Figure 4), which means poor inter-class separability. Therefore, there exist serious misclassification in some classes, resulting in the decline of overall classification accuracy for the Indian Pines dataset. Table 6. The details of classification results for Pavia University data.

C1 C2 C3 C4 C5 C6 C7 C8 C9

C1

C2

C3

C4

C5

C6

C7

C8

C9

4790 6 29 1 1 12 60 64 0

0 13,871 1 42 0 150 0 2 0

31 2 1341 0 0 0 1 212 0

0 58 0 2172 0 3 0 0 0

2 0 0 0 1011 1 0 0 0

1 169 0 2 0 3580 0 7 0

108 0 1 0 0 0 887 0 0

70 7 179 0 0 10 2 2478 0

2 0 0 0 0 0 0 0 716

Table 7. The detailed classification accuracy of all the classes for Pavia University data.

C1 C2 C3 C4 C5 C6 C7 C8 C9

C1

C2

C3

C4

C5

C6

C7

C8

C9

95.72% 0.04% 1.87% 0.05% 0.10% 0.32% 6.32% 2.32% 0.00%

0.00% 98.29% 0.06% 1.89% 0.00% 3.99% 0.00% 0.07% 0.00%

0.62% 0.01% 86.46% 0.00% 0.00% 0.00% 0.11% 7.67% 0.00%

0.00% 0.41% 0.00% 97.97% 0.00% 0.08% 0.00% 0.00% 0.00%

0.04% 0.00% 0.00% 0.00% 99.90% 0.03% 0.00% 0.00% 0.00%

0.02% 1.20% 0.00% 0.09% 0.00% 95.31% 0.00% 0.25% 0.00%

2.16% 0.00% 0.06% 0.00% 0.00% 0.00% 93.37% 0.00% 0.00%

1.40% 0.05% 11.54% 0.00% 0.00% 0.27% 0.21% 89.69% 0.00%

0.04% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00%

(4) Comparison with Other Methods In order to further evaluate the effectiveness of the proposed method, we implement two classical CNN architectures, NIN and LeNet-5. We take nc_layer = 2 and g = 20, the architectures of which are shown in Table 8. The classification results on three datasets are shown in Table 9. Table 10 displays the total run time for training and testing on the Indian Pines dataset classification, respectively. Figures 15–17 display the corresponding classification maps. It is easy to know that the classification performance of the proposed method is better than all the comparison methods. Furthermore, as can be seen from Tables 8 and 10, the FLOPS (floating-point operations) of proposed CNN is 106 less than that of NIN, and the training time and testing time in Indian Pines data classification are also significantly less than that of NIN. The OA of the proposed CNN is 0.98% more than that of NIN, which demonstrates the effectiveness of the ASC–FR module. However, FLOPS and computing consumption of LeNet-5 are much less than that of proposed CNN and NIN. Its shallow structure and insufficient training samples of HSI severely restrict the classification performance of LeNet-5. As a result, the OA of LeNet-5 is 3.19% less than that of NIN, and 4.17% less than that of proposed CNN.

ISPRS Int. J. Geo-Inf. 2018, 7, 349

17 of 21

Table 8. The architecture of proposed CNN/NIN/LeNet-5. Layer

Output Size

Proposed CNN NIN LeNet-5     3 × 3, 40 3 × 3, 40  3 × 3, 40   3 × 3, 40         3 × 3, 40   3 × 3, 40  ISPRS Int. J. Geo-Inf.Conv 2018, 7, x FOR PEER 5 × 5, 6 14 ×REVIEW 14      1 × 1, 40   1 × 1, 40  1 × 1,pool, 40 stride 2 1 × 1, 40 3 × 3 max Pool 7×7 Pool 7×7 3 × 3 max pool, 80 stride 2 1 1,80 3  3,      3  3,80  3  3 , 80 3× 3, 80 1×  1, 80   3×  3×   3 3, 3,8080 80  Conv 7×7    13,1,80  5  5,16     3× 3, 80     1 ×3  5 × 5, 16 1, 80 Conv 7×7   3,80   11, 80   3× 80   11×11,, 4080 13,1,40       1 × 1, 40 1 × 1, 40

Pool

3×3 3×3 1×1

Pool

Classification

1×1

Classification

FLOPS FLOPS

17 of 20

2 × 2 average pool, stride 2 2 × 2 average pool, stride 2 global average pool 120-D FC, 84-D FC global average pool 120-D FC, 84-D FC 16-D FC, Softmax 16-D FC, Softmax 12,693,600 13,869,600 175,704

12,693,600

13,869,600

175,704

Table 9. Classification results of different CNN architectures on three HSI datasets. Table 9. Classification results of different CNN architectures on three HSI datasets.

Dataset

Index LeNet-5 Dataset Index 85.78% LeNet-5 OA Indian Pines AA 85.33% OA 85.78% AA 85.33% Indian Pines Kappa 0.8379 Kappa 0.8379 OA 94.28% OA 94.28% Salinas AA 97.38% AA 97.38% Salinas Kappa Kappa 0.9363 0.9363 OA 93.37% OA 93.37% Pavia University AA 92.12% Pavia University AA 92.12% Kappa 0.9120 Kappa 0.9120

NIN Proposed CNN NIN Proposed CNN 88.97% 89.95% 86.36% 90.92% 88.97% 89.95% 86.36% 90.92% 0.8742 0.8845 0.8742 0.8845 95.09% 96.01% 95.09% 96.01% 97.94% 98.11% 97.94% 98.11% 0.9453 0.9555 0.9453 0.9555 95.24% 96.15% 95.24% 96.15% 94.58% 95.19% 94.58% 95.19% 0.9359 0.9488 0.9359 0.9488

Table 10. Computing consumption of different CNN architectures on the Indian Pines dataset.

Table 10. Computing consumption of different CNN architectures on the Indian Pines dataset. Methods Training Time Time Methods Training LeNet-5 74 LeNet-5 74 ss NIN 174.36 NIN 174.36 ss Proposed CNN 131.53 s Proposed CNN 131.53 s

(a) training set

(b) test set

Testing Time Testing Time OA OA 5.37s s 85.78% 5.37 85.78% 27.68 88.97% 27.68s s 88.97% 24.39 s 89.95% 24.39 s 89.95%

(c) LeNet-5(85.78%)

(d) NIN (88.97%)

(e) Proposed CNN (89.95%)

Figure 15. Classification maps of different CNN architectures on the Indian Pines dataset. Figure 15. Classification maps of different CNN architectures on the Indian Pines dataset.

(a) training set

(b) test set

(c) LeNet-5(85.78%)

(d) NIN (88.97%)

(e) Proposed CNN (89.95%)

ISPRS Int. J. Geo-Inf. 2018, 7, 349

Figure 15. Classification maps of different CNN architectures on the Indian Pines dataset.

set 2018, 7, x FOR (b) test setREVIEW (c) LeNet-5(94.28%) ISPRS(a) Int.training J. Geo-Inf. PEER

(d) NIN (95.09%)

18 of 21

(e) Proposed CNN (96.01%) 18 of 20

Figure 16. 16. Classification Classification maps maps of of different different CNN CNN architectures architectures on on the the Salinas Salinas dataset. dataset. Figure

(a) training set

(b) test set

(c) LeNet-5 (93.37%)

(d) NIN (95.24%)

(e) Proposed CNN (96.15%)

Figure 17. Classification maps of different CNN architectures on the Pavia University dataset. Figure 17. Classification maps of different CNN architectures on the Pavia University dataset.

Finally, the classification performance of the proposed method is compared with some other HSI Finally, the classification performance of the proposed is compared with some other HSI classification methods, as shown in Table 11. In this table, method the accuracies outside brackets are taken classification methods, as shown in Table 11. In this table, the accuracies outside brackets are taken from corresponding references directly, and those accuracies in brackets are obtained by the from corresponding directly, accuracies brackets areaccuracy obtained by by the proposed proposed method. Itreferences should be noted and thatthose we obtain the in classification dividing the method. It should be noted that we obtain the classification accuracy by dividing the training set and training set and the test set according to the corresponding reference. The table demonstrates that the the test set according to the reference. The table demonstrates that the classification classification performance of corresponding proposed method outperforms all the comparison methods. In addition, performance of proposed method outperforms all the comparison methods. In addition, DBN means deep belief network and DAE means denoising autoencoders in Table 11. DBN means deep belief network and DAE means denoising autoencoders in Table 11. Table 11. Comparison with other methods for the Indian Pines dataset. Table 11. Comparison with other methods for the Indian Pines dataset.

References Hu et al. [19] References Chen et al. [17] Hu et al. [19] Chen et etal. al.[17] [32] Chen Fu et al. [33] Chen et al. [32] Sunetetal.al.[33] [5] Fu Sun et al. al. [6] [5] Li et Li Li et et al. al.[6] [6] Li et al. [6] Wei et al. [7] Wei et al. [7] Hu et al. [19] Hu et al. [19]

Baseline CNN Baseline CNN CNN DBN CNN DAE DBN BWSVM DAE BWSVM KELM KELM KSVM KSVM FPCA+KELM FPCA+KELM RBF-SVM RBF-SVM

Feature Spectral Feature Spectral Spectral Spectral–Spatial Spectral Spectral Spectral–Spatial Spectral Spectral Spectral Spectral Spectral Spectral Spectral Spectral Spectral Spectral Spectral

Training Set 8 classes, Training 200 samples Set per class 150 samples per class 8 classes, 200 samples per class 50%per class 150 samples 30% 50% 25% 30% 25% 10% 10% 10% 10% 15% 15% 8 classes, 200 samples per class 8 classes, 200 samples per class

Accuracy 90.16% (90.74%) Accuracy 87.81% (88.16%) 90.16% (90.74%) 91.34% 87.81%(92.58%) (88.16%) 89.82% 91.34%(90.18%) (92.58%) 88% (89.95%) 89.82% (90.18%) 88% (89.95%) 80.37% (84.53%) 80.37%(84.53%) (84.53%) 79.17% 79.17% (84.53%) 87.62% (87.92%) 87.62% (87.92%) 87.60% (90.74%) 87.60% (90.74%)

5. Conclusions In this work, an HSI classification method based on ASC–FR is proposed. In data pre-processing, each 1D spectral vector that corresponds to a labeled pixel is transformed into a 2D spectral feature map, thereby highlighting the differences among samples and weakening the influence of strong correlation among bands for HSI classification. In the CNN design, 1 × 1 convolution layers are

ISPRS Int. J. Geo-Inf. 2018, 7, 349

19 of 21

5. Conclusions In this work, an HSI classification method based on ASC–FR is proposed. In data pre-processing, each 1D spectral vector that corresponds to a labeled pixel is transformed into a 2D spectral feature map, thereby highlighting the differences among samples and weakening the influence of strong correlation among bands for HSI classification. In the CNN design, 1 × 1 convolution layers are adopted to reduce the network parameters and increase network complexity, thus extracting increasingly accurate hyperspectral data features. Through the ASC–FR module, the utilization rate of the high-dimensional features in the network can be improved, the features of hyperspectral data can be extracted meticulously and comprehensively, overfitting can be prevented to a certain extent, and classification accuracy can be improved. Overlapping pooling and GAP are used to integrate the data features, thereby greatly enhancing the learning capability of the network for spectral features and improving the generalization capability of the CNN. Experimental results show when only 25% samples are selected for the training set, the classification accuracy of the proposed method can reach 89.95% for the Indian Pines dataset, and even 96.01% for the Salinas dataset and 96.15% for the Pavia University dataset. Comparative experiments on three benchmark HSI datasets demonstrate that the proposed ASC–FR module can improve the classification accuracy of CNNs for HSIs effectively and the proposed classification method has excellent classification performance, which outperforms all the comparison methods. However, the proposed method only configures the number of channels in each convolution layer simply, there is still much room for improvement. Furthermore, the spatial information of hyperspectral images cannot be effectively utilized by the proposed method. In future work, we will optimize the combination of convolution layer channels through a large number of experiments, and adopt some approaches to augment the number of training samples, more importantly, combining spectral information and spatial information to achieve hyperspectral classification. Finally, we plan to optimize the network structure of the proposed method by referring to the latest progress in CNN research. In those ways, we believe the performance of CNN-based HSI classification under a small training set can be improved effectively and significantly. Author Contributions: H.G. and Y.Y. conceived and designed the experiments; H.Z. and X.Q. presented tools and carried out the data analysis; H.G. and Y.Y. wrote the paper; C.L. revised the paper. Funding: This work was supported by National Natural Science Foundation of China (No. 61701166), China Postdoctoral Science Foundation (No. 2018M632215), Fundamental Research Funds for the Central Universities (No. 2018B16314), Projects in the National Science and Technology Pillar Program during the Twelfth Five-year Plan Period (No. 2015BAB07B01). Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2. 3.

4.

5. 6.

Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [CrossRef] Tong, Q.; Zhang, B.; Zheng, L. Hyperspectral Remote Sensing—Principle, Technology and Application; Higher Education Press: Beijing, China, 2006. Wang, M.; Gao, K.; Wang, L.J.; Miu, X.H. A Novel Hyperspectral Classification Method Based on C5.0 Decision Tree of Multiple Combined Classifiers. In Proceedings of the Fourth International Conference on Computational and Information Sciences, Chongqing, China, 17–19 August 2012; pp. 373–376. Rojas-Moraleda, R.; Valous, N.A.; Gowen, A.; Esquerre, C.; Härtel, S.; Salinas, L.; O’Donnell, C. A frame-based ANN for classification of hyperspectral images: Assessment of mechanical damage in mushrooms. Neural Comput. Appl. 2017, 28, 969–981. [CrossRef] Sun, W.; Liu, C.; Xu, Y.; Tian, L.; Li, W.Y. A Band-Weighted Support Vector Machine Method for Hyperspectral Imagery Classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1710–1714. [CrossRef] Li, J.; Du, Q.; Li, W.; Li, Y.S. Optimizing extreme learning machine for hyperspectral image classification. J. Appl. Remote Sens. 2015, 9, 097296. [CrossRef]

ISPRS Int. J. Geo-Inf. 2018, 7, 349

7.

8. 9. 10.

11. 12.

13. 14.

15.

16. 17. 18. 19. 20. 21. 22. 23.

24. 25. 26. 27. 28. 29. 30.

20 of 21

Wei, Y.; Xiao, G.; Deng, H.; Chen, H.; Tong, M.G.; Zhao, G.; Liu, Q.T. Hyperspectral image classification using FPCA-based kernel extreme learning machine. Optik Int. J. Light Electron Opt. 2015, 126, 3942–3948. [CrossRef] Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083. Shrivastava, A.; Sukthankar, R.; Malik, J.; Gupta, A. Beyond Skip Connections: Top-Down Modulation for Object Detection. arXiv 2016, arXiv:1612.06851. Dai, J.F.; Qi, H.Z.; Xiong, Y.W.; Li, Y.; Zhang, G.D.; Hu, H.; Wei, Y.C. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. Yann, L.; Leon, B.; Yoshua, B.; Patrick, H. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. Lin, M.; Chen, Q.; Yan, S. Network in Network. arXiv 2014, arXiv:1312.4400v3. Szegedy, C.; Liu, W.; Jia, Y.Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. Chen, Y.S.; Jiang, H.L.; Li, C.Y.; Jia, X.P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [CrossRef] Yue, Q.; Ma, C. Deep Learning for Hyperspectral Data Classification through Exponential Momentum Deep Convolution Neural Networks. J. Sens. 2016, 2016, 3150632. [CrossRef] Hu, W.; Huang, Y.Y.; Wei, L.; Zhang, F.; Li, H.C. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [CrossRef] Lee, H.; Kwon, H. Going Deeper with Contextual CNN for Hyperspectral Image Classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [CrossRef] [PubMed] Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016. Li, Y.; Zhang, H.; Shen, Q. Spectral-Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network. Remote Sens. 2017, 9, 67. [CrossRef] Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 4959–4962. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1806.00183. Alam, F.I.; Zhou, J.; Liew, W.C.; Jia, X.P. CRF learning with CNN features for hyperspectral image segmentation. In Proceedings of the Geoscience and Remote Sensing Symposium, Beijing, China, 10–15 July 2016; pp. 6890–6893. Yuan, Q.Q.; Zhang, Q.; Li, J.; Shen, H.F.; Zhang, L.P. Hyperspectral Image Denoising Employing a Spatial-Spectral Deep Residual Convolutional Neural Network. arXiv 2018, arXiv:1806.00183. Liu, Q.S.; Hang, R.L.; Song, H.H.; Zhu, F.P.; Plaza, J.; Plaza, A. Adaptive Deep Pyramid Matching for Remote Sensing Scene Classification. arXiv 2016, arXiv:1611.03589v1. Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2016, 219, 88–98. [CrossRef] Makantasis, K.; Doulamis, A.D.; Doulamis, N.D.; Nikitakis, A. Tensor-Based Classification Models for Hyperspectral Data Analysis. IEEE Trans. Geosci. Remote Sens. 2018, 99, 1–15. [CrossRef] Chen, Y.S.; Zhu, L.; Ghamisi, P.; Jia, X.P.; Li, G.Y.; Tang, L. Hyperspectral Images Classification with Gabor Filtering and Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2355–2359. [CrossRef]

ISPRS Int. J. Geo-Inf. 2018, 7, 349

31.

32. 33.

21 of 21

He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. Chen, Y.; Zhao, X.; Jia, X. Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [CrossRef] Fu, Q.Y.; Yu, X.C.; Tan, X.; Wei, X.P.; Zhao, J.L. Classification of Hyperspectral Imagery Based on Denoising Autoencoders. J. Geomat. Sci. Technol. 2016, 33, 485–489. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents