Hyperspectral Image Classification with Capsule Network ... - MDPI

sensors Article

Hyperspectral Image Classification with Capsule Network Using Limited Training Samples Fei Deng 1 , Shengliang Pu 1 , Xuehong Chen 2 , Yusheng Shi 3 , Ting Yuan 1 and Shengyan Pu 4, * 1 2 3 4

*

School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China; [email protected] (F.D.); [email protected] (S.P.); [email protected] (T.Y.) State Key Laboratory of Earth Surface Processes and Resource Ecology, Beijing Normal University, Beijing 100875, China; [email protected] State Environmental Protection Key Laboratory of Satellites Remote Sensing, Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100101, China; [email protected] State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China Correspondence: [email protected]; Tel.: +86-028-8407-3465

Received: 21 August 2018; Accepted: 17 September 2018; Published: 18 September 2018

Abstract: Deep learning techniques have boosted the performance of hyperspectral image (HSI) classification. In particular, convolutional neural networks (CNNs) have shown superior performance to that of the conventional machine learning algorithms. Recently, a novel type of neural networks called capsule networks (CapsNets) was presented to improve the most advanced CNNs. In this paper, we present a modified two-layer CapsNet with limited training samples for HSI classification, which is inspired by the comparability and simplicity of the shallower deep learning models. The presented CapsNet is trained using two real HSI datasets, i.e., the PaviaU (PU) and SalinasA datasets, representing complex and simple datasets, respectively, and which are used to investigate the robustness or representation of every model or classifier. In addition, a comparable paradigm of network architecture design has been proposed for the comparison of CNN and CapsNet. Experiments demonstrate that CapsNet shows better accuracy and convergence behavior for the complex data than the state-of-the-art CNN. For CapsNet using the PU dataset, the Kappa coefficient, overall accuracy, and average accuracy are 0.9456, 95.90%, and 96.27%, respectively, compared to the corresponding values yielded by CNN of 0.9345, 95.11%, and 95.63%. Moreover, we observed that CapsNet has much higher confidence for the predicted probabilities. Subsequently, this finding was analyzed and discussed with probability maps and uncertainty analysis. In terms of the existing literature, CapsNet provides promising results and explicit merits in comparison with CNN and two baseline classifiers, i.e., random forests (RFs) and support vector machines (SVMs). Keywords: capsule network; hyperspectral; image classification; deep learning; possibility density

1. Introduction Hyperspectral imaging plays an important role in the accurate measurement, analysis, and interpretation of land scene spectra [1–3]. Hyperspectral image (HSI) [4] often consists of hundreds of spectral bands which provide the valuable information for many applications, i.e., land cover classification [5], target detection [6], precision agriculture [7], and water resource management [8]. Image classification is the most important technique for labeling categories of each pixel based on spatial-spectral information [9–11]. HSI classification has become a very active area of research in recent years [4,9]. Over the past decades, various classification methods, such as random forests (RFs) [12,13]

Sensors 2018, 18, 3153; doi:10.3390/s18093153

www.mdpi.com/journal/sensors

Sensors 2018, 18, 3153

2 of 22

and support vector machines (SVMs) [14,15], have been employed in HSI classification task [16,17]. However, most of these methods suffer from various challenges, such as the curse of dimensionality and the spatial variability of spectral signature, or they focus only on spectral variability. Recently, deep learning techniques, such as stacked autoencoders (SAEs) [18] and convolutional neural networks (CNNs) [19], have shown considerable potential for high-order data classification [20] such as 3-D HSI classification [21]. The strong capacity for learning effective representations with multiple levels of abstraction for hyperspectral data allows these techniques to provide a special perspective for feature learning and reasoning [22–24]. The recent literature has shown that CNNs deliver state-of-the-art performance for HSI classification [25–28]. The most important advantage of CNNs is that the distributed features can be extracted with the weight-shared convolutional kernels through processing, i.e., convolution, spatial pooling, and full-connection, at different levels [11]. Hu et al. [21] proposed a five-layer CNN for 3-D HSI data using spectral information without considering the spatial context [29], and achieved a better classification performance than several traditional methods. They found that CNN could be used effectively for HSI classification after building an appropriate network architecture. However, the abovementioned frameworks only consider spectral signature, while the information regarding spatial domain is neglected [11]. Although CNNs have achieved state-of-the-art performance, there are two conspicuous flaws [30,31]: (1) their failure to consider spatial hierarchies between features; and (2) their lack of rotational invariance. Thus, Sabour et al. [30] presented a novel type of neural networks called capsule networks (CapsNets). Via the dynamic routing process (i.e., improving or replacing backpropagation) and vectoring the result (i.e., substituting for the scalar output), CapsNets may overcome the abovementioned shortcomings. Luo et al. [11] first tried to apply CapsNet to adapt 3-D HSI data. They compared a CNN-based approach with the CapsNet-based model and concluded that the CapsNet-based method did not provide the expected benefits, and no deep insight was revealed, e.g., whether the two networks are comparable. In general, a comparison of methods depends on the comparison of classification maps or accuracies. We tried to make CNN and CapsNet as comparable as possible to facilitate a more profound exploration or improvement, which was the purpose behind wrapping them together and designing a comparable paradigm of network architecture. Thus, the objectives of this work were to introduce the state-of-the-art CapsNet into HSI classification task and conduct a fair comparison between CapsNet and CNN. The existing research for HSI classification often concentrates the information in the spectral and spatial domains (e.g., 3-D CNNs proposed in [29,32]), and considers the spectral and spatial variabilities; however, the spatial hierarchies between features have seldom been explored. Spatial hierarchies may consist of features, sizes, places, contexts, pyramids, or even ultrametrics based on the spatially perceptive relationships. The merits of our paper are that: (1) a modified two-layer CapsNet with limited training samples has been presented for HSI classification; (2) a comparable paradigm of network architecture design has been proposed for the comparison of CNN and CapsNet; and (3) CapsNet has much higher confidence for the predicted probabilities, which has been confirmed by probability maps and uncertainty analysis. CNNs have shown remarkable performance for HSI classification tasks, however, which involve a substantial amount of training samples [31]. The reduction of training data may be a favorable change for deep learning models. Makantasis et al. [33] presented a tensor-based scheme for HSI classification and analysis instead of CNN in the case when few training samples are available. In order to better clarify the intrinsic logic of this study, we provided a conceptual overview of HSI classification with CapsNet (see Figure 1).

Sensors 2018, 3153 PEER REVIEW Sensors 2018, 18,18, x FOR

3 of 322of 22

Figure 1. Overview of HSI classification with CapsNet.

Figure 1. Overview of HSI classification with CapsNet. The rest of this paper is structured as follows. The necessary background is introduced in Section 2, and therest proposed and a comparable paradigm of network architectureisdesign are illustrated The of this CapsNet paper is structured as follows. The necessary background introduced in Section in Section 3. Experiments andand analysis for the twoparadigm HSI datasets are presented and discussed 2, and the proposed CapsNet a comparable of network architecture designinare Sections 4inand 5, respectively. Finally, the conclusions summarized in Section 6. illustrated Section 3. Experiments and analysis of forthis thework twoare HSI datasets are presented and

discussed in Sections 4 and 5, respectively. Finally, the conclusions of this work are summarized in 2. Related Work Section 6. In this section, we provide a brief introduction to CNNs and CapsNets.

2. Related Work 2.1. CNNs

In this section, we provide a brief introduction to CNNs and CapsNets.

CNNs are a type of deep neural network architectures based on shared weights (i.e., parameter sharing) [24]. Additionally, CNNs have been the most popular framework in recent years. The existing 2.1. CNNs literature has shown that CNNs present better performance for HSI classification than conventional CNNs are a typealgorithms of deep neural network architectures based on shared weights (i.e., parameter machine learning [19,26]. A typical CNN (see Figure 2) is composed of an input layer and output layer as well as many hidden that are alternately convolution, sharing) [24].anAdditionally, CNNs have been the layers most popular frameworkstacked in recent years. The normalization, pooling that (i.e., downsampling subsampling), and fully (i.e., dense) existing literaturespatial has shown CNNs presentorbetter performance for connected HSI classification than layers [21,24,27]. The convolutional layer extracts feature maps by the kernel-size-specified linear conventional machine learning algorithms [19,26]. A typical CNN (see Figure 2) is composed of an convolutional filters followed functions (e.g.,alternately rectified linear units (ReLU)). input layer and an output layerby as the wellnonlinear as manyactivation hidden layers that are stacked convolution, The normalization layers often normalize input dataortosubsampling), zero mean andand unitfully variance, and then normalization, spatial pooling (i.e., downsampling connected (i.e.,apply dense) a linear transformation. The spatial pooling layers group the local features together from spatially layers [21,24,27]. The convolutional layer extracts feature maps by the kernel-size-specified linear adjacent pixels to improve the robustness of neural network to the slight deformations of objects. convolutional filters followed by the nonlinear activation functions (e.g., rectified linear units (ReLU)). After several convolutional layers and pooling layers, the fully connected layers can perform high-level The normalization layers often normalize input data to zero mean and unit variance, and then apply reasoning such as classifiers.

a linear transformation. The spatial pooling layers group the local features together from spatially adjacent pixels to improve the robustness of neural network to the slight deformations of objects. After several convolutional layers and pooling layers, the fully connected layers can perform highlevel reasoning such as classifiers.

Sensors 2018, 18, 3153 Sensors 2018, 18, x FOR PEER REVIEW

4 of 22 4 of 22

Figure 2. A typical CNN which is composed of an input layer, an output layer, the convolution, Figure 2. A typical CNN which is composed of an input layer, an output layer, the convolution, normalization, spatial pooling, and fully connected layers. A typical CapsNet consists of an input normalization, spatial pooling, and fully connected layers. A typical CapsNet consists of an input layer; an output layer; the Conv1d, PrimaryCaps, and DigitCaps layers; a 3-D hyperspectral cube with layer; an output layer; the Conv1d, PrimaryCaps, and DigitCaps layers; a 3-D hyperspectral cube with 200 bands; and a 3-D input patch with size 7 × 7. The 3-D HSI patches are preprocessed for deep 200 bands; and a 3-D input patch with size 7 × 7. The 3-D HSI patches are preprocessed for deep learning models. The pixel cubes are fed into the conventional classifiers, i.e., RFs and SVMs. learning models. The pixel cubes are fed into the conventional classifiers, i.e., RFs and SVMs.

2.2. CapsNets 2.2. CapsNets CapsNets (or CAPs) represent a completely novel type of deep learning architectures that attempt CapsNetsthe (orlimits CAPs) a completely novel type of deepthe learning that to overcome andrepresent drawbacks of CNNs [30], such as lacking explicitarchitectures notion of an entity attempt to overcome the limits and drawbacks of CNNs [30], such as lacking the explicit notion of an and losing valuable information during max-pooling. A typical CapsNet (see Figure 2) is shallower, entity and losing valuable information during max-pooling. A typical CapsNet (see Figure 2) is with three layers, i.e., the Conv1d, PrimaryCaps, and DigitCaps layers. Sabour et al. [30] presented shallower, with three layers, i.e.,of the Conv1d, andinDigitCaps Sabour et al. a capsule-based representation a group of PrimaryCaps, hidden neurons, which notlayers. only likelihood but[30] also presented a capsule-based representation of a group of hidden neurons, in which not only likelihood the properties of the hidden features could be captured. In such a scenario, CapsNet was robust to but also the propertiesand of the hiddenfewer features coulddata. be captured. In such a scenario, CapsNet affine transformation required training In addition, CapsNet has resulted in was some robust affine transformation required fewer between training data. In addition, CapsNet hasaresulted specialto breakthroughs related toand spatial hierarchies features. A capsule represents group of in some special breakthroughs related to spatial in hierarchies features. the A capsule neurons [30]. The activities of neurons enclosed an active between capsule represent variousrepresents propertiesaof group of neurons [30]. The activities of neurons enclosed in an active capsule represent the various a particular entity. In addition, the overall length of a capsule represents the existence of an entity, and properties of a particular entity. In addition, the overall length of a capsule represents the existence the orientations of a capsule indicate its properties. A CapsNet is a discriminatively trained multi-layer of an entity, and [30]. the orientations a capsule indicate properties. A CapsNet is a discriminatively capsule system Because theoflength of the outputitsvector represents the probability of existence, trained multi-layer capsule system [30]. Because the length of the output vector represents the an output capsule is computed using a nonlinear squashing function: probability of existence, an output capsule is computed using a nonlinear squashing function:

2

s j 2 s j vj = (1) s j 2 s , v j =ε + s j 2 s j j , (1)

ε + sj

sj

where v j is the vector output of the capsule j and s j is its total input. The nonlinear squashing function is an activation to ensure thatcapsule the shortj vectors shrunk to almost zero length and the long where vector output of the and s get v j is thefunction j is its total input. The nonlinear squashing vectors get shrunk to a specific length with respect ε. vectors get shrunk to almost zero length function is an activation function to ensure that thetoshort and the long vectors get shrunk to a specific length with respect to s j = ∑ cij Wij ui . s j = i cij Wij u i .



ε.

(2)

(2) i The total input to a capsule s j is obtained by multiplying the output ui of a capsule by a weight matrix a weighted sum over all predictedthe vectors from in the ui the TheWtotal inputrepresents to a capsule by multiplying output of acapsules capsule by a ij , which s j is obtained layer below. Here, cij denotes a coupling coefficient that is determined by the iterative dynamic weight matrix Wij , which represents a weighted sum over all predicted vectors from the capsules in routing process: the layer below. Here, cij denotes a coupling coefficient that is determined by the iterative dynamic routing process:

Sensors 2018, 18, 3153

5 of 22

exp bij cij = , ∑ exp(bik )

(3)

k

where bij and bik are the log prior probabilities between two coupled capsules. In every capsule layer, each capsule outputs a local grid of vectors to each type of capsule in the layer above, and then the different transformation matrices for each grid member and each capsule type are used to obtain an equal number of classes. The dynamic routing [30] is implemented under an iterative process. A lower-level capsule sends its output to a higher-level capsule whose vectors have a large scalar product with the prediction coming from the lower-level capsule. The overall length of the output vector represents the predicted probability [30]. A separate margin loss Lk for each capsule k can be given as 2 2 Lk = Tk max(0, m+ − kvk k) + λ(1 − Tk )max 0, kvk k − m− , (4) where Tk = 1, m+ = 0.9, and m− = 0.1 are three free parameters by default. λ enables down-weighting of the loss and helps ensure final convergence. Because CapsNets are recently proposed neural network architectures, only a few studies have explored the applications. Xi et al. [31] attempted to find the best set of configurations that yielded the optimal test error on complex data. Afshar et al. [34] adopted and incorporated a CapsNet for brain tumor classification and proved that it could overcome the defects of CNNs successfully. Jaiswal et al. [35] presented a generative adversarial capsule network which was a framework that used a CapsNet instead of the standard CNN as discriminators. Kumar et al. [36] proposed a novel method for traffic sign detection using a CapsNet that achieved outstanding performance. LaLonde and Bagci [37] expanded the use of CapsNets to the task of object segmentation for the first time and achieved a promising segmentation accuracy. Li et al. [38] built a CapsNet to recognize rice composites from UAV images. Qiao et al. [39] captured the high-level features using a CapsNet to reconstruct image stimuli from human fMRI, achieving higher accuracy than all existing state-of-the-art methods. Wang et al. [40] proposed a CapsNet structure based on a recurrent neural network for sentiment analysis. Zhao et al. [41] were the first to investigate empirically CapsNets for text modeling. Therefore, CapsNets have the potential abilities in most research fields and deserve attention. Concerning HSI classification, Luo et al. [11] first modified CapsNet, and attempted to apply it to adapt 3-D HSI data. However, they did not find the benefits of CapsNet in comparison with CNN. Actually, CapsNets are still in their infancy [42]. Hence, there have been few documented studies on the applications at present. 3. Proposed Approach In this section, we describe the details of the proposed approach. When making a fair comparison between deep learning models, it is very important to keep their structures, sizes, and settings as similar as possible. The design principles of network architecture and CapsNet with parameter settings are described in the following subsections. 3.1. Network Design The proposed paradigm of network architecture design is illustrated in Figure 3. The main objective of the designed network architecture is to create a comparable paradigm to suppress any diversity, however small, and to have the comparable aspects between CNN and CapsNet. Moreover, we tried to improve CapsNet to reap the expected benefits compared to CNNs. Given that neural networks with deeper layers may not always have better classification results for HSI classification [43], both CNN and CapsNet are relatively smaller, shallower neural networks. As shown in Figure 3, there is only one convolutional layer (i.e., a feature extraction layer) in the network architecture, and all models are two-layer network structures. The differences between CNN1 (or CAP1) and CNN2 (or CAP2) are the former contains a composite block “conv-bn-relu”, whereas the latter one

Sensors 2018, 18, 3153

6 of 22

incorporates an equal number of the stacked layers with batch normalization (BN) layer and ReLU layer reversed. First, the input data (i.e., 3-D patches) are sent to the convolutional layer, and the data size is (PRow , PCol , Bands ), where Bands is the number of channels for a 3-D hyperspectral cube. The kernels are of size 4 × 4, where the number of filters is 64, and the stride is taken as 1 by default. Then, the convolutional layer (i.e., Conv) is followed by a normalization layer. As mentioned before, CNNs can extract features in the lower layer correctly. Therefore, we hold only one of the convolutional layers for our network architecture. The followed BN layer is expected to speed up the subsequent training and reduce the sensitivity to network initialization. The pooling operation is a very primitive form of routing [30]. Its good performance and surprising effectiveness spatial pooling the chosen processing. The composite capsule layer actually Sensors 2018, 18,have x FORmade PEER REVIEW 6 of is 22 regards as the same functionality as a dense layer, which is expected to decode the final output into a and all mapping models are two-layer network structures. Thenumber differences betweenand CNN1 (or CAP1) specific with the identified classes. Here, the of capsules the hidden unitsand of CNN2 (orlayer CAP2) the former containsofa classes. composite “conv-bn-relu”, whereas thetolatter one the dense areare equal to the number Theblock dimension of the capsules is set 64, and incorporates equal number of output the stacked layers is with normalization layer and ReLU there are threean routings. The final of network a (1,batch classes) vector. If the(BN) ith element (i.e., the layer reversed. First, the input datahas (i.e., 3-D patches)value, are sent to the convolutional andlabel the data predicted probability) in the vector the maximum then ith label is the layer, predicted for size is (P Row , P Col , B ands ), where B ands is the number of channels for a 3-D hyperspectral cube. The kernels the input sample. Specifically, CNN or CapsNet are both two-parameter-layer structures nominally. are of size 4the × 4,number where the number of total filtersnumber is 64, and theconvolutional stride is takenlayers as 1 by Then, (or the Meanwhile, of layers is the of the anddefault. dense layers convolutional capsule layers).layer (i.e., Conv) is followed by a normalization layer.

Figure CapsNet. Here, Here, Si Si {i{i==1,1,2}2}denotes denotesthe thekey keysub-flows, sub-flows,and andPjPj{j {j= = Figure3. 3. Proposed Proposed CNN CNN and and CapsNet. 1, 1, 2, 2, 3, 3, 4} represents a set of parameter settings. CNN1 paired with CNN2 and CAP1 paired with CAP2 4} represents a set of parameter settings. CNN1 paired with CNN2 and CAP1 paired with CAP2 are are intertwined relationships with fewdifferences. differences.Note Notethat thatCNN CNN and and CapsNet CapsNet are thethe intertwined relationships with few are completely completely independent networks herein. independent networks herein.

3.2. Parameter Learning As mentioned before, CNNs can extract features in the lower layer correctly. Therefore, we hold In our each layers pixel and its neighbors in a 7 × 7 patch are takenBN as layer a single sample. only one of experiments, the convolutional for our network architecture. The followed is expected Thus, the up data of each sample is and 7 × 7reduce × Bands . In general, the size of the convolutional in to speed thesize subsequent training the sensitivity to network initialization. Thefilters pooling CNN or CapsNet is W, where W can be 3, 5 or higher. The 1 × 1 convolutional kernel has been excluded operation is a very primitive form of routing [30]. Its good performance and surprising effectiveness because it can extract features only fromprocessing. the differentThe bands and not capsule in the spatial have made spatial pooling the chosen composite layerdomain. actuallyAdditionally, is regards as only one convolutional layer is employed and is followed bytoa decode max-pooling layer. Thus, set the the same functionality as a dense layer, which expected the final output intowe a specific convolutional kernel size to 4 × 4. A dropout layer can improve the performance by mitigating the mapping with the identified classes. Here, the number of capsules and the hidden units of the dense overfitting problem via discarding some co-adaptation of hidden units. Using the dropout operation, layer are equal to the number of classes. The dimension of the capsules is set to 64, and there are three the network learn moreofrobust features and reducevector. the effect of ith noise. Therefore, there is one routings. Thecan final output network is a (1, classes) If the element (i.e., the predicted

probability) in the vector has the maximum value, then the ith label is the predicted label for the input sample. Specifically, CNN or CapsNet are both two-parameter-layer structures nominally. Meanwhile, the number of layers is the total number of the convolutional layers and dense layers (or capsule layers).

Sensors 2018, 18, 3153

7 of 22

dropout layer in CNN, and its probability is set to 0.6. In the training and inference procedure, the training samples are initially randomly divided into a few size-specific batches. The batch size is set to 64 in our experiments. CNN and CapsNet are each trained using a first-order gradient-based optimization of stochastic objective functions, referred to as the Adam algorithm. For each epoch in 200 epochs, only one batch is sent to the sequential models for training at a time. The training procedure does not stop until it reaches the predetermined maximum number of iterations except in the case of an early stop option. In the test procedure, the test samples are also fed into the sequential models, and the final predicted labels can be obtained by finding the maximum value in the output vector. 3.3. Sampling Strategy As done in previously published studies, the spectral values are taken as a vector format (i.e., 1-D form) input to CNN or CapsNet to conduct HSI classification [44]. However, using the 1-D form of hyperspectral data does not take full advantage of the ability of neural networks to extract spatial hierarchical characteristics [45,46]. Therefore, one single patch (i.e., the pixel’s neighborhood or a square patch around the central pixel) with all spectrum bands is labeled as a category’s sample for the subsequent training and testing. To make both network structures as comparable as possible, all designed ground truth samples are split at a fixed size. Because the small datasets are used for training with the proposed structures, only the limited samples are chosen for training. After the training set is determined by randomly selecting 60 per class, the validation set is extracted with the same size as the training set. The rest of the samples from the training and validation sets are taken as the test set. Parameter analysis of the impact of the training size demonstrates that the determined size of the training set is suitable in this study. Due to the small size of the training set, the performance may depend on the selected training samples, i.e., a training set with good representation is likely to provide a better performance (and vice versa). Therefore, the classification accuracies may vary according to the selected training samples [27]. 4. Results and Analysis In this section, we compare CapsNet with CNN using accuracy metrics, i.e., overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa or K). The experimental results and analysis are reported in the following subsections. 4.1. Dataset Description To evaluate CNN and CapsNet, two public real hyperspectral datasets (see Figure 4) with the different spatial resolution are used for the experiments. The PaviaU (PU) and SalinasA (SA) dataset are openly accessible online (http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_ Sensing_Scenes). The PU dataset (see Figure 4a,b) was collected over an urban area, and the SA dataset was collected in a natural area. The PU dataset was acquired by the ROSIS sensor during a flight campaign over Pavia University, northern Italy. The number of spectral bands is 103 after the thirteen noisiest bands were discarded. The hyperspectral dataset has a size of 610 × 340. Its spatial resolution is 1.3 m per pixel (m/p). There are nine classes included in the ground truth map and 42,776 labeled samples (Table 1). The PU dataset was provided by Prof. Paolo Gamba from Telecommunications and Remote Sensing Laboratory, the University of Pavia (Italy). The SA dataset (see Figure 4c,d) was acquired by the 224-band AVIRIS sensor over Salinas Valley, California, USA and is characterized by a high spatial resolution of 3.7 m per pixel. It comprises 83 × 86 pixels located within the Salinas scene. After the 20 water absorption bands were removed, 204 out of 224 bands were retained. The ground truth map is composed of 5348 pixels (see Table 1) and includes six land cover classes. As listed in Table 1, there are 42,776 labeled samples including nine categories in the PU dataset and 164,624 unlabeled samples are coded as class C0. There are 5348 labeled samples including six categories in the SA dataset and 1790 unlabeled samples are coded as class C0. The size of the training samples of all classes is set to 60 as same as the size of the validation samples. Except for the samples included in

Sensors 2018, 18, x FOR PEER REVIEW

8 of 22

including six categories in the SA dataset and 1790 unlabeled samples are coded as class C0. The size of the training samples of all classes is set to 60 as same as the size of the validation samples. Except for the samples included in the training and validation sets, all other samples are taken for the test Sensors 2018, from 18, 3153 8 ofand 22 set. Apart the importance of the selected training samples for determining the final outputs accuracies, actually the type of dataset is another non-negligible factor. Moreover, the SA dataset has a large proportion of ground samples relative to the and the lower intra-class the training and validation sets, truth all other samples are taken for entire the testscene set. Apart from the importance variability. These differences are crucial for investigating the robustness or representation every of the selected training samples for determining the final outputs and accuracies, actually theoftype of model or classifier on the simple or complex data. This is why we used these two datasets. dataset is another non-negligible factor. Moreover, the SA dataset has a large proportion of ground truth samples relative to the entire scene and the lower intra-class variability. These differences are Table 1. Definitions of ground truth classes and samples for the PU and SA datasets. crucial for investigating the robustness or representation of every model or classifier on the simple or complex data.Dt ThisCn is why we used these Classtwo datasets. Samples Train Test Val C0 Non-ground truth 164,624 0 0 0 Table 1.C1 Definitions of ground truth classes and samples PU and6511 SA datasets. Asphalt 6631 for the60 60 C2 Meadows 18,649 Train 60 18,529 60Val Dt Cn Class Samples Test C3 Gravel 2099 60 1979 60 C0 Non-ground truth 164,624 0 0 0 C4 Trees 3064 60 2944 60 C1 Asphalt 6631 60 6511 60 PU C5 Painted metal sheets 1345 60 1225 60 60 C2 Meadows 18,649 60 18,529 C3 Gravel 2099 60 1979 C6 Bare Soil 5029 60 4909 60 60 C4 Trees 3064 60 2944 60 C7 Bitumen 1330 60 1210 60 PU C5 Painted metal sheets 1345 60 1225 60 C8 Self-Blocking Bricks 3682 60 3562 60 60 C6 Bare Soil 5029 60 4909 C9 Shadows 947 60 827 60 60 C7 Bitumen 1330 60 1210 C8 Self-Blocking Bricks 3682 600 3562 C0 Non-ground truth 1790 0 0 60 C9 Shadows 947 60 827 C1 Brocoli_green_weeds_1 391 60 271 60 60 C0 Corn_senesced_green_weeds Non-ground truth 1790 060 0 C2 1343 1223 60 0 C1 Brocoli_green_weeds_1 391 60 271 SA C3 Lettuce_romaine_4wk 616 60 496 60 60 C2 Corn_senesced_green_weeds 1343 60 1223 60 C4 Lettuce_romaine_5wk 1525 60 1405 60 C3 Lettuce_romaine_4wk 616 60 496 60 SA C5 Lettuce_romaine_6wk 674 60 554 60 60 C4 Lettuce_romaine_5wk 1525 60 1405 C6 Lettuce_romaine_7wk 799 60 679 60 60 C5 Lettuce_romaine_6wk 674 60 554 C6

Lettuce_romaine_7wk

799

60

679

60

Figure (a) pseudo-color image of of the Figure 4. 4. Real Real hyperspectral hyperspectral datasets: datasets: (a) pseudo-color image the PU PU dataset; dataset; (b) (b) ground ground truth truth samples samples of of the the PU PU dataset; dataset; (c) (c) pseudo-color pseudo-color image image of of the the SA SA dataset; dataset; and and (d) (d) ground ground truth truth samples samples of of the SA dataset. the SA dataset.

4.2. Experimental Setup Our experimental platform is a laptop equipped with an Intel Core i7-4810MQ 8-core 2.80 GHz processor, 16 GB of memory, and a 4G NVIDIA GeForce GTX 960M graphics card. The training


9 of 22

4.2. Experimental Setup Our experimental platform is a laptop equipped with an Intel Core i7-4810MQ 8-core 2.80 GHz Sensors 2018, 18, 3153 9 of 22 processor, 16 GB of memory, and a 4G NVIDIA GeForce GTX 960M graphics card. The training procedures run on the GPU to achieve the high computational speeds. Because we train our networks with small training data, thetotraining of both datasets can be completed within fewnetworks minutes procedures run on the GPU achieve times the high computational speeds. Because we trainaour with the five independent runs and 200 epochs per run. It is rather and shows the high small training data, the training times of both datasets can befast completed within a fewefficiency minutes in terms the shallower deep models.per run. It is rather fast and shows the high efficiency with the of five independent runs learning and 200 epochs For the proposed CapsNet, the additive convolutional layer has the convolutional kernels with in terms of the shallower deep learning models. a sizeFor of the 4 ×proposed 4, 64 filters, a stride of 1, and a “ReLU” layer activation. later layerskernels are a batch CapsNet, thesize additive convolutional has theThe convolutional with normalization and a stride max-pooling areactivation. included with the layers intention the a size of 4 × 4,layer 64 filters, size of 1,layer, and awhich “ReLU” The later are aforbatch consistency with CNN. that the kernel size and filters should be adjusted that the normalization layer and aNote max-pooling layer, which are included with the intention to forensure the consistency capsule layer hasthat the the correct input acquires the qualified vectors. The final layer with CNN. Note kernel sizedimension and filters and should be adjusted to ensure that the capsule contains the 16-D capsules per class, a capsule receives input from allfinal capsules the layer below. has the correct input dimension andand acquires the qualified vectors. The layerincontains the 16-D To ensureper a complete comparison CNNinput and tofrom improve CapsNet, thebelow. parameter settingsa capsules class, and a capsulewith receives all capsules inwe thekept layer To ensure of every model as similar as possible. the experiments fivekept times every model or classifier complete comparison with CNN andWe to ran improve CapsNet, we thewith parameter settings of every for each and kept the of experiments the training five andtimes validation sets identical. These sets model as dataset, similar as possible. Wesizes ran the with every model or classifier forwere each randomly shuffled to sizes reduce of random effects, and finally the statistical dataset, and kept the of the the influence training and validation sets identical. These sets were accuracies randomly were recorded. shuffled to reduce the influence of random effects, and finally the statistical accuracies were recorded. Training Details 4.3. Training procedure for the PU dataset involves the determination of training samples and The training procedure loss of training andand validation procedures. The sampling strategy is critical recording the theaccuracy accuracyand and loss of training validation procedures. The sampling strategy is to obtain a good performance for every model using the limited known samples. The accuracy and critical to obtain a good performance for every model using the limited known samples. The accuracy loss of training and validation procedure may depend on many factors, and show a model and loss of training and validation procedure may depend on many factors, and whether show whether a is qualified enoughenough and its parameter configuration is ready is forready the subsequent parameter learning. model is qualified and its parameter configuration for the subsequent parameter Additionally, the spatialthe distribution of training validation samples and the accuracy and loss learning. Additionally, spatial distribution ofand training and validation samples and the accuracy of training validation procedures with increasing iterations are plottedare in the sameinsubsection. and loss ofand training and validation procedures with increasing iterations plotted the same The SA dataset covers a natural area, which has many ground truth samples relative to non-ground subsection. The SA dataset covers a natural area, which has many ground truth samples relative to truth pixels, unlike the PUunlike dataset. theMoreover, intra-classthe variability of every class in SA class dataset non-ground truth pixels, theMoreover, PU dataset. intra-class variability ofthe every in is relatively low. models and classifiers a high performance. the SA dataset is Consequently, relatively low. all Consequently, all modelsshow and classifiers show a high performance. Figures 5 and 6 illustrate the training, test and validation sets (i.e., the central pixels) with For nine nine classes classes (Class (Class 1–Class 1–Class 9), 9), randomly selected 60 samples per class for the PU and SA dataset. For of the the different different classes classes have have exactly exactly the the same same color color as as the the ground ground truth truth classes. classes. the scattered samples of These sparse samples are fed into the different models and classifiers to determine the best model The spatial distribution of weights and and parameters parametersfor forthe thesubsequent subsequentinference inferenceand andlabel labelassignment. assignment. The spatial distribution the designed sets illustrated is the first one of five independent runs. For the four other independent of the designed sets illustrated is the first one of five independent runs. For the four other runs, the training will samples be randomly selected at a selected fixed size well. size For as each run, 60each samples independent runs,samples the training will be randomly atas a fixed well. For run, persamples class areper randomly for training the additional 60additional samples for60validation, while the rest 60 class areselected randomly selectedand for training and the samples for validation, of samples areof available while the rest samplesfor aretesting. available for testing.

Figure 5. Cont.

Sensors Sensors 2018, 2018, 18, 18, 3153 x FOR PEER REVIEW Sensors 2018, 18, x FOR PEER REVIEW

10 10of of22 22 10 of 22

Figure 5. Spatial distribution of training, test and validation samples for the PU dataset: (a) training Figure 5. Spatial distribution of test and validation samples the PU dataset: (a) training Figure Spatial distribution of training, training, validation samples for for dataset: training samples5.for fitting CNN and CAP, as welltest as and RF and SVM classifiers; (b) the testPU samples for(a) predicting samples for fitting CNN and CAP, as well as RF and SVM classifiers; (b) test samples for predicting samples forCAP; fitting(c) CNN and CAP, as wellfor as evaluating RF and SVM classifiers; (b) test predicting CNN and validation samples CNN and CAP; andsamples (d) testforsamples for CNN (c)(c) validation samples for evaluating CNN and CAP; (d) test evaluating CNN and andCAP; CAP; validation samples for evaluating CNN andand CAP; andsamples (d) testforsamples for evaluating RF and SVM classifiers. “PU-Trn”, “PU-Tst-1” and “PU-Val-1” are the selected samples RF and SVM classifiers. “PU-Trn”, “PU-Tst-1” and “PU-Val-1” the selected for training, evaluating and SVM “PU-Trn”, and are “PU-Val-1” are samples the selected samples for training,RFtesting and classifiers. verifying CNN and “PU-Tst-1” CAP, respectively, while “PU-Trn” and “PU-Tst-2” testing and verifying CNN and CAP, respectively, while “PU-Trn” and “PU-Tst-2” represent the selected for training, testing and verifying CNN and CAP, respectively, while “PU-Trn” and “PU-Tst-2” represent the selected samples for training and testing RF and SVM classifiers, respectively. samples training and testingfor RFtraining and SVM classifiers, respectively. representfor the selected samples and testing RF and SVM classifiers, respectively.

Figure 6. Spatial distribution of training, test and validation samples for the SA dataset: (a) training Figure 6. Spatial distribution of training, test and validation samples for the SA dataset: (a) training samples for fittingdistribution CNN and CAP, as welltest as RF and SVM classifiers; test SA samples for(a) predicting Figure Spatial of training, validation samples (b) for dataset: training samples6.for fitting CNN and CAP, as well as and RF and SVM classifiers; (b) the test samples for predicting CNN and CAP; (c) validation samples for evaluating CNN and CAP; and (d) test samples for evaluating samples for fitting CNN and CAP, as well as RF and SVM classifiers; (b) test samples predicting CNN and CAP; (c) validation samples for evaluating CNN and CAP; and (d) test samples for RF andand SVM classifiers. “SA-Trn”, “SA-Tst-1” and “SA-Val-1” are selected samples training, CNN CAP; validation samples for evaluating CNN andthe CAP; and test for samples evaluating RF and(c) SVM classifiers. “SA-Trn”, “SA-Tst-1” and “SA-Val-1” are the(d) selected samples for for testing and verifying CNN and CAP,“SA-Trn”, respectively, while “SA-Trn” and “SA-Tst-2” represent the selected evaluating RF and SVM classifiers. “SA-Tst-1” and “SA-Val-1” are the selected samples for training, testing and verifying CNN and CAP, respectively, while “SA-Trn” and “SA-Tst-2” represent samples for training testing RF and SVM classifiers, respectively. training, testing and and verifying CNN CAP,RF respectively, while “SA-Trn” and “SA-Tst-2” represent the selected samples for training andand testing and SVM classifiers, respectively. the selected samples for training and testing RF and SVM classifiers, respectively.


11 of 22 11 of 22

According to Figure 7, for the PU dataset, the CNN- and CAP-based models appear to converge behavior. The CAP-based models stabilize either locally or globally, and show the good convergence behavior. much faster at ~50 epochs, and the CNN-based models attain a stability at ~100 epochs. In addition, CAP-based the CNN-based models show a better representation of the local convergence than the CAP-based all independent independent runs runs show show good good convergence. convergence. As As for for the SA dataset, all models models. Additionally, Additionally, all have some difficulties in handling the local portions because the SA dataset is a relatively simple dataset. Therefore, some problems may arise, such as vanishing vanishing gradients gradients for for every every independent independent run. run. Note that that the the unnecessary unnecessaryPCA PCAtransformation transformationor orlayer layernormalization normalizationshould shouldbebeavoided avoidedininterms terms ofof a a simple dataset considering the possible problems such exploding gradients. simple dataset considering the possible problems such asas exploding gradients.

(a) CAP1-PU

(b) CAP2-PU

(c) CNN1-PU

(d) CNN2-PU

Figure 7. Accuracy and loss of: (a) CAP1; (b) CAP2; (c) CNN1; and (d) CNN2 for the PU dataset with Figure 7. Accuracy lossthe of: horizontal (a) CAP1; (b) (c) CNN1; and (d)per CNN2 for the PU dataset with increasing iterationsand where axisCAP2; represents 200 epochs independent run. increasing iterations where the horizontal axis represents 200 epochs per independent run.

Fine-tuning the hyperparameters of classifiers allows us to obtain a set of optimal parameters. Fine-tuningofthe of classifiers allows us a set of optimal parameters. The parameters RFhyperparameters and SVM classifiers can be optimized byto theobtain cross-validated grid search over a The parameters of RF and SVM classifiers can be optimized by the cross-validated grid search overofa parameter grid to obtain a robust baseline classifier. In this study, we fine-tuned five parameters parameter grid obtain a robust baseline classifier. In this study, weoffine-tuned five parameters of RF classifier, i.e.,tothe quality criterion of a split, the maximum depth tree, the number of features, RF classifier, i.e., the quality criterion of a split, the maximum depth of tree, the number of features, the minimum number of samples required to split, and the number of trees in the forest. Two the minimum of samples required split,term andand the the number trees in the Two parameters, i.e.,number the penalty parameter of thetoerror kernelofcoefficient for forest. “rbf”, were parameters, i.e., the penalty parameter of the error term and the kernel coefficient for “rbf”, were fine-tuned for SVM classifier. The implementation of SVM classifier was based on “libsvm” with a fine-tuned for SVM classifier. implementation ofcalculated SVM classifier based oncross-validation. “libsvm” with a one-vs.-one scheme. Moreover,The all grid searches were usingwas the five-fold one-vs.-one scheme. Moreover, all grid calculated using the five-fold cross-validation. For the PU and SA datasets with RF searches classifier,were the best parameters obtained in the first independent For the PU and SA datasets with RF classifier, the best parameters obtained the first run are as follows: (1) the Gini impurity, which is taken as the quality criterion; (2) theinmaximum independent run are as follows: (1) the Gini impurity, which is taken as the quality criterion; (2) the depth, determined as 8, which is the value until all leaves contain less than the minimum number maximum as 8, which is the value untilisall leaves contain less than minimum of samples;depth, (3) thedetermined maximum number of features, which calculated by a square rootthe operation of number of samples; (3) the maximum number of features, which is calculated by a square root the number of features; (4) the minimum number required to split an internal node, considered to operation of the thenumber numberofoftrees, features; minimum number split an internal node, be 2; and (5) which(4)is the optimized to 32. Finally,required the best to score is 0.7593 and 0.9639, considered toFor bethe 2; and (5) the trees, which is optimized to 32. Finally, the best score is respectively. PU and SA number datasets of with SVM classifier, the best parameters obtained in the first 0.7593 and 0.9639, respectively. For the PU and SA datasets with SVM classifier, the best parameters independent run are the penalty parameter, which is fixed at 10,000.0 and 1000.0, respectively, and the obtained in the first independent run are the penalty parameter, which is fixed at 10,000.0 and 1000.0,


12 of 22 12 of 22

kernel coefficient is determined be 0.01 and 0.1, respectively. best score respectively, and for the“rbf”, kernelwhich coefficient for “rbf”,to which is determined to be 0.01 Finally, and 0.1,the respectively. is 0.8074 and 0.9833, respectively. Finally, the best score is 0.8074 and 0.9833, respectively. 4.4. 4.4. Classification Classification Maps Maps When When the the training training procedure procedure is is successfully successfully completed, completed, the the next next step step is is to to have have the the unlabeled unlabeled samples categorized into the proper classes. In such a case, two kinds of scenes can be obtained. One is samples categorized into the proper classes. In such a case, two kinds of scenes can be obtained. One the whole scene covering the raw image,image, and another covers all ground samples is the whole scene covering the hyperspectral raw hyperspectral and another covers all truth ground truth referred to as the reference scene. Classification maps for the PU dataset have the distinct samples referred to as the reference scene. Classification maps for the PU dataset have thefeatures distinct owing toowing being an urbanan area, the area, widerthe coverage, and moreand categories. features to being urban wider coverage, more categories. Figure 8 illustrates the different classification maps obtained by theby CNNand CAP-based models Figure 8 illustrates the different classification maps obtained the CNNand CAP-based for the PU dataset with 60 randomly selected training samples per class. Misclassifications are mainly models for the PU dataset with 60 randomly selected training samples per class. Misclassifications caused by some fragmental structures or intra-class variability. The results of the subsequent accuracy are mainly caused by some fragmental structures or intra-class variability. The results of the assessment, and the probability maps also support such a conclusion. Probability maps are utilized to subsequent accuracy assessment, and the probability maps also support such a conclusion. observe the probability density and find the weak predictions. The CNNand CAP-based models have Probability maps are utilized to observe the probability density and find the weak predictions. The promising compared tohave the two baselineoutputs classifiers, i.e., RF to and SVM Furthermore, CNN- andoutputs CAP-based models promising compared the twoclassifiers. baseline classifiers, i.e., most errors of commission and omission occur in the non-homogeneous areas involving RF and SVM classifiers. Furthermore, most errors of commission and omission occurthe in complex the nonlandscape structures materials. results are commonly seenorin materials. the different random homogeneous areas or involving the Such complex landscape structures Such resultsruns. are Classification maps of every model or classifier for the SA dataset are rather satisfactory except the commonly seen in the different random runs. Classification maps of every model or classifier for the bottom-right corner and cross-area except (i.e., thethe transitional buffer) spanning the different sampling regions. SA dataset are rather satisfactory bottom-right corner and cross-area (i.e., the transitional Good is observed Figure 9. regions. Misclassifications are mainly iscaused by some inherent buffer)performance spanning the different insampling Good performance observed in Figure 9. uncertainties between classes. Because the SA dataset is a relatively simple dataset, all models obtain Misclassifications are mainly caused by some inherent uncertainties between classes. Because the SA fairly theisgood results and accuracies, surpassing the conventional baseline i.e., RF and dataset a relatively simple dataset,and all models obtain fairly the good resultsclassifiers, and accuracies, and SVM classifiers. surpassing the conventional baseline classifiers, i.e., RF and SVM classifiers.

Figure (a) CAP1; (b) CAP2; (c) CNN1; (d) CNN2; (e) RF; Figure 8. 8. Classification Classification maps maps (reference (reference scene) scene) of: of: (a) CAP1; (b) CAP2; (c) CNN1; (d) CNN2; (e) RF; and (f) SVM for the PU dataset; and (g) ground truth samples of the PU dataset. and (f) SVM for the PU dataset; and (g) ground truth samples of the PU dataset.


13 of 22 13 of 22

Figure9.9.Classification Classificationmaps maps (reference (reference scene) of: Figure of: (a) (a) CAP1; CAP1;(b) (b)CAP2; CAP2;(c) (c)CNN1; CNN1;(d) (d)CNN2; CNN2;(e)(e)RF; RF; and(f)(f)SVM SVMfor forthe theSA SAdataset; dataset;and and (g) (g) ground ground truth samples of the and the SA SA dataset. dataset.

4.5. 4.5.Classification ClassificationAccuracies Accuracies Several Severalwidely widelyused usedaccuracy accuracymetrics, metrics,i.e., i.e.,the theOA, OA,AA, AA,K, K,and andaccuracy accuracyofofeach eachclass, class,are areused usedto assess the final classification results. TheseThese metrics are derived from the confusion matrix. to assess the final classification results. metrics are derived fromsite-specific the site-specific confusion For the PU dataset, CAP1, many samples Class 1of(Asphalt) are wrongly classifiedclassified as Class 7 matrix. For the PUregarding dataset, regarding CAP1, many of samples Class 1 (Asphalt) are wrongly (Bitumen), severaland samples of Class 6 (Bare Soil)6 are classified Class 2 (Meadows). Concerning as Class 7 and (Bitumen), several samples of Class (Bare Soil) areasclassified as Class 2 (Meadows). CAP2, many CAP2, samples of Class 1 areofwrongly classified Class 7, and many samples Class 2 of are Concerning many samples Class 1 are wronglyasclassified as Class 7, and manyofsamples Class 2 are classified as Class 6. The apparent omission errors appear in Class 1, and Class 8 (Selfclassified as Class 6. The apparent omission errors appear in Class 1, and Class 8 (Self-Blocking Bricks) meant to have a considerable intra-class variability. With to CNN1, the isBlocking meant toBricks) have a is considerable intra-class variability. With respect to CNN1, therespect apparent errors occur apparentClass errors between Class21and and Class Class 7, 2 and3Class 6, andand Class 3 (Gravel) andthe Class between 1 occur and Class 7, Class 6, Class and Class (Gravel) Class 8. As for two 8. As for the two baseline classifiers, except for Class 5 (Painted metal sheets) and Class 9 (Shadows), baseline classifiers, except for Class 5 (Painted metal sheets) and Class 9 (Shadows), the other classes the other classes have errors obvious omission errorserrors. and commission 4 (Trees) with RF have obvious omission and commission For Class 4 errors. (Trees) For withClass RF and SVM classifiers, and SVM classifiers, many classified samples are classified Class 2,ofand more samples of Class 2 are many samples are wrongly as wrongly Class 2, and more as samples Class 2 are incorrectly classified incorrectly classified as Class 4. The apparent commission and omission errors between the class pairs as Class 4. The apparent commission and omission errors between the class pairs demonstrate the low demonstrate the low The inter-class variability. The classification accuracies obtained thegood. SA dataset inter-class variability. classification accuracies obtained for the SA dataset arefor fairly For all are fairly good. For all models and classifiers, there are no obvious commission and omission errors. models and classifiers, there are no obvious commission and omission errors. Such results indicate the Such results indicate the performance (i.e., the saturated when using a relatively superior performance (i.e.,superior the saturated accuracies) when using aaccuracies) relatively simple dataset. simple dataset. Based on Table 2, the classification accuracies can be summarized as follows: (1) both network Based on Table 2, the classification accuracies can be summarized as follows: (1) both network structures show the promising performance; (2) the CAP-based models are slightly superior to the structures show the promising performance; (2) the CAP-based models are slightly superior to the CNN-based models in terms of the Kappa, OA, and AA; and (3) most of the commission and omission CNN-based models in terms of the Kappa, OA, and AA; and (3) most of the commission and omission errors occur in Class 1 (Asphalt) and Class 7 (Bitumen), Class 2 (Meadows) and Class 6 (Bare Soil), and errors occur in Class 1 (Asphalt) and Class 7 (Bitumen), Class 2 (Meadows) and Class 6 (Bare Soil), Class 3 (Gravel) and Class 8 (Self-Blocking Bricks). In addition, compared to classification accuracies and Class 3 (Gravel) and Class 8 (Self-Blocking Bricks). In addition, compared to classification of the CNN presented by Yu et al. [27], our CNN-based models get better performance. Table 3 accuracies of the CNN presented by Yu et al. [27], our CNN-based models get better performance. shows that: (1) all models and classifiers get the nearly saturated performances; (2) most of the Table 3 shows that: (1) all models and classifiers get the nearly saturated performances; (2) most of commission and omission errors are caused by Class 2 (Corn_senesced_green_weeds) and Class 3 the commission and omission errors are caused by Class 2 (Corn_senesced_green_weeds) and Class (Lettuce_romaine_4wk); and (3) CAP and CNN have the similar accuracies for all classes. 3 (Lettuce_romaine_4wk); and (3) CAP and CNN have the similar accuracies for all classes.


14 of 22

Table 2. Classification accuracies for the PU dataset with five runs. Sensors 2018, 18, 3153

K OA AA C1 C2K OA C3 AA C4 C1 C5C2 C6C3 C7C4 C5 C8C6 C9C7 C8 C9

K OA AAK OA C1 AA C2C1 C3C2 C4C3 C4 C5 C5 C6C6

CAP1-PU CAP2-PU CNN1-PU CNN2-PU RF-PU 0.9324 ± 0.0242 0.9456 ± 0.0181 0.9345 ± 0.0130 0.9332 ± 0.0221 0.6450 ± 0.0147 0.9490 ± 0.0185 Table 0.9590 ± 0.0138 0.9511 ± 0.0097 0.9496 ± 0.0169with 0.7189 ± 0.0124 2. Classification accuracies for the PU dataset five runs. 0.9542 ± 0.0112 0.9627 ± 0.0138 0.9367 ± 0.0174 0.9563 ± 0.0089 0.7869 ± 0.0098 CAP2-PU CNN1-PU CNN2-PU RF-PU 0.8943CAP1-PU ± 0.0545 0.9277 ± 0.0264 0.9468 ± 0.0279 0.9382 ± 0.0407 0.6604 ± 0.0143 0.9686 ± 0.0278 0.0210 0.9324 ± 0.0242 0.9659 0.9456± ± 0.0181 0.9807 0.9345±±0.0070 0.0130 0.9507 0.9332±±0.0272 0.0221 0.6892 0.6450 ± ±0.0237 0.0147 0.9490 ± 0.0185 0.8882 0.9590± ± 0.0138 0.8518 0.9511±±0.0782 0.0097 0.8851 0.9496±±0.0404 0.0169 0.6364 0.7189 ± ±0.0284 0.0124 0.9229 ± 0.0312 0.1190 0.9542 ± 0.0112 0.9627 ± 0.0138 0.9367 ± 0.0174 0.9563 ± 0.0089 0.7869 ± 0.0098 0.9797 ± 0.0096 0.9761 ± 0.0121 0.9724 ± 0.0178 0.9720 ± 0.0124 0.9142 ± 0.0250 0.8943 ± 0.0545 0.9277 ± 0.0264 0.9468 ± 0.0279 0.9382 ± 0.0407 0.6604 ± 0.0143 1.0000 ± 0.0000 0.0000 0.9686 ± 0.0278 1.0000 0.9659± ± 0.0210 1.0000 0.9807±±0.0000 0.0070 1.0000 0.9507±±0.0000 0.0272 0.9928 0.6892 ± ±0.0052 0.0237 0.9229 ± 0.0312 0.9781 0.8882± ± 0.1190 0.9212 0.8518±±0.0241 0.0782 0.9483 0.8851±±0.0321 0.0404 0.7191 0.6364 ± ±0.0334 0.0284 0.9609 ± 0.0338 0.0284 0.9797 ± 0.0096 0.9800 0.9761± ± 0.0121 0.8783 0.9724±±0.1855 0.0178 0.9620 0.9720±±0.0125 0.0124 0.8296 0.9142 ± ±0.0205 0.0250 0.9831 ± 0.0137 0.0085 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 0.9928 ± 0.0052 0.8796 ± 0.0806 0.0224 0.9609 ± 0.0338 0.9483 0.9781± ± 0.0284 0.8799 0.9212±±0.0649 0.0241 0.9504 0.9483±±0.0196 0.0321 0.6612 0.7191 ± ±0.0332 0.0334 0.9985 ± 0.0009 0.9998 ± 0.0005 0.9993 ± 0.0006 1.0000 ± 0.0000 0.9793 ± 0.9831 ± 0.0137 0.9800 ± 0.0085 0.8783 ± 0.1855 0.9620 ± 0.0125 0.8296 ±0.0072 0.0205 0.8796 ± 0.0806 0.9985 ± 0.0009

0.9483 ± 0.0224 0.9998 ± 0.0005

0.8799 ± 0.0649 0.9993 ± 0.0006

0.9504 ± 0.0196 1.0000 ± 0.0000

0.6612 ± 0.0332 0.9793 ± 0.0072

Table 3. Classification accuracies for the SA dataset with five runs.

CAP1-SA TableCAP2-SA RF-SA 3. ClassificationCNN1-SA accuracies for theCNN2-SA SA dataset with five runs. 0.9992 ± 0.0007 0.9991 ± 0.0004 0.9994 ± 0.0004 0.9991 ± 0.0003 0.9647 ± 0.0073 CAP2-SA CNN1-SA CNN2-SA RF-SA 0.9994CAP1-SA ± 0.0005 0.9993 ± 0.0003 0.9995 ± 0.0003 0.9993 ± 0.0002 0.9720 ± 0.0058 0.9992 ± 0.0007 0.9991 ± 0.0004 0.9994 ± 0.0004 0.9991 ± 0.0003 0.9647 0.0073 0.9995 ± 0.0002 0.9992 ± 0.0003 0.9993 ± 0.0004 0.9989 ± 0.0003 0.9704 ±±0.0051 0.9994 ± 0.0005 0.9993 ± 0.0003 0.9995 ± 0.0003 0.9993 ± 0.0002 0.9720 ± 0.0058 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 0.9940 ± 0.0019 0.9995 ± 0.0002 0.9992 ± 0.0003 0.9993 ± 0.0004 0.9989 ± 0.0003 0.9704 ± 0.0051 0.9990 ± 0.0020 0.0019 1.0000 ± 0.0000 0.9987 1.0000± ± 0.0000 1.0000 1.0000±±0.0000 0.0000 0.9997 1.0000±±0.0007 0.0000 0.9574 0.9940 ± ±0.0170 0.0019 0.9984 ± 0.0008 0.0027 0.9990 ± 0.0020 0.9968 0.9987± ± 0.0019 0.9956 1.0000±±0.0027 0.0000 0.9940 0.9997±±0.0022 0.0007 0.9266 0.9574 ± ±0.0070 0.0170 0.9984 ± 0.0008 1.0000 0.9968± ± 0.0027 1.0000 0.9956±±0.0000 0.0027 1.0000 0.9940±±0.0000 0.0022 0.9974 0.9266 ± ±0.0008 0.0070 0.9994 ± 0.0008 0.0000 0.9994 ± 0.0008 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 0.9974 ± 0.0008 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 0.9954 ± 0.0052 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 1.0000 ± 0.0000 0.9954 ± 0.0052 1.0000 ± 0.0000 0.0000 1.0000 ± 0.0000 1.0000 1.0000± ± 0.0000 1.0000 1.0000±±0.0000 0.0000 1.0000 1.0000±±0.0000 0.0000 0.9518 0.9518 ± ±0.0199 0.0199

14 of 22

SVM-PU 0.7070 ± 0.0094 0.7703 ± 0.0071 0.8273 ± 0.0096 SVM-PU 0.7629 ± 0.0248 0.7374 ± 0.0082 0.7070 ± 0.0094 0.7703 ± 0.0071 0.7171 ± 0.0438 0.8273 ± 0.0096 0.9042 ± 0.0259 0.7629 ± 0.0248 0.9969 ± 0.0016 0.7374 ± 0.0082 0.7171 ± 0.0438 0.7738 ± 0.0324 0.9042 ± 0.0259 0.8537 ± 0.0288 0.9969 ± 0.0016 0.7015 ± 0.0380 0.7738 ± 0.0324 0.9986 ± 0.0013 0.8537 ± 0.0288 0.7015 ± 0.0380 0.9986 ± 0.0013

SVM-SA 0.9845 ± 0.0035 SVM-SA 0.9877 ± 0.0028 0.9845 0.0035 0.9883 ± ± 0.0025 0.9877 ± 0.0028 0.9952 ± 0.0024 0.9883 ± 0.0025 0.9782 ± 0.0056 0.9952 ± 0.0024 0.9763 ± 0.0082 0.9782 ± 0.0056 0.9763 ± 0.0082 0.9944 ± 0.0041 0.9944 ± 0.0041 0.9971 ± 0.0030 0.9971 ± 0.0030 0.9886 ± 0.0055 0.9886 ± 0.0055

4.6. 4.6. Probability Probability Maps Maps Additionally, to show show that thatCAP CAPhas hasclear clearadvantages advantagesover overCNN, CNN,we wegraph graphthe the probability maps Additionally, to probability maps of of every model or classifier. An apparent distinction is reflected in the probability maps of CNN and every model or classifier. An apparent distinction is reflected in the probability maps of CNN and CAP. CAP. This result discloses the differences of theand simple and complex datasets have a significant This result discloses that thethat differences of the simple complex datasets have a significant impact on impact on the measuring performance. Figure 10 shows noticeable differences between the CNNthe measuring performance. Figure 10 shows noticeable differences between the CNN- and CAP-based and CAP-based models, such as the weak predictions in a holistic scene in the case of the CNN-based models, such as the weak predictions in a holistic scene in the case of the CNN-based models. Thus, models. Thus, predicted the maximum predicted probabilities samples are relatively low.occur Weak the maximum probabilities of many samples of aremany relatively low. Weak predictions in predictions occur in the cross-area and the area covered by non-ground truth samples in the case of the cross-area and the area covered by non-ground truth samples in the case of the CNN-based models. the CNN-based Furthermore, the CAP-based for the to SAbedataset are supposed to be Furthermore, themodels. CAP-based models for the SA datasetmodels are supposed consistent with the results consistent with the results of the PU dataset. of the PU dataset.

(a) CAP1-SA

(b) CAP2-SA

Figure 10. Cont.


15 of 22 15 of 22

(c) CNN1-SA

(d) CNN2-SA

(e) RF-SA

(f) SVM-SA

Figure 10. 10. Probability Probability maps maps of: of: (a) (a)CAP1; CAP1;(b) (b)CAP2; CAP2; (c) (c) CNN1; CNN1; (d) (d) CNN2; CNN2; (e) (e) RF; RF; and and (f) (f) SVM SVM for for the the Figure SA dataset. dataset. SA

5. Discussion Discussion 5. 5.1. Parameters Determination 5.1. Parameters Determination The performance of HSI classification with the supervised neural networks [43] usually depends The performance of HSI classification with the supervised neural networks [43] usually depends on: (1) the representative ability of the designed structures; (2) the number of training samples; (3) the on: (1) the representative ability of the designed structures; (2) the number of training samples; (3) size of input patch: and (4) the parameter configurations (or the experimental setups). We take 7 × 7 the size of input patch: and (4) the parameter configurations (or the experimental setups). We take 7 as the patch size based on the information given in the literature [29,43,47]. The batch size (i.e., 64) of × 7 as the patch size based on the information given in the literature [29,43,47]. The batch size (i.e., 64) the input data is of particular importance to obtain a proper representation of the accuracy and loss of the input data is of particular importance to obtain a proper representation of the accuracy and of training and validation procedures. Regarding the influence of model sizes, deep learning models loss of training and validation procedures. Regarding the influence of model sizes, deep learning with the deeper layers appear to have worse classification results for HSI classification [43]. Meanwhile, models with the deeper layers appear to have worse classification results for HSI classification [43]. shallower network structures are apt to be comparable to each other. Previous work has indicated that a Meanwhile, shallower network structures are apt to be comparable to each other. Previous work has 3 × 3 kernel size encounters with the serious overfitting problem, and the 1 × 1 convolutional kernel can indicated that a 3 × 3 kernel size encounters with the serious overfitting problem, and the 1 × 1 provide better performance [27]. We found that a 4 × 4 kernel size and 64 filters for one convolutional convolutional kernel can provide better performance [27]. We found that a 4 × 4 kernel size and 64 layer are a viable option. Additionally, Zhong et al. [48] also found that deep learning models with the filters for one convolutional layer are a viable option. Additionally, Zhong et al. [48] also found that regularization methods (i.e., BN) could obtain the higher classification accuracy than those without. deep learning models with the regularization methods (i.e., BN) could obtain the higher classification The spatial pooling operation has been retained due to its superior performance. Yu et al. [27] suggested accuracy than those without. The spatial pooling operation has been retained due to its superior that the dropout operation was necessary and provided a better performance. Additionally, for the performance. Yu et al. [27] suggested that the dropout operation was necessary and provided a better capsule layer, the dimension of capsules is set to 64, and three routings are used by default. performance. Additionally, for the capsule layer, the dimension of capsules is set to 64, and three Random runs (or Monte Carlo runs) are especially useful for obtaining credible results through routings are used by default. the repeated random sampling. Hence, we applied every model and classifier to each dataset five times. Random the runsfirst (or run Monte runs) especially forfive obtaining credibleruns, results We illustrate andCarlo discuss theare others below.useful For the independent 60 through samples the sampling. we applied model60and classifier to each dataset per repeated class wererandom randomly selectedHence, for training and anevery additional samples for validation, whilefive the times. We illustrate the first run and discuss Every the others below.a slight For the five independent runs,is 60 remaining samples were available for testing. run shows difference because there an samples per class were randomly selected for training and an additional 60 samples for validation, inherent randomness in neural networks. The accuracies are recorded to calculate the final statistical while samples were available forsamples testing. Every run shows a slight difference because mean the andremaining standard error. Because the training for each run are randomly selected from the there an inherent in neural networks. accuracies areregardless recorded to the noise final same is category, theserandomness runs are expected to give nearlyThe consistent results of calculate the possible statistical mean andDifferent standardfrom error.repeating Because the training for each or contamination. many times,samples the random runsrun useare therandomly different selected training from the same these runs the are more expected to give nearly consistent results regardless of the sets every time,category, which can achieve reliable classification results at the risk of the possible possible contamination. Different from repeating many times, the random runs use the decrease noise of the or final accuracy statistics.


16 of 22

different training sets every time, which can achieve the more reliable classification results at the risk 16 of 22 of the possible decrease of the final accuracy statistics.

Sensors 2018, 18, 3153

Training Size Size 5.2. Impact of Training Generally, ifif raw raw HSI HSI data data are are formatted formatted as as an an n-D input, then a tensor-based learning approach Generally, be applied. applied. The high-order data data can be The tensor-based tensor-based learning learning [20] [20] is a promising technique for the high-order classification where aa HSI HSI cube is the the typical typical 3-D 3-D high-order high-order data. data. The exploitation of the high-order complex data dataraises raisesthe the new research challenges due thedimensionality high dimensionality the number limited complex new research challenges due to thetohigh and theand limited number ground truth[49]. samples [49]. The numbersamples of training samples is ain critical factorthe in of groundoftruth samples The number of training is a critical factor determining determining the performance of a model or classifier. Deep learning models may not extract effective performance of a model or classifier. Deep learning models may not extract effective features unless features unless abundant samples areFor available [47]. For HSI data, a sufficient number of abundant training samplestraining are available [47]. HSI data, a sufficient number of training samples training sampleshence, may be difficult; hence, determining the appropriate numberfor of HSI training samples may be difficult; determining the appropriate number of training samples classification forcritical. HSI classification critical. Zhong et al. models [43] reported that most performance models obtained a larger better is Zhong et al. is [43] reported that most obtained a better for the performance the larger Therefore, we chose 60 randomly training sets. for Therefore, wetraining chose 60sets. randomly selected samples per class. selected samples per class. As shown in Figure 11, we defined the number of the training samples to be 15, 30, 60, 90, and and 180 (i.e., (i.e., 0.25, 0.25, 0.5, 0.5, 1.5, 1.5, and and 33 times times the the base base case). case). The The optimal optimal number number of the the training training samples samples was was 180 60 according according to to the the experimental experimental analysis. analysis. However, as the number of training training determined to be 60 samples increases, the classification classification accuracies accuracies are continuously continuously improved improved and and become become somewhat somewhat stable. For CAP1 in the case of size 15, K, OA, and AA are 0.7609, 81.56%, and 84.96%, 84.96%, respectively, compared to to the corresponding corresponding values values yielded yielded by by CNN1 of 0.6735, 74.82%, and 73.93%. Similarly, for compared CAP2, K, OA, and AA are 0.7086, 77.95%, and 78.27%, respectively, while 0.4215, 48.90%, and 75.46% CNN2. Here, the CAP-based models can achieve achieve better performance performance with much fewer fewer training training for CNN2. data. Regarding the two baseline classifiers, i.e., RF and SVM classifiers, the classification accuracies the number number of of the the training training samples samples increases. increases. The parameter analysis of the the increase smoothly as the impact of the training size supports the determination of the number impact number of of training training samples samples in in this thisstudy. study. 1.000

1.000

0.950

0.950

Accuracies

Accuracies

0.900

0.900

0.850

0.800

0

30

60

90

0.800

K OA AA

0.750

K OA AA

0.750

0.850

0.700 0

180

Number of training samples

30

60

90

180


(a) CAP1-PU

(b) CAP2-PU

1.000 1.000 0.950 0.900

Accuracies

Accuracies

0.900 0.850 0.800

0.800 0.700 0.600

0.750

K OA AA

0.700

0.500

K OA AA

0.400

0.650 0

30

60

90

180

0


30

60

90


(c) CNN1-PU

(d) CNN2-PU

Figure 11. Cont.

180


17 of 22 17 of 22

0.850

0.850 0.800

0.800

0.700

Accuracies

Accuracies

0.750

0.650 0.600 0.550

K OA AA

0.500

0.750 0.700 0.650 0.600

K OA AA

0.550

0.450 0

30

60

90


(e) RF-PU

180

0

30

60

90

180


(f) SVM-PU

Figure 11. Accuracies Accuracies of: (a) (a) CAP1; CAP1; (b) (b) CAP2; CAP2; (c) (c) CNN1, CNN1, and and (d) CNN2; CNN2; (e) (e) RF; and (f) SVM for the different training sizes of the PU dataset.

5.3. Uncertainty Analysis 5.3. Uncertainty Analysis Probability density is not necessarily consistent with the final reliability of the classification output. Probability density is not necessarily consistent with the final reliability of the classification However, it is still expected to somehow indicate the confidence of the classification output, because output. However, it is still expected to somehow indicate the confidence of the classification output, the predicted probabilities are estimated based on the distribution of training samples. Actually, because the predicted probabilities are estimated based on the distribution of training samples. the output probabilities of a classifier or deep learning model have been considered as the uncertainty Actually, the output probabilities of a classifier or deep learning model have been considered as the of classification result for the consequent analysis in many previous studies [50–53]. Therefore, uncertainty of classification result for the consequent analysis in many previous studies [50–53]. the information of output probabilities is very helpful. Related work on uncertainty analysis, or the Therefore, the information of output probabilities is very helpful. Related work on uncertainty statistics of the predicted probabilities, have seldom been considered in HSI classification task before. analysis, or the statistics of the predicted probabilities, have seldom been considered in HSI This work provides us the new insight into the intrinsic differences between CNN and CAP, when classification task before. This work provides us the new insight into the intrinsic differences between putting the concerns of performance measure aside. CNN and CAP, when putting the concerns of performance measure aside. Label assignment relies on the credible predictions, and the maximum predicted probabilities Label assignment relies on the credible predictions, and the maximum predicted probabilities would determine the final outputs. As shown in Figure 12, the CAP-based models show a would determine the final outputs. As shown in Figure 12, the CAP-based models show a high high prediction confidence where most of the predicted maximum probabilities are fairly good. prediction confidence where most of the predicted maximum probabilities are fairly good. The The performance of the CNN-based models may be sufficient in terms of the final accuracies. However, performance of the CNN-based models may be sufficient in terms of the final accuracies. However, the predicted maximum probabilities of most samples are relatively low. For RF and SVM classifiers, the predicted maximum probabilities of most samples are relatively low. For RF and SVM classifiers, their representations appear normal. We speculate that such variance may depend on the complexity of their representations appear normal. We speculate that such variance may depend on the complexity the experimental data. Based on the experimental results, both CNN and CAP produced classification of the experimental data. Based on the experimental results, both CNN and CAP produced results with the very high accuracy. However, the predicted probabilities of CNN are much smaller classification results with the very high accuracy. However, the predicted probabilities of CNN are than the true reliability of its classification output. It is apparent that the presented CAP obtains a much smaller than the true reliability of its classification output. It is apparent that the presented CAP maximum probability for the category prediction and label assignment. Consequently, CAP shows obtains a maximum probability for the category prediction and label assignment. Consequently, CAP much higher confidence for the predicted probabilities which has been analyzed by probability maps shows much higher confidence for the predicted probabilities which has been analyzed by and uncertainty analysis (i.e., the statistical analysis). The probability map illustrates the observed probability maps and uncertainty analysis (i.e., the statistical analysis). The probability map spatial distribution (i.e., probability density) of the weak predictions, and the statistical analysis illustrates the observed spatial distribution (i.e., probability density) of the weak predictions, and the indicates the statistical distribution of the predicted probabilities. The weak predictions occur in the statistical analysis indicates the statistical distribution of the predicted probabilities. The weak cross-area (i.e., the transitional buffer) and the areas covered by non-ground truth samples. Concerning predictions occur in the cross-area (i.e., the transitional buffer) and the areas covered by non-ground the statistical plots, there are speculations can be given as follows: (1) if most predictions have a truth samples. Concerning the statistical plots, there are speculations can be given as follows: (1) if high probability, then the low probabilities will be outliers, and vice versa; (2) if most predictions most predictions have a high probability, then the low probabilities will be outliers, and vice versa; are quite approximate, then no outliers can be identified; and (3) the floated box with its properties (2) if most predictions are quite approximate, then no outliers can be identified; and (3) the floated such as height and size may indicate a specific statistical distribution. The output probabilities may box with its properties such as height and size may indicate a specific statistical distribution. The be intentionally manipulative considering the possible operations such as activation (or squashing) output probabilities may be intentionally manipulative considering the possible operations such as functions which can limit the output signal to a finite value. However, the statistical distribution is activation (or squashing) functions which can limit the output signal to a finite value. However, the supposed to be consistent all the way. For the scope of this study, CAP shows the preferable and statistical distribution is supposed to be consistent all the way. For the scope of this study, CAP shows exceptional characteristics. the preferable and exceptional characteristics.


18 of 22 18 of 22

Figure 12. Box plots of the predicted probabilities for: (a) CAP1; (b) CAP2; (c) CNN1; and (d) CNN2 models; (e) RF; and (f) SVM using using the the SA SA dataset. dataset.

5.4. Time Consumption 5.4. Time Consumption The training time provides a direct measure of the computational efficiency [43], and which is The training time provides a direct measure of the computational efficiency [43], and which is also an important performance indicator for the different network structures. Network parameters also an important performance indicator for the different network structures. Network parameters play an important role in the computational complexity of deep learning models. Specifically, once the play an important role in the computational complexity of deep learning models. Specifically, once network parameters are fixed, the efficiency of every model can be determined approximately. In the network parameters are fixed, the efficiency of every model can be determined approximately. In general, a designed network structure with a specific dataset can be used to determine the total general, a designed network structure with a specific dataset can be used to determine the total network parameters. Accordingly, the changes in the magnitudes of network parameters would alter network parameters. Accordingly, the changes in the magnitudes of network parameters would alter the training and test times. Differences in the network parameters between CNN and CAP depend the training and test times. Differences in the network parameters between CNN and CAP depend primarily on the differences between the capsule layer and dense layer in terms of a comparable primarily on the differences between the capsule layer and dense layer in terms of a comparable paradigm of network architecture design. Note that we attempt to make both network structures as paradigm of network architecture design. Note that we attempt to make both network structures as comparable as possible, thus facilitating further improvement. comparable as possible, thus facilitating further improvement. With respect to the time consumption (see Table 4), CAP costs more time than CNN. From With respect to the time consumption (see Table 4), CAP costs more time than CNN. From another perspective, both network structures have approximately the same network complexity, another perspective, both network structures have approximately the same network complexity, as as we expected. CAP requires approximately 1.8 times more time than CNN for the PU dataset. we expected. CAP requires approximately 1.8 times more time than CNN for the PU dataset. Furthermore, it can be speculated that, when the data complexity increases, more time would be taken Furthermore, it can be speculated that, when the data complexity increases, more time would be by CAP and SVM. This means that the time consumption of CAP and SVM may depend on the data taken by CAP and SVM. This means that the time consumption of CAP and SVM may depend on the complexity. Furthermore, the conventional SVM classifier exhibits a promising efficiency with little data complexity. Furthermore, the conventional SVM classifier exhibits a promising efficiency with time consumption, and RF classifier has the much larger computational intensity. Yu et al. [27] trained little time consumption, and RF classifier has the much larger computational intensity. Yu et al. [27] their networks with small training sets and the training was completed in several hours, while some trained their networks with small training sets and the training was completed in several hours, while other applications presented by Krizhevsky et al. [54] required days or weeks. We have tried our some other applications presented by Krizhevsky et al. [54] required days or weeks. We have tried our best to improve the efficiency of CNN and CAP in our study, i.e., run on the GPU, use small

Sensors 2018, 18, 3153

19 of 22

best to improve the efficiency of CNN and CAP in our study, i.e., run on the GPU, use small training data, make a proper division of training samples, design a shallower deep learning architecture (i.e., a two-parameter-layer structure), etc. Therefore, CNN or CAP can be trained within a minute. On the other hand, CNN and CAP have a different size of network parameters that would cause the inevitable but reasonable differences in time consumption. Additionally, the training time may depend on many possible factors, i.e., the randomness in neural networks, the influence of memory storage, and the difference of computational environment, etc. Table 4. Network parameters and time consumption (i.e., the average time of five runs). Models CAP1 CAP2 CNN1 CNN2 RF SVM

PU (610 × 340 × 103) Total Parameters 1.42 × 105 1.08 × 105 -

Training Time (s) 35 37 21 21 51 13

SA (83 × 86 × 204) Total Parameters 2.33 × 105 2.10 × 105 -

Training Time (s) 27 27 20 20 52 5

6. Conclusions CapsNets are a recently proposed type of deep learning network architectures, and only limited studies have explored the applications. Motivated by the innovativeness of CapsNets, we attempted to introduce CapsNets into HSI classification task. In this paper, we propose a modified two-layer CapsNet for HSI classification. Two real HSI datasets, the PaviaU and SalinasA datasets, are taken as the benchmark datasets. Compared with the state-of-the-art CNN and two baseline classifiers, i.e., RF and SVM classifiers, the proposed CapsNet achieves better classification results for the complex data, even with limited training samples. In addition, a comparable paradigm of network architecture design has been proposed to compare CNN and CapsNet. Additionally, we observed that CapsNet has much higher confidence for predictions. We observed weak predictions with regard to the probability density, which are confirmed by probability maps and uncertainty analysis. Since CNNs may fail to retain the spatial and spectral coherency of samples, the latest emerged approaches, such as the tensor-based learning algorithms and CapsNets, may be the appealing alternatives using less training data but a relatively good performance. To our knowledge, CNNs have several complex extensions including varied depth and width, with the composite functions and specific tricks on the operations of the input, hidden, and output layers. The complexity of CapsNets has not been well explored until now. Since further efforts should be devoted to improving CapsNets, the comparison between more complex CNNs and CapsNets may be an open field to be exploited hereafter. Meanwhile, classification maps and accuracies may always be qualified for evaluating a classifier or deep learning model. In light of this study, we believe that CapsNets have bright prospects in HSI classification, and further efforts should be devoted to improving CapsNets in the near future. Author Contributions: F.D. and S.P. conceived and designed the experiments; S.P. performed the experiments and analyzed the data; and all relevant co-authors participated in writing and editing the paper. Funding: Research Fund of State Key Laboratory of Geohazard Prevention and Geoenvironment Protection (SKLGP2018Z006). Acknowledgments: The authors thank the editors and the anonymous reviewers for their insightful comments and helpful suggestions, which highly improved the quality of the manuscript. Conflicts of Interest: The authors declare no conflict of interest.

Sensors 2018, 18, 3153

20 of 22

References 1.

2.

3. 4. 5. 6.

7.

8. 9. 10. 11. 12. 13. 14.

15. 16.

17. 18. 19.

20. 21. 22. 23.

Du, Q.; Zhang, L.; Zhang, B.; Tong, X.; Du, P.; Chanussot, J. Foreword to the special issue on hyperspectral remote sensing: Theory, methods, and applications. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 459–465. [CrossRef] Plaza, A.; Benediktsson, J.A.; Boardman, J.W.; Brazile, J.; Bruzzone, L.; Camps-Valls, G.; Chanussot, J.; Fauvel, M.; Gamba, P.; Gualtieri, A. Recent advances in techniques for hyperspectral image processing. Remote Sens. Environ. 2009, 113, S110–S122. [CrossRef] Zhang, L.; Du, B. Recent advances in hyperspectral image processing. Geo-Spat. Inf. Sci. 2012, 15, 143–156. [CrossRef] Landgrebe, D. Hyperspectral image data analysis. IEEE Signal Process. Mag. 2002, 19, 17–28. [CrossRef] Yan, W.Y.; Shaker, A.; El-Ashmawy, N. Urban land cover classification using airborne LiDAR data: A review. Remote Sens. Environ. 2015, 158, 295–310. [CrossRef] Eslami, M.; Mohammadzadeh, A. Developing a spectral-based strategy for urban object detection from airborne hyperspectral TIR and visible data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 1808–1816. [CrossRef] Gevaert, C.M.; Suomalainen, J.; Tang, J.; Kooistra, L. Generation of Spectral–Temporal Response Surfaces by Combining Multispectral Satellite and Hyperspectral UAV Imagery for Precision Agriculture Applications. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 3140–3146. [CrossRef] Govender, M.; Chetty, K.; Bulcock, H. A review of hyperspectral remote sensing and its application in vegetation and water resource studies. Water SA 2007, 33, 145–151. [CrossRef] He, L.; Li, J.; Liu, C.; Li, S. Recent Advances on Spectral-Spatial Hyperspectral Image Classification: An Overview and New Guidelines. IEEE Trans. Geosci. Remote Sens. 2018, 56, 1579–1597. [CrossRef] Lein, J.K. Environmental Sensing: Analytical Techniques for Earth Observation; Springer: New York, NY, USA, 2011. Luo, Y.; Zou, J.; Yao, C.; Li, T.; Bai, G. HSI-CNN: A Novel Convolution Neural Network for Hyperspectral Image. arXiv 2018, arXiv:1802.10478. Ham, J.; Chen, Y.; Crawford, M.M.; Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [CrossRef] Ho, T.K. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. Gualtieri, J.A.; Chettri, S. Support vector machines for classification of hyperspectral data. In Proceedings of the Geoscience and Remote Sensing Symposium (IGARSS 2000), Honolulu, HI, USA, 24–28 July 2000; pp. 813–815. Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [CrossRef] Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral Remote Sensing Data Analysis and Future Challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [CrossRef] Chutia, D.; Bhattacharyya, D.K.; Sarma, K.K.; Kalita, R.; Sudhakar, S. Hyperspectral Remote Sensing Classifications: A Perspective Survey. Trans. GIS 2016, 20, 463–490. [CrossRef] Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep Learning-Based Classification of Hyperspectral Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 7, 2094–2107. [CrossRef] Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 4959–4962. Makantasis, K.; Doulamis, A.; Doulamis, N.; Nikitakis, A.; Voulodimos, A. Tensor-based Nonlinear Classifier for High-Order Data Analysis. arXiv, 2018; arXiv:1802.05981. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [CrossRef] Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; p. 1. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [CrossRef] [PubMed]

Sensors 2018, 18, 3153

24. 25.

26. 27. 28.

29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39.

40. 41. 42. 43.

44.

45. 46. 47. 48.

21 of 22

LeCun, Y. LeNet-5, Convolutional Neural Networks. 2015, p. 20. Available online: http://yann.lecun.com/ exdb/lenet (accessed on 17 August 2018). Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [CrossRef] Li, W.; Wu, G.; Zhang, F.; Du, Q. Hyperspectral image classification using deep pixel-pair features. IEEE Trans. Geosci. Remote Sens. 2017, 55, 844–853. [CrossRef] Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2017, 219, 88–98. [CrossRef] Zhong, Z.; Fan, B.; Duan, J.; Wang, L.; Ding, K.; Xiang, S.; Pan, C. Discriminant tensor spectral-spatial feature extraction for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1028–1032. [CrossRef] Li, Y.; Zhang, H.; Shen, Q. Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [CrossRef] Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3859–3869. Xi, E.; Bing, S.; Jin, Y. Capsule Network Performance on Complex Data. arXiv, 2017; arXiv:1712.03480. Wei, W.; Zhang, J.; Zhang, L.; Tian, C.; Zhang, Y. Deep Cube-Pair Network for Hyperspectral Imagery Classification. Remote Sens. 2018, 10, 783. [CrossRef] Makantasis, K.; Doulamis, A.D.; Doulamis, N.D.; Nikitakis, A. Tensor-Based Classification Models for Hyperspectral Data Analysis. IEEE Trans. Geosci. Remote Sens. 2018, 99, 1–15. [CrossRef] Afshar, P.; Mohammadi, A.; Plataniotis, K.N. Brain Tumor Type Classification via Capsule Networks. arXiv 2018, arXiv:1802.10200. Jaiswal, A.; AbdAlmageed, W.; Natarajan, P. CapsuleGAN: Generative Adversarial Capsule Network. arXiv 2018, arXiv:1802.06167. Kumar, A.D. Novel Deep Learning Model for Traffic Sign Detection Using Capsule Networks. arXiv 2018; arXiv:1805.04424. LaLonde, R.; Bagci, U. Capsules for Object Segmentation. arXiv 2018, arXiv:1804.04241. Li, Y.; Qian, M.; Liu, P.; Cai, Q.; Li, X.; Guo, J.; Yan, H.; Yu, F.; Yuan, K.; Yu, J.; et al. The recognition of rice images by UAV based on capsule network. Clust. Comput. 2018, 1–10. [CrossRef] Qiao, K.; Zhang, C.; Wang, L.; Yan, B.; Chen, J.; Zeng, L.; Tong, L. Accurate reconstruction of image stimuli from human fMRI based on the decoding model with capsule network architecture. arXiv, 2018; arXiv:1801.00602. Wang, Y.; Sun, A.; Han, J.; Liu, Y.; Zhu, X. Sentiment Analysis by Capsules. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, Lyon, France, 23–27 April 2018; pp. 1165–1174. Zhao, W.; Ye, J.; Yang, M.; Lei, Z.; Zhang, S.; Zhao, Z. Investigating Capsule Networks with Dynamic Routing for Text Classification. arXiv, 2018; arXiv:1804.00538. Andersen, P. Deep Reinforcement Learning using Capsules in Advanced Game Environments. arXiv, 2018; arXiv:1801.09597. Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral-Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [CrossRef] Guidici, D.; Clark, M.L. One-Dimensional convolutional neural network land-cover classification of multi-seasonal hyperspectral imagery in the San Francisco Bay Area, California. Remote Sens. 2017, 9, 629. [CrossRef] Hu, F.; Xia, G.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [CrossRef] Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [CrossRef] Gao, Q.; Lim, S.; Jia, X. Hyperspectral Image Classification Using Convolutional Neural Networks and Multiple Feature Learning. Remote Sens. 2018, 10, 299. [CrossRef] Zhong, Z.; Li, J.; Ma, L.; Jiang, H.; Zhao, H. Deep Residual Networks for Hyperspectral Image Classification. In Proceedings of the Geoscience and Remote Sensing Symposium, Fort Worth, TX, USA, 23–28 July 2017.

Sensors 2018, 18, 3153

49. 50. 51. 52.

53.

54.

22 of 22

Camps-Valls, G.; Bruzzone, L. Kernel-based methods for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1351–1362. [CrossRef] Chen, J.; Chen, X.; Cui, X.; Chen, J. Change vector analysis in posterior probability space: A new method for land cover change detection. IEEE Geosci. Remote Sens. Lett. 2011, 8, 317–321. [CrossRef] Schneider, A.; Friedl, M.A.; Potere, D. Mapping global urban areas using MODIS 500-m data: New methods and datasets based on ‘urban ecoregions’. Remote Sens. Environ. 2010, 114, 1733–1746. [CrossRef] Sexton, J.O.; Noojipady, P.; Anand, A.; Song, X.; McMahon, S.; Huang, C.; Feng, M.; Channan, S.; Townshend, J.R. A model for the propagation of uncertainty from continuous estimates of tree cover to categorical forest cover and change. Remote Sens. Environ. 2015, 156, 418–425. [CrossRef] Yang, D.; Chen, X.; Chen, J.; Cao, X. Multiscale integration approach for land cover classification based on minimal entropy of posterior probability. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1105–1116. [CrossRef] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Hyperspectral Image Classification with Capsule Network ... - MDPI

Hyperspectral Image Classification with Capsule Network ... - MDPI

Suggest Documents

Deep Convolutional Capsule Network for Hyperspectral Image ... - MDPI

Hyperspectral Image Classification with Spatial Filtering and 2 ... - MDPI

SpectralâSpatial Hyperspectral Image Classification With Edge ...

Robust hyperspectral image classification with rejection fields

Hyperspectral Image Classification with Limited Labeled Training ...

Hyperspectral Image Classification with ... - Semantic Scholar

Refinement of Hyperspectral Image Classification

High Performance Hyperspectral Image Classification

ADVANCES IN HYPERSPECTRAL IMAGE CLASSIFICATION

Hyperspectral Image Classification Using ...

Advances in hyperspectral image classification

Hyperspectral Image Classification for Land Cover Based on ... - MDPI

Hyperspectral Image Classification Using Relevance Vector Machines

Hyperspectral Image Classification Using Graph ... - IPOL Journal

Hyperspectral Image Classification based on Spatial

Hyperspectral Image Classification Using Dictionary ... - CiteSeerX

Hyperspectral Image Classification by Collaboration ...

Hyperspectral Image Classification Using Support ... - Purdue University

Spectral-Spatial Classification of Hyperspectral Image ...

Hyperspectral Image Classification Using Gradient Local Auto ...

Adaptive Classification of Hyperspectral Image - Semantic Scholar

Hyperspectral Image Classification Using Orthogonal Subspace ...

Semi-supervised Hyperspectral Image Classification ... - (Denny) Zhou's

Methodology for hyperspectral image classification using ... - TechExpo

Hyperspectral Image Classification with Capsule Network ... - MDPI