Deep Convolutional Capsule Network for Hyperspectral Image ... - MDPI

1 downloads 0 Views 6MB Size Report
Jan 22, 2019 - deep capsule network is designed for spectral-spatial classification, ...... G. ImageNet classification with deep convolutional neural networks.
Article

Deep Convolutional Capsule Network for Hyperspectral Image Spectral and Spectral-Spatial Classification Kaiqiang Zhu 1, Yushi Chen 1,*, Pedram Ghamisi 2, Xiuping Jia 3 and Jón Atli Benediktsson 4 School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China; [email protected] 2 Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, 09599 Freiberg, Germany; [email protected] 3 School of Engineering and Information Technology, The University of New South Wales, Canberra, ACT 2600, Australia; [email protected] 4 Faculty of Electrical and Computer Engineering, University of Iceland, IS-107 Reykjavik, Iceland; [email protected] * Correspondence: [email protected]; 1

Received: 20 December 2018; Accepted: 16 January 2019; Published: 22 January 2019

Abstract: Capsule networks can be considered to be the next era of deep learning and have recently shown their advantages in supervised classification. Instead of using scalar values to represent features, the capsule networks use vectors to represent features, which enriches the feature presentation capability. This paper introduces a deep capsule network for hyperspectral image (HSI) classification to improve the performance of the conventional convolutional neural networks (CNNs). Furthermore, a modification of the capsule network named Conv-Capsule is proposed. Instead of using full connections, local connections and shared transform matrices, which are the core ideas of CNNs, are used in the Conv-Capsule network architecture. In Conv-Capsule, the number of trainable parameters is reduced compared to the original capsule, which potentially mitigates the overfitting issue when the number of available training samples is limited. Specifically, we propose two schemes: (1) A 1D deep capsule network is designed for spectral classification, as a combination of principal component analysis, CNN, and the Conv-Capsule network, and (2) a 3D deep capsule network is designed for spectral-spatial classification, as a combination of extended multi-attribute profiles, CNN, and the Conv-Capsule network. The proposed classifiers are tested on three widely-used hyperspectral data sets. The obtained results reveal that the proposed models provide competitive results compared to the state-of-the-art methods, including kernel support vector machines, CNNs, and recurrent neural network. Keywords: convolutional neural network (CNN); deep learning; capsule network; hyperspectral image classification

1. Introduction The task of classification, when it relates to hyperspectral images (HSIs), generally refers to assigning a label to each pixel vector in the scene [1]. HSI classification is a crucial step for a plethora of applications including urban development [2–4], land change monitoring [5–7], scene interpretation [8,9], resource management [10,11], and so on. Due to the fundamental importance of this step in various applications, classification of HSI is one of the hottest topics in the remote sensing community. However, the classification of HSI is still challenging due to several factors such as high dimensionality, a limited number of training samples, and complex imaging situations [1]. Remote Sens. 2019, 11, 223; doi:10.3390/rs11030223

www.mdpi.com/journal/remotesensing

Remote Sens. 2019, 11, 223

2 of 27

During the last few decades, a huge number of methods have been proposed for HSI classification [12–14]. Due to the availability of abundant spectral information in HSIs, lots of spectral classifiers have been proposed for HSI classification including k-nearest-neighbors, maximum likelihood, neural network, logistic regression, and support vector machines (SVMs) [1,15,16]. Hyperspectral sensors provide rich spatial information as well, and the spatial resolution is becoming finer and finer along with the development of sensor technologies. With the help of spatial information, classification performance can be greatly improved [17]. Among the spectral-spatial classification techniques, the generation of a morphological profile is a widely-used approach, which is usually followed by either an SVM or a random forest classifier to obtain the final classification result [18-21]. As the extension of SVM, multiple kernel learning is another main stream of spectralspatial HSI classification, which has a powerful capability to handle the heterogeneous features obtained by spectral-spatial hyperspectral images [22]. Due to the complex atmospheric conditions, scattering from neighboring objects, intra-class variability, and varying sunlight intensity, it is very important to extract invariant and robust features from HSIs for accurate classification. Deep learning uses hierarchical models to extract invariant and discriminate features from HSIs in an effective manner and usually leads to accurate classification. During the past few years, many deep learning methods have been proposed for HSI classification. Deep learning includes a broader family of models, including the stacked auto-encoder, the deep belief network, the deep convolutional neural network (CNN), and the deep recurrent neural network. All of the aforementioned deep models have been used for HSI classification [23,24]. The stacked auto-encoder was the first deep model to be investigated for HSI feature extraction and classification [25]. In [25], two stacked auto-encoders were used to hierarchically extract spectral and spatial features. The extracted invariant and discriminant features led to a better classification performance. Furthermore, recently, the deep belief network was introduced for HSI feature extraction and classification [26,27]. Because of the unique and useful model architectures of CNNs (e.g., local connections and shared weights), such networks usually outperform other deep models in terms of classification accuracy. In [28], a well-designed CNN with five layers was proposed to extract spectral features for accurate classification. In [29], a CNN-based spectral classifier that elaborately uses pixel-pair information was proposed, and it was shown to obtain good classification performance under the condition of a limited number of training samples. Most of the existing CNN-based HSI classification methods have been generalized to consider both spectral and spatial information in a single classification framework. The first spectral-spatial classifier based on CNN was introduced in [30], which was a combination of principal component analysis (PCA), deep CNN, and logistic regression. Due to the fact that the inputs of deep learning models are usually 3D data, it is reasonable to design 3D CNNs for HSI spectral-spatial classification [31,32]. Furthermore, CNN can be combined with other powerful techniques to improve the classification performance. In [33], CNN was combined with sparse representation to refine the learnt features. CNNs can be connected with other spatial feature extraction methods, such as morphological profiles and Gabor filtering, to further improve the classification performance [34,35]. The pixel vector of HSIs can be inherently considered to be sequential. Recurrent neural networks have the capability of characterizing sequential data. Therefore, in [27], a deep recurrent neural network that can analyze hyperspectral pixel vectors as sequential data and then determine information categories via network reasoning was proposed. Although deep learning models have shown their capabilities for HSI classification, some disadvantages exist which downgrade the performance of such techniques. In general, deep models require a huge number of training samples to reliably train a large number of parameters in their networks. On the other hand, having insufficient training samples is a frequent problem in remotely sensed image classification. In 2017, Sabour et al. proposed a new idea based on capsules, which showed its advantages in coping with a limited number of training samples [36]. Furthermore, traditional CNNs usually use a pooling layer to obtain invariant features from the input data, but the pooling operation loses the precise positional relationship of features. In hyperspectral remote

Remote Sens. 2019, 11, 223

3 of 27

sensing, abundant spectral information and the positional relationship in a pixel vector are the crucial factors for accurate spectral classification. Therefore, it is important to maintain the precise positional relationship in the feature extraction stage. In addition, when it comes to extracting spectral-spatial features from HSI, it is also important to hold the positional relationship of spectral-spatial features. Moreover, most of the existing deep methods use a scalar value to represent the intensity of a feature. In contrast, capsule networks use vectors to represent features. The usage of vectors enriches the feature representation and is a huge progress and a much more promising method for feature learning than scalar representation [36,44]. These properties of the capsule network perfectly align with the goals of this study and the current demands in the hyperspectral community. Deep learning-based methods, including deep capsule networks, have a powerful feature extraction capability when the number of training samples is sufficient. Unfortunately, the availability of only a limited number of training samples is a common bottleneck in HSI classification. Deep models are often over-trained with a limited number of training samples, which downgrades the classification accuracy on test samples. In order to mitigate the overfitting problem and lessen the feature extraction workload of deep models, the idea of a local connection-based capsule network is proposed in this study. The proposed Conv-Capsule network uses local connections and shared transform matrices to reduce the number of trainable parameters compared to the original capsule, which potentially mitigates the overfitting issue when the number of available training samples is limited. In the current study, the idea of the capsule network is modified for HSI classification. Two deep capsule classification frameworks, 1D-Capsule and 3D-Capsule, are proposed as spectral and spectral-spatial classifiers, respectively. Furthermore, two modified capsule networks, i.e., 1D-ConvCapsule and 3D-Conv-Capsule, are proposed to further improve the classification accuracy. The main contributions of the paper are briefly summarized as follows. (1) A modification of the capsule network named Conv-Capsule is proposed. The Conv-Capsule uses local connections and shared transform matrices in the network, which reduces the number of trainable parameters and mitigates the overfitting issue in classification. (2) Two frameworks, called 1D-Capsule and 3D-Capsule, based on the capsule network are proposed for HSI classification. (3) To further improve the HSI classification performance, two frameworks, called 1D-ConvCapsule and 3D-Conv-Capsule, are proposed. (4) The proposed methods are tested on three well-known hyperspectral data sets under the condition of having a limited number of training samples. The rest of the paper is organized as follows. Section 2 presents the background of the deep learning and capsule network. Sections 3 and 4 are dedicated to the details of the proposed deep capsule network frameworks, including spectral and spectral-spatial architectures for HSI classification. The experimental results are reported in Section 5. In Section 6, the conclusions and discussions are presented. 2. Background 2.1. Convolutional Neural Networks In general, CNN is a special case of deep neural network, which is loosely inspired by the biological visual system [37]. Compared with other deep learning methods, there are two unique factors in the architecture of the CNN, i.e., local connections and shared weights. Since each neuron only responds to a small region known as the reception field, CNN efficiently explores the structure correlation. Furthermore, CNN uses the replicated weights and biases across the entire layer, which significantly reduces the parameters in the network. By using specific architectures like local connections and shared weights, CNN tends to provide better generalization for a wide variety of applications.

Remote Sens. 2019, 11, 223

4 of 27

There are three main building blocks in CNNs: A convolution layer, a nonlinear transformation, and a pooling operation. By stacking several convolution layers with the nonlinear operations and several pooling layers, a deep CNN can be established [45]. A convolutional layer can be defined as follows: = (∑



+

( ) = max(0, ),

),

(1) (2)

where matrix is the i-th feature map of the previous (l-1)-th layer, is the j-th feature map of the current l-th layer, and M is the number of input feature maps. and are randomly initialized and set to zero. Furthermore, (∙) is a nonlinear function known as the rectified linear unit (ReLU), and  is the convolution operation [46]. The pooling operation offers invariance by reducing the resolution of the feature maps. The neuron in the pooling layer combines a small N×N (e.g., N=2) patch of the convolution layer. The most common pooling operation is max pooling. All parameters in the deep CNN model are trained using the back-propagation algorithm. In this study, CNN is adopted as the feature extraction method, and the extracted features are fed to the deep capsule network for further processing. 2.2. Capsule Network The capsule network is a modification of the traditional neural network, which uses a group of neurons to obtain the vector representations of a specific type of entity. In [36], the input to a capsule is a weighted sum of prediction vector | from the previous layers. | is obtained by multiplying of the previous capsule by a transform matrix , =∑ |

|

,

(3)

=

(4)

where represents the coupling coefficients determined by a processing called dynamic routing [36]. The capsule uses the length of the output vector to obtain the probability of the entity, and then, a nonlinear function which we call squash function is used to squash the vector, =

,

(5)

where is the output of capsule j, which is a vector representation of the input, while the traditional neural network uses a scalar value to give the final probability of the entity. There are some advantages when we use a vector representation instead of a scalar value. The vector representation uses the length of the activity vector to obtain the probability of the entity, and the vector representation gives the orientation of the entity too. In traditional CNNs, a pooling layer is used to make the network invariant to small changes in inputs, but the effectiveness is limited [44]. CNNs are not robust to translation, rotation, and scale, which usually downgrades their classification performance. In the capsule network, the output of the capsule is a vector representation of a type of entity [36]. When changes occur on the entity, the length of the corresponding output vector of the capsule may not change greatly. Through the capsule network, we can obtain a more robust representation of the input. 2.3 Capsule Network for HSI Classification The capsule network can be combined with the traditional neural network (e.g., CNN) to formulate a classification system for a specific task (e.g., HSI classification). In the remote sensing community, two works have already adopted the capsule networks for HSI classification. Paoletti et al. [47] and Deng et al. [48] adopted the capsule network for HSI classification and achieved good classification performance. In this context, Paoletti et al. proposed a spectral-spatial capsule network to capture high abstract-level features for HSI classification while reducing the network design complexity. The classification result in [47] demonstrates that the proposed method can extract more relevant and

Remote Sens. 2019, 11, 223

5 of 27

complete information about HSI data cubes. Deng et al. presented a modified two-layer capsule network capable of handling a limited number of training samples for HSI classification. Previous capsule networks contained a fully-connected capsule layer, which led to lots of trainable parameters. As we all know, having lots of parameters may cause an overfitting problem with a limited number of training samples. In this study, an improved capsule network named Conv-Capsule, which uses local connections and shared transform matrices in the network, is proposed. Conv-Capsule dramatically reduces the number of trainable parameters and mitigates the overfitting issue in HSI classification. Furthermore, the previous capsule networks for HSI classification are spectral-spatial classifiers. In this study, a 1D capsule network is also proposed as a spectral classifier to enrich the classification techniques of HSI. The details of our proposed methods are explicitly explained in Sections 3 and 4. 3. One-Dimensional Deep Capsule Network as a Spectral Classifier 3.1. One-Dimensional Convolutional Capsule Deep learning models use multilayer neural networks to hierarchically extract the features of input data, which is the key factor for effectiveness in deep learning-based methods. The traditional capsule network does not contain multiple capsule layers. Therefore, it is necessary to build a multilayer capsule network. The simple stacking of capsule layers can develop a deep capsule network. However, the traditional capsule layer is fully connected and contains a huge number of trainable parameters. The problem is even worse when the number of training samples is limited. Inspired by the CNN, local connections and shared transform matrices, which are the core ideas of CNN, are combined with the dynamic routing algorithm in the capsule layer, and we call it the convolutional capsule (ConvCapsule) layer. In the Conv-Capsule layer, each capsule in the current layer only connects with capsules within its local receptive field in the last capsule layer. The transform matrices in the local connections are shared across the entire layer. In HSI classification, spectral classification is an important research direction. To develop a 1D capsule network for HSI classification, a 1D Conv-Capsule layer needs to be utilized. Here is a description of the 1D Conv-Capsule layer which we use here to shape the spectral classifier. The input of a capsule in a 1D Conv-Capsule layer is a weighted sum of the “prediction vector” | from all channels of the capsule within its receptive field in the last capsule layer. Furthermore,

|

is

obtained by multiplying from the capsule in the last layer by the corresponding transform matrix which is shared across the last capsule layer. By using a squash function, the output of the capsule can be obtained from the input . The equations used here are listed as follows: =

|

=∑ =

,



(6) |



,

,

(7) (8)

where I and P are the number of capsule channels in the last capsule layer and kernel size in the current 1D Conv-Capsule layer. is the input of the j-th channel capsule at position x in the current 1D ConvCapsule layer, and is the corresponding output. is the output of the i-th channel capsule at position (x+p) in the last capsule layer. is the transform matrix between and . represents the coupling coefficients determined by the dynamic routing algorithm. An illustration of the 1D Conv-Capsule layer is shown in Figure 1.

Remote Sens. 2019, 11, 223

6 of 27

I-th channel

i-th channel

0 1 2 3 4 5

uix  p

1-st channel

Position x:

Dynamic Routing Algorithm in Conv-Capsule Layer

Wijp

u xj|i p

d1 Kernel Size P Initialize to zero

cijp softmax()

Position x+p:



cijp u jx|i p

i

p

s xj

squash()

0 1 2 3

J-th channel

j-th channel

v xj

d2 p update ij

b

p ij

x j

b v u

x p j |i

1-st channel

Figure 1. One-dimensional Conv-Capsule layer.

3.2. Dynamic Routing Algorithm in 1D Conv-Capsule layer Between two consecutive capsule layers, we use the dynamic routing algorithm to iteratively update coupling coefficients. The details about the procedure of the dynamic routing algorithm in 1D Conv-Capsule layer are described as follows: From the description of the 1D Conv-Capsule in the last subsection, we know that each capsule in the current 1D Conv-Capsule layer receives “prediction vectors” from the capsules within its receptive field in the last capsule layer. The weight of each “prediction vector” is represented by coupling coefficients. The coupling coefficients between capsule in the last capsule layer and all channels capsules at the same position in the 1D Conv-Capsule layer sum to 1 and can be obtained by a softmax function, =

(

)

(9)

,



where is initialized to 0 before the training begins and is determined by the dynamic routing algorithm. In the dynamic routing algorithm, the coefficient is iteratively refined by measuring the agreement between the “prediction vector” | and . If the agreement is reached to a great extent, capsule makes a good prediction for capsule . Then, the coefficient will be significantly increased. In our network, the agreement is quantified as the inner product between two vectors | and . This agreement is added to : = ←

|



,

(10)

+

,

(11)

The pseudo codes of the dynamic routing algorithm in the 1D Conv-Capsule layer are shown in Table 1. Table 1. Dynamic Routing Algorithm in 1D Conv-Capsule layer. Algorithm 1. Dynamic Routing Algorithm in 1D Conv-Capsule layer 1. begin 2. for x in spectral dimension in current capsule layer 3. for j-th channel capsule in current capsule layer 4. for p across kernel size 5. for i-th channel capsule in last capsule layer 6. initialize coupling coefficients 7. 8. 9. 10.

for r iterations =

(

) across dimension j

for j-th channel capsule in current capsule layer =∑ ∑

Remote Sens. 2019, 11, 223

7 of 27

= ←

11. 12. 13. 14. end

( ) +

return

3.3. One-Dimensional Capsule Framework for HSI Classification The main framework of the 1D-Conv-Capsule network, which is based on the integration of principal component analysis (PCA), convolutional neural network, and the capsule network, is shown in Figure 2. We build this framework based on HSI spectral features and only use spectral vectors of the training data to train the model. One Pixel

I

Conv.

PCA

d2

d1 Conv.

J Routing

Conv.

Routing

n_class

Hyperspectral Image

m Principal Components

Input Layer

Conv1. Layer

Conv2. Layer

Capsule Layer

Conv-Capsule Layer

d3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

||L2||

ClassCaps Layer

Classification map

Figure 2. The framework of the 1D-Conv-Capsule network for hyperspectral image (hsi) classification.

As illustrated in Figure 2, PCA is first used to reduce the dimensionality of the input data [38], which leads to fewer trainable parameters in the network. Then, principal components of each pixel are chosen as the inputs to the network. Through the capsule network, the predicted label of each pixel can be obtained. The proposed 1D-Conv-Capsule network contains six layers. The first layer is an input layer which has principal components for each pixel. The second and third layers are convolutional layers, which are the same as traditional convolutional layers in a CNN. The fourth layer is the first capsule layer with I channels of convolutional dimension capsules, which means that each capsule contains convolutional units. The fifth layer is a 1D Conv-Capsule layer which outputs J channels of dimension capsules. The last layer is a fully connected capsule layer that has n_class (n_class is the number of classes) dimensional capsules. Each capsule in the last layer represents one class, and it is called the ClassCaps layer for short. The length of the vector output of each capsule represents the probability of the input spectral vector belonging to each class. ‖ ‖ in Figure 2 is the Euclidean norm of a vector (i.e., the length of the vector). Some details about the network are given below. In the first two convolutional layers, which have no difference with traditional convolution layers, we use a leaky rectified linear unit (LeakyReLU) to obtain a nonlinear mapping [39], ( )=

, ,

≥0 ,

Suggest Documents