Deep Model for Classification of Hyperspectral image using Restricted Boltzmann Machine Midhun E M , Sarath R Nair , Nidhin Prabhakar T V , Sachin Kumar S Centre for Excellence in Computational Engineering & Networking Amrita Vishwa Vidyapeetham Coimbatore, India - 641112
[email protected] ,
[email protected] ,
[email protected] ,
[email protected] ABSTRACT This paper presents an improved classification of hyperspectral images using deep learning, by extracting meaningful representations at higher levels. Deep learning is a set of algorithm in machine learning that attempt to model high level abstraction of data by using architectures composed of multiple non-linear transformation. It allows artificial systems to discover re-usable features that capture structure in an environment. The ability of undirected graphical models like Restricted Boltzmann Machine, to capture distribution among pixels at the hidden level is utilized here to extract features for each band in the hyperspectral image. To enhance the quality of image a band-byband non-linear diffusion is introduced as a preprocessing step which ensures increased class separability and noise reduction. After preprocessing, a powerful regenerative model Restricted Boltzmann Machine (RBM) is used for the feature extraction. The generated feature vectors is feed as input to different classifiers for the classification. A statistical comparison of accuracies, obtained with RBM under different conditions illustrates the effectiveness of proposed method. Hyperspectral dataset acquired by Airborne Visible/Infrared imaging Spectrometer is used for experimentation.
Keywords Hyperspectral imagery, Diffusion, Image classification, Restricted Boltzmann Machine, Feature Generation, Deep learning.
1. INTRODUCTION Advances in hyperspectral sensing technologies and computational mathematics has opened up new opportunities in earth remote sensing and related application. Land cover mapping, target recognition, material identification, precision agriculture abundance etc are few application that utilize the large wealth of information obtained through earth remote sensing [1-3]. The data obtained from hyperspectral images need to be converted to a tractable form of information for easy storage, data handling and ease of interpretation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. ICONIAAC '14, October 10 - 11 2014, Amritapuri, India Copyright 2014 ACM 978-1-4503-29088/14/08…$15.00. http://dx.doi.org/10.1145/2660859.2660946
Classification is one such way of extracting information in data mining, where each constituent voxels in image are identified and a corresponding thematic map is generated. The quality of further analysis and decision making for an application depends on the accuracy of classification performed. Hyperspectral images are inherent to noise which may hinder proper classification and so preprocessing step becomes necessary to reduce the noise in the image. It is evident from this work that band by band nonlinear diffusion not only removes noise, but also preserves edge information. This preprocessing ensures that, the noise in the data is not propagated along a cascaded processing scheme. Unlike denoising, diffusion smoothens the data by preserving the edge information, as it is crucial factor in feature extraction technique. The most significant “breakthrough in remote sensing has been the development of hyperspectral sensors and software to analyze the resulting image data. Over the past decade hyperspectral image analysis has matured into one of the most powerful and fastest growing technologies in the field of remote sensing. The “hyper” in hyperspectral means “over” as in “too many” and refers to the large number of measured wavelength bands. Hyperspectral images are spectrally over determined, which means that they provide ample spectral information to identify and distinguish spectrally unique materials. Hyperspectral imagery provides the potential for more accurate and detailed information extraction than possible with any other type of remotely sensed data “[26]. Deep learning has been proved to be the state of art technique for image classification, which may range from simple MNIST Dataset to complex 3D NORB Database [4,5]. The ability of RBM to extract features at higher levels is exploited here, for extracting features [6] in Hyperspectral images. RBM which can be categorized as a mixture model is composed of a number of separately parameterized density models each of which has two important properties: (a) There is an efficient way to compute the probability density of a datapoint under each model. (b) There is an efficient way to change the parameters of each model so as to maximize or increase the sum of the log probabilities it assigns to a set of datapoints. [7] This ability of RBM is exploited here, for extracting meaningful representations, which can be used as input to a supervised predictor. The role of pre-training using RBM is inevitable , especially when we are modelling a multi-layer frame work as in Deep Learning .
2. DEEP LEARNING REPRESENTATION.
FOR
FEATURE
The area of Machine learning has been showing great effort in Computer vision, especially in the area of Image classification. It is evident from previous works that, the accuracy of classification relies on how good the features are. Deep learning can be viewed as a successful framework for feature learning and feature representations. The performance of machine learning methods relied heavily on data representation (or features) on which they are applied. So most of the actual effort in implementing machine learning algorithms goes into the design of preprocessing and data transformations that result in a data representation that can support effective machine learning [8]. Representation learning can be defined as, learning representations of the data that make it easier to extract useful information when building classifiers or other predictors [8]. Probabilistic model like RBM captures information, where a good representation is often one that captures the posterior distribution of the underlying explanatory factors for the observed input. A good representation, normally obtained from unsupervised methods should be useful input to the supervised predictor. Deep learning methods are those that are formed by “the composition of multiple non-linear transformations, with the goal of yielding more abstract “– and ultimately more useful– representations. [8].
units model dependencies between the components of observations (e.g., dependencies between pixels in images). They can be viewed as non-linear feature detectors [13]. A simplified way of understanding the training of RBM is, given some observation (input pixels), calculate the meaningful distribution of data in the hidden units, such that, the parameters ( usually weights and biases) maximizes the distribution. So, training a RBM is to find the parameters using Maximum Likelihood estimation (MLE) [17]. The visible variables correspond to the components of observation, while latent variable introduces dependencies between input units (pixels of an image). This behavior of RBM is a key component, for it to be used as good mechanism for feature generation of images. The key characteristic of RBM is that its hidden units are conditionally independent given the observed data. This property makes each hidden unit an independent expert on detecting a specific feature.[14]. The architecture of RBM for one image vector is given below: Architecture of RBM used for experimentation
Deep learning was started in 2006 by Hinton, on MNIST Dataset which breaks the supremacy of SVM. From there onwards Deep learning has been successfully implemented in NLP [9], Speech processing [10] and other area of machine learning. The main advantage of Deep learning is it provides a multi-layer framework where each layer extracts more and more compact features on every layer. In the case of image, if the first layer extracts edges then the foremost layers extracts combination of edges to form meaningful representations. Even though it is computationally expensive, with the advent of GPU core [11] and parallel programming, Deep learning has been widely used and critically acclaimed in the area of machine learning.
2.1 Restricted Boltzmann Machine Restricted Boltzmann machines (RBMs) are probabilistic graphical models that can be interpreted as stochastic neural networks. RBM generally comes under the class of unsupervised learning. The increase in computational power and the development of faster learning algorithms have made them applicable to relevant machine learning problems [4]. The area of artificial neural networks was being dominated by other Machine learning algorithms, due to the higher need of computation for ANN. The surprising boom of ANN was provoked by the successful implementation of RBM by Hinton in 2006 [12]. After that different and wide variants of RBM was developed by researchers and was successfully implemented in large scale. RBM is a variant of Boltzmann machine, in which hidden units and visible units are not following any connection among themselves. This is the key idea of RBM where independence between the units results in the calculation of conditional probability distribution of visible units to the hidden units. A RBM consists of two types of units. One is visible units and other is hidden units. The visible units constitute the first Layer and correspond to the components of an observation (e.g., one visible unit for each pixel of a digital input image). The hidden
Figure 1: Architecture of RBM
2.2 Training of RBM Restricted Boltzmann machines (RBMs) is used as generative models for different types of data including labelled and unlabeled images (Hinton et al., 2006a). RBMs are usually trained using the contrastive divergence learning procedure
[Hinton, 2002].This requires a “ certain amount of practical experience to decide how to set the values of numerical metaparameters such as the learning rate, the momentum, the weight cost, the sparsity target, the initial values of the weights, number of hidden units and the size of each mini-batch [15]. The decisions about what types of units to use, whether to update their states stochastically or deterministically, how many times to update the states of the hidden units for each training case, and whether to start each sequence of state updates at a data-vector requires a great deal of hands on experience. [16] RBM consists of visible units, which acts as the input to the first layer. This can be binary or real valued. The latent units, in the hidden layer captures the dependencies between the input vector (pixels). The visible units and hidden units consists of same number of bias units, which will be initialized to zero generally. The input vector is denoted by ‘x’ and hidden layer is denotes by ‘h’. The bias of hidden and visible units are represented by variable ‘b’ and ‘c’ respectively. W is the connection between visible and hidden units, called weights in Neural Network scenario which used to generate the model. The use of RBM is to initialize weights, which must be closer to a good approximation if the classification is performed through back propagation. Here, instead of back propagation other classifiers are used. The distribution of ‘x’ on ‘h’ is defined by the energy function E(x,h) as follows:
h = - W j , k h j xk ck xk b j h j j
k
k
(1)
j
The distribution is obtained by the probability function denoted by p(x ,h) as: p(x ,h) exp(E (x , h)) /Z
Similarly, the conditional distribution of visible given the hidden units: (6) p ( xk 1| h) = sigm( ck hT Wk ) Where
Wk
denotes the
k th
column of W .
To train an RBM, we would like to minimize the average negative log-likelihood (NLL), which can be explained as follows: 1 1 l( f (x (t ) )) log p(x (t ) ) (7) T t T t (t )
Here, T denotes the number of training images and x denotes one image of the training data. Our aim is to find the parameters {W,b,c} that will minimize the above loss function. As, in Neural Network we follow the Stochastic Gradient Descent algorithm E(x (t ) , h) (t ) ( log(x (t ) )) E(x, h) Eh x Ex,h (8) The first part of the R.H.S of the equation is called positive phase and the second part is called negative phase. As, the negative phase involves the expectation of the model, this part is hard to compute and is generally intractable. In order to tame this, Hinton [2002, Neural computation] proposed a method called Contrastive divergence which works well in practice [18]. The idea is to replace the expectation of negative phase by a ~
point estimate x by a technique called Gibbs sampling. The discussion of CD algorithm is beyond the scope of this paper. This is explained nicely in [4]. The parameter update for weights and biases are driven by the following equations,
(2)
Where Z is called partition function or normalization constant, which unfortunately for RBM is intractable [4]. The notation bold ‘ h ’ denotes vector whereas italic ‘ h ’ denotes scalar part of a vector. As there is no connection between hidden and visible units to each other, the conditional inference can easily be obtained from visible units and hidden units [6]. This is the key property which helps to extract features from subsequent layers of RBM The conditional distribution of hidden units j given x is p(h|x) = p(h j | x) (3)
~
j th
RBM Architecture
row
of W . The sigmoid function is defined by a logistic function. 1 sigm(z) = (5) 1 exp(z) The sigmoid function transforms the data into a non-linear form, and the value ranges from 0 to 1.
(11)
Here is called learning parameter. The value of h(x) is defined before as a sigmoid function.
The above equation defines the probability of a hidden unit being denotes the
(10)
~
c: c (x (t ) x)
As the hidden units are Bernoulli, where 1 represents the presence and 0 represents the absence of features, the probabilit y can be further defined as p(h j 1 | x) = sigm(b j Wj x) (4)
Wj
(9)
~
b: b (h (x(t ) ) h( x))
j
1 gives the visible units (pixels).
~T
W:=W (h(x (t ) )x (t )T h(x) x )
Figure 2. Architecture of RBM
3. PROPOSED METHOD
4. EXPERIMENTAL RESULT
The hyperspectral images are inherent to noise due to photon effect, sensor distortion, calibration etc. So in order to improve the quality of image, diffusion is applied as a pre-processing step. Diffusion not only smoothens the data but also retains the edge information. The diffusion used here is Perona malik diffusion [20, 21]. The diffused image is converted to band by band vector format and is feed as input to the RBM. The first layer input for RBM consists of these vectors (pixels). The initial vectors (pixels) are of size 200. An ensemble of these vectors are used to model a three-layer RBM. The input vector is mapped to a hidden layer of size 60 units using symmetrically weighted connections, which constitutes first layer of RBM. The first layer is trained according to CD algorithm and corresponding weights and biases are stored, which can be used for extracting features for other images. The output of first layer of RBM, usually called as hidden layer acts as the input for the second layer of RBM. This is gain mapped to 60 units (hidden layer 2). The same training algorithm is repeated and final (RBM layer 3) has output layer of 200 units. Thus the whole model can be visualized as stack of RBM, which acts as feature generators. For classification, the generated feature vector of size 200 is feed as input to supervised predictors like Orthogonal Matching Pursuit (OMP), subspace pursuit (SP) [22, 23]. OMP and SP are widely used classification algorithms for hyperspectral images [24, 25]. Figure1 shows the flow diagram of proposed method.
The experiment is performed using Indian pines dataset. The experiment is conducted on Indian pines data set. The dimension of hyperspectral images Indian pines 145*145*220. For experimental purpose it is reshaped into a matrix of 220*21025. The noisy and water absorption bands are removed from the hyperspectral data before the classification. The 20 noisy and water absorption bands are removed from Indian pines to form 200*21025 matrix [19]. The selection of hyper parameters has been done using trial and error method. As, there is no mathematical proof for selection of parameters, some universally accepted parameters are used.
4.1 Dataset Description The data collected by AVIRIS system, operated by NASA Jet Propulsion Laboratory, over the north western Indiana in 1992 is used here.
Load hyperspectral image data (a)
Band-by-band Perona-Malik Diffusion
Hypercube Formation
Feature vector formation using RBM
Classification using OMP and SP
Land cover map
Figure 3. Flow graph of proposed method
(b) Figure 4. (a) AVIRIS data scene (b) Ground truth image with class descriptor The Indian Pines scene comprising of 220 spectral channels (bands) in the wavelength range of 400-2500nm, with nominal spectral resolution of 10nm and spatial resolution of 20m per pixel. The image has 145x145 pixels per band. Figure2 (a) shows the AVIRIS Indian Pines data scene and Figure. 2(b) shows the labeled ground truth information of the 16 mutually exclusive classes of which ten corresponds to different crops; five represents vegetation types and one represent building. The background is represented as white and is not considered during classification.
4.2 Accuracy Assessment Measures
The classification results of Indian pines data set is given below:
The different accuracy measurements used in this paper is given below: Total number of correctly classified piexls Overall Accuracy Total number of pixels Average Accuracy
Sum of the accuracies of each class Total number of class
Classwise Accuracy
Correctly classified pixels in each class Total number of pixels in each class (a)
(b)
The different accuracy measures provided above gives a crystal clear and more detailed statistical measures which helps to understand the effectiveness of the proposed method.
4.3. Results and Discussions The pre-processing step plays a very crucial role in feature extraction. As hyperspectral images are highly susceptible to noise, diffusion is inevitable. This is evident from Figure 5. The first image Figure 5 (a) is original noisy hyperspectral image and Figure 5 (b) is the diffused image. The smoothening capacity of diffusion is visible from the later image. The image before and after diffusion is shown below
(c)
(e)
(d)
(f)
(a)
(g)
(h)
(b)
Figure 5. Indian pines band 100 (a) Before Diffusion (b) After diffusion
Figure 6. For the Indian Pines image: (a) training set and (b) test set. Classification maps obtained by (c) OMP with RBM feature (d) OMP with diffusion (e) OMP with RBM after diffusion (f) SP with RBM feature (g) SP with diffusion (h) SP with RBM after diffusion.
Table 1: Classification results of Indian pines for each class Class
OMP with diffusion
OMP with RBM
OMP with RBM after diffusion
SP with diffusion
SP with RBM
SP with RBM after diffusion
1
84.78
89.13
91.30
76.09
73.91
89.13
2
58.05
51.05
74.02
57.77
50.49
70.73
3
61.45
59.98
68.19
57.11
56.02
66.99
4
56.96
54.74
70.89
51.48
57.38
59.07
5
86.75
80.12
86.75
89.44
81.78
88.41
6
94.38
84.25
97.53
91.92
87.40
94.11
7
100.00
96.43
96.43
85.71
96.43
92.86
8
91.63
93.31
90.38
90.38
94.35
92.47
9
100.00
100.00
100.00
100.00
100.00
100.00
10
77.78
65.02
80.76
74.18
70.16
76.34
11
71.36
69.98
79.19
72.83
64.32
80.53
12
48.90
48.23
63.74
48.57
46.37
60.03
13
94.63
94.15
96.59
95.61
90.73
91.71
14
88.85
88.06
95.18
89.09
90.12
92.81
15
49.22
44.82
74.35
44.04
41.97
68.65
16
93.55
94.62
100.00
97.85
94.62
87.10
Overall Accuracy
73.18
71.07
81.30
72.36
68.29
79.34
Average Accuracy
78.64
77.24
85.33
76.38
74.75
81.93
5. CONCLUSION Feature representations are the back bone of Image classification. More meaningful features results in greater accuracy as compared to obsolete feature. The ability of RBM to extract multi-layer features is strictly exploited here. The statistical measure compared above has shown that feature extraction using RBM is rich in meaningful representations and hence provided better accuracy than other methods. Future work includes, adding more preprocessing techniques like data whitening, which is used in Deep Learning methods.
6. REFERENCE [1] Muhammad Ahmad,Sungyoung,Lee,Ishan Ul Haq and Qaisar Mustaq, 2012. “Hyper spectral remote sensing: Dimensional reduction and end member extraction,” International journal of soft computing and engineering, vol.2, Issue.2,May 2012. [2] ] Heesung Kwon, Xiaofei Hu, James Theiler, Alina Zare and Pruthvi Gurram, 2013.“ Algorithm for multispectral and Hyperspectral image analysis,” Journal of Electrical and Computer Engineering, vol.2013, Article ID 908906..
[3] G.Hughes , “On the mean accuracy of statistical pattern recognizers 1968.,”IEEE Trans. On Information Theory, vol.14,Jan 1968. . [4] Vinod Nair, Geoffrey E. Hinton. 2009. 3D Object Recognition with Deep Belief Nets In Advances in Neural Information Processing Systems 22, pp. 1339-1347 [5] G.Hinton, R R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks Science 313(5786) pp [504507] [6]
Stuhlsatz, A. Lippel, J. Zielke, T. 2010. Discriminative feature extraction with Deep Neural Networks. Neural Networks (IJCNN), The 2010 International Joint Conference on, pp. [1-8]
[7] Vinod Nair , Geoffrey Hinton. Implicit Mixtures of Restricted Boltzmann Machines [8] Yoshua Bengio, Aaron Courville, Pascal Vincent. 2012. Representation Learning: A Review and New Perspectives. In Pattern analysis and Machine Intelligence , IEEE Transactions pp [1798-1828 ] [9] George E. Dahl, Ryan P. Adams, Hugo Larochelle. 2012. Training Restricted Boltzmann Machines on Word
Observations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) [10] George E. Dahl , Abdel-rahman Mohamed , Geoffrey Hinton. 2011. Deep Belief Networks using discriminative features for phone recognition. In Proceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing, ICASSP 2011.pp [5060-5063 [11] Sean Patrick Parker2012. GPU implementation of a deep learning network for image recognition tasks. Thesis and Dissertations. [12] Hinton, G. E. and Salakhutdinov, Study Materials R. R Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. [13] Geoffrey. E. Hinton (2007) Boltzmann Machine . Scholarpedia, 2(5):1668 [14] ] Mohammad Norouzi . Convolutional restricted Boltzmann machines for feature learning. Thesis and Dissertations
[18] G.Hinton . A Practical Guide to Training Restricted Boltzmann Machines UTML TR 2010-003 [19] http://aviris.jpl.nasa.gov . [20] ] P. Perona and J. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 7, Jul. 1990, pp. 629-639. [21] Demir.B and Erturk,S.(2008),Spectral Magnitude spectral derivative feature fusion for improved classification of hyperspectral images, IGARSS, pp. 1020-1023 [22] Melgani.F and L. Bruzzone,(2004), Classification of hyperspectralremote sensing images with support vector machines, IEEE transaction on geo science and remote sensing, 42, pp. 1778 – 1790 [23] Benqin Song, Jun Li, Mauro Dalla Mura, Remotely Sensed Image Classification Using SparseRepresentations of Morphological Attribute Profiles, ieee transactions on geoscience and remote sensing, vol. 52, no. 8, august 2014
[15] lya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton . 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13) pp. [1139-1147]
[24] J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory, vol. 53, no. 12, 2007, pp. 4655-4666.
[16] ] Hugo Larochelle, Michael Mandel, Razvan Pascanu and Yoshua Bengio. 2012. In Journal of Machine Learning Research. pp. [643--669]
[26] Peg Shippert, Ph.D. Introduction to Hyperspectral Image Analysis
[17] ] Tieleman, T.2008. Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. In proceedings of the 25th international conference on Machine learning .pp. 1064--1071]
[25] Wei Dai and Olgica Milenkovic, "Subspace pursuit for compressive sensing signal reconstruction," Jan. 2009