Unsupervised Band Selection for Hyperspectral Imagery Classification ...

0 downloads 0 Views 3MB Size Report
“real-life image”) usually carries considerable degree of regu- larity. Meanwhile, the spectrum of woods is also chosen for the illustration of the automatic spatial ...
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

531

Unsupervised Band Selection for Hyperspectral Imagery Classification Without Manual Band Removal Sen Jia, Zhen Ji, Yuntao Qian, and Linlin Shen

Abstract—The rich information available in hyperspectral imagery has provided significant opportunities for material classification and identification. Due to the problem of the “curse of dimensionality” (called Hughes phenomenon) posed by the high number of spectral channels along with small amounts of labeled training samples, dimensionality reduction is a necessary preprocessing step for hyperspectral data. Generally, in order to improve the classification accuracy, noise bands generated by various sources (primarily the sensor and the atmosphere) are often manually removed in advance. However, the removal of these bands may discard some important discriminative information, eventually degrading the classification accuracy. In this paper, we propose a new strategy to automatically select bands without manual band removal. Firstly, wavelet shrinkage is applied to denoise the spatial images of the whole data cube. Then affinity propagation, which is a recently proposed feature selection approach, is used to choose representative bands from the noise-reduced data. Experimental results on three real hyperspectral data collected by two different sensors demonstrate that the bands selected by our approach on the whole data (containing noise bands) could achieve higher overall classification accuracies than those by other state-of-the-art feature selection techniques on the manual-band-removal (MBR) data, even better than the bands identified by the proposed approach on the MBR data, indicating that the removed “noise” bands are valuable for hyperspectral classification, which should not be eliminated. Index Terms—Affinity propagation, band selection, hyperspectral imagery classification, wavelet shrinkage.

I. INTRODUCTION

T

HE development of image sensor technology has made it possible to capture image data in hundreds of bands covering a broad spectrum of wavelength range (0.4–2.5 m) [1]. The high dimensionality of hyperspectral data increase the ability to classify and recognize the materials. However, the

Manuscript received October 10, 2011; revised January 19, 2012; accepted January 31, 2012. Date of publication March 06, 2012; date of current version May 23, 2012. This work was supported by the National Natural Science Foundation of China (60902070, 61171125, 61171151), the Doctor Starting Project of the Natural Science Foundation of Guangdong Province (9451806001002287), and the Open Research Fund of the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (11R02). S. Jia, Z. Ji, and L. Shen are with the Shenzhen City Key Laboratory of Embedded System Design, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China (e-mail: [email protected], [email protected], [email protected]). Y. Qian is with the Institute of Artificial Intelligence, College of Computer Science, Zhejiang University, Hangzhou, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTARS.2012.2187434

Fig. 1. Illustration of the manual band removal procedure.

number of training samples is limited in most of the hyperspectral applications. Meanwhile, the high dimensionality of hyperspectral data leads to a big increase in the computational time, and the highly correlated bands contain a degree of redundancy which may have a negative impact on classification accuracy [2]. Hence, the main challenge for hyperspectral image classification is to reduce the computation complexity without degrading the classification accuracy. Therefore, dimensionality reduction (DR) is often adopted to reduce computational cost and improve knowledge discovery. The research on the dimensionality reduction can be divided into two main streams. One is feature extraction [3], [4], which transforms the original data onto the destination feature space through projections like projection pursuit (PP) [5], [6], principal component analysis (PCA) [7]–[9], independent component analysis (ICA) [10]. The other is band (or feature) selection [11], which identifies a subset of the original spectral bands that contains most of the characteristics. The former generally achieves higher classification accuracy, while the latter could preserve the relevant original information of the spectral bands [12]. For hyperspectral imagery, spectra observed in the natural world are affected by kinds of noise, due to atmospheric conditions, sensor noise, material location, and other factors [13]. In the traditional hyperspectral image processing, the noisy bands are considered as useless and removed in advance [14]–[18]. As shown in Fig. 1, the original spectrum of woods is chosen from the Indian pines hyperspectral database, which is obtained by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) hyperspectral sensor [19]. Although the AVIRIS sensor initially acquires the Indian data in 224 bands, four of these only contain zeros and are discarded, leaving 220 bands in the public data set. Through visual inspection, there are 20 water absorption channels (numbered 104–108, 150–163, and 220) and 15 noisy channels (numbered 1–3, 103, 109–112, 148–149, 164–165, and 217–219) [20]. These bands are often removed manually, resulting in a total of 185 reserved spectral bands. However, it is

1939-1404/$31.00 © 2012 IEEE

532

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

hard to pre-determine which bands should be removed due to the variation of land cover property and the difference on the sensors. Therefore, great manual efforts are required. Although the removal of the noisy bands generally results in improved classification accuracy, some bands containing important discriminative information may also be discarded, leading to the degradation of the subsequent classification procedures. Moreover, the intrinsic noise caused by the instrument, calibration and atmospheric correction errors can not be eliminated by the manual-band-removal process. Alternatively, several techniques have been proposed to alleviate the interference of noise instead of discarding the “noise” bands directly [21]. A quite recent and efficient technique is wavelet transform. In order to denoise the data, wavelet shrinkage (WS) suppresses the noise and preserves the actual image discontinuities at the same time [22], [23]. In this case, noise reduction is obtained from shrinking the noisy coefficients in the wavelet domain. While noisy wavelet coefficients are removed, noise free coefficients are reduced considerably less or kept unchanged. The correlation between bands has been used to denoise hyperspectral images by enforcing simultaneous sparsity on their wavelet representations [24]. Wavelet shrinkage was applied to denoise the first intrinsic mode functions of each band obtained by 2-D empirical mode decomposition [25]. A hybrid of spatial and spectral wavelet shrinkage working in the spectral derivative domain has been presented for the noise reduction of hyperspectral imagery [26]. In [27] Sendur and Selesnick’s bivariate wavelet shrinkage function [28], [29] wascextended from two-dimensional image denoising to hyperspectral imagery denoising. In [30], [31] the hyperspectral imagery was denoised by combining bivariate wavelet shrinkage, wavelet packets and PCA, which also reduced the dimensionality of the original hyperspectral data cube. In [32] PCA was firstly used to decorrelate the fine features of the data cube from the noise, then bivariate wavelet shrinkage was adopted to denoise the low-energy PCA output channels. Also, dual-tree complex wavelet transform was used to remove the spectral noise of each pixel in the data cube. In order to decrease the computational load, in this paper, the wavelet shrinkage-based denoising procedure is applied to the spatial image of each band instead of the spectrum of each pixel. After the data have been spatially denoised, the representative features should be chosen for classification. Because the features obtained by the feature extraction methods can not be related to the original wavelength, band selection is preferable. Meanwhile, because the training set is often not available in practice, or in a very reduced number, supervised band selection techniques, which selects a band subset based on class separability measure of training samples, is not suitable for hyperspectral band selection. On the contrary, the importance of a band obtained by the unsupervised band selection methods is evaluated by various statistical measures or using clustering quality assessment [33], which is unrelated to the labeled sample size. For unsupervised band selection, most of them are based on band ranking, which constructs and evaluates an objective

matrix according to various criteria, and the spectral band-related vectors are ranked and selected. For example, information divergence (ID) [12], maximum-variance principal component analysis (MVPCA) [34], and mutual information (MI) [35] are widely used. The divergence criterion is used in ID to examine the non-Gaussian property of each band. The band image with large divergence has more priority. By constructing a loading factor matrix from the eigenvalues and eigenvectors obtained by PCA of the data, MVPCA decides the priority of the bands by sorting the variance of the associated loading factors from high to low. But the main disadvantage of the band ranking methods is that the band containing important information may be eliminated due to its low value of the associated criteria. In addition, mutual information method selects a given number of bands with maximal mutual information, which is a time-consuming optimization problem. In fact, unsupervised band selection can be considered as a data clustering problem, which is a process of partitioning the data set into groups of similar objects (clusters) without any class label information [36]. In the clustering-based band selection approaches, each band is considered as a data point, the pairwise similarity or correlation between two bands is measured, and then the bands are grouped into several clusters. The band decorrelation can be completed by means of selecting a band in a cluster to represent all the bands in this cluster. The performance of clustering-based methods is slightly better than band ranking-based methods using the same criterion of correlation, but the computational cost can be largely reduced [37]. There exist a large number of clustering algorithms of which hierarchical clustering and k-center have been used in clustering-based band selection. Recently, we have introduced a new clustering algorithm, based on affinity propagation (AP), for hyperspectral band selection [38]. AP, proposed by Frey and Dueck [39], [40], initially considers all data points as potential cluster exemplars, and then exchanges messages between data points until a stable state is reached. Clusters are formed by assigning each data point to its most similar exemplar. AP has been applied on several hard problems, such as error correction, computer vision, and good results have been achieved. Compared to the k-center and hierarchical clustering methods, the main advantage of AP is that it uses the parameter of “preference” to control the number of clusters, instead of specifying the number in advance. Besides, AP is not sensitive to the initial selection of exemplars because it considers all data points as potential exemplars. The experiments have shown AP is a good choice for band selection of hyperspectral imagery. In this paper, an unsupervised band selection methodology is introduced to acquire the representative bands of hyperspectral data without manual band removal preprocessing. Firstly, wavelet shrinkage is used to denoise the hyperspectral imagery in the spatial domain. It is worthwhile to point out that both the spectral and spatial dimensions of the cube are kept unchanged throughout the denoising procedure, making the following selected bands meaningful. Secondly, affinity propagation is applied to choose the exemplars from the denoised data. Based on these feature vectors, two classification algorithms, K-nearest

JIA et al.: UNSUPERVISED BAND SELECTION FOR HYPERSPECTRAL IMAGERY CLASSIFICATION WITHOUT MANUAL BAND REMOVAL

533

Fig. 2. System block diagram of the unsupervised band selection system without manual band removal for hyperspectral classification.

neighborhood (KNN) [41] and support vector machine (SVM) [42], are adopted to examine the separability of the chosen exemplars. Fig. 2 illustrates the system block diagram of the unsupervised band selection system without manual band removal for hyperspectral classification. Experimentations on three real hyperspectral data acquired by two sensors verify the importance of the omitted “noise” bands in terms of classification. The rest of this paper is organized as follows. Section II presents a brief introduction about noise reduction using wavelet shrinkage. Affinity propagation for band selection is given in Section III. In Section IV, the experimental results on three real hyperspectral data sets are evaluated to demonstrate the value of the “noise” bands. At last, Section V makes concluding remarks. II. WAVELET SHRINKAGE-BASED SPATIAL NOISE REDUCTION Wavelet transforms are the basis of many powerful tools that are now being used in remote sensing applications, e.g. compression, registration, fusion and classification. The intrinsic property of wavelet transform is that it preserves high and lowfrequency features during the signal decomposition. The discrete wavelet transform (DWT) of a signal is defined as an inner product of the signal and wavelet bases. The fine-scale and large-scale information of the signal can be simultaneously investigated by projecting it onto a set of wavelet bases with various scales. Using Mallat algorithm [43], DWT can be computed very fast. The mother wavelet can be represented as a set of high-pass and low-pass filters. Following the filtering, the outputs of the high- and low-pass branch are called wavelet detail and approximation coefficients, respectively.

domain by removing the small coefficients and shrinking the large coefficients, which could be accomplished by soft thresholding operator [44]. Precisely, let be the noisy signal that is composed of the pure signal and noise (1) The denoising goal is to suppress the noise part of the signal and to recover . The estimate of the signal by wavelet shrinkage can be computed as (2) where the operators and stand for the forward and inverse DWT, respectively. is the wavelet-domain soft thresholding operator with a threshold , which is defined as (3) is a given function. Since soft thresholding is a nonwhere linear operator, wavelet shrinkage is a kind of nonlinear model. Clearly, the most important problem in the WS denoising systems is the threshold determination. Several algorithms were introduced to estimate threshold values that are optimal in different senses. In this paper, we implement a data-driven threshold, called Stein’s unbiased risk estimator (SURE), to accomplish the spatial denoising of hyperspectral imagery [45]. SURE Shrink minimizes the Stein unbiased estimate of risk for threshold estimates. It can be obtained by (4)

A. Wavelet Shrinkage The key idea of wavelet shrinkage is that the wavelet representation can separate the signal and the noise. The DWT compacts the energy of the signal into a small number of DWT coefficients with large amplitudes and spreads the energy of the noise over a large number of DWT coefficients with small (or zero) amplitudes. Hence, wavelet shrinkage algorithms eliminate most of the noise contribution to the signal in the wavelet

where is a set of wavelet coefficients of the noisy signal, is the number of wavelet coefficients and is the SURE risk for a threshold , which is given by

(5)

534

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

Fig. 4. Illustration of the automatic noise reduction procedure on the spectrum of woods.

Fig. 3. Illustration of the automatic noise reduction procedure on spatial image. (a) band 20, (b) band 164.

where is an index of a wavelet coefficient and the number of elements in a set .

denotes

Fig. 5. Two kinds of message exchanging procedures. The dotted lines show the process of sending responsibilities, and the solid lines show that of sending availabilities.

B. Wavelet Shrinkage-Based Spatial Noise Reduction Concerning hyperspectral imagery, it is a three-dimensional array with the width and length corresponding to spatial dimensions and the spectral bands to the third dimension, which are denoted by , and in sequence. is the image cube with each band being an image matrix. Instead of removing the noisy bands manually, the wavelet shrinkage with SureShrink threshold is directly applied on each of the original hyperspectral data, fulfilling the image band spatial noise reduction task. For hyperspectral data, because the noise level varies with the signal level from band to band, the data-driven threshold of SureShrink could eliminate most of the noise contribution to the signal. Fig. 3 illustrates the results of before-and-after automatic noise reduction procedure on two images chosen from the Indian pines hyperspectral dataset, i.e., band 20 and band 164 (identified as noise band). From the figures, we can see that both the spatial noise-reduced images are smoother than the original ones, which accords with the fact that the signal in the spatial domain (can be seen as normal “real-life image”) usually carries considerable degree of regularity. Meanwhile, the spectrum of woods is also chosen for the illustration of the automatic spatial noise reduction on the spectral dimension, as depicted in Fig. 4. Note that, although the WS-based noise reduction procedure is only applied on the spatial domain, the signature obtained is smoother than the original one. III. AFFINITY PROPAGATION-BASED BAND SELECTION Affinity propagation determines the exemplars by exchanging real-valued messages between all data points. Two kinds of message-passing procedures termed “responsibility” and “availability” are used to exchange messages between

each point and each candidate exemplar . The responsibility is the message sent from cluster member to candidate exemplar , indicating how well-suited the point would be as the exemplar for point . Alternatively, the availability is the message sent from candidate exemplar to potential cluster member , indicating how appropriate that point would choose candidate as its exemplar. Fig. 5 shows the two message exchanging procedures. For the same point, it can be regarded as the competing candidate exemplar in the responsibility sending process or the supporting data point in the availability sending process. Initially, the availabilities are set to zero. Then the responsibilities and availabilities are computed using the following rules (6) (7) represents the similarity of and . In addition, where the self-availability is updated differently (8) The above updating rules are derived from belief propagation in factor graph [40]. After the iterative procedure converges to a stable state, the exemplar of each point can be determined by (9) In reality, the simple update rules in (6) and (7) will often lead to oscillations caused by “overshooting” the solution, so

JIA et al.: UNSUPERVISED BAND SELECTION FOR HYPERSPECTRAL IMAGERY CLASSIFICATION WITHOUT MANUAL BAND REMOVAL

the responsibility and availability messages are “dumped” as follows: (10) where and are the message values from the previous and current iteration respectively, and the damping factor is between 0 and 1. In all of our experiments, we use a default damping factor of , considering both speed and stability of convergence. After denoising the original hyperspectral data by the wavelet shrinkage, AP algorithm is applied to select the representative bands. AP starts with the construction of a similarity matrix ( is the number of spectral bands), in which the element measures how well the band represents band . A common choice for similarity is negative Euclidean distance: (11) , different from the self-similarity in other clusAs for tering algorithms, it has its particular meaning in AP. It reflects the a priori suitability of band to serve as an exemplar, which is referred to as “preference”. Instead of fixing the number of exemplars in advance, the preferences are used to control how many bands are selected as exemplars. Concretely, low preferences lead to small numbers of clusters, while high preferences lead to large numbers of clusters. In most cases, because no prior inclination is available for particular bands to be the exemplars, the preferences of all bands are set to the same value. Fortunately, the “preference” of a band can be approximately estimated. In view of classification, the prior inclination for a particular band to be the exemplar, represents the effect of band as an individual one in classification, named the discriminative capability. In the unsupervised learning case, the band with greater deviation from its associated Gaussian distribution has stronger discriminative capability. This is a quantitative measure of nongaussianity of the band, which can be estimated by kurtosis, a classic measure of nongaussianity [46] (12) Kurtosis is zero for a Gaussian random variable. For most (but not quite all) nongaussian random variables, kurtosis is nonzero, and can be both positive or negative. Therefore, the square of kurtosis is used to compute (13) where is a negative value for controlling the number of exemplars. In short, AP not only considers the discriminative capability of each individual band through , but the correlation/similarity among bands as well through , so that the exemplars are those bands that have both of higher discriminative capability and less correlation. The proposed unsupervised band selection strategy without manual band removal is summarized as follows. 1) Use wavelet shrinkage to remove the spatial noise of hyperspectral data automatically.

535

2) Use (11) and (13) to compute and of the similarity matrix , respectively. The number of exemplars is controlled by the value of . 3) Apply the AP algorithm to find the exemplars, i.e., the representative bands of the denoised data. 4) Use classification algorithms to evaluate the effectiveness of the chosen bands. 5) If we need a new number of exemplars, a new value can be given to the preference parameter , and repeat the above three steps. Concerning the parameter , we firstly let 1. Then the mean of the similarities between bands and the mean of the preferences are computed. The initial is set as , which could produce a moderate number of exemplars. Through gradually increasing or decreasing the value of , the numbers of exemplars correspondingly increase or decrease, producing all the needed numbers used in the comparative experiments. At last, we analyze the computational complexities of the proposed unsupervised band selection methodology without manual band removal. The cost of calculation can be divided into two parts. One part is consumed for denoising the band images of the whole data. Since the complexity of the DWT-based wavelet shrinkage is for an image with size [47], the cost of this part is . The other part is related to the AP-based band selection. According to the analysis in reference [33], the cost of the similarity matrix computation and the message-passing procedure are and respectively, where is the maximum iteration number. In our experiments, is set to 500. Apparently, for the hyperspectral imagery of large sizes, the complexity of the iteration procedure is much lower than that of the preclustering part (including the denoising and matrix computation). Note that the preclustering part only needs to be computed once, and the clustering procedure is independent of the number of pixels in the scene, so the algorithm can be applied to hyperspectral imagery of large sizes. IV. EXPERIMENTAL RESULTS Having presented our method in the previous section, we now turn our attention to demonstrate its utility for the purpose of classification. Here, we employ three real-world data acquired by two different sensors, the AVIRIS and Hyperion hyperspectral instruments, so as to evaluate the performance of the algorithms by making use of K-Nearest Neighborhood (KNN) and support vector machine (SVM), which are summarized as follows: 1) KNN classifier is one of the most fundamental and simple classification methods and should be one of the first choices when there is little or no prior knowledge about the distribution of the data. KNN classifier achieves high performance when a large number of training samples are available. The class of a new sample is determined by the labels of training data points that are nearest this sample. In our experiments, the number of neighbors in KNN is set to be 3.

536

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

Fig. 6. Indian pines AVIRIS dataset, band 25.

TABLE I LAND COVER CLASSES WITH NUMBER OF TRAINING AND TEST SET SAMPLES FOR THE INDIAN PINES DATA

2) SVM has shown a great potential in classification and regression problem, and a good performance against other classifiers in the case of nonlinear separable and small training sample sets. The optimal decision surface of SVM is constructed from its support vectors with maximum margin which are conventionally determined by solving a quadratic programming problem. In our experiments, the radial basis function (RBF) kernel and one-against-all scheme is used for multi-class classification. A. Indian Pines AVIRIS Data The first real-world data set to be used is acquired by the AVIRIS sensor over northwest Indiana’s Indian Pines in 1992 [48]. Fig. 6 displays the 25th band as a subimage of the original data with size 145 145. As mentioned in Section I, the public dataset is composed of 220 spectral channels, and 35 bands (numbered 1–3, 103–112, 148–165, 217–220) are identified as noise that could be manually removed in advance [49]. In this experiment, they are purposefully not removed and act as a natural test for our proposed unsupervised band selection strategy. For all the 16 different land-cover classes, only 10% samples of each class are randomly chosen as the training samples and the remaining ones to be the test samples in order to evaluate

Fig. 7. Performance versus the selected bands of different methods on the Indian pines AVIRIS data. (a) KNN, (b) SVM.

the effectiveness of the proposed methodology on small sample sizes (see Table I). Additionally, it is worthwhile to mention that the training and test data sets are randomly chosen from the data ten times, and both mean and standard deviation are computed to assess the statistical significance of the results. Fig. 7 illustrates the classification results using KNN and SVM, where both mean and standard deviation are included. To investigate the impact of different number of selected bands on the classification accuracy, the number ranges from 3 to 50, as represented by the -axis. The results with all bands and manual-band-removed data (MBR) are given for comparison, with the legend “AllBands” and “MBR_AllBands”, respectively. Clearly, the removal of noisy bands improves the classification accuracy. Hence, the band selection methods are investigated only on the MBR data, i.e., “MBR_ID” (information divergence) [12], “MBR_MVPCA” (maximum-variance principal component analysis) [34], “MBR_MI” (mutual information) [35] and “MBR_AP” (affinity propagation) [39]. Concerning “MBR_MI”, it is worthwhile to point out that

JIA et al.: UNSUPERVISED BAND SELECTION FOR HYPERSPECTRAL IMAGERY CLASSIFICATION WITHOUT MANUAL BAND REMOVAL

537

TABLE II COMPARISON BETWEEN SELECTED BANDS USING DIFFERENT BAND SELECTION METHODS (NUMBERS IN BOLD REPRESENT “NOISE” BANDS)

hierarchical clustering algorithm is used to group the bands based on the mutual information criterion [37]. Contrarily, in order to illustrate the importance of the “noise” bands, the proposed automatic-noise-reduction (abbreviated as “ANR”) strategy is applied on the whole hyperspectral imagery. Due to the effectiveness of AP on band selection [33], only AP is used on the ANR data, with the legend “ANR_AP”. From the figure, we can see that ANR_AP delivers the most stable and accurate results than the other four MBR-based methods. Besides, the proposed methodology is tested on the MBR data, with the legend “MBR_ANR_AP”. Meanwhile, the results without band selection on the ANR data and MBR_ANR data are also provided, with the legend “ANR_AllBands” and “MBR_ANR_AllBands”, respectively. It is encouraging that both the results obtained by ANR_AP and ANR_AllBands are better than those by MBR_ANR_AP and MBR_ANR_AllBands, indicating that the preservation of “noise” bands has a positive benefit for classification. We now examine the effect of “noise” bands on our method and the alternatives in more detail when the number of selected bands is 15. Table II lists the selected bands of the six methods, i.e. MBR_ID, MBR_MVPCA, MBR_MI, MBR_AP, MBR_ANR_AP and ANR_AP. Three “noise” bands, i.e., 107, 157 and 164, are chosen as representative bands by ANR_AP. Additionally, the mean and the standard deviation per class and overall percentage accuracy of the methods using AP are displayed in Table III. Because the results of MBR-based methods are comparable, only MBR_AP is given in the table. From the table, we can see that the results obtained by MBR_ANR_AP and ANR_AP are much better than MBR_AP. Further, considering these two ANR-based methods, it can be noticed from Table II that most of the bands selected by the two methods are the same except the three “noise” bands. Meanwhile, the classification accuracies of various classes and the overall accuracy achieved by ANR_AP exceed the values of MBR_ANR_AP, especially for small training sets. In particular, for class 6 (Grass/Pasture-mowed), the number of training samples is 2, which is the smallest value compared to the others. We notice

Fig. 8. Classification accuracy versus different ratio of training samples on the Indian pines AVIRIS data. (a) KNN, (b) SVM.

that the classification accuracies achieved by ANR_AP on this class, 67.50% by KNN and 77.92% by SVM, are much better than those of MBR_ANR_AP, implying that the reservation of “noise” bands improves the separability of the selected bands. Hence, the “noise” bands must contain the same important discriminative information as the other “reserved” bands in terms of classification, which should not be removed. Finally, the classification accuracies level as a function of the ratio of training samples for the interval (5, , 40) in steps of 5% for the four methods, including MBR_ANR_AP and ANR_AP with 15 selected bands, MBR_ANR_AllBands and ANR_AllBands, as shown in Fig. 8. As expected, the increase in the ratio of training samples has a detrimental effect on the performance for all the four algorithms. From the figure, we can see that the performance of ANR-based methods are better than those of MBR_ANR-based ones, verifying the efficiency of the proposed approach and the importance of the “noise” bands.

538

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

TABLE III CLASSIFICATION ACCURACY AND STANDARD DEVIATION (IN PERCENT) RESULTS ON THE 15 SELECTED BANDS FROM INDIAN PINES AVIRIS DATA BY MBR_AP, MBR_ANR_AP AND ANR_AP

TABLE IV LAND COVER CLASSES WITH NUMBER OF TRAINING AND TEST SET SAMPLES FOR THE KENNEDY SPACE CENTER DATA

Fig. 9. Kennedy space center dataset, band 100.

B. Kennedy Space Center AVIRIS Data The second real-world data set was acquired by the same AVIRIS sensor over the Kennedy Space Center (KSC), Florida, on March 23, 1996 [50]. The image depicts the scene in Fig. 9 and is of size 614 512. In the original 224 bands, 48 bands are identified as water absorption and low SNR bands (numbered 1–4, 102–116, 151–172, and 218–224). Note that, both the number and position of the “noise” bands are different from those in the Indiana pines AVIRIS data set, indicating that they may vary considerably even for the data collected by the same sensor, which needs more efforts to be identified. So it is unrealistic to manually locate all the “noise” bands for each hyperspectral data. Likewise, these noisy bands are reserved for our automatic noise reduction process. For classification purposes, 13 classes representing the various land cover types that occur in this environment were defined for the site (Table IV). Classes 2 and 7 represent mixed classes, making the discrimination of land cover more difficult.

Fig. 10 displays the classification accuracies of KNN and SVM using the selected bands obtained by MBR_ID, MBR_MVPCA, MBR_MI, MBR_AP, MBR_ANR_AP and ANR_AP. Meanwhile, the results without band selection procedure, i.e., AllBands, MBR_AllBands, MBR_ANR_AllBands and ANR_AllBands, are also given for comparison. As before, the two ANR-based methods, MBR_ANR_AP and ANR_AP, are much better than the other four MBR-based methods. The classification accuracies of ANR_AP and ANR_AllBands also outperform those of MBR_ANR_AP and MBR_ANR_AllBands, owing to the reservation of the “noise” bands. Same as the first experiment, fifteen selected bands acquired by MBR_ID, MBR_MVPCA, MBR_MI, MBR_AP, MBR_ANR_AP and ANR_AP are listed in Table V. Four “noise” bands (3, 109, 162 and 223) are identified by ANR_AP

JIA et al.: UNSUPERVISED BAND SELECTION FOR HYPERSPECTRAL IMAGERY CLASSIFICATION WITHOUT MANUAL BAND REMOVAL

539

TABLE V COMPARISON BETWEEN SELECTED BANDS USING DIFFERENT BAND SELECTION METHODS (NUMBERS IN BOLD REPRESENT THE “NOISE” BANDS)

Fig. 10. Performance versus the selected bands of different methods on the KSC AVIRIS data. (a) KNN, (b) SVM.

as the representative ones. The statistical results of the accuracies including the mean and standard deviation for various classes obtained by the three AP-based methods on these 15 selected bands are reported in Table VI. From the table, we notice that the accuracies of most classes and the overall accuracy achieved by ANR_AP are better than those delivered by the other two methods. At last, the effect of varying the ratio of the training samples on our method and the alternatives is conducted. As illustrated in Fig. 11, the classification accuracies improve as the ratio of the training samples increase. Both the accuracies of ANR_AP and ANR_AllBands are better than those of MBR_ANR_AP and MBR_ANR_AllBands, verifying the importance of the “noise” bands and the effectiveness of our unsupervised band selection strategy for hyperspectral data.

Fig. 11. Classification accuracy versus different ratio of training samples on the KSC AVIRIS data. (a) KNN, (b) SVM.

C. Okavango Delta, Botswana Hyperion Data To demonstrate the performance of the proposed methodology on the data collected by other sensors, the data to be analyzed is acquired by the Hyperion instrument on the National

Aeronautics and Space Administration (NASA) Earth Observation 1 (EO-1) satellite over the Okavango Delta, Botswana in 2001 [51], and its 145th band image is shown in Fig. 12.

540

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

TABLE VI CLASSIFICATION ACCURACY AND STANDARD DEVIATION (IN PERCENT) RESULTS ON THE 15 SELECTED BANDS FROM KSC AVIRIS DATA BY MBR_AP, MBR_ANR_AP AND ANR_AP

TABLE VIII COMPARISON BETWEEN SELECTED BANDS USING DIFFERENT BAND SELECTION METHODS (THE NUMBERS IN BOLD REPRESENT THE “NOISE” BANDS)

Fig. 12. Okavango Delta dataset, band 145.

TABLE VII LAND COVER CLASSES WITH NUMBER OF TRAINING AND TEST SET SAMPLES FOR THE OKAVANGO DELTA DATA

It is composed of 242 spectral channels with spectral resolution of 10nm obtained in the 400nm and 2500nm region. After noisy bands were removed, 145 bands were left: [10–55, 82–97, 102–119, 134–164, 187–220]. Likewise, all the discarded bands are reserved for the proposed methodology. 14 classes were chosen to reflect the impact of flooding on vegetation in the study area, and the detailed information of ground truth observations used in the experiments are listed in Table VII. How-

ever, several classes are similar, such as classes 3–4 and 9–11, increasing the difficulty of classifying the data. Fig. 13 shows the classification accuracies of KNN and SVM using the selected bands obtained by the four MBR-based methods, MBR_ID, MBR_MVPCA, MBR_MI, MBR_AP, and the two ANR-based methods, MBR_ANR_AP, ANR_AP. Besides, the results using all bands, including AllBands, MBR_AllBands, MBR_ANR_AllBands and ANR_AllBands, are also provided. Obviously, the ANR-based methods are much better than the MBR-based methods. Most of the accuracies and the precisions (i.e., the standard deviation) acquired by ANR_AP are better than those by MBR_ANR_AP, and the performance of ANR_AllBands is also higher than that of MBR_ANR_AllBands, implying the importance of the “noise” bands. Next, while the number of the selected bands equals 15, the selected bands and the classification accuracies for each class are summarized in Tables VIII and IX, respectively. As can be seen from Table VIII, seven “noise” bands (80, 99, 121, 168,

JIA et al.: UNSUPERVISED BAND SELECTION FOR HYPERSPECTRAL IMAGERY CLASSIFICATION WITHOUT MANUAL BAND REMOVAL

541

Fig. 13. Classification accuracy versus the selected bands using different methods on the Okavango Delta Hyperion Data. (a) KNN, (b) SVM.

Fig. 14. Classification accuracy versus different ratio of training samples on the Okavango Delta Hyperion Data. (a) KNN, (b) SVM.

169, 176 and 224) are selected by ANR_AP as the representative bands, which are almost half of the identified ones. In addition, the average accuracies and standard deviation of most classes and the overall performance presented in Table IX show that ANR_AP are better than the other two AP-based methods, validating the value of the “noise” bands. Likewise, the classification performance of MBR_ANR_AP and ANR_AP, which yield the best results in above experiment, are examined with different ratio of training samples, as displayed in Fig. 14. Also, the results of MBR_ANR_AllBands and ANR_AllBands are given for comparison. Clearly, as the increase of the ratio, the accuracies improve. Similarly, ANR_AP and ANR_AllBands respectively show higher performance than MBR_ANR_AP and MBR_ANR_AllBands, justifying the efficiency of the proposed methodology. At last, the classification accuracies of four methods without band selection are summarized in Table X. Clearly, the results of MBR_AllBands are higher than those of AllBands. But when the automatic noise reduction preprocessing is applied,

ANR_AllBands always produces better and stable performance of classification with any classifier and in any hyperspectral data set than MBR_ANR_AllBands, indicating the indispensability of the noisy bands. V. CONCLUSIONS In this paper, we present an unsupervised band selection approach without manual band removal to classify the hyperspectral imagery. Firstly, wavelet shrinkage is used to automatically reduce the spatial noise interferences in hyperspectral data, instead of removing the noisy bands manually. Subsequently, a message-passing based clustering method, affinity propagation, is adopted to accomplish the unsupervised band selection task. A comparison with three popular band selection methods, ID, MVPCA and MI, is conducted. KNN and SVM classifiers are utilized to evaluate the performance of the selected bands. The statistical experimental results on three real data sets collected by two different sensors consistently show that our proposed ANR-based band selection strategy exhibits better performance

542

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 2, APRIL 2012

TABLE IX CLASSIFICATION ACCURACY AND STANDARD DEVIATION (IN PERCENT) RESULTS ON THE 15 SELECTED BANDS FROM OKAVANGO DELTA EO-1 DATA BY MBR_AP, MBR_ANR_AP AND ANR_AP

TABLE X ACCURACY AND EFFICIENCY OF DIFFERENT METHODS (NUMBERS IN BOLD REPRESENT THE BEST CLASSIFICATION PERFORMANCE)

than other MBR-based ones from the pixel classification standpoint, even higher than the results by the proposed methodology on the MBR data, indicating the reliability of the “noise” bands and the efficiency of the proposed automatic dimensionality reduction approach for hyperspectral classification. REFERENCES [1] D. Landgrebe, “Hyperspectral image data analysis,” IEEE Signal Process. Mag., vol. 19, no. 1, pp. 17–28, 2002. [2] P. Zhong, P. Zhang, and R. Wang, “Dynamic learning of SMLR for feature selection and classification of hyperspectral data,” IEEE Geosci. Remote Sens. Lett., vol. 5, no. 2, pp. 280–284, 2008. [3] S. Kumar, J. Ghosh, and M. M. Crawford, “Best-bases feature extraction algorithms for classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 7, pp. 1368–1379, Jul. 2001. [4] L. O. Jimenez-Rodrguez, E. Arzuaga-Cruz, and M. Velez-Reyes, “Unsupervised linear feature-extraction methods and their effects in the classification of high-dimensional data,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 2, pp. 469–483, Feb. 2007. [5] M. Jones and R. Sibson, “What is projection pursuit?,” J. R. Stat. Soc., A,, vol. 150, no. 1, pp. 1–36, 1987. [6] A. Ifarraguerri and C.-I Chang, “Unsupervised hyperspectral image analysis with projection pursuit,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 6, pp. 2529–2538, 2000. [7] S. Kaewpijit, J. L. Moigne, and T. El-Ghazawi, “Hyperspectral imagery dimension reduction using principal component analysis on the HIVE,” in Science Data Processing Workshop, 2002. [8] A. Agarwal, T. El-Ghazawi, H. El-Askary, and J. Le-Moigne, “Efficient hierarchical-PCA dimension reduction for hyperspectral imagery,” in IEEE Int. Symp. Signal Processing and Information Technology, 2007.

[9] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Kernel principal component analysis for the classification of hyperspectral remote sensing data over urban areas,” EURASIP J. Advances in Signal Process., vol. 2009, pp. 1–14, 2009. [10] J. Wang and C.-I Chang, “Independent component analysis-based dimensionality reduction with applications in hyperspectral image analysis,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 6, pp. 1586–1600, 2006. [11] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Machine Learning Research, vol. 3, pp. 1157–1182, 2003. [12] C.-I Chang and S. Wang, “Constrained band selection for hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 6, pp. 1575–1585, Jun. 2006. [13] D. Manolakis and G. A. Shaw, “Detection algorithms for hyperspectral imaging applications,” IEEE Signal Process. Mag., vol. 19, no. 1, pp. 29–43, Jan. 2002. [14] A. Plaza, P. Martinez, J. Plaza, and R. Perez, “Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 466–479, 2005. [15] B. Mojaradi, H. Abrishami-Moghaddam, M. J. V. Zoej, and R. P. W. Duin, “Dimensionality reduction of hyperspectral data via spectral feature extraction,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 7, pp. 2091–2105, 2009. [16] L. Bruzzone and C. Persello, “A novel approach to the selection of spatially invariant features for the classification of hyperspectral images with improved generalization capability,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 9, pp. 3180–3191, Sep. 2009. [17] N. Renard and S. Bourennane, “Dimensionality reduction based on tensor modeling for classification methods,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 4, pp. 1123–1131, 2009. [18] M. Pal and G. M. Foody, “Feature selection for classification of hyperspectral data by SVM,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 5, pp. 2297–2307, 2010.

JIA et al.: UNSUPERVISED BAND SELECTION FOR HYPERSPECTRAL IMAGERY CLASSIFICATION WITHOUT MANUAL BAND REMOVAL

[19] AVIRIS NW Indiana’s Indian Pines 1992 Data Set. [Online]. Available: https://engineering.purdue.edu/biehl/MultiSpec/hyperspectral.html. [20] S. Bourennane, C. Fossati, and A. Cailly, “Improvement of classification for hyperspectral images based on tensor modeling,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 4, pp. 801–805, 2010. [21] P. Toivanen, A. Kaarna, J. Mielikainen, and M. Laukkanen, “Noise reduction methods for hyperspectral images,” in Proc. 9th Int. Symp. Remote Sensing, 2003, vol. 4885, pp. 307–313. [22] D. L. Donoho and I. M. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, vol. 81, pp. 425–455, 1994. [23] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard, “Wavelet shrinkage: Asymptopia,” J. Amer. Statist. Assoc. Series B, vol. 57, pp. 301–369, 1995. [24] A. C. Zelinski and V. K. Goyal, “Denoising hyperspectral imagery and recovering junk bands using wavelets and sparse approximation,” in Proc. IEEE Int. Conf. Geoscience and Remote Sensing Symp., IGARSS 2006, Jul. 2006, pp. 387–390. [25] B. Demir, S. Erturk, and M. K. Gullu, “Hyperspectral image classification using denoising of intrinsic mode functions,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 2, pp. 220–224, 2011. [26] H. Othman and S.-E. Qian, “Noise reduction of hyperspectral imagery using hybrid spatial-spectral derivative-domain wavelet shrinkage,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 2, pp. 397–408, Feb. 2006. [27] G. Y. Chen, T. D. Bui, and A. Krzyzak, “Denoising of three dimensional data cube using bivariate wavelet shrinking,” in Proc. Int. Conf. Image Analysis and Recognition (ICIAR), 2010, pp. 45–51. [28] L. Sendur and I. W. Selesnick, “Bivariate shrinkage with local variance estimation,” IEEE Signal Process. Lett., vol. 9, no. 12, pp. 438–441, 2002. [29] L. Sendur and I. W. Selesnick, “Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency,” IEEE Trans. Signal Process., vol. 50, no. 11, pp. 2744–2756, 2002. [30] G. Y. Chen and S. E. Qian, “Simultaneous dimensionality reduction and denoising of hyperspectral imagery using bivariate wavelet shrinking and principal component analysis,” Can. J. Remote Sens., vol. 34, no. 5, pp. 447–454, 2008. [31] J. Chen, X. Jia, W. Yang, and B. Matsushita, “Generalization of subpixel analysis for hyperspectral data with flexibility in spectral similarity measures,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 7, pp. 2165–2171, 2009. [32] G. Y. Chen and S. E. Qian, “Denoising of hyperspectral imagery using principal component analysis and wavelet shrinkage,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 3, pp. 973–980, 2011. [33] Y. Qian, F. Yao, and S. Jia, “Band selection for hyperspectral imagery using affinity propagation,” IET Computer Vision, vol. 3, no. 4, pp. 213–222, 2009. [34] C.-I Chang, Q. Du, T. L. Sun, and M. L. G. Althouse, “A joint band prioritization and band-deccorelation approach to band selection for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 37, no. 6, pp. 2631–2641, Nov. 1999. [35] B. Guo, S. R. Gunn, R. I. Damper, and J. D. B. Nelson, “Band selection for hyperspectral image classification using mutual information,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 4, pp. 522–526, 2006. [36] P. Mitra, C. A. Murthy, and S. K. Pal, “Unsupervised feature selection using feature similarity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 301–312, Mar. 2002. [37] A. Martínez-usó, F. Pla, J. M. Sotoca, and P. García-sevilla, “Clustering-based hyperspectral band selection using information measures,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 12, pp. 4158–4171, Dec. 2007. [38] S. Jia, Y. Qian, and Z. Ji, “Band selection for hyperspectral imagery using affinity propagation,” in Proc. DICTA’08.Digital Image Computing: Techniques and Applications, 2008, pp. 137–141. [39] J. F. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, pp. 972–976, Feb. 2007. [40] D. Dueck, “Affinity Propagation: Clustering Data by Passing Messages,” Ph.D. dissertation, Univ. Toronto, Toronto, ON, Canada, 2009. [41] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967. [42] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004. [43] S. G. Mallat, A Wavelet Tour of Signal Processing. Orlando, FL: Academic Press, Jul. 1999. [44] D. L. Donoho, “De-noising by soft-thresholding,” IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 613–627, May 1995.

543

[45] D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via wavelet shrinkage,” J. Amer. Statist. Assoc., vol. 90, pp. 1200–1224, 1995. [46] J.-F. Cardoso, “Dependence, correlation and gaussianity in independent component analysis,” J. Machine Learning Research, vol. 4, pp. 1177–1203, 2003. [47] A. C. Bovik, Handbook of Image and Video Processing, 2nd ed. Orlando, FL: Academic Press, 2005. [48] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. New York: Wiley, 2003. [49] R. Archibald and G. Fann, “Feature selection and classification of hyperspectral images with support vector machines,” IEEE Geosci. Remote Sens. Lett., vol. 4, no. 4, pp. 674–677, Oct. 2007. [50] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation of the random forest framework for classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 492–501, 2005. [51] A. Neuenschwander, M. M. Crawford, and S. Ringrose, “Results from the EO-1 experiment—A comparative study of Earth Observing-1 Advanced Land Imager (ALI) and LANDSAT ETM+ data for land cover mapping in the Okavango Delta, Botswana,” Int. J. Remote Sens., vol. 26, no. 19, pp. 4321–4337, 2005. Sen Jia received the B.E. and Ph.D. degrees from the College of Computer Science, Zhejiang University, China, in 2002 and 2007, respectively. Since 2008, he has been with Shenzhen University, where he is currently an Associate Professor in the College of Computer Science and Software Engineering. His research interests include hyperspectral image processing, signal and image processing, pattern recognition and machine learning.

Zhen Ji received the B.E. and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 1994 and 1999, respectively. He is currently a Professor with the Department of Computer Science, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China. In 2001, 2003, and 2004, he was an Academic Visitor with the Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool, U.K. Since 2002, he has been the Director of the Texas Instruments DSPs Laboratory, Shenzhen University. His current research interests include digital image processing, computational intelligence, bioinformatics, and digital signal processors. Yuntao Qian received the B.E. and M.E. degrees in automatic control from Xi’an Jiaotong University, China, in 1989 and 1992, respectively, and the Ph.D. degree in signal processing from Xidian University, China, in 1996. During 1996–1998, he was a postdoctoral fellow in Northwestern Polytechnical University. Since 1998, he has been with Zhejiang University, China, where he is currently a Professor in computer science. He had been a visiting Professor at Concordia University, Hong Kong Baptist University, Carnegie Mellon University, and the Canberra Research Laboratory of NICTA during 1999–2001, 2006, and 2010, respectively. His research interests include machine learning, signal and image processing, and pattern recognition. Linlin Shen received the Ph.D. degree from the University of Nottingham, U.K., in 2005. Before joining Shenzhen University, China, he was a Research Fellow working on MRI brain image processing at the Medical School of the University of Nottingham. His research interest covers Gabor wavelets, face recognition, pattern recognition and biometrics.