Exploring Locally Adaptive Dimensionality Reduction ...

26 downloads 852 Views 4MB Size Report
metric learning (LADRml) method for hyperspectral image clas- sification. .... See http://www.ieee.org/publications standards/publications/rights/index.html for more information. ..... nite (PSD) matrix M, which specifies the Mahalanobis distance.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

1

Exploring Locally Adaptive Dimensionality Reduction for Hyperspectral Image Classification: A Maximum Margin Metric Learning Aspect Yanni Dong, Student Member, IEEE, Bo Du, Senior Member, IEEE, Liangpei Zhang, Senior Member, IEEE, and Lefei Zhang, Member, IEEE

Abstract—The high-dimensional data space generated by hyperspectral sensors introduces challenges for the conventional data analysis techniques. Popular dimensionality reduction techniques usually assume a Gaussian distribution, which may not be in accordance with real life. Metric learning methods, which explore the global data structure of the labeled training samples, have proved to be very efficient in hyperspectral fields. However, we can go further by utilizing locally adaptive decision constraints for the labeled training samples per class to obtain an even better performance. In this paper, we present the locally adaptive dimensionality reduction metric learning (LADRml) method for hyperspectral image classification. The aims of the presented method are: 1) first, to utilize the limited training samples to reduce the dimensionality of data without a certain distribution hypothesis; and 2) second, to better handle data with complex distributions by the use of locally adaptive decision constraints, which can assess the similarity between a pair of samples based on the distance changes before and after metric learning. The experimental results obtained with a number of challenging hyperspectral image datasets demonstrate that the proposed LADRml algorithm outperforms the state-of-the-art dimensionality reduction and metric learning methods. Index Terms—Dimensionality reduction, hyperspectral image classification, locally adaptive decision constraints, metric learning.

I. INTRODUCTION YPERSPECTRAL sensors, which can gather hyperspectral imagery with hundreds of spectral bands, have been widely used for discriminating the subtle differences in ground objects. The rich bands contain much helpful information for image classification and have driven the development of advanced image classification techniques [1]–[3]. Hyperspectral image classification, which is aimed at determining a unique label for each pixel to generate a thematic land-cover map, is

H

Manuscript received May 09, 2016; revised June 23, 2016; accepted June 29, 2016. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2012CB719905, the National Natural Science Foundation of China under Grants 41431175, 61471274, 61401317, and 61302111, the Natural Science Foundation of Hubei Province under Grant 2014CFB193, and the Fundamental Research Funds for the Central Universities. (Corresponding Author: Liangpei Zhang.) Y. Dong and L. Zhang are with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University 430072, China (e-mail: [email protected]; [email protected]). B. Du is with the School of Computer, Wuhan University, Wuhan 430072, China (e-mail: [email protected]). L. Zhang is with the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTARS.2016.2587747

one of the most common tasks in hyperspectral image analysis [4]–[8]. However, several critical problems still remain. First, the huge number of spectral bands may result in the Hughes phenomenon (a decrease in the classification accuracy when increasing the number of spectral bands, due to the low ratio between the number of training samples and the spectral channels) [9], [10]. That is, the high-dimensional data space generated by the hyperspectral sensors creates a new challenge, which is the significant computational cost caused by the data complexity. Thus, hyperspectral image classification usually follows dimensionality reduction, which is aimed at reducing the dimensionality of the feature space to decrease the computational complexity and discard the redundant features. Dimensionality reduction technology allows for the separation of classes without sacrificing significant information [11]. Second, the training samples are limited, which may result in overfitting of the training data [12], [13]. Researchers have implied that high-dimensional data spaces are mostly empty, which indicates that the useful data exist primarily in a subspace. From this aspect, dimensionality reduction also needs to be considered [14]. In the past decades, a large number of dimensionality reduction methods have been proposed to transform data from a high dimension to a low one. In general, dimensionality reduction techniques can be categorized into unsupervised and supervised approaches. The most representative unsupervised dimensionality reduction methods are principal component analysis (PCA) [15], [16], independent component analysis (ICA) [17], and locality preserving projections (LPP) [18]. PCA maximizes the amount of data variance in the projected linear subspace, while ICA uses higher order statistics. LPP keeps the local geometric structure of the original feature space to reduce the dimension. The typical representative supervised approaches include Fisher’s linear discriminant analysis (LDA) and its variants, e. g., local Fisher’s discriminant analysis (LFDA) [11] and sparse discriminant analysis (SDA) [19]. LFDA preserves the underlying structure of the multimodal non-Gaussian class distribution in the projection, while SDA regularizes the usual LDA loss function by adding an L1 constraint to the weights. Discriminative locality alignment (DLA) also belongs to the supervised approaches, and it involves selecting both intraclass and interclass neighbors for a local patch to enlarge the margin between different classes [20]. As to hyperspectral image classification, many different approaches have been proposed in recent years, e.g., traditional dimensionality reduction [2], [21], sparse representation

1939-1404 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

Fig. 1.

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Illustration of exploiting the locally adaptive decision constraints.

[22]–[24], semisupervised learning (SSL) [25]–[27], and transfer learning [28]–[33]. In essence, these methods assess the similarity between spectral signatures. That is, the above methods can be boiled down to new insights into a “metric.” For example: 1) traditional dimensionality reduction seeks a lowdimensional representation from a high-dimensional space by assuming a Gaussian distribution to discriminate the betweenclass and within-class distances in a new feature space; 2) sparse representation, which involves studying the relationship between samples in the sparse feature space, can be compactly represented by a few coefficients in a certain dictionary; and 3) SSL considers both labeled and unlabeled samples to obtain a better dividing hyperplane which can measure the distances between samples. That is to say, the existing methods for solving high-dimensional problems can be summarized as learning a more stable and credible distance metric. In fact, metric learning methods have proved to be a more straightforward and effective way to obtain such a distance metric [34], [35]. A number of metric learning methods have been developed, such as the relevant component analysis (RCA) method [36], which is a simple and efficient algorithm proposed for learning a global linear transformation by exploiting only the equivalence to reduce the irrelevant variability of the data. The neighborhood component analysis (NCA) method defines the probability of selecting the same class instances as the neighbors for each instance. It also encourages instances from the same class to be close, which can maximize the stochastic variance of the leave-one-out k-nearest neighbor (KNN) score on the training samples [37]. The large margin nearest neighbor (LMNN) method [38] aims to find a distance metric such that the instances from different classes are effectively separated by a large margin within the neighborhood, where the margin is defined as the difference between the between-class and within-class distances. Furthermore, the information-theoretic metric learning (ITML) method was proposed to express the weakly supervised metric learning problem as a Bregman optimization problem. ITML can handle a variety of constraints and incorporate a prior on the distance function [39]. By metric learning, we can find a distance metric which can transform the high dimension to a low one to classify the images by effectively maximizing the between-class distance while minimizing the within-class distance. Due to these characteristics, metric learning algorithms

can also be used for dimensionality reduction, and they have been used to solve a range of problems in hyperspectral image analysis, such as feature extraction [40], image segmentation [41], and target detection [42], [43]. Among them, our previous work [maximum margin metric learning (MMML)] [43] is very different from the method presented in this paper (see Section II for a summary of the differences between the two methods). However, the current metric learning methods still have some obstacles that need to be addressed, e.g., the RCA algorithm lacks negative (dissimilarity) constraints, which can be informative, and it cannot capture complex nonlinear relationships between data instances; the NCA algorithm cannot obtain the optimal value of the objective function if the initial point is not selected appropriately, and it has a relatively high computational complexity; and the LMNN method has problems when dealing with high-dimensional data and adjusting parameters [44], [45]. Moreover, the state-of-the-art metric learning-based algorithms aim to ensure that samples from the same class are closer to each other than those from different classes, and make decisions based on comparing their Mahalanobis distance d and a fixed threshold b. In other words, the metric learning methods only lead to an absolute decision rule with a fixed threshold b, and they need relative constraints between the between-class and within-class pairs. To deal with the problem of high-dimensional data and to reduce the computational burden, a mapping function can be adopted from the high-dimensional input space into a low-dimensional embedding. Furthermore, a smoothness constraint can be represented by a regularization term, while a cutting plane algorithm can be used to optimize the algorithm in a constant number of iterations by taking the form of a large number of pairwise constraints with “similar” or “dissimilar” labels. However, the existing metric learning methods usually look for a distance measure and make a decision based on a fixed threshold, which is insufficient and suboptimal. Moreover, a further problem in this situation is that the between-class and withinclass variations are complex, meaning that these methods may not learn effective metrics for data with complex distributions. The processing is also difficult, because of the complex data. Consequently, the classification results are not accurate enough [46]–[49]. Aiming at both the above problems, how can we go further by elaborately constructing a more promising metric to obtain a better classification performance than the state-of-theart dimensionality reduction and metric learning methods? In this paper, the locally adaptive dimensionality reduction metric learning (LADRml) method is proposed for adaptively classifying hyperspectral images to further improve the classification accuracy. Locally adaptive decision constraints are applied to relax the fixed threshold, which allows us to make a decision by considering both the threshold b and the changes between the distances before and after metric learning. We can go further by utilizing the local information of the labeled training samples per class to obtain a better performance. The contributions of this paper can be summarized as follows. 1) For the first time, the proposed algorithm applies locally adaptive decision constraints to enhance the separability between different classes. It can guarantee the distance between different classes to the greatest degree by

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

3

Fig. 2. Classification OA of the proposed method with respect to parameter C for the three datasets. (a) AVIRIS Indian Pines dataset. (b) Washington DC Mall dataset. (c) ROSIS Pavia University dataset.

Fig. 3. (a) False-color image of the AVIRIS Indian Pines scene. (b) Ground-truth map containing 16 mutually exclusive classes. (c) Training samples used in the experiment. TABLE I NUMBERS OF SAMPLES OF THE AVIRIS INDIAN PINES DATASET USED FOR THE EXPERIMENT

MMML methods. As a result, the proposed algorithm makes fewer assumptions of the data and has strong generalization ability for dimensionality reduction. The remainder of this paper is organized as follows. Section II reviews the conventional regularized distance metric learning theory and details the proposed method with a step-by-step formula derivation. The experimental results of three challenging real-world hyperspectral datasets are presented to conduct a comparison between the different algorithms in Section III, followed by the conclusions in Section IV. II. LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION METRIC LEARNING METHOD

considering the distance between a pair of samples based on a threshold and the changes between the distances before and after metric learning. 2) In addition, the proposed method can effectively and efficiently encode the discriminative information from limited training data by considering the “locality of data distribution,” which considers neighboring constraints and avoids adopting those conflicting constraints. 3) Thus, we combine a global metric and the locally adaptive decision constraints using a joint MMML model, for which the number of parameters is less than the other

In this section, we first state the general problem and the important concept of metric learning. After this, the formulation and characteristics of the locally adaptive decision constraints are detailed. Finally, the LADRml method is proposed in the last section. In this section, we consider the following metric learning algorithm by taking a set of labeled training samples S = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} as the input, where [x1 , . . . , xn ] ∈ RL ×n . xi is the ith input data point and yi is its corresponding discrete label. L is the number of features and n denotes the number of samples. We then have the following two pairwise constraint sets: a set of similar constraints Λ and a set of dissimilar constraints Ω [50]–[52]: Λ : ∀(xi , xj ) ∈ Λ,xi , xj ∈ same class Ω : ∀(xi , xj ) ∈ Ω,xi , xj ∈ different class.

(1)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

TABLE II INDIVIDUAL CLASS ACCURACIES, OAS, KAPPA STATISTIC VALUES, AND RUNNING TIMES (IN S) OF THE AVIRIS INDIAN PINES DATASET OBTAINED BY THE DIFFERENT CLASSIFICATION METHODS (dim = 60) Methods Class

Original

MNF-SVM

SDA

DLA

LDA

LFDA

RCA

NCA

LMNN

ITML

MMML

LADR ml

Alfalfa Corn-no till Corn-min till Corn Grass/pasture Grass/tree Grass/pasture-mowed Hay-windrowed Oats Soybeans-no till Soybeans-min till Soybeans-clean till Wheat Woods Bldg-grass-tree-drives Stone-steel towers OA

0 37.73 41.06 31.42 79.08 95.24 3.70 99.34 0 61.04 76.60 25.71 90.26 91.76 10.35 74.16 64.93± 0.83 0.596± 0.0092 –

71.40 73.56 54.89 30.37 79.54 91.93 42.31 88.86 0 60.91 60.41 42.13 68.11 88.76 45.69 82.14 67.45± 0.85 0.630± 0.0097 1.32

11.90 56.84 38.55 19.63 79.31 94.06 34.62 95.13 11.11 54.29 74.03 40.26 91.89 94.29 27.87 61.90 66.83± 0.65 0.618± 0.0076 10.76

35.71 60.89 51.81 41.12 88.51 93.61 76.92 93.74 16.67 73.03 73.21 54.68 97.30 92.63 44.04 90.48 72.74± 0.40 0.690± 0.0044 0.44

16.67 44.17 53.41 69.63 91.95 87.37 57.69 99.30 55.60 69.14 27.15 72.85 98.92 65.32 79.31 88.10 58.63± 0.56 0.545± 0.0061 0.48

50.00 58.32 56.49 65.89 89.43 94.22 53.85 99.77 27.78 60.69 65.52 67.98 97.30 88.85 61.49 75.00 71.52± 0.86 0.677± 0.0094 3.16

19.05 36.16 27.84 21.96 64.37 73.97 11.54 87.47 0 35.89 54.07 23.41 65.95 80.16 31.32 72.62 51.15± 0.46 0.441± 0.0055 5.94

71.43 60.34 47.79 32.24 82.07 94.06 80.77 93.50 5.56 66.74 68.82 36.89 95.14 92.36 28.16 8452 68.58± 0.84 0.641± 0.0095 27105.56

50.00 50.62 37.75 33.64 82.99 93.30 76.92 95.13 11.11 60.91 66.74 39.14 92.43 89.46 43.10 92.86 65.23± 0.81 0.609± 0.0093 394.16

26.19 65.40 42.84 20.56 72.64 86.91 34.62 90.26 0 50.17 62.53 20.79 91.35 90.78 26.44 77.38 63.30± 0.45 0.580± 0.0054 14574.06

73.81 54.90 48.73 33.18 85.29 93.30 84.62 94.20 11.11 65.83 70.36 45.51 92.97 89.99 43.10 89.29 69.14± 0.80 0.648± 0.0091 756.78

61.90 71.93 59.17 50.93 83.45 90.72 80.77 94.66 38.89 68.23 78.51 55.06 93.51 95.82 44.25 89.29 75.80± 0.55 0.723± 0.0063 50.40

Kappa Time

Fig. 4. Classification maps obtained from the AVIRIS Indian Pines dataset, along with the training set selected by each method for: (a) Original, (b) MNF-SVM, (c) SDA, (d) DLA, (e) LDA, (f) LFDA, (g) RCA, (h) NCA, (i) LMNN, (j) ITML, (k) MMML, and (l) LADRml.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

5

With (1), we can rewrite the above formulation as dW (xi , xj ) ≤ b(xi , xj ) ∈ Λ dW (xi , xj ) ≥ b(xi , xj ) ∈ Ω.

(5)

A. Locally Adaptive Decision Constraints

Fig. 5. Classification OA with regard to reduced dimensionality in the AVIRIS Indian Pines dataset.

The goal of metric learning is to learn a positive semidefinite (PSD) matrix M, which specifies the Mahalanobis distance dM (xi , xj ), such that the distance between xi and xj can be computed from:  dM (xi , xj ) = (xi − xj )T M(xi − xj )   M, (xi − xj )(xi − xj )T = (2) F

After the metric learning, we still need to make a decision for the sake of the classification. However, for data with complex distributions, the fixed threshold will be suboptimal even if the associated metric is correct. Thus, we adjust the decision constraints locally to relax the fixed threshold. By introducing the locally adaptive constraints, the local structures of the data can be used, which is the key to achieve a good classification performance. We use a locally adaptive decision function f (dij ) to relax the fixed threshold, where dii is the distance between a pair (xi , xj ) and is the reference to guide the changes. By using the locally adaptive constraints, we can guarantee that the greater the distance between similar pairs, the more f (dii ) should shrink, while the smaller the distance between dissimilar pairs, the more f (dij ) should expand. As a result, it allows us to make a decision by considering both the threshold b and the changes between the distances before and after metric learning. Based on this principle, we can form the locally adaptive decision constraints to compute the adaptive upper/lower bounds for (xi , xj ) as Λ) fΛ (dij ) = dij − (d(1/N /dc )(xi , xj ) ∈ Λ ij (1/N Ω ) fΩ (dij ) = dij + (dc /di j )(xi , xj ) ∈ Ω

(6)

where F is the Frobenius inner product and M is an L × L square matrix. To ensure that dM (xi , xj ) is a metric, the learned matrix M must be symmetric and PSD to guarantee that dM (xi , xj ) satisfies the symmetry, nonnegativity, and triangle inequality. However, the data may lie in a very high-dimensional space, which leads to a significant computational burden for solving M. To solve this problem of matrix M ∈ RL ×L , we can find a nonsquare matrix W ∈ RL ×D (D  L) which defines a mapping function from the high-dimensional input space into a low-dimensional embedding and presents an alternative way to jointly perform dimensionality reduction and metric learning. According to this, we can rewrite dM (xi , xj ) [53]–[56] to undertake the problem of dimensionality reduction as  dM (xi , xj ) = (xi − xj )T WWT (xi − xj )  = (WT xi − WT xj )T (WT xi − WT xj )   (3) = WT xi − WT xj  .

where the constant dc ≥ 1. NΛ ≥ 1 and NΩ ≥ 1 are the scale factors that separately control the level of shrinkage and expansion. In this paper, we set dc = dm ax (where dm ax is the maximum distance between all the pairs). NΛ can be used to ensure that fΛ (dij ) can shrink as fast as possible, which means that the smaller the value of NΛ , the faster fΛ (dij ) will shrink. Meanwhile, NΩ guarantees the rapid expansion of fΩ (dij ), which means that the larger the value of NΩ , the faster fΩ (dij ) will expand. Clearly, we want to maximize the shrinkage and expansion of f (dij ). Considering that NΛ < 1 cannot guarantee that the constraints are positive, we define NΛ = 1 to ensure that the locally adaptive constraints can shrink rapidly. Meanwhile, we define NΩ = 1/ log(dc /(dc − 2)) to ensure faster expansion, which allows us to distinguish the similar and dissimilar pairs by comparing them with dc . Finally, we obtain the locally adaptive decision constraints as follows:

The ultimate objective is to find an appropriate metric matrix W under the supervision of Λ and Ω, and the corresponding distance threshold b, such that the distance for (xi , xj ) ∈ Λ is smaller than the decision threshold b, and the distance for (xi , xj ) ∈ Ω is greater than the decision threshold b, so that the data points can be accurately classified. That is, the metric learning methods only lead to an absolute decision rule with a fixed threshold d. We can obtain the decision function as

An illustration of the proposed locally adaptive decision constraints is shown in Fig. 1, which shows two pairs of samples from the hyperspectral image dataset. After the conventional metric learning, the Mahalanobis distance of a similar pair (the red points) is 80, which is higher than the threshold b = 60, while the distance of a dissimilar pair (the green point and red point) is 40, which is lower than the threshold. With this decision paradigm, the two pairs may be misclassified. However, we can classify the pairs according to the locally adaptive decision constraints, with which the distance of the similar pair shrinks

ffixed (dW ) = dW (xi , xj ) − b.

(4)

fΛ (dij ) = dij − (dij /dm ax ) (xi , xj ) ∈ Λ  fΩ (dij ) = dij + (dm ax / N Ω dij ) (xi , xj ) ∈ Ω .

(7)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Fig. 6. (a) Original scene of the Washington DC Mall image (channels 30, 90, and 150 for RGB) covering an area of pixels. (b) References of different classes. (c) Test area selected from the original data with a complex distribution. TABLE III NUMBERS OF SAMPLES OF THE WASHINGTON DC MALL DATASET USED FOR THE EXPERIMENT

constraints, and can then effectively distinguish between the similar pairs and dissimilar pairs. Based on this rule, we can formulate the metric learning framework (8) as min Ψ(M) + C · L(ξ) M ,ξ

s.t. dM (xi , xj ) ≤ fΛ (dij ) + ξij (xi , xj ) ∈ Λ dM (xi , xj ) ≥ fΩ (dij ) − ξij (xi , xj ) ∈ Ω dM (xi , xj ) = dM (xj , xi ), ξij ≥ 0, M ≥ 0. from 80 to 65, while the distance of the dissimilar pair expands from 40 to 55. Thus, it allows us to make a correct decision by considering both the threshold b and the changes between the distances before and after metric learning. B. Combining the Distance Metric and the Locally Adaptive Decision Constraints To acquire the metric matrix M (or W), the metric learning framework with a fixed threshold is as follows: min Ψ {M} + C · L(ξ) M ,ξ

s.t. dM (xi , xj ) ≤ b(xi , xj ) ∈ Λ dM (xi , xj ) ≥ b(xi , xj ) ∈ Ω dM (xi , xj ) = dM (xj , xi ), ξ ≥ 0, M ≥ 0

(8)

where Ψ is the regularizer on M, and L is the loss term of ξ. ξ is used as the slack variable on the pairwise inequality constraint. With the help of the locally adaptive decision constraints, the proposed algorithm can adaptively determine the pairwise

(9)

For (xi , xj ) ∈ Λ, we set yij = 1, f (dij , yij ) =fΛ (dij ), and yij = −1, f (dij , yij ) = − fΩ (dij ) for (xi , xj ) ∈ Ω. We can then simplify (9) as follows: min Ψ(M) + C · L(ξ) M ,ξ

s.t. yij dM (xi , xj ) ≤ f (dij , yij ) + ξij ξij ≥ 0, M ≥ 0.

(10)

Some studies have shown the effectiveness of the squared F-norm regularizer and the good generalization performance of hinge loss [57], [58]. Thus, we adopt the squared F-norm regularizer and hinge loss in the proposed framework. Here, we generalize the above metric learning formulation to be a structured problem, which leads to a structured output instead of a simple label output. That is, a semidefinite programming problem needs to be solved because of the nonnegative constraint on the metric matrix M. To further solve the problem efficiently, we define a set of points U = {uij } and uij = (1; (−(xi − xj )(xi − xj )T )). By reforming (10) into an

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

7

TABLE IV INDIVIDUAL CLASS ACCURACIES, OAS, KAPPA STATISTIC VALUES, AND RUNNING TIMES (IN S) OF THE WASHINGTON DC MALL DATASET OBTAINED BY THE DIFFERENT CLASSIFICATION METHODS (dim = 60) Methods Class

Original

MNF-SVM

SDA

DLA

LDA

LFDA

RCA

NCA

LMNN

ITML

MMML

LADRml

Road Grass Water Trails Trees Shadow Roof OA

97.24 99.29 100.00 0 98.75 96.09 98.10 94.97 ±0.15 0.937 ±0.0026 –

97.63 99.22 100.00 94.54 99.37 92.47 99.12 98.47 ±0.34 0.981 ±0.0045 18.12

97.25 99.74 99.65 90.16 98.10 97.13 98.61 97.97 ±0.21 0.975 ±0.0026 42.57

96.40 99.74 100.00 93.44 99.37 97.13 98.17 98.23 ±0.23 0.978 ±0.0029 0.35

98.20 99.74 98.23 72.68 99.58 97.13 95.95 97.08 ±0.19 0.964 ±0.0024 0.16

98.29 100.00 98.58 98.36 99.79 94.62 99.37 98.66 ±0.10 0.987 ±0.0013 7.73

97.16 99.48 98.40 93.44 99.26 94.98 98.74 98.13 ±0.24 0.977 ±0.0030 2.57

97.54 99.74 98.40 92.35 99.37 97.49 98.36 98.32 ±0.21 0.979 ±0.0026 3037.77

97.25 100.0 98.23 78.14 99.37 95.70 97.41 97.43 ±0.49 0.968 ±0.0060 138.78

97.63 99.74 100.00 92.90 99.58 96.42 98.93 98.53 ±0.22 0.982 ±0.0028 15467.33

97.91 99.48 99.11 93.44 99.05 97.85 98.67 98.53 ±0.24 0.981 ±0.0030 2506.35

100.00 100.00 100.00 98.91 99.79 100.00 99.37 99.74 ±0.25 0.997 ±0.0021 32.13

Kappa Time

unbiased form, U forms the basis of the new input dataset points. Furthermore, we define ω = [f (dij ,yij ); M], and then the formula can be rewritten as 1 min ω 2F + Cξ M ,ξ 2  1 1 cij yij uij ≥ cik − ξ, s.t. ω T ij ik n n c ∈ {0, 1}n , ξ ≥ 0, M ≥ 0.

(11)

We obtain a classifier which can be learned in an iterative manner in linear time with the above formulation, without exploiting more complex outputs. By using the metric learning framework, we can use both the global metric matrix M and the locally adaptive decision constraints to better distinguish between similar pairs and dissimilar pairs. However, the structured algorithm with PSD constraint demands exponential space and time, so it is a challenging problem to train the structured formulation on real large-scale data. Thus, we use the cutting plane algorithm to improve the objective function iteratively by using linear equalities. We start with an empty set of constraints, and we iteratively construct the locally adaptive decision constraints to optimize the solution. In each iteration, we calculate the unsatisfied locally adaptive decision constraints and combine them into the new constraint, and we then compute the optimum over the current metric matrix, in the case of the current compound constraints. A primal subgradient method is used to solve the optimization problem, which is reformulated into the following primal form: 1 ω 2F + Cξ 2 ω 1 n 1 n s.t. cij yij (ω T uij ) ≥ cik − ξ ij ik n n

(ω) = argmin g(ω) =

M ≥ 0, ω = [f (dij ,yij ); M].

After solving the gradient step, we obtain a new ω and metric matrix M, which are defined as ω = [f (dij , yij ); M]. We then adopt spectral decomposition to eigen-decompose the metric matrix M by projecting the metric matrix M into the PSD cone to guarantee that the metric matrix is a PSD matrix. We start by setting the negative eigenvalues λ to zero, and we can then proceed with the eigen-decomposition with the corresponding eigenvector ν as

(12)

We can then obtain the gradient of g(ω) with the correlative Lagrangian multipliers via

1 T ∂g =ω−C · ω cij yij uij . (13) ij ∂ω n

M=

m 

λi νiT νi .

(14)

i=1

We vectorize the new metric matrix M and place it in ω, and optimize until ω converges. We can then construct a linear projection matrix W ∈ RL ×D for the dimensionality reduction. For each test pixel vector xi ∈ RL ×D , we can calculate the reduced feature representation with the formula M = WWT x = W T x

(15)

which means that the original data x can be transformed in the low dimensionality of the Mahalanobis space. By using the new metric feature space, we can achieve classification. It is important to note that the proposed method is essentially different from our previous work (MMML) [42]. 1) First, the LADRml algorithm is proposed on the basis of data and classes with complex distributions in the hyperspectral image classification and dimensionality reduction problem, while the method proposed in [42] is aimed at hyperspectral target detection, considering that the target sample number is very low or the targets are difficult to detect compared with the huge background. 2) The LADRml algorithm adopts locally adaptive decision constraints to make a decision by considering both the decision threshold and the changes between the distances before and after metric learning. As a result, the local structures of the data can be used, which can solve the problem of data with complex distributions. However, the method proposed in [42] looks for a distance measure and makes a decision based on a fixed threshold, which is insufficient and suboptimal. When the between-class and within-class variations are complex, the method proposed in [42] cannot learn an effective metric for data with complex distributions. 3) The LADRml algorithm is a

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Fig. 7. Classification maps for all the methods with the Washington DC Mall dataset. (a) Original, (b) MNF-SVM, (c) SDA, (d) DLA, (e) LDA, (f) LFDA, (g) RCA, (h) NCA, (i) LMNN, (j) ITML, (k) MMML, and (l) LADRml.

local distance metric learning method that works by considering the “locality of data distribution,” which considers neighboring constraints and avoids adopting those conflicting constraints to further enhance the separability between different classes. In contrast, MMML is a global metric learning method, and it makes a decision based on a fixed threshold with pairwise constraints. 4) Both methods have strong generalization ability, but they differ with regard to parameters. The method proposed in [42] requires an extra parameter, i.e., the number of pairwise constraints for each class. Although the proposed algorithm has the same goal of learning a low-dimensional embedding of the data to improve the KNN classification, it should be noted that the LADRml algorithm is very different from LMNN in the following aspects. 1) LADRml and LMNN differ in the definition of the objective

function. The proposed method adopts a structured formulation, which contains the locally adaptive decision constraints, instead of a simple label output. 2) From the constraints aspect, the proposed method learns a metric with pairwise constraints, while LMNN learns a metric with triplet constraints, which means that for each triplet (xi , xj , xk ) (where the class label of xi is the same as that of xj but different from that of xk , dM (xi , xj ) should be smaller than the distance dM (xi , xk ). Thus, the proposed method can work in more general cases by using the Euclidean distance of a pair to adaptively set the bounds. Moreover, LMNN adopts fixed bound-based constraints, and the proposed method adopts locally adaptive decision constraints, which makes LADRml more effective when the class variations are complex. 3) In terms of the optimization, the proposed method uses the cutting plane algorithm to

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

9

TABLE V NUMBERS OF SAMPLES OF THE ROSIS UNIVERSITY OF PAVIA DATASET USED FOR THE EXPERIMENT

Fig. 8. Classification OA with regard to reduced dimensionality in the Washington DC Mall dataset.

perimental results demonstrating the benefits of the LADRml algorithm with the KNN classifier [59], [60], which is a classical classification method that is widely used in real-world applications. In order to provide a fair comparison, all the metric learning algorithms employed KNN as the classifier in the experiments. We compared the LADRml algorithm with four dimensionality reduction methods (SDA, DLA, LDA, and LFDA) and five representative metric learning methods (LMNN, NCA, RCA, ITML, and MMML), which can also be applied to dimensionality reduction. We used a 1-NN classifier [61]–[63] for all the datasets. In addition, SVM with minimum noise fraction transform (MNF-SVM) [64] was also used as a comparison algorithm. The original dataset without metric learning (denoted as “Original”) was also used and input into the 1-NN classifier. Similarly, for a fair comparison, all the methods used the same ground-truth data for all the experiments. For each dataset, we randomly selected 10% for each class as the training samples, and the rest were used as the test samples from the reference data to validate the performances. A. Hyperspectral Dataset Descriptions

Fig. 9. (a) False-color composite image. (b) Ground-truth map. (c) Reference for the different classes from the ROSIS University of Pavia dataset.

improve the objective function iteratively. 4) Finally, LMNN may have problems dealing with high-dimensional data, while the proposed method can achieve a stable performance. III. EXPERIMENTS AND DISCUSSION The proposed LADRml algorithm was evaluated on several popular hyperspectral imagery datasets, and we present the ex-

1) The Indian Pines dataset, covering the Indian Pines region in Northwestern Indiana in 1992, was collected by the National Aeronautics and Spaces Administration’s Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. This scene comprises 220 spectral bands in the wavelength range from 0.4 to 2.5 μm with a size of 145 × 145. The corresponding nominal spectral resolution is 10 nm, and the spatial resolution is approximately 20 m. In the experiment, we used a total of 200 radiance bands after removing some of the spectral bands affected by noise and water absorption. Due to the unbalanced number of available labeled pixels for each class, and the large number of mixed pixels in all the classes, it is a challenging image to classify. 2) The second airborne hyperspectral image is the Hyperspectral Digital Imagery Collection Experiment (HYDICE) Washington DC Mall dataset. This dataset contains 1280 lines, and each line has 307 pixels, including 210 bands within the 0.4- to 2.4-μm wavelength of the visible and infrared spectra. After discarding the water absorption channels, 191 channels remained. This dataset has significant variations between the spectra of the different classes, and is also a challenging image for classification.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

TABLE VI INDIVIDUAL CLASS ACCURACIES, OAS, KAPPA STATISTIC VALUES, AND RUNNING TIMES (IN S) OF THE ROSIS UNIVERSITY OF PAVIA DATASET OBTAINED BY THE DIFFERENT CLASSIFICATION METHODS (dim = 70) Class

Asphalt Meadows Gravel Trees Metal sheets Bare soils Bitumen Bricks Shadows OA Kappa Time

Methods Original

MNF-SVM

SDA

DLA

LDA

LFDA

RCA

NCA

LMNN

ITML

MMML

LADRml

84.00 97.36 42.95 75.56 98.86 36.78 73.93 86.87 99.68 82.40 ± 0.25 0.760 ± 0.0033 –

92.88 95.86 70.21 82.99 99.67 70.91 40.02 79.93 99.88 87.39 ± 0.56 0.830 ± 0.0079 7.74

84.10 90.98 58.89 76.40 99.09 56.26 74.19 71.88 99.65 81.49 ± 0.18 0.752 ± 0.0023 167.78

85.77 94.54 64.97 83.61 99.67 62.76 84.13 82.38 99.65 86.24 ± 0.19 0.816 ± 0.0025 3.22

47.52 87.39 53.33 76.32 99.75 43.23 75.44 36.24 99.53 69.44 ± 0.74 0.596 ± 0.0089 0.15

75.89 95.95 67.41 88.65 99.67 69.87 95.82 78.03 100.00 86.51 ± 0.44 0.820 ± 0.0058 5.12

57.07 78.03 3.49 31.98 7.35 9.56 2.51 4.50 36.93 47.97 ± 0.24 0.242 ± 0.0017 8.23

84.57 97.96 74.20 72.46 99.24 54.88 87.04 84.40 99.14 86.39 ± 0.17 0.815 ± 0.0024 31854.75

83.80 90.49 58.31 75.74 99.09 54.14 73.93 71.48 99.30 80.86 ± 0.16 0.744 ± 0.0022 1799.83

78.38 89.04 36.46 76.87 94.96 52.99 42.27 63.10 93.79 76.30 ± 0.25 0.717 ± 0.0033 2491.79

90.18 94.91 74.23 78.57 98.84 47.14 81.70 83.52 99.88 85.22 ± 0.23 0.799 ± 0.0030 357.82

93.23 97.78 77.62 91.19 99.26 80.45 85.38 89.23 99.88 92.55 ± 0.26 0.901 ± 0.0034 73.31

3) The third hyperspectral dataset was acquired by the airborne Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the University of Pavia, Northern Italy. After removing 12 bands due to noise and water absorption, it comprises 103 spectral bands in the range from 0.41 to 0.82 μm. The image size in pixels is 610 × 340, with a high spatial resolution of 1.3 m per pixel, comprising complex urban, soil, vegetation, and other areas. The proposed algorithm has two parameters: the convergence value and the tradeoff parameter C. According to experience, the convergence value was set as 0.001. The overall accuracy (OA) values of LADRml on the three datasets with respect to parameter C are presented in Fig. 2. From the figures, it can be observed that the results are relatively stable with regard to the values of C. This inspired us to set C = 1 in all the experiments. The classification accuracies were averaged over 10 runs for each method to reduce the possible bias induced by the random sampling. B. Hyperspectral Dataset Descriptions We first conducted an evaluation of the proposed algorithm with regard to the AVIRIS Indian Pines dataset. Fig. 3(a) shows the false-color image of the scene, while Fig. 3(b) shows the ground-truth classes of interest, which consist of 16 mutually exclusive classes with a total of 10 249 samples. Fig. 3(c) shows the 1018 training samples, i.e., 10% of each class in the ground-truth image in all. The number of samples in each class is shown in Table I. This dataset is available online (http://dynamo.ecn.purdue.edu/biehl/MultiSpec) for testing the accuracies of hyperspectral image classification algorithms. Table II shows the individual class accuracies, the average OAs, and the average kappa statistic values with their standard deviations, obtained using 1018 training samples, as shown in Table I. From Table II, we can see that the LADRml algorithm performs significantly better than the RCA algorithm and the ITML algorithm in class accuracies, OA, and kappa statistic values. The Original, SDA, and LDA methods show a poorer discriminative ability than the LADRml algorithm. That is, the

LADRml algorithm has strong generalization ability with limited training samples and can enhance the separability between different classes. The average running times for obtaining the distance metric in the scene are shown in the bottom line of the quantitative evaluation tables. For the running time comparison, it can be seen that DLA and LDA are the fastest methods, the proposed LADRml algorithm is comparable but slightly slower, and the other classical metric learning methods are the slowest. Overall, it is concluded that the proposed LADRml is more efficient than the other representation-based methods and obtains superior classification accuracy. In order to further show the good performance of the proposed LADRml algorithm, Fig. 4(a)–(i) shows the classification maps along with the training samples, in which it can be observed that the proposed LADRml algorithm achieves the best classification result for most land-cover classes. To demonstrate the benefits of LADRml as a powerful dimensionality reduction tool for hyperspectral classification, we calculated the classification OA with regard to the reduced dimensionality (see Fig. 5). As shown in this figure, the LADRml algorithm obtains the best classification OA when dim (dimension) is larger than 15 and achieves a relatively stable result when dim is increased to a larger value, whereas DLA, MNFSVM, and LFDA perform unsatisfactorily when dim reaches a large value. C. Experiment With the Washington DC Mall Dataset In this experiment, we adopted another challenging classification image which has significant spectral variation between different classes [65]. In order to show the classification results more efficiently, we used a subimage, selected from the original image shown in Fig. 6(a), as the experimental area (as shown in Fig. 6(c)), with a size of 307 × 401 pixels. Fig. 6(b) gives the spectral variation of the different classes, and the colors of the lines represent the corresponding classes. In total, 5966 pixels from seven classes were used in the experiment. Table III details the samples of the Washington DC Mall dataset used in the experiment. As shown in Table III, we can see that the samples are

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

11

Fig. 10. Classification maps obtained from the ROSIS Pavia University dataset for: (a) Original, (b) MNF-SVM, (c) SDA, (d) DLA, (e) LDA, (f) LFDA, (g) RCA, (h) NCA, (i) LMNN, (j) ITML, (k) MMML, and (l) LADRml.

Fig. 11. Two-dimensional representation of features for the different algorithms. (a) MNF-SVM, (b) SDA, (c) DLA, (d) LDA, (e) LFDA, (f) RCA, (g) NCA, (h) LMNN, (i) ITML, (j) MMML, and (k) LADRml.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Fig. 12. Classification OA with regard to reduced dimensionality in the ROSIS Pavia University dataset.

limited, which makes the image difficult to classify. Similarly, all seven classes of test samples in the classification results were calculated by the ground truth with the same training and test sets for each round, producing the average results of ten independent rounds of classification reported in Table IV. As can be seen in Table IV, the proposed LADRml algorithm provides results which are comparable to those obtained by the other methods. Furthermore, the LADRml algorithm obtains the best OA and kappa in this dataset, and has more obvious advantages in most individual classes than the other methods. Fig. 7(a)–(l) illustrates the classification maps of Original, MNF-SVM, SDA, DLA, LDA, LFDA, RCA, NCA, LMNN, ITML, MMML, and LADRml, respectively. Here, it can again be seen that the LADRml algorithm achieves the best classification performance for most classes. Furthermore, we enlarged the “water” class to better observe the classification performance, where it can be seen that the LADRml algorithm obtains fewer classification errors. This dataset has a complex distribution, but the proposed LADRml algorithm still achieves the best results. The locally adaptive decision constraints, added in the similarity measurement, allow for a more accurate measure of the relationship between the samples. Similarly, Fig. 8 indicates that the proposed method outperforms the other methods when dim is larger than 5 and achieves stability when the dimension increases to 10. D. Experiment With the ROSIS University of Pavia Dataset In this experiment, we adopted a classification image with a more complicated distribution, which also has larger areas. Fig. 9(a) shows the false-color composite of the image. Fig. 9(b) shows the nine ground-truth classes of interest with a total of 42 776 samples, while 4273 samples were used for the training and 38 503 samples for the testing. This image is also widely used in the hyperspectral image classification community [66], [67]. The samples used in the experiment are listed in Table V. As can be observed from Table V, the samples are huge and the problem of unbalance is more significant. Moreover, from the spectral variation of Fig. 9(c), we can see that most of the

Fig. 13. Classification OAs of the different methods with different numbers of training samples in the three datasets. (a) AVIRIS Indian Pines dataset. (b) Washington DC Mall dataset. (c) ROSIS Pavia University dataset.

spectra are quite similar, there are many different land-cover types, and the experimental image has a complex distribution. This results in the University of Pavia image being a difficult image for classification. Similarly, all nine classes of test samples were calculated by the ground truth in the classification results. The individual class

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

accuracies, average OAs, and average kappa statistic values with their standard deviations obtained in the algorithm comparison are listed in Table VI. The results reported in the table involved exactly the same number of training and test sets as those listed in Table V, allowing a fair comparison of the methods. As can be seen in Table VI, the proposed LADRml algorithm obtains the best class accuracy for most of the classes, and it also performs the best for the OA and kappa statistic values. Thus, we can conclude that the LADRml algorithm achieves the best classification result. As mentioned above, this image scene comprises data with a very complex distribution. In order to further show the good performance of the proposed algorithm, Fig. 10 shows the obtained classification maps along with the training samples. Here, it can be seen that the proposed LADRml algorithm outperforms the other algorithms, especially for bare soil and shadows, which have fewer training samples. The reason for this improvement may be that the LADRml algorithm can further enhance the separability between the different classes by applying the locally adaptive decision constraints. Fig. 11(a)–(k) shows the sample distribution after application of the different methods, where it can be seen that RCA has the poorest discriminative ability. The reason for RCA’s poor performance may be the assumption that it learns a globally linear transformation which may not be suitable for this data. LMNN and ITML have a weaker ability to separate the different classes. MNF-SVM, LDA, LFDA, NCA, SDA, and MMML can separate some classes well, while DLA results in some overlapping. The proposed LADRml algorithm can separate all of the classes well. In addition, the classification OAs (for all the methods) with an increase of the feature number dim are compared in Fig. 12. As can be seen in Fig. 12, the proposed algorithm achieves the best OA when dim is larger than 10, while the optimal OA stabilizes when the dimensionality increases to 15. The results allow a similar conclusion, i.e., LADRml outperforms the traditional dimensionality reduction methods and metric learning methods for dimensionality reduction in hyperspectral image classification. We also conducted experiments to investigate the influence of the number of training samples. We varied the amount of training data and studied the sensitivity of the proposed method relative to the conventional methods. Note that the training samples were the same in all cases. Fig. 13 shows the OA with the different numbers of training samples. For all the datasets, the training size was changed from 1% to 10% (note that 1% is the ratio of the number of training samples to the total labeled data). It is clear that the classification performance of LADRml is much better than the other methods. When the ratio of training samples becomes lower and lower, the performance of LADRml also decreases, but to a lesser degree than the other methods. This further confirms that the proposed LADRml is a competitive dimensionality reduction method, even with limited labeled data. IV. CONCLUSION In this paper, we have proposed the LADRml method, which combines global metric and locally adaptive decision constraints using a joint MMML model for hyperspectral image

13

classification. The proposed LADRml algorithm can utilize the limited training samples and transfer the problem of dimensionality reduction without a certain distribution hypothesis into a MMML problem. Furthermore, the proposed LADRml algorithm adopts locally adaptive decision constraints to determine whether the pairs are similar or not by considering both the decision threshold b and the changes between the distances before and after metric learning. The experimental results show that the proposed LADRml algorithm performs better than the other state-of-the-art dimensionality reduction methods and metric learning methods on challenging hyperspectral datasets, i.e., the AVIRIS Indian Pines dataset, the HYDICE Washington DC Mall dataset, and the ROSIS University of Pavia dataset. For all these challenging datasets, the proposed LADRml method presents the highest accuracy. However, the processing time of the proposed algorithm is greater than the classical dimensionality reduction methods, although it is much less than the classical metric learning methods. In our future work, we will consider improved optimization methods and a flexible scheme to ensure the computational efficiency when learning optimal parameters. ACKNOWLEDGMENT The authors would like to sincerely thank Prof. D. Landgrebe for making the AVIRIS Indian Pines hyperspectral dataset available to the community, and Prof. P. Gamba for providing the ROSIS data from Pavia, Italy. They would also gratefully like to thank the handling editor and anonymous reviewers for their careful reading and helpful remarks, which significantly helped us to improve the technical quality and presentation of this paper. REFERENCES [1] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004. [2] A. Plaza, P. Martinez, J. Plaza, and R. Perez, “Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 466–479, Mar. 2005. [3] R. Ji, Y. Gao, R. Hong, Q. Liu, D. Tao, and X. Li, “Spectral-spatial constraint hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 3, pp. 1811–1824, Mar. 2014. [4] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral–spatial classification of hyperspectral data using loopy belief propagation and active learning,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2, pp. 844–856, Feb. 2013. [5] J. C. Harsanyi and C.-I. Chang, “Hyperspectral image classification and dimensionality reduction: An orthogonal subspace projection approach,” IEEE Trans. Geosci. Remote Sens., vol. 32, pp. 779–785, Jul. 1994. [6] R. O. Green et al., “Imaging spectroscopy and the airborne visible/infrared imaging spectrometer (AVIRIS),” Remote Sens. Environ., vol. 65, no. 3, pp. 227–248, Sep. 1998. [7] C.-I. Chang, Hyperspectral Imaging: Techniques for Spectral Detection and Classification. New York, NY, USA: Kluwer, 2003. [8] D. Landgrebe, “Hyperspectral image data analysis,” IEEE Signal Process. Mag., vol. 19, no. 1, pp. 17–28, Jan. 2002. [9] G. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968. [10] D. Landgrebe, “Hyperspectral image data analysis as a high dimensional signal processing problems,” IEEE Signal Process. Mag., vol. 19, no. 1, pp. 17–28, Jan. 2002. [11] W. Li, S. Prasad, J. E. Fowled, and L. M. Bruce, “Locality-preserving dimensionality reduction and classification for hyperspectral image analysis,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 4, pp. 1185–1198, Apr. 2012.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

[12] Y. Zhou, J. Peng, and C. L. Chen, “Dimension reduction using spatial and spectral regularized local discriminant embedding for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 2, pp. 1082–1095, Feb. 2015. [13] G. F. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968. [14] L. M. Bruce, C. H. Koger, and J. Li, “Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 10, pp. 2331–2338, Oct. 2002. [15] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA: Springer, 2002. [16] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 22, pp. 2323–2326, 2000. [17] C. Jutten and J. H´erault, “Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture,” Signal Process., vol. 24, pp. 1–10, 1991. [18] X. He and P. Niyogi, “Locality preserving projections,” in Proc. Adv. Neural Inf. Process. Syst. Conf., 2003, pp. 153–160. [19] L. Clemmensen, T. Hastie, D. Witten, and B. Ersboll, “Sparse discriminant analysis,” Technometrics, vol. 53, no. 4, pp. 406–413, Jan. 2012. [20] T. Zhang, D. Tao, and J. Yang, “Discriminative locality alignment,” in Proc. Eur. Conf. Comput. Vis., 2008, pp. 725–738. [21] M. D. Farrell and R. M. Mersereau, “On the impact of PCA dimension reduction for hyperspectral detection of difficult targets,” IEEE Geosci. Remote Sens. Lett., vol. 2, no. 2, pp. 192–195, Apr. 2005. [22] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Trans. Image Process., vol. 17, no. 1, pp. 53–69, Jan. 2008. [23] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proc. IEEE, vol. 98, no. 6, pp. 1031–1044, Jun. 2010. [24] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification using dictionary-based sparse representation,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, Oct. 2011. [25] O. Chapelle and A. Zien, “Semi-supervised classification by low density separation,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp. 57– 64. [26] L. Bruzzone, M. Chi, and M. Marconcini, “A novel transductive SVM for semisupervised classification of remote-sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 11, pp. 3363–3373, Nov. 2006. [27] M. Chi and L. Bruzzone, “Semisupervised classification of hyperspectral images by SVMs optimized in the primal,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 6, pp. 1870–1880, Jun. 2007. [28] S. Rajan, J. Ghosh, and M. M. Crawford, “An active learning approach to hyperspectral data classification,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 4, pp. 1231–1242, Apr. 2008. [29] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery, “Active learning methods for remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 7, pp. 2218–2232, Jul. 2009. [30] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 10, pp. 1345–1359, Oct. 2010. [31] C. Cortes and V. N. Vapnik, “Support vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sep. 1995. [32] J. A. Gualtieri and R. F. Cromp, “Support vector machines for hyperspectral remote sensing classification,” SPIE, vol. 3584, pp. 221–232, Oct. 1998. [33] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004. [34] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst. Conf., 2002, pp. 505–512. [35] Y. Dong, B. Du, and L. Zhang, “Target detection based on random forest metric learning,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens., vol. 8, no. 4, pp. 1830–1838, Apr. 2015. [36] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, “Adjustment learning and relevant component analysis,” in Proc. Eur. Conf. Comput. Vis., 2002, vol. 4, pp. 776–790. [37] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Proc. Adv. Neural Inf. Process. Syst. Conf., Dec. 2004, pp. 571–577. [38] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. 19th Annu. Conf. Neural Inf. Process. Syst., 2005, pp. 1473–1480.

[39] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn., Jun. 2007, pp. 209–216. [40] Q. Zhang, L. Zhang, Y. Yang, Y. Tian, and L. Weng, “Local patch discriminative metric learning for hyperspectral image feature extraction,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 3, pp. 612–616, Mar. 2014. [41] B. Bue, D. Thompson, M. Gilmore, and R. Casta˜no, “Metric learning for hyperspectral image segmentation,” in Proc. 3rd IEEE Workshop Hyperspectral Image Signal Process., Evol. Remote Sens., Jun. 2011, pp. 1–4. [42] Y. Dong, L. Zhang, L. Zhang, and B. Du, “Maximum margin metric learning based target detection for hyperspectral images,” ISPRS J. Photogramm. Remote Sens., vol. 108, pp. 138–150, Oct. 2015. [43] L. Zhang, L. Zhang, D. Tao, X. Huang, and B. Du, “Hyperspectral remote sensing image subpixel target detection based on supervised metric learning,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 4955–4965, Aug. 2014. [44] Z. Xu, K. Q. Weinberger, and O. Chapelle, “Distance metric learning for kernel machines,” arXiv preprint arXiv: 1208.3422, 2012. [45] C. Shen, J. Kim, and L. Wang, “Scalable large-margin Mahalanobis distance metric learning,” IEEE Trans. Neural Netw., vol. 21, no. 9, pp. 1524– 1530, Sep. 2010. [46] C. Xiong, D. M. Johnson, and J. J. Corso, “Efficient max-margin metric learning,” in Proc. 6th Int. Workshop Evol. Change Data Manage., 2012, pp. 1–9. [47] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal estimated sub-gradient solver for SVM,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 807–814. [48] V. Franc and S. Sonnenburg, “Optimized cutting plane algorithm for support vector machines,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 320–327. [49] T. Joachims, “Training linear SVMs in linear time,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., 2006, pp. 217–226. [50] J. Lu, J. Hu, X. Zhou, and Y. Shang, “Activity-based human identification using sparse coding and discriminative metric learning,” in Proc. ACM Multimedia Conf., 2012, pp. 1061–1064. [51] J. Yu, D. Tao, J. Li, and J. Cheng, “Semantic preserving distance metric learning and applications,” Inf. Sci., vol. 281, pp. 674–686, 2014. [52] F. Wang, “Semisupervised metric learning by maximizing constraint margin,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 4, pp. 931– 939, Aug. 2011. [53] F. Wang and C. Zhang, “On discriminative semi-supervised classification,” in Proc. Conf. Artif. Intell., 2008, pp. 720–725. [54] C. C. Chang, “A boosting approach for supervised Mahalanobis distance metric learning,” Pattern Recog., vol. 45, pp. 844–862, 2012. [55] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locally-adaptive decision functions for person verification,” in Proc. IEEE Comput. Vis. Pattern Recog., 2013, pp. 3610–3617. [56] Q. Wang, W. Zuo, L. Zhang, and P. Li, “Shrinkage expansion adaptive metric learning,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 456–471. [57] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in Proc. IEEE Comput. Vis. Pattern Recog., 2011, pp. 1785–1792. [58] Y. Luo, D. Tao, C. Xu, D. Li, and C. Xu, “Vector-valued multi-view semisupervised learning for multi-label image classification,” in Proc. 27th Conf. Artif. Intell., 2013, pp. 647–653. [59] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967. [60] H. He, S. Hawkins, W. Graco, and X. Yao, “Application of genetic algorithm and k-nearest neighbour method in real world medical fraud detection problem,” JACIII, vol. 4, no. 2, pp. 130–137, 2000. [61] B. Du, L. Zhang, L. Zhang, T. Chen, and K. Wu, “A discriminative manifold learning based dimension reduction method,” Int. J. Fuzzy Syst., vol. 14, pp. 272–277, Jun. 2012. [62] Q. Shi, L. Zhang, and B. Du, “Semisupervised discriminative locally enhanced alignment for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 9, pp. 4800–4815, Sep. 2013. [63] B. D. Bue, “An evaluation of low-rank Mahalanobis metric learning techniques for hyperspectral image classification,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens., vol. 7, no. 4, pp. 1079–1088, Apr. 2014. [64] A. Green, M. Berman, P. Switzer, and M. Craig, “A transformation for ordering multispectral data in terms of image quality with implications for noise removal,” IEEE Trans. Geosci. Remote Sens., vol. 36, no. 1, pp. 65–74, Jan. 1998.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. DONG et al.: EXPLORING LOCALLY ADAPTIVE DIMENSIONALITY REDUCTION FOR HYPERSPECTRAL IMAGE CLASSIFICATION

[65] L. Zhang, Q. Zhang, L. Zhang, D. Tao, X. Huang, and B. Du, “Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding,” Pattern Recog., vol. 48, no. 10, pp. 3102–3112, Oct. 2015. [66] Y. Gu and K. Feng, “Optimized Laplacian SVM with distance metric learning for hyperspectral image classification,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens., vol. 6, no. 3, pp. 1109–1117, Jun. 2013. [67] W. Li and Q. Du, “Joint within-class collaborative representation for hyperspectral image classification,” IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens., vol. 52, no. 6, pp. 3399–3411, Jun. 2014.

Yanni Dong (S’14) received the B.S. degree in sciences and techniques of remote sensing from Wuhan University, Wuhan, China, in 2012, where she is currently working toward the Ph.D. degree in the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing. Her current research interests include pattern recognition in remote sensing images, hyperspectral image processing, and metric learning.

Bo Du (M’10–SM’15) received the B.S. degree from Wuhan University, Wuhan, China, in 2005, where he receive the Ph.D. degree in photogrammetry and remote sensing from the State Key Lab of Information Engineering in Surveying, Mapping, and Remote Sensing in 2010. He is currently a Professor with the School of Computer, Wuhan University. His major research interests include pattern recognition, hyperspectral image processing, and signal processing.

15

Liangpei Zhang (M’06–SM’08) received the B.S. degree in physics from Hunan Normal University, Changsha, China, in 1982, the M.S. degree in optics from Chinese Academy of Sciences, Xian, China, in 1988, and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 1998. He is currently the Head of the Remote Sensing Division, State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University. He is also a Chang-Jiang Scholar Chair Professor appointed by the Ministry of Education of China. He is currently a Principal Scientist for the China State Key Basic Research Project (2011–2016) appointed by the Ministry of National Science and Technology of China to lead the remote sensing program in China. He is the holder of 15 patents. He has more than 360 research papers. His research interests include hyperspectral remote sensing, high-resolution remote sensing, image processing, and artificial intelligence. Dr. Zhang is a Fellow of the Institution of Engineering and Technology, an Executive Member (Board of Governor) of the China National Committee of the International Geosphere-Biosphere Programme, and an Executive Member of the China Society of Image and Graphics. He regularly serves as a Cochair of the series SPIE Conferences on Multispectral Image Processing and Pattern Recognition, Conference on Asia Remote Sensing, and many other conferences. He edits several conference proceedings, issues, and Geoinformatics symposiums. He also serves as an Associate Editor of the International Journal of Ambient Computing and Intelligence, the International Journal of Image and Graphics, the International Journal of Digital Multimedia Broadcasting, the Journal of Geo-spatial Information Science, the Journal of Remote Sensing, and the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. Lefei Zhang (S’11–M’14) received the B.S. and Ph.D. degrees from Wuhan University, Wuhan, China, in 2008 and 2013, respectively. From August 2013 to July 2015, he was with the School of Computer, Wuhan University, as a Postdoctoral Researcher, and he was a Visiting Scholar with the CAD & CG Lab, Zhejiang University in 2015, and a Big Data Institute Visitor in the Department of Statistical Science, University College London in 2016. He is currently a Lecturer with the School of Computer, Wuhan University, and also a Hong Kong Scholar in the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong. His research interests include pattern recognition, image processing, and remote sensing. Dr. Zhang is a Reviewer of more than 20 international journals, including the IEEE TIP, the TNNLS, the TMM, and the TGRS.

Suggest Documents