This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS
1
Multimetric Active Learning for Classification of Remote Sensing Data Zhou Zhang, Student Member, IEEE, Edoardo Pasolli, Member, IEEE, Hsiuhan Lexie Yang, Member, IEEE, and Melba M. Crawford, Fellow, IEEE
Abstract—The classification of hyperspectral and multimodal remote sensing data is affected by two key problems: the high dimensionality of the input data and the limited number of the labeled samples. In this letter, a multimetric learning approach that combines feature extraction and active learning (AL) is introduced to deal with these two issues simultaneously. In particular, distinct metrics are assigned to different types of features and then learned jointly. In this way, multiple features are projected into a common feature space, in which AL is then performed in conjunction with k-nearest neighbor classification to enrich the set of labeled samples. Experiments on two sets of remote sensing data illustrate the effectiveness of the proposed framework in terms of both classification accuracy and computational requirements. Index Terms—Active learning (AL), classification, feature extraction, metric learning, remote sensing data.
I. I NTRODUCTION
R
EMOTE sensing data analysis has gained attention in recent years due to the development of a wide range of sensing technologies and the greater availability of data acquired by these sensors. Among the different types of analysis, particular attention has been given to land cover classification. Common classification approaches relying on a first step of feature extraction are able to increase the separability among classes and improve prediction accuracies. Disparate features either from the same (e.g., spectral and spatial information [1]) or different sensors (e.g., hyperspectral sensors and LiDAR systems [2]) are often combined as they can provide complementary information. However, the resulting high-dimensional feature space often imposes significant challenges in developing robust supervised classifiers, particularly when the quantity of labeled samples is limited. This is common in practical scenarios since training samples are typically collected from a limited portion of an image and therefore may not properly model the underlying distribution of the data. Most research has focused on addressing the two aforementioned problems independently. The high dimensionality problem is addressed
Manuscript received December 7, 2015; revised March 30, 2016; accepted April 19, 2016. This work was supported in part by the National Aeronautics and Space Administration (NASA) under Advanced Information Systems Technology (AIST) Grant 11-0077, by the Advanced Research Projects AgencyEnergy, U.S. Department of Energy under Grant DE-AR0000593. Z. Zhang, H. L. Yang, and M. M. Crawford are with the School of Civil Engineering, Purdue University, West Lafayette, IN 47906, USA (e-mail:
[email protected]). E. Pasolli is with the Centre for Integrative Biology, University of Trento, 38123 Trento, Italy. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LGRS.2016.2560623
through feature reduction strategies by projecting the original data into a lower dimensional feature space prior to classification. Commonly used feature extraction techniques include both linear [3] and nonlinear approaches [4]. Active learning (AL) has also been demonstrated to be an effective approach for dealing with the limited availability of labeled samples [5]. Common AL strategies work in a “fixed” feature space, represented by the original data or the feature space obtained after feature reduction. In this scenario, the feature space is not updated as AL proceeds with potentially suboptimal performance. Recently, AL has been applied in a more adaptive way [6]–[8]. In particular, we recently proposed to integrate the feature extraction and AL steps into a unique framework [7] in which 1) the reduced feature space is learned and updated at each iteration of the AL process and 2) a committee-based AL criterion is applied in the resulting feature space in conjunction with k-nearest neighbor (kNN) classification. Feature space reduction is achieved through large margin nearest neighbor (LMNN) [9], a metric learning strategy which can naturally take advantage of the increasing amount of labeled information. However, the method can only handle single input features (i.e., pure spectral features) and does not specifically accommodate multiple feature scenarios. This problem may be easily overcome by concatenating different types of features into a single feature vector before applying the proposed strategy, but learning the distance metric in a very high dimensional feature space is computationally intensive and therefore has limited appeal in an AL scenario. Although various metric learning methods have been recently proposed in the machine learning community [9]–[11], most can only deal with a single feature type. Very few studies have focused on multiple feature types, and some can only handle two types of features [12], [13]. Although the approach proposed in [14] can be generalized into a setting with multiple types of features, it is more suitable for image matching across different domains. Aiming at the classification of multiple types of features, heterogeneous multimetric learning (HMML) is proposed in [15], which extends the LMNN strategy into a multiple type of feature setting. HMML aims to learn multiple metrics jointly, where each metric is devoted to a specific feature type. To the best of our knowledge, HMML has not been applied in the remote sensing field and has not been integrated within an AL framework. In this letter, we propose to combine HMML and AL into a unique framework to improve the classification of remote sensing images when multiple feature types are processed. First, the reduced feature space is obtained for each feature type by adopting a modified version of the HMML algorithm. Then, AL is applied in the resulting single feature space in conjunction with kNN classification.
1545-598X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS
The rest of this letter is arranged as follows. In Section II, the proposed framework that integrates HMML and AL is described. Experimental analysis is presented in Section III. Finally, conclusions are summarized in Section IV.
where Nt represents a set of triples (i, j, l) ∈ Nt if and only if (i, j, l) triggers the hinge loss in (4). Finally, a sample xi in the lower dimensional feature space can be represented as xi = Lxi . B. HMML
II. P ROPOSED M ETHOD d Given N training samples {xi , yi }N i=1 , xi ∈ R is the input vector with d features, and yi ∈ {1, 2, . . . , C} is its corresponding label. Distance metric learning aims to learn a distance d(xi , xj ) between two samples, such that similar samples (i.e., samples from the same class) are close to each other while dissimilar samples (i.e., samples from different classes) are further apart. The learned distance has the following form:
d(xi , xj ) = (xi − xj ) M(xi − xj ) T
The HMML method [15] extends the LMNN, which is a single input feature strategy, to a setting with multiple feature types. Denote N training samples from Q different feature N
Q
types as {{xqi }q=1 , yi }i=1 , where xqi represents the ith training sample from the qth feature type. The distance between two samples is formulated as follows to account for all the features: Q Q q T xi − xqj Mq xqi − xqj d xqi , xqj =
(1)
q=1
q=1
where M ∈ Rd×d is a positive semidefinite matrix. Note that M can also be expressed as M = LT L, L ∈ Rr×d (r ≤ d), so (1) can be rewritten as d(xi , xj ) = (xi − xj )T LT L(xi − xj ) = L(xi − xj )2 . (2) The equivalence of (1) and (2) suggests that the distance can be estimated either as a function of the matrix M, enforcing the constraint that M is positive semidefinite, or by optimizing a linear projection matrix L, which can help project the input feature space into a lower dimensional space by constraining L to be rectangular.
=
Q
q q
L x − xq 2 i j
where Mq (or Lq ) is the transformation matrix for the qth feature type. Instead of learning a single metric M, HMML aims to learn a joint metric set {Mq }Q q=1 . The loss function of HMML is similar to (3), but redefining εpull and εpush as ⎧ Q q q ⎪ ⎪ ⎪ ⎨εpull = i,j q=1 d xi , xj (7) Q Q ⎪ q q q q ⎪εpush = ⎪ d xi , xj − d (xi , xl ) . 1+ ⎩ q=1
i,j,l
q=1
+
A. LMNN Among the various metric learning methods, LMNN is one of the most widely used strategies. For a generic labeled sample(xi , yi ), we denote (xj , yj ) as one of the k nearest neighbors of xi with label yj = yi and (xl , yl ) as any sample with label yl = yi . The loss function consists of two terms and is formulated as ε = (1 − μ)εpull + μεpush
(3)
where ⎧ ⎪ ⎨εpull = d(xi , xj ) i,j ⎪ [1 + d(xi , xj ) − d(xi , xl )]+ ⎩εpush =
(4)
i,j,l
where [·]+ = max(·, 0) is the hinge loss. The εpull term acts to pull neighboring samples with the same label closer, and the εpush term pushes differently labeled samples further apart. The two terms are combined via a weighting parameter μ. To reduce the number of features, (3) is minimized with respect to L, and r is set to the output dimensionality. Let Cij = (xi − xj )(xi − xj )T , and at the tth iteration, dt (xi , xj ) = Lt (xi − xj )2 , the gradient of (3) with respect to Lt is ∂ε Gt = = 2(1−μ)Lt Cij + 2μLt ∂Lt i,j
(Cij −Cil )
In (7), the multiple metrics are also coupled via the hinge loss [·]+ and learned jointly from the training data. This allows the information from multiple features to be fused and for the learned metrics to be adapted to each feature type. To solve (3) and (7), the metric set {Mq }Q q=1 is learned via optimization by semidefinite programming [15]. C. Combining HMML With AL HMML is able to find a transformation set for the input data with multiple feature types, but it cannot deal directly with high-dimensional feature spaces. To overcome this problem, we propose to do the following: 1) differentiate (3) and (7) q Q with respect to {Lq }Q q=1 instead of learning {M }q=1 and 2) constrain the projection matrix Lq for the qth feature type to be rectangular of size rq × dq , where dq and rq are the input and the output dimensionality of the qth feature type, respectively, and rq ≤ dq . The objective function (3) can then be rewritten in terms of {Lq }Q q=1 as
ε = (1 − μ)
(5)
Q
q q
L x − xq 2 i j i,j q=1
+μ
i,j,l
(i,j,l)∈Nt
(6)
q=1
1+
Q q=1
L
q
2 (xqi −xqi ) −
Q q=1
L
q
2 (xqi −xql )
. +
(8)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: MULTIMETRIC ACTIVE LEARNING FOR CLASSIFICATION OF REMOTE SENSING DATA
Fig. 1. Indian Pine data set. (a) True-color composite of the hyperspectral data (bands R: 26, G: 14, B: 8). (b) Class details.
The gradient with respect to Lq at the tth iteration is q q ∂ε q Gqt = Cij −Cqil Cij +2μLqt q = 2(1−μ)Lt ∂Lt i,j (i,j,l)∈Nt
(9) T
q where Cij = (xqi − xqj )(xqi − xqj ) . After learning the projection matrix set {Lq }Q q=1 , kNN classification is performed using the distance measure formulated in (6). Similar to LMNN, a sample xi = [x1i , x2i , . . . , xQ i ] from all types of features can be represented in a lower dimensional feature space as xi = Lxi , where L is a block diagonal matrix with {Lq }Q q=1 as the block entries, and the common feature space can be denoted as xi = [L1 x1i , L2 x2i , . . . , LQ xQ i ]. Finally, the AL query strategy proposed in [7] is applied in the derived feature space to select the most informative samples. In particular, uncertainty is incorporated by adopting an ensemble classifier approach based on kNN classification, in which each member is characterized by a different number of nearest neighbors, k.
III. E XPERIMENTS A. Data Sets and Extracted Features The proposed method was evaluated experimentally on two data sets. The first experiment was devoted to performing the fusion of spectral and spatial information in a hyperspectral scenario. We considered the well-known 145 × 145 Indian Pine data set, where hyperspectral data [see Fig. 1(a)] were acquired over the Indian Pine agricultural site in northwestern Indiana in June 1992, at 20-m spatial resolution and 10-nm spectral resolution over the range of 400–2500 nm. Twenty noisy and water absorption bands (104–108, 150–163, and 220) were removed, resulting in a 200-band image. Ground reference data were composed of 10 366 labeled samples assigned to 16 classes [see Fig. 1(b)]. We extracted window-based texture features (mean and standard deviation of each band) from a 3 × 3 window and extended multiattribute profiles (EMAPs) [16], which have been demonstrated to be effective spatial features for remote sensing image classification, from the hyperspectral image. EMAPs were extracted from the first four principal components (PCs) which contain 99% of the total variance of the original data. According to the work in [16], four attribute profiles were considered with the following parameters: 1) region area, λa = [100, 500, 1000, 5000];
3
2) bounding box diagonal length, λd = [10, 25, 50, 100]; 3) moment of inertia, λi = [0.2, 0.3, 0.4, 0.5]; and 4) standard deviation of the gray-level values of the pixels in the regions, λs = [20, 30, 40, 50]. The second experiment represented a multisensor fusion scenario. The data set consisted of a hyperspectral image and discrete return LiDAR data acquired in June 2012 over the University of Houston (UH) campus and the neighboring urban area. The hyperspectral image [see Fig. 2(a)] consisted of 144 spectral bands covering the range of [380, 1050] nm. The pseudowaveforms (PWs) [see Fig. 2(b)] were generated from the original LiDAR point cloud data as described in [17]. The PW data were composed of 80 features, which corresponded to the LiDAR aggregated discrete returns at elevations ranging from [−9, 70] m above ground level. In Fig. 2(b), green is associated with objects close to the ground, midelevation objects are in blue, and taller objects are in red. Both hyperspectral and LiDAR gridded data were 349 × 1340 with 2.5-m spatial resolution. The ground reference data were composed of 15 029 samples subdivided in 15 classes with the number of samples summarized in Fig. 2(c). EMAPs were extracted from the first four PCs of both the hyperspectral image and the LiDAR PW data (window-based texture features were not considered in this case since using EMAP alone resulted in a sufficient number of different sources). The source types with the corresponding number of features are summarized for both data sets in Table I. B. Experimental Setting The proposed method (LMNN_Couple) was compared to three AL baseline strategies: 1) SVM (SVM_Stack) and 2) LMNN (LMNN_Stack) applied to the set of features stacked into a single data structure and 3) LMNN applied to each feature type separately. In this latter case, the final distance was computed according to (6) (LMNN_Ind). Breaking ties [18], [19] was used as the AL query criterion for SVM_Stack. For all the other methods, the one-nearest neighbor (1NN) classifier was adopted as the back-end classifier to generate the final classification map (no significant changes in accuracy were observed in a simple sensitivity investigation where k = {1, 3, 5}). The output dimensionality was fixed for each feature type to 7 in LMNN_Ind and LMNN_Couple [7], resulting in 28 features as the final feature space. For a fair comparison, we fixed the output dimensionality to 28 in LMNN_Stack. We also tested other output dimensionalities, which, however, did not change accuracies significantly as reported in [7]. In addition, the results did not strongly depend on the weighting parameter μ, which was fixed to 0.5 in all the LMNN-related methods [9]. Experiments were conducted by subdividing the available labeled samples into disjoint learning and testing sets. Five and ten samples per class were randomly selected as the initial training set for Indian Pine and UH, respectively. The AL algorithm was run until 400 samples were added to the training set with a batch size of five samples. The entire AL process was repeated ten times, each time by initializing the training set randomly. SVM classification was accomplished using a Gaussian kernel and optimizing its parameters C and γ in the ranges [2−5 , 215 ] and [2−15 , 23 ], respectively, every 20 iterations of the process by a fivefold cross-validation.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS
Fig. 2. UH data set. (a) True-color composite of the hyperspectral data (wavelength R: 670.7 nm, G: 550.2 nm, B: 459.6 nm). (b) False-color composite of the PW data (elevation R: 10 m, G: 0 m, B: 5 m). (c) Class details. TABLE I S OURCE T YPES W ITH C ORRESPONDING N UMBER OF F EATURES
Fig. 3. Kappa statistic achieved on the Indian Pine data set.
C. Experimental Results 1) Indian Pine Data Set: Results obtained for the various methods are reported in Fig. 3. The proposed solution performed effectively for this data set in terms of early and longterm learning, yielding higher accuracies throughout the entire AL process. For example, SVM_Stack achieved a Kappa statistic around 95% at the last iteration, while this accuracy was reached by the proposed method by selecting only 200 labeled samples. Focusing on the three LMNN-based methods, LMNN_Couple and LMNN_Stack achieved higher accuracies than LMNN_Ind. This is reasonable since LMNN_Couple and LMNN_Stack optimize the different feature types jointly. Higher accuracies of the proposed strategy with respect to LMNN_Stack confirmed the effectiveness of the proposed
TABLE II AVERAGE A CCURACIES (S TANDARD D EVIATION ) A CHIEVED ON I NDIAN P INE D ATA S ET
methodology where a distinct metric was dedicated to each type of feature instead of stacking all the features into a single vector. The overall accuracy (OA), average accuracy (AA), Kappa statistic, and class-specific accuracies averaged on the entire AL process are listed in Table II. Compared to the other methods, the proposed strategy achieved higher accuracies for most of the classes as well as similar or even smaller standard deviations. Particular improvements were verified for classes 3, 4, and 6. 2) UH Data Set: The results obtained on the UH data set followed the same general trends observed for the Indian Pine data set. The proposed framework yielded higher classification accuracies through the entire AL process [see Fig. 4(a) and Table III]. Also, in this case, LMNN_Couple and LMNN_Stack outperformed LMNN_Ind in terms of average accuracies, although accuracy differences were smaller than in the previous data set. These results are likely due to the fact that HSI signatures and LiDAR data are quite different in this multisensor scenario with a minor advantage in optimizing simultaneously the different feature types. The UH data set is composed of a relatively large number of samples, which may compromise its
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: MULTIMETRIC ACTIVE LEARNING FOR CLASSIFICATION OF REMOTE SENSING DATA
5
relationship among different features. Experiments conducted on two real remote sensing data sets demonstrated the effectiveness of the proposed approach in terms of classification accuracies and computational time, making the proposed method promising for processing large images. While the proposed strategy may be affected by overfitting issues at early AL stages when the number of labeled samples is limited, additional improvements may be achieved by incorporating unlabeled information through semisupervised approaches. R EFERENCES
Fig. 4. Classification performance on the UH data set. (a) Kappa statistic. (b) Computational time, including both training and AL times. TABLE III AVERAGE A CCURACIES (S TANDARD D EVIATION ) A CHIEVED ON UH D ATA S ET
processing in a real AL scenario. The computational overhead for this data set is reported for the four methods in Fig. 4(b). The proposed solution required a similar computational time as that of LMNN_Ind. This is expected since both are characterized by the same source-specific gradient representations. Additionally, the proposed method consistently outperformed LMNN_Stack and SVM_Stack. LMNN_Stack needs longer training time due to the much higher dimensionality of the feature space, and SVM_Stack is time consuming at prediction time due to the computation of the kernel distances between each unlabeled sample and each support vector. IV. C ONCLUSION In this letter, we have proposed a new multimetric AL strategy for the classification of remote sensing data, in which dimensionality reduction for multiple types of features and AL are integrated into a unique framework. This is obtained by jointly optimizing a set of metrics in order to exploit the
[1] P. Ghamisi, J. A. Benediktsson, and M. O. Ulfarsson, “Spectral-spatial classification of hyperspectral images based on hidden Markov random fields,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2565–2574, May 2014. [2] M. Dalponte, L. Bruzzone, and D. Gianelle, “Fusion of hyperspectral and LIDAR remote sensing data for classification of complex forest areas,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 5, pp. 1416–1427, May 2008. [3] C. Rodarmel and J. Shan, “Principal component analysis for hyperspectral image classification,” Surv. Land Inf. Sci., vol. 62, no. 2, pp. 115–122, Jun. 2002. [4] D. Lunga, S. Prasad, M. M. Crawford, and O. Ersoy, “Manifold-learningbased feature extraction for classification of hyperspectral data: A review of advances in manifold learning,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 55–66, Jan. 2014. [5] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Muñoz-Marí, “A survey of active learning algorithms for supervised remote sensing image classification,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 3, pp. 606–617, Jun. 2011. [6] Q. Shi, B. Du, and L. Zhang, “Spatial coherence based batch-mode active learning for remote sensing image classification,” IEEE Trans. Image Process., vol. 24, no. 7, pp. 2037–2050, Jul. 2015. [7] E. Pasolli, H. L. Yang, and M. M. Crawford, “Active-metric learning for classification of remotely sensed hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 4, pp. 1925–1939, Apr. 2016. [8] Z. Zhang, E. Pasolli, M. M. Crawford, and J. C. Tilton, “An active learning framework for hyperspectral image classification using hierarchical segmentation,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 2, pp. 640–654, Feb. 2016. [9] K. Q. Weinberg and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, Feb. 2009. [10] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighborhood component analysis,” in Proc. NIPS, 2004, pp. 1–8. [11] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. ICML, 2007, pp. 209–216. [12] H. Zheng, M. Wang, and Z. Li, “Audio-visual speaker identification with multi-view distance metric learning,” in Proc. ICIP, 2010, pp. 4561–4564. [13] B. Li, H. Chang, S. Shan, and X. Chen, “Coupled metric learning for face recognition with degraded images,” in Proc. ACML, 2009, pp. 220–233. [14] D. Zhai, H. Chang, S. Shan, X. Chen, and W. Gao, “Multiview metric learning with global consistency and local smoothness,” ACM Trans. Intell. Syst. Technol., vol. 3, no. 3, pp. 1–22, May 2012. [15] H. Zhang, N. M. Nasrabadi, T. S. Huang, and Y. Zhang, “Heterogenous multi-metric learning for multi-sensor fusion,” in Proc. ICIF, 2011, pp. 1–8. [16] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone, “Extended profiles with morphological attribute filters for the analysis of hyperspectral data,” Int. J. Remote Sens., vol. 31, no. 22, pp. 5975–5991, 2010. [17] J. Jung, E. Pasolli, S. Prasad, J. C. Tilton, and M. M. Crawford, “A framework for land cover classification using discrete return LiDAR data: Adopting pseudo-waveform and hierarchical segmentation,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 2, pp. 491–502, Feb. 2014. [18] T. Luo et al., “Active learning to recognize multiple types of plankton,” J. Mach. Learn. Res., vol. 6, no. 4, pp. 589–613, 2005. [19] E. Pasolli, F. Melgani, D. Tuia, F. Pacifici, and W. J. Emery, “SVM active learning approach for image classification using spatial information,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 4, pp. 2217–2233, Apr. 2014.