Hyperspectral Image Classification with Mahalanobis Relevance Vector Machines Gustavo Camps-Valls, Antonio Rodrigo-Gonz´alez, Jordi Mu˜noz-Mar´ı, Luis G´omez-Chova, and Javier Calpe-Maravilla Dept. Enginyeria Electr`onica. Universitat de Val`encia. C/ Dr. Moliner, 50. 46100. Burjassot, Val`encia. Spain.
[email protected], http://www.uv.es/∼gcamps Abstract— This paper introduces the use of Relevance Vector Machines (RVM) for remote sensing hyperspectral image classification. We also include the Mahalanobis kernel in the formulation of the RVM to take into account the covariance of the features in the classification process. Experimental results in different scenarios confirm the accuracy and robustness of the proposed method, and also the ease of free parameters tuning.
I. I NTRODUCTION In the recent years, support vector machines (SVMs) [1], [2] have been successfully used in many application domains [3]. The applicability of SVMs has been also demonstrated for hyperspectral image classification [4]–[8]. Certainly, the properties of SVMs make them well-suited methods to tackle the problem of hyperspectral image classification since they can handle large input spaces efficiently, and deal with noisy samples in a robust way [8], [9]. This classification method is based on mapping samples to a high dimensional feature space in which a maximum margin linear method is applied, which is non-linear in the input space. An alternative approach to obtain non-linear kernel-based models is the Bayesian framework, which has yielded remarkable hits, such as Gaussian Processes and Relevance Vector Machines (RVM) [10], [11]. In particular, the RVM constitutes a Bayesian approximation for solving generalized linear classification and regression models [12]. This method not only provides accurate predictions but also force sparsity (simplicity) of the method, and can produce confidence intervals for the predictions. Good trade-offs between accuracy and sparseness of the solution has been observed in many application domains [13]–[16]. In the field of remote sensing, the use of RVMs has been recently introduced for the prediction of biophysical parameters [17], [18], but its capabilities for remote sensing data classification has not been evaluated so far. Being a kernel-based method, the key point for obtaining good RVM classifiers is the definition of a suitable kernel function that can properly represent relations (similarities) among samples (pixels). In this context some common kernels are available, such as the radial basis function (RBF) kernel or the polynomial kernel. However, the adoption of traditional kernels does not constitute a good option since, unlike with SVMs, here it is not necessary to fulfill Mercer’s conditions for the validity of the kernel function [19]. In addition, by using
1-4244-1212-9/07/$25.00 ©2007 IEEE.
traditional kernels, a second problem arises, namely, there is no adaptation to the different intrinsic relevance of spectral channels. In this paper, we introduce the use of the RVM for the classification of hyperspectral images, and evaluate its capabilities when working in ill-posed situations, i.e. low number of high dimensional labeled pixels. In addition, we introduce the Mahalanobis kernel distance in the formulation of the RVM to take into account the covariance of the features in the classification process, and show that the proposed method allows easier tuning of the free parameters. The rest of the paper is organized as follows. Section II reviews the standard formulation of the RVM for classification, and introduces the Mahalanobis kernel. Section III presents the experimental results. Finally, Section IV gives some conclusions and outlines further work. II. P ROPOSED M ETHODOLOGY The proposed method is constituted by the non-linear RVM classifier using the Mahalanobis kernel. Both ingredients are analyzed in detail in the following subsections. A. Notation and Preliminaries Let us assume a labeled dataset of n pixels {(xi , yi )}ni=1 , where xi ∈ RN , being N the number of spectral channels, and the corresponding output labels are yi ∈ {0, 1}. The RVM classifier has the following structure: y = f (x) =
n
wi K(xi , x) + w0 ,
(1)
i=1
where K is a kernel (similarity) matrix among training samples, and w0 is the bias in the decision function. From (1), it becomes clear that, in order to obtain a model, first the form of the kernel function K must be necessarily defined, and second the weights of the model, w = [w0 , w1 , . . . , wn ] , need either to be estimated, or a posterior distribution needs to be inferred. In the following sections, we pay special attention to these two issues.
3802
B. Relevance Vector Machine Formulation Under the Bayesian probabilistic framework, a distribution over the weights is inferred rather than point-wise estimated [20]. Specifically, the Bayes rule states that the posterior probability of w is: p(w|y, α) =
p(y|w, α)p(w|α) , p(y|α)
(2)
where p(y|w, α) is the likelihood, p(w|α) represents the prior with weights α = [α1 , . . . , αn ] , and p(y|α) is the evidence. Now, adopting a Bernoulli distribution for p(y|x), we write the likelihood as: n [f (y(xi ; w)]yi [1 − f (y(xi ; w))]1−yi , (3) p(y|w, α) = i=1
where a sigmoid link function is applied to the output y(x), f (y) = 1/(1 + e−y ) in order to obtain probabilistic outputs. The prior is chosen to be Gaussian: n αi wi2 αi p(w|α) = exp − (4) 2π 2 i=1 The marginal likelihood p(y|α) cannot be obtained analytically by integrating out the weights from (3), an iterative procedure is needed [13]. With a fixed given α, the maximum a posteriori (MAP) solution wMAP can be obtained by maximizing log(p(w|y, α)) or, equivalently, by minimizing the following cost function: n αi wi2 L(w|y, α) = − yi log(f (y(xi ))) 2 i=1 −(1 − yi ) log(1 − f (y(xi )))
(5)
The gradient of L w.r.t w is then ∇L = Aw + Φ (f − y)
(6)
where A = diag{α1 , . . . , αn }, f = [f (y(x1 )), f (y(x2 )), . . . , f (y(xn ))] , and matrix Φ has the elements φij = K(xi , xj ). The Hessian of L is H = ∇2 L = Φ BΦ + A
(7)
where B = diag{f (y(x1 ) log(f (y(x1 )), . . . , f (y(xn ) log(f (y(xn ))}. The posterior is approximated around wMAP by a Gaussian approximation with the covariance Λ = (H|wMAP )−1
(8)
µ = [µ1 , . . . , µn ] = Λ(Φ By|wMAP )
(9)
and the mean
The hyperparameters α are iteratively updated using [13]: = αnew i
1 − αold i λii µ2i
where λii represents the diagonal elements of Λ.
1-4244-1212-9/07/$25.00 ©2007 IEEE.
(10)
Fig. 1. Classification maps obtained for a two-class problem with ellipsoidal and parallel distributed classes. Similar RVM models in accuracy and sparsity are depicted using (left) RBF (15.7% error rate, 18 RVs) and (right) Mahalanobis kernels (14.9% error rate, 19 RVs). Red and black dots indicate the classes {0,1}, cyan dots point out the relevant vectors (RVs), the green line represents the classification boundary, and the grey lines are the confidence intervals at p = 0.25 and p = 0.75.
The introduction of an individual hyperparameter for every weight in the model (1) is the key feature of the RVM, and is ultimately responsible for the sparsity properties of the RVM method [13]. During the optimization process, many αi tend to infinity, so the associated weights wi are effectively discarded, ultimately leading to a sparse solution. These remaining (typically very few) examples are called the Relevance Vectors (RVs), resembling the SVs in the SVM framework. C. The Mahalanobis kernel The second key issue in any kernel-based algorithm is the selection of the kernel function. Many kernel functions are available in the literature but the common choice is the RBF kernel, defined as: 1 2 K(xi , xj ) = exp − 2 xi − xj ) , (11) 2σ where σ 2 ∈ R+ represents the variance (length scale or width), and constitutes the free parameter to be tuned. The solution introduced by this kernel can be casted as a density estimator in the input space since it puts a Gaussian in each sample xi , and weights its contribution to the final density by wi (cf. Eq. (1)). Thanks to the Representer theorem [1], the weighting coefficients wi correspond directly to the expansion coefficients for a weight vector in a classical linear model. Despite the good characteristics of this kernel [21], one can note that no explicit weight is defined for the (in principle) different relevance of each band (feature). A possible solution to alleviate this shortcoming is to tune a different Gaussian width per feature, but this results in a too heuristic method, also prohibitive for the hyperspectral scenario1 . In this paper, we introduce the use of the Mahalanobis kernel (MK) [22], which is defined as: 1 −1 K(xi , xj ) = exp − 2 (xi − xj ) Q (xi − xj ) , (12) 2σ 1 See [18] for an RVM regression algorithm where a bio-optical heuristic approach is introduced to tune the σ parameter.
3803
III. E XPERIMENTAL RESULTS This section analyzes the performance of the proposed method in several hyperspectral classification problems. We focus on accuracy and robustness to ill-posed scenarios.
1
1
60
0.9
50
0.8
0.8
40
0.7
[%]SVs
0.6
0.6
κ
[%]OA
where Q is the estimated covariance matrix computed with the available training data. Also note that this constitutes a non-linear generalization of the classical Mahalanobis metric through the use of the kernel framework. The MK differs from the standard RBF kernel in that each axis of the input space has a separate smoothing parameter, i.e. a different scale to observe differences in each axis, thus easily accommodating non-spherical density estimations. See Fig. 1 for an illustrative toy example, in which we compare RVMs using RBF and Mahalanobis kernels. Both models are similar in terms of accuracy and complexity, but the obtained boundary is more resaonable and smoother for the MK-RVM.
0.4
0.5
30 20
0.4
0.2
10
0.3 0.2 0 10
1
0 0 10
2
10 % Training Samples
10
1
0 0 10
2
10 % Training Samples
10
1
10 % Training Samples
2
10
Fig. 2. Results for the HyMAP image classification problem. Overall accuracy (left), kappa statistic (middle) and rate of relevant vectors (right) as a function of the rate of training samples used to build the RBF-RVM (black) and MK-RVM (red).
more likely to be obtained by the proposed method, suggesting that tuning the free parameter is easier. 80 90
70 80
OA[%]
TABLE I VALIDATION ACCURACIES FOR DIFFERENT KERNEL - BASED CLASSIFIERS . R ESULTS ARE AVERAGED OVER THE SIX DATASETS .
Parameters λ = 10−2 , 6×36×6 C = 18.29, σ = 707 σ = 90 σ = 88
50
50 1
10
0 100 80
OA[%]
40
30 0
60
60
50
50
0 100
1
10
40
80
10
60
40
Samples[%]
20 0
−1
σ
20
0
10
40
10
Samples[%]
20
σ 0
−1
10
Fig. 3. Overall accuracy in the validation set as a function of the number of training samples and the free parameter σ for the (left) RBF and the (right) Mahalanobis kernels.
B. Classification of urban areas In this section, the proposed MK-RVM classifier is tested on a high-resolution hyperspectral urban image of the area of Pavia, Italy. This image was adquired by the DAIS 7915 airbone imaging spectrometer of DLR [23]. This is a challenging urban classification problem dominated by directional features and relatively high spatial resolution (5-meter pixels). The image has a size of 400×400 pixels (2000×2000 meters), 40 bands and 9 labeled classes (Fig. 4). Class labels and their number of samples are shown in Table II.
Accuracy [%] 83.00 97.78 98.00 98.78
The presented classifier is also evaluated in ill-posed situations, in which only a percentage of the training samples are available. Results are shown in Fig. 2. It is noticeable how the presented MK performs better than the standard RBF kernel, specially significant when low number of labeled samples are available (< 10%, ∼ 60 training samples). This performance is obtained, however, at the expense of a higher rate of relevant vectors. A further analysis of the proposed method was conducted. Figure 3 shows the overall accuracy in the validation set as a function of the number of training samples and the tuned kernel width, σ, both for the RBF kernel and the MK. It is worth noting that high accuracy rates in the error surface are
1-4244-1212-9/07/$25.00 ©2007 IEEE.
70
100
For our experiments, we used six hyperspectral images (700×670 pixels) acquired with the 128-bands HyMap airborne spectrometer during the DAISEX-99 campaign [6]. This instrument provides 128 bands across the reflective solar wavelength region of 0.4µm - 2.5µm with contiguous spectral coverage (except in the atmospheric water vapour absorptions bands), bandwidths around 16 nm, very high signal to noise ratio, and a spatial resolution of 5m. Training and validation sets were formed by 150 samples/class and the best classifiers were selected using the cross-validation method. We compared the performance of the presented MK-RVM with RBF-RVM, RBF-SVM, and RBF neural networks. Table I shows the validation accuracy results, where the RVMs show the best results, and a certain gain is obtained by using the MK.
Classifier RBF-NN RBF-SVM RBF-RVM MKRVM
60
100
A. Classification of crop covers
Fig. 4. (a) RGB composition of the Pavia image (bands 14, 20 and 33), and (b) the true classification map.
Different values of the width for the kernel were tried exponentially, σ = {10−2, . . . , 102 }, and we varied the ratio of training samples (formed by an intentionally selected very small subset of the whole image of size 30×60 pixels). To test the classifiers, samples of all classes in the image were used.
3804
30
20
TABLE II
R EFERENCES
DAIS 7915 PAVIA IMAGE CLASSES , LABELS AND NUMBER OF SAMPLES . Name Water Trees Brick roofs Asphalt Bare soil Meadows Bitumen Parking lot Shadows
Class # 1 2 3 4 5 6 7 8 9
# of samples 4281 2424 2237 1699 1475 1245 685 287 241
Figure 5 shows the evolution of the overall accuracy (OA[%]) and the rate of relevant vectors (RVs[%]) as a function of the ratio of training samples and the kernel parameter σ. The best overall accuracy using the proposed MK was 81.67% and was obtained with σ = 1.29. These results are similar to those obtained by an RBF-RVM (80.33%), or slightly inferior to those obtained by an RBF-SVM (83.77%). Note, however, that the SVM needed 47.3% support vectors to achieve this result, whereas the MK-RVM yielded a very simple machine with only 11.3% of training samples selected as relevant vectors. These results show that a remarkable compromise between accuracy and complexity is obtained by using RVM. Last but not least, we should point out that a well-known important shortcoming of the RVM is its computational cost, scaling almost cubic with the number of samples. Faster versions of RVMs will be needed in the future to make the RVM operational in the remote sensing environment. 100 90
81 82
100
81
80
80
80
70
60
60
80
OA[%]
79 79 78
RVs 50
40
77 78
40
20
76 30 75 100
77
80
1
10 60
20
80
1
10 60
40
0
−1
10
10
40
0
10 20
Samples[%]
0 100
76
Samples[%]
σ
0
10 20 0
−1
10
σ
0
Fig. 5. (a) Overall accuracy, OA[%], and (b) rate of relevant vectors, RV[%], as a function of the σ parameter and the rate of training samples used.
IV. C ONCLUSIONS The paper presented an exhaustive evaluation of the RVM for the classification of hyperspectral images, paying special attention to ill-posed situations. We introduced the use of the Mahalanobis kernel in this context, leading to a more versatile and feature-adapted kernel. Good results have been obtained for a wide range of situations. Future work is tied to the design of other kernel functions that automatically adapt to the nature of the data distribution, and to develop faster optimizations of the marginal likelihood function through sequential addition and deletion of candidate basis functions. ACKNOWLEDGMENTS The authors would like to thank Prof. Paolo Gamba (University of Pavia, Italy) for providing the reference data for the Pavia dataset. This paper has been partly supported by the Spanish Ministry for Education and Science under project DATASAT ESP2005-07724-C05-03.
1-4244-1212-9/07/$25.00 ©2007 IEEE.
[1] B. Sch¨olkopf and A. Smola, Learning with Kernels – Support Vector Machines, Regularization, Optimization and Beyond, MIT Press Series, 2002. [2] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. ´ [3] G. Camps-Valls, J. L. Rojo-Alvarez, and M. Mart´ınez-Ram´on, Eds., Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group Publishing, Hershey, PA (USA), Jan 2007. [4] J. A. Gualtieri and R. F. Cromp, “Support vector machines for hyperspectral remote sensing classification,” in Proceedings of the SPIE, 27th AIPR Workshop, Feb. 1998, pp. 221–232. [5] C. Huang, L. S. Davis, and J. R. G. Townshend, “An assessment of support vector machines for land cover classification,” International Journal of Remote Sensing, vol. 23, no. 4, pp. 725–749, 2002. [6] G. Camps-Valls, L. G´omez-Chova, J. Calpe, E. Soria, J. D. Mart´ın, L. Alonso, and J. Moreno, “Robust support vector method for hyperspectral data classification and knowledge discovery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 7, pp. 1530–1542, July 2004. [7] F. Melgani and L. Bruzzone, “Classification of hyperspectral remotesensing images with support vector machines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 8, pp. 1778– 1790, Aug 2004. [8] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 6, June 2005. [9] G. Camps-Valls, L. G´omez-Chova, J. Mu˜noz-Mar´ı, J. Vila-Franc´es, and J. Calpe-Maravilla, “Composite kernels for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, vol. 3, no. 1, pp. 93–97, Jan 2006. [10] Matthias Seeger, “Gaussian processes for machine learning,” International Journal of Neural Systems, vol. 14, no. 2, pp. 69–106, 2004. [11] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, The MIT Press, 2006. [12] M. E. Tipping, “The Relevance Vector Machine,” in Advances in Neural Information Processing Systems 12, Sara A Solla, Todd K Leen, and Klaus-Robert M¨uller, Eds. 2000, Cambridge, Mass: MIT Press. [13] M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001. [14] N. Nikolaev and P. Tino, “Sequential relevance vector machine learning from time series,” in Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, Aug 2005, pp. 468–473. [15] J. Qui˜nonero-Candela, Learning with Uncertainty – Gaussian Processes and Relevance Vector Machines, Ph.D. thesis, Technical University of Denmark, Informatics and Mathematical Modelling, Kongens Lyngby (Denmark), November 20004. ´ [16] G. Camps-Valls, M. Mart´ınez-Ram´on, J. L. Rojo-Alvarez, and J. Mu˜nozMar´ı, “Nonlinear system identification with composite relevance vector machines,” IEEE Signal Processing Letters, vol. 14, no. 4, pp. 279–282, April 2007. [17] G. Camps-Valls, L. Gomez-Chova, J. Vila-Franc´es, J. Amor´os-L´opez, J. Mu˜noz-Mar´ı, and J. Calpe-Maravilla, “Relevance vector machines for sparse learning of biophysical parameters,” in SPIE International Symposium Remote Sensing, XI, Bruges, Belgium, Set 2005, vol. 5982. [18] G. Camps-Valls, L. Gomez-Chova, J. Vila-Franc´es, J. Amor´os-L´opez, J. Mu˜noz-Mar´ı, and J. Calpe-Maravilla, “Retrieval of oceanic chlorophyll concentration with relevance vector machines,” Remote Sensing of Environment, vol. 105, no. 1, pp. 23–33, Nov 2006. [19] J. Mercer, “Functions of positive and negative type and their connection with the theory of integral equations,” Philosophical Transactions of the Royal Society of London. Series A, vol. CCIX, no. A456, pp. 215–228, May 1905. [20] A. O’Hagan, Bayesian Inference, volume 2B of Kendall’s Advanced Theory of Statistics, Arnold, London, United Kingdom, 1994. [21] S. S. Keerthi and C.-J. Lin, “Asymptotic behaviors of support vector machines with Gaussian kernel,” Neural Computation, vol. 15, no. 7, pp. 1667–1689, 2003. [22] R. Herbrich, Learning kernel classifiers. Theory and Algorithms, The MIT Press, 2002. [23] German Aerospace Center (DLR), “DLR airbone image spectroscopy, DAIS 7915,” 2001, http://www.op.dlr.de/DAIS/.
3805