Bayesian Learning with Gaussian Processes for ... - CiteSeerX

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1223

Bayesian Learning with Gaussian Processes for Supervised Classification of Hyperspectral Data Kaiguang Zhao, Sorin Popescu, and Xuesong Zhang

Abstract Recent advances in kernel machines promote the novel use of Gaussian processes (GP) for Bayesian learning. Our purpose is to introduce GP models into the remote sensing community for supervised learning as exemplified in this study for classifying hyperspectral images. We first provided the mathematical formulation of GP models concerning both regression and classification; described several GP classifiers (GPCLs) and the automatic learning of kernel parameters; and then, examined the effectiveness of GPCLs compared with K-nearest neighbor (KNN) and Support Vector Machines (SVM). Experiment results on an Airborne Visible/Infrared Imaging Spectroradiometer image indicate that the GPCLs outperform KNN and yield classification accuracies comparable to or even better than SVMs. This study shows that GP models, though with a larger computation scaling than SVM, bring a competitive tool for remote sensing applications related to classification or possibly regression, particularly with small or moderate sizes of training datasets.

Introduction Supervised classification remains a crucial task in extracting thematic information from remotely-sensed imagery. Along with the increasing availability of hyperspectral images that have greatly enhanced the discrimination capability of remote sensing for more detailed mapping of land covers (Cochrane, 2000; Chang, 2002; Shippert, 2004), a concomitant need is arising for novel and effective techniques to analyze and classify these high-dimensional data, since most traditional classifiers, though effective for multispectral images, often show unsatisfactory performances for hyperspectral images due to the Hughes phenomenon, i.e., the curse of dimensionality (Landgrebe, 2002; Zhong et al., 2007). Among traditional image classification algorithms, the maximum likelihood classifier (MLC) is possibly the most notable method, and it assumes the Gaussian distributions for each spectral class (Richards and Jia, 1999); therefore, nonGaussianity of data could severely compromise the performance of the MLC, and the distribution parameters cannot be reliably estimated without enough training data (Hoffbeck and Landgrebe, 1996). Given adequate training samples, a remedy for non-Gaussianity is to use a mixture of Gaussians or nonparametric estimators (e.g., Parzen window estimator), for inferring class conditional distributions

(van der Heijden et al., 2004). Besides the MLC, other traditional classifiers, such as k-nearest neighbor (KNN), decision trees, and Mahalanobis distance, also received wide attention for remote sensing classification (Duda et al., 2001). In the late 1980s, a renewed interest in neural networks (NN) fostered a substantial body of literature on the NNs classifier for remote sensing data (Atkinson and Tatnall, 1997). Although most research indicated that NNs outperform traditional classifiers, some researchers reported applications where NNs produced worse results and also recognized some drawbacks of NNs such as slow convergence for training, high risk of being trapped to local minima, and overfitting issues (Paola and Schowengerdt, 1995; Schowengerdt, 2006; Xie et al., 2007). More recently, the machine learning community has been dedicating substantial efforts to kernel machines for supervised learning (Neal, 1998; Genton, 2001; Seeger, 2004; Rasmussen and Williams, 2006). Of all relevant techniques, Support Vector Machines (SVM) represent a superior kernel-based learning algorithm that has already been introduced to classify remote sensing data. Both theoretical and empirical evidences suggest that SVMs remarkably improve classification accuracies over conventional approaches as well as NNs, especially when applied to hyperspectral data (Gualtieri and Cromp, 1998; Huang et al., 2002). Conceptually, SVM attempts to search separating hyperplanes in a kernel-induced feature space by locating the training points close to class boundaries; the searching criterion is a tradeoff between maximizing the separating margin and minimizing the errors of misclassification (Vapnik, 1998). In practice, the use of SVMs requires specifying a kernel as well as the kernel and trade-off parameters that may be determined a priori or by a grid-searching approach. Such inconvenience in parameter-tuning somehow constitutes a great disadvantage that discourages practitioners from using SVMs (Hsu et al., 2003). On the other hand, Gaussian Processes (GPs) represent another important class of kernel-based learning algorithms. Unlike SVM, GPs bear a full Bayesian formulation, which enables explicitly probabilistic modeling and makes results easily interpretable. More importantly, the Bayesian learning of GPs provides a paradigm to infer knowledge on kernel parameters through the update of prior beliefs on them to posterior distributions

Photogrammetric Engineering & Remote Sensing Vol. 74, No. 10, October 2008, pp. 1223–1234. Spatial Sciences Lab, Department of Ecosystem Science and Management, Texas A&M University, College Station 77843 ([email protected]). PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

0099-1112/08/7410–1223/$3.00/0 © 2008 American Society for Photogrammetry and Remote Sensing O c t o b e r 2 0 0 8 1223

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1224

in light of training data, as opposed to the resistance of classical SVMs to an appropriate Bayesian framework (Chu et al., 2003). Successful applications of GPs have recently been reported in various fields both for regression and classification tasks (Bermak and Belhouari, 2006; Ferris et al., 2007). Although the attraction of GPs to the machine learning community is of only recent origin, as initiated by the observation that Bayesian regression using a NN with infinite hidden units may boil down to a GP model (MacKay, 1997), GPs by themselves are far from new as witnessed by the well-known “Kriging” method in geostatistics (Cressie, 1993). GPs are also called the Bayesian version of SVMs because the two approaches bear great resemblance with each other (Chu et al., 2003). Despite all the advantages, GPs do suffer from a major drawback that the training cost is of scale O(n3) for a training set of n samples, which somewhat precludes its practical use for applications with large-scale training data. In fact, the same computational issue originally existed for training a SVM, but was later alleviated by the proposal of fast training algorithms (Csató and Opper, 2002). To speed-up training GPs, a variety of efficient schemes have been proposed on the basis of various considerations (Rasmussen and Williams, 2006). The objective of this study is to introduce the Bayesian learning of GP models for remote sensing applications, specifically for classifying hyperspectral data. For this purpose, the rest of the paper is organized broadly into two parts: the first part presents an overview of the general Bayesian GP modeling for supervised learning including both regression and classification, and also deals with several practical aspects on the use of GP models such as the learning of optimal kernel parameters, sparse GP models and the implementation issues; the second part conducts three experiments to evaluate the performances of GP classifiers (GPCLs) as applied on an Airborne Visible Infrared Imaging Spectrometer (AVIRIS) image, and compares the classification results of GPCLs with those of KNN and SVMs in terms of several criteria such as overall accuracies and computational complexities.

of Equation 1, we only need to select a type of kernel function and specify the kernel parameters . Intuitively, a GP prior means that before looking at the training data, we believe that possible forms of f (x) are random realizations drawn from the prior p( f ). An example about a 1-D f (x) is illustrated in Figure 1a where the three curves are random draws from a GP prior p( f ) that uses the Gaussian kernel of Equation 11 with parameters (0, l) being (0.5, 1.0). Before taking into account the training data (the circles), the three curves are equally probable a priori as candidates for f. In light of the training data , the GP prior of f can be updated to a posterior distribution, i.e, p(f|). However, the observations y [yi]ni1 generally do not equal the true

Formulation of Bayesian GP Models for Supervised Learning Supervised learning tends to uncover a functional relationship f (x) from a training dataset {( xi,yi)}ni1 of n pairs of observations where xi Rd is the input vector, and yi the associated target value. In this study, Rd is the d-dimensional feature space of hyperspectral bands. To make inference on f () within a Bayesian framework, we must first elicit prior knowledge about the possible forms of f () in terms of a probability distribution over a function space. In Bayesian GP models, such a prior distribution is chosen as a GP, f p(f ) GP[m( x), k( x, x)]

(1)

where a GP is fully specified by a mean function m(x) and a covariance function k(x, x), i.e., E[f (x)] m(x) and Cov[f (x), f (x)] k(x, x) (Cressie, 1993). Moreover, a finite collection of function values indexed by any x’s also assumes a multivariate Gaussian distribution with its mean and covariance directly calculated from m() and k (,), respectively. For example, the function values f (X) [f(x1), f(x2), . . . , f(xn)]T at the observed inputs X [x1, x2, . . . , xn]T have a Gaussian distribution p(f|X) that has a mean vector m(X) [m(x1), m(x2), . . . , m(xn)]T and an n n covariance matrix K(X, X) with Kij k (xi, xj). Of note is that in the machine learning community, the covariance function k(,) is often called a kernel function or a kernel. A kernel is often parameterized by a set of parameters , i.e., k(x, x; ). Without the loss of generality, hereafter we assume m(x) 0. Thus, to specify a GP prior 1224 O c t o b e r 2 0 0 8

Figure 1. Demonstration of GP prior and Bayesian Gaussian Processes regression (GPR): (a) Three random realizations (curves), drawn from a GP prior with the Gaussian covariance function or kernel, serve as possible candidates of the underlying function that needs to be inferred from the training data (the circles), and (b) the posterior mean curve is predicted in light of the observed data (the circles), with the shaded area representing the pointwise 95 percent prediction interval. Note that the predicted mean curve may not pass through the data points because the training data are subject to noises. Also, note that predictions at locations further away from training points are more uncertain as indicated by the larger width of the shaded area, i.e., a natural phenomenon for almost all supervised learning algorithms.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1225

function values f(X), either due to measurement errors or the fact that in classification problems y takes discrete labels while f (x) takes continuous values; therefore, an observation or “noise” model p(y f ) needs to be specified for defining the probability of observing y when the true value is f. Then, according to the Bayes theorem, p(y f ) can update the GP prior p(f X) to the posterior distribution as, p( f , )

p(y f )p( f X, ) p( )

(2)

where , the set of kernel parameters, is made explicit to show the inference’s dependence on it; p(y f ) is also known as likelihood in that it is a function of unknown f for a fixed set of observed y, and p(|) is called marginal likelihood since it is a function of given . In the case that a confident value of cannot be obtained a priori, a sensible way is to learn an optimal value ˆ in favor of the training data by maximizing p(|) with respect to . Be aware, however, that the desideratum of learning an informative value of rather than an unreliable guess is achieved at the expense of extra computation for the involved optimization. Once the posterior p(f|,ˆ ) is available, prediction at a new input x* can be made by (Rasmussen and Williams, 2006): p(f x , , ˆ ) p(f , f , ˆ )df (3) *

*

*

which, in conjunction with the “noise” model, gives the predictive distribution of y*:

p(y* x *, , ˆ ) p(y * f *)p (f* x*, , ˆ )df*

(4)

The general formulation of Bayesian GP models in Equations 1 through 4 can be used to deal with both classification and regression problems, depending on the choice of the likelihood model p(y f ). Before proceeding to GP classification, it is beneficial to first consider GP regression (GPR) problems because GPR could either be employed directly as a least squares classifier or be extended for classification by transforming regression outputs to class probabilities through a sigmoid response function as shall be seen soon. GP Regression (GPR) and Least Squares Classifiers Bayesian GPR assumes the Gaussian (normal) noises for the likelihood model, i.e., p(y f ) N(f, 2) with 2 denoting the variance of the independent and identically-distributed noises. This very Gaussianity of p(y f) makes the relevant posterior inference analytically tractable. As such, relevant distributions, such as the posterior p(f , ) in Equation 2 and the predictive distributions of f* and y* in Equations 3 and 4, will all boil down to Gaussian ones. Specifically, the posterior mean and variance for f* of Equation 3 can be explicitly expressed in terms of and k(, ) as follows, mf k(x*, X)[ K(X, X) s 2 I]1 y *

(5)

Var (f*) k(x*, x*) k(x*, X)[K(X, X) s 2 I]1 k( X, x*) (6) where I is the n n identity matrix, k(x , X) is a 1 n * vector with its i th element being k(x*, xi), and we simply T have k(X, x*) k(x*, X) (MacKay, 1997). To further illustrate GP posterior inference, Figure 1a shows that although the three curves are equally probable a priori, the dotteddash curve (prior curve #1) favors the data points more than the other two and thus possesses a higher posterior probability of being the true function. Figure 1b depicts the posterior mean and variance that have been predicted with Equations 5 and 6 from the training data. PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

GPR models of Equations 5 and 6 can be used directly as least squares classifiers to solve classification problems in the same spirit as the early work of NNs (Richards and Jia, 1999). To illustrate, consider a binary classification problem with two classes {, }. First, the class labels are transformed to numerical values by assigning two real numbers y and y to the “” class and the “” class, respectively; one such choice is y 1.0 and y 1.0, which was employed for the subsequent experiments in this study. Then, after fitting a GPR model to the transformed values of 1.0, the class label of a new input x*, C(x*), can be determined by thresholding f , e.g., C(x*) sgn[ f*]. Of * note are that the predicted value f may fall outside the * interval [y, y], and also that such an approach lacks a probabilistic interpretation. A remedy is to employ a sigmoid function to squash the GPR outputs into [0, 1] to get class probabilities, as also used for SVMs (Platt, 2000).

GP

Classification (GPC)

refers to GP models that directly formulate classification problems, and it often needs a sigmoid function as the likelihood model in Equations 1 through 4 for explicitly modeling the class probabilities p(y f ). The role of sigmoid functions is to squash continuous values of the latent function f(x) into the unit interval [0, 1], and the common choices for such sigmoid functions are the logistic and probit functions. The logistic function is written as: GPC

τ logit(f ) 1/[1exp(f )],

(7)

and the probit function τprobit(f ) is simply the standard normal cumulative distribution function. For binary classification, the likelihood term becomes p(y f ) τ(y f ) where y {1, 1}. The GPC posterior p(f , ), however, is not a GP any more due to the non-Gaussianity of τ(y f), thus making impossible the exact analytical posterior inference of GPC. In such a case, the Monte Carlo Markov Chain (MCMC) sampling is a standard procedure for posterior inference, but it is computationally daunting. Instead, we employ two widely used numerical approximation algorithms, i.e., the Laplace’s method (LP) and Expectation-propagation (EP) algorithm, for obtaining an approximated GP posterior to the true non-GP posterior p(f , ). The approximated posterior q(f|, ), due to its Gaussianity, then can be incorporated into Equations 3 and 4 for analytical posterior inference. Next, we provide brief descriptions of the two numerical methods. The LP method expands p(f , ) around its mode fˆ using a second-order Taylor expansion to obtain an approximated GP q(f|, ), i.e., p( f|, ) q( f|, ) N( f|fˆ , H1) exp(

1 (f fˆ )T H(f fˆ )) 2 (8)

where fˆ arg maxf {p(f|, )}, and H f 2log p(f|,)|ffˆ is the Hessian matrix of the negative log posterior at the mode (Williams and Barber, 1998). The mode fˆ can be found by a gradient-based optimizer. In contrast to the LP method, the EP algorithm is more complicated in that it attempts to iteratively find a GP approximation to the true posterior p(f|, ) based on the sequential moment matching between approximated marginal distributions. Interested readers are referred to Rasmussen and Williams (2006) for more details on the implementation of the two methods. In general, both theoretical and experiential evidences have shown that the EP algorithm performs better than the LP method in terms of predictive distributions and marginal likelihood estimates, but

O c t o b e r 2 0 0 8 1225

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1226

at the expense of more computation (Kuss and Rasmussen, 2005). However, the LP and EP methods are found comparable if only classifying class labels is concerned (Kuss and Rasmussen, 2005).

proposed by Chu et al. (2003) has the merit of being sparse and being used in GPC at the same time, and its associated likelihood function is expressed as:

Sparse GP Models A typical GP model, either GPR or GPC, involves all training data for inference as indicated in Equation 4. Inspired by the sparsity of SVM, it is desirable to obtain a sparse GP model as well, which uses only part of the training data, i.e., support vectors (SV), for speeding up model training but without significant losses of inference accuracy. To help illustrate, let us first turn to the regularization-based formulation of kernel machines: Learning a kernel-based model of SVMs or GPs is equivalent to minimizing the functional: J (f )

C f 2H L ( y, f ) 2

(9)

where f 2H f T K1f is a regularization term, L(y, f ) is the loss function, and C is a scalar parameter that controls the trade-off between the two terms (Evgeniou et al., 2000). For the loss term L(y, f ), GPs use the negative log-likelihood function log p(yi|fi) while SVMs use the hinge loss function (1 yi fi). As depicted in Figure 2, the hinge loss function enjoys the sparsity property in that the training samples with zero loss, i.e., those with yi .fi 1, can be discarded without influencing the training of SVMs; however, this function resists a probabilistic interpretation and thus cannot be used as a likelihood term in GPC. Conversely, the probit and logistic functions, though being likelihood terms in GPC, gain no sparsity due to the long diminishing yet non-zero tails. As a remedy, the trigonometric loss function

y f 1 1 y f 1 y f 1

0 p TL (y f ) cos2(p (1 yf )/4) 1

(10)

which gives rise to a flat non-penalizing or zero-loss zone that overlaps with the hinge function of SVMs, i.e., log(p TL (y f )) 0 for y f 1. For GPC models with trigonometric loss function, the minimization of Equation 9 becomes a convex-programming problem, which enables us to learn a sparse GPC model using the sequential minimization optimization algorithm (SMO) that has been widely used for the fast training of SVMs. The basic idea of SMO is to decompose the original convex-programming problem into a series of sub-optimization in an iterative scheme, and more details on SMO can be found in Platt (1999).

Model Selection of GPs As stated above, a GP prior is fully specified by choosing a kernel function and its parameters . Therefore, given a kernel type, the selection of an appropriate GP model then boils down to determining sensible values of GPs offer a natural Bayesian paradigm to learn kernel parameters from the training data , as opposed to SVMs for which the kernel and tradeoff parameters are usually determined empirically or by a cross-validation approach. Kernels and the Automatic Relevance Determination (ARD) Several popular kernels for machine learning are the linear, polynomial, Gaussian and sigmoid kernels. In GPs, a kernel is parameterized by a set of hyperparameters as symbolized in Equation 2; for example, the Gaussian kernel has two hyperparameters [0, l], i.e.,

k gaussian (x, x; s0, l ) s20 exp

x x 2 l2

(11)

which is often called the radial basis function in SVMs. In a kernel is a covariance function, thus rendering more interpretable. For example, l in Equation 11 is the characteristic length that determines a distance in the input space beyond which function values become less relevant. Moreover, if the characteristic length varies with input dimensions (i.e., input bands), we arrive at an interesting kernel known as the Automatic Relevance Determination (ARD):

GPs,

Figure 2. Four different loss functions in the regularization formulation of kernel machines for classification problems: The logistic, probit, and trigonometric loss functions are used in the GPC, and the hinge one is used by SVMs. Notice that the hinge and trigonometric loss functions have no penalty (zero loss) over y f 1, which leads to a sparse model involving only support vectors, i.e., those data points associated with non-zero loss, into the inference, whereas the other two loss functions have non-zero decreasing penalties that entail all training data in a model.

1226 O c t o b e r 2 0 0 8

(xi x'i) ) li2 il...d

kARD(x, x'; s0, l1, l2, . . . , ld) s 20 exp(

(12)

where the magnitude of li indicates the inference ability of the ith input dimension, and large values of li’s can effectively downplay or eliminate the associated irrelevant input dimensions (Neal, 1996). Therefore, the ARD provides a parameterization scheme for automatic feature reduction especially when tackling high-dimensional problems such as classification of hyperspectral data. Another kernel that gains special interest is the Neural Network (NN) covariance function. Its typical form as derived by Neal (1996) is: 2 1 ˜ k NN (x, x; s0, ) s 0 sin

˜

˜ x ˜ w(1 2x˜ 2˜ xx)(1 ˜ x) ˜ 2x ˜ ˜ T

T

T

(13)

˜ denotes a covariance where x˜ [1, x1, x2, . . . , xd]T , and matrix that may take structural parameterization. The importance of the NN kernel lies in that a GP model with this PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1227

kernel is equivalent to an infinite Bayesian probit transfer function.

NN

with the

Learning Kernel Parameters In GP models, the sensible values of kernel parameters or hyperparemeters could be learned using Bayesian approaches. According to the Bayes theory, the posterior of in light of training data is modeled by: p( ) p( )p()/p()

(14)

where the marginal likelihood p( ) is given by:

p( ) p( y f )p( f ) df.

(15)

The Maximum a posterior (MAP) estimate of p(|), i.e., ˆ arg max p( , can be used as the optimal value of . Because p() is often assumed to be flat due to the lack of prior information, the maximization of p(|) reduces to that of the marginal likelihood p(|). This procedure is called the type II maximum likelihood (ML-II) (Rasmussen and Williams, 2006). In GPR models, p(|) is analytically available due to the Gaussianity of p(y|f), and in GPC models, it is approximated by using the Laplace and EP algorithms. In practice, the maximization of p(|) for ˆ can be done with a general gradient-based optimizer. However, such a global optimization process may be trapped into local maxima especially if there are a large number of hyperparameters (e.g., when using the ARD for feature selection over high-dimensional inputs), and a common practice to find a better optimum is to perform the optimization several times with different random initial values and to select the trial with the highest marginal likelihood (Chu et al., 2003).

Implementation Issues The direct implementation of GPs requires O(n3) computation for training due to the inversion of the n n Gram matrix K, which makes GPs less suitable for large-scale training data (e.g., n 10,000) (Rasmussen and Williams, 2006). So far, there have been many proposals for fast and efficient GP learning, such as the sparse model with the trigonometric loss function discussed above, sparse online learning, and the informative vector machine, among others (Csató and Opper, 2002; Lawrence et al., 2003; Kuss and Rasmussen, 2005). The typical strategy of fast and efficient GP learning is to select a small representative subset according to certain selection criteria, or in the simplest case, to make a random selection. In this study, except the sparse GPC model with the trigonometric loss function, we made no attempt to evaluate any complicated strategies for efficient GPs. The major overhead in using a GP or SVM is the computation cost for learning hyperparameters (Chu et al., 2003), as illustrated later by the experiment results. To reduce such computation cost, it is practical and effective to first search for optimal hyperparameters with more iterations using a small randomly selected subset to get good initial values for the next learning phase with less iterations using a larger random subset, and such a procedure continues until all of the dataset are used, and the final optimum learned from the full dataset will be used for the posterior inference and predictions. The aforementioned binary classification framework can be theoretically extended for multi-class problems; however, practical strategies exist for reducing a multiclass problem into a series of binary sub-problems that, when combined together, reconstruct the original problem (Hsu et al., 2002). Two such schemes are the one-againstall (OAA) and one-against-one (OAO) approaches. Specifically, consider an N-class problem. PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

1. The OAA approach constructs each binary problem by selecting one of the N classes as 1 and all the remaining classes as 1, thus, resulting in N binary classifiers in total. In predication, the class label of a new data point will be assigned to the “1” class of the binary classifier with the highest “1” probability. 2. The OAO approach totally builds N(N1)/2 binary classifiers from all combinatorial pairs of the N classes. In prediction, each binary classifier gives a vote to the winning class, and the class label will be chosen as the one with the highest accumulated votes.

Both the OAA and OAO approaches employ the “winnertakes-all” strategy. In case of a tie, i.e., that two or more classes have the same score in terms of probability or votes, the class with the highest prior probability can be used for class label assignment, and the prior probability can be estimated simply as the ratio of the sample size of a class to the entire sample size if training samples are collected randomly and independently (Melgan and Bruzzon, 2004). Although the OAO comprises a larger ensemble of classifiers than the OAA, the training with the OAO is usually faster because the training samples of an OAO binary classifier usually is smaller as compared to the OAA. It was also reported previously that, for classification using SVMs, the OAO yielded the best accuracy among several multi-class schemes that includes the OAA (Melgan and Bruzzone, 2004). Hence, this study adopted the OAO scheme for both GP and SVM classification in the following experiments.

Experiments The hyperspectral dataset used for assessment of GPC performance is a 145 pixels 145 pixels scene of the AVIRIS image with 220 bands, 20 of which are badly contaminated by atmospheric absorption (Figure 3). This image was acquired in June 1992 over the Indian Pines of northern Indiana, and is available online on the FTP site of the Laboratory for Applications of Remote Sensing (LARS) at Purdue University. This image has been also investigated for testing algorithms in several studies (Gualtieri and Cromp, 1998; Melgan and Bruzzone, 2004). This scene mainly comprises forests and agriculture fields with several different crops. The companion ground reference data include totally 16 classes, but seven of them have very limited numbers of pixels; therefore, following Melgani and Bruezzone (2004), only the remaining nine classes are used for the subsequent experiments. A summary of the retained classes and their sample sizes is

Figure 3. The AVIRIS hyperspectral image used for our experiments: (a) Band No. 100 and (b) Band No. 1 that appears noisy.

O c t o b e r 2 0 0 8 1227

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1228

TABLE 1. LABELS AND SIZES OF NINE CLASSES IN THE CLASSIFICATION SCHEME Class C1 C2 C3 C4 C5 C6 C7 C8 C9

Class Label

Size

Corn-no till Corn-min till Grass/Pasture Grass/Tree Hay-windrowed Soybean-no till Soybean-min till Soybean-clean till Woods

1434 834 497 747 489 968 2468 614 1294

listed in Table 1. To test the GPCLs, we designed three experiments, all of which were conducted in a Lenovo ThinkPad® T60 laptop equipped with 2.0 GHZ CPU and the 1.0 GB RAM. Of particular note, the data were first centered with respect to the band means and then normalized by the band standard deviations for numerical benefits. The classifiers we tested include a variety of GPCLs as well as the KNN and the SVM. For notational conveniences, we used the following conventions to refer to a specific classifier (e.g., GP_LS_NN, GP_LP_LG_ARD, SVM_GA): A GPCL begins with “GP”, and the SVM with “SVM”. Among GPCLs, a least-squares classifier is denoted by “LS”, and a GPC classifier with sigmoid response functions is denoted by “LG” and “PB,” respectively, for the logistic and “probit” functions. As to the approximation schemes of a GPC classifier, we used “LP” and “EP” to stand for the LP method and the Expectation-Propagation algorithm, respectively. As to kernels, we used “GA,” “NN,” and “ARD” to indicate the Gaussian, the Neural Network, and the ARD kernels, respectively. We use “GP_TL_SMO”, specifically, to denote the sparse GP model with the Gaussian kernel and the trigonometric loss function because it employs the SMO algorithm for training. As to SVMs, we always used the Gaussian kernel due to its superior performances over other kernels as reported previously (Huang et al., 2002; Melgani and Bruezzone, 2004). In addition, we used “KNN-1” and “KNN-5” to represent the KNNs with one and five nearest neighbors, respectively. For ease of reference, the relevant acronyms for the kernel-based classifiers are summarized in Table 2. On the other hand, in all the experiments of this study, kernel parameters of GP models including “GP_TL_SMO” were learned by the ML-II method as described above, while kernel and trade-off parameters of SVMs were learned by referring to a five-fold cross-validation with the nested grid-searching method as detailed in Hsu et al. (2003) (see Figure 4). TABLE 2. ACRONYMS Kernel machine

Kernel

GP classifiers

Likelihood Optimization

1228 O c t o b e r 2 0 0 8

SVM GP GA NN ARD LS LP EP LG PB TL SMO

FOR THE

Experiment 1 The first experiment aims to compare several typical GP classifiers (GPCL) with the KNN and the SVM in terms of classification accuracy and computational load. The KNN is chosen as a reference because it is a most popular standard nonparametric classifier that also keeps fair performances, and the reason to choose SVM is because it reportedly claimed superior performances to most other classifiers, especially on high-dimensional data. In this experiment, following Gualtieri and Cromp (1998), we abandoned the 20 contaminated bands as a rudimentary way of coarse feature selection, thus leaving 200 bands as inputs. Besides KNN and SVM, we assessed six GPCLs, including GP_LS_GA, GP_LS_NN, GP_LP_LG_GA, GP_LP_PB_GA, GP_EP_PB_GA, and GP_TL_SMO. For comparisons of computational costs, we divided the training stage into two phases: one concerning the tuning or learning of parameters, and the other concerning the training by using the optimal parameters learned in the tuning phase. We timed the two phases separately for all the classifiers except KNN. In addition, the original data were randomly split into training and testing subsets. The size of training subset varied from 10 percent to 50 percent of the entire data with a 10 percent

Figure 4. A comparison of the number of support vectors between the SVM and GP_TL_SMO in the 36 binary classifiers for one of ten trials using a training data of size 20 percent, with points above Y X indicating a larger number of SVs for GP_TL_SMO.

KERNEL CLASSIFIERS EVALUATED

IN THE

EXPERIMENTS

Support Vector Machine Gaussian Process the Gaussian the Neurel Network the Automatic Relevance Determination the least-squares the Laplace’e method the Expectation-Propagation algorithm the logistic the probit the trigonometric loss function Sequential Minization Optimization for GP_TL and SVM


SA-AI-03.qxd

9/11/08

10:03 AM

Page 1229

a mixed new training dataset that was used to train the classifiers. Note that the final training data contain outliers whereas the test data do not. In the experiment, we also discarded the 20 noisy bands. Along with the SVM, we tested seven GPCLs including GP_LS_GA, GP_LS_ARD, GP_LS_NN, GP_LP_LG_GA, GP_LP_LG_ARD, GP_EP_GA, and GP_EP_ARD.

increment, and all the remaining pixels were used for testing; moreover, for each of five training data sizes, we performed ten random trials (splitting) so that there are 50 runs for each classifier. Experiment 2 The purpose of this experiment is to examine the effectiveness of the use of ARD kernels in GPCLs as a means for feature selection, and the results were evaluated and compared against those of SVM. Unlike the first experiment that discarded the 20 noisy bands, the second experiment made use of all 220 bands by presuming that we have no prior information at all on the fact that the 20 bands are contaminated. Large characteristic lengths of the ARD kernel could downplay or even eliminate these irrelevant bands, thus allowing for discriminating between the effective and ineffective (e.g., noisy bands) bands. The 20 noisy bands are expected to have larger characteristic length because they contribute little or no discrimination abilities. The GPCLs considered in this experiment consist of GP_LS_ARD, GP_LP_LG_ARD, GP_LP_PB_ARD and GP_EP_PB_ARD. These four GPCLs used the ARD kernel with 220 characteristic length parameters that were optimized using the ML-II procedure as well. Meanwhile, the SVM_GA, i.e., the SVM with the Gaussian kernel, was also considered for comparison purposes. As to the scheme of splitting data, we used five sizes of training subsets randomly selected from the entire data as in the first experiment, and the percentages also ranged from 10 percent to 50 percent with a step of 10 percent, with the remaining pixels for testing. For each percentage, we only ran a single trial.

Results and Discussion The KNN and all the GPCLs were implemented in Matlab (Mathworks, Inc.), except that the GP_TL_GA_SMO was implemented in C/C. The SVM we used is the LIBSVM implementation that was also coded in C/C (Chang and Lin, 2001). The results of the above experiments are reported and discussed below in sequel.

Experiment 3 The third experiment deals with a binary classification problem to assess the sensitivity of GPCLs to outliers present in the training data, as compared to the SVM. The two classes involved are Soybean-no till (C6) and Soybean-min till (C7); we chose these two classes because there is considerable spectral overlapping between them, which usually causes errors in collecting training samples especially when relying on the on-screen examination procedure for identifying class pixels. To control the amount of outliers, we purposefully introduced outlier pixels into each of the two classes by borrowing some percent of pixels from one of the two classes to be exchanged into the other class, and the outlier percentage we used ranged from 1 percent to 10 percent incremented by 1 percent. In each case of ten outlier percentages, the remaining pixels after selecting out the outlier pixels were spilt into a 20 percent training subset and an 80 percent testing subset. Then the outliers were combined into the 20 percent training subset to generate

Experiment 1 For each training data size, Table 3 summarizes the mean overall accuracies (OA) averaged over the ten random trials of that training size, together with the corresponding variation of OA among ten trials as represented by the standard deviation. The two KNN classifiers produced accuracies that were significantly lower than those of the kernel classifiers, with KNN-5 slightly better than KNN-1 across all training sizes. The poor performances of KNN could be attributed both to the ineffectiveness of the Euclidian distance as a metric to measure the proximity between points in the manifold of the high-dimensional training data, and to the inadequate number of training points relative to the high dimensions of feature space; these two factors make it difficult for KNN to estimate local properties of the feature space because only a sparse number of data points are found locally. On the other hand, the kernel classifiers including GPs and SVM resort to the kernel trick that somehow mitigates this curse of dimensionality. Among all the kernel classifiers we applied, GP_LS_NN consistently yielded the highest mean OA across all training data sizes, followed closely by the GP_LS_GA that was consistently better than the remaining classifiers. No kernel classifier always performed worst, although GP_TL_SMO produced the worst mean OA in three out of five training sizes (20 percent, 30 percent, and 40 percent). In particular, two points are worthy of mentioning. First, although GP_LS_NN had the highest mean OA averaged over ten trials, it might have a lower OA than others at a single trial (e.g., 92.36 percent for GP_LS_NN versus 92.69 percent for SVM at one random trial using a training data size of 50 percent). Second, differences in the OA between two classifiers may be statistically insignificant; for example, at the training size of 50 percent, the gain of 0.61 percent by GP_LS_NN (93.72 percent) over

TABLE 3. MEAN OVERALL ACCURACIES (OA IN %) TOGETHER WITH THE STANDARD DEVIATION (SD) FOR EACH CLASSIFIER ON DIFFERENT SIZES OF TRAINING DATA (10% TO 50%) 10%

KNN-1 KNN-5 GP_LS_GA GP_LS_NN GP_LP_LG_GA GP_LP_PB_GA GP_EP_PB_GA SVM_GA GP_TL_SMO

20%

OF OA CALCULATED IN EXPERIMENT 1

30%

OVER TEN TRIALS,

40%

50%

Mean

SD

Mean

SD

Mean

SD

Mean

SD

Mean

SD

73.86 74.31 85.48 86.31 84.72 83.86 85.14 84.18 84.36

0.75 0.74 0.52 0.46 0.29 0.26 0.37 1.17 0.13

76.43 77.43 89.78 90.48 89.61 89.42 89.49 88.91 89.19

0.77 0.31 0.47 0.58 0.59 0.57 0.58 0.25 0.69

78.21 79.32 91.27 92.02 90.85 91.08 91.13 91.12 90.57

0.05 0.16 0.24 0.13 1.68 0.19 0.14 0.12 0.34

78.96 80.29 92.82 93.15 91.59 91.56 91.69 91.83 91.26

0.21 0.51 0.36 0.23 0.41 0.42 0.28 0.31 0.28

79.83 81.37 93.11 93.72 92.71 91.75 93.03 93.02 92.53

0.09 0.52 0.48 0.57 0.51 1.33 0.43 0.28 0.24


O c t o b e r 2 0 0 8 1229

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1230

GP_LS_GA (93.11 percent) was only marginally significant (pvalue 0.082) while a statistically significant gain is observed between GP_LS_NN (93.72 percent) and SVM (92.53 percent) with a p-value of 0.0008, both using the test statistics derived by assuming binomial distribution for the number of correctly classified pixels (van der Heijden et al., 2004). It is interesting to notice that the least-squares GP classifier (GP_LS_NN) yielded the best results, although it is more intuitive or natural to develop a classifier with the probit or logistic likelihood model as in GP_LP_LG or GP_LP_PB. In fact, previous studies reported that the regularized least-squares classifiers, including the least-squares GPCL of this study, e.g., GP_LS_NN, have comparable performances to SVMs (Rifkin and Klautau, 2004), and they are conceptually simpler and more straightforward to implement. Theoretical evidence also suggested that as long as the regression model of a least-squares classifier is flexible enough, the fitted model f(x) will converge to the class probability p(|x) (or a linear scaling of it) when the number of training data is sufficiently large (Rasmussen and Williams, 2006). In addition, the NN kernel (covariance function) allows for good fitting to functions that saturate at two constant values in the opposite directions (e.g., a step function in 1-D case) (Neal, 1996), thus providing a possible explanation about why the GP_LS_NN tended to produce better results than GP_LS_GA since the data to be fit in this experiment had constant values (i.e., 1.0 and 1.0 ) on either side of the class boundary. Overall, for this experiment, we can securely conclude that all the GPCLs yielded accuracies that were better or at least comparable to that of SVM. The computational expenses for each classifier are detailed in Table 4 where the “train” column indicates the CPU time consumed for training the classifiers with the “optimal” parameters learned in the “tuning” phase. Because learning parameters iteratively can take as much time as will be desired, in order to make fair comparisons, we attempted to keep time budgets spent on learning parameters within the same order as much as possible, except for the GP_LS_NN that indeed took less time than others. As shown in Table 4, there is no difference in computational costs between the probit and logistic likelihood models in the Laplace approximation of the GPC. The EP algorithm was the most timeconsuming, which took 5,343 seconds for training with a training data size of 50 percent. Except GP_TL_SMO, all the 3 GP classifiers have the computational cost that scales O(n ), which results from inverting the covariance matrix (Kuss and Rasmussen, 2005). The GP_LS_NN ran fastest among all GP classifiers. A possible reason is that the evaluation of the NN kernel for high-dimensional inputs was faster than that of the ˜ of Gaussian kernel, since in this study the hyperparameter

the NN kernel was specified by only a single parameter for the first diagonal element with all the other diagonal elements being ones and off-diagonal elements being zeros. The faster evaluation of the NN kernel is also revealed by the shorter time for per-pixel prediction of GP_LS_NN compared to the other four GPCLs with the Gaussian kernel. However, the computational scaling of GP_LS_NN is also approximately O(n3) as indicated by the power factor of 2.3 obtained from the fitting of a power function to the “train” time with respect to the size of training data. The GPCLs and SVM both relied on the iterative procedures to search optimal kernel parameters, and the learning time depends on both the cost per iteration and the iteration number (Chu et al., 2003). The per-iteration cost is approximately proportional to the “train” time of Table 4, and SVM spent less time per iteration than most others. For the iteration number, SVM usually iterated at least hundreds of times to obtain sensible values of parameters using the cross-validation-based grid-searching approach; however, the GPCLs with the type II ML approach often took only dozens of iterations before arriving at sensible values of hyperparameters. This difference in iteration number partly explains why GP_TL_SMO, though with a high per-iteration cost, spent less time on tuning parameters in comparison to the SVM for the first four training datasets. Although SVM and GP_TL_SMO both used the SMO training algorithm, SVM was faster per iteration as revealed by the “train” column of Table 4 because SVM possesses analytical solutions to each SMO subproblem while GP_TL_SMO has to solve it numerically. However, compared with other GPCLs, GP_TL_SMO demanded less computation due to its sparsity, and has a computation scaling of O(m3) with m being the number of SVs. Previous studies reported that SVM had less SVs than GP_TL_SMO (Chu et al., 2003), but in this experiment we found that in some cases GP_TL_SMO selected fewer SVs. For example, at one trial with a training data size of 20 percent, nineteen out of 36 binary classifiers indicated more SVs for the SVM than the “GP_TL_SMO”. Moreover, the SVM still possessed a greater total number of SVs than the GP_TL_SMO when averaged over all the 36 classifiers. This difference in the SV numbers explains why the perpixel prediction time was slightly longer for the SVM than GP_TL_SMO because the prediction cost per pixel is proportional to the SV number after relevant model coefficients are calculated during the training stage. Of particular note is that we avoided comparisons that could be dominated by the efficiency difference of two programming languages, since the GP_TL_SMO and SVM were implemented in C/C and all others in Matlab.

TABLE 4. COMPUTATIONAL COSTS AVERAGED OVER TEN TRIALS IN EXPERIMENT 1 FOR EACH CLASSIFIER UNDER ALL DIFFERENT SIZES OF TRAINING DATA: “TUNE”, “TRAIN”, AND “PREDICT” REPRESENT THE CPU TIME IN SECONDS CONSUMED FOR LEARNING PARAMETERS, FOR TRAINING WITH THE LEARNED OPTIMAL PARAMETERS, AND FOR MAKING PREDICTION OF A NEW PIXEL WITH THE TRAINED MODEL, RESPECTIVELY. NOTE THAT ALL VALUES ARE THE AVERAGE OF TEN RANDOM TRIALS Training Data Size

10%

20%

30%

40%

50%

Phases

tune

train

predict

tune

train

predict

tune

train

predict

tune

train

predict

tune

train

predict

KNN-1 KNN-5 GP_LS_GA GP_LS_NN GP_LP_LG_GA GP_LP_PB_GA GP_EP_GA GP_TL_SMO SVM_GA

–– –– 731 29 747 747 951 429 744

–– –– 18 0.98 19 19 50 15 3.5

5.6E-04 4.5E-04 1.1E-01 4.3E-03 1.1E-01 1.1E-01 1.1E-01 2.3E-03 2.9E-03

–– –– 3622 101 3657 3653 4407 1769 2698

–– –– 95 3.4 103 103 361 66 10

1.4E-03 1.2E-03 2.3E-01 5.9E-03 2.3E-01 2.3E-01 2.1E-01 3.7E-03 3.9E-03

–– –– 10040 252 10089 10081 12648 4855 5863

–– –– 240 8 262 262 1310 142 21

2.5E-03 2.2E-03 3.5E-01 8.6E-03 3.5E-01 3.5E-01 2.4E-01 5.2E-03 5.9E-03

–– –– 11736 339 11941 11944 18331 7575 9726

–– –– 437 16 448 447 1629 295 31

3.9E-03 3.5E-03 4.7E-01 1.5E-02 4.7E-01 4.7E-01 4.0E-01 5.5E-03 6.7E-03

–– –– 24995 176 24824 24808 37799 20778 14505

–– –– 694 26 763 760 5343 538 43

5.5E-03 5.1E-03 5.8E-01 1.9E-02 5.8E-01 5.8E-01 5.6E-01 5.7E-03 6.9E-03

1230 O c t o b e r 2 0 0 8


SA-AI-03.qxd

9/11/08

10:03 AM

Page 1231

Experiment 2 The GPCLs with ARD kernels improved the OAs over the SVM with the Gaussian kernel for all five training data sizes, and among the GPCLs, the GP_EP_ARD consistently provided the best OA. Also, all five classifiers including the SVM improved the OAs with increasing sizes of training data (Table 5). As compared to Experiment 1, the OAs for the SVM significantly decreased, e.g., the mean OAs of 84.18 percent and 73.89 percent for Experiment 1 and 2, respectively, when using the training data of size 10 percent. Such degradation was caused by the addition of the 20 atmospherically-contaminated bands. The 20 bands contain almost no information about spectral signatures of ground surfaces, therefore, somewhat confounding the training of SVMs. On the other hand, the GP classifiers used the ARD kernel to effectively determine these non-informative bands by referring to the magnitude of characteristic length parameters that were learned in light of the training data (Neal, 1996). However, the use of ARD kernel for the SVM remains impractical mainly due to the difficulty of the cross-validation procedure in pinpointing optimal parameters in a high-dimensional space (e.g., 221 in this case: one magnitude of variance of the kernel plus 220 characteristic lengths), unless the knowledge about kernel parameters is available a prior. As an example, Figure 5 depicts the learned ARD characteristic length (scale) parameters for the binary GP_LP_LG_ARD classifier discriminating between C1 and C6 as applied to the training data of size 20 percent. After 15 iterations, the bands (x-axis) that protrude in the solid curve correspond to the 20 atmospheric-absorption bands [104–108, 150–163, 220] as well as the first band that is also quite noisy as displayed

TABLE 5. OVERALL ACCURACY (%) FOR DIFFERENT SIZES TRAINING DATA IN EXPERIMENT 2

OF

Training Data Size

10%

20%

30%

40%

50%

SVM_GA GP_LS_ARD GP_LP_LG_ARD GP_LP_PB_ARD GP_EP_ARD

73.89 78.75 85.32 87.63 89.94

84.8 87.8 89.8 90.4 90.8

86.43 89.33 91.60 91.18 91.96

88.40 90.31 91.93 91.93 92.59

90.12 91.19 93.56 93.82 93.95

in Figure 3b. After 30 iterations, a more detailed curve of characteristic length was learned that favors bands at valleys while downplaying those at peaks. Afterwards, the convergence of the optimizer became slow as indicated by the small changes in curves from the 30 iterations to 45 iterations (Figure 5). This observation confirms the claim made previously that dozens of iterations are often sufficient to arrive at sensible values of kernel parameter in GP models when learned by a gradient-based optimizer. Experiment 3 In this experiment, no single classifier outperformed others across all the trials, but the GPCLs with the ARD kernels consistently yielded better results than their counterparts with the Gaussian kernels as exemplified by the comparison between GP_LS_GA and GP_LS_ARD (Figure 6). Overall, though not always, the SVM showed poorer accuracies than the GPCLs, as partially demonstrated in Figure 6 where the differences in OA between the three GP least-squares classifiers and SVM were all significant except in the case with 9 percent outliers. More interestingly, for all the classifiers considered, the OA exhibited no overall decreasing trend with an increasing amount of outliers added into the training data. Contrarily, an overall trend slightly toward improving OA was observed as exhibited by the four classifiers in Figure 6, but none of these improvements were statistically significant, e.g., a mean OA increase of 0.2 percent for the GP_LS_GA (p-value 0.45). The results suggest that in the experiment, all the kernel classifiers had low sensitivities to outliers in training data. We also evaluated the GPC classifiers and the SVM with recourse to error-reject curves that can better characterize the performance of a classifier (MacKay, 2004). In the evaluation, the three GPR least squares classifiers, i.e., GP_LS_GA, GP_LS_ARD and GP_LS_NN, are excluded because of their non-probabilistic prediction outputs, although ad-hoc methods exist that may use a sigmoid function to squeeze the least-squares output to the unit interval [0,1] for class probability outputs (Platt, 2000). It is found that the GP_EP_PB_ARD exhibited better error-reject characteristics across all ten trials, and this result is consistent with those of earlier studies (Rasmussen and Williams, 2006). The performance of SVM ranked consistently low, though not the worst, especially at small rejection rates

Figure 5. The 220 characteristic length parameters of the ARD kernel learned at three iteration phases, plotted as a curve (y-axis in the logarithmic scale) against band numbers, for the binary classifier GP_LP_LG_ARD that discriminates C1 and C6. Note that a large characteristic length will downplay the associated input band.


O c t o b e r 2 0 0 8 1231

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1232

Figure 6. The typical results of OAs for Experiment 3 at ten trials with different amounts (1 percent to 10 percent) of outliers in the training data. For displaying convenience, only 4 out of 7 classifiers are shown.

(Figure 6b). Other than these, no other patterns in the errorreject curves are persistent across all trials. Nevertheless, this experiment confirms again that GPCLs had performances comparable to the SVM in terms of both OA and error-reject characteristics.

Figure 7. Comparisons of error-reject curves for five classifiers that have class probability outputs in Experiment 3: (a) with 1 percent outliers and (b) 10 percent outliers, respectively.

1232 O c t o b e r 2 0 0 8

Summary and Conclusion In this paper, we proposed the use of GP models for classifying remotely sensed data, and evaluated the effectiveness of a variety of GPCLs as applied to the AVIRIS data. First, we provided a theoretical treatment to the Bayesian formulation of GPR and GPC; then briefly described the Laplace’s and the Expectation-Propagation algorithm to approximate the analytically intractable GPC models; and next presented the typical Bayesian approach to learning kernel hyperparameters of GP models. Some practical implementation issues of GPCLs were also discussed. Our experiments on the AVIRIS data demonstrated that GPCLs improve classification accuracies significantly over the conventional KNN classifier, and yield results comparable to or even better than the SVM. GPs provide a useful kernel machine for supervised learning tasks such as regression and classification problems. One advantage of GP models over other kernel machines, such as SVM, is their natural Bayesian formulation by which the underlying latent function to be inferred is assigned a GP prior, thus making model outputs more probabilistically interpretable. More importantly, the Bayesian nature of GPs enables learning sensible kernel parameters in terms of maximizing the marginal likelihood by using general gradient-based optimizers, as opposed to the cross-validation procedures of SVMs that often prove difficult for optimizing parameters over high-dimensional spaces. Meanwhile, the GP models, either GPR or GPC, somehow circumvent the curse of dimensionality through the use of the kernel trick, but their model elicitation is based on statistical considerations rather than the geometrical consideration as taken in the classical formulation of SVMs that attempts to maximize margins between the separable planes in a kernel-induced space. Moreover, GP models allow for the use of the ARD kernel to automatically determine the relevant importance of input features (bands) so as to provide a feature selection mechanism; however, the ARD is less practical to be incorporated into SVMs due to the daunting computation involved for searching optimal parameters over high-dimensional spaces. The superior performance of GPs with the ARD kernel was confirmed in our second experiment. Despite the aforementioned advantages, the training of GP models requires large computation especially when dealing with huge training datasets, and its naïve implementation scales O(n3), which is slower than the fast training of SVMs. However, learning parameters of GP models usually PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

SA-AI-03.qxd

9/11/08

10:03 AM

Page 1233

takes less iterations than that of SVMs, thus, somehow mitigating the high computation cost. Sparse GP models such as those with the trigonometric loss function could be used to reduce the computation cost because the training cost of these models is largely determined by the number of SVs in the same spirit as that of the fast training of SVMs. Indeed, the efficient and effective training of GP models currently remains an open area that attracts active research efforts, and there have been some advancements such as the informative vector machine (Lawrence et al., 2003) and the sparse online algorithm (Csató and Opper, 2002). Future studies may investigate the applicability of these sparse and efficient GP models for classification of remotely sensed data. This study exploited the potential of GP models only for supervised classification of hyperspectral data. We envision that GPs should also provide a useful and attractive method for regression-related remote sensing applications, particularly due to such advantages as the Bayesian model adaptation, the mitigation of the curse of dimensionality, the automatic feature selection with ARD kernels, and the determination of error bars (prediction uncertainty) in making prediction. Hence, it is highly recommended that future studies should examine the utility of GPs as a regression tool for retrieving surface biophysical variables from remote sensing measurements, e.g., estimating chlorophyll concentration from hyperspectral data.

Acknowledgments We thank Dr. David Landgrebe for providing the AVIRIS data and Drs. Wei Chu, Carl Edward Rasmussen, and Chris Williams for making their codes available; we also thank Mr. Jared Stukey for patient proofreading work. Our special thanks are due to the two guest editors for their constant help. Kaiguang Zhao gratefully acknowledges the help and support of the ERDAS Imagine® team during a summer internship in 2007 subsidized by the ASPRS Leica-geosystems Internship Scholarship.

References Atkinson, P., and A. Tatnall, 1997. Neural networks in remote sensing, International Journal of Remote Sensing, 18(4):699–709. AVIRIS NW Indiana’s Indian Pines 1992 data set, URL: ftp://ftp.ecn.purdue.edu/biehl/MultiSpec/92AV3C (original files) and ftp://ftp.ecn.purdue.edu/biehl/PC_MultiSpec/ThyFiles.zip (ground truth) (last date accessed: 16 June 2008). Bermak, A., and S. B-Belhouari, 2006. Bayesian learning using Gaussian process for gas identification, IEEE Transactions on Instrumentation and Measurements, 55(3):787–792. Chang, C.-C., and C.-J. Lin, 2001. LIBSVM: A library for support vector machines, Software, URL: http://www.csie.ntu.edu.tw/cjlin/libsvm (last date accessed: 16 June 2008). Chang, C.-I., 2003. Hyperspectral Imaging: Techniques for Spectral Detection and Classification, Kluwer Academic/Plenum Publishers, New York, 300 p. Chu, W., S.S. Keerthi, and C.J. Ong, 2003. Bayesian trigonometric support vector classifier, Neural Computation, 15(9):2227–2254. Cochrane, M.A., 2000. Using vegetation reflectance variability for species level classification of hyperspectral data, International Journal of Remote Sensing, 21(10):2075–2087 Cressie, N.A.C., 1993. Statistics for Spatial Data, Wiley, New York, 928 p. Csató, L., and M. Opper, 2002, Sparse online Gaussian processes, Neural Computation, 14(2):641–669. Duda, R., P. Hart, and D. Stork, 2001. Pattern Classification, Second edition. John Wiley & Sons, New York, 680 p. Evgeniou, T., M. Pontil, and T. Poggio, 2000. Regularization networks and support vector machines, Advances in Computational Mathematics, 13(1):1–50. PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

Ferris, B.D., D. Fox, and N.D. Lawrence, 2007. WiFi-SLAM using Gaussian process latent variable models, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), pp. 2480–2485. Genton, M.G., 2001. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299–312. Gualtieri, J.A., and R.F. Cromp, 1998. Support vector machines for hyperspectral remote sensing classification, Proceedings of SPIE, Vol. 3584, pp. 221–232. Hoffbeck, J.P., and D. Landgrebe, 1996. Covariance matrix estimation and classification with limited training data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7):763–767. Hsu, C.-W., and C.-J. Lin, 2002. A simple decomposition method for support vector machines, Machine learning, IEEE Transactions on Neural Networks, 46:291–314 Hsu, C.W., C.C Chang, and C.J. Lin, 2003. A Practical Guide to Support Vector Classification, URL: http://www.csie.ntu.edu.tw/ cjlin/libsvm/index.html (last date accessed: 16 June 2008). Huang, C., L.S. Davis, and J.R.G. Townshend, 2002. An assessment of support vector machines for land cover classification, International Journal of Remote Sensing, 23:725–749. Kuss, M., and C. Rasmussen, 2005. Assessing approximate inference for binary Gaussian process classification, Journal of Machine Learning Research, 6(10):1679–1704 Landgrebe, D., 2002. Hyperspectral image data analysis as a high dimensional signal processing problem, Special Issue of the IEEE Signal Processing Magazine, 19(1):17–28. Lawrence, N., M. Seeger, and R. Herbrich, 2003. Fast sparse Gaussian process methods: The informative vector machine, Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun, and K. Obermayer, editors), The MIT Press, Cambridge, Massachusetts, pp. 609–616. MacKay, D.J.C., 1997. Gaussian processes-a replacement for supervised neural networks?, Tutorial lecture notes for NIPS 1997, URL: http://www.inference.phy.cam.ac.uk/mackay/ abstracts/gp.html (last access date: 16 June 2008). MacKay, D.J.C., 2004. Information Theory, Inference, and Learning Algorithms, Cambridge University Press, Cambridge, UK, 640 p. Melgan, F., and L. Bruzzone, 2004. Classification of hyperspectral remote-sensing images with support vector machines, IEEE Transactions on Geosciences and Remote Sensing, 42(8): 1778–1790. Neal, R.M., 1996. Bayesian Learning for Neural Networks, Springer, New York, 206 p. Neal, R.M., 1998. Regression and classification using Gaussian process priors, Bayesian Statistics 6 (J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, editors), Oxford University Press, pp. 475–501. Paola, J.D., and R.A. Schowengerdt, 1995. A review and analysis of back-propagation neural networks for classification of remotelysensed multi-spectral imagery, International Journal of Remote Sensing, 16:3033–3058. Platt, J., 1999. Sequential minimal optimization: A fast algorithm for training support vector machines, Advances in Kernel MethodsSupport Vector Learning (B. Scholkopf, C. Burges, and A Smola, editors). MIT Press, Cambridge, USA, pp. 185–208. Platt, J., 2000. Probabilities for SV Machines, Advances in Large Margin Classifiers (Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D., editors), MIT Press, pp. 61–74. Rasmussen, C.E., and C.K.I. Williams, 2006. Gaussian Processes for Machine Learning, The MIT Press, Cambridge, Massachusetts, 248 p. Richards, J.A., and X. Jia, 1999. Remote Sensing Digital Image Analysis, Third edition, Springer, New York, 439 p. Rifkin, R., and A. Klautau, 2004, In defense of one-vs-all classification, Journal of Machine Learning Research, 5:101–141 Schowengerdt, R.A., 2006. Remote Sensing Models and Methods for Image Processing, Third edition, Academic Press, San Diego, California, 560 p. O c t o b e r 2 0 0 8 1233

Seeger, M., 2004. Gaussian processes for machine learning, International Journal of Neural Systems, 14(2):69–106 Shippert, P., 2004. Why use hyperspectral imagery?, Photogrammetric Engineering & Remote Sensing, 70(4):377–380. van der Heijden, F., R.P.W. Duin, D. de Ridder, and D.M.J. Tax, 2004. Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB, Wiley, New York, 440 p. Vapnik, V.N., 1998. Statistical Learning Theory, Wiley, New York, 768 p.

1234 O c t o b e r 2 0 0 8

October Layout.indd 1234

Williams, C.K.I., and D. Barber, 1998. Bayesian classification with Gaussian processes, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351. Xie, Y., D. Lord, and Y. Zhang, 2007. Predicting motor vehicle collisions using Bayesian neural network models: An empirical analysis, Accident Analysis & Prevention, 39(5):922–933. Zhong,Y., L. Zhang, J. Gong and P. Li, 2007. A supervised artificial immune classifier for remote-sensing imagery, IEEE Transactions on Geosciences and Remote Sensing, 45(12):3957–3966


9/12/2008 9:56:38 AM