Gaussian Process Approach to Remote Sensing ... - Semantic Scholar

186

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 1, JANUARY 2010

Gaussian Process Approach to Remote Sensing Image Classification Yakoub Bazi, Member, IEEE, and Farid Melgani, Senior Member, IEEE

Abstract—Gaussian processes (GPs) represent a powerful and interesting theoretical framework for Bayesian classification. Despite having gained prominence in recent years, they remain an approach whose potentialities are not yet sufficiently known. In this paper, we propose a thorough investigation of the GP approach for classifying multisource and hyperspectral remote sensing images. To this end, we explore two analytical approximation methods for GP classification, namely, the Laplace and expectationpropagation methods, which are implemented with two different covariance functions, i.e., the squared exponential and neuralnetwork covariance functions. Moreover, we analyze how the computational burden of GP classifiers (GPCs) can be drastically reduced without significant losses in terms of discrimination power through a fast sparse-approximation method like the informative vector machine. Experiments were designed aiming also at testing the sensitivity of GPCs to the number of training samples and to the curse of dimensionality. In general, the obtained classification results show clearly that the GPC can compete seriously with the state-of-the-art support vector machine classifier. Index Terms—Expectation-propagation (EP) method, Gaussian process (GP), hyperspectral imagery, Laplace approximation, sparse classification, support vector machine (SVM).

I. I NTRODUCTION

S

UPERVISED classification of remote sensing images has received great attention from the remote sensing community for several decades. For such a purpose, many simple and sophisticated techniques have been considered, such as the statistical classifier, the k-nearest neighbor classifier, the artificial neural-network (NN) classifier, and, more recently, the kernel-based classifier [1], [2]. Among the most popular kernelbased classifiers available in the literature, one can find support vector machine (SVM) classifiers [3]. They are based on the margin maximization principle, which aims at providing them with a good generalization capability. SVM classifiers have been used extensively and proved to be successful in dealing with remote sensing data [4]–[14]. Another potentially interesting kernel-based classification approach is the one based on Gaussian processes (GPs) [15]–[20]. In contrast to SVM classifiers, GP classifiers (GPCs) have not yet received sufficient attention from the remote sensManuscript received December 26, 2008; revised March 11, 2009 and April 28, 2009. First published August 18, 2009; current version published December 23, 2009. Y. Bazi is with the College of Engineering, Al-Jouf University, Al-Jouf 2014, Saudi Arabia (e-mail: [email protected]). F. Melgani is with the Department of Information Engineering and Computer Science, University of Trento, 38050 Trento, Italy (e-mail: melgani@ disi.unitn.it). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2009.2023983

ing community, despite being theoretically attractive statistical models that permit a fully Bayesian treatment of the considered classification problem. Compared to SVM classifiers, they have the advantage of providing probabilistic outputs rather than discriminant function values. Moreover, they can use evidence for solving the model selection issue in a completely automatic way. The main idea of GPCs is to assume that the probability of belonging to a class label for an input sample is monotonically related to the value of some latent function at that sample. Such a monotonic relationship is defined according to a so-called squashing function. A GP prior characterized by a covariance matrix embedding a set of hyperparameters is placed on this latent function. Inference is made by integrating over the latent function. Since such an integral is analytically intractable, solutions based on Monte Carlo sampling or analytical approximation methods are adopted. The two key analytical approximation algorithms are the Laplace and expectation-propagation (EP) algorithms. Both approximate the non-Gaussian joint posterior over the latent variables with a Gaussian one. In the Laplace approximation, the Gaussian model is defined with mean and covariance matrix as the maximum point of the posterior and the negative Hessian matrix at that point, respectively. The identification of this maximum is carried out according to the iterative Newton method. The EP algorithm is a more sophisticated approximation technique that tries in some way to minimize locally the Kullback–Leibler divergence measure between the true posterior and the approximated one. This is done sequentially through the so-called cavity distribution. In the prediction phase, the approximate predictive mean and variance for the (approximated) Gaussian posterior over the latent variable of the considered sample are computed first. Then, the class posterior probability for the sample target is derived either analytically or by approximation, depending on the adopted squashing function. The multiclass implementation of GPCs is obtained through an intrinsic multiclass formulation, which can be complex [21], or simply by decomposition into binary classification problems. In this paper, we propose a thorough investigation of the GPC effectiveness for classifying multisource and hyperspectral remote sensing images. To this end, we designed several experiments aiming also at testing the sensitivity of GPCs to the number of training samples and to the curse of dimensionality. In general, the obtained classification results show clearly that the GPC can compete seriously (providing sometimes better accuracies) with the state-of-the-art SVM classifier. The remainder of this paper is organized as follows. In Section II, we introduce the basics of the GPC. The Laplace

0196-2892/$26.00 © 2009 IEEE

Authorized licensed use limited to: UNIVERSITA TRENTO. Downloaded on December 21, 2009 at 04:52 from IEEE Xplore. Restrictions apply.

BAZI AND MELGANI: GAUSSIAN PROCESS APPROACH TO REMOTE SENSING IMAGE CLASSIFICATION

Fig. 1.

187

Graphical model of a GP.

and EP approximations algorithms are presented in Section III. The model selection issue in GPC is addressed in Section IV. Section V describes a sparse-approximation method for GPCs called informative vector machine (IVM). The experimental part is reported in Sections VI and VII. Finally, conclusions are drawn in Section VIII. II. GP C LASSIFICATION Let us consider a training set D = (X, y) consisting of a matrix of training data X = [x1 x2 · · · xN ]T , accompanied with a target vector y = [y1 y2 · · · yN ]T , where N is the number of samples. To each vector xi ∈ d (i = 1, 2, . . . , N ), a target yi ∈ {−1, +1} is associated. Given this training set D, we aim at predicting the label of a new sample x∗ (whose true label is unknown) through the computation of the class posterior probability p(y∗ |D, x∗ ). In GPC (see Fig. 1), the probability of belonging to a class label yi = +1 for an input sample xi is monotonically related to the value fi of some latent function f . Such a monotonic relationship is defined according to a squashing function, which can take several forms such as logistic and probit functions p(yi = +1|fi ) =

1 1+exp(−yi fi ) ,

logistic function

Φ(yi fi ),

probit function

(1)

where Φis the Gaussian cumulative distribution function, i.e., √ z Φ(z) = −∞ (1/ 2π) exp(−(x2 /2))dx. The prediction of the output probability for the sample x∗ is obtained by integration over the latent function f∗ as follows:

f2

p(y∗ |f∗ )p(f∗ |D, x∗ )df∗ .

(2)

The second term of the integral (2) represents the distribution of the latent variable corresponding to the sample x∗ . It is obtained

···

fN ]

p(f∗ |D, x∗ ) =

p(f∗ |X, x∗ , f )p(f |D)df

(3)

where p(f |D) is the posterior over the latent variables. This last can be reformulated through the Bayes’ rule as p(f |D) = p(y|f )p(f |X)/p(y|X) N p(yi |fi ) p(f |X)/p(y|X). =

(4)

i=1

p(y|f ) is the likelihood function. It can be expressed by using one of the forms of the squashing functions given in (1). p(y|X) is the marginal likelihood and p(f |X) is the GP prior over the latent variables. The GP prior is typically characterized by a zero mean and a covariance matrix embedding a set of hyperparameters, i.e., p(f |X) =

1 T −1 f exp − K f 1 2 (2π)N/2 |K| 2 1

(5)

where each term of the covariance matrix K is a function of xi and xj . By analogy, K can be seen as the Gram matrix in SVMs. The covariance function encapsulates the prior knowledge about the function smoothness. In this paper, we shall consider two different covariance functions. The first one, called squared exponential [or Gaussian radial basis function (RBF)], takes the following relationship:

p(y∗ = +1|D, x∗ ) =

by marginalization over f = [f1

|xi − xj |2 k(xi , xj ) = θ0 exp − 2l2

(6)

where θ0 is the process variance. It controls the overall vertical scale of variation of the latent function. l denotes the length


188


scale. It governs the latent function smoothness. The second one, termed as NN covariance function, is defined as ⎛

= (1, x where x 1 . . . , xd ) is the input vector augmented with a bias term and ∈ (d+1)(d+1) represents a diagonal matrix whose weights are equal to 1/l2 . This covariance function can be obtained from an NN of sigmoid functions in the limit of an infinite number of hidden units [22]. For both covariance functions, the hyperparameter vector is given by Θ = [l, θ0 ]. Since the integrals in (2) and (3) are not analytically tractable because of the non-Gaussian nature of the likelihood terms, analytical approximation or Monte Carlo methods have to be adopted. Monte Carlo methods are nondeterministic methods that compute numerically the integral by simply making use of the law of large numbers. Therefore, they are particularly time consuming because of the typically huge number of samples that they require to perform the numerical integration. In the next section, we will describe two alternative analytical approximation methods, which are the Laplace and EP approximations.

A. Laplace Approximation The Laplace technique substitutes the non-Gaussian posterior in the integral (3) by a Gaussian approximation q(f |D) derived from a second-order Taylor expansion of log p(f |D) around the maximum of the posterior

1 −1 T ˆ ˆ ˆ (f − f ) ) ∝ exp − A(f − f ) p(f |D) ∼ q(f |D) = N (f | f , A = 2 (8) where ˆf and A are the mean and the covariance matrix, respectively. They are given by (9) (10)

The covariance matrix represents the Hessian of the negative log posterior at the maximum point (i.e., the peak of the nonGaussian posterior). The calculation of ˆf and A starts from the formulation of p(f |D) as in (4). In particular, by using the logarithm of this posterior and (5), one can obtain the following: Ψ(f ) = log p(f |D) = log p(y|f ) + log p(f |X) − log p(y|X) 1 1 = log p(y|f ) − f T K−1 f − log |K| 2 2 N log 2π − log p(y|X). − 2

∇Ψ(f ) = ∇ log p(y|f ) − K−1 f ∇∇Ψ(f ) = ∇∇ log p(y|f ) − K−1 .

(12)

At the maximum of Ψ(f ), one gets ˆf = K ∇ log p(y|ˆf )

(13)

and the covariance matrix is approximated by the shape of Ψ(f ) −1 A = − ∇∇Ψ(ˆf ) = (K−1 + W)−1

(14)

W = −∇∇ log p(y|ˆf ) .

(15)

where

Since (13) is nonlinear, the computation of ˆf is achieved by numerical methods, such as the Newton method. Once the computations of ˆf and A are done, the Laplace approximation to the posterior is completely defined by q(f |D) = N ˆf , (K−1 + W )−1 .

(16)

The prediction of the point x∗ is evaluated by exploiting the Gaussian approximation in (2)

III. L APLACE AND EP A PPROXIMATIONS

A = − ∇∇ log p(f |D)|f =f .

⎞

j x 2 xTi ⎠ k(xi , xj ) = θ0 sin−1 ⎝ T i 1 + 2 j 1 + 2 xi xTj x x (7)

ˆf = arg maxf p(f |D)

Differentiating twice (11) with respect to f leads to

(11)

p(y∗ = +1|D, x∗ ) ∼ = q(y∗ = +1|D, x∗ ) = p(y∗ |f∗ )q(f∗ |D, x∗ )df∗

(17)

where q(f∗ |D, x∗ ) is a Gaussian with mean and variance given as follows: μ∗ = k(x∗ )T K−1 ˆf (18) σ∗2 = k(x∗ , x∗ ) − k(x∗ )T (K + W−1 )−1 k(x∗ ) and k(x∗ )T = [k(x1 , x∗ ) k(x2 , x∗ ) · · · k(xN , x∗ )] is a vector of kernel distances (covariances) between x∗ and all the training samples. If the probit form is adopted for the squashing function, then prediction (17) can be evaluated analytically as q(y∗ = +1|D, x∗ ) = Φ

μ ∗ 1 + σ∗2

.

(19)

In the case of the logistic function, the computation is done numerically by generating a set of random realizations from q(f∗ |D, x∗ ), squashing, and then averaging them. The larger the number of realizations, the better the estimate of the class posterior probability q(y∗ = +1|D, x∗ ). Note that if the class posterior probability value is not desired but just the estimate of the label of the sample x∗ , one can adopt the following labeling rule: ‘Assign x∗ to label “+1” if μ∗ ≥ 0; otherwise, assign it to label “−1”.’



B. EP Approximation As mentioned previously, the EP approximation aims at minimizing locally the Kullback–Leibler divergence measure between the true posterior and its Gaussian approximation. By using the Bayes’ rule, the posterior p(f |D) can be rewritten as follows: N 1 p(f |D) = p(f |X) p(yi |fi ) Z i=1

i , μ where Z i , and σ i2 are termed site parameters. Accordingly, the approximated likelihood is equal to N

N i i , μ µ, Σ) Z ti fi |Z i , σ i2 = N (

i=1

(22)

i=1

= diag[ = [ where µ μ1 , μ 2 , . . . , μ N ]T and Σ σ12 , σ 22 , . . . , 2 T σ N ] . The Gaussian approximation q(f |D) to the posterior becomes q(f |D) =

N 1 p(f |X) ti fi |Zi , μ i , σ i2 = N (µ, Σ) ZEP

(23)

i=1

with µ=

−1

µ

=

K−1 +

−1 −1

(24)

j=i

where the subscript −i indicates that the case i is removed. Conceptually, the estimation process is done by repeating four phases on each training sample and until convergence. First, the cavity distribution is computed according to the following: 2 (26) q−i (fi |xi , yi ) = N fi |μ−i , σ−i where μ−i =

σi−2 μi

−

σ i−2 μ i

2 σ−i

−1 = σi−2 − σ i−2 .

(29)

where ⎧ Zˆ = Φ(zi ) ⎪ ⎪ ⎪ ⎪ 2 ⎨ yi σ−i N (z ) √ i2 μ î = μ−i + Φ(zi ) 1+σ−i ⎪ ⎪ 4 ⎪ σ−i N (zi ) ⎪ 2 ⎩σ z + î2 = σ−i − 1+σ 2 ( −i )Φ(zi ) i

N (zi ) Φ(zi )

(30)

with yi μ−i zi = . 2 1 + σ−i

(31)

Finally, in the last step, the parameters of the approximation ti are updated in order to make the posterior have the desired marginal, leading to the following: ⎧ −2 −2 i = σ î μ i2 σ î − σ−i μ−i ⎪ ⎪μ ⎪ ⎪ ⎨ 2 −2 −2 −1 î − σ−i σ i = σ

⎪ ⎪ √ ⎪ (μ−i − μi )2 2 exp i = Zî 2π σ 2 + σ ⎪ . ⎩Z −i i 2 + 2(σ−i σi2 )

(32)

The procedure for making predictions is similar to that of the Laplace approximation, with the difference that μ∗ and σ∗2 are given by −1 µ μ∗ = k(x∗ )T (K + Σ) −1 k(x∗ ). σ∗2 = k(x∗ , x∗ ) − k(x∗ )T (K + Σ)

(33)

IV. M ODEL S ELECTION FOR GPC S The choice of the hyperparameter vector Θ, which governs the covariance function, is important as it has a direct impact on the accuracy of the GPC. This problem is known as model selection issue. Similarly to traditional classifiers, this hyperparameter vector can be determined empirically by crossvalidation or by using an independent set of labeled samples called validation samples. As an alternative, the intrinsic nature of GPCs allows a Bayesian treatment for the estimation of Θ. For such a purpose, one may resort to the type-II maximumlikelihood estimation procedure. It consists in the maximization of the marginal likelihood with respect to Θ, which is the integral of the likelihood times the prior

(27)

Second, the cavity distribution is combined with the true likelihood to get the desired non-Gaussian marginal r(fi ) = q−i (fi |xi , yi )p(yi |fi ).

î , σi2 ∼ qî (fi ) = Zî N μ = r(fi |xi , yi )

and ZEP = p(y|X) is the EP approximation to the normalizing term Z. The EP approximation estimates the individual ti approximations iteratively basing on the so-called cavity distribution q−i (fi ) ∝ p(f |X) tj fj |Zj , μ j , σ j2 dfj (25)

2 σ−i

Third, the Gaussian approximation to the non-Gaussian marginal is computed as

(20)

where the normalizing term Z is the marginal likelihood p(y|X). The key idea of the EP approximation is to approximate the likelihood by a product of local likelihoods estimated through unnormalized functions in the latent variables fi ’s i , μ i N fi | (21) i , σ i2 = Z μi , σ i2 p(yi |fi ) ∼ = ti fi |Z

189

(28)

p(y|X) = p(y|X, Θ) =

p(y|f )p(f |X, Θ)df .

(34)

Both the Laplace and EP approximations can provide an approximation to the marginal likelihood in (34) and, consequently, an estimation Θ∗ of the hyperparameter vector Θ.


190


A. Model Selection With Laplace Approximation

After some algebraic manipulations, one can obtain [20]

The Laplace approximation of the log marginal likelihood is given as follows [18]:

1 − 1µ −1 µ T (K + Σ) log |K + Σ| 2 2 ⎞ ⎛ N N 2 yi μ−i ⎠ 1 + + log Φ ⎝ log σ−i +σ i2 2 i=1 2 1 + σ−i i=1

log ZEP = −

log p(y|X, Θ) ∼ = log q(y|X, Θ) 1 1 = − ˆf K−1 ˆf + log p(y|ˆf , Θ) − log |B| 2 2 (35) where |B| = I + W1/2 KW1/2 |. The computation of Θ∗ is made by maximizing (35) with respect to Θ ∗

Θ = arg max log q(y|X, Θ). Θ

(36)

The dependence of the approximate marginal likelihood on Θ is twofold. Indeed, the partial derivatives of (∂ log q(y|X, Θ)/dΘj ) include two terms, namely, an explicit dependence via the terms involving the covariance matrix K and an implicit dependence on the posterior ˆf ∂ log q(y|X, Θ) ∂ log q(y|X, Θ) = dΘj ∂θj explicit +

N i=1

∂ log q(y|X, Θ) ∂ fî ∂Θj ∂ fî

(37)

where ∂ log q(y|X, Θ) ∂θj

∂K −1ˆ 1 = ˆf T K−1 K f 2 ∂Θ j explicit

∂K 1 − tr (W−1 +K)−1 2 ∂Θj (38)

∂W ∂ log q(y|X, Θ) 1 = − tr (K−1 +W)−1 . 2 ∂ fî ∂ fî (39)

The derivative ∂ fî /∂Θj is evaluated by differentiating (13), yielding ∂K ∂∇ log p(y|ˆf ) ∂ ˆf ∂ fî = ∇ log p(y|ˆf ) + K ∂Θj ∂Θj ∂Θj ∂ ˆf ∂K = (I + KW )−1 ∇ log p(y|ˆf ). (40) ∂Θj By substituting (38)–(40) into (37), we obtain ∂ log q(y|X, Θ)/dΘj , which can then be exploited to feed a conjugategradient optimization method for solving (36). B. Model Selection With EP Approximation In this case, the approximate marginal likelihood can be determined from the normalization factor ZEP ZEP = q(y|X, Θ) =

N i , μ p(f |X, Θ) ti fi |Z i , σ i2 . i=1

+

N (μ−i − μ i )2 . 2 2 σ−i + σi2 i=1

(42)

Similarly to the Laplace approximation, the hyperparameter vector Θ∗ is obtained by maximizing the marginal log likelihood through its partial derivative, which is expressed as ∂K 1 ∂ log ZEP −1 = − tr (K+ Σ) ∂Θj 2 ∂Θj T −1 −1 µ µ (K+ Σ) −(K + Σ) . (43)

V. S PARSE GP S A major drawback of GPCs is their high computational burden as the algorithm complexity is O(N 3 ) and the memory requirement is O(N 2 ), which makes them impractical for large data sets. To overcome this problem, several sparseapproximation methods have been proposed in the literature [23]–[25]. In general, they are based on the idea to select a subset of d (d < N ) training points, called active set, from the whole set by optimizing a predefined criterion. In this paper, we will use the simple and fast sparse-approximation method, called IVM, described in [24]. In brief, in the IVM, the selection of the active set I of size d aims at keeping the decision boundary between the different classes as close as possible to the boundary yielded by the whole training set J. Since an exhaustive search is intractable, the IVM performs the selection process by including one sample at a time in the set I, the one which results in the largest differential entropy score related to the posterior p(f |D). In other words, the ith sample is selected if i = arg max{ΔHk }, k∈J\I

where ΔHk = H (q (I + {k})) − H (q(I))

(44)

where q(I) is a Gaussian approximation to the posterior p(f |D) computed on the samples of I. As a result, the computational burden and the memory requirement are reduced to O(d2 N ) and O(dN ), respectively. For more details about the IVM, we refer the reader to [24] and [26]. VI. E XPERIMENTAL D ESIGN A. Data-Set Description

(41)

In order to evaluate the performances of GPCs implemented with both the Laplace and EP approximations, three data sets



191

TABLE I NUMBER OF TRAINING AND TEST SAMPLES FOR THE AVIRIS DATA SET

were used to conduct the experiments. They are of the multispectral, multisensor, and hyperspectral types, respectively. As the classification of these data sets involves the simultaneous discrimination of several information classes, in this paper, we will adopt the simple multiclass classification strategy based on the decomposition into binary classification problems. In particular, we shall consider the well-known one-against-one strategy [6]. The first data set represents a pan-sharpened multispectral image of 800 × 800 pixel size acquired by the QuickBird satellite in August 2004 in the region of El Tarf in northeastern Algeria. It consists of five main land covers, which are bare soil (ω1 ), forest (ω2 ), residential area (ω3 ), dirt road (ω4 ), and asphalt (ω5 ). For each class, the number of samples used for training was 100 samples, whereas it was 100 samples for the test. The second data set, which is a 494 × 882 pixel-size multisensor image acquired by the Landsat TM and ERS-1 SAR sensors in May 1994, represents an agricultural area located in the basin of the Po river in northern Italy. It consists of 11 features, i.e., six intensity features from the Landsat TM image, one intensity feature from the ERS-1 SAR image, and four texture features (entropy, difference entropy, correlation, and sum variance) extracted from the ERS-1 SAR image by computing the gray-level co-occurrence matrix. The main land covers in the scene are five: dry and wet rice fields (ω1 and ω2 ), cereals and corn fields (ω3 and ω4 ), and woods (ω5 ). The numbers of training and test samples for each class are 100 and 120 samples, respectively. The hyperspectral data set represents a section of 145 × 145 pixels of a scene taken over NW Indiana’s Indian Pines by the AVIRIS sensor in 1992 [27]. Of the 220 spectral channels acquired by the AVIRIS sensor, 20 were discarded because they were affected by atmospheric problems. The ground truth consists of nine land-cover classes and is represented by a set of 1800 training and 4588 test samples (see Table I). In order to illustrate the complexity of the three data sets, 2-D distributions of the land-cover classes characterizing each of them are shown in Fig. 2. B. Experimental Scheme Four main sets of experiments were conducted on the considered data sets. In the first set of experiments, we performed classification by means of GPCs on the original training sets. GPCs

Fig. 2. Two-dimensional distribution of the land-cover classes characterizing the three data sets used in the experiments. (a) El Tarf (red and NIR features). (b) Po (ERS-1 SAR features). (c) AVIRIS. For better visualization, just 25 samples were randomly selected for each class from the training set. In the case of the AVIRIS data set, the best couple of discriminative features obtained with the SA algorithm are used.


192


TABLE II PERFORMANCE IN TERMS OF OA, AA, AND CLASS PERCENTAGE ACCURACIES OBTAINED ON THE TEST SAMPLES BY THE GPC-LP-RBF, GPC-LP-NN, GPC-EP-RBF, GPC-EP-NN, AND SVM CLASSIFIERS WHEN TRAINED ON THE ORIGINAL TRAINING SET OF (a) EL TARF, (b) PO, AND (c) AVIRIS DATA SETS. IN ADDITION, THE BEST RESULTS OBTAINED BY BOTH IVM-RBF AND IVM-NN ON THESE DATA SETS ARE ALSO REPORTED

were implemented through both the Laplace and EP approximations and by adopting the RBF and NN covariance functions introduced in Section II. The corresponding classifiers are termed as GPC-LP-RBF, GPC-LP-NN, GPC-EP-RBF, and GPC-EPNN, respectively. In the second set of experiments, we assessed the sensitivity of these classifiers with respect to the number of training samples. This was useful to understand the generalization capabilities of GPCs when subject to a problem of limited training-sample availability. In the third set of experiments, which were devoted to the hyperspectral data set, we performed classification in feature subspaces of various dimensionalities to test the different GPCs against the curse of dimensionality. The feature subspaces were determined according to the steepest ascent (SA) feature selection algorithm [28]. Finally, in the fourth set of experiments, we evaluated the pros and cons of simplifying the GPCs by means of a sparse implementation like the IVM classifier described in the previous section. The classification performances were obtained on a personal computer with a 1.86-GHz processor and are expressed in terms of class-by-class, overall (OA), and average (AA) accuracies in addition to the computation time. Furthermore, the statistical significance of differences between the accuracies achieved by the different classifiers was assessed using McNemar’s test, which is based on the standardized normal test statistic [29], [30] fij − fji Zij = fij + fji

(45)

Fig. 3. WCAs achieved by the investigated classifiers on the test samples of the three data sets.

where Zij measures the pairwise statistical significance of the difference between the accuracies of the ith and jth classifiers. fij stands for the number of samples classified correctly and wrongly by the ith and jth classifiers, respectively. Accordingly, fij and fji are the counts of classified samples on which the considered ith and jth classifiers disagree. At the commonly used 5% level of significance, the difference of accuracies between the ith and jth classifiers is said to be statistically significant if |Zij | > 1.96. For comparison purposes, the results obtained by GPCs were contrasted against those yielded by the state-of-the-art SVM classifier implemented with the Gaussian kernel and the same multiclass strategy adopted for GPCs. In all the experiments and for both the RBF and NN covariance functions, the hyperparameter vector Θ was initially set



193

TABLE III STATISTICAL SIGNIFICANCE OF DIFFERENCES IN CLASSIFICATION ACCURACY BETWEEN THE INVESTIGATED CLASSIFIERS EXPRESSED BY MEANS OF MCNEMAR’S TEST. STATISTICALLY SIGNIFICANT DIFFERENCES AT THE 5% LEVEL OF SIGNIFICANCE (|Zij | ≥ 1.96) ARE HIGHLIGHTED IN BOLDFACE. POSITIVE VALUES INDICATE THAT THE ith CLASSIFIER HAS A CLASSIFICATION ACCURACY THAT IS HIGHER THAN THAT OF THE jth ONE

to Θ0 = [1 1]. Concerning the SVM classifier, the regularization and kernel width parameters were varied in the ranges of [10−3 , 200] and [10−3 , 2], respectively. VII. E XPERIMENTAL R ESULTS A. Experiment Set 1: Classification With the Original Training Sets In this set of experiments, for each data set, we trained separately the five classifiers GPC-LP-RBF, GPC-LP-NN, GPC-EPRBF, GPC-EP-NN, and SVM on the original training set. The tuning of the hyperparameter vector Θ for both GPC-LP and GPC-EP was performed according to the Bayesian estimation procedures described in Section IV. Since we adopted the oneagainst-one strategy for multiclass classification, for each pair of classes, the optimal hyperparameter vector Θ∗ was computed in order to maximize the corresponding marginal log likelihood. Concerning the SVM classifier, the regularization and kernel width parameters were obtained according to a threefold crossvalidation procedure. Table II reports the classification results achieved on the test set for the three data sets. In detail, for the El Tarf data set, the OAs yielded by the GPC-LP-RBF, GPC-LP-NN, GPC-EPRBF, GPC-EP-NN, and SVM are 94%, 94.2%, 95.2%, 96%, and 93.80%, respectively. For the Po data set, the OAs are 93%, 94.33%, 92.66%, 93.83%, and 92.83%, respectively. For the AVIRIS data set, the OAs (and AA) achieved by the five classifiers are 88.23% (and 91.52%), 87.40% (91.10%), 87.81% (90.93%), 87.53% (90.94%), and 87.66% (91.05%), respectively. Concerning the class accuracies, we show in Fig. 3 the worst class accuracies (WCAs) obtained by the five investigated

classifiers on the three data sets. As can be seen, the GPCs provided, in most cases (except for two cases), slightly better or close WCAs compared to what was achieved by the SVM. From a statistical point of view, McNemar’s test (see Table III) shows that both GPC-EP-NN and GPC-LP-NN provide statistically significant differences in the classification accuracy with respect to the SVM for the El Tarf and Po data sets, respectively. Concerning the AVIRIS data set, the difference in accuracy between the four GPCs and the SVM is not statistically significant as their accuracies were very close. From this first set of experiments, one can conclude that both GPC-LP and GPC-EP prove effective in classifying remote sensing data as they are competing with the SVM in terms of classification accuracy. Regarding computation time, the GPCEP is the most demanding, while the SVM is the most efficient. B. Experiment Set 2: Sensitivity to the Number of Training Samples This set of experiments was aimed at assessing the sensitivity of the five classifiers with respect to the number of training samples. To this end, we considered for each data set three experimental scenarios similar to the first experiments but characterized with different reduced training-set sizes. For the El Tarf and Po data sets, the training set size was reduced to 75, 50, and 25 samples, whereas it was decreased to 100, 50, and 25 samples for the AVIRIS data set. For each scenario, in order to obtain a reliable assessment of the classification accuracies, we carried out five different trials, each with a new set of randomly selected training samples, while the test set was kept unchanged. The results of these five trials were thus averaged.


194


Fig. 5. OAs achieved on the test samples for the AVIRIS data set by the GPC-LP-RBF, GPC-LP-NN, GPC-EP-RBF, GPC-EP-NN, and SVM classifiers versus the number of selected features for the AVIRIS data set.

89.41%; whereas they are 76.89%, 73.68%, 77.04%, 74.06%, and 77.33% for the AVIRIS data set, respectively. From these results, one can conclude the following: 1) Both GPC-LPRBF and GPC-EP-RBF can provide accuracies that are very close to those of the SVM; 2) the GPC-LP-NN and GPC-EPNN classifiers are more sensitive to the problem of limited training-sample availability; 3) finally, the GPC-LP and GPCEP classifiers exhibit very similar accuracies, but the former is less computationally demanding. C. Experiment Set 3: Sensitivity to the Curse of Dimensionality

Fig. 4. OAs achieved on the test set by the GPC-LP-RBF, GPC-LP-NN, GPC-EP-RBF, GP-EP-NN, and SVM classifiers versus the number of training samples for the (a) El Tarf, (b) Po, and (c) AVIRIS data sets.

The model selection issue was handled using the same procedures adopted in the first experiments. Fig. 4 shows the trends of the OA obtained on the test set versus the number of training samples for the five classifiers. As expected, in general, the OA exhibits a decreasing behavior for a decreasing training-set size. In general, GPCs based on the RBF covariance function show interesting behaviors as they yielded classification results that are statistically better (according to McNemar’s test) or very close to those obtained by the SVM, whereas those based on the NN covariance function are less competitive, particularly for the Po and AVIRIS data sets. In greater detail, for the El Tarf data set, the OAs averaged over the four experimental scenarios and yielded by the GPC-LP-RBF, GPC-LP-NN, GPC-EP-RBF, GPC-EP-NN, and SVM for the three scenarios are equal to 93.11%, 93.39%, 93.05%, 94.44%, and 92.21%, respectively. Concerning the Po data set, the averaged OAs yielded by these classifiers are 89.47%, 84.76%, 89.47%, 84.72%, and

In this set of experiments, we compared the five investigated classifiers with respect to an important issue that is intrinsic to hyperspectral images, which is the curse of dimensionality. For such a purpose, we trained these classifiers on the AVIRIS data set in feature subspaces of various dimensionalities. The desired number of features was varied from 10 to 100 with a step of 10. In general, as can be seen from Fig. 5, the four GPCs exhibited behaviors that are similar to the one of the SVM for the different feature subspace dimensionalities. Among these classifiers, the GPC-EP-NN yields slightly better classification accuracies. In greater detail, the GPC-LP-RBF, GPC-EP-RBF, and GPC-LPNN exhibit the maximum classification accuracy when using the 40 most discriminative features. Their OA’s (and AA) are equal to 89.79% (and 92.43%), 89.77% (92.31%), and 89.78% (92.57%), respectively. For the GPC-EP-NN classifier, the optimal number of features is 50 corresponding to an OA (and AA) of 90.4% (and 92.83%), whereas for the SVM classifier, it was equal to 60 corresponding to an OA (and AA) of 89.86% (and 92.75%). Comparing these results with those yielded in the original hyperdimensional feature space, one can see that the GPC-LPRBF and GPC-EP-RBF classifiers gained 1.56% (and 0.91%) and 1.96% (1.38%) in terms of OA (and AA), respectively. The GPC-LP-NN and GPC-EP-NN classifiers gained 2.37% (1.21%) and 2.87% (1.89%), respectively, whereas the SVM gained 2.6% (1.7%). This points out that the classifier that is less sensitive to the curse of dimensionality is the GPC-LPRBF, while the most sensitive one is the GPC-EP-NN. Note that, for all these classifiers, the WCA was obtained for the



Fig. 6. OAs achieved on the test samples by the IVM-RBF and IVM-NN classifiers versus the active-set size for the (a) El Tarf, (b) Po, and (c) AVIRIS data sets.

“soybean-min till” (ω7 ) class. It is equal to 79.64%, 78.98%, 79.96%, 81.43%, and 79.23% for the GPC-LP-RBF, GPCLP-NN, GPC-EP-RBF, GPC-EP-NN, and SVM, respectively. These accuracies are, however, clearly better than those achieved without feature reduction [see Table II(c)]. D. Experiment Set 4: Sparse Classification With IVM In this last set of experiments, we applied on the aforementioned three data sets the sparse IVM classifier implemented with the RBF and NN covariance functions (IVM-RBF and IVM-NN, respectively). Fig. 6 shows the overall classification accuracy achieved on the test set for different active-set sizes.

195

Fig. 7. Computation time of the IVM-RBF and IVM-NN classifiers versus the active-set size for the (a) El Tarf, (b) Po, and (c) AVIRIS data sets.

At first glance, by selecting just 50% (and even less) of samples from the original training set, the IVM classifier is able to provide accuracies that are very close to those yielded by the full GPCs, while sharply reducing the computation time (see Fig. 7). In more detail, for the El Tarf data set, the highest accuracy yielded by the IVM-RBF and IVM-NN classifiers is equal to 95.4% using 50% and 12.5% of training samples, respectively. For the Po data set, the best accuracy is obtained for an active set of sizes of 62.5% and 50%, and the corresponding OAs are 93.33% and 93%, respectively. Concerning the AVIRIS data set, the OAs are 87.55% and 88.05% corresponding to active-set sizes of 62.5% and 50% of samples, respectively. The detailed classification results in terms of classby-class accuracies and computation time obtained on these data sets are also reported in Table II. In some cases, these


196


classifiers proved capable to yield slightly better accuracies than those provided by the full GPCs. According to McNemar’s statistical test (see Table III), we observe that the difference in accuracy between the IVM-RBF/IVM-NN and the full GPCs in almost all cases is not significant. This keeps true when making the comparison with the SVM for the Po and AVIRIS data sets. For the El Tarf data set, the IVM-RBF and IVMNN yielded statistically significant differences in accuracy with respect to the SVM. The main problem with these sparse GPC implementations is, however, how to guess the best size for the active set. Regarding the computation time, the IVM-RBF and IVMNN classifiers consumed 67 and 31 s for the El Tarf data set, 89 and 152 s for the PO data set, and 0.62 and 0.85 h for the AVIRIS data set, respectively. However, these sharp reductions in computation time with respect to the full GPCs are still not sufficient to make them more efficient than the SVM. Nonetheless, they offer the advantages that, unlike the SVM, the model selection procedure is fully automatic and that they provide probabilistic outputs. VIII. C ONCLUSION This paper has presented a thorough experimental analysis of the capabilities of GPCs for classifying remote sensing images. Compared to the state-of-the-art SVM classification approach, GPCs exhibit the following interesting methodological properties: 1) Based on a Bayesian classification framework, they provide not only probabilistic outputs but also discriminant function outputs, thus making them well adapted to particular classification problems, such as those involving spatiotemporal contextual information; 2) other than the class posterior probability estimate, they yield a variance estimate that can be exploited as a confidence value; and 3) their Bayesian nature allows the following: i) to integrate any kind of prior information in the classification process through the prior distribution; ii) to use evidence for full automatic estimation of the hyperparameters of the covariance function; and iii) to perform feature selection by using automatic relevance determination kernels [31]. The experimental results obtained on multispectral, multisensor, and hyperspectral data sets show that GPCs based on Laplace and EP approximations can compete with the SVM classifier in terms of classification accuracy. In addition, similar to the SVM, they exhibit a low sensitivity to the problems of training-sample availability and curse of dimensionality. By comparing the Laplace and EP approximations, the former appears to be the most attractive one when putting in the balance both classification accuracy and computation time. The RBF covariance function seems better adapted than the NN covariance function, in particular when the classification problem is subject to a strong limitation of training-sample availability. The main drawback of GPCs is undoubtedly the amount of computation time required (particularly in the case of the EP approximation) because of the matrix inversions performed during the learning phase. However, to tackle this issue, we have seen that a sparse-approximation method like the IVM

can be adopted successfully due to its capability of sharply simplifying the classification model without (significant) losses in terms of discrimination power. In this context, however, an open problem is finding a way to estimate the size of the active set that provides the best tradeoff between classification accuracy and computation time.

ACKNOWLEDGMENT The authors would like to thank Prof. D. Landgrebe for providing the AVIRIS data [27], Dr. C-C. Chang and Dr. C-J. Lin for supplying the LIBSVM software [32], Dr. C. E. Rasmussen and Dr. K. I. Williams for providing the GPC software [33], and Dr. N. Lawrence for making available the IVM software [34] used in the context of this paper.

R EFERENCES [1] R. Duda, S. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [2] J. S. Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge Univ. Press, 2004. [3] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [4] G. M. Foody and A. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 6, pp. 1335–1343, Jun. 2004. [5] G. Camps-Valls, L. Gómez-Chova, J. Calpe-Maravilla, E. Soria-Olivas, J. D. Martín-Guerrero, L. Alonso-Chordá, and J. Moreno, “Robust support vector method for hyperspectral data classification and knowledge discovery,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 7, pp. 1530– 1542, Jul. 2004. [6] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machine,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004. [7] B. Wohlberg, D. M. Tartakovsky, and A. Guadagnini, “Subsurface characterization with support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 1, pp. 47–57, Jan. 2006. [8] L. Zhang, X. Huang, B. Huang, and P. Li, “A pixel shape index coupled with spectral information for classification of high spatial resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 10, pp. 2950–2961, Oct. 2006. [9] Y. Bazi and F. Melgani, “Toward an optimal SVM classification system for hyperspectral remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 11, pp. 3374–3385, Nov. 2006. [10] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for classification of multisensor data,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 12, pp. 3858–3866, Dec. 2007. [11] B. Waske and S. Van der Linden, “Classifying multilevel imagery from SAR and optical sensors by decision fusion,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 5, pp. 1457–1466, May 2008. [12] E. Blanzieri and F. Melgani, “Nearest neighbor classification of remote sensing images with the maximum marginal principle,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 6, pp. 1804–1811, Jun. 2008. [13] A. M. Filippi and R. Archibald, “Support vector machine-based endmember extraction,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 3, pp. 771–791, Mar. 2009. [14] N. Ghoggali, F. Melgani, and Y. Bazi, “A multiobjective genetic SVM approach for classification problems with limited training samples,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 6, pp. 1707–1718, Jun. 2009. [15] C. K. I. Williams and D. Barber, “Bayesian classification with Gaussian processes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1342–1351, Dec. 1998. [16] M. Gibbs and D. J. C. MacKay, “Variational Gaussian process classifiers,” IEEE Trans. Neural Netw., vol. 11, no. 6, pp. 1458–1464, Nov. 2000. [17] T. P. Minka, “A family of algorithm for approximate Bayesian inference,” Ph.D. dissertation, MIT, Cambridge, MA, 2001. [18] M. Kuss and C. Rasmussen, “Assessing approximate inference for binary Gaussian process classification,” J. Mach. Learn. Res., vol. 6, pp. 1679–1704, Dec. 2005.



[19] H. C. Kim and Z. Ghahramani, “Bayesian Gaussian process classification with the EM-EP algorithm,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 12, pp. 1948–1959, Dec. 2006. [20] C. Rasmussen and C. K. I. Williams, Gaussian Process for Machine Learning. Cambridge, MA: MIT Press, 2006. [21] M. Girolami and S. Rogers, “Variational Bayesian multinomial probit regression with Gaussian process priors,” Neural Comput., vol. 18, no. 8, pp. 1790–1817, Aug. 2006. [22] C. K. I. Williams, “Computation with infinite neural networks,” Neural Comput., vol. 10, no. 5, pp. 1203–1216, Jul. 1998. [23] L. Csato and M. Opper, “Sparse representation for Gaussian process models,” in Proc. NIPS, 2000, vol. 13, pp. 444–450. [24] N. D. Lawrence, J. C. Platt, and M. I. Jordan, “Extensions of the informative vector machine,” in Proc. Deterministic Stat. Methods Mach. Learn., 2005, vol. 3635, pp. 56–87. [25] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo inputs,” in Advances in Neural Information Processing Systems, vol. 18. Cambridge, MA: MIT Press, 2006, pp. 1257–1264. [26] M. Seeger, “Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations,” Ph.D. dissertation, Univ. Edinburgh, Edinburgh, U.K., 2003. [27] AVIRIS NW Indiana’s Indian Pines 1992 Data Set. [Online]. Available: http://cobweb.ecn.purdue.edu/~landgreb/ [28] S. B. Serpico and L. Bruzzone, “A new search algorithm for feature selection in hyperspectral remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 7, pp. 1360–1367, Jul. 2001. [29] A. Agresti, Categorical Data Analysis, 2nd ed. New York: Wiley, 2002. [30] G. M. Foody, “Thematic map comparison: Evaluating the statistical significance of differences in classification accuracy,” Photogramm. Eng. Remote Sens., vol. 70, no. 5, pp. 627–633, 2004. [31] R. M. Neal, Bayesian Learning for Neural Networks, vol. 118. New York: Springer-Verlag, 1996. [32] C.-C. Chang and C.-J. Lin, LIBSVM—A Library for Support Vector Machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm [33] C. E. Rasmussen and K. I. Williams, Gaussian Process Software. [Online]. Available: http://www.Gaussianprocess.org/gpml/code/ matlab/doc/ [34] N. Lawrence, IVM Software. [Online]. Available: http://www.cs.man. ac.uk/~neill/gpsoftware.html

197

Yakoub Bazi (S’05–M’07) received the State Engineer and M.Sc. degrees in electronics from the University of Batna, Batna, Algeria, in 1994 and 2000, respectively, and the Ph.D. degree in information and communication technology from the University of Trento, Trento, Italy, in 2005. From 2000 to 2002, he was a Lecturer with the University of M’sila, M’sila, Algeria. From January to June 2006, he was a Postdoctoral Researcher with the University of Trento, Trento. He is currently an Assistant Professor with the College of Engineering, Al-Jouf University, Al-Jouf, Saudi Arabia. His research interests include pattern recognition and evolutionary computation methodologies applied to remote sensing images and biomedical signal/images (change detection, classification, and semisupervised learning). He is a Referee for several international journals. Farid Melgani (M’04–SM’06) received the State Engineer degree in electronics from the University of Batna, Batna, Algeria, in 1994, the M.Sc. degree in electrical engineering from the University of Baghdad, Baghdad, Iraq, in 1999, and the Ph.D. degree in electronic and computer engineering from the University of Genoa, Genoa, Italy, in 2003. From 1999 to 2002, he cooperated with the Signal Processing and Telecommunications Group, Department of Biophysical and Electronic Engineering, University of Genoa. Since 2002, he has been an Assistant Professor of telecommunications with the University of Trento, Trento, Italy, where he has taught pattern recognition, machine learning, radar remote sensing systems, and digital transmission and is currently the Head of the Intelligent Information Processing Laboratory, Department of Information Engineering and Computer Science. His research interests include processing, pattern recognition, and machine learning techniques applied to remote sensing and biomedical signals/images (classification, regression, multitemporal analysis, and data fusion). He is a coauthor of more than 80 scientific publications and is a Referee for several international journals. Dr. Melgani has served on the scientific committees of several international conferences and is an Associate Editor for the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS.


Gaussian Process Approach to Remote Sensing ... - Semantic Scholar

Gaussian Process Approach to Remote Sensing ... - Semantic Scholar

Suggest Documents

Remote Sensing - Semantic Scholar

remote sensing - Semantic Scholar

Remote Sensing - Semantic Scholar

remote sensing - Semantic Scholar

Remote Sensing - Semantic Scholar

Remote Sensing - Semantic Scholar

remote sensing - Semantic Scholar

introduction to remote sensing - Semantic Scholar

Integrated approach using remote sensing & GIS ... - Semantic Scholar

A Multi-Disciplinary Approach to Remote Sensing

REMOTE SENSING• REMOTE SENSING•

A Remote Sensing Approach - MDPI

Stacked Gaussian Process Learning - Semantic Scholar

Learning Gaussian Process Kernels via ... - Semantic Scholar

Gaussian Process Factorization Machines for ... - Semantic Scholar

Gaussian Process Based Online Change ... - Semantic Scholar

Heteroscedastic Gaussian Process Regression for ... - Semantic Scholar

Discriminative Gaussian Process Latent Variable ... - Semantic Scholar

Recovering Quantitative Remote Sensing Products ... - Semantic Scholar

REMOTE SENSING RETRIEVALS FROM SPOT ... - Semantic Scholar

Remote-Sensing-Enhanced Outreach Education ... - Semantic Scholar

Challenges in Coordinating Remote Sensing ... - Semantic Scholar

Remote Sensing Information Processing Grid ... - Semantic Scholar

Remote Sensing Open Access Journal - Semantic Scholar