contextual classification of hyperspectral images

CONTEXTUAL CLASSIFICATION OF HYPERSPECTRAL IMAGES BY SUPPORT VECTOR MACHINES AND MARKOV RANDOM FIELDS Gabriele Moser, Sebastiano B. Serpico, and Michaela De Martino University of Genoa, Department of Biophysical and Electronic Engineering, via Opera Pia 11a, I-16145 Genoa (Italy); e-mail: [email protected]

ABSTRACT In the context of hyperspectral-image classification, a key problem is represented by the Hughes’ phenomenon, which makes many supervised classifiers ineffective when applied to high-dimensional feature spaces. Furthermore, most traditional hyperspectral-image classifiers are noncontextual, i.e., they label each pixel based on its spectral signature but while neglecting all interpixel correlations. In this paper, we propose a novel supervised classification method for hyperspectral images, that is based on the integration of the support vector machine (SVM) and Markov random field (MRF) approaches. The method represents a rigorous contextual generalization of SVMs and is based on a reformulation of the Markovian minimum-energy rule in terms of the application of an SVM in a suitably transformed space. The internal parameters of the method are automatically optimized by extending recently developed techniques based on the Ho-Kashyap and Powell’s numerical procedures. Experiments are carried out with a well-known hyperspectral data set, also combining the proposed classifier with the recently developed band-extraction approach to feature reduction. Key words: Hyperspectral image classification, support vector machines, Markov random fields, band extraction.

1.

INTRODUCTION

Spectral signatures are a very important source of information that may allow land-cover classes and ground materials to be accurately separated by pattern-classification methods [16]. This is one of the main motivations for the interest in hyperspectral sensors that provide a precise sampling of such signatures. This potential is strongly enforced by the PRISMA and EnMAP satellite missions, which will offer hyperspectral data on a repetitive basis, thus representing crucial novel tools for time-monitoring applications. However, effectively exploiting this potential requires the development of novel methods to generate accurate classification maps from the resulting data. _____________________________________________________ Proc. ‘Hyperspectral 2010 Workshop’, Frascati, Italy, 17–19 March 2010 (ESA SP-683, May 2010)

In the framework of hyperspectral-image classification, a key problem is that classifying with hundreds of channels may be critical due to computational burden, large memory occupation, and especially small-sample size. With regard to the last issue, the Hughes phenomenon may occur when the training set is not large enough to ensure a reliable estimation of the classifier parameters [16]. A further limitation is that most traditional hyperspectralimage classifiers are noncontextual, i.e., they classify each pixel based on the spectral bands, while neglecting the spatial correlations among distinct pixels. Here, both criticalities are jointly addressed, by proposing a novel method, based on integrating the support vector machine (SVM) [21], and Markov random field (MRF) [7] approaches to classification and consequently developing a rigorous contextual generalization of SVMs (which are intrinsically noncontextual). SVMs represent kernel-based learning methods aimed at optimizing a measure of the generalization capability [21] and have been found to be often quite robust to dimensionality issues [12]. MRFs are a general family of stochastic models that allow spatial information to be formalized in Bayesian image-analysis schemes in terms of the minimization of suitable “energy functions” [7]. Integrating these two frameworks is not trivial, due to the intrinsically nonbayesian nature of SVMs. Previous techniques aiming at incorporating spatial information into SVMs were mainly based on two approaches. First, Bayesian-like processing results were derived from SVM-based discriminant functions and combined with MRF models for the contextual information [2]. Semiheuristic methods to derive approximated posterior probabilities from an SVM discriminant function were proposed in [23] and used in [20, 11]. SVMs were directly used as probability density estimators in [8]. A second approach to incorporate spatial information into SVMs was based on the application of noncontextual SVMs after contextual (e.g., textural) feature extraction, possibly fusing spectral and spatial features through dedicated composite kernel formulations [4, 10, 3]. In this paper, a novel reformulation of the MRF minimum-energy decision rule is introduced and is analytically proven to be equivalent to the application of an SVM in a suitably transformed kernel-induced space.

This integrated SVM-MRF formulation allows the robustness of SVMs to overfitting and dimensionality issues and the contextual modeling capability of MRFs to be combined in order to jointly exploit both the spectral and the spatial information associated with a hyperspectral image. The internal parameters of the method are automatically optimized by extending the techniques developed in [17, 14]; this makes the proposed method fully automatic. Experiments on a well known hyperspectral data set are shown. In order to further improve robustness to the Hughes phenomenon, the method is also tested together with the recently proposed bandextraction (BE) approach to feature reduction from hyperspectral images [18].

2.

METHODOLOGY

2.1. Noncontextual SVM-based classification Given a hyperspectral image composed of d bands and including N pixels, we denote by Ω the (finite) set of classes that are present in the imaged scene and by xh and ℓh (xh ∈ Rd ; ℓh ∈ Ω) the spectral feature vector and the class label of the h-th pixel, respectively (h = 1, 2, . . . , N ). Focusing first on binary classification (i.e., Ω = {ω1 , ω2 }), the SVM approach computes a linear discriminant function f12 between ω1 and ω2 in a nonlinearly transformed space by minimizing an upper bound on the generalization error of the classifier and by adopting a kernel-based formulation [21]. Let K(·, ·) be a kernel function, i.e., let a Hilbert space H (endowed, as usual, with an inner product h·, ·i) and a mapping Φ : Rd → H exist such that K(x, x ¯) = hΦ(x), Φ(¯ x)i for all x, x¯ ∈ Rd [21]. The SVM-based discriminant function can be equivalently expressed as a linear function in H, i.e. (x ∈ Rd ): f12 (x) = hw, Φ(x)i + b,

(1)

or as a weighted sum of nonlinear kernels centered on a subset of the training samples, i.e.: f12 (x) =

X

αs ys K(xs , x) + b,

(2)

s∈S

where w and b (w ∈ H, b ∈ R) are the weight and bias parameters of the linear discriminant function in H, S is a subset of the training samples (whose elements are named support vectors), ys is a coefficient attached to the s-th training pixel such that ys = 1 or ys = −1 if the pixel is in the training set of ω1 or ω2 , respectively, and both the selection of the samples in S and the computation of the coefficients αs (s ∈ S) and b in Eq. (2) are obtained by solving a suitable quadratic programming (QP) problem [21]. Necessary and sufficient conditions (namely, the Mercer’s conditions) are known for a given function K to be a kernel. For a detailed discussion see [21].

2.2. MRF minimum-energy rule and SVMs Assuming the collection {ℓh }N h=1 of the binary class labels to be an MRF with respect to a given neighborhood system [7], we shall integrate the MRF model with the SVM framework mentioned in Sec. 2.1 by proving that, under proper assumptions, the Markovian minimumenergy rule can be expressed as an SVM discriminant function associated to a suitable kernel. Specifically, let us assume here that the dimension dim H = D is finite and that the transformed feature vector Φ(xh ) of each h-th pixel, when conditioned to ℓh = ωi , is Gaussian with mean µi and the same covariance matrix Σ for both classes (h = 1, 2, . . . , N ; i = 1, 2). The finite-dimension assumption is adopted here as a working hypothesis and operatively allows H to be identified (up to a suitable isometry) with RD , the inner product h·, ·i in H be identified with the dot product in RD , Φ(xh ) and µi be interpreted as D-dimensional vectors, Σ be considered a D × D matrix. This simplifying assumption is removed in Sec. 2.4 by a more complicated argument based on continuous- and discrete-time Gaussian processes. The equal-covariance assumption in itself is a restriction, but, as discussed also in [15], is actually pretty mild, thanks to the extreme flexibility granted by the mapping Φ in a possibly very-high dimensional space H. According to these assumptions, the distribution of the label of each pixel, when conditioned to the related feature vector and to the labels of all other pixels, is given by (h = 1, 2, . . . , N ; i = 1, 2): e−U(ωi |xh ,Ch ) P {ℓh = ωi |xh , ℓk , k 6= h} = P2 , −U(ωj |xh ,Ch ) j=1 e (3) where: U (ωi |xh , Ch ) = +

λE(ωi |Ch ) + (4)

−1 Φ(xh ) − µi , Σ [Φ(xh ) − µi ]

is the corresponding energy function, Ch is the set of the labels of the neighbors (i.e., the “spatial context”) of the h-th pixel, E(·|·) is the spatial energy contribution associated to the adopted MRF model, and λ is a positive parameter tuning the reciprocal weights of the pixelwise and contextual energy terms in Eq. (4) [7]. According to the Markovian minimum-energy decision rule, the local discriminant function corresponding to this energy function is given by: f12 (xh |Ch ) = U (ω2 |xh , Ch ) − U (ω1 |xh , Ch ).

(5)

Thanks to the equal-covariance assumption, this turns out to be a linear function of the transformed feature vector, i.e.: f12 (xh |Ch ) = hw, Φ(xh )i + b + λεh , (6) where w and b (w ∈ H, b ∈ R) are suitable algebraic combinations of µ1 , µ2 and Σ, and: εh = E(ω2 |Ch ) − E(ω1 |Ch ).

(7)

If H′ = H ⊕ R is the direct sum of the space H and of the set R of real numbers, i.e., the (D + 1)-dimensional space of all pairs (u, r) (u ∈ H, r ∈ R), endowed with the inner product h(u, r), (¯ u, r¯)i′ = hu, u ¯i + r¯ r (u, u ¯∈ H; r, r¯ ∈ R) [5], then:

2. compute the discriminant function in Eq. (9) by training an SVM (i.e., by solving the related QP problem) with the kernel in Eq. (10), the original spectral features, and the contextual feature updated in step 1;

f12 (xh |Ch ) = hw′ , Φ′ (xh , εh )i′ + b,

3. update the class labels of all image pixels by applying the discriminant function resulting from step 2.

(8)

where Φ′ (x, r) = (Φ(x), λ1/2 r) (x ∈ Rd , r ∈ R) and w′ = (w, λ1/2 ). Therefore, the Markovian minimumenergy rule turns out to be equivalent to a linear discriminant function in the transformed space H′ and can be obtained by augmenting the original d-dimensional feature space with the spatial feature εh and by mapping the resulting (d + 1)-dimensional space in H′ through Φ′ . As in Sec. 2.1, this discriminant function can be equivalently computed as a kernel expansion, i.e.: X f12 (xh |Ch ) = αs ys K ′ (xs , εs ; xh , εh ) + b, (9) s∈S

where the kernel is given by (x, x¯ ∈ Rd ; r, r¯ ∈ R): K ′ (x, r; x¯, r¯) = hΦ′ (x, r), Φ′ (¯ x, r¯)i′ = K(x, x¯) + λr¯ r, (10) and where the set S of the support vectors and the coefficients αs (s ∈ S) and b can be obtained again by solving a QP problem with respect to the kernel in Eq. (10). This formulation also prevents the need for estimating the parameters µ1 , µ2 , and Σ of the Gaussian distributions, which would be a very critical (if not unfeasible) task since it would explicitly involve transformed samples in H. The kernel in Eq. (10) is a linear combination of two contributions related to either the pixelwise spectral information (through K) or the spatial information (through the additional feature εh resulting from E); λ acts as a spatial smoothing parameter that tunes the tradeoff between these two contributions. Note that a similar kernel (namely, the “direct summation kernel”) was introduced in [4] even though without the present MRF interpretation and that the adopted MRF model is intrinsically stationary (since µ1 , µ2 , Σ, and E are independent of the pixel location h). The latter is accepted as a simplifying analytical assumption.

2.3. The proposed classification method The proposed contextual hyperspectral image classifier exploits the joint SVM-MRF framework introduced in Sec. 2.2 by adopting the iterated conditional mode (ICM) approach. ICM is an iterative MRF energy minimization method that converges to (at least) a local minimum and usually represents a good tradeoff between classification accuracy and computational burden [7]. Let a spatial MRF energy E and a kernel K be chosen. The method is initialized with a preliminary classification map generated by a classical noncontextual SVM. Then, the following steps are iterated until convergence: 1. update εh for each h-th pixel (h = 1, 2, . . . , N ), according to the current classification map;

Note that there is no restriction on E, i.e., the MRF model used in the proposed classifier can be chosen arbitrarily. The method has been described while focusing on binary classification: multiclass extension by the wellknown one-against-one strategy is straightforward [21]. The method also involves several input parameters, i.e., λ, the usual regularization parameter C involved in the QP problems of SVMs [21], and possible further parameters inside the kernel K (e.g., the variance of a Gaussian kernel). In order to automatically optimize the SVM regularization and kernel parameters, the algorithm proposed in [14] for SVM-based regression is extended here to classification. This algorithm numerically maximizes by the Powell’s procedure an analytical upper-bound on the leave-one-out error. Such an upper bound can be defined for SVM regression as well as for classification [14]; this allows the algorithm in [14] to be generalized to the proposed classifier. The MRF parameteroptimization technique in [17] is applied to optimize λ. It is feasible for the MRF models whose energy functions depend linearly on the unknown parameters (and this assumption holds for the relationship between the energy and λ; see Eq. (4)) and expresses the parameter optimization as the solution of a system of linear inequalities, addressed by the Ho-Kashyap procedure. Both parameter optimization techniques are applied together with the initialization of the classifier. The proposed method will be denoted as “Markovian automatic support vector classifier” (MA-SVC) in the following. SVM-based classifiers have often been found less severely affected by the Hughes phenomenon than Bayesian classifiers [12]. However, in order to further optimize the robustness to dimensionality, the combination of MA-SVC with feature reduction can be considered. In particular, the BE approach to feature reduction is based on the idea to extract, from a given set of hyperspectral channels, a set of synthetic multispectral channels, which are optimized in order to maximize the accuracy in a given supervised classification problem. Locally-optimal discrete maximization strategies have been developed to this end [18]. Here, the “sequential forward BE” (SFBE) is used in order to possibly perform a preliminary reduction in the number of features to be employed for classification by MA-SVC (see [18] for details on SFBE). 2.4. Comments about the extension to infinitedimensional kernel-induced spaces Here, we briefly discuss the main ideas allowing the approach presented in Sec. 2.2 to be generalized to the case

dim H = +∞ (analytical details are omitted for brevity). With the same notations as in Sec. 2.2, the Hilbert space H implicitly defined by the Mercer kernel K can always be assumed separable [21]. Therefore, up to an isometry, H can be identifed with the space L2 [0, 1] of the Lebesgue square-integrable functions on the interval [0, 1] [6] and Φ(x) (x ∈ Rd ) can consequently be envisaged as an element of L2 [0, 1]. Similar to Sec. 2.2, we assume that, for all x ∈ Rd , Φ(x), when conditioned to each class, is a Gaussian random process defined in [0, 1] with the same autocovariance function for both classes. In particular, this autocovariance is assumed to be continuous; this allows the Karhunen-Loeve expansion to be applied to the continuous-time random process Φ(x) [1]. Therefore, thanks to the related Karhunen-Loeve and whitening transformations, one may isometrically map ¯ such that: (i) L2 [0, 1] to a suitable Hilbert space H, ¯ are real-valued sequences and consethe elements of H ¯ quently a random sequence {u(n)}∞ n=0 in H is associated d to each feature vector x ∈ R ; (ii) {u(n)}∞ n=0 , when conditioned to each class, is a discrete-time nonzeromean white Gaussian random process [1]. Denoting by µi (n) = E{u(n)|ωi } (n = 0, 1, 2, . . .) the mean of the process {u(n)}∞ n=0 conditioned to ωi and introducing a zero-mean white Gaussian discrete-time random process {ν(n)}∞ n=0 , one may consequently decompose u(n) = µi (n) + ν(n) (n = 0, 1, 2, . . .), whenever x is drawn from ωi (i = 1, 2). Therefore, each feature vector is associated to a discretetime random process that can be expressed as the related class-conditional mean, corrupted by additive white Gaussian noise (AWGN). Thus, the methodological formulation of Bayesian decision theory for binary optimum receiver design under AWGN conditions can be extended to the present case. As proved in [22], the problem of optimally discriminating the two classes with respect to the MAP criterion, given the input random process {u(n)}∞ n=0 , can be formalized as a simple classification problem involving two Gaussian distributions in a one- or two-dimensional transformed space. This operatively allows the finite-dimension formulation discussed in Sec. 2.2 to hold in the current case as well, thus allowing the application of the proposed approach also in conjunction with kernel functions related to infinitedimensional Hilbert spaces.

3.

EXPERIMENTAL RESULTS

3.1. Data set and experimental set-up Experiments were carried out with a well-known hyperspectral data set, consisting of a 202-band AVIRIS image (145 × 145 pixel; see Fig. 1(a)) acquired over NW Indian Pine in 1992 and presenting 9 main classes (Table 1) endowed with spatially disjoint training and test fields (details on data preparation can be found in [18]). Specifically, a Gaussian kernel and a Potts MRF model [7] were

chosen for K and E, respectively. Focusing first on the results of MA-SVC in the original 202-dimensional space (i.e., without feature reduction), experimental comparisons were performed with: (i) a noncontextual SVM applied in the original feature space; (ii) two classical Markovian classifiers (named MRFGauss and MRF-k-NN, respectively), based on ICM, on the Potts model, and on pixelwise posterior probability estimates generated according to a Gaussian model for the class-conditional statistics or to the k-nearest neighbor (k-NN) algorithm [16], respectively. The regularization and kernel parameters of the noncontextual SVM were again optimized by the technique in [14]. The parameter k of k-NN was optimized by cross-validation. The smoothing parameter of the Potts model involved in both MRF-Gauss and MRF-k-NN was optimized by the method in [17]. Due to the impact of the Hughes phenomenon on Gaussian Bayesian and k-NN classifiers, both techniques were applied after SFBE (as discussed in [18], the optimum reduced number of features for SFBE in the Gaussian case was 26; the same number was used for k-NN as well). Table 1 reports the resulting classification accuracies over the test sets. The classification maps generated by MA-SVC, MRF-GMAP, and the noncontextual SVM are shown in Figs. 1(b)-1(d). Fig. 1(e) shows the map obtained from the MA-SVC result by merging the “corn,” “grass,” and “soybean” classes into three higher-level macroclasses. With regard to the combination of MA-SVC with SFBE, the proposed method was applied after reducing the number of features from 202 to m with 2 ≤ m ≤ 60 and the behavior of the resulting accuracy was assessed as a function of m. In order to further investigate robustness to the Hughes’ phenomenon, ten additional experiments were performed by progressively removing, at each run, a randomly selected 20% of the training-set size, and by applying MA-SVC both with all 202 bands and after SFBEbased reduction to m features (2 ≤ m ≤ 60).

3.2. Behavior of MA-SVC without feature reduction When applied directly to the original 202-dimensional space, MA-SVC obtained a very accurate result, despite the presence of several strongly overlapping classes: both the overall (OA) and average (AA) accuracies were around 93% and the accuracy was above 90% for most classes (Table 1). In particular, a 7.1% increase in OA and a 6.4% increase in AA were obtained as compared to a noncontextual SVM. This suggests a strong effectiveness of MA-SVC in exploiting spatial-contextual information. Specifically, a visual comparison of the resulting classification maps (see Figs. 1(b) and 1(d)) confirms the improvement in spatial regularity granted by MA-SVC as compared to the noncontextual SVM. This accurate result was obtained by dealing directly with the original hyperdimensional feature space, which also suggests quite a limited sensitivity to the Hughes’ phenomenon. A large accuracy increase was also remarked as compared to both

Table 1. Experimental results: classification accuracies of MA-SVC, of a noncontextual SVM, and of MRF-Gauss and MRF-k-NN on the test set. class corn no-till corn-min grass/pasture grass/trees hay-windrowed soybean-no till soybean-min soybean-clean till woods overall accuracy average accuracy

MA-SVC 89.91% 95.40% 92.00% 98.59% 100% 92.10% 90.71% 82.30% 97.95% 92.84% 93.22%

MRF-Gauss and MRF-k-NN (see also Fig. 1(c)). This confirms the improved capability of the proposed technique to jointly exploit spectral and spatial information to classify hyperspectral data.

3.3. Behavior of MA-SVC after band extraction As compared to the results achieved by MA-SVC with all 202 bands, no accuracy improvement was obtained by performing preliminary feature reduction, when dealing with the whole training set. This confirms the robustness of MA-SVC to dimensionality issues. When running MA-SVC after reducing the number of training samples, OA was always above 88% in all runs, thus confirming the effectiveness of MA-SVC also in small-sample size cases. Specifically, no relevant accuracy improvement was obtained by applying SFBE, as compared to the use of all 202 bands, when up to the 60% of the training set was removed. On the other hand, applying SFBE before MA-SVC resulted in a 1 ÷ 2% gain in OA and AA, as compared to applying MA-SVC with the original 202 bands, when 60 ÷ 90% of the training-set was removed. This encourages the use of the proposed method also when only a few training samples are available, but also points out that, in cases of extremely small sample sizes, a preliminary feature reduction may still slightly improve the resulting accuracies.

4.

CONCLUSIONS

The problem of hyperspectral image classification was addressed by developing a novel contextual fully automatic technique, based on the integration of the MRF and SVM frameworks. The method relies on a reformulation of the Markovian minimum-energy decision rule in terms of a suitable SVM-based discriminant function, involving a contextual kernel formulation. When applied

SVM 80.17% 73.93% 90.67% 98.59% 100% 79.23% 83.65% 77.88% 97.54% 85.76% 86.85%

MRF-Gauss 90.78% 69.33% 90.22% 100% 100% 44.24% 95.30% 86.28% 97.95% 86.40% 86.01%

MRF-k-NN 74.96% 54.29% 88.89% 100% 100% 80.36% 89.32% 63.27% 95.28% 83.61% 82.93%

to a well-known real hyperspectral data set, the method outperformed both a noncontextual SVM and previous Markovian classifiers based on both parametric and nonparametric approaches to class-statistics modeling. This suggests a strong effectiveness in combining the robustness of SVM to dimensionality and the spatial modeling capability of MRFs for hyperspectral image classification purposes. The experimental results also pointed out a remarkable robustness of the method to the Hughes’ phenomenon, essentially suggesting a possible need for preliminary feature reduction only when a very few training samples were available. The proposed method is fully nonparametric, thanks to the SVM approach, and its formulation involves no restriction on the adopted MRF energy function. Hence, the method represents a feasible classification tool not only for hyperspectral data but also for other typologies of remote-sensing images: experimental validations with more sophisticated MRF models (e.g., incorporating anisotropy or edge information [9]) and with optical very high-resolution or synthetic aperture radar images would be important extensions of this work. In particular, multitemporal MRF models (e.g., [13]) can be integrated in the method in order to exploit the repetitive acquisition capability granted by the satellite PRISMA and EnMAP missions, in order to generate accurate maps of the time evolution of land-cover thematic classes. Furthermore, multiscale MRF models (e.g., [19]) may be integrated in the proposed framework to exploit the multiresolution acquisition capability of PRISMA, thus combining both the spectral information associated to the 30-m resolution PRISMA hyperspectral channels and the spatial information conveyed by the 5-m PRISMA panchromatic channel, in order to generate accurate land-cover maps at 5-m spatial resolution.

REFERENCES [1] R. Ash. Information theory. Dover, 1965.

(a)

(b)

(c)

(d)

(e)

Figure 1. Experimental results: false-color composition of three channels of the AVIRIS image used for experiments (a) and classification maps generated by MA-SVC (b), noncontextual SVM (c), and MRF-GMAP (d). The classification map obtained from the MA-SVC result in (b) by merging the “corn,” “grass,” and “soybean” classes into three higherlevel macroclasses is shown in (e). Color legend for (b)-(d): yellow = “corn no-till,” orange = “corn-min,” blue = “grass/pasture,” cyan = “grass/trees,” red = “hay-windrowed,” light green = “soybean-no till,” dark green = “soybeanmin,” sea green = “soybean-clean till,” magenta = “woods.” Color legend for (e): yellow = “corn,” blue = “grass,” red = “hay-windrowed,” green = “soybean,” magenta = “woods.” [2] F. Bovolo and L. Bruzzone. A context-sensitive technique based on support vector machines for image classification. In Proc. of PReMI 2005, LNCS 3776, pages 260–265, 2005. [3] L. Bruzzone and C. Persello. A novel contextsensitive semisupervised SVM classifier robust to mislabeled training samples. IEEE Trans. Geosci. Remote Sensing, 47:2142–2154, 2009. [4] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances, and J. Calpe-Maravilla. Composite kernels for hyperspectral image classification. IEEE Geosci. Remote Sensing Letters, 3:93–97, 2006. [5] J. Conway. A course in functional analysis. Springer, 1985. [6] L. Debnath and P. Mikusinski. Introduction to Hilbert spaces. Academic Press, 1990. [7] R. C. Dubes and A. K. Jain. Random field models in image analysis. J. Appl. Stat., 16:131–163, 1989. [8] A. A. Farag, R. M. Mohamed, and A. El-Baz. A unified framework for MAP estimation in remote sensing image segmentation. IEEE Trans. Geosci. Remote Sensing, 43:1617–1634, 2005. [9] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration. IEEE Trans. Pattern Anal. Machine Intell., 6:721– 741, 1984. [10] F. Lafarge, X. Descombes, and J. Zerubia. Textural kernel for SVM classification in remote sensing: application to forest fire detectiona and urban area extraction. In Proc. of ICIP-2005, Genoa, Italy, 2005. [11] D. Liu, M. Kelly, and P. Gong. A spatial-temporal approach to monitoring forest disease spread using multi-temporal high spatial resolution imagery. Rem. Sens. Envinron., 101:167–180, 2006. [12] F. Melgani and L. Bruzzone. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sensing, 42:1778–1790, 2004. [13] F. Melgani and S. B. Serpico. A Markov random field approach to spatio-temporal contextual image

[14]

[15]

[16] [17]

classification. IEEE Trans. Geosci. Remote Sensing, 41(11):2478–2487, 2003. G. Moser and S. B. Serpico. Automatic parameter optimization for support vector regression for land and sea surface temperature estimation from remote-sensing data. IEEE Trans. Geosci. Remote Sensing, 47:909–921, 2009. M. Murat-Dundar and D. Landgrebe. Toward an optimal supervised classifier for the analysis of hyperspectral data. IEEE Trans. Geosci. Remote Sensing, 42:271–277, 2004. J.A. Richards and X. Jia. Remote sensing digital image analysis. Springer, 2006. S. B. Serpico and G. Moser. Weight parameter optimization by the Ho-Kashyap algorithm in MRF models for supervised image classification. IEEE Trans. Geosci. Remote Sensing, 44:3695– 3705, 2006.

[18] S. B. Serpico and G. Moser. Extraction of spectral channels from hyperspectral images for classification purposes. IEEE Trans. Geosci. Remote Sensing, 45:484–495, 2007. [19] G. Storvik, R. Fjortoft, and A. H. S. Solberg. A Bayesian approach to classification of multiresolution remote sensing data. IEEE Trans. Geosci. Remote Sensing, 43(3):539–547, 2005. [20] Y. Tarabalka, J. Chanussot, and J. A. Benediktsson. Segmentation and classification of hyperspectral images using minimum spanning forest grown from automatically selected markers. IEEE Trans. Syst. Man Cybern. B (in print), 2010. [21] V. N. Vapnik. Statistical learning theory. Wiley, 1998. [22] J. M. Wozencraft and I. M. Jacobs. Principles of communications engineering. John Wiley and Sons., 1965. [23] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res., 5:975–1005, 2004.