Ensemble Classification Algorithm for Hyperspectral Remote Sensing

3 downloads 0 Views 206KB Size Report
Oct 14, 2009 - IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 6, NO. 4, OCTOBER 2009. Ensemble Classification Algorithm for Hyperspectral.
762

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 6, NO. 4, OCTOBER 2009

Ensemble Classification Algorithm for Hyperspectral Remote Sensing Data Mingmin Chi, Member, IEEE, Qian Kun, Jón Atli Benediktsson, Fellow, IEEE, and Rui Feng

Abstract—In real applications, it is difficult to obtain a sufficient number of training samples in supervised classification of hyperspectral remote sensing images. Furthermore, the training samples may not represent the real distribution of the whole space. To attack these problems, an ensemble algorithm which combines generative (mixture of Gaussians) and discriminative (support cluster machine) models for classification is proposed. Experimental results carried out on hyperspectral data set collected by the reflective optics system imaging spectrometer sensor, validates the effectiveness of the proposed approach. Index Terms—Ensemble classification, hyperspectral remote sensing images, mixture of Gaussians (MoGs), support cluster machine (SCM).

I. I NTRODUCTION

H

YPERSPECTRAL remote sensing images are very important for the discrimination of spectrally similar landcover classes. In order to obtain a reliable classifier, a large amount of representative training samples are necessary for hyperspectral data compared to multispectral remote sensing data. In real applications, it is difficult to obtain sufficient number of training samples for supervised learning. Furthermore, the training samples may not represent the real distribution of the whole space. These result in a quantity problem for training samples in the design of a robust supervised classifier. In recent years, semisupervised learning (SSL) methods [1]–[3], usually, have been exploited to overcome the problems with small numbers of labeled samples for the classification of hyperspectral remote sensing images, such as self-labeling approaches [1], low-density separation SSL approaches [2], and label-propagation SSL approaches [3]. The methods previously mentioned usually exploit generative or discriminative approaches, where the estimation criterion is used for adjusting the parameters and/or structure of the classification approaches. There is little literature on the use of both generative and discriminative models for the quantity problem. In [4], the authors Manuscript received December 22, 2008; revised April 10, 2009. First published July 28, 2009; current version published October 14, 2009. This work was supported in part by the Natural Science Foundation of China under Contract 60705008, by the Ph.D. Programs Foundation of the Ministry of Education of China under Contract 20070246132, and by the Research Fund of the University of Iceland. M. Chi, Q. Kun, and R. Feng are with the School of Computer Science, Fudan University, Shanghai 200433, China (e-mail: [email protected]; [email protected]; [email protected]). J. A. Benediktsson is with the Faculty of Electrical and Computer Engineering, University of Iceland, 107 Reykjavik, Iceland (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LGRS.2009.2024624

worked on a generative model and adopted a discriminative model to correct the bias of the generative classifier learnt by small-size training samples. In this letter, we propose an ensemble algorithm, which benefits the advantages of both generative and discriminative models to deal with the quantity problem in the classification of hyperspectral remote sensing images. In particular, both labeled and unlabeled data are represented with a generative model [i.e., mixture of Gaussians (MoGs)]. Then, the estimated model is used for discriminative learning. This is motivated by the recently proposed discriminative classification approach, support cluster machine (SCM) [5]. The SCM was originally used to address large-scale supervised learning problems. The main idea in the SCM is that the labeled data are at first modeled using a generative model. Then, the kernel, the similarity measure between Gaussians, is defined by probability product kernels (PPKs) [6]. In other words, the obtained PPK kernel is used to train support vector machines (SVMs) where the learned models contain support clusters rather than support vectors (the name SCM is based on this). In the SCM, the number of clusters is important to obtain the best classification results. If the selected number of Gaussians (not limited to Gaussians) does not fit the data well, the classification accuracy can decrease. For a small size training set problem, the mixture model estimated by only labeled samples cannot represent the distribution for the whole data. To attack the aforementioned problem, it is proposed here to first use both labeled and unlabeled samples to estimate an MoG. Then, different sets of the MoGs are generated by going from few (coarse representation) to many (fine representation) numbers of clusters. Finally, the output classification result is integrated by an ensemble technique based on the ones obtained from individual SCMs learnt by different sets of MoGs. In terms of the different estimated MoGs, the corresponding PPK kernel matrixes can be computed and used as inputs to standard SVMs for training. The accuracies and the reliability of the proposed algorithm have been evaluated on reflective optics system imaging spectrometer (ROSIS) hyperspectral remote sensing data collected over the University of Pavia, Italy. The results are promising when compared to the state-of-the-art classifiers. The rest of this letter is organized as follows. The next section describes the proposed ensemble algorithm with generative/ discriminative models. Section III discusses the data used in the experiments, reports and discusses the results provided by the different algorithms. Finally, conclusions and discussion are given in Section IV.

1545-598X/$26.00 © 2009 IEEE

Authorized licensed use limited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on October 14, 2009 at 07:21 from IEEE Xplore. Restrictions apply.

CHI et al.: ENSEMBLE CLASSIFICATION ALGORITHM FOR HYPERSPECTRAL REMOTE SENSING DATA

II. E NSEMBLE A LGORITHM W ITH G ENERATIVE /D ISCRIMINATIVE M ODELS Let the given labeled data set Xl = (xi , yi )ni=1 , Xl ∈ RD×n be made up of n labeled samples with a D-dimensional input space. We work on a binary classification problem, i.e., yi = +1 if xi is labeled as the positive class and −1 otherwise. Let D×m the unlabeled data set Xu = (xi )n+m consist i=n+1 , Xu ∈ R of m unlabeled samples. To alleviate the quantity problem, we use a generative model to extract as much statistical information as possible from a large amount of unlabeled samples together with the small-size labeled patterns. Namely, a large amount of unlabeled samples are used for better estimation of the data distribution. In our framework, an SCM is used for discriminative learning based on the mixtures. However, it is difficult to evaluate the influence of the number of mixtures on the classification results [5]. Therefore, different number of mixtures are modeled and used as inputs to the base classifiers, SCMs. Finally, we propose to integrate the results in order to improve the classification accuracy and stability. Note that in a multiclass case, classes usually are highly overlapped. After clustering, the proposed algorithm based on MoG estimation cannot work better than SVM estimations. However, the proposed approach can be used for absolute classification [7] in remote sensing applications. A. Generative Model: MoG To obtain information from unlabeled data, the corresponding statistical information is used in this letter. In detail, a large amount of unlabeled samples are used to better estimate the data distribution, e.g., using a MoG (not limited to Gaussians). Let us assume that the data set X = {Xl , Xu } is drawn independently from a MoG model Θ. The log likelihood function can be modeled for the independent identically distributed data l(X) = ln p(X|Θ) =

n+m  i=1

 ln

K 

763

work is adopted for discriminative learning. After that, the kernel between Gaussian and a vector is also defined for prediction. 1) PPK With MoG: After the data are represented by MoG, the similarity between the components can be calculated by the PPK [6] in the form κkk = κ(θ k , θ k )  = (πk πk )ρ N ρ (x|μk , Σk )N ρ (x|μk , Σk )dx RD

 − 12 |Σk |− ρ2 |Σk |− ρ2 = (πk πk )ρ ρ−D/2 (2π) 2 |Σ|  ρ  −1  −1  −1 μ  Σ  μ × exp − p Σk μk + μk Σk μk − μ 2 (1−2ρ)D

(1)  = Σ−1 + Σ−1  = Σ−1 where ρ is a constant, Σ k k , μ k μk + −1 Σk μk . To reduce the computational cost, it is assumed that the features are statistically independent. Hence, a diagonal co(1) (d) variance matrix is used, i.e., Σk = diag((σk )2 , . . . , (σk )2 ). 2) Discriminative Learning: After obtaining the kernel matrix K = (κkk )K k,k =1 , we can use an SVM-like classifier for training, i.e., an SCM [5]. Here, the SCM maximizes the margin between the positive and negative clusters, rather than data vectors, i.e.,  1 min w w + C πk ξk w,b,ξ 2 K

(2)

k=1

with the constraints

yk w φ(Θk ) + b ≥ 1 − ξk ,



k = 1, . . . , K

(3)

πk N (xi |θ k )

k=1

where the MoG model contains K components, i.e., Θ = {θ k } = {(μk , Σk )}, k = 1, . . . , K, μk denotes the mean vector and Σk the covariance matrix, and πk the mixing coefficient of the kth component. In this letter, the expectation-maximization (EM) algorithm [8] is adopted to estimate the parameters. Since the estimation of the mixture model does not take class labels into account, a deterministic label needs to be assigned to each component. If only a very small-size labeled set is available, some components may contain all unlabeled samples. In that case, we would discard such components. Components containing different labels are divided until there are no samples with different labels. Accordingly, we can have the inputs {Θ, y} to an SCM for learning, where y ∈ RK is the label vector for all the components. B. Discriminative Model: SCM After obtaining inputs with MoGs, the similarity measure between Gaussians is defined and an SVM-like learning frame-

where φ(·) is a mapping function, which is a generative distribution (of the Gaussian form in our case) and the slack ξk is multiplied by the weight πk (the prior of the kth cluster in MoG) such that a misclassified cluster with more samples could be given a heavier penalty [5]. Incorporating the constraints in (3) and ξk ≥ 0, k = 1, . . . , K, to the cost function in (2), and using the Lagrangian theorem, the constrained optimization problem can be transformed into a dual problem following the same steps as in the SVM [9]. Thus, the dual representation of the SCM is given as

max α

K  k=1

αk −

K K 1 yk yk αk αk κ(θ k , θ k ) 2  k=1 k =1

s.t.

0 ≤ αk ≤ πk C, k = 1, . . . , K K α y = 0. k=1 k k

(4)

The SCM has the same optimization formulation as the SVM except that in the SCM the Lagrange multipliers αk are bounded by C multiplied by the weight πk .

Authorized licensed use limited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on October 14, 2009 at 07:21 from IEEE Xplore. Restrictions apply.

764

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 6, NO. 4, OCTOBER 2009

3) Prediction (Classification of Unlabeled Samples): A test sample x can be treated as an extreme case of Gaussian θ x when its covariance matrix vanishes, i.e., θ x = (πx = 1, μx = x, Σx = σx2 I, σx ∝ 0). Given two Gaussians, θ k and θ x , the kernel function (1) can be used to compute the similarity between a Gaussian and the test vector. If ρ = 1, and putting θ x = (1, x, σx2 I) to (1), we get the kernel value for the SCM prediction as follows: ⎛  2 ⎞ (d) (d) D μ −x 1 ⎟ ⎜  k κ(θ k , θ x ) = πk exp⎝− 2 ⎠  D/2 (d) (2π) det |Σk | d=1 2 σk = πk N (x|μk , Σk )

(5)

which is the posterior probability of x given θ k = (μk , Σk ). Similar to the SVM, the prediction function of the SCM is the linear combination of the kernels, but computed between the trained mixture components and the test pattern θ x = {1, x, σx2 I} as follows: f (x) =

K 

αk yk κ(θ k , θ x ) + b.

(6)

k=1

Accordingly, a class label is assigned to a test pattern by +1, if f (x) ≥ 0 x∈ = sgn (f (x)) . −1, otherwise

The training phase of the proposed approach is summarized in Algorithm 1. Algorithm 1 Training phase of the proposed algorithm n Require: (xi )n+m i=1 , (yi )i=1 , G, C, ρ 1: for g = 1, . . . , G do 2: Estimate the MoG model Θg based on (xi )n+m i=1 with the number of component K g . 3: Assign the labels to the mixtures to obtain inputs to SCM, i.e., {Θg , y g }. 4: Train SCM with (4) to obtain αg and bg . 5: end for 6: return {Θg , y g , αg , bg }, g = 1, . . . , G.

(7)

III. E XPERIMENTAL R ESULTS

C. Ensemble Strategy In the SCM, the data are represented by a mixture model. Usually, the number of components in such a model is initially fixed. In real applications, it is difficult to evaluate which one is best for the problem. Here, it is proposed to use an ensemble technique to overcome the problem. The number of mixture components goes from coarse to fine to generate different sets of MoGs. Accordingly, the input to different SCMs is {θ gk , ykg }, g = 1, . . . , G, where G is the number of classifiers. The prediction function for each classifier f g is the linear combination of the kernels computed between the trained mixture components and a test pattern θ x as follows: K  g

f g (x) =

TABLE I D ISTRIBUTION OF O RIGINAL T RAINING AND T EST S AMPLES IN ROSIS U NIVERSITY DATA S ET

αkg ykg κ (θ gk , θ x ) + bg .

(8)

k=1

Then, for the gth base classifier, a class label is assigned to the test pattern, i.e., x ∈ sgn(f g (x)). Finally, the winner-takes-all combination strategy is used to make a final decision [10], i.e.,  Nc = G (9) x ∈ ym if ym = arg maxc Nc , c

where Nc is the accumulated number that the base classifiers assign the yc labeling to the test pattern.

A. Data Set Description The data used in the experiments were collected using the optical sensor ROSIS 03 on the campus at the University of Pavia, Italy. Originally, there were 115 bands of ROSIS 03 sensor covering from 0.43 to 0.86 μm, and the image size is 610 × 340. Some channels were removed due to noise and so the data contains 103 features. The task is to discriminate among nine classes, i.e., Asphalt, Meadows, Gravel, Trees, Metal sheets (Metal), Bare soil (Soil), Bitumen, Bricks, and Shadow. Some training data were removed due to zero features, and hence, the full data set contains 3895 training samples and 42 598 test ones. The detail distribution can be found in Table I. From Table I, one can see that the number of original training samples for all classes is quite balanced. However, that is not the case for test patterns. In particular, the number of test patterns for class 2 (Meadows) is 18 563, while those of the remaining classes vary from 897 to 6631. This means that the data distribution estimated by even all of labeled training samples without prior information cannot represent the distribution over all the region. Therefore, in this letter, we mainly focus the classification on these unbalanced classes, e.g., class 2 versus class 4, class 2 versus class 6. In order to investigate the impact of the number of labeled data on the classifier performance, the original training data were subsampled to obtain ten splits made up of around 2% original labeled data (i.e., ten samples per class).

Authorized licensed use limited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on October 14, 2009 at 07:21 from IEEE Xplore. Restrictions apply.

CHI et al.: ENSEMBLE CLASSIFICATION ALGORITHM FOR HYPERSPECTRAL REMOTE SENSING DATA

765

TABLE II C LASSIFICATION ACCURACIES U SING THE SVM, SVMLight , AND THE P ROPOSED A LGORITHM W ITH THE T EN S UBSAMPLES FOR THE T RAINING DATA S ETS , I . E ., C LASSES 2 V ERSUS 4 AND C LASSES 2 V ERSUS 6

B. Experimental Setup In the SVM, if a Gaussian kernel is used, kernel parameter σ should be decided by model selection. In this letter, we used grid search, where C = (10−3 , . . . , 103 ), σ = (2−3 , . . . , 23 ). Then, five cross validation is used to select the “best” model for prediction. In addition, we conducted experiments using the semisupervised SVM, i.e., SVMLight for comparison.1 In the (d) SCM, we can compute the parameter σk from data directly. In addition, the variance in different directions is different such that it is better and more flexible to capture the structure of data, such as cigar-shape data. Finally, only one parameter, the penalization parameter C should be decided in the SCM. In our experiments, it has been observed that the choice of C does not significantly affect results. Therefore, we fix it at C = 100 in all the following experiments. The range of K using the EM algorithm is set to (2, 3, . . ., 19) to construct 18 base classifiers. Note that, in the SVM and SVMLight , 49 models need to be estimated for model selection. Therefore, the computational complexity of the SCM is of the same magnitude as the SVM. C. Experimental Results For the ease of comparison, we also carried out experiments by a supervised SVM and semisupervised SVM, i.e., SVMLight , on ten splits containing 20 labeled training samples. The results are shown in Table II for each data set. For the class 2 versus 6 (Meadows versus Bare Soil), the average accuracy is only 63%, varying significantly for each split and the SVMLight also obtains a significant improvement. However, the proposed approach obtains the best average classification accuracy with an increase of 16.31% from 63% to 79.31% and with much more stable results for individual splits. In particular, for Split 8, the proposed approach obtained the significant better result compared to the SVM and the SVMLight . This is possibly caused by the better and much more representative statistical results estimated using a large amount of unlabeled samples for the proposed approach. For the class 2 versus 4 (Meadows versus Trees), the average classification accuracy is 81.73% over ten splits. Since the spectral characteristics of the classes Meadows and Trees are very 1 Available

at: http://svmlight.joachims.org.

Fig. 1. Comparison among the original test samples, and the results provided by the SVM and the proposed. (a) Test Map. (b) Map by SVM. (c) Map by Proposed.

similar due to a similar spectral reflectance, we consider these classes more carefully. Looking closer to the class 2 versus 4, the average classification accuracies of the proposed algorithm is 88.47%, i.e., much higher than those by the SVM and the SVMLight . In particular, all the results per split by the proposed approach are significantly better than those by the SVM and the SVMLight . Furthermore, the ensemble classification result per split is comparable to or even better than that obtained by the SVM (i.e., 89.11%) using all the training samples. This confirms the effectiveness of the proposed ensemble classification algorithm which is capable of increasing not only classification accuracies but also the robustness of classification results. Fig. 1 shows the comparison classification map using the SVM and the proposed approach compared to the original one for the split9 of the data set Meadows versus Trees. From Fig. 1, one can see that Meadows and Trees are more accurately classified using the proposed approach. The possible reason is that the data distribution can be better estimated by the use of large scale unlabeled samples. Therefore, the problem of smallsize training data set can be alleviated. Moreover, the ensemble strategy avoids model selection, which should be taken into account for most of supervised classification algorithms. IV. D ISCUSSION AND C ONCLUSIONS In this letter, an ensemble classification with generative/ discriminative models was proposed to classify hyperspectral remote sensing data. In the proposed approach, unlabeled samples, together with very small-size labeled samples, are used to generate generative models, i.e., MoGs. The number of components in MoG is difficult to determine. Therefore, the number of Gaussians is changed from fine to coarse in order to aviod this problem. Then, each MoG is used to define a base classifier for the discriminative classifier, i.e., the SCM [5]. Different generative models can lead to a diversity of classification results for each base classifier. Finally, the results from different base classifiers are combined to obtain a better and more robust classification accuracy. The experiments were carried out on real hyperspectral data collected by the ROSIS 03 sensor over an area around the University of Pavia, Italy.

Authorized licensed use limited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on October 14, 2009 at 07:21 from IEEE Xplore. Restrictions apply.

766

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 6, NO. 4, OCTOBER 2009

The results obtained by the proposed ensemble classification approach gave both better classification accuracies and more robustness compared to the state-of-the-art classifiers. In our future research, we will extend the research on multiple class problems. Furthermore, components of a mixture model without labeled information for learning will be taken further into account. ACKNOWLEDGMENT The authors would like to thank Dr. P. Gamba of the University of Pavia, Italy, for providing the data set. R EFERENCES [1] B. M. Shahshahani and D. A. Landgrebe, “The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon,” IEEE Trans. Geosci. Remote Sens., vol. 32, no. 5, pp. 1087–1095, Sep. 1994. [2] M. Chi and L. Bruzzone, “Semi-supervised classification of hyperspectral images by SVMs optimized in the primal,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 6, pp. 1870–1880, Jun. 2007.

[3] T. Bandos, D. Zhou, and G. Camps-Valls, “Semi-supervised hyperspectral image classification with graphs,” in Proc. IEEE IGARSS, Denver, CO, Jul. 2006, pp. 3883–3886. [4] F. Akinori, U. Naonori, and S. Kazumi, “Semisupervised learning for hybrid generative-discriminative classifier based on the maximum entropy principle,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 424–437, Mar. 2008. [5] B. Li, M. Chi, J. Fan, and X. Xue, “Support cluster machine,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, Jun. 2007, pp. 505–512. [6] T. Jebara, R. Kondor, and A. Howard, “Probability product kernels,” J. Mach. Learn. Res., vol. 5, pp. 819–844, 2004. [7] B. Jeon and D. Landgrebe, “A new supervised absolute classifier,” in Proc. IEEE IGARSS, May 1990, pp. 2363–2366. [8] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc., ser. B, vol. 39, no. 1, pp. 1–38, 1977. [9] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA: MIT Press, 2002. [10] G. Briem, J. A. Benediktsson, and J. R. Sveinsson, “Multiple classifiers in classification of multisource remote sensing data,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 10, pp. 2291–2299, Oct. 2002.

Authorized licensed use limited to: NATIONAL UNIVERSITY OF ICELAND. Downloaded on October 14, 2009 at 07:21 from IEEE Xplore. Restrictions apply.

Suggest Documents