IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
1
Incremental Import Vector Machines for Classifying Hyperspectral Data
arXiv:1708.05966v1 [cs.CV] 20 Aug 2017
Ribana Roscher, Bj¨orn Waske, Member, IEEE, Wolfgang F¨orstner, Member, IEEE
Abstract—In this paper we propose an incremental learning strategy for import vector machines (IVM), which is a sparse kernel logistic regression approach. We use the procedure for the concept of self-training for sequential classification of hyperspectral data. The strategy comprises the inclusion of new training samples to increase the classification accuracy and the deletion of non-informative samples to be memory- and runtime-efficient. Moreover, we update the parameters in the incremental IVM model without re-training from scratch. Therefore, the incremental classifier is able to deal with large data sets. The performance of the IVM in comparison to support vector machines (SVM) is evaluated in terms of accuracy and experiments are conducted to assess the potential of the probabilistic outputs of the IVM. Experimental results demonstrate that the IVM and SVM perform similar in terms of classification accuracy. However, the number of import vectors is significantly lower when compared to the number of support vectors and thus, the computation time during classification can be decreased. Moreover, the probabilities provided by IVM are more reliable, when compared to the probabilistic information, derived from an SVM’s output. In addition, the proposed self-training strategy can increase the classification accuracy. Overall, the IVM and the its incremental version is worthwhile for the classification of hyperspectral data. Index Terms—Import vector machines, incremental learning, hyperspectral data, self-training.
I. I NTRODUCTION Hyperspectral imaging, also known as imaging spectroscopy is used for more than two decades for monitoring the earth [1]. The spectrally continuous data ranges from visible to the short-wave infrared region of the electromagnetic spectrum and thus, enables a detailed separation of similar surface materials. Therefore hyperspectral imagery is used for classification problems that require a precise differentiation in spectral feature space [2]–[4]. Hyperspectral applications become even more attractive, regarding the increased availability of hyperspectral imagery through future space-borne missions, such as the German EnMAP (Environmental Mapping and Analysis Program) [5] and the Italian PRISMA (Hyperspectral Precursor of the Application Mission). Nevertheless, the special properties of hyperspectral imagery demand more sophisticated image (pre-)processing and The authors are with the Institute of Geodesy and Geoinformation, Faculty of Agriculture, University of Bonn, 53115 Bonn, Germany (e-mail:
[email protected],
[email protected],
[email protected]). This work was supported by CROP.SENSe.net project, funded by the German Federal Ministry of Education and Research (BMBF) within the scope of the competitive grants program Networks of excellence in agricultural and nutrition research (FKZ: 0315529). Manuscript received May 4, 2011
analysis [6], [7]. Conventional methods, such as the maximum likelihood classifier, can be limited when applied to hyperspectral imagery, due to the high-dimensional feature space and a finite number of training samples. Consequently, the classification accuracy often decreases with an increasing number of bands (i. e., the well-known Hughes phenomena). Thus, more flexible classifiers, such as spectral angle mapper, neural networks and support vector machines (SVM), are applied on hyperspectral imagery [2], [4], [8], [9]. Among the various developments in the field of pattern recognition, SVM [10] are perhaps the most popular approach in recent hyperspectral applications [7]. SVM can outperform other methods in terms of the classification accuracy [11], [12] and still exhibit further modification and improvement, e. g., in context of modifying the kernel functions [13] and semi-supervised learning [14]. Whereas other classifiers can directly solve multi-class problems, the binary nature of SVM requires an adequate multi-class strategy (e. g., [15]). In contrast to other classifiers, which directly provide class labels or probabilities, SVM provide the distance of each pixel to the hyperplane of the binary classification problem. This information is used to determine the final class membership. Although the output of SVM can be transferred to probabilities (e. g., [16]), the reliability of these values could be inadequate [17], [18]. Nevertheless, probabilities are of interest and can be used e. g., as input in a markov random field model or for uncertainty analysis [19]–[21]. Logistic regression as an alternative probabilistic discriminative classification model was already used in context of classification and feature selection of hyperspectral imagery [22], [23]. The approach was extended to kernel logistic regression, e. g., [24], [25], showing a better accuracy but a higher complexity. To overcome the limitations in context of efficiency and computation time several sparse realizations of (kernel) logistic regression have been developed, including the explicit usage of a sparsity enforcing prior [26]–[29], an implicit prior used in the relevance vector machines (RVM) [17], [18], [30] or a greedy subset selection for the concept of import vector machines (IVM) [31]. Recent classifier developments, such as SVM, usually perform well on high-dimensional data sets. Nevertheless, the classification accuracy can be affected when a limited number of training samples is available [2], [32]. One possibility to overcome this problem is active learning, which has been studied extensively in the literature, e. g., [27], [33]–[35]. In this paper we use the specialized concept of self-training [36]. Self-training is based on the sequentially training of a classifier with new training samples. I. e., starting with a initial classifier
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
model learned from a few labeled training samples, an iterative classification procedure is performed. After each classification, new relevant training samples are selected and usually the classifier is re-trained from scratch. That means, the classifier is learned by using old and new training samples without taking into account the previous learned classifier model. For the acquisition of new training samples we use spectral as well as spatial information by using a discriminative random field (DRF) [37]. The gain of integrating spatial information was already discussed in several studies in context of classifying hyperspectral data [20], [21], [26], [34], [38]. Moreover, we consider the probabilistic outputs by the IVM, to assess the reliability of the classification result. However, to be time- and memory-efficient we need an incremental learning strategy within the self-training approach. Several incremental learning methods have been proposed, extending classical and state-of-the art off-line methods. Incremental generative models, e. g. [39], provide probabilities but tend to become very complex. Discriminative models, on the other hand, like incremental linear discriminant analysis [40] or incremental support vector machines, e. g. [41], show a good performance in classification tasks. Kernel-based algorithms have achieved considerable success in incremental and off-line learning settings [42], [43]. However, incremental kernel-based learning settings need strategies for dealing with existing and new samples. The challenge is to perform efficient, incremental update steps without suffering a loss in performance. The main objective is to propose an incremental learning strategy to update the trained IVM model. The classifier is called incremental IVM. Therefore, the IVM model is not re-trained from scratch, as usually done in context of active and self-training approaches. The incremental IVM consists of the selection of new training samples, the deletion of irrelevant training samples, and the update of the IVM model. Therefore, the classifier is able to deal with infinitely new training samples. The potential of the IVM and the its incremental version in context of classifying hyperspectral data is evaluated. The performance is compared to SVM in terms of accuracy. In addition, the reliability of the probabilities outputs provided by IVM and SVM are evaluated by using a discriminative random field and the effect is further investigate by neglecting uncertain test samples. Our study is aiming on the classification of three different hyperspectral data sets, i. e., two urban areas from the city of Pavia and an agricultural area from Indiana, USA, using SVM and IVM. The paper is organized as follows. Section II discusses the logistic regression, kernel logistic regression and the IVM algorithm. Moreover, the concept of SVM and related classifiers is briefly introduced and compared to IVM. Section III introduces the proposed strategy for self-training, including the DRF model. Also the incremental IVM are explained. The experimental setup is given in Section IV. The results are presented and discussed in Section V. We conclude in Section VI.
2
II. T HEORETICAL BACKGROUND In this section IVM model is introduced. Starting from the logistic regression, we discuss kernels and sparsity and finally the sparse kernel logistic regression model, i.e., IVM. Also a brief introduction to SVM and sparse multinomial logistic regression is given. In II-D the discriminative random field is introduced, which incorporates spatial information. A. Logistic Regression and Kernel Logistic Regression a) Logistic Regression: We assume to have a training set (xn , yn ), n = 1, . . . , N of N labeled samples with feature vectors xn ∈ IRM and class labels yn ∈ C = {C1 , . . . , CK }. The observations are collected in a matrix X = [x1 , . . . , xN ], while the corresponding labels are summarized in the vector y = [y1 , . . . , yN ]. In the two-class case the posterior probability pn of a feature vector xn is assumed to follow the Logistic Regression model pn = p(yn = C1 |xn ; w) =
1 1 + exp(−wT xn )
(1)
M +1 T and with the extended feature vector xT n = [1, xn ] ∈ IR M +1 T T the extended parameters w = [w0 , w ] ∈ IR containing the bias w0 and the weight vector w. b) Kernel Logistic Regression: In linear non-separable cases, the original observations X are implicitly mapped from the input space to a higher-dimensional kernel space with the kernel matrix K = [knm ] via the kernel function knm = k (xn , xm ). The kernel matrix K consists of affinities between the points depending on the distance measure defined by the kernel function. The parameters, in the kernel-based approach referred to α, are determined in an iterative way with −1 1 T α(i) = K RK + λK K T Rz (2) N 1 z= K α(i−1) + R −1 (p − t) (3) N by optimizing the objective function 1 X [tn log pn + (1 − tn ) log (1 − pn )] Q(i) = − N n
λ T α K α(i) (4) 2 (i) using the Newton-Raphson procedure. The (N × N )dimensional diagonal matrix R has the elements rnn = pn (1 − pn ) with pn = 1/(1 + exp(−kn α)) and kn as the nth row of the kernel matrix K . The binary target vector t ∈ {0, 1} of length N codes the labels with tn = 0 for yn = C1 and tn = 1 for yn = C2 . Additionally, we add an L2 -norm regularization term with parameter λ to prevent overfitting. +
B. Import Vector Machines The kernel logistic regression includes all training samples to train the classifier, which is computationally expensive and memory intensive for data sets with many training samples. Similar to the SVM the IVM algorithm [31] chooses a subset V
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
of feature vectors out of the training set with V = |V| samples X V = [xV,m ], m = 1, . . . , V , obtaining a sparse solution of the kernel logistic regression. These feature vectors are called import vectors. Following (2) and (3) the parameters in iteration i are determined by −1 1 T KT (5) K RK V + λK R α(i) = V Rz N V 1 z= K V α(i−1) + R −1 (p − t) . (6) N The (N × V )-dimensional kernel matrix is given by K V = [k(xn , xV,m )] and the (V × V )-dimensional regularization matrix by K R = [k(xV,l , xV,m )], {l, m} = 1, . . . , V . The IVM is illustrated in Algorithm 1. The convergence Initialize V0 := {}, X0 := {x1 , . . . , xN }, i := 0; repeat Compute z (i) from the current set V(i) ; foreach xn ∈ X(i) do Let V(i)n := V(i) ∪ xn ; Compute α(i)n from V(i)n in a one-step iteration; Evaluate error function Q(i)n ; end Find best point x∗ = xn with n = argminn Q(i)n ; Update V(i+1) := V(i) ∪ x∗ , X(i+1) := X(i) \ x∗ , i := i + 1; until Q converged; Algorithm 1: IVM: In every iteration i each point xn ∈ X(i) from the current training set X(i) is tested to be in the set of import vectors V(i) . The point x∗ yielding the lowest error Q(i)n is included. The algorithm stops as soon as Q converged. criterion is proposed by the ratio = |Q(i) − Q(i−∆i) |/|Q(i) | with a small integer ∆i. The original algorithm selects the import vectors in a greedy forward selection procedure. The approach is extended to a forward stepwise selection, which allows forward and backward steps. The advantage of this procedure is, that import vectors, which once entered can be dropped if they are no longer relevant. In all experiments an improvement of the results could be observed. Furthermore, an incremental update procedure to compute the inverse in (5) depending on the last iteration is used, which makes the algorithm more efficient. The incremental update is described in a detailed way in Section III-C. The two class model can be generalized to the multi-class model. Then the objective function is 1 X T λX T Q=− tn log pn + αk K R αk (7) N n 2 k
with the probabilities P = [p1 , . . . , pN ] obtained by exp(kV,n αk ) pnk = P . l exp(kV,n αl )
(8)
The binary target vector tn of length K uses the 1-of-K coding scheme so that all components but tnk are 0 if the point xn
3
is from class Ck . In the Newton-Raphson procedure in (5) and (6) we have to use one R k and one z k for each class. C. Related Classifiers Recently several algorithms have been developed, which enforce sparseness to control both the generalization capability of the learned classifier model and the complexity, as e. g., [44]–[47]. In these algorithms the model consists of a sparse weighted linear combination of basis functions, which are the input features themselves, nonlinear transformations of them or kernels centered on them. In this section we restrict ourselves to the review of realizations of sparse (kernel) logistic regression and SVM, whereby the latter one is used for comparison in the experiments. 1) Realizations of Sparse (Kernel) Logistic Regression: Using logistic regression or its kernel realization can be prohibitive regarding memory and time requirements if the dimension of the features or the number of training samples is large. Several sparse algorithms have been developed in the last years to overcome this problem. The relevance vector machine [18] uses the same model as the kernel logistic regression in combination with an implicit prior as regularization term, the so-called ARD (automatic relevance determination) [48] prior, to induce sparseness. The prior includes several regularization parameters, also called hyperparameters, which are determined during the optimization process. The algorithm have shown to be very sparse, but also tends to underfit, leading to a non well-generalized model [29]. Additionally, the RVM uses an expectation-maximization (EM)-like learning method and therefore, can suffer from local minima leading to non-optimal classification results. Alternatively, [28], [29] use a Laplace prior enforcing sparseness, which assigned regularization parameter is determined via cross-validation. These approaches propose sparse multinomial (kernel) logistic regression (SMLR) using different methods for a fast computation. In the field of hyperspectral image classification these approaches have been applied and further developed in, e. g., [26], [27]. 2) Support Vector Machines: The SVM find an optimal nonlinear decision boundary by minimizing the objective function λ 1 X [1 − yn f (xn )]+ + kf k2 (9) QSV M = N n 2 P with f (xn ) = n αn K (X , xn ). Contrary to IVM, which maximize the posterior probabilities, SVM aim to maximize the margin between the hyperplane and the closest training samples, the so-called support vectors. SVM are a binary classifier, with the decision rule given by the sign of f (xn ). 3) Comparison with Import Vector Machines: The main properties of SVM, IVM, SMLR and RVM are summarized in Table I and briefly compared below. The objective function of the SVM is quadratic and therefore convex and can be efficiently solved with sequential minimal optimization algorithm (SMO) [49]. The objective function of the IVM is convex, but non-quadratic. The function
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
4
TABLE I S UMMARY OF THE SVM, RVM AND IVM ALGORITHM REGARDING SEVERAL CHARACTERISTICS . H ERE ”–/0/+” MEANS THAT THE ALGORITHM SATISFIES A CERTAIN PROPERTY ” BARELY / PARTIALLY / COMPLETELY ”. Algorithm
Objective function, optimization procedure
Sparse
SVM RVM SMLR IVM
convex, greedy with SMO algorihtm nonconvex, IRLS, EM-like convex, see Sec. II-C1 convex, IRLS with greedy forward stepwise selection (see Sec. II-B and III-C)
0 + 0/+ +
can be solved with iterated re-weighted least squares (IRLS) and a greedy forward selection of import vectors. All models are sparse, i.e., the model parameters are primarily zero. The sparseness arises from different techniques, namely the usage of a prior in the SMLR models and the RVM or the greedy selection of a fraction of the training samples in the IVM model. However, IVM and RVM have shown to be sparser when compared to SVM [30], [31], and thus, require less computation time during classification. The training time of the IVM, on the other hand, can be slower than for the SVM, because of the non-quadratic objective function. Nonetheless, the training time depends on the number of training samples and the attainable sparseness, that therefore also SMLR and IVM can reach a faster training time. In comparison to RVM and SMLR approaches, which use the whole kernel during the optimization procedure, the IVM algorithm is suitable for large data sets, since only a subset of the training samples is used for computations during the optimization. Consequently, also only a fraction of the whole kernel have to be computed and stored. In contrast to this, e. g., the standard RVM algorithm has to solve a large matrix inversion of the size of the number of training samples during the optimization process and therefore can be slow and intractable for large datasets. While IVM, SMLR and RVM directly provide a probabilistic output, the SVM algorithm has the opportunity to transform its output to probabilities after some post-processing steps [16]. Though, as Tipping [18] already shows the transformed output is not necessarily statistical interpretable. Both IVM, SMLR and RVM are introduced as multi-class classifiers. The standard SVM have the opportunity to multiclass classification but they need a coupling strategy to get multi-class classification results. Moreover, these multi-class approaches are more suitable for practical use than direct multi-class approaches for SVM [50].
Training time + – 0/+ 0/+
Testing time 0 + 0/+ +
Probabilistic
Multi-class
0 + + +
0 + + +
where xj is the observed feature vector from the jth pixel, I being the set of all pixels and δ being the Kronecker delta function. The weighting parameter is given by β, which can be determined via cross-validation. The first term in (10) models the probability of a class assignment yj of the jth pixel, defined by the probabilistic output of the IVM. The second term describes the interaction potential as a Potts model over a 2D lattice penalizing every dissimilar pair of labels and therefore heterogeneous regions. The set of all neighboring pixels is given by N . III. I NCREMENTAL L EARNING S TRATEGY FOR S ELF -T RAINING In this section we introduce the self-training concept and the learning strategy for the incremental IVM. Self-training refers to the sequential selection of new training data and the adaption of the previous classifier model. The previous classifier model can (i) be neglected and re-trained from scratch or (ii) incrementally be updated. Following (i), a regular classifier training is performed, using the whole training samples set (i. e., previous + new training samples). The latter approach ”simply” updates the previous model, using the newly selected training samples. A. Self-training Concept
Remote Sensing image
Improved classification
Classification
IVM
DRF
Classification
D. Discriminative Random Fields A discriminative random field is employed to model prior knowledge about the neighborhood relations within the image. The final classification is assumed to be smooth, i. e., neighboring pixels are more likely to belong to the same class than to different classes. With this, the best classification C DRF is given by the argument of the minimum of the energy X X E (y) = − log p (yj |xj ) − β δ (yj , ym ) , (10) j∈I
{m,j}∈N
a
b
Posteriors
New training samples Incremental learning procedure
Fig. 1. Self-training scheme consisting of two steps: In the first step (Fig. 1a) the image is classified with the learned incremental IVM model and the DRF. The probabilistic output of the IVM and the classification result of the DRF are used to acquire new training samples. In the second step (Fig. 1b) the classifier model is incrementally updated.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
In the first step (Fig. 1a) we identify potential new training samples. For this selection step, we use the classification result provided by the DRF and the probabilistic outputs of the incremental IVM. In the second step (Fig. 1b) we use the incremental learning strategy to update the classifier model, without re-training from scratch. The procedure is repeated until no additional training samples can be selected. The latter step consists of the inclusion of new training samples, the deletion of irrelevant training samples and finally, the update of the IVM model. Therefore the approach can handle large data sets or even infinitive data streams. However, this step is independent of the used self-training strategy and other approaches can be used to identify new training samples. Both steps are explained in detail in the next paragraphs.
5
C. Incremental Learning To update the learned classifier, we consider the following aspects: • • •
New training samples acquired with the proposed selftraining approach (see Section III-B) are included. Non-informative samples are deleted. The set of import vectors is updated.
a) Update Training Vectors: For the two-class case the incremental learning procedure is stated as follows. We add training vectors X ∆ with targets t∆ so that N(s) := N(s−1) + ∆N with ∆N as the number of new training samples. At each self-training iteration s we extend the matrices and vectors K (s−1) t(s−1) K (s) = , t(s) = K∆ t∆
B. Acquisition of New Training Samples for Self-training This acquisition step illustrated in Fig. 2 is subdivided in 3 parts: First, we use a DRF as expert to identify worthwhile samples by evaluating the disagreement between the classification C yielded by the IVM classifier and the DRF result C DRF . The aspect implies, that the samples with estimated label yDRF,j derived from the DRF are sufficient uncertain with max(pj ) < 0.5 obtained from the IVM classifier. Using such samples we ensure progress in self-training. Second, we exclude from the selected samples these ones, whose probability is too small. I. e., the influence of the new samples is restricted, to ensure that the model is gradually changed and stable. Finally, we sort the chosen samples by their potentially influence on the model, to enable the flexibility of the model during self-training. The potential influence uses the concept of leverage points in regression [51]. The leverage values in a weighted regression are contained in the vector l = diag K pot
KT pot R pot K pot
−1
KT pot R pot
.
(11)
The kernel matrix obtained from the potential new training samples X pot is given by K pot = [k(xpot,j , xV,m )] and R pot is the weight matrix of the class the training sample belongs to. Training samples with a high leverage value first are considered first, since they causes large effects in the learned model. The self-training is stopped if no more training samples can be acquired. To prevent an acquisition, which leads to an imbalanced number of training samples per class, we ensure sampling an equal number of training samples for each class. If not enough new training samples can be acquired with the proposed selftraining approach, we also consider training samples with a high probability and whose label in C and C DRF are the same. These samples have a small influence onto the model and should not change the result too much, but balance the number of samples of each class. If there are still not enough samples for a class, we follow the over-sampling approach by duplicating existing samples from the concerned class at random and add a small noise to them [52].
to obtain the updated parameters α+ (s) given by (12) and (13) yielding p(s−1) z (s−1) z (s) = , p(s) = . z∆ p∆ The Sherman-Morrison-Woodbury (SMW) [53] formula is used to compute the inverse in (12) yielding (14) with A = 1/N(s) K T (s−1) R (s−1) K (s−1) +λK R,(s−1) . Note that the update (14) only incorporates an inverse of size ∆N × ∆N , since the inverse A−1 was computed in time step s−1. With these steps the parameters can be updated in an incremental way without re-training from scratch. The update (12) can also be formulated for a decreasing number of training vectors in a similar manner, leading to an efficient update rule as in (14), only with the sign of R ∆ changed. Training vectors can be removed based on their ”age”, so that they follow the first-in-first-out strategy. This can lead to instable results. Therefore, we identify training vectors, which can be removed using Cook’s distance [54]. Cook’s distance measures the effect of deleting a training vector as the higher the value the more informative is the training vector: T
dn =
ln (pn − tn ) (pn − tn ) a MSE (1 − ln )2
(15)
with a as the number of parameters and MSE as the mean squared error summed over all classes given by the difference between 1 and the mean probability of all training samples belonging to class Ck . The leverage value of each training vector is given by ln . The training vectors with the lowest distance are removed in a greedy backward selection until the value of the optimization function increases more than of 5%. b) Update the Set of Import Vectors: To add and remove import vectors we use the forward stepwise selection as described in Section II. We also make use of the SMW formula and proceed in the same way as described in Algorithm 1 until a convergence criterion is reached. To generalize the two class model to the multi-class model we use the class-specific values R k,(s−1) and z k,(s−1) .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
Classification with IVM
Posteriors P C Compare C and C DRF
Improved classification with DRF
6
C DRF
Select worthwhile samples with yj 6= yDRF,j and p(yj |xj ) sufficiently large
Sort samples regarding leverage values
Fig. 2. Schematic diagram of the selection of new training samples: Worthwhile samples are identified by comparing the classification C and C DRF . Samples with a relatively small posteriori probability are excluded. Remaining samples are sorted regarding their leverage value, i. e., samples with high influence are preferred.
−1 T T T T α+ = 1/N K R K + K R K + λK K R z + K R z ∆ ∆ ∆ ∆ (s) (s−1) (s−1) R,(s−1) (s−1) (s−1) (s−1) ∆ (s−1) ∆ (s) " # 1 K (s−1) α(s−1) + R (s−1) p(s−1) − t(s−1) z (s) = N(s) K ∆ α(s−1) + R ∆ (p∆ − t∆ ) −1 −1 T T −1 −1 T −1 T −1 α+ = A − 1/N A K R + 1/N K A K K A R z + K R z K ∆ ∆ ∆ ∆ (s) (s) ∆ ∆ (s−1) (s−1) (s−1) ∆ ∆ (s)
IV. E XPERIMENTAL S ETUP A. Data Sets We use three hyperspectral data sets – C ENTER OF PAVIA, U NIVERSITY OF PAVIA and I NDIAN P INES– from study sites with different environmental setting. The data sets have been used in a multitude of studies, e. g., [7], [13], [32], [55], [56]. The C ENTER OF PAVIA image was acquired by ROSIS-3 sensor in 2003. The spatial resolution of the image is 1.3 meter per pixel. The data cover the range from 0.43 µm to 0.86 µm of the electromagnetic spectrum. However, some bands have been removed due to noise and finally 102 channels have been used in the classification. The image strip, with 1096×492 pixels in size, lies around the center of Pavia. The classification is aiming on 9 land cover classes. The U NIVERSITY OF PAVIA data set was also acquired by ROSIS-3 sensor with 610×340 pixels in size and 103 channels. The classification is aiming on 9 land cover classes. The I NDIAN P INES data set was acquired by the AVIRIS instrument in 1992. The study site lies in a predominately agricultural region in NW Indiana, USA. AVIRIS operates from the visible to the short-wave infrared region of the electromagnetic spectrum, ranging from 0.4 µm to 2.4 µm. The data set covers 145×145 pixels, with a spatial resolution of 20 m per pixel. The experiments are aiming on the classification of 16 classes (Table II). B. Methods In the experiments the IVM for classifying hyperspectral data is analyzed. In addition SVM were applied on the data sets. SVM are perhaps the most popular approach in more recent applications and seems particularly advantageous when classifying high-dimensional data sets. Thus, the method is regarded as a kind of benchmark classifier for comparison with new approaches. Moreover, a DRF is applied on the respective probabilistic output. Besides as input for the DRF, we use the probabilistic
(12) (13)
(14)
outputs to analyze the uncertainty of the classification result. We assess the reliability of the probabilities by rejecting uncertain test samples and deriving the classification accuracy on the non-rejected test points [19]. The rejection rate is given by a threshold on the posterior probability, whereby the accuracy provided by SVM and IVM is reported as a function of the rejection rate in discrete intervals. In addition, the incremental IVM is evaluated by applying the self-training approach on the three data sets in terms of the classification accuracy and sparsity. To investigate the impact of the number of training samples on the performance of the model (e.g., in terms of sparsity and accuracy) we use two different training sets, containing (i) all initial training samples and (ii) 10% of each class (with a minimum of at least 10 samples per class). For (ii), we performed a stratified random sampling, selecting 10% of the samples of each class from the initial training set. The final results were averaged. For the SVM and the (incremental) IVM we use a radial basis function kernel. The kernel parameters are determined by a 5-fold cross-validation. Also the DRF parameter β is determined by by 5-fold cross-validation. The result provided by the common IVM is used for the initialization of the selftraining. The self-training procedure is repeated until no more training samples can be selected. The IVM algorithm1 is implemented in MATLAB and C++. The SVM classification is performed in MATLAB, using the LIBSVM approach by Chang and Lin [57]. To compute the result in (10) we use the graph-cut algorithm2 [58]. Besides the standard SVM classification, we use the method of [16] to convert the output of the SVM to probabilities (from now on referred to as probabilistic SVM).
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012
7
TABLE II N UMBER OF TRAINING AND TEST SAMPLES . C ENTER OF PAVIA class # training # test Asphalt 678 6907 Bare Soil 820 5729 Bitumen 808 6479 Meadows 797 2108 Bricks 485 1667 Shadow 195 1970 Tiles 223 2899 Trees 785 5723 Water 745 64533 Total 5536 91108
U NIVERSITY OF PAVIA class # training Asphalt 548 Bare Soil 532 Bitumen 375 Gravel 392 Meadows 540 Metal Sheets 265 Bricks 514 Shadow 231 Trees 524 3921
# test 6304 4572 981 1815 18146 1113 3364 795 2912 33698
I NDIAN P INES class # training Alalfa 27 Corn-no till 778 Corn-min till 408 Corn 110 Grass/Pasture 208 Pasture/Trees 339 Hay 228 Soybeans-no till 506 Soybeans-mid till 975 Soybeans-clean till 265 Wheat 100 Woods 637 Bldg-Grass 163 Stone 45 Grass/Pasture-mowed 12 Oats 10 4811
# test 27 656 426 124 289 408 261 462 1493 349 112 657 217 50 14 10 5555
TABLE III OVERALL ACCURACY (OA), AVERAGE ACCURACY (AA) AND KAPPA COEFFICIENT (K APPA ) OF SVM, SVM WITH TRANSFORMED PROBABILITIES (P ROB . SVM), IVM, INCREMENTAL IVM ( I IVM) WITH SELF - TRAINING (ST) AND ADDITIONAL DRF. F OR THE SMALLER DATA SETS (10%) WE REPORT THE MEAN AND STANDARD DEVIATION IN BRACKETS OVER 20 RUNS . Train data 100% 100% 100% 100% 100% 100% 100% 10% 10% 10% 10% 10% 10% 10%
Algorithm SVM Prob. SVM IVM iIVM+ST Prob. SVM+DRF IVM+DRF iIVM+ST+DRF SVM Prob. SVM IVM iIVM+ST Prob. SVM+DRF IVM+ DRF iIVM+ST+DRF
C ENTER OF PAVIA OA[%] AA[%] Kappa 97.1 94.6 0.95 97.0 94.2 0.95 97.2 94.2 0.95 97.3 95.2 0.95 97.8 96.4 0.96 98.4 96.5 0.97 98.6 97.1 0.98 92.1 (0.6) 85.1 (1.0) 0.86 (0.01) 93.0 (1.3) 86.2 (2.1) 0.87 (0.02) 92.2 (1.1) 85.4 (1.6) 0.86 (0.02) 94.3 (1.0) 88.2 (1.7) 0.90 (0.02) 93.7 (1.5) 88.9 (2.5) 0.89 (0.03) 92.8 (1.3) 87.8 (2.0) 0.87 (0.02) 95.8 (0.8) 92.3 (2.1) 0.92 (2.2)
U NIVERSITY OF PAVIA OA[%] AA[%] Kappa 79.0 87.9 0.73 78.4 87.7 0.73 78.3 87.4 0.73 86.6 90.0 0.83 82.6 90.5 0.78 85.9 91.4 0.82 94.2 93.8 0.92 70.2 (3.8) 77.5 (2.4) 0.62 (0.04) 69.4 (3.8) 77.3 (2.4) 0.61 (0.04) 70.3 (3.5) 77.3 (0.8) 0.62 (0.04) 74.3 (2.4) 77.9 (1.1) 0.66 (0.02) 76.3 (4.9) 79.2 (4.5) 0.70 (0.06) 78.5 (4.3) 79.9 (1.5) 0.72 (0.05) 80.9 (2.9) 81.0 (1.9) 0.74 (0.03)
V. E XPERIMENTAL R ESULTS Fig. 4 shows the ground truth and the classification results for all three data sets. As shown in Table III, the IVM is competitive to SVM in terms of accuracy and result in almost similar overall accuracies and kappa coefficients, irrespectively from the number of training samples. However, the SVM outperforms the IVM in terms of the averaged class accuracy (AA) for the I NDIAN P INES data set, when using all training samples. As expected the accuracies are increased by increasing the number of training samples, independently from the classifier method. It is interesting to underline that in many cases, the probabilistic SVM provide (slightly) lower accuracies (i.e., OA and AA) than a standard SVM. Behind this fact, the reliability of the probabilistic output of the SVM can be questioned. This assumption is confirmed by the results provided by the DRF. The accuracies of both methods, the probabilistic SVM and the IVM, are increased by the DRF. This is in accordance with other studies that have successfully integrated spatial 1 http://www.ipb.uni-bonn.de/ivm/ 2 http://vision.csd.uwo.ca/code/
OA[%] 61.6 60.7 63.7 63.7 66.3 70.6 70.6 54.7 (2.3) 54.9 (2.5) 58.2 (0.1) 64.8 (1.1) 60.1 (2.8) 64.5 (1.3) 70.1 (1.2)
I NDIAN P INES AA[%] Kappa 69.6 0.57 68.9 0.56 67.8 0.59 67.8 0.59 71.6 0.62 80.0 0.67 80.2 0.67 63.0 (1.6) 0.49 (0.03) 63.2 (2.3) 0.50 (0.03) 65.2 (0.8) 0.53 (