4110
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 11, NOVEMBER 2010
Adaptive Classification for Hyperspectral Image Data Using Manifold Regularization Kernel Machines Wonkook Kim, Student Member, IEEE, and Melba M. Crawford, Fellow, IEEE
Abstract—Localized training data typically utilized to develop a classifier may not be fully representative of class signatures over large areas but could potentially provide useful information which can be updated to reflect local conditions in other areas. An adaptive classification framework is proposed for this purpose, whereby a kernel machine is first trained with labeled data and then iteratively adapted to new data using manifold regularization. Assuming that no class labels are available for the data for which spectral drift may have occurred, resemblance associated with the clustering condition on the data manifold is used to bridge the change in spectra between the two data sets. Experiments are conducted using spatially disjoint data in EO-1 Hyperion images, and the results of the proposed framework are compared to semisupervised kernel machines. Index Terms—Adaptive classifier, hyperspectral, machine, knowledge transfer, manifold regularization.
kernel
I. I NTRODUCTION
H
YPERSPECTRAL image data provide enhanced capability for land cover classification as high spectral resolution enables discrimination between similar classes. However, it is well known that large numbers of labeled samples are required to fully exploit information in the high-dimensional data [1]. Unfortunately, time and labor cost for collecting ground reference data is often high, frequently resulting in a major obstacle to analysis of newly acquired hyperspectral data. Increasing demand for rapid cost-effective analysis of data, particularly in remote or inaccessible areas, justifies development of classification strategies that do not require extensive quantities of labeled data and leverage information derived from existing labeled data. To address this problem, we propose an adaptive classification framework where existing ground reference data are reused for the classification of new data from a different area. The problem can be characterized as a transfer learning or knowledge transfer [2] scenario. In the transfer setting, direct use of existing labels and class characteristics of one area for another area within a scene can be problematic because the spectral responses related to localized conditions may vary significantly over extended regions. Possible factors for such spatial variation include differences in
Manuscript received December 17, 2009; revised May 29, 2010. Date of current version October 27, 2010. This work was supported by the National Science Foundation under Grant 0705836. The authors are with the School of Civil Engineering and the Laboratory for Applications of Remote Sensing, Purdue University, West Lafayette, IN 479071971 USA (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2010.2076287
vegetation density and composition, soil moisture, topography, and illumination conditions. Characterization of those changes is an ill-posed problem due to the limited availability of reference data and its location-dependent characteristics. In this challenging situation, invariant characteristics of data that are not affected by local conditions are extremely important for robust classification. The proposed approach seeks to mitigate the impact of these issues on classification in several ways. In the proposed classification framework, we leverage a clustering assumption on the data manifold—a nonlinear space supported by data samples, which usually has lower intrinsic dimensionality than the original input space—by using a manifold regularization kernel machine (MRC) [3]. The clustering assumption is based on the idea that spectrally similar samples are likely to have similar class labels. That is, samples from the same class remain close in feature space, even after unpredictable changes in spectra, thus providing a valuable clue to identification of the class label of new data. Combining this concept with kernel machines in the form of regularization, MRC has achieved enhanced performance for classification of remote sensing data [4]. The proposed semisupervised classification method constrains its classification function to be close to the data manifold by using both labeled and unlabeled samples and finds the classification boundary in a low-density region between two clusters. However, as in other semisupervised kernel machines such as semisupervised support vector machines (S3VMs) [5], [6] and transductive support vector machines (TSVMs) [7], [8], MRC assumes that unlabeled samples are from the same population as the labeled data, which may not be true in this transfer learning setting. As shown in Fig. 1(a), samples of a given class from a spatially disjoint data set may have a different distribution, potentially resulting in confusion with another class as in Fig. 1(b). To tackle this problem, many adaptive classification methods use a scheme where transductive or semilabeled samples (samples with tentative labels) are selected based on a criterion that is evaluated using existing labeled samples [8]–[10]. However, if spectral signatures of samples overlap significantly between the original and the new data sets, but the new data have good separation between the classes, it may be better to disassociate the classifier from the original labeled data and depend on the clustering condition. In this context, we propose an adaptive framework, where the classifier is no longer dependent on the existing data after extracting relevant information from the original labeled data (i.e., information on the commonality between the two data sets). The classifier is then adapted to the new data set via iterative application of the classifier using the clustering condition on the data manifold.
0196-2892/$26.00 © 2010 IEEE
KIM AND CRAWFORD: ADAPTIVE CLASSIFICATION FOR HYPERSPECTRAL IMAGE DATA USING MRCs
4111
Fig. 1. (a) Spectral signatures of two spatially disjoint sets of Botswana Savanna (class 7) with std. dev. envelopes. (b) Spectral signatures of two upland classes, Savanna (class 7) and Island Interior (class 5), are shown.
Apart from the knowledge transfer framework, we also develop a fast grid search scheme for parameter tuning. Although various parameter tuning techniques have been developed for kernel machines [11], [12], most researchers rely on grid search due to its easy implementation and capability to avoid local minima. However, this is time consuming, and the computational load increases exponentially with the number of parameters. (Note that MRC can have up to four tunable parameters.) To mitigate the impact of this problem, we propose an intelligent grid search (IGS) method, a heuristic implementation of the gradient descent method, for the optimal parameter search. Tuning accuracy is estimated by cross-validation. IGS selects potential grid points based on the tuning accuracy of the current grid, reducing time focused on nonoptimal regions of the parameter space. This paper is organized as follows. Section II reviews both semisupervised classification methods for hyperspectral data and classification methods that have addressed the knowledge transfer problem. The formulation of MRC that is used as a base classifier in the proposed framework is presented in Section III. Section IV describes the proposed knowledge transfer framework, which includes the proposed IGS, a hierarchical tree construction scheme, and the iterative classification approach. In Section V, we test the proposed approach for the transfer learning problem using spatially disjoint samples of Hyperion data. The performance of the proposed method is compared to that achieved by Laplacian support vector machines (LapSVM) and TSVM. Conclusions and future work are discussed in Section VI. II. R ELATED W ORK A. Use of Unlabeled Samples for Hyperspectral Image Classification In classification of hyperspectral data, unlabeled samples are often used to compensate for small numbers of labeled samples.
Adaptive methods based on a generative model are proposed in [9] and [13]. In [9], Jackson et al. use semilabels repetitively to determine optimal regularization parameters and deduce regularized covariance estimators. They also adopt a self-training strategy in [13], where semilabels are employed to reduce the sensitivity of the expectation–maximization (EM) procedure to the number of labeled samples. Because it is robust to highdimensional data, the support vector machine (SVM) method is frequently utilized for classification of hyperspectral data [14]–[18]. Maximizing the margin between labeled samples in a feature space, SVMs often have good performance for data sets with nonlinear decision boundaries between classes. Although SVMs are known to be relatively robust to small sample sizes, use of unlabeled samples has been investigated to enhance performance through semisupervised approaches such as S3VM [5] and TSVM [7]. S3VM incorporates unlabeled samples into the objective function for this binary classification problem by assigning each sample the label that has minimum estimated classification error. The objective function of the 1-norm SVM is then optimized by integer programming. TSVM implements a similar idea in an iterative framework. SVMs are repeatedly applied to test samples, where the test samples are added to the training data set after assigning labels based on their distance from the hyperplane or the separating plane in the feature space. Both methods are adapted and applied to the classification of remote sensing data in [6] and [8]. Graph-based algorithms also provide a powerful framework for incorporating unlabeled samples. In those methods, a graph is usually represented by “nodes” composed of both labeled and unlabeled samples, and weights between the nodes are assigned based on the corresponding pairwise similarities. In MRC [3], the manifold regularization term is derived from a graph Laplacian which is estimated from both labeled and unlabeled samples. Gomez-Chova et al. [4] apply this method to remote sensing data for binary classification tasks and achieve a higher
4112
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 11, NOVEMBER 2010
overall classification accuracy than TSVM and classic SVM. Camps-Valls et al. [19] also develop a composite kernel to accommodate contextual information in a graph-based kernel machine. B. Semisupervised Learning for the Transfer Learning Problem One of the earliest investigations of the knowledge transfer problem was conducted by Bruzzone and Prieto [20]. The key idea is to retrain a maximum likelihood classifier with unlabeled samples from a new image in an EM framework. Rajan et al. [2] extend the binary hierarchical classifier (BHC) [21] to data acquired over the same area at a subsequent time. The BHC’s Fisher discriminant analysis is modified iteratively at each node of the tree using unlabeled samples from the new image. In [22], active learning is employed to expedite classifier training by identifying the most informative unlabeled samples. Inamdar et al. [23] address the knowledge transfer problem by transforming the probability density functions of two data sets to a common space. Kernel-based methods for the knowledge transfer problem can be found in [10] and [24]. Ghoggali and Melgani [24] use a genetic algorithm to update an existing land cover map. Using a priori information on the class transition as input, estimates of an optimal class transition matrix are obtained via a genetic algorithm. SVM is then used to assign final labels to new data with the updated training set. Bruzzone and Marconcini [10] approach the problem of updating land cover by introducing a domain adaptation scheme into the transductive framework of SVMs. The core idea is to update the training data set via an iterative procedure, where unlabeled samples of new data are added incrementally to the existing training data set while a group of existing samples is discarded based on the given criteria. They observe that the data close to the margin boundaries in the margin band (area between margin boundaries in the feature space) are the most informative samples for the adaptation. Samples and their semilabels are then added to the training set if the semilabels are consistent with the existing classifier.
where f is the classification function and γA and γI are the regularization parameters. With the data samples and the parameters fixed, the objective function R depends on the classification function f through one cost term and two regularization terms. The cost term (the first term) provides a measure of the classification error (or loss) by comparing the labels of training samples to the classification function evaluated on the samples. The most common loss functions are the hinge loss function and squared loss function given by V (xi , yi , f ) = max (0, 1 − yi · f (xi ))
(2)
V (xi , yi , f ) = (yi − f (xi ))2 .
(3)
In this paper, we use the hinge loss function to exploit the margin maximization scheme because our framework deals with binary classification problems of two “meta-classes,” which are often composed of multiple classes. MRC with the hinge loss function is also known as LapSVM. The first regularization term f 2K is a function norm in the associated reproducing kernel Hilbert space (RKSH) [25]. The manifold regularization term f 2I provides a measure of smoothness of the function on the data manifold. The estimate of the term for discrete data is given by f 2I =
=
l+u 1 (f (xi ) − f (xj ))2 wij (u + l)2 i,j=1
1 f T Lf (u + l)2
(4)
where wij is the weight between any two nodes i and j, f is the vector of function values at every node, and L is the graph Laplacian given by L = D − W. The weights of the k nearest points are stored in the weight matrix. The diagonal matrix D l+u wij . is then given by Dii = i,j=1 Training of the classifier involves finding the optimal function f ∗ that minimizes the objective function R f ∗ = arg min R. f ∈Hk
III. MRC In this section, we introduce the MRC method which is used as a base classifier in the proposed framework. Belkin et al. [3] propose such a regularization concept by incorporating a smoothness measure of the classification function relative to the data manifold. Define a data matrix X = {x1 , x2 , . . .} and the label set y = {y1 , y2 , . . .} of the data, where x is a p-dimensional observation vector associated with each pixel in an image and y is a class label of the corresponding sample. Given a set of l labeled samples (xi , yi ), i = 1, . . . , l, and u unlabeled samples (xi ), i = (l + 1), . . . , (l + u), the objective function of MRC is formulated as 1 V (xi , yi , f ) + γA f 2K + γI f 2I l i=1
(5)
According to the representer theorem [26], [27], the solution to this minimization problem exists and can be written in the following form: f ∗ (x) =
l+u
αi K(xi , x)
(6)
i=1
where K is the kernel function in the RKSH. The expansion coefficients α = [αi , . . . , αl+u ] for the hinge loss function can be obtained by solving the linear system α=
2γA I + 2
−1 γI LK JT yβ ∗ (u + l)2
(7)
l
R=
(1)
where I is an identity matrix, K is the Gram matrix whose elements are given by Kij = K(xi , xj ), y is an (l + u)
KIM AND CRAWFORD: ADAPTIVE CLASSIFICATION FOR HYPERSPECTRAL IMAGE DATA USING MRCs
dimensional label vector given by y = [y1 , . . . , yl , 0, . . . , 0], and J is an (l + u) × (l + u) diagonal matrix given by J = diag(1, . . . , 1, 0, . . . , 0). The optimal values of the Lagrange multipliers β ∗ can be obtained using a standard SVM solver [3]. The tunable parameters of the kernel machine are the two regularization coefficients γA and γI , the size of neighborhood k, and kernel parameters, if kernels other than the linear kernel are used.
IV. P ROPOSED AMC M ETHOD A. Assumptions in the Proposed AMC Framework An important assumption for this knowledge transfer problem is that two data sets may have different sample distributions but have considerable commonalities that can be identified from either of the data sets. This condition is commonly encountered in large images where ground reference data are obtained for only a subset of the image and in multitemporal data sets with short interacquisition time intervals. In the proposed framework, existing data are used for knowledge transfer in three ways, providing the following: 1) reliable estimates of model parameters; 2) a natural class hierarchy; and 3) rough correspondence between clusters and classes. Similar to several other studies [14]–[16], we assume that a set of parameters determined for one data set is reasonably reliable for the other region of the image. In the proposed framework, we tune parameters of MRC using existing data and retain them for the iterative classification procedure. The second type of information that can often be transferred to new data is a class hierarchy. A hierarchical framework also provides natural capability for extending kernel machines to multiclass problems. Although the kernel machine formulation can be directly modified for multiclass labels, the approach is not widely used because of computational complexity [28], [29]. Simple but widely used schemes for fully supervised kernel machines in multiclass scenarios include pairwise approaches such as one-against-one (OAO) [16], [30] and pairwise coupling [31], one-per-class approaches such as oneagainst-all (OAA) [6], [8], [15], [16], and binary tree structures [16], [32]–[35]. Error correcting output codes [36] can also be used to combine different output representations of multiple classifiers. However, because unlabeled samples are utilized in the training stage, semisupervised methods pose a unique problem which we refer to as “association of unlabeled samples.” Given a binary classifier, unlabeled samples can be best exploited only when they are from one of the two classes being addressed by the classifier. Because it is unable to determine this association, OAO is not a viable option for semisupervised classification methods, unless some alternative approach such as using semilabeled samples can be used [37]. The hierarchical approach of our proposed framework provides both an efficient decomposition and a means to handle the association problem. Each node involves a binary split of given classes into two meta-classes, where the split is determined to maximize separation among the combinations of candidate classes. Classes with similar characteristics are grouped at the top of the hierarchy, whereas those with subtle differences
4113
between individual classes are separated at the bottom of the tree. The hierarchy may thus reveal intrinsic groupings of a data set (i.e., wetland versus upland and vegetation versus nonvegetated), whereas the OAA approach constructs c OAA classifiers, regardless of the characteristics of the data. Another advantage of the proposed tree approach relative to OAA is that the number of samples used for each classifier is progressively reduced as it traverses to the bottom node of the tree. Each of the c classifiers in OAA requires the full set of training samples. Existing data can also help to identify correspondence between clusters and classes in the new data set. Identification of the correspondence is important in using a clustering-based method for classification, particularly when each meta-class contains multiple classes as in the hierarchical approach. When a naive classification is performed on new data by training the classifier with existing data, the results may be poor if there is population drift. We assume a scenario where the initial classification results provide reasonable characterization of each class in the spectral domain, so that the subsequent adaptive procedure can correctly classify at least a few test samples. Adaptive manifold classification (AMC) then uses manifold regularization to recover the correct decision boundary based on the results of the initial classification. B. Proposed AMC Framework The proposed AMC methodology involves three operations: 1) constructing a binary tree representing the class hierarchy; 2) tuning free parameters at each node of the tree; and 3) constructing and adapting a classifier through an iterative classification using MRC. As noted in Section III, the class hierarchy and optimal parameters are obtained from the existing labeled data and then transferred for the classification of new data via the proposed iterative classification method. The overall framework is shown in Fig. 3. In the following discussion, we use the notation for the ˜ y ˜ ) = {(˜ xm , y˜m )} for the original data set as (X, x1 , y˜1 ), . . . , (˜ labeled data and X = {x1 , . . . , xn } for new unlabeled data. The goal is to assign correct labels y = {y1 , . . . , yn } to the new data. 1) Binary Tree Construction: Finding the optimal binary split between multiple classes is a computationally demanding task as it involves a combinatorial problem. The normalized cut [38] is employed to determine the optimal class split at each node of the tree. When a graph G is given with vertices V , edges E, and similarities W , a cut measures the dissimilarity between two disjoint partitions of the graph. Although wellseparated sets can be obtained by minimizing the cut, they are often highly unbalanced, producing small sets of isolated nodes in the graph. This “biased” partitioning can often be avoided by using a normalized cut, where the cut is divided by the total edge connections to all the nodes in the graph. To obtain the partitioning of a given set of classes, we define the weights between classes as a function of the class-specific sample mean vectors μi =
nc 1 ˜t, x nc t=1
˜ t ∈ {˜ x xj |˜ yj = c}.
(8)
4114
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 48, NO. 11, NOVEMBER 2010
The weights are then assigned using the distances between the mean vectors dij = μi − μj , wij = max(dij ) − dij . i,j
i, j = {1, . . . , l}
(9) (10)
The normalized cut algorithm returns the association of each class to either of the child nodes or the meta-classes as 0, assigned to left node (11) Ai = 1, assigned to right node. Once the association is determined, the partitioning is repeated until a single class remains at a leaf node. 2) Parameter Tuning Via IGS: After the binary tree is constructed, parameter tuning is performed at each node of the tree using IGS. We propose this simple but effective tuning method to facilitate the selection of an appropriate interval between candidate parameter values. The proposed recursive grid search method performs well with a small grid size (i.e., 3 × 3 or 3 × 3 × 3 for two parameters and three parameters, respectively) as it intelligently adjusts the interval and the location of the grid based on the tuning results of the previous grid. Here, the grid size is related to the number of rows and columns of the grid (i.e., 3 × 3), and interval refers to the difference in parameter values between adjacent points. Once the tuning accuracies are obtained for the initial grid, IGS uses one of the following two procedures, depending on where the maximum tuning accuracy occurs in the grid. —Shift: When the entry with the maximum tuning accuracy is located on the boundary of the grid, the maximum point is likely to be outside the current grid, so the location of the grid is shifted. The current grid is moved toward the maximum point so that the center of the grid is placed on the maximum point. —Shrinking: When the entry with maximum tuning accuracy does not occur on the boundary, the interval of the grid is halved to determine the optimal parameter value more precisely. The concept is shown in Fig. 2. The two operations are repeated until the interval of the grid is reduced to a user-defined value. To obtain reliable estimates, m-fold cross-validation is performed for each grid point. Once a validation set is selected for parameter tuning, the samples in the set are divided into m sets, where the training of the classifier with the parameters of the grid point is performed with (m − 1) sets and the testing is performed on the remaining set. Whereas the validation set can be constructed with only the labeled samples for supervised classification methods, inclusion of unlabeled samples is important for the semisupervised classifier to perform well for new data. For example, in tuning of MRC parameters (γA , γI , k), we have obtained very small values for γI when no unlabeled samples are used for the validation set, if the number of labeled samples was large enough that unlabeled samples were not required to achieve adequate classification accuracies. For this reason, we remove a portion of labels from the currently labeled samples at a given rate Rremove and then use both the
Fig. 2. (a) Example of a 3 × 3 grid for the two-parameter case. (b) “Shift” is performed when the (red) optimal point occurs in the boundary of the current grid. (c) “Shrinking” is performed when the (red) optimal point does not occur at the boundary.
Fig. 3.
Overall framework of the proposed AMC method.
labeled and unlabeled samples for the training procedure of the validation process. 3) Adaptive MRC for Binary Problems: If the hierarchical structure and the tuned parameters at each node are given, the proposed adaptive classification method is applied to new data at each node. The overall framework is shown in Fig. 3. The key operations of the method are outlined in the remainder of this section. —Initial classification: For the initial classification, MRC is ˜ y ˜ ) and new data X trained with both the original data (X, ˆ (0) and is applied to the new data X to provide semilabels y for the test samples ˜ X], y ˜ C (0) = train [X
(12)
ˆ (0) = C (0) (X). y
(13)
When population drift exists between the original and the new data, the classification accuracy of this initial classification is usually low.
KIM AND CRAWFORD: ADAPTIVE CLASSIFICATION FOR HYPERSPECTRAL IMAGE DATA USING MRCs
TABLE I N O . OF L ABELED S AMPLES FOR B OTSWANA D ISJOINT DATA S ETS
4115
TABLE II DATA PAIRS U SED IN E ACH E XPERIMENT
TABLE III N OTATION : F REE PARAMETERS FOR K ERNEL M ACHINES
—Iterative classification: A relaxation is performed on the semilabels to limit their effect ˆ relaxed = relax(ˆ y y, rrelax ).
(14)
When n is the number of samples in the new data set, the semilabels of n · rrelax , 0 < rrelax < 1, randomly selected samples are removed, and the selected samples become unlabeled. The purpose of the procedure is to reduce the effect of the currently labeled samples in a dynamic environment. MRC does not require many labeled samples when the data satisfy a strong clustering condition with well-clustered samples, as can be seen in the experiments with synthetic example data [3]. For these data, only a few labeled samples are sufficient to recover a nonlinear class boundary if the parameters are set correctly. To reduce the impact of semilabels, which are possibly incorrect for the new data, the role of semilabels in this binary classification problem is limited to providing correspondence between classes and clusters to improve the initialization of the subsequent adaptive procedure. However, unlike the synthetic data, it is not possible to identify the true labels for our new data. We may also need additional samples because the new data may be multimodal, and the clustering condition may not be as strong. By randomly selecting semilabeled samples, we aim to weaken the effect of the semilabels under the fixed set of parameters provided by the original data and use them to obtain the approximate range and location of the classes within the spectral manifold. The relaxed semilabels are then used for the training of the next classifier, and the classifier is again applied to the new data ˆ relaxed ) C (i) = train(X, y
(15)
ˆ (i) = C (i) (X). y
(16)
To guarantee the convergence of the method, we reduce the relaxation rate at each loop of the iteration by a given rate η, 0