Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
1
ENSEMBLE CLASSIFIERS FOR HYPERSPECTRAL CLASSIFICATION Jonathan Cheung-Wai Chan and Frank Canters Vrije Universiteit Brussel, Department of Geography, Brussels, Belgium
[email protected],
[email protected]
ABSTRACT Machine learning algorithms are methods developed to deal with large volumes of data with high efficiency. Adaboost has been among the most popular and promising algorithms in the last decade and has demonstrated its potential for classification of remote sensing data. Previous studies have shown that Adaboost, though less stable than bagging (another well-know ensemble classification algorithm), consistently produces higher accuracies in classification tasks performed in a vast variety of data domains. The use of Adaboost for hyperspectral classification, however, has not been fully explored. Like Adaboost, Random Forest is another bootstrap method proposed recently to generate numerous, up to hundreds of classifiers for classification. Using the same resampling strategy as bagging, Random Forest introduces a new feature, called out-of-bag samples, for feature ranking and evaluation. The only parameter for tuning is the number of features to split on at each node, which is described as insensitive to accuracy. Comparatively, Adaboost does not have any parameters except for the amount of pruning, which is zero when using Random Forest. In this paper, we compare the results obtained with both classifiers on hyperspectral data. Results from two applications, one on ecotope mapping and one on urban mapping are presented. Compared with using one decision tree classifier, Adaboost increases classification accuracy by 9%, and Random Forest by 13%. Both classifiers achieve comparable results in terms of overall accuracy. Random Forest, however, due to its use of only a random feature subset and no pruning, is more efficient. Our results show that both Adaboost and Random Forest are exceptionally fast in training and achieve higher accuracies than accurate classifiers such as Multi-Layer Perceptrons. Their limited demands on user’s input for parameter tuning makes them ideal algorithms for operationally oriented tasks. The study demonstrates that Adaboost and Random Forest perform well with hyperspectral data, in terms of both accuracy and ease-of-use. INTRODUCTION Hyperspectral imagery provides information in hundreds of spectral bands. The rich spectral information can be utilized in many application domains. For the classification of hyperspectral data, Spectral Angle Mapper (SAM) is one of the most frequently used algorithms (1). While SAM has more or less become the standard classifier for hyperspectral data and has been implemented in most softwares for hyperspectral image processing, several studies have indicated the limited success of SAM. While some authors have investigated ways to enhance the accuracy obtained with SAM (2), many studies have also focused on the use of classifiers conventionally applied to multi-spectral imagery such as the maximum likelihood classifier, artificial neural networks, fuzzy classifiers and decision trees. Recently, also the use of binary-hierarchical techniques for hyperspectral image classification has been investigated (3). Another potentially interesting approach is the use of ensemble classifiers for hyperspectral classification. Instead of building one classifier by applying a learning algorithm, ensemble
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
2
methods build hundreds of classifiers from the original data set using probabilistic sampling methods. The two most studied and compared ensemble methods that have been applied in remote sensing research are boosting and bagging (4, 5, 6, 7). They are also called voting classification algorithms since the final labeling of an unknown case is determined by voting based on the many classifiers so created. Adaboost, a representative boosting method, modifies the distribution of the training samples so that the cases that are incorrectly classified in a previous trial will have a higher probability of being chosen in a subsequent trial (8). In doing so, newly constructed ensemble classifiers are forced to focus on more difficult cases. The voting for the final labeling is weighted by the accuracy performance of each classifier. Bagging uses a different approach that focuses on the instability of a classifier (9). ‘Instability’ of a classifier refers to the situation where a small change in the training samples results in a comparatively big change in accuracy. Bagging creates a new training set by random sampling from the original set with replacement. Hence, some of the samples will be replicated (being chosen more than once) while others will be missing (not being chosen) in a new set. In the end, the vote of each ensemble classifier carries the same weight. In case of a tie, the final decision can be taken randomly or by following prescribed rules. The accuracy improvement obtained with bagging is expected to be higher with a classifier that is more unstable. The concept of bagging forms the basis for the development of the Random Forest predictor, which is explained in more detail in the “METHODS” section. While ensemble classification algorithms can be applied to any classifier, they were initially developed to boost the accuracy performance of ‘weak’ learners. A weak learner is defined as a prediction function that has a low bias (10). As low bias predictors often produce high variances, weak learners are usually not the most accurate predictors. A decision tree, for instance, is considered a weak learning algorithm and is less accurate than ‘strong’ learners such as multilayer perceptrons, or support vector machines. Decision trees are frequently applied for the classification of remotely sensed data (11, 12, 13). Ensemble methods can be implemented with decision tree classifiers at low computational cost and their application in the classification of remotely sensed data has shown consistent results (6, 7). The interest of developing algorithms like Adaboost and Random Forest stems from the need in the machine learning community to detect patterns and extract useful information from a huge amount of data within a limited time frame. Both algorithms have been applied in many different domains. Adaboost has been widely experimented with in the last decade and is reportedly effective in increasing classification accuracy. Though Random Forest has been developed more recently, it has received much attention from pattern recognition research communities. So far, however, application of Adaboost and Random Forest for hyperspectral classification has not been thoroughly investigated. In (14), binary-hierarchical classification is implemented in conjunction with Random Forest prediction for analysis of hyperspectral data. The results presented in (14) show the performance of two variants of Random Forest differing in the fractions of the original training samples being used. However, since the results were not compared to results obtained by applying Random Forest in a traditional multi-class classification framework, it is difficult to evaluate the advantages of applying the binary-hierachical approach in conjunction with Random Forest prediction. A good choice of classification algorithm is usually determined by balancing its accuracy performance against other criteria such as speed and ease-of-use. Although some conventional classification methods such as ANNs and Support Vector Machines are accurate, their performance depends strongly on parameter tuning and long training times has always been a disadvantage. Furthermore, since users are often required to input many parameters, repeatability may be problematic. Ensemble classifications that build numerous classifiers are not an intuitive choice for the analysis of hyperspectral data since they increase the computational burden of a classification procedure already complicated by high dimensional inputs. However, since decision
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
3
trees are extremely fast to build, the costs of building hundreds of decision trees might still be lower than training a multi-layer perceptron. Decision trees also satisfy many other important criteria typical for a good learning model, such as high accuracy, stable performance, persistency to noise, high repeatability and high interpretability. In this paper, we present the results of applying Adaboost and Random Forest in combination with a decision tree base classifier, on two hyperspectral data sets: one for an urban area and another one for a rural area. The results obtained with both algorithms are compared with those obtained by Multi-Layer Perceptrons. The objective is to evaluate the effectiveness of both ensemble classification algorithms for hyperspectral classification. METHODS For robust training, we have chosen decision trees to be the base classifier. Decision trees predict class membership by recursively partitioning a data set into more homogenous subsets. In univariate decision trees, as applied in this study, each new node results from a binary split based on one feature. To avoid overfitting, pruning can be used to produce more stable predictions. One important component in decision tree classification is the method used to estimate splits at each internal node of the tree. C5.0, a commercial successor of C4.5, uses the “information gain ratio” to estimate splits at each internal node. Based on information gain measures, the node to split is selected by maximizing the reduction in entropy of the descendant nodes. Readers who are interested in the mathematical details of C4.5 are referred to (15). In this study the commercial software C5.0 was used to carry out the experiments that are described in this paper. 1. Adaboost Ensemble classification has been heavily studied in machine learning research. Previous studies have shown that accuracy by voting of a group of classifiers is always superior to accuracy obtained by one of the classifiers taken from the same group. Adaboost, proposed in (8), is one of the most studied methods. The algorithm attaches a higher weight to cases that are incorrectly classified in a present trial so that they have a higher probability of being chosen in a new training set which will subsequently be used to create a new classifier in the next trial. After the first trial, Adaboost changes the probability of a misclassified case by the factor bt =(1 - at)/at where at is the sum of the misclassified cases probabilities of a current classifier Ct at trial t. The sum of all probabilities is then normalized to 1. If the performance is worse than a random guess (i.e. at is greater than 0.5), the trials will terminate and trial T becomes t-1. If at = 0, then trial T becomes t. Finally, the classifiers C1, …, Ct are combined with weighted voting by log(bt). For our experiment, all boosting routines are run from 9 to 99 trials with 10 trials apart. Including the first generated tree, the results are recorded for snapshots of every 10 trees up to a hundred. Since most experiments reported in the literature show accuracy peaks before the first 100 trials, we assume the chosen number of trials is adequate to present a close-to complete picture of the strength of the algorithm. Boosting experiments in this study have been implemented using the commercial data mining software package C5.0. 2. Random Forest In contrast with boosting, bagging creates new training sets by random sampling with replacement from the original data set n times, n being the number of samples in the original set. Random Forest creates new training sets using the same bootstrap method as bagging. For each new training set, random feature selection is used for tree building (16). More specifically, for each of the CART-like trees, a random subset of the features is used at each split. The number of features used at each split is the only parameter to be defined by the user. An often suggested value is the square root of the number of input features. Trees are left to grow fully without pruning. Zero
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
4
pruning means that low-bias trees are obtained. Random feature selection also leads to low correlation of individual trees. Random Forest uses an unbiased method to evaluate the classification accuracy in case a separate test set is not available. For each new training set that is generated, one-third of the training samples are randomly left out, called the out-of-bag (OOB) samples. The remaining (inthe-bag) samples are used for building a tree. For accuracy estimation, votes for each case are counted every time the case belongs to the OOB samples. A majority vote will determine the final label. Only approximately one-third of the trees built will vote for each case. The OOB error estimate has been shown to be unbiased in many tests (10). Although the structure of a decision tree provides information concerning important features, such an interpretation becomes impossible when using hundreds of trees. One additional feature of Random Forest, however, is its ability to evaluate the importance of each feature based on the use of internal OOB estimates. To evaluate the importance of each feature, the values of the mth variable of the OOB samples will be allowed to permute. The perturbed OOB samples will be run down through each tree again. The differences in the number of correctly-labeled cases, or accuracy, between the original and the perturbed OOB samples over all the trees grown in the Random Forest will be averaged. This average number will become the importance score of the mth variable and will be used as a ranking index. Since only a portion of the input features is used at each split and without pruning, the computational load of Random Forest is comparatively light. The computational time is in the order of T M N log (N) where T is the number of trees, M is the number of features used for each split, and N is the number of training samples (10). A prediction is made by the unweighted majority of prediction from the ensemble of classification trees. Random Forest is described as a very accurate technique, with no risk of overfitting, low bias and low variance. STUDY AREAS The first study area is a rural area situated east of the town of Geraardsbergen in the province of East-Flanders within the South Flemish loamy hill district (Figure 1). The valley of the river Dender crosses the northern part of the area. Other than species rich improved grasslands which dominate this valley, the area also includes semi-natural Calthion-grasslands, poplar plantations, alluvial and oak forest. The forest area in the central part consists mostly of oak forest, with small areal coverage of beech forest and woodland of alluvial soils. The interfluves mostly consist of arable land, with a few of species poor to species rich improved grasslands. The area was recently surveyed (spring of 2004). The ground truth was made available by the Research Institute of Nature and Forest (INBO: Instituut voor Natuur- en Bosonderzoek). Airborne HyMap data was acquired on June 8, 2004. A level 2 product was generated after radiometric correction, geometric correction (ortho-rectified using bilinear resampling), atmospheric correction and calibration. The image was geo-referenced in UTM based on WGS-84 geodetic coordinates. The pixel size of the imagery is 4×4 m. The HyMap sensor has 4 spectrometers each producing 32 spectral bands. However, since the VIS and NIR show a slight overlap in the long wavelength region of the VIS spectrometer and the short wavelength region of the NIR spectrometer, the total number of calibrated bands is 126, instead of 128. The final calibration enables a continuous spectrum from 0.4 to 2.5 μm.
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
5
Figure 1: The figure on the left shows the location of the image within the Flemish Region: the image corresponds to the shaded rectangle on the extract of the topographic map. The figure on the right shows the ground truth for each of the 16 ecotope classes considered in the analysis. The second study area is located in the south of the city of Ghent and covers a peri-urban area characterized by high-rise buildings and residential zones (Figure 2). The airborne CASI image available for the area was collected on September 13, 2002 (OSTC APEX exploitation program). The size of the image is about 200 m by 1000 m with a spatial resolution of 1.34 m. The image consists of 48 spectral bands spanning a spectral range from 0.425 to 0.975 μm. The imagery was pre-processed with radiometric, geometric and atmospheric corrections. Ground truth was based on visual interpretation of large-scale aerial photography verified on the field. Table 1 presents the classification schemes for the two study areas and the number of training and validation pixels. The right parts of Figures 1 and 2 show the location of the ground truth pixels. These pixels were extracted from the imagery and randomly divided into training pixels and validation pixels for accuracy assessment. Average spectral profiles were created for all the classes in each of the two data sets (Figures 3 and 4). Figure 3 presents the spectral profiles of each ecotope class (reflectance) derived from the 126 HyMap channels. Figure 4 shows the spectral profiles of urban surface types (radiance) derived from the 48 CASI channels. RESULTS The same classification strategy was used on both data sets. Both Random Forests and Adaboost were applied from 1 to 100 trees. In the case of Adaboost, one hundred trees equals 99 trials of boosting. Snapshots at every ten trees were extracted to check the accuracy. Results on the relationship between the number of trees and the accuracy of the classifiers, which are not presented in this paper, show that overall accuracy comes close to its peak with less than 50 trees. Accuracy further increases as the number of trees grows, yet only marginally. The snapshots corresponding to the highest accuracies are shown in Tables 2 and 3.
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
6
Figure 2: The figure on the left shows the location of the urban study area in the south of the city of Ghent. The figure on the right shows the ground truth extracted for each of the 17 urban surface types present in the scene. Table 1: Classification scheme and number of training and validation pixels for the two study areas
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
7
Figure 3: Average spectral profiles for the ecotope classes
Figure 4: Average spectral profiles for the urban classes. For the ecotope data set, accuracy tops at 80 trees with Adaboost and at 70 trees with Random Forest. Compared to the use of one decision tree, Adaboost increases classification accuracy by 9%. In the case of Random Forest, the increase in accuracy is more than 11% compared to the use of one CART-like tree. In both cases, overall accuracies are close to 70%. For the urban data set, accuracies peak at 70 and 80 trees for Adaboost and Random Forest respectively. Compared
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
8
to using one decision tree, Adaboost increases the accuracy by 8% and Random Forest by 13%. Both ensemble classifiers achieve 75% overall accuracy. For the sake of comparison, we also used Multi-Layer Perceptrons (MLP) for classification of both data sets. The MLP is a type of artificial neural network which is considered to be very accurate for classification tasks. The overall accuracy of the ecotope classification with MLP is 63.7%, and that of the urban classification is 65.6%. The achieved accuracies are about 6% lower than the accuracies obtained with the ensemble classifiers in the case of ecotope mapping, and 10% lower in the case of urban mapping. Compared to MLP, the superiority of the ensemble classifiers have been shown not just in terms of overall accuracy, but also in terms of ease-of-use, speed and repeatability which are considered other important criteria of an optimal classification algorithm. Both the ensemble classifiers handle the hyperspectral data set with great efficiency. Random Forest is even less demanding than Adaboost as the out-of-bag strategy uses a portion of the training set (samples) to build a tree and a random feature subset is used for each split of a tree. Most runs finish within 2-3 minutes. In the case of the ecotope data set, which has a larger number of training samples than the urban data set, computing 100 trees with Adaboost takes roughly 30 minutes. Comparatively, it takes a MLP between 30 minutes to more than an hour to build up the neural network depending on the setting of parameters. Furthermore, both ensemble classifiers require very few inputs from the user. The only parameter with Adaboost is pruning where we have used the default value of 25%. The only parameter with Random Forest is the number of features used for each split, which is determined empirically by building 10 or 20 trees using different numbers of feature. The number of feature with the highest accuracy is then chosen. The dependency on user input has always been a concern for choosing a classification algorithm such as MLP. Less requirements in terms of user input will improve the repeatability of the results, and is an important feature of a classification tool. Table 2: Overall accuracies of Adaboost and Random Forest for the ecotope data set
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
9
Table 3: Overall accuracies of Adaboost and Random Forest for the urban data set
CONCLUSIONS Classification of hyperspectral data using ensemble classifiers has not been fully investigated in the literature. This paper presents results of two benchmark ensemble classifiers: Random Forest and Adaboost. Both ensemble classification algorithms were applied to airborne hyperspectral imagery acquired for two very different applications: ecotope mapping and urban mapping. In both cases, ensemble classifiers showed substantially higher accuracies than the use of one decision tree. Results were also compared with those from Multi-Layer Perceptrons, which are generally considered as very accurate classifiers. Not only do both ensemble classifiers achieve higher accuracies than MLPs, they also complete their training in a much shorter period than the latter. Another advantage of Random Forest and Adaboost is that both algorithms require very few user inputs, which improves repeatability of results. The results obtained with the two data sets show that ensemble classifiers are effective for classification of hyperspectral data and that many of their properties make them well-suited for use in operationally oriented projects. More research is therefore necessary to further explore the potential of these algorithms for hyperspectral data analysis.
ACKNOWLEDGEMENTS The research presented in this paper is funded by the Belgian Science Policy Office in the framework of the STEREO II programme – project SR/00/103. We would like to thank William De Genst from the Department of Geography of the Vrije Universiteit Brussel, Marc Binard from the Laboratoire SURFACES of the Université de Liège, and Nathalie Stephenne and Alexandre Carleer from IGEAT of the Université Libre de Bruxelles for their assistance in the collection of the training and validation data for the Ghent study area. We also thank the Instituut voor Natuur- en Bosonderzoek (INBO) for making the ground truth of the Biological Valuation Map available for our
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
10
study. Desiré Paelinckx, Toon Van Daele and Luc de Bruyn from INBO are thanked for their assistance during the field visits of the Geraardsbergen study area. REFERENCES
1
Kruse F A, A B Lefkoff, J B Boardman, K B Heidebrecht, A T Shapiro, P J Barloon & A F H Goetz, 1993. The Spectral Image Processing System SIPS - Interactive Visualization and Analysis of Imaging Spectrometer Data. Remote Sensing of Environment, Special issue on AVIRIS, 44, 145-163.
2
Schmidt K S , A K Skidmore, E H Kloosterman, H van Oosten, L Kumar & J A M Janseen, 2004. Mapping Coastal Vegetation Using an expert system and hyperspectral image. Photogrammetric Engineering & Remote Sensing, 70 6:703-715.
3
Kumar S, J Ghosh & M M Crawford, 2002. Hierarchical fusion of multiple classifiers for hyperspectral data analysis, International Journal of Pattern Analysis and Applications, vol. 5, no.2, 210-220.
4
Bauer E & R Kohavi, 1998. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, vv, 1-38.
5
Dietterich T G, 2000. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning, 402:139-158.
6
Chan J C-W, C Huang & R S DeFries 2001. Enhanced algorithm performance for land cover classification from remotely sensed data using bagging and boosting. IEEE Transactions of Geoscience and Remote Sensing, 39 (3): 693-695.
7
DeFries R S & J C-W Chan, 2000. Multiple criteria for evaluating machine learning algorithms for land cover classification from satellite data. Remote Sensing of Environment, 74, 503-515.
8
Freund Y & R E Schapire, 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pp 148-156.
9
Breiman L, 1996. Bagging predictors, Machine Learning, 26 (2): 123-140.
10 Breiman L, 2003. Manual for Setting Up, Using, and Understanding Random Forest V4.0. 11 DeFries R S, M Hansen, J R G Townshend, & R Sohlberg, 1998. Global land cover classification at 8km spatial resolution: the use of training data derived from Landsat Imagery in decision tree classifiers. International Journal of Remote Sensing, 19 (16): 3141-3168. 12 Friedl, M A & C E Brodley, 1997. Decision tree classification of land cover from remotely sensed data. Remote Sensing of Environment, 61 (3): 399-409. 13 Friedl M A, C E Brodley & A Strahler, 1999. Maximizing land cover classification accuracies produced by decision trees at continental to global scales. IEEE Transactions on Geoscience and Remote Sensing, 372: 969-977. 14 Crawford, M M, J Ham, Y Chen & J Ghosh, 2003. Random forests of binary hierarchical classifiers for analysis of hyperspectral data. In IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data, 27-28 Oct. 2003, pp: 337- 345.
Proceedings 5th EARSeL Workshop on Imaging Spectroscopy. Bruges, Belgium, April 23-25 2007
11
15 Quinlan J R, 1993. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc., San Mateo, CA. 302 pp. 16 Breiman L, 2001. Random Forests. Machine Learning, 45, 5-32.