Robust supervised image classifiers by spatial AdaBoost based on robust loss functions Ryuei Nishiia and Shinto Eguchib a Faculty
of Mathematics, Kyushu University, Hakozaki, Higashiku, Fukuoka 812-8581, Japan; of Statistical Mathematics, Minami-Azabu, Minatoku, Tokyo 106-8569, Japan
b Institute
ABSTRACT Spatial AdaBoost, proposed by Nishii and Eguchi (2005), is a machine learning technique for contextual supervised image classification of land-cover categories of geostatistical data. The method classifies a pixel through a convex combination of a log posterior probability at the current pixel and averages of log posteriors in various neighborhoods of the pixel. Weights for the log posteriors are tuned by minimizing the empirical risk based on the exponential loss function. It is known that the method classifies test data very fast and shows a similar performance to the Markov-random-field-based classifier in many cases. However, it is also known that the classifier gives a poor result for some data when the exponential loss puts too big penalty for misclassified data. In this paper, we consider a robust Spatial boosting method by taking a robust loss function instead of the exponential loss. For example, the logit loss function gives a linear penalty for misclassified data approximately, and is robust. The Spatial boosting methods are applied to artificial multispectral images and benchmark data sets. It is shown that Spatial LogitBoost based on the logit loss can classify the benchmark data very well even though Spatial AdaBoost based on the exponential loss failed to classify the data. Keywords: Bayes rule, image segmentation, loss function, machine learning, Markov random fields
1. INTRODUCTION Supervised image classification of land-cover categories of geostatistical data is an important issue in remote sensing community. For this purpose, statistical approaches are widely discussed in the literature1-4 , and a review paper5 has been presented. Fusion techniques and machine learning techniques6-8 have also been discussed. In a paradigm of supervised learning, AdaBoost9 was proposed, and has been widely and rapidly improved for use in pattern recognition. It combines multiple classifiers linearly, and the derived classifier shows high performance10,11 . Spatial AdaBoost12 is a machine learning technique for contextual supervised image classification. The method classifies a pixel through a convex combination of a log posterior probability at the current pixel and of averages of log posteriors in various neighborhoods of the pixel. Weights for the log posteriors are tuned by minimizing the empirical exponential risk. It is known that the method classifies test data very fast and shows a similar performance to the Markov-random-fieldbased (MRF-based) classifier in many cases. However, it is also known that the classifier gives a poor result for some data when the exponential loss puts too big penalty for misclassified data. In this paper, we consider a robust Spatial boosting method by taking a robust loss function instead of the exponential loss. For example, the logit loss function gives a linear penalty for misclassified data approximately, and is robust. The proposed method is applied to artificial multispectral images and benchmark data sets. It is shown that the proposed method can classify the benchmark data very well even though Spatial AdaBoost based on the usual exponential loss failed to classify the data. The remainder of this paper is organized as follows. In Section 2, loss functions for misclassification in multi-category case are given. Section 3 introduces Spatial boosting based on a loss function so as to derive a robust classification result. Further author information: (Send correspondence to R. Nishii) R. Nishii: E-mail:
[email protected], Telephone: +81 92-642-2765 S. Eguchi: E-mail:
[email protected], Telephone: +81 3-5421-8728 Image and Signal Processing for Remote Sensing XI, edited by Lorenzo Bruzzone Proceedings of SPIE Vol. 5982 (SPIE, Bellingham, WA, 2005) 0277-786X/05/$15 · doi: 10.1117/12.626775 Proc. of SPIE Vol. 5982 59820D-1
6 5 exponential logit 0−1
4 3 2 1 0 −3
−2
−1
0
1
2
3
Fig. 1. Loss functions
Neighborhoods of a pixel and the averages of the log posteriors therein are defined and will later be used to build classifiers. The Spatial boosting methods as well as MRF-based classifiers are examined through three data sets in Section 4. It is shown that Spatial LogitBoost and Spatial AdaBoost show similar performance. But the former shows still a good performance in the case that the latter fails in classification. Section 5 concludes the article.
2. LOSS FUNCTIONS AND EMPIRICAL RISK FUNCTIONS We will review ordinary boosting method9 . Suppose that there are g possible land-cover categories C1 , . . . , Cg (conifer, broad leaf etc). Let R = {1, . . . , n} be a training region with n pixels, and suppose that each pixel belongs to one of the categories. We denote a set of all the category labels by G = {1, ..., g}. Let xi ∈ Rm be an m-dimensional feature vector observed at a pixel i, and yi be its true label in the label set G. Note that pixel i in the region R is a numbered small unit area in the surface on the earth. Let F (x, k) be a classification function of feature vector x ∈ Rm and label k in the label set G. We allocate the feature vector x into a category with label yˆF ∈ G by the following maximizer: yˆF = arg max F (x, k).
(1)
k∈G
A typical example of the classification function is a posterior probability of the label Y = k given feature vector X = x, and the classification function gives the Bayes classifier. Let y ∈ G be the true label of the feature vector x. Then, the loss of misclassification, L(F, k | x, y), due to the classification function F is assessed by the loss functions: Lexp (F, k | x, y) = exp {F (x, k) − F (x, y)} and Llogit (F, k | x) = F (x, k) − log exp {F (x, )} (2) ∈G
where k is a lable in G. The Lexp and Llogit are called the exponential and the logit loss functions, respectively. Empirical risks are defined by the average of the loss functions evaluated by the training data set {(xi , yi ) ∈ Rm × G | i ∈ R} as Rexp (F ) =
1 1 Lexp (F, k | xi , yi ) and Rlogit (F ) = Llogit (F, k | xi ) n n i∈R k∈G
(3)
i∈R k∈G
AdaBoost and LogitBoost aim to minimize the exponential risk Rexp (F ) and the logit risk Rlogit (F ) respectively. For the two-category case (g = 2), put the label set by {1, −1}. Then, the true label y ∈ {1, −1} of the test vector x can be estimated by the signature of the classification function f (x). Actually, if f (x) > 0, then x is classified into the
Proc. of SPIE Vol. 5982 59820D-2
I
ii ii ii I (a) r = 0
(b) r = 1
(c) r =
√
II Ii II I 2
(d) r = 2
Fig. 2. Isotropic subsets Ur (i) with center pixel i and radius r
Fig. 3. True labels of the simulated data
label 1, otherwise into −1. Hence if t ≡ yf (x) < 0, the vector x is misclassified. In this case, the loss functions expressed in the formula (2) are given as Lexp (t) = exp(−t) and Llogit (t) = log{1 + exp(−2t)} − log 2 + 1.
(4)
Fig. 1 illustrates the loss functions in the above and the usual 0 − 1 loss against t = yF (x). It is seen that the exponential and logit functions are smooth and convex. Also the exponential function puts heavy penalty than the logit function does. This fact implies that AdaBoost is not robust compared with LogitBoost. Other types of robust loss functions14 can be also implemented.
3. AVERAGES OF LOG POSTERIORS AND SPATIAL BOOSTING Next, we proceed to Spatial boosting. Consider statistical approach for classification. Let p(x | k) be a probability density function of x over the feature space Rm specific to the category Ck , where k is a label in the label set G. Then, the posterior distribution of label Y = k given feature vector X = x is derived by Pr{Y = k | X = x} ≡ p(k | x) = πk p(x | k) π p(x | ), (5) ∈G
where πk is a prior probability of the category Ck . The posterior probability p(k|x) defined by (5) is a measure of the confidence of the current classification and is closely related to logistic-type discriminant functions13 . Let subset Ur (i) of the training region R be a neighborhood of pixel i with radius r defined by √ √ Ur (i) = {j ∈ R | d(i, j) = r}, r = 0, 1, 2, 2, 5, ..., where d(i, √j) denotes a distance between two centers of pixels i and j. Fig. 2 gives neighborhoods of pixel i with radius r = 0, 1, 2, 2. Then, define the average of log posteriors in the neighborhood Ur (i) by
fr (k | i) =
⎧ ⎨
⎩
j∈Ur (i)
log p(k | xj ) |Ur (i)| if |Ur (i)| > 0 0
for r = 0, 1,
√
2, ...,
(6)
otherwise
where |S| denotes the cardinality of set S, and p(k | xj ) is the posterior probability defined by (5). The most efficient classifier among the functions fr in (6) would be f0 (log posterior itself), and f1 (the average of log posteriors in the first-order neighborhood) comes next, and so on. Hence, Spatial AdaBoost minimizes the risk Rexp (cf0 ) defined by (3) with respect to positive constant c, say c0 . An iterative procedure for obtaining the optimal coefficient c for minimizing Rexp (F + cf ) with given classification functions F and f is known based on the delta method12 . Similarly, the optimal coefficient c minimizing Rlogit (F + cf ) is obtained. Thus, the proposed Spatial boosting is as follows: Procedures of Spatial boosting
Proc. of SPIE Vol. 5982 59820D-3
1. Fix a loss function and the corresponding risk function R(·) defined by the formula (3). 2. For the log posterior f0 defined by (6), find the optimal coefficient c which minimizes the empirical risk R(cf0 ), say c0 . 3. If the coefficient c0 is negative, quit the procedure. Otherwise, consider the empirical risk R(c0 f0 + cf1 ) with c0 given in the previous step. Then, find the optimal coefficient c that minimizes the empirical risk, say c1 . 4. If c1 is negative, quit the procedure. Otherwise, consider the empirical risk R(c0 f0 + c1 f1 + cf√2 ). This procedure is repeated, and we obtain the positive coefficients c0 , c1 , ..., cr for the averages of log posteriors. 5. The test label y at the i-th pixel of the test data is estimated by maximizing the weighted sum of the functions: Fr (k|i)
= c0 f0 (k|i) + c1 f1 (k|i) + · · · + cr fr (k|i)
(7)
with respect to the label k in G = {1, ..., g}, where functions f0 (·|·), .... , fr (·|·) must be averages of log posteriors evaluated by test data.
4. NUMERICAL EXPERIMENTS The proposed method is examined through three data sets; an artificially generated data set, and benchmark data sets grss dfc 0006, grss dfc 0009 for supervised image classification. Last two data sets can be accessed from the IEEE GRSS Data Fusion reference database15 .
4.1. Application to an Artificial Data Set The proposed method is applied to multispectral images12 generated over the image Fig. 3 of size 91 × 91. There are three categories (g = 3), and numbers of training pixels for each of these categories are 3330, 1371 and 3580 (n = 8281), respectively. We simulate four-dimensional spectral images (m = 4) at each pixel of the true image √ following independent multivariate Gaussian distributions with mean vectors µ1 = (0 0 0 0)T , µ2 = (1 1 0 0)T / 2, and µ3 = (1.0498 − 0.6379 0 0)T and the common variance-covariance matrix σ2 I, where I stands for the identity matrix. The mean vectors were chosen so as to maximize the pseudo-likelihood of the training data. Test data of size 8281 are similarly generated over the same image Fig. 3, and they are independent to the training data. Gaussian density functions with the common variance-covariance matrix are used to derive the posterior probabilities. It is well-known that the densities yield the linear discriminant function (LDF). We apply the classification functions F0 , F1 , ..., F√20 to the test data, where the functions Fr are defined by (7), and the coefficients c0 , c1 , ..., c√20 are sequentially tuned by minimizing the empirical risk (3). Table 1 compares the error rates due to Gaussian MRF-based (GMRF) classifiers and the Spatial boosting methods for the error variance σ2 = 1 and 4. Each row corresponds to a GMRF with the neighborhood system U1 (i) ∪ U√2 (i) ∪ · · · ∪ Ur (i) and the classification function Fr . Note that classification based on the radius 0 indicates non-contextual classification by the LDF. Boldfaced numerals in Table 1 imply that they are minimum in respective columns. When the variance σ 2 = 1, the error rate 41.75% due to the LDF is drastically improved to less than 5% by the use of spatial information. It is seen that GMRF is superior to the Spatial boosting methods a little bit. Further, Spatial AdaBoost and Spatial LogitBoost show similar performance. Note that Spatial boosting is very fast compared with GMRF-based classifiers. Matlab source code of the Spatial AdaBoost and the simulated data used here are available through WWW∗ .
4.2. Application to the Benchmark Data Set grss dfc 0006 Next, the Spatial boosting method is applied to benchmark data sets grss dfc 0006 and grss dfc 0009 for supervised image classification. These data sets can be accessed from the IEEE GRS-S Data Fusion reference database15 . ∗
http://www.math.kyushu-u.ac.jp/˜nishii
Proc. of SPIE Vol. 5982 59820D-4
Table 1. Error rates (%) due to GMRF and Spatial boosting for the simulated data with variance-covariance matrix σ 2 I Neighborhood r2 0 1 2 4 5 8 9 10 13 16 17 18 20
GMRF 41.75 10.68 5.01 3.57 3.23 4.61 5.12 6.10 9.52 11.70 13.56 21.78 30.58
σ2 = 1 Spatial boosting Exp. Logit 41.67 41.67 20.40 19.45 12.15 11.63 8.77 8.36 5.72 5.93 5.39 5.43 4.82 4.96 4.38 4.65 4.34 4.54 4.31 4.43 4.19 4.31 4.34 4.58 4.43 4.66
GMRF 54.56 40.55 17.14 12.04 10.04 10.18 10.19 11.05 19.41 18.51 20.66 21.46 22.79
σ2 = 4 Spatial boosting Exp. Logit 54.56 54.41 40.88 40.77 33.79 33.95 28.78 28.74 22.71 22.44 20.50 20.55 18.96 19.01 16.58 16.56 14.99 15.08 14.14 14.12 13.02 13.22 12.57 12.69 12.16 12.11
Table 2. Error rates (%) by GMRF and Spatial boosting for grss dfc 0006 Neighborhood r2 0 4 9 16 25 36 49 64
GMRF 8.61 6.09 5.69 5.35 5.69 5.69 6.13 6.65
Spatial boosting Exp. Logit 8.61 8.61 7.08 7.93 6.11 7.33 5.95 6.74 5.83 6.35 5.85 6.13 5.99 6.32 6.01 6.48
The set grss dfc 0006 consists of samples acquired by ATM and SAR (six and nine bands, respectively; m = 15) with five agricultural categories (g = 5) in Feltwell, UK. The numbers of the training data and test data are 5072 = 1436 + 1070 + 341 + 1411 + 814 and 5760 = 2017 + 1355 + 548 + 877 + 963, respectively, in a rectangular region of size 350 × 250. Gaussian distributions with the common variance-covariance matrix are fitted to the class-conditional densities. Table 2 lists the error rates due to the GMRF7 , to Spatial AdaBoost and to Spatial LogitBoost. The Spatial boosting methods improve the classification result, and error rates are comparable to those due to the GMRF-based method.
Proc. of SPIE Vol. 5982 59820D-5
Table 3. Optimal coefficients tuned by minimizing the exponential risks based on LDF and QDF for the data set grss dfc 0009
Error rate by f0 Tuned c0 to f0 Error rate by c0 f0
LDF 10.69 % −0.006805 nearly 100 %
Table 4. Error rates (%) due to the contextual classification function c0 f0 + c1 f1 given c0 for the data set grss dfc 0009. The c1 are tuned by minimizing the empirical exponential risk.
QDF 1.66 % −0.000953 nearly 100 %
given c0 −7
10 10−6 10−5 10−4 10−3 10−2 10−1 1 ∞
LDF
QDF
9.97 10.00 10.00 10.00 9.10 9.45 10.24 10.69 10.69
1.35 1.35 1.35 1.25 1.00 1.38 1.66 1.66 1.66
4.3. Application to the Benchmark Data Set grss dfc 0009 The data set grss dfc 0009 consists of samples acquired by Landsat 7 TM+ (m = 7) with eleven agricultural categories (g = 11) in Flevoland, Netherlands. The numbers of training data and test data are 2891 and 2890, respectively, in a square region of size 512 × 512. For the class-conditional densities, we fit two types of Gaussian distributions: 1) with the common variance-covariance matrix, and 2) with category-specific variance-covariance matrices. The second case gives quadratic discriminant functions (QDF). Then, non-contextual classifiers LDF and QDF give error rates 10.69% and 1.66%, respectively. This implies the classifiers f0 give small 0 − 1 risk. Unfortunately, in both of these cases, the coefficient c0 for the log posterior probability log p(k|i) is estimated to be negative values, −0.006805 and −0.000953, respectively. See Table 3. Hence, Spatial AdaBoost stops at the 3rd step of the procedure described at Section 3, and fails to combine the classifiers. The inconvenience is caused by the difference between the exponential and the 0 − 1 loss functions of misclassification. The exponential loss puts huge penalty for misclassified outlying data. Therefore, if we set c0 to be a small positive value, then the proposed procedure may work. Table 4 lists the error rates by the classification function c0 f0 + c1 f1 when the coefficient c0 is pre-determined and c1 is tuned by the empirical exponential risk Rexp (c0 f0 + c1 f1 ). The LDF column corresponds to the posterior probabilities based on Gaussian distributions with the common variance-covariance matrix, and the QDF column corresponds to Gaussian distributions with category-specific variance-covariance matrices. For both of these cases, information in the first-order neighborhood (r = 1) improves the classification slightly. Table 5 compares GMRF-based and Spatial LogitBoost classifiers. In this case, Spatial LogitBoost works normally, and gives similar performance to GMRF. Spatial AdaBoost and Spatial LogitBoost employ the same classification functions f0 , f1 , .... But the loss function used by LogitBoost puts less penalty to misclassified training data. Hence, Spatial AdaBoost has a robust property.
Proc. of SPIE Vol. 5982 59820D-6
Table 5. Error rates (%) by GMRF and Spatial LogitBoost for grss dfc 0009 Neighborhood r2 0 1 4 8 9 10 13 16
LDF GMRF Logit 10.69 10.69 8.58 9.48 7.06 6.51 3.81 5.43 4.01 5.40 4.08 5.29 4.01 5.16 4.01 5.26
QDF GMRF Logit 1.66 1.66 1.97 1.31 1.69 0.93 0.55 0.76 0.52 0.76 0.52 0.83 0.52 0.87 0.52 0.90
5. CONCLUSIONS Spatial AdaBoost sometimes fails to combine the classifiers because the exponential loss puts too much penalty for misclassified training data. Recall Table 3. This fact motivates us to substitute the exponential loss function to more robust loss function, e.g., the logit loss function. Although Spatial boosting is a straightforward extension of Spatial AdaBoost, it is necessary so that a classification method can be applied to various types of data. By taking a robust loss function, we can obtain a robust classifier. Spatial boosting is introduced in order to provide contextual image classification with robust property. Features of Spatial boosting are as follows: (a) Various types of posteriors can be implemented. (b) Various forms of neighborhoods can be implemented. (c) Classification is performed non-iteratively, and the performance is similar to that of MRF-based classifiers. The features (a) – (c) are common to Spatial AdaBoost, and the detailed discussion is found in the paper12 . The flexibility of Spatial boosting leaves us many problems in turn. • Choice of the loss function. (exponential, logit, or something else) • Choice of the posterior probability. (LDF, QDF, probabilistic support vector machines, or something else) • Choice of the maximum radius r. The most difficult problem is how to choose a loss function suitable to the current data set. If we choose a too-robust loss function, then the classification performance would become worse. The answer of the authors at this stage is as follows. First, consider Spatial AdaBoost. If the exponential risk obtains a negative optimal coefficient cr to the averaged log posteriors with small radius r, take another more robust loss function.
ACKNOWLEDGMENTS The data sets grss dfc 0006 and grss dfc 0009 were provided by the IEEE GRS-S Data Fusion Committee.
Proc. of SPIE Vol. 5982 59820D-7
* REFERENCES 1. Switzer, P. (1980): Extensions of linear discriminant analysis for statistical classification of remotely sensed satellite imagery, Mathematical Geology, vol. 12(4), pp. 367-376. 2. Geman, S. and Geman, D. (1984): Stochastic relaxation, Gibbs distribution, and Bayesian restoration of images, IEEE Transactions on Pattern Anal. Machine Intell., vol. PAMI-6, pp. 721-741. 3. Besag, J. (1986): On the statistical analysis of dirty pictures, J. R. Stat. Soc. B, vol. 48, pp. 259-302. 4. McLachlan, G. J. (2004): Discriminant Analysis and Statistical Pattern Recognition (2nd ed.), John Wiley & Sons, New York. 5. Jain, A.K., Duin, R.P.W., and Mao, J. (2000): Statistical pattern recognition: a review, IEEE Transaction on Pattern Anal. Machine Intell., vol. PAMI-22, pp. 4-37. 6. Benediktsson, J.A. and Kanellopoulos, I. (1999): Classification of multisource and hyperspectral data based on decision fusion, IEEE Transactions on Geoscience and Remote Sensing, vol. 37, pp. 1367-1377. 7. Nishii, R. (2003): A Markov random field-based approach to decision level fusion for remote sensing image classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 41(10), pp. 2316-2319. 8. Melgani, F. and Bruzzone, L. (2004): Classification of hyperspectral remote sensing images with support vector machines, IEEE Transactions on Geoscience and Remote Sensing, vol. 42(8), pp. 1778-1790. 9. Freund, Y. and Schapire, R.E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, vol. 55(1), pp. 119-139. 10. Friedman, J., Hastie, T. and Tibshirani, R. (2000): Additive logistic regression: a statistical view of boosting (with discussion), Annals of Statistics, vol. 28, pp. 337-407. 11. Hastie, T., Tibshirani, R. and Friedman, J. (2001): The Elements of Statistical Learning : Data Mining, Inference, and Prediction, Springer, New York. 12. Nishii, R. and Eguchi, S. (2005): Supervised image classification by contextual AdaBoost based on posteriors in neighborhoods, to appear in IEEE Transactions on Geoscience and Remote Sensing. 13. Eguchi, S. and Copas, J. (2002): A class of logistic-type discriminant functions, Biometrika, vol. 89, pp. 1-22. 14. Takenouchi, T. and Eguchi, S. (2004): Robustifying AdaBoost by adding the naive error rate, Neural Computation, vol. 16, pp. 767-787. 15. IEEE GRSS Data Fusion reference database, Data sets GRSS DFC 0006, GRSS DFC 0009 Online. http://www.dfcgrss.org/, 2001.
Proc. of SPIE Vol. 5982 59820D-8