variable importance and random forest classification ... - IEEE Xplore

0 downloads 0 Views 2MB Size Report
sify the dataset, as well as rank the parameters used in that classifier. Index Terms— Random Forest Classification, Synthetic. Aperture Radar, Polarimetry. 1.
VARIABLE IMPORTANCE AND RANDOM FOREST CLASSIFICATION USING RADARSAT-2 POLSAR DATA Siddharth Hariharan, Siddhesh Tirodkar, Shaunak De and Avik Bhattacharya Centre of Studies in Resources Engineering, IIT Bombay, Powai-400076, India ABSTRACT In this paper we have classified Polarimetric Synthetic Aperture Radar (PolSAR) data using the Random Forest (RF) classifier. The variables were ranked using the mean decrease in accuracy permutation method for each terrain class. RADARSAT-2 (RS-2) data acquired over Mumbai, India was used in this study. This technique is able to efficiently classify the dataset, as well as rank the parameters used in that classifier. Index Terms— Random Forest Classification, Synthetic Aperture Radar, Polarimetry 1. INTRODUCTION Several methods have been studied to classify terrain using POLSAR data. Some of these methods are based on polarimetric target decomposition theorems, Wishart classification [1], Suport Vector Machine (SVM) [2] etc. Apart from measuring the classification accuracy, it is necessary to measure the importance of each polarimetric parameter in characterizing a particular terrain. Identifying relevant variables can help to study the underlying physical mechanism for terrain identification. For this purpose RF, an ensemble decision tree-based classifier [3] has been used in this study. RFs, along with classification, also evaluate the important parameters necessary for terrain identification. Another advantage of using RFs is that there is no need to perform scaling of the parameters used- the parameters can be numerical as well as categorical. RFs have been used for land cover classification in [4]. RFs classification using multiple polarimetric features was studied in [5]. In this study we used RF to determine individual parameter contribution in terrain classification accuracy. The following study was performed on a RS-2 C-band image of Mumbai acquired on 16th Feb. 2011. Five terrain classes were identified: F orest, M angrove, U rban, W ater and W etland. The polarimetric parameters used in this study are listed in Table 1. 2. METHODOLOGY In this paper we apply the RF method for individual class and overall variable ranking, followed by classification. RF

978-1-4799-5775-0/14/$31.00 ©2014 IEEE

fits many classification trees to a data set, and then combines the predictions from all trees. The algorithm [3] begins with the selection of many (e.g., 1000) bootstrap samples from the data. In a typical bootstrap sample, approximately 63% of the original observations occur at atleast once. Observations in the original data set that do not occur in a bootstrap sample are called out-of-bag observations. A classification tree is fit to each bootstrap sample, but at each node, only a small number of randomly selected variables (e.g., the square root of the number of variables) are available for binary partitioning. The trees are fully grown and each is used to predict out-ofbag observations. The predicted class of an observation is calculated by majority vote of the out-of-bag predictions for that observation. Accuracy and error rates are computed for each observation using the out-of-bag predictions, and then averaged over all observations. The out-of-bag observations are not used in fitting trees, they essentially cross-validate estimates for final classification accuracy. Among the various predictive variables used to split a node in a decision tree, certain variables are more successful in classifying homogeneous areas. It is essential to identify important variables among the many predictor variables which define a data. An advanced variable importance measure available in RFs is the Permutation Accuracy Importance (PAI) measure [9] which has been utilized in this study. By randomly permuting the predictor variable Xj , its original association with the response Y is broken. When the permuted variable Xj , together with the remaining unpermuted predictor variables is used to predict the response. The prediction accuracy (i.e. the number of observations classified correctly) decreases substantially if the original variable Xj was associated with the response. In other words, greater the contribution of Xj in attaining the final response Y ; greater is the variable importance of Xj ; and greater decrease in prediction accuracy of Y when Xj is permuted. Thus a reasonable measure for variable importance is the difference in prediction accuracy before and after permuting Xj , averaged over all trees. For each tree in the forest there is a misclassification rate for the out-of-bag observations. To assess the importance of a specific predictor variable, the values of the variable are randomly permuted for the out-of-bag observations. The modified out-of-bag observations are passed down the tree to get new predictions. The difference between the

1210

IGARSS 2014

Table 1: SAR Polarimetric parameters used Decomposition Parameters Touzi [6]

Yamaguchi 4-component [7] Cloude-Pottier [8] Other parameters Intensity parameters Phase parameters Correlation coefficient Circular Correlation Coefficient Span

Description Touzi symmetric scattering type magnitude (αs1 , αs2 , αs3 ) Touzi symmetric scattering type phase (φs1 , φs2 , φs3 ) Kennaugh-Huynen target helicity (τ1 , τ2 , τ3 ) Eigen values corresponding to the eigenvector of the scatterer (λ1 , λ2 , λ3 ) Kennaugh-Huynen maximum amplitude parameter (m1 , m2 , m3 ) Odd-bounce scattering power (Ps ), Double-bounce scattering power (Pd ), Volume scattering power (Pv ) , Helix scattering power (Pc ) Entropy (H), Anisotropy (A), Average target scattering mechanism (α) Description Dual polarimetric real intensity elements of raw binary data (I11 , I12 , I22 ) HH-VV phase difference (φHH−V V ) Correlation co-efficient between HH and VV polarized waves (ρ12 ) Correlation coefficient in circular polarization basis (CCC) Total Scattering Power

(a)

(b)

Fig. 1: (a) Pauli RGB of RS-2, Mumbai (b) Classified output using RF (Classified subset shown with a red border) misclassification rate for the modified and original out-of-bag data divided by the standard error is a measure of the importance of the variable. The training areas for the individual terrain classes were identified. A RF ensemble of 1000 decision trees was built. At each node, 5 random parameters were used for node splitting. The Gini node impurity was calcu-

lated to determine the best split at each node. The individual class accuracy for each terrain type and overall classification accuracy was evaluated. Contribution of individual parameter in terrain classification accuracy was evaluated for all the five classes.

1211

Table 2: Variable ranking for individual terrain classes (Top 10 are shown) F orest Pv CCC Pd Span I11 I12 Pc λ3 A Ps

M angrove Ps Span λ3 H Pd CCC λ1 α ρ12 A

U rban Pd CCC λ3 αs1 ρ12 I11 Ps A Span α

W ater Ps Span I11 Pd Pv I12 CCC ρ12 A I22

W etland Ps Pd CCC Pv φs1 A Span φs2 α I22

All Classes Ps Span CCC Pv α I11 A Pd I22 λ2

Table 3: Confusion matrix with overall classification accuracy (in %) Class

F orest

M angrove

U rban

W ater

W etland

F orest

87.70

12.01

0.00

0.10

0.20

M angrove

10.27

80.71

0.36

2.98

5.67

U rban

7.42

3.28

87.41

0.60

1.28

W ater

0.00

0.00

0.21

99.79

0.00

W etland

0.40

15.95

0.00

1.07

82.57

Overall Accuracy = 89.27%

3. RESULTS AND DISCUSSIONS 3.1. Variable Importance The contribution of individual parameters towards achieving classification by RF is tabulated in Table 2. The importance was computed from the permutation accuracy importance method. It is observed that the variables that are important in one class, also have a high rank of importance in the other classes. For instance, the Ps is the most important parameter in the identification of the mangrove class, as well as water and wetland classes. This can be understood from the fact that all three classes bear resemblance to each other. The mangroves and wetland consist of a surface of water on terrain. Similar patterns can be seen for other classes. The classification is done using important variables obtained from mean decrease in accuracy estimated over all classes. 3.2. Classification using RF RS-2 image of Mumbai was classified using the RF technique using variables ranked by PAI method. The Pauli RGB is shown in Fig. 1a and the classified output is shown in Fig. 1b. As seen, the classes are well separated, especially the wetland area and the salt-pan (classified as wetland in this

κ = 0.85

case) is well segregated from the mangroves. Some mixing has been observed between f orests and mangroves. Quantitative analysis along with the overall accuracy and the kappa coefficient (κ) is presented in Table 3. 4. CONCLUSION In this study we have demonstrated the capability of RFs in terrain classification and for identifying the important variables, overall and per terrain class. It is seen that for classes of a similar nature, the same parameters have been found to be important. For example the Yamaguchi decomposition components (Ps , Pd , Pv ) occur repetitively in the top ranks of each class. Classification obtained from the RF method has an overall accuracy of 89.27%. Visually as well the classes appear well discriminated. The performance of RF classification on datasets of different frequencies can be investigated in the future. A comparative study with other existing classification techniques will also be beneficial. 5. REFERENCES [1] Jong-Sen Lee, Mitchell R Grunes, Eric Pottier, and Laurent Ferro-Famil, “Unsupervised terrain classification

1212

preserving polarimetric scattering characteristics,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 42, no. 4, pp. 722–731, 2004. [2] Lamei Zhang, Bin Zou, Junping Zhang, and Ye Zhang, “Classification of polarimetric sar image based on support vector machine using multiple-component scattering model and texture features,” EURASIP Journal on Advances in Signal Processing, vol. 2010, pp. 1, 2010. [3] Leo Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. [4] Pall Oskar Gislason, Jon Atli Benediktsson, and Johannes R Sveinsson, “Random forests for land cover classification,” Pattern Recognition Letters, vol. 27, no. 4, pp. 294–300, 2006. [5] Tongyuan Zou, Wen Yang, Dengxin Dai, and Hong Sun, “Polarimetric sar image classification using multifeatures combination and extremely randomized clustering forests,” EURASIP Journal on Advances in Signal Processing, vol. 2010, pp. 4, 2010. [6] Ridha Touzi, “Target scattering decomposition in terms of roll-invariant target parameters,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 45, no. 1, pp. 73–84, 2007. [7] Yoshio Yamaguchi, Akinobu Sato, W-M Boerner, Ryoichi Sato, and Hiroyoshi Yamada, “Four-component scattering power decomposition with rotation of coherency matrix,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 49, no. 6, pp. 2251–2258, 2011. [8] Shane R Cloude and Eric Pottier, “An entropy based classification scheme for land applications of polarimetric sar,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 35, no. 1, pp. 68–78, 1997. [9] Carolin Strobl and Achim Zeileis, Danger: High power! – exploring the statistical properties of a test for random forest variable importance - Technical Report No 17 University of Munich, 2008.

1213

Suggest Documents