J Indian Soc Remote Sens (September 2013) 41(3):523–530 DOI 10.1007/s12524-013-0265-4
RESEARCH ARTICLE
Satellite Data Classification Using Open Source Support S. Biswal & A. Ghosh & R. Sharma & P. K. Joshi
Received: 13 June 2012 / Accepted: 12 February 2013 / Published online: 30 April 2013 # Indian Society of Remote Sensing 2013
Abstract In this study we explored the potential of open source data mining software support to classify freely available Landsat image. The study identified several major classes that can be distinguished using Landsat data of 30 m spatial resolution. Decision tree classification (DTC) using Waikato environment for knowledge analysis (WEKA), open source software is used to prepare land use land cover (LULC) map and the result is compared with supervised (maximum likelihood classifier – MLC) and unsupervised (Iterative self-organizing data analysis technique - ISODATA clustering) classification techniques. The accuracy assessment indicates highest accuracy of the map prepared using DTC with overall accuracy (OA) 92 % (kappa = 0.90) followed by MLC with OA 88 % (kappa = 0.84) and ISODATA OA 76 % (kappa = 0.69). Results indicate that data set with a good definition of training sites can produce LULC map having good overall accuracy using decision tree. The paper demonstrates utility of open source system for information extraction and importance of DTC algorithm. Keywords Decision tree . LULC . Open source software . Satellite data S. Biswal : A. Ghosh : R. Sharma : P. K. Joshi (*) Department of Natural Resources, TERI University, New Delhi, India 110070 e-mail:
[email protected] S. Biswal Risk Management Solutions India Pvt. Ltd., Noida, India 201301
Introduction Remote sensing technology is one proven strategy to better document, characterize and quantify land use land cover (LULC) (Wentz et al. 2008). This information is vital input for various development, environment and resource planning applications, and regional as well as global scale process models. Globally, these remote sensing inputs have been recognized since the inception of Kyoto protocol (Rosenqvist et al. 2003) to the latest debates on reducing emissions from deforestation and forest degradation (REDD) (Miles and Kapos 2008). These kinds of databases are also important for national accounting of natural resources and planning at regular intervals. The challenge in classifying LULC in any dynamic landscape using multi-spectral remote sensing data is heterogeneity among the classes (Stefanov et al. 2001). Mixed pixel is a common confounding factor in classification using moderate resolution datasets (Small 2005; Woodcock and Strahler 1987). To resolve these issues, investigators have utilized various approaches like, artificial neural networks (Pu et al. 2008), fuzzy classifier (Feitosa et al. 2009), image segmentation (Gamanya et al. 2007), expert classification (Wentz et al. 2008), support vector machines (Carräa et al. 2008) and many others. The new generation satellite datasets certainly demand huge computing resources as well as robust classification procedures. Classification algorithms are also data dependent and extraction of information from such data poses various challenges since these are
524
closely associated with the human intervention. But the classifier should have ability to handle spectral variability and involve the ground truth information (Kandrika and Roy 2008). The algorithms available in the proprietary packages and software come with a set of a-priori assumptions. There is a need to develop approaches that can train quickly with a capability to handle huge data sets from numeric and non-numeric sources. Developing such approaches is convenient using open source data mining software systems. These are flexible with intuitive simplicity and computational efficient which is leading to increased acceptance (Pal and Mather 2003; Quilnan 1993). This study aims to assess use of open source remote sensing datasets and data mining software support for LULC classification. The proposed procedure is also compared with the LULC map produced using traditional hard classifiers, implemented through proprietary software packages for evaluation.
Study Area The study area covers a portion of Thane district of Maharashtra. It lies between the latitudinal parallels of 19°16′ N and 19° 17′ N and the longitudinal parallels of 73° 01′ E and 73° 04′ E (Fig. 1, Mumbai suburban region) and occupies parts of south-western region of India (216 m above sea level). With an area of 17.5 sqkm, it corresponds to a typical patch of the tropical region, completely engrossed with residential, agriculture and water body. Its South–North length is approximately 1.9 km and East–West distance is 8.8 km. Figure 1 shows location of the study area. Fig. 1 Showing location of study area in Thane district of Maharashtra
J Indian Soc Remote Sens (September 2013) 41(3):523–530
Materials and Methods Remote Sensing Data Landsat 5 thematic mapper (TM) data of 2009 year, having 7 spectral bands has been used for this study. The data is freely available from the United States Geological Survey (USGS) archive. Thermal band was dropped and remaining six bands were used for image classification. False Color Composite (FCC) of the study area is shown in Fig. 4. Software Used Waikato Environment for Knowledge Analysis (WEKA) is open source software issued under general public license with an object oriented Java based suite of machine learning algorithm for data mining task. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization (Weka 3). In this study, the data preprocessing and classification tools have been thoroughly used (http://www.cs.waikato.ac.nz). WEKA has more than 100 classification methods broadly categorized as Bayesian (Naïve Bayes, Bayesian nets, etc.), lazy methods (nearest neighbor and variants), rule-based methods (decision tables, OneR, RIPPER), tree learners (C4.5, Naive Bayes trees, M5), function-based learners (linear regression, support vector machine (SVM), Gaussian processes), and miscellaneous methods. For the current study J48 algorithm was employed, that implements C4.5 decision tree classifier. Table 1 gives a list of parameters for J48 and their description.
J Indian Soc Remote Sens (September 2013) 41(3):523–530
525
Table 1 List of parameters for J48 and their description Parameter name
Description
Binary splits
Whether to use binary splits at each node to build the decision tree: we have used Binary splits
Confidence factor
The confidence factor used for pruning; smaller confidence factor implies more pruning; the default value is 0.25; we have used 0.05
Number of objects
Minimum number of instances (sample) per leaf; if less number of samples are present in one leaf than the assigned value, the leaf will not be considered as a class; we have set this value to 5;
Number of folds
Determines the amount of data used for reduced error pruning; one fold is used for pruning, rest is used for growing the tree; we have set this value to 6; so 1 fold is used for pruning and 5 fold is used for growing the tree;
Size of tree
Number of nodes
Number of leaves
Occurrence of pure classes; This is same as number of rules
WEKA has several advantages over other software. First and foremost is its simple GUI. It provides liberty to use pruned or unpruned tree. It also has the provision where user can suppress subtree raising thus creating more efficient algorithm. There is a flexibility where confidence threshold for pruning can be set as well as reduced-error pruning can be performed, to optimize the performance. WEKA has one limitation of non-compatiblity with multi-relational data sets. But it provides compatibility with SQL databases using Java database connectivity.
threshold and (ii) rules for splitting a single cluster into two new clusters (Jensen 2005). 50 clusters in total were used to perform the unsupervised classification using 6 iterations and 0.95 convergence threshold. MLC classifier Supervised classification was carried out using Maximum likelihood classification (MLC) approach. For supervised classification, selection of samples for training and testing the classifier is essential. The accuracy of the samples is directly related to the results of the comparison study based on different
Image Classification In general, the major classification techniques can be divided into two groups: unsupervised and supervised methods. However in this study we used Decision Tree Classification (DTC) to evaluation the importance of open source classification algorithm. Class separability was studied using the Transform Divergence (TD) test. The test works by assigning exponentially decreasing weight to increasing distance between the classes. The results range from 0 to 2,000. The classes exhibit good separability, if TD test value is greater than 1,900, if value is between 1,700 and 1,900 the separation is fair and it is poor for values below 1,700. Additionally feature space plot was analyzed to evaluate the training sets (Fig. 2). ISODATA clustering Unsupervised classification was done using Iterative Self-Organizing data Analysis (ISODATA) technique which is a modified form of Kmeans clustering algorithm. This method includes (i) merging clusters in case the distance calculated in multispectral feature space is below a user- specified
Fig. 2 Feature space plot for different classes
526
J Indian Soc Remote Sens (September 2013) 41(3):523–530
methods and parameters. In this paper, stratified random sampling was used to acquire training samples for model building and test samples for evaluation of classification accuracy. MLC is a method which is based on calculation of probability of a pixel belonging to each of a predefined set of classes and eventually a class is assigned to it which has the highest probability (Jensen 2005). More details of ISODATA and MLC classifier can be found in the user manual of proprietary packages (ERDAS Field Guide TM 2010).
A total of 1937 pixels were used as training samples for different classes. 31 % of these pixels were represented by ‘Open vegetation’ class (593 pixels), 20 % by ‘Dense settlement’ (389 pixels), and 14 % (264 pixels) and 13 % (249 pixels) represented training samples for ‘Sea’ and ‘Open scrub’. ‘Dense vegetation’ had 218 sample pixels, ‘Water’ class had 133 pixels and ‘Open area’ had just 5 % (91) of training pixels.
Decision tree classifier DTC techniques have been used successfully for a wide range of classification problems, but only recently been tested in detail by the remote sensing community (Otukei and Blaschke 2010; Punia et al. 2011). The algorithm uses a "divide-andconquer" approach to grow the tree. Selection of explanatory variables for splitting a node and composition of classes in leaves is determined by information gain ratio. Gain ratio represents proportion of useful information for classification created due to the split. The objective is to maximize this ratio subject to the constraint of large information gain. We used an open source Java implementation of the C4.5 algorithm in the WEKA data mining tool, known as J48. The decision tree (DT) has binary splits with 0.05 confidence factor, and minimum number of objects is 5. The number of folds used, for decision tree classification is 6. The DT is shown in Fig. 3. For splitting index, entropy method of DT is utilized. The classes mapped are open vegetation, dense vegetation, open scrub, settlements, sea/ocean, water body/rivers and blank/open area.
The paper emphasizes on use of open source software, and Google Earth is best available open source of high resolution imagery through internet. The use of high resolution data through Google Earth is free and datasets are from more than 2 sensors (Quickbird, Geoeye).The database also gets updated regularly and has huge repository of historical images. Reference data of 50 random pixels were collected from Google Earth high resolution data and used for comparison with classified maps to assess commission, omission and overall accuracy (Stehman 1996).
Fig. 3 Decision tree for classifying satellite data into 7 classes
Accuracy Assessment
Results and Discussion A separability test considering all the bands was conducted before classification to study the disjunction of different land use classes that were to be mapped. Table 2 shows values of TD test results among different class pairs According to transform divergence test, results greater than 1,900 represent seperability among
J Indian Soc Remote Sens (September 2013) 41(3):523–530
527
classes. The separability is fairly good when the values lie between 1,700 and 1,900. For results below 1,700, the separation is considered as poor. Based on the results of separability test, it could be concluded that all classes were separable, since all the values were above 1,900. Feature space plot analysis was further carried out to identify the extent of overlap between various classes. The feature plot (Fig. 2) reveals that most of the classes could be separated from each other, except few that exhibited overlap to some extent. These classes were, Dense vegetation, Open vegetation, Open scrub, Open area and Dense settlement. Classified maps produced using three classification methods are presented in Fig. 4. Qualitative assessment of the three maps indicates distribution of water appears to be uniform in all maps prepared using three different algorithms. In MLC and ISODATA classified maps, some of the water channels are misclassified as settlements. There is also minor mixing between the dense and open vegetation cover. Mapping intermixing has also been observed among open scrub with other classes. The intermixing is attributed to similar reflectance values and heterogeneity of the landscape. The interspersion between settlements and vegetation is mapped nicely using DTC. It is able to identify the variations and differential among the features. Accuracy assessment of maps prepared using different classified algorithm is shown in Table 3. The results show highest accuracy of the map prepared using DTC (overall accuracy 92 %, kappa 0.03) followed by MLC (overall accuracy 88 %, kappa 0.84) and ISODATA clustering (overall accuracy 76 %, kappa 0.69). Blank/open area shows lower users’ accuracy in ISODATA clustering and in other classification the accuracy is improved. User’s accuracy of sea/ocean has increased in DTC unlike the other two. User’s accuracy
of open scrub, open vegetation has decreased in the DTC where improved for settlement and dense vegetation. DT is shown in Fig. 2. Time taken to train the model is 0.16 seconds which is much less than that of ISODATA clustering and MLC. The application of rule on the dataset is comparable to other two classification algorithms. The correctly classified instances are 98.24 % in the pruned tree. The DT classifies water body/river, settlement, blank/open area, and sea/ocean correctly. B5 (mid-IR) is used for initial binary splitting of the image, which is subsequently supported by B3 (red) and B1 (blue). In first node, B3 (red), B7 (SWIR) and B4 (NIR) are used to segregate among water body/river, sea/ocean, settlement and dense vegetation. The second node uses B1 (blue), B4 (NIR), and B5 (mid-IR) to segregate open vegetation, blank/open area and settlement. B2 (green) was not found significant to differentiate among the classes. DTC technique is successfully used for digital image classification (Sesnie et al. 2008; Tooke et al. 2009). This is because it can handle huge datasets from numeric and non-numeric sources to quickly train the classification approach. A variety of works have demonstrated that decision tree provides an accurate and efficient methodology for land cover classification (DeFries et al. 1998; Friedl et al. 1999; Hansen et al. 2000; Punia et al. 2011). The most effective advantages of using DTC approach are intuitive classification structure able to handle noisy and missing data and its works with no assumptions regarding the distribution of input data.
Conclusion The study was taken up to demonstrate importance of open source remote sensing datasets and data mining
Table 2 Transformed Divergence test results among different class pairs Water Body/River Blank/Open area Sea/Ocean Settlements Open Scrub Dense vegetation Open vegetation Water Body/River Blank/Open area Sea/Ocean Settlements Open scrub Dense vegetation Open vegetation
2000
2000
2000
2000
2000
2000
1989
2000
2000
2000
2000
2000
2000
2000
1999
1987
1998
2000
2000
2000 2000
528
J Indian Soc Remote Sens (September 2013) 41(3):523–530
Fig. 4 Classified Map by different classification methods, clockwise a) FCC, b) Decision tree classifier, c) MLC and d) ISODATA clustering
software support for LULC classification. We used ISODATA clustering, ML classifier and decision tree classification to extract LULC information from satellite data. The comparison of the classification algorithm
results show, DTC to be superior to the other two. These results could be due to additional advantages possessed by DTC method. DTs can easily handle both continuous and discrete attributes. And can work well
Table 3 Accuracy assessment of classified maps Class names
ISODATA clustering
MLC classifier
Decision tree
UA
PA
UA
PA
UA
PA
Water Body/River
100
100
100
100
100
100
Blank/Open Area
90
64.29
100
100
100
76.92
Sea/Ocean
75
100
71.43
100
100
100
Settlements
93.75
88.24
94.12
88.89
93.75
93.75
Open Scrub
100
100
94.12
88.89
50
100
Dense Vegetation
31.25
100
75
100
93.75
93.75
Open Vegetation
85.71
62.07
92.59
89.29
85.71
94.74
OA
76
OA
88
OA
92
Kappa
0.6979
Kappa
0.8439
Kappa
0.9003
UA Users’ accuracy, PA Producers’ accuracy, OA Overall accuracy
J Indian Soc Remote Sens (September 2013) 41(3):523–530
even in presence of redundant variables. Being nonparametric, DT has advantage as they do not make any assumption regarding the distribution of input data. This method of classification is significant as it can effortlessly manage the noisy and non-linear relationship between remotely sensed data and LULC classes. Use of open source software proves to be beneficial as it builds DT on the basis of attributes of a pixel in different bands which is not possible using DT tools in proprietary software packages. With the open source data mining software support, it is easier to adapt and implement the classification algorithms. It also provides flexibility to the analyst to customize and implement the background and a-priori knowledge. The open source satellite dataset are easier to procure and can give satisfactory results for most of the applications which need LULC information. The classification performance can be increased with the further integration of ancillary data like geospatial data such as digital elevation model (DEM), slope, aspect and others. Interestingly many of such databases are available free of cost and procurement has become convenient, and DTCs can handle these ancillary data along with spectral information under single classification framework. This work employed J48 algorithm from WEKA open source data mining software, for DTC. This was done considering the various advantages that the software offers including simple GUI, ability to get integrated in Java based platforms, freedom to utilize pruned or unpruned tree, liberty to set confidence thresholds as so on. One of the biggest strength of WEKA is that it is compatible with DBMS using SQL. This can be explored further to achieve higher scalability, providing same outputs and yet preserving optimum computation time. WEKA is a software package that has a set of data mining and machine learning algorithms that it offers for experimentation in much easier setup.
References Carräa, H., Goncalves, P., & Caetano, M. (2008). Contribution of multispectral and multitemporal information from MODIS images to land cover classification. Remote Sensing of Environment, 112, 986–997. DeFries, R., Hanson, M., Townshend, J., & Sohlberg, R. (1998). Global land cover classifications at 8 km spatial resolutions: the use of training data derived from landsat imagery in decision tree classifiers. International Journal of Remote Sensing, 19(6), 3141–3168.
529 ERDAS Field Guide TM (2010)., Technical documentation, ERDAS Inc., Feitosa, R. Q., Costa, G. A. O. P., Mota, G. L. A., Pakzad, K., & Costa, M. C. O. (2009). Cascade multitemporal classification based on fuzzy Markov changes. ISPRS Journal of Photogrammetry and Remote Sensing, 64, 159–170. Friedl, M. A., Brodley, C. E., & Strahler, A. H. (1999). Maximizing land cover classification accuracies produced by decision trees at continental to global scales. IEEE Transactions on Geoscience and Remote Sensing, 37(2), 969–977. Gamanya, R., Maeyer, P. D., & Dapper, M. D. (2007). An automated satellite image classification design using object oriented segmentation algorithms: a move towards standardization. Expert Systems with Applications, 32, 616– 624. Hansen, M. C., DeFries, R. S., Townshend, J. R. G., & Sohlberg, R. (2000). Global land cover classification at 1 km spatial resolution using a classification tree approach. International Journal of Remote Sensing, 21, 1331–1364. Jensen, J. R. (2005). Introductory digital image processing (3rd ed.). Upper Saddle River: Prentice Hall. Kandrika, S., & Roy, P. S. (2008). Land use land cover classification of Orissa using multi-temporal IRS-P6 AWiFS data: a decision tree approach. International Journal of Applied Earth Observation and Geoinformation, 10(2), 186–193. Miles, L., & Kapos, V. (2008). Reducing greenhouse gas emissions from deforestation and forest degradation: global land-use implications. Science, 320, 1454–1455. Otukei, J. R., & Blaschke, T. (2010). Land cover change assessment using decision trees, support vector machines and maximum likelihood classification algorithms. International Journal of Applied Earth Observation and Geoinformation, 12, 27–31. Pal, M., & Mather, P. M. (2003). An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sensing of Environment, 86, 554–565. Pu, R., Gong, P., Michishta, R., & Sasgawa, R. (2008). Spectral mixture analysis for mapping abundance of urban surface components from the Terra/ASTER data. Remote Sensing of Environment, 112, 939–954. Punia, M., Joshi, P. K., & Porwal, M. C. (2011). Decision tree classification of land use land cover for Delhi, India using IRS-P6 AWiFS data. Expert Systems with Applications, 38, 5577–5583. Quilnan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kauffmann Publishers. Rosenqvist, A., Milne, A., Lucas, R., Imhoff, M., & Dobsone, C. (2003). A review of remote sensing technology in support of the Kyoto Protocol. Environmental Science & Policy, 6, 441–455. Sesnie, S. E., Gessler, P. E., Finegan, B., & Thessler, S. (2008). Integrating Landsat TM and SRTM–DEM derived variables with decision trees for habitat classification and change detection in complex neotropical environments. Remote Sensing of Environment, 112, 2145–2159. Small, C. (2005). A global analysis of urban reflectance. International Journal of Remote Sensing, 26(4), 661–681.
530 Stefanov, W. L., Ramsey, M. S., & Christensen, P. R. (2001). Monitoring urban land cover change: an expert system approach to land cover classification of semiarid to arid urban centers. Remote Sensing of Environment, 77, 173–185. Stehman, S. V. (1996). Estimation of Kappa coefficient and its variance using stratified random sampling. Photogrammetric Engineering and Remote Sensing, 26, 401–407. Tooke, T. R., Coops, N. C., Goodwin, N. R., & Voogt, J. A. (2009). Extracting urban vegetation characteristics using
J Indian Soc Remote Sens (September 2013) 41(3):523–530 spectral mixture analysis and decision tree classifications. Remote Sensing of Environment, 113, 398–407. Wentz, E. A., Nelson, D., Rahman, A., Stefanov, W. L., & Roy, S. S. (2008). Expert system classification of urban land use/cover for Delhi, India. International Journal of Remote Sensing, 29(15–16), 4405–4427. Woodcock, C. E., & Strahler, A. H. (1987). The factor of scale in remote sensing. Remote Sensing of Environment, 21, 311–332.