Geoderma 253–254 (2015) 67–77
Contents lists available at ScienceDirect
Geoderma journal homepage: www.elsevier.com/locate/geoderma
Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran R. Taghizadeh-Mehrjardi a,⁎, K. Nabiollahi b, B. Minasny c, J. Triantafilis d a
Faculty of Agriculture and Natural Resources, University of Ardakan, Ardakan, Iran Faculty of Agriculture, University of Kurdistan, Sanandaj, Iran Department of Environmental Sciences, Faculty of Agriculture and Environment, The University of Sydney, Biomedical Building C81, 1 Central Avenue, Australian Technology Park, Eveleigh, NSW 2015, Australia d School of Biological, Earth and Environmental Sciences, Faculty of Science, The University of New South Wales, NSW 2052, Australia b c
a r t i c l e
i n f o
Article history: Received 30 July 2014 Received in revised form 7 April 2015 Accepted 8 April 2015 Available online xxxx Keywords: Digital soil mapping Taxonomic distance Auxiliary data Iran
a b s t r a c t Digital soil mapping involves the use of auxiliary data to assist in the mapping of soil classes. In this research, we investigate the predictive power of 6 data mining classifiers, namely Logistic regression (LR), artificial neural network (ANN), support vector machine (SVM), K-nearest neighbour (KNN), random forest (RF), and decision tree model (DTM) to create a DSM across an area covering of 3000 ha in Kurdistan Province, north-west Iran. In this area, using the conditioned Latin hypercube sampling method, 217 soil profiles were selected, sampled, analysed and allocated to taxonomic classes according to Soil Taxonomy up to family level. To test the user accuracy (UA) we established a calibration and validation set (70:30%). Of the 5 soil family classes we map, the highest overall accuracy (0.71) and kappa index (0.69) are achieved using the DTA and ANN method. More specifically, the UA of prediction was up to 18.33% better in comparison to LR. Moreover, our results showed that no improvement was obtained in prediction accuracy of DTA algorithm with minimizing taxonomic distance compared to minimizing misclassification error (0.71). Overall, our results suggest that the developed methodology could be used to predict soil classes in the other regions of Iran. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Soil survey and mapping in Iran follow the traditional approach, which requires the collection of a large amount of soil morphological data (e.g. soil colour, soil texture) and other information (TaghizadehMehrjardi et al., 2014). The result has been the preparation of a soil map across the country and at a scale of 1:1,000,000. However, at this scale the map is not suitable for detailed farm management planning because the spatial resolution is low, owing to the time consuming, labour intensive and expense involved in collecting the necessary soil morphological data to enable soil classification (i.e. USDA Soil Taxanomy). In addition, the accuracy of the final map is unknown. These issues are of importance in arid and semi-arid areas, because of the large spatial variability resulting from differences in parent material, age of land surface, topography, water distribution, amount and intensity of rainfall and plant heterogeneity (Shmida and Burgess, 1988; Cantón et al., 2003). What might be appropriate is the application of high-resolution Digital Soil Mapping (DSM) techniques to aid and enhance soil use and management (Lagacherie et al., 2007). This is because the underlying principle in DSM is to use computer assisted methods to harmonise ⁎ Corresponding author. E-mail address:
[email protected] (R. Taghizadeh-Mehrjardi).
http://dx.doi.org/10.1016/j.geoderma.2015.04.008 0016-7061/© 2015 Elsevier B.V. All rights reserved.
semi-quantitative morphological data with more readily measured auxiliary variables in an objective way (McBratney et al., 2000, 2003). Therefore, information obtained from differences in the much easier to obtain auxiliary data can be used to study and identify various soils. The first, and most common auxiliary variable used in DSM studies, is remote sensing data, because it provides information about parent material and soil properties across a landscape. Land cover can also be characterized. The second is terrain attributes because it provides information about relief, which plays a strong role in soil forming processes (McKenzie and Ryan, 1999). For example, surface topography controls water movement through and over a landscape, which has a significant impact on soil development (Moore et al., 1991). The third is geomorphological surfaces, which have a significant impact on soil distribution (Taghizadeh-Mehrjardi et al., 2014; Jafari et al., 2014) in arid and semi-arid areas. However, one of the key areas of DSM is the choice of data mining classifiers which have been developed and introduced to link soil classes and auxiliary variables and which can learn from data (Mitchell, 1997). It is crucial because the input and related output are given to the algorithm and the algorithm learns and finds the relationship. The most commonly used algorithms include artificial neural network (ANN), decision tree (DTA), K-nearest neighbour (KNN), random forest (RF), logistic regression (LR) and support vector machine (SVM). For example, ANN (Zhu, 2000; Behrens et al., 2005), DTA (Bui and Moran,
68
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77
2001; Adhikari et al., 2014), KNN (Mansuy et al., 2014), LR (Vasques et al., 2014), SVM (Ballabio, 2009) and RF (Pahlavan-Rad et al., 2014; Heung et al., 2014) have been applied by several researchers to predict various soil attributes and soil classes. Most recently, Jafari et al. (2012) and Taghizadeh-Mehrjardi et al. (2014) used LR and DTA, respectively, to predict the spatial distribution of USDA-great soil groups in central Iran, whilst Pahlavan-Rad et al. (2014) updated soil survey maps using random forest in the loess derived soils of northern Iran. As has been mentioned above, most of the DSM studies which have been conducted in Iran allocated to arid region (Taghizadeh-Mehrjardi et al., 2014; Jafari et al., 2012) or humid region (Pahlavan-Rad et al., 2014). It has not been given ample attention on DSM in semi-arid region of Iran at the family soil group level. In addition, far too little attention paid to different data mining classifiers for DSM. Therefore, in this paper an attempt will be made to predict a soil map of taxonomic level up to family group using different digital soil mapping techniques (i.e. ANN, RF, DTA, LR, SVM, and KNN) in a semi-arid region located in Kurdistan province, Iran. This study will use a range of DSM techniques to map soil classes (30 m × 30 m) up to the family level of USDA system in a semi-arid region of Iran. 2. Material and methods 2.1. Study area The study area is located in Kurdistan Province which is a major agricultural region in the north-west of Iran (Fig. 1). It covers approximately 3000 ha and is located 12 km to the north-west of Baneh City. The elevation ranges from 1400 m above m.s.l. to about 1700 m below m.s.l. Mean of annual rainfall and annual temperature are approximately 700 mm and 13.8 °C, respectively. Soil moisture and temperature regimes are Xeric and Mesic, respectively. Farm lands occupy approximately 70% of the total area, with the remainder consisting of range lands and forest. Parent materials are mainly schist and limestone. Geomorphologic units consist of piedmont, plateau and hills. 2.2. Auxiliary data collection and pre-processing A number of auxiliary variables are used to represent the various SCORPAN factors (i.e. soil, climate, organisms, relief, parent materials
and spatial position). Herein we describe the various factors and the auxiliary variables chosen to represent these in the DSM procedure. It should be noted that all auxiliary variables are projected in the same geographic space using the Universal Transverse Mercator system (RMSE = 0.42), clipped to the extent of the study area, and converted into ERDAS Imagine. All rater data (i.e. 30 grid layers: auxiliary variables) is also co-registered to the same raster grid size of 30 m. i) Landsat spectral data is used to predict soil distribution, because in arid and semi-arid areas the soil surface is not completely covered by vegetation. As such, spectral data can detect mineralogical properties of soil. Herein a Landsat 8 ETM+ image (acquired on March 28, 2013) is processed using ERDAS Imagine software (Leica Geosystems Geospatial Imaging, 2008). Various band ratios, found to be useful for vegetation variable are calculated, including the Normalized Difference Vegetation Index (NDVI — Rouse et al., 1973); ratio vegetation index (RVI — Pearson and Miller, 1972); and, soil adjusted vegetation index (SAVI — Huete, 1988). Band ratios to represent parent material and soil factors are also calculated and include the Clay Index (CI — Boettinger et al., 2008); B6/B4 (Gad and Kusky, 2006); Carbonate Index (CI — Boettinger et al., 2008); Gypsum Index (GI — Nield et al., 2007); Salinity Ratio (SR — Metternicht and Zinck, 2003) and Brightness Index (BI — Metternicht and Zinck, 2003). ii) In areas with sufficient relief, terrain attributes can also be combined with Landsat spectral data for spatial modelling of soil classes. In this study 14 terrain parameters are obtained from a digital elevation model (DEM) with a spatial resolution of 10 × 10 m (National Cartographic Center, 2010), including: elevation, slope, aspect, profile curvature, plane curvature, convergence index, catchment slope, catchment area, wetness index, mid-slope position, altitude above channel network, channel network base level, valley depth, multi-resolution ridge top flatness index, and multi-resolution index of valley bottom flatness (MrVBF—Gallant and Dowling, 2003). All terrain parameters are calculated using SAGA GIS software (Olaya, 2004). In this study, the DEM is prepared from RADAR images. A two-dimensional discrete wavelet transform (Graps, 1995) is used to remove artificial noise from the DEM (Lark and Webster, 2004) and into four levels: L1, L2, L3, and L4. These
Fig. 1. Location of the study area in the Northwest of Iran and spatial distribution of soil sample locations draped over digital elevation model (A: Coarse loamy, mixed, mesic, Lithic Xerorthents; B: Fine, mixed, mesic, Typic Calcixerepts; C: Fine loamy, carbonatic, mesic, Typic Calcixerepts; D: Fine loamy, mixed, mesic, Typic Haploxerepts; E: Fine, mixed, mesic, Typic Haploxerepts; V: validation data set; T: training data set; DEM: digital elevation model).
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77
correspond to pixel sizes of 20, 40, 80, and 160 m, respectively. Results showed that for the prediction of the target variable (i.e. Soil families), the decomposed data layer of L4 had larger accuracy than Original DEM. Therefore, the subset derived from decomposed data layer of L4 was used for spatial modelling of soil families. iii) A useful source of information for assessing soil parent material and soil genesis is the use of geomorphology maps. Scull et al. (2005) showed the dominant role of geomorphologic processes in determining spatial distribution of soil classes in arid regions. Traditionally, landform entities are characterized on the air photos in a manual procedure. This can be improved by automated processes. Here, a robust approach defined by MacMillan et al. (2000) is used to automatically segment landforms. This model used derivatives computed from DEM and a fuzzy rule base to identify up to 15 morphologically defined landform facets (Table 1).
Herein, ten terrain attributes are derived from the DEM and to act as input variables for automated landform classification. The terrain attributes are slope gradient (Eyton, 1991), profile curvature (Quinn et al., 1991), plane curvature, wetness index, percent Z relative to min & max elevation for the entire study area, percent Z relative to top & bottom of each watershed, percent Z relative to local pits & peaks, percent Z relative to nearest stream & divide, absolute height (Z) above the local pit cell, and absolute maximum pit to peak relief (Z) (MacMillan and Pettapiece, 1997). These data are classified using a fuzzy semantic import (SI) model (Burrough et al., 1992), which involved converting the terrain attributes and individual fuzzy landform attribute values into continuous numbers scaled from 0 to 100 (Fig. 2). 2.3. Data collection and soil sample analysis Sampling points were selected according to the conditioned Latin hypercube method (Minasny and McBratney, 2006). The conditioned Latin hypercube method (cLHC) is an efficient sampling method because it captures the variability of multiple input auxiliary variables. In total, 217 soil profiles were excavated and described according to Soil Survey Staff (2010) and with samples taken from the genetic horizons identified. Fig. 1 shows the location of the soil profiles draped over the DEM. From the genetic horizons, 594 samples were analysed in the laboratory. The samples were air-dried at room temperature and then, passed through a 2 mm sieve. The particle size distribution was determined by the Bouyoucos hydrometer method (Gee and Bauder, 1986). The electrical conductivity of a saturated soil paste extract (ECe) and pH values were measured using
Table 1 Names and general characteristics of the 15 facets. ID no.
Name
Slope
Curvature
Slope
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Level crest Divergent shoulder Upper depression Dack slope Divergent back slope Convergent back slope Terrace Saddle Mid slope depression Foot slope Toe slope Fan Lower slope mound Level lower slope Lower depression
Planar Convex Concave Planar Planar Planar Planar Concave Concave Concave Planar Planar Convex Planar Concave
Planar Convex Concave Planar Convex Concave Planar Convex Concave Concave Planar Convex Convex Planar Concave
Low Any Low High High High Low Any Low High High High Any Low Low
69
a PW-9527 Philips Conductivity Meter and EYELA-2000 pH meter. Gypsum was measured by the Nelson and Sommers method (Nelson, 1982), with organic carbon determined by the Walkley-Black method (Walkey and Black, 1934). Soluble calcium, magnesium, chlorine, carbonate, bicarbonate, sodium, and potassium were measured according to standard methods (Sparks et al., 1996). The soil profiles were allocated into two orders (i.e. Inceptisols and Entisols), two sub-orders (i.e. Xerepts and Orthents), three great groups (Haploxerepts, Calcixerepts, and Xerorthents), three sub-great groups (i.e. Typic Haploxerepts, Typic Calcixerepts and Lithic Xerorthents) and five families (i.e. Coarse loamy, mixed, mesic, Lithic Xerorthents; Fine, mixed, calcareous, mesic, Typic Calcixerepts; Fine loamy, carbonatic, mesic, Typic Calcixerepts; Fine loamy, mixed, mesic, Typic Haploxerepts; Fine, mixed, mesic, Typic Haploxerepts). Table 2 shows the basic soil morphological and physico-chemical properties of the representative soil profiles.
2.4. Auxiliary data selection The next step is to select the most appropriate auxiliary data (Table 3) to reduce the dimensionality but also allow learning algorithms to operate more effectively. Moreover, irrelevant and redundant information may decrease the prediction accuracy in common machine learning algorithms (Hall, 1997; Hall et al., 2009; Mollazade et al., 2012). Different techniques can be used to rank the relevance of auxiliary variables, including correlation-based feature selection (CFS), principal component analysis (Omid et al., 2010), factor analysis, and sensitivity analysis. Here, we apply a correlation-based feature selection (CFS) using the CfsSubsetEval algorithm available in the WEKA software package (Hall et al., 2009). Correlation-based feature selection is a fully automatic algorithm, not requiring predefined thresholds or number of features. The algorithm ranks auxiliary data according to a correlation based heuristic evaluation function, retaining relevant auxiliary data that are highly correlated; in our case soil classes. Irrelevant data, with low correlations, are screened out. Correlation-based feature selection typically eliminates over half the features. In our case, CFS algorithm has reduced the size of covariates from 30 to 12 layers, including NDVI, SAVI, clay index, elevation, slope, plane curvature, channel network base level, aspect, wetness index, mid-slope position, catchment area, and convergence index.
2.5. Spatial modelling The relationship between soil classes and auxiliary variables is implemented by applying empirical models (Kempen et al., 2009). Various modelling techniques have been used for the digital mapping of soil classes. Here, multinomial logistic regressions (LR), artificial neural networks (ANNs), support vector machines (SVMs), decision tree analysis (DTA), K-nearest neighbour (KNN), and random forest (RF) are evaluated using MATLAB software (Mathworks, 2010). We describe each in turn.
2.5.1. Multinomial logistic regression (MLR) Logistic regression is a type of probabilistic statistical classification model which belongs to the family of generalized linear models (Tabachnick and Fidell, 1996). Logistic regression can be binomial or multinomial. Binary logistic regression deals with situations in which the observed soil samples can only classify in two groups (i.e. A and B), whereas multinomial logistic regression deals with situations where observed soil samples can classify in three or more possible groups (i.e. A, B, C, etc.). For binomial logistic regression, the probability
70
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77
Fig. 2. Landforms in the study area based on MacMillan et al's. (2000) method. Refer to Table 2.
of occurrence of soil class (A) is calculated as follows;
Probability ðAÞ ¼
to think of odds rather than log odds Eq. (3) can be written as:
eB0 þB1 X 1 þB2 X 2 þ…þBk X k 1 þ eB0 þB1 X 1 þB2 X 2 þ…þBk X k
ð1Þ
where B0 to Bk are coefficients, and X1 to Xk are independent variables (i.e. auxiliary data presented in Table 3). Note that the regression coefficients are usually estimated using maximum likelihood estimation (Bailey et al., 2003). Eq. (1) can be written in terms of log odds (referred to as a logit);
prob ðAÞ B þB X þB X þ…þBk X k : ¼ e0 1 1 2 2 prob ðBÞ
The binomial logistic regression model is easily generalized to the multinomial logistic regression (MLR). For example, if there are five soil groups, the probabilities of occurrence class (A) can be calculated as follows; Probability ðAÞ ¼
prob ðAÞ ¼ B0 þ B1 X 1 þ B2 X 2 þ … þ Bk X k : prob ðBÞ
ð3Þ
e
B0 þBA1 X 1 þBA2 X 2 þ…þBAk X k
eB0 þBA1 X 1 þBA2 X 2 þ…þBAk X k : þe þ … þ eB0 þBE1 X 1 þBE2 X 2 þ…þBEk X k B0 þBB1 X 1 þBB2 X 2 þ…þBBk X k
ð2Þ
ð4Þ
A logistic coefficient can be interpreted as a change in log odds associated with a one-unit change in the independent variable. As it is easier
The logistic platform fits the probabilities for response categories to continuous predictors. Here, MLR is used to model the relationships between the family groups and the auxiliary data.
Log
Table 2 The basic soil morphological and physico-chemical properties of the representative family group soils (Depth: cm; sand, silt, clay, OC, gypsum, CCE: %; CEC: C mol kg−1); (OC: organic carbon; CCE: calcium carbonate equivalent; CEC: cation exchange capacity; SaL: sandy loam; Cl: clay loam; C: clay; SiCL: silty clay loam). Soil horizon
Depth cm
Sand %
Sill %
(A) Coarse loamy, mixed, mesic, Lithic Xerorthents A 0–10 54.00 31.00 R +10 – – (B) Fine, mixed, mesic, Typic Calcixerepts AP 0–15 26.00 41.00 Bw 15–65 23.00 35.00 Bk 65–120 32.00 31.00 (C) Fine loamy, carbonatic, mesic, Typic Calcixerepts AP 0–20 25.00 43.00 Bk1 20–50 12.00 55.00 Bk2 50–150 41.00 30.00 (D) Fine loamy, mixed, mesic, Typic Haploxerepts AP 0–15 34.00 33.00 Bw1 15–60 27.00 40.00 Bw2 60–150 24.00 33.00 (E) Fine, mixed, mesic, Typic Haploxerepts AP 0–20 22.00 45.00 Bw1 20–70 20.00 37.00 Bw2 70–150 24.00 35.00
Clay %
Texture
Gypsum %
CCE %
pH
ECe dS/m
CEC Meq/100 g
OC %
15.00 –
Sa L –
0 –
1.75 –
7.18 –
0.54 –
14.5 –
0.51
33.00 42.00 37.00
C.L C C.L
1.75 1.8 2.5
3.00 9.00 32.00
7.63 7.56 8.78
0.51 00.43 00.45
35.5 44.5 33.5
0.69 0.51 0.34
32.00 33.00 29.00
C.L Si.C.L C.L
2.2 5.25 6.5
6.75 34.5 45.75
7.58 7.72 8.84
00.42 00.56 00.59
31.4 28.45 33.7
0.67 0.5 0.16
33.00 33.00 43.00
C.L C.L C
2.05 0.00 1.75
3.00 1.00 1.75
7.82 6.72 6.63
00.44 00.62 00.45
34.25 43.5 29.75
0.45 0.2 0.53
33.00 43.00 41.00
C.L C C
2.35 2.45 2.35
5.25 3.50 1.50
7.34 7.26 7.51
00.68 00.44 00.55
32 39.75 43.5
0.67 0.44 0.55
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77 Table 3 Auxiliary variables were used for spatial modelling (The best subset was bolded.). Covariate data source
Attribute
Digital elevation model
Elevation (E), mid-slope position (MSP), slope, altitude above channel network (AACN), catchments area (CA), convergence index (CI), plane curvature (PlC), profile curvature (PrC), multi-resolution ridge top flatness index (MrRTF), multi-resolution index of valley bottom flatness (MrVBF), catchment network base level (CNBL), valley depth (VD), catchment slope (CS), wetness index (WI), aspect Blue, green, red, near infrared, shortwave IR-1, shortwave IR-2, normalized difference vegetation index (NDVI: (Shortwave IR-1 — Near infrared)/(Shortwave IR-1+ Near infrared)), ratio vegetation index (RVI: Shortwave IR-1/Near infrared), soil adjusted vegetation index (SAVI: [(Shortwave IR-1 — Near infrared)/(Shortwave IR-1+ Near infrared
Landsat 8
Geomorphology map
+ L)]*(1 + L)), clay index (CI: shortwave IR-1/shortwave IR-2), shortwave IR-1/Red, carbonate index (CrI: Red/Green), gypsum index (GI: (Shortwave IR-1 — Near infrared)/(Shortwave IR-1 + Near infrared)), salinity ratio (SR: (Red — Near infrared)/(Red + Near infrared)), brightness index (BI: ((Red)2 + (Near infrared)2)0.5). Landform facets
2.5.2. Artificial neural networks (ANNs) Artificial neural networks (ANNs) are mathematical models, which try to copy the parallel local computing system of the human brain in the simplest way (McClelland and Rumelhart, 1988). Herein we use a Multi-Layer Perceptron (MLP) algorithm, which has a feed-forward back-propagation neural network model with one hidden layer. This neural network consists of the sigmoid activation function in the hidden layer and the linear activation function in the output layer. The training process is performed and back propagation training algorithm which is a gradient descent algorithm and has been used successfully and extensively to train feed forward neural networks (Minsky and Papert, 1969; Levine et al., 1996; Morshed and Kaluarachchi, 1998; Amini et al., 2005). In the train step of ANNs, the number of neurons in the hidden layer is altered from 2 to 20 for 1000 epochs. We also use the Levenberg– Marquardt training algorithm (Levenberg, 1944; Marquardt, 1963) due to its efficiency and simplicity. 2.5.3. Support vector machines (SVMs) In machine learning, Support vector machines (SVMs) belong to a family of generalized linear classifiers and are used for classification and regression analysis (Cortes and Vapnik, 1995; Burges, 1998). The foundations of SVMs were developed by Vapnik (1995) and are also known as “structural risk minimisation” (Vapnik, 1998). The basic idea is to use a linear model to implement non-linear class boundaries through non-linear mapping of the input vector into a highdimensional feature space. The kernel method is used to construct linear classifiers in high-dimensional feature spaces that are nonlinearly related to input space (Aronszajn, 1950; Mollazade et al., 2012). The advantages of SVMs are that they are effective in high dimensional spaces; effective in cases where number of dimensions is greater than the number of samples; use a subset of training points in the decision function (support vectors); and, different Kernel functions can be specified for the decision function (Vapnik, 2000). 2.5.4. Decision trees (DTAs) A decision tree correlates several independent variables with direct or indirect relationships to a target variable with a tree structure, generated by partitioning the data recursively into a number of groups (Breiman et al., 1984). A constructed decision tree consists of nodes (each representing an attribute), branches (each representing the attribute value), and leaves (each representing a soil class). A training
71
dataset is used to discover or exploit the unknown relationships between the predictor variables and the predicted variable. The theory behind this approach is based on the assumption that all the required information to establish soil predictions is contained in the data and can be extracted if a sufficient amount of training data can be collected (Elnaggar, 2007; Dobos et al., 2006). To predict soil classes, four different DTAs induction algorithms were adopted here and include; J48 (C4.5 DTA learner) algorithm, REP (reduces-error pruning), CART (classification and regression trees), and C5 (Quinlan, 2001). More information about these algorithms is presented in Witten and Frank (2005). For instance, C5 constructs decision trees in two phases. A large tree is first grown and is then ‘pruned’. This pruning process is applied to every sub-tree. In the Pruning CF option, values smaller than the default (25%) cause more of the initial tree to be pruned, whilst larger values result in less pruning. The Minimum case option constrains the degree to which the initial tree can fit the data. Values higher than the default (2 cases) can lead to an initial tree that fits the training data only approximately (Quinlan, 2001). 2.5.5. K-nearest neighbour (KNN) In recent years, Nemes et al. (2006) introduced a nonparametric, nearest neighbour approach for classification and regression problems. Here we adapted their algorithm for the prediction of soil classes. The KNN technique classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. It searched for soils that are most similar to the target soil, based on the selected input attributes. In most KNN studies, the ‘distance’ measure is calculated as the classical Euclidean distance between the target and the known instances. The ‘distance’ of each soil from the target soil can be calculated as the square root of the sum of squared differences in attributes. The approach has been applied in many papers in the field of pattern recognition, statistical classification (Dasarathy, 1991), interpolate soil particle size distribution (Nemes et al., 1999) and soil hydraulic properties (Nemes et al., 2006). Herein we optimize the two parameters of the KNN technique (i.e. p and k). The parameter k, is related to the number of soils to be selected from the reference data set that are then used to formulate the estimate of the output attribute of the target soil. The other parameter was the p term, which is used to weigh each of the selected k soils whilst forming the estimate of the output attribute. 2.5.6. Random forest (RF) Breiman (2001) introduced the random forests in his paper. This approach is an ensemble learning method for classification that operates by constructing a multitude of decision trees. The main principle behind ensemble methods is that a group of “weak learners” can come together to form a “strong learner”. Random forest has been used for predicting soil classes in unmapped areas (Stum et al., 2010; Barthold et al., 2013; Pahlavan-Rad et al., 2014). A different subset of the training data is selected (~2/3), with replacement, to train each tree. At present research, first we optimized the two parameters of RF technique (e.g. mtry and ntree). The parameters, mtry and ntree, are related to the number of environmental covariates in each random subset and the number of trees in the forest, respectively. The mtry and ntree parameters were selected by iterating mtry values from 1 to 12 and ntree values from 100 to 10,000. 2.6. Taxonomic distance Mentioned data mining algorithms try to minimize the misclassification error and hence, error in all classes is equally important (Hastie et al., 2001). However this is not true for soil classes where there are taxonomic relationships among soil classes and hence, some errors are more serious than others. In this case, we can specify an algorithm that minimizes the relative importance between the soil classes (i.e. Taxonomic distance) rather than misclassification error (Minasny and
72
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77
McBratney, 2007). We calculated the taxonomic distance between soil families by using the ordination (Hole and Hironaka, 1960). A discriminant analysis was applied to predict soil families from observed soil physical and chemical properties. Then, the Mahalanobis distance was used to calculate the distance between the class centroids (i.e. Taxonomic distance). 2.7. Model evaluation In order to test the accuracy of predictions, the data was divided randomly into two sets. The larger set was used for training (70% = 152 soil samples) and the smaller set was set aside for validation (30% = 65 soil samples) (Schmidt et al., 2008). Note that we tried to preserve the class distribution from the whole dataset in the training and validation sets. As can be seen from Fig. 1, soil classes A, B, C, D and E covered the same percentage of training and validation data sets. At a current study, different validation criteria applied (Malone et al., 2014). They were kappa index (Grinand et al., 2008), user accuracy (UA), producer accuracy (PA), and overall accuracy (OA) (Jensen, 1996; Foody, 2002). 3. Result and discussion 3.1. Spatial modelling 3.1.1. Multi-nomial logistic regression (LR) A stepwise multi-nomial logistic regression was used to model the relationships between the soil family (categorical dependent variables) and the terrain attributes, remote sensing indices (quantitative predictors) and the landform facets (categorical predictors). Our results indicated the most significant correlations that existed between the family classes and the landform facets, elevation, WI and MrVBF. This confirmed that the terrain attributes were the most effective characteristics in predicting the family classes, indicating that the relief is the most important factor explaining the formed soils. The wetness index MrVBF indicates potential zones of transport for many materials, particularly sediment and other materials in excess water flow (Whiteway et al., 2004). The geomorphology map also was a powerful predictor in model fitting. This emphasizes the role of geomorphology processes in soil development as reported in many soilgeomorphology studies (Toomanian et al., 2006; Jafari et al., 2012). Applying these auxiliary data, LR predicted the spatial location of soil classes in the family level with overall accuracy and kappa index of 0.60 and 0.53, respectively, which is similar to the results obtained by other researchers (Jafari et al., 2012; Bourennane et al., 2014; Vasques et al., 2014). Table 4 further showed that the highest accuracy for the prediction of family classes belonged to soil B & E with 0.63 user accuracy based on validation data sets. That these classes can be found in 49% of the soil observations might be one reason why soil B & E had the highest accuracy. Supporting this suggestion is the poor prediction of soil D, which were only represented by a few observations (12% of total). Jafari et al.
(2012) and Pahlavan-Rad et al. (2014) mentioned that the size of sampling units relative to total study area is an important factor determining the purities of a map; hence the smaller sample size contributes to uncertainty. 3.1.2. Artificial neural network (ANN) Using the feed forward back-propagation algorithm, a number of three layer ANNs as input layer, hidden layer and output layer were trained for the soil map prediction in the family category level. In the input layer, the number of neurons was fixed as the number of independent variables, which are 12 auxiliary variables. In the output layer, the number of neurons was fixed as the number of dependent variables, which are the five family classes (A, B, C, D and E). Amini et al. (2005) indicated that too many neurons cause overfitting, whilst too few hidden neurons cause under fitting. To find a suitable number of neurons in the hidden layer, neural networks with different combinations of the number of neurons in the range of 2 to 20 were investigated. The overall accuracy for the prediction of soil classes based on validation data set was plotted in Fig. 3. Fig. 3a shows that the best number of neurons in the hidden layer is eight neurons. Therefore, we used a 12-8-5 network, as the best topology, containing 12 neurons in the input layer, 8 neurons in the hidden layer and 5 neurons in the output layer for the prediction of soil families (Fig. 3b). The overall accuracy and kappa index obtained for the validation data set were 0.71 and 0.69, respectively. This is consistent with those in other published studies (Zhu, 2000; Fidencio et al., 2001; Zhao
Table 4 Confusion matrix obtained from evaluation of spatial modelling (LR: Logistic regression; ANN: Artificial neural network; SVMNP: Normalized polynomial; KNN: K-nearest neighbour; RF: Random forest; DTAC5: Decision tree C5 algorithm; PA: Producer accuracy; UA: User accuracy; A: Coarse loamy, mixed, mesic, Lithic Xerorthents; B: Fine, mixed, mesic, Typic Calcixerepts; C: Fine loamy, carbonatic, mesic, Typic Calcixerepts; D: Fine loamy, mixed, mesic, Typic Haploxerepts; E: Fine, mixed, mesic, Typic Haploxerepts). Class LR
A B C D E
SVMNP
KNN
PA
UA
PA
ANN UA
PA
UA
PA
UA
PA
UA
PA
UA
0.67 0.71 0.47 0.36 0.77
0.62 0.63 0.58 0.50 0.63
0.77 0.80 0.54 0.67 0.72
0.77 0.75 0.58 0.50 0.81
0.60 0.65 0.62 0.50 1.00
0.69 0.69 0.67 0.75 0.50
0.78 0.78 0.73 0.22 0.81
0.54 0.88 0.67 0.25 0.81
0.86 0.72 0.75 0.43 0.67
0.46 0.81 0.75 0.38 0.88
0.73 0.72 0.67 0.44 0.81
0.62 0.81 0.67 0.50 0.76
RF
DTAC5
Fig. 3. (a) Variation of overall accuracy with different combinations of number of neurons in hidden layer; (b) the best ANN model for prediction soil family classes with 12-8-5 topology model (A: Coarse loamy, mixed, mesic, Lithic Xerorthents; B: Fine, mixed, mesic, Typic Calcixerepts; C: Fine loamy, carbonatic, mesic, Typic Calcixerepts; D: Fine loamy, mixed, mesic, Typic Haploxerepts; E: Fine, mixed, mesic, Typic Haploxerepts; NDVI: normalized difference vegetation index; SAVI: soil adjusted vegetation index; DEM: digital elevation model; Plane C.: plane curvature; WI: wetness index; CNBL: catchment network base level).
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77
et al., 2009; Behrens et al., 2005). Moreover, Table 4 shows that according to the confusion matrix, the user accuracy of ANN for A, B, C, D, E and E classes was 0.77, 0.75, 0.58, 0.50, and 0.8, respectively. The overall accuracy and kappa index were 0.71 and 0.69, respectively. According to the weight values given to each input parameter by the ANN model, it was found that the wetness index was given the greatest weight and it was the most effective parameter for the prediction of soil classes. The important parameters were ordered as follows: wetness index, landform facets, plane curvature, altitude above channel network, SAVI, NDVI, and MrVBF. 3.1.3. Support vector machine (SVM) SVMs utilize kernel functions to project the data into some higher dimensional space where they are linearly separable. To enhance SVM's performance for the prediction of soil classes, four commonly used kernel functions, including, polynomial, normalized polynomial, radial basis, and universal Pearson VII kernel functions (Cristianini and Shawe-Taylor, 2000) were evaluated. The confusion matrix in Table 4 shows that the user accuracy of SVM in the classification of soil classes (i.e. A, B, C, D and E) was 0.69, 0.69, 0.67, 0.75 and 0.50, respectively. Whilst many recent studies have shown the superior performance of SVMs versus other techniques, especially when using small training sets (Boyd et al., 2006; Pal and Mather, 2005; Li et al., 2013), this was not the case here, and except for class C which had the highest UA of any of the methods. The overall accuracy and kappa index were also lower (0.65 and 0.51, respectively) than those we were able to attain using the ANN model and equivalent to the LR model results. 3.1.4. K-nearest neighbour (KNN) Using KNN there is a need to optimize the two parameters of p and k. We obtained the optimal parameter by altering both parameters in the algorithm and making estimations on the validation data set. We changed values of parameter k from 1 to 40 and parameter p between 0.2 and 2. Fig. 4 shows the RMSE values obtained using the different combinations of k and p values. It can be seen that the K-NN technique is not very sensitive to p and k parameters. Changing p value from 0.2 to 2, changed RMSE of prediction from 0.28 to 0.31, and changing k value from 13 to 40, changed RMSE of prediction from 0.26 to 0.48. Therefore, the best combination of p of 0.8 and k of 13 values were selected to run the algorithm.
73
According to confusion matrix (Table 4), the user accuracy of individual soil classes, A, B, C, D, and E, was obtained 0.54, 0.88, 0.67, 0.25, and 0.81, respectively. Apart from class D, which achieves the worst UA of any class and any method, these results showed that KNN is better than most other methods. This is particularly the case for classes B and E which have the highest UA for any class and any method. However, and whilst the KNN technique could predict soil classes better than the SVM algorithm, the overall accuracy and kappa index of 0.68 and 0.60, respectively, are not quite as good as those achieved with the ANN. 3.1.5. Random forest (RF) In the RF technique, the two parameters of mtry and ntree need to be optimised. Optimal parameters were obtained based on the overall out of bag (OOB) error criteria, which is the total number of misclassified cases divided by the total number of cases. RF technique was trained with an independent bootstrap sample. Note that we allocated only 70% soil observation for this section and 30% of data put aside for the final true validation. Using this approach values of 9 and 800 were determined as optimal parameters for mtry and mtree, respectively. OOB error was achieved 78% for the best combination of parameters. In terms of the classification results, the RF predicted the five classes with overall accuracy and kappa index of 0.70 and 0.69, respectively. Table 4 shows that the UA for classes A, B, C, D, and E, was 0.46, 0.81, 0.75, 0.38, and 0.88, respectively. Of interest is the classes D and E achieved the highest UA as compared to any other method, however class A had the smallest UA. Results also showed that some covariates were more important for the prediction of soil classes at the family level, including: wetness index, landform facets, MrVBF, plane curvature, altitude above channel network, and SAVI. 3.1.6. Decision tree (DTA) Results of soil classification by J48, REP, CART and C5 tree indicated that classes A, B, C, D, and E, were 0.62, 0.81, 0.67, 0.50, and 0.76, respectively (Fig. 5). It is worth noting that whilst some of the UA values are quite large for some of the classes (i.e. A and E) neither were highest as compared to UA obtained for them by other models. It is also worth stating that whilst class D returned a UA of just 0.50, the relatively low level of accuracy exists for this model and for most of the others, because this class constitutes the smallest number of observations and covers less than 12% of the soil observations. Moreover, results indicate the significance of each type of auxiliary data represented as an attribute percentage, essentially the percentage of training cases for which the value of that covariate is used in predicting a class. The most powerful predictor is Wetness index (100 %) and was utilized by the model for every prediction. The second most important predictor was the landform facets, which was used by the model in 84% of soil family predictions. This emphasizes the role of geomorphology processes in soil development as reported in many soil-geomorphology studies (Jafari et al., 2012; Toomanian et al., 2006). The following auxiliary variables show their importance in terms of prediction of soil family classes with aspect (63%) the largest, followed by elevation (51%), catchment area (49%), SAVI (47%), channel network base level (41%), convergence index (34%), NDVI (24%) and plane curvature (11%). 3.2. Comparison of different spatial modelling
Fig. 4. Three-dimensional representation of the relationship between the p and k values.
The ability of 6 data-mining algorithms (i.e. LR, ANN, SVM, KNN, RF, and DTA) to predict five soil classes in Baneh region had been tested on the basis of a small validation data set. Results of kappa index and overall accuracy for all spatial models are surmised in Table 5. From the data in this table we can see that the highest kappa index and overall accuracy at family level related to DTAC5 and ANN, 0.69 and 0.71, respectively. Therefore, we could recommend DTAC5 and ANN as the best model for the prediction of soil classes up to the family level.
74
R. Taghizadeh-Mehrjardi et al. / Geoderma 253–254 (2015) 67–77
WI > 8.15: :...Dem 1443.42: : :...Aspect > 5.63: E (5/1) : Aspect 1457.02: E (7) : CNBL