Application and Comparison of Decision Tree-Based ...

60 downloads 3426 Views 4MB Size Report
Analysis of the data has been done considering sixteen conditioning factors (i.e., slope .... software. A distance to lineaments map has been generated by buffering ..... (China): a comparison between a random forest data mining technique and ...
Environ. Process. DOI 10.1007/s40710-017-0248-5 O R I G I N A L A RT I C L E

Application and Comparison of Decision Tree-Based Machine Learning Methods in Landside Susceptibility Assessment at Pauri Garhwal Area, Uttarakhand, India Binh Thai Pham 1

2

& Khabat Khosravi & Indra Prakash

3

Received: 15 March 2017 / Accepted: 6 June 2017 # Springer International Publishing AG 2017

Abstract Landslide susceptibility assessment has been conducted at the Pauri Garhwal area of Uttarakhand state, India, an area affected by numerous landslides causing significant losses of life, infrastructure and property every year. Decision tree-based machine learning methods, namely Random Forest (RF), Logistic Model Trees (LMT), Best First Decision Trees (BFDT) and Classification and Regression Trees (CART) have been used, and results are compared herein for proper spatial prediction of landslides. Analysis of the data has been done considering sixteen conditioning factors (i.e., slope angle, elevation, slope aspect, profile curvature, land cover, curvature, lithology, plan curvature, soil, distance to lineaments, lineament density, distance to roads, road density, distance to river, river density and rainfall), and 1295 historical landslide polygons. Models were validated and compared using Receiver Operating Characteristics (ROC) curve and statistical indices. The results show that the RF model has the highest predictive capability, followed by the LMT, BFDT and CART models, respectively, and indicate that although all four methods have shown good results, the performance of the RF method is the best for landslide spatial prediction. Keywords Landslide susceptibility mapping . Machine learning . Decision trees . Random Forest . India

* Binh Thai Pham [email protected]; [email protected]

1

Department of Geotechnical Engineering, University of Transport Technology, 54 Trieu Khuc, Thanh Xuan, Ha Noi, Viet Nam

2

Department of Watershed Management Engineering, Faculty of Natural Resources, Sari Agricultural Science and Natural Resources University, Sari, Iran

3

Department of Science & Technology, Government of Gujarat, Bhaskarcharya Institute for Space Applications and Geo-Informatics (BISAG), Gandhinagar, India

Pham B.T. et al.

1 Introduction Landslide is defined as a movement of soil, debris, and rock to the downslope under action of gravity (Das et al. 2012). Landslide is one of the most catastrophic natural disasters in the hilly and mountain areas all over the world (Chakraborty and Pradhan 2012; Shirzadi et al. 2017b). García-Rodríguez et al. (2008) stated that landslides cause larger annual loss of property compared to other natural disasters such us floods and windstorms. Every year, landslides kill hundreds of people all over the world (Yalcin et al. 2011). According to the statistical analysis, the average cost of annual damage by landslides is about 1.4 billion dollars in Canada, 3.8 billion dollars in Italy, 2.4 billion dollars in USA, 19.6 million dollars in Nepal, and 2 million Euro in China (Bai et al. 2012; Sidle and Ochiai 2006). The Himalayan region of India is affected by many minor to major landslides causing disruption of communication systems and significant loss of property and life (Pham et al. 2017a). The Pauri Garhal district of Uttarakhand state is located in Himalaya where landslide is considered as the most catastrophic natural disaster in this hilly and mountain area (Chakraborty and Pradhan 2012). Landslide activities, nowadays, are increasing due to rapid urbanization, deforestation and climate change (Chen et al. 2017; Kanungo et al. 2009). A landslide susceptibility map can be used for land use planning, hazard and risk assessment of landslides prone areas (Hong et al. 2017a; Pham et al. 2016f; Tsangaratos et al. 2016); therefore, it is a helpful tool in landslide hazard management. Landslide susceptibility mapping is carried out on the basis of the assumption that landslides in the future will likely to occur under identical conditions with previous historical landslides (Guzzetti 2006; Ilia and Tsangaratos 2016). During recent decades, different mathematical methods have been developed and applied for landslide susceptibility mapping. Machine learning methods, such as Support Vectors Machines (Hong et al. 2016c, d), artificial neural networks (Tsangaratos and Benardos 2014), Naïve Bayes (Pham et al. 2016b; Tsangaratos and Ilia 2016a), have been applied widely and efficiently for landslide susceptibility assessment and prediction. In addition, one of the machine learning methods known as Decision Trees (DT) has been applied widely in solving many real world problems including landslide prediction (Saito et al. 2009; Tsangaratos and Ilia 2016b). However, it faces problems of error in classification and preparation of decision trees which is a complex process, especially for large datasets (Nayab and Scheid 2011). Therefore, decision tree-based machine learning methods, namely Random Forest (RF), Logistic Model Trees (LMT), Best First Decision Trees (BFDT) and Classification and Regression Trees (CART), have been developed to solve the flaws of the conventional DT. In the present study, modified decision treebased methods (RF, LMT, BDFT and CART) have been applied and compared to find out the best method for landslide susceptibility assessment and mapping. Validation of landslide models has been done using the ROC curve and statistical indices. Modeling and data processing has been done using the Weka 3.7.12 and ArcMap software.

2 General Characteristics of the Study Area The study area (longitude 78o37’22″ to 78o52’55″ and latitude 30o4’37″ to 29o52’22″) is located in the Pauri Garhwal district of Uttarakhand State, India in Himalayan region (Fig. 1). Topographically, the area is hilly having high mountains and deep valleys ranging in elevation from 460 m to 2130 m. Slopes of the hills have angles up to 70 degrees.

Decision Trees Methods in Landside Susceptibility Assessment

Fig. 1 Landslide location map of the study area

Geologically, the study area is folded and faulted with complex geological structures. It is occupied by various lithological units consisting of limestone, dolomite, sandstone, shale, phyllite and schists. These units belong to different groups namely Boulder slate formation, Tal group, Amri group, Baliana group, Krol group, and Bijni group. Silt and loamy soils are present in the area. About 76% of the area is occupied by loamy soil. The area is located in tropical monsoon region with annual rainfall varying from 706 mm to 1872 mm, and temperature from sub-zero to 45 °C. The humidity varies between 54% and 63%. Most of the area is covered by dense and open forest vegetation.

Pham B.T. et al.

3 Materials and Methods 3.1 Data Used 3.1.1 Landslide and Non-landslide Data Important step in landslide spatial prediction is to construct landslide inventory map based on past and present landslide locations (Guzzetti 2006; Guzzetti et al. 2005; Pourghasemi and Kerle 2016). In the present study, extensive field surveys, interpretation of satellite and Google Earth images using remote sensing techniques have been done to identify landslide locations. In total, 1295 landslide polygons have been identified. Area of individual landslides varies from 750 m2 to 60,989 m2. Mainly three types of landslides occur in the area: translational (750 locations), rotational (120 locations), and debris flows (425 locations) (Figs. 1 and 2). In addition, non-landslide data is also important in landslide susceptibility modeling because landslide prediction is known as a binary classification problem (Pham et al. 2016d). In this study, non-landslide data has been generated from stable areas where landslides have not been observed and found. A total of 1295 non-landslide locations have been identified and used for landslide susceptibility analysis in this study.

3.1.2 Landslide Affecting Factors Determination of landslide conditioning factors is vital and crucial for developing a model for assessment of landslide susceptibility (Oh and Pradhan 2011; Youssef et al. 2015). In the present study, sixteen landslide affecting or conditioning factors, namely elevation, slope angle, slope aspect, curvature, plan curvature, profile curvature, lithology, distance to

Fig. 2 Landslide photos of Uttarakhand Area (Source: Geological Survey of India report)

Decision Trees Methods in Landside Susceptibility Assessment

lineaments, lineament density, soil type, land cover, rainfall, distance to roads, road density, distance to river, and river density have been considered for landslide susceptibility modeling. These factors have been selected, extracted, and generated from various sources, namely literature review, field surveys, and by the analysis of mechanism of landslide occurrences and its relation to geo-environmental characteristics of the area (Pham et al. 2015b, g). Elevation, slope angle, slope aspect, curvature, plan curvature, and profile curvature maps have been extracted from ASTER Global DEM of 30 m resolution. The Elevation is an important factor which influences weathering of rock mass, affecting landslide occurrences (Pham et al. 2016e). The elevation map, in this study, has been extracted with different classes such as 0–700, 700–900, 900–1100, 1100–1300, 1300–1500, 1500– 1700, 1700–1900, and >1900 m (Fig. 3a).

Fig. 3 a Elevation map; b Slope angle map; c Slope aspect map; d Curvature map; e Plan curvature map; f Profile curvature map; g Lithology map; h Soil type map; i Distance to lineaments map; j Lineament density map; k Land cover map; l Rainfall map; m Distance to roads map; n Road density map; o Distance to rivers map; and p River density map

Pham B.T. et al.

Fig. 3 (continued)

The Slope angle affects landslide occurrences (Hong et al. 2017b). Moderate to steep slopes are more prone to landslides than gentle slopes (< 10 degrees) (Lee and Min 2001; Pham et al. 2016c). The Slope angle map, in this study, has been extracted with six different classes: 0–10, 10–20, 20–30, 30–40, 40–50, 50–60, and >60 degrees (Fig. 3b). The Slope aspect affects the solar radiation, the wind, and the rainfall falling on the sides of the hill. The moisture content of slope forming material depends on this factor (Tien Bui et al. 2016b). In this study, the slope aspect map has been extracted with different classes, namely flat, north, northeast, east, southeast, south, southwest, west, and northwest (Fig. 3c). The Curvature is the rate of change of the slope angle and the aspects of terrain surface affecting landslide occurrence. Its morphology depends on the riverbank erosion, which would be positive, zero, and negative for convex, flat, and concave surfaces, respectively (Chang

Decision Trees Methods in Landside Susceptibility Assessment

Fig. 3 (continued)

et al. 2007). A Curvature map, in this study, has been developed in three classes, namely concave (< −0.05), flat (−0.05–0.05), and convex (> 0.05) (Fig. 3d). The Plan curvature represents the horizontal local relief, which depends on the divergence and convergence of flow direction on hill slope, affecting landslide occurrences (Ohlmacher 2007). In this study, plan curvature map has been generated with five classes: 1 [(−10.134) – (−1.042)], 2 [(−1.042) – (−0.331)], 3 [(−0.331) - 0.223], 4 [0.223–0.934], and 5 [0.934–10.026] (Fig. 3e). The Profile curvature is a horizontal form of morphology in the direction of the maximum slope (Yesilnacar and Topal 2005). In this study, a profile curvature map has been generated with different classes, namely 1 [(−14.22) – (−1.33)], 2 [(−1.33) – (−0.44)], 3 [(−0.44) - 0.26], 4 [0.26–1.25], and 5 [1.25–11.07] (Fig. 3f). The Lithology represents a general physical charateristic of rocks. It is one of the most important factors which influences susceptibility of landslide occurrences (Cevik and Topal

Pham B.T. et al.

Fig. 3 (continued)

2003). A lithology map has been extracted from an available regional scale map of Uttarakhand (1:1,000,000). Different formations and groups have been identified, namely Boulder slate formation, Tal group (sandstone, shale, quartzite, phyllite, and limestone), Krol groups (boulder bed and limestone), Bijni group (quartzite, phyllite), and Amri group (quartzite, phyllite) (Fig. 3g). The Soil type generally refers to different sizes of mineral particles derived from geologic litho-units. Most of landslides depend on the physico-mechanical properties of soils or rocks. Therefore, it is an important factor for landslide occurrences (Sarkar and Kanungo 2004). Soil map of the study area has been prepared from the regional scale soil map of Uttarakhand (1:1,000,000) having different classes such as coarse loam, fine loam, skeletal loam, mixed loam, and fine-silt (Fig. 3h). The Distance to lineaments, such as fractures, shears and faults, has influence on landslide occurrences as these features induce weakness on adjacent rock mass/ground mass. In this study,

Decision Trees Methods in Landside Susceptibility Assessment

lineaments have been extracted from LANDSAT-8 satellite images using Geomatica 2015 software. A distance to lineaments map has been generated by buffering lineaments on the study area with different distance classes, namely 0–50, 50–100, 100–150, 150–200, 200–250, 250–300, 300–350, 350–400, 400–450, 450–500, and >500 m (Fig. 3i). The Lineament density is also very important factor for landslide spatial modeling. Areas with higher Lineament density are more prone to landslides (Sarkar and Kanungo 2004). A Lineament density map has been generated using linear density function in ArcGIS software with different classes such as very low [0–0.224], low [0.224–0.579], moderate [0.579–0.916], high [0.9156–1.308], and very high [1.308–2.382] (Fig. 3j). The Land cover is the observed physical cover on the earth’s surface which includes vegetation and man-made features. Vegetation helps in preventing soil erosion, and thus provides stability to slopes (Varnes 1984). A Land cover map has been generated from the regional scale Uttarakhand land cover map (1:1,000,000) with different classes such as openforest, dense-forest, non-forest, and scrub-land (Fig. 3k). The Rainfall creates instability in rock mass/soil mass, and it is one of the most common causes of landslides (Pham et al. 2015a; Tien Bui et al. 2016a). Rainfall adversely affects the properties of soil and rocks especially soft and weathered rocks on slopes. A Rainfall map has been generated from 30 years (1984 to 2014) rainfall data obtained from Global Weather meteorological data (NCEP 2014) using the spline interpolation method with different classes: 0–700, 700–900, 900–1100, 1100–1300, 1300–1500, 1500–1700, 1700–1900, 1900–2100, 2100–2300, and >2300 mm (Fig. 3l). Distance to roads is an important factor for landslide occurrences as excavation of roads creates instability in the surrounding rock mass and ground slope (Yesilnacar and Topal 2005). In this study, the road networks have been extracted from Google Earth satellite images. A Distance to roads map has been generated by buffering road networks on the study area with six classes: 0–50, 50–100, 100–150, 150–200, 200–250, and >250 m (Fig. 3m). The Road density also affects the stability of slopes as areas having higher road density are more prone to landslides. A map of Road density has been generated using linear density function in ArcGIS software with different classes: very low [0–0.633], low [0.633–1.542], moderate [1.542–2.534], high [2.534–3.773], and very high [3.773–7.023] (Fig. 3n). The Distance to rivers affects stability of slopes as water action on the slopes and erosion of the ground mass are more near rivers. River networks have been extracted from ASTER Global DEM using hydrological function in ArcGIS software, and then, a Distance to rivers map has been generated by buffering river networks on the study area with different distance classes, namely 0–50, 50–100, 100–150, 150–200, 200–250, and >250 m (Fig. 3o). The River density affects the stability of slopes, as landslides often occur in the areas where the river density is high. A River density map has been generated using a linear density function in ArcGIS software with different classes, namely very low (0–0.05), low (0.05– 0.154), moderate (0.154–0.268), high (0.268–0.423), and very high (0.423–1.268) (Fig. 3p).

3.2 Decision Tree-Based Methods 3.2.1 Random Forest (RF) The RF is a nonparametric ensemble classifier method based on flexible decision tree algorithm, developed by Breiman (2001). This method is an extension of classification and regression tree, which is composed of combination of many trees in which the boot-strap samples are used for

Pham B.T. et al.

generating each tree (Breiman et al. 1984; Hong et al. 2016b; Rahmati et al. 2016). In this method, random selections of the training data are carried out automatically from the original dataset by the algorithm used to construct the model (Breiman 2001; Catani et al. 2013; Youssef et al. 2016). During the learning process, each split of the tree at each node is determined by randomized subset of the variables. For the minimization of the classification errors, each tree is expanded but the result is influenced by random selection (Zabihi et al. 2016). Main aim of the RF algorithm is to see how much the prediction error increases when the output of data for certain variable is permuted. Therefore, it can estimate the importance of the variable while all other variables are left unchanged (Catani et al. 2013; Liaw and Wiener 2002).

3.2.2 Logistic Model Trees (LMT) The LMT is a classification tree method which combines logistic regression and decision tree learning algorithms (Landwehr et al. 2005). The LMT provides a piecewise linear regression model which is used at the leaves of classification tree (Landwehr et al. 2005). The LMT has a tree structure which consists of two parts (terminal nodes, and non-terminal nodes). In this method, tree structure gives a disjoint subdivision S which is spanned by all attributes (t = ti) into regions St presented by leafs as in the following equation: S ¼ ∪ St ; t∈T

0

0

St∩St ¼ φ for t≠t

ð1Þ

Here, the leaves are connected to LR function f(t) which considers subset Ut ⊆ U of n attributes (t = ti). The class membership probabilities are as per the following equation: PðtÞ ¼

e f n ðtÞ ∑ni¼1 e f i ðtÞ

ð2Þ

3.2.3 Best First Decision Trees (BFDT) The BFDT is a decision tree-based method which constructs the tree in best-first order rather than fixed order (Shi 2007). It uses the algorithm of giving the best split node in each step to construct the classification tree. In this method, both pre-pruning and post-pruning process can be performed and compared. During the training process, the BFDT first chooses an attribute to put at the root node; thereafter, based on criteria for building some branches for that attribute (Shi 2007). Therefore, a splitting process is used to divide training instances into subsets which are then used to extend nodes from the root node. These steps are repeated to the best subset for constructing the branches of classification tree (Shi 2007). This process continues until all nodes reach a specific number of expansion.

3.2.4 Classification and Regression Trees (CART) Breiman et al. (1984) stated that the CART method is a non-parametric regression technique which is one of the most popular machine learning methods. This method is a flexible method, as it can deal with any type of numeric and binary data without affecting the result of the model by monotone transformations and different scales of measurement (Aertsen et al. 2010). A binary partitioning algorithm is usually used to

Decision Trees Methods in Landside Susceptibility Assessment

develop the decision trees in CART (Naghibi et al. 2016). Regression trees through replacement are often used to deal with missing data in a particular factor (Breiman et al. 1984). In this method, over-fitting of terminal nodes in the tree is avoided by recursively snipping of splits (Breiman et al. 1984). The following equation, which is on the basis of the comparison of the target attribute distribution with two child nodes, is used in the CART technique for classification problem: i h  I ðsplitÞ ¼ 0:25ðejð1−eÞÞu ∑k jPL ðk Þ−PR ðk Þ 2

ð3Þ

where k , PL(k) , PR(k) , u are indices of the target classes, target of probability distribution on the left of nodes, target of probability distribution on the right of nodes, and insert of a penalty on splits, respectively (Wu et al. 2008).

3.3 Evaluation Methods Performance of landslide models and validation have been analyzed considering five statistical indices, such as positive predictive value, negative predictive value, sensitivity, specificity, accuracy, kappa, and Root Mean Squared Error (RMSE) (Bennett et al. 2013). Positive and negative predictive values indicate the probability of pixels classified correctly as Blandslide^ class and Bnon-landslide^ class, respectively (Pham et al. 2016a). Sensitivity is the probability of landslide pixels classified correctly as Blandslide^ class whereas specificity is the probability of non-landslide pixels classified correctly as Bnon-landslide^ class (Pham et al. 2017b). Accuracy is the proportion of landslide and non-landslide pixels classified correctly (Pham et al. 2017c; Shirzadi et al. 2017a). Kappa measures the reliability of the landslide models. RMSE represents the error occurred in the modeling process (Bennett et al. 2013). Statistical indices have been calculated from values of confusion matrices based on the pixel classification (Fisher et al. 1969). In addition, the Receiver Operating Characteristic (ROC) curve method, which is the most common quantitative evaluation method for landslide models (Pourghasemi and Rossi 2016), has been selected for landslide model evaluation in the present study. The ROC curve method is based on two statistical values of Bsensitivity^ and B100-specificity^ (Bai et al. 2010; Hong et al. 2015). Quantitative performance of the landslide models has been validated by the AUC (Area Under the Curve) value (Pourghasemi et al. 2012). Higher value (upper value B1^) indicates better predictive capability of landslide models (Hanley and McNeil 1982; Hong et al. 2016a).

4 Methodology Landslide susceptibility assessment using decision tree-based methods has been carried out in the following five main steps (Fig. 4): (1) Selecting the affecting factors using feature selection method: In this step, the chi-square feature selection method has been used to test the predictive capability of landslide affecting factors for selecting suitable factors for landslide modeling. The results of this

Pham B.T. et al.

Fig. 4 The flow chart of the methodology adopted in this study

(2)

(3)

(4)

(5)

step have then been used to calculate and assign the weights of factors for generating final datasets. Generating training and testing datasets: Using the selected factors, the training dataset and the testing dataset have been generated for constructing and validating the models, respectively. The training dataset has been generated using 906 landslide pixels (70% of landslide locations) and 906 non-landslide pixels, whereas the testing dataset has been generated using 389 landslide pixels (30% of remaining landslide locations) and 389 non-landslide pixels. The sampling process has been carried out in ArcGIS environment to combine landslide and non-landslide data with landslide affecting factors for generating the final training and testing datasets. Constructing landslide models using decision tree-based methods: Landslide models using decision tree-based methods, such as RF, LMT, BFDT and CART, have been constructed using the training dataset. In this study, the RF model has been constructed with 300 iterations and 100% of training set size, whereas the LMT model has been trained with 15 minimum numbers of instances at which a node is considered for splitting. The BFDT and CART models have been trained with 2 minimal numbers of instances at the terminal node, and 5 folds in internal crossvalidation. Validating models: Landslide models (RF, LMT, BFDT and CART) have been validated using the testing dataset. Methods such as the ROC curve and statistical indices have been used for evaluation of the predictive capability of the models. Generating landslide susceptibility maps: Landslide susceptibility maps have been finally constructed using the RF, LMT, BFDT and CART models. The reliability of these maps has been validated by landslide density analysis on each map.

5 Results and Analysis 5.1 Landslide Influencing Factor Selection In the present study, the chi-square feature selection method, which is one of the most effective feature selection methods (Jin et al. 2006; Zheng et al. 2004), was applied to evaluate the importance of affecting factors. Importance of factors is quantitatively indicated by chi-square value (χ2) which is calculated on the basis of measurement of

Decision Trees Methods in Landside Susceptibility Assessment

lack of independence between a factor c and a term t indicated in the following equation (Singh et al. 2010): χ2 ðt; cÞ ¼

n½N 1 N 2 −N 3 N 4 2 ðN 1 þ N 3 ÞðN 2 þ N 4 ÞðN 1 þ N 4 ÞðN 2 þ N 3 Þ

ð4Þ

where n is the total number of factors, N1 is the number of factors of class c consisting of term t, N2 is the number of factors of other class (not c) consisting of term t, N3 is the number of factors of class c not consisting of term t, N4 is the number of factors of other class not consisting of term t (Singh et al. 2010). The results show that all sixteen landslide influencing factors are important for landslide modeling (Fig. 5). However, the Distance to roads is the most important (χ2=716.093), followed by Road density (χ2=443.771), Slope angle (χ2=268.197), Slope aspect (χ2=178.303), Elevation (χ2=114.484), Soil type (χ2=109.953), River density (χ 2 =90.77), Rainfall (χ 2 =84.697), Plan curvature (χ 2 =83.935), Profile curvature (χ2=79.036), Lineament density (χ2=77.315), Land cover (χ2=56.995), Distance to lineament (χ2=30.267), Curvature (χ2=16.465), Lithology (χ2=10.313), and Distance to rivers (χ2=4.173), respectively. The results of the analysis show that road network is the most affecting factor to landslide models. In general, sixteen landslide influencing factors have contribution to landslide models; therefore, they have been selected for landslide susceptibility analysis in the present case. Results of this selection task have been then used to calculate the weights assigned for factors in generating the datasets for constructing and validating the models.

5.2 Model Performance The results of validation of the various models using statistical indices are shown in Table 1, Fig. 6 and Table 2. The results indicate that the RF model has the highest statistical index values (predictive value, negative predictive value, sensitivity, specificity, accuracy, and kappa), followed by the LMT model, the BFDT model, and the CART model (Table 1). It has also been noticed that the CART model has the highest classification error index (RMSE), followed by the BFDT, LMT and the RF models, respectively. The analysis of results of the various landslide models using the ROC curve

Fig. 5 Importance of landslide influencing factors using the chi-square method

Pham B.T. et al. Table 1 Statistical index values based on the evaluation of the performance of the various landslide models No

Parameters

RF

LMT

BFDT

CART

1 2 3 4 5 6 7

Positive predictive value (%) Negative predictive value (%) Sensitivity (%) Specificity (%) Accuracy (%) Kappa Root Mean Squared Error

92.61 97.42 97.23 93.10 95.05 0.9008 0.2083

91.03 94.07 93.75 91.48 92.57 0.8513 0.265

88.65 93.81 93.33 89.43 91.26 0.8252 0.2802

86.28 94.33 93.70 87.56 90.35 0.8068 0.2907

method indicates that the RF model has the highest value of AUC (0.985), followed by the LMT model (0.945), the BFDT model (0.934), and the CART model (0.933), respectively (Fig. 6). The results of pair wise comparison of the ROC curves show that the difference between the RF model and other models (LMT, BFDT and CART) is of statistical significance (p < 0.001) but the difference among landslide models of LMT, BFDT and CART is not of statistical significance (p > 0.05) (Table 2).

5.3 Landslide Susceptibility Maps Developed Using Various DT Models Four landslide susceptibility maps have been generated using the RF, LMT, BFDT and CART models showing five susceptible categories (very low, low, moderate, high and very high) by reclassification of landslide susceptibility indices using the natural breaks method (Fig. 7). Validation of these maps has been done based on landslide density on the maps calculated from analysis of distribution of landslide pixels and class pixels (Pham et al. 2016g) when the landslide inventory map was combined with the landslide susceptibility map. The result of the analysis show that all maps are reliable for susceptibility assessment as landslide density is increasing gradually from very low class to very high class (Table 3). The results of the landslide density analysis also show that the landslide density value on very high class of the RF model is the highest compared to those of the other models (LMT, BFDT and CART).

Fig. 6 Analysis of the ROC curves of the various decision treebased models

100 90 80 70 60 50 RF (AUC = 0.985) LMT (AUC = 0.945) BFDT (AUC = 0.934) CART (AUC = 0.933)

40 30 20 10 0 0

10

20

30

40

50

60

100-Specificity

70

80

90

100

Decision Trees Methods in Landside Susceptibility Assessment Table 2 Pair wise comparison of the ROC curves of the various decision tree-based classifiers No

Pair comparison

RF ~ LMT

RF ~ CART

RF ~ BFDT

LMT ~ CART

LMT ~ BFDT

CART ~ BFDT

1 2 3

Difference between areas Standard Error 95% CI

4 5

z statistic Significance level

0.0408 0.00753 0.026 to 0.056 5.411 p < 0.001

0.0505 0.00834 0.034 to 0.067 6.050 p < 0.001

0.0523 0.00893 0.035 to 0.07 5.854 p < 0.001

0.00972 0.00934 −0.009 to 0.028 1.041 p = 0.298

0.0115 0.0101 −0.008 to 0.031 1.145 p = 0.252

0.0018 0.00734 −0.013 to 0.016 0.246 p = 0.806

6 Discussion Landslide susceptibility modeling is one of the necessary and difficult tasks for landslide hazard management (Guzzetti 2006). Even though several methods have been proposed and applied for landslide susceptibility modeling, this task still faces the problem on how to choose the most suitable method for each study region (Pham et al. 2015b; Pradhan

Fig. 7 Landslide susceptibility maps using the various landslide models

Pham B.T. et al. Table 3 Landslide density on various landslide susceptibility maps No

1 2 3 4 5

Susceptible classes

Very low Low Moderate High Very high

Landslide density RF

LMT

BFDT

CART

0.00 0.02 0.03 0.36 27.77

0.01 0.14 0.41 1.32 12.05

0.02 0.27 1.24 5.99 12.30

0.03 0.31 0.30 4.19 10.05

2013). In the present study, decision tree-based methods (i.e., RF, LMT, BFDT and CART) have been investigated and the results were compared for the evaluation of the best method for landslide susceptibility modeling at a part of the Pauri Garhwal area, Uttarakhand state, Himalaya, India. In landslide modeling, Pradhan (2013) and Pham et al. (2015b) stated that the predictive capability of landslide models depends significantly on the quality of data. Therefore, the predictive capability of the landslide influencing factors should be evaluated before the model learning process (Pham et al. 2015b). Using the results of this task, the weights of the factors can be calculated and assigned for generating the final datasets for training and validating the models (Pham et al. 2016g). In the present study, the chi-square feature selection method was applied for checking the predictive capability of sixteen landslide influencing factors. The result shows that the Distance to roads is the most affecting factor to landslide models. This has been observed in other areas also, where instability in the rock mass/ground mass is created along and adjacent due to road excavation. The Slope angle also has major contribution to landslide models in the present study which is comparable with other studies (Ohlmacher and Davis 2003; van den Eeckhaut et al. 2006). Results of validation of different models using statistical indices indicate that the RF model has the highest statistical index values, followed by the LMT model, the BFDT model and the CART model. Similar results have been obtained by the analysis of different landslide models using the ROC curve method. Modeling results indicate that all four landslide models showed good performance for susceptibility modeling but the RF model demonstrated the best performance, followed by the LMT model, the BFDT model and the CART model. This is due to the fact that the RF method has many advantages which help in improving its performance, such as: (i) it does not need assumptions on the distribution of explicative factors; (ii) it is capable of calculating interaction between factors; (iii) the random predictor selection used in RF holds low bias; and (iv) it is able to deal with unbalanced data and over-fitting (Aertsen et al. 2010; Prasad et al. 2006). In addition, the LMT model result corroborates the observations of other researchers, as it has good predictive capability for landslide susceptibility assessment (Akgun 2012; van den Eeckhaut et al. 2006). The results of BFDT and CART methods are mostly comparable to those observed by other researchers (Shi 2007). Landslide susceptibility maps have been developed using different DT models (RF, LMT, BFDT and CART). Among these maps, the susceptibility map developed by using the RF model can be considered as the most reliable, as landslide density on very high class is the highest compared to those by the other models.

Decision Trees Methods in Landside Susceptibility Assessment

7 Conclusions Landslide susceptibility assessment of Pauri Garhwal area of Uttarakhand state, India, has been conducted using decision tree-based machine learning methods, namely RF, LMT, BFDT and CART. Landslide affecting factors, such as elevation, slope angle, slope aspect, curvature, plan curvature, profile curvature, lithology, distance to lineaments, lineament density, soil type, land cover, rainfall, distance to roads, road density, distance to river and river density have been analyzed and selected for landslide modeling in conjunction with 1295 historical landslide locations. The ROC curve and statistical indices, such as the positive predictive value, negative predictive value, sensitivity, specificity, accuracy, kappa and RMSE, have been used to validate the performance of the models. The result of the analysis show that although all four landslide models have good predictive capability for landslide susceptibility modeling, the RF model showed the best performance, followed by the LMT, BFDT and CART models. Thus, the RF is an encouraging and good method, which can be used for landslide susceptibility assessment. Based on the RF, LMT, BFDT and CART methods, landslide susceptibility maps have been developed and compared. The results indicate that the RF model is better for proper land use planning and management of landslide prone areas. Acknowledgements The authors are thankful to the Director, Bhaskarcharya Institute for Space Applications and Geo-Informatics (BISAG), Department of Science & Technology, Government of Gujarat, Gandhinagar, Gujarat, India, for providing facilities to carry out this research work.

References Aertsen W, Kint V, van Orshoven J, Özkan K, Muys B (2010) Comparison and ranking of different modelling techniques for prediction of site index in Mediterranean mountain forests. Ecol Model 221:1119–1130 Akgun A (2012) A comparison of landslide susceptibility maps produced by logistic regression, multi-criteria decision, and likelihood ratio methods: a case study at İzmir, Turkey. Landslides 9:93–106 Bai S-B, Wang J, Lü G-N, Zhou P-G, Hou S-S, Xu S-N (2010) GIS-based logistic regression for landslide susceptibility mapping of the Zhongxian segment in the three gorges area, China. Geomorphology 115:23–31 Bai S, Wang J, Zhang Z, Cheng C (2012) Combined landslide susceptibility mapping after Wenchuan earthquake at the Zhouqu segment in the Bailongjiang Basin, China. Catena 99:18–25. doi:10.1016/j. catena.2012.06.012 Bennett ND, Croke BF, Guariso G, Guillaume JH, Hamilton SH, Jakeman AJ, Marsili-Libelli S, Newham LT, Norton JP, Perrin C (2013) Characterising performance of environmental models. Environ Model Softw 40:1–20 Breiman L (2001) Random forests. Mach Learn 45:5–32 Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton Catani F, Lagomarsino D, Segoni S, Tofani V (2013) Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues. Nat Hazards Earth Syst Sci 13:2815–2831 Cevik E, Topal T (2003) GIS-based landslide susceptibility mapping for a problematic segment of the natural gas pipeline, Hendek (Turkey). Environ Geol 44:949–962 Chakraborty S, Pradhan R (2012) Development of GIS based landslide information system for the region of East Sikkim. Int J Comput Appl 49:5–9 Chang K-T, Chiang S-H, Hsu M-L (2007) Modeling typhoon- and earthquake-induced landslides in a mountainous watershed using logistic regression. Geomorphology 89:335–347 Chen W, Pourghasemi HR, Naghibi SA (2017) Prioritization of landslide conditioning factors and its spatial modeling in Shangnan County, China using GIS-based data mining algorithms. Bull Eng Geol Environ 75:1–19 Das I, Stein A, Kerle N, Dadhwal VK (2012) Landslide susceptibility mapping along road corridors in the Indian Himalayas using Bayesian logistic regression models. Geomorphology 179:116–125. doi:10.1016/j. geomorph.2012.08.004 Fisher DF, Monty RA, Glucksberg S (1969) Visual confusion matrices: fact or artifact? J psychol 71:111–125

Pham B.T. et al. García-Rodríguez MJ, Malpica JA, Benito B, Díaz M (2008) Susceptibility assessment of earthquake-triggered landslides in El Salvador using logistic regression. Geomorphology 95:172–191. doi:10.1016/j. geomorph.2007.06.001 Guzzetti F (2006) Landslide Hazard and Risk Assessment. PhD thesis. University of Bonn Guzzetti F, Reichenbach P, Cardinali M, Galli M, Ardizzone F (2005) Probabilistic landslide hazard assessment at the basin scale. Geomorphology 72:272–299 Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36 Hong H, Pradhan B, Xu C, Bui DT (2015) Spatial prediction of landslide hazard at the Yihuang area (China) using two-class kernel logistic regression, alternating decision tree and support vector machines. Catena 133:266–281 Hong H, Naghibi SA, Pourghasemi HR, Pradhan B (2016a) GIS-based landslide spatial modeling in Ganzhou City, China. Arab J Geosci 9:1–26 Hong H, Pourghasemi HR, Pourtaghi ZS (2016b) Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models. Geomorphology 259:105–118 Hong H, Pradhan B, Bui DT, Xu C, Youssef AM, Chen W (2016c) Comparison of four kernel functions used in support vector machines for landslide susceptibility mapping: a case study at Suichuan area (China) Geomatics, Natural Hazards and Risk:1–26 Hong H, Pradhan B, Jebur MN, Bui DT, Xu C, Akgun A (2016d) Spatial prediction of landslide hazard at the Luxi area (China) using support vector machines. Environmental Earth Sciences 75:40 Hong H, Chen W, Xu C, Youssef AM, Pradhan B, Tien Bui D (2017a) Rainfall-induced landslide susceptibility assessment at the Chongren area (China) using frequency ratio, certainty factor, and index of entropy. Geocarto Int 32:139–154 Hong H, Ilia I, Tsangaratos P, Chen W, Xu C (2017b) A hybrid fuzzy weight of evidence method in landslide susceptibility analysis on the Wuyuan area, China. Geomorphology 290:1–16 Ilia I, Tsangaratos P (2016) Applying weight of evidence method and sensitivity analysis to produce a landslide susceptibility map. Landslides 13:379–397 Jin X, Xu A, Bie R, Guo P (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In: International Workshop on Data Mining for Biomedical Applications. Springer, pp 106–115 Kanungo D, Arora M, Sarkar S, Gupta R (2009) Landslide susceptibility zonation (LSZ) mapping-a review. J South Asia Disaster Stud 2:81–105 Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59:161–205 Lee S, Min K (2001) Statistical analysis of landslide susceptibility at Yongin, Korea. Environ Geol 40:1095–1113 Liaw A, Wiener M (2002) Classification and regression by random. Forest R News 2:18–22 Naghibi SA, Pourghasemi HR, Dixon B (2016) GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran. Environ Monit Assess 188:1–27 Nayab N, Scheid J (2011) Disadvantages to Using Decision Trees. http://www.brighthubpmcom/projectplanning/106005-disadvantages-to-using-decision-trees/ NCEP (2014) Global Weather Data for SWAT. http://globalweather.tamu.edu/home Oh H-J, Pradhan B (2011) Application of a neuro-fuzzy model to landslide-susceptibility mapping for shallow landslides in a tropical hilly area. Comput Geosci 37:1264–1276 Ohlmacher GC (2007) Plan curvature and landslide probability in regions dominated by earth flows and earth slides. Eng Geol 91:117–134 Ohlmacher GC, Davis JC (2003) Using multiple logistic regression and GIS technology to predict landslide hazard in northeast Kansas, USA. Eng Geol 69:331–343 Pham BT, Tien Bui D, Indra P, Dholakia M (2015a) Landslide susceptibility assessment at a part of Uttarakhand Himalaya, India using GIS–based statistical approach of frequency ratio method. Int J Eng Res Technology 4:338–344 Pham BT, Tien Bui D, Pourghasemi HR, Indra P, Dholakia MB (2015b) Landslide susceptibility assesssment in the Uttarakhand area (India) using GIS: a comparison study of prediction capability of naïve Bayes, multilayer perceptron neural networks, and functional trees methods. Theor Appl Climatol 122:1–19. doi:10.1007/s00704-015-1702-9 Pham BT, Bui DT, Dholakia MB, Prakash I, Pham HV, Mehmood K, Le HQ (2016a) A novel ensemble classifier of rotation forest and Naïve Bayer for landslide susceptibility assessment at the Luc Yen District, Yen Bai Province (Viet Nam) using GIS. Geomat Nat Haz Risk:1–23. doi:10.1080 /19475705.2016.1255667 Pham BT, Bui DT, Prakash I, Dholakia M (2016b) Evaluation of predictive ability of support vector machines and naive Bayes trees methods for spatial prediction of landslides in Uttarakhand state (India) using GIS. J Geom 10:71–79

Decision Trees Methods in Landside Susceptibility Assessment Pham BT, Pradhan B, Tien Bui D, Prakash I, Dholakia MB (2016c) A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India). Environ Model Softw 84:240–250. doi:10.1016/j.envsoft.2016.07.005 Pham BT, Tien Bui D, Dholakia MB, Prakash I, Pham HV (2016d) A comparative study of least square support vector machines and multiclass alternating decision trees for spatial prediction of rainfall-induced landslides in a tropical cyclones area. Geotech Geol Eng 34:1–18. doi:10.1007/s10706-016-9990-0 Pham BT, Tien Bui D, Pham HV (2016e) Spatial prediction of rainfall induced landslides using Bayesian network at Luc Yen District, Yen Bai Province (Viet Nam). In: International Conference on Environmental Issues in Mining and Natural Resources Development (EMNR 2016), Hanoi University of Mining and Geology (HUMG), Viet Nam, pp 1–10 Pham BT, Tien Bui D, Pham HV, Le HQ, Prakash I, Dholakia MB (2016f) Landslide hazard assessment using random subspace fuzzy rules based classifier ensemble and probability analysis of rainfall data: a case study at mu Cang Chai District, Yen Bai Province (Viet Nam). J Indian Soc Remote Sens 35:1–11. doi:10.1007 /s12524-016-0620-3 Pham BT, Tien Bui D, Prakash I, Dholakia MB (2016g) Rotation forest fuzzy rule-based classifier ensemble for spatial prediction of landslides using GIS. Nat Hazards 83:1–31. doi:10.1007/s11069-016-2304-2 Pham BT, Tien Bui D, Prakash I, Dholakia MB (2017a) Hybrid integration of multilayer perceptron neural networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. Catena 149, Part 1:52–63 doi:10.1016/j.catena.2016.09.007 Pham BT, Tien Bui D, Prakash I, Nguyen LH, Dholakia MB (2017b) A comparative study of sequential minimal optimization-based support vector machines, vote feature intervals, and logistic regression in landslide susceptibility assessment using GIS. Environmental Earth Sciences 76:371. doi:10.1007/s12665-017-6689-3 Pham BT, Bui DT, Prakash I (2017c) Landslide susceptibility assessment using bagging ensemble based alternating decision trees, logistic regression and J48 decision trees methods: a comparative study. Geotech Geol Eng: 1–15. doi:10.1007/s10706-017-0264-2 Pourghasemi HR, Kerle N (2016) Random forests and evidential belief function-based landslide susceptibility assessment in western Mazandaran Province, Iran. Environmental Earth Sciences 75:1–17 Pourghasemi HR, Rossi M (2016) Landslide susceptibility modeling in a landslide prone area in Mazandarn Province, north of Iran: a comparison between GLM, GAM, MARS, and M-AHP methods. Theor Appl Climatol:1–25 Pourghasemi HR, Pradhan B, Gokceoglu C (2012) Application of fuzzy logic and analytical hierarchy process (AHP) to landslide susceptibility mapping at Haraz watershed, Iran. Nat Hazards 63:965–996 Pradhan B (2013) A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using GIS. Comput Geosci 51:350–365. doi:10.1016/j.cageo.2012.08.023 Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199 Rahmati O, Pourghasemi HR, Melesse AM (2016) Application of GIS-based data driven random forest and maximum entropy models for groundwater potential mapping: a case study at Mehran region, Iran. Catena 137:360–372 Saito H, Nakayama D, Matsuyama H (2009) Comparison of landslide susceptibility based on a decision-tree model and actual landslide occurrence: the Akaishi Mountains, Japan. Geomorphology 109:108–121 Sarkar S, Kanungo D (2004) An integrated approach for landslide susceptibility mapping using remote sensing and GIS. Photogramm Eng Remote Sens 70:617–625 Shi H (2007) Best-first decision tree learning. PhD thesis. The University of Waikato Shirzadi A, Bui DT, Pham BT, Solaimani K, Chapi K, Kavian A, Shahabi H, Revhaug I (2017a) Shallow landslide susceptibility assessment using a novel hybrid intelligence approach. Environ Earth Sci 76:60 Shirzadi A, Shahabi H, Chapi K, Bui DT, Pham BT, Shahedi K, Ahmad BB (2017b) A comparative study between popular statistical and machine learning methods for simulating volume of landslides. Catena 157:213–226 Sidle RC, Ochiai H (2006) Landslides: Processes, Prediction, and Land Use. Vol 18. American Geophysical Union Singh SR, Murthy HA, Gonsalves TA (2010) Feature selection for text classification based on Gini coefficient of inequality. International Conference on Fuzzy System and Data Mining 10:76–85 Tien Bui D, Ho T-C, Pradhan B, Pham B-T, Nhu V-H, Revhaug I (2016a) GIS-based modeling of rainfallinduced landslides using data mining-based functional trees classifier with AdaBoost, bagging, and MultiBoost ensemble frameworks. Environ Earth Sci 75:1–22. doi:10.1007/s12665-016-5919-4 Tien Bui D, Pham BT, Nguyen QP, Hoang N-D (2016b) Spatial prediction of rainfall-induced shallow landslides using hybrid integration approach of least-squares support vector machines and differential evolution optimization: a case study in Central Vietnam. Int J Digital Earth 9:1–21. doi:10.1080/17538947.2016.1169561 Tsangaratos P, Benardos A (2014) Estimating landslide susceptibility through a artificial neural network classifier. Nat Hazards 74:1489–1516

Pham B.T. et al. Tsangaratos P, Ilia I (2016a) Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: the influence of models complexity and training dataset size. Catena 145:164–179 Tsangaratos P, Ilia I (2016b) Landslide susceptibility mapping using a modified decision tree classifier in the Xanthi perfection, Greece. Landslides 13:305–320 Tsangaratos P, Ilia I, Hong H, Chen W, Xu C (2016) Applying information theory and GIS-based quantitative methods to produce landslide susceptibility maps in Nancheng County, China. Landslides:1–21 van den Eeckhaut M, Vanwalleghem T, Poesen J, Govers G, Verstraeten G, Vandekerckhove L (2006) Prediction of landslide susceptibility using rare events logistic regression: a case-study in the Flemish Ardennes (Belgium). Geomorphology 76:392–410. doi:10.1016/j.geomorph.2005.12.003 Varnes DJ (1984) Landslide hazard zonation: a review of principles and practice. Vol 3. UNESCO, Paris Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37 Yalcin A, Reis S, Aydinoglu AC, Yomralioglu T (2011) A GIS-based comparative study of frequency ratio, analytical hierarchy process, bivariate statistics and logistics regression methods for landslide susceptibility mapping in Trabzon, NE Turkey. Catena 85:274–287. doi:10.1016/j.catena.2011.01.014 Yesilnacar E, Topal T (2005) Landslide susceptibility mapping: a comparison of logistic regression and neural networks methods in a medium scale study, Hendek region (Turkey). Eng Geol 79:251–266 Youssef AM, Pradhan B, Pourghasemi HR, Abdullahi S (2015) Landslide susceptibility assessment at Wadi Jawrah Basin, Jizan region, Saudi Arabia using two bivariate models in GIS. Geosciences Journal:1–21 Youssef AM, Pourghasemi HR, Pourtaghi ZS, Al-Katheeri MM (2016) Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir region, Saudi Arabia. Landslides 13:839–856 Zabihi M, Pourghasemi HR, Pourtaghi ZS, Behzadfar M (2016) GIS-based multivariate adaptive regression spline and random forest models for groundwater potential mapping in Iran. Environmental Earth Sciences 75:1–19 Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on inbalanced data. ACM Sigkdd Explor Newsletter 6:80–89

Suggest Documents