Predicting gully initiation: comparing data mining techniques ...

EARTH SURFACE PROCESSES AND LANDFORMS Earth Surf. Process. Landforms 37, 607–619 (2012) Copyright © 2012 John Wiley & Sons, Ltd. Published online 1 January 2012 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/esp.2273

Predicting gully initiation: comparing data mining techniques, analytical hierarchy processes and the topographic threshold Tal Svoray,1* Evgenia Michailov,2 Avraham Cohen,2 Lior Rokah2 and Arnon Sturm2 Department of Geography and Environmental Development, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel 2 Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel

1

Received 23 March 2011; Revised 6 November 2011; Accepted 17 November 2011 *Correspondence to: T. Svoray, Department of Geography and Environmental Development, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel. E-mail: [email protected]

ABSTRACT: Predicting gully initiation at catchment scale was done previously by integrating a geographical information system (GIS) with physically based models, statistical procedures or with knowledge-based expert systems. However, the reliability and validity of applying these procedures are still questionable. In this work, a data mining (DM) procedure based on decision trees was applied to identify areas of gully initiation risk. Performance was compared with the analytic hierarchy process (AHP) expert system and with the commonly used topographic threshold (TT) technique. A spatial database was used to test the models, composed of a target variable (presence or absence of initial points) and ten independent environmental, climatic and human-induced variables. The following findings emerged: using the same input layers, DM provided better predictive ability of gully initiation points than the application of both AHP and TT. The main difference between DM and TT was the very high overestimation inherent in TT. In addition, the minimum slope observed for soil detachment was 2 , whereas in other studies it is 3 . This could be explained by soil resistance, which is substantially lower in agricultural fields, while most studies test unploughed soil. Finally, rainfall intensity events >62.2 mm h-1 (for a period of 30 min) were found to have a significant effect on gully initiation. Copyright © 2012 John Wiley & Sons, Ltd. KEYWORDS: AHP; data mining; ephemeral gullies; GIS; land degradation; topographic threshold

Introduction Erosion processes in agricultural fields can cause major environmental damage through soil loss (Trimble and Crosson, 2000; Van Rompaey et al., 2003). Since the mid-1980s, much attention has been given to understanding and predicting gully initiation (Cheng et al., 2006), which, of all erosion processes, is now regarded as one of the most destructive mechanisms affecting agricultural soils (De-Santisteban et al., 2006). Gully erosion represents an important – if not the dominant – sediment source within catchments, while the gullies themselves constitute effective links for transferring runoff, sediment and other materials from source to sink. They thereby have an important role in increasing connectivity on the landscape scale (Casali et al., 2009). This is probably the reason for making an effort in recent years to use field measurements, as well as numerical model simulations, to study gully processes and gully dynamics (Kirkby and Bracken, 2009). Special attention has been given in previous research to the following questions: What causes gully initiation? Where it is most likely to occur? And what can be done to prevent gully creation? There are several ways to predict gully initiation using, mainly, physically based models. However, in recent years, various computer-supported methods have been used to predict gully occurrence. Among the several computational methods, there

are two, very different approaches to predict gullying: (1) systems based on expert knowledge and experience; and (2) empirical methods. An expert-based system enables human experts to integrate and translate their quantitative and qualitative knowledge into computer language, using formal and controlled procedures (Malczewski, 2004). The method assumes that the expert understands the mechanisms studied and that his knowledge can be translated accurately into computer language. Among expert-based methods, the most intuitive are the analytical hierarchy process (AHP) mechanisms (Saaty, 1977; Malczewski, 1999). In general, a typical AHP includes two main procedures: scoring and interpolating. Pairwise comparison is an advanced scoring method for examining real-world conditions in a relatively reliable way. Using this method, a matrix is developed in which every criterion is accorded a value based on its importance in relation to all other criteria and the weight of each criterion’s relative importance is calculated. Once the weights of importance for each criterion are established, the variables can be combined and interpolated over the entire study area. Methods of interpolating AHPs can be classified into two main groups based on the level of analysis required from decision-makers and experts and according to the methods of ranking and developing weights per variable. The first group includes compensatory methods, in which a scale of

608

T. SVORAY ET AL.

adjustment levels and index weights that compensate one another is used. The compensatory approach is demanding, since it requires that decision-makers and experts place the range of criteria on a scale according to specific adjustment levels in addition to developing each criterion’s weight. In the second group, which includes non-compensatory methods, the comparison between the alternatives (land use, for example) is carried out with no option of alternating and compensating between the internal scale (the range of criterion values) and the weights of each criterion. Entailing, at most, only a serial ranking of the criteria, this approach requires less attention from decision-makers and ranking experts (Jankowski, 1995). Weighted linear combination (WLC) is a common compensatory method for estimating and implementing numerous criteria in a geographical information system (GIS). This method simply combines successive variables on a linear basis, forming points of adaptability for specific purposes. In spatial studies, AHP analyses, combined with GIS, have improved the ability of classic multi-layered analysis and process- based models (Collins et al., 2001) to predict spatial phenomena. Thus, the method has found its way into many fields of decision-making, such as forestation (Gilliams et al., 2005). Furthermore, AHP has been widely used in such fields as engineering geology (Dai et al. 2001) and geomorphology (Ni and Li, 2003; Ni et al. 2008; Svoray and Ben-Said, 2010). However, although previous studies show the potential of AHP in geomorphology, several aspects of AHP require further elucidation: the reliability of the experts and of the knowledge mining procedures and algorithms applied to interpolate the predictions in space. On the other hand, a purely empirical method – the topographic threshold (TT) method – can predict the threshold value of initiation based on data of observed gully initiation points. Several studies of the slope and area required to support channel head incision have found that an inverse relationship exists between the upslope contributing area and the local slope in different environments (Montgomery and Dietrich, 1988; Hancock and Evans, 2006; Svoray and Markovitch, 2009). According to this approach, runoff volume may increase proportionally to catchment area and gully erosion may take place where the threshold value predicted by the empirical TT is exceeded. Although the TT method is commonly used, several studies have found that it may induce errors in predicting gully initiation (Chaplot et al., 2005). A more complex empirical method is the knowledge discovery database (KDD) method, which is established mathematically on the basis of data mining (DM) techniques. KDD is the process of identifying valid, novel, useful, and understandable patterns from large datasets (Fayyad et al., 1996). With the ever-increasing rate of data accumulation, KDD is becoming an important tool for transforming this data into useful information. KDD techniques have been used in geomorphological studies, for example for landslide susceptibility zonation (Melchiorre et al., 2008); for evaluating sedimentation vulnerability (Hentati et al., 2010); to determine land stability (Pavel et al., 2008) and to study gully initiation (Gutierrez et al., 2009a). Data mining is the mathematical core of the KDD process (Maimon and Rokach, 2005), involving the inference of algorithms that explore the data, develop mathematical models and discover significant patterns (implicit or explicit) – which are the essence of useful knowledge. In general, DM includes three stages: the first, the pre-processing stage includes data assemblage, cleaning and division into training and validation sets. The second step involves the actual DM algorithms and the final third step includes validation of the results in relation to observations. DM can be used for various tasks such as classification, clustering and regression. Here, we focus on a binary classification task, where the goal is to classify points into either ‘gully initial point’ or ‘non-gully initial point’. In a typical classification Copyright © 2012 John Wiley & Sons, Ltd.

task, a training set of labelled examples is given and the goal is to form a description that can predict previously unseen examples. An inducer aims to build a classifier (also known as a classification model), by learning from a set of pre-classified instances. The classifier can then classify unlabelled instances. When acquiring new information, the expert and empirical approaches rely on different paradigms. Whereas the former approach is based on mining processed knowledge that may not be derived from field measurements of the studied area, the latter extracts information based on a well-defined database of measurements whose size determines the validity of the results. It is quite clear that the two methods are imperfect and each has its advantages and disadvantages. Regarding the expertbased approach, various questions naturally arise: Who are experts? How reliable are they? Do they understand the underlying mechanisms? In addition there are objective factors such as the fact that the method of translating knowledge into computer language, that is often qualitative, is not fully understood. Finally, human expertise is a scarce resource that is not always available. All the above difficulties are known as the ‘knowledge acquisition bottleneck’. The empirical approach also raises several questions. How representative is the database? How large does such a database need to be? And what can we infer about the underlying mechanisms from the complex statistical process? The fact that the two approaches, despite their limitations, have been found useful for understanding geomorphological and soil erosion processes, calls for a quantitative comparison. The aims of this study are therefore: (1) to apply a data mining technique to predict gully initiation points in an agricultural catchment; (2) to compare the model results with predictions made by AHP and the TT; and (3) to study the rules generated by the data mining procedure to better understand what causes gully incision.

The Study Area The study area is the Yehezkel catchment, located in northern Israel, with a drainage area of 13 km² (Figure 1). The climate is that of a transition zone between Mediterranean and semi-arid environments, with an average annual rainfall of 450 mm and a potential evaporation of 1700 mm (Bitan and Robin, 1991). The parent rock is basalt with alluvial-clay sediments and the slope of the cultivated fields ranges between 0 and 28 . The soils are largely alluvial (vertisols) in the centre of the study area and colluviums at the margins (Nir, 1993). Topsoil texture is generally clay, varying in space in the range 55–64% clay, in rare cases going down to 45% clay. The primary erosion mechanism in the study area and the entire region is water. Continuous observations show that erosion processes do not usually cause permanent channels, but ephemeral gullies that may be easily refilled by farmers during the dry season (Nir, 1989). The Yehezkel catchment suffers relatively high erosion rates that are expressed in high gully-length density (Svoray and Markovitch, 2009). The agriculture here consists mainly of field crops, including wheat, sunflowers and corn, as well as orchards of citrus fruit, almonds and olives.

Methodology Data preparation For this study, we used the Yehezkel spatial database described previously by Svoray and Markovitch (2009) and Svoray and Ben-Said (2010). Briefly, the database includes ten variables which were used to represent the most influential factors in the studied area: five environmental, one climatic and four human-induced. Topographic indices, including slope, aspect Earth Surf. Process. Landforms, Vol. 37, 607–619 (2012)

PREDICTING GULLY INITIATION: A COMPARATIVE STUDY

609

Beit She’an

Jordan River

Afula Kfar Yehezkel

Beer Sheva

Figure 1.

The study area. A shaded relief map of the Yehezkel catchment in Israel.

and the upslope contributing area, were calculated from a digital elevation model (DEM). The contour-based DEM was extracted from a triangulated irregular network (TIN), created from a digital topographic map with 5-meter intervals that was prepared by the Survey of Israel. Each 1.5 1.5 m2 cell in the catchment was assigned an elevation value ranging from (15) to (+495). The DEM was tested against field measurements and high correlation was achieved between the predicted elevation values and actual measurements (r2 = 0.90; n = 20; p < 0.0001). Vertical accuracy was found to be approximately 5 m and the positional accuracy was better than 1.3 m. All artificial pits were removed from the DEM (Tarboton et al., 1989) and slope and flow accumulation were calculated per cell, using TauDEM Dinf (Tarboton, 1997). To map as input the study area’s vegetation, rock and soil cover, maximum likelihood classification was used with an orthophoto with spatial resolution 0.4 0.4 m2, acquired on November 2006 under clear sky conditions. To calculate the cover percentage of these classes, the ArcGIS workstation Fishnet operation was used to produce 25 25-m2 continuous cells covering the entire study area. Spatial representation of rainfall intensity was applied by using meteorological radar data that covered the entire study area during a rainfall event (28/10/2006), with a return period of 20 years. From the 2006 aerial photo and based on visual interpretation, tillage direction was digitized for all fields of the watershed. To express the effect of tillage direction on gully initiation, we used a cosine function. A land-use map with spatial resolution of 2 2 m2 was compiled, based on data from the National GIS of Israel, made by the Survey of Israel. Unpaved roads were manually digitized from the 2006 orthophoto, based on visual interpretation. As derived from expert recommendations, the roads layer was divided into two criteria (occupying two separate GIS layers): (1) roadsas-runoff-contributors enhancing the effect of roads on the contributing area downslope; and (2) roads-as-barriers to water ponding and sediment logging, enhancing the effect of roads as Copyright © 2012 John Wiley & Sons, Ltd.

barriers to water and sediment flow from upslope. The cells in the two layers were coded according to their distance from the road. Since one of the goals of this paper was to compare a DM approach to existing AHP and TT methods, we based our analysis on the same dataset for each one of the methods. In particular, when comparing the AHP approach, we used for validation only 32 gully initiation points, observed in the 2006 airphoto. The feature set for the comparison with AHP included four integer variables (slope, land use, aspect, and upslope contributing area); four double variables (tillage direction, rock, rainfall intensity and vegetation cover) and one enumeration variable (unpaved roads). In our comparison with the TT model, we used only 19 physically measured gully initiation points as a training set and 113 digitized points, observed in the 2006 airphoto, as a validation set. Only two features were included in the comparison with the TT technique: slope (integers) and upslope contributing area (integers).

The data mining procedure The use of KDD to study gullying is not trivial since the classification problem of whether a given cell is likely to include a gully initiation point or not involves several methodological challenges, primarily the ‘rare cases’ challenge (Weiss, 2010). The observed data in this study consist of 113 detected gully initiation points, versus millions of other raster cells within the catchment without gully initiation points. In the study of gully initiation, this is a typical phenomenon in many eroded catchments; the number of cells with initiation points is very much smaller than the number of cells without them. In such cases, there is a high chance that an overly simple classifier may classify all points in question as ‘no gully initiation point’. This problem is not limited to gullies as a phenomenon, or even to geomorphology, but confronts classifiers in all fields of study. For a variety of reasons, rare cases pose difficulties for induction algorithms (Weiss, 2010). The most obvious and fundamental problem is the associated Earth Surf. Process. Landforms, Vol. 37, 607–619 (2012)

T. SVORAY ET AL.

lack of data – rare cases tend to cover only a few training examples (i.e. absolute rarity). This lack of data hampers the detection of the rare cases and, even if a rare case is detected, it makes generalization difficult, since it is hard to identify regularities from only a few data points. This problem is usually referred to in the literature as prevalence (different proportions of presence/absence observations in the dataset). Beguería (2006) has shown that prevalence raises the need to use threshold independent statistics such as the receiver operating characteristics (ROC) curve for the validation of predictive models. In particular, Gutiérrez et al. (2009b) analyse the role of prevalence in the success of decision trees to model the distribution of gullies. They conclude that there is no influence on the success of the models, when the training set includes a sufficiently large number of observations (approximately 500 presences and 2000 absences). Another problem associated with mining rare cases is reflected by the phrase: like a needle in a haystack. The difficulty is not so much due to the needle being small – or there being only one needle – but by the fact that the needle is obscured by a huge number of strands of hay. Similarly, in data mining, rare cases may be obscured by common cases (relative rarity). This is especially a problem when data mining algorithms rely on what are called ‘greedy search heuristics’ that examine one variable at a time as in the case of classification trees (Rokach and Maimon, 2008). With greedy algorithms, rare cases may depend on the conjunction of many conditions and any single condition in isolation may not provide much guidance. Another challenge is that the predictive accuracy of a classifier alone is insufficient as an evaluation criterion. One reason for this phenomenon is that various validation and classification errors must be dealt with differently. More specifically, false negative errors, i.e., mistaking ‘gully’ for ‘non-gully’, are less desirable than false positive errors. Moreover, predictive accuracy alone does not provide enough flexibility when selecting a target for gully initiation. For example, the personnel may want to examine 1% of the available potential points, but the model predicts that only 0.1% of them are gully points. To resolve the issue of rare cases, then, we need to use other validation measures, as discussed in the Validation section below. Figure 2 specifies the training process used for applying the general data mining procedure. Given the collected set of gully initiation points (step 0), we added, in step 1, random points (which are not gully initiation points) in an attempt to convert the problem into a binary classification task that can be solved with existing algorithms. The number of these points was of a magnitude of 100, with

Inducer

Inducer Positive Gullies Incision Points

Extending the Database with Random Negative Points

respect to the number of gully initiation points. In a preliminary study, we examined the predictive performance (measured by the area under the ROC curve) for several ratio values starting from 10 to 150 with steps of 10 using Adaboost-J48 as the classification algorithm. As can be seen in Figure 3, the area under the curve (AUC) increases with the ratio. The process converges approximately at a ratio of 90. To address the rare cases challenge, we converted the dataset into multiple balanced datasets (step 2). An inducer was executed on each balanced dataset (step 3) and a single classifier was generated for each balanced dataset. Combining all classifiers from all datasets constitutes an ‘ensemble of classifier’ (Rokach, 2010). Note that each member of the ensemble was trained separately over a dataset that includes all detected gully initiation points and an equivalent set of randomly selected points with no gully initiation points. By doing so, each classifier was trained over a balanced dataset. Moreover, in a preliminary study we tried to build an Adaboost-J48 classifier without splitting the data into a balanced dataset. However, we obtained poor results (an AUC of only 0.639). The incompatibility with previous results of Gutiérrez et al. (2009b) may be explained by the fact that in this research we have only 32 presence points while in the previous research several hundred presence points were employed. Figure 4 refers to the testing phase. In this phase, the induced classifiers were used to identify a gully initiation point. We fed the new unclassified point into each one of the classifiers trained during the training phase. Each classifier provided its estimated probability that the given point was a gully initiation point (step 2). These probabilities were averaged to provide the final classification (step 3). We applied several induction

0.91

Area under the ROC curve

610

0.905 0.90 0.895 0.89 0.885 0.88 0.875 0 0

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Ratio Figure 3. The area under the ROC curve for several ratio values (nogully to gully) using AdaBoost-J48 as classification algorithm.

Classifier 1

Classifier 1

Classifier 2

2Classifier

Gullies Incision Probability 1

Gullies Incision Probability 2

Final Classification using the Average of Probabilities

A New Point

Inducer

Classifier n

Classifier n

Gullies Incision Probability n

Multiple Balanced Datasets

Figure 2. Outline of the training phase given only positive points. First, randomly selected negative points are added to the dataset. Then multiple balanced datasets are created. The dataset is used to train a classifier and inducer. Copyright © 2012 John Wiley & Sons, Ltd.

Figure 4. Outline of testing a new point. All raster cells are tested in a row using all classifiers. Each classifier shows the probability of the cell hosting a gully initiation point. Next, the final score is calculated based on the average of all classifier probabilities. Earth Surf. Process. Landforms, Vol. 37, 607–619 (2012)


algorithms to find the best that suited the nature of the dataset. We selected five techniques for evaluating the data. Below, we provide a short explanation about each technique and the reasons for its selection. The decision tree algorithm (Quinlan, 1993) is a wellestablished family of learning algorithms. Decision trees (or classification trees) are used to classify an object (geographical points in our case) to a predefined set of classes (gully/nongully initiation points) based on its feature values (such as tillage direction and rainfall intensity). The decision tree combines the features in a hierarchical fashion such that the most important feature is located in the root of the tree. Each node in the tree refers to one of the features. Each leaf is assigned to one class (gully/non-gully initiation points) representing the most frequent class value. In addition, the leaf holds a probability vector indicating the probability of having gully initiation points. New points are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. Decision trees assume that the feature space should be divided into axis-parallel rectangles such that each rectangle has a different gully probability. Figure 5 illustrates a simple decision tree for classifying a geographical point and its corresponding space partitioning. For the sake of clarity, we used only two features (aspect and vegetation cover). In particular, points with an aspect greater than 166 and vegetation cover less than 0.017, are associated with the top-left rectangle and have a probability Pgully = 0.67 of being a gully point. Decision trees are considered to be self-explanatory; there is no need to be a data mining expert in order to follow a certain decision tree. Classification trees are usually represented graphically as hierarchical structures, making them easier to interpret than other techniques. If the classification tree becomes complicated (i.e., has many nodes) then its straightforward, graphical representation becomes useless. In these cases it is sometimes useful to represent the tree as a list of ‘if-then’ rules. Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by conjoining the tests along the path to form the antecedent part and taking the leaf’s class prediction as the class value. For example, the abovementioned path in Figure 4 can be transformed into the rule: ‘If the aspect is greater than 166 and vegetation cover is less than 0.017, then the inspected point is a gully point with a probability of 0.67’. It should be noted that decision trees have been successfully used in the past to obtain spatial gully distribution but not exclusively for gully initiation (Gutiérrez et al., 2009b). In our

611

study, to overcome the drawbacks of decision trees, we generate several trees and combine them to form a decision forest. This is a well-known approach for overcoming decision tree drawbacks such as the inability to induce a meaningful model when there is a lack of adequate data. Decision forests are considered to be ‘a very promising technique for a wide range of environmental problems due to their flexibility, adaptability, interpretability and performance’ (Kuhnert et al., 2010). In this paper we utilize the well-known Adaboost algorithm to obtain the forest. Classifiers are represented as trees, whose internal nodes are tests of individual features, while the leaves are classification decisions. Typically, a greedy top-down search method is used to find a small decision tree that correctly classifies the training data. The decision tree is induced from the dataset by splitting the variables based on the expected information gain. Modern implementations include pruning, which avoids over-fitting. In this study, we evaluated AdaboostJ48, the WEKA version of the commonly used C4.5 algorithm (Quinlan, 1993). An important characteristic of decision trees is the explicit form of their knowledge, which can be represented as a set of if-then rules. While the Adaboost algorithm using decision tree has not previously been applied in geomorphology, the idea of an ensemble of decision trees (also known as a decision forest (Rokach and Maimon, 2008)) has been examined in the past (Saito et al., 2009; Kuhnert et al., 2010). Moreover, the Adaboost algorithm has been applied in geomorphology with induction algorithms other than decision trees (Shu and Burn, 2004). An artificial neural networks (ANN), using a multilayer perceptron (Bishop, 1995), is an information processing paradigm inspired by the way biological nervous systems (i.e., the brain) are modelled with regard to information processing. The key element of this paradigm is the structure of the information processing system. It is a network composed of a large number of highly interconnected processing elements, called neurons, working together in order to approximate a specific function. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process, during which the weights of the inputs in each neuron are updated. The weights are updated by a training algorithm, such as backpropagation, according to the examples the network receives, in order to reduce the value of an error function. The power and usefulness of ANNs have been demonstrated in numerous applications, including speech synthesis, medicine, finance, and many other pattern-recognition problems, including the geomorphology domain (Shu and Burn, 2004; Ermini et al., 2005; Gomez and Kavzoglu, 2005). For some application

Aspect

>166

0.00024 AND roads = 3 3. Veg. Cover > 0.00024 AND roads = 1

0.89 120

0.87 80 0.86

ROC

Gully Initiation Point

0.88 100

60 40

0.85 0.84

Random AdaBoost (J48) TT

20 0

0.83 0.82

0

2000

4000

6000

8000

10000

Total number of Points

0.81 0

Figure 11. The number of gully initiation points identified plotted against the prediction of points (cells) searched both by the TT and the DM procedures. The results show that the Adaboost-J48 outperforms the TT, while the latter is more similar to a random search.

200

400

600

800

1000

Instances in Leaves Figure 13. The AUC of the classifier as a function of the number of instances in each leaf.

Figure 12. The probability map (model score) for gully initiation both by DM and TT. The results show the large overestimation that occurs when using TT as a binary estimator. Reddish colours refer to areas of high probability for gully initiation. This figure is available in colour online at wileyonlinelibrary. com/journal/espl Copyright © 2012 John Wiley & Sons, Ltd.

Earth Surf. Process. Landforms, Vol. 37, 607–619 (2012)

616

T. SVORAY ET AL.

4. Veg. Cover > 0.00024 AND rain > 62.168 AND rock < = 0.016 5. Veg. Cover < = 0.00024 AND slope > 2 AND 1.373123 < cost < =1.759 AND rain > 50.674 AND 178 < aspect < = 226 6. 0.000052 < Veg. Cover < = 0.00024 AND slope > 2 AND cost < = 1.068 AND aspect > 183 7. Veg. Cover < = 0.00024 AND slope > 2 AND cost < = 1.373 AND aspect < = 183 AND rain > 53.362 8. 0.00024 < veg < = 0.0276 rain < = 62.1678 AND flowAcc > 4 AND slope > 2 AND cost < = 1.466 9. Veg. Cover > 0.085 AND rain > 65.313 AND rock > 0.077 AND cost < = 1.323 AND slope < = 4 10. 0.00024 < Veg. Cover < = 0.0856 AND rain > 62.168 AND rock > 0.016 AND cost < = 1.323 AND slope < = 4 This set of rules includes all ten variables with threshold values that, in general, agree with the existing literature. Thus, slope values > 2 , west facing slope aspects – more exposed to the wind from the Mediterranean Sea, relatively large rainfall intensity values and the road buffer areas are all observed here as promoting the development of initiation points. Second group: 1. Veg. Cover < = 0.00024 AND slope > 2 AND cost < = 1.373 AND aspect > 183 2. 0.00024 < Veg. Cover < = 0.005 AND rain > 62.168 3. Veg. Cover > 0.00024 AND roads = 3 4. Veg. Cover > 0.00024 AND roads = 1 The second group of rules shows that vegetation cover recurs as an important factor in all four rules, yet with a range that covers the entire population of values: > 0.00024 and at the same time < =0.00024. Therefore, from our data, it is hard to interpret the role of vegetation cover for gully initiation. Slope > 2 seems to be an important threshold, as well as rainfall intensity > 62.168. Roads were also found to be important, particularly with the category distances of 1 m and 3 m, but all other variables were omitted from the analysis and an AUC of 0.82 was achieved considering just these six variables. The other variables were found to be less important for gully prediction. Table II indicates how the AUC of AdaBoost-J48 is reduced by removing each one of the attributes. These values can be used to rank the attributes according to their importance to the model. As one can see all attributes are relevant. The most important attributes are slope, slope aspect, and land use. The least significant attribute is flow accumulation.

Discussion DM versus AHP In general, gully initiation and surface processes are too multifaceted to be predicted in every location and time without Table II.

uncertainty (Kirkby, 2010). It is therefore agreed that model predictions are some kind of approximation and they are, to some extent, uncertain in predicting gullying and gully initiation points. A comparative study of the predictive abilities of different models, preferably in a heuristic manner, can shed new light on a model’s pros and cons and better explain the different ways that we perceive surface phenomena. Furthermore, the comparison between the predictive abilities of gully initiation by DM and AHP presented here represents a fundamental issue in the interpretation of human knowledge. On the one hand, DM is based on actual measurements from a case study and predictions are made based on the local database. On the other hand, AHP provides predictions that are derived from knowledge interpreted by experts from previous measurements in other places, visual observations and subjective impressions, computer simulations and to some extent, intuition. Thus, although we cannot answer in full whether it is better to use empirical data or to rely on experts in predicting gully erosion, our study here has several implications for the question at hand. The results in Figure 8 show that if one covers enough points in the studied area (~2000), one will find, in the end, all the initiation points in the catchment. However, in such a case, while only a few tens of initiation points are indeed found, yet nearly two thousand are suspected as initiation points, but are not found as such in the validation. Such an overestimation also occurs for random decision (which assumes in every query that a cell has a 50% chance of being an initiation point and vice versa). This result clearly enhances the conclusion that what matters the most is the ability to find areas under threat, with minimal overestimation or, in other words, with the smallest number of queries possible. Considering prevention, the search refers to the number of points that one needs to physically treat, for the purpose of early conservation, in order to prevent future gully initiation. Since conservation activity is expensive, one should aim at reducing the amount of such activities. DM shows clearly that from a very early stage of search (below 1000 points) the method can find most (30) of the initiation points to prevent gully initiation, whereas the AHP achieves this only after double the number of searched points. Namely, using AHP, one needs a much larger number of activities to prevent future gully initiation points. (Practically, another thousand points will need to be treated to avoid future gully initiation.) Thus, due to the fact that we are always limited by the number of soil conservation treatments, if we use DM and distribute soil conservation treatments in the catchment areas predicted by risk, we shall probably be better able to prevent more gully initiation than by using AHP. This result is in line with the theory presented in a review of a large number of studies of environmental modelling (Guisan and Zimmermann, 2000). This review claims that predictions trained by empirical data are usually more realistic and accurate, while with mechanistic models that are more theoretically based, the predictions are characterized by high generality. Our study here clearly verifies this for the case of gully initiation predictions.

Reduction in AUC by removing a selected attributes.

Attribute name

The effect on AUC

Slope aspect Land use Slope Rain intensity Vegetation cover Unpaved roads Rock cover Tillage direction Flow accumulation

0.07 0.07 0.06 0.03 0.03 0.02 0.02 0.01 0.01

Copyright © 2012 John Wiley & Sons, Ltd.

DM versus topographic threshold The TT is a theoretical approach that is based on an empirical dataset of gullies derived from actual field measurements. The TT, in the form of Equation 1, expresses the combined effect of the slope and upslope contributing area and is widely used by geomorphologists to predict gully initiation (Montgomery and Dietrich, 1988; Hancock and Evans, 2006). S ¼ aAb

(1)



where S is the slope; A is the upslope contributing area that acts as a surrogate for the volume of runoff; and a and b are empirical coefficients. The threshold is defined as the lowest point of a dataset of observations of gully initiation points that were measured in the field. Different TT values and a and b coefficients have been found for different climates, types of vegetation cover, topsoil structure, soil moisture conditions and land use (Svoray and Markovitch, 2009). Although the TT method is commonly used, several studies have found that it may induce prediction error in estimating the spatial distribution of gullies and initiation points (Chaplot et al., 2005). Patton and Schumm (1975), who were among the first to suggest the lower limit of the scatter of data as an identifier of an unstable valley floor, have also pointed out the limitations of the TT as a surrogate for physically-based runoff modelling. Our results for the Yehezkel catchment, in Figure 12, show that the prediction based on the TT is close to random selection and the reason for this may lie in the binary nature of the TT, namely, the fact that the TT simply defines a threshold that depends on the slope–drainage area relationship. This is usually a straight line fitted through the lowermost points of an empirical plot of slope versus upslope contributing area in gully initiation points that were measured in the field. Since the fact that large parts of the catchment are above the threshold leads to a prediction that large parts of the catchment are under threat, the method suffers from large predictive overestimation. Based on the observations, not all the areas that are above the threshold fulfil all the conditions required for gully development. In that sense, the TT is not an efficient method for tackling the rare cases problem in general and, specifically, the gully initiation prediction problem. Furthermore, gullies in certain areas could be the result of other factors in addition to slope and drainage areas. The cause, however, is not apparently the size of the training set, namely, the number of initiation points measured in the field to be used for the scatter points of the graph (as those were also used to train the DM), but the mechanisms that initiate the gullies. (This will be further discussed in the next section.) This point is crucial since, if many points above the threshold are not under threat, the meaning of the threshold as a predictor is questionable. At best it can give evidence for the ‘safe’ points only, but not for the occurrence of the initiation points in “unsafe” cells.

The extracted rules In several previous studies, the minimum slope observed for soil detachment was 3 (see discussion in Bowman et al., 2011). However, the Adaboost-J48 rules predict that gullying can occur in the studied area on slopes >2 , that is, it can also occur on slopes between 2 and 3 . The fact that gully initiation can occur in slopes < 3 may be explained by soil resistance, which is substantially lower in agricultural fields, while the findings above mainly refer to uncultivated soils. Another topographic factor that is prone to affect gullying is flow accumulation, or the upslope contributing area. According to the TT concept, flow accumulation is considered to be a most influential factor in gully initiation. However, flow accumulation does not often appear in the rules extracted by the Adaboost-J48. This may imply that the runoff mechanism in this area is very efficient. Thus, a potential reason for the low importance of flow accumulation is that even a small contributing area can cause gullying, with the assistance of other factors such as tillage direction and, primarily, rainfall intensity. Our results show that a major effect of rainfall intensity on gully initiation results from rainfall events with an intensity of 62.2 mm h-1 (for a period of 30 min). The greater and more infrequent the rainfall, the greater the risk of gully development. Copyright © 2012 John Wiley & Sons, Ltd.

617

The importance of major rainfall events in causing soil loss has already been noticed in semi-arid catchments, as are the difficulties encountered with monitoring soil loss in the field due to damage to instruments during these events (Coppus and Imeson, 2002). The Adaboost-J48 analysis further supports the role of major events in gully initiation. Furthermore, the predicted result is very important since it implies an increased threat to this catchment, especially in light of the prediction that the frequency of extreme events on a global scale is increasing (Woodward, 1999). This finding indicates that while the rainy season may become shorter, due to climate change, the number of extreme rainfall events and floods is increasing. As a result of these changes, the agricultural environment requires modelling of more components than slope and flow accumulation. Yet, the method chosen to analyse the variation of the different factors affecting gully initiation remains important. In particular, the weighting of the different factors can affect the predictive ability of the system. With the AHP method, the experts assigned a relatively low weight to roads and also to tillage direction. The Adaboost-J48 rules, however, assigned a greater influential role to roads.

Conclusions In this work, we compared data mining procedures, a multicriteria mechanism and the topographic threshold as methods for predicting gully initiation in a semi-arid catchment. Based on the results, the following conclusions may be drawn: (1) Using the same input layers, the application of a data mining procedure provided better prediction of gully initiation points than the application of an expert-based system. This finding was expressed in particular in the number of points needed to be tested before all gully initiation points were identified. This finding means that an expert-based system is liable to propose the use of more conservation systems than are really needed. (2) Using the same input layers, the application of a data mining procedure provided better prediction of gully initiation points than the application of the topographic threshold method. The main difference was expressed in the very high overestimation embedded in the use of the topographic threshold method. In addition, the minimum slope observed for soil detachment was 2 whereas in other studies it is 3 . This may be explained by soil resistance, which is substantially lower in agricultural fields, while most studies test an unprocessed soil. Rainfall intensity events >62.2 mm h-1 (for a period of 30 min) were found to have a significant effect on gully initiation. With the predicted increase in occasional rainfall intensity in the region, it is important to further study this value as a marker of events with a higher probability of initiating gullying. Data mining procedures can be a useful tool to study gully initiation and other erosion processes. Further use of data mining requires more careful treatment of the rare cases problem, in order to better understand the mechanisms involved. Acknowledgements—The database used in this research was funded by the Soil Branch Advisory Board of the Israel Ministry of Agriculture and by the International Middle East Regional Agricultural Program supported by ‘Danida’. We thank the editors, two anonymous reviewers and especially reviewer # 2 for the thorough review that helped to improve the manuscript. Earth Surf. Process. Landforms, Vol. 37, 607–619 (2012)

618

T. SVORAY ET AL.

References Basnet BB, Apan AA, Raine SR. 2001. Selecting suitable sites for animal waste application using a raster GIS. Environmental Management 28: 519–531. Beguería S. 2006. Validation and evaluation of predictive models in hazard assessment and risk management. Natural Hazards 37: 315–329. Bishop C. 1995. Neural Networks for Pattern Recognition. Clarendon Press: Oxford. Bitan A, Rubin S. 1991. Climatic Atlas for Physical and Environmental Planning in Israel. Ramot, Tel Aviv University Press: Tel Aviv. Bowman D, Devora S, Svoray T. 2011. Drainage reorganization on an emerged lake bed following base level fall, the Dead Sea, Israel. Quaternary International 233: 53–60. Bradley AP. 1997. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition 30: 1145–1159. Casali J, Gimenez R, Bennett S. 2009. Gully erosion processes: monitoring and modelling. Earth Surface Processes and Landforms 34: 1839–1840. Chang Ch, Lin C-J. 2001. LIBSVM - A Library for Support Vector Machines. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Chaplot V, Coadou le Brozec E, Silvera N, Valentin C. 2005. Spatial and temporal assessment of linear erosion in catchments under sloping lands of northern Laos. Catena 63: 167–184. Cheng H, Wu YQ, Zou XY, Si H, Zhao YZ, Liu DG, Yue XL. 2006. Study of ephemeral gully erosion in a small upland catchment on the InnerMongolian Plateau. Soil & Tillage Research 90: 184–193. Collins MG, Steiner FR, Rushman MJ. 2001. Land-use suitability analysis in the United States: Historical development and promising technological achievements. Environmental Management 28: 611–621. Coppus R, Imeson AC. 2002. Extreme events controlling erosion and sediment transport in a semi-arid sub-andean valley. Earth Surface Processes and Landforms 27: 1365–1375. Dai FC, Lee CF, Zhang XH. 2001. GIS-based geo-environmental evaluation for urban land-use planning: a case study. Engineering Geology 61: 257–271. Dai FC, Lee CF. 2003. A spatiotemporal probabilistic modelling of storminduced shallow landsliding using aerial photographs and logistic regression. Earth Surface Processes and Landforms 28: 527–545. De Santisteban LM, Casali J, Lopez JJ. 2006. Assessing soil erosion rates in cultivated areas of Navarre (Spain). Earth Surface Processes and Landforms 31: 487–506. Ermini L, Catani F, Casagli N. 2005. Artificial neural networks applied to landslide susceptibility assessment. Geomorphology 66: 327–343. Fan RE, Chen PH, Lin CJ. 2005. Working set selection using second order information for training SVM. Journal of Machine Learning Research 6: 1889–1918. Fayyad UM, Piatetsky-Shapiro G, Smyth P. 1996. From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining, FayyadUM, Piatetsky-Shapiro G, Smyth P, Uthurasamy R (eds). MIT Press/AAAI Press: Menlo Park, CA. Freund Y, Schapire RE. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, Bari, Italy, Morgan Kaufmann: Waltham, Massachusetts. July 3–6, 325–332. Gilliams S, Raymaekers D, Muys B, Van Orshoven J. 2005. Comparing multiple criteria decision methods to extend a geographical information system on afforestation. Computers and Electronics in Agriculture 49: 142–158. Gomez H, Kavzoglu T. 2005, Assessment of shallow landslide susceptibility using artificial neural networks in Jabonosa river basin, Venezuela. Engineering Geology 78(1–2): 11–27. Guisan A, Zimmermann NE. 2000. Predictive habitat distribution models in ecology. Ecological Modelling 135: 147–186. Gutierrez AG, Schnabel S, Felicisimo AM. 2009a. Modelling the occurrence of gullies in rangelands of southwest Spain. Earth Surface Processes and Landforms 34: 1894–1902. Gutierrez AG, Schnabel S, Lavado Contador FJ. 2009b. Using and comparing two non parametric methods (CART and MARS) to model the potential distribution of gullies. Ecological Modelling 220: 3630–3637. Hancock GR, Evans KG. 2006. Channel head location and characteristics using digital elevation models. Earth Surface Processes and Landforms 31: 809–824. Copyright © 2012 John Wiley & Sons, Ltd.

Hentati A, Kawamura A, Amaguchi H, Iseri Y. 2010. Evaluation of sedimentation vulnerability at small hillside reservoirs in the semi-arid region of Tunisia using the Self-Organizing Map. Geomorphology 122: 56–64. Jankowski P. 1995.Integrating geographical information-systems and multiple criteria decision-making methods. International Journal of Geographical Information Systems 9: 251–273. Jiang YL, Metz CE, Nishikawa RM. 1996. A receiver operating: characteristic partial area index for highly sensitive diagnostic tests. Radiology 201: 745–750. Joshi MV, Agarwal RC, Kumar V. 2002. Predicting rare classes: can boosting make any weak learner strong? Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23–26. Kirkby MJ. 2010. Distance, time and scale in soil erosion processes. Earth Surface Processes and Landforms 35: 1621–1623. Kirkby MJ, Bracken LJ. 2009. Gully processes and gully dynamics. Earth Surface Processes and Landforms 34: 1841–1851. Kuhnert PM, Henderson AK, Bartley R, Herr A. 2010. Incorporating uncertainty in gully erosion calculations using the random forests modelling approach. Environmetrics 21: 493–509. Maimon O, Rokach L. 2005. The Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Springer: New York, Dordrecht, Heidelberg, London ISBN: 0-387-24435-2. Malczewski J. 1999. GIS and Multicriteria Decision Analysis. John Wiley & Sons: New York; 182–187. Malczewski J. 2004. GIS-based land-use suitability analysis: a critical overview. Progress in Planning 62: 3–65. Meyers JA, Martínez-Casasnovas JA. 1999. Prediction of existing gully erosion in vineyard parcels of the NE spain: a logistic modelling approach. Soil and Tillage Research 50: 319–331. Melchiorre C, Matteucci M, Azzoni A, Zanchi A. 2008. Artificial neural networks and cluster analysis in landslide susceptibility zonation. Geomorphology 94: 379–400. Metz CE. 2006. ROCKIOCKIT version 1.1. Department of Radiology, The University of Chicago. Metz CE, Wang PL, Kronman HB. 1984. A new approach for testing the significance of differences between ROC curves measured from correlated data. In Information Processing Medical Imaging VIII, Deconick F (ed). 432–445. Montgomery HD, Dietrich W. 1988. Where do channels begin?. Nature 336: 232–234. Ni JR, Li XX, Borthwick AGL. 2008. Soil erosion assessment based on minimum polygons in the Yellow River basin, China. Geomorphology 93: 233–252. Ni JR, Li YK. 2003. Approach to soil erosion assessment in terms of land-use structure changes. Journal of Soil and Water Conservation 58: 158–169. Nir D. 1989. The Geomorphology of Israel. Academon: Jerusalem. Nir D. 1993. Studies of the morphology of Shifimriver. In Studies of Physical Geography of Israel and Southern Sinai, Nir D (ed). Ariel: Jerusalem; 50–62. Patton PC, Schumm SA. 1975. Gully erosion, Northwestern Colorado threshold phenomenon. Geology 3: 88–90. Pavel M, Fannin RJ, Nelson JD. 2008. Replication of a terrain stability mapping using an Artificial Neural Network. Geomorphology 97: 356–373. Platt J. 1999. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods - Support Vector Learning, Schoelkopf B, Burges C, Smola A (eds). MIT Press Cambridge: MA, USA. Quinlan JR. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann: San Francisco. Quinlan JR. 1996. Bagging, Boosting, and c4.5, Proceedings of the National Conference on Artificial Intelligence. AAAI Press/The MIT Press: California, USA 725–730. Rokach L. 2010. Ensemble-based classifiers. Artificial Intelligence Review 33: 1–39. Rokach L, Maimon O. 2008. Data Mining with Decision Trees: Theory & Applications. World Scientific: Singapore, ISBN: 9812771719. Saaty TL. 1977. Scaling method for priorities in hierarchical structures. Journal of Mathematical Psychology 15: 234–281. Earth Surf. Process. Landforms, Vol. 37, 607–619 (2012)

PREDICTING GULLY INITIATION: A COMPARATIVE STUDY Saito, H. Nakayama, D, Matsuyama H. 2009. Comparison of landslide susceptibility based on a decision-tree model and actual landslide occurrence: the Akaishi Mountains, Japan. Geomorphology 109: 108–121. Shu C, Burn DH. 2004. Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resources Research 40: 1–10. Svoray T, Ben-Said S. 2010. Soil loss, water ponding and sediment deposition variations as a consequence of rainfall intensity and land use: a multicriteria analysis. Earth Surface Processes and Landforms 35: 202–216. Svoray T, Markovitch H. 2009. Catchment scale analysis of the effect of topography, tillage direction and unpaved roads on ephemeral gully incision. Earth Surface Processes and Landforms 34: 1970–1984. Swets JA. 1988. Measuring the accuracy of diagnostic systems. Science 240: 1285–1293. Tarboton DG, Bras RL, Rodriguez-Iturbe I. 1989. The analysis of river basins and channel networks using digital terrain data. TR 326, Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Boston.

Copyright © 2012 John Wiley & Sons, Ltd.

619

Tarboton DG. 1997. A new method for the determination of flow directions and upslope areas in grid digital elevation models. Water Resources Research 33: 309–319. Trimble SW, Crosson P. 2000. Land use - US soil erosion rates - myth and reality. Science 289: 248–250. Van Rompaey A, Bazzoffi P, Dostal T, Verstraeten G, Jordan G, Lenhart T, Govers G, Montanarella L. 2003. Modeling off-farm consequences of soil erosion in various landscapes in Europe with a spatially distributed approach. Proceedings of the OECD Expert Meeting on Soil Erosion and Soil Biodiversity Indicators, 25–28 March, Rome, Italy. Weiss GM. 2010. Mining with rare cases. In The Data Mining and Knowledge Discovery Handbook, Rokach L, Maimon O (eds). Springer: New York/Dordrecht: Heidelberg, London; 765–777, ISBN: 0-387-24435-2. Witten IH, Frank E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann: Waltham, Massachusetts. Woodward DE. 1999. Method to predict cropland ephemeral gully erosion. Catena 37: 393–399.


Predicting gully initiation: comparing data mining techniques ...

Predicting gully initiation: comparing data mining techniques ...

Suggest Documents

Comparing Data Mining Techniques in HIV Testing

data mining techniques and predicting corporate ...

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

Using Data Mining Techniques for Predicting Future Car market Demand

Using Data Mining Techniques for Predicting Future Car ... - IEEE Xplore

Using Data Mining Techniques for Predicting Future ...

Data mining Techniques for Predicting Crop Productivity â A ... - ijcst

Data Mining Techniques for Predicting Immunize-able Diseases ...

Using Data Mining Techniques for Predicting Future Car market Demand

Predicting Breast Cancer Survivability Using Data Mining Techniques

data mining techniques - DAU's

Comparing regression estimation techniques when predicting ...

comparing gully length and average gully width relationships on two

data mining techniques - Semantic Scholar

Data Mining: Concepts and Techniques

Data Mining : Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining Techniques Based Students

Data Mining : Concepts and Techniques

data mining techniques - Semantic Scholar

Data Mining : Concepts and Techniques

Comparing Four-Selected Data Mining Software

Comparing Data Mining Methods with Logistic ...

ScienceDirect Sports data mining: predicting results ...

Predicting gully initiation: comparing data mining techniques ...