On the application of Bayesian Networks in Digital ...

6 downloads 0 Views 5MB Size Report
c University of Leicester, Leicester, Leicestershire LE1 7RH, United Kingdom d Teagasc, Johnstown Castle, Wexford, Co. Wexford, Ireland. a b s t r a c t.
Geoderma 259–260 (2015) 134–148

Contents lists available at ScienceDirect

Geoderma journal homepage: www.elsevier.com/locate/geoderma

On the application of Bayesian Networks in Digital Soil Mapping K. Taalab a, R. Corstanje b, J. Zawadzka b, T. Mayr b,⁎, M.J. Whelan c, J.A. Hannam b, R. Creamer d a

University College London, London, Greater London WC1E 6BT, United Kingdom Cranfield University, Cranfield, Bedfordshire MK43 0AL, United Kingdom c University of Leicester, Leicester, Leicestershire LE1 7RH, United Kingdom d Teagasc, Johnstown Castle, Wexford, Co. Wexford, Ireland b

a r t i c l e

i n f o

Article history: Received 5 June 2014 Received in revised form 15 May 2015 Accepted 26 May 2015 Available online 11 June 2015 Keywords: Bayesian Networks Soil Bulk density Expert knowledge Mapping Modelling

a b s t r a c t Two corresponding issues concerning Digital Soil Mapping are the demand for up-to-date, fine resolution soil data and the need to determine soil–landscape relationships. In this study, we propose a Bayesian Network framework as a suitable modelling approach to fulfil these requirements. Bayesian Networks are graphical probabilistic models in which predictions are obtained using prior probabilities derived from either measured data or expert opinion. They represent cause and effect relationships through connections in a network system. The advantage of the Bayesian Networks approach is that the models are easy to interpret and the uncertainty inherent in the relationships between variables can be expressed in terms of probability. In this study we will define the fundamentals of a Bayesian Network and the probability theory that underpins predictions. Then, using case studies, we demonstrate how they can be applied to predict soil properties (bulk density) and soil taxonomic class (associations). © 2015 Elsevier B.V. All rights reserved.

1. Introduction To satisfy the growing demand for up-to-date, fine resolution soil data, there is a call to fully explore the potential of current mapping and modelling software, and apply existing modelling techniques in novel and innovative ways (Hartemink and McBratney, 2008). Predictive modelling of the spatial pattern of soil types and properties is based on a quasi-mechanistic understanding of soil formation and the factors that drive soil variation in the landscape, namely the ClORPT factors (Climate, Organic activity, Relief, Parent material and Time; Jenny, 1941). The relationships between soil forming factors and soil properties are complex and several non-linear modelling techniques have been employed to represent them including Random Forests (RFs) (Liaw and Wiener, 2002; Grimm et al., 2008; Wiesmeier et al., 2011) and Artificial Neural Networks (ANNs) (Agyare et al., 2007; Zhao et al., 2010). A principal disadvantage of these methods is that they are ‘black-box’, meaning that it is often difficult to interpret the relationship between response and predictor variables in physical terms (Suuster et al., 2012). In Bayesian Networks (BNs) the relationship between soil forming factors and soil properties can be directly addressed (Tavares Wahren et al., 2012). Many significant soil processes are not particularly well understood at the landscape scale and would benefit from the clarity and insight provided by BN modelling (e.g. Braakhekke et al., 2012).

⁎ Corresponding author. E-mail address: t.mayr@cranfield.ac.uk (T. Mayr).

http://dx.doi.org/10.1016/j.geoderma.2015.05.014 0016-7061/© 2015 Elsevier B.V. All rights reserved.

Chen and Pollino (2012) stated that improving system understanding is a key motivation for using a BN. BNs are graphical probabilistic models in which predictions are obtained using prior probabilities derived from either measured data or expert opinion. They represent cause and effect relationships via connections in a network system (Hough et al., 2010) but they differ from other network based methods, such as ANNs, in that the structure of the network and the interactions between nodes are defined by the user based on prevailing process understanding. BNs are a flexible way of structuring process understanding stochastically and, unlike purely deterministic models, reflect the uncertainty surrounding cause–effect relationships (one event leading to another) by expressing the relationships between soil classes/properties and the covariates as a probability (Dlamini, 2010). They are also ideal for addressing problems where data are limited (Kuhnert and Hayes, 2009). BNs have been applied to ecological systems (McCann et al., 2006), notably conservation (McCloskey et al., 2011), habitat mapping (Smith et al., 2007), and risk mapping of events such as wildfire (Dlamini, 2010) and peat erosion (Aalders et al., 2011). Bayesian modelling approaches have also been applied to modelling soil classes (Skidmore et al., 1996; Bui et al., 1999; Mayr et al., 2010) or soil attributes (Cook et al., 1996; Corner et al., 2002). Despite this, BNs are not yet established as a mainstream tool in Digital Soil Mapping (DSM). BNs were developed from the branch of mathematics known as probability theory, in particular from probabilistic reasoning (Pearl, 1988). Unlike deterministic models, BNs offer a structured method of dealing with uncertainty that, as a rule, diminishes as more information

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

is gathered. In the case of predicting the spatial distribution of soil classes and properties, the relationships between variables are highly uncertain and data availability is often limited, so BNs have great potential as a predictive tool (Finke, 2012). Another appealing aspect of BNs is their ability to integrate expert knowledge into the model, which can be used to supplement measured data, or define relationships between variables directly. There has been a long-standing drive to formally introduce expert knowledge into soil mapping, usually focusing on fuzzy set theory or possibility theory (McBratney and Odeh, 1997). In contrast, BNs use probability theory, which can be seen to offer a more coherent structure to decision making problems (Degroot, 1988), although, there has been some debate as to which is the superior approach (Krueger et al., 2012). In this study, BNs are explored for two typical problems in DSM; i) the prediction of a soil property, soil bulk density, and ii) the prediction of a soil taxonomic class. BNs have a number of advantages compared with other modelling techniques regularly employed in DSM: - The real strength of BNs can be fully appreciated in situations where domain knowledge is crucial and availability of data is scarce such as in Case Study 1 on bulk density. - While BNs did not improve predictive performance, they have the advantage of offering some process-based insight (Correa et al., 2009). This has been confirmed in this study where the results are very similar to (albeit slightly lower than) the ANN and RF blackbox modelling techniques which have previously been used to predict topsoil Db with the same dataset (Taalab et al., 2012). - In recent years, the focus of DSM has moved away from straightforward classification of soils towards developing a better understanding of the spatial distribution of soils in relation to the wider environment (Grunwald, 2009). This is necessary in order to resolve challenges such as climate change, desertification, and food production which are putting increasing pressure on soils as a resource (Hartemink and McBratney, 2008). - BNs provide an opportunity to assess the understanding of soil processes. In conjunction with expert knowledge, BNs can either confirm or contradict the opinions of the expert(s). If the BN contradicts what the expert believes, it can prompt further investigation into the process, indicating a knowledge gap or a problem with the model itself. If the latter is the case, it is easy to amend both the model structure and the probabilistic relationship between nodes. Identifying the source of predictive inaccuracies in a black-box model is much less straightforward. - As BNs are based on process understanding they can be used to answer specific questions using predictive reasoning. For example, what is the probability of X, given certain information, a capability that black box models do not possess. In addition, BNs are also capable of diagnostic reasoning. For example, given an outcome, the favourable conditions likely to lead to this outcome can be predicted. In summary, the major appeal of BNs is their clarity, which allows experts to judge whether the model makes pedogenic sense and to develop a better understanding of the soil processes. 2. Materials and methods 2.1. Theory BNs are named after the Reverend Thomas Bayes who, in the 18th century, developed a theorem regarding changing probabilities given new information (Bayes, 1783). The basis of a BN is conditional probability, which can be explained using an example from Jensen (1996), where a statement of conditional probability reads “Given an event B, the probability of event A is x.”

135

In mathematical notation this would read P ðAjBÞ ¼ x:

ð1Þ

This statement holds true, only if all other information which could affect event A is known and has been accounted for. The basic rule of conditional probability is: P ðAjBÞP ðBÞ ¼ P ðA; BÞ

ð2Þ

where P(A,B) is the probability of the joint event A and B both being true (A ∧ B). From this, the Bayes Rule (Eq. (3)) can be derived. P ðAjBÞ ¼

P ðBjAÞP ðAÞ : P ðBÞ

ð3Þ

This rule forms the basis of BN modelling, as we can use Bayes' rule to inform us of the probability of event A given information about B. Referring to Eq. (1), the posterior probability P(A|B) was an unknown x, we now see that it can be calculated using our prior belief in the occurrence of event A P(A) and event B P(B) and the probability of B given that A has occurred P(B|A). This is known as Bayesian inference and to illustrate how this might work in practice for DSM applications, we adapt an example given by Aitkenhead and Aalders (2009). From Eq. (3), P(A|B) is the posterior probability of event A (e.g. high bulk density; Db) given B (e.g. arable land use) (note that the class ‘high bulk density’ is an example of discretization of a continuous variable into a set of classes, the boundaries of which would need to be defined). P(A) is the probability that bulk density is ‘large’ (a prior probability derived from either data i.e. the percentage of samples recorded as high or from expert opinion), P(B) is the probability of the occurrence of arable land (proportion of the study area that is arable land) and P(B|A) is the proportion of high bulk density samples that are found on arable land. For example, if 30% of the total number of Db samples are classified as large, i.e. P(A) = 0.3, 40% of the terrain in the study area is classed as arable, i.e. P(B) = 0.4., and the proportion of high Db samples found on arable land is 50% i.e. prior probability P(B|A) = 0.5. This probability can be generated either by expert knowledge or using observed data. Combined, these probabilities give the probability that if the land is arable, the bulk density will be high, known as the posterior probability P(A| B). In this instance; P ðAjBÞ ¼

0:5  0:3 ¼ 0:375 0:4

ð4Þ

hence, there is a 37.5% probability that Db will be high on arable land. In reality, when dealing with complex problems in soil mapping, there will be numerous factors that influence variables of interest. Hence BNs are designed to link large numbers of influencing variables and combine the conditional probabilities of each. BNs comprise two components; 1) a directed acyclic graph (DAG), where each node represents a variable in which the directed links between nodes represent the conditional dependencies of the model and 2) a quantitative component of a network consisting of conditional probability tables (CPT) that accompany each node, which define the dependencies of each variable. Each CPT contains a list of possible states that could be applied to the variable. Using an example adapted from Nadkarni and Shenoy (2004), Fig. 1 shows a BN comprised of four variables: Land Use (L), Soil Group (S), Organic Carbon Content (C) and Soil Bulk Density (D). The directional arrows between variables indicate causality. The variables with arrows leading into them are known as the ‘child nodes’ and the variables where the arrows originate are known as ‘parent nodes’. Each state is mutually exclusive and the list is definitive; for clarity, we have kept the number of states in Fig. 1 to a minimum. It is acknowledged, however, that in complex natural systems the environmental

136

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

Land Use (L)

Soil Group (S)

Organic Carbon (C)

Bulk Density (D)

Fig. 1. An example Bayesian Network of soil properties and influencing factors (adapted from Nadkarni and Shenoy, 2004), showing the conditional probability tables for each node.

covariates are seldom independent of each other and BNs can account for those scenarios as will be demonstrated in the case studies. There are three types of connection in a BN (Jensen, 1996): serial, converging and diverging. In a serial connection, evidence about C (Organic carbon) is passed directly to D (Bulk Density). In a converging connection, evidence about L (Land Use) and S (Soil Group) is passed to D. If nothing is known about C except what may be inferred from knowledge of its parents L and S, then the parents are independent: evidence on one of them has no influence on the certainty of the others (information is not passed between them). This is important because often BNs will be used to determine the most likely cause of an event given evidence. However, if anything is known about C (including ‘soft evidence’, which may not determine the state of C, but may alter its probability distribution), then L and S become (conditionally) dependent. In a diverging connection (not shown in Fig. 1), hypothetically, evidence about X is transmitted to both nodes Y and Z. If there is no hard evidence about the state of X, evidence about Y can be transmitted to Z. When there is hard evidence of X (X is referred to as being ‘instantiated’), evidence about Y does not transmit to Z. When X is certain, Y and Z become (conditionally) independent. In order to function, BNs rely on certain independence assumptions. The links between nodes indicate what information about probabilities is required to produce the probability distribution at the node of interest. All parentless nodes need to be supplied with a prior probability and all child nodes need to have a conditional probability table of every combination of the parent node. Expanding the fundamental rule of conditional probability (Eq. (2)) to incorporate n variables we get the chain rule (Eq. (5)), which allows us to calculate the full joint probability for all the variables in the network.

P ðA1; A2; …; An Þ ¼ P ðA1jA2; …; An ÞP ðA2jA3; …; An ÞP ðAn −1jAn ÞP ðAn Þ ð5Þ

There is, however, a practical drawback with this rule in its current form. For example, with n random binary variables, the number of joint probabilities required is 2n − 1. This can quickly escalate to a very large number when a network represents the natural environment due to the large number of potentially interacting variables and the fact that variables will often have more than just two states (i.e. they will not be binary). In order to avoid this issue, BNs use an assumption of independence, which reduces the number of probabilities which need to be specified (Charniak, 1991). Consequently, how nodes are linked determines how the probabilistic relationships between variables are propagated through the network.

Given these assumptions about conditional probability, it is possible to re-write Eq. (5) as: n

P ðA1 ; …An Þ ¼ ∏ P ðAi jparentsðAi ÞÞ

ð6Þ

i−1

where parents(Ai) is the state of parent node for variable Ai. In this way, the joint probability distribution of the nodes in a BN is greatly simplified. Using the network in Fig. 1 as an example, without the assumption of conditional independence, the joint probability for the network is: P ðL; S; C; DÞ ¼ P ðLÞP ðSjLÞP ðCjL; SÞP ðDjL; S; C Þ

ð7Þ

where L denotes Land use, S denotes Soil Group, C denotes Organic Carbon, and D denotes Bulk Density. However, by assuming the nodes are conditionally independent, the joint probability is given by: P ðL; S; C; DÞ ¼ P ðLÞP ðSÞP ðCjL; SÞP ðDjC Þ:

ð8Þ

Note that here we assume that P(S) = P(S|L), meaning that the probability of event S is the same as the probability of event S given L, making S independent of L. A further assumption is that P(D|C) = P(D|L, S, C) showing that D is conditionally independent of L and S given C, if we know the value of C, information regarding the variables L and S will not affect D. Given any sequence of variables on any network, it is assumed that (if the parent nodes are known) two nodes that are not directly linked, are conditionally independent (Nadkarni and Shenoy, 2004). As is the case with most landscape level stochastic models, there is an assumption of stationarity of the conditional dependencies in BNs. This study does not consider the implications of deviation from stationarity only to note that it can be determined (e.g. Corstanje et al., 2008) and there are methods which will allow for non-stationary behaviour, such Dynamic Bayesian Networks (Robinson and Hartemink, 2010). Non-stationary behaviour in BNs for DSM is a further consideration for future studies. There are also a number of practical constraints involved in determining the model structure, as BNs cannot account for cycles or feedbacks (Jensen, 2001), hence the graphs are described as acyclic. Furthermore, there should be no more than four ‘layers’ to the model structure to avoid unnecessary propagation of uncertainty (Marcot et al., 2006). The size of the CPTs at each node is given by: n

SizeðCPT Þ ¼ S ∏ P i i¼1

ð9Þ

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

where S is the number of states and Pi is the number of states in the ith parent node (Chen and Pollino, 2012). Therefore, if a node has many parents, its CPT will become very large which makes populating the table difficult, due to increasing demands for data (from either empirical data or a multifarious process of expert knowledge elicitation). If the conditional probabilities at each node are derived from data, Cain (2001) suggests that at least 20 cases for every combination of variables are required to ensure the model is robust. 2.2. Implementation Once the variables of interest have been identified, the relationship between them must be determined. This requires the construction of a conceptual model which links the ‘driver’ variables with a wider suite of environmental variables and outlines key assumptions inherent in the model about the relationships between drivers and the property of interest. 2.2.1. Naive networks Described as the qualitative part of the network (Nadkarni and Shenoy, 2004), the most simplistic Bayesian model structure is known as a Naive Bayesian Network or a Bayesian Classifier (Duda and Hart, 1973). Here, the structure is very simple, as the naive network works under the assumption that all variables (L, S, C) are independent of each other given information about the root of the network (Bulk Density), which is the variable being predicted (Friedman et al., 1997). Note that despite all the variables (L, S, C) being used to predict D, the directional arrows run outwards from D. This is to allow the assumption of conditional independence. A naive network is often used if there is little understanding of the system which is being modelled. 2.2.2. Naive optimised networks As well as the naive network, an ‘optimised’ naive network was developed using a stepwise classification procedure where each individual prediction variable was added to the model in turn and the predictive capabilities of the variable were assessed using the training data. To clarify, the predictive power of the network is tested using each individual predictor. The predictor variable that leads to the most accurate prediction is then added to the network structure (for example land cover). The cycle then repeats for the remaining variables. A variable is rejected if it leads to an increase in the predictive error of the model. Hence only predictors which lead to an increase in the predictive power of the model are included. The optimised naive network was built using the training data. The approach was developed using the Netica (Norsys Software Corp, 2006) Application Programmer Interfaces (APIs) and a forward stepwise selection approach. 2.2.3. TAN learned structure networks For a more realistic representation of the system in question (in Digital Soil Mapping, this will usually be the natural environment) it is desirable to model the interactions between variables in the network. This can be accomplished using a data-mining approach; frequently a Tree Augmented Naive (TAN) Bayes algorithm is used to derive the optimal structure of the BN (Friedman et al., 1997). The TAN algorithm modifies the naive network by identifying dependencies between predictor variables (L, S, C). The model structure is altered so that the predictor variables can have an additional parent node from one of the other predictor variables, based on the conditional mutual information contained in a training dataset (Jiang et al., 2005). While this approach has been shown to outperform naive BNs, the drawback is that it requires a large amount of data and predictions can be highly sensitive to changes in the model parameters. Furthermore, the complexity of environmental systems sometimes prohibits these algorithms from producing adequate (feasible and efficient) model structures, such that superfluous nodes are included which can overcomplicate the network and reduce sensitivity to variations in relevant nodes (Chen and Pollino, 2012),

137

which is typically referred to as “overfitting” the data. BNs benefit from modelling parsimony, meaning it is preferable to exclude peripheral variables (those with little predictive power) to improve the ability of the model to predict independent data (Borsuk, 2008). 2.2.4. Expert structure The alternative is to construct the conceptual model using expert knowledge, where an expert or group of experts select the explanatory variables that are most likely to influence the predicted property and specify the relationships between them. At the very least, it is wise to incorporate an expert review of conceptual models built using a structured learning algorithm. 3. Case studies This paper will discuss two case studies; i) a continuous soil property, soil bulk density, which is of significant interest for the calculation of carbon stock estimates at the landscape scale and ii) mapping of soil taxonomic units, representing a traditional, categorical approach to soil mapping. Both case studies were part of a larger project that developed a soil information system for the Republic of Ireland. The Soil Information System for the Republic of Ireland was a 5 year collaborative programme between The Irish Agriculture and Food Development Authority (Teagasc), Cranfield University and University College Dublin commissioned by the Environmental Protection Agency under their STRIVE (Science, Technology, Research and Innovation for the Environment) programme. The project developed a national association soil map for Ireland at a scale of 1:250,000, together with an associated digital soil information system, providing both spatial and quantitative information on soil types and properties across the country. The development of a national soil information system contributed to the European Commission's need for a unified European soil map (Montanarella et al., 2005) and facilitates member state legislative requirements under INSPIRE (Infrastructure for Spatial Information in the European Community) and the proposed Soil Framework Directive. Both the map and the information system are made freely available to the public (http://gis.teagasc.ie/soils/). 3.1. Case Study 1: soil bulk density At the start of the project, there were only a very limited number of bulk density measurements available for the Republic of Ireland. As bulk density is a key physical property, for example in the determination of soil moisture and carbon stock estimations, a research programme was undertaken to explore the feasibility of mapping bulk density at the landscape scale using both data mining and expert knowledge approaches. Until more data became available through a field programme, various methodologies were developed with data from England and Wales. This Case Study assesses the utility of BNs for predicting Db at the landscape scale. In a broader context, this will go some way to establishing whether BNs can be used for a host of other Digital Soil Mapping applications. The paper aimed to assess the extent to which BNs can be used in combination with readily available, landscape-scale data to produce physically interpretable models, which link soil Db to easy-toobtain environmental variables. The empirically derived conditional probability distributions were tested across a number of model structures in order to produce spatial predictions of topsoil Db. At the landscape-scale, model inputs were selected as to be explicitly nonreliant on point samples (with the exception of measured Db data). One advantage of the approach taken is that the results can be directly compared to those obtained from ANNs and RFs models for Db built and analysed using the same dataset (reported in Taalab et al., 2012). Finally, a map of predicted Db values is produced, without interpolation, giving a uniform and quantifiable level of accuracy for the entire landscape.

138

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

3.1.1. Soil bulk density (Db) Fine earth soil bulk density (Db) is defined as the oven-dry mass per unit volume of soil (IUSS Working Group, 2006). Many studies require Db estimates for carbon stocks, emissions and sinks of CO2 and CH4, nutrient turnover and soil erosion, but suffer from the generally low spatial resolution of Db (Finke, 2012). This is a result of the time and expense required for direct measurement at multiple sites (Tranter et al., 2007). As a consequence, Db is often estimated using pedotransfer functions (PTFs) from other measured soil properties such as particle size distribution and organic matter content (Stewart et al., 1970; Rawls, 1983; Calhoun et al., 2001). While this technique is well established, it is only really applicable at the pedon scale. Across the landscape, there is a need to account for differences in topography, parent material and land use. This has been acknowledged both implicitly, in the stratification of datasets (Steller et al., 2008; Moreira et al., 2009; Hollis et al., 2012), and explicitly, with inclusion of the landscape attributes as direct predictors (Martin et al., 2009; Jalabert et al., 2010). As such, this study uses only continuous landscape covariates (Table 1) to create landscape scale prediction of Db. 3.1.2. Study area and data The study was conducted in an 18,150 km2 region of the English Midlands, selected due to the relatively high density of pre-existing Db sample data (Fig. 2). The soils in the area are dominated by brown earths and surface water gleys, most of which have either a coarse or fine loamy texture, with some more clayey soils in the south of the region (McGrath and Loveland, 1992). A total of 342 Db samples from the A Horizon were used in this study collected between 1970 and 1987 during the 1:25,000 and 1:50,000 soil mapping of England and Wales. Models were built using 239 training samples and validated using the remaining 103 samples. The other covariates used in the model, which were sampled in ArcGIS 10.1 (ESRI, 2011), are detailed in Table 1. 3.1.3. BN model development Identifying the variables of interest was a relatively straightforward procedure for two reasons. First, in a previous paper (Taalab et al., 2012) a range of available, landscape-scale environmental variables were examined using linear (multiple regression) and non-linear (RFs and ANNs) modelling techniques within the study area, which has yielded information on the relative influence of different landscape variables on Db. Secondly, as the purpose of this study is to map Db at a landscape

scale, only environmental variables which can be represented at that scale (in the form of a GIS data layer) were considered as inputs (Johnson et al., 2012). When constructing the DAG for a network, it is important to explore a range of network structures (Kuhnert and Hayes, 2009). For this reason, we tested both naive and expert-derived structures, as well as a data-derived expert structure. The nodes containing continuous variables were discretized into 5 classes. After testing the two most commonly used discretization techniques (equal width and equal frequency) along with the popular ‘minimum description length principle’ algorithm (Fayyad and Irani, 1993), this study found that equal frequency discretization provided the best results (lowest error when internally validated). This is consistent with the conclusion of Aitkenhead and Aalders (2009) that when some categories within a landscape are more prevalent than others, a frequency approach often gives a better representation. 3.1.4. Deployment The models were applied to the study area (Fig. 2) using a 100 m resolution raster data of all the environmental covariates detailed in Table 1, with a total of 1,815,000 grid cells processed. During the deployment, the “expval” function in Netica 4.09 (Norsys Software Corp, 2006) was used, which generates a continuous value of bulk density, calculated as the sum of the products of the mean bulk density and probability for each class. 3.2. Case Study 2: mapping soil classes One primary project objective of the Irish Soil Information System was to construct a soil map at 1:250,000 scale, with a harmonised national legend. The new map was developed by harmonizing existing soil survey legacy data (Gardiner and Ryan, 1969) that only covered 44% of the country. This harmonised data was subsequently used as a training dataset for predictive modelling into previously unsurveyed areas. The predictive mapping was based on a Multiple Classifier system using predictions from Random Forests and Bayesian Belief Networks and was validated through an extensive field programme. 3.2.1. Predicting soil classes Even in the most developed countries, it is rare to have complete, high-resolution spatial data on soil types, primarily due to the time and expense involved in traditional soil surveying (McBratney et al., 2003). Digital Soil Mapping can improve the scale and spatial coverage

Table 1 Spatial explanatory covariates used in all BNs for the prediction Db. Covariate

Source

Description

No of classes in the BN for the study area

Soil

Cranfield University — NATMAP

24

Land Use

Cranfield University — Representative Soil Profiles Centre for Ecology and Hydrology (CEH) — Land Cover Map 2000

1:250,000 scale National Soil Association Map of England and Wales (Hallett et al., 1996) — subgroup code Land use for Representative Profiles at the time of sampling Satellite imagery was classified into a 25 m raster dataset which was subsequently aggregated to a ten-class 1 km grid land cover map (Fuller et al., 2002). 1:250,000 scale National Soil Association Map of England and Wales (Hallett et al., 1996) — parent material lithology and subtype codes 1:50,000 geological map showing the BGS rock classification scheme (RCS) detailing the lithology of the bedrock (available for download from http://edina.ac.uk/digimap) Elevation, slope, aspect, curvature (plan, profile and mean), SAGA wetness index and Iwahashi landscape classification (Iwahashi and Pike, 2007) derived using ArcGIS 10.1 (ESRI, 2011) Average annual rainfall (mm yr−1), accumulated temperature above 0 °C, median number of field capacity days (i.e. the number of days per year that the soil moisture content is above field capacity), annual average potential evapotranspiration (mm yr−1) and maximum potential soil moisture deficit (i.e. the water required to bring the whole soil profile back to field capacity, mm) were all derived on a 5 km grid (Perry and Hollis, 2005).

8

Parent Material

Cranfield University — NATMAP

Bedrock Geology

British Geological Survey (BGS) — DiGMapGB-50

Topography

10 m DTM (available for download from http://edina.ac.uk/digimap)

Climate

UK Meteorological Office, 1971–2000

14

18 27

Discretized into 5 classes Iwahashi — 8 classes Discretized into 5 classes

139

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

a

¯

b

Sample points

Soil Great Group Brown soils Ground-water gley soils Lake Lithomorphic Soils Man made soils Peat soils Pelosols Podzolic soils 0

10

20

River

40 Kilometers

Surface-water gley soils

Fig. 2. The Study Area. a) In relation to England & Wales, b) map of the Soil great groups within the study area (derived from NATMAP: Avery, 1980) and the sample locations (black points).

(Grunwald, 2009). When using pre-existing soil maps as a basis for sampling combined with other landscape variables to predict soil classes across the landscape, it is possible to statistically recreate the implicit soil–landscape relationships identified by soil surveyors (Lemercier et al., 2012). Assessing the utility of BNs to predict soil classes is of interest as a fundamental facet of DSM and in the

of existing soil taxonomic maps by predicting into previously unsurveyed areas (McBratney et al., 2003). Soil legacy data, usually in the form of existing soil maps, provides information of soils based on surveyors' tacit understanding of how soil classes vary across the landscape (Carré et al., 2007) and is a resource commonly used for the calibration of predictive models in DSM

a

¯

b

Training data sample points

Elevation (m) High : 690.24

Low : 24.5155

0

5

10

20 Km

Fig. 3. The Study Area for predicting soil types. a) The location of Tipperary North within Ireland and b) the location of the sample points for the training dataset.

140

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

adoption and integration of new tools and techniques in the prediction of soil classes or attributes (Grunwald, 2009). BNs are also a flexible modelling tool as the CPTs can be refined and amended as new data is generated during the process of validating the predictions by additional field observations (McCann et al., 2006). 3.2.2. Study area and data To illustrate the use of BN modelling for DSM, the paper focuses on North Tipperary, a 2080 km2 county in central Ireland (Fig. 3a). North Tipperary is characterised by limestone till lowlands dominated by Luvisols and higher ground of shale and sandstone associated Histosols, Podzols and Cambisols. The existing medium-scale soil map (1:250,000) was used to generate the training data for the study area. The map is based on the original 1:126,720 County Soil map (An Foras Talúntais) (Daly and Fealy, 2007), which was harmonised and rationalised for this project. The training data (Fig. 3b) were sampled at a density of approximately 12 points per km2 giving a total of 25,282 points within the study area. 3.2.3. BN model development For soil type prediction, the relationship between soil forming factors and soil taxonomic units is not immediately obvious in the absence of expert knowledge. The predictor variables of interest were selected to represent the full range of SCORPAN (McBratney et al., 2003) landscape attributes. Initially, more than two hundred environmental predictor variables were produced and considered for modelling. Using the redundancy and feature selection functions of Statistica (StatSoft, Inc., 2012), a total of 31 variables (Table 2) were then selected to build a naive BN and naive-TAN BN using Netica 4.09. The number of variables was further reduced to nine after deploying an optimisation algorithm. In accordance with guidelines set out by Marcot et al. (2006), all continuous nodes were discretized into five classes. The boundaries of

these classes were determined by equal frequency discretization, which is a robust, unsupervised technique that has been shown to provide similar results to more complex, supervised methods of discretization (Dougherty et al., 1995).

3.2.4. Deployment Predictive maps were generated from a deployment dataset of nearly five million points distributed at a regular 20 m grid across the study area, corresponding to the centre of grid cells of the DTM from which the terrain covariates were derived. The soil class at each of these points is predicted based on the relationship between soil class and the environmental variables used.

4. Results 4.1. Case 1: mapping soil bulk density The predictive maps were validated using a total of 103 independent validation points. Mapping bulk density as a continuous variable allowed the determination of both R2 and RSME values. The results (Table 3) show that the BN that best described the variation in topsoil Db is the naive optimised network shown in Fig. 4, followed by TAN learned structure network (Fig. 5), the expert structure network (Fig. 6) and the naive network. Although performance of the TAN learned structure network was similar to that of the naive network, notably fewer predictor variables were used. The most important variables determined by the ‘Sensitivity to Findings’ feature of Netica (the five most important are ranked in order in Table 3), were land use, climatic factors and soil association. The results are comparable to Taalab et al. (2012) who reported validation results of 0.56 and 0.55 for RFs and ANNs respectively, for the same dataset (Table 3).

Table 2 Covariates used in the optimised BN. The order of covariates in the table reflects the importance in making the prediction based on the Sensitivity to Findings analysis and measured by the reduction of entropy. Covariate (abbreviation)

Description

Source/method of computation

No of classes in the BN for the study area

Soil

Teagasc — General Soil Map of Ireland 2nd Edition

16

Landform — ASO2

SOTER landform classification

Parent Material

Teagasc — Subsoil map

Landform — AHLS

Hammond landform classification

Geology

Geological Survey of Ireland — Bedrock Geology

Topography — eza1

Environmental Protection Agency — 20 m DTM

Topography — aje2

Environmental Protection Agency — 20 m DTM

Land cover

European Environment Agency — Land cover map

Topography — elp1

Environmental Protection Agency — 20 m DTM

1:575,000 scale, 44 soil associations countrywide presented at the Great Soil Group Level combined into broad physiographic divisions (Gardiner and Radford, 1980). Modified Dobos et al. (2005) approach. Derived from 100 m resolution DTM with circular search window of 9600 m in diameter. 1:50,000 scale map produced by Teagasc and re-classified during the project (Fealy et al., 2009) into 21 divisions. Modified Dikau et al. (1991) approach. Derived from 100 m resolution DTM with circular search window of 9600 m in diameter. The Geological Survey of Ireland map at the scale of 1:100,000 re-classified and harmonised during the project (Geological Survey of Ireland) into new classification system for Ireland consisting of 28 divisions nationwide. Total elevation range from depression to peak within a local catchment in the DTM of a flow path that runs through a given grid cell. Equivalent to zpit2peak layer derived with LandMapR software (MacMillan, 2003). Based on 20 m resolution DTM. Actual river network in a vector format provided by the Irish EPA based on 1:50,000 scale Ordnance Survey mapping, with Compass Informatics Ltd and the Central Fisheries Board as contributors. Potential river network derived by reclassification of the upslope area raster obtained by analysis of 20 m DTM undertaken with the FlowMapR utility in LandMapR software (MacMillan, 2003). Drainage density calculated as the total length of actual or potential river network within local hillsheds (catchments) derived by LandMapR in 20 m resolution raster format. CORINE, 2000 version Jan 10, 2007, scale 1:100,000; 44 landcover classes countrywide produced by interpretation of Landsat TM and SPOT HRV satellite imagery (EEA internet link) Total length of a flow path that runs through a given cell from depression to peak within a local catchment in the DTM. Equivalent to lpit2peak layer derived with LandMapR software (MacMillan, 2003). Based on 20 m resolution DTM.

79 15 18 5

Continuous layer discretized into 5 classes in Netica

Continuous layer discretized into 5 classes in Netica

20

Continuous layer discretized into 5 classes in Netica

141

K. Taalab et al. / Geoderma 259–260 (2015) 134–148 Table 3 Independently validated results for different BNs to predict bulk density. Network

Number of covariates

R2 training

RMSE training

R2 validation

RMSE validation

Naive Naive optimised TAN learned Expert learned Random Forestsa Neural Networksa

17 6 17 9

.47 .58 .77 .51

.20 .16 .13 .18

0.37 0.43 0.42 0.39 0.56 0.55

0.19 0.17 0.18 0.18 0.17 0.17

a

Taalab et al. (2012).

Analysing Table 3, a number of interesting observations can be made. All four BNs do not match the results from the previous study. Both the naive optimised and the TAN learned have similar R2 value indicating that inclusion of relationships between the covariates does not improve the results. In addition, the complexity of the TAN learned network might reflect the specific relationships of the training data which cannot always be extrapolated to the wider landscape. The results also highlight the potential for expert knowledge structures. The table also indicates overfitting for networks using all 17 covariates. Table 4 provides a statistical analysis of the original sampling points and the predictions of the four networks. In Fig. 7a the bars represent the average probability of each of the states across all Db classes. However, to get a better understanding of the relationship between (in this example) Db and average annual rainfall (AAR), the CPT for the node can be examined (Fig. 7b). This shows the probabilistic relationship relation of the five Db classes to the five AAR classes. The general trend reflected in the data is for low Db to occur in areas receiving the highest AAR (and vice versa). Using the optimised BN, the best performing model, we were able to create a continuous predicted surface of Db (Fig. 8). This model can account for nearly 50% of the variation in topsoil bulk density using the following landscape covariates; Land use, average annual rainfall, median number of field capacity days, soil group, elevation, rock classification scheme, parent material and soil wetness index. The covariates were

listed in order of importance in making the prediction based on the ‘Sensitivity to Findings’ analysis and measured by the reduction of entropy (see Marcot et al., 2006). 4.2. Case 2: mapping soil taxonomic units For this case, we constructed a set of naive, TAN learned and naiveoptimised network and deployed them in the study area. Validation is based on comparing the predicted map with the original 1:250,000 soil map (Fig. 10a). Two different comparisons were undertaken, the first based on the 20 m resolution predictive map, the second on a generalised predicted map, generalised to 1:250,000 eliminating areas smaller or equal to 25 ha. Table 5 lists the results in the form of correct proportions, quantity disagreement and allocation disagreement for the three models. Pontius and Millones (2010) define the quantity disagreement as the amount of the difference between the reference map and a comparison map that is due to the less than perfect match in the proportions of the categories and the allocation disagreement as the less than optimal match in the spatial allocation of the categories given the proportions of the categories in the reference and comparison maps. As with the case study for bulk density, it is obvious from Table 5 that making use of all available covariates resulted in overfitting. This highlights that covariate selection is an important component of network design.

Bulk_Density 23.0 0.5 to 1 16.4 1 to 1.13 20.1 1.13 to 1.26 20.1 1.26 to 1.4 20.5 1.4 to 1.76 1.18 ± 0.3

Soil Cambic_stagnogley_soils Cambic_stagnohumic_gley_... Ferritic_brown_earths Humo-ferric_podzols Ironpan_stagnopodzols Man_made_soils Paleo-argillic_stagnogley_s... Pelo-alluvial_gley_soils Pelo-stagnogley_soils Stagnogleyic_argillic_brown... Typical_argillic_brown_earths Typical_argillic_pelosols Typical_brown_alluvial_soils Typical_brown_calcareous_... Typical_brown_earths Typical_brown_podzolic_soils Typical_brown_sands Typical_calcareous_pelosols Typical_cambic_gley_soils Typical_humic-sandy_gley_... Typical_sandy_gley_soils Typical_stagnogley_soils

Land_Use 5.43 3.52 2.30 1.70 1.73 1.72 2.00 4.69 7.18 12.2 2.00 4.00 2.54 1.72 17.1 1.73 6.60 4.32 1.72 2.00 2.00 11.7

AR CO DC FA GC HC LE OR OT PG RG T? UG

24.1 1.98 2.97 4.93 1.97 1.97 12.8 1.97 4.27 36.5 2.32 2.29 1.98

Geology_RCS ARBR 1.93 ARSC 1.95 ARSD 1.67 BREC 1.96 CONG 1.67 DA 1.68 FLIR 1.68 FLMST 1.67 GNR 1.68 LMST 2.74 LSMD 2.83 MDHA 5.01 MDLM 2.78 MDSC 1.65 MDSD 2.50 MDSS 7.31 MDST 30.9 PESST 5.58 SCON 1.68 SCSM 1.67 SDLI 1.95 SDST 11.7 SIMD 3.89 SISD 1.95

Topography_Curvature_Plan 17.8 -1.81 to -0.12 15.5 -0.12 to 0 22.3 0 24.6 0 to 0.2 19.7 0.2 to 2.21 0.0808 ± 0.75

Climate_PT 19.4 480 to 590 17.7 590 to 610 20.8 610 to 630 21.2 630 to 650 20.8 650 to 700 616 ± 49

Topography_Curvature -4 to -0.3 15.9 -0.3 to 0 18.9 0 23.9 0 to 0.3 17.4 0.3 to 4.7 23.9 0.252 ± 1.7

Fig. 4. The naive optimised network structure. The variables included were determined using an optimisation algorithm selected only variables with significant predictive power based on the training data (see Table 1 for details).

142

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

LANDUSE_SAMPLED

Elevation 9 to 54 20.5 54 to 77 20.1 77 to 110 21.0 110 to 140 19.0 140 to 410 19.5 117 ± 91

A B C D E F G H

0 to 0.7 0.7 to 1.8 1.8 to 3 3 to 5 5 to 24.9

AR CO DC FA GC HC LE OR OT PG RG T? UG

13.5 5.67 6.03 6.68 5.76 5.79 9.49 5.73 6.28 17.6 5.86 5.79 5.74

Soil_association Cambic_stagnogley_soils 5.43 Cambic_stagnohumic_gley_... 3.52 Ferritic_brown_earths 2.30 Humo-ferric_podzols 1.70 Ironpan_stagnopodzols 1.73 Man_made_soils 1.72 Paleo-argillic_stagnogley_s... 2.00 Pelo-alluvial_gley_soils 4.69 Pelo-stagnogley_soils 7.18 Stagnogleyic_argillic_brown... 12.2 Typical_argillic_brown_earths 2.00 Typical_argillic_pelosols 4.00 Typical_brown_alluvial_soils 2.54 Typical_brown_calcareous_... 1.72 Typical_brown_earths 17.1 Typical_brown_podzolic_soils 1.73 Typical_brown_sands 6.60 Typical_calcareous_pelosols 4.32 Typical_cambic_gley_soils 1.72 Typical_humic-sandy_gley_... 2.00 Typical_sandy_gley_soils 2.00 Typical_stagnogley_soils 11.7

PM1 Bb Bg Bh Bo Da Db Ea Ef Eg Ei Fi Fq Fw Fx Fy Ga

4.81 4.88 5.42 6.49 4.81 5.18 6.10 4.97 5.22 13.1 12.2 6.84 4.84 5.08 5.26 4.81

AT0_ANNUAL 20.2 2690 to 3290 18.9 3290 to 3380 20.9 3380 to 3480 19.2 3480 to 3580 20.8 3580 to 3830 3400 ± 260

Iwahashi 16.3 11.2 9.35 8.86 13.4 14.7 11.2 15.0

PT 480 to 590 18.8 590 to 610 18.1 610 to 630 21.6 630 to 650 20.6 650 to 700 20.8 616 ± 49

Slope 19.8 19.8 19.6 21.3 19.6

Bulk_Density 0.5 to 1 23.0 1 to 1.13 16.4 1.13 to 1.26 20.1 1.26 to 1.4 20.1 1.4 to 1.76 20.5 1.18 ± 0.3

4.56 ± 5.9

60 to 140 140 to 174 174 to 191 191 to 213 213 to 251

PSMD 20.3 19.6 20.0 20.1 20.0

175 ± 47

Aspect 20.3 -1 to 31 19.3 31 to 110 19.9 110 to 190 19.8 190 to 260 20.7 260 to 360 155 ± 110

AAR 570 to 640 18.6 640 to 666 21.8 666 to 680 19.6 680 to 730 20.2 730 to 1270 19.8 727 ± 160 Curvature_Plan

-1.81 to -0.12 -0.12 to 0 0 0 to 0.2 0.2 to 2.21

18.9 17.5 21.3 22.4 20.0

FCD_MED RCS

0.0698 ± 0.76

Curvature -4 to -0.3 17.2 -0.3 to 0 19.6 22.7 0 0 to 0.3 17.8 0.3 to 4.7 22.8 0.197 ± 1.7

curvature_Profile -2.82 to -0.18 19.7 -0.18 to -0.03 19.9 -0.03 to 0 9.58 0 to 0.14 30.6 0.14 to 2.5 20.1 -0.0314 ± 1

SWI 10.4 to 14.9 14.9 to 15.6 15.6 to 16.1 16.1 to 16.7 16.7 to 18.3

20.5 19.2 21.0 18.9 20.4

15.5 ± 1.8

ARBR ARSC ARSD BREC CONG DA FLIR FLMST GNR LMST LSMD MDHA MDLM MDSC MDSD MDSS MDST PESST SCON SCSM SDLI SDST SIMD SISD

3.56 3.63 3.45 3.58 3.50 3.50 3.45 3.45 3.45 3.83 3.87 4.38 3.63 3.45 3.83 4.78 11.5 4.62 3.45 3.54 3.47 6.37 4.15 3.61

122 to 141 19.0 141 to 150 20.3 150 to 156 20.0 156 to 172 20.3 172 to 278 20.4 164 ± 35

Fig. 5. The learned network structure. The links between the variables were determined using a Tree Augmented Naive (TAN) Bayes algorithm.

The optimised model (Fig. 9) was used to make predictions on the deployment dataset and the results were mapped. The resulting raw map (Fig. 10b), which was subsequently generalised to match the 1:250,000 mapping of the existing soil map (Fig. 10c). It was achieved by eliminating polygons of area smaller or equal to 25 ha into neighbouring polygons based on the length of shared border. In this fully automated approach no consideration of soil classes or expert knowledge was applied. With each class prediction, there is an associated probability based on the CPTs in the optimised network and as such it was possible to plot a spatial representation of the probability (Fig. 10d). This gives a percentage probability that the soil class has been identified correctly, ranging from highly probable (0.99) to improbable (0.17). Finally, a confusion matrix was constructed between the existing soil map as truth data, and the raw (Fig. 10b) and generalised (Fig. 10c) predicted maps. The results were less than the result of internal validation for the optimised network, the decrease being caused by: (1) comparing full extents of the predicted and actual soil maps, rather than a point sub-sample of the study area; and (2) using the training dataset as the validation dataset in the internal validation of the model. 5. Discussion For Case Study 1, the performance of the optimised naive network is very similar to (albeit slightly lower than) the ANN and RF black-box

modelling techniques which have previously been used to predict topsoil Db (Table 5) with the same dataset (Taalab et al., 2012). While BNs did not improve predictive performance, they have the obvious advantage of offering some process-based insight (Correa et al., 2009). For example, using the optimised naive network (Fig. 4), we can see that topsoil Db is most likely to be very high (above 1.41 g cm−3) in areas of low elevation and rainfall, on brown podzolic soils which overlie drift with siliceous stones. In conjunction with expert knowledge, this can either confirm or contradict the opinions of the expert(s). In this instance the results are plausible as areas of low rainfall would typically be associated with low organic matter and hence high Db. Although Brown Podzolic soils would not necessarily be associated with the highest Db values, Hallett et al. (1998) found that Podzolic soils can be associated with extremely high Db values. If, however, the BN contradicts what the expert believes, it can prompt further investigation into the process, indicate a knowledge gap or a problem may be with the model itself. If the latter is the case, it is easy to amend both the model structure and the probabilistic relationship between nodes. Identifying the source of predictive inaccuracies in a black-box model is much less straightforward. As BNs are based on process understanding they can be used to answer specific questions using predictive reasoning. For example, what is the probability of X, given certain information, a capability that black box models do not possess.

143

K. Taalab et al. / Geoderma 259–260 (2015) 134–148 AT0_ANNUAL

Topography_Elevation 0 to 50 17.8 50 to 100 38.4 100 to 410 43.8

2600 to 3400 3400 to 3900

46.2 53.8

Climate_PT

3350 ± 380

480 to 610 34.9 610 to 640 33.6 640 to 700 31.5 611 ± 57

145 ± 120

Climate_AAR 570 to 670 45.6 670 to 1270 54.4 810 ± 220

Parent_Material Bb Bg Bh Bo Bp Cf Da Db Ea Ef Eg Ei Fi Fq Fw Fx Fy Ga

1.36 2.07 3.14 6.63 1.00 1.00 1.35 1.89 5.92 1.71 3.14 27.4 29.7 6.08 1.70 2.06 2.43 1.36

Landuse

Topography_SWI 10.4 to 15.4 15.4 to 16.3 16.3 to 18.3

AR CO DC FA GC HC LE OR OT PG RC RG T? UG

32.7 35.2 32.1

15.4 ± 2 Soil Cambic stagnogley soils Cambic stagnohumic gley s... Ferritic brown earths Gleyic brown earths Humo-ferric podzols Ironpan stagnopodzols Man made soils Paleo-argillic stagnogley soils Pelo-alluvial gley soils Pelo-stagnogley soils Stagnogleyic argillic brown e... Typical argillic brown earths Typical argillic pelosols Typical brown alluvial soils Typical brown calcareous ea... Typical brown earths Typical brown podzolic soils Typical brown sands Typical calcareous pelosols Typical cambic gley soils Typical humic-sandy gley soils Typical paleo-argillic brown ... Typical sandy gley soils Typical stagnogley soils Pelo-alluvial_gley_soils Pelo-stagnogley_soils Stagnogleyic_argillic_brown... Typical_humic-sandy_gley_... Paleo-argillic_stagnogley_s... Typical_brown_calcareous_... Humo-ferric_podzols

Landform_Iwahashi 20.1 A 9.79 B 6.06 C D 5.11 E 15.1 16.1 F 10.6 G H 17.1

3.55 3.24 3.02 2.84 2.84 2.88 2.89 2.84 2.84 2.84 2.84 2.95 3.22 2.96 2.84 5.32 2.90 3.73 3.30 2.89 2.84 2.84 2.94 4.86 3.34 4.01 4.78 2.95 2.95 2.89 2.89

10.2 6.17 6.35 6.71 6.17 6.16 8.18 6.17 6.61 12.5 6.11 6.23 6.22 6.17

Bulk_Density 0.59 to 0.99 0.99 to 1.14 1.14 to 1.28 1.28 to 1.41 1.41 to 1.76

20.2 20.0 20.1 19.8 19.9

1.2 ± 0.28

Fig. 6. The expert-knowledge network structure. The variables included and the links between variables were determined using expert knowledge (see Table 1 for details).

Furthermore, BNs are also capable of diagnostic reasoning. For example, given an outcome, we can predict favourable conditions likely to lead to this outcome. These applications have already been applied to predict the locations of suitable habitat for endangered species (Smith et al., 2007) and more pertinently for the Digital Soil Mapping community, to assess spatially the risk of peat erosion (Aalders et al., 2011). That the best performing BN was a naive optimised network is, at first, surprising, as generally, the best BNs are those which combine an expert derived structure with a series of conditional probabilities calculated from measured data (Nadkarni and Shenoy, 2004). Naive networks are the simplest form of Belief Networks and assume that the value of a particular covariate is independent of the value of any other covariate. However, despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. Zhang (2004) showed in an analysis of the Bayesian classification problem that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers even though the conditional independence assumption is violated. However, Table 4 Descriptive statistics for the bulk density samples.

Valid N Mean Median Minimum Maximum Std. dev. Skewness Kurtosis

Laboratory Data

Naive

TAN Learned structure

Naive optimised

Expert structure

103 1.232 1.250 0.680 1.670 0.222 −0.293 −0.532

91 1.251 1.279 0.750 1.578 0.197 −0.853 0.780

91 1.211 1.211 0.750 1.580 0.210 −0.209 −0.186

93 1.227 1.230 0.764 1.551 0.153 −0.527 0.835

102 1.203 1.199 0.960 1.396 0.090 −0.450 0.725

in this study, a further optimised network approach was introduced to reduce the number of environmental covariates, thereby removing many of the highly correlated covariates and satisfying to a much greater degree the independence assumptions. Finally, the TAN approach will satisfy the independence assumption by actively introducing links with covariates which are highly correlated, thereby introducing conditional independence, using a Tree Augmented Naive (TAN) Bayes approach. The performance of the TAN BN is also of interest. Friedman et al. (1997) found TAN BNs to be far superior to naive BNs. Although it outperformed the naive network in this study, it did not outperform the naive optimised network. This can be attributed to the relative lack of data which, in this study, has led to overfitting of CPTs (the R2 value of the predicted vs observed values of the training data was 0.77), meaning random error associated with these data was included in the model. This highlights the importance of testing models using independent data, to get an accurate estimate of a model's predictive power. Although in third place, the performance of the expert learned network demonstrates the capability of BNs to capture expert knowledge in the absence of hard data. Although BNs explicitly model uncertainty, they are themselves subject to second order uncertainty. The uncertainty associated with BNs typically comes from inadequate datasets, bias or a lack of understanding within expert opinion and from imperfect representation of real life by the model structure. There is, however, no way of distinguishing between the sources, which makes formalising this uncertainty itself, in the form of a probability distribution, uncertain. Hence, getting a genuine idea of model performance requires testing using independent data (Krueger et al., 2012). Often BNs are not subject to any validation (with the justification that the modelling approach is often applied specifically to situations where data are scarce). Aguilera et al. (2011) point out that, of the BNs which have been reported to solve regression or

144

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

AAR 570 to 650 650 to 665 665 to 678 678 to 720 720 to 1270

22.7 15.9 16.3 20.8 24.3

740 ± 170

Fig. 7. Conditional probability table for AAR: a) average probabilities for each state in the node AAR (average annual rainfall) and b) the conditional probability table (CPT) linking bulk density and AAR.

classification problems in environmental science between 1990 and 2010, fewer than thirty percent were tested using independent data. This is problematic for this type of modelling, because it will lead to BNs being compared unfavourably with other data mining techniques which are more routinely validated empirically. Jakeman et al. (2006) suggest that evaluation should go beyond the quantitative and include a subjective review of utility and transparency of the model. In the second Case Study implementing a TAN structured learning approach increased the predictive power of the model substantially. The results show that BNs are a viable tool for DSM applications. For all models the error rate generated using internal validation was similar to those reported for other studies predicting soil class or property using landscape variables, which were validated in the same manner (Lemercier et al., 2012; Moran and Bui, 2002). The accuracy of results

Fig. 8. A continuous spatial prediction of Db made using the optimised naive network.

often decreases when models are tested with independent data. In another study regarding soil class prediction, Grinand et al. (2008) found that soil type classification accuracy using independent data fell by up to 40%, although their internal validation provided much more accurate results, suggesting that the model was overfit to the training data. In reality, we might expect the classification accuracy to decrease to around 50% (Lemercier et al., 2012). While the results do not suggest a dramatic improvement in predictive accuracy compared to data mining approaches, the use of BNs is appealing for a variety of reasons. In recent years, the focus of DSM has moved away from straightforward classification of soils towards developing a better understanding of the spatial distribution of soils in relation to the wider environment (Grunwald, 2009). This is necessary in order to resolve challenges such as climate change, desertification, food production which are putting increasing pressure on soils as a resource (Hartemink and McBratney, 2008). There are many advantages to applying BNs to soil mapping, including their ability to handle missing data and explicit handling of uncertainty. The major appeal of BNs, however, is their clarity, which allows experts to judge whether the model makes pedogenic sense and to develop a better understanding of the environmental processes driving variations in soil properties. Bayesian Networks can, however, deal with continuous variables in only a limited manner (Friedman and Goldzmidt, 1996). The usual solution is to discretize the variables and build the model over the discrete domain. There is a trade-off, however, as the discretization can only capture rough characteristics of the original distribution (Friedman and Goldzmidt, 1996), and we may lose statistical power if the relationship between the variables is, in fact, linear (Myllymäki et al., 2002). On the other hand, we gain the ability to use the reasoning machinery of BNs, which is especially efficient if the relationships between the variables are non-linear and complex (Myllymäki et al., 2002). How to discretize the data is more difficult a question. Automatic data discretization techniques have been developed and discussed but no satisfactory automatic discretization methods for Bayesian Networks have been found (Uusitalo, 2007). Thus, finding a discretization that can be reasonably interpreted in terms of the study problem, generally given by domain experts, remains the best solution. There are a number of additional limitations according to Kelly et al. (2013), Uusitalo (2007) and Marcot et al. (2006); (i) Probabilistic relations within BNs reflect uncertainty in model parameterization, not model structure. Assessment of structural uncertainty is often neglected, but can be addressed by building and comparing outputs from alternative models based on different hypotheses about the system, (ii) BN models are rarely explicitly spatial or temporal, although

145

K. Taalab et al. / Geoderma 259–260 (2015) 134–148 Table 5 Comparison of performance of the naive and optimised naive BNs used to predict soil taxonomic units. Covariates

Naive Naive opt TAN

Raw 250k Raw 250k Raw 250k

Training error

31

.15

9

.36

31

Proportion correct

Quantity disagreement

Allocation disagreement

Total disagreement

C

Q

A

D

0.651 0.667 0.603 0.613 0.716 0.738

0.174 0.179 0.135 0.139 0.066 0.066

0.175 0.154 0.262 0.247 0.218 0.197

0.349 0.333 0.397 0.387 0.284 0.262

D = 1 − C = Q + A.

lumped representations of space and time are occasionally used. This is not necessarily due to a limitation in the method; it has more to do with the nature of applications to which BNs have been applied in the past

and (iii) Bayesian Networks are acyclic, and thus do not support feedback loops that would sometimes be beneficial in environmental modelling. Temporal or spatial dynamics can be modelled in BNs

Associations S411a 5.09 S411x 9.04 S723a 1.26 S911a 4.85 S922a 4.30 S411c 30.8 S321a 0.14 S513d 0.12 S712a 1.78 S822f 3.76 S711b 6.14 S411b 7.24 S821c 1.30 S913a 1.28 S724a 1.64 S311c 0.75 S722c 0.82 S513c 8.42 S311o 0.30 S914a 1.96 S712c 1.18 S811a .059 S311a 1.0 S822a .095 S513a 0.71 S113b 1.47 S311f .008 S511a 0.99 S632b 2.99 S722b 0.29 S911b 0.15 S822e .020

GSM29 GSM32 GSM5 GSM28 GSM40 GSM20 GSM41 GSM8 GSM13 GSM30 GSM14 GSM37 GSM1 GSM4 GSM15 GSM17

Soil 36.9 10.5 0.72 5.66 8.15 8.28 10.4 10.0 3.79 0.12 0.63 0.70 1.86 1.61 0.17 0.46

COR24 COR23 COR7 COR11 COR18 COR30 COR4 COR32 COR10 COR33 COR8 COR19 COR25 COR16 COR31 COR15 COR26 COR20 COR22 COR21

Landuse 68.0 7.17 4.22 0.74 2.47 0.20 0.29 5.18 0.11 0.19 2.75 0.23 6.50 0.62 0.14 0.17 0.17 0.15 0.36 0.31

Parent_Material SBS15 49.3 SBS19 14.5 SBS16 20.9 SBS9 2.31 SBS2 2.76 SBS5 0.78 SBS10 5.52 SBS14 0.25 SBS17 0.60 SBS20 0.47 SBS12 0.13 SBS4 0.32 SBS18 1.92 SBS1 0.12 SBS11 0.14

GEO8 GEO4 GEO7 GEO6 GEO2

Geology 2.96 70.4 8.85 17.6 0.13

Terrain_elp 0 to 7000 21.6 7000 to 10000 19.2 10000 to 13000 18.9 13000 to 18000 20.4 18000 to 33000 20.0 12800 ± 7800

Terrain_aje 0.5 to 0.8 21.4 0.8 to 0.9 13.9 0.9 to 1.02804 18.1 1.02804 to 1.29 35.3 1.29 to 48031.1 11.3 2710 ± 8900

Terrain_eza 1.11 to 39 19.8 39 to 49 19.9 49 to 70 18.3 70 to 180 22.4 180 to 670 19.6 135 ± 160

Fig. 9. The optimised Naive Bayesian Network (see Table 2 for details).

Landform_ASO A3Yc 11.4 A1Yd 9.74 A2Yc 7.81 A3Yd 7.52 B3Yc 5.20 A2Yd 4.95 A2Yb 3.87 A1Yc 3.65 C4Yd 2.98 B2Yc 2.88 B3Yd 2.78 B4Yd 2.31 A3Yb 2.12 E4Ye 1.83 C3Yd 1.83 C4Ye 1.79 C3Yc 1.70 D4Ye 1.68 A4Yd 1.30 B1Yd 1.30 E4Yf 1.26 D4Yd 1.25 B4Yc 1.0 B3Yb 0.91 A4Yc 0.88 B2Yd 0.87 B2Yb 0.85 D4Yf 0.81 E4Yd 0.81 B1Yc 0.79 C4Yf 0.74 C4Yc 0.70 B4Ye 0.70 C2Yc 0.60 D3Yd 0.58 C3Ye 0.53 E3Ye 0.49 D3Ye 0.46 F4Yf 0.42 C3Yb 0.39 D3Yc 0.32 A1Yb 0.29 E3Yd 0.29 B4Yf 0.29 F4Ye 0.28 A4Yb 0.26 B3Ye 0.25 C2Yb 0.25 D4Yc 0.24 C2Yd 0.24 other- 3.54

Landform_AHLS D4a 12.6 D3a 24.9 D2a 21.2 D5a 1.50 C5a 10.7 D2b 14.1 C4a 1.58 D2c 0.27 B5a 7.72 D4b 0.43 D3b 0.55 D1b 0.20 D1a 0.13 B5b 1.76 A5a 0.17 B4a 0.66 B4b 1.40 C5b 0.12

146

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

association. The ancillary soils in Patrickswell 1 are well drained soils (Brown Earths) compared with poorly drained (groundwater gleys) and peaty soils in Patrickswell 3. When comparing these associations in the feature space they are differentiated primarily on morphometric parameters such as the height between pit and peak features (EZA) and the landscape wetness (AJE) (Table 6). The EZA parameter in Patrickswell 3 indicates greater elevation contrast in the landscape, where there is greater depth between local minima (pits) and maxima (peak). The ancillary groundwater gley and peat soils are likely to occupy these local depressions highlighted by the contrast between pit and peak features. In Patrickswell 1 the landscape is more subdued (undulating) having less contrast between the pit and peaks. This indicates fewer hydrologically significant depressions and thus better drained soils in the association overall, which is reflected in the ancillary soils (Brown soils).

6. Conclusions

Fig. 10. Predictive maps and associated uncertainty. a) The pre-existing 1:250,000 soil association map, b) the raw predictive map of soil associations, c) the rationalised predictive map of soil associations, and d) the probability that the rationalised predictive map is correct.

using a separate network for each time slice; however, this is often very tedious. The greatest advantage of a Bayesian Belief Network is that the causal relationships between the environmental covariates and the predicted class, in this case soil associations, can be both graphically investigated as well as by analysis of the underlying Conditional Probability tables. For example, the efficacy of predicting soil classes can be assessed by analysing the conditional probabilities for two different but closely related soil associations 411a (Patrickswell 1) and 411c (Patrickswell 3). These two associations share the same dominant soil (Luvisol) but have different ancillary soils as components of the soil

Table 6 Comparison of two similar soil associations in feature space, showing the dominant class for each co-variate. GSM — small scale national soil map, SBS — parent material, GEO — geology, COR — CORINE landcover map, ELP — distance between pit and peak, EZA — height between pit and peak, AJE — drainage density index and ASO — SOTER landform classification. Covariate 411a Patrickwell 1

411c Patrickwell 3

GSM32 SBS15 GEO4 COR24 ELP EZA AJE ASO

Minimal Grey Brown Podzolic 80% Drift_limestone Limestone Pastures 0–7000 70–180 0.5–0.8 A3Yc — Slope 0–2%, Relief Intensity 100–300 m, PDD above 90 and Elevation 50–100 m

Minimal Grey Brown Podzolic 70% Drift_limestone Limestone Pastures 0–7000 b1.11 0.8–0.9 A3Yc — Slope 0–2%, Relief Intensity 100–300 m, PDD above 90 and Elevation 50–100 m

BNs provide a feasible alternative to black-box data mining techniques often applied to the modelling and mapping of soil properties. It is both their ease of interpretation and their ability to deal explicitly with uncertainty, which sets them apart. There are numerous approaches for application of BNs to the prediction of soil properties many of which remain relatively unexploited (Aguilera et al., 2011). It is important to stress that the cornerstone of good practice for the application of BNs is clarity throughout the modelling process (Chen and Pollino, 2012). A clear record of the choices and assumptions that underpin the model, in terms of parameterisation, model structure, elicitation and evaluation techniques is critical to ensure that the modelling approach remains credible. For Digital Soil Mapping, BNs provide a logical way of structuring knowledge, which can be used to disentangle complex processes, as well as ‘filling in gaps’ in empirical data. Many soil mapping applications are essentially attempts to formalise a soil surveyors' thought process, where the BN will perform expert-like reasoning. While this does not require expert opinion, as both the structure and CPTs can be ‘learned’ from data, all BNs will benefit from some form of expert evaluation to ensure that the relationships between variables are scientifically sound. This paper has demonstrated the effectiveness of BNs for quantitative prediction of a soil physical property (Db) and qualitative prediction of soil class. In both cases the results were comparable to those obtained using black-box modelling techniques, with the benefit that the modelling process is much easier to interpret. Our study of soil Db has shown that, where possible, model validation using independent data is invaluable. This is because expert judgement will always contain a measure of uncertainty reflecting both knowledge gaps and inherent natural variation which are difficult to separate. Current limitations in the ability of BNs to make highly accurate spatial predictions is offset by the clarity of the modelling approach (the process by which predictions are made) and the ability to model future scenarios (e.g. for different land use or climate regimes: Chen and Pollino, 2012). This paper examined the possibility for Bayesian Belief Networks as an additional tool in Digital Soil Mapping. For Case Study 1, the paper demonstrated the limitations of Bayesian Belief Networks for continuous data and the potential loss in the predictive power due to the discretization process. However, Bayesian modelling to predict BD incorporating expert knowledge provides a valid methodology for evaluating the expert's priors post hoc. For Case Study 2, the paper has demonstrated the value of Bayesian Belief Networks for both mapping and system understanding of soil class maps. Although naive networks do violate the independence assumption, Zhang (2004) showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. In addition, the use of TAN learning algorithm provides a tool to generate networks which adhere to the independence assumptions.

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

Acknowledgements This work is part of the Irish Soils Information Project and jointly funded by Teagasc (The Irish Agriculture and Food Development Authority), RMIS 5809, and the Environmental Protection Agency, 2007-CD1-1-S1, under the National Development Plan (2007-2013). The authors would also thank the anonymous reviewers for their valuable comments. References Aalders, I., Hough, R.L., Towers, W., 2011. Risk of erosion in peat soils — an investigation using Bayesian belief networks. Soil Use Manag. 27, 538–549. Aguilera, P.A., Fernandez, A., Fernandez, R., Rumi, R., Salmeron, A., 2011. Bayesian Networks in environmental modelling. Environ. Model. Softw. 26 (12). Agyare, W.A., Park, S.J., Vlek, P.L.G., 2007. Artificial neural network estimation of saturated hydraulic conductivity. Vadose Zone J. 6, 423–431. Aitkenhead, M.J., Aalders, I.H., 2009. Predicting land cover using GIS, Bayesian and evolutionary algorithm methods. J. Environ. Manag. 90, 236–250. Avery, B.W., 1980. Soil classification for England and Wales [higher categories]. Technical Monograph, Soil Survey of England and Wales 14. Bayes, T., 1783. Essay towards solving a problem in the doctrine of chances. Philos. Trans. R. Soc. 53, 370–418. Borsuk, M.E., 2008. Ecological Informatics: Bayesian Networks. In: Jorgensen, S.E., Fath, B. (Eds.), Encyc Ecol. Elsevier, pp. 307–317. Braakhekke, M.C., Wutzler, T., Beer, C., Kattge, J., Schrumpf, M., Schöning, I., Hoosbeek, M.R., Kruijt, B., Kabat, P. And, Reichstein, M., 2012. Modeling the vertical soil organic matter profile using Bayesian parameter estimation. Biogeosci. Discuss. 9, 11239–11292. Bui, E.N., Lougldead, A., Corner, R., 1999. Extracting soil–landscape rules from previous soil surveys. Aust. J. Soil Res. 37 (3), 495–508. Cain, J., 2001. Planning improvements in natural resources management. Guidelines for Using Bayesian Networks to Support the Planning and Management of Development Programmes in the Water Sector and Beyond. Centre for Ecology and Hydrology, Wallingford, UK. Calhoun, F.G., Smeck, N.E., Slater, B.L., Bigham, J.M., Hall, G.F., 2001. Predicting bulk density of Ohio soils from morphology, genetic principles, and laboratory characterization data. Soil Sci. Soc. Am. J. 65, 811–819. Carré, F., McBratney, A.B., Mayr, T., Montanarella, L., 2007. Digital soil assessments: beyond DSM. Geoderma 142 (1–2), 69–79. Charniak, E., 1991. Bayesian Networks without tears. AI Mag. 12 (4), 50. Chen, S.H., Pollino, C.A., 2012. Good practice in Bayesian Network modelling. Environ. Model. Softw. 37, 134–145. Cook, S.E., Corner, R.J., Grealish, G., Gessler, P.E., Chartres, C.J., 1996. A rule-based system to map soil properties. Soil Sci. Soc. Am. J. 60, 1893–1900. CORINE, 2000. Corine Land Cover 2000 seamless vector data. European Environment Agency (http://www.eea.europa.eu/data-and-maps/data/corine-land-cover-2000clc2000-seamless-vector-database-4). Corner, R.J., Hickey, R.J., Cook, S.E., 2002. Knowledge based soil attribute mapping in GIS: the Expector method. Trans. GIS 6 (4), 383–402. Correa, M., Bielza, C., Pamies-Teixeira, J., 2009. Comparison of Bayesian Networks and artificial neural networks for quality detection in a machining process. Expert Syst. Appl. 36 (3), 7270–7279. Corstanje, R., Grunwald, S., Lark, R.M., 2008. Inferences from fluctuations in the local variogram about the assumption of stationarity in the variance. Geoderma 143, 123–132. Daly, K., Fealy, R., 2007. Digital Soil Information System for Ireland Scoping Study (2005S-DS-22-M1) Final Report. Environmental Protection Agency, Ireland. Degroot, M.H., 1988. A Bayesian view of assessing uncertainty and comparing expert opinion. J. Stat. Plan. Infer. 20 (3). Dikau, R., Brabb, E.E., Mark, R.M., 1991. Landform classification of New Mexico by computer. U.S. Geological Survey Open File Report 91-634. Dlamini, W.M., 2010. A Bayesian belief network analysis of factors influencing wildfire occurrence in Swaziland. Environ. Model. Softw. 25 (2). Dobos, E., Daroussin, J., Montanarella, L., 2005. An SRTM-based procedure to delineate SOTER Terrain Units on 1:1 and 1:5 million scales, EUR 21571 EN. Office for Official Publications of the European Communities, Luxembourg (55 pp., http://eusoils.jrc. it/ESDB_Archive/eusoils_docs/Images/EUR21571EN.pdf). Dougherty, J., Kohavi, R., Sahami, M., 1995. Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S. (Eds.), Proc 12th Int Con Machine Learn. Morgan Kaufmann, San Francisco, CA, pp. 194–202. Duda, R.O., Hart, P.E., 1973. Pattern Classification and Scene Analysis. John Wiley & Sons, New York. ESRI, 2011. ArcGIS 10.1 Geographical Information System. ESRI, Redlands, California. Fayyad, U.M., Irani, K.B., 1993. Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, CA, pp. 1022–1027. Fealy, R.M., Green, S., Loftus, M., Meehan, R., Radford, T., Cronin, C., Bulfin, M., 2009. Teagasc EPA Soils and Subsoils Mapping Project — Final Report. Volume I, Teagasc, Dublin. Finke, P.A., 2012. On digital soil assessment with models and the Pedometrics agenda. Geoderma 171, 3–15. Friedman, N., Goldzmidt, M., 1996. Discretizing continuous attributes while learning Bayesian Networks. Proceedings of the 13th International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers, San Francisco, CA, pp. 157–165.

147

Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian Network classifiers. Mach. Learn. 29. Fuller, R.M., Smith, G.M., Sanderson, J.M., Hill, R.A., Thomson, A.G., 2002. The UK Land Cover Map 2000: construction of a parcel-based vector map from satellite images. Cartogr. J. 39, 15–25. Gardiner, M.J., Radford, T., 1980. Ireland: general soil map. An Foras Talúntais (now Teagasc), Dublin, Ireland 2nd ed. Gardiner, M.J., Ryan, P., 1969. A new generalised soil map of Ireland and its land-use interpretation. Ir. J. Agric. Res. 95–109. Geological Survey of Ireland, d. http://www.gsi.ie/. Grimm, R., Behrens, T., Maerker, M., Elsenbeer, H., 2008. Soil organic carbon concentrations and stocks on Barro Colorado Island — digital soil mapping using Random Forests analysis. Geoderma 146, 102–113. Grinand, C., Arrouays, D., Laroche, B., Martin, M.P., 2008. Extrapolating regional soil landscapes from an existing soil map: sampling intensity, validation procedures, and integration of spatial context. Geoderma 143, 180–190. Grunwald, S., 2009. Multi-criteria characterization of recent digital soil mapping and modeling approaches. Geoderma 152, 195–207. Hallett, S.H., Jones, R.J.A., Keay, C.A., 1996. Environmental information systems developments for planning sustainable land use. Int. J. Geogr. Inf. Sci. 10, 47–64. Hallett, S.H., Hollis, J.M., Keay, C.A., 1998. Derivation and evaluation of a set of pedogenically-based empirical algorithms for predicting bulk density in British soils. http://www.landis.org.uk/downloads/index.cfm_Predicting_Bulk_Density. pdf. Hartemink, A.E., McBratney, A., 2008. A soil science renaissance. Geoderma 148, 123–129. Hollis, J.M., Hannam, J., Bellamy, P.H., 2012. Empirically-derived pedotransfer functions for predicting bulk density in European soils. Eur. J. Soil Sci. 63, 96–109. Hough, R.L., Towers, W., Aalders, I., 2010. The risk of peat erosion from climate change: land management combinations: an assessment with Bayesian belief networks. Hum. Ecol. Risk. Assess. 16, 962–976. IUSS Working Group WRB, 2006. World Reference Base for Soil Resources. second ed. World Soil Resources Rep 103. FAO, Rome. Iwahashi, J., Pike, R.J., 2007. Automated classifications of topography from DEMs by an unsupervised nested-means algorithm and a three-part geometric signature. Geomorphology 86, 409–440. Jakeman, A.J., Letcher, R.A., Norton, J.P., 2006. Ten iterative steps in development and evaluation of environmental models. Environ. Model. Softw. 21, 602–614. Jalabert, S.S.M., Martin, M.P., Renaud, J., Boulonne, L., Jolivet, C., Montanarella, L., Arrouays, D., 2010. Estimating forest soil bulk density using boosted regression modelling. Soil Use Manag. 26, 516–528. Jenny, H., 1941. Factors of Soil Formation. McGraw-Hill, New York, USA. Jensen, F.V., 1996. An Introduction to Bayesian Networks. UCL Press, London. Jensen, F.V., 2001. Bayesian Networks and Decision Graphs. Springer-Verlag, New York 0387-95259-4. Jiang, L., Zhang, H., Cai, Z., Su, J., 2005. Learning tree augmented naive Bayes for ranking. Proceedings of the 10th International Conference on Database Systems for Advanced Applications. Springer-Verlag, Berlin, pp. 688–698. Johnson, S., Low-Choy, S., Mengersen, K., 2012. Integrating Bayesian Networks and geographic information systems: good practice examples. Integr. Environ. Assess. Manag. 8. Kelly, R.A., Jakeman, A.J., Barreteau, O., Borsuk, M.E., ElSawah, S., Hamilton, S.H., Henriksen, H.J., Kuikka, S., Maier, H.R., van Rizzoli, A.E., Delden, H., Voinov, A., 2013. Selection among five common modelling approaches for integrated environmental assessment and management. Environ. Model. Softw. 47, 159–181. Krueger, T., Page, T., Hubacek, K., Smith, L., Hiscock, K., 2012. The role of expert opinion in environmental modelling. Environ. Model. Softw. 36, 2012. Kuhnert, P.M., Hayes, K.R., 2009. How believable is your BBN? 18th World IMACs/ MODISM Congress, Cairns, Australia, 13–17 July 2009 Lemercier, B., Lacoste, M., Loum, M., Walter, C., 2012. Extrapolation at regional scale of local soil knowledge using boosted classification trees: a two-step approach. Geoderma 171, 1–98. Liaw, A., Wiener, M., 2002. Classification and Regression by randomForest. R News ((http://cran.r-project.org/doc/Rnews/), 2, 18–22). MacMillan, R.A., 2003. LandMapR© Software Toolkit-C++ Version: Users Manual. LandMapper Environ. Sol. Inc., Edmonton, AB (110 pp.). Marcot, B.G., Steventon, J.D., Sutherland, G.D., McCann, R.K., 2006. Guidelines for developing and updating Bayesian belief networks applied to ecological modeling and conservation. Can. J. For. Res. 36, 3063–3074. Martin, M.P., Lo Seen, D., Boulonne, L., Jolivet, C., Nair, K.M., Bourgeon, G., Arrouays, D., 2009. Optimizing pedotransfer functions for estimating soil bulk density using boosted regression trees. Soil Sci. Soc. Am. J. 73, 485–493. Mayr, T., Rivas-Casado, M., Bellamy, P., Palmer, R., Zawadzka, J., Corstanje, R., 2010. Two methods for using legacy data in digital soil mapping. In: Boettinger, J.L., Howell, D.W., Moore, A.C., Hartemink, A.E., Kienast-Brown, S. (Eds.), Digital Soil Mapping: Bridging Research, Environmental Application, and Operation. Springer, Dordrecht, pp. 191–202. McBratney, A.B., Odeh, I.O.A., 1997. Application of fuzzy sets in soil science: fuzzy logic, fuzzy measurements and fuzzy decisions. Geoderma 77, 85–113. McBratney, A.B., Santos, M.L.M., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3–52. McCann, R.K., Marcot, B.G., Ellis, R., 2006. Bayesian belief networks: applications in ecology and natural resource management. Can. J. For. Res. 36, 3053–3062. McCloskey, J.T., Lilieholm, R.J., Cronan, C., 2011. Using Bayesian belief networks to identify potential compatibilities and conflicts between development and landscape conservation. Landsc. Urban Plan. 101, 190–203. McGrath, S.P., Loveland, P.J., 1992. The Geochemical Atlas of England and Wales. Blackie, London, p. 101.

148

K. Taalab et al. / Geoderma 259–260 (2015) 134–148

Montanarella, L., Jones, R.J.A., Dusart, J., 2005. The European Soil Bureau Network. In: Jones, R.J.A., Houskova, B., Bullock, P., Montanarella, L. (Eds.), Soil Resources of Europe, second edition. European Soil Bureau Research Report No. 9, EUR 20559 EN, pp. 3–14. Moran, C.J., Bui, E.N., 2002. Spatial data mining for enhanced soil map modelling. Int. J. Geogr. Inf. Sci. 16, 533–549. Moreira, C.S., Brunet, D., Verneyre, L., Sa, S.M.O., Galdos, M.V., Cerri, C.C., Bernoux, M., 2009. Near infrared spectroscopy for soil bulk density assessment. Eur. J. Soil Sci. 60, 785–791. Myllymäki, P., Silander, T., Tirri, H., Uronen, P., 2002. B-Course: a web-based tool for Bayesian and causal data analysis. Int. J. Artif. Intell. Tools 11 (3), 369–387. Nadkarni, S., Shenoy, P.P., 2004. A causal mapping approach to constructing Bayesian Networks. Decis. Support. Syst. 38, 259–281. Norsys Software Corp, 2006. Sensitivity to findings. Netica Software Documentation (www.norsys.com). Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Perry, M., Hollis, D., 2005. The generation of monthly gridded datasets for a range of climatic variables over the UK. Int. J. Climatol. 25, 1041–1054. Pontius, R.G., Millones, M., 2010. Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment. Int. J. Remote Sens. 32, 4407–4429. Rawls, W.J., 1983. Estimating soil bulk-density from particle-size analysis and organicmatter content. Soil Sci. 135, 123–125. Robinson, J.W., Hartemink, A.J., 2010. Learning non-stationary dynamic Bayesian Networks. J. Mach. Learn. Res. 11, 3647–3680. SAGA GIS, 2012. System for Automated Geoscientific Analysis. Skidmore, A.K., Watford, F., Luckananurug, P., Ryan, P.J., 1996. An operational GIS expert system for mapping forest soils. Photogramm. Eng. Remote Sens. 62 (5), 501–511. Smith, C.S., Howes, A.L., Price, B., McAlpine, C.A., 2007. Using a Bayesian belief network to predict suitable habitat of an endangered mammal — the Julia Creek dunnart (Sminthopsis douglasi). Biol. Conserv. 139, 333–347.

Statistica, 2012. Statistics and Analytics Software Package 11.0. StatSoft. Steller, R.M., Jelinski, N.A., Kucharik, C.J., 2008. Developing models to predict soil bulk density in southern Wisconsin using soil chemical properties. J. Integr. Biosci. 6 (1), 53–63. Stewart, V.I., Adams, W.A., Abdulla, H.H., 1970. Quantitative pedological studies on soils derived from Silurian mudstones. 2. Relationship between stone content and apparent density of fine earth. J. Soil Sci. 21, 248–255. Suuster, E., Ritz, C., Roostalu, H., Kolli, R., Astover, A., 2012. Modelling soil organic carbon concentration of mineral soils in arable land using legacy soil data. Eur. J. Soil Sci. 63, 351–359. Taalab, K.P., Corstanje, R., Creamer, R., Whelan, M.J., 2012. Modeling soil bulk density at the landscape scale and its contributions to C stock uncertainty. Biogeosci. Discuss. 9, 18831–18864. Tavares Wahren, F., Tarasiuk, M., Mykhnovych, A., Kit, M., Feger, K.H., Schwärzel, K., 2012. Estimation of spatially distributed soil information: dealing with data shortages in the Western Bug Basin, Ukraine. Environ. Earth Sci. 65, 1501–1510. Tranter, G., Minasny, B., Mcbratney, A.B., Murphy, B., Mckenzie, N.J., Grundy, M., Brough, D., 2007. Building and testing conceptual and empirical models for predicting soil bulk density. Soil Use Manag. 23, 437–443. Uusitalo, L., 2007. Advantages and challenges of Bayesian Networks in environmental modelling. Ecol. Model. 203, 312–318. Wiesmeier, M., Barthold, F., Blank, B., Koegel-Knabner, I., 2011. Digital mapping of soil organic matter stocks using Random Forest modeling in a semi-arid steppe ecosystem. Plant Soil 340, 7–24. Zhang, H., 2004. The Optimality of Naïve Bayes. FLAIRS 2004 Conference. Miami Beach, Florida, USA. Zhao, Z., Yang, Q., Benoy, G., Chow, T.L., Xing, Z., Rees, H.W., Meng, F., 2010. Using artificial neural network models to produce soil organic carbon content distribution maps across landscapes. Can. J. Soil Sci. 90, 75–87.

Suggest Documents