Spatial data mining for enhanced soil map modelling - CiteSeerX

2 downloads 9765 Views 1005KB Size Report
geology were delineated, small classes and the detailed spatial pattern of .... rule induction) embedded in a large data set is termed 'data mining' in the arti cial.
int. j. geographical information science, 2002 vol. 16, no. 6, 533± 549

Research Article Spatial data mining for enhanced soil map modelling CHRISTOPHER J. MORAN and ELISABETH N. BUI CSIRO Land and Water, GPO Box 1666, ACT 2601, Australia; e-mail: [email protected] (Received 22 January 2001; accepted 16 November 2001) Abstract. The principle of using induction rules based on spatial environmental data to model a soil map has previously been demonstrated. Whilst the general pattern of classes of large spatial extent and those with close association with geology were delineated, small classes and the detailed spatial pattern of the map were less well rendered. Here we examine several strategies to improve the quality of the soil map models generated by rule induction. Terrain attributes that are better-suited to landscape description at a resolution of 250 m are introduced as predictors of soil type. A map sampling strategy is developed. ClassiŽ cation error is reduced by using boosting rather than cross-validation to improve the model. Further, the beneŽ t of incorporating the local spatial context for each environmental variable into the rule induction is examined. The best model was achieved by sampling in proportion to the spatial extent of the mapped classes, boosting the decision trees, and using spatial contextual information extracted from the environmental variables.

1.

Introduction The Murray-Darling Basin (MBD), some 1.1 Ö 106 km 2 in eastern Australia (Ž gure 1), is an important agricultural production region. Considerable areas of the

Figure 1.

Location of the Toowoomba map in Murray-Darling River Basin. Internationa l Journal of Geographica l Informatio n Science ISSN 1365-881 6 print/ISSN 1362-308 7 online © 2002 Taylor & Francis Ltd http://www.tandf.co.uk/journals DOI: 10.1080/13658810210138715

534

C. J. Moran and E. N. Bui

Basin are not covered by any detailed form of land-resource inventory. This paper presents a spatial modelling method whereby the soil-landscape models, developed in areas that have been surveyed, are captured for subsequent extension across unmapped areas (Bui and Moran, submitted). Given the extent of the MDB only a restricted set of data sources with complete basin-wide coverage was suitable for use (these are described in detail later). SigniŽ cantly, the best resolution of a digital elevation model of the region is 9² (approx. 250 m when gridded ). This, therefore, constrained the resolution at which spatial modelling of soil distribution was undertaken. An important precursor was to establish whether data at such low resolution contained suYcient information to predict soil type distribution. A test area with 1:100 000 scale mapping from Toowoomba, Queensland (Thompson and Beckmann 1959) was chosen. Bui et al. (1999) reported an initial attempt to use environmental variables to develop a set of rules that could reproduce a soil map. Using two diVerent ruleinduction methods, they were able to generate numerical empirical models that rendered ~50% of the area of the original map, which came from the Toowoomba area in Queensland, Australia. They then attempted to generalise what the machine learning had produced into a set of expert rules in an attempt to mimic the mental process used by the soil surveyors (Thompson and Beckmann 1959). Whilst broadly the original map was reproduced the detailed structure was not, some classes were not successfully modelled at all and spatially extensive classes dominated predictions. This paper extends the work of Bui et al. (1999 ) by: Introducing some terrain attributes that are better-suited to landscape description at a resolution of 250 m. Developing a map sampling strategy. Reducing classiŽ cation error by employing boosting to improve the model. Further, we examine the beneŽ ts of providing the rule-induction process with local spatial contextual information for each environmental variable. 2. Materials and methods 2.1. Modelling approac h Our hypothesis is that the soil surveyor’s knowledge is expressed in soil maps as the spatial structure of the delineated polygons and the soil associations listed in the accompanying legend, i.e. patterns of local correlation and exceptions (Bui et al. 1999). Hudson (1992) provides a compelling argument that a well-constructed soil survey captures knowledge if it is based explicitly on the soil-landscape paradigm. He states that the large number of adherents to the paradigm and the simplicity of the statement ‘soils are natural bodies that are distributed in a predictable way and in response to a systematic interaction of environmental variables’ gives it the necessary power to embody and therefore potentially communicate knowledge. There is a growing literature demonstrating the predictive capacity of the soillandscape paradigm using digital data and empirical numerical modelling techniques. Examples of spatial prediction have been provided, across a range of physiographica l environments and spatial extents, for a number of soil properties (Moore et al. 1993, Gessler et al. 1995, McBratney and Odeh 1998, McKenzie and Ryan 1999) and for categorisation and mapping of soil types (Lagacharie 1992). Evidence that knowledge

Spatial data mining for enhanced soil map modelling

535

has been captured can be demonstrate d if rules governing local correlation and exceptions can be successfully extrapolated; something has been learned that can be applied elsewhere. Lagacherie et al. (1995) undertook a controlled process that provided such evidence for mapping of soil types and this was extended to spatial distribution of soil properties by Voltz et al. (1997 ). Bui et al. (1999) have demonstrated the potential for the (re)discovery of the knowledge embedded in completed surveys by eVectively reverse engineering the soil-landscape model through inverse modelling using rule induction techniques based on decision trees. The success of the inverse procedure is assessed by the ability to mimic the soil map using samples taken from it. This assessment quantiŽ es the degree of success in capturing the surveyors’ realisation of the soil-landscape paradigm for the particular area. By implication, this also captures the embedded knowledge. However, this does not necessarily imply that the underlying rules used by the surveyors and those generated through the rule induction process are the same. In this sense, the two sets of rules converge on the same knowledge (the map representation) and could be considered as mimics of each other. The approach of discovering and capturing knowledge ( here implemented using rule induction) embedded in a large data set is termed ‘data mining’ in the artiŽ cial intelligence literature (Quinlan 1993). C5.0, a data mining tool, was used as the ruleinduction engine to build decision trees. Decision trees are classiŽ cation algorithms that partition a data set into more and more homogeneous subsets. Nodes are where trees branch or split the data set; terminal nodes are called leaves. C5.0 builds a tree by determining splits in the data set which minimize the entropy at a node (Quinlan 1993). Decision trees have been used previously to estimate soil properties (McBratney and Odeh 1998, McKenzie and Ryan 1999) and to develop soil maps (Lagacharie 1992, Lagacherie and Holmes 1997 ). The decision on the optimal split at a given node is made according to the gain ratio criterion, which is the ratio of the gain to the split info. The gain is the change in entropy between the node and the weighted entropy across the sub-nodes stemming from the split. The split info is used to avoid bias in favour of splits with many outcomes. The gain of a split X is: gain(X) 5 info(T )Õ

ž

n |T i| info (T ) i |T |

(1) iÕ 1 where T is the training cases at the node, T the training cases at the ith sub-node i following split X and || gives the count. info(T ) and info(T ) are the average i information of sets T and T , respectively (also known as the entropy), where: i freq(C , S) k freq(C , S) j j info(S) 5 Õ ž log (2) 2 |T | |T | j= 1 for set S and C identiŽ es the jth class. The split info of a split X is given by: j |T | n |T i | i split info (X) 5 Õ ž log2 (3) |T | |T | i= 1

A

B

A B

and is large when there are no dominant groups in terms of counts. The gain ratio measures the proportion of the information generated by the split

C. J. Moran and E. N. Bui

536

that is beneŽ cial to the classiŽ cation. Each split is chosen so as to maximize the gain ratio and thus the information gained. In our application the decision trees are used to describe classes that already exist (as deŽ ned by the soil mapper). Once generated, the trees are converted into a set of rules of the form: if slope < a distance downhill to river > 5 b lithology class 5 ( p, q) then polygon class 5 category II where a and b are numerical values and p and q denote lithology categories. Decision trees (models) are generated from a matrix of environmental variables (see below) that are believed to re ect the properties integrated by a soil surveyor when making the source map. The quality of the model is tested by generating it using only a proportion of the available data. The model is then used to re-create (predict) the full extent of the original map. Results were evaluated by comparing uncertainty coeYcients (Press et al. 1988 ) and kappa statistics (Bonham-Carter 1994) from cross-tabulation s of the resulting map classes with the original ones (contingency tables). Uncertainty coeYcients, based on entropy measurements, describe how well the two maps correspond and include the dispersion in the cross-tabulations . Kappa coeYcients describe how closely concentrated the data are about the diagonal of the matrix taking into account class size. This permits assessment of how much of the apparent quality of prediction is due to chance given that the spatial distribution and extent of some classes represented in the map render them more amenable to prediction than others. 2.2. Data and location The 1:100 000 Soil Association Map of the Toowoomba Area, Darling Downs (Thompson and Beckmann 1959) ( hereafter referred to as the Toowoomba map) as used by Bui et al. (1999) was regrouped following additional expert information (C. Thompson, personal communication A. Biggs, personal communication) . The Ž nal groupings are presented in Appendix 1. Following digitising, the Toowoomba map and digital geology (1:250 000) for Ipswich were co-registered to the 9² digital elevation model (DEM) (AUSLIG 1996) gridded to a 250 m cell size. Terrain attributes, derived from the DEM for the Murray-Darling River basin, were: natural logarithms of up-slope contributing area (ln ca), distance downhill to channels, and distance downhill from hilltops (Ž gure 2 ). Aspect, slope and curvature, which had been used in the initial work (Bui et al. 1999), were excluded because they were not considered to be well represented at 250 m resolution. Landsat MSS bands 1–4 (Geoimage 1995) were used as surrogates for vegetation and, to an extent land use patterns, as there existed no better representation of the vegetation. 2.3. Sampling In a direct analogy with surveying a new region, a sampling strategy needs to be designed for the source map. The sampling strategy includes decisions on where to sample and with what intensity. A decision on intensity is required at the polygon, class, and whole map levels.

Spatial data mining for enhanced soil map modelling

537

Figure 2. Terrain attributes used in the soil map modelling: (a) logarithm of contributing area, (b) distance uphill from the nearest river, (c) distance downhill from the nearest hill top.

538

C. J. Moran and E. N. Bui

There is no basis for diVerentially sampling individual polygons because some (perhaps not negligible) component of the surveyors’ model is encoded in the multiple polygons that compose a class. At the class level, a decision exists regarding the spatial distribution of samples with which to build trees. We compare the eVect of sampling equal numbers per class and weighting the number of samples according to the class extent, i.e. the proportion of the map occupied by each class. This is termed area-weighted sampling. Area-weighted sampling is similar to random sampling except that we ensure that small classes are represented explicitly when random sampling may under-sample them (given that we only undertake a single sampling). In area-weighted and equal representation sampling, the samples taken from each class are selected at random. A second consideration to the spatial distribution and frequency of samples between classes is the intensity of sampling of the whole map. The trade-oV is between having enough data to represent the variability underlying the polygons and leaving suYcient data to test the quality of the model. In the extreme, building a tree model with all the available data becomes a Ž tting exercise; we would expect to reproduce the map very well. At the other extreme, a very small sample may not inform the model suYciently to represent the mapped area (even if the model happens to have a low error rate in describing the samples). A range of sampling intensities was investigated. The eYciency of the models is assessed by comparing the size of the input data set, the error rate and size of the models with the quality of the predictions (percent correct, uncertainty and kappa coeYcients). This provides an estimate of the conversion of data to information in the modelling exercise and also suggests the degree of information compression in creating polygon representations of the soil distributions. 2.4. Pruning When dealing with decision trees it is generally necessary to simplify the models. One method employed is to prune the trees. Here pruning is performed ensuring that a minimum number of counts are included in each binary split in the tree. In equations (1, 2 and 3) a split is not acceptable if there are insuYcient counts. The deŽ nition of unacceptable is provided by the user. An increasing severity of pruning, minimum number of counts, is examined. 2.5. Boosting Boosting is a bias reduction procedure that sequentially retrains on the data iteratively producing tree models that concentrate on the mis-classiŽ ed data from previous iterations (Drucker et al. 1994, Freund and Schapire 1996). First, a tree is generated as usual. Then a second tree is built which makes more eVort on the cases mis-classiŽ ed in the Ž rst tree. This is repeated for n iterations ( here n 5 10). For the Ž rst tree, all data are given equal weights, i.e. 1/n. In subsequent trees the weight accorded to cases misclassiŽ ed in the Ž rst tree is increased relative to those correctly classiŽ ed. The functional form of the weighting and optimisation weighting of subsequent trees is a subject of current developments in the statistics literature (Freidman et al. 2000). The class allocated to each training pixel (sample) is the modal result from using all the trees generated. This allocation process is known as voting. Similarly, for test data or predictions, voting derives the result allocated to each location. The theoretical basis for the success of boosting (and other tree building optimisation algorithms) has more recently been demonstrated (Schapire et al. 1998,

Spatial data mining for enhanced soil map modelling

539

Kearns and Mansour 1999). Boosting has also been reported to avoid over-Ž tting of decision trees (Haruno et al. 1999 ). Here we examine the eVectiveness of boosting using the same comparison statistics as earlier, i.e. the percentage of the original map rendered correctly, kappa, and the uncertainty coeYcient. In addition, we present the percentage of the training data correctly classiŽ ed and, similarly, the percentage of the test data correct. 2.6. Spatial context ClassiŽ cation trees have no mechanisms for detecting spatial relationships between the training data. Therefore, no spatial information has been used in previous work using trees to predict soil properties (Lagacharie 1992, McKenzie and Ryan 1999, Bui et al. 1999). Indeed, Gessler et al. (1995) deliberately attempted to avoid the introduction of spatial information into the models by ensuring sample sites were separated by a distance greater than the geostatistical range for the compound topographic index; the environmental variable considered most likely to be correlated with the majority of soil properties. Polygons on soil maps represent generalization and simpliŽ cation of the incorporated spatial variation. Therefore, local variation is smoothed into a categorical representation. We examined the utility of informing the tree generation software of the spatial context of the data sampled from the source map. Switzer (1980) showed that inclusion of local spatial autocorrelation improved the classiŽ cation of multi-spectral Landsat data using linear discriminant analysis. Schetselaar et al. (2000) found addition of a local mean Ž lter with the raw data, across a range of remotely-sense d and geophysical data, improved the ability to discriminate geological units using parametric classiŽ cation of Ž eld observations. The attributes we examined were the mean, range (max–min), variance and slope at the origin of the local variogram (see below) of each variable for some radius around each point. The variance was given by: ž

n

variance 5 i= 1

xÅ )2

(x Õ i n

(4)

where n is the number of pixels within the radius of the window (see below), x is i the ith observation in the window and xÅ is the window mean. We decided to assume a priori that the spatial variation about each point was not the same. This attempts to recognise that the local soil associations will exist over diVerent length scales in diVerent environmental domains within the mapped region. Therefore, we varied the radius within which the spatial contextual information would be derived. Spatial Ž ltering using a variable window is hereafter referred to as adaptive Ž ltering. Example images are shown in Ž gure 3. The Ž rst step of adaptive Ž ltering is to compute the radius about each point that will be used to Ž lter the data. We decided to use the amount of spatial variation in the raw data to set the Ž ltering radius. The raw data are the DEM for terrain attributes and each of the MSS bands for their Ž ltered options. The local spatial variation was described by the slope, at the origin, of the local variogram that was computed for a Ž xed radius of 1 km about each point. Figure 3(a) shows the DEM for the study region. The variogram slope image was computed and then scaled so that the minimum and maximum corresponded to a minimum and maximum radius

540

Figure 3.

C. J. Moran and E. N. Bui

Example of adaptive Ž ltering to derive the spatial contextual attributes: (a) digital elevation model; (b) window size used for Ž ltering; (c) variance.

Spatial data mining for enhanced soil map modelling

541

(3.0–5.5 km) for the Ž ltering window resulting in a Ž lter radius image (Ž gure 3(b)). The variance of the DEM Ž ltered in this fashion is shown in Ž gure 3(c). The geostatistical range of the variogram may be a preferable property to re ect the extent of local spatial variability and therefore to dictate size the window. However, the task of computing local variograms combined with the problem of Ž tting the most appropriate variogram model to each point was considered impractical for the extent (some 16 million pixels) and purpose of the modelling, i.e. Ž lling of gaps in soil map coverage of the Murray-Darling Basin. Since this work has been completed software for computing and Ž tting a large number of variograms has become available (Minasny et al. 1999). However, computing platforms and incorporation into the data processing stream for large spatial modelling tasks remains an issue. 3.

Results and discussion To examine the sampling method and intensity and the impacts of pruning and boosting, it is necessary to present data using the model settings Ž nally chosen. Whilst this may seem in reverse we believe that it is better to discuss the rudiments of the Ž nal model before presenting the eVects of spatial Ž ltering and selection of various attributes as these are the more substantial results. 3.1. Sampling method Area-weighted sampling provides superior prediction in all statistics than equal number sampling (table 1). This result supports the hypothesis that small regions delineated by surveyors tend to describe particular and recognisable features of the soil-landscape, i.e. examples of exceptions. Therefore, fewer data are required to describe these classes. However, as class spatial extent increases, more observations are required to describe them. Our results indicate that the typical soil-landscape units in the more extensive classes must be more complex (or there are more soillandscape units) than the less extensive classes. If not, equal number sampling would perform approximately as well as area-weighted sampling. Alternatively, the more extensive classes consist of a greater degree of generalization than the less extensive classes. Greater variability in map units may be introduced by the aggregation of map units, which decreased the original number of map classes. Given these results, all modelling was conducted using area-weighted sampling. 3.2. Sampling intensity Figure 4(a± d) shows the results from increasing the intensity of sampling the input map. The greatest increases in predictive capability occur over the Ž rst 10% of the data and then diminishing return becomes increasingly obvious (Ž gure 4(a)). We note that the diVerence between the percent correct and the kappa statistic (Ž gure 4(a)) decreases as more training data become available for the model. Similarly the quality of the prediction, as indicated by the uncertainty coeYcient (Ž gure 4(b)), Table 1.

Comparison of area-weighted sampling with equal numbers per class. Treatment Area weighted Equal number

% correct

kappa

u.c.

69.9 57.8

64.1 52.1

0.49 0.42

542

C. J. Moran and E. N. Bui

Figure 4. Results from gradually increasing the proportion of the map sampled to construct a model: (a) solid line is the %correct, dotted line shows kappa statistic; (b) uncertainty coeYcient; (c) tree size (number of terminal leaves) as a function of % area sampled; (d) tree size as a function of % area correctly predicted.

Spatial data mining for enhanced soil map modelling

543

decreases between 10% and 25% sampling and then decreases more rapidly with increasing sampling intensity. Given these results, a sampling intensity of 25% was used for all further modelling. Figure 4 also provides the opportunity to examine the relationship between data and information in the map and model. We use the size of the tree (in all cases the trees were pruned using the same criterion—see below) and the degree to which it is able to reproduce the original map as a measure of the eYciency of information capture. Figure 4(c) shows that the size of the trees increases linearly with the increase in sampling intensity. Figure 4(d) shows the strong non-linearity of the information capture with increasing sampling intensity. Linear extrapolation of the Ž rst three points would give an estimate of tree size slightly greater than 100 for prediction of the original map to approximately 80%. However, the tree needed to render 80% of the map correct is close to 350 leaves. This exponential increase in tree size re ects the degree of generalization of the spatial variability incorporated into the polygons. This indicates that the surveyors have indeed captured soil-landscape pattern in the classes delineated in the map. Further, the general pattern can be captured in a model that uses only a modest amount of the potential data. However, there is a degree of spatial variability that is not general to the rest of the class and prediction of the soil type becomes more and more diYcult. This is likely a result of the reality of subdominant and minor soils within the classes, i.e., within-unit exceptions. Without attempting to include more information from the map legend about these other soil types we should not expect to predict everything. Even with this information, the variation may not be suYciently systematic (at the resolution available) to be explained. 3.3. Boosting Table 2 shows the eVect of boosting on a range of statistics. Clearly, the result from boosting is a better model that repays the additional cpu time required. In all cases presented boosting has been used. 3.4. Pruning Figure 5 shows the result of gradually increasing the severity of tree pruning. Figure 5(a) shows a decrease in the correct rendering of the Ž nal map, at a decreasing rate, following pruning at a minimum of 40 counts. However, Ž gure 5(b) shows that the size of the tree begins to decrease at a decreasing rate between pruning of 10–20. The errors in the Ž t to training and test data begin to decrease at a decreasing rate between pruning rates of 20–40. As a trade-oV between tree accuracy and size we chose a pruning rate of 20 for the model building.

Table 2.

The eVect of boosting on prediction quality of the decision tree model. Treatment Not boosted Boosted

% correct

kappa

u.c.

58.3 69.9

50.0 64.1

0.35 0.49

544

Figure 5.

C. J. Moran and E. N. Bui

Results from gradually increasing the severity of pruning the trees (increasing the minimum count for any tree split).

3.5. Spatial context Table 3 shows that the predictive capacity of the modelling approach using C5.0 with the raw data is similar to that given by Bui et al. (1999) in their comparison

Spatial data mining for enhanced soil map modelling Table 3. Treatment Raw data All data Spatial data only

545

Comparison of the predictive capacity raw and Ž ltered data. % correct

kappa

UC

Tree size

49.0 69.9 65.4

37.7 64.1 58.7

0.26 0.49 0.44

92 153 149

of Splus and Expector. This is not surprising as Bui et al. (1999) used decision tree models in Splus. However, unlike these earlier results, all map classes are represented. Table 3 also shows the results from providing C5.0 with spatial context information for model building. It is clear that spatial Ž ltering has a signiŽ cant impact on the predictive capacity of the model. The much larger uncertainty coeYcient is an indication that the details in the map simulations using spatial Ž ltering are far superior to those using only the raw data. When the raw data are included with the Ž ltered data, the predictions are slightly better than with the Ž ltered data alone. Figure 6 shows the quality of the maps rendered using raw data and raw data with the spatial contextual information. 3.6. Attribute selection We examined the predictive capability of various attributes and spatial contextual information by building models using various combinations of the predictor variables. With all combinations, attributes that were not subject to Ž ltering are included, i.e. distances to ridge and river. Table 4 shows that the terrain data appear to have the greatest predictive capacity with the MSS data a little less. There is only a 4% diVerence in total area correctly predicted (6% kappa) between these two major variable types. They provide most of the predictive power of the combined models. Indeed combining the terrain and MSS data and excluding geology gives the best model. Table 5 compares the spatial contextual information by the nature of the Ž lter, i.e. mean, range, variance and variogram slope. There is overlap in predictive capacity between each of the mean, range, and variance. There is perhaps slightly better capacity in the mean and variance than the range when they are combined. There is little apparent beneŽ t likely to be gained by including the variogram slope in the predictions. However, including all the variables it does generate the best model. 4.

General discussion The results presented above demonstrate that it is possible to model a soil map using low-resolution environmental correlation variables, viz. terrain attributes from the 9² DEM, geology, and MSS bands 1–4. This is not a demonstration of prediction of soil type per se but rather illustration that the soil patterns as observed in the landscape and generalized into soil association classes by the soil surveyors can be captured by machine learning. We have shown that model optimisation through boosting improves the quality of the information capture. Further, use of spatial contextual information results in a better model. This is not surprising as we expect that a local spatial correlation exists, which is smoothed in creation of class-based polygons. Analysis of the predictive capacity of the diVerent correlation attributes indicated that terrain attributes were marginally more powerful than the MSS data. Geology, whilst very useful in

546

Figure 6.

C. J. Moran and E. N. Bui

Maps rendered from models: (a) original map; (b) map prediction using raw data; (c) map prediction using raw and spatial contextual data.

Spatial data mining for enhanced soil map modelling Table 4.

Comparison of the predictive capacity of correlation attributes by category, i.e. terrain, MSS and geology.

Treatment All data MSS geology All terrain MSS & geology Terrain & geology MSS & terrain Table 5.

547

% correct

kappa

UC

Tree size

69.9 60.5 33.2 65.0 61.2 63.2 70.1

64.1 52.7 14.9 58.2 53.7 55.9 64.2

0.49 0.38 0.19 0.43 0.39 0.41 0.49

153 144 6 147 129 139 148

Comparison of the predictive capacity of correlation attributes by spatial contextual type, i.e., mean, range, variance and variogram slopes.

Treatment Mean Variance Range Mean & variance Mean & range Variance and range Variogram slopes No variograms

% correct

kappa

UC

Tree size

60.8 58.3 58.9 64.9 64.1 64.1 36.7 69.2

53.0 50.3 50.6 58.2 57.2 57.1 21.8 63.2

0.38 0.36 0.37 0.43 0.42 0.42 0.11 0.48

137 134 128 145 140 142 117 147

predicting one class, actually slightly degraded the quality of the model (table 4). This result was surprising because geology was a high level predictor in the Splus trees of Bui et al. (1999). For the spatial contextual attributes, the mean was found to be the strongest predictor, followed by the range and variance. Sampling in proportion to the spatial extent of classes provides a better result than taking an equal number of samples from each class. This indicates that the classes with greatest extent also contain more variation and therefore more samples are required to build a good model. Further, this implies that the polygon purity is related to the size of the polygon. Whilst, in itself this does not present a problem it is important information to be used in interpreting the map. Not all soil map polygons at a given scale of mapping have the same degree of homogeneity and one needs to check the legend and soil resources inventory report for a description of the homogeneity of each map unit. An analysis of the information content of the model shows that most of the available information can be represented using relatively few samples, i.e. with sampling rates of greater than ~10% there is non-linear improvement in prediction quality for each increment of data added. It is interesting to consider how one might design a sampling scheme, without prior knowledge of the soil distribution, to most eYciently determine the relationships represented by the 10%. We have applied this spatial modelling approach to building soil map models of the patchwork of mapped areas in the Murray-Darling Basin and extended the rules from those models to unmapped areas (Bui and Moran, submitted). We propose this as a method for reconnaissance mapping to provide rapid assessment of soil associations and their properties in unmapped areas. Further, it can be used as a priori information upon which an eYcient sampling scheme can be planned for more

548

C. J. Moran and E. N. Bui

detailed mapping as an exercise in falsiŽ cation/veriŽ cation of the predicted soil distribution. 5.

Conclusions We conclude that there is suYcient predictive capacity in the environmental correlation attributes representing geology, terrain, and soil/water/vegetation interactions (MSS bands 1–4 ) to model a known soil map. The best model was achieved by sampling in proportion to the spatial extent of the mapped classes, boosting the decision trees, and using spatial contextual information extracted from the environmental variables. Based on the soil-landscape paradigm, explicit linkages have been drawn between data, information and knowledge. References AUSLIG, 1996, GEODATA 9 sec DEM (DEM-9S) (Belconnen, ACT: Australian Survey and Land Information Group). Bonham-Carter, G. F., 1994, Geographic Information Systems for Geoscientists: Modelling with GIS (Oxford: Pergamon Press). Bui, E. N., Loughhead, A., and Corner, R., 1999, Extracting soil-landscape rules from previous soil surveys. Australian Journal of Soil Science, 37, 495–508. Bui, E. N., and Moran. C. J., forthcoming, Spatial modelling to generate a soil map and assess its quality: an example from the Murray-Darling basin of Australia (submitted Geoderma). Drucker, H., Cortes, C., Techel, L. P., Lecan, Y., and Vaprik, V., 1994, Boosting and other ensemble methods. Neural Computation, 6, 1289–1301. Freund, Y., and Schapire, R., 1996, Experiments with a new boosting algorithm. In Machine L earning: Proceedings of the T hirteenth International Conference, July, 1996 (San Mateo, California: Morgan Kaufmann). Friedman, J., Hastie, T., and Tibshirani, R., 2000, Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–374. GEOIMAGE, 1995, Australian L andsat MSS mosaic (100 m resolution ) (Brisbane: GEOIMAGE). Gessler, P. E., Moore, I. D., McKenzie, N. J., and Ryan, P. J., 1995, Soil-landscape modelling and spatial prediction of soil attributes. International Journal Geographical Information Systems, 9, 421–432. Haruno, M., Shirai, S., and Ooyama, Y., 1999, Using decision trees to construct a practical parser. Machine L earning, 43, 131–149. Hudson, B. D., 1992, The soil survey as a paradigm-based science. Soil Science Society America Journal, 56, 836–841. Kearns, K., and Mansour, Y., 1999, On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and System Sciences, 58, 109–128. Lagacherie, P., 1992, Formalisation des lois de distribution des sols pour automatiser la cartographie pedologique a partir d’un secteur pris comme reference. Ph.D. Thesis. Institut National de la Recherche Agronomique, Laboratoire de science du sol, Montpellier. Lagacherie, P., Legros, J. P., and Burrough, P. A., 1995, A soil survey procedure using the knowledge of soil pattern established on a previously mapped reference area. Geoderma, 65, 283–301. Lagacherie, P., and Holmes, S., 1997, Addressing geographical data erroes in a classiŽ cation tree for soil unit prediction. International Journal of Geographical Information Science, 11, 183–198. Minasny, B., McBratney, A. B., and Whelan, B. M., 1999, VESPER version 1.0. Australian Centre for Precision Agriculture, McMillan Building A05, The University of Sydney, NSW 2006. ( http://www.usyd.edu.au/su/agric/acpa). Moore, I. D., Gessler, P. E., Nielsen, G. A., and Peterson, G. A., 1993, Soil attribute prediction using terrain analysis. Soil Science Society of America Journal, 57, 443–452.

Spatial data mining for enhanced soil map modelling

549

McBratney, A. B., and Odeh, I. O. A., 1998, An overview of pedometric techniques for use in soil survey. (Keynote Paper, Symposium 17 ) Paper 622. 16th World Congress of Soil Science. Montpellier, France, 20–26 August, 1998. CD. McKenzie, N. J., and Ryan, P. J., 1999, Spatial prediction of soil properties using environmental correlation. Geoderma, 89, 67–94. Press, W. H., Flannery, B. P., Tenkolsky, S. A., and Vetterling, W. T., 1988, Numerical recipes in C: T he art of scientiŽ c computing (Cambridge University Press: Cambridge). Quinlan, J. R., 1993, C4.5: Programs for machine learning (San Mateo, California: Morgan Kaufmann). Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S., 1998, Boosting the margin: A new explanation for the eVectiveness of voting methods. Annals of Statistics, 26, 1651–1686. Schetselaar, E. M., Chung, C-J. F., and Kim, K. E., 2000, Integration of Landsat TM, Gamma-ray, magnetic, and Ž eld data to discriminate lithological units in vegetated granite-gneiss terrain. Remote Sensing of the Environment, 71, 89–105. Switzer, P., 1980, Extensions of linear discriminant analysis for statistical classiŽ cation of remotely sensed satellite imagery. Mathematical Geology, 12, 367–376. Thompson, G. G., and Beckmann, C. H., 1959, Soils and L and Use in the T oowoomb a Area, Darling Downs, Queensland, Soils and Land Use Series No 28. (Australia: CSIRO). Voltz, M., Lagacherie, P., and Louchart, X., 1997, Predicting soil properties over a region using sample information from a mapped reference area. European Journal of Soil Science, 48, 19–30.

Appendix 1. Revised Toowoomba soil map re-classiŽ cation (after A. Biggs, personal communication)

Soil map unit

Unit name

A B BeP C ChBe ChCr DKy IP Ir K Ke KeBe KeM KeS KegM KegS Kevar MaCh Mu RMR Ry SA SB TG W WcT1 Wy Y9E YO

Aubigny Burton Beauaraba—Purrawunda Cecilvale Charlton—Beauaraba Drayton—Kynoch Irving—Purrawunda Irongate Knapdale Kenmuir Kenmuir—Beauaraba Kenmuir (stony)—Mallard Kenmuir (stony)— Southbrook Kenmuir (gravelly)—Mallard Kenmuir (gravelly)—Southbrook Kenmuir (var) Majuba— Charlton Murlaggan Ruthven—Middle Ridge Ramsay Association A (unnamed) Association B (unnamed) Toowoomba— Gabbinbar Waco Waco (with type 1) Waverly Yargullen—Edgecombe Yarranlea—Oakview

Class code 7 17 14 15 14 16 4 16 12 5 18 18 18 6 18 6 18 10 12 4 11 3 2 4 13 9 13 8 1

Suggest Documents