Using knowledge discovery with data mining from the Australian Soil ...

32 downloads 100210 Views 1001KB Size Report
future greenhouse gas emissions accounting [e.g., Jones et al., 2005; Garnaut ... software and has been described in detail by Henderson et al. [2001, 2005].
Click Here

GLOBAL BIOGEOCHEMICAL CYCLES, VOL. 23, GB4033, doi:10.1029/2009GB003506, 2009

for

Full Article

Using knowledge discovery with data mining from the Australian Soil Resource Information System database to inform soil carbon mapping in Australia Elisabeth Bui,1 Brent Henderson,2 and Karin Viergever3 Received 3 March 2009; revised 27 July 2009; accepted 3 September 2009; published 31 December 2009.

[1] We present a piecewise linear decision tree model for predicting percent of soil

organic C (SOC) in the agricultural zones of Australia generated using a machine learning approach. The inputs for the model are a national database of soil data, national digital surfaces of climate, elevation, and terrain variables, Landsat multispectral scanner data, lithology, land use, and soil maps. The model and resulting map are evaluated, and insights into biogeological surficial processes are discussed. The decision tree splits the overall data set into more homogenous subsets, thus in this case, it identifies areas where SOC responds closely to climatic and other environmental variables. The spatial pattern of SOC corresponds well to maps of estimated primary productivity and bioclimatic zones. Topsoil organic C levels are highest in the high rainfall, temperate regions of Tasmania, Victoria, and Western Australia, along the coast of New South Wales and in the wet tropics of Queensland; and lowest in arid and semiarid inland regions. While this pattern broadly follows continental vegetation, soil moisture, and temperature patterns, it is governed by a spatially variable hierarchy of different climatic and other variables across bioregions of Australia. At the continental scale, soil moisture level, rather than temperature, seems most important in controlling SOC. Citation: Bui, E., B. Henderson, and K. Viergever (2009), Using knowledge discovery with data mining from the Australian Soil Resource Information System database to inform soil carbon mapping in Australia, Global Biogeochem. Cycles, 23, GB4033, doi:10.1029/2009GB003506.

1. Introduction [2] Soil organic C (SOC) is the largest pool in the biological terrestrial global C budget [Amundson, 2001], hence proportionally small changes in SOC content can have large impacts on the global C budget. Given the importance of SOC in the global C budget and its potential inclusion in future greenhouse gas emissions accounting [e.g., Jones et al., 2005; Garnaut, 2008, chapter 22], initiatives to estimate and map SOC stocks nationally have intensified in countries around the world: e.g., in the United States (Guo et al. [2006], after initial work by Kern [1994]); in the United Kingdom [Milne and Brown, 1997; Howard et al., 1995]; in France [Arrouays et al., 2001]; in Brazil [Bernoux et al., 2002; Batjes, 2005]; in New Zealand [Tate et al., 2005]; and in China [Wu et al., 2003; Yu et al., 2007]. [3] In most countries, these efforts are led by national soil survey organizations, and they use soil survey maps (sometimes in conjunction with vegetation maps) linked to estimated average SOC values for map units [e.g., Guo et 1

Land and Water, CSIRO, Canberra, ACT, Australia. Mathematical and Information Sciences, CSIRO, Canberra, ACT, Australia. 3 Ecometrica, Edinburgh, UK. 2

Published in 2009 by the American Geophysical Union.

al., 2006; Bernoux et al., 2002; Milne and Brown, 1997]. To generate high resolution maps, they rely on finer and finer-scale inventories and more soil sampling and laboratory measurements of SOC to estimate the values to attribute to the new, smaller map units. This approach is both labor and time-intensive and thus expensive and slow. Moreover, no direct insight is gained into processes responsible for regulating SOC levels. Such insight is obtained from localized studies of SOC dynamics and thus is difficult to extrapolate to large areas, yet this is what national-scale applications of C-cycling models attempt to do. [4] In Australia, prior to the work discussed here, which was part of the Australian Soil Resource Information System (ASRIS), and the more recent work of Wynn et al. [2006], the last national synthesis of SOC data was that of Spain et al. [1983]. The approach in ASRIS used spatial environmental modeling: measured SOC at georeferenced points from the ASRIS database were used as training data and a set of gridded predictor variables chosen to represent state factors of soil formation (climate, parent material, and topography) and land use over Australia were used to develop the models. Thus we assumed that SOC is a response variable driven by environmental factors and that the spatial variation in the environmental variables controls the pattern of SOC. [5] Modeling was implemented with machine learning software and has been described in detail by Henderson et al. [2001, 2005]. Machine learning is a branch of computer

GB4033

1 of 15

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033

GB4033

Figure 1. Locations of topsoil C data points. The extent of Australian Soil Resource Information System (ASRIS) maps generated by modeling is also shown.

science that focuses on artificial intelligence algorithms to allow computers to perform sophisticated cognitive tasks, e.g., pattern recognition, natural language processing, medical diagnoses. Data mining and knowledge discovery tools are one major type of algorithms used in machine learning. The advantage of the machine learning approach used here is that it allows for both the production of maps of SOC and the gaining of insight into biogeological surficial processes. The objective of this paper is to outline how knowledge discovery and data mining from the ASRIS point database have been used to estimate and map SOC over Australia and to discuss what has been learned in the process, in terms of understanding of bioregional C cycle processes and their environmental drivers.

2. Methods 2.1. Data [6] The ASRIS point database was created as part of the National Land and Water Resources Audit (NLWRA) of Australia [Johnston et al., 2003] and was used to generate maps of soil properties including SOC over the catchments that contained major agricultural areas [Henderson et al., 2001, 2005]. The data represent a collation assembled from State and Commonwealth of Australia agencies involved in natural resources management and were collected over many years, analyzed using different techniques with different original purposes, and thus exhibit sample selection bias and unknown measurement error. [7] The database included 11,483 measurements of topsoil C, representing sampling from the first or thickest A horizon or depth 0 –30 cm if no horizon was specified, and

5,100 measurements of subsoil C, representing sampling from the B1 horizon. The locations of the sample sites for topsoil C are shown in Figure 1. [8] Analytical methods used for assessing total SOC were the Walkley-Black (6A1, 6A1.UC), Heanes wet oxidation (6B1), and the combustion methods (6B2, 6B3 and 6.DC) as described by Rayment and Higginson [1992]. All methods were assumed to estimate total organic C, although the Walkley-Black method is generally known to give incomplete recovery, historically quoted in the vicinity of 75 –80% [Rayment and Higginson, 1992]. While a correction factor from the Walkley-Black methods to total SOC of 1.3 is sometimes used [e.g., Spain et al., 1983], there is no universal correction factor. In an Australia-wide investigation, Skjemstad et al. [2000] found differences that could be attributed to the laboratory and the date at which the sample was analyzed, with more recent analyses showing a much more complete recovery. The appropriate correction factor was notably less than 1.3 and for a large part of the data not needed at all. In this light, and not knowing whether the correction factor had already been applied or not, no correction factor was used. [9] The data as summarized by method are presented in Table 1. The distributions are strongly positively skewed. Outliers with values larger than 15% organic C have been omitted and a natural log transformation used to reduce the skewness. [10] Maps of soil thickness were also produced during the NLWRA, but as it proved difficult to generate models with the machine learning approach [Henderson et al., 2005], lookup tables of soil depth by soil type were linked to soil maps [Carlile et al., 2001]. These and bulk density maps,

2 of 15

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033

GB4033

Table 1. Topsoil Percent Organic C by Laboratory Method in the ASRIS Databasea Method

Min

q10

q25

q50

q75

q90

Max

N

6A1 6A1.UC 6B1 6B2 6B3 6.DC

0.02 0.04 0.01 0.10 0.05 0.04

0.64 0.85 0.75 0.60 0.80 0.68

0.92 1.54 1.28 1.10 1.09 1.00

1.40 2.76 2.25 1.80 1.63 2.60

2.20 3.45 3.63 5.07 2.50 4.93

3.40 4.37 5.52 8.09 3.98 6.94

14.02 14.80 14.59 14.49 13.70 14.70

3934 3913 2157 585 1132 351

a

ASRIS, Australian Soil Resource Information System.

similarly derived from soil surveys, as there were few recorded measurements of this soil property in the ASRIS database [Carlile et al., 2001], are needed to estimate C density as t ha1 from percent of C by mass of soil [Homann et al., 2007]. No correction for percent of gravel was available. 2.2. Modeling [11] Piecewise linear tree models were built using Cubist, a commercial data mining software package (http://www. rulequest.com), as explained in detail by Henderson et al. [2001, 2005], to predict ln(org C) in topsoil and in subsoil using predictors that included climatic, digital elevation and terrain variables, Landsat Multispectral Scanner (4 bands), lithology, soil type, and land use maps [Johnston et al., 2003]. The predictors were chosen to represent potential drivers of soil processes. All predictors were coregistered and resampled to match the 9 s digital elevation grid. A list and description of the predictors is available in Text S1– S4.1 [12] Cubist uses a recursive partitioning of the predictor variable space in a similar way to the regression tree methodology of CART [Breiman et al., 1984]. Both methods take a divide-and-conquer strategy and seek to minimize the intrasubset variation at each node. However, where CART uses the variance, Cubist uses the standard deviation as a measure of error. The reduction in error as a result of splitting at a node is given by Derror ¼ sdðT Þ 

X jTi j  sdðTi Þ jTj i

where T denotes the training cases available at that node, Ti represent the subset of those cases with the ith outcome following a given split, and j j denotes the count. The standard deviation of the response values is calculated for T and each subset Ti. The D error then represents the expected reduction in error as a result of that split. Cubist chooses the split so as to maximize the expected reduction in error rate across all potential splits. [13] Cubist models take the form: if [condition(s)], then [linear model], e.g., if [lowest monthly radiation > 2620 and annual mean moisture index  5542 and lithology {5, 8,. . .}, then percent of SOC = linear model]. If the predictor variables associated with an observation satisfy the set of conditions, the linear model is used to predict the response. 1 Auxiliary materials are available in the HTML. doi:10.1029/ 2009GB003506.

The advantage of the condition set in each rule is that it enables interactions to be handled automatically by allowing different linear models to capture the local linearity in different parts of the predictor variable space. This can often lead to smaller trees and better prediction accuracy than regression trees [Quinlan, 1992; Uysal and Gu¨venir, 1999]. To avoid overfitting, the size of the decision tree can be controlled by two optional parameters in Cubist: a constraint on the minimum number of observations upon which to base a rule (here, set at 2.5% of cases or 183 points) and a brevity factor (2% was used) which uses heuristics to control the complexity of the model. [14] When a given observation and its associated predictor variables satisfy more than one rule set, the average of the predictions is taken as the overall prediction. A smoothing process is adopted to compensate for the discontinuities that may occur between linear models at different leaves as described in detail by Quinlan [1992]. [15] Models were built with a 70:30 training:test data split in the development stage. 70% of the observations were used to construct the model, and 30% were held back in order to assess the performance of each model. Once the strongest possible model according to performance on the test data was identified, it was refitted using all the data to maximize the use of the sparse data, with the same model form and options. The performance of the model on the full data set was assessed by 10-fold cross validation. The data were randomly divided into 10 partitions or folds. At each step, nine of these partitions were used to fit the model and the performance assessed on the remaining partition held back as the test data. This procedure was repeated for each partition sequentially. The performance, averaged over all 10 partitions held back, delivers the cross-validated performance assessment. [16] Because the predictors are spatially exhaustive, maps of percent of organic C could be produced from the Cubist model’s rules after applying a conversion factor to the values predicted to account for the use of the natural log transformation on the original data. The resolution of the resulting map matches that of the predictors, 9 arc sec or 250 m in this case. In fact, this is one of the advantages of using this approach to map SOC: the resolution of the map can increase as higher resolution predictor layers become available, without new sampling if the positional accuracy of the points with SOC measurements is good enough, i.e., it falls within that of the new predictors. Other advantages of tree-based methods are that they can use a combination of continuous and categorical predictors, are robust to predictor multicollinearity, and can handle missing values. In addition it

3 of 15

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033

is easy to evaluate the rule-based models by examining the spatial pattern of the rules and their predictor variables. 2.3. Model Evaluation 2.3.1. Quantitative Evaluation [17] Cubist reports a set of global model error diagnostics and an estimated error for each rule. The average error gives the average absolute difference between the observed and predicted values, i.e., average error ¼

m   1 X yj  ^yj : m j¼1

Lower average errors imply that the predicted values are closer to the observed values more often. The average error is also known as the mean absolute deviation. [18] The relative error is defined as the ratio of the average absolute error magnitude to the average error magnitude that would result from predicting the mean value: relative error ¼

1 m 1 m

 Pm  yj  j¼1 yj  ^  Pm   j¼1 yj  y

If there is little improvement on the mean, the environmental variables have little predictive capacity and the relative error is close to 1. Generally, the smaller the relative error, the better the model. 2.3.2. Ground Truthing [19] Ideally, the predictions from the Cubist model should be validated against independently collected, unbiased data. Unfortunately, a ground truth exercise was not within the scope of ASRIS. Instead, the data set reported in the auxiliary material of Wynn et al. [2006] has been used for validation. Data of Wynn et al. [2006] were collected over 1999 – 2002 using a sampling design spatially stratified across the range of Australian native vegetation formations and analyzed by a single laboratory procedure. The data tables represent bulked sampling from five transects (500– 1000 m long) that can be more than 1 km apart, so the ASRIS SOC predicted values gridded to 1 km resolution (that is the resolution of the map delivered to the NLWRA) were used in the assessment rather than the original prediction at 250 m. 2.3.3. Qualitative Evaluation [20] As explained by Bui et al. [2006], further evaluation of the Cubist models beyond the error diagnostics hinges on the visualization of the spatial pattern of the predictor variables and of the rules and evaluation of these patterns in light of disciplinary knowledge in biogeosciences. The regions of allocation by rules were examined against the following possible definitions or conceptualizations of process zones: [21] 1. Biogeographical regionalizations such as the Interim Bioregions of Australia (IBRA) that is an integrated classification of both biotic and abiotic variation. The 80 IBRA regions represent a landscape-based approach to classifying the land surface, including attributes of climate, geomorphology, lithology, and characteristic flora and fauna. The regionalization is based on expert knowledge and a time series of vegetation greenness index [Thackway and

GB4033

Cresswell, 1995]. A map of IBRA is available at http:// www.environment.gov.au/parks/nrs/science/ibra.html. [22] 2. The estimated extent of pre-1750 (pre-European) major vegetation groups (MVG) as mapped during the NLWRA assessment of native vegetation (M. Cofinas and C. Creighton, Australian Native Vegetation Assessment, 2001, available at http://www.anra.gov.au/topics/vegetation/ pubs/native_vegetation/nat_veg_nvis.html), where dominant growth form, cover, height and broad floristic code, usually dominant land cover genus of the uppermost or dominant stratum, are represented. The MVG descriptions are at http:// www.environment.gov.au/erin/nvis/mvg/index.html. [23] 3. Estimated primary productivity [Roxburgh et al., 2004] and percent of land cover surfaces [Lu et al., 2003] represent above ground biomass that is the source of soil organic matter. None of the data sets listed above were used as predictors in ASRIS models thus they provide an independent assessment. [24] 4. Other Australia-wide estimates of SOC [Spain et al., 1983; Wynn et al., 2006]. [25] 5. Zones where geochemical and biological processes control environmental chemistry [Reiners, 2003]: Geochemical control is dominant where surficial geological processes are rapid and physical transport processes are energetic (e.g., actively eroding slopes) or where climate conditions are extreme (e.g., deserts). Biotic control dominates where biologically congenial ranges of temperature and water availability prevail and where disturbances are weak. Such zonation in the control of environmental chemistry is reflected in the contrasting soil formation processes of leaching versus carbonate accumulation in the two zonal soil orders [Baldwin et al., 1938], pedalfer and pedocal. Pedalfers referred to soils, usually found in humid regions, in which sesquioxides increase relative to silica during pedogenesis. On the other hand, pedocals referred to soils that form in arid or semiarid regions where the ratio of precipitation: evaporation  1, characterized by a thin A horizon low in organic matter, and secondary precipitation of calcite in the subsoil (caliche). (In the Australian Soil Classification [Isbell, 2002], these process zones would be reflected in the order Calcarosol versus other soil orders; a map of soil orders was used as a predictor in the Cubist model).

3. Results 3.1. Model and Map Evaluation [26] The rules from the final Cubist model were applied to the ASRIS extent to generate maps of percent of SOC predictions (Figure 2a) and of estimated C stocks in the top 30 cm of soil (Figure 3). Given the error propagation involved in combining the maps required to estimate C density, the percent of SOC map is likely to be more reliable. Given that in Australia the C flux between land and atmosphere is largely from the upper 20 cm of soil [Barrett, 2002] and that the topsoil model is more reliable than the subsoil one, further discussion in this paper focuses on the topsoil percent of SOC model and map. The global model error diagnostics are given in Table 2. Table 3 lists the set of conditions of the 29 rules in the topsoil model, and the complete Cubist output is given in Text S5. A map of ranked

4 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

Figure 2. (a) Map of topsoil percent of soil organic C (SOC); (b) map of ranked, rule-by-rule, estimated error reported by Cubist.

5 of 15

GB4033

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033

Figure 3. Map of estimated SOC density from ASRIS data layers. rule-based estimated error as reported in Text S5 is shown in Figure 2b. [27] Using other machine learning techniques such as Artificial Neural Networks (ANNs) and ensembles of ANNs on the same point data set did not generate very different results or improve overall model performance/statistical diagnostics [Spencer et al., 2006]. The performance of the various modeling techniques may be limited by the quality of the SOC data and the likely measurement errors as discussed above. In fact, the SOC model is one of the weakest produced in ASRIS [Henderson et al., 2005; Bui et al., 2006]. Nevertheless, the R2 between predicted and measured is better than that obtained from environmental modeling in New Zealand [Tate et al., 2005]. [28] The R2 between our predictions and SOC over the 0 – 30 cm depth interval as measured by Wynn et al. [2006] is 0.8 (Figure 4), which suggests that the ASRIS model is much better than it appeared from the Cubist model statistics. This underscores the potential imperfection of the ASRIS data as discussed in section 2.1, the ability of Cubist to identify meaningful structure under fairly low signal to noise conditions, and indirectly vindicates our decision not to use a correction factor for the Walkley-Black SOC analyses in ASRIS. [29] The 29 rules and the associated condition statements and thresholds involved generate spatially/geographically coherent patterns (Figures 2b and 5). Visually analyzing the mapping resulting from the individual rules, four broad patterns emerge: Most the Murray-Darling basin, Northern Territory, South Australia, and Western Australia, where C levels are low, are accounted for by the first eight rules (Figures 5a and 5b), southern Queensland is accounted for by rules 9 –12 (Figure 5c), eastern New South Wales is accounted for by rules 13– 25 (Figures 5d and 5e), and

Victoria and Tasmania, where C levels are highest, are accounted for by rules 28– 29 (Figure 5f ). The coherence of the spatial patterns generated by the data-driven rules begs the question: do the regions delineated correspond to any landscape or soil process zones as recognized by expert natural scientists? 3.2. Spatial Pattern of Rules 3.2.1. Areas of Low SOC [30] Areas defined by rules 1, 2, 3, and 4 have very low C content (Figure 5a). Areas covered by rules 1 and 2 correspond mostly to mallee (multistemmed, lignotuberous eucalypts) and Acacia shrublands in southern Australia and to Melaleuca and Eucalyptus forests and woodlands in the north. Calcarosols and other soils rich in calcrete, finely divided carbonates, and clay [Verboom and Pate, 2006] are the most important soil orders covered by these two rules. Whereas these areas are very low in organic C, they would have very high inorganic C. Areas defined by rule 3 (mostly the Darling and Murray Riverina clay-rich floodplains) are separated from those defined by rule 2 by a relief threshold of 38,200 Elevation  226.1127 Lith {6, 14, 19}

2

Table 3. (continued)

AMMI 38,200 Elevation  142 Relief > 20.55441 Lith {1, 2, 3, 5, 7, 8, 9, 10, 11, 12,13, 15, 16, 18, 20, 22}

3

AMMI 38,200 Elevation  142 Relief  20.55441 Lith {1, 2, 3, 5, 7, 8, 9, 10, 11, 12,13, 15, 16, 18, 20, 22}

4

Highest month radiation > 2,620 AMMI  5,542 Lith {5, 8, 11, 13, 18, 19}

5

6

AMMI  5542 Moisture index seasonality > 38,200 Elevation > 226.1127 Lith {1, 7, 9, 12, 14, 15, 22} AMMI  5,542 Moisture index seasonality > 69,354 Elevation  226.1127 Lith {1, 2, 3, 5, 7, 8, 9, 10, 11, 12,13, 15, 16, 18, 20, 22}

7

AMMI  5,542 Moisture index seasonality > 38,200 226.1127  Elevation > 142 Lith {1, 2, 3, 5, 7, 8, 9, 10, 11, 12,13, 15, 16, 18, 20, 22}

8

Max T, warmest month > 3,094 Highest month radiation  2,620 AMMI  5,542 Moisture index seasonality > 38,200 Elevation > 226.1127 Lith {1, 5, 7, 8, 9, 11, 12,13, 14, 15, 18, 19, 22}

9

AMMI  5,542 Moisture index seasonality  38,200 ASC {Kurosol, Sodosol, Kandosol, Rudosol, Tenosol}

10

Max T, warmest month  3,094 Highest month radiation  2,620 AMMI  5,542 Moisture index seasonality > 38,200 Elevation > 226.1127 Lith {5, 8, 11, 13, 18, 19}

Rule

Condition

14

AMMI  5,542 Moisture index seasonality > 38,200 Elevation > 226.1127 Lith {2, 3, 4, 6, 10}

15

MAT > 1,233 Isothermality  5,209 Max T, warmest month  2,954 MAP  1,374 AMMI > 5,542 Highest MMI  994 Elevation > 33.0575 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21} Land use {3, 10, 13, 14, 16} ASC {Vertosol, Kurosol, Sodosol, Dermosol, Kandosol, unknown}

16

Isothermality  5,209 MAP > 1,374 Highest MMI  994 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21}

17

MAT  1,233 Isothermality  5,209 Max T, warmest month  2,954 MAP  1,374 Highest MMI  994

18

Isothermality  5,209 Max T, warmest month > 2,954 MAP  1,374 AMMI > 5,542 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21} ASC {Vertosol, Chromosol, Calcarosol, Ferrosol, Dermosol, Tenosol}

19

MAT > 1,233 Isothermality  5,209 Max T, warmest month  2,954 MAP  1,374 Elevation  33.0575

20

AMMI  5,542 Moisture index seasonality  38,200 Relative elevation > 55.92682 ASC {Vertosol, Chromosol, Ferrosol, Dermosol}

21

Isothermality > 5,209 Max T, warmest month > 2,582 AMMI > 5,542 Highest MMI  994 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21}

11

Lowest month radiation > 1,299 AMMI  5,542 Moisture index seasonality  38,200 ASC {Vertosol, Chromosol, Ferrosol, Dermosol}

22

Isothermality  5,277 AMMI > 5,542 Relative elevation 5,542 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21} ASC {Kurosol, Sodosol, Kandosol, unknown}

Isothermality  5,209 Max T, warmest month  2,954 MAP  1,374 AMMI > 5,542 Highest MMI  994 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21} Land use {1, 12, 15, 17, 18} ASC {Vertosol, Kurosol, Sodosol, Dermosol, Kandosol, unknown}

7 of 15

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033 Table 3. (continued) Rule

Condition

24

MAT > 1,233 Isothermality  5,209 Max T, warmest month  2,954 AMMI > 5,542 Elevation > 33.0575 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21} ASC {Hydrosol, Chromosol, Ferrosol, Tenosol}

25

Isothermality  5,277 AMMI > 5,542 Highest MMI < 991 Lith {0, 3, 8, 12, 14, 15, 16, 19, 22}

26

Isothermality > 5,209 Max T, warmest month  2,582 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21}

27

Isothermality  5,277 AMMI > 5,542 Lith {0, 3, 8, 12, 14, 15, 16, 19, 22}

28

Highest MMI > 994 Lith {1, 2, 4, 5, 6, 7, 9, 10, 11, 13, 17, 18, 20, 21}

29

Isothermality  5,277 Highest MMI > 991 Relative elevation > 45.7383

a Codes for lithology and land use are listed in Text S1 and Text S4, respectively. ASC classes are soil orders of the Australian Soil Classification [Isbell, 2002]. AMMI, annual mean moisture index; MAT, mean annual temperatures; MAP, mean annual precipitation.

Cobar Plain (a central New South Wales IBRA) and in Western Australia [Verboom and Pate, 2006], and silcrete in the Stony Plains of South Australia (http://www.anra.gov.au/ topics/vegetation/assessment/sa/ibra-stony-plains.html: ‘‘Arid stony silcrete tablelands and gibber and gypsum plains with sparse low chenopod shrublands over short-lived tussock grasses on duplex soils and calcareous earths.’’) Rule 5 addresses the Yalgoo-Murchison, Avon Wheatbelt, Coolgardie, and mallee areas of Western Australia not covered by rule 4. Rule 6 maps the heaths of Western Australia’s Geraldton Sandplain, pockets of mallee and in South Australia and New South Wales, and most of northern Australia (Figure 5b). The Einasleigh Uplands are mapped in rules 7, 8, 10, and 14 (Figure 5b). 3.2.2. Inland Areas of Moderate SOC [31] The area of southern Queensland addressed by rules 9, 11, 12, and 20 corresponds with the brigalow (Acacia harpophylla) belt IBRA in the Belyando, Fitzroy, and Condamine-Culgoa river basins (Figure 5c). In this region, soils have moderate topsoil C and, not surprisingly, high subsoil C given the organic C soil profiles reported by Spain et al. [1983] for similar soil orders. In this region soil type is an important predictor of SOC and there is an association between soils and vegetation (Table 3): Certain soil orders coincide with patches of Eucalypt woodlands (rule 9) whereas different soils correspond to Acacia forests and woodlands, and tussock grasslands (rules 11 and 12) in the pre-European vegetation map. Patches of vine thickets are included in the area attributed by rule 20. The highest levels of topsoil organic C in this region though are attributed by rule 24, in pockets where the annual mean moisture index

GB4033

(AMMI) is >5542 (Figure 6). (The AMMI is an indexed estimate of the average weekly soil moisture content that mimics the effect of soil texture on the water balance [Houlder et al., 2000]. It requires input data for rainfall, evaporation, and soil water storage/availability, all in mm. It ranges between 0 (dry) and 1 (moist) but the grid has been multiplied by 104 to convert it to integer, so 5542 is really 0.5542.) 3.2.3. Coastal Areas of Moderate to High SOC [32] Areas of moderate to high SOC content in relatively moist coastal zones are mapped by rules 13, 15– 19, and 21– 29 (Figures 5d and 5f). In these areas, temperature variables and mean annual precipitation are important in defining the spatial pattern of rules. Land use is used as a predictor in the rule set only in rules 15 and 23. Most of the extent of rule 15 corresponds with native pastures whereas rule 23 applies over large areas labeled as ‘conservation’ and ‘production forest’. Compared to the first few rules (Figure 5a) the extent of application of the rules is much smaller and the spatial pattern more complex (Figures 5d and 5f ). Many rules are applied in piecemeal fashion over many IBRA regions. A few are more restricted in extent: Rule 17 applies over the Southeastern Highlands of New South Wales where vegetation is reportedly highly varied and dependent on elevation, rainfall, aspect and drainage (http://www.anra.gov.au/topics/vegetation/assessment/New South Wales/ibra-south-eastern-highlands.html). [33] The Queensland Wet Tropics are attributed by Rule 21. Rule 22 applies to the basaltic Volcanic Plains in Victoria and the Tasmanian Northern Midlands, both areas where much of the original characteristic eucalyptus forests and woodlands have been cleared and replaced by improved pastures. Rule 23 applies mostly over the Sydney basin and South East Corner bioregions. The South East Corner bioregion is important biogeographically as the area of overlap between cool temperate and warm temperate zones. Vegetation consists of high elevation woodlands, wet and damp sclerophyll forests interspersed with rain shadow woodlands in the Snowy River Valley, sclerophyll forests and woodlands, warm temperate rain forest, heath, and wetlands being common along the coast (http://www.anra. gov.au/topics/vegetation/assessment/New South Wales/ ibra-south-east-corner.html). Rule 24 applies mostly to the New South Wales North Coast bioregion. 3.2.4. Areas of High SOC [34] Areas of high soil C are mapped by rules 26 to 29, are all associated with remnant native forests, but occur on a wide range of soil orders. Rules 26 and 27 covers mostly the Warren bioregion in Western Australia that is described as ‘‘a refugia, with relict taxa from a wetter milder era, evidenced by groups and species of vascular and cryptic flora and invertebrates normally associated with the rainforests/Nothofagus forests of SE Australia, and now extinct elsewhere in the State.’’ (http://www.anra.gov. au/topics/vegetation/assessment/wa/ibra-warren.html). Rule 27 also attributes the wet tropical rain forests along coastal northern Queensland. Current areas of Nothofagus forests in Tasmania are mapped by rule 29 which allocates the highest C levels and also attributes the Jarrah forests in Western Australia and tall eucalypt forests in Victoria.

8 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033

Figure 4. (a) Location of data reported in the auxiliary material of Wynn et al. [2006]. Only 25 points overlap with the ASRIS extent. (b) Relationship between SOC predicted with ASRIS data and data reported by Wynn et al. [2006]. R2 between ASRIS predictions and SOC_30_T (near trees) is 0.84, and R2 between ASRIS predictions and SOC_30_G (away from trees, in grass) is 0.78. Rule 28 that covers some of the same regions as rule 29 may be attributing slightly lower C levels in areas that are under stress from changes in landscape function associated by land uses, especially in the Southeastern Highlands bioregion of Victoria (http://www.anra.gov.au/topics/vegetation/ assessment/vic/ibra-south-eastern-highlands.html).

3.3. Environmental Thresholds and Their Significance [35] In the conditions of the 29 rules consistent threshold values emerge, the significance of these thresholds is considered next. [36] A single threshold value of 5542 in the AMMI predictor is used consistently throughout the rule set

9 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

Figure 5. Maps of the domain of application of the 29 rules for predicting topsoil C. Rules are not mutually exclusive and there is sometimes overlap in their areas, although this is not detectable here.

10 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

GB4033

GB4033

Figure 6. Moisture index (AMMI) threshold emergent in the Cubist model corresponds well over Western Australia and eastern Australia to the water availability threshold that emerges in the work of Wynn et al. [2006]; and 2% SOC threshold.

(Table 3). This value delineates a boundary that is roughly aligned with the Great Dividing Range in eastern Australia and the Darling Range in Western Australia (Figure 6). The position of this threshold over eastern and southern Australia is very similar to the position of the threshold of 1835 mm/yr for water availability (W*, defined as a function of mean annual precipitation and global solar radiation) above which Wynn et al. [2006] found that the availability of water is no longer the dominant controlling variable for SOC (Figure 6). Where AMMI < 5542 or W* is 2% whereas on the drier side of the boundary soil C is 5542 (higher soil moisture) (Table 3). The rules’ thresholds are consistent with the results of Wynn et al.

11 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

Figure 7a. Topsoil percent of OC and percent of clay in ASRIS point database.

Figure 7b. SOC, AMMI and percent of clay and mean annual temperatures (MAT) thresholds emergent in Cubist models. MAT > 12.33°C is from the topsoil C model (see Table 3 and Text S5). The 25% clay threshold is from the Cubist percent of clay model [Bui et al., 2006].

12 of 15

GB4033

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

[2006] who found that temperature affected decomposition significantly only above the W* threshold of 1835 mm/yr and that the effect of fine soil texture was noticeable only at mean annual temperatures (MAT) above 10°C and where the fine fraction was >20% but the trend is not: As shown in Figure 7b, SOC levels are higher when MAT is 25. Similarly, Ise and Moorcroft [2006] found that the temperature sensitivity of SOC decomposition at global scales is significantly less than is assumed by many terrestrial ecosystem models and that the maximal rate of decomposition occurs at higher moisture values than is assumed by many models. Most recently, Adair et al. [2008] found that a composite climate variable that incorporates the effect of soil moisture and temperature is better at modeling litter decomposition than simply using evapotranspiration, MAT, and MAP. [40] Land use is used only in rules 15 and 23 (Figures 5d and 5e), East of the Great Dividing Range, where AMMI > 5542. Rule 15 applies mostly over native pastures and 23 over forested areas. This is important from the knowledge discovery standpoint because the ratio of decomposable to resistant plant material in litter from these two types of land use/cover is assumed to be quite different in soil C models but there is not much support for the ratio used for native grasses [Shirato and Yokozawa, 2006]. While they cannot be used to ascribe a value to the ratios, the ASRIS results support that there is a significant difference between the land uses and suggest areas where comparative decomposition studies should be pursued. The C:N ratio is narrower and the upper value lower for rule 15 (8 – 21) than for rule 23 (7 – 68 with most values in the 20– 30 range), thus there is clearly more decomposition in the native pasture than in the forests covered in rule 23 (eucalypt open forests in the pre-European vegetation map). That land use does not turn out to be an important predictor in the Murray-Darling basin and the Western Australia Wheatbelt, Australia’s most productive agricultural regions, is surprising. These regions have 5542, a MAP threshold of < or >1374 mm and a MAT threshold of < or >12.33°C emerge consistently. Thus, in the dry interior of Australia, abiotic photodegradation may be a dominant decomposition mechanism whereas, in wetter coastal regions, biotic decomposition dominates. [43] The rules separate the overall data set into more homogenous subsets, thus in this case, they identify areas where SOC responds closely to climatic and other environmental variables. Clearly the spatial pattern of rules from the Cubist model for SOC corresponds to the IBRA regions – to a surprising extent given that many of the sampled sites used to build the model are no longer under native vegetation. Thus it appears that there is a lingering C legacy signature and that change in land use has not changed SOC much. It seems that the rules in the model for topsoil C reproduce a bioregionalization approach somewhat akin to that used by biogeographers where higher levels of the hierarchy are determined by climatic characteristics and lower levels are defined by landforms and biocoenosis/vegetation formations [Cox and Moore, 2000]. This outcome is interesting from a knowledge discovery viewpoint in that, given similar input data sets, the data mining algorithm seems to be mimicking the classification process of natural scientists, with similar results. [44] The association of soil C predictions with duricrusts is also interesting from a knowledge discovery standpoint because Pate et al. [2001] have reported an association between ferruginous gravels and organic C in Western Australia, where up to 15 000 kg ha1 of C could be tied up in Fe-rich pisoliths that have accumulated over 40 000– 400 000 years. Thus rule 4 seems to be identifying a zone associated with a particular C stabilization process. The spatial patterns identified in rules 2, 3, and 4 that correspond to major landform groups such as the Riverina and the Cobar Plain is notable. Rules 7 and 8 also identify some major New South Wales landforms such as the western slopes of the Great Dividing Range. [45] The results for the topsoil C model using the ASRIS point database show a spatially variable hierarchy of environmental factors that control soil C distribution: in drier areas (where AMMI < 5542), soil moisture, radiation, elevation and relief (that define landforms), and soil type are the main predictors that define the spatial patterns in SOC whereas, in wetter areas (where AMMI > 5542) other climatic variables (isothermality, temperature in the warmest month, mean annual precipitation) and land use become important. Thus with the insight gained from this data-driven modeling we have begun to elucidate the drivers of SOC dynamics and their interaction over different regions in Australia. [46] There are few continental scale data-driven studies of the distribution of SOC and the factors that govern it but, in

13 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

the United States, Homann et al. [2007] also found a spatially variable hierarchy of importance for environmental variables in terms of their explanatory power for the variance in SOC across the United States. For example, in the Northwestern Forests, log (MAP) was dominant whereas it was fourth in accounting for variance in SOC in the neighboring Western Mountain Forests, and in the Southern Plains Grasslands, percent of clay was the most important in explaining variance in SOC.

5. Summary and Concluding Remarks [47] A rule induction modeling approach is able to generate a credible map of SOC across agriculturally productive areas of Australia. Using this approach, it is possible to produce a map with subkilometer resolution even with a relatively sparse training data set that is biased (nonrandom) in terms of sample selection and contains potentially redundant or irrelevant information in the predictor set. In fact this kind of data set is well suited for data mining applications [e.g., Zhang and Fan, 2008]. The benefit of this approach over other mapping methods is that it allows additional insight into biogeological surficial processes when the predictors are chosen to represent drivers of those processes. [48] The spatial pattern of rules corresponds well (1) to maps of estimated primary productivity; (2) to bioclimatic zones as defined by experts responsible for delineating the IBRA; (3) to pre-European vegetation cover type estimates; and (4) to landforms and soils in drier areas where geological processes would be expected to dominate biotic ones, but not across Australia. [49] Overall the spatial topsoil C pattern and its correspondence with vegetation is consistent with and thus supports Reiners’ [1986] theorem 4, that the world’s biota drives and regulates the global biogeochemical cycles of elements. [50] That the spatial pattern of Cubist results is consistent with that of Wynn et al. [2006] for Australia serves to validate both C models. When and where independently derived model results converge they serve to validate each other. Both models are compatible with Reiners’ [1986] ideas on the spatial distribution of biotic control over environmental chemistry. Both support the view expressed by Bui et al. [2006] that the vegetation response to climate in Australia has been instrumental in controlling certain chemical soil properties. [51] A lingering C legacy signature from the original native vegetation suggests that SOC does not respond quickly to land use change. Thus the map produced could be considered as a baseline, c. 1990, for future monitoring of SOC in Australia. The modeling approach presented herein could also be used with other historical SOC measurements at georeferenced points in national soil survey databases around the world. Doing so could provide additional insight into SOC dynamics under different environmental conditions. [52] The spatial pattern of association between SOC and IBRA regions observed from the Cubist model may reflect parallel patterns in soil chemistry, litter chemistry and plant and microbial communities but there is not enough research

GB4033

across Australian ecosystems to make any definitive conclusions regarding this. A more detailed investigation of the soil nutrients and vegetation relationships using ASRIS and NVIS is the subject of another paper (E. N. Bui and B. L. Henderson, Organic matter stoichiometry in Australian soils with respect to vegetation and environmental factors, manuscript in preparation, 2009). Investigating the chemistry of litter and of its decomposition, and the soil microbial communities under different vegetation groups should be a priority for future research in C cycling in Australia.

[53] Acknowledgments. Many thanks to Warwick McDonald and Raphael Viscarra-Rossel for providing constructive comments on an earlier draft of this manuscript and to Jonathan Wynn for providing a map of W*.

References Adair, E. C., W. J. Parton, S. J. Del Grosso, W. L. Silver, M. E. Harmon, S. A. Hall, I. C. Burkes, and S. C. Hart (2008), Simple three-pool model accurately describes patterns of long-term litter decomposition in diverse climates, Global Change Biol., 14, 1 – 25, doi:10.1111/j.13652486.2008.01674.x. Amundson, R. (2001), The carbon budget in soils, Annu. Rev. Earth Planet. Sci., 29, 535 – 562, doi:10.1146/annurev.earth.29.1.535. Arrouays, D., W. Deslais, and V. Badeau (2001), The carbon content of topsoil and its geographical distribution in France, Soil Use Manage., 17, 7 – 11. Attiwill, P. M., P. J. Polglase, C. J. Weston, and M. A. Adams (1996), Nutrient cycling in forests of south-eastern Australia, in Nutrition of Eucalypts, edited by P. M. Attiwill and M. A. Adams, pp. 191 – 227, CSIRO Publ., Melbourne, Vic., Australia. Baldwin, M., C. E. Kellogg, and J. Thorp (1938), Soil Classification, in Soils and Men: Yearbook of Agriculture 1938, edited by U.S. Dept. of Agric., pp. 979 – 1001, U.S. Govt. Print. Off., Washington, D. C. Barrett, D. J. (2002), Steady state turnover of carbon in the Australian terrestrial biosphere, Global Biogeochem. Cycles, 16(4), 1108, doi:10.1029/2002GB001860. Batjes, N. H. (2005), Organic carbon stocks in the soils of Brazil, Soil Use Manage., 21, 22 – 24, doi:10.1079/SUM2005286. Bernoux, M., M. Conceicao Santana Carvalho, B. Volkoff, and C. C. Cerri (2002), Brazil’s soil carbon stocks, Soil Sci. Soc. Am. J., 66, 888 – 896. Berry, S., B. Mackey, and T. Brown (2007), Potential applications of remotely sensed vegetation greenness to habitat analysis and the conservation of dispersive fauna, Pac. Conserv. Biol., 13, 120 – 127. Brandt, L. A., J. Y. King, and D. G. Milchunas (2007), Effects of ultraviolet radiation on litter decomposition depend on precipitation and litter chemistry in a shortgrass steppe ecosystem, Global Change Biol., 13, 2193 – 2205, doi:10.1111/j.1365-2486.2007.01428.x. Breiman, L., J. H. Friedman, R. A. Olhsen, and C. J. Stone (1984), Classification and Regression Trees, Wadsworth, Monterrey, Calif. Bui, E. N., and B. L. Henderson (2003), Vegetation indicators of soil salinity in north Queensland, Austral. Ecol., 28, 539 – 552. Bui, E. N., B. L. Henderson, and K. Viergever (2006), Knowledge discovery from models of soil properties developed through data mining, Ecol. Modell., 191, 431 – 446, doi:10.1016/j.ecolmodel.2005.05.021. Carlile, P., E. Bui, C. Moran, D. Simon, and B. Henderson (2001), Method used to generate soil attribute surfaces for ASRIS using soil maps and look up tables, Tech. Rep. 24/01, Land and Water, CSIRO, Canberra, ACT, Australia. Cox, C. B., and P. D. Moore (2000), Biogeography: An Ecological and Evolutionary Approach, 6th ed., Blackwell Sci., Malden, Mass. de Toledo Castanho, C., and A. A. de Oliveira (2008), Relative effect of litter quality, forest type and their interaction on leaf decomposition in south-east Brazilian forests, J. Trop. Ecol., 24, 149 – 156, doi:10.1017/ S0266467407004749. Garnaut, R. (2008), The Garnaut Climate Change Review, Cambridge Univ. Press, Port Melbourne, Vic., Australia. (Available at http://www. garnautreview.org.au/index.htm) Guo, Y., R. Amundson, P. Gong, and Q. Yu (2006), Quantity and spatial variability of soil carbon in the conterminous United States, Soil Sci. Soc. Am. J., 70, 590 – 600, doi:10.2136/sssaj2005.0162. Henderson, B. L., E. N. Bui, C. J. Moran, D. A. P. Simon, and P. Carlile (2001), ASRIS: Continental-scale soil property predictions from point data, Tech. Rep. 28/01, Land and Water, CSIRO, Canberra, ACT, Australia.

14 of 15

GB4033

BUI ET AL.: DATA MINING TO MAP SOIL CARBON

Henderson, B. L., E. N. Bui, C. J. Moran, and D. A. P. Simon (2005), Australia-wide predictions of soil properties using decision trees, Geoderma, 124, 383 – 398, doi:10.1016/j.geoderma.2004.06.007. Homann, P. S., J. S. Kapchinske, and A. Boyce (2007), Relations of mineral-soil C and N to climate and texture: Regional differences within the conterminous USA, Biogeochemistry, 85, 303 – 316, doi:10.1007/ s10533-007-9139-6. Houlder, D., M. Hutchinson, H. Nix, and J. McMahon (2000), ANUCLIM, Version 5.1, User Guide, Cent. for Resour. and Environ. Stud., Aust. Natl. Univ., Canberra, ACT, Australia. (Available at http://fennerschool. anu.edu.au/publications/software/anuclim/doc/Contents.html) Howard, P. J. A., P. J. Loveland, R. I. Bradley, F. T. Dry, D. M. Howard, and D. C. Howard (1995), The carbon content of soil and its geographical distribution in Great Britain, Soil Use Manage., 11, 9 – 15, doi:10.1111/ j.1475-2743.1995.tb00488.x. Isbell, R. (2002), Revised Edition of the Australian Soil Classification, CSIRO Publ., Melbourne, Vic., Australia. Ise, T., and P. R. Moorcroft (2006), The global-scale temperature and moisture dependencies of soil organic carbon decomposition: An analysis using a mechanistic decomposition model, Biogeochemistry, 80, 217 – 231, doi:10.1007/s10533-006-9019-5. Johnston, R. M., et al. (2003), ASRIS: The database, Aust. J. Soil Res., 41, 1021 – 1036, doi:10.1071/SR02033. Jones, R. J. A., R. Hiederer, E. Rusco, P. J. Loveland, and L. Montanarella (2005), Estimating organic carbon in the soils of Europe for policy support, Eur. J. Soil Sci., 56, 655 – 671, doi:10.1111/j.1365-2389.2005.00728.x. Kern, J. S. (1994), Spatial patterns of soil organic carbon in the contiguous United States, Soil Sci. Soc. Am. J., 58, 439 – 455. Lu, H., I. P. Prosser, C. J. Moran, J. C. Gallant, G. Priestley, and J. G. Stevenson (2003), Predicting sheetwash and rill erosion over the Australian continent, Aust. J. Soil Res., 41, 1037 – 1062, doi:10.1071/SR02157. Milne, R., and T. A. Brown (1997), Carbon in the vegetation and soils of Great Britain, J. Environ. Manage., 49, 413 – 433, doi:10.1006/ jema.1995.0118. Moran, C., I. Prosser, and G. Cannon (2001), Specification of the SEDUM model for modeling patterns of sediment transport based on unit stream power, CSIRO Land Water Tech. Rep. 25/01, Land and Water, CSIRO, Canberra, ACT, Australia. (Available at http://www.clw.csiro.au/ publications/technical2001/tr25-01.pdf) Parr, J. F., and L. A. Sullivan (2005), Soil carbon sequestration in phytoliths, Soil Biol. Biochem., 37, 117 – 124, doi:10.1016/j.soilbio.2004.06.013. Pate, J. S., W. H. Verboom, and P. D. Galloway (2001), Co-occurrence of Proteaceae, laterite and related oligotrophic soils: Coincidental associations or causative inter-relationships?, Aust. J. Bot., 49(5), 529 – 560, doi:10.1071/BT00086. Quinlan, J. R. (1992), Learning with continuous classes, in Ai ’92— Proceedings of the 5th Australian Joint Conference on Artificial Intelligence: Hobart, Tasmania, 16 – 18 November 1992, pp. 343 – 348, World Sci., Hackensack, N. J. Rayment, G. E., and F. R. Higginson (1992), Australian Laboratory Handbook of Soil and Water Chemical Methods, Inkata, Melbourne, Vic., Australia. Reiners, W. A. (1986), Complementary models for ecosystems, Am. Nat., 127, 59 – 73, doi:10.1086/284467. Reiners, W. A. (2003), Spatial/temporal variation in the biological control of environmental chemistry, Geol. Soc. Am. Abstr. Programs, 35(6), 270. Roxburgh, S. H., et al. (2004), A critical review of model estimates of net primary productivity for the Australian continent, Funct. Plant Biol., 31, 1043 – 1059, doi:10.1071/FP04100. Roxburgh, S. H., B. G. Mackey, C. Dean, L. Randall, A. Lee, and J. Austin (2006), Organic carbon partitioning in soil and litter in subtropical wood-

GB4033

lands and open forests: A case study from the Brigalow Belt, Queensland, Rangeland J., 28, 115 – 123, doi:10.1071/RJ05015. Shirato, Y., and M. Yokozawa (2006), Acid hydrolysis to partition plant material into decomposable and resistant fractions for use in the Rothamsted carbon model, Soil Biol. Biochem., 38, 812 – 816, doi:10.1016/j.soilbio.2005.07.008. Skjemstad, J. O., L. R. Spouncer, and A. Beech (2000), Carbon conversion factors for historical soil carbon data, National Carbon Accounting System, Tech. Rep. 15, Aust. Greenhouse Off., Canberra, ACT, Australia. Spain, A. V., and B. R. Hutson (1983), Dynamics and fauna of the litter layer, in Soils: An Australian Viewpoint, pp. 611 – 628, Div. Soils, CSIRO, Melbourne, Vic., Australia. Spain, A. V., R. F. Isbell, and M. E. Probert (1983), Soil organic matter, in Soils: An Australian Viewpoint, pp. 551 – 563, Div. Soils, CSIRO, Melbourne, Vic., Australia. Spencer, M., T. Whitfort, J. McCullagh, and E. Bui (2006), Dynamic ensemble approach for estimating organic carbon using computational intelligence, in Proceedings of IASTED International Conference on Advances in Computer Science and Technology (ACST 2006), edited by S. Sahni, pp. 186 – 192, ACTA Press, Calgary, Alb., Canada. (Available at http:// www.actapress.com/Content_of_Proceeding.aspx?proceedingid=396) Tate, K. R., R. H. Wilde, D. J. Giltrap, W. T. Baisden, S. Saggar, N. A. Trustrum, N. A. Scott, and J. P. Barton (2005), Soil organic carbon stocks and flows in New Zealand: System development, measurement and modelling, Can. J. Soil Sci., 85, 481 – 489. Thackway, R., and I. D. Cresswell (1995), An Interim Biogeographic Regionalisation for Australia: A Framework for Setting Priorities in the National Reserves System Cooperative Program, version 4.0, Aust. Nat. Conserv. Agency, Canberra, ACT, Australia. Uysal, I., and H. A. Gu¨venir (1999), An overview of regression techniques for knowledge discovery, Knowl. Eng. Rev., 14, 319 – 340, doi:10.1017/ S026988899900404X. Verboom, W. H., and J. S. Pate (2006), Evidence of active biotic influences in pedogenetic processes. Case studies from semiarid ecosystems of south-west Western Australia, Plant Soil, 289(1 – 2), 103 – 121, doi:10.1007/s11104-006-9075-6. Wu, H., Z. Guo, and C. Peng (2003), Distribution and storage of soil organic carbon in China, Global Biogeochem. Cycles, 17(2), 1048, doi:10.1029/2001GB001844. Wynn, J. G., M. I. Bird, L. Vellen, E. Grand-Clement, J. Carter, and S. L. Berry (2006), Continental-scale measurement of the soil organic carbon pool with climatic, edaphic, and biotic controls, Global Biogeochem. Cycles, 20, GB1007, doi:10.1029/2005GB002576. Yu, D. S., X. Z. Shi, H. J. Wang, W. X. Sun, E. D. Warner, and Q. H. Liu (2007), National-scale analysis of soil organic carbon storage in China based on Chinese Soil Taxonomy, Pedosphere, 17(1), 11 – 18, doi:10.1016/S1002-0160(07)60002-2. Zhang, K., and W. Fan (2008), Forecasting skewed biased stochastic ozone days: Analyses, solutions and beyond, Knowl. Inf. Syst., 14, 299 – 326, doi:10.1007/s10115-007-0095-1. 

E. Bui, Land and Water, CSIRO, GPO Box 1666, Canberra, ACT 2601, Australia. ([email protected]) B. Henderson, Mathematical and Information Sciences, CSIRO, GPO Box 664, Canberra, ACT 2601, Australia. K. Viergever, Ecometrica, Unit 3B Kittle Yards, Edinburgh EH9 1PJ, UK.

15 of 15