ICNIRS 2013 - 16th International Conference on Near Infrared Spectroscopy, 2 - 7 June 2013 - 34280 La Grande-Motte
Cubist, a regression rule approach for use in calibration of NIR spectra Budiman Minasny1, Alex.B. McBratney1, Uta Stockmann1, Suk Young Hong2 1
Faculty of Agriculture & Environment, The University of Sydney, NSW 2006, Australia.
2
National Academy of Agricultural Science, Rural Development Administration (RDA), Suwon 441707, Gyeonggi-do, Republic of Korea. Corresponding author:
[email protected] Introduction Near infrared diffuse reflectance spectroscopy has become a new tool for measuring carbon content in the soil. This is mainly because it requires minimal sample preparation, and potentially can be used in the field. Though infrared spectroscopy techniques are easy to use, they generate a large amount of data which requires chemometric manipulations. In addition, due to the complex and overlapping absorption of soil constituents in the infrared spectra, it is not possible to predict soil C concentration from absorbance at selected wavelengths or region of interest. This has been resolved mainly by using data reduction or factorial methods that enable prediction of soil C from the whole spectra, such as principal component regression and partial least squares regression (PLSR) (BellonMaurel et al, 2010). Another option for spectral calibration is the use of data mining techniques, which allow for nonlinear relationships between the reflectance (or absorbance) and soil C. Techniques such as random forests, boosted regression tree, support vector machines, and model tree have been tested (Viscarra Rossel and Behrens, 2010). Here we show the use of the software Cubist, a rule-based regression model, to calibrate NIR spectra. Cubist is essentially a decision tree with linear models at the leaves’ node (Holmes et al., 1999). Cubist was introduced as an alternative in handling soil spectral data by Minasny and McBratney (2009). It is an attractive tool as it produces descriptive models that can help to better understand the complicated structure and relationships present in data. Minasny and McBratney (2009) analysed mid-infared spectra of soil data and found that the cubist model gives high prediction accuracy, is easy to interpret, and has automatic variable selection that makes it parsimonious, and that it also considers the upper and lower boundary values of the predictant (i.e. the prediction of soil C content will not be negative). The objective of this study is to demonstrate the capability of Cubist in an open-source software environment in predicting soil C content from visible-near infrared spectra, and to compare its prediction power with PLSR and bootstrap aggregated (bagged) PLSR.
ICNIRS 2013 - 16th International Conference on Near Infrared Spectroscopy, 2 - 7 June 2013 - 34280 La Grande-Motte
Materials and Methods Data Two datasets were used to demonstrate the prediction of soil C content. The first is an Australian database of soils from locations in New South Wales (NSW), Australia. The second is a Korean database which contains samples throughout South Korea. Both databases contain measurements for different soil horizons from the soil surface to the depth of approximately 1 m. Soil C content was measured using the dry combustion technique. The NIR spectra for both Australian and Korean soil samples were collected using the ASD FieldSpec Pro (Applied Spectral Devices, Boulder, CO). A Spectralon was used as a reflectance standard, and reflectance from 350-2500 nm wavelengths were recorded at a 1-nm interval. Table 1. Statistics of soil C content (in g/100g) from 2 databases. Australia Calibration Validation Korea Calibration Validation
n
Mean
Std. dev
Median
Min
Max
293 98
1.09 1.40
1.20 1.77
0.81 0.99
0.09 0.06
11.94 12.74
538 179
1.62 1.63
2.03 1.89
1.62 0.93
0.00 0.00
14.88 9.49
The reflectance measurement was converted to absorbance log(1/R), followed by smoothing of the absorbance spectra using the Savitzky-Golay filter a with a window size of 10 nm and a polynomial degree of 2. Only absorbance data from 500 to 2450 were used, and sampled to a resolution of 10 nm. The spectra were then transformed using Standard Normal Variate (SNV) to remove the baseline effect. Prediction methods Three prediction methods were used as comparison: Cubist as well as partial least squares regression (PLSR), and bootstrap aggregated PLSR (bagPLSR). All models were executed using the R statistical software with packages: Cubist (Kuhn et al., 2013) and pls (Mevik et al., 2007). All codes used in this paper can be obtained from the authors. Cubist is a commercial regression-rules program, which is an advancement of an earlier version of the program called M5 or model tree (Rulequest research, Quinlan, 1993). Recently, a GNU General Public Licence C code was released by Quinlan, and subsequently ported into R, an open-source statistical program, by Kuhn et al. (2013). Cubist first creates rules by splitting the data based on its independent variables minimising within class variation. After that it builds a linear model of the absorbance spectra for each rule, which is similar to a piecewise linear function. The detail of the algorithm is presented in Quinlan (1993) and Holmes et al. (1999) and will not be repeated here. BagPLSR is a way of 'strengthening' the PLSR prediction, which is achieved by generating multiple PLSR models and averaging the predictions, the so-called ensemble models (Mevik et al., 2005). Bootstrap aggregating or bagging (Breiman, 1996) manipulate the training data to generate different models. Bootstrap is a general statistical method used to assess the accuracy of a prediction by 'sampling the training data with replacement'. Assume the training data is composed of predictors and response of size n, we randomly 'create' B datasets each of size n based on the training data. For each of the bootstrap dataset, we fit a PLSR model. The bagging estimate is calculated as the
ICNIRS 2013 - 16th International Conference on Near Infrared Spectroscopy, 2 - 7 June 2013 - 34280 La Grande-Motte
average of all the model prediction. Therefore, bagging combines the outputs of many 'weak' models to produce a powerful prediction. This is useful when dealing with data with high variation, as each bootstrap 'realisation' will produce a model that fits particular sets of the data which may differ from other realisations. The aggregated predictor averages the prediction over a collection of bootstrap realisations, therefore reducing the variance of prediction. The accuracy of the prediction is increased when the prediction method is unstable, i.e. small changes in the training data used in bootstrap can result in large changes in the model. This method was used by McBratney et al. (2006) for predicting soil properties and quantifying the uncertainty of the prediction. The improvement over a single PLSR might be small, but bagPLSR was found to be more robust against noise in the spectra (Mevik et al., 2005). Accuracy assessment Both datasets were split randomly into 75% for calibration and 25% for validation. The accuracy of the predictions was assessed using Root Mean Squared Error (RMSE), bias and the Ratio of Performance to Interquartile Range (RPIQ) (Bellon Maurel et al., 2011). In addition, the concordance correlation coefficient (CC) which measures the agreement between measured and predicted samples, i.e. how close the model predictions fall along a 45-degree line from the origin to the measured data (Lin, 1989). Results and Discussion For Korean soils (Table 2), Cubist consistently performs better than PLSR and bagPLSR (lowest RMSE and highest RPIQ). The Korean database contains more measurements of soil C, and the data is highly variable as it includes several regions of the country with highly varying soil C contents. The Australian data which came from a region (NSW) of the country has a lower variance of soil C values (See Table 1). The Cubist model is able to capture small concentration as well as high concentration values (Figure 1). Quite evident from Figure 1 is that the Korean data is skewed and PLSR cannot fit those very well. Despite the skewness of the data (skewness = 2.9), Cubist is able to predict the values reasonably well as it stratifies the prediction into 4 rules based on absorbance at wavelengths 1410 and 2210 nm (which are known to be related to OH bonds and organic matter). Table 3 shows that each rule corresponds to varying values of soil C content, from small to large values. Thus, a regression was formed for each rule that allows a better prediction for the whole range of C values without the need for any data transformation. Table 2. Accuracy of the prediction of PLSR, BagPLSR, and Cubist on soil C content for Korean soils. RMSE = root mean squared error (in g/100g), RPIQ is the ratio of prediction to interquartile range, CC is concordance correlation coefficient. PLSR Calibration Validation bagPLSR Calibration Validation Cubist Calibration Validation
n
RMSE
bias
RPIQ
CC
538 179
1.17 1.17
0.00 0.04
0.46 0.40
0.80 0.77
538 179
1.16 1.20
-0.01 0.04
0.46 0.39
0.80 0.76
538 179
0.63 0.82
-0.01 -0.13
0.86 0.57
0.95 0.89
ICNIRS 2013 - 16th International Conference on Near Infrared Spectroscopy, 2 - 7 June 2013 - 34280 La Grande-Motte
Figure 1. A comparison of observed and NIR predicted soil C content for Korean soil based on the validation dataset.
Table 3. Rules, conditions, mean soil C content, and RMSE of the prediction for a cubist model for Korean soils. Axxxx refers to absorbance at wavelength xxxx. Rules
Conditions
Rule 1
A1410 > -0.282 A2210 > -0.482 A1410 -0.482 A550 > 3.254 A1410 -0.482 A2210