Comparison of linear and mixed-effect regression models and ak ...

1 downloads 112 Views 146KB Size Report
nonparametric instance-based k-nearest neighbour (k-NN) approach to estimate single-tree biomass with predictions from linear mixed-effect regression models ...
1

Comparison of linear and mixed-effect regression models and a k-nearest neighbour approach for estimation of single-tree biomass Lutz Fehrmann, Aleksi Lehtonen, Christoph Kleinn, and Erkki Tomppo

Abstract: Allometric biomass models for individual trees are typically specific to site conditions and species. They are often based on a low number of easily measured independent variables, such as diameter in breast height and tree height. A prevalence of small data sets and few study sites limit their application domain. One challenge in the context of the actual climate change discussion is to find more general approaches for reliable biomass estimation. Therefore, nonparametric approaches can be seen as an alternative to commonly used regression models. In this pilot study, we compare a nonparametric instance-based k-nearest neighbour (k-NN) approach to estimate single-tree biomass with predictions from linear mixed-effect regression models and subsidiary linear models using data sets of Norway spruce (Picea abies (L.) Karst.) and Scots pine (Pinus sylvestris L.) from the National Forest Inventory of Finland. For all trees, the predictor variables diameter at breast height and tree height are known. The data sets were split randomly into a modelling and a test subset for each species. The test subsets were not considered for the estimation of regression coefficients nor as training data for the k-NN imputation. The relative root mean square errors of linear mixed models and k-NN estimations are slightly lower than those of an ordinary least squares regression model. Relative prediction errors of the k-NN approach are 16.4% for spruce and 14.5% for pine. Errors of the linear mixed models are 17.4% for spruce and 15.0% for pine. Our results show that nonparametric methods are suitable in the context of single-tree biomass estimation. Re´sume´ : Les mode`les allome´triques de biomasse pour les arbres individuels sont ge´ne´ralement spe´cifiques aux conditions d’une station et a` une espe`ce. Ils sont souvent base´s sur un faible nombre de variables inde´pendantes facilement mesurables comme le diame`tre a` hauteur de poitrine et la hauteur des arbres. Le domaine d’application de ces mode`les est limite´ a` cause de la pre´dominance des petites bases de donne´es et du nombre restreint de stations e´tudie´es. Un des de´fis qui e´manent de la discussion sur les changements climatiques consiste a` trouver des approches plus ge´ne´rales pour estimer la biomasse de fac¸on fiable. Par conse´quent, les approches non parame´triques peuvent eˆtre conside´re´es comme une solution de rechange aux mode`les de re´gression couramment utilise´s. Dans cette e´tude pilote, nous comparons une approche non parame´trique base´e sur les k voisins les plus proches (k-VP) pour estimer la biomasse d’arbres individuels a` des pre´dictions issues de mode`les de re´gression line´aire a` effets mixtes et de mode`les line´aires subsidiaires en utilisant des bases de donne´es sur l’e´pice´a commun (Picea abies (L.) Karst.) et le pin sylvestre (Pinus sylvestris L.) provenant de l’inventaire forestier national de la Finlande. Pour tous les arbres, les variables de pre´diction connues sont le diame`tre a` hauteur de poitrine et la hauteur de l’arbre. Les bases de donne´es ont e´te´ divise´es ale´atoirement en fichiers de calibration et d’e´valuation pour chaque espe`ce. Les fichiers d’e´valuation n’ont pas e´te´ utilise´s pour estimer les coefficients de re´gression ni comme donne´es d’entraıˆnement des imputations k-VP. L’erreur quadratique moyenne relative des mode`les line´aires mixtes et des estimations k-VP est le´ge`rement infe´rieure a` celle du mode`le de re´gression par les moindres carre´s ordinaires. L’erreur relative de pre´diction de l’approche k-VP est de 16,4 % pour l’e´pinette et de 14,5 % pour le pin. L’erreur associe´e aux mode`les line´aires mixtes est de 17,4 % pour l’e´pinette et de 15,0 % pour le pin. Nos re´sultats indiquent que les me´thodes non parame´triques sont approprie´es pour estimer la biomasse des arbres individuels. [Traduit par la Re´daction]

Introduction Estimation of forest biomass and the related carbon sequestration is an important topic, not only in the context of the legally accepted framework of the Kyoto protocol, but also for the management of fuelwood production for bioenergy

(Joosten et al. 2003; Wirth et al. 2003; Lehtonen et al. 2004a; Rosenbaum et al. 2004; Tremblay et al. 2006). The standard methodology of single-tree biomass estimation by fitting parametric regression models is frequently based on relatively small data sets. Numerous models have been derived, most of which are expressed as allometric functions

Received 22 December 2006. Accepted 19 June 2007. Published on the NRC Research Press Web site at cjfr.nrc.ca on 23 January 2008. L. Fehrmann1 and C. Kleinn. Institute of Forest Management, Georg-August-Universita¨t Go¨ttingen, Bu¨sgenweg 5, 37077 Go¨ttingen, Germany. A. Lehtonen. Finnish Forest Research Institute, P.O. Box 18, FIN-01301 Vantaa, Finland. E. Tomppo. Finnish Forest Research Institute, Unioninkatu 40 A, Helsinki, FIN-00170, Finland. 1Corresponding

author (e-mail: [email protected]).

Can. J. For. Res. 38: 1–9 (2008)

doi:10.1139/X07-119

#

2008 NRC Canada

2

of easily observable variables, such as diameter in breast height (DBH) and tree height. Biomass has usually been determined by destructive sampling of trees and subsampling of biomass components within a tree (Korhonen and Maltamo 1990). Typically, these models are specific to tree species and site conditions of the particular study. Because growing conditions as well as the particular stand history are influencing the parameter estimates based on chronosequences, the derived models are restrictedly suitable for estimations on a larger scale (Montagu et al. 2005). Attempts to derive generally applicable model formulations by meta-analyses (e.g., Jenkins et al. 2003; Zianis and Mencuccini 2004; Chave et al. 2005) are often constrained by the absence of raw data. Future research in single-tree biomass estimation should be focused on finding factors for local variations of allometric coefficients that are still not completely explained. Therefore, a compilation of single data sets is useful, because the high correlation of potential explanatory variables on spatially limited study sites limit further research to single-variable effects and their interactions (Fehrmann and Kleinn 2006). Once a larger single-tree database is available, nonparametric approaches, such as the k-nearest neighbour (k-NN) method, are an alternative to regression models. In contrast to mixed models and other regression techniques that need a predefined functional form, this approach does not require an a priori formulation of a statistical model that describes the relationship between the target variable and the predictors. As a result, the inclusion of factors affecting individual tree biomass is possible without detailed knowledge of their influence or interactions. The k-NN technique has been successfully applied for tree and stand variable analysis (e.g., Haara et al. 1997; Maltamo and Kangas 1998; Tommola et al. 1999; Maltamo et al. 2003). The k-NN imputation of tree-level attributes has been demonstrated, for example, by Sironen et al. (2001, 2003), Korhonen and Kangas (1997), Malinen (2003a), and Malinen et al. (2003). Furthermore, the k-NN technique has shown promise in small area estimation and in pixel-level prediction using satellite image data and forest inventory data in large-area forest inventories (Tomppo 1991; Moeur and Stage 1995; Holmstro¨m et al. 2001; McRoberts et al. 2002; Temesgen 2003; Tomppo and Halme 2004). The goal of this pilot study is to investigate the performance of approaches for single-tree biomass prediction, namely the k-NN imputation and linear mixed regression models as well as subsidiary linear models, based on observations of DBH, tree height, and aboveground biomass.

Material and methods Data Biomass data from the National Forest Inventory (NFI) of Finland were used to test different biomass estimation techniques. The data originates from the Vapu database (National Tree Research). Trees were cut between 1988 and 1990 from the sample plots of NFI-8, excluding private lands. The three trees closest to the plot centre were selected; they were predominantly from the dominant canopy layer. In case of mixed stands, up to three additional trees

Can. J. For. Res. Vol. 38, 2008

located closest to the plot centre were cut. The data set consists of 203 Norway spruce (Picea abies (L.) Karst.) and 205 Scots pine (Pinus sylvestris L.) trees. Korhonen and Maltamo (1990) give more detail on the data. The trees were felled and dimensions, density, and increment were measured. Diameters over bark and under bark were measured at 16–18 heights. For wood density (ovendried matter per unit volume), six samples per tree were taken (relative heights of 5%, 20%, 40%, 60%, and 80%). Bark thickness, wood density, and biomass were measured at seven to nine relative heights depending on tree size. All branches were numbered, and approximately 10 branches per tree were sampled systematically (with a random starting point). Each sample branch was subdivided into two parts: one with and one without needles. Both were weighed separately for fresh mass determination. Every second, fifth, and eighth sample branch was taken to the laboratory for dry matter analysis. Measures of size and location at the tree (diameter and distance from the top) were taken for additional branches. Upscaling measurements Stem and bark biomasses were predicted by the variables volume (determined from the stem dimensions) and wood density. Cubic splines were fitted to stem diameter over and under bark for each tree. Each fit was analysed visually to obtain logical taper curves and to define appropriate smoothness of the fit. Also, the variation of wood and bark density as a function of tree height were modelled for each tree by splines with PROC GAM of SAS Institute Inc. (2005). Wood biomass was obtained by integrating the product of stem diameter and wood density over tree height. Bark biomass was estimated as the product of bark density and bark volume, where the latter is calculated as the difference between the taper models over and under bark. Foliage mass for each tree was obtained by measuring 3 of the 10 sampled branches. A linear mixed model (LMM; McCulloch and Searle 2000) for the proportion of foliage biomass from total branch biomass was developed. The arcsine square root transformation of the proportion of foliage biomass in a branch was modelled as function of branch diameter, its relative height in the crown, and tree DBH. After multiplying the predicted proportion with total branch mass, an additional LMM was estimated for foliage biomass with branch diameter and relative height as predictor variables (Lehtonen 2005). Branch biomass of trees was predicted by an LMM with branch diameter as predictor (approximately 2000 branches per species). Foliage biomass was subtracted from total branch biomass before modelling. Sample plot based fresh biomass and dry biomass ratios were used to avoid variation caused by the weather (one plot was measured during 1 day) (Lehtonen et al. 2004b; Muukkonen and Lehtonen 2004). The estimates of aboveground biomasses were obtained for each tree by summing the predicted biomasses of branches, foliage, bark, and stem. The k-NN method The k-NN imputation is a nonparametric approach that is one of the oldest and simplest learning techniques based on pattern recognition and classification of unknown objects (Fix #

2008 NRC Canada

Fehrmann et al.

3

and Hodges 1951; Altman 1992; Mitchell 1997). A missing attribute value of a target variable for a population unit (instance) is imputed as local approximation from observed values of the target variable in a subset of ‘‘nearest neighbours’’ from a set of training data as k X

½1

bf ðxq Þ ¼

wk f ðxi Þ

i¼1 k X

wk

i¼1

where bf ðxq Þ is the predicted value for the unknown target value of a query instance xq, f ðxi Þ is the observed target values of training instances, wk is the weighting factor of the kth neighbour, and k is the number of nearest neighbours used for prediction. Following Maltamo and Kangas (1998) and Sironen et al. (2003) we derived the distance dependent weights as  t ½2

wk ¼

1 dq;i

 k  X 1 t dq;i i¼1

where dq,i is the (weighted) distance between a query point xq and the neighbour xi and t is a weighting parameter that influences the rate of decrease in w by a unit increase in distance. Because the influence of training instances is decreasing with increasing distance (dissimilarity of instances), theoretically all training data can be included in the estimation process with this approach. In this sense, nearest neighbours are training instances that are identified as most similar according to a distance metric in an m-dimensional feature space (space of the explanatory variables) and that are used for prediction under the assumption that these instances are also similar concerning their target variable values. Standard measures of proximity can be uased, known from discriminant and cluster analysis (Jobson 1994). We applied the following modified Minkowski distance metric: "  c #1c n X jxir  xjr j ½3 di;j ¼ wr r r¼1 where di,j is the weighted distance between two instances i and j, xir and xjr are the values of the rth variable for the respective instance, wr is the weighting factor for variable r, n is the number of variables in the analysis, r is the standardization factor, and c is the Minkowski constant (c = 1, Manhattan distance; c = 2, Euclidean distance). The standardization factor r can be defined as a function of the range of the variable. In our study, we set r = 2r with  being the standard deviation of the respective variable. Although standardization in the multivariate case is a matter of elimination of distorting influences of differently scaled feature spaces, the weights are an expression of their unequal relevance for the target variable (Bellmann 1961; Wettschereck and Aha 1995). In the case of linear relationships, feature weights of different variables can be derived

on the basis of their correlations with the target value (Tomppo et al. 1999) or according to the relationship between coefficients of a first-order regression model (Holmstro¨m et al. 2001). In our study, feature weights were derived by minimizing the relative root mean square error (rRMSE) by means of a leave-one-out cross validation of the training data in an iterative process. Therefore, every tree is excluded in turn from the training database of observed trees, and a prediction for this particular individual is derived using the remaining trees. Instance-based methods confront the analyst with a bias– variance dilemma. Although increasing k may reduce the variance, it simultaneously increases the bias because of an asymmetric neighbourhood at the extremes of the distribution of observations (Altman 1990; Katila and Tomppo 2001; McRoberts et al. 2002; Finley et al. 2006). This effect becomes worse as k increases and, therefore, should be seen as the downside of increasing the number of neighbours. As an alternative to a fixed k, neighbours can be considered up to a fixed maximum distance calculated to a certain query point (Atkeson et al. 1997; Malinen 2003b). In this case, k can change depending on the distribution of nearby observations in the feature space. To apply this kernel approach, we standardize the distances of all database trees to a particular query point to a fixed interval between zero and one (where zero is most similar and one is most dissimilar). This procedure allows defining a relative maximum distance up to which the resulting alterable number of neighbours is considered for predictions. To find an appropriate bandwidth and (or) number of neighbours, we used the rRMSE calculated by a leave-oneout cross validation: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n X 1 ðxi  bx i Þ2 N ½4

rRMSE ¼

i¼1

b x i

 100%

where xi is the observed value for instance i, bx i the respective predicted value, N is the number of observations and b x i is the mean of estimates. Linear mixed-effect models The reference biomass of the test subsets was estimated by LMMs. Using mixed models is reasonable here because of the sampling design employed for the Vapu data set: one to six trees were sampled on each plot and constitute a subsample of the plot trees. It is likely that their observations are correlated (McCulloch and Searle 2000; Lappi et al. 2006). Aboveground biomass (AGB) was estimated separately for Scots pine and Norway spruce trees. DBH and height were used as predictors, and models were selected by comparing the log-likelihood estimates of models and mean square errors of the predictions. For Scots pine, AGB was predicted by using DBH and tree height: lnðAGBki Þ ¼ lnðÞ þ lnðak Þ þ lnðDBHki Þ þ lnðhki Þ þ eki

#

2008 NRC Canada

4

Can. J. For. Res. Vol. 38, 2008

Fig. 1. Diameter distributions of the randomly selected test and modelling subsets. Norway spruce 25

20

20

15

15

DBH class (cm)

whereas DBH and slenderness were used for Norway spruce (interaction term between tree height and DBH): lnðAGBki Þ ¼ lnðÞ þ lnðak Þ þ lnðDBH  ki Þ  hki þ eki þ DBHki where AGBki is the aboveground biomass, DBHki is the diameter at breast height, hki is the tree height, eki is the residual, and ak is a random plot effect of tree i on plot k. Predictions of ak were estimated for each plot (with an expected value of zero) to obtain plot-specific aboveground biomass models, which were used for biomass prediction. Regression was computed on the logarithmic scale and then back transformed. According to Sprugel (1983), a correction was calculated to reduce the bias introduced by the transformation by adding one-half of the variance to the intercept. The plotwise estimation of dummy regressors ak make implicit use of ancillary spatial information (namely, the affiliation of a tree to a certain plot location) that is not used in the k-NN approach. For better comparability, we also estimated parameters for simple linear models without an additional random effect that would be appropriate in the case of missing prior information about the spatial dependency of sample trees. Comparison of results To evaluate the k-NN approach in comparison with mixed-effect and simple linear regression models, we divided the data randomly into ‘‘modelling’’ (n = 143 for spruce and n = 145 for pine) and ‘‘test’’ subsets (n = 60 for both species). Only the larger subsets were used to estimate regression coefficients for the given model formulations and (or) as training data for the k-NN algorithm. Figure 1 shows

46

40

43

34

37

31

25

28

19

22

13

43

46

37

40

31

34

28

22

25

16

0

19

0

13

5

7

5

16

10

7

10

modelling test

10

No. of trees

25

10

No. of trees

Scots pine

DBH class (cm)

the diameter distributions of the test subsets and the modelling data sets. For all approaches, we used DBH and tree height as predictor variables. In our comparison, single-tree biomass is referred to as the ‘‘observed value,’’ even though this is not a measurement but a modelling result from a number of predictor variables and, therefore, carries both measurement and model errors. To evaluate the prediction errors, we used a mix of quadratic and relative goodness-of-fit criteria (Weber 1998). The RMSE, rRMSE, mean absolute percentage error (MAPE), and mean error (ME) of the aboveground biomass for all 60 trees per species were calculated for all approaches: n 1X ðxi  b xiÞ N i¼1

½5

ME ¼

½6

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n pffiffiffiffiffiffiffiffiffiffi 1X ðxi  b x i Þ2 ¼ MSE RMSE ¼ N i¼1

½7

 N  1X xi  b x i   MAPE ¼  100% N i¼1  xi 

where N is the number of database trees, xi is the observed value for the ith tree, and b x i is the respective predicted value. We used the MAPE as an additional error criterion, because information about the relative error is important to evaluate the predictions dependent on the changing dimensions of trees. #

2008 NRC Canada

Fehrmann et al.

5 Table 1. Parameter estimates and their SEs for linear mixed models, including the covariance parameter estimates for residuals and plot factor, and for the coefficients of linear models. Scots pine Statistics



Norway spruce 









Linear mixed model Estimate –2.36 SE 0.054 p