SPECIAL FEATURE: UNCERTAINTY ANALYSIS
A parsimonious approach for modeling uncertainty within complex nonlinear relationships BOGDAN M. STRIMBU,1, ALEXANDRU AMARIOAREI,2,3 AND MIHAELA PAUN2,4 1 College of Forestry, Oregon State University, Corvallis, Oregon 97333 USA Bioinformatics Department, National Institute of Research and Development for Biological Sciences, Splaiul Independenței 296, Bucharest 060031 Romania 3 Faculty of Mathematics and Computer Science, University of Bucharest, Str. Academiei nr. 14, Bucharest 010014 Romania 4 Faculty of Administration and Business, University of Bucharest, Bdul. Regina Elisabeta, nr. 4 - 12, Bucharest 030018 Romania 2
Citation: Strimbu, B. M., A. Amarioarei, and M. Paun. 2017. A parsimonious approach for modeling uncertainty within complex nonlinear relationships. Ecosphere 8(9):e01945. 10.1002/ecs2.1945
Abstract. Advancements in information technology led environmental scientists to the illusion that efforts should be mainly focused on developing models that reduce uncertainty rather than on models adjusted to the existing uncertainty. As a result, environmental relationships are represented by non-parsimonious and suboptimal models, which in many instances could be even wrong. The objective of this research was to provide scientists focused on modeling ecosystem processes with a procedure that supplies parsimonious correct results. The procedure transforms the response variable to achieve a linear model and the normality of the residuals. After the parameters of the transformed model are estimated, the bias induced by back-transforming is corrected. We have computed the bias corrections for 11 of the most popular functions from the power, trigonometric, and hyperbolic families by considering the truncated normal distribution, when necessary. Using generated data, we have shown that the proposed procedure supplies unbiased results. We have identified a sample size artifact of data generation such that when the variance increases the truncation of distribution starts altering the corrections of predicted values, sometimes by more than 50% from the actual values. Our results indicate that uncertainty, measured by variance, impacts the analysis in a non-intuitive way when the defining domain of the response variable is restricted. The subtle way of influencing the development of complex nonlinear models by uncertainty advocates the usage of parsimonious linear models, which are less sensitive to the method of processing data. Finally, ecosystem processes should be modeled with strategies that consider not only processes and computation aspects, but also uncertainty, in particularly reducing variance to levels with no significant impact on the results. Key words: bias correction; linearization; sample size artifact; Special Feature: Uncertainty Analysis; truncated distribution. Received 3 January 2017; revised 7 June 2017; accepted 28 July 2017. Corresponding Editor: Jeffrey Taylor. Copyright: © 2017 Strimbu et al. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. E-mail:
[email protected]
INTRODUCTION
should be placed primarily on developing models that reduce uncertainty rather than on models adjusted to existing uncertainty. Therefore, traditional approaches of modeling nonlinear relationships by transforming the response variable seemed obsolete (Gregoire et al. 2008, Warton
The spectacular development in information technology led environmental scientists to the illusion that computational challenges of data analysis were overcome; hence, onward efforts
❖ www.esajournals.org
1
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
lead to severe departure from the actual solution, particularly for problems represented and formalized with discrete data (Hoffman 1998). To obtain results close to optimality, selection of appropriate parameters needed by various heuristic techniques is required (Murray and Church 1995, Bettinger et al. 2002, 2009, Pukkala and Kurttila 2005, Pukkala and Heinonen 2006). Some of the parameters used by heuristic algorithms can be estimated from a simplified model obtained by changing the response variable. Seppelt and Richter (2005) proved empirically that algorithms, as well as their implementation, affect the solution, suggesting that nonlinear problems, particularly the complex ones, should be investigated with procedures that do not rely on multiple parameters and assumptions; in other words, parsimony is preferred, all others equal. To illustrate the sensitivity of nonlinear models to algorithms and prior knowledge, as well as the effectiveness of transforming the predicted variable, a simple example is presented. The idea is to generate data using a known nonlinear model, then fit a model describing the generated data, and finally compare the two models, one that generated data and the other derived from data. In the event that the fitted model does not match the data generating model, either the algorithm estimating the parameters does not perform well or the solution depends on the interaction between nonlinearity and uncertainty. Either way, the investigator reaches an erroneous conclusion and alternative strategies should be considered, such as the transformation of the response variable. The model under consideration predicts the response variable Y with a complex nonlinear function h that combines several predictor variables, X:
and Hui 2011). The main issue of altering the predicted variable is that biased results are obtained when back-transformed; hence, they have to be corrected (Bartlett 1936, Finney 1941, Neyman and Scott 1960). The impact of transformations on models was very early acknowledged in ecological modeling (Meyer 1953), but bias corrections were not widely used (Clutter et al. 1983). To provide a procedure for correcting the bias produced by changing the response, Neyman and Scott (1960) established a framework for all transformations, with the assumption that the residuals are normally distributed. Development of generalized linear models (GLM) by Nelder and Wedderburn (1972) marked a sharp change in modeling complex relationships, as the transformation of the response variable did not seem to be required anymore. Strong arguments were provided to support departure from modeling approaches that change the response variable (Warton and Hui 2011), chief among them being that transformations impact simultaneously the predictor and the distribution of the random component (Gregoire et al. 2008). Generalized linear model allow not only the separation of the first-order moment (i.e., mean) from the secondorder centered moment (i.e., variance), but also permit specification of the random error (Gregoire et al. 2008). While these properties make GLM attractive for modeling, there are several situations when transformation of the response variable could be preferred, particularly when the predictor variables are combined in complex formulas. Additionally, to separate the first-order moment from the second-order moment GLM introduces a link function, which decreases the parsimony of the model. Furthermore, while GLM is successful for relatively simple relationships, it fails to solve complex problems that combine nonlinear functions in a nonlinear manner, because it lacks the ability to estimate all parameters analytically. Therefore, to maintain the response unchanged a vast array of procedures that estimate all parameters of nonlinear models were developed (Pierre 1986, Cormen et al. 2010). Most procedures computing the parameters of the nonlinear models are based on heuristic algorithms, which have the attractive property that supplies solutions in a relatively short amount of time. However, heuristic techniques yield suboptimal solutions, which can ❖ www.esajournals.org
Y ¼ hðXÞ þ e
(1)
where e are residuals. Currently, the advocated approach for estimating the coefficients of the function h is based on heuristic algorithms, such as Marquardt or Newton methods (Pierre 1986). Alternatively, one can alter Eq. 1 by transforming Y using a function, g (preferably continuously differentiable), such that Y0 = g(Y). The transformation of the response variable is executed such that will 2
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
lead to a linear combination of either vectors from X or changed vectors from X: gðYÞ ¼ h0 ðXÞ þ e
generated y does not have a normal distribution, which is most of the time the case for environmental models, particularly when relationships of interest are described with complex nonlinear functions. To model the generated data, one can guess based on trial and observations, that Y could be predicted using Eq. 6, which is the same with Eq. 4, except that the coefficients are not known.
(2)
0
0
where h is a vector of functions. If h is the identity function, then Eq. 2 can be solved using GLM, as it linearizes to gðYÞ ¼ Xb þ e
(3)
where b is the vector of coefficients. After computing the coefficients from Eq. 2 or 3, back-transform Y0 to Y = g1(Y0 ) and correct for bias. The underlying idea of the transformation is to facilitate identification of the relationship between Y and X. Let us assume that the actual model for Y is y ¼ arcsinð1 0:1 e0:0001x
2
þ0:001x
Þ þ e.
y ¼ hðxÞ ¼ arcsinðb0 þ b1 eb2 x þb3 x Þ þ e. 2
The NLIN procedure of SAS 9.4 (SAS Institute, Cary, North Carolina, USA) using the GaussNewton algorithm (Madsen et al. 2004) solved Eq. 6 and supplied significant values for all coefficients (P < 0.05): b0 = 1.2224, b1 = 0.329, b2 = 0.00006, and b3 = 0.0017. The difference between the actual and computed b0, b1, and b2 is not worrisome, as sampling commonly leads to estimates different from population parameters. However, b3 has an opposite sign than the actual coefficient, which shows that this approach can lead to wrong models. Depending on the generated data, other coefficients could have the wrong sign (e.g., b2). An alternative approach to identify the model is by transforming Y: Y0 = g(Y) = sin(Y). The proposed transformed model is consequently:
(4)
A possible representation of Y can be obtained by generating 100 consecutive values, from 1 to 100 in steps of 1 (one can think of Y as decay through time, represented by x). The transformed model y0 = g(y) = sin(y) is y0 ¼ sinðyÞ ¼ 1 0:1 e0:0001x
2
þ0:001x
þ e0
(5)
For simplicity and consistency with Neyman and Scott (1960) approach, let us assume that e0 is normally distributed with mean 0 and variance r2. To obtain y0, and consequently y, the 100 random generations of e are executed (Fig. 1) assuming r = 0.01 (generated data are presented in the Data S1). It should be noticed that
y0 ¼ sinðhðxÞÞ ¼ b0 þ b1 eb2 x
Response
1.0
0.8
0.6 20
40
60
80
100
Predictor arcsin
arcsin + error N(0,01)
Response
Fig. 1. Example data to support the impact of uncertainty in identification of the correct model.
❖ www.esajournals.org
2
þb3 x
þ e0 .
(7)
If Eq. 7 is used to estimate the parameters, the same NLIN procedure leads to parameters identical with the ones from Eq. 4. The backtransformed values are biased because of the error term; however, the model is correct. This result matches the intuition built on linear models, as nonlinear algorithms can supply accurate models. Therefore, transformation of the predicted variable that linearizes the model or reduces significantly the complexity of the model (i.e., increases parsimony) could lead to accurate solutions. In the event that the investigation is constrained to algorithms that require initial solutions to solve nonlinear problems, biased results could supply starting values close to actual values. Therefore, the objective of the paper was to provide the researchers and practitioners that model complex nonlinear ecosystems with a procedure that is unbiased and more parsimonious than current approaches.
1.2
0
(6)
3
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
METHODS
with the Rhelp of the exponential integral 2 b Iða; b; nÞ ¼ a tn et =2 dt, where a, b are real positive numbers, and n is a natural number, n > 0. The integral can be computed as a series that combines both the probability density function φ(t) and the cumulative density function Φ(t) of the standard normal distribution (proof in the Appendix S1).
Framework The complex representation of ecosystem processes can be simplified to a linear relationship (Eq. 2) by applying function g to Eq. 1: Y0 ¼ ghðXÞ þ e
(8)
where g○h is the composition of g with h. In most situations, g○h is not a linear function, but g can be selected such that the composition of the two functions can be linearized either without significant effort or by investigating pairwise relationships between Y and individual variables (i.e., vectors from X). Therefore, let us assume that X is a random vector and Y a random variable, such that for a function g continuously differentiable Eq. 8 becomes linear: gðYÞ ¼ Y 0 ¼ X0 b þ e
Models The general framework developed by Neyman and Scott (1960) for bias correction of the back-transformed variable suffers from a major drawback rooted in the assumption that the variables have to be positive. The positivity assumption makes sense practically, but it is violated from an analytical perspective. An obvious case is the square root transformation for which the error term that is normally distributed takes negative values but the transformation allows only half of them. Neyman and Scott (1960) identified this issue, but they decided to “ignore [. . .] the ambiguity connected with the fact that [. . .] ξ (i.e., transformed variable) [. . .] must be capable of assuming negative values.” Their decision leads to closed analytical formulas for a large pallet of functions, but they are not accurate. Therefore, we adjust the approach of Neyman and Scott to three types of transformations (i.e., power, trigonometric, and hyperbolic) by considering a truncated normal distribution, one for which both practicality and theoretical validity hold. The simultaneous fulfillment of the two conditions (i.e., realism and analytical soundness) leads to complicated formulas, but their implementation is not necessarily difficult. A power function has the general equation g(y) = y1/n (n a natural number), and to ensure realism, we restricted its base to positive values (i.e., y > 0). In this case, the general unbiased back-transformed expectation is (proof in Appendix S1): n 1 X n nk k p ffiffiffiffiffi ffi E½Y ¼ n r Iða; kÞ (12) 2p k¼0 k
(9)
where b are coefficients, X0 = g○h(X) is a random vector, and ɛ N(0, r2). The goal is to predict Y = g1(Y0 ) given x0. Knowing that Y 0 jx0 NðghðxT0 Þb; r2 Þ, which can be rewritten as Y0 N(ξ, r2) with n ¼ ghðxT0 Þb, the density of Y can be computed using the change of variable relation (Grimmett and Stirzaker 2002). dgðyÞ 1A fY ðyÞ ¼ fY0 ðgðyÞÞ (10) dy where 1 1 fY0 ðy0 Þ ¼ pffiffiffiffiffiffi e 2 2pr
2
y0 n r
is the density of the normal distribution and A ¼ fyjy ¼ g1 ðy0 Þ for some y0 such that fY0 ðy0 Þ [ 0g. Therefore, given x0, 1 E½Y ¼ pffiffiffiffiffiffi 2pr
Z ye A
t ¼ gðyÞn= r
¼
12
1 pffiffiffiffiffiffi 2p
gðyÞn r
Z
2 jg0 ðyÞjdy (11)
g1 ðn þ
t2 rtÞe 2 dt
where a = ξ/r, and the rest of the symbols as before. We considered two transformations often encountered in ecological investigations: the
B
where B ¼ ftjt ¼ ðgðyÞ n=rÞ; y 2 Ag. For the functions considered in this research, the expectation from Eq. 11 can be expressed ❖ www.esajournals.org
4
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
square root and the cubic root (Andersen et al. 2005, Gregoire et al. 2008). The unbiased backtransformed expectation for the square root (i.e., n = 2) is from Eq. 12: n n 2 2 E½Y ¼ ðn þ r Þ 1 U þ nr/ r r (13)
The unbiased back-transformed expectation for the arcsine function: g(y) = arcsin(y) for which y is restricted to the interval [1, 1] is (proof in Appendix S1): 1 1 X ð1Þn E½Y ¼ pffiffiffiffiffiffi 2p n¼0 ð2n þ 1Þ! (16) 2X nþ1 2n þ 1 2nþ1k k r Iða; b; kÞ n k k¼0
For the cubic root function (i.e., n = 3), the same expectation is: n 2 2 E½Y ¼nðn þ 3r Þ 1 U r (14) n þ rðn2 þ 2r2 Þ/ r
with a = (p/2 + ξ)/r and b = (p/2 ξ)/r (rest of the symbols as before). The unbiased back-transformed expectation for the cosine function g(y) = cos(y) for which y is restricted to the interval [0, p] is (proof in Appendix S1):
Trigonometric transformations of the response variable are encountered in ecological studies, but sometimes they are combined with other functions, such as square root (Lohrey and Bailey 1977). A strong argument against the arcsine transformation was made by Warton and Hui (2011), who recommend its replacement with GLM or nonlinear approaches. However, the example that we provided showed that their argument rests either in the simplicity of the model or solely in the ability of the heuristic algorithms to deliver the correct solution. Trigonometric transformations, if adequate, are particularly helpful when complex formulations are used to describe ecologic processes. The following back-transformation corrections address the concerns raised by advocates of GLM or direct nonlinear algorithms, without sacrificing the parsimony or appealing to semi-subjective rules of solving complex equations. The unbiased back-transformed expectation for the sin function g(y) = sin(Y), for which y is restricted to the interval [p/2, p/2] is (proof in Appendix S1):
1 p 1 X ð2n 1Þ!! 1 E½Y ¼ ðUðbÞ UðaÞÞ pffiffiffiffiffiffi 2 2p n¼0 ð2nÞ!! 2n þ 1 2nþ1 X 2n þ 1 n2nþ1k rk Iða; b; kÞ k k¼0
(17) with a = (1 + ξ)/r and b = (1 ξ)/r (rest of the symbols as before). The unbiased back-transformed expectation for the arccos function: g(y) = arccos(y), with y ϵ [1, 1] is (proof in Appendix S1): 1 2n 1 X ð1Þn X 2n 2nk k E½Y ¼ pffiffiffiffiffiffi r Iða; b; kÞ n 2p n¼0 ð2nÞ! k¼0 k (18) here a = ξ/r and b = (p ξ)/r (rest of the symbols as before). The unbiased back-transformed expectation for the tangent, g(y) = tan(y), with y ϵ [0, p] is (proof in Appendix S1): 1 1 X ð1Þn E½Y ¼ pffiffiffiffiffiffi 2p n¼0 ð2n þ 1Þ! (19) 2X nþ1 2n þ 1 2nþ1k k r Iða; b; kÞ n k k¼0
1 1 X ð2n 1Þ!! 1 E½Y ¼ pffiffiffiffiffiffi 2p n¼0 ð2nÞ!! 2n þ 1 (15) ! 2X nþ1 2n þ 1 2nþ1k k n r Iða; b; kÞ k k¼0
where a = ξ/r and b = (1 ξ)/r (rest of the symbols as before). The last trigonometric function considered is the arctan function: g(y) = arctan(y), for which the unbiased back-transformed expectation assuming y ϵ [1, 1] is (proof in Appendix S1):
where k!! is the product of all even or odd numbers less than the even or odd k, respectively, a = (1 + ξ)/r and b = (1 ξ)/r (rest of the symbols as before).
❖ www.esajournals.org
5
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
with y ϵ [0, ∞), which has the unbiased backtransformed expectation (proof in Appendix S1):
1 1 X ð1Þn 22n ð22n 1ÞB2n E½Y ¼ pffiffiffiffiffiffi ð2nÞ! 2p n¼0 ! (20) 2X n1 2n 1 2n1k k r Iða; b; kÞ n k k¼0
1 1 X 1 E½Y ¼ pffiffiffiffiffiffi 2n þ1 2p n¼0 (24) 2X nþ1 2n þ 1 2nþ1k k r Iða; b; kÞ n k k¼0
where Bn is the nth Bernoulli number that is computed with the recurrence formula n1 X Bk n ; n 1 and B0 ¼ 1. Bn ¼ k nk1 k¼0
where a = ξ/r and b = (1 ξ)/r (rest of the symbols as before).
Bayesian estimation
(21)
Stow et al. (2006) argued that the form of a model could influence the bias throughout the parameter’s space. Therefore, Stow et al. (2006) proposed a framework that uses a Bayesian approach to correct the retransformation bias. To compare the corrections of Eqs. 13–24 with their Bayes estimates, we followed the procedure used by Stow et al. (2006). The computations were executed with Stan, an open-source probabilistic programming language written in C++ for Bayesian inference and optimization. Stan was implemented on a Dell Precision 7910 workstation equipped with an Intel Xeon CPU E5-2630 v.3 and 32 GB RAM through the R interface using the RStan package (Stan Development Team 2016). Stan is computationally intense, as it is based on a Markov chain Monte Carlo approach that depends on a large number of parameters. Among these parameters, the number of iterations is preeminent, which were executed using four chains, as recommended by Stan Development Team (2016). Similarly with Stow et al. (2006), the Bayesian estimates were obtained using a non-informative prior distribution.
Hyperbolic functions are not often encountered in ecological studies (Gotelli and Ellison 2013), but they are present in econometrics, to model, for example, extreme values (Burbidge et al. 1988) or regressions (MacKinnon and Magee 1990). Consequently, they could be used to model ecosystem processes driven by extreme events, such as fire or infestation by insects. The unbiased back-transformed expectation for the hyperbolic sine, gðyÞ ¼ sin hðyÞ ¼
ey ey 2
where y ϵ [0, a], a < ∞ is (proof in Appendix S1): 1 1 X ð1Þn ð2nÞ! E½Y ¼ pffiffiffiffiffiffi 2p n¼0 ðn!Þ2 ð2n þ 1Þ (22) 2X nþ1 2n þ 1 2nþ1k k r Iða; b; kÞ n k k¼0
where a = ξ/r and b = (1 ξ)/r (rest of the symbols as before). The unbiased back-transformed expectation for the hyperbolic arcsine or inverse hyperbolic pffiffiffiffiffiffiffiffiffiffiffiffiffi sine, gðyÞ ¼ arcsin hðyÞ ¼ lnðy þ y2 þ 1Þ, with y ϵ [0, ∞) is (proof in Appendix S1): 1 2 n E½Y ¼ er =2 2 sin hðnÞ þ en U r 2 r (23) n en U þr . r
Data The bias correction for the logarithmic transformation computed by Finney (1941) depends only on the variance of the logarithmized variable. The corrections for bias of Eqs. 13–24 include besides standard deviation of the transformed response, r, the predictor vector, ξ. The departure from the simple rectification of Finney (1941) rests on one side in the non-exponential functions considered in this study, such as sine or cosine, and on the other side in the truncation of the normal distribution.
The last function considered in this study is the hyperbolic tangent, gðyÞ ¼ tan hðyÞ ¼
ey ey ey þ ey
❖ www.esajournals.org
6
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
Variance and, consequently, standard deviation are measures of uncertainty (Smaldino 2013). Therefore, it is of interest to assess the impact of uncertainty on the relationship between actual values and corrected values. It is also desired that all spectrum of values of variance, or at least one large enough to allow meaningful inferences, are analyzed, case that is not encountered in studies based on real data. To ensure consistency of the results, to all transformations considered and corrected with Eqs. 13–24, simulated data will be used. Being dependent on r and ξ, the simulations will consider values for both. The magnitude of ξ is model dependent; therefore, it can take any values from subunit to very large (e.g., 10n, where n is a natural number). To accommodate the wide range that X0 can have, we rely on the property that an affine transformation of a random vector is a random vector. Therefore, division of X0 by it standard deviation will lead to another random vector, X″, which will have the same properties as X0 , but will be scaled up or down depending on variance (i.e., smaller or larger than 1). Considering that e is normally distributed, it can be assumed that 99.99% of the values of X0 , and consequently X″, will be within seven standard deviations from ξ (i.e., 3.5 9 r). Therefore, one can select as representative for ξ > 1, values in the range (1, 5). A similar argument holds for subunit values, but in this case, ξ should be chosen such that ξ 3.5 9 r will be in the interval (0, 1). Therefore, when ξ has the
same sign (i.e., positive or negative), the largest standard deviation considered for simulations (Table 1) has to fulfill the condition: amax \n=3:5.
(25)
The smallest standard deviation was determined such that would be one order of magnitude in terms of significant digits lower than rmax. Depending on the transformation, for each selected ξ at least 10 standard divisions within the range (rmin, rmax) were simulated, each simulation containing 1000 generated values. A simulation with 1000 values will ensure normality and accurate estimation of the normal distribution parameters from data (Strimbu 2012). We supplemented the simulations with a real example, which used square root to predict the canopy fuel weight from lidar. The data were collected on a 5.2 km2 area within Capitol Sate Forest, Washington, USA (Andersen et al. 2005). The model uses three predictor variables: the fraction of lidar first return 2 m above ground (D), and the 25th and 90th height percentile (i.e., h25 and h90, respectively, measured in meters): pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi canopy fuel weight ½kg=ha ¼ 22:7 þ 2:9h25 1:7h90 þ 106:6D. (26) The standard deviation of the model was 11.5 (i.e., square root units) and the coefficient of determination R2 = 0.86 (P < 0.001), which rendered the relationship significant. The correction
Table 1. Value of r and ξ for which simulations were executed to assess the impact of uncertainty on complex nonlinear models. Y domain
ξ Domain
ξ
rmin
rmax
(0, ∞)
(0, ∞)
ffiffiffiffi p 3 Y
(0, ∞)
(0, ∞)
Sine Cosine Tangent Arcsine Arccosine Arctangent Hyperbolic sine Hyperbolic arcsine
sin(Y) cosin(Y) tan(Y) asin(Y) acosin(Y) atan(Y) Y (eY peffiffiffiffiffiffiffiffiffiffiffiffiffiffi )/2 lnðY þ Y 2 þ 1Þ
[p/2, p/2] [0, p] [0, p/4] [1, 1] [1, 1] [1, 1] [0, ∞) [0, ∞)
[1, 1] [1, 1] [0, 1] [p/2, p/2] [0, p] [0, p/4] [0, ∞) (0, ∞)
Hyperbolic tangent
(eY eY)/(eY + eY)
[0, ∞)
[0, 1)
0.5 5 0.5 5 0.5 0.5 0.5 p/4 p/4 p/8 0.5 0.5 5 0.5
0.05 0.5 0.05 0.5 0.05 0.05 0.05 p/64 p/64 p/96 0.05 0.05 0.5 0.05
0.14 1.4 0.14 1.4 0.14 0.14 0.14 p/16 p/16 p/32 0.14 0.14 1.4 0.14
No.
Transformation
1
Square root
Yʹ formula pffiffiffiffi Y
2
Cubic root
3 4 5 6 7 8 9 10 11
❖ www.esajournals.org
7
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
of Eq. 26 when back-transformed is executed with Eq. 13, which depends on ξ. According to Andersen et al. (2005), the approximate range of ξ is (40, 160). The authors did not back-transform in original units Eq. 26 probably for two reasons: First, the results would be biased, and second, the relationships would have been either non-significant or with a small coefficient of determination. Here, we estimate the bias when back-transformed as the ratio between the uncorrected value and the expectation of Eq. 13:
n2 E½Yjx 1 r2 n r n ¼ 1þ 2 1U . þ / r n r n (27)
bias ¼
RESULTS The back-transformed values computed with Eqs. 13–24 are 1), we notice an increase with variance of the difference between the expectations computed using the transformed model and the actual values. On the other hand, for power and trigonometric functions, there is no
transformations involving power or exponentials is sensitive to variance and size of predictors, and ranges from 1% to 62% (Figs. 2, 4). The dependence is obvious for hyperbolic arcsine of variables larger than 1 (Fig. 4c), which shows a bias ≥12%. Corrections based on Bayesian estimates were close to the real values (Figs. 2, 4). However, for the vast majority of transformations and variances, the Bayesian approach supplied worse corrections than the ones in Eqs. 13–24 (exception for variance 1.4 and the hyperbolic sine, hyperbolic arcsine, tangent, and hyperbolic tangent). In addition, a chain is completed in more than 400 s, irrespective of the transformation, which renders a solution in approximately 0.5 h. The confidence intervals for the estimated values supplied by RStan not only that include the biased results but sometimes 0, which questions its ability to model complex relationships. Even that the Bayesian estimates are surprisingly close to the actual values, the long time needed to obtain a solution reduces its attractiveness. Alternatively, the proposed approach for modeling complex nonlinear models renders a solution almost instantaneously. However, it could be argued that increases in computational power are expected, which could render this caveat irrelevant. Nevertheless, the main issue associated with the Bayesian approach springs from solutions that are not significantly different than 0. To assess the ability of the Bayesian procedure to identify nonlinear complex models, we have used the same synthetically generated example from Eq. 4 and the same goal: to estimate parameters knowing the functional form of the relationship. Considering that RStan depends on the number or iterations, eight pairs (total number of iterations number of warm-up iterations) were considered. The pairs were considered to evaluate the effort needed to obtain convergence of Markov chain Monte Carlo-based solutions. The combinations were 10,000–3500, 10,000–5000, 15,000–5000, 15,000–7500, 20,000–6500, 20,000– 10,000, 25,000–7500, and 25,000–10,000, where the first number is the total number of iterations and the second is the warm-up number of iterations. To evaluate the impact of the a priori information on the results, a set of four distributions were assigned to the four coefficients: (1) all bi are uniformly distributed between 1 and 1.5 (called Set 1); (2) all bi are between 1 and 2, with b0 ~ ❖ www.esajournals.org
11
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
altering the results by generating invalid data, as the probability of obtaining values outside the defining domain is larger than 15% when r > ξ/2. To enhance the results, we executed 60 simulations: 30 with 1 million observations and 30 with 1000 observations. Presence of multiple simulations allows identification of patterns, which cannot be detected from Fig. 4, as only one value was produced for each pair ξ r. The results of the new simulation are surprising, as accuracy, expressed as “mean from data divided by mean from transformed model,” exhibits an inverse relationship with the number of observations (Figs. 5, 6). In fact, an almost constant underestimation of the actual mean is revealed when 1 million observations are used in computations (i.e., accuracy is approximately 1.5). A very different picture is painted when expectation is based on 1000 observations, accuracy bouncing as expected around 1, from 0.6 to more than 2.
DISCUSSION Andersen et al. (2005) were cautious in presenting conclusions based on back-transformed values, as significant bias was associated with almost half of the range of the predictor variables. However, the canopy fuel weight model could have been used without any amendments for the upper portion of the X range. Now, the Fig. 5. Bayesian estimates of the model y = arcsin (1 0.1 9 exp(0.0001x2 + 0.001x)); (a) 10,000 total iterations, (b) 50,000 total iterations.
2.0
❖ www.esajournals.org
Accuracy
obvious pattern relating variance or size of predictor variable with the difference between modeled expectations and actual values. Therefore, it is natural to investigate whether or not the size of variance impacts expectation derived from transformed model. In the simulations, the size of variance depends on the mean of transformed model (i.e., X″), but Eq. 25 restricted the coefficient of variation (CV) to 1. The hyperbolic arcsine has a closedform expectation (Eq. 23) and allows values for ξ > 1, as its domain is the set of real positive numbers. Possible values to provide meaning inference are ξ = 8 and r = 3, which lead to CV = 37.5%. We restricted CV < 50% to avoid
1.5
1.0
0.5 0
10
20
30
Simulation No. of obs.
1000
1 million
Fig. 6. Impact of size of generated data on the accuracy of the back-transformed hyperbolic arcsine expectations.
12
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
The reduction in variance is more prominent when larger datasets are generated, as more extreme observations will be dropped, while the bulk of the data will be located around the mean. Therefore, for truncated distributions, generated data will exhibit an increase in bias with an increase in size of dataset. The bias is an artifact of the procedure used to generate data, and it is easy to eliminate by selection of a transformation which not only fit the data but also reduced the uncertainty. We are able to identify the bias for generated data because all information is available (i.e., model and distribution of residuals), which is not the case of real data when the relevant details are not attainable. Therefore, for complex nonlinear models, estimation of variance and, consequently, uncertainty acts like a threshold between bias and non-biased results. Considering that most transformations of the response reduce variance, the proposed approach recommends even more the parsimonious procedure presented in this research as the limit on which uncertainty starts influencing the results are unlikely to be reached. The lack of information on distribution of residuals recommends the development of complex nonlinear models based on transformed response, if not entirely at least for comparison with models developed on unchanged predicted variable. In conclusion, for complex nonlinear models not only that the transformation of the response produces unbiased results, but is also more parsimonious than the models obtained by non-transforming the predicted variable. The present research estimated the bias corrections needed for back-transforming the dependent variable only for 10 functions. For a large array of transformations, expectation of response given a linear model was not computed for two reasons: First, it is size related, in the sense that there are too many functions to be included in one paper, and second, for many functions, computation of expectation is not a trivial task. Therefore, we expect that further work will expand the approach advocated here and will continue computing E[Y] for other transformations.
model can be used for the entire range considered by Andersen et al. (2005). While the model developed by Andersen et al. (2005) is very simple (only the square root), transformation of the predicted variable proves to be helpful for complex nonlinear models. The argument of altering the response rests on two observations: First, conversion of a nonlinear function to a linear relationship is preferred, as solutions do not depend on external factors, such as the performance of the algorithms used to find the coefficients, and second, reduction in variance, which is usually induced by transformations, leads to results close to the real values. The main argument against changing the predictor is that it lacks intuition, as modeled units make little sense. The unattractiveness of alteration of units is enhanced by the presence of bias when results are back-transformed without considering data uncertainty. The proposed approach answers both concerns raised by transformation of the response variable: lack of intuition and bias. The main attraction of the solution presented here is its ability to identify a model that is not only close to the real one, but it is also relatively robust, if not insensitive, to the algorithms used to find the coefficients. From the later perspective, it is obvious that complex nonlinear models when identified using algorithms that do not require a change of the response variable should be verified by the same model with transformed variable whose coefficients are determined using the computed expectations. Bayesian estimation of complex, nonlinear models is close to the real model (Figs. 2–4), when predicted for one individual value. However, when the entire model is considered, then wrong results are obtained (Fig. 5), similar to nonlinear procedures. Furthermore, the time needed to compute a Bayesian solution is significantly larger than the proposed corrections, which are virtually instantaneous. A counterintuitive finding was that size of the dataset used to develop a model plays a significant role in the estimation of parameters, in the sense that more generated observations could lead to biased results. This outcome is the result of the truncation of the normal distribution to a domain on which the functions are defined. Truncation will reduce the variance and according to the Eqs. 13–24 will underestimate the actual values. ❖ www.esajournals.org
CONCLUSION The present research was triggered by the spectacular development in information technology, which led environmental scientists to the illusion 13
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL.
that efforts should be focused on developing models that reduce uncertainty rather than models adjusted to the existing uncertainty. As a result, environmental relationships are represented by non-parsimonious and suboptimal models, which in many instances could be even wrong. We empirically proved that models developed using nonlinear algorithms applied to raw data (i.e., untransformed) could lead to wrong models. Therefore, this research focused on providing scientists modeling ecosystem processes a procedure that supplies parsimonious accurate results. The procedure transforms the response variable to achieve simultaneously two objectives: first, normality of the residuals, and second, either a linear model or a model that can be easy linearized. After parameters of the transformed model are estimated, the bias induced by backtransforming is corrected. The bias corrections are computed for 10 popular functions (i.e., power, trigonometric, and hyperbolic) by considering the truncated normal distribution, when necessary (e.g., the square root transformation). Based on generated data, we have shown that the proposed procedure supplies unbiased results and increases parsimony compared with procedures based on untransformed response. We noticed that when variance increases, truncation of distribution starts altering the corrections such that predicted values will be more than 50% off from the actual values. The departure of the expectation from actual value in this case is an artifact of the procedure used to generate data. Our results indicate that uncertainty, measured by variance, impacts the analysis in a non-intuitive way when defining domain of the response variable is restricted. The subtle way of uncertainty on influencing the development of complex nonlinear models advocates even more the usage of parsimonious linear models, which are less sensitive to the method of processing data by a particular software. Finally, ecosystem processes should be modeled with strategies that consider not only modeling and computation aspects, but also data uncertainties, particularly reducing variance to levels with no significant impact on the results.
Agriculture, McIntire Stennis Project OREZ-FERM875, and by the Romanian ANCSI Project POC P-37– 257. Bogdan Strimbu designed the research, produced the example, and wrote most of the manuscript. Alexandru Amarioarei estimated all the integrals and wrote the sections Models and Bayesian estimation. Mihaela Paun produced the figures, wrote the conclusions, and reviewed the manuscript.
LITERATURE CITED Andersen, H.-E., R. J. McGaughey, and S. E. Reutebuch. 2005. Estimating forest canopy fuel parameters using LIDAR data. Remote Sensing of Environment 94:441–449. Bartlett, M. S. 1936. The square root transformation in analysis of variance. Journal of the Royal Statistical Society 3(Suppl):68–78. Bettinger, P., D. Graetz, K. Boston, J. Sessions, and W. D. Chung. 2002. Eight heuristic planning techniques applied to three increasingly difficult wildlife planning problems. Silva Fennica 36:561– 584. Bettinger, P., J. Sessions, and K. Boston. 2009. A review of the status and use of validation procedures for heuristics used in forest planning. Mathematical and Computational Forestry & Natural Resource Sciences 1:26–37. Burbidge, J. B., L. Magee, and A. L. Robb. 1988. Alternative transformations to handle extreme values of the dependent variable. Journal of the American Statistical Association 83:123–127. Clutter, J. L., J. C. Forston, L. V. Pienaar, G. H. Brister, and R. L. Bailey. 1983. Timber management: a quantitative approach. Krieger, Malabar, Florida, USA. Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein. 2010. Introduction to algorithms. Third edition. MIT Press, Cambridge, Massachusetts, USA. Finney, D. J. 1941. On the distribution of a variate whose logarithm is normally distributed. Journal of the Royal Statistical Society B 7:155–161. Gotelli, N. J., and A. M. Ellison. 2013. A primer of ecological statistics. Sinauer, Sunderland, Massachusetts, USA. Gregoire, T. G., Q. F. Lin, J. Boudreau, and R. Nelson. 2008. Regression estimation following the squareroot transformation of the response. Forest Science 54:597–606. Grimmett, G. D., and D. R. Stirzaker. 2002. Probability and random processes. Oxford University Press, New York, New York, USA. Hoffman, P. 1998. The man who loved only numbers. Hyperion, New York, New York, USA.
ACKNOWLEDGMENTS This work was partially supported by the National Institute of Food and Agriculture, U.S. Department of
❖ www.esajournals.org
14
September 2017
❖ Volume 8(9) ❖ Article e01945
SPECIAL FEATURE: UNCERTAINTY ANALYSIS
STRIMBU ET AL. Neyman, J., and E. L. Scott. 1960. Correction for bias introduced by a transformation of variables. Annals of Mathematical Statistics 31:643–655. Pierre, D. A. 1986. Optimization theory with applications. Dover, Mineola, New York, USA. Pukkala, T., and T. Heinonen. 2006. Optimizing heuristic search in forest planning. Nonlinear Analysis: Real World Applications 7:1284–1297. Pukkala, T., and M. Kurttila. 2005. Examining the performance of six heuristic optimisation techniques in different forest planning problems. Silva Fennica 39:67–80. Seppelt, R., and O. Richter. 2005. “It was an artefact not the result”: a note on systems dynamic model development tools. Environmental Modelling & Software 20:1543–1548. Smaldino, P. E. 2013. Measures of individual uncertainty for ecological models: variance and entropy. Ecological Modelling 254:50–53. Sprugel, D. G. 1983. Correcting for bias in log-transformed allometric equations. Ecology 64:209–210. Stan Development Team. 2016. Rstan: the R interface to Stan. https://cran.r-project.org/web/packages/ rstan/vignettes/rstan.html Stow, C. A., K. H. Reckhow, and S. S. Qian. 2006. A Bayesian approach to retransformation bias in transformed regression. Ecology 87:1472–1477. Strimbu, B. M. 2012. Correction for bias of models with lognormal distributed variables in absence of original data. Annals of Forest Research 55:265–279. Warton, D. I., and F. K. C. Hui. 2011. The arcsine is asinine: the analysis of proportions in ecology. Ecology 92:3–10.
Hoffman, M. D., and A. Gelman. 2014. The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15:1593–1623. Lohrey, R. E., and R. L. Bailey. 1977. Yield tables and stand structure for unthinned longleaf pine plantations in Louisiana and Texas. SO-133. USDA Forest Service, New Orleans, Louisiana, USA. MacKinnon, J. G., and L. Magee. 1990. Transforming the dependent variable in regression models. International Economic Review 31:315–339. Madsen, K., H. B. Nielsen, and O. Tiengleff. 2004. Methods for non-linear least squares problems. IMM 3215. Technical University of Denmark, Kongens Lyngby, Denmark. Meyer, H. A. 1953. Forest mensuration. Penns Valley Publishers, State College, Pennsylvania, USA. Monnahan, C. C., J. T. Thorson, and T. A. Branch. 2017. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution 8:339–348. Murray, A. T., and R. L. Church. 1995. Heuristic solution approaches to operational forest planning problems. OR Spektrum 17:193–203. Neal, R. M. 2011. MCMC using Hamiltonian dynamics. Pages 113–162 in S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, editors. Handbook of Markov Chain Monte Carlo. Chapman & Hall/ CRC Press, Boca Raton, Florida, USA. Nelder, J. A., and R. W. M. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society A 135:370–384.
SUPPORTING INFORMATION Additional Supporting Information may be found online at: http://onlinelibrary.wiley.com/doi/10.1002/ecs2. 1945/full
❖ www.esajournals.org
15
September 2017
❖ Volume 8(9) ❖ Article e01945