Function regression in ecology and evolution: FREE

Methods in Ecology and Evolution 2015, 6, 17–26

doi: 10.1111/2041-210X.12290

Function regression in ecology and evolution: FREE Jian D. L. Yen1*, James R. Thomson2, David M. Paganin1, Jonathan M. Keith3 and Ralph Mac Nally2 1

School of Physics, Monash University, Clayton, Vic 3800, Australia; 2Institute for Applied Ecology, The University of Canberra, Bruce, ACT 2617, Australia; and 3School of Mathematical Sciences, Monash University, Clayton, Vic 3800, Australia

Summary 1. Many questions in ecology and evolutionary biology consider response variables that are functions (e.g. speciesabundance distributions) rather than a single scalar value (e.g. species richness). Although methods for analysing function-valued data have been available for several decades, ecological and evolutionary applications are rare. 2. We outline methods for regression when the response variable is a function (‘function regression’) and introduce the R package FREE, which focuses on straightforward implementation and interpretation of function regression analyses. Several computational methods are implemented, including machine learning and several Bayesian methods. We compare different methods using simulated data and real ecological data on individualsize distributions (ISDs) of fish and trees. 3. No single method performed best overall, with several performing equally well for a given data set. Which method to use depends on sample sizes and the questions being considered; in many cases, a consensus approach should be used to combine or compare fitted models. 4. Function regression allows the direct modelling of many function-valued data (e.g. species-abundance distributions) rather than having to reduce those functions to a single scalar response variable (e.g. species diversity or functional diversity indices). Our ecological examples using ISD data show that larger rivers support more-even fish-size distributions than smaller rivers and that low initial planting densities lead to more-even tree-size distributions than high initial planting densities. Function regression provided more informative and intuitive interpretations of these data than conventional non-function-valued approaches.

Key-words: Bayesian statistics, ecological modelling, functional data analysis, functional responses, function-valued data, multivariate data, R package, trait distributions Introduction Many questions in ecology and evolutionary biology consider how variables respond to biotic or abiotic gradients. For example, ecologists study the response of diversity or abundance to local environmental conditions (Brown 1995), while evolutionary biologists study how fitness or its proxies are related to genotypes (a form of biotic gradient) (e.g. Dalziel, Rogers & Schulte 2009). These studies almost always focus on a singlevalued response variable, where each subject (e.g. a replicate site or individual) has one number representing its ‘response’. Advanced applications might consider several correlated response variables, but such studies are rare and, typically, are restricted to very simple correlation structures or distancebased analyses (Legendre & Legendre 1998; Wang et al. 2012). This usual approach to statistical analyses does not appreciate that many response variables in ecology and evolutionary biology are continuous or discrete functions. Examples of ‘function-valued’ data include species-abundance distributions, size-abundance distributions, species–area curves, reaction norms and traits that themselves are functions,

*Correspondence author. E-mail: [email protected]

such as growth curves or movement and dispersal patterns. Ecologists and evolutionary biologists often ask questions about these functions, such as ‘How do species-abundance distributions change as a function of vegetation structure?’ This question could be investigated using traditional regression approaches, whereby a log-normal species-abundance distribution would be fitted at each site and then the parameters of the fitted log-normal would be regressed on the vegetation characteristics (e.g. Yen, Thomson & Mac Nally 2013). A more elegant approach is to regress the entire species-abundance distribution, a smooth function, directly on the vegetation characteristics. This is ‘function regression’ (Fig. 1). Methods for function regression have been available for several decades but remain predominantly in the literature of applied statistics (e.g. M€ uller & Stadtm€ uller 2005; James, Wang & Zhu 2009). Discussions of function-valued data have appeared recently in the literature of evolutionary biology, but applications in evolution and ecology are scarce (Stinchcombe & Kirkpatrick 2012; Hadjipantelis et al. 2013). Implementation of function regression is not straightforward, and few software methods exist that are accessible to ecologists or evolutionary biologists. Ramsay, Hooker & Graves (2009) have developed the R package fda (functional data analysis) for function regression analyses, but this implementation was

© 2014 The Authors. Methods in Ecology and Evolution © 2014 British Ecological Society

18 J. D. L. Yen et al. (a)

Response

Response

(b)

Predictor

Predictor

Response

(c)

Ar

gu

me

nt

or

dict

Pre

not designed for use by ecologists or evolutionary biologists and requires an understanding of several technical details. Function regression encompasses several different analyses and the methods depend on whether the response variable, predictor variables or both are function-valued (Ramsay & Silverman 2005). We focus here on analyses where the response variable is a function and predictor variables are scalars, which we believe has the most application to ecological and evolutionary questions. Ramsay & Silverman (2005) provided a detailed introduction to, and discussion of, the different methods for functional data analysis; our focus is on the implementation and interpretation of function regression analyses for ecological and evolutionary questions. It is rare for ecologists or evolutionary biologists to measure data that are truly continuous. Rather, we define function-valued data as any data that are continuous in principle, even if measured sparsely, or any discrete data that vary across the range of some ordered index variable (the ‘argument’; Fig. 1). The only requirement is that this ordered index variable exists and that each observation is associated with a particular value of the index variable. The function-valued nature of data provides additional information to an analysis so, even when data are measured sparsely, function-valued analyses might outperform traditional statistical methods. We distinguish function regression from ‘functional responses’ (sensu Holling 1965) and the general curve-fitting methods used commonly in ecology and evolution. In these cases, the response and predictor variables are scalar quantities, rather than functions per se, and covariation between these scalar quantities is measured. Function regression extends functional response methods, using the estimated curves as response or predictor variables for further analyses (Fig. 1). Several methods already used by ecologists and evolutionary biologists have the capacity to handle function-valued data.

Fig. 1. Illustrations of different forms of regression analysis: (a) traditional linear regression; (b) nonlinear regression (sometimes associated with ‘functional responses’); and (c) function-valued regression. In linear and nonlinear regression, the response and predictor variables both take scalar values and the aim is to estimate the relationship between these variables [solid lines in (a) and (b)]. In function-valued regression, the response is itself a function of some ‘argument’ and the aim is to estimate the relationship between the entire response function and a scalar predictor variable.

These include repeated-measures analyses, multivariate modelling with complex correlation structures and hierarchical Bayesian models with random parameters (Clark 2007; Wang et al. 2012). Although these methods can incorporate function-valued data, the emphasis of these methods is dealing with non-standard error or model structures rather than studying variation in functions. Here, we argue for an emphasis on relationships between functions and their surrounding abiotic and biotic environments. Broad application of function regression could provide substantial new insights into ecological and evolutionary questions. Stinchcombe & Kirkpatrick (2012) outlined several examples for evolutionary biology, such as exploring variation in growth curves among different genotypes and exploring dominant modes of variation in genetic covariance functions (see also Wang et al. 2014). Stewart-Koster, Olden & Gido (2014) demonstrated one possible ecological application, using function regression to explore how the function-valued flow regime affected fish community composition and species’ densities. Other potential ecological applications include studying how species-abundance and species-biomass distributions are related to abiotic or biotic environmental conditions, exploring variation in trait-abundance distributions, studying movement and dispersal patterns in animals or plant propagules and understanding variation in resource-use curves among individuals or species. These applications almost always have been addressed using non-function-valued methods (e.g. Yen, Thomson & Mac Nally 2013; Murren et al. 2014), but a function-valued method is likely to provide much deeper ecological and evolutionary insight. We have two aims in this paper: (i) to introduce function regression to an ecological and evolutionary audience; and (ii) to provide software and recommendations for function regression analyses. We outline the mathematical details of function regression and describe six computational methods for these

© 2014 The Authors. Methods in Ecology and Evolution © 2014 British Ecological Society, Methods in Ecology and Evolution, 6, 17–26

Function regression in ecology and evolution analyses. One of these methods is a Gibbs sampler that we have developed specifically for function regression; the other methods are implementations of existing software. These methods were compared using simulated data and real ecological data for individual-size distributions. All six computational methods and methods for simulating data have been combined into the R package FREE (Function Regression in Ecology and Evolution; available at ). A step-by-step guide to the package FREE, including details of cross-validation for estimating parameter settings, is in Appendix S1. We emphasize applications over methodological details and have attempted to facilitate these applications by providing an interface that is similar to the standard regression functions in R.

Materials and methods MATHEMATICAL DETAILS FOR FUNCTION REGRESSION

Similarly to the expression of a standard linear regression model as response = intercept + coefficient 9 predictor + error, function regression for function-valued response variables can be represented as response(m) = intercept(m) + coefficient(m) 9 predictor + error(m) (Ramsay & Silverman 2005). Here, m in parentheses denotes a function of the index variable m, so that the response, intercept, coefficient and error all are functions of the index variable m (Fig. 1). For example, if one were interested in the response of organism growth curves to temperature, then response(m) would be a function specifying the observed size (e.g. snout-vent length, body mass) at each age m, predictor would be the temperature, intercept(m) would be the growth curve when the temperature was zero, coefficient(m) would be a function describing the change in the growth curve for a unit change in temperature and error(m) would be a function specifying the residual or error in the growth curve. It is very rare for ecological and evolutionary studies to collect actually continuous data. It is much more likely that response(m) is recorded at discrete values of the argument m. If we assume that response (m) is recorded at the same P values of its argument (m1, . . ., mp) for all subjects (this is not an essential assumption), we can write an equation for function regression in a discrete form (Ramsay & Silverman 2005): yi ðmj Þ ¼

Q X

bq ðmj Þ xi;q þ ei ðmj Þ;

eqn 1

q¼1

where yi(mj) is the measured data for the response function at mj for subject i (i = 1, . . ., n; j = 1, . . ., p), bq(mj) is the value of the ‘coefficient’ function for variable q (q = 1, . . ., Q) at mj, xi,q is the measured data for the predictor variable q for subject i and ei(mj) is the error or residual at mj for subject i. For simplicity, we define yi,j yi(mj), bj,q bq(mj) and ei,j ei(mj). Using matrix notation, eqn (1) can be simplified to Y ¼ XBT þ E;

eqn 2

T

where B is the transpose of matrix B and the following definitions are used: 2

3 2 T3 2 y1;1 y1 y1 . 6 . 7 6 7 6 y 4 .. 5; Y 4 ... 5 4 ... . . yp

yTn

yn;1

3 y1;p .. 7; . 5 yn;p

19

2

3 2 3 x1 1 x1;2 x1;Q . 6 7 6. 7 xQ ; X 4 ... 5 4 .. ... . . ... 5; 1 xn;2 xn;Q xn

x ½ x1 2

b1;1 6 B 4 ... bp;1

3 b1;Q .. . 7 . .. 5; bp;Q 2

e ½ e1

3 2 e1;1 e1 . 6 7 6 ep ; E 4 ... 5 4 ... . . en en;1

3 e1;p .. 7 . 5: en;p

In their simplest form, if the bj,q are independent for each j (i.e. there is no dependence within each column of B), then equations (1) and (2) are equivalent to a standard linear model with an interaction between a categorical variable m and the predictor variables xi,q. Interpreting the function regression model in this way gives some insight into the meaning of the estimated regression coefficients bj,q. For example, when studying variation in growth curves, temperature might affect growth differently depending on the age of an organism. The parameter btemp,age represents an interaction between the predictor variable ‘temperature’ and the index ‘age’, which tells us the effect of temperature on an organism’s growth at different ages. However, function regression models are more complex than conventional interaction models because the coefficient functions are linked over the different values of mj, that is the bj,q are not independent for each j. Instead, there is some structure over the different values of j, which leads to the coefficient parameters bj,q being ‘smooth’ functions (i.e. each column of B is a smooth function). Controlling the smoothness of the slope parameters is central to function regression, and several approaches for doing so are discussed in Appendix S1. We used two approaches to define smooth functions for the slope parameters in the matrix B: (i) treat each column of B as an independent spline; and (ii) introduce a B-spline basis to model the columns of B. The B-spline basis approach can be viewed as an extension of generalized additive models (GAMs), that is, several simple functions (‘basis functions’) are used as building blocks to create a more complex function. This approach is simpler than the spline model because the basis functions are pre-defined, so we need only to keep track of a set of weights that determine how the basis functions are combined. However, the spline model allows more flexible functions because it does not assume any pre-defined functional form. A description of the spline and basis-function approaches used here is in Appendix S2, and Ramsay & Silverman (2005) provided a detailed description of the basisfunction approach. The structure of errors over the argument values (m) can be treated in several ways, although some of the estimation methods we used here are limited in the forms of error structures that can be implemented. We expect that errors will be correlated within subjects, suggesting an error structure of the form ei,j = si + di,j, where si is the mean error for subject i and di,j is the deviation from that mean at argument value j. Generally, it will be appropriate to treat the si as independent [e.g. si ~ N(0, r2)]. The within-subject errors also can be regarded as independent [e.g. ei, j ~ N(si, ψ2)], but we would expect some within-subject correlation structure if ei,j is a smooth function (i.e. similar argument values have more similar errors). The simplest such structure is a firstorder autoregressive (AR1) model for di,j, which leads to the overall error structure ei,j = si + /idi,j-1 + di,j. We focus on only two error structures: (i) independent (IID) errors, in which errors over the argument values are identically distributed and independent of one another; and (ii) autoregressive (AR1) errors, in


20 J. D. L. Yen et al. which errors over the argument values have a first-order autoregressive structure. Three of the estimation methods (gamboost, INLA and BUGS, see below) have the capacity to include subject-level error terms and AR1 error structures, so ei,j = si + di,j (IID) or ei,j = si + /idi,j1 + di,j (AR1) for these methods. The other estimation methods do not include subject-level error terms or AR1 error structures, so ei,j = di,j (IID) for these methods. More complex structures could be used (e.g. multivariate normal with unstructured covariance matrix), but we found IID and AR1 error structures to be adequate for our examples.

COMPUTATIONAL DETAILS FOR FUNCTION REGRESSION

Six computational procedures were used to fit the function regression model (eqn 2): minimization of a least-squares loss function (fda) (Ramsay, Hooker & Graves 2009); minimization of a loss function using model-based boosting (gamboost) (Buehlmann & Hothorn 2007; Hothorn et al. 2013); Bayesian estimation using integrated nested Laplace approximations (INLA) (Rue, Martino & Chopin 2009; Martins et al. 2013); Bayesian estimation using Hamiltonian Monte Carlo (stan) (Hoffman & Gelman 2014; Stan Development Team 2014); Bayesian estimation using reversible-jump Markov chain Monte Carlo in WINBUGS 1.4 (BUGS) (Lunn et al. 2000; Lunn, Whittaker & Best 2006; Lunn, Best & Whittaker 2008); and Bayesian estimation using a purpose-built Gibbs sampler (Gibbs). The gamboost, INLA, BUGS and Gibbs methods are best suited to spline-based models, while the fda and stan methods are most appropriate for models with B-spline bases. Technical details of these methods are in the cited references and in Appendix S2; we focus on their implementation and interpretation for ecological and evolutionary studies. Default settings for the R package FREE were used in almost all of our analyses; the settings used are given in Appendix S3.

distributions (c 2 [01, 2] and g 2 [01, 1]). The argument mj took all integer values in [–10, 10] (j = 1, . . ., 21). The ei,j are noise added to the simulated data and took one of three forms: independent (IID), first-order autoregressive (AR1) and multivariate normal (MVN). For IID noise, we used ei,j = si + di,j, where si is a subject-level error and di,j is a subject- and bin-specific error. Both si and di,j were drawn from standard normal distributions. For AR1 errors, ei,j was simulated using an autoregressive structure [ei,j = si + / ei,j-1 + di,j; ei,0 = 0], with the autoregressive parameter / drawn from a uniform distribution on [05, 085] and the sis and di,js drawn from a normal distribution with zero mean and unit variance. The MVN errors also took the form ei,j = si + di,j, where the di,j were drawn simultaneously for all j from a multivariate normal distribution with zero mean and covariance matrix 2 3 1 ei=3 6 . .. 7 .. R ¼ 4 .. . 5; . i=3 e 1 where i denotes the distance from the diagonal in rows or columns. FREE includes a function for simulating function-valued data, where the user can define any of the parameters. Results were for 200 simulated subjects. A larger data set of 400 subjects also was simulated, and subsets (n = 50, 100, 200, 400) were used to compare computing time. Computing time was averaged over 10 model fittings for fda, gamboost, INLA and Gibbs methods. Only one model fitting was used to estimate computing time for stan and BUGS methods because these MCMC methods were extremely slow for larger data sets. Variability in estimates of computing time is low relative to overall computing time in slower analyses, so using one rather than 10 model fittings should not much affect our results.

Real data DATA EXAMPLES

The relative performance of each method was assessed using simulated data and real ecological data. Model performance was measured using r2 values between fitted and observed data (which we label as r2naive) and, for the simulated data, mean-squared errors between fitted and real parameters (MSEß). Ten-fold cross-validation was used to estimate out-of-sample predictive performance for the real ecological data, with r2 values between predicted and observed data (which we label r2cv) used to quantify model predictive performance. Mean-squared errors and r2cv values were used to identify overfitting in fitted models; the most obvious sign of overfitting is high r2naive values coupled with high MSEs or low r2cv values. Computing time was measured for simulated data sets with different numbers of subjects.

Simulated data Function-valued data were generated with the following structure: yi; j ¼ aj þ b1; j xi; 1 þ b2; j xi; 2 þ ei; j ; mj 8 aj ¼ sin þ ; 4 5 xi;q Normalð/; wÞ; qp : bq;j ¼ c sin gmj þ 3 / and ψ were drawn from discrete uniform distributions (/ 2 {1, . . ., 10} and ψ 2 {05, 1, 15, 2}) and c and g were drawn from uniform

Two real ecological data sets were used to test the performance of different function regression methods. The first data set was extracted from the National Water-Quality Assessment (NAWQA) Program data sets collected by the U.S. Geological Survey (U.S. Geological Survey 2001; see also http://cida.usgs.gov/nawqa_public/apex/f? p=136:1:0, last accessed April 15 2014). This data set consisted of body-size data for fish, sampled across the United States between 1993 and 2011. Data from 100 sites were used. The response variable was the distribution of body sizes among individuals in a given community: the individual-size distribution (ISD). The predictor variables were as follows: organic carbon (mg L1), orthophosphate (mg L1), light intensity (lmol s1 m2), elevation (m) and drainage area (km2). These variables are a subset of the available data, but only these were measured consistently for all sites and with few missing values. All predictor variables were standardized to zero mean and unit variance prior to analyses. The second data set was collected by the Forest Resource Analysis & Modelling unit within the Department of Sustainability and Environment (Victoria, Australia) (Horner et al. 2009). These data were sizes of forest trees (stem diameter at breast height over bark, dbh, measured in cm), sampled eight times over 42 years following initial plantings at different densities (600, 1000, 2000, 4000 and 8000 trees ha1) (Horner et al. 2009). The focal species was the floodplain dominant river red gum (Eucalyptus camaldulensis Dehnh.) in the Barmah–Millewa Forest in south-eastern Australia (35°500 S, 145°000 E) (Horner et al. 2009). Horner et al. (2009) used these data to explore the effects of initial planting densities on long-term patterns of mortality and growth in floodplain forests. We repeated one component of this analysis, namely


Function regression in ecology and evolution modelling the differences in the size distribution of forest trees 42 years after initial plantings at different densities. The response variable was the distribution of dbh among individuals, and the predictor variable was the initial planting density.

Results

21

best, followed closely by the BUGS and gamboost methods (Table 2). The fda and gamboost methods did not always estimate model parameters accurately (Table 2, Fig. 2; see Appendix S4 for fitted curves for all methods). In particular, despite a high r2naive value, the gamboost method generated consistently poor parameter estimates, suggesting that this method was overfitting (Table 2; Appendix S4).

SIMULATED DATA

The fda and Gibbs methods were substantially quicker than all other methods ( 800). The INLA method was up to eight times slower when modelling correlated errors (Table 1). The BUGS method was consistently 13 times slower when accounting for correlated errors (Table 1). Based on r2naive values (i.e. correlation between observed and fitted values), the INLA method consistently performed

Table 2. Performance of all methods for simulated data (n = 200). Data were simulated with three error structures within subjects: independent and identically distributed (IID); first-order autoregressive (AR1); and multivariate normal (MVN). Where possible, models were fitted with both independent error structures (IID) and autoregressive error structures (AR1). The model error structure that performed best (based on r2naive) for a given data error structure is listed in the ‘Model errors’ column. Model fit (r2naive) and mean-squared errors for each parameter [a(m), b1(m) and b2(m)] were recorded; values in bold denote best performance for a given data error structure

Data errors

Method

Model errors

r2naive

MSEa

MSEß1

MSEß2

IID

fda gamboost INLA stan BUGS Gibbs fda gamboost INLA stan BUGS Gibbs fda gamboost INLA stan BUGS Gibbs

NA AR1 AR1 NA IID NA NA AR1 AR1 NA IID NA NA AR1 AR1 NA IID NA

078 085 090 078 085 078 076 086 100 076 086 076 078 086 100 078 087 078

146 319 103 053 068 028 244 373 126 122 148 169 238 322 108 127 145 159

041 638 021 026 010 020 049 638 054 054 042 050 060 618 062 068 060 060

279 634 176 101 12 052 316 632 116 115 158 176 362 642 206 168 258 224

AR1

MVN

Table 1. Computing time for n simulated subjects for all methods. Models were fitted with both independent error structures (IID) and autoregressive error structures (AR1) where applicable. Values for fda, gamboost, INLA and Gibbs methods were based on an average of 10 model fittings, while stan and BUGS methods were fitted only once each due to their slow performance. All timings are based on a PC with an Intel Core i7-3770 340 GHz processor and 160 GB of RAM running Windows 7 Running time (s) Model errors

n

fda

gamboost*

INLA

stan

BUGS

Gibbs

IID

50 100 200 400 50 100 200 400

03 04 06 14 NA NA NA NA

61 (610) 66 (660) 77 (770) 102 (1020) 98 (980) 213 (2130) 860 (8600) 5265 (52650)

35 58 105 197 161 441 600 1574

24542 54094 120896 250867 NA NA NA NA

15139 29277 60890 159487 18890 38412 84280 217130

05 07 11 20 NA NA NA NA

AR1

*The gamboost method requires a bootstrap resampling procedure if confidence intervals are required; approximate computing times using 100 bootstrap samples are shown in brackets. © 2014 The Authors. Methods in Ecology and Evolution © 2014 British Ecological Society, Methods in Ecology and Evolution, 6, 17–26

22 J. D. L. Yen et al. (a) 2·0 1·5

(m)

1·0 0·5 0·0 0·5 1·0 1·5 5

10

15

20

m (b)

(c) 1

(m)

0

2

0·0

1

(m)

0·5

0·5

1

1·0

2 5

10

15

20

5

m

Correlated errors in the data did not affect model performance substantially, with each method estimating parameters similarly regardless of the correlation structure in the data (Table 2). Even when errors in the data were highly correlated, the BUGS method estimated parameters best when errors were assumed to be IID. The gamboost and INLA methods estimated parameters best using an AR1 error structure (Table 2). A simulation approach suggested that correlated errors in the data only affected fitted models for small n (n < 20; details in Appendix S7).

10

15

m

Table 3. Performance of all methods for diatom and fish data (n = 100). Models were fitted with both independent error structures (IID) and autoregressive error structures (AR1); the model error structure that performed best for a given data error structure is listed in the ‘Model errors’ column. Model fit (r2naive) and estimated predictive performance (r2cv) were recorded; values in bold denote best performance for a given data set Data set

Method

Model errors

r2naive

r2cv

Fish

fda gamboost INLA stan BUGS Gibbs fda gamboost INLA stan BUGS Gibbs

IID AR1 AR1 IID AR1 IID IID IID IID IID IID IID

0701 0723 0751 0709 0744 0726 0763 0592 0770 0795 0823 0821

0634 0577 0661 0660 0657 0650 0693 0389 0647 0687 0658 0689

REAL DATA

The BUGS method fitted the river red gum data best (r2naive), but all methods other than gamboost performed similarly and had good predictive performance (r2cv > 060; Table 3). All methods performed better when treating errors as independent rather than correlated for the river red gum data (Table 3). Initial planting densities had a strong influence on the size distributions of river red gum forests (Fig. 3; see also Appendix S5). High initial planting densities (8000 or 4000 trees ha1) led to many small stems and few large stems (Fig. 3a,b). Low initial planting densities (600 or 1000 trees ha1) substantially reduced the number of small stems and increased the number of medium- and large-sized stems, leading to a more-even distribution of tree sizes than sites with high initial planting densities (Fig. 3d,e). The INLA method fitted the fish data best (r2naive), but all methods had similar predictive performance for this data set (r2cv > 050; Table 3). The gamboost and INLA methods both performed better when treating errors as correlated than when errors were modelled as being independent (Table 3). The majority of variation in fish abundances across size classes was captured in the intercept curve (i.e. there was a

20

Fig. 2. Simulated data (solid line) and the parameter estimates from the INLA model (mean – dashed line; 95% pointwise credible intervals – dotted line). Key: (a) mean curve a (m); (b) effects of the first independent variable b1(m); and (c) effects of the second independent variable b2(m). Simulation details are described in the main text, and other methods are shown in Appendix S4.

River red gums

consistent pattern across all sites; Fig. 4; see also Appendix S5). Less of the variation around this intercept curve was explained (Fig. 4). Light intensity and elevation had the strongest effects on fish ISDs, with peaks for fish between 1 g and 200 g (Fig. 4). Orthophosphate and drainage area had less marked effects on fish ISDs, with both displaying negative effects for fish 200 g (Fig. 4). The fitted effects of drainage area on fish ISDs highlight the different interpretations that can be gained from using function-valued data (Fig. 5). Increasing drainage area was associated with increases in the relative abundance of large fishes (>200 g), decreases in the relative abundance of mid-sized


Function regression in ecology and evolution 4000 trees ha−1 40

40 30 20 10 0

β (m)

μ (m)

Intercept (8000 trees ha−1)

23

20 0 −20 −40

10

20

30

40

0

30

2000 trees ha−1

1000 trees ha−1

40

40

20

20

0

40

0 −20

−40

−40 0

10

20

30

40

0

dbh (cm)

10

20

30

40

dbh (cm)

600 trees ha−1 40

β (m)

20 0 −20 −40 0

10

20

30

40

dbh (cm)

Intercept

Organic carbon 1·0

β (m)

μ (m)

3 2 1 0

0·0

1e−02

1e+01

1e+04

1e−05

1e+01

Mass (g)

Orthophosphate

Light intensity

1 0 −1 −2 −3

1e+04

3 2 1 0 1e−05

1e−02

1e+01

1e+04

1e−05

Mass (g)

β (m) 1e−02

fishes and little change in the relative abundance of the smallest fish (Fig. 5). The effects of drainage area on relative abundances of different-sized fishes would not be apparent without function-valued data. For example, a multiple linear regression

1e+01

Mass (g)

1e+01

1e+04

Drainage area

0·5 0·0 −0·5 −1·0 −1·5 −2·0 1e−05

1e−02

Mass (g)

Elevation

β (m)

1e−02

Mass (g)

β (m)

β (m)

0·5

−0·5 1e−05

Fig. 4. Fitted parameters from a function regression model of fish individual-size distributions (ISDs). The fitted curves display the effect of a given predictor variable on the ISD across different size classes. Solid lines are the mean value for each estimated curve, dashed lines give approximate 95% pointwise credible intervals for the fitted curve and the dotted line denotes no effect. Parameters were fitted using the BUGS method; other methods are shown in Appendix S5.

20

dbh (cm)

−20

Fig. 3. Fitted parameters from a function regression model of river red gum size distributions. The fitted curves display the effect of different initial planting densities on size distributions measured 42 years after initial plantings. Solid lines are the mean value for each estimated curve, dashed lines give approximate 95% pointwise credible intervals for the fitted curve and the dotted line denotes no effect. Parameters were fitted using the Gibbs method; results for other methods are shown in Appendix S5.

10

dbh (cm)

β (m)

β (m)

0

1e+04

1·0 0·5 0·0 −0·5 −1·0 −1·5 1e−05

1e−02

1e+01

1e+04

Mass (g)

of total fish abundance against the same predictor variables would indicate that total fish abundance increased with drainage area (bdrain = 087, 95% credible interval = [030, 144], r2 = 024), but would not provide the detailed picture of how


24 J. D. L. Yen et al. into the effects of different variables on function-valued measurements. In the examples here, the effect of drainage area on fish ISDs was more complex than the corresponding linear model suggests, with larger drainage areas associated with more large fish and fewer mid-sized fish. This result might reflect the capacity for large rivers (large drainage areas) to support more large fish, increasing the number of top predators with associated declines in the number of small- and midsized fish. Fitted models of river red gum size distributions were consistent with those of Horner et al. (2009), but provided a more nuanced interpretation of how planting densities affected tree-size distributions. The fda, INLA and Gibbs methods performed consistently well, both in computing time and in model fit. The BUGS and stan methods performed slowly, but fitted models reliably. Given the adaptability of BUGS and stan to missing data and complex model structures, these methods are feasible for smaller data sets (n < 200). The gamboost method is not a good choice for fitting function regression models because it is prone to overfitting and requires bootstrap resampling to generate confidence intervals. Given the similar performance of the fda, INLA and Gibbs methods, one option would be to fit models using all three methods and to combine or synthesize the fitted models. Such consensus methods often perform well and consistent estimates from different methods increase confidence in the fitted models (Yen et al. 2011). The choice of function regression method should consider potential pitfalls in each method. Adjusting settings for MCMC algorithms or basis functions could be necessary to improve model performance for the fda and Gibbs methods,

abundances of different size classes change with drainage area (Fig. 5; see also Appendix S6). With the exception of the gamboost method, all methods tended to estimate similar effects for each variable (Fig. 5; see also Appendix S6). The gamboost method fitted markedly different curves to the other methods, a trend that also was apparent for simulated data (Fig. 5; see also Appendix S4). It is likely that the gamboost method was overfitting, a conclusion that is supported by the relatively poor r2cv performance of this method for both the fish and river red gum data sets (relative to r2naive; Table 3).

Discussion Many questions in ecology and evolutionary biology consider function-valued data, such as trait distributions or reaction norms, but current analyses typically reduce these functions to a single variable prior to analyses (e.g. using diversity indices). Function regression allows these functions to be analysed directly, providing much greater insight into key ecological and evolutionary questions. A limitation of function regression is the complexity of the underlying models to be fitted and the resulting computational loads. We have partially addressed this issue by providing the R package FREE that incorporates six methods for function regression. FREE has a formula interface that resembles the standard lm() function in R and so should be accessible to many ecologists and evolutionary biologists familiar with R. The interpretation of fitted effects from function regression analyses is intuitive (e.g. Figs 3 and 4) and provides insights

gamboost β (m)

β (m)

fda 1·0 0·5 0·0 −0·5 −1·0 −1·5

1e−05

1e−02

1e+01

1e+04

0·2 0·1 0·0 −0·1 −0·2 −0·3 1e−05

Mass (g)

1e+04

1e−05

1e−02

1e+01

Mass (g)

BUGS

Gibbs

1·0 0·5 0·0 −0·5 −1·0 −1·5

1e−05

1·5 1·0 0·5 0·0 −0·5 −1·0 −1·5 −2·0

Mass (g)

β (m)

β (m)

1e+01

1e−02

1e+01

Mass (g)

1e+04

stan β (m)

β (m)

INLA

1e−02

1e+01

Mass (g)

1·0 0·5 0·0 −0·5 −1·0

1e−05

1e−02

1e+04

1·0 0·5 0·0 −0·5 −1·0 −1·5 1e−05

1e−02

1e+01

Mass (g)

1e+04

1e+04

Fig. 5. Fitted effects of drainage area on fish individual-size distributions (ISDs). The fitted curves display the effect of drainage area on abundance of fishes in different size classes. Solid lines are the mean value for each estimated curve, dashed lines give approximate 95% pointwise confidence or credible intervals for the fitted curve and the dotted line denotes no effect. Fitted parameters are shown for all methods.


Function regression in ecology and evolution but the complexity of some of these settings introduces potential pitfalls for inexperienced users. A brief guide to some settings is in Appendix S1, but the INLA method has fewer adjustable settings and is likely to have the best out-of-the-box performance. INLA fits a Bayesian model but removes the challenge of monitoring MCMC settings and performance. Controlling the smoothness of parameter functions is important when fitting function regression models. If parameter functions are too smooth, then actual patterns in the data are blurred, but if parameter functions are too flexible, then the models are prone to overfitting (Ramsay & Silverman 2005). Parameter smoothness for Bayesian methods is controlled by the prior distributions, while parameter smoothness can be controlled using smoothing penalties for the fda and gamboost methods. Bayesian methods inherently penalize complexity, which means that they favour smooth parameter functions. A guide to adjusting and selecting these settings is presented in Appendix S1. Existing methods for function regression have not focused extensively on Bayesian approaches (Crainiceanu & Goldsmith 2010; Stinchcombe & Kirkpatrick 2012). A Bayesian approach to statistical inference has several benefits, such as handling observation errors and missing data, the ability to incorporate multiple sources of information and uncertainty, and greater flexibility to build complex model structures. Although these extensions are not currently built into the FREE package, there is an option for users to provide their own BUGS or stan code for non-standard model structures. It is likely that more complex applications will require the flexibility of Bayesian estimation (Stinchcombe & Kirkpatrick 2012). An extension to the methods used here would be to use Bayesian nonparametric methods, which would reduce or remove the need to specify parameter details (Orbanz & Teh 2010). Bayesian nonparametric methods have the capacity to explore an infinite-dimension parameter space, which allows flexible model structures, with the final model structure determined by the data (Orbanz & Teh 2010). These methods have not been applied to function regression, but the extension of Bayesian nonparametric methods to function regression warrants further exploration. Correlated error structures have received much attention in ecology and evolutionary biology because they can inflate the number of ‘false positive’ results (Barnett et al. 2010). Although we compared only independent and autoregressive error structures, these options estimated parameters well even for data with complex covariance matrices (see Appendix S7). Correlated errors in the data affected model performance only when the number of subjects n was small (n < 20; Appendix S7). More complex error structures could be included in the methods used here, but were not necessary for the data sets in this study. We focused exclusively on models with function-valued response variables and scalar predictor variables, but several alternative models are possible: (i) scalar responses with function-valued predictors [i.e. y ~ ∫ b(m) x(m) dm]; (ii) functionvalued responses with function-valued predictors measured at the same argument values and whose effects are restricted to a

25

given argument value [i.e. y(m) ~ b(m) x(m)]; and (iii) function-valued responses with function-valued predictors measured at any argument values and whose effects are not restricted to particular argument values [i.e. y(m) ~ ∫ b(m, n) x(n) dn]. Stewart-Koster, Olden & Gido (2014) used a model with a scalar response and function-valued predictor to explore the effects of function-valued flow regimes on community composition and species’ densities and there are several other potential applications for these models, such as predicting species richness or other diversity measures from trait distributions. All of these function regression models can be fitted using similar computational methods to those presented here, and our intention is to incorporate these different models into FREE. There are many possible extensions to function regression, including model selection, hierarchical model structures and nonlinear model structures. Random effects or spatially correlated error terms can be included using the BUGS, INLA or stan methods, while nonlinear models could be fitted using polynomial transformations of predictor variables. Model selection is more challenging and often requires computationally demanding techniques such as reversible-jump MCMC. Several of our approaches (INLA, stan, BUGS) calculate marginal likelihoods and the deviance information criterion, which can be used for model selection, but the reliability of these methods for function regression is unknown. Model selection is an important challenge for function regression, and we will update the FREE package as new methods are developed.

Acknowledgements Martin Burd and Tapio Simula participated in discussions during the development of the ideas presented in this manuscript. JDLY acknowledges the financial support of the Victorian Life Sciences Computation Initiative and the Monash University Sir James McNeill foundation. The fish data used here were available due to the efforts of the many researchers involved in the NAWQA program (USGS). River red gum data were provided by Gillis Horner and the staff of the DSE’s Forest Resource Analysis & Modelling unit.

Data accessibility Fish ISD data: available at http://cida.usgs.gov/nawqa_public/apex/f?p=136:1:0 FREE R package: available at https://github.com/jdyen/FREE.

References Barnett, A.G., Koper, N., Dobson, A.J., Schmiegelow, F. & Manseau, M. (2010) Using information criteria to select the correct variance-covariance structure for longitudinal data in ecology. Methods in Ecology and Evolution, 1, 15–24. Brown, J.H. (1995) Macroecology. The University of Chicago Press, Chicago, Illinois, USA. Buehlmann, P. & Hothorn, T. (2007) Boosting algorithms: regularization, prediction and model fitting (with discussion). Statistical Science, 22, 477–505. Clark, J.S. (2007) Models for Ecological Data: An Introduction. Princeton University Press, Princeton. Crainiceanu, C.M. & Goldsmith, A.J. (2010) Bayesian functional data analysis using WinBUGS. Journal of Statistical Software, 32, 1–33. Dalziel, A.C., Rogers, S.M. & Schulte, P.M. (2009) Linking genotypes to phenotypes and fitness: how mechanistic biology can inform molecular ecology. Molecular Ecology, 18, 4997–5017. Hadjipantelis, P.Z., Jones, N.S., Moriarty, J., Springate, D.A. & Knight, C.G. (2013) Function-valued traits in evolution. Journal of the Royal Society Interface, 10, 20121032.


26 J. D. L. Yen et al. Hoffman, M.D. & Gelman, A. (2014) The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1351–1381. Holling, C.S. (1965) The functional response of predators to prey density and its role in mimicry and population regulation. Memoirs of the Entomological Society of Canada, 97, 5–60. Horner, G.J., Baker, P.J., Mac Nally, R., Cunningham, S.C., Thomson, J.R. & Hamilton, F. (2009) Mortality of developing floodplain forests subected to a drying climate and water extraction. Global Change Biology, 15, 2176–2186. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M. & Hofner, B. (2013) mboost: model-based boosting. URL: http://CRAN.R-project.org/package=mboost (accessed 15 April 2014). James, G.M., Wang, J. & Zhu, J. (2009) Functional linear regression that’s interpretable. The Annals of Statistics, 37, 2083–2108. Legendre, P. & Legendre, L. (1998) Numerical Ecology, 2nd edn. Elsevier Science B.V, Amsterdam. Lunn, D.J., Best, N. & Whittaker, J. (2008) Generic reversible jump MCMC using graphical models. Statistics and Computing, 19, 395–408. Lunn, D.J., Whittaker, J.C. & Best, N. (2006) A Bayesian toolkit for genetic association studies. Genetic Epidemiology, 30, 231–247. Lunn, D.J., Thomas, A., Best, N. & Spiegelhalter, D. (2000) WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10, 325–337. Martins, T.G., Simpson, D., Lindgren, F. & Rue, H. (2013) Bayesian computing with INLA: New features. Computational Statistics & Data Analysis, 67, 68– 83. M€ uller, H.-G. & Stadtm€ uller, U. (2005) Generalized functional linear models. The Annals of Statistics, 33, 774–805. Murren, C.J., Maclean, H.J., Diamond, S.E., Steiner, U.K., Heskel, M.A., Handelsman, C.A., Ghalambor, C.K. & Auld, J.R. (2014) Evolutionary change in continuous reaction norms. The American Naturalist, 183, 453–467. Orbanz, P. & Teh, Y.W. (2010) Bayesian nonparametric models. Encyclopedia of Machine Learning (eds C. Sammut & G.I. Webb), pp. 81–89. Springer, New York. Ramsay, J.O., Hooker, G. & Graves, S. (2009) Functional Data Analysis with R and MATLAB. Springer, New York. Ramsay, J.O. & Silverman, B.W. (2005) Functional Data Analysis, 2nd edn. Springer, New York. Rue, H., Martino, S. & Chopin, N. (2009) Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). Journal of the Royal Statistical Society, Series B, 71, 319–392. Stan Development Team (2014) Stan: A C++ Library for Probability and Sampling, Version 2.2. URL: http://mc-stan.org (accessed 15 April 2014). Stewart-Koster, B., Olden, J.D. & Gido, K.B. (2014) Quantifying flow-ecology relationships with functional linear models. Hydrological Sciences Journal, 59, 1–16. Stinchcombe, J.R. & Kirkpatrick, M. (2012) Genetics and evolution of function-valued traits: understanding environmentally responsive phenotypes. Trends in Ecology & Evolution, 27, 637–647. U.S. Geological Survey (2001) National Water Information System (NWIS): U.S. Geological Survey database. URL: http://cida.usgs.gov/nawqa_public/ apex/f?p=136:1:0 (accessed 15 April 2014).

Wang, Y., Naumann, U., Wright, S.T. & Warton, D.I. (2012) mvabund– an R package for model-based analysis of multivariate abundance data. Methods in Ecology and Evolution, 3, 471–474. Wang, Z., Pang, X., Wu, W., Wang, J., Wang, Z. & Wu, R. (2014) Modeling phenotypic plasticity in growth trajectories: a statistical framework. Evolution, 68, 81–91. Yen, J.D.L., Thomson, J.R. & Mac Nally, R. (2013) Is there an ecological basis for species abundance distributions? Oecologia, 171, 517–525. Yen, J.D.L., Thomson, J.R., Vesk, P.A. & Mac Nally, R. (2011) To what are woodland birds responding? Inference on relative importance of in-site habitat variables using several ensemble habitat modelling techniques. Ecography, 34, 946–954. Received 22 June 2014; accepted 13 October 2014 Handling Editor: Ryan Chisholm

Supporting Information Additional Supporting Information may be found in the online version of this article. Appendix S1. Introduction to the R package FREE. Appendix S2. Details for computation of function regression analyses. Appendix S3. Settings for all methods used in function regression analyses. Appendix S4. Plots of parameter estimates for simulated data for each method. Appendix S5. Plots of parameter estimates from models of fish and forest tree ISDs, arranged by method. Appendix S6. Plots of parameter estimates from models of fish and forest tree ISDs, arranged by parameter. Appendix S7. Testing the effects of correlated errors in data on fitted models. Data S1. Sample data for the argument variable. Data S2. Sample data for the predictor variable. Data S3. Sample data for the response variable.