Copyright © Physiologia Plantarum 2010, ISSN 0031-9317
Physiologia Plantarum 141: 197–200. 2011
TECHNICAL FOCUS
Does using stepwise variable selection to build sequential path analysis models make sense? Marcin Kozaka,∗ and Ricardo A. Azevedob a Department
of Experimental Design and Bioinformatics, Faculty of Agriculture and Biology, Warsaw University of Life Sciences, Nowoursynowska 159, 02-776 Warsaw, Poland b Departamento de Genetica, ´ ˘ Paulo, Piracicaba, SP 13418-900, Brazil Escola Superior de Agricultura Luiz de Queiroz, Universidade de Sao
Correspondence *Corresponding author, e-mail:
[email protected] Received 14 July 2010; revised 9 October 2010 doi:10.1111/j.1399-3054.2010.01431.x
Causal inference methods – mainly path analysis and structural equation modeling – offer plant physiologists information about cause-and-effect relationships among plant traits. Recently, an unusual approach to causal inference through stepwise variable selection has been proposed and used in various works on plant physiology. The approach should not be considered correct from a biological point of view. Here, it is explained why stepwise variable selection should not be used for causal inference, and shown what strange conclusions can be drawn based upon the former analysis when one aims to interpret cause-and-effect relationships among plant traits.
Many biological processes are too complex to understand without the help of statistical methods. Unfortunately, there are so many statistical methods, each having its own aim, assumptions and interpretation, that a choice of the best method for a particular problem is seldom easy. When choosing a method for interpreting biological data, it is thus extremely important to ensure that statistics is aiding interpretation, not driving it toward biologically incorrect and non-sensical assumptions which, despite of that, still remain statistically valid. This is because what can be considered correct from a statistical point of view does not have to be correct from the biological point of view, while for a biologist it is crucial that any analysis is both biologically and statistically valid. Causal inference, which in general is a methodology of searching for cause-and-effect relationships among traits, is an important concept for plant physiology: it helps to improve our understanding of various physiological processes and key factors affecting them. For example, it can suggest patterns of how crop plants develop during ontogeny, offering clues about the traits that have importance in the process. Do numbers of days to
mid-flowering and to maturity influence rice grain yield or not? If yes, do they influence it directly, or maybe indirectly, through some other traits? Which plant and crop traits influence rice grain yield and milling quality during ontogeny? These are examples of questions (taken from Kozak et al. 2007) that can be analyzed through causal inference. Statistical inference usually does not aim to provide models that reflect biological processes, but rather to offer precise assessment of populations based on samples taken from these populations, and this is the biologist’s task to assure that the models considered are correct from the biological point of view. Hence, should one want a statistical model to reflect the biological model, the former model should follow hints offered by knowledge of the latter, only then can statistical inference offer interpretations about causal relationships. A simple example is that we know that soil nitrogen supply at the beginning of the growing season can affect final plant height, but final plant height cannot affect soil nitrogen supply; no one would argue with that. A statistical model can be applied to study both relationships, but in terms of causal inference, only the former will be correct from a biological point of
Abbreviations – ABA, abscisic acid; SEM, structural equation modeling.
Physiol. Plant. 141, 2011
197
view. Another example of the same problem is, e.g. the number of days to mid-flowering, which can affect the number of days to maturity; the opposite influence is impossible. From the more physiological point of view, leaf Rubisco content (and therefore photosynthetic capacity and rate) is influenced by N supply, but cannot influence N supply. Moreover, at any moment in time, abscisic acid (ABA) content or stomatal conductance is influenced by the availability of water in soil, but not the converse. Currently, the most important method for studying causal inference is structural equation modeling (SEM). Path analysis, a method which was first proposed 90 years ago (Wright 1921, 1934), is now considered a part of SEM: it is distinguished from the more general SEM because path models do not include latent (unobserved) variables. Simply put, path analysis helps to analyze and interpret associations among a set of many traits. As such, it can be used to analyze quantitative data to answer the questions asked in the first paragraph of this paper. While analyzing causal processes, contemporary applications are mainly based upon multi-level models, in which traits are set at various levels according to their development in ontogeny and assumed directions of causal inference (multi-level models in this meaning must not be confused with multi-level models in mixed effects modeling framework). A dependent variable, like rice grain yield (or several dependent variables, like rice grain yield and quality of milling; Kozak et al. 2007), is placed at the final (the last in ontogeny) level. Thus, for example in rice, the number of days to heading and number of days to maturity will be assumed to possibly influence plant height and number of tillers per plant, but will not be assumed to co-develop with them. More often, possible causal relationships among the physiological traits studied can be very precisely stated, based on the biological knowledge of the process, although Samborski et al. (2008), when discussing the relationship between nitrogen uptake and nitrogen uptake efficiency for cereals, showed that such decisions are not always easy to make based on contemporary knowledge. The bottom line is that path analysis does assume that the models it analyzes are biologically correct. Multiple regression behaves differently, because it analyzes how all predictor traits affect a dependent trait, and so all the cause-and-effect associations among the predictors are completely ignored. For instance, the number of days to heading, number of days to maturity, plant height and number of tillers per plant would be treated as co-developing traits, which approach does ignore the structure of associations among them. Theoretically, SEM enables one to test directions of causal associations among traits (Shipley 2002), but this 198
assumes that statistical analyses of experimental results can provide conclusions about such delicate aspects of plant development as to whether trait A influences trait B or vice versa. This is a matter for discussion that is beyond the scope of this paper, but the approach of assuming possible causal relations among plant traits and subsequent testing and exploring them by means of statistical methods has been commonly employed in plant sciences (Gim´enez-Benavides et al. 2007, Kozak et al. 2007, 2008, Schulze et al. 2006, Tisn´e et al. 2008, Turner et al. 2008, Vargas et al. 2010, which all employ sequential path analysis or SEM). The same approach has been taken, for example, in sequential yield component ¸ et al. 2005). analysis (Eaton and Kyte 1978, Madry This paper aims to point out a methodological flaw that can be found in some recent causal studies based on path analysis (Asghari-Zakaria et al. 2007, Feyzian et al. 2009, Mohammadi et al. 2003, Nemati et al. 2009, Rabiei et al. 2004, Sabaghnia et al. 2010). In these analyses, which are all based on sequential path models, the sequentiality does not refer to directions of causal associations among traits (as discussed earlier), although this is how the sequentiality is normally understood and assumed in the methods of causal inference (Pearl 2000, Shipley 2002). Instead, traits are set up at particular levels, based on stepwise variable selection in the regression and analysis of their total contribution to the variation of the dependent variable (Mohammadi et al. 2003). The stepwise approach to multiple regression aims to select from the whole set of predictors a smaller set of predictors whose influence on the dependent trait is the highest; so it is based on a multiple linear regression model, and the structure of cause-and-effect relationships among the predictors is ignored. These two concepts – defining sequentiality based upon the biology and based on the stepwise variable selection – have different philosophies and can provide different models (Table 1). This is not to say that stepwise variable selection is in general incorrect. Simply put, it was not designed for causal inference, but rather to limit a set of predictor variables to the smallest one that would explain as much variability in a dependent variable as possible. This can be extremely useful, for example, in prediction models (e.g. for predicting final cereal seed yield by various plant- and weather-related traits observed during a vegetation season), climatic models, etc. Thus, a consequence of such a definition of a causal model is that it does not have to represent causal associations among traits. This can lead, for example, to the number of grains per panicle (directly) and thousandgrain weight (indirectly) affecting rice plant height (Rabiei et al. 2004) or thousand-kernel weight in maize
Physiol. Plant. 141, 2011
Table 1. Comparison of philosophies of building causal models based on stepwise variable selection and causal inference methods. Stepwise variable selection
Causal inference
Aims to select an optimal model in statistical terms, with the biological sense put aside. Thus, the only criterion of selection of traits to be included in the model is the statistical quality of the model.
Aims to test whether hypothesized causal relations among traits are significant or not; such hypotheses are based upon the knowledge of the biology of a process being studied.
Hence, models derived based on stepwise variable selection describe a set of predictor traits that suffice to efficiently (in statistical terms) explain the variability in the dependent trait.
Hence, models derived based on causal inference describe causal relationships among traits in the biological process, suggesting how traits develop during ontogeny. Thus, predictor traits can be significant causes of the dependent traits, thereby giving an overall view of cause-and-effect relationships among the traits being studied.
directly affecting the number of kernels per row (Nemati et al. 2009). Returning to the example of soil nitrogen supply, theoretically, models could also be derived in which final plant height affects nitrogen supply at the beginning of the season; see also Table 2 for another example of such inappropriately defined causal models. This does not mean that such defined methodology has Table 2. An artificial example of a set of plant traits of winter wheat that can be studied in terms of causal associations. Experimental design: A field experiment was designed as a randomized complete block design with four replications. Twenty winter wheat genotypes were studied. The aim was to find traits determining final seed yield (SY) of winter wheat from among the traits given below. For the analyses, genotype means of the traits were taken. Traits to be considered: Soil Plant Analysis Development - Chlorophyll content (SPAD) measurement at GS (growth stage) 31 (GS31 ), 49 (GS49 ) and 65 (GS65 ) (according to Zadoks et al. 1974), mean kernel weight (MKW), final SY Hypothesized causal model to be tested (→ means that the trait at left can influence the trait at right, but not vice versa): GS31 → GS49 → GS65 → MKW → SY Possible (although unlikely) models for SY as a dependent trait, based on stepwise selection: GS65 & MKW → GS31 → SY MKW → GS65 → GS49 → GS31 → SY Possible conclusions based on the above ill-stated models: MKWs affects chlorophyll content at GS 31 Chlorophyll content at GS 49 affects its content that was observed earlier (at GS 31)
Physiol. Plant. 141, 2011
to provide models that are incorrect from a causal point of view; for example, the models fitted by Mohammadi et al. (2003), Asghari-Zakaria et al. (2007) and Feyzian et al. (2009) do not seem to contradict the biological knowledge of the processes they represent. However, such a coincidence is accidental, because such models serve as an indication of the traits that are the strongest predictors of the corresponding dependent variable, causality put aside. In addition, although these models do not have to look biologically non-sensical, it does not mean they are correct. This is because such models were not built to represent biological causal associations, but to minimize the number of traits in the model. From the above discussion, several conclusions can be drawn as follows: • Causal inference must not be confused with
statistical inference. • Hypothesized causal relationships among traits
for the purpose of building a causal model should reflect biological relationships among traits, so that the final model reflects the biology of the process. • Because path analysis aims to draw inferences and interpretations of causal associations, the methodology of building final models based on the contribution of traits to the variation of the dependent variable should not be used. Stepwise variable selection should not be confused with causal modeling of sequentially developing traits: in biology, we are interested in the models following the latter type of modeling. • Stepwise variable selection for multiple regression should not be used as a tool for building causal models. It should rather be reserved to applications it was constructed for.
References Asghari-Zakaria R, Fathi M, Hasan-Panah D (2007) Sequential path analysis of yield components in potato. Potato Res 49: 273–279 Eaton GW, Kyte TR (1978) Yield component analysis in strawberry. J Amer Soc Hort Sci 103: 578–583 Feyzian E, Dehghani H, Rezai AM, Jalali M (2009) Correlation and sequential path model for some yield-related traits in melon (Cucumis melo L.). J Agr Sci Technol 11: 341–353 Gim´enez-Benavides L, Escudero A, Iriond JM (2007) Reproductive limits of a late-flowering high-mountain Mediterranean plant along an elevational climate gradient. New Phytol 173: 367–382 Kozak M, Singh PK, Verma MR, Hore DK (2007) Causal mechanism for determination of grain yield and milling quality of lowland rice. Field Crops Res 102: 178–184
199
Kozak M, Bocianowski J, Rybinski ´ W (2008) Selection of promising genotypes based on path and cluster analyses. J Agr Sci 146: 85–92 ˙ Madry ¸ W, Kozak M, Pluta S, Zurawicz E (2005) A new approach to sequential yield component analysis (SYCA): application to fruit yield in blackcurrant (Ribes nigrum L.). J New Seeds 7: 85–107 Mohammadi SA, Prasanna BM, Singh NN (2003) Sequential path model for determining interrelationships among grain yield and related characters in maize. Crop Sci 43: 1690–1697 Nemati A, Sedghi M, Sharifi RS, Seiedi MN (2009) Investigation of correlation between traits and path analysis of corn (Zea mays L.) grain yield at the climate of Ardabil region (northwest Iran). Notulae Botanicae Horti Agrobotanici Cluj-Napoca 7: 194–198 Pearl J (2000) Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge Rabiei B, Valizadeh M, Ghareyazie B, Moghaddam M (2004) Evaluation of selection indices for improving rice grain shape. Field Crops Res 89: 359–367 Sabaghnia N, Dehghani H, Alizadeh B, Mohghaddam M (2010) Interrelationships between seed yield and 20 related traits of 49 canola (Brassica napus L.) genotypes in non-stressed and water-stressed environments. Span J Agr Res 8: 356–370 Samborski S, Kozak M, Azevedo RA (2008) Does nitrogen uptake affect nitrogen uptake efficiency, or vice versa? Acta Physiol Plant 30: 419–420 Schulze ED, Turner NC, Nicolle D, Schumacher J (2006) Species differences in carbon isotope ratios, specific leaf
area and nitrogen concentrations in leaves of Eucalyptus growing in a common garden compared with along an aridity gradient. Physiol Plantarum 127: 434–444 Shipley B (2002) Cause and Correlation in Biology. A User’s Guide to Path Analysis, Structural Equations and Causal Inference. Cambridge University Press, Cambridge Tisn´e S, Reymond M, Vile D, Fabre J, Dauzat M, Koornneef M, Granier C (2008) Combined genetic and modeling approaches reveal that epidermal cell area and number in leaves are controlled by leaf and plant developmental processes in Arabidopsis. Plant Physiol 148: 1117–1127 Turner NC, Schulze ED, Nicolle D, Schumacher J, Kuhlmann I (2008) Annual rainfall does not directly determine the carbon isotope ratio of leaves of Eucalyptus species. Physiol Plant 132: 440–445 Vargas R, Baldocchi DD, Querejeta JI, Curtis PS, Hasselquist NJ, Janssens IA, Allen MF, Montagnani L (2010) Ecosystem CO2 fluxes of arbuscular and ectomycorrhizal dominated vegetation types are differentially influenced by precipitation and temperature. New Phytol 185: 226–236 Wright S (1921) Correlation and causation. J Agr Res 20: 557–585 Wright S (1934) The method of path coefficients. Ann Math Stat 5: 161–215 Zadoks JC, Chang TT, Konzak CF (1974) A decimal code for growth stages of cereals. Weed Res 14: 415–421
Edited by P. Gardestrom ¨
200
Physiol. Plant. 141, 2011