Journal of Clinical Epidemiology 64 (2011) 1463e1469
LETTERS TO THE EDITOR Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure To the Editor: In their commentary, Steyerberg et al. [1] point out both a possible solution to decrease bias and an additional source of bias. These comments expand rather than contradict our findings, and we thank the authors for thus providing a more comprehensive view of the various predictors of correct (or incorrect) estimation of parameters [2]. Although it is true that there are alternatives to maximum likelihood estimation (MLE) that guarantee convergence, these methods are not implemented by default in most currently available statistical analysis programs. Thus, most researchers will continue to encounter convergence problems. Because convergence problems occur when data are separated or nearly separated (i.e., when the distributions of the predictor under Y Z 1 and Y Z 0 do not overlap), statistical programs should provide information on whether or not there is separation. They also could be programmed to apply alternatives to MLE whenever convergence problems occur, particularly when data are separated or nearly separated [3e5]. We also agree that selection bias may be larger than the estimation bias because of overfitting of the model to a specific sample, especially when model selection is purely datadriven. In that situation two phenomena contribute to bias. First, when power is low, only large estimates are statistically significant and are retained in the model (bias because of covariate selection method). Second, if important covariates are omitted because they lack significance, the effects of the remaining covariates may be incorrectly adjusted and hence biased (bias because of under-adjustment). Underadjustment bias will be potentially stronger if the predictors are more strongly correlated. To explore this issue, we use the simulation results obtained in the main article for seven continuous predictors with a true odds ratio of 1.5 per standard deviation and correlations among predictors of either 0.2 or 0.7 (Fig. 1). Similarly to Steyerberg et al. [1], we either considered the predictors as prespecified or selected only those predictors with statistically significant effects. As expected, the relative bias among the selected coefficients increases as events per variable (EPV) decreases. Moreover, the correlations among predictors influence relative bias in selected models but not in prespecified models. Indeed, for identical EPV, relative bias approximately doubles between 0895-4356/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
Fig. 1. Relative bias of the estimate.
selected models when covariates are weakly correlated (r Z 0.2) and selected models when covariates are strongly correlated (r Z 0.7). This confirms the impact of omitting nonsignificant confounding factors when estimating the effects of the significant variables. Relative bias in predictors is an important problem for diagnostic and prognostic scores. We agree with Steyerberg et al. that the best statistical and clinical solution to obtain unbiased coefficients is to preselect a set of predictors based on subject knowledge. However, data-driven model selection also will often be necessary when little previous knowledge is available (e.g., discovery of a new pathogen or in genomics). We also agree that the guidelines on the number of EPV appropriate for data-driven model selection should be higher than the guidelines (made for parameter estimation) provided in the main article. Nevertheless, the principle of taking into account data structure to determine the number of EPV remains relevant. Delphine S. Courvoisier* Christophe Combescure Thomas Agoritsas Angele Gayet-Ageron Thomas V. Perneger Faculty of Medicine, University of Geneva, Switzerland
1464
Letters to the Editor / Journal of Clinical Epidemiology 64 (2011) 1463e1469
Division of clinical epidemiology, University Hospitals of Geneva 6, rue Perret-Gentil, 1205 Geneva, Switzerland *Corresponding author. E-mail address:
[email protected] (D.S. Courvoisier)
References [1] Steyerberg EW, Schemper M, Harrell F. Logistic regression modeling and the number of events per variable: selection bias dominates. J Clin Epidemiol 2011;64:1464e5. [in this issue]. [2] Courvoisier DS, Combescure C, Agoritsas T, Gayet-Ageron A, Perneger TV. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol 2011;64:993e1000. [3] Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat Med 2006;25: 4216e26. [4] Heinze G, Ploner M. Fixing the nonconvergence bug in logistic regression with SPLUS and SAS. Comput Methods Programs Biomed 2003; 71:181e7. [5] Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med 2002;21:2409e19. doi: 10.1016/j.jclinepi.2011.06.013
Logistic regression modeling and the number of events per variable: selection bias dominates To the Editor: Courvoisier et al. [1] report an important study on the issue of the number of events per variable (EPV) in logistic regression modeling. The article clearly shows that EPV O 10 is no guarantee for unbiased estimation of regression coefficients and that there may still be quite some optimism in performance as quantified by the area under the receiver operating characteristic curve. A first comment is on the nonconvergence in some of the reported simulations. Conclusions from such simulations are questionable because of informative missingness of results for some simulated samples (in particular for those with more extreme regression effects). Nowadays modifications are available for logistic (and Cox) regression. These guarantee convergence, eliminating the occurrence of monotone likelihood by means of Firth’s modified score equations, originally developed to reduce the general bias of maximum likelihood estimates [2,3]. The required procedures, such as penalized maximum likelihood estimation, penalized likelihood ratio tests, and profile penalized likelihood confidence intervals, have been implemented in standard software (SAS 9.2/proc logistic) or in add-on packages (STATA/firthlogit and R/logistf). The occurrence of nonconvergence of parameter estimates under standard estimation may hence be of little relevance in the discussion of the role of EPV. Second, it is important to stress that the authors only studied prespecified models. The numbers of predictors were 1,
7, and 25. Using one predictor implies a univariate analysis; seven a realistic number for a typical prediction model, and 25 a number considered in initial, exploratory epidemiological analyses. We note that the authors suggest that the set of seven nonnull predictors ‘‘may arise in a later stage of modeling, after covariates that show weak associations with the outcome have been eliminated from consideration’’ [1]. Indeed, model structure is rarely fully prespecified in epidemiological studies. Analysts make all kinds of decisions based on the data under study. The broad distinction in a regression context is between modeling choices that are based only on studying predictors, such as frequency distributions, vs. choices based on examining the predictord outcome relationships [4]. The latter include graphical inspections of relationships of continuous predictors, definition of optimal cut points, univariate prescreening of candidate predictors, and stepwise selection methods. This process may lead to a model that some consider optimal. It is, however, only optimal to the specific data under study. A well-known problem is that the resulting models may not validate well for new subjects [4e7]. A specific issue is that regression coefficients in stepwise selected models are biased [8]. This bias has been termed ‘‘testimation,’’ to indicate the problem of estimation after testing. An intuitive example includes publication bias, where trials with statistically significant results have a higher chance of being published, leading to too extreme effect estimates in summaries of published reports. The problem of biased regression coefficients after stepwise selection is insufficiently recognized, and we therefore illustrate it again below. We use the simulation scheme as described by Courvoisier et al. for a single predictor with odds ratio (OR) 1.5 or 2 per standard deviation, with 2,000 replications for more stability. If the selection of this predictor is prespecified in a logistic regression model, we confirm that some bias occurs for EPV ! 10. For EPV O 5, the bias was however less than 20%. If we select only those predictors with statistically significant effects, the bias in the few selected coefficients is enormous. For EPV Z 10, the bias was 300% (OR Z 1.5) and 140% (OR Z 2), that is, median ORs of 5.1 and 5.2 rather than 1.5 and 2, respectively. This illustrates that selection bias dominates the bias that may occur in a prespecified model (Fig. 1). The implication is that analysts confronted with an unfavorable EPV cannot save themselves by selecting only statistically significant predictors [4,9]. If we start with 25 predictors and seven appear to have strong, statistically significant effects (omitting 18 nonsignificant predictors), the bias in the coefficients of these seven predictors will be far larger than the bias from simply estimating the coefficients in a prespecified 25 predictor model, even with low EPV. Careful preselection of a limited set of candidate predictors based on subject knowledge remains key to good modeling, considering summary predictors to save degrees of freedom [10] and model updating [11].