Journal of Clinical Epidemiology 71 (2016) 76e85
A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results Farideh Bagherzadeh-Khiabania, Azra Ramezankhania, Fereidoun Azizib, Farzad Hadaegha, Ewout W. Steyerbergc, Davood Khalilia,d,* a
Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran b Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran c Department of Public Health, Erasmus MC, Rotterdam, The Netherlands d Department of Biostatistics and Epidemiology, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran Accepted 2 October 2015; Published online 22 October 2015
Abstract Objectives: Identifying an appropriate set of predictors for the outcome of interest is a major challenge in clinical prediction research. The aim of this study was to show the application of some variable selection methods, usually used in data mining, for an epidemiological study. We introduce here a systematic approach. Study Design and Setting: The P-value-based method, usually used in epidemiological studies, and several filter and wrapper methods were implemented to select the predictors of diabetes among 55 variables in 803 prediabetic females, aged 20 years, followed for 10e12 years. To develop a logistic model, variables were selected from a train data set and evaluated on the test data set. The measures of Akaike information criterion (AIC) and area under the curve (AUC) were used as performance criteria. We also implemented a full model with all 55 variables. Results: We found that the worst and the best models were the full model and models based on the wrappers, respectively. Among filter methods, symmetrical uncertainty gave both the best AUC and AIC. Conclusion: Our experiment showed that the variable selection methods used in data mining could improve the performance of clinical prediction models. An R program was developed to make these methods more feasible and visualize the results. Ó 2016 Elsevier Inc. All rights reserved. Keywords: Data mining; Variable selection; Feature selection; Methods; Prediction; Statistical model
1. Introduction In prediction models, the belief that ‘‘the more the variables, the better the performance’’ is no longer acceptable; thus, during the past decades, application of variable (feature) selection techniques has been fast gaining popularity in the field of data mining [1]. In particular, the high-dimensional aspect of clinical data sets has attracted much attention to variable selection [2]; however, the methods developed have not yet been applied broadly in clinical prediction models. Data sets mostly contain a large number of variables with different qualities affecting the success of the prediction model. Conflict of interest: None. Funding: None. * Corresponding author. P.O. Box: 19395-4763, Tehran, Iran. Tel.: þ982122432500; fax: þ982122416264. E-mail address:
[email protected] (D. Khalili). http://dx.doi.org/10.1016/j.jclinepi.2015.10.002 0895-4356/Ó 2016 Elsevier Inc. All rights reserved.
Irrelevant and redundant information not only is of no help to the models but may also adversely affect their performance. Selecting a small number of highly predictive variables generally helps to avoid overfitting and results in more efficient, applicable, and comprehensible model [1e6]. A large number of variable selection methods have been developed; nevertheless, there is no general consensus on the one which performs well under all conditions. Thus, for each given data set, the best method should be chosen specifically [7]. A simple approach for variable selection is using subject-matter knowledge obtained by literature review and consultation with experts; however, literature may not always be available, the expert’s interpretation may cause bias and interesting results concerning new findings may be overlooked [8e10]. Another common approach is the use of P-values in a univariable screening for statistically significant predictors or in a multivariable forward or
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
What is new? Key findings The P-value-based methods, commonly used for variable selection in epidemiological studies, are not always the most optimal ones, and applying modern methods, usually used for feature selection in data mining, could improve the performance of prediction models. What this adds to what was known? The belief that ‘‘more variables result in better performance of prediction models’’ is known to be false. Thus, some limited strategies are used for variable selection in epidemiological studies such as ‘‘literature review and expert opinion’’ approach or P-value-based methods. In this study, we showed that instead of forward or backward methods for variable selection, which are commonly evaluated by a limited number of metrics such as P-value of likelihood ratio, the use of newer search strategies (such as best first search or hill climbing search) in conjunction with more interesting evaluation measures (area under the curve and AIC) may be better choices. What is the implication and what should change now? Providing researchers with more insight into the subject of variable selection and some alternatives to traditional methods are essential in clinical prediction modeling. Preparing some facilities like developing userfriendly statistical programs for variable selection is of critical importance to prognostic and diagnostic research.
backward manner; this approach comes from the old belief that including insignificant variables in the model is not reasonable. On the one hand, important variables may be overlooked when the sample size is small, whereas on the other hand, P-values are biased to low values due to higher risk of type 1 error when conducting multiple comparisons, increasing the probability of random variables being inserted into the model [6,11e14]. This article is intended to summarize variable selection strategies in a systematic way as a tutorial and to briefly clarify the pros and cons of each strategy. As an experiment, we implemented a whole data mining process on a real-world data set from the Tehran Lipid and Glucose Study (TLGS) [15e17]. Our aim is expressly to show how different approaches of variable selection influence the final efficiency of a clinical
77
prediction model and not to clinically interpret the selected variables; to do this, we used a logistic regression as a common regression model in epidemiological studies. 2. Variable selection process in data mining 2.1. General description In general, variable selection methods repeatedly generate a subset of variables and evaluate the generated subset until a stopping criterion is met. Such a process includes three basic steps: (1) subset generation, (2) subset evaluation, and (3) stopping criteria (Fig. 1) [1,3,5,18e20]. The most common and understandable categorization of variable selection methods places them into four broad categories, that is, filter, wrapper, embedded, and the hybrid [1e5,18e24]. Filter methods generate and evaluate the subset of variables without involvement of the model (before modeling starts), which is why they are also called independent methods [3,21]. In wrapper approaches, the subset is generated before modeling starts, but the merit of the subset of variables is measured by the performance of the model. In embedded methods, a subset of variables is proposed and evaluated during the construction of the model (such as decision tree) [2,21,24]. Hybrid methods are a combination of the filter and wrapper approaches [1,18]. From here on, we focus on the filter and wrapper methods (Fig. 2); embedded and hybrid ones are beyond the scope of the present article. Based on correlation between predictors, filters are categorized into univariate and multivariate methods. Univariate methods do not incorporate the correlation between predictors, whereas multivariate methods do [2,22,25]. Wrappers are all multivariate methods (Fig. 2). Based on the product of variable selection, filters are categorized into ranker and subset selector methods. Rankers rank the variables based on their quality (e.g., ranking based on chi-square statistic), whereas subset selectors generate a subset of good quality variables [e.g. correlation-based feature selection (CFS)] [5,20,22,26,27]. Wrappers are all subset selector methods (Fig. 2). Being independent of the model simplifies the filters as compared with the wrappers; it also eliminates the need to repeat variable selection for different models and makes filters applicable to high-dimensional data sets [21,23]. Wrappers usually tend to give better performance compared with filters because variable selection is optimized for the particular model being used. Nevertheless, because the model needs to be repeated for each and every generated subset, wrappers are much more complex [1,2,21,23]. In the following section, we discuss the three steps of variable selection procedure. 2.2. Steps of variable selection in data mining 2.2.1. Subset generation (search strategy) The generation procedure uses a search strategy to generate subsets of variables. The exhaustive search
78
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
Fig. 1. Overview of the complete study procedure.
strategy evaluates all 2N candidate subsets of variables, where N is the number of variables and thus guarantees determination of the optimal subset; however, it is not feasible practically when the number of variables increases, a challenge which hence called for the development of nonexhaustive search strategies, focusing only on a limited number of subsets. These are commonly divided into the three categories of heuristic, complete, and random search strategies [1,5,18,19,27]:
Univariate
2.2.1.1. Heuristic search. Heuristic search methods speed up the process of reaching an appropriate (but not necessarily optimal) subset. Examples include the sequential forward (SF) search, the sequential backward (SB) search, and the greedy hill climbing (HC) search [1,19]. The SF starts with no variables and iteratively adds one variable at a time selected by an evaluation criterion [1,19,22]. The SB starts with the set of all variables and proceeds by discarding one by one the variable yielding the worst estimated quality
Ranker Chi2, IG, OneR, RF a
GR,
SU,
Ranker
Filter
Relief a
Multivariate
Variable Selection Methods
Subset Selector CFS, Consistency a
Wrapper
Multivariate
Subset Selector
SF.AUC, SB.AUC, BF.AUC, HC.AUC, SF.AIC, SB.AIC, BF.AIC, HC.AIC a
Pros: Simple, fast, independent of the prediction model, easy to understand output (a rank of variables) Cons: Overlooks correlation between predictors, overlooks the prediction model, not a fully automatic procedure (needs a method to select the subset of variables) Pros: Independent of the prediction model, easy to understand output (a rank of variables), considers predictor`s correlations Cons: Overlooks the prediction model, not an automatic procedure (needs a method to select the subset of variables), slow Pros: Independent of the prediction model, fully automatic procedure (offers a subset of most important variables), considers predictor`s correlations Cons: Overlooks prediction model, slow Pros: Considers the prediction model, fully automatic procedure (offers a subset of most important variables), considers correlation between predictors Cons: Dependent on the prediction model, slow
Fig. 2. Categorization of data mining variable selection methods used in the present study along with the pros and cons aMethods of our study. Please see the Appendix 1/Appendix A at www.jclinepi.com for abbreviations.
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
according to the evaluation criterion. The HC is a bidirectional procedure, that is, it evaluates all the neighbor subsets resulting from addition or deletion of a single variable to/from the current subset in a stepwise manner and chooses the best one. The search procedures continue until a stopping criterion is met [1,21,28]. 2.2.1.2. Complete search. A complete search is a heuristic one with the capability to change its previous decisions, an ability called backtracking, which results in an optimal subset [1,19]. Best first (BF) search lets the search backtrack, that is, if all possible changes to a given state have been tried without improvement, it backtracks to the last best selected subset and proceeds to an unexplored subset by trying a new variable instead of the currently selected one. The search stops when a maximum number of backtracks is satisfied [21,23]. 2.2.1.3. Random search. This starts with a random subset of variables and generates the next subset to be evaluated by either a heuristic or a random procedure [1,19]. 2.2.2. Subset evaluation Various filter methods are categorized based on the criteria used to evaluate predictors. Data intrinsic measures such as distance measures (Euclidean distance), information measures (entropy), dependency measures (chi-square statistic), and consistency measures (Liu’s measure) are the most popular ones [1,4,8,10,11]; all these criteria are related to the general characteristics of data and can be calculated without model construction. On the contrary, in wrappers, a performance measure of the model such as area under the curve (AUC) or AIC is used to evaluate the subsets. 2.2.3. Stopping criteria The variable selection process must have a criterion to stop the search for new subsets. Some of the commonly used criteria are as follows: (1) search is complete, (2) a preset threshold is reached, where a threshold can be a minimum number of variables or a maximum number of iterations, (3) the next iteration fails to produce a better subset, and (4) an acceptable subset is selected according to the evaluation criterion (i.e., the evaluation criterion is good enough according to the domain expert) [1,18,21].
3. Materials and methods
79
and blood sampling at baseline, a process which was repeated every 3 years; in the second phase, 3,500 new subjects were added to the study and followed for the two next phases [17]. Prediabetes was defined based on fasting plasma glucose (FPG) 100 and FPG 126 mg/dL or 2h postchallenge plasma glucose (2h PG) 140 and 2h PG 200 mg/dL. Study subjects taking antidiabetic drugs at baseline were excluded from the study. Of 803 subjects with complete follow-up, 293 cases of type 2 diabetes were identified by the end of the follow-up in 2009e2012 (phase 4). Type 2 diabetics included subjects with FPG 126 or 2h PG 200 or taking antidiabetic medication in any followup period. Details of the study have been described elsewhere [16,17]. We considered female subjects only to increase internal validity of our analysis. On the other hand, in our data set, more variables were available for females, compared with males (such as history of pregnancy or menstruation) which cause the variable selection to be more important for females in our study. Variables of study included demographic, biochemical and anthropometric measures, medical history, clinical examination, smoking status, physical activity, and history of treatment and drug use. The list of exposure variables is presented in Supplementary Table 1/Appendix B at www.jclinepi.com; the incidence of type 2 diabetes has been considered the outcome [29]. The flowchart of study subjects is presented in Supplementary Fig. 1/Appendix B at www.jclinepi.com. Our main focus was on the variable selection phase.
3.2. Preprocessing data A multivariate iterative method was implemented to impute missing values. This method performs very well under conditions like high dimensions, complex interactions, and nonlinear data structures and can handle mixed-type correlated data [30,31]. The outcome variable was not included for imputation to avoid biased results. For identifying outliers, a multivariate procedure was applied due to its applicability in high-dimensional large data sets [32]. To achieve unbiased results, we set aside 20% of the original data for the purpose of ultimate report (testing data set) and applied all the variable selection methods on the remaining 80% data (training data set). Stratified sampling was implemented to preserve the actual ratio of the outcome in the train and test data set. An overview of the complete study procedure is presented in Fig. 1.
3.1. Study population and measurements We selected 803 prediabetic women, aged 20 years from the TLGS cohort to predict diabetes outcome. Details of the TLGS design have been published before [15e17]; briefly, it is a population-based study performed on a representative sample of residents of Tehran (n 5 15,005, aged 3 years), who entered the first phase of the study in 1999e2001. Data on subjects were collected through interview, physical examination,
3.3. Statistical analysis To choose appropriate predictors of the outcome from 55 variables, we used some common filter and wrapper approaches along with P-value-based methods and found the most suitable approach for our particular data set. We also built a model by using all the variables (full model) [28].
80
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
3.3.1. Filter methods Filter methods, depicted in Fig. 2, are discussed below: 3.3.1.1. Chi-square statistic. Chi-square statistic can be used to compute the relation of each individual variable with the outcome [4,18,24,33,34]. 3.3.1.2. Information measures. Information measures are based on a concept called entropy which is a measure of uncertainty or unpredictability of the variable [23]. Three popular measures of information are as follows: (1) information gain (IG) is the difference in the uncertainty of a variable before and after observing another variable; and the bigger this difference, the stronger the relationship. The problem with IG is that it is biased toward variables with more categories (such as identification number) although they may be less informative [21,23]. (2) Gain ratio (GR) is introduced to correct the IG bias. The limitation with GR is that it is nonsymmetrical. (3) The symmetrical uncertainty (SU) criterion compensates for both the bias of IG and the nonsymmetrical problem of GR [18,21,24,34]. 3.3.1.3. OneR. The OneR, short for ‘‘one rule,’’ creates a one-level decision tree for each individual variable in the data. Ranking of the variables is based on the fact that variables which result in more accurate trees are considered to be more significant [18,34]. 3.3.1.4. Random forest. Random forest (RF), a pile of trees, is a concept in decision tree analysis which can be used to evaluate variables. It is based on the simple idea that if a variable is not important for prediction of a particular outcome, relocating its values randomly among the instances will not change the performance of the prediction model [22,35,36]. 3.3.1.5. ReliefF. ReliefF is a ranker approach which considers correlations between predictors because it is based on the nearest-neighbor procedure. The worth of each variable is estimated by considering how well its values distinguish between neighbor subjects [19,21,22,25,34,36]. 3.3.1.6. Correlation-based feature selection. The idea behind CFS is that a good variable subset is one that contains variables uncorrelated with each other while being highly correlated with the outcome [21,23,24,33]. 3.3.1.7. Consistency. Consistency measures are based on the idea that a data set containing only the selected variables must be consistent, that is, two subjects with the same predictors must belong to the same outcome [3]. Liu’s measure of consistency has been used in the present study. Because ranker approaches only hand in a rank of all variables, we need to select a subset of the most important variables to include into the final model. To do so, we used an approach which is referred to as ‘‘Best Rank (BR) search’’ in the present study. We built consecutive models using the first variable with the highest rank, followed by the first and
the second variables with the highest ranks and the rest; the model with the best AIC was picked up. In filter subset selectors, a BF approach was implemented to generate subsets. For all the methods which could not handle numerical data (Chisquare statistic, information measures, OneR, CFS, and consistency), a quintile discretization was implemented. 3.3.2. Wrapper methods In the wrappers, the model was set to logistic regression. For the search strategies, we implemented a complete search (a BF search), two heuristic searches (an SF and an SB search), and a random search (a random HC search). For evaluating subsets, AUC and AIC of the logistic model were calculated. Because AUC reported on the same data used in constructing the model results in an overfitted model, we used a 10-fold cross-validation [4,37]. Complexity term in AIC penalizes complicated models. As a result, AIC, which has certain cross-validation properties, is considered a good measure of prediction performance [38]. The combination of the four search strategies mentioned and the two performance criteria resulted in eight wrapper procedures (SF.AUC, SB.AUC, BF.AUC, HC.AUC, SF.AIC, SB.AIC, BF.AIC, and HC.AIC). All data mining variable selection methods used in this article are summarized in Fig. 2 along with their pros and cons. 3.3.3. Finalizing variable selection We decided on AIC as our final evaluation measure. The best model is the one having the lowest AIC. Commonly, the difference of the models’ AIC is an indication of the relative importance of each model. At this point, we are confronted by multiple models/hypotheses instead of a null hypothesis. It is clear that there is more support for some models but less for others. To assess this support, we subtracted the least AIC from the AIC of each model. These differences (Di ) start from 0 to positive numbers where 0 corresponds to the best model. The rule to assess the relative goodness of models has been proposed as: models with Di 2 have substantial support, those having 4Di 7 have considerably less support, and those with Di O10 have essentially no support [39e41]. We considered models with Di 2 equally parsimonious and appropriate. All the above variable selection methods and their evaluation were carried out on training data set. We finally reported the AUCs of the appropriate models on the test data set. 4. Results The whole data preprocessing left us with a data set of 734 records and 56 variables, of which 35 were categorical and the other 21 were numerical (Supplementary Table 1/ Appendix B at www.jclinepi.com). By the end of the follow-up, 257 cases developed diabetes. As shown in Supplementary Table 1/Appendix B at www.jclinepi.com of 55 predictors, 29 variables had a P-value !0.05 for the difference between diabetic and nondiabetic subjects in a univariable analysis.
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
81
First, we constructed a logistic regression including all the variables (full model) and then performed 19 methods of variable selection (nine filter, eight wrapper, and two P-value based); each method resulted in an exclusive subset of variables. We then constructed different logistic regressions by entering 19 distinct subsets of variables (reduced models). The performance measures of each model (AIC, AUC, and Di ) and the number of their predictors are reported in Table 1. The complete list of the variables selected by each method is presented in Fig. 3. The following results have been extracted from Table 1: Among all methods, wrappers gave better performance (whether the performance measure was AUC or AIC); the four AIC-based wrappers resulted in the best AICs, and the four AUC-based wrappers resulted in the best AUCs. Among wrappers, HC.AIC resulted in the best AIC and both SF.AUC and SB.AUC gave the best AUC. Among filters, SU resulted in the best AIC and AUC. Considering AIC, the best model (HC.AIC) and the worst model (full model) had AICs of 630 and 683,
Table 1. Performance of models together with the number of variables each method selected
Variable selection method Full model Filter Chi2 IG GR SU OneR RF Relief CFS Consistency Wrapper SF.AIC SB.AIC BF.AIC HC.AIC SF.AUC SB.AUC BF.AUC HC.AUC P-value based SF.P-value SB.P-value
AUC (%95 CI)
AIC
Di
Number of selected variables
0.73 (0.68, 0.78)
683 53
55
0.77 0.77 0.78 0.78 0.75 0.78 0.74 0.77 0.76
(0.74, (0.74, (0.74, (0.69, (0.74, (0.70, (0.73, (0.72, (0.75,
646 646 647 643 671 643 662 652 662
16 16 17 13 41 13 32 22 32
13 13 13 12 24 8 36 5 27
0.79 0.77 0.79 0.78 0.79 0.79 0.79 0.79
(0.75, 0.82) (0.74, 0.81) (0.76, 0.83) (0.74, 0.83) (0.76, 0.83) (0.76, 0.82) (0.74, 0.82) (0.74,0.82)
631 1 632 2 631 1 630 0 646 16 650 20 646 16 639 9
11 19 11 16 15 25 15 16
0.81) 0.81) 0.81) 0.80) 0.82) 0.79) 0.82) 0.80) 0.82)
0.78 (0.74, 0.81) 636 0.78 (0.74, 0.81) 6
6 6
6 6
The performance of best and worst models according to AUC and AIC are shown in bold. Abbreviations: CI, confidence interval; please see the Appendix 1/ Appendix A at www.jclinepi.com for other abbreviations. Di represents s in AIC of models with the minimum available AIC among all models: Di 2, substantial support; 4 Di 7, considerably less support; Di O10, essentially no support and can be excluded from further consideration.
Fig. 3. Complete list of the variables selected by each method. Please see the Appendix 1/Appendix A at www.jclinepi.com for abbreviations.
respectively. Their corresponding AUCs in the test data set were 0.78 and 0.73, respectively. Models with Di 2 were HC.AIC, SF.AIC, BF.AIC, and SB.AIC and their AUCs in the test data set were 0.80, 0.82, 0.82, and 0.79, respectively.
82
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
5. Discussion
Fig. 4. The number of times each variable has been selected as a predictor by the methods. Please see the Appendix 1/Appendix A at www. jclinepi.com for abbreviations.
Fig. 4 shows that how many times each variable has been selected as a predictor by the methods. Of 29 significant variables in univariable analysis, 10 remained in the best model (HC.AIC) and 6 of the insignificant variables were also entered into this model.
In this study, we estimated the performance of a variety of filter and wrapper variable selection methods along with P-value-based methods in the context of a logistic regression model. A model containing all the variables (full model) was also built to show the importance of variable selection. In epidemiology, variable selection before modeling starts is of critical importance. The choice of variables by experts and literatures available may introduce bias and may result in no new knowledge to add to existing documents [8e10]. According to a review by Walter and Tiemeier [8], the second common approach in epidemiology is the stepwise selection procedure by statistical testing which is a subject of controversy [6]; according to their report, the reason these methods have not yet been replaced by newer methods is that the new methods cannot be commonly applied in user-friendly analysis software or the possibility of the methods not being accepted by other researchers; we do agree with this view. Although some of the modern methods have been welcomed warmly, they have rarely been used. Ladha and Deepa [24] and Saeys et al. [2] discussed the pros and cons of variable selection methods in data mining, some of which are discussed here and illustrated in Fig. 2. Filter approaches (model independent methods) are preferred by some experts who are not willing to inconvenience themselves by rerunning variable selection when another kind of model is replaced (e.g., replacing logistic regression with Cox) [7,23,42]. Univariate filter methods, that are all rankers, do not involve a lot of calculations and hence are simple and fast; besides, they deliver an easy-to-understand output (i.e., a weighting of variables according to their importance); thus, they are preferred by some domain experts. Such methods are not fully automatic, that is, after the analysis, the expert has to choose the number of good quality variables by a method or by eyeballing; nevertheless, the estimated weights give medical experts an idea about the importance of the variables and the ability to compare their result with literature and knowledge currently available. Some experts may opt for multivariate filter approaches which consider correlation between predictors at the cost of more calculations [2,22,24]. On the contrary, as Cehovin [22] states, wrappers result in better performance compared with filters at the cost of many more calculations, a view in agreement with our results. To be more specific, the four AIC-based wrappers had the least AICs and the four AUC-based wrappers had the largest AUCs compared with other methods. We decided to consider AIC as the final performance measure for constructing a prediction model. We agree that AUC has been the most common measure of discrimination for prediction models with binary outcome; however, after inclusion of substantially important variables in a model, that
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
corresponds to achieving an acceptable AUC level, AUC is no longer sensitive to the addition of new important variables; in other words, it changes very little even after inclusion of a variable with a good relative risk. Thus, researchers believe that AUC is no longer a proper measure for model selection and using AUC as an only measure of evaluation is not recommended and it may cause valuable predictors to be overlooked [43e47]. We took advantage of AUC for its popularity. On the contrary, AIC is considered a proper measure for comparing models. We used the simple rule of thumb to assess the relative goodness of models in the set which have been proposed by Burnham [39e41]. Incidentally, if all sets of candidate variables are poor, the best model would also be poor. Hence, it does not suffice to report only AIC, which is why we have also reported AUC. Our results confirm the Harrell view that statistical significance is not usually a good criterion to decide which variables should be included or excluded from the model [13]. It is noteworthy that, even considering adjustment for multiple comparison, may not eliminate the problem of the P-value method because even after adjustment, the nonsignificant variables which may help the process of prediction will still not appear in the model (P-value takes larger values after adjustment for multiple comparison); although this method is not recommended as an absolute method of variable selection, it may still be better than some other methods. In our study, the P-value-based models had a Di of 6 and a corresponding AUC of 0.78 which seems acceptable. In the final step of our analysis, we found four sets of variables as appropriate sets of predictors with substantial support for our prediction model and hence faced making the most appropriate choice. Referring to the first simple approach of using ‘‘subject-matter knowledge,’’ experts can choose the set with variables which are less in number, simpler, and their measurements are inexpensive and feasible. For instance, based on our results, all final appropriate models included ‘‘waist circumference’’ or ‘‘waistto-hip ratio’’ as a predictor except the SB.AIC model which included body mass index (BMI) instead. As we know measurement of BMI is more practical than waist. Besides, even a model with a slightly worse performance, compared with the selected models, may be preferred by the expert according to the ‘‘subject-matter knowledge.’’ For instance, if we consider models with less support (i.e., models with 4Di 7) in our study, the P-valuebased models, which in this particular study had the least number of variables (Table 1 and Fig. 3), may be preferable. The expert may be concerned about biologically plausibility of the selected variables; of note, prediction is different from causation. Avariable may be a good predictor because of its correlation with some important confounders. A good association without any causation may help us to predict an outcome. Thus, in a prediction model, we do not intend to interpret a predictor as
83
a risk or protective factor. This is a common belief that epidemiology could help to control diseases even before the etiology is known to science! Based on the above, we can make several recommendations. First, researchers need to know the pros and cons of the variable selection method they apply to their data set. We suggest that practitioners simply opt for variable selection method(s) from the wide range available, depending on their priorities (simplicity, comprehensibility, and predictive ability of the methods). Because there is no general rule, one could implement various optional methods and accept the most appropriate set(s) of variables according to the performance measures. Among accepted sets of variables, the expert in the field can then choose one set based on the simplicity and feasibility of the variables in the set. Of note, unfamiliar predictors may be selected by methods which introduce a new hypothesis to be explored and discussed in other studies. As the last step, the finally achieved model should be checked for internal and, if needed, for external validity [48]. The codes of this work can be applied to all data sets. Readers can find the R code of ‘‘VariableSelection’’ in the Appendix 2/Appendix A at www.jclinepi.com. Our study has several limitations. First, we do not intend to be comprehensive. More novel methods which involve much more calculations such as nonnegative garrotte (Brieman 1995) and the lasso (Tibshirani 1996) [49] and also embedded and hybrid methods, which are more complex, have not been covered in this work. Second, although categorizing variables would reduce the explained variance, it is unavoidable in case of variable selection methods which cannot handle continuous variables and may therefore result in improper performance. However, in spite of such discretization in the variable selection phase, the original selected variables (with no transformation) were inserted while constructing the final models. Third, considering the interaction of variables in variable selection is not a simple matter because the interaction may be 2-, 3- . or n-sided. If a researcher is concerned about an interaction, s/he could add the related interaction as a variable (x1 x2) to the analysis or could split the data based on the effect modifier variable. We cannot guarantee whether the best strategy demonstrated by our study is generalizable to other data sets. Each prognostic or diagnostic study should pick up its own predictor variables by using method similar to ours. It must be kept in mind that performing simulation for choosing the best method is quite problematic and awkward because of the many situations including different numbers of candidate variables, each with different distribution and different effects on the outcome. Our experiment showed that the variable selection methods used in data mining could improve the performance of clinical prediction models. We tried to simplify the variable selection methods of filter and wrapper as a tutorial and prepared an R program to make these methods more practical and visualize the results.
84
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85
Acknowledgments The authors appreciate the contribution by investigators, staff, and participants of the TLGS Cohort Study for preparing this population-based cohort data. The authors also wish to acknowledge Ms Niloofar Shiva for critical editing of English grammar and syntax of the manuscript.
Supplementary data Supplementary data related to this article can be found at http://dx.doi.org/10.1016/j.jclinepi.2015.10.002. References [1] Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 2005;17(4):491e502. [2] Saeys Y, Inza I, Larra~naga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007;23:2507e17. [3] Arauzo-Azofra A, Benitez JM, Castro JL. Consistency measures for feature selection. J Intell Inf Syst 2008;30:273e92. [4] Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003;3:1157e82. [5] Liu H, Motoda H. Feature selection for knowledge discovery and data mining. New York: Springer ScienceþBusiness Media; 1998. [6] Steyerberg EW, Eijkemans MJ, Habbema JDF. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol 1999;52:935e42. [7] Wang G, Song Q, Sun H, Zhang X, Xu B, Zhou Y. A feature subset selection algorithm automatic recommendation method. J Artif Intell Res 2013;47:1e34. [8] Walter S, Tiemeier H. Variable selection: current practice in epidemiological studies. Eur J Epidemiol 2009;24:733e6. [9] Steyerberg EW. Clinical prediction models. New York: Springer ScienceþBusiness Media; 2009. [10] Greenland S. Invited commentary: variable selection versus shrinkage in the control of multiple confounders. Am J Epidemiol 2008;167:523e9. [11] Flom, PL and Cassell DL. Stopping stepwise: why stepwise and similar selection methods are bad, and what you should use. In NorthEast SAS Users Group Inc 20th Annual Conference: 11-14th November 2007; Baltimore, Maryland. 2007. [12] Hammami D, Lee TS, Ouarda TB, Lee J. Predictor selection for downscaling GCM data with LASSO. J Geophys Res Atmos 2012; 117. [13] Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer ScienceþBusiness Media; 2001. [14] Austin PC, Tu JV. Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. J Clin Epidemiol 2004;57:1138e46. [15] Azizi F, Madjid M, Rahmani M, Emami H, Mirmiran P, Hadjipour R. Tehran Lipid and Glucose Study (TLGS): rationale and design. Iranian J Endocrinol Metab 2000;2(2):77e86. [16] Azizi F, Rahmani M, Emami H, Mirmiran P, Hajipour R, Madjid M, et al. Cardiovascular risk factors in an Iranian urban population: Tehran lipid and glucose study (phase 1). Soz Pr€aventivmed 2002;47(6):408e26. [17] Azizi F, Ghanbarian A, Momenan AA, Hadaegh F, Mirmiran P, Hedayati M, et al. Prevention of non-communicable disease in a population in nutrition transition: Tehran Lipid and Glucose Study phase II. Trials 2009;10:5. [18] Novakovic J, STRBAC P, Bulatovic D. Toward optimal feature selection using ranking methods and classification algorithms. Yugosl J Oper Res 2011;21. ISSN: 0354e0243, EISSN: 2334-6043.
[19] Dash M, Liu H. Feature selection for classification. Intell Data Anal 1997;1(3):131e56. [20] Liu H, Motoda H. Setiono R, Zhao Z. editors. Feature selection: an ever evolving frontier in data mining. JMLR: workshop and conference proceedings 10; 2010: the fourth workshop on feature selection in data mining. [21] Hall MA. PhD Thesis, Correlation-based feature selection for machine learning, in Department of Computer Science, The University of Waikato. 1999. [22] Cehovin L, Bosnic Z. Empirical evaluation of feature selection methods in classification. Intell Data Anal 2010;14(3):265e81. [23] Hall MA. Feature selection for discrete and numeric class machine learning. Hamilton, New Zealand: Department of Computer Science, University of Waikato; 1999. [24] Ladha L, Deepa T. Feature selection methods and algorithms. Int J Computer Sci Eng 2011;3(5):1787e97. [25] Megchelenbrink W, Marchiori E, Lucas P. Relief-based feature selection in bioinformatics: detecting functional specificity residues from multiple sequence alignments. Nijmegen: Department of Information Science, Radboud University; 2010:Master thesis. [26] Novakovic, J. The impact of feature selection on the accuracy of Na€ıve Bayes Classifier. In 18th Telecommunications forum TELFOR. 2010. [27] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 2004;5:1205e24. [28] Romanski P, Kotthoff LF. Selector: selecting attributes. R package Version 0.19, 2014. [29] Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care 1997;20(7):1183e97. [30] Stekhoven DJ, B€uhlmann P. MissForestdnon-parametric missing value imputation for mixed-type data. Bioinformatics 2012;28: 112e8. [31] Stekhoven DJ. MissForest: nonparametric missing value imputation using random forest R package version 1.3. 2013. [32] Filzmoser, P and Gschwandtner M. Package mvoutlier: multivariate outlier detection based on robust methods. R package version 2.0.6. 2015. [33] Liu H, Li J, Wong L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform 2002;13:51e60. [34] Jensen R, Shen Q. Feature selection for aiding glass forensic evidence analysis. Intell Data Anal 2009;13(5):703e23. [35] Livingston, F. Implementation of Breiman’s random forest machine learning algorithm. ECE591Q Machine Learning Journal Paper, 2005. [36] Strobl C, Hothorn T, Zeileis A. Party on!. The R Journal 2009;1(2): 14e7. [37] Sewell M. Feature selection 2007. Available at http://machinelearningmartinsewellcom/feature-selection. Accessed November 8, 2015. [38] Stone M. An asymptotic equivalence of choice of model by crossvalidation and Akaike’s criterion. J R Stat Soc Series B Stat Methodol 1977;39(1):44e7. [39] Burnham KP, Anderson DR. Information theory and log-likelihood models: a basis for model selection and inference. In: Model Selection and Inference. Springer, New York; 1998:32e74. [40] Burnham KP, Anderson DR. Model selection and multimodel inference: a practical information-theoretic approach. New York: Springer-Verlag; 2002. [41] Burnham KP, Anderson DR. Multimodel inference understanding AIC and BIC in model selection. Socio Meth Res 2004;33(2): 261e304. [42] Senliol B, et al. Fast Correlation Based Filter (FCBF) with a different search strategy. In: International Symposium on Computer and Information Sciences (ISCIS 2008). Istanbul, Turkey: Istanbul Technical University, Suleyman Demirel Cultural Center; 2008.
F. Bagherzadeh-Khiabani et al. / Journal of Clinical Epidemiology 71 (2016) 76e85 [43] Spitz MR, Amos C, D’Amelio A Jr, Dong Q, Etzel C. Re: discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. J Natl Cancer Inst 2009;101(24):1731. [44] Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007;115:928e35. [45] Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 2004;159(9): 882e90.
85
[46] Pencina MJ, D’Agostino RB, Massaro JM. Understanding increments in model performance metrics. Lifetime Data Anal 2013;19:202e18. [47] Biswas S, Arun B, Parmigiani G. Reclassification of predictions for uncovering subgroup specific improvement. Stat Med 2014;33:1914e27. [48] Steyerberg EW, Harrell F Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol 2015; http://dx.doi.org/10.1016/j.jclinepi.2015.04.005. [Epub ahead of print]. [49] George EI. The variable selection problem. J Am Stat Assoc 2000;95: 1304e8.