Letter to the Editors
www.molinf.com
DOI: 10.1002/minf.201400030
External Evaluation of QSAR Models, in Addition to CrossValidation: Verification of Predictive Capability on Totally New Chemicals Paola Gramatica*[a]
Dear Editors, an interesting paper of Gtlein et al., recently published in your journal,[1] has reopened the debate on the crucial topic of QSAR model validation, which, over the past decade, has been the subject of wide discussions in scientific and regulatory communities. Many notable scientific papers have been published (I cite here only a few of the most pertinent[2–17]) with different underlying ideas on the “best” way to validate QSAR models using various methodological approaches: a) only by cross-validation (CV),[1,6–9] simple or double CV, b) by an additional external validation,[2–5, 10–17] (better if verified, in my opinion, by different statistical parameters),[15–18] after the necessary preliminary internal validation by CV. The common final aim is to propose good QSAR models that are not only statistically robust, but also with a verified high predictive capability. The discrepancy in these two approaches lies in this point: how to verify the predictive performance of a QSAR model when applied to completely new chemicals. In the Introduction to their paper[1] Gtlein et al. wrote: “Many (Q)SAR researchers consider validation with a single external test set as the “gold standard” to assess model performance and they question the reliability of cross-validation procedures”. In my opinion, this point is not commented on clearly, at least in reference to my cited work,[10] so I wish to clarify my validation approach in order to highlight and resolve some misunderstandings. First of all, I am sure that all good QSAR modellers cannot disagree that CV (not simply by LOO, but also by LMO and/or bootstrap) is a necessary preliminary step in any QSAR validation, and it is unquestionably the best way to validate each model for its statistical performance in terms of the robustness and predictivity of partial sub-models on chemicals that have been iteratively put aside (hold-out) in the test sets. According to some authors,[2–14] including me,[10] this should be defined as the internal validation, because at the end of the complete modelling process the molecular structure of all the chemicals has been seen within the validation procedure, and their structural information has contributed to the molecular descriptor selection, at least in one run of CV when they were iteratively put in the training sub-set. Therefore, they are not really external (completely new) to the final model.
Indeed, internal validation parameters for proposed QSAR models must always be reported in publications to guarantee model robustness. Moreover, in QSAR modelling, it is important to distinguish an approach proposing predicted data from a specific single model (easily reproduced by any user) from an approach that produces predicted data obtained by averaging the results from multiple models, and therefore by a more complex algorithm. In my research I always apply the first approach, while the work discussed by Gtlein et al. in their paper uses the second one. The reason to prefer a single model, which is a unique specific regression equation based on a few selected descriptors with their relative coefficients, is mainly related to the preference that the “unambiguous algorithm”, (requested by the second Principles of the famous “OECD Principles for validation of QSAR models and applicability in regulation”[19] ) would be the simplest and most easily reproducible, and therefore easily applicable by a wide number of users, including regulators in the new European legislation on chemicals REACH. According to Principle 4, discussed in depth in my previous paper[10] and in the Guidance Documents of the OECD Principles,[20] the model must be verified for its goodness of fit (by R2), robustness (by internal Cross-Validation: Q2LOO and Q2LMO) and external predictivity (on external set compounds, which did not take part in the model development). Also in the Guidance document there is a clear distinction between internal and external validation in this sense. Only models with good internal validation parameters that guarantee their robustness should be chosen from among all the single models obtained by using the Genetic Algorithm (GA) as method for descriptor selection in Ordinary Least Square (OLS) regression (my QSAR approach, as implemented in my in-house software QSARINS).[18] However, my personal experience (and not only mine)[5,10] is that some QSAR models show good performance when verified [a] by P. Gramatica QSAR Research Unit in Environmental Chemistry and Ecotoxicology, Department of Theoretical and Applied Sciences, University of Insubria Via Dunant 3, 21100, Varese, Italy *e-mail:
[email protected]
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 311 – 314
311
Letter to the Editors
www.molinf.com
by CV, but are unable to predict well really unseen new chemicals (see as examples the models 4–6 in Table 1, that are over-optimistically verified as predictive according to CV, while, by various statistical parameters used for external validation,[15,16] they demonstrate their inability to predict new independent chemicals, when applied to two different prediction sets). There is evidence that CV is necessary, but this is not sufficient to guarantee predictivity for new external chemicals.[5,10, 20] Therefore, it is not a “psychological argument” (as stated by Gtlein et al.), but probably the different “philosophical approach” of a QSAR modeller who wishes to propose only cross-validated single models that are additionally verified for their possible predictivity on truly never-seen chemicals, to guarantee a larger generalizability. Certainly we will never wish to propose models that could be present in the GA population of CV-validated OLS models, as for instance some models in Tab. 1 of the Principles of QSAR model validation paper[10] and the externally unpredictive models no. 4–6 in Table 1. In my cited paper, “Principles of QSAR models validation: internal and external”,[10] I clarified the different aspects of what are, in my opinion and in the OECD Guidance Document for QSAR model validation,[20] internal and external validations, but additional clarification is needed and is provided here. The question that requires an answer, which is obtainable only by an additional external validation of one specific QSAR model, is: “Is the developed model, whose robustness has been validated by CV, able to also predict completely new chemicals?” These external chemicals must never be included in training sets during the complete process of the model development, not even in one single iteration of the k-fold CV procedure; therefore, their structural information must NEVER be taken into account. Psychologically, the best external set of new chemicals for this evaluation (the so called “blind” set) should be one that might become , available to a QSAR modeller after his model development; this set could also be called a “temporal” set. However, it is very rare to have a blind data set, due to limited data availability and time reasons (we should always have new experimental data for QSAR model evaluation and wait for a temporal set). Therefore, if the QSAR modeller wishes to verify a real model predictivity, before proposing his “best” single model, his only option is to exploit the actual data availability, sacrificing part of these
data in a preliminary step before the model development, putting aside these “supposed unknown” compounds for use later in the following evaluation of the models, models that are developed only on the remaining training set used in the learning process. In terms of validation procedure there is no difference between an external dataset as temporally delayed and the splitting of an available data set, obtaining in this way an external set. The chemicals put aside in this preliminary splitting step constitute the set that, preferably, should be called external prediction set(s),[10, 15–18] or the external evaluation set,[14] to be clearly distinguished from the iterative test sets of CV. Therefore, it should be clear that, in this approach, these two validations of single models have completely different aims, and cannot be used as parallel or alternative processes but only as sequential ones. The aim of CV is for a preliminary model validation of each single model in the GA population, and to help in the selection of the most robust and internally predictive models; instead the use of external prediction sets has the subsequent goal of evaluating each single model (these models being based only on the structural information found in the training set compounds and having passed previous cross-validation) with regard to its predictivity on actual “unseen” compounds whose chemical structures, as already pointed out, have NEVER influenced the descriptor selection. The single model in the GA population of CV-validated robust models which, simulating a real application of the model, shows also prediction performances (measured by Q2ext or CCC)[15,16] , similar to the internal ones (measured by Q2LOO and Q2LMO), also on the prediction set compounds, is preferred as a verified externally predictive model and is our proposal (for instance models no.1–3 in Table 1). To avoid misunderstandings on this point it is probably useful, and better, to define this additional check on really external compounds as “external evaluation” or “external verification“ of a specific QSAR model, before its proposal. In my recent works (see Gramatica,[17] as an example), after having, at the end, checked that the molecular descriptors, selected in a robust specific model taking information only from the structures of two training chemicals, are successfully able to also predict completely new chemicals (prediction sets), the same descriptors are used to redevelop a full model on the complete data set to exploit all
Table 1. Comparison of internal and external validation parameters for some algae toxicity models. Splitting by structure (Kohonen maps) 1 2 3 4 5 6
2
2 LOO
Variables
R
T(N..S) AEigZ Seigv* AEigm F08[O-O] Seigv nDB X2sol JGI4 Xindex F07[C Cl] F08[O O] nDB Xt F08[N O] Xt nCONN nCXr
0.83 0.84 0.80 0.84 0.84 0.83
2 LMO
2 ext-Fn
Splitting by ordered response
Q
Q
Q
CCCext
R2
Q2LOO
Q2LMO
Q2ext-Fn
CCCext
0.76 0.80 0.70 0.76 0.77 0.79
0.72 0.76 0.66 0.74 0.75 0.77
0.72–0.84 0.72–0.84 0.80–0.88 ( 0.02)–0.40 ( 0.13)–0.34 ( 0.43)–0.16
0.87 0.84 0.87 0.62 0.60 0.58
0.85 0.87 0.83 0.83 0.80 0.84
0.79 0.83 0.76 0.75 0.72 0.81
0.77 0.81 0.74 0.73 0.71 0.78
0.73–0.79 0.69–0.76 0.70–0.77 0.10–0.32 0.02–0.25 ( 0.36)–( 0.04)
0.86 0.82 0.86 0.62 0.69 0.56
*Model published by Gramatica et al.[10] For CCC (Concordance Correlation Coefficient) see the literature.[15, 16]
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 311 – 314
312
Letter to the Editors
www.molinf.com
Figure 1.
the available information. This full model (which is based on the descriptors selected from the trainings, but has different coefficients) is normally our final QSAR model, which we propose.[17] This procedure is schematized in Figure 1 (modified from Figure 1 in Gramatica et al.[17]). I hope that it is now clearer that in my approach “external validation” is not versus CV nor is it normally used as a substitute for CV. It is an additional evaluation simulating a temporal “supposed unknown” set, while the comparative exercise of Gtlein et al.[1] has a different purpose from that of the usual application of external validation of QSAR models (particularly single models). Gtlein et al. remember that Hawkins, in his papers,[7,8] recommends a “true CV” without additional external validation to avoid the loss of information. The “true CV” should be done selecting modeling features within each fold of the CV procedure; however it is important to note that in this way several different parallel sub-models, which are also based on different descriptors, are obtained, one in each fold. The predictions of these parallel sub-models, Hawkins-type, could be used, for instance, in a consensus approach.
In my opinion, the interesting results of the Gtlein et al. paper demonstrate that double CV could give less variable and better results (the lowest average prediction error for the available and modeled data) compared to external validation verified on a single set, but only if these two kinds of validation are used as alternative validation methods. However, the better performances in prediction of the double cross-validated models are, in my opinion, logically expected, simply because all the structural information of the complete data set has been viewed at the end in the overall final algorithm of this ensemble of sub-models. In fact, some chemicals considered as new or “unseen” are new only in each iterated fold when they are hold-out in the test sets, but not for the complete algorithm. Some sentences on the web: http://stats.stackexchange.com/Cross Validated, which is a question and answer site for statisticians, data analysts, data miners and data visualization experts, confirm this point: “cross-validation is used to optimize a model, but no “real” validation is done.” Cross-validation will lead to lower estimated errors, which makes sense because you are constantly reusing the data,…. Moreover, I wish to clarify the point that seems not to be shared by Gtlein et al. “the models that are built and validated on the folds are different from the finally reported model.[3], [1] exemplifying with a MLR model developed by GA-OLS, as in our QSARINS software.[18] I recall here that in every run of the CV k-folds, the chemicals that are put in the test are predicted by the respective sub-model, which is developed on a subset of chemicals, different in each iteration. Even if the selected molecular descriptors are the same in each sub-model in the iterative runs and in the finally reported model, the coefficients of the descriptors in the model equation are different, because in each iteration the training set is different. It is also important to highlight that the final model obtained from the training, exactly with its coefficients, must be applied to the external prediction set, while it is not correct to rebuild the model on these compounds, otherwise a new model that simply fits the external data and not predicts them is obtained. An additional crucial aspect that must be taken into consideration is that the composition of the prediction set(s) (as is also true for the training set) has a marked influence on the results. It is not reasonable to verify model predictivity on too few chemicals, as is sometimes done, because the results could be good just by chance. Certainly small input data sets are the most problematic to perform reliable external evaluation, and in these cases CV is the only validation that makes sense. Also the nature of the chemicals in the prediction set is important: they should have a feature range and distribution similar to the training data, because any QSAR model can be reliably applied (and therefore must be also verified) only on chemicals belonging to the same Applicability Domain (AD) of the training set.[5,10, 17, 20] If the external chemicals are out of the model AD, the verification of predictive performance would be performed in an extrapolation zone of less reliable predic-
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 311 – 314
313
Letter to the Editors
www.molinf.com
tions. For this reason, it is important to apply some rules for splitting the original data into training set (for the learning step and subsequent random splits in training sub-sets and tests for CV) and prediction set (for external evaluation after model development and CV).[17,20] It is also useful to apply different kinds of splitting methods, as implemented in QSARINS.[18] To avoid the limitation of using only a single external set, we, in our recent papers,[17] always verify our models on two/three different prediction sets: one obtained on the sorted responses (for verifying the model on chemicals in the response domain) and the other on structural similarity by Kohonen Maps (to check the model in the structural domain) and/or even randomly (that is the splitting, in a sense, more similar to the real life situation of unknown new chemicals and that, being unbiased for response and structure, cannot be accused of purposeful manipulation). In conclusion, the two modelling approaches compared here are philosophically different, and neither should be considered as right or wrong. The approach based on double CV is focused only in obtaining the best statistical performance (the lowest prediction error). The information from all the available data is exploited, therefore the best results are expected. This approach produces an ensemble of different models, each verified on test chemicals that could be considered as “external” only for each corresponding model, but at the end the complete algorithm is not verified on really new chemicals. Therefore, I agree with this statement in the Gtlein et al. paper:[1] “If external validation implies (i) that no instance from any test set is ever used for building the final model (see e.g. the literature[3,6,20]), then no form of cross-validation (in which the complete data set is repeatedly divided into disjoint training and test sets) can be regarded as external validation” The approach based on additional verification of statistically robust models on real external chemicals has the aim to propose good QSAR models (even if, probably, not the “best” possible models) additionally evaluating them for the external predictivity before their presentation, in order to guarantee a larger generalizability. We hope to avoid the proposal of models which seem only in appearance predictive, as models no. 4–6 in Table 1. In my opinion, it is important to remember that the ultimate goal of a validation strategy should be to simulate, with sufficient accuracy, the difficulties that one would encounter when applying a methodology in future circumstances (new experimental data), trying to represent the future working situation of the particular model: only an additional “external evaluation” on totally new chemicals can do this for QSAR
models, after their internal validation by CV, but this should be done already at the proposal step.
Acknowledgements I thank Knut Baumann, and also my collaborators Nicola Chirico and Stefano Cassani, for the interesting and helpful discussions during the preparation and revision of this letter.
References [1] M. Gtlein, C. Helma, A. Karwath, S. Kramer, Mol. Inf. 2013, 32, 516 – 528. [2] H. Kubinyi, A. H. Fred, T. Mietzner, J. Med. Chem. 1998, 41, 2553 – 2564. [3] H. Kubinyi, Quant. Struct-Act. Relat. 2002, 21, 348 – 356 [4] A. Golbraikh, A. Tropsha, J. Mol. Graph. Model. 2002, 20, 269 – 276. [5] A. Tropsha, P. Gramatica, V. K. Gombar, QSAR Comb. Sci. 2003, 22, 69 – 77. [6] K. Baumann, TrAC 2003, 22, 395 – 406. [7] D. M. Hawkins, S. C. Basak, D. Mills, J. Chem. Inf. Comput. Sci. 2003, 43, 579 – 586. [8] D. M. Hawkins, J. Chem. Inf. Comput. Sci. 2004, 44, 1 – 12. [9] K. Baumann, N. Stiefl, J. Comput-Aid. Mol. Design 2004, 18, 549 – 562. [10] P. Gramatica, QSAR Comb. Sci. 2007, 26, 694 – 701. [11] A. Tropsha, A. Golbraikh, Curr. Pharm. Des. 2007, 13, 3494 – 3504. [12] A. Tropsha, Mol. Inf. 2010, 29, 476 – 488. [13] K. H. Esbensen, P. Geladi, J. Chemom. 2010; 24,168 – 187. [14] T. M. Martin, P. Harten, D. M. Young, E. N. Muratov, A. Golbraikh, H. Zhu, A. Tropsha, J. Chem. Inf. Model. 2012, 52, 2570 – 2578. [15] N. Chirico, P. Gramatica, J. Chem. Inf. Model. 2011, 51, 2320 – 2335. [16] N. Chirico, P. Gramatica, J. Chem. Inf. Model. 2012, 52, 2044 – 2058. [17] P. Gramatica, S. Cassani, P. P. Roy, S. Kovarich, C. W. Yap, E. Papa, Mol. Inf. 2012, 31, 817 – 835. [18] P. Gramatica, N. Chirico, E. Papa, S. Cassani, S. Kovarich, J. Comput. Chem. 2013, 34, 2121 – 2132. [19] OECD Principles 2004; http://www.oecd.org/dataoecd/33/37/ 37849783.pdf (accessed 02/02/2014 [20] Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships Models ENV/JM/MONO(2007)2; http://search.oecd.org/officialdocuments/displaydocumentpdf/?doclanguage = en&cote = env/jm/ mono%282007 %292 (accessed 02/02/2014).
2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2014, 33, 311 – 314
314