novel applications of bootstrap resampling in in vitro toxicology. Bootstrap resampling is a statisti- cal method for .... The bootstrap: a tutorial. Chemometrics and.
ATLA 30, Supplement 2, 133–137, 2002
133
ECVAM’s Activities on Computer Modelling and Integrated Testing Andrew P. Worth ECVAM, Institute for Health & Consumer Protection, European Commission Joint Research Centre, 21020 Ispra (VA), Italy Summary — This paper introduces the basic concepts of quantitative structure–activity relationship (QSAR), expert system and integrated testing strategy, and explains how the analogy between QSARs and prediction models leads naturally to criteria for the validation of QSARs. ECVAM’s in-house research programme on QSAR modelling and integrated testing is summarised, along with plans for the validation of QSAR models and expert system rulebases at the European Union level. Key words: computer modelling, prediction model, quantitative structure–activity relationship, testing strategy.
Introduction The distinction between a quantitative structure– activity relationship (QSAR) and a structure– activity relationship (SAR) is often misunderstood, perhaps because these terms are not used consistently by practitioners in the field. The distinction used in this paper follows the usage recommended by Livingstone (1). An SAR is an association between a chemical substructure and a biological activity; for example, it is known that the presence of a carboxylic acid group (–COOH) or an amino group (–NH2) in a molecule can impart skin corrosion potential. Such substructures are referred to as structural alerts (or biophores) for skin corrosion. Sometimes, another substructure in the same molecule modulates (reduces) the activity imparted by the biophore, and is called a biophobe. In contrast, a QSAR is a mathematical relationship between a quantitative measure of chemical structure, or a quantitative measure of a physicochemical property, and a biological activity. Two types of QSARs can be distinguished: classification models (CMs), for which the response variable is on a categorical scale, and regression models, for which the response is continuous. An example of a classification-based QSAR is the following CM for predicting the corrosion potential of organic liquids (2): If MW ≤ 123g/mol, predict as corrosive; otherwise, predict as non-corrosive. (Equation 1) (n = 189, sensitivity = 70%, specificity = 68%, concordance = 69%)
As an example of a regression-based QSAR, it has been established by numerous workers that skin permeability can be expressed as a function of lipophilicity and molecular weight, such as in the following QSAR (3). log K = 0.77 log P – 0.0103 MW – 2.3
(Equation 2)
(n = 107, s = 0.394, r = 0.93, r2 = 0.86) A QSAR for a biological endpoint, such as skin permeability or skin corrosion, provides a means of extrapolating from a physicochemical property, or other measure of chemical structure, to the endpoint of interest. Such a QSAR can be regarded as an alternative (non-animal) method, in the sense that it could be used to reduce, refine, or replace the use of animals for an experimental purpose, such as toxicity testing. A term that is widely used in the literature on alternatives and in vitro toxicology is “prediction model” (PM), which is essentially an unambiguous statement that enables an extrapolation to be made from one or more in vitro endpoints to an in vivo biological endpoint (4). Thus, QSARs for in vivo endpoints in humans or animals are analogous to PMs, in the sense that the former extrapolate from the molecular level, whereas the latter extrapolate from the level of cells, tissues or organs. It should be noted that QSARs can also be developed (probably with more success) for in vitro endpoints, and QSARs also exist for environmental endpoints, such as chemical biodegradability, and toxicity to fish and insects. The term “expert system” has been defined in the following way (5):
Address for correspondence: Dr A.P. Worth, European Chemicals Bureau, Institute for Health & Consumer Protection, European Commission Joint Research Centre, 21020 Ispra (VA), Italy.
134
“An expert system for predicting toxicity is considered to be any formalised system, not necessarily computer-based, which enables a user to obtain rational predictions about the toxicity of chemicals. All expert systems for the prediction of chemical toxicity are built upon experimental data representing one or more toxic manifestations of chemicals in biological systems (the database), and/or rules derived from such data (the rulebase)”. The key words in this definition are “database” and “rulebase”. Rules can be obtained in various ways, such as by formalising the knowledge of an expert, or by statistical induction from a training set. For the purposes of validation, it is the acceptability (defined below) of the rulebase that is important. The database itself is not validated directly, although rules derived from data of poor quality (i.e. poorly reproducible or heterogeneous data) will inevitably have a lower predictive capacity than rules derived from data of higher quality (i.e. highly reproducible or homogeneous data). For this reason, it is desirable that the data in an expert system database have been subjected to some degree of quality control, and it is essential that all data used to validate a QSAR or expert system rulebase undergo a rigorous quality-control process. The expert systems that are currently available have been reviewed elsewhere (5–6).
ECVAM’s Research on QSAR Modelling and Integrated Testing The general objectives of the work conducted at ECVAM in the fields of computer modelling and integrated testing have been: 1. to develop and assess scientifically-based mathematical models ([Q]SARs, PMs and expert system rulebases) for predicting toxicological hazard; and 2. to develop and assess alternative testing strategies that could be used to reduce and refine, or replace, the use of animals in toxicity testing. Until now, most of the research has been conducted in the context of Andrew Worth’s PhD project (2). The general aim of the project was to investigate the integrated use of physicochemical and in vitro data for predicting the toxicological hazard of chemicals in animals. This was achieved in two stages: firstly, by developing two types of model for acute dermal and ocular toxicity — QSARs based on easily calculated physicochemical properties, and PMs based on experimentally derived physicochemical data (7) or in vitro data (8); and secondly, by evaluating the tiered testing approach to hazard classification, in which different CMs are applied
A.P. Worth
sequentially before animal testing is conducted (9–10). The PhD thesis (2) therefore reports the development and assessment of CMs for skin irritation, skin corrosion and eye irritation, as well as the outcome of simulations in which the models were incorporated into tiered testing strategies for these toxicological endpoints. The tiered testing approach to hazard classification was shown to provide a reliable means of reducing and refining the use of animals, without compromising the ability to classify chemicals. In addition to developing the above-mentioned CMs, regression models for corneal permeability were developed, and the relationship between corneal permeability and eye irritation was investigated (2, 11). The project also involved the development and assessment of a novel statistical method called embedded cluster modelling (ECM), which generates elliptical models of biological activity from embedded data sets (12). An embedded data set is one in which the variables, when plotted on a box plot (one variable), or on a scatter plot (two or more variables), reveal that chemicals can be classified into two groups (active/inactive or toxic/non-toxic), with the chemicals in one group forming an “embedded cluster” surrounded by the chemicals in the other group, the “diffuse cluster”. Typically, the active (or toxic) chemicals form the embedded cluster, and the inactive (or non-toxic) chemicals form the diffuse cluster, indicating that, for each of the chosen variables, there is an optimum range of values for the biological response to occur. The combined use of ECM with the existing method of cluster significance analysis (CSA), which is used to assess the statistical significance of apparent embedded clusters, was illustrated through the development of QSARs for eye irritation potential (13). The PhD project also involved investigations on novel applications of bootstrap resampling in in vitro toxicology. Bootstrap resampling is a statistical method for making inferences about an unknown population on the basis of a single sample (14–15). Algorithms based on this method were shown to provide a means of assessing: a) the variability in Cooper statistics (commonly used to summarise the performance of two-group CMs) that arises from chemical variation (16); and b) the minimal variability associated with the Draize rabbit tests for skin and eye irritation (17). It is important to know the variability in Cooper statistics, to determine when the predictive performance of an alternative test, as judged with a particular test set of chemicals, would vary significantly if the test had been assessed with a different set of chemicals. It is important to know the variability of an in vivo endpoint, since this places limitations on the maximal predictive capacity that can be expected of any alternative test.
Computer modelling and integrated testing
135
Current research at ECVAM is being conducted largely in the context of Iglika Lessigiarska’s PhD project, which started in February 2002. The aims of this project are two-fold:
the development and validation of in vitro systems (24–25). According to these criteria, which were agreed by the workshop participants, an acceptable QSAR should:
1. to develop QSARs and PMs for predicting:
1. be associated with a defined endpoint, which it serves to predict;
a) blood–brain barrier function; b) in vitro and in vivo endpoints of acute lethal toxicity; c) metabolism: cytochrome P450 interactions; and
2. take the form of an unambiguous and easily applicable algorithm for predicting a pharmacotoxicological endpoint;
2. to integrate these models into a tiered strategy.
3. ideally, have a clear mechanistic basis;
The project is being carried out under the aegis of Liverpool John Moores University (UK).
4. be accompanied by a definition of the domain of its applicability, for example, the physicochemical classes of chemicals for which it is applicable;
The Validation of QSARs and Expert Systems The validation of computer-based models and systems, such as structure–activity relationships and expert systems, has been discussed in various reports (5, 18, 19), in which a number of logistical considerations have already been noted. For example, it has been recognised that the validation of QSAR models and expert system rulebases should be conducted differently from the validation of experimental tests, as the issues of intralaboratory and interlaboratory reproducibility are not relevant for computer models. Also, it has been noted that it is not possible to test coded chemicals in the same manner as in laboratory-based tests, because the structure of the chemical has to be known in order to enter the appropriate information into the computer program. Furthermore, an important consideration is the quality of the reference data used to develop and validate the systems. When validating computer models, the chemicals used in the development of the algorithm should not be used for its subsequent validation. The validation of QSARs Since QSARs are analogous to PMs (indeed, QSARs for certain endpoints can be regarded as PMs), the acceptability criteria that are applied to PMs (4) can be extended to QSARs. An international workshop on QSARs was held in Setuba, Portugal on 4–6 March 2002, under the auspices of the European Association of Chemical Industries, CEFIC (20–23). One of the objectives of the workshop was to discuss and achieve consensus on the acceptability criteria for QSARs, with a view to promoting the use of QSAR predictions for regulatory purposes. ECVAM therefore proposed some acceptability criteria for QSARs, which were based on ECVAM’s internationally accepted criteria for
5. be associated with a measure of its goodness-offit and internal goodness-of-prediction, assessed by using the training set of data; and 6. be assessed in terms of its predictive power by using data that were not used in the development of the model (external validation). A clearly defined endpoint is necessary, to reflect the fact that QSARs represent a highly reductionist approach to the modelling of in vivo effects, and it is therefore unrealistic to expect QSARs to be capable of modelling, for diverse types of chemicals, generic endpoints that cover numerous mechanisms of action; for example, it is more meaningful to develop a QSAR for acetylcholinesterase inhibition, than to develop a QSAR for “neurotoxicity”. QSARs proposed for regulatory purposes should take the form of unambiguous algorithms, partly to avoid the use of “black-box” models, such as neuralnetwork models, which can sometimes appear to be highly predictive, but which may actually be overfitting the training set of data, and partly because such approaches can only be applied by experts with specialised software. By contrast, QSARs that take the form of PMs can be applied by a wide range of end-users, for example, in the form of equations entered into ExcelTM spreadsheets. It is stated that QSARs should “ideally” have a clear mechanistic basis, since the biological mechanisms underlying an in vivo effect are not always understood, so it is not always possible to select predictor variables that have a known mechanistic relevance. However, it is often possible to select predictors that have a plausible mechanistic relevance, and to use them in the development of QSARs with an acceptable predictive ability (26). It is essential that all QSARs proposed for regulatory purposes are associated with a domain of applicability. This requirement is based on the view that reductionist models can be developed for cer-
136
tain physicochemical classes of chemicals, but probably not for the whole universe of chemicals (however the latter may be defined). According to ECVAM’s validation criteria, the same criterion is applied to in vitro tests (24). All QSARs should be associated with a goodnessof-fit to a defined training set of data. In general, the goodness-of-fit of a QSAR model to its training set of data is expected to exceed its predictive power, as judged by applying the model to an independent (external) training set of data. When deciding on whether a QSAR is sufficiently predictive for a given purpose, the criteria for predictive ability should be defined by taking into consideration the proposed regulatory application of the model, and the consequences of making wrong or inaccurate predictions. The validation of expert systems Acceptability criteria for expert systems have not been discussed to the same extent as the acceptability criteria for QSARs. However, it is expected that many of the same considerations will apply. In addition, the author’s view is that, in the case of expert systems, the validation process should apply to an expert system rulebase for a particular endpoint, as defined at a fixed time-point in its development. For example, the DEREK system (27) makes predictions for various toxicological endpoints, which are continuously being revised. Therefore, instead of validating the DEREK system itself, it is proposed that a particular DEREK rulebase could be validated. This emphasises the fact that the validation process is not being extended to the platform on which the expert system operates, and should also avoid concerns over patented systems being proposed for regulatory purposes.
A.P. Worth
and which will be coordinated by the European Chemicals Bureau (ECE).
Concluding Remarks Since 1997, an in-house research programme at ECVAM has aimed to develop and assess QSARs, PMs and integrated testing strategies. Future activities, involving ECVAM and the ECB, are likely to extend these activities to the validation of computer models that are proposed for regulatory purposes, such as QSARs and expert systems, not least because the implementation of the REACH system will require validated computer models to be available for the regulatory assessment of chemicals. The development and validation of computer models, and the subsequent dissemination of validated models to end-users, will require the establishment and coordination of international networks comprising model developers, regulators, users, and other stakeholders.
References 1. 2.
3.
4. 5.
ECVAM’s plans for the validation of computer models The independent and prospective validation of computer models has not been conducted to date. However, the Joint Research Centre (JRC) plans to coordinate the validation of such models, to partially address the need to develop and validate alternative methods for the assessment of chemicals in the context of the EU’s future REACH (Registration, Evaluation and Authorisation of CHemicals) system (28). Initial priorities are likely to include the validation of QSARs/expert system rulebases for: a) dermal and ocular toxicity; and b) genotoxicity and carcinogenicity. This work will be carried out in the context of a JRC activity on the development, validation and dissemination of QSAR models, which will involve a collaboration with other services of the European Commission
6.
7.
8.
9.
Livingstone, D. (1995). Data Analysis for Chemists. Applications to QSAR and Chemical Product Design. Oxford, UK: Oxford University Press. Worth, A.P. (2000). The Integrated Use of Physicochemical and In Vitro Data for Predicting Chemical Toxicity. PhD Thesis, Liverpool John Moores University, UK. Cronin, M.T.D., Dearden, J.C., Moss, G.P. & Murray-Dickson, G. (1999). Investigation of the mechanism of flux across human skin in vitro by quantitative structure–permeability relationships. European Journal of Pharmaceutical Sciences 7, 325–330. Worth, A.P. & Balls, M. (2001). The importance of the prediction model in the development and validation of alternative tests. ATLA 29, 135–143. Dearden J.C., Barratt M.D., Benigni R., Bristol D.W., Combes R.D., Cronin M.T.D., Judson P.N., Payne M.P., Richard A.M., Tichy M., Worth A.P. & Yourick J.J. (1997). The development and validation of expert systems for predicting toxicity. The report and recommendations of an ECVAM/ECB workshop (ECVAM workshop 24). ATLA 25, 223–252. Greene, N., Judson, P.N., Langowski, J.J. & Marchant, C.A. (1999). Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR, METEOR. SAR and QSAR in Environmental Research 10, 299–314. Worth, A.P. & Cronin, M.T.D. (2001). The use of pH measurements to predict the potential of chemicals to cause acute dermal and ocular toxicity. Toxicology 169,119–131. Worth, A.P. & Cronin, M.T.D. (2001). Prediction models for eye irritation potential based on endpoints of the HET-CAM and neutral red assays. In Vitro and Molecular Toxicology 14, 143–156. Worth A.P., Fentem J.H., Balls M., Botham P.A., Curren R.D., Earl L.K., Esdaile D.J. & Liebsch M. (1998). An evaluation of the proposed OECD testing strategy for skin corrosion. ATLA 26, 709–720.
Computer modelling and integrated testing
10. Worth A.P. & Fentem J.H. (1999). A general approach for evaluating stepwise testing strategies ATLA 27, 161–177. 11. Worth A.P. & Cronin M.T.D. (2000). Structure–permeability relationships for transcorneal penetration. ATLA 28, 403–413. 12. Worth A.P. & Cronin M.T.D. (1999). Embedded cluster modelling: a novel method for analysing embedded data sets. Quantitative Structure–Activity Relationships 18, 229–235. 13. Worth A.P. & Cronin M.T.D. (2000). Embedded cluster modelling: a novel QSAR method for generating elliptic models of biological activity. In Progress in the Reduction, Refinement and Replacement of Animal Experimentation (ed. M. Balls, A-M. van Zeller & M.E. Halder), pp. 479–491. Amsterdam, The Netherlands: Elsevier. 14. Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap, 436pp. London, UK: Chapman & Hall. 15. Wehrens, R., Putter, H. & Buydens, L.M.C. (2000). The bootstrap: a tutorial. Chemometrics and Intelligent Laboratory Systems 54, 35–52. 16. Worth, A.P. & Cronin, M.T.D. (2001). The use of bootstrap resampling to assess the uncertainty of Cooper statistics. ATLA 29, 447–459. 17. Worth, A.P. & Cronin, M.T.D. (2001). The use of bootstrap resampling to assess the variability of Draize tissue scores. ATLA 29, 557–573. 18. Anon. (1996). In Technical Guidance Documents in Support of the Commission Directive 93/67/EEC on Risk Assessment for New Notified Substances and the Commission Regulation (EC) 1488/94 on Risk Assessment for Existing Substances, pp. 517–526. Luxembourg: CEC. 19. Worth A.P., Barratt M.D. & Houston J.B. (1998). The validation of computational prediction techniques. ATLA 26, 241–247. 20. Cronin, M.T.D., Jaworska, J.S., Walker, J.D., Comber, M.H.I., Watts, C.D. & Worth, A.P. (2003). Use of QSARs in international decision-making frameworks to predict health effects of chemical sub-
137
21.
22.
23.
24.
25.
26.
27.
28.
stances. Environmental Health Perspectives, in press. Cronin, M.T.D., Walker, J.D., Jaworska, J.S., Comber, M.H.I., Watts, C.D. & Worth, A.P. (2003). Use of QSARs in international decision-making frameworks to predict ecological effects and environmental fate of chemical substances. Environmental Health Perspectives, in press. Eriksson, L., Jaworska, J.S., Worth, A.P., Cronin, M.T.D., McDowell, R.M. & Gramatica, P. (2003). Methods for reliability, uncertainty assessment, and applicability evaluations of regression based and classification QSARs. Environmental Health Perspectives, in press. Jaworska, J.S., Comber, M., Auer, C. & Van Leeuwen, C.J. (2003). Summary of a workshop on regulatory acceptance of (Q)SARS for human health and environmental endpoints. Environmental Health Perspectives, in press. Balls, M., Blaauboer, B.J., Fentem, J.H., Bruner, L., Combes, R.D., Ekwall, B., Fielder, R.J., Guillouzo, A., Lewis, R.W., Lovell, D.P., Reinhardt, C.A., Repetto, G., Sladowski, D., Spielmann, H. & Zucco, F. (1995). Practical aspects of the validation of toxicity test procedures. The report and recommendations of ECVAM workshop 5. ATLA 23, 129–147. Worth, A.P. & Balls, M. (2001). The role of ECVAM in promoting the regulatory acceptance of alternative methods in the European Union. ATLA 29, 525–535. Cronin, M.T.D., Dearden, J.C., Duffy, J.C., Edwards, R., Manga, N., Worth, A.P. & Worgan, A.D.P. (2002). The importance of hydrophobicity and electrophilicity descriptors in mechanistically-based QSARs for toxicological endpoints. SAR and QSAR in Environmental Research 13, 167–176. Sanderson, D.M. & Earnshaw, C.G. (1991). Computer prediction of possible toxic action from chemical structure; the DEREK system. Human and Experimental Toxicology 10, 261–273. Botham, P. (2002). ECVAM, ECETOC and the EU chemicals policy. ATLA 30, Suppl. 2, 185–187.