S.J.
ab Webb ,
P.
a Krause ,
Email:
[email protected]
J.D.
b Vessey
Tel: +44 (0)113 394 4965
Interpretable Ames mutagenicity predictions using statistical learning techniques Introduction
Results
There are many strongly performing machine learning algorithms available which produce good accuracy within the toxicology prediction domain. Many of the models built however focus heavily on predictive performance based upon accuracy, sensitivity and specificity with little regard for the interpretability and applicability of the model.
The models shown here have been built on a curated version of the benchmark mutagenicity dataset released by Hansen et al. [1]. Curation resulted in the removal of a number of structures, standardisation of functional group representation and in some cases the replacement of the structure from a higher quality source.
A new method focusing on the interpretability of strongly performing models, such as a Random Forest, has been developed for Ames mutagenicity prediction utilising simple structural fragment descriptors. Careful analysis has enabled us to select a set of simple fragment descriptors while maintaining predictive performance over a number of independent datasets.
Models have been built using fragment descriptors above a minimum presence threshold (4 and 10) in the training set; models built using the lower cut-off provide better results in both internal and external validation. Internal validation is based on the out of bag (OOB) set and external validation is based on either the combination of unique molecules from three datasets (Bursi [2], Helma [3] and Novartis [4]) – referred to as BHN in Table 1 – or the unique molecules from a confidential internal dataset [5].
Structural fragments are the only descriptors used for the modelling and are used to create a binary fingerprint for each training and query molecule.
Figure 3 shows how model performance indicators vary with a confidence measure derived from 1) combining RF node purity and compound similarity for each tree and 2) the domain coverage which is a numeric representation of the modelled portion of the query. No descriptors
Method The methodology has been developed using the open source workflow software KNIME and utilising the R integration for the machine learning package Random Forest (Figure 1).
Dataset Hansen BHN Internal
1 0.9 0.8 0.7
Accuracy Sensitivity
0.6
Specificity
0.5
Precision
Validation BHN Internal
Accuracy 0.817 0.769
0.4 0.3
New nodes have been developed internally utilising our in-house chemical engine for descriptor generation, structure representation and similarity in addition to nodes for model analysis, confidence and cause assignment.
Structures 6477 504 2098
0.0 >= x < 0.2 >= x < 0.4 >= x < 0.6 >= x < 0.8 >= x < 0.2 0.4 0.6 0.8 1.0
Figure 3: BHN confidence bin performance for 4 fragment cut-off model using 200 trees
The breakdown of the domain coverage of the BHN structures (Figure 4) shows that this fragmentation method with a cut-off of 4 examples shows a good coverage of the validation set; only 33 molecules fall below a 40% coverage.
BHN Internal
0.808 0.761
% Active 54 33 36
4 cut-off 758 ---
10 cut-off 311 ---
4 fragment cut-off Sensitivity Specificity Precision 0.683 0.884 0.745 0.608 0.859 0.706 10 fragment cut-off 0.653 0.884 0.736 0.569 0.868 0.706
Table 1: Dataset performance and composition 327 4-frag 10-frag 276
99 18 21
15 20
131
45 56
0.0 >= x < 0.2 0.2 >= x < 0.4 0.4 >= x < 0.6 0.6 >= x < 0.8 0.8 >= x < 1.0
Figure 4: Domain coverage of BHN structures
Analysis of the model allows for an interpretation of the cause of the prediction (active/inactive). This allows for the assignment of activating or deactivating fragments or combinations of fragments providing another level of interpretability to the model and moving away from black box model predictions.
Figure 1: Representation of the modelling methodology
Descriptors The structural fragment descriptors are produced by fragmentation of all of the training molecules, retaining connectivity information. This produces a coverage of the chemical space to a higher degree than using a pre-defined set of structural fragments. Connectivity information for each atom is retained such as aromaticity and number of attachments. Examples of fragments are shown in Figure 2.
Figure 5: Examples of cause assignment
Input Structures
Functional groups
Ring Scaffolds Ring scaffolds will match any ring of this type, regardless of substitutions
aromatic nitro
aromatic amine
aromatic ring nitrogen
aromatic ring oxygen
Figure 2: Fragmentation examples Business vCard
Request poster
Ring substitutions Ring substitution fragments will only match rings with the same substitution pattern.
Conclusion The simplicity of the descriptors has been exploited to allow for the development of a new domain assignment methodology based on the chemical space of the training set, incorporation into confidence calculation and for the development of an explanation of the cause of a prediction. This methodology performs to the same high standard as other mutagenicity predictive systems however it adds interpretation and domain assessment to the prediction challenging the need to produce black box models for Ames mutagenicity.
References 1. Katja Hansen, Sebastian Mika, Timon Schroeter, Andreas Sutter, Antonius ter Laak, Thomas Steger-Hartmann, Nikolaus Heinrich and Klaus-Robert Müller J. Chem. Inf. Model., 2009, 49 (9), 2077–2081 2. Jeroen Kazius, Ross McGuire and Roberta Bursi J. Med. Chem., 2005, 48 (1), 312–320 3. C. Helma, T. Cramer, S. Kramer and L. DeRaedt, J. Chem. Inf. Comput. Sci., 2004, 44(4), 1402-1411 4. Patrick McCarren, Clayton Springer and Lewis Whitehead, Journal of Cheminformatics 2011, 3:51