a Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal str.,. 67081 Strasbourg, France b Department of Medicinal ...
Generative Topographic Mapping As A Virtual Screening Tool: A Benchmarking Study Arkadii Lina,b, Dragos Horvatha, Gilles Marcoua, Alexandre Varneka, and Bernd Beckb a Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal str., 67081 Strasbourg, France b Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorferstrasse 65, 88397 Biberach an der Riss, Germany
Introduction
Screened database
ChEMBL 1827
Generative Topographic Mapping (GTM) is a dimensionality reduction method that can be used for large chemical data visualization and analysis. [1,2] Recently it has been demonstrated that GTM activity landscapes could be effectively used for activity values or classes prediction. [3, 4] Here, we compare GTM performance as Virtual Screening (VS) tool compared to Random Forest, Neural Networks, and Similarity searching with data fusion. Active Zones
1.5M compounds, v.23 900K compounds
GTM activity landscape
Inactive Hits
Results Sequence Fragments
2
1
Atom-centered Fragments
2
1
4
0
Property labels
Method
Four universal GTM manifolds were built and validated internally (using 618 ChEMBL targets) and then applied to mine for the actives for 9 targets, within DUD compound sets. The results were compared with “local” GTM built for each particular target and other popular VS methods (Neural Network, Random Forest and Similarity search). Internal validation (with ChEMBL)
Randomly create Training and Test sets (0.5/0.5)
Generative Topographic Mapping (GTM) is a dimensionality reduction method which projects the data points in N-dimensional space into 2D Latent space through a 2D flexible hyperplane (manifold). Latent space
Compute ROC AUC
Data space t3
Build a landscape
Repeat 10 times
Build ROC
Predict class
Number of targets with ROC AUC > 0.8
Data and Descriptors
Active
700 600 500 400 300 200 100 0 UGTM SS in SS in initial latent space space
t*
x*
NN
RF
GTM
x2
Virtual Screening of DUD database
t2 Gaussian function
Each data point can be considered as a point and as a probability distribution. Cumulating such distributions it can easily generate a Landscape of a Data space. Coloring the obtained landscape with the corresponded activity value, we can use it as a QSAR model. Active
ROC AUC
x1
t1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
UGTM NN RF SS in initial space SS in latent space GTM T1
Inactive Target 3
Target 2
T2
T3
T4
T5
T6
T7
T8
T9
UGTM – Universal GTM, NN – Neural Network, RF- Random Forest, SS – Similarity search with data fusion. T1-9 – targets 1827, 1952, 251, 260, 279, 301, 4282, 4338, 4439 from ChEMBL DB (see ChEMBL for the targets description).
Conclusions • Built on large chemical DB Target 4
Target 1
• Validated on 618 ChEMBL targets • Easy to visualize • Clear model interpretation
Target 5
Four Universal GTM manifolds
Target 618
• Potentially can be used for many other targets • Etc.
• Four “universal” maps have been built on ChEMBL v.23 and successfully validated (510 out of 618 targets were predicted with ROC AUC > 0.8). • DUD data sets were successfully screened over 9 targets (ROC AUC ≈ 0.7÷0.9). • In virtual screening, universal GTM performs similarly to popular machine-learning methods.
References [1] Bishop CM, Svensén M, Williams CK. GTM: Neural computation. 1998, 10(1), 215-34. [2] Lin, A., Horvath, D., Afonina, V., Marcou, G., Reymond J.L. and Varnek, A. ChemMedChem. doi:10.1002/cmdc.201700561. [3] Gaspar HA, Baskin II, Marcou G, Horvath D, Varnek A. J. Chem.Inf.Mod.. 2015, 55(1), 84-94. [4] Sidorov P, Gaspar H, Marcou G, Varnek A, Horvath D. J. Comp.-Aided Mol. Design. 2015, 29(12), 1087-108.
This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 676434, “Big Data in Chemistry” (“BIGCHEM”, http://bigchem.eu).