Introduction Data and Descriptors Method Results ...

1 downloads 0 Views 1MB Size Report
a Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal str.,. 67081 Strasbourg, France b Department of Medicinal ...
Generative Topographic Mapping As A Virtual Screening Tool: A Benchmarking Study Arkadii Lina,b, Dragos Horvatha, Gilles Marcoua, Alexandre Varneka, and Bernd Beckb a Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal str., 67081 Strasbourg, France b Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorferstrasse 65, 88397 Biberach an der Riss, Germany

Introduction

Screened database

ChEMBL 1827

Generative Topographic Mapping (GTM) is a dimensionality reduction method that can be used for large chemical data visualization and analysis. [1,2] Recently it has been demonstrated that GTM activity landscapes could be effectively used for activity values or classes prediction. [3, 4] Here, we compare GTM performance as Virtual Screening (VS) tool compared to Random Forest, Neural Networks, and Similarity searching with data fusion. Active Zones

1.5M compounds, v.23 900K compounds

GTM activity landscape

Inactive Hits

Results  Sequence Fragments

2

1

 Atom-centered Fragments

2

1

4

0

 Property labels

Method

Four universal GTM manifolds were built and validated internally (using 618 ChEMBL targets) and then applied to mine for the actives for 9 targets, within DUD compound sets. The results were compared with “local” GTM built for each particular target and other popular VS methods (Neural Network, Random Forest and Similarity search). Internal validation (with ChEMBL)

Randomly create Training and Test sets (0.5/0.5)

Generative Topographic Mapping (GTM) is a dimensionality reduction method which projects the data points in N-dimensional space into 2D Latent space through a 2D flexible hyperplane (manifold). Latent space

Compute ROC AUC

Data space t3

Build a landscape

Repeat 10 times

Build ROC

Predict class

Number of targets with ROC AUC > 0.8

Data and Descriptors

Active

700 600 500 400 300 200 100 0 UGTM SS in SS in initial latent space space

t*

x*

NN

RF

GTM

x2

Virtual Screening of DUD database

t2 Gaussian function

Each data point can be considered as a point and as a probability distribution. Cumulating such distributions it can easily generate a Landscape of a Data space. Coloring the obtained landscape with the corresponded activity value, we can use it as a QSAR model. Active

ROC AUC

x1

t1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

UGTM NN RF SS in initial space SS in latent space GTM T1

Inactive Target 3

Target 2

T2

T3

T4

T5

T6

T7

T8

T9

UGTM – Universal GTM, NN – Neural Network, RF- Random Forest, SS – Similarity search with data fusion. T1-9 – targets 1827, 1952, 251, 260, 279, 301, 4282, 4338, 4439 from ChEMBL DB (see ChEMBL for the targets description).

Conclusions • Built on large chemical DB Target 4

Target 1

• Validated on 618 ChEMBL targets • Easy to visualize • Clear model interpretation

Target 5

Four Universal GTM manifolds

Target 618

• Potentially can be used for many other targets • Etc.

• Four “universal” maps have been built on ChEMBL v.23 and successfully validated (510 out of 618 targets were predicted with ROC AUC > 0.8). • DUD data sets were successfully screened over 9 targets (ROC AUC ≈ 0.7÷0.9). • In virtual screening, universal GTM performs similarly to popular machine-learning methods.

References [1] Bishop CM, Svensén M, Williams CK. GTM: Neural computation. 1998, 10(1), 215-34. [2] Lin, A., Horvath, D., Afonina, V., Marcou, G., Reymond J.L. and Varnek, A. ChemMedChem. doi:10.1002/cmdc.201700561. [3] Gaspar HA, Baskin II, Marcou G, Horvath D, Varnek A. J. Chem.Inf.Mod.. 2015, 55(1), 84-94. [4] Sidorov P, Gaspar H, Marcou G, Varnek A, Horvath D. J. Comp.-Aided Mol. Design. 2015, 29(12), 1087-108.

This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 676434, “Big Data in Chemistry” (“BIGCHEM”, http://bigchem.eu).