DOI: 10.1002/minf.201100163
Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison N. Kireeva,[a, b] I. I. Baskin,[a, c] H. A. Gaspar,[a] D. Horvath,[a] G. Marcou,[a] and A. Varnek*[a]
Abstract: Here, the utility of Generative Topographic Maps (GTM) for data visualization, structure-activity modeling and database comparison is evaluated, on hand of subsets of the Database of Useful Decoys (DUD). Unlike other popular dimensionality reduction approaches like Principal Component Analysis, Sammon Mapping or Self-Organizing Maps, the great advantage of GTMs is providing data probability distribution functions (PDF), both in the high-dimensional
space defined by molecular descriptors and in 2D latent space. PDFs for the molecules of different activity classes were successfully used to build classification models in the framework of the Bayesian approach. Because PDFs are represented by a mixture of Gaussian functions, the Bhattacharyya kernel has been proposed as a measure of the overlap of datasets, which leads to an elegant method of global comparison of chemical libraries.
Keywords: Generative topographic maps · Dimensionality reduction · Manifold learning · Data visualization · Predicting activity profiles · Comparison of databases · Bhattacharyya kernel
1 Introduction Chemography[1] is a relatively new field dealing with visualization of chemical data, representation of chemical space and navigation in this space.[2] This may help chemist to choose compounds to be purchased or synthesized in order to enrich “in-house” databases, to select subsets for screening campaigns, and to assess the overlap of different databases. The main problem of visualization of high-dimensional data concerns their representation in two or three dimensions with minimal information loss. Comparative analysis of various dimensionality reduction techniques and their use in drug discovery is given in books[3–4] and reviews.[5–7] Among other techniques, Principal Component Analysis (PCA),[8] Multidimensional Scaling (MDS),[9–10] SelfOrganizing Maps (SOMs),[11] Sammon Mapping (SM),[12] Stochastic Proximity Embedding (SPE),[13–15] and Stochastic Neighbor Embedding (SNE)[16–17] are the most commonly used. However, they have some limitations. Thus, PCA can efficiently be applied to process huge datasets with linearly dependent features, but it is less effective with nonlinear data distributions.[7] As a consequence, this approach fails to represent the cluster structure of vast multidimensional data.[18] MDS is also a linear technique, which for the case of Euclidean distances gives equivalent results to PCA.[19] SM, SPE and SNE have no explicit mapping function and therefore do not allow one to place any new data on an already existing map. In that case, a new map must be rebuilt from scratch.[20] Besides, calculation and storage of all inter-point distances are required; this imposes severe restrictions on many practical applications dealing with large amounts of data or incremental data flow. The SOM apMol. Inf. 2012, 31, 301 – 312
proach has no well-defined objective function to be optimized in the course of training[21–24] and, therefore, no theoretical framework to prove its convergence and to select the method’s parameters can be defined. This leads to some ambiguity in selection of the “best” SOMs. Although SM, SPE and SNE possess such an objective function, its optimization is either complex and inefficient (SM) or heuristic with no theoretically proven guarantee of convergence (SPE and SNE). It should also be pointed out that there exists a serious (although not obvious and seldom discussed in the literature) drawback shared by all afore-mentioned methods: none of them is probabilistic (although the notion of probability is present in formulation of SNE), i.e. the corresponding models do not define probability density function for data distribution. As a consequence, it is rather difficult or even impossible to assess the robustness of information contained in such data maps. For example, an assignment [a] N. Kireeva, I. I. Baskin, H. A. Gaspar, D. Horvath, G. Marcou, A. Varnek Laboratoire d’Infochimie, UMR 7177 CNRS, Universit de Strasbourg 4 rue B. Pascal, Strasbourg 67000, France *e-mail:
[email protected] [b] N. Kireeva Institute of Physical Chemistry and Electrochemistry RAS Leninsky pr-t 31, 119991 Moscow, Russian Federation [c] I. I. Baskin Department of Chemistry, Lomonosov Moscow State University 119991, Moscow, Russian Federation
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
301
Full Paper
N. Kireeva et al.
of the compound to a SOM node can be chemically meaningful (if the node is populated by its analogues) or be an artifact. The latter typically happens for structural outliers which should nevertheless be assigned somewhere or for the molecules almost equidistant to different neurons and which, therefore, may “occasionally” drop to irrelevant neuron. The Generative Topographic Mapping approach (GTM)[21–22,24–25] not only overcomes all the above mentioned drawbacks, but offers some additional advantages resulting from rigorous probabilistic character of 2D maps. Applications of GTM to visualization of chemical data have recently been discussed.[18,26] Thus, Maniyar et al.[18] described GTM using magnification factor mapping and directional curvature plots, and introduced a hierarchical GTM which arranges a set of GTMs in a tree structure. Owen et al.[26] considered two modifications of the GTM method – Latent Trait Model (LTM) and Linear Latent Trait Model (LTM-LIN) – which are specially tailored to deal with binary data, such as molecular fingerprints. Several useful parameters to compare the quality of the data visualization provided with different maps were suggested.[26] At the same time, all cited publications consider GTM as data visualization tool only and do not discuss other possible areas of its application issued from the probability density function (PDF) assessed with GTM. This article demonstrates how GTM can successfully be used in structure-property modeling and datasets comparison. First, we show how predictive classification models can directly be built from PDFs of different classes. Then, we describe how the Bhattacharyya kernel[27–28] calculated on PDFs can be used for datasets comparison. Finally, some additional means for data visualization combining GTM and PCA techniques are discussed. Suggested approaches have been applied to the compounds taken from Database of Useful Decoys (DUD)[29] which is often used for validation of new methods.
y :¼ yðx; WÞ ¼ WðxÞ
yd :¼ yd ðx; WÞ ¼
H X h¼1
Whd h ðxÞ ¼
ð1aÞ H X h¼1
k x xh k 2 Whd exp 2s ð1bÞ
Here, d runs from 1 to D, W is the weight matrix assuring connections between H hidden and D output units, f(x) are Gaussian activation functions for hidden units, which can be viewed as the basis functions for this non-linear transformation, xh is the center of h-th radius basis function in the latent space. The centers of these Gaussian basis functions form a square grid in the latent space, their number H and variance s are parameters of the method. Each point can be mapped from the low-dimensional latent space into the high-dimensional data space using transformation (Eq. 1). The images of this mapping form in the D-dimensional data space a continuous set (manifold) with intrinsic dimensionality L. For L = 2 such manifold can be associated with flexible “rubber sheet” hovering over the data space (Figure 1). The smoothness of the manifold is controlled by the smoothness of the transformation function (Eq. 1), which, in turn, depends on the value of parameter s. Larger s corresponds to smoother and flatter manifolds. Due to the smooth and continuous nature of the transformation function (Eq. 1), neighboring points in the latent space remain neighbors in the data space. The GTM is often viewed as a probabilistic extension of the SOM.[21–22] Like SOM, it operates with a grid of K nodes, which can be considered as analogs of nodes in SOM. The algorithm of data generation contains three steps: (i) random selection of one of the nodes in the latent space, (ii) mapping the chosen node onto the manifold in the data space, and, (iii) addition of random noise to the point mapped on the manifold. The latter is achieved by association of the given point with a center of an isotropic normal distribution with the inverse variance b. This three-step process corresponds to the sampling of a random variable t with the following probability density function:
2 GTM Methodology 2.1 GTM Build-up
pðt jW; bÞ ¼
In contrast to PCA, MDS, SM, SPE, SNE and SOM approaches, in which the data from the initial multidimensional space are projected into 2D space, in GTM, the data in the initial space are generated from the objects situated in 2D space. According to the probability theory, any stochastic data generation can be viewed as a sampling of a random variable that describes a data distribution law by means of a probability distribution function.[30] Dimensionality reduction is treated in a generative model by means of L-dimensional latent variables x (usually L = 2), from which the points in D-dimensional data space (D > L) can be produced through a transformation y carried out by RBF neural network: 302
www.molinf.com
K K 1X 1X b pðt jxk ; W; bÞ ¼ exp kt yðxk ; WÞk2 K k¼1 K k¼1 2 ð2Þ
where k is an ordinal number of the selected node, xk are its coordinates in the latent space, y(xk ; W) are coordinates of its image in the data space, while the D-dimensional random variable t spans the whole dataset and can, in principle, match any data point. Let a dataset consist of N examples, and the corresponding n-th point is located at position tn in the data space. In this case the correspondence between a GTM model and the data can be measured by the probability with which the data could be generated (sampled) from t. In GTM, the
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2012, 31, 301 – 312
Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset
Figure 1. GTM describes the data probability distribution in the data space by means of the mixture of embedded Gaussians situated on manifold (two-dimensional rubber sheet) located in the high-dimensional data space in a way to provide with the best fit of data distribution. Each Gaussian is generated by non-linear transformations y(x,W) from the grid nodes (*) in the latent space. The transformation y(x,W) is performed by means of RBF functions located at *. GTM is a planar map resulting from the manifold’s unbending. The location of the data on the map is defined by the probability distribution assessed by Equation 4. This presentation is inspired by Figure 2 in the original publication by Bishop et al.[21]
logarithm of such probability, called log likelihood, can be considered as a function of two adjustable parameters, W and b:
LðW; bÞ ¼
N X n¼1
(
) K 1X b ln exp ktn yðxk ; W k2 K k¼1 2
a fuzzy analog of SOM with the membership function for mapping data to different units expressed by node responsibilities. In that case, the mean position xmean(t) in the latent space is calculated by averaging the coordinates of all nodes taking the responsibilities as weighting factors:
ð3Þ x mean ðtÞ ¼
K X
ð5Þ
xk pðxk jt Þ
k¼1
The higher the value of data log likelihood, the better the random variable t fits the data. Thus, log likelihood is an objective function to be maximized during the training. Usually, this is achieved using the Expectation Maximization algorithm,[31] which guarantees a convergence to a local maximum (see details in the literature[21]). The posterior probability p(xkjt) that the given point t in the data space is generated from the k-th node (so-called responsibility of the k-th node for data point t) is computed using Bayes’ theorem: b exp 2 kt yðxk ; WÞk2 pðtjxk ; W; bÞpðxk Þ ¼ K pðxk jtÞ ¼ K b P P pðtjxk0 ; W; bÞpðxk0 Þ exp 2kt yðxk0 ; WÞk2 k 0 ¼1
k 0 ¼1
ð4Þ Equation 4 shows that unlike SOM in which only a single unit (node) is “responsible” for a data point, in GTM all nodes share such a “responsibility”. This means that a given data point t has non-zero probability to be mapped (assigned) to any node according to the responsibility of the latter. Hence, the GTM method can be considered as Mol. Inf. 2012, 31, 301 – 312
2.2 GTM-Based Classification Models
Activity profile of a chemical compound can be assessed starting from the values of the class-conditioned probability distribution function p(tjCk) computed for each class Ck, where t is its molecular descriptor vector. Such function can be build, for each activity class, by training a separate GTM model on the data belonging to class Ck. The classconditioned probabilities p(tjCk) can be used for computing posterior probabilities of class membership P(Ckjt) for a given compound using the Bayes theorem: PðCk jt Þ ¼
pðt jCk Þ PðCk Þ ; pðtÞ
ð6Þ
where P(Ck) = Nk/Ntot is a prior probability of class membership (Nk – the number of compounds belonging to class Ck ; Ntot – the total number of compounds), whereas p(t), the marginal probability density function, is the normalization factor: pðtÞ ¼
X
pðt jCk Þ PðCk Þ;
ð7Þ
k
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
www.molinf.com
303
Full Paper
N. Kireeva et al.
The latter ensures that the estimated posterior probabilities are normalized. By applying Function 6 to each class Ck one can assess the posterior probability of class membership for each compound. According to statistical decision theory,[19] the optimal class assignment is determined by the maximal value of posterior class probabilities P(Ckjt). Performance of classification models can be measured by Balanced Accuracy (BA)
BA ¼
0:5tp 0:5tn þ tp þ fn tn þ fp
ð8Þ
calculated as a function of the number of true positives (tp), true negatives (tn), false positives (fp) and false negatives (fn) assessed in cross–validation. Another useful measure of the predictive performance of GTM classification models is area under ROC curve (ROC AUC). The later plots True positive rate vs. False positive rate and can easily be obtained by thresholding the values of posterior class probabilities P(Ckjx). Two strategies in development of GTM-based classification models can be considered. Thus, for a given set of objects of different classes, one can build either a single multi-class classification model or ensemble of binary classification models. Here, the first strategy has been used to classify the actives against 10 different DUD targets, whereas the second one has been applied to actives/decoys classification models. 2.3 Comparison of Databases with Bhattacharyya Kernel
One of the most well-known measures of dissimilarity of two probability distributions P and P’ is the Kullback-Leibler (KL) divergence.[32] However, KL divergence is not symmetrical and, therefore, can hardly be considered as a distance measure between two probability distributions. Here, we propose to use Bhattacharyya kernel[27–28] (or B-kernel), to assess an overlap between two PDFs built with GTM and representing two different datasets Z pffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffi B kernel ¼ pðtÞ p0 ðtÞdt
ð9Þ
For PDFs expressed by Gaussian functions, B-kernel can be computed analytically using formula derived by Jebara et al.[27–28] Here, for convenience we use negative logarithm (log(B-kernel) or LBK, which fits better the meaning of “distance” than B-kernel itself.
3 Method Data sets were collected from the DUD (a Directory of Useful Decoys) repository.[29] The whole database comprises www.molinf.com
Table 1. Number of active and inactive compounds for ten resampled DUD datasets after data normalization, where Na : number of active compounds, Ndec : number of decoys. # 1 2 3 4 5 6 7 8 9 10
Dataset ache cox2 dhfr egfr fgfr1 fxa pdgfrb p38 src vegfr2 Total
Resampled sets after data normalization Na
Ndec
106 409 408 427 97 146 110 342 49 76 2170
1060 4090 4080 4270 970 1460 1100 3420 490 760 21700
ISIDA Property-Labeled Fragment Descriptors (IPLF)[36] were used to encode molecular structures. They represent selected subgraphs in which atom vertices are colored with respect to some local property/feature. In our study we have used two types of IPLF descriptors: atom-centered fragments (augmented atoms) colored either by pH-dependent pharmacophores (pharmacophore-based fragment descriptors) or atom symbols (structure-based fragment descriptors). Augmented atoms of radius 1 to 3 have been considered. 3.2 Computational Procedure
3.1 Data and Descriptors
304
2950 active compounds against 40 targets and 95316 decoy compounds. The DUD dataset was drawn from ZINC,[33] a free database of commercially-available compounds originally created for virtual screening. 10 datasets containing both active compounds and decoys were used: ache (inhibitors of human acetylcholinesterase), cox2 (cyclooxygenase-2 inhibitors), dhfr (dihydrofolate reductase inhibitors), egfr (inhibitors of epidermal growth factor receptor), fgfr1 (fibroblast growth factor receptor 1 inhibitors), fxa (factor Xa inhibitors), p38 (inhibitors of mitogen-activated protein (MAP) kinases), pdgfrb (inhibitors of beta-type platelet-derived growth factor receptor), src (tyrosine kinase inhibitors), vegfr2 (inhibitors of vascular endothelial growth factor receptor). Chemaxon Standardizer[34] and Instant JChem[35] were used for data preparation. Explicit hydrogen atoms have been removed. The structures have been aromatized. Instant JChem was used to remove duplicates from the data sets. Each data set was resampled in order to increase the percentage of active compounds in a set to the active/inactive ratio 1/10 (see Table 1).
The basic implementation of GTM approach has been taken from Netlab package (MATLAB toolbox for neural networks and pattern recognition).[37–38] Additional MATLAB
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2012, 31, 301 – 312
Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset
procedures have been written in order to apply GTM to comparison of databases and predicting activity profiles. PCA has been used as a pre-processing step in the model development because in the Netlab implementation of GTM, the dimensionality of the data should not be too high. In order to visualize the manifold and data in 3D, three first Principal Components have been used.
3.2.1 Grid Optimization of Selected GTM Parameters
There are several parameters to be adjusted in GTM design: the number of RBF basis functions, the number of latent points and several hyper-parameters: inverse variance of the prior over the weights (a), inverse noise variance (b) and width of the Gaussian basis functions (s). A grid search has been performed to adjust these parameters. For the mapping of new compounds, several criteria of as-
Figure 2. GTM (left) and corresponding manifold visualized in the data space approximated by three first principal components (right) for all ligands dataset. Each point in the map corresponds to a single compound. All activities are shown on the top view, whereas on the bottom view only cox2 ligands are shown. Model information: pharmacophore-based IPFL descriptors; 20 principal components; resolution 25 25; number of RBF centers = 16; a = 0.1, s = 2)
Mol. Inf. 2012, 31, 301 – 312
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
www.molinf.com
305
Full Paper
N. Kireeva et al.
sessment of the GTM performance have been used: visual inspection, the complexity of the manifold shape and the topography preservation between the data space and the latent spaces, as well as the value of log likelihood function. For activity-labeled compounds, we also considered the discriminatory power of the GTM as a quality criterion. Here, parameters were selected by optimizing Balanced Accuracy obtained by 5-fold external cross-validation (5-CV). The calculations were performed both on 10 datasets from DUD actives/decoys as well as on a dataset containing the ensemble of all actives without decoys (all actives dataset). 3.2.2 Comparison with Other Visualization Methods
PCA and SOM plots for the “all actives” dataset were calculated using the same descriptors as GTM. The SOM plot having the same size (25 25) as GTM has been calculated with R. PCA plot in two first principal components has been built with MATLAB.
4 Results and Discussion 4.1 Data Visualization
The GTM of the “all actives” dataset separates well most of 10 classes of the DUD database; few classes do however
overlap (e.g., egfr and vegfr2 ligands: see Figure 2, top) – which is hardly a surprise, since six of the ten targets are kinases, likely displaying some affinity for ligands formally associated to other (kinase) classes – and not only. Recall, for example, that dihydrofolic acid, the natural ligand of dhfr is also a nucleoside-like compound, not that dissimilar from ATP, featuring an aromatic heterocyclic base and negatively charged groups (carboxylates). In a pharmacophore pattern-based chemical space, partial overlaps of these sets of actives are actually expected. At the same time, one particular class may form several distinct clusters. This is demonstrated on Figure 2 (bottom) for cox2 ligands which form on GTM three clusters. A complex curved shape of the manifold provides with the best fit of the data distribution in the data space. Singletons are usually explained by their structural difference with the molecules in clusters. For some singletons, GTM offers an interesting interpretation related to their multimodal distribution. In fact, the objects mapped onto GTM may have several peaks of probability to be located in different parts of the map, which could be associated (but not necessarily) with different clusters. Typically, this is observed when the compound has fragments in common with other compounds located in different regions of the map. This is illustrated on Figure 3 for the particular cox2 ligand whose probability density function (PDF) has three
Figure 3. Case of multimodal probability distribution. The query compound (cox2 ligand) has a probability to be located in three different areas of the map. This can be explained by similarity of its structure to the compounds located in these areas. Three zones – A, B and C – correspond to related clusters on Figure 2 (bottom). The maxima of posterior probability in these zones are shown by white crosses. The query position on the map is calculated by Equation 5.
306
www.molinf.com
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2012, 31, 301 – 312
Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset
peaks located in different areas of the map. Roughly, they correspond to the clusters A, B and C on Figure 2 (bottom). One can see that this compound shares some fragments with the typical representatives of the above clusters. Its location on the map is defined as a weighted average over three peaks of PDF. This ligand is a kind of ‘chimera’ of other ligands and, as such, it was not merged with the ‘best matching’ ligand family, but rather positioned as outlier. In some cases GTM interpretation is not straightforward, as it is demonstrated on Figure 4 for the dataset of egfr ligands and their decoys. Here, the clusters A and B are well separated. However, both visual inspection of the 3D view of the manifold (Figure 4, right) and the large value of Tanimoto coefficient (Tc) averaged over all inter-cluster compound pairs (Tc = 0.91) show that they belong to one same big cluster. The separation of A and B on the map can be explained by the curved shape of the manifold in which two corners are in close vicinity. On the other hand, similar analysis shows that the cluster C is well separated with A and B. We believe that the data and manifold visualization in 3D using three first principal components facilitates interpretation of GTM and is a representation complementary to magnification factor mapping and directional curvature plots.[18]
4.2 Comparison of Data Distributions
As suggested in Method section, we used negative logarithm of B-kernel (LBK) to measure a distance between two
particular data sets. Two opposite cases – overlapped and separated datasets – were considered. As expected, the distance between strongly overlapped vegfr2 and egfr datasets (LBK = 2.78) is much shorter than that for well separated ache and egfr datasets (LBK = 11.48). This is not surprising taking into account an obvious molecular similarity of vegfr2 and egfr ligands (averaged inter-pair Tc = 0.94), on one hand, and dissimilarity of ache and egfr ligands (averaged inter-pair Tc = 0.47), on the other hand (Figure 5). Scatter plot of the values of LBK vs. Soergel distance (1– Tc) for all possible 45 pairs of different DUD datasets (Figure 6) displays very weak correlation of these two parameters. This could be explained by their different meaning. Thus, LBK measures the distance between two probability distributions whereas Tc averages the pairwise distances between particular compounds in two datasets. Hence, strong correlations between them can be expected only for very small values of the Gaussian noise variance parameter b1, i.e. for perfect fitting of data by the manifold.
4.3 Predicting Activity Profiles
In this work, we consider molecules belonging to at least one of 10 different activity classes, being known strong inhibitors of one of the 10 considered enzymes and receptors (Table 1). Unfortunately, the DUD provides no information about the activity of compounds with respect to targets other than the one they were designed for. Therefore, each molecule is formally considered to be a member of only one of the 10 activity classes – formally labeled Ck – or, for decoy compounds, to none of these.
Figure 4. GTM for egfr target considering ligands and decoys as two different classes: 2D map and manifold visualization. The numbers correspond to Tanimoto coefficient averaged over all inter-cluster pairs of molecules. Both 3D view and structure similarity calculations show that A and B belong to one same cluster whereas C is an individual cluster separated from A and B.
Mol. Inf. 2012, 31, 301 – 312
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
www.molinf.com
307
Full Paper
N. Kireeva et al.
Figure 5. Examples of overlapped (top) and distinct (bottom) datasets extracted from the maps shown on Figure 2 with corresponding manifolds. (ache (black stars), egfr (magenta stars), vegfr2 (blue open circles)).
The performance of GTM to predict the activity profiles has been assessed according to Equations 5–6 and compared with Naive Bayes approach implemented in Weka.[39] Five-fold external cross-validation (5-CV) procedure has been used. Two types of classification models have been obtained: (a) multiclass model for 10 types of biological activity, and (b) a set of 10 binary classification models for each of ligand classes against its decoys. The results shown on Figure 7 and 8 demonstrate a good performance of these models. In average, for all models of the two classes Balanced Accuracy (BA) is about 0.9 and ROC AUC varies from 0.94 to 0.97 both for pharmacophore-based and struc308
www.molinf.com
ture-based fragment descriptors (Figures 7 and 8). These results are similar to or even outperform those for Naive Bayes (NB) models. For instance, for ligand/decoys NB models involving structure-based descriptors, the averaged over all models BA = 0.78 (Figure 7) and ROC AUC for 5 out 10 models varies from 0.72 to 0.85 (Figure 8). 4.5 Comparison of GTM with SOM and PCA
The advantage of GTM over PCA and SOM for data visualization is clearly seen on Figure 9 for the “all actives” dataset. On the PCA plot, different classes of DUD actives are
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2012, 31, 301 – 312
Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset
Figure 6. LBK vs. Soergel distance for all possible 45 pairs composed of 10 DUD datasets under study.
Figure 7. GTM vs. Naive Bayes for predicting activity profiles: balanced accuracy of classification considering (a) ligands of one DUD class against ligands of the other 9 classes, (b) ligands and decoys as two different classes. The balanced accuracy was averaged over the models for all 10 DUD classes. Structure-based and pharmacophore-based IPFL descriptors were used (see Section 3.1).
not well separated and tend to occupy a relatively small part at the center of the plot. On the SOM plot, different DUD classes are, in general, well separated. However, the objects confined in a given node are not distinguished even if they belong to different activity classes. This may lead to wrong activity assessment of new compounds injected into existing map. Last but not least: neither PCA nor SOM are probabilistic methods and therefore, unlike GTM, they cannot be directly used to build classification models and to estimate an overlap between different data sets. Mol. Inf. 2012, 31, 301 – 312
5 Conclusions Generative Topographic Maps (GTM) is known to be a powerful tool for manifold learning, dimensionality reduction and data visualization, which combines the advantages and, at the same time, overcomes the drawbacks of such popular visualization approach as Kohonen Self-Organizing Maps, Principal Component Analysis, Sammon Mapping, Stochastic Proximity Embedding, and Stochastic Neighbor Embedding. In contrast to the above approaches, GTM defines explicitly the probability density function (PDF) of the
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
www.molinf.com
309
Full Paper
N. Kireeva et al.
Figure 8. Prediction performance of GTM compared to that of Naive Bayes method. AUC values for GTM classification models considering: (a) ligands of one DUD class against ligands of the other 9 classes, (b) ligands and decoys as different classes.
data, which offers additional powerful means for data analysis. On the other hand, GTM is more time consuming than PCA and is not as good as SM, SPE and SNE in preserving distances and proximities. This paper explores some opportunities of GTM in the chemoinformatics context. First, this concerns data visualization in 2D latent space. Unlike PCA, MDS, SM, SPE, SNE and SOM in which a molecule is represented by a single point in the latent space (assigned to one particular neuron in SOM), GTM calculates the probability distribution over the whole space. This provides one with additional means to give simple “chemical” interpretation of elative positions of objects over the map. As a complementary way to repre310
www.molinf.com
sent a manifold (2D “rubber sheet” inserted in multidimentional data space) we suggested to combine GTM with representation of objects in the space of three first principal components. It has been demonstrated that probability distribution function built by GTM in the data space can be efficiently used to obtain classification models via Bayes theorem. This opens an exciting opportunity to assess biological activity profile for chemical compounds mapped on GTM. Finally, we have described how the distance between different datasets can be evaluated using Bhattacharyya kernel which calculates an overlap of corresponding PDFs.
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Mol. Inf. 2012, 31, 301 – 312
Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset
Figure 9. GTM (a), PCA (b) and SOM (c) visualizations for all ligands dataset. Each point in the maps corresponds to a single compound.
Thus, GTM can be considered as a universal tool for data visualization, structure-activity modeling and database comparison.
Acknowledgements NK and IIB thank GDRI SupraChem, the ARCUS “AlsaceRussia/Ukraine” Project, and the French Embassy in Russia for support.
References [1] T. I. Oprea, J. Gottfries, J. Comb. Chem. 2001, 3, 157 – 166. [2] A. Varnek, I. I. Baskin, Mol. Inf. 2011, 30, 20 – 32.
Mol. Inf. 2012, 31, 301 – 312
[3] J. A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Springer, New York, 2007. [4] A. N. Gorban, B. Kegl, D. C. Wunsch, A. Zinovyev, Principal Manifolds for Data Visualisation and Dimension Reduction, Springer, Heidelberg, 2007. [5] Y. A. Ivanenkov, E. V. Bovina, K. V. Balakin, Russ. Chem. Rev. 2009, 78, 465 – 483. [6] Y. A. Ivanenkov, N. P. Savchuk, S. Ekins, K. V. Balakin, Drug Discov. Today 2009, 14, 767 – 775. [7] K. V. Balakin, Pharmaceutical Data Mining. Approaches and Applications for Drug Discovery, Wiley, New Jersey, 2010. [8] I. T. Jolliffe, Principal Component Analysis, 2nd ed., Springer, New York, 2002. [9] J. B. Kruskal, Psychometrika 1964, 29, 1 – 27. [10] J. B. Kruskal, Psychometrika 1964, 29, 115 – 129. [11] T. Kohonen, Self-Organizing Maps, Springer, Heidelberg, 2001. [12] J. W. Sammon, IEEE Trans. Computer 1969, 18, 401 – 409.
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
www.molinf.com
311
Full Paper
N. Kireeva et al.
[13] D. K. Agrafiotis, H. Xu, Proc. Natl. Acad. Sci. USA 2002, 99, 15869 – 15872. [14] D. K. Agrafiotis, J. Comp. Chem. 2003, 24, 1215 – 1221. [15] D. N. Rassokhin, D. K. Agrafiotis, J. Mol. Graphics Modell. 2003, 22, 133 – 140. [16] G. E. Hinton, S. T. Roweis, in Advances in Neural Information Processing Systems, Vol. 15, The MIT Press, Cambridge, MA, USA, 2002, pp. 833 – 840. [17] M. Reutlinger, W. Guba, R. E. Martin, A. I. Alanine, T. Hoffmann, A. Klenner, J. A. Hiss, P. Schneider, G. Schneider, Angew. Chem. 2011, 123, 11837 – 11840. [18] D. M. Maniyar, I. T. Nabney, B. S. Williams, A. Sewing, J. Chem. Inf. Model 2006, 46, 1806 – 1818. [19] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. [20] H. Yin, in Intelligent Data Engineering and Automated Learning (Eds: J. Liu, Y. Cheung, H. Yin), Springer, Heidelberg, 2003, pp. 377 – 388. [21] C. M. Bishop, M. Svensn, C. K. I. Williams, Neural Comput. 1998, 10, 215 – 234. [22] C. M. Bishop, M. Svensen, C. L. I. Williams, Tech. Report. Neural Comput. Res. Group. 1997. [23] E. Erwin, K. Obermayer, K. Schulten, Biol. Cybernetics 1992, 67, 47 – 55. [24] J. F. M. Svensn, PhD Thesis, Aston University (UK) 1998. [25] C. M. Bishop, M. Svensn, C. K. I. Williams, Neurocomputing 1998, 21, 203 – 224. [26] J. R. Owen, I. T. Nabney, J. L. Medina-Franco, F. Lo’pez-Vallejo, J. Chem. Inf. Model. 2011, 51, 1552 – 1563.
312
www.molinf.com
[27] T. Jebara, R. Kondor, Lect. Notes Comput. Sci. 2003, 2777, 57 – 71. [28] T. Jebara, R. Kondor, A. Howard, J. Mach. Learn. Res. 2004, 5, 819 – 844. [29] N. Huang, B. K. Shoichet, J. J. Irwin, J. Med. Chem. 2006, 49, 6789 – 6801. [30] E. T. Jaynes, Probability Theory. The Logic of Science, Cambridge University Press, Cambridge, 2003. [31] A. P. Dempster, N. M. Laird, D. B. Rubin, J. Roy. Stat. Soc. B Met. 1977, 39, 1 – 38. [32] S. Kullback, K. P. Burnham, N. F. Laubscher, G. E. Dallal, L. Wilkinson, D. F. Morrison, M. W. Loyer, B. Eisenberg, Am. Statistician 1987, 41, 340 – 341. [33] J. J. Irwin, B. K. Shoichet, J. Chem. Inf. Model 2005, 45, 177 – 182. [34] Chemaxon Standardizer, http://www.chemaxon.com/library/scientific-presentations/standardizer/. [35] Instant JChem, www.chemaxon.com/products/instant-jchem/. [36] F. Ruggiu, G. Marcou, A. Varnek, D. Horvath, Mol. Inf. 2010, 29, 855 – 868. [37] http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/ netlab/. [38] I. Nabney, Algorithms for Pattern Recognition, Springer, London, 2002. [39] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, SIGKDD Explorations 2009, 11.
2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
Received: December 15, 2011 Accepted: February 29, 2012
Mol. Inf. 2012, 31, 301 – 312