Applied Soft Computing 10 (2010) 170–182
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Genetic programming for QSAR investigation of docking energy Francesco Archetti a,b, Ilaria Giordani a,c, Leonardo Vanneschi a,* 1
Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.), University of Milano-Bicocca, via Bicocca degli Arcimboldi, 8, 20126 Milan, Italy Consorzio Milano Ricerche, 20126 Milan, Italy 3 DELOS Srl, 20091 Bresso (Milan), Italy 2
A R T I C L E I N F O
A B S T R A C T
Article history: Received 21 February 2008 Received in revised form 19 June 2009 Accepted 28 June 2009 Available online 5 July 2009
Statistical methods, and in particular Machine Learning, have been increasingly used in the drug development workflow to accelerate the discovery phase and to eliminate possible failures early during clinical developments. In the past, the authors of this paper have been working specifically on two problems: (i) prediction of drug induced toxicity and (ii) evaluation of the target–drug chemical interaction based on chemical descriptors. Among the numerous existing Machine Learning methods and their application to drug development (see for instance [F. Yoshida, J.G. Topliss, QSAR model for drug human oral bioavailability, Journal of Medicinal Chemistry 43 (2000) 2575–2585; Frohlich, J. Wegner, F. Sieker, A. Zell, Kernel functions for attributed molecular graphs—a new similarity based approach to ADME prediction in classification and regression, QSAR and Combinatorial Science, 38(4) (2003) 427– 431; C.W. Andrews, L. Bennett, L.X. Yu, Predicting human oral bioavailability of a compound: development of a novel quantitative structure–bioavailability relationship, Pharmacological Research 17 (2000) 639–644; J Feng, L. Lurati, H. Ouyang, T. Robinson, Y. Wang, S. Yuan, S.S. Young, Predictive toxicology: benchmarking molecular descriptors and statistical methods, Journal of Chemical Information Computer Science 43 (2003) 1463–1470; T.M. Martin, D.M. Young, Prediction of the acute toxicity (96-h LC50) of organic compounds to the fat head minnow (Pimephales promelas) using a group contribution method, Chemical Research in Toxicology 14(10) (2001) 1378–1385; G. Colmenarejo, A. Alvarez-Pedraglio, J.L. Lavandera, Chemoinformatic models to predict binding affinities to human serum albumin, Journal of Medicinal Chemistry 44 (2001) 4370–4378; J. Zupan, P. Gasteiger, Neural Networks in Chemistry and Drug Design: An Introduction, 2nd edition, Wiley, 1999]), we have been specifically concerned with Genetic Programming. A first paper [F. Archetti, E. Messina, S. Lanzeni, L. Vanneschi, Genetic programming for computational pharmacokinetics in drug discovery and development, Genetic Programming and Evolvable Machines 8(4) (2007) 17–26] has been devoted to problem (i). The present contribution aims at developing a Genetic Programming based framework on which to build specific strategies which are then shown to be a valuable tool for problem (ii). In this paper, we use target estrogen receptor molecules and genistein based drug compounds. Being able to precisely and efficiently predict their mutual interaction energy is a very important task: for example, it may have an immediate relationship with the efficacy of genistein based drugs in menopause therapy and also as a natural prevention of some tumors. We compare the experimental results obtained by Genetic Programming with the ones of a set of ‘‘non-evolutionary’’ Machine Learning methods, including Support Vector Machines, Artificial Neural Networks, Linear and Least Square Regression. Experimental results confirm that Genetic Programming is a promising technique from the viewpoint of the accuracy of the proposed solutions, of the generalization ability and of the correlation between predicted data and correct ones. ß 2009 Elsevier B.V. All rights reserved.
Keywords: Genetic Programming Machine learning Regression Docking energy Computational biology Drug design QSAR
1. Introduction The goal of this paper is to investigate the usefulness of Genetic Programming (GP) [9,10] for automatically generating the underlying functional relationship between a set of molecular descrip-
* Corresponding author. Tel.: +39 02 64487874; fax: +39 02 64487805. E-mail address:
[email protected] (L. Vanneschi). 1568-4946/$ – see front matter ß 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2009.06.013
tors of drug-like compounds and their value of the interaction, or docking, energy with a particular estrogen receptor. Being able to develop automatic computer systems to successfully and efficiently predict the mutual interaction energy between drug-like compounds and estrogen receptors would have a great impact, given that this interaction energy has an immediate relationship with the efficacy of those drugs. GP is an evolutionary approach which extends the genetic model of learning to the space of programs. It is a major variation of
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
171
Pharmacokinetics prediction tools are usually based on two approaches: molecular modelling, which uses intensive protein structure calculations and data modelling. Methods based on data modelling are widely reported in literature; they all belong to the category of Quantitative Structure Activity Relationship (QSAR) models [17] and they are adopted in the present work. To quantify the real usefulness of GP for the presented application, experimental results are compared with the ones of a set of well-known Machine Learning (ML) methods, including Support Vector Machines (SVM), Artificial Neural Networks, Linear and Least Square Regression. These ones will be referred to as ‘‘nonevolutionary’’ methods for simplicity. This paper is structured as follows: Section 2 discusses previous and related work; in Section 3 we describe the method employed to build the dataset used in our experiments; Section 4 briefly describes the non-evolutionary ML methods used in this paper and discusses their experimental results on our dataset; in Section 5 we introduce the different versions of GP that we have tested in this work and we discuss their experimental results; Section 6 contains the description of a method to improve GP results for the studied problem; finally Section 8 concludes the paper and offers hints for future research.
molecular descriptors and docking energy. Virtual molecular docking represents a basic step in rational drug design. Its objective is to predict how any macromolecules (typically a protein or nucleic acid) interact with other molecules called ‘‘ligands’’ (may be other proteins, peptides or small drug-like molecules) by calculating their interaction energy in some particular positions. Considerable efforts have been directed in understanding this process and optimizing it by computer simulations using many different computational methods, including Evolutionary Algorithms; see for instance [18–22]. We do not analyze in details all these contributions here, because this paper does not present a docking application, but a QSAR approach, where docking energy values are used as target. Also, many software environments for molecular docking have been developed and commercialized. For the sake of brevity, here we only quote [23–25] and the DELOS software platform [26], which has been recently developed and which we have used to build our dataset, as described in Section 3. For a more detailed survey and discussion of the numerous existing software environments for docking optimization see for instance [19]. In the present work, we choose a particular macromolecule and ligands. As ligands, we use a set of drug-like compounds belonging to the genistein family. Genistein (genesteina or genista tinctoria) is an isoflavone C15 H10 O5 found especially in soybeans which has been shown in laboratory experiments to be effective as a natural prevention of some tumors. As a macromolecule, we have used the estrogen receptor ERa, a member of the nuclear hormone family of intracellular receptors which is activated by the 17b-estradiol. The important effects of genistein on estrogen receptors is pointed out in many contributions; see for instance [27,28]. Many contributions have appeared to date using ML methods for training QSAR models. For instance fuzzy adaptive Least Squares are used in [1], GAs and Self Organizing Maps are used in [29], SVM are used in [2], various kinds of multivariate and Partial Least Square Regressions are used in [3], recursive partitioning and Partial Least Square regression has been tested in [4], multivariate Linear Regression and Artificial Neural Networks have been applied in [5], and a technique called Genetic Function Approximation has been proposed in [6]. Artificial Neural Networks, often used for QSAR [7], are frequently integrated in existing commercial packages developed by software vendors involved in the field of molecular modelling. Some of these tools are analyzed in [30]. Among the leaders in this field, Accelrys Inc. [31] and Pharma Algorithms Inc. [32] essentially provide black-box mathematical models and/or Data Mining tools that can be used to build new predictors. In the last few years, GP is becoming popular for QSAR modelling and related biomedical applications. For instance, in [13] GP is used to classify molecules in terms of their bioavailability; in [14] it has been used together with mutual information methods for analyzing QSAR data; in [8] it is used for quantitative prediction of drug induced toxicity, and in [15] it is applied to cancer expression profiling data to select feature genes and build molecular classifiers. To the best of our knowledge, the present contribution represents the first effort to develop a QSAR model for docking energy assessment using GP. This work is inspired by [27], were the goal was to discover new compounds that display the benefits of estrogens while avoiding the risk of reproductive tissue cancer. Authors applied a Virtual High-Throughput Screening, based on docking simulations, for the identification of new possible selective receptor compounds and discovered good values of the docking energy when some genistein molecules were used with ERa estrogen receptor.
2. Previous and related work
3. Dataset
As outlined above, the goal of this paper is investigating the usefulness of GP for generating the hidden relationship between
We have collected from the RCSB PDB database [33] a small set of estrogen–genistein virtual molecules. Successively we
Genetic Algorithms [11,12] in which the evolving individuals are themselves computer programs instead of fixed length strings from a limited alphabet of symbols. In the last few years, GP has become more and more popular for biomedical and pharmacokinetic applications. In particular, GP has been recently used to mine large datasets with the goal of automatically generating the underlying (hidden) functional relationship between data and correlate the behavior of latent features with some interesting pharmacokinetic parameters bound to drug activity patterns. For instance, in [13] GP has been used to classify drug-like molecules in terms of their bioavailability, in [14] it has been used with mutual information methods for analyzing complex molecular data, in [8] it has been used for quantitative prediction of drug induced toxicity and in [15] it has been applied to cancer expression profiling data to select features and build molecular classifiers by mathematical integration of genes. GP can be regarded as an optimization method, which makes no assumption on the objective functions and data. Furthermore, as pointed out in [8] and explained in details also further in this paper, GP often automatically performs a feature selection, maintaining into the population expressions that use subsets of data. Thus, the motivation behind our choice of investigating the usefulness of GP for assessing large biomedical datasets is twofold: biological/chemical data are not independent of each other. Rather, it has been verified that in most of the complex biochemical systems, small subsets of components work in cohesion [16]. These phenomena lead to high multi-dependency among the features. Hence, the underlying algorithm should make no assumption on the inter-dependencies between the different variables. Furthermore, the algorithm should be capable of extracting underlying features governing the biochemical reactions from high-dimensional correlated data. The dimensionality of the feature space in biomedical datasets is normally much higher than the number of observations available for training. Hence, automatic feature selection as well as other methods to handle overfitting and minimizing the generalization error should be encouraged.
172
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
have defined substitution points on which we have clasped a small database of substituents (OH, CH3, CH2CH3, CH2OH, CH2CH2OH, CH2CH2NH2, OCH2CH2NH2), obtaining a set of 992 genistein based virtual molecules. The resulting chemical structures where then optimized by means of molecular mechanics using the MOE software [34] and MMFF94 force field [35] for calculating 267 molecular descriptors. Finally, for each one of these ligands, we have calculated their docking energy value by means of the DELOS software platform [26], an environment for effective virtual screening and docking simulations recently produced by the Discovery and Lead Optimization Systems company (Bresso, Italy). The resulting dataset was composed of 992 genistein based molecules, each of which is represented by a vector of 267 molecular descriptors and with known values of the docking energy. It can be downloaded from the web page: http://personal.disco.unimib.it/Vanneschi/ Docking.htm. Our dataset is a matrix H ¼ ½Hði; jÞ of 992 rows and 268 columns, where each line i represents a molecule whose known docking energy value has been placed at position Hði;268Þ . In this way, the last column of matrix H represents all the known docking energy values. Our task is now to generate a mapping F such that FðHði;1Þ ; Hði;2Þ ; . . . ; Hði;267Þ Þ ¼ Hði;268Þ for each line i in the dataset. Of course, we also want F to have a good generalization ability, i.e. to be able to assess the docking energy value for new drug-like compounds, that have not been used in the training phase. For this reason, we use a set of ML techniques, and in particular GP, as discussed in Sections 4–6. A random splitting of the dataset is performed before model construction, by partitioning it into a training and a test set: 70% of the molecules are randomly selected with uniform probability and inserted into the training set, while the remaining 30% form the test set. In other words, matrix H is split into two matrix HðTRAINÞ and HðTESTÞ . The first one has 695 rows and 268 columns, whereas HðTESTÞ has 297 rows and 268 columns. Before applying the ML methods, we also have standardized and normalized all values in H. In particular, first of all for each column in H we have performed a standardization by replacing std each value Hði; jÞ with a new value Hði; jÞ such that: H m ði; jÞ j std Hði; jÞ ¼
dj
where m j is the average of the values on the j th column of H and d j is their standard deviation. In this way, each column in Hstd has the average equal to 0 and the standard deviation equal to 1. Successively, we have normalized Hstd by replacing each value std norm Hði; jÞ with a new value Hði; jÞ such that: norm Hði; jÞ ¼
std Hði; jÞ min std
max std min std
where max std and min std are the maximum and minimum values in Hstd respectively. In this way, each value in Hnorm is included into the range [0,1], the maximum value in Hstd corresponds to 1 in Hnorm and the minimum value in Hstd corresponds to 0 in Hnorm . Our experiments have been performed using data in Hnorm . 4. Non-evolutionary methods To assess our dataset, we have used a set of regression methods. For simplicity, we partition them into two broad classes we call non-evolutionary methods and GP methods. GP methods basically consist in some variants of the standard version of tree-based GP and will be described in the next sections. In this section, we present the non-evolutionary methods we have used and we discuss their experimental results.
4.1. Brief introduction to the non-evolutionary methods These methods are described here in a deliberately synthetic way, since they are well known and well established ML techniques. Furthermore, they have also been used after a preprocessing phase in which two well known feature selection algorithms have been employed. They are briefly described in the next paragraph. For more details on these methods and algorithms and their use, the reader is referred to the respective references quoted below. 4.1.1. Feature selection procedures We adopted two attribute selection heuristics: the Correlation Based Feature Selection (CorrFS) and the Principal Component Based Feature Selection (PCFS) [36]. The central hypothesis in CorrFS is that good feature sets contain features highly correlated with the target, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. The algorithm couples the evaluation formula with an appropriate correlation measure and a search strategy. In our experiments CorrFS performs a greedy forward search through the space of attribute subsets (no backward search has been executed). It starts with no attribute and it stops when the addition of any remaining attributes results in a decrease in evaluation. PCFS transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The basic idea in PCFS is to find the components that explain the maximum amount of variance possible by n linearly transformed components. The first principal component accounts for as much of the variability in the data as possible, and each subsequent component accounts for as much of the remaining variability as possible [37]. In our experiments using PCFS, dimensionality reduction has been accomplished by choosing a number of eigenvectors to account for 95% of the variance in the original data. The Weka [38] implementation of the two described feature selection procedures have been adopted for the experiments. For all remaining parameters, we have used the Weka default values. 4.1.2. Linear and Least Square Regression The linear regression model is based on the Akaike criterion for model selection (AIC) [39], based on the Kullback–Leibler information between two densities, corresponding to the fitted model and the true model. The M5 criterion is used for further attribute selection [39]. The Least Square Regression model is found on the algorithm of robust regression and outlier detection described in [40], searching for the more plausible linear relationship between outputs and targets. Also for these methods the Weka implementation was used in our experiments [38]. 4.1.3. Multilayered perceptron The multilayered Perceptron Artificial Neural Network [41] implementation included in the Weka software distribution [38] was adopted. We used the Backpropagation algorithm [41] with a learning rate equal to 0.3. All the neurons had a sigmoid activation function. All the other parameters that we have used have been set to the defaults values proposed by the Weka implementation. A momentum of 0.1 progressively decreasing until 0.0001 has been used to escape local minima on the error surface. 4.1.4. Support vector machines regression The Smola and Scholkopf sequential minimal optimization algorithm [42] was adopted for training a Support Vector regression using polynomial kernels. Also for this method the Weka implementation [38] has been used. More precisely, we have
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
built two models using polynomial kernels of first and second degree respectively. 4.2. Experimental results of the non-evolutionary methods
Table 4 Experimental results returned by SVM regression with first degree polynomial kernel. Upper part: results obtained when no feature selection has been used; middle part: results obtained when PCFS has been used to preprocess data; lower part: results obtained when CorrFS has been used to preprocess data. Columns 2 and 3: results on the training set; columns 4 and 5: results on the test set.
Tables 1–5 show the experimental results that have been returned by Linear Regression, Least Square Regression, MultiTable 1 Experimental results returned by Linear Regression. Upper part: results obtained when no feature selection has been used; middle part: results obtained when PCFS has been used to preprocess data; lower part: results obtained when CorrFS has been used to preprocess data. Columns 2 and 3: results on the training set; columns 4 and 5: results on the test set. RMSE on train (a) No feature Best avg. std.dev.
Selection 0.0816 0.0903 0.0064
CC on train
0.7265 0.6754 0.0467
RMSE on test
CC on test
0.1169 0.1175 0.0071
0.6952 0.6432 0.0391
(b) Principal Component Based Feature Selection (PCFS) Best 0.1054 0.6951 0.1328 avg. 0.1183 0.6592 0.1395 std.dev. 0.0082 0.0362 0.0052
0.6003 0.5835 0.0746
(c) Correlation Best avg. std.dev.
0.6972 0.6325 0.0427
Based Feature Selection (CorrFS) 0.0945 0.7064 0.0995 0.6845 0.0056 0.0476
0.1185 0.1276 0.0036
Table 2 Experimental results returned by Least Square Regression. Upper part: results obtained when no feature selection has been used; middle part: results obtained when PCFS has been used to preprocess data; lower part: results obtained when CorrFS has been used to preprocess data. Columns 2 and 3: results on the training set; columns 4 and 5: results on the test set.
173
RMSE on train
CC on train
RMSE on test
CC on test
0.7454 0.7045 0.0572
0.1279 0.1372 0.0036
0.6496 0.6165 0.0576
(b) Principal Component Based Feature Selection (PCFS) Best 0.0834 0.7145 0.1469 avg. 0.0997 0.6243 0.1645 std.dev. 0.0031 0.0465 0.0045
0.5618 0.5328 0.0365
(c) Correlation Best avg. std.dev.
0.6754 0.6945 0.0455
(a) No feature Best avg. std.dev.
Selection 0.0745 0.0845 0.0029
Based Feature Selection (CorrFS) 0.0701 0.7493 0.0832 0.7164 0.0039 0.0356
0.1277 0.1375 0.0032
Table 5 Experimental results returned by SVM regression with second degree polynomial kernel. Upper part: results obtained when no feature selection has been used; middle part: results obtained when PCFS has been used to preprocess data; lower part: results obtained when CorrFS has been used to preprocess data. Columns 2 and 3: results on the training set; columns 4 and 5: results on the test set. RMSE on train (a) No feature Best avg. std.dev.
selection 0.0728 0.0806 0.0031
CC on train
RMSE on test
CC on test
0.7489 0.7065 0.0576
0.1268 0.1334 0.0046
0.6934 0.6360 0.0375
0.5003 0.4876 0.0434
0.6021 0.5846 0.038
CC on train
RMSE on test
CC on test
(b) Principal Component Based Feature Selection (PCFS) Best 0.0946 0.7145 0.1538 avg. 0.1056 0.6462 0.1603 std.dev. 0.0028 0.0365 0.0037
0.6964 0.6065 0.0392
0.1709 0.1865 0.0083
0.4145 0.4395 0.0426
(c) Correlation Best avg. std.dev.
(b) Principal Component Based Feature Selection (PCFS) Best 0.0971 0.6837 0.1805 avg. 0.0996 0.6046 0.1965 std.dev. 0.0047 0.0265 0.0085
0.4531 0.4297 0.0385
(c) Correlation Best avg. std.dev.
0.5143 0.4975 0.0285
RMSE on train (a) No feature Best avg. std.dev.
selection 0.0945 0.1769 0.0085
Based Feature Selection (CorrFS) 0.0901 0.7013 0.0983 0.6954 0.0038 0.0238
0.1661 0.1753 0.0029
Table 3 Experimental results returned by Multilayered Perceptron. Upper part: results obtained when no feature selection has been used; middle part: results obtained when PCFS has been used to preprocess data; lower part: results obtained when CorrFS has been used to preprocess data. Columns 2 and 3: results on the training set; columns 4 and 5: results on the test set. RMSE on train
CC on train
RMSE on test
CC on test
0.6045 0.5432 0.0845
0.1373 0.1456 0.0059
0.6389 0.5974 0.0385
(b) Principal Component Based Feature Selection (PCFS) Best 0.0994 0.5946 0.1661 avg. 0.1756 0.5794 0.1856 std.dev. 0.0048 0.0286 0.0056
0.5437 0.5135 0.0289
(c) Correlation Best avg. std.dev.
0.5036 0.4849 0.0238
(a) No feature Best avg. std.dev.
selection 0.0854 0.0964 0.0039
Based Feature Selection (CorrFS) 0.1004 0.5835 0.1156 0.5487 0.0045 0.0478
0.1728 0.1644 0.0048
Based Feature Selection (CorrFS) 0.0795 0.7527 0.0867 0.7164 0.0041 0.0572
0.1455 0.1587 0.0045
layered Perceptron, SVM regression with first degree polynomial kernel and SVM regression with second degree polynomial kernel respectively. These tables must be interpreted as follows: the upper part (part (a)) shows the results that we have obtained when no feature selection strategy has been employed (data from our dataset have been used as input with no filtering, nor pre-processing), the middle part (part (b)) reports the results when PCFS has been used and the lower part (part (c)) shows the results obtained using CorrFS. In particular, PCFS has selected only 22 features (i.e. columns of the dataset) of the 267 total ones, while CorrFS has selected 95 of them. Columns 2 and 3 of these tables report the results obtained on the training set and columns 4 and 5 the ones on the test set; on both cases, we have reported the root mean squared error (RMSE) and the correlation coefficient (CC) between outputs and goals returned by the trained model. Furthermore, for both RMSE and CC we report the best results, the average and the standard deviation, calculated over 100 independent executions for each considered ML method. These experiments have been performed using the Weka public domain software [38] and the parameter setting we have used for each method is the standard one proposed by Weka. We do not report all these parameters’ values here, because they are discussed in [38]. The first thing that one might observe looking at Tables 1–5 is that feature selection is not very helpful, in fact results obtained using feature selection have almost never been better than the
174
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
root mean squared error (RMSE); correlation coefficient (CC) between outputs and targets; RMSE with linear scaling, as described in [43].
Fig. 1. Scatterplot of the true docking energy values against the docking energy predictions on the test set calculated by the solution with the best CC found by Linear Regression over 100 independent executions with no feature selection (one point for each line in the test set).
ones obtained by the same ML method using the whole dataset, both for RMSE and CC. Furthermore, the highest correlation value on the test set that we have been able to find is 0.6952. From both these considerations, we can deduce that the regression problem we are trying to solve is quite a difficult one and the relationship between data and targets is probably not straightforward. Both the best RMSE and the best CC have been found by Linear Regression with no feature selection and they are respectively equal to 0.1169 and 0.6952. Linear Regression with CorrFS returned slightly poorer results, i.e. RMSE equal to 0.1185 and CC equal to 0.6672. Linear Regression with PCFS returned RMSE equal to 0.1328 and CC equal to 0.6003. In Fig. 1 we show the scatterplot of the true docking energy values against the docking energy predictions (one point for each line in the test set) calculated by the solution with the best CC on the test set found by Linear Regression with no feature selection. We also report the axis bisector that indicates the ideal correlation and we observe that many points in this scatterplot are rather ‘‘far’’ from it.
These three different fitness functions, with their experimental results, are discussed in Sections 5.1–5.3 respectively. Furthermore, for each one of these three fitness functions, the training set has been handled using the iterative algorithm described in Fig. 21. Fig. 2. In synthesis, this algorithm partitions the training set into k subsets and uses k 1 of these subsets to calculate fitness at each generation, iteratively changing the unused subset at each p generations in a cyclic way. In this way, we hope to improve GP generalization ability. In fact, we hope that iteratively changing the data used to calculate fitness, GP is forced to keep into the population solutions that are more ‘‘general’’: the selection algorithm should discard all the individuals that are ‘‘specialized’’ on a particular set of data, each time these data are changed. In the experiments presented below, we have used a value of constant p equal to 15 (i.e. the training set is modified at each 15 generations) and we have partitioned the 695 lines composing the training set T into 10 subsets T 1 ; T 2 ; . . . ; T 10 such that T i contains 70 consecutive lines of T if i is odd and 69 consecutive lines of T if i is even, for each i ¼ 1; 2; . . . ; 10. 5.1. Root mean squared error as fitness For simplicity, we call RMSEGP the GP variant that uses RMSE computed on the training set as fitness function. For example, given an individual k producing the docking energy prediction ðPREÞ Di on the i th molecule of the training set, we define the fitness of k (RMSEk ) as:
RMSEk ¼ f ðkÞ ¼ 5. GP methods
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m$ uX u ðD DðPREÞ Þ2 u i i t i¼1 m$
(1)
$
In this section we describe the GP versions that we have used and discuss their experimental results. Configurations and parameters have been tuned by a set of experiments, in which many possible alternatives have been tested. The different configurations that have been tested are discussed in Appendix A. We have used a tree-based GP configuration for regression problems inspired by [9,43,44]. Each molecular feature has been represented as a floating point number. Potential solutions (GP individuals) have been built by means of the set of functions F ¼ fþ; ; ; =g, where the same technique as in [43] has been used to avoid individuals containing a division with the denominator equal to zero. The set of terminals T we have used was composed by n floating point variables (where n is the number of descriptors of each molecule in the dataset). The other parameters we have used are: population size of 200 individuals; ramped half-and-half initialization; tournament selection of size 7; maximum tree depth equal to 10; subtree crossover rate [9] equal to 0.95; subtree mutation rate [9] equal to 0.1; maximum number of generations equal to 50; furthermore, we have used generational tree based GP with elitism, i.e. unchanged copy of the best individual into the next population at each generation. Finally, no explicit feature selection strategy has been employed (data from our dataset have been used as input to GP with no filtering, nor pre-processing). Three different fitness functions have been used to evaluate the quality of GP individuals:
where m is the number of molecules (lines) in the training set and Di is the correct value of the docking energy. Besides calculating the RMSE on the training set, that we have used as fitness, for each individual in the population, we have also evaluated the CC on the training set. More precisely, with CC we indicate the correlation between the results returned by each GP candidate solution and the expected outputs [45]; in other words let D ¼ fD1 ; D2 ; . . . ; Dm$ g be the set of correct docking energy values for all the molecules in the ðPREÞ ðPREÞ ðPREÞ training set and let DðPREÞ ¼ fD1 ; D2 ; . . . ; Dm$ g be set of the corresponding docking energy predictions produced by an individual k; then the correlation coefficient of k (CC k ) is defined as follows: CC k ¼
C ðD;DðPREÞ Þ
s D s DðPREÞ
(2)
where C ðD;DðPREÞ Þ is the covariance of sets D and DðPREÞ and s D and
s DðPREÞ are their standard deviations respectively.
Furthermore, for each individual in the population, we have calculated the RMSE and CC on the test set, and we have used them for comparing GP generalization ability with the ones of the other employed methods. 1 In Appendix A we describe the other methods to handle the training set that we have tested, including the simple use of all the lines in the training set to calculate fitness at each generation and we motivate the choice of the algorithm in Fig. 2.
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
175
Fig. 2. The algorithm used to handle the training set in the experiments presented in this paper. It partitions the training set into k subsets and uses k 1 of these subsets to calculate fitness at each generation, iteratively changing the unused subset at each p generations in a cyclic way.
Table 6 Results that we have obtained performing 100 independent runs of RMSEGP on our dataset. Upper part: best, average and standard deviation of the results returned by the individuals with the best RMSE on the test set at each run. Lower part: best, average and standard deviation of the results returned by the individuals with the best CC on the test set at each run. RMSE on test
CC on test
(a) Individual with the best RMSE on the test set Best 0.0805 0.7592 avg. 0.0899 0.7022 std. dev. 0.0056 0.0442
RMSE on train
CC on train
0.1104 0.1227 0.0059
0.7100 0.6509 0.0367
(b) Individual with the best CC on the test set Best 0.0830 0.7913 avg. 0.0913 0.6924 std. dev. 0.0069 0.0532
0.1110 0.1268 0.0084
0.7323 0.6659 0.0330
5.1.1. Experimental results Table 6 reports the results of RMSEGP. These results have been obtained by executing 100 independent RMSEGP runs. For each one of these runs, we have monitored the individual with the best RMSE on the test set and the one with the best CC on the test set. The upper part of Table 6 reports the best (first line), average (second line) and standard deviation (third line) of the results returned by the individuals with the best RMSE on the test set at each run. The lower part of Table 6 does the same thing for the individuals with the best CC on the test set at each run. If we compare the results obtained by RMSEGP on the test set with the ones obtained by the non-evolutionary methods, we can see that the best RMSE obtained by RMSEGP is more or less similar to the best result obtained by the non-evolutionary methods (returned by Linear Regression with no feature selection), while the best CC obtained by RMSEGP is slightly better than the best one
obtained by the non-evolutionary methods (also returned by Linear Regression with no feature selection). Furthermore, in an informal way, we could say that the averages and standard deviations of the best results over the 100 runs show that RMSEGP results are, so to say, ‘‘stable’’. In other words, results obtained in the 100 GP runs are more or less similar to each other. In Fig. 3, we show the scatterplot of the true docking energy values against the docking energy predictions (one point for each line in the test set), calculated by the individual with the best RMSE (Fig. 3(a)) and the one with the best CC (Fig. 3(b)) on the test set found by RMSEGP. We also report the axis bisector that indicates the ideal correlation. By comparing these scatterplots with the one in Fig. 1, we can observe an improvement in correlation. Furthermore, Fig. 3 shows that the solutions returned by RMSEGP better approximate ‘‘small’’ docking energy values (i.e. values that, after standardization and normalization, are approximately equal to 0) than ‘‘large’’ ones (i.e. values that, after standardization and normalization, are approximately equal to 1). In fact, points near the axis origin are nearer to the bisector than the other ones. Since it is well known (see for instance [18,19]) that docking energy values are ‘‘good’’ when they are ‘‘small’’ (docking is a minimization problem), we can state that RMSEGP works well in predicting ‘‘good’’ docking energy values, while its predictions of ‘‘bad’’ docking energy values may be imprecise. 5.2. Correlation coefficient as fitness The second GP variant that we have tested is like the previous one, except that fitness of each individual k this time is equal to CC k on the training set, as defined in Eq. (2). For this reason, we call this GP version CCGP. As above, we also calculate the RMSE of each
Fig. 3. Scatterplot of the true docking energy values against the docking energy predictions on the test set. (a): predictions calculated by the individual with the best RMSE on the test set found by RMSEGP; (b): predictions calculated by the individual with the best CC on the test set found by RMSEGP.
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
176
Table 7 Results that we have obtained performing 100 independent runs of CCGP on our dataset. Upper part: best, average and standard deviation of the results returned by the individuals with the best RMSE on the test set at each run. Lower part: best, average and standard deviation of the results returned by the individuals with the best CC on the test set at each run. RMSE on train
CC on train
RMSE on test
CC on test
(a) Individual with the best RMSE on the test set Best 0.0893 0.8017 avg. 0.1202 0.8329 std. dev. 0.0190 0.0420
0.1225 0.1446 0.0094
0.6618 0.5600 0.0975
(b) Individual with the best CC on the test set Best 4.2266 0.9137 avg. 5.9515 0.9070 std. dev. 4.0070 0.0093
6.2392 7.6128 5.0734
0.9020 0.8758 0.0195
individual both on training and test set and the CC on the test set and we use these values to compare the results obtained by CCGP with the ones obtained by the other methods. 5.2.1. Experimental results Table 7 reports the results of CCGP. This Table must be interpreted as Table 6 and it clearly shows that if we optimize the correlation on the training set, we obtain a CC on the test set which is considerably better then the CC returned by any of the non-evolutionary techniques and by RMSEGP. Nevertheless, CCGP also returns poor RMSE results. We also point out that standard deviations on the RMSE are high, if compared with the ones obtained with RMSEGP, both on the training and test set. These results suggest that only optimizing the correlation is not a good strategy to solve our problem. On the other hand, we would like to develop a method to optimize both the RMSE and the CC, and we hope that in that way we will be able to obtain results which are comparable to the ones of CCGP for the correlation, but better RMSE results. This is done in the next section. In Fig. 4, we show the scatterplot of the true docking energy values against the docking energy predictions calculated by the individual with the best RMSE (Fig. 4(a)) and the one with the best CC (Fig. 4(b)) on the test set found by CCGP. Approximately the same remarks that we have made on Fig. 3 hold for Fig. 4 too, except for the fact that an improvement in correlation can be seen in Fig. 4(b). 5.3. Linear scaling After having tested GP using RMSE and CC as fitness functions separately, now we want to test a GP version optimizing both these criteria at the same time. One possibility could have been to use a multi-optimization GP algorithm like for instance the one presented in [46]. On the other hand, we have decided to use linear scaling, first introduced in [43]. In synthesis, linear scaling consists in calculating the slope and intercept of the formula coded
by the GP individual. Let DPRE be the output of a GP individual on i the i th input data, a linear regression on the target values Di can be performed using the equations: $
m X ðPREÞ ¯ ½ðDi DÞðD DðPREÞ Þ i
b¼
i¼1 $
m X ðPREÞ ðDi DðPREÞ Þ i¼1
a ¼ D¯ b DðPREÞ where m is the number of lines in the training set and DðPREÞ and D¯ denote the average output and the average target value respectively. These expressions respectively calculate the slope and intercept of a set of outputs DðPREÞ , such that the sum of the errors between D and a þ b DðPREÞ is minimized. After this, any error measure can be calculated on the scaled formula a þ b DðPREÞ , for instance the RMSE: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m$ uX u ða þ b DðPREÞ D Þ2 u i i t i¼1 RMSEk ða þ b DðPREÞ Þ ¼ $ m $
If a is different from 0 and b is different from 1, the procedure outlined above is guaranteed to reduce the RMSE for any formula [43]. Our choice of using linear scaling is motivated by the fact that, as explained in [44], optimizing the RMSE with linear scaling also automatically optimizes the CC. Furthermore, the cost of calculating the slope and intercept is linear in the size of the training set. By efficiently calculating the slope and intercept for each individual, the need to search for these two constants is removed from the GP run. GP is then free to search for that expression whose shape is most similar to that of the target function. The efficacy of linear scaling in GP for many symbolic regression problems has been widely demonstrated in [43,44]. GP using the RMSE with linear scaling as fitness function will be called LinScalGP from now on. 5.3.1. Experimental results Table 8 reports the results we have obtained executing 100 independent runs of LinScalGP on our dataset. This Table must be interpreted as Tables 6 and 7. It clearly shows that both the best RMSE and CC on the test set found by LinScalGP are better than the best RMSE and CC found by any of the non-evolutionary techniques and by the other GP variants. Furthermore, also the average best RMSE and the average best CC outperform the best RMSE and CC found by any of the other techniques. Finally, standard deviations confirm that the behavior of LinScalGP is ‘‘stable’’ (i.e. the results of the 100 runs are rather similar to each other). All these considerations allow us to conclude that LinScalGP seems a suitable technique to solve our problem.
Fig. 4. Scatterplot of the true docking energy values against the docking energy predictions on the test set. (a): predictions calculated by the individual with the best RMSE on the test set found by CCGP; (b): predictions calculated by the individual with the best CC on the test set found by CCGP.
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182 Table 8 Results that we have obtained performing 100 independent runs of LinScalGP on our dataset. Upper part: best, average and standard deviation of the results returned by the individuals with the best RMSE on the test set at each run. Lower part: best, average and standard deviation of the results returned by the individuals with the best CC on the test set at each run. RMSE on test
CC on test
(a) Individual with the bestRMSE on the test set Best 0.0740 0.9193 avg. 0.0757 0.8939 std. dev. 0.0055 0.0180
RMSE on train
CC on train
0.1000 0.1092 0.0036
0.9065 0.8781 0.0539
(b) Individual with the best CC on the test set Best 0.0691 0.9356 avg. 0.0735 0.9221 std. dev. 0.0041 0.0113
0.1000 0.1107 0.0042
0.9245 0.9057 0.0074
In Fig. 5, we show the scatterplot of the true docking energy values against the docking energy predictions calculated by the individual with the best RMSE (Fig. 5(a)) and the one with the best CC (Fig. 5(b)) on the test set found by LinScalGP. If we compare these scatterplots with the ones in Figs. 1, 3 and 4, we can observe a correlation improvement, in particular for ‘‘bad’’ (i.e. ‘‘large’’) docking energy values.
177
using a smaller number of variables could not have a better fitness value than an expression using all the 267 variables. If expressions using smaller number of variables get a better fitness, they survive, given that fitness is the only principle used by GP for selecting genes. This is evidently what happened during the presented GP executions: GP has found expressions using a small number of variables with a better fitness value than the ones using all variables. Thus, the former expressions survived into the population, while the latter ones were extinguished. Furthermore, we point out that all the descriptors that have been used have an intuitive correlation with docking energy. In fact they are mostly belonging to the categories of constitutional descriptors, derived from properties like solvent accessible surface and log P, characteristics known to influence the binding energy. For a definition of the molecular descriptors used, see Appendix B. The genotype of the individual with the best CC on the test set found by LinScalGP is: ððKierA1 þ KierA3 þ Kier1ÞðZ pcminus þ VOL polÞðSlogP SAS0 þ NULL pmiY þ POLA pmiZ þ fNULL pmiZ þ POLA pmiZ þ fNULL pmiZ þ SlogP SAS0 þ NULL pmiYÞ þ KierA3 þ Kier1 þ ððPOLA pmiZ þ fNULL pmiZÞ=chi0Þ þ SlogP SAS0 þ NULL pmiYÞða count þ b count þ ðKierA1 þ KierA3
5.4. The best solutions found In this section we show the genotypes of the individual with the best RMSE and of the one with the best CC found by LinScalGP (whose results have been reported in Table 8 and Fig. 5). They will be given here as expressions in infix form and the molecular descriptors will be represented using the same identifiers as in [47]. The individual with the best RMSE is: ðPOLA pmi þ ðSMR SAS0 Z pc plus þ VAd jMaÞðb 1rotR þ chi0v C þ POLA pmi þ b 1rotR þ SlogP VOL0ÞðSMR SAS0 Z pc plusÞðchi0v þ VAd jMaÞÞðchi0v þ VAd jMaÞ þ chi0v C þ ðSMR SAS0 Z pc plusÞðchi0v þ VAd jMaÞ þ VAd jMa The first thing that one might observe when looking at this expression is that it uses a limited number of molecular descriptors: only 12 different descriptors over the 267 total descriptors included in the dataset. In other words, although no explicit feature selection algorithm has been applied to reduce the number of input data, GP has implicitly performed a strong feature selection. The mechanism that allows GP to perform feature selection is simple: GP searches over the space of all arithmetic expressions of 267 variables. This search space includes the expressions that use all the 267 variables, but also the ones that use a smaller number of variables and in principle there is no reason why an expression
þ Kier1ÞðZ pcminus þ VOL polÞðKierA3 þ Kier1 þ KierA3 þ Kier1 þ SlogP SAS0 þ NULL pmiYÞ þ KierA1 þ ðKierA3=Kier1Þ þ Z pcminus þ KierA1 þ a count þ b count þ ðKierA1 þ POLA pmiZ þ fNULL pmiZÞZ pcminusðPOLA pmiZ þ fNULL pmiZ þ a acid þ Q VOL F fo þ a count þ b countÞ þ a count þ b count þ ðKierA1 þ KierA3 þ Kier1ÞZ pcminusðSlogP SAS0NULL pmiY þ SlogP SAS0NULL pmiYÞÞ Even though this individual is clearly longer than the previous one (i.e. it is composed by a larger number of nodes), also in this case GP has performed a strong feature selection: this individual contains only 18 molecular descriptors over the 267 ones considered. Also in this case, all these molecular descriptors belong to the categories of constitutional descriptors and are intuitively correlated with docking energy. Finally, we remark that the two individuals shown in this section share some common shapes: for instance, both of them are mainly composed by þ and functional symbols, while the and / symbols do not appear very often (only three symbols in the first individual and only two / symbols in the second one). Furthermore, in both individuals some shapes appear more than once; it is the case, for instance, of subtrees ðSMR SAS0 Z pc plusÞ and ðchi0v þ VAd jMaÞ
Fig. 5. Scatterplot of the true docking energy values against the docking energy predictions on the test set. (a): predictions calculated by the individual with the best RMSE on the test set found by LinScalGP; (b): predictions calculated by the individual with the best CC on the test set found by LinScalGP.
178
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
Fig. 6. Scatterplot of the true docking energy values against the docking energy predictions on the test set. (a): predictions calculated by the individual with the best RMSE on the test set found by LinScalGP+; (b): predictions calculated by the individual with the best CC on the test set found by LinScalGP+.
in the first individual and of subtrees ðKierA3 þ Kier1Þ and ða count þ b countÞ in the second one. This kind of considerations guided us in the design of the experiments that we present in the next section. 6. Improving GP results 6.1. Search for recurrent patterns In all the experiments that we have presented until now we have empirically observed that ‘‘good’’ individuals often shared some common structures (i.e. subtrees or ‘‘parts’’ of subtrees). For this reason, for each one of the 100 independent LinScalGP 2 runs discussed in Section 5.3, we have considered the individual with the best RMSE on the test set and the one with the best CC on the test set. Thus, we have analyzed the genotypes of these 200 individuals, looking for recurrent structures. In particular, we were looking for the most recurrent subtrees of depth 1 and 2 and the most recurrent subtree ‘‘fragments’’ of depth 1 and 2. Here with the term subtree ‘‘fragment’’ we indicate a schema as defined for instance in [48], i.e. a subtree that may contain some ‘‘don’t care’’ symbol (represented with the symbol ‘‘#’’ below) which could be replaced by any other subtree. The most recurrent subtrees that we have found are: ðSASA acidCHGR VSAm4Þ ða pol þ Z pcminusÞ ðchi1vzagrebÞ ðWeightKier2Þ ðVAd jMa þ SMR VSA0Þ ða count% areaÞ ðSMR VSA0 þ Z pcminusÞ ðVSA acida heaÞ while the most recurrent subtree fragments that we have found are:
ða acid þ ðþÞÞ ðKier þ ðþÞÞ ðKierA3 þ ðþÞÞ ðZ pcminus þ ðþÞÞ ða pol þ ðþÞÞ ðKier3 þ ðþÞÞ ðKierA1 þ ðþÞÞ ðZ pc plus þ ðþÞÞ ðVAd jMaþÞ ðSMR VSA0þÞ
For a definition of the molecular descriptors, the reader is referred to Appendix B. Here, it is of interest to remark that each 2
We have used the LinScalGP version because it is the one that has returned the best results, as discussed in Section 5.
Table 9 Results that we have obtained performing 100 independent runs of LinScalGP+ on our dataset. Upper part: best, average and standard deviation of the results returned by the individuals with the best RMSE on the test set at each run. Lower part: best, average and standard deviation of the results returned by the individuals with the best CC on the test set at each run. RMSE on test
CC on test
(a) Individual with the best RMSE on the test set Best 0.0647 0.9259 avg. 0.0678 0.9182 std. dev. 0.0017 0.0043
RMSE on train
CC on train
0.0961 0.1001 0.0018
0.9256 0.9005 0.167
(b) Individual with the best CC on the test set Best 0.0686 0.9166 avg. 0.0680 0.9176 std. dev. 0.0019 0.0050
0.0983 0.1012 0.0017
0.9290 0.9238 0.0027
one of the most recurrent subtrees that we have found could be defined as a new terminal symbol and inserted into the set of terminals T . Analogously, each one of the most recurrent subtree fragments could be defined as a new N-ary operator, where N is the number of ‘‘don’t care’’ symbols in the fragments, and all these new operators could be inserted into the set of non-terminal symbols F . Finally, new GP experiments could be performed using these new F and T sets to build the GP individuals. In this way, we hope to ‘‘facilitate’’ the task of GP, removing from the GP runs the computational effort needed to build these useful structures. In the next section, we present the experimental results that we have obtained using LinScalGP with these new F and T sets3 (all the other parameters used are like in section 5). This ‘‘extended’’ version of LinScalGP will be called LinScalGP+ from now on. 6.2. New GP results Table 9 reports the results we have obtained executing 100 independent runs of LinScalGP+ on our dataset. This Table must be interpreted as Tables 6–8. LinScalGP+ outperforms all the other methods for both RMSE and CC. Furthermore, also the average best RMSE and the average best CC outperform the best RMSE and CC found by any other studied technique. Finally, standard deviations confirm that the results obtained over the 100 independent runs are rather ‘‘stable’’. Our conclusion is that LinScalGP+ seems suitable for assessing our dataset. In Fig. 6, we show the scatterplot of the true docking energy values against the docking energy predictions calculated by the individual with the best RMSE (Fig. 6(a)) and the one with the best CC (Fig. 6(b)) on the test set found by LinScalGP+. 3 We point out that the F and T sets used in these new experiments also contained the operators and terminal symbols used in the previous experiments and discussed in Section 5.
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
Comparing these scatterplots with the previous ones (Figs. 1, 3, 4 and 5) we can see that points are more clustered around the axis bisector. The correlation improvement is particularly visible for ‘‘bad’’ (i.e. ‘‘large’’) docking energy values.
Table 10 Average CPU completion times for the executions of the various Machine Learning methods presented in the previous sections. All CPU times are expressed in seconds. As explained in the text, GP has been used with no explicit feature selection method.
6.3. The best solutions found The genotype of the individual with the best RMSE on the test set found by LinScalGP+ is: ða pol þ ððZ pcminus þ ða nC þ logSÞÞða acid þ ðQ VOL F fi a C10 þ a count= areaa count= areaðKier3 þ ðKier2 þ ðQM di pX þ b 1rotRÞ þ a acid þ ðWeight þ SlogP VOL3ÞÞÞÞÞ
179
Linear Regression Least Square Regression Multilayered Perceptron SVM (first degree polynomial kernel) SVM (second degree polynomial kernel) RMSEGP CCGP LinScalGP LinScalGP+
No feature selection
PCFS
CorrFS
89.65 121.54 328.40 135.23 176.39 156.59 154.29 168.34 166.22
54.34 25.28 97.28 77.18 98.45 – – – –
75.26 72.49 146.54 113.34 121.47 – – – –
þ a O04ÞÞðZ pcminus þ ððZ pcminus þ ðQ VSA fi þ Q VSA polÞÞðKier3 þ ðdi pY co þ SlogPÞÞ þ VAd jMa þ ððZ pcminus þ ða nC þ logSÞÞðZ pcminus þ ðKierA1 R pc pm SMR SAS4 þ a O04 fCHGR pmiYÞÞÞÞÞðKier3 þ ðKier2 þ ðNULL pmi þ fPOLA pmiZÞ þ Z pcminus þ ða nC þ logSÞ þ a acid þ ða acid þ ðZ pcminus þ ðQ SAS F fo þ KierA3 þ ða ICM þ a fiÞÞ þ a acid þ ðCHGR VOL p5 þ CHGR SASm1Þ KierA3 þ ða IC þ KierA2ÞÞ þ Z pcminus þ ðQ VSA fi þ Q VSA polÞÞÞÞ and the individual with the best CC on the test set is: ða pol þ ððZ pcminus þ ðQ VSA fi þ Q VSA polÞÞðKier3 þ ð fCHGR pmiX þ Q VSA fiÞÞ þ Z pc plus þ ða pol þ ðQ SAS F foa acid þ a acid þ ða acid þ ðWeight þ SlogP VOL3Þ þ KierA3 þ ðVOL fi þ VOL baseÞÞÞ þ ðKier3 þ ðSlogP SAS4 þ SASA acidCHGR VSAm4ÞÞðKier2 þ ðQ SAS F foa acid þ Kier2 þ ðNULL pmi þ fPOLA pmiZÞ þ Z pcminus þ ða nC þ logSÞÞÞðSMR VOL1 þ ðZ pcminus þ ðQ VSA fi þ Q VSA polÞÞðKierA3 þ ðNULL pmiY þ MASS pmiZÞ þ SMR VSA0 þ ðQ SAS F posÞÞÞÞÞÞðZ pcminus þ ða acid þ ðQ SAS fo þ fCHGR pmiZÞ þ Z pcminus þ ða nC þ logSÞ þ KierA3 þ ða ICM þ a fiÞÞÞ In both cases, GP has implicitly executed a feature selection, since only a small subset of the molecular descriptors are used by these solutions. Furthermore, also in this case, these descriptors belong to the categories of constitutional descriptors and have an intuitive correlation with docking energy. 7. Further experiments In the present section, we describe further details about the experiments whose results have been described previously in this manuscript. In particular, Section 7.1 contains an analysis of the CPU completion times of the various Machine Learning methods, in order to further establishing the utility of the proposed method, and Section 7.2 reports the average RMSE and CC along with generations for the different studied GP variants.
on a dedicated machine Intel Pentium III-500 with 128 M RAM, and we have calculated the averages of all these computational times. Results are shown in Table 10. As this table shows, all the GP methods require more or less the same CPU time to complete one execution (i.e. more or less 3 min). A slightly smaller amount of time is required by SVM with first degree polynomial kernel. In fact, one execution of SVM without feature selection requires approximately 135 s of CPU time. On the other hand, SVM with second degree polynomial kernel is rather slower. Multilayered Perceptron is the slower of the studied methods, requiring on average about 328 s to complete one execution when no feature selection is used. Linear and Least Square selection are faster than the other methods, requiring respectively an average time of about 89 and 121 s to complete one run with no feature selection. We also point out that, as expected, feature selection helps making all the methods faster. In particular, methods are faster when using PCFS, given that PCFS selects only 22 of the 267 total available features (but we also have to point out that results obtained using PCFS are generally the poorest ones we have obtained in out experiments, as shown in the previous sections). CorrFS has selected 95 of the 267 total features and it can be considered a good compromise between efficiency (in terms of CPU time) and effectiveness (in terms of the quality of the returned results). We conclude this discussion by pointing out that all the GP methods have ‘‘reasonable’’ CPU times, i.e. comparable with the ones of the other Machine Learning methods using no feature selection. Furthermore, GP times are, so to say, ‘‘not so worse’’ than the ones of the other methods using feature selection. This, together with the fact that GP finds the best quality results, that it performs an automatic feature selection and that it returns human readable models, make us claim that GP is an interesting method for this application. 7.2. RMSE and CC for each GP generation Fig. 7 reports the following results for the four GP variants studied (RMSEGP, CCGP, LinScalGP and LinScalGP+): Fig. 7(a): average of the best RMSE on the training set in the population at each generation; Fig. 7(b): average of the best CC on the training set in the population at each generation; Fig. 7(c): average of the RMSE obtained by the best individual on the training set, evaluated on the test set; Fig. 7(d): average of the CC obtained by the best individual on the training set, evaluated on the test set.
7.1. CPU times We have calculated the computational time for all the executions whose results are reported in the previous sections (i.e. 100 different executions for each Machine Learning method)
As already known from the previous results, LinScalGP+ is the method that returns both the best RMSE and the best CC. What is new from these graphs is the information that LinScalGP+ finds on average the best RMSE and the best CC at each generation. This
180
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
Fig. 7. (a) Average best RMSE on training; (b) average best CC on training; (c) average best RMSE on test of the best individual on training; (d) average best CC on test of the best individual on training.
allows us to conclude that LinScalGP+ is the more interesting of the methods that we have studied. 8. Conclusions and future work Machine Learning methods, including various versions of Genetic Programming, have been employed for assessing and predicting the value of the docking energy of genistein based drug compounds with estrogen receptor proteins. This application is important since the ability of correctly predicting this value could help us selecting the most promising genistein based drugs for menopause therapy and also as a natural prevention of some tumors. Genetic Programming using linear scaling for optimizing the error and the correlation coefficient between outputs and target has proven the most promising technique among the ones that have been considered both from the point of view of the accurateness of the solutions proposed, of the generalization capabilities and of the correlation between predicted data and correct ones. Its results can be further improved if the most recurrent structures in ‘‘good’’ solutions are stored and considered as functionals or terminals for building new GP solutions. These results are encouraging and should pave the way to a deeper study of Genetic Programming for drug discovery and development. One of the main limitations of this work is that we did not use any application specific problem knowledge. For instance, a feature selection based on the mutual ‘‘relevance’’ of the molecular descriptors and their expected correlation with the target might have been used; or a ‘‘semantic’’ analysis of the best solutions found by GP could have helped us to generate new and possibly more effective solutions. We are currently working in this direction: we are trying to develop a sort of ‘‘application based’’ feature selection and in parallel we are trying to give a biological interpretation to solutions found by GP, trying to infer interesting properties. Other possible developments of GP methods may be suggested by the analysis of the available data: experimental measurements return
average values associated to specific confidence intervals. This has to be taken into account in our future activity, possibly by considering an ‘‘error-tolerant’’ threshold for calculating fitness on the training set, and also assigning a different relevance to over-estimation and under-estimation errors. Finally, in order to deeply characterize the ability of GP in drug discovery, we are planning to test our methodologies on other datasets, for example for the prediction of Blood Brain Barrier Permeability or Cytochrome P450 interactions. Acknowledgments We acknowledge DELOS Srl [26] for allowing us to use their software environment. Appendix A. GP configuration and parameter tuning The GP configuration and parameters used in the experiments discussed in this paper are the ones that have returned the best results among a set of alternatives, that have been empirically compared. Synthetically, for each one of the three fitness functions discussed above (i.e. RMSE, CC and RMSE with linear scaling), we have tested three different algorithms to handle the training set: using all the lines composing the training set to calculate fitness at each generation; using the algorithm reported in Fig. 2; using the algorithm reported in Fig. A.1; The algorithm in Fig. A.1 has been implemented with the same objective as the one in Fig. 2, i.e. for improving GP generalization ability by iteratively modifying the data on which fitness is evaluated. In this way, we hope to force GP to keep into the population solutions that are more ‘‘general’’, since selection should discard all the individuals that are ‘‘specialized’’ on a particular set of data, each time
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
181
Fig. A.1. One of the algorithms we have used to handle the training set with the aim of improving GP generalization ability. Some experiments that we do not report in this paper have shown that using the algorithm in Fig. 2 allows GP to return better results on the test set, both in terms of RMSE and CC.
Table B.1 A brief definition of the molecular descriptors used in the subtrees and subtree fragments presented in Section 6. Descriptor
Definition
SASA acid CHGR VSAm4 Z pcminus Z pc plus chi1v
Solvent accessible surface area of acid atoms of the current molecule Van der Waals atomic surface summed over atoms with charge in the range [0:25; 0:20) z component of the negative charge centre z component of the positive charge centre P qffiffiffiffiffiffiffiffiffi Atomic valence connectivity index (order 1) from [47]. This is calculated as i j di d j over all bonds between heavy atoms i and j where i < j. For each atom i, di is the approximate accessible van der Waals surface area of i [47] P 2 i di over all heavy atoms i (di as above) molecular weight (including implicit hydrogens) Vertex adjacency information (magnitude): 1 þ log 2 m where m is the number of heavy-heavy bonds. If m is zero, then zero is returned van der Waals surface area calculations summed over atoms with molecular refractivity in the range [0,0.11] Area of van der Waals surface calculated using a connection table approximation Approximation to the sum of van der Waals surface areas of acidic atoms Total number of atoms (including implicit hydrogens). Number of heavy atoms Number of acid atoms Sum of atomic polarizabilities First kappa shape index: ðn 1Þ2 =m2 , where n denotes the number of atoms in the hydrogen suppressed graph and m is the number of bonds in the hydrogen suppressed graph [47] Second kappa shape index: ðn 1Þ2 =m2 , where n denotes the number of atoms in the hydrogen suppressed graph and m is the number of bonds in the hydrogen suppressed graph [47] Third kappa shape index: ðn 1Þ2 =m2 , where n denotes the number of atoms in the hydrogen suppressed graph and m is the number of bonds in the hydrogen suppressed graph [47] First alpha modified shape index: sðs 1Þ2 =m2 where s ¼ n þ a [47], where n and m are as above and a is the sum of ðr i =r c 1Þ where r i is the covalent radius of atom i and r c is the covalent radius of a carbon atom
zagreb Weight VAd jMa SMR VSA0 area VSA acid a count a hea a acid a pol Kier Kier2 Kier3 KierA1
these data are changed. Compared to the algorithm in Fig. 2, the algorithm in Fig. A.1 operates a more radical modification of the data used for fitness calculation: as the algorithm in Fig. 2, it partitions the training set into k subsets, but instead of using k 1 subsets to calculate fitness at each generation, iteratively changing the unused subset at each p generations in a cyclic way (as the algorithm in Fig. 2 does), the algorithm in Fig. A.1 uses only one subset to calculate fitness at each generation, iteratively changing the used subset at each p generations. For both the algorithm in Fig. 2 and the one in Fig. A.1, we have tested values of constant p equal to 5, 15, 30 and 50. Furthermore, for each possible combination of the three fitness functions (RMSE, CC and RMSE with linear scaling) and of these three ways of handling the training set, we have tested the use of data with no feature selection, with PCFS and with CorrFS. Finally, for each one of these possible GP configurations, we have performed simulations using the set of terminal symbols T discussed in Section 5 (i.e. a set composed by n floating point variables, where n is the number of descriptors of each molecule in the dataset) and also using an ‘‘extended’’ set of terminal symbols, also including n ephemeral random constants (ERCs) [49] generated with uniform
distribution in the range ½m; M, where m (respectively M) is the minimum (respectively the maximum) value assumed by the molecular descriptors in the training set (as in [45]). Analogously, for each possible configuration, we have built GP individuals using the set of functions F ¼ fþ; ; ; =g (as discussed in Section 5) and also an ‘‘extended’’ set of functional symbols including functions sin, cos, log, exp and sqrt. Given the huge number of the simulations that we have performed4, we have decided not to show the results of all of them in this paper, but only to report the best ones.
4 If we consider that we have tested 3 different fitness functions, 3 methods to handle the training set with 4 possible values of parameter p, 3 feature selection algorithms and 2 different sets of terminal symbols, we can conclude that to test all their possible combinations we have performed 216 different GP experiments. Since each experiment consisted in 100 independent GP runs, we have globally executed 21,600 independent GP runs. Given that each GP run took approximately 3 min to complete, we have been executing GP continuously for around 45 days. This simulation time do not include the time that we have spent for the experiments presented in Section 6, nor the time spent for choosing the set of functional symbols fþ; ; ; =g.
182
F. Archetti et al. / Applied Soft Computing 10 (2010) 170–182
Appendix B. Molecular descriptors In Table B.1 we define the molecular descriptors contained in the most recurrent subtrees and subtree fragments presented in Section 6. For a definition of all the descriptors contained in our dataset, the reader is referred to [47] and to the web page: http://personal.disco.unimib.it/Vanneschi/Docking.htm. Note for the reviewers: the table containing a description of all the 267 molecular descriptors used in our dataset is very large and takes a lot of space. For this reason, we have decided to present only a small set of those descriptors (the most interesting ones for our system, since they are the ones that most frequently appear in ‘‘good’’ solutions) in this appendix and to refer the reader to a web page for the whole dataset and descriptions. Nevertheless, the complete table can be inserted here, if requested. References [1] F. Yoshida, J.G. Topliss, QSAR model for drug human oral bioavailability, Journal of Medicinal Chemistry 43 (2000) 2575–2585. [2] J. Frohlich, F. Wegner, A. Sieker, Zell, Kernel functions for attributed molecular graphs—a new similarity based approach to ADME prediction in classification and regression, QSAR and Combinatorial Science 38 (4) (2003) 427–431. [3] C.W. Andrews, L. Bennett, L.X. Yu, Predicting human oral bioavailability of a compound: development of a novel quantitative structure-bioavailability relationship, Pharmacological Research 17 (2000) 639–644. [4] J. Feng, L. Lurati, H. Ouyang, T. Robinson, Y. Wang, S. Yuan, S.S. Young, Predictive toxicology: benchmarking molecular descriptors and statistical methods, Journal of Chemical Information Computer Science 43 (2003) 1463–1470. [5] T.M. Martin, D.M. Young, Prediction of the acute toxicity (96-h LC50) of organic compounds to the fathead minnow (Pimephales promelas) using a group contribution method, Chemical Research in Toxicology 14 (10) (2001) 1378–1385. [6] G. Colmenarejo, A. Alvarez-Pedraglio, J.L. Lavandera, Chemoinformatic models to predict binding affinities to human serum albumin, Journal of Medicinal Chemistry 44 (2001) 4370–4378. [7] J. Zupan, P. Gasteiger, Neural Networks in Chemistry and Drug Design: An Introduction, 2nd edition, Wiley, 1999. [8] F. Archetti, E. Messina, S. Lanzeni, L. Vanneschi, Genetic programming for computational pharmacokinetics in drug discovery and development, Genetic Programming and Evolvable Machines 8 (4) (2007) 17–26. [9] J.R. Koza, Genetic Programming, The MIT Press, Cambridge, MA, 1992. [10] L. Vanneschi theory and practice for efficient genetic programming. PhD thesis, Faculty of Sciences, University of Lausanne, Switzerland, 2004. [11] J.H. Holland, Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, MI, 1975. [12] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, 1989. [13] W.B. Langdon, S.J. Barrett, Genetic programming in data mining for drug discovery, Evolutionary Computing in Data Mining (2004) 211–235. [14] V. Venkatraman, A.R. Dalby, Z.R. Yang, Evaluation of mutual information and genetic programming for feature selection in QSAR, Journal of Chemical Information and Compututer Sciences 44 (2004) 1686–1692. [15] J. Yu, J. Yu, A.A. Almal, S.M. Dhanasekaran, D. Ghosh, W.P. Worzel, A.M. Chinnaiyan, Feature selection and molecular classification of cancer using genetic programming, Neoplasia 9 (4) (2007) 292–303. [16] N. Dasgupta, S.M. Lin, L. Carin, Modeling pharmacogenomics of the NCI-60 anticancer data set: utilizing kernel PLS to correlate the microarray data to therapeutic responses, in: Methods of Microarray Data Analysis II, Springer, USA, 2002. [17] H. Van de Waterbeemd, S. Rose, L.G. Wermuth, The Practice of Medicinal Chemistry, 2nd edition, Academic Press, 2003, pp. 1367–1385. [18] D.B. Kitchen, H. Decornez, J.R. Furr, J. Bajorath, Docking and scoring in virtual screening for drug discovery: methods and applications, Nature Reviews Drug Discovery 3 (2004) 935–949. [19] E.M. Krovat, T. Steindl, T. Langer, Recent advances in docking and scoring, Current Computer: Aided Drug Design 1 (January (10)) (2005) 93–102.
[20] J.M. Banley, J.S. Dixon, A good ligand is hard to find: automated docking methods of special interest, Perspectives of Drug Discovery and Design 1 (1993) 301–319. [21] J.S. Dixon, Flexible docking of ligands to receptor sites using genetic algorithms, in: G.C. Wermuth (Ed.), Trends in QSAR and Molecular Modelling 92, ESCOM, Leiden, 1993, pp. 412–413. [22] C.M. Oshiro, I.D. Kuntz, J.S. Dixon, Flexible ligand docking using a genetic algorithm, Journal of Computer-Aided Molecular Design 9 (1995) 113–120. [23] AutoDock A docking program developed by the Olson group at the Scripps Research Institute, 2007. http://autodock.scripps.edu. [24] GOLD A docking program produced by the CCDC in Cambridge, UK, 2007. http:// www.ccdc.cam.ac.uk/products/life sciences/gold/. [25] DOCK A docking program developed in the Kuntz and Shoichet groups at the University of California, San Francisco, 2007. http://dock.compbio.ucsf.edu. [26] DELOS S.r.l Discovery and Lead Optimization Systems, 20091, Bresso (MI), Italy, 2007. http://www.delos-bio.it. [27] F. Chiappori, M.G. Ferrario, N. Gaiji, P. Fantucci, Docking of estrogen and genistein like molecular library on estrogen receptor alpha and beta. In Proceedings of the Bioinformatics Italian Society (Bits) Annual Meeting, 2005. Publication on CD. Downloadable version available at http://www.itb.cnr.it/bits2005/abstract/ 26.pdf. [28] T.T. Wang, N. Sathyamoorthy, J.M. Phang, Molecular effects of genistein on estrogen receptor mediated pathways, Carcinogenesis 17 (2) (1996) 271–275. [29] M. Pintore, H. Van de Waterbeemd, N. Piclin, J.R. Chrtien, Prediction of oral bioavailability by adaptive fuzzy partitioning, European Journal of Medicinal Chemistry 38 (4) (2003) 427–431. [30] N. Greene, Computer systems for the prediction of toxicity: an update, Advances in Drug Delivery Reviews 54 (2002) 417–431. [31] Accelrys Inc. the world leader in cheminformatics for drug development, 2006. See http://www.accelrys.com. [32] Pharma Algorithms Inc. a company active in the field of ADMET predictions., 2006. See http://www.ap-algorithms.com. [33] RCSB Protein Data Bank (PDB) An information portal to biological macromolecular structures, 2007. http://www.rcsb.org/pdb/home/home.do. [34] Molecular Operating Environment (MOE) A software developed by chemical computing group inc., 2007. http://www.chemcomp.com. [35] MMFF94 Validation Suite Created by Computational Chemistry list Ltd., 2007. http://www.ccl.net/cca/data/MMFF94. [36] M.A. Hall, Correlation-based feature selection for machine learning. PhD thesis, Hamilton, NZ: Waikato University, Department of Computer Science, 1998. [37] I.T. Jolliffe, Principal Component Analysis, 2nd edition., Springer Series in Statistics, 1999. [38] Weka A multi-task machine learning software developed by Waikato University, 2006. See http://www.cs.waikato.ac.nz/ml/weka. [39] H. Akaike, Information theory and an extension of maximum likelihood principle, in: In 2nd International Symposium on Information Theory, Akademia Kiado, June, 1973. [40] Peter J. Rousseeuw, Annick M. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. [41] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, London, 1999. [42] J. Smola Alex, B. Scholkopf, A tutorial on support vector regression. Technical Report Technical Report Series - NC2-TR-1998–030, NeuroCOLT2, 1999. [43] M. Keijzer Improving symbolic regression with interval arithmetic and linear scaling. in: C., Ryan, et, al., (Eds.), Genetic Programming, Proceedings of the 6th European Conference, EuroGP 2003, vol. 2610 of LNCS, Essex, 2003. Springer, Berlin, Heidelberg, New York, pp. 71–83. [44] M. Keijzer, Scaled symbolic regression, Genetic Programming and Evolvable Machines 5 (3) (2004) 259–269. [45] F. Archetti, S. Lanzeni, E. Messina, L. Vanneschi, Genetic programming for human oral bioavailability of drugs, in: M. Cattolico (Ed.), Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, Seattle, Washington, USA, (2006), pp. 255–262. [46] E. Zitzler, M. Laumanns, L. Thiele SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Technical Report 103, Computer Engineering and Networks Laboratory (TIK) Dept. of Electrical Engineering Swiss Federal Institute of Technology (ETH) Zurich, Switzerland, 2001. [47] L.H. Hall, L.B. Kier, The molecular connectivity chi indices and kappa shape indices in structure–property modelling, Reviews of Computational Chemistry 2 (1991) 367–422. [48] W.B. Langdon, R. Poli, Foundations of Genetic Programming, Springer, Berlin, Heidelberg, New York, Berlin, 2002. [49] W. Banzhaf, P. Nordin, R.E. Keller, F.D. Francone, Genetic Programming: An Introduction, Morgan Kaufmann, San Francisco CA, 1998.