476
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 4,
NO. 3,
JULY-SEPTEMBER 2007
High-Throughput Ligand Screening via Preclustering and Evolved Neural Networks David Hecht and Gary B. Fogel Abstract—The pathway for novel lead drug discovery has many major deficiencies, the most significant of which is the immense size of small molecule diversity space. Methods that increase the search efficiency and/or reduce the size of the search space increase the rate at which useful lead compounds are identified. Artificial neural networks optimized via evolutionary computation provide a cost and time-effective solution to this problem. Here, we present results that suggest that preclustering of small molecules prior to neural network optimization is useful for generating models of quantitative structure-activity relationships for a set of HIV inhibitors. Using these methods, it is possible to prescreen compounds to separate active from inactive compounds or even active and mildly active compounds from inactive compounds with high predictive accuracy while simultaneously reducing the feature space. It is also possible to identify “human interpretable” features from the best models that can be used for proposal and synthesis of new compounds in order to optimize potency and specificity. Index Terms—Computational intelligence, evolutionary computation, artificial neural networks, medicine and science.
Ç 1
INTRODUCTION
T
HE
worldwide pharmaceutical industry is invested in technologies for high-throughput screening (HTS) of natural product and combinatorial chemical libraries for small molecule drug discovery. Given the current social and economic impact of the rising costs of pharmaceuticals, there is a significant need to increase the efficiency of the drug discovery and development process. As seen from Fig. 1, the increase in research and development funding over the last decade has failed to produce a proportional increase in new medical entities (NMEs) or drugs without prior US Food and Drug Administration (FDA) approval. A major factor in this phenomenon has been the high attrition rate of compounds in the pipeline. On average, far less than 1 percent of screened compounds enter preclinical testing. For every 250 compounds in preclinical testing, five survive to enter clinical testing, with only one approved drug by the FDA after an average of 15 years in total of research and development [1], [2]. At each phase of the drug discovery and development process, numerous factors contribute to this high attrition rate. For instance, during lead discovery, HTS hits are often found to be irreproducible in follow-up studies due to synthetic impurities, as well as chemical instabilities. In addition, most of the “reproducible hits” have concentrations required for 50 percent viral inhibition (IC50) are in the low M/high nM range and are often nonselective for the target, thus requiring significant additional optimization via medicinal chemistry and biochemistry. On average,
. D. Hecht is with Southwestern Community College, 900 Otay Lakes Road, Chula Vista, CA 91910. E-mail:
[email protected]. . G.B. Fogel is with Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121. E-mail:
[email protected]. Manuscript received 13 Mar. 2006; revised 12 July 2006; accepted 25 Aug. 2006; published online 12 Jan. 2007. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCBB-0053-0306. Digital Object Identifier no. 10.1109/TCBB.2007.1038. 1545-5963/07/$25.00 ß 2007 IEEE
discovery and development takes 3-6 years to achieve an investigational new drug (IND) filing, whereas clinical trials can last up to 10 years or more before the product reaches the market [1], [2]. One promising approach for improving the efficiency and productivity of early discovery is through the use of “focused” compound screening libraries. Each compound in a focused library is selected based upon structural and physical properties that will increase its probability of having activity for the specific target (e.g., a quantitative structure-activity relationship (QSAR) model). A simplified application for a typical lead discovery process such as HTS is described below. Although this varies in practice, assume a 1 percent confirmed and validated hit rate from a conventional commercial screening library. For a library consisting of 100,000 compounds, only 1,000 of them would be true hits. Assume a QSAR model is approximately 85 percent accurate for predicting hits (e.g., confirmed actives) and is also 85 percent accurate for predicting confirmed inactives. Also assume the worst-case scenario that the model has not been successfully trained for the removal of false positives. Out of the 1,000 active compounds in the library, only 850 would be correctly identified by the model. Out of the 99,000 inactive compounds, the model would incorrectly select 14,850 false positives. The resulting focused library would consist of 15,700 compounds, representing a 6-fold reduction in the number of compounds required for testing in order to obtain 85 percent of the same number of confirmed and verified hits. In addition, the focused library will have an approximate hit rate of 5:4 percent as compared to the 1 percent from the original screening library. QSAR models [3] have proven to be an effective approach for handling the massive quantities of structural and biological data generated with combinatorial libraries and HTS in lead discovery, lead optimization, and drug Published by the IEEE CS, CI, and EMB Societies & the ACM
HECHT AND FOGEL: HIGH-THROUGHPUT LIGAND SCREENING VIA PRECLUSTERING AND EVOLVED NEURAL NETWORKS
Fig. 1. Increased research and development expenditures have not resulted in increased numbers of new medical entities over time.
development [4], [5]. QSAR models are essentially a function relating parameters/descriptors/features based on physicochemical properties of small molecule compounds to biological response or activity (e.g., the amount of material required to produce a specified effect in 50 percent of an animal population (ED50), the dose of a chemical which kills 50 percent of a sample population (LD50), or other responses such as IC50, Ki , or absorption). Some of the more widely used commercially available software packages for performing QSAR include Cerius2 (Accelrys), MOE (Chemical Computing Group), and Sybyl (Tripos). A standard difficulty with QSAR approaches is identifying the appropriate selection and weighting of descriptors when relating these in combination to biological response. This often involves an iterative, time-consuming process. The development of improved and more efficient strategies is a very active area of research and some of the more popular techniques include multiple linear regression, partial least squares analysis, principal component analysis, as well as artificial neural networks (ANNs), and evolutionary algorithms [6], [7], [8], [9]. Artificial neural networks have been used in this domain given their ability to handle nonlinear relationships between QSAR features relative to output decisions of small molecule activity. Optimizing ANNs using evolutionary algorithms has been demonstrated to be useful for QSAR problems [10], [11], [12], [13] so that correlations between descriptors and predictions of activity can be determined from the space of all possible ANN models without resorting to gradient search methods such as backpropagation for training. Since 1955, the National Cancer Institute (NCI) Development Therapeutics Program [14] has provided a publicly available chemical compound repository for anticancer drug screening. This collection now contains tens of thousands of diverse compounds that have each been tested against a panel of 60 cell lines representing nine clinical tissue types. In recent years, anti-HIV assays and compounds have been added to the NCI collection. Several approaches have been developed to mine these repositories for lead compounds/classes of compounds for anti-HIV activity. These approaches include pharmacophore and 3D database searching [15], [16], [17], as well as clustering [18]. The NCI databases are an extremely valuable resource
477
and are representative of structure-activity databases in the pharmaceutical and drug discovery industry at large. As such, these databases are ideal to use for the development and testing of novel QSAR approaches. Previous preliminary research [19] examined the utility of evolved ANNs as a prescreening method for compound subclusters from the Developmental Therapeutics Program NCI/NIH AIDS Antiviral Screen public database [20]. Here, we extend this work by providing additional justification for data preclustering, demonstrating the utility of preclustering methods and the successful use of evolved ANNs for activity prediction within clusters. Additional research expanding the number of features used as input to the neural network provides evidence to support the conclusion that robust models can be developed using only a minimum set of features from the possible feature space.
2
METHODS
2.1 Database Construction Forty-two thousand five hundred ten compounds screened previously for anti-HIV activity were obtained from the NCI/ NIH AIDS Antiviral Screen database. For each compound, activity was provided as either “confirmed active (CA),” “confirmed moderately active (CM),” or “confirmed inactive (CI),” according to the guidelines established by the NCI/ NIH. The data set was downloaded for our model development with the following number of compounds per classification category: nCA ¼ 422, nCM ¼ 1; 079, nCI ¼ 41; 004 (five compounds did not have scores published). A Chemfinder [21] database was created with all 42,510 chemical structures and their NSC numbers (the NCI compound identification number) as well as their activities. Preliminary experiments evolving ANN classifiers over all compounds met with limited success (e.g., low predictive accuracies for actives and inactives), likely because of the large compound diversity found in this database containing a small number of confirmed active compounds relative to inactive compounds. In order to increase the predictive accuracies of the models, we “preclustered” the compound library. This approach consisted of clustering the 422 active compounds by their similarity to each other using a Tanimoto index [22], [23], [24]. Each of the resulting clusters was then used in iterative similarity searches over the entire database to identify CM and CI compounds with similar structures. As a result of this process, the vast majority of the 42,510 compound database was excluded from further analysis. The resulting clusters from this similarity search were then used to generate QSAR models. Our analysis required the choice of an appropriate Tanimoto cutoff. Iterative similarity searches were performed with the 422 active compounds. The “hits” resulting from each search were removed to avoid redundancy in the clusters. Table 1 demonstrates that, as the Tanimoto index was relaxed from 0.9 to 0.5, the clusters became larger in size and fewer in number. From this analysis, a Tanimoto cutoff of 0.80 was chosen as the best compromise between the number of compounds per cluster (cluster size) and the
478
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
TABLE 1 Cluster Size and Number as a Function of Tanimoto Cutoff
number of clusters in the small molecule space that is to be modeled. Cluster 3 (from the Tanimoto = 0.8 clustering of the 422 active compounds) was selected for QSAR development as it was the largest cluster at Tanimoto 0.8 having the most optimal distribution of actives, inactives, and moderates upon expansion from a similarity search over the entire structural database. This was performed using a representative compound (NSC 140025) from Cluster 3 for a similarity search over the entire 42,510 compound database. The resulting expansion of Cluster 3 consisted of 163 similar compounds with the following representation: nCA ¼ 25, nCM ¼ 14, and nCI ¼ 124, suitable for ANN training and testing.
2.2
Structure Activity Relationship (SAR) of Cluster 3 Cluster 3 consists of four different classes of structural templates (see Appendix A, which can be found on the Computer Society Digital Library at http://computer.org/ tcbb/archives.htm). Out of the 163 structures, 144 contained Template I, four contained Template II, three contained Template III, nine contained Template IV, and there were
VOL. 4,
NO. 3,
JULY-SEPTEMBER 2007
three remaining compounds that did not fall into any of the four main structure classes. The major differences between the templates consisted primarily of the presence or absence of a double bond in one of the two rings (although Template IV has two six-member rings). Each of the four structure classes was decomposed into an R-Group table using MOE [25] (see Appendix A, which can be found on the Computer Society Digital Library at http://computer.org/tcbb/archives.htm). The core structural template was numbered starting at the oxygen, identifying the point of attachment of each R-group in the tables. From this analysis, it became quite clear that the majority of the diversity in this cluster was with respect to substitutions at positions R2 and R3. These consist of large, bulky hydrophobic groups (with some polar/acidic functionalities) as well as smaller polar and ionizable groups (e.g., halogens and hydroxides). From the complexity of this R-group analysis of Cluster 3, it is clear that ANNs are an appropriate choice for developing QSAR models. In addition, ANNs are appropriate for handling HTS activity data that is binary (active or inactive) or ternary (active, moderate, or inactive) and not continuous (e.g., IC50, EC50). Handling HTS data of this nature has proven to be a difficult problem for most commercially available QSAR software packages that employ multiple linear regression or partial least squares analysis. For each compound, QikProp [23], [24] and MOE [25] were used to generate 200 descriptors. Multiple linear regression analysis was performed using the “correlation matrix” feature of MOE to reduce these descriptors to most relevant features for QSAR development using ANNs. The activity scores were assigned integer values: CA = 3, CM = 2, and CI = 1 and were compared to each set of normalized descriptor values. Seventy-one descriptors having correlation to activity 0:18 r 0:41 (the maximum observed correlation) were selected (Appendix B, which can be found on the Computer Society Digital Library at http:// computer.org/tcbb/archives.htm). The 0.18 cut off was chosen arbitrarily to provide a sufficient number of descriptors for initial studies. A more systematic optimization of the number of descriptors and their correlation to activity will be a focus of future studies. For the purposes of this investigation, the data were then divided into two experimental data sets. Data Set 1 contained only CA and CI compounds (total n ¼ 149) and Data Set 2 contained CA and CM compounds grouped together as “actives” and CI as inactives (total n ¼ 163). These two data sets were used to test the discrimination ability of the ANN in light of the additional noise introduced when using CM compounds as actives.
2.3 Evolved Neural Networks Artificial neural networks are computer algorithms based loosely on modeling of natural neuronal interconnections. They are stimulus-response transfer functions that accept some input and yield an output decision and are typically used to learn an input-output mapping over a set of examples. Optimization of neural networks with evolutionary computation is powerful because there are many potential parameters for each small molecule and the
HECHT AND FOGEL: HIGH-THROUGHPUT LIGAND SCREENING VIA PRECLUSTERING AND EVOLVED NEURAL NETWORKS
contribution of each parameter is unknown a priori. Evolutionary computation can be used as a tool to rapidly search for the appropriate number of input parameters simultaneous with the optimization of other neural network features such as weights, number of connections, and processing elements, while avoiding premature convergence due to inappropriate heuristics. For additional detail on this approach, the reader is directed to [26]. For the purpose of this experimentation, all 71 possible features were made available as input to a fully connected neural network. However, each solution in the evolving population of neural networks contained only a subset of the 71 features at input, chosen at random initially and evolved during neural network optimization (see below). Every generation, each parent neural network architecture generated offspring neural network architectures by varying all of its weighted connections (and possible inputs if a subset of the 71 features was so indicated) simultaneously, following a Gaussian distribution with zero mean and an adaptable standard deviation. The update rule for the standard deviation utilized lognormal perturbations to the standard deviation before generating offspring. Specifically, offspring were created using the following equations: 0i ¼ i expðNð0; 1Þ þ 0 Ni ð0; 1ÞÞ; x 0i
¼ xi þ
0i Nð0; 1Þ;
i ¼ 1; . . . ; n;
i ¼ 1; . . . ; n;
N 1X ðPk Ok Þ2 ; N k¼1
TABLE 2 Area under the ROC Curve for Different Generations of Evolution and with Different Numbers of Features Used as Input to the Neural Networks for Data Set 1
ð1Þ ð2Þ
where i denotes the ith dimension of the solution vector x or strategy parameter vector , Nð0; 1Þ is a standard Gaussian random variable, Ni ð0; 1Þ designates that a standard Gaussian random variable is sampled anew for each ith and and 0 are constants set equal to pffiffiffiffiffi ffi dimension, pffiffiffi 1= 2n and 1=ð2 nÞ, respectively, where n is the number of dimensions in x and . The weights of each connection on each member of the initial population were set to 0.0 and the initial standard deviation in each dimension for each parent was set to 0.1. For the purpose of evolving ANNs, mean squared error (MSE) over the known training exemplars was used as a measure of fitness, where MSE was defined as: MSE ¼
479
ð3Þ
where P is the predicted activity, O is the observed activity, and N is the number of patterns in the training set. MSE was to be minimized over evolutionary optimization. Given that the ratio of inactives to actives in each database was roughly 5, an additional penalty of 5 was applied for models that incorrectly predicted inactives as active compounds. The best evolved models on the training data were then assessed for their predictive ability on held-out examples using leave-one-out cross validation. A population of 50 parents and 50 offspring feedforward fully connected neural network models was evolved for generations {100, 500, 1,000, 1,750, 2,500, 3,750, 5,000, 7,500, 10,000} using tournament selection with four opponents. Additional information on this selection method can be found in [27]. For each data set, three series of experiments over all generations above were conducted
with 10, 20, or 30 features as input from the larger possible space of 71 features. Thus, the number of features used in the model development was reduced; however, the evolutionary process was used to determine which combinations of 10, 20, or 30 features from the 71 possible features could result in optimized ANN models. The feed-forward ANN models also utilized a fixed number of five hidden nodes and one output node (representing the activity prediction). All input features were normalized to the range [0.1, 0.9]. Evolutionary computation was used to optimize the choice of input features for the ANN models simultaneously with the weight assignments on the ANN connections. For each ANN architecture and number of generations of optimization, MSE over all training and testing examples was calculated, as was overall ROC curve area. These values in conjunction with convergence plots describing the evolutionary learning over the training examples were used to determine the number of generations that retained the best performance on the testing performance on the leave-oneout samples without overfitting to the training examples.
3
RESULTS
3.1 Data Set 1—CA versus CI Table 2 provides the area under the ROC curve (AðzÞ) for all experiments with Data Set 1. Larger values of AðzÞ indicate more successful predictive models that maximize the probability of detection while minimizing the probability of false alarm. When using 10 features, the best results were obtained after 2,500 generations of evolution (AðzÞ ¼ 0:921). This process required roughly 10 minutes for each leaveone-out sample or roughly one day of CPU time for all samples on a 2.2 GHz Pentium IV processor. With 20 input features, the best performing neural networks were obtained after only 1,750 generations of evolution (AðzÞ ¼ 0:910). With 30 features as input, 7,500 generations were required
480
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 4,
NO. 3,
JULY-SEPTEMBER 2007
Fig. 2. Performance of best evolved neural networks for Data Set 1 showing separation of CA compounds (1) from CI compounds (0) as box plots and ROC curves on the held-out examples during leave-one-out cross validation for neural networks with (a) 10 inputs, (b) 20 inputs, and (c) 30 inputs chosen at random from the larger set of 71 possible input features. For each ROC curve, the probability of false alarm (P(FA)) is shown relative to the probability of detection (P(D)).
to obtain the best neural network models (AðzÞ ¼ 0:887). Fig. 2a, Fig. 2b, and Fig. 2c provide box plots and ROC curves for these best results when using evolved neural networks with 10, 20, or 30 inputs, respectively. For Data Set 1, the best performing neural networks used only 10 input features.
networks models (AðzÞ ¼ 0:836). Fig. 3a, Fig. 3b, and Fig. 3c provide box plots and ROC curves for these best results when using evolved neural networks with 10, 20, or 30 inputs, respectively. For Data Set 2, the most parsimonious, best performing neural networks used only 20 input features.
3.2 Data Set 2—CA+CM versus CI Table 3 provides the area under the ROC curve (AðzÞ) for all experiments with Data Set 2. When using 10 features as input randomly selected from the 71 possible features, the best results were obtained after 1,000 generations of evolution (AðzÞ ¼ 0:828). With 20 input features, the best performing neural networks were obtained after 7,500 generations of evolution ðAðzÞ ¼ 0:836Þ. This process required approximately 30 minutes for each leave-one-out sample or the equivalent of three days of CPU time for all samples on a 2.2 GHz Pentium IV processor. With 30 features as input, 2,500 generations were required to obtain the best neural
3.3 Comparison of Data Sets 1 and 2 The vast majority of models developed for Data Set 1 resulted in higher AðzÞ scores relative to models developed for Data Set 2 (Table 2 and Table 3), suggesting that the discrimination of CA versus CI compounds in cluster 3 was easier than discriminating CA and CM compounds from CI compounds for the same cluster. For Data Set 2, performance after only 100 generations of evolution was poor, indicating insufficient learning on this more difficult problem relative to Data Set 1, where 100 generations of evolution were sufficient for the generation of models with reasonable ROC curve areas.
HECHT AND FOGEL: HIGH-THROUGHPUT LIGAND SCREENING VIA PRECLUSTERING AND EVOLVED NEURAL NETWORKS
TABLE 3 Area under the ROC Curve for Different Generations of Evolution and with Different Numbers of Features Used as Input to the Neural Networks for Data Set 2
4
DISCUSSION
Initial investigation on all 42,510 compounds in the Developmental Therapeutics Program NCI/NIH AIDS Antiviral Screen public database [20] indicated that development of neural network models capable of distinguishing CA from CI compounds was difficult (data not shown). This was likely due to the large diversity represented with relatively small numbers of CA compounds over the entire database. Interrogation of random samples of small molecules chosen from the entire database indicated that there was little correlation between small molecules and their level of activity. This also suggested that it is futile to develop models for the entire database with limited features given that some features will only be important in very limited regions of the small molecule space. A more effective modeling strategy was required for effective high-throughput screening. We developed a strategy that first clusters active small molecules in the database according to structural similarity using a stringent Tanimoto coefficient. These active compounds are used to extract structurally related CM and CI compounds. This approach greatly simplifies the problem by focusing model development on useful portions of the entire small molecule space while simultaneously ensuring that the features associated with those clusters are most meaningful. An analysis was performed to choose the optimal Tanimoto cutoff. A Tanimoto coefficient of 0.8 was selected as the best compromise between clusters that are well defined and able to be modeled and clusters that are large but diverse. Cluster 3 was chosen and a representative structure (NSC 140025) was used to perform a similarity search over the entire database with a relaxed Tanimoto coefficient of 0.8. We then examined this single cluster to determine to what degree evolved ANNs could discriminate active from inactive small molecules while simultaneously reducing the number of features used for such discrimination.
481
Two main experiments were conducted. The first experiment focused on developing models that could distinguish CA from CI compounds. The data from Table 1 and Fig. 2 suggest that it is possible to evolve ANNs capable of discriminating actives from inactives from within subclusters of similar compounds with reasonably high accuracy (AðzÞ > 0:90). The second experiment grouped CA and CM compounds to determine if these could be separated from CI compounds. The data from Table 3 and Fig. 3 in comparison to Table 2 and Fig. 2 suggest that this is possible, but with reduced accuracy (AðzÞ > 0:80). Under the restriction that a random subset of 10 features from the larger set of 71 features could be used as input to the ANNs, it becomes possible to examine the relative frequency of use features as input in the best-evolved models so as to determine their importance. Fig. 4 presents these data for the first 100 leave-one-out cross validations after 2,500 generations of evolution for Data Set 1. The mean number of times each feature was used in the best neural network models was 14:085 15:778 times. The overrepresentation of three features ((15) diameter, (26) KierA2, and (42) PEOE_VSA_HYD) was statistically significant from the mean representation by two standard deviations ðp < 0:01Þ. Features (2) a_don and (25) Kier3 were also overly represented at values greater than one standard deviation from the mean representation ðp < 0:05Þ. Four additional features were used often, (55) QPPMDCK, (56) radius, (57) rgyr, (68) vsa_hyd, but were not statistically significantly overrepresented from the mean representation. Fig. 5 presents the same data for Data Set 2 after 7,500 generations of evolution with 20 possible inputs (the parameters that led to the best evolved neural networks for Data Set 2). The mean number of times each feature was used in the best neural network models was 14:085 7:893 times. The overrepresentation of features (39) PEOE_VSA_FPOL, (42) PEOE_VSA_HYD, (64) std_dim3, and (68) vsa_hyd was statistically significant ðp < 0:01Þ. The overrepresentation of features (2) a_don, (25) Kier3, (32) opr_nrot was statistically significant ðp < 0:05Þ. When comparing the resulting models from Data Sets 1 and 2, four features are found in common ((2) a_don, (26) KierA2, (42) PEOE_VSA_HYD, (68) vsa_hyd) with feature 42 being represented often in the best models from both data sets. Historically, one of the major limitations of the use of QSAR models in lead optimization has been that many of the molecular descriptors used in these models are not “human-interpretable.” Lead optimization is an iterative process of systematically exploring the SAR of a lead series discovered in an HTS campaign. Structural analogues of a “hit” compound are synthesized and tested both in vitro as well as in vivo (at a significant cost) in order to obtain high information content data that includes binding affinities, IC50, EC50, Ki values, as well as specificity for a target, etc. A “human interpretable” feature is one that can be used by a medicinal chemist to propose and synthesize new compounds in order to optimize potency and specificity. It is very encouraging that three of the four features found in common in Data Sets 1 and 2 are “human interpretable” and appear to be features directly relevant to protein-ligand interactions. For example, a_don is the number of hydrogen bond donor atoms, while PEOE_VSA_HYD and vsa_hyd are related to the hydrophobic van der Waals surface area
482
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 4,
NO. 3,
JULY-SEPTEMBER 2007
Fig. 3. Performance of best evolved neural networks for Data Set 2 showing separation of CA compounds (1) from CI compounds (0) as box plots and ROC curves on the held-out examples during leave-one-out cross validation for neural networks with (a) 10 inputs, (b) 20 inputs, and (c) 30 inputs chosen at random from the larger set of 71 possible input features.
of the compound. Hydrophobic interactions and hydrogen bonding are both primary driving forces and contributors to the binding affinity and specificity of a ligand to an
Fig. 4. Frequency of occurrence of all 71 features, chosen 10 at a time, as input to the best evolved neural networks from Data Set 1 after 2,500 generations of evolution.
active site. Both can be readily modulated to systematically develop an SAR that will ultimately lead to more potent and specific inhibitors. The other feature, KierA2, is related to the shape of the molecule, but is not so readily interpretable. In addition, these features are completely consistent with the SAR of Cluster 3 presented in Appendix B, which can be found on the Computer Society Digital Library at http:// computer.org/tcbb/archives.htm. Most of the diversity in Cluster 3 came from hydrophobic R-groups (with some polar and acidic functionality) and smaller polar (e.g., halogens) and ionizable groups (e.g., hydroxides) that can act as hydrogen-bond donors. Many of the other commonly used features are also “human interpretable.” These include terms that are related to the size (diameter, radius) and the relative polar surface areas (PEOE_VSA_FPOL). These results are extremely encouraging and imply that evolved ANNs used for focused library generation can also be also used for SAR and lead optimization.
HECHT AND FOGEL: HIGH-THROUGHPUT LIGAND SCREENING VIA PRECLUSTERING AND EVOLVED NEURAL NETWORKS
483
TABLE 5 Confusion Matrix for Leave-One-Out Cross Validation after 7,500 Generations of Evolution on Data Set 2 Using 20 Inputs
Fig. 5. Frequency of occurrence of all 71 features chosen 20 at a time as input to the best evolved neural networks from Data Set 2, after 7,500 generations of evolution.
5
CONCLUSION
We have applied a preclustering method coupled with evolved ANNs to the prediction of small molecule activity from the Developmental Therapeutics Program NCI/NIH AIDS Antiviral Screen public database [20]. The best results from Data Sets 1 and 2 can be analyzed in terms of a confusion matrix (Table 4 and Table 5). The results from Data Set I (Table 4) indicate that the best-evolved ANNs are capable of 84 percent predictive accuracy on true positives and 87 percent for true negatives while the results from Data Set II (Table 5) indicate 76 percent predictive accuracy on true positives and 71 percent for true negatives. Given that no attempt was made to optimize the penalty associated with false positive prediction, as indicated in the introduction, a model with 85 percent accuracy represents a 5 to 6-fold reduction in the number of compounds required for testing in order to obtain 85 percent of the same number of confirmed and verified hits. Future research will make use of a reduced feature set incorporating the most valuable features indicated above, resulting in more parsimonious models with potentially higher predictive accuracy. From a very practical new leads discovery perspective, the resulting focused library of predicted actives should be as enriched (e.g., number of “true” actives per total) as possible in order to minimize the costs associated with testing false positives. In practice, moderately active compounds would not be of interest except to demonstrate that there is a systematic SAR in the structure class that can be optimized. A structure class of compounds having a gradient of activity will often be of TABLE 4 Confusion Matrix for Leave-One-Out Cross Validation after 2,500 Generations of Evolution on Data Set 1 Using 10 Inputs
more value than a higher activity “hit” that represents an island of activity in structure space. Given the observable performance difference after adding CM compounds to model development, we are interested in evaluating other combinations of compound types such as grouping CM compounds with CI compounds rather than CA compounds to see if there is any performance improvement. Treating all three compound types as completely independent is another option. In addition, the selection of descriptors having 18 percent correlation to activity was chosen arbitrarily to ensure a sufficient number of descriptors. A more systematic analysis of the necessary number of descriptors based upon their correlation to experimental activity on the performance of the ANNs is warranted. The models will be improved by optimizing both the stringency as well as the number and type of descriptors used. The data set used in this analysis will be made available upon request to the authors.
ACKNOWLEDGMENTS The authors would like to thank Mars Cheung and Bill Porto for their assistance with code development and Connie Ma and Susanna Wong for their assistance with small molecule clustering. In addition, the authors would like to thank the reviewers for their suggestions and comments.
REFERENCES [1] [2] [3] [4] [5]
[6] [7] [8] [9]
PhRMA Industry Profile 2003 Report, http://www.phrma.org/ publications, 2003. Center for Drug Evaluation and Research, http://www.fda. gov/ cder/rdmt, 2006. C. Hansch and T. Fujita, “-- Analysis: A Method for the Correlation of Biological Activity and Chemical Structure,” J. Am. Chemical Soc., vol. 86, p. 1616, 1964. G.E. Kellogg and S.F. Semus, “3D QSAR in Modern Drug Design in Hillisch,” Modern Methods of Drug Discovery, pp. 223-241, 2003. E. Ernesto, G. Patlewicz, and U. Eugenio, “From Molecular Graphs to Drugs: A Review on the Use of Topological Indices in Drug Design and Discovery,” Indian J. Chemistry, Section A; Inorganic, Bioinorganic, Physical, Theoretical and Analytical Chemistry, vol. 42A, pp. 1315-1329, 2003. P.V. Desai and E.C. Coutinho, “QSAR in Drug Discovery and Development,” Asian Chemistry Letters, vol. 5, pp. 77-86, 2001. V.J. Gillet, “Application of Evolutionary Algorithms to Combinatorial Library Design,” Computing Approaches in Chemistry, pp. 130, 2003. S.P. Niculescu, “Artificial Neural Networks and Genetic Algorithms in QSAR,” J. Molecular Structure, vol. 622, pp. 71-83, 2003. M.J. Embrechts, M. Ozdemir, L. Lockwood, C. Breneman, K. Bennett, D. Devogelaere, and M. Rijckaert, “Feature Selection Methods Based on Genetic Algorithms for in Silico Drug Design,” Evolutionary Computation in Bioinformatics, pp. 317-339, 2002.
484
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
[10] D. Weekes and G.B. Fogel, “Evolutionary Optimization, Backpropagation, and Data Preparation Issues in QSAR Modeling of HIV Inhibition by HEPT Derivatives,” BioSystems, vol. 72, pp. 149158, 2003. [11] D.G. Landavazo, G.B. Fogel, and D.B. Fogel, “Quantitative Structure-Activity Relationships by Evolved Neural Networks for the Inhibition of Dihydrofolate Reductase by Pyrimidines,” BioSystems, vol. 65, pp. 37-47, 2002. [12] S.-S. So and M. Karplus, “Evolutionary Optimization in Quantitative Structure-Activity Relationships: An Application of Genetic Neural Networks,” J. Medical Chemistry, vol. 39, pp. 1521-1530, 1996. [13] B.T. Luke, “Evolutionary Programming Applied to the Development of Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships,” J. Chemical Information and Computing Science, vol. 34, pp. 1279-1287, 1994. [14] http://dtp.nci.nih.gov/, 2006. [15] H. Hong, N. Neamati, S. Wang, M.C. Nicklaus, A. Mazumder, H. Zhao, T.R. Burke, Y. Pommier, and G.W.A. Milne, “Discovery of HIV-1 Integrase Inhibitors by Pharmacophore Searching,” J. Medical Chemistry, vol. 40, pp. 930-936, 1997. [16] N. Neamati, H. Hong, A. Mazumder, S. Wang, S. Sunder, M.C. Nicklaus, G.W.A. Milne, B. Proksa, and Y. Pommier, “Depsides and Depsidones as Inhibitors of HIV-1 Integrase: Discovery of Novel Inhibitors through 3D Database Searching,” J. Medical Chemistry, vol. 40, pp. 942-951, 1997. [17] S.Y. Tamura, P.A. Bacha, H.S. Gruver, and R.F. Nutt, “Data Analysis of High-Throughput Screening Results: Application of Multidomain Clustering to the NCI Anti-HCV Data Set,” J. Medical Chemistry, vol. 45, pp. 3082-3093, 2002. [18] M.C. Nicklaus, N. Neamati, H. Hong, A. Mazumder, S. Sunder, J. Chen, G.W.A. Milne, and Y. Pommier, “HIV-1 Integrase Pharmacophore: Discovery of Inhibitors through Three-Dimensional Database Searching,” J. Medical Chemistry, vol. 40, pp. 920-929, 1997. [19] C.Y.C. Ma, S.W.M. Wong, D. Hecht, and G. Fogel, “Evolved Neural Networks for High Throughput Anti-HIV Ligand Screening,” Proc. IEEE 2006 Congress on Evolutionary Computation, 2006. [20] http://dtp.nci.nih.gov/docs/aids/aids_data.html, 2006. [21] www.cambridgesoft.com, 2006. [22] P. Willett, “Chemical Similarity Searching,” J. Chemical Information and Computing Science, vol. 38, pp. 983-996, 1998. [23] E.M. Duffy and W.L. Jorgensen, “Prediction of Properties from Simulations: Free Energies of Solvation in Hexadecane, Octanol, and Water,” J. Am. Chemistry Soc., vol. 122, pp. 2878-2888, 2000. [24] W.L. Jorgensen and E.M. Duffy, “Prediction of Drug Solubility from Monte Carlo Simulations,” Bioorganic and Medical Chemistry Letters, vol. 10, pp. 1155-1158, 2000. [25] http://www.chemcomp.com/, 2006. [26] X. Yao, “Evolving Neural Networks,” Proc. IEEE, vol. 87, pp. 14231447, 1999. [27] D.B. Fogel, Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, third ed. IEEE Press, 2006. [28] R.S. Pearlman and K.M. Smith, “Novel Software Tools for Chemical Diversity,” Perspectives in Drug Discovery and Design, pp. 339-353, 1998. [29] S.A. Wildman and G.M. Crippen, “Prediction of Physiochemical Parameters by Atomic Contributions,” J. Chemistry Information and Computer Science, vol. 39, pp. 868-873, 1999. [30] M. Petitjean, “Applications of the Radius-Diameter Diagram to the Classification of Topological and Geometrical Shapes of Chemical Compounds,” J. Chemistry Information and Computer Science, vol. 32, pp. 331-337, 1992. [31] L.H. Hall and L.B. Kier, “The Molecular Connectivity Chi Indices and Kappa Shape Indices in Structure-Property Modeling,” Rev. Computing Chemistry, vol. 2, pp. 367-422, 1991. [32] L.H. Hall and L.B. Kier, “The Nature of Structure-Activity Relationships and Their Relation to Molecular Connectivity,” European J. Medical Chemistry, vol. 12, p. 307, 1977. [33] T.I. Oprea, “Property Distribution of Drug-Related Chemical Databases,” J. Computer-Aided Molecular Design, vol. 14, pp. 251264, 2000. [34] J. Gasteiger and M. Marsili, “Iterative Partial Equalization of Orbital Electronegativity—A Rapid Access to Atomic Charges,” Tetrahedron, vol. 36, p. 3219, 1980.
VOL. 4,
NO. 3,
JULY-SEPTEMBER 2007
[35] P. Ertl, B. Rohde, and P. Selzer, “Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties,” J. Medical Chemistry, vol. 43, pp. 3714-3717, 2000. [36] M. Yazdanian, S.L. Glynn, J.L. Wright, and A. Hawi, “Correlating Partitioning and Caco-2 Cell Permeability of Structurally Diverse Small Molecular Weight Compounds,” Pharmaceutical Research, vol. 15, pp. 1490-1494, 1998. [37] J.D. Irvine, L. Takahashi, K. Lockhart, J. Cheong, J.W. Tolan, H.E. Selick, and J.R. Grove, “MDCK (Madin-Darby Canine Kidney) Cells: A Tool for Membrane Permeability Screening,” J. Pharmaceutical Science, vol. 88, pp. 28-33, 1999. [38] P. Stenberg, U. Norinder, K. Luthman, and P. Artursson, “Experimental and Computational Screening Models for the Prediction of Intestinal Drug Absorption,” J. Medical Chemistry, vol. 44, pp. 1927-1937, 2001. [39] R.O. Potts and R.H. Guy, “Predicting Skin Permeability,” Pharmaceutical Research, vol. 9, pp. 663-669, 1992. [40] R.O. Potts and R.H. Guy, “Predicting Skin Permeability: II. The Effects of Molecular Size and Hydrogen Bond Activity,” Pharmaceutical Research, vol. 12, pp. 1628-1633, 1995. [41] G. Colmenarejo, A. Alvarez-Pedraglio, and J.-L. Lavandera, “Cheminformatic Models to Predict Binding Affinities to Human Serum Albumin,” J. Medical Chemistry, vol. 44, pp. 4370-4378, 2001. [42] E.M. Duffy and W.L. Jorgensen, “Prediction of Properties from Simulations: Free Energies of Solvation in Hexadecane, Octanol, and Water,” J. Am. Chemistry Soc., vol. 122, pp. 2878-2888, 2000. [43] W.L. Jorgensen and E.M. Duffy, “Prediction of Drug Solubility from Monte Carlo Simulations,” Bioorganic Medical Chemistry Letters, vol. 10, pp. 1155-1158, 2002. [44] W.L. Jorgensen and E.M. Duffy, “Prediction of Drug Solubility from Structure,” Advanced Drug Delivery Rev., vol. 54, pp. 355-366, 2002. David Hecht received the PhD degree in macromolecular structural biology and chemistry from the Scripps Research Institute in 2005, the MS degree in chemistry from the University of California, Berkeley, in 1989, and the BS degree in biochemistry from Rutgers University in 1988. He is currently an assistant professor of chemistry at Southwestern Community College in Chula Vista, California. He has spent the previous 10 years in several biotechnology companies actively involved in drug discovery and optimization activities including HTS, computational biology, and chemistry as well as research informatics. He is a member of the IEEE. Gary B. Fogel received the PhD degree in biology from the University of California, Los Angeles, in 1998 and the BA degree in biology from the University of California, Santa Cruz, in 1991. He is currently vice president of Natural Selection, Inc., in La Jolla, California. His experience includes more than 13 years of applying computational intelligence methods to bioinformatics problems. He has more than 40 publications in the technical literature, the majority treating the science and application of evolutionary computation, and he is the coeditor of the book Evolutionary Computation in Bioinformatics (Morgan Kauffman). He also serves as an associate editor for the IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE Computational Intelligence Magazine, and IEEE Transactions on Evolutionary Computation, and is on the editorial board of four other technical journals. He served as the general chairman for the 2005 and 2006 IEEE Computational Intelligence in Bioinformatics and Computational Biology Symposia, as cotechnical chair for the 2001 and 2006 IEEE Congress on Evolutionary Computation, and as program chair for the 2004 IEEE Congress on Evolutionary Computation. He was chair of the IEEE Computation Intelligence Society Bioinformatics and Bioengineering Technical Committee (2004-2005). He is a senior member of the IEEE.