Linear and non-linear quantitative structure-activity

2 downloads 0 Views 142KB Size Report
models. These models were also utilized to select the most efficient subsets of descriptors in a cross-validation procedure for non-linear log (1/EC50) prediction.
Indian Journal of Biochemistry & Biophysics Vol. 49, June 2012, pp 202-210

Linear and non-linear quantitative structure-activity relationship models on indole substitution patterns as inhibitors of HIV-1 attachment Mahyar Nirouei1*, Ghasem Ghasemi2, Parviz Abdolmaleki3, Abdolreza Tavakoli1 and Shahab Shariati4 1

Department of Electrical Engineering , Lahijan Branch, Islamic Azad University, P. O. Box 1616, Lahijan, Iran 2 Department of Chemistry, Islamic Azad University, Rasht Branch, Rasht, Iran 3 Department of Biophysics, Tarbiat Modares University, Tehran, Iran 4 Department of Chemistry, Science and Research Branch, Islamic Azad University, Guilan, Iran Received 16 October 2011; revised 27 April 2012

The antiviral drugs that inhibit human immunodeficiency virus (HIV) entry to the target cells are already in different phases of clinical trials. They prevent viral entry and have a highly specific mechanism of action with a low toxicity profile. Few QSAR studies have been performed on this group of inhibitors. This study was performed to develop a quantitative structure–activity relationship (QSAR) model of the biological activity of indole glyoxamide derivatives as inhibitors of the interaction between HIV glycoprotein gp120 and host cell CD4 receptors. Forty different indole glyoxamide derivatives were selected as a sample set and geometrically optimized using Gaussian 98W. Different combinations of multiple linear regression (MLR), genetic algorithms (GA) and artificial neural networks (ANN) were then utilized to construct the QSAR models. These models were also utilized to select the most efficient subsets of descriptors in a cross-validation procedure for non-linear log (1/EC50) prediction. The results that were obtained using GA-ANN were compared with MLR-MLR and MLR-ANN models. A high predictive ability was observed for the MLR, MLR-ANN and GA-ANN models, with root mean sum square errors (RMSE) of 0.99, 0.91 and 0.67, respectively (N = 40). In summary, machine learning methods were highly effective in designing QSAR models when compared to statistical method. Keywords: HIV, Indole glyoxamide derivatives, Quantitative structure-activity relationship, Genetic algorithm, Artificial neural network, Multiple linear regressions

The process of human immunodeficiency virus-1 (HIV-1) entry into host cells offers considerable potential for therapeutic intervention, with viral entry proceeding through multiple sequential steps involving attachment, co-receptor binding and fusion1,2. The early step of viral entry into the host cell is accomplished through binding of the viral envelope glycoprotein complex gp160 to the cellular receptor CD4. This attachment is followed by conformational _____________ *Corresponding author. Tel: +98-131-7730352 Fax: +98-131-7720327 E-mail: [email protected] Abbreviations: ANN, artificial neural networks; CCR5, C-C chemokine receptor type 5; CVSET, cross validation set; CXCR4, C-X-C chemokine receptor type 4; 3D-MoRSE, 3D-molecular representation of structure based on electron diffraction; GA, genetic algorithms; GETAWAY, geometry, topology and atomsweighted assembly; HIV-1, human immunodeficiency virus-1; MLR, multiple linear regressions; PCM, polarized continuum model; PSET, prediction set; QSAR, quantitative structure–activity relationship; RBP, resilient back-propagation; RDF, radial distribution function; RMSE, root mean sum square error; TSET, training set; WHIM, weighted holistic invariant molecular.

changes of the gp160 external glycoprotein portion gp120, which facilitates the second step involving binding to a cellular co-receptor, usually the chemokine receptor CCR5 or CXCR41,2. Co-receptor binding, in turn, facilitates a large conformational change and initiates the final entry event which leads to dissociation of gp120 from gp41, the virus membrane-spanning protein that mediates the fusion of the virus with the host cell. The indole-3-glyoxamide derivatives, such as BMS-378806 and BMS-488043 have been described as the first small molecule inhibitors of the gp120CD4 interaction (HIV-1 attachment inhibition) that demonstrate potent antiviral activity in cell culture3. These compounds, of which the 4-fluoro derivative is prototypical, appear to act by stabilizing a specific conformation of gp120 that is poorly recognized by CD43. However, under certain circumstances, compounds of this class have been shown to form a ternary complex with gp120 and CD4 and interfere with the CD4-induced exposure of the gp 41 heptad repeats, providing a potential additional mode of action3. Activity of attachment inhibitors is

203

NIROUEI et al: LINEAR AND NON-LINEAR QSAR MODELS ON INDOLE SUBSTITUTION PATTERNS

independent of human chemokine co-receptor binding and persists irrespective of viral tropism or host cell phenotype4-6. Since attachment inhibitors target a viral protein rather than a host chemokine receptor, they are not expected to impact human immune responses. These compounds exhibit some favorable pharmacokinetic traits, such as low protein binding, good oral bioavailability in animal models and a good safety profile in pre-clinical testing6. In fact they are safe and well tolerated with no serious adverse events7. QSAR provides valuable information for drug designers to improve the efficiency of drugs. Using QSAR techniques, quantitative structural descriptors which are in close relation with molecular activity are selected and then the relation between such descriptors and molecular activity is described by developing suitable quantitative models. Various statistical and machine learning techniques have been applied for building QSAR models. Different kinds of regression analyses, genetic algorithms and artificial neural networks can be used as techniques for selecting significant descriptors and developing QSAR models. Software for molecular modeling, neural networks and statistical analysis has been applied to data sets to generate predictive models for biological activities. Neural networks have been used to predict biological activities of HIV-1 reverse transcriptase and vesicular monoamine transporter-2 inhibitors, drug resistance, lipophilicity, aqueous solubility, intestinal absorption and site of protease cleavage8. Training and testing molecules can be obtained from either a single and multiple databanks or a compilation of several previous studies9. In the present study, we have applied multiple linear regressions (MLR), genetic algorithms (GA) and artificial neural networks (ANN) as linear and nonlinear models to investigate the QSAR in indole glyoxamide derivatives3 as an inhibitor of HIV-1 attachment. Few QSAR studies have been performed on this group of inhibitors. These models have also been used to select more effective descriptors to obtain a hybrid computational model for a rough prediction of inhibitory activity. The ability of these methods in predicting the inhibitory activity of indole glyoxamide derivatives has also been compared. Methodology Molecular dataset

The structure and half-maximal effective concentration (EC50 in nM) values of all compounds were obtained from the work of Meanwell et al3. This

set contained the effective concentration activities of 40 indole glyoxamide derivatives. The basic structures of these compounds are shown in Table 1. The inhibitory activities in a logarithmic scale (log (1/EC50)) were in the range of -3.1899 to 1.2218 for compound 10 and 29, respectively with a mean value of -0.937. A set of 8 compounds was randomly removed from the dataset to be used as the prediction set (PSET). The log (1/EC50) of this set spanned the entire dataset. The remaining 32 compounds were utilized as the training set (TSET) 8. Model evaluation

The log (1/EC50) of the compounds was used as the dependent variable in model development. In neuralnetwork-based QSAR models, to avoid any possible bias in selecting testing set individuals, cross-validation procedure was utilized. The structures of all models were optimized for minimum root mean square error (RMSE) as a performance benchmark. Descriptors calculation and selection

Geometry optimizations of 40 compounds were carried out in the following sequence: AM1/HF/631G* → B3LYP/6-31G (d) at Gaussian 98W10,11. The DFT/B3LYP was chosen because this method was demonstrated to produce satisfactory results when molecular geometries and energies are taken into account11. Polarized continuum model (PCM) was applied to consider non-specific solvent effect. In order to investigate effects of the environment of solvent on the structures, all molecules were optimized in H2O solvent. The molecular descriptors for constructing the best model were calculated by the Dragon program. Different types of numerical descriptors were generated to describe each compound. These descriptors were categorized in topological, geometrical, MoRSE12,13, RDF13,14, GETAWAY15,16, auto-correlations13 and WHIM16,18 groups. Totally, 1038 descriptors were generated. The number of descriptors was then reduced through an objective feature selection. This procedure was performed in three steps. Firstly, descriptors that had the same value for at least 70% of compounds in the dataset were removed. Secondly, descriptors with correlation coefficient less than 0.25 with the dependent variable (log (1/EC50)) were considered redundant and removed19. Finally, pairs of variables with a correlation coefficient greater than 0.90 were classified as inter-correlated and one of them in each correlated pair was randomly eliminated. After these three steps, the number of descriptors was reduced to 58.

INDIAN J. BIOCHEM. BIOPHYS., VOL. 49, JUNE 2012

204

Table 1—Structure, HIV pseudo-type virus inhibitory activity and cytotoxicity associated with indole glyoxamide derivatives

Compd # R1

R2

R3

R4

R5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

H H H H H H F Cl Br CH3 OCOCH3 H H H H H H H H H H Br Cl F H H H H H H H H H H H H H F H H

H H H H H H H H H H H F Cl OCH3 H H H H H H H F Cl H H H H H H H H H H H H H F F H H

H H H H H H H H H H H H H H F Cl CH3 CH3CH2 OCH3 OCH2CH3 CN H H Br F OCH3 Cl Br CN Br Br CH3 OCH3 OCH2CF3 CN F H F H H

H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H CH3 CH3CH2

H F Cl Br OCH3 OCH2CH3 H H H H H H H H H H H H H H H F H H F OCH3 OCH3 OCH3 OCH3 OCF3 F F F F F Br F F H H

Log (1/EC50) -2.1846 -0.4133 -0.6335 -0.6532 0.284 0.3468 -2.9234 -2.5966 -3.0375 -3.1899 -2.9787 -1.3243 -2.3181 -2.5169 -0.8633 -0.6435 -1.9518 -1.3838 -0.8195 0.301 -0.6902 -1.3636 -2.699 -1.8639 0.4559 0.6383 1.1549 0.8861 1.2218 -1.6314 0.8861 0.2518 1.2218 0.1487 -0.2788 -0.9138 0.3768 -0.2304 -2.4232 -3.1614

205

NIROUEI et al: LINEAR AND NON-LINEAR QSAR MODELS ON INDOLE SUBSTITUTION PATTERNS

Since the ANN model cannot be able to select the more significant descriptors from the pool of calculating molecular descriptors, it was essential to apply some variable selection methods. In this work, stepwise multiple linear regression (Stepwise-MLR) and genetic algorithm (GA) variable subset selection methods were used for the selection of the most relevant descriptors from the pool of 58 descriptors. The overall performance of all models was evaluated in terms of root mean square error calculated from the following equation: n

RMSE =

∑i=1 ( yi − yo ) n

… (1)

where yi is the desired output, yo is the predicted value by model, and n is the number of molecules in our data set. QSAR-1 (MLR-MLR)

Multiple linear regression (MLR) was utilized to establish the first type of QSAR models. Using minimum RMSE of TSET as a benchmark, subsets of descriptors were examined for establishing the best linear QSAR. The size of descriptor subset used for model establishment was increased until no improvement was seen. After model development with TSET members, the best model was further examined by the PSET compounds. QSAR-2 (MLR-ANN)

The best subset of descriptors selected in QSAR-1 was fed into neural networks to develop QSAR-2. The neural networks used in this study were all threelayered fully-connected feed-forward networks. Such networks were supposed to identify the non-linear relationship between the structural descriptors and inhibitory activity of compounds, if there was any. The networks were trained using the TSET members with resilient back-propagation (RBP) algorithm20. Each neuron in the network was connected to all neurons in neighboring layer(s) through adjustable weights. Network training is the process of adjusting such weights, wherein the error is somehow minimized. The number of input layer neurons is equal to the number of descriptors. We had only one output layer neuron, while the number of hidden layer neurons was a matter of optimization8.

similar properties to those used in the second model were used to calculate the fitness function of GAs. In this model, 58 descriptors were considered as possible input of the ANN and fed into the input layer of the ANNs in GA-ANN model. A binary vector with the dimension of 58 represented the individual in the population. In other words, the defined chromosome contained 58 genes, one gene for each feature, which could take 2 values. A value of 0 indicated that the corresponding feature was not selected and a value of 1 indicated that the feature was selected. Therefore, there were 258 possible feature subsets as the search space of this investigation. GA selected the best features from these possible feature subsets during different generations. In each generation, the population was probabilistically modified which generates new chromosomes having a better chance of solving the problem. New characteristics were introduced into a chromosome by crossover and mutation. The probability of survival or reproduction of an individual depends more or less on its fitness to the environment. The population thereby evolves toward higher fitness20. This procedure is shown in Fig. 1. In our study, two point binary crossover and binary mutation were performed as recombinant operators. The roulette wheel selection strategy was also used in the algorithm for parent selection. The relevant parameter settings included population size: 40; number of generations: 100; probability of crossover: 0.8; probability of mutation: 0.0920,21. A number of different fitness functions were assessed and the optimal fitness function as the object of minimization by GA was found as follows:

QSAR-3 (GA-ANN)

In the last QSAR model, GA was utilized for nonlinear feature selection. The neural networks with

Fig. 1—Hybrid GA-ANN model

INDIAN J. BIOCHEM. BIOPHYS., VOL. 49, JUNE 2012

F = 100*RMSE CVSET *RMSE TSET … (2) Each fitness value was obtained in a cross-validation procedure by removing 8 PSET individuals from the dataset; other 32 TSET ones were remained each time. This was done in a way that each compound was used four times as a TSET member and once as a PSET one. The average result of five different simulations was reported. We changed the number of hidden units in ANN part of our hybrid model from two to eight, in order to prevent any dependence to the number of hidden units. All calculations in present work were carried out in Matlab environment (V 7.1, The Mathworks, Inc.) and the GA toolbox and performed on a 2.6 GHz Dual-Core Intel Pentium IV with 2 GB RAM under windows XP. Results and Discussion As mentioned in the previous section, two linear and non-linear feature selection methods were used to select the most significant descriptors (stepwise-MLR and GA). According to the types of variable selection method and feature mapping techniques, these models were shown as MLR–MLR, MLR–ANN and GA-ANN. The following MLR equations were generated for each group of compounds: y1 = −178.488 + 1.694D(1) + 1.99.8D(7) + 145.778D(14) −9.087D(16) + 51.296D(57) … (3) y2 = −18.404 + 1.467D(1) − 2.611 D(17) + 0.122D(20)

206

+ 1.339D(30) − 1.118D(32) + 9.451 D(50) … (4) y3 = −63.117 + 48.524D(14) − 0.257D(19) − 0.63D(28) + 1.47D(30) − 1.375D(32)−0.802D(33) + 1.952D(36) + 8.962D(50) + 3.497D(54) + 23.426D(57) … (5) y4 = 15.428 − 560.493D(4) + 86.217D(14)

… (6)

y5 = −89.098 + 230.272D(7) − 9.873D(12) + 1.6.107D(14) − 5.049D(18) + 6.185D(50)

… (7)

The definition of the descriptors in the abovementioned equations is shown in Table 2. According to the Eqs (3) to (7), the most important descriptor is MATS6m (D (14)), because it was selected four times by the MLR model22,23. When using the previously mentioned equations, the RMSE for predicted activity was found to be 0.99 for the PSET compounds and 0.53 for the TSET. Also, the correlation coefficient (R2) calculated for the PSET was as good as 0.55. One of the most important advantages of this model was its ability to clarify the weights of each selected significant parameter, which highlighted its performance in determining the structure-activity relationship. The descriptors which were selected twice and more by the QSAR-1 model were fed to the neural networks to establish the QSAR-2 model. The optimal network architecture, which resulted in the best TSET and PSET RMS errors, was observed to be 7-4-1. In this QSAR model, the RMSE for predicted activity was

Table 2—The best selected descriptors using MLR model Index

Descriptor

Definition

Type

D(1) D(4) D(7) D(12) D(14) D(16) D(17)

MAXDP X3A PW5 VEA1 MATS6m MATS7v GATS5e

Topological Topological Topological Topological 2D Autocorrelation 2D Autocorrelation 2D Autocorrelation

2 1 2 1 4 1 1

D(18) D(19) D(20) D(28) D(30) D(32) D(33) D(36) D(47) D(50) D(54) D(57)

FDI RDF075u RDF030m Mor03u Mor03m Mor05m Mor09m Mor18m Mor31e ISH H4v R5u+

Maximal electrotopological positive variation Average connectivity index chi-3 Path/walk 5 Randic shape index Eigenvector coefficient sum from adjacency matrix Moran autocorrelation-lag 6/weighted by atomic masses Moran autocorrelation-lag 7/weighted by atomic van der waals volumes Geary autocorrelation-lag 5/weighted by atomic Sanderson electronegativities folding degree index Radial distribution function-7.5/unweighted Radial distribution function-3.0/weighted by atomic masses 3D-MoRSE-signal 03/unweighted 3D-MoRSE-signal 03/weighted by atomic masses 3D-MoRSE-signal 05/weighted by atomic masses 3D-MoRSE-signal 09/weighted by atomic masses 3D-MoRSE-signal 18/weighted by atomic masses 3D-MoRSE-signal 31/weighted by atomic Sanderson electronegativities Standardized information content on the leverage equality H autocorrelation of lag 4/weighted by atomic van der waals volumes R maximal autocorrelation of lag 5/unweighted

Geometrical RDF RDF 3D-MoRSE 3D-MoRSE 3D-MoRSE 3D-MoRSE 3D-MoRSE 3D-MoRSE GETAWAY GETAWAY GETAWAY

1 1 1 1 2 2 1 1 1 3 1 2

No. of selection

NIROUEI et al: LINEAR AND NON-LINEAR QSAR MODELS ON INDOLE SUBSTITUTION PATTERNS

207

Table 3—Selected parameters with different hidden units using GA and their corresponding fitness values. No. of hidden Bit strings which show the selected features using GA units 2 3 4 5 6 7 8

Fitness value

0,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0,1,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,1 0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,1,0,1,1,1,1,0,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,0,0,1,1,0,0,1 1,0,0,0,1,0,1,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,1,1,1,1,0,1,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1 0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,1,0,1,1,1,1,0,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,0,0,1,1,0,0,1 0,0,0,1,0,0,0,0,0,0,1,1,0,1,1,0,0,0,0,1,0,0,0,0,1,0,1,1,1,1,0,1,1,1,0,0,1,1,0,1,1,1,0,0,0,0,0,1,0,1,1,1,0,1,0,1,0,0 0,0,0,1,1,1,1,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0,1,1,1,1,0,1,1,0,0,1,0,0,0,1,0,1,1,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0 0,0,1,0,1,1,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,1,1,1,1,0,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0

11.699 9.265 9.444 9.143 9.595 8.853 9.178

Table 4—Statistical parameters of different QSAR models (N = 40)

Table 5—The best selected descriptors by GA-ANN hybrid model

QSAR Model

Descriptor

Definition

PW5 RDF055v

Path/Walk 5(topological)- Randic shape index Radial distribution function-5.5/weighted by atomic van der vaals volumes Radial distribution function-3.5/weighted by atomic polarizabilities 3D-MoRSE- signal 03/unweighted 3D-MoRSE- signal 17/unweighted 3D-MoRSE- signal 03/weighted by atomic masses 3D-MoRSE- signal 05/weighted by atomic masses 3D-MoRSE- signal 09/weighted by atomic masses 3D-MoRSE- signal 18/weighted by atomic masses 3D-MoRSE- signal 18/weighted by atomic van der vaals volumes 2nd Component accessibility directional WHIM index/weighted by atomic masses

MLR-MLR MLR-ANN GA-ANN

Training set RMSE R^2 0.53 0.85 0.46 0.95 0.23 0.96

Prediction set RMSE R^2 0.99 0.55 0.91 0.58 0.67 0.75

RDF035p Mor03u Mor17u Mor03m Mor05m Mor09m Mor18m Mor18v E2m

Fig. 2—The relative importance of descriptors using the results of Table 2

found to be 0.91 for the PSET compounds and 0.46 for the TSET. The correlation coefficient (R2) of this model for the PSET was better than the previous model and was equal to 0.58. To establish the QSAR-3, the 58 descriptors were fed to the GA-ANN model to select the best descriptors. After a hundred generations, the best parameters which could minimize the value of fitness function were selected by GA. The selected parameters with different hidden units and their corresponding fitness values are illustrated in Table 3. As can be seen in Table 3, some descriptors were almost selected in all simulations. The selection rates of each descriptor in simulations with different hidden units are shown in Fig. 2. The statistical parameters of all QSAR models are presented in Table 4. As can be seen from this table, the statistical parameters of GA–ANN model were better than the other models; therefore, we will explain only the descriptors which were used in this model. Cross-validation method (leave-8-out) was utilized to obtain the results of this table. We chose 11 descriptors from the parameters selected by GA-ANN model with

different hidden layers. The descriptors that were selected at least six times by the model were employed to build the final model. These parameters are illustrated in Table 5. The Jack-Knife method was utilized to calculate the results of the GA-ANN model with 11 input parameters. We also used 3 neurons in the hidden layer of the GAANN model, because the ANN with simpler structure made the better results in Jack-knife test. The obtained RMSE and R-square were 0.73 and 0.75, respectively. The value of the selected descriptors by GA-ANN model was presented in Table 6. The observed and predicted values of log (1/EC50) using Jack-Knife method were reported in Table 7. The plot showing the variation of observed versus predicted log (1/EC50) values, using the Jack- Knife method is depicted in Fig. 3. The results of three QSAR models proved that nonlinear feature selection models were better than their linear counterparts. The obtained results demonstrated that the GA-ANN model led to better results with good predictive ability than other models. High RMS errors resulted by the models were because of two reasons. First, by its nature, RMSE is highly dependent on the range of the dependent variable8. The range of log

INDIAN J. BIOCHEM. BIOPHYS., VOL. 49, JUNE 2012

208

Table 6—Descriptor values for GA- ANN model PW5 0.096 0.098 0.098 0.098 0.102 0.104 0.099 0.099 0.099 0.099 0.095 0.097 0.097 0.097 0.096 0.096 0.096 0.098 0.098 0.101 0.098 0.098 0.098 0.098 0.101 0.106 0.104 0.104 0.106 0.1 0.101 0.101 0.102 0.096 0.102 0.101 0.099 0.099 0.101 0.105

RDF055v 4.975 4.273 6.952 7.694 6.187 9.06 4.929 5.081 6.024 4.946 7.178 5.206 5.285 4.903 5.092 4.782 5.599 6.551 3.844 5.933 4.085 6.526 3.869 4.55 5.032 5.297 4.897 5.019 5.569 6.309 10.323 4.487 4.491 10.299 5.412 13.713 10.429 5.213 5.809 7.168

RDF035p 7.094 6.983 7.813 7.708 7.818 9.342 7.049 7.005 6.785 7.316 9.191 6.775 6.888 6.965 6.639 6.634 7.102 8.868 5.106 6.952 6.203 4.813 4.79 4.894 4.693 5.035 4.971 4.922 6.008 6.134 8.042 7.028 6.915 8.205 5.861 18.073 8.435 6.653 7.419 7.991

Mor03u -4.126 -4.178 -4.099 -4.024 -3.942 -5.626 -4.27 -4.225 -4.205 -4.157 -3.986 -4.288 -4.327 -4.134 -4.327 -4.414 -4.421 -5.123 -4.851 -5.524 -3.935 -4.526 -4.421 -4.441 -4.381 -4.435 -4.359 -4.364 -4.026 -5.217 -3.698 -4.277 -4.664 -5.448 -4.085 -3.308 -3.689 -4.584 -4.413 -4.672

Mor17u -1.138 -0.752 -0.765 -0.769 -0.815 -1.376 -1.086 -1.038 -1.077 -1.284 -1.291 -1.018 -1.011 -0.961 -1.008 -1.019 -1.326 -1.431 -0.968 -1.196 -1.07 -0.578 -0.909 -1.061 -0.722 -0.796 -0.641 -0.663 -0.601 -0.993 -0.942 -0.949 -0.725 -0.906 -0.715 0.132 -0.821 -0.404 -1.158 -1.454

Mor03m -3.413 -3.123 -3.117 -2.931 -2.822 -3.325 -3.642 -4.158 -5.033 -3.594 -3.866 -3.404 -3.811 -3.261 -3.024 -3.081 -3.103 -2.977 -3.169 -3.085 -2.706 -8.015 -4.795 -3.879 -3.492 -3.36 -3.025 -2.529 -2.876 -4.306 -2.437 -2.962 -3.174 -4.99 -3.179 -2.423 -2.441 -4.292 -3.662 -3.704

Fig. 3—Plot between observed vs. Jack-Knife predicted log (1/EC50) inhibitory activity

Mor05m -5.479 -6.177 -5.917 -5.6 -6.546 -6.624 -5.698 -6.209 -6.656 -5.783 -6.51 -5.541 -6.009 -5.829 -5.783 -5.986 -5.779 -5.708 -6.345 -6.13 -6.01 -7.752 -6.344 -5.973 -6.345 -6.743 -6.751 -6.961 -6.451 -6.404 -6.075 -6.505 -6.683 -6.711 -6.218 -3.716 -5.86 -7.242 -5.537 -5.604

Mor09m -2.167 -2.647 -2.157 -2.119 -2.812 -2.658 -2.375 -1.907 -2.07 -2.14 -2.26 -2.151 -2.267 -2.26 -2.577 -1.702 -2.181 -2.176 -2.854 -2.716 -2.515 -3.761 -1.776 -2.6 -2.595 -2.612 -2.159 -2.305 -2.468 -3.83 -2.386 -2.627 -2.871 -3.037 -2.369 -3.642 -2.359 -3.047 -2.255 -2.214

Mor18m -0.953 -0.894 -1.23 -0.881 -1.131 -1.006 -0.937 -0.891 -0.739 -0.983 -1.043 -0.947 -0.983 -1.131 -1.008 -0.926 -0.96 -0.94 -1.252 -1.171 -1.054 -0.709 -1.152 -1.178 -1.276 -1.289 -1.211 -1.62 -1.209 -1.629 -1.43 -0.932 -0.988 -1.351 -1.217 -0.836 -1.268 -1.055 -0.97 -0.924

Mor18v -1.013 -1.029 -1.16 -1.054 -1.189 -1.152 -0.987 -0.977 -0.953 -1.035 -1.086 -0.981 -0.999 -1.139 -1.013 -0.998 -1.036 -1.029 -1.269 -1.236 -1.192 -1.059 -1.126 -1.144 -1.164 -1.274 -1.194 -1.241 -1.237 -1.247 -1.3 -1.065 -1.116 -1.36 -1.22 -0.803 -1.246 -1.016 -1.054 -1.037

E2m 0.257 0.288 0.305 0.371 0.207 0.172 0.337 0.405 0.404 0.22 0.342 0.259 0.254 0.24 0.241 0.227 0.259 0.258 0.241 0.231 0.247 0.359 0.346 0.329 0.326 0.176 0.188 0.162 0.191 0.5 0.288 0.291 0.278 0.323 0.346 0.251 0.489 0.335 0.236 0.252

(1/EC50) in our dataset was 4.4117; and the best GA-ANN model resulted in PSET RMSE of 0.67 which was still good enough to make this model trustworthy in future predictions. The second reason was the possible errors in experimental data used in this study. Since the chemical variations of the considered compounds were low, the selection of chemical descriptors, which can encode the small variations between the structures of molecules in the dataset is very important. MoRSE descriptors are very informative 3D descriptors that can encode structural features of molecules and they were included in the GA–ANN model. 3D-MoRSE descriptors are based on the idea of obtaining information from the 3D

NIROUEI et al: LINEAR AND NON-LINEAR QSAR MODELS ON INDOLE SUBSTITUTION PATTERNS

209

Table 7—Observed and predicted values of log (1/EC50) using Jack- knife method Compound

Observed

Predicted

Residues

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

-2.185 -0.413 -0.634 -0.653 0.284 0.347 -2.923 -2.597 -3.038 -3.19 -2.979 -1.324 -2.318 -2.517 -0.863 -0.644 -1.952 -1.384 -0.82 0.301 -0.69 -1.364 -2.699 -1.864 0.456 0.638 1.155 0.886 1.222 -1.631 0.886 0.252 1.222 0.149 -0.279 -0.914 0.377 -0.23 -2.423 -3.161

-2.186 -0.198 -0.707 0.454 0.422 -0.007 -2.455 -3.586 -2.458 -2.546 -2.612 -2.272 -1.895 -1.396 -0.407 -1.892 -1.459 -1.418 0.661 -0.12 -1.684 -2.544 -3.603 -1.612 -0.244 0.425 1.535 1.492 1.132 -1.325 0.154 -0.431 0.309 -1.03 -0.194 0.224 -0.286 0.329 -2.331 -1.947

0.002 -0.216 0.074 -1.107 -0.138 0.353 -0.469 0.99 -0.579 -0.644 -0.367 0.948 -0.423 -1.121 -0.456 1.248 -0.493 0.035 -1.481 0.421 0.994 1.18 0.904 -0.252 0.7 0.213 -0.38 -0.606 0.09 -0.306 0.732 0.683 0.913 1.179 -0.084 -1.138 0.663 -0.56 -0.092 -1.215

atomic coordinates by the transform used in electron diffraction studies24. These descriptors are calculated by summing atom weights viewed by a different angular scattering function. The 3D-MoRSE descriptors showed great potential for the representation of molecular structures and had several merits24, for instance, (a) the number of values was independent of the size of the molecule and thus allowed the study of data sets of great structural

variety; (b) the number of these values could be changed and thus the resolution in the representation of a molecular structure could be scaled. Conclusion Linear and non-linear feature selection methods were developed to select the most significant descriptors and to construct QSAR models on indole substitution patterns as inhibitors of HIV-1 attachment. Although a large number of non-linear and hybrid models could be employed to establish QSAR models, GA-ANN model was admittedly one of the best of them. In spite of its time-consuming training process, the obtained model could discover complex and non-linear relations between dependent and independent variables. This fact was due to complicated relations between structure and activity of the compounds. These results also proved that RDF, topological, 3D-MoRSE and WHIM descriptors were more significant than other descriptors in building this QSAR model and predicting biological activity of indole substitution patterns. The 3D-MoRSE descriptors played an important role in predicting the log (1/EC50) of the compounds. As can be seen from Tables 2 and 5, a large number of the selected descriptors using linear and non-linear feature selection methods devoted to this group of descriptors. The study demonstrated the efficiency of using the statistical and machine learning techniques as a preprocessor in determining effective parameters. It also revealed the significance of 3D-MoRSE descriptors in the predicting the inhibitory activity of indole glyoxamide derivatives as inhibitors of HIV-1 attachment. This method has the potential to greatly assist medicinal chemists in the design of lead compounds for inhibiting HIV-1 entry to the target cells. Acknowledgement This work was financially supported by Islamic Azad University-Lahijan Branch and the authors are thankful to this university for their continuous assistance. References 1 Kuritzkes D R (2009) Curr Opin HIV AIDS 4, 82–87 2 Tilton J C & Doms R W (2010) Antiviral Res 85, 91–100 3 Meanwell N A, Wallace O B, Fang H, Wang H, Deshpande M, Wang T, Yin Z, Zhang Z, Pearce B C, James J, Yeung K S, Qiu Z, Kim Wright J J, Yang Z, Zadjura L, Tweedie D L, Yeola S, Zhao F, Ranadive S, Robinson B A, Gong Y F,

INDIAN J. BIOCHEM. BIOPHYS., VOL. 49, JUNE 2012

4

5

6

7

8 9 10

Wang H G, Spicer T P, Blair WS, Shi P Y, Colonno R J & Lin PF (2009) Bioorg Med Chem Lett 19, 1977-1981 Guo Q, Ho HT, Dicker I, Fan L, Zhou N, Friborg J, Wang T, McAuliffe B V, Wang H G, Rose R E, Fang H, Scarnati H T, Langley D R, Meanwell N A, Abraham R, Colonno R J & Lin PF (2003) J Virol 77, 10528-10536 Ho H T, Fan L, Nowicka-Sans B, McAuliffe B, Li C B, Yamanaka G, Zhou N, Fang H, Dicker I, Dalterio R, Gong Y F, Wang T, Yin Z, Ueda Y, Matiskella J, Kadow J, Clapham P, Robinson J, Colonno R & Lin P F (2006) J Virol 80, 4017-4025 Lin P F, Blair W, Wang T, Spicer T, Guo Q, Zhou N, Gong Y F, Wang H G, Rose R, Yamanaka G, Robinson B, Li C B, Fridell R, Deminie C, Demers G, Yang Z, Zadjura L, Meanwell N & Colonno R (2003) Proc Natl Acad Sci (USA) 100, 11013-11018 Hanna G J, Lalezari J, Hellinger J A, Wohl D A, Nettles R, Persson A, Krystal M, Lin P, Colonno R & Grasela D M (2011) Antimicrob Agents Chemother 55, 722-728 Sadat Hayatshahi S H, Abdolmaleki P, Ghiasi M & Safarian S (2007) FEBS Lett 581, 506-514 Fabry-Asztalos L, Andonie R, Collar C J, Abdul-Wahid S & Salim N (2008) Bioorg Med Chem. 16, 2903-2911 Frisch M J, Trucks G W, Schlegel H B, Scuseria G E, Robb M A, Cheeseman J R, Zakrzewski V G, Montgomery J A , Stratmann R E, Burant J C, Dapprich S, Millam J M, Daniels A D, Kudin K N, Strain M C, Farkas O, Tomasi J, Barone V, Cossi M, Cammi R, Mennucci B, Pomelli C, Adamo C, Clifford S, Ochterski J, Petersson GA, Ayala P Y, Cui Q, Morokuma K, Malick D K, Rabuck A D, Raghavachari K, Foresman J B, Cioslowski J, Ortiz J V, Baboul A G, Stefanov B B, Liu G, Liashenko A, Piskorz P, Komaromi I, Gomperts R, Martin R L, Fox D J, Keith T, Al-Laham M A,

11 12 13 14 15 16 17 18 19 20

21 22 23 24

210

Peng C Y, Nanayakkara A, M. Challacombe M, Gill P M W, Johnson B, Chen W, Wong M W, Andres J L, Gonzalez C, Head-Gordon M, Replogle E S & Pople J A (1998) Gaussian 98 (Revision A.9), Gaussian, Inc., Pittsburgh, PA, USA De Melo E B & Ferreira M M (2009) Eur J Med Chem 44, 3577-3583 Schuur J H, Selzer P & Gasteiger J (1996) J Chem Inf Comput Sci 36, 334-344 Todeschini R & Consonni V (2000) Handbook of Molecular Descriptors, Wiley-VCH Hemmer M C, Steinhauer V & Gasteiger J (1999) Vibr Spectrosc 19, 151-164 Consonni V, Todeschini R & Pavan M (2002) J Chem Inf Comput Sci 42, 682-692 Consonni V, Todeschini R, Pavan M & Gramatica P (2002) J Chem Inf Comput Sci 42, 693-705 Gramatica P, Consonni V & Todeschini R (1999) Chemosphere 38, 1371-1378 Gramatica P, Corradi M & Consonni V (2000) Chemosphere 41, 763-777 Fatemi M H & Gharaghani S (2007) Bioorg Med Chem 15, 7746-7754 Nirouei M, Abdolmaleki P, Tavakoli A & Gity M (2008) Proceeding of the 2th International Conference on Electrical Engineering Design and Technology, Hammamat, Tunisia Zhang P, Verma B & Kumar K (2005) Pattern Recognit Lett 26, 909-919 Sadat Hayatshahi S H, Abdolmaleki P, Safarian S & Khajeh K (2005) Biochem Biophys Res Commun 338, 1137-1142 Weekes D & Fogel G B (2003) BioSystems 72, 149-158 Cheng Z, Zhang Y, Zhou C, Zhang W & Gao S (2010) Int J Digital Content Technol Appl 2, 109-121