Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, 5, 11-19
11
Computer Prediction of Cardiovascular and Hematological Agents by Statistical Learning Methods X. Chen1,2, H. Li1, C.W. Yap1, C.Y. Ung1,3, L. Jiang1, Z.W. Cao4, Y.X. Li4 and Y.Z. Chen1,4,* 1 Bioinformatics and Drug Design Group, Department of Pharmacy and Department of Computational Science, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543; 2College of life sciences, Zhejiang University, No.368 Zijinghua Road, Hangzhou, Zhejiang, P. R. China 310058; 3Department of Biochemistry, The Yong Loo Lin School of Medicine, National University of Singapore, Blk MD7, #02-03, 8 Medical Drive, Singapore, 117597; 4 Shanghai Center for Bioinformation Technology, Shanghai, P. R. China 201203
Abstract: Computational methods have been explored for predicting agents that produce therapeutic or adverse effects in cardiovascular and hematological systems. The quantitative structure-activity relationship (QSAR) method is the first statistical learning methods successfully used for predicting various classes of cardiovascular and hematological agents. In recent years, more sophisticated statistical learning methods have been explored for predicting cardiovascular and hematological agents particularly those of diverse structures that might not be straightforwardly modelled by single QSAR models. These methods include partial least squares, multiple linear regressions, linear discriminant analysis, k-nearest neighbour, artificial neural networks and support vector machines. Their application potential has been exhibited in the prediction of various classes of cardiovascular and hematological agents including 1, 4-dihydropyridine calcium channel antagonists, angiotensin converting enzyme inhibitors, thrombin inhibitors, AchE inhibitors, HERG potassium channel inhibitors and blockers, potassium channel openers, platelet aggregation inhibitors, protein kinase inhibitors, dopamine antagonists and torsade de pointes causing agents. This article reviews the strategies, current progresses and problems in using statistical learning methods for predicting cardiovascular and hematological agents. It also evaluates algorithms for properly representing and extracting the structural and physicochemical properties of compounds relevant to the prediction of cardiovascular and hematological agents.
Key Words: Statistical learning methods, cardiovascular agents, haematological agents, pharmacodynamic, pharmacokinetic, QSAR. INTRODUCTION Cardiovascular diseases are the main causes of morbidity and mortality in the world [1]. Drug-induced hematological reactions often lead to type II, III and IV hemolytic anemia, hypersensitivity, agranulocytosis, thrombocytopecia and aplastic anemia [2]. Identification of agents that produce therapeutic or adverse effects in cardiovascular and hematological systems is important for designing new drugs and for detecting potentially harmful agents. Efforts have been made to explore computational methods for predicting various classes of cardiovascular and hematological agents [3-8]. In particular, statistical learning methods have shown promising potential for performing these tasks [7-10] as well as for predicting agents of other pharmaceutical applications, toxicological properties, and pharmacokinetic profiles [11]. More sophisticated statistical learning methods have been introduced to complement conventional quantitative structure activity relationship (QSAR) methods [4, 5, 12] for covering more diverse ranges of cardiovascular and haematological agents [7-10]. In contrast to QSAR methods, these statistical learning methods derive implicit statistical models *Address correspondence to this author at Bioinformatics and Drug design Group, Department of Pharmacy and Department of Computational Science, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543; Tel: 65-6516-6877; Fax: 65-6774-6756; E-mail:
[email protected] 1871-5257/07 $50.00+.00
capable of describing multiple mechanisms and non-linear relationships between chemical structures and a particular activity which can be used for predicting new agents having the same activity [13-15]. Regression methods can be incorporated into these statistical learning methods for deriving the activity level of these agents [16-18] . MOLECULAR DESCRIPTORS FOR REPRESENTING CHEMICAL AGENTS Successful application of statistical learning methods depends on proper representation of the structural and physicochemical features of chemical agents [19, 20]. Over 3,700 molecular descriptors, computed from the 1D, 2D or 3D structure of an agent, have been developed to quantitatively represent different structural and physicochemical features [19, 21-25]. These descriptors range from constitutional descriptors such as molecular weight to more complex 2D and 3D descriptors representing different geometric, connectivity, and physicochemical properties. These molecular descriptors can be computed by using popular computer programs such as DRAGON [23], Molconn-Z [22], JOELib [24], Xue descriptor set [19], and MODEL (http://jing.cz3. nus.edu.sg/cgi-bin/model/model.cgi). These descriptors can be divided into 18 classes, some of which contain members overlapping with members of other classes. Examples of typical descriptor classes are constitutional descriptors that include molecular weight and number © 2007 Bentham Science Publishers Ltd.
12 Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
Chen et al.
of hydrogen bond donors, and geometrical descriptors that include volume and surface areas, topological descriptors such as the number of rings and rotatable bonds. Many descriptor classes contain descriptors of mixed properties frequently used for collectively representing various QSAR and QSPR models. RDF descriptors represent inter-atomic distances in the entire molecule and other useful information such as bond distances, ring types, planar and non-planar systems, atom types and molecular weight [26], molecular walk counts [27]. 3D-MoRSE descriptors describe features such as molecular weight, van der Waals volume, electronegativities and polarizabilities [28]. BCUT descriptors represent connectivity information and atomic properties relevant to intermolecular interaction [29]. WHIM descriptors describe size, shape, symmetry, atom distribution and polarizability of a molecule [30]. Other useful descriptor classes are Galvez topological charge indices and charge descriptors [31], GETAWAY descriptors [32], 2D autocorrelations, functional groups, atom-centred descriptors, aromaticity indices [33], Randic molecular profiles [34], electrotopological state descriptors [35], and linear solvation energy relationship descriptors [36].
tion capability of the model) [45]. Prediction capability of a statistical learning model is more significantly affected by a greater change in the objective function, and thus the corresponding descriptor is ranked higher.
FEATURE SELECTION METHODS
sultant classification score and wi is the weight associated with the corresponding descriptor xi. A positive or negative L value indicates that a feature vector x belongs to the positive or negative class respectively.
Normally, only a fraction of these descriptors are needed for representing features of a particular class of agents. Features useful for agents of a particular activity can be selected either by intuition as in the cases of conventional QSAR studies, or by using feature selection methods. The commonly used feature selection methods include recursive feature eliminations (RFE) [37], genetic algorithm-based approach (GA) [38], and simulated annealing-based approach (SA) [39]. Some of these methods, particularly RFE, have gained popularity due to their effectiveness for discovering informative features in the analysis of drug activity [37, 40] and pharmacokinetic and toxicological properties [19, 20, 41-44]. These feature selection methods are primarily based on the following strategy: First a statistical learning model is generated by using either all or a few of a selected set of descriptors as the starting-set. This model is then used to rank the contribution of these descriptors. For the all-descriptors starting-set, descriptors contributing the least to a studied property are eliminated. For the few-descriptors starting-set, those contributing the most to a studied property are retained and the rest are eliminated. The process proceeds to the next step to construct a new machine learning model by using either the reduced set of descriptors for the all-descriptors starting-set or the retained set of descriptors plus newly added additional descriptors for the few-descriptors startingset. This new model is subsequently used to rank and then eliminate or add descriptors. This iteration process continues until all of the irrelevant descriptors are eliminated or all of the relevant descriptors are added. The ranking of descriptors can be illustrated by using recursive feature elimination (RFE) method as an example. Descriptor ranking in RFE is based on the magnitude of the change of an objective function of a statistical learning model upon removing each descriptor (which roughly measure the extent of contribution of each feature to the predic-
In many cases, it is difficult to uniquely select an optimal set of descriptors due to the high redundancy and overlapping nature of many descriptors [46]. Separate sets of descriptors containing different members of redundant descriptor classes have been found to give similar prediction accuracies [47]. The interpretation of the prediction results in these cases should be more appropriately conducted at the descriptor class level where redundant and overlapping descriptors are grouped into one class [20, 41, 48]. COMMONLY METHODS
USED
STATISTICAL
LEARNING
Linear Discriminant Analysis (LDA) As shown in Fig. 1, LDA [49] separates two classes of feature vectors by constructing a hyperplane defined by a k
linear discriminant function: L = wi xi , where L is the rei
Multiple Linear Regressions (MLR) A MLR model is developed under the assumption that there is a linear relationship between a specific set of molecular descriptors of a compound, which is usually expressed as a feature vector x with each descriptor as its component, and a particular property, y. A MLR model can be described using the following equation:
yˆ = 0 + 1X1 + 2 X 2 + … + kXk where {X1, …, Xk} are molecular descriptors, 0 is the regression model constant, 1 to k are the coefficients corresponding to the descriptors X1 to Xk. The values for 0 to k are chosen by minimizing the sum of squares of the vertical distances of the points from the hyperplane so as to give the best prediction of y from x. Partial Least Squares (PLS) PLS is similar to MLR in that it is also developed on the basis of a linear relationship between a vector x and a particular property y. However, the problems of collinear descriptors are avoided by calculating the principal components for the molecular descriptors and target property separately. The scores for the molecular descriptors are then used as the feature vector x for predicting the scores for the target property, which can then be used to predict y. An important consideration in PLS is the appropriate number of principal components to be used for the QSAR model. This is usually determined by using cross-validation methods like 5-fold cross-validation and leave-one-out. Comparative Molecular Field Analysis (CoMFA) [50] is a popular 3D-QSAR technique which uses PLS as the data analysis method. In CoMFA, compounds are aligned to a common substructure
Computer Prediction of CHA
Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
13
the unclassified vector x are used to determine the class of that unclassified vector. The class of the majority of the k nearest neighbors is chosen as the predicted class of the unclassified vector x. Artificial Neural Network (ANN), Neural Network (NN), Principal component ANN (PCANN) An artificial neural network (ANN) or neural network (NN) is an information-processing paradigm inspired by the way the densely interconnected, parallel structure of the mammalian brain processes information. As shown in Fig. 2, NN consists of a set of highly interconnected entities, called nodes or units. Each unit is designed to mimic its biological counterpart, the neuron, mathematically. Each node accepts a weighted set of inputs and responds with an output respectively [53]. While the principal component-artificial neural network (PC-ANN) was proposed to improve training speed and decrease the overall calibration error [54]. In this method, the input data are subjected to principal component analysis (PCA) before being introduced into the neural network and the most significant principal components of the original data matrix are selected and used as ANN input.
Fig. (1). Schematic diagram illustrates the process of the prediction of chemical agents with a cardiovascular or haematological property from its structure by using a statistical learning method - discriminant analysis method (LDA). A, B: feature vectors of agents with the property; E, F: feature vectors of agents without the property; feature vector (hj, pj, vj,…) represents such structural and physicochemical properties as hydrophobicity, volume, polarizability, etc.
and the magnitudes of the steric and electrostatic fields of each compound are sampled at regular intervals and used as molecular descriptors. k Nearest Neighbor (kNN) In kNN, the Euclidean distance between an unclassified vector x and each individual vector xi in the training set is measured [51, 52]. A total of k number of vectors nearest to
Fig. (2). Schematic diagram illustrating the process of the prediction of chemical agents with a cardiovascular or haematological property from its structure by using a statistical learning method – neural networks (NN). A, B, E, F and (hj, pj, vj,…) are the same as those in Fig. 2.
14 Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
Support Vector Machine (SVM) There are two types of SVM algorithms, linear and nonlinear SVM. Nonlinear SVM, which is illustrated in Fig. 3, is more useful for chemical agents of diverse structures and thus more extensive used [13, 20, 37, 48]. Linear SVM con-
Chen et al.
w x i + b 1, for yi = 1 (negative class). Here xi is a feature vector, yi is the group index, w is a vector normal to the hyperplane, b / w is the perpendicular distance from 2
the hyperplane to the origin and w
is the Euclidean norm
of w. Nonlinear SVM projects feature vectors into a high dimensional feature space by using a kernel function such as K (x i , x j ) = e
x j xi
2
/ 2 2
. The linear SVM procedure is
then applied to the feature vectors in this feature space. After the determination of w and b, a given vector x can be classified by using sign[(w x) + b] , a positive or negative value indicates that the vector x belongs to the positive or negative class respectively. PREDICTION PERFORMANCE The reported studies about the use of statistical learning methods for predicting cardiovascular and haematological agents can be divided into two groups. One group includes classification-based statistical learning methods that predict cardiovascular and haematological agents without providing the activity level of the predicted agents. The second group includes regression-based statistical learning methods that estimate the activity level in addition to classifying whether or not a compound is a cardiovascular or haematological agent.
Fig. (3). Schematic diagram illustrating the process of the prediction of chemical agents with a cardiovascular or haematological property from its structure by using a statistical learning method support vector machines (SVM). A, B, E, F and (hj, pj, vj,…) are the same as those in Fig. 2.
structs a hyperplane separating two different classes of feature vectors with a maximum margin [55, 56]. This hyperplane is constructed by finding a vector w and a parameter b 2
that minimizes w tions:
which satisfies the following condi-
w x i + b +1, for yi = +1
(positive class) and
Table 1 summarises the performance of classificationbased methods for predicting cardiovascular and haematological agents. These agents include HERG potassium channel inhibitors, calcium channel antagonists, torsade de pointes causing agents, protein kinase inhibitors and dopamine antagonists. Inhibition of HERG potassium channel can lead to prolongation of the QT interval which might trigger torsade de pointes arrhythmia [57]. Calcium channel antagonists control the calcium -dependent biological events by blocking the flux of calcium from the extracellular medium to the cell cytoplasm, which has implication in the treatment of such cardiovascular diseases as variant and exertional angina, certain types of cardiac arrhythmias, and hypertension [58]. Torsade de pointes is an atypical rapid ventricular tachycardia with periodic waxing and waning of amplitude of the QRS complexes on the electrocardiogram as well as rotation of the complexes about the isoelectric line [59]. Protein kinases such as protein kinase C and MAPK induce vascular contraction and increase blood pressure [60-62]. Inhibitors of such protein kinases can thus be used as agents for lowering blood pressure. Dopamine produces such a mixed cardiovascular effect as vasodilation. Dopamine antagonists inhibit dopamine receptors and thus enhance synthesis of dopamine that are clinically used in the treatment of circulatory shock [63]. As shown in Table 1, the reported overall prediction accuracies of classification-based methods are in the range of 71% ~ 100%, with the majority concentrated in the range of 81%~98%. These are similar to the reported accuracy of the prediction of compounds with other pharmacodynamic, pharmacokinetic and toxicological properties by statistical
Computer Prediction of CHA
Table 1.
Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
15
Performance of Classification-Based Statistical Learning Methods for Predicting Cardiovascular and Hematological Agents
Property
Method
Molecular descriptors
No of compounds in training set
Validation method (No of Compounds in validation set) a
Reported overall prediction accuracy
HERG potassium channel inhibitors
SVM [7]
MOE 2D descriptors, molecular fragment-count descriptors, S log P
73
Validation (73+414)
86~97%
Calcium Channel Antagonists
LDA [16]
two topological descriptors, one geometric descriptor, three quantum chemical descriptors, and one electrostatic descriptor
45
LOO (45)
86.7%
LS-SVM [16]
two topological descriptors, one geometric descriptor, three quantum chemical descriptors, and one electrostatic descriptor
45
LOO (45)
91.1%
PCA [8]
Molecular Polarizability, Verlop Minimum Width and Length of the Substituent, Rotational Barrier, Net Atomic Charge, Frontier Electron and Orbital Densities, and Molecular Hardness
45
NN [8]
Molecular Polarizability, Verlop Minimum Width and Length of the Substituent, Rotational Barrier, Net Atomic Charge, Frontier Electron and Orbital Densities, and Molecular Hardness
45
LOO (45)
77-100%
Torsade de pointes causing agents
SVM [78]
Linear solvation energy relationship
Training set 271
Validation set (78)
91.0%
Protein Kinase Inhibitors
Consensus NN [79]
20 standard BCUT descriptors
Training set 480
Validation Set (297)
98.7%
Dopamine Antagonists
ANN [80]
Topological Structural Fragment based on the enumeration of all possible substructure from a chemical structure and the numerical characterization of them
Training set 1227
Validation Set (137)
81%
ANN [81]
Structural and topological descriptors
Training set 1022
Validation set (113)
71.7%
LDA [81]
Structural and topological descriptors
Training set 1022
Validation set (113)
72.6%
82-100% --
Abbreviations: HERG – human ether-a-go-go-related gene; LDA – linear discriminant analysis; PCA – principal component analysis; NN – neural network; ANN – artificial neural network; SVM –support vector machine; LS-SVM - least squares support vector machines; BCUT – Burden-CAS-University of Texas eigenvalues. a
– number in parenthesis denotes the number of compounds used for model validation.
learning methods [11, 64]. These results suggest that the classification methods surveyed here have certain level of capability for predicting cardiovascular and haematological agents. Table 2 summarises the performance of the reported works that used regression methods for predicting cardiovascular and haematological agents. These agents include cal-
cium channel antagonists, angiotensin-converting enzyme (ACE) inhibitors, thrombin inhibitors, potassium channel openers, dopamine antagonists, hERG potassium channel blockers and platelet aggregation inhibitor. Angiotensinconverting enzyme is a membrane-bound enzyme on the surface of vascular endothelial cells mainly of the lung. ACE converts angiotensin I to angiotensin II that result in vaso-
16 Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
Table 2.
Chen et al.
Performance of Regression-Based Statistical Learning Methods for Predicting Cardiovascular and Hematological Agents
Activity
Method
Molecular descriptors
Number of compounds in training set
Validation method (No of Compounds in validation set) a
Reported prediction statistics
Calcium Channel Antagonists
LS-SVM [16]
Two topological descriptors, one geometric descriptor, three quantum chemical descriptors, and one electrostatic descriptor
45
LOO (45)
r2=0.82
Heuristic Method [16]
Two topological descriptors, one geometric descriptor, three quantum chemical descriptors, and one electrostatic descriptor
45
LOO (45)
r2=0.82
PC-GA-ANN [9]
Constitutional descriptors, topological indices, topological charge indices, geometrical descriptors, molecular walk counts, Burden’s eigenvalue descriptors, autocorrelation descriptors, physicochemical parameters and liquid properties
110
Training set (110)
r2=0.93~0.94
PCANN [10]
Graph Theoretic and Information Theoretic Variables
46
LOO (46)
r2=0.55
MLR [10]
Graph Theoretic and Information Theoretic Variables
46
LOO (46)
r2=0.55
CoMFA [82]
128 3D SYBYL QSAR descriptors
45
LOO (45)
r2=0.60
CoMSIA [82]
83 3D SYBYL QSAR descriptors
45
LOO (45)
r2=0.74
GRID/GOLPE [82]
Descriptors representing interaction fields between the ligands and the probe molecules. 65 3D descriptors were selected
45
LOO (45)
r2=0.62
PLS [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
114
Leave 10% out
q2=0.72
GFA [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
114
Leave 10% out
q2=0.67~0.70
GPLS [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
114
Leave 10% out
q2=0.72
NN [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
114
Leave 10% out
q2=0.74
PLS [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
88
Leave 10% out
q2=0.45
GFA [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
88
Leave 10% out
q2=0.49~0.50
GPLS [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
88
Leave 10% out
q2=0.61
NN [12]
CoMFA, CoMSIA, HQSAR, EVA, and 2D, 2.5 D traditional descriptors
88
Leave 10% out
q2=0.64
CP-ANN [68]
Molecular electrostatic potential at surface points
18
LOO
r2=0.96
PLS [83]
GRIND descriptors
17
Cross validation
r2=0.94
angiotensin converting enzyme inhibitors
thrombin inhibitors
Potassium channel openers
Computer Prediction of CHA
Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
17
(Table 2. Contd….)
Activity
Method
Molecular descriptors
Number of compounds in training set
Validation method (No of Compounds in validation set) a
Reported prediction statistics
Dopamine Antagonists
GA-PLS [84]
Topological indices
29
Cross validation
r2=0.73
kNN [84]
Topological indices
29
Cross validation
r2=0.79
hERG potassium channel blockers
PLS [85]
GRIND descriptors
Training set (332) Training set (518)
Validation set (16) Validation set (26)
r2=0.76, q2=0.72 r2=0.77, q2=0.74
Platelet aggregation inhibitor
MLR [86]
Log P, calculated molar refractivity, molar volume, Hammett sigma, resonance and inductive sigma constants
35
Cross validation
r2=0.74, q2=0.64
Abbreviations: LS-SVM - least squares support vector machines ; PC-GA-ANN – principal component genetic algorithm artificial neural network; PC-ANN – principal component artificial neural network; MLR - multiple linear regression; CoMFA – comparative molecular field analysis; CoMSIA – comparative molecular similarity indices analysis; GRID/GOLPE - generating optimal linear partial least square estimations; GRIND - grid-independent descriptors; QSAR – quantities structure activity relationship; SYBYL – a general use molecular modelling and visualization package; PLS – partial least square; HQSAR – hologram quantities structure activity relationship; GFA – genetic function approximation; EVA – eigenvalue analysis; GPLS – genetic partial least squares; NN – neural network; CP-ANN – computer propagation artificial neural network.
constriction and hypertension. ACE inhibitors are used in antihypertensive therapy [65]. Thrombin is one of the key enzymes in the blood coagulation system by controlling thrombus formation and its inhibition has become a primary means for antithrombotics control of thrombosis-linked pathological states [66]. Opening of potassium channel by openers causes membrane hyperpolarization. Potassium channel openers are used in the treatment of asthma and chronic obstructive pulmonary disease [67]. Anti-platelet aggregation drugs have found clinical applications in the secondary prevention of vascular events including acute myocardial infarction, stroke and cardiovascular death [4]. The performance of the regression-based methods is primarily measured by the r2 value, which measures the explained variance between the computed activities and experimentally estimated activities. Moreover, q2 values, which estimate the r2 values obtained by leave-one-out crossvalidation, are also frequently computed to further evaluate the predictive capability of these studies. The computed r2 values range from 0.55 to 0.96 [10, 68], which is compared to the range of 0.51 to 0.88 in typical conventional QSAR studies [69, 70]. These suggest that regression-based methods are useful for predicting the activity values of compounds of particular property at accuracy levels comparable to conventional QSAR methods. UNDERLYING DIFFICULTIES IN THE APPLICATION OF STATISTICAL LEARNING METHODS The performance of statistical learning methods critically depends on the diversity of chemical agents in a training dataset and the appropriate representation of these agents. The datasets in Table 1 and Table 2 are not expected to be fully representative of all of the agents with particular cardiovascular or hematological property. Various degrees of inadequate representation of chemical agents in these studies likely affect, to a certain extent, the prediction accuracy of the developed statistical learning models. In general, a sufficiently diverse set of agents is needed for developing a statistical learning model. Mining of the
agents known to possess a particular property and those do not possess that property from the literature and other sources [71, 72] is the key to more extensive exploration of machine learning methods. Databases such as PDSP Ki database [73], KiBank [74], PubChem [75], and CLiBE [76] that provide data about the information and activity of agents possessing specific activity are useful resources for serving this purpose, and more such databases are desired. Based on the studies of agents incorrectly predicted by statistical learning methods, it has been suggested that the currently-used descriptors seem to be insufficient to adequately represent some of the agents that contain complex structural or chemical configurations [42, 43, 48]. Examples of these agents are those with large rigid structure combined with a short flexible hydrophilic tail, compounds that contain multi-rings with various hetero atoms such as nitrogen, oxygen, sulphur, fluorine and chlorine. Due to the limited coverage of the number of bond links in a hetero-atom loop, topological descriptors are not yet capable of describing the special features of a complex multi-ring structure that contains multiple hetero atoms. It appears that none of the currentlyavailable descriptors can be used to fully represent molecules containing a long flexible chain. Therefore, there is a need to explore different combination of descriptors and to select more optimum set of descriptors by using more refined feature selection algorithms and parameters. However, indiscriminate use of many existing topological descriptors, which are overlapping and redundant to each other, may introduce noise as well as extending the coverage of some the aspects of these special features. Thus, it may be necessary to introduce new descriptors for more appropriately representing these and other special features. CONCLUDING REMARKS Statistical learning methods appear to be useful for facilitating the prediction of cardiovascular and haematological agents. The performance of statistical learning methods may be further improved from several aspects. One is the introduction of new descriptors for describing complex structures not well covered by existing descriptors. The second is more
18 Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
appropriate selection of molecular descriptors that can best represent compounds of a particular property. Feature selection methods have been extensively used [20, 37, 48], but they are computationally costly. Recent efforts are directed at the improvement of the efficiency and speed of feature selection methods [77], which can further help to optimally select molecular descriptors and enable the development of more accurate and efficient prediction tools. The third is the incorporation of additional factors such as hydrogen bonding and the relationship between pharmacodynamic and pharmacokinetic properties. These and other improvements will enable the development of statistical learning methods into useful tools for facilitating the prediction of diverse variety of cardiovascular and haematological agents.
[31] [32] [33] [34] [35] [36] [37] [38] [39] [40]
[41]
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]
World Health Organization. Avoiding heart attacks and strokes: don't be a victim - protect yourself; WHO Press: Geneva, 2005. Rang, H.P.; Dale, M.M.; Ritter, J.M.; Moore, P.K. Pharmacology; Churchill Livingstone: Edinburgh, 2003. Pinelli, A.; Godio, C.; Laghezza, A.; Mitro, N.; Fracchiolla, G.; Tortorella, V.; Lavecchia, A.; Novellino, E.; Fruchart, J.C.; Staels, B.; Crestani, M.; Loiodice, F. J. Med. Chem., 2005, 48, 5509. Verma, R.P. Mini Rev. Med. Chem., 2006, 6, 467. Gupta, S.P. Mini Rev. Med. Chem., 2003, 3, 315. Liu, H.; Ji, M.; Luo, X.; Shen, J.; Huang, X.; Hua, W.; Jiang, H.; Chen, K. J. Med. Chem., 2002, 45, 2953. Tobita, M.; Nishikawa, T.; Nagashima, R. Bioorg. Med. Chem. Lett., 2005, 15, 2886. Takahata, Y.; Costa, M.C.; Gaudio, A.C. J. Chem. Inf. Comput. Sci., 2003, 43, 540. Hemmateenejad, B.; Akhond, M.; Miri, R.; Shamsipur, M. J. Chem. Inf. Comput. Sci., 2003, 43, 1328. Viswanadhan, V.N.; Mueller, G.A.; Basak, S.C.; Weinstein, J.N. J. Chem. Inf. Comput. Sci., 2001, 41, 505. Li, H.; Yap, C.W.; Xue, Y.; Li, Z.R.; Ung, C.Y.; Han, L.Y.; Chen, Y.Z. Drug Dev. Res., 2006, 66, 245. Sutherland, J.J.; O'brien, L.A.; Weaver, D.F. J. Med. Chem., 2004, 47, 5541. Trotter, M.W.B.; Holden, S.B. QSAR Comb. Sci., 2003, 22, 533. Manallack, D.T.; Livingstone, D.J. Eur. J. Med. Chem., 1999, 34, 195. Burbidge, R.; Trotter, M.; Buxton, B.; Holden, S. Comput. Chem., 2001, 26, 5. Yao, X.; Liu, H.; Zhang, R.; Liu, M.; Hu, Z.; Panaye, A.; Doucet, J.P.; Fan, B. Mol. Pharm., 2005, 2, 348. Ng, C.; Xiao, Y.; Putnam, W.; Lum, B.; Tropsha, A. J. Pharm. Sci., 2004, 93, 2535. Yap, C.W.; Chen, Y.Z. J. Pharm. Sci., 2004, 94, 153. Xue, Y.; Li, Z.R.; Yap, C.W.; Sun, L.Z.; Chen, X.; Chen, Y.Z. J. Chem. Inf. Comput. Sci., 2004, 44, 1630. Li, H.; Yap, C.W.; Ung, C.Y.; Xue, Y.; Cao, Z.W.; Chen, Y.Z. J. Chem. Inf. Model., 2005, 45, 1376. Cruciani, G.; Pastor, M.; Guba, W. Eur. J. Pharm. Sci., 2000, 11, S29. Hall, L.H.; Kellogg, G.E.; Haney, D.N. Molconn-Z, eduSoft, LC: Ashland, 2002. Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. DRAGON, TALETE srl: Milano, 2005. Wegner, J.K. JOELib/JOELib2; http://www-ra.informatik.unituebingen.de/software/joelib/index.html , 2005. Hopfinger, A.J. J. Am. Chem. Soc., 1980, 102, 7196. Hemmer, M.C.; Steinhauer, V.; Gasteiger, J. Vib. Spectrosc., 1999, 19, 151. Ruecker, G.; Ruecker, C. J. Chem. Inf. Comput. Sci., 1993, 33, 683. Schuur, J.H.; Setzer, P.; Gasteiger, J. J. Chem. Inf. Comput. Sci., 1996, 36, 334. Pearlman, R.S.; Smith, K.M. J. Chem. Inf. Comput. Sci., 1999, 39, 28. Bravi, G.; Gancia, E.; Mascagni, P.; Pegna, M.; Todeschini, R.; Zaliani, A. J. Comput. Aided Mol. Des., 1997, 11, 79.
[42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52]
[53] [54] [55] [56] [57] [58] [59] [60] [61] [62]
[63] [64] [65] [66] [67] [68] [69] [70]
Chen et al. Galvez, J.; Garcia, R.; Salabert, M.T.; Soler, R. J. Chem. Inf. Comput. Sci., 1994, 34, 520. Consonni, V.; Todeschini, R.; Pavan, M. J. Chem. Inf. Comput. Sci., 2002, 42, 682. Randic, M. Tetrahedron, 1975, 31, 1477. Randic, M. New J. Chem., 1995, 19, 781. Kier, L.B.; Hall, L.H. Molecular structure description: The electrotopological state; Academic Press: San Diego, 1999. Platts, J.A.; Butina, D.; Abraham, M.H.; Hersey, A. J. Chem. Inf. Comput. Sci., 1999, 39, 835. Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Mach. Learn., 2002, 46, 389. Lucasius, C.B.; Kateman, G. Chemometr. Intell. Lab., 1993, 19, 1. Sutter, J.M.; Kalivas, J.H. Microchem. J., 1993, 47, 60. Yu, H.; Yang, J.; Wang, W.; Han, J. IEEE Computer Society Bioinformatics Conference (CSB'03); August 11 - 14, 2003, Stanford: California, 220. Serra, J.R.; Thompson, E.D.; Jurs, P.C. Chem. Res. Toxicol., 2003, 16, 153. Xue, Y.; Yap, C.W.; Sun, L.Z.; Cao, Z.W.; Wang, J.F.; Chen, Y.Z. J. Chem. Inf. Comput. Sci., 2004, 44, 1497. Li, H.; Ung, C.Y.; Yap, C.W.; Xue, Y.; Li, Z.R.; Cao, Z.W.; Chen, Y.Z. Chem. Res. Toxicol., 2005, 18, 1071. Iyer, M.; Mishru, R.; Han, Y.; Hopfinger, A.J. Pharm. Res., 2002, 19, 1611. Kohavi, R.; John, G.H. Artif. Intell. Med., 1997, 97, 273. Gramatica, P.; Pilutti, P.; Papa, E. J. Chem. Inf. Comput. Sci., 2004, 44, 1794. Izrailev, S.; Agrafiotis, D.K. J. Mol. Graph. Model., 2004, 22, 275. Yap, C.W.; Chen, Y.Z. J. Chem. Inf. Model., 2005, 45, 982. Huberty, C.J. Applied Discriminant Analysis; John Wiley & Sons: New York, 1994. Cramer, R.D.; Patterson, D.E.; Bunce, J.D. Prog. Clin. Biol. Res., 1989, 291, 161. Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis; Prentice Hall: Englewood Cliffs, NJ, 1982. Fix, E.; Hodges, J.L. Discriminatory Analysis: Non-parametric Discrimination Consistency Properties; USAF School of Aviation Medicine: Texas, 1951. Aleksander, I.; Morton, H. An Introduction to Neural Computing; International Thomson Computer Press: London, 1995. Gemperline, P.J.; Long, J.R.; Gregoriou, V.G. Anal. Chem., 1991, 63, 2313. Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, 1995. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: New York, 2000. De Ponti, F.; Poluzzi, E.; Montanaro, N. Eur. J. Clin. Pharmacol., 2001, 57, 185. Costa, M.C.A.; Gaudio, A.C.; Takahata, Y. Theochem, 1997, 394, 291. Saunders, W.B. Dorland's Illustrated Medical Dictionary; W.B. Saunders Company: London, 2000. Salamanca, D.A.; Khalil, R.A. Biochem. Pharmacol., 2005, 70, 1537. Jarajapu, Y.P.; Knot, H.J. Am. J. Physiol. Heart Circ. Physiol., 2005, 289, H1917. Umemoto, S.; Kawahara, S.; Hashimoto, R.; Umeji, K.; Matsuda, S.; Tanaka, M.; Kubo, M.; Matsuzaki, M. Hypertens. Res., 2006, 29, 179. Zou, A.P.; Parekh, N.; Steinhausen, M. Int. J. Microcirc. Clin. Exp., 1990, 9, 285. Yap, C.W.; Xue, Y.; Li, H.; Li, Z.R.; Ung, C.Y.; Han, L.Y.; Zheng, C.J.; Cao, Z.W.; Chen, Y.Z. Mini Rev. Med. Chem., 2006, 6, 449. Mallareddy, M.; Parikh, C.R.; Peixoto, A.J. J. Clin. Hypertens., 2006, 8, 398. Stubbs, M.T.; Bode, W. Thromb. Res., 1993, 69, 1. Pelaia, G.; Gallelli, L.; Vatrella, A.; Grembiale, R.D.; Maselli, R.; De Sarro, G.B.; Marsico, S.A. Life Sci., 2002, 70, 977. Mlinsek, G.; Novic, M.; Hodoscek, M.; Solmajer, T. J. Chem. Inf. Comput. Sci., 2001, 41, 1286. Lobell, M.; Sivarajah, V. Mol. Divers., 2003, 7, 69. Hou, T.J.; Xu, X.J. J. Chem. Inf. Comput. Sci., 2003, 43, 2137.
Computer Prediction of CHA [71] [72] [73] [74] [75] [76] [77] [78]
Cardiovascular & Hematological Agents in Medicinal Chemistry, 2007, Vol. 5, No. 1
AHFS drug information. http://www.ashp.org/ahfs; American Society of Health-System Pharmacists: Bethesda, 2001. MICROMEDEX. http://www.micromedex.com; Thomson MICROMEDEX: Greenwood Village: Colorado, 2003. Roth, B.L.; Kroeze, W.K.; Patel, S.; Lopez, E. Neuroscientist, 2000, 6, 252. Zhang, J.; Aizawa, M.; Amari, S.; Iwasawa, Y.; Nakano, T.; Nakata, K. Comput. Biol. Chem., 2004, 28, 401. Pubchem. http://pubchem.ncbi.nlm.nih.gov; National Institutes of Health (NIH): Bethesda, 2004. Chen, X.; Ji, Z.L.; Zhi, D.G.; Chen, Y.Z. Comput. Chem., 2002, 26, 661. Furlanello, C.; Serafini, M.; Merler, S.; Jurman, G. Neural Netw., 2003, 16, 641. Yap, C.W.; Cai, C.Z.; Xue, Y.; Chen, Y.Z. Toxicol. Sci., 2004, 79, 170.
Received: 22 June, 2006
[79] [80] [81] [82] [83] [84] [85] [86]
Revised: 31 July, 2006
19
Manallack, D.; Pitt, W.; Gancia, E.; Montana, J.; Livingstone, D.; Ford, M.; Whitley, D. J. Chem. Inf. Comput. Sci., 2002, 42, 1256. Fujishima, S.; Takahashi, Y. J. Chem. Inf. Comput. Sci., 2004, 44, 1006. Kim, H.J.; Choo, H.; Cho, Y.S.; Koh, H.Y.; No, K.T.; Pae, A.N. Bioorg. Med. Chem., 2006, 14, 2763. Schleifer, K.J.; Tot, E. Quant. Struct.-Act. Rel., 2002, 21, 239. Carosati, E.; Lemoine, H.; Spogli, R.; Grittner, D.; Mannhold, R.; Tabarrini, O.; Sabatini, S.; Cecchetti, V. Bioorg. Med. Chem., 2005, 13, 5581. Hoffman, B.; Cho, S.J.; Zheng, W.; Wyrick, S.; Nichols, D.E.; Mailman, R.B.; Tropsha, A. J. Med. Chem., 1999, 42, 3217. Cianchetta, G.; Li, Y.; Kang, J.; Rampe, D.; Fravolini, A.; Cruciani, G.; Vaz, R.J. Bioorg. Med. Chem. Lett., 2005, 15, 3637. De Candia, M.; Summo, L.; Carrieri, A.; Altomare, C.; Nardecchia, A.; Cellamare, S.; Carotti, A. Bioorg. Med. Chem., 2003, 11, 1439. Accepted: 04 August, 2006