modeling of molecular descriptors were presented. Particularly with ... and to select the descriptors used for the nonlinear methods, such as Multiple Non Linear.
Revue Interdisciplinaire
Vol1, n°1 (2016)
Basic approaches and applications of QSAR/QSPR methods
Samir Chtita*1, Mohammed Bouachrine2 and Tahar Lakhlifi1 1
Molecular Chemistry and Natural Substances Laboratory, Faculty of Science, University Moulay Ismail, Meknes, Morocco. 2
High School of Technology (EST Meknès) University Moulay Ismail, Meknes, Morocco.
*Corresponding Author: E-mail; samirchtita@gmail.com Abstract: The main objective of this paper is todescribe briefly the applications and methodologies involved in QSAR/QSPR, relate and comparethem to some of our various preceding published works.An intriguing and important field of activity for applying the results discussed in this work is QSAR and QSPR. Theoretical and practical results toward the statistical analysis and modeling of molecular descriptors were presented. Particularly with more emphasis on employing statistical methods for modeling data by using molecular descriptors. Key Words: QSAR, QSPR, Activity, Validation.
1. Introduction: The increasing number of papers describing the application of the quantitative structure activity/propriety relationship (QSAR/QSPR) methods shows their effectiveness and their strength. It is obviousthat the advantage of QSAR/QSPR becomes greaterwith the rapid development of computer science and theoretical quantum chemical studies. The QSAR/QSPRmethods is based on the assumption that the activity or the propriety, such as a drug binding to DNA or toxic effect, of a certain chemical compound related to its structure through a certain mathematical algorithm.The property or the biological activity willlinkedto the molecular structure of a chemical compound.Then, this relationship canbe used in the prediction, interpretation and assessment of new compounds with desired activities/properties with reducing and rationalizing time, efforts and cost of synthesis and new product development. The basic assumption to drive a QSAR/QSPR model is presented due to a mathematical function of the chemical properties that is related to the effect. Therefore, the effect (Activity or propriety) is like the function “f” of the chemical properties “x”: y = f(x). To findthis algorithm, we use a number of chemical compounds with known values of the studied effect (y). For each chemical compound, we calculate a series of parameters, called chemical descriptors. Then, we find an algorithm that provides a quite accurate value, similar to the real experimental value. The final step is to check if the obtained algorithm is capable to predict the Activity/property values for other chemicals, not used to build up the model (this phase called validation of the model). This validation phase is very important. Indeed, it is very important to generate a model worked not only for the chemical substances used within the training set, but also for other similar chemicals. Consequently, the challenge is to define the correct statistical properties of
Revue Interdisciplinaire
Vol1, n°1 (2016)
the model. A flowchart for the method of development of the QSAR/QSPR models along with the various validation methods used in ours work are demonstrated in Figure 1.
Figure1:Flowchart of the methodology used in our QSAR/QSPR works 2. Materials and methods: In our previous studies[1-5], the QSAR/QSPR methods were investigated for predicting, interpreting activity/propriety and for designing new proposed compounds by using some linear and nonlinear methods. All this QSAR/QSPR studies consists of four stages: selection of dataset andgeneration of molecular descriptor, descriptive analysis, statistical analysis (prediction and evaluation of models) and suggestion of novel compounds. In the first stage, the data sets of the activity/propriety were collected from previous works with known values of the studied effect (a more significant number of compounds is required, QSAR is not chance correlation if the number of observations is larger than the number of descriptors in five times [6]) and the values of descriptors were calculated.There are many chemical descriptors that we can calculate thousands of them. In our study, there are two types of molecular descriptors employed, namely: electronic descriptors (e.g.: LUMO and HOMO energies, dipole moment …) computed with the Gaussian 03 software, and physicochemical descriptors (e.g. Molar Refractivity) or at least constitutional indices (e.g. Molecular Weights) computed with the ACD/ChemSketch and ChemSketch software. In the second stage, the Principal Component Analysis (PCA)and the Hierarchical Cluster Analysis (HCA) or the K-means Clusteringmethods were used to form dissimilar clusters of compounds, to which the query compounds would be compared for determination of degree of similarity and the non-multicolinearity among variables (descriptors). After that, the dataset must divided into training and test sets. In the third stage, in order to propose mathematical models and to evaluate quantitatively the physicochemical effectsof the substituents on the activity/propriety.We weresubmitted the data matrix which is constituted from the variables corresponding to the studied molecules (descriptors) to the Multiple Linear Regression MLR method used to propose a linear model
Revue Interdisciplinaire
Vol1, n°1 (2016)
and to select the descriptors used for the nonlinear methods, such as Multiple Non Linear Regression MNLR and Artificial Neural Networks ANN. All of these statistical methods will be presented later in this paper and in detail in our previews articles [1-5]. The developedmodels wascompared and validated using internal validation techniques, such as the key statistical terms (Correlation and determination coefficients r or r2) and Leave one (or N) out Cross Validation CV-LOO/CV-LNO methods; and external validation using the test set(e.g.: group of molecules not in the original data training set on which the model has been developed). In the fourth stage, we were evaluated and compared the proposed models.We were alsoproposed new compounds with desired activities/properties. 2.1. Descriptive Analysis: a) Principal Component Analysis (PCA) PCA is a useful statistical technique for summarizing all the information encoded in the structures of the compounds. It is very helpful for understanding the distribution of the compounds. This is an essentially descriptive statistical method, which aims to present, in graphic form, the maximum of information contained in the dataset compounds. [4] b) Hierarchical Cluster Analysis (HCA) The aim of the HCA was the recognition of groups of objects based on their similarity; it involves grouping a collection of objects into clusters (subsets), such that objects within each cluster is more closely related to one another than objects in different clusters. It is a multivariate chemometric technique, which produced result by class or cluster [7]. c) K-means Clustering The k-means clustering is a non-hierarchical method of clustering thatcan be used when the number of clusters present in the objects or cases is known. In general, the k-means method will be produced exactly k different clusters.The division of the dataset into training and test sets has be performed using the HCA or the K-means clustering technique. In this one, from each obtained cluster one compound for the training set wasselected randomly for used as test set compound. [8-9] 2.2. Statistical Methods: a) Multiple linear regression (MLR) This method is one of the most popular methods of QSAR/QSPRthanks to its simplicity in operation, reproducibility and ability to allow easy interpretation of the features used. The important advantage of the linear regression analysis that are highly transparent,therefore, the algorithm is available and predictions can be made easily. Another advantage is that it can aid a priori descriptors selection [9]. b) Partial least squares (PLS) PLS is a generalization of MLR, It can analyze data with strongly collinear, correlated and noisy. If the number of descriptors gets too large (e.g., close to the number of observations) in MLR, it is likely to get a model that fits the sampled data perfectly in a phenomenon called over fitting [10]. c) Multiple nonlinear regression (MNLR) MNLR is a nonlinear method; in this one, we applied the descriptors proposed by the MLR corresponding to the dataset (training set). In our previous works, we were used the preprogrammed function: Y = a + (bX1 + cX2 + dX3 + eX4 + · · ·) + (f X12+ gX22+ hX32+ iX42+ · · ·) With: a, b, c, d ... represent the parameters and X1, X2,X3, X4... represent the variables. d) Artificial Neural Networks ANN,
Revue Interdisciplinaire
Vol1, n°1 (2016)
To increase the probability of good characterization of studied compounds, artificial neural networks (ANN) can be used to generate predictive models of QSAR/QSPR between the set of molecular descriptors obtained from the MLR, and observed activities. The ANN calculated activities model were developed using the properties of several studied compounds. We were used the proposed a parameter ρ, leading to determine the number of hidden neurons, which plays a major role in determining the best ANN architecture defined as follows: ρ = (Number of data points in the training set / Sum of the number of connections in the ANN) In order to avoid over fitting or under fitting, it is recommended that 1.80.5) indicate the better predictivity of the model. [11]
1
Training
2 3
Training
2 Model
.. ..
1 3 ..
Model
3
Test
..
..
..
..
..
..
Result
k
Model
..
.. k
Training
2
.. Test
1
Result
Resu Resu cross-validation lt Figure 2: Procedure of k-fold lt
k
Test
Result
Resu lt
b) External Validation: The real predictive power of a QSAR/QSPR model is to test their ability to predictperfectly the activity/propriety of compounds from an external test set (compounds not used for the model development).The purpose of a good QSAR/QSPR model is not only to predict the activity of the training set compounds, but also to predict the activities of external molecules (test set)[12]. This model will be able to predict the activity of test set molecules in agreement with the experimentally determined value.The predictive capacity of the models that was judged, was based on the test validation coefficient R2test for the model determined based on
Revue Interdisciplinaire
Vol1, n°1 (2016)
the predictive ability of the model for the test set, the higher value of R2test (>0.5) indicate the improved predictivity of the model. 2.4. Software used in our QSAR/QSPR developmentstudy There are many free or commercial software available for QSAR/QSPR development.These include specialized software for drawing chemical structures, generating 3D structures, calculating chemical descriptors and developing QSAR models. The software used in our works arerepresentedin the following table: Table 1:Software used in this work. Drawing chemical structures Marvin Sketch , ACD/ChemSketch, ChemBioDraw Generating 3D structures Gauss View 3.0 and ChemBio3D. Calculating chemical Gaussian 03, Marvin Sketch 6.2, ChemSketch and descriptors ChemBio3D Developing QSAR models XLSTAT 2009, SPSS statistics 20 and Matlab R2009b 3. Results and discussions: For ourfive study, we have investigated the QSAR/QSPRmethods to establish a quantitative relationship forsome activity/proprietyof several compounds based on imidazo[1,2a]pyrazinederivatives, on di-benzo[a,d]cycloalkenimine derivatives, on (1,3-benzothiazol-2yl) amino-9-(10H)-acridinone derivatives and on 9-aniliioacridine derivatives (table 2). We were used the PCA, HCA or K-means Clustering methodsto form dissimilar clusters of compounds, and we were studiedthe similarity and the non-multicolinearity among descriptors. Moreover, we have divided the database into training and test sets, and we have proposed linear and nonlinear models using MLR, PLS, MNLR and ANN methods with different descriptors. Moreover, the proposed models were comparedandevaluate using validation techniques. Table 2: Activity/propriety and chemical derivatives used in this study. Activities/Proprieties Activity antagonist to NMDA receptor Antiproliferative towards human monocytes
Compounds dibenzo[a,d]cycloalkenimine (1,3-benzothiazol-2-yl) amino-9-(10H)-acridinone 9-aniliioacridine
Association constant for Drug-DNA binding Cytotoxic against cancer cell lines Imidazo[1,2-a]pyrazine (MDAMB-231 and SK-N-SH) Cytotoxic against cancer cell lines (HCF-7 Imidazo[1,2-a]pyrazine and HepG-2)
Number of Ref. compounds 48
[1]
16
[2]
31
[3]
13
[4]
13
[5]
- The MLR give the most important interpretable results, it is a transparent approach. In addition, it is descriptors selection. But the MNLR (nonlinear approach) gives better results than the MLR because the MLR and MNLR are considered as complementary methods, the RNLM is an improvement of the RLM, the MNLR added a correction of 2nd degree (in our study) of the correlation between the selected descriptors in the MLR. - Comparing the key statistical terms for the used methods, the ANN have substantially better predictive capability than the MLR, MNLR and PLS. The problem in this method that ispoorlytransparent, therefore, the predictions cannot be made easily. - The most important finding from our researches that we have been designed and proposed new compounds with higher or lower values of studied proprieties (the association
Revue Interdisciplinaire
Vol1, n°1 (2016)
constant for Drug-DNA binding) than existing compounds [3]. We were added suitable substituents and calculated their propriety using the proposed models equations. Consequently, the proposed models will reduce the time and cost of synthesis as well as the determination of the DNA drug binding capacity of 9-aniliioacridine derivatives. We have given good explanation of the descriptor associated with the studied propriety. - The results show that the models proposed in [1] and [3] can predict studied activity/propriety accurately, and that the selected parameters are pertinent.The predictive powers of the models equations were validated by an internal test (Cross validation) and external test set. The accuracy and predictability of the proposed models were illustrated by comparison of the key statistical terms r or r2, r2cv and r2test. - In the three other study [2, 4-5], there are two obvious problems. One is that the number of samples (compounds) is too small resulting QSAR/QSPR models with little statistically significant. Meanwhile, the other problem is the compound structures without diversity. It is required for the dataset with diversity compound structures in the virtual screening of drug; so that the built mathematical model is not robustness and steady because of accidental correlation. Thus, the practical application of this study to predict the activity is doubtful. 4. Conclusions: In the present work, we carried out a comparative study of the results that we have achieved during the last three years on the use of statistical methods for the quantitative relationship study of the structure of various compounds with activities / properties of these compounds. Our work was developed as and the development of our skills and working methods. Other works are still bidding and accomplishment that we have taken into account the mistakes. Establishing a simple QSAR/QSPR model is difficult to give a good guidance for screening drug design or for experiment.For the successful application of the developed models in prediction for new compounds, rigorous validations will be used. Finally, we recommend current and future modelers to subscribe to theOrganization for Economic Cooperation and Development (OECD) Principles forthe Validation of QSAR/QSPR in developing models. References: [1] S. Chtita, M. Larif, M. Ghamali, M. Bouachrine and T. Lakhlifi, Quantitative structure–activity relationship studies of di-benzo[a,d]cycloalkenimine derivatives for non-competitive antagonists of N-methyl-d-aspartate based on density functional theory with electronic and topological descriptors, J. of Taibah Univ. for Sci., 9(2015)143– 154.http://dx.doi.org/10.1016/j.jtusci.2014.10.006. [2] S. Chtita, M. Larif, M. Ghamali, M. Bouachrine and T. Lakhlifi, QSAR Studies of Toxicity Towards Monocytes with (1,3-benzothiazol-2-yl) amino-9-(10H)-acridinone Derivatives Using Electronic Descriptors, Orbital: Electron. J. Chem. 7 (2) (2015) 176-184. http://dx.doi.org/10.17807/orbital.v7i2.677. [3] S. Chtita, R. Hmamouchi, M. Larif, M. Ghamali, M. Bouachrine and T. Lakhlifi,, QSPR studies of 9-aniliioacridine derivatives for their DNA drug binding properties based on density functional theory using statistical methods: Model, validation and influencing factors, J. of Taibah Univ. for Sci. (2016), In Press, http://dx.doi.org/10.1016/j.jtusci.2015.04.007. [4] S. Chtita, M. Ghamali, M. Larif, A. Adad, H. Rachid M. Bouachrine and T. Lakhlifi., Studies of two different cancer cell lines activities (MDAMB-231 and SK-N-SH) of imidazo[1,2-a]pyrazine derivatives by combining DFT and QSAR results, International Journal of Innovative Research in Science, Engineering and Technology, 2013, 2(11): 6586-6601.
Revue Interdisciplinaire
Vol1, n°1 (2016)
[5] S. Chtita, M. Ghamali, M. Larif, A. Adad, H. Rachid M. Bouachrine and T. Lakhlifi, Prediction of biological activity of imidazo[1,2-a]pyrazine derivatives by combining DFT and QSAR results, International Journal of Innovative Research in Science, Engineering and Technology, 2013, 2(12): 7951-7962. [6] J.G. Topliss and R.P. Edwards, Chance factors in studies of quantitative structure-activity relationships,Journal of Medicinal Chemistry,(1979) 22 (10):1238-1244. [7] J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques, 3rd edition, (Chapter 8: Classification: Basic Concepts), Morgan Kaufmann Publishers, (July 2011). [8] F.R. Burden, M.G. Ford, D.C. Whitley and D.A. Winkler, Use of automatic relevance determination in QSAR studies using Bayesian neural networks. J. Chem. Inf. Comput. Sci. 40:14 (2000) 23–1430.http://dx.doi.org/10.1021/ci000450a [9] K. Roy, S.Kar and R. Narayan Das, Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, Chapter 6 - Selected Statistical Methods in QSAR, Academic Press, Boston (2015) 191–229.http://dx.doi.org/10.1016/B978-0-12-801505-6.00006-5. [10] T.Puzyn, J.Leszczynskiand M.T. Cronin, Recent Advances in QSAR Studies: Methods and Applications: Part I Theory of QSAR, Challenges and Advances in Computational Chemistry and Physics, (2010) 1-217. http://dx.doi.org/10.1007/978-1-4020-9783-6. [11] P. Refaeilzadeh, L. Tang and H. Liu, Cross Validation. Encyclopedia of Database Systems, Editors: M. Tamer Özsu and Ling Liu. Springer, (2009). [12] S. Ekins, G. Bravi, S. Binkley, JS. Gillespie, BJ. Ring, JH. Wikel, et al., Drug MetabDispos (2000) 28:994.