( ) w 2 + C - CiteSeerX

Nonlinear SVM Approaches to QSPR/QSAR Studies and Drug Design Jean-Pierre Doucet*, Florent Barbault, Hairong Xia, Annick Panaye and Botao Fan ITODYS-CNRS UMR7086, Université Paris 7-Denis Diderot, 1, rue Guy de la Brosse, 75005 Paris, France Abstract: Recently, a new promising nonlinear method, the support vector machine (SVM), was proposed by Vapnik. It rapidly found numerous applications in chemistry, biochemistry and pharmacochemistry. Several attempts using SVM in drug design have been reported. It became an attractive nonlinear approach in this field. In this review, the theoretical basis of SVM in classification and regression is briefly described. Its applications in QSPR/QSAR studies, and particularly in drug design are discussed. Comparative studies with some linear and other nonlinear methods show SVM’s high performance both in classification and correlation.

Keywords: Support vector machine (SVM), QSPR/QSAR, drug-design, classification, correlation. INTRODUCTION The QSPR/QSAR approach (quantitative structure/property or activity relationships) became very useful and largely widespread for the prediction of physico-chemical or biological properties, particularly in drug design. This approach is based on the assumption that the variations in the properties of the compounds can be correlated with changes in their molecular features, characterized by the so-called “molecular descriptors”. Recently, a powerful classification and regression tool, Support Vector Machine (SVM), was proposed [1, 2] and became increasingly popular in various machine learning applications [3] This review briefly summarizes the basic principle of SVM and presents some applications in various fields of physical and biomedical Chemistry.

Fig. (1). (A) In 2D various lines could be drawn for separating disks and circles; (B) Determination of the Maximum Margin Hyperplane which may be thought as giving the most robust classifier for unknowns. Circled points correspond to support vectors.

As stated by Mattioni and Jurs [4], development of QSAR/QSPR models involves four major steps: structure entry (2D or 3D with possibly conformational analysis), descriptor generation with objective feature selection (in order to remove unimportant or highly correlated descriptors), model formation and validation (performance evaluation). Originally QSAR/QSPR largely used linear models: linear discriminant analysis (LDA), k Nearest Neighbours (k-NN), principal component analysis (PCA)… in classification, multilinear regression (MLR), principal component regression (PCR), partial least squares regression (PLS)… in correlation [5, 6]. Some nonlinear regressions also appeared to account for nonlinear effects, detected in some cases as substituent-substituent interactions, for example in the QFIT (Quantitative Factorisation of Interactions Treatment) approach [7-9]. In such models, nonlinearity is treated by introducing two- or three- fold products of structural parameters [10] with seldom higher degree terms.

( )

m

2 Q ( w,b, , , v ) = 1 2 w + C i i=1

The most widely used ANN, layered networks working with back-propagation of errors (BPNN) lead to a lot of successful applications. However they suffer from some limitations: (a) architecture and parameter setting are difficult to fix, without any systematic method; (b) convergence of the algorithm may be slow, with the risk to get stuck in a local minimum; (c) poor generalization capability may result from overfitting and overtraining, with a lack of robustness (nearly similar networks may give differing results largely due to random initialization of the weights [15]…). To overcome these difficulties, other artificial intelligence techniques have been introduced. Among them, SVM (initially proposed for classification problems and later extended to regression applications) has attracted attention, due to its remarkable generalization performance. It now gained extensive applications in pattern recognition and regression problems where it seems very promising for nonlinear problems. As stated by Zernov [3], the main advantages of SVM are: (a) results are stable, reproducible, and largely independent of the optimization algorithm; (b) solution is guaranteed to be optimum, the quadratic programming approach avoiding to get stuck at local minima; (c) a simple geometric interpretation is attainable; (d) any complex classifier can be built with the introduction of kernels for nonlinear decisions; (e) few parameters have to be adjusted: the regularization parameter (C), the nature and the parameters of the kernel function; (f) the result is built on a sparse subset of training samples: SVM avoids the "curse of dimensionality" in high dimension problems. *Address correspondence to this author at the ITODYS-CNRS UMR7086, Université Paris 7-Denis Diderot, 1, rue Guy de la Brosse, 75005 Paris, France; E-mail: [email protected]

SVM has been presented in many publications [1-3, 16-25]. Further information may be found at: http://www.kernel-machines.org http://www1.cs.columbia.edu/compbio

Table 1.

Summary of Observational Studies in Human Volunteers in which an Aspect of DNA Repair has been Measured and Relationships with Diet/ Nutritional Status Investigated

Reference

Subjects

Study description

Tissue investigated

Wei et al. (1993) [27]

135 healthy control subjects and 88 patients with basal cell carcinoma (BCC)

Observational study

Peripheral blood lymphocytes

Gonzalez et al. (2002) [26]

Group 1: Well-nourished, noninfected children (6 children) Group 2: Well-nourished, infected* children (6 children) Group 3: Malnourished, infected* children (7 children)

Wei et al. (2003) [28]

559 non-hispanic white adults

Pool-Zobel et al. (2004) [36]

10 Male alcoholics plus 6 female and 3 male control subjects

Observational study prior to initiation of drug treatment

Observational study in which folate intake was assessed by a food frequency questionnairebased method Observational study

DNA repair measurement

Study outcome

Host cell reactivation assay for Nucleotide Excision Repair

Significant (P