Document not found! Please try again

Virtual Screening Based on One Class Classification

0 downloads 0 Views 132KB Size Report
Accelerating the search for a lead compound is one of the most important problems aimed at optimizing drug design. Although high throughput screening is.
ISSN 00125008, Doklady Chemistry, 2011, Vol. 437, Part 2, pp. 107–111. © Pleiades Publishing, Ltd., 2011. Original Russian Text © P.V. Karpov, I.I. Baskin, V.A. Palyulin, N.S. Zefirov, 2011, published in Doklady Akademii Nauk, 2011, Vol. 437, No. 5, pp. 642–646.

CHEMISTRY

Virtual Screening Based on OneClass Classification P. V. Karpov, I. I. Baskin, V. A. Palyulin, and Academician N. S. Zefirov Received October 26, 2010

DOI: 10.1134/S0012500811040082

Accelerating the search for a lead compound is one of the most important problems aimed at optimizing drug design. Although highthroughput screening is being developed, its efficiency remains rather low [1]. Therefore, virtual screening is now actively being advanced, which is the rapid in silico evaluation of large electronic libraries of structures of organic com pounds. In this work, we proposed a basically new ligandbased virtual screening method using a one class classification. Similarity searching proceeds from the postulate that similar structures should exhibit similar properties [2]. If, for a given biological activity, there is a set of active structures, then one can construct such a virtual screening filter that finds compounds similar to the active structures in electronic databases. The method consists in the pairwise comparison of structures one of which has a known biological activity and the other is a test structure. In this case, the structures of both compounds are represented as binary descriptors and the Tanimoto coefficient (or another parameter) char acterizing the similarity between the two structures is calculated. If the Tanimoto coefficient exceeds 0.85, then the structures are considered similar and the test compound is regarded as a potential candidate for a lead compound. Otherwise, the test compound is rejected. The wellknown drawback of this approach is a strong dependence of the prediction results on insig nificant changes in the metric of the descriptor space. Even a small variation in the weights of descriptors, let alone their mixing, may lead to a considerable change in the set of structures regarded as a potential candi date for a lead compound, which introduces much subjectivism to the analysis. Recently, other significant disadvantages of this approach have been found. In particular, this approach is operative only when the structure–activity surface is sufficiently smooth and contains no activity cliffs [3]. This condition is far

Moscow State University, Moscow, 119991 Russia

from being always valid, which leads to the necessity of developing alternative approaches to similarity search ing. A widely used technique for constructing virtual screening filters is statistical modeling, in which a clas sification model that distinguishes between active and inactive structures is constructed. To correctly create such a model, it is necessary to analyze a large number of both active structures (examples) and inactive struc tures (counterexamples). Unfortunately, in the litera ture, only active compounds are provided, whereas inactive compounds are often not present. Therefore, researchers artificially assemble sets of inactive struc tures, assuming that a structure is inactive if this struc ture is unknown to be active [4, 5]. Obviously, such an approach has a number of disadvantages. First, the compounds selected as counterexamples may well turn out to be active. Second, the model quality is essentially dependent on the selection of the set of counterexamples and their chemical diversity. Third, the set of counterexamples should involve all conceiv able inactive compounds, which are impossible to encompass. These disadvantages are absent from one class classification methods because, (1) owing to the presence of a training phase, these methods can be adjusted to the existing metric and the prediction results will be more adequate and much less dependent on variations in the metric of the descriptor space; (2) in these methods, there are no such strict requirements for the smoothness of the structure–activity surface; and (3) no counterexamples are necessary for applying these methods. Moreover, unlike similarity searching methods used in chemoinformatics, oneclass classifi cation methods are not heuristic; conversely, they are based on the rigorous statistical theory of probability density approximation. Among oneclass classification methods are the support vector data description (SVDD) method, probabilistic methods (Parzen window), methods using artificial neural networks with special architec tures, etc. The SVDD method is the simplest. In this work, as a oneclass classification method, we used the

107

108

KARPOV et al. (а)

(b)

R2 Ropt

R1

Fig. 1. Illustration of (a) one and (b) twoclass classification methods. The thin lines restrict the positions of the points repre senting samples of both classes; the heavy line separates these points.

1SVM method [6] implemented in the LIBSVM, a Library for Support Vector Machines [7]. With Gauss ian kernels, both of the methods, SVDD and 1SVM, are mathematically equivalent. The main idea of oneclass classification methods is to construct a model using only a class of examples (without using a class of counterexamples). Figure 1 compares one and twoclass classification methods. Figure 1a presents a geometric interpretation of a one class classifier. Small circles lying on the large circle of radius Ropt represent support vectors of the 1SVM model. If a structure is represented by a small circle lying within the large circle, i.e., if the distance R2 from the center of the small circle to the center of the large circle is smaller than the optimal radius Ropt, then this structure is likely to be active. If a structure is rep resented by a small circle lying outside the large circle, i.e., R1 > Ropt, then the classifier categorizes this struc ture as inactive. Figure 1b illustrates twoclass classifi cation. The circles and boxes represent examples of different classes. Twoclass classification methods include the con struction of a separating open hypersurface in the descriptor space and the determination on which side of this hypersurface the point representing a test struc ture is located. Conversely, oneclass classification methods involve the construction of a closed hyper surface that separates all the descriptor space into two regions, outer and inner, such that the inner region has the minimum possible volume but simultaneously includes the maximum possible number of examples belonging to a target class. Thus, for oneclass classification, the closed hypersurface separates the inner region with high den sity of points from the outer region with low density of points. The farther in the outer region the point repre senting the test compound is located from the separat ing hypersurface, the lower the probability that the test

compound exhibits a modeled activity. Conversely, if the point representing the test compound is in the inner region, then it is highly probable that using this compound as a lead compound is promising. In the SVDD methods, the separating hypersurface is a hypersphere and the returned result is the value of f(x) = (Ropt – R). The construction of a hypersphere in the initial descriptor space may not give an operative model because of a too complex shape of a spatial region con taining points. Therefore, the dimension is often arti ficially increased by introducing special functions— kernels. It is believed that, in a higherdimension space, it is easier to find a separating surface. As a ker nel, the Gaussian function, as the most universal one, is typically used. In QSAR/QSPR modeling, the structure of an organic compound is represented as a set of descrip tors—numbers characterizing the structures of com pounds. Because virtual screening databases contain a large number of compounds, descriptors should be calculated as rapidly as possible. Fragment descriptors meet this requirement. We have previously developed the FRAGMENT block with hierarchical classification of atoms [8], which performs very well in QSAR/QSPR modeling of various properties. Nonetheless, the redundancy of descriptors prevents this block from being used for solving virtual screening problems, where the calcula tion speed is of prime importance. Since the classifica tion of nonhydrogen atoms that is builtin in the block ensures the efficient coding of structural information, we attempted to combine the classification scheme of the FRAGMENT block with Carhart descriptors [9], which are widely used for solving virtual screening problems, in order to accelerate calculations without losing prognostic properties. To construct Carhart descriptors, the structure of an organic compound is divided into small fragments DOKLADY CHEMISTRY

Vol. 437

Part 2

2011

VIRTUAL SCREENING BASED ON ONECLASS CLASSIFICATION

109

DUD modeling results Biotarget ACE ACHE ADA ALR2 AmpC AR CDK2 COMT COX1 COX2 DHFr EGFr ERagonist ERantagonist FGFr1 FXa GART GPB GR HIVPR HIVRT HMGA HSP90 InhA MR NA P38 PARP PDE5 PDGFrb PNP PPARgamma PR RXRalpha SAHH SRC Thrombin TK Trypsin VEGFr2

γ

ν

OT

AUC

TP

TN

FP

FN

0.05 0.07 0.07 0.03 0.06 0.05 0.05 0.20 0.06 0.06 0.02 0.02 0.05 0.03 0.03 0.03 0.02 0.03 0.03 0.02 0.06 0.01 0.02 0.04 0.02 0.02 0.02 0.05 0.02 0.05 0.02 0.02 0.03 0.01 0.02 0.04 0.02 0.05 0.02 0.04

0.01 0.02 0.01 0.06 0.01 0.05 0.01 0.20 0.01 0.02 0.01 0.01 0.01 0.04 0.02 0.01 0.02 0.01 0.02 0.01 0.01 0.03 0.04 0.01 0.01 0.03 0.01 0.03 0.07 0.01 0.06 0.03 0.01 0.03 0.04 0.09 0.03 0.01 0.03 0.01

–0.0093 –0.0249 –0.0138 –0.0748 –0.0111 –0.0912 –0.0135 –0.2002 –0.0112 –0.0286 –0.0412 –0.0349 –0.0167 –0.0696 –0.0339 –0.0151 –0.0456 –0.2388 –0.0492 –0.0159 –0.0119 –0.0866 –0.1028 –0.0131 –0.0167 –0.0653 –0.0377 –0.0385 –0.1139 –0.0146 –0.1537 –0.0562 –0.0168 –0.0412 –0.0931 –0.1351 –0.0581 –0.0129 –0.0516 –0.0132

0.92 0.95 0.99 0.71 0.86 0.94 0.89 0.63 0.74 0.97 0.99 0.97 0.95 0.97 0.98 0.91 0.98 0.92 0.96 0.94 0.80 0.92 0.96 0.92 0.84 0.94 0.99 0.94 0.93 0.97 0.98 0.98 0.93 0.99 0.96 0.98 0.93 0.88 0.95 0.79

40 93 38 17 18 72 58 4 17 399 405 443 60 37 115 123 37 46 71 56 27 31 35 74 12 45 437 33 78 159 48 81 23 19 30 152 62 18 46 64

1692 3819 918 664 681 2644 2025 413 705 12650 8280 14931 2454 1374 4444 5498 824 1896 2689 1916 1511 1312 928 2984 518 1723 8799 1276 1757 5776 995 3018 900 748 1226 6165 2121 768 1576 2392

105 73 9 331 105 210 49 55 206 639 87 1065 116 74 106 247 55 244 258 122 8 168 51 282 118 151 342 75 221 204 41 109 141 2 120 154 335 123 88 514

9 14 1 9 3 7 14 7 8 27 5 32 7 2 5 23 3 6 7 6 16 4 2 12 3 4 17 2 10 11 2 4 4 1 3 7 10 4 3 24

Specificity Sensitivity 94.2 98.1 99.0 66.7 86.6 92.6 97.6 88.2 77.4 95.2 99.0 93.3 95.5 94.4 97.7 95.7 93.7 88.6 91.2 94.0 99.5 88.6 94.8 91.4 81.4 91.9 96.3 94.4 88.8 96.6 96.0 96.5 86.5 99.7 91.9 97.6 86.4 86.2 94.7 82.3

81.6 86.9 97.4 65.4 85.7 91.1 80.6 36.4 68.0 93.7 98.8 93.3 89.6 94.9 95.8 84.2 92.5 88.5 91.0 90.3 62.8 88.6 94.6 86.0 80.0 91.8 96.3 94.3 88.6 93.5 96.0 95.3 85.2 95.0 90.9 95.6 86.1 81.8 93.9 72.7

Note: ACE—angiotensinconverting enzyme, ACHE—acetylcholinesterase, ADA—adenosine deaminase, ALR2—aldose reductase, AmpC—AmpC βlactamase, AR—androgen receptor, CDK2—cyclindependent kinase, COMT—catecholOmethyltransferase, COX1—cyclooxygenase1, COX2—cyclooxygenase2, DHFr—dihydrofolate reductase, EGFr—epidermal growth factor recep tor, ERagonist—estrogen receptor agonist, ERantagonist—estrogen receptor antagonist, FGFr1—fibroblast growth factor recep tor, FXa—factor Xa, GART—glycinamide ribonucleotide transformylase, GPB—glycogen phosphorylase, GR— glucocorticoid receptor, HIVPR—HIV protease, HIVRT—HIV reverse transcriptase, HMGA—hydroxymethylglutarylCoA reductase, HSP90— heat shock protein, InhA—enoylACP reductase, MR—mineralocorticoid receptor, NA—neuraminidase, P38—p38 mitogenacti vated protein, PARP—poly(ADPribose) polymerase, PDE5—phosphodiesterase type 5, PDGFrb—plateletderived growth factor receptor kinase, PNP—purine nucleoside phosphorylase, PPARgamma—peroxisome proliferatoractivated receptorgamma, PR— progesterone receptor, RXRalpha—retinoic X receptor alpha, SAHH—Sadenosyl homocysteine hydrolase, SRC—Src tyrosine kinase, Thrombin— thrombin, TK— thymidine kinase, Trypsin—trypsin, VEGFr2—vascular endothelial growth factor receptor. DOKLADY CHEMISTRY

Vol. 437

Part 2

2011

110 TPR 1.0

0.8

KARPOV et al.

1 2 3

0.6 0.4 0.2

0

0.2

0.4

0.6

0.8

1.0 FRP

Fig. 2. Examples of ROC curves for the models of (1) ACHE, (2) INHA, and (3) TK. The dashed line repre sents a random classification.

(atoms with attached hydrogen atoms) and then all the possible topological distances between them are found. The descriptors look as follows: F1–Dist–F2, where F1 and F2 are the above small fragments and Dist is the topological distance between them. The descrip tors are usually binary vectors indicating the absence or presence of a given chain fragment in the structure. The applicability of the oneclass classification method was studied using the directory of useful decoys (DUD) [10]. The DUD is a collection of both ligands acting on 40 different targets, and structures that resemble ligands in their physicochemical prop erties but are inactive—decoys. Recently, the DUD has usually been considered as a standard for checking and comparing various dockingbased virtual screen ing methods. The receptors represented in this direc tory include hydrophobic, polar, cationic and anionic binding sites; deeply closed hydrophobic pockets; and more open polar regions. The diversity of types of binding and structures is important for checking vir tual screening methods based on a oneclass classifica tion. The main difficulty in using a oneclass classifica tion is to evaluate the quality of the model constructed. In ordinary twoclass classification algorithms, mod els are evaluated by cross validation, by which all the data for constructing a model are divided into k parts. At k = 1, such a method is called the leaveoneout crossvalidation (LOOCV). Further, a model is con structed based on all the data, except the data for the first part, and values of it are predicted. Then, a model is constructed based on all the data, except the data for the second part, and values of it are also predicted. The

procedure is repeated for all the parts. As a result, val ues were predicted for all the structures used for train ing. The model quality can be evaluated from the area under the receiver operating characteristic (ROC) curve. These curves are used for evaluating the quality of classification models [11]. They illustrate the dependence of the true positive rate on the false posi tive rate at various threshold parameter values. The larger the area under the ROC curve, the better the model constructed; for a perfect classifier, this area is unity. The general methodology of model construction is the following: for each structure from the set of ligands, fragment descriptors were calculated. Then, oneclass classification models were constructed by the 1SVM models, and their quality was evaluated by the LOOCV method. The nonlinear transformation function was the Gaussian function. The algorithm for constructing models requires the optimal determina tion of two parameters, γ and ν. The kernel parameter γ controls the degree of nonlinear transformation. The parameter ν is the fraction of the vectors that should be used for constructing the model. At ν = 1, the 1SVM method degenerates into a statistical method, the Parzen window. The search was performed by simulta neously testing the possible values of γ ~ [0.01, 0.1] and ν ~ [0.01, 0.1] at a step of 0.01. For the final ver sion of the model, the parameter values at which the area under the ROC curve is maximal were taken. A total of 40 models were constructed, the characteris tics of which are given in the table. Each of the models is characterized by the following parameters: TP (number of true positive examples), TN (number of true negative examples), FP (number of false positive examples), FN (number of false negative examples), AUC (area under ROC curve), OT (optimal thresh old), γ, and ν. Figure 2 shows three ROC curves obtained for the models ACHE, INHA, and TK. The data in the table demonstrate that the one class classification method works well with DUD model bases. The minimal area under the ROC curve is 0.63; for most of the models, this parameter exceeds 0.8. The high values of the sensitivity and specificity directly evidence the efficiency of the method pro posed. Thus, we have developed a method that is an alter native to the conventional approach to virtual screen ing involving similarity searching using the Tanimoto coefficient. In the approach proposed, the similarity between a given compound and a target class of objects (chemical compounds) is measured not by the geo metric distances between points but by the probability density of the distribution of objects of a given class at a given point of the space. The higher the probability density of the distribution of such objects, the more DOKLADY CHEMISTRY

Vol. 437

Part 2

2011

VIRTUAL SCREENING BASED ON ONECLASS CLASSIFICATION

similar to the objects of this class the object at this point is considered. Thereby, we have actually pro posed to reconsider the concept of similarity of chem ical compounds on the basis of considering their dis tribution density in the descriptor space. REFERENCES 1. Chemoinformatics Approaches to Virtual Screening, Var nek, A. and Tropsha, A., Eds. Rhodes: RCS, 2008. 2. Skvortsova, M.I., Stankevich, I.V., Palyulin, V.A., and Zefirov, N.S., Usp. Khim., 2006, vol. 75, pp. 1074– 1093. 3. Guha, R. and Van Drie, J.H., J. Chem. Inf. Model., 2008, vol. 48, pp. 646–658.

DOKLADY CHEMISTRY

Vol. 437

Part 2

2011

111

4. BrunoBlanch, L. and Gálvez, R.G.D, Bioorg. Med. Chem. Lett., 2003, vol. 13, pp. 2749–2754. 5. Rodgers, S., J. Chem. Inf. Model., 2006, vol. 46, pp. 569–576. 6. Schölkopf, B., Platt, J.C., ShaweTaylor, J., et al., Neu ral Comput., 2001, vol. 13, pp. 1443–1471. 7. ChihChung Chang and ChihJen Lin, LIBSVM: a Library for Support Vector Machines, 2001. 8. Artemenko, N.V., Baskin, I.I., Palyulin, V.A., and Zefirov, N.S. Dokl. Chem., 2001, vol. 381, pp. 317–320. 9. Carhart, R.E., Smith, D.H., and Venkataraghavan, R., J. Chem. Inf. Comput. Sci., 1985, vol. 25, pp. 64–73. 10. Huang, N., Shoichet, B.K., and Irwin, J.J., J. Med. Chem., 2006, vol. 49, pp. 6789–6801. 11. Fawcett, T., Pattern Recogn. Lett., 2006, vol. 27, pp. 861–874.

Suggest Documents