In Silico Prediction of Cytochrome P450-Drug Interaction - MDPI

International Journal of

Molecular Sciences Article

In Silico Prediction of Cytochrome P450-Drug Interaction: QSARs for CYP3A4 and CYP2C9 Serena Nembri † , Francesca Grisoni † , Viviana Consonni and Roberto Todeschini * Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.za della Scienza 1, 20126 Milano, Italy; [email protected] (S.N.); [email protected] (F.G.); [email protected] (V.C.) * Correspondence: [email protected]; Tel.: +39-02-6448-2820 † These authors contributed equally to the work. Academic Editor: Jesus Vicente De Julián Ortiz Received: 16 May 2016; Accepted: 6 June 2016; Published: 9 June 2016

Abstract: Cytochromes P450 (CYP) are the main actors in the oxidation of xenobiotics and play a crucial role in drug safety, persistence, bioactivation, and drug-drug/food-drug interaction. This work aims to develop Quantitative Structure-Activity Relationship (QSAR) models to predict the drug interaction with two of the most important CYP isoforms, namely 2C9 and 3A4. The presented models are calibrated on 9122 drug-like compounds, using three different modelling approaches and two types of molecular description (classical molecular descriptors and binary fingerprints). For each isoform, three classification models are presented, based on a different approach and with different advantages: (1) a very simple and interpretable classification tree; (2) a local (k-Nearest Neighbor) model based classical descriptors and; (3) a model based on a recently proposed local classifier (N-Nearest Neighbor) on binary fingerprints. The salient features of the work are (1) the thorough model validation and the applicability domain assessment; (2) the descriptor interpretation, which highlighted the crucial aspects of P450-drug interaction; and (3) the consensus aggregation of models, which largely increased the prediction accuracy. Keywords: cytochrome P450; QSAR; CYP2C9; CYP3A4; in silico; ADMET

1. Introduction Cytochromes P450 (CYP) are a family of monooxygenase enzymes known for their crucial role in the metabolism of xenobiotics, as they are involved in the oxidation of the majority of compounds [1]. Despite the fact that the human genome encodes up to 57 different CYP genes, only six isoforms are mainly interested in drug metabolism [2,3]. Two of them account for 43% of the metabolism of known drugs: (1) the CYP3A4 isoform, which interacts with more than a half of all clinically used drugs, e.g., large and lipophilic molecules and; (2) the CYP2C9 isoform, which mainly metabolizes NSAIDs (Non-Steroidal Anti-Inflammatory Drugs) and weakly acidic molecules with a hydrogen bond acceptor [4]. These isoforms were the targets of the present work. The evaluation of the interaction of CYP with chemicals constitutes a fundamental step for drug discovery/design, as well as for toxicity assessment [2,5–7]. For this reason, Cytochrome P450 isozymes have been the target of many modelling studies [8,9]. In this context, relevant contributions to the field have been given by Quantitative Structure-Activity Relationship (QSAR) studies, which link the molecular structure, encoded within the so-called molecular descriptors [10], to a biological activity of interest through a mathematical/statistical approach. Throughout the years, many QSAR models have been proposed [11], based on different statistical techniques, such as Support Vector Machine (SVM) [12–15], k-Nearest Neighbors (k-NN) [16] and Self Organizing Maps (SOM) [17]. However, the majority of proposed QSAR models have some limitations that can often hamper the applicability and Int. J. Mol. Sci. 2016, 17, 914; doi:10.3390/ijms17060914

www.mdpi.com/journal/ijms

Int. J. Mol. Sci. 2016, 17, 914

2 of 19

reliability of their predictions, especially for ADMET (adsorption, distribution, metabolism, excretion and toxicity) applications. In particular, the most notable drawbacks are (1) a limited applicability due to the small datasets they are often trained on or to the lack of an applicability domain assessment; and (2) the lack of biochemical interpretability due to the use of very complex modelling approaches and molecular descriptors. The present work stems from these considerations and targets the development of novel, simple and easily interpretable QSAR systems to screen the drug binding to 3A4 and 2C9 isoform of CYP, which account for the metabolism of 30% and 13% of known drugs, respectively. Models were developed on High-Throughput Screening data and special attention was paid to model validation, applicability domain assessment and interpretation of the selected molecular descriptors. 2. Results and Discussion 2.1. Modelling Approach The starting point of this study was the database of potency values for 17,143 drug-like compounds for five CYP isoforms developed by Veith et al. [18], determined through quantitative High-Throughput Screening with a bioluminescent assay, which recognizes both inhibitors and substrates. The database was retrieved from PubChem (AID: 1851) [19]. Our modelling approach followed six logical steps: 1.

2.

3.

Data curation and splitting. CYP3A4 and CYP2C9 isoforms were curated separately by removing mismatching duplicates, inconsistent records and disconnected structures. The molecules were divided in (a) a Shared set, comprising the 9122 compounds with annotated activity for both the isoforms (Figure 1) and (b) an External set, having activity data for one isoform molecules (2996 and 2818 for CYP3A4 and CYP2C9, respectively). The Shared set molecules were randomly split into a training (70%, 6385 compounds) and a test set (30%, 2737 compounds), keeping the active/inactive proportion of both the isoforms (49:100 and 66:100 for 2C9 and 3A4, respectively). The training set served to select the variables, calibrate the models and perform the cross-validation (five-fold). The test set was used only in a later stage to validate the final pool of selected models. The external sets were used in the final stage to further validate the best models. Molecular description. To allow for the mathematical treatment of molecules, they were described using the so-called molecular descriptors [10], that is, numbers encoding for the presence of particular structural features, fragments or chemical properties. Two types of descriptors were calculated: (a) 3763 classical Dragon 6 [20] molecular descriptors (MDs) from 0-dimensional to 2-dimensional molecular representation, from which only a set of 1472 non redundant MDs was finally retained (see Materials and Methods); and (b) two types of binary fingerprints (FPs), that is, the extended connectivity (ECFP) [21] and the path fingerprints (PFP) [22], which are 1024 bit strings encoding the presence of particular fragments/substructures of molecules. Three-dimensional descriptors were not considered, as in a preliminary phase they did not lead to an improvement in the predictions. Variable selection and modelling. The Genetic Algorithms (GA) [23], a benchmark variable selection method characterized by an optimal trade-off between computational time and exploration/exploitation ability [24], were used to retain the most relevant subsets of variables. A refined two-step GA procedure (see Materials and Methods) was applied on the training set descriptors in combination with six classification techniques: (a) Classification and Regression Trees (CART) [25]; (b) k-Nearest Neighbor (k-NN) [26]; (c) N-Nearest Neighbors (N3) [27]; (d) Binned Nearest Neighbors (BNN) [27]; (e) Linear Discriminant Analysis (LDA) [28]; and (f) Partial Least Squares Analysis (PLSDA) [29]. Note that for FPs, no variable selection was performed, as they, unlike MDs, give a description of the molecule when used as a whole. To model FPs, only similarity-based classifiers (k-NN, N3 and BNN) were used. On both the isoforms, the best results were obtained using: (1) CART [25], based on binary splits of the data using one variable at time according to its optimized threshold values; (2) k-NN [26], in which

Int. J. Mol. Sci. 2016, 17, 914

3 of 19

Int. J. Mol. Sci. 2016, 17, 914

3 of 19

4.

4.

5.

5.

6. 6.

every molecule is classified according to the majority vote of its k more similar objects [14]; and (3) N3molecule [27], which uses all the availabletomolecules as neighbors through an optimized α every is classified according the majority vote of itsand, k more similar objects [14]; exponent, their contribution decreasing with decreasing theirand, similarity to the object. and (3) N3tunes [27], which uses all the as available molecules as neighbors through an new optimized The model parameters (number of objects per leaf, k and α) were optimized in cross-validation as α exponent, tunes their contribution as decreasing with decreasing their similarity to the those giving the best classification performance. new object. The model parameters (number of objects per leaf, k and α) were optimized in Model selection and validation. theclassification pool of calculated models, the final models were chosen cross-validation as those giving From the best performance. as the best compromise between classification performance five-fold (the Model selection and validation. From the pool of calculatedinmodels, thecross-validation final models were higher as thethe better) and number between of variables (the smaller the better).inModels with interpretable chosen best compromise classification performance five-fold cross-validation descriptors, if relevant, were preferred. (the higher the better) and number of variables (the smaller the better). Models with Applicabilitydescriptors, Domain Assessment. selected models were evaluated for their chemical space interpretable if relevant,The were preferred. of prediction reliability (Applicability Domain, AD). The were AD assessment strongly depends on the Applicability Domain Assessment. The selected models evaluated for their chemical space nature of the modelling and the characteristics of AD the dataset [30], thus, it was calibrated of prediction reliability approach (Applicability Domain, AD). The assessment strongly depends on it on a case-by-case basis, and rationalized to the modeling approach Materials the nature of the modelling approach and according the characteristics of the dataset [30],(see thus, it was and Methods). calibrated it on a case-by-case basis, and rationalized according to the modeling approach External validation. Models were selected according to the cross-validation results and the (see Materials and Methods). best models were screened theirselected performance on the Finally, for each isoform, External validation. Modelson were according to test the set. cross-validation results and the external set molecules were used in order to test their andFinally, predictivity towards real best models were screened on their performance onrobustness the test set. for each isoform, unknown the externaldata. set molecules were used in order to test their robustness and predictivity towards real unknown data. CYP3A4

CYP2C9

External Set (2996)

Training Set (6385)

Shared Set (9122)

External Set (2818)

Test Set (2737)

Figure 1. Scheme of the data splitting. Figure 1. Scheme of the data splitting.

The model performance in recognizing active/inactive compounds was calculated through the The model performance recognizing active/inactive compounds was calculated the Sensitivity (Sn), Specificity (Sp)inand Non-Error Rate (NER), defined for two-class problemsthrough as follows: Sensitivity (Sn), Specificity (Sp) and Non-Error Rate (NER), defined for two-class problems as follows: TP Sn = TP FN Sn “ TP+TP ` FN TN Sp “ TN (1) Sp = TN `FP (1) TN + FP Sn`Sp NER “ 2 Sn + Sp NER = positives, true negatives, false positives and false where TP, TN, FP and FN are the number of true 2 negatives of each class, respectively. Sn, Sp and NER were calculated in fitting, cross-validation, and where TN, FP and FN are the number of true positives, true negatives, false positives and false on the TP, test/external sets. negatives of each class, respectively. Sn, Sp and NER were calculated in fitting, cross-validation, and on the test/external sets.

Int. J. Mol. Sci. 2016, 17, 914

4 of 19

2.2. Quantitative Structure-Activity Relationship (QSAR) Models 2.2.1. Isoform 3A4 The proposed QSAR models for 3A4 are collected in Table 1. For all the models, a similar performance on the training and test sets can be noted, indicating the robustness and reliability of the predictions towards unknown data. The CART model, which is based on three very simple molecular descriptors, showed a very good balance between Sn and Sp. The k-NN model has a slightly better prediction ability, especially for the inactive compounds (higher Sp). If the model is restricted to its applicability domain (AD), more balanced predictions (in terms of Sn and Sp) are obtained with the same NER value. Finally, the N3 model (based on ECFPs) is characterized by high Sn values, that is, it identifies well the active compounds. Table 1. Model statistics for CYP3A4 isoform. Models are described according to the method and type of descriptors, the Applicability Domain (AD: yes/no (y/n)), number of variables (p) and classification parameters (parameter: object/leaf ratio for Classification and Regression Trees (CART), k for k-Nearest Neighbours (k-NN) and α for N-Nearest Neighbors (N3)). For each model, the Non-Error Rate (NER), the Sensitivity (Sn) and the Specificity (Sp) are reported in Fitting, Cross-Validation and on the test set. %out indicates the percentage of test set compounds outside of the AD. MD: molecular descriptors; ECFP: extended connectivity fingerprints. Model

Descriptors AD

Fitting

Cross-Validation

Test Set

p

Parameter NER

Sn

Sp

NER

Sn

Sp

%out

NER

Sn

Sp

3 3

210 210

0.74 0.74

0.74 0.74

0.75 0.75

0.74 0.74

0.73 0.73

0.75 0.75

0

0.75 0.75

0.74 0.74

0.76 0.76

CART

MD

y n

k-NN

MD

y n

6 6

14 14

0.76 0.76

0.73 0.73

0.79 0.79

0.76 0.76

0.73 0.73

0.78 0.78

5

0.77 0.77

0.75 0.76

0.79 0.78

N3

ECFP

y n

1024 1024

1 1

0.79 0.79

0.88 0.88

0.71 0.71

0.79 0.79

0.87 0.87

0.70 0.70

1

0.78 0.78

0.86 0.86

0.71 0.71

The selected MDs are briefly presented in Table 2. In Figure 2a, their role in the CART classification is depicted. To visualize the effect of the descriptors on the k-NN model, a Principal Component Analysis (PCA) [31] was performed (Figure 2b,c). PCA is a well-known data visualization technique that allows to observe the objects in a few new variables (Principal Components, PC) space, according to their coordinates (scores, Figure 2b). The contribution of the variables in the PC space is given by the loadings (Figure 2c), the higher they are (in absolute value) the larger their role. PC1 and PC2 explain 2/3 of data variance. Information about PC3, which explains 18% of data variance, is reported in the Supplementary Material (Figure S1). PC3 explains 14% of data variance, and it leads to considerations similar to those derived from PC1/PC2. The PCA loadings (Figure 2c) allow for an understanding of the contribution of the molecular descriptors: the higher their PC1 loadings, the higher their value for inactive molecules and vice versa for active compounds. Active compounds group on the left side of the score plot (negative PC1 scores), while the inactive molecules distribute on the right side (positive PC1 scores). At the same time, active compounds have relatively lower PC2 values, indicating that high PC2 values relate to inactivity. The descriptors with high loadings on PC1 (e.g., nROH) and PC2 (e.g., ATSC4p and ATSC6i) influence at most the observed class separation. With the aid of the PCA and the CART classification scheme, a brief description and interpretation of the descriptors is given below. 1.

The descriptor nROH represents the number of hydroxyl groups. It was independently selected in both the models based on MDs, underscoring its relevance in modelling the CYP3A4 activity. In particular, in both cases (Figure 2), high nROH values tend to correspond to inactive molecules. An increasing number of hydroxyl group generally increases the hydrophilicity, while molecules with no (or a few of) hydroxyl groups have a lipophilic nature. Molecular lipophilicity is known to facilitate the adsorption and to limit the excretion of the compounds. As a result, lipophilic

Int. J. Mol. Sci. 2016, 17, 914

2.

3.

4.

5.

6.

5 of 19

molecules are oxidated by CYP and converted into more hydrophilic compounds that can be easily eliminated [32]. nBM and nBO are the number of multiple bonds and bonds in the H-depleted molecular structure, respectively, which were selected within the CART model (Figure 2c). The descriptor nBM represents a measure of the unsaturation level and, therefore, gives information about the molecular interaction ability with CYP. In addition, this descriptor contains information about molecular size, flexibility, presence of heteroatoms (Table 2). In particular, small and flexible molecules with a few of the multiple bonds (nBM < 13, Figure 2c) tend to be inactive. The relevance of heteroatoms can be related to the P450-mediated oxygen addition to nitrogen, sulfur, phosphorus, and iodine atoms [33–36]. The descriptor nBO mainly encodes the information about molecular size. Molecules with no hydroxyl groups, less than 16 multiple bonds and with less than 25 non-hydrogen bonds tend to be inactive. These molecules probably have a small effective dimension within the receptor pocket, as they are either small or relatively small, but very flexible. In this branch also lie some hydrophilic compounds classified as inactive. They probably do not interact with CYP as they are easily dissolved in the aqueous body fluid and, consequently, are excreted from the body [37]. C% and NaasC are the percentage of C atoms and number of aromatic carbons bonded with non-H atoms, respectively (Table 2). They have a crucial role in identifying the active compounds of k-NN, as shown by their very high loadings on PC1. This means that relative large and/or aromatic molecules are generally classified as actives. The mechanisms of aromatic oxidation have been already elucidated [36,38]. SIC5 is the structural information content of order 5 [39]. It encodes information about atom equivalence and represents a general measure of structural complexity, the higher, the larger SIC5. It has negative PC1/PC2 loadings, suggesting that relatively large, branched and/or polycyclic compounds tend to be active. The effect of dimension on activity was already suggested (e.g., [40]), as well-known CYP3A4 ligands are commonly relatively large molecules. ATSC4p and ATSC6i are the Centred Broto-Moreau autocorrelations [10] of lag 4 and 6, respectively. The former is weighed on the atomic polarizability, while the latter on the atomic ionization potential. They lie very close to each other in the PC space and have positive loadings on PC2, indicating that inactive compounds tend to have high values of both these MDs. They increase when increasing the molecular dimension, the number of heteroatoms and the branching/cyclicity. Because of their weighting scheme type, ATSC4p and ATSC6i increase when increasing the atomic polarizability and the ionization potential, respectively. These features are known for their relevant contribution in receptor binding, especially when charged systems are taken into consideration [41,42].

In order to interpret the ECFP, the relevant molecular fragments (MFs) encoded by ECFP were generated by means of Dragon 7 [22], with the same settings used for the fingerprints calculation. A small set of 19 large fragments (length from 4 to 6) was retained from the original pool by: (a) removing those that were not selected for at least 100 molecules; and (b) considering only those that were characterized by a difference of frequency between the classes greater than or equal to 1%. The fragments are depicted in Figure 3. Only the cyclohexanediol derivate fragment (No. 1) occurs more frequently within the inactive compounds. All the other MFs have a higher frequency for the active molecules. In particular, the MFs quinazolinamine-like (No. 4 and 5) and benzamide-like (No. 14) MFs are mostly present within active compounds (frequency ranging from 3.6% to 5.1%), and their frequency value within the inactive molecules is lower than 0.5%. The butenamide- (No. 2) and fluorobenzene (No. 9)-derived MFs are less useful for the discrimination between the classes, since they have the lowest difference in the frequency. The remaining fragments are high for both the classes.

Int. J. Mol. Sci. 2016, 17, 914

6 of 19

Int. J. Mol. Sci. 2016, 17, 914

6 of 19

Int. J.J. Mol. Sci. 17, Mol.2016, Sci. 2016, 2016, 17, 914 914 Int. J. Int. Mol. Sci. 17, 914 Mol.2016, Sci. 2016, 17, 914 Int. J. Int. Mol.J. Sci. 17, 914

66 of of 19 19 6 of 19 6 of 19 6 of 19

Int. J. Mol. Sci. 2016, 17, 914

6 of 19

Figure2.2.Representation Representation ofmolecular the molecular descriptor (MD)-based CYP3A4: (a) Figure of the descriptor (MD)-based models formodels CYP3A4:for (a) Classification Classification and Regression (CART) model;plot (b) of Score of themolecules training molecules described and Regression Trees (CART)Trees model; (b) Score theplot training described by the by the k-Nearest Neighbours (k-NN) descriptors, coloured according to their activity; (c) Loading k-Nearest Neighbours descriptors, coloured according to(MD)-based their activity; (c) for Loading plot(a)of plot the Figure 2. Representation of molecular descriptor models CYP3A4: Figure 2.(k-NN) Representation of the the molecular descriptor (MD)-based models for CYP3A4: Figure 2. Representation of the molecular descriptor (MD)-based models for CYP3A4: (a) (a) Figure 2. Representation ofTrees the molecular descriptor (MD)-based models for CYP3A4: Figure 2. Representation of the molecular descriptor (MD)-based models formolecules CYP3A4: (a) (a) of the k-NN descriptors. Classification and Regression (CART) model; (b) Score plot of the training described k-NN descriptors. Classification and Regression (CART) model; (b) Score plot the training molecules described Classification and Regression TreesTrees (CART) model; (b) Score plot of theoftraining molecules described

Classification and Regression Trees (CART) model; (b) Score plot the training molecules described Classification and Regression Trees (CART) model; (b) Score plot of theoftraining molecules described by the Neighbours (k-NN) descriptors, coloured according to their activity; (c) Figure 2. k-Nearest Representation of the molecular descriptor (MD)-based models for(c)CYP3A4: (a) plot byk-Nearest the k-Nearest Neighbours (k-NN) descriptors, coloured according to their activity; (c) Loading Loading plot by the Neighbours (k-NN) descriptors, coloured according to their activity; Loading plot byk-Nearest thek-NN k-Nearest Neighbours (k-NN) descriptors, coloured according to their activity; (c) Loading by the Neighbours (k-NN) descriptors, coloured according to their activity; (c) Loading plot plot of the descriptors. Classification and Regression Trees (CART) model; (b) Score plot of the training molecules described ofk-NN the k-NN descriptors. Table 2. List and brief description of the classical molecular descriptors (MDs) selected for CYP3A4. of the descriptors. the k-NN descriptors. of the classical molecular descriptors (MDs) selected for CYP3A4. Table 2. Listofand brief description theofk-NN descriptors. by the k-Nearest Neighbours (k-NN) descriptors, coloured according to their activity; (c) Loading plot Someexamples examplesTable ofmolecules molecules withdescription lowand andof high MDvalues values arealso alsoreported. reported. 2. List and the molecular descriptors (MDs) selected for Some of with low high MD are of theTable k-NN 2.descriptors. List brief and brief brief description ofclassical the classical classical molecular descriptors (MDs) selected for CYP3A4. CYP3A4. Table 2. List and description of theof molecular descriptors (MDs) selected for CYP3A4. Table 2. and List brief andofbrief description the classical molecular descriptors (MDs) selected for CYP3A4. Table 2. List description of the classical molecular descriptors (MDs) selected for CYP3A4. Some examples molecules with low and high MD values are also reported. Some examples of molecules with low high and high MD Low values are reported. also reported. SomeSome examples of molecules with low and MD values are also MD Description Reference Model Value High Value examples of molecules low high and high MD values are reported. also reported. Some examples of molecules with with low and MD values are also Table 2. List and brief description of the Model classical molecularLow descriptors (MDs) selected for CYP3A4. MD Description Reference Value High Value MD Description Reference Model Low Value High Value nBM Number of [10] CART 0 66 MD Description Reference Model Low Value High Value MD MD Description Reference Model Low Value High ValueValue Description Reference Model Low Value Some examples of of molecules with low and high MD values also reported. MD Reference Model Loware Value High High Value nBMDescription Number [10] CART 0 66 nBM Number of multiple bonds. nBM nBM Number of of Number

nBM

nBO

nBM Number of bonds. multiple multiple bonds. multiple bonds. MD Description multiple bonds. multiple bonds. Number of multiple nBM Number of bonds. multiple bonds.

nBO

Number of nonnBO nBO Number of nonNumber ofNumber non[10] Number ofbonds. nonnBO nBO of nonhydrogen hydrogen bonds. hydrogen bonds. hydrogen bonds. hydrogen bonds. hydrogen bonds. nBO

nBO

C% C%

Number of non-

[10] [10] [10] [10] Reference [10] [10]

Number of non-

Number of non-hydrogen hydrogen bonds. bonds.

CARTCART CARTCART Model CART CART

00

3 3

[10]

3

[10]

CART

0 0

0 0

[10] CART [10] [10] CARTCART [10] [10] CART CARTCART

CART

C% Percentage of C [10] k-NN Percentage C% C% Percentage of C of C [10] [10] k-NNk-NN Percentage C% C% Percentage of C of C [10] [10] k-NNk-NN atoms. atoms. atoms. atoms. atoms. C% Percentage of C [10] k-NN Percentage k-NN Percentage ofofCC atoms. [10] [10] k-NN atoms.

atoms.

Low Value

High Value

66 66

66

3

333

59 59

66 66

66

59 59 59

59

59

3

59

0 0

00 0

58.3 58.3 58.3 58.3 58.3

00

0

58.3

58.3 58.3

Int. J. Mol. Sci. 2016, 17, 914

7 of 19

Table 1. Cont. Int. J. Mol. Sci.Description 2016, 17, 914 MD Reference SIC5 Structural Int. Int.J. Mol.Sci. Sci.2016, 2016,17, 17,914 914 [39] Int. J.J.Mol. Mol. Sci. 2016, 17, 914 Int. J. Mol. Sci. 2016, 17, 914 Information Int. Mol. Sci. 2016, 17, 914 Int. J.J.Int. Mol. Sci. 2016, 17, 914 Int.J.J.Mol. Mol.Sci. Sci.2016, 2016,17, 17,914 914 Content—order 5.17, 914 Int. J. Mol. Sci. 2016,

ATSC4p MD

SIC5

ATSC4p ATSC6i

ATSC6i

nROH

nROH

NaasC

Low Value

NaasC Counts NaasC Countsof ofthe theEENaasC Counts of the ECounts of the E[43] NaasC Counts Counts of the Estate atom types. state atom types. NaasC of the Estate atom types. NaasC Counts of the ENaasC Counts ofofthe Estate atom NaasCtypes. Counts the Estate atom types. statestate atomatom types. state atom types. Counts thetypes. Estateof atom types.

Counts of the E-state state atom types. atom types.

[43] [43] [43] [43] [43] [43] [43] [43] [43]

1. Table Cont. TableTable 2. Cont. Table 1.1.Cont. Cont. Table Cont. Table 1. 1. Cont. Table 1. Cont.

[43]

k-NN k-NN k-NN k-NN

k-NN k-NN k-NN k-NN k-NN k-NN

k-NN

Figure Figure3. Cont. Figure 3.3.Cont. Cont. Figure Cont. Figure 3. 3. Cont. Figure 3. Cont. Figure 3. Cont. Figure 3. Cont. Figure 3. Cont.

Figure 3. Cont. Figure 3. Cont.

7 of 19 77of of19 19 1.00

High Value 0.28

Table MD Low MD Description [20] Reference Reference Model LowValue Value 0.21 Table1.1.Cont. Cont. Centred Broto-Description k-NNModel MD Description Reference Model Low Value MD Description Reference Model Low Value Table Description Reference Model Low Value 0.28 SIC5 Structural [39] k-NN SIC5Description Structural [39] Model k-NN 1. Cont. 0.28 MD Description Reference Model Low Value MD Reference Low Value SIC5 Structural [39] k-NN 0.28 Moreau MD Description Reference Model Low Value MD Structural Description Reference Model Low Value0.28 SIC5 [39] k-NN Information Information SIC5 Structural [39] k-NN 0.28 0.28 SIC5 Structural [39] k-NN 0.28 Information SIC5 Structural [39] k-NN MD Description Reference Model Low Value autocorrelations— SIC5 Structural [39] k-NN 0.28 Information Content—order Content—order5. StructuralInformation Information Information Content—order 5.5. [39][39] Information k-NN 0.28 Structural k-NN 0.28 Information Content—order 5. lagContent—order 4SIC5 (weighted by 5. Content—order 5. Content—order 5. 5. Content—order ATSC4p Centred Broto[20] k-NN 0.21 ATSC4p Centred Broto[20] k-NN 0.21 Information Content—order 5. ATSC4p Centred Broto[20] k-NN 0.21 atomic ATSC4p Centred Centred Broto- 5. [20] k-NN 0.21 Moreau Moreau Content—order ATSC4p Broto[20] k-NN 0.21 ATSC4p Centred Broto[20] k-NN 0.21 Moreau ATSC4p Centred Broto[20] k-NN 0.21 Centred Broto-Moreau ATSC4p Centred Broto[20] k-NN 0.21 polarizability). Moreau autocorrelations— autocorrelations— Moreau autocorrelations— Moreau ATSC4p Moreau Centred Brotok-NN 0.21 autocorrelations—lag 4 Moreau autocorrelations— lag 444(weighted by lag (weighted by [20][20] autocorrelations— k-NN 0.21 Centred Broto[20] k-NN 0 lag (weighted by autocorrelations— autocorrelations— Moreau (weighted by atomic autocorrelations— 4 (weighted atomic laglag (weighted bybyby lag 44atomic (weighted by atomic Moreau lag 44(weighted autocorrelations— polarizability). lag (weighted by atomic polarizability). polarizability). atomic atomic polarizability). atomic lag 4 (weighted by autocorrelations— atomic polarizability). ATSC6i Centred Broto[20] k-NN 000 ATSC6i Centred Broto[20] k-NN polarizability). polarizability). ATSC6i Centred Broto[20] k-NN polarizability). atomic polarizability). lag 6 (weighted by ATSC6i Centred Broto[20] k-NN Centred Broto-Moreau Moreau Moreau ATSC6i Centred Broto[20] k-NN 00 0 ATSC6i Centred Broto[20] k-NN 0 Moreau ATSC6i Centred Broto[20] k-NN ATSC6i polarizability). Centred Broto[20] k-NN 0 Moreau autocorrelations—lag 6 autocorrelations— autocorrelations— atomic ionization Moreau autocorrelations— [20][20] k-NN 00 Moreau ATSC6i Moreau Centred Broto- by k-NN Moreau autocorrelations— lag 666(weighted (weighted by atomic lag (weighted by autocorrelations— autocorrelations— lag (weighted by potential). autocorrelations— Moreau autocorrelations— lag 6 (weighted ionization atomic ionization ionizationlag potential). (weighted bybyby 66atomic (weighted by atomic ionization lag (weighted autocorrelations— Number oflag CART, 0 lag66ionization (weighted[10] by atomic potential). potential). atomic ionization atomic ionization potential). atomic ionization lag 6 (weighted by atomic ionization potential). hydroxyl groups. k-NN nROH Number of [10] CART, 0 nROH Number of [10] CART, potential). nROHpotential). Number of [10] CART, 00 potential). atomic ionization potential). nROH Number Number [10] CART, hydroxyl groups. k-NN hydroxyl groups. k-NN nROH ofof of [10] CART, 00 0 nROH Number of [10] CART, 0 hydroxyl groups. k-NN nROH Number [10] CART, nROH potential). Number of [10] CART, 0 hydroxyl groups. k-NN hydroxyl groups. k-NN hydroxyl groups. k-NN CART, Number hydroxyl hydroxyl k-NN nROH of Number of groups. CART, hydroxyl groups. [10][10] k-NN 00 k-NN groups. hydroxyl groups. k-NN

NaasC

NaasC

Model k-NN

0000 0 00 0 0 0

0

7 of 19 7 of 19 of 19 77 of 19 77ofof19 19 7 of 19

High HighValue Value High Value High ValueValue 1.00 High 1.00 High Value High Value 1.00 High Value High Value 1.00 1.00 1.00 1.00 High Value 1.00 1.00 1.00 46.6 46.6 46.6 46.6 46.6 46.6 46.6 46.6 46.6

46.6

46.6

4.1

4.1 4.1 4.1 4.14.14.1 4.1 4.1 4.1 4.1

9 999

9 99 9 9 9 9

16 16 16 161616 16 16 16

16

16

Int. J. Mol. Sci. 2016, 17, 914

8 of 19

Int. J. Mol. Sci. 2016, 17, 914

8 of 19

Figure 3. Occurrence frequency of the 19 selected fragments for CYP3A4 within the active/inactive Figure 3. Occurrence frequency of the 19 selected fragments for CYP3A4 within the active/inactive compounds. Symbols SMARTS language) language) have have the the compounds. Symbols associated associated with with the the fragments fragments (according (according to to SMARTS following meaning: X + number = number of total bonds in which the considered atom is involved; following meaning: X + number = number of total bonds in which the considered atom is involved; aromatic atom; atom; A A= = aliphatic aa == aromatic aliphatic atom. atom.

2.2.2. Isoform 2C9 2.2.2. Isoform 2C9 The statistics of the selected classification models for the CYP2C9 are reported in Table 3. As for The statistics of the selected classification models for the CYP2C9 are reported in Table 3. As for the 3A4 isoform, in this case the proposed models also have comparable performance on the training the 3A4 isoform, in this case the proposed models also have comparable performance on the training and test set. Furthermore, the CART model still appears to be the simplest and most balanced model, and test set. Furthermore, the CART model still appears to be the simplest and most balanced model, and only a few of the compounds are outside of its AD. and only a few of the compounds are outside of its AD. The best performance in recognizing inactive compounds was obtained with the k-NN approach The best performance in recognizing inactive compounds was obtained with the k-NN approach (Table 3). On the contrary, N3 model, which is known for its ability to predict better the less (Table 3). On the contrary, N3 model, which is known for its ability to predict better the less represented represented class, is better in the actives recognition. The applicability domain characterization for class, is better in the actives recognition. The applicability domain characterization for the k-NN model the k-NN model slightly worsens the predictions on inactive compounds. For the N3 model, only a slightly worsens the predictions on inactive compounds. For the N3 model, only a small fraction of small fraction of compounds was excluded, not affecting the classification performance. compounds was excluded, not affecting the classification performance.

Int. J. Mol. Sci. 2016, 17, 914

9 of 19

Table 3. Model statistics for CYP2C9 isoform. Models are described according to the method and type of descriptors, the Applicability Domain (AD: yes/no), number of variables (p) and classification parameters (parameter: object/leaf ratio for CART, k for kNN and α for N3). For each model, the Non-Error Rate (NER), the Sensitivity (Sn), and the Specificity (Sp) are reported in Fitting, Cross-Validation and on the test set. %out indicates the percentage of test set compounds outside the AD. Model

Descriptors AD

p

Fitting

Parameter

Cross-Validation

Test Set

NER

Sn

Sp

NER

Sn

Sp

%out

NER

Sn

Sp

4 4

210 210

0.75 0.75

0.75 0.75

0.75 0.75

0.75 0.75

0.75 0.75

0.75 0.75

0

0.75 0.75

0.75 0.75

0.74 0.74

CART

MD

y n

k-NN

MD

y n

6 6

14 14

0.77 0.77

0.69 0.69

0.85 0.85

0.77 0.77

0.68 0.68

0.85 0.85

5

0.76 0.76

0.67 0.67

0.86 0.84

N3

ECFP

y n

1024 1024

1 1

0.80 0.80

0.87 0.87

0.73 0.73

0.80 0.80

0.86 0.86

0.73 0.73

1

0.78 0.78

0.83 0.83

0.73 0.73

Table 4 collects all the selected MDs for CYP2C9. In analogy with the previous case, in Figure 4, the representation of the CART model (Figure 4a) and of the PCA on k-NN descriptors (Figure 4b,c) can be found. In this case, PC1 and PC2 explain 57% of data variance. PC3 explains 14% of data variance and leads to similar considerations than those derived from PC1/PC2 (see Supplementary Material). In analogy with CYP3A4, the active compounds are grouped on the left part, having a low score value on PC1 and PC2, while the inactive compounds mainly cover the right side (high PC1 and PC2 scores). The selected MDs are briefly described and interpreted below: ‚

‚

‚

‚

‚

‚

‚

nBM (number of multiple bonds) resulted to be relevant also for this isoform. As for the 3A4 CART model, small flexible molecules with a few number of multiple bonds (nBM < 13, Figure 4c) are classified as inactive, while those with a high number of multiple bonds (nBM > 17, Figure 4c) are classified as active. The presence of pyrimidines (nPyrimidines), together with the high values of Sp (sum of atomic carbon-scaled polarizability, Sp ě 30.6) characterize the active compounds. The inhibition ability of pyrimidine derivatives towards CYP2C9 was already suggested [44–47]. ARR represents the aromatic ratio, that is, the ratio between the number of aromatic bonds and bonds in the H-depleted molecular structure. Molecules without pyrimidine rings but with an aromatic character (ARR ě 0.38) and high atomic polarizability value (Sp ě 23.6) are active. The presence of two MDs related to aromaticity (i.e., ARR and nPyrimidines) suggests that this feature is fundamental for CYP2C9-drug interaction, in agreement with previous studies (e.g., [40]). GATS2i is the Geary autocorrelation of lag 2 weighted by the ionization potential [10]. It plays a crucial role in identifying the inactive compounds for the k-NN model, as denoted by its (high) positive loading on PC1. The inactive compounds, distributed on the right side of the score plot, are characterized by low values of this MD and have, therefore, low ionization potential. When increasing the ionization potential, the number of active compounds increases. nRNR2 counts the number of aliphatic tertiary amines. It has high PC2 loading, meaning that it tends to be higher for inactive compounds. Tertiary aliphatic amines are, in fact, generally oxidized by Flavin monooxygenase [48] and are inactive on CYP. F01[C–N], represents the frequency of bonded C and N atoms. It has a positive loading on PC2 and, in analogy with nRNR2, tends to be higher for inactive compounds, meaning that a high number of C-N could limit the interaction with the CYP2C9 isoform. In addition to the information overlap with nRNR2, F01[C–N] also takes into account the information about the presence of primary/secondary amines and of amides. Eta_betaP_A, is the average measure of π bonds and lone pairs of non-H atoms [49]. It has the lowest loading on PC2 and it increases when increasing the number of multiple bonds and lone pairs. Molecules with a high value of Eta_betaP_A tend to be active, confirming the

Int. J. Mol. Sci. 2016, 17, 914

‚

‚

10 of 19

previous considerations, that is, the higher the unsaturation, the higher the probability of activity towards CYP. MLOGP is the Moriguchi octanol-water partition coefficient [50]. Drug lipophilicity is known for playing aInt.crucial role in the CYP binding affinity [51], and a significant correlation 10 was reported J. Int. Mol.J. Sci. 17, 914 of 10 19 of Mol.2016, Sci. 2016, 2016, 17, 914 914 19 Mol. Sci. 17, Int. J.Int. Mol.J. Sci. 2016, 17, 914 10 of 10 19 of 19 Int. J. Mol. Sci. 2016, 17, 914 10 of 19 J. Mol. binding Sci. 2016, 17, 914 of 19 the between theInt. P450 affinity and the compound lipophilicity [51–54]. As noted10from J. Mol. Sci.the 2016, 17,P450 914 10 19 between P450 binding affinity and and the compound lipophilicity [51–54]. As noted from between the binding affinity the lipophilicity [51–54]. As noted J.Int. Mol. Sci. 2016, 914 10from ofthe 19ofthe between the17, P450 binding affinity and the compound lipophilicity [51–54]. As noted from the PCA, theInt. compounds with high MLOGP value arecompound active. between binding affinity and the compound lipophilicity [51–54]. As noted fromfrom the the between the P450 binding affinity and the compound lipophilicity [51–54]. As noted PCA, theP450 compounds with high MLOGP value are active. active. PCA, the the compounds with high MLOGP value are active. PCA, the compounds with high MLOGP value are PCA, the compounds with high MLOGP value are active. J.the Sci. 2016, 17, 914 10 ofthe between P450 binding affinity and the compound [51–54]. As noted from HyWi_B(m) is hyper-Wiener-like weighted by mass [10]. ItItItcontains information PCA, the compounds with highindex MLOGP value are active. Int. Sci. 2016, 17, 914 10 ofthe 19 Int. J.Mol. Mol. Sci. 2016, 17, 914 10 of1919about PCA, the compounds with high MLOGP value are between the P450 binding affinity and the compound lipophilicity As noted from •Mol. HyWi_B(m) is the hyper-Wiener-like index weighted bylipophilicity mass [10]. contains information about • J.Int. HyWi_B(m) isthe the hyper-Wiener-like index weighted byactive. mass [10]. It [51–54]. contains information about HyWi_B(m) is the hyper-Wiener-like index weighted by mass contains information about • •HyWi_B(m) is the hyper-Wiener-like index weighted by mass [10].[10]. It contains information about PCA, the compounds with high MLOGP value are active. • HyWi_B(m) is the hyper-Wiener-like index weighted by mass [10]. It HyWi_B(m) contains information about HyWi_B(m) is cyclicity the hyper-Wiener-like index weighted by mass [10]. It contains information about the compounds with high MLOGP value are active. molecular size, branching, cyclicity and presence of heavy atoms. HyWi_B(m) tends tohigher be higher molecular size, branching, cyclicity and presence ofheavy heavy atoms. HyWi_B(m) tends to beto molecularInt.size, branching, and presence of atoms. tends to be higher J.•PCA, Mol. Sci. 2016, 17, 914 10 ofhigher 19 molecular size, branching, cyclicity and presence of heavy atoms. HyWi_B(m) tends be molecular size, branching, cyclicity andand presence of heavy atoms. HyWi_B(m) tends to be higher between the P450 binding affinity the compound lipophilicity As noted from the the binding affinity and the compound lipophilicity [51–54]. noted from the between the P450 binding affinity and the compound lipophilicity [51–54]. As from the HyWi_B(m) is the hyper-Wiener-like index weighted byanalogy mass [10]. It [51–54]. contains information about molecular size, branching, cyclicity and presence ofinheavy atoms. HyWi_B(m) tends tonoted be higher molecular size, branching, cyclicity and presence of heavy atoms. HyWi_B(m) tends to be higher • • between HyWi_B(m) isP450 the hyper-Wiener-like index weighted by mass [10]. Itwith contains information about for and relatively large and polycyclic compounds, in with the 3A4 isoform. As this for relatively large and polycyclic compounds, analogy with the 3A4As isoform. As this for relatively large and polycyclic compounds, in analogy the 3A4 isoform. As this for relatively large polycyclic compounds, in analogy with the 3A4 isoform. As this descriptor for PCA, relatively large and polycyclic compounds, in analogy with the 3A4 isoform. As this the compounds with high MLOGP value are active. PCA, therelatively compounds with high MLOGP value are PCA, the compounds with high value are active. molecular size, branching, cyclicity and presence ofanalogy heavy atoms. HyWi_B(m) tends toAs be higher for relatively large and polycyclic compounds, in with the 3A4 isoform. this for large and polycyclic compounds, in analogy 3A4 isoform. As this between the P450 binding affinity and the compound lipophilicity [51–54]. As noted from the molecular size, branching, cyclicity presence ofactive. heavy atoms. HyWi_B(m) tends be higher descriptor captures several types ofMLOGP chemical information and itswith contribution istonot easily descriptor captures several types of chemical information and its the contribution is not easily descriptor captures several types ofindex chemical information and its contribution is not easily ••HyWi_B(m) HyWi_B(m) is the hyper-Wiener-like index weighted by mass [10]. Itits information about captures •several types of chemical information and its contribution is not easily detectable is the hyper-Wiener-like weighted by mass [10]. Itwith contains information about HyWi_B(m) is the hyper-Wiener-like index weighted by mass [10]. Itcontains contains information about for relatively large and polycyclic compounds, in analogy the 3A4 isoform. As this from descriptor captures several types of chemical information and its contribution is not easily descriptor captures several types ofvalue chemical information and contribution isAsnot easily PCA, the compounds with high MLOGP are in active. for relatively and polycyclic compounds, analogy with the 3A4 isoform. this detectable from the PCA, it needs needs further investigation. detectable fromlarge the PCA, it needs further investigation. detectable from the PCA, it further investigation. detectable from thebranching, PCA, itcyclicity needs further investigation. molecular cyclicity and presence of atoms. HyWi_B(m) tends be molecular size, branching, and presence of heavy atoms. HyWi_B(m) tends be molecular size, branching, cyclicity and presence ofheavy heavy atoms. HyWi_B(m) tends behigher higher descriptor captures several types of chemical information and its contribution istotohigher not easily detectable from the PCA, it needs further investigation. detectable from the PCA, it needs further investigation. the PCA,•it needs further investigation. HyWi_B(m) is size, the hyper-Wiener-like index weighted by mass [10]. It contains information about descriptor captures several types of chemical information and its contribution istonot easily for large and polycyclic compounds, in with the 3A4 As for relatively large and polycyclic inheavy analogy with the 3A4 isoform. As this for relatively large and compounds, in analogy analogy with theselected 3A4 isoform. As this this detectable from the PCA, itpolycyclic needs further investigation. Table 4. List and brief description ofcompounds, thepresence classical molecular descriptors (MDs) selected forbe CYP2C9. Table 4.relatively List and brief description of the classical molecular descriptors (MDs) selected forisoform. CYP2C9. molecular size, branching, cyclicity and of atoms. HyWi_B(m) tends to higher detectable from the PCA, it needs further investigation. Table 4. List and brief description of the classical molecular descriptors (MDs) for CYP2C9. Table 4. List and brief description of the classical molecular descriptors (MDs) selected for CYP2C9. descriptor captures several types of chemical information and its contribution is not easily descriptor captures several types of chemical information and its contribution is not easily Table 4. List and brief description of the classical molecular descriptors (MDs) selected for CYP2C9. descriptor captures several types of chemical information and its contribution is not easily Table 4. List and brief description of the classical molecular descriptors (MDs) selected for CYP2C9. Some examples of molecules with low and high MD values are also reported. Some examples of molecules with low and high MD values are also reported. for relatively large polycyclic compounds, analogy with the 3A4 isoform. As this Some examples of and molecules low and high MDin values are reported. also reported. Some examples of molecules withwith low and high MD values are also detectable from the PCA, ititwith needs further investigation. Table 4. List and brief description of the classical molecular descriptors selected fornot CYP2C9. Some examples of molecules low and high MD values are also reported. detectable from the PCA, itwith needs further investigation. Some examples of molecules low and high MD values are also (MDs) reported. detectable from the PCA, needs further investigation. Table 4. List and brief description of the molecular descriptors selected for CYP2C9. descriptor captures several types ofclassical chemical information and its (MDs) contribution is easily

MD MD Model Lowdescriptors Value HighHigh Value Table 4. List and briefDescription description of Reference thewith classical molecular (MDs) selected for CYP2C9. MD Description Reference Reference Model Low Value Value Value Description Model Low MD Description Reference Model Low Value HighHigh ValueValue Some examples of PCA, molecules low high values are also reported. Some examples of the molecules lowfurther and and high MD MD values are also reported. detectable from itwith needs investigation. MD Description Reference Model Low Value HighHigh ValueValue44 MD Description Reference Model Low Value nBM Number of with [10] CART 0 (MDs) nBM Number of [10] CART 0 44 Some examples of molecules low and high MD values are also reported. Table 4. List and brief description of the classical molecular descriptors selected for CYP2C9. nBM Number of [10] CART 0 Table 4. List and brief description of the classical molecular descriptors (MDs) selected for CYP2C9. 4. List of and brief description of the classical molecular descriptors nBM TableNumber [10] CART 0 (MDs) selected for CYP2C9. 44 44

Description Reference Model Value Value 44 44 multiple nBM Number of of Reference [10] CART 0 multiple MD MD Model LowLow Value HighHigh Value nBM Description Number [10] low CART 0 multiple Some examples ofofmolecules and MD are multiple Some examples ofbrief molecules withwith low and high MD values are also reported. Some examples molecules with low andhigh high MDvalues values arealso alsoreported. reported. Table 4. List and description of the classical molecular descriptors (MDs) selected for CYP2C9. bonds. multiple bonds. multiple nBM Number of [10] CART 0 bonds. nBM Number [10] CART 0 44 44 bonds. of MD Description Reference Model Low Value High Value bonds. bonds. MD Description Reference Low Value Value Some examples of molecules with low andModel high MD values are also reported. multiple MD Description Reference Model Low Value HighHigh Value MD Description Reference Model Low Value High Value multiple Sp Sp of [10] [10] 5.0 5.0 56.1 56.1 Sp Sum Sum of of [10] CART CART 5.0 56.1 Sum bonds. CART Sp nBM SumNumber of [10] [10] CART 5.0 56.1 bonds. CART nBM Number of ofof Reference [10] CART 0 00 44 4444 Number [10] Model CART MD Description Low Value High Value Sp nBM Sum of [10] [10] CARTCART 5.0 56.1 atomic Sp atomic Sum of 5.0 56.1 atomic atomic multiple Number of multiple multiple multiple [10] polarizabiliti atomic nBM polarizabiliti atomic Sum of CART Number of [10] [10] CART CART 0 0 5.0 44 56.1 44 polarizabiliti Sp Sp Sum of [10] CART 5.0 56.1 polarizabiliti bonds. bonds. nBM bonds. bonds. es scaled on on polarizabiliti es scaled polarizabiliti atomic multiple es scaled atomic es scaled on on Carbon Carbon es scaled on es scaled polarizabiliti bonds. Carbon polarizabiliti Sum ofof on CART Carbon Sp Sp of [10] [10] 5.0 5.0 56.1 56.1 Sp Sum Sum [10] CART CART 5.0 56.1 atom. atom. Carbon Carbon es scaled on atom. es scaled on atomic atom. atomic atomic atom. atom. Sp ARR Sum of 5.0 56.1 Carbon Sum ofARR atomic Ratio [10] CART 0 0.96 ARR Ratio [10] CART 0 0.96 Carbon polarizabiliti Ratio 0 polarizabiliti polarizabiliti [10] [10] CARTCART ARR Ratio 0 0.96 0.96 atomic atom. Sp polarizabilities on [10] [10] [10] CART between the ARR ARRscaled Ratio CARTCART 05.0 0 0.96 0.9656.1 between the Ratio atom. es between the es scaled on on esscaled scaled on between the polarizabiliti number of Carbon atom. number of between the between the [10] [10] CART Ratio CART 0 Carbon number Carbon ARRARR Carbon Ratio 0 0.96 0.96 number of of es scaled aromatic aromatic number ofon the number between atom. aromatic atom. atom. between the of aromatic Carbon bonds over bonds over over aromatic aromatic number bonds number of of Ratio CART bonds over ARRARR [10] [10] 0 00 0.96 0.96 ARR Ratio Ratio [10] CART CART 0.96 atom. the total bonds over the total bonds over aromatic the total aromatic between the the total between the between the Ratio between the number of the total number of the total ARR Ratio [10] CART 0 0.96 bonds over number of bonds over number number of ofof number of number number of aromatic non-H number of non-H number between the of the total non-H ARR CART 0 0.96 the total aromatic non-H aromatic aromatic [10] bonds over the total bonds. non-H bonds. non-H number of over number of bonds. number of bonds bonds. bonds over bonds over number of non-H bonds. HyWi_B(m) Hyper[55] k-NN 2.3 5.1 bonds. HyWi_B(m) Hyper[55] k-NN 2.3 5.1 bonds. aromatic non-H HyWi_B(m) Hypernon-H the HyWi_B(m) Hyper[55] [55] k-NNk-NN 2.3 2.3 5.1 5.1 the total thetotal total Wiener-like HyWi_B(m) Hyper[55] [55] k-NNk-NN 2.3 2.3 5.1 5.1 Wiener-like HyWi_B(m) Hyperbonds over of bonds. Wiener-like bonds. number Wiener-like number of number of index from Wiener-like index from Wiener-like the total HyWi_B(m)HyperHyperindex HyWi_B(m) [55] [55] k-NNk-NN 2.3 2.3 5.1 5.1 non-H index from from non-H non-H Burden Burden index from index number of from Wiener-like Burden Wiener-like bonds. Burden Hyper-Wiener-likebonds. index bonds. matrix matrix Burden Burden non-H index matrix index fromfrom [55] [55] [55] HyWi_B(m) Hyperk-NN matrix Hyperk-NN 2.3 5.1 5.1 HyWi_B(m) fromHyWi_B(m) Burden matrix 2.32.3 HyWi_B(m) Hyper[55] k-NN k-NN 2.3 5.1 5.1 weighted by weighted by by matrix matrix bonds. Burden weighted Burden Wiener-like weighted by weighted by mass.Wiener-like Wiener-like mass. weighted by by [55] mass. weighted HyWi_B(m) Hyperk-NN 2.3 5.1 matrix mass. matrix index mass. index fromfrom index from GATS2i Geary [56] [56] k-NNk-NN 0.09 0.09 1.93 1.93 mass. GATS2i weighted Geary mass. Wiener-like weighted GATS2i Geary by by Burden GATS2i Geary [56] [56] k-NNk-NN 0.09 0.09 1.93 1.93 Burden Burden autocorrelati GATS2i Geary [56] [56] k-NNk-NN 0.09 0.09 1.93 1.93 autocorrelati GATS2i mass. Geary index from mass. autocorrelati matrix autocorrelati matrix matrix on ofGeary lagof 2 lag autocorrelati GearyGATS2i autocorrelation of on lag by autocorrelati Burden GATS2i Geary [56] on 22 k-NNk-NN 0.09 0.09 1.93 1.93 on ofweighted lagof2by weighted weighted by [56] weighted on ofautocorrelati lagof2by GATS2i lag 2 weighted byautocorrelati [56] k-NN 0.09 1.93 weighted by on lag 2by matrix weighted mass. weighted by mass. mass.by ionization weighted ionizationGATS2i potential. ionization weighted on of2lag 2 by ionization on of lag Geary [56] k-NN 0.09 1.93 ionization GATS2i Geary [56] k-NN 0.09 1.93 GATS2i potential. Geary [56] k-NN 0.09 1.93 potential.by ionization ionization mass. weighted potential. weighted by autocorrelati potential. autocorrelati autocorrelati Eta_betaP_A Eta pi and and [49] k-NNk-NN k-NN 0 1.02 Eta_betaP_A Eta pi and [49] 0 1.02 potential. potential. GATS2i Geary [56] 0.09 1.93 ionization Eta_betaP_A Eta pi [49] 0 1.02 ionization on of lag 2 Eta_betaP_A Eta pi and [49] k-NN 0 1.02 on of on lagof 2 lag 2 lone pair lone pair Eta_betaP_A Eta pi and [49] [49] k-NNk-NN 0 1.02 1.02 Eta pi and lone pair Eta_betaP_A Eta pi and 0 autocorrelati potential. lone pair potential. weighted by lone pair Eta_betaP_A [49] k-NN 0 1.02 weighted by by weighted average lone pair average VEM count. average lone on of lag Eta_betaP_A Eta pi2pair and 0 average Eta_betaP_A Eta pi and [49] [49] k-NNk-NN 0 1.02 1.02 ionization average ionization ionization VEM count. VEM count. average average weighted by lone pair VEM count. lone pair VEMpotential. count. potential. potential. nPyrimidines Number of [10] CARTCART CART 0 nPyrimidines Number ofcount. [10] [10] 0 2 VEM count. VEM ionization average nPyrimidines Number of 22 average Eta_betaP_A Eta pipi k-NN nPyrimidines Number ofand [10] [49] CART 2 1.02 Eta_betaP_A Eta pi and [49] k-NN 00 000 1.02 Eta_betaP_A Eta and [49] CART k-NN Pyrimidines. nPyrimidines Number ofcount. [10] [10] 0 2 1.02 Pyrimidines. nPyrimidines Number of CART 0 2 potential. VEM Pyrimidines. VEM count. lone pair Pyrimidines. lone pair lone pair Pyrimidines. Pyrimidines. Eta_betaP_A Eta pi and [49] [10] CART k-NN 1.02 nPyrimidines Number CART 2 2 nPyrimidines Number of of [10] [10] 00 0 0 2 average nPyrimidines Number of Pyrimidines. CART average average loneVEM pair count. Pyrimidines. Pyrimidines. VEM count.count. VEM average nPyrimidines Number CART nPyrimidines Number of ofof [10] [10] 0 00 2 22 nPyrimidines Number [10] CART CART VEMPyrimidines. count. Pyrimidines. Pyrimidines. nRNR2 Number of [10] k-NN 0 32 nRNR2 Number of [10] k-NN 0 nRNR2 Number Number 0 33 nPyrimidines of [10] [10] CART nRNR2 Number of of [10] k-NNk-NN 00 3 nRNR2 Number of of [10] [10] k-NNk-NN 0 3 aliphatic nRNR2 aliphatic Number 0 3 aliphatic Pyrimidines. aliphatic tertiary of tertiary aliphatic aliphatic Number ofnRNR2 aliphatic Number [10] k-NN 0 3 tertiary nRNR2 Number of [10] k-NN 0 3 tertiary nRNR2 k-NN 0 3 amines. [10] amines. tertiary tertiary tertiary amines. aliphatic aliphatic amines. amines. amines. amines. of tertiary tertiary nRNR2 Number [10] k-NN 0 3 nRNR2 of [10] [10] k-NNk-NN 0 3 nRNR2 Number Number of 0 3 amines. amines. aliphatic aliphatic aliphatic nRNR2 Number of [10] k-NN 0 3 tertiary tertiary tertiary aliphatic amines. amines. amines. tertiary amines.

Int. J. Mol. Sci. 2016, 17, 914

11 of 19

Table 4. Cont.

Int. J. Mol. Sci. 2016, 17, 914 Int. J. Mol. Sci. 2016, 17, 914

Mol.2016, Sci. 2016, 17, 914 Int. J. Int. Mol.J.Sci. MD Description Reference k-NN Model F01[C–N] Frequency of 17, 914 [57] Int. J. Mol. Sci. 2016, 17, 914 C–N at F01[C–N] Frequency of [57] k-NN F01[C–N] Frequency of [57] k-NN C–N at topological C–N at F01[C–N] Frequency of [57] k-NN topological Frequency of 1. C–Ntopological at at distance C–N [57] k-NN F01[C–N] 1. topological distance 1. distance distance 1. topological

11 of 19 Low Value

0 0

0 0

11 of 19

of 19 High 11 Value

11 of 19 19 19 19

0

19

19

distance 1.

MLOGP MLOGP

MLOGP Moriguchi [50] k-NN Moriguchi MLOGP Moriguchi[50] [50] k-NNk-NN octanoloctanolMLOGP Moriguchi [50] k-NN octanolwater water octanolMoriguchi partition water octanol-water [50] k-NN partition water partition coefficient. coefficient. partition coefficient. partition coefficient. coefficient.

-6.3 -6.3 -6.3 -6.3

9.6 9.6

-6.3

9.6

9.6 9.6

Figure 4. Representation of the MD-based models for CYP2C9: (a) CART model; (b) Score plot of the Figure 4. Representation of the MD-based models for CYP2C9: (a) CART model; (b) Score plot of the k-NN descriptors, colored to their activity; (c) Loading plotmodel; of the (b) k-NN descriptors. Figure 4. Representation the according MD-based models for(c) CYP2C9: (a)plot CART Score plot of the k-NN descriptors, coloredofaccording to their activity; Loading of the k-NN descriptors. k-NN descriptors, colored according to their activity; (c) Loading plot of the k-NN descriptors.

In order to easily interpret information encoded by CYP2C9 fingerprints, the same approach In order to easily interpret information encoded by CYP2C9 fingerprints, the same approach as as for 3A4 used (Figure 5), obtaining in this case a subset of 16 themost 16 the most relevant molecular In order to easily interpret information same approach as for 3A4 was was used (Figure 5), obtaining in encoded this caseby a CYP2C9 subset offingerprints, the relevant molecular fragments (MFs). Also for this isoform, the cyclohexanediol-like MF16 (No. 1) mainly characterizes for 3A4 was used (Figure 5), isoform, obtaining this case a subset MF of the relevant molecular fragments (MFs). Also for this theincyclohexanediol-like (No. 1)most mainly characterizes the the inactive compounds. quinazolinamine-like 2 MF and 3) have a higher frequency within fragments (MFs). AlsoThe for The this isoform, the cyclohexanediol-like (No. 1) mainly characterizes the inactive compounds. quinazolinamine-like MFsMFs (No.(No. 2 and 3) have a higher frequency within inactive structures. This is a relevant difference respect the isoform, in which the opposite inactive compounds. The MFs (No. 2 and 3)tohave a higher frequency within inactive structures. This isquinazolinamine-like a relevant difference withwith respect to the isoform, in which the opposite situation observed. remaining are more frequent the actives the same inactive structures. ThisThe is aThe relevant difference with respect to thewithin isoform, in which the and opposite situation was was observed. remaining MFsMFs are more frequent within the actives class,class, and the same conclusion as3A4 for 3A4 isoform candrawn. be drawn. Figure 4. Representation of the MD-based models for CYP2C9: (a) CART model; (b) Score plot of the the situation was observed. The remaining MFs are more frequent within the actives class, and the same conclusion as for isoform can be Figure 4. Representation of the MD-based models for CYP2C9: (a) CART model; (b) Score plot of conclusioncolored as for 3A4 isoform can drawn. k-NN descriptors, according to be their activity; (c) Loading plot of the k-NN descriptors.

k-NN descriptors, colored according to their activity; (c) Loading plot of the k-NN descriptors.

In order to easily interpret information encoded by CYP2C9 fingerprints, the same approach as In order to easily interpret information encoded by CYP2C9 fingerprints, the same approach as for 3A4 was used (Figure 5), obtaining in this case a subset of the 16 most relevant molecular for 3A4 was used (Figure 5), obtaining in this case a subset of the 16 most relevant molecular fragments fragments (MFs). Also for this isoform, the cyclohexanediol-like MF (No. 1) mainly characterizes the (MFs). Also for this isoform, the cyclohexanediol-like MF (No. 1) mainly characterizes the inactive inactive compounds. The quinazolinamine-like MFs (No. 2 and 3) have a higher frequency within compounds. The quinazolinamine-like MFs (No. 2 and 3) have a higher frequency within inactive inactive structures. This is a relevant difference with respect to the isoform, in which the opposite structures. This is a relevant difference with respect to the isoform, in which the opposite situation was situation was observed. The remaining MFs are more frequent within the actives class, and the same observed. The remaining MFs are more frequent within the actives class, and the same conclusion as conclusion as for 3A4 isoform can be drawn. for 3A4 isoform can be drawn.

Int. J. Mol. Sci. 2016, 17, 914

12 of 19

Int. J. Mol. Sci. 2016, 17, 914

12 of 19

Figure Figure5.5.Occurrence Occurrencefrequency frequencyof ofthe the16 16selected selectedfragments fragmentsfor forCYP2C9 CYP2C9within withinthe theactive/inactive active/inactive compounds. Symbols associated with the fragments (according to SMARTS language) compounds. Symbols associated with the fragments (according to SMARTS language) have have the the following following meaning: meaning: X X ++ number number = = number number of of total total bonds bonds in in which which the the considered considered atom atom is is involved; involved; aa== aromatic aromatic atom; atom; A A == aliphatic aliphatic atom. atom.

2.3. 2.3. Consensus Consensus Modelling Modelling The The consensus consensus approach approach is is well-known well-known in in the the field field of ofQSAR QSAR[58–61] [58–61]and andconsists consistsin incombining combining the several models to maximize theirtheir advantages and minimize their drawbacks, aiming thepredictions predictionsofof several models to maximize advantages and minimize their drawbacks, to increase the global prediction accuracy. In this work, the selected classification models, which aiming to increase the global prediction accuracy. In this work, the selected classification models, are based very descriptors and classification techniques, werewere merged in two types of which are on based ondifferent very different descriptors and classification techniques, merged in two types consensus approaches (Table 5): 5): of consensus approaches (Table

1.

Consensus 1: a molecule was classified if and only if the following conditions were met: (1) all the model predictions agreed in its predicted class; (2) the molecule was inside the AD of all of the models.

Int. J. Mol. Sci. 2016, 17, 914

1.

2.

13 of 19

Consensus 1: a molecule was classified if and only if the following conditions were met: (1) all the model predictions agreed in its predicted class; (2) the molecule was inside the AD of all of the models. Consensus 2: is based on the majority vote approach, i.e., the compound is classified according to the most frequently predicted class. In this case, the AD of the models used for the prediction was considered. Table 5. Consensus models (cons) 3A4 and 2C9. Non-Error Rate (NER), Sensitivity (Sn), and Specificity (Sp) are reported in Fitting, Cross-Validation and on the test set. Fitting

Cross-Validation

Test set

CYP

Type %na

NER

Sn

Sp

%na

NER

Sn

Sp

%na

NER

Sn

Sp

3A4

cons 1 cons 2

33 -

0.88 0.79

0.92 0.80

0.84 0.78

33 -

0.88 0.78

0.92 0.79

0.84 0.77

36 6

0.88 0.80

0.92 0.81

0.83 0.80

2C9

cons 1 cons 2

33 -

0.89 0.81

0.90 0.80

0.88 0.82

34 -

0.89 0.81

0.90 0.80

0.88 0.82

40 1

0.89 0.79

0.89 0.77

0.88 0.81

For both the isoforms, the same outcome of the consensus approach was reached. Consensus 1 provided increased predictions reliability, especially for the active compounds (Sn ranging from 0.89 to 0.92 on the test set), at the expense of a large number of excluded molecules (up to 40% when both the AD and the model disagreement are considered). Consensus 2 showed a NER comparable with the single models, but with more balanced Sn and Sp values and the advantage of providing a prediction for each model. 2.4. External Validation After recalibrating the models including the test set compounds within the AD domain, their predictivity was further tested on each isoform’s external set (Table 6). The single model performances tend to decrease on the external set. In particular, for CYP3A4 individual models, the Sp decreases substantially, while Sn values are stable; this means that the considered models are more reliable in recognizing the actives. This represents a prominent feature in the field of drug-drug interaction prediction and virtual screening [62,63]. On the contrary, on CYP2C9 single models, a comparable and moderately high decrease of Sn and Sp values can be noted, with the exception of N3, which has higher NER and Sn values. The best results are reached by the consensus 1 for both the isoforms, confirming the reliability of these models. For what concerns the consensus 2, they are slightly less accurate than the former case, but the advantage is that the predictions are provided for all the molecules, with a higher predictivity than the single models. Table 6. Classification results of the models on the external set in terms of Sn, Sp, NER and percentage of not assigned/out of the AD compounds (%na).

a

Fitting a

Cross-Validation a

External Set

CYP

Mod. %na

NER

Sn

Sp

%na

NER

Sn

Sp

%na

NER

Sn

Sp

3A4

CART k-NN N3 cons 1 cons 2

32 -

0.75 0.76 0.80 0.88 0.80

0.74 0.73 0.87 0.91 0.80

0.75 0.79 0.73 0.85 0.79

33 -

0.74 0.76 0.79 0.88 0.79

0.73 0.73 0.87 0.91 0.80

0.76 0.79 0.71 0.84 0.78

1 1 42 1

0.66 0.70 0.72 0.80 0.71

0.68 0.70 0.85 0.89 0.76

0.63 0.69 0.59 0.70 0.67

2C9

CART k-NN N3 cons 1 cons 2

34 -

0.75 0.77 0.80 0.89 0.81

0.77 0.69 0.86 0.91 0.81

0.74 0.85 0.74 0.88 0.82

35 -

0.74 0.77 0.79 0.89 0.80

0.73 0.68 0.85 0.90 0.79

0.76 0.85 0.74 0.88 0.82

1 45 1

0.66 0.69 0.75 0.83 0.73

0.66 0.58 0.83 0.85 0.71

0.66 0.81 0.68 0.82 0.75

Performance on the models recalibrated on all the shared set (9122 molecules), i.e., on training and test set compounds.

Int. J. Mol. Sci. 2016, 17, 914

14 of 19

3. Materials and Methods 3.1. Data Curation This study is based on the publicly available CYP bioactivity database developed by Veith et al. [18] of potency values for 17,143 drug-like compounds for five Cytochrome P450 isoforms (3A4, 2D6, 2C9, 2C19, 1A2), screened using a quantitative High-Throughput Screening with a bioluminescent assay, which recognizes both inhibitors and substrates. The dataset was retrieved from PubChem (AID: 1851), which provides the class of activity (active/inactive) for each compound, identified by a SMILES (Simplified Molecular Input Line Entry System). For each isoform of interest (3A4, 2C9), data were curated by: (a) removing the records without SMILES and/or activity class; (b) removing duplicate structures with mismatching class activity; (c) removing disconnected structures. The comparison between the isoform datasets led to the development of two types of sets (Table 7): 1. 2.

Shared set (9122 molecules), comprising the molecule available for both the isoforms. It served to calibrate, validate and choose the optimal model(s) for each isoform. External sets (2996 and 2818 molecules for CYP3A4 and CYP2C9, respectively). These datasets comprise the molecules with annotated activity values for one isoform only. They were used as an external validation tool to further evaluate the model predictivity on unknown molecules. Table 7. Characteristics of the shared and external sets for each isoform: n = number of molecules, %act = percentage of active compounds.

CYP Isoform 3A4 2C9

Shared Set

External Set

n

%act

n

%act

9122 9122

40 33

2996 2818

49 36

The shared set of molecules was randomly split into a training set (70%, 6385 compounds) and a test set (30%, 2737 compounds), keeping the active/inactive proportion of both the isoforms (49:100 and 66:100 for 2C9 and 3A4, respectively). The training set was used to select the variables, calibrate the models and cross-validate them (five-fold). The test set served in a later stage to validate the final pool of selected models. The external set was used as a further validation tool only. 3.2. Molecular Descriptors Two types of molecular descriptors were considered: 1.

2.

Classical molecular descriptors (MDs) were calculated using Dragon 6 [20]. From the obtained 3763 MDs, we filtered out those that: (a) were constant or near-constant; (b) had at least one missing value; (c) had a pairwise correlation larger than 0.95. Eventually, 1472 descriptors were retained. Two types of binary fingerprints (FPs) were calculated, namely: (1) extended connectivity (ECFP) [21], circular/topological FPs that encode for several molecular features including stereochemical information; and (2) path connectivity (PFP) [64] based on the presence of particular molecular fragments without accounting for the stereochemical information. A total of 1024 bit FPs were calculated using Dragon 7 [22]. The detailed settings can be found in the Supplementary Material. PFP were considered in the preliminary phase, but their results were not reported as they were outperformed by ECFP.

3.3. Variable Selection The most relevant MDs were selected using Genetic Algorithms [23], in the version proposed by Leardi et al. [65]. For each method, the following strategy was always followed: (1) GA were run on

Int. J. Mol. Sci. 2016, 17, 914

15 of 19

the pool of MDs (1472); (2) the most frequent MDs of phase 1 (up to a maximum of 200) were subjected to a second selection phase; and (3) the most relevant MDs of phase 2 (up to 15) were tested for all of their possible combinations (All Subset Modelling Strategy, e.g., [66]). 3.4. Applicability Domain (AD) Assessment The AD approach was rationalized according to the classifier characteristics, as explained below. 1.

2.

3.

As CART is based on univariate splits of data, a “bounding box” approach was applied, by excluding the molecules with descriptor values outside the training set ranges (reported as Supplementary Material) [58]. For k-NN, every compound too far from its k neighbors was considered out of the AD, using the approach proposed by Sahigara et al. [67]. The same distance metric of the best k-NN models (Euclidean distance) was used. For what concerns N3, as the largest contribute to the classification is given by the closest compound, a query was considered inside the AD if its similarity with its nearest neighbor was larger than or equal to 0.2. The same similarity metric of best models (Jaccard-Tanimoto) was used.

3.5. Software and Code Data curation was performed using KNIME [68] v 2.11.3 workflows. Training/test splitting, variable selection, cross-validation and model calibration were performed in MATLAB [69] environment, using routines written by Milano Chemometrics and QSAR Research Group [70]. 4. Conclusions The overall goal of this work was to obtain reliable QSAR screening systems for two of the most relevant isoforms of Cytochrome P450, namely CYP2C9 and CYP3A4. The study was based on the CYP bioactivity database developed by the National Institutes of Health Chemical Genomics Center (NCGC), from which activity data for 14,936 molecules were obtained and used to train/validate the models. Different classification approaches were tested on both classical molecular descriptors and binary fingerprints. For each isoform, three models led to the optimal results: (1) a classification tree, characterized by high interpretability and prediction balance on the classes; (2) a k-Nearest Neighbor model, with high complexity but also high predictivity towards inactive compounds; and (3) a N-Nearest Neighbor model, which had the highest performance towards the actives. The interpretation of the selected molecular descriptors made it possible to gather insights into the structural features that determine the activity on Cytochrome P450. In particular, for both the isoforms, molecular dimension and flexibility ultimately influenced the activity towards CYP, in that small and flexible molecules tend to be inactive, while lipophilic compounds tend to be active. As for CYP3A4, distinctive features of the active molecules turned out to be the presence of hydroxyl groups, and/or a low polarizability/ionization potential. Regarding the CYP2C9, the analysis underscored the fact that active molecules tend to have a large number of aromatic bonds, along with a high polarizability, a high unsaturation degree and often the presence of pyrimidines. Moreover, the interaction with CYP2C9 could be limited by a high number of bonded C and N atoms in the molecule. As a final classification step, the models were aggregated in a consensus manner. This allowed us to obtain a system with a good predictive ability towards both the activity classes and to exploit the advantages of the single models. Finally, the models were validated using more than 2000 molecules for each isoform as an external set. The external validation confirmed the model reliability and stability. A future perspective will be the development of analogous approaches for the remaining CYP isoforms of relevance.

Int. J. Mol. Sci. 2016, 17, 914

16 of 19

Supplementary Materials: The curated datasets along with the calculated molecular descriptors and other supplementary materials can be found at http://www.mdpi.com/1422-0067/17/6/914/s1. The dataset can be also freely downloaded from Milano Chemometrics website (http://michem.disat.unimib.it/chm/). Author Contributions: Serena Nembri and Francesca Grisoni conceived the study. Serena Nembri curated the dataset and performed the calculations. Serena Nembri and Francesca Grisoni analyzed and interpreted the results, and wrote the paper. Viviana Consonni and Roberto Todeschini supervised the study and reviewed the manuscript. Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations AD ADMET CART CYP FP ECFP k-NN MD MF N3 NER QSAR SMILES Sn Sp

Applicability Domain Absorption, Distribution, Metabolism, Excretion and Toxicity Classification and Regression Trees Cytochrome P450 binary Fingerprints Extended Connectivity Fingerprints k-Nearest Neighbors classical Molecular Descriptors Molecular Fragment N-Nearest neighbors Non-Error Rate Quantitative Structure-Activity Relationship Simplified Molecular Input Line Entry System Sensitivity Specificity

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

12. 13. 14.

Munro, A.W.; Girvan, H.M.; Mason, A.E.; Dunford, A.J.; McLean, K.J. What makes a P450 tick? Trends Biochem. Sci. 2013, 38, 140–150. [CrossRef] [PubMed] Yan, Z.; Caldwell, G.W. Metabolism Profiling, and Cytochrome P450 inhibition & induction in drug discovery. Curr. Top. Med. Chem. 2001, 1, 403–425. [PubMed] Singh, D.; Kashyap, A.; Pandey, R.V.; Saini, K.S. Novel advances in cytochrome P450 research. Drug Discov. Today 2011, 16, 793–799. [CrossRef] [PubMed] Zanger, U.M.; Schwab, M. Cytochrome P450 enzymes in drug metabolism: Regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacol. Ther. 2013, 138, 103–141. [CrossRef] [PubMed] Pb, W. Role of cytochromes P450 in drug metabolism and hepatotoxicity. Semin. Liver Dis. 1990, 10, 235–250. Gonzalez, F.J.; Gelboin, H.V. Role of human cytochromes P450 in the metabolic activation of chemical carcinogens and toxins. Drug Metab. Rev. 1994, 26, 165–183. [CrossRef] [PubMed] Gonzalez, F.J. Role of cytochromes P450 in chemical toxicity and oxidative stress: Studies with CYP2E1. Mutat. Res. Mol. Mech. Mutagen. 2005, 569, 101–110. [CrossRef] [PubMed] Langowski, J.; Long, A. Computer systems for the prediction of xenobiotic metabolism. Adv. Drug Deliv. Rev. 2002, 54, 407–415. [CrossRef] Kirton, S.B.; Baxter, C.A.; Sutcliffe, M.J. Comparative modelling of cytochromes P450. Adv. Drug Deliv. Rev. 2002, 54, 385–406. [CrossRef] Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 41. Li, H.; Sun, J.; Fan, X.; Sui, X.; Zhang, L.; Wang, Y.; He, Z. Considerations and recent advances in QSAR models for cytochrome P450-mediated drug metabolism prediction. J. Comput. Aided Mol. Des. 2008, 22, 843–855. [CrossRef] [PubMed] Yap, C.W.; Chen, Y.Z. Prediction of Cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. J. Chem. Inf. Model. 2005, 45, 982–992. [CrossRef] [PubMed] Rostkowski, M.; Spjuth, O.; Rydberg, P. WhichCyp: Prediction of cytochromes P450 inhibition. Bioinformatics 2013, 29, 2051–2052. [CrossRef] [PubMed] Pan, X.; Chao, L.; Qu, S.; Huang, S.; Yang, L.; Mei, H. An improved large-scale prediction model of CYP1A2 inhibitors by using combined fragment descriptors. RSC Adv. 2015, 5, 84232–84237. [CrossRef]

Int. J. Mol. Sci. 2016, 17, 914

15. 16.

17.

18.

19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

31. 32. 33. 34. 35. 36. 37. 38.

17 of 19

Sun, H.; Veith, H.; Xia, M.; Austin, C.P.; Huang, R. Predictive models for cytochrome P450 isozymes based on quantitative high throughput screening data. J. Chem. Inf. Model. 2011, 51, 2474–2481. [CrossRef] [PubMed] Jensen, B.F.; Vind, C.; Brockhoff, P.B.; Refsgaard, H.H.F. In silico prediction of cytochrome P450 2D6 and 3A4 inhibition using gaussian kernel weighted k-Nearest neighbor and extended connectivity fingerprints, including structural fragment analysis of inhibitors versus noninhibitors. J. Med. Chem. 2007, 50, 501–511. [CrossRef] [PubMed] Balakin, K.V.; Ekins, S.; Bugrim, A.; Ivanenkov, Y.A.; Korolev, D.; Nikolsky, Y.V.; Skorenko, A.V.; Ivashchenko, A.A.; Savchuk, N.P.; Nikolskaya, T. Kohonen maps for prediction of binding to human cytochrome P450 3A4. Drug Metab. Dispos. 2004, 32, 1183–1189. [CrossRef] [PubMed] Veith, H.; Southall, N.; Huang, R.; James, T.; Fayne, D.; Artemenko, N.; Shen, M.; Inglese, J.; Austin, C.P.; Lloyd, D.G.; et al. Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries. Nat. Biotechnol. 2009, 27, 1050–1055. [CrossRef] [PubMed] NCBI The PubChem Project. Available online: http://pubchem.ncbi.nlm.nih.gov/ (accessed on 1 March 2016). Talete srl. Dragon (Software for Molecular Descriptor Calculation). Version 6.0. 2012. Available online: http://www.talete.mi.it/(accessed on 8 June 2016). Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [CrossRef] [PubMed] Kode srl. Dragon (Software for Molecular Descriptor Calculation), Version 7.0. Pisa, Italy, 2016. Goldberg, D.E.; Holland, J.H. Genetic algorithms and machine learning. Mach. Learn. 1988, 3, 95–99. [CrossRef] Grisoni, F.; Cassotti, M.; Todeschini, R. Reshaped sequential replacement for variable selection in QSPR: Comparison with other reference methods. J. Chemom. 2014, 28, 249–259. [CrossRef] Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. Kowalski, B.R.; Bender, C.F. K-Nearest Neighbor Classification Rule (pattern recognition) applied to nuclear magnetic resonance spectral interpretation. Anal. Chem. 1972, 44, 1405–1411. [CrossRef] Todeschini, R.; Ballabio, D.; Cassotti, M.; Consonni, V. N3 and BNN: Two new similarity based classification methods in comparison with other classifiers. J. Chem. Inf. Model. 2015, 55, 2365–2374. [CrossRef] [PubMed] McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition; Wiley: Hoboken, NJ, USA, 1992. Ståhle, L.; Wold, S. Partial least squares analysis with cross-validation for the two-class problem: A Monte Carlo study. J. Chemom. 1987, 1, 185–196. [CrossRef] Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of different approaches to define the applicability domain of QSAR models. Molecules 2012, 17, 4791–4810. [CrossRef] [PubMed] Jolliffe, I. Principal component analysis. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Ltd.: Boca Raton, FL, USA, 2014. Smith, D.A.; Jones, B.C.; Walker, D.K. Design of drugs involving the concepts and theories of drug metabolism and pharmacokinetics. Med. Res. Rev. 1996, 16, 243–266. [CrossRef] De Montellano, P.R.O. Cytochrome P450: Structure, Mechanism, and Biochemistry; Springer Science & Business Media: Berlin, Germany; Heidelberg, Germany, 2005. Guengerich, F.P. Mechanisms of cytochrome P450 substrate oxidation: MiniReview. J. Biochem. Mol. Toxicol. 2007, 21, 163–168. [CrossRef] [PubMed] Guengerich, F.P. Oxidation of halogenated compounds by cytochrome P-450, peroxidases, and model metalloporphyrins. J. Biol. Chem. 1989, 264, 17198–17205. Meunier, B.; Visser, S.P.D. Mechanism of oxidation reactions catalyzed by cytochrome P450 enzymes. Chem. Rev. 2004, 104, 3947–3980. [CrossRef] [PubMed] Mutschler, E.; Derendorf, H. Drug Actions: Basic Principles and Theraputic Aspects; CRC Press: Boca Raton, FL, USA, 1995. Guroff, G.; Daly, J.W.; Jerina, D.M.; Renson, J.; Witkop, B.; Udenfriend, S. Hydroxylation-induced migration: The NIH shift. Recent experiments reveal an unexpected and general result of enzymatic hydroxylation of aromatic compounds. Science 1967, 157, 1524–1530. [CrossRef] [PubMed]

Int. J. Mol. Sci. 2016, 17, 914

39. 40. 41. 42.

43. 44.

45. 46. 47. 48. 49. 50. 51. 52. 53. 54.

55. 56. 57. 58. 59.

60.

61. 62. 63.

18 of 19

Magnuson, V.R.; Harriss, D.K.; Basak, S.C. Topological indices based on neighborhood symmetry: Chemical and biological applications. Chem. Appl. Topol. Graph Theory 1983, 178–191. Raunio, H.; Kuusisto, M.; Juvonen, R.O.; Pentikäinen, O.T. Modeling of interactions between xenobiotics and cytochrome P450 (CYP) enzymes. Front. Pharmacol. 2015, 6. [CrossRef] [PubMed] Jiao, D.; Golubkov, P.A.; Darden, T.A.; Ren, P. Calculation of protein-ligand binding free energy by using a polarizable potential. Proc. Natl. Acad. Sci. USA 2008, 105, 6290–6295. [CrossRef] [PubMed] Stauffer, D.A.; Karlin, A. Electrostatic potential of the acetylcholine binding sites in the nicotinic receptor probed by reactions of binding-site cysteines with charged methanethiosulfonates. Biochemistry 1994, 33, 6840–6849. [CrossRef] [PubMed] Butina, D. Performance of Kier-hall E-state descriptors in quantitative structure activity relationship (QSAR) studies of multifunctional molecules. Molecules 2004, 9, 1004–1009. [CrossRef] [PubMed] Gunes, A.; Coskun, U.; Boruban, C.; Gunel, N.; Babaoglu, M.O.; Sencan, O.; Bozkurt, A.; Rane, A.; Hassan, M.; Zengil, H.; et al. Inhibitory effect of 5-fluorouracil on cytochrome P450 2C9 activity in cancer patients. Basic Clin. Pharmacol. Toxicol. 2006, 98, 197–200. [CrossRef] [PubMed] Gilbar, P.J.; Brodribb, T.R. Phenytoin and fluorouracil interaction. Ann. Pharmacother. 2001, 35, 1367–1370. [CrossRef] [PubMed] Brown, M.C. An adverse interaction between warfarin and 5-fluorouracil: A case report and review of the literature. Chemotherapy 1999, 45, 392–395. [CrossRef] [PubMed] Stiborová, M.; Bieler, C.A.; Wiessler, M.; Frei, E. The anticancer agent ellipticine on activation by cytochrome P450 forms covalent DNA adducts. Biochem. Pharmacol. 2001, 62, 1675–1684. [CrossRef] Beedham, C. The role of non-P450 enzymes in drug oxidation. Pharm. World Sci. 1997, 19, 255–263. [CrossRef] [PubMed] Roy, K.; Saha, A. QSPR with TAU indices: Water solubility of diverse functional acyclic compounds. Intern. Electron. J. Mol. Des. 2003, 2, 475–491. Moriguchi, I.; Hirono, S.; Liu, Q.; Nakagome, I.; Matsushita, Y. Simple method of calculating octanol/water partition coefficient. Chem. Pharm. Bull. 1992, 40, 127–130. [CrossRef] Lewis, D.F.V.; Jacobs, M.N.; Dickins, M. Compound lipophilicity for substrate binding to human P450s in drug metabolism. Drug Discov. Today 2004, 9, 530–537. [CrossRef] Hansch, C.; Zhang, L. Quantitative structure-activity relationships of cytochrome P-450. Drug Metab. Rev. 1993, 25, 1–48. [CrossRef] [PubMed] Al-Gailany, K.A.S.; Houston, J.B.; Bridges, J.W. The role of substrate lipophilicity in determining type 1 microsomal P450 binding characteristics. Biochem. Pharmacol. 1978, 27, 783–788. [CrossRef] Ramos-Nino, M.E.; Clifford, M.N.; Adams, M.R. Quantitative structure activity relationship for the effect of benzoic acids, cinnamic acids and benzaldehydes on Listeria monocytogenes. J. Appl. Bacteriol. 1996, 80, 303–310. [CrossRef] [PubMed] Consonni, V.; Todeschini, R. New spectral indices for molecule description. MATCH 2008, 1, 2. Geary, R.C. The contiguity ratio and statistical mapping. Inc. Stat. 1954, 5, 115–146. [CrossRef] Carhart, R.E.; Smith, D.H.; Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: Definition and applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64–73. [CrossRef] Grisoni, F.; Consonni, V.; Vighi, M.; Villa, S.; Todeschini, R. Investigating the mechanisms of bioconcentration through QSAR classification trees. Environ. Int. 2016, 88, 198–205. [CrossRef] [PubMed] Cassotti, M.; Consonni, V.; Mauri, A.; Ballabio, D. Validation and extension of a similarity-based approach for prediction of acute aquatic toxicity towards Daphnia magna. SAR QSAR Environ. Res. 2014, 25, 1013–1036. [CrossRef] [PubMed] Mansouri, K.; Abdelaziz, A.; Rybacka, A.; Roncaglioni, A.; Tropsha, A.; Varnek, A.; Zakharov, A.; Worth, A.; Richard, A.M.; Grulke, C.M.; et al. CERAPP: Collaborative estrogen receptor activity prediction project. Environ. Health Perspect. 2016. [CrossRef] [PubMed] Gissi, A.; Nicolotti, O.; Carotti, A.; Gadaleta, D.; Lombardo, A.; Benfenati, E. Integration of QSAR models for bioconcentration suitable for REACH. Sci. Total Environ. 2013, 456–457, 325–332. [CrossRef] [PubMed] Fp, G. Role of cytochrome P450 enzymes in drug-drug interactions. Adv. Pharmacol. 1996, 43, 7–35. Walters, W.P.; Stahl, M.T.; Murcko, M.A. Virtual screening—An overview. Drug Discov. Today 1998, 3, 160–178. [CrossRef]

Int. J. Mol. Sci. 2016, 17, 914

64. 65. 66. 67.

68.

69. 70.

19 of 19

James, C.A.; Weininger, D.; Delany, J. Daylight Theory Manual; Daylight Chemical Information Systems, Inc.: Aliso Viejo, CA, USA, 1995; 3951p. Leardi, R.; Lupiáñez González, A. Genetic algorithms applied to feature selection in PLS regression: How and when to use them. Chemom. Intell. Lab. Syst. 1998, 41, 195–207. [CrossRef] Cassotti, M.; Grisoni, F. Variable selection methods: An introduction. Available online: http://www. moleculardescriptors.eu (accessed on 2 February 2016). Sahigara, F.; Ballabio, D.; Todeschini, R.; Consonni, V. Defining a novel k-Nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J. Cheminform. 2013, 5, 27. [CrossRef] [PubMed] Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The konstanz information miner. In Data Analysis, Machine Learning and Applications; Preisach, C., Burkhardt, P.D.H., Schmidt-Thieme, P.D.L., Decker, P.D.R., Eds.; Studies in Classification, Data Analysis, and Knowledge Organization; Springer: Berlin, Germany; Heidelberg, Germany, 2008; pp. 319–326. MATLAB, 2015. R2015a; The MathWorks Inc.: Natick, MA, USA, 2015. Milano Chemometrics and QSAR Research Group. Available online: http://michem.disat.unimib.it/chm/ download/datasets.htm (accessed on 3 March 2016). © 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).