Prediction of chemical biodegradability using computational methods

0 downloads 0 Views 6MB Size Report
Jun 12, 2017 - computational methods ... Prediction of chemical biodegradability using computational ...... This finding is consistent with our previous obser-.
Molecular Simulation

ISSN: 0892-7022 (Print) 1029-0435 (Online) Journal homepage: http://www.tandfonline.com/loi/gmos20

Prediction of chemical biodegradability using computational methods Zhixiong Zhan, Linlang Li, Sheng Tian, Xuechu Zhen & Youyong Li To cite this article: Zhixiong Zhan, Linlang Li, Sheng Tian, Xuechu Zhen & Youyong Li (2017): Prediction of chemical biodegradability using computational methods, Molecular Simulation, DOI: 10.1080/08927022.2017.1328556 To link to this article: http://dx.doi.org/10.1080/08927022.2017.1328556

View supplementary material

Published online: 12 Jun 2017.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gmos20 Download by: [The UC San Diego Library]

Date: 12 June 2017, At: 23:38

Molecular Simulation, 2017 https://doi.org/10.1080/08927022.2017.1328556

Prediction of chemical biodegradability using computational methods Zhixiong Zhana*, Linlang Lib*, Sheng Tianb, Xuechu Zhenb and Youyong Lia a

Institute of Functional Nano and Soft Materials (FUNSOM), Soochow University, Suzhou, P.R. China; bJiangsu Key Laboratory of Translational Research and Therapy for Neuro-Psycho-Diseases and College of Pharmaceutical Sciences, Soochow University, Suzhou, P.R. China

ABSTRACT

Biodegradability is a key factor to describe the long-time effects of chemicals to be decomposed in the environment. Compared with time-consuming and laborious experimental testing, the use of in silico approaches for assessing chemical biodegradability is highly encouraged by the legislators. In this study, based on an extensive data-set with 547 ready biodegradation (RB) and 1178 non-ready biodegradation (NRB) chemicals, we first examined the differences of the important physico-chemical properties and scaffold architectures between the RB and NRB molecules. We found that compared with the NRB molecules, the RB molecules are usually smaller, more flexible and hydrophilic, and have less polar groups and more complicated structural patterns (ring systems). However, the RB and NRB molecules cannot be well distinguished by any simple property-based or substructure-based rules. Then, the naïve Bayesian classification (NBC) approach was employed to develop classifiers for discriminating the RB and NRB molecules. Based on the 21 physico-chemical properties, 76 VolSurf descriptors and LPFP_4 structural fingerprints, the Bayesian classifier can achieve a sensitivity of 0.877, a specificity of 0.864, a global accuracy of 0.869, a C value of 0.720 and a AUC value of 0.890 for the training set. Besides, the best predictions can be achieved for the classifiers based on the combinations of simple physico-chemical properties, VolSurf descriptors, and LPFP_6 fingerprints for the test set I (AUC = 0.921), and any of the three fingerprint classes (ECFC_6, ECFC_8 or LPFC_4) for the test set II (AUC = 0.901). In addition, 20 structural fragments favourable and unfavourable for ready biodegradation, which were directly generated from the best naive Bayesian classifier, were highlighted and discussed. The results provide useful guidelines/tools for designing promising chemical compounds with good chemical biodegradability.

1. Introduction Microbes in the environment remove organic matter from the environment, through oxidation, reduction and hydrolysis effect such as destruction of some organic matter of molecular structure or make its mineralisation [1,2]. This process is biodegradable, which is an important way to remove pollutants from the environment [3−5]. Accumulation of persistent or non-ready biodegradation chemicals may pose a potential threat to human beings and ecosystem [6]. Biodegradation, therefore, has become one of the most important marks for organic chemicals [7−9]. Therefore, it is critical to establish a scientific and effective assessment method for the biodegradation of organics. The assessment of biodegradation is a critical issue for the health and environmental regulatory organisations around the world [5], such as the Organisation for Economic Co-operation and Development (OECD), the International Organisation for Standardisation (ISO), the Japanese Ministry of International Trade and Industry (MITI), the National Institute of Technology and Evaluation (NITE), the European Union (EU) and the United States Environmental Protection Agency (US EPA) [10]. According to the Pollution Prevention Act of 1990, pollution

KEYWORDS

Biodegradation; property distribution; scaffold architecture; naive Bayesian classification (NBC); QSAR

should be prevented or reduced at the source pollution as much as possible in the United States. In 2007, Europe proposed the regulation of Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), and under REACH, the companies who produce or import more than 1 ton industrial compounds per year need to provide the biodegradability information. However, the number of existing chemicals needs to be tested is huge, and also increases rapidly every year. Therefore, experimental assessment of biodegradation for all existing chemicals is a difficult or even impossible task. Compared with experimental assays, theoretical models provide a more rapid and efficient screening platform [10−28]. Theoretical models based on the combination of various molecular descriptors and machine learning approaches for evaluating biodegradation of chemicals attracted more attentions in the past decade [10,22,24]. In 2012, Cheng et al. employed four different machine learning methods including support vector machine (SVM), k-nearest neighbour (kNN), naive Bayes and C4.5 decision tree to build classifiers using 1440 diverse chemicals derived from the Japanese Ministry of International Trade and Industry (MITI) [10]. Based on physico-chemical descriptors or molecular fingerprints, the classifiers can achieve the global accuracies up

CONTACTS  Sheng Tian  [email protected]; Xuechu Zhen  [email protected]; Youyong Li  *These authors contributed equally to this paper.   Supplemental data for this article can be accessed https://doi.org/10.1080/08927022.2017.1328556. © 2017 Informa UK Limited, trading as Taylor & Francis Group

ARTICLE HISTORY

Received 12 February 2017 Accepted 2 May 2017

[email protected]

2 

 Z. ZHAN ET AL.

Table 1. The number of the molecules (RB and NRB) in the training set, test set I and test set II. Data-set Training set Test set I Test set II

RB 284 72 191

NRB 553 146 479

Total 837 218 670

to 80% and even higher. Subsequently, as claimed by Mansouri and co-workers, the so limited external test set with only 4 RB and 23 NRB molecules used by Cheng [10] may be not cogent to evaluate the prediction capacities of the classifiers. According to these considerations, in 2013, Mansouri et al. compiled a comprehensive data-set with 356 RB and 699 NRB from the National Institute of Technology and Evaluation (NITE) in Japan. The whole data-set was split into a training set with 837 molecules and a test set of 218 molecules [22], and an external validation set with 191 RB and 479 NRB molecules was used to evaluate the prediction abilities of the classifiers. The best classifier built using consensus 2 based on the predictions of kNN, partial least squares discriminant analysis (PLSDA) and SVM techniques yielded a sensitivity (SE) of 0.88 and a specificity (SP) of 0.94 for the test set, and a SE of 0.81 and a SP of 0.94 for the external validation set. In 2014, Cao and co-workers proposed a DE-SVM classifier for the prediction of chemical biodegradability by coupling of the differential evolution (DE) algorithm with the support vector classifier (SVC), and the DE-SVM classifier exhibits better classification performance than the models previously used for such studies [24]. Overall, the reported classification models still have obvious limitations, such as limited data-sets for training and testing, questionable prediction accuracy and application domain, and unintelligible mechanism of chemical biodegradation [10,22,24]. In this study, based on an extensive data-set of 1725 chemicals prepared by Mansouri et al. [22], the differences of simple physico-chemical properties and structural features between RB and NRB chemicals were systematically compared. Then, three representative structure partitioning methods, including Murcko framework [29], Scaffold Tree [30] and a scheme based on different complexities of ring combinations and side chains, were used to characterise the structural features of the studied molecules. Finally, the naïve Bayesian classifier (NBC) technique was employed to develop classifiers to distinguish RB from NRB chemicals. The impact of different molecular properties and structural fingerprints on the prediction accuracy of the classifiers were systematically investigated. The results provide a deeper insight for understanding the inherent differences between RB and NRB molecules and may also give useful guidelines/tools for designing promising chemical compounds with good chemical biodegradation.

2.  Materials and methods 2.1.  Preparation of the data-sets The experimental data of RB and NRB chemicals of the MITI test were collected from the NITE of Japan as described in the previous study [22]. The MITI test used the biochemical oxygen demand (BOD) in aerobic aqueous medium for 28 days for measuring the ready biodegradation and a compound with a

BOD value higher than 60% was considered to be a RB chemical whereas the others were labelled as NRB chemicals [10,21]. The chemicals were preprocessed carefully as mentioned by Mansouri and co-workers (Table 1) [22]. At last, 1055 chemicals (356 RB and 699 NRB molecules) were retained. About 284 RB and 553 NRB chemicals (80%) randomly selected from the data-set were served as the training set and the remaining chemicals (72 RB and 146 NRB chemicals) as the test set I [22]. In addition, the test set II (191 RB and 479 NRB chemicals) collected by Mansouri [10,22] was also used in our study. All chemicals in the three data-sets were optimised using molecular mechanics (MM) with the MMFF94 force field in SYBYL-X simulation package (version 1.1) for the following analysis [24]. 2.2.  Calculations of molecular descriptors and structural fingerprints In the current study, 32 molecular physico-chemical property descriptors, which were extensively used in QSAR modelling [31−43], include octanol-water partitioning coefficient (AlogP) based on Ghose and Crippen’s method [44], the apparent partition coefficient at pH 7.4 (logD7.4) based on the Csizmadia’s method, molecular solubility (logS) based on Tetko’s multiple linear regression model [45], molecular weight (MW), the number of rotatable bonds (Nrot), the number of rings in the smallest set of smallest rings (NRings), the number of aromatic rings in the smallest set of smallest rings (NAR), the number of hydrogen bond acceptors (NHBA), the number of hydrogen bond donors (NHBD), molecular polar surface area (MPSA), molecular surface area (MSA), the number of carbon atoms (NC), the number of nitrogen atoms (NN), the number of oxygen atoms (NO), the number of atoms (NAtom), the number of bonds (NBonds), the number of bridgehead atoms to connect a bridge to a ring (NBHA), the number of bonds in aromatic ring systems (NAromaticBonds), the number of bonds in bridgehead ring systems (NBridgebonds), the number of ring assemblies (NRA), the number of three rings (NR3), the number of four rings (NR4), the number of five rings (NR5), the number of six rings (NR6), the number of seven rings (NR7), the number of eight rings (NR8) and the number of nine rings and more (NR9+), the number of unbranched chains needed to cover all the non-ring bonds in the molecule (NChains), the number of chain assemblies (NChainA), the number of hydrogen bond acceptors used by Lipinski’s Rule-of-five (NHBAL) [46], and the number of hydrogen bond donors used by Lipinski’s Rule-offive (NHBDL) [46]. All these molecular descriptors were calculated using the Discovery Studio molecular simulation package (version 2.5, DS2.5) [33]. The descriptions of the descriptors are summarised in Table S1 in the Supporting Information. Besides, a set of 76 VolSurf descriptors calculated by the MOE simulation package [47], which were proved to be helpful for improving classification accuracy [48], were also used in our study. Then, the SciTegic extended-connectivity fingerprints (ECFC, ECFP, FCFC, FCFP, LCFC and LCFP), Daylight-style path-based fingerprints (EPFC, EPFP, FPFC, FPFP, LPFC and LPFP) and the atom environment fingerprint (SEFP) developed by Bender et al. [49] were used to characterise the substructural patterns of the RB and NRB molecules. Each fingerprint class studied here consists of four letters, and is followed by a connected number, which is the maximum diameter (in bond lengths) of the largest

MOLECULAR SIMULATION 

structure by the fingerprint or the maximum length of the path. For each fingerprint class, three maximum diameters (4, 6 and 8) were considered in this study. Detailed discussions can be referred to our previous studies [48,50−55] and all these molecular fingerprints were generated using the DS2.5 [33]. 2.3.  Generation of scaffold architectures The representative scaffolds for the RB and NRB chemicals were generated by three scaffold representations, including the Murcko frameworks developed by Bemis [29] for depicting molecular frame structure, the Scaffold Tree proposed by Schuffenhauer [30] to characterise cyclic substructures of molecules, and the occurrence of cyclic substructures in different complexity levels including simple rings, ring assemblies [56,57] and bridged assemblies. Moreover, the side chains [58] attached to Murcko frameworks were also examined in our study. First, the ring systems with different complexity levels including Murcko framework (Figure 2(a)), ring assemblies (Figure 2(c)), bridge assemblies (Figure 2(d)) and rings (Figure 2(e)) were generated using the Generate Fragments component in Pipeline Pilot 7.5. Besides, the side chains (Figure 2(b)) [58], which are considered to be important for improving synthetic, reducing metabolism and toxicity of studied chemicals, were also generated using the Generated Fragments component in Pipeline Pilot 7.5. Secondly, the Scaffold Tree representations were generated by hierarchically generating different levels of the ring systems of each molecule [30]. By removing rings according to predefined prioritisation rules, each molecule is chopped into ever smaller substructure until the remaining fragment that only contains one ring system. Thus, we can get a list of ring systems at different levels of the Scaffold Tree for each molecule (Figure 2(f)). The root node, which is also the original molecule in the tree, is named level 0, and so forth, the subsequent nodes are named numerically. As shown in Figure 2(f), the scaffolds become more complicated with the level of a Scaffold Tree increases. In order to choose representative scaffolds for balancing the molecular complexity and diversity to analyse the differences between the RB and NRB molecules, the levels 0, 1 and 2 of Scaffold Tree were used in our study according to the Langdon’s work and our previous study [52,59]. The Scaffold Tree for each RB or NRB chemical was generated using the Linear Fragmentation Function in the MOE simulation package [47]. A script compiled using SVL (Scientific Vector Language) was applied to the SDF files (RB and NRB molecules) and levels 0, 1 and 2 scaffolds were generated and retained for the further analysis. 2.4.  Scaffold diversity analysis and tree maps generation In order to analyse the structural differences between the RB and NRB molecules, their scaffold diversities were compared. In this study, the scaffold diversity for the RB and NRB molecules were characterised by two types of diversity measurements, which are the distribution of molecules over the unique scaffolds present in the studied data-sets and structural diversity of the scaffolds. The scaffold counts and the cumulative frequencies of the generated scaffolds (CSF) were used to measure the distribution of molecules over the unique scaffolds appeared in each class

 3

(RB or NRB). Here, the numbers of the representative scaffolds, including Murcko framework, side chains, ring assemblies, bridge assemblies, rings and the levels 0, 1 and 2 of the Scaffold Tree for each class were counted. Then, the percentage of the cumulative scaffold frequencies (CSF) for each class was plotted by sorting the scaffold frequency from the most to least frequent. Unlike the traditional approach to represent tree structures by a directed graph with the root node at the top and children nodes below the parent node with lines connecting them, Tree Maps proposed by Shneiderman adopts a 2D space-filling approach and uses circles or rectangles to represent designated properties of molecules for clear intuitive visualisation [60]. Tree Maps have been used to visualise hierarchical clustering by organising molecular data on the basis of the similarity between chemical structures or similarity across a predefined profile of biological assay values and to prepare visual representations of molecular structure hierarchies alongside activity information. Here, we used Tree Maps to analyse the structural diversity of different scaffold architectures. The scaffold frequency of the scaffolds can be represented by the colour and area of the circles. Tree Maps can highlight both scaffold structural diversity and the distribution of compounds over scaffolds. The Cluster Molecules component in Pipeline Pilot 7.5 was used to cluster the scaffold architectures of the molecules based on the ECFP_6 fingerprints, and the average number of compounds per cluster was set to five. This protocol randomly selects a molecule from the data-set as the first cluster centre and then selects the remaining cluster centres to achieve the maximum dissimilarity to the first cluster centre and each other. After the cluster centre molecules are assigned, the ownership of each remaining molecule to which cluster is then determined based on their similarity to the centre molecule. After clustering the scaffolds, each scaffold has a cluster number (1, 2, 3, etc.) to represent the cluster that the scaffold belongs to, a cluster centre number (1 or 0) to represent if this scaffold is a cluster centre or not, and a cluster size to represent the total number of the scaffolds that belong to the same cluster. 2.5.  Development of naïve Bayesian classifiers (NBC) Compared with many other machine learning approaches, the NBC technique can handle a lot of data simultaneously, learn fast, and is tolerant of random noise during model building process. In addition, NBC just needs a small training set to evaluate the necessary parameters (means and variances of the variables) for classification. In our study, each compound in the data-set is categorised into the RB (+) or NRB (−) class, and a vector f =   was prepared, where f1, f2,…, fn are the calculated values for the n feature variables F1, F1, …, Fn (simple physico-chemical properties, VolSurf descriptors and molecular fingerprints). Then, based on Bayes’s theorem, we obtain:

p(C|F1 , F2 , ⋯ , Fn ) =

p(C)p(F1 , ⋯ Fn |C) p(F1 , ⋯ , Fn )

(1)

In Equation (1), C refers to a compound’s class, is the posterior probability of the compound class, p(C) is the prior probability, a probability induced from the training set, p(F1, ⋯Fn|C) is the probability that a compound has positive descriptors given which is ready biodegradable or not ready biodegradable, and

4 

 Z. ZHAN ET AL.

p(F1, ⋯, Fn) is the marginal effect probability distribution that given the molecular descriptors will appear in the data-set. The three probabilities on the right of Equation (1) can be learned from a training set which comprises a number of RB and NRB molecules. The mathematical procedure to train a naïve Bayesian classifier was described in previously studies [48,50,51,53−55]. All naïve Bayesian classifiers were developed in DS2.5 [33]. 2.6.  Validation of Bayesian models for RB/NRB classification For all naïve Bayesian classifiers, the true positives (TP), true negatives (TN), false positives (FP), false negatives (FN) were counted. The performance of each naïve Bayesian classifier was evaluated by sensitivity, SE = TP/(TP + FN); specificity, SP = TN/ (TN + FP); prediction accuracy of RB, Q+ = TP/(TP + FP); prediction accuracy of NRB, Q− = TN/(TN + FN); global accuracy (GA), GA = (TP + TN)/(TP + TN + FP + FN) and Matthews TP×TN−FN×FP correlation coefficient, textC = √ . (TP +FN)(TP +FP)(TN +FN)(TN +FP)

Furthermore, the prediction accuracy was evaluated by AUC, which is the area under the receive operating characteristic (ROC) curve.

3.  Results and discussions 3.1.  Important property distributions for RB and NRB molecules To understand the relationships between essential molecular physico-chemical properties and biodegradation, the distributions of eight important molecular properties for all the RB and NRB molecules in training set and two test sets (Table 1) are depicted in Figure 1. These eight molecular physico-chemical properties include MW, logS, AlogP, logD, MPSA, Nrot, NHBD and NHBA. The significance of the differences between the means of the RB and NRB molecules for eight physico-chemical properties were evaluated by Student’s t-test. As can be seen in Figure 1, almost all investigated properties cannot separate the RB from NRB molecules effectively except MW, indicated by the quite large P values (>10−20). Although the P value associated with the difference between the means of the two molecular weight distributions for RB and NRB molecules is 3.09 × 10−32, the MW distribution of the RB molecules is highly overlapped with that of the NRB molecules and skews towards lower molecular weights (Figure 1). The molecular weights are distributed between 31.06 and 975.38 with a mean value of 177.54 for the RB molecules and between 32.05 and 1626.23 with a mean value of 271.57 for the NRB molecules. Apparently, the NRB molecules are usually bigger than the RB molecules, but a simple MW-based rule cannot be served as an effective filter to distinguish the RB from NRB molecules. For the molecular properties studied in this study, three of them (AlogP, logD7.4 and logS) are closely related to the hydrophobicity of a molecule. As shown in Figure 1, the AlogP and logD distributions of the NRB molecules skew towards higher values, whereas, the logS distribution of the NRB molecules shifts to lower values (Figure 1). For example, the mean values of AlogP

of the RB and NRB molecules are 2.52 and 3.36 (Table 2), respectively, suggesting that the RB molecules are more hydrophilic than the NRB molecules. Three physico-chemical properties, including MPSA, NHBA and NHBD, are used to measure the H-bonding capacity of a molecule. The average values of MPSA of the RB and NRB molecules are 37.96 and 51.78, respectively. As shown in Figure 1, the MPSA distributions of NRB and RB are obviously overlapped, and the distribution of NRB tends to higher MPSA values. As listed in Table 2, the mean values of NHBA for NRB and RB are 2.89 and 2.23, respectively. Meanwhile, the mean value of NHBD for NRB (0.92) is slightly higher than that for RB (0.82). Then, we calculated the occurrence frequencies of molecular elemental composition (Table 2). The statistical results show that the NRB molecules have much more nitrogen atoms than the RB molecules (1.27 vs. 0.29), whereas, the numbers of the oxygen atoms for NRB and RB are comparative (1.95 vs. 2.03). The mean average value of the sum of the oxygen and nitrogen for the NRB and RB molecules are 3.22 and 2.32, respectively. Thus, we can conclude that the NRB molecules have much more polar group/ moieties, which can be served as H-bond acceptors. The distributions of the number of rotatable bonds (Nrot) for two classes were examined and shown in Figure 1. The P value associated with the difference between the means of the two Nrot distributions for the RB and NRB molecules is only 5.20 × 10−7, indicating that the number of rotatable bonds cannot be employed as a useful filter for discriminating the RB from NRB molecules. The calculated average values of Nrot for the RB and NRB molecules are 5.72 and 4.08, respectively, suggesting that the RB molecules are more flexible than the NRB molecules. Besides, by analysing the parameters related to ring systems, we can observe that the NRB molecules have much more complicated molecular patterns. For example, the numbers of rings and aromatic rings for the RB molecules are 0.54 and 0.39, but those for the NRB molecules are 1.60 and 1.22. It means that almost half of the RB molecules are pure acyclic compounds. Besides, the mean value of ring systems with different complexity for the RB molecules are also quite lower than those of the NRB molecules, suggesting that the NRB molecules are usually more rigid than the RB molecules. In summary, according to the statistical results (Table 2 and Figure 1), the differences of the 32 studied physico-chemical properties for the RB and NRB molecules are not statistically significant. Thus, we cannot anticipant that the RB molecules can be easily separated from the NRB molecules using any simple property-based rules. However, based on our results, we still found some useful clues. For example, the NRB molecules are usually bigger and more rigid, and have more complicated structural patterns. It demonstrates that if we want to get more promising compounds with better biodegradation, smaller size (MW), more hydrophilic, less polar group/moieties, more flexible and less complicated patterns (ring systems), are preferred to candidate chemicals. 3.2.  Analysis of scaffold diversity and similarity between RB and NRB molecules As the RB molecules cannot be discriminated from the NRB molecules effectively by simple molecular physico-chemical

MOLECULAR SIMULATION 

 5

Figure 1. (Colour online) The distributions of eight important molecular physico-chemical properties for the RB and NRB molecules.

properties, and then we want to check if the typical structural architectures between RB and NRB are different. First of all, the representative structures/scaffolds of molecules, including Murcko framework, ring assemblies, bridge

assemblies, rings and ring systems with different complexity levels generated by the Scaffold Tree (Figure 2) for the RB and NRB molecules, were counted and compared (Tables 3 and 4, and Figures 3 and 4). As can be seen from Table 3, the percentages

6 

 Z. ZHAN ET AL.

Table 2. The max, min and mean values of different physico-chemical properties for the RB and NRB chemicals. RB No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Descriptors AlogP LogD7.4 LogS MW Nrot NRings NAR NHBA NHBD MPSA MSA NC NN NO NAtom NBonds NBHA NRingBonds NAromaticBonds NBridgeBonds NRA NR3 NR4 NR5 NR6 NR7 NR8 NR9+ NChains NChainA NHBAL NHBDL

Max 22.19 22.19 2.32 975.38 54 6 3 13 6 233.12 991.83 57 6 12 69 74 2 18 18 10 6 6 1 2 3 1 0 0 12 7 15 7

Min −6.00 −5.27 −26.59 31.06 0 0 0 0 0 0 49.49 1 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

NRB Mean 2.52 ± 3.55a 2.14 ± 3.75 −2.87 ± 3.81 177.54 ± 113.18 5.72 ± 7.76 0.54 ± 0.73 0.39 ± 0.61 2.23 ± 1.70 0.82 ± 1.01 37.96 ± 28.85 195.10 ± 124.40 9.87 ± 7.53 0.29 ± 0.66 2.03 ± 1.66 12.35 ± 7.94 11.89 ± 8.10 0.02 ± 0.19 3.07 ± 3.97 2.33 ± 3.62 0.08 ± 0.80 0.50 ± 0.66 0.02 ± 0.28 0.01 ± 0.07 0.07 ± 0.28 0.45 ± 0.65 0.00 ± 0.04 0 0 2.46 ± 1.44 1.39 ± 0.81 2.33 ± 1.77 0.89 ± 1.14

Max 22.18 22.18 4.03 1626.23 68 10 7 18 16 399.78 1319.41 73 10 18 85 88 10 52 40 30 8 4 4 4 9 1 1 2 35 18 24 16

Min −7.69 −8.44 −27.19 32.05 0 0 0 0 0 0 52.04 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Mean 3.36 ± 2.74 3.06 ± 2.91 −4.25 ± 3.28 271.57 ± 165.14 4.08 ± 5.47 1.60 ± 1.47 1.22 ± 1.16 2.89 ± 2.65 0.92 ± 0.21 51.78 ± 46.81 256.03 ± 143.65 12.65 ± 8.37 1.27 ± 1.70 1.95 ± 2.10 17.32 ± 10.43 17.92 ± 11.39 0.07 ± 0.57 9.00 ± 7.78 7.18 ± 6.74 0.26 ± 1.94 1.24 ± 1.00 0.03 ± 0.26 0.01 ± 0.14 0.16 ± 0.52 1.38 ± 1.26 0.00 ± 0.04 0.01 ± 0.07 0.01 ± 0.12 4.74 ± 3.79 2.84 ± 2.16 3.21 ± 3.05 1.10 ± 1.47

Note: aRepresents standard deviation (SD).

of the molecules that have the level 0 scaffolds for RB and NRB are 42.23 and 78.35%, respectively. Moreover, the percentages of the Murcko frameworks for the RB and NRB molecules are 43.51 and 78.69%, respectively (Table 4). The results indicated that more than half of the RB molecules do not have any ring systems and are pure acyclic compounds. Besides, we also found that the number of the scaffolds decreases rapidly with the increase of the level of the Scaffold Tree for the RB and NRB molecules. The percentages of the level 1 scaffolds for the RB and NRB molecules are 8.22% (45) and 41.6% (490), respectively, which demonstrates that more than 90% of the RB molecules have one- or zero-ring systems. When the level of the Scaffold Tree increases up to three, the number of the level 3 scaffolds for the RB molecules is even zero, whereas, that for the NRB molecules is 111 (9.42%). Besides, the scaffolds cannot be found in the NRB molecules when the level of the Scaffold Tree increases to level 9 (Table 3), indicating that the RB molecules are more simple and flexible than the NRB molecules. As shown in Figures 3 and 4, the numbers of the non-duplicated scaffolds for different ring systems, including different levels of Scaffold Tree, Murcko framework, rings, ring assemblies and bridge assemblies, for the RB and NRB molecules, are also quite limited. For example, the percentage of the non-duplicated level 0 scaffolds for the RB and NRB molecules are only 7.13 and 8.57% (Table 3 and Figure 3), respectively. In addition, the numbers of the non-duplicated Murcko frameworks present in the RB and NRB molecules are 63 (11.52%) and 297 (25.21%), clearly indicating that the structural diversity of the NRB molecules is much higher than that of the

RB molecules. According to the statistical results listed in Tables 3 and 4, the numbers of the side chains for the RB and NRB molecules are 761 and 3348, respectively; meanwhile, those of the unique side chains present in the RB and NRB molecules are also very high (364 vs. 510). As we discussed above, the number of the Murcko frameworks for the RB molecules is 238, and 63 among them are unique. We may make the following conclusion: most RB molecules are the combinations of quite limited core structural patterns/architectures and a variety of side chains with good structural diversity. Then, the cumulative frequencies of the molecules with the levels 0, 1 and 2 generated from the Scaffold Tree, and the Murcko frameworks for the RB and NRB molecules were computed. By sorting the frequencies of respective scaffolds, the cumulative scaffold frequencies (CSF) for available scaffolds at the levels 0, 1 and 2, and Murcko frameworks of the RB and NRB molecules are depicted in Figure 5. As shown in Figure 5(b) and (c), it can be observed that the levels 1 and 2 scaffolds cannot represent more than half of the RB or NRB molecules. For example, all level 1 scaffolds (29 vs. 149, Table 3) for the RB and NRB molecules can only represent no more than 2 and 20% of the corresponding data-sets. Meanwhile, the available level 2 scaffolds can only be found no more than 10% of the RB and 50% of the NRB molecules. Then, it can be noted that the level 0 scaffolds and Murcko frameworks can be found in more than 40% RB and NRB molecules (Figure 5(a) and (d)). The numbers of the level 0 scaffolds for the RB and NRB molecules are 231 and 923, respectively, and those of the unique ones are only 39 (7.13%)

MOLECULAR SIMULATION 

 7

Figure 2. (Colour online) A molecule depicted by different scaffold generation methods. The molecule codeine is dissected into (a) Murcko framework: the union of ring systems and linkers of a molecule; (b) side chains: the combination of any non-ring and non-linker atoms; (c) ring assemblies: contiguous ring systems; (d) bridge assemblies: contiguous ring systems sharing two or more bonds; (e) rings: individual rings; and (f) the scaffolds in different levels generated from Scaffold Tree.

Table 3. The number of the scaffolds at the different levels of the Scaffold Tree for the 547 RB molecules and 1178 NRB molecules. No. of scaffolds Level Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9

RB 231 (42.23%) 45 (8.22%) 8 (1.46%) 0 0 0 0 0 0 0

NRB 923 (78.35%) 490 (41.60%) 233 (19.78%) 111 (9.42%) 38 (3.22%) 16 (1.36%) 12 (1.02%) 9 (0.76%) 8 (0.68%) 1 (0.08%)

No. of non-duplicated scaffolds RB 39 (7.13%) 29 (5.30%) 6 (1.10%) 0 0 0 0 0 0 0

NRB 101 (8.57%) 149 (12.65%) 129 (10.95%) 77 (6.54%) 31 (2.63%) 14 (1.19%) 10 (0.85%) 7 (0.59%) 6 (0.51%) 1 (0.08%)

and 101 (8.57%), respectively (Table 3). As we all know, the level 0 scaffolds (Figure 2(f)), which are single ring systems generated from Scaffold Tree, are quite simple. In addition, the statistical results show that the most frequent level 0 scaffold for RB and NRB is the benzene ring. The frequency of the level 0 scaffolds with most occurring times for the RB and NRB molecules are 156 and 602; and the second most frequent one can only be found

in 10 RB and 31 NRB molecules. The results demonstrated that the level 1 scaffolds are not typical substructures to measure the structural differences between the RB and NRB molecules. More than 50% molecules can be represented by the top 50 most frequently occurring Murcko frameworks for each class (RB or NRB). Thus, the Murcko frameworks are most suitable for characterising the structural differences between the RB and NRB molecules for balancing the complexities and typical features of all molecules. Based on the ECFP_6 fingerprints, the similarity of the Murcko frameworks for the RB and NRB molecules was evaluated. The numbers of the similar Murcko frameworks for RB and NRB are listed in Table 5 and depicted in Figure 6. As shown in Table 5, the number of the Murcko frameworks generated from the RB molecules that are similar to those generated from the NRB molecules is 36, and that of the Murcko frameworks in the NRB molecules that are similar to those in the RB molecules is 42, when the similarity cut-off was set to 0.5. It means that about 57.14% (36/63) Murcko frameworks found in the RB molecules are similar to those found in the NRB molecules, and about 14.14% (42/297) Murcko frameworks found in the NRB molecules can also be found in the RB molecules (Table 5).

8 

 Z. ZHAN ET AL.

Table 4.  The numbers of the Murcko frameworks, ring assemblies, rings, bridge assemblies and side chains present in the 547 RB and 1178 NRB chemicals.

Scaffold architecture Murcko frameworks Ring assemblies Rings Bridge assemblies Side chains

No. of scaffolds

No. of non-duplicated scaffolds

RB NRB RB 238 (43.51%) 927 (78.69%) 63 (11.52%) 273 (49.91%) 1464 (1.24) 47 (8.59%) 298 (54.48%) 1886 (1.60) 40 (7.31%) 5 (0.91%) 29 (2.46%) 4 (0.73%) 761 (1.39) 3348 (2.84) 364 (66.54%)

NRB 297 (25.21%) 165 (14.01%) 131 (11.12%) 14 (1.19%) 510 (43.29%)

Figure 3. (Colour online) The numbers of the scaffolds at the different levels of the Scaffold Tree for the RB and NRB molecules.

Figure 4. (Colour online) The numbers of the Murcko frameworks, ring assemblies, rings, bridge assemblies and side chains for the RB and NRB molecules.

Then, as shown in Figure 6, we can find that the number of the Murcko frameworks in the RB molecules that are similar to those in the NRB molecules increases slowly, but the number of the Murcko frameworks in the NRB molecules that are similar to those in the RB molecules increases rapidly with the decrease of the similarity cut-off. In addition, the number of the Murcko frameworks in the RB molecules that are similar to those in the NRB molecules is approximately equal to the number of the

Murcko frameworks in the NRB molecules that are similar to those in the RB molecules (similarity cut-off >0.4). In general, our observations show that for most specific structural signatures (Murcko frameworks) in the RB molecules, similar Murcko frameworks can also be found in the NRB molecules. Finally, the Tree Maps was used to visually display the structural diversity of the scaffolds within the overall data-set (RB or NRB). The typical structural patterns, the Murcko frameworks for the RB and NRB molecules, were chosen and clustered based on the structural similarity measured by the ECFP_6 fingerprints. By setting the average number of compounds per cluster to five, the Tree Maps for the Murcko frameworks of the RB and NRB molecules are illustrated in Figures 7 and 8, respectively. In each Tree Map, the grey circle represents an independent cluster (labelled with red number) and the grey circle size is proportional to the number of the investigated Murcko frameworks of the same clusters in the total Murcko frameworks of the RB or NRB molecules. In addition, the circle (big or small circle with same colour in each grey circle) size is determined by the appearing frequency and the frequency for each Murcko framework is also marked in black number. The bigger is the circle, the more frequent is the Murcko framework in the same cluster centre. The largest circle within each grey circle has the largest number of Murcko frameworks. Then, the 2-D structures for the six largest clustering groups (grey circle) with the highest frequencies in the RB and NRB molecules are also depicted in Figures 7 and 8. As can be seen from Figures 7 and 8, the Tree Map for the NRB molecules has more cluster centres (grey circles) and the numbers of the cluster centres for the RB and NRB molecules are 13 and 60, respectively. It means that the Murcko frameworks in the NRB molecules have more sparse structural distribution than those in the RB molecules. In other words, the structural domain for the RB molecules is quite conservative and the NRB molecules have higher structural diversity than the RB molecules. Furthermore, we also found there are obvious differences for most frequently occurring Murcko frameworks between the RB and NRB molecules. As shown in Figure 7, the most frequently occurring Murcko framework of the largest grey circle is benzene (cluster centre: 10), and the other frequently occurring Murcko frameworks are usually simple single ring systems, including cyclohexane (cluster centre: 6), tetrahydrofuran (cluster centre: 3) and 1,3-dioxolan-2-one (cluster centre: 13) except the Biphenyl (cluster centre: 9) and tribenzyl phosphite (cluster centre: 1). Compared with the Murcko frameworks with higher frequencies in the RB molecules (Figure 7), those in the NRB molecules are more complicated (Figure 8). The most frequently occurring Murcko framework for NRB is also benzene and its frequency is 321 (cluster centre: 56). The other Murcko frameworks with higher occurring frequencies are ring systems with two combined six rings. Our observations are well consistent with the conclusions obtained from the property comparisons between the RB and NRB molecules in the previous section. In summary, the structural differences between the representative substructures were systematically investigated, and the statistical results illustrate that compared with the NRB molecules, the RB molecules are more simple and have quite limited structural diversity. In addition, most RB molecules are pure acyclic compounds or modified side chains with limited core structures/architectures. Then, the Murcko frameworks were used to evaluate the structural

MOLECULAR SIMULATION 

 9

Figure 5. (Colour online) Cumulative frequencies of the Scaffolds (CSF) at the levels 0/1/2 of Scaffold Tree and Murcko frameworks. Table 5. The number of the Murcko frameworks of RB/NRB that are similar to those in NRB/RB based on different similarity cut-offs of the ECFP_6 fingerprints. Similarity =1 ≥0.9 ≥0.8 ≥0.7 ≥0.6 ≥0.5 ≥0.4 ≥0.3 ≥0.2 ≥0.1 ≥0

RB in NRB 29 30 32 34 34 36 46 54 63 63 63

NRB in RB 32 32 34 36 37 42 63 122 213 292 297

differences between the RB and NRB molecules and the similarity comparison results show that for most specific structural signatures in the RB molecules, similar Murcko frameworks can also be found in the NRB molecules. At last, the structural diversity of representative scaffolds (Murcko frameworks) for each class (RB or NRB) were visually displayed using Tree Maps. The results suggest that the RB molecules have less complicated structural patterns. 3.3.  Naïve Bayesian classifiers The results provided by the previous sections suggest that the RB and NRB molecules cannot be effectively classified by

Figure 6. (Colour online) The number of the Murcko frameworks in RB/NRB that are similar to those in NRB/RB based on different similarity cut-offs of the ECFP_6 fingerprints.

simple property-based or substructure-based rules. Thus, the naive Bayesian classifier (NBC) technique was employed to develop classification models to discriminate the RB from NRB molecules.

10 

 Z. ZHAN ET AL.

Figure 7. (Colour online) Tree Map for the Murcko frameworks of the RB molecules. Scaffolds are represented by different coloured circles, and similar scaffolds are clustered in the independent grey circles. The most frequently occurring scaffolds for the six largest clusters are depicted.

Figure 8.  (Colour online) Tree Map for the Murcko frameworks for the NRB molecules. Scaffolds are represented by different coloured circles, and similar scaffolds are clustered in the independent grey circles. The most frequently occurring scaffolds for the six largest clusters are depicted.

Table 6. The performances of the naïve Bayesian classifiers for the training set based on different combinations of molecular descriptors. Descriptors MPa Vb MP+V MP+V+ECFC_4 MP+V+ECFC_6 MP+V+ECFC_8 MP+V+ECFP_4 MP+V+ECFP_6 MP+V+ECFP_8 MP+V+EPFC_4 MP+V+EPFC_6 MP+V+EPFC_8 MP+V+EPFP_4 MP+V+EPFP_6 MP+V+EPFP_8 MP+V+FCFC_4 MP+V+FCFC_6 MP+V+FCFC_8 MP+V+FCFP_4 MP+V+FCFP_6 MP+V+FCFP_8 MP+V+FPFC_4 MP+V+FPFC_6 MP+V+FPFC_8 MP+V+FPFP_4 MP+V+FPFP_6 MP+V+FPFP_8 MP+V+LCFC_4 MP+V+LCFC_6 MP+V+LCFC_8 MP+V+LCFP_4 MP+V+LCFP_6 MP+V+LCFP_8 MP+V+LPFC_4 MP+V+LPFC_6 MP+V+LPFC_8 MP+V+LPFP_4 MP+V+LPFP_6 MP+V+LPFP_8 MP+V+SEFP_4 MP+V+SEFP_6 MP+V+SEFP_8

TP 207 226 217 255 257 258 253 255 257 243 257 237 248 242 245 249 253 253 240 238 241 264 244 230 234 242 251 258 265 265 255 259 261 247 259 264 249 245 250 250 251 255

FN 77 58 67 29 27 26 31 29 27 41 27 47 36 42 39 35 31 31 44 46 43 20 40 54 50 42 33 26 19 19 29 25 23 37 25 20 35 39 34 34 33 29

FP 165 163 144 117 111 109 130 122 119 123 138 92 125 94 87 140 137 135 126 107 108 212 157 125 119 121 127 139 146 135 129 125 127 53 59 54 75 58 53 136 104 95

TN 388 390 409 436 442 444 423 431 434 430 415 461 428 459 466 413 416 418 427 446 445 341 396 428 434 432 426 414 407 418 424 428 426 500 494 499 478 495 500 417 449 458

SE 0.729 0.796 0.764 0.898 0.905 0.908 0.891 0.898 0.905 0.856 0.905 0.835 0.873 0.852 0.863 0.877 0.891 0.891 0.845 0.838 0.849 0.930 0.859 0.810 0.824 0.852 0.884 0.908 0.933 0.933 0.898 0.912 0.919 0.870 0.912 0.930 0.877 0.863 0.880 0.880 0.884 0.898

SP 0.702 0.705 0.740 0.788 0.799 0.803 0.765 0.779 0.785 0.778 0.750 0.834 0.774 0.830 0.843 0.747 0.752 0.756 0.772 0.807 0.805 0.617 0.716 0.774 0.785 0.781 0.770 0.749 0.736 0.756 0.767 0.774 0.770 0.904 0.893 0.902 0.864 0.895 0.904 0.754 0.812 0.828

Notes: aMP represents 21 molecular physico-chemical properties; bV represents 76 VolSurf descriptors.

Q+ 0.556 0.581 0.601 0.685 0.698 0.703 0.661 0.676 0.684 0.664 0.651 0.720 0.665 0.720 0.738 0.640 0.649 0.652 0.656 0.690 0.691 0.555 0.608 0.648 0.663 0.667 0.664 0.650 0.645 0.663 0.664 0.674 0.673 0.823 0.814 0.830 0.769 0.809 0.825 0.648 0.707 0.729

Q– 0.834 0.871 0.859 0.938 0.942 0.945 0.932 0.937 0.941 0.913 0.939 0.907 0.922 0.916 0.923 0.922 0.931 0.931 0.907 0.907 0.912 0.945 0.908 0.888 0.897 0.911 0.928 0.941 0.955 0.957 0.936 0.945 0.949 0.931 0.952 0.961 0.932 0.927 0.936 0.925 0.932 0.940

GA 0.711 0.736 0.748 0.826 0.835 0.839 0.808 0.820 0.826 0.804 0.803 0.834 0.808 0.838 0.849 0.791 0.799 0.802 0.797 0.817 0.820 0.723 0.765 0.786 0.798 0.805 0.809 0.803 0.803 0.816 0.811 0.821 0.821 0.892 0.900 0.912 0.869 0.884 0.896 0.797 0.836 0.852

C 0.410 0.476 0.482 0.654 0.672 0.679 0.623 0.645 0.657 0.604 0.622 0.648 0.617 0.659 0.683 0.592 0.610 0.614 0.589 0.620 0.627 0.522 0.545 0.559 0.584 0.605 0.622 0.623 0.634 0.653 0.632 0.652 0.655 0.764 0.786 0.812 0.720 0.747 0.773 0.603 0.667 0.697

AUC 0.758 0.804 0.806 0.874 0.875 0.875 0.875 0.879 0.879 0.849 0.844 0.836 0.877 0.873 0.869 0.849 0.851 0.852 0.857 0.864 0.866 0.815 0.810 0.803 0.854 0.852 0.846 0.872 0.872 0.872 0.874 0.878 0.878 0.886 0.877 0.863 0.890 0.887 0.880 0.872 0.884 0.888

MOLECULAR SIMULATION 

 11

Table 7. The performances of the naïve Bayesian classifiers for the test set I based on different combinations of molecular descriptors. Descriptors MPa Vb MP+V MP+V+ECFC_4 MP+V+ECFC_6 MP+V+ECFC_8 MP+V+ECFP_4 MP+V+ECFP_6 MP+V+ECFP_8 MP+V+EPFC_4 MP+V+EPFC_6 MP+V+EPFC_8 MP+V+EPFP_4 MP+V+EPFP_6 MP+V+EPFP_8 MP+V+FCFC_4 MP+V+FCFC_6 MP+V+FCFC_8 MP+V+FCFP_4 MP+V+FCFP_6 MP+V+FCFP_8 MP+V+FPFC_4 MP+V+FPFC_6 MP+V+FPFC_8 MP+V+FPFP_4 MP+V+FPFP_6 MP+V+FPFP_8 MP+V+LCFC_4 MP+V+LCFC_6 MP+V+LCFC_8 MP+V+LCFP_4 MP+V+LCFP_6 MP+V+LCFP_8 MP+V+LPFC_4 MP+V+LPFC_6 MP+V+LPFC_8 MP+V+LPFP_4 MP+V+LPFP_6 MP+V+LPFP_8 MP+V+SEFP_4 MP+V+SEFP_6 MP+V+SEFP_8

TP 60 57 63 58 58 66 64 65 66 57 58 57 59 62 51 61 61 61 56 60 60 60 58 52 56 54 58 60 61 59 64 65 60 55 55 61 60 62 60 61 62 61

FN 12 15 9 14 14 6 8 7 6 15 14 15 13 10 21 11 11 11 16 12 12 12 14 20 16 18 14 12 11 13 8 7 12 17 17 11 12 10 12 11 10 11

FP 54 43 54 23 22 39 35 43 42 31 33 33 26 39 18 33 33 32 24 29 29 44 44 38 35 38 48 26 26 25 36 36 27 18 21 24 29 27 28 36 32 30

TN 92 103 92 123 124 107 111 103 104 115 113 113 120 107 128 113 113 114 122 117 117 102 102 108 111 108 98 120 120 121 110 110 119 128 125 122 117 119 118 110 114 116

SE 0.833 0.792 0.875 0.806 0.806 0.917 0.889 0.903 0.917 0.792 0.806 0.792 0.819 0.861 0.708 0.847 0.847 0.847 0.778 0.833 0.833 0.833 0.806 0.722 0.778 0.750 0.806 0.833 0.847 0.819 0.889 0.903 0.833 0.764 0.764 0.847 0.833 0.861 0.833 0.847 0.861 0.847

SP 0.630 0.705 0.630 0.842 0.849 0.733 0.760 0.705 0.712 0.788 0.774 0.774 0.822 0.733 0.877 0.774 0.774 0.781 0.836 0.801 0.801 0.699 0.699 0.740 0.760 0.740 0.671 0.822 0.822 0.829 0.753 0.753 0.815 0.877 0.856 0.836 0.801 0.815 0.808 0.753 0.781 0.795

Q+ 0.526 0.570 0.538 0.716 0.725 0.629 0.646 0.602 0.611 0.648 0.637 0.633 0.694 0.614 0.739 0.649 0.649 0.656 0.700 0.674 0.674 0.577 0.569 0.578 0.615 0.587 0.547 0.698 0.701 0.702 0.640 0.644 0.690 0.753 0.724 0.718 0.674 0.697 0.682 0.629 0.660 0.670

Q− 0.885 0.873 0.911 0.898 0.899 0.947 0.933 0.936 0.945 0.885 0.890 0.883 0.902 0.915 0.859 0.911 0.911 0.912 0.884 0.907 0.907 0.895 0.879 0.844 0.874 0.857 0.875 0.909 0.916 0.903 0.932 0.940 0.908 0.883 0.880 0.917 0.907 0.922 0.908 0.909 0.919 0.913

GA 0.697 0.734 0.711 0.830 0.835 0.794 0.803 0.771 0.780 0.789 0.784 0.780 0.821 0.775 0.821 0.798 0.798 0.803 0.817 0.812 0.812 0.743 0.734 0.734 0.766 0.743 0.716 0.826 0.830 0.826 0.798 0.803 0.821 0.839 0.826 0.839 0.812 0.830 0.817 0.784 0.807 0.812

C 0.436 0.469 0.476 0.631 0.639 0.611 0.613 0.572 0.592 0.555 0.553 0.540 0.618 0.560 0.592 0.590 0.590 0.597 0.599 0.607 0.607 0.501 0.475 0.441 0.513 0.466 0.449 0.631 0.643 0.626 0.606 0.619 0.623 0.638 0.612 0.658 0.607 0.647 0.615 0.568 0.610 0.612

AUC 0.777 0.848 0.842 0.899 0.901 0.902 0.897 0.901 0.902 0.898 0.888 0.862 0.910 0.907 0.900 0.885 0.889 0.888 0.881 0.889 0.891 0.867 0.865 0.840 0.874 0.873 0.866 0.906 0.908 0.907 0.897 0.900 0.900 0.917 0.906 0.897 0.920 0.921 0.916 0.892 0.901 0.909

Notes: aMP represents 21 molecular physico-chemical properties; bV represents 76 VolSurf descriptors.

In total, 42 naive Bayesian models were generated based on different combinations of molecular descriptors. The statistical significances of all Bayesian classifiers were evaluated using the leave-one-out (LOO) cross-validations, and the results are listed in Table 6 (training set), Table 7 (test set I) and Table S2 (test set II) in the Supporting Information. As can be seen from Table 6, the Bayesian classifier based on the 21 simple physico-chemical properties cannot achieve satisfactory prediction accuracy, indicating by low C value 0.410 and AUC value 0.758. Besides, the classifiers based on a set of the VolSurf descriptors or the combination of the VolSurf descriptors and 21 simple molecular properties also cannot yield satisfactory classification capacities. The Bayesian classifier built based on the combination of the 21 physico-chemical properties and 76 VolSurf descriptors has the C and AUC values of 0.482 and 0.806, respectively (Table 6). Then, by combining any class of structural fingerprints and the simple physcico-chemical properties and VolSurf descriptors, 39 naive Bayesian classifiers were developed. It can be observed that the prediction accuracies of the Bayesian classifiers by

adding any structural fingerprints can be improved significantly (Table 6). The best Bayesian classifier for discriminating the RB and NRB molecules in the training set can be obtained using the 21 simple physico-chemical properties, VolSurf descriptors and LPFP_4 fingerprints. The best Bayesian classifier has a sensitivity of 0.877, a specificity of 0.864, a prediction accuracy of 0.769 for the RB molecules, a prediction accuracy of 0.932 for the NRB molecules, a GA of 0.869, and a C value of 0.720. In addition, the more restrict criterion (ROC curve) was used to evaluate the quality of the classifiers. In total, six naïve Bayesian classifiers have quite reliable prediction accuracy, indicated by relatively higher AUC values (>0.88). Furthermore, the actual classification capacity of all Bayesian classifiers were evaluated by the predictions to the test set I (Table 7) and test set II (Table S2 in the Supporting Information). Apparently, the classifiers only based on the physico-chemical properties or/and VolSurf descriptors cannot provide reliable predictions to the test sets I and II. Besides, it is interesting to find that the Bayesian classifier based on the simple molecular properties, VolSurf descriptors and LPFP_6 instead of

12 

 Z. ZHAN ET AL.

Figure 9. (Colour online) (a) The 10 good and (b) 10 bad structural fragments predicted by the best naive Bayesian classifier.

LPFP_4 fingerprints yields the best predictions to the test set I (AUC = 0.921), and the Bayesian classifiers based on the simple molecular properties, VolSurf descriptors, and any of three fingerprints (ECFC_6, ECFC_8 or LPFC_4) achieves the best predictions to the test set II (AUC = 0.901). Because the AUC and C values were not applied for measuring the classification accuracies in the previous studies, the actual prediction capacities of classification models cannot be compared directly. Thus, by comparing the other measurement indexes (SE, SP and GA), the statistical results show that the classification models built in this study have comparative prediction capacity, compared with the reported classification models [10,22,24]. For example, the best DE-SVC model proposed by Cao et al. [24] can give a SE of 0.77, a SP of 0.93 and a GA of 0.88 for the test set I, and a SE of 0.74, a SP of 0.93 and a GA of 0.88 for the test set II. Based on the combination of physico-chemical properties, VolSurf descriptors and LPFC_4 fingerprints, the best classification model evaluated by GA built in our work yields a SE of 0.764, a SP of 0.877 and a GA of 0.839 for the test set I, and a SE of 0.712, a SP of 0.914 and a GA of 0.857 for the test set II (Table 7 and Table S2 in the Supporting Information). Finally, considering the fingerprint classes and prediction accuracy, we selected five best Bayesian classifiers according to the AUC values for consensus predictions. The five Bayesian classifiers were developed based on the physico-chemical properties, VolSurf descriptors and each fingerprint (LPFP_4, ECFP_6, LCFP_6, LPFC_4 or SEFP_8). For each molecule, five different predictions can be afforded by five different Bayesian classifiers and a consensus prediction can be obtained. For example, if a RB molecule was predicted as a true

positive by more than three out of the five Bayesian classifiers, the attribute of this chemical can be confirmed and vice versa. The statistical results shown 257 RB and 456 NRB molecules in the training set can be predicted correctly using the consensus model, which can give a GA of 0.852 and a C value of 0.699. In addition, the consensus predictions provide a GA of 0.826 and a C value of 0.631, and a GA of 0.844 and a C value of 0.612 for the test sets I and II, respectively. It appears that the consensus predictions cannot afford improved prediction accuracy than the best Bayesian classifier. Similar phenomena can also be observed in our previous work [61]. In summary, by introducing the structural fingerprints, the naïve Bayesian classifiers can achieve quite reliable prediction accuracy, and they can be used as reliable tools to virtually screen the huge chemical space for identifying more promising chemical compounds with good biodegradability. 3.4.  Analysis of important structural fragments The relative importance of all structural fragments represented by fingerprints are ranked by Bayesian scores, and these important structural fragments may be quite useful for us to design more promising RB molecules. According to the predictions from the Bayesian classifier based on the 21 physico-chemical properties, 76 VolSurf descriptors and LPFP_4 fingerprint, the top 10 good and 10 bad fragments favourable or unfavourable to chemical biodegradation ranked by the Bayesian scores are shown in Figure 9. As shown in Figure 9(a), the top 10 substructures are quite favourable for the likelihood of a RB molecule. It seems that

MOLECULAR SIMULATION 

almost all of these favourable fragments are pure acyclic compounds except fragment 10. Besides, we also found that these fragments are quite similar, demonstrating that the chemical compounds with such structural fragments may have overwhelming advantage over these with other distinct structural fragments. This finding is consistent with our previous observations in scaffold-based comparisons between the RB and NRB molecules. The RB molecules are usually more simple and have quite limited structural diversity (limited numbers of Murcko frameworks and unique ones), compared with the NRB molecules. Ten structural fragments with negative contributions to molecular biodegradation are depicted in Figure 9(b). It appears that these unfavourable fragments with ring systems are more complicated than the favourable fragments. The observations are also consistent with the results found in the previous sections: compared with the RB molecules, the NRB molecules have more ring systems. Besides, it can also be observed that six out of 10 structural fragments (fragments 1–5 and 10) with negative contributions possess chlorine element. Our findings are well consistent with the results reported by Cheng and co-workers [10]. The numbers of chlorine element in the RB and NRB molecules are 27 (27/547 = 4.94%) and 296 (296/1178 = 25.13%), respectively. Obviously, avoiding halogen (like chlorine element), especially chlorine in ring systems, such as benzene (fragments 3, 5 and 10) and naphthalene (fragments 1 and 4) is quite necessary. In brief, according to the analysis of important structural fragments favourable or unfavourable to chemical biodegradation, simple pure acyclic compounds without chlorine is a rational choice to improve biodegradation.

4. Conclusions In this study, based on the well-compiled RB and NRB datasets, the differences between the RB and NRB molecules were examined systematically. We observed that the RB molecules are usually smaller, more flexible and hydrophilic, and have less complicated structural patterns. Then, the structural analysis shows that, compared with the NRB molecules, most RB molecules are pure acyclic chemicals or have limited core scaffold architectures. The analysis of the structural diversity using Tree Maps demonstrated that chemical compounds with smaller size and less ring systems have good biodegradation. Finally, in order to provide more quantitative classification models for distinguishing the RB from NRB molecules, various naïve Bayesian classifiers were built using different combinations of molecular descriptors. The best Bayesian classifier based on physico-chemical properties, a set of VolSurf descriptors and structural fingerprints can discriminate the RB from NRB molecules efficiently. In addition, the 10 important structural fragments generated from the best Bayesian classifier favourable or unfavorable for molecular biodegradation were highlighted and discussed. The best Bayesian classifier built in this study can be served as a powerful tool to virtually screen huge chemical space instead of experimental testing, and the important structural fragments also can give useful guidelines for designing promising chemical compounds with improved biodegradation.

 13

Disclosure statement No potential conflict of interest was reported by the authors.

Funding This study was supported by the National Science Foundation of China [grant number 81502982]; [grant number 21673149]; the National Science Foundation for Post-doctoral Scientists of China [grant number 2015T80586]; [grant number 2015M581862]; the Jiangsu Key Laboratory of Translational Research for Neuropsychiatric Diseases [grant number BM2013003].

References   [1] Ruecker C, Kuemmerer K. Modeling and predicting aquatic aerobic biodegradation – a review from a user’s perspective. Green Chem. 2012;14:875–887.   [2] Shah AA, Hasan F, Hameed A, et al. Biological degradation of plastics: a comprehensive review. Biotechnol Adv. 2008;26:246–265.   [3] Peijnenburg W. Structure–activity relationships for biodegradation: a critical review. Pure Appl Chem. 1994;66:1931–1941.   [4] Pavan M, Worth AP. Review of estimation models for biodegradation. QSAR Comb Sci. 2008;27:32–40.   [5] Rorije E, Loonen H, Müller M, et al. Evaluation and application of models for the prediction of ready biodegradability in the MITI-I test. Chemosphere. 1999;38:1409–1417.   [6] Boethling, RS. Designing biodegradable chemicals. In: DeVito SC, Garrett RL, editors. Designing safer chemicals: green chemistry for pollution prevention. Vol. 640, American Chemical Society; 1996. p. 156–171.   [7] Raymond JW, Rogers TN, Shonnard DR, et al. A review of structurebased biodegradation estimation methods. J Hazard Mater. 2001;84:189–215.  [8]  Howard PH, Muir DCG. Identifying new persistent and bioaccumulative organics among chemicals in commerce. Environ Sci Technol. 2010;44:2277–2285.  [9]  Howard PH, Muir DCG. Identifying new persistent and bioaccumulative organics among chemicals in commerce II: pharmaceuticals. Environ Sci Technol. 2011;45:6938–6946. [10] Cheng F, Ikenega Y, Zhou Y, et al. In silico assessment of chemical biodegradability. J Chem Inf Model. 2012;52:655–669. [11] Boethling RS, Sommer E, DiFiore D. Designing small molecules for biodegradability. Chem Rev. 2007;107:2207–2227. [12] Howard PH, Stiteler WM, Meylan WM, et al. Predictive model for aerobic biodegradability developed from a file of evaluated biodegradation data. Environ Toxicol Chem. 1992;11:593–603. [13] Hiromatsu K, Yakabe Y, Katagiri K, et al. Prediction for biodegradability of chemicals by an empirical flowchart. Chemosphere. 2000;41:1749– 1754. [14] Hou BK, Wackett LP, Ellis LBM. Microbial pathway prediction: a functional group approach. J Chem Inf Comput Sci. 2003;43:1051– 1057. [15] DeLisle RK, Dixon SL. Induction of decision trees via evolutionary programming. J Chem Inf Comput Sci. 2004;44:862–870. [16]  Philipp B, Hoff M, Germa F, et al. Biochemical interpretation of quantitative structure–activity relationships (QSAR) for biodegradation of N-heterocycles: a complementary approach to predict biodegradability. Environ Sci Technol. 2007;41:1390–1398. [17] Andreini C, Bertini I, Cavallaro G, et al. A simple protocol for the comparative analysis of the structure and occurrence of biochemical pathways across superkingdoms. J Chem Inf Model. 2011;51:730– 738. [18]  Boethling RS, Sabljic A. Screening-level model for aerobic biodegradability based on a survey of expert knowledge. Environ Sci Technol. 1989;23:672–679. [19] Boethling RS, Howard PH, Meylan WM, et al. Group contribution method for predicting probability and rate of aerobic biodegradation. Environ Sci Technol. 1994;28:459–465.

14 

 Z. ZHAN ET AL.

[20] Loonen H, Lindgren F, Hansen B, et al. Prediction of biodegradability from chemical structure: modeling of ready biodegradation test data. Environ Toxicol Chem. 1999;18:1763–1768. [21]  Tunkel J, Howard PH, Boethling RS, et al. Predicting ready biodegradability in the Japanese ministry of international trade and industry test. Environ Toxicol Chem. 2000;19:2478–2485. [22] Mansouri K, Ringsted T, Ballabio D, et al. Quantitative structureactivity relationship models for ready biodegradability of chemicals. J Chem Inf Model. 2013;53:867–878. [23] Lombardo A, Pizzo F, Benfenati E, et al. A new in silico classification model for ready biodegradability, based on molecular fragments. Chemosphere. 2014;108:10–16. [24] Cao Q, Leung KM. Prediction of chemical biodegradability using support vector classifier optimized with differential evolution. J Chem Inf Model. 2014;54:2515–2523. [25] Fernández A, Rallo R, Giralt F. Prioritization of in silico models and molecular descriptors for the assessment of ready biodegradability. Environ Res. 2015;142:161–168. [26]  Boethling R. Comparison of ready biodegradation estimation methods for fragrance materials. Sci Total Environ. 2014;497– 498:60–67. [27] Pizzo F, Lombardo A, Manganaro A, et al. In silico models for predicting ready biodegradability under REACH: a comparative study. Sci Total Environ. 2013;463–464:161–168. [28] Jaworska JS, Boethling RS, Howard PH. Recent developments in broadly applicable structure–biodegradability relationships. Environ Toxicol Chem. 2003;22:1710–1723. [29] Bemis GW, Murcko MA. The properties of known drugs. 1. Molecular frameworks. J Med Chem. 1996;39:2887–2893. [30] Schuffenhauer A, Ertl P, Roggo S, et al. The scaffold tree − visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model. 2007;47:47–58. [31] Hou T, Wang J. Structure–ADME relationship: still a long way to go? Expert Opin Drug Metab Toxicol. 2008;4:759–770. [32] Hou T, Wang J, Zhang W, et al. Recent advances in computational prediction of drug absorption and permeability in drug discovery. Curr Med Chem. 2006;13:2653–2667. [33]  Hou T, Li Y, Zhang W, et al. Recent developments of in silico predictions of intestinal absorption and oral bioavailability. Comb Chem High Throughput Screening. 2009;12:497–506. [34] Wang J, Hou T. Advances in computationally modeling human oral bioavailability. Adv Drug Deliv Rev. 2015;86:11–16. [35] Zhu J, Wang J, Yu H, et al. Recent developments of in silico predictions of oral bioavailability. Comb Chem High Throughput Screening. 2011;14:362–374. [36] Hou T, Wang J, Li Y. ADME evaluation in drug discovery. 8. The prediction of human intestinal absorption by a support vector machine. J Chem Inf Model. 2007;47:2408–2415. [37] Hou T, Wang J, Zhang W, et al. ADME evaluation in drug discovery. 6. Can oral bioavailability in humans be effectively predicted by simple molecular property-based rules? J Chem Inf Model. 2007;47:460–463. [38] Hou T, Wang J, Zhang W, et al. ADME evaluation in drug discovery. 7. Prediction of oral absorption by correlation and classification. J Chem Inf Model. 2007;47:208–218. [39] Hou TJ, Zhang W, Xia K, et al. ADME evaluation in drug discovery. 5. Correlation of Caco-2 permeation with simple molecular properties. J Chem Inf Comput Sci. 2004;44:1585–1600. [40] Lei T, Li Y, Song Y, et al. ADMET evaluation in drug discovery: 15. Accurate prediction of rat oral acute toxicity using relevance vector machine and consensus modeling. J Cheminform. 2016;8:6.

[41] Wang J, Hou T. Recent advances on aqueous solubility prediction. Comb Chem High Throughput Screening. 2011;14:328–338. [42] Wang J, Krudy G, Hou T, et al. Development of reliable aqueous solubility models and their application in druglike analysis. J Chem Inf Model. 2007;47:1395–1404. [43] Wang S, Li Y, Xu L, et al. Recent developments in computational prediction of hERG blockage. Curr Top Med Chem. 2013;13:1317– 1326. [44]  Csizmadia F, Tsantili-Kakoulidou A, Panderi I, et al. Prediction of distribution coefficient from structure. 1. Estimation method. J Pharm Sci. 1997;86:865–871. [45] Tetko IV, Tanchuk VY, Kasheva TN, et al. Estimation of aqueous solubility of chemical compounds using E-state indices. J Chem Inf Comput Sci. 2001;41:1488–1493. [46] Lipinski CA, Lombardo F, Dominy BW, et al. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 1997;23:3–25. [47]  MOE, Chemical Computing Group. [Internet]. Montreal; 2009. Available from: http://www.chemcomp.com. [48] Shi H, Tian S, Li Y, et al. Absorption, distribution, metabolism, excretion, and toxicity evaluation in drug discovery. 14. Prediction of human pregnane X receptor activators by using naive Bayesian classification technique. Chem Res Toxicol. 2015;28:116–125. [49] Bender A, Mussa HY, Glen RC, et al. Molecular similarity searching using atom environments, information-based feature selection, and a naive Bayesian classifier. J Chem Inf Comput Sci. 2004;44:170–178. [50] Chen L, Li YY, Zhao Q, et al. ADME evaluation in drug discovery. 10. Predictions of p-glycoprotein inhibitors using recursive partitioning and naive bayesian classification techniques. Mol Pharm. 2011;8:889–900. [51] Li D, Xu L, Li Y, et al. ADMET evaluation in drug discovery. 13. Development of in silico prediction models for p-glycoprotein substrates. Mol Pharm. 2014;11:716–726. [52] Tian S, Li Y, Wang J, et al. Drug-likeness analysis of traditional Chinese medicines: 2. Characterization of scaffold architectures for drug-like compounds, non-drug-like compounds, and natural compounds from traditional Chinese medicines. J Cheminform. 2013;5 (1):5. [53] Tian S, Wang J, Li Y, et al. The application of in silico drug-likeness predictions in pharmaceutical research. Adv Drug Deliv Rev. 2015;86:2–10. [54] Tian S, Wang J, Li Y, et al. Drug-likeness analysis of traditional Chinese medicines: prediction of drug-likeness using machine learning approaches. Mol Pharm. 2012;9:2875–2886. [55] Wang S, Li Y, Wang J, et al. ADMET evaluation in drug discovery. 12. Development of binary classification models for prediction of hERG potassium channel blockage. Mol Pharm. 2012;9:996–1010. [56] Wipke WT, Dyott TM. Use of ring assemblies in a ring perception algorithm. J Chem Inf Comput Sci. 1975;15:140–147. [57] Wang J, Hou T. Drug and drug candidate building block analysis. J Chem Inf Model. 2010;50:55–67. [58] Bemis GW, Murcko MA. Properties of known drugs. 2. Side chains. J Med Chem. 1999;42:5095–5099. [59] Langdon SR, Brown N, Blagg J. Scaffold diversity of exemplified medicinal chemistry space. J Chem Inf Model. 2011;51:2174–2185. [60] Shneiderman B. Tree visualization with tree-maps: 2-d space-filling approach. ACM Trans Graphics. 1992;11:92–99. [61] Tian S, Sun H, Pan P, et al. Assessing an ensemble docking-based virtual screening strategy for kinase targets by considering protein flexibility. J Chem Inf Model. 2014;54:2664–2679.

Suggest Documents