Computer-Aided Drug Design of Bioactive Natural Products (PDF ...

Send Orders for Reprints to [email protected] Current Topics in Medicinal Chemistry, 2015, 15, 1780-1800

1780

Computer-Aided Drug Design of Bioactive Natural Products Veda Prachayasittikul1,3, Apilak Worachartcheewan1,2, Watshara Shoombuatong1, Napat Songtawee1, Saw Simeon1, Virapong Prachayasittikul3 and Chanin Nantasenamat1,3,* 1

Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; 2Department of Clinical Chemistry, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; 3Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand Abstract: Natural products have been an integral part of sustaining civilizations because of their medicinal properties. Past discoveries of bioactive natural products have relied on serendipity, and these compounds serve as inspiration for the generation of analogs with desired physicochemical properties. Bioactive natural products with therapeutic potential are abundantly available in nature and some of them are beyond exploration by conventional methods. The effectiveness of computational approaches as versatile tools for facilitating drug discovery and development has been recognized for decades, without exception, in the case of natural products. In the post-genomic era, scientists are bombarded with data produced by advanced technologies. Thus, rendering these data into knowledge that is interpretable and meaningful becomes an essential issue. In this regard, computational approaches utilize the existing data to generate knowledge that provides valuable understanding for addressing current problems and guiding the further research and development of new natural-derived drugs. Furthermore, several medicinal plants have been continuously used in many traditional medicine systems since antiquity throughout the world, and their mechanisms have not yet been elucidated. Therefore, the utilization of computational approaches and advanced synthetic techniques would yield great benefit to improving the world’s health population and well-being.

Keywords: Natural products, Biological activity, Data mining, Drug discovery, Computer-aided drug design. INTRODUCTION Far-reaching impacts of natural products on human being have been noted for centuries in the realms of home remedies and medicines. Historical evidence of the first natural products was revealed through paleoanthropological studies in which pollen deposits were found in the grave of Shanidar in present-day Iraq, which is estimated to date back to more than 60,000 years ago [1]. The importance of natural products to civilizations can be attributed to their diverse pharmacological properties. Medical records on the use of natural products as therapeutics have been documented across regions. Furthermore, a clay tablet depicting information regarding medicinal extracts (i.e., resins, oils and juices from approximately 1,000 plants) was discovered in Mesopotamia and dates back to 2600 B.C. [2]. The Ebers Papyrus, an Egyptian medical text contained information on plant-based remedies for various diseases [3]. The first known Chinese text on this subject was called Wu Shi Er Bing Fang (containing 52 prescriptions), followed by Shennong Herbal (containing 365 drugs) and Tang Herbal (containing 850 drugs) [4]. As for western countries, historical evidence for the use of natural products was identified in monasteries of England, Ireland, Germany and France during the dark and middle ages [4]. Furthermore, it should not be overlooked *Address correspondence to this author at the Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; Tel: 66-2-441-4371; Fax: 66-2-441-4380; E-mail: [email protected] 1873-5294/15 $58.00+.00

that Avicenna, the Persian pharmacist, made significant contributions to the field of pharmacy through his work “Canon Medicinae” [4]. Historical records identified medicinal plants, fungi and algae as rich sources of bioactive natural products [5]. The use of medicinal plants originated with respect to the human instinct for survival, i.e., searching for food and seeking to avoid death [6]. Native Americans, used ashes of the plant genus Salvia to aid childbirth and protect infants from respiratory diseases [7]. The ancient Europeans used Parmelia omphalodes extracts to cure burns and cuts due to its antiinflammatory properties [8]. Fungi have been used as food (mushrooms), raw materials for perfumes and cosmetics, and ingredients for preparing alcohol and medicine since the early Chinese and Egyptian civilizations [9]. Fungi in the Anthozoans species, i.e., Chondrus crispus, were widely used for the treatment of chest infections [10]. Parmelia omphalodes (Linnaeus) Acharius were widely used in the British Isles as a dye and in Ireland as an anti-inflammatory agent to cure burns and cuts [11]. Among algae, the juice of the red alga Porphyra umbilicalis (Linnaeus) Kützing has been noted for its anticancer properties, particularly with respect to breast cancer [12]. The importance of natural products in medicine has been indicated by the continual use of classical natural products. One of the classic examples of a natural product is Papaver somniferum, the opium poppy, which contains naturally occurring alkaloids as bioactive compounds [13]. From the © 2015 Bentham Science Publishers

Computer-Aided Drug Design of Bioactive Natural Products

Current Topics in Medicinal Chemistry, 2015, Vol. 15, No. 18

Egyptian to Chinese civilizations, opium was cultivated and used for several purposes. Ancient physicians used it as an anesthetic agent to perform medical surgery [14]. Likewise, they were used as painkillers during the American Civil War. In addition, they were used as recreational drugs in ancient China. The Chinese and Indians are considered to be the pioneers of herbal medicine, and their formulae have had great impacts on the traditional medicine of many countries worldwide [15]. The knowledge of the Chinese and Indians has been exchanged for a long time through the silk road [16]. Ayurveda is an Indian traditional medicine that defines the body in terms of three main constitutions (dosha), and the dynamic equilibrium of these dosha is essential for normal bodily function [17]. In contrast, the disturbance of these dosha is believed to be the root causes of diseases [18]. Similarly, Traditional Chinese Medicine (TCM) defines yin, yang and qi as the three main biological forces in the human body. The balanced equilibrium of yin and yang is essential for being healthy, and qi is required as the energy that circulates and nourishes the entire body [19, 20]. Traditional Chinese medicine is considered to be the prototype of Japanese traditional medicine (kampo medicine) [21] and Korean traditional medicine or Sasang constitution medicine (SCM) [22], to which the original formulae have been adapted. The Chinese and Ayurvedic traditional medicine systems have had great impacts on traditional medicine in Asian countries, including Thailand. The history of utilizing natural products for medicinal purposes has been noted since the Ayutthaya period (1350–1767 A.D.) [23]. Both Ayurveda and TCM are herbal medicine systems in which herb formulae that contain various medicinal herbs are prescribed to provide synergistic effects and reduce adverse effects [24]. Despite having distinct formulae, the traditional medicines of India and China are based on the same belief that an individual’s physical constitution plays a major role in susceptibility to diseases and its response to treatment [25]. The prescribed formulae can be adjusted according to the patient’s condition [24]. A similar basis of different body constitutions that lead to differential responses to herbs is also implied SCM [26].

What biomolecule can be a target?

Which compounds can bind target ?

↑ Potency ↑ Drug-likeness ↓Toxicity

The unique characteristics of these traditional medicines are in agreement with modern individualized medicine [27]. Furthermore, recent studies have revealed the relationships between traditional medicine systems (i.e., Ayurveda [28, 29], Chinese [30, 31], Japanese [32, 33] and Korean [34-36]) and genomic differences of individuals [27], which renders these systems thought-provoking alternative personalized treatment strategies in the post-genomic era [27]. The great importance of natural products in human being has been documented. Approximately 11% of drugs in the WHO’s essential medicines list are exclusively derived from plants, and 25% of the drugs prescribed worldwide are plantderived products [37]. Most of the African and Asian populations rely on traditional medicine for their primary healthcare [38] because of limited access to healthcare facilities and healthcare professionals [39], affordability and belief of safety [40]. In addition, the ancient use of natural products has formed the basis of later clinical, pharmacological and chemical studies [5], which can be identified from the discovery and development of many currently used drugs, e.g., aspirin, morphine, digitoxin, quinine and pilocarpine [41]. Currently, the botanical statuses of countries differ because of distinct features of advancement in science and technology, regulations within the country, culture and society [42]. In the Europe Union (EU) and the United States of America (USA), herbal extracts are used as active compositions in herbal medicinal products, dietary supplements (in the USA) and food supplements (in the EU). In Asian countries, natural products from plants are widely used as drugs for therapeutic purposes in traditional medicine and are used as health foods for the prevention of diseases and promotion of good health [42]. DRUG DISCOVERY AND DEVELOPMENT Drug discovery and development is a complex process that requires expertise from multidisciplinary fields. It consists of many time consuming processes, from target identification to clinical trials, that require substantial financial efforts (Fig. 1) [43]. According to the complexity of drug development processes, bioinformatics and computational

Target identification

Data mining tools Molecular docking

Hit identification

Virtual screening Structural-based : •  Molecular docking Ligand-based : •  Pharmacophore •  Machine learning •  Similarity

Hit-to-lead

1781

QSAR QSPR

Lead optimization

Pre-clinical trials

Clinical trials Phase I, II, III

Drug Approval

Market

Fig. (1). Conceptual framework of drug discovery and development and the roles of computational approaches. (Hits = compounds that can bind to a target, Leads = hits with preferable potency, QSAR = quantitative structure-activity relationships, QSPR = quantitative structureproperties relationships).

1782


approaches have become versatile tools for facilitating and accelerating drug design and development [44, 45]. The conceptual framework of drug discovery and development and the roles of computational approaches in each step are illustrated in (Fig. 1). Target identification is the process in which drug targets are identified [43] by databases that include experimental results. [43] The associated data mining tools are useful for creating databases, and molecular docking is capable of identifying potential targets by docking drugs to large libraries of proteins [43]. Hits are defined as groups of compounds that exhibit desired activity in the screening process [43]. The process of hit identification can be performed using high throughput screening (HTS) and virtual screening. HTS is performed by screening an entire library of compounds against the target by automation; however, secondary assays are required for confirmation [46]. Virtual screening is an effective means of searching for potential compounds by using computational approaches. One widely used computational method in this process is molecular docking. The crystal structure of the target protein is required to simulate binding in silico against large libraries of compounds. Active compounds with good binding affinity to the target, represented by a docking score [43], are identified as hits and will be further developed [47]. Hits are subsequently optimized to obtain improved potency and pharmacokinetic properties and reduced toxicity [43]. The optimization is performed by structural modification of compounds, where medical chemistry and computational approaches play essential interactive roles [48]. Quantitative structure-activity relationships (QSAR) and quantitative structure-properties relationships (QSPR) are computational methods for correlating the chemical structures of the compounds with their activity /properties. Understanding these relationships is useful for structural modification by medicinal chemists in seeking for potential compounds [43, 48]. In addition, molecular modeling and molecular docking can be used for the discovery of new binding sites on target proteins [43]. PRIVILEGED STRUCTURES The similarity principle has been widely applied in drug design on the basis that structurally related compounds possessing similar chemical structures may elicit similar biological activities [49]. In addition, the importance of most common molecular fragments or privileged structures has been noted by Evans et al. in 1988 [50]. Privileged structures are defined as molecular substructures that are capable of binding to a diverse array of receptors, and the modification of these substructures can provide an alternative approach to the discovery of novel receptor agonists and antagonists [50]. It also has been suggested that privileged structures provide affinity towards binding with receptors, whereas the rest of the molecule defines the selectivity of a potential compound [51]. Privileged structures have been successfully used as core structures for the synthesis of novel biologically active compounds [52-54] and as being a starting point for the synthesis of libraries [55]. The importance of privileged structures in drug design and discovery renders computational approaches a powerful tool to address the search for novel privileged structures. It is widely known that natural products are rich sources of bioactive compounds. Recently,

Prachayasittikul et al.

diverse types of privileged structures have been identified from natural products, e.g., indole, quinolone, isoquinoline, purine, quinoxaline, quinazolinone, tetrahydroquinoline, tetrahydroisoquinoline, benzoxazole, benzofuran, 3,3dimethylbenzopyran, chromone, coumarin, carbohydrate, steroid and prostanoic acid [55]. DRUG-LIKE PROPERTIES Drug likeness is essential for effective drugs because active compounds become useless if they are not capable of behaving like drugs in clinical situations. Drug likeness is expressed by drug-like properties that are indicated by Lipinski’s rule of five [44, 56]. Lipinski’s rule suggests that drug-like compounds are molecules with molecular weights (MW) < 500 Da, calculated octanol/water partition coefficients (clogP) < 5, a number of hydrogen-bond donors < 5 and a number of hydrogen-bond acceptors < 10 [56]. However, these rules are used as guidelines rather than as absolute cut-offs for determining drug-like properties [44]. Recently, the importance of other physicochemical and structural properties influencing drug-like properties has been suggested in terms of property-based design [57]. The basis of property-based design is that molecules with similar chemical structures are expected to possess similar pharmacokinetic properties [57]. The pharmacokinetic profiles of drugs, i.e., absorption (A), distribution (D), metabolism (M), excretion (E) and toxicity (T), are essential for determining whether such bioactive compounds could be used as safe and efficient oral drugs [43], and they are considered to be crucial factors for decision-making in further development of the investigated compounds [58]. All of these ADMET properties indicate the drug likeness of compounds and notably affect efficacy, toxicity and drug-drug interactions [44]. For decades, many drugs have failed and been withdrawn in the late stages of drug development, causing considerable financial lost [43, 59]. The two main reasons that lead to the clinical failures of drugs are poor ADME properties [44] and severe toxicities (T) [43, 59]. Hence, considerable attention has been paid to the evaluation of the pharmacokinetic (ADME) properties and toxicity (T) of investigated compounds in the early stages of drug development to reduce the risk of failures and, therefore, save time and cost [60, 61]. In this regard, many computational approaches have been employed for the prediction of ADMET properties [62-67]. COMPUTATIONAL TOOLS Databases In recent years, we have witnessed the introduction of a wide range of databases to aid drug discovery efforts, and these can be broadly classified into two groups: bioactivity databases and target databases. Bioactivity databases are valuable tools for identifying hit chemical compounds. For example, the ChemNavigator database (http://www.chemnavigator.com/) is a comprehensive database because it contains over 91.5 million druggable compounds, although post-curation is needed before performing docking studies and/or quantitative structureactivity relationship (QSAR) studies [68]. ZINC is a dock-



ing-studies friendly database (http://zinc.docking.org) because 3D formatted and purchasable 35 million drug-like compounds are deposited [69]. The ChEMBL database (https://www.ebi.ac.uk/chembl/) contains over 1 million compounds with information on their binding affinities, functional assays, bioactivity measurements and ADMET properties [70]. Pubchem is a freely accessible repository that contains more than 63 million compounds and provides diverse bioactivity results for approximately 45 million. One of the features that makes Pubchem an attractive tool for in silico drug design is the PubChem Download Service [71]. Binding DB is a public and openly accessible database that has approximately 20,000 binding affinities of small compounds that have been experimentally tested with known 3D structural available protein targets. Target databases are important for identifying druggable proteins that are involved in pathogenesis. For instance, the tropical disease pathogens target database (http://tdrtargets. org) contains information on protein structures, functional genomics and biochemical pathways to aid the in silico identification of protein targets [72]. The potential drug target database (http://www.dddc.ac.cn/pdtd/) contains over 1,100 3D druggable protein structures ranging from enzymes to lipid binding proteins [73]. The Therapeutic Target Database (http://xin.cz3.nus.edu.sg/group/ttd/ ttd.asp) contains over 2,360 targets with information on 3D structures, diseases, binding properties and functional properties [74]. The Pro-

1783

tein Data Bank (PDB) contains all known crystallized 3D protein structures, conveniently providing new information (structural information that is not provided in the sequence database, e.g., GeneBank) and tremendously aiding in silico drug design by allowing researchers to identify novel potential drug targets and to perform docking studies [75]. Chemical Space of Natural Products Chemical space is the total possible number of descriptors from chemical compounds. Similar to the spatial extent of space the universe, these descriptors are infinite in number. Despite the advancement in the synthesis of organic compounds and the characterization of nature products, only a small fraction of compounds have been synthesized and used. Thus, by exploring the origin of chemical space in living organisms, new strategies to combat diseases will emerge. Visualization of the chemical space of natural products obtained from 12 natural product databases available from the ZINC database is shown in (Fig. 2) by means of a PCA scores plot. Chemical space analyses of FDA approved drugs were performed to explore the properties and characteristics of drug-like chemical compounds. For example, Vieth et al. [76] studied fragment analysis of 1,082 FDA approved drugs and 1,729 marketed drugs. The results showed that the halogen contents of marketed drugs are identical, and the molec-

1.5

1.0

Database AfroDb AnalyticCon HIT

0.5

IBScreen

PC2

Indofine NPACT

0.0

Nubbe Princeton Specs

−0.5

TCM Tongju UEFS −1.0

−1.5 −1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

PC1

Fig. (2). PCA plot of compounds from 12 databases obtained from the ZINC database. Random selection of 100 compounds from each of the 12 databases was carried out followed by representing each compound by the ECFP substructure fingerprint. Finally, PCA was computed in R using the prcomp function from the stats package and the resulting plot is visualized using the ggplot2 package. Acronyms and full names of the 12 databases are provided hereafter (AfroDb: African natural products, AnalyticCon: AnalytiCon discovery natural products, HIT: Herbal Ingredients Targets, IBScreen: IBScreen natural products, Indofine: Indofine natural products, NPACT: Naturally occurring plant based anticancerous compound-activity-target database, Nubbe: Nuclei of Bioassays, Biosynthesis and Ecophysiology of Natural Products, Princeton: Princeton natural products, Specs: Specs natural products, TCM: Traditional Chinese Medicine Database, Tongju: Tongji University herbal ingredients in vivo metabolism, UEFS: Universidade Estadual De Feira De Santana natural products).

1784



ular weights of FDA approved drugs are lower than 500. These results were consistent with Lipinski's rule of five, which claims that drugs should possess a MW smaller than 500 to have good bio-absorption and bioavailability. Chemical space analyses were performed on natural products as well. For instance, Ganesan [77] used 24 unique natural products to understand the associated chemical space. Of all the 24 natural products, half of them obey Lipinski’s rule of five, whereas the other half disobey the rule. A closer examination of the physicochemical properties of these 24 natural products revealed that almost all of them obey the log P rule, such that their values are smaller than 5.

they were clearly different. For example, for the structural properties of bioactive natural products, the molecular weights, the number of rings, the number of carbon atoms and the number of oxygen atoms, in particular, were higher than those of non-bioactive natural products [83]. In contrast, the results showed that most of the bioactive natural products exhibited drug likeness despite having increased numbers of hydrogen bonds donors and acceptors. This result suggested that natural products have desirable properties in drug discovery and development because compounds that obey the rule-of-five are orally active and very specific in binding to their targets [83].

Koch et al. [78] explored the chemical space of natural products by classifying their chemical scaffold, which allowed the identification of 11 novel ß-hydroxysteroid dehydrogenase type 1 inhibitors. Reayi and Arya [79] stressed that the chemical space of natural products can be populated by diversity-oriented synthesis (DOS), a strategy in chemical synthesis to quickly create a library of compounds, which will aid in the deorphanization of druggable protein targets. Josefin et al. [80] utilized ChemGPS-NP to explore the chemical space of natural products from several databases and found that 40,348 compounds from the Dictionary of Nature Products Database passed the Lipinski’s rule of five. Osada and Hertweck [81] claimed that the chemical space of natural products is populated naturally by gene clustering, where gene natural product synthesizer enzymes are altered to increase their chemical space. Lachance et al. [82] claimed that the bioactivity of the chemical space of natural products can be analyzed, charted and navigated to identify relevant substituents to aid modern chemical synthesis in drug discovery and development.

To compare the drug-likeness and BNC-likeness models, a data set of 59,000 drugs from the World Drug Index (WDI) were randomly chosen and screened to obtain 3,930 compounds, of which 1,965 were bioactive and 1,965 were nonbioactive. Molecular descriptors were extracted for each compound to develop a drug-likeness model using SVM as the learning technique. The performance of the drug-likeness model decreased when the natural product data set was used, and the opposite phenomenon was observed for the BNClikeness model [83]. These two models may have differed because they rely on different properties of synthesized drugs and natural products. A closer look at the key descriptors of these two models revealed by the RuleSet algorithm, an algorithm that is based on a decision tree algorithm, indicated only a few important descriptors to perform the classification. In the development of the BNC-likeness predictive model, 180 descriptors were used, whereas 328 descriptors were used as inputs to construct the drug-likeness model. There were significant differences when the distributions of the 180 and 328 descriptors were plotted. To confirm these differences, 1,647 descriptors were extracted from Dragon based on molecular descriptors for each compound and were split with the k-means clustering approach. The descriptors were clustered into 50 groups based on their Pearson’s correlation coefficients. The important descriptors for both models (i.e., drug-likeness and NBC-likeness) were significantly different because the descriptors in clusters 35, 33, 28, and 36 were mainly used to build the BNC-likeness prediction model, and they were rarely used to create druglikeness models. In contrast, clusters 19, 7, and 18 were largely used to build drug-likeness models and were rarely used to make BNC-likeness models [83].

Analogous to the Linpiski’s rule of five (drug-likeness), Zhou et al. [83] used structure-activity relationships to explore the chemical space of natural products to define “bioactive natural compound-likeness” (BNC-likeness). Structural properties were compared between bioactive and nonbioactive natural products and between the drug-likeness and BNC-likeness models. A dataset of 1,580 natural products was obtained from a total of 7,549 natural product ingredients from the Ethnobotanical Database and Dr. Duke’s Phytochemical Database. Of 1,580 natural products, 790 were bioactive whereas 790 were not, providing a balanced dataset. SVM with radial basis function kernels was used to perform bioactive natural compound-likeness models, using 1,580 compounds with bioactivity as the training set. The performance of the models was tested with an independent external data set that included 81 bioactive natural products and 81 non-bioactive natural products from widely used medicinal herbs. The prediction results demonstrated that 75 bioactive compounds were successfully classified, suggesting that the models are robust and do not have the problem of overfitting. Overfitting is one of the problems in machine learning and occurs when noise data are incorporated as independent variables to develop highly predictive models. Although these models work very well with internal datasets, their performance is very low when a new class of data or a test set is applied [83]. A closer examination of the structural properties between bioactive and non-bioactive natural products showed that

Natural Products as Sources of Inspiration for New Drugs Small molecules and secondary metabolites have been economically designed and synthesized by nature for the benefit of evolution; in other words, they have been evolutionarily selected [84]. Regarding the power of evolution, natural products contain diverse types of biologically relevant privileged structures that have saved millions of lives, which renders them a continuous source of inspiration for the discovery of new drugs [85]. These naturally occurring ligands serve as excellent structural starting points for exploring biologically relevant chemical space [86]. Therefore, the identification of natural products that are capable of modulating protein functions in pathogenesis-related pathways is the heart of drug discovery and development [78]. Until now, distinct natural products have been chemically


modified and driven to become Food and Drug Administration (FDA) approved drugs [77]. From 1981 to 2010, natural products and their derivatives accounted for 74.8% of all candidate drugs approved by the FDA [87]. Good examples of natural product-inspired drugs are carfilzomib, omacetaxine mepesuccinate and mitoxantrone. Carfilzomib is a natural-linked compound derived from a naturally occurring bacterial proteasome inhibitor, epoxomicin. Carfilzomib was first synthesized in 1992 by Hanada et al. [88]. However, the mechanism of action of this compound was unknown. In the late 1990s, carfilzomib was structurally modified by Crews’ lab from Yale University to obtain a derivative that was structurally similar to the parent compound, epoxomicin [89, 90]. Several research groups put forward great effort to structurally modify carfilzomib. Eventually, a derivative of carfilzomib was obtained by Proteolix and Onyx and was approved by FDA in 2012 for the treatment of multiple myeloma [91]. Homoharringtonine is a bioactive cephalotaxine alkaloid isolated from the extract of evergreen trees. In 1976, homoharringtonine was clinically observed for its anticancer potential against acute leukemia [92]. Since then, this natural compound has been examined across several organizations and companies. Finally, the ester derivative of homoharringtonine was approved by the FDA in 2012 for the treatment of chronic myeloid leukemia under the name omacetaxine mepesuccinate [91]. Mitoxantrone is an anticancer agent derived from the natural product pharmacophore. Mitoxantrone is a doxorubicin analog that was designed to minimize cardiotoxicity of its parent compound [93]. Mitoxantrone has been approved by the FDA for the treatment of many cancers, including acute leukemia, breast cancer and lymphomas [94]. In addition, it was approved by the European Medicine Agency (EMEA) of the European Union (EU) in September 2012 for the treatment of B cell lymphomas [91]. At present, there are applications of this drug before the FDA for approval for the treatment of non-Hodgkin’s lymphoma [91]. Finally, the commercial success of these natural-derived drugs clearly demonstrates that natural products provide great sources of biologically relevant privileged structures that are useful as structural starting points for the screening, design and development of novel potential drugs. Synthesis of Natural Products Natural products are in high demand owing to their exceptional range of bioactivities. Some natural products are limited or inaccessible in nature. Organic synthesis often solves this problem by supplying these scarce compounds and enabling the conversion of bioactive natural compounds into more drug-like derivatives [95]. It is well known that the chemical structures of the majority of natural products are complex, which renders their total synthesis a difficult task [95]. Therefore, novel organic synthetic approaches have been developed in an attempt to yield potential compounds with medicinal value [96]. Principally, structural modifications of the natural product core structures are performed to improve selectivity and potency, to provide additional properties [97], and to facilitate their synthesis [95]. Furthermore,


1785

some novel synthetic strategies have been developed to increase structural diversity, in other words, to expand the chemical space of investigated compounds [84, 98]. Examples of organic synthesis methods are given below. Semi-synthesis is performed by the chemical modification of natural products to improve potency, selectivity and other properties [97]. This method has been historically used to yield a number of therapeutic compounds or compounds with significant impacts on mankind. A notable example of this approach is heroin, which is derived from the acetylation of morphine [99]. Fragment exchange is a complementary approach that replaces chemical fragments of natural products with synthetically derived fragments [97]. Statin and its derivatives, i.e., mevastatin, lovastatin, simvastatin and atorvastatin, are drugs that lower the concentration of lipids. These compounds have been developed from naturally occurring statin based on the semi-synthesis and fragment exchange methods [100]. Diversity-oriented synthesis (DOS) is an effective tool to achieve a library of structurally diverse compounds with desirable biological properties [101, 102]. Structural diversity is one of the key strategies for expanding the investigated chemical space and thereby increasing the rate of finding potential hits [98]. Conceptually, natural products are used as starting scaffolds to generate compound libraries by various organic synthesis methods [103], in which novel molecules are generated in short reaction sequences (not more than 4 or 5 steps) [104]. Examples of natural products used as starting scaffolds are gibberellic acid (a plant hormone), adrenosterone (steroid hormone) and quinine (isolated compound from the bark of the cinchona tree) [103]. Function-oriented synthesis (FOS) is an effective strategy for producing therapeutic lead compounds in a stepeconomical fashion [95] such that small molecules are generated with less structural complexity and with preferable properties [95]. The principle of FOS is based on the fact that only a portion (substructure) of a compound is responsible for its biological activities, and these crucial moieties can be modified to facilitate synthesis, enhance desirable biological activities and improve drug-like properties [95]. It should be noted that natural products are most likely bind to multiple targets [84], and they are not designed for human therapeutic use [95]. These characteristics lead to undesired side effects and inferior pharmacokinetic properties [95]. The benefits of FOS have been noted to address these problems by reducing undesired side effects, enhancing desired biological activities and improving pharmacokinetic properties [95]. FOS has been applied for the development of many natural compounds, such as bryostatin [105], halichondrin B [106], statin [107], dynemicin [108] and laulimalide [95]. One of the challenges in drug discovery and development is the identification of biologically relevant areas that are located inside an investigated chemical space [109]. Biology-oriented synthesis (BIOS) is based on the structural analysis of small molecules and target proteins, where biological relevance is a prime criterion for the selection of starting scaffolds for the synthesis of biologically active compound collections [84]. Briefly, natural product scaffolds are ana-

1786


lyzed and classified according to their core structures, and protein targets are clustered by their similarity. Consequently, scaffold collections and protein clusters are matched by biological pre-validation [84] to provide a starting point for the subsequent synthesis of small molecules enriched with biological activity [86]. In this regard, computational approaches, i.e., chemoinformatics and bioinformatics, are necessary [86]. It should be noted that BIOS only provides a starting point for discovery, and the continuous development of practical synthetic methods, i.e., one-pot sequences, cascade and domino reactions, are essential as a final step to obtain biologically active, naturally derived compound collections [86]. Until now, many synthetic strategies have been reported for the synthesis of natural product analogs, including the solid-phase technique [110, 111], solution-phase technique [96, 111], polymer-immobilized scavenger reagents [112, 113], direct sorting [114], combinatorial biosynthesis [115118], total synthesis using gold catalysis [119] and biologyoriented synthesis [84, 120, 121]. Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) Quantitative structure-activity/property relationships (QSAR/QSPR) describe mathematical and statistical relationships between molecular descriptors of compounds (X) and their biological activities/properties (Y). Hansch et al. first demonstrated the use of mathematical and statistical approaches for constructing a QSAR/QSPR model [122, 123]. Over the last several decades, the QSAR/QSPR model has been used to effectively reduce time-consuming, laborious and expensive process in innovation drug research [124126], and it has also performed well for the prediction of physicochemical and biological properties [127-135]. Thus, it is desirable to develop an efficient and reliable QSAR/QSPR model to improve the drug discovery process. The development of a QSAR/QSPR model is essentially comprised of five major steps: i) calculating the molecular descriptors; ii) selecting relevant and informative molecular descriptors; iii) dividing the data into training/internal and testing/external sets; iv) establishing the QSAR/QSPR model using the training set; and v) validating the QSAR/QSPR model. Calculating the Molecular Descriptors The chemical structure of a compound can be represented as a set of numerical values, called molecular descriptors [136]. First, chemical structures are drawn, geometrically optimized and calculated to obtain descriptor values. Typically, many types of descriptors, i.e., physicochemical properties, molecular properties and molecular fingerprints, can be extracted from the chemical structure of natural products using computer software [137]. Although several thousand descriptors can be obtained from conventional software packages, those descriptors may not be informative or useful for predicting the bioactivity of compounds of interest. Thus, feature selection via machine learning algorithms is essential to select a set of informative descriptors prior to the construction of QSAR/QSPR models [125]. Bioactivities are considered to be the effects of the natural products on the living organisms, which can be either beneficial or harmful,


depending on their structural composition and concentration. Accurate and precise bioactivity data are essential for the successful construction of QSAR/QSPR models. Therefore, multiple rounds of activity assays should be performed to obtain accurate and precise bioactivities. Recently, QSAR models were successfully constructed using several bioactivities, such as minimum inhibition, toxicity, solubility, sorption, absorption, bioconcentration, permeability, metabolism, clearance and binding affinity [125]. Initially, the chemical structures of the natural products can be collected from public databases, commercial repositories and the literature. Chemical structures are drawn, geometrically optimized and subjected to descriptor calculations [125]. Many types of descriptors (e.g., constitutional, topological, geometric, electrostatic, fingerprints, steric, quantum chemical descriptor) can be obtained independently from the software that is used to perform the descriptor calculation. The physicochemical properties, quantum chemical properties and molecular fingerprint properties of the natural products can be extracted as a set of descriptor values by free and/or commercial software [138-140]. There are openly available descriptor calculators that permit descriptor extractions for the user. For example, the free online E-Dragon molecular descriptors calculator (http://www.vcclab.org/lab/edragon/start.html) allows users to extract 1,600 molecular descriptors, where SDF (MDL) or MOL2 (Sybl) input files of the 3D structures (with added hydrogen atoms) are used as inputs to extract the descriptors [141]. Another example of a free descriptor calculator is Jcompoundmapper (http://jcompoundmapper.source forge. net/); users can download the java client for this application, called JCMapperCLI.jar, and conveniently calculate molecular fingerprint descriptors using the Command Prompt script [142]. In addition, molecular structures that represented as simplified molecular-input line-entry system (SMILES) format together with endpoint or their biological and chemical properties can use for development of QSPR/QSAR models by Monte Carlo method in CORrelation And Logic (CORAL) software (http://www.insilico.eu/ coral) [134]. Selecting Relevant and Informative Molecular Descriptors Many QSAR/QSPR models are not suitable to handle a large number of irrelevant descriptors. Thus, the selection of relevant/informative descriptors plays a crucial rule in the construction of QSAR/QSPR models. The objectives of selecting descriptors based on importance are manifold: (I) to alleviate overfitting and enhance QSAR/QSPR performance; (II) to provide faster and more cost-effective models; and (III) to gain deeper insight into the underlying chemical structures of natural products [143]. Currently, there are three major methods of feature selection: filter, wrapper, and embedded approaches [143, 144]. Filter approaches assess the relevance of descriptors by ranking a feature relevance score and filtering a feature relevance score, such as the ttest. Subsequently, the top-ranked informative features are used to construct a predictive model. This approach considers the intrinsic properties of the data and ignores the interaction with the model. Filter techniques are simple and fast, with little computational complexity, and they are also easy to manipulate for very high dimensional data sets. However, these techniques are independent of the prediction model. Instead of focusing on gaining an informative feature independently of the model selection step, wrapper approaches



1787

were proposed to mitigate this issue by embedding the model within the candidate feature subset. Some examples of wrapper approaches include sequential forward selection (SFS) [145], sequential backward elimination (SBE) [145] and genetic algorithm [146]. Advantages of the wrapper approaches include their interaction between the candidate feature subset and the model selection, whereas a common drawback of these techniques is that they have a higher risk of overfitting than filter techniques and are computationally intensive. In the last category of feature selection techniques, termed embedded approaches, the selection of the informative subset features is built into the model. Similar to wrapper approaches, embedded approaches have the advantage that they can include the interaction with the classification model; however, they are far less computationally intensive than wrapper methods. Some examples of embedded approaches included decision tree [147], logistic model tree [148] and random forest approaches [149]. (Table 1) provides a summary of the feature selection methods, showing the most prominent advantages, disadvantages, and some examples for each method.

concepts of popular machine learning techniques that are used for construction of QSAR/QSPR models.

Dividing the Data into Training and Testing Sets

scriptor.A schematic representation the k-NN method is illustrated in (Fig. 3A).

A simple method involving a classification task is the kNN algorithm. This algorithm is conceptually based on a distance function, such as the Euclidean distance, to measure the similarity between a pair of data. Given a data set D = x1 ,..., x N , where x j ∈ ℜ N and N is the number of molecular descriptors, a positive integer k, and a new datum x to be classified, the k-NN algorithm finds the k nearest neighbors of x in D, denoted as k-NN( x ), and returns the dominating class label in k-NN( x ) as the label of x . Given descriptors of two compounds (e.g.,

xi and x j ), the Euclide-

an distance Dist ( xi , x j ) is

Dist ( xi , x j ) =

∑ ( xin − x jn )2 n =1

where

(1)

N

i

th

xin is the i compound with the nth molecular de-

To alleviate the overfitting problem, a QSAR/QSPR model must perform well on both training and testing sets to be an effective and efficient model. Currently, there are a number of splitting algorithms, such as Kennard and Stone, Dublex and k-means sampling. These three algorithms were implemented with the R program within the prospectr software package. An introduction to the prospectr software package can be downloaded at no cost from http://cran.rproject.org/web/packages/prospectr/index.html.

The MLR method attempts to model the relationship and behavior between a set of molecular descriptors X and a quantitative value Y by fitting a linear equation to observed data. In MLR analysis, stepwise regression is used to select the most informative descriptor and improve the performance of the QSAR/QSPR model. Formally, the QSAR/QSPR model constructed from the MLR method is

Establishing the QSAR/QSPR Model

y i = β 0 + β 1 x i1 + β 2 x i 2 + ... + β N x iN = ∑ β n x in + β 0 (2)

N

The construction of a QSAR/QSPR model is based on the principal idea of machine learning. Currently, a few wellknown QSAR/QSPR models based on machine learning techniques include multiple linear regressions (MLR), partial least square (PLS), k-nearest neighbor (k-NN), artificial neural network (ANN), support vector machine (SVM), decision tree (DT), and random forests (RF). All of these methods have been reported in many applications of QSAR/QSPR modeling. Machine learning tasks are typically classified into two broad categories consisting of classification and regression tasks. Classification tasks aim to discriminating a variable Y into its class or property, where the Y variable could be classified into two and more than two classes, which are called binary and multi-class classification, respectively. In contrast, the regression task primarily focuses on predicting the value of the variable Y with a numerical output. The MLR, PLS, ANN, SVM, and RF methods can be utilized in both classification and regression tasks, whereas k-NN and DT are used only in the classification task. Additionally, the tasks of machine learning could be further divided according to their inclusion (supervised learning) or omission (unsupervised learning) of the variable Y. All examples of QSAR/QSPR models are commonly used in supervised learning tasks, whereas a well-known unsupervised learning method is principal component analysis (PCA). (Fig. 3) displays the schematic representation of the major

n =1

where yi is the output value. To obtain the MLR parameter β i , the ordinary least squares (OLS) approach is used by minimizing the sum of the actual and predicted values to give a loss function (actual value – predicted value). In practice, it is laborious to directly manipulate and visualize high-dimensional data. Rather than analyzing the original dimension of data X, the importance of the extracted variable is more useful. In this regard, PCA is likely the most popular unsupervised learning technique based on a statistical approach that reduces the dimensionality of the data set to a smaller subset known as principal components (PCs) while preserving its dominant characteristics (variance) [150]. The major goals of PCA are as follows: 1) to extract the most information from X variable; 2) to analyze the pattern of X and Y variables; and 3) to remove an outlier(s). (Fig. 3B) shows the scores and loading plots derived from PCA approach. Practically, if we have more variables (i.e., molecular descriptors) than compounds, the MLR method will not be a suitable option. Further, the OLS approach might provide an unstable parameter β i , which is difficult to interpret. The PLS method was proposed to solve a large number of variables. This approach is the most commonly utilized approach,

1788


Table 1.

a


Summary of feature selection approaches.

Model Search

Advantage

Disadvantage

Examples

Filter

-Independent of the classifier -Better computational complexity than wrapper methods

-Ignores interaction with the classifier

-t-test

Wrapper

-Interacts with the classifier -Model feature dependencies

-Risk of overfitting -Classifier dependent selection

-Genetic algorithma -Sequential forward selectionb -Sequential backward selection b

Embedded

-Better computational complexity than wrapper methods -Interacts with the classifier -Model feature dependencies

-Classifier dependent selection

-Decision tree c -Logistic model tree d -Random forests e

Reference [146], b Reference [145], c Reference [147], d Reference [148], e Reference [149].

Fig. (3). Schematic overview of commonly used machine learning techniques comprising of k-nearest neighbor (A), Principal component analysis (B), artificial neural network (C), support vector machine (D), decision tree (E), and random forests (F).

rather than MLR or PCA. Practically, PLS (projection to latent structures), is used to establish the correlation of a matrix of X variables that have high variance and good correlation with a matrix of Y variables. The correlation approximation is achieved by simultaneously projecting the X and Y matrices on lower dimensional spaces that are represented by PLS components. The idea behind the PLS model is to cal-

culate the PLS component T by decomposing the block of X = TP + residuals and predict the response variable Y = TC + residuals, where P and C are the inverse of loading scores of X and Y, respectively. Additional details of the PLS model can be found in references [151-153]. The k-NN, MLR, and PLS methods are suitable for modeling linear relationships between the variables X and Y;



thus, when data sets possess a nonlinear relationship, these three methods might not perform well. ANN was proposed for use with nonlinearly separable data sets. Computational models of this method were inspired by the human central nervous system. The details of ANN evolved from the perceptron concept, which is one of algorithms used for supervised classification [154]. Mathematically, ANN is represented by a nonlinear weighted sum: N

y i = θ (∑ β n xin + β 0 )

(3)

n =1

where θ (•) is the activation function. The sigmoid function θ (•) is a commonly used activation function in ANN and refers to the special case of the logistic function defined by

1

θ(X ) =

(4)

N

∑ β n xin + β 0

1 + e n =1

The prediction result possesses a value of 1 if Eq. 4 is greater than the threshold value; otherwise, the prediction result is 0. Because the goal of any supervised learning algorithm is to construct a model that performs well on both internal and external sets, backpropagation, also called backward propagation, is a commonly used method for training ANN by using an optimization method such as gradient descent. This method calculates the gradient of a loss function with respect to all the weights or parameters β i in the network. The gradient is fed to the optimization procedure to provide more accurate weights and to minimize the loss function. This method has been applied in both classification and regression tasks. (Fig. 3C) shows the most common structure of ANN composing of three layers, i.e. input, hidden, and output layers, with full inter-connection (Table 2). Additional details of the ANN method can be found in references [155] and [156]. Table 2.

1789

SVM was originally developed for classification by Cortes and Vapnik [157]. This method attempts to construct a separating hyperplane that maximizes the margin between the two classes of data sets. Intuitively, a good separation or classification occurs when the hyperplane has the greatest distance to neighboring data points of both classes because a larger margin leads to lower values of the loss function of the classifier and also accurately predicts each data point. To easily understand SVM, a linear model (i.e., Eq. 2) can be used for a binary classiﬁcation problem given a data set D. To achieve the maximizing margin, the optimization approach is defined as

1 2 w 2 s.t. yi (β i xi + β 0 ) > 1

min w, β

(5)

This method has w, β as its parameters. Previously, SVM has been successfully applied in QSAR modeling by utilizing the -insensitive loss function [157, 158] as follows:

Lε (y, f (x,β ))={

|y−f (x,β )|−ε ,y−f (x,β )|≥ε 0

,y−f (x,β )|