Quantitative Structure-Activity Relationship Models for ...

2 downloads 0 Views 1MB Size Report
94 therapeutic subgroups belonging to the second level of ATC classification. The high-confidence .... Therapeutic Subgroup (ATC Code). High-confidence.
toxicological sciences 136(1), 242–249 2013 doi:10.1093/toxsci/kft189 Advance Access publication August 31, 2013

Quantitative Structure-Activity Relationship Models for Predicting Drug-Induced Liver Injury Based on FDA-Approved Drug Labeling Annotation and Using a Large Collection of Drugs Minjun Chen,* Huixiao Hong,* Hong Fang,† Reagan Kelly,* Guangxu Zhou,* Jürgen Borlak,‡ and Weida Tong*,1 *Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas 72079; †Office of Scientific Coordination, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas 72079; and ‡Center for Pharmacology and Toxicology, Hannover Medical School, Hannover, Germany 1

To whom correspondence should be addressed. Fax: (870) 543-7854. E-mail: [email protected].

Drug-induced liver injury (DILI) is one of the leading causes of the termination of drug development programs. Consequently, identifying the risk of DILI in humans for drug candidates during the early stages of the development process would greatly reduce the drug attrition rate in the pharmaceutical industry but would require the implementation of new research and development strategies. In this regard, several in silico models have been proposed as alternative means in prioritizing drug candidates. Because the accuracy and utility of a predictive model rests largely on how to annotate the potential of a drug to cause DILI in a reliable and consistent way, the Food and Drug Administration–approved drug labeling was given prominence. Out of 387 drugs annotated, 197 drugs were used to develop a quantitative structure-activity relationship (QSAR) model and the model was subsequently challenged by the left of drugs serving as an external validation set with an overall prediction accuracy of 68.9%. The performance of the model was further assessed by the use of 2 additional independent validation sets, and the 3 validation data sets have a total of 483 unique drugs. We observed that the QSAR model’s performance varied for drugs with different therapeutic uses; however, it achieved a better estimated accuracy (73.6%) as well as negative predictive value (77.0%) when focusing only on these therapeutic categories with high prediction confidence. Thus, the model’s applicability domain was defined. Taken collectively, the developed QSAR model has the potential utility to prioritize compound’s risk for DILI in humans, particularly for the high-confidence therapeutic subgroups like analgesics, antibacterial agents, and antihistamines. Key Words:  drug-induced liver injury; predictive model; quantitative structure-activity relationship; drug label; therapeutic categories; external validation. Disclaimer: The views presented in this article do not necessarily reflect current or future opinion or policy of the U.S. Food and Drug Administration. Any mention of commercial products is for clarification and not intended as an endorsement.

The pharmaceutical industry suffers from a decline in the number of new drugs brought to market (Paul et  al., 2010; Watkins, 2011) due to about 90% of drug candidates approved for human testing failing in clinical trials (Kola and Landis, 2004). Efficacy and toxicity are 2 main causes for drug failure (Arrowsmith, 2011a,b) with drug-induced liver injury (DILI) being a principal cause (Ballet, 1997). Several methodologies are employed to evaluate risk for DILI at the preclinical stage, including in vivo, in vitro, and in silico approaches (Hamburg, 2011; Zhang et al., 2012). Notably, in silico computational methods like quantitative structure-activity relationship (QSAR) have been shown to be useful for safety screening during the early stages in drug discovery due to their rapid results and the fact that drug substances are unnecessary (Merlot, 2010; Muster et al., 2008; Thomas and Will, 2012), and physicochemical nature of compounds like lipophilicity has been identified as an important risk factor for DILI when considering together with daily dose (Chen et al., 2013a; Kaplowitz, 2013). QSAR models have been widely used for the study of hepatotoxicity (Przybylak and Cronin, 2012) but with only a few studies focusing on predicting risk for DILI in humans (Ekins et  al., 2010; Greene et  al., 2010; Liu et  al., 2011; Rodgers et  al., 2010). For example, Greene et  al. (2010) reported a QSAR model for human-specific DILI using a knowledgebased expert system named Derek for Windows. Ekins et  al. (2010) also reported a DILI QSAR model using a Bayesian approach. Liu et al. (2011) developed a DILI prediction system (DILIps) by modeling 13 side effects that collectively provided an indication for DILI using a QSAR approach. These pioneering works demonstrated the potential utility of QSAR for predicting complex endpoints associated with heterogeneous mechanisms like DILI in humans. The predictive performance of a QSAR model, particularly its performance as assessed by external validation sets containing chemicals unknown to the training model, is the most important consideration for its

Published by Oxford University Press on behalf of the Society of Toxicology 2013. This work is written by (a) US Government employee(s) and is in the public domain in the US.

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

Received April 23, 2013; accepted August 13, 2013

QSAR Models for Predicting DILI



and Precautions section as a high/moderate DILI risk. No-DILI-concern drugs (termed as negatives hereafter) are those whose drug labels contained no precautions regarding DILI. Drugs that did not fall into these 2 categories were labeled as less-DILI-concern. Because the DILI potential of less-DILI-concern drugs is often debatable, drugs falling into this category were not included in the model development. In our previous study, we established a DILI annotation for 137 positive drugs and 65 negative drugs (Chen et al., 2011a). In the present study, the same annotation was carried out for additional drugs numbering approximately 200. Altogether, 176 positive and 211 negative drugs were defined. The data set was divided into training and validation sets based on the drugs’ PubChem identification (CID). The training set contains 197 drugs with even numbered CID (81 positives and 116 negatives). The validation set (referred as NCTR validation set hereafter) contains 190 drugs with odd numbered CID (95 positive and 95 negative drugs). The drugs in the NCTR training and validation sets are listed in Supplementary Tables S1 and S2, respectively. Independent Validation Data Sets In addition to the NCTR validation set mentioned above, 2 additional data sets were included as independent validation data sets to assess the predictive performance of the QSAR model constructed from the training set. None of the drugs in the 3 validation sets were used in the training set. One data set was originally reported by Greene et al. (called Greene data set hereafter), which divides drugs into 4 DILI risk groups: HH (evidence for hepatotoxicity in humans), NE (no evidence for hepatotoxicity in any species), WE (weak evidence for human hepatotoxicity with < 10 case reports but generally considered to present no liver injury potential in humans), and AH (animal hepatotoxicity observed, not tested in humans) (Greene et al., 2010). Because the relevance of both WE and AH to human hepatotoxicity is not well established, only HH and NE drugs were used in this study. Of the 627 drugs/compounds in the original report, 328 drugs were selected for external validation after excluding those drugs that overlapped with the training set: 214 HH drugs and 114 NE drugs (Supplementary Table S3). The second data set used for independent validation was reported by Xu et  al. (called Xu data set hereafter) and was annotated according to human hepatotoxicity (Xu et al., 2008). The data set contains 344 drugs/compounds, but only 241 drugs (132 positives and 109 negatives) were used after excluding those whose chemical structures were not available or overlapped with the training set (Supplementary Table S4). Notably, the DILI annotations among the approaches of the NCTR, Greene, and Xu data sets are >80% in agreement (Chen et al., 2011a). In total, there were 483 unique drugs in the 3 validation data sets with only 54 drugs common to all 3 data sets, 131 drugs shared between the Xu and Greene data sets, 31 shared between the NCTR and Xu data sets, and 6 shared between the NCTR and Greene data sets. More information about these validation sets is shown in Supplementary Figure S1. QSAR Modeling

NCTR Data Sets With DILI Annotation Using the FDA-Approved Drug Labels

The term QSAR as used in the present study is interpreted broadly to include methodologies that predict activity on an ordinal or categorical scale. The reported QSAR model for predicting human-specific DILI was developed using Mold2 molecular descriptors (Hong et al., 2008) and the Decision Forest (DF) algorithm (Tong et al., 2003), both of which were developed in-house at the NCTR/FDA. The overall modeling procedure is depicted in Figure 1 and detailed below.

The FDA’s National Center for Toxicological Research (NCTR) developed a methodology to assess a drug’s potential to cause DILI in humans with FDA-approved drug labeling and the method categorized drugs into 3 groups: most-DILI-concern, less-DILI-concern, and no-DILI-concern (Chen et  al., 2011a). A drug was considered as most-DILI-concern (termed as positive hereafter) if it was withdrawn/discontinued from the market, labeled with a Boxed Warning due to a clinical case of DILI or specifically indicated in the Warnings

Molecular descriptors.  Molecular descriptors for drugs were generated using Mold2 (http://www.fda.gov/ScienceResearch/BioinformaticsTools/ Mold2/default.htm), a free software tool developed at NCTR. Mold2 calculates molecular descriptors from 2D chemical structures and has been demonstrated to be reliable for developing QSAR models (Hong et al., 2008). Specifically, 777 Mold2 descriptors were generated for all of the data sets (training and validation sets). Thereafter, the descriptors were filtered by removing those with

Material and Methods

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

potential use in decision making (Thomas and Will, 2012). To date, the predictive performance of most published models for DILI in humans has been unsatisfactory, with accuracies of approximately 60% or less, especially when the models are challenged by external validation sets. Besides the complex molecular events leading to hepatotoxicity, the limited availability of reliable data on human DILI is a formidable barrier for QSAR modeling (Przybylak and Cronin, 2012). In this study, we implemented an improved strategy to develop a QSAR model for DILI in humans. Firstly, the model is developed by using the Food and Drug Administration (FDA)– approved drug labeling data (Chen et al., 2011a). Human hepatotoxicity data are of crucial importance for QSAR modeling development; however, annotating a drug’s potential to cause human hepatotoxicity is a huge challenge (Zimmerman, 1999) and requires the assessment of severity, causality, and incidence (Chen et al., 2011a). Drug labeling is one of the few public data sources capable of addressing all 3 attributes (Willy and Li, 2004). Drug labels are drafted by manufacturers and approved by the FDA. Drug manufacturers utilize postmarketing case reports in addition to clinical trial data to incorporate adverse drug reactions (ADRs) in different sections of the drug label. The contents of the labels are reviewed by the FDA; consequently, the decision to include or exclude ADRs in the drug label is relatively consistent and stable. Drug label–based DILI annotation might lead to more reproducible and reliable results for the assessment of a drug’s risk for DILI in humans. Secondly, an extensive model validation strategy was applied to ensure the model’s performance that was better than chance and sustainable to the challenge by the largest validation sets that have ever been reported. Specifically, the performance of the model was first assessed with cross-validation and permutation tests and then validated using a total of 483 unique drugs from 3 independent data sets. Given the fact that a QSAR model usually predicts better for some categories of chemicals than for others (ie, applicability domain), we further exploited the dependency of the model performance on the therapeutic categories of drugs. Two groups of therapeutic categories were identified, one associating with a “high-confidence” prediction and the other with the “low-confidence” prediction. As a result, the model achieved a high prediction accuracy (ranging from 68% to 82% for 3 validation sets) for the high-confidence category of drugs.

243

244

Chen et al. developed. The performance of these 2000 models was compared with the QSAR model derived from the training set. A 2-sided paired t test was used to determine the statistical significance of the difference between the training results and permutation analysis.

Statistical Analysis constant value across all drugs and those with less than 5% of drugs having nonzero values. The remaining 584 descriptors were used in modeling. QSAR model development.  We previously reported an improved supervised machine learning methodology called DF (Tong et al., 2003). DF employs a consensus modeling technique of combining multiple heterogeneous decision trees to achieve a more accurate predictive model (Chen et  al., 2011b). DF has been employed in the modeling of a variety of biological systems including complex biological endpoints such as predicting estrogen receptor binding activity and classifying prostate cancer samples (Tong et al., 2004a,b). In the present study, the DF procedure began with 1 tree that was set up with each node containing a minimum of 5 drugs. The process was then repeated for a second tree using the descriptors that were not used by the previous tree. The process was repeated until no trees could be developed using the remaining descriptors. The trees were constructed using a variant of the Classification and Regression Tree method and were pruned based on tree cost complexity (Tong et  al., 2003). The DF algorithm is implemented in a freely available software tool published by the FDA (http://www.fda.gov/ScienceResearch/ BioinformaticsTools/DecisionForest/default.htm). Cross-validation.  A 10-fold cross-validation was applied to assess the model performance (Tropsha and Golbraikh, 2007). The data set was randomly split into 10 equal portions tailored to the ratio between the positive and negative drugs. Nine of the 10 portions were used to develop a model, which was then applied to the remaining portions to assess prediction performance. This process was repeated sequentially so that each of the 10 portions was left out once. The prediction results of the 10 cross-validation models were then averaged to measure the performance. The 10-fold cross-validation was repeated 2000 times to achieve a statistically reliable estimation of the model’s performance. Permutation analysis to assess chance correlation.  Permutation analysis is a common approach to determine whether a model’s performance estimates differ from random chance (Rodgers et al., 2010). In this study, the class labels (DILI or not) of drugs in the training set were randomly shuffled, whereas the Mold2 descriptor values (the independent variables) remained unchanged. We generated 2000 permuted data sets, and for each, a QSAR model was

The odds ratio obtained from logistic regression was used to measure the relative risk for DILI in a specific group, and a 2-sided Fisher’s exact test was used to examine the statistical significance of the association. The QSAR model’s predictive performance was assessed by accuracy [(TP + TN)/(TP + TN + FP + FN)], sensitivity [TP/(TP + FN)], specificity [TN/(TN + FP)], positive predictive value (PPV) [TP/(TP + FP)], and negative predictive value (NPV) [TN/(TN + FN)], where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. Estimates for the odds ratio and Fisher’s exact tests were obtained using R and the “Stats” package (R Development Core Team, 2011).

Results

A QSAR model using DF and Mold2 molecular descriptors based on the NCTR training set of 197 drugs annotated by FDAapproved drug labeling was developed. The QSAR model consisted of 6 decision trees using 82 descriptors (Supplementary Figure S2 and Table S5). Cross-validation and permutation analysis were performed to demonstrate that the observed predictive performance was not by chance. Three validation data sets with a total of 483 unique drugs were employed to further assess the model’s predictive power. We also identified the highconfidence therapeutic subgroups from the ATC Classification System in which the QSAR model performs best. Development of the QSAR Model Two thousand repetitions of 10-fold cross-validation were performed based on the training set; the results are summarized in Table 1. The mean accuracy, sensitivity, and specificity were 69.7% (relative standard deviation [RSD] = 2.9%), 57.8% (RSD = 6.2%), and 77.9% (RSD = 3.0%), respectively.

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

Fig.  1.  Flowchart of the quantitative structure-activity relationship (QSAR) model construction and validation process in the present study. The Decision Forest (DF) model was developed using molecular descriptors calculated using Mold2. Permutation tests were conducted to assess the level of chance correlation of the model. The predictive performance of the QSAR model was assessed using 2000 repetitions of 10-fold cross-validation and external validation of 3 independent data sets. The applicability of the QSAR model was also assessed using the 3 validation data sets.

Identifying the high-confidence therapeutic subgroups.  The World Health Organization’s (WHO) Anatomical Therapeutic Chemical (ATC) classification system is widely used for defining drug classes. ATC contains 5 levels of classification according to target organs or systems, therapeutic, pharmacological, and chemical properties. The analysis of determining the dependency of the model performance on the therapeutic categories was based on a total of 94 therapeutic subgroups belonging to the second level of ATC classification. The high-confidence therapeutic subgroups where the model performed better were identified with cross-validation based on the training set. Specifically, prediction accuracies of all subgroups were ranked based on the prediction results from the 2000 runs of 10-fold cross-validation. Only therapeutic subgroups containing at least 3 drugs in the training set were included in the confidence analysis; otherwise, they were termed as “not defined” and were excluded from the confidence analysis. A  subgroup was termed as a high-confidence group if it had a higher accuracy than the overall accuracy assessed by the whole training set and termed as a low-confidence one otherwise. A small number of drugs belonged to multiple therapeutic subgroups. Because the assessment was done within therapeutic subgroups rather than across multiple subgroups, these drugs were counted in each of the subgroups in which they appeared. The high-confidence and low-confidence subgroups determined from cross-validation were further assessed by the 3 external validation data sets.

245

QSAR Models for Predicting DILI



Table 1 Summary of Internal Cross-Validation and External Validation Results Internal Cross-Validation (2000 runs)

Accuracy Sensitivity Specificity PPV NPV Drugs

External Validation

NCTR Training Seta

NCTR Validation Set

Greene Data Set

Xu Data Set

69.7% ± 2.9% 57.8% ± 6.2% 77.9% ± 3.0% 64.6% ± 4.3% 72.6% ± 2.5% 197 (pos/neg = 81/116)

68.9% 66.3% 71.6% 70.0% 68.0% 190 (pos/neg = 95/95)

61.6% 58.4% 67.5% 77.2% 46.4% 328 (pos/neg = 214/114)

63.1% 60.6% 66.1% 68.4% 58.1% 241 (pos/neg = 132/109)

Note. Cross-validated results are averaged values of 2000 runs of 10-fold cross-validations. External validation results are prediction results on the 3 validation sets. Abbreviation: pos/neg, positive/negative. a Mean ± RSD.

Fig.  2.  Distribution of prediction accuracy of the 2000 runs of 10-fold cross-validation (right) and the 2000 permutation analyses (left).

A permutation analysis was conducted to assess chance correlation of the model. The results of the 2000 cross-validations and 2000 permutations are plotted in Figure  2. The median accuracy value from the cross-validations was significantly higher than that of the permutations’ results (69.7% vs 48.5%, p = 1.57E-12). The permutation analysis demonstrated that the prediction performance of the QSAR model was not due to random chance. QSAR Model Performance Evaluated by 3 External Validation Sets The predictive performance of the QSAR model was assessed by 3 validation data sets: the NCTR, Xu, and Greene validation sets. The QSAR model’s estimated performance on these data sets is summarized in Table 1. The prediction accuracy of the NCTR validation data set (68.9%) was slightly higher than that of the other 2 data sets (63.1% for the Xu data set and 61.6% for the Greene data set).

Identification of High-Confidence Therapeutic Subgroups Two thousand repetitions of cross-validation based on the training set were performed to segregate therapeutic subgroups defined by the second level of ATC in which the QSAR model had a higher or lower accuracy (ie, to identify high- and low-confidence subgroups) than the overall accuracy (69.7%) as shown in Supplementary Figure S3. The analysis identified 22 therapeutic subgroups applied in the QSAR model as high-confidence (eg, analgesics, antihistamines, antibacterials, antihemorrhagics, corticosteroids, antihypertensives, vasoprotectives, and muscle relaxants) and 18 low-confidence therapeutic subgroups such as psycholeptics, anabolic agents, calcium channel blockers, antivirals, and anesthetics as detailed in Table 2. The predictive accuracy in the high-/low-confidence subgroups as assessed by the 3 validation sets individually was plotted in Figure 3. The QSAR model performed consistently better on drugs in the high-confidence therapeutic subgroups than on drugs in the low-confidence therapeutic subgroups.

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

We specifically assessed drugs appearing in more than 1 of the 3 validation sets and compared the QSAR prediction between drugs with consistent and inconsistent DILI annotations across the validation sets. Among the drugs with consistent annotations between the NCTR validation and the Greene data sets, 69.1% (47/68) of the drugs were correctly predicted by the QSAR model compared with only 58.8% (10/17) of those drugs with inconsistent annotations (the accuracies are calculated based on the NCTR annotation). Similar trends were observed in the comparison between the NCTR validation and Xu data sets or between the Greene and Xu data sets. Furthermore, the positive prediction rates of the QSAR model on drugs with DILI concern in the drug label’s Boxed Warning section, Warnings & Precautions section, and Adverse Reactions section were 76.9% (20/26), 59.3% (51/86), and 42.6% (40/94), respectively.

246

Chen et al.

Table 2 The High- and Low-Confidence Therapeutic Subgroups (Second Level of ATC Classification) Identified From Cross-Validation of the QSAR Model Based on the Training Set Confidence Domain High-confidence

Stomatological (A01), drugs for functional gastrointestinal disorders (A03), antidiarrheals, intestinal anti-inflammatory/anti-infective agents (A07), antihemorrhagics (B02), cardiac therapy (C01), antihypertensives (C02), vasoprotectives (C05), antipruritics, incl. antihistamines, anesthetics, etc. (D04), antibiotics and chemotherapeutics for dermatological use (D06), corticosteroids for systemic use (H02), antibacterials for systemic use (J01), muscle relaxants (M03), analgesics (N02), nasal preparations (R01), throat preparations (R02), drugs for obstructive airway diseases (R03), cough and cold preparations (R05), antihistamines for systemic use (R06), ophthalmologicals (S01), otologicals (S02), ophthalmological and otological preparations (S03), contrast media (V08) Vitamins (A11), anabolic agents for systemic use (A14), antithrombotic agents (B01), peripheral vasodilators (C04), calcium channel blockers (C08), agents acting on the renin-angiotensin system (C09), antifungals for dermatological use (D01), gynecological anti-infectives and antiseptics (G01), sex hormones and modulators of the genital system (G03), urologicals (G04), antivirals for systemic use (J05), antineoplastic agents (L01), endocrine therapy (L02), antiinflammatory and antirheumatic products (M01), anesthetics (N01), psycholeptics (N05), psychoanaleptics (N06), antiprotozoals (P01)

The average accuracies for the NCTR validation, Greene, and Xu data sets are 81.8%, 68.6%, and 68.3% for high-confidence therapeutic subgroups and 61.3%, 59.6%, and 64.2% for low-confidence subgroups. Supplementary Table S6 lists the performance for each of the high-/low-confidence subgroups in the 3 validation data sets. When we combined the unique drugs with consistent annotations across the 3 validation data sets, the logistic regression modeling estimated an odds ratio for DILI risk of 7.50 for drugs in the high-confidence subgroup versus 1.95 for drugs in the low-confidence subgroup (Table 3), and the negative predictive value for the high-confidence subgroup was 77.0% versus 46.6% for the low-confidence subgroup. Discussion

In silico testing strategies like QSAR models have become a promising approach for identification of toxic drug candidates in the early stages of drug discovery (Meanwell, 2011);

Fig. 3.  Prediction accuracy in the high- and low-confidence therapeutic subgroups derived from 3 external validation sets.

however, the predictive performance of the currently available QSAR models in predicting DILI risk in humans is poor. Specifically, the low negative predictive values of these models would increase the risk of toxic drug candidates passing this initial screening and proceeding into the development pipeline (Merlot, 2010; Thomas and Will, 2012). By using a large number of drugs to develop the QSAR model, relying on FDA-approved drug labeling for DILI annotation, and focusing on high-confidence ATC subgroups, we have achieved an improvement in both overall accuracy (73.6% vs 55%–60%) as well as negative predictive value (77.0% vs 42%–45%) (Ekins et al., 2010; Greene et al., 2010) without significantly sacrificing the positive predictive value (69.1% vs 75%–77%). Notably, the success of in silico methods for prediction of toxicity largely depends on the quality and quantity of the data used for developing predictive models (Przybylak and Cronin, 2012). The low frequency of DILI in large patient populations together with the lack of robust biomarkers for diagnosis of DILI has posed a challenge to reliably assess DILI in humans (Chen et al., 2012). Our annotation based on FDA drug labels has proved to be more reliable and consistent than other DILI annotation methods that have been proposed in the literature (Chen et al., 2011a). The importance of the quality of annotation is underscored by our observations. For example, the QSAR model predicted best in drugs with Boxed Warning, followed by drugs with DILI concern in the Warnings & Precautions section and the drugs with DILI concern in the Adverse Reactions section. This is likely due to the fact that the more serious regulatory actions require stronger evidence and so will tend to be more accurate assessments of DILI potential. For a complicated endpoint like DILI, “one-size-fits-all” approaches are not appropriate (Chen et  al., 2013b). In fact, QSAR models have been found to perform better for some chemicals than others (Netzeva et al., 2005; Rusyn et al., 2012). Because of this, it is important to identify to which domains QSAR models may be most successfully applied. In the present

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

Low-confidence

Therapeutic Subgroup (ATC Code)

QSAR Models for Predicting DILI

46.6 56.7 1.95 (1.0–3.9), p = .05

58.6

59.8

69.0

77.0 69.1 77.0

Abbreviation: pos/neg, positive/negative.

Low confidence (pos/neg = 97/60)

High confidence (pos/neg = 54/74)

Positive Negative Positive Negative

38 17 58 39

17 57 26 34

7.50 (3.2–17.9), p < .001

73.6

69.1

PPV (%) Specificity (%) Sensitivity (%) Accuracy (%) Odds Ratio (95% Confidence) DILI Negative DILI Positive QSAR Prediction Confidence Domain

Observed Results

247

study, we examined the training set using cross-validation of the QSAR model to identify 22 high-confidence and 18 low-confidence therapeutic subgroups. More importantly, the application of the QSAR model to drugs from the 3 independent validation data sets confirmed superior performance in the high-confidence therapeutic subgroups. Some drugs in the high-confidence subgroups are well documented either to cause or not to cause DILI. For example, acetaminophen is predicted as positive by the QSAR model and belongs to a high-confidence therapeutic subgroup (analgesics) as a well-known DILI drug that accounts for almost half of the acute liver failure cases in the United States (Ostapowicz et  al., 2002). Moreover, our QSAR model works well for some therapeutic subgroups that were reported to rarely cause significant liver injury, such as antihistamines (Fontana, 2008). Thus, understanding a model’s applicability domain will enhance the appropriate use of the model. By focusing on those therapeutic subgroups with high confidence, our QSAR model achieves better performance and allows alternative methods to predict DILI to be developed for other therapeutic subgroups. This approach allows us to address the current problem with the application of QSAR modeling to DILI assessment, which is the overall poor predictive performance of the models. The applicability domain of QSAR has gained increased attention in the area of predictive modeling (Rusyn et al., 2012; Tong et al., 2004a). We have investigated both major aspects of applicability domains in the present study: chemical structure space and prediction confidence. The predictive performance for DILI drugs is not obviously improved (mostly less than 5% of increase in accuracy as indicated in Supplementary Table S7). DILI is a complex and multifactorial endpoint. Its prediction in the drug space is not trivial. The simple dichotomy of application domain based on chemical structure space or prediction confidence might oversimplify the complexity. The therapeutic categories represent respective groups of chemicals in the drug space, which seems to be a useful alternative to stratify the drug space to approach complex endpoints in QSAR. Nonetheless, this finding warrant further investigation by either using a larger set of drug data for DILI or examined in other complex endpoints. The newly developed QSAR model was based on Mold2 molecular descriptors (which are calculated from the chemical structure) and built using the DF algorithm, both freely available software packages developed in-house. The molecular descriptors used for the developed QSAR model are very diverse (Supplementary Figure S2 and Table S5). For example, hydrophilic factor index (D775) from level 3 in tree #4 reflects a compound’s lipophilicity and was recently shown to be an important risk factor for DILI (Chen et al., 2013a). Moreover, some descriptors define the molecular shape (D242) and size (D123), which may affect off-target activities of drug candidates and thereby contribute to risk for DILI (Leeson and Springthorpe, 2007). Some caveats should be noted in the application of our QSAR model. Firstly, this study is a retrospective analysis based on a survey of ADRs of marketed drugs, which passed extensive safety testing before obtaining market authorization.

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

Table 3 Predictive Performances in the High- and Low-Confidence Therapeutic Subgroups Derived From a Combined Data Set of the 3 External Validation Sets

NPV (%)



248

Chen et al.

Supplementary Data

Supplementary data are available online at http://toxsci. oxfordjournals.org/. Funding

German Federal Ministry for Education and Research as part of the Virtual Liver Network initiative (031 6154 to J.B.). Furthermore, J.B. is recipient of an ORISE Stipend of the Food and Drug Administration, which is gratefully acknowledged. References Arrowsmith, J. (2011a). Trial watch: Phase II failures: 2008–2010. Nat. Rev. Drug Discov. 10, 328–329. Arrowsmith, J. (2011b). Trial watch: Phase III and submission failures: 2007– 2010. Nat. Rev. Drug Discov. 10, 87. Ballet, F. (1997). Hepatotoxicity in drug development: Detection, significance and solutions. J. Hepatol. 26(Suppl. 2), 26–36. Chen, M., Borlak, J., and Tong, W. (2013a). High lipophilicity and high daily dose of oral medications are associated with significant risk for drug-induced liver injury. Hepatology 58, 388–396. Chen, M., Shi, L., Kelly, R., Perkins, R., Fang, H., and Tong, W. (2011b). Selecting a single model or combining multiple models for microarray-based classifier development?—A comparative analysis based on large and diverse datasets generated from the MAQC-II project. BMC Bioinformatics 12, S3. Chen, M., Vijay, V., Shi, Q., Liu, Z., Fang, H., and Tong, W. (2011a). FDAapproved drug labeling for the study of drug-induced liver injury. Drug Discov. Today 16, 697–703. Chen, M., Zhang, J., Wang, Y., Liu, Z., Kelly, R., Zhou, G., Fang, H., Borlak, J., and Tong, W. (2013b). Liver Toxicity Knowledge Base (LTKB)—A systems approach to a complex endpoint. Clin. Pharmacol. Therapeut. 95, 409–412.

Chen, M., Zhang, M., Borlak, J., and Tong, W. (2012). A decade of toxicogenomic research and its contribution to toxicological science. Toxicol. Sci. 130, 217–228. Ekins, S., Williams, A. J., and Xu, J. J. (2010). A predictive ligand-based Bayesian model for human drug induced liver injury. Drug Metab. Dispos. 38, 2302–2308. Fontana, R. J. (2008). Acute liver failure due to drugs. Semin. Liver Dis. 28, 175–187. Greene, N., Fisk, L., Naven, R. T., Note, R. R., Patel, M. L., and Pelletier, D. J. (2010) Developing structure-activity relationships for the prediction of hepatotoxicity. Chem. Res. Toxicol. 23, 1215–1222. Hamburg, M. A. (2011). Advancing regulatory science. Science. 331, 987. Hong, H., Xie, Q., Ge, W., Qian, F., Fang, H., Shi, L., Su, Z., Perkins, R., and Tong, W. (2008). Mold(2), molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J. Chem. Inf. Model. 48, 1337–1344. Kaplowitz, N. (2013). Avoiding idiosyncratic DILI: Two is better than one. Hepatology 58, 15–17. Kola, I., and Landis, J. (2004). Can the pharmaceutical industry reduce attrition rates? Nat. Rev. Drug Discov. 3, 711–715. Leeson, P. D., and Springthorpe, B. (2007). The influence of drug-like concepts on decision-making in medicinal chemistry. Nat. Rev. Drug Discov. 6, 881–890. Liu, Z., Shi, Q., Ding, D., Kelly, R., Fang, H., and Tong, W. (2011). Translating clinical findings into knowledge in drug safety evaluation—Drug induced liver injury prediction system (DILIps). PLoS Comput. Biol. 7, e1002310. Meanwell, N. A. (2011). Improving drug candidates by design: A  focus on physicochemical properties as a means of improving compound disposition and safety. Chem. Res. Toxicol. 24, 1420–1456. Merlot, C. (2010) Computational toxicology—A tool for early safety evaluation. Drug Discov. Today 15, 16–22. Muster, W., Breidenbach, A., Fischer, H., Kirchner, S., Muller, L., and Pahler, A. (2008). Computational toxicology in drug development. Drug Discov. Today 13, 303–310. Netzeva, T. I., Worth, A., Aldenberg, T., Benigni, R., Cronin, M. T., Gramatica, P., Jaworska, J. S., Kahn, S., Klopman, G., Marchant, C. A., et al. (2005). Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. Altern. Lab. Anim. 33, 155–173. Ostapowicz, G., Fontana, R. J., Schiodt, F. V., Larson, A., Davern, T. J., Han, S. H., McCashland, T. M., Shakil, A. O., Hay, J. E., Hynan, L., et al. (2002). Results of a prospective study of acute liver failure at 17 tertiary care centers in the United States. Ann. Intern. Med. 137, 947–954. Paul, S. M., Mytelka, D. S., Dunwiddie, C. T., Persinger, C. C., Munos, B. H., Lindborg, S. R., and Schacht, A. L. (2010). How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat. Rev. Drug Discov. 9, 203–214. Przybylak, K. R., and Cronin, M. T. (2012). In silico models for druginduced liver injury—Current status. Expert Opin. Drug Metab. Toxicol. 8, 201–217. R Development Core Team. (2011). R: A  Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Available at: http://www.R-project.org/. Rodgers, A. D., Zhu, H., Fourches, D., Rusyn, I., and Tropsha, A. (2010). Modeling liver-related adverse effects of drugs using knearest neighbor quantitative structure-activity relationship method. Chem. Res. Toxicol. 23, 724–732. Rusyn, I., Sedykh, A., Low, Y., Guyton, K. Z., and Tropsha, A. (2012). Predictive modeling of chemical hazard by integrating numerical descriptors of chemical structures and short-term toxicity assay data. Toxicol. Sci. 127, 1–9. Thomas, C. E., and Will, Y. (2012). The impact of assay technology as applied to safety assessment in reducing compound attrition in drug discovery. Expert Opin. Drug Discov. 7, 109–122.

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

Thus, additional studies are needed to assess the full potential of the developed QSAR model. Secondly, the QSAR model has established a statistical association between a compound’s chemical structure (ie, molecular descriptors) and the potential to cause DILI in humans but does not necessarily indicate any specific mechanistic relationship. Our study demonstrates that it is possible to develop QSAR models that are capable of accurately predicting a drug’s DILI risk in humans. By considering high-quality annotations, such as the FDA-approved drug labeling; utilizing powerful chemoinformatic, bioinformatic, and statistical methods; building the model with a large initial set of drugs; validating the model with external data sets; and focusing on therapeutic subgroups where the model performs best, we have been able to mitigate the problem of low accuracy and negative predictive values while maintaining acceptable positive predictive values. The models may be used for a wide range of applications; for one example, they can identify putative DILI liability of drug candidates during the early drug discovery stages, particularly for those falling in high-confidence therapeutic subgroups like analgesics, antibacterials, and antihistamines.



QSAR Models for Predicting DILI

Tong, W., Hong, H., Fang, H., Xie, Q., and Perkins, R. (2003). Decision forest: Combining the predictions of multiple independent decision tree models. J. Chem. Inf. Comput. Sci. 43, 525–531. Tong, W., Xie, Q., Hong, H., Shi, L., Fang, H., and Perkins, R. (2004a). Assessment of prediction confidence and domain extrapolation of two structure-activity relationship models for predicting estrogen receptor binding activity. Environ. Health Perspect. 112, 1249–1254. Tong, W., Xie, Q., Hong, H., Shi, L., Fang, H., Perkins, R., and Petricoin, E. F. (2004b). Using decision forest to classify prostate cancer samples on the basis of SELDI-TOF MS data: Assessing chance correlation and prediction confidence. Environ. Health Perspect. 112, 1622–1627. Tropsha, A., and Golbraikh, A. (2007). Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr. Pharm. Des. 13, 3494–3504.

249

Watkins, P. B. (2011). Drug safety sciences and the bottleneck in drug development. Clin. Pharmacol. Ther. 89, 788–790. Willy, M. E., and Li, Z. (2004). What is prescription labeling communicating to doctors about hepatotoxic drugs? A study of FDA approved product labeling. Pharmacoepidemiol. Drug Saf. 13, 201–206. Xu, J. J., Henstock, P. V., Dunn, M. C., Smith, A. R., Chabot, J. R., and de Graaf, D. (2008). Cellular imaging predictions of clinical drug-induced liver injury. Toxicol. Sci. 105, 97–105. Zhang, M., Chen, M., and Tong, W. (2012). Is toxicogenomics a more reliable and sensitive biomarker than conventional indicators from rats to predict drug-induced liver injury in humans? Chem. Res. Toxicol. 25, 122–129. Zimmerman, H. J. (1999). Hepatotoxicity: The Adverse Effects of Drugs and Other Chemicals on the Liver. Lippincott, Williams & Wilkins, Philadelphia, PA.

Downloaded from http://toxsci.oxfordjournals.org/ by guest on January 13, 2016

Suggest Documents