SCIENCE CHINA Phenotype prediction of ... - Springer Link

31 downloads 0 Views 693KB Size Report
sponse, hyperbilirubinemia, Gilbert's syndrome, and. Crigler-Najjar syndrome (CN) [5,6]. Similarly, nsSNPs of the human GST, SULT, NAT, and MT genes have ...
SCIENCE CHINA Life Sciences • RESEARCH PAPERS •

October 2010 Vol.53 No.10: 1252–1262 doi: 10.1007/s11427-010-4062-9

Phenotype prediction of nonsynonymous single nucleotide polymorphisms in human phase II drug/xenobiotic metabolizing enzymes: perspectives on molecular evolution HAO DaCheng1*, XIAO PeiGen2* & CHEN ShiLin2 1

2

Laboratory of Biotechnology, College of Environment, Dalian Jiaotong University, Dalian 116028, China; Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100193, China Received April 18, 2010; accepted May 27, 2010

Nonsynonymous single nucleotide polymorphisms (nsSNPs) in coding regions can lead to amino acid changes that might alter the protein’s function and account for susceptibility to disease and altered drug/xenobiotic response. Many nsSNPs have been found in genes encoding human phase II metabolizing enzymes; however, there is little known about the relationship between the genotype and phenotype of nsSNPs in these enzymes. We have identified 923 validated nsSNPs in 104 human phase II enzyme genes from the Ensembl genome database and the NCBI SNP database. Using PolyPhen, Panther, and SNAP algorithms, 44%−59% of nsSNPs in phase II enzyme genes were predicted to have functional impacts on protein function. Predictions largely agree with the available experimental annotations. 68% of deleterious nsSNPs were correctly predicted as damaging. This study also identified many amino acids that are likely to be functionally critical, but have not yet been studied experimentally. There was significant concordance between the predicted results of Panther and PolyPhen, and between SNAP non-neutral predictions and PolyPhen scores. Evolutionarily non-neutral (destabilizing) amino acid substitutions are thought to be the pathogenetic basis for the alteration of phase II enzyme activity and to be associated with disease susceptibility and drug/xenobiotic toxicity. Furthermore, the molecular evolutionary patterns of phase II enzymes were characterized with regards to the predicted deleterious nsSNPs. phenotype, PolyPhen, Panther, SNAP, SNP, phase II drug/xenobiotic metabolizing enzyme Citation:

Hao D C, Xiao P G, Chen S L. Phenotype prediction of nonsynonymous single nucleotide polymorphisms in human phase II drug/xenobiotic metabolizing enzymes: perspectives on molecular evolution. Sci China Life Sci, 2010, 53: 1252–1262, doi: 10.1007/s11427-010-4062-9

Phase II metabolism reactions, usually known as conjugation reactions (e.g., with glucuronic acid, sulfonates, glutathione, or amino acids), are usually detoxifying in nature, and involve interactions of the polar functional groups of phase I metabolites. Sites on drugs where conjugation reactions occur include carboxyl (-COOH), hydroxyl (-OH), amino (NH2), and sulfhydryl (-SH) groups. Products of conjugation reactions have increased molecular weight and *Corresponding author (email: [email protected]; [email protected]) © Science China Press and Springer-Verlag Berlin Heidelberg 2010

are usually inactive, unlike phase I reactions, which often produce active metabolites. Several major enzymes and pathways are involved in phase II drug/xenobiotic metabolism, including UDP-glucuronosyltransferases (UGTs for glucuronidation), glutathione S-transferases (GSTs for sulfation), sulfotransferases (SULTs for sulfation), N-acetyltransferases (NATs for acetylation), and methyltransferase (MT for methylation) [1,2]. The glucuronidation reaction comprises the transfer of a glucuronosyl group from uridine 5'-diphospho-glucuronic life.scichina.com

www.springerlink.com

Hao DaCheng, et al.

Sci China Life Sci

acid to substrate molecules that contain oxygen, nitrogen, sulfur, or carboxyl functional groups. For example, UGT1A3 is responsible for the glucuronidation of anti-HIV drug candidate PA-457 (bevirimat), which is not metabolized by phase I detoxification enzymes [3]. The resulting glucuronide is more polar and more easily excreted than the substrate molecule. UGTs form a gene superfamily, and currently a total of 22 different UGT proteins have been detected in human tissues, belonging to either the UGT1A (1A1, 1A3, 1A4, 1A5, 1A6, 1A7, 1A8, 1A9 and 1A10), the UGT2A (2A1, 2A2 and 2A3), the UGT2B (2B4, 2B7, 2B10, 2B11, 2B15, 2B17 and 2B28), the UGT3 (3A1 and 3A2), or the UGT8 family [4]. Nonsynonymous SNPs (nsSNPs) of the human UGT genes can cause absent or reduced enzyme activity, and polymorphisms of UGT have been found to be closely related to altered drug clearance and/or drug response, hyperbilirubinemia, Gilbert’s syndrome, and Crigler-Najjar syndrome (CN) [5,6]. Similarly, nsSNPs of the human GST, SULT, NAT, and MT genes have been found to be closely related to altered drug clearance and/or drug response, and increased risk for various diseases such as cancer, chronic obstructive pulmonary disease, and Alzheimer’s and Parkinson’s diseases [7–13]. For example, nsSNPs in the NAT2 gene, which allows the population to be split into three categories: slow, intermediate, and fast metabolizers, result in inherited variation in the pharmacokinetics and pharmacodynamics of the anti-HIV drugs, hydralazine, and isoniazid, ultimately determining their efficacy and toxicity [14]. Human genetic variation can directly or indirectly influence response to modern antiretroviral therapies for HIV. Some immunogenetic and other human genetic variations affect the natural history of HIV disease progression where individuals are untreated, but less information is available as to whether these differences are still relevant in the context of HAART (highly active antiretroviral therapy) [15]. Antiretroviral therapy adds additional opportunities for human genetic contributions to affect prognosis––in particular for those genes that influence pharmacokinetics and/or adverse events, e.g., phase II detoxification enzyme genes. To date, the majority of studies investigating the influence of human genetic variation on HIV disease and treatment outcome have focused on a small number of SNPs, not including those of phase II enzymes. The functional impact of most nsSNPs in human phase II metabolizing enzyme genes is still unknown. Computational technologies aid the experimental exploration of nsSNPs and are indispensable in predicting the response to HAART in HIV-infected subjects. Di et al. [5] identified 248 nsSNPs from human UGT genes and used SIFT and PolyPhen to predict the impact of these nsSNPs on protein function. However, the numbers of nsSNPs in UGTs and other phase II metabolizing enzyme genes collected in public databases and publications are rapidly increasing, and new algorithms with better predictive power are becoming available. Currently, the assess-

October (2010) Vol.53 No.10

1253

ment of functionally essential residues of drug/xenobiotic metabolizing enzymes remains relatively incomplete. In this study, we have investigated the potential effect of known human phase II metabolizing enzyme nsSNPs on protein function using PolyPhen, Panther, and SNAP algorithms. The data set we compiled is larger and more diverse and could be used for the evaluation of prediction methods that require similar types of inputs.

1 1.1

Materials and methods Gene nomenclature and dataset

The UGT, GST, SULT, and NAT genes were named according to the literature [4,16–18]. The data on human phase II metabolizing enzyme genes were collected from Ensembl (http://www.ensembl.org/Homo_sapiens/Search) and Entrez Gene on the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/sites/entrez). Expired and merged gene names were excluded from the study. The majority of the variants included in this analysis were identified during the screening of 25 human UGT, 19 GST, 42 SULT, 16 NAT, and two MT genes from Ensembl (http://www.ensembl.org/Homo_sapiens/Gene/Variation_ Gene). Ensembl integrates genetic variants from dbSNP, Uniprot, personal genomes of Watson and Venter, Illumina human genome sequencing results, and links to information such as transcript, population genetics, individual genotype, genomic context, phenotype data, and phylogenetic context. Information including gene symbol, gene name, mRNA accession number (ENST or NM), protein accession number (ENSP or NP), SNP ID, amino acid residue 1 (wild-type, wt), amino acid position, and amino acid residue 2 (missense) was collected. Supplementary variants were identified from Entrez Gene on NCBI and by a PubMed literature search and added to the dataset after cross-examination. The information on the effect of the nsSNPs on enzyme activity and the correlation between the nsSNPs and disease/adverse drug reaction/ toxicant intake were extracted from in vivo and in vitro experiments (e.g., recombinant enzyme analysis), according to the literature. 1.2 Prediction of the phenotype of nsSNPs in human phase II metabolizing enzyme genes Identified UGT, GST, SULT, NAT, and MT nsSNPs and the predicted effects of variant amino acid substitution on protein function were performed using PolyPhen (http://genetics. bwh.harvard.edu/pph), Panther (http://www.pantherdb.org/ tools/csnpScoreForm.jsp), and SNAP (http://cubic.bioc. columbia.edu/services/SNAP/submit.html). The table detailing the above information is available upon request. PolyPhen uses empirically derived rules based on previous research in protein structure, interaction, and evolution

1254

Hao DaCheng, et al.

Sci China Life Sci

that automatically predict whether a replacement is likely to be deleterious for the protein based on three-dimensional structure and multiple alignments of homologous sequences [19]. In this study, PolyPhen input is a protein amino acid sequence together with sequence position and two amino acid variants characterizing the polymorphism. Generally, PolyPhen scores of 0–1.49 are classified as benign, 1.50–1.99 as possibly damaging, and ≥2 as probably damaging. In the concordance analysis, scores of 0–0.99 were classified as benign, 1–1.24 as borderline, and 1.25–1.49 as potentially damaging. However, there are exceptions. Some predictions were presented as benign, possibly damaging, or probably damaging, but without scores. Some prediction scores were below zero. Therefore, these prediction results were excluded from the concordance analysis between the functional consequences for nsSNPs predicted by two different algorithms. The Panther program analyzes variants and provides subPSEC scores that range from −10 (most severe) to 0 (least severe) [20,21]. It then uses an HMM based on position-specific independent counts to convert this data into a probability score that the particular amino acid substitution is damaging. To convert the Panther results, a score from 0 to −3 is considered to be tolerated and a score from −3 to −10 is not tolerated. The value of −3 was chosen because it represents a probability of 0.5 of being damaging. SNAP combines many sequence analysis tools in a battery of neural networks to predict the functional effects of nsSNPs [22]. For example, SNAP uses functional effects from SIFT [23], as well as conservation information from PSIC [24]. For each mutant, SNAP returns three values: the binary prediction (neutral/non-neutral), the RI (Reliability Index, range 0–9), and the expected accuracy that estimates accuracy on a large dataset at the given RI (i.e., accuracy of test set predictions calculated for each neutral and non-neutral RI). The latter two values correlate; when both are provided, the server chooses the one yielding better predictions. 1.3

Validation of the prediction results

We first searched PubMed using keywords, i.e., the respective gene name, and then downloaded the search results that had been recorded before April 2010 from the NCBI. From this search, we obtained ~4000 papers that contained the words of gene names. We then curated the data manually and retrieved phenotype data that related to the nsSNPs. Different researchers double-checked all the articles. nsSNPs with experimental evidence of altered enzyme activity or disease association were regarded as deleterious. The phenotypic data were from both in vivo and in vitro studies, in which analysis of site-directed mutagenesis or enzymatic changes often provided direct evidence indicating the functional impact of nsSNPs. Prediction accuracy was analyzed according to the positive findings from these

October (2010) Vol.53 No.10

experiments. As a test for the ability of PolyPhen, Panther, and SNAP algorithms to identify substitutions impacting enzymatic activity, scores were obtained and compared for the collected nsSNPs of human phase II metabolizing enzyme genes related to loss of enzyme activity and disease, based on experimental and clinical studies. 1.4

Statistical analysis

A chi-square test was used to compare percentages. Given that the three algorithms employ different approaches and also different datasets as foundations for their analysis, it is important to find the concordance of the three prediction tools on the functional consequences of each nsSNP prediction. Concordance analysis of each nsSNP predicted by PolyPhen, Panther, and SNAP was assessed using the linear correlation coefficient R. Prediction scores of each nsSNP were plotted on scatter graphs and analyzed using linear trend lines. P-values below 0.05 were considered statistically significant. 1.5

Evolutionary analyses

Whole-gene values for dN/dS (ω, the ratio of nucleotide substitution rates at nonsynonymous and synonymous sites) were calculated by SLAC [25]. Multiple codon alignments were made with RevTrans [26] and checked manually for proper alignment. Predictions of individual codons that evolved under positive selection were performed by maximum likelihood using the NSsites model comparisons in PAML [27]. To identify positive selection, a significant difference was required in log likelihoods, by chi-square testing, between nested PAML models M8a and M8, and M7 and M8. Model M7 allows codons to have dN/dS values according to a beta distribution (two parameters). Model M8 is the same as M7 except that it adds a discrete category of dN/dS with dN/dS>1. Codons predicted to be under positive selection were identified by Bayes-Empirical-Bayes (BEB) analysis. Positively selected codons, i.e., rapidly evolving ones, should be predicted in both M8a vs. M8 and M7 vs. M8 model comparisons with P-values of greater than 0.95. As Yang models are based on theoretical assumptions and ignore the empirical observation that distinct amino acids differ in their replacement rates, we also implemented the MEC model [28], which takes into account not only the transition-transversion bias and the nonsynonymous/synonymous ratio, but also the different amino acid replacement probabilities, as specified in empirical amino acid matrices. The LRT (Likelihood Ratio Test) is applicable only when two models are nested and thus is not suitable for comparing MEC and M8a models; therefore, the second-order value AIC (Akaike’s information criterion, AICc) was used for comparisons [28–30]. Those sites that are most likely to be in the positive selection class (ω >1) are identified as likely selection targets.

Hao DaCheng, et al.

Sci China Life Sci

The HyPhy software package (www.datamonkey.org) was also used to estimate positive selection [31]. To identify particular codons that evolved under positive selection, SLAC, FEL, and REL with the HKY85 or other optimal model for the respective set of sequences were used. GABranch [32] was used to estimate lineage-specific values of dN/dS.

2

Results

2.1 Validated nsSNPs of human phase II metabolizing enzyme genes A total of 923 amino acid substitution variants were identified in the systematic screening of 104 human phase II metabolizing enzyme genes for the analysis of the potential impact of all nsSNPs. With development in bioinformatics and updated data, some previously reported SNPs in Ensembl and dbSNP have been identified as invalid by later studies because of sequencing and alignment errors. These incorrect SNPs have either been terminated or merged into other SNPs. We have cross-examined the databases and removed those invalid SNPs. No nsSNPs were identified in SULT1A3, SULT1C1, CHST2, CHST8, CHST11, HS3ST3A1, HS3ST3A2, HS3ST3B1, HS3ST3B2, and TPST1; therefore, they were not included in this study. 2.2 Prediction of functional effect of nsSNPs of human phase II metabolizing enzyme genes In this study, PolyPhen, Panther, and SNAP predicted 921, 749, and 923 nsSNPs, respectively. 749 nsSNPs were predicted by both Panther and PolyPhen, while 921 nsSNPs were predicted by both SNAP and PolyPhen. These nsSNPs were used in the statistical concordance test. As shown in Figure 1, 234 of 430 (54.4%) of identified UGT nsSNPs exhibited PolyPhen scores of ≥1.5 and were classified as “Damaging” variants by PolyPhen. 323 of 430

Figure 1 Prediction results of nsSNPs of human phase II metabolizing enzyme genes. NS, not scored (not predicted); Neu, neutral; Non, non-neutral; others, other phase II enzymes.

October (2010) Vol.53 No.10

1255

(75.1%) of identified UGT nsSNPs had prediction scores of