Digital Inventory of Arabidopsis Transcripts Revealed by 61 RNA Sequencing Samples1[C][W] Xiaoyong Sun 2*, Qiuying Yang 2, Zhiping Deng, and Xinfu Ye Agricultural Big-Data Research Center, College of Information Science and Engineering, Shandong Agricultural University, Taian, Shandong 271018, China (X.S.); Department of Physiology, University of Texas Southwestern Medical Center, Dallas, Texas 75235 (Q.Y.); State Key Laboratory Breeding Base for Zhejiang Sustainable Pest and Disease Control, Institute of Virology and Biotechnology, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China (Z.D.); and Fruit Research Institute, Fujian Academy of Agricultural Sciences, Fuzhou, Fujian 350013, China (X.Y.) ORCID ID: 0000-0002-2425-3526 (X.S.).
Alternative splicing is an essential biological process to generate proteome diversity and phenotypic complexity. Recent improvements in RNA sequencing accuracy and computational algorithms have provided unprecedented opportunities to examine the expression levels of Arabidopsis (Arabidopsis thaliana) transcripts. In this article, we analyzed 61 RNA sequencing samples from 10 totally independent studies of Arabidopsis and calculated the transcript expression levels in different tissues, treatments, developmental stages, and varieties. These data provide a comprehensive profile of Arabidopsis transcripts with single-base resolution. We quantified the expression levels of 40,745 transcripts annotated in The Arabidopsis Information Resource 10, comprising 73% common transcripts, 15% rare transcripts, and 12% nondetectable transcripts. In addition, we investigated diverse common transcripts in detail, including ubiquitous transcripts, dominant/subordinate transcripts, and switch transcripts, in terms of their expression and transcript ratio. Interestingly, alternative splicing was the highly enriched function for the genes related to dominant/subordinate transcripts and switch transcripts. In addition, motif analysis revealed that TC motifs were enriched in dominant transcripts but not in subordinate transcripts. These motifs were found to have a strong relationship with transcription factor activity. Our results shed light on the complexity of alternative splicing and the diversity of the contributing factors.
Alternative splicing is a crucial process that not only produces proteome diversity but also functions as an evolutionary force to produce phenotypic complexity. It has been reported that 92% to 94% of human genes (Wang et al., 2008) and 61% of Arabidopsis (Arabidopsis thaliana) genes (Marquez et al., 2012) were found to be alternatively spliced. Some human genes such as Myelin Oligodendrocyte Glycoprotein have more than 70 isoforms. By alternatively selecting splice sites, removing introns, and retaining exons, the cell yields a plethora of isoforms and thus generates protein complexity, which impacts developmental processes and environmental and stress responses. Because it is a 1 This work was supported by Shandong Agricultural University (start-up grant to X.S.), the Ministry of Science and Technology of China (973 Program grant no. 2013CB127101), and the Zhejiang Natural Science Foundation (grant no. LR12C02002 to Z.D.). 2 These authors contributed equally to the article. * Address correspondence to
[email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Xiaoyong Sun (
[email protected]). [C] Some figures in this article are displayed in color online but in black and white in the print edition. [W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.114.241604
fundamental biological process, alternative splicing is a primary driver for many human diseases, such as cancer, Parkinson’s disease, heart disease, cardiovascular disease, blood coagulation, cholesterol homeostasis, etc. (Maugeri et al., 1999; Hui et al., 2004; Wang et al., 2008; Marquez et al., 2012). At present, RNA sequencing (RNA-seq) provides unprecedented details of the transcriptome with singlebase resolution. Wang et al. (2008) reported the existence of transcripts with switch-like splicing regulation (i.e. the isoform ratio in the same gene can be switched in different tissues, thus resulting in a substantial change in protein production). Those authors utilized deep sequencing technology to explore the relative expression values of skipped exons. However, the relative expression and percentage of the isoforms generated from alternative splicing remained unclear. The main computational challenge for this question comes from the method used to deconvolute the expression of transcripts based on 50- or 100-bp short reads. Since 2006, many algorithms have been developed to address this task (Stanke et al., 2006; Zerbino and Birney, 2008; Trapnell et al., 2010; Schulz et al., 2012; Mezlini et al., 2013). Recently, Merkin et al. (2012) performed quantitative reverse transcriptionPCR to measure isoform expression in three mouse tissues and used this tool as a reference. By analyzing RNA-seq data using Cufflinks, they provided strong
Plant PhysiologyÒ, October 2014, Vol. 166, pp. 869–878, www.plantphysiol.org Ó 2014 American Society of Plant Biologists. All Rights Reserved.
869
Sun et al.
evidence that the results from Cufflinks are highly consistent with the results from quantitative reverse transcription-PCR (r = 0.9). Improvements in sequencing accuracy and computational algorithms have provided opportunities for the community to further investigate this issue. More recently, based on high-resolution and highdepth RNA-seq, Gonzàlez-Porta et al. (2013) utilized computational approaches to investigate two different data sets, one from the Illumina body map and the other containing nucleus and cytoplasm information. By quantifying and deconvoluting these isoforms, they showed that the dominant transcripts (i.e. transcripts with a substantially higher expression level than other ones) exist widely in human cells. These findings raise the following questions: Do these dominant transcripts exist in plants? What other types of transcripts are based on expression level? What are the biological functions of these transcripts? These unanswered questions motivated us to search, screen, and quantify all the transcripts from Arabidopsis. Here, we explored 61 Arabidopsis RNA-seq samples from 10 independent projects to computationally quantify and deconvolute all isoforms using Cufflinks, resulting in a comprehensive and digital transcript catalog categorized by expression levels. This inventory will help the community to gain a deeper understanding of the complexity and diversity of alternative splicing and related protein functions. In particular, we discuss common transcripts, rare transcripts, and nondetectable transcripts in terms of their expression levels. Additionally, we identified dominant transcripts, ubiquitous transcripts, and switch transcripts (transcripts with a switch event) and further explored the biological and sequence properties by Gene Ontology analysis and motif analysis. Interestingly, all the genes with switch events were also found to harbor dominant transcripts. Our results show that the genes related to dominant transcripts and switch transcripts are both involved in alternative splicing.
genes, seven seven-transcript genes, five eight-transcript genes, one nine-transcript gene, and one 10-transcript gene (Supplemental Fig. S1). We found that genes with more transcripts have more diverse functions by the following two comparisons (Supplemental Method S1): (1) the number of Gene Ontology groups from genes with two transcripts versus that from genes with only one transcript (Mann-Whitney test, P = 2.2e-16), and (2) the number of Gene Ontology groups from genes with three or more transcripts versus that from genes with two transcripts (Mann-Whitney test, P = 2.644e-07). Interestingly, AT1G43170 and AT4G32850 have nine and 10 transcripts, respectively. In the context of Gene Ontology, AT1G43170 functions as a structural constituent of ribosomes and participates in RNA methylation, embryo development ending in seed dormancy, and translation, so it is not surprising to see so many different isoforms for this gene. Simultaneously, AT4G32850 is located in the nucleus and is involved in many essential biological processes, including RNA 39 end processing, RNA polyadenylation, and transcription (Seya et al., 1999). We also investigated the number of the transcripts in each sample calculated with different expression thresholds (Fig. 1). Most samples had transcripts with expression around 1 to 106 FPKM (fragments per kilobase of transcript per million mapped reads), and the average number of detectable transcripts was around 22,000. We used FPKM 1 as the threshold for the
RESULTS Genes and Transcripts in Arabidopsis
Based on The Arabidopsis Information Resource (TAIR) 10 annotation, Arabidopsis has 32,678 genes, including protein-coding genes, microRNAs, ribosomal RNAs, tRNAs, small nuclear RNAs, small nucleolar RNAs, and some other RNAs. Many genes have only one transcript, while some genes have multiple isoforms. To investigate the relationship between transcripts and genes, we defined genes based on the number of annotated transcripts as one-transcript genes or single-transcript genes, two-transcript genes, etc. As such, we observed 26,795 one-transcript genes, 4,316 two-transcript genes, 1,144 three-transcript genes, 293 fourtranscript genes, 90 five-transcript genes, 26 six-transcript 870
Figure 1. Number of transcripts per sample with different expression thresholds. It is clear that FPKM 1 is the threshold for the expressed transcripts. The average number of detectable transcripts was 22,567 6 1,007, 55% of all annotated transcripts. [See online article for color version of this figure.] Plant Physiol. Vol. 166, 2014
Digital Inventory of Arabidopsis Transcripts
expressed transcripts following many previous studies (Hebenstreit et al., 2011; Vogel and Marcotte, 2012; Fagerberg et al., 2013; Gonzàlez-Porta et al., 2013). The number of transcripts in different tissues is shown in Supplemental Table S1. Almost all of these transcripts come from protein-coding genes with 1 FPKM as the detectable threshold. In addition, TAIR 10 annotation describes 40,745 transcripts with 215,908 exons and states that 18% of all Arabidopsis genes undergo alternative splicing (Lamesch et al., 2012), which is different from the results (61%) reported by Marquez et al. (2012; Table I; Supplemental Table S2; Supplemental Fig. S2). However, because most of the transcripts described in that work do not have related functional annotations or descriptions, in the following analysis we only focused on TAIR 10 annotation to discuss the transcripts. To analyze the transcripts based on the expression level, we divided the transcripts into the following categories (Fig. 2). (1) Common transcripts, which have expression in more than 10 samples out of 61 samples (15% or more of samples in at least two independent studies). We found that 29,654 transcripts from 25,216 genes were detected as being expressed based on the FPKM value. Interestingly, there were 4,940 transcripts existing in all 61 samples, which we called ubiquitous transcripts. In terms of the transcript number per gene, there were two kinds of transcripts: transcripts that did not have alternative isoforms with expression and transcripts that had alternative isoforms with expression. Notably, 21,494 genes belonged to the former category, of which 19,505 were onetranscript genes; 3,722 genes were in the latter category. Finally, we found 917 transcripts identified to have the switch event; thus, we called this kind of transcript switch transcripts. (2) Rare transcripts, which have detectable expression in 10 samples or fewer (less than 15% of samples). A total of 6,302 transcripts from 5,995 genes were found to be rarely expressed. (3) Nondetectable transcripts, which had no expression in all the samples in this study (0% of samples). We found that 4,789 transcripts from 4,574
Table I. Alternative splicing summary for TAIR 9, Marquez et al. (2012), and TAIR 10 and RNA-seq (this study) based on gene annotation (General Feature Format file) Genes
One-transcript genes Multiple-transcript genes Alternative splicing genes Intron-containing genes/ multiexonic genes All genes Percentage of alternative splicing in multiexonic genes Percentage of alternative splicing in all genes Plant Physiol. Vol. 166, 2014
TAIR 9
Marquez et al. (2012)
TAIR 10 and RNA-seq
27,970 4,622 4,622 22,190
11,033 12,872 12,872 19,514
26,795 5,883 5,883 22,307
32,592 21
23,905 66
32,678 26
14
54
18
genes belonged to this category. (4) Novel transcripts, which were not reported in TAIR 10. Common Transcripts from One-Transcript Genes
Eighty-two percent (26,795) of all annotated genes were one-transcript genes, suggesting that most genes do not have alternative splicing. Using 1 FPKM as the threshold value for expression, at the gene level, we found 21,494 genes with only one detectable transcript, 91% from one-transcript genes and 9% from genes with more than one transcript. At the transcript level, 53% of all transcripts (40,745) came from one-transcript genes. Notably, the transcripts from one-transcript genes had an expression profile with less variation than those from multiple-transcript genes (MannWhitney test, P = 5.61e-13), suggesting that the genes with more transcripts and more functions have much broader expression profiles. Specifically, the former group had expression ranging from five (25%) to 15 (75%) with a few exceptions, while the latter group had expression with a wider range, from zero (25%) to 15 (75%; Supplemental Fig. S3). Previously, 29 genes were reported as reference genes with low variation (Czechowski et al., 2005). Fourteen genes had only one transcript, 10 genes had two transcripts, four genes had three transcripts, and one gene had four transcripts. Twenty-eight genes (28 of 29 = 97%) had one common transcript across all 61 samples, while only one gene, a one-transcript gene (AT1G58050), had a rare transcript. Supplemental Figure S4 shows that the transcripts from these reference genes had higher expression than the other transcripts (Mann-Whitney test, P = 2.2e-16). Common Transcripts from Multiple-Transcript Genes
Based on the definition of common transcripts, we found that 8,160 common transcripts came from 3,722 multiple-transcript genes, including 4,946 (61%) from two-transcript genes, 2,086 (26%) from three-transcript genes, 745 (9%) from four-transcript genes, 234 (3%) from five-transcript genes, 92 (1.1%) from six-transcript genes, 29 (0.4%) from seven-transcript genes, 17 (0.2%) from eight-transcript genes, four (0.05%) from ninetranscript genes, and seven (0.09%) from 10-transcript genes. To explore the expression differences in the two common-transcript categories (i.e. single detectable transcripts in a single gene and multiple transcripts in a single gene), the left part of Supplemental Figure S5 demonstrates that there was no remarkable expression difference between these two kinds of transcripts (Mann-Whitney test, P = 2.2e-16), although the multiple detectable transcripts showed greater variation. In addition, we also extracted all the dominant and subordinate transcripts from each gene and compared their expression variation across all samples. Notably, the expression variation of the dominant transcripts 871
Sun et al.
Figure 2. Hierarchy tree for transcripts. A, The transcripts include common transcripts, rare transcripts, and nondetectable transcripts. Common transcripts can be divided into the transcripts from single-transcript genes and the transcripts from multiple-transcript genes. The dominant transcripts and the subordinate transcripts are defined based on the following criteria: if the transcript had the highest expression among all the transcripts from the same gene and the ratio of this transcript to the transcript with the second highest expression was greater than 2, this transcript is the dominant transcript, and the second abundant transcript from this gene is considered to be the subordinate transcript. The ubiquitous transcripts are those transcripts that are expressed in all samples. B, Percentage of diverse transcripts based on the 61 RNA-seq samples.
was similar to that of the subordinate transcripts (right part of Supplemental Fig. S5), although the general expression level of the former was higher than that of the latter (Supplemental Fig. S6; Mann-Whitney test, P = 2.2e-16). Identification of Dominant and Subordinate Transcripts in 61 Samples
Following Gonzàlez-Porta et al. (2013), we considered a transcript as the dominant transcript if this transcript had the highest expression among all the transcripts from the same gene and the ratio of this transcript to the transcript with the second highest expression was greater than 2; the second abundant transcript was considered the subordinate transcript. Most dominant transcripts came from those twotranscript genes and three-transcript genes (57% and 27%, respectively), while only 16% of the dominant transcripts were from genes with more than three transcripts. Simultaneously, 58%, 26%, and 16% of all the subordinate transcripts were from the two-transcript genes, three-transcript genes, or greater than threetranscript genes. A total of 1,860 transcripts can function as both the dominant transcript and the subordinate transcript. Based on the ratio of the expression level of the dominant transcripts to that of the subordinate transcripts with the highest expression in the same gene (Supplemental Table S3), we further divided the dominant transcripts into the following categories. (1) 872
Strongly dominant transcripts (ratio . 5); 4,315 transcripts were found to be in this category and 4,246 were matching subordinate transcripts accordingly. Specifically, 2,636 genes had only one subordinate transcript, 621 genes had two subordinate transcripts, and 99 genes had three subordinate transcripts. (2) Weakly dominant transcripts (2 , ratio , 5); 754 transcripts were found to be weakly dominant transcripts and 811 were subordinate transcripts. In particular, these subordinate transcripts were from 619 and 96 genes with one and two subordinate transcripts, respectively. To reveal the differences between the dominant transcripts and the subordinate transcripts, we only focused on the strongly dominant transcripts in the following analysis. We also compared the ratio from the top two common transcripts from the same multiple-transcript genes. Notably, the transcript ratios from the twotranscript genes had the greatest variation while those from the 10-transcript genes had the smallest variation (Supplemental Fig. S7). One interesting finding was that 56% of all the ratios were higher than 2, which is a bit lower than the 79% found in humans (GonzàlezPorta et al., 2013). Differences between Dominant Transcripts and Subordinate Transcripts
We found 4,315 strong dominant transcripts from 3,373 genes across all 61 samples. To reveal the differences between dominant transcripts and subordinate Plant Physiol. Vol. 166, 2014
Digital Inventory of Arabidopsis Transcripts
transcripts, we selected those dominant transcripts (596) and subordinate transcripts (598) identified consistently in more than 50 samples (more than 75% of total samples) and compared the sequence differences of these two alternative isoforms. Interestingly, we found that 75% of the selected dominant transcripts had only one or two regions of insertions/ deletions (indels) different from the subordinate transcripts, and most of the differences (69%) were in either the first exon or the last exon. To test the statistical significance of the indel difference, we selected those genes that have two annotated transcripts but do not have dominant transcripts. We then performed the Mann-Whitney test and compared the indel numbers from the genes with dominant transcripts with those from the genes without dominant transcripts selected previously. Notably, the indel difference was not statistically significant (Supplemental Method S2). In addition, the average exon length of the dominant transcripts was a bit shorter than that of the subordinate transcripts, with no statistical significance. With regard to exon number as well as the average and SD of intron length, there were no differences between the selected dominant transcripts and matching subordinate transcripts. In the context of alternative splicing, we also compared the differences in splicing types between the dominant and subordinate transcripts. Interestingly, most differences (i.e. 377 cases [63% of all splicing cases]) came from the alternative first exon, while 231, 162, 104, 75, and 37 cases resulted from intron retention, alternative 39 sites, alternative 59 sites, alternative last exons, and exon skipping, respectively. We also found that 20% of all the selected dominant transcripts were short transcripts (i.e. transcripts with three or
fewer exons), and most of these dominant short transcripts had similar numbers of exons to their subordinate partners. To gain a better understanding of the differences between dominant transcripts and subordinate transcripts, we performed motif analysis at both the gene and transcript levels using MEME with discriminative motif discovery, as described in “Materials and Methods.” One motif pattern is shown in Figure 3. These motifs were enriched with TC elements and were consistently found in all comparisons at the gene and transcript levels. In the following GOMO analysis, these motifs were found to have a strong relationship with transcription factor activity (GO:0003700; P , 0.05 adjusted for multiple comparisons), suggesting that the different activities from transcription factors resulted in the expression difference between the dominant transcripts and the subordinate transcripts. Finally, we performed Gene Ontology analysis on the genes that had dominant transcripts. We found that genes containing dominant transcripts were enriched in the following functional groups: alternative splicing (false discovery rate [FDR]-adjusted P = 5.5E-34), splice variant (FDR-adjusted P = 1.7E-10), plastid (FDRadjusted P = 1.0E-3), and chloroplast (FDR-adjusted P = 1.8E-3).
Ubiquitous Transcripts
Ubiquitous transcripts are those transcripts that are always expressed. Using this criterion, we identified 4,940 ubiquitous transcripts in all 61 samples. These ubiquitous transcripts belonged to 4,921 genes, including single-transcript genes and multiple-transcript Figure 3. TC motifs inferred from MEME with discriminative motif discovery. A, Motif in the dominant transcripts but not in the subordinate transcripts. B, Motif in the 500 bases upstream of the dominant transcripts but not in that of the subordinate transcripts. C, Motif in the gene with dominant transcripts but not in the gene without dominant transcripts. D, Motif in the 500 bases upstream of the gene with dominant transcripts but not in that of the gene without dominant transcripts. The TC motifs are found to have a strong relationship with transcription factor activity. [See online article for color version of this figure.]
Plant Physiol. Vol. 166, 2014
873
Sun et al.
genes. Supplemental Table S4 shows that most of the ubiquitous transcripts had FPKM . 1,000, and almost all the related genes were protein-coding genes. Also, we investigated the expression levels of ubiquitous transcripts (Supplemental Fig. S8). Interestingly, the FPKM values of this kind of transcript showed quite stable expression across all samples, although the samples from projects 1 and 6 had slightly different expression profiles. Almost all the transcripts had FPKM values above 5 (12.9 6 1.3). Switch Transcripts
The log ratios of some transcripts from the same gene changed drastically under different conditions (i.e. from positive to negative or from negative to positive). These cases are the switch events defined in “Materials and Methods.” In the 61 samples, we found 812 genes exhibiting switch events. Supplemental Figure S9 shows the transcript ratio switch, suggesting that the range of most ratios was from 25 to 5. Gene functional analysis (Supplemental Fig. S10) revealed that the following functional groups were statistically significant: alternative splicing (FDR-adjusted P = 2.1E-40), splice variants (FDR-adjusted P = 4.3E-26), regulation of transcription (FDR-adjusted P = 2.4E-3), DNA binding (FDR-adjusted P = 4.0E-3), etc. These results suggest that the switch transcripts had a strong correlation with transcription itself. The frequent ratio switch of these transcripts helps to contribute to the diversity of the proteome, directly leading to complex biological functions. Another notable issue is that 69 genes had multiple transcripts with more than one switch event. These genes were also related to transcription, transcription regulation, alternative splicing, nucleus, and transcription factor. Six genes (AT2G17442, AT2G32700, AT5G06440, AT5G28020, AT5G44290, and AT5G47455) are involved with three switch events, and all of them except AT5G28020 are from the mitochondrion or chloroplast). Rare, Nondetectable, and Novel Transcripts
We found 6,302 transcripts from 5,995 genes with expression in fewer than 10 samples and 4,789 transcripts from 4,574 genes with no expression at all, indicating that 18% of the annotated genes harbored rare transcripts and 14% of the annotated genes were not expressed in the 61 samples from 10 Arabidopsis projects. Specifically, 36% of the rare transcripts were generated from those genes that had multiple transcripts and 64% of the rare transcripts came from onetranscript genes. In addition, compared with ubiquitous transcripts, rare transcripts had lower exon numbers (Mann-Whitney test, P = 2.2e-16) and shorter exon lengths (Mann-Whitney test, P = 2.2e-16; Supplemental Fig. S11). 874
Interestingly, rare transcripts had comparatively high expression, ranging from 8 to 17 FPKM (Supplemental Fig. S12). We selected genes containing rare transcripts and performed Gene Ontology analysis. The following functional groups were found to be enriched with statistical significance (FDR-adjusted P , 0.05): base pairing, base pairing with RNA, triplet codon-amino acid adaptor activity, molecular adaptor activity, defense response to fungus, plant defense, gene silencing by microRNA, noncoding RNA processing, etc. Similarly, we also performed Gene Ontology analysis for nondetectable genes, and the enriched functional groups were similar to those for genes containing rare transcripts. To screen the novel transcripts, we searched for the transcripts marked as class code j from Cufflinks. A total of 5,987 genes were found to have novel transcripts, which were observed in more than 10 samples. Remarkably, 93% of these novel transcripts were related to common transcripts, which may be the products of splicing noise during alternative splicing. We also utilized the SplicingTypesAnno package to screen for novel junctions involved in five types of alternative splicing: intron retention, exon skipping, alternative donor sites, alternative acceptor sites, and both alternative sites. The results showed that 88% of novel splicing was also identified by SplicingTypesAnno. Another notable issue is the number of splicing genes. There were 21,252 one-transcript genes based on TAIR 10 annotation. In these samples, we found that 20% of these one-transcript genes had alternative splicing events. Compared with the expression level from common transcripts, the lower 20% of the novel transcripts had higher expression levels, while the other 80% of the novel transcripts had slightly lower expression, suggesting that the novel transcripts had expression profiles with less variation (Supplemental Fig. S13). The expression level of the novel transcripts was around 11.4 6 1.1 (log-transformed FPKM; Supplemental Fig. S14). The significant functional groups inferred from genes with novel transcripts included nucleotide binding (FDR-adjusted P = 1.3E-46), alternative splicing (FDR-adjusted P = 3.0E-24), DNA metabolic processes (FDR-adjusted P = 1.3E-19), DNA repair (FDR-adjusted P = 7.8E-16), etc.
Overlapping Functions of Transcript and Gene
To investigate the transcripts discussed above, we systematically applied the Gene Ontology analysis to those genes related to dominant transcripts, switch transcripts, and ubiquitous transcripts and analyzed the overlapping and nonoverlapping functional groups (Supplemental Fig. S15). Most importantly, we found that the genes harboring switch transcripts also have dominant transcripts. All three transcript groups are related to alternative splicing and splice variant functional groups with statistical significance. In addition, Plant Physiol. Vol. 166, 2014
Digital Inventory of Arabidopsis Transcripts
switch transcripts and dominant transcripts are also involved with regulation of transcription, transcription regulator activity, and transcription factor activity, while ubiquitous transcripts participate in many ribosomerelated functions. Project Summary
In 10 independent projects including 61 samples, we investigated several tissues, including leaves (11 samples), flowers (24 samples), roots (eight samples), shoots (two samples), silique (one sample), seedling (four samples), and whole plants (11 samples). To understand the relationship between samples, we clustered all the samples based on isoform expression (Supplemental Fig. S16). Unsurprisingly, most samples from the same tissues were clustered together. However, some project bias was observed: the samples from the same projects tended to be clustered together. In particular, the flower samples from projects 3 and 5 were separated in the dendrogram. DISCUSSION
In this article, we analyzed 61 RNA-seq samples from 10 independent studies on Arabidopsis and calculated the transcript expression levels in different tissues. These data provide a comprehensive profile of the Arabidopsis transcripts with single-base resolution. We quantified the expression levels of 40,745 transcripts annotated in TAIR 10, comprising 73% common transcripts, 15% rare transcripts, and 12% nondetectable transcripts. In addition, we investigated diverse common transcripts in detail, including ubiquitous transcripts, dominant/subordinate transcripts, and switch transcripts, in terms of existence and the transcript ratio. Interestingly, the genes related to dominant/subordinate transcripts and switch transcripts were involved in alternative splicing. Our results shed light on the expression level of the Arabidopsis transcriptome as well as the transcript ratio determined by the splicing mechanism. Dominant transcripts in human genes have been studied in a recent work by Gonzàlez-Porta et al. (2013). In this study, we found that 59% of all the transcript ratios were greater than 2, which is less than the value of 79% identified in the human genome. We performed the comparison of indel numbers and exon/intron length between dominant transcripts and subordinate transcripts, but the results were not statistically significant. We also compared dominant transcripts with subordinate transcripts using the discriminative motif discover algorithm from MEME. Interestingly, TC motifs were found to be statistically significant, suggesting a strong relationship with transcription factor activity (GO:0003700). TC motifs were reported previously (Bernard et al., 2010) as alternative motifs to the TATA box, which participates in transcription regulation. In addition, Pickrell et al. Plant Physiol. Vol. 166, 2014
(2010) suggested that, during information transfer from DNA to RNA, there is a set of probability rules that cells follow. As such, dominant transcripts and subordinate transcripts may just be the products of the probability distribution determined by the quantity and relative ratio of splicing factors, binding to either TC motifs or the TATA box. The overexpression or underexpression of these splicing-related transcripts may have significant biological functions (Jensen et al., 2014). These sequence differences will provide a starting point for further experimental validation and hypothesis. Transcripts with switch events were reported by Wang et al. (2008) using RNA-seq data. They found that genes involved in switch events come from the following functional groups: developmental process, cell communication, signal transduction, and regulation of metabolism. Interestingly, in our work, the genes involved in switch events were enriched in the alternative splicing, splice variant, regulation of transcription, and DNA binding categories. In the study of Wang et al. (2008), 15 human tissues were sequenced simultaneously, while our data were generated from 10 totally independent studies, differing in tissue, treatment, developmental stage, variety, and research goals. The differences in the enriched functional groups may come from the heterogeneity of the samples, revealing that the complexity of splicing mechanisms is affected not only by tissue but also many different cofactors. It is straightforward to assume that the biological significance has a positive correlation with the expression level of the transcript. Several lines of evidence support the notion that the expression level impacts the plasticity of the phenotype (Dal Santo et al., 2013; Grzeskowiak et al., 2013). Pickrell et al. (2010) argued that these low-expression transcripts are just by-products of splicing errors. In Figure 1, most transcripts with detectable expression had values greater than 10, indicating that the low-expression transcripts are a very small fraction of the whole transcriptome. However, with regard to low-expression transcripts, there is no clear answer about biological significance. At present, it is clear that low expression of long noncoding RNA has a nonnegligible biological function (Gupta et al., 2010), suggesting that the expression level may not be the only criterion for determining biological importance. Recent work by Khan et al. (2013) confirmed that significant changes in the transcript levels of some genes do not lead to similar differences in protein levels, leading to an uncoupled relationship between transcript expression and protein expression. More experiments and techniques are needed to provide answers about the biological significance of low-expression transcripts. Another interesting finding is that 12% of all the annotated transcripts did not have any detectable expression in the 61 samples. A recent study by Jüschke et al. (2013) reported similar results in Drosophila spp.: about 82% of all Kyoto Encyclopedia of Genes and 875
Sun et al.
Genomes annotated transcripts were expressed in fly heads, while 18% were not expressed. This concordance suggests that there are some yet unidentified important factors that may control the expression of these transcripts. There is some limitation of this study. Sixty-one RNA-seq samples were selected from 10 independent studies. Although we found that most of the transcripts are shared by four tissues (more than 96% of the protein-coding transcripts; Supplemental Table S4), the unbalanced sample size from diverse tissues may produce biases with regard to tissue expression. In conclusion, the diverse categories of transcripts reflect the existence of different and matching splicing mechanisms: some splicing mechanisms exist in all samples, and some splicing mechanisms are activated only by specific environmental or developmental factors. The digital inventory of Arabidopsis transcripts reveals the complexity of alternative splicing and the diversity of contributing factors.
MATERIALS AND METHODS
to detect the junction reads from the bam files generated in the previous step. The whole pipeline accepted the bam files and output an html report for all alternative splicing types, including the details of the novel and known splicing junctions. To confirm a junction site, we set the parameter minReadCounts as 10 to identify a valid splicing junction in order to avoid PCR artifacts. The final output for the splicing junction sites was imported into R for further analysis.
Transcript Comparison All transcript comparisons were done in R. The Bioconductor package Rsamtools was used to read bam files; the IRanges package was used to compare the transcripts based on the genomic coordinates; the Biostrings package was used to extract the related sequences for downstream analysis; the R package SplicingTypesAnno (previously developed by our group) was used to extract the exon, intron, and splicing information. In addition, the figures were generated with the R packages ggplot2 and lattice. In the analysis, a transcript with FPKM , 1 was treated as no expression (Hebenstreit et al., 2011; Vogel and Marcotte, 2012; Fagerberg et al., 2013).
One-Transcript Genes and Multiple-Transcript Genes Based on the number of annotated transcripts, we defined the genes as onetranscript genes, two-transcript genes, etc. One-transcript genes are genes that do not have alternative splicing based on the current annotation and produce a single transcript during transcription. Multiple-transcript genes are genes that experience alternative splicing and generate a few isoforms. The number of transcripts in one gene is not equal to the number of alternative splicing events.
Data Set The Sequence Read Archive from the National Center for Biotechnology Information was searched with two keywords, “Arabidopsis” and “RNA-seq,” on July 1, 2013. The results were downloaded as File format. Only those projects using the Illumina platform were selected and ranked by the release date. Then, six independent studies (accession nos. SRP013631, SRP012576, SRP010481, SRP011086, SRP012587, and ERP002617) were randomly chosen from the pool. In addition, four independent studies (SRP009136, SRP007763, SRP007845, and SRP035234), which focused on alternative splicing, were also included in this study. Thus, 62 samples from 10 independent studies were selected for the analysis. These RNA-seq samples were generated using the Illumina platform, which covers various tissues (siliques, flowers, leaves, roots, and whole plants). Detailed information about the projects is available in Supplemental Table S5.
Dominant Transcripts and Subordinate Transcripts We followed the definition of Gonzàlez-Porta et al. (2013) to define dominant transcripts. If one gene has the following number of transcripts (A1, A2, ..., An) ordered by expression from low to high, the ratio of the dominant to the subordinate is the ratio of An to An 2 1. If the ratio is greater than 2, then the transcript An is considered as the dominant transcript and An 2 1 is considered as the matching subordinate transcript.
Transcript Switch Ratio We followed Gonzàlez-Porta et al. (2013) to define the switch transcripts based on the ratio of two transcripts from the same gene. Specifically, If trans_a . trans_b,
Alignment RNA-seq data sets were aligned with TopHat version 2.0.6 (Trapnell et al., 2009) using the following parameter: min-intron-length 20, which is for small introns reported in Arabidopsis (http://www.plantgdb.org/ASIP/pub/ Mini_Intron.php). To avoid alignment bias, we set the anchor length of TopHat (i.e. reads with at least this many bases on each side of the junction) at eight nucleotides and the mismatch number in these regions as 0 nucleotides. After alignment, SAMtools was utilized to sort and index the bam files (Li et al., 2009). TAIR 10 was used as the reference annotation.
Quantification
ratio ¼
trans_a trans_b
otherwise
ratio ¼ 2
trans_b trans_a
where trans_a and trans_b are the two transcripts from the same gene. We calculated ratio_i for sample_i and ratio_j for sample_j. If the transcript ratios from two samples met either one of the following requirements, ratio_i . 2 and ratio_j , 22 or ratio_i , 22 and ratio_j . 2, we marked these transcripts as switch transcripts.
Cufflinks version 2.0.2 was used to quantify the known transcripts with upper-quartile-norm and GTF-guide. GTF-guide quantifies both known transcripts and novel isoforms. For those isoforms with FPKM status noted as Fail, the software had problems with deconvolution of the isoform expression. They were marked as “no value” and removed in the following analysis to avoid ambiguity. Cuffcompare was then used to compare the transfrags with the reference annotation. The novel isoforms were detected by the j class code.
The network analysis was done with DAVID (Huang et al., 2009). The input data included genes containing the targeted transcripts selected with expression levels or transcript ratios. Genes were mapped to the DAVID pathway knowledge base in terms of TAIR identifiers. The gene functional groups with FDR-adjusted P , 0.05 were further explored based on the biological information provided by DAVID.
Splicing Junction Detection
Motif Analysis
The Bioconductor package GenomicRanges, IRanges, and the R package SplicingTypesAnno (http://sourceforge.net/projects/splicingtypes/) were utilized
The motif analysis was performed using the discriminative motif discovery algorithm from MEME (Bailey and Elkan, 1994). The dominant transcripts
876
Gene Ontology Analysis
Plant Physiol. Vol. 166, 2014
Digital Inventory of Arabidopsis Transcripts
were applied to the MEME Web site as the input sequences, and the subordinate transcripts were treated as negative sequences. The default parameters were chosen for the following analysis. The motifs found in MEME were further analyzed by GOMO (Buske et al., 2010), which is designed to discover the related Gene Ontology groups related to these targeted motifs. The parameter Category was set to multiple species, and Database was configured to Arabidopsis thaliana. We also compared the sequences from the genes with dominant transcripts with those from the genes with no dominant transcripts (negative sequences). In addition, we extracted the 500 bases upstream of the dominant transcripts and compared those sequences with those from the subordinate transcripts (negative sequences). Finally, we extracted the 500 bases upstream of the genes with dominant transcripts and compared those sequences with those from the genes with no dominant transcripts (negative sequences).
Supplemental Data The following materials are available in the online version of this article. Supplemental Figure S1. Total gene numbers versus isoform numbers in one gene. Supplemental Figure S2. Transcript comparison of TAIR9 (Marquez et al., 2012) TAIR10 based on the exact match of the transcript sequences Supplemental Figure S3. Transcript expression (log-transformed FPKM) for 61 samples from 10 projects. Supplemental Figure S4. Expression levels (log-transformed FPKM) for transcripts from the reference genes and other genes. Supplemental Figure S5. Expression levels (log-transformed FPKM) for common transcripts from single-transcript genes and multiple transcript genes. Supplemental Figure S6. Expression profiles of dominant transcripts and subordinate transcripts. Supplemental Figure S7. Ratio of top two transcripts from one gene versus the number of transcripts per gene. Supplemental Figure S8. Expression profiles for ubiquitous transcripts in 61 samples. Supplemental Figure S9. Transcript ratio versus gene ID. Supplemental Figure S10. Enriched functional group for genes with switch event. Supplemental Figure S11. Exon numbers and mean exon length for rare transcripts and ubiquitous transcripts. Supplemental Figure S12. Expression profiles for rare transcripts in 61 samples. Supplemental Figure S13. Cumulative percentage of expression levels for novel transcripts and common transcripts. Supplemental Figure S14. Expression profiles for novel transcripts in 61 samples. Supplemental Figure S15. Venn diagram for genes from dominant transcripts, switch transcripts, and ubiquitous transcripts, and related gene ontology groups. Supplemental Figure S16. Hierarchical cluster analysis for 61 samples from 10 projects based on the transcript expression level. Supplemental Table S1. Number of detectable transcripts per tissue with the threshold FPKM = 1. Supplemental Table S2. Gene number and transcript number annotated in TAIR9, Marquez et al. (2012), and TAIR10. Supplemental Table S3. The expression ratio of the top two transcripts per gene (dominant transcript/subordinate transcript). Supplemental Table S4. Ubiquitous transcripts. Supplemental Table S5. The details for 10 independent projects investigated in this study. Plant Physiol. Vol. 166, 2014
Supplemental Method S1. Statistical test for genes with more transcripts and more diverse functions. Supplemental Method S2. Statistical test for difference of indel numbers between the genes with dominant transcripts and the genes without dominant transcripts. Received April 18, 2014; accepted August 10, 2014; published August 12, 2014.
LITERATURE CITED Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In R Altman, D Brutlag, P Karp, R Lathrop, D Searls, eds, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp 28–36 Bernard V, Brunaud V, Lecharny A (2010) TC-motifs at the TATA-box expected position in plant genes: a novel class of motifs involved in the transcription regulation. BMC Genomics 11: 166 Buske FA, Bodén M, Bauer DC, Bailey TL (2010) Assigning roles to DNA regulatory motifs using comparative genomics. Bioinformatics 26: 860–866 Czechowski T, Stitt M, Altmann T, Udvardi MK, Scheible WR (2005) Genome-wide identification and testing of superior reference genes for transcript normalization in Arabidopsis. Plant Physiol 139: 5–17 Dal Santo S, Tornielli GB, Zenoni S, Fasoli M, Farina L, Anesi A, Guzzo F, Delledonne M, Pezzotti M (2013) The plasticity of the grapevine berry transcriptome. Genome Biol 14: r54 Fagerberg L, Oksvold P, Skogs M, Algenäs C, Lundberg E, Pontén F, Sivertsson A, Odeberg J, Klevebring D, Kampf C, et al (2013) Contribution of antibody-based protein profiling to the human Chromosomecentric Proteome Project (C-HPP). J Proteome Res 12: 2439–2448 Gonzàlez-Porta M, Frankish A, Rung J, Harrow J, Brazma A (2013) Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol 14: R70 Grzeskowiak L, Costantini L, Lorenzi S, Grando MS (2013) Candidate loci for phenology and fruitfulness contributing to the phenotypic variability observed in grapevine. Theor Appl Genet 126: 2763–2776 Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, Tsai MC, Hung T, Argani P, Rinn JL, et al (2010) Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 464: 1071–1076 Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann SA (2011) RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol 7: 497 Huang W, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4: 44–57 Hui L, Zhang X, Wu X, Lin Z, Wang Q, Li Y, Hu G (2004) Identification of alternatively spliced mRNA variants related to cancers by genome-wide ESTs alignment. Oncogene 23: 3013–3023 Jensen MA, Wilkinson JE, Krainer AR (2014) Splicing factor SRSF6 promotes hyperplasia of sensitized skin. Nat Struct Mol Biol 21: 189–197 Jüschke C, Dohnal I, Pichler P, Harzer H, Swart R, Ammerer G, Mechtler K, Knoblich JA (2013) Transcriptome and proteome quantification of a tumor model provides novel insights into post-transcriptional gene regulation. Genome Biol 14: r133 Khan Z, Ford MJ, Cusanovich DA, Mitrano A, Pritchard JK, Gilad Y (2013) Primate transcript and protein expression levels evolve under compensatory selection pressures. Science 342: 1100–1104 Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40: D1202–D1210 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079 Marquez Y, Brown JW, Simpson C, Barta A, Kalyna M (2012) Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis. Genome Res 22: 1184–1195 Maugeri A, van Driel MA, van de Pol DJ, Klevering BJ, van Haren FJ, Tijmes N, Bergen AA, Rohrschneider K, Blankenagel A, Pinckers AJ, et al (1999) The 2588G→C mutation in the ABCR gene is a mild frequent 877
Sun et al.
founder mutation in the Western European population and allows the classification of ABCR mutations in patients with Stargardt disease. Am J Hum Genet 64: 1024–1035 Merkin J, Russell C, Chen P, Burge CB (2012) Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338: 1593–1599 Mezlini AM, Smith EJ, Fiume M, Buske O, Savich GL, Shah S, Aparicio S, Chiang DY, Goldenberg A, Brudno M (2013) iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res 23: 519–529 Pickrell JK, Pai AA, Gilad Y, Pritchard JK (2010) Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet 6: e1001236 Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28: 1086–1092 Seya T, Hirano A, Matsumoto M, Nomura M, Ueda S (1999) Human membrane cofactor protein (MCP, CD46): multiple isoforms and functions. Int J Biochem Cell Biol 31: 1255–1260
878
Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B (2006) AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34: W435-W439 Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–1111 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28: 511–515 Vogel C, Marcotte EM (2012) Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet 13: 227–232 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470–476 Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829
Plant Physiol. Vol. 166, 2014