Deep RNA-Seq uncovers the peach transcriptome landscape

1 downloads 0 Views 958KB Size Report
Jun 20, 2013 - Arús P, Verde I, Sosinski B, Zhebentyayeva T, Abbott AG (2012) The peach genome. ... tion of genes associated with growth cessation and bud.
Plant Mol Biol (2013) 83:365–377 DOI 10.1007/s11103-013-0093-5

Deep RNA-Seq uncovers the peach transcriptome landscape Lu Wang • Shuang Zhao • Chao Gu • Ying Zhou • Hui Zhou • Juanjuan Ma Jun Cheng • Yuepeng Han



Received: 15 April 2013 / Accepted: 15 June 2013 / Published online: 20 June 2013 Ó Springer Science+Business Media Dordrecht 2013

Abstract Peach (Prunus persica) is one of the most important of deciduous fruit trees worldwide. To facilitate isolation of genes controlling important horticultural traits of peach, transcriptome sequencing was conducted in this study. A total of 133 million pair-end RNA-Seq reads were generated from leaf, flower, and fruit, and 90 % of reads were mapped to the peach draft genome. Sequence assembly revealed 1,162 transcription factors and 2,140 novel transcribed regions (NTRs). Of these 2,140 NTRs, 723 contain an open reading frame, while the rest 1,417 are non-coding RNAs. A total of 9,587 SNPs were identified

Electronic supplementary material The online version of this article (doi:10.1007/s11103-013-0093-5) contains supplementary material, which is available to authorized users. L. Wang  S. Zhao  C. Gu  Y. Zhou  H. Zhou  J. Ma  J. Cheng  Y. Han (&) Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden of the Chinese Academy of Sciences, Wuhan 430074, People’s Republic of China e-mail: [email protected] L. Wang e-mail: [email protected] S. Zhao e-mail: [email protected] C. Gu e-mail: [email protected] Y. Zhou e-mail: [email protected] H. Zhou e-mail: [email protected] J. Ma e-mail: [email protected] J. Cheng e-mail: [email protected]

across six peach genotypes, with an average density of one SNP per *5.7 kb. The top of chromosome 2 has higher density of expressed SNPs than the rest of the peach genome. The average density of SSR is 312.5/Mb, with trinucleotide repeats being the most abundant. Most of the detected SSRs are AT-rich repeats and the most common di-nucleotide repeat is CT/TC. The predominant type of alternative splicing (AS) events in peach is exon-skipping isoforms, which account for 43 % of all the observed AS events. In addition, the most active transcribed regions in peach genome were also analyzed. Our study reveals for the first time the complexity of the peach transcriptome, and our results will be helpful for functional genomics research in peach. Keywords Peach  Transcriptome  Alternative splicing  RNA-Seq  Non-coding RNA

Introduction Peach (Prunus persica), a member of the family Rosaceae, is the third most important of the deciduous fruit trees worldwide, ranking only after apple and pear. It is a diploid with a base chromosome number of 8. Peach is not only a major economic fruit crop grown world-wide, but also serves as an important model species for functional genomics research of woody perennial angiosperms due to its several distinct advantages, including self-compatibility, short juvenile phase (2–3 years), and a small genome size (*230 Mb) (Aru´s et al. 2012). Over the last several decades, great efforts have been made to develop various genomics resources such as ESTs (Yamamoto et al. 2002), genetic maps (Joobeur et al. 2000; Dirlewanger et al. 2004), and BAC libraries (Zhebentyayeva et al. 2008) and

123

366

to address molecular mechanism underlying various horticultural traits in peach (Boudehri et al. 2009; Li et al. 2009; Jime´nez et al. 2010a, b; Brandi et al. 2011). However, only limited information is available on gene networks associated with economically important traits. The genetic identification of functionally important genes in peach is hampered by the lack of comprehensive transcript and physical maps. Recently, the high-quality genome sequence of the doubled haploid peach cv. ‘Lovell’ has been released (The International Peach Genome Initiative 2013), and this suggests we are entering the post-genomic era. Extensive functional genomics work is underway to identify the activity of various functional elements in the peach genome. Progress in functional genomic research is dependent on the availability of detailed transcriptome information. However, most of molecular studies in peach are involved in structural genomics, and few transcriptome studies have been conducted (Shulaev et al. 2008; Martı´nez-Go´mez et al. 2011). As of now, only 79,689 expressed sequence tags (ESTs) have been deposited in NCBI. To facilitate both functional annotation of the peach genome and identification of genes controlling important traits, it is important to explore the transcriptome landscape of peach. In the past decade, several approaches such as EST sequencing and microarray analysis have been developed to investigate features of transcriptome in fruit trees (Yamamoto et al. 2002; Newcomb et al. 2006; Trainotti et al. 2006; Vecchietti 2009; Soria-Guerra et al. 2011). However, Sanger-based EST sequencing generates highly redundant sequences of high-expressed and few low-expressed transcripts, rendering such efforts not as suitable as deep transcriptome analysis. Microarray design requires prior knowledge of ESTs or genomic sequences, and microarray analysis cannot detect either RNA variants such as alternative splicing (AS) transcripts or novel transcripts. More recently, a deep RNA sequencing methodology, also called RNA-Seq, has been developed and shows its tremendous power in characterizing transcriptome because it can detect low-expressed transcripts, splice variants, and novel transcripts (Mortazavi et al. 2008; Socquet-Juglard et al. 2013). Therefore, RNA-Seq is now regarded as the latest and most powerful tool for sequencing and profiling of transcriptome. In peach, two versions of microarrays developed from 4,806 and 7,862 non-redundant ESTs, respectively, have been used to investigate transcriptome changes associated with biological processes such as response to hormone treatments, fruit ripening and chilling injury (Livio et al. 2007; Bonghi et al. 2011; Martı´nez-Go´mez et al. 2011). However, the utilization of these microarrays has been greatly limited in peach transcriptome analysis because

123

Plant Mol Biol (2013) 83:365–377

they were designed for several thousand genes, which was quite inadequate on the whole transcriptome scale. To obtain a global view of the peach transcriptome, RNA-Seq method has been conducted on different tissues of peach. Based on extensive data analyses, we have identified a substantial number of novel transcripts that significantly improve the current genome annotation of peach. Moreover, other features have also been investigated, including alternative spliced (AS) isoforms, single nucleotide polymorphisms (SNPs), simple sequence repeats (SSRs), and the untranslated region (UTR) boundaries. The transcriptome data allow us to make accurate predictions of gene structures. Our results will be very helpful for the future of functional genomic research in peach and other fruit trees.

Materials and methods Plant materials Four peach varieties (‘Baifeng’, ‘Jinxiang’, ‘Dahongpao’, and ‘Mantianhong’) and two ornamental peach varieties (‘Hongbaihua’ and ‘Hongyetao’), maintained at Wuhan Botanical Garden of the Chinese Academy of Sciences (Hubei Province, PRC), were used for transcriptome analysis. Leaves were collected from cv. Hongyetao and Mantianhong during juvenile stage in spring season. Flowers were collected from cv. Hongbaihua and Mantianhong at the pink stage. Fruits were collected from cv. Baifeng, Jinxiang, and Dahongpao at 65 and 85 days after pollination. All the samples were immediately frozen in liquid nitrogen and then stored at -75 °C until use. cDNA library preparation and Illumina sequencing Total RNA was extracted using TRIzol (Invitrogen, CA, USA) according to the manufacturer’s instructions, and treated with RNase-free DNase I (Takara, Dalian, China) to remove residual DNA. Equivalent amount of total RNA from fruit tissues of the same genotype were mixed equally and subjected to purification of poly(A) mRNA. The purification of Poly(A) mRNAs were performed using oligo-dT attached to magnetic beads. The mRNAs were fragmented using super sonication, and then subjected to first- and second-strand cDNA synthesis using random hexamer primers. The cDNA libraries were prepared according to Illumina’s protocols. Fragments of *300 bp were excised and enriched by PCR for 18 cycles. In total, we constructed 2, 2, 3 paired-end cDNA libraries for leaf, flower, and fruit tissues, respectively. The cDNA libraries were sequenced using Illumina Hiseq2000 sequencer according to the manufacturer’s instructions.

Plant Mol Biol (2013) 83:365–377

Mapping RNA-Seq reads to the peach genome and transcript annotation RNA-Seq reads were aligned against the peach genome sequences (http://www.rosaceae.org/node/355) using programs Tophat, Bowtie, and BWA (Trapnell et al. 2009; Langmead et al. 2009; Li and Durbin 2009). Overlapping RNA-Seq reads were merged into continuous transcribed sequences using cufflink package (Trapnell et al. 2009), and the splice junction maps and splicing isoforms were simultaneously generated. UTRs were identified according to the method as previously described by Lu et al. (2010). The sequences of the assembled transcripts were compared against NCBI RefSeq nucleotide database and Swiss-Prot and UniPro protein databases. Homologues were sequentially annotated according to the blast results, followed by the pathway annotation pipelines, including COG (http:// www.ncbi.nlm.nih.gov/COG/), GO (http://www.geneontology. org), and KEGG (www.genome.jp/kegg/). Identification of SNPs, SSRs, and alternative splicing events RNA-Seq reads from all the six genotypes were used for SNP identification. SNP calling was conducted using VarScan 2.2.10 with the default parameters. The distribution of SNPs on coding and UTR regions was analyzed using ANNOVAR (Wang et al. 2010). The assembled transcribed sequences were searched for perfect microsatellites, with a basic motif length of 2–6 bp, using the SSR scanning program (Temnykh et al. 2001). Repeats with a minimum length of 12 bp for di- to tetranucleotide repeats, 15 bp for penta-nucleotide repeats, and 18 bp for hexa-nucleotide repeats were recorded. The result that RNA-Seq reads were mapped to the peach genome was used to detect AS events using MATS package (Shen et al. 2012). RNA-Seq reads from all the six genotypes were used for identification of splicing events. Differentially or strongly expressed gene assessment The RNA-Seq read-mapping result was used to predict gene expression profiles, and gene expression level was quantified using FPKM values (Fragments per kilo bases per million reads). FPKM values were calculated using program Cufflinks (http://cufflinks.cbcb.umd.edu/) with a statistical method of RSEM. We set a threshold value at 0.3 FPKM to determine whether or not a gene was expressed in a specific tissue. To identify active transcribed regions in the peach genome, RNA-Seq reads from all samples were used to calculate the gene FPKM value. The genes with the top 1 and 10 % highest FPKM values were considered to derive from active transcribed regions.

367

Real-time PCR analysis and AS validation Total RNA was extracted using TRIzol (Invitrogen, CA, USA) following the manufacturer’s instructions. Approximately 5 lg of total RNA per sample was treated with DNase I (Takara, Dalian, China), and then subjected to the first strand cDNA synthesis. A SYBR green-based real-time PCR assay was carried out in a total volume of 20 lL reaction mixture containing 10.0 lL of 29 SYBR Green I Master Mix (Takara, Dalian, China), 0.2 lM of each primer, and 100 ng of template cDNA. A peach gene PpEF2 was used as a constitutive control (Tong et al. 2009). Amplifications were performed using a StepOne real-time PCR System (Applied Biosystems). The amplification program consisted of one cycle of 95 °C for 30 s, followed by 40 cycles of 95 °C for 15 s and 60 °C for 30 s. Fluorescent products were detected in the last step of each cycle. Melting curve analysis was performed at the end of 40 cycles to ensure proper amplification of target fragments. All analyses were repeated three times using biological replicates. The validation of AS events was performed using RTPCR. The mixture of cDNAs prepared from leaves of cv. Hongyetao and Mantianhong was used as template. The PCR program consisted of one cycle of 95 °C for 5 min, followed by 45 cycles of 95 °C for 30 s, 55 °C for 30 s and 72 °C for 1 min. The sequences of primers are listed in Table S1.

Results Overview of the RNA-Seq data A total of 40.8, 61.2, and 26.9 million pair-end reads in length of 81 bp were generated from flower, fruit, and leaf tissues, respectively (Table 1). Raw reads were trimmed by removing adaptor sequences, empty reads, and low quality sequences. As a result, 260.9 million (98.1 %) of high quality reads, designated as clean reads, were generated. Of the clean reads, 98.8 and 1.2 % were paired- and single-end reads, respectively (Table 1). The majority of the clean reads (89.3 %) were mapped to the peach genome. Most of the mapped reads (95.4 %) were anchored onto eight mega-scaffolds that represent 96 % of the peach genome and correspond to 8 haploid chromosomes (n = 8) of peach. A small portion of the mapped reads (4.6 %) were located on other 194 sequence scaffolds, which have not been anchored onto chromosomes. Identification of transcribed regions in the peach genome RNA-Seq reads from all the three tissues were mapped to the peach genome and 35,263 consensus sequences were

123

368

Plant Mol Biol (2013) 83:365–377

Table 1 RNA-Seq clean reads and their physical mapping result in peach Samples

No. of reads (million)

Total sizes (Mb)

Paired

Single

Left read

Right read

Left

Right

Flower

40.80

0.77

3,191.2

3,180.6

37.48

37.63

Fruit

61.16

1.61

4,699.5

4,729.3

55.80

55.79

Leaf

26.89

0.80

2,084.4

2,088.8

24.20

24.35

identified. All the 35,263 consensus sequences were generated from 24,427 genes. Of the 24,427 genes, 22,287 were previously predicted in the peach reference gene set in GDR database (http://www.rosaceae.org/peach/genome). However, the rest of 2,140 genes were not included in the peach reference gene set, representing novel transcribed regions (NTRs) (Fig. 1a). Moreover, our analysis revealed *20.0 Mb UTRs. On average, each gene contained an 876 bp of UTR. The transcript sizes ranged from 0.1 to 13.3 kb, with an average of 1.8 kb (Fig. 1b). The peach predicted transcripts in GDR database were well improved when they were combined with our results. The percentage of RNA-Seq reads mapped to the transcribed regions of our study was higher than that of RNA-Seq reads mapped to the GDR predicted transcripts. The annotation of genes identified in this study was shown in Fig. 1c. Of the 24,427 genes, 15,684 (64.2 %) had one or more GO (Gene Ontology) annotations, resulting in 11,118, 11,311 and 12,980 biological process, cellular component and molecular function terms, respectively. Of all the 24,427 genes, 4,745 had one or more KEGG annotations, and they belonged to 2,489 pathways. A total of 1,162 genes were identified to be putative transcription factors (TFs). Among these peach TFs, 37 were not included in the GDR predicted gene set. Of the 2,140 NTRs, 723 (33.8 %) contained an open reading frame, while the rest 1,417 (66.2 %) were noncoding RNA (ncRNA) genes. The physical distributions of NTRs and ncRNAs on the peach genome were shown in Fig. 2. The protein-coding NTRs were evenly distributed throughout the peach genome, while high density of ncRNAs was observed on chromosomes 1, 4, 5, 7, and 8. Of the 2,140 NTRs, 2,072 (96.8 %) and 69 (3.2 %) were located on genetically anchored and non-anchored scaffolds, respectively. Of the anchored NTRs, 692 (32.3 %) and 1,380 (64.5 %) were protein-coding and ncRNAs, respectively. On contrast, 31 and 38 of the non-anchored NTRs were protein-coding and ncRNAs, respectively. Analysis of SNPs in the transcribed regions of peach In this study, transcriptome sequencing data were generated from six genotypes, and thus provided an opportunity to investigate the frequency of SNPs in transcribed regions.

123

No. of mapped read (million)

As a result, a total of 9,587 SNPs were identified from the peach transcribed regions, with an average density of one SNP per *5.7 kb. Of all the SNPs, 61.7 and 38.3 % were transitions and transversions, respectively (Table 2). Two transitions A/G and C/T were the two most abundant SNPs and accounted for 31.00 and 30.71 % of all SNPs, respectively. Four transversions i.e. A/C, A/T, G/C, and G/T were evenly present, with each accounting for *10 % of all SNPs. Moreover, 16 and 6 SNPs caused stop codon gain and loss, respectively. The physical distribution of expressed SNPs on the peach genome was summarized in Fig. 3. All the 9,587 SNPs were located on 31 scaffolds (Table S2). Scaffold 2 had the highest density of expressed SNPs, followed by scaffolds 6 and 4. Of all the 9,587 SNPs, 9,109 (95.0 %) and 478 (5.0 %) were located in 5,127 protein-coding genes and 359 ncRNAs, respectively. Of the 5,127 proteincoding transcripts, 3,143 contained 1 SNP and 1,984 had two or more SNPs. A transcript encoding ACC-NBS-LRR protein (GDR accession no. ppa016901m) involved in disease resistance showed the highest genetic variation of 30 SNPs. However, most of the transcripts (77.9 %) had no SNPs. Moreover, 5,148, 916, and 3,045 SNPs were identified in coding sequences (CDSs), 50 UTRs, and 30 UTRs, respectively. Of the 5,148 SNPs in CDSs, 2,687 (52.2 %) were synonymous and located in 1,741 genes. 2,461 (47.8 %) were non-synonymous and located in 2,025 genes. Analysis of SSRs in the transcribed regions of peach A total of 17,979 SSRs were identified in the peach transcriptome, with an average density of one SSR per 3.2 kb (Table 3). Tri-nucleotide repeats were the most abundant, accounting for 36.5 % of all SSRs. Di-, tetra-, penta-, and hexa-nucleotide repeats accounted for 32.5, 17.6, 6.5, and 6.9 %, respectively, for all SSRs. Of the dimers, CT/TC repeats were the most abundant, accounting for 57.2 % of all trimers, while, AG/GA, AT/TA, AC/CA and GT/TG repeats accounted for 27.4, 10.2, 3.1 and 2.0 %, respectively. Only one GC/CG repeat 12 bp in length was found. Of the trimers, AAG/AGA/GAA repeats were the most abundant, accounting for 15.8 % of all trimers, followed by CTT/TCT/TTC (14.6 %) and CCT/CTC/TCC (8.0 %).

Plant Mol Biol (2013) 83:365–377

369

Fig. 1 Overview of the peach transcriptome. a Comparison of the estimated gene numbers between GDR database (I) and our study (II). b Distribution of transcript sizes. c GO annotation of genes identified in our study

Among tetramers, AT-rich repeats were the most abundant, accounting for 20.7 % of all tetramers. Moreover, 76.9 % of SSR motifs were \20 bp in length. The physical distribution of SSRs across the peach genome was shown in Fig. 3. Overall, the SSR density was consistent with transcript density in the peach genome. In addition, the distribution of SSRs in UTRs and CDSs is also investigated (Table 4). Among UTR-SSRs, di-nucleotide repeats were the most abundant, accounting for 42.0 %. CT/TC and AG/GA dimers prevailed in UTR sequences, accounting for 23.3 and 11.9 % of all UTRSSRs, respectively. Of CDS-SSRs, tri-nucleotide repeats were the most abundant, accounting for 68.2 % of all CDSSSRs. AAG/AGA/GAA and CTT/TCT/TTC trimers were frequently encountered in CDSs, accounting for 10.3 and 7.4 % of all CDS-SSRs, respectively (Fig. S1).

Identification of alternative splicing events and exons in the peach transcriptome Five types of AS events were analyzed in the peach transcriptome, including exon skipping (ES), alternative 50 splice site (A5SS), alternative 30 splice site (A3SS), mutually exclusive exons (MXE), and retained intron (RI). As a result, 10,835 AS events were identified in 5,520 transcribed regions, including 496 NTRs (Fig. 4a). ES was the most abundant type of AS events (42.8 %), followed by A3SS (37.1 %) and A5SS (15.6 %) (Fig. 4b). RI and MXE were rare and accounted only for 2.8 and 1.7 %, respectively. All these AS events occurred in 5,634 transcripts. The physical distribution of the transcripts containing AS events across the peach genome was shown in Fig. 4c.

123

370

Plant Mol Biol (2013) 83:365–377

Fig. 2 Distribution of transcripts on the peach genome

Nearly all the transcripts containing AS events were located on scaffolds 1–9. Genes expressed in peach leaf, flower, and fruit Overall, 19,300, 18,932, 18,435 genes were identified in leaf, flower, and fruit tissues, respectively (Fig. 5a). The expression of 16,472 genes was detected in all three tissues, while 831, 892, and 680 genes were specifically expressed in leaf, flower, and fruit, respectively. 856, 427, and 1,141 genes were expressed in two tissues, i.e., leaf/ fruit, flower/fruit, and flower/leaf, respectively. 3,128 genes had extremely low expression in all three tissues. As mentioned above, 1,162 TFs were identified in all samples analyzed. Of the 1,162 TFs, 874 TFs were expressed in all analyzed tissues, while 21, 33 and 52 TFs were specifically expressed in flower, fruit, and leaf, respectively (Fig. 5b). 77, 68, and 31 TFs were expressed in two tissues, i.e., flower/leaf, leaf/fruit, and flower/fruit, respectively. Six TFs had extremely low expression in all three tissues.

123

Among the TFs expressed in tested tissues, 100, 66, 21, 33, 67, and 45 encoded MYB, bHLH, WD40, MADS-box, ERF, and WRKY proteins, respectively (Fig. 5c). Overall, most of these typical TFs (79 %) were expressed in all three tissues, while 26, 14, 1, 13, 14, and 9 of MYB, bHLH, WD40, MADS-box, ERF, and WRKY TFs, respectively, were differentially expressed (Fig. 5d). Of the 26 differentially expressed MYB TFs, 11, 4, and 2 were exclusively expressed in leaves, flowers, and fruits, respectively, while 8 and 1 were expressed in two types of tissues viz. flower/ leaf and flower/fruit, respectively. For the 14 differentially expressed bHLH TFs, 3, 2, and 1 were exclusively expressed in leaf, flower, and fruit tissues, respectively, while 5 and 3 were expressed in two types of tissues viz. flower/leaf and leaf/fruit, respectively. Of the 14 differentially expressed ERF TFs, 4, 1, and 5 were exclusively expressed in leaf, flower, and fruit tissues, respectively, while 2, 1, and 1 were expressed in two tissues viz. flower/ leaf, flower/fruit, and leaf/fruit, respectively. Among the 13 differentially expressed MADS TFs, 6 and 3 were

Plant Mol Biol (2013) 83:365–377

371

Table 2 Composition of SNPs in the transcribed regions of peach Region

Type

50 UTR

Transition

Transversion

30 UTR

Transition

Transversion

SNP A$G

276

2.91

290

3.06

Total

566

5.97

A$C

89

0.94

A$T

91

0.96

G$C

167

1.76

G$T

81

0.86

Total

428

4.52

A$G

823

8.69

C$T

823

8.69

Total A$C

1,646 257

17.38 2.71

A$T

292

3.08

G$C

511

5.39

Total Transition

Transversion

Transition

Transversion

240

2.53

1,300

13.72

A$G

1,707

18.02

C$T

1,640

17.31

Total

3,347

35.33

A$C

401

4.23

A$T

454

4.79

G$C

523

5.52

G$T

420

4.43

Total ncRNA

%

C$T

G$T CDS

No.

1,798

18.98

A$G

107

1.13

C$T

136

1.44

Total A$C

243 34

2.57 0.36

A$T

29

0.31

G$C

41

0.43

24

0.25

145

1.53

G$T Total

exclusively expressed in leaf and flower, respectively, while 1, 1, and 2 were expressed in two tissues viz. flower/ leaf, flower/fruit, and leaf/fruit, respectively. Of the 9 differentially expressed WRKY TFs, 5 were exclusively expressed in leaf, while 4 were expressed in leaf and fruit. Intriguingly, only one out of 23 WD40 TFs showed differential expression and it was exclusively expressed in fruits (Fig. 5d). The most active transcribed regions in peach genome The expression of 24,427 genes was quantified using FPKM values and 21,299 (87.2 %) had an FPKM value [0.3 in at least one tissue. The physical distribution of the top 500 highest expressed genes was shown in Fig. 6a. Scaffolds 1, 2, 3, 5, 7, and 8 contained more top 500 genes at the bottom than at the top. In contrast, scaffold 4 had more top 500 genes at the top than at the bottom. Scaffold 6 contained slightly more top 500 genes at the top than at the bottom. The annotation of the top 500 genes was shown in Fig. 6b. Most of the top 500 genes (97.5 %) had one or more GO annotations, including biological process, cellular component and molecular function terms. Validation of gene expression profiles and AS events using RT-PCR Seven pairs of primers were designed to verify AS events. Of the 7 primer pairs, five generated more than one bands with predicted sizes (Fig. S2). In addition, five pair primers were designed to validate the expression data in silicon (Fig. S3). As a result, four genes showed high expression in all tested tissues and one was only expressed in leaf tissue. This result was well consistent with the expression profiles estimated from RNA-Seq data.

Fig. 3 Distribution of expressed SNPs and SSRs on the peach genome

123

372

Plant Mol Biol (2013) 83:365–377

Table 3 Composition and length distribution of SSR motifs in the peach transcriptome Repeat unit

Repeat type

Repeat length (bp) \20

Dimer

Tetramer

132

46

178

0.99

94

25

119

0.66

AG/GA

800

805

1,605

8.93

CT/TC

1,541

1,805

3,346

18.61

AT/TA

388

211

599

3.33

1

0

1

0.01

2,956 143

2,892 15

5,848 158

32.53 0.88

ATT/TAT/TTA

160

19

179

1.00

AAC/ACA/CAA

308

40

348

1.94

GTT/TGT/TTG

154

7

161

0.90

Total AAT/ATA/TAA

AAG/AGA/GAA

924

115

1,039

5.78

CTT/TCT/TTC

849

107

956

5.32

ACC/CAC/CCA

416

35

451

2.51

GGT/GTG/TGG

271

12

283

1.57

CCT/CTC/TCC

472

54

526

2.93

AGG/GAG/GGA

297

42

339

1.89

CCG/CGC/GCC

83

2

85

0.47

CGG/GCG/GGC

83

5

88

0.49

Others

1,774

176

1,950

10.84

Total

5,934

629

6,563

36.50

AAAC/AACA/ACAA/CAAA AAAG/AAGA/AGAA/GAAA

101 276

3 13

104 289

0.58 1.61

AAAT/AATA/ATAA/TAAA

221

5

226

1.26

AACC/ACCA/CAAC/CCAA

54

0

54

0.30

AAGG/AGGA/GAAG/GGAA

67

2

69

0.38

125

3

128

0.71

ACCC/CACC/CCAC/CCCA ATTT/TATT/TTAT/TTTA AGGG/GAGG/GGAG/GGGA

Hexamer

C20

GT/TG

AATT/ATTA/TTAA/TAAT

Pentamer

Frequency (%)

AC/CA

CG Trimer

Total

34

0

34

0.19

283

18

301

1.67

53

2

55

0.31

CCCT/CCTC/CTCC/CCCT

105

10

115

0.64

CTTT/TCTT/TTCT/TTTC

286

21

307

1.71

CCTT/CTTC/TCCT/TTCC

74

3

77

0.43

GGGT/GGTG/GTGG/TGGG

19

0

19

0.11

GGTT/GTTG/TGGT/TTGG

49

1

50

0.28

GTTT/TGTT/TTGT/TTTG

122

6

128

0.71

Others

1,129

81

1,210

6.73

Total AAAAT/AAATA/AATAA/ATAAA/TAAAA

2,998 74

168 11

3,166 85

17.61 0.47

AAAAG/AAAGA/AAGAA/AGAAA/GAAAA

76

18

94

0.52

CTTTT/TCTTT/TTCTT/TTTCT/TTTTC

80

21

101

0.56

ATTTT/TATTT/TTATT/TTTAT/TTTTA

76

11

87

0.48

GTTTT/TGTTT/TTGTT/TTTGT/TTTTG

43

8

51

0.28

Others

629

123

752

4.18

Total

978

192

1,170

6.51

Total

953

279

1,232

6.85

SSRs recorded for the final dataset included dimmers and trimers with at least 12 bp in length and tetramers to hexamers with at least 3 repeats

123

Plant Mol Biol (2013) 83:365–377

373

Table 4 Distribution of SSRs in CDSs and UTRs in the peach transcriptome Repeat

Region

Repeat length (bp) \20

Dimer

Sum

%

C20

CDS

313

309

622

3.46

UTR

2,643

2,583

5,226

29.07

Trimer

CDS UTR

3,444 2,490

325 304

3,769 2,794

20.96 15.54

Tetramer

CDS

476

8

484

2.69

UTR

2,522

160

2,682

14.92

CDS

103

19

122

0.68

UTR

875

173

1,048

5.83

CDS

396

132

528

2.94

UTR

557

147

704

3.92

Pentamer Hexamer

Discussion Frequency of SSRs and SNPs identified in the peach transcriptome SSRs and SNPs derived of transcribed sequences serve as gene-tagged markers and will be very helpful for genetics and functional genomics study. Here, our study reveals that SSRs are abundant in the peach transcriptome, with an average density of 312 SSRs per Mb. The SSR density in the peach transcriptome is similar to the overall density of SSRs in the expressed sequences of other dicots such as

Arabidopsis (357 SSRs/Mb), Medicago (324 SSRs/Mb), soybean (403 SSRs/Mb), poplar (424 SSRs/Mb), grapevine (247 SSRs/Mb), and cucumber (370 SSRs/Mb), but lower than monocots such as rice (739 SSRs/Mb) and sorghum (646 SSRs/Mb) (Cavagnaro et al. 2010). Moreover, trinucleotide repeats are the most frequent SSR type in the peach transcriptome, followed by di- and tetra-nucleotide repeats. This result is in agreement with previous reports that tri-nucleotide repeats are the most abundant type of SSRs in the expressed sequences of plant species such as Arabidopsis, Medicago, soybean, poplar, grapevine, rice and sorghum (Cavagnaro et al. 2010). The most common di-nucleotide repeat is AG/GA/CT/ TC in the peach transcriptome, which is consistent with previous findings in the expressed sequences of other plant species, including monocots such as rice, maize, barley and sorghum and dicots such as apple, almond, rose, Arabidopsis, Medicago, soybean, poplar, grapevine, tomato and cotton (Cardle et al. 2000; Kantety et al. 2002; Jung et al. 2005; Cavagnaro et al. 2010; Zhang et al. 2012). The rarity of GC/CG repeats observed in the peach transcriptome seems to be common in the expressed sequences of all other species. However, the most common type of trinucleotide repeats in the expressed sequences varies among plant species. For example, AAG/AGA/GAA are the predominant tri-nucleotide repeats in the expressed sequences of dicots such as Arabidopsis, peach, apple and cucumber, while CCG/CGC/GCC repeats prevail in the expressed sequences of monocots such as rice and sorghum

Fig. 4 AS events in the peach transcriptome. a Diagram of five major types of AS events (Shen et al. 2012). b Proportions of different types of AS events. c Distribution of AS events on the peach genome

123

374

Plant Mol Biol (2013) 83:365–377

Fig. 5 Characterization of genes in the peach transcriptome. a Genes expressed in leaf, fruit, and flower. b TFs expressed in leaf, fruit, and flower. c Typical TFs in the peach transcriptome. d Typical TFs expressed in leaf, fruit, and flower

Fig. 6 Physical distribution (a) and GO annotation (b) of the top 500 highest expressed genes in the peach transcriptome

123

Plant Mol Biol (2013) 83:365–377

(Zhang et al. 2012; Cavagnaro et al. 2010). This variation in the frequency of tri-nucleotide repeats may be partially attributed to the fact that GC contents in monocots are generally higher than those observed in dicots (Cavagnaro et al. 2010). In contrast to the high frequency of SSRs in the peach transcriptome, a low density of SNPs in the expressed sequences (*0.2/kb) was observed across six varieties. This observed SNP density in the peach transcriptome is extremely lower than those reported for other fruit tree species. For example, an average density of 15.6 SNPs per kb has been reported in the expressed sequences from grapevine (Lijavetzky et al. 2007). In apple, 71,482 SNPs were identified from 9,555 EST contigs, with an overall density of 6.7 SNPs per kb (Chagne´ et al. 2008). Similarly, Khan et al. (2012) detected 37,807 SNPs from 6,888 apple EST contigs, with an average density of 5.3 SNPs per kb. A low density of SNPs in the peach transcriptome could be partially attributed to a small sample of peach varieties used in this study. On the other hand, whole genome resequencing of 56 peach breeding accessions revealed 1,022,354 SNPs, with an overall density of 4.4 SNPs/kb in the peach genome (Ahmad et al. 2011; Verde et al. 2012). The frequency of SNPs in genomic DNA sequences is much higher than observed in the transcribed sequences in this study. This result implies selection pressure could be stronger in genic regions than in nongenic regions during the process of peach domestication and adaptation. It is worth noting that a putative resistance gene (ppa016901m) that contains the highest density of expressed SNPs is located at the top of scaffold 2. Intriguingly, the density of expressed SNPs, including both synonymous and non-synonymous coding SNPs, is higher at the top of scaffold 2 than those at the rest of the peach genome (Fig. 3). Peach genome sequencing of cultivated varieties and wild species also reveals a high SNP density at the top of scaffold 2 (The International Peach Genome Initiative 2013). The top of scaffold 2 is rich in resistant genes. Therefore, our study confirms the previous finding that regions hosting resistant genes are evolving rapidly (McHale et al. 2006). Novel transcribed regions and AS events in the peach transcriptome In this study, we have produced over 21.5 Gb Illumina RNASeq data, which represent *96-fold coverage of the peach genome and over 550-fold coverage of the peach reference gene set. In addition, up to 90 % of RNA-Seq reads have been mapped to the peach reference genome. This percentage of the mapped reads is much higher than the ratio of *60 % previously reported in rice (Lu et al. 2010). The outcome of our effort suggests that both transcriptome sequences and the peach reference genome are of high

375

quality. Therefore, our cDNA deep sequencing data provide a good opportunity to identify AS events and NTRs in peach. Firstly, alternative splicing is common in plants and over 20 % of plant genes produce two or more transcript isoforms (Campbell et al. 2006). Here, 22.6 % of peach genes are observed to undergo AS, which is commensurate with the levels observed in rice and Arabidopsis. However, the predominant type of AS events in peach is exon-skipping isoforms, which account for 43 % of all the observed AS events. This result contradicts the previous finding that exon-skipping is relatively rare in plants (Barbazuk et al. 2008). For example, the proportion of AS events that undergo exonskipping in Arabidopsis, rice and maize are 3, 11, and 5 %, respectively. Similarly, intron retention isoforms are rare in peach, accounting for only about 3 % of all the observed AS events as opposed to over 30 % reported in Arabidopsis, rice and maize (Barbazuk et al. 2008). These contradictions clearly suggest that the preferential type of AS events is not conserved among wide spectra of plant species. In peach, 838 AS events have been discovered through EST sequencing (The International Peach Genome Initiative 2013). However, this conventional approach is usually impaired by EST representation biased towards highly expressed genes and depth of sequencing. For example, the construction of EST database has been well conducted in human, but 31 % of exons are still represented by no or a single EST (Johnson et al. 2003). The depth of the human EST collection is much better than any of EST collections conducted in plants (Barbazuk et al. 2008). Thus, using EST collections to investigate AS events is greatly limited in plants, which may result in underestimation of AS events in plants. High-throughput sequencing tools such as RNA-Seq are obviously more powerful than Sanger’s EST sequencing for the purpose towards investigation of AS events. In this study, the transcribed regions of peach are deeply sequenced, with an average coverage of 550-fold depth. Therefore, our results related to the estimation of AS events in peach are not only repeatable but reliable as well. Of course, it is still needed to sequence more cDNA libraries covering different tissues, developmental stages and a range of stress conditions to get a full view of AS events in peach. Secondly, our study reveals the incidence of 2,140 NTRs, most of which (66.2 %) are long non-coding RNAs (lncRNA). More recently, IncRNAs is becoming a hot research topic in plants. For example, two classes of lncRNAs, which play important role in regulation of vernalization, have been identified in Arabidopsis FLC locus (Swiezewski et al. 2009; Heo and Sung 2011). Likewise, several lncRNAs, which are responsive to powdery mildew infection and heat stress, have been reported in wheat (Xin et al. 2011). Here, we report the transcripts of IncRNAs at genome-wide level in peach, and our findings will be helpful for functional genomics research in peach.

123

376

Transcriptome sequence serves as a complement to the draft sequence of the peach genome Compared with the reference genomes of apple and strawberry, the peach reference genome possesses a high quality of sequence assembly (The International Peach Genome Initiative 2013). However, most of the predicted genes contain no UTRs, suggesting there is a large room for the improvement of annotation of the peach genome. In this study, the UTRs of the peach reference genes have been extended to an average size of 876 bp. These UTRs are very useful for the study of digital gene expression profiling analysis because the sequences of UTRs are unique and can serve as gene tags (Nishiyama et al. 2012). Moreover, our study also reveals nearly 2,983 novel transcripts. Surprisingly, these new transcripts include 15 S-locus genes (6 encoding S-haplotype-specific S-RNase and 9 encoding S-haplotype-specific F-box protein). All these S-locus genes show a high level expression in the tested tissues. It is well known that S-locus genes are responsible for self-incompatibility in fruit trees of Rosaceae such as almond, pear and apple (Wang et al. 2009). Peach, unlike its close relative almond, is self-fertile. Thus, it is not clear whether these S-locus genes have the same function as their orthologs after the divergence of peach from other Rosaceae species. In addition, 10,835 of alternative splicing events and 2,461 non-synonymous SNPs have also been identified in this study. The expressed SNPs are of highly informative resource for genotyping such as the design of peach SNP chip. In short, our study reveals for the first time the complexity of the peach transcriptome, and gives an extensive new knowledge about alternative splicing, NTRs, and gene boundaries. The results will not only serve as a complement to the predicted gene database of peach, but also provide an invaluable resource for functional genomics research in peach and other fruit trees in the future. Acknowledgments This project was supported by funds received from the National 863 program of China (No. 2011AA100206), the National 948 Project from the Ministry of Agriculture of China, and the National Natural Science Foundation of China (No. 31201604 and 31000139).

References Ahmad R, Parfitt DE, Fass J, Ogundiwin E, Dhingra A, Gradziel TM, Lin D, Joshi NA, Martı´nez-Garcı´a PJ, Crisosto CH (2011) Whole genome sequencing of peach (Prunus persica L.) for SNP identification and selection. BMC Genomics 12:569 Aru´s P, Verde I, Sosinski B, Zhebentyayeva T, Abbott AG (2012) The peach genome. Tree Genet Genomes 8:1–17 Barbazuk WB, Fu Y, McGinnis KM (2008) Genome wide analyses of alternative splicing in plants: opportunities and challenges. Genome Res 18:1381–1392

123

Plant Mol Biol (2013) 83:365–377 Bonghi C, Trainotti L, Botton A, Tadiello A, Rasori A, Ziliotto F, Zaffalon V, Casadoro G, Ramina A (2011) A microarray approach to identify genes involved in seed-pericarp cross-talk and development in peach. BMC Plant Biol 11:107 Boudehri K, Bendahmane A, Cardinet G, Troadec C, Moing A, Dirlewanger E (2009) Phenotypic and fine genetic characterization of the D locus controlling fruit acidity in peach. BMC Plant Biol 19:59 Brandi F, Bar E, Mourgues F, Horva´th G, Turcsi E, Giuliano G, Liverani A, Tartarini S, Lewinsohn E, Rosati C (2011) Study of ‘Redhaven’ peach and its white-fleshed mutant suggests a key role of CCD4 carotenoid dioxygenase in carotenoid and norisoprenoid volatile metabolism. BMC Plant Biol 11:24 Campbell MA, Haas BJ, Hamilton JP, Mount SM, Buell CR (2006) Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7:327 Cardle L, Ramsay L, Milbourne D, Macaulay M, Marshall D, Waugh R (2000) Computational and experimental characterization of physically clustered simple sequence repeats in plants. Genetics 56:847–854 Cavagnaro PF, Senalik DA, Yang L, Simon PW, Harkins TT, Kodira CD, Huang S, Weng Y (2010) Genome-wide characterization of simple sequence repeats in cucumber (Cucumis sativus L.). BMC Genomics 11:569 Chagne´ D, Gasic K, Crowhurst RN, Han Y, Bassett HC, Bowatte DR, Lawrence TJ, Rikkerink EH, Gardiner SE, Korban SS (2008) Development of a set of SNP markers present in expressed genes of the apple. Genomics 92:353–358 Dirlewanger E, Cosson P, Howad W, Capdeville G, Bosselut N, Claverie M, Voisin R, Poizat C, Lafargue B, Baron O, Laigret F, Kleinhentz M, Aru´s P, Esmenjaud D (2004) Microsatellite genetic linkage maps of myrobalan plum and an almond-peach hybrid-location of root-knot nematode resistance genes. Theor Appl Genet 109:827–838 Heo JB, Sung S (2011) Vernalization-mediated epigenetic silencing by a long intronic noncoding RNA. Science 331:76–79 Jime´nez S, Li ZG, Reighard GL, Bielenberg DG (2010a) Identification of genes associated with growth cessation and bud dormancy entrance using a dormancy-incapable tree mutant. BMC Plant Biol 10:25 Jime´nez S, Reighard GL, Bielenberg DG (2010b) Gene expression of DAM 5 and DAM6 is suppressed by chilling temperatures and inversely correlated with bud break. Plant Mol Biol 73:157–167 Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD (2003) Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302:2141–2144 Joobeur T, Periam N, de Vicente MC, King GJ, Aru´s P (2000) Development of a second generation linkage map for almond using RAPD and SSR markers. Genome 43:649–655 Jung S, Abbott A, Jesudurai C, Tomkins J, Main D (2005) Frequency, type, distribution and annotation of simple sequence repeats in Rosaceae ESTs. Funct Integr Genomics 5:136–143 Kantety V, Rota L, Matthews E, Sorrells E (2002) Data mining for simple sequence repeats in expressed sequence tags from barley, maize, rice sorghum and wheat. Plant Mol Biol 48:501–510 Khan MA, Han Y, Zhao YF, Korban SS (2012) A high-throughput apple SNP genotyping platform using the GoldenGateTM assay. Gene 494:196–201 Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25 Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760 Li ZG, Reighard GL, Abbott AG, Bielenberg DG (2009) Dormancy associated MADS genes from the EVG locus of peach [Prunus

Plant Mol Biol (2013) 83:365–377 persica (L.) Batsch] have distinct seasonal and photoperiodic expression patterns. J Exp Bot 60:3521–3530 Lijavetzky D, Cabezas JA, Iba´n˜ez A, Rodrı´guez V, Martı´nez-Zapater JM (2007) High throughput SNP discovery and genotyping in grapevine (Vitis vinifera L.) by combining a re-sequencing approach and SNPlex technology. BMC Genomics 8:424 Livio T, Tadiello A, Casadoro G (2007) Variations of the peach fruit transcriptome during ripening and in response to hormone treatments. Caryologia 60:156–159 Lu T, Lu G, Fan D, Zhu C, Li W, Zhao Q, Feng Q, Zhao Y, Guo Y, Li W, Huang X, Han B (2010) Function annotation of the rice transcriptome at single-nucleotide resolution by RNA-seq. Genome Res 20:1238–1249 Martı´nez-Go´mez P, Crisosto CH, Bonghi C, Rubio M (2011) New approaches to Prunus transcriptome analysis. Genetica 139:755–769 McHale L, Tan X, Koehl P, Michelmore RW (2006) Plant NBS-LRR proteins: adaptable guards. Genome Biol 7:212 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods 5:621–628 Newcomb RD, Crowhurst RN, Gleave AP, Rikkerink EH, Allan AC, Beuning LL, Bowen JH, Gera E, Jamieson KR, Janssen BJ, Laing WA, McArtney S, Nain B, Ross GS, Snowden KC, Souleyre EJ, Walton EF, Yauk YK (2006) Analyses of expressed sequence tags from apple. Plant Physiol 141:147–166 Nishiyama R, Le DT, Watanabe Y, Matsui A, Tanaka M et al (2012) Transcriptome analyses of a salt-tolerant cytokinin-deficient mutant reveal differential regulation of salt stress response by cytokinin deficiency. PLoS One 7:e32124 Shen S, Park JW, Huang J, Dittmar KA, Lu ZX, Zhou Q, Carstens RP, Xing Y (2012) MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data. Nucleic Acids Res 40:e61 Shulaev V, Korban SS, Sosinski B, Abbott AG, Aldwinckle HS, Folta KM, Iezzoni A, Main D, Aru´s P, Dandekar AM, Lewers K, Gardiner SE, Potter D, Veilleux E (2008) Multiple models for Rosaceae genomics. Plant Physiol 147:985–1003 Socquet-Juglard D, Kamber T, Pothier JF, Christen D, Gessler C, Duffy B, Patocchi A (2013) Comparative RNA-Seq analysis of early-Infected peach leaves by the invasive phytopathogen Xanthomonas arboricola pv. Pruni. PLoS One 8:e541969 Soria-Guerra RE, Rosales-Mendoza S, Gasic K, Wisniewski ME, Band M, Korban SS (2011) Gene expression is highly regulated in early developing fruit of apple. Plant Mol Biol Rep 29:885–897 Swiezewski S, Liu F, Magusin A, Dean C (2009) Cold-induced silencing by long antisense transcripts of an Arabidopsis Polycomb target. Nature 462:799–802 Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S, McCouch S (2001) Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length

377 variation, transposon associations, and genetic marker potential. Genome Res 11:1441–1452 The International Peach Genome Initiative (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet. doi:10.1038/ng.2586 Tong Z, Gao Z, Wang F, Zhou J, Zang Z (2009) Selection of reliable reference genes for gene expression studies in peach using realtime PCR. BMC Mol Biol 10:71 Trainotti L, Bonghi C, Ziliotto F, Zanin D, Rasori A, Casadoro G, Ramina A, Tonutti P (2006) The use of microarray lPEACH1.0 to investigate transcriptome changes during transition from preclimacteric to climacteric phase in peach fruit. Plant Sci 170:606–613 Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111 Vecchietti A (2009) Comparative analysis of expressed sequence tags from tissues in ripening stages of peach. Tree Genet Genomes 5:377–391 Verde I, Bassil N, Scalabrin S, Gilmore B, Lawley CT, Gasic K, Micheletti D, Rosyara UR, Cattonaro F, Vendramin E, Main D, Aramini V, Blas AL, Mockler TC, Bryant DW, Wilhelm L, Troggio M, Sosinski B, Aranzana MJ, Aru´s P, Iezzoni A, Morgante M, Peace C (2012) Development and evaluation of a 9K SNP array for peach by internationally coordinated SNP detection and validation in breeding germplasm. PLoS One 7:e35668 Wang C, Xu G, Jiang X, Chen G, Wu J, Wu H, Zhang S (2009) S-RNase triggers mitochondrial alteration and DNA degradation in the incompatible pollen tube of Pyruspyrifolia in vitro. Plant J 57:220–229 Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164 Xin MM, Wang Y, Yao YY, Song N, Hu ZR, Qin DD, Xie CJ, Peng HR, Ni ZF, Sun QX (2011) Identification and characterization of wheat long non-protein coding RNAs responsive to powdery mildew infection and heat stress by using microarray analysis and SBS sequencing. BMC Plant Biol 11:61–73 Yamamoto T, Mochida K, Imai T, Shi IZ, Ogiwara I, Hayashi T (2002) Microsatellite markers in peach [Prunus persica (L.) Batsch] derived from an enriched genomic and cDNA libraries. Mol Ecol Notes 2:298–302 Zhang Q, Ma B, Li H, Chang Y, Han Y, Li J, Wei G, Zhao S, Khan MA, Zhou Y, Gu C, Zhang X, Han Z, Korban SS, Li S, Han Y (2012) Identification, characterization, and utilization of genome-wide simple sequence repeats to identify a QTL for acidity in apple. BMC Genomics 13:537 Zhebentyayeva T, Swire-Clark G, Georgi L, Garay L, Jung S, Forrest S, Blenda A, Blackmon B, Mook J, Horn R, Howad W, Aru´s P, Main D, Tomkins J, Sosinski B, Baird W, Reighard G, Abbott A (2008) A framework physical map for peach, a model Rosaceae species. Tree Genet Genomics 4:745–756

123

Suggest Documents