Exploiting genome variation to improve next ...

3 downloads 4963 Views 2MB Size Report
sequencing data analysis and genome editing efficiency in Populus tremula×alba 717-1B4. Liang-Jiao Xue1,2,3 & Magdy S. Alabady4 & Mohammad Mohebbi5 ...
Tree Genetics & Genomes (2015) 11:82 DOI 10.1007/s11295-015-0907-5

SHORT COMMUNICATION

Exploiting genome variation to improve next-generation sequencing data analysis and genome editing efficiency in Populus tremula×alba 717-1B4 Liang-Jiao Xue 1,2,3 & Magdy S. Alabady 4 & Mohammad Mohebbi 5 & Chung-Jui Tsai 1,2,3

Received: 26 April 2015 / Accepted: 7 July 2015 # Springer-Verlag Berlin Heidelberg 2015

Abstract Populus species are widely distributed across the Northern Hemisphere. The genetic diversity makes the genus an ideal study system for traits of ecological or agronomic significance. However, sequence variation between the genome-sequenced Populus trichocarpa Nisqually-1 and many other Populus species and hybrids poses significant challenges for research that employs sequence-sensitive approaches, such as next-generation sequencing and sitespecific genome editing. Using the routinely transformed genotype Populus tremula×alba 717-1B4 as a test case, we utilized established variant-calling pipelines with affordable re-sequencing (~20×) and publicly available transcriptome data to generate a variant-substituted custom genome (sPta717). The sPta717 genome harbors over 10 million SNPs or small indels relative to the P. trichocarpa v3 reference genome. When applied to RNA-Seq analysis, the fraction of Communicated by A. Brunner

uniquely mapped reads increased by 13–28 % relative to that obtained with the P. trichocarpa reference genome, depending on read length and sequence type. The enhanced mapping rates enabled detection of several hundred more expressed genes and improved the differential expression analysis. Similar improvements were observed for DNA-Seq and ChIP-Seq data mapping. The sPta717 genome is also instrumental in guide RNA (gRNA) design for CRISPR-mediated genome editing. We showed that a majority of gRNAs designed from the P. trichocarpa reference genome contain mismatches with the corresponding target sequences of sPta717, likely rendering those gRNAs ineffective in transgenic 717. A website is provided for querying the sPta717 genome by gene model or homology search. The same approach should be applicable to other outcrossing species with a closely related reference genome. Keywords Re-sequencing . SNP . Substituted genome . RNA-Seq . CRISPR

This article is part of the Topical Collection on Genome Biology Electronic supplementary material The online version of this article (doi:10.1007/s11295-015-0907-5) contains supplementary material, which is available to authorized users. * Chung-Jui Tsai [email protected] 1

School of Forestry and Natural Resources, University of Georgia, Athens, GA, USA

2

Department of Genetics, University of Georgia, Athens, GA, USA

3

Institute of Bioinformatics, University of Georgia, Athens, GA, USA

4

Department of Plant Biology, University of Georgia, Athens, GA, USA

5

Department of Computer Science, University of Georgia, Athens, GA, USA

Introduction Next-generation sequencing (NGS) technologies are widely used in many aspects of functional genomic research, including whole-genome sequencing, re-sequencing, epigenetic modifications, transcript profiling, and high-throughput genotyping (Morozova and Marra 2008; Varshney et al. 2009). These technologies enable characterization of gene expression, DNA methylation, and genomic variations at the single nucleotide level. Large-scale efforts such as the 1000 Genomes Project (The 1000 Genome Project Consortium 2012) and the ENCODE (ENCyclopedia Of DNA Elements) Project (ENCODE Project Consortium 2004) for human, the 1001 Genomes Project for Arabidopsis thaliana (Weigel and Mott 2009), and the 1000 Plants Initiative (Wickett et al.

82

Page 2 of 8

2014) have yielded unprecedented resources and analytical methods to enhance our understanding of complex traits ranging from human health to agricultural productivity. As the third plant species to have been sequenced, Populus is a model system for genomics of woody perennials (Tuskan et al. 2006). However, genetic diversity is high within the genus (six sections) and at the species or individual levels, owing to the outcrossing nature. This presents a challenge for functional genomics research, especially when the target species deviates from the genome-sequenced Populus trichocarpa Nisqually-1 (section Tacamahaca). For instance, hybrid aspen Populus tremula×alba 717-1B4 (717, section Populus) is arguably the most widely used model in transgenic experiments due to its ease of transformation and rapid growth (Leple et al. 1992). The genome variation between the two species poses significant challenges for both wet-lab and dry-lab investigations that depend on high sequence specificity. Examples include quantitative PCR, microarray-based expression or genotyping assays, and NGS data analysis. The powerful clustered regularly interspaced short palindromic repeats (CRISPR) system for genome editing is also highly sequence-specific (Jinek et al. 2012). To begin to address these challenges, we generated more than 94 million paired-end 100 bp (PE-100) shotgun sequencing reads from 717 (~18 Gb). However, de novo assembly of a highly heterologous genome remains impractical due to the lack of large-insert libraries and limitation of current methods. Reference-guided assembly was not satisfactory as only half of the reads can be mapped to the P. trichocarpa genome (~20× coverage). Instead, we applied established variantcalling pipelines to process genomic, transcriptomic, and publicly available mRNA sequences from 717 and identified over 10 million SNPs and small indels. By substituting sequence variations into the reference genome, we generated a custom P. tremula×alba 717 genome referred to as sPta717. When used in RNA-Seq, DNA-Seq, and ChIP-Seq analysis, the sPta717 genome showed better performance for reads alignment than the reference genome. Both the sPta717 genome and the collection of sequence variants were integrated into a web-based search engine to facilitate other applications, such as primer or CRISPR guide RNA (gRNA) designs that are sensitive to sequence mismatches.

Materials and methods 717 re-sequencing Genomic DNAwas extracted from young leaves of P. tremula× alba (717-1B4) using the DNeasy Plant Mini Kit (Qiagen, Valencia, CA) and fragmented on a Covaris (Woburn, MA) for standard library preparation. To enrich for genic sequences, genomic DNA was digested with the methyl-sensitive RsrII (NEB, Ipswich, MA) or subjected to cDNA-primed

Tree Genetics & Genomes (2015) 11:82

amplification using Phi29 DNA polymerase (NEB). Equal amounts of total RNA extracted from various tissues of wild-type and transgenic 717 (Babst et al. 2014; Frost et al. 2012; Xue et al. 2013) with or without stress treatments (Supplemental 1: Table S1) were pooled for mRNA isolation using the Oligotex mRNA mini kit (Qiagen) and fragmented using the NEBNext Magensium RNA fragmentation Module (NEB). First-strand cDNA synthesis was performed using SuperScript® II reverse transcriptase and random hexamers (Invitrogen, Carlsbad, CA) and purified using the Microcon® DNA Fast Flow Filter (EMD Millipore, Billerica, MA) for use in Phi29 genome amplification. Illumina-compatible libraries were prepared using the PrepX ILM DNA library kit on the Apollo 324 automated system (IntengenX, Pleasanton, CA). The libraries were pooled for PE100 sequencing on an Illumina HiSeq-2500 at the Georgia Regents University. All sequence reads were submitted to NCBI Sequence Read Archive (SRA) under accession no. SRP049825 (Supplemental 1: Table S2). Identification of genomic variants using DNA and RNA data Multiple variant-calling programs, including SHORE v0.9.3 (Ossowski et al. 2008), GATK v3.2.2 (Danecek et al. 2011), and SAMtools v0.1.19.0 (Li et al. 2009), were used to process 717 sequences against the P. trichocarpa reference genome (v3.0, http://phytozome.jgi.doe.gov). Parameters for SHORE were the same as described (Schmitz et al. 2013), except that both Bconsensus^ and BqVar^ methods were performed using two alignment settings (B15 % -g 9 %^ and B-n 10 % -g 6 %^) with Burrows-Wheeler Aligner (BWA) v0.7.10 (Voelker et al. 2010). GATK analysis followed the BBest Practice^ recommended by the developer using BWA-MEM v0.7.10 for read alignment. The indels and SNPs obtained from SHORE were used for ‘Base Quality Score Recalibration’ in GATK, and both UnifiedGenotyper and HaplotypeCaller methods were applied for variant calling. SNPs/indels were also identified using transcripts assembled from in-house RNA-Seq data, which included more than 600 million read pairs (SRA accession nos. SRP041959, SRP042117, and SRP059838, Supplemental 1: Table S2). De novo assembly was performed using Trinity r20140717 (Grabherr et al. 2011) with default settings, and transcripts ≥200 bp were mapped onto the P. trichocarpa reference genome for variant calling using SAMtools. mRNA sequences of 717 downloaded from NCBI (July 16th 2014) were also included in the analysis. The variants supported by at least two of the five NGS datasets (two methods each from SHORE and GATK on DNA-Seq data plus RNA-Seq data), along with those identified from the NCBI mRNA dataset were deemed high quality. The variantsubstituted 717 genome sequences were then generated using VCFtools v0.1.12b (Danecek et al. 2011). Gene annotation information was converted from the P. trichocarpa reference

Tree Genetics & Genomes (2015) 11:82 Table 1

Page 3 of 8 82

Distribution of SNPs and indels across genomic features SNPs

Genic Exon Intron Upstream 1-500 Upstream 501-1000 Upstream 1001-1500 Downstream 1-500 Downstream 501-1000 Downstream 1001-1500 Intergenic Total

INDELs

No.

%

No.

%

5,426,701 1,067,936 1,969,844 510,016 390,262 359,440 466,570 357,119 305,514 4,834,490 10,261,191

52.9 10.4 19.2 5.0 3.8 3.5 4.6 3.5 3.0 47.1 100.0

55,738 3,246 30,627 6,183 2,649 1,780 6,815 2,745 1,693 16,985 72,723

76.6 4.5 42.1 8.5 3.6 2.5 9.4 3.8 2.3 23.4 100.0

genome using liftOver (https://genome.ucsc.edu/util.html). The sPta717 v1.1 genome, as well as gene model, transcript and promoter sequences are available for download at AspenDB (http://aspendb.uga.edu/s717).

NGS read mapping Leaf, bark, and xylem RNA-Seq data of wild-type 717 plants from two stress experiments in our lab were used to evaluate mapping performance. These samples contain three biological replicates, each with 10–15 million paired-end 100 bp (PE100, SRA accession no. SRP059838) or 50 bp (PE50, SRA accession no. SRP041959) reads. For SE100 or SE50 Table 2 Mapping performance of RNA-Seq data against the reference and the sPta717 genomes

datasets, only reads 1 from the paired-end data were analyzed. RNA-Seq reads were processed to remove adapter and rRNA sequences and mapped to either genome by TopHat2 v2.0.12 (Kim et al. 2013) with default settings, except for the maximum length of introns (10,000 bp) and maximum number of mismatches (two per 50-bp read). Leaf RNA-Seq samples from well-watered and drought-stressed 717 plants (SRA accession no. SRP041959) were used for differential expression (DE) analysis. The alignment output from TopHat2 was processed by HTSeq v0.6.1p1 (Anders et al. 2015), and uniquely mapped reads were analyzed by DESeq2 v1.6.2 (Love et al. 2014) using SLIM (Wang et al. 2011) for multiple testing corrections. Only genes with FPKM ≥1 in all three biological replicates were included in the analysis. Sequence alignment to either genome was visualized using IGV v2.3.40 (Robinson et al. 2011). DNA read mapping performance was evaluated using 717 re-sequencing data (SRA accession no. SRP049825), as well as the genomic input sequences and ChIP-Seq data (SRA accession no. SRP028935) described by Liu et al. (2014). The original PE100 sequences from 717 were trimmed to generate PE50, SE100, and SE50 datasets. Both DNA-Seq and ChIP-Seq reads were mapped onto either genome using Bowtie2 v2.2.3.0 (Langmead and Salzberg 2012). Affymetrix probe re-annotation and gRNA quality assessment Probes of the Affymetrix Poplar Genome Array were mapped onto the sPta717 genome using Bowtie2. Custom Perl scripts were used for probe annotation following the NetAffx

Leaf

SE50 (%) Uniquely mapped Multiply mapped Unmapped SE100 (%) Uniquely mapped Multiply mapped Unmapped PE50 (%) Concordant alignment Discordant alignment Unmapped PE100 (%) Concordant alignment Discordant alignment Unmapped

Bark

Xylem

Ptr_v3

sPta717

Ptr_v3

sPta717

Ptr_v3

sPta717

65.6 12.5 21.9

81.1 8.8 10.1

65.3 10.1 24.6

80.3 7.3 12.4

69.7 9.8 20.5

83.1 7.3 9.6

41.1 6.4 52.5

68.0 3.9 28.1

41.0 6.7 52.3

68.0 4.0 28.0

46.7 5.9 47.4

73.0 3.8 23.2

59.3 18.5 22.2

77.7 11.7 10.6

55.9 18.8 25.3

74.3 12.6 13.1

61.7 17.3 21.0

79.4 10.4 10.2

30.5 14.3 55.2

58.4 11.4 30.2

33.6 13.5 52.9

60.6 10.6 28.8

37.5 14.6 47.9

65.7 10.3 24.0

82

Page 4 of 8

Tree Genetics & Genomes (2015) 11:82

Construction of a variant-substituted sPta717 genome

intergenic regions was likely under-estimated due to technical challenges in indel calling associated with greater degrees of non-coding sequence variation between species. Developing a personal/individual genome incorporating known sequence variants was deemed a necessary first step for analysis of functional genomics data derived from the 1000 Genomes Project (Rozowsky et al. 2011) and the 1001 Genomes Project for A. thaliana (Schmitz et al. 2013). Following a similar approach, we constructed a custom 717 genome by substituting the aggregated set of SNPs and indels (Supplemental 1: Table S3) into the P. trichocarpa reference genome v3 (Ptr_v3) and updated the genomic coordinates of all annotated structural features. We refer to this variantsubstituted genome for the P. tremula×alba clone 717 as the BsPta717^ genome.

Multiple data resources were used for identification of genomic variants in 717, including re-sequencing and RNA-Seq data, and GenBank mRNA sequences. The SHORE (Ossowski et al. 2008) and GATK (DePristo et al. 2011) pipelines were applied for genomic sequence analysis, each with two variant calling methods. Between 10 and 11 million variants were identified by these methods, except for HaplotypeCaller in GATK which returned a much lower (1.1 million) number of variants (Supplemental 1: Table S3). Trinity-assembled RNA-Seq data were processed by the SAMtools (Li et al. 2009) to generate 1.3 million variants. The GenBank 717 mRNA sequences were also processed by SAMtools and about 11,000 variants were detected. Aggregating these discoveries resulted in 10,333,182 variants that were detected in at least two of the five deep-sequencing datasets, with an additional 732 variants found only from the mRNA data. Among the total 10,333,914 variants, only 72, 723 represent indels (Table 1). The relatively low frequency of indels is consistent with other published studies (Evans et al. 2014; Kelleher et al. 2007). Analysis of the variants across different genome features showed a near-even distribution of SNPs between genic and intergenic regions, with a slight preference toward the former (~53 %) (Table 1). Within the genic region, more SNPs were detected in introns (19 %) than in exons (10 %), as would be expected due to selective constraints in coding sequences. The SNP frequencies were similar in upstream (12 %) and downstream (11 %) regions and decreased with distance (Table 1). Indels, on the other hand, were more than three times as frequently detected in genic region (77 %) as in intergenic (23 %) region (Table 1). The vast majority of genic indels were located in noncoding sequences, i.e., introns (42 % total indels or 55 % genic indels) and flanking upstream and downstream sequences (30 % total), and only less than 5 % were from exons (Table 1). This is consistent with strong selection against indels in exons that can potentially disrupt reading frames. It should be mentioned that the indel frequency in

Fig. 1 Comparisons of leaf RNA-Seq data analysis using the variantsubstituted P. tremula×abla 717 (sPta717) genome or the P. trichocarpa (Ptr_v3) reference genome. a Transcript abundance. Genes with significantly different FPKM values are highlighted in red (higher in sPta717) or blue (higher in Ptr_v3). b Transcriptional response to drought. Genes are color-coded if they were found to exhibit significant differences by either genome (black), by sPta717 only (red), by Ptr_v3 only (blue), or neither (gray). Significant difference threshold was Q ≤0.05 and fold change ≥2

procedures (Liu et al. 2003). gRNAs were designed according to the method of Lei et al. (2014) for 1000 randomly selected genes extracted from the P. trichocarpa or sPta717 genome. The resulting gRNA sequences were mapped to the respective genome using BatMis (Tennakoon et al. 2012) to evaluate specificity according to Hsu et al. (2013). Up to 10 exonic gRNAs per gene with a score of ≥0.5 were retained for further analysis. The gRNA sequences were searched against the compiled genomic variants of 717 to determine the frequencies of SNPs and indels.

Results and discussion

Tree Genetics & Genomes (2015) 11:82

Page 5 of 8 82

RNA-Seq data from leaf, bark, and xylem tissues of 717 were used to compare the alignment performance with the reference Ptr_v3 and the custom sPta717 genomes. As expected, shorter and/or single-end (SE) reads had overall higher mapping rates than the more discriminating longer and/or paired-end (PE) sequences (Table 2). Regardless of data type, the mapping rates were consistently higher with the sPta717 genome than with the reference genome across all three tissues. Uniquely mapped reads increased by 13–18 % for SE/PE50 datasets and by 26–28 % for

SE/PE100 datasets, while ambiguously mapped reads decreased (Table 2). Across tissues, the fractions of unmapped reads decreased from 22–25 to 10–13 % for SE/PE50 datasets and from 52–55 to 28–30 % for SE/PE100 datasets. The results support an overall improvement of read mapping using the sPta717 genome. Across all data types and tissues, we consistently detected a greater number of genes (383–974) that passed an arbitrary detection threshold (FPKM ≥1) with the sPta717 genome than with the reference genome (Supplemental 1: Table S4). When the expression values from the two analyses were plotted, a trend toward higher FPKM values was evident using the

Fig. 2 Read alignment patterns for tandem duplicates Potri.019G075200 (left) and Potri.019G075300 (right) using all mapped reads (a) or uniquely mapped reads (b), with the sPta717 (black-boxed) or the Ptr_

v3 (red-boxed) genomes. SNPs between RNA-Seq and genome sequences are shown as color ticks. Numbers of mapped fragments are indicated

Applications of the sPta717 genome to NGS data analysis

82

Tree Genetics & Genomes (2015) 11:82

Page 6 of 8

Table 3 Mapping rates of DNA-Seq reads against the reference and the sPat717 genomes Single-end

Paired-end

Ptr_v3

sPta717

Ptr_v3

sPta717

Uniquely mapped Multiply mapped Unmapped 100 bp (%)

29.4 22.7 47.9

39.9 24.1 36.0

25.0 13.7 61.4

33.4 14.7 51.8

Uniquely mapped Multiply mapped Unmapped

32.7 22.6 44.8

40.2 23.9 36.0

27.3 14.8 57.9

33.9 15.7 50.3

50 bp (%)

sPta717 genome (Fig. 1a, Supplemental 2: Fig. S1a–b). The data suggested that improved read mapping with the sPta717 genome also translated into higher transcript abundance estimations across several orders of magnitude. We then partitioned genes into three groups based on FPKM estimates with either genome: significantly higher with sPta717, significantly higher with Ptr_v3 or similar (significance based on Q ≤0.05 and fold-change ≥2). For all three tissues examined, a majority of the genes (61–65 %) with increased FPKM values by sPta717 have highly similar paralogs (≥90 % coding sequence identity) elsewhere in the genome versus 31–39 % (Ptr_v3-higher) and 27–28 % (similar) such genes in the other groups. Using one leaf sample as a test case, we found that ~40 % of reads mapped to multiple locations of Ptr_v3 can be assigned to unique positions on sPta717. Together, these results suggest that using the custom genome enhanced mapping sensitivity and accuracy within the redundant gene space by improving discrimination between highly similar sequences while reducing erroneous alignments. Table 4 Frequencies of sequence variants in gRNAs designed based on the reference and sPta717 genomes Ptr_v3a

Total gRNAs c Indels SNPs PAM (NGG) Proximal 10-nt Distal 10-nt

sPta717b

no.

%

no.

%

9492 56 5396 750

– 0.6 56.8 13.9

9750 727 3560 563

– 7.5 36.5 15.8

2980 1666

55.2 30.9

1942 1055

54.6 29.6

a

gRNAs were designed based on the Ptr_v3 reference genome and crosschecked against genomic variants of 717

b gRNAs were designed based on the sPta717 genome and cross-checked against genomic variants of 717

Up to 10 gRNAs per gene (with a specificity score ≥0.5) from 1000 randomly selected genes were used for the analysis

c

We extended the comparison to differential expression (DE) analysis using leaf data from a drought stress experiment as a test case. For this purpose, DE criteria were set at Q ≤0.05 and fold-change ≥2. Using the sPta717 genome, we detected 1468 DE genes in response to drought versus 1362 with the reference genome (Fig. 1b). While the vast majority of DE genes (1300) overlapped between the two analyses, 62 and 168 were unique to the Ptr_v3 and sPta717 datasets, respectively. Similar results were also observed for bark and xylem (Supplemental 2: Figure S1c–d). Some of the discrepancies can be explained by fluctuations of the FPKM estimates using two different genomes. In support of this idea, a majority of the Ptr_v3-only (83 %) and sPta717-only (78 %) DE genes were expressed at low levels (average FPKM

Suggest Documents