Comparing and Optimizing Transcriptome Assembly

0 downloads 0 Views 304KB Size Report
Sickle and Scythe is reduced by 1,938,921 reads. For the purpose of this paper, we use the Digital Normalization Algorithm. (Khmer), the Error Correction ...
Comparing and Optimizing Transcriptome Assembly Pipeline for Diploid Wheat Natasha Pavlovikj1, Kevin Begcy2, Sairam Behera1, Malachy Campbell2, Harkamal Walia2, Jitender S. Deogun1 1

Department of Computer Science and Engineering, 2Department of Agronomy and Horticulture University of Nebraska-Lincoln

Email: [email protected] ABSTRACT Gene expression and transcriptome analysis are currently one of the main focuses of research for a great number of scientists. However, the assembly of raw sequence data to obtain a draft transcriptome of an organism is a complex multi-stage process usually composed of preprocessing, assembling, and postprocessing. Each of these stages includes multiple steps such as data cleaning, contaminant removal, error correction and assembly validation. In order to implement all these steps, a great knowledge of different algorithms, various bioinformatics tools and software is required. In this paper, we generate multiple transcriptome assembly pipelines by using different tools and approaches in the process. Analyzing these pipelines, we can observe that using the error correction method with Velvet Oases and merging the individual k-mer assemblies with highest N50 produce the most stable base for further transcriptome biological analysis.

Categories and Subject Descriptors D.0 [Software]: General; J.3 [Computer Applications]: Life and Medical Sciences – Biology and genetics.

General Terms Design, Experimentation.

1. EXPERIMENTAL RESULTS In this paper, our objective is to develop and optimize an assembly pipeline for diploid wheat transcriptome. Therefore, we generated multiple transcriptome assembly pipelines by using different tools and approaches in the process. Foremost, we investigate the importance of using digital normalization algorithm, error correction method, or a combination of both in the pre-processing stage of the assembly pipeline. Furthermore, the generated datasets from these three approaches are passed to two different de novo assembly tools: Velvet Oases [4] and SOAPdenovo-Trans [5]. The outputs from the assembly pipeline are analyzed by using different metrics. The importance of the k value is also investigated by comparing the final assembly pipeline outputs when 5 k values (k=45, 51, 55, 61, 63), 8 k values (k=21, 31, 41, 51, 61, 71, 81, 91), and 10 k values (k=21, 25, 31, Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). BCB '14, September 20 - 23 2014, Newport Beach, CA, USA ACM 978-1-4503-2894-4/14/09. http://dx.doi.org/10.1145/2649387.2662450

35, 41, 45, 51, 55, 61, 63) are grouped in the merging step, respectively. Therefore, we have 9 different assembly pipelines for Velvet Oases. Similarly, we have 9 different assembly pipelines for SOAPdenovo-Trans that are further investigated. For our experiments, we used the diploid wheat Triticum urartu dataset to generate the transcriptome assembly. The NCBI BioProject PRJNA191053 [1] contains all sequence libraries submitted by the UCD group. The sequencing was performed using Illumina HiSeq platform with 100 bp paired-end protocol. The total number of raw Illumina paired-end reads for T. urartu is 248 million reads. The first step from the pre-processing stage is to remove the Illumina adapters. For that purpose, Scythe [7] was used. The removal of adapters changes the length of the reads, but not their number. After Scythe, Sickle [6] is used for trimming poor quality bases. If one of the paired-end reads is trimmed, the other read is saved in a separate output file. The total number of reads after Sickle and Scythe is reduced by 1,938,921 reads. For the purpose of this paper, we use the Digital Normalization Algorithm (Khmer), the Error Correction Method (Seecer) and combination of both (Khmer Seecer) in order to see how their usage influences the overall quality of the assembly. In order to compare the performance of each of these approaches, we compare the following metrics: the total number of reads, the average read length, and N50. From the results generated when Khmer, Seecer, and Khmer Seecer are used, we observe a huge difference in the total number of reads after Khmer and Seecer. When Khmer is used, the total number of reads is reduced to 57,359,540.00, while when Seecer is used, the total number of corrected reads is 246,593,875.00. The digital normalization algorithm removes redundant reads and errors, and also evens out the coverage. Because of its deep analyses, digital normalization significantly reduces the size of the data set. The error correction method removes errors from the raw data just by reducing the read length, but not the total number of reads. When the combined approach of both Khmer and Seecer is used, we notice that the total number of reads is same as the number of reads when only Khmer is used. Therefore, we can say that the error correction method is just a part of the more general digital normalization approach. Comparing these three datasets, we observe that the biggest difference is the number of final reads after Khmer and Seecer. However, the other compared metrics like the number of reads, the average read length and N50 are almost the same for the three used approaches. After the pre-processing stage, the next step of the assembly pipeline is the de novo assembly. We use both Velvet Oases and SOAPdenovo-Trans where 13 individual assemblies are

constructed when the k value is 21, 25, 31, 35, 41, 45, 51, 55, 61, 63, 71, 81 and 91 respectively. Although the multiple-k method improves the transcript diversity of the assembly, there is no right way to determine the range of k values that produces the best assembly. For this purpose, for each de novo assembler we investigated the quality of assembly when groups of 5, 8, and 10 different k values were merged together. The 5 k values are chosen because the individual assemblies for these k values have the highest N50 metric. The 8 k values contain the whole range from 21 to 91, while the 10 k values are based on the work of Ksenia V. Krasileva et al [2]. Comparing the results for Velvet Oases from the three different input datasets generated from Khmer, Seecer, and Khmer Seecer, we observe that the assemblies obtained with Khmer Seecer have slightly better metrics compared to the assemblies when Khmer and Seecer are used. On the other hand, when SOAPdenovo-Trans is used, the best N50 value occurs for Khmer, although the metrics generated from Seecer and Khmer Seecer do not differ a lot. Better performance is observed when the group of 5 and the group of 8 k values are used for Velvet Oases and SOAPdenovoTrans respectively. To test the overall quality of the assembly pipeline, we aligned the resulting transcripts with 19,200 sequences from full length common cDNA wheat dataset from TriFLDB with average read length of 1,652.9 bp [3] (see Table I). For better understanding of the results, we denote the assemblies generated from Khmer, Seecer and Khmer Seecer with K-Xk, S-Xk and KS-Xk, respectively, where X is the group of k values they belong to (|k|=5, 8, 10). We observe that better alignment rate occurs when Velvet Oases is used as the de novo assembler with all datasets. As stated above, the dataset generated from Seecer did not have huge impact on the metrics evaluated. However, the best alignment rates are achieved when the Seecer dataset is used. On the other hand, for Velvet Oases, the best assembly is produced when the 5 k assemblies with highest N50 are used in the merging process. For SOAPdenovo-Trans, slightly better results are observed when the group of 8 k values is used. However, using all the assemblies for various k values did not improve the assembly quality. Therefore, further investigation is needed for ascertaining which k values produce the best assembly quality.

2. CONCLUSION In this paper, we developed a bioinformatics assembly pipeline, and analyzed different tools used for the different steps of the pipeline. Analyzing 9 different assemblies generated for Velvet Oases, and 9 different assemblies generated for SOAPdenovoTrans, we can observe that using the error correction method with Velvet Oases and merging the individual k-mer assemblies with highest N50 produce the most stable base for further transcriptome biological analysis. Developing multi-stage assembly pipeline is an important and crucial part of generating accurate and meaningful transcriptome assembly. Using error correction methods, as well as merging assemblies with different k values that have good metrics can improve the overall quality of any transcriptome assembly.

3. ACKNOWLEDGMENTS This work was supported in part by a grant from UNL-Wheat, Wheat Products Research Program and was completed utilizing the Holland Computing Center of the University of Nebraska.

4. REFERENCES [1] http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA19105 3 [2] Ksenia V. Krasileva et al, “Separating homeologs by phasing in the tetraploid wheat transcriptome”, in Genome Biology 2013, 14:R66 doi:10.1186/gb-2013-14-6-r66 [3] Mochida K., Yoshida T., Sakurai T., Ogihara Y., Shinozaki K., “TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics”, Plant Physiol 2009, 150:1135-1146 [4] Marcel H. Schulz, Daniel R. Zerbino, Ewan Birney et al, “Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels” [5] Yinlong X et al, “SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads”, arXiv:1305.6760 [q-bio.GN] [6] https://github.com/najoshi/sickle [7] https://github.com/vsbuffalo/scythe

Table 1. Statistics when transcripts are aligned with full length common wheat cDNA dataset from TriFLDB Velvet Oases

SOAPdenovo-Trans

K-5k

210,463.00

unalign ed reads 37.45%

K-8k

409,262.00

37.21%

15.36%

47.42%

62.79%

130,594.00

56.48%

18.32%

25.20%

43.52%

K-10k

377,440.00

42.49%

16.07%

41.45%

57.51%

130,328.00

59.70%

17.13%

23.18%

40.30%

KS-5k

206,450.00

37.84%

15.52%

46.64%

62.16%

106,096.00

56.45%

18.55%

24.99%

43.55%

KS-8k

399,845.00

37.23%

15.35%

47.42%

62.77%

128,455.00

56.72%

18.43%

24.86%

43.28%

KS-10k

367,166.00

42.93%

16.17%

40.90%

57.07%

128,432.00

59.78%

17.11%

23.12%

40.22%

S-5k

356,696.00

32.15%

10.96%

56.89%

67.85%

111,523.00

55.61%

18.56%

25.83%

44.39%

S-8k

466,964.00

33.78%

11.60%

54.62%

66.22%

135,806.00

55.16%

18.56%

26.27%

44.84%

S-10k

441,723.00

38.79%

13.37%

47.83%

61.21%

134,221.00

58.95%

17.13%

23.92%

41.05%

total # of reads

# of aligned reads=1 14.98%

# of aligned reads>1 47.58%

overall alignme nt rate 62.55%

108,464.00

unalign ed reads 56.52%

total # of reads

# of aligned reads=1 18.47%

# of aligned reads>1 25.01%

overall alignme nt rate 43.48%

Legend: In the first column the notation K, KS and S respectively denotes the Khmer, Khmer Seecer and Seecer datasets

Suggest Documents