Fastq_clean: an optimized pipeline to clean the Illumina sequencing data with quality control Mi Zhang School of Computer Science and Technology Tianjin University Tianjin 300072, P.R.China
[email protected]
Feng Zhan Department of Computer Science Guangxi University Nanning, Guangxi 530004, P. R. CHINA
[email protected]
Honghe Sun Boyce Thompson Inst. for Plant Research Cornell University Ithaca, NY 14853, USA
[email protected]
Xiujun Gong School of Computer Science and Technology Tianjin University Tianjin 300072, P.R.China
[email protected]
Zhangjun Fei Boyce Thompson Inst. for Plant Research Cornell University Ithaca, NY 14853, USA
[email protected]
Shan Gao College of Life Science Nankai University Tianjin 300071, P. R. CHINA
[email protected]
Abstract—The
usability of the NGS technologies heavily relies on the accuracy of the data. Many research groups developed different programs to clean the sequenced raw data by removing adapter contamination and trimming low-quality nucleotides. However, they are not optimized to process data from any specific equipment. In this study, we present an optimized pipeline Fastq_clean to clean the DNA-seq and RNA-seq data from the illumina sequencer. Fastq_clean can remove the low quality nucleotides and adapter contamination precisely and keep as many of the qualified nucleotides as possible. Fastq_clean can batchly process sequenced data and export statistics information for the data quality control (QC) by running a single command line. Compared with two most used tools on a published dataset, Fastq_clean reached the best performance. Fastq_clean has already been successfully used in some genome or transcriptome projects and it can also be used to clean the NGS data from other sequencers (e.g. 454), but needs some modification to reach the rest performance. NGS; deep sequencing; data cleaning; data quality; pipeline
I.
INTRODUCTION
With the rapid advancement of the next generation sequencing (NGS) technologies, more and more
research groups from around the world are using the NGS technologies as their first choices for their basic genomic research or diagnostics [1] [2]. Considering the usability of the NGS technologies heavily relies on the accuracy of the data, the high-quality clean data must be derived from the sequenced raw data in the preprocessing step, which is also called the data cleaning process. Therefore, many NGS data users have to develop different data cleaning programs to deal with the same types of problems (e.g. adapter contamination and low quality nucleotides). These problems, if not solved, can lead to misalignments or wrong assembly which affects the downstream analysis (e.g. genotyping or SNP calling [3]). Pertaining to the adapter contamination, if the read length is longer than the insert size, the read will contain a partial or full length of the adapter sequence, usually on the 3’ end of the read [4]. When it comes to the low quality nucleotides, most of them are contained in the 3’ end region of the reads, while a few of them lie in the 5’ end region or other regions of the reads. The data cleaning process generally includes two parts, which is adapter removal and low-quality nucleotides trimming. The data cleaning process also needs to remove the ambiguous nucleotide which is represented by the character N and some other contamination, e.g. rRNA and virus contamination. In previous studies, researchers developed some data cleaning tools
including AdapterRemoval[4], FASTX-Toolkit[5], Trimmomatic[6], Btrim[7], CANGS[8], SeqTrim[9], TagCleaner[10] etc.. Although these general tools have already showed their own strengths in various features, the main problem is they are not optimized to process data from any specific equipment (e.g. the illumina sequencer) by taking full advantage of all the characteristics of the equipment. In this study, we present an optimized pipeline Fastq_clean to clean the DNA-seq and RNA-seq data from the illumina sequencer, which dominates the NGS market with an almost 60% share. Users can remove low quality nucleotides and Ns, adapter contamination, possible rRNA and virus contamination by running Fastq_clean in a single command line. Fastq_clean is designed to remove the low quality nucleotides and adapters precisely and keep as many of the qualified nucleotides as possible. Fastq_clean can batchly process sequenced data and export statistics information for the data quality control. It has been successively used in many genome or transcriptome projects [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] and helped produce some genomes or transcriptomes of high quality. Fastq_clean can also be used to clean the NGS data from other sequencers (e.g. 454), but needs some modification to reach the rest results. The Fastq_clean software package is available at https://app.box.com/s/v7sjjmjo56r71k8bvqkv. II.
SYSTEM DESIGN
Fastq_clean is composed of four procedures to remove low quality nucleotides and Ns, adapter contamination, possible rRNA and virus contamination in the NGS data. The pipeline can be run consecutively by a single command on the Linux system (see the user manual in the downloadable package). The first and second procedures are written in the R language based on the ShortRead package [21]. The third and fourth procedures are written in the Perl language. Four procedures are pipelined by a Perl script. All the parameters are set default values for 100 bp reads and can be set by users to process reads of any length. The first procedure can recognize the current quality score systems (either Phred+33 or Phred+64) without user specification. First, it masks all the low quality nucleotides (the default setting is = 20). Using FASTX-Toolkit, we reduced the ambiguous nucleotide number from 22,760,837 to 32,487 and increased the Q20 ratio from 93.98% to 98.34%. AdapterRemoval reached better performance by reducing the ambiguous nucleotide number to 26,328 and increasing the Q20 ratio to 98.35%. Fastq_clean reached the best performance with the highest Q20 ratio 99.21% and reduced the ambiguous nucleotide number to 1,134. Moreover, Fastq_clean has the uniform performance on each of the six RNA-seq data without any bias (see supplementary 1). The fastx_clipper in the FASTX-Toolkit removed 11.35% nucleotides as the adapters, while AdapterRemoval only removed 0.32%. After carefully checking the results, we found fastx_clipper removed too many reads containing no adapter sequences. It is due to fastx_clipper was designed for small RNA experiments and so it's tweaked to be very sensitive and not specific at all, clipping anything that resembles an adapter and any nucleotides after that. AdapterRemoval missed to remove a lot of adapter contamination due to some weakness to deal with single end reads. First, it cannot deal with the adapter selfligation problem. Second, it cannot deal with gapped alignments. Fastq_clean removed 2.75% nucleotides as the adapters. This ratio is close to the ratio that AdapterRemoval reached. The little difference mainly came from adapter self-ligation.
TABLE II. Method Raw FASTX-Toolkit AdapterRemoval Fastq_clean
Total_Reads 59214931 59214237 59208916 59213100
Total_NT 5921493100 5659145450 5593920081 5216604530
COMPARISON WITH OTHER METHODS
Total_N 22760837 32487 26328 1134
Total_Q20 5565183085 5565183085 5501768511 5175578953
N Percentage 0.38% 0.00% 0.00% 0.00%
Q20 Percentage 93.98% 98.34% 98.35% 99.21%
NT in Clean 5016577573 5576277668 5072933042
Adapter Percentage 11.35% 0.32% 2.75%
The first row is for the raw data. For other rows: 1. is total read number after removing the low quality and Ns; 2. is the total nucleotides after removing the low quality and Ns; 3. is the toal number of N after removing the low quality and Ns; 4. is the number of nucleotides with quality score more than 20 after removing the low quality and Ns; 5. is the percentage of N in the total nucleotides; 6. is the percentage of Q20 in the total nucleotides; 7. is total nucleotides after removing the adapters; 8. is the percentage of nucleotides removed as adapters.
V. CONCLUSION AND DISCUSSION Fastq_clean can be used to clean the DNA-seq and RNA-seq data from the illumina sequencer. Fastq_clean can batchly process sequenced data and export statistics information for the data quality control by running a single command line. Compared with two most used tools on a published dataset, Fastq_clean reached the best performance. Fastq_clean can also be used to clean the NGS data from other sequencers (e.g. 454), but needs some modification to reach the rest performance. Fastq_clean does not process paired-end reads couplely to remove the adapter contamination using the identical information in the overlapping region between the paired reads (with one being reverse-complemented) based on such reasons as below. First, it is very difficult for a user to define a proper mismatch ratio to find the overlapping region due to the unbalanced quality in the paired reads. Second, too short overlap length maybe come from random matches and will lead to false positive overlaps. Third, the multiplexing strategy may introduce significant levels of sample crosscontamination. Thus, the paired reads maybe come from two different templates [24]. Therefore, Fastq_clean processes paired-end sequenced reads separately and then provides a Perl script to extract the paired reads based on the identical read names. Another reason to process paired-end reads separately is to remove the rRNA or virus contamination strictly. That is, if one read of the paired can be aligned to the rRNA or virus databases and thus removed, the left read will be also removed using the paired-extracting Perl script. In the previous studies, we found in most of cases, only one read of the paired can be aligned to the rRNA or virus databases due to the unbalanced sequencing quality or the incomplete reference sequences in the databases. Using this strictly strategy, Fastq_clean removes much more contamination than other tools. Fastq_clean also provides some advantages use the processed clean data for the downstream analysis. First, the clean data uses the standard Phred+33 quality score system. Many NGS software (e.g. Bowtie) has the default setting to take the quality score system of input
data as Phred+33. They never report errors when users input data using the Phred+64 system but lead to some wrong results. Second, Fastq_clean automatically adds “/1” and“/2” to the cleaned left or right reads. Some NGS software requires this information for the paired end alignment (e.g. BWA) or other use. ACKNOWLEDGMENTS We thank Associate Professor Jijun Tang from Department of Computer Science & Engineering, University of South Carolina. This research was supported by the Natural Science Foundation of Jiangxi Province of China (20132BAB214009), the National Natural Science Foundation of China (31201191 and 61170177) and the National Basic Research Program of China (2013CB32930X).
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
T. P. Niedringhaus, D. Milanova, M. B. Kerby, M. P. Snyder, and A. E. Barron, "Landscape of next-generation sequencing technologies," Analytical Chemistry, vol. 83, pp. 4327-4341, 2011. Z. Su, B. Ning, H. Fang, H. Hong, R. Perkins, W. Tong, and L. Shi, "Next-generation sequencing and its applications in molecular diagnostics," Expert Review of Molecular Diagnostics, vol. 11, pp. 333-343, 2011. R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song, "Genotype and SNP calling from next-generation sequencing data," Nat Rev Genet, vol. 12, pp. 443-51, Jun 2011. S. Lindgreen, "AdapterRemoval: easy cleaning of nextgeneration sequencing reads," BMC Res Notes, vol. 5, p. 337, 2012. A. Gordon and G. Hannon, "Fastx-toolkit," FASTQ/A shortreads pre-processing tools (unpublished) http://hannonlab. cshl. edu/fastx_toolkit, 2010. A. M. Bolger, M. Lohse, and B. Usadel, "Trimmomatic: a flexible trimmer for Illumina sequence data," Bioinformatics, p. btu170, 2014. Y. Kong, "Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies," Genomics, vol. 98, pp. 152-153, 2011.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
R. V. Pandey, V. Nolte, and C. Schlötterer, "CANGS: a userfriendly utility for processing and analyzing 454 GS-FLX data in biodiversity studies," BMC research notes, vol. 3, p. 3, 2010. J. Falgueras, A. J. Lara, N. Fernández-Pozo, F. R. Cantón, G. Pérez-Trabado, and M. G. Claros, "SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read," Bmc Bioinformatics, vol. 11, p. 38, 2010. R. Schmieder, Y. W. Lim, F. Rohwer, and R. Edwards, "TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets," Bmc Bioinformatics, vol. 11, p. 341, 2010. T. T. G. Consortium, "The tomato genome sequence provides insights into fleshy fruit evolution," Nature, vol. 485, pp. 63541, May 31 2012. S. Guo, J. Zhang, H. Sun, J. Salse, W. J. Lucas, H. Zhang, Y. Zheng, L. Mao, Y. Ren, Z. Wang, J. Min, X. Guo, F. Murat, B. K. Ham, Z. Zhang, S. Gao, M. Huang, Y. Xu, S. Zhong, A. Bombarely, L. A. Mueller, H. Zhao, H. He, Y. Zhang, S. Huang, T. Tan, E. Pang, K. Lin, Q. Hu, H. Kuang, P. Ni, B. Wang, J. Liu, Q. Kou, W. Hou, X. Zou, J. Jiang, G. Gong, K. Klee, H. Schoof, Y. Huang, X. Hu, S. Dong, D. Liang, J. Wang, K. Wu, Y. Xia, X. Zhao, Z. Zheng, M. Xing, X. Liang, B. Huang, T. Lv, Y. Yin, H. Yi, R. Li, M. Wu, A. Levi, X. Zhang, J. J. Giovannoni, Y. Li, and Z. Fei, "The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions," Nature Genetics, vol. 45, pp. 51-8, Jan 2013. Y. Xu, S. Gao, Y. Yang, M. Huang, L. Cheng, Q. Wei, Z. Fei, J. Gao, and B. Hong, "Transcriptome sequencing and whole genome expression profiling of chrysanthemum under dehydration stress," BMC Genomics, vol. 14, p. 662, Sep 28 2013. Y. Yang, C. Ma, Y. Xu, Q. Wei, M. Imtiaz, H. Lan, S. Gao, L. Cheng, M. Wang, Z. Fei, B. Hong, and J. Gao, "A Zinc Finger Protein Regulates Flowering Time and Abiotic Stress Tolerance in Chrysanthemum by Modulating Gibberellin Biosynthesis," The Plant cell, May 23 2014. A. Bolger, F. Scossa, M. E. Bolger, C. Lanz, F. Maumus, T. Tohge, H. Quesneville, S. Alseekh, I. S rensen, and G. Lichtenstein, "The genome of the stress-tolerant wild tomato species Solanum pennellii," Nature Genetics, 2014.
[16] J. Qi, X. Liu, D. Shen, H. Miao, B. Xie, X. Li, P. Zeng, S. Wang, Y. Shang, and X. Gu, "A genomic variation map provides insights into the genetic basis of cucumber domestication and diversity," Nature Genetics, 2013. [17] S. Grassi, G. Piro, J. M. Lee, Y. Zheng, Z. Fei, G. Dalessandro, J. J. Giovannoni, and M. S. Lenucci, "Comparative genomics reveals candidate carotenoid pathway regulators of ripening watermelon fruit," BMC Genomics, vol. 14, p. 781, 2013. [18] S. Huang, J. Ding, D. Deng, W. Tang, H. Sun, D. Liu, L. Zhang, X. Niu, X. Zhang, and M. Meng, "Draft genome of the kiwifruit Actinidia chinensis," Nature communications, vol. 4, 2013. [19] S. Cohen, M. Itkin, Y. Yeselson, G. Tzuri, V. Portnoy, R. Harel-Baja, S. Lev, U. Sa‘ar, R. Davidovitz-Rikanati, and N. Baranes, "The PH gene determines fruit acidity and contributes to the evolution of sweet melons," Nature communications, vol. 5, 2014. [20] H. G. Rosli, Y. Zheng, M. A. Pombo, S. Zhong, A. Bombarely, Z. Fei, A. Collmer, and G. B. Martin, "Transcriptomics-based screen for genes induced by flagellin and repressed by pathogen effectors identifies a cell wall-associated kinase involved in plant immunity," Genome Biology, vol. 14, p. R139, 2013. [21] M. Morgan, S. Anders, M. Lawrence, P. Aboyoun, H. Pages, and R. Gentleman, "ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data," Bioinformatics, vol. 25, pp. 2607-2608, Oct 1 2009. [22] H. Li and R. Durbin, "Fast and accurate short read alignment with Burrows-Wheeler transform," Bioinformatics, vol. 25, pp. 1754-1760, Jul 15 2009. [23] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glockner, "The SILVA ribosomal RNA gene database project: improved data processing and web-based tools," Nucleic Acids Res, vol. 41, pp. D590-D596, Jan 2013. [24] M. Kircher, S. Sawyer, and M. Meyer, "Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform," Nucleic Acids Res, vol. 40, Jan 2012.