From trash to treasure: detecting unexpected contamination in unmapped NGS data Ilaria Granata1# , Mara Sangiovanni1# , Amarinder Singh Thind1 and Mario R. Guarracino1 1
ICAR-CNR, Naples, Italy Equal contributors
#
Corresponding Authors: Ilaria Granata,
[email protected] Mara Sangiovanni,
[email protected] Introduction Standard procedures for NGS data analysis involve a pre-processing step of reads quality assessment, followed by the alignment of the filtered reads to a reference genome. The mapped sequences are then investigated to extract information such as transcripts expression, splicing events, nucleotide or structural variations, enriched regions of specific binding sites. Typically the amount of reads that correctly maps to the specific reference genome ranges between 70% and 90%, leaving in some cases a consistent fraction of unmapped sequences. Their presence can be ascribed either to technical errors during the sequencing protocol, when the quality assessment step is not performed, or to sequence differences between the reads and the reference. Investigating the reasons of this discrepancy may provide relevant information about the source of the so called unmapped reads. It is not unusual that genetic material of microorganisms is present in biological samples undergoing sequencing. These exogenous sequences can derive from the normal or altered tissues microbiome (upstream contamination) or from a contamination occurring during the samples processing (downstream contamination). Upstream contamination have been reported by several research groups which have used NGS techniques to discover exogenous agents in human tissues samples and cell lines. Detection of downstream contamination can help in checking the quality of the working environment and the possibility of a cross-contamination among samples, which, in turn, might affect the reliability of the experimental results as discussed in Laurin-Lemay et al. [1] or in Olarerin-George and Hogenesch [2]. Here we propose DecontaMiner, a tool to unravel the presence of contaminating sequences among the unmapped reads. It uses a subtraction approach in which the sequences are first filtered according to quality parameters and then mapped to ribosomal and mithocondrial reference sequences. The unaligned reads are then mapped, through a local alignment algorithm (MegaBlast), to bacteria, fungi and viruses genome. DecontaMiner generates several output files to track all the processed reads, and to provide a complete report of their characteristics. The good quality matches on microorganism genomes are counted and compared among samples. The main novelty of DecontaMiner is the versatility of its use together with a complete, easy to use, and automatic pipeline. DecontaMiner differs from other tools like Taxonomer [3] or SURPI [4], because it is not aimed to address metagenomics studies but it is focused on the detection and identification of upstream/downstream contamination among the unmapped reads. DecontaMiner does not require complicated installation like in RNA-COMPASS [5], since it exploits several tools that are widely used by the sequencing community. It does not have limits of reads length and performs the alignment to several microorganism databases. Moreover, it does not require an a-priori knowledge as expected in FastQ Screen [6] or contamination_screen [7] where the user must provide the genome of each putative contaminating species. Other tools, such as TruePure [8], have limits on the input size and process only small subsets of a fastq file. Methods DecontaMiner has been developed to work on one or more samples and both on paired- and single-end experiments. The tool is composed of two main modules, the first one involving the format conversion, filtering and mapping steps, and the second one performing the extraction and the parsing of the results. PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3230v1 | CC BY 4.0 Open Access | rec: 6 Sep 2017, publ: 6 Sep 2017
1
The first module is composed of one main shell script which automatically and sequentially executes all the steps up to the alignment to the microbial genomes. All the single scripts belonging to this module are provided, allowing the user to run them separately. The second module is composed of two shell scripts, one to filter the results coming from the BLAST alignment, and one to collect the information according to user defined thresholds. The code is written in Perl, with bash scripts to connect and launch the various submodules of the pipeline. DecontaMiner has dependencies from external tools: Samtools [9], FastX toolkit [10], SortmeRNA [11] and BLAST [12]. The input file is accepted in fastq, fasta and bam formats, and this option determines the starting point of the pipeline. The quality parameters to retain or discard reads must be set by the user. The sequences which pass the filtering step are then aligned against the human ribosomal and mitochondrial RNA using SortmeRNA. The non-human sequences can be mapped to bacteria, viruses and fungi genome databases (NCBI nt) using the MegaBLAST algorithm and specifying the alignment length and the number of allowed mismatches/gaps. The Blast-table output contains all the matches satisfying the alignment criteria. Additionally, also the files containing the reads discarded along the overall pipeline are generated. Hence, low quality, rRNA-mapped, ambiguous and unaligned reads are all available. The second module needs some thresholds to be executed. In particular, the user must specify the minimum number of total reads successfully mapped to a single organism in order to consider it as a contaminant. The uniqueness of the mapping is evaluated at the genus level. The results can be extracted and grouped by both genus and species. In addition, DecontaMiner builds an offline HTML page containing summary statistics and plots. The latter are obtained using the state-of-the art D3 javascript libraries [13]. DecontaMiner has been mainly used to detect contamination in human RNA-seq data, but the pipeline can be easily tailored using the configuration files and flags to process all kind of NGS data, as well as unmapped data coming from other (non-human) organisms. Results DecontaMiner has been tested on two publicly available datasets downloaded from the GEO (Gene Expression Omnibus) portal: GSE68086 and GSE69240. The first (GSE69240) contains polyA-RNA samples. DecontaMiner found no contamination, as expected given the type of sequenced data since an efficient polyA enrichment and a sterile environment should guarantee contamination-free samples. However, downstream contaminations after the poly-A enrichment step might still occur and be detected, in casa suggesting the presence of contaminants in the working environment. The second (GSE68086) is a dataset of total RNA sequencing of tumor and healthy samples. Here we found a background contamination in almost all the samples. The most present organisms were P. acnes and E. coli, and, in addition, some tumor samples significatively matched to A. baumannii, that is a well known nosocomial pathogen, even probably associated to outcomes of cancer diseases. From the alignment to fungal and virus genomes the matches were very modest compared to bacteria, although the mapping to Propionibacterium and Enterobacteria phages was in agreement with what we found as bacterial contamination. Conclusions DecontaMiner is a tool designed and developed to investigate the presence of contaminating sequences in unmapped NGS data. It is freely available at www-labgtp.na.icar.cnr.it/decontaminer/index.php . We are also working at providing the presented pipeline as a tool in the Galaxy environment [14]. Indeed, the novelty of DecontaMiner is represented by its easy integration with the standard analysis of NGS data, thus making it a useful tool for additional investigation on the sequenced data. The contamination analysis might shed light on the presence of unmapped reads, which in turn might derive from laboratory contamination or from their biological source, and can suggest further investigations or experimental validations. References 1. Laurin-Lemay, S., Henner B., and Hervé P. Origin of land plants revisited in the light of sequence contamination and missing data. Current Biology 22.15 (2012): R593-R594. PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3230v1 | CC BY 4.0 Open Access | rec: 6 Sep 2017, publ: 6 Sep 2017
2
2. Olarerin-George, A. O., and Hogenesch J.B. Assessing the prevalence of mycoplasma contamination in cell culture via a survey of NCBI’s RNA-seq archive. Nucleic acids research 43.5 (2015): 2535-2542. 3. Flygare S., Simmon K., et al. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling. Genome Biology. 2016. 17:111. 4. Naccache, S. N., et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome research. 2014. 24(7):1180–1192 5. Xu G., Strong M.J., Lacey M.R., Baribault C., Flemington E.K., et al. (2014) RNA CoMPASS: A Dual Approach for Pathogen and Host Transcriptome Analysis of RNA-Seq Datasets. PLOS ONE 9(2): e89445 6. Babraham Bioinformatics Institute. FastQ Screen. https://www.bioinformatics.babraham.ac.uk/projects/ fastq_screen/ 7. George Cresswell. contamination_screen. https://github.com/luslab/contamination_screen 8. Expedeon True Helix. TruePure. https://www.expedeon.com/services/truehelix/truepure 9. Li, H., et al. The sequence alignment/map format and samtools. Bioinformatics. 2009. 25(16):2078–2079 10. Gordon, A., and Hannon G. J. Fastx-toolkit. 2010. 11. Kopylova, E., et al. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012. 28.24: 3211-3217. 12. Zhang, Z., et al. A greedy algorithm for aligning DNA sequences. Journal of Computational biology. 2000. 7.1-2:203-214. 13. Bostok, M. Data-driven documents. 2015. https://d3js.org/ 14. Galaxy Platform. https://usegalaxy.org
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3230v1 | CC BY 4.0 Open Access | rec: 6 Sep 2017, publ: 6 Sep 2017
3