MEGAHIT: An ultra-fast single-node solution for ... - Semantic Scholar

Application Note

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph Dinghua Li1,†, Chi-Man Liu2,†, Ruibang Luo2,†, Kunihiko Sadakane3 and Tak-Wah Lam1,2,* 1

HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong 2 L3 Bioinformatics Limited, Hong Kong 3 National Institute of Informatics, Chiyoda-ku, Tokyo, Japan Received on XXXXX; revised on XXXXX; accepted on XXXXX

Associate Editor: XXXXXXX

ABSTRACT Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. Availability: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license. Contact: [email protected], [email protected]

1

INTRODUCTION

Next generation sequencing technologies have offered new opportunities to study metagenomics and understand various microbial communities such as human guts, rumen and soil. Constrained by the lack of reference genomes, de novo assembling metagenomics data (short reads) is a beneficial and almost inevitable step for metagenomics analysis (Qin, et al., 2010). Such a step is, however, challenged by the heavy requirement of computational resources, especially for large and complex datasets encountered in environmental metagenomics (Howe, et al., 2014). The soil metagenomics dataset recently published by Howe et al.’s comprises 252G base-pairs even after removing low quality bases. Although the dataset was successfully assembled, the pre-processing steps including partitioning and digital normalization are prone to completeness loss. At present no de novo assembler that targets for metagenomics can assemble the data as a whole with a feasible amount of computer memory. Estimation for SOAPdenovo2 (Luo, et al., 2012) and IDBA-UD (Peng, et al., 2012) to assembly the soil dataset requires at least 4TB of memory. As more and more largevolume of metagenomics data will be generated, we are motivated to develop MEGAHIT, an assembler that can assemble large and complex metagenomics in a time- and cost-efficient manner, especially on a single-node server (standard memory capacity currently bounded by 16GB x 24 = 384GB or 32GB x 24 = 768 GB).

2

METHODS

MEGAHIT makes use of succinct de Bruijn graph (Bowe, et al., 2012), which is a compressed representation of de Bruijn graph. *

Succinct de Bruijn graph (SdBG) encodes a graph with m edges in O(m) bits, and supports O(1) time traversal from a vertex to its neighbors. In our implementation, we added a bit-vector of length m to mark the validity of each vertex to make the graph revisable, and an auxiliary of 2kt bits (where k is the k-mer-size and t is the number of zero-indegree vertices) to store the sequence of zeroindegree vertices to ensure the graph being lossless. Despite its advantage, how to construct a SdBG efficiently is non-trivial. MEGAHIT is rooted at a fast parallel algorithm for construing SdBG, which is required to sort all length-k prefixes (kmers) in reverse lexicographical order of a set of (k+1)-mers that are the edges of a SdBG. MEGAHIT exploits the parallelism of a Graphics Processing Unit (GPU, CUDA-enabled) by adapting the recent BWT-construction algorithm CX1 (Liu, et al., 2014) (which takes advantage of a GPU to sort the suffices of a set of reads very efficiently). Limited by the relatively smaller size of GPU’s onboard memory, we adopt a block-wise strategy that partitions the prefixes (k-mers) to be sorted according to their length-l prefix (l=8 in our implementation), and sort k-mers in consecutive partitions whose total size can fit into the GPU memory. Leveraging the massive parallelism of GPU, MEGAHIT speeds up the construction by 3-5 times than its CPU-only counterpart. Notably, sequencing error is unnegligible, and a single base of sequencing error leads to k erroneous k-mer singletons, which increases the memory consumption of MEGAHIT significantly. To cope with the problem, before graph construction, all (k+1)-mers from the set of input reads are sorted and counted, only (k+1)-mers that appears at least d (2 by default) times would be kept as solidkmer. This method removes large amount of spurious edges, but may be risky for metagenomics assembly since many lowabundance species may have been sequenced at very low depths. Thus we introduce a mercy-kmer strategy to recover these lowdepth edges. Given two solid (k+1)-mers x and y are from the same read, where x has no outdegree and y has no indegree, if all (k+1)mers between x and y in that read are not solid, they will be added to the de Bruijn graph as mercy-kmer. Mercy-kmer strengthens the contiguity of low depth regions. Without it, many authentic edges with low-depth would be incorrectly identified as tips and being removed from the graph. Based on SdBG, we implemented a multiple k-mer strategy in MEGAHIT (Peng, et al., 2012). The method iteratively build SdBGs for assembly from smaller k to larger k, while small k-mer size is favorable for filtering erroneous edges and filling gaps in low-coverage regions, and large k-mer size is used to resolve repeats. In each iteration, MEGAHIT cleans potential erroneous edges by removing tips, merging bubbles and removing low local coverage edges. The last approach is specially introduced for met-

To whom correspondence should be addressed. † Equal first-author.

1

Li et al.

agenomics which suffers from non-uniform sequencing depths. The overall workflow of MEGAHIT is shown in Fig 1.

of reads mapped to MEGAHIT with multiple hits, which implies MEGAHIT's better ability to deal with similar sequences, which is essential in metagenomics assembly for recovering the subspecies in ultra-diversified metagenomics.

4

CONCLUSIONS

MEGAHIT provides better completeness and contiguity while enables an efficient assembly of large and complex metagenomics data on a single-node. MEGAHIT is available with both CPU-only and GPU-accelerated version. With GPU, the assembling time of the soil dataset is shortened from 4 days to less than 2 days. Table 1. Summary statistics for MEGAHIT, Howe and Minia Wall Time (hr) Peak Memory (GB) Total Size (Mbp) Average Length (bp) N50 (bp) Longest (bp) # of Contigs # of Contigs ≥ 1kbp

MEGAHIT

Howe

Minia

44.1* 345 4,902 633 657 184,210 7,749,211 841,257

>488 287 1,503 485 471 9,397 3,096,464 129,513

331.4 29 1,490 505 488 32,679 2,951,575 158,402

Fig. 1. The workflow of MEGAHIT

* The wall time for the CPU-only counterpart of MEGAHIT is 99.6 hours.

3

Table 2. Alignment statistics of MEGAHIT, Howe and Minia

RESULTS

For evaluation, we selected the Iowa prairie soil metagenomics dataset, which comprises 3.3 billion reads totaling 252 billion base-pairs (Howe, et al., 2014). In Howe’s study, the inputs were preprocessed by digital normalization and partitioning before using Velvet to assemble each partition. We also benchmarked Minia (Chikhi and Rizk, 2012), a memory-efficient but rather primitive assembler for generic purpose assembly. The experiments were conducted on a server equipped with two Xeon 2.0GHz 6-core CPUs, 384G memory and a Tesla K40c GPU card. For parameter settings, MEGAHIT uses k-mer size from 27 to 87 with a step of 10; Minia uses a single k-mer size of 31 (constrained by the minimum read length). MEGAHIT’s GPU-accelerated version and CPU-only version finished the assembly in 44.1 hours and 99.6 hours, respectively (Table 1). Minia, even using one k-mer size, took ~331 hours, which is 7.5 time slower than MEGAHIT. MEGAHIT reached peak memory consumption at 345 GB during (k+1)-mer counting and SdBG construction; this matches the expectation since the sorting module in these two steps automatically adjust to fully utilize all available memory in a sever. As described in Howe’s paper, they’ve used over 488 hours and 287 GB memory for digital normalization and partitioning (time for Velvet assembly not reported). The program, script and assembly result of MEGAHIT and Minia is available at http://www.bio8.cs.hku.hk/dataset/MEGAHIT/. To be consistent with Howe’s analysis, we only consider contigs ≥300 bp for further analysis. As shown in Table 1, the total size of MEGAHIT contigs was at least 3 times larger than other methods (4.90G to 1.50G and 1.49G). MEGAHIT achieved longest N50, average length and largest number of long contigs (length ≥1000bp) among three, which shows better assembly contiguity. To inspect the completeness and correctness of the assemblies, raw reads were aligned back to the assembled contigs using Bowtie2 (Langmead and Salzberg, 2012). As shown in Table 2, 55.81% of raw reads were mapped to the MEGAHIT assembly, while only 10.72% were mapped for Howe’s contigs and 13.03% for Minia’s. MEGAHIT assembly has 45.68% paired-end reads properly aligned, which is ~5-6 times higher than Minia and Howe. ~30%

2

MEGAHIT

Total # of reads Reads overall aligned (%) Total # of SE reads SE aligned 1 time (%) SE aligned >1 time (%) Total # of PE reads PE p. aligned 1 time (%) PE p. aligned >1 time (%) PE improperly aligned (%)

Howe

3,252,369,195 10.72 356,742,333 37.00 8.72 14.68 0.32 1,447,813,431 36.78 7.41 8.90 0.20 2.67 0.54 55.81

Minia

13.03 12.38 0.02 9.48 0.01 0.82

SE: Single-end; PE: Paired-end; p.: Properly (defined as aligned insert size within 3 times the standard deviation differ from the experimentally designed insert size).

ACKNOWLEDGEMENTS We thank S.M. Yiu, C.M. Leung and Y. Peng for the detailed explanation about IDBA-UD. Conflict of Interest: None declared.

REFERENCES Bowe, A., Onodera T, Sadakane K, Shibuya T. (2012) Succinct de Bruijn Graphs, Algorithms in Bioinformatics. Springer-Verlag, pp. 225-235. Chikhi, R. and Rizk, G. (2012) Space-Efficient and Exact de Bruijn Graph Representation Based on a Bloom Filter. In Raphael, B. and Tang, J. (eds), Algorithms in Bioinformatics. Springer Berlin Heidelberg, pp. 236-248. Howe, A.C., et al. (2014) Tackling soil diversity with the assembly of large, complex metagenomes, Proceedings of the National Academy of Sciences, 111, 4904-4909. Liu, C.-M., Luo, R. and Lam, T.-W. (2014) GPU-Accelerated BWT Construction for Large Collection of Short Reads, arXiv:1401.7457. Luo, R., et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience, 1, 18. Peng, Y., et al. (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth, Bioinformatics, 28, 14201428. Qin, J., et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing, Nature, 464, 59-65.

MEGAHIT: An ultra-fast single-node solution for ... - Semantic Scholar

MEGAHIT: An ultra-fast single-node solution for ... - Semantic Scholar

Suggest Documents

Article Ultrafast Approximation for Phylogenetic ... - Semantic Scholar

Doppler Spectrometry for Ultrafast Temporal ... - Semantic Scholar

Ultrafast Dynamics for Electron Photodetachment ... - Semantic Scholar

Ultrafast Reaction Dynamics - Semantic Scholar

High fidelity femtosecond pulses from an ultrafast ... - Semantic Scholar

A Review of an Ultrafast and Sensitive Bioassay ... - Semantic Scholar

An Integrated Solution for Runtime Compliance ... - Semantic Scholar

An Eigenfunction Solution for Scribble Based ... - Semantic Scholar

An integrable, web-based solution for easy ... - Semantic Scholar

Trial of an Experimental Castor Oil Solution for ... - Semantic Scholar

An Algorithm for Efficient Solution of Finite ... - Semantic Scholar

An Integrated Solution for Runtime Compliance ... - Semantic Scholar

An Efficient and Stable Parallel Solution for Non ... - Semantic Scholar

An archetype-based solution for the ... - Semantic Scholar

An exact solution approach for disassembly line ... - Semantic Scholar

An elegans Solution for Crossover Formation - Semantic Scholar

An exact analytical solution for discrete barrier ... - Semantic Scholar

An Analytical Solution for the Inverse Kinematics of ... - Semantic Scholar

An efficient solution method for relaxed variants of ... - Semantic Scholar

An Agent-Based Solution for Dynamic Supply ... - Semantic Scholar

An integrated Diet Monitoring Solution for ... - Semantic Scholar

An Exact Dual Adjoint Solution Method for ... - Semantic Scholar

An IMS-based Middleware Solution for Energy ... - Semantic Scholar

An Improved Algorithm for the Solution of ... - Semantic Scholar