MEGAHIT: An ultra-fast single-node solution for ... - Semantic Scholar

1 downloads 0 Views 283KB Size Report
ABSTRACT. Summary: MEGAHIT is a NGS de novo assembler for assembling ... MEGAHIT makes use of succinct de Bruijn graph (Bowe, et al.,. 2012), which is ...
Application Note

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph Dinghua Li1,†, Chi-Man Liu2,†, Ruibang Luo2,†, Kunihiko Sadakane3 and Tak-Wah Lam1,2,* 1

HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong 2 L3 Bioinformatics Limited, Hong Kong 3 National Institute of Informatics, Chiyoda-ku, Tokyo, Japan Received on XXXXX; revised on XXXXX; accepted on XXXXX

Associate Editor: XXXXXXX

ABSTRACT Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., it avoids pre-processing like partitioning and normalization, which might compromise on result integrity. MEGAHIT generates 3 times larger assembly, with longer contig N50 and average contig length than the previous assembly. 55.8% of the reads were aligned to the assembly, which is 4 times higher than the previous. Availability: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license. Contact: [email protected], [email protected]

1

INTRODUCTION

Next generation sequencing technologies have offered new opportunities to study metagenomics and understand various microbial communities such as human guts, rumen and soil. Constrained by the lack of reference genomes, de novo assembling metagenomics data (short reads) is a beneficial and almost inevitable step for metagenomics analysis (Qin, et al., 2010). Such a step is, however, challenged by the heavy requirement of computational resources, especially for large and complex datasets encountered in environmental metagenomics (Howe, et al., 2014). The soil metagenomics dataset recently published by Howe et al.’s comprises 252G base-pairs even after removing low quality bases. Although the dataset was successfully assembled, the pre-processing steps including partitioning and digital normalization are prone to completeness loss. At present no de novo assembler that targets for metagenomics can assemble the data as a whole with a feasible amount of computer memory. Estimation for SOAPdenovo2 (Luo, et al., 2012) and IDBA-UD (Peng, et al., 2012) to assembly the soil dataset requires at least 4TB of memory. As more and more largevolume of metagenomics data will be generated, we are motivated to develop MEGAHIT, an assembler that can assemble large and complex metagenomics in a time- and cost-efficient manner, especially on a single-node server (standard memory capacity currently bounded by 16GB x 24 = 384GB or 32GB x 24 = 768 GB).

2

METHODS

MEGAHIT makes use of succinct de Bruijn graph (Bowe, et al., 2012), which is a compressed representation of de Bruijn graph. *

Succinct de Bruijn graph (SdBG) encodes a graph with m edges in O(m) bits, and supports O(1) time traversal from a vertex to its neighbors. In our implementation, we added a bit-vector of length m to mark the validity of each vertex to make the graph revisable, and an auxiliary of 2kt bits (where k is the k-mer-size and t is the number of zero-indegree vertices) to store the sequence of zeroindegree vertices to ensure the graph being lossless. Despite its advantage, how to construct a SdBG efficiently is non-trivial. MEGAHIT is rooted at a fast parallel algorithm for construing SdBG, which is required to sort all length-k prefixes (kmers) in reverse lexicographical order of a set of (k+1)-mers that are the edges of a SdBG. MEGAHIT exploits the parallelism of a Graphics Processing Unit (GPU, CUDA-enabled) by adapting the recent BWT-construction algorithm CX1 (Liu, et al., 2014) (which takes advantage of a GPU to sort the suffices of a set of reads very efficiently). Limited by the relatively smaller size of GPU’s onboard memory, we adopt a block-wise strategy that partitions the prefixes (k-mers) to be sorted according to their length-l prefix (l=8 in our implementation), and sort k-mers in consecutive partitions whose total size can fit into the GPU memory. Leveraging the massive parallelism of GPU, MEGAHIT speeds up the construction by 3-5 times than its CPU-only counterpart. Notably, sequencing error is unnegligible, and a single base of sequencing error leads to k erroneous k-mer singletons, which increases the memory consumption of MEGAHIT significantly. To cope with the problem, before graph construction, all (k+1)-mers from the set of input reads are sorted and counted, only (k+1)-mers that appears at least d (2 by default) times would be kept as solidkmer. This method removes large amount of spurious edges, but may be risky for metagenomics assembly since many lowabundance species may have been sequenced at very low depths. Thus we introduce a mercy-kmer strategy to recover these lowdepth edges. Given two solid (k+1)-mers x and y are from the same read, where x has no outdegree and y has no indegree, if all (k+1)mers between x and y in that read are not solid, they will be added to the de Bruijn graph as mercy-kmer. Mercy-kmer strengthens the contiguity of low depth regions. Without it, many authentic edges with low-depth would be incorrectly identified as tips and being removed from the graph. Based on SdBG, we implemented a multiple k-mer strategy in MEGAHIT (Peng, et al., 2012). The method iteratively build SdBGs for assembly from smaller k to larger k, while small k-mer size is favorable for filtering erroneous edges and filling gaps in low-coverage regions, and large k-mer size is used to resolve repeats. In each iteration, MEGAHIT cleans potential erroneous edges by removing tips, merging bubbles and removing low local coverage edges. The last approach is specially introduced for met-

To whom correspondence should be addressed. † Equal first-author.

1

Li et al.

agenomics which suffers from non-uniform sequencing depths. The overall workflow of MEGAHIT is shown in Fig 1.

of reads mapped to MEGAHIT with multiple hits, which implies MEGAHIT's better ability to deal with similar sequences, which is essential in metagenomics assembly for recovering the subspecies in ultra-diversified metagenomics.

4

CONCLUSIONS

MEGAHIT provides better completeness and contiguity while enables an efficient assembly of large and complex metagenomics data on a single-node. MEGAHIT is available with both CPU-only and GPU-accelerated version. With GPU, the assembling time of the soil dataset is shortened from 4 days to less than 2 days. Table 1. Summary statistics for MEGAHIT, Howe and Minia   Wall  Time  (hr)   Peak  Memory  (GB)   Total  Size  (Mbp)   Average  Length  (bp)   N50  (bp)   Longest  (bp)   #  of  Contigs   #  of  Contigs  ≥  1kbp  

MEGAHIT  

Howe  

Minia  

     

44.1*   345   4,902   633   657   184,210   7,749,211   841,257  

>488   287   1,503   485   471   9,397   3,096,464   129,513  

331.4   29   1,490   505   488   32,679   2,951,575   158,402  

     

Fig. 1. The workflow of MEGAHIT

* The wall time for the CPU-only counterpart of MEGAHIT is 99.6 hours.

3

Table 2. Alignment statistics of MEGAHIT, Howe and Minia

RESULTS

For evaluation, we selected the Iowa prairie soil metagenomics dataset, which comprises 3.3 billion reads totaling 252 billion base-pairs (Howe, et al., 2014). In Howe’s study, the inputs were preprocessed by digital normalization and partitioning before using Velvet to assemble each partition. We also benchmarked Minia (Chikhi and Rizk, 2012), a memory-efficient but rather primitive assembler for generic purpose assembly. The experiments were conducted on a server equipped with two Xeon 2.0GHz 6-core CPUs, 384G memory and a Tesla K40c GPU card. For parameter settings, MEGAHIT uses k-mer size from 27 to 87 with a step of 10; Minia uses a single k-mer size of 31 (constrained by the minimum read length). MEGAHIT’s GPU-accelerated version and CPU-only version finished the assembly in 44.1 hours and 99.6 hours, respectively (Table 1). Minia, even using one k-mer size, took ~331 hours, which is 7.5 time slower than MEGAHIT. MEGAHIT reached peak memory consumption at 345 GB during (k+1)-mer counting and SdBG construction; this matches the expectation since the sorting module in these two steps automatically adjust to fully utilize all available memory in a sever. As described in Howe’s paper, they’ve used over 488 hours and 287 GB memory for digital normalization and partitioning (time for Velvet assembly not reported). The program, script and assembly result of MEGAHIT and Minia is available at http://www.bio8.cs.hku.hk/dataset/MEGAHIT/. To be consistent with Howe’s analysis, we only consider contigs ≥300 bp for further analysis. As shown in Table 1, the total size of MEGAHIT contigs was at least 3 times larger than other methods (4.90G to 1.50G and 1.49G). MEGAHIT achieved longest N50, average length and largest number of long contigs (length ≥1000bp) among three, which shows better assembly contiguity. To inspect the completeness and correctness of the assemblies, raw reads were aligned back to the assembled contigs using Bowtie2 (Langmead and Salzberg, 2012). As shown in Table 2, 55.81% of raw reads were mapped to the MEGAHIT assembly, while only 10.72% were mapped for Howe’s contigs and 13.03% for Minia’s. MEGAHIT assembly has 45.68% paired-end reads properly aligned, which is ~5-6 times higher than Minia and Howe. ~30%

2

 

MEGAHIT  

Total  #  of  reads   Reads  overall  aligned  (%)   Total  #  of  SE  reads   SE  aligned  1  time  (%)   SE  aligned  >1  time  (%)   Total  #  of  PE  reads   PE  p.  aligned  1  time  (%)   PE  p.  aligned  >1  time  (%)   PE  improperly  aligned  (%)  

Howe  

3,252,369,195   10.72   356,742,333   37.00   8.72   14.68   0.32   1,447,813,431   36.78   7.41   8.90   0.20   2.67   0.54   55.81  

                                         

Minia  

13.03   12.38   0.02   9.48   0.01   0.82  

SE: Single-end; PE: Paired-end; p.: Properly (defined as aligned insert size within 3 times the standard deviation differ from the experimentally designed insert size).

ACKNOWLEDGEMENTS We thank S.M. Yiu, C.M. Leung and Y. Peng for the detailed explanation about IDBA-UD. Conflict of Interest: None declared.

REFERENCES Bowe, A., Onodera T, Sadakane K, Shibuya T. (2012) Succinct de Bruijn Graphs, Algorithms in Bioinformatics. Springer-Verlag, pp. 225-235. Chikhi, R. and Rizk, G. (2012) Space-Efficient and Exact de Bruijn Graph Representation Based on a Bloom Filter. In Raphael, B. and Tang, J. (eds), Algorithms in Bioinformatics. Springer Berlin Heidelberg, pp. 236-248. Howe, A.C., et al. (2014) Tackling soil diversity with the assembly of large, complex metagenomes, Proceedings of the National Academy of Sciences, 111, 4904-4909. Liu, C.-M., Luo, R. and Lam, T.-W. (2014) GPU-Accelerated BWT Construction for Large Collection of Short Reads, arXiv:1401.7457. Luo, R., et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience, 1, 18. Peng, Y., et al. (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth, Bioinformatics, 28, 14201428. Qin, J., et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing, Nature, 464, 59-65.

Suggest Documents