2015 IEEE International Conference on Bioinformatics and Biomedicine (BlBM)
Error Correction and DeNovo Genome Assembly for the MinION Sequencing Reads mixing Illumina Short Reads Mehdi Kchouk & Mourad Elloumi Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTlCE), Nation al S uperior School of Engineers of Tunis (EN S IT),
University of Tunis, Tunisia
[email protected],
[email protected]
Abstract-The new MinIon sequencer provided by the Oxford Nanopore Technologies is characterized by his small size and is powered from the USB 3.0 port of a laptop computer. This sequencer produces long reads with a low production costs and with high throughput. However, long reads generated by the MinIon sequencer have a high error rate (about 25% [1]) which deteriorates the quality of results obtained by analyzing these long reads. A solution to correct long reads is to use the high coverage and the high quality of short reads generated by the second generation sequencing technology. Here, we present MiRCA (MlnIon Reads !;;.orrection Algorithm) a hybrid pipeline that detects and corrects errors for MinIon long reads using pre assembled IIIumina MiSeq short reads and we use the Overlap Layout-Consensus(OLC) approach to assemble the corrected reads. MiRCA is able to correct: deletions, insertions and substitutions errors by forming a multiple sequence alignment and does not require a large memory space. We use the Saccharomyces cerevisiae W303 genome and the Escherichia coli K-12 MG1655 bacterial genome to test the efficiency of our pipeline. Keywords-DeNovo
Assembly,
Error
Correction,
Oxford
There are four main steps to correct the MinIon Long reads in our pipeline;
(i)
Cleaning
Data:
the
reads
directly
obtained
by
sequencing platform are contaminated and are likely to cause alignment errors and needs to be cleaned to obtain proper reads. This step consists in cleaning reads by eliminating very contaminated reads and cannot be corrected. (ii) Pre-assembly Step: next, we take the high quality short reads generated by Illumina sequencer and we pre-assemble
these
short
reads
into
longer
sequences
called
"Contigs" using the DeBurjin Graph approach for pre-assembly. The idea of pre-assembling short reads into "Contigs" ensures that all information in the short reads into contigs will be preserved by ensuring better interpretation of these short reads.
(iii) Errors
Correction step: the next step is to form multiple alignments, long reads are defmed as the consensus sequence, by seeking contigs that overlapping with each long reads or those that share at least k-mer positions with the base sequence, the multiple alignment is created by aligning the contigs to each long reads and we correct errors by using the majority voting scheme. After the alignment of contigs to the long reads, the consensus (long read) is updated according to the types of errors found in the long reads. (iv) DeNovo Assembly Step:
Nanopore MinIon sequencing, Illumina, Algorithm.
After the correction step, the overlaps between high accurate long
I. INTRODUCTION
reads can be easily detected and can be assembled. We use the OLC
The appearance of new sequencing technologies also called Next
approach to detect the overlaps between long reads and build the
Generation Sequencing (NGS) technologies (second, third and even
consensus sequence of the genome.
fourth generations) including IIIumina sequencing Technology[2] for short reads and Oxford Nanopore technology known by the new
III.
sequencing platform Min/on[3] that produce data at a higher
DATASETS:
For testing our pipeline we use the S. cerevisiae genome and the
throughput and a lower cost. Despite their low cost and the high
Escherichia coli K-12 MGI655 bacterial genome. The MinIon Long
throughput, the next generations of sequencing (NGS) technologies
reads of S.cerevisie W303 provided from [9] and the The Minion
have their own drawbacks mainly "Sequencing Errors" which are
Long reads of Escherichia coli provided from [10]. The MiSeq
one of the main problems in analyzing high-throughput sequencing
Illumina short reads of S.cerevisie and Escherichia coli from[9].
data. To address the problem of sequencing errors, the Error
IV.
Correction task is used. Error Correction is an important task in the analysis and
[I ]David
laehnemann, arndt
REFERENCES borkhardt, and
alice
carolyn
mchardy.
manipulations of NGS data. It constitutes in the detection and the
Denoising dna deep sequencing data-high-throughput sequencing errors
repair of errors in reads, this correction is made by using the high
and their correction. Brief bioinform. (2015).
reads coverage from NG Sequencers to correct the erroneous bases
[2] Bentley dr, balasubramanian s, swerdlow hp, et al. Accurate whole
in reads. Referring to literature, there are three main approaches for
human genome sequencing using reversible terminator chemistry. Nature 2008;456:53-9.
errors correction [4]: K-spectrum based approach, Suffix tree and
[3] Alexander s. Mikheyev andmandy m. Y. Tin. A first look at the oxford
array based approach based approach and alignment. Also, there are
nanopore minion sequencer. Molecular ecology resources (2014).
several error correction algorithms for short reads and for long
[4] Yang x, chockalingam sp, alum s. A survey of error correction methods
reads. In our case we will look at the hybrid error correction
for next-generation sequencing. Brief bioinform 2013;14:56--66.
algorithms for long reads including PacbioToCA [5], LSC [6],
[5] Koren s, schatz me, walenz bp, et al. Hybrid error correction and de
Proovread[7], LoRDEC [8] that correct the PacBio reads and
novo
Nanocorr[9] the one that corrects the long reads generated by the
30:693-700. (2012).
assembly
of
single-molecule
sequencing
reads. Nat biotechnol;
[6] Au kf, underwood jg, lee I, et al. Improving pacbio long read accuracy
MinIon sequencer.
by short read alignment. Plos one;7: e46679.(2012).
Here, we propose a new hybrid algorithm for error correction of
[7] Hackl T, Hedrich R, Schultz J, et al. proovread: large-scale high
the MinIon long reads using pre-assembled short reads as a
accuracy
reference to correct the long reads that supports substitutions, insertions and deletions errors and finally we assemble the corrected long reads using an OLC approach for assembly. II.
METHODS
Considering the limitations of long reads tools and the high error
PacBio
correction
through
iterative
short
read
consensus.
Bioinformatics;30:3004-11. (2014) [8] Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics;30:3506-14. (2014) [9] Goodwin S, Gurtowski J, Ethe-Sayers S, et al. Oxford Nanopore sequencing and de novo assembly of a eukaryotic genome. bioRxiv ; 013490. (2015)
rates generated by the sequencing platforms. We present MiRCA
[10] Nicholas J Loman, Joshua Quick &
(MinIon Beads !;;.orrection Algorithm) takes as input a set of long
bacterial genome assembled de novo using only nanopore sequencing data.
reads and a set of paired-end short reads.
Nature Methods 12,733-735 (2015)
978-1-4673-6799-8/15/$31.00 ©2015 IEEE
1785
Jared T Simpson. A complete