Error Correction and DeNovo Genome Assembly for the MinION ...

2015 IEEE International Conference on Bioinformatics and Biomedicine (BlBM)

Error Correction and DeNovo Genome Assembly for the MinION Sequencing Reads mixing Illumina Short Reads Mehdi Kchouk & Mourad Elloumi Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTlCE), Nation al S uperior School of Engineers of Tunis (EN S IT),

University of Tunis, Tunisia [email protected],

[email protected]

Abstract-The new MinIon sequencer provided by the Oxford Nanopore Technologies is characterized by his small size and is powered from the USB 3.0 port of a laptop computer. This sequencer produces long reads with a low production costs and with high throughput. However, long reads generated by the MinIon sequencer have a high error rate (about 25% [1]) which deteriorates the quality of results obtained by analyzing these long reads. A solution to correct long reads is to use the high coverage and the high quality of short reads generated by the second generation sequencing technology. Here, we present MiRCA (MlnIon Reads !;;.orrection Algorithm) a hybrid pipeline that detects and corrects errors for MinIon long reads using pre assembled IIIumina MiSeq short reads and we use the Overlap Layout-Consensus(OLC) approach to assemble the corrected reads. MiRCA is able to correct: deletions, insertions and substitutions errors by forming a multiple sequence alignment and does not require a large memory space. We use the Saccharomyces cerevisiae W303 genome and the Escherichia coli K-12 MG1655 bacterial genome to test the efficiency of our pipeline. Keywords-DeNovo

Assembly,

Error

Correction,

Oxford

There are four main steps to correct the MinIon Long reads in our pipeline;

(i)

Cleaning

Data:

the

reads

directly

obtained

by

sequencing platform are contaminated and are likely to cause alignment errors and needs to be cleaned to obtain proper reads. This step consists in cleaning reads by eliminating very contaminated reads and cannot be corrected. (ii) Pre-assembly Step: next, we take the high quality short reads generated by Illumina sequencer and we pre-assemble

these

short

reads

into

longer

sequences

called

"Contigs" using the DeBurjin Graph approach for pre-assembly. The idea of pre-assembling short reads into "Contigs" ensures that all information in the short reads into contigs will be preserved by ensuring better interpretation of these short reads.

(iii) Errors

Correction step: the next step is to form multiple alignments, long reads are defmed as the consensus sequence, by seeking contigs that overlapping with each long reads or those that share at least k-mer positions with the base sequence, the multiple alignment is created by aligning the contigs to each long reads and we correct errors by using the majority voting scheme. After the alignment of contigs to the long reads, the consensus (long read) is updated according to the types of errors found in the long reads. (iv) DeNovo Assembly Step:

Nanopore MinIon sequencing, Illumina, Algorithm.

After the correction step, the overlaps between high accurate long

I. INTRODUCTION

reads can be easily detected and can be assembled. We use the OLC

The appearance of new sequencing technologies also called Next

approach to detect the overlaps between long reads and build the

Generation Sequencing (NGS) technologies (second, third and even

consensus sequence of the genome.

fourth generations) including IIIumina sequencing Technology[2] for short reads and Oxford Nanopore technology known by the new

III.

sequencing platform Min/on[3] that produce data at a higher

DATASETS:

For testing our pipeline we use the S. cerevisiae genome and the

throughput and a lower cost. Despite their low cost and the high

Escherichia coli K-12 MGI655 bacterial genome. The MinIon Long

throughput, the next generations of sequencing (NGS) technologies

reads of S.cerevisie W303 provided from [9] and the The Minion

have their own drawbacks mainly "Sequencing Errors" which are

Long reads of Escherichia coli provided from [10]. The MiSeq

one of the main problems in analyzing high-throughput sequencing

Illumina short reads of S.cerevisie and Escherichia coli from[9].

data. To address the problem of sequencing errors, the Error

IV.

Correction task is used. Error Correction is an important task in the analysis and

[I ]David

laehnemann, arndt

REFERENCES borkhardt, and

alice

carolyn

mchardy.

manipulations of NGS data. It constitutes in the detection and the

Denoising dna deep sequencing data-high-throughput sequencing errors

repair of errors in reads, this correction is made by using the high

and their correction. Brief bioinform. (2015).

reads coverage from NG Sequencers to correct the erroneous bases

[2] Bentley dr, balasubramanian s, swerdlow hp, et al. Accurate whole

in reads. Referring to literature, there are three main approaches for

human genome sequencing using reversible terminator chemistry. Nature 2008;456:53-9.

errors correction [4]: K-spectrum based approach, Suffix tree and

[3] Alexander s. Mikheyev andmandy m. Y. Tin. A first look at the oxford

array based approach based approach and alignment. Also, there are

nanopore minion sequencer. Molecular ecology resources (2014).

several error correction algorithms for short reads and for long

[4] Yang x, chockalingam sp, alum s. A survey of error correction methods

reads. In our case we will look at the hybrid error correction

for next-generation sequencing. Brief bioinform 2013;14:56--66.

algorithms for long reads including PacbioToCA [5], LSC [6],

[5] Koren s, schatz me, walenz bp, et al. Hybrid error correction and de

Proovread[7], LoRDEC [8] that correct the PacBio reads and

novo

Nanocorr[9] the one that corrects the long reads generated by the

30:693-700. (2012).

assembly

of

single-molecule

sequencing

reads. Nat biotechnol;

[6] Au kf, underwood jg, lee I, et al. Improving pacbio long read accuracy

MinIon sequencer.

by short read alignment. Plos one;7: e46679.(2012).

Here, we propose a new hybrid algorithm for error correction of

[7] Hackl T, Hedrich R, Schultz J, et al. proovread: large-scale high

the MinIon long reads using pre-assembled short reads as a

accuracy

reference to correct the long reads that supports substitutions, insertions and deletions errors and finally we assemble the corrected long reads using an OLC approach for assembly. II.

METHODS

Considering the limitations of long reads tools and the high error

PacBio

correction

through

iterative

short

read

consensus.

Bioinformatics;30:3004-11. (2014) [8] Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics;30:3506-14. (2014) [9] Goodwin S, Gurtowski J, Ethe-Sayers S, et al. Oxford Nanopore sequencing and de novo assembly of a eukaryotic genome. bioRxiv ; 013490. (2015)

rates generated by the sequencing platforms. We present MiRCA

[10] Nicholas J Loman, Joshua Quick &

(MinIon Beads !;;.orrection Algorithm) takes as input a set of long

bacterial genome assembled de novo using only nanopore sequencing data.

reads and a set of paired-end short reads.

Nature Methods 12,733-735 (2015)

978-1-4673-6799-8/15/$31.00 ©2015 IEEE

1785

Jared T Simpson. A complete