REAL: An efficient REad ALigner for next generation sequencing reads. Kimon Frousios
Costas S. Iliopoulos
∗
King’s College London Centre for Bioinformatics Strand, London WC2R 2LS, England, United Kingdom
King’s College London Dept. of Computer Science Strand, London WC2R 2LS, England, United Kingdom
[email protected]
[email protected]
Solon P. Pissis
Laurent Mouchard
†
Université de Rouen Dept. of Computer Science LITIS EA 4108 76821 Mont Saint Aignan, France
[email protected]
German Tischler
‡
King’s College London Dept. of Computer Science Strand, London WC2R 2LS, England, United Kingdom
King’s College London Dept. of Computer Science Strand, London WC2R 2LS, England, United Kingdom
[email protected]
[email protected]
ABSTRACT
Keywords
Motivation: The constant advances in sequencing technology are turning whole-genome sequencing into a routine procedure, resulting in massive amounts of data that need to be processed. Tens of gigabytes of data in the form of short reads need to be mapped back to reference sequences, a few gigabases long. A first generation of short read alignment software successfully employed hash tables, and the current second generation uses Burrows-Wheeler Transform, further improving mapping speed. However, there is still demand for faster and more accurate mapping.
next generation sequencing, reads, mapping, string algorithms, pattern matching
Results: In this paper, we present REad ALigner, an efficient, accurate and consistent tool for aligning short reads obtained from next generation sequencing. It is based on a new, simple, yet efficient mapping algorithm that can match and outperform current BWT-based software. ∗Prof. Iliopoulos is also affiliated with Curtin University, Digital Ecosystems & Business Intelligence Institute, Centre for Stringology & Applications, GPO Box U1987 Perth WA 6845, Australia. †Dr. Mouchard is also affiliated with King’s College London, Dept. of Computer Science, Strand, London WC2R 2LS, England, United Kingdom, and Curtin University, Digital Ecosystems & Business Intelligence Institute, Centre for Stringology & Applications, GPO Box U1987 Perth WA 6845, Australia. ‡Newton Fellow
1.
INTRODUCTION
The traditional Sanger capillary sequencing methods [25, 26], developed in the mid 70’s, have been the workhorse technology for DNA sequencing, for almost 30 years, and is still the go-to technique for high quality sequencing. But sequencing technology has come a long way since the time when traditional sequencing techniques required many labs around the world to cooperate for over a decade, in order to sequence the human genome for the first time. Nowadays, high-throughput Sequencing By Synthesis technologies have reduced the task of sequencing a whole genome to a matter of days or even hours, and the cost has decreased by orders of magnitude, making it an accessible experimental procedure to many labs [28]. This opened the door for re-sequencing to start becoming a more routine procedure, as it finds many applications in the detection of genetic variability among individuals. Thus, it can help us understand the extent of that variability, and also identify specific variants, alternative splicing sites and patterns, epigenetic effects, and relate them to gene regulation and expression, as well as to diseases ([1], [29], [30], [22]). Thus, DNA sequencing is quickly becoming a powerful tool in diagnostic medicine, and eventually personalized treatment [28]. The data resulting from a single sequencing experiment can be quite large, and it is not uncommon to have data from multiple experiments. This trend of increasing availability of sequencing data will continue as projects even more ambitious than the 1000 Genomes Project [1] start to materialize. According to their respective websites, typical output sizes for the three main next generation sequencing platforms are: over a million 400bp-long reads per 10-hour run
for the 454/Roche platform [3], up to 300GB per run for the ABI SOLiD platform [2], and up to 500 millions paired-end reads 100bp-long for the Illumina GA [4]. In most cases these reads are too short to be directly assembled, especially in the presence of repetitive regions [19], therefore a reference sequence is usually required. In the case of human genome re-sequencing, the reference genome is approximately 3Gbp-long. However, attempts to directly assemble short reads from simpler genomes have begun [27], and a first attempt for human data has also been recently reported [18]. Mapping so many short reads onto such a long reference sequence is a very challenging task that cannot be adequately carried out by traditional search and alignment algorithms [12] like BLAST [5] and FASTA [23], so a broad array of programs have been published to address this task, placing emphasis on different aspects of the challenge. The different algorithms implement various combinations of innovations and trade-offs, to address computing speed, system resources requirements, and biological relevance and accuracy of the computed results. The need for more efficient ways to map large numbers of short sequences was first acknowledged in 2002 and involved modifying the BLAST [5] algorithm so as to index the reference instead of the queries [12]. But really fast and efficient mapping software started with ELAND [9], which is the software bundled in the Illumina GA pipeline and with constant development to match the advances of the Illumina platform it is still one of the fastest algorithms. MAQ [15] was released as an independent alternative to ELAND. It makes different use of base-calling qualities and introduced mapping qualities, but cannot do gapped alignment, and has an upper limit to the length of reads it can map. Indexing the reads also potentially imposes a high demand on system resources, limiting the scalability of the method. SOAP [16] indexes the reference for more efficient memory usage and offers some form of gapped alignment, while SeqMap [11] allows more flexibility for gaps and substitutions. Bowtie [13], SOAP2 [17] and BWA [14] (the successor of MAQ) make use of the Burrows-Wheeler Transform [8] to index the reference, and are able to achieve very good speed and relatively low memory usage. A number of other tools exist as well ([24], [29]), each combining solutions differently and to different extents. A comprehensive review of read mapping software can be found in the review of Dalca and Brudno [10]. In this paper, we present REAL, an efficient read aligner for next generation sequencing reads. Our approach resembles the strategies presented in [6] and [7], for exact and approximate matching, respectively. It preprocesses the genomic sequence first, based on the short reads length, by using word-level parallelism and radix-sort. Then, we do not hash the short reads, but instead we convert each read to a unique arithmetic value, using 2-bits-per-base encoding of the DNA alphabet, and use the pigeonhole principle, binary search, and simple word-level operations for mapping the reads to the reference. The rest of the paper is structured as follows. In Section 2, the basic definitions that are used throughout the paper are presented, and in Section 3, we formally define the problem solved. Section 4 presents the proposed method for ef-
ficiently and accurately mapping the reads to a reference sequence. Section 5 provides some experimental results and demonstrates REAL’s performance on various real and simulated datasets in comparison to SOAP2. Finally, we briefly conclude with some future proposals in Section 6.
2.
PRELIMINARIES
A string is a sequence of zero or more symbols from an alphabet Σ. In this work, we are considering the finite alphabet Σ for DN A sequences, where Σ = {A, C, G, T }. The length of a string x is denoted by |x|. The i-th symbol of a string x is denoted by x[i]. A string w is a factor of x if x = uwv, where u, v ∈ Σ∗ . We denote by x[i . . . j] the factor of x that starts at position i and ends at position j. For two strings x and y, such that |x| = |y|, the Hamming distance δH (x, y) is the number of places in which the two strings differ, i.e. have different characters. Formally, ∑ δH (x, y) = |x| i=1 1x[i]6=y[i] , where 1x[i]6=y[i] = 1, if x[i] 6= y[i], or 0, otherwise.
3.
PROBLEM DEFINITION
We denote the generated short reads as the set p1 , p2 , ..., pr , where r is a natural number (r > 107 in practice), and we call them patterns. The length of each pattern is currently, typically between 25 and 75 bp long. Without loss of generality, we denote that length as `. We are given a genomic sequence t = t[1..n], where n > 108 , and a positive threshold k, 0 ≤ k ≤ `, which denotes the number of allowed mismatches. The case that k > 0 corresponds to the possibility that the pattern either contains a sequencing error, or a small difference between a mutant and the reference genome. We formally define the problem of mapping tens of millions of short sequences to a reference genome, as follows.
Problem 1. Find whether the pattern pi = pi [1..`], for all 1 ≤ i ≤ r, with pi ∈ Σ∗ , Σ = {A, C, G, T }, occurs with at most kmismatches in t = t[1..n], with t ∈ Σ∗ . In particular, we are interested in reporting a pattern pi , for all 1 ≤ i ≤ r, in a case that pi occurs with the least possible number of allowed mismatches, exactly once, in t.
4.
METHODS
REAL is a new read aligner, which addresses the problem of efficiently mapping p1 , p2 , ..., pr to t with at most k-mismatches. In order for the procedure to be efficient, we make use of word-level parallelism by transforming each factor of length ` of t into a signature. We get the signature σ(x) of a string x, by transforming it to its binary equivalent using 2-bits-per-base encoding of the DNA alphabet, and storing its decimal value into a computer word (see Table 1 and Table 2 in this regard). In addition, the idea of using the pigeonhole principle to split each read into ν fragments is adopted. The general idea for the k-mismatches problem is that inside any match of the pattern of length m, with at most k errors, there must
A C G T
00 01 10 11
• f (j): a function that given j, it returns q, such that if cj (σ(x)) = {σ(xa1 ), σ(xa2 ), ..., σ(xaν−k )} and dq (σ(x)) = {σ(xb1 ), σ(xb2 ), ..., σ(xbk )}, then
Table 1: Binary Encoding of DNA alphabet String x Binary form Signature σ(x)
A 00 147
G 10
C 01
A 00
T 11
be at least m − k letters belonging to the pattern [21]. In our case, by requiring ν − k of the fragments (instead of all of them) to be perfectly matched on t, the non-candidates can be filtered out very quickly. For example, to admit two mismatches, a read can be split into four fragments. The two mismatches can exist in at most two of the fragments (at the same time). Then, if we try all six combinations of the two fragments as the seed, we can catch all hits with two mismatches. Lemma 1. Given the number of fragments ν of a string x = {x1 , x2 , ..., xν }, and the number of allowed mismatches k, k < ν, any of the k mismatches cannot exist, at the same time, in at least ν − k fragments of x. Proof. Immediate from the pigeonhole principle. Without loss of generality, we choose ν, such that ν − k = a1 a2 aν−k k. We denote as cj (σ(x)) = {σ(x )}, ( ν ) ), σ(x ), ..., σ(x combinations of σ(x) = with a1 < a2 < ... < aν−k , the ν−k {σ(x1 ), σ(x2 ), ..., σ(xν )}, such that if cj+1 = {σ(xb1 ), σ(xb2 ), ..., σ(xbν−k )}, then
i=1
ai