Algorithm to search for genomic rearrangements Katarzyna Naª¦cz-Charkiewicz, Robert Nowak Faculty of Electronics and Information Technology Warsaw University of Technology Warsaw, Poland
ABSTRACT The aim of this article is to discuss the issue of comparing nucleotide sequences in order to detect chromosomal rearrangements (for example, in the study of genomes of two cucumber varieties, Polish and Chinese).
Two
basic algorithms for detecting rearrangements has been described: Smith-Waterman algorithm, as well as a new method of searching genetic markers in combination with Knuth-Morris-Pratt algorithm. The computer program
Escherichia Arabidopsis thaliana genomes, and are prepared to compare two cucumber varieties, Polish and Chinese.
in client-server architecture was developed. The algorithms properties were examined on genomes
coli
and
The results are promising and further works are planned. Keywords: genome rearrangements, next generation sequencing, computer program, web application, Smith-
Waterman, Knuth-Morris-Pratt, sequence alignment
1. INTRODUCTION Nothing in bioinformatics makes sense if considered in isolation of evolution
1 the initial chapter of their book.
∗
state Attwood and Higgs in
These words perfectly reect the importance of the subject of this paper
comparing the genomes of organisms, in particular in order to nd rearrangements which occurred during evolution. Because of the extremely rapid development of methods for the analysis of biological data in recent several years
2 (like the invention of so-called second generation sequencing methods in 2008, far more cheaper
and faster than the previously used Sanger method), especially in comparative genomics, the development of appropriate bioinformatics tools in order to enable eective and ecient analysis of the protein and DNA sequences seems to be of particular importance.
1.1 The importance of the study of genomes There are many reasons for comparing protein and nucleotide sequences. It is the rst stage of structural and functional research of newly found sequences. Because of the fact that about 70% of genes from newly discovered genomes are homologous
†
3 it is possible to transfer functional annotations of the
to other distant genomes,
better-known organisms by searching similar gene fragments. Another application is the study of the genomes of two organisms, detecting similarities, and, based on this, deciding if they have common origin. This in turn enables inter alia construction of the so-called phylogenetic trees. They provide a graphical representation of the evolutionary history of homologous organisms, taking into account the order of separation of their evolutionary lines. The comparison of entire genomes of dierent organisms, including the number and location of individual genes, allows also to determine the degree of conservation of genomes. This allows us to explore the mechanisms of evolution and gene transfer between them. Another application of genome comparison is to determine what is called a minimal genome, which is the smallest set of genes necessary for sustaining the life of the organism .
4
Finding it will help to identify genes which are essential for metabolic pathways necessary for cells survival. Further author information: (Send correspondence to Robert Nowak) Robert Nowak: E-mail:
[email protected] ∗ These words are paraphrasing the famous geneticist and evolutionary biologist, Theodosiusa Dobzhancky, who said: Nothing in biology makes sense if considered in isolation of evolution. † Two species are known as homologous if they have a common ancestor.
Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2013, edited by Ryszard S. Romaniuk, Proc. of SPIE Vol. 8903, 890319 © 2013 SPIE · CCC code: 0277-786X/13/$18 · doi: 10.1117/12.2032639 Proc. of SPIE Vol. 8903 890319-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
1.2 DNA sequencing DNA sequencing involves the reading of the order of nucleotide pairs
‡
in deoxyribonucleic acid molecule. The
origins of the sequencing date back to 1984, when scientists managed for the rst time to obtain the sequence of the viral genome (
Epstein-Bar ).
Complete genome sequences of cellular life forms were not available until several
years. The publication of the rst two bacteria genomes had a historical signicance.
Haemophilus inuenzae
and
Mycoplasma genitalium
The great success was also the publication (by the Human Genome Sequencing
Consortium) of the human genome on February 15, 2001 in the journal
Nature.
Starting from 2003, the process of reducing the cost of sequencing overtook projections made due to Moore's
5 Signicant acceleration of cost reductions
Law, assuming two-fold increase in computing power every two years.
was caused by the invention and implementation of second generation methods. It should be mentioned, that third generation sequencing techniques are developed.
6
1.3 Rearrangements Genomic rearrangements are changes that occur in the course of evolution in DNA sequences that determine the emergence of new varieties of a given species, as well as new species (usually over millions of years). Mutations are therefore a source of genetic variability of organisms. Basic types of rearrangements are follows:
A
B
C
D
A
C
B
D
Figure 1. Inversion inverting the part of chromosome of
A
B
C
D
A
C
B
D
180◦ .
Figure 2. Transposition displacement of the transposon to another position in the genome.
A
B
F
chromosome 1
chromosome 1 D
A
C
E
D
F
E
B
chromosome 2
chromosome 2
Figure 3. Translocation swap over parts of two chromosomes.
•
insertion insertion of one or more nucleotide pairs,
•
deletion deletion of one or more nucleotide pairs,
•
inversion inverting the part of chromosome of
•
transposition displacement of the transposon to another position in the genome,
‡
180◦ ,
In each pair, there are two out of the four nitrogenous bases: adenine, guanine, cytosine, and thymine.
Proc. of SPIE Vol. 8903 890319-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
C
•
translocation swap over parts of two chromosomes.
It should be noted that the sequence comparison algorithms described in the following subsections allow only to detect the occurrence of rearrangement in one of the analyzed genomes (for instance, we can say that in one of the sequences occurred insertion, or in the other one deletion). We cannot say in which of them there was a change from the original common ancestor. In general, however, the detection of this type is possible for example, thanks to this phylogenetic trees are constructed.
1.4 The genome of the cucumber The study of genomic rearrangements, to which this work is the introduction, will be carried out on real data from the nuclear genome of two varieties of cucumber. The genome of the Polish variety of the cucumber was sequenced in 2011 by the Polish Consortium of Cucumber Genome Sequencing, in particular by the scientists from the Department of Plant Genetics Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture on Warsaw University of Life Sciences (SGGW). Its size is approximately 367 million base pairs. a
b
c
10 Mbp
Figure 4. Schematic representation of chromosomal rearrangements between the two varieties of cucumber. (Polish) is on the top, line 9930 (Chinese) is at the bottom. Source:
7
Line B10
The second of the tested varieties, the Chinese cucumber genome (line 9930), was sequenced by scientists from China. Comparison carried out so far showed many dierences occurring in the composition of both plant genes, which are involved in key processes, such as photosynthesis, respiration or sugar metabolism. So far, research has been carried out within the rearrangement of corresponding chromosomes of both varieties.
7 are shown in 4. Most of the observed rearrangements were inversions and
Results of these studies, published in,
translocations. The issue still remaining unexplored is to carry out a similar comparison, but applied to dierent chromosomes (so comparing each chromosome of one of the varieties to all chromosomes of the second variety).
2. THE ALGORITHMS 2.1 Smith-Waterman algorithm The algorithm, which will now be described, not only determines if a numerical measure of the alignment, but also determines the best sequence alignment. By sequence alignment is meant the mapping of two or more sequences in a way that an element is aligned to an element or to the gap. Alignment to the gap occurs when there was an insertion in one sentence (or, equivalently, in the second there was a deletion). Alignment to of two gaps to each other is not allowed.
Proc. of SPIE Vol. 8903 890319-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
This raises the question what exactly we mean by the best alignment?
Figure 5 depicts several possible
alignments of two DNA sequences, each of which satises the above limitations. In order to choose among them the optimal one, it is necessary to determine the quality of the alignments. The presented algorithm performs this by summing the values assigned to each of the stages of creating the alignment path.
c g g g t a t c c a a c c c t a g g t c c c a c g g g t a - - t c c a a c c c - t a g g t c c c a c g g g t a - - t - c c a a c c c - t a g g t c c c a Figure 5. Dierent possible alignments of two DNA sequences.
In the described algorithm an important role plays the scoring system, which consists of sequence elements matching rate and gap penalty function. While evaluating the alignment of two DNA sequences, scoring rules are very simple: 1 point if the nucleotides
§
are identical, 0 otherwise . Gap penalty function length of the gap.
W (l)
is also an important part of the scoring system;
In the simplest case it is a linear function
W (l) = gl.
l
means
However, such approach does not
reect the the reality well, because the occurrence of a break of length 3 is not three times less likely than the occurrence of the gap of length 1. In other words, the probability that a break occurs is much smaller than the probability of its extension provided it already occurred. Taking this into account requires an ane function instead of the linear one. It has the form and
gext < gopen
W (l) = gopen + gext (l − 1),
where
gopen
is penalty for gap occurrence
is penalty for the extension of already existing gap.
4 The method uses a two-
Smith-Waterman algorithm (1981) nds an optimal local alignment of sequences.
dimensional matrix, initially lled with zeros. Starting from the top left corner, we ll its elements according to the function
H(i, j): H(i − 1, j − 1) − S(ai , bj ) H(i − 1, j) − W (l) H(i, j) = H(i, j − 1) − W (l) 0
where:
ai
element from the rst sequence,
matching rate,
W (l)
In the simplest case
bj element from the second sequence, S(ai , bj ) sequence elements l gap length. Initial conditions: H(i, 0) = H(0, j) = H(0, 0) = 0.
gap penalty function,
W (l) = gl
(linear function).
S(ai , bj ) = The nal alignment score is equal to §
score =
P
1 0
ai = bj otherwise
S(a, b) −
P
W (l).
This scoring system does not work well while comparing sequences of proteins, where the alphabet of possible char-
acters includes as many as 20 possible amino acids. The scoring system also has to take into account the fact that some amino acids are more alike than the other. This causes the necessity to use more sophisticated methods, such as scoring matrices PAM or BLOSUM.
Proc. of SPIE Vol. 8903 890319-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
2.2 Genetic markers and Knuth-Morris-Pratt algorithm The basic disadvantage of the above algorithm is its computational and memory complexity (O(mn), where and
n are lengths of compared sequences).
m
Therefore, it is impossible to apply it to the problem of the comparison
of whole genomes of more complex eukaryotic organisms, such as already mentioned cucumber genome.
(Th=0=== 8
Figure 6. The example of Knuth-Morris-Pratt algorithm results (source: ).
9
It seems that a solution which can be applied in this case is Knuth-Morris-Pratt (KMP) algorithm (1977),
which belongs to string matching methods. Its computational and memory complexity is linear. Its key part is the observation that in the case of non-compliance of the pattern with the searched sequence, it is possible to identify where the next test match should start without the need for re-examination of already aligned elements.
¶
For this purpose so-called prex-sux array is used . This array, denoted as KMPNext, includes maximum sizes of prex-suxes satisfying the following condition: is searched pattern,
k
pattern[k + 1] 6= pattern[i + 1] for i < n, where pattern[1...m] i index of current position in the array and n sequence
maximum prex-sux size,
size. In case there is no such
k,
we place
−1
in the array.
M-c - -
i
. Y --
/IP\
1111
,
i,ii
/
Figure 7. The use of genetic markers in searching for genome rearrangements.
After lling the KMPNext array we begin to compare the pattern with the sequence. In our case patterns are genetic markers, which are unique (or almost unique, ie. present a small number of times) subsequences of one of the analyzed genomes, spread evenly on it. Determining their positions on the examined genomes will allow us to detect rearrangements (insertions, deletions, reversals). Initial pattern position (indicated in the code as is set to
0.
pp)
is set to
−1,
whereas initial prex-sux length (b)
The for loop determines the elements from the sequence being compared with the pattern (marker)
elements. We look for the prex-sux (or break after nding the guard the prex was a guard, it is set to
0).
−1),
Then the position of the pattern in the sequence is determined and the pattern. If pattern is not found in the sequence,
−1
b
is reduced to the length of prex-sux of
is returned.
The proposed method, using KMP searching is described in Alg. 1. not without drawbacks.
then extend it by one character (if
If the prex does not cover the whole pattern, we continue with the loop.
Unfortunately, this approach is also
The reason is that it only allows us to search for the exact occurrences of pattern
in the DNA sequence. Compatibility of each pair of nucleotides is required. The possibility of the occurrence of point mutations is not taken into consideration. Such mutations, relatively common, are not interesting in terms of whole genome comparison. What is more, Knuth-Morris-Pratt algorithm does not take gaps in DNA sequences into account. ¶
Prex-sux is each proper prex of the word, which is equal to its sux.
10
Proc. of SPIE Vol. 8903 890319-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
Algorithm 1 Rearrangement searching algorithm using KMP and markers
KMP
(seq, pattern) patternP ositions ← [ ] pp ← −1, b ← 0 for i = 0 → (|seq| − 1) do while b > −1 & pattern[b] 6= seq[i] b ← KM P N ext[b]
function
do
end while
b←b+1 if b < |pattern| continue
then
end if
pp ← i − b + 1 patternP ositions.add(pp) b ← KM P N ext[b] end for if
pp = −1 then patternP ositions.add(−1)
end if return
patternP ositions
end function
markerP ositionsM ap ← { } for marker ∈ markers do markerP ositionsM ap[marker.ID] ← KM P (sequence, marker) end for
3. THE COMPUTER PROGRAM The picture 8 shows the planned architecture of the application used to searching for genomic rearrangements. It is based on the thin client architecture of client-server model. The advantages of this solution are security, easy scalability, portability, ease of maintenance and application development and convenience in the database management.
server server W W W
(Flask) client Web browser
http 1
GUI
rna rr server application
database
calculation module
PostgreSQ L
(C + +)
Figure 8. The application architecture.
On the server side Flask, a BSD-licensed microframework for Python based on Werkzeug, Jinja 2 and good
11 will be used. The core server module, responsible for running all genomic algorithms, was written 12 in C++ language, chosen mostly because of its high performance, a large number of libraries (like STL, boost ) intentions ,
and the ease of their use. All genomic data are stored in the PostrgeSQL database. The user interface, available
Proc. of SPIE Vol. 8903 890319-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
via website, will be implemented with the use of HTML5 and JavaScript. The language binding all layers and modules will be Python, the C++ to Python conversions uses Boost.Python library.
k
As for now, the application in its present form provides a text interface .
The input data are read from
a text le (in FASTA format), the name of which is the rst command-line argument. The second argument is a prex added at the beginning of all output les' names. As a result of the program the following les are created:
prex _clear.txt a le containing only sequence elements (without newline characters and comments); prex _rearranged.txt a le containing articially rearranged sequence from the prex _clear.txt; prex _2compare_genomes.json a le containing both sequences, written in JSON format; prex _patterns.json a le containing a collection of genetic markers, written in JSON format; prex _result.json a le containing results of the algorithm, written in JSON format;
• • • • •
The resulting JSON le contains two elements, each corresponding to one of the compared genomes.
Each
element is described by a unique identier and collection of markers (their identiers and positions).
4. RESULTS The software was tested on two real DNA sequences with articially generated rearrangements. The rst one was
Escherichia coli (E. coli )
genome (4,093,725 nucleobases), and the second one
Arabidopsis thaliana
(117,765,951 nucleobases). The testing algorithm was as follows. Firstly, a sequence in FASTA format was read from the database. Afterwards the sequence was divided into 10 pieces of equal size (except for the last one) and randomly shued, in order to create articial genomic rearrangements. Then, in the rst (not rearranged) sequence 10 genetic markers of length 100 have been found, evenly distributed over the whole sequence. Finally, Knuth-Morris-Pratt algorithm was used to look for these markers in the second sequence.
initial genome
marker ID marker
6
position
409372
0
818744
1228116
1637488
2046860
2456232
3
4
9
1637488
2046860
2456232
9
2865604
3274976
3684348
rearranged geno
marker ID marker
6
position
409372
0
818744
1228116
0
2865609
3274981
3684353
Figure 9. Results of genetic markers and Knuth-Morris-Pratt algorithm for E. coli genome.
initial genome
marker ID marker
0
position
0
3
1
4
6
9
11776595 23553190 35329785 47106380 58882975 70659570 82436165 94212760 105989355
rearranged genome
marker ID marker
5
1
position
0
2
6
4
0
9
3
11776595 23553190 35329785 47106380 58882975 70659570 82436165 94212760 105989355
Figure 10. Results of genetic markers and Knuth-Morris-Pratt algorithm for Arabidopsis thaliana genome. k
A
preliminary
genomecomp
version
of
the
program
can
be
downloaded
from
https://
[email protected]/knalecz/
Proc. of SPIE Vol. 8903 890319-7 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms
The results achieved in both cases are depicted on gures 9 and 10. All genetic markers on shued pieces of the sequence have been properly found. What is more, each marker has been found exactly once, which means that the length of 100 seems to be enough for genomes of size close to the two examined. However, one should be aware of the fact that the text described above was an idealized situation. The number of genetic markers was equal to the number of pieces. Moreover, such articial rearrangements are far easier to detect than the real ones, when their size and number is completely unknown and unpredictable. Furthermore, in both presented cases there were no so-called point mutations in the second sequence (no insertions, deletions, substitutions of single nucleotides). All these arguments demonstrate the need for more advanced research on real data.
5. POSSIBLE IMPROVEMENTS Methods of searching for genomic rearrangements briey presented here are only an outline of solutions that have practical signicance. The size of input data makes classical methods of searching for sequence alignment, such as Smith-Waterman algorithm, not very useful. As far as Knuth-Morris-Pratt algorithm is concerned, despite its obvious advantages, namely linear computational and memory complexity, it is of limited use due to the fact of being exact string matching algorithm. Possible solution may be methods based on approximate string matching algorithms. Another approach, which seems to be potentially very useful, is the usage of image correlation method. This
13 incorporates digital signal processing techniques to the problem of DNA sequence
algorithm, described in, alignment.
Its main advantage is very high sensitivity and specicity, especially when applied to noisy data
(i.e. with the large number of point mutations). This issue, however, requires further examination, in particular testing its performance on real biological data (with real, not articial genomic rearrangements).
REFERENCES 1. A. T. K. Higgs Paul G.,
Bioinformatics and Molecular Evolution,
Wydawnictwo naukowe PWN, Warsaw,
2008.
et al., The ics 24(3), p. 133, 2008.
2. E. R. Mardis
impact of next-generation sequencing technology on genetics,
Trends in genet-
3. G. M. Y. W. D. R. Koonin E. V., Mushegian A. R., Comparison of archaeal and bacterial genomes: computer
Mol Microbiol. , 1997. X. Jin, Essential bioinformatics, Wydawnictwo Uniwersytetu Warszawskiego, Warsaw, 2006. J. M. Rothberg and J. H. Leamon, The development and impact of 454 sequencing, Nature biotechnology 26(10), pp. 11171124, 2008.
analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea., 4. 5.
6. D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang,
et al., The potential and challenges of nanopore sequencing, Nature biotechnology 26(10),
pp. 11461153, 2008. 7. R. Wóycicki, J. Witkowicz, P. Gawro«ski, J. D¡browska, A. Lomsadze, M. Paweªkowicz, E. Siedlecka, K. Yagi, W. Pl¡der, A. Seroczy«ska,
et al., The genome sequence of the north-european cucumber (cucumis PLoS One 6(7), p. e22728, 2011.
sativus l.) unravels evolutionary adaptation mechanisms in plants, 8. W. J., Algorithms. Data structures. [online].
php.
http://edu.i-lo.tarnow.pl/inf/alg/001_search/0049.
9. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to algorithms, MIT press, 2001.
http: //wazniak.mimuw.edu.pl/index.php?title=Algorytmy_i_struktury_danych/Algorytmy_tekstowe_I. Flask. [online]. http://flask.pocoo.org/.
10. R. W. Diks K., Malinowski A., Text algorithms I. In: Algorithms and data structures. [online], 2008. 11.
C++ programming, boost libraries and design patterns, in polish: J¦zyk C++: mechanizmy, wzorce, biblioteki, BTC, Legionowo, 2010. ISBN 978-83-60233-66-5.
12. R. Nowak and A. Paj¡k,
13. M. C. Saldías, F. V. Sassarini, C. M. Poblete, A. V. Vásquez, and I. M. Butler, Image correlation method for dna sequence alignment,
PloS one 7(6), p. e39221, 2012.
Proc. of SPIE Vol. 8903 890319-8 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms