Algorithm to search for genomic rearrangements

5 downloads 201 Views 518KB Size Report
On the server side Flask, a BSD-licensed microframework for Python based on ... in C++ language, chosen mostly because of its high performance, a large ...
Algorithm to search for genomic rearrangements Katarzyna Naª¦cz-Charkiewicz, Robert Nowak Faculty of Electronics and Information Technology Warsaw University of Technology Warsaw, Poland

ABSTRACT The aim of this article is to discuss the issue of comparing nucleotide sequences in order to detect chromosomal rearrangements (for example, in the study of genomes of two cucumber varieties, Polish and Chinese).

Two

basic algorithms for detecting rearrangements has been described: Smith-Waterman algorithm, as well as a new method of searching genetic markers in combination with Knuth-Morris-Pratt algorithm. The computer program

Escherichia Arabidopsis thaliana genomes, and are prepared to compare two cucumber varieties, Polish and Chinese.

in client-server architecture was developed. The algorithms properties were examined on genomes

coli

and

The results are promising and further works are planned. Keywords: genome rearrangements, next generation sequencing, computer program, web application, Smith-

Waterman, Knuth-Morris-Pratt, sequence alignment

1. INTRODUCTION Nothing in bioinformatics makes sense if considered in isolation of evolution

1 the initial chapter of their book.



 state Attwood and Higgs in

These words perfectly reect the importance of the subject of this paper

 comparing the genomes of organisms, in particular in order to nd rearrangements which occurred during evolution. Because of the extremely rapid development of methods for the analysis of biological data in recent several years

2 (like the invention of so-called second generation sequencing methods in 2008, far more cheaper

and faster than the previously used Sanger method), especially in comparative genomics, the development of appropriate bioinformatics tools in order to enable eective and ecient analysis of the protein and DNA sequences seems to be of particular importance.

1.1 The importance of the study of genomes There are many reasons for comparing protein and nucleotide sequences. It is the rst stage of structural and functional research of newly found sequences. Because of the fact that about 70% of genes from newly discovered genomes are homologous



3 it is possible to transfer functional annotations of the

to other distant genomes,

better-known organisms by searching similar gene fragments. Another application is the study of the genomes of two organisms, detecting similarities, and, based on this, deciding if they have common origin. This in turn enables inter alia construction of the so-called phylogenetic trees. They provide a graphical representation of the evolutionary history of homologous organisms, taking into account the order of separation of their evolutionary lines. The comparison of entire genomes of dierent organisms, including the number and location of individual genes, allows also to determine the degree of conservation of genomes. This allows us to explore the mechanisms of evolution and gene transfer between them. Another application of genome comparison is to determine what is called a minimal genome, which is the smallest set of genes necessary for sustaining the life of the organism .

4

Finding it will help to identify genes which are essential for metabolic pathways necessary for cells survival. Further author information: (Send correspondence to Robert Nowak) Robert Nowak: E-mail: [email protected] ∗ These words are paraphrasing the famous geneticist and evolutionary biologist, Theodosiusa Dobzhancky, who said: Nothing in biology makes sense if considered in isolation of evolution. † Two species are known as homologous if they have a common ancestor.

Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2013, edited by Ryszard S. Romaniuk, Proc. of SPIE Vol. 8903, 890319 © 2013 SPIE · CCC code: 0277-786X/13/$18 · doi: 10.1117/12.2032639 Proc. of SPIE Vol. 8903 890319-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

1.2 DNA sequencing DNA sequencing involves the reading of the order of nucleotide pairs



in deoxyribonucleic acid molecule. The

origins of the sequencing date back to 1984, when scientists managed for the rst time to obtain the sequence of the viral genome (

Epstein-Bar ).

Complete genome sequences of cellular life forms were not available until several

years. The publication of the rst two bacteria genomes had a historical signicance.

Haemophilus inuenzae

and

Mycoplasma genitalium

The great success was also the publication (by the Human Genome Sequencing

Consortium) of the human genome on February 15, 2001 in the journal

Nature.

Starting from 2003, the process of reducing the cost of sequencing overtook projections made due to Moore's

5 Signicant acceleration of cost reductions

Law, assuming two-fold increase in computing power every two years.

was caused by the invention and implementation of second generation methods. It should be mentioned, that third generation sequencing techniques are developed.

6

1.3 Rearrangements Genomic rearrangements are changes that occur in the course of evolution in DNA sequences that determine the emergence of new varieties of a given species, as well as new species (usually over millions of years). Mutations are therefore a source of genetic variability of organisms. Basic types of rearrangements are follows:

A

B

C

D

A

C

B

D

Figure 1. Inversion  inverting the part of chromosome of

A

B

C

D

A

C

B

D

180◦ .

Figure 2. Transposition  displacement of the transposon to another position in the genome.

A

B

F

chromosome 1

chromosome 1 D

A

C

E

D

F

E

B

chromosome 2

chromosome 2

Figure 3. Translocation  swap over parts of two chromosomes.



insertion  insertion of one or more nucleotide pairs,



deletion  deletion of one or more nucleotide pairs,



inversion  inverting the part of chromosome of



transposition  displacement of the transposon to another position in the genome,



180◦ ,

In each pair, there are two out of the four nitrogenous bases: adenine, guanine, cytosine, and thymine.

Proc. of SPIE Vol. 8903 890319-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

C



translocation  swap over parts of two chromosomes.

It should be noted that the sequence comparison algorithms described in the following subsections allow only to detect the occurrence of rearrangement in one of the analyzed genomes (for instance, we can say that in one of the sequences occurred insertion, or in the other one deletion). We cannot say in which of them there was a change from the original common ancestor. In general, however, the detection of this type is possible  for example, thanks to this phylogenetic trees are constructed.

1.4 The genome of the cucumber The study of genomic rearrangements, to which this work is the introduction, will be carried out on real data from the nuclear genome of two varieties of cucumber. The genome of the Polish variety of the cucumber was sequenced in 2011 by the Polish Consortium of Cucumber Genome Sequencing, in particular by the scientists from the Department of Plant Genetics Breeding and Biotechnology, Faculty of Horticulture and Landscape Architecture on Warsaw University of Life Sciences (SGGW). Its size is approximately 367 million base pairs. a

b

c

10 Mbp

Figure 4. Schematic representation of chromosomal rearrangements between the two varieties of cucumber. (Polish) is on the top, line 9930 (Chinese) is at the bottom. Source:

7

Line B10

The second of the tested varieties, the Chinese cucumber genome (line 9930), was sequenced by scientists from China. Comparison carried out so far showed many dierences occurring in the composition of both plant genes, which are involved in key processes, such as photosynthesis, respiration or sugar metabolism. So far, research has been carried out within the rearrangement of corresponding chromosomes of both varieties.

7 are shown in 4. Most of the observed rearrangements were inversions and

Results of these studies, published in,

translocations. The issue still remaining unexplored is to carry out a similar comparison, but applied to dierent chromosomes (so comparing each chromosome of one of the varieties to all chromosomes of the second variety).

2. THE ALGORITHMS 2.1 Smith-Waterman algorithm The algorithm, which will now be described, not only determines if a numerical measure of the alignment, but also determines the best sequence alignment. By sequence alignment is meant the mapping of two or more sequences in a way that an element is aligned to an element or to the gap. Alignment to the gap occurs when there was an insertion in one sentence (or, equivalently, in the second there was a deletion). Alignment to of two gaps to each other is not allowed.

Proc. of SPIE Vol. 8903 890319-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

This raises the question what exactly we mean by the best alignment?

Figure 5 depicts several possible

alignments of two DNA sequences, each of which satises the above limitations. In order to choose among them the optimal one, it is necessary to determine the quality of the alignments. The presented algorithm performs this by summing the values assigned to each of the stages of creating the alignment path.

c g g g t a t c c a a c c c t a g g t c c c a c g g g t a - - t c c a a c c c - t a g g t c c c a c g g g t a - - t - c c a a c c c - t a g g t c c c a Figure 5. Dierent possible alignments of two DNA sequences.

In the described algorithm an important role plays the scoring system, which consists of sequence elements matching rate and gap penalty function. While evaluating the alignment of two DNA sequences, scoring rules are very simple: 1 point if the nucleotides

§

are identical, 0 otherwise . Gap penalty function length of the gap.

W (l)

is also an important part of the scoring system;

In the simplest case it is a linear function

W (l) = gl.

l

means

However, such approach does not

reect the the reality well, because the occurrence of a break of length 3 is not three times less likely than the occurrence of the gap of length 1. In other words, the probability that a break occurs is much smaller than the probability of its extension provided it already occurred. Taking this into account requires an ane function instead of the linear one. It has the form and

gext < gopen

W (l) = gopen + gext (l − 1),

where

gopen

is penalty for gap occurrence

is penalty for the extension of already existing gap.

4 The method uses a two-

Smith-Waterman algorithm (1981) nds an optimal local alignment of sequences.

dimensional matrix, initially lled with zeros. Starting from the top left corner, we ll its elements according to the function

H(i, j):  H(i − 1, j − 1) − S(ai , bj )    H(i − 1, j) − W (l) H(i, j) = H(i, j − 1) − W (l)    0

where:

ai

 element from the rst sequence,

matching rate,

W (l)

In the simplest case

bj  element from the second sequence, S(ai , bj )  sequence elements l  gap length. Initial conditions: H(i, 0) = H(0, j) = H(0, 0) = 0.

 gap penalty function,

W (l) = gl

(linear function).

 S(ai , bj ) = The nal alignment score is equal to §

score =

P

1 0

ai = bj otherwise

S(a, b) −

P

W (l).

This scoring system does not work well while comparing sequences of proteins, where the alphabet of possible char-

acters includes as many as 20 possible amino acids. The scoring system also has to take into account the fact that some amino acids are more alike than the other. This causes the necessity to use more sophisticated methods, such as scoring matrices PAM or BLOSUM.

Proc. of SPIE Vol. 8903 890319-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

2.2 Genetic markers and Knuth-Morris-Pratt algorithm The basic disadvantage of the above algorithm is its computational and memory complexity (O(mn), where and

n are lengths of compared sequences).

m

Therefore, it is impossible to apply it to the problem of the comparison

of whole genomes of more complex eukaryotic organisms, such as already mentioned cucumber genome.

(Th=0=== 8

Figure 6. The example of Knuth-Morris-Pratt algorithm results (source: ).

9

It seems that a solution which can be applied in this case is Knuth-Morris-Pratt (KMP) algorithm (1977),

which belongs to string matching methods. Its computational and memory complexity is linear. Its key part is the observation that in the case of non-compliance of the pattern with the searched sequence, it is possible to identify where the next test match should start without the need for re-examination of already aligned elements.



For this purpose so-called prex-sux array is used . This array, denoted as KMPNext, includes maximum sizes of prex-suxes satisfying the following condition: is searched pattern,

k

pattern[k + 1] 6= pattern[i + 1] for i < n, where pattern[1...m] i  index of current position in the array and n  sequence

 maximum prex-sux size,

size. In case there is no such

k,

we place

−1

in the array.

M-c - -

i

. Y --

/IP\

1111

,

i,ii

/

Figure 7. The use of genetic markers in searching for genome rearrangements.

After lling the KMPNext array we begin to compare the pattern with the sequence. In our case patterns are genetic markers, which are unique (or almost unique, ie. present a small number of times) subsequences of one of the analyzed genomes, spread evenly on it. Determining their positions on the examined genomes will allow us to detect rearrangements (insertions, deletions, reversals). Initial pattern position (indicated in the code as is set to

0.

pp)

is set to

−1,

whereas initial prex-sux length (b)

The for loop determines the elements from the sequence being compared with the pattern (marker)

elements. We look for the prex-sux (or break after nding the guard the prex was a guard, it is set to

0).

−1),

Then the position of the pattern in the sequence is determined and the pattern. If pattern is not found in the sequence,

−1

b

is reduced to the length of prex-sux of

is returned.

The proposed method, using KMP searching is described in Alg. 1. not without drawbacks.

then extend it by one character (if

If the prex does not cover the whole pattern, we continue with the loop.

Unfortunately, this approach is also

The reason is that it only allows us to search for the exact occurrences of pattern

in the DNA sequence. Compatibility of each pair of nucleotides is required. The possibility of the occurrence of point mutations is not taken into consideration. Such mutations, relatively common, are not interesting in terms of whole genome comparison. What is more, Knuth-Morris-Pratt algorithm does not take gaps in DNA sequences into account. ¶

Prex-sux is each proper prex of the word, which is equal to its sux.

10

Proc. of SPIE Vol. 8903 890319-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

Algorithm 1 Rearrangement searching algorithm using KMP and markers

KMP

(seq, pattern) patternP ositions ← [ ] pp ← −1, b ← 0 for i = 0 → (|seq| − 1) do while b > −1 & pattern[b] 6= seq[i] b ← KM P N ext[b]

function

do

end while

b←b+1 if b < |pattern| continue

then

end if

pp ← i − b + 1 patternP ositions.add(pp) b ← KM P N ext[b] end for if

pp = −1 then patternP ositions.add(−1)

end if return

patternP ositions

end function

markerP ositionsM ap ← { } for marker ∈ markers do markerP ositionsM ap[marker.ID] ← KM P (sequence, marker) end for

3. THE COMPUTER PROGRAM The picture 8 shows the planned architecture of the application used to searching for genomic rearrangements. It is based on the thin client architecture of client-server model. The advantages of this solution are security, easy scalability, portability, ease of maintenance and application development and convenience in the database management.

server server W W W

(Flask) client Web browser

http 1

GUI

rna rr server application

database

calculation module

PostgreSQ L

(C + +)

Figure 8. The application architecture.

On the server side Flask, a BSD-licensed microframework for Python based on Werkzeug, Jinja 2 and good

11 will be used. The core server module, responsible for running all genomic algorithms, was written 12 in C++ language, chosen mostly because of its high performance, a large number of libraries (like STL, boost ) intentions ,

and the ease of their use. All genomic data are stored in the PostrgeSQL database. The user interface, available

Proc. of SPIE Vol. 8903 890319-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

via website, will be implemented with the use of HTML5 and JavaScript. The language binding all layers and modules will be Python, the C++ to Python conversions uses Boost.Python library.

k

As for now, the application in its present form provides a text interface .

The input data are read from

a text le (in FASTA format), the name of which is the rst command-line argument. The second argument is a prex added at the beginning of all output les' names. As a result of the program the following les are created:

prex _clear.txt  a le containing only sequence elements (without newline characters and comments); prex _rearranged.txt  a le containing articially rearranged sequence from the prex _clear.txt; prex _2compare_genomes.json  a le containing both sequences, written in JSON format; prex _patterns.json  a le containing a collection of genetic markers, written in JSON format; prex _result.json  a le containing results of the algorithm, written in JSON format;

• • • • •

The resulting JSON le contains two elements, each corresponding to one of the compared genomes.

Each

element is described by a unique identier and collection of markers (their identiers and positions).

4. RESULTS The software was tested on two real DNA sequences with articially generated rearrangements. The rst one was

Escherichia coli (E. coli )

genome (4,093,725 nucleobases), and the second one 

Arabidopsis thaliana

(117,765,951 nucleobases). The testing algorithm was as follows. Firstly, a sequence in FASTA format was read from the database. Afterwards the sequence was divided into 10 pieces of equal size (except for the last one) and randomly shued, in order to create articial genomic rearrangements. Then, in the rst (not rearranged) sequence 10 genetic markers of length 100 have been found, evenly distributed over the whole sequence. Finally, Knuth-Morris-Pratt algorithm was used to look for these markers in the second sequence.

initial genome

marker ID marker

6

position

409372

0

818744

1228116

1637488

2046860

2456232

3

4

9

1637488

2046860

2456232

9

2865604

3274976

3684348

rearranged geno

marker ID marker

6

position

409372

0

818744

1228116

0

2865609

3274981

3684353

Figure 9. Results of genetic markers and Knuth-Morris-Pratt algorithm for E. coli genome.

initial genome

marker ID marker

0

position

0

3

1

4

6

9

11776595 23553190 35329785 47106380 58882975 70659570 82436165 94212760 105989355

rearranged genome

marker ID marker

5

1

position

0

2

6

4

0

9

3

11776595 23553190 35329785 47106380 58882975 70659570 82436165 94212760 105989355

Figure 10. Results of genetic markers and Knuth-Morris-Pratt algorithm for Arabidopsis thaliana genome. k

A

preliminary

genomecomp

version

of

the

program

can

be

downloaded

from

https://[email protected]/knalecz/

Proc. of SPIE Vol. 8903 890319-7 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

The results achieved in both cases are depicted on gures 9 and 10. All genetic markers on shued pieces of the sequence have been properly found. What is more, each marker has been found exactly once, which means that the length of 100 seems to be enough for genomes of size close to the two examined. However, one should be aware of the fact that the text described above was an idealized situation. The number of genetic markers was equal to the number of pieces. Moreover, such articial rearrangements are far easier to detect than the real ones, when their size and number is completely unknown and unpredictable. Furthermore, in both presented cases there were no so-called point mutations in the second sequence (no insertions, deletions, substitutions of single nucleotides). All these arguments demonstrate the need for more advanced research on real data.

5. POSSIBLE IMPROVEMENTS Methods of searching for genomic rearrangements briey presented here are only an outline of solutions that have practical signicance. The size of input data makes classical methods of searching for sequence alignment, such as Smith-Waterman algorithm, not very useful. As far as Knuth-Morris-Pratt algorithm is concerned, despite its obvious advantages, namely linear computational and memory complexity, it is of limited use due to the fact of being exact string matching algorithm. Possible solution may be methods based on approximate string matching algorithms. Another approach, which seems to be potentially very useful, is the usage of image correlation method. This

13 incorporates digital signal processing techniques to the problem of DNA sequence

algorithm, described in, alignment.

Its main advantage is very high sensitivity and specicity, especially when applied to noisy data

(i.e. with the large number of point mutations). This issue, however, requires further examination, in particular testing its performance on real biological data (with real, not articial genomic rearrangements).

REFERENCES 1. A. T. K. Higgs Paul G.,

Bioinformatics and Molecular Evolution,

Wydawnictwo naukowe PWN, Warsaw,

2008.

et al., The ics 24(3), p. 133, 2008.

2. E. R. Mardis

impact of next-generation sequencing technology on genetics,

Trends in genet-

3. G. M. Y. W. D. R. Koonin E. V., Mushegian A. R., Comparison of archaeal and bacterial genomes: computer

Mol Microbiol. , 1997. X. Jin, Essential bioinformatics, Wydawnictwo Uniwersytetu Warszawskiego, Warsaw, 2006. J. M. Rothberg and J. H. Leamon, The development and impact of 454 sequencing, Nature biotechnology 26(10), pp. 11171124, 2008.

analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea., 4. 5.

6. D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang,

et al., The potential and challenges of nanopore sequencing, Nature biotechnology 26(10),

pp. 11461153, 2008. 7. R. Wóycicki, J. Witkowicz, P. Gawro«ski, J. D¡browska, A. Lomsadze, M. Paweªkowicz, E. Siedlecka, K. Yagi, W. Pl¡der, A. Seroczy«ska,

et al., The genome sequence of the north-european cucumber (cucumis PLoS One 6(7), p. e22728, 2011.

sativus l.) unravels evolutionary adaptation mechanisms in plants, 8. W. J.,  Algorithms. Data structures. [online].

php.

http://edu.i-lo.tarnow.pl/inf/alg/001_search/0049.

9. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,

Introduction to algorithms, MIT press, 2001.

http: //wazniak.mimuw.edu.pl/index.php?title=Algorytmy_i_struktury_danych/Algorytmy_tekstowe_I.  Flask. [online]. http://flask.pocoo.org/.

10. R. W. Diks K., Malinowski A.,  Text algorithms I. In: Algorithms and data structures. [online], 2008. 11.

C++ programming, boost libraries and design patterns, in polish: J¦zyk C++: mechanizmy, wzorce, biblioteki, BTC, Legionowo, 2010. ISBN 978-83-60233-66-5.

12. R. Nowak and A. Paj¡k,

13. M. C. Saldías, F. V. Sassarini, C. M. Poblete, A. V. Vásquez, and I. M. Butler, Image correlation method for dna sequence alignment,

PloS one 7(6), p. e39221, 2012.

Proc. of SPIE Vol. 8903 890319-8 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 10/29/2013 Terms of Use: http://spiedl.org/terms

Suggest Documents