Longest Common Circular Chains of Maximal Unique Matches between Bacterial Genomes Fr´ed´eric Guyon, Serge Hazout INSERM E346 2, place Jussieu, 75005 Paris, France fguyon,
[email protected]
Alain Gu´enoche IML-CNRS 163 Av. de Luminy, 13009 Marseille
[email protected]
keywords: genome comparison, genome evolution.
1. Introduction The aim of this study is to compare complete genomes to identify preserved large DNA fragments in order to analyse evolution process. For this purpose, we propose an efficient method to identify conserved regions between multiple genomes. We first transform each complete genome into a permutation of Maximal Unique Matches (MUMs). Secondly, we compute the longest common sequence of MUMs in the same order in all the genomes. This permits to define conserved genome segments as long DNA fragments having MUMs in the same order and to analyse evolution events as reversal or transposition of this fragments.Finally, we propose some genomic distances based on MUMs and conserved genome segments length and number.
2. More on MUMs and LCCCs Recently, closely related bacterial genomes have been made available. Comparing these genomes reveals some large scale rearrangements. In particular, chromosomal inversion centered at the origin of replication is an important mechanism involved in bacterial evolution [5]. Many recent works deal with the sorting by reversal problem which is to find the shortest sequences of signed inversions transforming one genome into another one [2], [1], [3]. However, this theory supposes that a genome can be represented by an ordering of oriented genes (signs + and - corresponding to strands), in order to define a bijective mapping between genomes. Therefore the comparison of two genomes implies to select first the common genes and to designate the ortholog ones, that are directly inherited from a common ancestor. This is not possible for bacterial genomes, even for closely related ones, because bacteria are subject to non conservative transformations such as gene duplication, deletion and insertion (horizontal transfer). Moreover, these events can blur large scale inversions and transpositions making them difficult to identify. To overcome these difficulties we propose a new method based on order comparison between genomes. These latter are encoded as sequences of common patterns that are unique words belonging to all the genomes. They also have a maximal length. They are called Maximal Unique Matches and were introduced by Delcher et al.[4]. Using MUMs has a great advantage compared to genes ; it becomes useless to match ortholog genes and unnecessary to distinguish them from paralog genes (resulting of duplications). In other words, MUMs permits to define a one-to-one mapping between genomes since they are common and unique words. Moreover, this approach can be easily extended to more than two genomes because the identification of MUMs shared by several genomes is feasible. Let M be the set of all the selected MUMs, and n its number of elements. They permit to consider each genome as a permutation on M , denoted (m1 ; m2 ; ::mn ), where mi is the reference number of the i-th MUM in this permutation. The total number of these patterns is very large depending on the genomes length: several thousands when comparing two bacterial genomes. They can cover hundreds bases showing that long fragments of DNA can be preserved. However, generally they are much shorter, around 30 nucleotides, and oriented in one sens or the other according to the inversions of sequence fragments generated by the evolutionary process. So, given several circular permutations on a set M , corresponding to different genomes, we look for a subset having its elements ranked in the same circular order in each permutation. It means that the order of MUMs reduced to these chains are identical. This subset is totally ordered, so it is a chain, called a longest common circular chain (LCCC).
3. Methods and Results
P.horikoshii
P.abyssi
P.furiosus
MUMs shared by several genomes comparing sequences in direct and reverse orders are selected using a suffix tree. Suffix trees are very efficient data structures for finding all the matches common to two or more strings [7]. We also establish some statistics on the number and the length of expected motifs according to the genome length and number. These statistics show that MUMs exist in great quantity even in totally random sequences , and they give us the mean to assess the significance of the MUMs chains depending on MUMs number and length. We describe an algorithm to built a longest common chain to two linear permutations and we have extended it to circular ones. We have developed a new method to build LCCC to more than two permutations and we have proved that this problem has a polynomial time and space complexity [6]. We compare some complete bacterial genomes. The mapping of different conserved segments permits to reveal some large evolution rearrangements between closely related species (Fig.1). Finally, we discuss some genomic distances based on MUMs and conserved genome segments length and number. We construct phylogenetic trees of microbial genomes based on these distances and we compare them to trees obtained with different methods.
0
500000
1000000
1500000
pB
Figure 1. Segments mapping plot of Pyrococcus horikoshii with Pyrococcus abyssi and Pyrococcus furiosus.
References [1] D. Bader, B. Moret, and M. Yan. A linear-time algorithm for computing inversion distances between signed permutations with an experimental study. J. Comp. Biol, 8:483–491, 2001. [2] V. Bafna and P. Pevzner. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversal. J.ACM, 46:1–27, 1999. [3] A. Bergeron and M. Corteel, S.and Raffinot. The algorithmic of gene teams. In Lecture Notes in Computer Science, volume 2452, 2002. [4] A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Res, 27(11):2369–76, 1999. [5] J. Eisen, J. Heidelberg, O. White, and S. Salzberg. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol, 1(6), 2000. [6] A. Gu´enoche and P. Vitte. M´ethodes exacte et approch´ee pour le probl`eme de la plus longue sous-s´equence commune a` p chaˆınes de caract`eres. T.S.I, 14:887–915, 1995. [7] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, 1997.