Identification of Genomic Features using Domain Teams S. Pasek∗†, A. Bergeron‡, J-L. Risler∗ , A. Louis∗ , E. Ollivier∗ and M. Raffinot§
Abstract Comparison of genomic segments within a species, or across several species, reveals a high number of “cut-copy-paste” mutations that alter gene content, order, orientation and proximity. The study of these segments of almost identical gene content considerably helps the prediction of features of interest, such as gene fusions, or physical and functional interactions. In this paper, we discuss the automated detection of such groups of genes, based on a combinatorial model that allows the simultaneous processing of several intra- and interspecies comparisons. In addition to being adapted to large-scale comparisons, our approach adopts the innovative technique of deconstructing gene into domains before detecting conserved groups. We are thus able to avoid many of the pitfalls of gene families identification methods, often hampered by multi-domain genes, or low levels of sequence similarity.
1
Introduction
According to widely accepted terminology, the title of this paper should have been Conserved gene clusters of clusters of conserved genes, where each repetition of the same word covers a quite different reality. Conserved gene clusters refers to segments with almost identical gene content that can be found in different organisms. In this context, the word ‘gene’ usually designates a family of similar sequences, and genes from different families are clustered in a single organism. On the other hand, clusters of conserved genes refers to the families of similar sequences. Here, the word ‘gene’ refers to a single sequence, and genes from different organisms are clustered in a single family. To further confuse the matters, gene clusters can occur at different places in the same organism, and clustering genes does not always involve sequence similarity. In this paper, clusters, in the sense of segments with almost identical gene content, will be refered to as teams. These have been extensively studied since the detection, across several genomes, of local conservation of gene content and proximity considerably helps the prediction of features of interest (Galperin and Koonin, 2000; Sali, 1999; Marcotte et al., 1999a,b). Indeed, conservation of physical proximity of homologous genes – not necessarily operons – across different species probably points out, in many cases, to a selection pressure that tends to preserve this very proximity (Overbeek et al., 1999). Hence proteins encoded by such conserved groups of genes are good candidates for inferring physical interactions, or participation in common metabolic/regulatory networks. Conservation of gene content may also provide a way to infer phylogenic reconstructions by identification of some of the numerous rearrangement events that can affect a genome, such as transpositions, deletions, insertions, inversions, fusions and fissions. Historically, the first examples involved segments with conserved gene content and order (Overbeek et al., 1999), but the plasticity of genomes soon yielded concepts “where order, contiguity and even strandeness may be relaxed to some extend.” (Sankoff, 2003). Such sets are sometimes informally defined as “groups of homologous genes that are located at contigous positions in the genome of multiple organisms” (Fujibuchi et al., 2000), or more formally as gene teams (Luc et al., 2001). More recently (He and Goldwasser, 2004), an extension of gene teams was proposed for the comparison of two chromosomes in the presence of multiple copies of each gene. Here, we generalize further this extension to cover simultaneously self-comparison and comparison of multiple organisms. The simultaneous comparison of many different genomes also raises the delicate problem of clustering genes, in the sense of identifying families of similar sequences. In the sequel, we will refer to this problem as the labeling problem: each sequence should have a label that identifies its ∗ Laboratoire
G´ enome et Informatique, Evry, France. author:
[email protected] ‡ LaCIM, Universit´ e du Qu´ ebec ` a Montr´ eal, Canada. § Laboratoire G´ enome et Informatique, Evry, and Ecole Normale Sup´ erieure, Paris, France. † Corresponding
1
family. Labels can be gene names, gene annotations, multiple alignments, structural alignments, COG numbers (Tatusov et al., 2000), EC numbers (Bairoch, 2000), etc. In this paper, we take the novel approach of temporarily forgetting genes, and working directly with domains. This choice has many interesting consequences: we integrate structural information, bypass strict homology, and detect rearrangments that would not be obvious otherwise. This new approach is discussed in details in Section 2, and we denote this extension of the previous gene teams model (Luc et al., 2001) as domain teams. This model also allows the identification of gene fusions events, since many of these can be considered as extreme cases of conservation of genes proximity. Indeed, “evolution of gene fusion often involves an intermediate stage, during which the future fusion components exist as juxtaposed and co-regulated, but still distinct genes within operons” (Yanai et al., 2002). They can lead to the production of, for example, a bifunctional enzyme where one single protein catalyzes two steps in the same metabolic pathway. Well-known examples are the thrA and metL genes in E. coli that encode the bifunctional aspartate kinasehomoserine dehydrogenase I and II (Patte, 1996). Thus, the detection of gene fusion events can be used to predict functional associations of proteins such as functionnal interaction, or complex formation (Yanai et al., 2001; Enright and Ouzounis, 2001; Enright et al., 1999). In the present study, gene fusions are detected only for genes that are physically close to one another in at least two species. We thus expect a number of false negatives, but the proximity constraint should considerably reduce the number of false positives. We implemented this concept in a software named DomainTeams, freely available for academic purposes. It allows the identification domain teams in a large set of organisms, and in reasonable time. The sofware produces a wealth of results that can be queryied to detect various genomic features, such as gene fusions. This paper is organized as follow. In Section 2, we discuss the choice of considering domains instead of genes. In Section 3, we introduce the concept of domain teams using real examples. Section 4 presents many biological instances of domain teams, arising from the comparison of various organisms, and illustrating the particularities of our approach. We present the software in appendix A and experimental methods in appendix B.
2
From Genes to Domains
The first model of gene teams (Bergeron et al., 2002) was based on the assumption that a gene had exactly one ortholog in each chromosome under study. The classical answer to the selection of such orthologuous groups of genes is to use bi-directional best hits using popular local aligment software. In this paper, we want to process much more realistic models of chromosomes, in which genes can be duplicated, or even be absent from some chromosomes. In this case, the techniques required to correctly label gene families are much less obvious. A common approach is to use one of the various techniques to cluster proteins or genes. One well known example is the COG database (Tatusov et al., 2000) that imposes very stringent conditions on its clusters. Finding teams using COG numbers, as shown in (He and Goldwasser, 2004), produces far less results than expected. One of the reasons seems to be the treatment of genes resulting from fusions. In the COG database, these genes have two or more COG numbers. In order to achieve computational tractability, algorithms to find gene teams rely on sequences that have a unique label. He & Goldwasser answer this problem by removing these genes from their study [private communication]. Relaxing constraints introduce other types of complications. Indeed, a number of studies have been devoted to the clustering of (protein) sequences on the basis of some similarity criterion, but the modular nature of proteins makes it definitely imposible to partition the sequences. It is not feasible, and it would be biologically irrelevant, to assign a multi-domain of a bi-functional protein to one single unambiguous class. For example, the aggregation of sequences linked by a level of similarity above a given threshold often yields huge families. These families result from a massive chain effect due to multi-domain proteins. Highly divergent sequences get the same label, therefore invalidating the results of the algorithms. Moreover, all criteria used to split those huge
2
families are bound to fail at one point, since many proteins have multiple functions. It was thus quite natural to consider domains, instead of genes, as the basic units to cluster, in both meanings of the term. Since many databases, such as Pfam (Bateman et al., 2004), impose a non-overlapping rule on domain families, the problem of unique labels is solved. Moreover, considering duplications at the domain level provides an intermediate level between the purely syntactic approach of Pevzner and Tesler (Pevzner and Tesler, 2003), and the classical steps of gene-finding and ortholog identification. We thus gain considerable sensitivity, without sacrifying biological relevance.
3
Domain Teams
From a computational point of view, a chromosome can be defined as a collection of genes. A gene is known through a certain number of properties such as position, orientation, nucleic acid sequence, coding sequence(s), etc. For the study of simple organisms, one can assume that genes do not overlap, and that there is a unique coding sequence associated to the nucleic acid sequence of a gene. Thus, a chromosome becomes an ordered sequence of genes, and admits a representation such as the one in Fig. 1. FRUB 359
381
FRUK 294
FRUA 2379
2378
Figure 1:
Three genes and their five domains on the chromosome of Y. pestis. Genes are represented by thick arrows and labelled by their names above the arrows. Within these arrows are given the PfamA domain numbers found in the corresponding encoded proteins, where 359 stands for PF00359.
We will also assume that each gene is decomposed in consecutive domains, each domain having a label. For example, the three genes of Fig. 1 have a total of five domains. The distance between two domains on the same chromosome is the difference between their positions. The position of a domain can be defined as its starting position along the DNA sequence, or using the order in which domains appear on the chromosome. This last approach, which is used in this paper, is highly suitable for the comparative study of successive, or nearly successive, domains or genes. Indeed, altough procaryotes geneomes are generally compact, intergenic regions of several thousands basepairs can be found even in the simplest eukaryotes genomes, and comparing distances in basepairs from different organisms becomes meaningless. Given a set S of domain labels, and a fixed distance δ, the labels of S divide a set of chromosomes in δ-chains. These are maximal runs of domains, with labels in S, such that the distance between two consecutive domains in a run is less than or equal to δ. For example, consider the following set C of chromosomes, in which domains with labels in S = {A, B, C} have been underlined: C = ABD EF BCAGH IJAKBCLM N OP CAQARS With δ = 2, the set S induces four δ-chains on chromosomes of C: AB, BCA, AKBC, and CAQA. Note that the domains in different δ-chains can appear in different order, and are not necessarily contiguous in a given δ-chain. The content of a δ-chain is the subset of S of labels that appear in the domains of the run. Each δ-chain that contains all labels of a set S is called an occurrence of the set S. A set of labels T is an extension of a set S if S is contained in T , and each occurrence of S is contained in an occurrence of T . Definition 1 Given δ, a set S of labels is a δ-team of a set of chromosomes C if there is at least one occurrence of the set S in C, and S has no extension.
3
For example, in the set of chromosomes C above, the set S = {A, B, C} is a δ-team with δ = 2. It has two occurrences: BCA and AKBC. On the other hand, the set {B} is not a δ-team, since the set T = {A, B} is an extension of {B}, which means that each occurrence of label B implies a nearby occurrence of label A. [Note that the reverse is not true.] Fig. 2 shows an example of domain team in four different organisms, exhibiting significant rearrangements. The five domains present in Y pestis, Fig. 1, are transposed, reversed and duplicated in S. typhi, E. coli and V. cholerae. FRUB Y. pestis
FRUK
381
359
STY2439 S. typhi
2378
STY3433 294
S. typhi
2378
294
#STM3253#
STY3436
1116
YEIC E. coli
294
2379
2379
FRUA 2378
FRUK
2379
VCA0516 V. cholerae
294
2379
294
VCA0517 294
FRUA 2379
STY2442 381
359
STY3437 2379
2378
STY3438 359
381
STY3441 359
VCA0518 381
FRUK 294
2378
359
FRUB 381
359
Figure 2: A domain team of five domains with occurrences in four different organisms, with two occurrences in S. typhi. The first occurrence in S. typhi has the same domain order and content as the occurrence in Y. pestis, except that the whole segment is reversed. In the second occurrence in S. typhi, domain 294 is duplicated in reverse, sandwiching an insertion of a new domain; there is also a transposition of the two domains 359 and 381, and a duplication of domain 359, with respect to the four other occurrences. V. cholerae has a tandem duplication of domain 2379, and E. coli has a duplication of domain 294.
Definition 1 is a direct generalization of the notion of gene teams introduced in (Bergeron et al., 2002), which addressed the case of chromosomes containing a unique copy of each gene. In (He and Goldwasser, 2004), He & Goldwasser define an extension of gene teams that allows multiple copies of a gene in a chromosome. However, to achieve polynomial-time complexity of their algorithm, they restrict the number of chromosomes to two, and identify as gene teams those with at least one occurrence in each chromosomes. This is clearly interesting for the comparison of two organisms, but does not allow for self-comparison, and does not upgrade polynomially to several organisms. The number of teams can be exponential. Without additional constraints, Definition 1 also leads to theoretically exponential algorithms, since the number of gene teams can be exponential in the number of labels. However, as we will see in Section 4, real-life examples involving thousand of genes can be computed efficiently. In order to show the exponential nature of Definition 1, consider a set L of n labels. Construct n chromosomes, each containing n − 1 different labels obtained by removing one different label from L. Then, for δ = n − 2, each proper subset of L is a δ-team. For example, with n = 5 and L = {A, B, C, D, E}, one gets the following five chromosomes: ABCD ABCE ABDE ACDE BCDE. Each proper subset S of L has at least one occurrence, since S is contained in at least one chromosome, and the distance between two labels in a chromosome is always less than δ = n − 2. 4
For any domain d not in S, there is an occurrence of S which is not contained in S ∪ d, namely the chromosome in which d was removed, therefore S has no extension. Thus S is a δ-team.
4
Experimental validation
We computed the δ-teams of various sets of proteomes containing up to 20 species, using the program DomainTeams with different values of the parameter δ. Our goal was to assess: (i) the relevance of using Pfam domains rather than similarity criteria (section 4.1); (ii) the efficiency in retrieving intra-chromosomal duplications involving paralogous genes (section 4.2); (iii) the feasability of detecting fusion events (section 4.3). These experiments also allowed an evaluation of the program efficiency under different conditions (section 4.4). The complete results are available from the site http://lgi.infobiogen.fr/ DomainTeams/.
4.1
Validity of using Pfam domains
The search for conserved teams between E. coli and S. typhi provides an example showing that the use of Pfam domains can be more sensitive than sequence comparisons. Fig. 3 displays a schematic representation of a conserved team in which the proteins share five domains. The proteins encoded by pgtA and pgtB in S. typhi are known to be the members of a two-components system (Kadner, 1996) (as shown in the STRING database (Mering et al., 2003), genes encoding two-components systems are often adjacent). The pairs YfhA/YfhK and Sty2809/Sty2811 are putative proteins that received the same annotation by homology. However, sequence comparisons of PgtB with both YfhK and Sty2811 resulted in high Blast2 E-values (10 and 0.17 respectively). As a consequence, a search based on sequence similarities would have missed PgtB and, consequently, the team PgtA/PgtB. Similarly, the probable duplication of these genes in S. typhi would not have been detected by an automated procedure based on E-values. Note that a procedure based on Z-values would have also led to the same failure. YFHA
E. coli 3 e−26
158
2518
0
72
2518
512
512
STY2809 672
7 e −29
672 0
10
PGTB
PGTA
S. typhi
YFHK
*YFHG* 72
158
158
72
*STY2810*
STY2811 2518
512
672
1.7 e −1
Figure 3: Part of a domain team obtained from the two bacteria E. coli (top) and S. typhi (bottom). Genes that do not contain any Pfam indication (gray arrows) correspond to PfamA unannotated proteins. Names between stars correspond to gaps within the team. For S. typhi, two occurrences are observed separated by 207 domains. The E-values are indicated for each pair of proteins sharing the same domains. Due to the low similarities between PgtB and YfhK, and also between PgtB and Sty2811, a search based on sequence alignments would have missed the pgtA/pgtB component. The use of Pfam domains is therefore more sensitive. PgtA/PgtB is a two-components system and the two others elements are similarly annotated by homology.
5
4.2
Performing self-comparisons
Fig. 4 illustrates an example of conserved team that was found by the simultaneous scanning of the four proteomes of E. coli, S. typhi, Y. pestis and V. cholerae. In E. coli, the team contains three genes (fruA, fruK, fruB), two of them (fruA and fruB) being known to encode proteins belonging to the fructose-dependant phosphotransferase system (PTS). The three genes probably belong to an operon (Lin, 1996), and their products are involved in three sequential steps that import extracellular fructose into the cell (FruA and FruB), phosphorylate fructose into fructose1-P (FruA) and phosphorylate fructose-1-P into fructose-1,6-P2 (FruK). The same team is also found in the three other bacteria. According to Swissprot annotations, the encoded proteins have the same specificity as their counterparts in E. coli. In addition, the team is duplicated in S. typhi, but in this case the specificity of the proteins might be different. Sty3436 is annotated as “possible carbohydrate kinase”, while Sty3438 is not known to be inducible by fructose. In any case, the conservation of this team might indicate that proteins Sty3436, Sty3437 and Sty3438 may be involved in the transport and phosphorylation of a sugar. Other duplications and gene fusions are observed as well for this family. They are not shown here for the sake of clarity but are available from the supplementary material. YEIC E. coli
STY2439 S. typhi
FRUA 2378
294
2378
2379
FRUK
STY2442
294
381
FRUB Y. pestis
2378
2379
359
STY3433 1097
2379
294
VCA0517 294
FRUB
294
381
#STM3253# 1116
294
FRUK
381
359
VCA0516 V. cholerae
FRUK
2379
359
STY3436 294
STY3437 2379
2378
STY3438 359
381
STY3441 359
FRUA 2378
2379
VCA0518 381
359
Figure 4:
The fructose-specific phosphotransferase system: A conserved team between the four bacteria E. coli, S. typhi, Y. pestis, V. cholerae, and two occurrences within the S. typhi chromosome. In the second occurrence in S. typhi, the specificity of the proteins might be different from the first.
4.3
Detecting fusions
Fig. 5 results from the search for conserved domain teams in a set of 16 chromosomes containing 10 bacterial chromosomes, 4 chromosomes of S. cerevisiae, one chromosome of S. pombe, and one of A. thaliana. This team contains genes encoding proteins that belong to the aconitase/isomerase familly, which appears on 12 of the 16 studied chromosomes. Most of the genes contain two Pfam domains (694 and 330), but some have only one of these domains: the two variants coexist in some organisms. These mono-domain proteins are, in the identified teams, always encoded by pairs of adjacent genes. Therefore the aconitases containing the two domains – probably fused proteins – were detected by the program. Proteins encoded by leuC and leuD are involved in the second step of the leucine biosynthesis and are known to interact by forming an heterodimer, in which LeuC is called the large subunit, and LeuD the small subunit, of the 3-isopropylmalate dehydratase. In S. cerevisiae, Leu1 is a monomeric enzyme and is known to have exactly the same activity, in the same pathway, as the heterodimer LeuC-LeuD (Umbarger, 1996). Note that this fusion is also predicted by the difFuse program (Enright and Ouzounis, 2001). This example, involving proteins which are known to interact, is typical of the relevance of searching fusion events.
4.4
Efficiency
Computation time is a function of the number of proteins, the number of genomes, the parameter δ, and the degree of conservation between the studied organisms. In order to explore the strengths 6
and limitations of the methodology, we have tested it on different sets of proteins containing an increasing number of bacterial proteomes. Table 1 shows that, whichever set we consider, more than half of the proteins are annotated in the PfamA database. Up to 22 bacterial proteomes, totaling 65667 proteins, can be processed in a matter of hours, with δ = 3. For one set of four bacterial proteomes, we have tested the evolution of computation time for values of δ ranging from 3 to 10. The results are shown in Table 2. Proteins set nb of proteins nb of domains nb of teams nb of fusions Time user CPU (s)
4 bacteria 10638 13211 2099 69 294
8 bacteria 22402 27392 4228 170 1199
16 bacteria 48257 57609 8821 338 9589
18 bacteria 52401 63740 11169 447 46472
20 bacteria 58323 71381 12220 502 45616
Table 1:
Average parameters of computation for different set of proteins with an increasing number of bacteria and delta = 3
Values of δ CPU Time (s)
3 159
4 230
5 381
6 757
7 1522
8 1615
9 3617
10 9010
Table 2:
Computation time for processing the set of four bacterial proteomes with different values of δ. Note that a value of δ = 3 corresponds to amaximum gap of 2 domains.
A
Overview of the Software
The software DomainTeams is available on the site http://lgi.infobiogen.fr/DomainTeams/. Basically, it takes as input a file (option -a ) and a maximal distance (option -d ).
A.1
Format of the input file
Here is an example of the input format: A>PF02518 LA2830 2813122 375
2 Q8F2E1 R
0
3455
A>PF00512 LA2830 2813122 375
2 Q8F2E1 R
1
2
3456
HATPASE_C Two-component hybrid sensor and regulator
HISKA 2
Two-component hybrid sensor and regulator
Each entry corresponds to a domain in a specific position on a unique chromosome. It begins with A>. The first line (line 1 of the example) is formatted the following way. • the name of the domain (ex: PF02518); • the number of the chromosome on which it stands (beginning at 0, ex: 2); • its position on the chromosome (ex: 3455); • its annotation (ex: HATPASE C). The next two lines (line 2 and 3 on the example) are devoted to the protein to which the domain belongs. The first line is composed of: 7
• the name of the protein (ex: LA2830); • its SwissProt ID (ex: Q8F2E1); • the number of the domain in the protein (ex: 0); • the total number of domains in the protein (ex: 2); • its annotation (ex: Two-...
regulator).
The second line is composed of: • the physical position of the protein (ex: 2813122); • its size in amino acids (ex: 375); • its orientation on the chromosome (F or R). The same protein may be associated to many domains (on the example, protein LA2830 is associated with 2 domains) and the information is partly redondant. However, the entry has the advantage of being formatted in order to be automatically generated. Many examples of complete entry files are available with the software. XXX domains Special domains, labelled XXX, indicate that the underlying protein is not annotated with any domain. A>XXX 2 LA2683_2 2668962 0
3281 Q8F2T5 R
XXX 0
0
Conserved hypothetical protein
Two different XXX domains are never considered as ortholog or paralog.
A.2
Output
A domain team
Each domain team is reported the following way.
--------------------------------------------------------------------------------------------NUMBER|[EVAL]|NBCHR|NBOCC|NBDOM|NBPROT|PRCOMPL|POSTOT|POSIN|POSOUT|MAXDIS|MINDIS|AVDIS|NBFUS| 34 | 1 | 2 | 2 | 4 | 7 | 5 | 15 | 11 | 4 | 9 | 6 |2.181| 1 | --------------------------------------------------------------------------------------------* 2 LA2827_2[PF00072][2/2] LA2828_2[PF02518][1/2] LA2828_2[PF00512][2/2] LA2829_2[PF00072][1/1] LA2830_2[PF02518][1/2] LA2830_2[PF00512][2/2] #LA2831_2[PF00989][1/2]# #LA2831_2[PF00989][2/2]# LA2832_2[PF01734][1/2] 3 LB241_3[PF00512][1/3] LB241_3[PF02518][2/3] LB241_3[PF00072][3/3] *[268]* *[269]* LB244_3[PF01734][1/1] ! // FUSION * 2 LA2829_2[PF00072][1/1] LA2830_2[PF02518][1/2] LA2830_2[PF00512][2/2] {527} 3 LB241_3[PF00512][1/3] LB241_3[PF02518][2/3] LB241_3[PF00072][3/3] {869}
Parameters of the domain teams are listed (see next paragraph), followed by the domain team itself. Each occurrence is a bloc of elements of the form Protein[Domain][e/t] where e is the number of the domain in the protein and t the total number of domains in the protein (ex: LA2830 2[PF02518][1/2]). An element enclosed in # (ex. #LA2831 2[PF00989][1/2]#) does not belong to the domain team, but is included due to its position. A domain inclosed in ∗ (ex. *[268]*) is a non-annotated domain inside the team. If a fusion is detected, it is listed separately. The sum of the lengths of each protein appears on the side of each occurrence.
8
The set of results The resulting set of domain teams is summarized in a table sorted by decreasing score. The score is the result of a rational expression (user defined) on each domain team using global and local parameters. Global parameters are the total number of domains and proteins, the number of chromosomes and the input distance δ. Local parameters to each domain team are: • its number of chromosomes (NBCHR); • its number of occurrences (NBOCC); • its number of distinct domains (NBDOM); • its number of complete proteins (NBCOMPL); • its total number of positions (POSTOT); • the number of positions really appearing in the domain team (POSIN); • the number of positions inside but not belonging to the domain team (POSOUT); • the maximal and minimal length of an occurrence (MAXDIS and MINDIS); • the average distance between two positions in the domain team (AVDIS); • its number of fusions (NBFUS).
B
Methods
The proteomes of various eubacteria, archaebacteria and eukaryotes were downloaded from the EBI server (http://www.ebi.ac.uk/proteome) together with the “chromosome tables” that give for each protein, through its Swissprot/TrEmbl accession number, the position of the corresponding coding gene on the chromosome. The list of Pfam motifs present in the proteins (swisspfam) were downloaded from the Sanger Institute (http://ftp.sanger.ac.uk/pub/databases/Pfam/swisspfam.gz), and curated to keep only the reliable PfamA domains. Computations were performed on the Infobiogen Sun 15k computer.
References Bairoch, A., 2000. The enzyme database in 2000. Nucleic Acids Res 28, 304–305. Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., et al., E. L. L. S., 2004. The pfam protein families database. Nucleic Acids Research 32, D138–D141. Bergeron, A., Corteel, S., Raffinot, M., 2002. The algorithmic of gene teams. In: Workshop on Algorithms in Bioinformatics (WABI). No. 2452 in Lecture Notes in Computer Science. Springer-Verlag, Berlin, pp. 464–476. Enright, A. J., Iliopoulos, I., Kyrpides, N. C., Ouzounis, C. A., 1999. Protein interaction maps for complete genomes based on gene fusion events. Nature 402 (6757), 86–90. Enright, A. J., Ouzounis, C. A., 2001. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biol 2 (9), RESEARCH0034.1–0034.7. Fujibuchi, W., Ogata, H., Matsuda, H., Kanahisa, M., 2000. Automatic detection of conserved gene clusters in multiple genomes by graph comparison and p-quasi grouping. Nucleic Acids Research 28 (20), 4029– 4036. Galperin, M. Y., Koonin, E. V., 2000. Who’s your neighbor? new computationnal approaches for functionnal genomics. Nat. Biotechnol 18, 609–613.
9
He, X., Goldwasser, M., 2004. Identifying conserved gene clusters in the presence of orthologous groups. In: Recomb2004. pp. xx–xx. Kadner, R. J., 1996. Cytoplasmic Membrane in Escherichia coli and Salmonella typhimurium , cellular and molecular biology, second edition Edition. Vol. 1. American Society for Microbiology. Lin, E. C. C., 1996. Dissimilatory Pathways for Sugars, Plyols, and Carboxylates in Escherichia coli and Salmonella typhimurium , cellular and molecular biology, second edition Edition. Vol. 1. American Society for Microbiology. Luc, N., Risler, J.-L., Bergeron, A., Raffinot, M., 2001. Gene teams: a new formalization of gene clusters for comparative genomics. Computational Biology and Chemistry (311), 639–656. Marcotte, E. M., Pelligrini, M., Thompson, M. J., Yeates, T., Eisenberg, D., 1999a. A combined algorithm for genome-wide prediction of protein function. Nature 402, 83–86. Marcotte, E. M., Pelligrini, M. J., Ho-Leung, N., Rice, D. W., Yeates, T., D.Eisenberg, 1999b. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753. Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., Snel, B., 2003. String : a database of predicted functionnal associations between proteins. Nucleic Acids Research 31 (1), 258–261. Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D., Maltsev, N., 1999. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96 (6), 2896–2901. Patte, J. C., 1996. Biosynthesis of Threonine and Lysine in Escherichia coli and Salmonella typhimurium , cellular and molecular biology, second edition Edition. Vol. 1. American Society for Microbiology. Pevzner, P., Tesler, G., 2003. Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes. Genome Res. 13 (1), 37–45. Sali, A., 1999. Functionnal links between proteins. Nature 402, 23–26. Sankoff, D., 2003. Rearrangements and chromosomal evolution. Current opinion in genetics and development (13), 583–587. Tatusov, R. L., Galperin, M. Y., Natale, D. A., Koonin, E. V., 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucl. Acids. Res. 28 (1), 33–36. Umbarger, H. E., 1996. Biosynthesis of the Branched-Chain Amino-Acids in Escherichia coli and Salmonella typhimurium , cellular and molecular biology, second edition Edition. Vol. 1. American Society for Microbiology. Yanai, I., Derti, A., DeLisi, C., 2001. Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proc Natl Acad Sci U S A 98 (14), 7940–5. Yanai, I., Wolf, Y. I., Koonin, E. V., 2002. Evolution of gene fusions: horizontal transfer versus independent events. Genome Biol 3 (5), research0024.1–0024.13.
10
citB
Lactococcus lactis
694
330
leuC
leuD
330
694
651
848
H. influenzae
leuC
leuD
330
694
668 acO
A. aeolicus
330
694
659
M. tuberculosis
leuD
leuC
330
694
671
E. coli
leuD
leuC
330
694
acnA
890
666 citB
B. subtilis
694
330
leuD
leuC
330
694
671
908 citB 330
694
B. halodurans
leuD
leuC
330
694
666
907
S. pombe chr1
694
SPAC9E9.03
SPAC24C9.06C
SPAC343.16 330
694
330
694
721
778
330
330
694
758
lys4
S.cerevisiae chr4
694
330
693 leu1
S.cerevisiae chr7
694
330
779 YJL200C
S.cerevisiae chr10
694
330
789 aco1
S.cerevisiae chr12
694
330
778
Figure 5:
Members of the aconitase/isomerase familly across different species, comprising bacteria and eukaryotes, where fusions have been detected by DomainTeams. Numbers below genes represent amino-acid lengths of proteins or pairs of proteins.
11