The Relationship Between Domain Duplication and ... - CiteSeerX

18 downloads 209004 Views 449KB Size Report
Protein domains represent the basic evolutionary units that form proteins. Domain duplication and shuffling by recombination are probably the most important ...
ARTICLE IN PRESS doi:10.1016/j.jmb.2004.11.050

J. Mol. Biol. (2005) xx, 1–11

The Relationship Between Domain Duplication and Recombination Christine Vogel*, Sarah A. Teichmann and Jose Pereira-Leal MRC Laboratory of Molecular Biology, Hills Rd, Cambridge CB2 2QH, UK

Protein domains represent the basic evolutionary units that form proteins. Domain duplication and shuffling by recombination are probably the most important forces driving protein evolution and hence the complexity of the proteome. While the duplication of whole genes as well as domainencoding exons increases the abundance of domains in the proteome, domain shuffling increases versatility, i.e. the number of distinct contexts in which a domain can occur. Here, we describe a comprehensive, genomewide analysis of the relationship between these two processes. We observe a strong and robust correlation between domain versatility and abundance: domains that occur more often also have many different combination partners. This supports the view that domain recombination occurs in a random way. However, we do not observe all the different combinations that are expected from a simple random recombination scenario, and this is due to frequent duplication of specific domain combinations. When we simulate the evolution of the protein repertoire considering stochastic recombination of domains followed by extensive duplication of the combinations, we approximate the observed data well. Our analyses are consistent with a stochastic process that governs domain recombination and thus protein divergence with respect to domains within a polypeptide chain. At the same time, they support a scenario in which domain combinations are formed only once during the evolution of the protein repertoire, and are then duplicated to various extents. The extent of duplication of different combinations varies widely and, in nature, will depend on selection for the domain combination based on its function. Some of the pair-wise domain combinations that are highly duplicated also recur frequently with other partner domains, and thus represent evolutionary units larger than single protein domains, which we term “supra-domains”. q 2004 Elsevier Ltd. All rights reserved.

*Corresponding author

Keywords: domain recombination; neutral evolution; domain versatility; domain shuffling

Introduction Protein domains represent the basic evolutionary units that form proteins.1,2 A domain is defined here as a structural unit observed in nature either in isolation or in more than one context in multidomain proteins.2 Examination of their properties provides a key to understanding the evolution of proteomes and of an organism’s complexity. The protein domains examined here are classified into Abbreviation used: SCOP, structural classification of proteins. E-mail address of the corresponding author: [email protected]

superfamilies2,3 of domains with a common ancestor. Thus, two proteins with domains of a common superfamily are likely to be evolutionarily related, originating by processes such as duplication and/or shuffling of whole genes or domain-encoding exons.4 Figure 1(a) illustrates how an organism’s protein repertoire can be represented as a collection of proteins with different domain architectures. The combinations of domains in an organism can also be represented as a network (Figure 1(b)) in which the nodes denote different domain superfamilies and the edges denote that the two domains are adjacent to each other in a protein with arrows pointing from N to C terminus. These domain combination

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

ARTICLE IN PRESS 2

Domain Duplication and Recombination

Figure 1. The duplication and recombination of domains form the protein repertoire. The Figure illustrates, from a domain perspective, the main processes that formed the protein repertoire and which are discussed here: the duplication and recombination of domains. (a) Proteome of an organism, consisting of eight different proteins (black lines) with domains of different superfamilies represented as boxes of different colours. (b) An alternative representation of the proteome: the colour of a circle represents a particular superfamily, the size of the circle corresponds to the abundance of the domains of that superfamily in the genome. The connections between the circles denote domain combinations. (c) Abundance is the result of a variety of duplication processes. The abundance of each domain superfamily is the number of domains in the proteins of the proteome, disregarding tandem repeats. (d) Versatility is the number of different N and C-terminal partner domains of a particular superfamily that are observed. The example shows the combinations of the blue superfamily: it occurs with one N-terminal and two different C-terminal partner superfamilies.

networks are known to be scale-free, which means that a few superfamilies are connected to many different superfamilies, while most superfamilies are adjacent to only one or two types of neighbours.5,6 Duplication, transposition and horizontal gene transfer result in the expansion of domain superfamilies in terms of their abundance; that is, the number of occurrences of a domain in the proteome of one organism (Figure 1(c)). Protein and domain duplication have been studied extensively and have been connected with an increase in an organism’s complexity in evolution, for example expansions of particular families in higher eukaryotes.7–15 Domains can recombine to form multi-domain proteins, and proteins with two or more domains constitute the majority of proteins in all organisms studied.16,17 Thus, the recombination of existing domains may be a major mechanism that modifies protein function18 and increases proteome complexity.16,19 The combination or shuffling of domains increases what we term the versatility of a domain superfamily; that is, the number of different partner domains that domains of a particular superfamily are adjacent to (Figure 1(d)). The exact interplay of the mechanisms that underlie the evolution of the protein repertoire is still not completely resolved. In particular, it is unclear to what extent the recombination of protein domains, or domain versatility, is under selection, or if it is the result of a purely random shuffling

process.20 Next, we discuss the arguments for both mechanisms. The frequency distribution of domain family sizes, of protein length measured as number of domains per protein and the distribution of domain versatility are all best approximated by a power-law function.5,6,21 These mathematical relationships are often taken as support of a neutral or random mode of evolution.5,20–23 However, it is still unclear to what extent a random model holds true for detailed examples. The observation that the number of domain combinations in nature is only a small fraction of the possible number of combinations suggests that domain recombination is under strong selection.5,24 In fact, several lines of evidence support this view. First, eukaryotic proteins have more different domains per protein than prokaryotic ones.11,16,25–27 They also have more different domain combinations than simpler organisms.5,11,16,27,28 This means that individual domain superfamilies are observed with a greater variety of combination partners in higher organisms. However, eukaryotes also have larger genomes and higher rates of retention of duplicated genes, which questions the extent to which these observations are evidence for selection. Furthermore, a detailed analysis of the number of co-occurring domains in human and other eukaryotes showed no change in versatility in terms of co-occurring domains, but just a change in the repertoire.15

ARTICLE IN PRESS Domain Duplication and Recombination

Next, domains with “generic” or especially “useful” functions clearly are the most versatile superfamilies in many organisms.5,6 These are, for example, cofactor or co-substrate-binding domains, like the P-loop nucleotide triphosphate hydrolase or the NADP(P)-binding Rossmann domains. Protein–protein interaction domains are also found in multi-domain proteins adjacent to a variety of other domains, and these different combinations are used to regulate distinct aspects of cellular organization.29 These domain superfamilies are, however, also very abundant in genomes,30 thus it is unclear if their great versatility is suggestive of selection or just a consequence of their high abundance. Furthermore, domains that co-occur in proteins are more likely to display similar function19 or localisation31 than domains in separate proteins, and may support selection acting on domain combinations. However, this could be due to a bias in the function classification schemes towards annotation of whole proteins rather than domains.32 Finally, since some highly abundant folds tend to have particular structures,33 it is possible that the three-dimensional structure of a domain may impose constraints on its ability to combine with other domains and hence reduce its versatility. However, to the best of our knowledge no such constraints have been observed. In summary, although there is support for selective forces acting on the recombination of protein domains, there are also observations that suggest a random process. Here, we present a comprehensive, genome-wide study of domain recombination, testing the hypothesis that it is a random process. We use a data-driven approach combined with simulations to show that the domain combinations observed are consistent with stochastic recombination together with differential duplication of domain combinations. Duplication of domain combinations is much more common than invention of new combinations. Selection at this level has resulted in a few highly duplicated, very abundant domain combinations in genomes, while the bulk of domain combinations are rare, occurring in only a few proteins.

Results and Discussion The abundance of domain superfamilies correlates with their number of combination partners In a random scenario, more trials result in higher success rates. For domains, the domain abundance represents the number of trials for recombination. This means that the more abundant domains would be expected to be “luckier” and have more combination partners than the less abundant domains. This is illustrated in Figure 1(b) where the number of edges pointing to and from a node

3

Figure 2. Domain versatility and abundance. The abundance and versatility of all superfamilies is plotted on a log–log-scale. Each point represents the abundance and versatility of a particular superfamily. The plot shows the data for human, and the distribution is very similar for all other eukaryotes. In bacteria and archaea, the abundance and versatility is much smaller, thus a relationship between the two measures is much less obvious. The plotted data fit a power-law function with VwAq and qobsZ0.42G0.04 (r2Z0.75) best, compared to an exponential, linear or logarithmic function.

(connectivity) would be expected to correlate with the size of the node. A random scenario would also imply that the abundance of a domain alone should determine how often it is found with a different combination partner. Thus, we compared abundance and versatility of each superfamily in various genomes, and the data for the human proteome is plotted in Figure 2. We observe that the absolute versatility (V; see Methods and Data Sets for definitions) correlates positively with its absolute abundance (A). This relationship is best approximated by a power-law, where VwAq, with qZ0.42 (r2Z0.75). The finding of a power-law relationship between abundance and versatility is consistent with the null hypothesis of a random process of domain combination. However, as different domains may differ in their propensity for retention after duplication and recombination, this result does not exclude selection at the level of domain recombination. In order to further investigate the hypothesis that domain recombination is a random process we study the robustness of the observation that versatility correlates with abundance. If domain recombination is indeed a random process, then we expect the relationship between versatility and abundance of any group of protein domains to be constant. In order to investigate the robustness of this relationship, we define the exponent q as the relative versatility of a protein domain, describing the relationship between the absolute versatility of a domain and its absolute abundance. Note that q is identical with the slope of the fitted line in Figure 2. We observe that the relative versatility is roughly the same for each of the bacterial, archaeal or

ARTICLE IN PRESS 4

Domain Duplication and Recombination

Figure 3. The relationship between versatility and abundance across different domain properties. Versatility is plotted as a function of abundance similar to Figure 2. In each panel, the superfamily data on 42 different genomes is grouped according to four different criteria: structural class and phylogeny (affiliation to kingdom) domain molecular function and biological process. See Methods and Data Sets for a detailed description of the groupings. Left: each panel shows the lines fitted through the superfamily groupings. Right: a graph displays the slopes q of the fitted lines on the left together with their standard deviations.

ARTICLE IN PRESS Domain Duplication and Recombination

eukaryote genomes in our data set; similar to the human data shown in Figure 2. The versatility relative to abundance is constant Next, we consider the factors that can affect domain versatility, as discussed in the Introduction. We ask whether distinct groups of domains, defined according to their function, structure and evolutionary history, display different relative versatilities. Such differences would reveal selection, as some combinations would be more duplicated, thus having higher abundances without the corresponding increase in versatility expected from a random scenario. In order to address this hypothesis, we test whether the relationship VwAq holds true in particular subsets of domains and, whether the exponent q is constant. If the relative rates of domain duplication and combination are dictated by characteristics of their component domains, one would expect that: (i) the power-law relationship does not hold for one or more of the domain groups; or that (ii) the slope qx for domains of group x is different from the slope qy for domains of group y. This will tell us whether the relative rate is different for distinct groups of domains. We divide protein domains according to structural classes and folds, phylogenetic affiliation, molecular function and biological process as described in Methods and Data Sets. We examine the relationship between versatility and abundance in each subset (Figure 3). For each grouping, we observe a remarkable consistency of (i) the power-law function being the best fit to the data and (ii) the slope q, the relative versatility across all subgroups (Figure 3). Thus, the relationship between versatility and abundance is very similar in all subsets of domains. This confirms that the versatility of a domain in a particular genome can be predicted relatively accurately purely from its abundance in that genome, independently of other properties of the domain. The combination of domains into novel contexts is constant relative to its duplication. With respect to the illustration in Figure 1(b), this means that the connectivity of the node is correlated with its size. Given the vast variety of domain functions, structures and other characteristics considered here, this is a surprising result. It is consistent with the hypothesis that the versatility of protein domains is the result of a stochastic process in which more abundant domains naturally have a higher chance of combining with different partner domains than less abundant ones. In other words, our findings can be explained by a random model of domain shuffling; and the null hypothesis of random recombination cannot be rejected. S0: a random model of domain combination The dependence of domain versatility on

5 abundance suggests that new combinations of domains occur in a random or neutral fashion, independently of any other particular circumstances. If this is the case, random shuffling of domains given their distribution of abundances should result in a similar absolute versatility for individual superfamilies and thus a similar relationship between abundance and versatility as that observed. We term this model S0. We performed 10,000 random shuffling experiments where the abundance of each domain, the number and their size in terms of domains are maintained for each genome considered. The expected maximal versatility we obtained in this random shuffling is also related to abundance by a power-law function (Figure 4, blue), which supports a random model. Highly abundant domains are more versatile, and conversely, less abundant domains are less versatile. However, the expected maximal relative versatility of each domain according to this random scenario (qrandZ0.72G0.01) is considerably higher than what we observe in the genomes (qobsZ0.42G 0.04). This means that overall the observed relative versatility of protein domains is lower than that expected by simple random shuffling of domains. This represents a marked discrepancy between the model S0 and the observed data (Figure 4). Inspection of this discrepancy reveals that a small proportion of the domain combinations are more frequent than expected by random shuffling of domains, whereas a larger proportion of combinations are under-represented; the random shuffling of domains results in a more even distribution of domain combinations and abundances. This means that in the genomes, many domain superfamilies have fewer combination partners than expected by chance, which is consistent with previously reported results.20 For example, in the human genome, these overrepresented combinations correspond to 7% (61/856) of all observed pair-wise combinations and under-represented combinations correspond to 41% (352/856). One example is the SH3 domain, which is a highly abundant domain (226 copies in the human genome). This domain combines with the SH2 and protein kinase-like domains much more frequently than would be expected given the abundance of the two domain superfamilies: 31 times with an SH2 domain and 27 times with a protein kinase domain given a single expected occurrence of each type. This suggests that the two domain combinations have been duplicated repeatedly, and as a consequence other domain combinations involving the SH3 domain are underrepresented. Thus, when the combination is duplicated, the abundance of each component domain is increased but the versatility is kept constant. This explains why the versatility of protein domains is not at the maximal level possible according to their abundance in a random shuffling model. The discrepancy between the predictions of the S0

ARTICLE IN PRESS 6

Figure 4. Two different random scenarios on how domain combinations were formed. The versatility of domain superfamilies is plotted against their abundance. The simulation procedure is described in the text. The observed data are plotted in black diamonds, the simulated data in blue (S0) or red (S1) crosses. S0, simple random model. S1, random model with duplication of combinations.

model and the observed data is because the model does not take into account the duplication of domain combinations.

Domain Duplication and Recombination

compared to model S0: for superfamilies that occur in ten or more proteins, model S1 predicts 54% correctly, and for superfamilies that occur in 30 or more proteins, model S1 even predicts two-thirds correctly (66%). Model S0 correctly predicts only 8% and 3% of the abundant superfamilies, respectively (Figure 4). In contrast to model S0, model S1 is best approximated by a linear relationship between domain versatility and abundance (r2Z0.991) but is also well approximated by a power-law (r2Z0.948). Since we lack data especially for domains of high abundance and versatility, it remains inconclusive when the approximations from model S1 would break down. Given that the model S1, emphasizing duplication of combinations, is much more accurate than the first random model, we conclude that duplication and retention of domain combinations due to selection plays an important role in the evolution of the protein repertoire. We show that recombination and duplication of domain combinations can be approximated by our stochastic simulation, and this suggests that selection favours duplicates of an existing combination much more often than novel combinations. This means that it is easier or more likely that a functional protein retained by selection evolves by sequence divergence and mutation of an existing combination, rather than combining existing domains in a new way.

S1: a random model of domain combination with duplication of combinations

Prevalent domain combinations

Given the observation that domain combinations in genomes are often more duplicated than simulated combinations in the random model S0, we implemented a different model, which we termed S1. In this model we explicitly consider the duplication of domain combinations. As in model S0, the domain abundance, number and size of proteins are maintained. Thus, we examine recombination given selection processes that may have produced the observed number of domains per superfamily. All domains are randomly shuffled. As soon as one combination is formed, we regard this combination as fixed and it is duplicated until all the instances of at least one of the component domains are used up. This reduces the abundance of the component domains that are available for further re-shuffling. The recombination and subsequent duplication is repeated until all domains are placed into combinations. This simulation was repeated 10,000 times. The relationship between the average versatility and abundance predicted by this model is shown in Figure 4. It is a substantially better approximation of the real data than model S0. Model S1 reproduces the observed data four times better than model S0: model S1 correctly predicts the number of different combination partners for 39% of all superfamilies in human, whereas model S0 predicts only 10% (within 20% error). A correct prediction of domain versatility is most meaningful for large domain superfamilies, and this is where model S1 improves enormously

Model S1 is consistent with the observed data, predicting versatilities that are lower than S0, and approximating better the overall relationship between versatility and abundance. It also predicts that for a given domain, a few combinations will be highly duplicated whereas most combinations will be infrequent. To what extent do we observe this prediction in the protein repertoire? In order to answer this question, we analysed the abundance distribution of combinations of particular superfamilies: we wanted to know whether for each particular superfamily, there are one or a few prevailing combinations. The abundance distribution for the pair-wise combinations of large superfamilies in the human genome is shown in Figure 5. For example, the SH3 domain occurs 31 times in combination with the SH2 domain, as mentioned above. The next most abundant combination of the SH3 domain is with the PDZ peptide-binding domain and the P-loop nucleotide triphosphate hydrolase domains: each combination occurs 12 times in the human genome. The other 24 combinations of the SH3 domain occur much less frequently. The distribution in Figure 5 is in accordance with our prediction that large superfamilies have only one or few combinations with many duplicates. These results and the striking fact that model S1 accounts for about two-thirds of the large superfamilies support the possibility that the mechanisms simulated in this model are an approximation

ARTICLE IN PRESS Domain Duplication and Recombination

7

Figure 5. Duplication of domain combinations. The Figure shows for eight large domain superfamilies in the human genome the abundance of their combinations. The different colours correspond to different superfamilies, and each entry on the x-axis corresponds to a combination of that superfamily with another domain. The y-axis denotes how often the particular combination is observed in the human genome. We omit tandem combinations where the two component domains are the same and combinations with unknown domains. The plot is restricted to the 13 most abundant combinations of each domain superfamily.

of the evolution of protein repertoires. We do not claim that the model resolves the chronological or evolutionary sequence of events, but simply that the concerted action of random re-shuffling and selection and duplication of specific combinations can have formed the protein repertoire. Furthermore, while model S1 clearly does not reproduce the exact versatility of each individual superfamily, we demonstrate how the observed relationship can result from an underlying stochastic process of evolution, involving random domain shuffling. The model also includes selection at the level of duplication of domain combinations: a few combinations are highly duplicated, while most have been duplicated and retained to a much lesser extent. As is obvious from Figure 4, the observed domain combinations have a broad distribution of abundance and versatility, suggesting that nature employs some combination of model S0, S1 and other processes.

Conclusions The interplay of neutral processes and selection has been debated for a long time.34,35 Upon gene duplication, the redundant copy is under selection, and the “usefulness” and ability to subfunctionalise, i.e. modify function, is thought to increase a duplicate’s chance of retention.36,37 In order to modify function, the duplicate diverges, for example, with respect to the sequence or the spatial and temporal expression.38,39 Another mode of divergence that can change protein function18 has been discussed here: the recombination with other protein domains. We described an analysis of the general processes that may have governed the recombination of protein domains. We compared the known domain superfamilies and their arrangements in proteins in a variety of genomes to stochastic models of domain

recombination and duplication. We defined domain versatility as the number of domain superfamilies that a particular domain can combine with, and we observed a robust correlation between domain abundance and versatility by a power-law. This correlation is independent of domain function, structure or evolutionary history. It suggests that, similar to transcriptome evolution,40,41 domain reshuffling represents a form of random protein divergence. Despite this evidence for random evolution of domain combinations, the observed domain versatility in genomes is very different to that resulting from a simple random model (S0). This is due to highly duplicated domain combinations, and many of these qualify as supra-domains, which are evolutionary units larger than a single domain and which recur in many different proteins.24,28 One such supra-domain is the combination of the P-loop nucleotide triphosphate hydrolase domain with the translation proteins domain.28 Proteins with this domain combination hydrolyse GTP and interact with the ribosome, and these are functions that are needed by all translation factors. Thus, the combination recurs in 23 proteins in human. When incorporating the duplication of particular domain combinations into a more elaborate model, we were able to approximate the observed versatility fairly accurately. Thus, selection appears to occur at the level of retaining particular domains and domain combinations after duplication, while the footprint of selection is less clear at the level of domain recombination. The emphasis on duplication of domain combinations supports the view that most combinations formed once during evolution, and all examples of a combination are related duplicates rather than emerging from independent recombination events. This is consistent with: (i) the prevalence of most combinations in only one N to C-terminal order,5

ARTICLE IN PRESS 8

Domain Duplication and Recombination

“easier” for nature to retain duplicates of this combination and re-use them in different proteins.

despite no obvious structural or functional constraints on the sequential order; (ii) structural evidence of an evolutionary relationship between all instances of the same domain combination;42 and (iii) the lack of functional conservation if the domain order is modified.42,43 Such a view implies that duplication of combinations is more parsimonious than several independent recombination events, and predicts the prevalence of one or two combination partners for a particular domain superfamily in a genome. We observe such a distribution of combination partners in Figure 5. Usually, modification of an existing domain occurs by incremental mutation of the coding sequence. The recombination with a novel partner domain, however, is a different and more drastic type of modification as it can require the evolution of a new domain interface, for example. Once the domain combination is established as a functional unit, it is

Methods and Data Sets Genomic and domain assignment data The 14 eukaryote, 14 bacterial and 14 archaeal genomes used in our analysis are listed in Table 1. The gene predictions and domain assignments to the gene predictions were taken from the SUPERFAMILY database version 1.63.30 The domain superfamilies are defined and classified in the SCOP database.3 All domains within a SCOP superfamily are related and can be regarded as descendants from one common ancestral domain. All eukaryotic genomes, except for the two yeasts and the two plants, were made non-redundant with respect to predicted splice variants: for each gene only the longest transcript was included.

Table 1. The genomes used in this analysis Name Anopheles gambiae release 19.2a Arabidopsis thaliana Aspergillus nidulans release 1 r3.1 Caenorhabditis briggsae release Aug03 Caenorhabditis elegans release 19.102 Danio rerio release 19.2 Drosophila melanogaster release 3.1 Fugu rubripes release19.2 Homo sapiens release19.34a Mus musculus release19.30 Neurospora crassa release3 Oryza sativa ssp. japonica Nipponbare Saccharomyces cerevisiae Schizosaccharomyces pombe Bacillus subtilis ssp. subtilis 168 Bifidobacterium longum NCC2705 Buchnera aphidicola Bp Chlamydophila pneumoniae J138 Escherichia coli K12 Helicobacter pylori 26695 Mycoplasma genitalium Prochlorococcus marinus ssp. marinus CCMP1375 Salmonella typhimurium LT2 Streptomyces coelicolor A3(2) Thermotoga maritima Tropheryma whipplei Twist Vibrio cholerae Wigglesworthia glossinidia Aeropyrum pernix Archaeoglobus fulgidus DSM 4304 Halobacterium sp. NRC-1 Methanobacterium thermoautotrophicum Methanopyrus kandleri AV19 Methanosarcina acetivorans C2A Methanothermobacter thermautotrophicus Delta H Nanoarchaeum equitans Kin4-M Pyrococcus abyssi Pyrococcus furiosus DSM 3638 Sulfolobus solfataricus Sulfolobus tokodaii Thermoplasma acidophilum Thermoplasma volcanium

Kingdoma E E E E E E E E E E E E E E B B B B B B B B B B B B B B A A A A A A A A A A A A A A

Total number of predicted proteinsb 16,148 28,677 9541 19,507 21,631 26,587 18,484 37,245 29,802 32,911 10,082 56,056 6703 4993 4112 1727 504 1069 4311 1576 484 1882 4451 7769 1858 808 3835 611 1841 2420 2075 1869 1687 4540 1873 563 1896 2125 2977 2826 1482 1499

All assignment data and genome information were taken from SUPERFAMILY version 1.63.30 On average, about 45% of the sequences had at least one domain predicted. a A, archaea; B, bacteria; E, eukaryotes. b Eukaryote genomes were made non-redundant in terms of splice variants.

ARTICLE IN PRESS 9

Domain Duplication and Recombination

Domain properties

Annotation of domain function and process Manual procedure

Abundance and versatility The duplication or abundance of a domain superfamily in each genome was measured as the number of domains belonging to the particular superfamily (Figure 1(c)). The recombination or absolute versatility of a particular superfamily was measured as the sum of different N and C-terminal domains that are immediately adjacent to the domain (Figure 1(d)). Tandem duplications of the same domain within one protein are ignored. The powerlaw function was fitted using regression. The minimal versatility of a domain and thus the intercept of the graph is known (no combination partners), so it seemed meaningful to restrict the analysis to the slope or exponent q.

Alternative definitions of abundance and versatility We tested our results on alternative definitions of abundance and versatility, and found the same general relationships. When we considered the number of proteins instead of the number of domains as a measure of abundance, or when we considered the number of cooccurring domains instead of immediately adjacent domains as a measure of versatility, the exponent q increased slightly. In both cases, the versatility relative to the abundance is slightly higher, but their power-law relationship is the same. The same holds true if we do account for tandem duplications of domains within one protein.

Structural classification Our analysis is based on the four main structural classes in the SCOP database: alpha, beta, alpha/beta and alphaCbeta.2 Each class contains several folds, and each fold contains several superfamilies. For the hierarchical classification of protein domains, please refer to the SCOP database.2,3

All domain superfamilies were manually annotated with respect to their dominant molecular function, i.e. the function within the protein, and their dominant biological process, i.e. their role in the cell. Using a manual scheme, we ensured a high-quality annotation and a domaincentred annotation instead of the widely used whole-gene or whole-protein centred annotation. The different schemes are described below. Scheme for biological processes of domains. For an analysis of the biological process in which domain superfamilies act, we annotated all superfamilies with respect to their affiliation to one of 43 different process categories (Table 2). All 43 functional categories were then mapped onto one of eight process classes which are defined as follows. (i) Information: storage, maintenance of the genetic code. (ii) Regulation: regulation of gene expression, protein activity; information processing in response to environmental input. (iii) Metabolism: all anabolic and catabolic processes; cell maintenance/homeostasis. (iv) Energy: energy production/storage. (v) Processes_IC: intra-cellular processes; cell motility/division; transport. (vi) Processes_EC: inter or extra-cellular processes; organismal processes (such as blood clotting). (vii) General: general enzymatic reactions and multiple functions. (viii) Unknown/Other: unknown or unclassifiable superfamilies, viral functions. The functional distribution of all SCOP superfamilies and those in the respective genomes is available upon request. Scheme for molecular functions of domains. All domain superfamilies were assigned to one of 27 functional

Table 2. Classification scheme for biological processes of domains Biological process class

Biological process category

Information

Chromatin structure and dynamics; translation, ribosomes, ribosome biogenesis; tRNA metabolism; transcription, DNA replication, recombination, repair; RNA processing, dynamics; nuclear structure RNA processing and modification; DNA-binding (transcription factors); kinases and phosphatases and inhibitors; signal transduction Electron transfer/transport; amino acid transport and metabolism; nitrogen metabolism; nucleotide transport and metabolism; carbohydrate transport and metabolism; polysaccharide metabolism; lipid/polysaccharide storage; coenzyme metabolism; lipid transport and metabolism; cell envelope biogenesis, outer membrane; secondary metabolites biosynthesis, transport and catabolism; oxidation/reduction; transferases; other enzymes Energy production and conversion; photosynthesis Cell division and chromsome partitioning, cell cycle; phospholipid metabolism; cell motility, cytoskeleton; intracellular trafficking and secretion; post-translational modification, protein turnover, chaperones; proteases, peptidases and their inhibitors; inorganic ion transport and metabolism; transport Cell adhesion; immune response; blood clotting; toxins and defense enzymes Small molecule binding; general or several functions; protein–protein interaction Function unknown; viral proteins

Regulation Metabolism

Energy Intra-cellular processes, Processes_IC

Inter-cellular processes, Processes_EC General Unknown/Other

ARTICLE IN PRESS 10

Domain Duplication and Recombination

Table 3. Classification scheme for the molecular functions of domains Functional class

Functional category

Enzyme/related function Cofactor binding Nucleic acid binding Protein binding

Catalytic; substrate-binding; regulatory only; other enzyme function Cofactor-binding; ion-binding; ion-cluster-binding; other molecule-binding function DNA-binding; RNA-binding; other nucleic acid binding function Permanent protein–protein interaction; transient protein–protein interaction; part of complex; other protein binding function Lipid binding; other lipid binding function Carbohydrate binding; other carbohydrate binding function Channel protein; electron transfer; other transport function Structural role; general/multiple functions; viral proteins; toxin; other/ambiguous/unknown function

Lipid binding Carbohydrate binding Transport Other function

categories. Each category maps to one of the eight rough functional classes (Table 3). The annotation was based on information in SCOP3 and the literature.

Controls for our functional classification

Automated domain annotation As a control, we used the automated annotation of GO process, function and location44 to Pfam45 domains in InterPro.46 Pfam domains were mapped onto SCOP domain superfamilies based on sequence similarity. This provided annotation for 647, 667 and 343 domain superfamilies, respectively. The manual domain annotation was largely consistent with the Gene-Ontology annotation for Pfam domains and their mappings to the domains described in SUPERFAMILY.30

Random annotation of function As a further control, all domain superfamilies were randomly assigned to 15 categories and the relationship between versatility and abundance tested as described above (data not shown). The relationship between abundance and versatility and the slope q were the same for all subsets.

Random models For a simulation of the versatility expected from random shuffling, we considered each genome separately. The abundance of a particular domain superfamily and the number of proteins and their lengths in terms of domains were kept constant. The resulting versatility was recorded for each superfamily; and the average was taken after 10,000 repeats of the simulation. S0. All domains were considered at their observed abundance. The domains were then picked and assigned to random positions in the proteins until each position was filled. S1. All domains were considered at their observed abundance. Two different domain superfamilies were chosen at random to combine with each other. The combination was then duplicated until one of the component domains was “used up”. The process was repeated taking the new abundances for both superfamilies into account. Eventually, all domains were assigned to combinations.

Acknowledgements We are grateful to Cyrus Chothia, M. Madan Babu, Shinhan Shiu and Christine Orengo for helpful discussion. C.V. acknowledges financial support by the Boehringer Ingelheim Fonds.

References 1. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974). Chemical and biological evolution of nucleotidebinding protein. Nature, 250, 194–199. 2. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. 3. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C., Murzin, A. G. et al. (2004). SCOP database in 2004: refinements integrate structure and sequence family data. Nucl. Acids. Res. 32, D226–D229. 4. Doolittle, R. F. (1995). The multiplicity of domains in proteins. Annu. Rev. Biochem. 64, 287–314. 5. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311–325. 6. Wuchty, S. (2001). Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18, 1694–1702. 7. Patthy, L. (2003). Modular assembly of genes and the evolution of new functions. Genetica, 118, 217–231. 8. Sinha, S., van Nimwegen, E. & Siggia, E. D. (2003). A probabilistic method to detect regulatory modules. Bioinformatics, 19, i292–i301. 9. Chervitz, S. A., Aravind, L., Sherlock, G., Ball, C. A., Koonin, E. V., Dwight, S. S. et al. (1998). Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science, 282, 2022–2028. 10. Rubin, G. M., Yandell, M. D., Wortman, J. R., Gabor Miklos, G. L., Nelson, C. R., Hariharan, I. K. et al. (2000). Comparative genomics of the eukaryotes. Science, 287, 2204–2215. 11. Aravind, L. & Subramanian, G. (1999). Origin of multicellular eukaryotes—insights from proteome comparisons. Curr. Opin. Genet. Dev. 9, 688–694. 12. Lespinet, O., Wolf, Y. I., Koonin, E. V. & Aravind, L. (2002). The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 12, 1048–1059. 13. Ranea, J. A., Buchan, D. W., Thornton, J. M. & Orengo, C. A. (2004). Evolution of protein superfamilies and bacterial genome size. J. Mol. Biol. 336, 871–887. 14. Vogel, C., Teichmann, S. A. & Chothia, C. (2003). The

ARTICLE IN PRESS 11

Domain Duplication and Recombination

15. 16.

17.

18. 19. 20.

21. 22. 23.

24. 25. 26.

27.

28.

29. 30.

immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity. Development, 130, 6317–6328. Muller, A., MacCallum, R. M. & Sternberg, M. J. (2002). Structural characterization of the human proteome. Genome Res. 12, 1625–1641. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J. et al. (2001). Initial sequencing and analysis of the human genome. Nature, 409, 860–921. Teichmann, S. A., Park, J. & Chothia, C. (1998). Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc. Natl Acad. Sci. USA, 95, 14658–14663. Todd, A. E., Orengo, C. A. & Thornton, J. M. (2001). Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143. Ye, Y. & Godzik, A. (2004). Comparative analysis of protein domain organization. Genome Res. 14, 343–353. Apic, G., Huber, W. & Teichmann, S. A. (2003). Multidomain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J. Struct. Funct. Genome, 4, 67–78. Koonin, E. V., Wolf, Y. I. & Karev, G. P. (2002). The structure of the protein universe and genome evolution. Nature, 420, 218–223. Barabasi, A. L. & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512. Qian, J., Luscombe, N. M. & Gerstein, M. (2001). Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681. Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A. (2003). Evolution of the protein repertoire. Science, 300, 1701–1703. Koonin, E. V., Aravind, L. & Kondrashov, A. S. (2000). The impact of comparative genomics on our understanding of evolution. Cell, 101, 573–576. Gerstein, M. (1998). How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold. Des. 3, 497–512. Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P. et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562. Vogel, C., Berzuini, C., Bashton, M., Gough, J. & Teichmann, S. A. (2004). Supra-domains—evolutionary units larger than single protein domains. J. Mol. Biol. 336, 809–823. Pawson, T. & Nash, P. (2003). Assembly of cell regulatory systems through protein interaction domains. Science, 300, 445–452. Madera, M., Vogel, C., Kummerfeld, S. K., Chothia, C.

31. 32.

33. 34. 35. 36. 37. 38. 39.

40. 41.

42. 43. 44.

45. 46.

& Gough, J. (2004). The SUPERFAMILY database in 2004: additions and improvements. Nucl. Acids Res. 32, D235–D239. Mott, R., Schultz, J., Bork, P. & Ponting, C. P. (2002). Predicting protein cellular localization using a domain projection method. Genome Res. 12, 1168–1174. Ouzounis, C. A., Coulson, R. M., Enright, A. J., Kunin, V. & Pereira-Leal, J. B. (2003). Classification schemes for protein structure and function. Nature Rev. Genet. 4, 508–519. Li, H., Helling, R., Tang, C. & Wingreen, N. (1996). Emergence of preferred structures in a simple model of protein folding. Science, 273, 666–669. Kimura, M. (1983). The Neutral Theory of Molecular Evolution, Cambridge University Press, Cambridge, UK. Darwin, C. (1859). The Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life, John Murray, London. Lynch, M. & Conery, J. S. (2000). The evolutionary fate and consequences of duplicate genes. Science, 290, 1151–1155. Wolfe, K. H. (2001). Yesterday’s polyploids and the mystery of diploidization. Nature Rev. Genet. 2, 333– 341. Gu, Z., Rifkin, S. A., White, K. P. & Li, W. H. (2004). Duplicate genes increase gene expression diversity within and between species. Nature Genet. 36, 577–579. Wagner, A. (2000). Decoupled evolution of coding region and mRNA expression patterns after gene duplication: implications for the neutralist–selectionist debate. Proc. Natl Acad. Sci. USA, 97, 6579–6584. Khaitovich, P., Weiss, G., Lachmann, M., Hellmann, I., Enard, W., Muetzel, B. et al. (2004). A neutral model of transcriptome evolution. PLoS Biol. 2, E132. van Noort, V., Snel, B. & Huynen, M. A. (2004). The yeast coexpression network has a small-world, scalefree architecture and can be explained by a simple model. EMBO Rep. 5, 280–284. Bashton, M. & Chothia, C. (2002). The geometry of domain combination in proteins. J. Mol. Biol. 315, 927–939. Hegyi, H. & Gerstein, M. (2001). Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res. 11, 1632–1640. Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R. et al. (2004). The gene ontology (GO) database and informatics resource. Nucl. Acids Res. 32, D258–D261. Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S. et al. (2004). The Pfam protein families database. Nucl. Acids Res. 32, D138–D141. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Barrell, D., Bateman, A. et al. (2003). The InterPro Database, 2003 brings increased coverage and new features. Nucl. Acids Res. 31, 315–318.

Edited by G. von Heijne (Received 23 September 2004; received in revised form 18 November 2004; accepted 19 November 2004)