Genetica DOI 10.1007/s10709-006-0035-0
ORIGINAL PAPER
Functional and evolutionary inference in gene networks: does topology matter? Mark L. Siegal Æ Daniel E. L. Promislow Æ Aviv Bergman
Received: 25 June 2004 / Accepted: 20 June 2005 Springer Science+Business Media B.V. 2006
Abstract The relationship between the topology of a biological network and its functional or evolutionary properties has attracted much recent interest. It has been suggested that most, if not all, biological networks are ‘scale free.’ That is, their connections follow powerlaw distributions, such that there are very few nodes with very many connections and vice versa. The number of target genes of known transcriptional regulators in the yeast, Saccharomyces cerevisiae, appears to follow such a distribution, as do other networks, such as the yeast network of protein–protein interactions. These findings have inspired attempts to draw biological inferences from general properties associated with scale-free network topology. One often cited general property is that, when compromised, highly connected nodes will tend to have a larger effect on network function than sparsely connected nodes. For example, more highly connected proteins are more likely to be lethal when knocked out. However, the correlation between lethality and connectivity is relatively weak, and some highly connected proteins can be removed without noticeable phenotypic effect. Similarly, network topology only weakly predicts the response of M. L. Siegal (&) Department of Biology, New York University, 100 Washington Square East, New York, NY 10003, USA e-mail:
[email protected]
gene expression to environmental perturbations. Evolutionary simulations of gene-regulatory networks, presented here, suggest that such weak or non-existent correlations are to be expected, and are likely not due to inadequacy of experimental data. We argue that ‘top-down’ inferences of biological properties based on simple measures of network topology are of limited utility, and we present simulation results suggesting that much more detailed information about a gene’s location in a regulatory network, as well as dynamic gene-expression data, are needed to make more meaningful functional and evolutionary predictions. Specifically, we find in our simulations that: (1) the relationship between a gene’s connectivity and its fitness effect upon knockout depends on its equilibrium expression level; (2) correlation between connectivity and genetic variation is virtually non-existent, yet upon independent evolution of networks with identical topologies, some nodes exhibit consistently low or high polymorphism; and (3) certain genes show low polymorphism yet high divergence among independent evolutionary runs. This latter pattern is generally taken as a signature of positive selection, but in our simulations its cause is often neutral coevolution of regulatory inputs to the same gene. Keywords Evolutionary systems biology Æ Gene network Æ Developmental systems drift
D. E. L. Promislow Department of Genetics, University of Georgia, Athens, GA 30602, USA
Introduction
A. Bergman Departments of Pathology and Molecular Genetics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
Biologists have persistently struggled with the enormous complexity of living organisms. The emerging field of Systems Biology represents an effort to
123
Genetica
embrace this complexity and to study it on its own terms, and therefore marks a dramatic departure from the long successful and still dominant reductionist program that seeks to understand complex systems through detailed investigations of their individual parts (Ideker et al. 2001; Kitano 2002; Levesque and Benfey 2004; Baraba´si and Oltvai 2004). Of course, what enables modern molecular biologists to conceive of such an endeavor as Systems Biology is the availability of high-throughput technologies, such as DNA microarrays, for simultaneously probing the molecular function of many cellular components. That is, Systems Biology begins with massively parallel reductionism. If it is not to end there as well, then there must be theoretical advances that turn a flood of data into true, integrative understanding. One hope is that biological systems will be found to have characteristic organizational features and that these features will in turn determine, or at least circumscribe, functional properties to a large extent. Thus, a natural focus has been on the architecture, or topology, of complex biological networks, such as gene-regulatory networks, protein-interaction networks and metabolic networks, and on whether topological information has predictive value with regard to system-level functional properties of interest. Recent work (reviewed in Baraba´si and Oltvai 2004) has suggested that a variety of biological networks do share characteristic organizational principles, and indeed that these principles are also shared with complex non-biological networks, such as social networks and computer networks. In other words, a diagram showing connections between nodes on the Internet does not look much different from a diagram showing connections between cellular metabolites linked by enzymatic reactions, nor from a diagram showing connections between proteins that physically interact. If, as has been claimed (Baraba´si and Oltvai 2004), these shared organizational principles were then to confer important functional properties on the networks in question, then this could provide a powerful framework for understanding biological as well as non-biological systems. It is therefore worthwhile to examine critically both the empirical evidence for commonality in network architecture and the theoretical basis for drawing functional inferences from topological features. To this end, we clearly must consider the nature and quality of the data and methods used to infer network structure. Universality of network organizational features, even if it were to be firmly established, does not necessarily imply universality of functional properties. This is partly because details might matter—the behavior of a complex network might be exquisitely sensitive to slight differences in its composition, even if
123
those differences do not violate the apparent organizational rules (see, for example, Guet et al. 2002; Kim and Tidor 2003). It is also because various meanings and measures of ‘function’ pertain to different types of networks, and so conclusions drawn from one type of network, or from abstract representations of idealized networks, might have little relevance to other types of network. For example, consider the simple network module depicted in Fig. 1, in which a highly connected node ‘H’ (a so-called ‘hub’) connects to many nodes that have far fewer links, a commonly observed motif within complex biological and nonbiological networks. We may ask what the effect is of removing a given node on the function of the system. Specifically, let us ask whether the node labeled ‘A’ is ‘essential’ to the network. Of course, the answer to this question depends on what type of network is depicted, which in turn will define what we mean by ‘essential.’ If this were a computer network, we would most likely be concerned with the ability of information to pass between two nodes that are not directly connected. In this case, disrupting node A only affects node A, and does not affect the path between any two other nodes. Therefore we conclude that node A is entirely nonessential to the function of the network. However, what if the network were instead a gene-regulatory network, with node A (gene A) serving as a trigger to activate gene H, which then acts on many downstream targets? The network diagram is the same (although we might wish to add arrows to the links to show which gene is the target), but the definition of ‘essential’ has changed. Indeed, gene A is absolutely essential to the function of this network, because without it gene H and all its targets cannot be activated. This simple example illustrates the potential pitfalls of over-generalization when it comes to the relationship between network architecture and function, and it suggests that gene-regulatory networks in particular might have unique properties that distinguish them from other types of complex network and that deserve greater theoretical attention. This paper represents an effort to review briefly the current state of knowledge of complex biological networks, and to investigate, through a combination of data analysis and numerical simulations, the extent to which predictions of network theory hold for generegulatory networks. Our hope is that this kind of critical examination will not only point out misleading aspects of current network theory as it pertains to gene networks, but also identify promising avenues for future theoretical investigation. In particular, we wish to highlight the potential insights gained from incorporating microevolutionary (i.e., population-genetic) perspectives into gene-network theory.
Genetica
Representation of networks and types of network architecture
Fig. 1 Different interpretations of a network graph. (a) Hub H is connected to nodes A–G in a simple network module. (b) The graph is interpreted as a computer network. Edges are undirected, because electronic information flows both ways along links. When node A is disrupted, no information flows between A and H (missing edge), but the flow of information between all other nodes is unaffected (for example, C and F remain connected through H, dotted line). (c) The graph is interpreted as a gene-regulatory network, in which A activates H and H activates B–G. Edges are directed toward targets. (d) When gene A is disrupted, gene H cannot be activated and therefore neither can genes B–G (missing arrows)
A graph of nodes connected by edges, as in Fig. 1, is an intuitive and flexible means of representing a network. Nodes represent the system components of interest and edges link interacting components. For a gene-regulatory network, the nodes are genes and the edges represent activation or repression of one gene’s activity by another gene’s product. Note that whereas the definition of a gene is (relatively) unambiguous in this context, the definition of a regulatory interaction is not fixed and indeed varies in ways that impact how one interprets the network that is drawn. For example, some consider only direct interactions, i.e., activation or repression mediated by physical interaction between a transcription factor and a cis-regulatory DNA sequence (e.g., Shen-Orr et al. 2002; Yu et al. 2003). Others, however, consider indirect effects as well, and draw edges between gene pairs in which the elimination (deletion) of one member causes a change in expression of the other, no matter how that change is mediated (e.g., Featherstone and Broadie 2002). Clearly, the network of indirect effects will have more edges than the network of direct effects, and likely will have more anastomoses (re-connections) as well. As we will see below, networks are often characterized by the distribution of the number of connections per node, as well as by measures describing the relative independence of different parts of the network. Thus, care must always be taken to specify, and remain cognizant of, the nature of the connections that are depicted. One of the most basic descriptors of a node is its ‘degree,’ or the number of edges connected to it. In Fig. 1, node H has high degree, whereas the other nodes have low degree. Even this simple example shows how the degree distribution among nodes in a network can capture important features of the network’s topology. Consider, for example, if the N nodes in a network were to be connected such that any pair had the same probability, P, of being connected as any other pair. Such a network would belong to the wellstudied class of so-called ‘random’ networks, also known as Erdo¨s–Re´nyi graphs after the two mathematicians who pioneered their study (Erdo¨s and Re´nyi 1960). In a random network the degree follows a Poisson distribution, which peaks around the average degree and has variance equal to the average degree. This means that networks that are on average weakly connected will have vanishingly few nodes with very high degree. Similarly, networks that are on average highly connected will have vanishingly few nodes with very low degree. Perhaps the best known feature of
123
Genetica
random networks is their ‘small-world’ nature. That is, the shortest path between any two randomly chosen nodes typically passes through rather few nodes. It turns out that actual complex networks of many types are not random, but instead appear to be organized such that the probability that a node will have k links follows a power-law distribution: PðkÞ / kc , with c constant for a given network (Baraba´si and Oltvai 2004). This recent discovery might seem like a rather esoteric difference, but it has consequences that have garnered attention even beyond the scientific literature (see, for example, Gertner 2003, in The New York Times Magazine’s 2003 Year in Ideas issue), due to applications in social, technological and scientific realms. The upshot of the power-law distribution of connectivity is that it has substantially heavier tails than the Poisson distribution. That is, in a power-lawdistributed network there are, on average, more nodes with very many connections (hubs) and more nodes with few connections (as in Fig. 1). Indeed, a powerlaw distribution always decays from a maximum at k = 1. Because this peak does not necessarily coincide with the average degree, and because the decay rate is independent of k, these networks have been termed ‘scale-free’ (Baraba´si and Albert 1999). In addition to having more very highly connected hubs than random networks, scale-free networks with degree exponents within the observed range of biological and non-biological networks (2 < c < 3) are ‘ultra small.’ That is, scale-free networks have even shorter minimum paths between nodes than random networks with the same number of nodes (Baraba´si and Oltvai 2004). Moreover, as with random networks, the minimum path length is typically robust to disruptions of random nodes, so that even when a large proportion of nodes is eliminated, the distances between the remaining nodes are essentially unchanged. However, if a hub in a scalefree or random network is compromised, then the distances between remaining nodes tend to increase dramatically, and indeed might go to infinity, in that the removal of a hub potentially fractures the network into isolated sub-networks that are no longer connected. Thus, by this distance-based definition of network function, scale-free and random networks share an important property: they are robust to random failure, but very sensitive to targeted attacks on hubs (Albert et al. 2000a, b). It is this functional distinction between hubs and non-hubs that has received the most attention in considerations of biological networks. Below, we will critically evaluate the inferences that have been drawn in this regard, and their experimental support, but first let us quickly review the evidence that biological networks are indeed scale-free.
123
Evidence for scale-free architecture of biological networks The first biological networks to be investigated for scale-free architecture were those linking cellular metabolites via enzyme-catalyzed reactions (Jeong et al. 2000; Wagner and Fell 2001). Metabolic reactions are occasionally reversible, but often favor one direction, so edges in the graph representation of such networks must be directional (Jeong et al. 2000; but see alternative representation of Wagner and Fell 2001; also see summary of different representations by Arita 2004). For each of 43 species—spanning eubacteria, archaea and eukaryota—Jeong et al. (2000) demonstrated power-law distributions for both the number of enzymatic reactions in which a metabolite serves as a substrate and the number of enzymatic reactions in which a metabolite is the product (i.e., for both outgoing and incoming degree). The network structures for these analyses derive from known biochemical reactions, or from strong inference using complete genomic sequences to determine the full enzyme complements of those organisms for which extensive biochemical data are not available. Thus, the data are quite reliable, and make credible the claim that scalefree topology is a universal feature of metabolic networks, at least of enzyme-linked metabolite networks as constructed by Jeong et al. (2000). This last qualifier is necessary, because an alternative—and arguably better—representation of the E. coli metabolic network challenges a central tenet of scale-free network theory, namely the ultra-small-world phenomenon. Arita (2004) computationally traced individual atoms in substrates and products of enzymatic reactions and showed that typical path lengths between metabolites are much longer than those predicted by graph representations that merely link each enzyme’s substrates with its products. The extent to which this difference reveals something fundamentally ‘un-scale-free’ about metabolism is unclear, but it is nonetheless a forceful reminder that network representations can be misleading if not interpreted properly. Scale-free topology is also apparently a feature of protein-interaction networks, although limitations in the data are substantial. Protein-interaction networks are determined primarily by yeast two-hybrid assays, which do not probe native interactions but instead employ novel protein fusions and, for the proteins of all species besides Saccharomyces cerevisiae, a foreign cellular environment. Moreover, two proteins may score as physically interacting in a yeast two-hybrid assay, yet they might never be found in the same cell type or same subcellular location, and therefore might
Genetica
never actually interact in their native organism. These limitations no doubt contribute to a large number of false-positive and false-negative results. Indeed, von Mering et al. (2002) estimate that, even for yeast proteins, over half the interactions detected by highthroughput experiments are spurious. For this reason, protein-interaction data as they exist currently are inherently less reliable than data on enzymatic reactions. Nevertheless, a fairly strong case can be made for the scale-free topology of the S. cerevisiae proteininteraction network, based on the findings in independent data sets of a power-law distribution of connections, as well as on the extremely low probability of finding such an empirical distribution (even from incomplete and imperfect data) when the underlying network is not scale-free (Yook et al. 2004; but see Stumpf et al. 2005). Thus, whereas any particular inferred interaction may be suspect, the overall topology of inferred interactions is likely to resemble that of the actual network. Interestingly, the proteininteraction network of Caenorhabditis elegans also appears to be scale-free (Li et al. 2004), yet the protein-interaction network of Drosophila melanogaster does not conform strictly to a power-law distribution of connections. Instead, the degree distribution is fit better by a combination of power-law and exponential decay, suggesting a more complicated hierarchical organization of the network (Giot et al. 2003). It should be pointed out, however, that an earlier study of the yeast protein-interaction network found the degree distribution to be best fit by a power-law with an exponential cutoff (Jeong et al. 2001), but subsequent studies using larger data sets have found the power-law without exponential cutoff to be sufficient (Yook et al. 2004). As with metabolic networks, accurate and meaningful graph representation is an issue with proteininteraction networks. For example, we would expect that a graph drawn of physical interactions that actually take place in the cell would differ from one drawn of physically possible interactions (i.e., interactions detected by biochemical assays but not necessarily occurring in the cell), and that the former might be more desirable as an indicator of system-level properties. An effort has been made recently (Gavin et al. 2002; Ho et al. 2002) to characterize proteins that actually interact in complexes within the cell, to complement yeast two-hybrid data, although the complex data would overestimate pair-wise interactions if used incorrectly because members of the same multi-protein complex need not physically touch each other (Hahn et al. 2004). The graph representation might also be misleading in those cases when protein interactions are
not symmetric. That is, protein-interaction graphs derived from two-hybrid assays are undirected, for the simple reason that if protein A touches protein B, then protein B touches protein A. However, certain physical interactions might lead to alterations in protein structure (biochemical and/or conformational modifications) such that, for example, when protein A touches protein B, B is changed to a functionally distinct molecule, B¢. This would suggest a directionality to the interaction, or perhaps even that a better representation would be to have B and B¢ as separate nodes linked by A. Large-scale data sets containing this type of information do not yet exist. Among biological networks, gene-regulatory networks present perhaps the greatest challenge in terms of their experimental determination and their interpretation. As mentioned above, one problem lies in the definition of an interaction. Consider a simple genetic pathway A fi B fi C (where an arrow represents a direct interaction between a transcription factor produced by the gene at its tail and a cis-regulatory element controlling the gene at its head). A microarray experiment assaying the effects of deleting each of these genes in turn would yield the network
,
because such an experiment does not distinguish between direct and indirect effects. This network looks like a feedforward loop (Shen-Orr et al. 2002; Mangan and Alon 2003), even though the underlying genetic pathway does not contain such a motif. As feedforward loops are characterized by important functional properties that do not hold for simple linear cascades, such as attenuation of transient activation signals (Shen-Orr et al. 2002; Mangan and Alon 2003), care must be taken in the interpretation of the microarray-derived network. This is why only established transcription-factor/ direct-target interactions were used by Shen-Orr et al. (2002) and by Yu et al. (2003) for their analyses of network motifs in Escherichia coli and S. cerevisiae, respectively. A microarray-derived gene-expression network in S. cerevisiae has been shown to be scale-free (Featherstone and Broadie 2002), but as the above considerations suggest, it is not clear to what extent this network of indirect interactions holds relevant information about system-level functional properties. Smaller networks, incorporating only known transcription factors and their targets, have been determined for yeast and E. coli (Guelzim et al. 2002). This work has shown an interesting departure from the power-law: in both yeast and E. coli, the outgoing degree distribution (the number of target genes for each transcription factor) does appear to be scale-free, but the incoming degree distribution (the number of
123
Genetica
transcription factors acting on each target gene) is exponential. In yeast, 93% of target genes were found to be regulated by only 1–4 factors. Thus, there do exist transcription factor hubs that regulate a large number of target genes, but there do not appear to be a significant number of target hubs that funnel signals from a large number of transcription factors. It would be an overstatement, or at least an oversimplification, to say that scale-free topology is universal in biological networks. However, as noted by Baraba´si and Oltvai (2004), what does appear to be universal is the existence of hubs in these networks. Current network theory ascribes great functional importance to these hubs, so in the next section we explore the inferences that have been drawn concerning the correlation between network topology and network function, and the extent to which these predictions are corroborated by experimental evidence.
Inferences based on scale-free architecture and their experimental support One of the key predictions of network theory is that hubs are more essential than non-hubs to the function of the network. It is important to recall that this prediction derives from the observation that disrupting a hub in an idealized, undirected network tends to dramatically increase the distance between remaining nodes. Thus, the expectation that hubs in biological networks will tend to be more functionally important than non-hubs is based on an assumption concerning the equivalence of two definitions of function, the simple graph-based one (the existence of a short path between two nodes) and the much more complicated biological one (ultimately, the ability of the system to contribute to the survival and reproduction of the organism). This assumption is as yet unproven and, we believe, deserves much more critical attention. Nevertheless, we may ask if there is any evidence that the prediction of greater importance of hubs holds for biological networks. The essentiality prediction has been tested for the yeast protein-interaction network, by comparing each protein’s number of connections to its effect on cell viability when absent (by gene deletion) (Jeong et al. 2001; Przˇulj et al. 2004; Hahn and Kern 2005). An approximately 3-fold difference was observed in the proportion of essential proteins (those causing lethality when removed) between the subset of most connected proteins (those with greater than 15 links) and the subset of least connected proteins (those with fewer than 6 links) (Jeong et al. 2001). Correspondingly, although there is considerable overlap between the
123
degree distributions of essential and non-essential proteins, the average degree of essential proteins is significantly greater than that of non-essential proteins (2-fold greater according to Przˇulj et al. 2004; 1.2-fold greater according to Hahn and Kern 2005, who used a different subset of the protein-interaction data and different data on protein essentiality). In contrast to the above results, there appears to be at best a weak relationship between connectivities in the yeast gene-expression network and the relative growth rates of yeast deletion mutants. Featherstone and Broadie (2002) compared the number of genes whose expression levels changed significantly when a given gene was deleted (indirect interactions) to the relative growth rate of the strain lacking that gene, and found only a very weak trend toward lower fitness of strains with a more highly connected gene deleted. It has been argued (Hahn et al. 2004) that viability or relative growth rate measurements under artificial laboratory conditions do not capture true fitness differences that emerge under natural conditions over evolutionary time scales, and therefore that a better test of hub essentiality is one that assays selective constraint on proteins through interspecies sequence comparisons. Hahn et al. (2004) recently reviewed and critiqued efforts to do so using data from yeast, and concluded that, while a negative correlation exists between a protein’s number of interaction partners and its rate of sequence divergence, this correlation is exceedingly weak. That is, they found that connectivity explained only on the order of 1% of the variance in rate of protein evolution (Hahn et al. 2004). One possible reason for the low predictive power of connectivity with respect to protein evolution is that there is a long causal chain between the two, to which many other influences may contribute. Indeed, acknowledging this, Hahn et al. (2004) controlled for several of these other potential influences, including GC content and expression level (which is known to influence codon usage). Another approach is to step back in the causal chain, that is, to seek correlations between connectivity and more proximate measures of function, as opposed to measures of ‘ultimate’ function (fitness). We have attempted to do this below, by analyzing the relationship between protein connectivity and cellular responses to environmental conditions. Specifically, because of the purported functional centrality of hubs, one might expect responses to environmental differences to be mediated through hubs, and therefore would expect genes encoding hub proteins to experience a greater variance in expression across environmental conditions than those encoding non-hubs.
Genetica
Evidence for functional significance of protein-interaction hubs in cellular responses
to test for a dependence on connectivity of the mean and variance of gene-expression levels.
A large number of microarray studies have examined expression levels for each gene in the yeast genome under hundreds of different independent environments (many of these datasets are available through the Stanford Yeast Database, http://www.yeastgenome.org/). With these data, one can determine relatively easily which genes respond to specific environmental changes (e.g., nutrient availability, temperature, salinity, etc.), and whether specific clusters of genes tend to respond in tandem to environmental change, either through up- or down-regulation. We coupled these expression data with information on connectivity obtained from assays of yeast protein–protein interactions.
Results As shown in Fig. 2, both mean and variance of expression ratios correlate significantly with log-transformed values of connectivity. Genes with higher connectivity show lower mean expression ratios (r2 = 0.052, F1,1918 = 106.1, P = 3 · 10–24), but significantly higher variances (r2 = 0.023, F1,1918 = 45.32, P = 2 · 10–11). Similarly low P-values were obtained using the non-parametric Spearman Rank test (not shown). The opposing trends for the mean and variance a
1.5
1.0
0.5
0
-0.5
-1.0
-1.5
1
10
100
Number of protein interactions
b
Variance in expression
Rather than pool data on gene expression patterns from many different labs, which may increase background noise, we chose to estimate variance in expression patterns for one particular dataset (Gasch et al. 2000). Gasch et al. (2000) measured levels of mRNA for each of 6152 genes in S. cerevisiae under standard ‘control’ conditions, and under 173 different stressful environments. The actual data consist of logtransformed ratios of RNA levels under treatment and control conditions. Gasch et al.’s study investigated a variety of stressful environments, including changes in temperature, salinity, hydrogen peroxide exposure, salt concentration, nitrogen depletion and so forth. In some cases, different environments consist of yeast raised under a particular stressor for different amounts of time, or changes in the concentration of the stressor. To measure variation in expression levels, we simply calculated the mean and variance for each gene across the 173 environments tested by Gasch et al. (2000). Variances were log-transformed to better approximate normality. Data for protein-network connectivity were obtained from von Mering et al. (2002). These authors compiled data on connectivity from a variety of sources, and then determined the overall quality of the data based on the degree to which particular protein–protein interactions were replicated across independent experiments. For the results presented here, we confined our analysis to those interactions that were ranked as having ‘medium’ or ‘high’ confidence by von Mering et al. There were 1921 genes for which we had data on both variation in gene expression and protein– protein connectivity. We used least-squares regression
Mean expression
Data and statistics
10
1
0.1
1
10
100
Number of protein interactions
Fig. 2 Relationship between mean and variance in yeast geneexpression levels among a range of stressful environments. (a) Each gene’s mean expression (log-transformed ratio of treatment to control RNA levels) among 173 environments is plotted against its protein’s connectivity (log-transformed node degree) in the inferred network of protein–protein interactions. The least-squares regression line is shown. (b) The log-transformed variance in each gene’s expression among the 173 environments is plotted against its protein’s connectivity. The least-squares regression line is shown
123
Genetica
suggest that a statistic measuring variance relative to the mean would show an even more significant negative correlation with connectivity. Typically, the coefficient of variation (CV = s.d./mean) is a useful statistic in such cases, but analysis of CVs assumes that the distribution of mean values is approximately Normal and strictly positive; the latter condition is violated by the logtransformed ratios and so we could not use CV as a measure of the spread of the distribution. Our results suggest that environmentally induced changes in levels of RNA expression for a particular gene can be predicted in part by its protein’s connectivity. Larger numbers of interactions tend to increase the amount of variation one sees in expression levels for a given gene. However, while the patterns are highly statistically significant owing to large sample size, from a biological perspective, the results are not especially striking. As indicated by the low r2 values, the amount of variation in gene expression explained by connectivity is small. That is, the predictive value of connectivity is relatively weak. We have seen that with more or less proximate measures of network function, node degree in general correlates only weakly. This does not necessarily mean that topology is unimportant in determining function. Instead, better or more detailed network measures might be required to explain a greater proportion of the variance in function (Bray 2003). Such measures might include other topological descriptors of the network, such as local clustering of nodes, betweenness centrality (the propensity for a node to lie on the shortest paths between other nodes), and the existence of small sub-network motifs. For example, in the protein-interaction networks of S. cerevisiae, D. melanogaster and C. elegans, betweenness correlates better with a protein’s evolutionary rate than does node degree (Hahn and Kern 2005). For S. cerevisiae and D. melanogaster, betweenness also has a significant effect on the probability of a protein’s being essential, even after the effect of node degree is removed statistically (Hahn and Kern 2005). Motif structure also has some explanatory power. There are three possible threenode motifs that contain a terminal (non-regulatory) gene that is regulated by only one input. In S. cerevisiae only two of these motifs are common, and terminal genes that belong to one motif (a so-called ‘three chain’) have significantly greater variance in expression across environmental conditions than those that belong to the other (Promislow 2005). As an alternative to studying network structure, it might be profitable to focus more on network dynamics, making links to topological features when possible. Even the strongest proponents of the ‘top-down’
123
analysis of complex networks acknowledge that a more detail-oriented, quantitative, dynamical view of complex biological networks is necessary (Baraba´si and Oltvai 2004), and have made important strides in this direction with their flux balance analysis of the E. coli metabolic network (Almaas et al. 2004). Just as generegulatory networks present greater challenges than metabolic networks in their structural representation and analysis, they also present greater challenges in dynamical analysis. Much less is known about the quantitative aspects of gene-regulation, and our qualitative knowledge strongly suggests that much of generegulation is context-dependent. That is, for the vast majority of genes, we do not know their concentrationdependent responses to individual transcription factors, let alone the combinatorial cis-regulatory logic that governs how they respond to multiple, simultaneous transcription-factor inputs. However, the hope is that these data will emerge as high-throughput technologies are widely employed and further developed (Bray 2003). In the meantime, it is not premature to begin asking questions about how dynamical considerations might fit into a structural understanding of complex gene-regulatory networks. We choose to do this by using simulations of evolving gene networks. The advantage of the simulation framework is that it permits exploration of complex, non-linear interactions in a fairly tractable way, although there is always a tradeoff between realism and tractability. As a starting point, we feel it is appropriate to use a rather abstract model with relatively few parameters, and to endeavor to reveal interesting patterns that may then be investigated in more detail-rich models and in experimental data. We also feel strongly that an evolutionary perspective is immensely valuable. Despite the evidence of superficial resemblance between complex biological systems and complex engineered systems (Alon 2003), we should not take for granted that complex biological systems, such as gene networks, are exquisitely crafted for optimal function. Instead, we must allow that the contingent evolutionary histories of organisms impinge on their organization and function in complicated ways (Wagner 2003). Indeed, such a perspective might help explain the lack of strong correlations between node degree and functional measures that we saw above. We explore this possibility below.
Gene-network simulations Computer simulations can be used to explore the relationship between the functional properties of a gene-regulatory network and its topology, in a setting
Genetica
in which various aspects of network architecture can be controlled and in which measurement error is not an issue. To perform such simulations, we use a model of an evolving network of interacting transcription factors first proposed by Wagner (1996). We previously used an extension of Wagner’s model to explore the causes and consequences of robustness to mutations in such networks (Siegal and Bergman 2002; Bergman and Siegal 2003). Here, we are interested in the effects of mutations and other functional properties as they relate to network topology. A network model of an evolving regulatory system The Wagner (1996) model (Fig. 3) considers a set of N genes, each of which produces a factor capable of regulating the expression of any of the N genes. A factor may act positively on some targets and negatively on others, and does not necessarily target all N genes. Such networks of interregulatory transcription factors are believed to be key regulatory modules in the development and physiology of practically all organisms (Guelzim et al. 2002; Davidson et al. 2002; Davidson 2001). In the model, when multiple factors regulate the same gene, they combine in a
w11 w12
w1n
s1
w21 w22
w2n
s2
wn1 wn2
wnn
sn
Fig. 3 Representation of a gene-regulatory network. Each gene (horizontal arrow) produces a transcription factor (open circles, squares or diamonds). These factors influence the expression of each gene via upstream cis-regulatory elements (solid circles, squares or diamonds, shown with different gray levels to reflect inherent differences in affinity and in whether the interaction is activating or repressing). Genotype is represented by the matrix, W, of regulatory interactions, the elements of which are drawn randomly from the standard Normal distribution. The connectivity of the network is predetermined by choosing certain wij elements to be nonzero, such that outgoing connections follow a power-law and incoming connections follow a Poisson distribution. The regulatory interactions are iterated from an arbitrary starting point of gene-product levels, represented by the vector ^ of gene-product levels at S(0). Phenotype is the vector, S, equilibrium. For more details on the simulation procedure, see text as well as Siegal and Bergman (2002) and Bergman and Siegal (2003)
concentration-dependent, non-linear way to determine the output of the target gene. The non-linearity is introduced by a sigmoidal function, the steepness of which is set for each simulation by a single parameter, a (Siegal and Bergman 2002; Bergman and Siegal 2003). When a is large, the sigmoid approaches a step function, essentially restricting genes to be either fully ‘on’ or fully ‘off’; lower values of a allow intermediate expression states as well. The regulatory effect of each factor on each gene is drawn from the standard Normal distribution and comprises an element of an N · N matrix, W, which is considered the ‘genotype’ of a haploid individual. As described below, it is this genotype that evolves by mutation. Here, the connectivity of the network is predetermined by choosing certain wij elements to be non-zero, as follows. The number, k, of genes that a given factor regulates (‘outgoing’ connections) is drawn from a power-law probability distribution, PðkÞ / kc . For each factor, these k connections are assigned randomly with equal probability to the N potentially regulated genes, such that the number of factors regulating a given gene (‘incoming’ connections) is approximately Poisson distributed. It has been found that biological networks typically have degree exponent between 2 and 3, which is the critical range of c for which characteristic scalefree properties emerge (Baraba´si and Oltvai 2004). Accordingly, for the simulations here, we use c = 2 or c = 2.7. The concentration of each factor at any time t is recorded in an N-element vector, S, whose entries range from 0 (no expression) to 1 (full expression). The initial concentrations, S(0), are chosen randomly to be either 0 or 1, but are the same for each individual in a given simulation. The equilibrium state of the S vector, ^ is achieved by iterated multiplication of W and S S, (with passage of each element of S through the sigmoid). That is, Sðt þ 1Þ ¼ f ðW SðtÞÞ, where f is the sigmoidal function, f ðxÞ ¼ 1=ð1 þ eax Þ, applied to each element of the vector that results from multiply^ is considered the ‘phenotype’ of an ing W and S(t). S individual. This ‘phenotype’ is clearly far removed from an actual morphological, physiological or behavioral phenotype, but it is assumed that the activity pattern of key regulatory genes will somehow determine downstream events that lead to actual phenotypic outcomes. Furthermore, an equilibrium gene-expression profile is a phenotype that can be straightforwardly measured, using microarrays for example, and so is a relevant trait for experimental analysis. Figure 4 shows a hypothetical example of a network and its approach to equilibrium. Evolution in the model consists of a standard population-genetic process. Each population is foun-
123
Genetica
ded by a clone of a single founder individual, and the population size is then kept constant through subsequent generations (at 500 individuals in the simulations presented here). Mutations are simulated by random replacements of elements of the W matrix (and occur at an average rate of 0.1 replacements per individual). Mating occurs by choosing two parents at random to form a transient diploid, which then segregates rows of the W matrix to form a single, haploid progeny. That is, the regulatory elements controlling a given gene are completely linked to one another and to the gene, whereas the genes are unlinked to one another. Selection has two components. First, a genotype is lethal if it has no equilibrium S vector, that is, if its gene-expression profile does not settle down to a steady state. Second, for non-lethal genotypes, fitness is determined by phenotypic distance to a pre-deter-
Fig. 4 Example of gene-network dynamics in the model. (a) A hypothetical four-gene network, represented as a W matrix. (b) The activation and repression relationships encapsulated in the W matrix, represented as a familiar ‘arrow and T-bar’ diagram. (c) The successive iterations of network dynamics (using a = 3), represented as if gene-expression levels were measured in a microarray experiment. In this example, the initial state of the system is genes 1 and 3 on, and genes 2 and 4 off. That is, S(0) = [1;0;1;0]. Readings from the same gene are aligned horizontally, and from the same time point are aligned vertically. As shown in the legend at bottom, white squares represent full expression, and black squares represent zero expression. Note how equilibrium is reached at S(2) after transient activation of gene 4
123
mined optimum (SOPT, typically set as the phenotype of the founder of each population). This distance from the optimum, D, is defined simply as the sum of ^ and SOPT, normalized squared differences between S to range between 0 and 1 by dividing by N. The fitness, F, decreases exponentially with increasing phenotypic distance from the optimum, and the strength of selection is set by a single parameter, r, which determines ^ ¼ eD=r ). the steepness of the exponential curve ( FðSÞ The relationship between node degree and essentiality The first functional property we investigate here is the fitness upon knocking out a network gene. As we reviewed above, experimental analyses suggest that at least a weak correlation exists, such that knocking out a highly connected gene (a gene with many direct outputs) is expected to yield a lower fitness than knocking out a gene with fewer outputs. Here we simulate 100 independent networks of 25 genes with sigmoid parameter a = 6. The topology of each network is drawn from a scale-free distribution of outgoing connections (power-law exponent c = 2). Each network then undergoes 1000 generations of evolution, under strong selection for the founder’s phenotype (r = 0.1). The 500 members of each evolved population are then subjected to a knockout of a given gene, and the average fitness effect of the knockout across the population is measured. As shown in Fig. 5, there is a weak negative correlation in these simulations between the average fitness of a knockout and the number of direct targets of the knocked-out gene. This result is consistent with the experimental observations, and implies that while a relationship does exist between a gene’s connectivity and its essentiality, this relationship is exceedingly weak and holds little predictive power on its own. Note, however, that the higher connectivity genes do not display a simple relationship with average fitness, and indeed that a clear pattern emerges when the data are broken down into three classes based on the genes’ having low (0–0.25), intermediate (0.25–0.75) or high (0.75–1) expression at equilibrium. Among the high equilibrium expression genes especially, there is a pronounced negative correlation between connectivity and fitness (r = –0.769, r2 = 0.59, P < 10–30). Thus, in these simulations combining the connectivity measure (node degree) with a dynamic property (equilibrium gene-expression level) improves prediction considerably. Similar results are seen in simulations with a = 100 and c = 2.7 (see Table 1). We also ran simulations with random (Poisson-distributed)
Genetica 1.0
Mean fitness upon gene knockout
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
10
20
Number of outgoing connections
Fig. 5 Relationship between fitness effect of a gene knockout and number of outgoing connections of the gene. One hundred simulated networks were evolved for 1000 generations each, after which each member of each final population was subjected to a knockout of each gene in turn. For each gene knockout, the average fitness across the population is plotted against the number of genes regulated by the factor that the knocked-out gene normally produces (log-transformed). Data are divided into three classes based on the equilibrium expression level of the knocked out gene: ^s [0:75 (circles), 0:75 ^s 0:25 (diamonds), and ^s \0:25 (squares). For the first five values on the horizontal axis, the data points for the three expression classes are offset slightly so that they do not obscure one another. There is a weak correlation when all data are considered together, but a stronger correlation for just the first (circle) class [r = –0.769, r2 = 0.59, P < 10–30 on arcsin-transformed data (Sokal and Rohlf, 1995)]. For these simulations, N = 25, a = 6, r = 0.1 and c=2
network topology but the same average number of connections as in the c = 2 scale-free networks, and found a similar correlation between degree and fitness, especially among the high equilibrium-expression genes (r = –0.589, r2 = 0.35, P < 10–30), suggesting that the importance of relative node
degree does not strictly depend on the network’s having a scale-free topology. Given a connection between topology and the fitness effect of a gene knockout, one might also expect a connection between topology and polymorphism. That is, one might expect more essential genes to have lower polymorphism, and therefore one might expect a negative correlation to exist between polymorphism and number of direct targets, but this correlation is nearly non-existent. In the simulations we measure polymorphism first for each cis-regulatory element—that is, for each element of the W matrix—by taking the variance of allelic values across the population. Then an average polymorphism is calculated for each gene, by averaging these variances across each row of W. The relationship between average polymorphism and number of direct targets is shown in Fig. 6 (r = –0.044, r2 = 0.002, P = 0.020). This correlation is not improved—indeed it is non-existent—when only genes with high expression at equilibrium are considered (r = –0.006, r2 = 4 · 10–5, P = 0.851). An alternative measure, the virtual heterozygosity of haplotypes at each locus (Hartl and Clark 1989), does not give qualitatively different results (data not shown). One possible reason for the lack of correlation between polymorphism and connectivity in these simulations is that mutations are independently drawn from the standard Normal distribution. A mutation model with a small number of allele classes, or with correlations between the values of old and new alleles at a locus, might constrain polymorphism in such a way as to reveal a link with connectivity. This will be an interesting possibility to explore. Another possibility is that because these simulated networks buffer genetic variation to a large extent (Siegal and Bergman 2002; Bergman and Siegal 2003), the assumption that more essential genes would have less variation might not be valid. Indeed, as described in the next section, our simulation results promote a more complicated view of
Table 1 Correlation statistics for network simulations with different sigmoidal parameters, a, and with topologies drawn from different power-law distributions, defined by exponent c a = 100 Comparison
a
Fitness versus k (high expression) Polymorphism versus k (all genes) Polymorphism versus k (high expression)
r P r P r P
a=6
c=2
c = 2.7
c=2
c = 2.7
c = 0.1b
–0.766 < 10–30 –0.067 0.0008 –0.099 0.001
–0.632 < 10–30 –0.0002 0.992 –0.053 0.098
–0.769 < 10–30 –0.044 0.020 –0.006 0.851
–0.678 < 10–30 –0.054 0.008 –0.004 0.911
–0.589 < 10–30 –0.058 0.004 –0.016 0.628
a
‘Fitness’ refers to the average fitness of individuals within the population upon knockout of a network gene. ‘High expression’ refers to the class of genes with equilibrium expression levels greater than 0.75. ‘Polymorphism’ refers to the variance in allelic values for a gene, averaged across its cis-regulatory elements
b
Networks with c = 0.1 have Poisson-distributed node degree, with probability 0.1 that any two nodes are connected
123
Genetica 3
Variance in allelic values
2
1
0 1
10
20
Number of outgoing connections
Fig. 6 Relationship between allelic variation of a gene and number of outgoing connections of the gene. For the same 100 simulations as in Fig. 5, a measure of polymorphism for each gene is plotted against the number of genes regulated by the factor that the knocked-out gene normally produces (logtransformed). Here, polymorphism is measured for each gene i as the variance of each wij element across the population, averaged over all j for which wij is nonzero. These values are Box-Cox transformed (Sokal and Rohlf, 1995) so that their distribution better approximates normality. The major principal axis is shown. There is a weak, but marginally significant, negative correlation (r = –0.044, r2 = 0.002, P = 0.020)
the expected effects of purifying selection on levels of within- and between-population variation in regulatory genes. Re-playing the tape: repeated evolution of identical topologies The above results suggest that simple measures of network structure may hold little predictive power on their own, but may nonetheless be important factors in combination with other descriptors, such as dynamical gene-expression data. As a first attempt at exploring the prospects for strong inference of functional and evolutionary properties based on more sophisticated, detailed measures of network topology and dynamics, we use the same simulation framework, but instead of analyzing 100 independent realizations of scale-free network topology, we select a small number of topologies and replicate the evolutionary simulations 100 times. That is, we ask whether the evolution of the identical topology leads to predictable outcomes, and then whether certain topological or dynamical features emerge as important functional predictors. We begin, as before, by exploring the average fitness effect of a gene knockout. By doing 100 simulations of
123
the same topology, allowing only the magnitudes and signs of connections to change, we can analyze the variance in fitness for a given gene knockout across evolutionary runs. As shown in Fig. 7 for one such topology, the average fitness effect of knocking out a given gene typically varies a considerable amount from run to run. This suggests that contingencies of the stochastic processes of mutation, mating and selection contribute greatly to the functional properties associated with a given gene. However, for some genes in some network topologies, the variance in fitness effect is rather small, suggesting that some aspect of the network topology constrains the functional properties of certain genes. The dynamics of gene expression also appear to be important, in that, as might be expected, the genes with the least fitness effect upon knockout tend to be those with low expression at equilibrium, whereas the genes with the greatest fitness effect upon knockout tend to be those with high expression at equilibrium (Figs. 7 and 8). For the latter class of genes, the fitness effect of the knockout tends to be larger than what would be expected by the loss of only the knocked out gene. That is, these knockouts disrupt the equilibrium expression levels of other genes in the network. Moreover, for some topologies, such as the one in Fig. 8, the magnitude of disruption in equilibrium expression levels appears to depend on node degree. That is, some topologies, upon repeated evolution, uphold a rather strong negative correlation between fitness and direct targets, especially when the equilibrium expression levels of the genes are taken into account. This again suggests that the weak correlations seen in more global analyses may reflect a composite of some sub-networks that obey a strong correlation and some sub-networks that have no correlation at all. It remains to be investigated what aspects of topology will be able to predict which subnetworks will obey the correlation and which will not. It is possible that local clustering of nodes matters, or that hierarchical structure of the network does, or that small motifs within the network do. It is also possible that descriptors of a network’s dynamic properties will be more informative than descriptors of its topology. The conclusions are similar for polymorphism in the networks that were evolved 100 times. In general, the same gene may have a wide range of polymorphism levels across evolutionary runs, as seen in Figs. 7c and 8c. However, some genes clearly harbor consistently low levels of polymorphism, suggesting the particular network topology constrains the variation in these genes. An even better view of polymorphism in these simulations is provided by considering not the average polymorphism across cis-regulatory inputs (rows of W)
Genetica
Fig. 7 Repeated evolution of the same network topology. (a) The gene network that founded each of 100 independently evolving populations is depicted (N = 25, a = 6, r = 0.1 and c = 2). Positive regulation is indicated by a solid arrow, negative regulation is indicated by a dashed arrow, and the magnitude of each interaction is indicated by the shading of the arrow (stronger connections are more darkly shaded). The equilibrium expression state of each gene is indicated by shading of the circle marking its node (black = 0, white = 1). The graph was created with Pajek software (Batagelj and Mrvar, 1998), then adjusted
manually. (b) The relationship between fitness effect of a gene knockout and number of outgoing connections of the gene is shown, as in Fig. 5, except each gene is shown separately with its number of outgoing connections indicated below on the horizontal axis and with genes ordered left to right by number of outgoing connections. The equilibrium expression state of each gene is also indicated below on the horizontal axis, by shading as with nodes in (a). (c) The relationship between allelic variation of a gene and number of outgoing connections of the gene is shown, as in Fig. 6, except with horizontal axis as in (b)
but the allelic variation at each cis-regulatory element. As seen in Fig. 9, some elements have consistently low allelic variation within populations. Thus, it might be fruitful to examine such elements in greater detail, and in a broader population-genetic context, especially since the correlation analysis performed above on a collection of independent networks revealed virtually no predictive value of node degree on polymorphism, even when equilibrium gene-expression level was considered. Specifically, it is often illuminating to
compare allelic variation within populations to divergence between them. These simulations do indeed allow us to measure divergence between isolated populations, in that the runs of the same network topology can be thought of as populations splitting off from a common starting point. It is therefore possible to compare polymorphism within populations with divergence between them, and to perform these comparisons in the context of network topology and dynamics.
123
Genetica
Fig. 8 Repeated evolution of the same network topology. (a) As in Fig. 7, a gene network is shown that founded each of 100 independently evolving populations (same parameter values as in Fig. 7). (b) The relationship between fitness effect of a gene
knockout and number of outgoing connections of the gene is shown, as in Fig. 7. (c) The relationship between allelic variation of a gene and number of outgoing connections of the gene is shown, as in Fig. 7
Figure 10a shows, for one set of 100 runs of the same network topology, the relationship between withinpopulation allelic variation and the population mean allelic value for each element of the W matrix. The interesting features of these data emerge when the mean and variance values for a given W element in its 100 independent evolutionary runs are highlighted in the graph. We then see the relationship between polymorphism and divergence. As shown in Fig. 10b, some regulatory elements appear rather unconstrained, in that polymorphism tends to be high and the range of mean allelic values across populations also tends to be large. As might be expected, such unconstrained elements are more likely to be inputs from genes with
lower expression at equilibrium (correlation between equilibrium expression level of input and average polymorphism for network topology in Fig. 7: r = –0.738, P = 3 · 10–9, and for network topology in Fig. 8: r = –0.561, P = 4 · 10–7). In contrast, some regulatory elements appear quite constrained—polymorphism tends to be low and the mean allelic value for each population falls within a tight window (Fig. 10c). Polymorphism and divergence are not always linked in these simulated populations. Figure 10d shows an example of a regulatory element with low polymorphism within each population, yet high divergence between populations. This type of disconnect between
123
Genetica
a
4 3 2
1
Variance in allelic values
Fig. 9 Allelic variation of individual cis-regulatory elements. (a) A measure of polymorphism is plotted for each non-zero wij element, for the 100 independent evolutionary runs of the network in Fig. 7. Here, polymorphism is measured as the variance of each wij element (and Box-Cox transformed). The wij elements are flattened into one dimension such that they are arranged by row from left to right. That is, the first 25 vertical slices correspond to cis-regulatory inputs to gene 1 from factors 1–25, and are therefore placed in a bin labeled ‘1’ on the horizontal axis. Not all vertical slices are occupied because some factors do not regulate some genes (i.e., wij = 0). (b) The same plot as in (a) is shown, for the 100 independent evolutionary runs of the network in Fig. 8
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
b
5 4 3 2
Variance in allelic values
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Gene
polymorphism and divergence is usually ascribed to deviation from a standard neutral model of molecular evolution; specifically, low polymorphism and high divergence are typically taken as evidence of adaptive evolution (Li 1997). This presents a paradox, in that all 100 populations in these simulations are subject to the same, constant selection, as the optimum phenotype is fixed across populations and across time. However,
constant selection for a particular phenotype does not imply that the selection coefficient associated with a given allele at a given locus is independent of the rest of the genotype, nor for that matter independent of the genetic composition of the entire population. That is, there are likely to be rather intricate dependencies among natural selection, genetic drift and the epistatic interactions of alleles segregating at loci in the
123
Genetica
network. From this perspective, genetic drift might even be said to play a creative role, fixing mildly deleterious alleles that are subsequently compensated by changes elsewhere in the network (Zuckerkandl 1997; Hartl and Taubes 1996). Thus, a resolution to our original paradox presents itself in what has been termed ‘developmental system drift’ (True and Haag 2001). A complex developmental program with a sufficient amount of redundancy or robustness may conserve its function through evolutionary time while elements of the program drift in and out; the developmental mechanism may thus appear different in divergent lineages, but its ultimate function is never altered. Such a process might be expected to lead to dependencies between elements of the program, such that mutations in one element are only tolerable in combination with certain alleles at other loci (Phillips et al. 2000; Frank 1999). This intuition is supported by Johnson and Porter (2000) and Porter and Johnson (2002), who modeled the regulatory interactions between transcription factors and their DNA binding sites in simple developmental pathways, and saw hybrid incompatibility arise between independently evolving lineages. Johnson and Porter (this issue) also observed developmental system drift in their simulations. We thus investigated whether allelic correlations existed between elements that each displayed low polymorphism and high divergence in our simulations. This is exactly what we found. An example is shown in Fig. 10e. Here, two regulatory elements in the same network display low polymorphism and high divergence. As shown in Fig. 10f, the mean allelic values for the two elements within a population are correlated. Interestingly, these two correlated elements are regulatory inputs to the same gene, consistent with the intuition that components of a developmental program with most immediately related function will be those most likely to co-evolve. In addition, the gene controlled by these elements has an optimum equilibrium expression level that is neither fully on nor fully off, but
is intermediate (Fig. 10g). Inputs to a gene whose expression must be precisely controlled might be expected to be more likely to show such correlated evolution. Overall, we found that many more than expected allelic correlations existed between regulatory elements controlling the same gene (for the network of Fig. 7, only 31 of 1081 possible pairs of non-zero elements are inputs to the same gene, whereas 8 of 14 pairs of elements that are significantly correlated at P < 0.01 are inputs to the same gene [P = 1.8 · 10–7, G-test with Williams’s correction]; for the network of Fig. 8, only 87 of 2415 possible pairs of non-zero elements are inputs to the same gene, whereas 21 of 46 pairs of elements that are significantly correlated at P < 0.01 are inputs to the same gene [P = 1.1 · 10–16, G-test with Williams’s correction]). Some interesting exceptions existed, such as the case shown in Fig. 11, in which three regulatory elements are correlated and these three elements form a closed loop within the network. This finding suggests that a profitable future direction might be to focus more explicitly on network motif structure (Shen-Orr et al. 2002) as it relates to population-genetic properties of interest.
Fig. 10 Relationship between polymorphism and divergence in repeated evolutionary simulations. (a) For each of the 100 independent evolutionary runs of the network in Fig. 8, polymorphism of each wij element (as in Fig. 9a) is plotted against the mean value of the wij element, such that variation within populations is represented on the vertical axis while divergence between populations is represented along the horizontal axis. (b) The same plot as in (a) is shown, with points in gray except those corresponding to a particular wij element, which are superimposed in black. The vertical line shows the value of this wij element in the founder individual. (c) The same plot as in (b) is shown, for a different highlighted wij element. (d)
The same plot as in (b) and (c) is shown, for a different c highlighted wij element. (e) The same plot as in (d) is shown, with an additional wij element highlighted in white. The dashed vertical line shows the value of this element in the founder individual. (f) The mean wij values for the elements highlighted in (e) are correlated (r = –0.572, r2 = 0.33, P = 5 · 10–10). The major principal axis is shown. (g) The gene-regulated by the elements in (e) and (f) has an optimum expression state that is intermediate between 0 and 1, as shown in this plot of the expression state of the regulated gene through iterations of the network dynamics, for the founder individual
123
Conclusions High-throughput technologies have revolutionized molecular biology, facilitating the elucidation of complex networks of interacting cellular components. Attempts to understand this complexity have been guided by important new conceptual advances in network theory (Proulx et al. 2005). In this article, we have endeavored to highlight these advances, while at the same time cautioning against over-generalization. In particular, we have focused on the various claims— some stronger than others—that biological networks follow power-law distributions of connectivity (i.e., are scale-free), and that this type of architecture imbues them with key properties, such as the disproportionate functional importance of hubs (Baraba´si and Oltvai
Genetica
123
Genetica
Fig. 11 Correlation of allele values for genes that form a closed loop within a network. See Fig. 7 for the closed regulatory loop between genes 13, 15 and 19. (a–c) The values of w13,15, w15,19 and w19,13 are correlated among each other, as shown in two of these three pair-wise comparisons (correlation in (a): r = –0.316, P = 0.001; correlation in (b): r = 0.313, P = 0.002; correlation in (c): r = –0.050, P = 0.623). The major principal axis is shown
123
2004). Our simulations suggest that the degree distribution itself (i.e., power-law versus Poisson) does not exert a major influence on functional properties associated with nodes, such as fitness upon knockout or polymorphism, consistent with our previous simulation results (Bergman and Siegal 2003). A similar conclusion can be drawn from the work of Albert et al. (2000a, b), who demonstrated little difference between scale-free and random undirected networks in their tolerance to node removal, as measured by the distances between remaining nodes. Thus, in quite different settings, degree distribution alone appears insufficient to explain functional properties of interest. Further, both our empirical data analysis and our computer simulations of gene-regulatory networks suggest that simple measures of connectivity, such as a node’s degree, carry little explanatory power with respect to functional properties of that node. This must be due in part to the critical differences between actual biological networks and idealized theoretical networks, as well as to the contingencies of the evolutionary process. That is, details matter, both when it comes to the underlying operational logic that is masked by superficial graph representations of connectivity, and when it comes to the history of living systems. We have focused most on gene-regulatory networks, because they are central to cellular function and multicellular development, and because they present the greatest challenges in terms of biologists’ ability to elucidate and to understand them. Perhaps the supreme challenge with respect to network theory is to incorporate the complicated cis-regulatory logic that characterizes the nodes in gene-regulatory networks, especially because uncovering this logic (in all its quantitative detail) is a difficult and largely unfinished experimental task. Thus, we lack the critical link between network topology and network dynamics. Nevertheless, we can begin to explore topology and dynamics in more sophisticated ways. For example, because certain network motifs, such as feedback and feedforward loops, possess circumscribed dynamical properties, their analysis can be productive even in the absence of extensive quantitative data (Shen-Orr et al. 2002; Mangan and Alon 2003). Similarly, topological measures such as clustering coefficients (Baraba´si and Oltvai 2004) and betweenness centrality (Hahn and Kern 2005), which characterize nodes in relation to one another, are expected to be more informative than node degree alone. We can also approach the problem from the opposite direction, using gene-network simulations with arbitrary yet well-defined cis-regulatory logic to explore possible connections between topology, dynamics and ultimate function, as we have done
Genetica
here. The simulation approach can obviously be extended in a number of directions. More detail-rich models can be used, incorporating more of what we do know about actual regulatory networks. It would also be valuable to explore models of environmental plasticity (i.e., models in which distinct sets of inputs map to distinct outputs), both because such models might have a more natural mapping between gene-expression phenotype and fitness, and because this would allow more direct comparison with the extensive yeast data on environment-dependent gene expression. Exploring models that consider a variety of selection regimes (directional selection, fluctuating selection, etc.) and demographic scenarios (population subdivision, changing population sizes, etc.) might also generate novel and testable predictions concerning the origin and significance of particular network configurations. Microevolutionary perspectives will play a large role in making sense of accumulating network data. As we have noted, it is tempting to think of complex organisms as globally optimized systems, but this would be a mistake. Evolving systems might have counter-intuitive properties that cannot be understood based on engineering principles alone. Furthermore, there is an ever-increasing body of comparative genomic and gene-expression data that can be brought to bear on issues of network structure and function. Important work has already been done in this regard, including that reviewed and critiqued by Hahn et al. (2004). Another notable recent contribution is the work of Wuchty et al. (2003), who demonstrated greater evolutionary conservation of proteins participating in particular motifs in the yeast protein-interaction network. The rich resource of comparative data should continue to motivate network theory, and also provide a means of validating it. Network theory also holds great promise for enriching microevolutionary theory. We have seen one example of this in our simulations, with the coevolution of regulatory elements leading to low polymorphism and high divergence, the usual mark of adaptive evolution. Such a coevolutionary process might occur not only for cis-regulatory elements of the same gene, but also for trans-acting factors converging on the same promoter or for any proteins cooperating in some other cellular process (we do not model such protein-coding mutations in our simulations). This possibility deserves further attention, as it might force reinterpretation of data showing low polymorphism relative to divergence in protein-coding regions. That is, the molecular footprint of developmental system drift (generally thought of as a non-adaptive process) may be difficult to distinguish from the footprint of adaptive evolution. Similar caution
was urged by Hartl and Taubes (1996), who made the case that fixations of nearly neutral deleterious alleles are common enough to spark frequent episodes of selection of compensatory mutations to restore an already adapted state. A focus on interacting regulatory genes might be the most fruitful approach for seeking experimental validation of this notion. More broadly, network modeling will necessarily contribute to the most conspicuously deficient aspect of population-genetic theory, the mapping of genotype to phenotype. This traditional black box of population genetics will not become transparent anytime soon, despite the immensely powerful technologies now at our disposal. Nevertheless, each day our understanding of the regulatory logic of complex organisms grows. With that understanding comes the potential for truly integrative understanding of life and its diversity. Acknowledgements The authors thank Norman Johnson and Matthew Hahn for helpful comments on the manuscript. During part of this work, M.L.S. was supported by an NIH NRSA postdoctoral fellowship, and thanks Bruce Baker for his support. A.B. thanks Eric and Illeana Benhamou and The Benhamou Global Ventures, and The Rockefeller Foundation Creativity & Culture Program, for their generous and continual support. D.E.L.P.’s contribution to this work was supported by the National Science Foundation under Grant No. 0214022, and by the Ellison Medical Foundation.
References Albert R, Jeong H, Baraba´si A-L (2000a) Error and attack tolerance of complex networks. Nature 406:378–382 Albert R, Jeong H, Baraba´si A-L (2000b) Error and attack tolerance of complex networks (Correction). Nature 409:542 Almaas E, Kova´cs B, Vicsek T, Oltvai ZN, Baraba´si A-L (2004) Global organization of metabolic fluxes in the bacterium Escherichia coli. Nature 427:839–843 Alon U (2003) Biological networks: the tinkerer as an engineer. Science 301:1866–1867 Arita M (2004) The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci U S A 101:1543–1547 Baraba´si A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512 Baraba´si A-L, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5:101–113 Batagelj V, Mrvar A (1998) Pajek—program for large network analysis. Connections 21:47–57 Bergman A, Siegal ML (2003) Evolutionary capacitance as a general feature of complex gene networks. Nature 424:549– 552 Bray D (2003) Molecular networks: the top-down view. Science 301:1864–1865 Davidson EH (2001) Genomic regulatory systems: development and evolution. Academic Press, San Diego, CA Davidson EH, Rast JP, Oliveri P, Ransick A, Calestani C, Yuh C-H, Minokawa T, Amore G, Hinman V, Arenas-Mena C, Otim O, Brown CT, Livi CB, Lee PY, Revilla R, Rust AG, Pan Zj, Schilstra MJ, Clarke PJC, Arnone MI, Rowen L,
123
Genetica Cameron RA, McClay DR, Hood L, Bolouri H, (2002) A genomic regulatory network for development. Science 295:1669–1678 Erdo¨s P, Re´nyi A (1960) On the evolution of random graphs. Publ Math Inst Hung Acad Sci 5:17–61 Featherstone DE, Broadie K (2002) Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network. Bioessays 24:267–274 Frank SA (1999) Population and quantitative genetics of regulatory networks. J Theor Biol 197:281–294 Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11:4241–4257 Gavin A-C, Bo¨sche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon A-M, Cruciat C-M, Remor M, Ho¨fert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier M-A, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415:141–147 Gertner J (2003) Social networks. The New York Times 12/14/03, section 6, p. 92 Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL Jr, White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM (2003) A protein interaction map of Drosophila melanogaster. Science 302:1727–1736 Guelzim N, Bottani S, Bourgine P, Ke´pe`s F (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 31:60–63 Guet CC, Elowitz MB, Hsing W, Leibler S (2002) Combinatorial synthesis of genetic networks. Science 296:1466–1470 Hahn MW, Conant GC, Wagner A (2004) Molecular evolution in large genetic networks: does connectivity equal constraint? J Mol Evol 58:203–211 Hahn MW, Kern AD (2005) Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol Biol Evol 22:803–806 Hartl DL, Clark AG (1989) Principles of population genetics. Sinauer Associates, Inc., Sunderland, MA Hartl DL, Taubes CH (1996) Compensatory nearly neutral mutations: selection without adaptation. J Theor Biol 182:303–309 Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams S-L, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sørensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CWV, Figeys D, Tyers M (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–183
123
Ideker T, Galitski T, Hood L (2001) A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2:343–372 Jeong H, Mason SP, Baraba´si A-L, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411:41–42 Jeong H, Tombor B, Albert R, Oltvai ZN, Baraba´si A-L (2000) The large-scale organization of metabolic networks. Nature 407:651–654 Johnson NA, Porter AH (2000) Rapid speciation via parallel, directional selection on regulatory genetic pathways. J Theor Biol 205:527–542 Johnson NA, Porter AH (2005) Evolution of branched regulatory genetic pathways: directional selection on pleiotropic loci accelerates developmental system drift. Genetica, this issue Kim PM, Tidor B (2003) Limitations of quantitative gene regulation models: a case study. Genome Res 13:2391–2395 Kitano H (2002) Systems biology: a brief overview. Science 295:1662–1664 Levesque MP, Benfey PN (2004) Systems biology. Curr Biol 14:R179–R180 Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain P-O, J-Han DJ, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual J-F, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, van den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M (2004) A map of the interactome network of the metazoan C elegans. Science 303:540– 543 Li W-H (1997) Molecular evolution. Sinauer Associates, Inc., Sunderland, MA Mangan S, Alon U (2003) Structure and function of the feedforward loop network motif. Proc Natl Acad Sci U S A 100:11980–11985 Phillips PC, Otto SP, Whitlock MC (2000) Beyond the average: the evolutionary importance of gene interactions and variability of epistatic effects, in Epistasis and the evolutionary process, edited by Wolf JB, Brodie ED, III, Wade MJ. Oxford University Press, New York, NY Porter AH, Johnson NA (2002) Speciation despite gene flow when developmental pathways evolve. Evolution 56:2103– 2111 Promislow D (2005) A regulatory network analysis of phenotypic plasticity in yeast. Am Nat 165:515–523 Proulx SR, Promislow DEL, Phillips PC (2005) Network thinking in ecology and evolution. Trends Ecol Evol 20:345–353 Przˇulj N, Wigle DA, Jurisica I (2004) Functional topology in a network of protein interactions. Bioinformatics 20:340–348 Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31:64–68 Siegal ML, Bergman A (2002) Waddington’s canalization revisited: Developmental stability and evolution. Proc Natl Acad Sci U S A 99:10528–10532 Sokal RR, Rohlf FJ (1995) Biometry. W. H. Freeman and Company, New York, NY Stumpf MPH, Wiuf C, May RM (2005) Subnets of scale-free networks are not scale-free: sampling properties of networks. Proc Natl Acad Sci U S A 102:4221–4224 True JR, Haag ES (2001) Developmental system drift and flexibility in evolutionary trajectories. Evol. Dev. 3:109–119
Genetica von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417:399–403 Wagner A (1996) Does evolutionary plasticity evolve? Evolution 50:1008–1023 Wagner A (2003) Does selection mold molecular networks? Sci. STKE 2003:341 Wagner A, Fell DA (2001) The small world inside large metabolic networks. Proc R Soc Lond B Biol Sci 268:1803–1810 Wuchty S, Oltvai ZN, Baraba´si A-L (2003) Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet 35:176–179
Yook S-H, Oltvai ZN, Baraba´si A-L (2004) Functional and topological characterization of protein interaction networks. Proteomics 4:928–942 Yu H, Luscombe NM, Qian J, Gerstein M (2003) Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet 19:422–427 Zuckerkandl E (1997) Neutral and nonneutral mutations: the creative mix—evolution of complexity in gene interaction systems. J Mol Evol 44:S2–8
123