THE EVOLUTIONARY DYNAMICS OF EUKARYOTIC GENE ORDER

27 downloads 0 Views 292KB Size Report
known about the non-random gene order in eukaryotes ..... regions of up to a few hundred kilobases. c | At the tertiary level, cis-acting elements (orange ovals) ...
REVIEWS

THE EVOLUTIONARY DYNAMICS OF EUKARYOTIC GENE ORDER Laurence D. Hurst*, Csaba Pál*‡ and Martin J. Lercher* In eukaryotes, unlike in bacteria, gene order has typically been assumed to be random. However, the first statistically rigorous analyses of complete genomes, together with the availability of abundant gene-expression data, have forced a paradigm shift: in every complete eukaryotic genome that has been analysed so far, gene order is not random. It seems that genes that have similar and/or coordinated expression are often clustered. Here, we review this evidence and ask how such clusters evolve and how this relates to mechanisms that control gene expression. TRANSGENE

Foreign DNA that is inserted experimentally into totipotent embryonic cells or into unicellular organisms. POSITION EFFECT

In general terms, any effect of a gene’s genomic location on its expression. A phenomenon that is often observed in transgenic organisms in which transcription of an inserted transgene is affected by the proximity to heterochromatin.

*Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK. ‡ MTA, Theoretical Biology Research Group, Eötvös Loránd University, Pázmány Péter Sétány 1/C, Budapest H-1117, Hungary. Correspondence to L.D.H. e-mail: [email protected] doi:10.1038/nrg1319

NATURE REVIEWS | GENETICS

The organization of genes within a genome (gene order) can be considered at two levels: first, between chromosomes (for example, comparison of the distribution of genes among autosomes and sex chromosomes), and second, within chromosomes. Here, we focus on the second of these two levels: in particular we discuss the evidence for, and the causes of, non-random gene order within chromosomes. The idea that genes in eukaryotic genomes could be distributed non-randomly, and, moreover, that genes of comparable and/or coordinated expression might cluster, is important in terms of understanding how genomes function and how they have evolved. This idea, however, also has important practical implications. For example, it might explain why an intact gene in a novel genomic location can have a pathological phenotype1–4. Moreover, understanding how and why genes cluster might also be important if we are to understand development5 and ageing. For example, in contrast to genes that are upregulated in quiescent cells (that is, those that are in reversible proliferative arrest), upregulated genes in cells that undergo replicative senescence (irreversible proliferative arrest) are clustered 6. The extent to which the location of a gene in a genome affects expression is also important when we consider genetic modification. TRANSGENE activity can depend on the chromosomal integration site (see, for example, REF. 7), and some argue that successful manipulation of genomes must wait until we understand the causes of POSITION EFFECTS1. Importantly, recent analyses indicate that genomic regions that contain the most

actively expressed genes are also those of the highest gene density8, which makes it more probable that functional integration would interfere with other genes. Understanding position effects should, and does, inform the design of gene-therapy vectors, both with respect to improving their efficacy7,9 and their safety10. With the plethora of statistically rigorous wholegenome analyses of gene organization that have recently been published, for the first time we are in a position to make a general assessment of the dynamics of gene order in eukaryotic genomes. Here, we review what is known about the non-random gene order in eukaryotes and the underlying molecular mechanisms that favour coordinated gene expression. In particular, we ask what is the evidence for non-random order (see TABLE 1), what are the probable mechanisms of coordinated regulation, is it probable that selection for coordinated expression explains the initial evolution of linkage in all cases, and are the clusters maintained by selection (and should we expect them to be)? Evidence for non-random gene order

Anecdotal evidence for clusters. In some ways, the null hypothesis of eukaryotic gene order — random distribution along chromosomes — could be viewed as a straw man. Even before whole-genome sequences were available, numerous apparent exceptions were known. Many of these, however, involve tandem duplicates, the Hox and globin clusters being examples. What, if anything, is the worth of anecdotal observations of clusters that are not explained by tandem duplication? VOLUME 5 | APRIL 2004 | 2 9 9

REVIEWS

Table 1 | Whole-genome studies on gene order in relation to expression or protein function Species

Experimental method

Controlled Observation for duplicates

Reference

Protists P. falciparum

Chromatography/ No mass spectrometry

Clusters of co-expressed proteins (3–6 neighbours)

124

Fungi S. cerevisiae

Microarray

No

Clusters of cell-cycle-dependent genes (direct neighbours)*

18

S. cerevisiae

Microarray, MIPS

Yes

Pairs (triplets) of co-expressed neighbouring genes, independent of orientation; pairs of neighbouring genes with similar function (MIPS)

19

S. cerevisiae

Microarray

No

Pairs of co-expressed neighbouring genes, independent of orientation

52

S. cerevisiae

Knockout

Yes

Essential genes that are clustered in regions of low recombination, independent of co-expression 111

Z. mays

Polymorphisms, cDNA, QTL

No

10–30-centimorgan (cM)-long clusters that contain developmental genes*

A. thaliana

Microarray, cDNA Yes

Regions of correlated expression patterns, in part owing to functionally related genes (KEGG)

A. thaliana

Microarray

Yes

Regions of increased expression rate; regions of correlated expression patterns, with few functionally related genes

A. thaliana

Microarray

Yes

Clusters of co-expressed genes

26

Yes

Clusters of adjacent co-expressed genes encompassing 20% of genes; few clusters are enriched for functional classes (GO); no association with cytogenetic bands or matrix-attachment regions

24

Plants 28 25 27 ‡

Animals: invertebrates D. melanogaster Microarray D. melanogaster EST

Yes

Clusters of adjacent tissue-specific genes; no clusters on the X chromosome

23

D. melanogaster Microarray

No

Small clusters of circadian genes; evidence for regular spacing between some clusters

49

C. elegans

Microarray

No

Operons (2–8 genes) that contain 15% of genes

20

C. elegans

mRNA tagging/ microarray

Yes

Clusters of adjacent genes that are expressed in muscle (including housekeeping genes)

22

C. elegans

RNAi, microarray

No

Large regions that are enriched in genes of similar RNAi phenotype/co-expressed genes

102

C. elegans

Microarray

Yes

Clusters of co-expressed neighbouring genes; long-range owing to duplicate genes, short-range owing to operons and neighbouring genes on opposing strands

21

Animals: mammals H. sapiens

EST

No

Clusters of muscle-expressed genes (including housekeeping genes)

31

H. sapiens

SAGE

No

Large regions of highly expressed genes

36

H. sapiens

EST

No

Clusters of genes that are expressed in adipose tissue*

34

H. sapiens

SAGE, EST

Yes

Clusters of housekeeping genes; regions of highly expressed genes/genes that are expressed in one tissue as secondary effects

29

H. sapiens

SAGE, EST

No

Housekeeping genes that are mainly located in high GC and R-bands§

72

H. sapiens

SAGE

No

Regions of highly expressed genes at high GC; clusters of genes with low levels of expression

H. sapiens

EST

No

Clusters of tissue-specific genes

H. sapiens

EST

No

Regions of increased expression in tumours

3

H. sapiens

Microarray

No

Clusters of genes that are upregulated in senescent, or downregulated in quiescent cells

6

M. musculus

EST

No

M. musculus

RNA in situ hybrid, No PCR

8 39

Clusters of extra-embryonically expressed genes*

30

Clusters of genes that are upregulated in the brain and downregulated in the heart, lung, testis and muscle

40

Multiple phyla S. cerevisiae, KEGG pathways A. thaliana, C. elegans, D. melanogaster, H. sapiens

No

Clusters of genes that are involved in the same pathway; the number of clustered pathways is variable: yeast (98%) > human > worm > A. thaliana > fly (30%)

17

S. cerevisiae, Microarray C. elegans, D. melanogaster, H. sapiens, M. musculus, R. norvegicus

Yes

Comparable clustering of co-expressed genes in all genomes; in yeast, there are many functional overlaps (GO) between clustered genes, whereas in humans there are only a few functional overlaps

41

*Studies in which statistics were not robust (statistics were robust in all other studies). ‡Additional details from T. Zhu (personal communication). §Chromosome banding patterns that are produced by Giemsa staining (G-bands). The reciprocal pattern (reverse, or R-bands) can be produced with various other staining procedures. A. thaliana, Arabidopsis thaliana; C. elegans, Caenorhabditis elegans; D. melanogaster, Drosophila melanogaster; GO, Gene Ontology; H. sapiens, Homo sapiens; KEGG, Kyoto Encyclopaedia of Genes and Genomes; MIPS, Munich Information Centre for Protein Sequences; M. musculus, Mus musculus; P. falciparum, Plasmodium falciparum; QTL, quantitative trait loci; R. norvegicus, Rattus norvegicus; RNAi, RNA interference; S. cerevisiae, Saccharomyces cerevisiae; SAGE, serial analysis of gene expression; Z. mays, Zea mays.

300

| APRIL 2004 | VOLUME 5

www.nature.com/reviews/genetics

REVIEWS

Box 1 | Genome-wide analysis of gene clusters: statistical considerations To find non-trivial cases of non-random gene order, it is necessary to start by formulating a test function. This measures the degree of order in the genome. For example, in assaying the clustering of essential genes in a genome, the frequency with which an essential gene has another essential gene as its immediate neighbour might be considered. After determining the value of the function for the real genome, we then ask how often this figure or higher would be observed if gene order were random. To do this, we must define a null and test for deviation from it. Testing for deviation from the null is often best done by randomizing the location of genes in a genome, recalculating the test function for the random genome and repeating this process many times. This generates the null distribution of values of the test function. The real value can then be compared with this distribution. If there are n random simulants and r have a test score that is equal to or greater than that observed in the real data, the probability (p) of observing the degree of order that is seen in the real genome is given in equation 1. p=

r+1 n+1

(1)

The rules that are used to define a ‘random’ genome define the null hypothesis that is being examined. A common null hypothesis is that there is a lack of spatial pattern in the distribution of genes with shared properties. The simplest procedure, then, is to allow, in randomizations, any gene to assume any ‘location’ in the genome while preventing two or more genes from assuming the same location. However, this often fails to exclude trivial or competing biological explanations. For example, the presence of tandem duplicates can lead to a deviation from random as they can show similar properties (that is, expression profiles) that result from common evolutionary history or experimental design (for example, cross-hybridization in microarray studies). If the physical location of genes is of interest, rather than their order alone, the null should reflect observed gene-density variation. One problem with cluster analysis using quantitative trait loci (QTLs) is that the null often supposes an equal probability of finding a gene in all genomic locations. Differences in generating random gene-order variants mean that the results of different randomization studies are often difficult to compare. Alternative analytical methods therefore have attractions. A few studies116,117 elaborate exact analytical solutions or approximate formulae for non-random gene distribution or borrow previously elaborated methods from time-series analysis118. In many cases, however, randomization seems the only tractable method.

IMPRINTED GENES

Genes that are expressed from only one of the two parental copies, the choice being dependent on the sex of the parent from which the gene was derived. CO-EXPRESSION

A property of genes that show similar spatial or temporal expression patterns.

NATURE REVIEWS | GENETICS

Perhaps the strongest anecdotal evidence of nonrandom gene order in eukaryotes is the observed clustering of mammalian IMPRINTED GENES11. However, this clustering could be related to localized cis activity of imprint control regions and so could be interpreted as a strange exception to the rule of random gene order11. On a more limited scale, there are numerous reports of gene clusters of related function (see, for example, REFS 12–15). For instance, human glutamine phosphoribosyl pyrophosphate amidotransferase (GPAT), which is necessary for the initial step in de novo purine synthesis, and phosphoribosylamidoimidazole-succinocarboxamide synthase (AIRC), which encodes an enzyme for later steps in the pathway, are closely linked16. Do such incidences disturb our null hypothesis? The problem is that in random genomes, we expect curious co-incidental linkages. Consequently, these anecdotes fail to show that there is more clustering than would be expected by chance. Even assuming that we can eliminate tandem duplication as a cause, the problems are numerous (see also BOX 1). First, we have no a priori expectation as to which genes should be clustered and which should not. So, to understand the statistical significance of the finding of a given cluster, we need to ask, not, for example, how often GPAT and AIRC reside next to each other in random genomes, but how often two or more genes that act in the same pathway are found next to each other. As we had no expectation that purine synthesis would be unusual, we must consider all ‘comparable’ pathways (N.B. defining ‘comparable’ is problematic). We then need to determine the null expected number of incidences of linkage of genes in the same pathway

(see BOX 1). To do this, we will need much more extensive data than is provided in the original observation of linkage of two genes. Therefore, although larger clusters — such as the seven linked genes that are involved in quinic acid utilization in fungi12 — are strongly indicative of non-random gene order, rigorous analysis is only possible with complete genome data. We can now, for example, ask whether there are more metabolic pathways in which two or more genes are clustered than would be expected by chance17. However, determining whether any given cluster requires special explanation remains problematic. Nonetheless, the same statistical tools can be applied, albeit with less power, to address this issue. Evidence of clustering from whole-genome studies. The study of genes that are involved in the mitotic cell cycle in yeast by Cho et al. was the first to show clustering of 18 CO-EXPRESSED genes on a genomic scale . They found that 25% of genes with cell-cycle-dependent expression patterns were directly adjacent to genes induced in the same phase of the cell cycle (see also REF. 19). Clusters of co-expressed yeast genes rarely seem to exceed ten genes or a few kilobases (C.P., unpublished observations). Although clusters of a similar size are found in the worm Caenorhabditis elegans, many of these clusters could be attributed to the co-transcription of these genes in operons: a process that is unusual among eukaryotes. Approximately 15% of C. elegans genes are contained in operons — that is, stretches of two to eight genes that are transcribed into polycistronic premRNAs20. Although operons, together with tandemly

VOLUME 5 | APRIL 2004 | 3 0 1

REVIEWS

QUANTITATIVE TRAIT LOCI

(QTLs). Genes that segregate for a quantitative trait. QTL mapping allows the determination of the genomic location of QTL using genetic markers. SERIAL ANALYSIS OF GENE EXPRESSION

(SAGE). An experimental method for determining transcript abundances in a tissue on the basis of sequencing thousands of short gene-specific tags. TRANSCRIPTION-COUPLED REPAIR

A specialized repair pathway that counteracts the toxic effects of DNA damage in transcriptionally active genes. EXPRESSION BREADTH

The number of tissues in which a gene is expressed. EXPRESSION RATE

mRNA or protein abundances of a gene in a given tissue or under given cellular conditions. KYOTO ENCYCLOPAEDIA OF GENES AND GENOMES

(KEGG). An online database that integrates current knowledge on molecular interaction networks (for example, metabolic pathways and protein complexes). MUNICH INFORMATION CENTRE FOR PROTEIN SEQUENCES

(MIPS). An online database that provides protein sequencerelated information on the basis of whole-genome analysis of Saccharomyces cerevisiae, Arabidopsis thaliana and Neurospora crassa. GENE ONTOLOGY [DATABASE]

(GO). A collaborative effort to address the need for consistent descriptions and functional classification of gene products in different databases.

302

| APRIL 2004 | VOLUME 5

duplicated genes, account for most of the observed coexpression clusters in the worm21, significant local co-expression is still evident after excluding these two causes21,22. Clusters of co-expressed genes in multicellular eukaryotes can be substantially larger than those that are described in yeast and the worm. In Drosophila melanogaster, 45% of genes that are expressed only in testes were found in uninterrupted stretches of at least four genes23. However, a looser definition of a cluster that allows for intervening genes with different expression patterns led to the identification of much larger groups of co-expressed genes. When averaging coexpression over 10-kb windows, Spellman and Rubin found that 20% of genes occur in co-expression clusters that span 10–30 genes or, on average, 125 kb of DNA24. Within the Arabidopsis thaliana genome, co-expression clusters (excluding tandem duplicates) span up to 20 genes25 (see also REFS 26,27), and QUANTITATIVE TRAIT LOCUS (QTL) studies indicate that they might be considerably larger28. The physical scale of co-expression seems to be even larger in mammals, with clusters that extend up to 1,000 kb (REF. 29). Several reports30–35 indicate that, when a large body of cDNAs (or ESTs) are extracted for a given tissue, the genes that specify the proteins tend to cluster in the genome. Other reports note that highly expressed genes, defined by SERIAL ANALYSIS OF GENE EXPRESSION (SAGE) tags, tend to cluster in large domains (regions of increased gene expression; RIDGEs)8,36. Similarly, TRANSCRIPTION-COUPLED REPAIR is prominent in specific chromosomal domains37, although this pattern might reflect variation in gene density. Lercher et al.29 argue that all these patterns might be explained by a tendency for (on average) highly expressed housekeeping genes to cluster. They note that although genes tend to cluster according to their EXPRESSION BREADTH even if their EXPRESSION RATE is controlled for, they do not tend to cluster according to their expression rate if breadth is controlled for. This is not to say that tissue-specific, highly expressed genes in the clusters cannot be identified 8, simply that the dominant trend is for clustering of genes that are expressed in many tissues. Lercher et al.29 also concluded that they could not find much evidence for clustering by tissue of tissue-specific genes — for example, genes that are expressed only in muscle do not tend to cluster with other genes that are expressed only in muscle. A significant clustering was detected for 4 of 14 tissues, but only 1 remains significant after control for multiple testing. Clustering of tissue-specific genes is, however, well described for testes-specific genes in flies23. Whether the human genome contains blocks of multiple genes that are expressed exclusively in the same tissue remains to be resolved. However, this might not be the most important question. Given evidence for tissue-specific chromosomal silencing38, it might be more relevant to ask whether there are chromosomal domains that are associated with up- or downregulation in a given tissue. Evidence for such clustering has been found39,40, although these

studies did not control for the effect of tandem duplicates. These clusters of co-suppressed genes extend over several megabases40. In summary, there is extensive evidence for the clustering of co-expressed genes across all major eukaryotic kingdoms. However, there seems to be a correlation between the physical size of clusters and organismal complexity, with cluster size ranging from a few kilobases in yeast to several megabases in mammals (see also REF. 41). This might be partly explained by differences in genome compactness; however, it might also reflect different underlying mechanisms. Are functionally related genes clustered? Bacterial operons often consist of genes that are functionally related, such as being part of the same metabolic pathway. Do we more generally find that functionally related genes cluster in eukaryotes? The answer to this question to some extent depends on how we define functional relatedness. Unlike co-expression, what it means to be ‘functionally related’ is relatively ambiguous. It might mean involvement in the same pathway, proteins that interact with each other, or genes, the alleles of which affect the same trait, and so on. There can be overlap between all of these meanings. Lee and Sonnhammer17 examined the physical location, in numerous genomes, of genes that have proteins that are involved in metabolic pathways, defined from the KYOTO ENCYCLOPAEDIA OF GENES AND GENOMES (KEGG) database. In all species examined (human, worm, fly, A. thaliana and yeast), there was a significant tendency for genes from the same metabolic pathway to cluster. However, the fraction of pathways with significant chromosomal clustering of genes is highly variable, ranging from 30% for D. melanogaster to a remarkable 98% for yeast, whereas 11% are expected under the null hypothesis17. In at least one well-characterized example, a cluster of genes evolved independently in two different species42. Similarly, in yeast, the genes that are involved in stable protein–protein complexes tend to be more tightly linked than expected43. Cooper13 has suggested that, in the human genome, proteins tend to be linked to their receptor. However, from a whole-genome analysis, we find that the number of such incidences is not different from random (L.D.H., C.P. and M.J.L., unpublished observations). A less robust approach to identifying functional clusters of genes is to examine the clustering of QTLs mapped for any given trait. Several such QTL studies indicate co-localization of QTLs for related traits28,44–46. However, these effects might result from variations in gene density or from multiple effects at one gene and its control sequences. The relationship between the above functional clusters and co-expression clusters is often uncertain. In a few cases, the link between co-expression and cofunctionality has been examined. In yeast, many genes in co-expression clusters seem to be functionally related — they either belong to the same MUNICH INFORMATION 19 CENTRE FOR PROTEIN SEQUENCES (MIPS) category or the same GENE ONTOLOGY (GO) classification41. Similarly, in A. thaliana, both genes with protein products that

www.nature.com/reviews/genetics

REVIEWS a Primary (~10 kb) cis-acting elements

c Tertiary (~1,000 kb) active chromatin hub

b Secondary (~100 kb) histone modifications

d Tertiary (~1,000 kb) chromosome territories

Figure 1 | Schematic representation of the different levels of transcriptional co-regulation. a | At the primary level, cis-acting elements directly affect the transcription of neighbouring genes. The figure depicts a bidirectional promoter causing co-regulation of transcription of genes on the two DNA strands. This level will only affect genes within a few kilobases of each other. b | At the secondary level, HISTONE modifications spread from a LOCUS CONTROL REGION (LCR) (depicted in orange) down the CHROMATIN fibre until they are stopped at a BOUNDARY ELEMENTS (depicted in pink). Modification of the histones (depicted in red) suppresses transcription of the intervening genes (grey boxes), whereas unmodified histones (green) beyond the boundary element retain an open chromatin structure, thereby allowing transcription of a neighbouring gene (blue). This type of co-regulation will affect regions of up to a few hundred kilobases. c | At the tertiary level, cis-acting elements (orange ovals) come together to form the node of chromatin loops (the ‘active chromatin hub’). Genes close to the hub (blue) are accessible to transcription, whereas genes further away (grey) are inaccessible. d | An alternative view of the tertiary level posits that chromatin is arranged in compact chromosome territories, with transcription largely being restricted to territory surfaces (blue genes), but suppressed within the interior (grey genes). In both pictures of tertiary-level regulation, effects are expected to range up to several megabases.

HISTONES

Positively charged DNA-binding proteins that mediate the folding of DNA. LOCUS CONTROL REGION

(LCR). Cis-acting sequence that organizes a gene cluster into an active chromatin block and enhances transcription. CHROMATIN

A highly condensed structure of DNA that is associated with histone proteins and other DNA-binding proteins. BOUNDARY ELEMENTS [OR INSULATORS]

Cis-acting DNA sequences that act as barriers to the effects of distal enhancers and silencers. BIDIRECTIONAL PROMOTERS

Promoter sequences between divergently transcribed neighbouring gene pairs that initiate transcription in both directions. POLYCISTRONIC TRANSCRIPT

mRNA that encodes several polypeptides; a common phenomenon in bacteria.

NATURE REVIEWS | GENETICS

interact and genes that act in the same pathway (defined by KEGG) explain some but not all of the observed coexpression25. Clustering of co-expressed linked genes that belong to the same GO category is relatively rare in humans41. Although some functionally related genes are found in co-expression clusters in D. melanogaster, these seem to be mostly the result of tandemly duplicated genes24, 47. Evidence for regular spacing of genes. Non-random gene order need not be manifested as clustering. A rarely considered possibility is that genes might be regularly spaced. Intriguingly, Képès48 reports that genes that are regulated by the same sequence-specific transcription factor tend to be regularly spaced along the yeast chromosome. Regular spacing along chromosomes of coexpressed gene pairs, defined by chip array data, has also been described in Saccharomyces cerevisiae19,49 and D. melanogaster, but these reports might be artefacts of the chip design50. Mechanisms

There seems to be a broad split of incidences of coexpression into those that act on a relatively small local scale and those that act over much broader genomic spans. We argue that such a pattern is consistent with what is known of the mechanisms for co-expression (FIG. 1).

The simple null hypothesis is that a gene’s expression depends only on the promoters in its immediate vicinity. Many of the local scale phenomena discussed above are consistent with this model. Trivially, tandem duplicates tend to have comparable expression because they have comparable promoters21,51. In yeast18,19,52 and in humans13,16,53,54 (see also REF. 25), some co-expression of adjacent genes can be attributed to a BIDIRECTIONAL PROMOTER that resides between the two. Similarly, although there are POLYCISTRONIC TRANSCRIPTS in some eukaryotes20,55–57, such incidences do not disturb the simple ‘promoter-drives-expression’ model any more than does the finding of genes nested within the introns of other genes. At the extreme, co-expression of multiple linked genes is achieved by fusion of all of the genes to make one protein product58,59. However, even on the small scale, this simple null model is unable to explain everything. Notably, cis effects, such as the downstream effects of upstream activating sequences (UAS), explain some examples of tight co-expression of gene pairs19. Moreover, the broader scale co-expression patterns indicate that the promoterdrives-expression model is too simplistic, as does parallel work that indicates that higher-order features are crucial to understanding chromosomal domains of expression. Beyond the one-dimensional array of genes on chromosomes, to understand gene expression, it now seems important to consider two higher levels of chromosomal

VOLUME 5 | APRIL 2004 | 3 0 3

REVIEWS

DNA METHYLATION

Covalent modification of the DNA that inhibits transcription initiation. HISTONE ACETYLATION/ DEACETYLATION

These processes regulate changes in chromatin structure by covalent modification of histone proteins, and therefore influence the ability of transcriptions factors to bind to promoters. CHROMATIN IMMUNOPRECIPITATION

An experimental method that is used for analysing the acetylation state of histones in a specific genomic region. ISOCHORIC [STRUCTURE]

Large-scale variation in the G+C content of vertebrate genomes.

organization: the state of chromatin and its positioning within the nucleus (particularly its proximity to intranuclear transcription-associated machinery). These two features interact and often it is difficult to dissect out the two causes. The tightly packed state of DNA (heterochromatin) that leaves genes largely inaccessible to transcription factors (and is therefore transcriptionally incompetent) tends to reside towards the periphery of the nucleus60. These issues have been well-reviewed elsewhere60–62 so here we only deal with the main features that are pertinent to understanding the expression clusters. Chromatin-level regulation. Studies on the success of transgene inserts show that inserts into heterochromatin tend to be inactive7,9. Chromatin is not, however, static and transitions between states are linked to changes in gene expression63 and are causally related to covalent modifications of the core histones64. The best current model indicates that specific histonemodifying proteins initiate the opening or closing of chromatin (for example, at a locus control region (LCR)), and that this modification spreads along a chromosome until it meets a boundary element65,66. In this way, all the genes in a region of a chromosome might be prevented from being expressed. Alternatively, a chromosomal region can be made accessible for transcription, but whether these genes are actually expressed depends on other factors such as DNA METHYLATION status, nuclear position, available transcription factors and cisUAS effects19. Consequently, we expect to see domains of downregulation (see, for example, REF. 38) more than we see domains of coordinated upregulation. Indeed, Akashi et al.5 suggest a model on the basis of this sort of premise. They propose that stem cells have a largely open chromatin formation and each step towards specialization is accompanied by the downregulation of genes in specific chromosomal regions .

In some cases, these modifications are stably inherited through cell division and are therefore important in differentiation and development65. Different mechanisms of silencing have different consequences for the stability of the silencing. For example, silencing by histone lysine methylation is reversed only by the slow process of replacement of histones or through DNA replication. In other cases, the modification can be rapidly modulated by the alteration of activities of HISTONE ACETYLASES and HISTONE DEACETYLASES (HDAC) (see, for example, REF. 67). The relationship between chromatin modification and co-expression has been most elegantly demonstrated in yeast. Yeast contains a family of five related HDACs. Using CHROMATIN IMMUNOPRECIPITATION and intergenic microarrays to generate genome-wide HDAC enzyme-activity maps, Robyr et al.67 reported a striking division of labour that enables yeast to modify the chromatin to activate a block of genes that are associated with a given function. Hda1, for example, deacetylates subtelomeric domains that contain normally repressed genes that are used instead for gluconeogenesis, growth on carbon sources other than glucose and adverse growth conditions. By contrast, Hos1/Hos3 and Hos2 preferentially affect ribosomal DNA and ribosomal protein genes, respectively. In humans, there is evidence that comparable mechanisms can explain the inactivation of blocks of tissue-specific genes. The zinc-finger gene-specific repressor element RE-1 silencing transcription factor (REST) can mediate restriction of gene activity in non-neuronal tissues by imposing active repression through histone deacetylase recruitment38. Through the recruitment of an associated co-repressor, CoREST, it can also enable long-term gene silencing that spreads down the chromosome38, affecting transcriptional units that do not themselves contain REST response elements.

8

Average number of tissues per gene

7 6

0.6

5 4

0.5

3 2

0.4

1 0

Average proportion of intronic GC per gene

0.7

Expression breadth GC

0.3 0

20

40

60

80

100

120

140

Position (Mb)

Figure 2 | Expression breadth and surrounding GC content along human chromosome 11. The figure shows the number of analysed tissues in which a gene is expressed and the local GC content; both are averaged over a sliding window of 15 genes drawn across human chromosome 11. Genes with high breadth of expression (that is, expressed in many tissues) tend to reside in regions in which the local GC content is especially high. This indicates that the ISOCHORIC STRUCTURE of the human genome (the regional variation in GC) might reflect underlying selection for transcriptional competency. Modified with permission from REF. 72 © (2003) Oxford Univ. Press.

304

| APRIL 2004 | VOLUME 5

www.nature.com/reviews/genetics

REVIEWS

SC-35 DOMAINS

A set of 10–30 prominent domains of the eukaryotic nucleus that are concentrated in mRNA metabolic factors. They are probably important in organizing euchromatin domains. NUCLEOSOMAL FIBRE

Fibre of chromatin that is made up of nucleosomes.

NATURE REVIEWS | GENETICS

Three-dimensional structure and intra-nuclear position. We know that targeting a gene to the periphery of the yeast nucleus induces silencing68, which indicates that the location within the nucleus might be an important component in promoting or repressing transcription. Indeed, interphase chromosomes in many species occupy unique, relatively compact positions in the nucleus60. Moreover, gene-dense chromosomes tend to be more central in the nucleus60,69, which also indicates that there might be a relationship between three-dimensional position and expression. Can the need to be at a particular intra-nuclear location also drive the evolution of similarly expressed genes to cluster in particular chromosomal regions? This has long been considered a possibility for rRNA genes. Linkage of these genes makes sense because they are associated with the nucleolus: the factory that enables their rapid expression. Are there other intra-nuclear structures that might be of importance? SC-35 DOMAINS are one such group of structures that could promote gene clustering70. Typically, eukaryotic nuclei contain 10–30 prominent domains that are concentrated in mRNA metabolic factors. Gene-rich reverse-chromosomal bands71 show extensive contact with these domains70, which tallies with the tendency for domains of broadly/highly expressed genes to be located in GC-rich R-bands72 (see also FIG. 2). Shopland et al.70 argue that these findings indicate a functional rationale for gene clustering in chromosomal bands, which relates to nuclear clustering of genes with SC-35 domains. They propose a model of SC-35 domains as functional centres for a multitude of clustered genes, forming local euchromatic ‘neighbourhoods’. This model also indicates a mechanism for restricting expression even in euchromatin — that is, the chromatin might be open but if the DNA is not associated with SC-35 domains, transcription will be limited. However, whether nuclear location and chromosomal clustering are as tightly coordinated remains unclear. Analysis of tRNA genes73 suggests a different story: the genes are associated with the nucleolus but are not co-localized on the chromosomes. So, intranuclear location could determine the potential for gene expression, but might not necessarily lead to evolution of clustering of the genes on a chromosome. Analysis of co-regulated genes in yeast supports the idea that selection that acts on gene location might not result in the genes being clustered, but might nonetheless drive non-random gene order48. Képès proposed that the three-dimensional arrangement of genes within the nucleus might underpin the regular spacing of genes that are under the control of a given transcription factor. Specifically, if the DNA NUCLEOSOMAL FIBRE folds into topologically closed loops of regular size, the promoters of these regularly spaced genes would cluster in a small region of the nuclear space. This model fits with the ‘active chromatin hub’ model of gene regulation62. In this model, at least two cis-acting regulatory structures, at either end of a broadly defined region, come together in three-dimensional space to form a DNA loop. Gene

expression is then allowed only in close proximity to the point at which the elements meet. Multiple loops would then act to enable co-expression of regularly spaced genes and inhibition of intervening genes. Between-species comparison of co-expression modes. Do the mechanisms of co-expression vary between species? Certainly, clustering of co-expressed genes in all eukaryotic genomes does not necessarily imply a common underlying mechanism. For example, operons are common in the worm20, but are rarely found in other eukaryotes55–57. Similarly, although it is probable that in all species, tandem duplicates contribute to co-expression, in the worm these happen to be unusually common21. Bidirectional promoters might explain many incidences of co-expression of gene pairs in yeast18,52 but by no means all of them19. Their role in other species is now starting to be examined on a genomic scale54. Less clear is the importance of chromatin-level regulation. Histone-modified genomic domains in yeast are now well described67. However, a whole-genome analysis in the worm revealed little evidence that broad-scale effects mediate co-expression21. It is still unclear how common chromatin-mediated inactivation of broad spans of genes is in the human genome5,38. So, differences between species in the mechanisms that promote non-random gene order seem to be largely quantitiative rather than qualitative. Nonetheless, there might be mechanisms that truly are limited to certain taxa. In flies, for example, there is a coupling of the timing of replication and initiation of transcription74, but no such effect is seen in yeast75. Similarly, methylation, although rare in D. melanogaster and common in plants and vertebrates, is absent in yeast76. Formation and maintenance of clusters

Why might clusters have formed? To address this issue, we first ask whether non-random organization is itself evidence for selection on gene order? We then address whether it is adequate to suppose that, because the current organization allows co-regulation, selection for co-regulation drives the aggregation process. Non-random gene order need not necessarily imply the activity of selection. First, if gene expression is a noisy process, then opening chromatin to allow expression from one gene might incidentally allow leaky expression of linked genes24. Given that in D. melanogaster, the large regions of co-expression are not also regions in which the genes are functionally related24, this model cannot be trivially dismissed. Second, the random model was a poor null because it failed to make allowance for biases in the rates and dimensions of various forms of gene rearrangement (duplication, transposition/retroposition, translocation, inversion, and so on), and in the parameters that differ between species77–79 and between chromosomes80. Removal of tandem duplicates is desirable as it attempts to correct for these known biases. Retroposition might also cause such a bias in gene order: insertion of retroposing viruses seems to be more common in open chromatin81. Such a bias alone could explain, in principle,

VOLUME 5 | APRIL 2004 | 3 0 5

REVIEWS

Box 2 | Supergene clusters with low recombination rates Supergene clusters are genomic regions in which selection favours tight linkage to maintain linkage disequilibrium between alternative alleles at two or more loci. The mating-type locus of the single-celled green alga Chlamydomonas reinhardtii is an example. For instance, the chloroplasts in the zygote of this species are derived from both parental cells but a ‘destruction’ allele in one of the gametes eliminates the chloroplast genomes of the mating partner before SPORULATION. Haploid gametes with this allele should be under selection to avoid mating with each other. Assuming that uniparental inheritance is beneficial, cells without this allele will be under selection to mate with a partner that does have the allele. Selection can then favour the linkage of the organelle-inheritance allele with a mating-type allele, as linkage disequilibrium between them reduces the rate of the more deleterious matings: destroyer with destroyer, nondestroyer with non-destroyer. Therefore, it is predicted119,120 that mating-type (+ and – type) and organelle-inheritance alleles should come to be linked and to be in strict linkage disequilibrium (all gametes of one mating type should be the destroyer type, whereas all gametes of the opposite mating type should be the non-destroyers). This is what is seen96,121. The genome region has features that minimize the recombination rate within it96, including inversions, rearrangements and insertions. The other well-described supergene clusters are segregation distorters, such as Sd in flies and t-complex in mice98. In the simplest model, at around the time of male meiosis, a toxin is given to all sperm, but the anti-toxin is restricted to those sperm that contain the anti-toxin allele. Selection strongly favours linkage of the alleles for toxin and antitoxin, as a chromosome that bears the toxin allele but not the anti-toxin allele is immediately eliminated from the population122,123. As predicted, the genes are usually in regions of low recombination (for example, centromeres) and often have inversions98. SD has at least two loci, Sd and Rsp. Sd+ is the toxic allele and alleles at Rsp determine sensitivity to the toxin. The two loci span the centromere on chromosome 2 and are often associated with an inversion. As predicted, a modifying allele (E(Sd)) that increases the extent of segregation distortion is linked to and is in linkage disequilibrium with SD, residing between Sd+ and Rsp.

SPORULATION

A defence mechanism of microbes in response to unfavourable environmental conditions that results in spores that are highly resistant to physical and chemical abuse. LINKAGE DISEQUILIBRIUM

Non-random assortment of alleles at different, usually linked, loci. Low population size and selection can increase linkage disequilibrum whereas recombination reduces linkage. MEIOTIC DRIVE

A departure from Mendelian segregation of chromosomes. MAJOR HISTOCOMPATIBILITY COMPLEX

(MHC). MHC molecules bind peptide fragments that are derived from pathogens and display them on the cell surface for recognition by the appropriate T cells. The organizations of the MHC gene clusters are similar in many species. GENE CONVERSION

Non-reciprocal transfer between a pair of non-allelic or allelic DNA sequences during meiosis and mitosis.

306

| APRIL 2004 | VOLUME 5

why gene density is not random and why highly expressed genes tend to reside in regions of highest gene density8 — that is, highly expressed genes have the highest probability of being in open chromatin and therefore of having new genes inserted in close proximity. Similarly, clustering of organelle-associated genes in the nuclear genomes of D. melanogaster82 and A. thaliana83 might reflect nothing more than a block transfer of genes from organelle to nucleus83. So, the discovery of more structure to genomes than was previously foreseen need not implicate a role for selection. However, the presence of functional clusters does indicate that selection is important. So, under the assumption that selection might favour certain genes to be co-expressed, can we suppose that this will explain the evolution of clusters? Insertion of a gene into the region might well directly affect its expression profile. For example, if a gene moves into a chromosomal domain that is regulated by Hda1, we might assume that regulation of the cluster will affect its activity. This type of model could be tested by analysis of expression characteristics of de novo retrovirus insertions to find out whether those that are inserted into transcriptionally more competent chromatin are more likely to be expressed. Preliminary data support the idea that high GC content of insertion sites might be necessary for activity84. This tallies with reports that gene-expression parameters vary with the GC content of the flanking sequence8,72,85.

Some types of co-regulation might require additional steps, such as the evolution of bidirectional promoters, or the establishment of operons. This dislocation between the eventual co-expression of genes and the reason for the assemblage of operons has been noted previously86,87. Consider the evolution of an operon with two functionally related but initially unlinked genes (A and B). Why might A and B be favoured to come into closer proximity? Lawrence86,87 argues that before the evolution of the polycistronic transcript, it cannot be supposed that ever-closer linkage means ever-tighter co-expression. Consequently, although selection might favour the adsorption of the two genes into a single operon, once they are tightly linked, selection for co-expression cannot explain the original evolution of linkage. There are at least two alternative explanations to account for the initial co-location of the genes: either that there was some other force promoting linkage or that chance happens to put the two genes into proximity. Selection for linkage independent of selection on coexpression. What other forces might promote linkage? Lawrence86 suggests that linkage of functionally related genes in prokaryotes might enable simultaneous horizontal transfer. However, there is evidence against this model in prokaryotes88, and its relevance to eukaryotes is limited. More importantly, although interest has understandably concentrated on the relationship between expression and linkage, a body of work on population-genetic theory of the evolution of the recombination rate89–92 (and therefore of linkage) has been relatively overlooked. In 1930, for example, Fisher93 noted that if, in a haploid, alleles A and B together confer high fitness, as do a and b, although Ab and aB are of low fitness, selection will favour the establishment and maintenance of LINKAGE DISEQUILIBRIUM between them to form AB and ab clusters that are rarely broken by recombination owing to the close genetic position. Although numerous examples have been discussed15,90,94,95, perhaps the strongest evidence has come from the examination of the mating-type loci of Chlamydomonas reinhardtii96 and MEIOTIC-DRIVE ‘genes’97,98 (BOX 2). A problem with these examples, as with the finding of the clustering of imprinted genes11, is that they might simply be strange phenomena that are associated with strange genes. Similarly, it is indicated that in the MAJOR HISTOCOMPATIBILITY COMPLEX (MHC), new beneficial alleles can be created by GENE CONVERSION99. Although this might provide selection for linkage (see also REF. 100), this could only apply to genes in the same family. The issue, then, is whether population-genetic forces that promote linkage can have broad-scale effects on gene order. Recent evidence indicates that this is possible. In both yeast101 and worm101,102, essential genes (those for which the knockout is not viable) cluster in the genome. In both cases, the clusters are associated with low recombination rates (FIG. 3), indicating that a population genetics model for linkage might be needed. What might be going on? There might be a simple NEUTRALIST explanation: recombination might enable the production of tandem duplicates. As duplicates tend to

www.nature.com/reviews/genetics

REVIEWS be non-essential103, regions of high recombination might be regions of clusters of non-essential genes. However, we find that only relatively few tandem duplicates reside in regions of high recombination, and this slight bias cannot explain the observed association between gene dispensability and recombination rate (C.P., unpublished observations). Additionally, Pál and Hurst101 show, in yeast, that this clustering is not associated with coexpression. Nei91 showed that if deleterious alleles that are maintained at mutation-selection equilibrium interact with positive epistasis (organisms with two mutations are not as badly affected as expected given the effects of each mutation alone), then selection favours linkage of the genes and reduced recombination rates. He argues that essential genes are more likely to harbour positive epistatic mutations104. Similarly, inversions105 in flies have emerged as suppressors of recombination to maintain a positive epistastic relationship among loci within the gene rearrangements. Alternatively, Gessler and Xu106 note that the strength of selection on an enhancer of recombination is weaker if the strength of selection on the deleterious mutations in two linked genes is larger, as expected for essential genes. An alternative possibility is that selection might promote relatively important genes to be in mutational cold spots, which also correspond to regions of low recombination107,108. Pál and Hurst found no evidence that the essential genes had especially low mutation rates in yeast. This interpretation, however, has been put forward to explain clustering of genes of similar synonymous substitution rates (a proxy for the mutation rate) in the human genome109. It is notable that genes in mutation cold spots are biased109 towards essential cellular processes (gene regulation, RNA processing, and so on).

NEUTRALIST [MODEL]

Evolutionary model that assumes that the trait being investigated has no selective advantage. Changes in allele frequency are said to be the result of chance (drift) alone.

a

Are clusters maintained by selection? So, how and why clusters of genes form remains unclear, but can we say anything about whether selection acts to maintain them once they are formed? Although high rates of gene-order evolution have been taken as evidence of an absence of constraint110, the most detailed analysis so far supports a role for selection111. In yeast, there are at least two strong independent predictors of the probability that given gene pairs are still linked in Candida albicans: the intergene spacer size and the degree of coexpression111. The role of intergene spacer is consistent with a simple null neutralist model in which only rearrangements with breakpoints between genes are tolerated. However, co-expressed genes remain linked more than expected, which indicates that selection might favour their retention as a pair (see also REF. 112). Linked pairs of essential genes in yeast are also retained as linked more often than expected101. It is unclear then why clusters of metabolically related genes are not especially well-conserved between species17. One possible explanation is that selection for the importance of any given metabolic pathway varies over time in a given lineage. Another possibility is that linkage is not under selection. If co-expression is on the broader chromatin level scale, do we expect similar selection on gene order? If we take Hda1 control of spans of genes in yeast that are associated with stress response67 as a model example, then the answer must be no. If the cluster was formed under selection, then we expect the cluster to be retained together within the relevant chromosomal domain, but the precise order and orientation of genes need not be under selection. This suggestion has yet to receive much systematic scrutiny, although it has b

6

2.5

Number of essential genes Recombination rate (deviation from mean)

5

2.0

4

Recombination rate

1.5

3

2

1

1.0

0.5

0.0

Centromere

0

–0.5

–1

–1.0

–2

–1.5 5′

3′

Position on chromosome 9

0

1

2

3

4

5

6

N

Figure 3 | The number of essential genes and the recombination rate along yeast chromosome 9. a | Sliding window analysis. The figure shows the number of essential genes and the recombination in a sliding window (of 10 genes) drawn across yeast chromosome 9. Note that in general, if the recombination rate is low, the number of essential genes is high. The recombination rate is assayed as the number of standard deviations from the mean recombination rate (that is, +1 = one standard deviation above the mean). The inset box illustrates the same data for non-overlapping windows (N = number of essential genes, Recombination rate = recombination rate assayed in standard deviations). Modified with permission from REF. 101 © (2003) Macmillan Magazines Ltd. b | Non-overlapping window analysis figure. The recombination rate in a 10-gene window is plotted against the number of essential genes in that window. Blocks with many essential genes only ever have low recombination rates.

NATURE REVIEWS | GENETICS

VOLUME 5 | APRIL 2004 | 3 0 7

REVIEWS

EFFECTIVE POPULATION SIZE

The number of individuals in a population that contribute to the next generation. It never exceeds the actual population size.

1. 2.

3.

4.

5.

6.

7.

8.

308

recently been claimed that the MHC has conserved gene composition but non-conserved gene order113. Summary and outlook

It is no longer tenable to suppose that gene order in eukaryotes is random. Parallel advances in our understanding of the control of gene expression and their distribution in the genome have led to a new, more organized, view. Although we are not yet at a position in which we can present a complete integration of the bioinformatic results, the understanding of chromatin and the role of intra-nuclear location, such an integration is both necessary and realistic. Nonetheless, if eukaryotic gene order is not random, what sort of model might take its place? The idea of the genome as, in part, a series of chromosomal blocks, each being opened for the potential for transcription or inactivated under particular conditions, seems like a helpful, guiding new model. It agrees well with the notion that stem cells generally have open chromatin and that part of the development of specificity is the inactivation of particular spans of genes5. It also tallies with the evidence for a region of downregulation of neuronalspecific genes38 and with the division of labour between histone deacetylases in yeast67. However, as always, a new model generates new questions. For example, is the extent of genome organization the same in all species? We have been struck by the extent to which many patterns are highly discernible in yeast. Of all the complete genomes, yeast has the highest degree of linkage of genes that have proteins that are involved in the same metabolic pathway17, it shows the most striking clustering of essential genes into regions of low recombination101 and has many incidences of highly coordinated expression of linked

Kleinjan, D. J. & van Heyningen, V. Position effect in human genetic disease. Hum. Mol. Genet. 7, 1611–1618 (1998). Glinsky, G. V., Krones-Herzig, A. & Glinskii, A. B. Malignancy-associated regions of transcriptional activation: gene expression profiling identifies common chromosomal regions of a recurrent transcriptional activation in human prostate, breast, ovarian, and colon cancers. Neoplasia 5, 218–228 (2003). Zhou, Y. et al. Genome-wide identification of chromosomal regions of increased tumor expression by transcriptome analysis. Cancer Res. 63, 5781–5784 (2003). Joos, S. et al. Variable breakpoints in Burkitt lymphoma cells with chromosomal t(8;14) translocation separate c-myc and the IgH locus up to several hundred kb. Hum. Mol. Genet. 1, 625–632 (1992). Akashi, K. et al. Transcriptional accessibility for genes of multiple tissues and hematopoietic lineages is hierarchically controlled during early hematopoiesis. Blood 101, 383–389 (2003). A quality analysis that supports the hypothesis that stem cells possess a wide-open chromatin structure to maintain their multipotentiality, which is progressively quenched as they go down a particular pathway of differentiation. Zhang, H., Pan, K. H. & Cohen, S. N. Senescence-specific gene expression fingerprints reveal cell-type-dependent physical clustering of upregulated chromosomal loci. Proc. Natl Acad. Sci. USA. 100, 3251–3256 (2003). Milot, E. et al. Heterochromatin effects on the frequency and duration of LCR-mediated gene transcription. Cell 87, 105–114 (1996). Versteeg, R. et al. The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 13, 1998–2004 (2003).

| APRIL 2004 | VOLUME 5

9.

10.

11. 12.

13. 14.

15. 16.

17.

18. 19.

genes19,52. We can also imagine reasons why genomes might vary in the extent to which they are organized. One possibility is that organisms with a large EFFECTIVE POPULATION SIZE (which we assume yeast must have) should be able to resist the spread of weakly deleterious mutations and therefore are expected to be more ‘optimally’ organized114. Alternatively, we might expect the genomes of more ‘complex’ organisms to be more ‘organized’, and as such, organization might be necessary in development5. Indeed, could a connection between GC content and regional transcriptional competence8,72,85,115 explain the evolution of isochores in mammals (FIG. 2)? The above new model supposes that selection that favours coordinated control of gene expression is the only reason for gene-order evolution. However, not only is it often difficult to eliminate sophisticated neutralist models, but there are counter-examples to indicate that selection can favour linkage for other reasons. Although the need to evoke alternative models has been advocated87 and evidence that population genetics models are needed has been provided101, the relevance of prior population genetics theory for gene-order evolution89,91 is uncertain. More generally, there is the need to develop a theory of genome-organization evolution, taking into account the mechanisms of genome rearrangement, mechanisms of control of gene expression and the evolutionary forces that result from different interactions of loci. Understanding how genes are rearranged will be important in defining a more appropriate null. Moreover, different mechanisms have different population-genetic consequences. Inversions alter recombination rates and duplicates can mask deleterious mutations, whereas translocations might disrupt meiosis. Modelling the evolution of gene order, from both selective and neutralist perspectives, represents a considerable challenge.

Festenstein, R. et al. Locus control region function and heterochromatin-induced position effect variegation. Science 271, 1123–1125 (1996). Kuhn, E. J. & Geyer, P. K. Genomic insulators: connecting properties to mechanism. Curr. Opin. Cell Biol. 15, 259–265 (2003). Reik, W. & Walter, J. Genomic imprinting: parental influence on the genome. Nature Rev. Genet. 2, 21–32 (2001). Turner, G. in The Eukaryotic Genome (eds Broda, P., Oliver, S. G. & Sims, P. F. G.) 107–125 (Cambridge Univ. Press, Cambridge, 1993). Cooper, D. N. Human Gene Evolution (BIOS Scientific, Oxford, 1999). Hughes, A. L. & Yeager, M. Molecular evolution of the vertebrate immune system. Bioessays 19, 777–786 (1997). Korol, A. B., Preigel, I. A. & Preigel, S. I. Recombination Variability and Evolution (Chapman and Hall, London, 1994). Brayton, K. A. et al. Two genes for de novo purine nucleotide synthesis on human chromosome 4 are closely linked and divergently transcribed. J. Biol. Chem. 269, 5313–5321 (1994). Lee, J. M. & Sonnhammer, E. L. L. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 13, 875–882 (2003). First systematic evidence that, in eukaryotes, genes from the same metabolic pathway tend to cluster. The study reveals striking differences between species in the extent to which this is true. Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65–73 (1998). Cohen, B. A., Mitra, R. D., Hughes, J. D. & Church, G. M. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nature Genet. 26, 183–186 (2000).

20.

21.

22.

23.

24.

25.

26. 27.

Early systematic evidence that adjacent pairs of genes, as well as nearby non-adjacent pairs of genes, show correlated expression. Blumenthal, T. et al. A global analysis of Caenorhabditis elegans operons. Nature 417, 851–854 (2002). Evidence that in the worm genome, operons are not a rare peculiarity: it contains at least 1,000 operons that are 2–8 genes long, which contain approximately 15% of all C. elegans genes. Lercher, M. J., Blumenthal, T. & Hurst, L. D. Co-expression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes. Genome Res. 13, 238–243 (2003). Roy, P. J., Stuart, J. M., Lund, J. & Kim, S. K. Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans. Nature 418, 975–979 (2002). Boutanaev, A. M., Kalmykova, A. I., Shevelyou, Y. Y. & Nurminsky, D. I. Large clusters of co-expressed genes in the Drosophila genome. Nature 420, 666–669 (2002). Spellman, P. T. & Rubin, G. M. Evidence for large domains of similarly expressed genes in the Drosophila genome. J. Biol. 1, 5 (2002). Robust report to show that groups of adjacent and co-regulated genes, which are not otherwise functionally related in any obvious way, can be identified by expression profiling in D. melanogaster. Williams, E. J. B. & Bowles, D. J. Co-expression of neighbouring genes in the genome of Arabidopsis thaliana. Genome Res. (in the press). Birnbaum, K. et al. A gene expression map of the Arabidopsis root. Science 302, 1956–1960 (2003). Zhu, T. Global analysis of gene expression using GeneChip microarrays. Curr. Opin. Plant Biol. 6, 418–425 (2003).

www.nature.com/reviews/genetics

REVIEWS 28. Khavkin, E. & Coe, E. Mapped genomic locations for developmental functions and QTLs reflect concerted groups in maize (Zea mays L.). Theor. Appl. Genet. 95, 343–352 (1997). 29. Lercher, M. J., Urrutia, A. O. & Hurst, L. D. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nature Genet. 31, 180–183 (2002). Evidence that genes that are expressed in most tissues tend to cluster. This is proposed to explain why highly epxressed genes cluster and why cDNAs extracted from any given tissue show clustering. 30. Ko, M. S. H. et al. Genome-wide mapping of unselected transcripts from extraembryonic tissue of 7.5-day mouse embryos reveals enrichment in the t-complex and underrepresentation on the X chromosome. Hum. Mol. Genet. 7, 1967–1978 (1998). 31. Bortoluzzi, S. et al. A comprehensive, high-resolution genomic transcript map of human skeletal muscle. Genome Res. 8, 817–825 (1998). 32. Dempsey, A. A., Pabalan, N., Tang, H. & Liew, C.-C. Organization of human cardiovascular-expressed genes on chromosomes 21 and 22. J. Mol. Cell. Cardiol. 33, 587–591 (2001). 33. Gabrielsson, B. L., Carlsson, B. & Carlsson, L. M. S. Partial genome scale analysis of gene expression in human adipose tissue using DNA array. Obes. Res. 8, 374–384 (2000). 34. Yang, Y. S. et al. Chromosome localization analysis of genes strongly expressed in human visceral adipose tissue. Endocrine 18, 57–66 (2002). 35. Soury, E. et al. Chromosomal assignments of mammalian genes with an acute inflammation-regulated expression in liver. Immunogenet. 53, 634–642 (2001). 36. Caron, H. et al. The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291, 1289–1292 (2001). First systematic evidence for clustering of genes in the human genome according to their expression profile. 37. Surralles, J., Ramirez, M. J., Marcos, R., Natarajan, A. T. & Mullenders, L. H. Clusters of transcription-coupled repair in the human genome. Proc. Natl Acad. Sci. USA 99, 10571–10574 (2002). 38. Lunyak, V. V. et al. Co-repressor-dependent silencing of chromosomal regions encoding neuronal genes. Science 298, 1747–1752 (2002). Elegant evidence for the existence of a large domain of downregulation of genes in non-neuronal tissues 39. Megy, K., Audic, S. & Claverie, J. M. Positional clustering of differentially expressed genes on human chromosomes 20, 21 and 22. Genome Biol. 4, P1 (2003). 40. Reymond, A. et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420, 582–586 (2002) 41. Fukuoka, Y., Inaoka, I. & Kohane, I. S. Inter-species differences of co-expression of neighboring genes in eukaryotic genomes. BMC Genomics 5, 4 (2004). 42. Vieira, C. P., Vieira, J. & Hartl, D. L. The evolution of small gene clusters: evidence for an independent origin of the maltase gene cluster in Drosophila virilis and Drosophila melanogaster. Mol. Biol. Evol. 14, 985–993 (1997). 43. Teichmann, S. & Veitia, R. Genes encoding subunits of stable complexes are clustered on the yeast chromosomes. Genetics (in the press). 44. Tuberosa, R. et al. Mapping QTLs regulating morphophysiological traits and yield: case studies, shortcomings and perspectives in drought-stressed maize. Ann. Bot. 89, 941–963 (2002). 45. Santos, C. A. F. & Simon, P. W. QTL analyses reveal clustered loci for accumulation of major provitamin A carotenes and lycopene in carrot roots. Mol. Genet. Genomics 268, 122–129 (2002). 46. Cai, H. W. & Morishima, H. QTL clusters reflect character associations in wild and cultivated rice. Theor. Appl. Genet. 104, 1217–1228 (2002). 47. Ueda, H. R. et al. Genome-wide transcriptional orchestration of circadian rhythms in Drosophila. J. Biol. Chem. 277, 14048–14052 (2002). 48. Képès, F. Periodic epi-organization of the yeast genome revealed by the distribution of promoter sites. J. Mol. Biol. 329, 859–865 (2003). Elegant evidence that in yeast, genes that are controlled by the same sequence-specific transcription factor tend to be regularly spaced along the chromosome arms. It is proposed that these regularities are consistent with a genome-wide loop model of chromosomes, in which coregulated genes tend to dynamically co-localize in 3D. 49. Mannila, H., Patrikainen, A., Seppanen, J. K. & Kere, J. Long-range control of expression in yeast. Bioinformatics 18, 482–483 (2002).

NATURE REVIEWS | GENETICS

50. Balazsi, G., Kay, K. A., Barabasi, A. L. & Oltvai, Z. N. Spurious spatial periodicity of co-expression in microarray data due to printing design. Nucleic Acids Res. 31, 4425–4433 (2003). 51. Papp, B., Pál, C. & Hurst, L. D. Evolution of cis-regulatory elements in duplicated genes of yeast. Trends Genet. 19, 417–422 (2003). 52. Kruglyak, S. & Tang, H. Regulation of adjacent yeast genes. Trends Genet. 16, 109–111 (2000). 53. Wright, K. L. et al. Coordinate regulation of the human Tap1 and Lmp2 genes from a shared bidirectional promoter. J. Exp. Med. 181, 1459–1471 (1995). 54. Trinklein, N. D. et al. An abundance of bidirectional promoters in the human genome. Genome Res. 14, 62–66 (2004). 55. Gray, T. A., Saitoh, S. & Nicholls, R. D. An imprinted, mammalian bicistronic transcript encodes two independent proteins. Proc. Natl Acad. Sci. USA 96, 5616–5621 (1999). 56. Reiss, J. et al. Mutations in a polycistronic nuclear gene associated with molybdenum cofactor deficiency. Nature Genet. 20, 51–53 (1998). 57. Nanbru, C. et al. Translation of the human c-myc P0 tricistronic mRNA involves two independent internal ribosome entry sites. Oncogene 20, 4270–4280 (2001). 58. Hawkins, A. R. The complex Arom locus of Aspergillus nidulans. Evidence for multiple gene fusions and convergent evolution. Curr. Genet. 11, 491–498 (1987). 59. Zhang, X. & Smith, T. F. Yeast ‘operons’. Microb. Comp. Genomics 3, 133–140 (1998). 60. Cremer, T. & Cremer, C. Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nature Rev. Genet. 2, 292–301 (2001). 61. van Driel, R., Fransz, P. F. & Verschure, P. J. The eukaryotic genome: a system regulated at different hierarchical levels. J. Cell Sci. 116, 4067–4075 (2003). 62. de Laat, W. & Grosveld, F. Spatial organization of gene expression: the active chromatin hub. Chromosome Res. 11, 447–459 (2003). 63. Eberharter, A. & Becker, P. B. Histone acetylation: a switch between repressive and permissive chromatin. Second in review series on chromatin dynamics. EMBO Rep. 3, 224–229 (2002). 64. Strahl, B. D. & Allis, C. D. The language of covalent histone modifications. Nature 403, 41–45 (2000). 65. Turner, B. M. Cellular memory and the histone code. Cell 111, 285–291 (2002). 66. Labrador, M. & Corces, V. G. Setting the boundaries of chromatin domains and nuclear organization. Cell 111, 151–154 (2002). 67. Robyr, D. et al. Microarray deacetylation maps determine genome-wide functions for yeast histone deacetylases. Cell 109, 437–446 (2002). Acetylation microarrays are used to uncover a striking ‘division of labour’ for yeast histone deacetylases, with individual deacetylases controlling highly specific chromosomal domains. 68. Andrulis, E. D., Neiman, A. M., Zappulla, D. C. & Sternglanz, R. Perinuclear localization of chromatin facilitates transcriptional silencing. Nature 394, 592–595 (1998). A manipulation experiment that shows that perinuclear localization helps to establish transcriptionally silent chromatin. 69. Tanabe, H. et al. Evolutionary conservation of chromosome territory arrangements in cell nuclei from higher primates. Proc. Natl Acad. Sci. USA 99, 4424–4429 (2002). 70. Shopland, L. S., Johnson, C. V., Byron, M., McNeil, J. & Lawrence, J. B. Clustering of multiple specific genes and gene-rich R-bands around SC-35 domains: evidence for local euchromatic neighborhoods. J. Cell Biol. 162, 981–990 (2003). Evidence that chromosomal bands relate to nuclear clustering of genes around SC-35 domains. 71. Saccone, S., Pavlicek, A., Federico, C., Paces, J. & Bernardi, G. Genes, isochores and bands in human chromosomes 21 and 22. Chromosome Res. 9, 533–539 (2001). 72. Lercher, M. J., Urrutia, A. O., Pavlicek, A. & Hurst, L. D. A unification of mosaic structures in the human genome. Hum. Mol. Genet. 12, 2411–2415 (2003). 73. Thompson, M., Haeusler, R. A., Good, P. D. & Engelke, D. R. Nucleolar clustering of dispersed tRNA genes. Science 302, 1399–1401 (2003). tRNA genes are shown to be unclustered in one dimension (linear order on chromosomes) but highly clustered when considered in three dimensions (that is, intra-nuclear location). 74. Schubeler, D. et al. Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing. Nature Genet. 32, 438–442 (2002). 75. Raghuraman, M. K. et al. Replication dynamics of the yeast genome. Science 294, 115–121 (2001).

76. Regev, A., Lamb, M. J. & Jablonka, E. The role of DNA methylation in invertebrates: developmental regulation or genome defense? Mol. Biol. Evol. 15, 880–891 (1998). 77. Coghlan, A. & Wolfe, K. H. Fourfold faster rate of genome rearrangement in nematodes than in Drosophila. Genome Res. 12, 857–867 (2002). 78. Seoighe, C. et al. Prevalence of small inversions in yeast gene order evolution. Proc. Natl Acad. Sci. USA 97, 14433–14437 (2000). 79. Ranz, J. M., Gonzalez, J., Casals, F. & Ruiz, A. Low occurrence of gene transposition events during the evolution of the genus Drosophila. Evolution 57, 1325–1335 (2003). 80. Gonzalez, J., Ranz, J. M. & Ruiz, A. Chromosomal elements evolve at different rates in the Drosophila genome. Genetics 161, 1137–1154 (2002). 81. Rynditch, A. V., Zoubak, S., Tsyba, L., Tryapitsina-Guley, N. & Bernardi, G. The regional integration of retroviral sequences into the mosaic genomes of mammals. Gene 222, 1–16 (1998). 82. Lefai, E., Fernandez-Moreno, M. A., Kaguni, L. S. & Garesse, R. The highly compact structure of the mitochondrial DNA polymerase genomic region of Drosophila melanogaster: functional and evolutionary implications. Insect Mol. Biol. 9, 315–322 (2000). 83. Elo, A., Lyznik, A., Gonzalez, D. O., Kachman, S. D. & Mackenzie, S. A. Nuclear genes that encode mitochondrial proteins for DNA and RNA metabolism are clustered in the Arabidopsis genome. Plant Cell 15, 1619–1631 (2003). 84. Glukhova, L. A. et al. Localization of HTLV-1 and HIV-1 proviral sequences in chromosomes of persistently infected cells. Chromosome Res. 7, 177–183 (1999). 85. Vinogradov, A. E. Isochores and tissue-specificity. Nucleic Acids Res. 31, 5212–5220 (2003). 86. Lawrence, J. G. & Roth, J. R. Selfish operons: horizontal transfer might drive the evolution of gene clusters. Genetics 143, 1843–1860 (1996). 87. Lawrence, J. G. Gene organization: selection, selfishness, and serendipity. Annu. Rev. Microbiol. 57, 419–440 (2003). 88. Pál, C. & Hurst, L. D. Evidence against the selfish operon hypothesis. Trends Genet. (in the press). 89. Bodmer, W. F. & Parsons, P. A. Linkage and recombination in evolution. Adv. Genet. 11, 1–100 (1962). 90. Charlesworth, D. & Charlesworth, B. Theoretical genetics of Batesian mimicry II. Evolution of supergenes. J. Theor. Biol. 55, 305–324 (1975). 91. Nei, M. Modification of linkage intensity by natural selection. Genetics 57, 625–641 (1967). 92. Otto, S. P. & Lenormand, T. Resolving the paradox of sex and recombination. Nature Rev. Genet. 3, 252–261 (2002). 93. Fisher, R. A. The Genetical Theory of Natural Selection (Clarendon, Oxford, 1930). 94. Sinervo, B. & Svensson, E. Correlational selection and the evolution of genomic architecture. Heredity 89, 329–338 (2002). 95. Ford, E. B. Ecological Genetics (Chapman and Hall, London, 1971). 96. Ferris, P. J., Armbrust, E. V. & Goodenough, U. W. Genetic structure of the mating-type locus of Chlamydomonas reinhardtii. Genetics 160, 181–200 (2002). 97. Hurst, L. D. The evolution of genomic anatomy. Trends Ecol. Evol. 14, 108–112 (1999). 98. Lyttle, T. W. Segregation distorters. Annu. Rev. Genet. 25, 511–557 (1991). 99. Hogstrand, K. & Bohme, J. Gene conversion can create new MHC alleles. Immunol. Rev. 167, 305–317 (1999). 100. Hurst, L. D. & Smith, N. G. C. The evolution of concerted evolution. Proc. R. Soc. Lond. B 265, 121–127 (1998). 101. Pál, C. & Hurst, L. D. Evidence for co-evolution of gene order and recombination rate. Nature Genet. 33, 392–395 (2003). Evidence that essential genes reside in regions of low recombination in yeast and worm. Evidence is also presented to indicate that this is not to the result of tandem duplicates or co-expression. 102. Kamath, R. S. et al. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421, 231–237 (2003). First large-scale assay of gene dispensability in a multicellular organism showing that the essential genes cluster in regions of low recombination. 103. Gu, Z. et al. Role of duplicate genes in genetic robustness against null mutations. Nature 421, 63–66 (2003). 104. Nei, M. Genome evolution: let’s stick together. Heredity 90, 411–412 (2003). 105. Schaeffer, S. W. et al. Evolutionary genomics of inversions in Drosophila pseudoobscura: evidence for epistasis. Proc. Natl Acad. Sci. USA 100, 8319–8324 (2003). 106. Gessler, D. D. & Xu, S. On the evolution of recombination and meiosis. Genet. Res. 73, 119–131 (1999).

VOLUME 5 | APRIL 2004 | 3 0 9

REVIEWS 107. Lercher, M. J. & Hurst, L. D. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18, 337–340 (2002). 108. Perry, J. & Ashworth, A. Evolutionary rate of a gene affected by chromosomal position. Curr. Biol. 9, 987–989 (1999). 109. Chuang, J. H. & Li, H. Functional bias and spatial organization of genes in mutational hot and cold regions in the human genome. PLoS Biol. 2, e29 (2004). 110. Ranz, J. M., Casals, F. & Ruiz, A. How malleable is the eukaryotic genome? Extreme rate of chromosomal rearrangement in the genus Drosophila. Genome Res. 11, 230–239 (2001). 111. Hurst, L. D., Williams, E. J. B. & Pál, C. Natural selection promotes the conservation of linkage of co- expressed genes. Trends Genet. 18, 604–606 (2002). First evidence that selection acts to preserve linked pairs of co-expressed genes, even after allowing for the effect of intergene distance. 112. Huynen, M. A. & Snel, B. in Frontiers in Computational Genomics (eds Galperin, M. Y. & Koonin, E. V.) 145–166 (Horizon Scientific Press, Wymondham, UK, 2003). 113. Danchin, E. G., Abi-Rached, L., Gilles, A. & Pontarotti, P. Conservation of the MHC-like region throughout evolution. Immunogenet. 55, 141–148 (2003). 114. Lynch, M. & Conery, J. S. The origins of genome complexity. Science 302, 1401–1404 (2003).

115. Vinogradov, A. E. DNA helix: the importance of being GC-rich. Nucleic Acids Res. 31, 1838–1844 (2003). 116. Durand, D. & Sankoff, D. Tests for gene clustering. J. Comput. Biol. 10, 453–482 (2003). 117. Lefebvre, J. F., El-Mabrouk, N., Tillier, E. & Sankoff, D. Detection and validation of single gene inversions. Bioinformatics 19 (Suppl. 1), I190–I196 (2003). 118. Bradnam, K. R., Seoighe, C., Sharp, P. M. & Wolfe, K. H. G+C content variation along and among Saccharomyces cerevisiae chromosomes. Mol. Biol. Evol. 16, 666–675 (1999). 119. Hurst, L. D. Why are there only 2 sexes? Proc. R. Soc. Lond. B 263, 415–422 (1996). 120. Hutson, V. & Law, R. Four steps to two sexes. Proc. R. Soc. Lond. B 253, 43–51 (1993). 121. Armbrust, E. V., Ferris, P. J. & Goodenough, U. W. A mating type-linked gene cluster expressed in Chlamydomonas zygotes participates in the uniparental inheritance of the chloroplast genome. Cell 74, 801–811 (1993). 122. Feldman, M. W. & Otto, S. P. A comparative approach to the theoretical population-genetics theory of segregation distortion. Am. Nat. 137, 443–456 (1991). 123. Thomson, G. J. & Feldman, M. W. Population genetics of modifiers of meiotic drive. II. Linkage modification in the segregation distortion system. Theor. Popul. Biol. 5, 155–162 (1974).

124. Florens, L. et al. A proteomic view of the Plasmodium falciparum life cycle. Nature 419, 520–526 (2002).

Acknowledgements We wish to thank two anonymous reviewers and J. Lawrence for comments on an earlier version of the manuscript. We also thank F. Grosveld, A. Ward, R. Kelsh and L. Weinert for discussion. M.J.L. is funded by a Royal Society University Research Fellowship. L.D.H. and C.P. are funded by the Biotechnology and Biological Sciences Research Council.

Competing interests statement The authors declare that they have no competing financial interests.

Online links DATABASES The following terms in this article are linked online to: LocusLink: http://www.ncbi.nlm.nih.gov/LocusLink AIRC | GPAT | Sd FURTHER INFORMATION Laurence Hurst’s web page: http://www.bath.ac.uk/biosci/hurst.htm Martin Lercher’s web page: http://www.bath.ac.uk/biosci/lercher.htm Access to this interactive links box is free online.

CORRECTION

TRANSGENE INTROGRESSION FROM GENETICALLY MODIFIED CROPS TO THEIR WILD RELATIVES C. Neal Stewart, Matthew D. Halfhill and Suzanne I.Warwick Nature Reviews Genetics 4, 806–817 (2003); doi:10.1038/nrg1179

In reference to an article that appeared in Nature in 2001 (Quist and Chapela, Nature 414, 541–543 (2001)), it was incorrectly stated that “After much controversy, Nature retracted the paper because introgression per se was not shown”(page 806). The article was never formally retracted by Nature (see editorial footnote to Nature 417, 898 (2002)). In fact, Nature concluded that although “the evidence is not sufficient to justify the publication of the original paper”, it was best “to allow [the] readers to judge the science for themselves” (see editorial footnote to Nature 416, 600–601 (2002)).

310

| APRIL 2004 | VOLUME 5

www.nature.com/reviews/genetics