Functional variation and evolution of non-coding ... - Semantic Scholar

5 downloads 0 Views 411KB Size Report
Oct 19, 2006 - evolutionary analysis and natural variation of non-coding DNA, and the parameters that ... But one wonders what the focus would have been if ...
Functional variation and evolution of non-coding DNA Christine P Bird, Barbara E Stranger and Emmanouil T Dermitzakis The focus of large genomic studies has shifted from only looking at genes and protein-coding sequences to exploring the full set of elements in each genome. The explosion of comparative sequencing data has led to an increase in methodologies, approaches and ideas on how to analyze the unknown fraction of the genome, namely the non-proteincoding fraction. The main issues relate to the discovery, evolutionary analysis and natural variation of non-coding DNA, and the parameters that prevent us from fully understanding the properties of non-coding DNA. Addresses The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK Corresponding author: Dermitzakis, Emmanouil T ([email protected])

Current Opinion in Genetics & Development 2006, 16:559–564 This review comes from a themed issue on Genomes and evolution Edited by Chris Tyler-Smith and Molly Przeworski Available online 19th October 2006 0959-437X/$ – see front matter # 2006 Elsevier Ltd. All rights reserved.

In this sequence of events, it is understandable that gene sequences were given priority. But one wonders what the focus would have been if large-scale sequencing and comparative efforts had preceded cDNAs and ESTs. We might have come to the realization that constrained (i.e. conserved) sequences are important, regardless of whether they undergo transcription. It is easy to argue that evolutionary constraint is a much more reliable indicator of functionality than transcription is, and the delay in focusing on non-coding DNA was an example of ascertainment bias in our experimental methodologies (Figure 1). Here, we review some of the old, but mostly the recent, data on non-coding DNA evolution and function. The explosion of experimental and analytical approaches that have been applied to non-coding sequences cannot be fully captured in this review; instead, we have attempted to give the highlights and to indicate some exciting new directions of research. For the purposes of this review, noncoding DNA is defined as the DNA regions that do not correspond to protein-coding DNA or non-coding RNA genes (e.g. rRNAs and microRNAs), because the world of non-coding RNA genes constitutes a big category of functional genomic sequences that we cannot address here.

DOI 10.1016/j.gde.2006.10.003

Evolution and comparative analysis of non-coding elements Introduction In recent years, non-coding DNA has attracted a lot of attention [1,2,3,4]. This has been mainly due to the realization that a large fraction — in some cases the majority — of functional DNA in the human and other genomes is not encoded by protein-coding sequences but by other sequences, the exact function of which remains elusive. For many years before this realisation, most attention and resources were directed toward the study of protein-coding sequences, and it required effort to convince the community that the non-coding part of the genome harbours exciting and valuable jewels. A historical view on genome annotation will show us that for many years the presence of mRNAs and open reading frames was considered proof that a sequence was genic. For this reason, large efforts to identify cDNAs and, subsequently, expressed sequence tags (ESTs) were initiated. Such efforts facilitated gene discovery and annotation, and evidence-based methodologies for gene discovery were developed. In the past decade, comparative methods have been developed to explore genome function further by looking at sequence conservation, starting with the analysis of the mouse genome [4]. www.sciencedirect.com

From the early days of sequence comparative analysis, it was clear that there is a large fraction of conserved noncoding DNA. Some preliminary studies have indicated that much of it is probably regulatory and that relying simply on sequence conservation would aid the identification of regulatory elements [5,6]. There were still two surprising results that were not predicted in the original studies. First, it was discovered that conservation of noncoding DNA goes beyond genic regions [1,2,3,7], and in fact there appears to be a negative correlation between the density of conserved non-coding DNA and that of exonic sequences [8]. Second, it was the shear amount of conserved non-coding sequence. In the mouse genome study [4], it was estimated that at least 5% of the human genome was selectively constrained when compared with the mouse genome, out of which only about a third corresponds to protein-coding DNA. This is of course an underestimate, mainly due to the fact that this 5% corresponds to the amount of constrained DNA that is responsible for the biological processes common between human and mouse. Some of the common processes might still rely on divergent sequence features, and of course there are obvious species-specific properties that rely on unique functional sequences in each genome. Current Opinion in Genetics & Development 2006, 16:559–564

560 Genomes and evolution

Figure 1

Illustration of real and hypothetical sequence of events with respect to coding and non-coding DNA inference of function. (a) In the real scenario, the realization of the existence of transcription came early and, therefore, transcribed sequences were given a lot of attention. Conserved sequences came later, so there was a lot of scepticism as to whether they were functional, because they were not transcribed. (b) In the hypothetical scenario, conservation was observed before transcription was found. If this was the case, we would assume that it is more likely for a conserved sequence that was not transcribed to be functional than it is for a transcribed sequence that was not conserved. The figure shows a clear example of ascertainment bias where the order of discoveries affects the confidence we put on the significance of subsequent discoveries.

Deeper and more rigorous approaches followed, which explored multi-species comparisons of protein-coding and non-coding DNA to dissect the evolutionary properties and to develop models for the identification of constrained non-coding sequences [3,7,9]. Some of these models are routinely used to interrogate the increasing number of sequenced genomes [2,10]. If a sufficient Current Opinion in Genetics & Development 2006, 16:559–564

amount of vertebrate sequences become available, and if the methods are applied at the whole-genome level, they promise an invaluable resource for the identification not only of constrained non-coding elements but also of nucleotide-level estimates of selective constraint. Such a resource, coupled with experimental approaches, will enable further understanding of the function of www.sciencedirect.com

Functional variation and evolution of non-coding DNA Bird, Stranger and Dermitzakis 561

non-coding DNA and the consequences of individual nucleotide substitutions in a similar way to that of protein-coding DNA. Much attention has been given to single nucleotide changes, but these are not the only processes that underlie sequence evolution. One cannot ignore larger-scale events that affect many nucleotides at the same time, events such as indels or segmental duplications. Studies in Drosophila have shown that there is a suppression of indels in conserved non-coding DNA, most probably due to selective constraint on the primary sequence but also on the spacer sequences between non-coding elements [3,7,9,11,12]. In a recent study in mammals, the deficit of interspecific indels was used to estimate the fraction of DNA that is under selective constraint [13]. Overall, these results suggest that more attention should be given to indel rates, because these types of mutational process highlight evolutionary properties that might differ from those identified by nucleotide substitutions. The element that we understand better in non-coding DNA is the traditional transcriptional regulatory region (i.e. promoter, enhancer and suppressor), the key components of which are the transcription factor binding sites. In the days before comparative data became available, it was expected that transcription factor binding sites would be conserved, on the basis of what we had seen in coding DNA. Transcription factor binding sites do not, however, always stand out as being highly conserved in mammalian alignments, although a fraction of them can still be functionally conserved [5]. This result had been shown before in Drosophila studies [14], with the seminal work on the even-skipped stripe 2 enhancer, which showed a clear case of transcription factor binding site turnover. A recent study highlighted the magnitude of such phenomena, with the identification of a regulatory region that has no primary sequence conservation between human and zebrafish but has an equivalent function in both species [15]. It has become obvious that sequence conservation can only help in the identification of a fraction, and probably a biased one, of regulatory regions in mammalian genomes. Models have been and are being developed to distinguish neutral from constrained DNA [16–18] — in particular regulatory DNA — but these are still in embryonic stages.

Recent evolution of non-coding DNA Using conservation to identify non-coding elements is only one side of the story. Ideally, one would like to use patterns of recent evolution to learn how much conserved non-coding DNA, as identified by vertebrate, insect or other sequence alignments, actually has a function in the focus genome today and what the range of functions is in that genome. The conservation of non-coding DNA suggests that a large fraction has similar levels of selective constraint to protein-coding DNA. The study of www.sciencedirect.com

non-coding DNA does not, however, benefit from the useful properties of a transcriptional and protein product as does coding DNA (e.g. synonymous or non-synonymous nucleotide changes) to assess the implications of nucleotide substitutions. Modified versions of tests for natural selection [19] can be applied to non-coding sequences, utilizing nucleotide variation data from the comparison of whole or partial genome sequences to study variation between closely related species (i.e. interspecific) or within the same species (i.e. population variation).

Is conserved non-coding DNA really selectively constrained? To attribute functional relevance to non-coding DNA, it must be shown that its primary sequence (or a higher order property of it) is actively maintained to a higher degree than neutral DNA, rather than passively being conserved purely owing to low regional mutation rates. The signature of selective constraint is manifested as reduced levels of polymorphisms and divergence, and an excess of rare variants (Figure 2). This is measurable by reduced interspecies substitution rates or the suppression of the frequency of new (derived) alleles within a population. In a study assessing sequence divergence in two closely related Drosophila species [20], Tajima’s D [21] was applied as a test for neutrality in coding and non-coding DNA. Non-coding DNA showed a pattern in its distribution of negative Tajima’s D values similar to that of nonsynonymous sites when both were compared w synonymous sites, as would be expected if both non-coding DNA and non-synonymous sites are under purifying selection. Keightley and colleagues [22] demonstrated that selective constraint was present in mammals within conserved non-coding sequences and their flanking regions but concluded that it is 50% lower in hominids than in murids. Two studies focusing on the level of sequence constraint in the non-coding DNA that flanks mammalian genes appeared to draw conflicting conclusions [22,23], but they agreed that hominid non-coding sequences upstream of genes do show increased sequence divergence. Bush et al. [23] emphasized that, despite an overall increase in divergence rate in hominid non-coding regions, significant constraint remains at some sites. One problem with assessing nucleotide variation using interspecific divergence is that it is affected by varying mutation rates. An alternative method to study nucleotide variation is to use frequency spectra of nucleotide variants (e.g. single nucleotide polymorphisms [SNPs]) within a population. Frequency spectra are unaffected by variable regional mutation rates, because they describe the properties of the variation of a site after the initial mutation has happened. Instead they can be affected by any bias in the selection of the polymorphic sites, such as an over-representation in protein-coding DNA or ascertainment in Current Opinion in Genetics & Development 2006, 16:559–564

562 Genomes and evolution

Figure 2

Distribution of new variants in haplotypes in three different classes of sequences (neutral non-coding, coding and conserved non-coding). This figure illustrates the predominant signal of purifying selection in functional sequences.

small samples and genotyping in large samples. This is particularly true in the study of human data, where complete resequencing of regions or the whole genome is currently a very difficult task. To illustrate the selective constraint in non-coding DNA in the whole of the euchromatic human genome, Drake and colleagues [24] used SNPs from the HapMap project [25]. This study assessed the SNPs within conserved non-coding sequences in the human genome as a group and used the chimpanzee sequence to infer the ancestral allele. Comparison of the frequency spectrum of derived alleles (i.e. the derived allele frequency [DAF]) for SNPs in coding sequences and conserved non-coding sequences shows that both are under a similar level of selective constraint, and therefore the interspecific conservation observed in conserved noncoding sequences is not due to regional low mutation rate.

Is there some aspect of positive DNA selection? The amount of selective constraint appears to differ between organisms. Could this be due to adaptation? Are the weaker signatures of selective constraint that have been detected in some cases a result of some positive selection? Three studies [22,23,26] agree that conserved regions between primate and rodent groups show higher relative conservation in rodent, indicating a genome-wide relaxation of selective constraint in primates. They all detected a small fitness effect of mutations in highly conserved non-coding regions. But it is likely that some of the effect we see is due to a mixture of constraint and positive selection. In Drosophila, Andolfatto [20] looked at population-level variability by studying the frequency of polymorphisms at coding and noncoding sites. A modified McDonald-Kreitman test [19] for divergence within, and between, species was applied to distinguish between variation of mutation rate under Current Opinion in Genetics & Development 2006, 16:559–564

neutrality and signatures of negative or positive selection in Drosophila. Untranslated regions (UTRs) of genes as well as other non-coding sequences showed that a proportion of nucleotide divergence has been driven to fixation by positive selection. Non-coding sequences in Saccharomyces cerevisiae have been shown to be hypervariable in a study comparing rates of variation within, and between species, in coding and non-coding sequences at polymorphic sites [27]. This variability could be due to common, but transient, mutational hotspots or rarer events caused by natural selection on mutations affecting gene expression. Population subdivision would prevent the spread of positively selected alleles through the entire species; the alternative explanation could be balancing or diversifying selection. The studies to date suggest that some conserved noncoding sequences have undergone loss of selective constraint or have undergone positive selection (Table 1). Table 1 Fraction of selectively constrained genome in various organisms. Organism

Fraction of genome constrained Reference

Drosophila Drosophila Caenorhabditis elegans Human Human

22% 40–70% 33% 5% 2.5–3.25%

[11] [20] [42] [4] [13]

The apparent differences in the Drosophila estimates come from the fact that different species’ evolutionary distances were used for the estimation. The discrepancy in the human estimates (although probably within the error of the estimates) comes from the fact that one study [4] used nucleotide substitutions and the other [13] indels to produce the estimates.

www.sciencedirect.com

Functional variation and evolution of non-coding DNA Bird, Stranger and Dermitzakis 563

The adaptive impact underlying this change in selection has yet to be demonstrated. It has been highlighted recently that sequence variation occurs not only in the form of SNPs but also structurally in the form of indels and larger segments described as copy number variants, the extent of which is still being explored [28]. Several diseases have been associated to altered gene expression (for a good summary of recent papers, see reviews by Feuk et al. [28] and Kleinjan and van Heyningen [29]) as a result of ‘position effects’ — chromosomal rearrangement of sequences outside the transcription and promoter region [29]. These rearrangements result in the detachment, removal or disruption of long-range enhancers and their binding sites in the lost or disrupted non-coding DNA. Interestingly, recent findings by Merla and colleagues [30] suggest that gene expression changes are not always directly correlated to copy number. Their findings agree with those of Kleinjan and van Heyningen [31], who found that functional gene domains extend way beyond the transcription units highlighted by previous studies on a 200 kb downstream regulatory domain of PAX6 [32,33]. These results give a strong message that genome context and not simply sequence is important for genome function.

Functional experimental variation One of the key issues in the exploration of function of non-coding DNA is experimental validation. This has proven to be a difficult task, because only a small fraction of elements can be tested, and publication bias (i.e. only positives results get published) has led to an ascertainment bias in what is tested. This has resulted in strong supporting evidence for the regulatory role of non-coding sequences that are conserved in most vertebrates (e.g. mammals, chicken and fish) [34], but we have very little experimental evidence for less conserved non-coding sequences that are mammalian only. It is possible that the mammalian-specific or primate-specific elements, which are the vast majority in the human genome, serve a much wider range of roles than traditional regulatory regions. The ENCODE project [35] is starting to reveal some of these functions. Exploring sequence variation within or between species is providing interesting insights into the contribution of non-coding DNA to phenotypic variation and evolution. But, ultimately, one would wish to get direct functional information for some of the non-coding sequences, in order to explore the range of functions that such sequences play. One approach to this is to perform reporter assays that specifically test for functional variation. Only a limited number of such studies have been successful, mainly owing to the fact that reporter assays are not sensitive enough to detect small differences in activation. An alternative approach is to detect the natural variants in non-coding DNA that explain gene expression www.sciencedirect.com

differences between individuals or haplotypes. A range of studies have been performed in humans, mice and yeast that specifically test for association of non-coding variants with gene expression variation [36–39]. Some of these experiments are done as genotypic associations, whereas others have tested for allele-specific effects within an organism or cell [40,41]. As methodologies become more sensitive to small effects, a wider range of experiments can be used, such as binding assays, to reveal haplotypespecific functional effects.

Conclusions We have attempted to provide an overview of evolutionary processes that non-coding DNA undergoes. It is apparent that both purifying and positive selection occurs in these sequences, and also that much of the phenotypic variation is mapped in such regions. The time and excitement in the community is there to ensure new and surprising discoveries in the next few years.

References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:  of special interest  of outstanding interest 1.

Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C, Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV et al.: Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature 2002, 420:578-582.

2. 

Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S et al.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15:1034-1050. This is the first study to describe selectively constrained sequences in a wide range of genomes. 3.

Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 2003, 424:788-793.

4. 

Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P et al.: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420:520-562. The first rigorous and genome-wide comparative study in vertebrates. 5.

Dermitzakis ET, Clark AG: Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol 2002, 19:1114-1121.

6.

Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE: Human–mouse genome comparisons to locate regulatory sites. Nat Genet 2000, 26:225-228.

7.

Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome Res 2003, 13:2507-2518.

8.

Dermitzakis ET, Reymond A, Antonarakis SE: Conserved nongenic sequences — an unexpected feature of mammalian genomes. Nat Rev Genet 2005, 6:151-157.

9.

Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, Rossier C, Antonarakis SE: Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 2003, 302:1033-1035. Current Opinion in Genetics & Development 2006, 16:559–564

564 Genomes and evolution

10. Cooper GM, Brudno M, Green ED, Batzoglou S, Sidow A: Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res 2003, 13:813-820. 11. Bergman CM, Kreitman M: Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res 2001, 11:1335-1345. 12. Bergman CM, Pfeiffer BD, Rincon-Limas DE, Hoskins RA, Gnirke A, Mungall CJ, Wang AM, Kronmiller B, Pacleb J, Park S et al.: Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol 2002, 3: RESEARCH0086. 13. Lunter G, Ponting CP, Hein J: Genome-wide identification of  human functional DNA using a neutral indel model. PLoS Comput Biol 2006, 2:e5. A very elegant use of indel fixation to study selective constraints in genomes. 14. Ludwig MZ, Bergman C, Patel NH, Kreitman M: Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 2000, 403:564-567. 15. Fisher S, Grice EA, Vinton RM, Bessling SL, McCallion AS:  Conservation of RET regulatory function from human to zebrafish without sequence similarity. Science 2006, 312:276-279. An exciting study that showed the limits of conservation of function without sequence conservation. 16. Down TA, Hubbard TJ: NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence. Nucleic Acids Res 2005, 33:1445-1453. 17. Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O’Connor MJ, Schwartz S, Miller W, Chiaromonte F: Distinguishing regulatory DNA from neutral sites. Genome Res 2003, 13:64-72. 18. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals. Nature 2005, 434:338-345. 19. McDonald JH, Kreitman M: Adaptive protein evolution at the Adh locus in Drosophila. Nature 1991, 351:652-654. 20. Andolfatto P: Adaptive evolution of non-coding DNA in Drosophila. Nature 2005, 437:1149-1152. 21. Tajima F: Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 1989, 123:585-595. 22. Keightley PD, Lercher MJ, Eyre-Walker A: Evidence for widespread degradation of gene control regions in hominid genomes. PLoS Biol 2005, 3:e42. 23. Bush EC, Lahn BT: Selective constraint on noncoding regions of hominid genomes. PLoS Comput Biol 2005, 1:e73.

26. Kryukov GV, Schmidt S, Sunyaev S: Small fitness effect of mutations in highly conserved non-coding regions. Hum Mol Genet 2005, 14:2221-2229. 27. Doniger SW, Huh J, Fay JC: Identification of functional transcription factor binding sites using closely related Saccharomyces species. Genome Res 2005, 15:701-709. 28. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7:85-97. 29. Kleinjan DJ, van Heyningen V: Position effect in human genetic disease. Hum Mol Genet 1998, 7:1611-1618. 30. Merla G, Howald C, Henrichsen CN, Lyle R, Wyss C,  Zabot MT, Antonarakis SE, Reymond A: Submicroscopic deletion in patients with Williams-Beuren syndrome influences expression levels of the nonhemizygous flanking genes. Am J Hum Genet 2006, 79:332-341. An interesting study that illustrates the complexity of copy number effects in mammalian genomes. 31. Kleinjan DA, van Heyningen V: Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet 2005, 76:8-32. 32. Tyas DA, Simpson TI, Carr CB, Kleinjan DA, van Heyningen V, Mason JO, Price DJ: Functional conservation of Pax6 regulatory elements in humans and mice demonstrated with a novel transgenic reporter mouse. BMC Dev Biol 2006, 6:21. 33. van Heyningen V, Williamson KA: PAX6 in sensory development. Hum Mol Genet 2002, 11:1161-1167. 34. Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K et al.: Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol 2005, 3:e7. 35. ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306:636-640. 36. Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci USA 2005, 102:1572-1577. 37. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genome-wide association. Nature 2005, 437:1365-1369. 38. Pastinen T, Hudson TJ: Cis-acting regulatory variation in the human genome. Science 2004, 306:647-650. 39. Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavare S et al.: Genome-wide association of gene expression variation in humans. PLoS Genet 2005, 1:e78. 40. Wittkopp PJ, Haerum BK, Clark AG: Evolutionary changes in cis and trans gene regulation. Nature 2004, 430:85-88.

24. Drake JA, Bird C, Nemesh J, Thomas DJ, Newton-Cheh C,  Reymond A, Excoffier L, Attar H, Antonarakis SE, Dermitzakis ET et al.: Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat Genet 2006, 38:223-227. The first illustration that conserved non-coding sequences are not mutation cold-spots.

41. Pastinen T, Ge B, Gurd S, Gaudin T, Dore C, Lemire M, Lepage P, Harmsen E, Hudson TJ: Mapping common regulatory variants to human haplotypes. Hum Mol Genet 2005, 14:3963-3971.

25. Altshuler DM, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P: A haplotype map of the human genome. Nature 2005, 437:1299-1320.

42. Shabalina SA, Kondrashov AS: Pattern of selective constraint in C. elegans and C. briggsae genomes. Genet Res 1999, 74:23-30.

Current Opinion in Genetics & Development 2006, 16:559–564

www.sciencedirect.com

Suggest Documents