J Mol Evol (2007) 64:1–3 DOI: 10.1007/s00239-005-0120-5
Searching for Sequence Directed Mutagenesis in Eukaryotes Emmanuel D. Ladoukakis, Adam Eyre-Walker Centre for the Study of Evolution and School of Life Sciences, University of Sussex, Brighton BN1 9QG, UK Received: 18 May 2005 / Accepted: 3 October 2006 [Reviewing Editor: Dr. Richard Kliman]
Abstract. Sequence directed mutagenesis is a mechanism by which imperfect repeats ‘‘repair’’ each other to become perfect, generating mutations. This process is known to be prevalent in prokaryotes and it has been implicated in several human genetic diseases. Here we test whether sequence directed mutagenesis occurs in the protein coding sequences of eukaryotes using extensive DNA sequence data from humans, mice, Drosophila, nematodes, yeast, and Arabidopsis. Using two tests we find little evidence of sequence directed mutagenesis. We conclude that sequence directed mutagenesis is not prevalent in eukaryotes and that the examples of human diseases, apparently caused by sequence directed mutagenesis, are probably coincidental. Key words: Sequence directed Eukaryotes — Inverted repeats
Introduction The genomes of most organisms are littered with short inverted repeats which often differ from one another by a few nucleotides. These are potentially a source of mutations because imperfect repeats can undergo a process known as sequence directed mutagenesis, which generates perfect repeats and a set of mutations. The process by which this occurs is
Correspondence to: Adam Eyre-Walker; email:
[email protected]
not fully understood. It seems to involve template switching and the formation of hairpin structure during DNA replication (Ripley 1982), which facilitates the ‘‘correction’’ of the imperfect repeats. Sequence directed mutagenesis does not appear to involve recombination and subsequently is different from gene conversion. The process has been directly demonstrated in a number of bacteria and a recent survey of bacterial genomes showed that almost all have an excess of short (7-bp) inverted repeats, suggesting that sequence directed mutagenesis is common in both eubacterial and archeabacterial genomes (van Noort et al. 2003). Furthermore, computer simulations have shown that sequence directed mutagenesis might be an important potential mutagenic source (Fieldhouse and Golding 1991). The process has been less extensively studied in eukaryotes. Concurrent substitutions in human interferon (Golding and Glickman 1985), frameshift mutations in yeast (Ripley 1982), and a list of human diseases (Blisser 1998) have been attributed to sequence directed mutagenesis. But the role or frequency of sequence directed mutagenesis has never been firmly established. A recent analysis has shown that eukaryotic genomes have an excess of inverted repeats (Cox and Mirkin 1997), but many of these appear to be due to recent duplications (Thomas et al. 2004). Here we set out to test whether sequence directed mutagenesis occurs in eukaryotic genomes using two tests. First, we test whether inverted repeats are more common than would be expected by chance. Second, we investigate whether perfect inverted repeats tend to be close together.
Materials and Methods To perform our analysis we extracted all protein coding exons which were longer than 2000 bp from Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Homo sapiens, Saccharomyces cerevisiae, and Arabidopsis thaliana genomes using the ACNUC retrieval system. We restricted our analysis to exons for two reasons. First, we were initially interested in whether sequence directed mutagenesis was generating mutations associated with human disease, and most mutations which cause Mendelian diseases are found in protein coding sequences (Stenson et al. 2003). Second, considering coding sequences reduces the problem of repeats generated by transposable elements and spontaneous duplications of small sequences (Thomas et al. 2004). We restricted our analysis to exons of >2000 bp to ensure proper randomization of the sequences in our analysis. In total we retrieved 139 exons from C. elegans, 742 for Drosophila, 492 for human, 404 for mouse, 1209 for yeast, and 1039 for Arabidopsis. To identify repeats we slid a window of 50, l00, and 500 bp across each exon. Within that window we counted (a) the number of perfect inverted repeats 6, 7, 8, and 9 bp long and (b) the distance between those repeats. Repeats of a certain length contained only the repeats of that length, not repeats which were shorter. For example, the number of 7-bp repeats did not contain the number of 6-bp repeats because those were of a different class. The numbers of repeats and distances were also counted in randomized sequences. To randomize the sequence we swapped synonymous codons, thus preserving the amino acid sequence and codon usage of the sequence. Because nucleotides can affect the mutation pattern of their adjacent neighbors, we further restricted the randomization by only swapping synonymous codons which were followed by the same base (e.g., CAA.G could be swapped with CAT.G but not CAT.C). Because this randomization scheme greatly reduces the number of permutations that are possible, we restricted our analysis to exons which were >2000 bp in length. In such a sequence each sense codon, followed by each nucleotide, is represented on average 2.7 times, which means that a twofold degenerate amino acid will have on average 5.4 codons, and a fourfold degenerate amino acid 10.8 codons––i.e., we expect most codons to be represented more than once and randomization to be possible. Analyses with exons >4000 bp in length gave qualitatively similar results. van Noort et al. (2003) adopted a different strategy to assess whether perfect repeats were more common than expected by chance; they compared the relative number of perfect repeats to the relative number of imperfect repeats, which differed from one another by one nucleotide, where the number of repeats in each case was relative to the number expected from the dinucleotide frequencies. However, this approach is only valid if dinucleotide frequencies are homogeneous across the genome, which they are not in many organisms. For example, mammals shows very strong heterogeneity in base composition (Bernardi 1993; Lander et al. 2001; Waterston et al. 2002).
Results and Discussion If sequence directed mutagenesis is occurring, we would expect perfect repeats to be more common than by chance. To test this we counted the number of perfect repeats of between 6 and 9 bp in a window of 50, 100, and 500 bp and compared this to the number we found when each exon was randomized. In our study human, mice, and Arabidopsis show significantly more repeats than expected by chance, whereas Drosophila shows a significant deficit (Fig. 1).
Fig. 1. Ratio of the observed to the expected number of perfect inverted repeats in five eukaryotic genomes. Bars indicate standard errors. CE: C. elegans, DM: Drosophila, HS: humans, MM: mouse, SC: yeast, AT: Arabidopsis.
Similar patterns are observed for the repeat classes individually and for all window sizes. However, although human, mice and Arabidopsis show significant excess, the size of the excess is very small; we estimate that the repeats are