American Journal of Botany 92(8): 1221–1233. 2005.
SPECIAL PAPER
CODON
USAGE PATTERNS DISTORT PHYLOGENIES FROM OR OF
DNA
SEQUENCES1
MICHAEL L. CHRISTIANSON2 Department of Plant and Microbial Biology, University of California, Berkeley, California 94720 USA Papers reporting phylogenetic reconstructions often include discussion of the nature of third position substitutions and have often treated third position data differently from other data. This paper extends such considerations. Plant biotechnologists interested in high levels of expression of foreign proteins have accumulated information on preferences for otherwise synonymous codons. This paper presents a simple analysis for codon bias. Not only is bias frequent, but bias also varies between cohorts of proteins, both by amino acid and by taxon. Analysis of codon usage in the parallel divergence of phytochromes in three model plants finds identical bias for all family members within each taxon and increasingly divergent patterns of bias between increasingly divergent taxa. The molecular constraint of taxon-specific pools of tRNA molecules means individual triplets in a coding sequence are often not independent; algorithms designed to analyze independent characters are inappropriate for such data. Although a misestimate of the number of differences between taxa and groups of taxa can still generate an accurate description of the nesting of clades, other phylogenetic parameters will be strongly affected. Importantly, since codon bias produces smaller-than-expected within-taxon variance (common use of favored triplets) and larger-than-expected between-taxa variance (different favorites in different taxa), statistical support for nodes is certain to be wrong. The translational control of gene expression mediated by codon bias has implications for modern molecular systematics. Key words: Arabidopsis thaliana; Ceratodon purpureus; convergent selection; molecular systematics; Nicotiana tabacum; Physcomitrella patens.
Joseph Felsenstein (1988, p. 445) begins an excellent and often-cited review by saying ‘‘Systematists and evolutionary geneticists don’t often talk to each other.’’ As an interested outsider to the inner circles of either population genetics or systematics, I can point out what Felsenstein does not. These two traditions are interested in different things, the average behavior of genes and the actual genetic history or pedigree of a lineage, respectively. As interesting as the average behavior of a gene or sequence may be, phylogenetic reconstructions are a litany of historical accidents, including all the improbable twists as Murphy’s Law trumps Occam’s Razor. Felsenstein (1988, p. 468) exhorts systematists and evolutionary geneticists to ‘‘start communicating.’’ As an interested outsider, I note that they may also need to hear the voices of practical molecular biologists; generating transgenic organisms is, after all, nothing more than instant genetic evolution. Dobzhansky (1970, pp. 5–6) observed that in biology ‘‘nothing makes sense except in light of evolution.’’ Indeed, I would very much like to see plant systematics unravel the relationships between the major clades. I would, of course, prefer a DNA-based phylogeny, so all the morphological traits I find so interesting remain free to be examined against that phylogeny. And while I selfishly focus on how useful such an advance in systematics could be for the study of morphology, I admit that finally knowing the relatedness of the various Manuscript received 23 November 2003; revision accepted 23 May 2005. This work was enabled by the hospitality of L. J. Feldman, his laboratory, and A. O. Jackson as Department Chair. The author appreciates interactions with members of the Department of Integrative Biology and comments on the manuscript from Loren Rieseberg, Elizabeth Kellogg, and the anonymous reviewers; readers should not assume these people agree with all or any of the contentions in this paper. 2 E-mail:
[email protected] 1
plant lineages is key to an evolutionary understanding of plant physiology, of ecology, and of other areas of botany. It is extremely disheartening (as well as encouraging), therefore, when the pages of this journal report that a careful analysis of phylogenies for the land plants finds that important topological features of the inferred branching of lineages depend on analytical technique and the treatment of the so-called third position data (Magallo´n and Sanderson, 2002). At first glance, the continuity of DNA sequence from generation to generation makes DNA the sine qua non for reconstructing phylogenetic relationships. Admittedly, because that genie, mutation, can and does makes new DNA sequence from old, there can be some uncertainty: previously different sequence can become similar sequence. This phenomenon, often viewed as back-mutation, is the unavoidable counterpart to the fact that new mutations can and do mark lineage divergence and are useful for constructing phylogenies. Indeed, each of us has newly arisen mutations in our germ line (some 13 by one estimate; Neel, 1983); comprehensive genome sequencing could distinguish our offspring’s offspring from those of even their close relatives. It is possible to imagine that using large numbers of ‘‘characters’’ will somehow solve any and all problems, especially with a computer and multivariate statistics to sort the signal from the noise. Critical analysis, however, has established that simply increasing the size of a data set is not a solution to many of the problems. A Special Paper in this journal (Kellogg and Juliano, 1997), for example, has explored the implications of work by biochemists interested in the effects of amino acid substitutions on catalytic activity or other features affecting protein function. Large numbers of phylogenies are based on the use of DNA sequence for Rubisco subunits. Each of these phylogenies does capture signal from the ancestry of the taxa.
1221
1222
AMERICAN JOURNAL
As Kellogg shows, however, any reconstruction that does not consider the functional consequences of amino acid substitutions will also capture signal from convergence (Kellogg and Juliano, 1997). Any amino acid substitution giving even slight functional advantage in a particular habitat is an opportunity for convergence in DNA sequence; similarly, speciation into contrasting habitats is an opportunity for convergence to an alternate amino acid sequence and divergence in DNA sequence. Signal from selection on the protein distorts the pure signal from ancestry. There are, however, mutations in DNA sequence that do not result in changes in amino acid sequence. The triplet code is redundant: the identity of the third base in the codon is often not essential to fix the identity of the amino acid specified by the codon. It is possible to imagine that toggling amongst A, T, G, and C at the third position are ‘‘neutral mutations,’’ insensitive to selection. If so, such changes would serve as faithful reporters of signal from ancestry, free from signal introduced by convergent or divergent forces and also free of the debate over using characters subject to selection. Many phylogenies include attempts to improve the signal for ancestry by treating the third base in the codon differently from the other two bases. A recent examination of a number of those phylogenies for the seed plants finds that the conclusions drawn in each study are, at least in part, a reflection of how the third base in the codon was treated (Magallo´n and Sanderson, 2002). That paper prompted this explicit exploration of synonymous substitutions. Unlike other examinations of this idea (e.g., Bernardi et al., 1993; Kwiatowski et al., 1994; Muse, 1996; Bustamente et al., 2002; Yang and Nielsen, 2002), this paper uses the perspective of practical genetic engineers. The phenomenon of nonrandom or biased use of codons is important to biologists attempting to get high levels of expression from genes moved between taxa by now-routine molecular techniques; this work is not of interest to many plant systematists. While the specifics of much of this work remain as proprietary secrets of corporations, the general phenomenon is well documented. The early decision by Genentech to create a ‘‘multiple suppressor strain of bacteria,’’ rather than to adjust codon use in each mammalian gene they wanted to over-express, reflects the magnitude of the effect of codon bias on protein synthesis, even from gene constructs able to make large amounts of mRNA. The gross details of the phenomenon, especially differences in bias between monocotyledonous and dicotyledonous plants, have been described for quite some time (Murray et al., 1989; Fennoy and Bailey-Serres, 1993; Kumar and Sharma, 1995) and are now appearing in an explicit systematic context (e.g., Tiffin and Hahn, 2002). Such preferences in sequence extend beyond triplet codon usage; there are preferences for intron splice junctions, for example (Sinibaldi and Mettler, 1992). Practical demonstration of the impact of codon bias can be seen in the adjustments of codon identity to optimize expression of green fluorescent protein reporters for various taxa (e.g., Yuang et al., 1996; Cormack et al., 1997; Rouwendal et al., 1997). This body of work demonstrates that codon bias can and does affect protein expression levels, and to a major, not a minor, degree. The magnitude of the effects of codon bias or preference should be of interest to any systematist concerned about signal in DNA data being twisted by the force of selective pressures. Papers published on the effect itself or reporting high or higher levels of protein expression by tweaking codon use do not,
OF
BOTANY
[Vol. 92
however, explore the phenomenon in ways useful to systematists. This paper intends to begin that process. The message of this paper is simple. Competition for a taxon-specific common pool of charged tRNAs in cytoplasm or organello-plasm is a powerful force for what traditional geneticists would call ‘‘coincidence’’ in sequence within each of the respective genomes in an organism. Clades are diverged in the abundances of the various tRNA transcripts. Competition within taxon-specific pools of tRNAs results in increased homogeneity of sequence within taxa and increased variance in sequence between taxa. These effects necessarily distort calculations using algorithms that presume random and independent changes in sequence both within and between taxa. This paper presents an analysis of data on codon use in a diverse set of protein coding sequences available in the National Center for Biotechnology Information (NCBI) database, asking if triplets synonymous in specification of amino acid are also synonymous in actual use. Few, if any, are. It is, however, possible to estimate the magnitude of codon preference and to estimate the net change in codon preference by mutational change. The analysis is simple, uses ordinary commercial software, and relies on standard statistical tests. I show that it is possible to detect variance in codon preference between proteins for certain amino acids; raw sequence of such idiosyncratic proteins may be less useful for certain systematic purposes. I show, in contrast, that within-taxon codon preference for amino acids is uniform for members of an ancient protein family, the phytochomes. Because the phytochrome family in each taxon arises as independent divergences from a single protein in the remote common ancestor of the taxa examined, mosses and angiosperms, the homogeneity within taxa and the variance between taxa illustrate the powerful constraint of tRNA pools over long times. Molecular biologists constructing phylogenies of individual genes in gene families also need to consider the impact of codon bias on the architecture and strength of nodes in those phylogenies. MATERIALS AND METHODS Sequence data was retrieved from the (NCBI) (www.ncbi.nlm.nih.gov/ entrez/query.fcgi) for the following taxa: (1) Nicotiana tabacum L. (Solanaceae)—cytokinin-regulated kinase-1 AF302082; cyclin A-like protein AF518250; alpha-tubulin (tubA1 gene) AJ421411; late embryogenesis abundant protein-5 (lea5) AF053076; cdc2 L77083; protoporphyrinogen oxidase PX-1 AF044128. (2) Physcomitrella patens (Hedw.) Bruch & Schimp. (Funariaceae)—cyclin D (cycD gene) AJ428953; putative retinoblastoma protein (rb gene) AJ428952; heme oxygenase (ho-1 gene) AJ489941; aldolase AB048209; alpha tubulin AB096718; phytochromes (phy1) AY123146; (phy2) AY123147; (phy3) AY123148; (phy4) AY123145. (3) Ceratodon purpureus (Hedw.) Brid. (Ditrichaceae)—phytochromes (phy1) U87632; (phy2) U72993; (phy3) AY123149; (phy0;2) U56698. (4) Arabidopsis thaliana (L.) Heynh. (Brassicaceae)—phytochromes (phyA) X17341; (phyB) X17342; (phyC) X17343; (phyD) X76609; (phyE) X76610. Retrieved sequences were divided into triplet codons (Microsoft Word, Microsoft Corp., Redmond, Washington, USA); these triplets were sorted and their frequencies counted (Cricket Graph, Computer Associates, Islandia, New York, USA). As A, T, G, and C are discontinuous categories, statistical tests rely upon calculating values of chi-square (Microsoft Excel); while doublestranded DNA obeys Chargaff’s Rule (A 5 T; G 5 C), sequences of individual strands are not so constrained and contribute 3 degrees of freedom. When expected numbers were smaller than 11, Yate’s correction was employed (Simpson et al., 1960). Calculated values of chi-square were compared against the well-known published tables of critical values.
August 2005]
CHRISTIANSON—CODON
Fig. 1. The canonical table of codon usage. Codon position is emphasized by typeface: first position, bold italic; second position, bold. Third codon position within each of the 16 cells toggles among A, T, G, and C. Sets of synonymous triplet codons are color coded: green 5 unique codons, blue 5 two-membered sets, red 5 three-membered sets, black 5 four-membered sets, pink 5 six-membered sets. The synonymous codons for leucine, serine, and arginine are highlighted.
RESULTS Not simply a ‘‘third position’’ phenomenon—The literature most often considers synonymous codons in terms of identity of the base at the third position, and discusses potential problems in terms of weighting of such third position data. Inspection of the canonical table relating each DNA triplet to the amino acid specified by that triplet shows that amino acids (or termination) are specified by unique triplets (two cases), by pairs of triplets (nine cases), by triplets of triplets (two cases), by quartets of triplets (five cases), and by hextuplets of triplets (three cases) (Fig. 1). This table facilitates finding the results of mutational changes to triplets. Changes in the first base convert a triplet into another triplet in the same column of the table. Changes in the second base convert a triplet into another triplet in the same row of the table. Changes in the third position toggle between triplets within the same cell of the table. It is only the instances of quartets of triplets, five of the 21 cases, where any base in the third position results in the specification of the same amino acid in the translated protein. Examining each of the 64 triplets for the average effect of base pair change at the first, the second, or the third position reveals that 4.2% of first position changes, 1.0% of second position changes, and 66.7% of third position changes result in the specification of the same amino acid as the original triplet. Rules for treating third position data, devised as if all changes at third positions were redundant (and that no first or second position changes were redundant), are misapplied when used on sequences other than the five appropriate quartets of triplets. Such general application of rules gives two kinds of effect. For sets of synonymous codons with fewer than four members (13 of the 21 cases), third position changes in sequence over-correct for effects of redundancy. For example, with the pairs of synonymous triplets, purine to purine or pyrimidine to pyrimidine changes result in no change in amino acid specification. With random mutation, however, two-thirds of the
USAGE AND PHYLOGENIES
1223
mutations will be from purine to pyrimidine (or the reverse) and will change the amino acid specified. The synonymous hextuplets present another problem. For four of the six triplets in each set, changes in the third position do not result in changes in amino acid. For two triplets of each six, however, mutation at the third position results in the changes just described for pairs of triplets. Additionally, mutation in the first positions in triplets from two of the three hextuplet sets can result in no change in amino acid. For leucine, for example, with random mutation, one-third of the changes from the first T in TTA or TTG codons are changes to C, and both CTA and CTG still specify leucine. Conversely, one-third of the changes from the C in CTA and CTG codons still specify leucine. In contrast, first position changes in the CTT and CTC leucine codons never give codons that specify leucine. The case for the six arginine codons is exactly parallel to that just described for the leucine codons; the six serine codons, however, would require two coincident changes, in both first and second positions, to generate a codon that still specifies serine. Given the nonuniform pattern of redundancy in the genetic code, all general rules for adjusting sequence data are doomed to be inappropriate for some of the synonymous sets of triplets. While theoretical population geneticists may still profitably model average effects, the use of actual sequence data to reconstruct phylogenies cannot use simple global rules to accommodate the effects of the redundancy of the genetic code. Since sets of triplets redundant in specification are not necessarily comprised of triplets that are functionally identical, the practical use of sequence data for historical reconstruction would need to further adjust any global rule to match patterns seen in the genes actually sequenced from the taxa being investigated. An overall look at codon usage—Comparison of codon use for a similarly diverse set of six proteins in both Nicotiana and Physcomitrella begins with within-species comparisons of codon use for the respective six proteins (Table 1). As expected, some proteins (tubulin, for example) can show idiosyncratic or atypical use of a particular set of synonymous codons; removing these proteins from the database generates a homogeneous set of codon use for each amino acid, an estimate of the typical use for the species. In two cases, the leucine codons for Nicotiana and the alanine codons in Physcomitrella, it was not possible to generate a homogeneous data set by subtracting the data from one or two proteins (using data from the remaining four or five). A molecular biologist might suspect that differences in cellular expression of these various proteins are mediated by translational control from tRNA availability. Table 1 shows codon use by all six proteins in these two cases, but these data were not used in tests against homogeneous data sets. For both species, there are amino acids for which the synonymous codons are used randomly (Table 1). Two amino acids are specified by single codons: the ATG for methionine and the TGG for tryptophan. Nicotiana and Physcomitrella must both use those single codons 100% of the time, and both species must use the same codon; these cases are not informative. In both species, however, Table 1 documents cases where codons synonymous in specification of an amino acid are also synonymous with respect to the frequency at which they are used in coding sequences. For Nicotiana, two amino acids, phenylalanine and histidine, have synonymous codons
1224
AMERICAN JOURNAL
OF
BOTANY
[Vol. 92
TABLE 1. Codon usage typical of Nicotiana tabacum and Physcomitrella patens. Sequence for six proteins, spanning similar presumed diversity in expression levels and tissue specificity (see Materials and Methods for list) were retrieved from the National Center for Biotechnology Information public database, sorted into triplets, collated, and analyzed for random use (via chi-square test) using standard commercial word processing, plotting, and spreadsheet software. x2 significance Nicotiana N
Random?
Physcomitrella N
Random?
49
92
trp TGG
25
31
53 42 40 36
1.28 ns
0.22 ns
65 51
1.69 ns
26 56 10.98***
7.10**
27 22
0.53 ns
13 24
3.27 ns
3.38 ns
gln CAA CAG
16 27
2.81 ns
43 77
9.63**
0.03 ns
70 28 18.01*** 52 58
0.34 ns
45 44
0.01 ns
23 52 11.21***
asp GAT GAC
81 37 16.42***
79 45
glu GAA GAG
72 59
31 80 21.63***
cys TGT TGC
34 15
1.30 ns
n.c.
29 32
9.32**
0.16 ns
8.58**
74 27 28 53 33.34***
42 24 19 62
31.1***
18.03**
50 21 48 6 43.99***
37 24 23 22
5.62 ns
32.83***
thr ACT ACC ACA ACG
42 6 33 15 33.75***
50 14 38 31
20.41***
20.57***
ala GCT GCC GCA GCG
53 25 46 12 31.47***
77 39 66 73
n.c.
61.63***
gly GGT GGC GGA GGG
69 22 58 29 34.36***
54 41 55 66
5.82 ns
58.68***
Six (4 1 2) synonyms 5.10*
0.66 ns
69 27 20 36.33***
55 52 36
4.34 ns
leu TTA TTG CTT CTC CTA CTG
24 65 61 25 18 27
ser TCT TCC TCA TCG AGT AGC arg CGT CGC CGA CGG AGA AGG
6 55 18 11 8 63 120.00***
50.81***
43 33 63 16 37 16 47.08***
19 11 13 38 13 28
27.90***
58.50***
31 16 9 4 34 23 35.44***
24 24 24 40 20 33
10.16 ns
37.08***
n.c.
17.96***
5.30*
Three synonyms ile ATT ATC ATA
Nt- vs Pppattern
Random?
0.00 ns
his CAT CAC
lys AAA AAC
Random?
Physcomitrella N
val GTT GTC GTA GTG pro CCT CCC CCA CCG
Two synonyms
asn AAT AAC
N
Four synonyms
met ATG
tyr TAT TAC
Nicotiana
Nt- vs Pppattern
Monotypic codons
phe TTT TTC
x2 significance
11.37**
Notes: n.c. 5 not calculated; species’ proteins have heterogeneous usage. Statistical significance: ns 5 not significantly different, * 5 5%, ** 5 1%, *** 5 0.1%. Color code: red 5 preferred codon(s), green 5 random codon use, underlined when test against taxon with preferred use is nonsignificant.
August 2005]
CHRISTIANSON—CODON
used with equal frequency. For Physcomitrella, eight amino acids have synonyms used with equal frequency (Table 1). These cases are not the typical result, however. The most typical finding about codon usage is the detection of statistically significant bias, or codon preference. For Nicotiana, there are nine amino acids specified with a preference for codon use; in the Physcomitrella data set, analysis also finds nine amino acids specified with codon bias. The nine in Nicotiana and the nine in Physcomitrella, however, are not the same set of nine. A comparison of patterns of codon use for each amino acid between Nicotiana and Physcomitrella discovers four amino acids with similar patterns of codon usage between species. All four are amino acids with two synonymous codons. For phenylalanine and histidine, both species use synonymous codons with equal frequency. And while the use of the two glutamine codons by Nicotiana (16 vs. 27) is not statistically distinct from 1 : 1, that usage is also not statistically distinct from the biased use of these codons (43 vs. 77) in Physcomitrella; this example is best interpreted as a case of similar, and biased, use of codons in these two species. The other case of similar, biased, codon usage is aspartate, with a strong preference for GAT in both species (81 vs. 37 and 79 vs. 45; Table 1). A phylogenetic analysis restricted to synonymous substitutions at histidine and phenylalanine residues would not be disturbed by the phenomenon of codon bias. Using additional portions of the DNA sequence would disturb the accuracy of phylogenetic reconstruction because those substitutions synonymous in specification of amino acid are not equally preferred in use across taxa. Table 1 presents data on nuclear-encoded genes; codon bias is found in plastidic and mitochondrial genomes, as well. Understanding bias in those genomic compartments is complicated by the phenomena of RNA-editing and import of some of the organelle tRNAs (see Gillham, 1994); demonstration of bias, however, is straightforward. A comparison of sequence in the NCBI database for edited transcripts of the Rubisco large subunit in moss and tobacco, for example, finds biases in codon usage for amino acids, and reveals statistically significant taxon-specific bias for the amino acids histidine, isoleucine, and leucine (data not shown). Within-taxon preference is not a uniform preference— Some aspects of codon bias may certainly reflect the often described effect of overall GC ratio of the genome, or a strong, but general, preference, for third positions. However, examination of codon usage by amino acid specification, rather than over all triplets, finds bias in codon use for some amino acids that is different from the average bias over all triplets. Analysis of codon usage in the proteins of Nicotiana finds 10 sets of synonyms with bias (Table 1). Nine of these cases involve preference for an A or T at the third position; Nicotiana, however, prefers to specify glutamine with CAG, not CAA. Seven of the nine cases of biased use of synonyms in Physcomitrella are preferences for a third position use of G or C, but two of the cases are preferences for a third position T (Table 1). Additional examples can be found in other data, including examples where direction of bias is different for the same amino acids in related species (e.g., proline and serine usage, the mosses, in Table 2). Although it is possible to calculate average or overall preferences for bases at the third position (via third position GC preference), such information is of limited usefulness for ad-
1225
USAGE AND PHYLOGENIES
TABLE 2. The preference of triplet codons specifying glutamine and the net effect of synonymous changes in third position. Combined data for glutamine from Table 1 is analyzed in a matrix as described in the text. The preferred use of the CAG triplet (1.28 of random) means a third position change to A, giving the triplet CAA, and represents a relative change in codon preference of 0.57 (0.72/ 1.28). This synonymous substitution is to a substantially less favored triplet, with consequent effects on the efficiency of protein synthesis. The converse change, from CAA to CAG, represents a relative change in preference of 1.76. Glutamine
CAA 59 81.5 0.72
CAG 104 81.5 1.28
Effect of third position changes Original
A G Mean Variance
Total 163 if random preference
preference 0.72 1.28
New A G 0.72 1.28 — 1.76 0.57 —
1.00 0.15
1.17 0.71
justing data sets to remove the effects of codon bias. While accurately correcting bias in those sets of triplets with bias close to the calculated average, such corrections are not helpful for triplets with random use of codons (i.e., those triplets that had undistorted phylogenetic signal) and even less helpful for synonymous sets with distinctive patterns of bias. Calculating the relative preference of synonymous substitutions—It would be desirable to use observed frequencies of codon use to calculate some kind of weighting factor for members of each set of synonymous substitutions. For glutamine in the data set presented in Table 1, the observed bias in use for each triplet codon (actual use/use if random) can be used to generate a simple matrix (Table 2). In this example, the synonymous alteration in the third position that converts a CAA to the preferred CAG represents a ;1.8-fold change in codon preference. The converse change, from the preferred CAG to a CAA, is a 0.6-fold change in preference. It is a simple matter to generate such matrices for each example of observed biased codon usage presented in Table 1. The most extreme examples of bias result in the most extreme selective potential for third position changes. In the larger synonymous triplet sets, change from one moderately preferred triplet to another moderately preferred triplet often results in no net change in codon preference (a ;1.0-fold change). The distribution of all preference changes for synonyms with biased use spans a wide range (increases and decreases of up to 8.3-fold); the 1.8-fold change of the example shown in Table 2 is not at all extreme (Fig. 2). This analysis documents the magnitude of the potential supplied by codon bias for coincidence in seemingly independent triplet sequences. The effects of codon bias should also be found by examining families of proteins, comparing codon use within and between taxa. If bias has the effect of canalizing patterns of codon use, families of appropriate proteins will show similar codon use within taxa, despite the evolutionary divergence between family members. Bias will also result in exaggerated divergence of patterns of codon use between taxa,
1226
AMERICAN JOURNAL
OF
BOTANY
[Vol. 92
TABLE 3. Codon usage in phytochrome family members from three taxa. Sequence for phytochrome family member mRNAs (Physcomitrella, four genes; Ceratodon, four genes; Arabidopsis, five genes) were retrieved from the National Center for Biotechnology Information public database, sorted into triplets, collated, and analyzed for homogeneity within family for each taxon and then for random use of synonyms (via chi-square test) using standard commercial word processing, plotting, and spreadsheet software. Within taxa Physcomitrella Family:
Homogeneous? Total
Two synonyms phe TTT TTC
3.58 ns
gln CAA CAG
2.15 ns
asn AAT AAC
0.66 ns
lys AAA AAC
1.95 ns
asp GAT GAC
0.17 ns
glu GAA GAG
2.44 ns
cys TGT TGC
1.44 ns
Four synonyms val GTT GTC GTA GTG
1.98 ns
0.55 ns
1.54 ns
68 57
0.97 ns
2.31 ns
4.71 ns
85 39
17.06***
1.93 ns
2.10 ns
119 135
1.01 ns
6.60*
6.62*
97 98
0.01 ns
3.76 ns
8.63*
141 152
0.41 ns
0.06 ns
0.21 ns
138 43
49.86***
8.09**
24.36***
203 186
0.74 ns
1.13 ns
6.05*
79 48
7.57**
0.09 ns
0.39 ns
145 110 82
17.70***
0.02 ns
3.86 ns
190 69 57 109 101.97***
1.39 ns
26.83***
46
51
0.73 ns
8.93 ns 98 74
8.10*
3.35 ns
1.54 ns
9.36 ns 55 54
3.88*
0.01 ns
7.89*
1.78 ns 80 33
5.95*
19.55***
6.22 ns
4.60 ns 80 123
0.29 ns
9.11**
0.50 ns 86 44 13.57***
4.21 ns 77 64
1.20 ns
5.46 ns 112 125
6.80 ns 113 132
0.71 ns
1.47 ns
4.30 ns 194 101 29.32***
0.93 ns 167 140
2.37 ns
9.22* 135 151
0.74 ns 132 176
0.90 ns
6.29*
0.28 ns 50 33
3.21 ns 54 39
3.49 ns
7.44 ns
2.43 ns
5.14 ns 110 61 62 20.20***
1.88 ns
18.01* 118 67 68
20.17***
10.93 ns 105 65 55 133 43.83***
11.85 ns
122 101
40
116 108
thr ACT ACC ACA ACG
Same?
176
65 40
3.85 ns
Same?
181
37 56
pro CCT CCC CCA CCG
Random?
Total Random?
167
103 66
his CAT CAC
Three synonyms ile ATT ATC ATA
Total
2.79 ns
2.07 ns
3-taxa
Homogeneous?
Random?
tyr TAT TAC
Moss-Moss
Arabidopsis
Homogeneous?
Usage:
Monotypic codons met ATG trp TGG
Across taxa
Ceratodon
16.19 ns 110 71 56 117
29.80***
13.44 ns 74 29 47 18 42.71***
9.30 ns 43 34 63 36
11.95**
13.55 ns 95 43 74 31 41.95***
101 27 60 44
51.90***
16.76**
25.44***
112 44 70 45
44.94***
4.13 ns
5.14 ns
9.96 ns 95 40 69 49
28.22***
August 2005] TABLE 3.
CHRISTIANSON—CODON
1227
USAGE AND PHYLOGENIES
Continued. Within taxa Physcomitrella Family:
Homogeneous?
15.40 ns
Total
gly GGT GGC GGA GGG
5.98 ns
26.00*
Six (4 1 2) synonyms leu 21.67 ns TTA TTG CTT CTC CTA CTG ser TCT TCC TCA TCG AGT AGC
10.67 ns
arg CGT CGC CGA CGG AGA AGG
14.25 ns
9.38*
13.96 ns
9.17*
15.38 ns
58.20***
9.82 ns
168 56 113 63
80.98**
5.93 ns
27.57***
139 57 112 76
41.94***
4.38 ns
13.67*
92 146 122 76 62 79
52.38***
6.51 ns
14.96 ns
118 51 108 41 92 81
57.62***
21.28***
42.49***
41 22 32 31 86 90
88.32***
3.93 ns
38.01***
18.68 ns 65 46 55 54 64 36
11.39*
18.85 ns 41 20 46 40 60 36 21.02***
Same?
26.80 ns 51 101 119 79 41 66
57 47 55 40 32 66 15.38**
Same?
10.57 ns 97 64 69 88
60 111 102 60 47 78 42.67***
Random?
16.46 ns 112 73 103 88
106 62 89 72 13.67**
3-taxa
Total Random?
101 78 107 57 18.32***
Moss-Moss
Homogeneous?
Random?
ala GCT GCC GCA GCG
Arabidopsis
Homogeneous? Total
Usage:
Across taxa
Ceratodon
35.44* 43 30 36 44 55 41
8.52 ns
Notes: ? 5 Reports value of chi-square test for homogeneity or random use with statistical significance: ns 5 not significantly different, * 5 5%, ** 5 1%, *** 5 0.1%. Color code: green 5 random codon use, underlined when test against taxon with preferred use is non-significant; red 5 preferred codon(s); pink 5 moss prefers same codons, Arabidopsis prefers others, orange 5 one moss prefers codon, other moss and Arabidopsis have random codon use; blue 5 a distinct pattern for each taxon.
the presumed consequences of between-taxa differences in tRNA pool sizes and diversity. A case study: the phytochrome gene family in three model plants—The family of phytochromes is an example of a protein family appropriate for measuring the canalizing effects of tRNA pools on codon use. These proteins are large enough that comparison of triplet use will not be affected by low N. Although the family members have diverged in function, they certainly represent more similar states of gene expression than, say, a storage protein and a rare transcription factor. Most importantly, phylogenetic analysis uniformly suggests that these families diverged independently from a single ancestral protein in the common ancestor of mosses and the vascular plants (Schneider-Poetsch et al., 1994; Mathews and Sharrock, 1997). Equally importantly, all or most of the family members have been sequenced in two mosses and for at least one vascular plant. If codon use within taxa is not canalized, there should be as much diversity in codon use between gene family members within a taxon as diversity in codon use between taxa. If, on
the other hand and as seen for the proteins in Table 1, there are preferences for particular codons in synonymous-in-specification sets of triplets, then a protein family coevolving within the convergent force of one cytoplasmic tRNA pool will tend to show parallel patterns in codon use within taxa and more divergent patterns of codon use between taxa. This homogeneity within taxa and divergence between taxa is exactly what analysis of codon usage in the phytochrome genes of Physcomitrella, Ceratodon, and Arabidopsis documents. The phytochromes of two mosses and a vascular plant each show homogeneous patterns of codon use (Table 3). There are four cases where chi-square values could support a conclusion of nonhomogeneity. In three cases, the chi-square value exceeds the less stringent 5% value, but not the more stringent 1% value. Indeed, because a series of 54 independent chisquare tests should recover values in excess of the 5% level some 3.4 times, these three sets of codon are judged as homogeneous using the 1% level of confidence. (Recall that A, T, G, and C are discontinuous variables, precluding analyses by techniques for continuous variables, and that, as for the discontinuous traits in peas of Mendel [Fairbanks and Rytting,
1228
AMERICAN JOURNAL
Fig. 2. Fold-change in codon preference from mutation. Relative preferences calculated as in Table 2 for all synonymous triplet sets of Table 1.
2001], sequential analyses by chi-square often seem curious, but are not.) In the fourth case, the use of 37 GCG triplets in the Ceratodon phytochrome-3 gene results in a distinctive pattern of alanine codon use, that cell in the chi-square contributing 14.14 to the total chi-square value of 26.00. Because the Ceratodon sequences being compared include both typical phytochromes as well as sequence of the novel phytochromes discovered in Ceratodon (Thu¨mmler et al., 1992), it is not surprising to discover such a variation. This single understandable exception underscores the remarkable homogeneity within taxa. This homogeneity could reflect widespread use of synonymous codons without codon bias. It does not. Analysis of codon usage for this gene family over these three taxa does discover two sets of synonyms in which the synonyms are used with equal frequency: the TAT and TAC specifying tyrosine and the AAA and AAC specifying lysine (Table 3). These codons would be useful for constructing phylogenies. There are sets of synonymous codons used with the same bias by Physcomitrella, Ceratodon, and Arabidopsis: those that specify the six amino acids phenylalanine, histidine, cystine, isoleucine, threonine, and leucine (Table 3). These codons could be useful in constructing phylogenies with an adjustment to account for the presumed selective values of particular mutational changes and consequent coincidence in triplets. There are six additional sets of synonymous codons that are used with similar bias by the two mosses, but used randomly (asparagine, glutamate) or with a different bias (valine, alanine, glycine, arginine) in Arabidopsis phytochromes (Table 3). There is one example of a set of synonymous codons, those specifying glutamine, where one moss and Arabidopisis show random use of codons, but the other moss has preferred codons (Table 3). Finally, there are three sets of synonymous codons, those specifying serine, proline and aspartate, for which no two taxa use the same pattern or degree of bias (Table 3). Adjusting for coincidence from such patterns of biased usage will be more problematic.
OF
BOTANY
[Vol. 92
Treating two, three, four, and six synomym sets on an equal basis, these observations can be summarized briefly as follows: (1) Within-taxa variance in codon use—little or none. Physcomitrella, 18 homogenous patterns of use; Ceratodon, 17 homogenous patterns, one of four phytochromes has a unique pattern for alanine codons; Arabidopsis, 18 homogenous patterns of use. (2) Between taxa variance in codon use— more and more. The two mosses have the same use pattern for 14 of the 18 sets of codons. The three taxa have the same use pattern for eight of the 18 sets of codons. The observed overall pattern of variance in codon usage— minimal variance within taxa, divergence between taxa—will produce characteristic distortions in phylogenies constructed as if synonymous triplets were independent, and not (as they actually are) subject to coincidence in sequence. Ancestry of species, judged by numbers of base pair differences, will be overestimated: similarity from taxon-specific codon bias will inflate the contrast to between-species base pair diversity. Ancestry of proteins will be underestimated, as family members within taxa will be more alike due to taxon-specific codon bias. The analysis reported here uses observed frequencies of codon usage to infer coincidence, not independence, in triplets, as a consequence of translational control from competition for tRNA molecules. While some scientists will find it sufficient to outline the nature of codon bias, how to detect it, and to outline how preference in use of erstwhile synonyms would affect phylogenies, scientists with an experimental bent would also like to observe directly the process of codon usage becoming coincident. It is easy to imagine experiments using the techniques of molecular biology: introducing a selectable plasmid with nonoptimal codons into Physcomitrella, a species in which free plasmid will replicate and persist (Schaefer et al., 1991), recovering populations of plasmid at varioues times, and monitoring for the appearance of Physcomitrella-preferred codons. Fortunately, such an experiment is performed in nature each time a coding sequence migrates from an organelle and is captured by the nucleus. Equally important, the successive steps in this process, from initial capture, through acquisition of transit peptide, to functional replacement and eventual loss of the organellar copy of the gene, have been outlined and supported by direct examination for the presence (Southern blots) and expression (Northern blots) of the nuclear and organellar copies of the gene (Brennike et al., 1993). An example: codon preferences in genes acquired from organelles—Functional copies of formerly mitochondrial genes have been cloned and sequenced: cox2 in soybean, in which a nonfunctional mitochondrial copy still exists, and in cowpea, in which the mitochondrial copy has been lost (Nugent and Palmer, 1991; Covello and Gray, 1992). Comparison of codon usage between the lengthy mitochondrial targeting presequences (existing nuclear sequences; Kadowaki et al., 1996) and the actual cox2 portion of the protein (originally a mitochondrial sequence) finds homogeneous use of codons between these parts of the protein as well as between taxa (analysis not shown), exactly in keeping with Brennicke and other’s (1993) observations that fully functional nuclear copies of genes from organelles have appropriate adjustment(s) in codon usage. Southern blots have identified at least one taxon in which a cox2 gene has been transferred to the nucleus but does not make transcript (Pisum; Nugent and Palmer, 1991). It is not
August 2005]
CHRISTIANSON—CODON
known whether the nuclear copy of cox2 in Pisum has acquired a targeting presequence and appropriate upstream elements for translation; recall that mitochondrial genes are translated using organellar, prokaryotic-like ribosomes. A coding region must be transcribed and processed into mRNA for presumed selective pressure of any nuclear-bias in codon usage to come into play during translation into protein. A full series of legume cox2 genes—one without targeting presequence, one with a captured presequence but without gene expression, and ones with varying amounts of expression of both the mitochondrial and nuclear versions of the gene—would document the process of adjustment in codon usage that the fully functional nuclear versions of cox2 recovered in soybean and cowpea have completed. DISCUSSION On rates and frequencies of mutation—Geneticists are often careful to distinguish between a mutation rate (newly arisen mutations/population/time) and the frequency of mutations, either as individual organisms or alleles (number of mutants observed/population). This dual use of the word ‘‘mutation,’’ referring to a process with units of reciprocal time and to the results of that process, does not disturb geneticists (who are equally sanguine about using the same word to refer to a genetic locus, a coding region at that location, the RNA transcribed from that region, or the protein product translated from that RNA, and using the same word to refer to the wild-type and the mutant versions of any of the above, although in these cases, orthographic conventions of capitalization and italics make textual references less ambiguous.) Indeed, the relationship between the mutation rate and the frequency of mutations is a standard topic in traditional texts of genetics (Dobzhansky, 1970) and in modern population biology (Futuyma, 1998). The advent of molecular genetic techniques has only increased the precision with which geneticists view mutation and its component processes, from the rate of initial lesions, to the inherent error rates of repair polymerases, to the failure rates of proofreading enzymes, which together determine the net rate of mutation (Freidberg et al., 1995). For example, clever use of isoschizomers that distinguish between dimerized and ordinary adjacent thymidines now allows plant geneticists to follow the kinetics of repair and production of mutations in each of the three plant genomes simultaneously. The variations in sequence used by systematists and population geneticists are the end result of mutation in its broadest sense, and many workers carefully indicate that they measure the frequencies of mutant alleles in a population (although some workers do call this a ‘‘mutation rate,’’ blurring the distinction from the actual mutation rate, a number with units of reciprocal time). The mutant alleles or base substitutions observed in DNA sequencing of an individual or a population are the ones that have managed to first escape the stochastic extinction events to which most newly arisen mutations succumb and are only the ones associated with a selection coefficient that lets them persist and build to a reasonable frequency (or to fixation) in the population. These are topics dealt with at length in the literature of population genetics, with great mathematical rigor. The overall concept can be grasped by considering the odds that the one pollen grain with a newly arisen mutation actually resulted in a seed, rather than becoming food for some bee larva, falling to the ground, or losing out in the race of pollen tubes toward
USAGE AND PHYLOGENIES
1229
the ovule, and that that seed was one of the few seeds to escape being eaten, to escape damping-off fungi, and to escape each of the subsequent tests of Nature, ‘‘red in tooth and claw’’ (Tennyson, 1850), to perpetuate that particular incidence of mutation through pollen and seed of its own. This paper presents an analysis of the frequency with which codons are actually used in protein-coding sequences. This analysis in no way suggests that the frequency of mutation, i.e., the actual process of mutation, is biased toward or against transitions or transversions or any particular base. It should be understood to reflect the fact that newly arisen synonymous mutants succeed or fail, persisting in or vanishing from a population, after stochastic gating effects, by virtue of how well they work with the pool of tRNAs in the cells of a particular species. On codons as characters—Perhaps because the central questions in plant morphology (for me, at least) involve discovering the number, arrangement, identity, and extent of the natural (vs. designated) iterative units which comprise plants, a morphologist might be expected to ask whether codons were elemental characters, or components of an elemental character, and if problems arise when extensive sequences of codons need to be examined. Indeed, beyond their observations about the impact of third position data, Magallo´n and Sanderson (2002) also note differences linked to method of analysis, i.e., parsimony and maximum likelihood. Consideration of the effects of method of analysis leads to insight into the nature of codons as characters. The examination of codon preference in this paper has already documented its main effect: coincidence in triplet usage (Tables 1 and 3). This demonstration should not be taken as suggesting a bias in the process of mutation itself, but as a consequence of selection during the process of fixation in a population of any newly arisen mutant allele. Given the disadvantage for nonpreferred triplets (and consequent effects on organisms when homozygous; Table 2), subsequent mutational events that convert the triplet into a more-preferred triplet (specifying the same or a different amino acid) will be strongly favored. The chance of two, not one, mutational events closely linked in time as well as chromosomal location might seem vanishingly remote; in plants, however, each mitotic division is an opportunity for a somatic mutation, and a large, albeit older, literature documents the magnitude and success of ‘‘diplontic selection’’ (D’Amato, 1965). Whatever the exact molecular genetic and population genetic mechanisms that operate to result in coincident, not random, use of triplets specifying particular amino acids, the effect important for systematics is to introduce an additional event (or step). Analysis of data sets composed only of those triplets in which use of synonyms is random may accurately prefer the most parsimonious outcomes. When there is preference for some triplets, data sets including those triplets necessarily include the extra steps to a preferred codon. In those instances, the choice of parsimony as an arbiter of truth is a wrong choice. The coincidence of triplets used within sequences creates a problem for analysis based on maximum likelihood. The problem is two-fold. As just seen in the discussion of parsimony, the actual number of changes in sequence is misestimated (the numerator of the observed frequency). In addition, the coincidence, not independence, of triplets reduces the denominator (number of bases in the sequence examined) to an unknown
1230
AMERICAN JOURNAL
extent. Population geneticists have noticed this last implication and have explored ways to calculate the ‘‘effective’’ size of a gene (Wright, 1990); extension of this approach to embrace codon families, instead of using average bias, could adjust lengths of sequence for systematic purposes, resulting in better estimates of frequencies and, therefore, more meaningful maximum-likelihood calculations. The nature of the problem for the method of maximum likelihood is also revealed by examination of bases within triplets. The focus of this paper on the extent and consequences of codon preference necessarily involves discussion of triplets and sets of triplets synonymous in specification. One can reasonably ask if this hierarchical level is the natural or elemental level of organization, or if individual bases are the characters for systematic analysis. Although this question may be answered in various ways, one particularly apt way is to ask how bases assort in triplets. Ignoring third position data, the 64 possible triplets of Fig. 1 collapse into a set of 16 cells. Because these 16 cells are roughly equivalent to kinds of amino acids in protein, and because it is well known that some amino acids are rare, that each coding sequence has only one stop codon, and that gene sequences do not use A, T, G, and C in equal frequency, it is no surprise that statistical analysis of the data for Nicotiana sequences in Table 1 (collapse to 16 cells not shown) finds that the 16 base-base-x combinations are not used with equal frequency (chi-square, 15 df, 232.7, P , 0.001). The more revealing analysis is asking if the 16 base-basex combinations reflect random association of the As, Ts, Gs, and Cs in the genome. Variation in overall GC content of genomes is a widely recognized phenomenon, and many authors have noted that third positions may differ in GC content from the rest of the sequence. Indeed, tabulation of base usage by position reveals that the Nicotiana sequences in Table 1 document distinct usage in first and second positions as well (581, 438, 603, 745 and 662, 536, 744, 425, respectively, as T, C, A, G; chi-square, 3 df, 117.42, P , 0.001). This phenomenon by itself is a problem for methods of systematic analysis that use overall frequencies of bases, not frequencies of bases by position. The actual frequencies of A, T, G, and C at the first and the second positions can be used to test whether the observed use of the 16 base-base-x combinations in the Nicotiana data in Table 1 reflect simple assortment or probabilistic combination of first and second position bases. Such a test does not support a simple probabilistic combination; the value of chi-square, 15 df, is 58.54, P , 0.001. Simply stated, even when adjusted for position-specific use of A, T, G, and C, the actual frequencies of codons used in a sequence (or homogeneous set of sequence) are not the frequencies predicted by combination of individual base frequencies. Although many of the 16 basebase-x combinations might appear to be ‘‘accurately’’ predicted in such a way (e.g., 110 observed, 108.3 predicted AGx codons), chi-square examines the entire distribution of comparisons (including the 178 observed, 133.87 predicted GGx codons, and the 77 observed, 104.3 predicted TGx codons) to test the idea of random use of bases by codons in the data set. If, as just shown, the observed frequencies of codons are not those predicted from the frequencies of their component bases, then individual bases are not ‘‘characters’’ themselves. This is not to say that a difference in base sequence at any one particular position of a sequence may, can, and will mark branches in a reconstruction; such differences will. But indi-
OF
BOTANY
[Vol. 92
vidual bases are not characters with statistical properties useful for establishing the confidence of aspects of a phylogenetic reconstruction. It is the triplet within which the base is found that is the elemental character and that marks the branch. The morphologically inclined, who understand a leaf to be comprised of leaf base, petiole, and blade, will find this 3-in-1 concept easy to accept; others may need to consider the example of St. Patrick and the shamrock. Methods that correct for the effects of codon bias will be methods based on analysis of triplet codons and that deal individually with each family of synonyms. Using overall frequencies of individual bases to attempt some global average adjustment, while responding to the central issue, and while useful for population geneticists interested in predicting the average behavior of an average gene, is not a method actually useful for reconstructing the phylogeny of taxa. Proteins other than phytochromes—The phytochromes were an appropriate family of proteins to demonstrate coincidence in sequence, seen as codon usage patterns, driven, as shown by Ikemura (1985), by tRNA pools in cells. Beyond the fact of such coincidence, the analysis demonstrates that each set of synonymous triplets needs to be examined for bias and that sweeping generalizations about all third positions are always inappropriate. The effects of competition for charged tRNAs should be expected to differ in stringency for proteins with atypical amino acid composition. Given entire genomes in sequence databases, and the growing databases from gene chip experiments on relative expression of mRNAs and of proteins, it will soon be possible to ask and answer fine questions about codon usage. In a protein with large numbers of arginine residues, for example, codon preference that reflects relative abundance of the six tRNA species will be balanced by the intensity of competition for any abundant tRNA. I would expect such proteins to show less bias for particular arginine codons than proteins with fewer arginine residues. Similarly, proteins being translated from low-abundance mRNAs will be more likely to show bias toward favored triplets, thereby facilitating the speed of translation, than proteins with similar final concentrations per cell being translated from abundant mRNAs (Xia, 1996). Finally, because protein synthesis can be and is regulated not only by transcriptional effects (mRNA abundance) but also via effects on translation, there will be proteins where evolution has fine-tuned a particular codon usage to keep the level of protein product at a currently favorable level, an exception to the general finding that the strength of codon bias is a function of the level of protein expression (Grantham et al., 1981; Sharp and Li, 1986; Sharp and Devine, 1989; Akashi, 2003). In all of these cases, sequencing of coding regions done for systematics may well document molecular mechanisms of interest to protein chemists or phenomena that are the physiology behind speciation. There are already some reports that proteins exist as groups, identified by the overall properties of their sequences (Carels and Bernardi, 2000). Although the GC-rich proteins are associated with no or few, short introns, and the GC-poor proteins with numerous and long introns, an observation true over a large range in genome size as well as for both monocots and dicots, these classes of proteins will also vary dramatically in codon usage. Indeed, given the results presented in Table 1 of this paper on the homogeneity in codon usage for each amino acid (or lack thereof) for a diverse group of protein-coding
August 2005]
CHRISTIANSON—CODON
regions, it seems likely that exhaustive analysis of coding regions for proteins within and across the sequenced genomes of plants will generate large numbers of cohorts of proteins, sharing codon biases within cohorts, and differing between cohorts. These differences could reflect tissue-specific or developmentally regulated changes in tRNA pools; such phenomena have been noted (Chiapello et al., 1998). These differences could, of course, also document proteins whose translation is similarly regulated through codon bias or reveal some other natural ordering of proteins in cellular physiology. Codon bias in organellar genomes—The analyses presented in this paper use gene sequences that are encoded in the nuclear DNA. Genes encoded in the two other genomes in plants, that of the plastid and that of the mitochondrion, if translated into protein in those compartments are constrained both by the tRNA pools in those compartments, as well as by differences from the universal translation code, including RNA editing (or by overall change in GC composition of the genome; Andersson and Kurland, 1995). Details of these phenomena, including the intriguing import of tRNAs, are covered in Gillham’s (1994) text and specialized reviews (e.g., Small et al., 1999). While the mechanistic details differ from those for nuclear-encoded genes, the overall effect, coincidence in codon usage, is the same. Indeed, published examination of organellar coding regions for bias finds extreme bias (i.e., extreme uniformity in usage pattern), underscoring my observations in this paper about the Rubisco large subunit in moss and tobacco. The bias can be so strong and pervasive that individual proteins with an atypical codon usage are seen as clues to understanding the relative importance of forces that could result in bias (Morton, 1997; Morton and Levin, 1997). Effects of number of genes or choice of gene or incorporating other data—It is true that ‘‘parallel studies on genes subject to other functional constraints than phytochrome genes’’ (Kolukisaoglu et al., 1995, p. 336) can moderate the distorting effects of coincidence in triplet usage just as averaging over several proteins can defeat problems introduced by normalizing selection to conserve amino acid sequence in functionally important regions of a protein (Kellogg and Juliano, 1997). At one level, this idea of using many different sources of information rather than single characters as the basis of a sound, natural, system of classification is just another expression of a central concept usually ascribed to the 1760s and Bernard de Jussieau (Morton, 1981). From the perspective of this paper, the success of such a strategy should rely on the fact (or hope) that, given enough different proteins, each with their own bias, the net bias in codon usage will be zero. And furthermore, that algorithms requiring independence of codons will then be appropriate. Although, with the right collection of proteins, the average bias may be zero, it is obvious that the variance within such an inhomogeneous collection is exceptionally large. Although there will be independent patterns of coincidence between proteins with different patterns of codon bias, there will still be coincidence of codons within individual protein sequences, with all the conceptual problems that then ensue. The incorporation of morphological traits into an analysis ‘‘averages’’ the data set in yet another way. But as single genetic differences can and do produce several phenotypic effects (e.g., Anderson and deWinton, 1935), a phenomenon geneticists know as pleiotropy and morphologists know as cor-
USAGE AND PHYLOGENIES
1231
relation, morphological traits may be the best example of how more data may not represent more information. Genes or segments of either genes or genomes that do not code for proteins are not subject to distortions from codon bias, of course. They are, however, variously subject to constraints on sequence imposed by processing sites, functionally required secondary structure, and similar phenomena—all topics well recognized and outside the scope of this paper. The analysis in this paper—Given the option of improving statistical confidence by increasing sample size or by decreasing the variance in the samples, the most powerful option, that giving the greatest resolving power, is to use better, low-variance data. Such low-variance data considers what protein chemists know about conserved or required amino acids in a protein (Kellogg and Juliano, 1996). This paper demonstrates that low-variance data will also consider codon bias. Fortunately, this paper also demonstrates that bias among otherwise synonymous codons is easily tabulated and can be tested for with statistical rigor. The paper also shows that each set of synonyms must be examined individually. The coincidence in codon use among phytochrome family members and the difference in family usage between species demonstrate the existence and the magnitude of the effect. It is true that pruning data sets of segments of sequence from domains in which functional constraints distort phylogenetic signal is not only extra work, but requires judgment calls and standards that are only in the earliest stages of being established (Kellogg and Juliano, 1996). It is also true that pruning data sets until they retain only those sets of synonyms for which codon usage can be shown to be random, not biased, is a brutal pruning. This pruning, however, is driven by statistical test and is straightforward. The idea of pruning data does conjure up negative images. This pruning is not doctoring the data to make the data support a particular conclusion. And as difficult as it is to accept the idea that less data can be more information, that is the overall objective of the pruning. Those data that are pruned are data that contain phylogenetic signal variously distorted by the coincidence in codons; those data that remain are independent contributors to phylogenetic signal. Such data sets will, of course, still be plagued by the decay of phylogenetic signal with increasing durations of phylogenetic time, consequences of back-mutation, and other phenomena known to systematists, and for which corrective approaches have been developed. It is worth noting here that the use of members of gene families to provide an internal calibration of the molecular clock is exactly the one situation most liable to be affected by forces toward coincidence from codon usage. Confidence in suggestions about the times of origin of clades of plants (e.g., Kolukisaoglu et al., 1995) need to include consideration of coincidence within taxa and variable divergence between several taxa due to codon bias. It is also worth noting that experimental measurements of the strengths of selection from codon bias also reveal areas of sequence that seem selected not for codon usage (selective effects during translation of proteins) but for aspects of secondary structure of the mRNA itself (Hartl et al., 1994). Morphology not superior to DNA sequence—Before DNAbased phylogenies were possible, careful consideration of certain morphological details by systematists led to phylogenetic conclusions; indeed, some of those conclusions are so robust
1232
AMERICAN JOURNAL
that they have yet to be revised. Were it possible, however, to exhaustively measure morphology, problems would arise. Many of these problems are obvious; others are more subtle but just as problematic. Allometric relationships are mathematical evidence of developmental linkage, for example. Although an investigator may have measured distances between seemingly distinct sets of landmarks, analysis that finds an allometric relationship means that those measurements were replicate, not unique, measures of the natural organization of the plant. More measurements do not result in more information, just other estimates of the same information about the plant. A similar, and perhaps more subtle, phenomenon is pleiotropy, the effect of one gene on various parts or aspects of an organism’s morphology or physiology. Genes identified by effects on the shape of leaves have been shown to affect the shapes of floral parts (Anderson and DeWinton, 1935). Given the evolutionary linkages between stem, leaf, and flower, this is not an unexpected result. Less intuitive, however, are the effects on the overall architecture of the inflorescence. Just as for allometric relationships, pleiotropy means exhaustive morphological characterization will ultimately document coincident effects from individual elemental changes, rather than resulting in more information. Indeed, a morphologist might suspect that such coincidence defeated much of the promise of ‘‘numerical taxonomy.’’ Pools of tRNA as a character—This paper describes how codon usage is a force for coincidence in DNA sequence, distorting the phylogenetic signal in protein-coding regions. Inherent in this demonstration is the idea that there are characteristic distributions of tRNA molecules for taxa: taxon-specific tRNA pools. The composition of the tRNA pool is itself a character and contains phylogenetic information. The taxonspecific pools of tRNA in related species represent independent divergences from the tRNA pool of the common ancestor. This complex (numbers of molecular species) and quantitative (relative or absolute abundance of each tRNA) trait is, at first glance, not amenable to character encoding, even if the raw data were easy to obtain. I suspect, however, that some clever systematist may find ways to capture the phylogenetic signal contained in tRNA pools. Envoi—This paper attempts to document and explore the phenomenon of codon bias, placing what practical biotechnologists know into an explicit, phylogenetic context. It is my hope that the paper, its general observations, and the practical demonstration of the effect in one protein family, points out the magnitude of the problem, explains a reason why different treatment of third position data might be expected to influence both the architecture and strength of nodes in a phylogeny, and focuses attention on the issue in a useful way. Preference for particular codons in some synonymous sets of tRNAs, and random use of codons in other sets, can be documented using ordinary word processing software and the simple chi-square statistical test. Some synonymous sets of codons involve alternate bases at the first or second rather than the third positions. Global approaches to correcting for ‘‘third position’’ effects will always be inferior to corrections based on actual analysis of codon usage. Although codon usage poses a problem for phylogenetic reconstructions, exploring the problem reveals avenues for solutions—including a focus on
OF
BOTANY
[Vol. 92
triplets, not codons, as the character—and identifies tRNA pools as a ‘‘character’’ that diverges over evolutionary time. LITERATURE CITED AKASHI, H. 2003. Translational selection and yeast proteome evolution. Genetics 164: 1291–1303. ANDERSON, E. A., AND D. DE WINTON. 1935. The genetics of Primula sinensis. IV. Indications as to the ontogenetic relationship of leaf and inflorescence. Annals of Botany 49: 671–687. ANDERSSON, S. G. E., AND C. G. KURLAND. 1995. Genomic evolution drives the evolution of the translation system. Biochemistry and Cell Biology 73: 775–787. BERNARDI, G., D. MOUCHIROUD, AND C. GAUTIER. 1993. Silent substitutions in mammalian genomes and their evolutionary implications. Journal of Molecular Evolution 37: 583–589. BRENNICKE, A., L. GROHMANN, R. HIESEL, V. KNOOP, AND W. SCHUSTER. 1993. The mitochondrial genome on its way to the nucleus: different stages of gene transfer in higher plants. FEBS Letters 325: 140–145. BUSTAMENTE, C. D., R. NIELSEN, AND D. L. HARTL. 2002. A maximum likelihood method for analyzing pseudogene evolution: implications for silent site evolution in humans and rodents. Molecular Biology and Evolution 19: 110–117. CARELS, N., AND G. BERNARDI. 2000. Two classes of genes in plants. Genetics 154: 1819–1825. CHIAPELLO, H., F. LISACEK, M. CABOCHE, AND A. HE´NAUT. 1998. Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene 209: gc1–gc38. CORMACK, B. P., G. BERTRAM, M. EDGERTON, N. A. GOW, S. FALKOW, AND A. J. BROWN. 1997. Yeast-enhanced green fluorescent protein (yEGFP): a reporter of gene expression in Candida albicans. Microbiology 143: 303–311. COVELLO, P. S., AND M. W. GRAY. 1992. Silent mitochondrial and active nuclear genes for subunit 2 of cytochrome c oxidase (cox2) in soybean: evidence for RNA-mediated gene transfer. EMBO Journal 11: 3815– 3820. D’AMATO, F. 1965. Chimera formation in mutagen-treated seed and diplontic selection. In The use of induced mutations in plant breeding, 303–315. Pergamon Press, London, UK. DOBZHANSKY, T. 1970. Genetics of the evolutionary process. Columbia University Press, New York, New York, USA. FAIRBANKS, D. J., AND B. RYTTING. 2001. Mendelian controversies: a botanical and historical review. American Journal of Botany 88: 737–752. FELSENSTEIN, J. 1988. Phylogenies and quantitative characters. Annual Review of Ecology and Systematics 19: 445–471. FENNOY, S. L., AND J. BAILEY-SERRES. 1993. Synonymous codon usage in Zea mays L. nuclear genes is varied by levels of C- and G-ending codons. Nucleic Acids Research 21: 5294–5300. FREIDBERG, E. C., G. C. WALKER, AND W. SIEDE. 1995. DNA repair and mutagenesis. ASM Press, Washington, D. C., USA. FUTUYMA, D. J. 1998. Evolutionary biology, 3rd ed. Sinauer Associates, Inc., Sunderland, Massachusetts, USA. GILLHAM, N. W. 1994. Organelle genes and genomes. Oxford University Press, New York, New York, USA. GRANTHAM, R., C. GAUTIER, M. GOUY, M. JACOBZONE, AND R. MERCIER. 1981. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Research 9: r43–r74. HARTL, D. L., E. N. MORIYAMA, AND S. A. SAWYER. 1994. Selection intensity for codon bias. Genetics 138: 227–234. IKEMURA, T. 1985. Codon usage and tRNA content in unicellular organisms. Molecular Biology and Evolution 2: 13–34. KADOWAKI, K., N. KUBO, K. OZAWA, AND A. HIRAI. 1996. Targeting presequence acquisition after mitochondrial gene transfer to the nucleus occurs by duplication of existing targeting signals. EMBO Journal 15: 6652–6661. KELLOGG, E. A., AND N. D. JULIANO. 1997. The structure and function of RuBisCO and their implications for systematic studies. American Journal of Botany 84: 413–428. ¨ ., S. MARX, C. WIEGMANN, S. HANELT, AND H. A. W. KOLUKISAOGLU, H. U SCHNEIDER-POETSCH. 1995. Divergence of the phytochrome gene family predates angiosperm evolution and suggests that Selaginella and Equisetum arose prior to Psilotum. Journal of Molecular Evolution 41: 329– 337.
August 2005]
CHRISTIANSON—CODON
KUMAR, P. A., AND R. P. SHARMA. 1995. Codon usage in Brassica genes. Journal of Plant Biochemistry and Biotechnology 4: 113–115. KWIATOWSKI, J., D. SKARECKY, K. BAILEY, AND F. J. AYALA. 1994. Phylogeny of Drosophila and related genera inferred from the nucleotide sequence of the Cu,Zn Sod gene. Journal of Molecular Evolution 38: 443–454. MAGALLO´N, S., AND M. J. SANDERSON. 2002. Relationships among seed plants inferred from highly conserved genes: sorting conflicting phylogenetic signals among ancient lineages. American Journal of Botany 89: 1991–2006. MATHEWS, S., AND R. A. SHARROCK. 1997. Phytochrome gene diversity. Plant Cell and Environment 20: 666–671. MORTON, A. G. 1981. History of botanical science. Academic Press, London, UK. MORTON, B. R. 1997. Rates of synonymous substitution do not indicate selective constraints on the codon use of the plant psbA gene. Molecular and Biological Evolution 14: 412–419. MORTON, B. R., AND J. A. LEVIN. 1997. The atypical codon usage of the plant psbA gene may be the remnant of an ancestral bias. Proceedings of the National Academy of Sciences, USA 94: 11434–11438. MURRAY, E. E., J. LOTZER, AND M. EBERLE. 1989. Codon usage in plant genes. Nucleic Acids Research 17: 477–498. MUSE, S. V. 1996. Estimating synonymous and nonsynonymous substitution rates. Molecular Biology and Evolution 13: 105–114. NEEL, J. V. 1983. Frequency of spontaneous and induced ‘‘point’’ mutations in higher eukaryotes. Journal of Heredity 74: 2–15. NUGENT, J. M., AND J. D. PALMER. 1991. RNA-mediated transfer of the gene coxII from the mitochondrion to the nucleus during flowering plant evolution. Cell 66: 473–481. ROUWENDAL, G. J., O. MENDES, E. J. WOLBERT, AND A. DOUWE DE BOER. 1997. Enhanced expression in tobacco of the gene encoding green fluorescent protein by modification of its codon usage. Plant Molecular Biology 33: 989–999. SCHAEFER, D., J.-P. ZRY¨D, C. D. KNIGHT, AND D. J. COVE. 1991. Stable transformation of the moss Physcomitrella patens. Molecular and General Genetics 226: 418–424.
USAGE AND PHYLOGENIES
1233
¨ . KOLUKISAOGLU, S. HANELT, SCHNEIDER-POETSCH, H. A. W., S. MARX, H. U AND B. BRAUN. 1994. Phytochrome evolution: phytochrome genes in ferns and mosses. Physiologia Plantarum 91: 241–250. SHARP, P. M., AND K. M. DEVINE. 1989. Codon usage and gene expression level in Dictyostelium discoideum: highly expressed genes do ‘prefer’ optimal codons. Nucleic Acids Research 17: 5029–5039. SHARP, P. M., AND W.-H. LI. 1986. An evolutionary perspective on synonymous codon usage in unicellular organisms. Journal of Molecular Evolution 24: 28–38. SIMPSON, G. G., A. ROE, AND R. C. LEWONTIN. 1960. Quantitative zoology, rev. ed. Harcourt, Brace & World, Inc., New York, New York, USA. SINIBALDI, R. M., AND I. J. METTLER. 1992. Intron-splicing and intron-mediated enhanced expression in monocots. Progress in Nucleic Acids Research and Molecular Biology 42: 229–257. SMALL, I., K. AKASHI, A. CHAPRON, A. DIETRICH, A.-M. DUCHEˆNE, D. LANCELLIN, L. MARE´CHAL-DROUARD, B. MENAND, H. MIREA, Y. MOUDDEN, J. OVESNA, N. PEETERS, W. SAKAMOTO, G. SOUCLET, AND H. WINTZ. 1999. The strange evolutionary history of plant mitochondrial tRNAs and their aminoacyl-tRNA synthases. Journal of Heredity 90: 333–337. TENNYSON, A., LORD. 1850. In memoriam. Edward Moxon, London, UK. THU¨MMLER, F., M. DUFNER, P. KREISEL, AND P. DITTRICH. 1992. Molecular cloning of a novel phytochrome gene of the moss Ceratodon purpureus which encodes a putative light-regulated protein kinase. Plant Molecular Biology 20: 1003–1017. TIFFIN, P., AND M. W. HAHN. 2002. Coding sequence divergence between two closely related plant species: Arabidopsis thaliana and Brassica rapa ssp. pekinensis. Journal of Molecular Evolution 54: 746–753. WRIGHT, F. 1990. The ‘effective number of codons’ used in a gene. Gene 87: 23–29. XIA, X. 1996. Maximizing transcription efficiency causes codon usage bias. Genetics 144: 1309–1320. YANG, Z., AND R. NIELSEN. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Molecular Biology and Evolution 19: 908–917. YUANG, T. T., L. CHENG, AND S. R. KAIN. 1996. Optimized codon usage and chromophore mutations provide enhanced sensitivity with the green fluorescent protein. Nucleic Acids Research 24: 4592–4593.