The Root of the Universal Tree of Life Inferred from Anciently Duplicated. Genes Encoding Components of the Protein-Targeting Machinery. Simonetta Gribaldo ...
J Mol Evol (1998) 47:508–516
© Springer-Verlag New York Inc. 1998
The Root of the Universal Tree of Life Inferred from Anciently Duplicated Genes Encoding Components of the Protein-Targeting Machinery Simonetta Gribaldo, Piero Cammarano Istituto Pasteur Fondazione Cenci-Bolognetti, Universita’ di Roma ‘‘La Sapienza,’’ Dipartimento di Biotecnologie Cellulari ed Ematologia, Sezione di Genetica Molecolare, Policlinico Umberto I°, Viale Regina Elena 324, 00161, Roma, Italy Received: 19 March 1998 / Accepted: 5 June 1998
Abstract. The key protein of the signal recognition particle (termed SRP54 for Eucarya and Ffh for Bacteria) and the protein (termed SR␣ for Eucarya and Ftsy for bacteria) involved in the recognition and binding of the ribosome SRP nascent polypeptide complex are the products of an ancient gene duplication that appears to predate the divergence of all extant taxa. The paralogy of the genes encoding the two proteins (both of which are GTP triphosphatases) is argued by obvious sequence similarities between the N-terminal half of SRP54(Ffh) and the C-terminal half of SR␣(Ftsy). This enables a universal phylogeny based on either protein to be rooted using the second protein as an outgroup. Phylogenetic trees inferred by various methods from an alignment (220 amino acid positions) of the shared SRP54(Ffh) and SR␣(Ftsy) regions generate two reciprocally rooted universal trees corresponding to the two genes. The root of both trees is firmly positioned between Bacteria and Archaea/Eucarya, thus providing strong support for the notion (Iwabe et al. 1989; Gogarten et al. 1989) that the first bifurcation in the tree of life separated the lineage leading to Bacteria from a common ancestor to Archaea and Eucarya. None of the gene trees inferred from the two paralogues support a paraphyletic Archaea with the crenarchaeota as a sister group to Eucarya. Key words: Tree of life — Signal recognition particle (SRP) — SRP receptor — Gene duplication — Last universal common ancestor — Phylogeny
Correspondence to: P. Cammarano; e-mail: Cammarano@ bce.med.uniroma1.it
Introduction The root of the universal tree relating all three domains of life (Archaea, Eucarya, Bacteria) cannot be inferred from sequences of single-gene homologues (Iwabe et al. 1989). This difficulty has been circumvented by Gogarten et al. (1989) and Iwabe et al. (1989) by reciprocally rooting trees for paralogous genes that arose by duplication prior to the divergence of prokaryotes and eucaryotes. Composite trees generated in this way from two independent sets of anciently duplicated genes—the genes encoding the elongation factors (EFs) Tu(1␣) and G(2) (Iwabe et al. 1989) and the genes for the ␣ and  subunits of membrane ATPases (Gogarten et al. 1989; Iwabe et al. 1989)—placed the Archaea together with the Eucarya to the exclusion of all Bacteria. More recently, the sisterhood of Archaea and Eucarya has been confirmed (i) by rooting a universal tree of isoleucyl-tRNA synthetase sequences with a tree of paralogous bacterial and eucaryal valyl-tRNA synthetase sequences (Brown and Doolittle 1995) and (ii) by reciprocally rooting the universal trees generated from the two repeats of an internally duplicated sequence of a carbamoylphosphate synthetase subunit (Lawson et al. 1996). Given the issue at stake (the topology of the tree of life and the nature of the last universal common ancestor), it is essential that the Gogarten/Iwabe rooting be challenged with several alternative sets of anciently duplicated genes. In this perspective, we have identified two promising candidates among the key protein components of the cellular machinery ubiquitously assisting the transport of secretory proteins to their final destina-
509
tions. The two proteins involved are a 50- to 54-kD subunit of the signal recognition particle (SRP), termed SRP54 for Eucarya and Ffh (for 54 homologue) for Bacteria, and an SRP receptor protein (SR), termed SR␣ for Eucarya and Ftsy for Bacteria. SRP54 participates in the formation of the eucaryal SRP in close association with a 7S RNA molecule and five additional protein subunits (9, 14, 19, 68, and 72 kDa), while its bacterial homologue—Ffh—is the sole protein of a less complex SRP comprising what appears to be an amputated (4.5 S) version of the eucaryal 7S RNA (Poritz et al. 1990; Kaine and Merkel 1989; Althoff et al. 1994). A greater complexity of the eucaryal protein-targeting system is also observed for the SRP receptor in that Sr␣ occurs as a subunit of an ER membrane-bound heterodimeric (SR␣– SR) protein, while Ftsy (the bacterial SRP receptor) exists as a single cytosolic protein. As for the Archaea, the complexity of the SRP and SRP receptor can be argued only from gene sequencing data; these indicate that Archaea exhibit an intermediate situation in having a 7S RNA moiety like Eucarya, but only two 7S RNAassociated proteins (the 54- and 19-kDa components). Similar to Bacteria, however, the archaeal SRP receptor system does not appear to contain homologues of the eucaryal SR subunit (Pohlschro¨der et al. 1997). Both SRP54(Ffh) and SR␣(Ftsy) are guanosine triphosphatases (GTPases) whose functioning (in Eucarya and Bacteria) has been extensively investigated and exhaustively reviewed (Walter and Johnson 1994). In Eucarya, SRP54 is responsible for recognizing the signal sequence of nascent secretory proteins at the ribosomal exit domain, while SR␣ binds—in a GTP-dependent manner—the nascent chain/ribosome/SRP complex to the ER membrane, causing the release of SRP and the onset of cotranslational translocation. A similar SRPmediated signal recognition and a similar SRP–receptor interaction occur in Bacteria. However, the precise mechanisms whereby secretory proteins are translocated across the bacterial cytoplasmic membrane remain unclear. As the key components of the protein-targeting machinery—7S(4.5 S) RNA, SRP54(Ffh), and SR␣(Ftsy)— are universally distributed and display conserved primary-structural features, it is argued that they are descended from an ancestral mechanism predating the divergence of all extant organismal lineages. All the more important, a region [about 250 amino acids (aa)] of SRP54(Ffh) comprising the GTP-binding domain and two flanking regions is obviously repeated in SR␣(Ftsy) (Althoff et al. 1994). This suggests that SRP54(Ffh) and SR␣(Ftsy) originated from an ancient gene duplication, thus permitting the rooting of the tree of life. Here we show that composite trees of SRP54(Ffh) and SR␣(Ftsy) sequences from a phylogenetically diverse spectrum are firmly rooted between two highly distinct groups, the Archaea/Eucarya and the Bacteria, thus providing strong support to the original Gogarten/Iwabe
rooting of the universal tree. Some of the results in the present report have a bearing on ongoing controversies concerning the paraphyly of Archaea and the sisterhood of crenarcheotes and Eucarya.
Methods Sequences. Retrieval of SRP54(Ffh) and SR␣(Ftsy)-related sequences was done by BLAST (Altschul et al. 1990) and FASTA (Pearson et al. 1988) probing of the DNA and protein databases with the tBLASTN and FASTAp programs using the GCG program suite (Genetic Computer Group) (Deveraux et al. 1984) of the UK MRC Human Genome Mapping Project (HGMP) Resource Centre (Cambridge University, UK).
Alignments. Preliminary multiple alignments of amino acid sequences were generated with the program Clustal W (Thompson et al. 1994) using default gap penalties. The selection of characters eligible for the construction of phylogenetic trees was optimized by comparing all sections of the SRP54(Ffh) and SR␣(Ftsy) alignments with comprehensive inventories of significant binary alignments obtained by BLAST probing of the DNA and protein databases with representative eucaryal, archaeal, and bacterial sequences (Cammarano et al. 1998). Complete multiple alignments of the individual SRP54(Ffh) and SR␣(Ftsy) sequences and a complete alignment of SRP54(Ffh) with the SR␣(Ftsy) sequences are retrievable (Filename SRP.aln) via anonymous ftp at ftp.bce.med.uniroma1.it; dir/cammara.
Tree-Making Algorithms. Phylogenetic trees were constructed using distance-matrix (DM), maximum-parsimony (MP), and maximumlikelihood (ML) methods. MP analyses used the PROTPARS program of the Phylogeny Inference Package (PHYLIP), Version 3.57c (Felsenstein 1993), which neglects synonymous substitutions; the PHYLIP programs SEQBOOT, PROTPARS, and CONSENSE were used (in that order) to derive a MP tree which was replicated in 100 bootstraps. Evolutionary distances between all pairs of taxa (DM analyses) were calculated with the program PROTDIST of the PHYLIP 3.57c package, which estimates the number of expected amino acid replacements per position using a substitution model based on the Dayhoff 120 matrix; the resultant distance matrix was then used to construct a neighborjoining tree with the program NEIGHBOR. The PHYLIP programs SEQBOOT, PROTDIST, NEIGHBOR, and CONSENSE were used (in that order) to derive a consensus tree based on 100 bootstrap replications of the original alignment. ML analyses utilized the ProtML program of the MOLPHY (Molecular Phylogenetics) software package, Version 2.2 (Adachi and Hasegawa 1992), and the program PUZZLE, Version 4.0 (Strimmer and Von Haeseler 1995), which allows rate heterogeneity among sites to be taken into account. The amino acid substitution model was chosen by the ML criterion using the Jones– Taylor–Thornton (JTT), Dayhoff, and Blosum 62 (Henikoff and Henikoff 1992) substitution models implemented in the PUZZLE program and the JTT-F and Dayhoff-F models implemented in the MOLPHY programs. The program ProtML was used to estimate the relative bootstrap confidence levels of alternative topologies. To this end, 1000 candidate topologies were selected (of 2,027,025) by the approximate log-likelihood criterion (Adachi 1995; Waddel 1995) from an exhaustive search of a partially constrained starting tree in which 16 SRP54(Ffh) and 16 Sr␣(Ftsy) sequences were organized in 10 topological groupings; the retained 1000 top-ranking trees were then analyzed for the best tree by the RELL (resampling of estimated loglikelihood) bootstrap method with the ‘‘users’’ option of ProtML (Kishino and Hasegawa 1989; Kishino et al. 1990). The quartet puzzling algorithm implemented in the program PUZZLE was used with a
510
Fig. 1. Schematic drawing of the alignment of 28 SRP54(Ffh) and 22 SR␣(Ftsy) sequences from Bacteria (B), Archaea (A), and Eucarya (E). Abbreviations: G, X, and M indicate the respective domains; g1, g2, and g3 indicate the three G-domain consensus motifs that are common to all members of the GTP-binding protein superfamily—(i) GXXXXGKT, (ii) DXXG, and (iii) ([T,S,N]KXD) (Romisch et al. 1989; Bernstein et al. 1989; Ævarsson, 1995); PGB is the putative guanine nucleotide dissociation stimulator binding element identified by Althoff et al. (1994). Figures in parentheses indicate lengths of sequences. Because of the remarkable length variability, the longest sequence in each protein family has been drawn.
gamma-distributed model of rate heterogeneity among sites and eight gamma-rate categories.
Results and Discussion Sequence Alignment. By screening current databases we have retrieved an inventory of 30 SRP54(Ffh) and 22 SR␣ (Ftsy) sequences spanning a broad spectrum of phylogenetic diversity and including several representatives of both the euryarchaeota (Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Archaeoglobus fulgidus, Thermococcus sp.) and the crenarchaeota (Acidianus ambivalens and two members of the Sulfolobales). The two proteins display a remarkable length heterogeneity, ranging in size from 430 to 541 res. for SRP54(Ffh) and from 329 to 638 res. for SR␣(Ftsy). The result of multiply aligning the SRP(Ffh) sequences with their SR␣(Ftsy) counterparts is shown schematically in Fig. 1. In accordance with previous data (Althoff et al. 1994) the N-terminal half of SRP54(Ffh) is repeated in (and is easily alignable with) the C-terminal half of SR␣(Ftsy). This shared region contains (i) a short sequence termed X domain (about 60 res.) possibly responsible for SRP54(Ffh)/SR␣(Ftsy) contact, and (ii) a Gdomain (about 170 res.) comprising the three consensus motifs (I–III) characteristic of the G superfamily (see legend to Fig. 2) together with a so-called PGB motif proposed to be the site of interaction of the two proteins with a common regulatory factor (Althoff et al. 1994).
Outside the shared region, no further homology between the two molecules can be found: SRP54(Ffh) contains a unique C-terminal, methionine-rich, accretion termed the M domain, while SR␣(Ftsy) possesses a unique Nterminal accretion, termed the ␣ domain. The M and ␣ domains accommodate most of the length differences seen between the SRP54(Ffh) and the SR␣(Ftsy) proteins, respectively, while the length of the shared region appears to be conserved over vast evolutionary distances. Based on the SRP54(Ffh)/SR␣(Ftsy) alignment, the mean sequence identity values between the overlapping regions of the two protein families were 29.3% for Bacteria, 28.6% for Archaea, and 24.1% for Eucarya. Despite these low similarities, the two proteins are more closely related to one another than to any other member of the G-superfamily. In fact, while the SRP54(Ffh) and SR␣(Ftsy) sequences were reciprocally retrieved at low values of P(N) (the Poisson probability for random homology) by BLAST probing of the protein and DNA databases, no other GTP-binding proteins [e.g., the translational GTPases EF-Tu(1␣), EF-Ts, EF-G(2), IF2, and Ras-related GTPases] were retrieved with SRP54(Ffh) and SR␣(Ftsy) as the query sequences. Conversely, no GTPases of the protein-targeting pathway were retrieved with EF-Tu(1␣) and EF-G(2) as the queries up to P(N) ⳱ 0.9995. This argues strongly against the possibility that SRP54(Ffh) and SR␣(Ftsy) arose independently from two different sets of ancestral G-protein genes. For phylogenetic tree reconstruction seven conserved blocks (A–G in Fig. 2, totaling 220 amino acid positions) were selected from the region of the SRP54(Ffh)/ SR␣(Ftsy) alignment encompassing the sequence overlap. In this data set the mean intradomain sequence identity values among the SRP54(Ffh) sequences were 52.5% for Bacteria, 57.3% for Archaea, and 65.0% for Eucarya. The corresponding intradomain identities among the SR␣(Ftsy) sequences were 45.6% for Bacteria, 54.7% for Archaea, and 51.4% for Eucarya. The mean sequence identity values between SRP54(Ffh) and SR␣(Ftsy) were 35.5% for Bacteria, 35.1% for Archaea, and 29.5% for Eucarya; we calculate similar identity values for the EFTu(1␣)/EF-G(2) data set used by Baldauf et al. (1996) (33.1, 34.1, and 34.7% for Bacteria, Archaea, and Eucarya, respectively). Phylogenetic Analyses. Composite trees of SRP(Ffh) and SR␣(Ftsy) amino acid sequences were inferred from the selected sequence blocks (A–G) by distance-matrix, maximum-parsimony, and maximum-likelihood methods (Figs. 3–5). All three analyses generated two universal trees: one comprising the SRP54(Ffh) sequences, the other corresponding to the SR␣(Ftsy) sequences. This confirms that the gene duplication that gave rise to the two key components of the protein-targeting machinery preceded the divergence of Bacteria, Archaea, and Eucarya, and for this reason, it can be used to root the tree of life.
511
Fig. 2. Abridged alignment of SRP54(Ffh) and SR␣(Ftsy) amino acid sequences from the three domains of life. Only 28 of the 50 sequences used for the SRP(Ffh)/SR␣(Ftsy) alignment are shown for reasons of space; the complete alignment is retrievable via anonymous ftp (see Methods). Sequence blocks (A–G) were used to construct the composite trees shown in Figs. 3–5. Blocks A and B correspond to the X domain; blocks C to G correspond to the G domain (g1, g2, g3, and PGB as in Fig. 1). The shading in block E indicates a position deselected for the phylogenetic analysis. Capital S and D (for ‘‘docking’’ protein) preceding species names identify SRP54(Ffh) and SR␣(Ftsy) sequences, respectively. Highlighted characters indicate positions that are occupied by identical or similar amino acids (ILVM, DEKRH, ST, GA, FYW, NQ) in no less than 80% of the aligned sequences. The starting positions for each sequence are indicated in each block; ending positions of each sequence are given for block G together with the total lengths of the sequence (in parentheses). The full organism names and accession numbers for the corresponding SR␣(Ftsy) sequences are as follows: Aam (Acidianus ambivalens; X95989), Afu (Archaeoglobus fulgidus; AE00961), Bsu (Bacillus subtilis; P51835), dog (Canis familiaris; P06625), Eco (Escherichia coli; P10121), Hin (Haemophilus influenzae; P44870), Hsa (Homo sapiens; P08240), Mge (Mycoplasma genitalium; P47539), Mho (Mycoplasma hominis; Y11726), Mja (Methanococcus jannaschii; Q57739), Mle (Mycobacterium leprae; Z97369), Mmy (Mycoplasma mycoides; Y10137), Mpn (Mycoplasma pneumoniae; P75362), Mth (Methanobacterium thermoautotrophicum; AE000920), Mtu (Mycobacterium tuberculosis; Q10969), Ngo (Neisseria gonorrhoeae; P14929), Rpr (Rickettsia prowazekii; Y11784), Sac (Sulfolobus acidocaldarius; X77509), Sce (Saccharomyces cerevisiae; P32916), Sso (Sulfolobus solfataricus; P27414), Syn (Synechocystis sp.; D90910), and Tsp (Thermococcus sp.; U95207). The full organisms names and the accession numbers for the (Continued on next page)
512
Fig. 3. Neighbor-joining tree constructed from the SRP54(Ffh)/SR␣(Ftsy) alignment. Numbers attached to internal nodes are BCL based on 100 bootstrap replicates of the original alignment. BCLs