5578–5588 Nucleic Acids Research, 2009, Vol. 37, No. 16 doi:10.1093/nar/gkp547
Published online 15 July 2009
Probing the relationship between Gram-negative and Gram-positive S1 proteins by sequence analysis Philippe Salah1, Marco Bisaglia1,4, Pascale Aliprandi1, Marc Uzan2,3, Christina Sizun1 and Franc¸ois Bontems1,* 1
CNRS, Centre de Recherche CNRS de Gif-sur-Yvette (FRC 3115), Institut de Chimie des Substances Naturelles, avenue de la terrasse, 91198 Gif-sur-Yvette Cedex, 2Universite´ Paris 6-Pierre et Marie Curie, 4 place Jussieu, 75252 Paris cedex 05, 3CNRS, FRE3207 Acides Nucle´iques et Biophotonique, 4 place Jussieu, 75251 Paris cedex 05, France and 4Department of Biology, University of Padova, via U. Bassi 58/B 35121 Padova, Italia
Received April 10, 2009; Revised May 19, 2009; Accepted June 10, 2009
ABSTRACT Escherichia coli ribosomal protein S1 is required for the translation initiation of messenger RNAs, in particular when their Shine–Dalgarno sequence is degenerated. Closely related forms of the protein, composed of the same number of domains (six), are found in all Gram-negative bacteria. More distant proteins, generally formed of fewer domains, have been identified, by sequence similarities, in Gram-positive bacteria and are also termed ‘S1 proteins’. However in the absence of functional information, it is generally difficult to ascertain their relationship with Gram-negative S1. In this article, we report the solution structure of the fourth and sixth domains of the E. coli protein S1 and show that it is possible to characterize their b-barrel by a consensus sequence that allows a precise identification of all domains in Gram-negative and Gram-positive S1 proteins. In addition, we show that it is possible to discriminate between five domain types corresponding to the domains 1, 2, 3, 4–5 and 6 of E. coli S1 on the basis of their sequence. This enabled us to identify the nature of the domains present in Gram-positive proteins and, subsequently, to probe the filiations between all forms of S1. INTRODUCTION Prokaryotic cells (and chloroplasts) translation seldom starts at the first AUG codon. The small ribosomal subunit identifies the initiation codons among synonymous triplets by the means of specific signals in their vicinity. The most often encountered signal, recognized by the translation system of all bacteria, is the Shine–Dalgarno sequence (with consensus AAGGAG), located 5–13 nt
upstream from the initiator AUG (1,2). However, in addition to this universal mechanism, specific systems co-exist. In particular, the translation initiation of most, if not all, Escherichia coli messenger RNAs (mRNAs) depends on the presence of the ribosomal protein S1 (3) that was described to recognize an A/U rich sequence upstream of the initiation codon (4). This protein is found in Gram-negative bacteria. More or less distantly related forms have been observed in Gram-positive bacteria, but the nature of these proteins, their roles and their relationship with the Gram-negative S1 remain unclear. Gram-negative proteins S1 are formed of six similar domains. These domains have been shown to play different roles in the E. coli protein S1. The two first are involved in the binding of the protein to the ribosome, while the four following are involved in the interactions with mRNAs (5). As the last domain has been shown to be dispensable for translation initiation (6), its function is still unknown. The fragment of S1 formed of the domains 3, 4 and 5 (F3–5 fragment) also enhances the activity of the ribonuclease RegB of the bacteriophage T4, whose function is to inactivate several of the phage early messengers, when their translation in no more required, by cleaving them in their Shine–Dalgarno sequence. The domains 3, 4 and 5 are simultaneously involved in the binding of RNAs (7), but there is a structural and functional asymmetry between the domain 3 on the one hand and the domains 4 and 5 on the other hand. The domains 4 and 5 are structurally associated in the F3–5 fragment, while the domain 3 is free to move, in the absence of RNA at least (7). The fragment 45 has a function by itself (in RegB activation) while the domain 3 is only active when linked to the two others (8). The domains of the protein S1 are more or less closely related to a set of domains found in many proteins involved in the RNA metabolism in all kinds of organisms. Several structures of such domains have been determined (9), showing that they belong to the OBfold (oligonucleotide–oligosaccharide-binding fold) family.
*To whom correspondence should be addressed. Tel: +(33) 1 69 82 36 78; Email:
[email protected] ß 2009 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Nucleic Acids Research, 2009, Vol. 37, No. 16 5579
However, to date, there is no known structure of a protein S1 domain. To gain insight into the properties of these domains and with the objective to find structural features that could be related to their functional differences, we determined the structure of the domains 4 and 6 of the E. coli protein S1 and investigated the interactions between the domain 6 and two RNAs [poly(A) and poly(U)]. We show that there is no structural difference between the domains. RNA binding on the domain 6 induces fewer modifications at the surface of the b-barrel than in the case of the domains 3, 4 or 5, and involves the C-terminal flexible segment of the protein, but the interaction occurs on the same side of the b-barrel as for the other domains. However, comparison of the structures of the two domains allowed us to identify consensus sequences characteristic of the five b-strands of the domains of all Gram-negative and, apparently, Grampositive proteins S1. Using these sequences we were able to precisely align the sequences of the domains of the Gram-negative proteins S1 and consequently to show that it is possible to discriminate between the different domains (at positions 1, 2, 3, 4, 5 and 6) by using only the sequences. Accordingly we also were able to identify the nature of all domains of the Gram-positive proteins S1 and to probe the relationship between the proteins S1 of the two groups of bacteria. MATERIALS AND METHODS Proteins production and purification Typically, 25 ml overnight cultures of the E. coli strain BL21(DE3) transformed with the adequate plasmid (8) were used to inoculate 1 l of M9 minimal medium supplemented with 1.0 g l–1 15NH4Cl and either 4 g l–1 glucose (15N U-labeled protein samples) or 2 g l–1 13C-glucose (15N13C U-labeled protein samples). Protein expression was induced at OD600 = 0.6 using 1 mmol l–1 isopropylb-D-thiogalactopyranoside. Cells were harvested 3–4 h later, disrupted by sonication and proteins were purified either on a Talon resin (Clontech) or on a Fast Flow Ni-Histrap column (GE Healtcare) using the manufacturer’s recommended protocols. The proteins were dialyzed against nuclear magnetic resonance (NMR) buffer [50 mmol l–1 phosphate, pH 6.8, 200 mmol l–1 NaCl, 20 mmol l–1 dithiothreitol (DTT)] and concentrated up to 0.7–0.8 mmol l–1. NMR spectroscopy All NMR experiments were realized on a Bruker DRX 600 spectrometer equipped first with a 5-mm TXI triple resonance Z-gradient probe (F6 study) and then with a 5-mm TXI triple resonance Z-gradient cryoprobe (F4 study). Data were processed with GIFA (10) or XWINNMR 3.0 (Bruker) and analyzed with XEASY (11) or Sparky software (University of California San Francisco, Thomas L. Goddard). All spectra were recorded at 303 K. The resonance frequency assignment of the backbone atoms (HN, N, C0 , Ca and Cb) was obtained by recording and analyzing HNCO, HNCA, HN(CO)CA, HNCACB
and CBCA(CO)NH triple-resonance experiments recorded on 15N13C U-labeled samples in 90/10% H2O/ D2O. The Ha and aliphatic side-chain 1H and 13C resonance frequencies were further assigned by the means of 3D TOCSY-HSQC (60 ms spin-lock time, 15N U-labeled samples in 90/10% H2O/D2O) and HCCH-TOCSY (12 ms spin-lock time, 15N13C U-labeled samples in 100% D2O) experiments. Finally, the aromatic 1H resonance frequencies were assigned using NOESY and COSY spectra (15N U-labeled samples in 100% D2O). Distance constraints were obtained from the analysis of 15N-NOESY-HSQC (80 ms mixing-time, 15N U-labeled samples in 90/10% H2O/D2O), 13Caliphatic-NOESY-HSQC (80 ms mixingtime, 15N13C U-labeled samples in 90/10% H2O/D2O for the F4 fragment or 100% D2O for the F6 fragment) and a series of 2D-NOESY (25–100 ms mixing-times experiments, 15N U-labeled samples in 100% D2O). Domain 6/poly(A) and poly(U) interaction experiments were analyzed by recording 15N–HQSC spectra on 0.5 mmol l–1 15N U-labeled samples of the domain 6 in the presence of increasing amounts of poly(A) and poly(U) (Amersham Pharmacia Biotech.). Each titration point was obtained by transferring the sample from the NMR tube into a 1.5 ml vial containing the desired amount of dry polyribonucleotide and transferring it back into the NMR tube. For each series, five HSQC spectra were recorded corresponding to RNA/protein ratios of 0, 1, 5, 10 and 20, where RNA quantities are expressed in nucleotide (monomer) quantities. In all interaction experiments, all recording parameters were kept rigorously constant, the only adjustment concerning the probe tuning and the field shimming. All HSQC were recorded with 200 15N increments in order to obtain a sufficient resolution. Structure calculations The structures of domains 4 and 6 were calculated using the INCA software (12), which mainly uses unassigned NOE cross peaks to perform the structure determination. Typically, the software carried out 22 calculation cycles, each corresponding to an automatic assignment, a structure calculation and an analysis step. The assignment step converts each NOE cross peak to a list of possible constraints based on the distances observed in the structures calculated during the preceding cycle. The calculation step carries out the calculation of 500 structures from the constraint lists by simulated annealing. Subsequently, the best 20 structures are selected during the analysis step. The program input consisted of three lists of unassigned NOE correlations picked from the 80 ms 13C- and 15NNOESY–HSQC and the 60 ms 2D-NOESY spectra, with the corresponding lists of 1H, 15N and 13C chemical shifts, with a list of constraints derived from manually assigned NOEs (mainly those characteristic of the secondary structure elements and topology) and the identified hydrogen bonds and finally with a list of phi and psi dihedral angle constraints derived from the TALOS analysis of the backbone chemical shifts (13). The coordinates of 12 structures and the restraint files have been deposited at the
5580 Nucleic Acids Research, 2009, Vol. 37, No. 16
RCSB with accession code 2KHI (domain 4) and 2KHJ (domain 6). Sequence analysis Using the NCBI search engine (www.ncbi.nlm.nih.gov /sites/entrez), we retrieved the sequences of nine S1 proteins belonging to the main groups of Gram-negative bacteria, namely those of the proteobacter a Sinorhizobium melitoti (Sm, P14129), proteobacter b Neisseria meningitidis (Nm, CAM08662), proteobacter g E. coli (Ec, ABJOO328), proteobacter d Anaeromyxobacter sp. (A, YP_002134703), proteobacter e Arcobacter butzleri (Ab, YP_001490936), aquificae Aquifex aeolicus (Aa, AAC07419), chlamydiaec Chlamydophila pneumonia (Cp, Q9Z8M3), bacteroides Bacteroides fragilis (Bf, CAH06700) and that of the spirochetes Borrelia burgdorferi (Bb, NP212261). All these proteins have similar length (between 550 for the S1 of A. butzleri and 597 for the S1 of B. fragilis) and are formed of the same number of S1 domains (six). We also retrieved 17 sequences noted as ‘S1 proteins’ representative of the main Gram-positive eubacteria subdivisions as defined by Olsen and collaborators (14) that possess very different sizes (from 111 for the tenericutes Spiroplasma kunkelii to 827 for the fusobacteria Fusobacterium nucleatum) and a variable number of S1 domains (1–6). We did not found any S1 sequences for Archea and thermodesulfobacteria. The sequences of the Gram-negative S1 proteins were aligned in two steps. The b-strands were first manually aligned by using characteristic residues from the hydrophobic core of the S1 domains (see ‘Results’ section). The loop regions were then automatically aligned by clustalW2 (15). The phylogenic trees were built using the parsimony method (protpars algorithm of the PHYLYP 3.68 package). The contribution of the different S1 regions to the functional specialization of the domains was estimated by comparing the evolutionary trees obtained from the complete sequences and from partially masked sequences, the masks resulting of the use of a weighting file containing 0 (masked) and 1 (unmasked) coefficients applied to the desired amino acids. The hmm profile library of the domains of the Gram-negative S1 proteins was built by using the hmmbuild algorithm of the HHMER suite (16). The agreement between all other S1 domain sequences and the hmm profiles of the library was then tested by using the hmmpfam algorithm of the same suite. RESULTS Escherichia coli domain 4 and 6 structure determination We used the INCA software (12) that performs simultaneous NOE cross peak interpretation and structure determination. The success of such an automatic procedure critically relies on the availability of an essentially complete list of atom resonance frequencies for each used NOESY spectrum (17). We obtained 94% of the proton frequencies in domain 4 spectra and more than 95% in domain 6 spectra. In the case of domain 4, we missed all Met–CeH3, the side chain protons of Lys314 and Lys347, the aromatic proton of His305, His317, His361
and Trp357, and the Trp311-He1,Hz3 in all spectra. We also missed Asn315-NgH2 and Gln348-NdH2 in 15 N-NOESY-HSQC. In the case of domain 6, we missed all Met-CeH3 Glu527-Hb and Phe505-Hz protons in all NOESY. We also missed Lys449–He and Lys450-Hg,d in 15 N-NOESY-HSQC and 2D–NOESY but assigned them in 13C–NOESY-HSQC. The structures were calculated by using a set of already assigned NOEs characteristic of the secondary structure elements, the (f,c) dihedral angles, and the lists of nonassigned correlations manually peaked in the NOESY spectra (1685 for domain 4, 1326 for domain 6) (Table 1). These correlations were converted during the structure calculation process in 1176 (domain 4) and 921 (domain 6) nontrivial, non redundant constraints among which 880 (domain 4) and 740 (domain 6) were unambiguously assigned. Some of these constraints completed the net of backbone-backbone proximities defining the protein secondary structures. Twelve structures are represented for each domain in the Figure 1. None of these structures presents a distance violation larger than 0.5 A˚ and a dihedral angle violation larger than 108. Their covalent geometry is nearly perfect and 96% (domain 4) and 97% (domain 6) of the residues of the structured regions are in the most and additionally allowed regions of the Ramachandran diagram (Table 1). The structure of domain 4 (rmsd of 0.7 A˚, calculated on the backbone atoms of the structured regions) is better defined than that of the domain 6 (0.8 A˚), due to a greater number of constraints. As awaited, both structures consist of the five-stranded b-barrel characteristic of the S1 domain structures (9). Their geometry is very similar; the rmsd between the Ca of the b-barrel and of the short loops connecting the strands B1 and B2, B2 and B3, and B4 and B5 is about 1.2 A˚. The long loop between the strands B3 and B4 is mainly disorganized, but presents a propency to form a helix turn at each of its extremities. The b-barrels are stabilized through a set of similar hydrophobic interactions. In both barrels, three residues are involved in the case of the strands B1 (L/V-x-G-x-V), B2 (C/A-x-V-x-I/L), B3 (V-x-G-x-V/L) and B5 (I-x-L-x-L/V). Four hydrophobic residues and, more surprisingly, an aspartate are found in the case of the strand B4 (D-x-V-x-V/A-x-V/F-xx-I/V). Similarly, a set of conserved glycines is found at or near the extremities of the strands B1, B2, B3 and B4, which does not participate to the packing of the barrels, but seems important for the connections between the strands. The main difference between the two domains resides in the space between the parallel B3 and B5 strands (slightly wider in domain 6 than in domain 4) and the presence of a two-turns a-helix at the N-terminus extremity of the domain 6. However, the pertinence of these observations is difficult to assess. The absence of the adjacent domains, in particular, is likely to influence the structures of the extremities. Domain 6 interactions with poly(U) and poly(A) RNAs We recently characterized the interactions of an S1 fragment composed of the domains 3, 4 and 5 with three
Nucleic Acids Research, 2009, Vol. 37, No. 16 5581
Table 1. Structural statistics of the S1 domains 4 and 6 Domain 4 (12 structures) Experimental restraints A priori assigned restraints Sequential (HN-HN, Ha-HN) Non-sequential (HN-HN, Ha-Ha, HN-Ha) Hydrogen bonds (two restraints by bond) A posteriori assigned restraints Peak number Assigned peaks Constraint number Non-ambiguous restraints Intraresidual Sequential Non-sequential Ambiguous constraints Phi-Psi dihedral angle constraints
Domain 6 (12 structures)
144 51 57 36
96 38 40 18
1685 1606 1176 889 432 223 234 287 62
1326 1244 921 740 411 130 199 181 120
0 0
0 0
Restraints violations NOE violations > 0.5 A˚ Dihedral angle violations > 108 Structural coordinates rmsd Bond rmsd (maximal deviation) Angle rmsd (maximal deviation) Improper rmsd (maximal deviation)
0.016 A˚ (