Closed loops: persistence of the protein chain returns

1 downloads 0 Views 112KB Size Report
955 implications. It was suggested that the individual closed loops may be descendants of certain structural and sequence prototypes (Trifonov et al., 2001).
Protein Engineering vol.15 no.12 pp.955–957, 2002

Closed loops: persistence of the protein chain returns

Igor N.Berezovsky1,2, Valery M.Kirzhner3, Alla Kirzhner3, Vladimir R.Rosenfeld3 and Edward N.Trifonov1,3 1Department

of Structural Biology, The Weizmann Institute of Science, P.O.B. 26, Rehovot 76100 and 3Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa 31905, Israel 2To

whom correspondence should be addressed. E-mail: [email protected]

It has recently been discovered that globular proteins are universally built from standard loop-n-lock units of about 30 amino acid residues. The hypothesis has been put forward on the loop stage in the protein evolution when the units were autonomous. Later they joined together making longer chains. One would expect that the early individual loop-n-lock elements might still be detected in modern protein sequences as remnants of the hypothetical 30-residue sequence prototypes. Among several strong sequence motifs, extracted from protein sequences of 23 complete bacterial proteomes, one 32-residue prototype was studied here in detail. Numerous sequence segments related to the prototype are identified in the crystal structures of proteins of a PDB_SELECT database. Analysis of the respective chain trajectories for the cases with different degrees of sequence conservation confirms that the majority of the segments correspond to the closed loops. In the evolutionary diversification of the prototypes the secondary structure yields first, while the sequence is still moderately conserved. The last feature to go is the chain return property. Apparently, the opening of the loops would severely destabilize the protein fold, which explains their conservation. Keywords: closed loops/protein chain return persistence/ protein evolution

Introduction A protein chain trajectory makes many returns to itself, thus forming closed loops (Berezovsky et al., 2000; Berezovsky and Trifonov, 2001a,b), as the analysis of protein crystal structures shows. The loops of nearly standard size (25– 30 residues) dominate and the proteins actually consist of assemblies of these standard elements. Their size is confirmed by an independent study on the protein sequences of 23 complete bacterial genomes, indicating that clusters of hydrophobic residues preferentially follow one after another at that distance (Berezovsky et al., 2001; Trifonov et al., 2001). Analysis of distributions of hydrophobic residues in the crystallized proteins also indicates this characteristic distance between hydrophobic residues (Lamarine et al., 2001). They apparently serve as van der Waals locks closing the loops (Berezovsky and Trifonov, 2001a). The linear arrangement of the loop-nlock elements along the sequences (Berezovsky et al., 2000; Berezovsky and Trifonov, 2001b) has important evolutionary © Oxford University Press

implications. It was suggested that the individual closed loops may be descendants of certain structural and sequence prototypes (Trifonov et al., 2001). Modern versions of the loops may have their sequences diverged to various degrees compared with their hypothetical prototype. Similarly, one would expect various degrees of conservation of their secondary structures. To verify this general hypothesis, we conducted an extensive search for the most frequent sequence motifs of the size 25–35 residues in bacterial proteomes. Several such motifs that were found (Trifonov and Berezovsky, 2002) perfectly match the closed loop elements in the protein crystal structures (PDB database), in accordance with the above hypothesis. One such detected 32-residue long sequence motif was studied here in detail. Many elements of even marginal identity to this sequence prototype are found to conserve their closed loop property, while the secondary structure varies. The conservation of the loop return property is, presumably, important for integrity of the overall protein fold. Materials and methods The protein sequences of the following complete prokaryotic genomes were used for the calculations: Archaea, A.pernix, A.fulgidus, M.thermoautotrophicum and P.abyssi; and Eubacteria, A.aeolicus, B.burgdorferii, C.jejuni, C.pneumoniae, C.trachomatis, D.radiodurans, E.coli, H.influenzae, H.pylori, M.tuberculosis, M.pneumoniae, N.meningitidis, R.prowazekii, Synechocystis, T.maritima, T.pallidum, U.urealyticum, V. cholerae and X. fastidiosa. The sequences were provided by the National Center for Biotechnology Information, via Entrez Browser. The sequences were used without any filtering. The search for the most frequent motifs of the size 30 residues consisted of the following steps. (i) Every 30 amino acid long segment from an exhaustive collection of about one million different segments taken from the proteome of E.coli is matched to all protein sequences of 23 complete bacterial proteomes. The number of matching fragments is counted. In this first step the threshold is taken as equal 11 matching residues, to ensure a sufficiently large number of well matching (⬎37% match) fragments. (ii) The collected matching fragments (of the order of several hundred for a successful sequence motif) are used for derivation of an initial matrix of distribution of 20 amino acid residues in 30 positions. (iii) The initial matrix is used for the next round of collecting the matching fragments. In this case the comparison is made between the matrix and the tested sequence, rather than sequence-tosequence as in the initial stage. The similarity of each 30-residue long tested sequence from the bacterial proteomes is calculated as an average of matching normalized matrix elements. This average should exceed a certain minimal value: the threshold of similarity in which case the sequence fragment is included in the family for calculation of the next round matrix. Inclusion of sequence fragments of lower similarity would cause instability of the matrix, that is, its divergence in the iteration process. The range of the threshold values 955

I.N.Berezovsky et al.

Fig. 1. Histogram of the most frequent (‘consensus’) elements of the prototype matrix.

corresponding to stable solutions may vary (between 0.4 and 0.8) depending on the amino acid compositions of the initial 30-residue sequence and of the whole sequence ensemble. It is noteworthy that the convergence of the iterated matrices to a stable solution is the sole criterion to consider the respective pattern as being present in the sequences, irrespective of the numbers of the matching segments in natural and random (shuffled) sequences. This also means that the choice of the initial match and the subsequent matrix thresholds is empirical. (iv) From a total of about one million different tested segments are chosen those which show the highest final scores after several rounds (typically, 5–10 rounds) of convergent iteration of the matrices. (v) The sequence dimension of the resulting matrix is adjusted by inspecting frequencies of the amino acid residues below and beyond the initial 30-residue range. For example, the matrix for the sequence prototype analyzed in this work has the sequence dimension of 32 residues. In the histogram shown in Figure 1 it corresponds to the dark gray area with high frequency values at the edges. If the dimension is taken beyond the 32 residues, the respective frequency values will drop to the background level. On the other hand, the choice of a shorter range will result in high values beyond it, indicating that the choice is not optimal. The size of the family of the ‘descendants’ of this prototype is 978 fragments. The similarity threshold is taken as equal to 0.4. Results and discussion Table I presents a converged matrix (available in electronic form upon request) with the ‘consensus sequence’ LSGGQRQRVAIARALALEPKLLLLDEPTSALD. Columns of the matrix correspond to the amino acids, the rows to their positions in the sequence and the highest frequency elements (bold face) to the consensus sequence. The matrix and the sequence provide a generalized description of sequences of this consensus family identified in about 42 000 proteins of the 23 bacterial proteomes. The actual number of related sequences is much higher, since a high discriminating threshold of 11 matching residues was used for the first run (for comparison, two random sequences of this size would have a match of, typically, only two to five residues). Figure 1 displays frequencies of the ‘consensus’ residues beyond the 32 amino acid prototype sequence. Note the sharp limits of the distribution. The same limits are indicated by positional cross-correlation analysis of the proteomic sequences. In particular, the occurrence of the border twoletter elements LS and LD at a distance of 31 residues from one another is one of the highest in the cross-correlation data for the range of distances up to 50 residues (data not shown). Figure 2 displays the structures from the PDB SELECT database of crystallized proteins (Hobohm and Sander, 1994), 956

Fig. 2. Chain trajectories of the matching segments extracted from the PDB_SELECT database (Hobohm and Sander, 1994). A, protein 1b0u, chain A (match 26); B, 1f2t, chain B (match 15); C, 1cs1, chain A (match 12); D, 1qhf, chain A (match 11); E, 1qap, chain A (match 10); F, 1di1, chain A (match 10); G, 1ds1, chain A (match 10); H, 1nf1, chain A (match 9); I, 8ohm, chain A (match 9); J, 1d4c, chain C (match 9); K, 6gsv, chain A (match 9); L, 1guq, chain A (match 9).

corresponding to the sequences matching the prototype. A total of 32 such sequence segments are located in the database, with a match of 9–26 residues. For respective random sequences the typical match is 2–5 residues. (If the sequences of 30 residues of uniform amino acid composition are compared, the expected match is 30/20 ⫽ 1.5 residues; the higher match of 2–5 residues is due to non-uniformity of the composition.) Figure 2 includes four cases with the highest matches observed [from 11 to 26 residues, respectively; Figure 2(A)–(D)], and also as a representative set of lower sequence match structures. Among the segments with a match of nine residues [Figure 2(H)–(L)] and data not shown), 12 display a closed loop structure and 10 segments have a non-loop appearance [as in Figure 2(K) and (L)]. The four higher sequence match structures (11–26 residues) are all of the type α–turn–β. With the match 9–10 residues (28 cases in total), this structural motif still dominates, appearing nine times. In other cases these are either loops of different structures [e.g. Figure 2(F), (H) and (I)] or non-loop sections (seven and 12 times, respectively). Thus, of 32 structures, the highest sequence match cases (four) correspond to standard α–turn–β elements; nine more cases have lower sequence match, but still retain the α–turn–β structure; seven structures of lower sequence match have lost the α– turn–β motif, while retaining their loop property; finally, 12 elements of lower sequence match have lost both the typical structural motif and the chain return property.

Closed loops: persistence of protein chain returns

Table I. Distribution of amino acid residues in the matrix description of the sequence prototype

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0.011 0.018 0.024 0.024 0.016 0.078 0.046 0.032 0.063 0.395 0.029 0.770 0.053 0.566 0.024 0.264 0.070 0.047 0.109 0.037 0.027 0.031 0.017 0.143 0.019 0.020 0.101 0.034 0.163 0.295 0.021 0.036

0.001 0.000 0.000 0.002 0.002 0.001 0.009 0.000 0.000 0.006 0.025 0.000 0.015 0.016 0.013 0.002 0.014 0.032 0.004 0.005 0.002 0.006 0.002 0.002 0.035 0.002 0.001 0.003 0.008 0.002 0.000 0.002

0.003 0.013 0.015 0.008 0.017 0.014 0.006 0.004 0.006 0.016 0.008 0.005 0.011 0.004 0.004 0.008 0.010 0.137 0.015 0.138 0.004 0.003 0.005 0.006 0.821 0.019 0.024 0.043 0.009 0.029 0.005 0.759

0.003 0.002 0.014 0.008 0.278 0.011 0.017 0.008 0.014 0.078 0.003 0.005 0.010 0.038 0.009 0.012 0.027 0.162 0.007 0.161 0.006 0.009 0.004 0.007 0.007 0.807 0.011 0.027 0.044 0.014 0.007 0.009

0.086 0.006 0.011 0.005 0.010 0.010 0.013 0.006 0.009 0.016 0.031 0.004 0.011 0.006 0.037 0.024 0.013 0.005 0.006 0.003 0.045 0.021 0.055 0.074 0.004 0.006 0.007 0.089 0.004 0.009 0.008 0.005

0.011 0.021 0.725 0.886 0.021 0.007 0.019 0.017 0.012 0.078 0.008 0.057 0.023 0.023 0.002 0.029 0.054 0.051 0.026 0.028 0.050 0.011 0.021 0.022 0.008 0.010 0.008 0.019 0.080 0.215 0.013 0.031

0.001 0.000 0.007 0.002 0.003 0.004 0.006 0.004 0.002 0.008 0.003 0.000 0.003 0.004 0.004 0.007 0.049 0.021 0.009 0.010 0.001 0.003 0.003 0.003 0.001 0.000 0.003 0.004 0.006 0.060 0.001 0.002

0.027 0.007 0.012 0.004 0.011 0.017 0.013 0.009 0.134 0.035 0.439 0.020 0.036 0.031 0.130 0.096 0.033 0.011 0.016 0.015 0.193 0.146 0.248 0.021 0.009 0.020 0.024 0.017 0.022 0.013 0.060 0.012

0.002 0.002 0.025 0.001 0.005 0.219 0.064 0.093 0.005 0.036 0.005 0.001 0.073 0.007 0.007 0.014 0.077 0.120 0.014 0.185 0.007 0.018 0.002 0.008 0.005 0.003 0.008 0.004 0.005 0.014 0.002 0.003

0.754 0.009 0.031 0.010 0.027 0.070 0.013 0.053 0.241 0.070 0.356 0.039 0.058 0.066 0.643 0.193 0.137 0.031 0.035 0.023 0.322 0.533 0.442 0.516 0.009 0.044 0.041 0.121 0.036 0.029 0.722 0.021

0.018 0.001 0.008 0.000 0.123 0.013 0.011 0.034 0.016 0.038 0.027 0.004 0.047 0.019 0.036 0.061 0.042 0.006 0.006 0.008 0.026 0.027 0.022 0.073 0.007 0.003 0.000 0.010 0.004 0.021 0.012 0.004

0.004 0.004 0.002 0.001 0.041 0.005 0.016 0.001 0.009 0.014 0.004 0.004 0.004 0.007 0.004 0.012 0.065 0.141 0.016 0.053 0.005 0.001 0.006 0.005 0.013 0.005 0.007 0.005 0.092 0.108 0.002 0.016

0.004 0.014 0.011 0.004 0.005 0.007 0.011 0.005 0.003 0.008 0.004 0.005 0.003 0.004 0.003 0.004 0.021 0.021 0.616 0.067 0.005 0.007 0.009 0.005 0.010 0.009 0.671 0.006 0.023 0.027 0.006 0.005

0.003 0.002 0.013 0.004 0.377 0.189 0.614 0.008 0.009 0.015 0.003 0.003 0.044 0.025 0.005 0.014 0.061 0.067 0.005 0.057 0.006 0.007 0.009 0.003 0.004 0.012 0.002 0.008 0.016 0.008 0.013 0.003

0.006 0.006 0.008 0.002 0.004 0.291 0.090 0.693 0.005 0.026 0.009 0.008 0.552 0.018 0.007 0.018 0.093 0.113 0.014 0.085 0.010 0.021 0.009 0.002 0.007 0.003 0.005 0.009 0.037 0.010 0.003 0.008

0.008 0.863 0.026 0.021 0.008 0.016 0.015 0.016 0.013 0.061 0.008 0.020 0.040 0.056 0.009 0.073 0.078 0.027 0.040 0.045 0.008 0.013 0.007 0.008 0.009 0.009 0.025 0.042 0.338 0.092 0.004 0.019

0.005 0.011 0.021 0.005 0.012 0.013 0.020 0.003 0.022 0.020 0.007 0.005 0.019 0.033 0.017 0.012 0.061 0.012 0.025 0.044 0.013 0.008 0.031 0.008 0.007 0.008 0.012 0.497 0.091 0.017 0.021 0.047

0.019 0.016 0.023 0.006 0.020 0.012 0.018 0.009 0.423 0.046 0.046 0.025 0.016 0.060 0.045 0.131 0.058 0.012 0.027 0.026 0.245 0.098 0.103 0.026 0.035 0.014 0.039 0.050 0.024 0.021 0.087 0.013

0.002 0.000 0.001 0.002 0.000 0.011 0.005 0.000 0.001 0.000 0.008 0.002 0.002 0.001 0.010 0.009 0.005 0.002 0.004 0.004 0.000 0.001 0.009 0.001 0.004 0.000 0.001 0.002 0.001 0.001 0.002 0.003

0.033 0.001 0.017 0.004 0.007 0.008 0.003 0.003 0.001 0.007 0.003 0.004 0.007 0.006 0.001 0.016 0.017 0.006 0.003 0.011 0.018 0.029 0.001 0.030 0.020 0.002 0.003 0.004 0.002 0.013 0.005 0.001

As the data above demonstrate, the prototype α–turn–β structure survives even after 72% of the presumed original prototype sequence has been lost. However, the chain return property is still conserved in the low-match structures, where the prototype secondary structure is lost. Apparently, the return property is still maintained owing to either marginal sequence conservation or supporting influence of the remaining parts of the fold or both. Unfolding of the closed loop would cause severe changes in the path of the protein chain, while sequence variations and changes in the details of secondary structure would only be of local influence and, in general, of less importance for the overall protein fold. This work introduces an important dimension in protein evolutionary studies. Highly diverged protein segments with barely recognizable sequence similarity and with no structural resemblance may still be related as soon as segments appear as closed loops in the folds. For possible generalization of the above observations, other sequence prototypes have to be analyzed in a similar way (work in progress). Many of the closed loops observed in the crystallized proteins resemble the structural types described in the early study by Levitt and Chothia (Levitt and Chothia, 1976). It remains to be seen what would be a complete spectrum of the closed loop structures.

References Berezovsky,I.N. and Trifonov,E.N. (2001a) J. Mol. Biol., 307, 1419–1426. Berezovsky,I.N. and Trifonov,E.N. (2001b) Protein Eng., 14, 403–407. Berezovsky,I.N., Grosberg,A.Y. and Trifonov,E.N. (2000) FEBS Lett., 466, 283–286. Berezovsky,I.N., Kirzhner,A., Kirzhner,V.M. and Trifonov,E.N. (2001) Proteins, 45, 346–350. Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–524. Lamarine,M., Mornon,J.-P., Berezovsky,I.N. and Chomilier,J. (2001) Cell. Mol. Life Sci., 58, 492–498. Levitt,M. and Chothia,C. (1976) Nature, 261, 552–558. Trifonov,E.N. and Berezovsky,I.N. (2002) Mol. Biol., 36, 239–243. Trifonov,E.N., Kirzhner,A., Kirzhner,V.M. and Berezovsky,I.N. (2001) J. Mol. Evol., 53, 394–401. Received May 28, 2002; revised September 6, 2002; accepted October 1, 2002

Acknowledgements We are grateful to Mrs A.Weinberg for editing of the text. I.N.B. is a PostDoctoral Fellow of the Feinberg Graduate School, Weizmann Institute of Science. V.M.K. is supported by the Ministry of Absorption.

957

Suggest Documents