and Richard's rotamer library (Ponder and Richard 1987),. Kitasato ...... Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. for a trypsin ...
Protein Engineering vol.10 no.4 pp.353–359, 1997
Prediction of protein side-chain conformations by principal component analysis for fixed main-chain atoms
Koji Ogata and Hideaki Umeyama Kitasato University, School of Pharmaceutical Sciences, 5-9-1, Shirokane, Minato-ku, Tokyo 108, Japan
A method of side-chain prediction without calculating the potential function is introduced. It is based on the assumption that similar side-chain conformations have a similar structural environment around the side chains. The environment information is represented by vectors that were obtained from principle component analysis and represented by the variance of positions of main-chain atoms around side chains. This information was added to the side-chain library (rotamer library) made from Xray structures. Side-chain conformations were constructed using this side-chain library without using potential functions. An optimal solution was determined by comparing environmental information with the backbone conformation around the side chain to be predicted and native ones in the library. The method was performed for 15 proteins whose structures were known. The result for the rootmean-square deviation between the predicted and X-ray side-chain conformations was ~1.5 Å (the value for core residues was ~1.1 Å) and the percentage of predicted χ1 angles correct within 40° was ~65% (75% for the core). The computational time was short (~60 s for the prediction of proteins with 200 amino acid residues). About 70% of the side-chain conformations were constructed by location of the main-chain atoms around the central Cβ atom and the average of r.m.s.d. was ~1.4 Å (for core residues the average was ~1.0 Å). Keywords: database searching/homology modeling/prediction of side-chain conformations/principal component analysis/sidechain library
Introduction The prediction of side-chain conformations is important in the modeling process for unknown protein structures (Blundell et al., 1987; Holm and Sander, 1991). It has been performed by using a solution of optimal problems and it was reported that such methods can obtain side-chain conformations similar to native ones (Reid and Thornton, 1989; Summers and Karplus 1989; Lee and Subbiah, 1991; Tuffery et al., 1991; Holm and Sander, 1992; Wilson et al. 1993; Koehl and Delarue, 1994). These studies evaluated the optimal solution by calculating a potential energy function using various molecular mechanics force fields such as AMBER (Weiner, et al., 1986). However, calculation of the potential function was repeated in the prediction process, because an optimal solution was chosen from many conformations, which was a time-consuming process. The number of combinations for the side-chain conformations is large and it was difficult to find a structure having the global minimum of the potential function. Therefore, the larger the protein, the more difficult it is to obtain the optimal © Oxford University Press
solution. To solve this problem, some methods used Ponder and Richard’s rotamer library (Ponder and Richard 1987), which gathered representations of side-chain conformations by analyzing the side-chain torsion angles of various proteins in the Brookhaven Protein Data Bank (Bernstein et al., 1977). However, even if this library was used, the number of combinations of side-chain conformations is still huge. Recently, the Dead-End-Elimination Theorem (Desmet et al., 1992; Lasters and Desmet, 1993; Leach, 1994; Tanimura et al., 1994), which detects side-chain conformations absolutely incompatible with those at the global minimum, was reported to obtain the global minimum energy conformation. This method can predict side-chain conformations relatively rapidly, because the number of combinations searched is smaller than that of all combinations. The potential function, especially the van der Waals potential, is strongly influenced by small changes in side-chain conformations. Side-chain conformations in the protein interior are determined by the tight packing of atoms and even a small overlap of atoms makes the potential value very large. This means that Ponder and Richard’s rotamer library may not have enough side-chain conformations to obtain the best solution for various proteins. To avoid this difficulty, the prediction must be performed by using a large rotamer library which includes a large number of side-chain conformations, which should make the evaluation of conformations faster than that in the currently used methods. As one approach, Laughton (1994) performed the prediction using a large library including environmental information taken from the X-ray structure. This method limited the generation of side-chain conformations by a local three-dimensional homology defined for each side-chain conformation. After restricting side-chain conformations, an optimal solution was evaluated with a Monte Carlo procedure. However, this method calculated the potential function for a large number of conformations and then the running time was long. In this paper, we introduce a fast and accurate method for predicting side-chain conformations. Our method is based on the supposition that similar side-chain conformations have a similar structural environment. Here, the word ‘environment’ is used in place of ‘structural environment’. The above assumption is supported by the results of the GAP model method by Eisenmenger et al. (1993). Results from side-chain prediction with a backbone-dependent rotamer library by Dunbrack and Karplus (1993), where the side-chain conformation was dependent on the main-chain torsion angles (i.e. ϕ and ψ), also support our supposition. This means that side-chain conformations are influenced by the main-chain atoms around the side chains. In this work, the environment around the side chain was represented by a set of three eigenvectors, which were obtained by diagonalizing a variance–covariance matrix made from positions of the main-chain and Cβ atoms. The more similar the two sets of eigenvectors (i.e. the more similar the environments around the two side chains), the more similar are the side-chain conformations. 353
K.Ogata and H.Umeyama
Fig. 1. Definition of local coordinate frame. The y-axis in the coordinate frame was defined by the right-angle with the x-axis and z-axis as composing a right-handed-system.
A side-chain conformation which had similar eigenvectors to those in the side chain to be predicted was picked from a side-chain library made from X-ray structures. We predicted the side-chain conformations of known proteins assuming that they were unknown. Materials and methods The prediction of side-chain conformation was performed on a fixed main chain, which included N, Cα, C, O and Cβ atoms. Local coordinate frame To express the environment around a Cβ atom quantitatively, from which the side chain is constructed, we defined a local coordinate frame for each residue. This coordinate frame was used to examine whether the environments around two sidechain conformations from different proteins were similar. First, suppose a sphere (the radius represented by r) for the center is placed at the Cβ atom (the central Cβ atom) and picks up main-chain atoms. The origin of the local coordinate frame was put at the center of mass for the atoms in the sphere. Next, the z-axis of the frame was defined by a vector from the frame origin to the Cα atom belonging to the considered residue and the x-axis was defined by a parallel vector that was perpendicular from the N atom of the considered residue to the z-axis with a direction from the origin to the N atom. The y-axis was defined from the x- and z-axes as shown in Figure 1. Side-chain library We generated a side-chain library, which included environmental information around the central Cβ atoms, by the following procedure. We defined the environmental information to be presented as eigenvectors obtained from a variance– covariance matrix (Kendall 1975) generated from the coordinates of main-chain and Cβ atoms. First, we calculated the local coordinate frame for each residue of known proteins. The number (N) of main-chain atoms in the sphere with a radius r around the central Cβ atom was also counted for each residue. Remember that the main-chain atoms included Cβ atoms. The variance–covariance matrix was given by
Σ5 354
1 N
N
Σ
i51
[
xixi xiyi xizi yixi yiyi yizi zixi ziyi zizi
]
(1)
where xi, y and zi are, respectively, the x, y and z components of the ith atom (0 ø i ø N) on the local coordinate frame. Here, we did not distinguish the atomic type in the sphere around the Cβ atom. We then diagonalized this matrix and obtained three eigenvalues and three eigenvectors. The eigenvalues were represented by λ1, λ2 and λ3 (λ1 ø λ2 ø λ3) and the three eigenvectors corresponding to each eigenvalue were defined by u 5 {u1, u2, u3}. These eigenvalues mean the variance for atomic positions in the sphere and the eigenvectors correspond to directions for the variance. Thus, these eigenvalues and eigenvectors represent the degree of packing of main-chain atoms in the sphere. If the number of atoms and variance in the sphere are similar to each other, it is supposed that the side-chain conformations of two residues are similar to each other. Therefore, the side-chain library generated from known proteins consisted of numbers of atoms (i.e. N) in spheres and sets of eigenvectors (i.e. u) and each amino acid type, N and u, which were calculated for various main-chain conformations in the known proteins, was stored in the library. Construction of side-chain conformations The construction of unknown side-chain conformations was performed on fixed main-chain atoms by using the side-chain library including the environmental information (i.e. N and u) around the central Cβ atom. We considered a certain amino acid residue for the prediction. There were registered M side-chain conformations with the environmental information for this amino acid type in the sidechain library. N for the kth side-chain conformation (1 ø k ø M) in the library was defined by Nk. Similarly, a set of three eigenvectors was defined by uk 5 (u1k, u2k, u3k). The construction procedure of the side-chain conformation was as follows. First, for the side chain to be predicted, the local coordinate frame was calculated around the central Cβ atom. Next, N in the sphere was counted. The similarity index, a set of three eigenvectors defined by v 5 (v1, v2, v3), was calculated in the sphere. Remember that N and v were obtained only from the positions of main-chain atoms. Here, the difference between N of this side chain and Nk of the kth side chain in the library was given by ∆ k 5 | N – Nk |
(2)
and the difference between two set of eigenvectors was defined by 〈v · uk〉 5
1 3
3
Σ
| vi · uki |
(3)
i51
where vi·uki is the inner-product between vi and uki. This ,v·uk. is the average over vi·uki. If ∆k is zero and ,v·uk. is close to the real number of 1.0 at which two sets of eigenvectors have the same orientation, the variance in spheres is similar to each other. Hence, the environments around the central Cβ atoms are similar and their conformations are likely to be similar to each other. In multivariate statistical analysis, the first and second eigenvectors are usually considered, while the third is ignored. However, we considered three eigenvectors in this work in order to make clear the difference between the two sets of eigenvectors. Next, the side chains in the library were arranged in order from smallest to largest ∆k among the M side-chain conformations. For the kth and lth (l Þ k) side-chain conformations in the library, if ∆k is the same as ∆l, these side chains
Prediction of protein side-chain conformations
Fig. 2. The order of side-chain construction. Side-chain construction was performed from the nearest residue to the center of mass. In the diagram on the left, the order of construction is A → B → C → D → E. If residue D has a short contact with the other residue, it was predicted first and the order is changed to D → A → B → C → E.
were arranged from the larger value of inner product (Equation 3). After this arrangement, the construction of side-chain conformation was tested from the optimal one (i.e. the sidechain conformation of the smallest ∆k and the largest ,v·uk. when there were conformations of the same ∆) in the library. Placing this optimal side-chain conformation on the fixed main-chain was performed by superposing N, Cα, C and Cβ atoms between the residue to be predicted and the selected optimal one. When this optimal side-chain conformation had a short contact (a limited contact) with other surrounding atoms, the next optimal side chain was tried from the library, and so on. The construction of the side chain was performed from the residue nearest to the center of mass of the protein to be predicted. If a constructed side chain always had a short contact, the prediction was restarted in a way that such a side chain was predicted first as Figure 2 shows and the other side chains were treated from the residue nearest to the center of mass. If the construction was circulated by side chains, the next optimal side chain was chosen in the arranged library. In the prediction, the criterion of the short contact was defined by a distance shorter than 2.5 Å between a predicted side-chain atom and the other main-chain atoms, the previously generated side-chain atoms and Pro’s atoms and by a distance shorter than 2.2 Å between the predicted atom and mainchain atoms in the predicting residue. These values were appropriately determined by observing X-ray structures. Our procedure excluded a calculation based upon potential energy functions which should be repeated thousands of times. Therefore, the running time of this method was shorter. Construction of Cys forming a disulfide bridge The above procedure was not used for Cys pairs forming disulfide bridges, because their chemical bonds among Sγ atoms (~2.0 Å) should be considered. We should distinguish between Cys pairs and the other amino acid residues in the side-chain construction. A pair of Cys residues that formed a disulfide bridge was determined as follows. First, one of the Cys residues was normally placed in the side-chain library by the procedure described previously. Next, the other Cys residue was sought for the best side-chain conformation in the side-chain library to satisfy an Sγ–Sγ length .1.9 and ,2.1 Å. If the optimal solution did not exist because of having a short contact or if did not satisfy the criterion of Sγ–Sγ length, the first Cys
residue was changed by the second solution in the arranged side-chain library and so on until it could be used. Predicted proteins We predicted side-chain conformations, excluding Ala, Gly and Pro residues, for 15 known proteins with a wide variety of structures (Table I). These protein data have high resolution and no lack of atomic coordinates and they have been often used in other studies (Reid and Thornton, 1989; Summer and Karplus 1989; Lee and Subbiah, 1991; Tuffery et al., 1991; Holm and Sander, 1992; Wilson et al. 1993; Koehl and Delarue, 1994). The prediction was performed using all main-chain atoms (including Cβ atoms) and Pro residue’s side-chain atoms in native proteins. Before the prediction, a side-chain library was made with 72 proteins having various folding patterns and high resolution (see Appendix). Consequently, the side-chain library involved 14 075 residues (Table II) for 72 proteins. When a protein was predicted, this protein was excluded from the library. The choice of sphere radius around the central Cβ atom is important. It varied from 2.0 to 6.0 Å in 1.0 Å increments. When the radius was 2.0 Å, the atoms within the sphere were only main-chain atoms in the predicting residue. When the radius was 6.0 Å, the main-chain atoms which belong to the residues next to the considered residue were included. In this case, our method is similar to one that used a backbonedependent rotamer library (Dunbrack and Karplus 1993), but note that our method is free from calculating a potential function. Evaluations of the prediction model The predicted side-chain conformations were compared with the X-ray structures by the average of side-chain root-meansquare deviation (r.m.s.d.) and the percentage of correctly predicted χ1 and χ1 1 χ2 angles. Here, the r.m.s.d. was calculated except for Cβ atoms, which were treated as mainchain atoms. The accuracy of the χ1 and χ1 1 χ2 angles confirmed the similarity of the conformations and it was defined as each percentage of the χ1 and χ1 1 χ2 angles which are not different from ones in the corresponding X-ray structure by the 40° used by Dunbrack and Karplus (1993) These criteria were evaluated for overall and core residues except for Ala, Gly and Pro residues. The core residues were defined as those whose solvent surface accessibility in the X-ray structure is 355
K.Ogata and H.Umeyama
Table I. Test proteins ID
Resolution
No. of residues
Protein
Code
(Å)
Length
Alla
Coreb
Crambin L7/L12 C-terminal domain Lysozyme Papain Avian pancreatic polypeptide Ubiquitin Interleukin-1b Leucine/isoleucine/valine-binding protein Ovomucoid third domain Plastocyanin Thioredoxin Acid proteinase (penicillopepsin) β-Trypsin Trypsin inhibitor Insulin
ICRN ICTF ILZI IPPP IPPT 1UBQ 211B 2LIV 2OVO 2PCY 2TRX 3APP 4PTP 5PTI 9INS
1.50 1.70 1.50 1.90 1.37 1.80 2.00 2.40 1.50 1.80 1.68 1.80 1.34 1.00 1.70
46 68 130 212 36 76 153 344 56 99 216 323 223 58 51
32 46 103 160 29 65 132 251 45 77 164 247 176 42 44
10 15 50 92 2 23 59 134 17 33 81 144 94 14 14
aNumber
of predicted residues excluding Ala, Gly and Pro residues. of accessible surface area ,25%.
bResidues
Table II. The number of side-chain conformations in the library Residue
No. of residues
Residue
Arg Asn Asp Cys Gln Glu His Ile Leu
590 880 1192 154 609 911 446 896 1397
Lys Met Phe Ser Thr Trp Tyr Val Cys (S–S)
Total aCys
No. of residues 1174 297 679 1306 1158 244 632 1380 130 14075
residue forming a disulfide bridge.
smaller than 25% (Koehl and Delarue, 1994). The solvent surface accessibility was calculated using the program MSAS in BIOCES/PM by NEC Co. (Akahane et al., 1989). Results and discussion Table III shows the results of the accuracy of side-chain predictions. At r 5 2.0 Å, the average r.m.s.d. over all residues was 1.489 Å and the average r.m.s.d. for core residues was 1.022 Å. At r 5 4.0 Å, the best r.m.s.d. for all residues was 0.98 Å for 1CRN and at r 5 5.0 and 6.0 Å, the best r.m.s.d. for core residues in the predicted proteins excluding 1PPT was 0.40 Å for 9INS. The best weighted r.m.s.d. for core residues was obtained at r 5 4.0 Å, although the simple average was also better at r 5 2.0 Å. This means that a radius of 2.0 Å was effective for small proteins such as 9INS and 1CTF, but in the large proteins it was too small to characterize the environment around the central Cβ atom. However, as the radius increased to .4.0 Å, the r.m.s.d. increased step by step. Since the number of residues in proteins distributes over a wide range, as shown in Table II, the weighted average is more appropriate to access the accuracy of prediction. Table IV shows the accuracy of the χ1 angle. In Table IV, the best radius of the simple average values was 2.0 Å for all residues and 3.0 Å for core residues. The best weighted average was 4.0 Å for core residues and that for all residues 356
was 2.0 Å. The result of the weighted average for core residues in Table III was in good accordance with that in Table IV. Table V shows the accuracy of χ1 and χ2 torsion angles are both within 40° of the experimental angles. The accuracy of the χ1 1 χ2 angle was similar to the r.m.s.d. results and the χ1 angle. In Table V, the best radius for the weighted average was 4.0 Å for core residues. Conformations of many core residues are thought to be determined in the influence of various interactions. Therefore, the above results show that the structural environment is very important, since the predictive accuracy for the weighted average was best at a radius of 4.0 Å in Tables III–V. The running time of the prediction was ~60 s for the size of 200 residues (NEC EWS 4800/310 LC, for which the actual speed was ~10 Mflops). The fastest prediction in the five condition of radius was for r 5 4.0 Å (Table VI). This is because the radius of 4.0 Å could search side-chain conformations by considering atomic locations near the central Cβ atom to predict small- and medium-size residues. In other words, the structural environment plays an important role in the determining process of side-chain conformations. Also, the first side-chain construction from the side-chain library was highly likely to succeed at a radius of 4.0 Å, as shown in Table VI. This fact is also closely related to the structural environment. The average r.m.s.d. (Table III) and the accuracy of the χ1 angle (Table IV) were similar to those reported by the method that required potential functions (Koehl and Delarue, 1994). Those reported values were calculated with different conditions [i.e. the number of predicted proteins, definition of core residues (Eisenmenger et al., 1993; Koehl and Delarue, 1994) and atoms taken for r.m.s.d. calculation], yielding an r.m.s.d. of ~1.6 Å for all residues and ~1.1 Å for core residues. The accuracy of the χ1 angle reported in other papers was ~65% for all residues and 75% for core residues. However, our method was fast, because it does not need to calculate the potential function. For example, in our method, it took ~60 s for proteins of ~200 residues. We considered the relation between the variance–covariance matrix and three eigenvectors. In the variance–covariance matrix, the atomic coordination was included in a square
Prediction of protein side-chain conformations
Table III. Accuracy of side-chain predictions on various radii around the central Cβ atom Corea (Å)
All (Å) ID code
2.0
3.0
4.0
5.0
6.0
2.0
3.0
4.0
5.0
6.0
ICRN 1CTF 1LZI 1PPP 1PPT 1UBQ 2I1B 2LIV 2OVO 2PCY 2TRX 3APP 4PTP 5PTI 9INS Average Weighted
1.00 1.37 1.51 1.85 1.71 1.68 1.73 1.74 1.32 1.48 1.50 1.26 1.54 1.88 1.15 1.52 1.55
1.00 1.37 1.56 1.76 1.70 1.70 1.78 1.76 1.45 1.46 1.49 1.25 1.53 1.95 1.08 1.52 1.55
0.98 1.61 1.45 1.84 1.57 1.60 1.86 1.56 1.25 1.42 1.56 1.26 1.65 1.63 1.48 1.51 1.54
1.42 1.33 1.73 1.78 1.43 1.83 1.88 1.81 1.77 1.38 1.68 1.42 1.68 1.73 1.30 1.61 1.65
1.26 1.66 1.62 1.77 1.47 1.66 1.93 1.67 1.87 1.47 1.61 1.46 1.82 1.83 1.58 1.65 1.66
0.78 0.73 1.00 1.61 0.26 1.49 1.12 1.37 0.71 1.24 1.07 1.19 1.25 1.56 0.49 1.06 1.23
0.78 0.73 0.96 1.55 0.38 1.49 1.17 1.39 0.86 1.23 1.08 1.17 1.29 1.56 0.52 1.08 1.23
0.99 1.00 0.79 1.40 0.59 1.18 1.43 1.20 0.69 1.03 1.18 1.08 1.34 1.12 0.59 1.04 1.17
1.07 0.58 1.32 1.35 0.49 1.10 1.26 1.24 0.86 1.18 1.40 1.30 1.52 1.27 0.40 1.09 1.27
1.07 0.58 1.32 1.35 0.49 1.10 1.26 1.24 0.86 1.18 1.40 1.30 1.52 1.27 0.40 1.09 1.27
aCore residues whose accessibility is lower than 25%. bWeighted average is the average of r.m.s.d. over the number
of residues (Tanimura et al., 1994).
Table IV. Accuracy of the χ1 angle on various radii around the central Cβ atom Corea (%)
All (%) ID code
2.0
3.0
4.0
5.0
6.0
ICRN 1CTF 1LZ1 1PPP 1PPT 1UBQ 2I1B 2LIV 2OVO 2PCY 2TRX 3APP 4PTP 5PTI 9INS Average Weightedb
84.4 80.4 74.8 56.9 62.1 56.9 58.3 61.0 75.6 62.3 71.3 72.1 65.9 69.0 72.7 68.2 66,4
84.4 80.4 74.8 60.6 62.1 55.4 56.8 59.4 71.1 61.0 72.0 72.1 67.0 69.0 77.3 68.2 66.5
84.4 63.0 75.7 59.4 65.5 63.1 53.0 64.9 68.9 61.0 64.6 68.8 63.6 78.6 68.2 66.9 65.2
71.9 73.9 67.0 56.9 79.3 58.5 52.3 57.8 60.0 64.9 61.6 64.8 58.0 69.0 79.5 65.0 61.7
68.8 67.4 66.0 58.8 72.4 53.8 49.2 61.4 48.9 61.0 59.8 59.1 52.8 69.0 65.9 61.0 59.1
aCore
2.0
3.0
4.0
5.0
6.0
80.0 93.3 84.0 59.8 100.0 56.5 69.5 67.9 88.2 72.7 79.0 75.0 73.4 78.6 85.7 77.6 72.8
80.0 93.3 84.0 62.0 100.0 56.5 67.8 67.2 82.4 69.7 79.0 75.0 72.3 78.6 85.7 76.9 72.4
70.0 60.0 92.0 65.2 100.0 65.2 59.3 73.9 88.2 78.8 70.4 77.1 70.2 85.7 85.7 76.1 73.1
80.0 73.3 70.0 59.8 100.0 82.6 54.2 66.4 76.5 69.7 63.0 68.8 57.4 85.7 100.0 73.8 66.1
70.0 93.3 70.0 66.3 100.0 73.9 57.6 69.4 82.4 75.8 63.0 68.8 61.7 85.7 100.0 75.9 68.5
residues whose accessibility is lower than 25%. average is the average of χ1 over the number of residues (Tanimura et al., 1994).
bWeighted
form (Equation 1) and the eigenvectors were obtained by diagonalizing this matrix. Hence, the eigenvectors were influenced by the distance between atoms and the center of mass of main-chain atoms of the sphere, because atomic coordinates in Equation 1 were considered on the local coordinate frame defined in Figure 1. The atoms near the boundary of the sphere influence the eigenvectors much more than the atoms near the center of mass. As for the predictions with a radius of 5.0 or 6.0 Å, where the number of atoms in the sphere was considerably larger than that of radius ,5.0 Å in Table VI, not all atoms in the sphere were related to the side-chain conformations, but were nevertheless included in the variance–covariance matrix. Therefore, unrelated atoms are regarded as noise in the prediction with r ù 5.0 Å. On the other hand, the prediction with r 5 2.0 Å had no noise, because the main-chain atoms in the sphere belong only to the residue to be predicted. This case is similar to the
method by Dunbrack and Karplus in the sense that the correlation between the side-chain conformations and the main-chain conformation of the considered residue was taken advantage of. This case (r 5 2.0 Å) had a larger number of side-chain conformations with ,v·uk. than the other cases (r . 2.0 Å). On the other hand, the number of atoms in the sphere at r 5 3.0 Å was 0.1 (Table VI) and these atoms influenced the conformations, because the result from r 5 3.0 Å was similar to that from r 5 2.0 Å (Tables III and IV). However, some atoms existing near the boundary of the sphere at r 5 3.0 Å were regarded as noise, because results of the average r.m.s.d. (Table III) and the accuracy of the χ1 angle (Table IV) were worse than r 5 2.0 Å. On the other hand, the result of the weighed average for the prediction at r 5 4.0 Å was better than predictions at r 5 2.0 and 3.0 Å for core residues. The average number of atoms in the sphere at r 5 4.0 Å was about three, excluding the considered residue (Table 357
K.Ogata and H.Umeyama
Table V. Accuracy of the χ1 1 χ2 angle on various radii around the central Cβ atoma Coreb (Å)
All (Å) ID code
2.0
3.0
4.0
5.0
6.0
2.0
3.0
4.0
5.0
6.0
ICRN 1CTF 1LZ1 1PPP 1PPT 1UBQ 2I1B 2LIV 2OVO 2PCY 2TRX 3APP 4PTP 5PTI 9INS Average Weighted
68.8 60.9 58.3 36.3 41.4 44.6 40.9 43.0 62.2 42.9 71.3 55.1 50.6 50.0 59.1 52.3 44.9
68.8 60.9 58.3 37.5 41.4 43.1 38.6 40.6 57.8 44.2 49.4 55.1 50.0 47.6 63.6 50.4 41.8
68.8 47.8 56.3 39.4 48.3 49.2 34.1 42.6 57.8 50.6 45.1 53.4 47.2 45.2 52.3 49.2 41.7
50.0 58.7 47.6 35.6 55.2 43.1 29.5 37.1 44.4 53.2 48.8 49.4 43.8 50.0 54.5 46.7 38.8
50.0 45.7 46.6 41.3 55.2 44.6 37.9 40.6 37.8 46.8 42.7 45.7 40.3 45.2 45.5 44.4 37.9
80.0 66.7 68.0 41.3 100.0 39.1 57.6 52.2 82.4 51.5 58.0 54.9 54.3 64.3 85.7 63.7 51.9
80.0 66.7 72.0 44.6 100.0 39.1 55.9 50.7 76.5 51.5 58.0 56.3 51.1 64.3 85.7 63.5 51.6
70.0 53.3 74.0 50.0 100.0 52.2 40.7 53.7 82.4 66.7 51.9 58.3 54.3 64.3 71.4 62.9 52.6
80.0 60.0 58.0 40.2 100.0 69.6 39.0 47.8 64.7 63.6 44.4 50.7 42.6 71.4 85.7 61.2 47.0
70.0 80.0 60.0 51.1 100.0 56.5 45.8 50.0 58.8 54.5 38.3 50.0 50.0 57.1 71.4 59.6 46.7
of the χ1 1 χ2 angle. The χ1 1 χ2 angles was defined as the torsion angles of χ1 and χ2 within 40 degrees for X-ray structure. If the side-chain does not have a χ2 angle, it is calculated using only the χ1 angle. bCore residues whose accessibility is lower than 25%. cWeighted average is the average of χ 1 χ2 over the number of residues (Tanimura et al., 1994). 1 aAccuracy
Table VI. Analysis of prediction R.m.s.d. for first replacement residue (Å)e Radius (Å)
Average No. of atoms
Average CPU time (s)b
Average difference of eigenvectors (°)c
Percentage for first trial (%)d
All
Core
2.0 3.0 4.0 5.0 6.0
0.0 0.1 3.1 10.0 21.5
16.3 16.0 13.0 14.4 16.6
14.3 15.0 19.8 25.5 30.3
69.10 68.72 73.30 69.40 67.37
1.48 1.48 1.44 1.53 1.55
1.04 1.03 0.97 1.02 1.02
aAverage bAverage
of the number of main-chain atoms in the sphere excluding the atoms belonging to the residue to be predicted. CPU time except reading the side-chain library. cAverage value of the difference of eigen vectors, which was obtained by transforming the average value of inner products to degrees. dThe percentage of success without a short contact in the first trial of placing a side-chain conformation in the library. eAverage r.m.s.d. for first replacement residue for all residues and core residues.
VI). One was the C atom belonging to the residue of the Nterminal side, one was the N atom belonging to the residue of the C-terminal side and the other was a nearby Cβ atom. These C and N atoms together with atoms belonging to the considered residue specified main-chain torsion angles (i.e. φ, ψ and ω angle). Therefore, a library with a radius of 4.0 Å can be considered as a backbone-dependent library with environmental information. Next, we considered the similarity of environment atoms around the central Cβ atom from the percentage of success by the first trial of side-chain conformation without a short contact. This value was ~70% for all radii (Table VI). The best percentage was 73.3% at r 5 4.0 Å and the average r.m.s.d. was 1.44 Å for all residues and 0.97 Å for core residues for first-success side chains. Table VII indicates that values of ~74% were not influenced by the order of side-chain construction in the prediction. Therefore, our method can always obtain similar results (Tables III–V) unrelated to the order of side-chain construction. When the number of side-chain conformations in the library increased, the predicted models became very similar to the native structures. Especially when the library included side-chain conformations taken from family proteins for the predicted protein, 358
Table VII. Predictions for various orders of side-chain construction R.m.s.d. for first replacement residue (Å) Order of side-chain constructiona
Percentage for first trial (%)b
All
Core
Nearest from center of mass From N terminus to C terminus Random 1 Random 2 Random 3
73.30
1.44
0.97
74.00
1.47
1.01
73.95 74.06 73.40
1.46 1.45 1.43
0.96 0.96 0.96
aPredictions
for various orders of side-chain construction were performed at r 5 4.0 Å. The top in the table is a result of a prediction performed from the nearest residue around the mass centre. The second is a result of a prediction performed from the N-terminal to the C-terminal. The 3rd to 5th show results of predictions in which the order of construction was determined at random. bThe percentage of success without a short contact in the first trial of placing a side-chain conformation in the library.
Prediction of protein side-chain conformations
Table VIII. Results of side-chain predictions using two libraries excluding and including family structuresa Excluding family Å (%) Protein
ID code
All
Trypsin (Streptomyces griseus) Chymotrypsinogen A Kallikrein A Native elastase Trypsin (bovine)
1SGT 2CGA 2PKA 3EST 4PTP
1.50 1.53 1.57 1.55 1.65
aPredictiion used the side-chain library referred to in Methods bValues in parentheses are the percentage accuracy of χ . 1
Including family Å (%) Core
(66.9) (61.3) (61.9) (67.5) (63.6)
1.19 1.42 1.21 1.27 1.34
All (69.1) (62.4) (72.3) (72.4) (70.2)
1.39 1.43 1.49 1.45 1.38
Core (71.4) (62.6) (65.2) (68.6) (67.6)
1.03 (77.8) 1.27 (63.8) 1.06 (79.0) 1.11 (73.5) 0.80 (84.0)
with or without family structures (see Appendix).
the result was considerably better. Table VIII shows the results for a trypsin family predicted with a radius of 4.0 Å using two side-chain libraries. One excluded their family protein’s sidechain conformations and the other included them. The results including family proteins were considerably better than those excluding them. Our method is useful for generating side chains in homology modeling, when many family proteins are known. Our method predicted side-chain conformations using environmental information around the central Cβ atom and the accuracy of results was similar to that of previously reported methods that required potential functions. From these results, it is considered that side-chain conformations are reasonably determined by confirming main-chain and Cβ atoms around the central Cβ atom. The positions of main-chain atoms and the other mainchain atom within 4.0 Å near the central Cβ atom especially influence side-chain conformation. Although our method did not include a potential energy calculation for the effect of side-chain atoms around the central Cβ atom, it is thought that such a calculation for the side chain–side chain interaction will improve the method presented in this paper. Conclusion We have presented a method of side-chain prediction taking advantage of the environmental information around the Cβ atom without calculating the potential function. Our method yielded an r.m.s.d. of ~1.5 Å for all amino acid residues and ~1.05 Å for core amino acid residues to the X-ray structures. The accuracy of the χ1 angle was 65% for the all amino acid residues and 73% for the core amino acid residues. The CPU time was fast (prediction of 200 residues took ~60 s by EWS 4800/310 LC), because our method could determine the optimal side-chain conformation without having to calculate the potential function thousands of times. These results suggest that most of side-chain conformations were determined by the positions of main-chain atoms around the central Cβ atom. Especially main-chain atoms of the consideed residue and two or three atoms in the sphere influenced the sidechain conformations. If we generate a library which included all folding patterns in various types of environment, the results will be greatly improved by the method in this paper.
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542. Blundell,T.L., Sibanda,B.L., Sternberg,M.J.E. and Thornton,J.M. (1987) Nature, 323, 347–352. Desmet,J., De Maeyer,M., Hazes,B. and Lasters,I. (1992) Nature, 356, 539–542. Dunbrack,R.L.,Jr and Karplus,M. (1993) J. Mol. Biol., 230, 543–574. Eisenmenger,F., Argos,P. and Abagyan,R. (1993) J. Mol. Biol., 231, 849–860. Holm,L. and Sander,C. (1991) J. Mol. Biol., 218, 183–194. Holm,L. and Sander,C. (1992) Proteins, 14, 213–223. Kendall,M.G. (1975) Multivariate Analysis. Charles Griffin, London. Koehl,P. and Delarue,M. (1994) J. Mol. Biol., 239, 249–275. Lasters,I. and Desmet,J. (1993) Protein Engng, 6, 717–722. Laughton,C.A. (1994) J. Mol. Biol., 235, 1088–1097. Leach,A. (1994) J. Mol. Biol., 235, 345–356. Lee,C. and Subbiah,S. (1991) J. Mol. Biol., 217, 373–388. Ponder,J.A. and Richards,F.M. (1987) J. Mol. Biol., 193, 775–791. Reid,L.S. and Thornton,J.M. (1989) Proteins, 5, 170–182. Summers,N.L. and Karplus,M. (1989) J. Mol. Biol., 210, 785–811. Tanimura,R., Kidera,A. and Nakamura,H. (1994) Protein Sci., 3, 2358–2365. Tuffery,P., Etchebest,C., Hazout,S. and Lavery,R. (1991) J. Biomol. Struct. Dyn., 8, 1267–1289. Weiner,S.J., Kollman,P.A., Nguyen,D.T. and Case,D.A. (1986) J. Comput. Chem., 7, 230–252. Wilson,C., Gregoret,L.M. and Agard,D.A. (1993) J. Mol. Biol., 229, 996–1006. Received July 15, 1996; revised October 24, 1996; accepted December 6, 1996
Appendix The prediction was carried out by using all main-chain atoms (including Cβ atoms) and all proline’s side-chain atoms in the native conformation. Before the prediction, a side-chain library was generated by the following 72 proteins whose ID codes are 1ALD, 1CAA, 1CCR, 1COB, 1CRN, 1FKF, 1GD1, 1GST, 1HMO, 1HOE, 1IFB, 1L58, 1LH4, 1LTE, 1MBD, 1MEE, 1PMY, 1PPP, 1PPT, 1RBR, 1RRO, 1SAR, 1SCA, 1SGT, 1SRD, 1THB, 1THM, 1TLD, 1TOP, 1UBQ, 1UTG, 1YEA, 1YEB, 1YPI, 256B, 2ALP, 2APR, 2AZA, 2BFH, 2CDV, 2CGA, 2FB4, 2FCR, 2GBP, 2I1B, 2LHB, 2RNT, 2SCP, 2SN3, 2TRX, 2TSC, 3APP, 3BLM, 3EST, 3TMN, 4CPV, 4ENL, 4GCR, 4PEP, 4PTP, 5CNA, 5CPA, 5FD1, 5P21, 5PAL, 5PTI, 6LDH, 6XIA, 7PCY, 7RSA, 7RXN and 9INS. These proteins have a variety of sizes and folding patterns and their resolution is better than 2.5 Å without lack of atomic coordinates.
Acknowledgments We thank Dr Hitoshi Komooka and Dr Junichi Higo for helpful discussions. This work was supported by a grant-in-aid for special project research from the Ministry of Education, Science, Sports and Culture of Japan.
References Akahane,K., Nagano,Y. and Umeyama,H. (1989) Chem. Pharm. Bull., 37, 86–92.
359