Predicting Protein Folding Types by Distance Functions That Make ...

6 downloads 1193 Views 746KB Size Report
Make Allowances for Amino Acid Interactions*. (Received for publication, May 10, 1994, and in revised form, June 3, 1994). Kuo-Chen ChouS and Chun-Ting ...
THEJOURNALOF BIOLOGICAL CHEMISTRY 0 1994 by The American Society for Biochemistry and Molecular Biology, Inc.

Vol. 269, No. 35,Issue of September 2, pp. 22014-22020, 1994 Printed in U.S.A.

Predicting Protein Folding Typesby Distance Functions That Make Allowancesfor Amino Acid Interactions* (Received forpublication, May 10, 1994, and in revised form, June 3, 1994)

Kuo-Chen ChouSand Chun-Ting ZhangO From Computer-Aided Drug Discovery, 7247-267-1, Upjohn Laboratories, Kalamazoo, Michigan 49007-4940

Given the amino acid composition of a protein, how and Chou, 1992a; Chou and Zhang, 1993; Metfesselet al., 1993; may one predict its folding type? Althougharound this Dubchak et al., 1993; Kikuchi, 1993). problem a number of methods have been proposed, noneHowever, the existing methods for predicting thefolding type of them has taken into account the correlative effect of a protein from its amino acidcomposition are based oneither among different amino acids,and hence the accuracy of the least Minkowski’s distance principle (Chou, 1980, 19891, prediction could not be improved to the extent that it the least Euclidian distance principle (Nakashima et al., 19861, should have. In viewof this, a new method has been thediscriminantanalysis principle(Klein,1986; Klein and developed in which the similarity between two protein Delisi, 19861, the optimization approachprinciple (Zhang and molecules is based on the scale of Mahalanobis distance Chou, 1992a), or the maximum projection principle (Chou and rather than on the ordinary intuitive geometric dis- Zhang, 1993). In these methods, the composition of each of the tances, suchas Minkowski’s distance and Euclidian dis- 20 amino acids is treated as an independent variable, and the tance. By introducing the Mahalanobis distance, the corcontribution duet o the correlationamong differentamino acids relative effectamongdifferentamino acids can be is completely ignored. In other words, none of them has ever automatically incorporated. Predictions have been performed for131 real proteins consisting of a,p, a + p, and taken into account the coupling effect among different amino “lp proteins. The results indicate that the ratesof cor- acid components. Obviously, this is an intrinsic weakness in rect prediction for both a and p proteins are loo%,and methodology, which will certainly set a barrier for increasing those for a + p and alp are 88.9 and 89.7%, respectively, the accuracy. Thus,the following questionsarenaturally with an average accuracy of 94.7%. Predictions have raised. If the ordinary geometric distances do not accurately also been performed for10,000 simulated proteins gen- reflect the similaritybetween twoprotein molecules, what kind erated by Monte Carlo sampling for each of the above of distance should be adopted? How can the coupling effect be fourfoldingtypes,yielding an averageaccuracy of taken into account? Is it true that the predicted results ob95.9%. The accuracy thus obtained for the simulated tained after incorporating this effect can be significantly improteins can avoid the bias due to the limited number proved? The present study was initiated in an attempt to anBelow, we shall adopta special scale,the of testing proteins selected arbitrarily by different in- swer these questions. 1936; Pillai, vestigators and hence can be regarded as an objective so-called Mahalanobisdistance(Mahalanobis, accuracy. It is anticipated that a method with such a 1985), to measure the similarityof proteins. (Pillai(1985) also high objective accuracy should become a reliable tool presents a brief biography of Mahalanobis who was a man of in predicting the protein folding type and a useful tool great originaltiy and who made considerable contributions t o for improving the prediction of secondary structure as statistics.) well. THEORY

“he principle underlying the new prediction method is that the shorter the Mahalanobis distance between two protein molAccording t o the contentsof secondary structures, proteinsof ecules, the higher their similarity, and hence the more likely known structures are generally classified into one of the folthey belong to a same folding type. lowing five folding types: a,/3,a + /3,d p , and irregular proteins Let us first give a brief introduction about the Mahalanobis (Levitt and Chothia,1976; Chou, 1980, 1989; Nakashima et al., distance, its difference with ordinary distances, and then a 1986; Richardsonand Richardson, 1989; Mao et al., 1994).As is formulation of how t o use it t o predict the folding type of a well known, the knowledge of protein folding type can help in protein. the determination of the three-dimensional structureof a proAccording to its amino acid composition, a protein molecule tein,particularlyin improving the prediction of secondary can be represented by a point or a vector (Chou and Zhang, structure (e.g. see Deleage and Roux (1989)). Therefore, it 1993) in a 20- dimensional (D) space, the so-called composition would be very useful if a rapid and reliable method could be space (Nakashimaet al., 1986). However, the amino acid comdeveloped to predict the folding type of a protein. During the position of a protein must be normalized, i.e. constrained by past 10 years or so, various efforts have been made to reach such a goal by many investigators (Chou, 1980, 1989; Naka20 x,= I (Eq. 1) shima et al., 1986; Klein, 1986; Klein and Delisi, 1986; Zhang

c.

i=l

* The costs of publication of this article were defrayed in padby the where xiis the composition component of the ith aminoacid in payment of page charges. This article musttherefore be hereby marked a protein. This indicatesthat of the 20 amino acid composition “uduertisement” in accordance with 18 U.S.C. Section 1734 solely to components only 19are independent. Therefore, by leaving out indicate this fact. any one of its 20 components, one can still uniquely represent $ To whom correspondence should be addressed. Tel.: 616-385-6867. tj On sabbatical leave from Dept. ofphysics,Tianjin University, Tian- a protein by a Point in a 19-D space. Supposethe 2o aminoacids jin, China. alphabetically ordered are single-letter according their to code.

22014

PredictionFolding of Protein

Types

22015

nents from Equation 1 as long as X, X , and S are defined in a same 19-D space. When the N proteins in Equations4 and 6 are all a proteins, the S thus defined will become the covariance matrix of an a protein set, denoted by Sa,and D2(X,X)will become D2(X,X,). Likewise, when the N proteins in Equations 4 and 6 are all p, or a + p, or OJp proteins, S will become the covariance matrix of a p, or a + p, or d p protein set, denoted by S,, Sa+,, or Sd,, respectively. And the corresponding D2(X,X) will become between Dz(X,X,),D2(X,X,,,), or D2(X,X,,). Thus, the similarity where x ~ , x~ ~, ,.~. ., , xk,19are, respectively, the 19 amino acid any protein X and the norms of the four folding types can be composition components of the kthprotein X,, and N is the total number of proteins in the set. The norm of the protein set quantitatively formulated through the Mahalanobis distance as given below: concerned is defined by

If the lastamino acid component is left out, then the 19-D space will be defined by the basescorresponding to thecomponents of A, C, D, E, F, G, H, I, K,L, M, N, P, Q, R, S, T, V, and W, respectively. Once the 19-D space is established, the kth protein in a given protein set canbe expressed by

(Eq. 3)

Lx,,I where 1 N

(Eq. 4)

When the N proteins in Equation 3 are all a proteins, X thus defined would become the norm of a protein set, denoted by X,. Likewise, when the N proteins in Equation 3 are all p, or a + @,or d p proteins, X would become the norm of the p, or a + p, or d p protein set, denoted by X,, X,,,, or X,,, respectively. To incorporate the correlative terms, let us first introduce a 19 x 19 covariance matrix given by

When D2(X,X,)is smaller, meaning that theprotein X is closer to the e protein set, and hence the likelihood of it belonging to the C folding type ishigher; and vice versa. Thus, theprotein X will be predicted to be the folding type for which D2 has the least value, as can be formulated as follows. Suppose

where the index e can be a , p, a + p, or do,and the operator Min means taking the leastone among those in the parentheses, then theindex e of Equation 9 will represent which folding type theprotein X shouldbelong to. Therefore, the new method is based on the least distanceprinciple. However, as discussed above, the distancedefined here hasincorporated the coupling of different amino acid components, which is quite different (Eq. 5) from those defined in the previous methods. In summary, the prediction can be performed according to the following procedure step by step: where Step 1:Knowing the amino acid compositions of the database N proteins and the unknown protein whose folding type is to be si,,= [X,,; - Z,][ x ~-, 2~, ] , ( i , j = 1, 2, ..,, 19) (Eq. 6) ascertained, normalize their amino acid components by dividk=l ing the number of each component amino acid by the total Suppose Xis a protein whose folding type is tobe predicted. number of amino acids in the protein (cf. Equation 1). 2 or a protein It can be either one of the N proteins in Equation Step 2: Eliminate one of the 20 normalized amino acid comoutside of them. The proteinX also corresponds to a point (x1, ponents, thereby defining a 19-D space, and express the prox2, . . . , x19)in the 19-D space, where xi is the normalized fre- teins as points in the 19-D space (cf. Equation 2). quency of its ith amino acid (see Equation 1).Thus, the MaStep 3: Calculate the standard points for a , p, a + p, and OJp halanobis distance, D2(X,%),between the norm % defined by proteins, respectively, from the data base proteins (cf. EquaEquation 3 and X in the 19-D space is given by (Mahalanobis, tions 3 and 4). 1936; Pillai, 1985) Step 4 : Calculate the 19 row and 19 column covariance matrix for a, p, a + p, and d p proteins, respectively, from the data Dz(X,%)= (X - %)?+'(X (Eq. 7) base proteins (cf. Equations 5 and 6). where T is the transposition operator, and S" is the inverse Step 5: Calculate the Mahalanobis distance from the point of matrix of S given by Equation 5. Note that the non-diagonal the unknown protein to each of the above four standard points terms in Equation5 are generally not equal to zero. It is these (cf. Equations 7 and 8). terms throughwhich the coupling effect among different amino Step 6: The unknown protein is predicted to have the same acid components is incorporated. Also, being different from the folding type as the one to which the Mahalanobis distance is ordinary geometric distance, theMahalanobis distance is unit- the least (cf. Equation 9). independent, i.e. its value will not be changed by using different unitsof coordinates. More discussion about theMahalanoRESULTS AND DISCUSSION bis distance is given in "Appendix A," where a simple example Two sorts of proteins, the real proteins and the simulated is presented that illustrateswhy it is more appropriate to use Mahalanobis distance as a scale in classifying statistical data. proteins, are used to demonstrate the predicted results. The It canalso be proved that thevalue ofD2(X,%)is independent of calculated results for the former will indicate the self-consistwhich 19 of the 20 components are chosen as the bases for ency of the new method, while those for the latter indicate its calculation. In other words, it will lead t o a same result of objective accuracy(Zhang and Chou, 199213). A valid new D2(X,X) by leaving out any one of the 20 normalized compo- method should give a better result inboth of these two aspects. "

x)

22016

Prediction of Protein Folding Types

TABLE I The Mahalanobis distances calculated for 31 a proteins The predicted folding type for these proteins = a. PDB" code of the 31 a proteins

OTMV 1CC5 OAF1 OHBG OLRP A001' A002' A003' lCPV lICB 156B 3cYT 2c2c 2CDV OCY3 155C 351C 05C1 lCCY lECD 2MHB 2MHB lLHB lLHl 2MBN OMBA lMHR OC3A 1cYP lHMQ lCTS

TABLE I1 The Mahalanobis distances calculated for 34 p proteins The predicted folding type for these proteins = p.

Mahalanobis distanceb

x,)

DYX,% ),,

DYX,

0.73* 0.78* 19.16 0.51* 7.22 0.46* 0.64* 0.69* 7.98 0.69* 11.39 0.80* 0.61* 0.71* 14.02 0.75* 19.17 0.76* 2.07 0.53" 0.53* 0.68* 0.49* 0.56" 6.40 0.63* 14.32 0.77* 12.32 0.55" 2.28 0.60* 3.59 0.60* 4.61 0.70* 0.52* 0.63" 3.84 0.61" 2.49 0.42* 0.88* 2.48 0.43* 0.60* 0.15*

2.54 0.963.23 3.72 1.57 7.10 1.88 5.77 2.25 3.95 6.08 9.44 2.93 13.61 3.22 5.79 1.42 2.22 1.74 1.96 2.43 8.46 13.23 6.57 13.31 1.12 4.48 2.27 5.54 3.82 2.07 3.43 3.80 1.73 2.32 3.03 1.85 3.31 3.23 5.63 1.43 1.40 6.98 1.73 0.79 3.94 1.16 0.54 1.43 2.57 Rate of correct prediction = 31/31 = 100% 4.12 4.62 3.15 1.91 1.80 2.95 11.69 3.53 3.86 3.67 3.85 3.34 2.21 3.96 1.94 1.75 1.10 5.96 2.28 3.28 2.65 1.97 1.81 2.05 2.45 3.63 2.12 2.39 1.05 2.62 0.51

PDB is the Protein Data Bank, Brookhaven National Laboratory, Upton, N y . * A factor of lo3 is multiplied to each of the distance values listed here. See Equation 8 for the definition of the Mahalanobis distance, which is different for different folding types. The one with the least value (marked by *) is assumed to correspond to the folding type for the predicted protein (cf Equation 9). PDBcode is not available: A001 represents bacteriorhodopsin halobacterium halobium; A002 a-tropomyosin rabbit skeletal muscle; and A003 uteroglobin (steroid-binding protein, blastokinin) rabbit.

lHIP B001" lACX 3CNA 3FAB 3FAB lREI lFCl B002' lPEP 2PKA lALP lAZU lPCY 2PAB 3RxN lCTX lNXB 1SN3 2SOD 2cHA 3PTP lEST 2APP 2APE lAPR 3SGB 3RP2

2sTv 2TBV 3SBV B003' B004' OHMG

1.90 0.93 1.55 2.08 5.40 1.18 1.72 2.12 2.59 2.42 1.58 1.65 5.13 3.20 5.11 6.88 2.68 3.04 3.32 1.75 3.42 1.99 1.36 4.42 1.70 0.82 0.86 1.16

0.73* 0.44* 0.57* 0.42* 0.28* 0.39* 0.66* 0.50* 0.21* 0.59* 0.69* 0.43* 0.81* 0.57* 0.59* 0.74" 0.86* 0.68* 0.87* 0.78* 0.50" 0.63" 0.67* 0.42* 0.27* 0.43* 0.54* 0.54* 0.49* 0.48" 0.47* 0.85* 0.50* 0.38*

10.25 2.58 5.85 0.69 1.83 2.27 11.96 1.64 1.79 3.50 1.40 3.77 4.91 3.13 4.38 8.25 7.50 1.63 10.00 10.78 1.42 1.96 1.88 6.75 4.05 5.12 6.43 4.51 2.88 2.57 3.45 5.48 4.43 1.09

3.44 0.79 1.87 1.97 2.34 2.17 7.39 1.98 2.30 2.27 1.54 1.70 1.46 1.86 0.76 5.51 13.79 13.99 11.81 2.08 1.75 1.99 1.57 3.06 1.39 1.42 3.33 0.77 2.01 1.56 1.35 1.88 2.12 2.04

Rate of correct prediction = 34/34 = 100% See footnote a to Table I. * See footnote b to Table I. e PDB code is not available: BOO1 represents trypsin inhibitor soy bean kunitz; BO02 neuraminidase tokyo/3/67 influenza; BO03 y-crystallin 2 bovine eyelens protein; and BO04 gene 5 protein bacteriophage Fd.

a / p Proteins-The four Mahalanobisdistances for the 39 d p proteins aregiven in Table IV. As we can see, of the 39 proteins, Predicted Results for Real Proteins 35 are correctly predicted as alp proteins. The rate of correct (1986) examined 135 prediction for the cdp proteins is 89.7%. In theiroriginal paper Nakashima et al. Listed in Table V are the rates of correct prediction by proteins of which 31 are a proteins, 34 p, 27 ct + p, 39 alp, and 4 irregular. The irregular folding type proteins have been left means of the current method, the least Minkowski's distance 1980, 19891, theleast Euclidian distance out in this study because their number isonly four, too small to method(Chou, have any statisticalsignificance. Therefore, the prediction and method (Nakashima et al., 19861, the discriminant analysis comparison will be made based on the remaining131 proteins. method (Klein, 1986; Klein and Delisi, 1986), and the maxiThe amino acid compositions of the 131 proteins, togetherwith mum projection method (Chou and Zhang, 1993), respectively. the computer program for predicting folding types, are avail- As shown in the table, the currentmethod yields a 100% rate of correct prediction for both a and p proteins. The prediction able upon request. a Proteins-The four Mahalanobis distances D2(X,Xe)(e = a, of a + /3 proteins had been a big trouble, as reflected by the p, a + p, d p ) for the 31 a proteins are given in Table I, from fact that the rate of correct prediction by the least Euclidian which we can see that D2(X,x,) has the leastvalues for all the distance method for the 27 proteins was only 37.0%. Nakaproteins listed there.Therefore, according to the least distance shima etal. (1986) attributed the troubleto the fact that thea principle (cf. Equation 91, all of the 31 proteins are correctly + p folding type had a more serious problem in distribution predicted to be the ct folding type. These results indicate that overlapping with all the other folding types. However, for the same set of 27 a + p proteins, the rate of correct prediction by the rate of correct prediction for the a proteins is 100%. p Proteins-The four Mahalanobis distances for the 34 p the new method has been increased to 88.9%, which is more proteins are given in Table 11, from which we can see that the than 40% higher than that obtained by Nakashimaetal. (1986). Now it is clear that the trouble is actually due to the rate of correct prediction for the p proteins is also 100%. a + p Proteins-The four Mahalanobis distances for the 27 a error by ignoring the coupling effect among different amino + p proteins are given in Table 111. As we can see, of the 27 acid components, and such an errorwould become particularly overlapping proteins, 24 are correctly predicted as a + p proteins. The rate serious in the case of havingthedistribution problem. of correct prediction for the a + p proteins is 88.9%.

PredictionFolding of Protein TABLE I11 The Mahalanobis distances calculated for 27 a Mahalanobis distance' PBB" code of the proteins DYX.X") DYX, D*(x. X A D'CX,

27 a

x.,

+

2B5C 2SSI 4PTI lTGS lOV0 1P2P

cool'

0.87

2.83 2.59

C002' C003' 4RSA 2SNS C004' lLZM 2LYZ 8PAP 2ACT 3TLN 2cAB 2BCL OHMG OCTF C005' GAP1 OGAP OCRO lPyP OSDE

1.43 1.99 8.27 3.30 5.15 6.44 1.09 12.88 5.87 2.78 1.24 0.79 4.57 1.96 0.88 1.64 1.18 1.49 1.47 0.43* 0.55* 1.25 1.53 0.49*

2.42 1.44 2.28 1.68 2.99 2.79 2.01 2.20 2.48 3.03 1.81 1.59 0.87 1.37 1.62 1.07 1.64 0.69 1.19 1.14 5.68 2.45 1.33 1.03 2.36 1.40 1.26

0.74* 0.76* 0.77* 0.89* 0.83* 0.79* 0.48* 0.78* 0.69* 0.79* 0.82* 0.76* 0.60* 0.65* 0.67* 0.62* 0.62* 0.45* 0.74* 0.41* 0.85* 0.78* 0.66 0.64 0.61* 0.74* 0.74

+ P proteins

X",,,

Types

22017

TABLE IV The Mahalanobis distances calculated for 39 a l p proteins

The predicted folding

2.52 1.65 6.93 10.00 7.48 6.66 1.51 2.99 2.29 5.67 1.52 0.85 1.81 5.72 3.65 1.41 1.01 1.08 1.24 0.92 4.22 1.56 1.31 1.01 2.14 0.86 1.12

a+P a+P a+P a+P a+P a+P a+P

a+P

Q+P

a+P

a+P a+P a+P a+P a+P

a+P a+P a+P

a+P a+P a+P a+P

a a a+P a+P

a

Rate of correct prediction = 24/27 = 88.9% See footnoteb to Table I. 'See footnote a to Table I. e PDB code is not available: COO1 represents ribonuclease (barnase) bacillus amyloliquefaciens; COO2 ribonuclease ST, Streptomyces erythreus; COO3 ribonuclease T1 EC 3.1.27.1, Aspergillus oryzae; COO4 deoxyribonuclease I (EC 3.1.21.11, bovine; and COO5 DNA binding protein 2, Bacillus stearothermophilus. a

Let us define the average accuracy as the percentage of the number of correct prediction events for all types divided by the number of total prediction events; i.e. q = average accuracy =

loo x (total

number of correct prediction events % (Eq. 10) total numberof prediction events

As we can see from Table V, the average accuracy obtained by the current method is 1241131 = 94.7%, which is much higher than the ratesby any of the previous methods.

DOOl OGPl OPHH OPBl OETU D002' D003" 3CAT lABP lGBP 3FxN

om1

lSRX OTT4 4ADH 4LDH lLDX 2GPD 4DFR 3DFR 2GRS 2ATC 2ATC lAAT 3PGK 2ADK 3PGM lRHD 2TAA 5CPA 1CPB lSBT lTIM lKGA 2MDH lTSl OMTS OPFK OZGP

0.37 0.97 1.89 0.67 2.57 0.54 1.01 1.35 0.34 0.97 3.17 2.47 1.18 1.40 0.63 2.15 1.71 0.53 0.84 0.96 0.56 1.58 0.27* 0.25 0.70 1.28 1.24 0.87 1.09 0.81 1.12 1.28 0.52* 2.14 1.49 0.74 0.24* 2.28 1.75

0.85 2.44 1.51 0.51 0.83 0.58 0.69 0.88 0.98 1.57 2.40 2.26 0.95 1.53 0.72 0.80 1.23 0.80 0.48* 1.23 0.36 1.57 0.97 0.43 0.73 1.91 1.02 0.78 0.61 1.13 0.79 0.95 1.13 1.45 0.51 1.14 0.29 0.86 0.68

2.15 3.60 0.99 0.95 3.02 1.21 4.27 2.45 1.88 5.89 5.41 4.16 4.31 3.89 2.40 3.08 2.06 2.16 2.51 4.64 2.54 2.22 1.69 0.40 0.85 2.17 1.51 1.56 1.58 2.96 4.83 2.17 2.12 1.86 1.29 0.52 1.04 6.03 3.45

0.33* 0.77* 0.40* 0.38* 0.38* 0.46* 0.54* 0.44* 0.25* 0.57* 0.75* 0.68* 0.52* 0.66* 0.42* 0.42* 0.58* 0.33* 0.49 0.62* 0.21* 0.63* 0.49 0.23* 0.38* 0.62* 0.47* 0.40* 0.54* 0.50* 0.50*

0.69* 0.55 0.56* 0.31* 0.30* 0.25 0.61* 0.56*

Rate of correct prediction = 35/39 = 89.7%

See footnote a to Table I. See footnote b to Table I. e PDB code is not available: DOOl represents 6-phosphogluconate dehydrogenase EC 1.1.1.44, ovine; DO02 phosphoglycerate kinase EC 2.7.2.3, horse muscle; and DO03 alkaline phosphatase isozyme 3 EC 3.1.3.1, E. coZi.

set of testing proteins. Only when the numberof proteins considered is sufficiently large can the bias due to the selection of different protein sets be eliminated. Unfortunately, so far there Predicted Results for SimulatedProteins are only about 500 proteins whose 3-D structures have been First of all, let us point out why we need to consider simu- determined. Therefore, to make a fair comparison of different lated proteins rather than a set of testing proteins. After a prediction methods, one should adopt the objective accuracy carefully analyzing the data listedTable in V, one may raise the (Zhang andChou, 1992b) as a criterion. The objective accuracy following questions. Althoughthe averageaccuracy reported by is actually anasymptotical limit for the rate of correct predicNakashima et al. (1986) is 70.2%, lower than 79.7% as reported tion computedfor a sufficiently large numberof simulated proby Chou (1989),it does not necessarily mean that Nakashima et teins. The simulated proteinswere generated through thefolat.'s method is poorer than Chou's because the prediction per- lowing procedure. formed by the former was for a set of 131 proteins but thatby Suppose a set of only 64 proteins. Furthermore,even though the latter is the predictions by different methods are performed for a n identical set of proteins, the accountability of the results thus obtained could still be questionable. This is because a method which gives the bestpredicted results for a set of proteins does not necessarily remain so when applied to another set of pro- is a normal, randomvector in the 19-D space, representing the teins. In other words, the predicted accuracyis, tosome extent, amino acid composition of any protein in the C type protein dependent on the setof proteins selected by the predictor. The distribution range, thus according to statistical theory (Desimilar problem also exists when prediction is performed for a Groot, 1986), the distribution density functions for a,p, (Y + p,

PredictionFolding of Protein

22018

Fypes

T-LE V Comparison of various prediction methods for real proteins accuracya Average prediction Rate of correct

Method

This paper* Chou and Zhang (1993)' P. Y.Chou (1989Id Nakashima et ai. 11986)' (1986)' Klein Klein and Delisi (1986Ig

P

P type

OJP type

Q

24 = 88.9% 27

35 = 89.7%

18.5 - 97.4% 19

34 = 100% 34

12 = 80.0% 15

'0 = 71.4% 14

'3 = 81.3% 16

- 94.7% 131 53.5 - - 83.6%

'6 = 84.2% 19

'2 = 80.0% 15

! != 78.6% 14

'2 = 75.0% 16

51 = 79.7% 64

2 = 87.1% 31

22 = 64.7% 34

'0 = 37.0% 27

33 = 84.6% 39

92 = 70.2%

-- 68.9% 29

2'. = 92.6% 27

'0 = 62.5% 16

20 = 76.9% 26

75 = 76.5%

'7 = 85.0% 20

2 = 82.4% 17

0

type

2 = 100% 31

20

type

a+

39

'5 = 68.2% 22

124

"

64

131

98

2 = 78.0% 59

See equation 10 for the definition of the average accuracy. Equation 9). Based on the maximum projection principle. Based on the least Minkowski's distance principle. e Based on the least Euclidian distance principle. 'Based on the discriminant analysis method. Based on the discriminant analysis method. In addition to the amino acid composition,the hydrophobic values of the constituent amino acids were also used as the attribute.The a + P and dp proteins were groupedinto one "mixed type,"and hence onlythree distinctive folding types were considered during prediction.

* Based on the least Mahalanobis distance principle (cf

and d p proteins can be generally formulated as follows:

two =

the other three methods. This presents a striking contrast to the resultsreported by the previous investigators. According to their reports, the difference in the rate of correct prediction among these threemethods are remarkably different, e.g. the difference between the ratefor a + p proteins by P. Y. Chou and that by Nakashima et al. (1986) is as high as 78.6% - 37.0% = 41.6% (see column 3 of Table V). Apparently, this is an artifact owing to selecting different sets of proteins. And this kind of artifact can be excluded by computing the rate of correct prediction for a sufficiently large numberof simulated proteinsas a criterion in examining the accuracy of a method. CONCLUSION

During the past10 years or so, various methods were developed for predicting the folding type of a protein according to its (Eq. 12) amino acid composition (Chou, 1980, 1989; Nakashima et al., Thus, based on the above density distribution functions for 1986; Klein, 1986; Klein and Delisi, 1986; Zhang and Chou, the four folding types, we can generate any number of a,p, a + 1992a; Chou and Zhang, 1993; Dubchak et al., 1993). In these p, or d p proteins by the Monte Carlo sampling procedure de- methods, however, the composition of each amino acid was scribed in a previous paper (Zhang and Chou, 1992b). In the treated independently; none of them ever considered the corcurrent study, 10,000 simulated proteins were generated for relative effect among the different amino acid components, each of the four folding types. For these proteins, predictions and hence the prediction accuracy could not be improved as were performed by the current method, the maximum projec- desired. In view of this, a new method was developed in terms tion method (Chouand Zhang, 1993),the method by P. Y. Chou of the Mahalanobis distance in which such a correlation is in(1989) and themethod by Nakashima et al. (19861,respectively, corporated through a covariance matrix. Since among the 20 and thepredicted results thus obtained are given in Table VI. amino acid composition components of a protein only 19 are As demonstratedinthe previous study(Zhangand Chou, independent, the Mahalanobis distance can be defined in a 1992b), whenthe number of simulated proteinsof each type is 19-D space. The new method is based on the least Mahalanogreater than 3000, the rate of correct prediction would gradu- bis distance principle, according to which the shorter the Maally approach toa limit, the so-called asymptotical limit.In this halanobisdistance between two proteins, the highertheir case the errors due to fluctuation would be very small andcould similarity, and hence the more likely they belong to a same actually be omitted. Such an asymptotical limit was defined as folding type. The predicted results for the 131 real proteins indicate that the objective accuracy of prediction (Zhang and Chou, 1992b). Even if a perfect training database is not availableat present, the rates of correct prediction for the a, 0, a + 0, and alp type the accuracy derived from such a large number of simulated folding proteins were 100, 100, 88.9, and 89.7%, respectively, proteins would certainly have much less bias than those de- with a rate of 94.7% for the average accuracy. The predicted rived according to very limited number of testing proteins se- results for the 10,000 x 4 simulated proteins indicate that the objective accuracy of prediction for the a, p, a + p and d p type lected quite arbitrarilyby different investigators. The predicted results are listed inTable VI, from which it is folding proteins were 99.8%, 99.5%, 92.7%, and 91.6%, respecseen that the average objective accuracy by the currentmethod tively, with a rate of 95.9% for the averageobjective accuracy of is significantly higher than those by the other three methods. prediction, which is more than 30% higher than those by the Also, it is interesting tosee that only a slight difference (in the previous methods. The objective accuracy is an asymptotical range of 1-3%) was observed for the objective accuracy among limit for the rateof correct prediction, which is independent of ift=dp

22019

Prediction of Protein Folding Types TABLEVI Comparison of various prediction methods for10,000 x 4 simulated proteins Rate of correct prediction Method

Average accuracy"

a type

P type

a + P type ~.

This paperb

9982 = 99.82% 10000

9951 = 99.51% 10000

9273 = 92.73% 10000

d 3 . type .. 9158 = 91.58% loow

Chou and Zhang (1993)' P.Y. Chou (1989)d Nakashima. et al. (1986)"

1 '93 = 71.93% 10000 10000

1392 = 73.92% 10000

4171 = 41.7% 10000

1421 = 74.21%

6959 = 69.59% 10000

1 '30 = 71.30% 10000

4279

- 42.79%

1233 = 72.33% 10000

25601 - 64.00% 40000

= 71.16%

3 % = 73.45% 10000

4260 = 42.60% 10000

1419 = 74.19% 10000

26140 - 65.35% 40000

~~

o,b,sd,e

10000

~~

"

10000

4

E?= ! 95.91% 40000 26177

- 65.44% 40000

See corresponding footnotes toTable V.

the protein set selected by different investigators (Zhang and pose A,B, and C are any three points in the space where the Mahalanobis distance is defined, then itfollows: 1)D(A,B) 2 0, Chou, 199213). It is instructive t o point out that, for all of the seven incor- where the equalsign holds true only when A B;2) D(A,B) rectly predicted proteins as shown in Tables I11 and IV, the D(B, A);and 3) D(A,B) + D(B,C) 2 D(A,C). Mahalanobis distance margin between the actual and incorBelow, a simple example will be presented that illustrates rectly predicted types is very small, rangingfrom 0.01 to 0.25. the advantage of using Mahalanobis distance as a scale to This suggests that the Mahalanobis distance is really quite measure the similarity among a set of points. Supposein a 2-D successful in reflecting the similarityof proteins, andhence the space there are 10 points whose coordinates in the Cartesian rate of correct prediction by the current method can be further system (x1, x 2 ) are improved if a good protein data base isavailable. xl = 0.907 1.596-0.808 The development of prediction methods based on statistical -0.529 1.116 -0.585 -1.236 0.663 -0.278 -0.848 theory generally consists of two parts: one is on the establishx2 = 0.803 (AI) ment of the method itself, and the otheris the improvement of -0.9781.467 -0.748 0.868 -0.463 -0.474 1.121 -0.267 -1.329 the training data. The present study belongs to the former. The very high rate of correct prediction for both the real and simulated proteins indicates that the currentmethod will be- The mean of the 10 points is (cf. Equations 2 and 3) come a reliable tool in predicting the protein folding type if a good data base were available to represent the folding types of proteins. Since the only input for the new method is the amino acid composition of a protein, the high rate itself would further The covariance S is (cf. Equations 7 and 8) confirm the fact that the protein folding type is really associ= p.' s 1 . q 1 0.91 ated with the amino acid composition, as discussed recently by sz,1 s2.2 0.9 1 Muskal and Kim (1992). In their study it is suggested that knowledge of sequence information is not necessary for highly from which it follows for the inverse matrixof S is accurate predictions of protein secondary structure content, 1 -0.9 implying that the folding type of a protein may basically de(A4) 0.19 -0.9 1 pend on its amino acid composition. "The mechanism of secondary folding," as pointed out by one of the anonymous reviewers is generally given by of this report, "is a convoluted process, which depends upon Given S-',the normal distribution density sequence in some respect, but thefinal result can be predicted 1 statistically provided that allowance is made for interactions fix1, x 2 ) = exp[ - 2Trqi-G 2 x 0.19 1 rxl xzl between amino acids. These interactions are apparently a reflection of sequence. The physical analogy is the quantumme1 1 -~ ~XP{ (x: - 1 . 8 ~ 1 + ~ 2xi) chanical prediction of atomic structure, which basically is a (A51 probabilistic theory in which interactions between fundamental particles are expressed in terms of potential energy." Now suppose we have two points A and B whose coordinates in It is known that theknowledge of the folding type of a protein the 2-D space are given by is important for improving the prediction of its secondary structure (Deleage and R o n , 1989). This has been further demonstrated by a recent report (Cohen et al., 1993) in which it was A=[:]. (-46) acid sequence observed that a completely identicalamino would assume totally different secondary structures if the se- whose Euclidian distances to the centralpoint X = (0,O)are the quence is located in two proteins with different folding types. same, i.e. both equal to fi. However, according to Equation Accordingly, owing t o its high accuracy, the new method would A5, the distribution density at these two points are, respecalso make a n impact in improving the secondary structure tively, prediction.

)

i

x=

E;] =[;I

.""[

[

]

[ -;q [::I1

~

I

2pm

.=[-:I

Acknowledgments-We thank Dr. Ken Nishikawa,Dr. Hiroshi Nakashima, and Professor TasuoOoi for supplying the aminoacid composition data used in this study. APPENDIX A

First, it is instructive topoint out that the quantitydefined by Equation 9 bears all the basic properties of distance. Sup-

AA) =dl,1) = 0.216, dB) =dl,-1) = 0.0000166

(A71

indicating that, according to the viewpoint of classification, point A should be much closer to the central point (0, 0 ) than point B because the distribution density f(%) = f(0, 0 ) at the central point is the maximum. Such a fact cannot be reflected by their Euclidian distances to the central points, but can be

22020

Prediction of Protein Folding Types

well elucidated by the Mahalanobis as follows. The Mahalanobis distance between point A and X is (cf. Equation 9)

the essential aspect of the problem to use the Mahalanobis distance as a scale to measure the similarity. REFERENCES

and that between point B and X is

Therefore, the ratio of these two Mahalanobis distances is - 4.4

(A101

indicating that the Mahalanobis distance from B t o X is more than four times that from A to X although their corresponding Euclidian distances are the same. According to Mahalanobis distance, point A is much more closer to the central point X than point B, and hence the distribution density at point A is much higher than that atB. Therefore, in classifying a set of statistical data which generally belong to a normal distribution, it would better reflect

Chou, K. C. & Zhang, C. T. (1993)J. Protein Chem. 12, 169-178 Chou, P. Y. (1980)Amino acid composition of four classes of proteins, in Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas Chou, P. Y.(1989)in Prediction of Protein Structure and the Principles of Protein Conformation (Fasman, G . D.,ed) pp. 549-586, Plenum Press, New York Cohen, B. I., Presnell, S. R. & Cohen, F. E. (1993)Protein Sci. 2, 2134-2145 DeGroot, M. H.(1986)Probability and Statistics, 2nd ed., pp. 267-278,AdditionWesley Publishing Co., Reading, MA DelBage, G. & Roux, E.(1989)in Prediction of Protein Structure and the Principles of Protein Conformution (Fasman, G . D.,ed) pp. 587-597, Plenum Press, New York Dubchak, I., Holbrook, S. R. & Kim, S. H. (1993)Proteins 16, 79-91 Kikuchi, T.(1993)J . Protein Chem. 12, 515-523 Klein, P. (1986)Biochim. Biophys. Acta 874,205-215 Klein, P. & Delisi, C (1986)Biopolymers 25, 1659-1672 Levitt, M. & Chothia, C. (1976)Nature 261,552-558 Mahalanobis, P. C. (1936)Proc. Natl. Inst. Sci. India 2, 49-55 Metfessel, B. A., Saurugger, P. N., Connelly, D. P. & Rich, S. S.(1993)Protein Sci. 2, 1171-1182 Mao, B., Chou, K C. & Zhang, C. T.(1994)Protein Eng. 7, 319-330 Muskal, S. M. & Kim, S. H. (1992)J. Mol. Biol. 225, 713-727 Nakashima, H., Nishikawa, K., & Ooi, T. (1986)J. Biochem. 99, 153-162 Pillai, K. C. S. (1985)in Encyclopedia of Statistical Sciences (Kotz, S., and Johnson, N. L., eds) Vol. 5, pp. 176-181,John Wiley & Sons, New York Richardson, J. S. & Richardson, D. C. (1989)in Prediction ofprotein Structure and the Principles of Protein Conformation (Fasman, G . D.,ed) pp. 1-98, Plenum Press, New York Zhang, C. T. & Chou, K C. (1992a)Protein Sci. 1,401-408 Zhang, C. T.& Chou, K. C. (1992b)Biophys. J. 6.3, 1523-1529