Refer- ences. Calcium binding protein carp. (lo). Carbonic anhydrase B human. (11). Carboxypeptidase A bovine. (12). ~-chymotripsin bovine. (is). Concavalin A.
IL NUOVO CIMENTO
VOL. 3 D, N. 2
l%bbraio 1984
A Statistical Method for Predicting Alpha-Helical and Beta-Sheet Regions in Proteins from their Amino Acidic Sequences. V. C u o ~ o I s t i t u t o di t~isica della ~ a e o l t h d ' I n g e g n e r i a dell' Universit~ - N a p o l i
:~. F . MACC~HATO I s t i t u t o di ~ i s i c a Sperimentale della ~acolth di Scienze dell' Universit~ - N a p o l i
A. T~A~Orr I s t i t u t o I n t e r ~ a z i o n a l e d i Genetiea e Bio]isica - IYapoli
,(ricevuto il 6 Luglio 1983)
Summary. I n this paper we propose a new method to predict the secondary structure of proteins from sequence data. A satisfactory improvement of the available efficiency of prediction is obtained. The described method takes into account the frequency of each pair of amino acids i n alpha-helical, beta-sheet and random coil regions according to previous results that the sequences of amino acidic residues in these regions are autocorrelated. The rules of the method are not derived from the analysis of the regions of proteins with a known secondary structure, b u t they are instead based on statistical considerations. I n such a way the obtained value of efficiency of the method (88%) has a high reliability: in fact, it is correct to test a method only on the data not used to construct it. A new definition of efficiency of a predictive method is given to resolve the ambiguities arising from the previously accepted definitions. -
-
PACS. 8 7 . 1 0 . - General, theoretical and mathematical biophysics (including logic of biophysics, q u a n t u m biology and relevant aspects of thermodynamics, information theory, cybernetics and bionics).
421
422
1. -
V. CUO)IO, ~5. F. ~ A C C t t I A T O &rid A. TRAMOINTAIqO
Introduction.
Since e x p e r i m e n t a l evidence has shown t h a t t h e s e c o n d a r y s t r u c t u r e of proteins is overall d e t e r m i n e d b y t h e i r a m i n o acidic sequence (1), m a n y a t t e m p t s h a v e b e e n m a d e to p r e d i c t t h e s p a t i a l configuration f r o m t h e i r p r i m a r y struct u r e (~.s). I t is generally a s s u m e d t h a t t h e alpha-helical a n d bet~-sheet configuration are d e t e r m i n e d b y short- a n d m e d i n m - r a n g e interactions b e t w e e n a m i n o acidic residues. I f one only analyses t h e r e c u r r e n c e of some douplets (4,5), triplets (3) a n d so on (2,~,s), t h e correlation of t h e whole a m i n o acidic sequence is neglected. W e h a v e d e m o n s t r a t e d in a previous p a p e r (9) t h a t t h e sequence of a m i n o acidic residues in a p r o t e i n is a u t o c o r r e l a t e d a n d t h a t t h e order of correlation is at least 2; in o t h e r words, t h e p r o b a b i l i t y t h a t t h e i - t h a m i n o acid belongs to a certain t y p e depends at least on t h e a m i n o acid t h a t precedes it, a n d this applies to all i in t h e sequence. Since t h e available d a t a estabilish t h a t t h e a u t o c o r r e l a t i o u is at least of second order (9), we h a v e a n a l y s e d t h e spatial configuration of proteins using t h e frequencies of each p a i r of residues in alpha-helical, b e t a - s h e e t a n d r a n d o m coil regions. F u r t h e r m o r e , m e t h o d s used up to now to p r e d i c t alpha-helical, b e t a - s h e e t a n d r a n d o m coil regions of proteins utilized t h e sequences of proteins with a well-known s e c o n d a r y s t r u c t u r e , t h e n u m b e r of which is quite small, not only to find t h e p r o b a b i l i t y of presence of each a m i n o acid in t h e s e regions, as is unavoidable, b u t also to find t h e rules t h a t , b y m e a n s of this p r o b a b i l i t y , locate alpha, b e t a a n d r a n d o m coil regions in proteins. I n this m a n n e r t h e n u m b e r of proteins not used to search for t h e rules, t h a t one can use to t e s t t h e m , is really l o % it being incorrect to t e s t t h e m e t h o d on t h e d a t a used to c o n s t r u c t it. T h e rules developed in this p a p e r are i n s t e a d b a s e d on statistical considerations and, therefore~ t h e n u m b e r of proteins used to t e s t t h e m is comp a r a b l e w i t h t h e n u m b e r of proteins whose secondary s t r u c t u r e is known.
(1) C.B. ANFINSEN, ]~. H2~BER, M. SELA.and F. H. WHITE : Proc. Natl. Acad. Sci. USA, 47, 1309 (1961). (2) K . ~-AGANO: J. Mol. Biol., 75, 401 (1973). (a) V. I. LIM: J. Mol. Biol., 88, 857 (1974). (4) V. I. LIM: J. Mol. Biol., 88, 873 (1974). (~) 1~ u Ci~ov and G. D. FAs~.N: Biochemistry, 13, 211 (1974). (G) P. Y. C~ou and G. D. F-~S~AN: Biochemistry, 13, 222 (1974). (7) P. Y. CHov and G. D. FASMAN: Adv. Enzimol., 47, 45 (1978). (8) R . J . GA~NI~R, D. J. OSGVTI~ORP~and B. J. ROBSON: J. Mol. Biol., 120, 97 (1978). (9) M. l~. MACC~IIATOand A. TRAMON•ANO: Lett. ~Vuovo Cimento, 37, 89 (1983).
A S T A T I S T I C A L :~I]~TtIOD F O R P R E D I C T I N G A L I ' t t A - I t E L I C A L
2.
-
ETC.
423
Methods.
The first step in t h e definition of ~ predictive algorithm, taking into account t h e modality of presence of pairs of amino acids, is t h e c o m p u t a t i o n of the probability to find each pair of amino acids a n d each amino acid, in alphahelical, beta-sheet and r a n d o m coil regions of proteins. As is well known, when t h e n u m b e r of trials is big, f r e q u e n c y approximates probability. Thus we need the f r e q u e n c y of each pair of amino acids in alpha, b e t a a n d r a n d o m coil regions of the analysed proteins. The n u m b e r of available proteins, t h e sequence and secondary s t r u c t u r e of which are at t h e same t i m e known, is a little more t h a n 40 (7), t h e n our sample of proteins listed in table I is a significant p a r t of this population, so we can obtain an approxim a t e value of t h e probability of presence of each amino acid or of each pair of amino acids in alpha-helicM, beta-sheet and r a n d o m coil regions of proteins b y simply counting t h e n u m b e r of presence of each pair of amino acids, or t h e n u m b e r of presences of each amino acid in these regions of t h e proteins of t h e chosen sample. W i t h a Pascal p r o g r a m carried out on t h e 1108 Univac c o m p u t e r of t h e Zqaples University we have c o m p u t e d t h e frequencies of each amino acid in alpha, beta and r a n d o m coil regions of these proteins. These frequencies, called, respectively, alpha, beta and r a n d o m coil single potentials, are r e p o r t e d in table II. Thus t h e amino acidic sequence of a p r o t e i n can be identified with t h r e e sequences of numbers b y associating to each amino acid its alpha, b e t a and r a n d o m coil single potentials. These t h r e e sequencies will be called henceforth, respectively, alpha, b e t a a n d r a n d o m coil single-potential sequencies. W i t h these series of numbers we can define for each protein, respectively, the values Ma, Mb and Mc as t h e arithmetical m e a n of t h e alpha, beta and r a n d o m coil single potentials and t h e values s], s~ a n d s~ as the s t a n d a r d deviations of t h e t h r e e series of numbers. We h a v e t h e n ordered t h e amino acids f r o m 1 to 20 and have constructed, b y means of a n o t h e r Pascal program, t h r e e matrices, one for each region, in which at t h e point ij t h e r e is t h e observed f r e q u e n c y of t h e pair composed b y t h e amino acid in row i and t h e amino acid in column j, where i and j r u n f r o m 1 up to 20. These n u m b e r s , called pair potentials, are r e p o r t e d in table I I I . W e h a v e at first verified t h a t t h e t h r e e matrices are different from t h e m a t r i x one would h a v e if t h e distribution of t h e amino acids in t h e proteins were random. F o r this purpose, we have c o n s t r u c t e d t h e m a t r i x with each element equal to 1000/400 a n d we have carried out a z2-test between each of t h e t h r e e matrices of the frequencies and this uniform matrix. The obtained value of Z 2 ensures t h a t t h e t h r e e matrices are statistically different from the r a n d o m one.
424
V. CUOMO, )I. F. MACCHIATO and A. T R A M O N T A N O
TABLE I. -- 1%oteins qzsed ]or the statistical analysis. The sequences and the alphahelix and beta-sheet assignments are stored in the data bank constructed on the 1100/80 UNIVAC system of the University of Naples. All proteins have been used in the analysis with equal weight. Protein
Species
References
Calcium binding protein
carp
(lo)
Carbonic anhydrase B
human
(11)
Carboxypeptidase A
bovine
(12)
~-chymotripsin
bovine
(is)
Concavalin A Cytocrome b5 Cytocrome c
jack bean bovine horse
(14) (15) (16)
]~lastas6
porcine
(iv)
Erythroernorin
chironomus
(is)
Ferrodoxin Glucagone
p. aereogenase pig
(19) (2o)
a-hemoglobin
horse
(21)
fl-hemoglobin Hemoglobin
horse lamprey
(29) (2s)
Insulin A
porcine
(94)
(10) C. E. NOCKOLDS, R. H. KI~ETSINGER, C. J. COFFEE and R. A. BI~ADSHAW:1)roe.
Natl. Acad. Sci. USA, 69, 581 (1972). (11) K. K. KXNNAN: t)roe. Natl. Aead. Sci. USA, 72, 51 (1975). (12) F. A. QulOCl~O and W. N. Ln'scoMB: Adv. 1%otein Chem., 25, 1 (1971). (in) j . j . BIRKTOFT and D. M. BLOW: J. Mol. Biol., 68, 187 (1972). (la) A. JACK, J. WEINZIERN and A. J. KALB: J. Mol. Biol., 58, 389 (1971). (15) F. S. MATTREWS, P. ARGOS and M. LEVINE : Cold Spring Harbor Syrup. Quant. Biol., 36, 387 (1971). (16) T. ASHIDA, T. UEKI, A. TS~TKmA~A, T. TAKANO and M. KAKVD0: J. Bioehem., 70, 913 (1971). (iv) D . M . SI~OTTONand H. C. WATSON: ~hilos. Trans. 2~. doe. London, 257, 111 (1970). (is) D. M. SHOTTON and B. S. HATTLEY: 2Yature (London), 225, 802 (1970). (19) E. T. ADMAN, L. C. SIEKEI~ and L. H. JENSEN: J. Biol. Chem., 248, 3987 (1973). (20) p. u CHou and G. D. I~ASMAN: Ted. Eur. Biochem. Soc..T~eet. 1)roe. 128, 13 (1977). (21) M. F. PERUTZ, M. G. ROSSMAN,A. F. CULLIS, H. MUIRI=IEAD,G. WILL and A. C. T. NORTH: _Nature (London), 185, 416 (1960). (92) M. F. PEI~UTZ, H. MInRHEAD, J. M. Cox and L. C. G. GOAMAN: Nature (London), 219, 131 (1968). (Ba) W. E. LovE, P. A. KLOCK, E. E. LATTMAN, E. A. PADLAN, K. B. WARD jr. and W. A. HENDRICKSON: Cold Spring Harbor Symp. Qnant. Biol., 36, 349 (1971). (24) T. L. BLUNDELL, J. F. CUTFIELD, E. J. DOD~ON, G. G. DODSON, D. C. H O D G K I N
and D. A. MERCOLA: Cold Spring Harbor Syrup. Qnant. Biol., 36, 233 (1971).
A STATISTICALMETHOD FOR PREDICTING ALl)HA-HELICAL ETC.
425
TABLE I. -- (continued) Lactate dehydrogenase
dogfish
(25)
Lisozyme
dugg egg white
(2s)
i~Iyogen
carp
(lo)
Myohenerythin
carp
(2~)
Myoglobin
sperm whale
(2s)
Nuclease
S.aureus
(29)
Pancreatic trypsin inhibitor
bovine
(so)
Papain
papay~
(al)
Ribonuelease S
bovine
(s2)
Rubredoxin
C. pasteurianum
(as)
Subtilisin B P N Superoxide dismutase Thermolysin Thioredoxin
B. amyloliquefaciens bovine B. thermoproteoliticus E. coli
(aa) (65) (ae) (sT)
Trioso phosphate isomerase
chicken muscle
(s3)
(25) M. J. ADAMS, G. C. FORD, A. LILIJAS and ~V[. G. ROSSMAN: Bioehem. Biophys. l~es.
Commun., 53, 46 (1973). (2~) T. IMOTO, L. N. JOHNSON, A. C. T. NORTH, D. C. PRILIPS and J. A. RUPLEY: The Enzymes, edited by P. D. BOYER, 3rd edition, Vol. 7 (1972), p. 665. (27) W. A. HENDRICKSON and K. B. WARD: Biochem. Biophys. ~es. Commun., 66, 1349 (1975). (2s) J. C. KENDREW, R. E. DICKERSON, B. E. STRAND:BERG, R. G. EAST, D. R. DAVIES, D. C. PHILLII)S and V. C. SHORE: Nature (London), 185, 422 (1960). (29) F. A. COTTON, C. J. BIER, V. W. DAY, E. E. ItAZEN and S. LARSEN: Cold Spring Harbor Syrup. Quant. Biol., 36, 243 (1971). (so) R. HUBER, D. KUKLA, A. RUHLMANN, 0. ErP and H. FORMENEK: Naturwissenscha]ten, 57, 389 (1970). ($1) j . DRENTH J. N. JANSONIUS, R. KOEKOEK and B. G. WOLTtIERS: Adv. Prot. Chem., 25, 79 (1971). (32) H. W . WYCHOFF, D. TSERNOGLOU, 2k. W . HANSON, J. i~. KNOX, B. LEE and F. i~I. I~ICHARDS: J. Biol. Chem., 245, 305 (1970). (sa) K. D. WATENI)AVGH, L. C. SICKER, J. R. HERRIOTT and L. H. JENSEN: Cold Spring. Harbor Syrup. Quant. Biol., 36, 359 (1971). (ad) j . DRENTH, W. G. HOL, J . N . JANSONIUS and R. KOEKOEK: Cold. Spring Harbor Syrup. Quant. Biol., 36, 107 (1971). ($5) j . S. RICHARDSON, K. A. THOMAS and D. C. RICHARDSON: Biochem. Biophys. Commun., 63, 286 (1975). ($6) p . M . COLMAN,J. N. JANSONIUS and B. W. MATTHEWS:J. Mol. Biol., 70, 701 (1972). (sT) A. HOLMGREN, B. O. SODEMBERG, H. ]~KLUND and C. I. BRADEN: Proc..tVatl. Acad. Sci. USA, 72, 2305 (1975). (3s) D. W. BANNER, fl~.C. BLOOMEN, G. •. PETSKO, D. C. PHILLIPS, C. I. PORGS0N, J. A. WILSON, P. H. CORRAN, fl-.J. I~URTH, J. D. ]~/~ILMAN,i~. E. 0FFORD, J. D. PRIDDLE and S. G. WELEY: Nature (London), 255, 609 (1975).
426
v. cuoMo, ~. F. ~IACCItIATO ~nd A. TRA~IONTANO
TXBLE I I . - ~requeney o] each amino acid in alpha-elieal, beta-sheet and random coil regions of the analysed proteins. The letters on the left are t h e one-letter codes for the amino ~cids (39). Amino acid
P(~)
P(fl)
t)(r)
A
0.116
0.070
0.094
R
0.023
0.020
0.031
N
0.042
0.070
0.050
]:)
0.063
0.058
0.057
C
0.015
0.038
0.026
Q
0.031
0.038
0.035
E
0.063
0.035
0.036
G
0.073
0.093
0.091
H
0.033
0.020
0.019
I
0.039
0.058
0.050
L
0.091
0.045
0.071
I~
0.087
0.093
0.070
0.017
0.013
0.012
0.040
0.020
0.039
0.039
0.038
0.046
:P S
0.062
0.093
0.086
T
0.050
0.073
0.062
W
0.015
0.020
0.009
Y
0.027
0.043
0.034
V
0.076
0.063
0.082
I t is also n e c e s s a r y t o v e r i f y t h a t t h e t h r e e m a t r i c e s a r e d i f f e r e n t f r o m e a c h o t h e r i n o r d e r t o b e s u r e t h a t a l p h a , b e t a a n d r a n d o m coil r e g i o n s h a v e different statistical characteristics. Another g2-test ensures that this assumpt i o n is t r u e . T h e n e x t q u e s t i o n is w h e t h e r t h e m a t r i c e s a r e s y m m e t r i c a l , t h a t is t o s a y w h e t h e r t h e e l e m e n t i n t h e p l a c e i j is, o n t h e a v e r a g e , e q u a l t o t h e e l e m e n t i n t h e p l a c e ji. A f u r t h e r g2-test h a s b e e n c a r r i e d o u t b e t w e e n t h e g r o u p of t h e elements in the places ij for i=1--20 and j