Apr 3, 2018 - We feel that there are two key points: first, how is the protein secondary ..... structure relation viewed from cryptography [Luo et al, 2002].
Chapter 4 From Sequence to Structure
The research of traditional biology is one from morphology to cytology and then to the atomic and molecular level, from physiology to microscopic regulation, from phenotype to genotype. With the founding of theoretical biophysics and the introduction of concepts and methods of theoretical physics, it is possible to establish reverse biology. About 15 years ago, when my interest was turned to the life science, for a long time I was puzzled by the role of theory in biology, since it is generally accepted that biology is an experimental science all along. I questioned whether there exists a theoretical biology as a branch of life science that is similar to the theoretical physics in physical science? In an essay published in 1988, I suggested a line of reverse biology: code-sequence-conformation-dynamics. “The reverse biology begins with research on gene and then goes deep from molecular sequence to molecular conformation and from structure to function. In the meantime, it sets about with an unifying principle and extensively used mathematical tools to quantitatively clarify the ever changing phenomena of life” [Luo, 1988]. I’m excited that the above-mentioned view has been published later in a similar way by a well-known biologist W. Gilbert. He said, “The new paradigm, now emerging, is that all the genes will be known and the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis.”[Gilbert, 1991]. In the above reversed line, the structure plays a central role that connects biological function with gene information. How does a biological molecule find its low energy state in a time less than geological? This is so-called Levinthal paradox (1968). Even more paradoxical is that how does spontaneous folding of a protein occur through a process of self-organization? This paradox was emphasized by Delbruck. He wrote: “You might as well say that the resolution of this paradox by the reduction in dimensionality from 3-dimensional continuous to 1-dimensional discrete in the genesis of proteins is a new law of physics and one nobody could have pulled out of quantum mechanics without first having seen it in operation.”[Wolynes, 1996] Based on the experiment on the formation of naive ribonuclease during oxidation of reduced polypeptide chain, Anfinsen (1973) pointed out an important principle: the amino acid sequence determines the native structure of protein with biological activity. Then, biologists proposed that “sequence first, then structure” and tried to find a prediction rule. From the point of application, gene engineering has been developed to a stage that novel man-made amino acid sequences can be designed with expected properties. To satisfy the requirement of some functions, one may reform a protein and produce new types of protein not existing in nature by directional mutations of some amino acids in given sites, since the biological function of proteins is generally determined by their molecular conformation. But the directional mutation of amino acids is not only a problem of experimental techniques, but also a profound theoretical problem. One key here is to discover some laws determining the relationships between sequence and conformation. How does heredity information determine the protein structure? We feel that there are two key points: first, how is the protein secondary structure determined by amino acid sequence or possibly further influenced by mRNA sequence; and second, how is the framework structure of a protein determined by its secondary structure sequence. We shall discuss these points in this chapter. For nucleic acid, the X ray diffraction analysis indicates that the local deviations of double helix structure of DNA generally exist in A-form (left), B-form (right) and Z-form (left) secondary structures. Moreover, they have several kinds of topological structures. DNA curvature and flexibility have been widely studied in recent works. The interaction between cis and trans elements is one of the key points in understanding the law of gene expression. The local deviations of double helix and the topological structures of DNA play an important role in recognizing its particular sites by repressors, regulation proteins and many kinds of enzymes. However, in principle, the information on these DNA structures should be stored in base sequence. So, to find 248
the law governing the relationships between base sequence and DNA structure is also of special importance, as important as the determinative rule of amino acid sequence on protein structure. In the last section of this chapter, we shall make a preliminary discussion on the relationships between DNA curvature and flexibility and its base sequence.
§4.1
Statistical mechanical approach to the empirical prediction of protein secondary structure [Luo, 1993]
The secondary structure is the basis of spatial structure of a protein. Chou and Fasman were the first authors to formulate an approach to the empirical prediction of the secondary structure of protein from its amino acid sequence. The score was about 50% or larger. After thirty year’s effort, the prediction accuracy of various methods still paces up and down at the level of 70% to 75%. So the dinosaurs of secondary structure prediction are still alive [Rost & Sander, 2000]. To improve the empirical prediction, we suggest the prediction method reformulated in the scheme of Zimm-Bragg model. The statistical mechanical approach will afford a sound basis for the empirical prediction. By utilizing Ising model, Zimm and Bragg discussed the helix-coil transition of homogeneous peptide chain and deduced the cooperative effect [Schulz & Shirmer, 1979]. For practical proteins, one should consider a model of inhomogeneous chain and more than three conformational states. On the other hand, the peptide correlation forces between adjacent residues as well as the hydrogen bond interaction forces (and other forces) between non-neighboring residues should be taken into account simultaneously. Generalizing the ideal Z-B model to real peptide chains, we are able to formulate the theoretical ground for the prediction of protein secondary structures. Considering bond lengths and bond angles fixed one has only three types of internal degrees of freedom, namely, the usual backbone dihederal angles
φ
and
ψ , and dihederal angle χ , defined as the angle about the peptide bone preceding the
residue. Set the partition function of peptide chain
Z = const ∫ exp
− E (φ1ψ 1χ1 "" φ N ψ N χ N ) dφ1 " dχ N RT
(4.1.1)
Assume N
N −1
i =1
j =1
E = ∑ ε i (φ i ψ i χ i ) + ∑V j , j +1 (φ j ψ j χ j , φ j +1ψ j +1χ j +1 ) N −4
+ ∑ U j , j + 4 (φ j ψ j χ j , φ j + 4 ψ j + 4 χ j + 4 )
(4.1.2)
j =1
where three terms correspond to single peptide energy, dipeptide energy and hydrogen-bond energy in alpha-helix structure respectively. For non-neighboring interaction, only the hydrogen-bond energy in alpha-helix as a representative has been written in (4.1.2). The generalization to the real case of more non-neighboring interactions is direct. Assuming that the variation of
ε i , V j , j +1 etc. is small in region confined by steric interaction clash [Robson and Garnier, 1986] we have Z = const ( Z 1α + Z 1β + Z 1c )( Z 12αα + Z 12ββ + Z 12cc + Z 12αβ + Z 12βα + Z 12βc + Z 12cβ + Z 12cα + Z 12αc )( Z 15αα + ") × ( Z 2α + Z 2β + Z 2c ) "" × ( Z Nα + Z Nβ + Z Nc )
(4.1.3) 249
where
Z 1α =
∫
exp
( φ1ψ1 )∈α
∫
Z12αβ =
− ε 1 ( φ1 ψ 1 χ 1 ) dφ1 dψ 1 d χ 1 RT
etc.
(4.1.4)
− V12 (φ1ψ 1χ1φ 2 ψ 2 χ 2 ) dφ1 dψ 1 dχ1 dφ 2 dψ 2 dχ 2 / RT
exp
( φ1ψ1 )∈α ( φ 2 ψ 2 )∈β
∫
dφ1 dψ 1 dχ1 dφ 2 dψ 2 dχ 2
etc.
(4.1.5)
( φ1ψ1 )∈α ( φ 2 ψ 2 )∈β
Z 15αα =
∫ exp φψ α
( 1 1 )∈ (φ5ψ 5 )∈β
− U 15 (φ1ψ 1 χ 1φ 5ψ 5 χ 5 ) dφ1 dψ 1 dχ 1 dφ 5 dψ 5 dχ 5 / RT
∫
dφ1 d1 dχ1 dφ 5 dψ 5 dχ 5
( φ1ψ1 )∈α ( φ5ψ 5 )∈β
Z 15αβ = "" = Z 15cc = 1
(4.1.6)
Let α Y jα = Z αα Z αα j −1, j Z j j , j +1
σ
αβ j , j +1
=
Z αβ j , j +1
( j = 1, " N − 1)
Z αα Z ββ j , j +1 j , j +1
λαj , j + 4 = Z αα j, j +4
( j = 1" N , Z 01 = Z N , N +1 = 1 )
etc.
etc.
(4.1.7) (4.1.8)
(as the j-th and the (j+4)-th residues being α )
= 1 (as β or c occurring in the j-th and the (j+4)-th sites)
(4.1.9)
In above equations “etc” means cyclic between different secondary structures. The partition function can be transformed into
Z=
∑Y
αβ β β β βc c λ Y Y Y Y σ 56 Y6 Y7 Y8 σ 89 Y9 ...YNc
a a a a a a 1 15 2 3 4 5 3 N terms
(4.1.10)
(the typical term in the summation corresponding to secondary structure (αααααβββc...c)). The above equation can be rewritten in
⎡Y1a λa15 ⎢ Z = (1,1,1) ⎢ ⎢ 0 ⎣
Y1β
0⎤ ⎥ ⎥ Y1c ⎥⎦
⎡ Y2a λa26 ⎢ βα a α ⎢σ12 Y2 λ 26 cα a α ⎢ σ12 ⎣ Y2 λ 26
⎡ YNa ⎢ βα a ⎢σ N −1, N YN ⎢σ Ncα−1, N YNa ⎣
αβ β σ12 Y2
Y2β cβ β σ12 Y2
β σ αβ N −1, N Y N
YNβ
σ
cβ N −1, N
β
YN
αc c σ12 Y2 ⎤ βc c ⎥ σ12 Y2 ⎥ ⋅ ⋅ ⋅ Y2c ⎥⎦
σ αNc−1, N YNc ⎤ ⎡1⎤ ⎥ σ Nβc−1, N YNc ⎥ ⎢⎢1⎥⎥ Y
c N
(4.1.11)
⎥ ⎢⎣1⎥⎦ ⎦
The 3 N terms occurring in the summation of partition function, Eq (4.1.10), represent 3 N possible secondary structures, each describing the relative probability of one structure. All statistical properties can be deduced from the partition function. For 250
example, the average frequency of α/β boundary with adjacent residues A and B on its two sides is
∂ ln Z ∂ ln σ αβ AB
Lαβ AB =
etc
(4.1.12)
The average frequency of hydrogen bonds between residues A and B in α helices is
∂ ln Z ∂ ln λαAB
M αAB = α
β
(4.1.13)
c
The parameter Y A( Y A ,Y A )corresponding to residue A depends not only on A, but also on its upstream and downstream residues. α
β
c
If one neglects the dependence of Y A ( Y A , Y A )on adjacent residues or takes appropriate average over these residues then the α
β
c
averaged conformation parameter Y A ( Y A , Y A ) is obtained.
The average frequency of residue A occurring in α
conformation is
∂ ln Z ∂ ln Y Aα
N αA =
etc.
(4.1.14)
From (4.1.12) one has the total frequency of α/βboundaries
Lαβ = ∑ Lαβ AB
etc.
(4.1.15)
AB
From (4.1.14) one has the total number of residues in α helices
Nα = ∑ A
∂ ln Z ∂ ln Y Aa
etc.
(4.1.16)
So, the average length of α helix is
d a = 2 N a /( Lαβ + Lβα + Lαc + Lcα )
etc.
(4.1.17)
The similar equations can be deduced for β strand and random coil. On the other hand, if hydrogen bonds are numerated from CO of the j-th residue to NH of the (j+4) residue then
∑Mα
AB
and
B
∑Mα
BA
mean hydrogen-bond-in-C-end frequency and
B
hydrogen-bond-in- N-end frequency of residue A respectively. Furthermore,
f AC =
∑M
a AB
/ N Aa
(4.1.18)
B
is the measure of the ability of residue A inαhelix providing hydrogen-bond-in-C-end and a f AN = ∑ M BA / N Aa
(4.1.19)
B
is the measure of the ability of residue A inαhelix providing hydrogen-bond-in-N-end. We have presented a theoretical formulation of protein secondary structures. In fact, apart from energy terms given in Eq (4.1.2) other energies such as the hydrogen-bond energy in beta strand should be taken into account. It will lead to new factors in partition function, but the theoretical formulation remains unchanged basically. In the following we shall indicate that the statistical mechanical formulation of peptide chain given above is helpful to understand the physics behind the empirical prediction for protein secondary structures. At first, each of the 3 N terms in Eq (4.1.10) represents one secondary structure configuration. For example, the typical term 251
written in r.h.s. of Eq (4.1.10) denotes the relative probability of configuration(αααααβββc...c ). The largest term in the summation corresponds to the most probable configuration. This is what means “the sequence determines the structure”. However, other terms in the summation should not be neglected generally. The most probable configuration is not the only one that can be realized. Historically, some authors proposed that the relationship between sequence and conformation can be represented by a set of logical rules which are unambiguous and without a probabilistic component. These logical rules are invoked to explain how the secondary structure was generated from the amino acid sequence, and may be said to constitute a deterministic generative grammar. However, the supposition on the sequence exclusively determining structure has never been demonstrated. Accompanying the enlargement of database the exception to some deterministic rules always occurs. Therefore, we should re-examine this point from the basic principle. The above-mentioned statistic mechanical formulation provides a clue to observe the problem. For example, for α
β
four consecutive residues A,B,C and D,if YE > YE β conformation.
α
(E = A, B, C , D )
then the fragment preferentially takesαrather than
β
To be definite, we assume YE ≈ 2YE and the most probable conformation is α helix. If the α helix
conformation of 4-peptide fragment is changed to β sheet and in the replacement the number of conformational boundaries remains unchanged, then the replacing term in partition function is only 1/16 of the most probable term. So, for the 4 peptide fragment the statement that sequence determines structure contains 1/16 of ambiguities at least. This shows that the relation between sequence and secondary structure is a statistical one in essence. From a definite peptide sequence only a distribution of secondary structures can be deduced by the statistical mechanical approach. The full determination of the structure should be dependent on the self-organization properties of the system. Secondly, the cooperative effect can be deduced from the above theory. The conformation probability is described by boundary parameter
σ γδAB (γ , δ = αβc) in addition to YAγ . The conformational boundary factor is determined through thermodynamic
method in Zimm-Bragg theory. It gives σ = 10 −4 based on homogeneous chain data.
For real polypeptide chain, taking the
helix-breaking factor into account which makes the helix length shorter, it gives σ ≈ 10 −1 by use of statistical method [Suzuki & Robson,1976]. On the other hand, from the calculation of energy the result is sensitively dependent on susceptibility of the medium, and the obtained σ takes a value in a wide range from 10-9 to 10-4 [Go et al, 1970]. It seems that the choice of
σ ≈ 10 −1 is reasonable for polypeptide chain. The parameter σ < 1 means the tendency towards elimination of the boundary. From the statistical data on the frequency distribution of residue pairs occurring in conformational boundary we find that the frequencies of residue pairs differ by a factor of 5 to 10. So it is estimated σ ≈ 0.05 − 0.5 tentatively. From this estimation the decrease of boundary number by 1 would cancel the probability reduction caused by changing 1 to 3~5 residues from the favorable conformation to unfavorable. This is the very reason why the minimal length of αhelix takes 4 and that of βstrand takes 3. Under the associated action of taking the most favorable conformation by each residue and reducing boundary number as far as possible it results in the occurrence of nucleus of some secondary structure and the extension of the structure along two directions in the sequence. Thirdly, the hydrogen bond, di-sulphur bond and hydrophobicity bond play an important role in determining the protein structure.
For example, the hydrogen bond factor
λαj , j + 4 can be calculated from Eqs (4.1.9) and (4.1.6). Let hydrogen bond
energy -0.5 kcal/mol, the angle between CO of j-th residue and NH of (j+4)-th residue being at random but smaller than 60 degrees 5
6 it is estimated that Z αα j, j+4 ≈ e ×
6 120 + ≈ 1.5 . So, the introduction of a hydrogen bond in α conformation would increase 180 180
252
the relative probability by a factor 1.5. Of course, since other hydrogen bonds besides that in α helix take place in protein folding, the enhancing factor of α helix comparative to other conformations cannot take such a large value in reality. The di-sulphur bond energy is much higher than hydrogen bond. Both they can largely increase the stability of protein structure. Besides, in studying the tertiary structure one should consider hydrophobicity interaction seriously. For a polypeptide chain, each residue gains hydrophobic free energy 2.6Kcal/mol on average from extended to folded state, which is several times of hydrogen bond energy. So, the hydrophobicity interaction is of special importance in the stabilization of protein tertiary structure. Finally, from the expression of partition function Eq (4.1.10) we know that the probability of a definite conformation is dependent on factors YA , σ AB , λ AB etc. As seen from (4.1.7) Y j
γ
(γ = α , β , c )
is not fully determined by the j-th residue itself,
also influenced by its upstream and downstream residues. This implies, though the Chou-Fasman mono-peptide prediction is commonly used, but the correlation of protein secondary structure with di-peptide and tri-peptide frequency is more stronger than mono-peptide (see next section). From the proposed formulation the above point is easily understood since the Chou-Fasman γ
prediction is only the first-order approximation, a mean-field approximation, where Y A has been replaced by the averaged Y Aγ . In fact, the typical term in Eq ( 4.1.10) can be rewritten in αα αα αα αβ Z 1α λα15 X 12αα X 23 X 34 X 45 X 56 X 67ββ X 78ββ ...
X γδj , j +1 = Z γj Z γδj , j +1 Z δj +1 which says that the dipeptide parameter X
j , j +1
(γ ,δ = α , β , c )
(4.1.20)
can equally well be used to describe the conformational probability. The prediction
based on dipeptide frequencies is generally better than Chou-Fasman’s [Luo & Dong, 1988]. In above statistical mechanical approach there are a set of secondary structure parameters, namely,
Y Aγ ( A = 1" 20; γ = α , β , c ), σ γδ AB (γ ≠ δ )
and
λαAB . The parameters
YAγ
σ γδAB
and
can
be
replaced
by
γδ γδ ( A, B = 1" 20; γ , δ = α , β , c ) , in which Z Aγ depends on single residue and Z AB two neighboring residues. In Z Aγ , Z AB
principle, these theoretical parameters can be calculated through the integral of mono-peptide energy and di-peptide energy in appropriate conformation. But in fact, it is not easy to calculate them ab initio. However, if we have a large enough database on protein secondary structure then they can be determined from the statistics on frequencies. In other words, these theoretical parameters can be calculated from mono- and di-peptide frequencies. For example, there are 3660 theoretical parameters γδ ( A, B = 1" 20; γ , δ = α , β , c ) when the hydrogen bonds have been neglected. The mono-peptide relative frequency Z Aγ , Z AB
p γA (= n γA
∑ n γA
γ
= n A / N ) is proportional to Y Aγ , i.e.
Aγ
20
γγ γγ p γA = c ∑ Z LA Z Aγ Z AR q Lγ q Rγ
(4.1.21)
LR
γ
γ
where q S = n S /
∑ nγ i
( S = L, R )
(the summation is taken over 20 amino acids for given conformation γ), c is determined
i
γδ
by normalization. Likewise, the dipeptide relative frequency p AB (= n γδ AB
∑ n γδAB
ABγδ
253
γδ
2
= n AB / N )
satisfies
γγ γ γδ δ δδ γ δ p γδ Z BR qL qR AB = c ' ∑ Z LA Z A Z AB Z B
(4.1.22)
LR
The theoretical parameters can be solved from 3660 equations (4.1.21) and (4.1.22) if the mono-peptide and di-peptide frequencies have been obtained from the statistics. By use of these theoretical parameters as intermediate the proposed framework provides an approach to predict the most probable conformation. This is a semi-empirical approach to the protein secondary structure prediction. Ability of a residue inαhelix providing hydrogen bond is an important parameter for the prediction of secondary structure. The statistics based on Kabsch-Saunder’s classic work (1983) are given in the following tables. Table 4.1-1 Ability of a residue inαhelix providing hydrogen-bonds-in-C-end Pro Trp Asp Glu Asn
0.933 0.917 0.709 0.682 0.681
Ile Met Thr Ala Cys
0.680 0.672 0.667 0.663 0.655
Leu Val Gly Arg Phe
0.645 0.641 0.635 0.629 0.627
Ser Gln Tyr His Lys
0.626 0.614 0.549 0.543 0.535
Table 4.1-2 Ability of a residue inαhelix providing hydrogen-bonds-in-N-end Met Tyr Val Leu His
0.866 0.802 0.785 0.774 0.765
Lys Phe Ile Arg Gln
0.748 0.739 0.710 0.707 0.682
Table 4.1-3 Met Trp Val Leu Ile
1.537 1.458 1.426 1.419 1.391
Tyr Phe Ala Arg Cys
1.352 1.366 1.344 1.336 1.310
Ala Cys Thr Asn Gly
0.681 0.655 0.594 0.572 0.563
Ser Trp Glu Asp Pro
0.561 0.542 0.489 0.455 0.000
Ability of a residue inαhelix providing hydrogen bond His Gln Lys Thr Asn
1.309 1.295 1.283 1.261 1.254
Gly Ser Glu Asp Pro
1.197 1.187 1.172 1.164 0.933
(measured by the sum of the value in Table 4.1-1 and 4.1-2)
According to the order of ability providing H-bond-in-C-end the first 11 pairs are: Pro-Ala, Gly-Ala, Trp-Ala, Trp-Gly, Thr-Ala, Glu-Lys, Ala-Ala, Val-Leu, Phe-Val, Cys-Ala, Pro-Leu. According to the order of ability providing H-bond-in-N-end the first 8 pairs are: Ala-Met, Ala-Trp, Ala-Lys, Leu-Met, Leu-Tyr, Ala-Ala, Ala-Cys, Glu-Lys. The number of H-bonds in N end or C end per residue of the above pairs all exceeds 0.1.
254
§4.2 Information-theoretic approach to structure prediction and sequence – structure relation viewed from cryptography [Luo et al, 2002] Information-theoretic method The prediction of protein secondary structure from its amino acid sequence is essentially a problem of finding the correlation between the two objects. It can be studied in the framework of information theory. The amino acid sequence can be regarded as information source while the corresponding secondary structure the information receiver. For an amino acid sequence of length N one can construct a secondary structure sequence of same length written by three letters α, β and c following the one-to-one correspondence between residue and secondary structure. Set p (ai )
(ai = a , β , c) the probability (normalized frequency) of structure ai in the
secondary structure sequence and
( j = 1,...,20)
p( s j )
frequency) of amino acid sj in the protein.
the probability (normalized
Define average mutual information
I ( X ;Y ) = H ( X ) − H ( X Y ) = −∑ p (ai ) log p (ai ) + ∑∑ p ( s j ) p (ai s j ) log p (ai s j ) i
i
(4.2.1)
j
Likewise, one has
I (Y , X ) = H (Y ) − H (Y X ) = −∑ p ( s j ) log p ( s j ) + ∑∑ p (ai ) p ( s j ai ) log p ( s j ai )
(4.2.2)
I ( X ; Y ) = I (Y ; X )
(4.2.3)
j
i
j
Easily proved that
The maximum of H ( X Y ) is H(X) which corresponds to no correlation between X and Y . So, the correlation between secondary structure (X) and amino acid (Y) is defined by
r1 =
I ( X ;Y ) H (X )
(ai = α, β, c;
sj = A,C,…W, Y)
(4.2.4)
r1 takes value between 0 and 1. r1 = 0 means no correlation and r1 = 1 means the full determination of secondary structure by amino acid, which occurs in case p ( ai s j ) = 0 or 1 for all ai and sj . The single peptide – structure correspondence can be easily generalized to di-peptide (tri-peptide) – structure correspondence through shifting a window of width 2 (3). The above equations can be used as well in these generalized cases. For the case of di-peptide – structure correspondence ai takes 9 conformations, namely, αα, αβ, αc, βα, ββ, βc, cα, cβ, and cc and sj takes 400 di-peptides in above equations. Denote the correlation between secondary structure and neighboring di-peptide as r2 , 255
r2 =
I ( X ;Y ) H (X )
(4.2.5) (ai =αα, αβ, …, cc; sj = AA, AC,…WY, YY) the correlation between secondary structure and next-to-neighboring di-peptide as r2-1 , the correlation between secondary structure and next-to-next-to- neighboring di-peptide as r2-2 etc.
r2−1 =
I ( X ;Y ) H (X )
(ai =α *α, α * β, …, c*c; sj = A*A, A*C,…W*Y, Y*Y)
r2− 2 =
(4.2.6)
I ( X ;Y ) H (X )
(ai =α **α, α ** β, …, c**c;
sj = A**A, A**C,…W**Y, Y**Y)
(4.2.6 ’)
Likewise, we have the correlation between secondary structure and tri-peptide ,
r3 =
I ( X ;Y ) H (X )
(4.2.7) (ai =ααα, ααβ, …, ccc; sj = AAA, AAC,…WYY, YYY) We shall demonstrate that the correlation of protein secondary structure with neighboring di-peptide frequency is much stronger than that with single peptide and the correlation with tri-peptide frequency is much stronger than that with di-peptide. Therefore, the prediction of protein secondary structure from di- and tri-peptide frequency distribution is a better approach than single-peptide prediction. The information theoretic approach provides a method to estimate the efficiency of a structural prediction. The average mutual information I (X;Y), given by Eq (4.2.1), is a useful quantity for the estimation. Now we shall study how to deduce the maximum of I (X;Y) as the information source has been defined. By use of
p (α i ) p ( s j α i ) = p ( s j ) p (α i s j )
(4.2.8)
Eq. (4.2.2) can be rewritten to
I ( p ( s j ), p ( s j α i ), p (α i s j )) = −∑ p ( s j ) log p ( s j ) + ∑ p ( s j ) p (α i s j ) log p ( s j α i ) j
ij
(4.2.9) Eq. (4.2.1) can be rewritten to
I ( p (α i ), p (α i s j ), p ( s j α i )) = −∑ p (α i ) log p (α i ) + ∑ p (α i ) p ( s j α i ) log p (α i s j ) j
ij
(4.2.10) By maximizing (4.2.9) with respect to p ( s i ) under constraint
∑ p( s ) = 1
(4.2.11)
i
i
we obtain
256
exp{∑ p (α i s j ) log p ( s j α i )}
p * (s j ) =
i
∑ exp{∑ p(α i s j ) log p(s j α i )} i
(4.2.12)
i
By maximizing (4.2.9) with respect to p ( s j α i ) under constraints
∑ p( s
j
αi ) = 1
(4.2.13)
j
we obtain
p * (s j α i ) =
p( s j ) p (α i s j )
∑ p( s
j ) p (α i s j )
(4.2.14)
j
Eq (4.2.14) is consistent with Eq (4.2.8).
In Eq.(4.2.9) I is linear-dependent on p (α i s j ) so
one cannot obtain more relation by its maximization with respect to p (α i s j ) .
From (4.2.10)
one can deduce two similar equations as Eqs (4.2.12) and (4.2.14). Further, we can prove that the final result through iteration of (4.2.12) and (4.2.14) is the maximum of I (X;Y) exactly.
Application to the species-dependence of the structure prediction The database ISSD version 2.0 includes 105 human proteins (119 polypeptide chains, 22306 residues) and 89 E coli proteins (92 polypeptide chains, 26059 residues) with less than 50% pairwise similarity (sequence identity < 50%) [Adzhubei & Adzhuhei, 1999]. The average di-peptide frequency is about 120. By use of ISSD database the calculated correlations between protein secondary structure and amino acid sequence are summarized in Table 4.2-1. Human and E coli genes are listed separately, denoted by Hum and Eco in the table. As they are studied together, the calculated correlation is denoted by ‘tot’. Besides, to improve statistics we consider the 2-state structural classification, namely, alpha/non-alpha, beta/non-beta and coil/non-coil in addition to the conventional 3-state (alpha, beta and coil) structural classification. The results on the correlation between 2-structures and amino acids are summarized in Table 4.2-2. Interestingly, the correlation of the protein secondary structure with human and E coli genes as considered separately is generally stronger than that with two species as a whole. This means the di-peptide (neighboring or non-neighboring) signals for protein secondary structure are different in various organisms. That is, the same di-peptide may have different structural propensities and the same structure requires different di-peptide signals in human and Eco. The point can be demonstrated by direct calculation of di-peptide or tri-peptide structural signals. To be definite, we (j ) consider tri-peptide signal. Set Ni - the frequency of tri-peptide i and Ni - its partial frequency occurring in structure j (j=ααα、βββ、ccc、α/c、β/c、c/αand c/β). Here α/c etc are defined by the boundary of two structures, for exampleα/c meansααc, cαα, ccα, and αcc. ()
If Ni j ≥0.8Ni,and Ni≥3,then we define tri-peptide i as the signal of structure j. The statistical meaning of the structural signal defined above is: if the tri-peptide i takes structure j
257
(j=ααα,…, c/β) at random, then the probability of tri-peptide i occurring in structure j for
Ni(j)≥0.8Ni times is pi( j ) =
Ni
( j) ( j) Ni! × (q ( j ) ) N i × (1 − q ( j ) ) N i − N i ( j) ( j) N i( j ) = 0.8 N i ( N i − N i )!× N i !
∑
(4.2.15)
where q(j) is the probability of structure j occurring in the database. So, when Ni ≥3, we always have pi( j ) 10 and slowly as 10 ≥ m > 6 but rapidly as m ≤ 6 .
However, r3 (m)
decreases obviously from the very beginning and it attains only 1/6 of the original value as m=2. Since r3 (20) takes a large enough value it means that 20-letter alphabet correlates strongly with protein structure, but the rapid decrease of r3 (m) with m indicates that any alphabet reduction will much lower the correlation between sequence and structure. In fact, an α or β structural signal of
four-peptide (see section §4.4) does not remain its characteristics in general when amino
acids are replaced by the simplified alphabets. We have demonstrated that, in reduction to 4 alphabets (m=4) the conservation (at 90% level) of structural characteristic of a four-peptide signal is, on average, only 5/256 for α structural signal and 1/256 for β signal.
Therefore, the
reduced alphabets are generally insufficient for the prediction of protein structure. However, the 260
reduction may be a useful tool in studying the effect of large-scale context in amino acid sequence on the structure prediction. For example, about 49% 57% four-peptideα-type structural signal can be successfully expanded to six-peptide by putting two reduced (m=4) alphabets on its two sides, and 70% 81% four-peptideβ-type structural signal can be successfully expanded to six-peptide signal by the same procedure. Note: 1. The result shows that the correlation of protein secondary structure with tri-peptide frequency is the strongest and that the correlation with neighboring di-peptide frequency is also much stronger than single peptide. Due to the weak statistics (which is limited by the size of current database) we have not considered the correlation with tetra-peptide frequency. But we believe that r4 (m) will decrease with m more rapidly than r3 (m) in a large enough database. 2. The correlation of protein secondary structure with amino acid sequence generally changes with the size of database. This is not incomprehensible since mono- and di-peptide frequencies are not the only deterministic factors of the secondary structure. In fact, the secondary structure is determined by a longer segment than di-peptide. However, many properties about the alphabet reduction depend only on the relative values of the correlation in a given database. So, the amino acid reduction remains nearly unchanged with the expand of database as long as the latter has attained a large enough size. 3. To check the correctness of the above reduction method, the following approach has been employed: as two amino acids are grouped together in each step to obtain the highest correlation, we consider the pair with the next-to-highest correlation as an added pair to the first group if one member is already in the group, and then we compare the new reduction pathway with the old one to deduce the best reduction.
261
Table 4.2-5
The correlation r2 (m) and the corresponding amino acid reduction
m
r2 (m)
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
0.09234 0.09204 0.09167 0.09128 0.09088 0.09044 0.09002 0.08952 0.08883 0.08814 0.08747 0.08618 0.08484 0.08277 0.07870 0.07435 0.06818 0.05867 0.04469
Table 4.2-6
Amino acid reduction YFRKDENHQSTGPCWAMLVI / YF/RKDENHQSTGPCWAMLVI / YF/RQ/KDENHSTGPCWAMLVI / YFW/RQ/KDENHSTGPCAMLVI / YFW/RQ/LM/KDENHSTGPCAVI / YFW/RQK/LM/DENHSTGPCAVI / YFW/RQK/LM/VI/DENHSTGPCA / YFW/RQK/LM/VI/HC/DENSTGPA / YFW/RQK/LM/VI/HC/DN/ESTGPA / YFW/RQK/LM/VI/HTC/DN/ESGPA / YFW/RQK/LM/VI/HTC/DN/EA/SGP / YFW/RQK/LM/VI/HTCS/DN/EA/GP / YFW/RQKEA/LM/VI/HTCS/DN/GP / YFWLM/RQKEA/VI/HTCS/DN/GP / YFWLM/RQKEA/VI/HTCS/DNG/P/ / YFWLMVI/RQKEA/HTCS/DNG/P/ / YFWLMVI/RQKEA/NSDHTCG/P/ / YFWLMVI/RQKEA/NSDHTCGP/ / YFWLMVIRQKEA/NSDHTCGP/
The correlation r2-1 (m) and the corresponding amino acid reduction
m
r2-1 (m)
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
0.08193 0.08159 0.08123 0.08087 0.08048 0.08008 0.07966 0.07924 0.07876 0.07805 0.07694 0.07580 0.07394 0.07180 0.06878 0.06573 0.06205 0.05318 0.03888
Amino acid reduction MLYFRQNSKDEHTGPCWAVI / YW/MLFRQNSKDEHTGPCAVI / YW/ML/FRQNSKDEHTGPCAVI / YW/ML/RQ/FNSKDEHTGPCAVI / YW/ML/RQ/VI/FNSKDEHTGPCA / YWF/ML/RQ/VI/NSKDEHTGPCA / YWFC/ML/RQ/VI/NSKDEHTGPA / YWFC/ML/RQK/VI/NSDEHTGPA / YWFC/ML/RQK/VI/HT/NSDEGPA / YWFC/ML/RQK/VI/HT/NS/DEGPA / YWFC/ML/RQK/VI/HT/NSD/EGPA / YWFC/ML/RQKE/VI/HT/NSD/GPA / YWFC/ML/RQKEA/VI/HT/NSD/GP / YWFCML/RQKEA/VI/HT/NSD/GP / YWFCML/RQKEA/VI/HT/NSD/GP/ / YWFCML/RQKEA/VI/NSDHT/GP/ / YWFCMLVI/RQKEA/NSDHT/GP/ / YWFCMLVI/RQKEA/NSDHTGP/ / YWFCMLVIRQKEA/NSDHTGP/
262
Table 4.2-7 The correlation r3 (m) and the corresponding amino acid reduction
m
r3 (m)
20 19 18
0.18845 0.18210 0.17463
17 16 15
0.16746 0.16065 0.15443
/ YF/RQ/ML/KDENHSTGPCWAVI / YFW/RQ/ML/KDENHSTGPCAVI / YFW/RQ/ML/VI/KDENHSTGPCA
14 13 12
0.14871 0.14344 0.13863
/ YFW/RQK/ML/VI/DENHSTGPCA / YFWC/RQK/ML/VI/DENHSTGPA / YFWC/RQK/ML/VI/SH/DENTGPA
11 10 9
0.13400 0.12953 0.12550
/ YFWC/RQK/ML/VI/SH/EA/DNTGP / YFWC/RQK/ML/VI/SH/EA/DN/TGP / YFWC/RQK/ML/VI/SHT/EA/DN/GP
8 7 6
0.12048 0.11555 0.10896
/ YFWC/RQKEA/ML/VI/SHT/DN/GP / YFWCML/RQKEA/VI/SHT/DN/GP / YFWCML/RQKEA/VI/DNSHT/GP
5 4 3
0.10244 0.09340 0.07871
/ YFWCMLVI/RQKEA/DNSHT/GP / YFWCMLVI/RQKEA/DNSHT/GP/ / YFWCMLVI/RQKEA/DNSHTGP/
2
0.05742
/ YFWCMLVIRQKEA/DNSHTGP/
Amino acid reduction YFRKDENHQSTGPCWAMLVI / YF/RKDENHQSTGPCWAMLVI / YF/RQ/KDENHSTGPCWAMLVI
(each group of reduced amino acid alphabets is written between two slants // in the third column of the table)
0.2 0.15 1 2 3 4
0.1 0.05 0 0
Figure 4.2-1
5
10
15
20
The dependence of correlation r(m) with m
r1(m) labeled by 1, r2(m) by 2, r2-1(m) by 3 and r3(m) by 4; m=21 – (the number labeled on abscissa)
4. The detailed comparison of different simplified approaches is given in the following. To assess the different ways of amino acid reduction – ours (for brevity, consider tri-peptide case only, see Table 4.2-8, denoted as 3 in the table); Wang & Wang, 1999 (denoted as 1); and Murphy et al, 2000 (denoted as 2) we introduce a criterion as follows. Consider the reduction to four types of residues - A,B,C,and D. Define the average distance of amino acids within group A and group B, 263
I ( A) = ∑ d ( Ai A j ) / number of terms in summation, Ai ,Aj ∈A ij
I (B) = ∑ d ( Bi B j ) / number of terms in summation, Bi ,Bj ∈B ij
and the average distance between amino acids in group A and group B ,
D( AB) = ∑ d ( Ai B j ) / number of terms in summation ij
Define
f AB =2 D(AB) / [ I ( A) + I (B) ] Likewise, we can define f BC , f AC ,…., f CD .
(4.2-16)
For another reduction, the four types are denoted
by A’, B’, C’, and D’. Through comparison between ( f AB , f BC , .…, f CD ) and ( f A'B ' , f B 'C ' ,…., f C 'D ' ),one can determine which reduction is better. The larger the f , the better the grouping. The four types of alphabets (or five types in 1, where only five types are given) in three reductions are summarized as follows (Table 4.2-8): Table 4.2-8
The comparison of three reductions of 20 amino acids into four or five groups A
B
C
D
1 2 3
YFWCMLVI YFW/ LVIMC YFWCMLVI
GP AGSTP GP
RQKNS EDNQKRH RQKEA
AHT
characteristics
hydrophobic, large
weak hydrophobic, small
hydrophilic, large
hydrophilic, small
E DE
DNSHT
Wang & Wang’s reduction denoted as 1, M urphy’s reduction denoted as 2, our tri-peptide reduction denoted as 3, and the last line gives the characteristic description.
Table 4.2-8a
f AD
The comparison of two reductions of amino acid alphabets into four groups
f AC
f AD
f BC
f BD
f CD
Average
1
2.14
1.70
1.50
1.66
1.13
1.08
1.54
3
2.14
1.61
1.89
1.64
1.48
1.11
1.65
Wang & Wang ’s reduction is based on the comparison of amino acid interaction matrices [Wang & Lee, 2000] while our reduction is based on the comparison of correlation between amino acid and protein secondary structure. Though the above two reductions are based on different principles, their classifications, namely 1 and 3, of twenty amino acids into four or five groups are nearly same as seen from Table 4.2-8. On the other hand, Murphy et al ’s reduction is based on the 264
study of similarity matrix elements directly. We also find that both reductions 1 and 3 are near to the reduction 2 if two groups in reduction 2, YFW and LVIMC, merging to a single group in reduction 1 and 3 have been noticed.
The main chemical and physical properties used in
differentiate amino acids are hydrophobicity and volume. In Table 4.2-8 the characteristics for each group of amino acids are given. the characteristic classification.
Our reduction (into four groups) is more consistent with
By use of the physico-chemical distances between amino acids
given by Grantham (1974) (see Table 1.3-1) one can give a more quantitative comparison between reduction 1 and 3. All f values have been calculated for each pair of groups by use of Eq (4.2.16) and the results are shown in Table 4.2-8a. From table we find that, on average, the reduction 3 is better than 1.
The problem on empirical prediction of protein secondary structure viewed from sequence-structure correlation The information-theoretic correlation defined by Eqs (4.2.4) to (4.4.7) affords the theoretical basis for the prediction of protein secondary structures. Interestingly, the calculation results of r1 , r2 , r2−1 , r2− 2 , etc are dependent on database obviously.
The calculations based on
ISSD database have been listed in Table 4.2-1. However, by use of 2269 protein data in IADE1 database (with sequence identity < 40%, see section §4.4) we obtain the correlation shown in Table 4.2-9. Two sets of correlations take values differently from each other though they retain the same trend with the variation of residue distance. Table 4.2-9
Correlation of protein secondary structure with amino acid sequence (in IADE1 database)
r1
r2
r2−1
r2− 2
r2−3
r2− 4
r2−5
r2−6
r2−7
0.049
0.088
0.077
0.068
0.064
0.059
0.057
0.056
0.055
r2−8
r2−9
r2−10
r2−11
r2−12
r2−13
r2−14
r2−15
r2−30
0.054
0.053
0.053
0.053
0.053
0.052
0.052
0.052
0.052
r2− 40
r2−50
r2−60
r2−70
r2−80
r2−90
0.0518
0.0518
0.0519
0.0518
0.0516
0.0514
Apart from the difference of sequence number in two databases the main reason of the database-dependence of correlation is the sequence identity in IADE1 database (< 40%) lower than that in ISSD (1500. One may expect a better secondary structure prediction based on a database with magnitude n>1500. This means, under sequence identity smaller than 40%, a better secondary structure prediction can be obtained by use of a database including more than 270 thousand residues. In a database with 270 266
thousand residues each residue pair occurs 680 times on average and it occurs in a given secondary structure more than 200 times. The statistics is strong enough. Of course, instead of enlarging database, we may use another strategy as well. In fact, using a small database with higher homology one could still make a better structural prediction as long as the protein to be predicted has strong enough sequence similarity with the database. Finding homology is first important in structural prediction. Accompanying the enlargement of database we are more able to choose a group of proteins with known structures sharing high sequence similarity with the protein to be predicted. Comparing the group of proteins with those to be predicted we are more likely to achieve a high prediction accuracy. Another thing is how to evaluate the importance of long-range information in secondary structure determination. By calculation of information correlation r2− z with large z we are able to understand the point in a simple way.
From Table 4.2-9 we find r2−1 =0.077 but r2− z =
0.052 (for z=13 to z=80 ), differing from r2−1 about 32% and it remains nearly unchanged from z=13 to z=80. Therefore, including long-range information into the secondary structure prediction is still a challenging problem.
Sequence-structure relation viewed from cryptography [Luo, 1995] The genetic information transmitted from amino acid sequence to the 3- dimensional structure of protein is generally called the second genetic code. The problem can be discussed in the framework of cryptography. In principle, the tertiary structure of protein is determined by dihedral angles of main chain and side-chains in which the main chain dihedral angles (φ ψ) play a role for determining the basic backbone and is more important in the structural prediction. Angles (φ ψ)take values continuously. However, due to the spatial obstacles (NH...NH clash, CO...CO clash, NH...CO clash, side chain - bakebone clash, etc.)and other physical limitations, they take values actually in some discrete regions in Ramachandran plot. So, the problem of the prediction of the tertiary structure is reduced to the prediction of several structural states of(φ ψ) in Ramachandran plot. For example, from the conformation statistics one finds that these angles are distributed mostly in φ=φR, φL; ψ=ψa , ψb. (φR = -60 ±10, φL = -120 ±20 , ψa = -40 ±10 , ψb =140±10 ) except glycin (Gly takesdiscretestructures, too, but different from above values.) [Lee & Luo, 1996b]. Rooman et al reporteda prediction of protein backbone conformationbasedon seven structure assignments [Rooman, 1991]. How to estimate the prediction ability of the above-mentioned approach? How to improve the prediction by appropriate choice of the structure assignments? These problems willbeobserved from the point of cryptography. On the other hand,thereexist alotof neutralmutations of protein due to the mutational plasticity. Here “neutrality” means that the amino acid substitution do not changethestructure (function) of a protein. Therefore, some authors suggested that the fold is encoded not by 20- letter code but by minor characteristics of protein, for example, the hydrophobicity, charges, volumes
267
etc. If the degeneracy does exist,what is its influence to structure prediction? The problem can also be observed from cryptography. From the idea of cryptography [Diffie & Hellman, 1976],the amino acid sequenceand the corresponding 3-dimensional structure can be viewed as code and message respectively. Furthermore, the relation between them can be viewed as key of the encryption system. n Suppose the code written by q letters (amino acids). The length of code is n. There are q different sequences in total which defines the code space. Supposethe dihedral angles (φ ψ) occupy s different regions in Ramachandran plot. There are s n conformations which define the message space. Since each amino acid corresponds to s conformations, each amino acid ─ conformation correspondence is a key of the system. There are s q keys which defines the key space. Evidently, these keys are not simply direct assignments but are public-key system based on physico-chemical principles. However we don't know the construction methods of encryption clearly. To give a general description we shall use the assumption of random key, that is, the key homogeneously distributed in key space. Let us review Shannon's cryptography theory at first [Shannon & Weaver, 1949]. Consider codes with length n written by 26 English letters. There are Cn =2γ n codes, Cn =2 γ n , γ = log 2 26 = 4.70 (4.2.17) The meaningful messages can be picked out by the frequencies of each letter occurring in the message consistent with the statisticalresults in general English literatures. Suppose that the meaningful messages are distributed homogeneously in message space. The entropy deduced from the frequencies of English letters is
h = −∑ pi log 2 pi = 4.19 The number of meaningful messages is 2 meaningful message with a probability
hn
(4.2.18)
. So, a sequence of n letters can be translated to
p = 2 − (γ − h ) n = 2 − nd1
(4.2.19)
where d1 is first-order informational redundancy. (If the correlations between letters are taken into account, d2 should be added to d1 ). On the other hand, the number of keys is
K = 2 log 26! = 2 88.38
(4.2.20)
If all keys are used in decryption then one obtains M
M n = 2 (88.38− nd1 )
n
meaningful messages (4.2.21)
The condition for only one meaningful message is (4.2.22) n = n 0 =88.38 / d1 = 174 which is called unicity distance. When n < n 0 , M n is larger than 1. One obtains more than one messages. Otherwise, when n > n0 , the number of meaningful messages is smaller than 1. In the above discussion the correlation between letters has been neglected. If the correlation of adjacent letters is considered then d1 in Eqs (4.2.19) (4.2.21) and (4.2.22) should be replaced by d1 +d 2 ,where d 2 is second-order informational redundancy. Now we will generalize the Shannon theory of cryptography to sequence structure relation of proteins.There are three basic quantities describing the system, namely, the code number Cn , the message number Mn and the key number K. We define transmission 268
efficiency of information (TEI)
η n = KM n / C n
(4.2.23)
n 0 is For a general code system one always has Mn < Cn and K > 1. Set η n = 1 as n=n 0 . called threshold length which is unicity distance exactly in usual code theory.For the sequence structure (of proteins) system TEI is
η n = s q ( s / q) n
(4.2.24)
And the corresponding threshold length
n0 = q log 2 s /(log 2 q − log 2 s )
(4.2.25)
Taking q = 20, we obtain the numerical results as follows ________________________________________________________________________ s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 n0 0 6 11 17 23 29 37 45 55 66 80 97 119 148 188 248 ________________________________________________________________________ (In table n0 takes the integral part of the values satisfying (4.2.25))
For a native protein n = 50~200. To obtain TEI = 1 it is necessary s = 9~15 separate regions in (φ ψ) angles. For s = 4~7, the TEI ηn is much smaller than 1. Furthermore, we see that from Eq (4.2.24),if the 3- dimensional structure is encoded by minor (q) characteristics of the amino acid but not 20 different amino acids, then TEI grows with decreasing q when n is not too small. Therefore, the simplification of amino acid alphabets is generally favorable to the information transmission from sequence to structure. TEI gives how many messages can be transmitted by one code. It reflects the transmitting ability of information from code to message. However, for a real informational system, a more important property is the anti-disturbance capacity (ADC) of the code. In conventional code theory ADC is described by informational redundancy d1 (or d1 + d2, etc). The larger the informational redundancy is, the stronger the ADC will be. As seen from Eq (4.2.19) d1 is proportional to the negative of log p, d1 ~ -log (M n /C n ). If different keys are used and the random distribution of keys is assumed then we shall generalize the informational redundancy defined in usual information theory to ADC, which is defined by (4.2.26) ADC = -log2 TEI = log2 Cn - log2 M n - log2 K Therefore, ADC is proportional to the negative of the logarithm of TEI. A smaller TEI means a stronger ADC and vice versa. Perhaps, this explainswhyfor the native protein in nature a smaller TEI has been taken by natural selection.
269
§4.3
Relationship between protein secondary structure and synonymous codon usage [Li, Liu &Luo, 2001]
Following Anfinsen’s principle on the folding of protein chains, the protein spatial structure is fully determined by information contained in its amino acid sequence [Anfinsen, 1973]. Has the mRNA sequence and structure nothing to do with the protein secondary structure? Recently, several authors indicated the possible connections between protein secondary structure and mRNA information. There exist three approaches in literatures. Firstly, it was proposed that that there exist strong signals in mRNA sequence regions surrounding α-helices and β-sheets [Brunak & Engelbrecht, 1996] and the formations of someβ-sheets in E.coli and α-helices in human are correlated with specific Relative Synonymous Codon Usages [Oresic & Shalloway, 1998]. The protein structure may be correlated with the sequential structure of messenger RNA. Secondly, studies showed that translation is a non-uniform process. The tRNA availability may effect the rate of elongation of nascent polypeptide chains [Varenne et al,1984]. The ribosome pause sites have also been localized in vivo [Kim et al, 1992]. Several authors adduced evidence to show that the codon usage determines the translation rate in E.coli [Sorensen et al, 1989]. It was suggested that the protein secondary structural types are differentially coded on mRNA. For E.coli alpha helices on proteins tend to be preferentially coded by translationally fast mRNA regions while the slow segments often code for beta strands and coil regions [Thanaraj & Argos, 1996]. The point is consistent with the observation of partially folded state on nascent peptides [Doinach et al, 1995]. The third group of authors proposed that the stem - loop structures of mRNA may influence the translation rate and in turn, change the forming of secondary structure and the overall folding of protein [Zhang & Liu, 1999]. A quantitative analysis was given on the secondary structure of the ribosome binding site determining translational efficiency [Marrten et al, 1990]. It was assumed that the process of hairpin unfolding can increase the time of translocation from the A to P ribosome site of the codon 5’ to the hairpin, thus decreasing the probability of translational error [Shpaer, 1985]. A hypothesis was also proposed that the folding of the MS2 coat protein in E.coli is modulated by translational pauses resulting from mRNA secondary structure [Guisez et al, 1993]. However, about the problem on the direct effect of mRNA secondary structure on translation rate there still exist different points of view [Sorensen et al, 1989; Le et al, 1993]. Though it seems that there exists some correlation between codon usage (mRNA sequence) and protein structure but, to our knowledge, there have not been direct experiments which show protein structural change by synonymous codon replacement. So, the statistical analysis of up-to-date sequence data is necessary. Several works have been done in this direction [Adzhubei et al, 1996; Xie & Ding, 1998]. In this section we shall present a further statistical study on the problem. Our result will afford more reliable evidence on the codon usage correlated to protein secondary structures. Abnormal preference of synonymous codons for protein secondary structural types (di-peptide analysis) Is the nascent peptide folding determined fully by amino acid sequence? Is there any influence of synonymous codon usage on protein structure? Recently the statistical analyses on higher mammalian mRNA sequences show that the non-random usage of synonymous codons may influence the forming of protein secondary structures [Adzhubei et al, 1998; Xie & Ding, 1998; Adzhubei et al, 1999]. Furthermore, some work [Xie & Ding, 1998] indicated that for 270
E.coli there is no significant correlation between codon usage and protein structure. The discussions in above works are all based on the correlation between single peptide frequency and protein secondary structure. However, following the analyses given in previous section the correlation of di-peptide frequencies on protein secondary structures is much stronger than that of single peptide. From di-peptide frequencies we shall be able to deduce more reliable conclusions on the role of codon usage in the determination of protein secondary structures. In the following analyses the data for mRNA sequences and native three-dimensional structures of the encoded proteins are taken from ISSD 2.0 [Adzhubei et al, 1999]. The average occurrence frequency for each di-peptide is about 120. Corresponding to di-peptide AiAj there are seven kinds of secondary structures, namely,α α,ββ,CC, αC, Cα,βC and Cβ(the neighboring α and β not occurring). Their frequencies in database are denoted by p1(ij),p2(ij),p3(ij),p4(ij),p5(ij),p6(ij),p7(ij) respectively. Define
Sc( Ai A j ) = Max pk
( ij )
k
= pk 0
( ij )
(4.3.1)
(The maximum is supposed at k=k0). On the other hand, the codon coding for Ai is denoted by The frequencies of codon pairs Bi' Bj' corresponding to Bi’ , coding for Aj is denoted by Bj’. seven kinds of secondary structures occurred in ISSD are g1(i'j'),g2(i'j'),g3(i'j'),g4(i'j'),g5(i'j'),g6(i'j'),g7(i'j') respectively. Define B
Sc ' ( Ai A j ) =
∑
Bi ' ∈ Ai , B j ' ∈ A j
B
Max g k
( i ' j ')
(4.3.2)
k
here the summation is over all possible codon pairs.
Max g k
( i ' j ')
k
= gk0
( i ' j ')
If each term satisfies
(all i’j’)
(4.3.3)
then Sc ( Ai A j ) = Sc( Ai A j ) ; otherwise, if one term (or more terms) in the sum of (4.3.2) '
satisfies
Max g k k
( i ' j ')
> gk0
( i ' j ')
(4.3.4)
then it shows the secondary structure preference of codon pair Bi ' B j ' different from that of amino acid pair Ai A j . The numbers of di-peptides with different Sc ( Ai A j ) − Sc ( Ai A j ) are '
given in Table 4.3-1.
271
Table 4.3-1
Di-peptide numbers for given Sc' ( Ai A j ) − Sc( Ai A j )
Sc' ( Ai A j ) − Sc( Ai A j )
0
1
2
3
4
5
6
7
8
≥9
Amino acid pair number (human) Amino acid pair number (E coli)
94
66
55
38
38
24
16
16
9
44
81
80
57
43
35
20
25
15
11
33
So, there are about 77 % to 79% di-peptides for which the structural preferences of some codon pairs are different from amino acid pairs. However, a more important and difficult problem is: whether these deviant structural preferences of codons are caused by chance stochastically? We shall discuss this point in the following. For given amino acid pair Ai Aj ,suppose three maximal frequencies in seven kinds of secondary structures are pk0(ij) > pk1(ij) > pk2(ij) where each of k0, k1 and k2 takes one of structures αα,ββ,CC respectively. Set the normalized frequencies (probabilities) corresponding to pk0(ij), pk1(ij) ,p k2(ij) denoted by x, y, z (x>y>z) and other four frequencies by u, v, w, and t (x + y + z + u + v + w + t=1) respectively. The codon pair (i'j') with normal structural preference (that is, its secondary structure preference in agreement with amino acid pair’s) satisfies Max g k ( i ' j ') = g k 0 ( i ' j ')
(4.3.5)
k
The codon pairs with abnormal structural preference are divided into two groups: AbnormalⅠ AbnormalⅡ
Max g k (i ' j ') = g k1(i ' j ')
(4.3.6)
k
Max g k
( i ' j ')
k
= gk2
( i ' j ')
(4.3.7)
Let the total frequency of abnormalⅠcodon pairs occurring in k0、k1、k2 and other four secondary structures be
n1、n2、n3、n4、n5、n6 and n7 (n2>n1, n2>n3 ,
7
∑ ni = L ) respectively. Let the total i
frequency of abnormal Ⅱ codon pairs be m3>m2 ,
m1 、 m2 、 m3 、 m4 、 m5 、 m6 and m7 (m3>m1,
7
∑ mi = M ) respectively. i
For given amino acid pair AiAj , if the codon pairs occur at random in the process of forming secondary structure, then in N trials (N codon pairs) the frequencies taking ν1, ν2, ν3, ν4, ν5, ν6 and ν7 respectively in seven secondary structures obey polynomial distribution
P (v1 , v 2 , v3 , ", v7 ) =
N ! v1 v2 v3 v4 v5 v6 v7 x y z u v w t ∏ vi ! i
7
( N = ∑ vi ) i
272
(4.3.8)
Using (4.3.8) to abnormalⅠgroup we have
P (n1 , n2 , n3 , ", n7 ) =
L! n1 n2 n3 n4 n5 n6 n7 x y z u v w t ∏ ni !
(4.3.9)
i
To estimate the magnitude of Eq (4.3.9) we consider
p (n 2 , n1 , n3 ,...n7 ) It satisfies
P (n2 , n1 , n3 , ", n7 ) =
L! n2 n1 n3 n4 n5 n6 n7 x y z u v w t ∏ ni !
(4.3.10)
i
So we have P( n1, n2 , n3 ,", n7 ) y = ( ) n2 − n1 P( n2 , n1, n3 ,", n7 ) x
(4.3.11)
Likewise we have
P (n1 , n 2 + 1, n3 − 1, n 4 , n5 , n6 , n7 ) y y = ( ) n2 − n1 +1 < ( ) n2 − n1 P (n 2 + 1, n1 , n3 − 1, n 4 , n5 , n6 , n7 ) x x etc.
(4.3.12)
It leads to
∑ p(ν ν
ν 2 −ν 1 ≥ n2 − n1 > 0 ν 2 = max ν i
1,
ν 3, n 4 , n 5 , n 6 , n 7 )
2,
y < ( ) n2 − n1 ∑ p(ν 2,ν 1,ν 3, n4, n5, n6, n7 ) x
ν 2 −ν 1 ≥ n2 − n1 ν 2 = max ν i
(4.3.13)
∑ P (v , v
v2 − v1 ≥ n2 − n1 v2 = max ν i
1
2
y , v3 , n4 , n5 , n6 , n7 )