Digital Coding of Amino Acids Based on ... - Semantic Scholar

1 downloads 0 Views 114KB Size Report
Biou, V., Gibrrat, J. F., Levin, J. M., Robson, B. and Garnier, J. (1988) Protein Eng., 2, 185. [17]. Kyte, J. and Doolittle, R. F. (1982) J. Mol. Biol., 157, 105. [18].
Protein & Peptide Letters, 2007, 14, 871-875

871

Digital Coding of Amino Acids Based on Hydrophobic Index Xuan Xiao1,* and Kuo-Chen Chou2 1

Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China; 2Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA Abstract: Analysis of amino acid sequences can provide useful insights into the tertiary structures of proteins and their biological functions. One of the critical problems in amino acid analysis is how to establish a digital coding system to better reflect the properties of amino acids and their degeneracy. Based on the hydrophobic index, a one-to-one relationship has been established between the amino acid sequence and the digital signal process. Such a “bridge” will make it possible to apply all the existing powerful methods in the signal processing area to analysis of the amino acid sequences.

Keywords: Amino acid digital coding, hydrophobic index, sequence analysis, signal process, pseudo amino acid composition. I. INTRODUCTION The success of human genome project has generated deluge of sequence information. Sequence databases, such as GenBank and EMBL, have been growing at an exponential rate [1-2]. The explosion of biological data has challenged biologists’ and computer scientist’s ability and speed of analyzing these data. In general, gene sequences are stored in the computer database system in the form of long character strings. It would act like a snail’s pace for human beings to read these sequences with the naked eyes. Also, it is very hard to extract any key features by directly reading these long character strings. However, if they can be converted to some signal process, many important features can be automatically manifested and easily studied by means of the existing tools of information theory [3]. Biological information can be analyzed on several levels, such as nucleotide sequence, protein sequence, and genome sequencing. Amino acid sequence analysis can provide important insights into the tertiary structures of proteins and their functions. Viewing protein synthesis as an information processing system allows amino acid sequences to be analyzed as messages without considering the physical-chemical elements for information processing [4]. There are many established information process methods that can be used for the analysis of amino acid sequences. In fact, digital signal processing approach has been used in a number of protein prediction tasks, such as prediction of subcellular location [5-6] and structural classes as well be described later. Digital coding of amino acid can be modeled as a communication channel with the amino acid sequence as the input and a 01 digital signal as the channel output. One of the critical problems is how to model the digital coding of amino acids for better reflecting the amino acid properties and degeneracy. Many kinds of models on amino acid digital encoding have been built. Cristea (2001) proposed a representation of *Address correspondence to this author at the Computer Department, JingDe-Zhen Ceramic Institute, Jing-De-Zhen 33300, China; E-mail: [email protected]

0929-8665/07 $50.00+.00

genetic code, which converts the DNA sequences into digital signals and uses a base for representation of the nucleotides. It leads to the conversion of the codons into the numbers in the range 0-63 and the conversion of the amino acids (together with the terminator) into the numbers in the range 020 [7]. According to their model, the 20 amino acids and the terminator are coded as: F=0, L=1, S=2, Y=3, end=4, C=5, W=6, P=7, H=8, Q=9, R=10, I=11, M=12, T=13, N=14, K=15, V=16, A=17, D=18, E=19, G=20. This model better reflects amino acid structure and degeneracy, but the genetic signals built from genes on this model show low autocorrelation. Pan et al. also proposed a kind of amino acid coding for predicting protein sub-cellular localization through the stochastic signal processing approach [8]. For simplicity, their model is: A=10, C=20, D=30, E=40, F=50, G=60, H=70, I=80, K=90, L=100, M=110, N=120, P=130, Q=140, R=150, S=160, T=170, V= 180, W=190, Y=200. Although the aforementioned two different procedures can encode a protein sequence to a serial of digital signals, they only distinguish each amino acid in the process of encoding amino acids without taking into account the physics chemical properties of the amino acids. When Sofer predicted secondary structure of proteins using genetic algorithms, he assigned one or two five-digit codes to each amino acid because the rules of genetic algorithms are often encoded as binary strings [9]. Amino acids with similar properties have similar code words. But a shortcoming of this model is that the amino acid and its digital coding are not one-to-one correspondence. According to this rule, there are 12 amino acids sharing two possibilities of digital coding. To improve the shortcoming, Nikola built an encoding model according to the molecular recognition theory [10]. In the current paper, based on the amino acid hydrophobicity index and information theory, a digital coding approach is proposed. It can not only take into account the chemical physical properties of amino acids but also make each of them corresponding to one, and only one, digital code.

© 2007 Bentham Science Publishers Ltd.

872 Protein & Peptide Letters, 2007, Vol. 14, No. 9

II. METHOD 1. Amino-Acid Index An amino-acid index is a set of 20 numerical values representing one of the various physicochemical properties of the 20 amino acids. A total of 402 sets of amino-acid indices were collected by Tomii and Kanehisal [11]; unfortunately, none of them is universally applicable although many of them concur with each other on the classification of a particular amino acid. They analyzed the relationships among the amino acid indices in the 402 sets by the single-linkage hierarchical cluster analysis, and found these indices can be clustered into the following six groups: (A) the  -helix and tight-turn [12] propensities, (B) -strand propensity, (C) amino acid composition, (H) hydrophobicity, (P) physicochemical properties and (O) other properties such as the frequency of left-handed helix. The hydrophobic amino acids tend to repel the aqueous environment, and therefore reside predominantly in the interior of proteins. Amino acids of this type neither ionize nor participate in the formation of H-bonds. The hydrophilic amino acids that tend to interact with the aqueous environment are often involved in the formation of H-bonds and are predominantly found on the exterior surfaces proteins or in the reactive centers of enzymes. In fact the hydrophobicity of amino acids is not only one of the major factors that influence the amino acid substitution during evolution, but also able to show periodicity of the secondary structure [13]. Using the auto-correlation functions based on the profile of amino-acid index along the primary sequence of the query protein (domain), Bu et al. [14] predicted the protein structural classes and found that most fractions of the amino-acid indices lead to considerably less accuracy than that obtained by using the Oobatake-Ooi index and the hydrophobic index of Ponnuswamy. Only nine indices yielded the predicted results similar to the Ponnuswamy index. The ten amino-acid indices consist of one physicochemical property, one -helix and tight-turn propensities, one -propensity, and seven hydrophobicity indices, indicating that the relation between amino acid hydrophobicity and protein structural class is very strong. A formulation of the autocorrelation functions based on the hydrophobicity index of the 20 amino acids is also used to predict membrane protein types [15], where it was reported that the higher predicted accuracy could be obtained with two sets of hydrophobicity indices, those of Ponnuwamy index and Hopp index. Table 1 shows the five kinds hydrophobicity indices of the 20 amino acids that lead to the higher overall predicted accuracy. The amino acid indices listed in Table 1, all include decimal fraction and negative. Therefore, they do not satisfy the information coding principle and can not be deemed as the appropriate digital codes for amino acids. Nevertheless, their results did show some intriguing approach through the Ponnuwamy hydrophobicity index system. 2. Optimal Model of Amino Acids Digital Coding It is well known that all the proteins occurring in living organisms are composed of a total of just 20 different chemical building blocks (amino acids). Information theory makes

Xiao and Chou

it possible to determine the smallest binary number of a word in order to allow unambiguous identification of all amino acids. If words are made up of 4 bits per word, they would contain too little information. Six bits per word would be too complicated. According to the information theory, words having five bits per word are sufficient and are therefore the most economical method of coding. Five binary numbers can mostly present 32 states from which we have to select the 20 states. According to the combinatorics, this encoding format has

C 3220 kinds.

The coding principle we adopt is that the larger the Ponnuwamy hydrophobicity index of an amino acid is, the greater its digital code. Thus, according to the ascendant order of the Ponnuwamy hydrophobicity index for the 20 amino acids, the digital codes of amino acids are arranged in the following order: K, N, D, E, P, Q, R, S, T, G, A, H, W, Y, F, L, M, I, V, S. Because the Ponnuwamy’ hydrophobicity index of amino acid K is 5.72, we arranged the digital coding system from the beginning of number six. The margin of hydrophobicity index between any two adjacent amino acids in the above sequence is less than 0.45 except K-N, HW, Y-F, V-C. If the difference of hydrophobicity indices between two amino acid is small, the two amino acid should been arranged close together in the digital coding system, and vice versa. Based on such a principle, an optimal coding system is formed as shown in Table 2. It can be seen from Table 2 that the larger the Ponnuwamy hydrophobicity index of an amino acid, the greater its digital code is. There are only two one cordon-one amino acid (non degenerated) mappings for Tryptophan and Methionine, but ten double, three triple, six quadrille, and two sextuple degeneracy. Judging from the frequency of the amino acids in the proteins, it is obviously that the genetic code presents the features of an entropic coding. III. APPLICATION: PREDICTION OF PROTEIN STRUCTURAL CLASSES Prediction of protein structural class is an important topic in protein science [20-23]. Many different methods were proposed aimed at such a topic. Chou et al. [24-25] demonstrated that the interaction among the components of amino acid composition is an important driving force [26] in determining the structural class of a protein during the sequence folding process, and it was observed that the correct rates in recognizing protein structural classes by the covariant discriminant algorithm are significantly higher than other algorithms. However, in the above approaches, the sample of a protein is represented by the conventional amino acid (AA) composition. Obviously, if one used the AA composition to represent the sample of a protein, all its sequence order effects are lost. To avoid completely lose the sequence-order information, the pseudo amino acid (PseAA) composition was introduced [27]. Since the concept of PseAA composition was introduced, various different kinds of PseAA composition have been proposed to improve the prediction quality of various protein attributes (see, e.g., [28-33]). Owing to its wide application, recently a web-server called PseAA was established at http://chou.med.harvard.edu/bioinf/PseAA/, by which users can generate many different types of PseAA composition as they wish. Here we shall introduce a new

Digital Coding of Amino Acids Based on Hydrophobic Index

Table 1.

Protein & Peptide Letters, 2007, Vol. 14, No. 9

873

Hydrophobicity Indices of the 20 Amino Acids. One Letter Codes are Used to Denote Amino Acids

Amino acid

Biou et al. 

Kyte and Doolittle 

Ponnuswamy

Ponnuswamy

Woid

A

16

1.8

12.28

7.62

0.07

C

168

2.5

14.93

10.93

0.71

D

-78

-3.5

10.97

6.18

3.64

E

-106

-3.5

11.19

6.38

3.08

F

189

-3.5

10.97

6.18

3.64

G

-13

-0.4

12.01

7.31

2.23

H

50

-3.2

12.84

7.85

2.41

I

151

4.5

12.77

9.99

-4.44

K

-141

-3.9

10.80

5.72

2.84

L

145

3.8

14.10

9.37

-4.19

M

124

1.9

14.33

9.83

-2.49

N

-74

-3.5

11.00

6.17

3.22

P

-20

-1.6

11.19

6.64

-1.22

Q

-73

-3.5

11.28

6.67

2.18

R

-70

-4.5

11.49

6.81

2.88

S

-70

-0.8

11.26

6.93

1.96

T

-38

-0.7

11.65

7.08

0.92

V

123

4.2

15.07

10.38

-2.69

W

145

-0.9

12.95

8.41

-4.75

Y

53

-1.3

13.29

8.53

-1.39

Information value for accessibilityalu; average fraction 35% [16]; Hydropathy index[17]; Surrounding hydrophobicity in folded form [18]; Average gain in surrounding hydrophobicity [19]; Principal property value [19].

type of PseAA composition based on the current digital coding system as formulated below. Given a protein sequence, we can generate a series of digital signals according to Table 2 and define the value of its complexity measure factor. Complexity measure factor has been used in predicting protein subcellular location. The complexity of a sequence can be measured by the minimal number of steps required for its synthesis in a certain process. The advantage by incorporating the complexity measure factor as one of the pseudo amino acid components for a protein is that it can more effectively reflect its overall sequence-order feature than the conventional correlation factors. Now, by following exactly the same procedure as described by Chou [27] and Xiao et al. [34], a protein P can be expressed by a vector or a point in a (20 + )D = (20 + 1)D = 21D space; i.e., P = ( p1 , p2 , , p20, p21 )T

where T is the transpose operator, and

(1)

fk  ,  20   fi + wf 21  i =1 pk =  wf 21  ,  20 f wf + 21  i  i =1

(1  k  20)

(2) (k = 21)

where f i (i =1, 2, …, 20) are the occurrence frequencies of the 20 native amino acids in a protein, f 21 = CLS ( S ) the complexity measure factor that can be derived for a given protein sequence according to the procedure described in [34], and w the weight factor. The standard vector for the subset G is defined by  p1    p   P = 2       p21

( = , , /, +)

(3)

874 Protein & Peptide Letters, 2007, Vol. 14, No. 9

Table 2.

Xiao and Chou

Digital Codes of 20 Native Amino Acids

Type

Code

Character

K

N

D

E

P

Q

R

S

T

G

Decimal

6

8

9

10

11

12

13

14

15

16

Binary

00110

01000

01001

01010

01011

01100

01101

01110

01111

10000

Character

A

H

W

Y

F

L

M

I

V

C

Decimal

17

18

20

21

23

24

26

27

28

30

Binary

10001

10010

10100

10101

10111

11000

11010

11011

11100

11110

The similarity between the standard vector P î and the protein P is characterized by the covariant discriminant, as defined by F(P,P ) = D 2 (P,P ) + ln( 2  3  21 )

(4)

where the first term is the squared Mahalanobis distance between P and P , the second term reflects the difference of covariance matrices for different subsets, in which

i is the

i th eigenvalue of the covariance matrix C [35]. Accordingly, the prediction rule is formulated by F(P,P ) = Min{F(P,P ), F(P,P ), F(P,P / ), F(P,P + )} (5)

where  can be , , /, +, and the Min means taking the least one among those in the parentheses, and the superscript  represents the very structural class which the protein p belongs to. The details about the algorithm can be found in [34,35]. As a demonstration, let us use the same dataset studied by the many previous authors. It consists of 204 proteins, of which 52 all-, 61 all-, 45 /, and 46 +. Their PDB codes are given in Table 2 of Chou [26]. We used the jackknife cross-validation to examine the performance of the current approach. This is because among the independent dataset test, sub-sampling (e.g., 5-fold sub-sampling) test, and jackknife test, which are often used for examining the accuracy of a statistical prediction method, the jackknife test is deemed the most rigorous and objective as analyzed by a Table 3.

comprehensive review [36] and has been increasingly adopted by investigators to test the power of various prediction methods (see, e.g., [37-47]). The results thus obtained are listed in Table 3, where for facilitating comparison the corresponding results by the other methods are also given. It can be seen from Table 3 that the current approach yielded the best overall success rate because the digital coding model based on which the current method was established can better reflect the chemical physical properties of amino acids and their degeneracy. CONCLUSIONS This paper introduces the optimal symbolic-to-digital mapping for amino acids based on the hydrophobicity index and information theory. The model developed based on the current coding system can be also used to predict a series of other features of proteins, such as protein subcellular localization [48], membrane protein type [49], protein signal peptide [50], enzyme family class [51-53], GPCR type [54-57], and protease type [58]. ACKNOWLEDGEMENTS This study was supported by the grants from the National Natural Science Foundation of China (No. 60661003), and the Province National Natural Science Foundation of JiangXi (No. 0611060). The corresponding author would like to express his gratitude to two anonymous reviewers for their constructive comments, which were very helpful for improving the presentation of this paper.

The Overall Predictive Accuracy in the Jackknife Test for the 3 Sets of Amino Acid Digital Codes

Method

Augmented covariant discriminant algorithm

Digital coding

All-

All-

/

+

Overall

Cristea [7]

43 = 82.7% 52

55 = 90.16% 61

44 = 97.78% 45

40 = 86.95% 46

182 = 89.21% 204

Xiao et al. [34]

43 = 82.7% 52

55 = 90.2% 61

45 = 100% 45

40 = 87.0% 46

193 = 89.7% 204

57 = 93.44% 61

45 = 100% 45

41 = 89.13% 46

186 = 91.17% 204

This paper

43 = 82.7% 52

Digital Coding of Amino Acids Based on Hydrophobic Index [30]

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29]

Venter, J. C., Smith, H. O. and Hood, L. (1996) Nature 381, 364. Chou, K. C. (2004) Curr. Med. Chem., 11, 2105. Xiao, X., Shao, S. H., Ding, Y., Huang, Z., Chen, X. and Chou, K. C. (2005) Amino Acids, 28, 29. Ramon R. R., Pedro, B. and Jose, L. O. (1996) Pattern Rec., 29, 1187. Xiao, X., Shao, S. H., Ding, Y. and Chou, K. C. (2005) Amino Acids, 28, 57. Xiao, X., Shao, S. H. and Chou, K. C. (2006) Amino Acids, 30, 49. Cristea, P. (2001) SPIE Conference BIOS 2001-International Biomedical Optics Symposium, San Jose, USA, pp. 20-26 Pan, Y. X., Zhang, Z. Z., Guo, Z. M., Huang, Z. D. and He, L. (2003) J. Prot. Chem., 22, 395. Sofer, W. H., http://waksman. Rutgers. Edu/Waks/Sofer/sofer. Html. Nikola, S. (1998) Croat. Chem. Acta, 71, 573. Tomii, K. and Kanehisa, M. (1996) Protein Eng., 9, 27. Chou, K. C. (2000) Analytical Biochem., 286, 1. Cornette, J. L., Cease, K. B., Margalit, H., Spouge, J. L., Berzofsky, J. A. and Delisi, C. (1987) J. Mol. Biol., 195, 659. Bu, W. S., Feng, Z. P., Zhang, Z. D. and Zhang, C. T. (1999) Eur. J. Biochem., 266, 1043. Feng, Z. P. and Zhang, Z. T. (2000) J. Prot.Chem., 19, 269. Biou, V., Gibrrat, J. F., Levin, J. M., Robson, B. and Garnier, J. (1988) Protein Eng., 2, 185. Kyte, J. and Doolittle, R. F. (1982) J. Mol. Biol., 157, 105. Ponnuswamy, P. K., Prabhakaran, M. and Manavalan P. (1980) Biochem. Biophys. Acta, 623, 301. Wold, S., Eriksso, L. and Hellberg, S. (1987) Can. J. Chem., 65, 1814. Chou, K. C., (2000) Curr. Prot. Pept. Sci., 1, 171. Chou, K., C. and Zhang, C. T. (1994) J. Biol. Chem., 269, 22014. Shen, H. B. and Chou, K. C. (2006) Bioinformatics, 22, 1717. Shen, H. B., Yang, J., Liu, X. J. and Chou, K. C. (2005) Biochem. Biophys. Res. Commun., 334, 577. Chou, K. C. (1995) Prot: Struct. Funct. Gene, 21, 319. Chou, K. C. and Maggiora, G. M. (1998) Protein Eng., 11, 523. Chou, K.C. (1999) Biochem. Biophys. Res. Comm., 264, 216. Chou, K. C. (2001) PROT: Struct. Funct. Gene, 43, 246. Chen, C., Tian, Y. X., Zou, X., Y. and Mo, J. Y. (2006) J. Theor. Biol., 243, 444. Du, P. and Li,Y. (2006) BMC Bioinformatics, 7, 518.

Received: May 30, 2007

Protein & Peptide Letters, 2007, Vol. 14, No. 9

Revised: July 02, 2007

Accepted: July 03, 2007

[31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58]

875

Mondal, S., Bhavna, R., Mohan Babu, R. and Ramakumar, S. (2006) J. Theor. Biol., 243, 252. Lin, H. and Li, Q. Z. (2007) Biochem. Biophys. Res. Commun., 354, 548. Pu, X., Guo, J., Leung, H. and Lin, Y. (2007) J. Theor. Biol., 247, 259–265. Chen, Y. L. and Li, Q. Z. (2007) J. Theor. Biol., 245, 775. Xiao, X., Shao, S. H., Huang, Z. D. and Chou, K. C. (2006) J. Comp. Chem., 27, 478. Chou, K. C. and Elrod, D. W. (1999) Protein Eng., 12, 107. Chou, K. C. and Zhang, C. T. (1995) Crit. Revi. Biochem. Mol. Biol., 30, 275. Zhou, G. P., (1998) J. Protein Chem., 17, 729. Zhou, G. P. and Assa-Munt, N. (2001) PROTEINS: Struct. Funct. Gene, 44, 57. Chou, K. C. and Shen, H. B. (2006) Biochem. Biophys. Res. Commun., 347, 150. Kedarisetti, K. D., Kurgan, L. A. and Dick, S. (2006) Biochem. Biophys. Res. Commun., 348, 981. Chou, K. C. and Shen, H. B. (2006) J. Proteome Res., 5, 1888. Chou, K. C. and Shen, H. B. (2007) J. Cell. Biochem., 100, 665. Shen, H. B. and Chou, K. C. (2007) Biopolymers, 85, 233. Shen, H. B. and Chou, K. C. (2007) Biochem. Biophys. Res. Commun., 355, 1006. Chou, K. C. and Shen, H. B. (2007) J. Proteome Res., 6, 1728. Chou, K. C. and Shen, H. B. (2007) Biochem. Biophys. Res., Comm., 357, 633. Chou, K. C. and Shen, H. B. (2007) Biochem. Biophys. Res., Comm., 360, 339. Shen, H.B., J. Yang, and K.C. Chou. (2007) Amino Acids, 33, 57. Chou, K.C. and Y.D. Cai. (2005). J. Chem. Inform. Model. 45, 407. Chou, K.C. and H.B. Shen. (2007). Biochem. Biophys. Res. Comm., 357, 633. Chou, K.C. (2005) Bioinformatics, 21, 10. Chou, K. C. and Cai, Y. D. (2005) Protein Sci., 13, 2857. Zhou, X.B., Chen, C., Li, Z.C. Zou, X. Y. (2007) J. Theoret. Biol., doi:10.1016/j.jtbi.2007.1006.1001. Chou, K.C. and Elrod, D. W. (2002) J. Prot. Res., 1, 429. Chou, K.C. (2005) J. Prot. Res., 4, 1413. Gao, Q. B. and Wang, Z. Z. (2006) Prot. Eng. Des. Sel., 19, 511. Wen, Z., Li, M., Li, Y., Guo, Y. and Wang, K. (2006) Amino Acids, 32, 277. Chou, K.C. and Y.D. Cai. (2006). Biochem. Biophys. Res. Comm. 339, 1015-1020.

Suggest Documents