Using a New Alignment Kernel Function to Identify ... - IngentaConnect

4 downloads 0 Views 311KB Size Report
Hong-Bin Shen a. , and Kuo-Chen Chou a,b. aInstitute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 200030, China; bGordon Life.
Protein & Peptide Letters, 2007, 14, 203-208

203

Using a New Alignment Kernel Function to Identify Secretory Proteins Hui Liua,*, Jie Yanga, Dan-Qing Liua, Hong-Bin Shena, and Kuo-Chen Choua,b a

Institute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 200030, China; bGordon Life Science Institute, San Diego, CA 92130, USA Abstract: As the knowledge of protein signal peptides can be used to reprogram cells in a desired way for gene therapy, signal peptides have become a crucial tool for researchers to design new drugs for targeting a particular organelle to correct a specific defect. To effectively use such a technique, however, we have to develop an automated method for fast and accurately predicting signal peptides and their cleavage sites, particularly in the post-genomic era when the number of protein sequences is being explosively increased. To realize this, the first important thing is to discriminate secretory proteins from non-secretory proteins. On the basis of the Needleman-Wunsch algorithm, we proposed a new alignment kernel function. The novel approach can be effectively used to extract the statistical properties of protein sequences for machine learning, leading to a higher prediction success rate.

Keywords: Kernel function, global alignment, support vector machine, signal sequence, cleavage site, scaled window. I. INTRODUCTION The discovery of signal peptides has had an enormous impact upon modern cellular biology and protein science. Knowledge of signal peptides is very useful in developing novel strategies for drug discovery, as well as in revealing the molecular mechanisms for basic research (see, e.g., a review [1]). Signal sequences are usually N-terminal extensions although they can also be located within a protein or at its C-terminal end (e.g., for “tail-anchored” membrane proteins [2]). All secreted proteins, as well as many transmembrane proteins, are synthesized with N-terminal signal peptides. The N-terminal signal peptides generally bear a tripartite structural pattern; i.e., consisting of the following three structurally, and, possibly, functionally distinct regions: (a) an N-terminal positively charged n-region, (b) a central hydrophobic h-region, and (c) a neutral but polar c-region, as shown in (Fig. 1) [1,3]. Although such a tripartite feature might provide useful information for their identification, signal peptides have little sequence similarity and their lengths are in extreme variation, which has made the prediction of signal peptides very difficult. Earlier computational methods for predicting N-terminal signal peptides were published around 20 years ago, initially using a weight matrix approach [4,5]. In mid 1990s, development of prediction methods shifted to machine learning algorithms. One of the popular methods is SignalP proposed by Nielsen et al. [6] and its latest version is SignalP 3.0 developed in 2004 [7]. SignalP consists of two signal peptide prediction modes: the neural networks (NN) and the hidden Markov method (HMM). Chou [8] developed a sequence-encoded algorithm to identify the signal peptides. Based on the sequence-encoded model, the first-order Markov-chain algorithm [9] and the subsite coupling algo*Address correspondence to this author at the aInstitute of Image Processing & Pattern Recognition, Shanghai Jiaotong University, 200030, China; Tel:/Fax: 86-21-3420-1576; 86-21-6293-3428; E-mail: [email protected] 0929-8665/07 $50.00+.00

rithm [10,11] were subsequently developed. Meanwhile, the SVM (support vector machines) approach was also introduced [12,13]. Most of the aforementioned methods divided the amino acid sequences into different segments with a “scaled window”, as shown in Fig. 2. Although such an approach can be used to deal with the variable length problem, it is more desirable to treat the entire sequence as an entity. Moreover, the division approach may also cause the imbalanced problem for the training dataset. In this paper, we are to develop a global alignment technique by combining the Needleman-Wunsch algorithm [14] and the kernel function approach to identify secretory proteins from non-secretory ones  a key step in determining the signal peptide. The global alignment is equivalent to the pair HMM, and is better-rounded for bioinformatics. What should be emphasized here is that the global alignment is different from redundancy reduction based on the local alignment; the latter was used for constructing the signal peptides datasets [15]. The global alignment approach is seeking for the best match for the entire sequence from one end to the other; while the local alignment is for the best similarity between two subsequence segments. II. MATERIALS A consistent assessment of the predictive performance requires a reliable benchmark dataset. In this study, the dataset was downloaded from the website at http:// www.cbs.dtu.dk/ftp/signalp. The dataset was formed by 5 subsets: (1) human, (2) E. coli, (3) eukaryotes, (4) Gram-positive bacteria, and (5) Gram-negative bacteria. The dataset originally constructed by Nielsen [6] consists of 1,939 secretory proteins and 1,440 non-secretory proteins. For the secretory proteins, the sequence of the signal peptide and the first 30 amino acids of the mature protein were included in the dataset, whereas for the non-secretory proteins, the first 70 amino acids of each sequence were included. Such a sequence truncation for simplifying the problem is

© 2007 Bentham Science Publishers Ltd.

204

Protein & Peptide Letters, 2007, Vol. 14, No. 2

Chou et al.

Figure 1. A schematic drawing to show the three sub-regions of the tripartite structure of a signal peptide: (a) The n-region usually contains relatively hydrophilic (basic) residues with positive charges. (b) The central h-region is dominated by hydrophobic residues and hence called “the hydrophobic core”. (c) The c-region contains more neutral but polar residues. Reproduced from [1] with permission.

Figure 2. Illustration of the scaled window model. When sliding the window [1, +2] along a protein sequence from the N-terminal (a) to C-terminal (c), the scales on the window are aligned with different amino acids so as to define different peptide segments. When, and only when, the scale 1 is aligned with the last residue of the signal sequence, and scale +1 aligned with the first residue of the mature protein as shown in panel (b), is the peptide segment seen within the window regarded as secretion-cleavable. Peptides segments seen within the window for all the other cases, such as those shown in panels (a) and (c), are regarded as non-secretion-cleavable. Amino acid residues in the signal part are expressed by white characters with black background white those in the mature protein by black characters with white background. Reproduced from [9] with permission.

rational because the N-terminal signal peptides are usually 15-60 amino acids long [1,16], and hence will be adopted in the current study as well. However, in the original dataset, the eukaryotic subset and human subset share an intersection of 310 sequences; while the Gram-positive subset and the E. coli subset share an intersection of 172 sequences. These duplicate sequences should be removed from the dataset to

avoid redundancy and bias. The refined dataset obtained through such a cleaning procedure consists of 1,647 secretory protein sequences, and 1,250 non-secretory protein sequences. Listed in Table 1 is the number of proteins in each of the 5 subsets obtained after removing all the duplicate sequences, and their Swiss-Prot codes [17] are given in the Supplementary Materials available through the publisher.

Predicting Secretory Proteins

Table 1.

a

Protein & Peptide Letters, 2007, Vol. 14, No. 2

205

Numbers of Proteins in the Five Subsetsa

Species

Secretory proteins

Non-secretory proteins

Total

Eukaryotic

810

711

1521

Human

416

251

667

Gram-negative

175

105

280

E. coli

105

119

224

Gram-positive

141

64

205

The detailed protein codes are given in Supplementary Materials available through the publisher.

 i , j [14]: (1) xi is aligned to y j ; (2) xi aligned to a gap;

III. METHOD Global Alignment Algorithm

(3)

Needleman-Wunsch algorithm [14] is a well known algorithm for global alignment between two strings under a scoring system. In addition to the alignment, we can use the scoring system to measure the similarity between two sequences, which is consistent with the kernel function in the sense of similarity representation. Therefore, we can convert the scoring system to a kind of kernel. Needleman-Wunsch algorithm bears a dynamic feature for obtaining the optimal global alignment. Given a pair of sequences X and Y :

X = [ x1 , x2 , , xn ]

(1)

Y = [ y1 , y2 , , ym ]

(2)

xi (i = 1, 2, , n) is the ith amino acid residue in the sequence X , and y j ( j = 1, 2, , m) the jth amino acid reside in the sequence Y . where

In order to obtain the global optimal alignment between

X and Y , let us construct a scoring matrix:   1,1  1,2   2,2 2,1 Ã=       n ,1  n ,2 where the

  1,m    2,m        n ,m 

alignment

Xi ([ x1 , x2 , , xi ])

between and

the the

(3)

initial initial

segment segment

Y j ([ y1 , y2 , , y j ]). Initialize

 0,0 = 0 , and then proceed to fill the matrix

à from the top left to the bottom right. If  i 1, j 1 ,  i 1, j and

 i1, j 1 +  (xi , y j )   ij = Max  i1, j   (g)   i, j 1   (g)

(4)

where  (xi , y j ) is the substitution scores of amino acids of xi and y j (in this study, BLOSUM50 is chosen); ( g ) is a linear score gap penalty function of length g ; Max means taking the maximum value of the three possible ways, i.e. (1) xi is aligned to y j ; (2) xi aligned to a gap; (3) y j aligned to a gap. Eq. (4) is computed repeatedly to fill in the  i , j values of matrix à . Accordingly,  n ,m is the score of the best alignment between sequences X and Y , and hence can be regarded as the scale of their similarity. Alignment Kernel Function for Amino Acid Sequences

 i , j (i = 1, 2, , n; j = 1, 2, , m) is the score of best

y j aligned to a gap. Thus, we have:

 i , j 1 are known, then we can further calculate  i , j .

There are three possible ways for calculating the best score

Introduction of the kernel function into SVM can simplify the computation of the feature space not only during the computation of inner products, but also in the design of the learning machine itself. Another attractive feature of the kernel method is that the learning algorithm and theory can largely be decoupled from the specifics of the application area, which must simply be encoded into the design of an appropriate kernel function. Therefore, the kernel method can be extended to a non-Euclidean space. Though kernel functions are very popular, to be a kernel function, function K must satisfy the following condition (proposition): Suppose  = { 1 , 2 , , K } is a finite

K (r , z ) (r , z  ) a symmetric function on  , then K ( r , z ) is a kernel function if, and only if, input space and

the matrix

206

Protein & Peptide Letters, 2007, Vol. 14, No. 2

Chou et al.

 K ( 1 , 1 ) K ( 1 , 2 )  K ( 1 , K )   K ( , ) K ( , )  K ( , )  2 1 2 2 2 K  K=          K ( K 1 ) K ( K 2 )  K ( K K ) 

eigenvalues  i > 0 (i = 1,2,,;   L) , we could de(5)

fine the following symmetric positive definite matrix the alignment kernel function:

S* = V *  * V 1

is positive semi-definite, i.e., with non-negative eigenvalues [18].

where

Now, suppose  = {P1 , P2 , , PL } is a set of proteins.

script

S (i, j ) as the similarity between sequences

We define

Pi = (R , R , , R ) i 1

i 2

i n

and

Pj = (R , R , R ) , as j 1

j 2

j m

formulated by

S (i, j ) =  n ,m where

(6)

 n ,m is the score value obtained by Needle-

S* is now positive semi-definite. Then suppose the S* is S * (i, j ) = H iT H j , where element of

steps,

H i = (h1i , h2i , , hLi ) is the feature vector for the amino acid

S (i, j ) =

i, j

max S (i, j ) i, j

(2) Diagonal normalized:

S (i, j ) =

S (i, j ) S (i, i ) S ( j , j )

*

Then

*

L

k vk ,i =  S * (i, j )vk , j , (i = 1, 2, , L)

(11)

j =1

Vk *  V * is the kth eigenvector of S* and vk ,i

ith element of Vk * . Right multiplying both sides of Eq.11 with H i , we obtain L equations with different i . It follows by taking a sum of such L equations that is the

L

L

L

i =1

i =1

i =1

k  vk ,i H i =  H i H iT  vk ,i H i By defining the correlation matrix C = the vector Q k =

L (8)

.

k Vk = S Vk can be written as following:

k

S (i, j )  min S (i, j )

Pi   (i, j = 1, 2, , L)

sequence *

S (i, j ) is the similarity score between protein Pi and Pj (cf., Eq. 6).

(1) Minimum-maximum normalized:

and the super-

(4) Transform to the new space: After the above three

where

However, S does not satisfy the above kernel function proposition (cf. Eq.5), i.e. positive semi-definite. To make it become the alignment kernel function, let us transform the matrix S according to the following procedures:

T  = diag(1 , 2 ,,  ,0,,0  ) L-

where (7)

(10)

T means transpose operator to a matrix.

man-Wunsch algorithm of Eq. (3). Then based on Eq. (6), we can further obtain a symmetric similarity matrix S on the protein set  :

 S (1,1) S (1, 2)  S (1, L)   S (2,1) S (2, 2)  S (2, L)   S=          S (L,1) S (L, 2)  S (L, L) 

S* as

(12)

1 L H i H iT and  L i=1

L

v i =1

k ,i

H i , Eq.12 can be written as

Q k = CQ k . Then  k L and Q k (k = 1, 2 , L) are

eigenvalues and eigenvectors of the correlation matrix C respectively. Thus, the new feature vector for amino acid sequence Pi (Pi  ) can be obtained according to the following equation:

H*i = [H i  Q1 , H i  Q 2 ,  , H i  Q L ]

(9)

(3) Positive definite: Suppose the eigenvalues and eigenS are vectors of the symmetric matrix  = ( 1 , 2 ,, L ) and V = (V1 , V2 , , VL ) . The eigenvalues must be real number. In this case, we cannot guarantee S to be a positive definite matrix. However, in our study, we found most S of different protein data sets were positive definite, even if they were not positive definite the number of negative eigenvalues would be quite few and their absolute value would be small. So with the positive

= (S*i  V1* , S*i  V2* ,  , S*i  VL* ) where and

(13)

S*i (i = 1, 2, , L) is the ith row of matrix S*

Vi* (i = 1, 2, , L) the ith eigenvector of matrix

S* . According to Eq.13, we can use the derived feature vector H*i to represent a test protein sequence Pi in a new space. Thus, the SVM (support vector machines) was trained to predict the secretory proteins. In our experiments, the secretory protein sequence was labeled as 1, and the non- se-

Predicting Secretory Proteins

Table 2.

Protein & Peptide Letters, 2007, Vol. 14, No. 2

207

Comparison of Performance for the Benchmark Datasets of Table 1 Alignment kernel function

Neural Networks [7]

PSORT [32,55]

SubLoc [35]

Species

a

OSRa

MCCb

OSRa

MCCb

OSRa

MCCb

OSRa

MCCb

Eukaryotic

99.3

0.99

93.0

0.87

80.0

0.56

77.0

0.47

Human

99.1

0.98

-

-

-

-

-

-

Gram-negative

96.4

0.93

95.0

0.87

75.0

0.58

91.0

0.78

E. coli

98.7

0.97

-

-

-

-

-

-

Gram- positive

97.1

0.93

97.0

0.92

91.0

0.77

86.0

0.76

OSR, the overall success rate as defined in Eq.15. MCC, the Matthew correlation coefficient as defined in Eq.14.

b

cretory protein sequence labeled as -1. The kernel function is given by matrix S (cf. Eq.10). The parameter training the SVM was set at 1000 [12,13]. *

C for

IV. RESULTS AND DISCUSSION The power of the predictor was examined by the jackknife test [19]. During jackknifing, each peptide sequence in the benchmark dataset is in turn taken out as a test sample and the prediction rule is trained based on the remaining sequences. In this paper, we use (1) the Matthew correlation coefficient (MCC) [20] and (2) the overall success rate (OSR) to evaluate the performance by alignment kernel function. The MCC is defined as

MCC = where

N tp N tn  N fp N fn ( N tn + N fn )( N tn + N fp )( N tp + N fn )( N tp + N fp )

(14)

N tp is the true positive number representing the

correctly predicted number for signal peptides,

N tn the

true negative number for the correctly predicted number of non-secretory proteins, N fp the false positive number for the falsely predicted signal peptides as non-secretory proteins, and N fn the false negative number for the falsely

remarkably high prediction quality can be obtained by using the new approach. Because the underpinning of the current approach is in providing a powerful formulation to reflect the similarity of sequences, it is expected that the current alignment kernel function approach can also be used to deal with many other prediction problems relevant to protein sequence analysis, such as protein structural class [21-28], protein subcellular localization [29-40], sub-nuclear location [41,42], membrane protein type [43-47], enzyme family and subfamily class [48,49], G-protein coupled receptor type [50-53], and protease type [54], among many others. V. CONCLUSIONS We propose a new alignment kernel function and apply it to discriminate secretory proteins from non-secretory proteins. The results thus obtained are very promising as demonstrated by the significant improvement in the overall success rates and Matthew correlation coefficients. It has not escaped our notice that the current alignment kernel function approach may also become a useful vehicle for sequence-based prediction of various attributes of proteins. REFERENCES [1] [2] [3]

predicted non-secretory proteins as signal peptides. [4] [5] [6]

The overall success rate is defined as:

OSR =

N tp + N tn N tp + N tn + N fp + N fn

(15)

The predicted results by the alignment kernel function approach for the protein sequences in the five subsets, i.e., human, E. coli, eukaryote, Gram-positive bacterium, and Gram-negative bacterium, are given in (Table 2), where, for facilitating comparison, the corresponding results by the other approaches are also listed. As we can see from the table, the alignment kernel function approach proposed in this paper has outperformed other methods in both the overall success rate and Mathew correlation coefficient, indicating a

[7] [8] [9] [10] [11] [12] [13] [14] [15]

Chou, K. C. (2002) Curr. Prot. Peptide Sci., 3, 615-622. Kutay, U., Ahnert-Hilger, G., Hartmann, E., Wiedenmann, B., and Rapoport, T. A. (1995) EMBO J., 14, 217-223. Claros, M. G., Brunak, S., and von Heijne, G. (1997) Curr. Opin. Struct. Biol., 7, 394-398. McGeoch, D. J. (1985) Virus Res., 3, 271-286. von Heijne, G. (1986) Nucleic Acids Research, 14, 4683-4690. Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Protein Eng., 10, 1-6. Bendtsen, J. D., Nielsen, H., von Heijne, G., and Brunak, S. (2004) J. Mol. Biol., 340, 783-795. Chou, K. C. (2001) PROTEINS: Structure, Function, and Genetics, 42, 136-139. Chou, K. C. (2001) Peptides, 22, 1973-1979. Chou, K. C. (2001) Protein Eng., 14, 75-79. Liu, H., Yang, J., Ling, J. G., and Chou, K. C. (2005) Biochem. Biophys. Res. Comm., 338, 1005-1011. Cai, Y. D., Lin, S., and Chou, K. C. (2003) Peptides, 24, 159-161. Wang, M., Yang, J., and Chou, K. C. (2005) Amino Acids (Erratum, ibid. 2005, 29: 301), 28, 395-402. Needleman, S. B. and Wunsch, C. D. (1970) J. Mol. Biol., 48, 443-453. Nielsen, H., Engelbrecht, J., von Heijne, G., and Brunak, S. (1996) Proteins, 24, 165-177.

208

Protein & Peptide Letters, 2007, Vol. 14, No. 2

[16]

Kammerer, C. M., VandeBerg, J. L., Haffner, S. M., and Hixson, J. E. (1996) Atherosclerosis, 120, 37-45. Bairoch, A. and Apweiler, R. (2000) Nucl. Acids Res., 25, 31-36. Cristianini, N. and Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Chapter 3; Cambridge University Press, 2000. Chou, K. C. and Zhang, C. T. (1995) Crit. Rev.Biochem.Mol. Biol., 30, 275-349. Matthews, B. W. (1975) Biochim. Biophys. Acta, 405, 442-451. Nakashima, H., Nishikawa, K., and Ooi, T. (1986) J. Biochem, 99, 152-162. Chou, K. C. (1995) Proteins: Structure, Function & Genetics, 21, 319-344. Chou, K. C. and Zhang, C. T. (1994) J. Biol. Chem., 269, 22014-22020. Zhou, G. P. (1998) J. Prot. Chem., 17, 729-738. Luo, R. Y., Feng, Z. P., and Liu, J. K. (2002) Eur. J. Biochem., 269, 4219-4225. Zhou, G. P. and Assa-Munt, N. (2001) PROTEINS: Structure, Function, and Genetics, 44, 57-59. Sun, X. D. and Huang, R. B. (2006) Amino Acids, 30, 469-475. Chen, C., Zhou, X., Tian, Y., Zou, X., and Cai, P. (2006) Anal. Biochem., 357, 116-121. Cedano, J., Aloy, P., P'erez-Pons, J. A., and Querol, E. (1997) J. Mol. Biol, 266, 594-600. Chou, K. C. and Elrod, D. W. (1999) Protein Eng., 12, 107-118. Nakai, K. and Kanehisa, M. (1992) Genomics, 14, 897-911. Nakai, K. and Horton, P. (1999) Trends in Biochemical Science, 24, 34-36. Nakai, K. (2000) Advances in Protein Chemistry, 54, 277-344. Chou, K. C. (2001) PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60), 43, 246-255. Hua, S. and Sun, Z. (2001) Bioinformatics, 17, 721-728.

[17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35]

Received: November 11, 2006

Revised: December 22, 2006

Accepted: January 02, 2007

Chou et al. [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55]

Zhou, G. P. and Doctor, K. (2003) PROTEINS: Structure, Function, and Genetics, 50, 44-48. Xiao, X., Shao, S. H., Ding, Y. S., Huang, Z. D., and Chou, K. C. (2006) Amino Acids, 30, 49-54. Chou, K. C. and Shen, H. B. (2006) J. Cell. Biochem., 99, 517-527. Chou, K. C. and Shen, H. B. (2006) Biochem. Biophys. Res. Commun., 347, 150-157. Chou, K. C. and Shen, H. B. (2006) J. Proteome Res., 5, 1888-1897. Shen, H. B. and Chou, K. C. (2005) Biochem. Biophys. Res. Comm., 337, 752-756. Lei, Z. and Dai, Y. (2005) BMC Bioinformatics, 6, 291. Feng, Z. P. and Zhang, C. T. (2000) J. Protein Chem., 19, 269-275. Liu, H., Wang, M., and Chou, K. C. (2005) Biochem Biophys Res Commun, 336, 737-739. Shen, H. B., Yang, J., and Chou, K. C. (2006) J. Theoretical Biol., 240, 9-13. Shen, H. B. and Chou, K. C. (2005) Biochem. Biophys. Res. Commun., 334, 288-292. Wang, M., Yang, J., Xu, Z. J., and Chou, K. C. (2005) J. Theoretical Biol., 232, 7-15. Chou, K. C. and Elrod, D. W. (2003) J. Proteome Res., 2, 183-190. Chou, K. C. and Cai, Y. D. (2004) Protein Science, 13, 2857-2863. Chou, K. C. and Elrod, D. W. (2002) J. Proteome Res., 1, 429-433. Chou, K. C. (2005) J. Proteome Res., 4, 1413-1418. Wen, Z., Li, M., Li, Y., Guo, Y., and Wang, K. (2006) Amino Acids, DOI 10.1007/S00726-006-0341-y. Guo, Y. Z., Li, M., Lu, M., Wen, Z., Wang, K., Li, G., and Wu, J. (2006) Amino Acids, 30, 397-402. Chou, K. C. and Cai, Y. D. (2006) Biochem. Biophys. Res. Comm., 339, 1015-1020. Gardy, J. L., Spencer, C., Wang, K., Ester, M., Tusnady, G. E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K., and Brinkman, F. S. (2003) Nucl. Acids Res., 31, 3613-3617.

Suggest Documents