Available online at www.sciencedirect.com
ANALYTICAL BIOCHEMISTRY Analytical Biochemistry 373 (2008) 386–388 www.elsevier.com/locate/yabio
Notes & Tips
PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition Hong-Bin Shen b
a,b,*
, Kuo-Chen Chou
a,b
a Gordon Life Science Institute, San Diego, CA 92130, USA Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China
Received 2 August 2007 Available online 13 October 2007
Abstract The pseudo amino acid (PseAA) composition can represent a protein sequence in a discrete model without completely losing its sequence-order information, and hence has been widely applied for improving the prediction quality for various protein attributes. However, dealing with different problems may need different kinds of PseAA composition. Here, we present a web-server called PseAAC at http://chou.med.harvard.edu/bioinf/PseAA/, by which users can generate various kinds of PseAA composition to best fit their need. 2007 Elsevier Inc. All rights reserved.
Pseudo amino acid (PseAA)1 composition was originally introduced to improve the prediction quality for protein subcellular localization and membrane protein type [1]. PseAA composition, or PseAAC, can be used to represent a protein sequence with a discrete model yet without completely losing its sequence order information. According to its definition, the PseAA composition of a given protein sample is represented by a set of more than 20 discrete factors, where the first 20 factors represent the components of its conventional amino acid (AA) composition while the additional factors incorporate some of its sequence order information via various modes. Typically, these additional factors are a series of rank-different correlation factors along a protein chain [1,2], but they can also be any combination of other factors so long as they can reflect some sort of sequence order effects one way or the other. Since the concept of PseAA composition was introduced, various PseAA composition approaches have been developed for enhancing the prediction quality of protein * Corresponding author. Current address: BCMP, Harvard Medical School, Boston, MA 02115, USA. E-mail address:
[email protected] (H.-B. Shen). 1 Abbreviations used: PseAA, pseudo amino acid; AA, amino acid.
0003-2697/$ - see front matter 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.ab.2007.10.012
attributes, including protein subcellular localization [3–7], protein structural class [8–11], protein oligomer type [12,13], protein subnuclear localization [14,15], protein submitochondria localization [16], conotoxin superfamily classification [17,18], membrane protein type [19–23], apoptosis protein subcellular localization [24,25], enzyme functional classification [2,26–28], protein fold pattern [29], and signal peptide [30,31]. In the existing approaches, the PseAA components can be roughly categorized into two types: correlation and extraction. Either type needs to first convert a protein from a character sequence to a numerical sequence. Various physicochemical AA indexes, such as hydrophobicity, hydrophilicity, and side chain mass, often are used to make such a conversion. Once the numerical sequence is available, its PseAA components of correlation type can be easily generated by the routine correlation equations [1,2,29]. However, the PseAA components of extraction type, such as those generated through cellular automata [4,32], complexity measure factor [10,33], Fourier spectrum [19], and other special functions [3], need to be treated specifically. In this short note, our attention is focused on the correlation type only. Actually, even for the correlation type, there are still many different kinds of PseAA components. For simplicity,
Notes & Tips / Anal. Biochem. 373 (2008) 386–388
hereafter the PseAA components always will mean the correlation type. In view of its many different applications and formations, we have developed a flexible web server called PseAAC by which the user can generate various kinds of PseAA composition as he or she wishes. Also, this note is a kind of protocol for providing the user with step-by-step instructions for using the PseAAC web server. As for the comprehensive principles and various prediction algorithms, the reader is referred to two recent review articles by Chou and Shen [34] and Shen and coworkers [35]. Three different types of parameters often are used to generate various kinds of PseAA composition: quantitative characters of AAs, weight factor, and rank of correlation. The following six AA characters are supported by the current PseAAC server to calculate the correlations between AAs at different positions along the protein chain: (1) hydrophobicity, (2) hydrophilicity, (3) side chain mass, (4) pK of the a-COOH group, (5) pK of the a-NH3+ group, and (6) pI at 25 C. Therefore, the total possible different combinations are as follows: Cð6; 1Þ þ Cð6; 2Þ þ Cð6; 3Þ þ Cð6; 4Þ þ Cð6; 5Þ þ Cð6; 6Þ 6! 6! 6! 6! þ þ þ ¼ ð6 1Þ!1! ð6 2Þ!2! ð6 3Þ!3! ð6 4Þ!4! 6! 6! þ ¼ 63: ð1Þ þ ð6 5Þ!5! ð6 6Þ!6! The user can select any of the 63 combinations as part of the input. The weight factor is designed for the user to put weight on the additional PseAA components with respect to the conventional AA components. The user can select any value within the region from 0.05 to 0.70 for the weight factor. The counted rank (or tier) of the correlation along a protein sequence usually is represented by k [1]. Note that (i) k must be smaller than the length of input protein sequence; (ii) k must be a non-negative integer; and (iii) when k = 0, the output of the PseAAC server is degenerated to the conventional AA composition. For a given protein sequence, the current web server can generate three different types of PseAA composition. Type 1 PseAA composition is also called the parallel correlation type. The final output for type 1 PseAA composition will have (20 + k) discrete numbers [1]. Type 2 PseAA composition is also called the series correlation type. The final output for type 2 PseAA composition will have (20 + n · k) discrete numbers, where n is the number of AA characters selected by the user; for example, when selecting hydrophobicity and hydrophilicity, we have n = 2 [2] and so forth. Type 3 PseAA composition is also called the dipeptide type [11,36]. The final output for type 3 PseAA composition will have (20 + 400 = 420) discrete numbers. For generating type 3 PseAA composition, the user need not select any values for the aforementioned parameters. The user should follow the following procedural steps:
387
PseAAC: Generating pseudo amino acid composition Read Me
Citation
Select or input the following parameters PreAA mode
Type 1 (?)
Amino acid character (?)
Hydrophobicity
Hydrophilicity
Mass
pK1 (α-COOH)
pK2 (NH3)
PI (at 25 ºC)
Weight factor (?)
Type 2 (?)
Dipeptide - composition (?)
0.05
λ parameter (?) Input protein sequences in Fasta (?) format (maximum 50 proteins for each submission):
Submit
Clear all
Fig. 1. Illustration showing the PseAAC web page at http://chou.med.harvard.edu/bioinf/PseAAC/.
1. Click on http://chou.med.harvard.edu/bioinf/ PseAAC/ and you will see the top page of the PseAAC web server on the screen of your computer, as shown in Fig. 1. 2. Either type or copy and paste the protein sequences into the input box at the lower center of Fig. 1. Each input sequence should be in FASTA format, as shown by clicking on the ‘‘?’’ button right above the input box. The maximum number of protein sequences allowed for each submission is 50. 3. Select the type of PseAA composition you wish to generate. For example, if you click the button on the left of ‘‘Type 1,’’ the server will generate the type 1 or parallel correlation type PseAA composition. 4. Select the AA characters. For example, if you click the buttons on the left of ‘‘Hydrophobicity,’’ ‘‘Hydrophilicity,’’ and ‘‘Mass,’’ these three characters will be used to calculate the sequence correlation factors. 5. Select the value within the region from 0.05 to 0.70 for ‘‘Weight factor.’’ 6. Type the desired number for k into the small box on the right of ‘‘Lambda parameter.’’ Note that the number entered for k must be smaller than the number of AAs of the entire protein sequence concerned [1]. 7. Click on the ‘‘Submit’’ button, and in less than 10 s you will get the output for generating the PseAA composition for 50 protein sequences. The PseAA composition generated should consist of (20 + k) discrete factors, which are the components of type 1 PseAA composition as selected in step 3. 8. If in step 3 you click the button of ‘‘Type 2,’’ you will get type 2 or the series correlation type PseAA composition that consists of (20 + n · k) discrete factors, where n = 3, because in step 4 you clicked three different buttons to characterize the AAs. 9. If in step 3 you click the button of ‘‘Type 3,’’ you will get type 3 or the dipeptide type PseAA composition that consists of 420 discrete factors.
388
Notes & Tips / Anal. Biochem. 373 (2008) 386–388
Getting an optimal descriptor for protein samples is the key to improving the prediction quality for their attributes. However, the optimal descriptor for protein samples may be different in dealing with different problems. The current web server was designed in an attempt to provide the user with a flexible generator for various kinds of PseAA composition. For instance, the current PseAAC server allows the user to generate 63 different parallel correlation types of PseAA composition and 63 different series correlation types of PseAA composition as well as the dipeptide PseAA composition. The user can apply the PseAAC server to generate the PseAA composition that best fits his or her need. The PseAAC server will be updated in a timely manner by incorporating new scales and/or indexes to characterize AAs (see, e.g., Ref. [37]).
[16]
[17]
[18]
[19]
[20]
References [21] [1] K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, Proteins Struct. Funct. Genet. 43 (2001) 246–255 (Erratum, 2001, vol. 44, p. 60). [2] K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10–19. [3] Y. Gao, S.H. Shao, X. Xiao, Y.S. Ding, Y.S. Huang, Z.D. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein subcellular location: Approached with Lyapunov index, Bessel function, and Chebyshev filter, Amino Acids 28 (2005) 373–376. [4] X. Xiao, S.H. Shao, Y.S. Ding, Z.D. Huang, K.C. Chou, Using cellular automata images and pseudo amino acid composition to predict protein subcellular location, Amino Acids 30 (2006) 49–54. [5] K.C. Chou, H.B. Shen, Euk-mPLoc: A fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites, J. Proteome Res. 6 (2007) 1728–1734. [6] H.B. Shen, K.C. Chou, Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites, Biochem. Biophys. Res. Commun. 355 (2007) 1006–1011. [7] K.C. Chou, H.B. Shen, Large-scale plant protein subcellular location prediction, J. Cell. Biochem. 100 (2007) 665–678. [8] C. Chen, X. Zhou, Y. Tian, X. Zou, P. Cai, Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network, Anal. Biochem. 357 (2006) 116–121. [9] C. Chen, Y.X. Tian, X.Y. Zou, P.X. Cai, J.Y. Mo, Using pseudoamino acid composition and support vector machine to predict protein structural class, J. Theor. Biol. 243 (2006) 444–448. [10] X. Xiao, S.H. Shao, Z.D. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor, J. Computat. Chem. 27 (2006) 478–482. [11] H. Lin, Q.Z. Li, Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components, J. Computat. Chem. 28 (2007) 1463–1466. [12] K.C. Chou, Y.D. Cai, Predicting protein quaternary structure by pseudo amino acid composition, Proteins Struct. Funct. Genet. 53 (2003) 282–289. [13] S.W. Zhang, Q. Pan, H.C. Zhang, Z.C. Shao, J.Y. Shi, Prediction protein homo-oligomer types by pseudo amino acid composition: Approached with an improved feature extraction and naive Bayes feature fusion, Amino Acids 30 (2006) 461–468. [14] H.B. Shen, K.C. Chou, Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition, Biochem. Biophys. Res. Commun. 337 (2005) 752– 756. [15] P. Mundra, M. Kumar, K.K. Kumar, V.K. Jayaraman, B.D. Kulkarni, Using pseudo amino acid composition to predict protein
[22]
[23]
[24]
[25] [26] [27]
[28]
[29] [30]
[31] [32]
[33]
[34] [35]
[36] [37]
subnuclear localization: Approached with PSSM, Pattern Recogn. Lett. 28 (2007) 1610–1615. P. Du, Y. Li, Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence, BMC Bioinform. 7 (2006) 518. S. Mondal, R. Bhavna, R. Mohan Babu, S. Ramakumar, Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification, J. Theor. Biol. 243 (2006) 252–260. H. Lin, Q.Z. Li, Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant, Biochem. Biophys. Res. Commun. 354 (2007) 548– 551. H. Liu, M. Wang, K.C. Chou, Low-frequency Fourier spectrum for predicting membrane protein types, Biochem. Biophys. Res. Commun. 336 (2005) 737–739. H.B. Shen, K.C. Chou, Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo amino acid composition to predict membrane protein types, Biochem. Biophys. Res. Commun. 334 (2005) 288–292. S.Q. Wang, J. Yang, K.C. Chou, Using stacked generalization to predict membrane protein types based on pseudo amino acid composition, J. Theor. Biol. 242 (2006) 941–946. H.B. Shen, J. Yang, K.C. Chou, Fuzzy KNN for predicting membrane protein types from pseudo amino acid composition, J. Theor. Biol. 240 (2006) 9–13. K.C. Chou, H.B. Shen, MemType-2L: A web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM, Biochem. Biophys. Res. Commun. 360 (2007) 339–345. Y.L. Chen, Q.Z. Li, Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition, J. Theor. Biol. 248 (2007) 377–381. Y.L. Chen, Q.Z. Li, Prediction of the subcellular location of apoptosis proteins, J. Theor. Biol. 245 (2007) 775–783. K.C. Chou, Y.D. Cai, Predicting enzyme family class in a hybridization space, Protein Sci. 13 (2004) 2857–2863. X.B. Zhou, C. Chen, Z.C. Li, X.Y. Zou, Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes, J. Theor. Biol. 248 (2007) 546–551. H.B. Shen, K.C. Chou, EzyPred: A top-down approach for predicting enzyme functional classes and subclasses, Biochem. Biophys. Res. Commun. 364 (2007) 53–59. H.B. Shen, K.C. Chou, Ensemble classifier for protein fold pattern recognition, Bioinformatics 22 (2006) 1717–1722. K.C. Chou, H.B. Shen, Signal-CF: A subsite-coupled and windowfusing approach for predicting signal peptides, Biochem. Biophys. Res. Commun. 357 (2007) 633–640. H.B. Shen, K.C. Chou, Signal-3L: A 3-layer approach for predicting signal peptide, Biochem. Biophys. Res. Commun. 363 (2007) 297–303. X. Xiao, S. Shao, Y. Ding, Z. Huang, X. Chen, K.C. Chou, Using cellular automata to generate Image representation for biological sequences, Amino Acids 28 (2005) 29–35. X. Xiao, S. Shao, Y. Ding, Z. Huang, Y. Huang, K.C. Chou, Using complexity measure factor to predict protein subcellular location, Amino Acids 28 (2005) 57–61. K.C. Chou, H.B. Shen, Recent progress in protein subcellular location prediction, Anal. Biochem. 370 (2007) 1–16. H.B. Shen, J. Yang, K.C. Chou, Methodology development for predicting subcellular localization and other attributes of proteins, Expert Rev. Proteom. 4 (2007) 453–463. W. Liu, K.C. Chou, Protein secondary structural content prediction, Protein Eng. 12 (1999) 1041–1050. L.A. Kurgan, W. Stach, J. Ruan, Novel scales based on hydrophobicity indices for secondary protein structure, J. Theor. Biol. 248 (2007) 354–366.