Bioinformatics to Predict Protein Structure and Function REVIEW
139
Bioinformatics Methods to Predict Protein Structure and Function A Practical Approach
Yvonne J. K. Edwards* and Amanda Cottage Abstract Protein structure prediction by using bioinformatics can involve sequence similarity searches, multiple sequence alignments, identification and characterization of domains, secondary structure prediction, solvent accessibility prediction, automatic protein fold recognition, constructing three-dimensional models to atomic detail, and model validation. Not all protein structure prediction projects involve the use of all these techniques. A central part of a typical protein structure prediction is the identification of a suitable structural target from which to extrapolate three-dimensional information for a query sequence. The way in which this is done defines three types of projects. The first involves the use of standard and well-understood techniques. If a structural template remains elusive, a second approach using nontrivial methods is required. If a target fold cannot be reliably identified because inconsistent results have been obtained from nontrivial data analyses, the project falls into the third type of project and will be virtually impossible to complete with any degree of reliability. In this article, a set of protocols to predict protein structure from sequence is presented and distinctions among the three types of project are given. These methods, if used appropriately, can provide valuable indicators of protein structure and function. Index Entries: Molecular modeling; sequence similarity searches; multiple sequence alignment; identification and characterization of domains; secondary structure prediction; solvent accessibility prediction; automatic protein fold recognition.
1. Introduction The first step in a typical protein structure prediction is to establish if a protein sequence or part of a protein sequence has any similarities with sequences of known structures present in the Protein Data Bank (PDB) (1,2). Typically, protein structures are experimentally determined and classified at the level of the domain (1–5) (see Note 1). Comparative molecular modeling or homology modeling is the most successful and accurate method for protein structure prediction (see Notes 2 and 3). If a protein structure prediction can be based on comparative molecular modeling, this should be the method of choice (see Fig. 1). In the absence of high sequence identity between sequence and structural homologs (see Notes 4 and 5), deciding what constitutes significant sequence
similarity is not an easy task. This type of prediction can be classed as “nontrivial” (see Table 1). The most promising methods for solving this problem involve performing sensitive sequence searches and characterizing sequence compatibility with the structural environments in known secondary and tertiary protein structure (also known as “1D-2D-3D” compatibility matching methods). Sensitive searches help identify weak similarities between the sequence of interest and sequences of structural homologs. The “1D-2D-3D” compatibility matching methods include secondary structure and solvent accessibility predictions as well as automatic protein fold recognition. Such methods can be useful in predicting common structural folds for proteins that share little or no sequence similarity (see Fig. 1). However, at low levels of
*Author to whom all correspondence and reprint requests should be addressed: Dr. Y. J. K. Edwards, Research Division, UK Human Genome Mapping Project Resource Center, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10, 1SB, England, UK. Tel.: 01223-494531, Fax: 01223-494512. E-mail:
[email protected]. Web: http://fugu.hgmp.mrc.ac.uk Molecular Biotechnology 2003 Humana Press Inc. All rights of any nature whatsoever reserved. 1073–6085/2003/23:2/139–166/$20.00
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
139
139
Volume 23, 2003
1/10/03, 8:31 AM
140
Edwards and Cottage
Fig. 1. The flowchart highlights the steps involved in constructing 3D structural models from protein sequences by using bioinformatics. Predictions using standard methods are the most reliable when they can be applied. The sensitive search and "1D-2D-3D" compatibility matching methods are nontrivial and can add more value to the sequences when the standard techniques do not reliably identify a structural template for the query sequence.
Table 1 Types of Protein Structure Prediction Projectsa Description of project based on the outcome of the following analysis 1. Identification of sequence homolog 2. Identification of structural homolog 3. Mapped domain boundaries 4. Biochemical characterization
Standard Yes Yes Yes/No Yes/No
Nontrivial
Virtually Impossible
Yes No Yes Yes/No
Yes/No No No No
aA
protein structure prediction is described as standard, nontrivial, or virtually impossible. A standard protein structure prediction identifies a structural homolog for the query sequence using well-understood and commonly used techniques. In this type of project, the protein fold can be predicted with a high degree of confidence. If a structural homolog cannot be reliably identified using standard methods (see Fig. 1), nontrivial data mining methods need to be used. A nontrivial structure prediction requires the use of more sensitive searches and mapping properties of the query sequence with features of known secondary and tertiary structure. In these projects, much biochemical characterization is required to support and validate the prediction. If a target fold cannot be reliably identified due to inconsistent results obtained from nontrivial data mining methods, the project falls into the third type of project and will be virtually impossible to complete with any degree of reliability.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
140
1/10/03, 8:31 AM
Bioinformatics to Predict Protein Structure and Function
141
Fig. 2. Fugu PRGFR1 domains. The PRGFR1 protein has an amino-terminal signal peptide (SIG). The sema domain occurs in semaphorins, a large family of secreted and transmembrane proteins. The PSI domain contains a cysteine-rich repeat found in several different extracellular receptors. The PSI domain is located in Plexins, Semaphorins, and Integrins. The IPT domains are predicted to adopt an Immunoglobulin-like fold also detected in Plexins and Transcription factors (7). The juxtamembrane domain (JUXTA) follows the transmembrane region (Tx). All PRGFR homologs have the conserved tyrosine kinase region and a carboxyl-terminal docking site. The latter includes two conserved tyrosine residues known to be essential for intracellular signaling.
sequence similarity the structures of proteins sharing a common fold diverge to such an extent that the accuracy of models built by comparative techniques are significantly lower. The 1425 amino-acid residues derived from the translated coding sequence of Fugu rubripes (Fugu) plasminogen-related growth factor receptor 1 (PRGFR1) (6) are used to describe how to predict features of structure and function by using bioinformatics. The PRGFR family has a role in embryogenesis, tissue regeneration, and neoplasia. Although some domains of this protein family have been well delineated at the sequence and structural level, many have only recently been characterized (see Fig. 2). Methods to characterize protein domains and structure predictions in the PRGFR1 homologs (see Note 6) are described. We predict, to atomic detail, the protein tyrosine kinase structure and one of the IPT domains (the Immunoglobulin-like fold shared by domains in Plexins and some Transcription factors) (7). The two predictions fall into the “standard” and “nontrivial” categories, respectively (see Table 1). The regions of PRGFR1 not considered, the sema and PSI domains (see Fig. 2), fall into the “virtually impossible” category. This means that the solvent accessibility and the secondary structure predicted did not match the solvent accessibility and secondary structure of the fold prediction obtained from protein fold recognition methods. Protocols for various protein structure prediction techniques are described (see Tables 1–6). The steps involved in validating and refining structural models (see Table 7), the expected accuracy,
and the weaknesses and strengths of the methods are highlighted. These methods outlined can be applied to most protein structure predictions.
2. Materials It is assumed that you would like to predict structural features for a protein sequence. The time and computational resources you intend to invest on a protein structure prediction project and the scientific questions you would like to address are not presumed. The aim of this article is not to be too specific but to describe the available choices, thus enabling an informed decision about the tools to use.
2.1. Internet Computing Systems for Protein Structure Prediction Access to a computing system designed for bioinformatics analysis is required. For example, a user account on a computer system running a Unix-flavored operating system (e.g., Solaris, Linux, or IRIX), sufficient memory, disk space, and applications (including an editor, a multiple sequence alignment program, a sequence similarity search program, and access to up-to-date biological sequence, structure, and bibliographic databanks) are required. Such facilities are available to registered users of the UK Human Genome Mapping Project Resource Centre (HGMP-RC) Bioinformatics facilities (see Table 2). An internet connection and use of a Web browser such as Netscape or Internet Explorer are needed. Search engines such as Google or Yahoo enable keyword searches to find web sites hosting bioinformatics
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
141
1/10/03, 8:31 AM
142
Edwards and Cottage Table 2 Databanks to Classify Protein Structures, Domains, Folds, and Functiona
Databank
Ref.
Information
ProDom Pfam
(8) (9)
Sequence Sequence
SMART PDB
(9) (1) (2) (3) (4) (11) (12) (13) (14) (15) (16) (17) – –
Sequence Structure
CATH SCOP 3Dee FSSP HSSP Prosite Prints Blocks Rasmol BIDS PubMed
Structure Structure Structure Structure Structure Function Function Function Visualization Bibliographic Biblographic
URL http://protein.toulouse.inra.fr/prodom.html http://www.sanger.ac.uk/Software/Pfam/ http://pfam.wustl.edu/ http://www.cgr.ki.se/Pfam/ http://smart.embl-heidelberg.de/ http://msd.ebi.ac.uk/ http://www.rcsb.org/pdb/ http://www.biochem.ucl.ac.uk/bsm/cath/ http://scop.mrc-lmb.cam.ac.uk/scop/ http://jura.ebi.ac.uk:8080/3Dee/help/help_intro.html http://www.ebi.ac.uk/dali/fssp/ http://www.sander.embl-heidelberg.de/hssp/ http://www.expasy.ch/prosite/ http://www.biochem.ucl.ac.uk/dbbrowser/PRINTSPRINTS.html http://www.blocks.fhcrc.org/ http://www.umass.edu/microbio/rasmol/ http://www.bids.ac.uk/ http://www.ncbi.nlm.nih.gov/PubMed/
aA registered user of the HGMP-RC bioinformatics facilities (http://www.hgmp.mrc.ac.uk/) can access a program called PIX, which identifies domains and functional features in protein sequences using many of the above bioinformatics tools.
Table 3 Descriptions of Nonredundant (nr) Protein Sequence Databanksa Databank
Ref.
SPTR SWALL OWL NCBI nr
(18) – (19) –
Composite Databanks SwissProt, SPTREMBL,TREMBLNEW SwissProt, TREMBL, TREMBLNEW SwissProt, PIR,GenPept, NRL3D (20) SwissProt, nr-GenPept, PDB (1,2), PIR,PRF
aEntries with identical sequences are merged. GenPept is produced by extracting the translated coding regions in GenBank. PIR database is a protein sequence database founded by The Protein Information Resource, National Biomedical Research Foundation, Georgetown University Medical Center, US. PRF is the databank of protein sequences created by the Protein Research Foundation, Osaka, Japan.
Table 4 Tools for Sequence Similarity Searches and Sequence Retrieval from Databanksa Software Package
Ref.
BLAST PSI-BLAST HMMER
(21) (21) (9)
SRS aSRS
–
URL http://www.hgmp.mrc.ac.uk/ http://www2.ncbi.nlm.nih.gov/BLAST/ http://www2.ncbi.nlm.nih.gov/blast/psiblast.cgi http://pfam.wustl.edu/hmmsearch.shtml http://www.cgr.ki.se/Pfam/search.html http://www.sanger.ac.uk/Software/Pfam/search.shtml http://srs.hgmp.mrc.ac.uk/srs6 http://srs.ebi.ac.uk/
is a valuable tool to retrieve sequences from databanks.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
142
1/10/03, 8:31 AM
Bioinformatics to Predict Protein Structure and Function
143
Table 5 Servers for Secondary Structure Prediction and Automatic Protein Fold Recognitiona Server
Ref.
URL
Secondary Structure Prediction JPRED PHD
(22) (23)
http://jura.ebi.ac.uk:8888/ http://www.embl-heidelberg.de/predictprotein/doc/mirrors.html
Automatic Protein Fold Recognition PHD (TOPITS) GenThreader FUGUE 3D-PSSM (PCONS) MetaServer Meta server CBS
(23) (24) (25) (26) (27) (28) (29)
http://www.embl-heidelberg.de/predictprotein/doc/mirrors.html http://insulin.brunel.ac.uk/ http://www-cryst.bioc.cam.ac.uk/~Fugue/prfsearch.html http://www.bmm.icnet.uk/ http://bioserv.infobiosud.univ-montp1.fr/SERVEUR/HTML_BIO/frame_meta.html http://BioInfo.PL/ http://bioserv.infobiosud.univ-montp1.fr/SERVEUR/META/result/meta.175806/frame.html
aThe
metaservers are a new wave of tool that allow a user to submit an amino acid sequence to a single server. This server submits the sequence to several servers that perform structural fold predictions. The results are collated and summarized, and consensus fold predictions are given. Use the metaservers for effective and automatic protein fold predictions.
Table 6 Tools for Comparative Molecular Modeling of Protein Structuresa Software
Ref.
URL
Academic Versions COMPOSER^ Modeller* WhatIF^ SwissModel^ NAOMI^
(30,31) (32) (33) (34) (35)
http://www-cryst.bioc.cam.ac.uk/ http://guitar.rockefeller.edu/modeller/ http://www.sander.embl-heidelberg.de/whatif/ http://www.expasy.ch/swissmod/SWISSMODEL.html http://www.cambridgeantibody.com/
(32) (36) – – (30,31)
http://www.accelrys.com/ http://www.accelrys.com/ http://www.accelrys.com/ http://www.tripos.com/ http://www.tripos.com/
Commercial Modeller* Homology^ DISCOVER SYBYL COMPOSER^ aThe
key to the symbols used follows: * restraint-based molecular modeling techniques; ^ rigid-body fragment assembly tech-
niques.
Table 7 Software to Assess Modeled Protein Structures Software
Ref.
Biotech Validation Suite Joy WhatIf
(37) (38) (33)
URL http://biotech.emblheidelberg.de:8400/ http://biotech.ebi.ac.uk:8400/ http://www-cryst.bioc.cam.ac.uk/cgi-bin/joy.cgi http://swift.embl-heidelberg.de/servers2/
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
143
1/10/03, 8:31 AM
144
Edwards and Cottage
programs and servers. Various databanks and analysis tools are available for protein database searches and secondary and tertiary structure prediction via the internet (see Tables 2–7)(8–38).
grams need to be installed and maintained on local computer systems that can include (networked) Silicon Graphics workstations, plus a local copy of the PDB (1,2).
2.2. Local Computing Systems for Protein Structure Prediction
3. Methods Bioinformatics is a rapidly evolving science, and new and improved versions of the software and databanks are released frequently. Therefore, the methods are presented as a guide to the principles relevant and applicable in the field. Before setting out, it is important to have some background understanding of the following: the databank search algorithms; the information content of the databanks; sequence retrieval from the databanks; sequence alignments; protein structures and Unix commands. We recommend reading some comprehensive, concise, and introductory reviews on these topics (39–44). The methods to predict the structure and insights into the function of PRGFR1 are described.
Rasmol (17) is a macromolecule viewer; the correct mime-types and helper-applications need to be set in the browser’s preferences to view structures (Table 2). Rasmol was not designed to manipulate atomic stereochemistry. Software such as Composer (30,31), Modeller (32), WhatIf (33), SwissModel (34), and Naomi (35) are of value in protein structure modeling to atomic detail (see Table 6). In addition to building protein models, software packages are required to interactively visualize and monitor the building process. Commercial packages for molecular modeling, developed by Accelrys and Tripos, can provide these facilities (see Table 6). These commercial products have extensive graphical user interfaces and have been developed with emphasis on ease of use and project management and continuity. WhatIf also provides a continuing research environment for protein structure predictions and is one example of a noncommercial package. If you are affiliated with a nonprofit academic institute, access to many computational resources will be available free or at a lower cost compared to that available to commercial organizations. Having built a protein structure, you may like to investigate complex molecular recognition processes like protein interaction networks and ligand–receptor binding with the aim of designing therapeutics. If this is the case, control of visualizing and building accurately to atomic detail of complex macromolecules are important. If you are interested in modeling reliable structures of proteins to atomic detail, then access to a bespoke molecular modeling computing system is required. This will provide modules to perform macromolecular editing and high-resolution interactive viewing capability and energy minimization facilities. The commercial software packages (SYBYL Tripos, Discover, and InsightII Accelrys) and the WhatIf suite provide such modules. Such pro-
3.1. Search for a Structural Homolog Using Standard Search Methods A sequence similarity search can be performed to query a protein sequence against the amino-acid sequences of known 3D protein structures. If a structural homolog has been reliably identified for a significant fraction of the query sequence (see Note 4), a model can be built by using standard homology modeling methods (30–36). BLASTP (21) was used to detect sequence similarities in a databank of protein sequences with solved structures (see Notes 7 and 8). The result of the BLASTP search, using PRGFR1 as the query sequence, is shown (see Fig. 3). Residues 1089 to 1352 of the PRGFR1 sequence match with the tyrosine kinase structure of the human insulin receptor (PDB accession code: 1irk). The PRGFR1 tyrosine kinase sequence has several structural homologs (e.g., 1irk). In Subheading 3.4.1., a 3D model of Fugu PRGFR1 tyrosine kinase is built to atomic detail based on structural homologs identified using SwissModel (34). As a result, this region in PRGFR1 should not be investigated using secondary structure predictions and automatic protein fold recognition methods. For Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
144
1/10/03, 8:31 AM
Bioinformatics to Predict Protein Structure and Function
145
Fig. 3. The results of the BLASTP search using Fugu PRGFR1 to query sequences derived from known protein structures. (A) A schematic representation of the databank matches aligned to the query. The different gray scales are indicative of the alignment score. Several proteins match the tyrosine kinase region. (B) The pairwise sequence alignment of the tyrosine kinase structure of the insulin receptor and PRGFR1. The percentage sequence identify for this alignment is 39% spanning amino-acid residues 1089–1352 of PRGFR1. This match indicates that the SwissModel protein modeling server is expected to build a model to atomic detail for this region.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
145
1/10/03, 8:31 AM
146
Edwards and Cottage
the regions of the protein sequence for which a structural homolog has not been reliably identified (see Note 4), further nontrivial probing of the databanks are required to establish possible links between other protein sequences and structural homolog(s) or analog(s) (see Notes 5 and 6, respectively).
3.2. Search for a Structural Target Fold Using Sensitive Search Methods 3.2.1. Search Against the Databank of Pfam Profiles Profile Hidden Markov Models (HMMs) built from Pfam alignments can be useful for automatically recognizing that a new protein contains an existing protein domain, even if the sequence similarity is weak (9). This type of sensitive search is useful when investigating multidomain proteins. Query sequences can be used to scan the Pfam HMMs. The query sequence should be in FASTA format. Select whether you wish to scan Pfam-A or Pfam-B families (see Note 9) and the maximum cut-off value for the E-value (see Note 8). The type, the number, and the location of the Pfam domains identified by the search will be reported plus some annotation about the domain. If available, Pfam will provide references to homologous domains in protein structures deposited and classified in the PDB (1,2), CATH (3,5), and SCOP (4,5) databanks (see Note 1). Our search identified a sema domain, a PSI domain, the aforementioned tyrosine kinase domains, and three IPT domains in PRGFR1 (Table 8). No structural homolog for either the sema domain or the PSI domain was identified from the HMMER search. Several homologous structures were identified for the tyrosine kinase domains. For reasons given in the introduction, no further analysis of the PSI and the sema domains is shown. The results obtained proved unreliable and too complex for discussions in this article. The IPT domains were shown to share detectable but weak sequence similarity with the immunoglobulin-like fold in the cyclodextrin glucanotransferase protein structure (PDB accession code: 1cyg). By using similar techniques summarized in (7) we predict the fold for the IPT domain to
Table 8 The Names and Databank Accession Codes for 12 Unique Homologous PRGFR Sequencesa Organism 1 2 3 4 5 6 7 8 9 10 11 12
Mouse Human Chicken Frog Fugu Fugu Rat Mouse Human Chicken Frog Fugu
Protein RON RON SEA SEA PRGFR2 PRGFR3 MET MET MET MET MET PRGFR1
146
OWL:I48751 SPTR:Q04912 SPTR:Q08757 OWL:JC4860 SPTR:Q9YGM5 SPTR:Q9YGN0 SPTR:P97579 SPTR:Q62190 SPTR:P08581 SPTR:Q90975 OWL:JC5148 SPTR:Q9YGM7
aPRGFRs can be grouped into two subfamilies. The MET subfamily includes PRGFR1. The SEA/RON family includes PRGFR2 and PRGFR3. MET was first identified as the activated oncogene in an N-methyl-N'-nitro-N-nitrosoguanidine (MNNG) treated human osteosarcoma and Xeroderma pigmentosa cell lines. RON (Recepteur d’Origine Nantaise) was isolated from a human foreskin keratinocyte cDNA library. It is a 1400 amino acid receptor tyrosine kinase protein and is homologous to the 1408 amino acid MET proto-oncogene. SEA was originally identified in the genome of the S13 avian erythroblastosis retrovirus. This virus causes Sarcomas, Erythroblastoses and Anemias in young chicks.
illustrate how one can gain confidence in a fold prediction by using different and complementary tools. The residue numberings for the start and end for the three IPT domains in the PRGFR were established from the HMMER search. The sequences defining the relevant IPT domains need to be extracted into separate files. It is good practice to give your files meaningful names, as the files containing these sequences will form the basis of further analysis at the next stage. 3.2.2 Further Characterization of the IPT Domains 3.2.2.1. PSI-BLAST IDENTIFIES A FOURTH IPT DOMAIN IN PRGFR1 Analysis using PSI-BLAST (21) provided evidence for a fourth IPT domain in the PRGFR family. The PRGFR1 IPT2 sequence was used to query a nonredundant databank of protein sequences. Figures 5 and 6 provide details of the parameters used Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
Accession Number
1/10/03, 8:31 AM
Bioinformatics to Predict Protein Structure and Function
147
Fig. 4. The results of the BLASTP search using Fugu PRGFR1 to query a nonredundant databank of protein sequences. The different gray scales are indicative of the alignment score. Twelve unique proteins match the fulllength amino-acid sequence of the query (see Table 9) and many more match the tyrosine kinase region. The juxtamembrane domain is not present in mouse RON receptor sequences and, consequently, there is a large gap in the alignment depicted by a striped line.
and the results of the search. Two to four IPT domains were identified in homologous PRGFR sequences by using PSI-BLAST. Other members of the IPT superfamily (see Note 10) were also identified using successive PSI-BLAST iterations (see Fig. 5). Validation was required for the prediction of the fourth IPT domain in PRGFR family and this is covered in Subheading 3.2.2. 3.2.2.2.FURTHER EVIDENCE FOR THE EXISTENCE OF THE FOURTH IPT DOMAIN IN THE PRGFR FAMILY BLASTP was used to identify protein homologs of Fugu PRGFR1. The protein sequence of PRGFR1 was used to search a nonredundant protein sequence databank (see Table 4). Twelve unique proteins match the full-length amino-acid sequence (see Table 9 and Fig. 4). PSI-BLAST was used to search a nonredundant protein sequence
databank, but in this case no new full-length homologs were established by further iterations. The complete PRGFR homologous amino-acid sequences were retrieved from the SPTR databank (18) using SRS (see Tables 3 and 4) and transferred to the Unix-based computing systems at the HGMP-RC. Two multiple sequence alignments were generated. The first comprised the fulllength PRGFR sequences (6). The second comprised the four IPT domains found in chicken MET and Fugu PRGFR1 only (see Fig. 7). These were compiled in order to identify the residues in and the properties of the fourth IPT domain in PRGFR1. From the multiple sequence alignment using the full-length sequences (6), the percentage pairwise sequence identities fell between 30 and Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
147
1/10/03, 8:31 AM
148
Edwards and Cottage
Fig. 5. The results of the PSI-BLAST search using Fugu PRGFR1 IPT2 to query the NCBI nonredundant protein sequence databank. The IPT domain shares weak sequence similarity to the OLF1/Ebf-like transcription factors (OLF), the nuclear factor of activated T-cells (NFAT) family of transcription factors, the transcription factor XCOE2 (XCOE), the viral encoded semaphorin receptor VESPR, and cyclodexterin glucotransferase (the PDB accession code is 1cyg). The number of IPT domains identified in the proteins following consecutive iterations is shown in parenthesis. The E-value was set to 10 for all the PSI-BLAST iterations. Alignments were inspected. Matches to repeat sequences were excluded from succeeding iterations. The outcome of a PSI-BLAST search can be affected by the selected query sequence. Depending on which PRGFR1 IPT domain is used as the query, three or four IPT domains can be identified in the PRGFR homologs. The Fugu PRGFR1 IPT domain returned four IPT domains in a single PRGFR sequence, whereas the other PRGFR1 IPT domains returned only three.
58% and equivalent regions with highly conserved residues were established. For example, common regions include conserved cysteines, prolines, and glycines in the IPT domains of the PRGFR family (6,7). A fourth IPT domain is present in all the PRGFR protein sequences. Using
the HGMP-RC Unix-based computing system, a multiple sequence alignment was performed using ClustalW and the four IPT domains in chicken MET and Fugu PRGFR1. The version of ClustalW used has an interface called EMMA, which is part of the EMBOSS package (45). For the IPT align-
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
148
1/10/03, 8:31 AM
Bioinformatics to Predict Protein Structure and Function
149
Table 9 Four Different Domain Types Identified in Fugu PRGFR1 as a Result of a Pfam-A Searcha Pfam domain name Sema Sema Plexin repeat IPT IPT IPT Tyrosine kinase
Query Start
Query End
Pfam Start
Pfam End
Bits
51 246 528 572 669 755 1096
141 390 571 667 752 849 1353
1 194 1 1 1 1 1
99 350 67 103 103 103 274
46.0 100.0 39.3 48.7 76.8 33.4 289.3
E-value 6.2e-13 1.7e-27 3.2e-09 4.7e-12 1.6e-20 1.8e-07 1.7e-84
aA maximum E-value of 10 was used. For each domain identified in PRGFR1, the start and end points of the Pfam domain and query sequence are given, together with the statistical score and E-value of the sequence alignment of the representative pfam domain and PFRGR1. A sema domain, a plexin-repeat (also known as the PSI domain), three IPT domains, and a tyrosine kinase region were identified in PRGFR1.
ment, parameters such as the gap penalty, gap extension, substitution matrix were altered in a stepwise manner to optimize the number of equivalent residue positions (i.e., the length of the alignment was decreased, whereas the number of conserved positions increased). From the IPT alignment, invariant and highly conserved residues and positions of the alignment that comprise hydrophobic and polar residues have been defined (see Fig. 7). The sequences of the fourth IPT domain have diverged compared to the other three IPT domains and it is likely that the four IPT domains share a similar protein fold and mode of action. Characterizing the properties of domains where the sequence identities are this low is a nontrivial exercise. This is an interesting example of in silico characterization of domains with low sequence identity. If the results of this step are wrong, analysis based on the newly characterized domains will be erroneous.
3.3. Search for a Structural Target for the IPT Domain Using 1D-2D-3D Compatibility Matches 3.3.1. Prediction of Secondary Structure and Solvent Accessibility for IPT Domains There are many good secondary structure and solvent accessibility prediction servers available for use, for example, Jpred2 (22). Jpred2 works by combining many high-quality prediction methods to form consensus predictions. Three types of
input can be supplied to Jpred2: a single protein sequence, an unaligned set of protein sequences, or a multiple protein sequence alignment in MSF format. For single sequences submitted, Jpred2 uses PSI-BLAST to perform a sequence similarity search and generates an alignment from the matched sequences. If the input is an unaligned set of sequences, an alignment will be created from the sequences supplied. If your input is an alignment, the alignment will be modified so that it does not contain gaps in the first sequence. The first sequence therefore should be your main sequence of interest. Once an alignment has been generated and modified, the algorithms predict the secondary structure. Jpred2 was used to predict the secondary structure and solvent accessibility of the IPT domains. In this example, Jpred2 determines a consensus secondary structure prediction of seven consecutive beta strands and predicts their solvent accessibility to be largely solvent inaccessible (see Fig. 8). Given that the proteins share a low level of sequence identity, the accuracy of the prediction will be affected. The Jpred2 secondary structure prediction is expected to be about 70% accurate at the amino-acid residue level (22). So far, four IPT domains have been identified and characterized in each PRGFR sequence. The second PSI-BLAST iteration for the IPT2 sequence identified the third domain of 1cyg as a matching protein structure (see Fig. 5). The known structure
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
149
1/10/03, 8:31 AM
150
Edwards and Cottage
Fig. 6. (continued on next page) Four consecutive IPT domains identified in two PRGFR homologs (chicken and mouse). The second PSI-BLAST iteration identified a match with 1cyg. The PDB databank identifier followed
of 1cyg adopts an immunoglobulin-like fold and is composed of nine beta strands. The Jpred2 analysis of the IPT domains predicts seven beta strands with a similar predicted solvent accessibility profile typical of that observed in an immunoglobulin-like fold. The collective evidence from these different techniques suggests that the IPT domains consist largely of beta strands and are likely to adopt an immunoglobulin-like fold.
3.3.2. Automatic Protein Fold Recognition Methods The protein fold recognition program Threader (46) was used to score protein sequence compatibility against known protein folds. Threader can be run from the UK HGMP-RC Bioinformatics applications menu. Sequence threadings against a structural databank of 1902 known protein folds were performed for the 48 IPT sequences (the four Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
150
1/10/03, 8:31 AM
Bioinformatics to Predict Protein Structure and Function
151
Fig. 6. (continued) by the accession code indicated that this sequence has a structure deposited in the PDB. 1 cyg contains 680 amino-acid residues. IPT2 makes a match with the third domain of 1cyg (1cyg03) (3–5). 1cyg03 comprises residues 492–575 and adopts an immunoglobulin-like fold. Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
151
1/10/03, 8:32 AM
Fig. 7. A protein multiple sequence alignment of the IPT domains in chicken MET (GG.MET) and Fugu PRGFR1 (FR.PRGFR) was obtained using Clusta1W and EMMA. The gap creation penalty was set to 4, the gap extension penalty was set to 1 and the Gonnet matrix was used; these parameter changes also applied to the pairwise equivalents. Prettyplot, part of the EMBOSS package (45) produced the graphical display and calculated a consensus for the sequences. The alignment shows a consensus comprising invariant and structurally important amino acid residues such as cysteines, prolines, and a glycine. Conserved hydrophobic or hydrophilic characteristics can be observed in equivalent boxed positions. The alignment provides some evidence for the existence of the fourth IPT domain in the PRGFR family.
152
05/JW 542/ Edwards/139-166
Edwards and Cottage
MOLECULAR BIOTECHNOLOGY
152
Volume 23, 2003
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function
153
Fig. 8. Consensus secondary structure from Jpred2 using Fugu PRGFR1 IPT2 domain as the query.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
153
1/10/03, 8:32 AM
154
Edwards and Cottage
IPT domains from each of the twelve homologous PRGFR sequences). Threadings were computed in terms of: 1. Pairwise interaction energies, 2. Solvation potential energies and 3. Their weighted sum, in order to evaluate the fit of each IPT sequence to a particular fold conformation, and represented as Z-scores [ (Energy Mean)/Standard Deviation]. Provided there is a match of at least 50% of the query sequence to the structure, the Z-scores were sorted for input into a program called SumThreader (47) in order to summarize the outcome of the searches. For each of the three Z-scores, the average rank of each fold was calculated from the 48 values determined for the individual IPT domain sequence threadings. The average ranked position for each fold for the 48 sequences was calculated. The IPT domains were matched favorably with protein structures. The PDB code, the domain, and the average Z-score (in parenthesis) are given: 1cgt03 (3.08); 1cyg03 (2.65); 1cdg03 (2.57); and 1vcaA1 (1.51). The first three domains are cyclodextrin glycosyltransferase from, Bacillus circulans, Bacillus stearothermophilus, and Bacillus circulans, respectively, and the fourth is the Nterminal domain of the human vascular cell adhesion molecule-1. The top ranked folds showing the average ranked positions are given in parenthesis: 1cgt03 (44.79); 1cyg03 (67.96); 1vcaA1 (84.56); and 1cdg03 (85.92). These four domains share a common fold. They are mainly all beta stranded proteins with a beta sandwich consisting of nine beta strands. The protein fold prediction for the IPT domain corresponds well to the predictions provided by sensitive sequence similarity searches, solvent accessibility and secondary structure predictions. In addition, similar results were also obtained by using other automatic protein fold recognition methods like TOPITS (23), GenThreader (24), FUGUE (25), and 3D-PSSM (26) (see Table 5) with the input sequence of the IPT2 domain of PRGFR1 (see Note 11). The results support previous predictions suggesting that the IPT domains are likely to comprise between seven to nine beta strands and adopt an immunoglobulinlike fold. The IPT sequences have been matched with the structural target fold 1cyg(03). In Sub-
heading 3.4.2., a 3D model of Fugu PRGFR1 IPT2 is built using 1cyg(03) as a structural template. This model cannot be built using SwissModel as the association between sequence and structure was not established by using BLASTP.
3.4. Molecular Modeling Protein Structures to Atomic Detail Automated procedures have been developed to facilitate the construction of a protein model based on the assembly of rigid fragments from existing known structures (see Tables 6 and 10–12). The methods generally encompass the following stages. The identification of a structural template for the query sequence. The determination of the structurally conserved regions (SCRs) of suitable protein structures. The construction of the main chain coordinates of the SCRs of the unknown by using an alignment of the query sequence against the template SCRs. The structurally variable regions (SVRs) for the query sequence are determined by searching a databank of fragments of crystal structures and selecting the fragment predicted to be most compatible with the amino-acid sequence. The construction of side chains is achieved by using side-chain rotamer libraries and rules governing the conformations of amino-acid side-chains at equivalent positions in the homologous proteins. A short description of other methods is given (see Note 3). 3.4.1. Automatic Comparative Molecular Modeling Using the SwissModel Server SwissModel is a commonly used automated comparative protein modeling server that employs a rigid body fragment assembly program (34). This server has many advantages as it hides both the technical and tedious aspects of the modeling procedure. The modeling server is fast, free, and performs some WhatIf checks (33). The results of the analysis are sent to the user via e-mail. However, fully automated sequence alignment algorithms often misplace insertions and deletions when overall sequence identity falls below 30%. Additionally, you have limited control over what features can be engineered in and out of the protein model. In Subheading 3.1., by using BLASTP (21) against NRL3D (20), a structural homolog was identified for the PRGFR1 tyrosine kinase Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
154
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function
155
Table 10 A Summary of How the Protein Modeling Server SwissModel Works Program
Database
Activities Performed
1 2
BLASTP SIM
ExNRL-3D (20) –
3 4 5
– ProModII Gromos96
– ExPDB (1,2) –
Search for target sequence with sequences of known structure Searches for template groups and shows global alignment Selects template structures with pairwise sequence identity above 25% and projected model size larger than 20 amino acids. Detect domains that can be modeled for target sequence Generate ProModII input files Generate models with ProModII Energy minimization of all models
Table 11 Defined Structurally Conserved Regions (SCRs) in the Fugu PRGFR1 IPT2 Domaina SCR
1CYG(03)
SCR1 494-505 SCR2 510–518 SCR3 522–545 SCR4 552–559 SCR5 569–575 Total number of framework residues in IPT2 model
IPT2
SCR length
1-12 19–27 32–55 65–72 79–85
12 9 24 8 7 60
aThe SCRs are based on the atomic coordinates of the crystal structure of the immunoglobulin-like fold of cyclodextrin glucanotransferase. The five defined SCRs are used to build the framework for the model of the PRGFR1 IPT2 domain.
sequence (see Figs. 1 and 3). Thus, we expect SwissModel to build a credible model for this region. SwissModel was used in the “First Approach Mode.” The following details were provided to the form: an e-mail address, the user’s name, a request title, and the complete sequence of PRGFR1. The BLASTP P(N) limit for template selection was set to 0.0001 (see Note 8). The result option was set to the “normal mode” and the server returned the final model coordinates file in PDB format and a log file tracing all the actions taken by the server. A WhatCheck report of the final model was requested and is performed by the WhatIf program. The default values were accepted for other parameters. SwissModel returns a 3D model for the PRGFR1 tyrosine kinase structure. This 3D model comprises two structural domains similar to the structural templates (see Figs. 9–11). The domains have been described and classified in the CATH
(3) and SCOP (4) databanks. The N-terminal domain comprises a two-layered alpha beta sandwich, in which one layer is a beta-sheet (with five strands) and the second layer is a long alphahelix (see Fig. 9). The C-terminal domain is mostly alpha-helical and is a nonbundle fold. The non-bundle architecture is a general term that groups together helical proteins that cannot be classified as bundles (3,4). If the sequence similarity between the model and the template is significant (see Note 4), the secondary structure and solvent accessibility calculated from the homology model (evaluated on a residue per residue basis) is expected to be about 70–90% accurate. The program Joy can be used to define the secondary structure and to calculate the solvent accessibility and other structural features using the atomic coordinates of the 3D model as input (see Fig. 11 and Table 7).
3.4.2. Interactive Comparative Molecular Modeling Based on sensitive sequence searches, solvent accessibility and secondary structure predictions and automatic protein fold recognition methods, an association between the IPT sequences with an experimentally determined structure has been established. A 3D model of the IPT2 domain in Fugu PRGFR1 is built based on the atomic coordinates of the crystal structure of the immunoglobulin-like domain in cyclodextrin glucanotransferase. All the following computations were performed on a Silicon Graphics workstation with an IRIX operating system. 3D properties of protein structures were visualized and examined on the Silicon Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
155
1/10/03, 8:32 AM
156
Edwards and Cottage Table 12 Modeling of the Four Structurally Variable Regions (SVRs) of the Fugu PRGFR1 IPT2 Domaina
SVRs SVR start end
IPT2 SVR length
|CSCRi—NSCR i+1| Angstroms
SVR1 13–18 SVR2 28–32 SVR3 56–64 SVR4 73–78 Total residues in SVR
6 4 9 6 25
10.1 6.1 14.6 14.2
Loop type (42)
SVR source
BB β-hairpin BB β-arch BB β-hairpi n BB β-arch
1rsy(196) 1vba (3:102) 1kit (367) 1uae (63)
Flex residues
Deviation Angstroms
RMSD tail/no tail
3,3 3,3 4,4 4,4
0.39 0.43 0.42 0.63
0.52/N 1.27/N 1.87/N 0.61/N
aFor each SVR, the comprising amino-acid residues and distance between the main-chain C-atom of the preflex (the ith SCR) and the main-chain N-atom of the first residue in the postflex (the i+1th SCR), |Cscri—Nscr i+1|, are described. All SVRs are located in the solvent accessible loops, i.e., regions not defined as regular repeating secondary structure in the immunoglobulin-like domain of cyclodextrin glucanotransferase by DSSP (47). The first amino-acid residue number of the selected fragment is specified in parenthesis in the SVR source column. The respective residue numbers in the preflex and postflex regions used to screen the protein database for suitable fragments to interconnect the ith SCR and the i+1th SCR and the RMS deviation of the selected flex region with equivalent residues in the respective pair of SCRs are given.
Graphics machines using the molecular modeling and visualization software package Insight II (Accelrys). The coordinates of protein structures were obtained from the PDB (1,2). In order to construct an atomic 3D model for the PRGFR1 IPT2 domain based on the template structure 1cyg(03), the rigid-body fragment assembly method was carried out using the commercial software package Homology. Four stages were used in the modeling procedure: 1. The construction of a 3D framework by consideration of known structures. 2. Selection of suitable SVRs from a databank of “structural spare-parts” to connect consecutive SCRs of the framework. 3. Building side-chain coordinates. 4. Refinement of model using manual modeling and interactive computer graphic techniques and energy refinement to improve covalent geometry and minimize bad contacts. The sequence of the IPT2 domain was aligned with the third domain of cyclodextrin glucanotransferase (1cyg03). This alignment was initially extracted from the threading alignment and modified by eye to optimize the equivalences and minimize the number of gaps in the sequence alignment. The percentage sequence identity between 1cyg03 and Fugu PRGFR1 IPT2 domain is 22.6%. The topological equivalences, identities, and conservative amino-acid exchanges between the se-
quence of the template structure and the sequence of the IPT2 domain were optimized. Non-gap regions defined the SCRs that formed the framework of the homology model (see Table 11). The insertions and deletions in the alignment were mainly positioned in the loops of the known structure and were designated as SVRs. The framework of SCRs used to build the model of the IPT2 domain of PRGFR1 contains five peptide fragments comprising 60 residues (see Table 11). Fragments of the backbone from the template structure were least-square fitted to the average Cα positions of the immunoglobulin-like domain in the cyclodextrin glucanotransferase framework in order to construct the core of the model. Table 12 describes the modeling of four SVRs for the IPT2 domain. Peptide fragments of a predefined length from known protein structures from the PDB were selected and used in modeling the SVRs. A precalculated Cα distance matrix for all well-defined proteins in the PDB was used to search for “spare parts” of protein structure suitable to model the SVRs. This precalculated distance matrix is used to screen parts to best fit the defined number of residues joining a pair of consecutive SCRs. The RMS (root mean squared) deviation for Cα atoms between the IPT2 model and flanking, or preflex and postflex, peptide fragments of the selected databank proteins are shown (see Table 12).
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
156
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function The SVRs for the IPT2 domain in PRGFR1 comprise 25 amino acids from a total of 85. The SVRs comprise less than 30% of the total model constructed. This is a significant proportion of the structure and will inevitably introduce a higher degree of uncertainty in the 3D model. The loops that comprise SVRs are all typical of loop length distribution in protein structures (see Table 12). The side-chain atoms were placed automatically for both the SCRs and SVRs using information from the template structure and general rules for residue exchanges. The model was refined using energy minimization. As a general rule of thumb, use energy minimization refinements and molecular dynamics simulations very sparingly in this type of modeling. In this example, six hundred steps of steepest descent minimization were performed using the program Discover (Accelrys) on the main-chain and side-chain atoms of the splice sites (where the SCRs and SVRs join) and all sidechain atoms. With regards to the main-chain atoms (not side-chain atoms), only the regions local to the splice sites should be refined. This is important, as over use of refinements on main-chain atoms using energy minimization will result in loss of valuable secondary structural information provided by the template structure(s). Loss in valuable secondary structure results in loss of valuable fold information. After refinement, the covalent geometry of regions joining SCRs and SVRs and bad contacts were significantly improved as assessed by the program Procheck (37).
3.5. Validation of Protein Structure Models Previously defined SCRs and SVRs can be altered to remove unfavorable geometric or stereochemical features. Alignments may need to be altered manually, especially if the sequence similarity between the sequence of the unknown structure and the sequence of the structural template is low (see Note 4). If the selection of the template structure is wrong, the model based on it will be wrong. If your alignment is incorrect, local features of the model will be incorrect. If the protein is well-characterized biochemically, use this information to validate the model of the protein
157
Fig. 9. Tyrosine kinase structures. (A) The 3D model of the Fugu PRGFR1 tyrosine kinase produced by SwissModel. This model is based in the atomic coordinates of five homologous tyrosine kinase structues. The PDB accession codes are 1irk, 1ir3, 1fgi, 1agw, and 1fgk. The PRGFR1 tyrosine kinase model contains 273 amio acids (i.e., residues 1091–1363). The model contains two structural domains. The C-terminal domain is mostly alpha-helical. The secondary structure of the model was calculated by using DSSP (47) from the model coordinates. Molscript (48) was used to plot this figure. (B) The model and one of the template structures (1irk) were superimposed in Insight (Accelrys) and are shown in the same orientation. Most of the differences between the two molecules lie in the crevice in between the two domains of the protein structure.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
157
1/10/03, 8:32 AM
158
Edwards and Cottage
Fig. 10. A Ramachandran plot of the Fugu PRGFR1 tyrosine kinase model. This plot is created by Procheck based at the Biotech validation site (see Table 7).
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
158
Volume 23, 2003
158
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function
159
Fig. 11. A Joy output highlight the stuctural features of the PRGFR1 tyrosine kinase model. The key to the Joy feature formating follows: solvent inaccessible (UPPER CASE), solvent accessible (lower case), alpha-helix (α), beta-strand (β), 3–10 helix (3), hydrogen bond to main-chain amide (bold), hydrogen bond to main-chain carbonyl (underline), hydrogen bond to side-chain (tilde), cis-peptide (breve), disulfide bond (cedilla), and positive phi (italics).
structure. For example, the information that certain residues are known to coordinate metal ion binding can be of value. Use such information to ensure that the relevant side-chains are in close proximity in 3D space and in the correct orientation for metal coordination. If the alignment is wrong, the residues for coordinating the metal ion could end up on opposite sides of the molecule. Likewise, the knowledge that two cysteine residues are involved in disulfide bridge formation can provide a useful indicator that a protein structure has been modeled correctly (see Fig. 12A). Such information can be used to check the geometry of formed disulfide bridges in your model. It is important to visually inspect various features of the constructed models. Features such as defined secondary structure and main-chain conformation
(see Figs. 10 and 13), strands and helices, the globular nature of protein fold, and the 3D clusters of interacting side-chains should be investigated. There are tools to help point to regions of the model that might need correcting (see Table 7). To calculate the secondary structure and solvent accessibility of a 3D protein structure (either a constructed 3D model or an experimentally determined structure), submit the protein coordinates to the “Protein Analysis” server at the WhatIf website (see Table 7).
3.5.1. Checks Performed by the Biotech Protein Structure Validation Suite A model of a protein structure built to atomic details requires validation of the stereochemistry and assessment of the biological viability (33,37,38). Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
159
1/10/03, 8:32 AM
160
Edwards and Cottage
Fig. 12. Immunoglobulin-like folds. (A) The 3D model of the IPT2 domain of Fugu PRGFR1. The atomic coordinates were created with Homology and Discover (Accelrys). The model was based on 1cyg(03) as the template structure. The model consists of 85 amino-acid residues and adopts a beta-sandwich architecture. These structures have two small beta-sheets packed together in a layered arrangement. Ths side-chains of two cysteines (residues 42 and 54) are shown. These are close in space and are likely to form a disulfide bridge. (B) The 3D structure of the template 1cyg(03) is in the same orientation as the model. See legend to Fig. 9.
Checks performed by Procheck (37) on a given protein structure follows: covalent geometry, planarity, dihedral angles, chirality nonbonded interactions, main-chain hydrogen bonds, disulfide bonds, stereochemical parameters, and residue-byresidue analysis. Molecular modeling to atomic detail can be an iterative cycle. Having obtained initial models, properties such as bad stereochemistry, bad van der Waals contacts, and features of the model not supporting known experimental evidence should be engineered out of the model. Interactive molecular modeling tools are required for this (see Subheadings 2 and 3.4.2.). In cor-
recting undesirable features of models, some parameters that can be altered in the remodeling process are the alignment, definition of SCRs, and the selection of SVRs. WhatIf performs the following checks: bond angle deviations; bond lengths; buried unsatisfied hydrogen bond donors and acceptors; bad van der Waals contacts; peptide bond flip; amino-acid handedness; histidine, glutamine, and asparagine side-chain conformation; atom nomenclature; side-chain planarity; proline puckering; directional atomic contact analysis; amino-acid sidechain rotamer analysis; symmetry; torsion angle Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
160
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function
161
Fig. 13. A Ramachandran plot (42) of the Fugu PRGFR1 IPT2 model. The plot is created by Procheck. See legend to Fig. 10.
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
161
Volume 23, 2003
161
1/10/03, 8:32 AM
162
Edwards and Cottage
evaluation; isolated water clusters; and atomic occupancy. The server provides detailed descriptions of these properties (see Table 7).
3.5.2. Examine Structural and Local Environments in Protein Models by Using Joy The program Joy annotates protein sequences with 3D structural features (see Table 7). Joy was designed to investigate properties of structural and local environments and conservation of amino acids in protein families (38). For instance, a solvent inaccessible side-chain hydrogen-bonded to a main-chain amide plays an important role in stabilizing the 3D structure and is well conserved in families of protein structure. In a Joy display, this type hydrogen bond is shown in a bold faced letter in a formatted alignment. Also, solvent inaccessible residues are represented as upper-case letters. Structural features like these are highly conserved in families of protein structures and should be monitored in model building (see Figs. 11 and 14).
3.6. Where Next? The most reliable and accurate predictions are those in which the application of standard and well-established methods result in a successful outcome. Such methods produced reliable results in the PRGFR1 tyrosine kinase structure prediction. Where these standard methods do not lead to the reliable identification of a structural template, the newer and nontrivial methods may help to provide indicators of structure and hence function. These nontrivial methods produced interesting and consistent leads for the PRGFR1 IPT protein structure prediction. For reasons given in the introduction, about 55% of the PRGFR1 structure could not be predicted reliably, and this is typical in structural genomics (53,54). Having said this, the bioinformatics techniques are becoming increasingly more effective, quicker, and simpler to use (8–38), and the databanks are growing in size and diversity. So these approaches, if used appropriately, will help to close the information gap between sequence and structure and complement in vitro approaches to investigate molecular structure and function. Additionally, many useful
lessons can be learned from critical assessments of bona fide protein structure predictions in light of their structures being solved (49–52). Furthermore, the methods described in protein structure prediction projects can show, by example, how not to overinterpret results obtained from bioinformatics methods (7,53–56). In this article, we hope we have demonstrated bioinformatics methods to predict protein structure by using a practical approach.
4. Notes 1. A domain is a polypeptide chain or part(s) of a polypeptide chain that can independently fold into a stable tertiary structure. Domains need not be formed from contiguous regions of an amino-acid sequence. Typically, they have distinctive secondary structure content and a hydrophobic core. In small disulfide-rich, zincbinding or calcium-binding domains, the hydrophobic core may be replaced with cystines and metal ions. Domains can be defined in terms of a unit of function. Proteins may comprise a single domain or as many as several dozens. At the sequence level, homologous domains with common functions usually show sequence similarities. Structural domains typically comprise 35–350 amino-acid residues. 2. Many proteins, particularly those sharing a common evolutionary origin, have been shown to possess similar backbone structures, that is, they share similar 3D folds. It is this property that is exploited in the approach known as comparative modeling. The aim is to base a 3D model for a protein sequence on a homologous structure (see Note 5) or analogous protein structure (see Note 6). The most reliable indicator that two proteins share a common fold is provided by amino-acid sequence comparisons. If the target protein has a sequence with detectable sequence similarity with the sequence of a protein of known structure (see Note 4), then a model can usually be built that is as accurate as a medium resolution X-ray or NMR structure. A large number of structural parameters are considered when modeling a structure. To minimize subjective decisions, specially written computer algorithms have been developed to exploit our knowledge
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
162
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function
163
Fig. 14. A Joy output of Fugu PRGFR1 IPT2 model. See legend to Figs. 11–13.
of protein structures in a systematic way and make this task more manageable. 3. A description of comparative modeling using rigid-body fragment-based methods is described in the main text (30,31,34). Threedimensional models can also be constructed by satisfaction of spatial restraints using methods similar to those used in NMR protein structure determinations. The restraints are obtained from a consideration of homologous or analogous protein structures (32,35). Restraint-based approaches are, perhaps, more automated than fragment assembly approaches and produce models with better stereochemistry. Fragmentassembly-based methods, on the other hand, are significantly less computationally intensive than restraint-based methods. Both techniques can produce models with similar overall accuracy. 4. If a protein sequence has 30% sequence identity or higher (over a region of 70 amino-acid residues or more) to another protein sequence of known structure, then the two proteins can “safely” be described as homologous (57–59). This type of relationship infers that the sequences are highly likely to perform a similar function. 5. Homologous sequences share a common evolutionary ancestor and are believed to have arisen by gene duplication followed by gene divergence. There are no degrees of homology. Sequences are either homologous or they are not. Paralogs and orthologs are homologs. Orthologous sequences are homologous proteins that perform the same function in different species. Paralogous sequences are homologous proteins that perform differing function in the same species.
6. Analogous sequences are non-homologous proteins that have a similar protein three-dimensional fold or similar functional sites believed to have arisen through convergent evolution. 7. Version 2 of NCBI’s BLAST (Basic Local Alignment Search Tool) program performs gapped alignments (21). BLAST is the heuristic search algorithm employed by the five programs (BLASTP, BLASTX, BLASTN, TBLASTN, and TBLASTX). BLASTP compares an aminoacid query sequence against a protein sequence databank. Databank search algorithms are based on mathematical models. Two general models view alignments as sequence similarity across the full-length (global alignments) and regions of similarity in parts of the sequence (local alignments). For the purpose of protein structure prediction and domain characterization and identification, performing local alignments to search databanks is more desirable. 8. The E-value is the expect value. For a given score, the E-value is the number of hits in a databank search that we expect to see by chance with this score or better. The E-value takes into account the size of the databank searched. The lower the E-value, the more significant the match is. If the E-value is 10, we expect 10 matches to be found by chance. If the statistical significance ascribed to a match is greater than the E-value, the match will not be reported. The lower the E-value, the more stringent the search, this leads to fewer chances of false positive matches being reported. The P-value is similar to an E-value, but it is the probability of a match occurring by chance with this score or better as opposed to the expected number of hits. The P-value has a maximum value of 1.0.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
163
1/10/03, 8:32 AM
164
Edwards and Cottage
The E-value can have the maximum value of the number sequences in the databank searched. Version 2 of NBCI’S BLAST program quotes E-values, whereas version 1 of the same package quotes P-values. HMMER reports E-values (9,21) (see Note 9). 9. Pfam is a databank of protein multiple sequence alignments and Hidden Markov Models (HMMs) (9). A Hidden Markov Model is a probabilistic model that is suited for providing a mathematical scoring scheme for profile analysis. The HMMs describe the propensity of amino-acid exchange in common protein domains and conserved regions. The HMMER software uses the multiple sequence alignment to build a HMM profile of the family. The profiles incorporate position-specific scoring information derived from the amino-acid frequency observed in equivalent positions in a protein sequence alignment. Families known as Pfam-A are generated by the Hidden Markov Model package from well-annotated, high-quality families that comprise accurate human-crafted multiple alignments. Families known as Pfam-B are generated using an automatic clustering of the rest of SWISSPROT and TrEMBL derived from the Prodom databank. Pfam-B families are not of consistent quality and little may be known of their function. 10. A superfamily is composed of two or more homologous families where not all members of a family have detectable sequence similarity with all the members of the other families. 11. Many automatic protein fold recognition servers are available. To predict protein folds from sequence use the servers outlined in Table 5. However, do not enter a 1500 amino-acid residue sequence and expect to receive meaningful predictions. Submitting small sized (i.e., less than 350 amino-acid residues) and well-characterized sequence domains typically produce results that are more meaningful. Many programs listed in Table 2 can help identify regions of sequences with predicted coiled-coils, transmembrane regions and BLASTP matches with proteins of known structure. It is not advisable to submit these regions to automatic protein fold recognition servers.
Acknowledgments This article is an expanded review based on the following chapter (60). References 1. Keller, P. A., Henrick, K., McNeil, P., Moodie, S., and Barton, G. J. (1998) Deposition of macromolecular structures. Acta Crysta. 54, 1105–1108. 2. Berman, H. M., Westbrook, J., Feng, Z., et al. (2000) The Protein Data Bank Nucleic Acids Res. 28, 235–242. 3. Bray, J. E., Todd, A. E., Pearl, F. M., Thornton, J. M., and Orengo, C. A. (2000) The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologs. Protein Eng. 13, 153–165. 4. Lo Conte, L., Ailey, B., Hubbard, T. J., Brenner, S. E., Murzin, A. G., and Chothia C. (2000) SCOP: a structural classification of proteins database. Nucleic Acids Res. 28, 257–259. 5. Ison, J. C. (2000) Exploring protein domain structure. Briefings in Bioinformatics 1, 305–312. 6. Cottage, A., Clark, M., Hawker, K., et al. (1999) Three receptor genes for plasminogen related growth factors in the genome of the puffer fish Fugu rubripes. FEBS Lett. 443, 370–374. 7. Bork, P., Doerks, T., Springer, T. A., and Snel, B. (1999) Domains in plexins: links to integrins and transcription factors. Trends Biochem. Sci. 24, 261–263. 8. Corpet, F., Servant, F., Gouzy J., and Kahn, D. (2000) ProDom and ProDom–CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28, 267–269. 9. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., and Sonnhammer, E. L.. (2000) The Pfam protein families database. Nucleic Acids Res. 28, 263–266. 10. Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P., and Bork, P. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28, 231–234. 11. Siddiqui, A. S., Dengler, U. and Barton, G. J. (2000) 3Dee: a database of protein structural domains. Bioinformatics. 17, 200–201. 12. Holm, L. and Sander, C. (1998) Touring protein fold space with Dali/FSSP. Nucleic Acids Res. 26, 316–319. 13. Holm, L. and Sander, C. (1999) Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247. 14. Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–219. 15. Attwood, T. K., Croning, M. D. R., Flower, D. R., et al. (2000) PRINTS–S: the database formerly known as PRINTS. Nucleic Acids Res, 28, 225–227. 16. Henikoff, J. G. Greene, E.A, Pietrokovski S, Henikoff S. (2000) Increased coverage of protein families with
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
164
1/10/03, 8:32 AM
Bioinformatics to Predict Protein Structure and Function
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
the blocks database servers. Nucleic Acids Res. 28, 228–230. Sayle, R. A. and Milner-White, E. J. (1995) Rasmol Biomolecular Graphics For All. Trends Biochem. Sci. 20, 374–376. Bairoch, A. and Apweiler, R.(2000) The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48. Bleasby, A. J., Akrigg, D. and Attwood, T. K. (1994) OWL—A non-redundant, composite protein sequence database. Nucleic Acids Res. 22, 3574–3577. Garavelli, J. S., Hou Z., Pattabiraman, N., and Stephens, R. M. (2001) The RESID Database of protein structure modifications and the NRL-3D Sequence-Structure Database. Nucleic Acids Res. 29,199–201. Altschul, S. F., Madden, T. L., Schaffer, A. A., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Cuff, J. A. and Barton, G. J. (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 40, 502–511. Rost, B., Schneider, R. and Sander, C. (1997) Protein fold recognition by prediction-based threading. J. Mol. Biol. 270, 471–480. Jones, D. T. (1999) GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences J. Mol. Biol. 287, 797–815. Shi, J. Y., Blundell, T. L., and Mizuguchi, K. (2001) FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243–257. Kelley, L. A., MacCallum R. M., and Sternberg M. J. E. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499–520. Lundstrom, J., Rychlewski, L., Bujnicki, J. and Elofsson, A. (2001) Pcons: A neural-network-based consensus predictor that improves fold recognition. Protein Science 10, 2354–2362. Bujnicki, J. M., Elofsson, A., Fischer, D. and Rychlewski L (2001) Structure prediction meta server. Bioinformatics 17, 750–751. Douguet, D. and Labesse, G. (2001) Easier threading through web-based comparisons and cross-validations. Bioinformatics 17, 752–753. Sutcliffe, M. J., Haneef, I., Carney, D., and Blundell, T. L. (1987) Knowledge based modeling of homologous proteins, Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng. 1, 377–384. Sutcliffe, M. J., Hayes, F. R. and Blundell, T. L. (1987) Knowledge based modeling of homologous proteins, Part II: Rules for the conformations of substituted sidechains. Protein Eng. 1, 385–392.
165
32. Sanchez, R. and Sali, A. (2000) Comparative protein structure modeling. Introduction and practical examples with modeller. Methods Mol. Biol. 143, 97–129. 33. Vriend, G. (1990) WhatIf: A molecular modeling and drug design program. J. Mol. Graph 8, 52–56. 34. Guex, N., Diemand, A. and Peitsch, M. C. (1999) Protein Modeling for All. Trends Biochem. Sci. 24, 364–367. 35. Brocklehurst, S. M. and Perham, R. N. (1993) Prediction of the three-dimensional structures of the biotinylated domain from yeast pyruvate carboxylase and of the lipoylated H-protein from the pea leaf glycine cleavage system: a new automated method for the prediction of protein tertiary structure. Protein Sci. 4, 626–639. 36. Greer, J. (1981). Comparative Model-Building of the Mammalian Serine Proteases. J. Mol. Biol. 153, 1027–1042. 37. Laskowski, R. A., Rullmann, J. A., MacArthur, M.W., Kaptein, R., and Thornton, J. M. (1996) AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J. Biomol. NMR 8, 477–486. 38. Mizuguchi, K., Deane, C. M., Blundell, T. L., Johnson, M. S. and Overington, J. P. (1998) Joy: protein sequencestructure representation and analysis. Bioinformatics 14, 617–623. 39. Bioinformatics: Sequence, Structure and Databanks. A practical approach. (2000) (Higgins, D., and Taylor, W., eds.) IRL, Oxford University Press. 40. Attwood, T. K. and Parry-Smith, D. J. (1999) Introduction to Bioinformatics. Cell and Molecular Biology in Action Series. Published by Addison Wesley Longman, Harlow, Essex, England. 41. Genetics Databases. (1999) (Bishop, M. J., ed.) Academic Press. 42. Branden, C. and Tooze, J. (1998) Introduction to protein structure. The Second Edition. Garland Publishing Inc. New York and London. 43. Protein Structure Prediction—A practical approach. (1996) (Sternberg, M. J. E., ed.) IRL, Oxford University Press. 44. Baker, D. and Sali, A. (2001) Protein structure prediction and structural genomics. Science 294, 93–96. 45. Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: The European molecular biology open software suite. Trends Genet. 16, 276–277. 46. Jones, D. T., Taylor, W. R. and Thornton, J. M. (1992) A new approach to protein fold recognition. Nature 358, 86–89. 47. Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 12, 2577–2637. 48. Kraulis, P. J. (1991) MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
165
1/10/03, 8:32 AM
166
Edwards and Cottage
49. Edwards, Y. J. K. and Perkins, S. J. (1996) Assessment of protein fold predictions from sequence information—the predicted alpha/beta doubly wound fold of the von Willebrand factor type A domain is similar to its crystal-structure. J. Mol. Biol. 260, 277–285. 50. Benner, S. A., Cannarozzi, G., Gerloff, D., Turcotte, D., and Chelvanayagam M. (1997) Bona fide predictions of protein structure using transparent analyses of multiple sequence alignments. Chem. Rev. 97, 2725–2843. 51. Siew, N. and Fischer, D. (2001) Convergent evolution of protein structure prediction and computer chess tournaments: CASP, Kasparov, and CAFASP. IBM Systems Journal 40, 410–425. 52. Bujnicki, J. M., Elofsson, A., Fischer, D. and Rychlewski, L. (2001) LiveBench-1: Continuous benchmarking of protein structure prediction servers. Protein Sci. 10, 352–361. 53. Pawlowski, K., Rychlewski, L., Zhang, B. H., and Godzik, A. (2001). Fold predictions for bacterial genomes. J. Struct. Biol. 134, 219–231 54. Cottage, A., Edwards, Y. J. K., and Elgar, G. (2001) SAND, a new protein family: from nucleic acid to
55.
56. 57.
58.
59.
60.
protein structure and function prediction. Compar. Funct. Genom. 2, 226–235. Edwards, Y. J. K. and Perkins, S. J. (1995) The protein fold of the von-willebrand-factor type-a domain is predicted to be similar to the open twisted beta-sheet flanked by alpha-helices found in human Ras-P21. FEBS Lett 358, 283–286. Devos, D. and Valencia, A. (2000) Practical limits of function prediction. Proteins 41, 98–107. Sander, C. and Schneider, R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56–68. Doolittle, R. F. (1981) Similar amino-acidsequences: Chance or common ancestry. Science 214, 149–159. Park, J., Karplus, K., Barrett, C., et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologs as pairwise methods. J. Mol. Biol. 284, 1201–1210. Edwards, Y. J. K. and Cottage, A. (2001) Prediction of protein structure and function by using Bioinformatics. Methods Mol. Biol. 175, 341–375.
Volume 23, 2003
MOLECULAR BIOTECHNOLOGY
05/JW 542/ Edwards/139-166
166
1/10/03, 8:32 AM