Fold recognition without folds - Wiley Online Library

20 downloads 8041 Views 66KB Size Report
(PSI-BLAST) with no reference to secondary or tertiary structure information is able to perform as well as traditional fold ... of a query sequence to a library of structural templates, .... cant sequence space, containing those sequences with an E-.
FOR THE RECORD

Fold recognition without folds KRISTIN K. KORETKE,1 ROBERT B. RUSSELL,2,3

AND

ANDREI N. LUPAS1,4

1

Microbial Bioinformatics Group, GlaxoSmithKline, Collegeville, Pennsylvania 19426-0989, USA Bioinformatics Research Group, GlaxoSmithKline, Harlow, CM19 5AW, United Kingdom

2

(RECEIVED August 21, 2001; FINAL REVISION February 25, 2002; ACCEPTED February 25, 2002)

Abstract Fold recognition predicts protein three-dimensional structure by establishing relationships between a protein sequence and known protein structures. Most methods explicitly use information derived from the secondary and tertiary structure of the templates. Here we show that rigorous application of a sequence search method (PSI-BLAST) with no reference to secondary or tertiary structure information is able to perform as well as traditional fold recognition methods. Since the method, SENSER, does not require knowledge of the three-dimensional structure, it can be used to infer relationships that are not tractable by methods dependent on structural templates. Keywords: Structure prediction; sequence similarity; fold recognition; PSI-BLAST; HMMer, SENSER

In the absence of experimental structure data, fold recognition methods predict approximate three-dimensional models for protein sequences by inferring relationships with the database of known structures. Most methods assess the “fit” of a query sequence to a library of structural templates, typically by considering a combination of features such as secondary structure, solvent accessibility, or residue–residue contact preferences (Sippl 1990; Russell et al. 1996; Rost et al. 1997; Jones 1999; Panchenko et al. 1999). Fold recognition methods have made substantial progress in recent years. This is best illustrated by the results of the CASP experiments, which have shown consistent improvements in both the ability to assign a sequence to the correct fold and in the associated alignment of sequence to template (Moult et al. 1999). In parallel with the progress in fold recognition has been a dramatic improvement in sequence search and alignment

Reprint requests to: Kristin K. Koretke, GlaxoSmithKline, UP1345, 1250 South Collegeville Road, Collegeville, PA 19426-0989, USA; e-mail: [email protected]; fax: (610) 917-7901. 3 Present address: European Molecular Biology Laboratory, Meyerhofstrasse 1, 69012 Heidelberg, Germany. 4 Present address: Department of Protein Evolution, Max-Planck-Institute for Developmental Biology, Spemannstr. 35, D-72076 Tübingen, Germany. Article and publication are at http://www.proteinscience.org/cgi/doi/ 10.1110/ps.3590102.

methods, specifically the development of iterative database searching strategies (e.g., PSI-BLAST; Altschul et al. 1997) and hidden Markov models (HMMs; Krogh et al. 1994; Eddy 1998; Karplus et al. 1998). Recent work has also exploited the concept of transitivity, where significant similarity between sequences A and B, and B and C is used to infer a relationship between sequences A and C (e.g., Holm and Sander 1997; Neuwald et al. 1997; Park et al. 1997, 1998). The relevance of these methods for structure prediction is clear from the results of the CASP3 experiment (Karplus et al. 1999; Koretke et al. 1999) and from wholegenome structure predictions (Huynen et al. 1998; Teichmann et al. 1998; Ruepp et al. 2000). In CASP3 we used both sequence and secondary structure information to predict protein folds (Koretke et al. 1999). During evaluation, we found that most of the successful predictions were made through rigorous exploration of sequence space, with few predictions augmented by reference to structural information. We found that automation of part of the sequence search routine yielded sequence-tofold assignments comparable to the best-performing structure-based methods. These insights prompted us to devise a sensitive search routine (SENSER), based on PSI-BLAST and HMMer, and to test it against standard fold recognition benchmarks. PSI-BLAST (position-specific iterated basic local alignment search tool; Altschul et al. 1997) derives a profile from

Protein Science (2002), 11:1575–1579. Published by Cold Spring Harbor Laboratory Press. Copyright © 2002 The Protein Society

1575

Koretke et al.

the statistically significant sequence matches of a gapped BLAST run and uses recurring steps of searching and profile-building to expand the significant sequence space until no new sequences are identified below the chosen threshold. SENSER exploits a basic asymmetry in the operation of PSI-BLAST to establish relatedness between proteins whose similarity would not normally suggest homology. Because in PSI-BLAST comparison of sequences A and B amounts to comparing a profile of A-like sequences to B, if A is used as the starting sequence, and a profile of B-like sequences to A, if B is used as the starting sequence, the results of the two comparisons can differ substantially. Therefore, if A and B can detect each other, even at Evalues up to 10, the two sequences are probably related. We refer to this concept as ‘back-validation.’ In the context of CASP3, we evaluated three different procedures for back-validation (Table II in Koretke et al. 1999). The strictest procedure required that A and B detect each other with E-values smaller than 10, resulting in 2

correct and 0 false assignments (for 21 targets). A less strict procedure only required B to detect any protein within the significant sequence space of A, resulting in 5 correct and 1 false assignments. The least strict procedure introduced transitive searches to maximize the significant sequence space of A before requiring B to back-validate, resulting in 6 correct and 0 false assignments. It is surprising that even the least strict procedure produced so few false positives, the method generally making no prediction for targets outside its reach. In light of these results, we chose to implement the least strict procedure for back-validation in SENSER. As the first step, SENSER performs a PSI-BLAST search with the target sequence using the BLOSSUM62 matrix, E-value cut-off of 1.0e-3 for inclusion in profile, and gap penalties of 11 to open the gap and one to extend it (Fig. 1). Proteins identified in the search are divided into a significant sequence space, containing those sequences with an Evalue lower than the threshold, and a ‘trailing end’ of se-

Fig. 1. Diagram outlining the steps used for detection and alignment in SENSER. For each of the benchmark runs, we concatenated the sequences of the template structures to a database containing sequences from NCBI’s nonredundant database and from 24 partial genomes. PSI-BLAST is known to have problems with coiled-coil, transmembrane, and low-complexity regions. Therefore, we scanned our databases for these regions using the COILS program (Lupas 1996), an internally designed transmembrane predicting program, and SEG (Wootton and Federhen 1996). Any sequence region identified above the default threshold in these programs was replaced with a run of X characters. Target sequences containing such regions used these modified databases in the SENSER search. A 4h cpu time constraint was imposed on each run.

1576

Protein Science, vol. 11

Fold recognition without folds

part process: (1) a PSI-BLAST search is run for the untrimmed sequence, (2) a multiple alignment is extracted, (3) this alignment is combined with the sequences of the target HMM to produce a global alignment, using the target HMM as a template, (4) a final HMM is built from this global alignment, and (5) this HMM is used to realign the detected sequence to the target. The total cpu time for SENSER is a function of the length of the target protein and the number of sequences within the superfamily. The benchmark sets used in this study were run on an LSF cluster of eight DEC alpha machines. The average cpu run time was 29 min +/− 48 min. To determine the predictive accuracy of SENSER, we chose two benchmark sets, benchmark68 (Fischer et al. 1996) and PDB40 (Park et al. 1998), which have been used previously for assessing fold recognition methods (Fischer et al. 1996; Jones 1999). The benchmark68 set consists of 68 target proteins and 301 structural templates, including at least one correct template for each target. The proteins in benchmark68 are not decomposed into individual domains. In contrast, the PDB40 set consists almost entirely of distinct protein domains, derived from the SCOP database to yield the largest set with pairwise sequence identities of 40% or less. In this set, discontinuous protein domains (i.e., domains whose sequence is interrupted by the insertion of sequences from other domains) are treated as continuous. PDB40 encompasses 1319 protein domains and 34 multidomain proteins. Each of these served as both a target and a template. PDB40 is a more rigorous test set because it is larger, has a broad representation of structural classes, and 153 sequences have unique folds, that is, there is no other sequence of known structure with the same fold. In assessing SENSER on the benchmark68 set, we recorded the first identified template protein with less than 30% target-to-template identity as our prediction. We chose this cutoff in order to allow direct comparison with the results of Jones (1999). As shown in Table 1, SENSER detects the correct fold for 53 targets in this set, with 2 incorrect assignments, corresponding to a 96% accuracy of

quences above. For the purposes of this study we defined the threshold as 10−3 and limited trailing end sequences to those with E-values smaller than 10. Because some of the proteins detected may contain unrelated domains, all proteins are trimmed to the actual region detected by PSIBLAST. In the second step, transitive searches are used to expand the significant sequence space (a step we refer to as recursive searching). Only proteins within the significant sequence space that have less than 25% identity to the target sequence are used as starting points for further PSI-BLAST searches, in order to avoid redundant searches, that is, those that produce similar profiles and sequence spaces. This value was chosen as it is a frequently quoted threshold for the ‘twilight zone,’ below which sequences cannot be confidently said to be homologous (e.g., Hobohm and Sander 1995). In the third step, trailing-end sequences are tested for their ability to back-validate. Because several PSI-BLAST searches were performed to establish the significant sequence space, trailing-end sequences are pooled and ranked first by number of occurrences and second by lowest Evalue, before being tested. If a trailing-end sequence backvalidates, its significant sequence space is added to that of the target. The process is then repeated until no further sequences are detected. The steps above can connect proteins that are far apart in sequence space; however, beyond the first PSI-BLAST search, they do not directly provide an alignment of the target to the sequences detected. Moreover, even for sequences detected in the first step, PSI-BLAST generally provides only partial alignments. For these reasons, we introduced an alignment strategy based on HMMer (Fig.1; http://hmmer.wustl.edu/). After the first PSI-BLAST search, we build a target HMM from the proteins in the significant sequence space, as aligned by PSI-BLAST. Any sequence detected at this step is then realigned to the target sequence using the target HMM, to yield a full-length alignment. Any sequence detected at a subsequent step is aligned in a five-

Table 1. Detection of sequence-to-fold relationships using the benchmark68 set SENSER

a

Correct Incorrecta No prediction % accuracy of predictions % success over all proteins % success over proteins not predicted at the previous step a

Genthreader

BLAST

PSI-BLAST

Recursive

Trailing

Total

90% Confidence

50% Confidence

No threshold

Total

6 0 62 100 9

28 0 34 100 50

11 0 23 100 66

8 2 13 80 78

53 2 13 96 78

33 0 35 100 49

9 3 23 75 62

8 15 0 35 74

50 18 0 74 74



45

32

35





26

35



Except in the ‘total’ column, numbers reflect predictions made incrementally over those at the previous step.

www.proteinscience.org

1577

Koretke et al.

Table 2. Detection of sequence-to-fold relationships using PDB40

a

Correct Incorrecta unique folds No prediction unique folds % accuracy of predictions % success over all proteins % success over proteins not predicted at the previous step a

PSI-BLAST

Recursive

Trailing

Total

306 0 0 1047 153 100 23 —

301 9 4 730 149 97 45 29

71 16 5 643 144 82 51 10

132 40 18 510 126 77 60 20

810 65 27 510 126 93 60 —

Except in the ‘total’ column, numbers reflect predictions made incrementally over those at the previous step.

the predictions made and a 78% success rate over all targets. In comparison, GenTHREADER made correct assignments for 33 target sequences with a 90% confidence level, 42 correct and 3 incorrect assignments with a 50% confidence level, and 50 correct and 18 incorrect assignments overall. There were 11 sequences in benchmark68 which neither SENSER nor GenTHREADER predicted correctly, seven predicted correctly only by SENSER and four predicted correctly only by GenTHREADER. The results of assessing SENSER on PDB40 are shown in Table 2. Overall, SENSER identified 810 correct and 65 incorrect relationships among the 1353 proteins in PDB40 (93% prediction accuracy, 60% success rate). The prediction accuracy across structural classes was fairly consistent (Table 3); however, the success rate was much higher for ␣/␤ folds. This may be due, at least in part, to the greater regularity of these folds, resulting in stronger sequence patterns. Prediction accuracy decreased with each step of the search routine, but remained above 75% even in the last step (Table 1). As already noted during the CASP3 evaluation (Koretke et al. 1999), SENSER primarily detects homologous relationships. In the PDB40 set, 98% of correct predictions involved proteins where the target and template were part of the same SCOP superfamily. Of the 153 unique folds in PDB40, SENSER made predictions for 27 (four in PSI-BLAST, five in recursive and 18 Table 3. Detection of sequence-to-fold relationship within different SCOP structure classes for the PDB40 benchmark set

Correct Incorrect No prediction % accuracy of predictions % success over proteins in class





␣/␤

␣+␤

MD

M/S

Small

133 14 97

184 12 121

263 19 57

126 5 139

16 4 14

0 0 0

88 11 49

91

94

93

96

80

0

89

55

58

78

47

47

0

59

Abbreviations: MD, multi-domain proteins (alpha and beta); M/S, membrane and cell surface proteins and peptides.

1578

BLAST

Protein Science, vol. 11

in trailing-end searching). Recognizable reasons for these incorrect predictions included amino acid distribution (such as the high cysteine content of some targets) and the presence of discontinuous domains in PDB40. For the latter, the ‘greediness’ of PSI-BLAST resulted in the incorrect alignment of parts of a discontinuous domain to the inserted domain, eventually resulting in predicting the structure of the inserted domain for the discontinuous domain. A critical aspect of structure prediction is the alignment between the target sequence and the associated template. We evaluated the alignment strategy of SENSER on the 53 correctly predicted targets of the benchmark68 set, by comparing the target-to-template alignments produced in SENSER with the optimal structure-to-structure alignments obtained by STAMP (Russell and Barton 1992). Only structurally equivalent residues were considered. Table 4 shows the comparison of the STAMP, PSI-BLAST, and SENSER alignments in terms of residues correctly aligned. At every step of the search routine, SENSER produced, on average, more accurate alignments than PSI-BLAST, and the effect increased with decreasing target-to-template similarity. On average, SENSER yielded alignments of at least 40% accuracy even for the most distant sequence matches. Surprisingly, alignments generated for targets predicted in trailing-end searches had a slightly higher proportion of correctly aligned residues than those identified through recursive searches (Table 4). Inspection showed that many of these targets were small, cysteine-rich proteins, where the alignments were anchored by conserved cysteines. Overall, Table 4. Number of equivalent residues aligned correctly in the 53 correctly predicted proteins of the benchmark68 set

STAMP PSI-BLAST % alignment accuracy SENSER % alignment accuracy

BLAST

PSI-BLAST

Recursive

Trailing

Total

1668 1264

3511 2272

934 223

463 0

6576 3759

76 1277

65 2494

24 376

— 210

61 4357

77

71

40

45

66

Fold recognition without folds

the average alignment accuracy was 66.3% for the 53 correctly predicted targets of the benchmark68 set, a relatively high accuracy for sequences that are on average 18% identical. In comparison, Jones (1999) reported an average alignment accuracy of 46.2% for the 50 targets correctly predicted by GenTHREADER. An initial evaluation of SENSER in the CASP4 experiment supports the findings reported here. Unmodified predictions made by SENSER were filed under the group name SBauto. Of 41 target domains classified as ‘comparative modeling/fold recognition’ or harder and whose coordinates have been made available to the predictors, SENSER made predictions for 21, of which 15 were correct (71% prediction accuracy, 37% success rate). The average alignment accuracy, as judged relative to STAMP, was 41.5%. A detailed assessment of SENSER’s performance in CASP4 will be presented elsewhere. In summary, despite only using sequence information, SENSER appears to have a rate of successful fold predictions comparable to that of advanced threading programs. It has a low intrinsic rate of false positives and it creates a collection of sequence relationships, which allows the user to recreate the path by which target and template sequences were connected and to evaluate the validity of the connection. Moreover, despite sometimes seemingly tenuous connections, SENSER moves largely in homologous sequence space. This allows functional and/or mechanistic inferences to be drawn more often from the relationships found than with threading methods, which frequently uncover analogous fold similarities. Thus, SENSER may assist in the annotation of the many hypothetical proteins obtained from genome projects not only by making fold predictions but by coupling these to functional or mechanistic hypotheses. An example of SENSER’s applicability is the analysis of the N-domain of the AAA ATPase VAT, which has generated functional hypotheses for a large number of hitherto unclassified ORFs and has pointed to a potential evolutionary connection between double-psi barrels with chaperone activity and aspartic proteases (Coles et al. 1999). Acknowledgments The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

References Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new genera-

tion of protein database search programs. Nucleic Acids Res. 25: 3389– 3402. Coles, M., Dierks, T., Liermann, J., Gröger, A., Rockel, B., Baumeister, W., Koretke, K.K., Lupas, A., Peters, J. and Kessler, H. 1999. The solution structure of VAT-N reveals a ‘missing link’ in the evolution of complex enzymes from a simple ␤␣␤␤ element. Curr. Biol. 9: 1158–1168. Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755–763. Fischer, D., Elofsson, A., Rice, D. and Eisenberg, D. 1996. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac. Symp. Biocomput. 300–318. Hobohm, U. and Sander, C. 1995. A sequence property approach to searching protein databases. J. Mol. Biol. 251: 390–399. Holm, L. and Sander, C. 1997. An evolutionary treasure: Unification of a broad set of aminohydrolases related to urease. Proteins 28: 72–82. Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C., Sunyaev, S., Yuan, Y. and Bork, P. 1998. Homology-based fold predictions for Mycoplasma genitalium proteins. J. Mol. Biol. 280: 323–326. Jones, D.T. 1999. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287: 797–815. Karplus, K., Barrett, C., Cline, M., Diekhans, M., Grate, L. and Hughey, R. 1999. Predicting protein structure using only sequence information. Proteins 3: 121–125. Karplus, K., Barrett, C. and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14: 846–856. Koretke, K.K., Russell, R.B., Copley, R.R. and Lupas, A.N. 1999. Fold recognition using sequence and secondary structure information. Proteins 3: 141– 148. Krogh, A., Brown, M., Mian, I.S., Sjolander, K. and Haussler, D. 1994. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235: 1501–1531. Lupas, A. 1996. Prediction and analysis of coiled-coil structures. Methods Enzymol. 266: 513–525. Moult, J., Hubbard, T., Fidelis, K. and Pedersen, J.T. 1999. Critical assessment of methods of protein structure prediction (CASP): Round III. Proteins 3: 2–6. Neuwald, A.F., Liu, J.S., Lipman, D.J. and Lawrence, C.E. 1997. Extracting protein alignment models from the sequence database. Nucleic Acids Res. 25: 1665–77 Panchenko, A., Marchler-Bauer, A. and Bryant, S.H. 1999. Threading with explicit models for evolutionary conservation of structure and sequence. Proteins 3: 133–140. Park, J., Teichmann, S.A., Hubbard, T. and Chothia, C. 1997. Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol. 273: 349–354. Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T. and Chothia, C. 1998. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284: 1201–1210. Rost, B., Schneider, R. and Sander, C. 1997. Protein fold recognition by prediction based threading. J. Mol. Biol. 270: 470–480. Ruepp, A., Graml, W., Santos-Martinez, M.L., Koretke, K.K., Volker, C., Mewes, H.W., Frishman, D., Stocker, S., Lupas, A.N. and Baumeister, W. 2000. The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 407: 508–513. Russell, R.B. and Barton, G.J. 1992. Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels. Proteins 14: 309–323. Russell, R.B., Copley, R.R. and Barton, G.J. 1996. Protein fold recognition by mapping predicted secondary structures. J. Mol. Biol. 259: 349–365. Sippl, M.J. 1990. The calculation of conformational ensembles from potentials of mean force. An approach to the knowledge based prediction of local structures in globular proteins. J. Mol. Biol. 213: 859–883. Teichmann, S.A., Park, J. and Chothia, C. 1998. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc. Natl. Acad. Sci. 95: 14658–14663. Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266: 554–571.

www.proteinscience.org

1579