Vol. 16 no. 5 2000 Pages 458–464
BIOINFORMATICS
RSDB: representative protein sequence databases have high information content Jong Park 1, Liisa Holm 1, Andreas Heger 1 and Cyrus Chothia 2 1 The
European Bioinformatics Institute, EMBL Outstation, Cambridge CB10 1SD, UK and 2 LMB, MRC Centre, Hills Road, Cambridge, CB2 2QH, UK Received on October 28, 1999; revised and accepted on December 9, 1999
Abstract Motivation: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? Results: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. Availability: All the RSDB files generated and the full analysis results are available through internet: ftp:// ftp.ebi.ac.uk/ pub/ contrib/ jong/ RSDB/ http:// cyrah.ebi.ac.uk:1111/ Proj/ Bio/ RSDB Contact:
[email protected]
Introduction Database growth in the biology field has been exponential. Unlike other databases in science and technology, the biological databases are unique in that the entries can be grouped according to evolutionary relationships. Therefore, they can be clustered, classified and reduced in size without critical kinship information being lost. The question we asked is how many sequences can be thrown out before critical information is lost in big sequence databases. We tried to answer the question by 458
precisely calibrating the information contents of nine non-redundant representative sequence databases (RSDB) by measuring their performance in homology searching. Homology searching in bioinformatics is one of the most standard and widely used methods. Since the 1970s (Needleman and Wunsch, 1970), there has been consistent development in algorithms (Altschul et al., 1990; Pearson and Lipman, 1988; Smith and Waterman, 1981; Waterman, 1986) to cope with the exploding biological sequence databases, mostly from genome projects. A major breakthrough in the search algorithms has been the use of multiple sequences in generating sequence profile (Gribskov et al., 1987; Luthy et al., 1994; Thompson et al., 1994); hidden Markov models (Baldi et al., 1994; Eddy et al., 1995; Krogh et al., 1994) and templates (Taylor, 1988; Tatusov et al., 1994; Yi and Lander, 1994). Most of them can be iterated easily to achieve higher performance (Park et al., 1997; Tatusov et al., 1994). It has been shown that iterative multiple sequence search algorithms can be at least two times more sensitive with a test set of distant homologues [PDB40D: 40% sequence identity or less to each other, Park et al. (1998)]. Such additional sensitivity is crucial in discovering new biological functions with unknown sequences, building threading algorithms, predicting protein structures, and organizing genomic sequences. The third critical assessment of protein structure prediction methods [CASP3, 1998, http: //predictioncenter.llnl.gov/casp3/, Sternberg et al. (1999)] showed a significant increase of prediction quality, both in threading and ab initio protein structure prediction categories as a result of the use of good sequence profile search methods [e.g. PSI-BLAST, Altschul et al. (1997)] However, the time taken by the iteration over very large databases is much higher compared with any single pairwise search like traditional BLAST or FASTA. So, it will be very time consuming to search all the protein sequences in databanks with large numbers of new sequence queries from various genome projects. As an approach to alleviate the speed problem, nonredundant sequence databases (NRDB) can be used to exclude identical and almost identical sequences c Oxford University Press 2000
Representative protein sequence databases
(Bleasby and Wootton, 1990; Bleasby et al., 1994; Holm and Sander, 1998; Kallberg and Persson, 1999). Non-redundancy is important for both sensitivity and the speed of search algorithms. Statistical bias due to the overabundance of and flooding search outputs by certain protein families can affect the scoring methods of search programs. This diminishes the benefit of using large databases for iterative search methods. However, the effectiveness of such non-redundant databases has not been systematically calibrated in terms of information content. Also, it has not been tested if databases with a very small number of RSDB can be sufficient in providing necessary intermediate sequences for building good profiles. It is well-known that above the ‘twilight zone’ of around 30–40% sequence identity in proteins it is often straightforward to search and align sequences (Chung and Subbiah, 1996; Doolittle, 1986; Rost, 1999). Does this mean that an RSDB with 40% of mutual sequence identity (RSDB40) is equivalent to an RSDB with 90% mutual sequence identity (RSDB90)? If so, then, how far can we remove similar sequences before we lose critical information? To answer these questions a structure classification database, a set of gradually concentrated representative sequence databases (RSDB), and PSI-BLAST, a multiple sequence iterative search program, were used. Note: The difference between NRDB and RSDB in this paper is that the term NRDB is used for the databases containing no identical or almost identical sequences; NRDB is a special database of RSDB.
Methods The overall procedure is to run PSI-BLAST with a well defined test set of proteins against nine RSDBs which have different levels of redundancy. The performance of PSI-BLAST in finding distant homologues will reflect the information content of the RSDBs. For a more detailed explanation of the assessment methods a complete package for sequence search algorithms assessment [SAT package, Park et al. (2000)] can be found at: ftp://ftp.ebi.ac.uk/pub/contrib/jong/SAT/. Protein structure is more conserved than sequence, therefore, structural information of proteins can be used in measuring the sensitivity of sequence search methods by providing definite criteria of homology (Brenner et al., 1998; Park et al., 1998). Generating a query database set. PDB40D-J5 The structural classification of proteins (SCOP) provided us with the definite homology criterion information for this work. The SCOP database contains a description of the evolutionary and structural relations of those proteins whose atomic structure has been determined (Murzin
et al., 1995). The current version is available on the internet at http://scop.mrc-lmb.cam.ac.uk/scop/. The unit of classification in the database is the protein domain. Small proteins, and most of those of medium size, have a single domain and are, therefore, treated as a whole. The domains that form large proteins are classified individually. Domains are clustered together into families if they have close evolutionary relationships. Superfamilies bring together families whose proteins have low sequence identities but whose structural details and, in many cases, functional features suggest that a common evolutionary origin is very probable, for example, the variable and constant domains of immunoglobulins. The fold classification brings together superfamilies that have the same secondary structures in the same arrangement. For most superfamilies in this category there is no good evidence that they have evolutionary relationships. In a few cases, the situation is less clear in that the evidence at the time weakly supports the existence of evolutionary relationships. In these cases, the superfamilies are kept separate until the subsequent discovery of intermediate structures provides strong support for their merger. We measured the extent to which PSI-BLAST could find the evolutionary relationships described by the superfamilies in the SCOP. As there are few problems in finding relationships between proteins that have sequence identities of 40% or more, we used a database of sequences that have pairwise identities of 40% or less, which we call PDB40DJ5. This database contains 1566 sequences, 261 superfamilies containing 1283 sequences, and 283 singlets. There were 6958 possible homologous pairs, where both members of a pair are in the same SCOP superfamily. Nonhomologous pairs are formed by any two sequences that have different folds. There are 1 218 437 non-homologous pairs in the PDB40D-J5 database.
Search algorithm Position-Specific Iterative (PSI) BLAST can be used in finding homologues in an iterative fashion. In doing so, a protein family specific similarity profile is generated to discover more distant homologues as the iteration proceeds. PSI-BLAST is dependent on the multiple sequence information given by the alignment of the family members, which are less than 98% identical to each other. The more the distant family members are incorporated, the better the profile becomes for more distant family members. Therefore, the sensitivity of the program is affected largely by: 1. the number of homologues 2. the quality of the profiles generated by the sequences. If PSI-BLAST can find all the family members with a profile made from a subset of homologues from a 459
J.Park et al
large database, it would not require the full evolutionarily redundant database (denoted as RSDB100). The main steps of the procedure are: 1. for a given query sequence, an initial set of homologues is collected from the sequence database using GAP-BLAST, a new version of BLAST that generates gapped alignments, and a conventional score matrix (here we use BLOSUM-62) 2. a weighted multiple alignment is made from the query sequence and the homologues whose match scores are better than a specified cut-off value (h parameter) 3. a position-specific score matrix is constructed from this alignment 4. this matrix is then used to search the database for new homologues 5. new homologues with a good match score are used to construct a new position-specific score matrix which is then used in a further search for homologues 6. rounds of matrix reconstruction and new searches are iterated until no new homologues are found or until the number of iterations reaches a specified limit. The procedure uses two important parameters that can be set by the user. The first parameter, h, is the BLAST step expected number of hits with higher score (E-value) cut-off used for selecting homologues for the alignments and the position-specific score matrix. The E-value is a statistical scoring scheme which is defined as, Evalue = P D, where P is the probability of BLAST algorithm calculated from query sequence length and library (database) size, and D is the size of database used (Altschul and Gish, 1996). If PSI-BLAST parameter h is made too low, only close homologues of the query sequence are used to make the position-specific score matrix, and the sequence variation is too limited to find distant homologues. The second PSI-BLAST parameter, j, is the number of iterations of matrix reconstruction and new searches that are carried out to find new homologues. We found that to achieve a low rate of false positive homology predictions in our experiments (Park et al., 1997), effective parameters for detecting true sequence relationships are h = 0.000 5 and j = 5–10. In the calculations described below all PSI-BLAST calculations used h = 0.000 5 and j = 5. Two additional parameters (b and v) are critical when PSI-BLAST searches databases with large sequence families. They set the number of sequence matches shown in each iteration. We used values of b = 1000 (default: 250) and v = 1000 (default: 500). The value of 1000 means 460
up to 1000 matches are used in profile generation and displayed in the output. With RSDB20, RSDB40, RSDB50, and RSDB60, these parameters do not affect the performance, as the homologues of any very large family are relatively few. For a very large family, such as IG domain, b and v values of 15 000 were necessary to collect all the distant family members in each iteration step. The high b and v parameter caused an unacceptable speed loss. Some sequence from large families took over 12 h each with b = v = 15 000.
Implementation Nine RSDBs were constructed based on the union of Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB (it is called RSDB100 for convenience). The RSDB100 contains 370 405 sequence of 111 958 534 amino acid residues (March, 1999). The schematic representation of the process is shown in Figure 1. The first algorithm used, after removing identical sequences, was NRDB90 (Holm and Sander, 1998) to sort them by size and to remove neighboring sequences with identities of 90% or higher. The same algorithm also generated RSDB99 and RSDB95 databases. Subsequently, the gapped BLAST 2.0 algorithm was applied to remove any pairs for lower percentage identities. The percentage identities were derived from the alignments by BLAST in a comprehensive pairwise comparison database called PAIRSDB (L. Holm et al. unpublished). Sequence neighbours around representatives above the target mutual sequence identity (MSI) were removed keeping the longer homologues in the representatives to reach low sequence identity RSDBs. The smaller homologues are removed in the concentration process. At each stage, some close neighbouring homologues are removed. Lastly, the PDB40D-J5 sequences were added to all the RSDBs to be used as target matches. This makes them very slightly redundant for low identity RSDBs. However, the ratios are such it would not affect RSDBs in any significant way. For example, the smallest RSDB20 database (15 646 160 amino acid residues) is 49 times larger than PDB40D-J5 (317 580 amino acid residues), and the redundancy it creates can not be significant. It should also be noted that PDB40D itself is a representative database. There is a very sharp drop in RSDB size from 100% to 99% and 90%, indicating that there were many nearly identical sequences in the original full database (Figure 2). All the searches were done by dedicated DEC Alpha UNIX machines at 500 MHz clock speed with the memory size from 256 megabytes to 1 gigabytes. The calculations took less than 10 weeks total CPU time. The processing time was most affected by the PSI-BLAST parameters of j, b and v. The time taken by PSI-BLAST with j = 5, b = v = 1000 and is shown in Figure 3
Representative protein sequence databases
Fig. 1. The schematic diagram of generating RSDBs. (1) The original full database embedded in the whole protein sequence space where distance in plane projection reflects distance in sequence identity. Dense regions represent families with many close homologues. (2) Non-redundant database after removal of identical or nearly identical sequences. (3) More neighbouring sequences removed leaving a small number of representatives sequences. (4) Small set of representatives where each sequences has a very low mutual sequence identity.
Fig. 2. The mutual sequence identity versus RSDB size (in amino acid residue numbers). The x-axis shows the mutual sequence identity in sequence space. The y-axis represents the number of amino acids.
From the PSI-BLAST output we calculated: 1. direct pair match hits, 2. direct plus intermediate linkage hits. The intermediate linkage hit scoring scheme is to assess the performance of the databases in their ability to provide intermediate sequences for very distant superfamily members. For example, in the intermediate linkage hit scoring scheme, if a sequence A matches to B and B to C, even if there is no direct match between A and C, A is calculated to be matching C through the intermediate B.
Results and discussion The main result of this report (Figure 4) is that RSDB50 performed better than the full protein sequence database. It performed better than RSDB100 regardless of the scoring schemes used (direct pair match hits or linkage hits scoring scheme). RSDB50 is three times smaller and six times faster than RSDB100. Pairwise comparison using GAP-BLAST, resulted in identical performances for all the RSDBs. This means as long as target sequences are in the target database, hits are found in the same ranks. GAP-BLAST had 981 homologues pairs reported at 70 non-homologoes which
Fig. 3. The speed of the sequence search with different RSDBs. 100% (RSDB100) means there was no removal of neighbouring sequences.
was equivalent to 1% mismatches per possible good pairs (MPGP) while 1403 homologues were found with RSDB20 by using PSI-BLAST (Figure 4). The performance of PSI-BLAST peaked at around RSDB50 and went down again gradually toward RSDB100. The E-value cut-off used (0.0005) is not the same in effect for all the RSDBs, because E-value is dependent on database size. For RSDB20 and RSDB100 the E461
J.Park et al
Fig. 4. The comparison of the information content of nine RSDBs in terms of success in homology searching. The x-axis is for the number of non-homologous matches (errors). The y-axis is for the number of homologous matches (hits). The maximum possible value for the y-axis is 6958. The bottom dashed curve is for GAPBLAST performance. The solid curves are for RSDB20, RSDB30, RSDB40 and RSDB50 from the bottom. The dotted curve are RSDB60, RSDB70, RSDB80, RSDB90 and RSDB100 from the top. MPGP: mismatches per total good pairs.
value difference (reported by PSI-BLAST) was 6.2, on average. In other words, in the rankings of the matches pairs of query sequences, the same sequence pair occurs in RSDB20 with an E-value 6.2 times higher than the one of RSDB100, on average. This corresponds to the size difference between the two RSDBs which is 7. The consequence of these different effective E-values is that the performances of the smaller RSDBs are slightly underestimated. As PSI-BLAST lost some homologous matches with the default low b and v parameters, searches were done for superfamily members from large families such as IG domain. With b = v = 15 000, the performances increased slightly for the larger RSDB sets, as in Figure 5. This suggests that the higher performance with RSDB50 was partly responsible for the b and v parameters of PSIBLAST. Nevertheless, RSDB50 was still equivalent to RSDB100. The performances with the intermediate linkage hit scoring scheme (Figure 6) showed how efficiently the search found members lying at far distances in the sequence space. The bottom line as a control, shows the performance of RSDB50 with a direct pair match hit scoring as in Figure 4. The top two dashed curves are for RSDB90 and RSDB100. The overall picture of the results 462
Fig. 5. The performance of RSDB with b and v parameters of PSIBLAST set to 15 000. The top two dashed curves are for RSDB100 and RSDB90. The thick curve RSDB50. RSDB60, RSDB70 and RSDB80 are close to RSDB50 which are shown as thin curves. The bottom dashed curve shows the pairwise BLAST performance.
is the same as when direct pair match performance is used, as in Figure 4. Theoretically, all the possible pairs with a given number of members for any superfamily could have been found if sequence profiles were perfect. At 70 non-homologues cut-off, RSDB50 scored 3620 hits. This means 52% of the true homologous pairs could be found by PSI-BLAST with the intermediate linkage hit scoring scheme. As the query sequences from PDB40D-J5 were down to 40% non-redundant, for a real world situation, it is a significant underestimation because complete genomes have sequences which have mutual sequence identities over 40%. The control RSDB50 (bottom solid curve in Figure 6) with the direct pair match hits score is 1942 which is equivalent to 28% coverage for 6958 all possible homologues. Therefore, there is about two-fold difference between the two different scoring schemes. This result implies that there is room for improvement in building profiles. A summary of the performance of RSDBs is shown in Figure 7. The performance was again from the nonhomologues cut-off of 70 in the direct pair match scoring scheme. It shows the overall benefit of choosing any specific RSDB. At RSDB50, the performance gain is the highest. There are three important parameters for any productive bioinformatic sequence search involving large databases. They are speed in searching, size of the database (in re-
Representative protein sequence databases
Fig. 6. The performances of RSDBs with intermediate linkage hit scoring scheme. The top thickest curve is for RSDB50. The bottom curve is for RSDB50 measured by the direct pair match hit scoring scheme. The two dashed curves are for RSDB100 and RSDB90 respectively from the bottom at the point 30 on the x-axis. All other curves account for RSDB60, RSDB70 and RSDB80.
gard to storage and manipulation) and sensitivity of the algorithm. If any of the factors costs too much, it diminishes the efficiency of homology searching. This report showed the relationship of the above parameters. The sensitivity (performance) of protein sequence databases in homology searching is not linearly proportional to the size and speed of the databases. This is due to the inherent characteristics of biological sequence databases which have an extensive network of homology relationship between sequences (Holm and Sander, 1996). Therefore, as long as some critical evolutionary information is kept in representative sequences, sequence databases maintain high information content even at 50% mutual sequence identity. The performance of the full sequence database RSDB100, was worse than or similar to that of RSDB50 (Figures 4 and 7). This indicates a decreased efficiency in profile construction of PSI-BLAST with over-represented redundant sequence members and inadequate b and v parameters of of PSI-BLAST. Weighting sequences according to the evolutionary divergence is a general and difficult problem in the development of multiple sequence search algorithms (Karchin and Hughey, 1998; Krogh and Mitchison, 1995). Therefore, the higher performance of RSDB50 exemplifies the characteristics of the representative sequence database which have more regular evolutionary distance between sequences in the search space. With the sensitivity of PSI-BLAST most of the intermediate sequences with critical kinship information were preserved in RSDB50. If PSI-BLAST were more
Fig. 7. The x-axis represents the maximum percentage of identity between pairs of sequences in representative sequence databases (RSDBs). The y-axis represents the performance of PSI-BLAST searches. The first data point (RSDB10) shows a simple noniterative pairwise BLAST (GAP-BLAST) performance as a control.
sensitive, the boundary could have been lower than 50% mutual sequence identity. However, the general picture of conserved information content with smaller RSDB would not change regardless of the parameters and algorithms used. It is noteworthy that only half of all the homologous pairs were found even with the intermediate linkage hit scoring scheme at the 70 non-homologous pair cut-off. In the remaining half of the sequence space 1. sampling is still inadequate for methods of PSIBLAST caliber to detect structurally defined homologous pairs 2. known sequences are at the moment,unevenly distributed causing uneven profile weighting regardless of the level of near-neighbour removal. The results validate the existing approaches of processed, smaller, non-redundant and more organized database construction as a cost-effective solution (Bateman et al., 1999; Sonnhammer et al., 1997; Teichmann et al., 2000) in homology search and analysis.
Acknowledgements J.P. thanks the members of Church laboratory at Harvard Medical School who were lovely and helpful, S.A.Teichmann for collaboration, Terry Horsnell, the sysadmin of MRC LMB, for the wonderful, nonauthoritative and user-oriented computer administration, Maryana Huston for her caring collaboration, and many 463
J.Park et al
people who have shown passion, love, ruthless objectivity and honest care toward science. The project was supported by Hoechst Marion Russell and EU grant to EMBL-EBI. We all thank the sharp, precise and constructive criticisms by the Bioinformatics referees.
References Altschul,S.F. and Gish,W. (1996) Local alignment statistics. Meth. Enzymol., 266, 460–480. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– 410. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucl. Acids Res., 25, 3389–3402. Baldi,P., Chauvin,Y., Hunkapiller,T. and McClure,M.A. (1994) Hidden Markov models of biologicla primary sequence information. Proc. Natl Acad. Sci. USA, 91, 1059–1063. Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res., 27, 260–262. Bleasby,A.J. and Wootton,J.C. (1990) Construction of validated, non-redundant composite protein sequence databases. Protein Eng., 3, 153–159. Bleasby,A.J., Akrigg,D. and Attwood,T.K. (1994) OWL—a nonredundant composite protein sequence database. Nucl. Acids Res., 22, 3574–3577. Brenner,S.E., Chothia,C. and Hubbard,T. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073–6078. Chung,S.Y. and Subbiah,S. (1996) A structural explanation for the twilight zone of protein sequence homology. Structure, 15, 1123–1127. Doolittle,R.F. (1986) Of URFs and ORFs: Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley CA. Eddy,S.R. (1995) ISMB 95 (Intelligent Systems in Molecular Biology conference) multiple alignment using hidden Markov models, ISMB 95. Eddy,S.R., Mitchison,G. and Durbin,R. (1995) Maximum discrimination hidden Markov models of sequence consensus. J. Comput. Biol., 2, 9–23. Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595–603. Holm,L. and Sander,C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14, 423–429. Holm,L. and Sander,C. (1999) Protein folds and families: sequence and structure alignments. Nucl. Acids Res., 27, 244–247. Kallberg,Y. and Persson,B. (1999) KIND—a non-redundant protein database. Bioinformatics, 15, 260–261. Karchin,R. and Hughey,R. (1998) Weighting hidden Markov mod-
464
els for maximum discrimination. Bioinformatics, 14, 772–782. Krogh,A. and Mitchison,G. (1995) ISMB95 (Intelligent Systems in Molecular Biology conference) Maximum entropy weighting of aligned sequences of proteins or DNA. ISMB95, 215–221. Krogh,A., Brown,M., Mian,I.S., Sjolander,K. and Haussler,D. (1994) Hidden Markov models in computational biology— applications in protein modelling. J. Mol. Biol., 235, 1501–1531. Luthy,R., Xenarios,I. and Bucher,P. (1994) Improving the sensitivity of the sequence of the sequence profile method. Protein Sci., 3, 139–146. Murzin,A., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. :(see also: http://scop.mrc-lmb.cam.ac.uk/scop). Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the searc for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Park,J., Holm,L. and Chothia,C. (2000) Sequence search assessment testing toolkit (SAT). Bioinformatics, 16, 104–110. Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) Intermediate sequences increase the detection of distant sequence homologies. J. Mol. Biol., 273, 349–354. Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 1201–1210. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biologicla sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444– 2448. Rost,B. (1999) Twilight zone of protein sequence alignments. Protein Eng., 12, 85–94. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. Sonnhammer,E.L., Eddy,S.R. and Durbin,R. (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins, 28, 405–420. Sonnhammer,E.L., Eddy,S.R., Birney,E., Bateman,A. and Durbin,R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucl. Acids. Res., 26, 320–322. Sternberg,M.J., Bates,P.A., Kelley,L.A. and MacCallum,R.M. (1999) Progress in protein structure prediction: assessment of CASP3. Curr. Opin. Struct. Biol., 9, 368–373. Tatusov,R.L., Altschul,S.F. and Koonin,E.V. (1994) Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl Acad. Sci. USA, 91, 12091–12095. Taylor,W.R. (1988) A flexible method to align large numbers of biologicla sequences. J. Mol. Evol., 28, 161–169. Teichmann,S.A., Chothia,C., Church,G.M. and Park,J. (2000) Fast assignment of protein structures to sequences using the Intermediate Sequence Library PDB-ISL. Bioinformatics, 16, 117–124. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci., 10, 19–29. Waterman,M.S. (1986) Multiple sequence alignment by consensus. Nucl. Acids Res., 14, 9095–9102. Yi,T.M. and Lander,E.S. (1994) Recognition of related proteins by iterative template refinement. Protein Sci., 3, 1315–1328.