SABmark—a benchmark for sequence alignment that ... - CiteSeerX

1 downloads 0 Views 34KB Size Report
ABSTRACT. Summary: The Sequence Alignment Benchmark (SABmark) provides sets of multiple alignment problems derived from the SCOP classifica- tion.
BIOINFORMATICS APPLICATIONS NOTE

Vol. 21 no. 7 2005, pages 1267–1268 doi:10.1093/bioinformatics/bth493

Sequence analysis

SABmark—a benchmark for sequence alignment that covers the entire known fold space Ivo Van Walle1,∗ , Ignace Lasters2 and Lode Wyns1 1 Department 2 Algonomics

of Ultrastructure, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium and NV, Technologiepark 4, 9052 Gent-Zwijnaarde, Belgium

Received on July 13, 2004; revised and accepted on August 19, 2004 Advance Access publication August 27, 2004

ABSTRACT Summary: The Sequence Alignment Benchmark (SABmark) provides sets of multiple alignment problems derived from the SCOP classification. These sets, Twilight Zone and Superfamilies, both cover the entire known fold space using sequences with very low to low, and low to intermediate similarity, respectively. In addition, each set has an alternate version in which unalignable but apparently similar sequences are added to each problem. Availability: SABmark is available from http://bioinformatics.vub.ac.be Contact: [email protected]

Protein sequence alignment has long been one of the most fundamental tools in bioinformatics, and is still an active field of research today. Plenty of room for improvement remains however, because structurally related sequences are more often than not quite dissimilar to each other and contain little information to produce a good alignment from. Normally, each newly developed technique is compared with other algorithms in terms of both accuracy and speed. Although this was initially done using a test set of alignments selected by the author, there are currently still few independent databases available that are designed specifically for this purpose (Bahr et al., 2001; Raghava et al., 2003). The often used BaliBase, for example, contains 167 alignments that address different problems such as equidistant sequences, orphan sequences and internal insertions. Our Sequence Alignment Benchmark (SABmark) focuses solely on sequences with very low to intermediate similarity (0–50% identity), since it has been shown that above this region the performance of most alignment programs is already excellent and can be improved little (e.g. see Sauder et al., 2000). In addition, it allows to benchmark cases where not all sequences are related to each other. SABmark systematically covers the entire known fold space using only high-quality structures taken from the SCOP database (Murzin et al., 1995), and is updated following new releases (currently 1.65). For each alignment problem, pairwise reference structure alignments are provided, derived as a consensus from SOFI and CE (Boutonnet et al., 1995; Shindyalov and Bourne, 1998). Between pairs, these references are not necessarily entirely consistent with each other, which, in contrast to a single reference multiple alignment, reflects the uncertainty about parts of the reference alignment. As a result, multiple alignment algorithms cannot obtain a ∗ To

whom correspondence should be addressed.

perfect score for each problem, though nearly so for the more similar structures. SABmark consists of two alignment sets, Twilight Zone and Superfamilies, which contain single-domain sequences with very low to low, and low to intermediate similarity, respectively. The sequences of the Twilight Zone set are taken from a SCOP subset provided by the ASTRAL compendium, in which domains have a pairwise Blast E-value of at least 1, for a theoretical database size of 108 residues (Altschul et al., 1997; Chandonia et al., 2004). To ensure the quality of the structures, this subset is reduced further by retaining only those that have all four backbone atoms present in all residues, and for which the PDB ATOM records exactly match the SEQRES sequence. The remaining structures are subsequently divided into groups corresponding to SCOP folds, which is the lowest level of structure similarity. In order to avoid bias towards very wellrepresented folds, the maximum size of each group is limited to 25 sequences simply by keeping only the largest structures. The Superfamilies set is created analogously from sequences that have at most 50% identity to each other, but here groups are formed by division into SCOP superfamilies, which correspond to sequences with a putative common evolutionary origin. One of the major applications of sequence alignment is determining which sequences are related to each other and which are not. SABmark contains a second version of the Twilight Zone and the Superfamilies sets, which allow to benchmark procedures for this purpose. In these sets, each group of originally N sequences is expanded with at most N other, structurally unrelated yet apparently similar sequences. These false positives were selected from Blast searches of the original sequences (true positives) against a 70% identity subset of SCOP. Each false positive belongs to a different fold than the true positives and has at least one E-value to a true positive that is lower than at least one E-value of that true positive with another true positive. In practice, N or more false positives were found for almost all alignments. An overview of the general characteristics of each SABmark set is given in Figure 1 and Table 1. The database is provided as a wellorganized set of standard format files (Fasta, PDB), for which new alignments can be created, scored and archived by using a small number of Perl scripts. For example: run_alignm.pl twi score.pl twi alignm archive.pl twi alignm

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1267

I.Van Walle et al.

2000

report is subsequently generated with, for each sequence pair, the number of residues aligned correctly and incorrectly with respect to the reference, and some other information such as percentage identity and structure similarity according to SCOP. Finally, all data are archived into a single zipped tarball that can be restored later. To benchmark a new algorithm or different parameter settings, all that needs to be done is copy and adjust the run_alignm.pl script. Further details can be found in the accompanying manual.

Number of pairs

Twi FO

1500

Twi SF Twi FA Sup SF

1000

Sup FA

500 ACKNOWLEDGEMENT

0 0

10

20

30

40

50

60

Identity (%)

REFERENCES

Fig. 1. Distribution of sequence pairs as a function of percentage identity for the Twilight Zone (Twi) and Superfamilies (Sup) set, and for each level of structure similarity (fold, superfamily and family). Intervals are 2.5% identity wide. Table 1. Specifications of each set

Set

Groups

Sequences

Sequence pairs

Twi TwiFPa Sup SupFPa

209 209 425 425

1740 3458 3280 6526

10 667 44 056 19 092 79 095

Twi, Twilight Zone and Sup, Superfamilies. a With added false positives.

This will calculate Align-m multiple alignments for each problem of the Twilight Zone set (Van Walle et al., 2004). A flat textfile

1268

This work was supported by a grant from Instituut voor de bevordering van Wetenschap en Techniek (IWT), Belgium

Altschul,S., Madden,T., Schäffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bahr,A., Thompson,J., Thierry,J. and Poch,O. (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res., 29, 323–326. Boutonnet,N., Rooman,M., Ochagavia,M., Richelle,J. and Wodak,S. (1995) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., 8, 647–662. Chandonia,J.-M., Hon,G., Walker,N.S., Conte,L.L., Koehl,P., Levitt,M. and Brenner,S.E. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res., 32 (Database issue), D189–D192. Murzin,A., Brenner,S., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. Raghava,G.P.S., Searle,S.M.J., Audley,P.C., Barber,J.D. and Barton,G.J. (2003) OXBench:a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. Sauder,J., Arthur,J. and Dunbrack,R. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 6–22. Shindyalov,I. and Bourne,P. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747. Van Walle,I., Lasters,I. and Wyns,L. (2004) Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, 20, 1428–1435.

Suggest Documents