levels. For example, visual inspection or more detailed analysis of superimposed
structures and associated sequences is required to distinguish well-conserved ...
#
2003 Oxford University Press
Nucleic Acids Research, 2003, Vol. 31, No. 1 DOI: 10.1093/nar/gkg127
505–510
The Structure Superposition Database Ranyee A. Chiang1, Elaine C. Meng1,2, Conrad C. Huang2, Thomas E. Ferrin1,2 and Patricia C. Babbitt1,2,* 1
Department of Biopharmaceutical Sciences and 2Department of Pharmaceutical Chemistry, University of California San Francisco, 513 Parnassus Avenue, San Francisco, CA 94143, USA
Received August 14, 2002; Revised and Accepted October 27, 2002
ABSTRACT The need for new tools for investigating biological systems on a large scale is becoming acute, particularly with respect to computationally intensive analyses such as comparisons of many threedimensional protein structures. Structure superposition is a valuable approach for understanding evolutionary relationships and for the prediction of function. But while available tools are adequate for generating and viewing superpositions of single pairs of protein structures, these tools are generally too cumbersome and time-consuming for examining multiple superpositions. To address this need, we have created the Structure Superposition Database (SSD) for accessing, viewing and understanding large sets of structure superposition data. The initial implementation of the SSD contains the results of pairwise, all-by-all superpositions of a representative set of 115 (b=a)8 barrel structures (TIM barrels). Future plans call for extending the database to include representative structure superpositions for many additional folds. The SSD can be browsed with a user interface module developed as an extension to Chimera, an extensible molecular modeling program. Features of the user interface module facilitate viewing multiple superpositions together. The SSD interface module can be downloaded from http:// ssd.rbvi.ucsf.edu.
INTRODUCTION Structural genomics projects have made significant progress in high-throughput protein structure determination (1,2). Comparative analyses of structures are traditionally and successfully used for predicting protein functions, finding functionally important residues and making evolutionary inferences. Such capabilities are particularly useful for examining protein relationships at the family and superfamily levels. For example, visual inspection or more detailed analysis of superimposed structures and associated sequences is
required to distinguish well-conserved regions of protein structure and to suggest new hypotheses for the prediction of function from structural similarities. Large-scale efforts have been made to perform such comparative structure analyses on all available protein structures, the goal being to classify the protein universe and make new predictions with respect to structure and function (3–5). The results of some of these efforts are available through public databases, e.g. Entrez/ MMDB (6), FSSP (7) and the CE database (8). Current databases do provide visualization tools, but these are largely limited to basic manipulations that are not sufficient for more extensive studies. The necessary detailed inspections of a single pair of superimposed structures can be tedious; using current databases and tools on a large set of superpositions is much more cumbersome. As the number of experimentally determined and modeled/predicted structures of unknown function increases, the goal will shift to discovering the functions of these structures and to the larger-frame questions associated with a general understanding of how structure determines function. Large sets of superpositions and new tools for their examination, visualization and analysis will be required. In response to these needs, we have created the Structure Superposition Database (SSD). The SSD currently contains structure superpositions of a diverse, representative set of (b/a)8 barrel, also known as TIM barrel, domains. Future plans call for expanding the SSD to include superpositions of additional folds. We have developed an extension to the modeling program, Chimera (9), to facilitate visualization of large sets of structure superposition information. Users of the database can also utilize Chimera’s standard capabilities independently to manipulate, view and analyze the data for themselves. The SSD user interface module is publicly available for download at http://ssd.rbvi.ucsf.edu. Chimera is available at http://www.cgl.ucsf.edu/chimera/.
Database content and methods We have chosen superpositions of the (b/a)8 barrel fold as the initial dataset for the SSD because these structures are evolutionarily diverse and versatile, performing many different functions and representing perhaps as many as 10% of enzyme structures (10–13). The challenge of understanding the links between the similar structures and diverse functions of these
*To whom correspondence should be addressed. Tel: þ1 415 476 3784; Fax: þ1 415 476 0668; Email:
[email protected]
Downloaded from https://academic.oup.com/nar/article-abstract/31/1/505/2401554/The-Structure-Superposition-Database by guest on 20 September 2017
506
Nucleic Acids Research, 2003, Vol. 31, No. 1
proteins underscores the need for developing the SSD and associated tools. To gather this set of diverse (b/a)8 domains, we ran pairwise sequence alignments using CLUSTALW (14) on all the (b/a)8 barrel domains listed in the Structural Classification of Proteins (SCOP) database (15), version 1.55 and the (b/a)8 barrel proteins recently added to the Protein Data Bank (PDB) (16), as of November 2001, which had not yet been classified in SCOP [additional (b/a)8 barrel structures will be added to the SSD as regular updates]. For proteins containing multiple (b/a)8 barrel domains, we used the first (b/a)8 barrel domain listed in the PDB file. We gathered a diverse set (N ¼ 115) such that no sequence had greater than 30% identity to any other in the set, choosing in all cases the highest resolution, non-mutated proteins. Wherever possible, unliganded proteins were used. Pairwise, all-by-all structural superpositions were generated using the MinRMS algorithm (17) available from http:// www.cgl.ucsf.edu/Research/minrms. This algorithm is unique in that it finds a set of optimal superpositions for a single pair of proteins, rather than providing a single optimal superposition. For any pair of proteins, MinRMS calculates the set of superpositions such that the number of matched residues
ranges from 2 residues to n, where nequals the full-length of the shorter protein. Each superposition is optimized for the lowest root-mean-square distance (RMSD) for the given number of residues matched. The MinRMS superpositions were generated using BioCluster, a 100-CPU cluster built by Compaq Corporation to assist work on annotating the human genome. MinRMS results for any set of superpositions can be viewed using the AlignPlot extension of Chimera. Documentation for Chimera and AlignPlot is available at http://www.cgl.ucsf.edu/chimera/. The SSD contains the results of these 6555 sets of superpositions. For each superposition, the SSD contains: (i) the transformation matrix, (ii) the alignment scores, and (iii) the pairwise sequence alignment [in Multiple Sequence File (MSF) format]. The transformation matrix is used for calculating the position and orientation when viewing structure superpositions and can be viewed and used independently of other SSD capabilities. Scores available for each superposition include the set of residues matched in the two structures, RMSD, longest distance between any two matched residues and Gerstein–Levitt probability score (18). In the structurebased sequence alignment associated with each structure
Figure 1. Schematic representation of SSD data structures. The enolase superfamily (blue) contains multiple homologous proteins. All of these proteins (MR, enolase, . . .), along with a protein from a different superfamily (PRAI) (peach) can be compared. Each protein has an associated structure (magenta). One set of MinRMS results (green) compares two structures and includes multiple superpositions (cyan) with different numbers of alpha-carbons matched (n).
Downloaded from https://academic.oup.com/nar/article-abstract/31/1/505/2401554/The-Structure-Superposition-Database by guest on 20 September 2017
Nucleic Acids Research, 2003, Vol. 31, No. 1
superposition, residues matched in the superposition are aligned and the unmatched residues are treated as insertions and deletions. The essential database structure for the SSD is provided in Figure 1. Features and applications: the user interface module The data currently available in the SSD will be useful, by itself, for investigators who require sophisticated visualization capabilities for pairs of superimposed structures. But to understand large sets of superpositions together, investigators will require more complex data manipulations and visualizations which are provided through the database user interface module of the SSD. Development of the user interface model was guided by asking what manipulations are required for understanding a single superposition and then creating features to facilitate the same manipulations for large sets of super-
507
positions. The following sections describe the features of the user interface module. Chimera (9) is an interactive molecular modeling program where developers have access to internal data structures and can create their own Python programs to customize the modeling environment. We chose to implement the SSD user interface module as a Chimera extension because of Chimera’s sophistication and large number of analysis features as well as the accessibility and easy manipulation of the molecular graphics data. The superpositions can be searched according to PDB identifiers. Users can search the database using four different methods: (i) Browse the structures sorted by superfamily (using SCOP classifications) and select which proteins to compare together, (ii) search for a single protein and then select, using checkboxes, which additional proteins to compare with that protein, (iii) search for a pair of proteins, or (iv) enter a user-defined group of proteins to view all pairwise
Figure 2. Screenshot of SSD search options in the left frame of the SSD user interface module. The search options are: (A) view structures sorted by superfamily, (B) search for a single structure, (C) search for superpositions of a pair of structures, (D) search for superpositions of all structures within a user-defined group and (E) specify visualization options.
Downloaded from https://academic.oup.com/nar/article-abstract/31/1/505/2401554/The-Structure-Superposition-Database by guest on 20 September 2017
508
Nucleic Acids Research, 2003, Vol. 31, No. 1
comparisons within the group. Figure 2 provides a screenshot of part of the user interface showing search options. Users can choose to view all of the superpositions for one pair of proteins or to view only those superpositions within user-specified ranges of RMSD and number of matched alphacarbons. These options are shown in Figure 2. The superpositions for any single pair of proteins are viewed as a graph of alignment scores versus number of matched alpha-carbons (MinRMS panel in Fig. 1). The user selects bars on the graph to display the scores of the superposition and view the structures superimposed in the main Chimera window. Buttons are included for viewing the transformation matrix and the structure-based sequence alignment. The sequence alignments are viewed through the Chimera extension Multalign Viewer, which allows the user to link the protein sequence to the structure and vice versa. Documentation for Multalign Viewer is available at http://www.cgl.ucsf.edu/chimera. When investigating a group of proteins, users can open and view each pairwise superposition with the same tools described above. Users may also choose to view any number of the proteins superimposed together by selecting a template structure to which the other structures will be superimposed and selecting either an RMSD value or the percent of the protein length matched in the superposition. Structure-based sequence alignments for multiple structures superimposed are
constructed by inserting gaps into the sequences for residues not used in the superposition, leaving only residues that are superimposed in the structure superpositions matched in the sequence alignment. The structure-based sequence alignments for a group of proteins can be viewed with Multalign Viewer simultaneously with the associated superimposed structures in the Chimera main screen, thus allowing the user to explore the relevant sequences and structures together. Figure 3 provides a screenshot of the options for viewing a multiple structure superposition. Users can also open their own multiple sequence alignments using Multalign Viewer. These multiple sequence alignments are automatically associated with a set of structures opened using the SSD interface. Additional features are provided to aid in visualization of multiple structure superpositions and allow users to focus on relevant structural features. For example, users may choose to color only the matched residues in the structure superpositions, leaving unmatched residues gray. Also, the SSD user interface module allows for changing the color of structures and for hiding and redisplaying structures. The standard Chimera molecular representation tools are also available for use in addition to the SSD features. Currently, because the SSD data format is specific to MinRMS and the data is read into Chimera from the SSD database, users cannot directly upload and view superpositions
Figure 3. Viewing multiple superpositions together. The menu options (A) for viewing a group of proteins allow the user to view superpositions for a pair of proteins (B) or more than two proteins superimposed together (C). When multiple structures are superimposed relative to one template, the scores for each pairwise superposition are displayed in a table that provides the RMSD values representing the comparison of the template structure to each structure checked in the list (D). The structures are displayed in the main Chimera window. The structure-based sequence alignment is shown through the Multalign Viewer interface (E), which connects the sequence alignments to the graphical display of structures.
Downloaded from https://academic.oup.com/nar/article-abstract/31/1/505/2401554/The-Structure-Superposition-Database by guest on 20 September 2017
Nucleic Acids Research, 2003, Vol. 31, No. 1
509
Table 1. Comparison of the SSD with other structure superposition databases Capability
Database SSD
MMDB
FSSP
CE
Viewing of multiple structures superimposed
Y Chimera
Y Cn3D
Y Rasmol
Search for a group of structures Structures arranged by curated superfamilies Access to internal data structures and customization of modeling environment Visual inspection of quality of alignment/viewing of aligned and unaligned positions of structures Residues in structures linked to protein sequence Residues in structures can be linked to users’ multiple sequence alignments Multiple superpositions available for a single pair of proteins Universal structure file formats and transformation information for additional computation or use with other software can be downloaded All PDB proteins and folds represented
Y Y Y
Y N N
N N Y
Y Rasmol, Java, Chimea N N Y
Y
Y
N
N
Y Y
Y N
N N
N N
Y
N
N
N
Y
N
N
Y
N
Y
Y
Y
a
For this database, the features which meet these criteria are accessible using different visualization techniques.
generated by other structure alignment methods. Users who wish to use the SSD user interface module to view such superpositions can contact the authors to make arrangements. Possibilities for allowing users to view their own MinRMSgenerated superpositions are being explored. Accessibility and availability To use the SSD, Chimera must be downloaded and installed. Chimera is compatible with and has been tested on Windows 98/NT/2000, SGI IRIX 6.5, Compaq Tru64 UNIX 5.0/5.1, and Redhat Linux 6.2/7.1. Chimera also works on Windows XP, although this platform has not been tested by the developers. An alpha test version of Chimera for Mac OS X, which will require an X-windows server, will be released in late 2002. A future version will run with the native OS X windowing system. The SSD user interface module must also be downloaded and installed separately and is available at http:// ssd.rbvi.ucsf.edu. Documentation and instructions for using the SSD, with an example investigating common structural elements of the wellcharacterized enolase superfamily of (b/a)8 proteins (19,20), is available on the SSD site. This ‘Tutorial’ file guides users through manipulations possible within the user interface module. Methods similar to those described here can be used to identify and investigate new superfamily relationships.
order the data, they cannot clearly indicate which regions are better aligned or whether regions of interest are well aligned. Even when visualization of pairwise or multiple superpositions is available, existing databases are especially limited with respect to viewing and evaluation of multiple structural superpositions simultaneously with integrated multiple sequence alignments. The SSD includes several features that enable sophisticated analysis and visualization of multiple structural superpositions together. Many of these features are either not available or only minimally developed in other available databases. Table 1 provides a comparison between the SSD and several existing databases with respect to these issues. The capabilities included in Table 1 reflect those the authors consider most important for sophisticated visualization of structural superposition data. FUTURE DEVELOPMENTS We plan to add structures and structure superpositions representing additional folds. We also plan to add structure superpositions for proteins in well-characterized and established superfamilies as part of a project to create a gold standard set for investigations of superfamilies. Other future developments include support for downloading subsets of the data and more options for viewing multiple superpositions.
Comparison between the SSD and other structure superposition databases
ACKNOWLEDGEMENTS
Beyond the simple visualization tools provided by such software as Cn3D and Rasmol/Chime, evaluation of superpositions in current databases has been largely limited to using various numerical scores representing the overall quality of a structure superposition. Although such quantitative measures allow convenient use of clustering and statistical techniques to
The authors wish to thank Compaq Computer Corporation (now Hewlett Packard Company) for access to their 100-CPU ‘BioCluster’ computer system, where many of our calculations were performed. This work was supported by NIH GM60595 (to P.C.B.) and P41 RR-01081 (to T.E.F.). Ongoing support for SSD access is provided by the Laboratory for Functional
Downloaded from https://academic.oup.com/nar/article-abstract/31/1/505/2401554/The-Structure-Superposition-Database by guest on 20 September 2017
510
Nucleic Acids Research, 2003, Vol. 31, No. 1
Informatics, a component of the UCSF Research Resource for Biocomputing, Visualization and Informatics.
REFERENCES 1. Burley,S.K., Almo,S.C., Bonanno,J.B., Capel,M., Chance,M.R., Gaasterland,T., Lin,D., Sali,A., Studier,F.W. and Swaminathan,S. (1999) Structural genomics: beyond the human genome project. Nature Genet., 23, 151–157. 2. Chance,M.R., Bresnick,A.R., Burley,S.K., Jiang,J.S., Lima,C.D., Sali,A., Almo,S.C., Bonanno,J.B., Buglino,J.A., Boulton,S. et al. (2002) Structural genomics: a pipeline for providing structures for the biologist. Protein Sci., 11, 723–738. 3. Holm,L. and Sander,C. (1996) Mapping the Protein Universe. Science, 273, 595–603. 4. Young,M.M., Skillman,G. and Kuntz,I.D. (1999) A rapid method for exploring the protein structure universe. Proteins, 34, 317–332. 5. Dietmann,S. and Holm,L. (2001) Identification of homology in protein structure classification. Nature Struct. Biol., 8, 953–957. 6. Wang,Y., Anderson,J.B., Chen,J., Greer,L.Y., He,S., Hurwitz,D.I., Liebert,C.A., Madej,T., Marchler,G.H., Marchler-Bauer,A. et al. (2002) MMDB: Entrez’s 3D structure database. Nucleic Acids Res., 30, 249–252. 7. Holm,L. and Sander,C. (1998) Touring protein fold space with Dali/FSSP. Nucleic Acids Res., 26, 316–319. 8. Shindyalov,I.N. and Bourne,P.E. (2001) A database and tools for 3D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm. Nucleic Acids Res., 29, 228–229. 9. Huang,C.C., Novak,W.R., Babbitt,P.C., Jewett,A.I., Ferrin,T.E. and Klein,T.E. (2000) Integrated tools for structural and sequence alignment and analysis. Pac. Symp. Biocomput., 230–241.
10. Farber,G.K. and Petsko,G.A. (1990) The evolution of alpha/beta barrel enzymes. Trends Biochem. Sci., 15, 228–234. 11. Gerstein,M. (1997) A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562–576. 12. Gerstein,M. (1998) Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins, 33, 518–534. 13. Wierenga,R.K. (2001) The TIM-barrel fold: a versatile framework for efficient enzymes. FEBS Lett., 492, 193–198. 14. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. 15. Conte,L.L., Brenner,S.E., Hubbard,T.J.P., Chothia,C. and Murzin,A.G. (1992) SCOP database in 2002: refinements accomodate structural genomics. Nucleic Acids Res., 30, 264–267. 16. Westbrook,J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichandran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S. et al. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res., 30, 245–248. 17. Jewett,A.I., Huang,C.C. and Ferrin,T.E. MINRMS: An efficient algorithm for determining protein structure similarity using root-mean squared-distance. Bioinformatics, in press. 18. Levitt,M. and Gerstein,M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 94, 5913–5920. 19. Babbitt,P.C., Hasson,M.S., Wedekind,J.E., Palmer,D.R., Barrett,W.C., Reed,G.H., Rayment,I., Ringe,D., Kenyon,G.L. and Gerlt,J.A. (1996) The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. Biochemistry, 35, 16489–16501. 20. Gerlt,J.A. and Babbitt,P.C. (2001) Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu. Rev. Biochem., 70, 209–246.
Downloaded from https://academic.oup.com/nar/article-abstract/31/1/505/2401554/The-Structure-Superposition-Database by guest on 20 September 2017