bioinformaticsapplications note - Oxford Academic - Oxford University ...

BIOINFORMATICS APPLICATIONS NOTE

Vol. 21 no. 20 2005, pages 3926–3928 doi:10.1093/bioinformatics/bti632

Structural bioinformatics

Structure clustering features on the Sfold Web server Chi Yu Chan1, , Charles E. Lawrence1,2 and Ye Ding1 1

Bioinformatics Center, Wadsworth Center, New York State Department of Health, 150 New Scotland Avenue, Albany, NY 12208, USA and 2Center for Computational Molecular Biology and Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI 02912, USA

Received on June 23, 2005; revised on August 12, 2005; accepted on August 15, 2005 Advance Access publication August 18, 2005

ABSTRACT Summary: The energy landscape of RNA secondary structures is often complex, and the Boltzmann-weighted ensemble usually contains distinct clusters. Furthermore, the minimum free energy structure often lies outside of the cluster containing the structure determined by comparative sequence analysis. We have developed procedures to characterize and visualize the Boltzmann-weighted ensemble, and have made them available on the Sfold Web server. The new features on the Web server include clustering statistics, ensemble and cluster centroids, multi-dimensional scaling display and energy landscape representation of the Boltzmann-weighted ensemble. Availability: http://sfold.wadsworth.org; http://www.bioinfo.rpi.edu/applications/sfold Contact: [email protected]

BACKGROUND Free energy minimization is a long-established paradigm for the prediction of RNA secondary structures. Algorithms have been developed for computing the optimal (Zuker and Stiegler, 1981) and suboptimal folds (Zuker, 1989). More recently, Ding and Lawrence (2003) have developed an algorithm for drawing a statistical sample from the Boltzmann-weighted ensemble of RNA secondary structures. We have shown that the sampling approach can substantially improve structure prediction through clustering of sampled structures and the identification of centroids of structural clusters (Ding et al., 2005). In order to make the benefits of this novel prediction framework widely available, we have included clustering features in the Srna module of the Sfold server. Here, we describe these new features and their implementation on the Sfold Web server. Features offered by other application modules of Sfold were reported previously (Ding et al., 2004).

INPUT Structural clustering features and centroid identification are available in the Srna module output for every submitted job on our server. Input sequences can be submitted in raw format, FASTA format or GenBank format. Because these features can be efficiently computed, the limits on sequence length remain unchanged, at 200 bases for an interactive job and 5000 bases for a batch job. The character N, commonly used for an undetermined nucleotide, is now allowed. All Ns are forced to be unpaired in all sampled structures. After submission of an interactive job, the user receives a progress

To whom correspondence should be addressed.

3926

update in the same browser window every 5 s. Users submitting batch jobs will receive email notifications once their jobs are completed.

OUTPUT The clustering features are available in the output page of the Srna module. Users submitting jobs to other modules can access the clustering features by selecting Srna under the ‘output from other application modules’ heading in their job output. Two new sections have been added in Srna output, as described below.

Ensemble centroid in comparison with the minimum free energy structure The centroid for a set of structures is defined as the structure with the minimum total base-pair distance to all structures in the set. The ensemble centroid is the centroid for the entire sampled ensemble (Ding et al., 2005). In this section of Srna output, graphical and base-pair distance information is available for comparison between the ensemble centroid structure and the MFE structure computed by mfold 3.1 (Zuker, 2003), for the same set of Turner thermodynamic parameters (Xia et al., 1998; Mathews et al., 1999). Structural diagrams of these two ensemble-level representatives are available in PNG, PDF, PostScript and GCG connect formats. The PNG version offers interactive capabilities of zooming and re-centering, which are useful for local structure display. For print-quality figures, the PDF and PostScript versions are recommended. Base pairs marked as green dots in structural diagrams of the MFE structure and the ensemble centroid represent the common base pairs between these two structures. Base pairs in blue are those present only in the MFE structure, and base pairs in red are those present only in the ensemble centroid. The sum of the numbers of base pairs in red and in blue is thus the base-pair distance between the two structures. Circle diagrams generated by the sir_graph software package developed by Stewart and Zuker (Zuker, 2003) are also available to facilitate comparison. In a circle diagram, bases are positioned along a circle, in a clockwise orientation. An arc connecting two bases across the circle indicates pairing between the bases.

Cluster representation of the Boltzmann ensemble The partitioning of the sampled ensemble into various clusters is summarized in table format. Clusters are sorted in descending order of cluster size. The cluster containing the MFE structure is marked using a red asterisk. The size of a cluster is the sampling estimate for

Ó The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Structure clustering features on Sfold Web server

Fig. 1. The energy landscape of the sampled ensemble and representative structures for Zygosaccharomyces bailii RNase P RNA (GenBank accession no. AF186231), with a length of 205 nt. The optimal number of clusters determined by our software is 11. Structures belonging to the five largest clusters are marked as solid dots of five different colors. Members of all other clusters are plotted as small circles under the same category of ‘All other clusters’. The MFE structure and the ensemble centroid are both in the largest cluster (light blue color), with a probability of 0.622. The coordinates for a structure are (axis 1, axis 2, energy), where the horizontal axes are from MDS and the vertical axis is the free energy of a secondary structure. The coordinates in this example are, ( 2.19, 4.81, 66.90) for the MFE structure ( 1.13, 3.15, 63.40) for the ensemble centroid ( 2.02, 4.21, 65.00) for the centroid of cluster 1, (14.05, 10.13, 62.10) for the centroid of cluster 2, ( 26.06, 7.76, 63.50) for the centroid of cluster 3, (0.18, 17.12, 53.90) for the centroid of cluster 4 and ( 12.06, 30.98, 62.30) for the centroid of cluster 5.

the probability of the cluster, i.e. the sample frequency of the cluster. The MFE structure is excluded from the calculation of cluster sizes, and the sizes of all clusters sum to one. A structure diagram of the cluster centroid and a cluster-level two-dimensional histogram (2Dhist) are provided for each cluster. Cluster centroids are the centroids for structures classified into each cluster. A cluster-level 2Dhist displays base-pair frequencies for that cluster. In addition to individual plots, cluster-level 2Dhist plots and circle diagrams of centroids for all clusters are also available in panel format. Multi-dimensional scaling (MDS) plot of the sampled ensemble and representative structures MDS is a technique for representing high-dimensional objects in typically two dimensions (Kruskal and Wish, 1977). For RNA secondary structures, base-pair distances are used as an input to MDS. A 2D MDS plot of the sampled ensemble and representative structures is available. Members of the five largest clusters with sizes of at least 0.010 are drawn as small dots in

different colors. Members of all other clusters are plotted as small circles. Representative structures, including the MFE structure, ensemble centroid and centroids of all colored clusters, are drawn as large dots in the graph. The units on the axes of the MDS plot only serve the purpose of indicating the relative positions of the objects; they do not have any real meaning. Because of the drastic reduction in dimensionality that it achieves, MDS may not visually preserve the strong separation of distinct clusters in the original space of a higher dimension. For RNA structure applications, MDS works reasonably well for the majority of tested sequences. In cases for which clusters overlap on the MDS plot, users are encouraged to refer to the text output file as described below, before drawing any conclusion. Energy landscape of the sampled ensemble and representative structures A 3D energy landscape plot of the sampled ensemble and representative structures (Fig. 1) is constructed by adding a third

3927

C.Y.Chan et al.

dimension of free energies of secondary structures to the 2D MDS plane. In order to enhance the 3D visualization effect and to allow users to choose the best angle at which a particular plot can be viewed, this energy landscape plot is presented as an animation rotating around an imaginary vertical line passing through the center of the horizontal plane. Depending on connection speed, the animation will first take a short while to download all the necessary image files and then it will begin automatically when it is ready. The animation rotates clockwise (as viewed from the top), with an inter-frame increment of 30 and at a speed of 1 frame per s. The user has the option to pause the animation at any particular angle, and then to download the corresponding image in PDF or PostScript format. The color scheme and dot patterns of structures remain the same as in the 2D MDS plot. Coordinates of representative structures are included below the plot to help the user locate their exact positions. Although JavaScript is required to display the animation properly, users with browsers not supporting JavaScript or having JavaScript disabled will still be able to view a static page containing links to individual plots at every angle, in PNG, PDF and PostScript formats. In addition to the graphical updates mentioned above, a new text output file describing results from the clustering procedures, including listings of cluster members and distances within and between clusters, is available in the ‘Text files’ section. With the exception of the energy landscape plot, all new graphical plots and text output are included in the compressed archive for download.

ACKNOWLEDGEMENTS The Computational Molecular Biology and Statistics Core at the Wadsworth Center is acknowledged for providing computing

3928

resources for this work. The Computer Systems Support group at the Wadsworth Center is acknowledged for providing hosting space and network connectivity to the Sfold web server at the Wadsworth Center. The long-term development of the Sfold software and the maintenance of the web server are supported by the National Science Foundation grant DMS-0200970 and the National Institutes of Health grant GM068726 to Y.D. This work was also supported, in part, by the National Institutes of Health grant HG01257 to C.E.L. The server at Rensselaer Polytechnic Institute (RPI) runs on a Linux cluster acquired through an IBM SUR grant awarded to RPI/Wadsworth Bioinformatics Center (Zuker, 2003). Conflicts of Interest: none declared.

REFERENCES Ding,Y. and Lawrence,C.E. (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res., 31, 7280–7301. Ding,Y. et al. (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res., 32 (Web Server issue), W135–W141. Ding,Y. et al. (2005) RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 11, 1157–1166. Kruskal,J.B. and Wish,M. (1977) Multidimensional Scaling. Sage Publications, Beverly Hills, CA. Mathews,D.H. et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288, 911–940. Xia,T. et al. (1998) Thermodynamic parameters for an expanded nearestneighbor model for formation of RNA duplexes with Watson–Crick base pairs. Biochemistry, 37, 14719–14735. Zuker,M. (1989) On finding all suboptimal foldings of an RNA molecule. Science, 244, 48–52. Zuker,M. (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res., 31, 3406–3415. Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res., 9, 133–148.