# "# %
BIOINFORMATICS RNA Movies: visualizing RNA secondary structure spaces
$ ($% " #$& $
"% ' &*& "($%&*& #%& $!")
Abstract Motivation: RNA Movies is a system for the visualization of RNA secondary structure spaces. Its input is a script consisting of primary and secondary structure information. From this script, the system fully automatically generates animated graphical structure representations. In this way, it creates the impression of an RNA molecule exploring its own two-dimensional structure space. Results: RNA Movies has been used to generate animations of a switching structure in the spliced leader RNA of Leptomonas collosoma and sequential foldings of potato spindle tuber viroid transcripts. Availability: Demonstrations of the animations mentioned in this paper can be viewed on our Bioinformatics web server under the following address: http://BiBiServ.TechFak.UniBielefeld.DE/rnamovies/. The RNA Movies software is available upon request from the authors. Contact:
[email protected] Introduction: the need for visualization of RNA structure spaces Finding the secondary structure of an RNA molecule is an important step towards understanding its function in many cases, as the two-dimensional (2D) folding constrains the set of possible tertiary structures (Tinoco et al., 1987). Therefore, research in secondary structure prediction is of ongoing interest (Nussinov et al., 1978; Zuker and Stiegler, 1981; Lefebvre, 1995). While the computed minimum free energy structure (mfestructure) can certainly often provide clues as to what the correct structure might be, it is not the answer to all questions and must generally be used with caution: Mathematically, the mfe-structure is weakly defined. It is drawn from a space of energetically similar structures, which nevertheless may be totally different morphologically (Zuker, 1989; Schuster, 1995). Consequently, a slight change in the energy parameters may affect the result in an unpredictable way. It must be kept in mind that some parameters are interpolated or crudely approximated to achieve computational tractability (Freier et al., 1986).
32
The model assumes the RNA molecule to be in thermodynamic equilibrium. This is often unrealistic, and known to be false in some cases. In the case of conformational switches, there are two alternative structures that are functionally relevant. It makes no sense to seek ‘the’ optimal structure. In studies of molecular evolution, the global relationships between sequence and structure are more relevant than the distinguished mfe-structures. For these reasons, the problem of parametric RNA folding is a challenging item in the list of open problems in computational biology (Pevzner and Waterman, 1995). Akin to parametric (pairwise) sequence alignment (Gusfield et al., 1992), one would like to calculate a parameterized representation of the structure space of an RNA sequence. From this, structures can be extracted for specific parameter values, and their stability under parameter changes evaluated. These parameters can be temperature, ion concentration, energy contributions of particular structure elements, or more abstract criteria like the presence or absence of certain substructures. While the general problem has yet to be solved, a few advances have been made. The RNA folding programs have been modified (Mfold and RNAfold, in slightly different ways) to enumerate suboptimal structures (Zuker, 1989; Wuchty, 1999). The equilibrium partition function of McCaskill (1990) evaluates base pair probabilities, and allows the study of structures composed of highly probable substructures in various ways. All the tools mentioned give various kinds of static graphics to aid in the interpretation of their results. The tool collection reported in Shapiro and Kasprzak (1996) integrates additional software for visualization. According to the estimates given in Waterman (1989), the number of (suboptimal) structures grows exponentially with sequence length. Any careful study of the structure space will yield a large number of alternative structures, which must be evaluated. This is where new visualization techniques are required. Known representations are static and suited to visualizing secondary structures individually. Structure animation displays structures together with a (graphically) smooth transi Oxford University Press
Visualizing RNA secondary structure spaces
tion between them. The virtue of structure animation is that changing parts of the structure can be observed moving, while the rest remains stable. In an otherwise unwieldy mass of data, this directs our attention to those structural features that deserve closer scrutiny.
The pipeline structure utilized by RNA Movies Structure space and script generation Some terminology is required to explain the overall workings of RNA Movies. (This terminology is abstract in the sense that it does not speak about actual representations of structures.) Let χ denote the set of all RNA sequences, which are strings over the alphabet {A, C, G, U}. For x ∈ χ, S(x) denotes the set of all secondary structures compatible with x, i.e. base pairs indicated by s ∈ S(x) correspond to residues in x that actually match according to the generally accepted set of pairing rules. mfe(x) ∈ S(x) denotes an mfe-structure of x. For X ⊆ χ, a structure space SX is an X-indexed family of structure sets such that Ø ≠ Sx ⊆ S(x) for each x ∈ X. A subspace S 1X1 of SX has X 1 ⊆ X and S 1x ⊆ Sx . A structure space may contain many structures for the same sequence (cf. Applications: Conformational switching), just one structure for each of many sequences (cf. Applications: Sequential folding), as well as a combination of both. A script for a structure space SX is a list of sequence–structure pairs [(x1, s1), (x2, s2), …] such that s ∈ Sx implies (x, s) = (xi , si ) for some i. Note that a script must enumerate the complete structure space, and may contain repetitions. An aligned script is obtained when for each entry (xi + 1, si + 1) a partial mapping πi from the residues of xi to those of xi + 1 is specified. In the animation, the alignment determines that residue xi,j moves into x i1, i(j). A special case of an aligned script is an anchored script: here, a residue pair (m, n) is specified, indicating the alignment πi (m + j) = n + j for all suitable i and j. In the current implementation, all scripts are implicitly anchored at (1,1). A (structure) space generator is a program that generates some SX from X; a (structure space) filter is a program that maps SX onto a subset of itself. Finally, a script generator is a program that generates an aligned script for a structure space. The application examples given in the Applications section map onto this formal structure as shown in Figure 1. The separation into space generators, filters and script generators is motivated by our wish to encourage modularity and reuse of existing components. A generator will typically use an RNA folding routine, which can only be argued into creating a larger space than we actually want to see. Filters are responsible for restricting the generated space. The script generator takes responsibility for arranging the structure space in some meaningful order, such that the resulting movie can be given an intuitive interpretation.
Fig. 1. Mapping of the application examples onto the formal pipeline structure.
The present paper is devoted to a presentation of the visualization component. Script generators are only touched upon in explaining the applications.
Script visualization The visualization component was designed to follow a parsimony principle: the visualization shall not create any suggestive impression other than intended by the script generator. Hence, there are no implicit assumptions about relationships between the structures in the script, and RNA Movies only provides the animation. In particular [and in contrast to the visualization reported in Shapiro and Kasprzak (1996)], it does not convey any information about the algorithm used by the script generator. The display module provides the well-established metaphors of multimedia presentation, displaying the secondary structure animation as a movieplayer shown in Figure 2. It uses play, forward, backward and pause buttons to guide the animation process, and displays the progress in a scrollbar that can also be used to move directly to a certain part of the script. The display of nucleotide abbreviations, nucleotide numbering and base pair type can be toggled in an options menu. Zooming, placement and rotation values for the viewport are automatically chosen to display the largest structure contained in the script. They can be changed interactively at any time, also during the running animation. A status bar at the bottom of the animation window gives feedback, for instance of error conditions encountered during script parsing. The inner workings of the visualization component are described in the implementation section.
Applications Conformational switching RNA conformational switching plays a fundamental role in protein synthesis, mRNA splicing, translational regulation and probably in other biological processes as well. The most prominent example to date is translational attenuation where
33
D.Evers and R.Giegerich
Fig. 3. Minimal energy pathway of the conformational switch in the spliced leader RNA of Leptomonas collosoma. Animation (top left to bottom right) of the intermediate structures [(p, s3), (p, s4), (p, s5)] with an interpolation rate of five. Fig. 2. RNA Movies: movieplayer.
terminator and antiterminator structures regulate translation in Escherichia coli and Bacillus subtilis (Fayat et al., 1983). The spliced leader RNA of Leptomonas collosoma exhibits two competing structures of similar free energy (LeCuyer and Crothers, 1993). These forms can switch by binding complementary oligonucleotides at sites not involved in secondary structure formation, thus favouring one conformation over the other. The activation energy needed to perform the conformational change is much lower than the energy associated with a complete dissociation of all helices involved. Therefore, an alternative kinetic pathway not including the fully melted intermediate must occur (LeCuyer and Crothers, 1994). In search of a general computational method to characterize conformational switches (Giegerich et al., 1999), Rehmsmeier (1996) developed the notion of an energy barrier distance, i.e. a metric on structures related to the minimal energy required to switch from one structure to the other. A heuristic was developed that searches for the precise sequence of intermediate structures participating in the switching process. Let p be the primary RNA sequence of L. collosoma. The script describing the switching between structures sb and se takes the form [(p, sb = s1), (p, s2), …, (p, sn = se )], where the intermediate structures s2, …, sn – 1 are calculated by the above heuristic. Naturally, the script is anchored at residues (1,1). This list is an abstract representation of a minimal energy pathway of a conformational switch between two competing
34
secondary structures. As such, it is not easily interpreted. Used as a script for RNA Movies, the dynamics of structural change become obvious. The script has a length of 27; animated with five interpolations, this gives a total of 157 frames. Figure 3 attempts to give an impression, but, of course, the visual effect lies in the structure actually moving, and cannot be conferred on paper.
Sequential folding Standard secondary structure prediction by thermodynamic parameters alone implies that RNA molecule intra-action adheres to its equilibrium partition function. This is not sufficient in all cases, however, as the presence of metastable structures in RNA molecules has shown (Nussinov and Tinoco, 1981). One explanation is that the kinetic element of structure formation is an important factor during the transcription phase of a molecule, leading to thermodynamically suboptimal secondary structures. This process is known as sequential folding. Kinetic RNA structure prediction was introduced by Martinez and Mironov using Monte Carlo-based algorithms (Martinez, 1984; Mironov et al., 1985). It has been shown that intermediates of the potato spindle tuber viroid (PSTVd) fold into metastable configurations during replication (Qu et al., 1993). A sequential folding algorithm based on ‘Simulated Annealing’ was used to produce data for the PSTVd, taking into account thermodynamic and kinetic parameters as well as RNA polymerase elongation rates (Schmitz and Steger, 1996).
Visualizing RNA secondary structure spaces
Fig. 4. A sequential folding of a partial PSTVd transcript [(pi , si )|i ∈ [15, …, 50]] in steps of five. The last six plots are of identical length, being variants in the search space of the simulated annealing procedure.
Given the primary RNA sequence p, we obtain the script [(pi , si )|1 ≤ i ≤ |p|], where pi is the ith prefix of p (i.e. the sequence containing the first i residues of p), and si is the structure calculated for it by simulated annealing. The script is anchored at residues (1,1). The RNA movie generated from these data clearly visualizes those stages where metastable substructures are formed and resolved during transcription. Figure 4 gives a few snapshots.
Structure space enumeration In our current research, we are developing a technique for the complete analysis of all potential foldings of an RNA molecule, under varying canonicity restrictions. This leads to structure enumerators, whose voluminous output can be viewed most conveniently in animated form.
Implementation Algorithms and formats The Vienna RNAfold and Zuker CT format are widely used in RNA folding, and are interpreted by RNA Movies. The
parsing phase of our algorithm translates these formats into an internal representation. An animation needs controlling points at key frames to guide the animation process on the time scale. In our case the positions of the nucleotides in 2D space serve this function. To generate these coordinates, we apply the naview algorithm of Bruccoleri (Bruccoleri and Heinrich, 1988). This part can easily be substituted by other algorithms with different properties, for instance with Shapiro’s radial graph representation (Shapiro et al., 1984), but note that not all static structure representations are equally well suited for animation. For example, in the mountain structures of Hogeweg (Hogeweg and Hesper, 1984), local changes have a global effect on the layout, creating the impression of change where actually nothing happens. Having generated sets of coordinates representing successive instances of the molecule configuration (the key frames), we interpolate the corresponding points in the neighbouring sets to give a smooth transformation of intermediate frames. This obviously only works for anchored scripts where the mapping of the residues of all structures is implicit (see the previous subsection ‘Structure space and script generation’). As the point pairs are guaranteed to be in close proximity by the display algorithm, a linear interpolation of the points has proved sufficient for our examples. (Note that minimal energy paths between dissimilar sets can be generated by the algorithm used for conformational switching.) We interpret the 2D coordinates of the residues in every frame as control points of a B-spline forming the backbone of the molecule. The frames are animated, annotating nucleotide type, base pair type and nucleotide numbering if desired by the user. The continuity properties of the B-spline give the impression of an elastic band moving through different configurations, leading to a smooth, continuous and pleasing animation of conformational changes in the secondary structure of the RNA molecule. As B-spline interpolation is more expensive than linear interpolation in terms of computation time, this simpler technique is also supplied. The slightly more aesthetic spline view should be chosen when producing visualizations for publication.
Efficiency The efficiency of the display module mainly depends on the number of frames it is able to display per second. The different parts of the animation algorithm have been separated into different phases of execution. While parsing the input, all 2D coordinates are generated on the fly; depending on the display algorithm, the time efficiency is typically O(p), where p is the number of base pairs in the structure. The animation itself has linear efficiency in time based on the fact that B-spline generation depends on the number of controlling
35
D.Evers and R.Giegerich
points. The memory allocation is completely dynamic, such that sequences of arbitrary length can be displayed, given enough memory. For speed, efficiency and portability, RNA Movies was developed under UNIX using the ANSI C programming language. The major libraries used were OpenGL/Mesa, XForms (Frazier, 1997; Paul, 1997; Zhao and Overmars, 1997). The program is currently being tested using the operating systems Sun Solaris 2.4 on a Sparcstation 4, and SGI Irix 5.3 and 6.2 on Indigo2 Extreme and Maximum Impact systems. The animation of the conformational switch with an interpolation rate of 10 frames per structural change takes 44 s to display 287 frames on a Sparcstation 4. Hardware accelerated systems like the SGI Indigo2 series at least double the speed of the animation, depending on the graphics subsystem.
Acknowledgements The authors would like to thank Gerhard Steger and his colleagues at the Institute for Physical Biology, Heinrich-Heine University, Düsseldorf, for many stimulating discussions on structure formation, and Marc Rachen for his data on PSTVd. Marc Rehmsmeier of the DKFZ-Heidelberg provided us with the minimal energy paths for L. collosoma. The work presented in this paper was partially supported by the German Ministry for Education and Sciences (BMBF), the Ministry of Science of North Rhine Westphalia (MWFNRW) and the German Research Council Graduate Programme on structure formation (DFG-GK ‘Strukturbildungsprozesse’).
References Discussion A large-scale evaluation of RNA Movies has not been performed yet. Researchers used to labouring through piles of printouts of closely related structures, trying to locate substructures of interest, generally react very positively to the animated visualization. Owing to the parsimony principle, an RNA movie gives you exactly what you put in: if structures adjacent in the script are closely related, the movie will show smooth transitions (note that this solely pertains to the structure information; the primary sequence information hardly influences the animation). If the script juxtaposes quite distinct structures, the movie gives aberant behaviour. It has been suggested to smooth the transitions in such a case. This could in fact be done using the minimal energy transitions described in the previous section ‘Conformational switching’. However, we are hesitant to include such a feature, as it violates the parsimony principle. RNA Movies has just become operational, and we hope to evaluate its merits with further applications. Aside from the software, we offer support to researchers in demand of the kind of animation that RNA Movies provides. In our future work, we hope to include some moderate extension to the visualization component. The basic functionality needs to be enhanced; arbitrary alignments of scripts must be implemented. More convenience must be added: RNA Movies should allow the creation of snapshots for paper publications and (much more appropriate) animated gifs for electronic documents. Depending on public demand, we would consider a re-implementation of the display module in Java, to achieve full WWW operationality. In the long run, and aside from pure visualization, we also want to construct reusable modules for script generation. This involves research into flexible algorithms for structure space enumeration and structure alignment.
36
Bruccoleri,R.E. and Heinrich,G. (1988) An improved algorithm for nucleic acid secondary structure display. Comput. Applic. Biosci., 4, 167–173. Fayat,G., Mayaux,J.-F., Sacerdot,C., Fromant,M., Springer,M., Grunberg-Manago,M. and Blanquet,S. (1983) Escherichia coli phenylalanyl-tRNA synthetase operon region. Evidence for an attenuation mechanism. Identification of the gene for the ribosomal protein L20. J. Mol. Biol., 171, 239–261. Freier,S.M., Kierzek,R., Jaeger,J.A., Sugimoto,N., Carutjers,M.H., Neilson,T. and Turner,D.H. (1986) Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl Acad. Sci. USA, 83, 9373–9377. Giegerich,R., Haase,D. and Rehmsmeier,M. (1999) Prediction and visualization of structural switches in RNA. In Proceedings of the Pacific Symposium on Biocomputing, Vol. 4. Gusfield,D., Balasubramaniam,K. and Naor,D. (1992) Parametric optimization of sequence alignment. In Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, pp. 432–439. Hogeweg,P. and Hesper,B. (1984) Energy directed folding of RNA sequences. Nucleic Acids Res., 12, 67–74. LeCuyer,K.A. and Crothers,D.M. (1993) The Leptomonas collosoma spliced leader RNA can switch between two alternate structural forms. Biochemistry, 32, 5301–5311. LeCuyer,K.A. and Crothers,D.M. (1994) Kinetics of an RNA conformational switch. Proc. Natl Acad. Sci. USA, 91, 3373–3377. Lefebvre,R. (1995) An optimized algorithm well suited to RNA folding. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. IAAA Press, pp. 222–230. Martinez,H.M. (1984) An RNA folding rule. Nucleic Acids Res., 12, 323–334. McCaskill,J.S.M. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119. Mironov,A., Dyakonova,L.R. and Kister,A.E. (1985) A kinetic approach to the prediction of RNA secondary structures. J. Biomol. Struct. Dynam., 2, 953–962.
Visualizing RNA secondary structure spaces
Nussinov,R. and Tinoco,I.,Jr (1981) Sequential folding of a messenger RNA molecule. J. Mol. Biol., 151, 519–533. Nussinov,R., Pieczenik,G., Griggs,J.R. and Kleitman,D.J. (1978) Algorithms for loop matchings. SIAM J. Appl. Math., 35, 68–82. OpenGL Architecture Review Board, Frazier,C. (Ed.) and Kempt,R. (1997) OpenGL Reference Manual: The Official Reference Document to OpenGL, Version 1.1. Addison-Wesley, Reading, MA. Paul,B. (1997) The Mesa 3-D graphics library. http://www.ssec.wisc. edu/brianp/Mesa.html. Pevzner,P. and Waterman,M. (1995) Open combinatorial problems in computational molecular biology. In Proceedings of the Third Israel Symposium on Theory of Computing and Systems. IEEE Computer Society Press, Tel Aviv, Israel, pp. 158–173. Qu,E., Heinrich,C., Loss,P., Steger,G., Tien,R. and Riesner,D. (1993) Multiple pathways of reversion in viroids for conservation of structural elements. EMBO J., 12, 2129–2139. Rehmsmeier,M. (1996) Klassifikation von RNA-Sequenzen durch Analyse ihres Strukturraums. Master’s Thesis, Faculty of Technology, Bielefeld University. Schmitz,M. and Steger,G. (1996) Description of RNA folding by ‘Simulated Annealing’. J. Mol. Biol., 255, 254–266. Schuster,P. (1995) How to search for RNA structures, Theoretical concepts in evolutionary biotechnology. J. Biotechnol., 41, 239–257.
Shapiro,B. and Kasprzak,W. (1996) STRUCTURELAB: A heterogeneous bioinformatics system for RNA structure analysis. J. Mol. Graph., 14, 194–205. Shapiro,B., Maizel,J., Lipkin,L.E., Currey,K. and Whitney,C. (1984) Generating non-overlapping displays of nucleic acid secondary structure. Nucleic Acids Res., 12, 75–99. Tinoco Jr.,I., Davis,P.W., Hardin,C.C., Puglisi,J.D., Walker,G.T. and Wyatt,J. (1987) RNA structure from A to Z. Cold Spring Harbor Symp. Quant. Biol., LII, 135–146. Waterman,M.S. (ed.) (1989) Mathematical Methods for DNA Sequences. CRC Press, Boca Raton, FL. Wuchty,S., Fontana,W., Hofacker,I.L. and Schuster,P. (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49. Zhao,T.C. and Overmars,M. (1997) XForms Home Page. http://bragg.phys.uwm.edu/xforms. Zuker,M. (1989) On finding all suboptimal foldings of an RNA molecule. Science, 244, 48–52. Zuker,M. and Stiegler,R. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res., 9, 133–148.
37