/, ./ !'%2
BIOINFORMATICS Stem Trace: an interactive visual tool for comparative RNA structure analysis /*#)%#( !2018!+ !.$ 14#% (!0)1/ .31!-41!,
%2%!1#( 400/13 1/'1!- 1%$%1)#+ !3)/.!, !.#%1 .23)343% ,$' - !.$ -!'% 1/#%22).' %#3)/. !"/1!3/17 /& 60%1)-%.3!, !.$ /-043!3)/.!, )/,/'7 )5)2)/. /& !2)# #)%.#%2 !3)/.!, !.#%1 .23)343% 1%$%1)#+ !.#%1 %2%!1#( !.$ %5%,/0-%.3 %.3%1 !3)/.!, .23)343%2 /& %!,3( ,$' - 1%$%1)#+
Abstract Motivation: Stem Trace is one of the latest tools available in STRUCTURELAB, an RNA structure analysis computer workbench. The paradigm used in STRUCTURELAB views RNA structure determination as a problem of dealing with a database of a large number of computationally generated structures. Stem Trace provides the capability to analyze this data set in a novel, visually driven, interactive and exploratory way. In addition to providing graphs at a high level of abstraction, it is also connected with complementary visualization tools which provide orthogonal views of the same data, as well as drawing of structures represented by a stem trace. Thus, on top of being an analysis tool, Stem Trace is a graphical user interface to an RNA structural information database. Results: We illustrate Stem Trace’s capabilities with several examples of the analysis of RNA folding data performed on 24 strains of HIV-1, HIV-2 and SIV sequences around the HIV dimerization region. This dimer linkage site has been found to play a role in encapsidation, reverse transcription, recombination, and inhibition of translation. Our examples show how Stem Trace elucidates preservation of structures in this region across the various strains of HIV. Availability: The program can be made available upon request. It runs on SUN, SGI and DEC (Compaq) Unix workstations. Contact:
[email protected] or
[email protected] Introduction RNA structure prediction Because of the very fragile nature of RNA, experimental methods for determining its secondary and tertiary structures have not kept pace with the rapid sequencing of RNA and discovery of its function. Therefore, the development of computational prediction methodologies for secondary and tertiary structures has become very important. It has to be realized that secondary structure prediction can be computational-
16
ly expensive. RNA sequences such as HIV can be >9000 nucleotides long and, in general, it is accepted that there are 1.8 n possible structures for a sequence of size n (Turner et al., 1988). Computational spaces of such enormous sizes require both excellent prediction as well as analysis tools. Two types of RNA folding algorithms are utilized by the STRUCTURELAB workbench described below (Shapiro and Kasprzak, 1996). They are a dynamic programming algorithm (DPA) (Zuker and Stiegler, 1981; Zuker, 1989, 1996) and a genetic algorithm (GA) (Shapiro and Navetta, 1994; Shapiro and Wu, 1996, 1997). They provide the RNA structure data and are briefly described here so that the description of the analysis of their results with the stem trace will be easier to understand. The two algorithms rely on minimization of the free energy of a structure as the guiding principle. DPAs, such as MFOLD, are recursive programs, arriving at optimal foldings through an exploration of subproblems. They have two distinct phases. The first computes and stores minimum folding energies for subproblems based on minimum folding energies of small fragments in a solution matrix. In the second phase, the program assembles an entire structure via traceback of the matrix of folding energies. In general, recursive algorithms are designed to yield a single, optimal solution, but with modifications to the traceback phase, the algorithms are capable of sampling the suboptimal solution space and providing multiple alternative, suboptimal structures. The other secondary structure prediction algorithm employed by our computer workbench is a GA. It is a non-deterministic algorithm that is currently implemented on a massively parallel MasPar MP-2 supercomputer, a 16 384 processor SIMD (single instruction multiple data) machine. It is also being adapted to a Cray/SGI T3E and ORIGIN 2000 MIMD (multiple instructions multiple data) machines. The GA iterates over an evolution-like three-step procedure: selection, mutation and cross-over on all 16K processors in parallel. Another approach to RNA structure prediction is phylogenetic analysis. Phylogenetic analysis is based on the assump-
An interactive visual tool for comparative RNA structure analysis
tion that RNAs of similar function share common structural elements. It compares sequences (or primary structures) and looks for the potential to form similar secondary structures. Sequence alignments are employed to study homology. They usually contain gaps in places where sequences differ due to base deletions and insertions. Secondary structures deduced from the sequence alignment must show at least one compensatory base pair mutation per stem (helix) in order to be confirmed by this method. Stem Trace facilitates analysis of the free energy minimization computation results, embodied in the GA and DPA, along with the guiding principles of phylogenetic analysis, thus combining the two major approaches to RNA secondary structure prediction.
STRUCTURELAB Motivated by the importance of the RNA molecule in biology, and with the ever expanding field of bioinformatics in mind, an RNA structure computer analysis workbench, STRUCTURELAB, has been developed. The workbench paradigm approaches the structure determination problem as one of dealing with a database of potentially thousands of computationally generated structures and provides the capability to analyze this data set from different perspectives. Structures may be studied by analyzing a portion of the solution space, or at the level of an individual substructure motif. The development process has involved the incorporation of some already existing algorithms as well as new methodologies to facilitate both the prediction and analysis of RNA secondary and tertiary structure. Consequently, databases which may contain thousands of structures are first generated by the RNA folding algorithms and then analyzed with the help of other STRUCTURELAB tools. One of the most recently developed analysis tools which has proven to be quite useful is Stem Trace, described in this paper. It is worth mentioning briefly that the system utilizes various software modules and hardware complexes available both on our LAN, at the Frederick Cancer Research and Development Center, Frederick, MD, and elsewhere on the Internet through computer networking. Such an approach makes it relatively easy to incorporate already existing programs and to add newly developed algorithms and tools. It also permits the best match of these algorithms with the most appropriate hardware. Thus, the most computationally costly modules of the system, the RNA folding algorithms, are run on the fastest machines available, such as a Cray or a MasPar, while the interactive visualization and analysis tools run on the user’s workstation, minimizing the network delay. In addition to the need for new tools for visualization, manipulation, and interactive analysis of the available data and data processing results, the need for a uniform and user-friendly interface to all the dispersed software tools packages has
motivated the workbench development. Figure 1 shows a snapshot of several tools available within the workbench. The system has been written in Common Lisp which provided all the flexibility and power needed (Steele, 1990). It is running on SUN, SGI and DEC Unix workstations, and it employs a network of participating machines defined in reconfigurable tables. A communications package handles the LAN/WAN cross-network file I/O, batch job submissions and the user’s interactions, while a window-based interface makes this heterogeneous computing environment as transparent to the user as possible. Any platform supporting Xwindows (an X-terminal, or a PC running Linux OS) can be used as a front end display. The system can be confined to just one Unix platform if no other hardware is available.
Stem Trace Generally speaking, Stem Trace displays a two-dimensional (2-D) plot of all unique helical stems existing in a solution space of structure conformations of a given sequence. In other words, all entries corresponding to the same y-axis ‘value’ denote the same stem, where the ‘value’ is a unique triplet of the first 5′ base position, the last 3′ base position and the stem size (number of base pairs). A vertical ‘slice’ through a stem trace for a given x-position contains information about the entire structure. Thus, consecutive x-axis positions correspond to an entire set of structures or region files. Refer to Figure 2 to see the meaning of the unique triplet coding for a 2-D structure and see Figure 3 for an illustration of the graphical encoding employed by Stem Trace. The generic nature of such a representation allowed us to develop Stem Trace into an analysis tool capable of dealing with a range of RNA structure folding data for single or multiple sequences, thus introducing some phylogenetic analysis elements. This has proven to be extremely useful in the structural analysis of sequences such as HIV.
Stem Trace methodology Stem Trace has been used to analyze the following major types of inputs: data from each generation of one run of a GA for RNA folding; the best results for multiple GA runs for one input sequence; ordered sequence of suboptimal results generated by the DPA, such as MFOLD; any of the above for multiple sequences of a family. Stem Trace can be employed in the analysis of secondary structure predictions in many ways. In the case of the GA, we have two major analysis modes. In one, we plot along the horizontal axis best solutions from consecutive generations of one GA run. In the other, final results from multiple runs are plotted along the horizontal axis, showing all the unique re-
17
W.Kasprzak and B.A.Shapiro
Fig. 1. STRUCTURELAB: a general view of some of the available workbench tools (clockwise); the main menu, the taxonomy tree windows, the significance/stability plot, the 2D stem histogram, the structure drawing with base labeling, amino acid labeling, and anno tations, the interactive color control window and a small color scale window, and the large-scale structure drawing.
gions on the vertical axis. In the first case, we can follow the GA’s process of structure maturation in this non-deterministic algorithm. In the second case, persistence (or variability) of structural elements in the solutions becomes clearly visible. Similarly, in the case of the DPA, stem trace visualizes persistent motifs in the space of suboptimal solutions. The more continuous the horizontal line, the more prevalent the stem it represents. Bear in mind that for the DPA-based plots, the energetically optimal solution corresponds to the left-most (first, x = 1) vertical slice, while for the single GA run data plots, it is the last vertical slice that corresponds to the final solution. In the case of multiple GA runs, the x position does not have this significance, since every entry is based on the last generation data from every run. Additionally, Stem Trace can be employed to compare folding results of multiple sequences in one graph. Since the sequences may vary in length, and the resulting structures have
18
different triplet (start 5′ stop 3′ stem size) values for the same structural motifs, the most difficult challenge associated with stem trace plots of multi-sequence data is the proper alignment of the structural elements. We begin with aligning the sequences of interest using PILEUP, one of the programs available in the Wisconsin Package from the Genetics Computer Group (GCG), Madison, WI. Once the sequences are aligned, the result file is automatically translated into a list of shifts, i.e. a list of tuples containing positions at which PILEUP introduced alignment gaps and the widths of the gaps. This list is used in turn to shift-adjust the region tables produced by any of the folding algorithms. The results are plotted. Such a purely sequence-based alignment is not perfect, however. Thus, further refinements are needed that would take structure elements in conjunction with the sequence data into consideration. We have recently conducted successful tests to that effect. At the moment, we employ an iterative procedure
An interactive visual tool for comparative RNA structure analysis
Fig. 2. RNA structure representation in the form of region coding triplets.
for obtaining the preliminary results, as described above, followed by a refinement mechanism. While the stem trace of shift-adjusted region files brings the majority of the stems into correct alignment, some misalignments are unavoidable due to PILEUP’s somewhat arbitrary gap placement in situations where multiple, equally good sequence alignments are possible. Therefore, the refinement steps are needed which would take alignment of structural elements into account. For this purpose, we employ RNAMOT, an RNA Motif Search Program (Laferier et al., 1994). This program takes as its input primary sequences and a descriptor file containing specifications of the structural motif to be searched. When run in the alignment mode, it outputs all occurrences of the searched motif in an alignment based on the secondary structure position rather than an optimal sequence (primary structure) alignment. This is exactly what we want, but can accomplish only after the stem trace based on preliminary sequence alignment reveals potentially dominant structural motifs which can drive the refinement step. Additionally, it is hoped that, together with the structural, the biologically more important information is thus incorporated in the stem trace. As a result, the most recent addition to Stem Trace is a set of functions facilitating the interactive creation of structural consensus motifs later used to realign stems for more accurate display in a stem trace.
Now, these motif functions facilitate the selection of a list of stems to be used as a motif. The user simply has to activate the recording mode and gather the stems of interest via mouse ‘point-and-click’ actions. Then, upon request, the list of regions is translated into the RNAMOT descriptor format and written out to a file. Alignments produced by RNAMOT are used to modify the list of shifts mentioned before and to redraw the stem trace. The most natural way to employ this procedure is to select dominant (high-frequency) stems as the motif and test whether in sequences where some of the motif elements are missing the neighboring stems could be brought into alignment with the help of RNAMOT. A simple example is illustrated in Figure 4. The upper panel shows the stem trace results based on the PILEUP sequence alignment performed for five strains of HIV-1. As can be seen, the lower of the two stems comprising stem–loop 1 (SL1) appears to be different for the HIVMAL sequence. In a refinement step, a structural motif comprised of SL1–SL4 from the remaining four strains was selected and translated into the RNAMOT descriptor file. RNAMOT results were then used to correct the alignment file which was used in the creation of the stem trace shown in the lower panel. It is one of our goals for the immediate future to automate the last step of the described protocol. Other algorithms, such as FOLDALIGN (Gorodkin et al., 1997), may also be considered for guiding the stem alignment process.
19
W.Kasprzak and B.A.Shapiro
Thus, the stem trace clearly visualizes conservation of structural motifs across a family of sequences, combining elements of phylogenetic structural analysis with computational RNA structure prediction methods. As we have mentioned earlier, conceptually, a vertical slice through the stem trace graph is equivalent to the region table for an entire structure. This information, extracted from a trace with a click of a mouse button, is sufficient to draw a secondary and tertiary structure without any need for further file I/O. Automating the same procedure, by putting it in a loop iterating over a range selected by the user, a series of structural drawings can be displayed in a slow-motion sequence. This function is particularly useful in the visual analysis of the GA folding pathways, based on the output saved by this algorithm at every generation. Also, combining the positional stem information with the sequence associated with the stem or stems extracted with mouse clicks makes it easy to label a drawn structure’s stems with sequence symbols, additionally color coded to reflect the stem trace presentation. Linking an abstract representation provided by the stem trace with more explicit secondary structure drawings via use of common color coding proves to be very helpful and effective in the analysis of folding results.
Stem Trace design and implementation Fig. 3. Generic stem trace, sorted in 3′ order, with annotations explaining the graph.
Besides providing a visual representation of a solution space for an RNA secondary structure folding problem, Stem
Fig. 4. Comparison of two structure alignments in a multi-sequence stem trace. In the upper picture, five strains of HIV-1 have been aligned using the GCG’s PILEUP. The lower picture illustrates improvement in the alignment of one of the two stems in the stem–loop structure SL1 in the HIVMAL strain based on the RNAMOT results.
20
An interactive visual tool for comparative RNA structure analysis
Fig. 5. Stem Trace coupled with STRUCTURELAB’s 2D structure drawing tools.
Trace is an active interface to the underlying data. All of the graph points, each denoting one stem of one structure, are mouse sensitive. As a result, a few mouse button clicks facilitate extraction of the data. A left mouse button click over a stem of interest brings up this stem data, whereas complete structural information can be extracted via the middle button. Displayed information is color coded, based on a normalized, 10-color scale, to reflect the viewing criteria of frequency of occurrence or stem energy. To optimize the clarity of the display, a gray scale background can be adjusted over a black to white range by dragging the mouse. A set of functions associated with every graph, briefly described below, performs searching, sorting, scaling, thresholding and, through connections with other STRUCTURELAB tools, drawing and labeling of structures based on the data extracted from a stem trace (see Figure 5). Thus, a stem trace can be viewed as a graphical user interface (GUI) to a database of structure files. The user can draw and maintain multiple and independent stem traces which are handled by a directory of stem trace windows internal to STRUCTURELAB. All of them share a single control window which provides the primary display of information pointed to and extracted from the current, i.e. the one active, in-focus, stem trace. The control window contains buttons which activate state toggles, single functions or display pop-up menus of further action choices. It also displays
a bar of 10 sample colors that are used to color code stems, based on a normalized coding criteria, described below. To provide as much useful information to the user at a glance as possible, the control window permanently displays the current trace state data, such as color coding criterion, thresholding values, absolute energy range of the stems on display or the total number of structures making up the graph. In addition, a display of data related to the stems currently pointed to with the mouse shows x (structure) and y (stem or region) positions, and region frequency. Since a stem trace can be based on structural data associated with multiple sequences (explained earlier), the frequency information is displayed as both the global value, calculated for the entire graph, and a value local to the given sequence interval, which may vary significantly from the global value in the multi-sequence stem traces. Also shown is the name of the sequence associated with the currently pointed to stem/structure data. More detailed information can be displayed via mouse button clicks over the stems. A sample output for one stem is shown below: GENERATION/STRUCTURE = 114 STEM = 111 REGION (UNSHIFTED): ((407 471 2 –2.3)) SEQUENCE BIN FREQUENCY: 100.0% (25 IN 25) ENERGY: –2.3 kcal
Control window functions. The following section briefly discusses the group of functions and functional domains asso-
21
W.Kasprzak and B.A.Shapiro
ciated with the buttons in the stem trace control window (see Figure 5). The ENG-TRACE/FREQ-TRACE button toggles tracing criterion from frequency to energy. The absolute numerical values are also displayed explicitly in the control window. The THRESHOLD button allows the display of only those stems for which the frequencies of occurrence or stem energies, depending on the thresholding criterion, are within the user-specified bounds. The DRAW/NODRAW button enables or disables drawing of RNA structures via middle mouse button clicks over the stem trace window. The LABEL/UNLABEL button enables or disables labeling of the current RNA structure drawing via mouse button clicks over the stem trace window. The MOVIE button invokes the display of a sequence of secondary structures drawings, based on the stem trace data, in a movie-like fashion. DATE: 6/13/1997, TIME: 16:49:46 INPUT FILE(S) : hivs5-dim96.mshft NUMBER OF INPUT STRUCTURES: 125 ENERGY RANGE: 0.0 to –19.5 kcal LOWER GLOBAL THRESHOLD: 0.2 UPPER GLOBAL THRESHOLD: 1.0 reg#
5′
3′
size
eng
#hits freq
1)
1
76
3
-5.0 124
0.99
2)
5
73 11
–14.8 125
1.00
3)
17
48
5
–8.7 125
1.00
4)
30
43
4
–7.4 125
1.00
5)
77
124
8
–10.8
98
0.78
6)
77
124 11
–13.3
25
0.20
… SEQUENCE: hivhxb2r-dim SEQUENCE BIN THRESHOLDS: 0.5 TO 1.0 reg#
5′
3′
size
eng
#hits freq
1)
1
76
3
–5.0 25
1.00
2)
5
73
11
–14.8 25
1.00
3)
17
48
5
–8.7 25
1.00
…
The OUTPUT button allows the user to output information stored by Stem Trace to an indicated file. A small sample of an output file from a multi-sequence (5 HIV-1 dimer site sequences) stem trace is shown below. The output has been thresholded and sorted. The first block of results shows cumulative values, while the following blocks (only one is shown) display sequence bin data. Note the differences in
22
frequency values for the first region. Also worth noting are regions five and six which show Stem Trace’s categorization of overlapping stems (different in length) as two separate entities with very different frequencies. The FIND-STEM button invokes a function searching the stem trace for all occurrences of stems matching a query of the form (5′ [3′[eng [size]]]), where optional entries are shown in square brackets. The SORT button pops up a submenu of stem trace sorting choices. These include a 5′, 3′, region energy, region size and pairing distance sorting criteria. The SCALE button facilitates rescaling and redisplay of the current (in focus) stem trace window. The MOTIF button pops up a submenu of functions involved in the creation of structural motifs that are RNAMOT like for the purpose of further stem trace refinement/realignment and structural motif searches. The DIRECTORY button pops up a menu of stem traces and allows the user to activate (bring in focus and make sensitive to mouse interactions) a trace of interest. The DELETE button deletes the current stem trace window, and if there are any other stem traces, it activates the one from the top of the list. Stem Trace and 2D Stem Histogram interplay. Stem Trace provides a view orthogonal to that of another STRUCTURELAB tool developed before Stem Trace: the 2D Stem Histogram. This function produces a 2D histogram of all base pairs, as opposed to the entire stems, that exist in the input data. It is color coded, on the same 10-color scale used in Stem Trace and based on normalized cumulative frequency information. Continuous diagonals in the histogram correspond to stems of secondary structures. Strong linear structural motifs are particularly clearly visible and appear as chains of diagonals with gaps corresponding to unpaired regions separating the stems (refer to Figure 1). On the other hand, the cumulative nature of the stem histogram fuses information from overlapping stems together and makes it harder to discern which component base pairs in continuous diagonals with varying frequency counts come from which stems in the input structures. While it would have been possible to expand internal data structures to store this information, the visualization limitations could not have been overcome, short of moving into a 3D graph domain, which posed questions about the readability of the display. It was this drawback in visual representation that has motivated us to explore other, more informative ways of representing the data and ultimately led to the development of Stem Trace. In view of the issues arising regarding fuzzy matching of stems, the naturally cumulative results presentation provided by the Stem Histogram makes it easier to spot overlapping alternative stems. Hence, Stem Histogram is still a helpful tool, automatically benefiting from the improvements to the
An interactive visual tool for comparative RNA structure analysis
Stem Trace alignment procedure. It is worth pointing out that a stem trace contains more information which can be collapsed into a stem histogram representation, whereas the opposite transformation cannot take place. Also, as a side benefit of its powerful visualization, Stem Trace multi-sequence data alignment improvements came naturally as the tool matured and were easier to think of than they had been in the case of Stem Histogram. Stem Trace and 2D Stem Histogram are tightly synchronized with each other. RNA secondary structures can be drawn directly based on data from any of them with a click of a mouse button. Resetting the LABEL buttons in either of them (both have similar control windows) toggles states of both the current stem trace, as well that of the related stem histogram. Mouse-driven labeling of the same structure drawing can also be performed interchangeably from either of them. It has to be kept in mind, however, that due to the cumulative nature of stem histograms, drawings based on their data represent composite structures combining most frequently encountered structural elements within the graphed space of solutions. The user can decide on the levels of frequency thresholding, with 50% being the lowest allowed in order to avoid stem conflicts in the same drawing. Technical aspects of the implementation. Structural data collected from input file or files are first hashed into a table with the unique stem-defining triplets being used as the hashing keys (see Figure 2). Then the data are stored more efficiently in a list of structures keeping track of the total cumulative results as well as individual structure–sequence associations and statistics for the sequence bin data. All the data are utilized when a mouse pointer returns (x,y) coordinates, corresponding to the generation and stem indexes, which are translated into the display data. Both stem traces and histograms are based on the powerful and easy-to-use object-oriented Common Windows, a part of the Allegro Common Lisp from Franz Inc. As a side-effect of the Common Windows’ use of multiprocessing, they are separate processes which remain active in parallel with the main menu tree of the entire workbench, thus allowing convenient interactions with other STRUCTURELAB tools. All the functions described above are defined as methods and associated with the control window and acting on the current (the one active) stem trace. They are triggered by mouse button states and mouse gestures over the active stem trace. From the user’s perspective, every ‘pixel’ of a stem trace, i.e. the active area of a trace depicting one stem of one structure (n2 pixels, depending on the trace scaling), can be queried for data.
Examples of usage The following examples illustrate how Stem Trace can be used in the analysis of different RNA folding data. They all
refer to the test case analysis performed on 24 strains of HIV-1, HIV-2 and SIV for 562-nucleotide-long sequences encompassing the dimerization region, starting with the TAR region. It has been reported that two identical and unspliced strands are found in the HIV virion, connected to each other at the dimer linkage structure (DLS). This linkage site is thought to be important in the life cycle of the HIV virus. It has been implicated in encapsidation, reverse transcription, recombination, and inhibition of translation (Darlix et al., 1990; Hu and Temin, 1990; Baudin et al., 1993). It has been proposed that this region of the RNA consists of four stem– loop structures, labeled in our figures as SL1–SL4. Of significant interest is the SL1 structure which contains a wellpreserved palindromic sequence in its hairpin loop (see Figure 5). It is thought to play a crucial role in the formation of the dimer, as the proposed ‘kissing loop’ model suggests (Laughrea and Jette, 1994, 1996; Paillart et al., 1994; Skripkin et al., 1994; Muriaux et al., 1995; Clever et al., 1996; Haddrick et al., 1996; Laughrea et al., 1997). Recently, another structure containing a palindrome (AAGCUU) in a hairpin loop, one that can be seen immediately past the TAR structure in Figure 5, has been suggested to take part in the dimerization process together with SL1. It is marked as PSL in our illustrations, and it has been referred to as r-u5 SL in the paper by Höglund (1997). As the following examples of Stem Trace usage vividly show, these structural features are very well preserved across many strains of HIV-1. We have chosen this particular length of the sequence, 562 nucleotides, based on our folds of the entire HIV-1 genome which showed a self-contained structural domain there. At the same time, the domain is large enough not to force folding results into only one possible conformation. Only the sequences containing the full TAR sequence were selected from the GenBank data, as it is a clearly recognizable landmark structure.
Single-sequence GA folding pathway analysis The first example, illustrated in Figure 6, presents a stem trace for one full run, 120 generations long, of the GA. The upper panel shows the raw data representation of the run, where the last generation ‘vertical slice’ corresponds to the final result to which the GA run has converged. The ordering of stems in the raw trace, related to the implementation issues explained below for the DPA results analysis, corresponds to the first appearance of any given stem in the consecutive generations of the GA. Thus, in this case, the frequency of appearance of color coding of stems gives us some clues about the folding pathway the algorithm takes. It is an additional visual clue in the sorted stem trace depiction shown in the lower panel. For example, the two stems comprising SL1 do not appear in the solution structure simultaneously. It is the stem closer
23
W.Kasprzak and B.A.Shapiro
Single-sequence DPA and GA folding analysis
Fig. 6. Stem trace of one GA run for the HIV-1 strain HIVHXB2R dimerization region, 562 nucleotides long. The top picture shows the stems in order of appearance within the 120 generations, whereas the bottom picture shows them sorted in 5′ order. The final solution structure corresponds to the right-most vertical ‘slice’ of the trace.
to the hairpin loop that enters the solution before the more 5′ stem (lower in a 5′-sorted trace). The TAR structure, on the other hand, is being assembled from its 5′ end (refer to Figure 5), with the longest of its four stems, followed by the stems closer to the hairpin loop, and the less frequent, most 5, short stem of the structure. We have utilized Stem Trace in the capacity described above to evaluate GA runs using different free energy rules (Freier et al., 1986; Turner et al., 1987; Woese et al., 1990).
24
The second example of Stem Trace use, illustrated in Figure 7, shows the stem trace depiction of the solution space produced by a DPA algorithm for a single sequence (HIV-1 HXB2R). The top 100 suboptimal structures corresponding to the top 1% of the free energy range were used as input to the stem traces shown. As the left panel of Figure 7 clearly shows, the unsorted view captures the order of appearance of stems in the consecutive suboptimal solutions. This is related to the actual implementation of Stem Trace which hashes all new entries into the list of stem bins. Thus, in the raw (unsorted) representation, the bottom to top ordering of stems reflects their first appearance in the solution space. The leftmost vertical slice of the DPA stem trace shown corresponds to the optimal structure (the set of its stems). Any single new/ alternative stems appear for the first time along the top-left ‘envelope curve’. In most cases, just one or a few alternative stems enter the picture, indicating minor adjustments to the initially established solution. These, of course can still be very significant, as in the case of SL2 and SL4, indicated in the left panel. Major substructure alternatives comprised of multiple stems show up as vertical bars, such as those including the SL1 structure, also clearly visible in the left panel of Figure 7. While the raw trace is best suited for the order of appearance and alternative solution depictions, a sorted stem trace makes it easier to discern relative positions of the depicted stems. It is especially useful when a stem trace is used in conjunction with a labeled 2D drawing (see Figure 5), making visual associations between the stem trace elements and the structural elements of interest much easier. It is worthwhile keeping in mind that for all the stem traces coupled with the corresponding 2D structure drawings, any stem from the trace can be used (via mouse point and click) to overlay effectively its bases in its color code over a structure drawing currently in focus. In the case of a properly matched stem trace and drawing, the (positionally) matching stem of a structure will be labeled in the color corresponding to its appearance in the stem trace. When the stem selected from a trace is not a part of the displayed drawing, the positions reflecting the stem trace encoding are labeled. In other words, stem labeling may not correspond to the structural elements the drawing shows, and may be seen as two disjoint halves. In this fashion, a quick exploratory visualization of the structural alternatives is possible. This feature is also useful for visual checks on how structurally close the distinct stems laid out near each other in a trace are. Figure 8 illustrates a stem trace for the final results of 25 runs of our GA. Since the GA is a non-deterministic algorithm, it is a standard procedure to run it multiple times and then analyze the results statistically. This 5′-sorted trace shows clearly that the structures of interest, SL1–SL4, are
An interactive visual tool for comparative RNA structure analysis
Fig. 7. Stem trace of the top 100 suboptimal secondary structures computed by the DPA (dynamic programming algorithm such as MFOLD) for the HIV-1 strain HIVHXB2R dimerization region, 562 nucleotides long. The left picture shows the stems in order of appearanc e in the solution space, whereas the right picture shows them sorted in 5′ order. The optimal structure corresponds to the left-most vertical ‘slice’ of the trace.
high-frequency motifs, more vividly so than in the DPA solution space. A reference TAR structure at the beginning of the sequence analyzed was predicted correctly in all the runs. The few stems immediately following the TAR structure
(100% stems) belong to another highly conserved structure with a palindrome in the hairpin loop, the PSL (see Figure 5) (Höglund et al., 1997).
25
W.Kasprzak and B.A.Shapiro
were less vivid in general, and so we have decided to retain the same sequence ordering for ease of visual comparisons between the two stem traces, even though it is not the best ordering from the DPA results perspective.
Fig. 8. A 5′-sorted stem trace of the final results of 25 runs of the GA for the HIV-1 strain HIVHXB2R dimerization region, 562 nucleotides long.
Multiple-sequence DPA and GA phylogenetic analysis In order to facilitate a phylogenetic approach to the analysis of folding results for families of sequences, the initial Stem Trace design, mentioned in our paper on STRUCTURELAB (Shapiro and Kasprzak, 1996), has been extended to handle structural inputs for multiple sequences. The following two subsections illustrate stem traces of the GA and DPA folding results for the same set of sequences containing the DLS. The results were organized in order of preservation of the four stem–loops, SL1–SL4, in the GA results. The DPA results
26
Genetic algorithm results. In exploring the 24 different strains of HIV-1, HIV-2 and SIV, folded with the GA, we have found that the four stem–loop structures are very well maintained. In addition, it is also clear that the GA results point out strong preservation of the TAR structure as well as the PSL stem–loop structure containing a palindrome in the hairpin loop. There is one more strongly preserved stem (in HIV-1 and SIV) between the PSL and SL1 structures, marked simply as SL in Figures 5 and 9. It is a short stem composed of three G-C pairs, with a hairpin loop of size 5 (GAACA). Usually it is supported by another stem of variable size, separated from it by a small internal or a bulge loop. Figure 9 shows the end results of 25 fold runs for each of the 24 sequences performed with the GA and visualized as multiple sequence stem trace. The sequences were aligned with PILEUP, and the structure data (region files) were shift adjusted based on these alignments. One PILEUP alignment artifact, already shown in Figure 4, is the misalignment of the lower of the two SL1 stems for the HIVMAL strain (sequence 1). The lower of the two SL2 stems for strains HIVNDK and HIVJRCSF (sequences 2 and 3) are 1 bp longer then the consensus stem length of 4, and hence are classified as distinctly different by Stem Trace. In the case of the HIVOYI strain (sequence 8), the upper of the two stems in SL1 is of size 6, while the consensus stem is of size 7. This is related to a missing base G which leaves the opposite C unpaired, thus also making the hairpin loop one base larger than in the consensus structures. An additional A to U mutation in the first 5′ position within the hairpin loop does not disrupt the palindromic sequence. However, three out of the 25 solutions for HIVOYI show an alternative, ‘pinched-loop’ structure which pairs the middle two bases of the palindrome (GC). This major alternative may be of some interest, since the GenBank annotations report this strain to be avirulent, ascribing its inactivity to a substitution in OYI’s tat protein, but leaving the exact reasons for avirulence open to further questions. A similar mutation mechanism shortens the upper stem of the SL2 structure for the HIVZ2Z6 sequence (sequence 19) from 3 to 2 bp. SL3 seems to be the most stable of the four stem–loop structures of interest, while SL4 does have a few true alternative structures clearly visible just above it in strains HIVMN, HIVSF2 and HIV3202A12 (sequences 15, 17 and 20, respectively). The same is true for the upper SL2 stem in the case of HIVSF2. While we can see a strong preservation of the four stem– loop structures for the first 20 sequences, the alignments begin to fall apart after that. SL1 and SL2 are disrupted for the HIV-1 strain HIV3202A21. It is not clear whether it is
An interactive visual tool for comparative RNA structure analysis
Fig. 9. A 5′-sorted stem trace of the final results of the GA folding runs for 24 strains of HIV-1, SIV and HIV-2. The sequences folded (562 nucleotides each) come from the dimerization site and begin with the TAR region. The alignment of the stems shown in this picture is based on sequence alignments produced by the GCG’s PILEUP utility.
27
W.Kasprzak and B.A.Shapiro
due to a sequencing mistake or whether this dramatic difference in sequence and the resulting structure is related to the non-syncytium-inducing character of the strain, given the fact that HIV3202A12 comes from the same patient (and it is syncytium inducing) (Groenink et al., 1995; Guillon et al., 1995; Laughrea et al., 1997). Interestingly, SL4 is preserved unchanged in the SIV strain CPZGAB (sequence 22). Inserted sequence prior to the SL1 area disrupts this structure in SIV, resulting in a 5′-shifted stem–loop structure with a hairpin loop of size 8, instead of 9, and a different palindrome (GUGCAC). It is marked as SL1 ALT in Figure 9. On the other hand, SL2, SL3 and SL4 appear correctly, although due to sequence insertions and deletions between them, SL2 and SL3 are both 5′-shifted with respect to the consensus structures, and the hairpin loop sequence in SL3 is mutated from GGAG to GGUG, as in SL2. As in SIV, HIV-1 strain HIVANT70 (sequence 23) has an alternative palindromic SL1, with the same palindrome as in SIV, but in a hairpin loop of size 7 (see label SL1 ALT in Figure 9). It also has a 5′-shifted structure roughly similar to SL2 (i.e. morphologically similar, but with a different sequence in the hairpin loop). Finally, it has a 5′-shifted SL3 structure with the hairpin loop sequence mutated to GGUG, and no SL4 structure. Both SIVCPZGAB and HIVANT70 have a strongly preserved extra structure between SL3 and SL4. HIV2ROD is definitely different in the area where SL1–SL4 occur in the HIV-1 consensus structures, and any comparisons would have to be made in a rather abstract fashion, based on the morphology of the secondary structure. Of special interest may be the strongly preserved structure beginning just past (more 3′) the SL4 consensus stem line, marked as HIV-2 PSL. It has a hairpin loop of size 11, containing the palindromic sequence GGUACC conserved across HIV-2 strains. The three indicated stems are in full agreement with the structure proposed to be responsible for HIV-2 dimerization and encapsidation (McCann and Lever, 1997). The base stem has an alternate structure which is just 1 bp shorter (not affecting the loops). Other structural motifs that are strongly preserved are the TAR structure and the PSL structure. While we do see some variations within the middle and base stems comprising them, none of these disrupts the essentially linear topology of them in HIV-1 and the SIV strains analyzed. In the case of the HIV2ROD, most of the TAR structure is preserved, but it is a part of a bigger tree-like structure. There is no equivalent of the PSL. Instead, there is a strongly preserved, long linear stem-loop structure with two short palindromes (UUAA and AGCU) in its 18-nucleotide-long hairpin loop. Dynamic programming algorithm results. Figure 10 shows the results of the DPA fold with the free energy rules
28
employed in MFOLD 2.3 (Zuker, 1997), performed for the same 24 sequences used in GA folds. For reasons of limited picture space, only the top 10 suboptimal structures for each sequence are used in this graph. We have analyzed up to the top 100 suboptimal solutions for each sequence (the top 1% of the free energy range), and we have not seen any dramatic differences in the results. Therefore, we feel that this more compact stem trace is sufficiently representative of the DPA results. One striking feature, when compared with the GA results, is a much less frequent presence of stem–loop structures SL1, SL2 and SL4. Only SL3 is strongly preserved in the majority of the HIV-1 strains (82%). Also clearly visible is the greater variability of the solution space (more stems, y-axis entries). We have to keep in mind, however, that, in general, the suboptimal solutions produced by the DPA are meant to explore this variability. Nevertheless, particularly strong motifs, such as the TAR stems, the SL, and SL3, are strongly preserved and clearly delineated in the DPA results stem trace. Clearly visible are the strongly preserved stems comprising the TAR structure, whereas the PSL elements are of lower frequency and greater variability. The SL structure, while not as strong as in the GA results, also clearly stands out. The SL1 structure is strongly present in only eight out of the 24 sequences; HIV-1 strains YU10, OYI, YU2, CDC41, D31, CAM1, SF2 and 3202A12 (sequences 5, 8, 9, 10, 11, 12, 17 and 20, respectively). Owing to a mutation in the upper stem of the SL1 for HIVOYI, the hairpin loop is of size 10, the same as in the GA stem trace. In agreement with the GA results are also the two alternatives to SL1 in SIVCPZGAB, and HIVANT70 (sequences 22 and 23), marked as SL1 ALT. SL2 results show great variability with alternatives including long-distance interactions seriously disrupting the structure, such as in the case of HIVOYI, HIVCAM1 or HIVLAI (sequences 8, 12, 13). Similar to GA results, the larger hairpin loop (shorter stem) in the SL1 structure is the dominant conformation in the HIVOYI strain (sequence 8). All the top 10 structures shown have it, and in the top 100 structures (first 1% of energetically suboptimal solutions) it is present in 71% of structures. The alternatives with partially paired palindromic sequence add up to 5% of the solutions. Other solutions totally rearrange sequence pairings in the SL1 area. SL3 is the best preserved structure, while SL4 is strongly present only in the HIV-1 strains JRCSF, BCSG3C, Z2Z6 and 3202A12 (sequences 3, 14, 19 and 20). As expected, HIV3202A21 (sequence 22) does not conform with the other HIV-1 strains. Less expected is a very different conformation of HIVZ2Z6 (sequence 19) in which only the TAR, SL and SL4 match the rest. Both SIVCPZGAB and HIV-1 strain ANT70 (sequences 22 and 23) have an alternative palindromic SL1, as described in the discussion of the GA results, and marked as SL1 ALT in
An interactive visual tool for comparative RNA structure analysis
Fig. 10. A 5′-sorted stem trace of the top 10 suboptimal structures obtained from the DPA folding runs for 24 strains of HIV-1, SIV and HIV-2. The sequences folded (562 nucleotides each) come from the dimerization site and begin with the TAR region. The alignment of the stems shown in this picture is based on sequence alignments produced by the GCG’s PILEUP utility.
Figure 10, as well as a strongly preserved extra structure past the SL3 band, the same as just mentioned in the discussion of the GA results. Neither of them has the SL4 structure.
HIVANT70 has a 5′-shifted structure roughly similar to the SL2, although with a different sequence in the hairpin loop. It also has an SL3 equivalent (in terms of relative position),
29
W.Kasprzak and B.A.Shapiro
but with a GGUG sequence in the hairpin loop (same as the consensus SL2 hairpin loop). Finally, similar to the GA results, the HIV2ROD strain has a structure, marked as HIV-2 PSL, with a palindromic hairpin loop containing the sequence GGUACC, well past (more 3′) the SL4 stem.
Conclusions As the examples of use show, Stem Trace has already proven itself to be a very useful analysis tool. Besides the above presented cases of DLS structure exploration, Stem Trace, together with the 2D Stem Histogram, has been used in the analysis of RNA enteroviruses. The GA runs for the three strains of poliovirus (poliovirus 1 Mahoney, poliovirus 2 Lansing and poliovirus 3 Leon) were studied with its help, resulting in elucidation of the consensus structure (Currey and Shapiro, 1997). The STRUCTURELAB tools and the GA were also applied to the analysis of the 5′ NCR of coxsackievirus (Tu et al., 1995). We are working on the automation of the entire multiple sequence motif elucidation. The iterative method with feedback from sequence and structure alignment tools seems to have a lot of promise and room for improvements. Further integration with STRUCTURELAB tools is possible and beneficial. Links with more sophisticated linear and structural motif searches, currently in use as independent system tools, are planned.
References Baudin,F., Marquet,R., Isel,C., Darlix,J.-L., Ehresmann,B. and Ehresmann,C. (1993) Functional sites in the 5′ region of human immunodeficiency virus type 1 RNA from defined structural domains. J. Mol. Biol., 229, 379–382. Clever,J.L., Wong,M.L. and Parslow,T.G. (1996) Requirements for kissing-loop-mediated dimerization of human immunodeficiency virus RNA. J. Virol., 70, 5902–5908. Currey,K. and Shapiro,B.A. (1997) Secondary structure computer prediction of the polio virus 5′ non-coding region is improved with a genetic algorithm. Comput. Applic. Biosci., 13, 1–12. Darlix,J.-L., Gabus,C., Nugeyre,M.-T., Clavel,F. and Barre-Sinoussi,F. (1990) cis elements and trans-acting factors involved in the RNA dimerization of the human immunodeficiency virus HIV-1. J. Mol. Biol., 216, 689–699. Freier,S.M., Kierzek,R., Jaeger,J.A., Sugimoto,N., Caruthers,M.H., Neilson,T. and Turner,D.H. (1986) Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl Acad. Sci. USA, 83, 9373–9377. Gorodkin,J., Heyer,L.J. and Stormo,G.D. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res., 25, 3724–3732.
30
Groenink,M., Fouchier,R.A., de Goede,R.E., de Wolf,F., Gruters,R.A., Cuypers,H.T., Huisman,H.G. and Tersmette,M. (1995) Phenotypic heterogeneity in a panel of infectious molecular human immunodeficiency virus type 1 clones derived from a single individual. J. Virol., 65, 1968–1975. Guillon,C., Bedin,F., Fouchier,R.A., Schuitemaker,H. and Gruters,R.A. (1995) Completion of nucleotide sequences of nonsyncytium-inducing and syncytium-inducing HIV type 1 variants isolated from the same patient. AIDS Res. Hum. Retroviruses, 11, 1537–1538. Haddrick,M., Lear,A.L., Cann,A.J. and Heaphy,S. (1996) Evidence that a kissing loop structure facilitates genomic RNA dimerization in HIV-1. J. Mol. Biol., 259, 58–68. Höglund,S., Ohagen,A., Goncalves,J., Panganiban,A.L. and Gabuzda,D. (1997) Ultrastructure of HIV-1 genomic RNA. Virology, 233, 271–279. Hu,W.-S. and Temin,H.M. (1990) Genetic consequences of packaging two RNA genomes in one retroviral particle: pseudodiploidy and high rate of genetic recombination. Proc. Natl Acad. Sci. USA, 87, 1556–1560. Laferier,A., Gautheret,D. and Cedergren,R. (1994) An RNA pattern matching program with enhanced performance. Comput. Applic. Biosci., 10, 211–212. Laughrea,M. and Jette,L. (1994) A 19-nucleotide sequence upstream of the 5′ major splice donor is part of the dimerization domain of the human immunodeficiency virus 1 genomic RNA. Biochemistry, 33, 13464–13474. Laughrea,M. and Jette,L. (1996) Kissing-loop model of HIV-1 genome dimerization: HIV-1 RNAs can assume alternative dimeric forms, and all sequences upstream or downstream of hairpin 248–271 are dispensable for dimer formation. Biochemistry, 35, 1589–1598. Laughrea,M., Jette,L., Mak,J., Kleiman,L., Liang,C. and Wainberg,M.A. (1997) Mutations in the kissing-loop hairpin of human immunodeficiency virus type 1 reduce viral infectivity as well as genomic RNA packaging and dimerization. J. Virol., 71, 3397–3406. McCann,E.M. and Lever,A.M.L. (1997) Location of cis-acting signals important for RNA encapsidation in the leader sequence of human immunodeficiency virus type 2. J. Virol., 71, 4133–4137. Muriaux,D., Girard,P.M., Bonnet-Mathoniere,B. and Paolettu,J. (1995) Dimerization of HIV-1_Lai RNA at low ionic strength. J. Biol. Chem., 270, 8209–8216. Paillart,J.C., Marquet,R., Skripkin,E., Ehresmann,B. and Ehresmann,C. (1994) Mutational analysis of the bipartite dimer linkage structure of human immunodeficiency virus type 1 genomic RNA. J. Biol. Chem., 269, 27486–27493. Shapiro,B.A. and Kasprzak,W. (1996) Structurelab: a heterogeneous bioinformatics system for RNA structure analysis. J. Mol. Graph., 14, 194–205. Shapiro,B.A. and Navetta,J. (1994) A massively parallel genetic algorithm for RNA secondary structure prediction. J. Supercomput., 8, 195–207. Shapiro,B.A. and Wu,J.C. (1996) An annealing mutation operator in the genetic algorithms for RNA folding. Comput. Applic. Biosci., 12, 171–180. Shapiro,B.A. and Wu,J.C. (1997) Predicting RNA H-type pseudoknots with the massively parallel genetic algorithm. Comput. Applic. Biosci., 13, 459–471.
An interactive visual tool for comparative RNA structure analysis
Skripkin,E., Paillart,J.C., Marquet,R., Ehresmann,B. and Ehresmann,C. (1994) Identification of the primary site of the human immunodeficiency virus type 1 RNA dimerization in vitro. Proc. Natl Acad. Sci. USA, 91, 4945–4949. Steele,G.L. (1990) Common Lisp—The Language, 2nd edn. Digital Equipment Corporation, Bedford, MA. Tu,Z., Chapman,N.M., Hufnagel,G., Tracy,S., Romero,J.R., Barry,W.H., Zhao,L., Currey,K. and Shapiro,B.A. (1995) The cardiovirulent phenotype of coxsackievirus b3 is determined at a single site in the genomic 5′ untranslated region. J. Virol., 69, 4607–4618. Turner,D.H., Sugimoto,N., Jaeger,J.A., Longfellow,C.E., Freier,S.M. and Kierzek,R. (1987) Improved parameters for predictions of RNA structure. Cold Spring Harbour Symp. Quant. Biol., 52, 123–133.
Turner,D.H., Sugimoto,N. and Freier,S.M. (1988) RNA structure prediction. Annu. Rev. Biophys. Biophys. Chem., 17, 167–192. Woese,C.R., Winker,S. and Guttell,R.R. (1990) Architecture for ribosomal RNA: constraints on the sequence of tetra-loops. Proc. Natl Acad. Sci. USA, 87, 8467–8471. Zuker,M. (1989) On finding all suboptimal foldings of an RNA molecule. Science, 244, 48–52. Zuker,M. (1996) Prediction of RNA Secondary Structure by Energy Minimization. Washington University, St Louis, MO. Zuker,M. (1997) RNA Pages. http://www.ibc.wustl.edu/∼zuker/rna/ (14 August 1998). Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxilliary information. Nucleic Acids Res., 9, 133–148.
31