Genome Informatics 12: 184–193 (2001)
184
A Mini-Greedy Algorithm for Faster Structural RNA Stem-Loop Search Jan Gorodkin1
Rune B. Lyngsø2
Gary D. Stormo3
[email protected]
[email protected]
[email protected]
1 2 3
Bioinformatics Research Center and Department of Genetics and Ecology, University of Aarhus, Building 540, Ny Munkegade, DK-8000 Aarhus, Denmark Department of Computer Science and Engineering, Jack Baskin School of Engineering University of California, Santa Cruz CA 95064, USA Department of Genetics, Washington University Medical School 660 S. Euclid, Box 8232, St. Louis MO 63110, USA
Abstract When a set of coregulated genes share a common structural RNA motif, e.g. a hairpin, most motif search approaches fail to locate the covarying but structurally conserved motif. There do exist methods that can locate structural RNA motifs, like FOLDALIGN, but the main problem with these methods is that they are computationally expensive. In FOLDALIGN, a major contribution to this is the use of a greedy algorithm to construct the multiple alignment. To ensure good quality many redundant computations must be made. However, by applying the greedy algorithm on a carefully selected subset of sequences, near full greedy quality can be obtained. The basic idea is to estimate the order in which the sequences entered a good greedy alignment. If such a ranking, found from all pairwise alignments, is in good agreement with the order of appearance in the multiple alignment, the core structural motif can be found by performing the greedy algorithm on just the top sequences in the ranking. The ranking used in this mini-greedy algorithm is found by using two complementing approaches: 1) When interpreting the FOLDALIGN score as an inner product (kernel), the sequences can be ranked according to their distance to their center of mass; 2) We construct an algorithm that attempts to find the K closest sequences in the vector space associated with the inner product, and the remaining sequences can be ranked by their minimum distance to any of the sequences, or to the center of mass in this set. The two approaches are compared and merged, and the results discussed. We also show that structural alignments of near full greedy quality can found in significantly reduced time, using these methods. The algorithm is being included in the SLASH (Stem-Loop Align SearcH) server available at http://www.bioinf.au.dk/slash.
Keywords: RNA Structural alignment, greedy algorithm, FOLDALIGN, SLASH, sequence ranking
1
Introduction
A major challenge in genomics is to describe the mechanism of coregulated genes. Much attention has been given to transcriptional regulation, which typically is accomplished by transcription factor binding sites. These binding sites are often small DNA regions that are constrained by a semiconserved sequence motif. However, of obvious interest is also the much less studied mechanism of post-transcriptional regulation. For example, in yeast there are many genes where the level of mRNA is not highly correlated with the protein level [7]. Many post-transcriptional mechanisms involve binding between the regulatory protein and a structural RNA motif in, for example, the mRNA’s UTR regions, and have impact on stability and translation, e.g., the hairpins of Iron Responsive Elements (IRE) that are responsible for regulating iron metabolism in vertebrate organisms [10]. There exist several pattern recognition approaches that can predict protein binding sites of a set of transcriptionally
A Mini-Greedy Algorithm for Stem-Loop Search
185
coregulated genes [18, 1, 15, 9, 11]. However, when the motif is constrained in structure and have no or little sequence conservation these methods fail to discover the regulating motif. RNA secondary structures are most reliably determined through comparative sequence analysis from sequences that have the same functionality e.g., [14, 21]. The hard part is to provide the initial alignment from which the structure should be extracted. This problem gets even harder when the motif only constitutes a small portion of the sequences investigated, such as structural motifs in UTRs. FOLDALIGN was developed to search for such local structural stem-loop motifs that are constrained in sequence as well as structure. FOLDALIGN optimises a score through a dynamic programming approach [3, 4], and the multiple alignment is progressively constructed by aligning a new sequence to an already existing alignment, starting with all pairwise alignments. As for example with MFOLD [22, 13] where the lowest predicted energy might not correspond to the biologically correct structure, the highest scoring FOLDALIGN alignment may not be correct either. The sequences that make up the highest scoring multiple alignment consisting of three sequences might not be a part of the highest scoring multiple alignment consisting of four sequences. However, if sufficiently many sequences contain the same motif, it is likely to be reinforced throughout the multiple alignments, when the greedy algorithm allows for sufficiently many computations. The goal of this work is to let the greedy algorithm perform only the most relevant computations, by discarding computations that would be redundant anyway. In recent work we removed this redundancy on large scale, by making it feasible to analyze data sets of sizes that otherwise would be intractable for FOLDALIGN [5]. This was done by combining FOLDALIGN and the stochastic context-free grammar (SCFG) implementation COVE [2], where FOLDALIGN only works on a small but sufficient number of (non-redundant) sequences, to find a covarying motif that could be used to train an SCFG [5]. SCFG’s were also developed in [16]. Here, we show that the greedy algorithm itself can be modified to run much faster at only a small cost in the quality of the local structural alignments. The sequences in each obtained multiple alignment have been added one by one in a particular order. So if we were able to predict this order for the best alignment, we would only need to save a single alignment at each round as the sequences could be aligned in that order. Based on all pairwise alignments, we estimate an order in which to align the sequences. But rather than aligning one sequence at the time, we perform the greedy algorithm on a certain fraction of the highest ranking sequences. Thus compared to a full greedy on all sequences scheme many redundant computations are avoided. The ranking is based on finding a subset of sequences that have as much in common as possible, but at the same time share features with the remaining sequences in the data set. The latter is done by interpreting the FOLDALIGN score as an inner product, a kernel, [8, 20] for which distances to the center of mass can found. The former is done by an iterative approach that selects a subset of K close sequences.
2 2.1
The Mini-Greedy Algorithm Greedy Alignments by FOLDALIGN
FOLDALIGN has previously been described in detail [3, 4] and its time complexity discussed in [5]. Here, we briefly describe the concept relevant for the greedy part of the algorithm. As mentioned, after making all pairwise alignments, the multiple alignment is built by adding a sequence to an already existing alignment, see Figure 1. The s best (highest scoring) alignments of r −1 sequences are kept for the next round where alignments of r sequences are obtained. As mentioned in [5] the obtained score r is the sum of all pairwise scores, that is, the score of aligning r sequences is S r = r−1 k=l+1 slk , l=1 when they are constrained to the same structure. The slk terms are the scores for aligning sequence l and sequence k. The FOLDALIGN score can in obscure cases be negative, and therefore unsuitable for the kernel
186
Gorodkin et al.
seq ar−1 seq ar−2
seq x ∈ /A
seq x ∈ /B
seq br−1 seq br−2
t
t
t
t
t
t
seq a1
seq b1
t
t
t
Figure 1: Outline of FOLDALIGN greedy algorithm. A sequence not part of a given alignment of r − 1 sequences is added to the alignment. The r − 1 sequences have the same structural constraints in this alignment. The sequences at the bottom entered the alignment first. presented below. However, in practice this does not occur at the pairwise level, if the two sequences contain just a common nucleotide and the score for a match of these is positive.
2.2
The Center of Mass Ranking
The score of two aligned sequences can be interpreted as an inner product, or kernel, for which there exists a vector space (which can be infinite dimensional), where each sequence is associated with a vector in this space [8, 20]. The score of aligning sequence x i with xj , can then be interpreted as the kernel: Score(xi , xj ) = K(xi , xj ) = K(i, j), and i ) · φ(x j) = φ i · φ j , K(xi , xj ) = φ(x
(1)
where φi is the vector associated to the ith of the N sequences in the data set. The distance between the two sequences can be written as
K(xi , xi ) − 2K(xi , xj ) + K(xj , xj ).
dist(i, j) = dij =
(2)
The center of mass vector, which is hard or impossible to compute without explicit knowledge of the kernel, is defined by N
j=1 CM ≡ φ N
j mj φ
j=1 mj
=
N
j µj , φ
(3)
j=1
where mi is the “mass” of the ith sequence. The mass can be interpreted as a weight of the ith sequence, and could for example include a sequence length or cluster size dependence. Here we shall CM , is an average vector of all sequences, set mi = µi = 1/N for all i. The center of mass vector, φ and if we could convert it back to a sequence, it would be an “average sequence” of our data set. If the data set consists of only a single motif and there are no strong submotifs we would expect such an average sequence to contribute to a good multiple alignment in a full greedy approach. Thus, the center of mass sequence, and the sequences in an appropriate vicinity are expected to contribute significantly to the multiple alignment. As illustrated in Figure 2 the sequences closest to the center of mass are the ones selected for performing the mini-greedy algorithm. The remaining sequences can be ranked according to their distances to the center of mass. We notice that we do not have to know the kernel function in order to compute the distance between sequence i and the center of mass
dist(i, CM ) =
K(i, i) − 2K(i, CM ) + K(CM, CM ),
(4)
as CM = i · φ K(i, CM ) = φ
N j=1
j = i · φ µj φ
N j=1
µj K(i, j),
(5)
A Mini-Greedy Algorithm for Stem-Loop Search
187
»
»
¸¸
¸
»
» »
¸
¸
¸ »
»
Figure 2: An illustration of the vector space associated with the inner product. Only sequences in a neighborhood of the center of mass (CM) are aligned against each other. The remaining sequences can be aligned in the order corresponding to their distances to the center of mass. and CM = CM · φ K(CM, CM ) = φ
i,j
j = i · φ µi µj φ
µi µj K(i, j).
(6)
i,j
Hence K(i, CM ) is the average score sequence i has to all other sequences in the data set including itself. K(CM, CM ) is a global property of the data set, namely the expected score of two sequences chosen independently and uniformly at random from the data set. Thus, the distance to CM does essentially depend on the difference K(i, i) − 2K(i, CM ) = (1−2µ i )K(i, i) − 2 j=i µj K(i, j). In most cases we would expect K(i, i) to be of the same order, and most certainly distributed differently than K(i, j), i = j. If the sequence length is of the order of the size of the motif searched for, we would expect all sequences aligned against themselves to have roughly the same score. Likewise in the case of long sequences, with a small motif. If the surrounding nucleotides (of a stemloop) is of the same random composition in each sequence, they are expected to contribute with roughly the same “random structure”. In the case of a small motif, but where the sequences are of different length, the size of the motif searched for is usually (user) limited by FOLDALIGN, such that all sequences aligned against themselves typically have motifs of that length. Hence, the score arguments just mentioned also apply here. Even if the sequences in a data set all share the same clear structural motif, the pairwise scores can vary substantially [5]. It is therefore desirable to include only appropriately close and large scoring sequences in the computation of the center of mass, as these sequences are expected to be the first sequences entering the multiple alignment in the greedy process.
2.3
Homology Ranking (Homage): Selecting the K Closest Sequences
To select the K ≤ N closest or most similar sequences, we introduce a greedy heuristic, where either the K best (the closest in distance or highest in score) sequences are kept, or the N−K worst discarded. Similar approaches have been described in the literature, e.g. by Gusfield [6]. The schemes are based on assigning a value vi to each sequence xi in the data set. This value should reflect the largest pairwise distance x i will have to any other sequence in an optimal set of K sequences including xi . The largest distance from xi to any other sequence in a set of K sequences will always be at least the K − 1’st smallest distance from x i to all the other sequences in the full data
188
Gorodkin et al.
set. Hence, we will use this value as a starting point for assigning a value to x i . One could consider using the K − 1 + ∆’th smallest distance for some ∆ > 0 as it is unlikely that the optimal set of K sequences containing xi will consist of the K − 1 sequences closest to x i . Or even use a weighted average of all distances but the K − 2 smallest. However, our experiments showed that these more complicated value assignments did not yield more dense sets with the data sets analyzed in this paper. Once we have assigned values V = {v1 , . . . , vN } to all sequences X = {x1 , . . . , xN }, we can either iteratively discard sequences (removing scheme) that have high values from X, or iteratively add sequences (adding scheme) that have low values to a set X keep . However, as soon as we remove a sequence xi from X, this changes the K − 1’st smallest distance for any sequence x j that has xi among the K − 1 closest. Hence, we update vj for all such sequences to reflect this fact. Similarly, if we add xi to Xkeep we know that the largest distance to a sequence in the final set for any other sequence xj will be at least dij , the distance between xi and xj . Again, we update vj to reflect this fact. The algorithms are listed in Table 1. Table 1: The removing and adding schemes in Homage. The K–subset is drawn from the scheme producing the lowest score. In the rare cases where more than one sequence can be removed or added, we choose it randomly, however for real sequences this will in practice never happen.
The Homage algorithm is related to the sequence homology reduction algorithm by Hobohm and Sander [12]. If we instead were interested in finding sequences that had as little in common as possible the algorithm can be reversed to find the set of K most distant or low scoring sequences. However, in contrast to the algorithm of Hobohm and Sander, which selects an arbitrary number of sequences at a given similarity reduction threshold, the Homage approach uses an arbitrary threshold to select K non-similar sequences. This can be very advantageous in cases where one deals with high time/memory complexity, as in the case with FOLDALIGN. In fact, this Homage approach has already been implemented in the SLASH server (http://www.bioinf.au.dk/slash) to reduce the submission of up to 200 sequences to a feasible number of sequences for FOLDALIGN. This particular reduction is based purely on the sequence similarity found from all pairwise sequence alignments.
2.4
Sequence Ranking and Mini-Greedy Algorithm
The center of mass and Homage approaches are tested individually as well as combined. The user can decide on a greedy size G ≤ N and a Homage size K ≤ N . When selecting K sequences by Homage these are always ranked highest, and the remaining sequences are ranked according to either their distance to the center of mass of the K sequences, or their minimum distance to any of the K sequences in the ranking. After the ranking has been made we choose a greedy size, G. Usually K ≤ G, but in the cases where K is close to N , Homage is used to get rid of a number of sequences; then the center of mass approach is applied on the remaining sequences to provide a ranking. The
A Mini-Greedy Algorithm for Stem-Loop Search
189
¸¸
» ¸ ¸¸ »
¸
¸
»
»
»
»
»
¸ ¸ »» ¸
¸
Figure 3: An illustration of how the adding and removing schemes of Homage work when selecting a subset of three element from a set of six elements. The table shows all pairwise distances between elements in the set. In this example the removing scheme ends up with the best result.
sequences that were discarded in the first place can be added to the ranking just described. The mini-greedy approach is also in contrast to the approach in CLUSTAL [19], which first locates individual clusters and then merges them. However, such an approach seems less feasible for RNA structural alignments as the structure assignment strongly depends on involved sequences. Different clusters might have slightly different structural assignments which would make it hard and sometimes impossible to merge two clusters as the structures could be incompatible.
3
Data
We consider the following SELEX data, discussed and analyzed in detail in [17]. The structural motif of RNA ligands binds to bacteriophage R17 coat protein, and has a characteristic tetra-loop and Abulge in a stem-loop. The set has 36 sequences which along with the structural motif is shown in Figure 4.
4
Results
Here, we compare the different ranking schemes to the full-greedy alignment of the r17 data. First we constructed four random rankings, and considered the average result. Even though these runs did not capture any structure they did align some of the sequences by the loop region in early rounds. For the pure center of mass (CM) ranking only the loop region was aligned correctly. We applied different K’s (4, 8, and 12) in the Homage approach. Not surprisingly, using K = 12 gave the best result, but the performance of the pure Homage approach is comparable to that of CM.
190
Gorodkin et al.
N
C
A
A S-S S-S
A S D N R
-
S N N Y
seq1 seq2 seq3 seq4 seq5 seq6 seq7 seq8 seq9 seq10 etc.
CAGAGAUAUCACUUCUGUUCACCAUCAGGGGA AUAUAAGUAAUGGAUGCGCACCAUCAGGGCGU GGAAUAAGUGCUUUCGUCGAUCACCAUCAGGG UGGAGUAUAAACCUUUAUGGUCACCAUCAGGG UCAGAGAUAGCUCAUAGGACACCAUCAGGG CUGAGAUAUAUGACAGAGUCCACCAUCAGGG GGAUUAAUAUGUCUGCAUGAUCACCAUCAGGG GGGAGAUUCUUAGUACUCACCAUCAGGGGGCA AAAUUAUCUUCGGAAUGCACCAUCAGGGCAUGG GGGAGAUUCUUACUACUCACCAUCAGGGGGCA
Figure 4: The r17 data set contains a total of 36 sequences. Left is shown the characteristic structure, and right a section of the sequence set. The best performance was obtained through a combination of CM and Homage (K = 4). Two approaches were tested, one (recalc) where we recalculated the center of mass each time a sequence entered the ranking, and one where the ranking, based on distances to the original CM, was fixed. The first approach corresponds to a moving CM in the vector space. A short summary is given in Table 2. As we see the ranking for the CM+Homage approach is in good agreement with the top sequences that entered the best alignment in round 12 of the full greedy approach. Interestingly if K is increased, the performance drops (not shown). Table 2: The ranking of the R17 data by different schemes. The two combined CM and Homage schemes are in fine agreement with the ranking given from the full greedy at round 12. Underlined numbers indicate sequences among the G = 12 highest ranked sequences which also are found in the full greedy. For CM+Homage (K = 4), Homage with K = 4 gave the following sequences: 28, 21, 19, and 32. Full-greedy (round 12) CM+Homage (K = 4) recalc CM+Homage (K = 4) CM
32 19 28 17 18 27 9 2 14 21 24 36 19 28 21 32 18 27 9 7 24 22 17 13 8 10 14 1 4 30 6 5 31 3 12 29 35 11 2 23 20 16 25 26 15 36 34 33 21 19 28 32 18 27 31 7 24 9 1 22 14 8 17 30 6 2 13 5 15 4 36 10 20 11 25 26 3 29 35 34 23 16 12 33 7 13 17 14 8 4 1 10 19 12 5 35 6 31 29 3 22 30 9 23 24 20 16 18 2 26 11 25 15 28 36 27 21 32 34 33
Using these rankings we performed the mini-greedy approach for G = 8 and G = 12. The conclusion was the same in both cases, but not surprisingly, the higher G, the better performance. The performance for G = 12 is shown in Figure 5, where the round normalized score growth [5] is shown for some of the tests we made. We see that the full greedy gives the best performance. The two CM+Homage (K = 4) versions have almost the same performance. However the recalc scheme has a higher score during much of the greedy approach. By inspection of the alignments obtained at round 12 we find that they have two strong misalignments each. Hence, the overall quality is the same, but the score at round 12 is 2522 for the recalc version and 2326 for the nonrecalc version. The pure CM approach did not find the structure, however, it does align the sequences by the conserved loop region. Likewise for the pure Homage approach. A few of the four random schemes capture the loop motif, but only in the beginning of the runs. Whereas the score level for the random ranking in general is much lower than the full greedy, the combined CM+Homage approaches obtain near full greedy quality up to round 17–19. The full greedy and CM+Homage (K = 4) recalc alignments are shown in Table 3.
A Mini-Greedy Algorithm for Stem-Loop Search
191
FOLDALIGN Full greedy CM + Homage (K=4) recalc CM + Homage (K=4) CM Homage (K=12; min) Random (average)
Score / Round
300
200
100
0
0
5
10
15
20
25
30
35
Round
Figure 5: Round normalized score growth, with greed size G = 12. The random curve is an average of four runs. Round refers to the number of sequences in the alignment. Table 3: The alignments of full greedy (left) and CM+Homage K = 4 (right). The order in which the sequences entered the alignments is from the bottom to the top. (Note that the order of the sequences in the mini-greedy to the right differs from ranking based on pairwise alignments in Table 2.) ‘S’ indicates the predicted structure which is correct for the full greedy. A pair of matching parenthesis indicates a base-pair. Two misalignment appears (seq 7, 13) along with a misassignment (seq 22) due to the lack of A-G pairing rule in FOLDALIGN. The predicted structure is correct for the remaining sequences.
5
Full greedy
CM + Homage(K=4) recalc
Score: 2889 36 AUUGUAGUCACGAGCACGG 24 AUGAGAUAGAUCAUGCUCA 21 UAUAGAGC-AUCAGCCUAU 14 AUAUGAUC-UUAUGGUAUG 2 UGCGCACC-AUCAGGGCGU 9 AAUGCACC-AUCAGGGCAU 27 AUGUUACC-AUCAGGAACA 18 UGCAGAGG-AUCACCCUGC 17 AUGUCACG-AUCACGGGCA 28 AAGAUAGC-AUCAGCAUCU 19 AUUAGAGG-AUCACCCUAG 32 AGUAGAGG-AACACCCUAC S .((((.((.....))))))
Score: 2522 7 U-CUGCAUGAUCACCAUCAGG 13 UAUAGAUAGUUC---UACUGA 22 UAAGGACC-AUCA--GGCCUG 24 U-GAGAUAGAUCA--UGCUCA 19 U-UAGAGG-AUCA--CCCUAG 18 G-CAGAGG-AUCA--CCCUGC 17 U-GUCACG-AUCA--CGGGCA 32 G-UAGAGG-AACA--CCCUAC 21 A-UAGAGC-AUCA--GCCUAU 27 U-GUUACC-AUCA--GGAACA 9 A-UGCACC-AUCA--GGGCAU 28 A-GAUAGC-AUCA--GCAUCU S (.(((.((.......))))))
Final Remarks
We presented a mini-greedy algorithm, an approach where the greedy step in a structural RNA alignment algorithm is performed only on a high ranking portion of the sequences. We applied two ranking schemes, the center of mass (CM) and Homage (homology ranking) for estimating the order in which sequences enter a good greedy alignment. In both cases this ranking was based on all pairwise alignments. We found that when these schemes were combined near full greedy quality of the alignments could be obtained. We used a greedy size of 12 sequences in the mini-greedy approach, a number of sequences in a
192
Gorodkin et al.
structural multiple alignment that in general is sufficient to train an SCFG such as COVE [2]. COVE can then find the motif in the remaining sequences [5]. The use of only 12 rather than the full 36 sequences in the data set, corresponds to a reduction in time requirements by factor of roughly 3 4 = 81, a significant speed up. This is well in agreement with the reduction of run time from 14 hours using the full greedy approach to approximately 14 minutes using the mini-greedy approach on a pentiumIII xeon 500 MHz processor. The score difference between full greedy and mini-greedy was 13%, and inspection showed that near full greedy quality was obtained. We are currently testing the mini-greedy approach on UTR-like sequences containing a stem-loop motif, and from preliminary results we anticipate that the minigreedy algorithm will provide a significant speed up in the structural RNA motif search.
References [1] Brazma, A., Jonassen, I., Eidhammer, I., and Gilbert, D., Approaches to the automatic discovery of patterns in biosequences, J. Comput. Biol., 5:279–305, 1998. [2] Eddy, S. and Durbin, R., RNA sequence analysis using covariance models, Nucl. Acids Res., 22:2079–2088, 1994. (http://genome.wustl.edu/eddy/#cove). [3] Gorodkin, J., Heyer, L. J., and Stormo, G. D., Finding the most significant common sequence and structure motifs in a set of RNA sequences, Nucl. Acids Res., 25:3724–3732, 1997. (http://www.bioinf.au.dk/FOLDALIGN). [4] Gorodkin, J., Heyer, L. J., and Stormo, G. D., Finding common sequence and structure motifs in a set of RNA sequences, ISMB ’97, AAAI/MIT Press, 5:120–123, 1997. [5] Gorodkin, J., Stricklin, S. L., and Stormo, G. D., Discovering common stemloop motifs in unaligned RNA sequences, Nucl. Acids Res., 29(10):2135–2144, 2001 (http://www.bioinf.au.dk/slash). [6] Gusfield, D., Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. [7] Gygi, S. P., Rochon, Y., Franza, B. R., and Aebersold, R., Correlation between protein and mRNA abundance in yeast, Mol. Cell Biol., 19:1720–1730, 1999. [8] Haussler, D., Convolution Kernels on Discrete Structures, UCSC-CRL-99-10, 1999 (http://www.cse.ucsc.edu/∼haussler/convolutions.ps). [9] van Helden, J., Andr´e, B., and Collodo-Vides, J., Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol., 281:827–842, 1998. [10] Hentze, M. W. and Kuhn, L. C., Molecular control of vertebrate iron metabolism: mRNA-based regulatory circuits operated by iron, nitric oxide, and oxidative stress, Proc. Natl. Acad. Sci. USA, 93:8175–8182, 1996. [11] Hertz, G. Z. and Stormo, G. D., Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, Bioinformatics, 15:563–577, 1999. [12] Hobohm, U. and Sander, C., Enlarged representative set of protein structures Prot. Sci., 3, 522–524, 1994.
A Mini-Greedy Algorithm for Stem-Loop Search
193
[13] Mathews, D., Sabina, J., Zuker, M., and Turner, D., Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure, J. Mol. Biol., 288:911–940, 1999 [14] Pace, N. R., Smith, D., Olsen, G. J., and James, B. D., Phylogenetic comparative analysis and the secondary structure of riboclease — a review, Gene, 82:65–75, 1989. [15] Roth, F.P., Hughes, J.D., Estep, P.W., and Church, G.M., Finding DNA-regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation, Nature Biotechnol., 16:939–845, 1998. [16] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sj¨ olander, K., Underwood, R. C., and Haussler, D., Stochastic context-free grammars for tRNA modeling Nucl. Acids Res., 22:5112– 5120, 1994. [17] Schneider, D., Tuerk, C., and Gold, L., Ligands to the bacteriophage R17 coat protein, J. Mol. Biol, 228:862–869, 1992. [18] Stormo, G.D. and Hartzell,III, G.W., Identifying protein-binding sites from unaliged DNA fragments, Proc. Natl. Acad. Sci. USA, 86:1183–1187, 1989. [19] Thompson, J.D., Higgins, D.G., and Gibson, T.J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice, Nucl. Acids Res., 22:4673–4680, 1994. [20] Watkins, C., Dynamic Alignment Kernels, CSD-TR-98-11, 1999 (http://www.cs.rhbnc.ac.uk/home/chrisw/dynk.ps.gz). [21] Westhof, E., Auffinger, P., and Gaspin, C., DNA and RNA structure prediction, M.J. Bishop and C.J. Rawlings, (eds.), DNA – Protein Sequence Analysis, Oxford UK: IRL/Oxford University Press, 255–278, 1996. [22] Zuker, M., Mathews, D.H., and Turner, D.H., Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide, J. Barciszewski & B.F.C. Clark, (eds.), RNA Biochemistry and Biotechnology: NATO ASI Series, Kluwer Academic Publishers, 11–43, 1999 (http://bioinfo.math.rpi.edu/∼mfold/rna/form1.cgi).