Comparison of Dynamic Programming and Evolutionary Algorithms for ...

3 downloads 313 Views 186KB Size Report
niques such as dynamic programming algorithms (DPAs) [2] .... larger than the Nussinov dynamic programming algorithm and depends ...... IOS Press, 2002, vol.
Comparison of Dynamic Programming and Evolutionary Algorithms for RNA Secondary Structure Prediction Alain Deschˆenes, Student Member, IEEE, Kay C. Wiese, Member, IEEE, and Jagdeep Poonian

Abstract— This paper builds on previous research from an EA used to predict secondary structure of RNA molecules. The EA has the goal of predicting which canonical base pairs will form hydrogen bonds and helices. The addition of stacking energies, through INN and INN-HB, to our thermodynamic model has enhanced our predictions. In this paper, we test three RNA sequences of lengths 118, 543, and 784 nucleotides using a variety of previously successful operators and parameter settings. The accuracy of the predicted structures are compared against those generated by the Nussinov DPA and also to known structures. The EA showed high accuracy of prediction especially on short sequences. On all tested sequences, the EA outperforms the Nussinov DPA. Index Terms— RNA, evolutionary algorithms, structure prediction, folding.

I. I NTRODUCTION

T

HE question of how macromolecules fold to attain specific shapes is fundamental to biology since the function of molecules is largely determined by structure [1]. Ribonucleic acid (RNA) is a molecule made up of a chain of nucleotides consisting of adenine (A), cytosine (C), guanine (G), and uracil (U). An RNA strand folds onto itself by forming hydrogen bonds between GC, AU, and GU, and their respective mirror images. It is widely believed that RNA molecules are closely related to the molecules from which life originally evolved. Experimental methods for determining RNA structures including X-Ray Crystallography and Nuclear Magnetic Resonance (NMR) have been shown to be error prone, difficult, and time consuming. Alternative approaches are used to predict the secondary structure of RNA, including optimization techniques such as dynamic programming algorithms (DPAs) [2] or evolutionary algorithms (EAs) [3]. The evolutionary algorithm [4] for structure prediction of RNA first generates all candidate stacked pairs, also called helices, where three or more adjacent pairs can form. A subset of all these helices is chosen to form a candidate structure. For a helix to be valid, there must be at least three nucleotides in the loop connecting the set of stacked pairs. These simple rules allow for the enumeration of all possible helices that can form in a structure. The challenge is predicting which ones will actually form in nature making this problem highly The authors are with Simon Fraser University Surrey, 2400 Central City, 10153 King George Highway, Surrey B.C., Canada (604-268-7436; fax: 604268-7488; e-mail: {aadesche, wiese, jpooniaa}@sfu.ca)

combinatorial. The listing of the base pairs in a structure is what is called the secondary structure of an RNA sequence. This paper builds on previous research [4]–[7] that showed that it is possible to use an EA to minimize the free energy associated with RNA secondary structures. It was shown that using permutation encoding found lower energy structures in less generations than using binary encoding. With permutation encoding, using Keep-Best Reproduction (KBR) [8] was found to outperform Standard Selection (STDS). When encoding candidate structures using permutation vectors, some crossover operators, namely OX2 [9], CX [10], and PMX [11], converged toward lower energy structures faster than other crossover operators. It was also shown that using complex thermodynamic models that take stacking energies into account, such as Individual Nearest-Neighbor (INN) Model [12] and Individual Nearest-Neighbor Hydrogen Bond (INN-HB) Model [13], [14], consistently found structures that more closely resemble the natural fold than those generated using simple hydrogen bond models. This paper will compare our EA with DPA optimization using the Nussinov DPA [2]. DPAs have previously been used to maximize base pair formation using the Nussinov method. Testing and scoring each possible structure is unfeasible, since the number of possible base pairs potentially grows exponentially with the length of the sequence. However, with DP optimization, the search space and computational time required to minimize or maximize certain variables are vastly reduced. This paper has the following objectives: • Measure the accuracy of our predicted structures by comparing them to known structures • Compare the accuracy of our predicted structures with those predicted by the Nussinov DPA • Strengthen previous claims that permutation based encoding performs better than binary encoding The paper is structured as follows: Section II contains our method for testing. Our results are discussed in Section III. Section IV contains our conclusions. Finally, we present ideas on future work in Section V. II. M ETHOD A. Nussinov algorithm In this experiment, we attempt to find the structures with the maximum number of possible base pairs using the Nussinov DPA. This algorithm, considered the simplest and earliest

Initialize random population of chromosomes; Evaluate the chromosomes in the population; while stopping criteria is not reached for half of the members of a population select 2 parent chromosomes; apply crossover operator (Pc ); apply mutation operator (Pm ); evaluate the new chromosomes; replacement strategy; elitism; insert them into next generation; end for update stopping criteria; end while Fig. 1. Our algorithm is based on a standard generational EA. Our stopping criteria is the number of generations.

form of the DPA, works by recursively calculating the optimal structure with maximal base pairs for small subsequences, and successively grows this structure to larger and larger substructures. The key idea is that there are only four possible ways to grow an optimal substructure. Successively larger optimal structures are retained, and the final result is a structure with maximal base pair configuration. The minimum hairpin loop size has been set to 3 for this experiment. The Nussinov approach normally only tries to maximize the number of base pairs that a structure can form. We added a modified version of this algorithm that takes free energy of the base pair contributions into account. The idea is that better predictions can be obtained by minimizing free energy of a structure (S) evaluated with the following energy function. E(S) =

X

e(ri , rj )

(1)

i,j∈S

Here, e(ri , rj ) denotes the free energy ∆G contribution between the ith and j th nucleotide from the formation of a base pair. Reasonable values for e at 37◦ C are −3, −2, and −1 kcal/mol for GC, AU, and GU base pairs, since GC has three hydrogen bonds, AU has 2 hydrogen bonds and GU has a much weaker bonding [15]. This is implemented using the Nussinov algorithm by modifying the scoring to reflect a GC:AU:GU weighting of 3:2:1, versus a 1:1:1 base pair maximization configuration. A third set, 3:2:2, is also implemented modeling the ratio of hydrogen bonds in GC:AU:GU [14]. A drawback to the aforementioned variations is that they do not take into account that helical stacks of base pairs have a stabilizing effect whereas loops have a destabilizing effect on the structure.

probability (Pm ), and crossover probability (Pc ). Lastly, mutation occurs regardless of whether crossover occurs. In all our runs, we chose a population of 700 chromosomes. Pc and Pm were set to values of 0.7 and 0.8, respectively. These values have shown to yield good results in previous research [7]. Our EA prevents the formation of pseudoknots by disallowing them in the decoder. KBR was used in all runs. The KBR operator first selects two parents via roulette wheel selection, then after crossover and mutation, the best parent and best child are passed on to the next generation from a rank based selection. Using this operator has shown to be beneficial in finding lower energy structures in fewer generations when compared to standard selection (STDS). 1) Binary encoding: RNA structures can be encoded in bitstrings. A bit-string has length |H| where H is the set of all possible helices within our helix generation model and each bit represents the presence (1) or absence (0) of a particular helix in a structure. Some helices can be mutually exclusive making a structure infeasible and thus, a repair mechanism is used to ensure only valid structures exist in the population. 2) Permutation encoding: RNA structures can also be encoded using integer vectors. Each possible helix is assigned a unique integer from the set of all possible helices H. The decoder reads the permutation from left to right adding subsequent helices only if they are not mutually exclusive with any previously added helices. 3) Stacking-energy models: The basis of stacking-energy models is that each base pair in a helix contributes to the stability of that helix by an amount dependent on its neighboring bases. For example, the free energy of a GC base pair will be different whether it is next to an AU base pair or next to its mirror, a UA base pair. This contrasts the previous models we used where a base pair contributes directly to a helix’s free energy regardless of its surroundings or orientation. In this experiment, our EA evaluates helices using the INN [12] or INN-HB [13] models. The energy calculation involves two distinct steps, the first models hydrogen bonding of the first base pair (initiation) and the second involves addition of base pairs to grow the helix (propagation). INN and INN-HB differ in the way that the models evaluate terminal pairs. In INN-HB, a penalty is added for AU or GU terminal pairs. Also, INN-HB adds parameters for nearest-neighbors containing GU pairs. It is important to note that these models make no effort to account for higher-order structures, such as loops, junctions, bulges and pseudoknots. 4) Runtime: The runtime of the evolutionary algorithm is larger than the Nussinov dynamic programming algorithm and depends on the crossover operator and the length of the sequence. C. Sequences tested

B. Genetic algorithm In this experiment, we use an EA (Figure 1) that uses either a permutation or a binary bit-string to encode RNA structures. The details are discussed in our previous work [5]. Our EA is controlled by three parameters: population size, mutation

Tables I, II, and III show the structures used in this experiment. These are a 784 nucleotide Drosophila virilis sequence, a 543 nucleotide Hildenbrandia rubra, and a 118 nucleotide Saccharomyces cerevisiae sequence. These sequences were chosen from the Comparative RNA Web Site [16] where the

TABLE IV G ENETIC ALGORITHM PARAMETERS

TABLE I Drosophila virilis DETAILS

Filename Organism Accession Number Length # of BPs in known structure

Pop. Size Generations Crossover Operators

d.16.m.D.virilis.bpseq Drosophila virilis X05914 784 nucleotides 233

Pc Pm Replacement Elitism Thermodynamic Models Random seeds Allow pseudoknots

TABLE II Hildenbrandia rubra DETAILS Filename Organism Accession Number Length # of BPs in known structure

b.I1.e.H.rubra.1.C1.SSU.1506.bpseq Hildenbrandia rubra L19345 543 nucleotides 138

700 700 CX, OX, OX2, PMX, ASERC, SYMERC, 1-Point, 2-Point, Uniform, No crossover (Binary and Permutation encoding) 0.7 (all), 0.2 (ASERC and SYMERC) 0.8 KBR 1 INN, INN-HB 30 No

be mainly attributed to the high mutation rate coupled with high selection pressure provided from elitism and KBR. -50

-75

-100 Free Energy (kcal/mol)

known structures are determined through comparative methods. The three structures were tested using the same parameter set shown in Table IV. Nine crossover operators were tested, three of which used binary encoding. Two runs were added where Symmetric Edge Recombination Crossover (SYMERC) [9] and Asymetric Edge Recombination Crossover (ASERC) [17] used a Pc = 0.2 and two more runs were added where no crossover occurred using binary encoding and permutation encoding.

-125

-150

-175

III. R ESULTS

-200

A. Convergence behaviour of the EA Figure 2 shows a typical average run for a Hildenbrandia rubra sequence containing 543 nucleotides. The lighter outer envelope of the plot shows the extremities of each generation. This corresponds to the individuals with the maximum and minimum free energy. The darker inner envelope of the plot shows the mean free energy of the population with the standard deviation. In this graph, we see an average run for 700 generations. For the first 50 generations, we see a rapid convergence where the extremities and the standard deviation shrink towards the mean. After this point, the graph’s slope increases slowly but there continues to be improvements toward lower free energy structures until about generation 300. After this point, each generation’s population decreases in free energy but at a much slower rate. The remaining improvements can TABLE III Saccharomyces cerevisiae DETAILS

Filename Organism Accession Number Length # of BPs in known structure

d.5.e.S.cerevisiae.bpseq Saccharomyces cerevisiae X67579 118 nucleotides 37

-225 0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

Number of Generations

Fig. 2. Hildenbrandia rubra, Pm = 0.8, Pc = 0.7, population size = 700, CX, KBR, 1-elitism, average of 30 random seeds using INN-HB as the thermodynamic model. This run predicted 156.23 base pairs, on average, where 28.5% of the known structure was correctly predicted.

B. Drosophila virilis – 784 nucleotides 1) Nussinov results: Table V shows the results when using the Nussinov DPA. The results show that the maximum number of possible base pairs with this sequence within a single structure is 320. The DPA is able to correctly predict 12.4% of the known base pairs. Changing the weights to be proportional to the number of hydrogen bonds in each base pair, 3:2:2, for GC, AU, and GU, respectively, the algorithm predicts a structure with 319 base pairs which contain 9.9% of the real structure. Lastly, changing the weights to 3:2:1 to approximate the stability of the base pairs of GC, AU, and GU, respectively, predicted a structure with 309 base pairs where 9.0% of the real structure was correctly predicted. 2) Average lowest free energy predictions: Table VI shows the results from the EA with the Drosophila virilis sequence. The first column shows the free energy of the average lowest energy structure from the given parameter set. In all runs, it

TABLE V Drosophila virilis N USSINOV RESULTS . N UMBER OF KNOWN BASE PAIRS IS 233.

STRUCTURE GROUPED BY THERMODYNAMIC MODEL

GC:AU:GU Predicted Weights BP

Correctly Predicted BP

Correctly Predicted (%)

1:1:1 3:2:2 3:2:1

29 23 21

12.4 9.9 9.0

320 319 309

TABLE VI AVERAGE RESULTS OF COMPARISON WITH KNOWN Drosophila virilis

is an average of the lowest energy structure for 30 random seeds after 700 generations. The second column shows the average number of base pairs predicted in the lowest energy structure after 700 generations. The third column shows the percentage of known base pairs that were correctly predicted. The fourth column shows which crossover operator was used while the last column shows the chosen thermodynamic model. Each row represents a thermodynamic model and crossover operator combination. Since the two thermodynamic models are incompatible for direct comparison, the table rows have been grouped by the chosen model. Within each thermodynamic model, the runs are sorted by average free energy of the lowest free energy structure after 700 generations, that is, the first column. The rows are then sorted by the average number of correctly predicted base pairs. The bolded row(s) shows which thermodynamic model and crossover operator combination fares best with respect to the highest number of known base pairs correctly predicted. Table VI shows that with INN-HB, the crossover operator able to predict the lowest free energy structures on average was OX2. This run predicted an average structure containing 234.6 base pairs where 12.0% of the base pairs were correctly predicted. This result is very close to the best result found with the Nussinov algorithm which correctly predicted 12.4% of the known structure. However, the Nussinov DPA is prone to overprediction of base pairs, i.e.: it predicts many false positive base pairs that are not found in the natural fold. This EA run was found to be the run with the highest number of correctly predicted base pairs with INN-HB. With INN, the run finding the lowest free energy structure was also found with OX2. This run was able to predict structures with 232.5 base pairs and correctly predicted 14.3% of the base pairs in the known structure on average. This results improves on the best result found with INN-HB. This result also predicts more base pairs correctly than any Nussinov result. Better yet, a run using the CX crossover operator, was able to predict 14.4% of the known base pairs correctly from its average structure containing 233.7 base pairs, but with a slightly higher free energy. 3) Individual lowest free energy predictions: Table VII shows the single lowest free energy found with each crossover operator/thermodynamic model combination. The first column shows the lowest free energy structure found within the given parameter set. The second column shows the number of times a structure of similar energy was found within the 30 seeds. The third column shows the generation number at which this

∆G (kcal/mol) -165.94 -160.43 -157.40 -153.60 -152.29 -151.83 -149.41 -149.28 -143.43 -142.74 -141.42 -141.37 -139.70 -145.47 -144.80 -139.95 -139.90 -138.32 -132.34 -131.44 -131.40 -131.34 -130.57 -129.40 -127.38 -122.99

Pred. BPs 234.6 230.1 231.2 226.9 223.6 226.6 223.7 224.7 221.2 219.8 224.3 223.4 221.7 232.5 233.7 228.9 230.4 226.4 226.1 223.0 226.9 225.3 223.4 221.7 224.1 224.3

Correct BPs (%) 12.0 9.5 9.4 11.0 10.2 10.3 10.8 9.4 8.3 8.1 8.2 8.1 7.4 14.3 14.4 12.2 12.4 13.6 10.5 11.4 11.2 11.8 10.9 8.9 9.2 8.1

Crossover

Model

OX2 CX PMX 2-Point 1-Point NA (Perm) SYMERC (0.2) ASERC (0.2) NA (Bin) Uniform OX SYMERC ASERC OX2 CX 2-Point PMX 1-Point SYMERC (0.2) Uniform NA (Perm) ASERC (0.2) NA (Bin) SYMERC OX ASERC

INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN INN INN INN INN INN INN INN INN INN INN INN INN

structure was found. If the structure is found more than once, then the value is reported as an average of all runs that found this structure. The fourth column shows the number of predicted base pairs in the structure while the fifth column shows how many base pairs were correctly predicted. The final two columns show the crossover operator and thermodynamic model, respectively. This table shows that the OX2 crossover operator was able to find the single lowest energy structure with INN-HB. This structure had a free energy of −188.80 kcal/mol and was found by a single random seed after 224 generations. The structure contained 242 base pairs correctly predicting 7.3% of the known structure. A structure with higher free energy was found to be the best structure found with INNHB. This structure was found after 699 generations from a single run using the SYMERC crossover operator with Pc = 0.2. The structure had a free energy 12.2% higher than the lowest energy structure found, and contained 221 base pairs overlapping 23.6% of the base pairs of the known structure. This structure also happens to predict many more base pairs correctly than any Nussinov results with fewer false positives. A similar result was found with INN where a structure of higher free energy was found to predict more correct base pairs. In this case, the single lowest free energy structure was found with the CX crossover operator at a free energy of −165.9 kcal/mol after 639 generations. This run predicted a structure containing 244 base pairs. This structure correctly predicted 14.6% of the known structure. This prediction was also more accurate than any Nussinov result. However, a structure, which had a 9.8% higher free energy and was found with 1-Point crossover, had an even higher number of correctly

predicted base pairs at 26.2% of the natural fold. This structure was found after 175 generations and contained 240 base pairs. After perusing all the lowest energy structures found for all the random seeds, it was found that the overall highest percentage of base pairs correctly predicted by a single run was 26.6% after 643 generations. This structure, containing 230 base pairs, was predicted by a single run making use of the CX crossover operator with the INN thermodynamic model. This predicted structure was found to be 13.0% higher in energy than the overall single lowest energy structure found with the INN model using the CX crossover operator. TABLE VII B EST RESULTS OF COMPARISON WITH KNOWN Drosophila virilis STRUCTURE GROUPED BY THERMODYNAMIC MODEL

∆G Freq. (kcal/mol)

Gens

Pred. BPs

-188.80 -181.34 -177.90 -175.36 -174.15 -171.33 -168.45 -165.68

1 1 1 1 1 1 1 1

224 160 501 555 125 417 172 699

242 248 233 226 236 235 224 221

Corr. BPs (%) 7.3 12.0 13.3 8.2 9.0 6.9 16.7 23.6

-166.47 -163.15 -161.63 -159.74 -155.78 -165.9 -159.8 -154.9 -154.5 -151.6 -150.7 -149.7 -148.9 -148.3 -144.2 -142.8

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

681 574 580 274 639 639 590 635 175 697 693 175 186 316 674 485

229 230 221 232 225 244 243 225 239 237 229 240 245 228 229 238

3.0 9.0 9.4 9.4 4.7 14.6 11.2 21.5 11.6 17.2 9.4 26.2 19.7 13.7 4.7 9.0

-140.1 -138.4

1 1

658 660

234 227

6.4 12.0

Crossover

Model

OX2 2-Point NA (Perm) CX 1-Point PMX Uniform SYMERC (0.2) SYMERC ASERC (0.2) OX NA (Bin) ASERC CX OX2 PMX 2-Point ASERC NA (Perm) 1-Point Uniform NA (Bin) ASERC (0.2) SYMERC (0.2) OX SYMERC

INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN INN INN INN INN INN INN INN INN INN INN INN INN

C. Hildenbrandia rubra sequence – 543 nucleotides 1) Nussinov results: The Nussinov results shown in Table VIII gives us the upper bound of the number of base pairs possible in this sequence at 213 base pairs. When maximizing the number of possible base pairs alone, the generated structure contains 213 base pairs containing 5.0% of the those found in the real structure. Changing the weights to be proportional to the number of hydrogen bonds in each base pairs predicts a structure with 211 base pairs where 22.5% of the base pairs in the real structure are correctly predicted. Changing the weight for GU pairs to 1, because of its weaker stability, reduces the number of predicted base pairs to 205 but is still able to correctly predict 22.5% of the known structure. 2) Average lowest free energy predictions: Table IX shows that the CX crossover operator was able to predict structures closest to the known structure on average using the INNHB thermodynamic model. With the INN thermodynamic

TABLE VIII Hildenbrandia rubra N USSINOV RESULTS . N UMBER OF KNOWN BASE PAIRS IS

138.

GC:AU:GU Predicted Weights BP

Correctly Predicted BP

Correctly Predicted (%)

1:1:1 3:2:2 3:2:1

7 31 31

5.0 22.5 22.5

213 211 205

model, the lowest energy structures were found using the OX2 crossover operator. These lowest energy structures are also found to be the ones predicting the highest number of base pairs correctly within their respective thermodynamic models. The average highest number of correctly predicted base pairs was found using the CX crossover operator with INNHB. This run predicted, on average, 156.2 base pairs containing 28.5% of the base pairs found in the known structure. This structure contains more correct base pairs than any structure found with the Nussinov algorithm. Using the INN model and the OX2 crossover operator, the EA was able to correctly predict 22.0% out of 155.3 predicted base pairs. This result outperforms base pair maximization but is edged out when the weights are changed in the Nussinov algorithm. The drawback with the Nussinov DPA is again that it predicts structures with a large number of false positive base pairs. It is important to note that three crossover operators, using INN-HB, were able to outperform all INN runs on average. These were PMX, OX2 and CX with 22.5%, 24.3%, and 28.5% of the base pairs of the known structure were correctly predicted on average, respectively. 3) Individual lowest free energy predictions: Table X shows the single lowest free energy structure found after running the EA for a maximum of 700 generations. It was found that the overall lowest energy structure, with INN-HB, was found using the CX crossover operator after 684 generations with a single random seed. This structure had a free energy of −216.8 kcal/mol and contained 161 base pairs with 37.7% of the known base pairs correctly predicted. However, the lowest energy structure found with the binary 2-Point crossover operator ultimately found the single structure with highest number of correctly predicted base pairs. This structure had a free energy 2.0% higher than the overall lowest energy structure but was able to predict 50.0% of the known base pairs correctly from the 161 base pairs it predicted. Coincidently, this structure was also the structure to predict the highest number of correct base pairs overall. These structures, along with the single lowest energy structures predicted with OX2, PMX, 1-Point, ASERC (Pc = 0.2), permutation encoding with mutation alone, and ASERC (Pc = 0.7), also predict more correct base pairs from the known structure than any Nussinov structure. With the INN thermodynamic model, the lowest single structure was found using the PMX operator after 678 generations. The structure contained 165 base pairs evaluated to a free energy of −201.6 kcal/mol and correctly predicting

TABLE IX AVERAGE RESULTS OF COMPARISON WITH KNOWN Hildenbrandia rubra

TABLE X B EST RESULTS OF COMPARISON WITH KNOWN Hildenbrandia rubra STRUCTURE GROUPED BY THERMODYNAMIC MODEL

STRUCTURE GROUPED BY THERMODYNAMIC MODEL

∆G (kcal/mol) -201.84 -198.48 -198.07 -187.94 -187.87 -186.47 -185.54 -182.23 -182.06 -181.32 -180.48 -172.23 -171.11 -183.46 -182.83 -181.73 -171.46 -170.19 -169.05 -168.84 -167.42 -167.13 -165.93 -164.40 -155.78 -155.32

Pred. BPs 156.2 153.9 154.5 151.2 151.9 149.9 150.3 146.9 147.9 149.3 147.8 144.8 143.6 155.3 154.9 155.2 150.5 149.6 149.5 150.2 149.9 148.5 149.8 148.1 144.3 143.2

Correct BPs (%) 28.5 24.3 22.5 22.2 19.9 18.0 21.3 19.9 21.2 17.8 18.1 16.6 14.2 22.0 21.8 21.3 16.9 18.0 15.1 15.8 17.5 17.6 15.0 15.7 12.1 11.9

Crossover

Model

CX OX2 PMX SYMERC (0.2) NA (Perm) ASERC (0.2) OX 1-Point 2-Point SYMERC ASERC Uniform NA (Bin) OX2 PMX CX NA (Perm) ASERC (0.2) SYMERC (0.2) OX SYMERC 2-Point ASERC 1-Point NA (Bin) Uniform

INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN INN INN INN INN INN INN INN INN INN INN INN INN

30.4% of the known structures’ base pairs. With the INN thermodynamic model, the CX operator’s lowest energy structure was found with 162 base pairs and had a free energy 0.3% higher but predicted 44.9% of the known base pairs correctly after running for 416 generations. These structures, along with the lowest free energy structures found with OX2, ASERC (Pc = 0.2), permutation encoding with mutation alone, 2Point, SYMERC (Pc = 0.2), SYMERC, 1-Point, and Uniform, predict more correct base pairs from the known structure than any Nussinov structure.

∆G Freq. (kcal/mol)

Gens

Pred. BPs

-216.76 -215.37 -212.35 -210.71 -209.61 -206.80 -202.93

1 1 1 1 1 1 1

684 417 124 690 93 626 681

161 157 161 161 151 152 160

Corr. BPs (%) 37.7 37.0 50.0 30.4 35.5 23.9 22.5

-201.13 -200.17 -197.95 -195.98 -195.77 -194.46 -201.6 -200.9 -194.2 -190.2 -188.4 -188.4 -187.3 -184.8

1 1 1 1 1 1 1 1 1 1 1 1 1 1

382 224 367 112 439 556 678 416 637 567 633 674 87 684

160 147 149 152 150 146 165 162 157 161 162 150 164 152

23.9 10.8 29.7 21.7 14.5 18.1 30.4 44.9 35.5 30.4 36.2 22.5 26.8 39.1

-179.0 -178.5 -175.6 -175.6 -171.5

1 1 1 1 1

685 647 67 229 104

162 153 142 148 141

18.8 40.6 37.0 5.8 23.2

Crossover

Model

CX OX2 2-Point PMX 1-Point ASERC (0.2) SYMERC (0.2) NA (Perm) NA (Bin) ASERC Uniform OX SYMERC PMX CX OX2 ASERC (0.2) NA (Perm) OX 2-Point SYMERC (0.2) ASERC SYMERC 1-Point NA (Bin) Uniform

INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN INN INN INN INN INN INN INN INN INN INN INN INN

TABLE XI Saccharomyces cerevisiae N USSINOV RESULTS . N UMBER OF KNOWN BASE PAIRS IS

37.

GC:AU:GU Predicted Weights BP

Correctly Predicted BP

Correctly Predicted (%)

1:1:1 3:2:2 3:2:1

28 28 9

75.7 75.7 24.3

45 45 44

D. Saccharomyces cerevisiae sequence – 118 nucleotides 1) Nussinov results: Table XI shows the results when using the Nussinov DPA. The results show that the maximum number of base pairs possible with this sequence is 45. Base pair maximization alone predicts 75.7% of the known base pairs correctly. Changing the weights to be proportional to the number of hydrogen bonds in each base pair, 3:2:2, for GC, AU, and GU, respectively, the algorithm predicts a different structure also containing 45 base pairs which coincides with 75.7% of the base pairs in the real structure. Lastly, changing the weights to 3:2:1 to approximate the stability of the base pairs of GC, AU, and GU, respectively, predicted a structure with 44 base pairs with only 24.3% of them correct. 2) Average lowest free energy predictions: Table XII shows that the highest number of correctly predicted base pairs was with the INN-HB thermodynamic model. All crossover operators performed equally finding the same structure for all 30 seeds. These runs predicted a structure, with free energy −57.52 kcal/mol, containing 39 base pairs where 89.2% of the known structure was correctly predicted. Using no crossover

with the permutation encoding also yielded the same result, but the binary representation predicted an average structure with a free energy 1.3% higher. This latter structure contained 38.6 base pairs on average, where 87.6% of the real structure coincided with the predicted structure. All runs with INN-HB outperformed all Nussinov determined structures. With the INN thermodynamic model, all crossover operators were able to find the same structure with all their seeds. This structure contained a free energy of −52.90 kcal/mol. The structure contained 40 base pairs and contained 75.7% of the known base pairs. Without crossover, both the permutation and the binary representation performed worse with the former outperforming the latter by a small margin. All runs, except those without crossover, predicted equally as many correct base pairs as possible with the Nussinov algorithm, but had a lower number of false predictions. Finally, although the two last runs predicted fewer correct base pairs than both base pair maximization without constraint and base pair maximization

with weights 3:2:2 for GU, AU, and GU, they predicted less false positive base pairs. TABLE XII AVERAGE RESULTS OF COMPARISON WITH KNOWN Saccharomyces cerevisiae STRUCTURE GROUPED BY THERMODYNAMIC MODEL ∆G (kcal/mol) -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -56.80 -52.90 -52.90 -52.90 -52.90 -52.90 -52.90 -52.90 -52.90 -52.90 -52.90 -52.90 -52.69 -50.67

Pred. BPs 39 39 39 39 39 39 39 39 39 39 39 39 38.6 40 40 40 40 40 40 40 40 40 40 40 40 39.2

Correct BPs (%) 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 87.6 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7 74.9 71.5

Crossover

Model

CX OX2 PMX SYMERC (0.2) NA (Perm) ASERC (0.2) OX 1-Point 2-Point SYMERC ASERC Uniform NA (Bin) OX2 PMX CX ASERC (0.2) SYMERC (0.2) OX SYMERC 2-Point ASERC 1-Point Uniform NA (Perm) NA (Bin)

INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN INN INN INN INN INN INN INN INN INN INN INN INN

3) Individual lowest free energy predictions: Looking at individual INN-HB runs in Table XIII shows us that all runs, except for two single binary runs without crossover, found the optimal structure within less than 700 generations. In fact, the most aggressive operator, OX2, was able to find the optimal structure within 4.7 generations on average. With the INN model, all crossover operators were able to find the same lowest free energy structure for all seeds. However, a single random seed with permutation representation using mutation only found a sub-optimal structure. The binary run with no crossover operator found 20 optimal structures and 10 sub-optimal. An interesting feature found shows that permutation encoding without crossover finds the lowest energy structure faster than OX and all ERC runs when using INN-HB. Similarly with INN, permutation encoding alone outperforms SYMERC (PC = 0.2, 0.7) and ASERC (Pc = 0.7). By simply counting the runs that found the highest number of correctly predicted base pairs, 89.2%, we found that 388 out of 390 runs were successful when using INN-HB. With INN, it was also found that four structures were found with 1Point that equally predicted 89.2% out of its 33 base pairs. The structure was found to be sub-optimal within the INN model with a free energy 4.0% larger than the optimal structure. 4) Graphical comparison: Comparing structures using quantitative measures, such as the number of correctly predicted base pairs, is useful, but comparing the overlap qualitatively can strengthen the interpretation of the results. Particularly, qualitative, graphical comparison can identify regions of

TABLE XIII B EST RESULTS OF COMPARISON WITH KNOWN Saccharomyces cerevisiae STRUCTURE GROUPED BY THERMODYNAMIC MODEL

∆G Freq. (kcal/mol)

Gens

Pred. BPs

-57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52 -57.52

30 30 30 30 30 30 30 30 30 30

4.7 5.0 6.5 6.2 8.4 8.6 13.7 15.6 17.7 19.6

39 39 39 39 39 39 39 39 39 39

Corr. BPs (%) 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2 89.2

-57.52 -57.52 -57.52 -52.9 -52.9 -52.9 -52.9 -52.9 -52.9 -52.9 -52.9 -52.9 -52.9

30 30 28 30 30 30 30 30 30 30 30 29 30

19.7 24.8 118.6 5.7 6.3 6.9 9.3 10.2 15.4 19.1 23.7 27.4 28.9

39 39 39 40 40 40 40 40 40 40 40 40 40

89.2 89.2 89.2 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7 75.7

-52.9 -52.9 -52.9

30 30 20

33.2 35.3 120.2

40 40 40

75.7 75.7 75.7

Crossover

Model

OX2 CX 2-Point 1-Point PMX Uniform NA (Perm) OX ASERC (0.2) SYMERC (0.2) ASERC SYMERC NA (Bin) OX2 1-Point 2-Point Uniform CX PMX OX ASERC (0.2) NA (Perm) SYMERC (0.2) SYMERC ASERC NA (Bin)

INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN-HB INN INN INN INN INN INN INN INN INN INN INN INN INN

high structural similarity between two structures, even if the quantitative overlap of base pairs in those regions is low, such as a shift in base pairs caused by a bulge that may be present in one structure and absent in the other. Figure 3 shows a typical comparison between two structures of the same sequence. The figure shows how two Saccharomyces cerevisiae structures overlap. The known structure is represented by the light grey bonds while the predicted structure is represented by dark grey bonds. Where the known and the predicted structures overlap, the bonds are colored black.

Fig. 3. The above shows a comparison between the known and the highest number of correctly predicted base pairs using the EA. The predicted base pairs are coloured in dark grey, the known are coloured in light grey and the overlap is coloured in black. We were able to predict 89.2% of the known Saccharomyces cerevisiae base pairs.

The comparison between the structure with the highest number of correct base pairs and the known structure is seen

in Figure 3. This figure shows that the EA is able to not only predict as many as 89.2% of the known base pairs correctly but also correctly predict most the known structure’s substructures correctly. The known structure contains three branches and two internal loops. The predicted structure contains all these substructures with minor modifications due to the constraints in the helix generation algorithm. The Nussinov algorithm using base pair maximization is able to predict as many as 75.7% of the known base pairs correctly. Figure 4 shows only the predicted structure (light grey) and highlights the bonds that overlap (black) to ease interpretation. We can see that even if the number of base pairs correctly predicted is high, there are very obvious differences between the structures. The predicted structure contains three major branches similar to the real structure. However, upon closer inspection, the predicted structure fails to recognize a proper multi-branch loop connecting the three branches. Also, the predicted structure does not contain the internal loop in the left branch. Instead, the over-prediction causes two hairpin loops to replace the internal loop from the known structure. The branch on the right hand side is again correctly predicted until the position where the internal loop is supposed to start. After this point, the Nussinov algorithm predicts three hairpin loops where the known structure only contains one.

does not allow for a helix of length two. If this helix, along with the second CU pair, is removed, our EA has effectively correctly predicted 100% of the known base pairs with six false positives. This lends further support to the notion that the EA search engine is very effective in identifying real structural elements, but is limited by the current helix generation model. F. Overprediction of base pairs Table XIV shows the number of false positive base pairs predicted by the two methods. The data in the table was drawn from the single best Nussinov prediction and the best average run of the EA for each sequence. The first column gives the sequence. The second shows the weights used for the DPA. The third column shows the number of false positive base pairs predicted with the Nussinov DPA and the fourth column shows the percentage of known base pairs correctly predicted with this structure. The fifth column shows which crossover operator/thermodynamic model combination was used to find the structure that predicted the largest number of known base pairs on average. The sixth column shows how many false positive base pairs were predicted on average with the EA. Finally, the seventh column shows the percentage of known base pairs correctly predicted with the EA. TABLE XIV C OMPARISON BETWEEN THE NUMBER OF FALSE PREDICTIONS BETWEEN BEST RESULTS WITH THE N USSINOV DPA AND THE BEST AVERAGE EA RUNS

Sequence

D. virilis H. rubra S. cerevisiae

Fig. 4. The above shows the comparison of the structure predicted with maximal number of base pairs using the Nussinov DPA. The light grey base pairs correspond to the predicted structure, while the black ones correspond to the correctly predicted base pairs. The known base pairs were omitted to make the comparison easier. In this case, the Nussinov algorithm was able to predict 75.7% of the known base pairs.

E. Helix generation algorithm limitations As mentioned in Section I, our helix generation algorithm is bound by three rules: a helix must contain at least three stacked pairs, a helix must be connected by at least three nucleotides, and base pairs must be composed of GC, AU, or GU. Looking at the known structure in Figure 3 shows that there are two base pairs that our model cannot predict. The branch on the right hand side shows two CU base pairs in the known structure. For a fairer comparison, these could be removed. Removing the first CU base pairs also removes the ability of the CU-AU-CG stack to form since our model

GC:AU:GU DPA Weights overpred. 1:1:1 291 3:2:1 174

Corr. Pred. (%) 12.4 22.5

1:1:1

75.7

17

CrossModel

EA overpred. CX/INN 200.7 CX/INN- 116.9 HB CX/INN- 6.0 HB

Corr. Pred. (%) 14.4 28.5 89.2

The results show that in all cases the EA outperformed the DPA by predicting more base pairs correctly. Furthermore, the EA consistently predicted far less false positive base pairs than the DPA. IV. C ONCLUSION We have discussed new results to our previously introduced EA for RNA secondary structure prediction. We have added results for two new sequences and have also added results using only mutation without crossover. We have also compared our EA to a dynamic programming method, the Nussinov algorithm. Our algorithm currently performs better on shorter sequences than longer ones. Overall, we were able to predict as many as 89.2% of the known base pairs on a 118 nucleotide sequence. When testing larger structures, the predictions were not as accurate as with shorter sequences. This is mainly due to thermodynamic model limitations. This paper also lends further support that permutation encoding performs better than binary encoding in this domain. In fact, the crossover operators giving the lowest energy

structures were OX2, CX, and PMX for all sequences and both thermodynamic models. Also, when comparing permutation encoding directly with binary encoding by disabling the crossover operator, the permutation runs always outperform the binary ones. The EA was able to outperform the Nussinov DPA for each sequence. With D. virilis, four INN average runs, using OX2, CX, PMX, and 1-Point, performed equally well or better than all Nussinov results. Average runs of the EA were able to outperform all Nussinov results with CX, OX2, and PMX using the INN-HB thermodynamic model while testing H. rubra. Lastly, while testing S. cerevisiae, all INN-HB average runs and all INN runs, with the exception of those without crossover, were able to outperform all Nussinov results. Also noteworthy is that the EA is not as prone to make false base pair predictions as the Nussinov DPA is, which further increases the quality of the predicted structures. V. F UTURE WORK Future work includes exploring different crossover and mutation parameter settings using both KBR and STDS. Also, our structures will be compared with those generated with mfold [18]. Adding a more complete thermodynamic model to account for hairpin loops, bulges, internal loops, multi-branch loops and pseudoknots is planned. We will also investigate modifying our rules to allow certain non-canonical base pairs to form. These changes should allow us to better model most RNA structures. By integrating a comparable thermodynamic model to mfold’s efn2 into our EA, we expect to be able to predict structures more accurately than mfold. ACKNOWLEDGMENT A. Deschˆenes and J. Poonian would like to acknowledge support from the Natural Sciences and Engineering Research Council in the form of a Postgraduate Scholarships (NSERC, Canada, PGS-A) and a Canada Graduate Scholarship (NSERC, Canada, CGS-M), respectively. K. C. Wiese would like to acknowledge the support of NSERC for this research under Research Grant number RG-PIN 238298 and Equipment Grant number EQPEQ 240868. All authors would also like to acknowledge the support of the InfoNet Media Centre funded by the Canada Foundation for Innovation (CFI) under grant number CFI-3648. The authors would also like to acknowledge Edward Glen for creating the images used for comparisons. R EFERENCES [1] P. G. Higgs, “RNA secondary structure: physical and computational aspects,” Quarterly Reviews of Biophysics, vol. 33, pp. 199–253, 2000. [2] R. Nussinov, G. Pieczenik, J. R. Griggs, and D. J. Kleitman, “Algorithms for loop matchings,” SIAM Journal of Applied Mathematics, vol. 35, pp. 68–82, 1978. [3] B. A. Shapiro and J. Navetta, “A massively-parallel genetic algorithm for RNA secondary structure prediction,” J. Supercomput., vol. 8, pp. 195–207, 1994. [4] K. C. Wiese and E. Glen, A Permutation Based Genetic Algorithm for RNA Secondary Structure Prediction, ser. Frontiers in Artificial Intelligence and Applications - Soft Computing Systems. Amsterdam: IOS Press, 2002, vol. 87, ch. 4, pp. 173–182.

[5] ——, “A permutation-based genetic algorithm for the RNA folding problem: a critical look at selection strategies, crossover operators, and representation issues,” BioSystems - Special Issue on Computational Intelligence in Bioinformatics, vol. 72, pp. 29–41, 2003. [6] K. C. Wiese, A. Deschˆenes, and E. Glen, “Permutation based RNA secondary structure prediction via a genetic algorithm,” in Proceedings of the 2003 Congress on Evolutionary Computation CEC2003, R. Sarker, R. Reynolds, H. Abbass, K. C. Tan, B. McKay, D. Essam, and T. Gedeon, Eds. Canberra: IEEE Press, 8-12 December 2003, pp. 335–342. [7] A. Deschˆenes and K. C. Wiese, “Using stacking-energies (INN and INNHB) for improving the accuracy of RNA secondary structure prediction with an evolutionary algorithm - a comparison to known structures,” in Proceedings of the 2004 IEEE Congress on Evolutionary Computation, vol. 1. Portland, Oregon: IEEE Press, June 2004, pp. 598–606. [8] K. Wiese and S. D. Goodwin, “Keep-best reproduction: A local family competition selection strategy and the environment it flourishes in,” Constraints, vol. 6, no. 4, pp. 399–422, 2001. [9] T. Starkweather, S. McDaniel, K. Mathias, D. Whitley, and C. Whitley, “A comparison of genetic sequencing operators,” in Proceedings of the Fourth International Conference on Genetic Algorithms, R. Belew and L. Booker, Eds. San Mateo, CA: Morgan Kaufman, 1991, pp. 69–76. [10] I. M. Oliver, D. J. Smith, and J. R. C. Holland, “A study of permutation crossover operators on the traveling salesman problem,” in Proceedings of the Second International Conference on Genetic Algorithms on Genetic algorithms and their application. Lawrence Erlbaum Associates, Inc., 1987, pp. 224–230. [11] D. Goldberg and J. Lingle, “Alleles, loci and the travelling salesman problem,” in Proceedings of the First International Conference on Genetic Algorithms, J. Grefenstette, Ed. Lawrence Erlbaum Associates, 1985, pp. 154–159. [12] M. J. Serra and D. H. Turner, “Predicting thermodynamic properties of RNA,” Methods in Enzymology, vol. 259, pp. 242–261, 1995. [13] T. Xia, J. S. Jr., M. E. Burkard, R. Kierzek, S. J. Schroeder, X. Jiao, C. Cox, and D. H. Turner, “Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with watsoncrick base pairs,” Biochemistry, vol. 37, pp. 14 719–14 735, 1998. [14] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner, “Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure,” Journal of Molecular Biology, vol. 288, pp. 911–940, 1999. [15] F. Major, 2003, private communication. [16] J. J. Cannone, S. Subramanian, M. N. Schnare, J. R. Collett, L. M. D’Souza, Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Mu¨ ller, N. Pande, Z. Shang, N. Yu, and R. R. Gutell, “The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs,” BMC Bioinformatics, vol. 3, 2002. [17] K. C. Wiese, S. D. Goodwin, and S. Nagarajan, “ASERC - a genetic sequencing operator for asymmetric permutation problems,” in Canadian AI 2000, LNAI 1822, H. Hamilton and Q. Yang, Eds. Springer-Verlag Berlin Heidelberg, 2000, pp. 201–213. [18] M. Zuker, D. H. Mathews, and D. H. Turner, “Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide,” in RNA Biochemistry and Biotechnology, ser. NATO ASI Series, J. Barciszewski and B. Clark, Eds. Kluwer Academic Publishers, 1999.

Suggest Documents