RNA global alignment in the joint sequence - Oxford Academic

1 downloads 1053 Views 13MB Size Report
Apr 12, 2013 - $jinfeng/ESA.html. ... Tel: +1 850 644 3218; Fax: +1 850 644 5271; Email: [email protected] .... For a curve β, an (embedded) rotation 2 " and a.
Published online 12 April 2013

Nucleic Acids Research, 2013, Vol. 41, No. 11 e114 doi:10.1093/nar/gkt187

RNA global alignment in the joint sequence– structure space using elastic shape analysis Jose Laborde1, Daniel Robinson2, Anuj Srivastava1, Eric Klassen2 and Jinfeng Zhang1,* 1

Department of Statistics, Florida State University, FL, USA and 2Department of Mathematics, Florida State University, FL, USA

Received August 9, 2012; Revised February 26, 2013; Accepted February 27, 2013

ABSTRACT The functions of RNAs, like proteins, are determined by their structures, which, in turn, are determined by their sequences. Comparison/alignment of RNA molecules provides an effective means to predict their functions and understand their evolutionary relationships. For RNA sequence alignment, most methods developed for protein and DNA sequence alignment can be directly applied. RNA 3-dimensional structure alignment, on the other hand, tends to be more difficult than protein structure alignment due to the lack of regular secondary structures as observed in proteins. Most of the existing RNA 3D structure alignment methods use only the backbone geometry and ignore the sequence information. Using both the sequence and backbone geometry information in RNA alignment may not only produce more accurate classification, but also deepen our understanding of the sequence–structure–function relationship of RNA molecules. In this study, we developed a new RNA alignment method based on elastic shape analysis (ESA). ESA treats RNA structures as three dimensional curves with sequence information encoded on additional dimensions so that the alignment can be performed in the joint sequence–structure space. The similarity between two RNA molecules is quantified by a formal distance, geodesic distance. Based on ESA, a rigorous mathematical framework can be built for RNA structure comparison. Means and covariances of full structures can be defined and computed, and probability distributions on spaces of such structures can be constructed for a group of RNAs. Our method was further applied to predict functions of RNA molecules and showed superior performance compared with previous methods when tested on benchmark datasets.

The programs are available at http://stat.fsu.edu/ jinfeng/ESA.html.

INTRODUCTION Recent discoveries have shown that RNA molecules play important roles in many biological processes such as enzymatic activity, protein synthesis and transport, gene transcriptional regulation, RNA processing and splicing and chromosome replication (1–5). This changed the traditional view of RNA being solely a carrier of genetic information (1,6,7). Comparison of RNAs, including both sequence alignment and structure alignment, can reveal the conserved motifs important for RNA functions, the evolutionary relationships of RNAs and the sequence– structure–function relationship of RNAs in general (6). Compared with protein alignments, the comparison/ alignment of RNAs is much less studied (8–18). Although the alignment of RNA sequences can borrow directly those methods developed for protein or DNA sequence alignment, many methods designed for protein structure alignment cannot be readily used for alignment of RNA structures. This is partially due to the difference between the secondary structures of proteins and RNAs. In the rest of the Introduction, current methods for RNA structure alignment will be briefly reviewed, followed by a conceptual description of our own method. Since local structural motifs of RNAs often have specific functions, instead of comparing the overall structures, some studies focus on detection of local structural motifs [such as NASSAM (17), COMPADRES (16), RNAMotifScan (19) and FR3D (15)]. This is analogous to the identification of functional domains in proteins. Methods that compare 3D RNA structures at scales larger than motifs can be divided largely into two types. The first type of methods represents nucleotide residues by some local structure features, which allow them to reduce the 3D structures to 1D sequences. The resulting 1D sequences can then be aligned using traditional sequence alignment methods. The second type of methods start

*To whom correspondence should be addressed. Tel: +1 850 644 3218; Fax: +1 850 644 5271; Email: [email protected] Correspondence may also be addressed to Anuj Srivastava. Tel: +1 850 644 3218; Fax: +1 850 644 5271; Email: [email protected] ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

PAGE 2 OF 10

e114 Nucleic Acids Research, 2013, Vol. 41, No. 11

from alignment of similar local structures and then aims to obtain larger scale alignment by extending the initial local alignment. Among the first type of methods, SARA uses a set of unit vectors derived from consecutive nucleotides to represent each nucleotide. The difference of nucleotides can be compared via the unit vectors using unit-vector root mean square (URMS) as distance (8,9); iPARTS uses structural alphabet (SA) derived from backbone torsion angles, which are discretized into 23 states. A substitution matrix is then derived for the alphabet of 23 letters and used in the sequence alignment (10). LaJolla uses a n-gram model for analysing the sequences derived from the torsion angles of nucleotides (12). Similarly, PRIMOS/AMIGOS (13) and DIAL (14) also use torsion angles to represent nucleotides and align RNA on a sequence space encoded by the torsion angle representation. The above approaches do not necessarily produce globally similar alignment between two RNA structures [i.e. with small RMSDs (root mean square deviations)]. To achieve smaller RMSD for the aligned parts, extra steps need to be taken after the sequence alignment. Among the second type of RNA structure alignment methods, ARTS used P (phosphor) atoms of two consecutive base pairs as seeds and aligned the overall structures based on the alignment of structurally similar seed quadrants between the two RNA molecules (11). In R3D Align, local alignments are merged to form a global alignment by using a maximum clique algorithm on a specially defined graph called a local alignment graph (18). Sequence alignment uses information of side chains, which are reduced to single letters, and structure alignment uses information of the backbone geometry. Both methods have their advantages and disadvantages. Sequence alignment can be used to compare any two or multiple RNA molecules with sequence information, which is readily available. But such methods often perform poorly for remotely related sequences, and will not work for related structures without detectable sequence similarities. Structure alignment, on the other hand, fills in that gap by using structure information directly. Structure alignment also allows for detection of common functional motifs/domains in a set of structures. However, structure information is far more difficult to obtain than sequences. Even though the nucleotide sequences are available, most structure alignment methods do not use this information. A valid question is, can sequences provide additional useful information for structure alignment? Combining both structure and sequence information in RNA comparison may also provide more insight on the sequence–structure–function relationships of these molecules. In this study, we address this problem using a novel approach, called elastic shape analysis (ESA). Recently, ESA has been successfully applied to protein structure comparison (20,21). ESA treats protein/RNA structures as three dimensional curves and uses a geometric framework that has been developed originally in image analysis and computer vision for shape analysis of parameterized curves and surfaces (22–25). The basic idea in this framework is to design an infinite-dimensional manifold of curves, endow it with a metric structure and compare

any two objects by computing the distances between them on this manifold. Under this framework, we will be able to (i) quantify the similarity between any two protein/RNA structures by a formal distance, geodesic distance, computed on their respective shape manifolds. These geodesics can be seen as optimal deformations of objects into each other; (ii) compute intrinsic statistics of the shapes of a collection of protein/RNA conformations. For instance, one can compute their means and variances to study the statistical variability of shapes within the same group; and (iii) use the moments computed above to impose a full probability model on a shape group using a wrapped Gaussian density. This type of probability density can be used for statistical analysis and hypothesis testing of future structures. A nice property of ESA is that additional information can be readily added and the resulting distance can still be a formal one. In this study, we apply ESA to RNA structure alignment and extend the original framework to compare RNA molecules in the joint sequence-structure space. In the following sections, we will first present the ESA method for RNA structure and sequence comparison. We then apply the method on benchmark datasets for RNA function prediction and compare its performance with that of previous methods. Finally, we provide a discussion and our conclusions. METHODOLOGY Elastic shape analysis in the joint sequence–structure space A mathematical and statistical framework based on ESA has been recently developed for protein and RNA structure alignment where only the backbone geometric information was used in distance calculations (20, 21, 26). In this study, we extend the framework by incorporating additional dimensions containing sequence information so that the comparison can be done in the joint sequence–structure space. For the sake of completeness, we describe the key points of the original framework while focusing more on the new development. For RNA structure comparison, ESA treats the backbone structures of RNA molecules as parameterized curves in R3. Since the comparisons involve shapes, the resulting quantifications should not depend on the rigid motions and parameterizations of these curves. When incorporating additional information, such as sequences, we add extra coordinates resulting in curves in higher dimensions. Since the structure dimensions and sequence dimensions represent different types of information, they are treated differently at certain steps of the ESA procedure. We represent each parameterized curve with a special function called the square root velocity function (SRVF) and restrict to the manifold of such functions under the desired constraints. To compare shapes of curves, we remove all shape-preserving transformations from this representation. This is done using an algebraic technique—we form a quotient space of the original manifold with respect to these shape-preserving transformation groups. In the resulting quotient space, called

PAGE 3 OF 10

the shape space of elastic curves, one can perform statistical analysis of curves as if they are random variables. One can compare, match and deform one curve into another, or compute averages and covariances of curve populations, and perform hypothesis testing and classification of curves according to their shapes. When incorporating sequence information as additional dimensions, nucleotides need to be converted to numerical values while the distances among them should still be sensible. A distance matrix or substitution matrix need to be used to derive the conversion rule. Here we use the Jukes and Cantor substitution matrix (27), a commonly used and also the simplest substitution model (JC model). JC model assumes equal base frequency and equal mutation rates, suggesting that the four nucleotides must have equal distance from one another. Consequently, we use a regular tetrahedron to represent the four nucleotides, where the numerical values of the four nucleotides are set to be the coordinates of the four vertices of the tetrahedron. Specifically, we have A ¼ ð1, 1, 1Þ, C ¼ ð1,  1,  1Þ, U ¼ ð1, 1,  1Þ and G ¼ ð1, p1, ffiffiffi 1Þ. The distance between any pair of these points is 2 2. Note that it is impossible to embed four equidistant points in two dimensions and doing this in four dimensions is unnecessary. It is also possible to adopt other substitution matrices in a similar fashion. Now suppose we have a sequence of n P (phosphor) atoms with coordinates ðx1 , y1 , z1 Þ, . . . , ðxi , yi , zi Þ, . . . , ðxn , yn , zn Þ, taken from the PDB (28, 29) file of an RNA molecule. The corresponding sequence coordinates according to the representation above are ðui , vi , wi Þ. The RNA molecule is represented as a curve in R6 going through the points  T Pi :¼ xi yi zi ui vi wi 2 R6 , i ¼ 1, 2, . . . , n: where l is a weight that controls the contribution of the sequence information in the alignment. The greater the value of l, the more influential the sequence information will be, and for a very large l, only sequence information will be relevant. Conversely, if  ¼ 0, the alignment is done only in the structure space. Using the representation above, given an RNA structure, we construct a continuous parameterized curve that interpolates the points fPi gni¼1 , which will map the interval [0,1] to R6. We will denote this parameterizing variable as t. We assign the fti gni¼1 values corresponding to fPi gni¼1 as follows: t1 ¼ 0 1 ti+1 ¼ ti+ jjðxi+1 , yi+1 , zi+1 Þ  ðxi , yi , zi Þjj, L for i ¼ 1, 2, . . . , n  1, where L is the total length of the backbone. Note that the sequence information is ignored when selecting parameter values, and also that these are not necessarily uniformly spaced. Once parameter values have been assigned, we define our curve  : ½0, 1 ! R6 to be the piecewise-linear interpolating function mapping ti ° Pi for i ¼ 1, 2, . . . , n.

Nucleic Acids Research, 2013, Vol. 41, No. 11 e114

Since the b functions are absolutely continuous curves in R6, all of the ESA techniques for Rn, described in (22, 23), can be readily applied to them. Since b is linear on each subinterval ðti1 , ti Þ, the corresponding SRVF (which we discuss next) will be constant on each of these subintervals. To analyse the (sequence-annotated) shape of a curve b, we represent b by its square-root velocity function: qffiffiffiffiffiffiffiffiffiffiffiffiffi _ _ qðtÞ ¼ ðtÞ= jjðtÞjj in R6, where jj  jj is the standard _ ¼ dðtÞ. In order for the Euclidean norm in R6, and ðtÞ dt shape analysis to be invariant to scales, we rescale each curve to length 1. This treatment is optional, but gave (surprisingly) better performance for protein structure alignment (20, 21). Restricting to the curves of interest, represented by their SRVFs, we obtain the set Z1 C  fq : ½0, 1 ! R6 j jjqðtÞjj2 dt ¼ 1g: ð1Þ 0

C is called the pre-shape space and is the set of all SRVFs representing parameterized curves in R6 of length one. It is actually a unit sphere in the Hilbert space  L2. Among four shape-preserving transformations, we have removed translation and scale (optional); the rotation and reparameterization are removed algebraically as follows. When analysing these curves in R6, the sequence information is relevant only for the parameterization step, but not for the rotation step: when optimizing over rotations, we only use information of backbone structures. Since the nucleotide sequence is not considered geometric information, when dealing with rotation, we modify the original algorithm by letting SO(3) be the group of 3  3 rotation matrices and  be the group of all re-parameterizations (they are actually positive diffeomorphisms of the interval  be the embedding of SO(3) in GL6 (general [0,1]). Let  linear group of 6  6 invertible matrices) by letting  all O 0 elements O 2 SOð3Þ in GL6 through  ¼ . It is 0 I3  is a subgroup of GL6 or SO(6). easy to show that   and a For a curve b, an (embedded) rotation  2  re-parameterization  2 , the transformed curve is  given by p ð ffiffiffi Þ. The SRVF of the transformed curve is given by _ ðq Þ. To unify all elements that denote the same shape inpC, ffiffiffi we define equivalence classes of the type:   2 g. Each such class ½q is ½q ¼ fðq Þ _ j 2 , uniquely associated with a shape and vice versa. The set of all these equivalence classes is called the shape space S. Mathematically, it is a quotient space of the pre-shape   Þ ¼ f½qjq 2 Cg. space: S  C=ð With the above modifications of the original framework, we can obtain geodesic paths and distances between two RNA structures, represented by 3D curves 1 and 2 , as done in (20, 21, 26). Details of the mathematical framework can be found in the appendix. Obtaining a sequence alignment from the elastic matching Now suppose that 1 , 2 : ½0, 1 ! R6 are two curves obtained as above. Let t11 , t12 , . . . , t1n1 and t21 , t22 , . . . , t2n2 be the parameter values assigned to the original sample

e114 Nucleic Acids Research, 2013, Vol. 41, No. 11

points for 1 and 2 , respectively, and let q1 and q2 be the corresponding SRVFs. Our approach to finding an optimal re-parameterization  as done in (20,21,26) is basically the same as the dynamic programming algorithm used in (25), but in our case the nucleotide sequence information gives rise to a few considerations which we briefly discuss here. In the original ESA framework for RNA/protein alignment (21,26), we added a pre-processing step: any two RNAs/proteins to be compared had to have their 3D coordinates re-sampled using interpolations, and we obtained smooth curves with equal number of points on each structure. That treatment was convenient since it facilitated the use of a uniform grid that evenly partitions ½0, 1  ½0, 1 to search for the optimal  . In the new framework, each point is mapped to a nucleotide and re-sampling of points is not necessary. With this modification, sequence alignments are obtained together with structure alignments, like most of the other structure alignment methods do. As in the original algorithm, we create a grid on the unit square ½0, 1  ½0, 1 and search for an optimal path through the grid from (0,0) to (1,1). However, our grid is non-uniform: the vertical lines are placed at x-coordinates t11 , t12 , . . . , t1n1 and the horizontal lines are placed at y-coordinates t21 , t22 , . . . , t2n2 (see Figure 1). In our case, the gridpoints have special significance: each gridpoint ðt1i , t2j Þ represents a match between one of the P atoms on the backbone of the molecule represented by 1 and a P atom on the backbone of the molecule represented by 2 . As in the original algorithm, a list of gridpoints ðt1i1 , t2j1 Þ, ðt1i2 , t2j2 Þ, . . . , ðt1im , t2jm Þ which starts at (0,0), ends at (1,1), and is strictly increasing in both the x and y components can be thought of as the graph of an increasing piecewise-linear re-parameterization  : ½0, 1 ! ½0, 1. In our case, such a list also gives us an alignment between the two nucleotide sequences. Here, parameter values tij which do not appear in the list of gridpoints correspond to alignment gaps (elements in the nucleotide sequence which are unmatched). Dynamic programming is used as in the original algorithm to search over the space of all such lists for one which minimizes the energy in equation A.1, and the result gives us both a full elastic matching between 1 and 2 , as well as a nucleotide sequence alignment. Figure 2 shows an example of a sequence alignment obtained in this way. Since paths in the grid are required to start at (0,0) and end at (1,1), the alignments produced by our method always match the first nucleotide in 1 to the first nucleotide in 2 ; similarly, the last nucleotides are always matched together. To work around this restriction, we add a preprocessing step in which a dummy point is added to the beginning and ending of each curve in R6. The 3D coordinates ðxi , yi , zi Þ of these points are obtained by linear extrapolation, and the sequence coordinates, ðui , vi , wi Þ, are duplicates of the sequence coordinates of the first (or last) real point. For the example in Figure 1, we used one dummy point at the beginning and ending of each curve. The first row and column and the last row and column in the grid correspond to these extra points. In Figure 2, the extra points have been removed for

PAGE 4 OF 10

Figure 1. Piecewise-linear re-parameterization  yielding an optimal matching between 1h3e chain B and 1gtr chain B. The matching grid has 82 columns, corresponding to the 80 points on the backbone of 1h3eB plus the dummy points at each end. Similarly, the grid has 76 rows, corresponding to the 74 points on the backbone of 1gtrB plus the dummy points at each end.

display, along with any matches involving them. The number of added dummy points is arbitrary and quite easy to change, but one or two should be enough to get around the ‘endpoints’ matching restriction. In this example, we scale the structures to unit length, let the sequence weight in the matching be  ¼ 5 and the resulting geodesic distance between both structures is 0.8966. The range of ESA distances for any two unscaled-length RNAs is ½0, 1Þ. The geodesic distance of a structure to itself is zero. In many applications of shape analysis, the lengths of objects are often scaled to unit length. In such cases, distances have an upper bound of =2. Statistics of tertiary structures using 3D shape and sequence With a formal way to measure the distances between RNA structures, we can compute some important statistics for these shapes. Note that we can only do statistics of shapes if there is a true notion of shape distance. In particular, we would like to calculate mean and covariance and even impose probability distributions for a given set of RNA structures and, if desired, their nucleotide sequences, as well. Computing mean and covariance in non-linear (in this case, spherical) manifolds is not straightforward, as the shapes are not in a vector space. To get around this limitation, we use the linear properties of the tangent space at each point of S. Specifically, let 1 , 2 , . . . , n be a given set of RNA structures, represented by their SRVFs q1 , q2 , . . . , qn . The sample mean is defined as the Karcher mean, and the covariance structure is obtained using the differential geometry of the q-function space. For mathematical details, see the appendix. Mean shapes and probability distributions of RNA structure families/classes can be very useful in automatic classifications of new structures. For example, mean shapes can serve as filters to quickly narrow down the

PAGE 5 OF 10

Nucleic Acids Research, 2013, Vol. 41, No. 11 e114

Figure 2. The nucleotide sequence alignment between 1h3eB and 1gtrB obtained from the matching shown above. The black lines represent the matching between nucleotides of each structure which are labelled and colour coded.

mean shape calculated from the structures in 3a; Figure 3c shows five sampled structures on each of the top three variance components; and Figure 3d shows some randomly sampled structures from the distribution derived from the set of structures shown in Figure 3a. In this study, the calculated statistics are on the shapes only, and the sequences are used to refine the overall registration of the structures to their sample mean. In principle, these statistics can also be calculated using a combination of nucleotide sequences and structures, if desired. RESULTS Benchmark dataset To test the performance of ESA and compare it with previous methods, we use a benchmark dataset, FSCOR (8, 9), compiled from the SCOR database (30). FSCOR contains 419 RNA structures in 168 functional classes. The histograms of chain lengths for the 419 RNA molecules are plotted in Figure 4a, and the histogram of number of members in each class is plotted in Figure 4b. We can see that most of the RNA molecules have fairly small sizes, with