Current Bioinformatics
290
Send Orders for Reprints to
[email protected] Current Bioinformatics, 2018, 13, 290-298
RESEARCH ARTICLE ISSN: 1574-8936 eISSN: 2212-392X
Impact Factor: 0.6
SARELI: Sequence Alignment by Radial Evaluation of Local Interactions BENTHAM SCIENCE
Ricardo Ortega1, Arturo Chavoya1,*, Cuauhtémoc López-Martín1 and Luis Delaye2 1
Departamento de Sistemas de Información, CUCEA, Universidad de Guadalajara, Guadalajara, Mexico and Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados Unidad Irapuato, Irapuato, Mexico
2
Abstract: Background: A robust guide tree is necessary as a first step for the multiple sequence alignment of proteins. The guide tree is normally generated using an initial distance matrix based on a particular distance metric. Objective: A new tool for generating guide trees for multiple protein sequence alignment is presented. ARTICLEHISTORY Received: June 05, 2017 Revised: November 22, 2017 Accepted: January 26, 2018 DOI: 10.2174/1574893613666180130143055
Method: The algorithm involved in the initialization of the progressive algorithm for the alignment of sequences is computed by a novel metric termed Radial Distance that estimates the variation around symbols in two sequences; after the initial distance matrix is generated, a guide tree is created using the neighbor joining algorithm. The guide trees generated with our tool were then fed independently into MUSCLE and Clustal Omega-as these methods can accept external guide trees-to produce the final alignments. Results: The results from our approach in the alignment of the sequences were compared with those from MUSCLE and Clustal Omega (with their original guide trees) on the BAliBASE, SABRE, and PREFAB protein sequence databases. For scoring the alignments, we obtained the sum of pairs score and the column score against the reference alignments of the protein benchmark databases used. The alignments produced using the guide trees generated by SARELI obtained statistically superior scores on sum of pairs and column scores than those using the original guide trees from MUSCLE and Clustal Omega on the SABRE and PREFAB databases. Conclusion: Our proposed approach can generate guide trees that can be used by established multiple sequence alignment methods for proteins.
Keywords: Multiple sequence alignment, protein sequences, radial distance, neighbor joining algorithm, sum of pairs, column score. 1. INTRODUCTION The multiple sequence alignment (MSA) problem is one of the most common and important in bioinformatics, since its solution helps to predict protein structure and function, and to perform a phylogenetic analysis. Even though there have been advances in the performance of the alignment algorithms, finding accurate alignments is still a problem. Aligned biological sequences allow the scientific community to discover patterns in structural and functional motifs (highly conserved regions in multiple sequence alignments) of nucleotides or protein sequences. Once these patterns are identified, the function of the molecules can be inferred [1]. This analysis has already been used in AIDS and cancer studies [2-5] and a wide range of medical and pharmaceutical studies [6]. *Address correspondence to this author at the Departamento de Sistemas de Información, CUCEA, Universidad de Guadalajara, Periférico Norte 799L308, Núcleo Belenes, Zapopan, Jalisco, CP 45100, Mexico; Tel/Fax: +52(33)3770-3352; E-mail:
[email protected] 2212-392X/18 $58.00+.00
By definition, the multiple sequence alignment problem aims at obtaining a set of sequences of the same length that matches as many homologous symbols (representing a protein or a nucleotide segment) as possible from the initial sequences; this set may contain gap symbols to displace the columns of the sequences in order to obtain a better alignment. It has been proven that the problem of obtaining an optimal alignment using a score function is NP-complete when using a brute force approach [7]. Dynamic programming algorithms for aligning multiple sequences can obtain the optimal alignment for a given score scheme, with the inconvenience of growth in exponential order for N sequences of length L to be aligned-O(LN)-, making the algorithm computationally restrictive when processing all sequences at once [8, 9]. This restriction is mainly evidenced by a lack of sufficient RAM to process the whole matrix at once, forcing the computer to issue more read/write commands on the hard disk to use virtual memory, and obtaining the results in an inconveniently long time. © 2018 Bentham Science Publishers
Sequence Alignment by Radial Evaluation of Local Interactions
1.1. Guide Trees Progressive alignment algorithms align the sequences by pairs, replacing a maximum global score alignment for a local one, strongly depending on a guide tree for aligning the sequence pairs in an appropriate order [10]. The guide tree is constructed by clustering algorithms, which position the closest sequences at the bottom of the tree, and the most dissimilar sequences at the top. Each node in the tree represents alignment pairs, which could either be a single sequence to sequence alignment, a sequence to a previous alignment, or an alignment to another alignment. The two most common clustering algorithms used to construct guide trees are the unweighted pair group method with arithmetic mean (UPGMA) [11] and the neighbor joining algorithm (NJ) [12]. A robust guide tree for an alignment can be obtained by applying a consistent and precise metric at the start of the clustering algorithm; failing to correctly identify the closest related sequences can have a negative impact on the final alignment. Additional heuristics and metrics can refine the tree with intermediate steps in the process in order to reduce the sensitivity and generate a better alignment [13-15]. After initial alignments, iterative refinement methods can be applied to enhance the results, such as those used in MAFFT [16] and MUSCLE [14], or by Berger and Munson [17] and Gotoh [18], or by modifying the gap cost function as done by Yamada et al. [19] and Al-Shatnawi et al. [20] in order to obtain better alignments. In the present work, a novel metric termed Radial Distance is proposed to enhance the sum of pairs score and the column score on the aligned sequences by generating a distance matrix used by a clustering algorithm. After testing the UPGMA and the NJ clustering algorithms, we decided to use the latter, as it yielded better scores when coupled with our distance matrix generation approach. As the NJ algorithm takes as input the distance between each pair of sequences, these distances need to be calculated first. Considering that the Radial Distance metric proposed in this work takes into account the effect adjacent symbols within a certain radius can have on the different symbols to be aligned for the construction of the initial distance matrix, we named our tool SARELI, which stands for Sequence Alignment by Radial Evaluation of Local Interactions. MSA tools such as MUSCLE [14] and Clustal Omega [21] can use the guide trees output by SARELI, which are constructed by the NJ algorithm when fed with the distance matrix generated using the Radial Distance. 1.2. Distance Metrics Distances that are used to measure the difference between sequences of symbols in general, such as Hamming distance [22], Levenshtein distance [23], Jaro-Winkler string comparison [24, 25], and Wu-Manber string matching [26], measure the difference between sequences by counting absolute mismatches (only considering whether the symbols are the same or not). However, in the protein sequence alignment problem, similarity measures [27-29] need to be used, as the symbols may represent not only absolute mismatches, but also possible substitutions between symbols
Current Bioinformatics, 2018, Vol. 13, No. 3 291
corresponding to amino acids that can have a similar biological function. Using similarity distances, such as those found in substitution matrices, allows considering the evolution of biological residues in the sequences to analyze to correctly order the progressive sequence alignment. Ibrahim and Rashid [30] published a recount of well-known available software packages used to align multiple sequences and how they performed in the reconstruction of phylogenetic trees. These authors also showed how these algorithms managed to measure the distance between the sequences at the beginning of the guide tree initialization. MUSCLE, Clustal Omega and MAFFT are amongst these computer programs and were used, along with T-Coffee, for comparison against our proposed approach, which feeds the guide trees generated by SARELI independently into MUSCLE and Clustal Omega. T-Coffee (Tree-Based Consistency Objective Function for Evaluation) [31] uses a mixture of weighting local with global pairwise alignments at the beginning of the distance generation process, as well as within the progressive alignment method. This distance is calculated by using Clustal W [32] to align the sequences per pairs in a globally aligned sense, and then using a LALIGN [33] implementation to perform a local alignment of the same pairs of sequences. Once this process is finished, an initial weight is given to the matches on both alignments, equal to the percent identity [34] within the pair-wise alignment from which it comes. A second step to extend the first library of pair-wised alignments is performed by a heuristic algorithm, yielding another weighted library of the symbols of the sequences, which allows obtaining a fine-tuned guide tree, as well as a better control in the progressive alignment algorithm. The K-mer distance matches the number of identical residues in groups of K-tuples between the sequences to align; the higher the number of matches, the closer the sequences are [32]. As described in [35], this metric calculates the distance between two sequences, takes a set of K-tuples from each sequence (Wx,Wy), and then joins both sets with the union function to yield the set Wxy. The K-mer distance (KmerDis) is calculated by m
KmerDis (Cix Ciy ), i 1
where m is the length of Wxy, and Cx and Cy are sets with a binary substitution (1 if found; 0 otherwise) when a tuple is found in the corresponding K-tuples sets, i.e. for each Ktuple in set Wxy found in sequence X, set Cx includes the value of one, or zero if not found; set Cy is constructed similarly. Thus, Cx and Cy have the same size as Wxy, but only include 1s and 0s; if both sequences are equal, the Kmer distance is zero. MUSCLE and Clustal W use the K-mer distance for the initial distance matrix generation, whereas MAFFT uses a rough version of this distance for an initial calculation of similarity in which the number of K-mers shared by a pair of sequences is counted and considered as an approximation of the degree of similarity [36]. Clustal Omega uses a sampling method called mBed, which embeds each sequence in a space of n dimensions, where n is
292 Current Bioinformatics, 2018, Vol. 13, No. 3
Ortega et al.
proportional to the logarithm of the number of sequences, to speed up the guide tree calculation [21].
function. The Radial Distance between sequences A and B is calculated as
1.3. Multiple Sequence Alignment Evaluation
RD( A, B )
M
i RM
i 1 j i R 0
In order to compare multiple alignments, the sum of pairs score is commonly used [10, 37-39], and is calculated by adding all the possible pairs from each column of the alignment, without repetition. In order to compare the quality of an alignment against a benchmark reference, the sum of pairs score (SPS) for the set of sequences A is calculated as
S S M
SPS ( A)
i 1 Mr
i
i 1
ri
,
where M is the length of the alignment, Mr is the number of columns in the reference alignment and Sri is the Si score in the reference alignment, the latter being calculated as
S
i
N
N
p
j 1, j k k 1
ijk
,
where N is the number sequences in the set and, if Ai1, Ai2, …, AiN is the i-th column of alignment A, then for each pair of amino acids Aij and Aik, pijk is defined as 1 if Aij and Aik are aligned with each other in the reference alignment, and 0 otherwise [40]. On the other hand, the column score (CS) measures the proportion of completely aligned columns, which is compared against the reference alignment. This score is commonly used in combination with SPS to validate the alignments. The CS is calculated as
C CS ( A) M
i 1
i
Score( Ai , B j ) Abs (i j ) 1
,
where M is the length of the aligned sequences and R is the radial parameter value that indicates how far the weights of the side columns will affect the score. The Score function can be a substitution matrix such as PAM [41] or BLOSUM [42], or any other function that assigns scores for each pair of residues; in our implementation we used the BLOSUM62 substitution matrix. Before measuring the radial distance between two sequences, the pair is aligned using dynamic programming with gap penalties, so the sequences are of equal length. An example of the calculation of the RD for a radius value of 2 is shown in Fig. (1), where the first step of the process is presented for the first contribution to the RD value. The symbol nc represents a value not yet computed. At the beginning and at the end of the sequences, the radius may not encompass all the symbols, as only a partial part of the radius is considered. In Fig. (1), the radius on the left is not considered, as there are no symbols to the left of the first symbol. R=2
i=1
I
D
A
E
T
M
E
L
H
H
Sequence A
j = 1 to 3
I
S
G
E
T
M
E
L
H
H
Sequence B
4
-2
-4
1
1/2
1/3
BLOSUM62
Score(A i, Bj)
M
where A is the set of sequences, M is the length of the sequences in the alignment, and Ci = 1 if all the residues in the i-th column are aligned, and 0 otherwise [40]. 2. METHOD This section describes the Radial Distance (RD) metric used for the construction of the initial distance matrix, which in turn is used by the NJ algorithm to produce the guide trees fed into MUSCLE and Clustal Omega (named Clustal Ω, from now on). The protein benchmark databases used in the present study are also described, along with the proposed approach. 2.1. Radial distance As a novel approach to measure the distance between two sequences, we propose a metric that takes into consideration not only the symbol in the column to be aligned, but also the symbols around the column. This metric takes a given parameter that limits the number of symbols around each column of the pairwise alignment to be considered into the sum. As the distance from the referenced column is increased, the score is decreased with an asymptotic
1.67
nc
nc
nc
nc
nc
nc
nc
nc
nc
Distancia Radial
Fig. (1). Example of the calculation for the first step of the Radial Distance for sequences A and B.
In Fig. (2), the sixth step is calculated for position 6 of the RD for two example sequences; at the end of the process, the sum of all the individual values represents the radial distance between the two sequences. As stated in [43], for the use of the NJ algorithm, the distance can be a dissimilarity. Since the radial distance is working with similarities, it does not fulfill the conditions of symmetry, nor triangular inequality [44], as RD(A,B) can be different from RD(B,A); however, we found that this difference was small. Thus, to use the NJ algorithm, the lower triangular part of the distance matrix was calculated and that part was used to generate the tree. 2.2. Protein Sequence Benchmark Databases Three different benchmark databases were used in this work to corroborate the validity of our proposed approach: BAliBASE 3 [45], SABRE [46] and PREFAB 4.0 [14].
Sequence Alignment by Radial Evaluation of Local Interactions R=2 i=6
I
D
A
E
T
M
E
L
H
H
Sequence A
j = 4 to 8
I
S
G
E
T
M
E
L
H
H
Sequence B
1.67
-1.34 -0.67
-2
-1
5
-2
2
1/3
1/2
1
1/2
1/3
3.34
3.84
3.5
nc
BLOSUM62
nc
nc
nc
Radial Distance
Fig. (2). Example of the calculation for the sixth step of the Radial Distance for sequences A and B.
The BAliBASE database was initially designed as an evaluation resource for addressing problems that arise when aligning complete sequences [47] and has been widely used for testing and comparison purposes [13, 14, 48, 49]. SABRE is a set of sequences derived from the SABmark database [50] that was selected to be used in MSA comparisons [51]. Finally, PREFAB is a database formulated from an automated protocol feed by an ad hoc set of sequences from published works [14]. 2.3. Proposed Approach The process followed by SARELI for producing the guide tree for each file for the three databases considered (BAliBASE, SABRE, and PREFAB) is described next. In our proposed approach, the construction of the initial distance matrix is achieved by calculating the Radial Distance using different radius values, and then feeding each matrix to the NJ algorithm to generate the corresponding guide tree in PHY format. These guide trees were then provided independently to MUSCLE and Clustal Ω to proceed with the rest of the alignment process, as these tools can use the guide trees generated by other software. In the case of MUSCLE, a command line example of the use of an external guide tree in PHY format applied to a sequence set in TFA format to generate an alignment in FASTA format is muscle -usetree "set.phy" -in "set.tfa" -out "set.fasta" whereas for Clustal Ω a similar command line example is clustalo --guidetree-in "set.phy" -i "set.tfa" -o "set.fasta" where the file formats are the same as in the example for MUSCLE. When producing the guide tree for a sequence set with SARELI, the user can specify a range of radius values and the algorithm will try each one in turn and output the guide tree files corresponding to those radius values. After a number of runs, we heuristically determined that the range of radius values from 3 to 10 was sufficient to render high SPS and CS scores for the sequence files from the benchmark databases; we thus specified this range of radius values with SARELI in all runs. For the reported alignment scores, of the
Current Bioinformatics, 2018, Vol. 13, No. 3 293
eight guide trees generated for each sequence set, we used the guide tree that provided the best CS score, or alternatively the best SPS score in case of a tie of the CS scores. As described in the next section, we did not find a single radius value that worked well for all sequence sets. Although primarily we were interested in comparing the alignments using the guide trees generated by SARELI against the corresponding alignments from MUSCLE and Clustal Ω with their respective default guide trees, we decided to also compare the alignments using SARELI against those from other well-known MSA methods. We used the SPS and CS scores as described in Section 1.3 for measuring the quality of the alignments employing the QScore software [52] on the BAliBASE, SABRE and PREFAB benchmark databases. Each sequence set from these databases was aligned using the guide trees generated by SARELI and fed into MUSCLE and Clustal Ω, and the scores were compared against those from MUSCLE [14], Clustal Ω [21], MAFFT [36], and T-Coffee [31]. In the case of MAFFT, as suggested in the manual for this method, three different versions of the algorithm were used to obtain more accurate alignments: E-INS-i, G-INS-i, and L-INS-i [53], which were named in the present work as MAFFT GE, MAFFT GL, and MAFFT LO, respectively. All the runs were performed on a PC computer with one 8-core 3.52-GHz FX-8320 AMD processor, 16 GB of RAM, and an NVIDIA GeForce GT 730 card. The operating system used was Windows 10, whereas the library containing the alignment algorithm for SARELI was coded in C# using the Visual Studio 2013 IDE and compiler. The statistical analysis was performed using STATGRAPHICS Centurion XVI. 3. RESULTS AND DISCUSSION We used SARELI coupled with MUSCLE and Clustal Ω (termed SARELI & MUSCLE and SARELI & Clustal Ω, respectively) to align the sequence files in the BAliBASE, SABRE and PREFAB databases and did the same with MUSCLE, Clustal Ω (with their respective default guide trees), the three variants of MAFFT, and T-Coffee. After the alignment files were obtained for all the methods, we calculated their sum of pairs score (SPS) and column score (CS) as defined in Section 1.3 using the QScore software [52] and the reference alignments for each database. Once all the results were collected, we proceeded to analyze them as described next. After running kurtosis, skewness, chi-squared, and Shapiro-Wilk statistical tests on the BAliBASE, SABRE, and PREFAB databases, the scores obtained for SPS and CS for each set of samples did not show a normal distribution; therefore, we used a non-parametric statistical test to compare the medians of the scores. Since the evaluation consisted of more than two sets of related samples, we used the Friedman test to determine if there was at least one score statistically different-if the p-value indicates that there is a difference among the medians, a Wilcoxon test should be applied to compare the scores by pair of set of samples. Table 1 shows the results after applying the Friedman test, whereas Tables 2-4 show the medians by database.
294 Current Bioinformatics, 2018, Vol. 13, No. 3
Ortega et al.
SPS
CS
BAliBASE
0.0000
0.0000
SABRE
0.0000
0.0000
According to the sum of pairs score criterion from Tables 2, 5 and 6, when testing the methods using BAliBASE, SARELI & MUSCLE resulted statistically better than MUSCLE at 99% of confidence, and better than Clustal Ω at 95% of confidence. As for the column score, SARELI & MUSCLE was statistically equal to Clustal Ω.
PREFAB
0.0000
0.0000
Table 5.
Table 1.
Friedman test p-values for the databases.
According to the p-values from Table 1, there was at least one database statistically different; thus, the Wilcoxon test was applied and the results by database are presented in Tables 5-10. In the cases of statistically significant difference and equal medians, the corresponding Box-and-Whisker plots with median notch were used to determine which of the methods resulted better. Table 2.
Table 3.
Table 4.
BAliBASE medians. Method
SPS
CS
SARELI & MUSCLE
0.9075
0.6010
SARELI & Clustal Ω
0.8830
0.5480
MUSCLE
0.8995
0.6060
Clustal Ω
0.9030
0.6280
MAFFT GE
0.9145
0.6555
MAFFT GL
0.9225
0.6800
MAFFT LO
0.9215
0.6680
T-Coffee
0.9170
0.6400
Table 6.
SABRE medians. Method
SPS
CS
SARELI & MUSCLE
0.6050
0.3500
SARELI & Clustal Ω
0.5510
0.2790
MUSCLE
0.5510
0.2350
Clustal Ω
0.5500
0.2300
MAFFT GE
0.5600
0.2570
MAFFT GL
0.5960
0.2880
MAFFT LO
0.5930
0.2820
T-Coffee
0.6310
0.3330
Wilcoxon test p-values for SARELI & MUSCLE with the BAliBASE database. Method
SPS
CS
MUSCLE
0.0003
0.0000
Clustal Ω
0.0146
0.8868
MAFFT GE
0.0000
0.0004
MAFFT GL
0.0000
0.0000
MAFFT LO
0.0000
0.0000
T-Coffee
0.0000
0.0000
Wilcoxon test p-values for SARELI & Clustal Ω with the BAliBASE database. Method
SPS
CS
MUSCLE
0.0000
0.0063
Clustal Ω
0.0000
0.0000
MAFFT GE
0.0000
0.0000
MAFFT GL
0.0000
0.0000
MAFFT LO
0.0000
0.0000
T-Coffee
0.0000
0.0000
When comparing the methods using the SABRE database (Tables 3, 7 and 8), with the exception of T-Coffee, SARELI & MUSCLE resulted statistically better than all of the other methods, both for SPS and CS, with 99% of confidence, being statistically equal to T-Coffee on both scores. On the other hand, for SPS, SARELI & Clustal Ω was statistically better than Clustal Ω and MUSCLE at 99% and 95% of confidence, respectively, and statistically equal to the three variants of MAFFT, whereas for CS, it was statistically better than Clustal Ω and MAFFT GE at 95% of confidence, better than MUSCLE at 99% of confidence, and equal to MAFFT GL and MAFFT LO.
PREFAB medians. Table 7.
Wilcoxon test p-values for SARELI & MUSCLE with the SABRE database.
Method
SPS/CS
SARELI & MUSCLE
0.7810
SARELI & Clustal Ω
0.7680
Method
SPS
CS
MUSCLE
0.7555
MUSCLE
0.0000
0.0000
Clustal Ω
0.7680
Clustal Ω
0.0000
0.0009
MAFFT GE
0.7870
MAFFT GE
0.0000
0.0000
MAFFT GL
0.7910
MAFFT GL
0.0000
0.0000
MAFFT LO
0.7940
MAFFT LO
0.0007
0.0000
T-Coffee
0.7765
T-Coffee
0.1972
0.2334
Sequence Alignment by Radial Evaluation of Local Interactions
Table 8.
Wilcoxon test p-values for SARELI & Clustal Ω with the SABRE database. Method
SPS
CS
MUSCLE
0.0372
0.0000
Clustal Ω
0.0048
0.0143
MAFFT GE
0.7228
0.0487
MAFFT GL
0.4720
0.2341
MAFFT LO
0.2010
0.1563
T-Coffee
0.0000
0.0000
As for the PREFAB database, Tables 4, 9 and 10 show that regarding both SPS and CS, SARELI & MUSCLE was statistically better than MUSCLE, Clustal Ω and T-Coffee at 99% of confidence, and statistically equal to MAFFT LO, whereas SARELI & Clustal Ω was statistically better than Clustal Ω and MUSCLE at 99% of confidence, for both scores. From the way the sequence sets were constructed for this database, the SPS and CS scores are always the same; thus, only one column is shown for both scores. Table 9.
Current Bioinformatics, 2018, Vol. 13, No. 3 295
PREFAB databases. For BAliBASE, the alignments with SARELI & MUSCLE were statistically better than those from MUSCLE only for SPS, whereas Clustal Ω alone was better than SARELI & Clustal Ω for both SPS and CS. As a possible explanation for the previous result, it has been argued that many BAliBASE sequence sets contain structures with uncertain homology, as only 13% of the sequences in this database have known structure [51]. Table 11. Summary of the comparisons of SARELI & MUSCLE against MUSCLE.
SPS/CS
MUSCLE
0.0000
Clustal Ω
0.0000
MAFFT GE
0.0196
MAFFT GL
0.0404
MAFFT LO
0.1330
T-Coffee
0.0000
SPS
CS
BAliBASE
++
–
SABRE
++
++
PREFAB
++
++
++: Better at 99% confidence; –: Worse
Table 12. Summary of the comparisons of SARELI & Clustal Ω against Clustal Ω.
Wilcoxon test p-values for SARELI & MUSCLE with the PREFAB database. Method
Database
Database
SPS
CS
BAliBASE
–
–
SABRE
++
+
PREFAB
++
++
++: Better at 99% confidence; +: Better at 95% confidence; –: Worse
As with the previous two tables, the summaries from the comparison results of SARELI & MUSCLE, and SARELI & Clustal Ω against the other methods are presented in Tables 13 and 14, respectively. Table 13. Summary of the comparisons of SARELI & MUSCLE against the other methods. Method
Table 10. Wilcoxon test p-values for SARELI & Clustal Ω with the PREFAB database. Method
SPS/CS
MUSCLE
0.0000
Clustal Ω
0.0000
MAFFT GE
0.0000
MAFFT GL
0.0000
MAFFT LO
0.0000
T-Coffee
0.0000
A summary of the comparison results from SARELI & MUSCLE against MUSCLE is presented in Table 11, whereas a similar comparison of SARELI & Clustal Ω against Clustal Ω is shown in Table 12. As seen from the previous results, when the guide trees generated by SARELI were used on MUSCLE and Clustal Ω, the SPS and CS scores for the alignments were statistically better than the alignments produced by these two methods and their default guide trees on the SABRE and
Score
BAliBASE
SABRE
SPS
+
++
CS
=
++
SPS
–
++
CS
–
++
SPS
–
++
CS
–
++
SPS
–
++
CS
–
++
SPS
–
=
CS
–
=
Clustal Ω
PREFAB ++
MAFFT GE
–
MAFFT GL
–
MAFFT LO
=
++
T-Coffee
++: Better at 99% confidence; +: Better at 95% confidence; –: Worse; =: No statistically significant difference
As seen in Tables 13 and 14, when comparing the alignments using SARELI and MUSCLE or Clustal Ω against those from the three variants of MAFFT and TCoffee, mixed results were obtained. For example, with the exception of T-Coffee, the alignments from SARELI & MUSCLE were statistically better for both SPS and CS than all the other the methods on the SABRE database, but not for all the cases on the other databases. These comparisons were
296 Current Bioinformatics, 2018, Vol. 13, No. 3
Ortega et al.
made in order to get a glimpse of the overall performance of the guide trees from SARELI against those from other methods, but a more accurate comparison would be to replace the guide trees from the three variants of MAFFT and T-Coffee with those from SARELI, and compare the alignments against those from these same methods and their default guide trees. Since there is no straightforward way to replace the default guide trees from MAFFT and T-Coffee, we decided not to attempt these replacements. Table 14. Summary of the comparisons of SARELI & Clustal Ω against the other methods. Method
Score
BAliBASE
SABRE
SPS
–
+
MUSCLE –
++
SPS
–
=
CS
–
+
CONCLUSION
SPS
–
=
CS
–
=
SPS
–
=
CS
–
=
SPS
–
–
CS
–
–
–
MAFFT GE
–
MAFFT GL
–
MAFFT LO
–
T-Coffee
The latest SARELI executable file can also be downloaded from [54]. This software has only been tested on the Microsoft Windows 10 operating system, and is released under the MIT License.
PREFAB ++
CS
For a description of all options, running sareli.exe without parameters will display the quick start guide. The SARELI source files are freely available and can be found at [54] for the main Windows binary, as well as the SARELI Library, which can be installed into the solution directly from the NuGet repositories, running “Install-Package SARELI_DLL” in the command line of the Package Manager Console.
++: Better at 99% confidence; +: Better at 95% confidence; –: Worse; =: No statistically significant difference
With respect to the radius value to use for SARELI, we did not find a clear pattern that allowed us to recommend a particular value. However, after trying different combinations of radius values for the comparisons of SARELI & MUSCLE and SARELI & Clustal Ω against MUSCLE and Clustal Ω, respectively, we found that by selecting the guide tree with the best scores from the five radius values 3, 5, 7, 9 and 10, we obtained similar alignment results, with at least 95% of confidence, as using the eight values from 3 to 10 as described previously (results not shown). We thus recommend establishing a range of radius values when using SARELI, or select one or more of the five radius values mentioned above, and use the value that renders the best scores (unreferenced sum of pairs and column scores, or some other score of interest) for the particular sequence set under test. SARELI is currently only available for the alignment of protein sequences and is distributed as a command line tool for Windows that can generate guide trees in PHY format. An example of a single run is SARELI.exe -in BB30030.tfa -r 3,10 where the -in parameter is followed by the name of the file with the sequences to align, and -r specifies a commaseparated range in which to search for the best radius value for the Radial Distance metric. The default output is a file in PHY format for each radius value tested; the name of each output file is automatically generated starting with the name of the sequence set file (without the file extension), followed by an underscore and the radius value from the range provided, and ending with the “PHY” extension.
In this article we present a new approach that enhances the multiple sequence alignment process for proteins, with a novel metric for constructing the initial distance matrix used by the neighbor joining algorithm. This matrix is used to obtain a guide tree that seeks to maximize the sum of pairs score and column score by establishing the order in which the sequence pairs are to be aligned. The proposed metric was named Radial Distance, as it considers the effect adjacent symbols within a given radius can have on every symbol in a pair of aligning sequences. The initial distance matrix constructed using the Radial Distance metric is then passed to an implementation of the neighbor joining algorithm to produce a guide tree in PHY format. We named our proposed tool SARELI, which stands for Sequence Alignment by Radial Evaluation of Local Interactions. SARELI produces guide trees than can be fed into MSA programs than can use external guide trees; we used MUSCLE and Clustal Omega as such programs. We compared the alignments produced by SARELI when combined with MUSCLE and with Clustal Omega against those from these two methods with their default guide trees. When using the guide trees from SARELI, the alignments produced with MUSCLE and Clustal Omega were statistically better than those from these two methods alone for both the sum of pairs and column scores on the SABRE and PREFAB databases. We recommend using SARELI as a specialized tool for generating guide trees fed into MUSCLE or Clustal Omega; this integration could be automated by programming a command line script. We especially recommend using the guide trees from SARELI when a better alignment than those provided by MUSCLE or Clustal Omega and their default guide trees is required. Alternatively, the parts of our proposed tool that produce the guide trees might be incorporated into other multiple alignment methods to test for the possibility of improved accuracy. As future work, we would like to use our proposed metric on already known algorithms from other packages to assess the behavior of the scoring on those implementations. We also plan to exploit the advantages of parallel architectures of GPUs and cluster computing to enhance execution performance. As for the determination of the radial parameter value, we used a heuristic method to find the best radius for each sequence set file, but it would be helpful if this parameter could be automatically determined using characteristics from the sequence set to be aligned, such as
Sequence Alignment by Radial Evaluation of Local Interactions
the initial distances, and the length and number of sequences. Furthermore, an iterative restructuration of the guide tree or the use of the radial distance at later steps of the alignment process could help obtain even better results. CONSENT FOR PUBLICATION
Current Bioinformatics, 2018, Vol. 13, No. 3 297 [17] [18] [19]
Not applicable. [20]
CONFLICT OF INTEREST The authors declare no conflict of interest, financial or otherwise.
[21] [22]
ACKNOWLEDGEMENTS Declared none. REFERENCES [1] [2]
[3]
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
[16]
Bailey TL. Discovering Sequence Motifs. In: Methods Mol Biol; Keith JM, Ed. Humana Press: Totowa, New Jersey 2008; 452: pp. 231-51. Veljkovic V, Metlas R, Raspopovic J, Pongor S. Spectral and sequence similarity between vasoactive intestinal peptide and the second conserved region of human immunodeficiency virus type 1 envelope glycoprotein (gp120): Possible consequences on prevention and therapy of aids. Biochem Biophys Res Commun 1992; 189: 705-10. Wu JN, Pinello L, Yissachar E, Wischhusen JW, Yuan GC, Roberts CWM. Functionally distinct patterns of nucleosome remodeling at enhancers in glucocorticoid-treated acute lymphoblastic leukemia. Epigenetics Chromatin 2015; 8: 53. Mohankumar S, Patel T. Extracellular vesicle long noncoding RNA as potential biomarkers of liver cancer. Brief Funct Genomics 2016; 15: 249-56. Ode H, Matsuda M, Matsuoka K, et al. Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq. Front Microbiol 2015; 6: 1258. Chiu YY, Lin CY, Lin CT, Hsu KC, Chang LZ, Yang JM. Spacerelated pharma-motifs for fast search of protein binding motifs and polypharmacological targets. BMC Genomics 2012; 13: S21. Bonizzoni P, Vedova GD. The complexity of multiple sequence alignment with SP-score that is a metric. Theor Comput Sci 2001; 259: 63-79. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970; 48: 443-53. Sankoff D. The early introduction of dynamic programming into computational biology. Bioinformatics 2000; 16: 41-7. Zhan Q, Ye Y, Lam TW, Yiu SM, Wang Y, Ting HF. Improving multiple sequence alignment by using better guide trees. BMC Bioinformatics 2015; 16: S4. Dawyndt P, De Meyer H, De Baets B. UPGMA clustering revisited: A weight-driven approach to transitive approximation. Int J Approx Reason 2006; 42: 174-91. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987; 4: 406-25. Larkin MA, Blackshields G, Brown NP, et al. Clustal W and Clustal X version 2.0. Bioinformatics 2007; 23: 2947-8. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004; 32: 17927. Naznin F, Sarker R, Essam D. Iterative progressive alignment method (IPAM) for multiple sequence alignment. In: Proceedings of the 2009 International Conference on Computers & Industrial Engineering; 2009 Jul 6-9; Troyes, France. IEEE: 2009; pp. 53641. Zhu X, Li K. CUDA-MAFFT: Accelerating MAFFT on CUDAenabled graphics hardware. In: Proceedings of the 2013 IEEE International Conference on Bioinformatics and Biomedicine; 2013 Feb 6; Shanghai, China. IEEE: 2013; pp. 486-9.
[23] [24] [25] [26] [27] [28]
[29]
[30]
[31] [32]
[33] [34] [35] [36] [37] [38]
[39] [40]
Berger MP, Munson PJ. A novel randomized iterative strategy for aligning multiple protein sequences. Bioinformatics 1991; 7: 47984. Gotoh O. Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Bioinformatics 1994; 10: 379-87. Yamada S, Gotoh O, Yamana H. Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinformatics 2006; 7: 524. Al-Shatnawi M, Ahmad MO, Swamy MN. MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions. BMC Bioinformatics 2015; 16: 393. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 2011; 7: 539. Hamming RW. Error detecting and error correcting codes. Bell Syst Tech J 1950; 29: 147-60. Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov Phys Dokl 1966; 10: 707. Jaro MA. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. J Am Stat Assoc 1989; 84: 414-20. Winkler WE. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proc Sect Surv Res 1990; 354-9. Malav MP, Rasool A. Variations of Wu-Manber String Matching Algorithm. Int J Eng Res Technol 2014; 3: 2519-24. Herranz J, Nin J, Sole M. Optimal Symbol Alignment Distance: A New Distance for Sequences of Symbols. IEEE Trans Knowl Data Eng 2011; 23: 1541-54. Gómez-Alonso C, Valls A. A Similarity Measure for Sequences of Categorical Data Based on the Ordering of Common Elements. In: Torra V, Narukawa Y, Eds. MDAI 2008. Proceedings of the 5th International Conference on Modeling Decesions for Artificial Intelligence; 2008 Oct 30-31; Sabadell, Spain. Berlin, Heidelberg: Springer 2008; pp. 134-45. Fan Y, QingXin Z, MingYuan Z. Improving Kalign via reconstruction of phylogenetic tree and iteration. In: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering; 2009 Mar 31-Apr 2; Los Angeles, USA. Washington, DC, USA: IEEE Computer Society 2009; pp. 625-9. Ibrahim N, Rashid NA. Determining an ideal MSA method for constructing phylogenetic tree on DNA dataset. In: Proceedings of the 2013 International Conference on Advanced Computer Science Applications and Technologies; 2013 Dec 23-24; Kuching, Malaysia: IEEE 2014; pp. 42-7. Notredame C, Higgins DG, Heringa J. T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000; 302: 205-17. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22(22): 4673-80. Huang X, Miller W. A time-efficient, linear-space local similarity algorithm. Adv Appl Math 1991; 12: 337-57. Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins Struct Funct Genet 1991; 9: 56-68. Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics 2003; 19: 513-23. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol 2013; 30: 772-80. Van Walle I, Lasters I, Wyns L. Align-m--a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 2004; 20: 1428-35. Stoye J, Moulton V, Dress AWM. DCA: An efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Bioinformatics 1997; 13: 625-6. Lipman DJ, Altschul SF, Kececioglu JD. A tool for multiple sequence alignment. Proc Natl Acad Sci 1989; 86: 4412-5. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999; 27(13): 2682-90.
298 Current Bioinformatics, 2018, Vol. 13, No. 3 [41] [42] [43] [44] [45] [46] [47]
Wilbur WJ. On the PAM matrix model of protein evolution. Mol Biol Evol 1985; 2: 434-47. Henikoff S, Henikoff G. Amino acid substitution matrices from protein blocks. Biochemistry 1992; 89: 10915-9. Eickmeyer K, Huggins P, Pachter L, Yoshida R. On the optimality of the neighbor-joining algorithm. Algorithms Mol Biol 2008; 3: 5. Herranz J, Nin J, Sole M. Optimal Symbol Alignment Distance: A New Distance for Sequences of Symbols. Knowl Data Eng IEEE Trans 2011; 23: 1541-54. Thompson JD, Plewniak F, Poch O. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999; 15(1): 87-8. Edgar RC. Benchmark Home Page. http://www.drive5.com/bench/ (Accessed June 4, 2017). Bahr A, Thompson JD, Thierry JC, Poch O. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res 2001; 29(1): 323-6.
Ortega et al. [48] [49] [50] [51] [52] [53] [54]
Karplus K, Hu B. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 2001; 17(8): 713-20. Lassmann T, Sonnhammer EL. Quality assessment of multiple alignment programs. FEBS Lett 2002; 529: 126-30. Van Walle I, Lasters I, Wyns L. SABmark-a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005; 21: 1267-8. Edgar RC. Quality measures for protein alignment benchmarks. Nucleic Acids Res 2010; 38: 2145-53. Edgar RC. QSCORE multiple alignment scoring Software. http://www.drive5.com/qscore (Accessed on June 4, 2017). Katoh K, Standley DM. MAFFT ver.7 - a multiple sequence alignment program. http://mafft.cbrc.jp/alignment/software/ algorithms/ (Accessed on June 4, 2017). Ortega R. SARELI Source code. https://github.com/icariantk/SARELI (Accessed on January 21, 2017).