Evaluating the Accuracy and Efficiency of Multiple Sequence ...

3 downloads 0 Views 892KB Size Report
ali raza awan2, naeem aslam2,5, tanveer Hussain2, nasir naveed6, salman Qadri7, usman Waheed1 and ...... Virtual University of Pakistan especially Syed Shah Moham mad (Lecturer CS) .... Perez MT, Babar ME, Nadeem A, et al. MQAT: an ...
Open Access: Full open access to this and thousands of other papers at http://www.la-press.com.

Evolutionary Bioinformatics

Evaluating the Accuracy and Efficiency of Multiple Sequence Alignment Methods Muhammad Tariq Pervez1,2, Masroor Ellahi Babar3, Asif Nadeem2, Muhammad Aslam4, Ali Raza Awan2, Naeem Aslam2,5, Tanveer Hussain2, Nasir Naveed6, Salman Qadri7, Usman Waheed1 and Muhammad Shoaib4 1

Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan. 2Institute of Biochemistry and Biotechnology, University of Veterinary and Animal Sciences, Lahore, Pakistan. 3Department of Bioinformatics, Virtual University of Pakistan, Lahore, Pakistan. 4 Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan. 5Department of Computer Science, NFC Institute of Engineering and Technological Training, Multan, Pakistan. 6University of Koblenz-Landau, Germany. 7Department of Computer Science, Islamia University of Bahawalpur, Pakistan.

Abstract: A comparison of 10  most popular Multiple Sequence Alignment (MSA) tools, namely, MUSCLE, MAFFT(L-INS-i), MAFFT (FFT-NS-2), T-Coffee, ProbCons, SATe, Clustal Omega, Kalign, Multalin, and Dialign-TX is presented. We also focused on the significance of some implementations embedded in algorithm of each tool. Based on 10 simulated trees of different number of taxa generated by R, 400 known alignments and sequence files were constructed using indel-Seq-Gen. A total of 4000 test alignments were generated to study the effect of sequence length, indel size, dele­ tion rate, and insertion rate. Results showed that alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect. Overall, ProbCons was consistently on the top of list of the evaluated MSA tools. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE achieved the highest sum of pairs. We also considered BALiBASE benchmark datasets and the results relative to BAliBASE- and indel-Seq-Gen-generated alignments were consistent in the most cases. Keywords: Multiple Sequence Alignment Tools, comparative study of MSA tools, sum of pairs score, column score, evolutionary parameters Citation: Pervez et al. Evaluating the Accuracy and Efficiency of Multiple Sequence Alignment Methods. Evolutionary Bioinformatics 2014:10 205–217 doi: 10.4137/EBO.S19199. Received: August 7, 2014. ReSubmitted: October 9, 2014. Accepted for publication: October 14, 2014. Academic editor: Jike Cui, Associate Editor TYPE: Original Research Funding: Authors disclose no funding sources. Competing Interests: Authors disclose no potential conflicts of interest. Copyright: © the authors, publisher and licensee Libertas Academica Limited. This is an open-access article distributed under the terms of the Creative Commons CC-BY-NC 3.0 License. Correspondence: [email protected] Paper subject to independent expert blind peer review by minimum of two reviewers. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE).

Introduction

Multiple Sequence Alignments (MSAs) have become highly scrutinized and a fundamental approach in several research domains in molecular biology and bioinformatics such as studies of epidemiology and virulence,1 drug design, 2 recon­ struction of phylogenetic tree, prediction of 3D structure, identifying conserved regions, 3–5 and finding molecular func­ tion.6–8 Dozens of algorithms have been developed as a part of an attempt to improve the accuracy of alignments, but still

there is not a single MSA method that may generate accurate alignments for all types of test cases.9 Manually refined repositories of MSAs such as BAliBASE,10 PREFAB,11 and SABmark12 are good sources of accurate alignments to gauge performance of various MSA programs, but they have a number of disadvantages such as due to the small size they do not cover the full range of scenarios of protein evolution and due to the uncertain positional homo­ logy assessing accuracy of the alignments becomes difficult.13

Evolutionary Bioinformatics 2014:10

205

Pervez et al

In addition, because of the lack of evolutionary history among the sequences, benchmark alignments cannot be used to test phylogenetic software applications. The developers may also be misguided to develop algorithms to resolve the problems that are highlighted only in the manually curated alignment sets.14 Finally, high-level expertise is required to generate benchmark alignments. Simulated/true alignments are an alternative to bench­ mark alignments for comparing MSA tools. The major moti­ vation of simulated sequences is that their true evolutionary history is known, which is very useful to generate accurate alignments and phylogenetic trees. Second, the user can generate simulated alignments comprising varying insertion rate, deletion rate, sequence length, indel size, and number of sequences. Third, as compared to benchmark alignments, it is very easy for the end user to generate simulated align­ ments. Simulated sequences also have some drawbacks. First, because of dependency of all observations drawn from true alignments on simplifications and assumptions of the model used to reconstruct these alignments, the simulated sequences cannot give an explanation for all evolutionary aspects. The other potential threat is the use of simulation settings more close to the strategy of some MSA methods than others. For example, the selected model of sequence evolution might be similar to the fundamental model of a particular MSA tool and thus provide it with an excessive advantage.15 Keeping in view disadvantages of simulated sequences, there is a need to compare results of true alignments with the results of bench­ mark sequences. However, the easy construction of simulated alignments is one of the major motivations to apply them for comparative study of MSA methods. Several sequence simulators, with their own strengths and weaknesses, are available. ROSE16 generates MSA for DNA, RNA, and protein sequences, and true evolutionary history is also logged. ROSE incorporates indels linearly in accordance with the evolutionary distance and length distribution. Indel models with nonlinear indel probabilities cannot be shown in ROSE. SIMPROT17 is another simulation tool that incorpo­ rates indels; however, it does not support the feature of root sequence as an input, the conservation of motifs or modify­ ing amino acid frequencies among subsequences. It also does not generate alignments for nucleotide sequences. MySSP18 simulates DNA sequences using different models of DNA evolution such as Jukes-Cantor,19 Kimura two-parameter, 20 equal input, and Hasegawa–Kishino–Yano.21 It incorporates the features of indels, nonstationary patterns and output of ancestral sequences. Indel-Seq-Gen−2.1.03 (iSGv2.0)22 gen­ erates highly divergent DNA sequences and protein families by incorporating a number of indel models. It can also model coding and noncoding DNA evolutions. iSGv2.0 is a new tool for generating simulated datasets. iSGv2.0 improves upon iSGv1.0 through the addition of motif conservation, lineagespecific evolution, using indel tracking, PROSITE-like regu­ lar expressions, and subsequence length constraints, as well as 206

Evolutionary Bioinformatics 2014:10

coding and noncoding DNA evolution. The authors in their original article claimed that iSGv2.0 has unique features for generating highly divergent protein sequences with the incor­ poration of indels. In the current era, a large number of MSA methods are available. Comparative studies9,14,23 of MSA programs showed that none of them were capable to generate accurate alignments for all test cases. The choice of an MSA method is based on the sequences to be aligned. In this article, we present a comparative study of 10 of the most often used MSA methods based on a different approach. The selected MSA programs are T-Coffee, 24 MAFFT(FF-TNS-2), MAFFT(L-INS-i), 25,26 MUSCLE,11,27 Kalign, 28,29 DialignTX, 30 Multalin, 31 Clustal Omega, 32 ProbCons, 33 and SATe. 34,35 Our study also focused on the significance of some implementations embedded in the each program’s algorithm. Using various evolutionary parameters, simulated alignments were constructed through iSG. As a part of input to iSG, simulated trees were generated under birth–death model using TreeSim36 package integrated in R, which is a suite of software tools for data interpretation and graphical view. 37 Birth–death model allows species to speciate with constant rate b, and go extinct with a constant rate d. Consequently, simulated tree grows at the rate of b−d. In order to ensure the net growth of the phylogenetic tree, the birth rate should be higher than the death rate. iSG produced both the aligned and unaligned sequences. Unaligned sequences were used as an input to the selected MSA programs to generate MSAs, which were then compared with the alignments produced by iSG. We also present a comparison of the results obtained on the simulated alignments and the results obtained on BALi­ BASE benchmark alignments. This comparison, with a few exceptions, confirmed that the simulated alignments may be used as an alternative for the comparative study of MSA tools.

Results

Simulated sequences and alignments. Simulated seq­ uen­ces and alignments generated by iSG were used to study the effect of sequence length, indel size, deletion rate, and insertion rate on alignment accuracy. Four hundred known alignments and corresponding sequence files with no indels were constructed based on the trees generated by R. The two most popular scores, ie, sum of pairs (SPS) and column score (CS) were applied to measure quality of the alignments. MSA tool evaluation: overall alignment accuracy. For each of the 400 reference alignments in the simulated data­ set, the 10 MSA methods were applied, resulting in a total of 4000 test alignments. These 4000 test alignments consisted of 1000 alignments with varying indel size, 1000 alignments with varying sequence length, 1000 alignments with varying deletion rate, and 1000 alignments with varying insertion rate. The varying evolutionary parameters are shown in Table 1. The

Comparison of MSA methods

overall accuracy of these alignments was measured using aver­ age sum-of-pairs scores. The experiment confirmed previous findings9,14 in the sense that ProbCons outperformed all other MSA tools (Fig. 1). SATe, which was not tested in the previous studies, was in the second position and MAFFT(L-INS-i) was in the third position. Among other tools, Kalign achieved the

highest score. T-Coffee and MAFFT(FF-TNS-2) generated the least quality alignments. One-way analysis of variance (ANOVA) showed 0.002 significant level, which means that there is a significant difference among SPS of the alignments generated by MSA tools. Multiple Comparisons Table (MCT) developed using Tukey post hoc test confirmed our results

Table 1. Four sets of evolutionary parameters. Varying deletion rates

Varying insertion rates

Sequence Length

Indel size

Insertion rate

Deletion rate

Sequence Length

Indel size

Insertion rate

Deletion rate

500

20

0.000002

0.002

500

20

0.001

0.000002

500

20

0.000002

0.002222222

500

20

0.001333333

0.000002

500

20

0.000002

0.002857143

500

20

0.001666667

0.000002

500

20

0.000002

0.005

500

20

0.002222222

0.000002

500

20

0.000002

0.006666667

500

20

0.002857143

0.000002

500

20

0.000002

0.01

500

20

0.005

0.000002

500

20

0.000002

0.013

500

20

0.006666667

0.000002

500

20

0.000002

0.05

500

20

0.013333

0.000002

500

20

0.000002

0.1

500

20

0.02

0.000002

500

20

0.000002

0.2

500

20

0.04

0.000002

Varying indel sizes

Varying sequence lengths

Sequence Length

Indel size

Insertion rate

Deletion rate

Sequence Length

Indel size

Insertion rate

Deletion rate

500

5

0.000002

0.000002

10

20

0.000002

0.000002

500

10

0.000002

0.000002

50

20

0.000002

0.000002

500

20

0.000002

0.000002

150

20

0.000002

0.000002

500

40

0.000002

0.000002

300

20

0.000002

0.000002

500

80

0.000002

0.000002

500

20

0.000002

0.000002

500

160

0.000002

0.000002

750

20

0.000002

0.000002

500

320

0.000002

0.000002

1200

20

0.000002

0.000002

500

740

0.000002

0.000002

1500

20

0.000002

0.000002

500

850

0.000002

0.000002

2000

20

0.000002

0.000002

500

950

0.000002

0.000002

2500

20

0.000002

0.000002

0.40

0.36

0.35

0.31

0.30 0.25

0.21

0.19

0.20 0.13

0.15

0.17

0.16

0.14

0.12

0.12

0.10 0.05

e fe of TC

ob C on s Pr

ia lig nTX D

SA Te

LE M

U

SC

ul ta lin M

m ta lO lu s C

Ka lig M n AF FT (F FT -N S2) M AF FT (L -IN Si)

0.00

eg a

Alignment accuracy (SPS)

Note: In each of the four sets, three parameters are kept constant and one is varying (highlighted).

Figure 1. Overall alignment quality measured using SPS. Error bars correspond to one standard deviation. ProbCons maintained its first position but MAFFT (L-INS-i) loosed the second position, which was occupied by the SATe.

Evolutionary Bioinformatics 2014:10

207

Pervez et al

that ProbCons, SATe, and MAFFT(L-INS-i) were the most accurate tools. ANOVA and MCT are attached as Supple­ mentary file (ANOVA-MCT). MSA tool evaluation: effect of indel size. To evaluate the effect of indel size, we generated 1000 alignments (100 align­ ments by each MSA method) with varying indel size (5–950). This study showed that alignment quality was much less depen­ dent on indel size (line charts of Figs.  2A and 2B), but nev­ ertheless evaluation of effect of indel size on alignment quality measured using SPS (Fig.  2A) showed that ProbCons was the top performer. SATe and MAFFT(L-INS-i) were in the second and third positions, respectively. Among other MSA

tools, Dialign-TX achieved the highest SPS. Multalin generated alignments with the lowest SPS. Study of the effect of indel size on alignment quality measured using CS (Fig. 2B) showed that ProbCons, SATe, and MAFFT(L-INS-i) were in the first, second, and third positions, respectively. Among other MSA pro­ grams, T-Coffee was the top performer. Majority of the MSA tools such as Clustal Omega, Multalin, Kalign, MAFFT(FFTNS-2), MUSCLE, and Dialign-TX achieved very low CS. MSA tool evaluation: effect of sequence length. To study the effect of increasing sequence length on alignment quality, another dataset having 1000 alignments (100 alignments by each MSA method) of varying sequence lengths (30–2500 res­

5

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

10 20 40 80 160

e T-

ob

C

C

of

on

fe

s

TX n-

Pr

850

AF

D

M

ia

U

lig

SC

SA

Te

LE

lin ta ul M

(L FT

740

950

Alignment accuracy (SPS)

M

M

AF

C

lu

FT

st

(F

al

FT

-IN

-N

S-

S-

i)

n lig Ka

eg m O

2)

320 a

Alignment accuracy (SPS)

A

Clustal Omega

0.5

Kalign

0.45

MAFFT(FFT-NS-2)

0.4 0.35

MAFFT(L-INS-i)

0.3

Multalin

0.25

MUSCLE

0.2 0.15

SATe

0.1

Dialign-TX

0.05

ProbCons

0 5

10

20

40

80

160

320

740

850

950

T-Coffee

Indel size

B

5

Alignment accuracy (CS)

0.25

10

0.2

20

0.15

40

0.1

80

0.05

160 320 fe e of

s

C T-

on C ob

ia D

M

Pr

n-

TX

Te

lig

SC U

SA

LE

lin ul M

T( L AF F

740 850 950

M

M

AF

ta

S-IN

STN FF FT (

st al lu C

i)

2)

n ig Ka l

O m

eg

a

0

Clustal Omega

Alignment accuracy (CS)

0.25

Kalign MAFFT(FFT-NS-2)

0.2

MAFFT(L-INS-i) 0.15

Multalin MUSCLE

0.1

SATe 0.05

Dialign-TX ProbCons

0 5

10

20

40

80

160

320

740

850

950

T-Coffee

Indel size

Figure 2. Effect of varying indel size on alignment quality. Indel size did not affect alignment quality (line charts); however, ProbCons was the top performer. MAFFT(L-INS-i) and SATe achieved second and third positions, respectively.

208

Evolutionary Bioinformatics 2014:10

Comparison of MSA methods

idues) were generated. This experiment showed that sequence length had a weaker effect on alignment (line charts of Figs. 3A and 3B), but nevertheless study of the effect of sequence length measured using SPS (Fig. 3A) showed that ProbCons achieved the highest average scores. SATe and MAFFT(L-INS-i) were consistently in the second and third positions, respectively. Among other MSA tools, MUSCLE and Multalin achieved the highest and lowest SPS, respectively. Evaluation of effect of sequence length measured using CS (Fig. 3B) showed that ProbCons was in the top, followed by SATe and MAFFT (L-INS-i), whereas other MSA programs achieved a low CS. MSA tool evaluation: effect of deletion rate. Effect of deletion rate on the alignment quality was studied by generat­

ing alignments with varying deletion rates (Table 1). Similar to the studies of effect of indel size and sequence length on alignment quality, effect of deletion rate was investigated by developing a dataset consisting of 1000 alignments (100 alignments by each MSA method). This study showed dif­ ferent results. First, a tradeoff between increasing deletion rate and alignment quality was observed. High deletion rate affected accuracy of almost all MSA tools (line charts of Figs.  4A and 4B). Second, SATe achieved the highest aver­ age sum of pairs scores (Fig. 4A) and column scores (Fig. 4B). In case of alignment quality measured using SPS, ProbCons and Multalin were in the second and third positions, respec­ tively. Among other MSA tools, Kalign and MUSCLE were

A

30

Alignment accuracy (SPS)

0.6

50

0.5

150

0.4

300

0.3

550

0.2

750

0.1

1,200

D

of

fe

e

s

C

2,000

T-

on C Pr

ia

ob

n-

TX

Te

1,500

2,500

M

M

AF

AF

FT

lig

SC

FT

(L

M

M

U

SA

LE

lin ul

S-IN

-N FT (F

al st lu C

ta

i)

2) S-

lig

O

m

Ka

eg

a

n

0

Clustal Omega

Alignment accuracy (SPS)

0.6

Kalign

0.5

MAFFT(FFT-NS-2)

0.4

MAFFT(L-INS-i) Multalin

0.3

MUSCLE

0.2

SATe

0.1

Dialign-TX ProbCons

0 30

50

150

300

750

1,200

1,500

2,000

T-Coffee

2,500

Sequence length

B

30

0.12

50

0.1

150

0.08

300

0.06

550

0.04

750

0.02

1,200

AF F

fe e C

of

2,000

T-

on s C ob Pr

D

ia

M

U

lig

n-

TX

SA Te

SC

LE

ul ta lin M

IN T(

1,500

2,500

M

M

AF

L-

TN FF FT (

lu C

Si)

2) S-

lig Ka

eg st al

O

m

n

0

a

Alignment accuracy (CS)

550

Clustal Omega

Alignment accuracy (CS)

0.12

Kalign

0.1

MAFFT(FFT-NS-2) MAFFT(L-INS-i)

0.08

Multalin

0.06

MUSCLE

0.04

SATe

0.02

Dialign-TX ProbCons

0 30

50

150

300

550

750

1,200

1,500

2,000

2,500

T-Coffee

Sequence length

Figure 3. Effect of increasing sequence length on alignment accuracy. Line charts A and B show that sequence length had a weaker effect on performance of all MSA methods; however, ProbCons outperformed all other MSA tools. SATe and MAFFT(L-INS-i) were in the second and third positions, respectively. Among other MSA tools, MUSCLE and MAFFT(FFT-NS-2) gave good SPS and CS, respectively.

Evolutionary Bioinformatics 2014:10

209

Pervez et al

A

0.002

Alignment accuracy (SPS)

0.6

0.002222222

0.5

0.002857143

0.4

0.005

0.3

0.006666667

0.2

0.01

0.1

0.013

C T-

Pr

al

0.1

M AF

Di

0.05

of fe e

on s

TX ig

ob C

n-

SA

Te

LE SC M U

(L -IN FT

(F FT M AF

C

M ul ta lin

S-

i)

2) S-N FT

lu st al O

m eg a

Ka lig n

0

0.2

Clustal Omega

Alignment accuracy (SPS)

0.6

Kalign

0.5

MAFFT(FFT-NS-2)

0.4

MAFFT(L-INS-i)

0.3

Multalin

0.2

MUSCLE

0.1

SATe

1

2 0.

0.

0.

0.

05

3 01

01 0.

66

0.

66

00

7

5

1 57 28

Dialign-TX

00

ProbCons

0.

0.

0.

00

00

22

0.

22

00

2

2

0

T-Coffee

Deletion rate

0.002

0.25

0.002222222

0.2

0.002857143

0.15

0.005

0.1

0.006666667

0.05

0.01

D

e of

fe

s C T-

on C Pr

ia

ob

lig

n-

TX

Te SA

U M

0.05

M

M

0.2

Clustal Omega

0.25

Alignment accuracy (CS)

0.013

0.1

AF

FT

(L

M

ul

SC

ta

LE

lin

i) S-IN

S-N

AF

C

FT

lu

st

(F

al

FT

O

m

Ka

eg

lig

2)

n

0

a

Alignment accuracy (CS)

B

Kalign

0.2

MAFFT(FFT-NS-2)

0.15

MAFFT(L-INS-i) Multalin

0.1

MUSCLE

0.05

SATe 0

0. 2

0. 1

0. 05

0. 01 3

0. 01

00 66 66 7 0.

0. 00 5

71 85 02 0. 0

22 02 0. 0

0. 00

2

22

Dialign-TX ProbCons T-Coffee

Deletion rate

Figure 4. Effect of increasing deletion rate on alignment quality. The study showed different results. Firstly, SATe beated ProbCons. Second, a higher deletion rate had significant effect on alignment quality (line charts A and B). In case of alignment quality measured using CS, SATe, MAFFT(L-INS-i), and Multalin were in the first, second, and third positions, respectively.

the best performers. The smallest accuracy was shown by MAFFT(FFT-NS-2). In the case of alignment quality mea­ sured using CS, MAFFT(L-INS-i) and Multalin were in the second and third positions, respectively. MSA tool evaluation: effect of insertion rate. To investigate the effect of increasing insertion rate, the fourth dataset consisting of 1000 alignments (100 alignments by each MSA method) with varying insertion rate (Table 1) was generated. Results of this study were similar to the study of

210

Evolutionary Bioinformatics 2014:10

“effect of deletion rate on alignment quality”, which showed that performance of all MSA tools was highly dependent on the insertion rate (line charts of Figs. 5A and 5B). Evalua­ tion of effect of insertion rate measured using SPS (Fig. 5A) showed that SATe generated the most accurate alignments. ProbCons, Kalign, and MAFFT(L-INS-i) were in the second, third, and fourth positions, respectively. T-Coffee achieved the lowest SPS. Study of insertion rate on align­ ment quality measured using CS (Fig.  5B) showed that

Comparison of MSA methods

0.001

0.600

0.001333333

0.500

0.001666667

0.400

0.002222222

0.300

0.002857143

0.200

0.005

0.100

0.006667

of fe e

0.02

Pr

T-

C

ob C

ia lig nT

on s

X

Te

D

M U

SC

SA

LE

M ul ta lin

i) S-

0.013333

0.04

M AF

M AF

FT

(F

FT

FT

-N

(L -IN

S-

Ka lig n

lu st al O C

Clustal Omega

0.600

Kalign

0.500

MAFFT(FFT-NS-2)

0.400

MAFFT(L-INS-i)

0.300

Multalin

0.200

MUSCLE

0.100

SATe

04 0.

02 0.

33

ProbCons

0.

0.

33

66 00

0.

28 00 0.

Dialign-TX

01

67

5 00

1 57

2 22 22 0.

00

16

66

3 00 0.

0.

00

13

0.

33

00

7

0.000

1

Alignment accuracy (SPS)

2)

0.000

m eg a

Alignment accuracy (SPS)

A

Insertion rate

T-Coffee

B 0.001

0.200

0.001333333

0.150

0.001666667

0.100

0.002222222

0.050

0.002857143 0.005

of

fe

e

s

C

0.013333

T-

on C ob

ia D

M

Pr

n-

TX

Te

lig

SC U

SA

LE

lin M

(L FT

0.006667

0.02 0.04

Clustal Omega

0.250

Kalign

0.200

MAFFT(FFT-NS-2)

0.150

MAFFT(L-INS-i)

0.100

Multalin

0.050

MUSCLE

Insertion rate

0. 04

0. 02

0. 01 33 33

7 66 06 0. 0

0. 00 5

1 0. 00

28

57

2 0. 00

22

22

7 66 16 0. 00

0. 00

13

33

3

0.000

0. 00 1

Alignment accuracy (CS)

M

M

AF

FT AF

ul

-IN

-N FT (F

al st lu C

ta

S-

i)

2) S-

lig

O

m

Ka

eg

n

0.000

a

Alignment accuracy (CS)

0.250

SATe Dialign-TX ProbCons T-Coffee

Figure 5. Effect of increasing insertion rate on the alignment accuracy. Insertion rate had significant effect on alignment quality. Performance of almost all MSA tools was poor on high insertion rate (line charts). SATe achieved the highest average SPS. ProbCons and Kalign were in the second and third positions, respectively. Accuracy of T-Coffee was the lowest. Study of insertion rate measured using CS showed that SATe, MUSCLE, and MAFFT(L-INS-i) were in first, second, and third positions, respectively. ProbCons was in the fourth position.

SATe outperformed all other MSA programs. MUSCLE, MAFFT(L-INS-i), and ProbCons were in the second, third, and fourth positions respectively. T-Coffee achieved the lowest CS. MSA tool evaluation: time consumed by each MSA method. Results showed that, overall, accuracy of ProbCons was the highest; however, it was also the slowest tool. SATe, being the second more accurate tool, was 529% faster than ProbCons. MAFFT(L-INS-i) consumed more time than SATe and less time than ProbCons. However, MUSCLE was

the fastest tool. Figure 6 shows time spent in seconds by each MSA method. User guidance for choosing MSA tools. It was impor­ tant to rank MSA methods based on the experiments con­ ducted in this study. Results are summarized in Figure  7. ProbCons, SATe, and MAFFT(L-INS-i) were the best tools for sequences with varying indel size and sequence length. For sequences with varying insertion rate, SATe, ProbCons, and Kalign achieved the highest SPS. In case of sequences with varying deletion rate, SATe, ProbCons, and Multalin

Evolutionary Bioinformatics 2014:10

211

Pervez et al 450,000

425,800

400,000

302,525 300,000 250,000 190,500

200,000 132,575

150,000 100,000

ProbCons

MAFFT(L-INS-i)

MSA tools

MAFFT(FFT-NS-2)

T-Coffee

Kalign

700

375 MUSCLE

1,950 Dialign-TX

1,600 Multalin

0

40,250

Clustal Omega

50,000

80,475

SATe

Time (seconds)

350,000

Figure 6. Efficiency of the MSA tools. ProbCons spent maximum time. SATe was little less accurate but 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). MUSCLE was the fastest tool. It took only 375 seconds.

Alignment accuracy (SPS)

0.45 0.4

Clustal Omega

0.35

Kalign

0.3

MAFFT(FFT-NS-2)

0.25

MAFFT(L-INS-i)

0.2

Multalin MUSCLE

0.15

SATe

0.1

Dialign-TX

0.05 0

ProbCons Del rate

Ins rate

Seq len

Indel size

T-Coffee

Evolutionary parameters Figure 7. User guidance for choosing MSA tools. ProbCons, SATe, and MAFFT(L-INS-i) are the best tools for sequences with varying indel size and sequence length. For sequences with varying insertion rate, SATe, ProbCons, and Kalign achieved the highest SPS. In case of sequences with varying deletion rate, SATe, ProbCons, and Multalin outperformed other MSA tools.

outperformed other MSA tools. On the whole, SATe, based on its overall alignment quality and processing speed, was the best tool. Comparison of results obtained on simulated data with the results obtained on benchmark sequences. In order to determine that the results obtained on iSG simulated align­ ments were also applicable to the benchmark alignments, we measured accuracy of the 10 MSA tools using six reference test cases (RV11, RV12, RV20, RV30, RV40, and RV50) available in the version 3 of the BALiBASE (ftp://ftp-igbmc. u-strasbg.fr/pub/BAliBASE3). RV11 contains sequences hav­ ing equal distances. RV12 comprised sequences with orphans. RV20 consists of sequences from deviating subfamilies. RV30 comprised sequences from families with highly diverged 212

Evolutionary Bioinformatics 2014:10

sequences. RV40 contains sequences with N/C terminal extensions. RV50 comprised sequences with large insertions. Overall, the results obtained on BALiBASE bench­ mark alignments were similar to those obtained on the true alignments (Fig.  8). PronCons and MAFFT(L-INS-i) gave similar performances on BALiBASE reference sets RV11, RV20, RV30, and RV50. In case of BALiBASE refer­ ence set RV12, ProbCons outperformed MAFFT(L-INS-i); however, on RV40, MAFFT(L-INS-i) performed better than all other MSA methods. SATe, except on RV11 (where it gave performance equal to ProbCons and MAFFT(L-INS-i)), outperformed all MSA tools on all the reference sets. Almost all MSA tools performed much better, but overall the trends were similar. ProbCons, SATe, and MAFFT(L-INS-i) were

Comparison of MSA methods

A

Del rate

iSG simulated alignments

Seq len Indel size

0.3 0.2

s Pr ob

TC of fe e

C on

-T X

Te

D ia lig n

SA

M U SC LE

M ul ta lin

FT

M AF

M AF

FT

C lu s

(L -IN Si)

TN S2)

lig n

(F F

Ka

O m eg

a

0.1 0

ta l

Alignment accuracy (SPS)

Ins rate 0.5 0.4

B 1.2 1 0.8 0.6 0.4 0.2 0

RV11 RV12

AF

C of fe e T-

s C on ob Pr

X -T D ia lig n

Te SA

U SC LE

RV40 RV50

M

M

ul ta lin M

i)

FT

(L

FT (F FT

-IN S-

2) -N S-

lig n AF

RV30

M

C lu s

ta l

O m

Ka

a

RV20

eg

Alignment accuracy (SPS)

BAliBASE benchmark alignments

Figure 8. Overall alignment quality comparison between the results obtained on simulated data and results obtained on benchmark sequences. With the exception of SATe, T-Coffee, and Clustal Omega, which performed better in case of benchmark alignments, the results were similar to the results obtained on simulated data and confirmed the previous findings.

consistently the best tools. However, in contrast to the results obtained on simulated sequences, SATe, Clustal Omega, and T-Coffee showed different performances. SATe outperformed all other MSA tools, and T-Coffee showed a better perfor­ mance than MUSCLE. Clustal Omega outperformed Kalign, Multalin, and Dialign-TX. Other MSA tools were almost consistent in their performances. Time spent on BALiBASE benchmark alignments by all MSA tools was also very similar to the time spent on simulated alignments (Fig. 9). In case of both datasets, SATe was faster than ProbCons and T-Coffee. However, some differences were also observed. Kalign, with a minute difference, consumed less time than MUSCLE. MAFFT(L-INS-i) and Dialign-TX were faster than SATe and Multalin, respectively. Overall, the findings found from comparison of results obtained on simulated data with the results obtained on benchmark sequences were consistent and confirmed previous findings14 in the sense that ProbCons and MAFFT(L-INS-i) were the best tools.

Discussion

We used 4000 alignments to test whether the MSA methods have potential to generate high-quality alignments. Accu­ racy and efficiency of the latest versions of MSA methods (which were based on various algorithms and techniques) were evaluated with their default parameters/configurations. Different parameter settings may improve their performance. We intentionally generated alignments comprising very high insertion rate, deletion rate, indel size, and sequence length. Overall alignment quality investigation showed that in case

of simulated sequences as well as BAliBASE’s v3.0 reference sets,10 ProbCons, 33 SATe, 34,35 and MAFFT(L-INS-i)25,26 were the best performers. The same results have been reported by the previous studies.9,14,23 Among other MSA methods, in case of simulated alignments, Kalign 28,29 and MUSCLE11,27 achieved the highest SPS, 38 with MUSCLE being the most efficient method, and in the case of BALiBASE benchmark datasets, T-Coffee generated the most accurate alignments, but it was consistently slower than MUSCLE. Studies of the effect of indel size and sequence length measured using SPS and CS showed that they have the least effect on the performance of MSA tools; however, ProbCons, SATe, and MAFFT(L-INS-i) were in the first, second, and third positions, respectively. These results confirmed findings of the previous studies.14,39 Investigation of the effect of dele­ tion rate on alignment quality showed that performance of MSA methods was significantly low on higher deletion rates. Nuin et  al reported the same findings.14 SATe achieved the highest SPS and CS. In case of alignment measured using SPS, ProbCons and Multalin were the second and third top performers. In case of alignment quality measured using CS, Multalin, MUSCLE, and MAFFT(L-INS-i) were in second, third, and fourth positions, respectively. Study of the effect of insertion rate on alignment quality also confirmed the previ­ ous findings14 in the sense that alignment quality is signifi­ cantly dependent on the insertion rate. In case of alignment quality measured using SPS and CS, SATe outperformed all other MSA methods. Alignment quality measured using SPS showed that Kalign and ProbCons were in the second and third Evolutionary Bioinformatics 2014:10

213

Pervez et al

A Time consumed for benchmark alignments by each MSA tool 20,000 18,000 16,000

Time (seconds)

14,000 12,000 10,000 8,000 6,000 4,000

MSA tools

B

ProbCons

SATe

MAFFT(L-INS-i)

MAFFT(FFT-NS-2)

T-Coffee

MUSCLE

Dialign-TX

Kalign

Multalin

0

Clustal Omega

2,000

Time consumed for simulated alignments by each MSA tool 450,000

425,800

350,000 302,525 300,000 250,000 190,500

200,000 132,575

150,000

80,475

100,000

ProbCons

MAFFT(L-INS-i)

MSA tools

MAFFT(FFT-NS-2)

700 T-Coffee

MUSCLE

Multalin

Clustal Omega

375

1,950 Dialign-TX

1,600 0

Kalign

50,000

40,250

SATe

Time (seconds)

400,000

Figure 9. Efficiency comparison between the results obtained on simulated data and results obtained on benchmark sequences. Major findings were almost similar. SATe was faster than ProbCons and T-Coffee. However, Kalign, MAFFT(L-INS-i), and Dialign-TX showed better efficiency than MUSCLE, SATe, and Multalin, respectively, on benchmark alignments.

positions, respectively. Alignment accuracy measured using CS showed that MUSCLE, MAFFT(L-INS-i), and ProbCons were in the second, third, and fourth positions, respectively. For the both evolutionary parameters, ie, deletion rate and inser­ tion rate, T-Coffee achieved the lowest SPS and CS. All MSA methods performed much better with BALi­ BASE benchmark alignments than simulated datasets but the overall trends were similar. Our findings generally confirmed 214

Evolutionary Bioinformatics 2014:10

results of the previous studies.9,14,23 One distinguished finding was the fact that when simulated datasets were used, SATe outperformed MAFFT(L-INS-i). Among the best MSA tools, SATe was also the fastest tool. Original articles of MAFFT(L-INS-i)25,26 and ProbCons33 placed them on the top with the best accuracy on benchmark alignments. Our results also proved the claim of the authors of Kalign 28,29 that its efficiency is very close to the fast mode of MUSCLE11,27

Comparison of MSA methods

and MAFFT(FFT-NS-2)25,26 but accuracy is comparable to other MSA tools. Our study highlighted strengths and weakness of all the MSA tools. ProbCons, which is based on consistency approach, 23 outperformed other MSA tools when alignments with varying sequence length and indel size were tested. SATe, which is based on an iterative divide and conquer approach, 34 outperformed all MSA methods when alignments comprising high insertion or deletion rate were investigated. However, overall, there is a minor difference between ProbCons and SATe. MAFFT(L-INS-i), which also adopted consistency approach in its algorithm, 23 was in the third position in all test cases. Among other MAS tools, overall, Kalign and Multalin whose alignment generating process is based on progressive alignment approach were the better alternatives. 28,29,31 How­ ever, in case of high sequence lengths and indel sizes, they did not generate high-quality alignments. Their performance was good when alignments with high insertion rates and deletion rates were used. Algorithm used by Clustal Omega is based on Hidden Markov Model approach32 and DialignTX is based on consistency-based algorithm. 30 Overall, both tools performed very poorly, and especially in the case of big indel sizes and large sequence lengths, they generated lowquality alignments. MUSCLE, which is based on iterative approach, 27 performed better when alignments compris­ ing high deletion rates and insertion rates were provided. In case of benchmark datasets, T-Coffee, which is based on consistency approach, performed better than MUSCLE. Performance of MUSCLE and MAFFT(FFT-NS-2) was consistent for all test cases.

Materials and Methods

Figure 10 describes all steps of the methodology adopted in this research work. Construction of simulated trees. TreeSim package of R was used to generate 10 simulated trees comprising different number of taxa (10, 25, 75, 200, 350, 500, 600, 700, 850, and 1000) under the birth–death model. Construction of simulated alignments. iSGv2.0 was used to construct four datasets. Each of the four datasets con­ sisted of 100 alignments; 100 with varying deletion rate, 100 with varying insertion rate, 100 with varying indel size, and 100 with varying sequence length. Thus, a total of 400 known alignments were generated. Construction of test alignments. Each of the MSA methods was applied to generate 100 alignments with varying deletion rate, 100 alignments with varying insertion rate, 100 alignments with varying indel size, and 100 alignments with varying sequence length. Thus, a total of 400 alignments were generated by each MSA method, resulting in a grand total of 4000 alignments. Table 1 shows the four parameters and their varying values (base pairs) used to construct the true align­ ments. Each of the four sets had one varying (highlighted) and three constant parameters.

Tree generating script

Tree (along with other parameters)

Corresponding sequences

iSG

MSA Tool

True alignment

Test alignment

R

Comparison

SPS and CS

Figure 10. Methodology of the comparative study.

Alignment accuracy assessment procedure. The most common practice of measuring accuracy of MSA programs is to compute SPS and CS by comparing an alignment gen­ erated by an MSA tool with a reference alignment.38 SPS is calculated by counting the correctly aligned residue pairs. It measures the ability of MSA tools to align some, if not all, of the sequences in an alignment.40 Let an alignment of N sequences comprise M columns. The cth column can be desig­ nated as Ac1, Ac2,…., AcN. For each pair of residues Acj and Ack, we define Scjk such that Scjk = 1, if Acj and Ack are in the same column of reference alignment. The score for cth column (Sc) can be defined as follows. N



Sc = ∑

N



j =1 j ≠ k k =1

Scjk

The sum of pair score for the full alignment can be computed as



M

C

SPS = ∑ c =1 Sc / ∑ i =r1Src

C r denotes number of columns and Src represents the score of the cth column in reference alignment. Column score examines the ability of MSA tools of aligning all the columns correctly.39 It is computed by divid­ ing the “matched” columns between test and reference align­ ments with the total number of “considered” columns in the test alignment. Cc = 1 if a column of a (test) alignment matches with the column of reference alignment otherwise it is zero.



M

CS = ∑ c =1 C c /M

Accuracy of the MSA methods were measured using SPS and CS. MQAT, which is an interactive tool for computing quality scores of several alignments simultaneously,41 was used to calculate SPS and CS. Evolutionary Bioinformatics 2014:10

215

Pervez et al Table 2. MSA methods used in this article. MSA Method

Version

Main Algorithm

Availability

T-Coffee [22]

10.00.r1613

Consistency (multi-core usage capable)

http://www.tcoffee.org/

ProbCons [32]

1.12

Consistency

http://probcons.stanford.edu/download.html

Dialign-TX [29]

1.0.2

Consistency

http://dialign-tx.gobics.de/download

Kalign [27, 28]

2.0

Progressive (Wu-Manber)

http://msa.sbc.su.se/cgi-bin/msa.cgi

Multalin [30]

5.4.1

Progressive

http://multalin.toulouse.inra.fr/multalin/

MAFFT(L-INS-i) [23,24]

7.0

Consistency

http://mafft.cbrc.jp/alignment/software/

Clustal Omega [31]

1.2.0

Hidden Markov Model

http://www.clustal.org/omega/

MUSCLE3.8.31 [25,26]

3.8.31

Iterative

http://www.drive5.com/MUSCLE/downloads.htm

MAFFT(FFT-NS-2) [23,24]

7.0

Iterative

http://mafft.cbrc.jp/alignment/software/

SATe [33,24]

2.2.7

Iterative divide and conquer

http://phylo.bio.ku.edu/software/sate/sate.html

Statistical analysis. For each of the four datasets (based on four evolutionary parameters), average SPS and average CS were computed from the results produced by the 10 MSA meth­ ods. One-way ANOVA was performed to determine the level of significance for accuracy of all MSA tools. In order to find the significant difference between specific MSA tools, we developed MCT using Tukey post hoc test. The positive or negative mean differences indicate the significant difference between the MSA tools. The P , 0.05 was used as the level of significance. MSA methods evaluated. The 10 MSA tools were selected based on two parameters: (1) the underlying algorithms and (2) their popularity. Table 2 describes the MSA tools with their versions, main algorithms, and URL for download. All these MSA methods were run using default parameters. Computing machine. A computing machine with Core i7 3.34 GHz processor, 8 GB RAM, and Fedora OS was used for comparative study of MSA methods.

Conclusions

Our study discloses that SATe was the best tool, based on its overall alignment accuracy and efficiency. Overall, the results showed that ProbCons was consistently on the top of the list of the evaluated MSA tools, but it was a very slow tool. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE gave the highest SPS and CS, respectively. A comparison on the results obtained on sim­ ulated alignments and the results obtained on BALiBASE benchmark alignments showed the similar trends. Our analysis allows the user to establish with more detail the strengths and weaknesses of each MSA tool and its algorithmic approach. iSG was also confirmed to be an appropriate choice to test alignment quality. It allows a user to construct large simulated datasets in seconds, with full control of its characteristics. Availability. Simulated trees, unaligned sequences for all the true alignments, true alignments generated by iSG and test alignments constructed by all the MSA tools are available at www.ivistmsa.com. 216

Evolutionary Bioinformatics 2014:10

Acknowledgments

We are highly thankful to the members of Institute of Bio­ chemistry and Biotechnology, University of Veterinary and Animal Sciences Lahore, Pakistan for discussions and tech­ nical support. We are also very thankful to our colleagues at Virtual University of Pakistan especially Syed Shah Moham­ mad (Lecturer CS), Abdul Basit (Lecturer CS), and Gulzar Ahmad Siddiqui (Lecturer Statistics) for their support in the statistical analysis of the comparative study and write-up of the article.

Author Contributions

Conceived the concept: MTP. Designed the experiments: MEB, NN, AN. Contributed to the design of the study: MS. Performed the experiments: ARA, TH, UW, NA. Ana­ lyzed the data: MTP, SQ , MEB. Wrote the first draft of the manuscript: MTP, MA. Contributed to the writing of the manuscript: MS. Made critical revisions and approved final version: MEB. All authors reviewed and approved of the final manuscript.

Supplementary File ANOVA-MCT (DOC)

References

1. Bao Y, Bolotov P, Dernovoy D, et al. The influenza virus resource at the National Center for Biotechnology Information. J Virol. 2008;82:596–601. 2. Kuipers RK, Joosten HJ, van Berkel WJ, et  al. 3DM: systematic analysis of heterogeneous superfamily data to discover protein functionalities. Proteins. 2010;78:2101–13. 3. Kim J, Ma J. PSAR: measuring multiple sequence alignment reliability by proba­ bilistic sampling. Nucl Acids Res. 2011;39(15):6359–68. 4. Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–50. 5. Roskin KM, Diekhans M, Haussler D. Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments. In: Proceedings of the seventh annual international conference on computational molecular biology ACM Press. 2003. 257–66. 6. Levasseur A, Pontarotti P, Poch O, Thompson JD. Strategies for reliable exploi­ tation of evolutionary concepts in high throughput biology. Evol Bioinform Online. 2008;4:121–37.

Comparison of MSA methods 7. Wong KM, Suchard MA, Huelsenbeck JP. Alignment uncertainty and genomic analysis. Science. 2008;319:473–6. 8. Loytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–5. 9. Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One. 2011;6(3):e18093. 10. Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005;61:127–36. 11. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. 12. Walle IV, Lasters I, Wyns L. SABmark-a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics. 2005;21(7):1267–8. 13. Karplus K, Hu B. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics. 2001;17(8): 713–20. 14. Nuin PA, Wang Z, Tillier ER. The accuracy of several multiple sequence align­ ment programs for proteins. BMC Bioinformatics. 2006;7:471. 15. Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. In: Russell DJ, ed. Multiple Sequence Alignment Methods. Vol 1069. Clifton, NY: Humana Press; 2014:59–73. 16. Stoye J, Evers D, Meyer F. Rose: generating sequence families. Bioinformatics. 1998;14:157–63. 17. Pang A, Smith AD, Nuin PAS, Tillier ERM. SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics. 2005;6:236. 18. Rosenberg MS. MySSP: Non-stationary evolutionary sequence simulation, including indels. Evol Bioinform Online. 2005;1:81–3. 19. Jukes TH, Cantor CR. Evolution of protein molecules. In: Munro HN, ed. Mammalian Protein Metabolism. New York: Academic Press; 1969:21–132. 20. Kimura M. A simple method for estimating evolutionary rates of base subsitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16:111–20. 21. Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molec­ ular clock of mitochondrial DNA. J Mol Evol. 1985;22:160–74. 22. Strope CL, Abel K, Scott SD, Moriyama EN. Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol Biol Evol. 2009;26(11):2581–93. 23. Pais FSM, Ruy PC, Oliveira G, Coimbra RS. Assessing the efficiency of mul­ tiple sequence alignment programs. Algorithms Mol Biol. 2014;9:4. 24. Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000;302(1):205–17.

25. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30(14):3059–66. 26. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80. 27. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. 28. Lassmann T, Frings OS, Sonnhammer EL. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 2009;37:858–65. 29. Lassmann T, Sonnhammer E. Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics. 2005;6:298. 30. Subramanian AR, Kaufmann M, Morgenstern B. DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol. 2008;3:6. 31. Corpet F. Multiple sequence alignment with hierarchical clustering. Nucl Acids Res. 1988;16(22):10881–90. 32. Sievers F, Wilm A, Dineen DG, et al. Fast, scalable generation of high-quality pro­ tein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7:539. 33. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–40. 34. Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T. Rapid and accurate large scale coestimation of sequence alignments and phylogenetic trees. Science. 2009;324(5934):1561–4. 35. Liu K, Warnow T, Holder MT, et al. SATé-II: very fast and accurate simultane­ ous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol. 2012;61(1):90–106. 36. Cusimano N, Stadler T, Renner SS. A new method for handling missing spe­ cies in diversification analysis applicable to randomly or non-randomly sampled phylogenies. Syst Biol. 2012;61:785–92. 37. R Development Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3–900051–07–0. 2008. Available from: http://www.R-project.org. 38. Catherine LA, Cory LS, Etsuko NM. SuiteMSA: visual tools for multiple sequence alignment comparison and molecular sequence simulation. BMC Bioinformatics. 2011;12:184. 39. Lassmann T, Sonnhammer E. Quality assessment of multiple alignment pro­ grams. FEBS Lett. 2002;529:126–30. 40. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27(13):2682–90. 41. Perez MT, Babar ME, Nadeem A, et al. MQAT: an efficient quality assessment tool for large multiple sequence alignments. Life Sci J. 2013;10(9s):9–16.

Evolutionary Bioinformatics 2014:10

217

Suggest Documents