Multiple sequence alignment in parallel on a ...

3 downloads 63 Views 44KB Size Report
The third and final stage of ClustalW produces a multiple sequence alignment. This method is described by Thompson et al. (1994) and Feng and Doolittle.
BIOINFORMATICS APPLICATIONS NOTE

Vol. 20 no. 7 2004, pages 1193–1195 DOI: 10.1093/bioinformatics/bth055

Multiple sequence alignment in parallel on a workstation cluster Justin Ebedes and Amitava Datta∗ School of Computer Science and Software Engineering, University of Western Australia, Perth, WA 6009, Australia Received on July 31, 2003; revised on November 10, 2003; accepted on December 24, 2003 Advance Access publication February 5, 2004

1

INTRODUCTION

Multiple sequence alignment is an important tool in DNA and protein analysis and ClustalW is a popular sequential program for multiple sequence alignment. Several parallel algorithms for multiple sequence alignment have been reported in recent years. Kleinjung et al. (2002) have reported a parallel progressive alignment strategy without a guide tree. Their implementation is not a strict parallelization of the ClustalW algorithm. The parallel version of ClustalW reported by Mikhailov et al. (2001) is designed for sharedmemory multiprocessor machines. Usually, these machines are commodity parallel architectures and quite expensive. There is an increasing use of cluster or network of personal computers as distributed memory parallel machines for solving large-scale computational problems. Such clusters are usually programmed with the help of a message passing API like message passing interface (MPI). Recently, Li (2003) has reported a parallelization of the ClustalW algorithm on a workstation cluster using MPI. Our implementation was done independently to that of Li (2003) and, although similar in many ways, has a number of differences. ∗ To

whom correspondence should be addressed.

Bioinformatics 20(7) © Oxford University Press 2004; all rights reserved.

2

METHODS

The first stage of ClustalW algorithm requires the calculation of a distance matrix, which simply tabulates the results from pairwise aligning each sequence with every other sequence. For n sequences, n(n − 1)/2 pairwise alignments must be calculated. Once the distance matrix has been calculated, it is used in the next phase of the algorithm to produce a phylogenetic or guide tree that determines the order of alignments in the final phase of the algorithm (Thompson et al., 1994). The neighbor-joining method of Saitou and Nei (1987) is the algorithm used by CLUSTALW for constructing this guide tree. The third and final stage of ClustalW produces a multiple sequence alignment. This method is described by Thompson et al. (1994) and Feng and Doolittle (1987), and comprises a series of pairwise alignments on increasingly larger groups of sequences. The order in which these pairwise alignments are performed is dictated by the guide tree. We found that the sequential ClustalW implementation spends almost 96% of the running time in the first stage for pairwise alignment of the n input sequences. The second and third stages take roughly about 4% each of the total running time. As each pairwise alignment is completely independent of all other pairwise alignments, it is quite possible to achieve near-linear speedup in the first stage of the algorithm. For maximizing speedup, it is necessary to find a balance between two factors. First, the amount of message passing required for sharing the sequences should be minimized; and second, the pairwise alignment tasks should be equally distributed among the processors. We experimented with a number of schemes and found that broadcasting all the n sequences to each processor was a better strategy. Each of the P processors performs exactly n(n − 1)/2P alignments. We got maximum speedup from this strategy despite its heavy communication cost. Although we did not parallelize the second stage of the algorithm, this does not affect the overall speedup significantly. The neighbor-joining method is many times faster than other guide tree construction algorithms and usually requires only 1% of the total running time in the sequential algorithm. 1193

Downloaded from bioinformatics.oxfordjournals.org by guest on September 13, 2011

ABSTRACT Summary: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. Availability: The software is available upon request from the second author. Contact: [email protected]

J.Ebedes and A.Datta

6

x 10 4

10 running time : 100 seq, 1500 char

3.5

Speedup

speedup : 100 seq, 1500 char

2.5

speedup : 200 seq, 300 char

5

2

running time : 500 seq, 200 char

1.5

speedup : 500 seq, 200 char

Running time in milliseconds

3

1

0.5 running time : 200 seq, 300 char 0

1

2

3

4

5

6

0

Fig. 1. Running times and speedups for our parallel implementation of ClustalW. We show here the results for three datasets, 100 sequences with 1500 characters in each sequence, 200 sequences with 300 characters in each sequence and 500 sequences with 200 characters in each sequence.

We parallelized the third stage, i.e. the progressive alignment, but the load balancing is not satisfactory. The progressive alignment is based on the guide tree constructed in the second stage. For a well-balanced guide tree, pairs of leaves of the tree can be allocated to independent processors for better load balancing. In general, only the alignments at the same level in the tree can be parallelized. The load balancing becomes difficult as we progress through the higher levels of the tree. First, the sequences become longer and longer as gaps are introduced in the alignments at the lower levels of the tree. Also, the number of sequences to be aligned becomes smaller than the number of processors available. For a perfectly balanced guide tree, it is possible to get good speedup at the lower levels of the tree.

3

RESULTS

Our tests were run on a small cluster of six Pentium IV machines (1.8 GHz, 1 GB of RAM, 100 Mbps Ethernet network). We used both synthetic datasets and real datasets for protein sequences. We show some results in Figure 1. Three widely varying inputs are shown, which are reasonably close to best-case (small number of very large sequences), averagecase (average number of sequences of average length) and worst-case (large number of short sequences) scenarios. As is clear from the graph, even in the cases with ‘bad’ input, there is still a significant level of speedup, which in this case was 4.96 with six processors. When using inputs closer to the average and best cases, even greater speedup is possible, which in the

1194

two cases were 5.56 and 5.81 with six processors. We found that an overall speedup of about 5.5 with six processors, for an efficiency of ∼90% is obtainable for most inputs. Our speedup for the first stage was almost linear and efficiency was close to 100%. There was no speedup for the second stage as we did not parallelize the construction of the guide tree. The average speedup for the third stage was about 3 with six processors, i.e. about 50% efficiency. We derived an estimate of the speedup assuming a perfectly balanced guide tree and the actual speedup was very close to this theoretical estimate for almost all the inputs we considered. Our method and the speedup we achieved are quite similar to those reported by Li (2003). Our program and his program work well for small clusters. However, from our experience the speedup decreases as we employ more processors. The main reason is the poor load balancing in the third stage of the algorithm. During the progressive alignment at the higher levels of the guide tree, some processors are loaded with large alignment tasks. On the other hand, more and more processors become idle as the alignments progress at the higher levels of the guide tree. The extreme example is when there is only one large alignment to be done at the last level. The speedup of our implementation decreases when the number of sequences in the input is large. Again, this inefficiency comes due to the poor load balancing in the third stage. With a large number of input sequences, the alignments at the higher levels of the guide tree involve very large sequences as more and more gaps are introduced at the lower levels. Hence these sequential alignments at the higher levels

Downloaded from bioinformatics.oxfordjournals.org by guest on September 13, 2011

Number of processors

Multiple sequence alignment

dominate the speedup. The proportion of the work in the third stage compared with the total work done by the algorithm increases with increasing number of processors. This is due to the fact that the first stage can be completed faster if we employ more processors. In order to improve the speedup in the third stage of our implementation, we need to devise a method to split up a single alignment task among multiple processors. This will require a new algorithm for parallel dynamic programming that keeps the communication costs low. This is an important future direction for improving the parallel ClustalW algorithm. Another important problem is the establishment of correspondence between biologically meaningful alignments through structural superposition. Most multiple alignment algorithms including ClustalW work on mathematical alignments based on general scoring schemes.

ACKNOWLEDGEMENTS

Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. Kleinjung,J., Douglas,N. and Herringa,J. (2002) Parallelized multiple alignment. Bioinformatics, 18, 1270–1271. Li,K.-B. (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics, 19, 1585–1586. Mikhailov,D., Cofer,H. and Gomperts,R. (2001) Performance optimization of ClustalW: parallel ClustalW, HT Clustal, and MULTICLUSTAL. White papers, Silicon Graphics, Mountain View, CA. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phlyogenetic trees. Mol. Biol. Evol., 4, 406–425. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.

1195

Downloaded from bioinformatics.oxfordjournals.org by guest on September 13, 2011

A.D. is supported by the Western Australian Interactive Virtual Environments Centre (IVEC) and Australian Partnership in Advanced Computing (APAC).

REFERENCES