Performance Evaluation of Computational Phylogeny Software in Parallel Computing Environment Luka Filipović, Danilo Mrdak and Božo Krstajić 1
Abstract. Computational phylogeny is a challenging application even for the most powerful supercomputers. One of significant application in this are is Randomized Axelerated Maximum Likelihood (RAxML) which is used for sequential and parallel Maximum Likelihood based inference of large phylogenetic trees. This paper covers scalability testing results on high-performance computers on up to 256 cores, for coarse and fine grained parallelization using MPI, Pthreads and hybrid version and comparison between results of traditional and SSE3 version of RAxML. Keywords: parallel computing, scalability testing, computational phylogeny, RAxML
1
Introduction
As a result of fast development of different genetic molecular techniques in last 20 years we have huge data sets related to genome of different organisms. Those datasets in beginning were mainly used in identification of different populations of organisms with high significance but lately, due to rapid growth of such information data sets, DNA structure information are used in evolutionary biology purpose. Thus, DNA rows of nucleic base pars scientists are using in reconstruction of evolution history of different group of organisms. In beginning such phylogenetic trees were constructed on one gene for few species data sets but with rapid increasing of sequence data those Luka Filipović University of Montenegro, Center of Information Systems, Cetinjska 2, Podgorica, Montenegro, e-mail :
[email protected] Danilo Mrdak University of Montenegro, Faculty of Natural Sciences, Mihaila Lalića 1, Podgorica, Montenegro, e-mail :
[email protected] Božo Krstajić University of Montenegro, Faculty of Electrical engineering, Džordža Vašingtona bb, Podgorica, Montenegro, e-mail :
[email protected] adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011
phylogenetic analyses become more and more demanding in terms of numbers of species as well as in terms of number of genes in use. With aim to get closer to most realistic reconstruction of evolution processes, scientists use data sets of more than one gene for input data sets (multi-gene approach) in order to perform phylogenetic analysis. No matter if they use huge number of OUT (Operational Taxonomic Unit, sometimes more than 1000 of them) or they use 5-10 genes for 20-50 OTUs, such approach demand strong computing resources and software for analysis execution. Moreover, since DNA sequence continuously grow in the number of organisms and in sequence length, there exists an increasing need for efficient programs for such analysis and process of software parallelization is adequate answers for such demands. One of significant open source application from bioinformatics and DNA analysis area is Random Axelerated Maximum Likelikhood (RAxML) [1], program for sequential and parallel Maximum Likelihood based inference of large phylogenetic trees. RaxML exists in several versions; one is a pure sequential program lacking of any parallel code, the other is a parallel version which supports three parallelization techniques: MPI, Pthreads and hybrid parallelization. We analyzed all of them, compared using execution time, speedup, efficiency and CPU consumption on specified dataset. In our test model we used 123 different DNA sequences of Salmo trutta (Linnaeus, 1758) from Eurasian geographical region, 552 base pairs per DNA sequence. Those DNA data sets are available in gene bank [2]. We downloaded it in FASTA format and prepared them performing multiple DNA alignment in CLUSTAL_X [3].
2
The RAXML Software
The RAXML (Random Axelerated Maximum Likelikhood) [1] program has been developed to perform both sequential (on a single processor) and parallel (on multiple processors) phylogenetic analysis using the maximum likelihood optimality criterion. Historically, RAXML stems from the FASTDNAML program, which in turn stems from the DNAML program Although RAXML’s design emphasis is on computationally efficient and biologically accurate analysis of very large data sets, it is also appropriate for and amenable to the analysis of data sets of any size. RAXML can use a variety of different character sets, including nucleotide, amino acid, binary, and multistate character state data. Versions of the RAXML program are available for the Unix/Linux, Mac, and Windows operating systems [4][5]. The first step of the RAxML search strategy is the generation of a starting tree. This starting tree is constructed by adding the sequences one by one in random order, and identifying their optimal location on the tree under the parsimony optimality criterion [6]. The random order in which sequences are added is likely to generate several different starting trees every time a new analysis is run (especially for data sets with more than a few sequences), which allows better exploration of the tree space. If multiple analyses using different starting trees all converge on the same best tree, then confidence that this is the true best tree increases. The second step of the search strategy involves a method known as lazy subtree rearrangement or LSR [6][7]. Briefly, under LSR, all possible subtrees of a tree are clipped and reinserted at all possible locations as long as the number of branches separating the clipped and insertion
points is smaller than N branches. RAXML estimates the appropriate N value for a given data set automatically, but one can also run the program with any fixed value. The LSR method is first applied on the starting tree, and subsequently multiple times on the currently best tree as the search continues, until no better tree is found. 2.1 The RAXML Parallelization RAxML has been parallelized in various ways and at various levels of granularity as the capabilities of the code have evolved. They can be divided by parallelization methods as: • Coarse grained parallelization – using MPI • Fine grained parallelization – using OpenMP and later Pthreads • Hybrid version, which combines coarse and fine grained parallelization. [8] In addition, three experimental versions have been developed: • Version for Cell Broadband Engine, which include Cell-specific code for the finegrained parallelization as well as a Cell-specific system-level scheduler for efficiently exploiting the coarse-grain parallelism with MPI. • Version for BlueGene/L; It was multi-grained, but used MPI at both levels of granularity. • Version where developers compared fine-grained parallelization done with MPI, Pthreads, or OpenMP. Although each approach could be fastest depending upon the data set and computer, the Pthreads implementation was adopted in production from version 7.0.0 and replaced the earlier OpenMP implementation. Hybrid parallelization of RAxML was enabled from version 7.2.4. The fine-grained Pthreads parallelization is the same as in recent versions and is over the number of patterns. The coarse-grained MPI parallelization is over the number of separate tree searches, similar to that in Version 7.0.0. However, the new MPI approach is simpler than the former master/worker approach and has minimal MPI communication. Thus a fast and expensive interconnect is not required. The new approach is also more efficient when there is reasonable load balance, which is often the case. All results presented in this paper were done with RAxML version 7.2.8., which had minor changes from version 7.2.4.
3
Benchmark Infrastructure
For the purpose of testing and result analysis of RAxML we needed HPC cluster with support of MPI, Pthreads and ability of system to run hybrid programs. One of suitable cluster in HP-SEE project [9] is HPCG cluster [10] is located at Institute of Information and Communication Technologies (IICT-BAS) in Sofia, Bulgaria. It has 576 computing cores organized in a blade system (HP Cluster Platform Express 7000 enclosures with 36 blades BL 280c with dual Intel Xeon X5560 @ 2.8Ghz). Parallel programming paradigms supported by HPCG cluster are Message passing, supporting several implementations of MPI, OpenMPI and OpenMP. The nodes (36 nodes) have
relatively high amount of RAM (24GB per node). Hybrid approach available through combining the two approaches listed above.
4
Raxml Scalability Testing
4.1 Performance Metrics Three metrics are commonly used to measure the performance of MPI programs, execution time, speedup and efficiency. Several factors such as the number of processors used, the size of the data being processed and inter-processor communications influence parallel program's performance. [11] Speedup is defined by the following formula S(p) = T(1)/T(p) where p is the number of cores, T(1) is the execution time of the sequential application and T(p) is the execution time of the parallel algorithm with p processors. Efficiency can be defined as E(p) = S(p) / p . WC bootstrapping test [12] showed that 100 bootstrap runs is minimum for any analysis, 500-1200 bootstrap runs is required for any serious analysis and minimum 5.000 bootstraps is needed for more than 99.5 % precision in ML searches. We decided to have two bootstrap tests: 1000 as a smaller test and 5000 bootstraps as a larger test on our test dataset. 4.2 Sequential Application Results RAxML, version 7.2.8, offers two versions: standard and SSE3 vectorized version [13]. We tested parallelized versions of RAxML on smaller and larger number of bootstraps. Serial results for comparison are listed in Table 1. Table 1. Execution time of serial version
1000 bootstraps
5000 bootstraps
Standard version
18 455 s
86 828 s
SSE3 vectorized version
14 407 s
75 759 s
SSE3-based SIMD version gave better results for sequential version, but we used both version for MPI experiments on a HPCG cluster. 4.3 MPI Scalability Results Coarse grained parallelization of RAxML is done using MPI. We tested MPI parallelization using described dataset with 8 and 16 cores per node. Time of execution, speedup and efficiency of standard and SSE3 vectorized version for 1000 and 5000 bootstraps are listed in Table 2 and 3. Application speedup and efficiency are graphically presented in Figures 1-4.
Table 2. Execution time for 1000 bootstraps MPI with SSE3 MPI without SSE3 Number of Cores
8 cores / node
8
2129,29
16
1095,03
16 cores / node
8 cores / node
16 cores / node
3849,65 1545,57
2260,62
2560,12
32
649,33
887,51
1114,89
1342,29
64
437,39
436,79
607,83
740,24
128
262,88
282,25
362,54
608,42
256
233,12
389,55
Table 3. Execution time for 5000 bootstraps MPI with SSE3 MPI without SSE3 No of cores 8
8 cores / nodes 12477,51
16 cores / node
8 cores / nodes 15066,57
16 cores / node
16
6584,08
7559,33
7589,08
11577,62
32
4187,13
3831,41
4080,20
6033,55
64
2403,53
1963,52
2607,92
3180,24
128
1176,38
1098,83
1611,18
1607,92
256
Fig. 1. Speedup for 1000 bootstraps
723,42
954,88
Fig. 2. Speedup for 5000 bootstraps
Fig. 3. Efficiency for 1000 bootstraps
Fig. 4. Efficiency for 5000 bootstraps
After the analysis of execution time and the attached charts we conclude that MPI version with SSE3 instructions is significantly better than standard MPI version and should be used whenever possible. As a general rule it is accepted that parallelization with efficiency lower than 50 % is not good parallelization. According to that fact standard MPI with 8 cores per node is acceptable for 32 cores for 1000 bootstraps (smaller test) and 64 cores for 5000 bootstraps (bigger test). Figures 1 and 3 showed good scaling for MPI with SSE3 up to 128 cores, especially for run with 8 cores per node. Execution of application on 256 cores has not shown sufficient good because part tests executed on single core is too small and time for serial part of application is significant if we compare them with time of execution per core. Figures 2 and 4 also showed good scaling for MPI with SEE3 up to 128 cores, but now version of 16 cores per node showed better results because it had larger amount of data to calculate. Execution of application on 256 cores was below 50%, but it showed 15% better efficiency than test with smaller number of bootstraps. 4.4 Pthreads Scalability Results Pthreads tests are executing on single node. The optimal number of Pthreads increases with the number of distinct patterns in the columns of the multiple sequence alignment. It’s limited with number of cores per node of the server being used. Pthreads version is compiled with SSE3 support and performed on 1000 bootstraps. All results are compared with serial application. Results are shown in Table 4. Table 4. Execution time for Pthreads version and comparison with serial version Serial applicaNumber of Threads tion 2 4 8 12 16 Time [s]
14407,02
Speedup CPU Time [CPUhours]
4,00
9491,95
7679,72
5766,47
6685,62
6880,30
1,51
1,87
2,49
2,15
2,09
5,27
8,53
12,81
22,28
30,57
Fig. 5. Speedup for Pthreads parallelization
Fig. 6. CPU consumption of Pthreads parallelization
Figure 5 shows scaling up to 2.49 for 8 cores and negative return for more than 8 cores. Speedup rapidly drops down for more than 8 cores because set of 552 base pair is relatively small for splitting into more than 8 parts. CPU Time is shown in Figure 6 which shows us that Pthreads version takes up more CPU time than serial version for our dataset. We didn’t executed larger Pthread test because smaller with 1000 bootstraps shown enormously using of resources. 4.5 Hybrid Scalability Results The hybrid version of RAxML combines coarse and fine grained parallelization. It’s especially useful for a comprehensive phylogenetic analysis, i.e., execution of many rapid bootstraps followed by a full maximum likelihood search. Multiple multicore nodes can be used in a single run to speed up the computation and reduce the turnaround time. The hybrid code also allows more efficient utilization of a given number of processor cores. Moreover, it often returns a better solution than the standalone Pthreads code, because additional maximum likelihood searches are conducted in parallel using MPI. [9] Since Pthreads version gave us bad results for larger number of cores we executed hybrid version with 2 and 4 threads per node and compared results with MPI version, 8 cores/node. Results are shown in Table 5. Table 5. Execution times[s] for Hybrid parallelization and comparison with MPI and Pthreads version Hybrid version MPI version Pthreads No of cores 2 Threads 4 Threads version 8
2572,98
4646,24
2129,29
5766,47
16
1530,92
2433,89
1095,03
6880,30
32
978,75
1301,88
649,33
64
518,62
836,98
437,39
Fig. 7. Execution times of hybrid and MPI version
Fig. 8. Efficiency of RAxML hybrid version
Figure 7 shows comparison of execution times between Hybrid version for 2 and 4 threads for fine grained parallelization, MPI version (coarse grained parallelization) and Pthreads parallelization on 8 and 16 cores as a fine grained parallelization. MPI version showed best performance for our dataset. Pthreads version used up to 6.2 times more than MPI version and 4.5 times more than hybrid version for 16 cores. Figure 8 shows efficiency of hybrid version where 2 thread version achieved significantly better results. Hybrid version with 2 Threads had efficiency over 50% on less than 30 cores, efficiency over 40 % on all tests up to 64 cores, while version with 4 threads had efficiency bellow 40% on all segment.
5
Conclusion
This paper has described a comparative analysis of RAxML parallelization results. MPI with SSE3 support version had best results, especially during execution on less than 128 cores on larger bootstrap test where every core had larger chunk of data to analyze. Pthread version had relatively small speedup up to 8 cores; it takes more CPU time than serial version for our dataset and had a negative return for more than 8 cores. Hybrid version showed worse results than MPI version, but only 20 % in some executions and it can be comparative for testing on some different sizes of input dataset. Hybrid version with 2 threads shown better results than version with larger number of threads because we used 552 base pairs per DNA sequence which can’t be divided on larger number of cores without delay. We assume that Hybrid and Pthreads version can give better performance for different dataset with larger number of base pairs per DNA sequence. Parallelized version of RAxML shown that results of analyses with big datasets can be received in shorter time using larger number of CPUs. From the point of efficiency and energy consumption every version shown that uses more energy than sequential, especially Pthreads. Users must find optimal number of cores, which is up to 128 in MPI version or less than 32 in Hybrid version in our example. Acknowledgements. This work makes use of results produced by the HighPerformance Computing Infrastructure for South East Europe’s Research Communities (HP-SEE), a project co-funded by the European Commission (under contract
number 261499) through the Seventh Framework Programme. HP-SEE involves and addresses specific needs of a number of new multi-disciplinary international scientific communities (computational physics, computational chemistry, life sciences, etc.) and thus stimulates the use and expansion of the emerging new regional HPC infrastructure and its services. Full information is available at: http://www.hp-see.eu/
References 1. RAxML - Scientific Computing Group, The Exelixis Lab, http://sco.hits.org/exelixis/software.html 2. The National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/ 3. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G. 1997. The Clustal_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acid Research, 25, 24:4876-4882. 4. Antonis Rokas , Phylogenetic Analysis of Protein Sequence Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program, Wiley Online Library, DOI: 10.1002/0471142727.mb1911s96 5. Stamatakis, A. 2006. RAXML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa andmixed models. Bioinformatics 22:2688-2690. 6. Stamatakis, A., Ludwig, T., and Meier, H. 2005., RAXML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21:456-463. 7. Schmidt, H.A. and von Haeseler, A. 2009. Phylogenetic inference using maximum likelihood methods. In The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing (P. Lemey, M. Salemi, and A.M. Vandamme, eds.) pp. 181-209. Cambridge University Press, Cambridge. 8. W. Pfeiffer and A. Stamatakis, "Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code," Proc. Ninth IEEE Int'l Workshop High Performance Computational Biology, Apr. 2010. 9. HP-SEE project, http://www.hp-see.eu/ 10. Resource centre HPCG, HP SEE wiki ry, http://hpseewiki.ipb.ac.rs/index.php/Resource_centre_HPCG 11. Alaa Ismail El-Nashar , To parallelize or not to parallelize, speed up issue, International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011 12. N.D. Pattengale, M. Alipour, O.R.P. Bininda-Emonds,B.M.E. Moret, and A. Stamatakis, “How Many Bootstrap Replicates are Necessary?” Proc. RECOMB 2009, LNCS 5541, Springer-Verlag, 2009, pp. 184-200 13. Intel® 64 and IA-32 Architectures Software Developer’s Manual