Bayesian Phylogeny on Grid - Semantic Scholar

2 downloads 0 Views 2MB Size Report
different fields and countries under one banner. DNA and ..... Chor, B., Hendy, M.D., Holland, B.R., Penny, D.: Multiple maxima of likelihood in phylogenetic trees: ...
Bayesian Phylogeny on Grid Richard C. van der Wath1 , Elizabeth van der Wath1 , Antonio Carapelli2 , Francesco Nardi2 , Francesco Frati2 , Luciano Milanesi3 , and Pietro Li´ o1 1

The Computer Laboratory, University of Cambridge, William Gates Building, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK 2 Department of Evolutionary Biology, University of Siena, via A.Moro 2, 53100, Siena, Italy 3 CNR-ITB, Via F.lli Cervi 93, 20090, Segrate, Italy

Abstract. Grid computing defines the combination of computers or clusters of computers across networks, like the internet, to form a distributed supercomputer. This infrastructure allows scientists to process complex and time consuming computations in parallel on demand. Phylogenetic inference for large data sets of DNA/protein sequences is known to be computationally intensive and could greatly benefit from this parallel supercomputing approach. Bayesian algorithms allows the estimation of important parameters on species divergence modus and time but at the price of running repetitive long series of MonteCarlo simulations. As part of the BioinfoGrid project, we ported parallel MrBayes to the EGEE (Enabling Grids for E-sciencE) grid infrastructure. As case study we investigate both a challenging dataset of arthropod phylogeny and the most appropriate model of amino acid replacement for that data set. Our aim is to resolve the position of basal hexapod lineages with respect to Insecta and Crustacea. In this effort, a new matrix of protein change was derived from the dataset itself, and its performance compared with other currently used models.

1

Introduction

Due to large data sets and accompanied large number of parameters being produced by high throughput techniques, it became necessary to develop high performance computers based on clustering technologies and high performance distributed platforms. Grid infrastructures are based on a distributed computing model where easy access to large geographical computing and data management resources is provided to large multi disciplinary VOs (Virtual Organizations). A VO comprises of a sampling Grid users sharing similar requirements and interests who are able to share resources and/or work collaboratively with the other members within the same grouping. In order to submit jobs to the Grid, all users are required to be a member of a VO, this helps to ensure the integrity of the data stored on the grid network. The distributed HPC (high performance computer) is considered as the way to realize the concept of virtual places where scientists and researchers work together to solve complex problems in Bioinformatics, despite their geographic and organizational boundaries. BioinfoGRID (Bioinformatics M. Elloumi et al. (Eds.): BIRD 2008, CCIS 13, pp. 404–416, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Bayesian Phylogeny on Grid

405

Grid Application for life science) is a project funded by the European Union within the Sixth Framework Programme for Research and Technological Development(FP6). The BioinfoGRID project is associated with the Biomedical VO, which consists of a large number of biomedical scientists working in several different fields and countries under one banner. DNA and amino acid sequences contain both the information of the phylogenetic relationships among species and of the evolutionary processes that have caused the sequences to diverge [1,2,33,34,35,36]. The statistical and computational methods try to detect this information to determine how and why DNA and protein molecules work the way they do. Arthropoda (insects, crustaceans and their kins) account for more than 80% of described animal species, and display an extraordinary diversity in terms of morphology and lifestyle adaptations. This diversity, as well as the age of the major taxa, have considerably complicated our possibility to reconstruct their phylogenetic relationships, which are still debated either among and within major lineages [20]. Molecular phylogeny has recently contributed extensively to this issue, and large data sets of mitochondrial sequences are now available for analysis. The mitochondrial genome is a closed circular molecule, inherited via the maternal line. The organization of the genome is quite simple, and its gene content is highly conserved across multicellular animals [27]: 13 protein coding genes (PCGs), 22 transfer RNA (tRNA) genes, and 2 ribosomal RNA (rRNA) genes. Mitochondrial sequences are commonly used as a standard benchmark for testing evolutionary models and systematic methodology (see for instance [6,21,22,3]). Several authors have found differences in phylogenies that result from different data sets on the same species ( [5,4,7,8,23]). Because of these problems in reconstructing phylogenies, it is important to develop appropriate evolutionary models and investigate the performances of different existing models for specific phylogenetic problems. Here we investigate the Grid setting as infrastructure for biomedical applications. In specific the utilisation of parallel MrBayes to resolve the phylogeny of Pancrustacea whilst testing different models of evolution. The article’s framework comprises of three main sections starting with the Methodology Section which constitutes of the following: models of evolution, Bayesian phylogeny and Grid computing. Statistics of different models of evolution and visualization of the posterior probabilities of phylogenetic trees are reported as main results under the Results and Discussion Section. Lastly we summarise our findings towards Grid performance and future expectations in the Conclusion Section. For a comprehensive presentation of the underlying biological problem and a more general discussion of the biological relevance of the results obtained, we refer to the accompanying paper [3].

2 2.1

Methodology Generating Models of Evolution

The process of phylogeny reconstruction requires 4 steps. The first step comprises sequence selection and alignment to determine site-by-site DNA or amino

406

R.C. van der Wath et al.

acid differences. This was done by using the data set described in [3]. The thirteen PCGs (protein coding genes) of one-hundred species of Pancrustacea for which the complete mitochondrial genome sequence is available were aligned and concatenated. After removal of taxa whose mt-sequences possess molecular features known to negatively affect phylogenetic reconstruction, such as extreme compositional bias, accelerated rates of nucleotide substitutions, and gene order rearrangements that involve a change in the strand where one or more PCGs are encoded [28,29,3], a data set with 81 pancrustacean sequences (ingroup) and 5 non-pancrustacean arthropod species (outgroup) was retained and analysed. The second step is to build a mathematical model describing the evolution in time of the sequences. Usually theory is based on continuous time models. We need to derive the instantaneous probabilities of the transition from one amino acid to another. The probabilities of a model can be generated empirically using properties calculated through comparisons of observed sequences or parameterically using chemical and biological properties of DNA and amino acids. Such models permit estimation of the genetic distance between two homologous sequences, measured by the expected number of nucleotide substitutions per site that have occurred on the evolutionary lineages between them and their most recent common ancestor. Such distances may be represented as branch lengths in a phylogenetic tree; the extant sequences form the tips of the tree, while the ancestral sequences form the internal nodes and are generally not known. The third step involves applying an appropriate statistical method to find the tree topology and branch lengths that best describe the phylogenetic relationships of the sequences. One of the most important methods is that of maximum likelihood (ML) which is also a necessary ingredient of the Bayesian approach [36,33,34,35]. The likelihood (LH ) of a hypothesis (H) is equal to the probability of observing the data if that hypothesis were correct. The observed data is again usually taken to be the alignment, although it would of course be more reasonable to say that the sequences are what have been observed and the alignment should then be inferred along with the phylogeny. The statistical method of ML chooses amongst hypotheses by selecting the one which maximizes the likelihood; that is, which renders the data the most plausible. In the context of molecular phylogenetics, a model of nucleotide or amino acid replacement permits the calculation of the likelihood for any possible combination of tree topology and branch lengths [1,30,32]. The topology and branch lengths that maximize this likelihood (or, equivalently, its natural logarithm, lnLH , which is almost invariably used to give a more manageable number) are the ML estimates. Any parameters with values not explicitly specified by the replacement model can be simultaneously estimated, again by selecting the values that maximize the likelihood. The fourth step consists of the interpretation of results. When properties shared by a set of sequences are too subtle or hidden to be analytically represented (or there are too many degrees of freedom), amino acid replacement models should be obtained through an empirical approach. MtPan, a new model of amino acid replacement in Pancrustacea, was constructed using relative rates of estimated amino acid replacement from pairwise comparisons of

Bayesian Phylogeny on Grid

407

mtPan A R N D







C



E



Q



G H I L K



● ● ●



F



P



S



T



Y





● ●



M

W







V

A















● ●





● ●

● ●

● ●

● ●

























































































































R

N

D

C

E





















Q





























G



● ●







● ●













I









H



● ● ●







L



K





●●















M

● ● ●





F

P

● ●







● S





T

W



Y

V

Amino Acid replacement matrix

Fig. 1. MtPan, a new model of amino acid replacement in Pancrustacea

sequences in the Pancrustacea dataset, maintained by us, that are 85% or more identical, see Figure 1. The estimates of the relative rates of amino acid replacement was computed by examining the database and recording the number of times that amino acid type i is observed in one sequence and type j is observed at the corresponding site in a closely related sequence. Interestingly the new model of evolution shows high probability values for transition G-A, V-A, I-V, F-I, T-I with respect all the other possible transitions in mtPan and in the other two models of evolution. These differences reflect the particular environment of and evolutionary trend of mitochondrial proteins in the species we have considered. 2.2

Bayesian Phylogeny

In Bayesian statistics the goal is to obtain a full probability distribution over all possible parameter values. This so-called posterior probability distribution

408

R.C. van der Wath et al.

requires the combining of likelihood with the prior probability distribution [9,10,11]. The prior probability distribution shows your beliefs about the parameters before seeing any data. Often, prior information about an unknown parameter may not be available. In such cases, standard non-informative prior distributions, i.e., probability distributions which contain little or no information about the parameters are used, resulting in posterior distributions that are dominated by the likelihood. In phylogeny it is very common that biologists have some strong beliefs about the relationships of some deep branches. In other cases there are theories from shared organs/apparatus or developmental pathways which may suggest alteration of the prior. In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny where typical parameters include tree topology, branch lengths, nucleotide frequencies and substitution model parameters [12,13,14,15,16]. The main objective in maximum likelihood however, is to determine the best point estimates of parameter values while Bayesian phylogeny aims to calculate a full probability distribution over all possible parameter values. If a target distribution has multiple peaks, separated by low valleys, a Markov chain may have difficulty in moving from one peak to another. As a result, the chain might get stuck on one peak and the resulting samples will not approximate the posterior density correctly. This is a serious practical concern for phylogeny reconstruction as multiple local peaks are known to exist in the tree space during heuristic tree search under maximum parsimony, maximum likelihood and minimum evolution criteria. The same can be expected for stochastic tree search using MCMC. Many strategies have been proposed to improve mixing of Markov chains in presence of multiple local peaks in the posterior density. One of the most successful algorithms is the Metropolis-coupled MCMC; [11,31]. Parallel MrBayes the program implements this variant of MCMC called “Metropolis-Coupled Markov Chain Monte Carlo”(MCMCMC)[17]. In this algorithm, m chains are run in parallel, on as many or less processors with different stationary distributions where all but one of them are heated. Heating increases the state acceptance probability of chains resulting in the more eager crossing of valleys in a landscape of probability trees. State swapping is attempted among randomly appointed chains which provides better integration and if such swapping is successful it results in the exploration of other peaks and limits the seclusion to local maximums. 2.3

State of Art, Potentialities of Grid Computing for Phylogeny

The European Commission-funded “Enabling Grids for E-sciencE” (EGEE)[19] project brings together scientists and engineers from more than 240 institutions in 45 countries world-wide to provide a seamless international Grid infrastructure for e-Science that is available to scientists 24 hours-a-day. Here we give some details in order to favour the usage of this important resource. The EGEE Grid consists of 41,000 CPUs in addition to about 5 PB disk (5 million Gigabytes) + tape MSS of storage, and maintains 100,000 concurrent jobs. Having such resources available changes the way scientific research can take place, and could

Bayesian Phylogeny on Grid

409

significantly increase our possibility to analyse complex datasets using methodologically more correct, though computationally more intense, methodologies. Middleware is a key component to any grid computing effort for it serves as the communication layer enabling interaction across hardware and network environments. Early in the project the LCG(Large Hadron Collider Computing Project) middleware stack was used on the EGEE infrastructure. Most of this stack was later developed and re-engineered into the current middleware solution, gLite. The gLite Grid services follow a Service Oriented Architecture facilitating compliance with upcoming Grid standards and is currently widely deployed on hundreds of sites as part of the EGEE project, enabling global science in a great number of disciplines [26]. WMProxy is a new component, implemented as a Web service, to access the gLite Workload Management System (WMS) and efficiently handles large number of job submission requests and controls. Job submission requires a description of the job to be executed and a description of the needed resources. These descriptions are provided with a high-level language called JDL. The Job Description Language (JDL) is based on Condor classified advertisements (classads) for describing jobs and aggregates of jobs such as MPICH which is a high-performance and widely portable implementation of MPI(Message Passing Interface). Actual job submission is done by calling in sequence the two service operations jobRegister and jobStart. When a jobRegister request arrives at the WMProxy, if the client has the rights to proceed, a set of specific attributes needed by the WMS for handling the request appropriately together with a generated unnique Job identifier are inserted in to the job description. The requesting user is then mapped to a local user by means of LCMAP, which provides authorization functionalities based on VOMS (Virtual Organization Membership Services),resulting in the job, local directories and files being created with appropriate ownership and permissions. When all the aforementioned steps have been successfully completed, the job with the generated job identifier and the enriched JDL description is registered to the RB(Resource Broker). From that point on the job is uniquely identified and can be monitored throughout the system and the various job states with its identifier [25]. Bioinformatics usually entails the execution of very complex workflow analysis. Some applications can perform and scale very well in a Grid environment while others are instead better suited for a dedicated cluster especially when bound to certain license agreements or when specialized supporting software is required. Current results suggests the grid also better suited for more computationally expensive jobs as job submission times often exceeds execution times of small jobs. An obvious disadvantage of the MCMCMC algorithm is that m chains are run and only one chain is used for inference. For this reason, MCMCMC is ideally suited for implementation on parallel machines, since each chain will in general require the same amount of computation per iteration. The EGEE grid infrastructure with its uncontested parallel capacity provides a very favorable

410

R.C. van der Wath et al.

environment for Bayesian MCMCMC analysis allowing a user to combine several simultaneous simulations varying crucial variables such as models of evolution, number of generations and amount of chains. We embedded Parallel MrBayes into the BioMed virtual organisation of the EGEE grid infrastructure as part of the BioinfoGrid project [18]. We utilised this framework to infer phylogenetic analysis on the Pancrustacea dataset by submitting numerous jobs testing three different models of evolution. Firstly the general matrix available for mitochondrial genomes, but based on vertebrate taxa, MtRev [24] secondly MtArt [23] and lastly our specifically developed matrix, MtPan. One million generations were run, varying between 8 or 4 MC-chains, while trees were sampled every 100 generations. Three nexus files was constructed for the batch processing of MrBayes, two stating the GTR model as the rate matrix and setting the prior for the substitution rates in accordance with the MtArt and MtPan matrices and the third, by setting the amino acid model prior variable to MtRev. Many BioMed VO Grid WNs(worker nodes) are not currently correctly configured to execute MPI(parallel) jobs successfully even though the jdl requirements variable specifies the desired prerequisites to which the selected CEs (Computing Elements) should adhear. This is due to the fact that the Grid infrastructure and maintanance of contributed resources are still evolving. We randomly submitted 12 similar parallel MrBayes jobs (not to specific CEs) and disappointingly 9 of these jobs failed. For our main analysis we submitted only to selected CEs, which consists of hundreds of WNs, proven to have very good MPICH success rates.

3

Results and Discussion

As mentioned, we submitted only to a handful of CEs which has proven to be able to successfully execute MPICH job types. The Resource Broker algorithmically selects among this list for the most suited CE. Of our 45 MtArt 8-chain jobs submitted, 15 failed to reach the running state. The running state refers to the status of a job which has passed successfully through the queueing system to being actually processed by the amount of worker nodes specified. After several concurrent submissions we achieved a total of 30 8-chain jobs and 20 4-chain jobs for each of the three evolutionary models MtArt, MtRev and MtPan. The average mrbayes execution time was around 29 hours while the average submission time (the time it takes the job to successfully reach the actual running state from the moment the jobs was submitted to the Resource Broker) was around 17 hours. This resulted in total average execution times of around 45 hours for one 8-chain analysis. Thirty 8-chain and twenty 4-chain jobs for each of the three models were run in parallel. This amounts to a total of 960 worker nodes being occupied, in parallel, for an average period of two days. One 4-chain sequential execution, with the same parameters as used in the main analysis, failed to complete within the 7 day grid proxy validity period. This implies that sequentially done, taking the average running time of one chain as 45 hours, this analysis would have taken up to 960*2 days. See Figure 2 for a

Bayesian Phylogeny on Grid

411

Fig. 2. Grid job submission and execution times for MtArt 8-chain

0.7



0.4 0.1

0.2

0.3

Probability

0.5

0.6



mtPan

mtArt

mtRev

Replacement matrix

Fig. 3. Box-and-whisker plot of the posterior probabilities obtained using the three different models

depicted summary of the execution and submission times for the 30 successful Mtart 8-chain jobs. In the analysis of our Pancrustacean mitochondrial data set, plots of likelihood versus generations, together with the value of the likelihood towards which each run converges, were used to assess the efficiency of the analysis to explore the likelihood space and reach the best maximum, and the relative performance of the three amino acid substitution matrices. Most of the runs, regardless of the matrix used, converged to slightly different maxima. This indicates, on one hand, that the resulting topology for each run is highly dependent on the performance of the algorithm to explore the likelihood

412

R.C. van der Wath et al.

surface and the starting point of the search, thus suggesting prudence when interpreting the results. On the other hand, this underlines the importance of conducting different parallel runs and comparing the results in order to have a global outlook on these aspects of the analysis. Furthermore, comparing the actual topologies to which each run converges, it becomes evident that while most of the shallow nodes are common to most resulting trees, the deepest nodes tend to vary, and the difference in likelihood observed across runs, little as they are, depend on rearrangements at the deepest nodes. This behaviour is most likely due to the intrinsic signal of the data, rather than to limitations of the analysis. It also has a detrimental effect on the biological interpretation of the results, as the most interesting nodes are in fact the ones that connect the major lineages of the Pancrustacea, and these are in turn the less stable. Nevertheless, the tree with the highest posterior probabilities selected (Figure 5) is largely congruent with that obtained in a previous analysis [3], and has a precise taxonomic meaning. It shows, in fact, that two hexapod lineages, Collembola and Diplura, do not cluster with the remaining insects, whose closest relatives appear to be in certain lineages of crustaceans (Stomatopoda, Decapoda, Cephalocarida). The only difference between this tree and the tree obtained in [3] is the relative position of a cluster of four taxa (Pachypsilla, Trialeurodes, Xenos, Armillifer) which was embedded in the clade of Insecta in [3], while is here joined with the basalmost lineages of the whole tree, and clustered with the Remipedia and the Maxillopoda. This cluster indeed joins species from distant taxonomic lineages: Homoptera (insects), Strepsiptera (insects) and Pentastomida. Their association must therefore be considered an artifact of the analysis, possibly related with an accelerated rate of evolution of these sequences.

0.7





0.4 0.1

0.2

0.3

Probability

0.5

0.6



mtPan 4

mtPan 8

mtArt 4

mtArt 8

mtRev 4

mtRev 8

Replacement matrix

Fig. 4. Box-and-whisker plot of the posterior probabilities obtained using the three different models, showing the two MCMCMC implementations of each model separately. The 4 chain versions tended to slightly outperform the 8 chain versions.

Bayesian Phylogeny on Grid

Fig. 5. The tree with the highest posterior probability of all the simulations

413

414

R.C. van der Wath et al.

The highest posterior probabilities from the 1 million runs (excluding burn-in) of each job was analysed separately for each matrix. From this multiple analysis it was possible to show (Figure 3 and 4) that the MtPan matrix generally converge to higher likelihood values than MtRev and MtArt. This does not come unexpected, as this matrix was directly derived from a collection of pancrustacean mitochondrial genes, and is likely to describe the mechanism of sequence evolution in these groups better than matrices developed for different taxonomic groups [23]. From Figure 4 it can be seen that the tree with the highest overall posterior was found during an MtPan 4-chain run. The consensus tree as produced by MrBayes for this run is shown in Figure 5 and is considered as our best estimate of phylogenetic relationships among Pancrustacea based on the data set analysed.

4

Conclusion

Here we present a new model of evolution for pancrustacean mitochondrial PCGs which is performing in average better than currently available models. From a biological standpoint, this analysis has confirmed that mitochondrial genome sequences unambiguously indicate the reciprocal paraphyly of the formal taxa Hexapoda and Crustacea, as traditionally defined [6,21,22,3], therefore implying a new interpretation of the evolution of the most successful lineage of Metazoa. We found that inching towards better models of evolution may require intensive computation. Models may be improved by adding more taxa from which parameters are estimated, and methods of phylogenetic reconstruction may be improved by taking into account the inevitable presence of different rates of evolution in different lineages of the same phylogenetic tree. The Grid is becoming “the” resource for solving large-scale computing applications in Bioinformatics, system biology and computational medicine. The distributed high performance computer (HPC) and GRID are “de facto” considered as the way to realize the concept of virtual places where scientists and researchers work together to solve complex scientific problems, despite their geographic and organizational boundaries. In phylogenetic inference the grid will become a computational laboratory to test models of evolution and phylogenetic hypothesis on large trees which may provide an effective boost in our investigation of the evolutionary process.

References 1. Whelan, S., Li´ o, P., Goldman, N.: Molecular phylogenetics: State-of-art methods for looking into the past. Trends Genet. 17, 262–272 (2001) 2. Li´ o, P., Goldman, N.: Models of molecular evolution and phylogeny. Genome Res. 8, 1233–1244 (1998) 3. Carapelli, A., Li´ o, P., Nardi, F., van der Wath, E., Frati, F.: Phylogenetic analysis of mitochondrial protein coding genes confirms the reciprocal paraphyly of Hexapoda and Crustacea. BMC Evolutionary Biology 7 (2007), doi:10.1186/1471-2148-7-S2-S8

Bayesian Phylogeny on Grid

415

4. Russo, C.A., Takezaki, N., Nei, M.: Efficiencies of different genes and different treebuilding methods in recovering a known vertebrate phylogeny. Mol. Biol. Evol. 13, 933–942 (1996) 5. Zardoya, R., Meyer, A.: Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. Molecular Biology and Evolution 13, 525–536 (1996) 6. Pollock, D.D., Eisen, J.A., Doggett, N.A., Cummings, M.P.: A case for the evolutionary genomics and the comprehensive examination of sequence biodiversity. Mol. Biol. Evol. 17, 1776–1778 (2000) 7. Cao, Y., Janke, A., Waddell, P.J., Westerman, M., Takenaka, O., Murata, S., Okada, N., Paabo, S., Hasegawa, M.: Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47, 307–322 (1998) 8. Li´ o, P.: Phylogenetic and structural analysis of mitochondrial complex I proteins. Gene 345, 55–64 (1999) 9. Liu, J.S., Lawrence, C.E.: Bayesian inference on biopolymer models. Bioinformatics 15, 38–52 (1999) 10. Shoemaker, J.S., Painter, I.S., Weir, B.: Bayesian statistics in genetics: a guide for the uninitiated. Trends Genet. 15, 354–358 (1999) 11. Larget, B., Simon, D.: Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. E 16, 750–759 (1999) 12. Huelsenbeck, J.P., Ronquist, F.: MrBayes: Bayesian inference in phylogenetic trees. Bioinformatics 17, 754–755 (2001) 13. Ronquist, F., Huelsenbeck., J.P.: MrBayes3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003) 14. Rannala, B., Yang., Z.: Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003) 15. Mau, B., Newton, M.A., Larget, B.: Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55, 1–12 (1999) 16. Yang, Z., Rannala, B.: Bayesian phylogenetic inference using DNA sequences: Markov chain Monte Carlo methods. Mol. Biol. Evol. 14, 717–724 (1997) 17. Altekar1, G., Dwarkadas1, S., Huelsenbeck, J.P., Ronquist3, F.: Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20, 407–415 (2004) 18. http://www.bioinfogrid.eu/ 19. http://public.eu-egee.org/ 20. Richter, S.: The Tetraconata concept: hexapod-crustacean relationships and the phylogeny of Crustacea. Org. Divers Evol. 2, 217–237 (2002) 21. Nardi, F., Spinsanti, G., Boore, J.L., Carapelli, A., Dallai, R., Frati, F.: Hexapod origins: monophyletic or polyphyletic? Science 299, 1887–1889 (2003) 22. Cook, C.E., Yue, Q., Akam, M.: Mitochondrial genomes suggest that hexapods and crustaceans are mutually paraphyletic. Proc. R Soc. Lond. B 272, 1295–1304 (2005) 23. Abascal, F., Posada, D., Zardoya, R.: MtArt: a new model of amino acid replacement for Arthropoda. Mol. Biol. Evol. 24, 1–5 (2007) 24. Yang, Z., Nielsen, R., Hasegawa: Models of amino acid substitutions and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15, 1600–1611 (1998) 25. http://trinity.datamat.it/projects/EGEE/wiki/ 26. http://public.eu-egee.org/industry/ifdocuments/glite-flyer.pdf

416

R.C. van der Wath et al.

27. Boore, J.: Animal mitochondrial genomes. Nucl. Acid Res. 27, 1767–1780 (1999) 28. Cameron, S.L., Miller, K.B., DaHaese, C.A., Whiting, M.F., Barker, S.C.: Mitochondrial genome data alone are not enough to unambiguosly resolve the relationships of Entognatha, Insecta and Crustacea sensu lato (Arthropoda). Cladistics 20, 534–557 (2004) 29. Hassanin, A., Lger, N., Deutsch, J.: Evidence for multiple reversals of asymmetric mutational constraints during the evolution of the mitochondrial genome of Metazoa, and consequences for phylogenetic inferences. Syt. Biol. 54, 277–298 (2005) 30. Chor, B., Hendy, M.D., Holland, B.R., Penny, D.: Multiple maxima of likelihood in phylogenetic trees: an analytic approach. In: RECOMB 2000, pp. 108–117 (2000) 31. Mossel, E., Vigoda, E.: Limitations of Markov chain Monte Carlo algorithms for Bayesian Inference of phylogeny. Ann. Appl. Probab. 16, 2215–2234 (2006) 32. Chor, B., Tuller, T.: Finding a maximum likelihood tree is hard. J. ACM 53, 722– 744 (2006) 33. Gascuel, O.: Mathematics of Evolution and Phylogeny. Oxford University Press, USA (2007) 34. Yang, Z.: Computational Molecular Evolution (Oxford Series in Ecology and Evolution). Oxford University Press, USA (2006) 35. Felsenstein, J.: Inferring Phylogenies Sinauer Associates, 2nd edn (2003) 36. Nielsen, R.: Statistical Methods in Molecular Evolution (Statistics for Biology and Health), 1st edn. Springer, Heidelberg (2005)

Suggest Documents