Mar 4, 2013 - Sha Zhu1,â, James H Degnan2 and Bjarki Eldon3 ... coalescents (Donnelly and Kurtz, 1999; Pitman, 1999; Sagitov, 1999). .... Sagitov, S. (1999) The general coalescent with asynchronous mergers of ancestral lines, J. Appl.
arXiv:1303.0673v1 [q-bio.PE] 4 Mar 2013
Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees Sha Zhu 1,∗ , James H Degnan 2 and Bjarki Eldon 3 1 Wellcome Trust Centre for Human Genetics, University of Oxford, UK 2 Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand 3 Technische Universit¨at Berlin, Germany ∗ to whom correspondence should be addressed March 5, 2013 Abstract Hybrid-Lambda is a software package that simulates gene trees under Kingman or two Lambda-coalescent processes within species networks or species trees. It is written in C++, and released under GNU General Public License (GPL) version 3. Users can modify and make new distribution under the terms of this license. For details of this license, visit http://www.gnu.org/licenses/. Hybrid-Lambda is available at https://code.google.com/p/hybrid-lambda. Keywords: hybridization, lambda coalescent, gene tree, phylogenetic network
Species trees describe ancestral relations among species. Gene trees describe the random ancestral relations of alleles sampled within species. Gene trees and species trees are often assumed to be bifurcating (Degnan and Salter, 2005; Hudson, 1990; Kingman, 1982). However, for organisms ´ exhibiting sweepstakes reproduction, such as oysters and other marine organisms (E Arnason, 2004; Beckenbach, 1994; Eldon, 2011; Eldon and Wakeley, 2006; Hedgecock, 1994; Hedgecock et al., 1982; Sargsyan and Wakeley, 2008), the Kingman coalescent may not be appropriate, as it only allows binary mergers of ancestral lineages. Thus, we consider models that allow more than two lineages to coalesce simultaneously in the gene trees, that is multiple merger coalescents, also known as Λ coalescents (Donnelly and Kurtz, 1999; Pitman, 1999; Sagitov, 1999). The concordance probabilities between multiple merger gene genealogies and a species tree of two species are investigated by Eldon and Degnan (2012). Species trees may also fail to be bifurcating due to either polytomies or hybridization events. Simulating gene trees from a rooted species network modeling hybridization is another application of hybrid-Lambda. The package ms (Hudson, 2002) can simulate gene trees within a general species network. However the input of ms is difficult to automate when the network is sophisticated or generated from other software. Other simulation studies using species networks have either used a small number of network topologies coded individually (for example, in phylonet (Than et al., 2008) or have assumed that gene trees have evolved on species trees embedded within the species network (Holland et al., 2008; Kubatko, 2009; Meng and Kubatko, 2009). The software hybrid-Lambda will help automate simulation studies of hybridization allowing for a large number of species network topologies and allowing gene trees to evolve directly within the network. 1
binary mergers
gene genealogies multiple merger
species network multiple merger
a1 a2 a3
b1
c1 c2
d1
A
B
C
D
Figure 1: Example of a multiple merger gene genealogy with topology (((a1 ,a2 ,a3 ),c1 ),(b1 ,c2 ,d1 )) simulated in a species network with topology ((((B,C)s1)h1#H1,A)s2,(h1#H1,D)s3)r, where H1 is the probability that a lineage has its ancestry from its left parental population.
1
DESCRIPTION
The program input file for hybrid-Lambda is a character string that describes relationships between species. Standard Newick format (Olsen, 1990) is used for inputting species trees and outputting gene trees, whose interior nodes are not labelled. An extended Newick formatted string (Cardona et al., 2008; Huson et al., 2010) labels all internal nodes, and is used for inputting species networks (see Fig. 1).
1.1
Parameters
Hybrid-Lambda can use multiple lineages sampled from each species, then simulate either a Kingman or a multiple merger (Λ) coalescent within a given species network. The coalescent is a continuous-time Markov process, in which times between coalescent events are independent exponential random variables with different rates. The rates are determined by a so-called coalescent parameter in the program that can be input via command line, or a(n) (extended) Newick formatted input string with specific coalescent parameters as branch lengths. By default, the Kingman coalescent is used, for which a population with b lineages sampled has two lineages coalesce at rate λb,2 = 2b . One can choose between two different examples of a Λ coalescent. If the coalescent parameter is between 0 and 1, then we use ψ for the coalescent parameter, and the rate λbk at which k out of b active ancestral lineages merge is b k λbk = ψ (1 − ψ)b−k , ψ ∈ (0, 1), (1) k and if the coalescent parameter is between 1 and 2, then we use α for the coalescent parameter, and the rate of k-mergers is b B(k − α, b − k + α) , α ∈ (1, 2), (2) λbk = B(2 − α, α) k 2
where B(·, ·) is the beta function (Schweinsberg, 2003). The program hybrid-Lambda assumes that the input network (tree) branch lengths are in coalescent units. However, this is not essential. Coalescent units can be converted through an alternative input file with numbers of generation as branch lengths, then divided by its corresponding effective population sizes. By default, effective population sizes on all branches are assumed to be equal and unchanged. Users can change this parameter using the command line, or using a(n) (extended) Newick formatted string to specify population sizes on all branches though another input file. The simulation requires ultrametric species networks, i.e. equal lengths of all paths from tip to root. Hybrid-Lambda checks the distances in coalescent units between the root and all tip nodes and prints out warning messages if the ultrametric assumption is violated.
1.2
Output
Hybrid-Lambda outputs simulated gene trees in three different files: one contains gene trees with branch lengths in coalescent units, another uses the number of generations as branch lengths, and the third uses the number of expected mutations as branch lengths. Besides outputting gene tree files, hybrid-Lambda also provides several functions for analysis purposes: • user-defined random seed for simulation, • a frequency table of gene tree topologies, • a figure of the species network or tree (this function only works when LATEX or dot is installed), • when gene trees are simulated from two populations, hybrid-Lambda can generate a table of relative frequencies of reciprocal monophyly, paraphyly, and polyphyly. Funding: This work was supported by New Zealand Marsden Fund (SZ and JD), EPSRC (BE). This work was partly conducted while JD was a Sabbatical Fellow at the National Institute for Mathematical and Biological Synthesis, an Institute sponsored by the National Science Foundation, the U.S. Department of Homeland Security, and the U.S. Department of Agriculture through NSF Award #EF-0832858, with additional support from The University of Tennessee, Knoxville.
References ´ Arnason, E. (2004). Mitochondrial cytochrome b variation in the high-fecundity Atlantic cod: transAtlantic clines and shallow gene genealogy, Genetics 166, 1871–1885. Beckenbach, A. T. Mitochondrial haplotype frequencies in oysters: neutral alternatives to selection models, Non-neutral Evolution, ed. B Golding, 188–198 (1994). Chapman & Hall, New York. Cardona, G., Rossell, F., and Valiente, G. (2008). Extended newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics 9, 532. Degnan, J. H. and Salter, L. A. (2005). Gene tree distributions under the coalescent process. Evolution 59, 24–37.
3
Donnelly, P. and Kurtz, T. G. Particle representations for measure-valued population models, Ann. Probab. 27, 166–205 (1999). Eldon, B. (2011) Estimation of parameters in large offspring number models and ratios of coalescence times, Theor. Popul. Biol. 80, 16–28. Eldon, B. and Degnan, J. H. (2012). Multiple merger gene genealogies in two species: monophyly, paraphyly, and polyphyly for two examples of Lambda coalescents, Theor. Popul. Biol. 82 117-130. Eldon, B. and Wakeley, J. (2006). Coalesent processes when the distribution of offspring number among individuals is highly skewed. Genetics 172, 2621–2633. Hedgecock, D. (1994). Does variance in reproductive success limit effective population sizes of marine organisms? Genetics and Evolution of Aquatic Organisms, ed. A Beaumont, 1222–1344 (1994). Chapman and Hall, London. Hedgecock, D., Tracey, M. and Nelson, K. (1982). Genetics, The Biology of Crustacea vol. 2, ed. LG Abele, 297–403 (1982). Academic Press, New York. Holland, B. R., Benthin, S., Lockhart, P. J., Moulton, V., and Huber, K. T. (2008) Using supernetworks to distinguish hybridization from lineage-sorting BMC Evol. Biol. 8:202. Hudson, R. R. (1990). Gene genealogies and the coalescent process. Oxford Surveys Evolution Biology 7, 1–44. Hudson, R. R. (2002). Generating samples under a wright-fisher neutral model. Bioinformatics 18, 337–338. Huson, D., Rupp, R. and Scornavacca, C. (2010). Phylogenetic Networks: Concepts, Algorithms and Applications. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press. Kingman, J. F. C. (1982) On the genealogy of large populations, J. App. Probab. 19A, 27–43. Kubatko, L. S. (2009) Identifying hybridization events in the presence of coalescence via model selection Syst. Biol. 58, 478–488. Meng, C. and Kubatko, L. S. (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model Theor. Popul. Biol. 75, 35–45. Olsen, G. (1990). Gary Olsen’s interpretation of the “Newick’s 8:45” tree format standard. http://evolution.genetics.washington.edu/phylip/newick doc.html. Pitman, J. (1999). Coalescents with multiple collisions. Ann. Probab. 27, 1870–1902. Sagitov, S. (1999) The general coalescent with asynchronous mergers of ancestral lines, J. Appl. Probab. 36, 1116–1125. Sargsyan, O. and Wakeley, J. (2008). A coalescent process with simultaneous multiple mergers for approximating the gene genealogies of many marine organisms. Theor. Popul. Biol. 74, 104–114. 4
Schweinsberg, J. (2003). Coalescent processes obtained from supercritical Galton-Watson processes. Stoch. Proc. Appl. 106, 107–139. Than, C., Ruths, D., and Nakhleh, L. (2008) PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships, BMC Bioinformatics, 9:322. Yu, Y., Than, C., Degnan, J. H. and Nakhleh, L. (2011). Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting, Syst. Biol. 60, 138– 149. Yu, Y., Degnan, J. H. and Nakhleh, L. (2012). The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection PLoS Genet 8, e1002660.
5