bioinformatics applications note

BIOINFORMATICS APPLICATIONS NOTE Genetics and population analysis

Vol. 30 no. 8 2014, pages 1187–1189 doi:10.1093/bioinformatics/btt763

Advance Access publication January 2, 2014

DIYABC v2.0: a software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data 1

Inra, UMR1062 cbgp, Montpellier, France, 2Universite´ Montpellier 2, UMR CNRS 5149, I3M, Montpellier, France, Institut de Biologie Computationnelle (IBC), 34095 Montpellier, France and 4CNRS-UM2, Institut de Biologie Computationnelle, LIRMM, Montpellier, France

3

Associate Editor: Gunnar Ratsch

ABSTRACT Motivation: DIYABC is a software package for a comprehensive analysis of population history using approximate Bayesian computation on DNA polymorphism data. Version 2.0 implements a number of new features and analytical methods. It allows (i) the analysis of single nucleotide polymorphism data at large number of loci, apart from microsatellite and DNA sequence data, (ii) efficient Bayesian model choice using linear discriminant analysis on summary statistics and (iii) the serial launching of multiple post-processing analyses. DIYABC v2.0 also includes a user-friendly graphical interface with various new options. It can be run on three operating systems: GNU/Linux, Microsoft Windows and Apple Os X. Availability: Freely available with a detailed notice document and example projects to academic users at http://www1.montpellier.inra.fr/ CBGP/diyabc Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. Received on July 31, 2013; revised on November 26, 2013; accepted on December 25, 2013

1 INTRODUCTION One prospect of current biology is that molecular data will help us to reveal the complex demographic processes that have acted on natural populations. The extensive availability of various molecular markers and increased computer power has promoted the development of inferential methods. Among these novel methods, approximate Bayesian computation (ABC) method (Beaumont et al., 2002) is increasingly used to make inferences from large datasets for complex models in various research fields, including population and evolutionary biology. General statistical features, practical aspects and applications of ABC in evolutionary biology have been reviewed in three recent papers (Beaumont, 2010; Bertorelle et al., 2010; Csille´ry et al., 2010). Briefly, ABC constitutes a recent approach to carrying out model-based inference in a Bayesian setting in which model likelihoods are difficult to calculate (due to the complexity *To whom correspondence should be addressed.

of the models considered) and must be estimated by massive simulations. In ABC, the posterior probabilities of different models and/or the posterior distributions of the demographic parameters under a given model are determined by measuring the similarity between the observed dataset (i.e. the target) and a large number of simulated datasets; all raw datasets (i.e. multilocus genotypes or individual sequences) are summarized by statistics, such as mean number of alleles or Fst. Several ABC programs have been proposed to provide solutions to non-specialist biologists (Table 1 in Bertorelle et al., 2010; see also Supplementary Appendix S1). Cornuet et al. (2008, 2010) developed the (coalescent based) software DIYABC in which a user-friendly interface helps non-expert users to perform historical inferences using ABC. DIYABC allows considering complex population histories including any combination of population divergence events, admixture events and changes in past population size (with population samples potentially collected at different times). DIYABC can be used to compare competing evolutionary scenarios and quantify their relative support and estimate parameters for one or more scenarios. Eventually, it provides a way to evaluate the amount of confidence that can be put into the various estimations and to achieve model checking computation. In this article, we present DIYABC v2.0, a completely rewritten version of the software DIYABC (Cornuet et al., 2008, 2010). Version 2.0 implements a number of new features and analytical methods allowing extensive analyses of large molecular datasets, including single nucleotide polymorphism (SNP) data.

2 2.1

NEW FEATURES Analysis of SNP data

DIYABC v2.0 allows analyzing statistically independent SNP markers, apart from microsatellite and DNA sequence data. Compared with other types of markers, SNP loci have low mutation rates, so that polymorphism at such loci results from a single mutation during the whole population gene tree, and genotypes are bi-allelic. To generate a simulated polymorphic dataset at a given SNP locus, we proceeded following the algorithm proposed by Hudson (2002) (cf—s 1 option in the program ms associated to Hudson, 2002). Briefly, the genealogy at a given

ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected]

1187

Downloaded from http://bioinformatics.oxfordjournals.org/ at INRA Institut National de la Recherche Agronomique on April 14, 2014

Jean-Marie Cornuet1, Pierre Pudlo1,2,3, Julien Veyssier1,3,4, Alexandre Dehne-Garcia1,3, Mathieu Gautier1,3, Raphae¨l Leblois1,3, Jean-Michel Marin2,3 and Arnaud Estoup1,3,*

SUPPLEMENTARY MATERIAL of the paper:

DIYABC v2.0: a software to make approximate bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data Jean-Marie Cornuet1, Pierre Pudlo1, 2, 3, Julien Veyssier1, 3, 4, Alexandre Dehne-Garcia1, 3, Mathieu Gautier1,3, Raphaël Leblois1, 3, JeanMichel Marin2, 3, and Arnaud Estoup1, 3 * 1

Inra, UMR1062 Cbgp, Montpellier, France, 2 Université Montpellier 2, UMR CNRS 5149, I3M, Montpellier, France. 3 Institut de Biologie Computationnelle (IBC), 95 rue de la Galéra, 34095 Montpellier, France, 4 CNRS-UM2, Institut de Biologie Computationnelle, LIRMM, Montpellier, France

Published in the journal Bioinformatics

Forwords : this supplementary material was redacted after receiving the reviewer’s comments on an earlier draft of the paper. The authors want to thank the reviewers for having encouraged them to redact the three appendix sections below, and especially one of the reviewers for having provided constructive comments and useful insights that were particularly helpful when redacting the Appendix 3.

1

Appendix S1: Additional comments regarding DIYABC versus other ABC packages used in genomics studies, the Hudson’s algorithm to simulate SNP data and ascertainment bias for next generation sequencing (NGS) data.

 DIYABC versus other ABC packages that may be used for SNP-based genomics studies. So far, because the earlier versions 0 and 1 of DIYABC were not suited to SNP-based genomic analysis (in contrast to version 2), recent SNP-based genomic papers have mainly used the package ABCtoolbox (Wegmann et al., 2012) or, at least for the post-processing part of these studies, the R package ‘abc’ (Csilléry et al., 2012). We believe that the DIYABC Version 2 package, which allows efficient simulation of SNP data, will reach out many more empirical users than the latter two packages. This is because DIYABC Version 2 provides a complete (from data set simulation to post-processing analyses) user-friendly workflow. In any case, we are confident that the three above-mentioned packages cater for the full spectrum of technical ability and are by no means incompatible with each other.  The Hudson’s algorithm to simulate SNP data. Two models coexist in the literature to explain SNP data. The first and simplest model, which can be simulated with the Hudson's algorithm, considers that the gene genealogy of a SNP locus is given by the Kingman's coalescent, and that one and only one mutation event occurs during the past history of the gene sample at a given locus. The second model assumes that, at each base pair of the DNA strand, a gene genealogy is drawn following the Kingman's coalescent independently of the gene genealogies of other base pairs of the DNA strand. Then, mutations are put at random on the branches of the genealogies at some very low rate. Most of the gene genealogies will not carry any mutation event and the other gene genealogies will carry only one mutation event (hence the presence of bi-allelic SNP loci). The gene genealogies with mutation event(s) are characterized by a total branch length which is larger than those without mutation event. The probabilistic distribution of gene genealogies with mutation event(s) is often referred in the literature as Unique Event Polymorphism (UEP) genealogies (see for instance Markovtsova et al., 2000). The Hudson and UEP models are clearly different. Therefore, simulating data with one model and estimating parameters with the other one leads to a bias which is due to misspecification of the model (independent of the effect of ascertainment bias underlined in the main text). Markovtsova et al. (2001) discussed the consequences of this misspecification in the particular context of an evolutionary neutrality test assuming a simple demographic scenario with a single population and an infinite site mutation model. The Markovtsova et al. (2001)’s paper provoked a flurry of responses and comments, which globally suggests that the Hudson approximation is correct at least for the tests that were carried out for infinite sites models, provided that some conditions on parameters of the mutation model are satisfied. Additional work is certainly needed to investigate the effect of this misspecification bias on SNP data when more complex demographic histories involving several populations are considered. It is worth stressing here that the bias between Hudson's algorithm and real SNP data might be actually smaller than this misspecification bias. As a matter of fact, the above-mentioned reasoning for defining UEP genealogies does not take into account that gene genealogies of two adjacent base pairs in the DNA strand are far from being independent. It is hence likely that UEP genealogies tend to exaggerate the increase in total branch length. A third (and more realistic) model would be as follow. Draw the genealogies of each base pair of the DNA strand according to a model that includes recombination (e.g. the ancestral recombination graph), draw mutation(s) on these genealogies and then keep those giving polymorphic samples. As far as we know, this third model, which is particularly heavy to simulate, has not been studied in the literature. It will probably produce genealogies with a total branch length larger than Hudson's algorithm, but smaller than UEP genealogies. In any case, as Georges E. P. Box wrote in 1976, “all models are wrong but some are useful". In our situation, the Hudson's algorithm is the only one which provides the simulation efficiency and speed necessary in the context of ABC, where large numbers of simulated data sets including numerous SNP loci have to be generated. This is the main reason why the simulation of SNP loci in DIYABC Version 2 relies on the first model simulated with the Hudson's algorithm. Deciding which of the first model and other ones such as UEP better fit real SNP data should be processed using a statistical model choice procedure that we are yet not able to conduct.  Ascertainment bias for NGS data. As pointed in the main text of the paper, there should be (in principle) no ascertainment bias (AB) from many of the modern NGS methods. It may be useful, however, to implement a basic simulation filter such as to keep only the simulated data sets in which the minor allele frequency, computed 2

for instance on the global simulated data set, is > x%, in order to mimic some basic choice of the post-processing treatments of raw NGS data. It may be also useful to keep a track of the full site-frequency spectrum as well as of monomorphic sites, knowing that this would translate into additional statistics that are not yet implemented in DIYABC. Optimally, it might be necessary to produce resequencing data (and not just SNP genotyping/calling) if one needs accurate and fully model-based inference based on the coalescent. It is worth noting that, in order to take into account strong and obvious AB at some SNP datasets, we intend to include the way SNPs were initially ascertained (if documented) directly in the simulation process of the data sets. We hope to finalize this implementation issue in a future version of the DIYABC software. References cited Box, G E P (1976) Science and Statistics. Journal of the American Statistical Association, 71, 791-799. Csilléry, K, et al. (2012) abc: an R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution, 3, 475–47. Markovtsova, et al. (2000) The Age of Unique Event Polymorphism. Genetics, 156, 401-409. Markovtsova, et al. (2001) On a test of Depaulis and Veuille. Molecular Biology and Evolution, 18, 1132-1133. Wegmann, D. et al. (2010) ABCtoolbox: A versatile toolkit for approximate Bayesian computations. BMC Bioinformatics, 11, 116–123.

3

Appendix S2: Post-processing analyses that can be launched serially from the DIYABC V2 GUI (see the notice document available at http://www1.montpellier.inra.fr/CBGP/diyabc for additional details). The GUI panel to choose post-processing analyses looks like this:

4

Analysis “Pre-evaluate scenario-prior combinations” This option allows checking whether some of the models (scenarios) together with the chosen prior distributions have the potential to generate at least a subset of summary statistics close to the observed summary statistics (i.e. the target statistics obtained from the data set on which one wants to make inferences). To our experience, using this new option before running a full ABC treatment with DIYABC is a convenient and easy way to reveal noticeable misspecification of prior distributions and/or models (see Cornuet et al., 2010).  Analysis “Compute posterior probabilities of scenarios” Choosing among a finite set of models (scenarios) is crucial when doing inferences about evolutionary history and processes for at least two reasons: (i) it allows making general conclusions about major evolutionary events (e.g. admixture between populations, occurrence of bottleneck events or identification of source populations) and (ii) it makes it possible to estimate posterior probabilities of parameters assuming a single scenario if the latter is strongly supported. In this option, the (relative) posterior probabilities of a finite set of models are estimated using the direct and the logistic-regression based approach (Cornuet et al., 2008). The LDA sub-option allows processing efficient (rapid) ABC scenario probability computation using linear discriminant analysis on summary statistics before computing the logistic regression (Estoup et al., 2012).  Analysis “Evaluate confidence in scenario choice” As pointed by Bertorelle et al. (2010) and Robert et al. (2011) among others, confidence in model choice may be empirically evaluated by processing Monte Carlo evaluation of false allocation rates (type I and II errors) based on ABC posterior probabilities computed from simulated pseudo-observed data sets (hereafter named pods) produced by drawing parameter values from prior distributions of using fixed values. Such computations are more and more requested by ABC experts for assessing the power to discriminate among scenarios. A version of such exploratory analysis is provided through this particular DIYABC option. Note that, in this analysis option, the exact configuration of the observed data set is copied in terms of sample sizes (hence taking into account missing data). The confidence in scenario choice evaluation may be time consuming especially when type I and II errors have to be computed from complex scenarios with many summary statistics and when a large number of pods are needed. The LDA sub-option which allows efficient ABC scenario probability computation using linear discriminant analysis can be used here also and hence provides users an access to a much faster way to compute regression-based probabilities and hence false allocation rates (type I and II errors).  Analysis “Estimate posterior distributions of parameters” In this option, assuming a given scenario, Euclidian distances between each simulated and the observed data set in the space of summary statistics are computed and only the simulated data sets closest to the observed data set are retained. The parameter values used to simulate these selected data sets provide a sample of parameter values approximately distributed according to their own posterior distribution. Beaumont et al. (2002) have shown that a local linear regression provides a better approximation of the posterior distribution: this is what is used in this analysis option. Note that such parameter estimation can be processed assuming several scenarios altogether. In such multiple scenario parameter estimation, the weight of each scenario is taken into account in the estimation process and only parameters common to all scenarios are estimated. One advantage of DIYABC v2.0 is that it provides the posterior distributions of demographic parameters scaled either by the mutation rate or the effective population size, in parallel to those of original parameters. Scaled parameters are sometimes if not often the only type of parameters that can be robustly inferred under many evolutionary scenarios (e.g. Wakeley, 2005).

5

 Analysis “Compute bias and precision in parameter estimations” If we assume that the evolutionary scenario is correct, the comparison of real (i.e. known) and estimated values of parameters provide some information about the precision of the estimation process. For this kind of analysis, pseudo-observed data sets (pods) are simulated under a given scenario with known values of parameters produced by drawing parameter values from prior distribution of using fixed values, and by copying the exact configuration of the observed data set in terms of sample sizes (hence taking into account missing data). Such pods are submitted to the above-mentioned ABC estimation process. A number of statistics measuring estimation precision are provided as output (e.g. relative bias, RMSE, RMAD, …, computed on original or scaled parameters). Note that there are two values given for each statistics measuring estimation precision. One value is that of the statistics computed from the posterior distribution of parameters (i.e. using the genetic information provided by the data), the other value between parentheses is that of the statistics computed from the prior distribution of parameters (i.e. NOT using the genetic information provided by the data but only the information contained in prior distributions). Having two such values allows a better assessment of the level of information provided by genetic data.  Analysis “Perform model checking” This option, with newly designed graphical outputs, aims at measuring the discrepancy between a combination of a model and parameter posterior distributions and a “real” data set by considering various sets of test quantities. These test quantities can be chosen among the large set of ABC summary statistics proposed in DIYABC v2.0 (see Cornuet et al., 2010 for further details). Model checking (i.e. “goodness-of-fit” of a model – posterior combination with respect to a target data set) is a crucial statistical treatment that should be systematically processed at the end of an ABC inferential study.

References cited Beaumont,M.A., et al. (2002) Approximate Bayesian computation in population genetics. Genetics, 162, 2025-2035. Bertorelle,G. et al. (2010) ABC as a flexible framework to estimate demography over space and time: some cons, many pros. Mol. Ecol., 19, 2609-2625. Cornuet,J-M. et al. (2008) Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics, 24, 2713-2719. Cornuet,J-M. et al. (2010) Inference on population history and model checking using DNA sequence and microsatellite data with the software DIYABC (v1.0). BMC Bioinformatics, 11, 401. Estoup,A. et al. (2012) Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Mol. Ecol. Res., 12, 846-855. Robert, C.P. et al. (2011) Lack of confidence in approximate Bayesian computation model choice. Proceedings of the National Academy of Sciences of the United States of America, 108, 15112-15117. Wakeley,J. (2005). The limits of theoretical population genetics. Genetics 169, 1-7.

6

Appendix S3 : DIYABC example projects including all post-processing treatments that can be downloaded from http://www1.montpellier.inra.fr/CBGP/diyabc/viewforum.php?f=27. Note: We also provide in blue characters estimations of differences in running speeds between the version 1 and the version 2 of DIYABC when producing simulated data sets (i.e. production of the reftable file). Such evaluations are given for the examples projects 1 and 2 (not appropriate for the example project 3 as SNP loci could not be treated in DIYABC V1). Running speed comparisons were processed on a 2 CPU Intel Xeon X5472 computer (Windows XP platform, 32 bits system, 4 Gb of RAM).  Example project 1: Simple scenarios including a single population which endured (or not) a bottleneck in the past and for which several samples were genotyped at microsatellite markers. [DIYABC Version2 is 2.4 times faster than DIYABC Version1].  Example project 2: Complex scenarios including unsampled populations (i.e. "ghost populations") genotyped at microsatellite markers and different sequence loci located on autosomal, X, Y and mtDNA genomic fragments. [DIYABC Version 2 is 1.7 times faster than DIYABC Version 1].  Example project 1: Simple split-hybridization scenarios including three populations genotyped at SNP markers located on autosomal, X, Y and mtDNA genomic fragments.

7

J.-M.Cornuet et al.

locus of all genes sampled in all populations of the studied dataset is simulated until the most recent common ancestor according to coalescence theory. Then a single mutation event is put at random on one branch of the genealogy (the branch being chosen with a probability proportional to its length relatively to the total gene tree length). This algorithm provides the simulation efficiency and speed necessary in the context of ABC, where large numbers of simulated datasets including numerous SNP loci have to be generated (see Supplementary Appendix S1 for additional comments on Hudson’s algorithm).

Computation of scenario probability

Estoup et al. (2012) recently proposed a methodological innovation to deal with the discrimination among a large set of complex scenarios through efficient ABC probability computation. It is based on a linear discriminant analysis on summary statistics before the logistic regression analysis (introduced by Fagundes et al., 2007). A major practical advantage is that it substantially decreases the dimension of explanatory variables making computation of scenario probability (100 times) faster. We have implemented this methodological innovation in DIYABC v2.0 for the analysis of both the real datasets and the simulated pseudo-observed datasets used to evaluate the amount of confidence that can be put into the discrimination of a given set of scenarios.

2.3

New graphical interface and random number generator

DIYABC v2.0 has a new user-friendly graphical interface structured into two main parts: (i) one part including the definition of scenarios, prior distributions, summary statistics and the production of simulated datasets drawing parameter values into priors and (ii) other part including all types of post-processing computations typical of ABC analyses. Among the new options proposed, part (i) allows the definition of different groups of markers characterized by different mutation models and summary statistics and part (ii) allows launching serially multiple post-processing analyses (Supplementary Appendix S2). Random number generators (RNG) are an important issue especially when several processors are used simultaneously for parallel computing. In DIYABC v2.0, we used RNG of Mersenne Twister types. In the multithreaded sections of the codes, which require random draws, each thread uses its own random generator. We initiate the different RNG with the algorithm proposed by Matsumoto and Nishimura (2000) to produce independent random streams.

2.4

Implementation

DIYABC v2.0 is a multithreaded program that runs on three operating systems: GNU/Linux, Microsoft Windows and Apple Os X. Computational procedures are written in Cþþ, and the graphical user interface is based on PyQt, a Python binding of the Qt framework.

1188

DISCUSSION

One of the main innovations of DIYABC v2.0 is that it can analyze SNP data, using an efficient simulation algorithm, therefore allowing the treatment of multi-population datasets with large number of loci (e.g. several thousands to ten thousands of loci within a few hours to a few days). The analyzed SNP data are assumed to correspond to independent selectively neutral loci, without any ascertainment bias (AB, i.e. the deviations from expected theoretical results due to the SNP discovery process in which a small number of individuals from selected populations are used as discovery panel). AB may distort measures of diversity and possibly change conclusions drawn from these measures in unexpected ways (e.g. Albrechtsen et al., 2010). AB is mainly a concern when using SNP data obtained from chip-based high-throughput genotyping. It should impact to a much lower extent SNP data obtained from recent next-generation sequencing technologies, such as shotgun sequencing or restriction-site associated DNA sequencing techniques that are increasingly popular, including in population genetics studies of non-model species (Davey et al., 2012). See Supplementary Appendix S1 for additional comments on AB. Another advantage of DIYABC v2.0 is that it provides the posterior distributions of demographic parameters scaled either by the mutation rate or by the effective population size, in parallel to those of original parameters. Scaled parameters are sometimes if not often the only type of parameters that can be robustly inferred under many evolutionary scenarios (e.g. Wakeley, 2005). Owing to the compilation optimization of Cþþ code and the multithreading of additional computation sections of the program, DIYABC v2.0 is also running faster than the previous version of the program (Supplementary Appendix S3). Finally, the new interface includes an automatic procedure to produce the different files to easily launch simulations on a computer cluster, hence obtaining access to larger computational resources.

ACKNOWLEDGEMENTS The authors thank the ‘beta-users’ (Eric Lombaert, Michael Fontaine, Carine Brouat, Thomas Guillemaud, Christophe Plantamp, Johan Michaux and Marie-Pierre Chapuis) who tested the software with their data. Funding: French Agence Nationale de la Recherche (ANR-09BLAN-0145-01), Inra–Jeune Equipe IGGiPop, CBGP HPC computational platform and NUMEV Labex. Conflict of Interest: none declared.

REFERENCES Albrechtsen,A. et al. (2010) Ascertainment biases in SNP chips affect measures of population divergence. Mol. Biol. Evol., 27, 2534–2547. Beaumont,M.A. (2010) Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst., 41, 379–406. Beaumont,M.A. et al. (2002) Approximate Bayesian computation in population genetics. Genetics, 162, 2025–2035. Bertorelle,G. et al. (2010) ABC as a flexible framework to estimate demography over space and time: some cons, many pros. Mol. Ecol., 19, 2609–2625. Cornuet,J.M. et al. (2008) Inferring population history with DIY ABC: a userfriendly approach to approximate Bayesian computation. Bioinformatics, 24, 2713–2719.


2.2

3

DIYABC v2.0

Cornuet,J.M. et al. (2010) Inference on population history and model checking using DNA sequence and microsatellite data with the software DIYABC (v1.0). BMC Bioinformatics, 11, 401. Csille´ry,K. et al. (2010) Approximate Bayesian computation (ABC) in practice. Trends Ecol. Evol., 25, 410–418. Davey,J.W. et al. (2012) Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet., 12, 499–510. Estoup,A. et al. (2012) Estimation of demo-genetic model probabilities with approximate Bayesian computation using linear discriminant analysis on summary statistics. Mol. Ecol. Res., 12, 846–855.

Fagundes,N.J.R. et al. (2007) Statistical evaluation of alternative models of human evolution. Proc. Natl Acad. Sci. USA, 104, 17614–17619. Hudson,R. (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338. Matsumoto,M. and Nishimura,T. (2000) Dynamic Creation of Pseudorandom Number Generators. In: Fang,F. et al. (eds) Monte Carlo and Quasi-Monte Carlo Methods 1998. Springer-Verlag, New York, pp. 56–69. Wakeley,J. (2005) The limits of theoretical population genetics. Genetics, 169, 1–7.


1189

bioinformatics applications note

bioinformatics applications note

Suggest Documents