NeEstimator v2: reimplementation of software for the estimation of ...

37 downloads 11839 Views 532KB Size Report
NEESTIMATOR v2: re-implementation of software for the estimation of contemporary effective population size (Ne) from genetic data. C. DO,* R. S. WAPLES,† D.
Molecular Ecology Resources (2014) 14, 209–214

doi: 10.1111/1755-0998.12157

NEESTIMATOR v2: re-implementation of software for the estimation of contemporary effective population size (Ne) from genetic data C. DO,* R. S. WAPLES,† D. PEEL,‡ G. M. MACBETH,§ B. J. TILLETT¶ and J . R . O V E N D E N * * *Conservation Biology Division, Northwest Fisheries Science Center, 2725 Montlake Blvd East, Seattle, WA 98112, USA, †Northwest Fisheries Science Centre, NOAA Fisheries, 2725 Montlake Blvd East, Seattle, WA 98112, USA, ‡CSIRO Computational Informatics, Castray Esplanade, Hobart, Tas., 7004, Australia, §Queensland Department of Agriculture, Fisheries and Forestry, 80 Ann St., Brisbane, Qld 4000, Australia, ¶Australian Institute of Marine Science, UWA Oceans Institute, 35 Stirling Highway, Crawley, WA 6009, Australia, **Molecular Fisheries Laboratory, School of Biomedical Sciences, University of Queensland, Otto Hirschfeld Building (81), St Lucia, Qld 4072, Australia

Abstract NEESTIMATOR v2 is a completely revised and updated implementation of software that produces estimates of contemporary effective population size, using several different methods and a single input file. NEESTIMATOR v2 includes three single-sample estimators (updated versions of the linkage disequilibrium and heterozygote-excess methods, and a new method based on molecular coancestry), as well as the two-sample (moment-based temporal) method. New features include the following: (i) an improved method for accounting for missing data; (ii) options for screening out rare alleles; (iii) confidence intervals for all methods; (iv) the ability to analyse data sets with large numbers of genetic markers (10 000 or more); (v) options for batch processing large numbers of different data sets, which will facilitate cross-method comparisons using simulated data; and (vi) correction for temporal estimates when individuals sampled are not removed from the population (Plan I sampling). The user is given considerable control over input data and composition, and format of output files. The freely available software has a new JAVA interface and runs under MacOS, Linux and Windows. Keywords: heterozygote-excess, linkage disequilibrium, molecular coancestry, Plan I and II temporal sampling Received 18 May 2013; revision received 25 July 2013; accepted 25 July 2013

Introduction Spurred by recent advances in development of molecular markers and nonlethal methods for extracting DNA from natural populations, interest in using genetic methods to estimate contemporary effective population size (Ne) has grown exponentially over the past decade (Palstra & Fraser 2012). Until recently, most such estimates have used the temporal method, which requires at least two samples from the same population spaced in time. However, several new single-sample estimators have been developed recently (Nomura 2008; Waples & Do 2008; Zhdanova & Pudovkin 2008) and since 2009, published estimates using single-sample methods have eclipsed those from the temporal method by a wide margin (Palstra & Fraser 2012). Correspondence: Jennifer R. Ovenden, Fax: +61-7-3365-1766; E-mail: [email protected]

Given the variety of available methods, an attractive option is to develop software that can apply multiple methods to the same data set. Since 2004, this role has been filled by NEESTIMATOR v1.4 (Ovenden et al. 2007). However, that original implementation predated most of the recent developments in single-sample methods, and this has increasingly limited its usefulness. Here, we describe a completely updated and revamped version of the NEESTIMATOR software (version 2.0) that includes (i) three single-sample estimators [a bias-corrected version of the linkage disequilibrium method (Waples & Do 2008); an updated version of the heterozygote-excess method (Zhdanova & Pudovkin 2008); and a new implementation of the molecular coancestry method (Nomura 2008)]; and (ii) the standard temporal method (Waples 1989), with three different options for computing the standardized variance in allele frequency, F [Fe (Nei & Tajima 1981); Fk (Pollak 1983); and Fs (Jorde & Ryman 2007)].

Published 2013. This article is a U.S. Government work and is in the public domain in the USA.

210 C . D O E T A L . The new version has a flexible and friendly graphical user interface (Fig. 1) and is suitable for empirical and simulated data sets containing varying numbers of nuclear genotypes consisting of two or more loci and having two or more alleles per locus. The genotypes can represent one to many populations that can be sampled once or at two or more times. NEESTIMATOR (v2) also has versions for Windows, MacOS, and Linux operating systems. This programme note summarizes key features of the new software and should be read in conjunction with the primary literature describing the concept of genetic effective population size (e.g. Schwartz et al. 2007; Charlesworth 2009; Luikart et al. 2010; Palstra & Fraser 2012) and the published estimation methods (Waples 1989; Nomura 2008; Waples & Do 2008; Zhdanova & Pudovkin 2008).

Data input NEESTIMATOR (v2) accepts common input formats (GENEor FSTAT). The user browses directories to select the appropriate input file. The user can choose to only show

POP

files that are in acceptable formats (.TXT, .GEN, .DAT). One or multiple methods for calculating Ne can be performed simultaneously, generally performed on a single data input file. There is a batch option for processing many separate files. Input files can include data for a large number of samples. For the single-sample methods (linkage disequilibrium, heterozygote-excess and molecular coancestry), each sample is treated as a separate ‘population’. For the temporal method, each population is represented by two or more samples taken at different times, separated by a known number of generations. Generations for each sample can be defined as whole or fractional numbers. In the simplest circumstance, an input file for the temporal method would contain two samples, separated by a number of generations defined by the user. The temporal method would produce a single estimate of Ne applicable to the number of generations between samples, while the singlesample methods would produce separate estimates applicable to each sampled generation. More advanced sampling strategies can be implemented. For example, the user can specify that the first population was sam-

Fig. 1 Key features of the user interface of NEESTIMATOR v2.

Published 2013. This article is a U.S. Government work and is in the public domain in the USA.

NEESTIMATOR pled at generations zero and two, the second population at generations three and five, and the remaining three populations were all sampled at generations zero, four and five. In this circumstance, the input data file would contain 13 total samples, analysed as five separate populations. The user can also specify whether samples were taken after reproduction or nonlethally before reproduction, so individuals can contribute to future generations (Plan I), or whether individuals (typically juveniles) are sampled without replacement before reproduction (Plan II) (Waples 2005). An estimate of census size is needed for temporal estimation under plan I. The software also allows the user flexibility in defining parameters for the analyses. For all methods, the user can choose to screen out rare alleles with frequencies below a user-specified criterion, commonly referred to as Pcrit. Previously, this option has only been available for LDNe (Waples & Do 2008). Furthermore, subsets of the input data can be selected for analysis in the graphical user interface. For example, if the data file includes ten populations, the software can be directed to analyse the first two only. The user can also restrict the number of individuals analysed (e.g. to the first 10 or 20) in each sample. Additionally, loci can be selectively excluded, either by specifying a range (e.g. loci 1–5 and 10–15) or by listing individual loci (e.g. loci 2,4,6). For the linkage disequilibrium method, the user can toggle between the assumptions of random or monogamous mating. When an input file contains thousands of loci, or large number of individuals per population, the LD and coancestry methods can take hours or days to run. The interface will approximately evaluate this possibility and put out a warning dialogue box if necessary; the user then can decide whether to continue or use some options available on the interface to limit the data. If the user chooses to run, the terminal screen will print out the progress at certain goals so that the user can estimate when the run will be finished.

Missing data The software describes the extent of missing data in output files. NEESTIMATOR software (v2) implements an improved method to account for missing data calculating a unique fixed-inverse variance-weighted harmonic mean (Peel et al. 2013). Here, the sample size is taken as the weighted mean sample size across loci whose weights are based on the number of alleles. If no data are missing, the formulas using weighted harmonic means will reduce to formulas given by the Jorde & Ryman (2007) method. In evaluations using simulated data generated under a wide range of scenarios (see Fig. 1, Peel et al. 2013), this new method outperformed the simple weighted mean that was implemented to cor-

V2

211

rect for presence of missing data in version 1.4 of NEESIt also is preferred over the method used to jointly weight by sample size and number of independent comparisons, which was implemented in LDNe (Waples & Do 2008).

TIMATOR.

Data output files One potential downside of software offering multiple analysis methods is large, hard-to-navigate output data files. NEESTIMATOR (v2) overcomes this by generating a simple default output file which describes estimated population parameters for each selected analysis method and allows the user the option to select additional, more detailed output files. For example, the user can choose to have results for each method printed in a separate file that is organized in a streamlined, tabular format that is easy to analyse and import into other software. Other options include reporting frequency data at each locus and results for each pair of loci in the linkage disequilibrium method; the latter can be particularly useful for evaluating evidence for physical linkage in genomics studies.

Confidence intervals NEESTIMATOR (v2) provides confidence intervals for all methods and in several cases implements new and improved routines. Potential bias associated with standard parametric chi-squared confidence intervals for the LD method are reduced by implementing the jackknife method of Waples & Do (2008) as an alternative and allowing the user to determine if one or both intervals are relevant for their analyses. For the heterozygoteexcess method, our implementation corrects an error in the confidence interval method proposed by Zhdanova & Pudovkin (2008). Nomura (2008) did not propose a method for constructing confidence intervals for his molecular coancestry method; NEESTIMATOR (v2) implements a new jackknife method developed specifically for this purpose. An important caveat is that the performance of new methods for confidence intervals implemented in NEESTIMATOR (v2) has not been evaluated. In particular, use of large numbers (100s or 1000s) of SNP loci, many of which inevitably will be linked, introduces important issues related to pseudo-replication that need quantitative evaluation. Precision of estimates based on large numbers of loci might be substantially lower than suggested by traditional methods for computing CIs.

Negative or infinite estimates of Ne All the methods considered here are based on a genetic index that has two components: one due to genetic drift

Published 2013. This article is a U.S. Government work and is in the public domain in the USA.

212 C . D O E T A L . burn-in period (25 generations) that produced an average heterozygosity for the ‘microsat’ loci of approximately 0.8 and ensured that most or all of the ‘SNP’ loci were still segregating for both alleles in most or all populations. We used PCrit = 0.02 for the microsat loci and used all alleles for the SNP loci. All 100 individuals were sampled for the genetic analyses. We recorded the minimum and maximum estimates for each method, as well ^ e and the coefficient of variation as the harmonic mean N ^ in 1/Ne , which is the drift signal these methods respond to (Wang 2001, 2009). For the latter two metrics, infinite estimates were converted to 99 999. Results (Fig. 2; Table 1) show that about 80–90% of the estimates for LDNe fell within the range 80–120, while the remaining values were between 120 and 160. For the other two methods, in contrast, most estimates were substantially too low or too high, with 10% or less 100

80

NEESTIMATOR software (v2) provides options for screening out rare alleles for all methods except molecular coancestry (for which allele frequency is not an issue), using the same protocols as LDNe (Waples & Do 2008). By default, the software conducts and reports results for separate analyses that use all alleles, or which screen out alleles with frequencies below PCrit values of 0.01, 0.02, and 0.05. The user can change these default settings to implement any desired PCrit value(s). The user also has an option to select an additional output file that contains the allele frequencies for each locus for each population and reports the number of alleles per locus that were removed because they were below the user-specified PCrit value(s).

60

30

40

17

20

0

0

40

80

120

160

//

200

100

200 SNPs 80

Percent

Rare alleles

LDNe Het excess Coancestry

20 Msats

Percent

(the signal) and one due to sampling a finite number of individuals. Unbiased estimators depend on knowing the sample size (so that the expected magnitude of sampling error can be calculated) and subtracting that from the index. By chance, however, the actual amount of sampling error can be larger than expected, in which case it is possible for the correction to result in a negative estimate of Ne. The usual interpretation in this case is that the estimate of Ne is infinity, that is, there is no evidence for variation in the genetic characteristic caused by a finite number of parents – it can all be explained by sampling error (see discussion in Waples & Do 2010). An equivalent phenomenon also can occur with unbiased estimators of genetic distance or FST (e.g. Nei 1978; Weir & Cockerham 1984). In NEESTIMATOR (v2), negative point estimates, and confidence intervals are reported as ‘infinity’ in the main output file. In accessory output files, however, the actual negative values are reported, as negative estimates of Ne contain valuable information when included in harmonic mean calculations to provide an overall estimate of Ne, for example, when there are several replicate samples from the same population.

60

29

40

20

11

Examples We used genetic data simulated using Easypop (Balloux 2001) to illustrate some of the novel features of NEESTIMATOR (v2). In the first example, we simulated two groups of 100 isolated populations with true Ne = 100, and for each group, we estimated effective size using the three single-sample estimators. In the first group of populations, we tracked 20 loci similar to microsatellites (l = 0.0005, maximum of 10 alleles per locus); in the second group, we tracked 200 loci similar to SNPs (l = 10 7, a maximum of two alleles per locus). We initialized using the Maximum Diversity option and used a

0

0

40

80

120

160

//

200

^ Ne ^ e) Fig. 2 Distribution of estimates of effective population size (N from three single-sample estimators, based on 100 replicate, simulated data sets using 20 ‘microsatellite’ loci (top panel, with up to 10 alleles each) or 200 ‘SNP’ loci (bottom panel, with up to two alleles each). True Ne was 100. The numbers above the verti^ e > 200 indicate the number of those estimates that cal bars for N were infinitely large. The microsatellite analyses used PCrit = 0.02; the SNP analyses used PCrit = 0.

Published 2013. This article is a U.S. Government work and is in the public domain in the USA.

NEESTIMATOR

V2

213

Table 1 Summary of results comparing performance of effective size estimators on simulated data with true Ne = 100 Microsats; PCrit = 0.02

Single sample ^ e) Hmean (N Min Max % Infinite ^ e) CV (1/N

SNPs; PCrit = 0

LDNe

Het Excess

Coancestry

LDNe

Het Excess

Coancestry

106.0 82.2 139.6 0.0 0.114

77.5 18.2 Infinite 30.0 0.974

39.6 14.6 Infinite 17.0 0.777

102.7 84.1 131.7 0.0 0.095

78.4 19.5 Infinite 29.0 0.992

28.2 11.3 Infinite 11.0 0.648

PCrit = 0

Temporal ^ e) Hmean (N Min Max ^ e) CV (1/N

Fk

Fs

Fc

Fk

PCrit = 0.05

PCrit = 0.02

PCrit = 0

98.1 59.5 167.3 0.204

113.6 83.2 157.0 0.144

111.7 81.0 156.1 0.144

96.9 57.2 179.1 0.227

100.6 65.1 177.3 0.182

111.7 81.0 156.1 0.144

In each case, estimates of Ne reflect data for 100 replicates. Figure 2 shows distribution of single-sample estimates summarized here.

^ e values falling between 80 and 120. About 30% of the N of the estimates for the heterozygosity excess method were infinite, as were 10–20% of those for the molecular coancestry method. An interesting result was that patterns of bias and precision for all three methods were similar for the ‘microsat’ and ‘SNP’ analyses. As expected, based on previous results (Waples & Do 2010), use of PCrit = 0.02 with the microsat loci led to a slight (6%) upward bias for the LD method, while the upward bias was slightly lower (70% lower for SNPs), while precision was slightly improved with SNPs (fewer infinite estimates, lower CV). The second example compared the three methods for computing the temporal F. We simulated a metapopulation of 50 populations, each with effective sizes of 100. After a burn-in period of complete panmixia (island model with migration rate = 0.98 per generation), we imposed a single generation of isolation before

collecting data. This produced 50 populations of Ne = 100 that, on average, were as divergent from each other as would be samples from a single population taken two generations apart. We made 25 independent pairwise comparisons of these 50 populations and treated them as temporal samples taken two generations apart. We repeated the process four times to produce 100 replicate temporal comparisons. Sample size again was 100 individuals. For the temporal comparisons, we tracked 20 ‘microsat’ loci with up to 20 alleles each (hence large numbers of rare alleles). In the first analysis, we compared performance of Fs, Fc and Fk using all alleles (PCrit = 0). Results (Table 1) agreed with Jorde & Ryman’s (2007) conclusion that Fs is both less biased ^ e for and less precise than Fc and Fk. Harmonic mean N Fs showed a slight (10%. On the other hand, CV (1/N higher than for the other two indices. Table 1 also shows in more detail how rare alleles affect the estimates from Fk: there is little or no bias for PCrit = 0.02 or 0.05; substantial upward bias only occurs when alleles at frequency 10% while CV drops by only one-fifth) unless the user was much more concerned about precision than bias.

Published 2013. This article is a U.S. Government work and is in the public domain in the USA.

214 C . D O E T A L . These examples should not be considered to represent definitive evaluations of performance of any of these methods, as only a few specific scenarios were considered. Nevertheless, they illustrate how easy it is, using routine features of the new NEESTIMATOR (v2), to generate comparative information that previously would have required much more effort to compile.

Download and usage The software can be downloaded at no cost from http:// molecularfisherieslaboratory.com.au/neestimator-soft ware The user can select between MacOS, Linux or Windows versions. To run the NEESTIMATOR (v2) software, start the graphical user interface file: Windows or Mac users can double click on the NEESTIMATOR.jar files, whereas Linux users can start the program from the command line execute: ‘java –jar./NeGUI.jar’. An example of an input data file is provided as well as a help file in.pdf and.html formats. Although NEESTIMATOR (v2) can in theory handle arbitrarily large numbers of individuals, loci and populations, large combinations can slow the program considerably, and it is possible that the capabilities could be exceeded at some point. We have successfully run the 32-bit program using the LD method with a data set that included a single population with 27 individuals and >46 000 loci. This analysis involved calculation of r2 values for over one billion pairs of loci. The analysis, including calculation of jackknifed confidence intervals, took about 2 h for each PCrit value used on a Dell OptiPlex 390 running Windows 7 platform on a PC computer.

Acknowledgements We thank authors of the methods included here (Per Erik Jorde, Tetsuro Nomua, Alexander Pudovkin and Oxana Zhadanova) for reviewing and confirming the accuracy of implementations of their methods. We also are indebted to our cadre of BetaTesters, who diligently evaluated earlier versions of the software and provided valuable comments and feedback (Tiago Ant~ ao, Dean Blower, Mark Christie, Christine Dudgeon, Jon Hesse, Wes Larson, Friso Palstra, Ivan Phillipsen, Malin Pinsky and Ryan Waples), and to Dezhi Peng for sharing a large data set.

Luikart G, Ryman N, Tallmon DA, Schwartz MK, Allendorf FW (2010) Estimation of census and effective population sizes: the increasing usefulness of DNA-based approaches. Conservation Genetics, 11, 255–373. Nei M (1978) Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89, 583–590. Nei M, Tajima F (1981) Genetic drift and estimation of effective population size. Genetics, 1981, 625–640. Nomura T (2008) Estimation of effective number of breeders from molecular coancestry of single cohort sample. Evolutionary Applications, 1, 462–474. Ovenden J, Peel D, Street R et al. (2007) The genetic effective and adult census size of an Australian population of tiger prawns (Penaeus esculentus). Molecular Ecology, 16, 127–138. Palstra FP, Fraser DJ (2012) Effective/census population size ratio estimation: a compendium and appraisal. Ecology and Evolution, 2, 2357–2365. Peel D, Waples RS, Macbeth GM, Do C, Ovenden JR (2013) Accounting for missing data in the estimation of contemporary genetic effective population size (Ne). Molecular Ecology Resources, 13, 243–253. Pollak E (1983) A new method for estimating the effective population size from allele frequency changes. Genetics, 104, 531–548. Schwartz MK, Luikart G, Waples RS (2007) Genetic monitoring as a promising tool for conservation and management. Trends in Ecology & Evolution, 22, 25–33. Wang JL (2001) A pseudo-likelihood method for estimating effective population size from temporally spaced samples. Genetical Research Cambridge, 78, 243–257. Wang JL (2009) A new method for estimating effective population sizes from a single sample of multilocus genotypes. Molecular Ecology, 18, 2148–2164. Waples RS (1989) A generalized approach for estimating effective population size from temporal changes in allele frequency. Genetics, 121, 379–391. Waples RS (2005) Genetic estimates of contemporary effective population size: to what time periods do the estimates apply? Molecular Ecology, 14, 3335–3352. Waples RS, Do C (2008) LDNE: a program for estimating effective population size from data on linkage disequilibrium. Molecular Ecology Resources, 8, 753–756. Waples RS, Do C (2010) Linkage disequilibrium estimates of contemporary Ne using highly variable genetic markers: a largely untapped resource for applied conservation and evolution. Evolutionary Applications, 3, 244–262. Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure. Evolution, 38, 1358–1370. Zhdanova O, Pudovkin AI (2008) Nb_HetEx: a program to estimate the effective number of breeders. Journal of Heredity, 99, 694–695.

C.D. wrote the software, with input from R.W., D.P. and G.M. J.O. and R.W. led the project. J.O. coordinated the project. J.O. and B.T. wrote the NeEstimator v2 help file and drafted the manuscript. All authors contributed to and approved the final manuscript.

References Balloux F (2001) EASYPOP (version 1.7): a computer program for population genetics simulations. Journal of Heredity, 92, 301–302. Charlesworth B (2009) Effective population size and patterns of molecular evolution and variation. Nature Reviews Genetics, 10, 195–205. Jorde PE, Ryman N (2007) Unbiased estimator for genetic drift and effective population size. Genetics, 177, 927–935.

Data Accessibility The software files, including an example input data file and documentation, are available from http://molecularfisherieslaboratory.com.au/neestimator-software.

Published 2013. This article is a U.S. Government work and is in the public domain in the USA.

Suggest Documents