Document not found! Please try again

Maximum-Likelihood and Markov Chain Monte ... - Semantic Scholar

1 downloads 0 Views 251KB Size Report
Maximum-Likelihood and Markov Chain Monte Carlo Approaches to Estimate. Inbreeding and Effective Size From Allele Frequency Changes. Guillaume Laval ...
Copyright  2003 by the Genetics Society of America

Maximum-Likelihood and Markov Chain Monte Carlo Approaches to Estimate Inbreeding and Effective Size From Allele Frequency Changes Guillaume Laval,* Magali SanCristobal†,1 and Claude Chevalet† *Computational and Molecular Population Genetics Laboratory, Zoology Institute, University of Bern, 3012 Bern, Switzerland and † Laboratoire de Ge´ne´tique Cellulaire, Institut National de la Recherche Agronomique, 31326 Castanet-Tolosan, France Manuscript received November 28, 2002 Accepted for publication March 25, 2003 ABSTRACT Maximum-likelihood and Bayesian (MCMC algorithm) estimates of the increase of the Wright-Male´cot inbreeding coefficient, Ft, between two temporally spaced samples, were developed from the Dirichlet approximation of allelic frequency distribution (model MD) and from the admixture of the Dirichlet approximation and the probabilities of fixation and loss of alleles (model MDL). Their accuracy was tested using computer simulations in which Ft ⫽ 10% or less. The maximum-likelihood method based on the model MDL was found to be the best estimate of Ft provided that initial frequencies are known exactly. When founder frequencies are estimated from a limited set of founder animals, only the estimates based on the model MD can be used for the moment. In this case no method was found to be the best in all situations investigated. The likelihood and Bayesian approaches give better results than the classical F-statistics when markers exhibiting a low polymorphism (such as the SNP markers) are used. Concerning the estimations of the effective population size all the new estimates presented here were found to be better than the F-statistics classically used.

M

ONITORING genetic diversity in animal populations has become an important concern for institutions involved in the preservation of wild life and of endangered species, as well as in animal breeding. In the latter case, a few breeds are selected with high selection pressure, while other breeds are no longer extensively used and are faced with risks of extinction. In both cases, severe reductions in the genetic effective size are observed, leading to the loss of genetic diversity. Measuring this loss can be performed by means of historical surveys of allelic frequencies at a number of polymorphic loci. Such time-spaced sampling protocols have been applied in a wide range of populations, from natural and domestic taxa (Waples 1990; Kantanen et al. 1999). Furthermore, the development of systematic molecular analysis, through hybridization onto DNA chips or high-throughput mass spectrometry analyzers of single-nucleotide polymorphism, with expected lower costs, will allow population geneticists to perform large-scale DNA analysis and make this type of study probably very common in the future. Consider a neutral marker exhibiting k alleles and an isolated population of effective size N, which is described by its allele frequencies p0 ⫽ (p 0,1, . . . , p 0,i , . . . , p 0,k) and pt ⫽ (p t,1, . . . , p t,i , . . . , p t,k) (for i ⫽ 1, . . . , k), in a first generation, named the founder generation, and t generations later, respectively. In the tth generation

1 Corresponding author: Laboratoire de Ge´ne´tique Cellulaire, INRA, Chemin de Borde-Rouge—Auzeville, BP 27, 31326 Castanet-Tolosan Cedex, France. E-mail: [email protected]

Genetics 164: 1189–1204 ( July 2003)

the expectation and variance of the allelic frequencies distribution are given by E(p t,i |p0) ⫽ p 0,i



(1)



Var(p t,i |p0) ⫽ 1 ⫺ 1 ⫺

1 2N

冣 冥p t

0,i

(1 ⫺ p 0,i),

(2)

when genetic drift is assumed during t discrete and nonoverlapping generations. The quantity 1 ⫺ (1 ⫺ 1/2N)t, which can be viewed as the growth from the founder generation of the Wright-Male´cot inbreeding coefficient (Wright 1931; Male´cot 1948), will be denoted Ft and used as a measure of time. A theoretical framework based on F-statistics proposed by Krimbas and Tsakas (1971) was developed by Nei and Tajima (1981) and Pollak (1983) to derive the variance effective size N of populations from the ˆ ⫽ t/2Fˆt in which t, the number approximated relation N of generations between two samplings, is known. The estimations of Ft and of the variance effective size performed in a natural population give the increase in the inbreeding coefficient and the size of a Wright-Fisher population that would experience a comparable increase in variance of gene frequency over time. In fact, estimates of allele frequencies are obtained by sampling a number of individuals in the population, which suggests using the theory of coalescence. Following it (Figure 1, right, broken arrows), the probability to get a sample mt ⫽ {mt,1, . . . , mt,i , . . . mt,J0,•} at time t, given the true initial frequencies p0 ⫽ {pt,1, . . . , pt,i , . . . pt,J0,•}, is obtained by conditioning on the true number J0,• ⫽ 兺i J0,i of original genes, or lines of descent, from which

1190

G. Laval, M. SanCristobal and C. Chevalet

Figure 1.—Graph showing different ways to get the probability of a sample drawn from the tth generation. The coalescence approach is presented in the right-hand side with broken arrows, and the alternative approach, developed in the present article, is in the left-hand side with solid arrows.

the final sample originates. The probability distribution of J0,•, given the size mt,• ⫽ 兺imt,i of the final sample, can be derived exactly (Tavare 1984, Sect. 7.3). Under that condition, the probability of the observed sample mt is obtained by summing over the possible drawings J0,i of J0,• genes from the original population with the combinatorial distribution between the partitions J0,i and mt that is independent of time (Wilson and Balding 1998; Beaumont 1999). In the following, we consider the alternative approach (Figure 1, left, solid arrows), introducing the direct transition between initial allele frequencies p0 and allele frequencies pt at time t, followed by the final (multinomial) sampling giving the observed sample mt . Asymptotic distribution of allele frequencies is known under special cases involving mutations between a finite number of alleles and leading to Dirichlet distributions (Wright 1951), or under the infinite-alleles model, leading to distributions such as Ewens’ sampling formula (Ewens 1972) and related results (Tavare 1984). In the present context we neglect effects of mutations since we refer to medium or small populations analyzed over a short interval of time during which mutations are not expected to play an important role in the change of allelic frequencies. Hence, we refer to the simple drift process as the only source of frequency fluctuations. The exact transition due to drift has no simple analytical expression. We therefore investigated approximations of joint allele probability using the Dirichlet distribution (Balding and Nichols 1995; Holsinger 1999). The parameters of the Dirichlet are adjusted to fit with the known moments of the drift process. As a by-product, we use the same model as described in Kitada et al. (2000) to directly estimate the increase of the inbreeding coefficient and derive from it the variance effective size. Since a model based on the Dirichlet approximation at time t implicitly set p t,i ⬆ 0 for every allele existing in

the founder sample, it is clear that this first model cannot take into account the loss of alleles due to the drift process. We introduced a second model, which is an admixture of allele fixation and loss probabilities (Chevalet 2000) and the Dirichlet distribution, to consider that one or more founder alleles can disappear (p t,i ⫽ 0 for some i). Distributions of the sample mt are obtained conditionally on initial exact frequencies p 0. In this article we therefore consider the following two cases. First, we propose maximum-likelihood estimates based on two models, Dirichlet only and Dirichlet and allele loss probabilities admixture, when founder frequencies are known. This theoretical situation allows us to discuss the relevance of the Dirichlet approximation and the benefit given by the introduction of the allele loss probabilities. Second, the founder frequencies are estimated from a small sample of founder animals. In such a case the second model taking allele loss into account is hardly tractable. Developing unbiased estimates requires further statistical developments that are not presented in this article. Here, we consider only the first model (Dirichlet only) in which the sampling process within the founder generation is dealt with either from a heuristic point of view, checking for possible corrections of maximum likelihood estimates, or using a Monte Carlo Markov chain (MCMC) algorithm in a Bayesian context. Comparisons with two F-statistic methods, one introduced by Nei and Tajima (1981) and promoted by Barker et al. (1998) and another that is derived from Reynolds’ genetic distance (Reynolds et al. 1983), were performed using computer simulations focused on short-term evolution. The largest simulated values of Ft were chosen to be slightly ⬎0.1, i.e., 10 generations of drift for an effective population size of 50. Since the number of generations between the founder and the tth generation is known, we test the accuracy of the estimations of N from the maximum-likelihood ˆ ⫽ t/2Fˆt. estimations of Ft using the classical formula N We also introduce a corrected estimate of N. Furthermore, to test these new methods (maximumlikelihood estimates and MCMC algorithm) to a real data set, a French snail population (a species of importance for French consumers) was analyzed.

STATISTICAL BACKGROUND

Multinomial-Dirichlet model: In the following sections, this model is called multinomial-Dirichlet (MD). We assumed that the allele-frequency distributions in the current generation can be approximated by a Dirichlet distribution (Balding and Nichols 1995; Holsinger 1999; Kitada et al. 2000), with parameters ␣t ⫽ (␣t,1, . . . , ␣t,i , . . . , ␣t,k), f (pt|␣t) ⫽

⌫(A t) 兿i ⌫(␣t,i)

k

兿 p t,i␣ ⫺1, t, i

i⫽1

Inbreeding and Effective Size Estimates

where 兺i ␣t,i ⫽ At. Parameters ␣t can be related to Ft by adjusting the first two moments under the genetic drift model (Equations 1 and 2) and under the statistical Dirichlet model. This leads to ␣t,i ⫽ A tp 0,i with A t ⫽

1 ⫺ 1. Ft

Accordingly, an approximate likelihood ᏸMD of mt given p 0 and A t (after integrating p t out) can be written as ⌫(A t) 兿i ⌫(mt,i ⫹ A tp 0,i) ⌫(mt,• ⫹ 1) . (4) 兿i ⌫(A tp0,i) ⌫(mt,• ⫹ A t) 兿i ⌫(mt,i ⫹ 1)

Allele loss taken into account: In the following sections, this model is called multinomial-Dirichlet and allele loss (MDL). The Dirichlet distribution is no longer valid when fixation or loss of alleles occurs and is expected to bias the likelihood since any null allele frequency is considered as the result of final sampling only. Taking into account a possible loss of alleles during the drift process implies that f (pt|p0, Ft) is written as a mixture of discrete and continuous terms, introducing the probabilities that some alleles can be lost. Let S be a state in which some of the alleles are lost; then

兺S Pr(S|p 0, Ft)Tr(p t|S, p 0, Ft).

qS,i ⫽

(6)

which are given by solutions of the classical partial differential equation of Kimura (1955). Then Tr(p t|S, p 0, Ft) stands for the distribution of transient frequencies p t,i, excluding null frequencies, conditional on state S, on “time” Ft , and on initial conditions p 0,i . appendix a shows the method for the calculation of the probabilities of S states and to derive expectations qS,i of nonnull frequencies under the various S conditions. Following the same heuristics as before, we approximated the transient distributions by Dirichlet distributions, adjusting their parameters to the genetic drift model. In practice, the following approximations were used. Fixation probabilities were approximated following Chev-

p 0,i , 1 ⫺ p 0,(S⫺lost)

(7)

where p 0,(S⫺lost) is the total of initial frequencies of lost alleles in state S, both approximations being justified by the short-term scope of the present scenario. Writing f (p t|p 0, Ft) as a mixture (Equation 5), the likelihood of a sample becomes a mixture involving various S terms: ᏸMDL ⫽ L(mt|p0, Ft) ⫽

兺S 冮

Pr(mt|pt, S)Tr(pt|S, p 0, Ft)Pr(S|p 0, Ft).

pt

Since the first term Pr(mt|p t, S) is zero for any S state in which one allele is observed in the sample while it is of null frequency in state S, there are only one or two terms in the previous likelihood. If all the alleles are observed in the sample (state S1), then the likelihood ᏸMDL is equivalent to ᏸMD in Equation 4, weighted by Pr(S1|p 0, Ft). Otherwise, at least one allele is not observed in the sample. In that case, two S states must be considered, the one in which all frequencies pt,i are positive but sampling did not allow some alleles to be observed (state S1) and the state that indicates that unobserved alleles have been lost (state S2). For example, for three alleles (see appendix a), if mt,1 ⬎ 0, mt,2 ⬎ 0, and mt,3 ⫽ 0, the whole likelihood ᏸMDL is given by ᏸMDL ⫽ Pr(111|p0 , Ft) Ᏸ111 ⫹ Pr(110|p0 , Ft) Ᏸ110

(8)

with Ᏸ111 ⫽

(5)

Pr(S|p 0, Ft) is the probability of getting state S at reduced time Ft from the initial conditions p 0 and is derived from the probabilities that alleles are fixed, Pr(p t,i ⫽ 1|p 0, Ft),

␣S,i ⫽ A t qS,i , keeping a single unconditional A t parameter, and taking

mt|p t ⵑ ᏹk (mt,•; p t) and p t|Ft , p 0 ⵑ Ᏸk (Atp 0).

f (p t|p 0, Ft) ⫽

alet (2000), and the ␣S,i parameters of the Dirichlet characterizing a transient state S were set to

(3)

It may be shown that the drift and the Dirichlet distributions are only approximately equal, since third moments are different in an amount proportional to F t2, indicating that the approximation is valid only for small Ft values. Taking account of gene sampling in the tth generation is performed as follows: for one locus, the gene sampling gives partitions mt ⫽ (mt,1, . . . , mt,i , . . . , mt,k). Let us denote the total number of sampled alleles, i.e., twice the number of individuals, as 兺i mt,i ⫽ mt,•. Assuming that the sample stage is described by a multinomial drawing, the whole model is a compound multinomialDirichlet model:

ᏸMD ⫽

1191

Ᏸ110 ⫽

⌫(A t) ⌫(A t q111,i)

兿ii⫽⫽13 ⌫(mt,i ⫹ A t q111,i)

⌫(mt,• ⫹ 1) 兿 ⌫(mt,i ⫹ 1)

⌫(A t) i⫽2 兿i⫽1 ⌫(At q110,i)

兿ii⫽⫽12 ⌫(mt,i ⫹ A t q110,i)

⌫(mt,• ⫹ 1) , 兿 ⌫(mt,i ⫹ 1)



i⫽3 i⫽1

⌫(mt,• ⫹ A t)

⌫(mt,• ⫹ A t)

i⫽3 i⫽1

i⫽2 i⫽1

where 111 stands for state S1 in which the three alleles have been kept, and 110 for state S2 in which the third allele has been lost during the drift process. ESTIMATION PROCEDURE

The likelihoods ᏸMD and ᏸMDL, in Equations 4 and 8, depend on the true founder frequencies. Thus we consider the following two cases: Known founder frequencies: This situation allows us to check by simulation the validity of the Dirichlet approximation made in this study. However, this situation can be found in some selected breeds in which all founder animals are known and can be genotyped. Assuming statistical independence between loci (ᐉ), the maximization of the multilocus log-likelihood

1192

G. Laval, M. SanCristobal and C. Chevalet

log ᏸ ⫽

兺ᐉ log ᏸᐉ,

1983), given by

in which ᏸᐉ ⫽ ᏸMD in model MD (Equation 4) and ᏸᐉ ⫽ ᏸMDL in model MDL (Equation 8), was performed using a quasi-Newton algorithm. In model MD and in model MDL, the maximum-likelihood estimators are named FˆMD(ML) and FˆMDL(ML), respectively. Estimated founder frequencies: For one locus, gene sampling within the founder generation gives partitions m0 ⫽ (m0,1, . . . , m0,i , . . . , m0,k). Let the total number of alleles sampled be 兺i m0,i ⫽ m0,•. The estimations of Ft can be based on likelihoods (Equations 4 and 8), where p 0 are replaced by consistent estimators x 0 (x 0,i ⫽ m0,i/ m0,•). Since maximum-likelihood estimations do not explicitly account for the sampling process in the founder generation, this was performed as follows. Model MD (Dirichlet): FˆMD(ML) shows a positive bias of 1/ m0,• (data not shown), corresponding to Reynolds et al.’s (1983) distance between the sample and the founder generation in a Multinomial sampling scheme. Hence, as in Krimbas and Tsakas (1971), a correction FˆMD(corrML) ⫽ FˆMD(ML) ⫺ 1/m0,• was proposed. Alternatively, a full Bayesian approach can be implemented, taking advantage of prior knowledge on parameters A t and p 0 through Gamma and Dirichlet distributions, respectively, as A t|a, b ⵑ Ᏻ(a, b),

(9)

p 0|␣0 ⵑ Ᏸk(␣0),

(10)

with ␣0 ⫽ (␣0,1, . . . , ␣0,k). Computer simulations showed that maximum a posteriori estimates of Ft were biased, with biases depending on sample size (data not shown), and could not be simply corrected as was the case for FˆMD(ML). Therefore, Bayes estimates, noted FˆMD(MC) in the following, were derived from the mean of the marginal a posteriori distribution of parameters of interest, f (Ft |m0, mt), obtained by means of a MCMC approach using a Metropolis-Hasting algorithm (Metropolis et al. 1953; Hasting 1970; and appendix b). Model MDL (Dirichlet and allele loss): Combining the model accounting for gene loss and the sampling process in the founder generation (p 0 are replaced by x 0) did not give satisfactory results. No simple rule was found to obtain unbiased estimates through maximization of likelihood or of a posteriori probability. Extending the MCMC algorithm was postponed for future studies. Bias correction for F-statistics: As in Williamson and Slatkin (1999), we used the F-statistic FˆNT (the index t is omitted) of Nei and Tajima (1981) corrected for reduced sample size: 1 FˆNT ⫽ k

k



i⫽1





1 (x0,i ⫺ xt,i)2 1 ⫹ ⫺ xi(1 ⫺ xi) m0,• mt,•

with xi ⫽ (x0,i ⫹ xt,i)/2.

(11) We also used Reynolds’ F-statistic FˆR (Reynolds et al.

兺ki⫽1 (x0,i ⫺ xt,i)2 , 2 (1 ⫺ 兺ki⫽1 x 0,i )

FˆR ⫽

(12)

in which, for generations 0 and t, 兺i x i2 is replaced by (兺i x i2 ⫺ (1/m•))/(1 ⫺ (1/m•)) (Nei 1978). Multilocus estimation and variance prediction for F-statistics: Denoting Fˆᐉ a single-locus estimate (Equation 11 or 12), multilocus estimates were written as

兺ᐉ (n0,ᐉ ⫺ 1)Fˆᐉ , 兺ᐉ (n0,ᐉ ⫺ 1)

Fˆ ⫽

(13)

where n0,ᐉ are the observed numbers of founder alleles at locus ᐉ (Pollak 1983). Weighting by n0,ᐉ ⫺ 1 allows heterogeneity of information between markers to be taken into account in minimizing the variance of the multilocus estimation assuming statistical independence between loci. The predictions of the standard error (SE) of Nei and Tajima’s (1981) distance given in Pollak (1983), Barker et al. (1998), and Foulley and Hill (1999) and of the standard error of Reynolds et al.’s (1983) distance given in Laval et al. (2002) lead to an approximated multilocus standard error of (13) equal to SE(Fˆt) ⬇

冪兺 (n ᐉ

2

0,ᐉ

⫺ 1)

冢F ⫹ m1 t

0,•





1 . mt,•

(14)

SIMULATION PROCEDURE

Twenty genetically independent loci were always considered in the simulations. A simulated population of size N ⫽ 500, with allele frequencies p00,i (for i ⫽ 1, . . . , k), was initially considered, and a pure genetic drift process was simulated 1000 times through five generations. This process generates 1000 quasi-independent populations used as starting points for simulation runs. To provide inbreeding coefficient values in the interval [0.002, 0.008], each one of these 1000 populations, described by its founder frequencies p 0, was submitted to a further pure genetic drift process, with constant diploid effective sizes set to 500 individuals during 8 nonoverlapping generations. Samples of mt,• ⫽ 50 alleles were drawn (sampling of 25 diploids) in a multinomial way every two generations. To provide inbreeding coefficient values in the interval [0.01, 0.1], the same process was applied to a population of 100 individuals evolving during 22 generations. Two kinds of sampling at the stage of the founder population were considered: (i) exhaustive sampling, where the founder sample size is made up of the complete founder population, so that the true founder allele frequencies p 0,i are known, and (ii) finite sampling, where the founder sample size m0,• was set to 50 alleles (sampling of 25 diploids). In Uniform founder frequencies, three sets of 1000 simula-

Inbreeding and Effective Size Estimates

1193

tions, in which allele frequencies of the initial population were set to p 00,i ⫽ 1/k, were performed with k ⫽ 2, k ⫽ 4, and k ⫽ 8 alleles, respectively. In Biochemical and microsatellite founder frequencies two sets of 1000 simulations were used, in which allele frequencies p 00,i in the initial populations were set to biochemical and microsatellite marker frequencies published in Kantanen et al. (1999) and in Laval et al. (2000), respectively. RESULTS

Performances of the estimates were compared using the bias, Bt (or the relative bias, Bt/Ft with Ft the true value of inbreeding coefficient) and the standard error, SE t , computed over S simulations (unless it was explicitly indicated, S was always set to 1000). The global accuracy of estimations was established on the basis of the square root of MSE (√MSEt ⫽ √B2t ⫹ SE2t ), a criterion combining bias and variance, which is equal to the standard error when the estimation is unbiased. Approximate confidence intervals of relative biases and standard errors were computed using formulas that are valid for normal distributions of estimates. They are indicated in the figures when relative biases are significantly different from zero (zero outside the 95% confidence interval, P ⬍ 0.05) and when differences between standard errors were significant (no overlapping of the 95% confidence intervals, P ⬍ 0.05). Uniform founder frequencies Known founder frequencies (Figures 2A, 3A, 4A, and 5A): Square root of the mean squared error—Figure 2A: With two founder alleles per locus (top curves), FˆNT gives the least accurate estimations (√MSE are the highest) whatever the level of drift. With four founder alleles (middle curves) and eight founder alleles (bottom curves) per locus, the two likelihood methods FˆMD(ML) (Dirichlet without gene loss) and FˆMDL(ML) (Dirichlet with gene loss) diverge from each other for ⵑF ⬎ 0.07 in the first case and F ⬎ 0.04 in the second case. FˆMD(ML) shows a large loss of accuracy with eight alleles whereas FˆMDL(ML) always gives the best estimations for any Ft value (with results significantly better for F ⬎ 0.08) and a small to medium number of alleles. Relative bias—Figure 3A: As expected from its definition, FˆR shows no significant bias as is illustrated for eight alleles (Figure 3A) and for fewer alleles (data not shown). In contrast to this result, other measures show biases that are dependent on polymorphism and on the amount of drift. FˆNT shows significant positive bias with two alleles (data not shown); with eight alleles FˆNT, and also FˆMDL(ML), exhibit small but significant negative biases for F ⬎ 0.07 (confidence intervals in Figure 3A). However, biases remain small. Therefore, except for FˆMD(ML) computed with eight alleles per locus, the global accu-

Figure 2.—Square root of mean square errors of inbreeding estimations: two, four, and eight founder alleles per locus are shown. The estimations of Ft ⱕ 0.01 (respectively Ft ⱖ 0.01) were computed over 20 loci and 1000 replications performed with populations of N ⫽ 500, t varying from 2 to 8 (respectively N ⫽ 100, t varying from 2 to 22). Every sample size was equal to 25 individuals. In A, the true founder frequencies are known. In B, the founder sample size was set to 25 individuals. In both A and B, dotted lines show the results obtained with uniform distributions involving four founder alleles. Solid lines show the results obtained with uniform distributions involving two (top curves) and eight (bottom curves) founder alleles.

racy of every method is determined mainly by the standard errors of the estimations (√MSE ⬇ SE) as long as the number of loci is not too large (⬍50). Standard error—Figures 4A and 5A: The prediction given by Equation 14 provides a conservative basis for choosing the number and the polymorphism of markers needed to achieve a given level of precision as defined by the variance. Indeed it is apparent that the different methods considered here are characterized by a variance equal to or less than that predicted by the formula. In every situation encountered (four alleles, data not shown, and eight alleles, Figure 5A) FˆMDL(ML) proves to be the best with respect to the criterion of minimal variance, with the difference with the theoretical reference (Equation 14) always significant for F ⬎ 0.07. For eight alleles (Figure 5A) FˆNT performs nearly as well but is slightly less accurate [differences between FˆNT and FˆMDL(ML) are significant].

1194

G. Laval, M. SanCristobal and C. Chevalet

Figure 3.—Relative biases of inbreeding estimations: eight founder alleles per locus. For the replication conditions refer to the Figure 2 legend. Confidence intervals are shown when the bias is significantly different from 0. In this case all methods except FˆR show biases significantly different from 0 for almost every value of Ft.

Figure 4.—Standard errors of inbreeding estimations: two founder alleles per locus. For the replication conditions refer to the Figure 2 legend. The straight line is the expected value of standard deviation from Equation 14. Confidence intervals of SE are shown when differences between methods are significant. In A the differences between methods are significant for Ft values ⬎0.08 [except between FˆR and FˆMD(ML)]. In B FˆMD(corrML) exhibits standard errors significantly lower than the standard errors of the other methods for every value of Ft ⱖ 0.01.

Estimated founder frequencies (Figures 2B, 3B, 4B, and 5B): As mentioned above, taking account of the sampling process within the founder generation needs additional statistical treatments when the Dirichlet approximation and loss of gene probabilities are combined. Replacing p 0,i by x 0,i provides biased estimates, with biases inversely proportional to the founder sample size. The results given by FˆMDL(ML) are not shown. Square root of the mean square error—Figure 2B: For two alleles (top curves) a gain in accuracy was found with the corrected likelihood estimate FˆMD(corrML). Beyond 7% drift, FˆR is slightly better than FˆNT. For four alleles, results are nearly identical for all the compared methods. With eight alleles FˆNT turns out to be better than the others. Relative bias—Figure 3B: FˆR is always unbiased (biases were never statistically different from zero under the simulation conditions used in this article). Both FˆNT and FˆMD(corrML) measures show a bias that depends on the allelic distribution, changing in sign with the number of alleles and the true value of Ft. In the range of Ft values considered, and discarding very low values, the relative bias remains smaller than ⵑ10%. However, the global accuracy of methods is still due mainly to the standard

error of estimation [√MSE ⬇ SE, except for FˆMD(corrML) computed with eight alleles per locus and Ft ⬎ 0.07]. Standard error—Figures 4B and 5B: The best method is FˆMD(corrML) when the number of alleles is reduced (two alleles, Figure 4B). With eight alleles per locus (Figure 5B) FˆNT is the best method with a reduction in standard error of up to 25% for an Ft value ⵑ0.10. The results presented in this section highlight the main drawback of the Dirichlet approximation. Biases depending on the founder polymorphism of markers and on the Ft values appear even for a small amount of drift (Ft ⱕ 0.1). This problem can be avoided when allele loss probabilities are taken into account. These results suggest that this new model should be used to develop further estimates, which will be more accurate when highly polymorphic markers are used. However FˆMD(corrML) performs well with markers exhibiting two alleles. This method should, then, be recommended when markers such as single-nucleotide polymorphisms (SNPs) are used. This method, as well as the MCMC algorithm, was tested with two data sets generated with allele frequencies observed in pig and cattle breeds. Furthermore,

Inbreeding and Effective Size Estimates

Figure 5.—Standard errors of inbreeding estimations: eight founder alleles per locus. For the replication conditions refer to the Figure 2 legend. The straight line is the expected value of standard deviation from Equation 14. Confidence intervals of SE are shown when differences between methods are significant. In A FˆR and FˆMD(ML) exhibit the same results. The differences between FˆR and FˆMDL(ML) are significant for Ft values ⬎0.03. The differences between FˆNT and FˆMDL(ML) are significant for Ft ⬎ 0.07. In B FˆNT exhibits standard errors significantly lower than the standard errors of the other methods for Ft ⬎ 0.05.

the accuracy of the estimations of N obtained with the corrected maximum-likelihood method (estimations of N were derived from the estimations of Ft) was determined with these data sets. Biochemical and microsatellite founder frequencies In this section the allele frequency distributions include rare founder alleles and different numbers of alleles between markers. The √MSE in Figures 6 and 7 were computed with 20 biochemical (Kantanen et al. 1999) and 20 microsatellite (Laval et al. 2000) markers exhibiting mean numbers of founder alleles equal to 2.5 (markers with two and three alleles) and 4 (allele numbers ranging from 2 to 6), respectively. Estimation of Ft : To show the relevance of the model combining Dirichlet and allele loss probabilities, we give the results obtained when the founder frequencies are known (Figures 6A and 7A). FˆMDL(ML) is still the best inbreeding estimator in every case (being unbiased and with the smallest standard error).

1195

Figure 6.—Square root of mean square errors: biochemical markers. The results were performed (1000 replications, N ⫽ 500 and N ⫽ 100) with distributions of founder frequencies belonging to 20 biochemical markers, exhibiting an average number of founder alleles of 2.5, observed within cattle breeds from Kantanen et al. (1999).

When the founder frequencies of biochemical markers are estimated (Figure 6B), the likelihood approach FˆMD(corrML) and Nei and Tajima’s (1981) statistics provide the best inbreeding estimations: FˆMD(corrML) is nearly the best for biochemical markers. As for the replications performed in the previous section, the effect of bias on the global accuracy (√MSE ⬇ SE) is negligible. For microsatellite markers (Figure 7B), the method giving the smallest standard error still provides the most accurate F estimations, in this case FˆNT. The MCMC algorithm requires larger computation times than the likelihood methods. The results are shown for two different values of Ft , Ft ⫽ 0.01 and Ft ⫽ 0.1 (Table 1), and the number of simulations was limited to S ⫽ 100. In practice with a real data set the parameters of candidate generating densities were empirically chosen so that the Metropolis-Hasting algorithm accepts 25% of drawn values (Robert 1996). Since this cannot be performed for all the 100 simulations, we determined optimal parameters with a simulated data set randomly chosen, and we kept them for all the 100 simulations. We discarded simulations in which convergence is of doubtful validity (the percentage of accepted values ⬍5).

1196

G. Laval, M. SanCristobal and C. Chevalet

ˆ is biased as However, even if Fˆt is unbiased, N ˆ) ⬇ E(N

t t ⫹ Var(Fˆt). 2E(Fˆt) 2E(Fˆt)3

(16)

ˆ ⬘ of N: Equation 14 suggests the following estimate N ˆ⬘⫽ N

Figure 7.—Square root of mean square errors: microsatellite markers. The results were performed (1000 replications, N ⫽ 500 and N ⫽ 100) with distributions of founder frequencies belonging to 20 microsatellite markers, exhibiting an average number of founder alleles of four, observed within pig breeds from Laval et al. (2000).

t 2[Fˆt ⫹ (2/Fˆt

兺ᐉ (n0,ᐉ ⫺ 1))(Fˆt ⫹ 1/m0,• ⫹ 1/mt,•)2]

.

(17)

The performances of these estimates are given in Tables 2 and 3. Their names are given with the same subscript as used for the Fˆ notation. ˆ MD(MC) are not shown ˆ R and N The results obtained with N since they are less accurate than those obtained with ˆ NT and N ˆ MD(corrML), respectively. N ˆ ⬘ estimations are less biased with In every case, the N a smaller standard error than that of the noncorrected ˆ . The corrected-likelihood method is more estimators N accurate than the F-statistics, for every t and for both biases and standard errors. It is more suitable to comˆ ⬘ with the corrected maximum-likelihood apbine N proach: in the last row of Table 3 (20 microsatellites), ˆ ⬘MD(corrML) gives an unbiased estimation and leads to a N ˆ NT. decrease of 30% of the standard error of N It should be mentioned here that this corrected estimate must be used when the experimental conditions lead to a small coefficient of variation of the estimation of Ft, since Equation 16 is accurate only when Var(Fˆt)/ E(Fˆt)2 is negligible. The highest relative standard error of Ft in Tables 2 and 3 is ⬍0.5. In practice Equation 14 can be used to estimate the relative standard error to ˆ ⬘ is relevant. decide whether N

A REAL DATA SET

The numbers of simulations used to compute accuracy criteria are equal to 85 (biochemical markers) and 91 (microsatellites) for Ft ⫽ 0.01 and equal to 100 (for both biochemical and microsatellite markers) for Ft ⫽ 0.1, respectively. With biochemical and microsatellite markers, FˆMD(MC) was found to be significantly more accurate than FˆNT and FˆMD(corrML) for Ft values of 0.01 (the first two rows in Table 1). The √MSE is almost halved in every case. For Ft ⫽ 0.1 the MCMC algorithm does not give the same large gain in accuracy. There are no significant differences among the three methods. The MCMC algorithm allows us to compute the posterior marginal distribution of the parameter of interest, f (Ft |m0, mt) (Figures 8 and 9; Ft ⫽ 0.01 and Ft ⫽ 0.1, respectively). Estimation of N: These data sets were also used to obtain estimations of the effective population size since t is known (and small) and biases of F estimations are small. For Ft between 0.01 and 0.1 (a simulated population of N ⫽ 100) estimations of N can be simply obtained from ˆ ⬇ t . N 2Fˆt

(15)

We used a data set provided by J. F. Arnauld (unpublished results) to illustrate the behavior of the corrected maximum-likelihood method and the MCMC algorithm with a real data set. The population of Helix aspersa (Gastropoda: Helicidae) belongs to an intensive agricultural zone located in Brittany (northwestern France), in the polders of the Bay of Mont-Saint-Michel. The snail population was sampled in 1998 and 2000 with 15 and 30 individuals, respectively. The estimations of Ft and N were computed with four microsatellite markers, exhibiting 5, 6, 8, and 12 alleles in the founder generation, respectively. The estimations of Ft obtained with FˆNT , FˆR, and FˆMD(corrML) are equal to 0.011, 0.016, and 0.032, respectively. Their coefficients of variation, computed from Equation 14, are equal to 1.715, 1.264, and 1 respectively. With such ˆ ⬘ cannot be used. We esticoefficients of variation N mated the effective size from Equation 15 assuming one generation per year (Madec et al. 2000). The estimations of N obtained with Nei and Tajima’s (1981), Reynolds et al.’s (1983), and the corrected maximum-likelihood methods are equal to 88, 62, and 31 individuals, respectively.

Inbreeding and Effective Size Estimates

1197

TABLE 1 Estimation of Ft ⫽ 0.01 and Ft ⫽ 0.1 computed with the MCMC algorithm from biochemical and microsatellite markers Nei and Tajima’s weighted FˆNT

MD corrected FˆMD(corrML)

MCMC FˆMD(MC)

B/Ft

SE/Ft

√MSE

B/Ft

SE/Ft

√MSE

B/Ft

SE/Ft

√MSE

0.01 0.01

a

Bioch Microb

⫺0.029 ⫺0.106

1.267 0.842

0.013 0.008

0.172 0.231

1.230 0.847

0.012 0.009

⫺0.270 ⫺0.181

0.529 0.416

0.006 0.005

0.1 0.1

Biocha Microb

⫺0.022 ⫺0.088

0.326 0.225

0.034 0.025

0.024 0.088

0.334 0.256

0.035 0.028

⫺0.112 ⫺0.001

0.340 0.220

0.037 0.023

Fi

Markers

The estimations of Ft were computed with the simulated data sets presented in Figures 6B and 7B but in every data set only the first 100 simulations were used. Vague priors ␣0 ⫽ 1, a ⫽ 2, and b ⫽ 250 were chosen. For every simulation the total number of iterations was set to 200,000, and only the last 100,000 were used to compute FˆMD(MC). We kept the same parameters (0.4, 4, and 25 for bu, ag, and bg, respectively, for Ft ⫽0.01; 0.5, 1, and 10 for bu, ag, and bg, respectively, for Ft ⫽ 0.1) for all simulations and we discarded simulations in which convergence is of doubtful validity (the percentage of accepted values ⬍5). The number of simulations used to compute accuracy criteria is equal to 85 (biochemical markers) and 91 (microsatellites) for Ft ⫽ 0.01 and equal to 100 (for both biochemical and microsatellite markers) for Ft ⫽ 0.1, respectively. a Biochemical markers. b Microsatellite markers.

The MCMC estimations of Ft and of the coefficient of variation (computed with the standard error obtained from the marginal posterior distribution of Ft , Figure 10) are equal to 0.007 and 0.857, respectively. DISCUSSION

Since many different methods and situations were investigated in this study, a summary table (Table 4) is given to highlight the main results obtained when the founder frequencies are estimated with a limited founder sample size (here 25 individuals), the situation most widely found in an experimental scheme. Scope of the present work: The scope of this article was limited to a short-term investigation. The comparisons of methods are based on an expected increase in inbreeding of 10% or less, which corresponds to 10 generations for a population of 50 individuals. This seemed a realistic timescale since the validity of the assumption of no mutation may be questionable for longer time intervals. The number of markers used was 20, which is a common value found in practical (MacHugh et al. 1997; Moazami-Goudarzi et al. 1997; Laval et al. 2000) and theoretical studies (Berthier et al. 2002). This allows the comparisons of methods made in the present study to be useful for experiments currently performed. Increasing the number of markers to get a better estimate of drift, which reduces the standard error around the expectation but not around the true value, requires the use of unbiased measures. Bias becomes a significant consideration when the number of alleles is large, and analytical corrections for FˆNT could be derived to avoid this bias. In practice, however, this may be not crucial since with the number of markers commonly used and the small values of Ft observed biases remain small in

comparison with the standard errors. To illustrate this point when Ft ⫽ 0.08, markers exhibit eight alleles, and founder allele frequencies are exactly known, ⬎200 markers are needed for the unbiased FˆR statistic to outperform Nei and Tajima’s (1981) method. Hence working with measures with a low bias may be an advantage when the level of inbreeding is low. It must be stressed that this conclusion is quite different when we consider the distance between two different breeds. Indeed, the values of distances are often ⬎0.1. Methods such as FˆNT and similar statistics as well as likelihood estimates show nonnegligible biases when allele numbers and Ft values are large and therefore cannot be recommended on the basis of this work. Unbiased methods such as Reynolds et al.’s (1983) distance would be preferred in such cases (Laval et al. 2002). For dominant markers such as randomly amplified polymorphic DNA or amplified fragment length polymorphism, allelic frequencies can be estimated with the square root rule from the frequencies of absence of bands and can be used to estimate Ft , keeping in mind that a deviation from the Hardy-Weinberg proportions leads to biased maximum-likelihood estimations of allelic frequencies. Some software (Arlequin, Schneider et al. 2000) provide better estimations of allelic frequencies for dominant markers by using an expectation-maximization algorithm (Excoffier and Slatkin 1995). Performances of various estimates of Ft were sensitive to the total number of alleles over loci (Equation 14). Considering biallelic markers, SNPs seem to be full of potential (Vignal et al. 2002) since their low numbers of alleles can be counterbalanced by the high number of SNPs found in the whole genome. As a consequence the assumption of independence between loci does not hold with a data set involving a high number of loci.

1198

G. Laval, M. SanCristobal and C. Chevalet

Figure 8.—Histogram and kernel density estimate of MCMC drawings in the posterior distribution of the inbreeding coefficient, for Ft ⫽ 0.01. We kept the same parameters as in Table 1. A and B were computed with one simulation involving 20 biochemical markers and one simulation involving 20 microsatellite markers, respectively. For biochemical markers the mean and the standard error are equal to 0.0062 and 0.0046, respectively, and the percentage of accepted values of the At parameter by the Metropolis-Hasting algorithm is equal to 19%. For microsatellite markers the mean and the standard error are equal to 0.0065 and 0.0046, respectively, and the percentage of accepted values of the At parameter is equal to 17%.

The theoretical prediction (Equation 14), although computed under the assumption of the statistical independence of loci (linkage disequilibrium remains null), seems to be a conservative estimate of the variance of the optimized F-statistics (Equation 13) and of the other measures considered. With a real data set in which some markers are in linkage disequilibrium, the expected standard error cannot be computed easily. Some preliminary simulations have been undertaken with 100 individuals, 22 generations, and 20 highly polymorphic loci (20 founder alleles per locus), with a recombination rate r between them varying from 0.5 to 0.001. Results show that the variance of the FR estimate is not affected by linkage between loci as long as r ⬎ 0.1. The standard deviation is increased by 17% if r ⫽ 0.05 and by 155% if r ⫽ 0.001 (9 and 104%, respectively, with 4 founder

Figure 9.—Histogram and kernel density estimate of MCMC drawings in the posterior distribution of the inbreeding coefficient, for Ft ⫽ 0.1. We kept the same parameters as in Table 1. A and B were computed with one simulation involving 20 biochemical markers and one simulation involving 20 microsatellite markers, respectively. For biochemical markers the mean and the standard error are equal to 0.093 and 0.03, respectively, and the percentage of accepted values of the At parameter by the Metropolis-Hasting algorithm is equal to 23%. For microsatellite markers the mean and the standard error are equal to 0.093 and 0.023, respectively, and the percentage of accepted values of the A t parameter is equal to 27%.

alleles per locus). More work is needed to quantify the influence of nonindependence of loci on the variances of the estimates. Analytical approximation: We derived an approximate likelihood by using the Dirichlet approximation of conditional allele frequency distribution. Comparisons between likelihood methods computed when true founder frequencies are known and the F-statistics (unbiased like FˆR and with a small variance like FˆNT) indicate that the MD approximation model is relevant when the polymorphism of markers is low. When factors enhance the probability that alleles may be lost (high polymorphism leading to the occurrence of rare founder alleles and intermediate and high levels of drift), this simple approximation is no longer rele-

Inbreeding and Effective Size Estimates

1199

TABLE 2 Estimation of N ⫽ 100 from biochemical markers Nei and Tajima weighted ˆ N⬘ T (17) N

ˆ NT (15) N Ft 0.058 0.067 0.077 0.086 0.095 0.100

MD corrected ˆ M⬘ D(corrML) (17) N

ˆ MD(corrML) (15) N

t

d





d⬘





d⬘





d⬘





d⬘

12 14 16 18 20 22

2 0 2 0 0 0

137 132 126 124 123 119

115 103 78 67 65 49

1 0 2 0 0 0

99 101 103 103 104 103

40 41 42 42 42 36

0 0 1 0 0 0

130 129 123 121 120 116

87 101 62 61 57 44

2 0 1 0 0 0

94 100 101 102 102 101

39 40 40 39 38 33

0 0 1 0 0 0

The estimations of effective sizes were computed with the simulated data set presented in Figure 6B but restricted to the simulated evolution of the population of 100 individuals. ␮ is the mean and ␴ is the standard error. t is the number of generations from the founder generation. If at least one method provided values ⬎50 ⫻ N ⫽ 5000 (Williamson and Slatkin 1999) as well as negative values the simulation was discarded for the corresponding t (d and d ⬘ are, respectively, the total and per method numbers of discarded runs). In addition, when the number of discarded runs overstepped 10 (1%) the entire issue for this number of generations was removed from the table. The names of the N estimates are given by the same subscripts as used for the F notations.

vant. Several levels of approximations were used, to circumvent the combinatorial problems raised by the exact coalescent approach when the number of alleles increases. Likelihoods and posterior probabilities were derived by means of a Dirichlet model and probabilities of fixation and loss of alleles made use of a simple approximation. The latter was previously checked (Chevalet 2000) and should not induce much error in the short-term evolution considered here. The Dirichlet approximation may be more sensitive. The simpler version (model MD) makes use of a single distribution that does not allow for null frequencies of alleles. Indeed, the behavior of the corresponding estimate (FˆMD(ML)) observed in Figure 2A shows the deleterious effects of increasing drift or number of alleles, with both effects increasing the probabilities that some alleles may be lost by drift.

Combining probabilities of losses and a mixture of distributions in model MDL allowed the best estimates to be obtained when using initial known frequencies (Figure 2A). The advantage over other methods seems to be uniform over all considered cases, and the most pronounced improvement concerns the variance of F estimates, although the gain seems to be less for a large number of alleles. The latter observation suggests that the combined approximation becomes less accurate for more than eight alleles per loci. It may be suggested that the choice of the Dirichlet distributions used for transient allele frequencies (appendix a) is the most sensitive step since—considering the first two moments of distributions—it can be proved that it is not possible to equate the true mixture (Equation 5) to the mixture of Dirich-

TABLE 3 Estimation of N ⫽ 100 from microsatellite markers Nei and Tajima weighted ˆ N⬘ T (17) N

ˆ NT (15) N Ft 0.040 0.049 0.058 0.067 0.077 0.086 0.095 0.100

MD corrected ˆ M⬘ D(corrML) (17) N

ˆ MD(corrML) (15) N

t

d





d⬘





d⬘





d⬘





d⬘

8 10 12 14 16 18 20 22

2 0 0 0 0 0 0 0

135 129 128 123 125 123 123 124

84 84 73 48 47 38 38 35

2 0 0 0 0 0 0 0

107 108 111 114 114 114 115 115

38 37 38 35 36 32 32 30

0 0 0 0 0 0 0 0

113 108 108 106 106 104 103 103

59 44 42 38 34 30 30 27

0 0 0 0 0 0 0 0

95 96 97 97 98 97 97 97

34 32 32 30 29 27 27 24

0 0 0 0 0 0 0 0

The estimations of effective sizes were computed with the simulated data set presented in Figure 7B but restricted to the simulated evolution of the population of 100 individuals, same legend as in Table 2.

1200

G. Laval, M. SanCristobal and C. Chevalet

Figure 10.—Histogram and kernel density estimate of MCMC drawings in the posterior distribution of the inbreeding coefficient, for a French snail population. The parameters of the prior distribution are equal to those presented in Table 1. The parameters of the candidate generating densities (bu ⫽ 0.4, ag ⫽ 8, and bg ⫽ 20) and the number of replicates (500,000 with a burn-in period of 150,000 replicates) were empirically chosen to give an optimal convergence of the chains of the A t and p 0,i ; 25% (respectively 10%) of the sampled values of the A t parameter (respectively the founder frequencies p 0,i) are accepted by the Metropolis algorithm (Robert 1996). Because of the small number of markers (four markers) the convergence of chains needs large numbers of replicates to be completed (the number of replicates used here is significantly higher than those used with the simulated data sets; Table 1). The mean and the standard error are equal to 0.007 and 0.006, respectively.

let distributions that yields Equation 8. Other parameter adjustments, such as using exact conditional expectations (appendix a) rather than the simplified expression of Equation 7, and adjusting the dispersion parameters (A) for each transient state S, might give a better account of drift for mid- or long-term processes. At the same time this should avoid the combinatorial problem driven by a large number of alleles in the exact coalescence. TABLE 4 Advantage of each method as a function of the polymorphism of markers used and with moderate founder sample sizes

FMD(corrML) FMD(MC) NMD(corrML) N N⬘ Ta N M⬘ D(corrML)a

Low polymorphism (SNPs or biochemical markers)

High polymorphism (microsatellites or haplotypes)

⫹ ⫹ ⫹ ⫹ ⫹

⫽ (small Ft); ⫺ (high Ft) ⫹ (small Ft); ⫽ (high Ft) ⫹ ⫹ ⫹

⫺, less accurate than the best F-statistic; ⫽, as accurate as the best F-statistic; ⫹, more accurate than the best F-statistic. a The N ⬘ estimator can be used when the coefficient of variation of the estimation of Ft is ⬍1.

Initial sampling: Precision in the estimation of founder allele frequencies is a key to accurately estimate the amount of drift. For example, the best estimate obtained with exact biochemical founder frequencies [Figure 6A, FˆMDL(ML) estimate] shows the same mean square error as the best estimate possible with microsatellite markers sampled in the founder generation (Figure 7B, FˆNT estimate). As mentioned above in experimental schemes applied to domestic breeds all founder animals are known and can be sampled. In this special case, we have shown that it is possible to derive a valuable approximation of the drift process, which allows improvements of Ft estimation to be obtained with both biochemical and microsatellites markers. The gain is significant for intermediate values of Ft (in the range of 5–10%) and if the mean number of alleles is not too large. From a practical point of view concerning natural populations the methods based on this MDL model might be used when the founder sample is large (up to 100 individuals). The bias of the maximum-likelihood estimates, which is inversely proportional to the founder sample size (data not shown), tends to be small in front of the standard error. For small founder sample sizes no method was found to be consistently better than the others over all the situations tested. Introducing the allele loss probabilities in the Dirichlet model is hardly tractable in this case and we kept this for future work. However, the methods based on the Dirichlet model without allele loss greatly improve the estimation of the amount of drift in several interesting situations. The corrected maximum-likelihood methods (FˆMD(corrML)) and the MCMC algorithm should be preferred when markers of low polymorphism are used, a situation that may be of importance with the advent of the SNP markers. Using estimations of founder frequencies can lead to biased Ft estimations with the maximum-likelihood method based on the MD model but we have shown that this problem can be solved simply with a heuristic correction. This bias correction gives more accurate estimations than the F-statistics give without changing the standard error of the maximum-likelihood estimate. In contrast, the MCMC algorithm greatly reduces the standard error of the maximum-likelihood estimate. Using this algorithm, which performs a numerical integration of the nuisance parameters (here p 0), is the most relevant when a large part of this standard error is due to the sampling in the founder generation, as was the case for small Ft. We can deduce from Equation 14 that the sampling process is largely responsible for the decrease of the accuracy of estimations when Ft is small: the relative standard error SE t/Ft becomes large when Ft tends to 0 [the part depending on 1/(m0,•) is inversely proportional to Ft]. The analysis of the French snail population shows that the MCMC algorithm can be applied to a real data set. The Markov chain well converges and the estimation of Ft is of the order of magnitude of the estimations given

Inbreeding and Effective Size Estimates

by the other approaches, suggesting that the results given by the MCMC algorithm are consistent. The MCMC algorithm presented here is based on model MD and is therefore affected by factors that enhance the probability that alleles are lost: the presence of rare alleles in the founder sample, highly polymorphic markers, and values of inbreeding ⬇0.1. For this reason the MCMC algorithm does not bring significant improvement with this range of inbreeding values. The simulations made with the known founder frequencies show that the standard error of FˆMDL(ML) is always smaller than the standard error of FˆMD(ML). This suggests that efforts should be made to implement a MCMC algorithm based on the model MDL, although analysis may be difficult and substantial computation times may be required to analyze large multiple data sets. Nevertheless, since for Ft ⫽ 0.1 the algorithm based on the model MD remains as accurate as the best F-statistics, it can be used even with polymorphic markers (haplotypes or microsatellites). Moreover, this algorithm gives the posterior distribution of all the parameters and allows us to compute statistics, such as highest posterior probability intervals, directly from the data rather than from approximated calculus in which unverified assumptions were made. Simulated tests of departure from drift: As a by-product, this work will allow the distributions of the Ft estimates to be computed for one locus, considering the pure genetic drift model as the null hypothesis. This provides a way to test for deviations from drift, using the comparison between the value of Ft estimated from allele frequency changes and the distribution of the Ft estimates obtained from computer simulations (assuming panmixia or using pedigree information when it is available). Such simulation-based tests may be helpful to detect parts of the genome that are under the influence of selection or of any departure from drift. As discussed in Williamson and Slatkin (1999) and Anderson et al. (2000), methods based on the complete enumeration of all possible genetic states of a population (hidden Markov chain as well as coalescent approaches; Berthier et al. 2002) require intensive computation times. Although these approaches are valuable, they are difficult to apply with multiple simulated data sets: ⵑ10,000– 100,000 simulations are needed to get the null distribution used to detect a marker under selection. Using analytical approximations allows substantial cuts in computation times. The corrected-likelihood method might easily be used in such simulation-based tests with markers expected to be largely disseminated on the whole genome like the SNP markers. Moreover, with a single marker exhibiting two founder alleles (a common SNP) and Ft values ⬎0.06 the standard error of FˆMD(corrML) is almost one-half that of FˆR (data not shown), which is the most accurate F-statistic in this situation (Figure 2B). This work also showed that the MCMC algorithm can

1201

be used to analyze multiple data sets obtained by way of simulation. The consensus (identical for all the simulations performed) values of the parameters of the candidate generating densities can be empirically determined to obtain optimal convergences of the MCMC algorithm for most simulations. Although the computation time required for each simulation is reasonable (e.g., 5 min with 20 microsatellites on a Unix Sun operating system with a processor of 480 MHz), this is still a limitation with ⵑ10,000–100,000 simulations. Nevertheless, computation times might be improved with multiprocessing computations using more powerful processors. Estimation of effective sizes: To compare the performances of our estimates to the maximum-likelihood method developed by Williamson and Slatkin (1999) we performed 1000 simulations, keeping the same parameters as they used in the third line of their Table 1, i.e., a population of 25 individuals evolving during four generations in which samples of 50 individuals were drawn in the founder and the fourth generation, respectively. Estimations were computed with 15 diallelic markers with founder frequencies uniformly distributed. ˆ NT (which The mean and the standard error of 2 ⫻ N is the haploid size of the population as defined in their article) between the founder and the fourth generation are equal to 64 and 44, respectively, while they are equal to 68 and 43, respectively, in Table 1 of Williamson and Slatkin (1999). Since our programs return similar results for Nei and Tajima’s (1981) statistics a comparison can be made. The gain in accuracy obtained with the maximum-likelihood methods of Williamson and Slatkin is interesting: mean and the standard error are equal to 62 and 35, respectively. Our corrected maximum-likelihood estimate computed from Equation 15 ˆ MD(corrML)] seems to be as upward biased with a smaller [N standard error: 54 and 26 for mean and standard error, respectively. The estimations exhibit smaller standard errors whatever the method computed from Equation ˆ ⬘NT estimate 17. The mean and standard error of the 2 * N ˆ M⬘ D(corrML) are equal to 48 and 25, respectively. The 2 * N estimate gives a mean and a standard error equal to 52 and 25, respectively. These results suggest that it would be advantageous to combine the methods recently proposed (Williamson and Slatkin 1999; Berthier et al. ˆ ⬘ estimate. A comparative study should 2002) with the N be made to confirm this. Other related situations: It should be noted that all the likelihood and Bayesian (MCMC algorithm) methods presented in this article can be extended to include several sampling regimes. For example, a situation that occurs in practice is where animals are sampled and genotyped at several points in time as described in Anderson et al. (2000). Data can be treated time interval by time interval, or a single likelihood may be written, taking the whole process into account, pt|p t⫺1, Ft⫺1,t ⵑ Ᏸ((1/Ft⫺1,t ⫺ 1)p t⫺1), t ⫽ 1, . . . T (18)

1202

G. Laval, M. SanCristobal and C. Chevalet

or pt|p 0, F0,t ⵑ Ᏸ((1/F0,t ⫺ 1)p 0), t ⫽ 1, . . . T,

(19)

where Ft1,t2 is the variation in the inbreeding coefficient between times t1 and t2 and T ⫹ 1 is the total number of generations. The evolution of the population effective size can be drawn with such an analysis. However, if only the total variation F0,T in the inbreeding coefficient is of interest, then the adjunction of numerous nuisance parameters (the allele frequencies p t and the Ft,t⫹1 for t ⫽ 1, . . . T ⫺ 1) will probably add unnecessary noise in the statistical analysis. The cases when animals are sampled at some single point in time do not fit in the framework of the present longitudinal study, but the methods presented could be used if some knowledge of an ancestral population is available. In the case of a single population that is meant to originate from a large founder wild population, an estimate of its mean (harmonic) effective size may be proposed under some hypotheses about the number of generations and about the initial frequency distribution of alleles. If several populations are sampled, as in genetic biodiversity studies, the computation of an appropriate genetic distance between pairs of populations (Reynolds et al. 1983, say) gives an estimate of the average variation in the inbreeding coefficient F ⫽ (F1 ⫹ F2)/2, where Fi ⫽ t/(2Ni) is the variation of the inbreeding coefficient in population i since divergence is due to genetic drift only, and Ni is its effective size (i ⫽ 1, 2). Then a MCMC algorithm based on the model MD can be implemented, allowing us to estimate every Fi . Effective population sizes can be estimated if divergence time is known. If t is unknown the multilocus estimation of every Fi gives a starting point to simulate the drift distribution of estimates for every locus since these distributions depend on t/2Ni . These simulated distributions will allow us to test the deviation from the null assumption of drift and thus localize every part of the genome under selection. We thank Jean-Francois Arnaud for the data set on a snail population; Jean-Marie Cornuet, John William James, and Miguel PerezEnciso for motivating remarks; Grant Hamilton for the English revision; and two anonymous referees for improvements of the manuscript.

LITERATURE CITED Anderson, E. C., E. G. Williamson and E. A. Thompson, 2000 Monte Carlo evaluation of the likelihood for Ne from temporally spaced samples. Genetics 156: 2109–2118. Balding, D. J., and R. A. Nichols, 1995 A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12. Barker, J. S. F., W. G. Hill, D. Bradley, M. Nei, R. Fries et al., 1998 Measurement of Domestic Animal Diversity (MoDAD): Original Working Group Report. FAO, Rome. Beaumont, M. A., 1999 Detecting population expansion and decline using microsatellites. Genetics 153: 2013–2029. Berthier, P., M. A. Beaumont, J.-M. Cornuet and G. Luikart, 2002

Likelihood-based estimation of the effective population size using temporal changes in allele frequencies: a genealogical approach. Genetics 160: 741–751. Chevalet, C., 2000 The number of lines of descent and fixation probabilities of alleles in the pure genetic drift process: analytical approximations. Theor. Popul. Biol. 57: 167–175. Excoffier, L., and M. Slatkin, 1995 Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921–927. Ewens, W. J., 1972 The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112. Foulley, J. L., and W. G. Hill, 1999 On the precision of estimation of genetic distance. Genet. Sel. Evol. 31: 457–464. Hasting, W. K., 1970 Monte Carlo sampling methods using Markov chains and their application. Biometrika 57: 97–109. Holsinger, K. E., 1999 Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas 130: 245–255. Kantanen, J., I. Olsaker, S. Adalsteinsson, K. Sandberg, E. Eythorsdottir et al., 1999 Temporal changes in genetic variation of north European cattle breeds. Anim. Genet. 30: 16–27. Kimura, M., 1955 Process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA 41: 144–150. Kitada, S., T. Hayashi and H. Kishino, 2000 Empirical Bayes procedure for estimating genetic distance between populations and effective population size. Genetics 156: 2063–2079. Krimbas, C. B., and S. Tsakas, 1971 The genetics of Dacus oleae. V. Changes of esterase polymorphism in a natural population following insecticide control: selection or drift? Evolution 25: 454–460. Laval, G., N. Iannuccelli, C. Legault, D. Milan, M. A. M. Groenen et al., 2000 Genetic diversity of eleven European pig breeds. Genet. Sel. Evol. 32: 187–203. Laval, G., M. SanCristobal and C. Chevalet, 2002 Measuring genetic distances between breeds: use of some distances in various short term evolution models. Genet. Sel. Evol. 34: 481–507. MacHugh, D. E., M. D. Shriver, R. T. Loftus, P. Cunningham and D. G. Bradley, 1997 Microsatellite DNA variation and the evolution, domestication and phylogeography of taurine and zebu cattle (Bos taurus and Bos indicus). Genetics. 146: 1071–1086. Madec, L., C. Desbuquois and M. A. Coutellec-Vreto, 2000 Phenotypic plasticity in reproductive traits: importance in the life history of Helix aspersa (Mollusca: Helicidae) in a recently colonized habitat. Biol. J. Linn. Soc. 69: 25–39. Male´cot, G., 1948 Les Mathematiques de l’Heredite. Masson, Paris. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller, 1953 Equation of state calculations by fast computing machines. J. Chem. Phys. 21: 1087–1092. Moazami-Goudarzi, K., D. Laloe, J. P. Furet and F. Grosclaude, 1997 Analysis of genetic relationships between 10 cattle breeds with 17 microsatellites. Anim. Genet. 28: 338–345. Nei, M., 1978 Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics 89: 583–590. Nei, M., and F. Tajima, 1981 Genetic drift and estimation of effective population size. Genetics 98: 625–640. Pollak, E., 1983 A new method for estimating the effective population size from allele frequency changes. Genetics 104: 531–548. Reynolds, J., B. S. Weir and C. C. Cockerham, 1983 Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105: 767–779. Robert, C., 1996 Methodes de Monte Carlo par Chaines de Markov. Economica, Paris. Schneider, S., D. Roessli and L. Excoffier, 2000 A software for population genetics data analysis. http://anthro.unige.ch/arlequin. Tavare, S., 1984 Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119–164. Vignal, A., D. Milan, M. SanCristobal and A. Eggen, 2002 A review on SNP and other types of molecular markers and their use in animal genetics. Genet. Sel. Evol. 34: 275–305. Waples, R. S, 1990 Temporal changes of allele frequency in Pacific salmon: implications for mixed-stock fishery analysis. Can. J. Fish. Aquat. Sci. 47: 968–976. Williamson, E. G., and M. Slatkin, 1999 Using maximum likelihood to estimate population size from temporal changes in allele frequencies. Genetics 152: 755–761.

Inbreeding and Effective Size Estimates Wilson, I. J., and D. J. Balding, 1998 Genealogical inference from microsatellite data. Genetics 150: 499–510. Wright, S., 1931 Evolution in Mendeleian populations. Genetics 16: 97–159. Wright, S., 1951 The genetical structure of populations. Ann. Eugen. 15: 323–354. Communicating editor: S. W. Schaeffer

APPENDIX A: APPROXIMATE DISTRIBUTION OF ALLELE FREQUENCIES WITH FIXATION OR LOSS OF ALLELES

1203



Pr(p i ⫽ 1|p0, F) ⯝ p 0,i exp ⫺2(1 ⫺ p0,i)

冢1F ⫺ 1冣冣

⫽ FIX(p 0,i ; p0, F), we may write Pr(100|p0, F ) ⫽ FIX(p0,1 ; p0, F ) Pr(010|p0, F ) ⫽ FIX(p0,2 ; p0, F ) Pr(110|p0, F ) ⫽ FIX(p0,1 ⫹ p0,2 ; p0, F ) ⫺ FIX(p0,1 ; p0, F ) ⫺ FIX(p0,2 ; p0, F )

110, the third allele is lost, the other two are present;

and analogous expressions for Pr(001|p0, F), Pr(101|p0, F), and Pr(011|p0, F), from which Pr(111|p0, F) is derived as the complement to 1 of the sum of the six previous probabilities. Although this rationale can be extended to any number of alleles, its practical use is limited to rather small values (up to ⵑ10), due to the exponential number of situations (2n ⫺1 for n0 alleles). For numerical applications, it may be necessary to group several alleles, avoiding merging rare alleles into another allelic class since rare alleles are expected to provide great information on the drift process. Transient distributions of allele frequencies: There is no known simple analytical expression for the transient distributions

101, the second allele is lost, the other two are present;

Tr(p|S, p0, F).

Notations used are the same as in the text, discarding index t. When fixation or loss of alleles does occur, the continuous Dirichlet distribution is not appropriate, since the distribution of allele frequencies is a mixture of continuous distributions among partial sets of nonzero frequencies. We explain the general rationale in the case with three alleles. Let S be a state in which some of the alleles are lost, and denote it by a string of zeroes and ones indicating that the corresponding alleles are lost or present. For three alleles, there are seven possible S states: 111, all three alleles are still present;

011, the first allele is lost, the other two are present; 100, the first allele is fixed in the population; 010, the second allele is fixed in the population; 001, the third allele is fixed in the population. We write the distribution f (p|p0, F) as a sum over S states, f (p|p0, F) ⫽

兺S Pr(S|p0, F) ∗ Tr(p|S, p0, F),

where Pr(S|p0, F) is the probability to get state S at “time” F from the initial conditions p0, and Tr(p|S, p0, F) stands for the distribution of transient frequencies p (excluding null frequencies), conditional on state S, on time F, and on initial conditions p0 . Probabilities of S states: Probabilities Pr(S|p0, F) are derived from the probabilities that alleles are fixed. For example, the probability that allele 3 has been lost can be written as Pr(110|p0, F) ⫹ Pr(100|p0, F) ⫹ Pr(010|p0, F), while Pr(100|p0, F) is exactly the probability that allele 1 was fixed, and Pr(010|p0, F) is the probability that allele 2 was fixed. Hence these probabilities Pr(S|p0, F) can be derived from the probabilities that any subset of alleles has been lost, solving a triangular linear system of equations. Following a compact approximation (Chevalet 2000),

0

Solutions derived from the diffusion approximation are given as infinite series involving the latent roots of the process and the corresponding eigenfunctions. However, the convergence is slow so that the solution is not of practical use, except near fixation, i.e., at the end of the process for large values of the fixation index F. Here, we approximate these transient conditional distributions by Dirichlet distributions, although there is no way to adjust parameters so that moments coincide with those of the true distribution (it can be easily checked, in the case with three alleles, that it is not possible even for the first two moments). Parameters ␣S of the Dirichlet distribution used in place of Tr(p|S, p0, F) are set according to the following two rules: (i) adjust parameters so that expectations are equal to the true conditional expectations qS,i of transient unfixed frequencies (pi) in state S; (ii) restrict the scope of the approximation to intermediate time periods (F values should not be too large) for which correlations between frequencies are characterized by the single, unconditioned, A ⫽ 1/F ⫺ 1 value. To derive the conditional expectations of unfixed frequencies, we make use of the constancy of expected allelic frequencies: at any time and under the condition that some alleles have been lost, the expected frequency of another allele is proportional to its initial frequency. For example, in the three-alleles case, consider the situations in which the third allele has been lost. This statement tells that, subject to the condition that allele 3 has been lost, the

1204

G. Laval, M. SanCristobal and C. Chevalet

expected frequencies of alleles 1 and 2 are equal to p 0,1/ (p 0,1 ⫹ p 0,2) and p 0,2/(p 0,1 ⫹ p 0,2). Denoting as C this condition (“allele 3 has been lost”), we have p 0,1 , E(p 1/C) ⫽ p 0,1 ⫹ p 0,2 and, recalling that these probabilities depend on p0 and on F, Pr(C) ⫽ Pr(110) ⫹ Pr(100) ⫹ Pr(010), since the condition C is made up of states (110), (100), and (010). Then, to derive the conditional mean q110,1 ⫽ E(p 1/110), we write E(p 1/C ) ⫽ E(p 1/110)Pr(110/C ) ⫹ E(p 1/100)Pr(100/C ) ⫹ E(p 1/010)Pr(010/C ) ⫽ q 110,1Pr(110/C ) ⫹ Pr(100/C ) ⫹ 0, from which p 0,1 q 110,1Pr(110) ⫽ Pr(C ) ⫺ Pr(100). p 0,1 ⫹ p 0,2

Numerically, these expectations can be calculated at the same time as probabilities of states S are derived. Then the Dirichlet distribution used to approximate the transient distribution conditional on S is defined by the ␣S parameters ␣S,i ⫽ AqS,i , where qS,i are the previous conditional expectations of allele frequencies, and a single unconditional A parameter is used.

APPENDIX B

The joint posterior distribution of parameters introduced in the model is equal to f (p0 , p t , A t |m0 , mt) ⬀

L

kᐉ

m ⫹A p 兿 兿 pᐉ,i,t

ᐉ⫽1i⫽1

ᐉ,i,t

t 0,i⫺1



⌫(A t) ⌫(A t p 0,i)

kᐉ i⫽1

kl

·

兿 p 0,im ⫹␣ ⫺1 · A ta⫺1exp(⫺bA t), 0,i

0

i⫽1

where kᐉ is the number of alleles at locus ᐉ (and A t ⫽ 1/Ft ⫺ 1). Then the conditional posterior distributions are pᐉ,t| . . . ⵑ Ᏸkᐉ (mᐉ,t ⫹ A tp0), ᐉ ⫽ 1, . . . L (B1) L kl (p At )p0,i p m0,i⫹␣0⫺1 ᐉ,i,t 0,1 f (p 0| . . .) ⬀ 兿 兿 (B2) ⌫(Atp0,i) l⫽1 i⫽1 L kᐉ ⌫(A t) p0,i A f (A t | . . .) ⬀ 兿 兿 (p ᐉ,i,t )t k · A at ⫺1 exp(⫺bA t) (B3) ᐉ ᐉ⫽1 i⫽1 兿i⫽1 ⌫(A tp 0,i)

(the notation | . . . means “given the rest,” i.e., given the other parameters and the data). We have chosen a random-walk step for p 0 and an independent step for the At in the Metropolis-Hastings algorithm, with a Gibbs step for the p’s. More precisely, the following algorithm was implemented; the superscript (r) denotes the current iteration: 1. Choose initial values for the parameters (r ⫽ 0). In practice, sample estimates of allele frequencies and Reynolds’s estimate of the inbreeding coefficient. 2. Draw p (rᐉ,t⫹1) from a Dirichlet distribution Ᏸkᐉ (mᐉ,t ⫹ A (rt )p (r0 )) for ᐉ ⫽ 1, . . . L. 3. Draw y ⫽ (y1, . . . , ykᐉ) from a Uniform distribution in [0, bu] and compute ␣0 ⫽ min(1, f (y)/f (p (r0 ))), where f is the function of p0 given in the right-hand side of (B2); finally, accept p(r0 ⫹1) ⫽ y with a probability equal to ␣0 or p0(r⫹1) ⫽ p(r0 ) with a probability 1 ⫺ ␣0. 4. Draw y from a gamma distribution Ᏻ(ag, bg) and compute ␣ ⫽ min(1, f (y)g(A (rt ))/f (A (rt ))g(y)), where f is the function of A t given in the right-hand side of (B3) and g is the gamma distribution; finally, accept A (rt ⫹1) ⫽ y with a probability equal to ␣t or A (rt ⫹1) ⫽ A (rt ) with a probability 1 ⫺ ␣. 5. And go to step 2. In practice, the parameters of candidate generating densities bu, ag, and bg were empirically determined to provide samples (or chains) that converge to the distributions of interest. Convergence of chain A (rt ) is optimal when the percentage of values accepted by the Metropolis-Hasting algorithm is close to 25% (Robert 1996).

Suggest Documents