(e.g., assortative mating, inbreeding, and population mixing) or selection. Extensive testing for HDW con- formity of observed genotype frequencies for many.
47
Genetica 99: 47–58, 1997. c 1997 Kluwer Academic Publishers. Printed in the Netherlands.
Bootstrap tests for specific hypotheses at single locus inbreeding coefficients S. Van Dongen1 & T. Backeljau2 1
Department of Biology, University of Antwerp, Universiteitsplein 1, B-2610 Wilrijk; 2 Royal Belgian Institute of Natural Sciences, Vautierstraat 29, B-1040 Brussels, Belgium
Received 8 September 1996 Accepted 7 January 1997
Key words: bootstrap, F-statistics, Hardy-Weinberg law, power, resampling
Abstract Deviations of genotype distribution from Hardy-Weinberg expectations within a (sub)population can give valuable insight into the population structure, and can be quantified by means of Fis values. Specific biological and/or genetical hypotheses regarding Fis require particular statistical procedures to be able to perform the test with high power. The bootstrap offers a convenient way to test against a broad range of alternative hypotheses. It enables: a) comparison of an observed Fis with any expected value between 1 and 1, and b) comparison of two or more observed Fis values. However, it fails under numerous situations, and great caution should be taken before applying the bootstrap to estimate confidence intervals of Fis . We discuss under which conditions the bootstrap gives reliable results. Introduction Testing for Hardy-Weinberg (HDW) equilibrium is one of the first steps in a population genetic analysis. Deviations of the observed genotypic frequencies from HDW expectations indicate either non-random mating (e.g., assortative mating, inbreeding, and population mixing) or selection. Extensive testing for HDW conformity of observed genotype frequencies for many loci in several subpopulations may be of limited value because of the high frequency of type I errors, i.e., erroneously rejecting the null hypothesis. Testing HDW equilibrium should, therefore, focus on specific biologic and/or genetic hypotheses (Lessios, 1992). Yet ‘classic’ statistical tests like chi-square or Fisher exact tests are inappropriate to test against various alternative hypotheses (Rousset & Raymond, 1995 and references therein). Therefore, more specific tests need to be developed to be able to test specific hypotheses with increased statistical power. Recently Rousset and Raymond (1995) examined exact procedures that take the nature of the deviation (heterozygote excess or deficiency) from HDW equilibrium into account, such that one-tailed tests with consequently higher power can be performed. Van Dongen and Backeljau (1995)
investigated how the bootstrap can be applied to estimate the distribution of the F-statistic Fis (Nei, 1978), a measure for the deviation from HDW equilibrium. Van Dongen and Backeljau (1995) showed by means of Monte Carlo simulations that by resampling individual genotypes, the distribution of Fis values can be estimated such that: a) an observed Fis , both oneand two-tailed, can be statistically compared with any expected value (one-sample test), and b) two observed Fis ’s can be tested for equality (two-sample test). Bootstrapping thus allows tests against a much broader range of alternative hypotheses than the exact procedures proposed by Rousset and Raymond (1995), although it may not be reliable in all situations. With increasing Fis , the one-sample test becomes less appropriate for small sample sizes (e.g., 10 and 20 individuals) as the type I error rate is larger than expected (Van Dongen & Backeljau, 1995). The two-sample test was not tested in this respect. A potential cause for this failure of the bootstrap test is non-pivotalness of the used statistic Fis Fis obs: . To avoid high computation times, Van Dongen and Backeljau (1995) discarded the second guideline of Hall and Wilson (1991) for bootstrap testing, which is dividing the test statistic by the standard error of Fis . This implicitly assumes that
48 Fis Fis obs: is a pivotal quantity, which means that it has approximately the same distribution for all values of Fis obs: (e.g., Efron & Tibshirani, 1993). Furthermore, Fis is bounded between 1 and 1, like a correlation coefficient, which may skew the distribution (Sokal & Rohlf, 1981). The use of a z-transformation may be more appropriate in such situations (Hinkley, 1988). In this paper we further investigate the behaviour and performance of the one-sample bootstrap test for single locus Fis values. We examine a more advanced method for confidence interval estimation based on percentiles, the so-called bias accelerated percentile (bca) method (Efron, 1987). This method takes both non-pivotalness and bias of the distribution into account (Efron & Tibshirani, 1993). We also extend the study to more realistic situations. Van Dongen and Backeljau (1995) restricted their simulations to a twoallele model with allele frequencies of 0.5, while here we examine a range of allele frequencies and both twoand three-allele situations. We explore specific bootstrap distributions more closely to determine possible causes of the failure of the bootstrap, and compare the power of the bca one-sample with upper bounds. The two-sample bootstrap test is extended to an n-sample test or bootstrap ANOVA. We investigate the behaviour of the tests by Monte Carlo simulations.
Materials and methods Estimation of the inbreeding coefficient The F-statistics Fst , Fit , and Fis introduced by Wright (1951) offer a convenient way to summarize the population structure and measure deviations from HDWproportions at different levels (Weir & Cockerham, 1984). Nei (1977) showed that the gene diversity of the total population can be partitioned into its intra- and intersubpopulational components when gene diversity is defined as the frequency of heterozygotes expected under HDW-equlibrium. Nei therefore reformulated the three F-statistics and obtained for Fis , which is the statistic of interest here, the following expression:
F
is
=
H
H
s
i
H
s
Where Hi = observed heterozygosity in subpopulation S Hs = expected heterozygosity under HDW-conditions
Hi can be determined directly from the genotype frequencies. For Hs , Nei (1978) proposed an unbiased estimator given by: Hs
=
2n(1 2n
Pf
2 i
)
1
Where n = sample size fi
=
frequency of the i th allele
Different notations for Fis will be used throughout the paper referring to different quantities. Fis indicates the population parameter, while Fis obs: represents the sample estimate and Fis the bootstrap estimate (see below). The bootstrap One-sample procedure. Efron (1979) introduced a very general resampling procedure, the “bootstrap”, for estimating the distributions of statistics based on independent observations. For the one-sample situation, let X = X1 , X2 ,: : : , Xn be a random sample of size n of independent, identically distributed (i.i.d.) random variables with common but unknown distribution function F, and x = x1 , x2 ,: : : , xn its observed realization, and let R(X,F) be a statistic of interest depending on both X and F, for which one wants to estimate the distribution. The bootstrap method proceeds as follows: 1. Construct the empirical distribution function Fn by assigning mass 1/n to each observation xi . 2. Draw a random sample of size n, with replacement from Fn , the so-called bootstrap sample. 3. Approximate the sampling distribution of R(X,F) by the bootstrap distribution of R = R(X ,Fn ). Efron (1979) suggests a Monte Carlo approximation to estimate the bootstrap distribution of R . By repeating steps 2 and 3 several times (B times), the bootstrap distribution of R(X ,Fn ) is approximated by the distribution of R(X(1) ,Fn ), R(X(2) ,Fn ): : : R(X(B ) ,Fn ). By making B sufficiently large, this approximation can be made arbitrarily accurate. Based on this bootstrap distribution, several methods have been proposed to estimate confidence intervals (C.I.) and hypothesis testing (Efron & Tibshirani, 1986, 1993; Efron, 1987; Hall, 1988; Hall & Wilson, 1991). The bootstrap distribution and C.I. of Fis obs: will be estimated by resampling individual genotypes, the independent units of observation. Efron (1979, 1987) introduced and refined the percentile methods for C.I. construction. The basic idea of these methods is that
49 there exists some, often unknown, transformation that perfectly normalizes the frequency distribution of the bootstrap estimates, which is automatically incorporated in the C.I. estimation. Two refinements, bias and acceleration correction, have been devised to take possible shifts and non-constancy of the variance (non-pivotalness) into account. In the absence of any bias and variance nonconstancy, this so-called bias accelerated (bca) percentile method reduces to the uncorrected percentile procedure. The bca has, besides lower computation times, several other advantages over the previously used method by Van Dongen and Backeljau (1995): a) It does not require the use of z-transformations to obtain approximate normality of correlation coefficients as the bca automatically ‘chooses’ its own best scale, b) it is transformation respective, c) it is range respecting, and d) it may perform better for small sample sizes. Therefore, the bca method is recommended for general use, especially for non-parametric problems (Efron & Tibshirani, 1993). Computational formulas and more details can be found in Efron (1987) and Efron and Tibshirani (1993, pp. 178–188). Bootstrap ANOVA. A parametrical ANOVA is based on the comparison of the variance of a measure between and within groups or classes. This comparison is expressed by the so-called F-value. Larger variances among groups relative to the within group variance will increase the F-value. If the variance among groups is significantly large enough relative to the within group variance, that is if F is significantly large enough, H0 of equality of means will be rejected. For normally distributed data, the distribution of F under H0 follows an F distribution such that for an observed F a significance level can be estimated. This approach cannot, however, be used directly for comparing Fis obs: values because the estimation of the F-value would require that an Fis obs: value be assigned to each individual. Because Fis is a feature of the population rather than of individuals, this is impossible. Alternatively, just as the bootstrap offers a convenient tool for variance estimation (Efron & Tibshirani, 1993), the ANOVA F-value can be approximated by a bootstrap estimate. The variance among groups (subpopulations or loci) (VARbetween ) can be estimated directly from the observed Fis obs: values, whereas the withinsubpopulation variance (VARwithin ) can be approximated by the average of the bootstrap variance estimates of the individual Fis obs: ’s. The ANOVA F-value equals VARbetween /VARwithin . Testing the significance of this
F-value requires the estimation of its distribution under the null hypotheses Fis1 obs: = Fis2 obs: = Fisn obs: (Fisher & Hall, 1990). Fisher and Hall (1990) suggest transforming the original dataset by subtracting the respective population estimates of the statistic of interest from the observations and then resampling this new dataset. Again, this requires that an Fis obs: can be assigned to each individual. Alternatively, we suggest shifting the bootstrap distribution of the Fis obs: to one common value (e.g., 0) and estimating the distribution of the observed ANOVA F-value under H0 (see Manly, 1991, p. 28 for a similar approach in the 2-sample case). In this way, the bootstrap estimate of the variance in Fis among subpopulations represents its variance under the H0 conditions. This results in the following algorithm: a) Resample the original dataset separately per subpopulation with replacement to obtain n (# subpopulations) bootstrap samples, b) estimate Fis1 ,Fis2 ,: : : , Fisn from the bootstrap samples, c) let Fisi = Fisi Fisi obs: , d) calculate VARbetween from the Fisi ’s, e) resample the bootstrap sample B2 times and estimate the variance (VAR ) of the Fisi ’s from the i secondary bootstrap estimates (i.e. Fisi ’s), f) estimate VAR within as the average of the VARi ’s, g) let Fi = VARbetween /VARwithin , h) if F > Fi then counter = counter + 1, i) repeat a–h B1 times, j) p = counter/B1. Note that this algorithm involves two nested bootstrap procedures because for each bootstrap resample, the variance within Fis ’s is estimated by a secondary bootstrap procedure. This leads to high computation times. Therefore, we first examine the coverage behaviour of the bootstrap ANOVA when steps e and f are discarded and the distribution of VARbetween is estimated under H0 with the above algorithm, and setting VARwithin = 1. We then examine the possible improvement of the test by estimating VARwithin by a secondary bootstrap resampling (cfr. step e) in a few simulations. As this algorithm is not based on percentiles, we apply a z-transformation to obtain approximate normality and a better performance of the bootstrap (Hinkley, 1988). Monte Carlo simulations Two errors can be made while performing a statistical test: a) falsely rejecting the null hypothesis (type I
50 error), or b) falsely accepting the null hypothesis (type II error), with the respective error probabilities and . The power of a test, given by 1 , is the probability of rejecting the null hypothesis when the alternative is true, i.e., of making a correct conclusion. Ideally, one would like to keep both and close to zero. In practice this is in most cases not possible, and usually one keeps < 0:05, while , and consequently the power of the test, depends on the sample size, the minimal difference one wants to detect, and the nature of the test procedure (Siegel & Castellan, 1988). The error probabilities of a statistical test can be investigated by Monte Carlo simulations (Bickel & Krieger, 1989). In each simulation step, a dataset is generated from an underlying distribution, for example the null distribution, and the statistical test is performed. This is repeated many (M) times and one counts the number (N) of tests where P < 0.05. If the data were generated from a distribution that reflects the null hypothesis, one would ideally expect that 1 N/M (i.e., the coverage probability [CP]) is close to 0.95 (the nominal level = 1- [NL]). Analogously to the investigation of the CP, the power of a test can be investigated. Datasets are generated M times from an underlying distribution different from H0 and the tests are performed. The proportion of rejections of the null hypothesis (N/M) gives an estimate of the power of the test to detect the given difference at a prespecified sample size and NL. In this simulation, the collection of data from an underlying population is mimicked. By varying the population distribution parameters the performance of the bootstrap can be evaluated under various conditions. The HDW law states that within one generation genotype frequencies will follow a multinomial distribution with the product of the respective allele frequencies as distribution parameters. Deviations from HDW equilibrium can be quantified by means of an F-statistic Fis (Nei, 1978). To test under which sample size, allele frequencies, and Fis the bca percentile method and the bootstrap ANOVA give reliable distribution estimates, we generated genotype frequencies from multinomial distributions with the distribution parameters (sampling probabilities) set at 2fi fj (1 Fis ) for the heterozygotes and fi2 + Fis fi (1 fi ) for the homozygotes (Wright, 1969). Allele frequencies and Fis values of underlying distributions, sample sizes in the different simulations, and further specific details will be given in the Results section.
Results Coverage probabilities of the one-sample test We examined the CP of the bca method to estimate C.I. for Fis obs: with Monte Carlo simulations. For each simulation, we repeatedly (500 times) generated a sample under the conditions of the null hypothesis and rejected H0 if the Fis of the underlying distribution was not contained in the 95% bca C.I. The parameters of the underlying distributions were as follows: Fis : 0, 0.5, and 0.8; and allele frequencies q = 0.5, 0.7, 0.9, and 0.95 for the 2-allelic and q1 = 0.5, q2 = 0.45, 0.4, and 0.3; q1 = 0.7, q2 = 0.25 and 0.2; q1 = 0.9, q2 = 0.05; and q1 = 0.36, q2 = 0.33 for the 3-allelic case. For the triallelic situation allele frequencies were chosen in agreement with those selected by Rousset and Raymond (1995), as these authors provide power estimates under some of those conditions that will be compared with the power of the bootstrap one-sample test (see below). Samples of sizes 10, 20, 50, and 100 were obtained from the underlying distributions. Simulations were performed for 1000 and 2500 bootstrap resamples. Results for the 2-allele situation are summarized in Table 1. For sample size 10, CPs were never close to the expected 95% NL. With higher sample sizes, the CPs differed very little from the NL for q = 0.5. Deviations of CP around 0.95 were lower for simulations with 2500 resamples as compared to 1000 bootstrap resamples. However, CPs decreased with increasing allele frequency of the common allele (q). This effect was less conspicuous for larger sample sizes and increasing Fis . For the simulations with 3 alleles (Table 2), CPs generally differed less from the predicted NL compared to the 2 allele simulations, except when sample size equalled 10. CPs became lower than the NL with increasing frequency of the common allele. This effect was again less conspicuous as Fis increased. When the frequency of the common allele was relatively low and rare alleles were present, CPs were close to the NL. This result suggests that the low CPs were due to an increase of the allele frequency of the most common allele and not due to the presence of rare alleles. One-sample bootstrap distribution To investigate why type I error rates were inflated (CP < 0.95) when the allele frequency of the most common allele increased, and why this effect was more conspicuous with increasing Fis , we examined the shape
51 Table 1. Coverage probabilities of the bca method in the 2 allele case Fis size 10
20
50
100
q
0 1000
2500
0.5 1000
2500
0.8 1000
2500
0.5 0.7 0.9 0.95 0.5 0.7 0.9 0.95 0.5 0.7 0.9 0.95 0.5 0.7 0.9 0.95
0.85 0.68 0.87 0.98 0.92 0.73 0.18 0.04 0.94 0.93 0.40 0.13 0.94 0.93 0.62 0.27
0.89 0.66 0.89 0.98 0.95 0.68 0.18 0.07 0.95 0.95 0.38 0.14 0.95 0.95 0.58 0.23
0.93 0.78 0.40 0.21 0.94 0.95 0.70 0.44 0.96 0.96 0.93 0.75 0.94 0.94 0.96 0.94
0.92 0.80 0.40 0.21 0.95 0.95 0.71 0.38 0.96 0.95 0.94 0.71 0.94 0.96 0.96 0.92
0.98 0.90 0.55 0.30 0.98 0.99 0.82 0.54 0.95 0.94 0.99 0.88 0.95 0.94 0.95 0.96
0.98 0.90 0.56 0.30 0.98 0.99 0.78 0.55 0.95 0.95 0.98 0.88 0.95 0.95 0.98 0.98
Table 2. Coverage probabilities for the bca method in the 3 allele case Fis size 10
20
50
100
q1
q2
0 1000
2500
0.5 1000
2500
0.8 1000
2500
0.5 0.5 0.5 0.7 0.7 0.9 0.36 0.5 0.5 0.5 0.7 0.7 0.9 0.36 0.5 0.5 0.5 0.7 0.7 0.9 0.35 0.5 0.5 0.5 0.7 0.7 0.9 0.36
0.10 0.20 0.05 0.25 0.20 0.05 0.33 0.10 0.20 0.05 0.25 0.2 0.05 0.33 0.10 0.20 0.05 0.25 0.2 0.05 0.33 0.10 0.20 0.05 0.25 0.2 0.05 0.33
0.90 0.89 0.89 0.71 0.77 0.94 0.94 0.95 0.94 0.95 0.81 0.80 0.16 0.94 0.94 0.95 0.95 0.94 0.95 0.39 0.95 0.94 0.93 0.96 0.95 0.95 0.63 0.94
0.90 0.92 0.91 0.73 0.77 0.95 0.95 0.95 0.95 0.95 0.80 0.85 0.18 0.96 0.95 0.95 0.95 0.92 0.93 0.37 0.95 0.94 0.95 0.95 0.96 0.96 0.59 0.94
0.93 0.95 0.94 0.85 0.91 0.39 0.95 0.96 0.95 0.95 0.96 0.97 0.67 0.95 0.95 0.95 0.94 0.95 0.97 0.92 0.94 0.94 0.95 0.94 0.95 0.94 0.96 0.95
0.96 0.95 0.93 0.85 0.86 0.40 0.96 0.97 0.97 0.94 0.96 0.95 0.68 0.95 0.95 0.95 0.95 0.95 0.96 0.93 0.94 0.96 0.94 0.94 0.96 0.96 0.96 0.95
0.99 0.99 0.98 0.92 0.93 0.56 0.99 0.98 0.99 0.98 0.99 0.99 0.79 0.97 0.97 0.94 0.96 0.96 0.96 0.98 0.94 0.94 0.95 0.95 0.95 0.94 0.98 0.95
0.99 0.99 0.99 0.95 0.95 0.52 0.99 0.98 0.99 0.99 0.99 0.99 0.78 0.98 0.95 0.94 0.95 0.96 0.97 0.99 0.94 0.95 0.95 0.95 0.95 0.95 0.97 0.94
52
Figure 1. Bootstrap distributions estimated with 10,000 resamples from 8 different datasets generated under the following conditions: sample size 100, Fis 0 and 8, and q 0.5, 0.7, 0.9, and 0.95 for a 2 allele model. Coverage probabilities of the bca percentile method as estimated by Monte Carlo simulations (Table 1) are also given.
=
=
=
53 of the bootstrap distributions of single datasets generated under different underlying allele frequencies and fixation indices. Figure 1 shows the bootstrap distributions estimated from 10,000 bootstrap resamples for 8 such datasets with sample size 100. The datasets were generated from underlying distributions with Fis = 0 and 0.8, and allele frequencies q = 0.5, 0.7, 0.9, and 0.95 for the 2 allele situation. For Fis = 0 and increasing q, the bootstrap distribution changed from a nearly normal distribution for q = 0.5 and 0.7 to a distribution with multiple peaks for q = 0.9 and 0.95. For the latter 2 allele frequencies, the CPs were lower than the NL while for the first two distributions CP equalled 0.95. For Fis = 0.8 and increasing q, the bootstrap distribution changed from a nearly normal to a skewed distribution, without, however, a reduction of the CP. Figure 2 represents the detailed bootstrap distribution of Figure 1 for Fis = 0 and q = 0.95. Bars with different fill patterns represent resamples with different numbers of homozygotes for the rare allele (q2 = 0.05). The presence or absence of one extra homozygote for the rare allele caused large shifts in the distribution that resulted in different peaks. These individuals break up the otherwise continuous distribution into more discrete parts and seem to act as outliers. The same analysis for sample size 20 gave comparable results (Figure 3). Bootstrap distributions exhibited multiple peaks with increasing q resulting in erroneous tests at q = 0.7, as judged from the CPs. Again the different peaks could be attributed to the presence of different numbers of homozygotes for the rare allele (data not shown). For q = 0.95, a second problem became obvious. The distribution showed only one peak (Figure 3), which was due to a lack of homozygotes for the rare allele. This resulted in Fis < 0 in most resamples such that H0 of Fis = 0 is rejected, while the presence of one homozygote for the rare allele in the dataset would have provoked much more variation in the bootstrap resamples and thus contains very important information. Furthermore, for large Fis, the probability of sampling heterozygotes is low. The absence of heterozygotes in a sample results in absence of variability in Fis . The increased probability of lack of heterozygotes in the resamples may explain the high incidence of resamples with Fis = 1 for Fis = 0.8 (Figure 3). Power estimation of the bca percentile method Rousset and Raymond (1995) recently reported the power of several exact tests to test either heterozygote
deficiency or excess, for various allele frequencies, sample sizes, and Fis values. They also report the upper bound of the power under these conditions, applying the Neyman-Pearson lemma. We estimated the power of the bootstrap to test the one-tailed hypotheses in Rousset and Raymond’s (1995) table 1 by means of Monte Carlo simulations of size 1000 each with 2500 bootstrap resamples. Table 3 summarizes these results and shows that the power of the bca bootstrap test was very close to the upper bound, which means that it has maximal power. However, the conditions for which Rousset and Raymond (1995) give power estimates are all situations in which the bootstrap works quite well, because the CP is close to 0.95 (Table 1, 2). Otherwise, one expects a power higher than the upper bound and an inflated type I error rate. This seemed to be the case for some 2-allele situations. Bootstrap ANOVA We performed Monte Carlo simulations of size 500 to estimate the CP of the bootstrap ANOVA. Simulations were performed for sample sizes 10, 20, 50, and 100; allele frequencies q1 = 0.5, 0.7, and 0.9; Fis = 0, 0.5 and 0.8; and the comparison of 2, 5, and 10 loci for equality with B1 = 2500. Table 4 summarizes the results. CPs were generally close to the expected 0.95 NL for large sample sizes, Fis = 0 and q = 0.5. For increased Fis and q, the bootstrap ANOVA became conservative. This effect was more conspicuous as the number of loci that have to be compared increased. We did not repeat all the above simulations including the estimation of VARwithin by the bootstrap because of the high computation times. However, those simulations that were repeated did not improve the CP; on the contrary, the tests were even more conservative (data not shown).
Discussion The bootstrap and its limitations The bootstrap is a general resampling method to estimate the distribution of statistics based on independent, identically distributed observations. As the method is applicable to almost any situation, it is a very appealing technique to estimate the distribution of statistics. The bootstrap may fail, although there has been insufficient basic research to be able to predict when this may occur (Noreen, 1989; Manly, 1991). Van Dongen
54
Figure 2. Detailed representation of the bootstrap distribution in Figure 1 with sample size fill pattern represent resamples with different numbers of homozygotes for the rare allele.
and Backeljau (1995) showed that bootstrapping individual genotypes is a useful technique to estimate the distribution of the estimators of the F-statistic Fis . This application opens a wide range of possible hypotheses that can be tested, such that more specific biological and/or genetical problems can be treated with higher accuracy and power. The application of the bootstrap
= 100, F = 0, and q = 0.95. Bars with different is
to estimate the distribution of Fis is, however, not reliable in all situations. The reliability of the one-sample bca appeared to depend critically on the expected frequency of homozygotes of the rare allele(s): a) The absence of these homozygotes in the sample results in virtually no variation in the bootstrap distribution and an incorrect test. Thus, these homozy-
55
Figure 3. Bootstrap distributions estimated with 10,000 resamples from 8 different datasets generated under the following conditions: sample size 20, Fis 0 and 8, and q 0.5, 0.7, 0.9, and 0.95 for a 2 allele model. Coverage probabilities of the bca percentile method as estimated by Monte Carlo simulations (Table 1) are also given.
=
=
=
56 Table 3. Power estimations of the one sample bca percentile method q1
q2
q3
Fis
size
upper bound
power bootstrap
2 alleles 0.25 0.75 0.45 0.55 0.45 0.55 0.25 0.75
– – – –
0.167 0.125 0.25 0.5
100 50 50 20
0.436 0.156 0.451 0.470
0.493 0.193 0.539 0.500
3 alleles 0.5 0.3 0.36 0.33 0.7 0.2 0.5 0.3 0.36 0.33 0.7 0.2 0.5 0.3 0.36 0.33 0.7 0.2 0.5 0.3 0.36 0.33 0.7 0.2 0.5 0.3 0.7 0.2
0.2 0.33 0.1 0.2 0.3 0.1 0.3 0.3 0.1 0.2 0.3 0.1 0.2 0.1
0.1 0.1 0.1 0.125 0.125 0.125 0.25 0.25 0.25 0.25 0.25 0.25 0.5 0.5
100 100 100 50 50 50 20 20 20 50 50 50 20 20
0.392 0.422 0.368 0.339 0.344 0.300 0.422 0.423 0.341 0.754 0.774 0.684 0.887 0.764
0.402 0.400 0.371 0.356 0.355 0.335 0.400 0.472 0.332 0.761 0.794 0.680 0.889 0.775
Table 4. Coverage probabilities of the bootstrap ANOVA Fis #loci 2
5
10
q
0 0.5
0.7
0.9
0.5 0.5
0.7
0.9
0.8 0.5
0.7
0.9
10 20 50 100 10 20 50 100 10 20 50 100
0.95 0.96 0.95 0.95 0.94 0.96 0.95 0.95 0.96 0.96 0.97 0.95
0.97 0.96 0.95 0.96 0.99 0.98 0.95 0.95 0.99 0.99 0.96 0.97
0.97 0.98 0.97 0.96 1.00 0.99 0.98 0.96 1.00 0.99 0.98 0.97
0.99 0.96 0.96 0.98 1.00 1.00 0.99 0.95 1.00 1.00 0.99 0.96
0.99 0.99 0.96 0.98 1.00 1.00 0.99 0.97 1.00 1.00 0.99 0.97
1.00 0.98 0.97 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 0.99
1.00 1.00 1.00 0.99 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00
1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
gotes contain ‘important’ information for the bootstrap test as they provoke most of the variation in the estimated distribution. b) Low frequencies of the homozygotes for rare allele(s) in the sample results in a bootstrap distribution with multiple peaks, again causing wrong tests. The different peaks in the distribution correspond to resamples with different numbers of homozygotes for the rare allele. These homozy-
gotes disrupt the bootstrap distribution and appear to act as outliers, although they do contain important information. The adverse effects of the rare homozygotes was more buffered when more alleles and thus more homozygote combinations are present, such that the coverage behaviour of the bca method in multiple allele situations improved. The allele frequency of the most common allele seems to be the best predictor of the behaviour of
57 the bootstrap. If this frequency is relatively low, CPs are close to the expected NL.
The application of the bootstrap to estimate the distribution of Fis Resampling individual genotypes has been applied to estimate the significance of Fis values (Van Dongen et al., 1996). Although the bootstrap allows testing of a broad range of hypotheses, our simulations show that its use is limited to loci with relatively high genetic variation and to samples of sizes larger than 20. As the performance of the bootstrap critically depends on the often unknown parameters of the population distribution, the examination of the shape of the bootstrap for the presence of multiple peaks and discontinuities is an important tool to evaluate the correctness of the test. With the use of genetic markers with high variability the bootstrap can be expected to be reliable. When the bootstrap is expected to fail, exact tests should be preferred despite the obvious drawback of less generality. In fact, generality is the only reason to prefer the bootstrap over exact methods. Furthermore, exact computations can also be used when Fis values differ among alleles (Rousset & Raymond, 1995), whereas the bootstrap used here is inappropriate for such an application as it estimates the distribution of inbreeding coefficients averaged across alleles. A heterogeneous combination of positive and negative Fis values may result in an average inbreeding coefficient close to zero such that the deviations from HDW equilibrium remain undetected by the bootstrap but not by the exact methods proposed by Rousset and Raymond (1995). The bootstrap can be easily extended to the estimation of the distribution of single allele Fis values which will, when many alleles are present, be prone to high type I error rates if significance levels are not adjusted (see also Van Dongen & Backeljau, 1995). Bootstrapping thus cannot replace exact tests but offers in some situations the ability to test against a broader range of alternative hypotheses. An important class of tests which so far cannot be tested with exact computations are the multi-sample tests. Resampling individual genotypes, for each sample separately, offers a way to estimate the distribution of an ‘ANOVA’ F-value without assuming normality and/or homoscedasticity (Fisher & Hall, 1990; Efron & Tibshirani, 1993). With increasing allele frequency of the common allele and Fis of the underlying distribution the bootstrap ANOVA, type I error probability decreased below the preset
. This may indicate that the test becomes conservative. In order to increase the bootstrap reliability and power, multilocus (or sample) Fis values may be estimated. Resampling individuals with replacement may be used to estimate the distribution of the statistic. The extent to which the reliability of bootstrap tests based on that distribution are influenced by allele frequencies and underlying Fis values remains to be tested.
Acknowledgements SVD is research assistant at the NFWO (Belgium). This research was supported by FJBR grants 2.0004.91 and 2.0128.94.
References Bickel, P.J. & A.M. Krieger, 1989. Confidence bands for a distribution function using the bootstrap. J. Am. Stat. Ass. 84: 95–100. Bacilieri, R., T. Labbe & A. Kremer, 1994. Intraspecific genetic structure in a mixed population of Quercus petraea (Matt.) Meibl and Q. rubur L., Heredity 73: 130–141. Efron, B., 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7: 1–26. Efron, B., 1987. Better bootstrap confidence intervals. J. Am. Stat. Ass. 82: 171–185. Efron, B. & R. Tibshirani, 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1: 54–77. Efron, B. & R. Tibshirani, 1993. An introduction to the bootstrap. London: Chapman and Hall. Fisher, N.I. & P. Hall, 1990. On bootstrap hypothesis testing. Austr. J. Statist. 32: 177–190. Hall, P., 1988. Theoretical comparison of bootstrap confidence intervals. Ann. Stat. 16: 927–953. Hall, P. & S.R. Wilson, 1991. Two guidelines for bootstrap hypothesis testing. Biometrics 47: 757–762. Hinkley, D.V., 1988. Bootstrap methods. J. R. Statist. Soc. B 50: 321–337. Lessios, H.A., 1992. Testing electrophoretic data for agreement with Hardy-Weinberg expectations. Mar. Biol. 112: 517–523. Manly, B.F.J., 1991. Randomization and Monte Carlo methods in biology. London: Chapman and Hall. Nei, M., 1977. F-statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet. 41: 225–233. Nei, M., 1978. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics 89: 583– 590. Noreen, E.W., 1989. Computer-intensive methods for testing hypotheses. New York: John Wiley. Rousset, F. & M. Raymond, 1995. Testing heterozygote excess and deficiency. Genetics 140: 1413–1419. Siegel, S. & N.J. Castellan, 1988. Nonparametric statistics for the behavioural sciences. New York: McGraw Hill. Sokal, R.R. & F.J. Rohlf, 1981. Biometry, 2nd edn. San Francisco: Freeman and Co.
58 Van Dongen, S. & T. Backeljau, 1995. One- and two-sample tests for single locus inbreeding coefficients using the bootstrap. Heredity 74: 129–135. Van Dongen, S., 1995. How should we bootstrap allozyme data. Heredity 74: 445–447. Van Dongen, S., T. Backeljau, E., Matthijsen & A.A. Dhondt, 1996. High gene flow levels and relative strong natural selection in the deme formation process of the winter moth (Operophtera brumata L.) on its primary host (Quercus robur L.). Submitted manuscript.
Weir, B.S., 1990. Genetic data analysis. Massachusetts: Sinauer Ass. Weir, B.S. & C.C. Cockerham, 1984. Estimating F-statistics for the analysis of population structure. Evolution 38: 1358–1370. Wright, S., 1951. The genetical structure of populations. Ann. Eugen. 15: 323–354. Wright, S., 1969. Evolution and genetics of populations, Vol. 2. The theory of gene frequencies. Chicago: University of Chicago Press.