doi:10.1111/j.1420-9101.2010.02093.x
SHORT COMMUNICATION
Comparing three different methods to detect selective loci using dominant markers A. PE´ REZ-FIGUEROA, M. J. GARCI´ A-PEREIRA, M. SAURA, E. ROLA´ N-ALVAREZ & A. CABALLERO Departamento de Bioquı´mica, Gene´tica e Inmunologı´a, Facultad de Biologı´a, Universidad de Vigo, Vigo, Spain
Keywords:
Abstract
amplified fragment length polymorphisms; candidate loci; FST; genome scan; neutral model; outliers.
We carried out a simulation study to compare the efficiency of three alternative programs (D F D I S T , D E T S E L D and B A Y E S C A N ) to detect loci under directional selection from genome-wide scans using dominant markers. We also evaluated the efficiency of correcting for multiple testing those methods that use a classical probability approach. Under a wide range of scenarios, we conclude that B A Y E S C A N appears to be more efficient than the other methods, detecting a usually high percentage of true selective loci as well as less than 1% of outliers (false positives) under a fully neutral model. In addition, the percentage of outliers detected by this software is always correlated with the true percentage of selective loci in the genome. Our results show, nevertheless, that false positives are common even with a combination of methods and multitest correction, suggesting that conclusions obtained from this approach should be taken with extreme caution.
Introduction One of the key topics in evolutionary biology is to unravel the molecular basis of adaptive changes to characterize those parts of the genome subject to natural selection. This objective may have obvious applications to different fields, such as biological conservation, animal, plant or microorganism production, or even medical genetics, and it has also recently become a main research focus in molecular ecology and population genomics. Two main strategies have been used to distinguish natural selection from stochastic forces at the genome level (Nielsen, 2005; Holderegger et al., 2008; Volis, 2008): (i) to compare data from different species to detect old signatures of selection and (ii) to use information within the same species to detect recent selective changes. Regarding the second strategy, one of the most common approaches to deal with recent signatures of selection was first introduced by Lewontin & Krakauer (1973) and since then used under slightly different frameworks (Beaumont & Nichols, 1996; Vitalis et al., 2001; Beaumont & Balding, 2004). The methodolCorrespondence: Andre´s Pe´rez-Figueroa, Departamento de Bioquı´mica, Gene´tica e Inmunologı´a, Facultad de Biologı´a, Universidad de Vigo, 36200 Vigo, Spain. Tel. ⁄ fax: +34 986 813828; e-mail:
[email protected]
ogy consists of identifying loci (molecular markers) that present population differentiation (FST) coefficients that are ‘distinct’ (called outlier loci) from those under neutral expectations. This strategy has been widely used to detect recent episodes of selection in nonmodel species, where the absence of detailed genomic information does not allow other alternatives. Amplified fragment length polymorphisms (AFLP) are among the most popular genetic markers used to detect selective loci in genome scans of nonmodel species, where detailed genomic information is usually not available (Mueller & Wolfenbarger, 1999; Meudt & Clarke, 2007). The advantages of choosing this marker are the combination of comparatively low costs, high reproducibility and information content, and the possibility of analysing many loci scattered over the whole genome (Bensch & Akesson, 2005). Three different software programs are being used preferentially to detect outlier loci owing to selection on AFLP and similar dominant markers. The D F D I S T software is a modification of the software developed by Beaumont & Nichols (1996) for dominant markers. Briefly, this program implements a Bayesian method developed by Zhivotovsky (1999) to estimate allelic frequencies from the proportion of recessive phenotypes in the sample. It also estimates the Weir & Cockerham (1984) FST between the subgroups defined
ª 2010 THE AUTHORS. J. EVOL. BIOL. JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
1
2
A . P E´ R E Z - F I G U E R O A E T A L .
in the sample. Coalescent simulations are then performed to generate an FST null sampling distribution from upon the neutral expectations. The simulated FST data are used to identify those loci that may not fit the neutral drift simulations given their unusually low or high FST values. D F D I S T is the method most frequently applied when using AFLP data (Wilding et al., 2001; Scotti-Saintagne et al., 2004; Achere´ et al., 2005; Bonin et al., 2006, 2007; Jump et al., 2006; Mealor & Hild, 2006; Murray & Hare, 2006; Savolainen et al., 2006; Joost et al., 2007; Miller et al., 2007; Papa et al., 2007; Egan et al., 2008; Nosil et al., 2008; Smith et al., 2008; Chen & Yang, 2009; Gagnaire et al., 2009; Galindo et al., 2009; Manel et al., 2009; Meyer et al., 2009; Paris et al., 2010). A second alternative is D E T S E L D , an unpublished version of D E T S E L (Vitalis et al., 2003) accounting for dominant data (R. Vitalis, personal communication). The basis of this dominant version is the same as the codominant one, but the way of estimating allelic frequencies is different (the Zhivotovsky, 1999, method). The program relies on a model where a common ancestor population splits up into two populations, which afterwards diverge only by random drift after a possible bottleneck event (Vitalis et al., 2001, 2003). A generic parameter of population divergence Fi is then defined for each population i (= 1 or 2), which is a function of the divergence time s and the population size Ni, that is, Fi 1 ) exp()s ⁄ Ni). Single-locus estimates of these parameters can be calculated using the nuisance parameters of the model: mutation rate (l), ancestral population size before the bottleneck (Ne), ancestral population size during the bottleneck (N0) and the number of generations before the bottleneck (s0). A joint distribution of F1 and F2 under neutral expectations is generated using coalescent simulations, and every locus falling outside the resulting confidence envelope (outliers) can be seen as potentially under selection. D E T S E L D has been used sometimes for dominant markers (Bonin et al., 2006; Manel et al., 2009; Meyer et al., 2009; Freeland et al., 2010), although the codominant version D E T S E L has been more widely used (Vasema¨gi et al., 2005; Bonhomme et al., 2007; Bryja et al., 2007; Kane & Rieseberg, 2007; Oetjen & Reusch, 2007; O’Malley et al., 2007; Raeymaekers et al., 2007; Tiranti & Negri, 2007; Tsumura et al., 2007; Ridgway et al., 2008; Bitocchi et al., 2009; Blel et al., 2010; Oetjen et al., 2010; Santalla et al., 2010; among others). A third alternative is B A Y E S C A N (Foll & Gaggiotti, 2008), which implements a Bayesian method to estimate directly the posterior probability that each locus is subject to selection. This method is an extension of that proposed by Beaumont & Balding (2004), and it is based on a logistic regression model in which each logit value of genetic differentiation FST (i, j) for locus i in population j is decomposed as a linear combination of the coefficients of the logistic regression, ai and bj, corresponding, respectively, to a locus effect and to a population effect. The posterior probability of locus i being under selection
is estimated by defining two alternative models, one that includes ai and another that excludes it. The respective posterior probabilities of these two models are estimated using a reversible jump Markov chain Monte Carlo (RJMCMC) approach. The posterior probability that a locus is subject to selection is then estimated from the output of the RJMCMC by counting the number of times that ai is included in the model. This Bayesian approach takes all loci into account in the analyses through the prior distribution, resolving the problem of multiple testing of a large number of genomic locations. B A Y E S C A N has being used for codominant (Gaggiotti et al., 2009; Knapen et al., 2009; Medugorac et al., 2009; Nielsen et al., 2009) as well as dominant (Manel et al., 2009; Paris et al., 2010; Parisod & Joost, 2010) DNA markers. The available software has been shown to solve most of the previous criticisms (Robertson, 1975; Nei & Chakravarti, 1977; Flint et al., 1999) on the original methodology proposed by Lewontin & Krakauer (1973), as suggested by some previous simulations (Beaumont & Nichols, 1996; Vitalis et al., 2001; Foll & Gaggiotti, 2008). To our knowledge, however, there is only one study analysing the efficiency of one of these programs (D F D I S T ) for a wide range of situations (Caballero et al., 2008). These authors found that the detection of selective loci can be a difficult and risky task. For example, under certain simulated conditions, the highest percentage of outliers was observed under the null (fully neutral) model. Only for extremely favourable conditions (strong selection coefficients, comparatively low levels of neutral gene frequency differentiation and low critical P-values), the program was able to detect outliers in a number proportional to the true percentage of selective loci existing in the genome. However, even for these favourable circumstances, the outliers detected were often false positives, suggesting that the method should be used with caution. A comparison in efficiency between this method and the other widely used ones (D E T S E L D and B A Y E S C A N ) has not been made so far. In addition, only a few studies have incorporated a multitest correction for detecting outliers of selection (Eveno et al., 2008; Galindo et al., 2009; Manel et al., 2009; Michalski et al., 2010), and the efficiency of this strategy has not been studied under a wide range of scenarios by simulation. Here, we use simulation data to compare the efficiency of the three alternative programs to detect selective loci from genomewide scans using dominant markers. We also evaluate the efficiency of correcting for multiple testing of those methods that use a classical probability approach.
Materials and methods Simulated data and scenarios investigated The frequencies of neutral and selected dominant markers in a subdivided population were obtained by an analytical procedure following Caballero et al. (2008).
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
Comparing methods to detect selective loci
Details of the procedure are also given in the Appendix S1. Briefly, a classical island model at equilibrium between migration and drift was used to obtain the frequency of neutral loci in a subdivided population (two subpopulations with N = 500 individuals each were assumed). Allelic frequencies sampled from the theoretical beta distribution were assigned to each subpopulation for a number of loci (1000). Allele frequencies for selective loci were generated using a transition matrix approach. Briefly, we assumed that selection acts only in one subpopulation with fitnesses 1 + s, 1 + s ⁄ 2 and 1 for the genotypes AA, Aa and aa, respectively. Selective coefficients of loci (s) were sampled from an exponential distribution with mean effect s. The transition matrix considers the effects of selection, drift and migration between subpopulations. Genotypic values of individuals were obtained assuming Hardy–Weinberg proportions, and sample sizes of 40 individuals were assumed for each subpopulation. A range of possible scenarios were run regarding different values of neutral gene frequency differentiation (FST for neutral loci equal to 0.025, 0.1 and 0.3), mean selection coefficients (s= 0.005, 0.05 and 0.5) and different proportions of true selective loci in the genome (0%, 1%, 3%, 5% and 10%). Each scenario was replicated 10 times and analysed with the three programs, averaging results over replicates. Parameters assumed in the software In the analysis with D F D I S T software (see http://www. rubic.rdg.ac.uk/~mab/stuff/), the significance level was set at 95%. From the collection of P-values obtained, we optionally applied multitest correction based on false discovery rate (FDR) described by Benjamini & Hochberg (1995) and implemented in the SGoF software (CarvajalRodriguez et al., 2009). D F D I S T was run for every simulated marker data set. The parameter conditions used in the D F D I S T analyses were as follows: 1. The critical frequency for the most common allele was 0.99 (loci in which the most frequent allele had a frequency ‡ 0.99 were excluded). 2. The scale for the Zhivotovsky (1999) parameters for estimating allele frequencies was 0.25 (the accuracy of the estimation of frequencies with this parameter value was checked). 3. The number of resamplings used to obtain the confidence intervals for outliers was 10 000 (several runs were performed with 100 000 resamplings, and results did not change). 4. The smoothing proportion used was 0.04. 5. The estimate of average FST to be used in the D F D I S T simulations was obtained in two different ways: (i) the average estimated FST calculated by the Ddatacal program (one of the programs in the D F D I S T package) and (ii) a trimmed mean FST (provided also by the
3
Ddatacal program) obtained excluding 30% of the highest and 30% of the lowest FST values (this trimmed mean FST is supposed to be an estimate of the average ‘neutral’ FST uninfluenced by outlier loci; Bonin et al., 2006). The second program, D E T S E L D , was also run for all simulated marker data sets. The significance level was set at 95%, and a multitest correction based on FDR was also applied. Two different models, given by the nuisance parameters, were run in D E T S E L D . The first model was based on a range of parameters used empirically by Bonin et al. (2006). Because the amount of false positives obtained with this set of parameters was very high (about 20% of outliers with a fully neutral model for a critical P-value of 5%), we obtained a second model optimized to reduce the number of outliers detected by an exhaustive search in more than 1000 sets of parameters. As a result of this optimization, we obtained a set of nuisance parameters (Table S1) yielding a minimum of 11% (FST = 0.025) or 5% (FST = 0.1) of outliers when all loci were neutral for a critical P-value of 5%. We simulated 107 points for each set of parameters to ensure a correct generation of P-values. For the third program, B A Y E S C A N , the estimation of model parameters was automatically tuned on the basis of short pilot runs (10 pilot runs, length 5000), using the default chain parameters given in the program: the sample size was set to 5000 and the thinning interval to 20. The loci were ranked according to their estimated posterior probability. This probability cannot be interpreted directly or compared to P-values (Marden, 2000; Foll & Gaggiotti, 2008). Instead, all loci showing log(Bayes Factor) > 2 ((P[ai] 6¼ 0) > 0.99) were retained as outliers, which provides decisive support for the acceptation of the model (Foll & Gaggiotti, 2008). Comparison between the performances of the methods Any useful method to detect selective loci should present at least the following four desirable criteria: 1. The method should detect the lowest percentage (ideally none) of significant outliers under the null model (i.e. when all loci are neutral). 2. There should be a positive relationship between the percentage of outliers detected and the true percentage of selective loci simulated across scenarios. 3. The outliers detected should be typically true selective loci. 4. The method should detect a substantial proportion of true selective loci. Thus, following Caballero et al. (2008), for each method and scenario, we obtained the average percentage of outliers detected, the average percentage of those outliers that corresponded to selective loci and the average percentage
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
4
A . P E´ R E Z - F I G U E R O A E T A L .
of truly selective loci detected as outliers. Every locus detected as a candidate for selection by any of the methods was recorded, and the number of such loci rightly or wrongly assigned as being under selection by one method or a combination of two or three methods was obtained. The correlation between the percentage of loci detected as outliers and the true percentage of selective loci was obtained using the Kendall’s Tau (s) nonparametric coefficient (Sokal & Rohlf, 1995) calculated by S P S S for Windows version 17 (SPSS Inc., Chicago, IL, USA).
Results We assessed whether the three different approaches to detect selective loci behave conveniently under distinct
population genetic differentiation scenarios for neutral loci (FST = 0.025 and 0.1; Figs 1 and 2, respectively). For a scenario assuming a neutral gene frequency differentiation of FST = 0.3, almost no outliers were detected by any of the methods and scenarios. The lowest FST value represents the most favourable framework for detecting selective loci under directional selection. In addition, under each scenario, we compared D F D I S T and D E T S E L D without (Figs 1a–f and 2a–f) and with a multitest (FDR) correction (Figs 1g–l and 2g–l). Note that B A Y E S C A N already incorporates a multitest adjustment so black bars in panels A–F are the same as those in panels G–L. We present results using only the optimized set of parameters for D E T S E L D (Table S1) and the trimmed correction for D F D I S T , as they produced the best results. In addition,
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 1 Percentage of outliers (a, d, g and j), percentage of outliers that correspond to truly selective loci (b, e, h and k) and percentage of selective loci that are detected as outliers (c, f, i and l) by three different software programs (B A Y E S C A N , D F D I S T and D E T S E L D ) with a mean neutral gene frequency differentiation FST = 0.025 and two different average selection coefficients (s). The first two rows are the results with no multitest correction, whereas the last two rows represent the results with false discovery rate correction for D F D I S T and D E T S E L D . Lines over the bars represent one standard error of means.
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
Comparing methods to detect selective loci
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
5
Fig. 2 Percentage of outliers (a, d, g and j), percentage of outliers that correspond to truly selective loci (b, e, h and k) and percentage of selective loci that are detected as outliers (c, f, i and l) by three different software programs (B A Y E S C A N , D F D I S T and D E T S E L D ) with a mean neutral gene frequency differentiation FST = 0.1 and two different average selection coefficients (s). The first two rows are the results with no multitest correction, whereas the last two rows represent the results with false discovery rate correction for D F D I S T and D E T S E L D . Lines over the bars represent one standard error of means.
we show results with moderate or high average selection coefficients (s = 0.05 or 0.5). In general, the performance of the methods with the lowest s was substantially worse than with higher s, although the comparative results among methods were hold. The left column of Fig. 1 (Fig. 1a, d, g, j) represents the percentage of outliers detected by each method under comparatively low mean levels of neutral genetic differentiation (FST = 0.025). Both D F D I S T and D E T S E L D produced a high overestimation of the percentage of outliers when they were not corrected by a multitest method (Fig. 1a, d). Furthermore, the percentage of outliers detected was the highest (11–14%) under the neutral model (0% selective loci). Obviously this invalidates
these approaches, as all detected outliers could be in fact false positives. However, this problem was partially corrected when a multitest correction was provided (Fig. 1g, j), showing D F D I S T a higher capability to detect outliers than D E T S E L D . In fact, the relationship between the percentage of outliers detected and the true percentage of selective loci simulated was positive and significant for D F D I S T (Fig. 1g, j pooled; Kendall’s s8 = 0.89, P = 0.001) but nonsignificant for D E T S E L D (Fig. 1g, j; Kendall’s s = 0.07, P = 0.780). Interestingly, B A Y E S C A N detected a proportion of outliers similar to D F D I S T corrected for multiple testing, showing also a positive relationship between that proportion and the true percentage of selective loci (Fig. 1g, j; Kendall’s
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
6
A . P E´ R E Z - F I G U E R O A E T A L .
s8 = 0.91, P < 0.001). The percentage of selective loci found in the outliers (Fig. 1b, e, h, k) was moderate when a low percentage (1%) of selective loci was simulated but typically high when at least 3% of the loci were selective (Fig. 1h, k). The percentage of truly selective loci detected (Fig. 1c, f, i, l) was comparatively high with the highest average selection coefficient (s = 0.5), but substantially smaller with the moderate average selection coefficient (s = 0.05). The same trend is described in Fig. 2 for a less efficient scenario (FST = 0.1). Briefly, D F D I S T and D E T S E L D were still inefficient without any multitest correction, detecting 5–7% of outliers irrespective of the true percentage of selective loci, whereas B A Y E S C A N detected less than 0.5% under the null model and an increasing proportion of outliers with increasing proportions of true selective loci. D F D I S T and, particularly, D E T S E L D showed a rather small percentage of selective loci found in outliers without multitest correction. Again, D E T S E L D detected a very low percentage of outliers after correcting for multiple testing, whereas D F D I S T and B A Y E S C A N behaved better. Both D E T S E L D (Fig. 2g, j; Kendall’s s8 = 0.28, P = 0.320) and D F D I S T (Fig. 2g, j; Kendall’s s8 = )0.27, P = 0.310) did not show a significant relationship between the percentage of outliers detected and the true percentage of selective loci, suggesting a low efficiency as estimators of the percentage of selective loci in the genome. B A Y E S C A N , however, performed much better, showing a positive relationship between the percentage of outliers detected and the true percentage of selective loci (Fig. 2g, j; Kendall’s s8 = 0.94, P < 0.001), and a comparatively high percentage of selective loci in outliers (Fig. 2h, k). The percentage of selective loci detected was low (Fig. 2i, l), particularly so for D F D I S T and D E T S E L D for a high proportion of true selective loci (> 5%). Figure 3 represents the number of loci detected as outliers for the different methods (D E T S E L D and D F D I S T using multitest correction) with a neutral gene frequency differentiation of FST = 0.025 and for a case with no selective loci (s = 0) or 1% of selective loci in the genome with average effect s = 0.05 or 0.5. In the neutral case, the three methods detected an average of 1.1 false outlier loci, but another 10.9 more were detected by both D E T S E L D and D F D I S T , 6.7 more by D E T S E L D and 4.7 more by D F D I S T . The three methods detected almost the same 3.9 and 10.6 true selective loci on average for moderate (s = 0.05) and strong (s = 0.5) average selection coefficients, respectively. A few additional true selective loci were detected by one or two methods. However, most loci detected exclusively by D F D I S T were false positives.
Discussion For all the scenarios analysed, we can conclude that B A Y E S C A N appears to be more efficient than the other methods. B A Y E S C A N usually detected a higher percentage of outliers than the other methods after multitest
Fig. 3 Graphic representation of the mean coincidence of true (bold) and false (italic) outliers detected by the three programs (B A Y E S C A N , D F D I S T and D E T S E L D ; the last two using multitest false discovery rate correction) under different conditions (neutral case, 1% of selective loci with s = 0.05 and 0.5). The neutral gene frequency differentiation is FST = 0.025.
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
Comparing methods to detect selective loci
correction. This percentage of outliers was always correlated with the true percentage of selective loci in the genome. B A Y E S C A N detected less than 1% of outliers under the neutral model and showed the highest percentage (or as higher as other alternatives) of selective loci in outliers and selective loci detected. B A Y E S C A N showed another useful property, which is the automatic optimization of the parameter conditions used in simulations. D F D I S T and D E T S E L D with multitest correction were also shown to be efficient at least for the most favourable scenarios, but they seemed to fail in a number of situations, particular the latter. One possible reason behind the poor performance of D E T S E L D may be that it assumes a divergence model between subpopulations with no migration, rather than an island model (assumed by D F D I S T , B A Y E S C A N and the simulations). The consequences of these results are straightforward. Most genome-scan studies in nonmodel species for dominant markers have used D F D I S T or D E T S E L D (see references in the Introduction section). Moreover, most of them have applied the methods without any multitest correction (exceptions are Eveno et al., 2008; Galindo et al., 2009; Manel et al., 2009; Michalski et al., 2010). Therefore, our results show that an unknown number of outliers detected with D F D I S T or D E T S E L D without multitest correction could be false positives, suggesting the need for a re-evaluation of results to confirm their conclusions. In some studies, however, the outliers detected were confirmed by a complementary argument, such as getting repeated detection under replication or pseudo-replication (Wilding et al., 2001) or comparing results between a priori adaptive cases and controls (Miller et al., 2007; Nosil et al., 2008; Galindo et al., 2009; Manel et al., 2009). There are a number of studies that have used a combination of D F D I S T and D E T S E L D (Bonin et al., 2006; Meyer et al., 2009) as well as D F D I S T and B A Y E S C A N for dominant markers (Paris et al., 2010) or F D I S T 2 and D E T S E L for codominant markers (Vasema ¨ gi et al., 2005; Oetjen & Reusch, 2007; Tsumura et al., 2007; Oetjen et al., 2010; Santalla et al., 2010) to minimize the detection of false positives. However, our results show that even this strategy could be inefficient for fully neutral or low selection scenarios, as D F D I S T and D E T S E L D methods, for example, could detect the same false outlier loci (see Fig. 3). We have focused on the three more widely used methods for genome scan to identify candidate dominant loci for selection although other methods and software have been proposed for genome scans too. For example, the program spatial analysis method (SAM) (Joost et al., 2008) performs a SAM (Joost et al., 2007) with the help of geographic and environmental information. However, this approach only permits identifying molecular markers associated with environmental variables and requires a combination with one of the above programs to differentiate the type of selection. Another example is the
7
program W I N K L E S (Wilding et al., 2001), which is based on the same principle as D F D I S T , but the null distribution of genetic differentiation is here conditional on allele frequency instead of on heterozygosity. B A Y E S F S T (Beaumont & Balding, 2004) allows for Bayesian estimation of FST, and it is the base from where B A Y E S C A N was extended (Foll & Gaggiotti, 2008). Recently, A R L E Q U I N 3.5.1 implements a new hierarchical test of selection (Excoffier et al., 2009), as an extension of FDIST, that highlights the need to have a good understanding of the population genetic structure of the studied organism to accurately identify loci with unusual levels of differentiation. In our analysis, we have assumed that the selective loci are subject to directional selection in one of the subpopulations. This implies that the most favourable situation for detecting outliers is that of a low neutral FST, as selective loci would tend to show exceedingly high FST values. In fact, the results show that for FST = 0.1, the efficiency of the methods is much lower than for FST = 0.025 (cf. Figs 1 and 2), and for FST = 0.3, the methods do not work at all. Under balancing selection, in contrast, a high neutral FST would be more favourable for detecting selective loci. All these methods for genome scan assume independence among loci (Foll & Gaggiotti, 2008), which is an unrealistic assumption when using a large amount of markers. The effect of linkage disequilibrium on these methods is not clear. An exhaustive simulation study, including different degrees of linkage disequilibrium, would be needed to address this issue. There are also further factors that would have a presumable influence on the ability of the methods to detect loci under selection. Some demographic scenarios, for example, would lead to erroneously assign population differentiation as a consequence of selection. In general, it can be assumed that the detection of selection would be more difficult under this kind of factors. Our simple scenario assumed, where no demographic complexities are involved, can therefore be considered as a favourable situation for the performance of the methods. More complex scenarios would require further detailed simulations. Our analysis referred to dominant markers, such as AFLPs. In the conditions simulated, we expect that our results can be also extrapolated to codominant markers. The main difference between the dominant and codominant versions of the software is that the dominant versions must include a previous estimation of allelic frequencies. Because, in our study, we run simple scenarios under Hardy–Weinberg equilibrium, the estimated allelic frequencies should be unbiased. In fact, this was checked previously for the D F D I S T software (Caballero et al., 2008). It is possible that deviations from Hardy–Weinberg equilibrium may also reduce the efficiency of the methods using dominant markers.
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
8
A . P E´ R E Z - F I G U E R O A E T A L .
In conclusion, from our results, we can give the following advices to design a study for detecting loci under directional selection using any of the analysed software: 1. Do not start the experiment if the populations compared show a mean gene frequency differentiation (FST) larger than about 0.2. 2. Use preferentially B A Y E S C A N , or alternatively D F D I S T or D E T S E L D , using a multitest correction. 3. Use experimental controls: compare detection of outliers in situations expecting positive and negative results. Alternatively, provide independent experimental replication. 4. Be rather cautious when the percentage of outliers observed after multitest correction falls below 1%.
Acknowledgments The authors thank N. Santamarı´a for her technical assistance and two anonymous referees for helpful comments. This work was supported by grants from Ministerio de Ciencia E Innovacio´n y Fondos Feder (CGL2008-00135 ⁄ BOS; CGL2009-13278-C02) and Xunta de Galicia (IN825B 2009 ⁄ 6-0). R. Vitalis provided a source code and helpful inputs for the D E T S E L D software. A. P.-F. was supported by an A´ngeles Alvarin˜o fellowship from Xunta de Galicia (Spain). M.J. G-P was supported by a Marı´a Barbeito fellowship from Xunta de Galicia (Spain).
References Achere´, V., Favre, J.M., Besnard, G. & Jeandroz, S. 2005. Genomic organization of molecular differentiation in Norway spruce (Picea abies). Mol. Ecol. 14: 3191–3201. Beaumont, M.A. & Balding, D.J. 2004. Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. 13: 969–980. Beaumont, M.A. & Nichols, R.A. 1996. Evaluating loci for use in the genetic analysis of population structure. Proc. Biol. Sci. 263: 1619–1626. Benjamini, Y. & Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57: 289–300. Bensch, S. & Akesson, M. 2005. Ten years of AFLP in ecology and evolution: why so few animals? Mol. Ecol. 14: 2899–2914. Bitocchi, E., Nanni, L., Rossi, M., Bellucci, E., Giardini, E., Buonamici, A., Vendramin, G.G. & Papa, R. 2009. Introgression from modern hybrid varieties into landrace populations of maize (Zea mays ssp. mays L.) in central Italy. Mol. Ecol. 18: 603–621. Blel, H., Panfili, J., Guinand, B., Berrebi, P., Said, K. & Durand, J.-D. 2010. Selection footprint at the first intron of the Prl gene in natural populations of the flathead mullet (Mugil cephalus, L. 1758). J. Exp. Mar. Biol. Ecol. 387: 60–67. Bonhomme, M., Blancher, A., Jalil, M.F. & Crouau-Roy, B. 2007. Factors shaping genetic variation in the MHC of natural non-human primate populations. Tissue Antigens 70: 398–411.
Bonin, A., Taberlet, P., Miaud, C. & Pompanon, F. 2006. Explorative genome scan to detect candidate loci for adaptation along a gradient of altitude in the common frog (Rana temporaria). Mol. Biol. Evol. 23: 773–783. Bonin, A., Nicole, F., Pompanon, F., Miaud, C. & Taberlet, P. 2007. Population adaptive index: a new method to help measure intraspecific genetic diversity and prioritize populations for conservation. Conserv. Biol. 21: 697–708. Bryja, J., Charbonnel, N., Berthier, K., Galan, M. & Cosson, J.-F. 2007. Density-related changes in selection pattern for major histocompatibility complex genes in fluctuating populations of voles. Mol. Ecol. 16: 5084–5097. ´ lvarez, E. 2008. Impact of Caballero, A., Quesada, H. & Rola´n-A amplified fragment length polymorphism size homoplasy on the estimation of population genetic diversity and the detection of selective loci. Genetics 179: 539–554. Carvajal-Rodriguez, A., de Un˜a-Alvarez, J. & Rolan-Alvarez, E. 2009. A new multitest correction (SGoF) that increases its statistical power when increasing the number of tests. BMC Bioinformatics 10: 209. See http://webs.uvigo.es/acraaj/ SGoF.htm Chen, L. & Yang, G. 2009. A genomic scanning using AFLP to detect candidate loci under selection in the finless porpoise (Neophocaena phocaenoides). Genes Genet. Syst. 84: 307–313. Egan, S.P., Nosil, P. & Funk, D. 2008. Selection and genomic differentiation during ecological speciation: isolating the contributions of host association via comparative genome scan of Neochlamisus bebbianae leaf beetles. Evolution 62: 1162–1181. Eveno, E., Collada, C., Guevara, M.A., Leger, V., Soto, A., Diaz, L., Leger, P., Gonzalez-Martinez, S.C., Cervera, M.T., Plomion, C. & Garnier-Gere, P.H. 2008. Contrasting patterns of selection at Pinus pinaster Ait. Drought stress candidate genes as revealed by genetic differentiation analyses. Mol. Biol. Evol. 25: 417– 437. Excoffier, L., Hofer, T. & Foll, M. 2009. Detecting loci under selection in a hierarchically structured population. Heredity 103: 285–298. Flint, J., Bond, J., Rees, D.C., Boyce, A.J., Roberts-Thomson, J.M., Excoffier, L., Clegg, M.A., Beaumont, M.A., Nichols, R.A. & Harding, R.M. 1999. Minisatellite mutational processes reduce Fst estimates. Hum. Genet. 105: 567–576. Foll, M. & Gaggiotti, O. 2008. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977–995. Freeland, J.R., Biss, P., Conrad, K.F. & Silvertown, J. 2010. Selection pressures have caused genome-wide population differentiation of Anthoxanthum odoratum despite the potential for high gene flow. J. Evol. Biol. 23: 776. Gaggiotti, O., Bekkevold, D., Jorgensen, H.B.H., Foll, M., Carvalho, G.R., Andre, C. & Ruzzante, D.E. 2009. Disentangling the effects of evolutionary, demographic, and environmental factors influencing genetic structure of natural populations: Atlantic herring as a case study. Evolution 63: 2939–2951. Gagnaire, P.A., Albert, V., Jo´nsson, B. & Bernatchez, L. 2009. Natural selection influences AFLP intraspecific genetic variability and introgression patterns in Atlantic eels. Mol. Ecol. 18: 1678–1691. ´ lvarez, E. 2009. Comparing Galindo, J., Mora´n, P. & Rola´n-A geographical genetic differentiation between candidate and noncandidate loci for adaptation strengthens support for parallel ecological divergence in the marine snail Littorina saxatilis. Mol. Ecol. 18: 919–930.
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
Comparing methods to detect selective loci
Holderegger, R., Herrmann, D., Poncet, B., Gugerli, F., Thuiller, W., Taberlet, P., Gielly, L., Rioux, D., Brodbeck, S., Aubert, S. & Manel, S. 2008. Land ahead: using genome scans to identify molecular markers of adaptive relevance. Plant Ecol. Divers, 1: 273–283. Joost, S., Bonin, A., Bruford, M.W., Despre´s, S.L., Conord, C., Erhardt, G. & Taberlet, P. 2007. A spatial analysis method (SAM) to detect candidate loci for selection: towards a landscape genomics approach to adaptation. Mol. Ecol. 16: 3955–3969. Joost, S., Kalbermatten, M. & Bonin, A. 2008. Spatial analysis method (SAM): a software tool combining molecular and environmental data to identify candidate loci for selection. Mol. Ecol. Resour. 8: 957–960. Jump, A.S., Hunt, J.M., Martı´nez-Izquierdo, J.A. & Pen˜uelas, J. 2006. Natural selection and climate change: temperaturelinked spatial and temporal trends in gene frequency in Fagus sylvatica. Mol. Ecol. 15: 3469–3480. Kane, N.C. & Rieseberg, L. 2007. Selective sweeps reveal candidate genes for adaptation to drought and salt tolerance in common sunflower, Helianthus annuus. Genetics 175: 1823– 1834. Knapen, D., De Wolf, H., Knaepkens, G., Bervoets, L., Eens, M., Blust, R. & Verheyen, E. 2009. Historical metal pollution in natural gudgeon populations: inferences from allozyme, microsatellite and condition factor analysis. Aquat. Toxicol. 95: 17–26. Lewontin, R. & Krakauer, J. 1973. Distribution of gene frequency as a test of theory of selective neutrality of polymorphisms. Genetics 74: 175–195. Manel, S., Conord, C. & Despre´s, L. 2009. Genome scan to assess the respective role of host-plant and environmental constraints on the adaptation of a widespread insect. BMC Evol. Biol. 9: 288–297. Marden, J.I. 2000. Hypothesis testing: from p values to Bayes factors. J. Am. Stat. Assoc. 95: 1316–1320. Mealor, B.A. & Hild, A. 2006. Potential selection in native grass populations by exotic invasion. Mol. Ecol. 15: 2291– 2300. Medugorac, I., Medugorac, A., Russ, I., Veit-Kensch, C.E., Taberlet, P., Luntz, B., Mix, H.M. & Fo¨rster, M. 2009. Genetic diversity of European cattle breeds highlights the conservation value of traditional unselected breeds with high effective population size. Mol. Ecol. 18: 3394–3410. Meudt, H.M. & Clarke, A.C. 2007. Almost forgotten or latest practice? AFLP applications, analyses and advances Trends Plant Sci. 12: 106–117. Meyer, C.-L., Vitalis, R., Saumitou-Laprade, P. & Castric, C. 2009. Genomic pattern of adaptive divergence in Arabidopsis halleri, a model species for tolerance to heavy metal. Mol. Ecol. 18: 2050–2062. Michalski, S., Durka, W., Jentsch, A., Kreyling, J., Pompe, S., Schweiger, O., Willner, E. & Beierkuhnlein, C. 2010. Evidence for genetic differentiation and divergent selection in an autotetraploid forage grass (Arrhenatherum elatius). Theor. Appl. Genet. 120: 1151–1162. Miller, N.J., Ciosi, M., Sappington, T.W., Ratcliffe, S.T., Spencer, J.L. & Giuillemaud, T. 2007. Genome scan of Diabrotica virgifera virgifera for genetic variation associated with crop rotation tolerance. J. Appl. Entomol. 131: 378–385. Mueller, U.G. & Wolfenbarger, L. 1999. AFLP genotyping and fingerprinting. Trends Ecol. Evol. 14: 389–394.
9
Murray, M.C. & Hare, M.P. 2006. A genomic scan for divergent selection in a secondary contact zone between Atlantic and Gulf of Mexico oysters, Crassostrea virginica. Mol. Ecol. 15: 4229–4242. Nei, M. & Chakravarti, A. 1977. Drift variances of Fst and Gst statistics obtained from a finite number of isolated populations. Theor. Popul. Biol. 11: 307–325. Nielsen, R. 2005. Molecular signatures of natural selection. Annu. Rev. Genet. 39: 197–218. Nielsen, E.E., Hemmer-Hansen, J., Poulsen, N.A., Loeschcke, V., Moen, T., Johansen, T., Mittelholzer, C., Taranger, G.-L., Ogden, R. & Carvalho, G.R. 2009. Genomic signatures of local directional selection in a high gene flow marine organism: the Atlantic cod (Gadus morhua). BMC Evol. Biol. 9: 276– 286. Nosil, P., Egan, S.P. & Funk, D. 2008. Heterogeneous genomic differentiation between walking-stick ecotypes: ‘‘isolation by adaptation’’ and multiple roles for divergent selection. Evolution 62: 316–336. O’Malley, K.G., Camara, M.D. & Banks, M. 2007. Candidate loci reveal genetic differentiation between temporally divergent migratory runs of Chinook salmon (Oncorhynchus tshawytscha). Mol. Ecol. 16: 4930–4941. Oetjen, K. & Reusch, T.B.H. 2007. Genome scans detect consistent divergent selection among subtidal vs. intertidal populations of the marine angiosperm Zostera marina. Mol. Ecol. 16: 5156–5157. Oetjen, K., Ferber, S., Dankert, I. & Reusch, T.B.H. 2010. New evidence for habitat-specific selection in Wadden Sea Zostera marina populations revealed by genome scanning using SNP and microsatellite markers. Mar. Biol. 157: 81–89. Papa, R., Bellucci, E., Rossi, M., Leonardi, S., Rau, D., Gepts, P., Nanni, L. & Attene, G. 2007. Tagging the signatures of domestication in common bean (Phaseolus vulgaris) by means of pooled DNA samples. Ann. Bot. London 100: 1039–1051. Paris, M., Boyer, S., Bonin, A., Collado, A., David, J. & Despres, L. 2010. Genome scan in the mosquito Aedes rusticus: population structure and detection of positive selection after insecticide treatment. Mol. Ecol. 19: 325–337. Parisod, C. & Joost, S. 2010. Divergent selection in trailingversus leading-edge populations of Biscutella laevigata. Ann. Bot. London 105: 655–660. Raeymaekers, J.A.M., Houdt, J.K.J.V., Larmuseau, M.H.D., Geldof, S. & Volckaert, F.A.M. 2007. Divergent selection as revealed by PST and QTL-based FST in three-spined stickleback (Gasterosteus aculeatus) populations along a coastal-inland gradient. Mol. Ecol. 16: 891–905. Ridgway, T., Riginos, C., Davis, J. & HoeghGuldberg, O. 2008. Genetic connectivity patterns of Pocillopora verrucosa in southern African Marine Protected Areas. Mar. Ecol. Prog. Ser. 354: 161–168. Robertson, A. 1975. Gene frequency distributions as a test of selective neutrality. Genetics 81: 775–785. Santalla, M., Ron, A.M. & La Fuente, M. 2010. Integration of genome and phenotypic scanning gives evidence of genetic structure in Mesoamerican common bean (Phaseolus vulgaris L.) landraces from the southwest of Europe. Theor. Appl. Genet. 120: 1635–1651. Savolainen, V., Anstett, M.-C., Lexer, C., Hutton, I., Clarkson, J.J., Norup, M.V., Powell, M.P., Springate, D., Salamin, N. & Baker, W. 2006. Sympatric speciation in palms on an oceanic island. Nat. Lett. 411: 210–213.
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY
10
A . P E´ R E Z - F I G U E R O A E T A L.
Scotti-Saintagne, C., Mariette, S., Porth, I., Goicoechea, P.G., Barreneche, T., Bode`ne´s, C., Burg, K. & Kremer, A. 2004. Genome scanning for interspecific differentiation between two closely related oak species [Quercus robur L. and Q. petraea (Matt.) Liebl.]. Genetics 168: 1615–1626. Smith, T.B., Mila´, B., Grether, G.F., Slabbekoorn, H., Sepil, I., Buermann, W., Saatchi, S. & Pollinger, J.P. 2008. Evolutionary consequences of human disturbance in a rainforest bird species from Central Africa. Mol. Ecol. 17: 58–71. Sokal, R.R. & Rohlf, F.J. 1995. Biometry. Freeman & Co., New York. Tiranti, B. & Negri, V. 2007. Selective microenvironmental effects play a role in shaping genetic diversity and structure in a Phaseolus vulgaris L. landrace: implications for on-farm conservation. Mol. Ecol. 16: 4942–4955. Tsumura, Y., Kado, T., Takahashi, T., Tani, N., Ujino-Ihara, T. & Iwata, H. 2007. Genome scan to detect genetic structure and adaptive genes of natural populations of Cryptomeria japonica. Genetics 176: 2393–2403. Vasema¨gi, A., Nilsson, J. & Primmer, C.R. 2005. Expressed sequence tag-linked microsatellites as a source of geneassociated polymorphisms for detecting signatures of divergent selection in Atlantic salmon (Salmo salar L.). Mol. Biol. Evol. 22: 1067–1076. Vitalis, R., Dawson, K. & Boursot, P. 2001. Interpretation of variation across marker loci as evidence of selection. Genetics 158: 1811–1823. Vitalis, R., Dawson, K., Boursot, P. & Belkhir, K. 2003. DetSel 1.0: a computer program to detect markers responding to selection. J. Hered. 94: 429–431. Volis, S. 2008. Detection of signatures of positive selection in naturally occurring genetic variation. In: Population Genetics
Research Progress (V.T. Koven, ed), pp. 279–310. Nova Science Publishers, Hauppauge, NY, USA. Weir, B.S. & Cockerham, C.C. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38: 1358– 1370. Wilding, C.S., Butlin, R.K. & Grahame, J. 2001. Differential gene exchange between parapatric morphs of Littorina saxatilis detected using AFLP markers. J. Evol. Biol. 14: 611–619. Zhivotovsky, L.A. 1999. Estimating population structure in diploids with multilocus dominant DNA markers. Mol. Ecol. 8: 907–913.
Supporting information Additional Supporting Information may be found in the online version of this article: Appendix S1 Simulated allele frequencies for neutral and selective loci. Table S1 Nuisance parameters used for DETSELD. As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer-reviewed and may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors. Received 16 June 2010; revised 23 July 2010; accepted 29 July 2010
ª 2010 THE AUTHORS. J. EVOL. BIOL. doi:10.1111/j.1420-9101.2010.02093.x JOURNAL COMPILATION ª 2010 EUROPEAN SOCIETY FOR EVOLUTIONARY BIOLOGY