Tandem duplications and the limits of natural selection in ... - CiteSeerX

0 downloads 0 Views 3MB Size Report
May 2, 2014 - Tandem duplications and the limits of natural selection in Drosophila .... Convergent evolution is regarded as the ultimate signal of natural selection: if the ..... predator-prey coevolution, and sexual conflict [48, 49, 50, 51].
arXiv:1405.0518v1 [q-bio.PE] 2 May 2014

Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans Rebekah L. Rogers1 , Julie M. Cridland1,2 , Ling Shao1 , Tina T. Hu3 , Peter Andolfatto3 , and Kevin R. Thornton1

Research Article Biological Sciences, Poulation Biology

1) Ecology and Evolutionary Biology, University of California, Irvine 2) Ecology and Evolutionary Biology, University of California, Davis 3) Ecology and Evolutionary Biology and the Lewis Sigler Institute for Integrative Genomics, Princeton University Running head: Tandem duplications in non-model Drosophila Key words: Gene duplications, Drosophila yakuba, Drosophila simulans, evolutionary novelty, population genomics, parallel adaptation, convergent evolution, mutation limited evolution, rapid evolution, Red Queen dynamics Corresponding author: Rebekah L. Rogers, Dept. of Ecology and Evolutionary Biology, 5323 McGaugh Hall, University of California, Irvine, CA 92697 Phone: 949-824-0614 Fax: 949-824-2181 Email: [email protected]

Significance Tandem duplications are an essential source of genetic novelty that is useful in adaptation and the development of novel traits, and their prevalence in populations will influence the arc of evolutionary trajectories. We survey standing variation for tandem duplications in D. yakuba and D. simulans, and find that they are associated with rapidly evolving phenotypes and Red Queen dynamics with parallel selection in both species. However, we show that low mutation rates and detrimental impacts leads to only a limited span of standing variation in populations, leading to low levels of convergence at the gene level. Thus evolutionary trajectories dependent on duplications are unlikely to be reproducible across taxa even in the face of strong selective pressures.

2

Abstract Tandem duplications are an essential source of genetic novelty, and their prevalence in natural populations is expected to influence the trajectory of adaptive walks.

Here,

we describe evolutionary impacts of recently-derived, segregating tandem duplications in Drosophila yakuba and Drosophila simulans. We observe an excess of duplicated genes involved in defense against pathogens, insecticide resistance, chorion development, cuticular peptides, and lipases or endopeptidases associate d with the accessory glands, suggesting that duplications function in Red Queen dynamics and rapid evolution. We observe evidence of widespread selection on the D. simulans X, suggesting adaptation through duplication is common on the X. Though we find many high frequency variants, duplicates display an excess of low frequency variants consistent with largely detrimental impacts, limiting the variation that can effectively facilitate adaptation. Although we observe hundreds of gene duplications, we show that segregating variation is insufficient to provide duplicate copies of the entire genome, and the number of duplications in the population spans 13.4% of major chromosome arms in D. yakuba and 9.7% in D. simulans. Whole gene duplication rates are low at 1.1 × 10−9 in D. yakuba and 6.1 × 10−9 in D. simulans, suggesting long wait times for new mutations. Hence, if adaptive processes are dependent on individual duplications, evolution will be severely limited by mutation. Hence, parallel recruitment of the same duplicated gene in different species will be rare and standing variation will define evolutionary outcomes, in spite of convergence across rapidly evolving phenotypes.

3

4

Introduction Tandem duplications are an essential source of genetic novelty that is useful for the development of novel traits [1, 2] and their prevalence in populations is therefore expected to influence the arc of evolutionary trajectories. The observed landscape of tandem duplications in Drosophila spans only a few percent of the genome [3, 4, 5, 6], and it is unclear to what extent duplications, whether from newly arising or from standing variation, can provide a sufficient source of adaptive genetic variation. Tandem duplications produce a variety of novel gene structures including chimeric genes, recruited non-coding sequence, dual promoter genes, and whole gene duplications [3, 7, 8]. Whole gene duplications are often cited as an essential source of evolutionary innovation [2], and chimeric genes are more likely still to be involved in selective sweeps, produce regulatory changes, and effect changes in cellular targeting [9]. However, whole gene duplications form at low rates with even lower chances of formation for alternative variants [10, 3, 7]. They are therefore even more likely to be limited by mutation. If population-level mutation rates are sufficiently large, new mutations will accumulate quickly and adaptation is expected to proceed rapidly [11]. However, if population-level 5

mutation rates are low, then there will be long wait times until the next new mutation and evolutionary trajectories are likely to stall at suboptimal solutions during the mutational lag [11]. Drosophila have large population sizes in comparison to other multicellular eukaryotes with Ne ≈ 106 [12, 13, 14] and absolute numbers of individuals large enough to provide large numbers of SNPs at many sites every generation [15]. However, the prevalence of other types of mutations beyond SNPs has not been systematically surveyed. If the supply of tandem duplications is limited by mutation, we expect to see suboptimal outcomes in adaptive walks, limited ability to adapt to changing environments, and low rates of evolution through parallel genetic mechanisms in different species. The Drosophila offer an excellent model system for population genomics, allowing for a whole genome survey of the genetic landscape of standing variation across species in natural populations and determination of genetic convergence across taxa. There are multiple sequenced reference genomes for Drosophila, and genomes are small and compact, allowing for whole genome population surveys using next generation sequencing. Here, we focus on D. yakuba and D. simulans, which are separated by 12 MY of divergence [16], allowing for surveys of distantly related groups which are not expected to share polymorphic variation due to ancestry. Thus, we can measure the limits of standing variation and the incidence of parallel duplication across species, which should be broadly applicable to multicellular eukaryotic evolution.

Convergent evolution is regarded as the ultimate signal of natural selection: if the same solution is favored for a given environment then selection should result in the similar phenotypes [17]. There are many known cases of convergent phenotypic evolution, but 6

the understanding of convergence at the genetic level is limited to a small number of case studies across diverse clades [18]. These case studies have revealed convergent evolution through different genetic solutions in vertebrates [19, 20, 21], and arthropods [22, 23, 24, 25]. Parallel evolution through similar genetic solutions, however, appears to be more common at mutational hotspots where high mutation rates at targeted sites produce mutations at a steady rate [26, 27, 28]. Beyond these results from natural populations, convergence has often been observed in experimental evolution and is considered a signal of selection favoring alleles [26, 28, 29, 30, 31]. However, most studies of laboratory evolution take advantage of microbes or viruses with large population sizes roughly 109 −1010 such that every mutation is likely to be sampled every generation [26, 28] or from small populations that share a common pool of standing variation [29, 32] and may therefore be qualitatively different outcomes in comparison to natural evolution in multicellular eukaryotes. Indeed, known examples of genetic convergence in natural populations often occur through a common ancestral genetic pool [33] or through introgression [34].

The instance of convergent evolution across unrelated taxa that do not share ancestry is essential to understanding the ways mutation limits evolution, the role of standing variation in evolutionary trajectories, and the genetic architecture of adaptation. Here we offer a survey of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans and the role that this standing variation plays in adaptive evolution in natural populations. We identify an overrepresentation of tandem duplications involved in rapidly evolving phenotypes, and an overabundance of high frequency variants on the D. simulans 7

X chromosome, pointing to a role for adaptation through gene duplication. We further show that the span of tandem duplications in populations is limited to a small fraction of the genome. This implies that evolution by duplication will be limited by mutation and that parallel recruitment of gene duplicates across species is likely to be exceedingly rare even for rapidly evolving phenotypes with similar selective pressures across species.

Results We previously identified hundreds to thousands of segregating duplications in natural populations of D. yakuba and D. simulans, including large numbers of gene duplications (Table S1) [3].

We assess the numbers and types of gene duplications, differences in

duplication rates across species and explore the limits of the number of duplications present in each species to determine the extent to which these variants can serve as a source of genetic novelty. Recently derived, segregating tandem duplications were defined using paired end reads and coverage changes in D. yakuba and D. simulans using samples of 20 isofemale lines derived from natural populations of each species [3]. Using divergently oriented paired-end reads, we have identified 1415 segregating tandem duplications in D. yakuba, in comparison to 975 segregating tandem duplications in D. simulans. The tandem duplications identified across these sample strains cover 2.574% of assayable the genome of the X and 4 major autosomes in D. yakuba and 1.837% of the assayable genome of the X and 4 major autosomes in D. simulans. 8

Widespread selection on the D. simulans X chromosome

If tandem duplications are common targets for adaptation and selective sweeps, we should observe a shift in the site frequency spectrum (SFS) toward high frequency variants relative to neutral markers [35]. We compare the SFS for duplications with the SFS for SNPs from 8-30 bp of short first introns used as a putatively neutral proxy to determine whether duplicates are subject to selection (Figure S1). The SFS for duplications is significantly different from that of intronic SNPs on the D. simulans autosomes (W = 268, P = 2.981 × 10−6 ) and D. yakuba autosomes (W = 212, P = 3.507 × 10−6 ). In D. yakuba the SFS for duplicates on the X is significantly different from that of SNPs (W = 211, P = 4.781 × 10−4 ) using a Wilcoxon sign rank test. Duplicates show an excess of singleton variants on the autosomes in both species (Figure S1), suggesting deleterious impacts on average. We find a significant difference between the SFS of duplicates on the X chromosome and the autosomes in D. yakuba (W = 172, P = 0.0128) but not in D. simulans(W = 183.5, P = 0.1848) (Figure 1, Table S2-S3). The SFS for duplicates on the D. simulans X chromosome, however, shows a stark contrast, with fewer singletons in comparison to neutral SNPs and a slight peak at high frequency duplicates (Figure 1). We observe an excess of duplications at a sample frequency of 20 out of 20 strains on the X chromosome of D. simulans in comparison to neutral SNPs (P < 10−6 ), but observe no duplicates at a sample frequency of 20 out of 20 in D. yakuba on the X or autosomes. Furthermore comparisons of the SFS a show an excess of highest frequency variants ≥16 out of 17 on the X (χ2 = 21.8334, df = 1, P = 2.974 × 10−6 ). The excess of high frequency duplicates on the D. simulans X chromosome is indicative of 9

selection favoring large numbers of tandem duplicates. These results imply that adaptation through duplication is common on the D. simulans X. While demography and neutral evolutionary forces can result in shifts of site frequency spectra, these forces should affect sequences across individual chromosome arms uniformly with similar effects on SNPs, and are therefore unlikely to explain the observed differences between duplicates and synonymous SNPs. Therefore, it seems likely that the overabundance of high frequency variants on the D. simulans X is driven by natural selection. Thus, we would expect many of these high frequency variants to be strong candidates for ongoing selective sweeps. We furthermore observe large numbers of singleton variants among tandem duplicates in comparison with intronic SNPs in D. yakuba and D. simulans autosomes. Many copy number variants are subject to purifying selection in D. melanogaster [4, 36], and we observe large numbers of singleton variants in excess of neutral expectations, indicating negative selection preventing many variants from rising to higher frequency. Hence, while some variants are likely to offer a means of adaptive changes, many are likely to ultimately be lost from the pool of standing variation. Thus, we suggest that tandem duplications are likely to be non-neutral and represent mutations of large effect in comparison with intronic SNPs.

Rapid Evolution Biases in the rates at which duplications form in different genomic regions or a greater propensity for selection to favor duplications in specific functional classes can result in a bias 10

in gene ontology categories among duplicated genes. We use DAVID gene ontology analysis software to identify overrepresented functions among duplicate genes in D. yakuba and D. simulans [3]. While some categories are specific to each species, immune response and toxin metabolism are overrepresented in both species (Figure 2a,Table S4), functions that have also been identified as being overrepresented in a parallel study of D. melanogaster [37]. We also observe an overrepresentation of chitin cuticle formation and chemosensation in both D. yakuba and D. simulans (Table S4). Furthermore, lipases and endopeptidases commonly found in the accessory glands are present among high frequency variants in both species and there are multiple independent duplications in chorion and egg development genes (Table S4), suggesting a role for duplications in sexual conflict. The overabundance of toxin metabolism genes and immune response peptides in both species as well as the overrepresentation of chemoreceptors, chitin cuticle genes, endopeptidases, and oogenesis factors suggests that duplications may be key players in rapidly evolving systems. Moreover, the strong agreement in overrepresented functional categories points to strong selective pressures acting in parallel in these independently evolving species.

Limits of standing variation in natural populations We observe hundreds of segregating tandem duplicates in D. yakuba and D. simulans, spanning 2.574% of assayable the genome of the X and 4 major autosomes in D. yakuba and 1.837% of the assayable genome of the X and 4 major autosomes in D. simulans. If evolutionary trajectories depend on duplications to effect beneficial phenotypic changes, 11

then the number of segregating variants available in the population may not contain desired variants as standing variation. We estimate the number of variants present in the entire population based on the observed sample variation in order to determine the extent to which selection will be limited by mutation. We estimate that the population contains at most 7,230 segregating tandem duplications in D. yakuba and 4,720 in D. simulans (Table S5), corresponding to 13.4% of major chromosome arms in D. yakuba and 9.7% of major chromosome arms in D. simulans. Thus, the standing variation for tandem duplications will be insufficient to offer tandem duplications for every potential gene across the entire genome. Thus, for the majority of the genome, if a tandem duplication is required for adaptation, evolutionary trajectories will be more likely to rely on new mutations.

We calculate the per generation mutation rate µ per gene for whole gene duplications, considering duplicates that capture 90% or more of gene sequences, in agreement with previous methods [10]. We estimate a whole gene duplication rate of 1.1 × 10−9 per gene per generation for D. yakuba and 6.1 × 10−9 per gene per generation for D. simulans (Figure 3, Table S6), slightly higher than estimates derived from surveys of duplicates in the D. melanogaster reference genome of 3.68 × 10−10 per gene per generation [10, 7]. The rate of recruited non-coding sequence is 6.3 × 10−10 in D. yakuba and 6.3 × 10−10 in D. simulans and the rate of chimeric gene formation is lower still with 3.5 × 10−10 in D. yakuba and 2.5 × 10−10 in D. simulans (Figure 3, Table S6). These estimates of mutation rates for whole gene duplications and complex gene structures point to long wait times for new mutations.

Moreover, a Bayesian binomial estimate of the 95% lower CI suggests that for all 12

mutations not observed across the 20 sample strains, their frequency in the population is ≤ 0.1329, with a 50% lower CI of ≤ 0.0325. Very rare mutations may have difficulty escaping the forces of drift in the population, especially if recessive [38], and therefore ultimately the number of duplications that are at frequencies high enough to establish deterministic selective sweeps will be considerably fewer that than the number that exist in the population. The number of tandem duplications that have the potential to sweep to fixation may be substantially less than indicated by the number of segregating sites. Thus, the pool of standing variation in tandem duplications will provide only a limited substrate of novel genetic sequences and evolution will be limited by mutation.

Some 56 genes are partially or wholly duplicated both in D. yakuba and in D. simulans, suggesting that there is little concurrence in the standing variation of the two species but that there are more genes shared across the two species than expected based on uniform chance (P = 2.812 × 10−8 , binomial test) pointing to mutational or selective pressures on similar genes (SI Text). Furthermore, a comparison to duplicate genes in D. melanogaster [37] shows only 5 genes that exist among the segregating variation of tandem duplications in all three species. We find that 13.4% of the genome is present but unsampled in D. yakuba and 9.7% in D. simulans, indicating that the likelihood of shared unsampled variation is low. Such unsampled alleles will be at low frequency and are unlikely to be able to establish selective sweeps. Hence, the portion of variation available for selective sweeps that is shared across species will be low, resulting in a rarity of evolution through parallel recruitment of tandem duplicates. 13

Discussion We have described the prevalence of tandem duplications in natural populations of D. yakuba and D. simulans, their frequencies in the population, and the genes that they affect. We find that duplications show a bias towards genes associated with rapid evolutionary processes and that they commonly affect the X chromosome in D. simulans in comparison to the autosomes. In spite of their strong role in adaptation, we find low rates of parallel recruitment of tandem duplications across species due to low formation rates and mutation limited evolution.

Widespread positive selection on the X chromosome in D. simulans We observe an excess of high frequency duplications on the D. simulans X chromosomes in comparison to neutral intronic SNPs. Background selection [39] and hitchhiking [40] are not expected to act differently on duplications in comparison to SNPs. Yet, we observe significant differences between the SFS of duplicates and putatively neutral SNPs, pointing to a role for adaptation through tandem duplication. Hence, the overabundance of high-frequency duplications on the X is likely to be driven by selection and these represent strong candidate loci for ongoing selective sweeps. Additionally, we observe fewer singleton alleles on the X in comparison to the autosomes in both D. yakuba and D. simulans, consistent with greater efficiency of selection on the X to purge detrimental alleles from the population [41]. Based on the newly assembled D. simulans reference, X vs. autosome divergence indicates faster evolution on the X chromosome at non-synonymous sites, long introns, and UTRs [42]. This pattern is distinct from observations at synonymous sites as well as general patterns 14

of differential evolution on the autosomes [42], further evidence of more frequent selective sweeps on the X chromosome.. The X chromosome is thought to evolve rapidly due to sexual conflict, intragenomic conflict, and sexual selection [43] and thus multiple selective forces may facilitate the spread of duplicates on the X. The X chromosome in D. simulans houses an excess of duplicates in comparison to all autosomes, as well as a strong association with repetitive sequence and tandem duplications on the X [3]. Therefore, the X chromosome appears to be subject to particularly rapid evolution in duplicate content in D. simulans. We do not observe similar patterns in D. yakuba, suggesting that the X may be evolving under different selective pressures in the two species.

Mutation limited evolution While both D. simulans and D. yakuba house a rich diversity of duplicated sequences, only a few percent of the genome will be covered by tandem duplications. With lower mutation rates for duplications [44, 10, 7], there may be long wait times to achieve any single new mutation, and standing variation will likely define evolutionary outcomes. As such, any evolutionary path that is dependent upon duplications of any specific genomic sequence will be severely limited by the small likelihood that the necessary mutation is among the standing variation in copy number. The Drosophila represent an organism with large effective population sizes [13, 45] and hence are expected to host large numbers of duplications as standing variation in comparison to other multicellular eukaryotes. However, we have shown that the number of tandem duplications segregating in the population is substantially smaller than the number 15

of mutations needed to guarantee a duplicate of any desired genomic region. Organisms with large population sizes are expected to offer such great diversity of SNPs that many amino acid changes should already be present, thus offering selection a full palate of genetic diversity upon which it can act [15]. However, when population level mutation rates are small, standing variation is unlikely to offer a sufficient substrate for selective sweeps and systems will be stuck waiting for new mutations [11]. We observe population level mutation rates θ per gene on the order of 0.0045 in D. yakuba and 0.0083 in D. simulans (Figure 3) resulting in low probabilities that standing variation offers the major source of adaptation [11]. Thus, we conclude that evolution through duplication is mutation limited even in Drosophila which have large Ne , and that these limits are expected to be even more severe for many other multicellular eukaryotes, especially vertebrates.

The majority of tandem duplications identified in D. yakuba and D. simulans appear to be at extremely low frequency, with an excess of singleton variants in comparison to neutral intronic SNPs, suggesting that large numbers of duplications are detrimental, consistent with previous work in other species [4]. It has previously been argued that the accumulation of duplications is the product of small Ne and inability of selection to purge nearly neutral alleles from the population [46, 47]. However, we show that duplicates are less likely to be neutral in comparison to putatively neutral first intronic SNPs, suggesting that they are often mutations of large effect. We have shown that both positive and negative selection will play a strong role in the fixation or loss of duplications and that simplified nearly neutral theories are unlikely to explain the patterns observed across species, but rather selection is 16

expected to play an appreciable role in the evolution of genome content.

Convergence Convergent evolution is often interpreted of a signal of adaptation in experimental evolution and in natural populations [17]. Here, we show that for tandem duplications, parallel recruitment of genes for duplication and diversification independent from shared ancestry will be very rare in spite of convergence in functional categories represented. Thus, the reliance on genetic convergence to establish natural selection in natural populations will underreport selected alleles and result in significant underestimation of the number and types of alleles that are selected. Though convergence is common in experimental evolution of both prokaryotic systems and multicellular eukaryotes with shared ancestry, these results suggest that such convergence is unlikely to reflect the frequency of convergent evolution in natural populations of independently evolving species of multicellular eukaryotes that have little shared standing variation. We observe an excess of variants that are involved similar in rapid evolutionary processes both in D. yakuba and in D. simulans (Figure 2a). However, few genes appear to be duplicated in both species and only a handful have been identified in D. simulans, D. yakuba, and D. melanogaster (Figure 2b). Moreover, none of the high frequency variants in the in D. yakuba and D. simulans capture orthologous sequences. Hence, in spite of parallel selective pressures on rapidly evolving phenotypes, there is little convergence at the genetic level with respect to duplication. Given the limited genomic span of standing variation in the population (Table S7), and low rates of new mutation (Figure 3, Table S6), 17

as well as the low frequency of a large fraction of variants, parallel fixation will be extremely rare even among genera with large effective population sizes facing similar selective pressures. Thus, even when a given duplication is needed for adaptation, we expect that the limits of mutation will lead to low levels of convergence and scarcity of shared genetic solutions.

Duplicate genes and rapidly evolving phenotypes Both D. simulans and D. yakuba have an overabundance of genes involved in immune function, chemosensory processing or response, and drug and toxin metabolism (Table S4). Furthermore the instance of independent duplications confirm a bias toward chemosensory receptors, chorion development and oogenesis, as well as immune response [3].

These

phenotypes are strongly associated with rapid evolution due to host-parasite interactions, predator-prey coevolution, and sexual conflict [48, 49, 50, 51]. Previous work has observed similar bias toward rapid amino acid substitutions in olfactory genes, and chitin cuticle in D. melanogaster and D. simulans [52], and selection for toxin resistance is common in D. melanogaster [53, 50] suggesting that associated phenotypes may be under widespread selection in multiple species. Host pathogen systems as well as arms races in pesticide and toxin resistance, operate under Red Queen dynamics in which conflicts between organisms result in repeated selective sweeps [54]. Organisms that lack the genetic means to adapt to rapidly changing systems will be at a distinct disadvantage in the face of selective events. In cases where single nucleotide polymorphisms offer a means to overcome selective challenges, there is likely to 18

be sufficient variation [15]. However, if rapidly evolving systems rely heavily on complex mutations, profiles of standing variation will place strong limits on outcomes in response to selection. Additionally, the overrepresentation of duplications in cytochromes and drug or toxin metabolism genes confirms rapid evolution in copy number seen in comparison of reference genomes [55] as well as recent studies of insecticide resistance and viral resistance in natural populations [53, 56, 57]. Large amounts of divergence driven by selection among non-synonymous sites and UTRs in D. simulans [58] and high rates of adaptive substitutions [14, 52] point to widespread selective pressures acting in D. simulans, and it is likely that these same pressures influence the current diversity and frequency of copy number variants.

Shifting selective pressures such as those found in rapidly evolving systems or gross ecological change require a pool of genetic variation to facilitate adaptation. We observe standing variation and mutational profiles that will limit evolutionary trajectories and would expect these limits to be even more severe for rapidly evolving phenotypes. Repeated sweeps are expected to purge genetic and phenotypic diversity, and recovering such diversity after sweeps can take thousands of generations [59]. Thus, during rapid evolution selection will potentially purge diversity that is needed for subsequent steps in the adaptive walk. Hence, although duplications are key players in rapid evolution, their limited rates of formation combined with low frequencies due to commonly detrimental impacts will hinder evolutionary outcomes precisely when they are urgently needed. Moreover, large numbers of duplicates are low-frequency, suggesting that detrimental impacts further limit standing variation. Thus, we conclude that the available substrate of tandem duplications and profiles of standing 19

variation will define evolutionary outcomes in Drosophila and other multicellular eukaryotes.

Materials and Methods Tandem duplications Tandem duplications were identified using paired-end Illumina sequencing of genomic DNA for 20 strains of D. yakuba and 20 strains of D. simulans as well as the reference genome of each species as described in Rogers et al. 2014. The dataset describes derived, segregating tandem duplications that span 25 kb or less. These sequences exclude ancestral duplications as well as putative duplications in the resequenced reference genomes which are identified in the reference genomes. The resulting list of variants describes segregating variation for newly formed tandem duplicates across the full genome in these two species of non-model Drosophila.

Additional methods Further description of methods including description of intronic SNPs, analysis of population structure, residual heterozygosity, and gene ontology analysis, and correction for ascertainment bias are available in SI Text.

Acknowledgements The authors would like to thank Nigel F. Delaney, Elizabeth G. King, Anthony D. Long, and Alexis S. Harrison, and Trevor Bedford for helpful discussions.RLR is supported by NIH 20

Ruth Kirschtein National Research Service Award F32-GM099377. Research funds were provided by NIH grant R01-GM085183 to KRT and R01-GM083228 to PA. All sequencing was performed at the UC Irvine High Throughput Genomics facility, which is supported b -y the National Cancer Institute of the National Institutes of Health under Award Number P30CA062203. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. RLR, JMC, KRT, and PA performed analyses. LS generated Illumina sequencing libraries. TTH provided gene annotations for D. simulans. RLR, JMC, PA and KRT designed experiments and analyses.

21

22

Bibliography [1] Conant, G. C & Wolfe, K. H. (2008) Nat. Rev. Genet. 9, 938–950. [2] Ohno, S et al. (1970) Evolution by gene duplication. (London: George Alien & Unwin Ltd. Berlin, Heidelberg and New York: Springer-Verlag.). [3] Rogers, R. L, Cridland, J. M, Shao, L, Hu, T. T, Andolfatto, P, & Thornton, K. R. (2014) arXiv preprint arXiv:1401.7371. [4] Emerson, J. J, Cardoso-Moreira, M, Borevitz, J. O, & Long, M. (2008) Science 320, 1629–1631. [5] Cardoso-Moreira, M, Emerson, J. J, Clark, A. G, & Long, M. (2011) PLoS Genet. 7, e1002340. [6] Dopman, E. B & Hartl, D. L. (2007) Proc. Natl. Acad. Sci. U.S.A. 104, 19920–19925. [7] Zhou, Q, Zhang, G, Zhang, Y, Xu, S, Zhao, R, Zhan, Z, Li, X, Ding, Y, Yang, S, & Wang, W. (2008) Genome Research 18, 1446–1455. [8] Katju, V & Lynch, M. (2006) Molecular Biology and Evolution 23, 1056–1067. [9] Rogers, R. L & Hartl, D. L. (2012) Mol. Biol. Evol. 29, 517–529. [10] Rogers, R. L, Bedford, T, & Hartl, D. L. (2009) Genetics 181, 313–322. [11] Hermisson, J & Pennings, P. S. (2005) Genetics 169, 2335–2352. [12] Kreitman, M. (1983) Nature 304, 412–417. [13] Bachtrog, D, Thornton, K, Clark, A, & Andolfatto, P. (2006) Evolution 60, 292–302. [14] Andolfatto, P, Wong, K. M, & Bachtrog, D. (2011) Genome Biol Evol 3, 114–128. [15] Karasov, T, Messer, P. W, & Petrov, D. A. (2010) PLoS Genet. 6, e1000924. [16] Tamura, K, Subramanian, S, & Kumar, S. (2004) Mol. Biol. Evol. 21, 36–44. [17] Gould, S. J & Lewontin, R. C. (1979) Proc. R. Soc. Lond., B, Biol. Sci. 205, 581–598. 23

[18] Stern, D. L. (2013) Nat. Rev. Genet. 14, 751–764. [19] Chen, L, DeVries, A. L, & Cheng, C. H. (1997) Proc. Natl. Acad. Sci. U.S.A. 94, 3817–3822. [20] Shapiro, M. D, Summers, B. R, Balabhadra, S, Aldenhoven, J. T, Miller, A. L, Cunningham, C. B, Bell, M. A, & Kingsley, D. M. (2009) Curr. Biol. 19, 1140–1145. [21] Brodie, E. D. (2010) Curr. Biol. 20, R152–154. [22] Khadjeh, S, Turetzek, N, Pechmann, M, Schwager, E. E, Wimmer, E. A, Damen, W. G, & Prpic, N. M. (2012) Proc. Natl. Acad. Sci. U.S.A. 109, 4921–4926. [23] Wittkopp, P. J, Williams, B. L, Selegue, J. E, & Carroll, S. B. (2003) Proc. Natl. Acad. Sci. U.S.A. 100, 1808–1813. [24] Tanaka, K, Barmina, O, & Kopp, A. 4764–4769.

(2009) Proc. Natl. Acad. Sci. U.S.A. 106,

[25] Zhen, Y, Aardema, M. L, Medina, E. M, Schumer, M, & Andolfatto, P. (2012) Science 337, 1634–1637. [26] Riehle, M. M, Bennett, A. F, & Long, A. D. (2001) Proc. Natl. Acad. Sci. U.S.A. 98, 525–530. [27] Chan, Y. F, Marks, M. E, Jones, F. C, Villarreal, G, Shapiro, M. D, Brady, S. D, Southwick, A. M, Absher, D. M, Grimwood, J, Schmutz, J, Myers, R. M, Petrov, D, Jonsson, B, Schluter, D, Bell, M. A, & Kingsley, D. M. (2010) Science 327, 302–305. [28] Moxon, E, Lenski, R, & Rainey, P. (1998) Perspectives in Biology and Medicine 42, 154–155. [29] Burke, M. K, Dunham, J. P, Shahrestani, P, Thornton, K. R, Rose, M. R, & Long, A. D. (2010) Nature 467, 587–590. [30] Woods, R, Schneider, D, Winkworth, C. L, Riley, M. A, & Lenski, R. E. (2006) Proceedings of the National Academy of Sciences 103, 9107–9112. [31] Tenaillon, O, Rodriguez-Verdugo, A, Gaut, R. L, McDonald, P, Bennett, A. F, Long, A. D, & Gaut, B. S. (2012) Science 335, 457–461. [32] Orozco-terWengel, P, Kapun, M, Nolte, V, Kofler, R, Flatt, T, & Schloetterer, C. (2012) Molecular ecology 21, 4931–4941. [33] Colosimo, P. F, Hosemann, K. E, Balabhadra, S, Villarreal, G, Dickson, M, Grimwood, J, Schmutz, J, Myers, R. M, Schluter, D, & Kingsley, D. M. (2005) Science 307, 1928–1933. 24

[34] Martin, A, Papa, R, Nadeau, N. J, Hill, R. I, Counterman, B. A, Halder, G, Jiggins, C. D, Kronforst, M. R, Long, A. D, McMillan, W. O, & Reed, R. D. (2012) Proc. Natl. Acad. Sci. U.S.A. 109, 12632–12637. [35] Hartl, D. L & Clark, A. G. (2007) Principles of population genetics. (Sinauer Associates, Sunderland, Mass.), p. 652. [36] Cridland, J. M & Thornton, K. R. (2010) Genome Biol Evol 2, 83–101. [37] Zichner, T, Garfield, D. A, Rausch, T, Stutz, A. M, Cannavo, E, Braun, M, Furlong, E. E, & Korbel, J. O. (2013) Genome Res. 23, 568–579. [38] Haldane, J. B. S. (1927) Proceedings of the Cambridge Philosophical Society 23, 838–844. [39] Charlesworth, B, Morgan, M. T, & Charlesworth, D. (1993) Genetics 134, 1289–1303. [40] Smith, J. M & Haigh, J. (2007) Genet. Res. 89, 391–403. [41] Charlesworth, B, Coyne, J, & Barton, N. (1987) American Naturalist pp. 113–146. [42] Hu, T. T, Eisen, M. B, Thornton, K. R, & Andolfatto, P. (2012) Genome Res. [43] Presgraves, D. C. (2008) Trends Genet. 24, 336–343. [44] Lynch, M & Conery, J. S. (2003) Journal of Structural and Functional Genomics 3, 35–44. [45] Eyre-Walker, A, Keightley, P. D, Smith, N. G, & Gaffney, D. (2002) Mol. Biol. Evol. 19, 2142–2149. [46] Lynch, M & Conery, J. S. (2000) Science 290, 1151–1155. [47] Lynch, M. (2007) The Origins of Genome Architecture. Sunderland, Mass.), p. 494.

(Sinauer Associates,

[48] Lazarro, B & Clark, A. (2012) Rapidly Evolving Genes and Genetic Systems, eds. R.S. Singh, J. X & Kulathinal, R. (Oxford University Press, Oxford). [49] Beckerman, A. P, de Roij, J, Dennis, S. R, & Little, T. J. (2013) Ecol Evol 3, 5119–5126. [50] Ffrench-Constant, R. H, Daborn, P. J, & Le Goff, G. (2004) Trends Genet. 20, 163–170. [51] Panhuis, T. M, Clark, N. L, & Swanson, W. J. (2006) Philos. Trans. R. Soc. Lond., B, Biol. Sci. 361, 261–268. [52] Begun, D. J, Holloway, A. K, Stevens, K, Hillier, L. W, Poh, Y. P, Hahn, M. W, Nista, P. M, Jones, C. D, Kern, A. D, Dewey, C. N, Pachter, L, Myers, E, & Langley, C. H. (2007) PLoS Biol. 5, e310. 25

[53] Schmidt, J. M, Good, R. T, Appleton, B, Sherrard, J, Raymant, G. C, Bogwitz, M. R, Martin, J, Daborn, P. J, Goddard, M. E, Batterham, P, & Robin, C. (2010) PLoS Genet. 6, e1000998. [54] VAN, V et al. (1973) Evolutionary theory 1, 1–30. [55] Drosophila Twelve Genomes Consortium. (2007) Nature 450, 203–218. [56] Bass, C & Field, L. M. (2011) Pest Manag. Sci. 67, 886–890. [57] Magwire, M. M, Bayer, F, Webster, C. L, Cao, C, & Jiggins, F. M. (2011) PLoS Genet. 7, e1002337. [58] Haddrill, P. R, Bachtrog, D, & Andolfatto, P. (2008) Mol. Biol. Evol. 25, 1825–1834. [59] Kaplan, N. L, Hudson, R. R, & Langley, C. H. (1989) Genetics 123, 887–899.

26

Figure 1: SFS for tandem duplications in D. yakuba and D. simulans, corrected for ascertainment bias. A. Site frequency spectra on the autosomes (black) and on the X (grey) in D. yakuba. B. SFS on the autosomes (black) and on the X (grey) in D. simulans. C. SFS for X-linked intronic SNPs (black) and duplicates (grey). The excess of high frequency variants on the X in D. simulans suggests widespread selection for tandem duplicates on the D. simulans X.

27

(b) (a)

Figure 2: A) Gene ontology classes overrepresented by species for single genes or multiply duplicated genes. B) Number of genes duplicated by species. Most variants are species specific, with small numbers of parallel duplication of orthologs across species.

28

Figure 3: Genomewide population mutation rates for all duplications (θ), population sizes (Ne ), and per gene mutation rates (µ) for gene structures produced by duplication by species. Low mutation rates and mutation limited evolution leads to low levels of parallel recruitment of tandem duplications.

29

30

Supporting Information Identifying duplicated coding sequence Tandem duplications were previously identified using a combination of paired end read mapping and coverage changes in 20 isofemale lines of D. yakuba and 20 isofemale lines in D. simulans generated via 9-12 generations of sibling mating from wild-caught flies. We sequenced 10 isofemale lines of D. yakuba from Nairobi, Kenya, and 10 isofemale lines from Nguti, Cameroon as well as 10 isofemale lines of D. simulans from Nairobi, Kenya and 10 isofemale lines from Madagascar. Duplications were identified through divergently oriented reads and coverage changes in comparison to reference genomes. We identify 1415 tandem duplications in D. simulans and 975 tandem duplications segregating in D. yakuba that span 845 different gene sequences in D. yakuba and 478 different gene sequences in D. simulans [1]. Gene duplications were defined as any divergent read calls whose maximum span across all lines overlaps with the annotated CDS coordinates. D. yakuba CDS annotations were based on flybase release D. yakuba r.1.3. Gene annotations for the recent reassembly of the D. simulans reference were produced by aligning all D. melanogaster CDS sequences to the D. simulans reference in a tblastx. Percent coverage of the CDS was defined based on the portion of the corresponding genomic sequence from start to stop that was covered by the 1

maximum span of divergent read calls across all strains. Using the representation of gene sequences in D. yakuba of

845 16082

we use a binomial test to calculate the likelihood of 56 shared

variants among the 478 genes duplicated in D. simulans.

Intronic SNPs In order to produce a neutral proxy for sequence change in each species, we identified SNPs for short introns 100 bp or less, focusing on sites 8-30 which are generally subject to little constraints [2, 3, 4]. Reads containing indels were re-aligned using GATK [5]. SNPs were identified across strains using samtools v1.18 mpileup [6] disabling probabilistic realignment (-B) and outputting genotype likelihoods in BCF format (-g). The resulting BCF used to create a VCFusing bcftools, calling bases using Bayesian inference (-c) calling genotypes per sample (-g) with a scaled mutation rate of 1% (-t .01) under a haploid model (ploidy=1). SNPs were required to have minimum Illumina coverage depth of 20 reads, maximum coverage of 250 reads, MQ≥ 20, and GQ≥ 30 and invar GQ≥ 40. We excluded SNPs identified in the reference, which are indicative of either assembly errors or residual heterozygosity. We performed hierarchical cluster analysis in R using all SNPs by chromosome to evaluate population structure. The ancestral state for each SNP was established through comparison with the nearest sequenced reference genome as an outgroup, D. erecta for D. yakuba sequences and D. melanogaster for D. simulans sequences. Orthologs between each species and its outgroup were identified using reciprocal best hit criteria in a BLASTn at an E-value cutoff of 10−5 . 2

Full gene sequences for each ortholog were then aligned using clustalw, keeping only genes which aligned with 85% or greater nucleotide identity. Divergence between the two species, Divx,y , was defined based on alignments of intronic sites from bases 8-30 between each species and the outgroup reference genome, excluding gapped sequences, for aligned orthologs with 85% or nucleotide identity. The ancestral state was defined based on the corresponding sequence in the outgroup genome. We excluded sites where the outgroup reference was in disagreement with both the D. yakuba reference and D. yakuba SNPs, as well as triallelic SNPs, sites with reference sequence of ‘N’, or SNPs identified in the VCF for the reference, suggesting inaccuracies in reference assembly or residual heterozygosity in the reference. These resulted in a total of 7158 intronic SNPs in D. yakuba and 5504 intronic SNPs in D. simulans. The resulting unfolded SFS was then corrected for the probability of independent mutations in both reference genomes leading to incorrect inference of the ancestral state. Given net divergence Dnet = Divx,y − πx , the probability of identical independent mutations occurring in the outgroup reference genome is reflected by either the probability of an independent transition (ts) at the site of a transition mutation, or by 1/2 the probability of a transversion (tv) at the site of a transversion polymorphism. Thus,

k = [(

Empirically, in Drosophila κ =

κ 2 1 2 2 ) + ( ) ]Dnet 2+κ 2 2+κ

ts tv

(1)

= 2. Thus, k = 38 Dnet .

The unfolded SFS for intronic sites was corrected for the likelihood of independent mutations in the reference, k. The probability of independent mutations occurring in both 3

genomes is equal to the probability of either two independent transitions or two independent transversions occurring in both genomes. We calculated πx as the average heterozygosity per intronic site. Given a likelihood of independent identical mutations of k = 38 Dnet (See Text S1).

Si,obs = E[Si ] − E[Si ](k) + E[Sn−i ](k)

(2)

Sn−i,obs = E[Sn−i ] − E[Sn−i ](k) + E[S − i](k)

(3)

Substituting Equation 3 into Equation 2, we obtain

E[Si ] =

Si,obs (1 − k) − Sn−i,obs (k) 1 − 2k

(4)

Correcting Duplicates for Ascertainment Bias Tandem duplications, unlike SNPs, cannot be identified using paired-end reads in individual strains except through comparison to the reference genome. Moreover, variants that are segregating at high frequency in populations are substantially more likely to be present in the reference, and therefore are substantially less likely to be identified in sample strains [7]. We corrected site frequency spectra according to the model developed by Emerson et al. [7].

yi n xi = Pn−2n−i n i=1 yi n−i

(5)

Here, xi is the true proportion of alleles at frequency i in the population, and yi is 4

the observed proportion of alleles at frequency i in a sample of n strains (here 21). The correction for ascertainment bias lowers estimates of the proportion found a low frequencies and increases estimates of the proportion at high frequency. For estimates of population site frequency spectra, we removed all variants with divergently oriented reads in the reference strain, as these would not be identified in an accurately annotated reference.

Residual heterozygosity Some isofemale lines contained regions of residual heterozygosity in spite of over 10 generations of inbreeding in the lab. To detect regions of residual heterozygosity, we called SNPs as above under a diploid model. Segments with residual heterozygosity were detected using an HMM (HMM;http://cran.r-project.org/web/packages/HMM/). Prior probabilities on states were set as:   π = 0.5. 0.5 Transition probabilities were set to:   1 − 10−10 10−10 T = 10−10 1 − 10−10

and emission probabilities set to:   θ  E= 1−θ 1− Where  = 0.001 and θ = 0.01. The most likely path was calculated using the Viterbi algorithm, and heterozygous segments 10kb or larger were retained. Heterozygous blocks within 100kb of one another in a sample strain were clustered together as a single segment 5

to define the span of residual heterozygosity within inbred lines.

Differences in Site Frequency Spectra If different classes of duplications have different selective impacts, we should observe clear differences in site frequency spectra, with more positively selected duplications showing fewer singleton alleles and more high frequency variants. Site frequency spectra are not normally distributed, nor can they be normalized through standard transformations, and thus require non-parametric tests. We used a two-sided Wilcoxon rank sum test to determine whether site frequency spectra were significantly different. For each comparison, we excluded tandem duplications that are present in the reference genomes as well as putative ancestral duplications, as these are likely to display biases with respect to size, propensity to capture coding sequences, and association with repetitive content. We compared site frequency spectra of the following groups within each species: duplications on the X and on the autosomes and all pairwise combinations of SNPs and duplicates on the X and autosomes. We also performed Kolmogorov-Smirnov test for comparison. In D. simulans, we used a χ2 test to determine whether high frequency alleles are overrepresented among duplications on the X relative to intronic SNPs, comparing the proportion of variants as at a sample frequency of ≥

16 . 17

Tandem duplicates that lie in regions with residually heterozygous segments extending 1kb upstream or downstream were excluded from the SFS, resulting in unequal sample sizes for different variants. Samples with fewer than 15 strains remaining were excluded from the 6

SFS. The SFS for intronic SNPs and for duplicates was then scaled to a sample of size 17 in D. simulans and 15 in D. yakuba according to Nielsen et al. [8].

Segregating Inversions In order to check for population substructure, we aligned all SNPs in Intronic sequences from 8-30 bp which are supposed to be a neutral proxy [2, 3, 4] and performed hierarchical clustering in R using hclust. These SNPs were intended solely to differentiate strains and were not polarized with respect to the ancestral state or otherwise filtered. We observe little evidence for population structure in D. simulans (Figure S2). However, we identify structure on chromosome 2 in D. yakuba (Figure S3), consistent with known polymorphic inversions prohibiting recombination on chromosome 2 [9]. Strains do not strictly cluster with respect to geography but rather are reticulated amongst other groups. Moreover, among duplicates we do not observe an excess of moderate frequency alleles as one would expect under population substructure given our sampling scheme (Figure S1). Thus, these strains constitute a single admixed population. Some strains retained residual heterozygosity even after 9 generations of inbreeding, with greater residual heterozygosity in D. yakuba than in D. simulans, consistent with inversions segregating in D. yakuba. These regions of residual heterozygosity can result in incorrect estimates of SFS by artificially increasing chances of observing variation. Site frequency spectra were calculated across all strais by correcting sample frequencies for ascertainment bias, excluding regions of residual heterozygosity and then projecting frequencies onto a 7

sample size of 15 in D. yakuba and 17 in D. simulans according to [8]. As a neutral comparison we calculated SFS for intronic SNPs (as above) and projected the SFS down to a sample size of 15 in D. yakuba and 17 in D. simulans (Figure S1).

Frequency of unsampled alleles We also estimated the frequency distribution for alleles that are absent among our 20 strains, according to a Bayesian binomial model. Assuming that sampling follows a binomial model that is dependent upon the allele frequency p, the probability that a variant is present at frequency p given that it is not observed is as follows: P (p|absent) =

R1 0

p(absent|p)∗p(p) p(absent|p)∗p(p)dp.

Assuming a uniform distribution on p: P (p|absent) =

R1 0

P (p|absent) =

p(absent|p) p(absent|p)dp.

(1−p)20 20 0 (1−p) dp.

R1

P (p|absent) = 21(1 − p)20 95% lower CI defined as: Rx 0

21(1 − p)20 dp = 0.95

1 − (1 − x)21 = 0.95 And therefore the 95% one-sided lower CI is x ≤ 0.132946 whereas the 50% one-sided lower CI will be x ≤ 0.0325. Placing a uniform prior on p will bias estimates toward higher frequency variants, thereby placing a conservative upper bound on allele frequencies. 8

Likelihood of shared variation through ancestry The likelihood of shared variation through shared ancestry can be obtained through a coalescent approach. The probability that an allele does not coalesce in the time period from the present back to the speciation event that separated D. yakuba and D. simulans is (1 −

1 t ). 2Ne

−t

This can be approximated using e 2Ne . The X-linked π estimate for Dyak

is 1.27% [10] in D. yakuba and 2.19% in D. simulans [11]. Using the mutation rate of 5.8 × 10−9 [12], we find Ne = (1.27/100)/(3 ∗ 5.8e − 9) = 730, 000 in D. yakuba and Ne = (2.19/100)/(3 ∗ 5.8e − 9) = 1, 200, 000 in D. simulans. Previous estimates have indicated higher Ne [13, 14] and we therefore use 1.0 × 106 as an approximation of Ne that is conservative with respect to the analyses presented here. Using t=12MY [15] and 12 generations per year, and Ne = 1.0 × 106 , we obtain a probability of shared ancestry for an allele of 7.6 × 10−32 . We have polarized all mutations against the putative ancestral state using outgroup reference genomes, focusing solely on derived mutations [1]. Furthermore, the expectation of shared variation for any two alleles through shared ancestry for D. yakuba and D. simulans is expected to be low. Even large samples are not expected to harbor shared variation over such timescales [16]. Thus, we expect shared variants described here to result from independent mutations, not from long standing neutral polymorphism.

Estimated number of segregating tandem duplications We compared the estimated total number of duplications expected in a population to estimates of diversity based on our sample of 20 strains, correcting S for a 3.9% false positive 9

rate (Table S5). Under a standard coalescent model [17, 18, 19]: E[Spopulation ] =

Ssample asample

∗ apopulation

Where a in a sample of size n (in this case n=20): asample =

Pn−1

1 i=0 i

apopulation =

P2Ne

1 i=0 i

When 2Ne is large: P2Ne

1 i=0 i

≈ θ(ln(2Ne ) + 0.57722)

Hence: E[Spopulation ] =

Ssample a20

∗ (ln(2Ne ) + 0.57722)

We can use similar methods to estimate the variance in the number of segregating sites in the population. V ar[Spopulation ] = θ

P2Ne

1 i=0 i

+ θ2

P2Ne

1 i=0 i2

When 2Ne is large: P2Ne

1 i=0 i2



π2 . 6 2

V ar[Spopulation ] = θ(ln(2Ne ) + 0.57722) + θ2 π6

Gene Ontology Overrepresented functional categories were identified using DAVID gene ontology software with an EASE threshold of 1.0. as previously described [1]. We observe several functional categories indicative of rapid evolution that are shared between the two species (Table S4).

As a comparison, we selected a random subset of 845 10

genes for D. yakubaand

478 genes from D. simulans, and performed ontology analysis for a comparison. In D. yakuba, male courtship, GTPase enzymes, and alt splicing had significant group EASE thresholds whereas D. simulansshowed marginal values on esterases, tracheal development, neurodevelopment, lipid metabolism, hormone receptors, cell communication and growth and starvation. Nothing has an EASE of 1.5 in either species similar to the values observed in Table S4. There is no agreement in functional categories for the two species in the random subsets. Thus, we would suggest that the convergence across species and the associations with rapid evolution are not the product of sampling errors.

Proportion of the genome represented by segregating duplicates To determine the number of duplications necessary to span the full range of the genome, we simulated chromosomes with a length determined by the number of base pairs with non-zero coverage in our reference strain. We then simulated random draws from the distribution of duplication lengths for each chromosome, placing duplication start sites at random and recorded the number of duplications necessary to cover 10%, 25%, 50%, and 90% of sequence length for each chromosome in each trial. Simulations were repeated for 1000 trials for each chromosome. These simulations do not account for mutational biases that might result in clustering of duplications in particular regions while other regions remain static, nor do they require that new duplications reach an appreciable frequency so that they are immune to stochastic loss through genetic drift. They do not require that duplications capture sufficient sequence to 11

have functional impacts or require that breakpoints not disrupt known functional elements. Furthermore, simulating individual chromosomes separately decreases the likelihood of resampling particular sites thereby lowering the estimated number of duplications needed to cover the entire genome. Hence, these estimates put a highly conservative lower bound on the minimum number of mutations necessary to capture the full genomic sequence. To estimate the expected proportion of the genome spanned by all duplicates in the population, we resampled 6700 duplicates from the observed size distribution of D. yakuba with replacement and 4000 duplicates from the observed size distribution of D. simulans, placing duplications at random positions across the chromosome.

We performed 100

replicates of sampling and report the mean across all replicates for each species. In D. simulans we observe one case with 19 independent whole gene duplications of a single ORF [1], suggesting up to 1000-fold variation in mutation rates over the genome average. Estimates of population level variation and genome wide mutation rates ignore mutation rate variation where some regions may be highly prone to duplications whereas others remain static, which would reduce likelihood of unobserved tandem duplications outside of mutational hotspots. Hence, these estimates represent a lower bound on the number of duplications necessary to span the entire genome.

Mutation Rates We estimate µ and θ per gene for D. yakuba and D. simulans for gene duplications that capture at least 90% of gene sequence, criteria in agreement with previous estimates [20], 12

where θ = S/a20 , given 16082 gene sequences in D. yakuba and 10786 coding sequences in D. simulans. In D. yakuba Ne = 1.0 × 106 and in D. simulans Ne = 1.0 × 106 .

13

14

Bibliography [1] Rogers, R. L, Cridland, J. M, Shao, L, Hu, T. T, Andolfatto, P, & Thornton, K. R. (2014) arXiv preprint arXiv:1401.7371. [2] Halligan, D. L & Keightley, P. D. (2006) Genome Res. 16, 875–884. [3] Parsch, J, Novozhilov, S, Saminadin-Peter, S. S, Wong, K. M, & Andolfatto, P. (2010) Mol. Biol. Evol. 27, 1226–1234. [4] Clemente, F & Vogl, C. (2012) J. Evol. Biol. 25, 1975–1990. [5] McKenna, A, Hanna, M, Banks, E, Sivachenko, A, Cibulskis, K, Kernytsky, A, Garimella, K, Altshuler, D, Gabriel, S, Daly, M, & DePristo, M. A. (2010) Genome Res. 20, 1297–1303. [6] Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G, & Durbin, R. (2009) Bioinformatics 25, 2078–2079. [7] Emerson, J. J, Cardoso-Moreira, M, Borevitz, J. O, & Long, M. (2008) Science 320, 1629–1631. [8] Nielsen, R, Bustamante, C, Clark, A. G, Glanowski, S, Sackton, T. B, Hubisz, M. J, Fledel-Alon, A, Tanenbaum, D. M, Civello, D, White, T. J, J Sninsky, J, Adams, M. D, & Cargill, M. (2005) PLoS Biol. 3, e170. [9] Lemeunier, F & Ashburner, M. A. (1976) Proc. R. Soc. Lond., B, Biol. Sci. 193, 275–294. [10] Bachtrog, D, Thornton, K, Clark, A, & Andolfatto, P. (2006) Evolution 60, 292–302. [11] Andolfatto, P, Wong, K. M, & Bachtrog, D. (2011) Genome Biol Evol 3, 114–128. [12] Haag-Liautard, C, Dorris, M, Maside, X, Macaskill, S, Halligan, D. L, Houle, D, Charlesworth, B, & Keightley, P. D. (2007) Nature 445, 82–85. [13] Sawyer, S. A & Hartl, D. L. (1992) Genetics 132, 1161–1176. [14] Eyre-Walker, A, Keightley, P. D, Smith, N. G, & Gaffney, D. (2002) Mol. Biol. Evol. 19, 2142–2149. 15

[15] Tamura, K, Subramanian, S, & Kumar, S. (2004) Mol. Biol. Evol. 21, 36–44. [16] Rosenberg, N. A. (2003) Evolution 57, 1465–1477. [17] Wakeley, J. (2009) Coalescent Theory: An Introduction. Publishers), p. 97.

(Roberts & Company

[18] Ewens, W. J. (1974) Theor Popul Biol 6, 143–148. [19] Watterson, G. A. (1975) Theor Popul Biol 7, 256–276. [20] Rogers, R. L, Bedford, T, & Hartl, D. L. (2009) Genetics 181, 313–322.

16

Table S1: Number of duplicated regions detected in D. yakuba and D. simulans

Whole gene Partial gene Intergenic

D. yakuba D. simulans 248 296 745 462 745 577

17

Table S2: Wilcoxon Rank Sum Tests of Site Frequency Spectra Species D. yakuba

D. simulans

Type Autosomal SNPs X-linked SNPs Autosomal Duplicates Autosomal SNPs X-linked SNPs Autosomal Duplicates

Type Autosomal Duplicates X-linked Duplicates X-linked Duplicates Autosomal Duplicates X-linked Duplicates X-linked Duplicates

W 212 211 172 268 113 183.5

* P < 0.05, ** P < 0.01 SNPs are derived from 8-30 bp of short first introns ≤ 200bp.

18

Adjusted P -value 3.507 × 10−6 ** 4.781 × 10−4 ** 0.0128* 2.981 × 10−6 ** 0.2897 0.1848

Table S3: Kolmogorov-Smirnov Tests of Site Frequency Spectra Species D. yakuba

D. simulans

Type Autosomal SNPs X-linked SNPs Autosomal Duplicates Autosomal SNPs X-linked SNPs Autosomal Duplicates

Type Autosomal Duplicates X-linked Duplicates X-linked Duplicates Autosomal Duplicates X-linked Duplicates X-linked Duplicates

D 0.9333 0.800 0.5333 0.8824 0.2941 0.3529

* P < 0.05, ** P < 0.01 SNPs are derived from 8-30 bp of short first introns ≤ 200bp.

19

Adjusted P -value 3.868 × 10−7 ** 5.235 × 10−5 ** 0.02625* 4.808 × 10−7 ** 0.4654 0.2402

Table S4: Gene ontology categories overrepresented in both D. yakuba and D. simulans Functional Category Chitins or cuticle Immune response Drug and toxin metabolism Chemosensation Multiple Independent Duplications Chorion and oogenesis Sensory processing Immune response High Frequency Duplicates Endopeptidases

D. yakuba EASE D. simulans EASE 2.00 0.97 1.44 1.59 1.37 2.32 1.12 1.37 1.79 1.23 1.11

1.84 1.41 3.35

-

-

Table S5: Estimated Number of Segregating Duplications on X and major Autosomes Genome wide θ for tandem duplications E[S] σS E[S] + 2σS Genomic Coverage

20

D. yakuba D. simulans 383 264 5700 3800 497 344 6800 4500 13.4% 9.7%

Table S6: Mutation rates for gene duplications in D. yakuba and D. simulans. Ne θwholegene per gene θrecruit per gene θchim per gene µwholegene µrecruit µchim

D. yakuba D. simulans 1.0 × 106 1.0 × 106 0.0046 0.0083 0.0025 0.0025 0.0014 0.00099 1.1 × 10−9 6.1 × 10−9 6.3 × 10−10 6.3 × 10−10 3.5 × 10−10 2.5 × 10−10

21

Table S7: Number of duplications necessary to cover segments of the genome Percent Covered 5% 10% 25% 50% 90%

Lower Bound (95% CI) Upper Bound (95%CI) 2,358 2,660 4,912 5,360 13,668 14,410 33,191 34,427 119,799 113,767

22

Supplementary Figures A

B SFS for X−linked mutations in D. yakuba

0.0

0.0

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

SFS for Autosomal mutations in D. yakuba

SFS for Autosomal mutations in D. simulans

D

SFS for X−linked mutations in D. simulans

0.0

0.0

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

C

Figure S1: Site frequency spectra for SNPs (black) and tandem duplications (grey) on the A) X and B) autosomes in D. yakuba and on the C) X and D) autosomes in D. simulans. Tandem duplications in D. yakuba and on the D. simulans autosomes show an excess of low frequency variants, consistent with detrimental phenotypic effects. The D. simulans X shows an excess of high frequency variants, consistent with widespread selection favoring duplicates on the X.

23

60

Chrom X

24

reference NS39

D

F All

NS33 NS50 NS113 MD63 MD73 MD105 MD106 MD233

MD221 MD15 MD251 MD06 NS40 NS05 NS137 NS67 NS78 NS79 MD199

60

Chrom 3L

0

30

NS137 NS67 NS79 NS113 NS33 MD105 MD06 MD63

reference NS39

reference NS39

30

40

MD251 MD221 MD233

MD06 MD105 MD63 MD106 MD15 MD199 MD73 NS50 NS67 NS79 NS137 NS40 NS113 NS78 NS33 NS05

0

MD105 MD106 MD06 MD251 MD15 MD63 MD221 MD233 NS40 NS78 NS05 NS79 NS137 NS50 NS67 MD73 NS113 NS33 MD199

0

60

B

NS05 NS113 MD106 MD63 MD105

140

reference NS39

30 60

MD15 MD106 MD251 MD221 MD233 NS50 NS05 NS40 NS78 MD73 MD199

0

Chrom 2L

MD15 MD06 MD251 MD221 MD233 NS50 NS40 NS79 NS67 NS137 NS78 MD73 NS33 MD199

0

MD06 NS40 NS113 NS79

E

reference NS39

reference NS39

20 40

C

MD63 MD105 MD15 MD251 MD221 MD233 NS05 MD73 MD106 NS50 NS78 NS33 NS137 NS67 MD199

0

A Chrom 2R

Chrom 3R

Figure S2: Hierarchical clustering of intronic SNP data for D. simulans shows little population structure.

60

60

80

Chrom 2R

D

Chrom 3R

F

25

50 NY48 NY65 NY66−2 NY73 CY02B5 CY04B CY20A NY81 CY13A NY56 NY42 NY141 CY08A NY62 reference NY85 CY21B3 CY28 CY17C CY01A CY22B

40

60

66

CY04B CY02B5 CY13A CY22B CY20A CY28 CY21B3 CY08A NY42 NY62 NY85 NY73 NY48 CY17C NY81 CY01A NY65 NY56 NY66−2 reference NY141

62

B

NY66−2 NY65 NY48 CY20A CY01A CY08A NY62 CY17C CY04B NY56 CY28 CY02B5 NY85 NY73 CY22B NY42 NY81 CY13A CY21B3 reference NY141

46.5 48.0

CY04B CY01A CY20A reference CY21B3 CY22B NY56 NY65 NY42 NY73 NY48 NY141 CY02B5 NY66−2 NY62 CY17C NY85 CY08A NY81 CY28 CY13A

40

Chrom 3L

CY04B CY20A CY01A reference CY21B3 CY02B5 NY66−2 CY13A NY81 CY08A CY28 NY62 CY17C NY85 CY22B NY48 NY65 NY73 NY42 NY56 NY141

135

E

120

70

C

NY42 CY08A NY66−2 NY48 CY21B3 CY04B NY141 NY62 NY56 NY85 CY22B CY28 CY13A CY17C NY73 CY02B5 NY81 NY65 CY20A reference CY01A

50

A Chrom 2L

Chrom X

All

Figure S3: Hierarchical clustering of intronic SNP data for D. yakuba shows population structure on chromosome 2R (A) and 2L (B), consistent with known inversions segregating on chromosome 2. Samples do not cluster strictly with respect to geography (C-D), indicating widespread gene flow between geographic locations and a single admixed population.