A null model for microbial diversification

6 downloads 0 Views 1MB Size Report
Jun 19, 2017 - designated as species and recognized as such in Bergey's Manual, ...... present in 90% (Borrelia), 95% (Neisseria and H. pylori), and 99% (Escher- ..... Jolley KA, Maiden MC (2010) BIGSdb: Scalable analysis of bacterial ...
A null model for microbial diversification Timothy J. Strauba,1 and Olga Zhaxybayevaa,b,2 a Department of Biological Sciences, Dartmouth College, Hanover, NH 03755; and bDepartment of Computer Science, Dartmouth College, Hanover, NH 03755

Edited by Eugene V. Koonin, National Institutes of Health, Bethesda, MD, and approved May 10, 2017 (received for review December 12, 2016)

Whether prokaryotes (Bacteria and Archaea) are naturally organized into phenotypically and genetically cohesive units comparable to animal or plant species remains contested, frustrating attempts to estimate how many such units there might be, or to identify the ecological roles they play. Analyses of gene sequences in various closely related prokaryotic groups reveal that sequence diversity is typically organized into distinct clusters, and processes such as periodic selection and extensive recombination are understood to be drivers of cluster formation (“speciation”). However, observed patterns are rarely compared with those obtainable with simple null models of diversification under stochastic lineage birth and death and random genetic drift. Via a combination of simulations and analyses of core and phylogenetic marker genes, we show that patterns of diversity for the genera Escherichia, Neisseria, and Borrelia are generally indistinguishable from patterns arising under a null model. We suggest that caution should thus be taken in interpreting observed clustering as a result of selective evolutionary forces. Unknown forces do, however, appear to play a role in Helicobacter pylori, and some individual genes in all groups fail to conform to the null model. Taken together, we recommend the presented birth−death model as a null hypothesis in prokaryotic speciation studies. It is only when the real data are statistically different from the expectations under the null model that some speciation process should be invoked. prokaryotic diversity neutral evolution

| bacterial species | pathogen typing | genetic drift |

and selective sweeps, nascent ecologically differentiated groups are proposed to undergo up to five stages that eventually result in their speciation (11). Whatever the genetic and ecological mechanisms, clustering patterns are suggested to be an “accurate” (13), “quick,” and “easy” (14) way to delineate bacterial species and are widely interpreted as indicative of the operation of evolutionary forces (e.g., refs. 12, 15, and 16). However, a simple turnover of microbial populations due to their random diversification and extinction accompanied by accumulation of neutral changes in genomic DNA sequence will also inevitably produce clusters in the genealogies of the encoded genes (17). Intuitively, such clustering becomes apparent when one examines patterns on any tree-like diagram resulting from the processes of birth and death, but it can also be shown mathematically (e.g., ref. 18). That purely stochastic processes produce clusters raises the possibility that microdiverse clusters observed in gene trees might not actually reflect the operation of genetic or ecological forces of speciation. In this study, we developed a statistical framework that simulates microbial diversity under a null model of birth–death and compares the created patterns to those observed in data collected from actual bacterial populations, such as 16S rRNA genes, housekeeping genes frequently used in multilocus sequence analyses (MLSA), and protein-coding gene families. We applied this framework to analyses of hundreds of genomes within four bacterial groups: Escherichia spp., Borrelia spp., Neisseria spp., and Helicobacter pylori. The first three of these

M

icrobiologists have long debated whether a coherent species concept might apply to Bacteria (and the other prokaryotic domain, Archaea). Without such a concept, ad hoc conventions, for instance grouping taxa into “operational taxonomic units” (OTUs) delimited by regions of 16S ribosomal RNA (rRNA) genes showing at least 97% sequence identity, have been used for identification and quantification of prokaryotic diversity (1). Sometimes, however, 16S rRNA or other marker genes in environmental samples exhibit “microdiversity” at this or more stringent levels, with an overabundance of closely related sequences that could be interpreted as resulting from speciation-like ecological and genetic processes (2–4). For instance, in an early pivotal study of marine vibrios, Acinas et al. (5) observed “a large predominance of closely related taxa in this community” and concluded that “such microdiverse clusters arise by selective sweeps and persist because competitive mechanisms are too weak to purge diversity from within them.” Two models have been most widely invoked to explain the evolutionary forces behind the formation of such clusters. Under the ecotype (periodic selection) model alluded to above, the progeny of the most-fit genotype in a clonally evolving bacterial population takes over the population, resulting in a selective sweep that purges diversity at all loci (6). In this case, microdiversity would be the consequence of neutral mutations occurring between sweeps (6, 7). Under the competing recombination model, the frequent exchange of material between genes within ecologically differentiated bacterial populations effects gene sequence similarity clustering (8, 9). Perhaps a combination of these forces is in place, or different diversification modes act on different bacterial groups, even if they coexist in the same environment (10–12). Under the selective forces of recombination

E5414–E5423 | PNAS | Published online June 19, 2017

Significance When evolutionary histories of closely related microorganisms are reconstructed, the lineages often cluster into visibly recognizable groups. However, we do not know if these clusters represent fundamental units of bacterial diversity, such as “species,” nor do we know the nature of evolutionary and ecological forces that are responsible for cluster formation. Addressing these questions is crucial, both for describing biodiversity and for rapid and unambiguous identification of microorganisms, including pathogens. Multiple competing scenarios of ecological diversification have been previously proposed. Here we show that simple cell death and division over time could also explain the observed clustering. We argue that testing for the signatures of such “neutral” patterns should be considered a null hypothesis in any microbial classification analysis. Author contributions: T.J.S. and O.Z. designed research; T.J.S. performed research; T.J.S. and O.Z. analyzed data; and T.J.S. and O.Z. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Data deposition: Gene families are deposited to FigShare, available at https://dx.doi.org/ 10.6084/m9.figshare.4731946. Note that this is a derived dataset. The actual sequence data were taken from PATRIC database records, descriptions of which are supplied in Dataset S1. 1

Present address: Genomic Center for Infectious Diseases, Broad Institute, Cambridge, MA 02142.

2

To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1619993114/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1619993114

Results and Discussion A Null Model of Microbial Diversification. As a null model, we propose a diversification scenario in which a bacterial group does not experience evolutionary and ecological forces that either purge or promote diversity. Under the proposed model, the genealogy of diversifying bacterial populations (hereafter referred to as lineages) is influenced only by a random process of extinction (death) or divergence into two new lineages (birth). The net birth–death rate is constant, and recombination is absent. Additionally, the DNA sequences of the genomes of these lineages are assumed to accumulate substitutions at a constant rate. In population genetics terminology, the genes within the genomes of the simulated lineages experience only genetic drift. To simulate evolution of such gene families, we implemented a two-step procedure (see Methods for detailed description). First, the evolutionary history (genealogy) of a gene family was simulated under the stochastic lineage birth–death model until all “extant” lineages could be traced back to one common ancestor. This resulted in a phylogenetic tree relating the extant lineages that contain the gene. Second, the DNA sequences of the gene in these extant lineages were established by evolving the DNA sequence of the common ancestor along the obtained genealogy under one of the simplest nucleotide substitution models [the F81 substitution model (19)]. The F81 model does not distinguish synonymous and nonsynonymous substitutions, which, given the likely strong constraints on the amino acid sequences of the protein-coding gene families to maintain protein function, may appear overly simplistic. However, at the examined level of divergence, the majority of the selected gene families are expected to be strongly conserved, and therefore a substantial fraction of their nucleotide substitutions are projected to be synonymous. For gene families that do deviate from this expectation, we expect the null model to be rejected (discussed in Framework for Comparison of Divergence Patterns in Real and Simulated Gene Families). Because, in the real bacterial groups, even conserved gene families exhibit a range of divergence (from hyperconserved to faster evolving), each simulated phylogenetic tree was scaled to a randomly selected value drawn from the distribution of the divergences observed for the real gene families of the analyzed bacterial group. Similarly, the composition and length of the DNA sequence of the common ancestor in each gene family simulation was informed by the respective parameters observed in the real data (see Methods for details). To simulate evolution of multiple gene families, the procedure was repeated 1,000 times.

Table 1. Summary of the gene families and their diversification patterns in the four analyzed bacterial groups No. of gene families Bacterial group Escherichia Borrelia H. pylori Neisseria Neisseria excluding N. gonorrhoeae

No. of genomes 620 107 233 232 214

Total (%*) 2,837 304 1,138 1,244 1,259

(54) (22) (68) (47) (46)

=

D (%†) 2,534 303 13 835 1,041

D+ (%†)

(89.3) (99.7) (1.1) (67.1) (82.7)

12 0 1,124 14 27

(0.4) (0) (98.8) (1.1) (2.1)

D− (%†) 291 1 1 395 191

(10.3) (0.3) (0.1) (31.8) (15.2)

*Average percent of the gene families present in each genome of the group. † Percent of the total number of gene families.

Straub and Zhaxybayeva

PNAS | Published online June 19, 2017 | E5415

PNAS PLUS

Selection of Bacterial Groups and Compilation of Gene Families. To test the null hypothesis, we examined four bacterial groups (Escherichia spp., Neisseria spp., Borrelia spp., and Helicobacter pylori) that had at least 100 genomes in public access (to ensure statistical power) and represented a range of lifestyles (to capture potentially different selective pressures that could have affected their diversification) (Table 1 and Dataset S1). Escherichia is a γ-proteobacterial genus comprised of six described species and a few yet unclassified strains (20–24). The analyzed 620 Escherichia genomes span representatives from environmental isolates to commensals and disease-causing strains of Escherichia coli, Escherichia albertii, Escherichia fergusonii, Escherichia hermanii, and the unclassified Escherichia spp. Members of β-proteobacterial genus Neisseria colonize the mucosal and dental surfaces of many animals, and two of its 21 currently named species (25), Neisseria meningitidis and Neisseria gonorrhoeae, cause disease in humans (26). Neisseria cluster into “species groups” that do not match perfectly to named species (27), and the relationships between the species groups are unclear (2, 27, 28), perhaps as a result of Neisseria’s natural propensity for DNA uptake and recombination (29). The analyzed 232 Neisseria genomes represent 15 named species, although the majority are from N. meningitidis and N. gonorrhoeae. Named species of the spirochete Borrelia (30, 31) form two distinct groups associated with Lyme disease and relapsing fever, respectively (32), and, within “Lyme disease group,” named species typically form distinct clusters (33). The 107 Borrelia genomes in our dataset are dominated by the Lyme disease-causing B. burgdorferi and B. garnii but overall represent 16 named species and include a few relapsing fever isolates. Finally, human-associated e-proteobacterium Helicobacter pylori (34, 35) is known for its unexpectedly high genomic diversity (36) and extensive recombination (37, 38), likely due to its natural competence. Thus, in contrast to the other three bacterial groups, we limited our analyses to 233 genomes within H. pylori and did not expand the analyses to the rest of the genus Helicobacter. To assess genetic diversity patterns within each group, we selected markers frequently used for this purpose: the 16S rRNA gene; several genes known as MLSA loci (39, 40); and a subset of protein-coding genes hereafter referred to, for brevity, as “gene families.” For the last, many comparative bacterial genomic studies aiming at characterization of diversity and structure of a microbial community use single-copy universally or nearly universally present genes, so-called “core” genes (41). The relaxation of universality is needed to account for the substantial gene content variation even among closely related bacteria (42, 43), as well as for the variable quality and completeness of the genomes. We focused our analyses on a similar, “relaxed core single copy” subset of protein-coding gene families within each of the four groups (Table 1; see Methods for details). The relaxation of the requirement of a gene to be present in all members of the group by allowing it to be absent in up to 10% of genomes ensured that a sufficiently large number of gene families were available for analyses. Additionally, keeping only

EVOLUTION

analyzed groups have been routinely subdivided into clusters designated as species and recognized as such in Bergey’s Manual, the authoritative classification compendium (15). However, for these three groups, the majority of ubiquitous gene families formed clusters indistinguishable from those simulated under the null model in 1 to 5% DNA sequence divergence interval, and therefore no speciation processes need be invoked to explain the observed diversity.

“single-copy” genes eliminated gene families expanded via gene duplication and/or horizontal gene transfer and therefore likely to have their evolutionary histories deviate from the assumptions of the null birth−death model. Framework for Comparison of Divergence Patterns in Real and Simulated Gene Families. Divergence patterns presented as phy-

logenetic trees are difficult to compare, especially in large-scale analyses. Instead, for each gene family, we calculated the number of clusters at a specific level of divergence using furthestneighbor hierarchical clustering (44) and summarized the data as a cluster−divergence (CD) curve (Fig. 1) (inspired by ref. 5). If a special process, like recombination or periodic selection, influences diversification of real gene families, the shapes of their CD curves are expected to differ from those of the simulated gene families. If no difference is detected, then it is inappropriate to conclude that such processes delimit the observed discrete groups. Because the focus of the study is on examination of bacterial microdiversity and biological inferences

drawn therefrom, we have limited our analyses to evaluation of the divergence patterns that correspond to 0.01 to 0.05 nucleotide substitutions per gene family. The range was chosen because the clusters of prokaryotic sequences thought to mark “species” exhibit, on average, less than 5% and 3% of nucleotide substitutions per genome (45) and 16S rRNA gene (46), respectively. To assess whether the differences in CD curve shapes between real and simulated data are significant, we developed a statistical test reminiscent of the widely used Kolmogorov−Smirnov (KS) test (47). The deviations in shapes of two CD curves are captured in a distance metric Dmax, which measures how maximally far apart the CD curves are in a specified interval of divergence (see Methods for details). When Dmax between a real gene family and the median of the simulated gene families is compared with the distribution of Dmax among simulated gene families, the former distance metric can be classified as either falling within 95% of the values for the simulated gene families (i.e., indistinguishable from the diversification patterns under the null model; referred as category D=) or significantly differing from the null model by being in the upper (D+) or lower (D−) 2.5 percentile. The comparisons of the KS to our test show that ours is less conservative (Fig. S1), and thus is less likely to generate false matches to the null model.

20 10 5 1

2

Number of Clusters

50

100

Diversification Patterns Observed in Gene Families of Bacterial Groups Are Often Indistinguishable from Those Generated Under the Null Model. Similarity of 16S rRNA gene sequences is one

0.4

0.3

0.2 Divergence

0.1

0.0

Fig. 1. An illustration of how diversification patterns of a gene family are represented in a CD plot. For each pair of taxa in a gene family, the nucleotide sequences are converted to pairwise distances. The distances are then clustered at a specific level of divergence using the furthest-neighbor algorithm (44). The number of clusters (y axis) is plotted against divergence (x axis; measured in number of substitutions per site). If all taxa in the gene family evolve at the same rate, there is a one-to-one correspondence between the points on the CD curve and the nodes of the genealogy of the examined gene family (shown above the plot), as exemplified by several dashed lines. In this specific case, the CD curve is identical to a LineageThrough-Time (LTT) plot (106), a widely used approach (51, 107) that transforms the patterns of divergence on phylogenetic trees and assesses them quantitatively via the γ statistic (50, 108). However, although the equal rate of substitutions is assumed in our simulations under the null model, it is often violated in the real bacterial gene families. Therefore, the CD plots of real data are not LTT plots, and we have developed a statistical test distinct from the γ statistic (see Methods for details).

E5416 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114

of the widely used approaches implemented to circumscribe OTUs, which are often used in lieu of species to describe microbial diversity within a group of microorganisms or in a microbial community (16). However, we found that, for all four analyzed groups, the clustering observed in the real 16S rRNA genes is indistinguishable from the simulations under the null model (Fig. 2A and Fig. S2). Ubiquitous housekeeping genes are also commonly used to identify clustering on species or “clonal complex” levels (e.g., for pathogen typing) via MLSA (48), and most of these loci are among the identified gene families. The majority of the MLSA loci-corresponding gene families of Escherichia, Borrelia, and Neisseria were classified as D= (Fig. 2B, Fig. S3, and Table S1), and therefore their diversification patterns are also indistinguishable from those simulated under the null model. Similar to the clustering patterns of their 16S rRNA and MLSA gene families, 89.3% of Escherichia and 99.7% of Borrelia relaxed core protein-coding gene families are classified as D= (Table 1, Fig. S4, and Fig. 3). This result is surprising, given that our gene family identification likely disproportionally selected very conserved gene families expected to be under purifying selection. Taken together, our observations imply that, for Escherichia and Borrelia, the evolutionary histories of the majority of gene families, including 16S rRNA and the MLSA loci, are statistically indistinguishable from the genes simulated to evolve under genetic drift. If recombination or periodic selection impacts these groups (49), they are not sufficiently strong or constrained by phylogeny to produce sequence clusters that are different from the simple model of random turnover of lineages over time. However, the results were somewhat different for H. pylori and Neisseria. In stark contrast to Escherichia and Borrelia, 98.8% of H. pylori gene families, including all MLSA loci, were not D= (Fig. 4, Fig. S3, Table 1 and Table S1). In addition to such a dramatic number of H. pylori non-D= gene families, all but one were D+ (Fig. 4D). Therefore, it is likely that the H. pylori lineage experiences evolutionary forces that violate at least some of the null model assumptions of constant diversity, constant rate of lineage birth/death, and absence of recombination. Such patterns of diversification could result from longer terminal branches Straub and Zhaxybayeva

PNAS PLUS

Gene family simulations MLSA loci

1

1

2

2

5

# of Clusters 10 20 50

200 500

200 500

B

16S rRNA simulations 16S rRNA

# of Clusters 5 10 20 50

A

0.06

0.05

0.04 0.03 0.02 Divergence

0.01

0.00

0.06

0.05

0.04

0.03

0.02

0.01

0.00

Divergence

Fig. 2. The divergence patterns of (A) the Escherichia 16S rRNA gene and (B) most of MLSA loci are indistinguishable from those simulated under the null model of diversification. (A) CD curves of 100 simulations are shown in black, and real data are shown in purple. Based on our statistical test, the 16S rRNA gene CD curve is classified as D= in the 0.01 to 0.05 divergence interval. (B) CD curves of 1,000 gene family simulations are shown in black, and real data are shown in purple (D=) and blue (D−). Based on our statistical test, all but one of the MLSA loci CD curves are classified as D= in the 0.01 to 0.05 divergence interval (Table S1). The unit of divergence on the x axis of A and B is number of substitutions per site.

Straub and Zhaxybayeva

than Escherichia (55) and Borrelia (32), suggesting that, even in presence of substantial amounts of recombination, the null model cannot be easily rejected. Effects of Sampling on the Inferred Divergence Patterns. Incomplete and biased sampling (for example, of only pathogenic or drugresistant strains within a species) can influence clustering independently of any speciation-like processes (22, 56) and, as a result, can affect measurements of clade diversifications (57). Even though hundreds to thousands of gene families were analyzed for hundreds of genomes within each group, within a gene family, some homologs have identical nucleotide sequences and, as a result, may contain fewer than 100 nonidentical (“unique”) nucleotide sequences per gene family. Therefore, our limited datasets likely miss some of the microdiversity that exists in real, fully sampled bacterial populations. If undersampling influences the apparent diversification patterns, the calculations of the values of Dmax for a given null distribution could be affected, which, subsequently, will result in false assignment of gene families into D+, D−, and D= categories. However, when simulated gene families were randomly subsampled at 25%, 50%, and 75% levels, the patterns of diversification of the subsampled datasets were not significantly different from each other and from the 100% sampled set (One-way ANOVA; F3,396 = 0.115; P = 0.951; Fig. S5). Thus, the observed diversification patterns and the summary statistic Dmax appear to be robust to incomplete random sampling of a microbial community, consistent with findings of ref. 50. However, isolates with available genomes may not represent random sampling of both cultured strains and overall microbial community diversity. These nonrandom sampling biases are difficult to pinpoint and simulate. To investigate the impact of such potential biases empirically, we compared diversification patterns observed in the detected gene families to those in the much more extensive collection of the MLSA database records (refs. 39, and 40 and Table S1). Although these records are likely to have skewed sampling due to their bias for pathogenic strains, when corrected for the difference in the absolute number of nonidentical sequences between the two datasets, the overall PNAS | Published online June 19, 2017 | E5417

EVOLUTION

than expected under the pure birth−death process. Longer terminal branches can be, for example, a result of the variable net rate of birth and death, or due to niche expansion (50) or ecological diversification (51). Consistent with the latter possibility, the phylogenetic trees of H. pylori strains and human mitochondrial DNA from various geographic regions are congruent (35). High levels of recombination are also shown to result in phylogenetic trees that have longer terminal branches and shorter internal branches (9, 52). Given that H. pylori is one of the most mutagenic and recombinogenic bacterium known (53), with the recombination rate being at least an order of magnitude higher than the substitution rate (54), the observed long terminal branches may, in addition, reflect the combination of elevated substitution and recombination rates. Members of Neisseria genus are also known to be highly recombinogenic, albeit to a lesser degree than H. pylori (55). Although the majority of Neisseria gene families were classified as D= (Fig. S4E and Table 1), the fraction (67.1%) was much smaller than for Borrelia and Escherichia. Also, unlike in H. pylori comparisons, most of the non-D= gene families were D− (Fig. S4F and Table 1). These observations suggest that processes shaping diversification of Neisseria are different from those of H. pylori, Escherichia, and Borrelia. Although all known Neisseria strains have a natural propensity for DNA uptake, there are barriers to successful between-strain gene exchange (29), which is probably why there are recognizable (albeit fuzzy) clusters on phylogenetic networks of marker genes (25). In particular, a genetically uniform group of N. gonorrhoeae forms a clearly isolated cluster (25), which may be explained by N. gonorrhoeae’s restriction to urogenital tract, an ecological niche distinct from the oral and nasal cavities colonized by most known Neisseria spp. (25). Suspecting that such distinct patterns of N. gonorrhoeae diversification and “geographic” isolation may have impacted shapes of CD curves of its gene families, we reanalyzed the Neisseria group without N. gonorrhoeae genomes. In this reduced dataset, 82.7% of Neisseria gene families became statistically indistinguishable from the simulated gene families (Table 1 and Fig. S4H), making the overall results comparable to those of Escherichia and Borrelia. However, Neisseria are more recombinogenic

200 5 10 20 50

Simulations D=

1 2

B # of Clusters

200 5 10 20 50

Simulations

1 2

# of Clusters

A

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.6

0.5

0.4

Divergence

# of Clusters

5 10 20 50

200

Simulations D-

0.2

0.1

0.0

0.1

0.0

Simulations D+ FliC (D+)

1 2

5 10 20 50

200

D

1 2

# of Clusters

C

0.3

Divergence

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Divergence

0.6

0.5

0.4

0.3

0.2

Divergence

Fig. 3. The divergence patterns of most Escherichia gene families are indistinguishable from those simulated under the null model of diversification. (A) CD curves of 50 gene families randomly sampled from the 1,000 gene families simulated under the null model. The simulations were tailored to this dataset as described in Methods. (B−D) CD curves of 2,837 Escherichia gene families divided into three categories: statistically indistinguishable from simulations [(B) D=, n = 2,534] or significantly different [(C) D−, n = 291; (D) D+, n = 12]. Highlighted in red is the most divergent gene family, fliC, the H antigen in E. coli. The curves in B and C were randomly subsampled to 50 for representation purposes.

shapes of CD curves of the MLSA loci were qualitatively comparable to those of the real gene families (Fig. S3 D–G). Future analyses of genomic sets of a substantially larger size will allow testing of the null hypothesis at precisions higher than used in this study. Dmax as a Metric for Identifying Gene Families Under Selection and Those Suitable as Phylogenetic Markers of Microdiversity. Are there

any common properties of non-D= gene families that may hint at the evolutionary and ecological forces that shape their evolutionary history? Compared with D= gene families, D− and D+ gene families had significantly smaller and larger tree heights (i.e., the sum of all branch lengths of a phylogenetic tree), respectively (Fig. S6A). This relation was not due to elevated substitution rates in just a few taxa within a gene family: In many cases, the overall substitution rates for gene families in D+, D=, and D− categories differed by an order of magnitude, as reflected in the significantly higher and lower number of nonidentical sequences in D+ and D− gene families, respectively (Fig. S6B), and as exemplified by trees of selected gene families (Fig. S7). Taken together, these observations suggest that, in general, D+ gene families represent faster-evolving genes, whereas D− gene families are very conserved. Consistent with this hypothesis, 291 D− gene families of Escherichia are significantly enriched in genes encoding translation machinery and significantly depleted for several categories of metabolic genes (Fig. S8), a functional bias expected for conserved E5418 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114

genes (58). Additionally, the only D− gene families in Borrelia and H. pylori (acpP and rps19, respectively; Fig. S4C and Fig. 4C) exhibit strong purifying selection in 95% and 98% of their codons (Table 2). (Among Neisseria’s D− gene families, no significant depletion or enrichment of functional categories was detected after correction for multiple testing, possibly reflecting the lack of power to detect statistical significance from the smaller dataset.) Among the identified D+ gene families, fliC of Escherichia exhibits a notably elevated rate of nonsynonymous substitutions, with 18% and 43% of the gene’s sites inferred to be under positive and neutral selection, respectively (Table 2). The fliC gene encodes flagellin, the flagellar filament protein. FliC is known to be a highly polymorphic gene, the evolutionary history of which is shaped by both fixation of mutations and horizontal gene transfer, not only in Escherichia/Salmonella (59) but also in other bacterial pathogens (60, 61). Given that fliC protein serves as an antigen (known as an H antigen of Gram-negative bacteria) and an important factor in animal innate immune response (62), generation of fliC allelic diversity via mutation and recombination is one of the strategies for antigenic variation necessary for the arms race against host immune responses (59, 63). Similarly, among the numerous H. pylori’s D+ gene families, mod-1 and omp3 exhibit notably elevated substitution rates (Table 2). In many pathogens, mod genes are hypothesized to function as regulators that control gene expression of surface antigens (64, 65). Interestingly, H. pylori exhibits a very high level of variability in its methyltransferase-encoding genes even in Straub and Zhaxybayeva

# of Clusters

PNAS PLUS

200

200

1 2

5 10 20 50

Simulations D=

5 10 20 50

B

Simulations

1 2

# of Clusters

A

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.7

0.0

0.6

0.5

5 10 20 50

# of Clusters

200

Simulations Ribosomal protein S19 (D-)

0.2

0.1

0.0

0.2

0.1

0.0

Simulations D+ mod−1 (D+) omp3 (D+)

1 2

5 10 20 50

200

D

1 2

# of Clusters

0.3

Divergence

Divergence

C

0.4

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.7

0.6

0.5

Divergence

0.4

0.3

Divergence

isolates from a restricted geographic region (66), highlighting the importance of diversifying selection on these loci in this pathogen. H. pylori also carries an unusually large and diverse family of outer membrane proteins (including omp3), and this gene family diversity is hypothesized to be adaptive (67) and to influence the pathogenic success of this bacterium (68). Not all D+ gene families exhibit signatures of positive selection, as exemplified by pilS and mutL gene families in Neisseria

(Table 2). The pilS gene is part of the two-component signal transduction system PilR/PilS that regulates transcription of the pilin gene in Pseudomonas (69), a component of type IV pili present in Neisseria (70). Although it is not clear why pilS in Neisseria has the observed elevated substitution rates, notably, in N. gonorrhoeae, it no longer controls piliation but is still expressed, suggesting the recruitment for some other function (71). Such neofunctionalization, if occurring in Neisseria, may underlie observed relaxation of selection

Table 2. Inference of selection in certain gene families with high Dmax Proportion of sites in three categories of the M2a model (estimated dN/dS) Bacterial group Escherichia Borrelia H. pylori H. pylori H. pylori Neisseria Neisseria

Gene family (its functional annotation)

Category

Tree height, substitutions/site

fliC (flaggelin) acpP (acyl carrier protein) rpS19 (ribosomal protein S19) mod-1 (type III restriction-modification system methyltransferase) omp3 (outer membrane protein) mutL (MMR) pilS (transcription regulator)

D+ D− D− D+

40.39 8.9 0.73 29.56

138 28 70 185

D+ D+ D+

17.66 11.89 27.53

188 (241) 93 (229) 88 (233)

Straub and Zhaxybayeva

No. of unique (total) sequences (610) (95) (232) (233)

Purifying 0.40 0.95 0.99 0.53

(0.04) (0.02) (0.04) (0.08)

0.61 (0.11) 0.86 (0.03) 0.91 (0.11)

Neutral 0.42 0.05 0 0.43

(1) (1) (1) (1)

0.28 (1) 0.13 0.09 (1)

Positive 0.18 0 0.01 0.04

(1.94) (-) (4.89) (3.71)

0.11 (3.16) 0.01 (2.74) 0 (-)

PNAS | Published online June 19, 2017 | E5419

EVOLUTION

Fig. 4. The divergence patterns of most H. pylori gene families are distinct from those simulated under the null model of diversification. (A) CD plot of the 50 gene families randomly sampled from the 1,000 gene families simulated under the null model. The simulations were tailored to this dataset as described in Methods. (B−D) CD plots of 1,138 H. pylori gene families divided into three categories: statistically indistinguishable from simulations [(B) D=, n = 13] or significantly different [(C) D−, n = 1; (D) D+, n = 1,124]. The only D− gene family (rps19) encodes ribosomal protein S19. Highlighted in red and green are two of the most divergent D+ gene families: mod-1, encoding a type III restriction modification system methyltransferase, and omp3, encoding an outer membrane protein. For representation purposes, the remaining D+ gene families shown in D are a subset of 50 randomly sampled families.

on the pilS nucleotide sequence. MutL, a gene encoding a component of the mismatch repair (MMR) system, is known to be frequently horizontally transferred (72, 73) and shows evidence for extensive intragenic recombination (72). In agreement with our analyses (Table 2), most of the nucleotide variation observed in mutL homologs does not cause a change in amino acid (72), and horizontal acquisition of the mutL gene is hypothesized to be a mechanism for functional restoration of the MMR system in hyperrecombinogenic mutants (72). Combined with the observation that most of the highly recombinogenic H. pylori gene families are in the D+ category, D+ gene families may be enriched in genes that have undergone strong diversifying selection of the corresponding amino acid sequence and/or experienced extensive horizontal gene transfer. Regardless of the underlying evolutionary forces, D+ gene families promise to be useful phylogenetic markers for delineating microdiversity within a group of closely related bacteria. Some of the identified D+ gene families have already been either shown to be efficacious markers for typing of pathogenic isolates [e.g., porB (74) and lip (75) for Neisseria, and the above-discussed fliC (76, 77) for E. coli] or identified to have a potential to be such markers [e.g., E. coli housekeeping genes ugd and galF that often flank frequently horizontally exchanged genes involved in the O antigen biosynthesis (78, 79)]. Therefore, the presented framework could help identify loci promising for describing bacterial genotypic variation that is driven by ecological forces. Concluding Remarks: Implications for the Search for the Biological Process Underlying Bacterial Speciation. Our analysis shows that

many of the diversification patterns observed in 16S rRNA genes, MLSA markers, and core genes families are indistinguishable from a random process of lineage birth−death and, therefore, could be a result of neutral lineage turnover. Hence, for these markers, the observed genotypic clustering alone cannot be used as evidence for existence of selection-driven ecological processes behind its formation, as is often done (6, 14, 42, 43, 80), even if the patterns are compatible with such models. Taken together with the recent finding that patterns of diversification arising from distinct processes of periodic selection and recombination also mimic each other in appearance (10), our analysis highlights yet another challenge to identifying a single biological process that defines bacterial species (81). Moreover, distinct evolutionary patterns of divergence in most gene families of the H. pylori group and many gene families of Neisseria group hint that clustering patterns observed in bacterial groups are not formed by one universal process, and the answer to the question “How do bacteria diverge?” is likely to be group-specific and not generalizable, as has been hypothesized (49). We recommend neutral birth−death models such as that used here as null hypotheses in future prokaryotic speciation studies. Methods Sources for the Bacterial Datasets. Nucleotide and amino acid sequences of the protein-coding genes from the genomes of Escherichia spp., Neisseria spp., Borrelia spp., and Helicobacter pylori (Dataset S1) were obtained from the Pathosystems Resource Integration Center (PATRIC) database (82) between June 2014 and April 2015. Nucleotide sequences of 16S rRNA and MLSA genes were extracted from the same PATRIC records. Additional nucleotide sequences from MLSA loci for E. coli (40), B. burgdorferi (33), Neisseria (39), and H. pylori (39) (Table S1) were obtained in April 2015 from the MLST databases hosted at https://pubmlst.org/ and mlst.warwick.ac.uk/ mlst/dbs/Ecoli. Identification of Relaxed Single-Copy Core Gene Families. Amino acid sequences of protein-coding genes in the Escherichia spp., Neisseria spp., Borrelia spp., and Helicobacter pylori groups (analyzed separately) were used as an input in all-versus-all sequence similarity searches, which were conducted using Afree v2.0beta (83) with a word size of five and a minimum Sørensen-Dice (SD) similarity index (84, 85) of 10. The obtained SD index-

E5420 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114

based similarity relationships among the genes were used to determine the gene families using the Markov Cluster Algorithm v14-137 (86, 87) with an inflation parameter of 1.2. The acquired gene families were consequently filtered to identify relaxed core gene families by requiring the genes in a gene family to be present in most (but not all) genomes of the analyzed group. Because the four analyzed bacterial groups have different numbers of genomes (Table 1), this requirement translated into a constraint for the gene family homologs to be present in 90% (Borrelia), 95% (Neisseria and H. pylori), and 99% (Escherichia) of genomes, with less stringency imposed on bacterial groups with a smaller number of genomes. The relaxed core gene families were further filtered for single-copy relaxed core gene families, defined as those with less than an average of 1.1 gene copies per genome across all genomes in a group. Finally, to avoid inclusion of homologs that only shared a protein domain, sequences that were 2 standard deviations outside of the mean amino sequence length of the gene family were removed. For brevity, we refer to the resulting single-copy relaxed core gene families simply as gene families throughout the manuscript. All further analyses were performed on the nucleotide sequences of the gene families, which are available via FigShare (https://dx.doi.org/10.6084/m9.figshare.4731946). Modeling Genealogies of Gene Families That Evolve Under the Null Model. To simulate evolutionary history of a gene family that is not affected by ecological and evolutionary forces (our null model), we implemented a simple birth−death process known as the Hey/Moran model (88, 89). Under this model, at each generation, a constant-size population of n gene familycontaining lineages experiences only an extinction of one randomly chosen lineage (death) and a duplication of a different, also randomly chosen, lineage (birth). The birth−death process is repeated until all n lineages in the population, when tracked back in time, coalesce to a single common ancestor (18). The resulting genealogical relationships among n extant lineages can be summarized in a phylogenetic tree. To do so, we first define a distance between two extant lineages as a value proportional to the number of generations to their last common ancestor, then calculate pairwise distances for all extant lineages, and, finally, transform the resulting distance matrix into a phylogenetic tree using the neighbor-joining method (90), as implemented in Biopython v1.64 (91). The Python source code of the birth−death model is available at https://github.com/ecg-lab/microdiversity/. The simulations were independently repeated 1,000 times to obtain evolutionary histories of 1,000 gene families, and the procedure was repeated separately for each bacterial group. Given that the real gene families varied in size, the size of each simulated gene family (i.e., the number of lineages n) was drawn from the distribution of the observed real gene family sizes in the respective bacterial group. Simulation of Nucleotide Sequences of Neutrally Evolving Gene Families. Among the gene families of real bacterial groups, we expect to observe a gamut of gene lengths, variable nucleotide composition, and level of sequence conservation (which can be measured by tree height) that ranges from “ultraconserved” to “quickly evolving” gene families. This variation was captured as distributions of the parameters observed within each analyzed bacterial group. The distributions were used to mimic variation among the simulated gene families. Specifically, the genealogical history of each simulated gene family was scaled with a tree height randomly drawn from the observed range of tree heights. The length of the gene in a simulated gene family was drawn from the distribution of observed gene lengths. Lastly, the nucleotide sequence of the last common ancestor (i.e., the sequence at the root node of the simulated gene family’s genealogy) was generated randomly, but with nucleotide composition drawn from the distribution of observed nucleotide frequencies in real gene families. Using the ancestral sequence and the scaled simulated genealogical history as an input tree, nucleotide sequences of a simulated gene family were evolved in SeqGen v1.3.3 (92) under the F81 substitution model (19). This model takes into account empirical nucleotide frequencies but, otherwise, does not make any additional assumptions about how nucleotide changes occur over time, and therefore represents neutral accumulation of the substitutions over time. No insertions or deletions were simulated. Adjustments for the Analyses of 16S rRNA Gene. Due to presence of multiple, often nonidentical copies of 16S rRNA genes (93, 94) in many of the genomes in the four analyzed groups, and much higher nucleotide conservation of 16S rRNA gene, a slightly different methodology was used for both the assembly of 16S rRNA gene family from the analyzed genomes and the simulation of 16S rRNA-like sequences. Specifically, 16S rRNA-like gene genealogies were simulated 100 times, using the number of unique rRNA sequences in the

Straub and Zhaxybayeva

Alignments of Gene Families. Nucleotide sequences of each real gene family, as well as of the MLSA loci, were aligned using the MUSCLE v3.8.31 program (95) with a maximum number of iterations set to 16. Sequences with fewer than 50 nucleotides aligned to all other sequences were removed. Alignments of 16S rRNA genes were calculated using the SINA program (96), which incorporates structural information of the rRNA. Alignment was unnecessary for simulated gene families, as simulations did not incorporate insertions or deletions of nucleotides. Quantification of Divergence Patterns. For each gene family, pairwise nucleotide distances were calculated from the gene family alignment using the Mothur v1.32.1 program (44). These distances were not corrected for multiple substitutions, because we assumed that genomes within analyzed groups have recently diverged and therefore are not expected to accumulate many multiple substitutions, and, additionally, we analyzed only distances between 0.01 and 0.05 (i.e., divergence of 1 to 5%). The calculated distances were used to group taxa into clusters at various distance cutoffs using furthest-neighbor algorithm with a precision parameter of 1,000, as implemented in Mothur v1.32.1 (44). The resulting patterns of divergence were captured in a CD curve, in which the number of clusters in a gene family (y axis) is a function of genetic divergence (x axis) (Fig. 1). The shape of the CD curve is expected to reflect the evolutionary processes (e.g., selective sweeps, recombination) that influence the history of the gene family. Assignment of Statistical Significance to Differences in Divergence Patterns. To determine if the shape of a given gene family’s CD curve is significantly different from that of the simulated neutrally evolving gene families (the null distribution of CD curves), we developed a statistical method inspired by an application of the KS test (47) to the comparison of evolutionary histories (97). Although our test resembles the KS test, we introduced improvements that allowed us to assess differences in specific parts of two curves, to take into account how variable the null distribution is, and to discriminate between two classes of significantly different CD curves (that we designate as D+ and D−; see below). Specifically, in our method, the distance D(x) between a CD curve Y of a real gene family and a distribution of CD curves of the simulated gene families N = {N1,N2, ..., N1000} (the null distribution) at a particular divergence x is defined as DðxÞ = YðxÞ − medianðNðxÞÞ, if  MADðNðxÞÞ ≤ 1; DðxÞ =

YðxÞ − medianðNðxÞÞ , if  MADðNðxÞÞ > 1; MADðNðxÞÞ

Dmax = max jDðxÞj  , x

where x varies across a specified divergence range (in our case, between 0.01 and 0.05, at 0.01 intervals; the clusters at divergence of 0.00 were excluded to reduce the impact of biased sampling). Using the same procedure, each simulated gene family was compared with all simulated gene families to obtain the set of Dmax values of the null distribution. Based on the Dmax metric, real gene families were classified into three categories, D+, D−, and D=, which correspond to sets of gene families with Dmax values greater than the top 2.5%, less than the bottom 2.5%, or within the middle 95%, respectively, of the Dmax values of the null distribution. Patterns of diversification of a given gene family were designated as significantly different from the null distribution in a given divergence range if the family was classified into the D+ or D− category; this corresponds to a P value < 0.05, because the expectation of seeing the specific Dmax value due to chance is less than 5%. Assessment of Incomplete Sampling on CD Curve Shape. To investigate the effects of random subsampling from a larger population, four different population sizes—100, 133, 200, and 400 lineages—were simulated 100 times each. From each population size, 100 random genes were subsampled without replacement, which corresponds to 100% (100 out of 100), 75% (100 out of 133), 50% (100 out of 200), and 25% (100 out of 400) sampling rates. Distance Dmax was calculated for each gene family in both the full simulation and each subsample. The distributions of Dmax were compared by one-way ANOVA (98). Assignment and Analysis of Gene Families’ Functional Annotations. A randomly selected amino acid sequence from each gene family was used as a query in a protein BLAST (99) search (database size 107, e value < 10−5, low-complexity filter on) against the Clusters of Orthologous Groups (COG) database (2014 update; accessed at ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data) (100). Each gene family was assigned a functional annotation of the topscoring BLASTP match. If a gene family had no homologs in the COG database, it was excluded from further analysis. Underrepresentation and overrepresentation of specific functional categories among the gene families classified as either D+ or D− was evaluated using Fisher’s exact test (101). Inference of Selection Regimes in Gene Families with the Elevated Dmax. From each of the four analyzed bacterial groups, gene families with the largest values of Dmax in [0.01; 0.05] divergence interval in both D+ and D− categories were examined for evidence of selection (Table 2). For these gene families, the codon nucleotide alignments were created using translatorX v1.1 (102) from the amino acid sequence alignment generated in MUSCLE v3.8.31 (95). Maximum likelihood phylogenetic trees were reconstructed in the RAxML program v7.3.6 (103) under the GTR+Γ substitution model [general time-reversible (GTR) substitution model with among-site rate variation accomodated using Γ distribution] and with 100 rapid bootstraps. Using these trees and corresponding codon alignments as an input, the fraction of sites under selection, as well as estimates of dN and dS, were inferred under the M2a model (104), as implemented in the PAML program v4.8 (105).

where Y(x) is the number of clusters in the given gene family at divergence x, median(N(x)) is the median number of clusters of the null distribution N at divergence x, and MAD (N(x)) = median (jNi(x) - median (N(x))j) is the median absolute deviation of the null distribution at divergence x. As a robust measure of variability among a set of measurements, MAD is used here to take into account the spread of the null distribution N at a specific divergence x. When the simulations produced a variable distribution [i.e.,

ACKNOWLEDGMENTS. We thank Dr. W. Ford Doolittle for numerous stimulating discussions of the project and for comments on the manuscript. We also thank Drs. Deborah Hogan, Phillip Honenberger, Mark McPeek, and Camilla Nesbø for critical reading of the manuscript drafts, and two anonymous reviewers for the constructive and insightful feedback. This work was supported by the Simons Foundation Investigator in Mathematical Modeling of Living Systems award (to O.Z.) and Dartmouth start-up funds (to O.Z.).

1. Locey KJ, Lennon JT (2016) Scaling laws predict global microbial diversity. Proc Natl Acad Sci USA 113:5970–5975. 2. Hanage WP, Fraser C, Spratt BG (2006) Sequences, sequence clusters and bacterial species. Philos Trans R Soc Lond B Biol Sci 361:1917–1927. 3. Hunt DE, et al. (2008) Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science 320:1081–1085. 4. Caro-Quintero A, Konstantinidis KT (2012) Bacterial species may exist, metagenomics

5. Acinas SG, et al. (2004) Fine-scale phylogenetic architecture of a complex bacterial community. Nature 430:551–554. 6. Cohan FM, Perry EB (2007) A systematics for discovering the fundamental units of bacterial diversity. Curr Biol 17:R373–R386. 7. Cohan FM (2016) Bacterial speciation: Genetic sweeps in bacterial species. Curr Biol 26:R112–R115. 8. Hanage WP, Spratt BG, Turner KM, Fraser C (2006) Modelling bacterial speciation.

reveal. Environ Microbiol 14:347–355.

Straub and Zhaxybayeva

PNAS PLUS

MAD(N(x)) > 1], D(x) was normalized by MAD(N(x)) to avoid an inflation of D(x). When simulations were less variable [i.e., MAD(N(x)) ≤ 1], the normalization was not used. Each real gene family was then associated with a single value Dmax defined as

Philos Trans R Soc Lond B Biol Sci 361:2039–2044.

PNAS | Published online June 19, 2017 | E5421

EVOLUTION

real bacterial group as a value of n. For the corresponding nucleotide sequence simulations, these genealogies were scaled with respect to the maximum divergence observed for the real 16S rRNA gene family, and the length of the ancestral nucleotide sequence was set to 1,500 bp. The nucleotide sequences were evolved in Seq-Gen v1.3.3 (92) under the F81 substitution model (19). Because presence of multiple nonidentical copies of a 16S rRNA gene per genome may affect the inferred clustering patterns, we also constructed an additional 16S rRNA gene family that included only one, randomly selected, full-length, 16S rRNA gene copy per genome, and we investigated the differences in observed clustering in the full and subsampled 16S rRNA gene families (Fig. S2A). As expected, such data subsampling underestimated the amount of observed microdiversity, but the overall shapes of the two CD curves remained similar (Fig. S2A). Given that our simulations assume one gene copy per lineage, we used subsampled 16S rRNA gene family in the comparisons of real and simulated data.

9. Feil EJ (2004) Small change: Keeping pace with microevolution. Nat Rev Microbiol 2: 483–495. 10. Shapiro BJ, et al. (2012) Population genomics of early events in the ecological differentiation of bacteria. Science 336:48–51. 11. Shapiro BJ, Polz MF (2014) Ordering microbial diversity into ecologically and genetically cohesive units. Trends Microbiol 22:235–247. 12. Bendall ML, et al. (2016) Genome-wide selective sweeps and gene-specific sweeps in natural bacterial populations. ISME J 10:1589–1601. 13. Mende DR, Sunagawa S, Zeller G, Bork P (2013) Accurate and universal delineation of prokaryotic species. Nat Methods 10:881–884. 14. Thompson CC, et al. (2015) Microbial taxonomy in the post-genomic era: Rebuilding from scratch? Arch Microbiol 197:359–370. 15. Whitman W, ed (2015) Bergey’s Manual of Systematics of Archaea and Bacteria (Bergey’s Manual Trust, Athens, GA). 16. Huse SM, et al. (2014) VAMPS: A website for visualization and analysis of microbial population structures. BMC Bioinformatics 15:41. 17. Wakeley J (2009) Coalescent Theory: An Introduction (Roberts and Company, Greenwood Village, CO). 18. Kingman JFC (1982) The coalescent. Stochastic Process Appl 13:235–248. 19. Felsenstein J (1981) Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol 17:368–376. 20. Welch RA (2006) The genus Escherichia. Proteobacteria: Gamma Subclass, The Prokaryotes, eds Dworkin M, Falkow S, Rosenberg E, Schleifer K-H, Stackebrandt E (Springer, New York), Vol 6, pp 60−71. 21. Luo C, et al. (2011) Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc Natl Acad Sci USA 108:7200–7205. 22. Walk ST (2015) The “cryptic” Escherichia. EcoSal Plus 6:10.1128/ecosalplus.ESP0002-2015. 23. Octavia S, Lan R (2014) The family Enterobacteriaceae. The Prokaryotes: Gammaproteobacteria, eds Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, Berlin), pp 225−286. 24. Liu S, et al. (2015) Escherichia marmotae sp. nov., isolated from faeces of Marmota himalayana. Int J Syst Evol Microbiol 65:2130–2134. 25. Bennett JS, Bratcher HB, Brehony C, Harrison OB, Maiden MCJ (2014) The genus Neisseria. The Prokaryotes: Alphaproteobacteria and Betaproteobacteria, eds Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, Berlin), pp 881−900. 26. Maiden MCJ, Harrison OB (2016) Population and functional genomics of Neisseria revealed with gene-by-gene approaches. J Clin Microbiol 54:1949–1955. 27. Bennett JS, et al. (2012) A genomic approach to bacterial taxonomy: An examination and proposed reclassification of species within the genus Neisseria. Microbiology 158:1570–1580. 28. Bratcher HB, Bennett JS, Maiden MCJ (2012) Evolutionary and genomic insights into meningococcal biology. Future Microbiol 7:873–885. 29. Rotman E, Seifert HS (2014) The genetics of Neisseria species. Annu Rev Genet 48: 405–431. 30. Caimano M (2006) The genus Borrelia. Proteobacteria: Delta, Epsilon subclass, The Prokaryotes, eds Dworkin M, Falkow S, Rosenberg E, Schleifer K-H, Stackebrandt E (Springer, New York), Vol 7, pp 235−293. 31. Karami A, Sarshar M, Ranjbar R, Zanjani RS (2014) The phylum Spirochaetaceae. The Prokaryotes: Other Major Lineages of Bacteria and The Archaea, eds Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, Berlin), pp 915−929. 32. Wang G, Schwartz I (2015) Borrelia. Bergey’s Manual of Systematics of Archaea and Bacteria (John Wiley, New York). 33. Margos G, et al. (2009) A new Borrelia species defined by multilocus sequence analysis of housekeeping genes. Appl Environ Microbiol 75:5410–5416. 34. Mitchell HM, Rocha GA, Kaakoush NO, O’Rourke JL, Queiroz DMM (2014) The family Helicobacteraceae. The Prokaryotes: Deltaproteobacteria and Epsilonproteobacteria, eds Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, Berlin), pp 337−392. 35. Moodley Y, et al. (2012) Age of the association between Helicobacter pylori and man. PLoS Pathog 8:e1002693. 36. Alm RA, et al. (1999) Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori. Nature 397:176–180. 37. Morelli G, et al. (2010) Microevolution of Helicobacter pylori during prolonged infection of single hosts and within families. PLoS Genet 6:e1001036. 38. Dorer MS, Sessler TH, Salama NR (2011) Recombination and DNA repair in Helicobacter pylori. Annu Rev Microbiol 65:329–348. 39. Jolley KA, Maiden MC (2010) BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics 11:595. 40. Wirth T, et al. (2006) Sex and virulence in Escherichia coli: An evolutionary perspective. Mol Microbiol 60:1136–1151. 41. Medini D, et al. (2008) Microbiology in the post-genomic era. Nat Rev Microbiol 6: 419–430. 42. Cordero OX, Polz MF (2014) Explaining microbial genomic diversity in light of evolutionary ecology. Nat Rev Microbiol 12:263–273. 43. Kashtan N, et al. (2014) Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344:416–420. 44. Schloss PD, et al. (2009) Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541. 45. Goris J, et al. (2007) DNA-DNA hybridization values and their relationship to wholegenome sequence similarities. Int J Syst Evol Microbiol 57:81–91.

E5422 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114

46. Stackebrandt E, Ebers J (2006) Taxonomic parameters revisited: Tarnished gold standards. Microbiol Today 33:152–155. 47. Massey FJ (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc 46:68–78. 48. Maiden MCJ, et al. (2013) MLST revisited: The gene-by-gene approach to bacterial genomics. Nat Rev Microbiol 11:728–736. 49. Achtman M, Wagner M (2008) Microbial diversity and the genetic nature of microbial species. Nat Rev Microbiol 6:431–440. 50. Morlon H, Potts MD, Plotkin JB (2010) Inferring the dynamics of diversification: A coalescent approach. PLoS Biol 8:e1000493. 51. McPeek MA (2008) The ecological dynamics of clade diversification and community assembly. Am Nat 172:E270–E284. 52. Ferretti L, Disanto F, Wiehe T (2013) The effect of single recombination events on coalescent tree height and shape. PLoS One 8:e60123. 53. Suerbaum S, Josenhans C (2007) Helicobacter pylori evolution and phenotypic diversification in a changing host. Nat Rev Microbiol 5:441–452. 54. Kennemann L, et al. (2011) Helicobacter pylori genome evolution during human infection. Proc Natl Acad Sci USA 108:5033–5038. 55. Vos M, Didelot X (2009) A comparison of homologous recombination rates in bacteria and archaea. ISME J 3:199–208. 56. Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol 12:1889–1898. 57. Revell LJ, Harmon LJ, Glor RE (2005) Underparameterized model of sequence evolution leads to bias in the estimation of diversification rates from molecular phylogenies. Syst Biol 54:973–983. 58. Jordan IK, Rogozin IB, Wolf YI, Koonin EV (2002) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12:962–968. 59. Wang L, Rothemund D, Curd H, Reeves PR (2003) Species-wide variation in the Escherichia coli flagellin (H-antigen) gene. J Bacteriol 185:2936–2943. 60. Wicker E, et al. (2012) Contrasting recombination patterns and demographic histories of the plant pathogen Ralstonia solanacearum inferred from MLSA. ISME J 6: 961–974. 61. Monteil CL, et al. (2013) Nonagricultural reservoirs contribute to emergence and evolution of Pseudomonas syringae crop pathogens. New Phytol 199:800–811. 62. Smith KD, et al. (2003) Toll-like receptor 5 recognizes a conserved site on flagellin required for protofilament formation and bacterial motility. Nat Immunol 4: 1247–1253. 63. Vinatzer BA, Monteil CL, Clarke CR (2014) Harnessing population genomics to understand how bacterial pathogens emerge, adapt to crop hosts, and disseminate. Annu Rev Phytopathol 52:19–43. 64. Rao DN, Dryden DTF, Bheemanaik S (2014) Type III restriction-modification enzymes: A historical perspective. Nucleic Acids Res 42:45–55. 65. de Vries N, et al. (2002) Transcriptional phase variation of a type III restrictionmodification system in Helicobacter pylori. J Bacteriol 184:6615–6623. 66. Kojima KK, et al. (2016) Population evolution of Helicobacter pylori through diversification in DNA methylation and interstrain sequence homogenization. Mol Biol Evol 33:2848–2859. 67. Tomb J-F, et al. (1997) The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539–547. 68. Oleastro M, Ménard A (2013) The role of Helicobacter pylori outer membrane proteins in adherence and pathogenesis. Biology (Basel) 2:1110–1134. 69. Hobbs M, Collie ESR, Free PD, Livingston SP, Mattick JS (1993) PilS and PilR, a twocomponent transcriptional regulatory system controlling expression of type 4 fimbriae in Pseudomonas aeruginosa. Mol Microbiol 7:669–682. 70. Eriksson J, et al. (2015) Characterization of motility and piliation in pathogenic Neisseria. BMC Microbiol 15:92. 71. Carrick CS, Fyfe JAM, Davies JK (2000) The genome of Neisseria gonorrhoeae retains the remnants of a two-component regulatory system that once controlled piliation. FEMS Microbiol Lett 186:197–201. 72. Denamur E, et al. (2000) Evolutionary implications of the frequent horizontal transfer of mismatch repair genes. Cell 103:711–721. 73. Lin Z, Nei M, Ma H (2007) The origins and early evolution of DNA mismatch repair genes—Multiple horizontal gene transfers and co-evolution. Nucleic Acids Res 35: 7591–7603. 74. Heymans R, Golparian D, Bruisten SM, Schouls LM, Unemo M (2012) Evaluation of Neisseria gonorrhoeae multiple-locus variable-number tandem-repeat analysis, N. gonorrhoeae Multiantigen sequence typing, and full-length porB gene sequence analysis for molecular epidemiological typing. J Clin Microbiol 50:180–183. 75. Trees DL, Schultz AJ, Knapp JS (2000) Use of the neisserial lipoprotein (Lip) for subtyping Neisseria gonorrhoeae. J Clin Microbiol 38:2914–2916. 76. Beutin L, Delannoy S, Fach P (2015) Genetic diversity of the fliC genes encoding the flagellar antigen H19 of Escherichia coli and application to the specific identification of enterohemorrhagic E. coli O121:H19. Appl Environ Microbiol 81:4224–4230. 77. Beutin L, Strauch E (2007) Identification of sequence diversity in the Escherichia coli fliC genes encoding flagellar types H8 and H40 and its use in typing of Shiga toxinproducing E. coli O8, O22, O111, O174, and O179 strains. J Clin Microbiol 45:333–339. 78. Iguchi A, et al. (2015) A complete view of the genetic diversity of the Escherichia coli O-antigen biosynthesis gene cluster. DNA Res 22:101–107. 79. Cheng K, et al. (2016) Phenotypic H-antigen typing by mass spectrometry combined with genetic typing of H antigens, O antigens, and toxins by whole-genome sequencing enhances identification of Escherichia coli isolates. J Clin Microbiol 54: 2162–2168. 80. Barraclough TG, Balbi KJ, Ellis RJ (2012) Evolving concepts of bacterial species. Evol Biol 39:148–157.

Straub and Zhaxybayeva

PNAS PLUS

96. Pruesse E, Peplies J, Glöckner FO (2012) SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28:1823–1829. 97. Wollenberg K, Arnold J, Avise JC (1996) Recognizing the forest for the trees: Testing temporal patterns of cladogenesis using a null model of stochastic diversification. Mol Biol Evol 13:833–849. 98. Bewick V, Cheek L, Ball J (2004) Statistics review 9: One-way analysis of variance. Crit Care 8:130–136. 99. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. 100. Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res 43:D261–D269. 101. Fisher RA (1922) On the interpretation of χ2 from contingency tables, and the calculation of P. JR Stat Soc 85:87–94. 102. Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: Multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res 38:W7-13. 103. Stamatakis A (2014) RAxML version 8: A tool for phylogenetic analysis and postanalysis of large phylogenies. Bioinformatics 30:1312–1313. 104. Yang Z, Wong WSW, Nielsen R (2005) Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol 22:1107–1118. 105. Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586–1591. 106. Harvey PH, May RM, Nee S (1994) Phylogenies without fossils. Evolution 48:523–529. 107. Martin AP, Costello EK, Meyer AF, Nemergut DR, Schmidt SK (2004) The rate and pattern of cladogenesis in microbes. Evolution 58:946–955. 108. Pybus OG, Harvey PH (2000) Testing macro-evolutionary models using incomplete molecular phylogenies. Proc Biol Sci 267:2267–2272. 109. Byers DM, Gong H (2007) Acyl carrier protein: Structure-function relationships in a conserved multifunctional protein family. Biochem Cell Biol 85:649–662. 110. Barnwal RP, Van Voorhis WC, Varani G (2011) NMR structure of an acyl-carrier protein from Borrelia burgdorferi. Acta Crystallogr Sect F Struct Biol Cryst Commun 67: 1137–1140.

EVOLUTION

81. Doolittle WF (2012) Population genomics: How bacterial species form and why they don’t exist. Curr Biol 22:R451–R453. 82. Wattam AR, et al. (2014) PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res 42:D581–D591. 83. Mahmood K, Webb GI, Song J, Whisstock JC, Konagurthu AS (2012) Efficient largescale protein sequence comparison and gene matching to identify orthologs and coorthologs. Nucleic Acids Res 40:e44. 84. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302. 85. Sørensen TJ (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biol Skr 5:1–34. 86. Van Dongen S (2000) Graph clustering by flow simulation. Ph.D dissertation (Univ Utrecht, Utrecht, The Netherlands). 87. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584. 88. Hey J (1992) Using phylogenetic trees to study speciation and extinction. Evolution 46:627–640. 89. Moran PAP (1958) A general theory of the distribution of gene frequencies. I. Overlapping generations. Proc R Soc Lond B Biol Sci 149:102–112. 90. Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425. 91. Cock PJA, et al. (2009) Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423. 92. Rambaut A, Grassly NC (1997) Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13: 235–238. 93. Acinas SG, Marcelino LA, Klepac-Ceraj V, Polz MF (2004) Divergence and redundancy of 16S rRNA sequences in genomes with multiple rrn operons. J Bacteriol 186: 2629–2635. 94. Pei AY, et al. (2010) Diversity of 16S rRNA genes within individual prokaryotic genomes. Appl Environ Microbiol 76:3886–3897. 95. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797.

Straub and Zhaxybayeva

PNAS | Published online June 19, 2017 | E5423

Suggest Documents