Widely Conserved Recombination Patterns among

0 downloads 0 Views 2MB Size Report
Mar 1, 2009 - The combinatorial nature of genetic recombination can potentially provide ... recombination process, there is also a significant tendency for ... natural selection acting against viruses expressing recombinant proteins is a major determinant of .... Our choice of using the consensus of three or more methods.
JOURNAL OF VIROLOGY, Mar. 2009, p. 2697–2707 0022-538X/09/$08.00⫹0 doi:10.1128/JVI.02152-08 Copyright © 2009, American Society for Microbiology. All Rights Reserved.

Vol. 83, No. 6

Widely Conserved Recombination Patterns among Single-Stranded DNA Viruses䌤† P. Lefeuvre,1 J.-M. Lett,1 A. Varsani,2,3 and D. P. Martin4* CIRAD, UMR 53 PVBMT CIRAD-Universite´ de la Re´union, Po ˆle de Protection des Plantes, Ligne Paradis, 97410 Saint Pierre, La Re´union, France1; Electron Microscope Unit, University of Cape Town, Private Bag, Rondebosch 7701, South Africa2; School of Biological Science, University of Canterbury, Private Bag 4800, Christchurch, New Zealand3; and Institute of Infectious Diseases and Molecular Medicine, University of Cape Town, Observatory 7925, South Africa4 Received 13 October 2008/Accepted 23 December 2008

genomes increase, so too does the probability of fitness defects in their recombinant offspring (16, 51). The viability of recombinants is apparently largely dependent on how severely recombination disrupts coevolved intragenome interaction networks (16, 32, 51). These networks include interacting nucleotide sequences that form secondary structures, sequence-specific protein-DNA interactions, interprotein interactions, and amino acid-amino acid interactions within protein three-dimensional folds. One virus family where such interaction networks appear to have a large impact on patterns of natural interspecies recombination are the single-stranded DNA (ssDNA) geminiviruses. As with other ssDNA viruses, recombination is very common among the species of this family (62, 84). Partially conserved recombination hot and cold spots have been detected in different genera (39, 81) and are apparently caused by both differential mechanistic predispositions of genome regions to recombination and natural selection disfavoring the survival of recombinants with disrupted intragenome interaction networks (38, 51). Genome organization and rolling circle replication (RCR)—the mechanism by which geminiviruses and many other ssDNA viruses replicate (9, 67, 79; see reference 24 for a review)—seem to have a large influence on basal recombination rates in different parts of geminivirus genomes (20, 33, 39, 61, 81). To initiate RCR, virion-strand ssDNA molecules are converted by host-mediated pathways into double-stranded “replicative-form” (RF) DNAs (34, 67).

Genetic recombination is a ubiquitous biological process that is both central to DNA repair pathways (10, 57) and an important evolutionary mechanism. By generating novel combinations of preexisting nucleotide polymorphisms, recombination can potentially accelerate evolution by increasing the population-wide genetic diversity upon which adaptive selection relies. Recombination can paradoxically also prevent the progressive accumulation of harmful mutations within individual genomes (18, 35, 53). Whereas its ability to defend high-fitness genomes from mutational decay possibly underlies the evolutionary value of sexuality in higher organisms, in many microbial species where pseudosexual genetic exchange is permissible among even highly divergent genomes, recombination can enable access to evolutionary innovations that would otherwise be inaccessible by mutation alone. Such interspecies recombination is fairly common in many virus families (8, 17, 27, 44, 82). It is becoming clear, however, that as with mutation events, most recombination events between distantly related genomes are maladaptive (5, 13, 38, 50, 63, 80). As genetic distances between parental * Corresponding author. Mailing address: Institute of Infectious Diseases and Molecular Medicine, University of Cape Town, Observatory 7925, South Africa. Phone: 27-21-406 6366. Fax: 27-21-698 1528. E-mail: [email protected]. † Supplemental material for this article may be found at http://jvi .asm.org/. 䌤 Published ahead of print on 30 December 2008. 2697

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

The combinatorial nature of genetic recombination can potentially provide organisms with immediate access to many more positions in sequence space than can be reached by mutation alone. Recombination features particularly prominently in the evolution of a diverse range of viruses. Despite rapid progress having been made in the characterization of discrete recombination events for many species, little is currently known about either gross patterns of recombination across related virus families or the underlying processes that determine genome-wide recombination breakpoint distributions observable in nature. It has been hypothesized that the networks of coevolved molecular interactions that define the epistatic architectures of virus genomes might be damaged by recombination and therefore that selection strongly influences observable recombination patterns. For recombinants to thrive in nature, it is probably important that the portions of their genomes that they have inherited from different parents work well together. Here we describe a comparative analysis of recombination breakpoint distributions within the genomes of diverse single-stranded DNA (ssDNA) virus families. We show that whereas nonrandom breakpoint distributions in ssDNA virus genomes are partially attributable to mechanistic aspects of the recombination process, there is also a significant tendency for recombination breakpoints to fall either outside or on the peripheries of genes. In particular, we found significantly fewer recombination breakpoints within structural protein genes than within other gene types. Collectively, these results imply that natural selection acting against viruses expressing recombinant proteins is a major determinant of nonrandom recombination breakpoint distributions observable in most ssDNA virus families.

2698

LEFEUVRE ET AL.

MATERIALS AND METHODS Sequence data sets. All publicly available full-length circovirus, microvirus, and parvovirus genome sequences, full-length nanovirus genome component sequences, anellovirus sequences that were ⬎50% of full genome size, and geminivirus DNA-B and DNA-1 sequences were obtained from public sequence databases by using TaxBrowser (http://www.ncbi.nlm.nih.gov/) between October and December 2007. An alignment of geminivirus DNA-Beta sequences has been described previously (6). With the exception of the anelloviruses, parvoviruses, and microviruses, sequences were linearized at the site that is nicked during virion-strand replication. In the case of the anelloviruses, sequences were linearized at the first nucleotide of either the conserved AGGGCGGTGCCG sequence (Torque teno viruses [TTV]) or the T/AGGGCGGGAGC sequence (Torque teno mini viruses [TTMV]). With the parvoviruses, the VP-NS intergenic region (5⬘33⬘) was excluded

from analyses (because it was largely unalignable) and sequences were linearized at the first codon of the nonstructural protein gene. Microvirus sequences were linearized at position 1469 relative to the sequence of isolate M14428. Sequence alignments were constructed using poa (37) and edited both by eye and using the ClustalW-based (77) alignment tool implemented in Mega4 (75). Highly divergent sequences (i.e., those sharing ⬍60% genome-wide sequence identity to any other sequences in a data set) were discarded. Finally, to ensure that sequences could be aligned properly, data sets were split into groups of sequences all sharing ⬎60% genome-wide sequence identity. The Begomovirus DNA-A/DNA-A-like and Mastrevirus genome sequence alignments analyzed here were described previously (38, 81). Four “population-level” data sets that were used to detect evidence of recombination rate differences within the complementary- and virion-strand geminivirus and circovirus genes were assembled precisely as outlined previously (61). Details of all analyzed data sets are given in Table S1 in the supplemental material, and sequence alignments are available upon request from the authors and/or within RDP3 project files provided as supplemental material. Characterization of individual recombination events. Detection of potential recombinant sequences, identification of likely parental sequences, and localization of recombination breakpoints was carried out with the RDP (48), GENECONV (62), BOOTSCAN (49), MAXCHI (54), CHIMAERA (64), SISCAN (21), LARD (29), and 3SEQ (4) methods implemented in RDP3 (52) (see the RDP project files submitted as supplemental material for full details of program settings). Default settings were used throughout, and only potential recombination events detected by three or more of the above methods coupled with phylogenetic evidence of recombination were considered significant. Our choice of using the consensus of three or more methods was determined empirically based on false-positive rates encountered during analyses of the simulated data sets of Posada and Crandall (64). Simultaneously analyzing these data sets with seven methods (all of those mentioned above except for LARD), using a consensus of three or more methods with a Bonferroni-corrected P value cutoff of 0.05, resulted in false-positive rates below one falsely inferred recombination event per 100 data sets analyzed while at the same time ensuring a good degree of analysis power. To achieve maximum analysis power, we minimized the severity of Bonferroni correction during exploratory recombination analyses by either removing from analyzed alignments or masking within them (a setting in RDP3) all but one sequence within groups of sequences sharing ⬎98% genome-wide sequence identity. The exact program settings can be accessed within the RDP project files provided as supplemental material. Analysis of genome-wide recombination patterns. Recombination breakpoint density plots and recombinant region count matrices were constructed using RDP3 as described previously (27, 38). The matrices represent the numbers of times that recombinational movements of sequence tracts between genomes separate pairs of nucleotide sites. This representation of detectable recombination events highlights the differential “exchangeability” of sequence tracts between genomes. Whereas highly exchangeable genome regions (i.e., those represented by warm colors in the matrices due to their frequent movement into foreign genetic backgrounds) are expected to be most modular, the less exchangeable regions (i.e., those represented by cool colors due to their infrequent movement into foreign genetic backgrounds) are expected to be the least modular. Recombination hot and cold spot tests. Recombination breakpoint hot and cold spots were identified from breakpoint distribution plots by use of previously described permutation-based linear “local” and “global” tests (27). The statistical significance of potential recombination region hot and cold spots in recombinant region count matrices was tested using a two-dimensional version of the linear local recombination hot and cold spot permutation test of Heath et al. (27). Briefly, this involved the same procedure as the linear test except that rather than plotting breakpoints in permuted and real data sets on a linear genome map, the genomic regions bounded by breakpoints were plotted on recombination region count matrices as described previously (45). The score of each cell in a particular real recombinant region count matrix was ranked relative to corresponding cells recorded in 1,000 permuted matrices to identify cells in the real matrix that had either higher or lower values than 95% or 99% of corresponding cells in the permuted matrices. It should be stressed that the permutation P values were not corrected for multiple testing and that one would, for example, expect a falsepositive rate of 5% of the cells in the matrix at a P value cutoff of 0.05. Nevertheless, the test does provide a reasonable quantitative assessment of the least and most transmissible portions of genomes that takes into account

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

Initiated by a virus-encoded replication-associated protein (Rep) at a well-defined virion-strand replication origin (vori), new virion strands are synthesized on the complementary strand of RF DNAs (28, 73, 74) by host DNA polymerases. Virion-strand replication is concomitant with the displacement of old virion strands, which, once complete, yields covalently closed ssDNA molecules which are either encapsidated or converted into additional RF DNAs. Genome-wide basal recombination rates in ssDNA viruses are probably strongly influenced by the specific characteristics of host DNA polymerases that enable RCR. Interruption of RCR has been implicated directly in geminivirus recombination (40) and is most likely responsible for increased basal recombination rates both within genes transcribed in the opposite direction from that of virion-strand replication (40, 71) and at the v-ori (1, 9, 20, 69, 74). Whereas most ssDNA virus families replicate via either a rolling circle mechanism (the Nanoviridae, Microviridae, and Geminiviridae) (3, 23, 24, 31, 59, 67, 74) or a related rolling hairpin mechanism (the Parvoviridae) (25, 76), among the Circoviridae only the Circovirus genus is known to use RCR (45). Although the Gyrovirus genus (the other member of the Circoviridae) and the anelloviruses (a currently unclassified ssDNA virus group) might also use RCR, it is currently unknown whether they do or not (78). Additionally, some members of the Begomovirus genus of the Geminiviridae either have a second genome component, called DNA-B, or are associated with satellite ssDNA molecules called DNA-1 and DNA-Beta, all of which also replicate by RCR (1, 47, 68). Recombination is known to occur in the parvoviruses (19, 43, 70), microviruses (66), anelloviruses (40, 46), circoviruses (11, 26, 60), nanoviruses (30), geminivirus DNA-B components, and geminivirus satellite molecules (2, 62). Given that most, if not all, of these ssDNA replicons are evolutionarily related to and share many biological features with the geminiviruses (22, 31, 36), it is of interest to determine whether conserved recombination patterns observed in the geminiviruses (61, 81) are evident in these other groups. To date, no comparative analyses have ever been performed with different ssDNA virus families to identify, for example, possible influences of genome organization on recombination breakpoint distributions found in these viruses. Here we compare recombination frequencies and recombination breakpoint distributions in most currently described ssDNA viruses and satellite molecules and identify a number of sequence exchange patterns that are broadly conserved across this entire group.

J. VIROL.

VOL. 83, 2009

CONSERVED RECOMBINATION PATTERNS AMONG ssDNA VIRUSES

RESULTS AND DISCUSSION Increased recombination frequencies in complementarysense genes are common but not absolutely conserved among ssDNA replicons. The geminiviruses and circoviruses are the only known ssDNA viruses that both utilize RCR and have complementary-sense genes that are transcribed in the opposite direction and on the same strand as virion-strand synthesis. For geminiviruses, it has been speculated that this genome organization may result in clashes between transcription and replication complexes and result in increased basal recombination rates within the complementary-sense genes (61). Such a pattern has in fact been observed for the geminiviruses maize streak virus, East African cassava mosaic virus (Fig. 1a and b) (61), and East African cassava mosaic Kenya virus. Here we show that this pattern is also apparent within the complementary-sense genes of the circoviruses porcine circovirus 1 (PCV-1) and porcine circovirus 2 (PCV-2) (Fig. 1c and d). Also in common with the geminiviruses is that recombination frequencies in these circovirus genomes decrease sharply on the virion-sense gene side of the v-ori. The only other known classes of rolling circle replicons where transcription is known to occur in the opposite direction from that of virion-strand replication are the begomovirus DNA-B genome component and begomovirus-associated DNA-Beta satellite molecules. Whereas we were unable to assemble a suitable DNA-Beta data set for our recombination frequency analyses, two small East African cassava mosaic virus DNA-B data sets (containing only 12 and 11 sequences) displayed no obvious differences in re-

combination frequencies between their complementary- and virion-sense genes (Fig. 1e and f). Whereas no genome-wide fluctuations in recombination frequencies were detectable in one of these data sets (Fig. 1e), the other (Fig. 1f) displayed increased recombination frequencies within the intergenic region around the v-ori, a pattern that is superficially similar to that seen for the DNA-A components of these viruses. In contrast to the case for DNA-A components, however, there was a notable decrease in detectable frequencies at the v-ori, with recombination frequency peaks instead occurring at the boundaries of the so-called “common region,” a stretch of sequence containing the v-ori that is highly conserved between the DNA-A and DNA-B components of bipartite sequences. Although it is possible that these small DNA-B data sets are not representative of DNA-B sequences in general, this evidence suggests that complementary-strand transcription running counter to virion-strand RCR is perhaps not absolutely associated with increased recombination rates in complementary-sense genes. Interspecies and interstrain recombination is common in ssDNA viruses. We used a battery of eight recombination detection methods and a series of manual recombination signal evaluation tools implemented in the program RDP3 (52) to identify and characterize 663 unique recombination events detectable within 27 different ssDNA virus full genome/genome component data sets (see Table 1 for a summary of these events and Table S1 in the supplemental material for details of each individual event, as well as the interactive RDP3 project files provided as supplemental material). Surprisingly, most sequences in most of the data sets were apparently recombinant. For example, 38 events were detectable in 37 microvirus sequences and 38 events were detectable in 39 begomovirus-associated DNA-1 satellite sequences. Whereas in some of the data sets sequence diversity was relatively low and the detected recombination events were all between members of the same species (the PCV, goose circovirus, and gyrovirus data sets), in other data sets most, and in some cases all, detectable recombination events were between viruses sharing ⬍90% genomewide nucleotide sequence identify (Table 1). Although for some groups, such as the gyroviruses, goose circoviruses, and erythroviruses, only a few recombination events were detected, this was probably a reflection of both the low diversity and small sizes of these data sets (see Table S1 in the supplemental material; also see reference 64 for a discussion on the inherent difficulty of identifying and characterizing individual recombination events in data sets with low diversity). A v-ori recombination hot spot is partially conserved across diverse rolling circle replicons. For data sets in which more than 10 recombination events were detectable, we tested for the presence of both recombination breakpoint and recombinant region hot and cold spots. Whereas breakpoint distribution maps were generated and tested as described previously (27), the tracts of sequence exchanged during recombination events were also mapped onto recombinant region count matrices. These matrices describe the relative frequencies with which different parts of the analyzed ssDNA replicons are separated during recombination. Recombinant region hot and

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

the influence that sequence diversity has on the detectability of recombination events (64). Comparison of recombination breakpoint densities between different genome regions. We used another modification of the local permutation test of Heath et al. (27) to specifically test for clustering of recombination breakpoints in different genome regions. In this test, rather than partitioning the alignment with a moving window of set length, the alignment was partitioned in various other ways. For example, to test for significant clustering of breakpoints in the intergenic regions, alignments were partitioned into coding and noncoding regions and tested to determine whether more/fewer breakpoints were detectable in the intergenic regions than could be accounted for by chance. Other similar tests included (i) discounting breakpoints falling outside coding regions and determining whether individual genes contained significantly more/fewer detectable breakpoints than the remainder of the coding regions and (ii) again discounting breakpoints falling outside coding regions and determining whether the middle 50% of all genes collectively contained significantly more/fewer detectable breakpoints than those collectively observed in the beginning 25% and ending 25% of the genes. Inference of population-scaled mutation and recombination frequencies. Variations in site-to-site composite likelihood estimates of population-scaled recombination frequencies were assessed with the INTERVAL (56) component of ldhat (55). The program settings for these analyses were a precomputed likelihood lookup table for a population-scaled mutation frequency of 0.001, a minimum minor allele frequency cutoff of 0.05 (for data sets containing 20 or more sequences) or 0.01 (for data sets with between 11 and 19 sequences), a block penalty of 10, a starting recombination frequency of 5, and 107 Markov chain Monte Carlo updates, with sampling every 2,000 updates and the first 500 samples discarded (56). The average genome-wide recombination frequency estimate obtained after a first run with these parameter settings was then used for a second run, using the same parameters but with the starting recombination frequency replaced by that estimated from the first run. We avoided analysis inaccuracies at the edges of alignments by simulating circular genome sequences as described previously (61). Briefly, this involved constructing tandemly repeated alignments of full genome sequences and then excluding the point recombination frequency estimates for the repeated ends of the alignments.

2699

2700

LEFEUVRE ET AL.

J. VIROL.

FIG. 1. Variable recombination rates across the genomes of maize streak virus (MSV) (61) (a), DNA-A genome components of East African cassava mosaic virus (EACMV) (61) (b), porcine circovirus 2 (PCV-2) (c), porcine circovirus 1 (PCV-1) (d), and DNA-B genome components of East African cassava mosaic virus (EACMV-DNA B) (e and f). The black plots represent average estimates of point recombination rates determined by the reversible-jump Markov chain Monte Carlo (RJMCMC) approach implemented in the INTERVAL compo-

nent of ldhat (56). Gray regions represent the 95% credibility intervals of point recombination rate estimates from the RJMCMC chain. Genome cartoons above the plots indicate the starting and ending alignment positions of various genes (cp, CP gene; rep, replication-associated protein gene). Vertical lines beneath the genome cartoons indicate the locations of polymorphic sites used for the analysis.

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

cold spots were identified as genome regions that were more or less frequently exchanged, respectively, during recombination than can be accounted for by chance (with the null hypothesis that tracts of sequence are randomly exchanged). Importantly, the permutation tests used for both the recombination breakpoint distribution plots and the recombinant region count matrices account for recombination being inherently easier to detect in more diverse genome regions than it is in less diverse regions. The most conserved feature of detectable recombination breakpoint distributions was a statistically significant breakpoint cluster at the v-ori’s (black arrows in Fig. 2) of circovirus (beak-and-feather disease virus), microvirus, and geminivirus (mastrevirus and begomovirus DNA A, DNA B, and DNA-Beta) genomes/genome components that are known to use RCR. While it has been apparent for some time now that the v-ori is a mechanistically predisposed recombination hot spot in geminiviruses (73, 74) and circoviruses (9), our results indicate that the same is probably true for most other rolling circle replicons. There are two probable reasons for v-ori sequences being recombination hot spots. Firstly, they are the natural points at which recombinational repairs of double-stranded genome breakages are resolved by the joining and nicking activities of Rep proteins (74), and secondly, they are the sites at which unit-length genomes are replicationally released from highmolecular-weight genomic concatemers that are produced by recombination-dependent replication (33, 65). Significant evidence of v-ori recombination hot spots was, however, not found in the nanovirus and geminivirus DNA-1 satellite molecule data sets, indicating that mechanisms underlying v-ori hot spots may not be conserved across all rolling circle replicons. In this regard, it is currently unknown whether recombination-dependent replication, a process directly implicated in the v-ori recombination hot spot found in geminiviruses (33), occurs in any other ssDNA viruses. It should also be pointed out, however, that our analysis may have simply lacked enough power to detect v-ori hot spots in some of the analyzed data sets. For example, within each of the six nanovirus data sets, the hot spot test lacked any appreciable power due to the numbers of detectable recombination breakpoints being very low (⬍15). Besides the v-ori hot spot, we found no other obviously conserved recombination breakpoint patterns between ssDNA replicons from different families. We realized, however, that given the small numbers of recombination breakpoints detected in many of the data sets, our analysis lacked sufficient power to find any but the most obvious recombination hot and cold spots. Despite this, we were encouraged by the fact that careful visual inspection of recombinant region count matrices (Fig.

VOL. 83, 2009

CONSERVED RECOMBINATION PATTERNS AMONG ssDNA VIRUSES

2701

TABLE 1. Summary of recombination signals detectable in ssDNA virus full-genome data sets Data set Host

Family

Data set Name

Animals

Unassigned

Anellovirus

Circoviridae

Parvoviridae

Bacteria Plants

Microviridae Geminiviridae

No. of eventsa

Probability of ⬍90 (⬍80)

7 8 9 10

67 21 16 17

99 (87) 100 (100) 0 (0) 45 (18)

Circovirus/goose Gyrovirus Dependovirus/parvovirus Dependovirus Erythrovirus Parvovirus Microvirus Mastrevirus

Goose Gyro Merged Dependo Erythro Parvo Micro Mastre

11 12 13 ⫹ 15 13 14 15 16 17

1 1 35* 17 2 13 38 37

0 (0) 0 (0) 86 (29) 78 (33) 100 (0) 100 (33) 45 (27) 66 (11)

Begomovirus DNA-B

Begomo Merged Old World EACMV Asia 1 Asia 2 New world Merged beta1 beta2 beta3 dna1 Merged babu nano CP babu1 babu2 babu3 babu4 babu5 nano

18 19323 19 20 21 22 23 24326 24 25 26 27 32 ⫹ 33

245 67* 17 10 9 6 41 52* 25 20 11 38 9*

100 (84) 79 (63) 100 (11) 14 (14) 71 (57) 33 (33) 100 (100) 92 (71) 100 (90) 80 (50) 100 (67) 100 (75) 100 (80)

4 3 1 0 2 7

100 (100)

DNA-1 Babuvirus/nanovirus Babuvirus

Nanovirus

28 29 30 31 32 33

Hot spot

Cold spot

IR-ORF2-ORF4 IR-ORF4

CP

IR-beginning of Rep

CP

IR

Spike-RepA LIR-SIR-repB-end of CP IR-Rep IR

G-CP CP-repA CP-V2-C2-C3 BV1-BC1

IR

ORF

Rep CP-IR

Beginning of Rep IR

100 (0) 100 (100)

a *, partial genome data set. Multiple testing correction was more severe in the merged data sets than in the separated ones. As a consequence of this, the number of recombination events in the merged data sets is not simply the sum of events detected in the separated data sets.

3, upper triangles) revealed what may have been subtle conserved recombination patterns that were missed by our breakpoint hot/cold spot test. In these matrices, whereas dark/light blue triangles and rectangles correspond with genomic regions that tend to be inherited from the same parental source during recombination events, orange/red triangles and rectangles denote recombinant regions that tend to become separated during recombination events. The lower triangles in Fig. 3 indicate whether individual pairs of sites represented in the upper triangles are separated more (red) or less (blue) often during recombination events than can be accounted for by chance. Note, however, that the statistical test used in these “probability matrices” is not multiple comparison corrected, which means, for example, that for a P value threshold of ⬍0.01 one would expect a 1% false-positive rate per pair of sites analyzed. Nevertheless, these matrices indicated three potential recombination patterns shared by many of the analyzed viruses, as follows: (i) intergenic regions tended to be moved between genomes more frequently than individual genes were (note red/orange/light green diagonals in the upper matrices and red diagonals in the lower matrices originating on intergenic regions in the anellovirus, geminivirus, DNABeta, and nanovirus data sets); (ii) certain genes, particu-

larly those encoding coat proteins (CP), tended to be moved by recombination as either complete or mostly complete (⬎50% of the middle regions) units (note light/dark blue triangles in the upper matrices and blue patches in the lower matrices associated in particular with CP genes in the anellovirus, microvirus, circovirus, parvovirus, and geminivirus data sets); and (iii) CP genes tended to contain fewer recombination breakpoints than other genes did (note blue patches in the recombination region count matrices in Fig. 3 and the distribution of breakpoints indicated in Fig. 2). Selection apparently disfavors recombinants with breakpoints in coding regions. To directly test for conserved features of recombination breakpoint distributions that might underlie the patterns we observed in the recombination region count matrices, we modified our recombination breakpoint hot and cold spot test. The original test effectively determined whether the numbers of breakpoints observed in particular small stretches of sequence (in the case of Fig. 2, a moving 200-nucleotide [nt] window) were greater or less than could be accounted for by chance. Given the small numbers of breakpoints detectable in many of the data sets, this test lacked power primarily because the average number of breakpoints per window was low (and often zero), regardless of whether windows were over recombina-

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

Circovirus/PCV Circovirus/BF

TTV TTMV PCV BF

DNA-Beta

Nanoviridae

No.

2702

LEFEUVRE ET AL.

J. VIROL.

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009 FIG. 2. Distributions of recombination breakpoints detected within different ssDNA virus data sets. All detectable breakpoint positions are indicated by small vertical lines at the top of the graphs. A 200-nt window was moved along each of the represented alignments 1 nt at a time, and the number of breakpoints detected within the window region was counted and plotted (solid lines). The upper and lower broken lines indicate

VOL. 83, 2009

CONSERVED RECOMBINATION PATTERNS AMONG ssDNA VIRUSES

genes. Also, when we merged the dependovirus and parvovirus data sets (both are members of the Parvoviridae), the increased clustering of breakpoints at the edges of genes became significant (Table 2). Collectively, the relatively low abundance of breakpoints within coding regions and the tendency for breakpoints to occur toward the edges of genes are consistent with the hypothesis that breakpoints are less tolerable when they occur within genes (5, 38, 83) because there is a relatively high probability that recombinant proteins will not fold properly (14; see reference 7 for a review). CP genes contain fewer detectable breakpoints than other genes. While recombination region count matrices (Fig. 3) indicated that the entire CP genes of various ssDNA virus groups tended to be inherited from the same parental source, recombination breakpoint distribution analyses indicated that for data sets containing a CP gene, 8 of 13 (61%) had statistically significant recombination cold spots within these genes. We therefore specifically tested whether CP genes tended to have significantly fewer recombination breakpoints than other genes. Importantly, the test we used discounted breakpoints that fell within intergenic regions and was therefore an unbiased comparison between the genes themselves. As expected, relative to the other genes, we found significantly fewer (P ⬍ 0.05) recombination breakpoints within CP genes for 6 of 10 analyzed data sets (3 of the original 14 data sets did not contain a CP gene, and 1 contained only a CP gene). Among the four data sets where CP genes did not have significantly fewer breakpoints than other genes, the circovirus (PCV) and parvovirus data sets had lower densities of breakpoints within their CP genes than in their Rep genes (the only other gene on these replicons), and in the case of the parvovirus data set, this lower density was marginally significant (P ⫽ 0.0932). The other two exceptional data sets, the anellovirus (TTMV) and dependovirus data sets, were both unusual in that related data sets (TTV and parvovirus data sets, respectively) displayed either a significant or marginally significant tendency toward having lower breakpoint numbers within their CP genes. Decreased recombination breakpoint densities have been noted within the structural protein genes of picornaviruses (27, 42, 72), adenoviruses (41), human immunodeficiency viruses (both capsid and envelope proteins) (17), and hepatitis B viruses (85). This implies either that viral CP genes generally experience low basal recombination rates or that they are generally less tolerant of recombination than most other genes. For geminiviruses, there is good evidence from both experimental and computational analyses that CP genes both experience lower basal recombination rates (33, 61) and have a low degree of recombination tolerance (38, 51). This evidence suggests that whereas recombination

99% and 95% confidence thresholds, respectively, for globally significant breakpoint clusters. Light and dark gray areas indicate local 99% and 95% breakpoint clustering thresholds, respectively, taking into account local regional differences in sequence diversity that influence the ability of different recombination detection methods to identify recombination breakpoints. Red areas indicate recombination hot spots, while blue areas represent recombination cold spots. Genes (horizontal arrows) are represented on the top of the graph (cp, CP gene; rep, replication-associated protein gene). The v-ori’s of various rolling circle replicons are indicated with vertical black arrows.

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

tion cold spots or not. To remedy this problem in our new test, rather than partitioning sequence alignments using a moving window, we simply partitioned them into two or three large regions and used the same permutation test to determine whether individual partitions contained more or fewer recombination breakpoints than could be accounted for by chance. Specifically, we compared breakpoint numbers for (i) coding versus noncoding regions, (ii) the middle 50% of genes versus the beginning and end 25% of genes, and (iii) CP genes versus other genes. To further increase the power of our modified test, we merged some of our original 27 data sets and discarded five others in which fewer than eight recombination breakpoints were detectable (Table 2). It was apparent from our earlier recombination breakpoint distribution analyses that whereas hot spots tended to occur within intergenic regions (70% of 23 detected hot spots), cold spots tended to occur within coding regions (94% of 18 detected cold spots). Applying our modified method, we confirmed that in 8 of the 14 analyzed data sets intergenic regions had a significantly higher density of detectable recombination breakpoints (P ⬍ 0.05) than coding regions did (Table 2). The exceptions were the begomovirus DNA-A and DNA-1 satellite, nanovirus, circovirus (PCV), microvirus, and dependovirus data sets. Whereas breakpoint densities were clearly highest in the noncoding regions of the begomovirus DNA-A (35 versus 24 breakpoints per 100 nt), DNA-1 (8.7 versus 5.2 breakpoints per 100 nt), microvirus (5.05 versus 2.7 breakpoints per 100 nt), and nanovirus (3.1 versus 2.6 breakpoints per 100 nt) data sets, the circovirus (PCV) and dependovirus sequences had only extremely small intergenic regions included in the analysis, and therefore neither had any detectable breakpoints outside genes. Despite these two exceptions, the trend is clear: across all ssDNA virus families, there is a significant tendency for detectable recombination breakpoints to fall outside coding regions. While this tendency might be due to recombination breakpoints within genes being less tolerable than those that fall between genes, it is difficult to discount the fact that the tendency might instead be caused by the occurrence within intergenic regions of mechanistically predisposed recombination hot spots such as v-ori’s. However, even when intergenic regions were discounted, we detected a tendency for recombination breakpoints to occur within the beginning and ending 25% of genes rather than in the middle 50%. This tendency was significant (P ⬍ 0.05) for all but the geminivirus DNA-1, anellovirus (TTV), circovirus (PCV), parvovirus, and dependovirus data sets. Among these exceptions, the anellovirus (TTV), parvovirus, and dependovirus data sets displayed higher densities of recombination breakpoints at the edges of genes than in the middle 50% of

2703

Downloaded from jvi.asm.org by Edward Rybicki on March 1, 2009

2704

VOL. 83, 2009

CONSERVED RECOMBINATION PATTERNS AMONG ssDNA VIRUSES

2705

TABLE 2. Imbalances in recombination breakpoint locations between different genome regions P valuea

Data set Host

Family

Genus No.

Coding vs noncoding regions

TTV TTMV PCV Bird circovirus Merged

7 8 9 10 ⫹ 11 13 ⫹ 15