1
Genomes of marine cyanopodoviruses reveal multiple origins of diversity
2 3
Labrie S.J.1, K. Frois-Moniz1, M.S. Osburne1, L. Kelly1, S.E. Roggensack1, M.B. Sullivan1,3, G.
4
Gearin2, Q. Zeng2, M. Fitzgerald2, M.R. Henn2 and S.W. Chisholm1
5 6
1
Department of Civil and Environmental Engineering, Massachusetts Institute of Technology,
7
Cambridge, MA, USA.
8
2
Broad Institute, Cambridge, MA, USA.
9
3
Current Address: Ecology and Evolutionary Biology Department, University of Arizona,
10
Tucson, AZ, USA
11 12
Corresponding author: Sallie W. Chisholm
13
MIT 48-419
14
15 Vassar Street
15
Cambridge, MA 02139
16
Phone: (617) 253-1771
17
Fax: (617) 324-0336
18
Email:
[email protected]
19 20
Running title: Cyanopodovirus genomics
21
1
22
Summary
23 24
The marine cyanobacteria Prochlorococcus and Synechococcus are highly abundant in the global
25
oceans, as are the cyanophage with which they co-evolve. While genomic analyses have been
26
relatively extensive for cyanomyoviruses, only 3 cyanopodoviruses isolated on marine
27
cyanobacteria have been sequenced. Here we present 9 new cyanopodovirus genomes, and
28
analyze them in the context of the broader group. The genomes range from 42.2 to 47.7 kbp, with
29
G+C contents consistent with those of their hosts. They share 12 core genes, and the pan-genome
30
is not close to being fully sampled. The genomes contain 3 variable island regions, with the most
31
hypervariable genes concentrated at one end of the genome. Concatenated core-gene phylogeny
32
clusters all but one of the phage into three distinct groups (MPP-A and two discrete clades within
33
MPP-B). The outlier, P-RSP2, has the smallest genome and lacks RNA polymerase, a hallmark of
34
the Autographivirinae subfamily. The phage in groups MPP-B contain photosynthesis and carbon
35
metabolism associated genes, while group MPP-A and the outlier P-RSP2 do not, suggesting
36
different constraints on their lytic cycles. Four of the phage encode integrases and three have a
37
host integration signature. Metagenomic analyses reveal that cyanopodoviruses may be more
38
abundant in the oceans than previously thought.
39 40
Introduction
41
Viruses are abundant in the oceans, often outnumbering bacterioplankton by an order of
42
magnitude (Bergh et al., 1989; Fuhrman, 1999; Wommack and Colwell, 2000; Weinbauer and
43
Rassoulzadegan, 2004). Among marine bacteria, the cyanobacteria Prochlorococcus and
44
Synechococcus are the numerically dominant oxygenic phototrophs (Waterbury et al., 1986;
45
Partensky et al., 1999; Scanlan and West, 2002), and contribute significantly to global primary
46
productivity and global biogeochemical cycles (Liu et al., 1997; Liu et al., 1998). They coexist
47
with their specific viruses – cyanophage – which are believed to play a key role in maintaining
2
48
diversity by “killing the winner” (Waterbury and Valois, 1993; Suttle and Chan, 1994; Thingstad,
49
2000). Moreover, cyanophage impact the evolution of their hosts by mediating horizontal gene
50
transfer (Lindell et al., 2004; Zeidner et al., 2005; Sullivan et al., 2006; Yerrapragada et al.,
51
2009).
52 53
All cyanophage isolated thus far are Caudovirales – tailed, dsDNA viruses belonging to three
54
families: Myoviridae, Podoviridae and Siphoviridae. Most of the cyanomyoviruses are similar to
55
the archetypal coliphage T4, and have genome sizes ranging from 161 – 252 kb, (Sullivan et al.,
56
2010). Cyanopodoviruses, with genome sizes ranging from 42 kb to 47 kb, are similar in gene
57
content and genome organization to coliphage T7 (Chen and Lu, 2002; Sullivan et al., 2005; Pope
58
et al., 2007). There are fewer examples of cyanosiphoviruses (Sullivan et al., 2009; Huang et al.,
59
2011), which have genome sizes ranging from 30 kb to 108 kb and do not share common features
60
with other bacteriophage (Huang et al., 2011). To date, 18 cyanomyovirus genomes (Sullivan et
61
al., 2005; Weigele et al., 2007; Millard et al., 2009; Sullivan et al., 2010; Sabehi et al., 2012), 5
62
cyanosiphovirus genomes (Sullivan et al., 2009; Huang et al., 2011), and 5 cyanopodovirus
63
genomes have been published (Chen and Lu, 2002; Sullivan et al., 2005; Pope et al., 2007; Liu et
64
al., 2007; Liu et al., 2008).
65 66
A hallmark characteristic of the cyanomyoviruses and cyanopodoviruses is that they carry
67
homologs to host genes (which we now refer to as phage/host shared genes (Kelly et al,
68
submitted)) whose products are thought to increase phage fitness under certain conditions. A
69
subclass of these genes, referred to as auxiliary metabolic genes (“AMG” (Breitbart et al., 2007)),
70
encode proteins involved in host metabolic pathways such as the light reactions of photosynthesis
71
(PsbA, PsbD, Hli, PsaA, B, C, D, E, K, J/F (Mann, 2003; Lindell et al., 2004; Lindell et al., 2005;
72
Sullivan et al., 2006; Sharon et al., 2009; Béjà et al., 2012)), the pentose phosphate pathway (PPP
73
(Sullivan et al., 2005; Thompson et al., 2011; Zeng and Chisholm, 2012)), phosphate acquisition
3
74
(Millard et al., 2004; Sullivan et al., 2005; Sullivan et al., 2010; Thompson et al., 2011; Zeng and
75
Chisholm, 2012), nitrogen metabolism (Sullivan et al., 2010) and DNA synthesis (Sullivan et al.,
76
2005), among others. It is thought that the phage carry these homologs to alleviate bottlenecks in
77
these key pathways after host transcription of host homologs has stopped (Thompson et al.,
78
2011).
79 80
Several observations reveal very tight co-evolution of host and cyanophage genomes with regard
81
to these phage/host shared genes. It has been demonstrated, for example, that phage AMGs are
82
expressed simultaneously during infection (Lindell et al., 2007) regardless of their position in the
83
genome, which is striking given the strict genome-order transcription normally associated with
84
such (T7-like) phage. In the case of phage/host shared P-acquisition genes, it has been
85
demonstrated that these genes are carried more frequently by phage in regions of the oceans
86
where cells are P-stressed (Kelly et al., submitted), and expression of the phage version of a high-
87
affinity PO4-transport protein is actually regulated by the host PhoRB two component regulatory
88
system such that the phage gene is only upregulated when the phage is infecting a P-stressed host
89
cell (Zeng and Chisholm, 2012).
90 91
Phage also carry genes that in the host encode high-light inducible proteins (Hlis – also called
92
small CAB-like proteins (Funk and Vermaas, 1999)) thought to protect the photosynthetic
93
complex, or possibly to be involved in a more general stress response in the host (He et al., 2001).
94
Photosynthesis-associated proteins (Hlis, PsbA and PsbD) found in cyanophage are related to
95
their respective orthologous proteins found in cyanobacterial genomes, indicating that they are of
96
cyanobacterial origin (Lindell et al., 2004; Sullivan et al., 2006). Interestingly, there are two types
97
of hli genes found in cyanobacterial genomes, referred to as single- and multi-copy hlis (Bhaya et
98
al., 2002). The single-copy hlis are part of the Prochlorococcus core genome while multi-copy
99
hlis contribute to the flexible genome and are found in highly variable genomic islands (Coleman
4
100
et al., 2006). Cyanophage hlis are homologous to the multi-copy hlis, suggesting that cyanophage
101
play a role in horizontal transfer of multi-copy hlis (Lindell et al., 2004).
102 103
Of the 5 cyanopodoviruses for which complete genomes were available prior to this study, two
104
are of marine origin: P-SSP7 and Syn5 from the Sargasso Sea (Sullivan et al., 2005; Pope et al.,
105
2007), and one is of estuarine environment in Georgia (P60 -Chen and Lu, 2002). The two other
106
isolates, Pf-WMP3 (Liu et al., 2008) and Pf-WMP4 (Liu et al., 2007), are derived from
107
freshwater environment and were isolated on the filamentous cyanobacterium Leptolyngbya. The
108
genome of P-SSP7 is organized in three classes, similar to coliphage T7 (Sullivan et al., 2005),
109
the first involved in takeover of host enzymatic machinery, followed by DNA replication and
110
transcription, and finally viral assembly and morphogenesis (Lindell et al., 2007). Interestingly,
111
whereas P60, isolated from a coastal river (Chen and Lu, 2002), has a similar genetic architecture
112
to the freshwater cyanophage, its genes have greater homology to marine cyanopodoviruses (See
113
Note added in proofs).
114 115
To expand our understanding of the diversity and evolution of cyanopodoviruses infecting marine
116
cyanobacteria, and to provide more reference genomes for metagenomic analyses, we sequenced
117
9 additional cyanopodovirus genomes (Table 1) isolated from diverse environments (Red Sea,
118
Sargasso Sea, Gulf Stream, and Subtropical Pacific Gyre) on host strains belonging to four
119
different ecotypes of Prochlorococcus (HL I, II and LL I, II), and analyzed them in the context of
120
the entire collection.
121
5
122
Results and discussion
123 124
Cyanophage isolation and host range
125
The cyanopodoviruses reported here were isolated over a period spanning more that a decade
126
(1995-2006; Table 1). Diverse strains of Prochlorococcus, including representatives from both
127
high-light and low-light clades, were used as hosts to isolate and maintain phage stocks (Table 1).
128
In contrast to cyanomyoviruses, which can typically infect multiple bacterial strains (Sullivan et
129
al., 2003), these cyanopodoviruses have narrow host ranges, infecting only one or two strains
130
under laboratory conditions (Table 2).
131 132
General features of cyanopodovirus genomes
133
The general features of the cyanopodovirus genomes are shown in Table 1, and include 9
134
genomes reported for the first time, along with 5 existing genomes that were used for comparative
135
analyses. The genomes of cyanopodovirus P-SSP7 and Syn5 are known to be linear, with direct
136
terminal repeats (Pope et al., 2007; Sabehi and Lindell, 2012), and we assume that the new
137
genomes are linear as well. The marine cyanopodovirus genomes range from 42.2 kbp to 47.7
138
kbp, and code for 48 to 68 putative open reading frames (ORFs). The majority of the putative
139
genes are encoded on the same strand, but phage P-RSP2 and P60 that contain an inverted region
140
of 1.5 kb and multiple genome rearrangements, respectively (ORF15-17P-RSP2 – Fig. 3) (See Note
141
added in proofs). Phage isolated on Prochlorococcus have a G+C content of 34% to 40.5%, while
142
those isolated on Synechococcus range from 53% to 55% (Table 1) reflecting the different G+C
143
content of the two hosts and the selective pressure for the phage to adapt their codon usage to that
144
of their hosts (Krakauer and Jansen, 2002; Limor-Waisberg et al., 2011). The ability of
145
cyanomyoviruses to cross-infect both Prochlorococcus and Synechococcus, despite their different
146
G+C content, is thought to be facilitated by the tRNAs encoded by this group of phage (Enav et
147
al., 2012). Only two tRNAs were identified in the cyanopodoviruses, however – one partial tRNA
6
148
in P-SSP7 (Sullivan et al., 2005) and one glycine tRNA in P-RSP5. The latter does not
149
correspond to a rare codon in its host genome or to a highly used codon in the P-RSP5 genome
150
(data not shown), suggesting that the G+C content difference between the genomes of
151
cyanopodoviruses that infect Synechococcus and Prochlorococcus is probably a significant
152
barrier to cross-infectivity (Enav et al., 2012).
153 154
DNA Polymerase Phylogeny and the Core and Pan Genomes
155 156
As a foundation for the analyses that follow, we wanted to identify the core genes shared by a
157
defined set of cyanopodoviruses, as well as their flexible gene set. Previous work on Podoviridae
158
DNA polymerase diversity suggests that this gene could be an acceptable phylogenetic tracer for
159
Podoviridae because it is conserved among different groups of phage and shows signs of vertical
160
inheritance (Chen et al. 2009; Labonté et al. 2009). Thus we used the phylogeny of this gene to
161
define sets of phage for the core and pan-genome analysis, and to guide our analysis of
162
relatedness among the phage. We first cast a broad net, including 71 DNA polymerase genes
163
from phage of different genera and families according to current International Committee on
164
Taxonomy of Viruses (ICTV) classification (Fig. 1). All cyanopodoviruses fell into the same
165
clade – designated the P60-like genus (Lavigne et al., 2008) – with the exception of two
166
freshwater cyanopodoviruses (indicated by three blue dots in Fig 1, as DNA polymerase is
167
encoded by two genes in one of the phage). The P60-like clade can be divided into three
168
subclades, supported by bootstrap values greater than 95% which exclude an outlier – P-RSP2 .
169
The first clade corresponds to the clade MPP-A (marine picocyanopodovirus A) established by
170
Chen and colleagues (2009), while the other two fall within clade MPP-B and form two discrete
171
clades (B1 and B2) (see the core genome phylogeny analysis section below – Figs 1 & 3).
172
7
173
Using an analysis similar to that described in Tettelin et al (2005) and used in our analysis of
174
cyanomyoviruses (Sullivan et al., 2010), we first defined a set of core genes using only the 10
175
cyanopodoviruses isolated on Prochlorococcus (P-RSP2, P-HP1, P-SSP11, P-SSP10, P-GSP1, P-
176
SSP2, P-SSP3, P-SSP7, P-RSP5 and P-SSP9 – Table 1). This core is composed of 19 genes (Fig.
177
2A); adding Synechococcus-specific phage Syn5 to the analysis reduces this number to 17 (Fig.
178
2B), and if Synechococcus phage P60 is added, the shared gene set drops to 12 (Table 3 – Fig.
179
2C). The significant impact of adding P60 is perhaps not surprising given its estuarine habitat.
180
P60’s genome also includes several frameshifts (see below) and incomplete proteins (Table 3)
181
(See Note added in proofs). Finally, adding the two freshwater cyanopodoviruses to the analysis
182
causes a precipitous drop to 3 core genes: primase/helicase, DNA polymerase, and terminase
183
(Fig. 2D) – consistent with the divergence of these phage seen in the DNA polymerase tree (Fig.
184
1).
185 186
Of the 17 core genes shared by the 10 Prochlorococcus cyanopodoviruses and Syn5, 9 are
187
involved in DNA metabolism and assembly of virions, 6 encode phage structural proteins (portal
188
protein, MCP, tail tube proteins A and B, internal core protein, tail fiber), one encodes the
189
terminase, and one codes for an hypothetical protein of unknown function (Table 3; Fig. 3, blue
190
shading). The pan-genome of this set of cyanopodoviruses is composed of 241 clustered
191
orthologous groups (COGs), and the cumulative curve of unique genes is nowhere near
192
saturation, suggesting that vast diversity remains (Fig. 2). Each new genome contributed an
193
average of 15 unique genes to the pan-genome, representing 22.0% to 31.6% of the genes in each
194
genome. In a similar analysis of 16 cyanomyoviruses, each genome adds approximately 90 new
195
genes, or 27.5% to 42.8% of their gene content (Sullivan et al., 2010). In both, the percentage is
196
significantly higher than that observed for host strains, where each new sequenced genome added
197
approximately 7.3% to 11.8% of their gene content to the pan-genome (Kettler et al., 2007).
198
8
199
Genome organization
200
With the exception of P60 (See Note added in proofs) and the two freshwater cyanophage (Pf-
201
WMP3 and Pf-WMP4) gene order in these genomes is roughly consistent with their relatedness in
202
the DNA polymerase tree and core genome analysis (Fig. 3). As in P-SSP7 (Sullivan et al., 2005),
203
order is highly conserved, and strikingly similar to the distantly related prototype enterophage T7
204
(Dunn et al., 1983), supporting the hypothesis that T7-like enterophage and cyanopodoviruses
205
evolved from a common ancestor, diverging at the protein sequence level (Sullivan et al., 2005;
206
Lavigne et al., 2008) while keeping a similar genome organization. The exception is P60, which
207
has multiple inversions (Fig. 3), rendering its genome architecture more similar to the freshwater
208
cyanopodoviruses Pf-WMP3 and Pf-WMP4 (Liu et al., 2007; Liu et al., 2008), while its protein
209
sequences are more similar to those of marine cyanophage (Liu et al., 2007). That is, P60 evolved
210
with the other marine cyanopodoviruses in terms of protein sequences, but underwent multiple
211
genomic rearrangements altering the T7-like genome architecture (See Note added in proofs). We
212
note again, that P60 was isolated from an estuarine environment – quite distinct from the open
213
ocean habitat of the other marine phages.
214 215
Similar to T7 (Molineux, 2006), P-SSP7 genes are grouped into three ordered classes of genes
216
that are sequentially expressed over the course of infection – marked in red, green, and blue along
217
the P-SSP7 genome in Fig. 3 (Lindell et al., 2007). Class I genes encode primarily small proteins,
218
including MarR and gp0.7, thought to be involved in redirecting transcription from the host to
219
the phage (Lindell et al., 2007). This region is highly variable and does not include core genes
220
(see below). Class II includes genes from the RNA polymerase gene up to, but not including the
221
major capsid protein (MCP) gene and is involved in transcription, DNA metabolism and
222
replication, and code for phage scaffolding proteins and structural components. Class III consists
223
of genes involved in phage assembly and DNA maturation (Molineux, 2006) and spans the rest of
224
the genome (Lindell et al., 2007).
9
225 226
Since P60 was the first cyanopodovirus sequenced (Chen and Lu, 2002) we are upholding naming
227
conventions for phage and referring to this as the “P60-like genus” (Lavigne et al., 2008), even
228
though P60 is not a ‘typical’ phage in this group with respect to gene content and organization
229
(See Note added in proofs).
230 231
Phylogeny and classification based on core genomes
232
To further examine the phylogenetic groupings established above, the amino acid sequences of
233
the core genes shared by the marine cyanopodovirus genomes (Fig. 2C) were concatenated and
234
aligned, and a maximum likelihood analysis was applied (Fig. 3, tree on the left). Three distinct
235
subgroups (MPP-A, MPP-B1 and B2) emerged with a topology consistent with the DNA
236
polymerase tree above (compare Fig. 1 and Fig. 3), with P-RSP2 as an outlier, but still belonging
237
to the group. The two divergent freshwater cyanopodoviruses (Fig. 1) were excluded from this
238
core phylogeny analysis since they are missing most of the core genes (Fig. 2D).
239 240
Based on the sequence analysis of the concatenated core genomes (Fig. 3), and its congruence
241
with the DNA polymerase tree (Fig. 1), the 12 marine cyanopodoviruses in Fig. 3 belong to the
242
same genus – the P60-like genus of the subfamily of the Autographivirinae. Even though P-RSP2
243
is divergent from the other members of the group, it clearly falls within this clade. Because P-
244
RSP2 lacks an RNA polymerase gene, however, it would normally be excluded from the
245
Autographivirinae subfamily – which currently includes even very distantly related Podoviridae
246
(eg. T7 and phiKMV – Fig. 1, middle ring) – based on this single criterion. Although the presence
247
of RNA polymerase has been considered a hallmark gene for assignment of a phage to the
248
Autographivirinae, we argue that P-RSP2 should be included based on its similarities to other
249
phage in the P60-like genus (Figs. 1 & 3).
250
10
251
P-RSP2 – the outlier
252
P-RSP2 shares the same genome organization as the other cyanopodoviruses (with the exception
253
of an inverted region in the class III genes), and has the same set of core genes, but it is highly
254
divergent (Figs. 1 & 3). In fact, only one of its core genes (DNA polymerase – Fig. 1) shares
255
more than 60% amino acid identity with the other phage. That it is the only phage in the group
256
that was isolated on Prochlorococcus strain MIT9302 raises the question of whether there is
257
something unique about this phage/host relationship. As discussed above, P-RSP2 is also the only
258
phage in this group that lacks an RNA polymerase gene, essential for inclusion in the
259
Autographivirinae (Lavigne et al., 2008), which in the canonical podovirus coliphage T7 is
260
required for efficient transcription of class II and class III phage genes (Summers and Szybalski,
261
1968; Studier and Maizel, 1969; Studier, 1972).
262 263
Since P- RSP2 does not encode its own RNA polymerase, it likely has evolved mechanisms to
264
use host transcriptional machinery to transcribe class II-III genes, such as additional host-like
265
promoters or modulation of host RNA polymerase with transcriptional regulators such as sigma
266
factors (Sullivan et al., 2009; Pavlova et al., 2012). In T4, for example, middle and late gene
267
expression is coordinated by two transcriptional activators (Brody et al., 1995), but a search for
268
similar activators in P-RSP2 yielded nothing. The G+C content of cyanopodoviruses prohibits the
269
use of computational approaches like those of Vogel et al. (2003) to search for host-like
270
promoters, thus the mechanism by which P-RSP2 transcribes Class II and III genes remains a
271
mystery.
272 273
Comparative genomics
274
The Class I gene set (Fig. 3 – red under the P-SSP7 genome), is composed of very short genes
275
that are highly variable. The set is most conserved in the MPP-B1 group relative to MPP-B2 and
276
MPP-A, and consists of a genetic module of 10-13 genes that code for putative proteins mostly of
11
277
unknown function (Fig. 3). Genes of interest include an integrase (in 4 genomes), and a protein
278
similar to T7 gp0.7 (a transcriptional regulator involved in the takeover of the cellular metabolism
279
by the phage (Molineux, 2006), found in 3 genomes). Three of the 4 genomes that have the
280
integrase gene have a downstream integration signature sequence, suggestive of the potential for
281
lysogeny (discussed in more detail below).
282 283
Class II genes (Fig. 3 – green under the P-SSP7 genome) were among the most conserved (Table
284
3) across all three MPP groups. In addition to core genes, Class II also includes genes encoding
285
RNA polymerase (11/12 genomes), high light inducible proteins (Hli – 9/12 genomes),
286
photosystem II D1 protein (PsbA – 8/12 genomes) and transaldolase (TalC – 8/12 genomes).
287
These genes have orthologs in bacterial genomes (phage/host shared genes), and while
288
photosynthesis-associated genes are thought to have been derived from the host, the origin of talC
289
is not clear (Ignacio-Espinoza and Sullivan, 2012) (see discussion below). The genes hli, psbA
290
and talC, only found in MPP-B1 and MPP-B2, are common in cyanophage (Lindell et al., 2004;
291
Sullivan et al., 2005; Lindell et al., 2005; Sullivan et al., 2006; Chenard and Suttle, 2008; Sullivan
292
et al., 2010; Thompson et al., 2011; Sabehi et al.,2012) and are thought to increase phage fitness
293
during infection (Bragg and Chisholm, 2008; Thompson et al., 2011).
294 295
Class III genes (Fig. 3 – blue under the P-SSP7 genome) mainly consist of genes coding for
296
structural components of mature virions. This class contains a highly variable region that encodes
297
host specificity determinants, including genes in the region downstream of the tail tube protein B
298
(gp31P-SSP7) and through the tail fiber protein (gp36P-SSP7).
299 300
P-SSP2 and P-SSP3: two co-isolated phage reveal a hypervariable genomic region
301
Phage P-SSP2 and P-SSP3 were isolated on the same day, at the same station, from proximate
302
depths (120m and 100m respectively), using Prochlorococcus MIT9312 as the host. Their
12
303
genomes share 95% overall nucleotide sequence identity, and most proteins are 100% identical
304
(Fig. 4). They differ in only 7 genes (Table 5), each being either significantly divergent, or absent
305
in one or the other. The Class I module in the two genomes includes 2 pairs of divergent genes:
306
gp14P-SSP2/gp55P-SSP3 and gp18P-SSP2/gp52P-SSP3, whose gene products share 76% and 66% identity,
307
respectively. Immediately adjacent to the latter pair, P-SSP2 encodes an additional orphan gene
308
(gp17P-SSP2) (Fig. 4) that does not share similarity with proteins in public databases. A second
309
divergent region is located at the C-terminus of the tail fiber(gp16P-SSP3 and gp57P-SSP2) (Fig. 4;
310
Table 5) involved in host recognition,. The P-SSP3 tail fiber gene (gp16P-SSP3) is smaller than that
311
of (gp57P-SSP2). Downstream of gp16P-SSP3 are two small genes - gp15P-SSP3 and gp14P-SSP3 – that are
312
absent in the P-SSP2 genome. The former is an orphan while the latter shares 29% amino acid
313
identity with genes gp40P-SSP7 (Figs. 1 & 3) – and 20% amino acid identity with gp28P-RSM4 in a
314
cyanomyovirus isolated on Prochlorococcus MIT9303 (Sullivan et al., 2010). Genes gp40P-SSP7
315
and gp14P-SSP3 are located in the same genomic region (Fig. 3).
316 317
The N-terminal regions of all marine cyanopodoviruses tail fiber proteins are more conserved
318
than the C-terminal regions (data not shown). The hypervariable C-terminal regions likely help
319
phage adapt to host receptor diversity, and could either result from random
320
mutation/recombination events or through an active mechanism. The latter has been reported in
321
podoviruses that infect the pathogen Bordetella (Uhl and Miller, 1996), which encode a template-
322
dependent, reverse transcriptase-mediated diversity generating mechanism (Liu et al., 2002; Liu
323
et al., 2004; Doulatov et al., 2004), but we could find no evidence of this in our genomes. The
324
counterpart of this phage hypervariable region in their hosts was studied by Avrani et al. (2011).
325
They found that phage resistance in Prochlorococcus was acquired by accumulating mutations in
326
hypervariable genomic islands coding for cell surface receptors, among others. Together, these
327
recent findings beautifully illustrate the ongoing evolutionary arms race between phage and their
328
hosts.
13
329 330
Phage/host shared genes, myo/podo shared genes, and genomic islands
331
One of the most interesting features of some cyanophage is the set of genes they carry that appear
332
to be of bacterial origin (Mann, 2003; Lindell et al., 2004; Millard et al., 2004; Sullivan et al.,
333
2005; Lindell et al., 2005; Sullivan et al., 2006; Sullivan et al., 2010; Thompson et al., 2011;
334
Zeng and Chisholm, 2012) – ‘phage/host shared genes’ (Kelly et al., submitted) – 3 of the most
335
well studied examples being psbA, talC, and hlis. There are 66 genes in these cyanopodovirus
336
genomes with orthologs in Prochlorococcus and Synechococcus (Proportal
337
http://proportal.mit.edu/ - (Kelly et al., 2012)). They group into 12 COGs and are localized in
338
three regions of the phage genomes (Fig. 5A - diamonds). The first includes genes involved in
339
nucleotide metabolism that are found in all branches of the tree of life, and as such we don’t
340
consider it an island. The second contains the psbA and hli genes, and the third includes talC,
341
which is involved in host carbon metabolism, a nuclease-encoding gene, and a gene of unknown
342
function – all genes likely acquired by horizontal gene transfer. These regions, which have some
343
similarity to the genomic islands found in cyanomyoviruses (Millard et al., 2009), are referred to
344
as Island II and III (Fig. 5A).
345 346
Island II (Fig. 3, pink shading), surrounded by core genes, is composed of up to 6 genes,
347
including psbA and hli and additional genes of unknown function (Table 4). Island II genes are
348
not present in the Syn5, and P-RSP2 genomes, and P-SSP9 has only the hli gene (Figs. 3 and
349
5A). The psbA and hli genes in this island have orthologs in cyanomyoviruses and hosts (Mann,
350
2003; Lindell et al., 2004; Lindell et al., 2005; Sullivan et al., 2006), so we wondered whether the
351
rest of the genes in this island did as well (Table 4). gp222_COG and gp30_COG, clusters of
352
genes coding for hypothetical proteins, have orthologs in cyanomyoviruses but not in
353
picocyanobacteria, while gp32_COG has orthologs only in host genomes (Table 4). While the
354
synteny of Island II is not present in the hosts or cyanomyoviruses (data not shown), orthologous
14
355
genes in cyanomyovirus were often located within 15-20 genes of each other suggesting that
356
Island II was likely acquired in small pieces via multiple gene gain events, or as a larger insert
357
that underwent a series of deletions and reorganizations.
358 359
Analysis of the phylogeny of the psbA and talC genes in this expanded set of phage genomes (Fig
360
S1 and S2) generally confirms the conclusions of other reports (Lindell et al., 2004; Millard et al.,
361
2004; Sullivan et al., 2006; Ignacio-Espinoza and Sullivan, 2012) that phage psbA was not
362
recently acquired from picocyanobacteria (Fig. S1) and was likely acquired multiple times
363
(Ignacio-Espinoza et al. 2012). But while the cyanomyovirus psbA genes are closely related to
364
their specific hosts (Fig. S1), cyanopodovirus psbA genes form a clade distinct from those from
365
both cyanomyoviruses and hosts (Fig. S1). Further, cyanopodovirus psbA genes appear more
366
diverse than those of cyanomyoviruses, as indicated by the long branch lengths. As for talC, we
367
confirm that the origin of phage talC is less clear, as it differs significantly from
368
picocyanobacterial versions of this gene (Ignacio-Espinoza and Sullivan, 2012). In fact, phage
369
talC genes are more related to organisms from different phyla (Gammaproteobacteria, Firmicute
370
and Actinobacteria – Fig. S2). In contrast to psbA, cyanophage talC genes are highly conserved,
371
form a monophyletic clade, and likely were only acquired once and then diverged (Ignacio-
372
Espinoza and Sullivan, 2012).
373 374
It is intriguing that if a genome has any of the three genes, psbA, hli or talC, it has them all - with
375
the exception of P-SSP9 which has only one hli gene (Table 3). While Island II contains psbA
376
and hli, and is in the middle of the genome, talC is at the extreme downstream end, making it
377
unlikely that this set of genes could be simultaneously acquired or lost. Yet they are linked in the
378
observed gene gain/loss pattern (Fig. 5A – green and turquoise diamonds in Island II, and red
379
diamonds in Island III) and their co-expression, despite their separation in the genome, led
380
Lindell et al. (2007) to argue that their physical separation might reflect “evolution in progress”
15
381
i.e. an initial step toward the co-localization of these co-transcribed genes (Molineux, 2006;
382
Lindell et al., 2007). The fact that talC lies at the end of all of the cyanopodoviruses now in our
383
collection, however, argues against this, and suggests that there is something significant about
384
this positioning that still eludes us.
385 386
We found 59 proteins (grouped into 16 COGs) shared only by cyanopodo and cyanomyoviruses –
387
i.e. not present in hosts – and all are of unknown function (Table 6). The majority are in Islands II
388
and III (Fig. 5A; Table 6) – also the location of all of the phage/host shared genes.
389 390
The mechanisms underlying the genetic variability in islands in cyanopodoviruses are not clear.
391
In small lambda-like siphoviruses, rapid evolution is facilitated by structural simplicity, a small
392
set of core genes, and the exchange of compatible genetic modules (Botstein, 1980; Hendrix et
393
al., 1999; Comeau et al., 2007). T4-like myoviruses, on the other hand, have a significantly
394
larger, and syntenic, set of core genes, that are for the most part vertically inherited (Filée et al.,
395
2006; Comeau et al., 2007; Ignacio-Espinoza and Sullivan, 2012). This core is involved in
396
replication and assembly of the viruses, often requiring complex protein-protein interactions
397
(Leiman et al., 2003), which reduces the probability of acquiring functional orthologs. Thus in
398
T4-like phage, horizontal gene transfer events are concentrated in hypervariable islands (Comeau
399
et al., 2007; Millard et al., 2009), while the optimal core genome is kept intact (Comeau et al.,
400
2007). Cyanopodoviruses appear to use a strategy similar to T4-like phages, accessing the genetic
401
diversity thought to be involved in adaptation to their host’s metabolism and ecological niche
402
through genomic islands (Filée et al., 2006; Comeau et al., 2007), while conserving an optimal
403
core genome.
404 405
The flexible genome positioning reveals more islands
16
406
We explored whether the frequency of occurrence of a gene in this set of phage (Fig. 2) would be
407
reflected in the position of that gene in a genome, hoping that this might ultimately yield insights
408
into gene gain and loss mechanisms. We divided the flexible COGs into 3 groups for this
409
analysis: i) hyperflexible genes (found in 1-3 genomes – Fig. 5B, red diamonds), ii) flexible genes
410
(found in 4-6 genomes – Fig. 5B, green diamonds), and iii) conserved flexible genes (found in 7-
411
10 genomes – Fig. 5B, blue diamonds). The hyperflexible genes are concentrated in the left
412
extremity of the genomes, which we name Island I, while the flexible genes are more
413
concentrated in Island II and the right arm of the genome (Island III). Finally, the core and the
414
conserved flexible genes appear more distributed along the middle, and slightly in the right arm
415
of the genomes.
416 417
Assuming that these cyanopodoviruses reproduce similarly to T7 (Wolfson et al., 1972;
418
Molineux, 2006), in which the genome replicates as linear concatemers that are cleaved before
419
encapsidation, the propensity of hypervariable genes to be located in Island I could suggest that
420
gene gain/loss events occur primarily at the extremities of the linear genomes. An alternative
421
explanation is lysogeny, in which the temperate phage integrates into the host genome as a linear
422
fragment, and the excision of the phage genome from host chromosome may be imprecise. Two
423
published cyanopodovirus genomes (P-SSP7 (Sullivan et al., 2005) and Syn5 (Pope et al., 2007))
424
and three reported here (P-SSP2, P-SSP3 and P-SSP9) encode a phage-like integrase gene.
425
Furthermore, a 40-50 bp sequence with a perfect match to a cyanobacterial host sequence is found
426
downstream – suggesting a possible host integration site (Sullivan et al., 2005).
427 428
Despite indirect evidence for lysogeny in picocyanobacteria (McDaniel et al., 2002; Ortmann et
429
al., 2002), none of the complete marine cyanobacterial genomes examined contains an intact
430
prophage. This is perhaps not surprising as it is thought that lysogeny is favored when the
431
environment is not optimal for growth of host cells, the opposite of optimally growing laboratory
17
432
cultures (Waterbury and Valois, 1993). Recently, however, a partial prophage sequence, highly
433
similar to P-SSP7, was found in a genome fragment from a wild Prochlorococcus single-cell
434
(Malmstrom et al., 2012).
435 436
Biogeography of cyanopodoviruses
437
To analyze the distribution of the cyanopodoviruses in the oceans and place it in the context of
438
their hosts and other cyanophage, we recruited reads from marine metagenomic datasets using all
439
the cyanophage genomes available (see methods) (Fig. 6-7). We first examined the relative
440
number of metagenomic reads recruited by cyanosipho-, podo-, and myovirus genomes in the
441
viral metagenome samples from the HOT212 sample (N. Pacific) and “Marine Virome”. Using
442
only the 3 previously published cyanopodovirus genomes to recruit, cyanopodoviruses represent
443
22% of all recruited reads in the HOT212 sample (Fig. 6). This jumps to 50% if all 12 genomes
444
are used for recruitment, and a similar proportion emerges from the analysis of the MarineVirome
445
database (Fig. 6).
446 447
Analysis of the relative abundance of the three viral groups in the bacterial-fraction metagenomes
448
from the North Pacific (HOT), Bermuda (BATS), Mediterranean (MedDCM), and the Global
449
Ocean Survey (GOS) (Fig. 6) revealed the dominance of cyanomyoviruses in all samples,
450
consistent with the observations of others for GOS and MedDCM databases (Williamson et al.,
451
2008; Huang et al., 2011). The significant overabundance of cyanomyoviruses in these samples
452
relative to those from the viral fraction (“Marine Virome and HOT212”) samples is likely due to
453
the larger size of cyanomyoviruses, which would cause them to be preferentially retained by
454
filters, either attached to cells or freely floating.
455 456
We analyzed the geographic distribution of cyanopodo- and cyanomyoviruses in the Global
457
Ocean Survey (GOS) and found that cyanopodoviruses are widespread but appear to be more
18
458
abundant in the Caribbean Sea, the Gulf of Mexico, the Eastern Tropical Pacific Ocean and the
459
Indian Ocean (Fig. 7B). Interestingly, abundance of Prochlorococcus recruited reads also
460
qualitatively corresponds to areas of relatively high cyanopodovirus counts (Fig. 7C). Thus
461
although quantitative assessments are not possible, the additional reference genomes for
462
cyanopodoviruses help document their widespread distribution, and point to some hotspots of
463
abundance.
464 465
Conclusions and future directions
466 467
The growing number of cyanophage genomes is helping us better understand their relatedness
468
and evolution, and their interactions with their host cells. Here we used four approaches to
469
explore the similarities and differences among cyanopodoviruses: DNA polymerase phylogeny,
470
concatenated core genome phylogeny, the presence or absence of RNA polymerase, and genome
471
architecture. All but the extremely divergent freshwater cyanopodoviruses would fall into the
472
“P60-like genus” by these criteria, except for P-RSP2, which is an outlier in the concatenated
473
core genome tree, and lacks the hallmark RNA polymerase gene for this group. It is also the only
474
phage isolated on Prochlorococcus MIT9302. Bcause its core genome architecture is similar to
475
the others over much of the genome, and its position in the DNA polymerase tree assigns it to the
476
“P60-like genus” group, we include it here.
477 478
Cyanopodoviruses have two hypervariable island regions in which genes shared with their hosts,
479
and/or with cyanomyoviruses, are concentrated. The positions of hyperflexible genes – i.e. those
480
found in only 1 to 3 genomes – are highly concentrated in a third island at one extremity of the
481
genome. These islands point to interesting regions for unveiling gene acquisition and loss
482
mechanisms. Another hypervariable region, at a finer evolutionary scale, encompasses the C-
483
terminal part of the tail fiber gene in the two very closely related phage, P-SSP2 and P-SSP3.
19
484
This region may indicate an underlying diversity-generating mechanism, helping phage to adapt
485
to the vast diversity of host receptors found in marine environments.
486 487
Our analysis contributes to the growing appreciation of the complexity of phage diversity in the
488
oceans, and the degree to which it is under-sampled.
489 490
Materials and Methods
491 492
Bacteriophage isolation, characterization, DNA extraction
493
Phage were isolated as previously described (Waterbury and Valois, 1993; Sullivan et al., 2003).
494
All phage used in this study were isolated by triple (or greater) plaque purification, followed by
495
two rounds of dilution to extinction. The phage stocks were filtered through 0.2µm and stored at
496
4˚C in the dark. For each phage, we used the earliest sample in our collection that still retained
497
infectivity, to minimize the number of infectious cycles the phage went through – and therefore,
498
the accumulation of mutations in the genome. Nonetheless, all of these phage went through
499
multiple transfers on serially transferred host cultures before the final stock was collected for
500
sequencing. The DNA was extracted as previously described (Henn et al., 2010).
501 502
Genome sequencing, assembly and annotation
503
The genomes were sequenced by 454 pyrosequencing, and assembled and annotated at the Broad
504
Institute as previously described (Henn et al., 2010). The protein sequences were clustered into
505
orthologous groups using OrthoMCL program (van Dongen and Abreu-Goodger, 2012) (see
506
below) with the available cyanophage genomes on Proportal (http://proportal.mit.edu/ ). The
507
protein functional annotations were updated based on the information available on ProPortal.
508 509
Comparative genomics
20
510
For Figure 3 and Figure 4, all marine cyanopodovirus proteins were compared using the program
511
BLASTP (NCBI). The genomes in Figure 3 were extracted from the GenBank file using the
512
software BioEdit (http://www.mbio.ncsu.edu/bioedit/bioedit.html) and imported in Adobe
513
Illustrator. The comparison of P-SSP2 and P-SSP3 was done using BLASTP and the genome
514
maps were generated in R using the package GenoplotR (Guy et al., 2010).
515 516
Core genome analysis
517
The method used for clustering cyanopodovirus proteins into homologous groups was similar to
518
that described previously(Kettler et al., 2007; Sullivan et al., 2010). All marine cyanopodovirus
519
proteins were paired using a reciprocal best BLASTP hit analysis where the sequence alignment
520
covered at least 75% of the protein length of the longest protein and where the percentage of
521
identity was at least 35%. The clusters were then built by transiently grouping these pairs. To
522
increase the sensitivity of the method, HMM profiles (Sonnhammer et al., 1998) were built for
523
each cluster from an alignment of proteins made with Muscle (version 3.7 (Edgar, 2004; Edgar,
524
2004). The protein database was then searched de novo using the HMM models to group proteins
525
with significant homology (E-value ≤ 1e-5). HMMBUILD and HMMSEARCH from HMMER
526
were used to build and search for motifs in the sequence database, respectively.
527 528
Phylogeny of the core genome and of the DNA polymerase
529
All marine cyanopodoviruses were included for this analysis while the freshwater
530
cyanopodoviruses were excluded because they lack most of the core genes. For each phage, the
531
core protein sequences were concatenated in the same order, from the single strand binding
532
protein to the terminase. The concatenated protein sequences were then aligned with MUSCLE
533
(Edgar, 2004; Edgar, 2004) using the default parameters. The alignment was converted to phylip
534
format using the BioPython package (Cock et al., 2009). Phylogenetic analysis of the
535
concatenated proteins was performed using PhyML 3.0 (Guindon et al., 2010). The trees were
21
536
built from the command line with the following options: -d aa -b -4 -m JTT -v e -c 4 -a e -o tlr.
537
Both trees are unrooted. The approach NNIs was used to search the tree topology. The initial tree
538
was based on the BioNJ algorithm using the substitution model JTT (Jones et al., 1992). A
539
discrete gamma model was estimated by the software with 4 categories and a gamma shape of
540
1.384 with a proportion of invariant a.a. of 0.042. The maximum likelihood was estimated using
541
the Shimodaira–Hasegawa–like procedure (Shimodaira, 2002). Finally, the trees were visualized
542
with the online tool iTOL (Letunic and Bork, 2007; Letunic and Bork, 2011). The sequences of
543
the DNA polymerase were retrieved from ACLAME database (ACLAME MGEs. Version 0.4 -
544
family_vir_proph_26 (Leplae et al., 2009)) and were aligned as described above; the tree was
545
built using the same approach as the core genome phylogeny analysis.
546 547
Phage/host shared genes and hypervariable genetic islands in cyanopodoviruses
548
Clustering cyanopodovirus/host and cyanopodovirus/cyanomyovirus shared genes was performed
549
using the OrthoMCL program (van Dongen and Abreu-Goodger, 2012). The clustering was done
550
with a conservative value of 35% for the percent identity and an E-value of 1E-05. To avoid
551
clustering proteins solely on the basis of conserved domains, we pre-filtered our BLASTP results
552
to accept the orthologous pairs only if the sequence alignment covered at least 75% of the length
553
of the longer of the two sequences. The cyanophage and picocyanobacterial genomes used in the
554
clustering analysis are listed in supplemental Table 1. Figure 5 was generated using the python
555
matplotlib module (Hunter, 2007).
556 557
P-RSP2 promoter analysis and transcriptional factor searches
558
The P-RSP2 genome was screened for promoters as previously described (Vogel et al., 2003;
559
Lindell et al., 2007). Briefly, a position-specific weight matrix was built from the -10 box of
560
Prochlorococcus MED4 (Vogel et al., 2003) with the Motif module from the BioPython package
561
(Cock et al., 2009). The phage genomes were searched for this motif. The threshold was set at 7.2
22
562
based on the distribution of scores for the established motif for the -10 promoter box sequences.
563
P-RSP2 coding sequences were analyzed to detect transcription factors using InterProScan
564
(Zdobnov and Apweiler, 2001), Pfam (Punta et al., 2012), and CDD (Marchler-Bauer and Bryant,
565
2004). We were specifically looking for conserved protein domains related to transcription
566
factors or DNA binding domain. Except for the phage proteins known to be involved in DNA
567
metabolism (DNA polymerase, endo/exonuclease, DNA primase, single strand binding protein),
568
no DNA binding motifs could be detected nor conserved domains related to transcription factors.
569 570
Metagenomics
571
Six metagenomic datasets were used in this study: four from the bacterial fraction, (The Global
572
Ocean Survey dataset (GOS (Rusch et al., 2007)), the deep chlorophyll max Mediterranean
573
dataset (Ghai et al., 2010), the Pacific Ocean datasets (Station Hawaii Ocean Time-Series –
574
HOT179 and HOT186 (Frias-Lopez et al., 2008; Coleman and Chisholm, 2010)) and two viral
575
fraction datasets (the MarineVirome (Angly et al., 2006) and the Pacific Ocean dataset (HOT212
576
(this study – NCBI accession: SRA059090)). All datasets, except HOT212, were obtained from
577
the CAMERA website (http://camera.calit2.net/index.shtm). Only the sites with more than 10,000
578
reads were used from the GOS database. The methods used were similar to those described by
579
Malmstrom et al (2012) , and the reference genomes used for recruitment are listed in
580
supplemental Table 2. Briefly, metagenomic reads were matched to reference genomes using
581
BLASTN (Table S1), and those with a bit score of at least 40 were compared against the NCBI nt
582
database to assess if there were other best hits. The number of recruited reads at a GOS site was
583
normalized against the number of reads in the GOS database from that site. Finally, to compare
584
the relative abundance of cyanopodo- and cyanomyoviruses, the normalized read counts for each
585
GOS site were normalized to the average genome size of each phage family – 188780 bp and
586
46320 bp for the cyanomyo- and cyanopodoviruses respectively. The bar graphs were generated
587
in R using ggplot2 package (Wickham, 2009) and the map was generated in R using ggplot2
23
588
(Wickham, 2009), maps (http://CRAN.R-project.org/package=maps), gpclib (http://CRAN.R-
589
project.org/package=gpclib), and maptools (http://CRAN.R-project.org/package=maptools)
590
packages. The shapefile used to create the Galapagos Islands inset was downloaded from ©
591
OpenStreetMap contributors (http://downloads.cloudmade.com).
592
Note added to proof
593
After this manuscript was accepted, we learned that a new version of P60 genome has been
594
generated (Feng Chen, pers. comm.), which contains significant changes from the published
595
version (Chen and Lu, 2002). We re-examined our data in the context of this revised P60 genome
596
and found that some of our statements need to be modified, but the main conclusions of the paper
597
remain the same.
598 599
First, the revised P60 genome organization now makes it more similar to the other
600
cyanopodoviruses, and all the genes are coded on the same DNA strand. Further, this genome
601
makes P60 fall squarely in the P60-like genus as defined by Lavigne et al. (2008). The revised
602
sequence also affects our core gene analysis such that marine cyanopodoviruses and P60 now
603
share 15 core genes instead of 12.
604 605
Acknowledgments
606
We are grateful to Jessie W. Thompson and Qinglu Zeng for comments and edits on the
607
manuscript, and Katherine Huang for her advice and analyses in the early stages of the genome
608
sequencing. This work was supported by grants from the Gordon and Betty Moore Foundation
609
(SWC and MRH), the US National Science Foundation (NSF) Biological Oceanography Section,
610
the NSF Center for Microbial Oceanography Research and Education (C-MORE) (grant numbers
611
OCE-0425602 and EF 0424599), and the US Department of Energy-GTL. SJL was supported by
612
a postdoctoral fellowship from the “Fonds Québecois de la recherche sur la nature et les
613
technologies”.
24
614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663
References Angly, F.E., Felts, B., Breitbart, M., Salamon, P., Edwards, R.A., Carlson, C., et al. (2006) The marine viromes of four oceanic regions. PLoS Biol. 4: e368. Avrani, S., Wurtzel, O., Sharon, I., Sorek, R., and Lindell, D. (2011) Genomic island variability facilitates Prochlorococcus-virus coexistence. Nature. 474: 604-608. Bergh, O., Børsheim, K.Y., Bratbak, G., and Heldal, M. (1989) High abundance of viruses found in aquatic environments. Nature. 340: 467-468. Béjà, O., Fridman, S., and Glaser, F. (2012) Viral clones from the GOS expedition with an unusual photosystem-I gene cassette organization. ISME. J. 6: 1617-1620. Bhaya, D., Dufresne, A., Vaulot, D., and Grossman, A. (2002) Analysis of the hli gene family in marine and freshwater cyanobacteria. FEMS Microbiol. Lett. 215: 209-219. Botstein, D. (1980) A theory of modular evolution for bacteriophages. Ann. N. Y. Acad. Sci. 354: 484-490. Bragg, J.G., and Chisholm, S.W. (2008) Modeling the fitness consequences of a cyanophageencoded photosynthesis gene. PLoS One. 3: e3550. Breitbart, M., Thompson, L.R., Suttle, C.A., and Sullivan, M.B. (2007) Exploring the Vast Diversity of Marine Viruses. Oceanography. 20: 135-139. Brody, N., Kassavetis, A., Ouhammouch, M., Sanders, M., Tinker, L., and Geiduschek, P. (1995) Old phage, new insights: Two recently recognized mechanisms of transcriptional regulation in bacteriophage T4 development. FEMS Microbiol. Lett. 128: 1-8. Chen, F., and Lu, J. (2002) Genomic Sequence and Evolution of Marine Cyanophage P60: a New Insight on Lytic and Lysogenic Phages. Appl. Environ. Microbiol. 68: 2589-2594. Chenard, C., and Suttle, C.A. (2008) Phylogenetic Diversity of Cyanophage Photosynthetic Genes (psbA) in Marine and Fresh Waters. Appl. Environ. Microbiol. 74: 5317-5324. Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 25: 1422-1423. Coleman, M.L., and Chisholm, S.W. (2010) Ecosystem-specific selection pressures revealed through comparative population genomics. Proc. Natl. Acad. Sci. U. S. A. 107: 1863418639. Coleman, M.L., Sullivan, M.B., Martiny, A.C., Steglich, C., Barry, K., DeLong, E.F., and Chisholm, S.W. (2006) Genomic islands and the ecology and evolution of Prochlorococcus. Science. 311: 1768-1770. Comeau, A.M., Bertrand, C., Letarov, A., Tétart, F., and Krisch, H.M. (2007) Modular architecture of the T4 phage superfamily: a conserved core genome and a plastic periphery. Virology. 362: 384-396. Doulatov, S., Hodes, A., Dai, L., Mandhana, N., Liu, M., Deora, R., et al. (2004) Tropism switching in Bordetella bacteriophage defines a family of diversity-generating retroelements. Nature. 431: 476-481. Dunn, J.J., Studier, F.W., and Gottesman, M. (1983) Complete nucleotide sequence of bacteriophage T7 DNA and the locations of T7 genetic elements. J. Mol. Biol. 166: 477-535. Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC. Bioinformatics. 5: 113. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32: 1792-1797. Enav, H., Béjà, O., and Mandel-Gutfreund, Y. (2012) Cyanophage tRNAs may have a role in cross-infectivity of oceanic Prochlorococcus and Synechococcus hosts. ISME J. 6: 619-628. Filée, J., Bapteste, E., Susko, E., and Krisch, H.M. (2006) A selective barrier to horizontal gene transfer in the T4-type bacteriophages that has preserved a core genome with the viral replication and structural genes. Mol. Biol. Evol. 23: 1688-1696.
25
664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713
Frias-Lopez, J., Shi, Y., Tyson, G.W., Coleman, M.L., Schuster, S.C., Chisholm, S.W., and DeLong, E.F. (2008) Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. U. S. A. 105: 3805-3810. Fuhrman, J.A. (1999) Marine viruses and their biogeochemical and ecological effects. Nature. 399: 541-548. Funk, C., and Vermaas, W. (1999) A cyanobacterial gene family coding for single-helix proteins resembling part of the light-harvesting proteins from higher plants. Biochemistry. 38: 93979404. Ghai, R., Martin-Cuadrado, A.B., Molto, A.G., Heredia, I.G., Cabrera, R., Martin, J., et al. (2010) Metagenome of the Mediterranean deep chlorophyll maximum studied by direct and fosmid library 454 pyrosequencing. ISME J. 4: 1154-1166. Guindon, S., Dufayard, J.F., Lefort, V., Anisimova, M., Hordijk, W., and Gascuel, O. (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59: 307-321. Guy, L., Kultima, J.R., and Andersson, S.G. (2010) genoPlotR: comparative gene and genome visualization in R. Bioinformatics. 26: 2334-2335. He, Q., Dolganov, N., Bjorkman, O., and Grossman, A.R. (2001) The high light-inducible polypeptides in Synechocystis PCC6803. Expression and function in high light. J. Biol. Chem. 276: 306-314. Hendrix, R.W., Smith, M.C., Burns, R.N., Ford, M.E., and Hatfull, G.F. (1999) Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proc. Natl. Acad. Sci. U. S. A. 96: 2192-2197. Henn, M.R., Sullivan, M.B., Stange-Thomann, N., Osburne, M.S., Berlin, A.M., Kelly, L., et al. (2010) Analysis of High-Throughput Sequencing and Annotation Strategies for Phage Genomes. PLoS One. 5: e9083. Huang, S., Wang, K., Jiao, N., and Chen, F. (2011) Genome sequences of siphoviruses infecting marine Synechococcus unveil a diverse cyanophage group and extensive phage-host genetic exchanges. Environ. Microbiol. 14: 540-558. Hunter, J.D. (2007) Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9: 90-95. Ignacio-Espinoza, J.C., and Sullivan, M.B. (2012) Phylogenomics of T4 cyanophages: lateral gene transfer in the 'core' and origins of host genes. Environ. Microbiol. 14: 2113-2126. Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8: 275-282. Kelly L, Ding H, Huang KH, Osburne M, Chisholm SW. (submitted) Features of cyanomyophage gene content and evolution in wild populations and cultured genomes. ISME J. Kelly, L., Huang, K.H., Ding, H., and Chisholm, S.W. (2012) ProPortal: a resource for integrated systems biology of Prochlorococcus and its phage. Nucleic Acids Res. 40: D632-D640. Kettler, G.C., Martiny, A.C., Huang, K., Zucker, J., Coleman, M.L., Rodrigue, S., et al. (2007) Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet. 3: e231. Krakauer, D.C., and Jansen, V.A. (2002) Red queen dynamics of protein translation. J. Theor. Biol. 218: 97-109. Labonté, J.M., Reid, K.E., and Suttle, C.A. (2009) Phylogenetic analysis indicates evolutionary diversity and environmental segregation of marine podovirus DNA polymerase gene sequences. Appl. Environ. Microbiol. 75: 3634-3640. Lavigne, R., Seto, D., Mahadevan, P., Ackermann, H.W., and Kropinski, A.M. (2008) Unifying classical and molecular taxonomic classification: analysis of the Podoviridae using BLASTP-based tools. Res. Microbiol. 159: 406-414. Leiman, P.G., Kanamaru, S., Mesyanzhinov, V.V., Arisaka, F., and Rossmann, M.G. (2003) Structure and morphogenesis of bacteriophage T4. Cell. Mol. Life Sci. 60: 2356-2370.
26
714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763
Leplae, R., Lima-Mendez, G., and Toussaint, A. (2009) ACLAME: A CLAssification of Mobile genetic Elements, update 2010. Nucleic Acids Res. 38: D57-D61. Letunic, I., and Bork, P. (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics. 23: 127-128. Letunic, I., and Bork, P. (2011) Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 39: W475-W478. Limor-Waisberg, K., Carmi, A., Scherz, A., Pilpel, Y., and Furman, I. (2011) Specialization versus adaptation: two strategies employed by cyanophages to enhance their translation efficiencies. Nucleic Acids Res. 39: 6016-6028. Lindell, D., Jaffe, J.D., Coleman, M.L., Futschik, M.E., Axmann, I.M., Rector, T., et al. (2007) Genome-wide expression dynamics of a marine virus and host reveal features of coevolution. Nature. 449: 83-86. Lindell, D., Jaffe, J.D., Johnson, Z.I., Church, G.M., and Chisholm, S.W. (2005) Photosynthesis genes in marine viruses yield proteins during host infection. Nature. 438: 86-89. Lindell, D., Sullivan, M.B., Johnson, Z.I., Tolonen, A.C., Rohwer, F., and Chisholm, S.W. (2004) Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc. Natl. Acad. Sci. U. S. A. 101: 11013-11018. Liu, H., Campbell, L., Landry, M.R., Nolla, H.A., Brown, S.L., and Constantinou, J. (1998) Prochlorococcus and Synechococcus growth rates and contributions to production in the Arabian Sea during the 1995 Southwest and Northeast Monsoons. Deep-Sea Res. Part II. 45: 2327-2352. Liu, H., Nolla, H.A., and Campbell, L. (1997) Prochlorococcus growth rate and contribution to primary production in the equatorial and subtropical North Pacific Ocean. Aquat. Microb. Ecol. 12: 39-47. Liu, M., Deora, R., Doulatov, S.R., Gingery, M., Eiserling, F.A., Preston, A., et al. (2002) Reverse transcriptase-mediated tropism switching in Bordetella bacteriophage. Science. 295: 2091-2094. Liu, M., Gingery, M., Doulatov, S.R., Liu, Y., Hodes, A., Baker, S., et al. (2004) Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes. J. Bacteriol. 186: 1503-1517. Liu, X., Kong, S., Shi, M., Fu, L., Gao, Y., and An, C. (2008) Genomic analysis of freshwater cyanophage Pf-WMP3 Infecting cyanobacterium Phormidium foveolarum: the conserved elements for a phage. Microb. Ecol. 56: 671-680. Liu, X., Shi, M., Kong, S., Gao, Y., and An, C. (2007) Cyanophage Pf-WMP4, a T7-like phage infecting the freshwater cyanobacterium Phormidium foveolarum: complete genome sequence and DNA translocation. Virology. 366: 28-39. Malmstrom, R.R., Rodrigue, S., Huang, K.H., Kelly, L., Kern, S.E., Thompson, A., et al. (2012) Ecology of uncultured Prochlorococcus clades revealed through single-cell genomics and biogeographic analysis. ISME J. 10.1038/ismej.2012.89. Mann, N.H. (2003) Phages of the marine cyanobacterial picophytoplankton. FEMS Microbiol. Rev. 27: 17-34. Marchler-Bauer, A., and Bryant, S.H. (2004) CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 32: W327-W331. McDaniel, L., Houchin, L.A., Williamson, S.J., and Paul, J.H. (2002) Lysogeny in marine Synechococcus. Nature. 415: 496. Millard, A., Clokie, M.R., Shub, D.A., and Mann, N.H. (2004) Genetic organization of the psbAD region in phages infecting marine Synechococcus strains. Proc. Natl. Acad. Sci. U. S. A. 101: 11007-11012. Millard, A.D., Zwirglmaier, K., Downey, M.J., Mann, N.H., and Scanlan, D.J. (2009) Comparative genomics of marine cyanomyoviruses reveals the widespread occurrence of
27
764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814
Synechococcus host genes localized to a hyperplastic region: implications for mechanisms of cyanophage evolution. Environ. Microbiol. 11: 2370-2387. Molineux, I. (2006) The T7 group, In The bacteriophages. Calendar, R. (eds). New York: Oxford, UK: Oxford University Press, pp. 277-301. Ortmann, A.C., Lawrence, J.E., and Suttle, C.A. (2002) Lysogeny and Lytic Viral Production during a Bloom of the Cyanobacterium Synechococcus spp. Microb. Ecol. 43: 225-231. Partensky, F., Hess, W.R., and Vaulot, D. (1999) Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microbiol. Mol. Biol. Rev. 63: 106-127. Pavlova, O., Lavysh, D., Klimuk, E., Djordjevic, M., Ravcheev, D.A., Gelfand, M.S., et al. (2012) Temporal Regulation of Gene Expression of the Escherichia coli Bacteriophage phiEco32. J. Mol. Biol. 416: 389-399. Pope, W.H., Weigele, P.R., Chang, J., Pedulla, M.L., Ford, M.E., Houtz, J.M., et al. (2007) Genome sequence, structural proteins, and capsid organization of the cyanophage Syn5: a "horned" bacteriophage of marine Synechococcus. J. Mol. Biol. 368: 966-981. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., et al. (2012) The Pfam protein families database. Nucleic Acids Res. 40: D290-D301. Rusch, D.B., Halpern, A.L., Sutton, G., Heidelberg, K.B., Williamson, S., Yooseph, S., et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology. 5: 398-431. Sabehi, , G., and Lindell D (2012) The P-SSP7 Cyanophage Has a Linear Genome with Direct Terminal Repeats. PLoS One. 7: e36710. Sabehi, G., Shaulov, L., Silver, D.H., Yanai, I., Harel, A., and Lindell, D. (2012) A novel lineage of myoviruses infecting cyanobacteria is widespread in the oceans. Proc. Natl. Acad. Sci. U. S. A. 109: 2037-2042.Scanlan, D.J., and West, N.J. (2002) Molecular ecology of the marine cyanobacterial genera Prochlorococcus and Synechococcus. FEMS Microbiol. Ecol. 40: 112. Sharon, I., Alperovitch, A., Rohwer, F., Haynes, M., Glaser, F., Atamna-Ismaeel, N., et al. (2009) Photosystem I gene cassettes are present in marine virus genomes. Nature. 461: 258-262. Shimodaira, H. (2002) An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51: 492-508. Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research. 26: 320-322. Studier, F.W. (1972) Bacteriophage T7. Science. 176: 367-376. Studier, F.W., and Maizel, J.V. (1969) T7-directed protein synthesis. Virology. 39: 575-586. Sullivan, M.B., Coleman, M.L., Weigele, P., Rohwer, F., and Chisholm, S.W. (2005) Three Prochlorococcus Cyanophage Genomes: Signature Features and Ecological Interpretations. PLoS Biol. 3: e144. Sullivan, M.B., Huang, K.H., Ignacio-Espinoza, J.C., Berlin, A.M., Kelly, L., Weigele, P.R., et al. (2010) Genomic analysis of oceanic cyanobacterial myoviruses compared with T4-like myoviruses from diverse hosts and environments. Environ. Microbiol. 12: 3035-3056. Sullivan, M.B., Krastins, B., Hughes, J.L., Kelly, L., Chase, M., Sarracino, D., and Chisholm, S.W. (2009) The genome and structural proteome of an ocean siphovirus: a new window into the cyanobacterial 'mobilome'. Environ. Microbiol. 11: 2935-2951. Sullivan, M.B., Lindell, D., Lee, J.A., Thompson, L.R., Bielawski, J.P., and Chisholm, S.W. (2006) Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biology. 4: 1344-1357. Sullivan, M.B., Waterbury, J.B., and Chisholm, S.W. (2003) Cyanophages infecting the oceanic cyanobacterium Prochlorococcus. Nature. 424: 1047-1051. Summers, W.C., and Szybalski, W. (1968) Totally asymmetric transcription of coliphage T7 in vivo: correlation with poly G binding sites. Virology. 34: 9-16.
28
815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863
Suttle, C.A., and Chan, A.M. (1994) Dynamics and Distribution of Cyanophages and Their Effect on Marine Synechococcus spp. Appl. Environ. Microbiol. 60: 3167-3174. Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., et al. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc. Natl. Acad. Sci. U. S. A. 102: 13950-13955. Thingstad, T.F. (2000) Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic systems. Limnol. Oceanogr. 45: 1320-1328. Thompson, L.R., Zeng, Q., Kelly, L., Huang, K.H., Singer, A.U., Stubbe, J., and Chisholm, S.W. (2011) Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc. Natl. Acad. Sci. U. S. A. 108: E757-E764. Uhl, M.A., and Miller, J.F. (1996) Integration of multiple domains in a two-component sensor protein: the Bordetella pertussis BvgAS phosphorelay. EMBO Journal. 15: 1028-1036. van Dongen, S., and Abreu-Goodger, C. (2012) Using MCL to extract clusters from networks. Methods. Mol. Biol. 804: 281-295. Vogel, J., Axmann, I.M., Herzel, H., and Hess, W.R. (2003) Experimental and computational analysis of transcriptional start sites in the cyanobacterium Prochlorococcus MED4. Nucleic Acids Res. 31: 2890-2899. Waterbury, J.B., and Valois, F.W. (1993) Resistance to cooccurring phages enables marine Synechococcus communities to coexist with cyanophages abundant in seawater. Appl. Environ. Microbiol. 59: 3393-3399. Waterbury, J.B., Watson, S.W., Valois, F.W., and Franks, D.G. (1986) Biological and ecological characterization of the marine unicellular cyanobacterium Synechococcus. Can. Bull. Fish. Aquat.. Sci. 214: 71-120. Weigele, P.R., Pope, W.H., Pedulla, M.L., Houtz, J.M., Smith, A.L., Conway, J.F., et al. (2007) Genomic and structural analysis of Syn9, a cyanophage infecting marine Prochlorococcus and Synechococcus. Environ. Microbiol. 9: 1675-1695. Weinbauer, M.G., and Rassoulzadegan, F. (2004) Are viruses driving microbial diversification and diversity? Environ. Microbiol. 6: 1-11. Wickham, H. (2009) Ggplot2 : elegant graphics for data analysis. New York: Springer. Williamson, S.J., Rusch, D.B., Yooseph, S., Halpern, A.L., Heidelberg, K.B., Glass, J.I., et al. (2008) The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples. PLoS ONE. 3: e1456. Wolfson, J., Dressler, D., and Magazin, M. (1972) Bacteriophage T7 DNA replication: a linear replicating intermediate (gradient centrifugation-electron microscopy-E. coli-DNA partial denaturation). Proc. Natl. Acad. Sci. U. S. A. 69: 499-504. Wommack, K.E., and Colwell, R.R. (2000) Virioplankton: viruses in aquatic ecosystems. Microbiol. Mol. Biol. Rev. 64: 69-114. Yerrapragada, S., Siefert, J.L., and Fox, G.E. (2009) Horizontal gene transfer in cyanobacterial signature genes. Methods Mol. Biol. 532: 339-366. Zdobnov, E.M., and Apweiler, R. (2001) InterProScan--an integration platform for the signaturerecognition methods in InterPro. Bioinformatics. 17: 847-848. Zeidner, G., Bielawski, J.P., Shmoish, M., Scanlan, D.J., Sabehi, G., and Béjà, O. (2005) Potential photosynthesis gene recombination between Prochlorococcus and Synechococcus via viral intermediates. Environ. Microbiol. 7: 1505-1513. Zeng, Q., and Chisholm, S.W. (2012) Marine Viruses Exploit Their Host's Two-Component Regulatory System in Response to Resource Limitation. Curr. Biol. 22: 124-128.
29
864 865 866
Tables Table 1. General features of the cyanopodoviruses from this study, and of those whose genomes have been previously published. MPP
1
Original host
# ORFs
Host Phage %GC %GC content content Site of origin
1-Sep-99
HQ634152
This study
100
64°16’W 64°16’W
5-Jun-96
HQ337022
This study
25m
22° 45'N 158°00'W
8-Mar-06
GU071104
This study
Aug-95
HQ332140
This study
1-Sep-99
NC_006882
(Sullivan et al., 2005)
31-Aug-95
HQ332137
This study
64°16’W 34°55’E
31-Aug-95
GU071107
This study
13-Sep-00
GU071102
This study
100
31°48’N
64°16’W
31-Aug-95
HQ316584
This study
HL(I)
47039
54
30.8
39.2
P-SSP10
Prochlorococcus NATL2A
LL(I)
47325
52
35
39.2
BATS
P-HP1
Prochlorococcus NATL2A
LL(I)
47536
65
35
39.9
HOTS
4
MPP-B2 P-GSP1
Prochlorococcus MED4
HL(I)
44945
53
30.8
39.6
80
38°21’N
66°49’W
P-SSP7
Prochlorococcus MED4
HL(I)
44970
54
30.8
38.8
BATS
5
100
31°48’N
64º16’W
P-SSP3
Prochlorococcus MIT9312 HL(II)
46198
56
31.2
37.9
BATS
100
31°48’N
64°16’W
P-SSP2
Prochlorococcus MIT9312 HL(II)
45890
59
31.2
37.9
BATS
120
P-RSP5
Prochlorococcus NATL1A
47741
68
35.1
38.7
Red Sea
130
31°48’N 29°28’N
BATS
1
40.5
Gulf Stream
100
Long.
Date water sampled Accession # Reference
Lat. 31°48’N 31°48’N
Prochlorococcus MIT9515
LL(I)
BATS
Depth
MPP-B1 P-SSP11
MPP-A
867 868 869 870 871 872
Phage
Host Genome Clade2 size (kb)
P-SSP9
Prochlorococcus SS-120
LL(II)
46997
53
36.4
SYN5
Synechococcus WH8109
Syn.
46214
61
60.1
55
Sargasso Sea
Surface
36°58’N
73°42’W
30-Nov-86 NC_009531
(Pope et al., 2007)
P60
Synechococcus WH7803
Syn.
47872
80
60.2
53.2
Satilla River6
Surface
-
-
12-Jul-88
AF338467
(Chen and Lu, 2002)
-
P-RSP2
Prochlorococcus MIT9302 HL(II)
42257
48
-
34
Red Sea
Surface
29°28’N
34°53’E
14-Jul-96
HQ332139
This study
-
Pf-WMP3 Leptolyngbya foveolarum
FC3
43249
41
-
46.5
Lake Weiming
nd
-
-
22-Jul-03
EF537008.1
(Liu et al., 2008)
-
Pf-WMP4 Leptolyngbya foveolarum
FC
40938
55
-
51.8
Lake Weiming
nd
-
-
22-Jul-03
DQ875742.1
(Liu et al., 2007)
Classification of phage genomes based on the concatenated core genes phylogeny. “–“ indicates a phage that is not classified in one of the three groups (Fig. 1). 2 Clade names for Prochlorococcus as defined in Rocap et al., (2002) 3 FC = Freshwater cyanophage 4 HOTS = Hawaii Ocean Time Series Station 5 BATS = Bermuda Atlantic Time Series Station 6 Satilla River: estuary - salinity = 30‰ – See note added in proofs.
Table 2. Host range of some of the cyanopodoviruses reported here. + indicates successful infection; - indicates no infection. Clade designations for Prochlorococcus refer to light adaptation properties of host cells as defined in Rocap and colleagues (2002). [Correction added on 29 January 2013 after first online publication: P-SSP3 and P-SSP10 were removed from Table 2 as irregularities were detected in the lysates after publication. This does not affect the genomic data or any of the conclusions of the paper.] P-RSP5
P-RSP2
P-SSP11
-
+
-
-
-
-
+
+
-
-
-
-
HL(I)
-
-
-
-
-
+
LL(I)
-
+ -
+ +
-
-
-
-
-
-
-
P-GSP1
P-HP1
Phage P-SSP7
873 874 875 876 877 878
Prochlorococcus MIT9302 Prochlorococcus MIT9312 Prochlorococcus MIT9215 Prochlorococcus GP2
HL(II)
-
Prochlorococcus MIT9202 Prochlorococcus AS9601 Prochlorococcus MIT9301
HL(II) HL(II)
-
Prochlorococcus MED4
HL(I)
Prochlorococcus MIT9515 Prochlorococcus NATL2A
Host strains tested
Host clade HL(II) HL(II) HL(II) HL(II)
Prochlorococcus NATL1A
LL(I)
-
Prochlorococcus MIT9313
LL(IV)
-
+
31
Table 3. Relatively conserved genes in cyanopodoviruses. Core genes of marine cyanopodoviruses are shown in bold. Classes of genes are as defined for P-SSP7 by Lindell et al. (2007), depicting the order of the timing of their transcription (see Fig. 3). Class II-b genes, which include talC, are transcribed with Class II genes, even though they are positioned at the end of the genome (Lindell et al., 2007) Freshwater Cyano T7-like phage
Class III
Class II-b
P-SSP9
gp28
gp54
gp29
gp15
gp6
Pf-WMP4
Syn5
gp51
Pf-WMP3
P-SSP10
gp11
P60 *
P-SSP11
gp42
P-RSP5
gp29
P-HP1
gp13
P-GSP1
RNA polymerase
P-SSP3
Class II
Putative Function P-SSP2
Gene Class
P-RSP2
Marine cyanopodoviruses
P-SSP7
879 880 881 882
-
gp6
-
-
SSB
gp14
gp30
gp41
gp10
gp50
gp26
gp53
gp28
gp21
gp1
gp47
-
-
-
Endonuclease
gp15
gp31 gp40
gp9
gp49 gp25
gp52
gp26
gp22
gp52
gp46
gp16-17
-
gp17
Primase/Helicase
gp16
gp32 gp39
gp8
gp48 gp24
gp51
gp25
gp24
gp50
gp45
gp18
gp9
gp12
DNA polymerase
gp17
gp34 gp38
gp7
gp46 gp23
gp50
gp24
gp27
gp49
gp44
gp20
gp12-14
gp19
Exonuclease
gp19
gp35 gp37
gp6
gp44 gp22
gp49
gp23
gp29
gp47
gp42
gp21
-
-
Rnr
gp20
gp38
gp35
gp4
gp41
gp19
gp46
gp20
gp33
gp44
gp40
-
-
-
gp34
gp21
gp39
gp34
gp3
gp40
gp18
gp45
gp19
gp34
gp43
gp39
-
-
-
-
gp22
gp40 gp33
gp52
gp39 gp17
gp44
gp18
gp35
gp42
gp37
gp28-43
-
-
Portal
gp24
gp42 gp31
gp50
gp37 gp13
gp42
gp16
gp37
gp40
gp35
gp41
-
-
Scaffolding protein
gp25
gp43 gp30
gp49
gp36 gp11
gp41
gp15
gp38
gp38
gp33
gp38-39
-
-
Hli
gp26
gp44
gp29
gp48
gp35
gp9
gp40.5
gp14
-
gp38.5
-
-
-
-
PsbA
gp27
gp46
gp27
gp47
gp34
gp8
gp40
gp13
-
-
-
-
-
-
MCP
gp29
gp48 gp25
gp46
gp29
gp5
gp36
gp8
gp39
gp37
gp32
gp37
gp32
-
Tail tube A
gp30
gp50 gp23
gp45
gp28
gp2
gp35
gp7
gp40
gp36
gp31
gp35-36
-
-
Tail tube B
gp31
gp51 gp22
gp44
gp27
gp1
gp33-34
gp6
gp41
gp35
gp29
gp33-34
-
-
gp32
-
gp32
gp53
gp20
gp43
gp26
gp68
gp5
gp42
gp34
-
-
-
-
Internal core protein
gp35
gp56
gp17
gp39
gp19
gp65 gp27-26
gp2
gp45
gp31
gp26
-
-
-
Tail fiber
gp36
gp57
gp16
gp38-35
gp16
gp64
gp25
gp1
gp46
gp30-28
gp25
-
-
-
-
gp43
gp2
gp10
gp32
gp9
gp60
gp20
gp49
-
gp23
gp16
-
-
-
-
gp45
gp4
gp8
gp30
gp07
gp59
gp19
gp48
-
gp21
gp18
-
-
-
gp49
gp47
gp6
gp7
gp26
gp5
gp56
gp17
gp46
gp49
gp18
gp11
gp70
-
-
Terminase
gp51
gp10
gp3
gp21
gp1
gp51
gp13
gp48
gp60
gp14
gp9
gp54-55
gp36
gp40
TalC
gp54
gp12
gp1
gp19
gp2
gp50
gp14
gp46
-
-
-
-
-
-
P-SSP7
Orthologs present in cyanomyoviruses
Orthologs present in hosts
Table 4. Genes found in Island II (Fig 3, 5) – an island found in all but 3 of the cyanopodoviruses in Fig. 3 – showing whether they have orthologs in host genomes (Prochlorococcus and Synechococcus), and/or those of cyanomyoviruses.
P-SSP11
884 885 886
gp40
gp27
+
+
gp40.5 gp26
+
+
887 888 889 890 891 892 893 894 895 896 897 898 899
901 902 903 904
P-SSP3
P-SSP2
P-SSP10
P-RSP5
P-HP1
P-GSP1
Phage
Cluster name1
Putative function
PsbA_COG
PsbA
gp47 gp34
gp9
gp13 gp46 gp27
Hli_COG
Hli
gp48 gp35
gp8
gp14 gp44 gp29
gp33
gp7
2
gp222_COG
gp222
gp12 gp45 gp28
gp39
+
-
gp30_COG
hypothetical protein
gp30 gp33
gp9
gp37
+
-
gp32_COG
hypothetical protein
gp32
gp11
gp38
-
+
gp47_COG
hypothetical protein
-
-
orphan
hypothetical protein
gp10
-
-
orphan
hypothetical protein
gp6
-
-
orphan
hypothetical protein
-
-
orphan
hypothetical protein
-
-
orphan
hypothetical protein
-
-
1 2
gp47 gp26
gp10 gp28 gp31
Cluster names refer to the putative function or a phage gene representing the cluster
gp222: conserved hypothetical protein
Table 5. The only genome differences between the most closely related cyanopodoviruses, P-SSP2 and P-SSP3, which were isolated from the same site, on the same host. The remainder of the proteins share ≥95% identity (see also Fig. 4). 900 Orthologous proteins P-SSP2
P-SSP3
% id (aa)
Putative function
gp14
gp55
76.4
Hypothetical protein
gp17
absent
-
Hypothetical protein
gp18
gp52
66.3
Hypothetical protein
gp32
gp39
†
Primase/helicase
gp57
gp16
77.7
Tail fiber
absent
gp15
-
Hypothetical protein
absent
gp14
-
Hypothetical protein
† Frameshifts -- High similarity between the nucleotide sequences
33
905 906 907 908
Table 6. Cyanopodovirus genes shared with (A) picocyanobacterial hosts, Synechococcus and Prochlorococcus (“phage/host share genes”), (B) cyanomyoviruses (“podo/myo shared genes”), or (C) both (phage/host and podo/myo shared genes”). Single-strand binding protein (SSB, bolded) is the only core gene in this set.
A
B
C
909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925
Syn5
P-SSP9
P-SSP11
P-SSP3
P-SSP2
P-SSP10
P-RSP5
P-RSP2
P-HP1
Putative function DNA primase RNA polymerase gp 0.72 SSB3 Class II Unknown Class III Unknown Unknown Thymidylate synthase
P-GSP1
Class1 Class I
P-SSP7
Phage
gp13 gp11 gp51 gp28 gp29 gp29 gp42 gp54 gp11 gp26 gp44 gp14 gp10 gp50 gp47 gp26 gp28 gp30 gp41 gp53 gp32 gp11 gp38 gp41 gp65 gp48 gp40 -
gp10 gp6 gp15 gp4 gp1 gp21 gp61
Class I
Endonuclease Unknown Unknown Class II gp2224 Unknown Class III Unknown Unknown Unknown Endonuclease Unknown Unknown Unknown
gp46 gp33 gp7 gp12 gp45 gp28 gp30 gp33 gp9 gp43 gp32 gp9 gp60 gp49 gp2 gp20 gp30 gp8 gp29 gp43 gp43 gp49 -
gp48 gp25 gp23 gp61
Class II
gp26 gp27 gp49 gp54
Hli PsbA Class III HNH endonuclease Class IIb TalC
gp48 gp35 gp8 gp47 gp34 gp9 gp25 gp3 gp10 gp19 gp2 gp50
gp39 gp37 -
gp14 gp44 gp29 gp40.5 gp38.5 gp13 gp46 gp27 gp40 gp44 gp8 gp5 gp15 gp43 gp12 gp1 gp14 -
-
1
Class of genes as defined for P-SSP7 by Lindell et al. (2007), according to the timing of their transcription gp 0.7: transcriptional regulator 3 Core gene, SSB: Single Strand Binding protein. 4 gp222: conserved hypothetical protein. 2
34
926 927
Supplemental table 1. Cyanophage and picocyanobacterial genomes used for the protein clustering analysis. Bacterial or viral strain
Genome size (kb)
NCBI accession number
Cyanophage S-TIM5 Syn33 syn9 MED4-213 P-HM1 P-HM2 P-RSM1 P-RSM3
152.3 174.3 177.3 181.0 181.0 183.8 177.2 178.8
JQ245707.1 GU071108.1 DQ149023.2
P-RSM4 P-SSM2 P-SSM3 P-SSM4 P-SSM5 P-SSM7 S-PM2 P-RSM6 S-RSM4 S-SM1 S-SM2 S-SSM4 S-SSM5 S-SSM7 S-ShM2 Syn1 Syn10 Syn19 Syn2 Syn30 P60 P-SSP7 P-RSP5
176.4 252.4 179.1 178.2 252.0 182.2 196.3 192.5 194.5 174.1 190.8 182.8 176.2 232.9 179.6 191.2 177.1 175.2 175.6 178.8 47.9 45.0 47.7
GU071099.1 GU071092.1
P-SSP2 P-SSP9 P-SSP10 P-GSP1 P-HP1 P-SSP11 Syn5 P-RSP2
GU071101.1 GU075905.1
CAMERA accession number
BROADPHAGEGENOMES_SMPL_SYN33_G1163 CAM_SMPL_001226 BROADPHAGEGENOMES_SMPL_P-HM1_G1154 BROADPHAGEGENOMES_SMPL_P-HM2_G1155 CAM_SMPL_001227 CAM_SMPL_001229
AF338467.1 AY939843.2 GU071102.1
BROADPHAGEGENOMES_SMPL_P-RSM4_G1161 CAM_SMPL_000950 CAM_SMPL_000897 CAM_SMPL_000949 BROADPHAGEGENOMES_SMPL_P-SSM7_G1169 CAM_SMPL_001230 BROADPHAGEGENOMES_SMPL_S-SM1_G1061 BROADPHAGEGENOMES_SMPL_S-SM2_G1159 CAM_SMPL_000897 BROADPHAGEGENOMES_SMPL_S-SSM5_G1166 BROADPHAGEGENOMES_SMPL_S-SSM7_G1167 BROADPHAGEGENOMES_SMPL_S-SHM2_G1164 BROADPHAGEGENOMES_SMPL_SYN1_G1160 CAM_SMPL_001202 BROADPHAGEGENOMES_SMPL_SYN19_G1165 CAM_SMPL_001201 CAM_SMPL_001200 BROADPHAGEGENOMES_SMPL_NATL1A-7_G1172
45.9 47.0 47.3 44.9
GU071107.1 GU071104.1
CAM_SMPL_000899
47.5 47.0 46.2 42.3
GU071104.1
AY940168.2 GU071103.1 AJ630128.1 FM207411.1 GU071094.1 GU071095.1 GU071097.1 GU071098.1 GU071096.1 GU071105.1 GU071106.1
CAM_SMPL_000948
EF372997.1
BROADPHAGEGENOMES_SMPL_NATL2A-133_G1171 CAM_SMPL_000947 CAM_SMPL_000945
35
P-SSP3 MED4-184 MED4-117 P-SS2
47.1 38.3 38.8 107.5
Cyanobacteria Prochlo. MED4 Prochlo. MIT9313 Prochlo. MIT9303 Prochlo. NATL1A Prochlo. NATL2A Prochlo. AS9601 Prochlo. MIT9515 Prochlo. MIT9215 Prochlo. MIT9211 Prochlo. MIT9312 Prochlo. SS120 Prochlo. MIT9301 Synecho. CC9311 Synecho. CC9605 Synecho. CC9902 Synecho. WH8102 Synecho. WH7803 Synecho. RCC307 Synecho. WH7805 Prochlo. MIT9202 Synecho. BL107 Synecho. RS9917 Synecho. RS9916 Synecho. WH5701
1657 2410 2682 1864 1842 1669 1704 1738 1688 1709 1751 1641 2606 2510 2234 2434 2366 2224 2620 1691 2283 2579 2664 3043
GQ334450.1
BX548174 BX548175 CP000554 CP000553 CP000095 CP000551 CP000552 CP000825 CP000878 CP000111 AE017126 CP000576 CP000435 CP000110 CP000097 BX548020 CT971583 CT978603 AAOK00000000 AATZ00000000 AANP00000000 AAUA00000000
CAM_SMPL_000946 CAM_SMPL_001191 CAM_SMPL_001190 -
MF_SMPL_P9202 MF_SMPL_WH5701
928 929 930 931 932 933 934 935 936 937 938 939 940 941 942
36
943 944
Supplemental table 2. Cyanophage reference genomes used for the metagenomic read recruitment. Phage family Myo.
Genome size (kb)
NCBI accession number
152.3
JQ245707.1
-
Syn33
Myo.
174.3
GU071108.1
BROADPHAGEGENOMES_SMPL_SYN33_G1163
syn9
Myo.
177.3
DQ149023.2
-
MED4-213
Myo.
181.0
P-HM1
Myo.
181.0
GU071101.1
BROADPHAGEGENOMES_SMPL_P-HM1_G1154
P-HM2
Myo.
183.8
GU075905.1
BROADPHAGEGENOMES_SMPL_P-HM2_G1155
P-RSM1
Myo.
177.2
P-RSM3
Myo.
178.8
P-RSM4
Myo.
176.4
GU071099.1
BROADPHAGEGENOMES_SMPL_P-RSM4_G1161
P-SSM2
Myo.
252.4
GU071092.1
-
P-SSM3
Myo.
179.1
P-SSM4
Myo.
178.2
P-SSM5
Myo.
252.0
P-SSM7
Myo.
182.2
GU071103.1
BROADPHAGEGENOMES_SMPL_P-SSM7_G1169
S-PM2
Myo.
196.3
AJ630128.1
-
P-RSM6
Myo.
192.5
S-RSM4
Myo.
194.5
FM207411.1
-
S-SM1
Myo.
174.1
GU071094.1
BROADPHAGEGENOMES_SMPL_S-SM1_G1061
S-SM2
Myo.
190.8
GU071095.1
BROADPHAGEGENOMES_SMPL_S-SM2_G1159
S-SSM4
Myo.
182.8
S-SSM5
Myo.
176.2
GU071097.1
BROADPHAGEGENOMES_SMPL_S-SSM5_G1166
S-SSM7
Myo.
232.9
GU071098.1
BROADPHAGEGENOMES_SMPL_S-SSM7_G1167
S-ShM2
Myo.
179.6
GU071096.1
BROADPHAGEGENOMES_SMPL_S-SHM2_G1164
Syn1
Myo.
191.2
GU071105.1
BROADPHAGEGENOMES_SMPL_SYN1_G1160
Syn10
Myo.
177.1
Syn19
Myo.
175.2
Syn2
Myo.
175.6
Syn30
Myo.
178.8
P60
Podo.
47.9
AF338467.1
-
P-SSP7
Podo.
45.0
AY939843.2
-
P-RSP5
Podo.
47.7
GU071102.1
BROADPHAGEGENOMES_SMPL_NATL1A-7_G1172
P-SSP2
Podo.
45.9
GU071107.1
-
P-SSP9
Podo.
47.0
GU071104.1
CAM_SMPL_000899
P-SSP10
Podo.
47.3
-
P-GSP1
Podo.
44.9
CAM_SMPL_000948
P-HP1
Podo.
47.5
P-SSP11
Podo.
47.0
Syn5
Podo.
46.2
Phage S-TIM5
CAMERA sample accession number
CAM_SMPL_001226
CAM_SMPL_001227 CAM_SMPL_001229
CAM_SMPL_000950 AY940168.2
CAM_SMPL_000897 CAM_SMPL_000949
CAM_SMPL_001230
CAM_SMPL_000897
CAM_SMPL_001202 GU071106.1
BROADPHAGEGENOMES_SMPL_SYN19_G1165 CAM_SMPL_001201 CAM_SMPL_001200
GU071104.1
BROADPHAGEGENOMES_SMPL_NATL2A-133_G1171 CAM_SMPL_000947
EF372997.1
-
37
P-RSP2
Podo.
42.3
CAM_SMPL_000945
P-SSP3
Podo.
47.1
CAM_SMPL_000946
MED4-184
Sipho.
38.3
CAM_SMPL_001191
MED4-117
Sipho.
38.8
CAM_SMPL_001190
P-SS2
Sipho.
107.5
GQ334450.1
-
S-CBS1
Sipho.
53.7
HM480106.1
-
S-CBS2
Sipho.
73.5
GU936714.1
-
S-CBS3
Sipho.
28.0
GU936715.1
-
S-CBS4
Sipho.
62.9
HQ698895.1
-
945 946 947 948 949
38
950 951 952
953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973
FIGURES
Figure 1. Maximum likelihood, circular phylogenetic tree of phage DNA polymerase sequences retrieved from ACLAME database (ACLAME MGEs. Version 0.4 - family_vir_ 14 (Leplae et al., 2009)). The bar represents 1 amino acid substitution per site and branches with a bootstrap value greater than 80% are indicated by a black dot. Green dots indicate marine cyanopodoviruses while the three blue dots mark DNA polymerase genes from the two freshwater cyanopodoviruses, one of which encodes DNA polymerase with two genes. The outer, middle and inner rings respectively indicate the phage families, subfamilies and genus when available in NCBI taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy).
39
A
B
Prochlorococcus cyanopodoviruses
Prochlorococcus cyanopodoviruses + Syn5 200
250
100
50
0
4
2
6
8
125
120 100 80 60
10 2
11 4
3
4
5
6
k genomes
4 8
0
10
100 50 0
974 975 976 977 978 979 980 981 982 983
2
4
6
8
10
12
Number of genomes
6
8
210
350
150
100
300 250 200 150 100
50 14 17 0
2
50 7 2 2 2 4 3 3 9 12 4 6 8 10 12
k genomes
50
13 0
10
400
200
100
19 8 4
2
6
2 4 6
7 8
k genomes
1
10
17
10
Prochlorococcus cyanopodoviruses D + Syn5 + P60 + Freshwater cyanopodoviruses
Number of genes
Number of genes in k genomes
Number of genes
150
4
2
154 150
Number of genomes
250 300
200
100
8
Prochlorococcus cyanopodoviruses + Syn5 + P60
250
150
50
19
19
20
Number of genomes
C
200
40
0
10
Number of genes in k genomes
150
140
Number of genes in k genomes
Number of genes
200
250
Number of genes
Number of genes in k genomes
160
0
2
4
6
8 10 12 14
Number of genomes
300 287 250 200 150 100 50 1417 0
2
7 2 2 2 4 3 2 8 9 1 3 6 8 10 12 14
4
k genomes
Figure 2. Core and pan-genome analysis using different sets of phage genomes in the analysis, as indicated by the headers in A-D. Left panel in each pair: Number of total genes in the core- (circles) and pan- (triangles) genomes as a function of the number of genomes included in the analysis. The core genome is the set of genes shared by all the genomes included in the analyzed subset, while the pan-genome is the total number of unique genes found in the same subset. All possible combinations of genomes were analyzed; the line is drawn through the average. Right panel in each pair: The frequency distribution of genes among the genomes, showing that genes found in only one (k=1) of the genomes are the most common (See note added in proofs for panel C)
40
0.3 P60 47872 bp
65 67 66 68 70 71 69 72 73 74 75 76 77 78 79 80
55 56 57 58 59 60 62 61 63 64
*
61
60
52
47 (holin) 49
51
55 56 57 58 59
54
53
48 50
46 (Tail fiber)
45 (ICP) (ICP)
44
42 43
41 (Tail tube B)
40 (Tail tube A)
39 (MCP)
38 (Scaff.)
(HNH)
50 (TalC) 49 48
52 51
53
62 6160 59 (gp165) 58 57 56 5455
63 (Tail fiber)
64 (Tail fiber)
65 (ICP)
66
68 (gp42) 67
1 (Tail tube B)
12 11 (Scaff.) 10 9 (PsbA) 8 (Hli) 7 (gp222) 6 tRNA 5 (MCP) 3 4 2 (Tail tube A)
18 (gp34) 16 17 1415 13 (Portal)
22 (Exonuc.) 21 20 (gp32) 19 (Rnr)
23 (DNA pol)
21 20 (gp42) 19
34 (ICP)
52 53 54 (TalC)
51
37 38 39 40 41 42 43 44 45 46 47 4849 (HNH) 50
36 (Tail fiber)
35 (ICP)
22 (Tail tube B)
32 33
2 1 (TalC)
3
1211 10 9 8 6 7 (gp49) 5 (HNH) 4
15 14 13
16 (Tail fiber)
17 (ICP)
18 (ICP)
24 23 (Tail tube A)
30 (Scaff.) 29 (Hli) 28 27 (PsbA) 26 25 (MCP)
34 (gp34) 3332 31 (Portal)
37 (Exonuc.) 36 35 (Rnr)
38 (DNA pol.)
39 (Prim./Hel.)
41 (SSB) 40 (Endonuc.)
42 (RNA pol.)
31 (Tail tube B)
30 (Tail tube A)
25 (Scaff.) 26 ((Hli) 27 (PsbA) 28 29 (MCP)
21 (gp34) 22 23 24 (Portal)
20 (Rnr)
18 19 (Exonuc.)
17 (DNA pol.)
14 (SSB) 15 (Endonuc.) 16 (Prim./Hel.)
13 (RNA pol.)
11 12 (TalC)
10
59 1 2 3 4 56 8 7 9
58
57
56 (ICP)
55 (ICP)
52 53 (gp42) 54
51 (Tail tube B)
49 50 (Tail tube A)
48 (MCP)
43 (Scaff.) 44 (Hli) 45 46 (PsbA) 47
39 (gp34) 4041 42 (Portal)
35 (Exonuc.) 3637 38 (Rnr)
33 34 (DNA pol)
30 (SSB) 31 (Endonuc.) 32 (Prim./Hel.)
29 (RNA pol.)
28
20 19 (TalC)
21
33 31 32 30 29 28 27 26 25 (HNH) 24 23 22
34
35 (T4 like PTF)
36
37
38 (Tail fiber)
39 (ICP)
43 (gp42) 42 41 40
44 (Tail tube B)
45 (Tail tube A)
46 (MCP)
49 (Scaff.) 48 (Hli) 47 (PsbA)
50 (Portal)
3 (gp34) 2 1 53 51 52 (TMP)
6 (Exonuc.) 5 4 (Rnr)
7 (DNA pol)
8 (Prim/Hel)
10 (SSB) 9 (Endonuc.)
8 (MCP)
40
41
42
48 (gp165) 4647 (gp48) 44 45 (gp50) 43 (TalC)
51 50 49
52 (PNAP)
1 (Tail fiber)
2 (IPC)
3
5 (gp42) 4
6 (Tail tube B)
7 (Tail tube A)
15 (Scaff.) 14 (Hli) 13 (PsbA) 12 (gp222) 11 10 9
16 (Portal)
19 (gp34) 18 17 (TMP)
20 (Rnr)
23 (Exonuc.) 2221 (gp32)
24 (DNA pol.)
25 (Prim./Hel.)
28 (SSB) 27 26 (Endonuc.)
29 (RNA pol.)
39 38 37 36 35 (gp9) 3433 32 31 30
12
13
24 23 22 21 20 19 (gp165) 18 (gp48) 17 15 16(gp50) 14 (TalC)
25 (Tail fiber)
26 (ICP)
27 (ICP)
18
(gp50) (Endonuc.)
66
1
13 12 11 10 9 8 7 (gp165) 6 (gp48)5 4 3 2
14 (TFP)
15
16 (Tail fiber)
17
19 (ICP)
22 21 20
23
30 29 28
26 25 24
27 (Tail tube B)
28 (Tail tube A)
33 (gp222) 3231 30 29 (MCP)
36 (Scaff.) 35 (Hli) 34 (PsbA)
40 (gp34) 39 38 (TMP) 37 (Portal)
45 44 (Exonuc.) 43 42 (gp32) 41 (Rnr)
46 (DNA pol)
47
48 (Prim./Hel.)
50 (SSB) 48 (Endonuc.)
51 (RNA pol.)
64 63 (gp12) 62 60 61 59 (gp9) 58 57 56 55 54 52 53
32 (gp42) 31
33 (Tail tube B)
35 (Tail tube A) 34 (Tail tube B)
41 (Scaff.) 40.5 (Hli) 40 (PsbA) 39 ((gp222) 38 37 36 (MCP)
42 (Portal)
45 (gp34) 44 43 (TMP)
49 (Exonuc.) 48 47 (gp32) 46 (Rnr)
50 (DNA pol)
53 (SSB) 52 (Endonuc.) 51 (Prim./Hel.)
54 (RNA pol.)
11 10 9 8 (gp12) 7 6 (gp9) 5 4 3 2 1
65
(TalC)
13
28 (Tail fber) 27 26 25 24 23 22 21 20 19 18 17 (gp50) 16 15 14
29
30 (Tail fiber)
31 (ICP)
32 (IVP)
34 (PKS) 33 (IVP)
35 (Tail tube B)
36 (Tail tube A)
39 38 (Scaff.) 38.5 (Hli) 37 (MCP)
43 (gp34) 41 42 (gp35) 40 (Portal)
48 (Endonuc.) 47 (Exonuc.) 46 45 44 (Rnr)
49 (DNA pol.)
Class Class III III
54
53
42 43 44 45 46 47 48 49 50 51 52
41 (Portal)
38 39 (Scaff.) 40
35 (Tail tube A) 36 (Tail tube A) 37 (MCP)
34 (Tail tube B)
27 28 29 30 31 32 33
37 (Portal)
34 (gp34) 35 36
33 (Rnr)
28 29 (Exonuc.) 30 31 32
27 (DNA pol)
25 26
27 26 (SSB) 25 (Endonuc.) 24 (Prim./Hel.)
Class C ass IIII Class
26
21 (Exonuc.) 22 23 24 25
20 (DNA pol)
19
18 (Prim./Hel.)
13 14 (Endonuc.) 16 15 17
24 (Prim./Hel.)
2 1 (SSB) 53 52 (Endonuc.) 51 50 (Prim./Hel.)
28 (RNA pol.)
11 (RNA pol.)
12
18 17 16 15 13 14
P-SSP11 47039 bp
12
17 (Int) 19 20 18 21 (SSB) 22 23 (Endonuc.)
5 4 (MarR) 3 (Int)
Class Class II
16
15 (RNA pol.)
6 (RNA pol.)
13 14 15 16 1718 20 19 21 2223 2526 24 27 (Int)
P-HP1 47536 bp
10 11
Syn5 46214 bp 55 54 5352 5150 49 48 4647 4544 (gp0.7) 43 (Int)
3 2 1
7 6 5 4
8
18 (gp165) 17 16 1514 13 12 (gp48) 11 10 (Endonuc.) 9
2120 19
22
24 23
25 (Tail fiber)
26 (ICP)
27
28
29 (Tail tube B)
31 (Tail tube A) 30
32 (MCP)
34 33 (Scaff.)
35 (Portal)
38 37 (TMP) 36
39 (gp 34)
41 (gp32) 40 (Rnr)
43 (Endonuc.) 42 (Exonuc.)
44 (DNA pol)
45 (Prim/Hel)
48 47 (SSB) 46 (Endonuc.)
P-RSP2 42257 bp
8 9
P-SSP9 46997 bp
7
MPP-A
6 (RNA pol.)
P-RSP5 47741 bp
7 10 13
P-SSP7 44970 bp 56
P-SSP3 46198 bp
1 2 3 4 5 6 7 8 9 10 11 (gp0.7) 12 (Int)
P-SSP2 45890 bp
47 46 4445 43 42 41 40 383937 36 35 34 323331 2930
P-GSP1 44945 bp
10 (DNA Prim.) 9 7 8
MPP-B2
11 12
P-SSP10 47325 bp
2 4 5 6 8 9 11 12 14
MPP-B1
1 3
984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1 2 3 4 5
Island II
(endonuc.)
*
Figure 3. Alignment of the genomes of 12 cyanopodoviruses. Orthologous proteins represented in color other than white share 60% amino acid identity or more, while those shown in white do not. The core proteins shared by all cyanopodoviruses are linked by blue shading and genomic Island II (see Fig. 5) is highlighted by pink shading. Cyanopodovirus/host shared proteins and cyanopodovirus/cyanomyovirus shared proteins are designated by small diamonds and triangles, respectively (see also Fig. 5 & Table 6), and each different cluster is represented by a different color except for singletons that are represented in white. The phylogenetic tree on the left was generated from an alignment of the concatenated core protein sequences using a maximum likelihood method. Branches with a bootstrap value greater than 80% are indicated by a black dot. The phage genomes were classified into three groups based on the concatenated core gene phylogeny of the 12 cyanopodoviruses (Boxes – MPP-A, MPP-B1 and MPP-B2 (MPP: Marine picocyanopodovirus) ); PRSP2 is an outlier based on this analysis. The bar represents 0.3 amino acid substitutions per site. (P60 genomes – See note added in proofs)
41
1036 1037
Figure 4. Alignment of the genomes of phage P-SSP2 and P-SSP3. Each gene product was aligned with its homolog and the percent identity was calculated using the length of the longest protein as the denominator. The colors indicate the percent identity between proteins (from 80% to 100%) while the red shading and thin red lines indicate the zone of homology between the DNA sequences where bit score is higher than 40. Proteins in white share less than 80% identity and are reported in Table 5.
1035
42
1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054
Figure 5. A) Position of cyanopodovirus/host shared genes (diamonds) and cyanopodovirus/cyanomyovirus shared genes (triangles) in cyanopodovirus genomes (symbols are positioned in the middle of the genes). The position of the genes is relative to the position (marked as 0) of the ribonucleotide reductase genes (rnr). When a diamond and a triangle co-localize, the cyanopodovirus gene is shared by both host and cyanomyovirus genomes. Orthologs determined using OrthoMCL are represented in the same color. Singletons are shown in white. B) Position of flexible genes (Fig. 2B) in the genomes, according to their frequency distribution (see Fig. 2) Red diamonds indicate genes shared by 1-3 genomes; green diamonds shared by 4-6 genomes; and blue diamonds shared by 7-10 genomes. The histogram on top indicates the relative counts of genes in the various categories present in overlapping sliding windows of 500 bp. The grey shading indicates apparent genome islands. Island I is identified primarily by the set of the most hypervariable genes, occurring in only a few genomes (red diamonds, panel B), while the other two islands are evident in both panels A and B. The orange shading marks the region of the genome involved in DNA replication and transcription, which is not considered a genomic island as these genes are shared by all branches of the tree of life. C) Relative counts of core genes present in overlapping sliding windows of 500 bp.
43
Viral metagenomic datasets
Bacterial metagenomic datasets
100% 90%
Cyanosiphovirus
80%
Cyanomyovirus
70%
Cyanopodovirus
60% 50% 40% 30% 20% 10%
1055 1056 1057 1058 1059 1060 1061 1062 1063 1064
BATS216
HOT186
HOT179
MedDCM
GOS
Marine Virome
HOT212
HOT212 (3 cyanopodovirus genomes)
0%
12 cyanopodovirus genomes
Figure 6. Proportion of reads recruited from different metagenomic datasets by different families of cyanophage. The number of recruited reads was normalized to the average size of the genome of each phage family. “Bacterial metagenomes” refers to viral sequences found in samples that were designed to collect the bacterial fraction; viruses are by-catch. “Viral metagenomes” refers to samples that were collected specifically to capture the viral fraction. For the HOT212 sample, we compare the recruitment proportions obtainedusing the cyanopodovirus genomes extant before this study (3 phage: P-SSP7, Syn5 and P60), and those obtainedusing all marine cyanopodoviruses.
44
A
3.5e-07 3.0e-07
Cyanomyovirus Normalized recruited read counts
2.5e-07 2.0e-07 1.5e-07 1.0e-07
C D
0.0e+00 1.0e-07 5.0e-08 0.0e+00
Prochlorococcus Normalized recruited read counts
GS051 G G GS047 G GS037 G GS035 G GS031 GS030 G GS036 G GS032 G GS029 G GS033 G GS027 G GS028 G GS026 G GS034 G GS025 G GS023 G GS022 G GS021 G GS020 G GS019 G GS018 G GS017 G GS016 G GS015 G GS014 G GS013 G GS012 G GS011 G GS010 G GS009 G GS008 G GS007 G GS006 G GS005 G GS004 G GS003 G GS002 G GS001a GS S GS001b GS S GS001c GS S GS000a GS S GS000b GS S GS000d GS S GS000c GS S GS123 G GS122a GS S GS122b GS S GS121 G GS120 G GS119 G GS148 G GS149 G GS117a GS S GS117b GS S GS116 G GS115 G GS114 G GS113 G GS112a GS S GS112b GS S GS111 G GS110a GS S GS110b GS S GS109 G GS108a GS S GS108b GS S
B
Cyanopodovirus Normalized recruited read counts
5.0e-08
50
0
GS117a GS S1 S117a
GS047 GS05 S051 GS051
1.5
GS035
GS116 GS115 5
GS114
GS113 GS112a GS111 GS110a G GS109 9 GS108a G 10
GS026
1.0
0.5
GS119
GS030 GS036 GS029
0.0
GS031 -0.5
-1.0
-91.5
1065 1066 1067 1068 1069 1070 1071 1072
−50
GS120 G GS G GS121
GS034
GS122a G G GS123
GS032
-91.0
−100
GS027 GS028 GS033
-90.5
-90.0
-89.5
0
100
Figure 7. Normalized recruited read counts corresponding to (A) cyanomyoviruses and (B) cyanopodoviruses in the GOS database. Each bar represents a sampling site. The number of reads was normalized to the average size of the genome of each phage family and to the total number of sequencing reads at each of the GOS sites. (C) The relative abundance of Prochlorococcus is shown as a series of dots for which the size is proportional to the counts of normalized recruited reads. (D) Map illustrating the position of the GOS sites.
45
WH7805 12138
15
31
02 1
P9313 04931
WH7805 07341
SynRCC307 1441
Sy nec hoc occu s
1 44 0 18 S9 7 3 2 90 00 2 9 22 S9 03 90 2 1 1 S9 97 90 91 2 1 97 BL1 61 07 0 816 BL1 9 07 0 935 1 BL1 07 0 9336 BL10 7 081 79 S8102 21261 S8102 105 91
07
2
07
30
CC 3
S8102 23831
S9605 03131
nR
CC
CC 3
nR
RS9917 10626 SynW H7803 0784 WH7 805 0 0300 BL1 07 1 426 S99 0 02 1 014 S96 1 05 1 121 S8 10 1 2 1 S9 64 31 21 1 1 W 92 H5 61 70 66 1 1 60 26 RS 53 9 C PR 917 1 G 29 0 75 00 95
0.1
S9605 05371
Sy
Sy
RS9916 2789 9
5 20 2 9 11 n1 00 Sy TG CP 0 66 S 6 30 21 6í gp V9 81 BS 001 UG CP 60 65 660 220 Syn1 3 213 Syn3 66K0 CPLG 00104
cu oc oc ch ne Sy
WH7805 00325 315 WH7805 00 5 7 0802 RS991 498 17 03 RS99 4917 916 3 RS9 61 395 916 9 6 S R 950 3 6 20 991 11 RS 1 0 0 19 57 07 WH 1 1 0 26 57 55 H 1 W 1 61 70 36 1 0 H5 1 W 23 31 1 19 S9 1 39 1 3 24 9 S 11 3 S9
s
1073 1074 1075 1076 1077 1078 1079
CPXG 00013
SynW H78 03 0 366 SynW H780 3 208 4 SynWH 7803 0 790 WH7805 1212 8
nR
Syn. cyanomyovirus
s
25
0
1 46 21 25 21 P9 0 0 51 2 1 32 S 1 S 1 1 TL 72 NA 15 1 TL 031 NA 03 1 L T 761 NA 12 TL2 NA 131 2 15 L T NA 2831 L2 0 T A N 18681 P9303 091 P9303 04 P9313 19281
5 P9
yovir u
Sy
Synechococcus
3660 PRTG 0 0160 3660 P930 1 02 451 P92 02 2 481 P92 02 1 281 A9 1 601 12 A9 521 60 1 0 P9 24 21 41 5 1 P9 30 31 11 2 1 P9 23 3 30 12 41 0 23 P9 (' 41 51 5 02 55 1
Pr o. Cy an o Pro. cyanopodovirus
CY IG PR OG 000 09 CY 0 00 PG 13 00 PR 03 UG 4 00 CY 04 OG 0 00 04 3
us cc co ro hlo oc Pr
Pro. cyanomyovirus
35606
Prochlorococcus
3+0 3+0
Legend
&30* 8 0010 PRAG 0 366 6 010 G 0 CPQ 185 00 XG 01 1 CY 00 PG CP 7 60 02 66 00 46 SG 0 0 7 PR 0 02 LG 7 7 CY SP 004 0 G Q PR
nom
s iru ov d po
PS
Cya
s
u vir yo om an Cy
n. Sy
Syn.
yovirus Cyanom Pro.
s iru ov y om yan . C n y S
Figure S1. Maximum likelihood, circular phylogenetic tree of PsbA from cyanophage and marine picocyanobacteria ( (Kelly et al., 2012) - http://proportal.mit.edu/ ). The bar represents 0.1 amino acid substitutions per site and branches with a bootstrap value greater than 80% are indicated by a black dot. The ring indicates the origin of PsbA sequences.
46
P PS SS SM M2 SS 5 (P (PS KLO XV M R SM WR SR 2 T UULG S Sy (S G 2 7K HUP XV' 666 M4 n1 ( SM 00 244 id 1 ( S Fla OLQX atus RWRJ 60 0 SR yn 2 2 97) ) voba So D 6 S 1 5 cter P$V liba PD 66 M4 200 8) Nitr UL ium WU 0 0 osop john $7& cter u WLPD 46) ) &RO um OLQVH soni & sita ilu tu OODD HURI s mariti ae UW s Elli DFLH QV$ mus S 101 (1 n607 2FH DQLFD 6 7&& CM 46 XOLV 1 (1 2997 Deino Chlor VS 6152 62 cocc obium us ra +7&& 7809 ) phae diodu obac rans ) teroid es DSM R1 (16 $QDHUR 266 ( 157948 6DOLQL P\[RED 2266 7) EDFWHU FWHUGH 95 UX 49 KDORJHQ EHU'6 1) DQV& $QDSODV 3í&0 PDSKDJR F\WRSKLOX P+= Ehrlichia ru minantium s tr. Gardel (5 8416808) 6SKLQJRS\[LVDOD VNHQVLV5% 1HRULFNHWWVLDVHQQHWVXVWU0L\D\DPD URS
nd
3LF
Ca
RWX
PE
GLX
VWUL
&OR
3HFWREDFWHULXPDWURVHSWLFXP6&5,
an Cy
om yo v
0 5) 5 259 QVLV 0 217 UL- OLH \OR SVD L50 ps ( Q US X ce WH WHU MX 0 rin 97 60 DF DF UMH UL5 ya p D2 ' RE E FWH OD OLF S\OR ED FWHU mbla ME HQV H is LJ OR + DP S\ ED Tre ns 1) OH[ & DP S\OR tus nde UVD 772 978) & DP ida bla DFWH 4952 264 2 & and kea ORE ta ( (8 C ine RKD bra ridis 5) Re URP a gla igrovi UHXV & 6076 Ellin &K ndid on n VDX $7& tus Ca traod RFFX UHYLV sita V$ Te SK\ORF LOOXVE acter u XVWULV%L 6WD REDF s Solib DVSDO 2) /DFW didatu RPRQ 2256 Can RSVHXGnetii (817 6 5KRG lla bur V6 :+ Coxie FKRFRFFX + FXV: 3 6\QH FKRFRF FXV0,7 6\QH 3 RURFRF V0,7 3URFKO ) URFRFFX H702 3URFKOR JDOHWWLQJD 0 7KHUPRWR DJDODFWLDH1( 6WUHSWRFRFFXV 7) m (Q88UA Lactobacillus plantaru Enterococcus faecalis (81722290) Enterococcus faecium DO (68196462) Leptospirillum rubarum Candidatus P elagibacter ub ique HTCC10 &DQGLGDWX 02 (91762288) :LJJOHV V&DUVRQHOODUXGGLL ZRUWKLD JORVVLQ )UDQFL LGLDHQG VHOODWXOD 0DJQ RV\PELRQW UHQVLV RI*ORVV VXEVS +DKH DSRUWKHRU LQDEUHY KRODUF LSDOSLV WLFD)7 Reine OODFKHMX \]DHí 1) ) 4. Yers kea bla HQVLV.&7 í *, & $OWH inia pes ndensis Bau URPRQ tis Pes MED29 7 &D man DGDOHV toides 0D QGLGD nia cica EDFWH F (145 ULXP 2121 delli $OF ULQRE WXV% 7 20) nico ORFK 5 DQLY DFW la s :í Re KRGR RUD HUDT PDQ tr. Hc F ine VS [ER XDH QLDI (Hom ; ulvim kea LULOOX UNX ROHL9 ORULG alod DQX A DQWK a bla PU PH 7 isca V 7 gro R rina nd XEUX QVLV coa 3D HWUD bac PRQ pe ens P$ 6. gula UD K\ te DV lag is M 7& ta) ( P P riu FD i H 9467 ED & HF HQ m P T 6584 LX DW tu SH CC 297 P K m 25 ) V WU WH HUP efa LV 06 S ( WUD c XUH RSK iens YF 1147 OLD LOD str DPS 065 . C HV 88 WULV ) 5 VWU 8 (1 5 89 05 21 )
cus hococ
} Synec coccus } Prochloro
6 0 $ 3& WU - VV P QX Ií QWX DUL ID P 5Z 0 ( 1 3 LQ V 8: LD FFX S 01 DQ FR UV CY DQV & SKLOD 1) & ER KP UR FWH p C J RH 0459 LV KOR ED s HOH XP $7 (279 /H URF KUR ece GLWLV SRQLF DQQLL LDDP ae) 3 V\F oth DE MD XP P\G taci í 3 an RUK PD ED KOD pis ) P Cy DHQ WRVR FWHU WRF HX H( zongia & KLV WRED V3UR FRLG HQLD ai 6F LQH DWX PGLV HUHLV . Bp (B í $F QGLG HOLX EDFW la str ) $5 5288 co &D RVW KUR $+ 50 di W\ S LV (154 í 'LF UPLQH a aphi KRPDW 149 )Z 9H hner LDWUDF C 29 UVS Buc am\G REDFWH us ATC Chl URP\[ cus gnav QG $QDH inococ $. WU/DQJHOD )V Rum VKLORQLL LQXP 7796) 9LEULR LGLXPERWXO (15450 CC 17982 &ORVWU GLXPGLIILFLOH icus AT &ORVWUL yces odontolyt 8052 ckii NCIMB Actinom m beijerin ) Clostridiu /%í %DFLOOXVVS155 ULDH6G 6KLJHOODG\VHQWH $FWLQREDFLOOXVVXFFLQRJHQHV=
s iru ov od nop Cya
1080 1081 1082 1083 1084 1085
6\Q Syn9 (BSV Syn1 0 (CP 9 gp199 Syn2 (C UG 00196)) PTG 0 010 Syn19 ( Syn19 1 0) 90) 6\Q6\ Q) PSSM7 (PSS M7 207) SSSM4 (CYXG 00162) SSSM5 (SSSM5 195) PRSM4 (PRSM4 209) SSM1 (SSM1 205) 6\Q&35* M4 171) PSSM4 (PSS &34* 3560 5$* 3 3660 3* 061) &3 00 3560 CPLG 2 027) 2 ( M SSSM 2 (SSh 0025) 0 ) SShM (CPXG 00014 ) 6 19 UG PRSM 6 (PR G 000 02) O 00 PSSP 5 (CY PG 0 ) Y 0 SP PS P1 (C 2* 005 PH 35 YIG 0 2) C * 001 4) 63 5 ( 36 RSP 356 LG 0 7 05 9) P P 01 Y 63 (C SS 0 ) 36 SP2 7 (P QG 0 27 ) 8 PS SSP PR 0* M1 2 22 P P1 ( &3 H M2 S (P H PG M1 (P PH M2 (' PH 0
Cyanomyovirus
Acidobacteria Epsilonproteobacteria Cyanobacteria_Chroococcales_Cyanothece Nitrospirae Euryarchaeota Chlamydiae Amoebozoa Bacteroidetes Thaumarchaeota Firmicutes Metazoa Gammaproteobacteria Betaproteobacteria Fungi Chlorobi Alveolata Cyanobacteria_Nostocales_Nostocaceae Thermotogae Euglenozoa Deinococcus Alphaproteobacteria Deltaproteobacteria Cyanobacteria_Gloeobacteria_Gloeobacterales Actinobacteria Planctomycetes
%LI LG RE DF m ari WH ULX ne a P cti OR T QJ $F roph nob XP LGR ery ac 1 WKH m teri Ru a w um N && UP bro X ba VF hip PH oca cte ple S rd HOO r x $ON X i ORO\ TW C20 ioid yla DOLO n WLF LP 08 C1 es s QLF oph X V N ilu ROD /27 (8 p J HK s D ocard % (2 881 S6 85 45 1 SM UOLF io 7 KLL 9 0 4 id 0/ 941 es s 258 5) 1 +( (1 í 08 p JS ) 76 61 41 4 47 )
Dechloromonas aromatica RCB (71907689)
Prochloroccus Cyanopodovirus Cyanomyovirus Synechococcus
$URPDWROHXPDURPDWLFXP(E1 $VSHUJLOOXVRU\]DH5,% Xanthobact er autotroph Candida icus Py2 (15 tus Kuene 4244785) Rhodop nia stuttg artiensis *OXFRQ seudomonas pa lustris ( *UDQ REDFWHUR[\G 816198 DQV 55) Nitro XOLEDFWHUE + Cand sococcus HWKHVGHQV ocea LV&* Glo idatus '1,+ ni AT Cya eobact Kuene CC 19 ni a st 707 ( Nos nothec er viola 7688 ceus uttgartie &D toc sp e sp 2118 (817 ns ) 1LW QGLGD . PC CCY 0901 is &D UDWLUX WXV. C 712 0110 1) RULE 0 (2 6X PLQ SWR 0141 $UF OIXUR LEDF UVS DFWH U WH 75 YHUV S RE YX U 6% DWLO 9) : ulfu DF PV PH LV( &D ROLQ rim WHUE S1 GLDWO í OOLQ DQ & P H on XW] % & DP S\ OODV as d OHUL & WLFXV DP S OR X í 7 5 e S\ \ORE EDF FFLQ nitri 0 %í OR DF WHU RJ fica ED WH F HQ HV ns D FWH UIH RQ UK WX FLVX SM RP VV V 12 X 5 LQ 1 ( E LV VS 78 $ 49 7& IHWX 72 V & 05 % ) $$ í í
7KLREDFLOOXVGHQLWULILFDQV$7&& HXP &KURPREDFWHULXPYLRODF AM eningitidis F Neisseria m 6$ GRVXV9& DFWHUQR KLOD6/ 'LFKHORE DKDORS '600814) RGRVSLU UXEHU +DORUK EDFWHU 196 (8241 8052 6DOLQL CC 25 ii NCIMB is AT ck iform ijerin í a mult ium be 1 ospir 862) 0 trid Nitros Clos 3+( & (90962 ODULV 8 FHOOX L$7& CC11 ) LQWUD U FDVH 99 RQLD LOOXV ivarius +