JVI Accepted Manuscript Posted Online 11 May 2016 J. Virol. doi:10.1128/JVI.00651-16 Copyright © 2016 Staheli et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.
1
Complete unique genome sequence, expression profile and salivary gland tissue
2
tropism of the herpesvirus-7 homolog in pigtailed macaques
3 4
Jeannette P. Stahelia, Michael R. Dyena, Gail H. Deutschb, Ryan S. Basomc, Matthew P.
5
Fitzgibbond, Patrick Lewisa, Serge Barcya,e#
6
Center for Global Infectious Disease Research, Seattle Children’s Research Institute,
7
Seattle, Washington, USAa, Pathology, Seattle Children’s Hospital , Seattle ,
8
Washington, USAb, Genomics, Fred Hutchinson Cancer Research Center, Seattle,
9
Washington, USAc, Computational biology, Fred Hutchinson Cancer Research Center,
10
Seattle, Washington, USAd, Department of Pediatrics, University of Washington,
11
Seattle , Washington, USAe
12 13
Running title: MneHV7 genome, transcriptome and tropism
14 15
#
Corresponding author Serge Barcy,
[email protected]
16 17
Abstract word count 250, Text word count: 9,290
18 19
1
20
Abstract
21
Human herpesviruses HHV-6A, 6B and 7 are classified as roseoloviruses, and are
22
highly prevalent in the human population. Roseolovirus reactivation in an
23
immunocompromised host can cause severe pathologies. While the pathogenic
24
potential of HHV-7 is unclear, it can reactivate HHV-6 from latency and thus contributes
25
to severe pathological conditions associated with HHV-6. Because of the ubiquitous
26
nature of roseoloviruses, their roles in such interactions and resulting pathological
27
consequences have been difficult to study. Furthermore, the lack of a relevant animal
28
model for HHV-7 infection has hindered a better understanding of its contribution to
29
roseolovirus associated diseases.
30
Using Next-Gen sequence analysis, we have characterized the unique genome of an
31
uncultured novel pig-tailed macaque roseolovirus. Detailed genomic analysis revealed
32
the presence of gene homologs to all 84 known HHV-7 ORFs. Phylogenetic analysis
33
confirmed that the virus is a macaque homolog of HHV-7, which we have provisionally
34
named MneHV7.
35
Using high-throughput RNA sequencing, we observed that the salivary gland tissue
36
samples from nine different macaques had distinct MneHV7 gene expression patterns
37
and that the overall number of viral transcripts correlated with viral loads in parotide
38
tissue and saliva. Immunohistochemistry staining confirmed that like HHV-7, MneHV7
39
exhibits natural tropism for salivary gland ductal cells. We also observed staining for
40
MneHV7 in peripheral nerve ganglia present in salivary gland tissues suggesting that
41
HHV-7 may also have a tropism for the peripheral nervous system.
2
42
Our data demonstrate that MneHV7-infected macaques represent a relevant animal
43
model that may help clarify the causality between roseolovirus reactivation and
44
diseases.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 3
60
Importance
61
Human herpesviruses HHV-6A, 6B and 7 are classified as Roseoloviruses. We have
62
recently discovered that pigtailed macaques are naturally infected with viral homologs of
63
HHV-6 and HHV-7 which we provisionally named MneHV6 and MneHV7, respectively.
64
In this study we confirm that MneHV7 is genetically and biologically similar to its human
65
counterpart HHV-7. We determine the complete unique MneHV7 genome sequence
66
and provide a comprehensive annotation of all genes. We also characterize viral
67
transcription profiles in salivary glands from naturally infected macaques. We show that
68
broad transcriptional activity across most of the viral genome is associated with high
69
viral loads in infected parotids and that late viral protein expression is detected in
70
salivary duct cells and peripheral nerve ganglia. Our study provides new insights into
71
the natural behavior of an extremely prevalent virus and establishes a basis for
72
subsequent investigations of the mechanisms that cause HHV-7 reactivation and
73
associated disease.
74 75 76 77 78 79 80 4
81
Introduction
82
Human herpesvirus 7 (HHV-7) was first identified in 1990 as a lymphotropic virus
83
belonging to the Roseolovirus genus within the Betaherpesvirinae subfamily (1). HHV-7
84
is one of the most prevalent viruses in the human population (2). Serological data
85
suggest that primary infection occurs early in childhood and results in a lifelong
86
infection. While primary HHV-7 infection can result in benign illness like exanthem
87
subitum (3), viral reactivation in immuno-suppressed patients has been associated with
88
severe pathological conditions (4, 5). Complications associated with virus reactivation
89
include a wide range of diseases including neurological pathologies (6).
90
When viral persistence following primary infection is established, the presence of
91
detectable infectious virus particles in saliva is a common occurrence (7, 8). Both, labial
92
and submandibular salivary glands have been shown to be anatomical sites for HHV-7
93
persistence, replication and late viral protein expression (9). Within the gland tissue,
94
virus seems to be predominantly localized in the ductal cuboidal and columnar cells.
95
Thus, salivary glands are a likely source for infectious virus shed into saliva.
96
Complete genomes sequences are available for all three human roseoloviruses,
97
HHV-6A, HHV-6B, and HHV-7. For HHV-7, the genomes of three different strains, JI,
98
RK and UCL-1, have been sequenced using viral DNA isolated from peripheral blood
99
mononuclear cells (PBMC) (JI and RK ) (1, 10) or saliva (UCL-1) (11, 12). Similar to
100
other roseolovirus genomes, the genome structure of HHV-7 is composed of a central
101
unique 133 kb long segment flanked by 10 kb long direct terminal repeat regions on
102
each side. Viral genes in the unique segment are arranged in blocks of genes
5
103
conserved among herpesviruses and betaherpesviruses or are unique to
104
roseoloviruses.
105
The HHV-7 origin of lytic genome replication (oriLyt) is located upstream of the
106
major DNA binding protein gene U41, similar to its location in other betaherpesvirinae
107
subfamily genomes. The oriLyt sequence contains two binding sites, OBP-1 and OBP-2
108
(13), which are both recognized by an origin binding protein (OBP) encoded by the viral
109
ORF U73 (14). Interaction between OBP and the binding sites is required to initiate viral
110
DNA replication. The HHV-7 unique region also contains two repetitive sequence motifs,
111
R1 and R2 (15). The R1 segment is located between ORFs U86 and U89 and, in the
112
RK strain, is composed of two 84 bp repeats plus two partial 67 bp repeats, with
113
nucleotide sequences conserved between HHV-7 strains. The second segment, R2, is
114
located downstream of R1 between ORFs U91 and U95 and is composed of 105 bp
115
motifs repeated 16 times. The coding potential of the R1 and R2 repeat regions for
116
either short proteins or non-coding RNAs is uncertain, and both regions have been
117
shown to vary in size and sequence between different strains of HHV-7. The HHV-7
118
genome is predicted to encode 82 different proteins in the unique region with open
119
reading frames (ORF) annotated U2 to U100. In addition, two protein-coding ORFs
120
(DR1 and DR6) have been identified in each of the terminal repeat regions. All but one
121
of the ORFs in the HHV-7 genome have counterparts among the other roseolovirus
122
genomes. The exception is U55B, which is only found in different HHV-7 strains and not
123
in either HHV-6A or HHV-6B. Conversely, the HHV-7 genome lacks ORFs U22 and
124
U94, which are present in both HHV-6 species. Some additional small expressed ORFs
125
are present in the HHV-6B DR region but not in HHV-7. 6
126
Little is known about HHV-7 transcriptional activity in infected cells. One study
127
reported that HHV-7 temporal gene expression is regulated by a cascade mechanism
128
similar to HHV-6 transcription. Interestingly, transcripts from four viral genes that had
129
been determined to be immediate early genes in that study, could not be detected by
130
RT-qPCR in peripheral blood mononuclear cells (PBMC) from healthy patients with
131
detectable levels of HHV-7 DNA (16). This observation suggests that PBMC infection
132
with HHV-7 may result in latency with very limited transcriptional activity.
133
Recently, we and others have reported the existence of HHV-7 homologs
134
naturally infecting different non-human primate species including macaques (17) and
135
great apes (18). All simian homologs of HHV-7 were discovered using consensus-
136
degenerate hybrid oligonucleotide primers (CODEHOP) (19). This PCR-based
137
approach led to the discovery of genomic segments comprised of the glycoprotein B
138
(gB) and the DNA polymerase genes. Since our study used DNA from saliva samples
139
collected from pig-tailed macaques (Macaca nemestrina, Mne), we proposed to
140
provisionally name the HHV-7 macaque homolog MneHV7. Phylogenetic analysis
141
based on the available sequence segments confirmed the close evolutionary
142
relationship between the newly identified simian roseoloviruses and HHV-7. We further
143
characterized prevalence and tissue tropism of MneHV7 natural infection in pig-tailed
144
macaques by using a newly developed qPCR assay specific for MneHV7. We detected
145
MneHV7 at high levels in all saliva samples screened and observed the highest viral
146
loads in salivary gland tissues. Finally, we showed a strong correlation between viral
147
loads in saliva and salivary glands. Altogether these results are in agreement with
7
148
previous observations made for HHV-7 in humans that salivary glands are likely a major
149
site of MneHV7 replication and persistence.
150
Here we report the complete sequence of the unique segment of the MneHV7
151
genome as well as a portion of the 10KB direct repeat (DR) region at the genomic
152
termini. A comparison of the MneHV7 genome with the HHV7 genome revealed a
153
collinearity of gene structure and a strong gene sequence similarity, confirming that
154
MneHV7 is an HHV-7 homolog naturally infecting pig-tailed macaques. We present a
155
first annotation of the MneHV7 genome, based on similarity with the HHV-7 RK strain
156
sequence. Genomic analysis of sequence elements crucial for roseolovirus biology
157
suggests that MneHV7 relies on similar mechanisms to HHV-7 to control its life cycle.
158
Using high-throughput RNA sequencing (RNA seq), we detected high level expression
159
of a broad range of MneHV7 genes in salivary gland tissue from naturally infected
160
macaques. In addition, using immunohistochemistry (IHC), we observed late viral
161
protein expression in salivary gland tissues, which correlated with high levels of
162
MneHV7 gene expression and viral loads in saliva. Our study revealed that MneHV7
163
exhibited numerous molecular and biological similarities with HHV-7 demonstrating that
164
natural infection with MneHV7 is a useful animal model to study HHV-7 infection and
165
associated pathologies.
166 167
MATERIALS AND METHODS
168
Pig-tailed macaque specimens
8
169
Saliva and salivary gland tissue samples from 12 pigtailed macaques were obtained
170
from stored aliquots collected in previous studies [29]. All 12 animals had been part of a
171
prior simian immunodeficiency virus (SIV) vaccine study. During saliva collection, the
172
animals were kept under deep sedation with ketamine HCl at a dose of 10-15 mg/kg
173
intramuscularly to alleviate any pain and discomfort. The animals were monitored by an
174
Animal Technician or Veterinary Technologist while under sedation. Tissues collected at
175
necropsy were either snap frozen or fixed in 10% neutral buffered formalin and
176
embedded in paraffin. All animals in this study were housed and cared for according to
177
the Guide for the Care and Use of Laboratory Animals at the Washington National
178
Primate Research Center (WaNPRC), an Association for Assessment and Accreditation
179
of Laboratory Animal Care International accredited institution. The experimental
180
procedures were approved by the Institutional Animal Care and Use Committee (2370-
181
20) at the University of Washington and conducted in compliance with the Public Health
182
Services Policy on Humane Care and Use of Laboratory Animals
183
(http://grants.nih.gov/grants/olaw/references/ PHSPolicyLabAnimals.pdf).
184
DNA isolation from saliva and tissues and library preparation
185
DNA isolation from macaque saliva and tissue samples was performed as previously
186
described [16] using proteinase K treatment at 56Co and a QIAmp DNA Mini Kit
187
according to the manufacturer’s protocol. DNA template for Next-Generation
188
sequencing was isolated from cell-free filtered saliva following a similar procedure but
189
with an additional purification and concentration step using a DNA clean and
190
concentrator spin column (Zymo Research). The sequencing library was prepared from
191
sheared DNA using the TruSeq DNA Sample Prep Kit v2 (Illumina, Inc.). The library 9
192
size distribution was validated using an Agilent 2200 TapeStation (Agilent
193
Technologies).
194
MneHV7-specific TaqMan qPCR assay
195
The PCR primers and TaqMan probe for quantitative real-time PCR assays (qPCR)
196
specifically targeting the MneHV7 DNA pol gene were described before [16]. Primer and
197
probe sequences were as follows: MneHV7-pol probe (5’-[6FAM]
198
ACTGGTGCAACACATAGCT TATTACCGT [BHQ1] -3’), MneHV7-pol –F (5’-
199
GTGCAAAGACCCTACGTTAATTATG-3’), MneHV7-pol-R (5’-
200
CTTGTTACCGAAGCAGCAATAG-3’). The thermocycling conditions were as follows:
201
incubation at 94°C for 2.5 min, followed by 45 cycles of 94°C for 30s, 60°C for 30s and
202
72°C for 30s, then cooled for 1 min at 37°C, all at maximum ramp speed of 4.4°C/s.
203
Standardized MneHV7 templates consisted of purified plasmid pJET vector containing
204
the cloned MneHV7 PCR amplicon. Viral load was determined by comparing the cycle
205
threshold (Ct) values obtained from the MneHV7 qPCR assay with the Ct values
206
obtained from a single copy cellular gene, oncostatin M (OSM) as previously described
207
[16].
208
Next-Gen sequencing and de-novo assembly
209
Library clustering was performed on a single flow cell lane using an Illumina cBot and
210
sequencing subsequently occurred on an Illumina HiSeq 2500 using a paired-end, 50 nt
211
read length (PE50) strategy. Image analysis and base calling was performed using
212
Illumina's Real Time Analysis (RTA) v1.17 software, followed by generation of FASTQ
213
files, using Illumina's CASAVA v1.8.2 software (http://www.illumina.com/software.ilmn). 10
214
Because the sample was from saliva, we expected a predominance of short reads from
215
host DNA and sought to suppress these before attempting de novo assembly. Since a
216
Macaca nemestrina reference genome was not available, preliminary alignment to the
217
Rhesus macaque genome (Ensembl MMUL 1.0) was used to filter out reads derived
218
from the host. Approximately 215 million raw read pairs, all passing default quality
219
filtering in the Illumina CASAVA 1.8.2 pipeline, were aligned to the Rhesus reference
220
genome using BWA 0.6.1 [19]. This process classified nearly 83% of the input reads as
221
aligning to and therefore originating from the host. Over 36 million read pairs that did not
222
align to the Rhesus genome were assembled with Velvet 1.2.08 [20]. The resulting
223
contigs were searched for loose matches to our previously determined MneHV7 DNA
224
polymerase sequence [16]. A match was found, and properties of the matching contig
225
were used to further optimize de novo assembly parameters and extend coverage. A
226
single large contig of ~122KB, a significant portion of the expected ~150KB genome,
227
was identified. BWA alignment of all raw Illumina reads against this candidate contig
228
was then performed, resulting in a collection of 34,571 read pairs which were used to
229
further refine the assembly. Consecutively, two additional short contigs of 6.8KB and
230
2.4KB were found, resulting in complete coverage of the expected length of the unique
231
genome region (U) of MneHV7 and about one quarter (2.4KB of an estimated 10KB) of
232
the end-terminal direct repeat regions.
233
Long range sequencing
234
The R2 repeat region was resolved by long-range PCR and Sanger sequencing using
235
the Takara LA PCR kit (ver. 2.1) following the manufacturer’s protocol (Clontech). PCR
236
reactions were performed using the following primer combinations: R2_1F_IDT (5’11
237
CTCTCTTCAGTT GATGCCATAGT-3’) and R2_1R_IDT (5’-
238
TTGCGAATCCCTGTCATAGTT-3’); R2_3F_IDT (5’-
239
CAGTTCCATAGACAGAGCTAACA-3’) R2_3R_IDT (5’-
240
GGTGTGGGTCGAATAACATTTG-3’); R2_4F_IDT (5’-
241
CTGGATGGTTGTCCAACTACA-3’) and R2_4R_IDT (5’-
242
AGGCTGATGAATACTGCTGAC-3’). The PCR buffer consisted of the LA PCR buffer II
243
(10x, Mg) supplemented with 5% DMSO. Amplification conditions were as follows: one
244
cycle of 94°C for 1 min., followed by 35 cycles of 98°C for 10 sec, 58°C for 1 min and
245
68°C for 5 min and one final cycle at 72°C for 10 min.
246
Gene annotations
247
The annotation of the MneHV7 genome was performed using the Genome
248
Annotation Transfer Utility (GATU) provided at the Viral Bioinformatics Resource Center
249
[25] and was fine-tuned using Geneious software package v.6.1.6 (Biomatters). Open
250
reading frames were verified by observing the presence and locations of start and stop
251
codons as well as poly-adenylation signal sequences in comparative alignments of the
252
candidate MneHV7 sequence with the HHV7 RK strain sequence (NC_001716) using
253
various aligners (Geneious aligner, ClustalW) that are part of the Geneious software
254
package. Predicted intron-exon boundaries of spliced genes in the HHV7 RK sequence
255
were detected in the MneHV7 genome sequence by locating conserved splice acceptor
256
and donor sites in sequence alignment.
257
Comparative sequence and phylogenetic analysis
12
258
Sequence analysis of specific regions of the MneHV7 genome was performed by
259
alignment with known beta-herpesvirus sequences using Muscle (for amino acid
260
sequences) or the Geneious alignment algorithm (for nucleotide sequences) that are
261
part of the Geneious software package (Biomatters). Phylogenetic analysis was
262
performed using the Maximum Likelihood method (PhyML) with the JC69 substitution
263
model [27] from Muscle or Geneious alignments of herpesvirus amino acid sequences.
264
The number of substitution rate categories was set at 4, and the best of both NNI and
265
SPR topology searches was combined into the final tree calculation. The percent
266
likelihood of branching within the phylogenetic tree was calculated by using the
267
approximate likelihood ratio test returning Chi2-based parametric branch supports [67].
268
The viral DNA polymerase sequences used to construct the phylogenetic tree are
269
deposited in GenBank under these accession numbers: (HHV-5/hCMV) YP_081513,
270
(MndHVbeta) AAG39066, (HHV-6A strain U1102) CAA58372, (HHV-6B strain Z29)
271
AAD49652, (Ptro Roseolo1) AAW55996, (HHV-7 strain RK) AAC40752, (HHV-7 strain
272
JI) AAC54700, (PtroHV7) AIN81106, (PpanHV7) AIN81107,(GgorHV7) AIM81109 and
273
(MneHV7) KU351741.
274
Immunohistochemistry
275
Immunohistochemical staining was performed on paraffin-embedded sections after an
276
endogenous peroxidase block with 3% H2O2 in methanol, antigen retrieval with citrate
277
buffer, pH 6.0, a 5% normal serum block, and an avidin-biotin block (Vector
278
Laboratories). The KR4 primary antibody (1:1000, #12179, NIH AIDS Reagent
279
Program) was incubated for 2 hours at room temperature, followed by biotinylated anti-
13
280
mouse antibody (1:500) and a streptavidin-biotin-peroxidase complex (both Vector
281
Laboratories). Peroxidase activity was visualized using diaminobenzidine (Vector
282
Laboratories). All histological stains were repeated on 3 separate sections. Negative
283
controls consisted of sections incubated without primary antibody and matched isotype
284
control. VirAarray HHV-7 control cell slides (VAH7-05, Bioworld Consulting
285
Laboratories) were used as positive control.
286
RNA isolation from salivary gland tissues
287
Frozen salivary gland tissue specimens were minced with a sterile razor blade before
288
addition of Trizol [68]. Following a 30 min incubation on ice, total RNA was isolated
289
using a Direct-zol RNA mini prep column (Zymor research) following the manufacturer’s
290
protocol. RNA samples were further purified on “Clean and Concentrate” columns
291
(Zymo research) according to the manufacturer’s instructions. Total RNA integrity was
292
checked using an Agilent 2200 TapeStation (Agilent Technologies) and quantified using
293
a Trinean DropSense96 spectrophotometer (Caliper Life Sciences).
294
Viral transcriptome
295
RNA-seq libraries were prepared from total RNA using the TruSeq RNA Sample Prep
296
Kit (Illumina, Inc.) and a Sciclone NGSx Workstation (PerkinElmer). Library size
297
distributions were validated using an Agilent 2200 TapeStation (Agilent Technologies).
298
Additional library QC, blending of pooled indexed libraries, and cluster optimization was
299
performed using Life Technology’s Invitrogen Qubit® 2.0 Fluorometer (Life
300
Technologies-Invitrogen). RNA-seq libraries were pooled (duplex) and clustered onto a
301
flow cell lane. Sequencing was performed using an Illumina HiSeq 2500 in rapid mode 14
302
employing a paired-end, 50 base pair read length (PE50) sequencing strategy. Image
303
analysis and base calling was performed using Illumina's Real Time Analysis v1.18
304
software, followed by 'demultiplexing' of indexed reads and generation of FASTQ files,
305
using Illumina's bcl2fastq Conversion Software v1.8.4
306
(http://support.illumina.com/downloads /bcl2fastq_conversion_software_184.html).
307
Reads of low quality were filtered out prior to alignment to the MneHV7 custom
308
reference genome using TopHat v2.0.12. Counts were generated from TopHat
309
alignments for each viral gene using the Python package HTSeq v0.6.1 ( http://www-
310
huber.embl.de/users/anders/ HTSeq/doc/overview.html). Total viral read numbers were
311
normalized by calculating viral reads per million unique mapped genomic reads (VPMM)
312
[32]. Viral transcriptome expression profiles between salivary gland specimens were
313
analyzed using hierarchical cluster analysis (CIMminer). The Manhattan distance matrix
314
was computed based on the respective VPMM value calculated for each viral gene
315
detected in each specimen. These values were used as input for hierarchical clustering
316
using the complete linkage-clustering algorithm. Data sets representing the viral gene
317
expression profiles were visualized by heat maps using color-coded Clustered Image
318
Maps software (CIMminer) [34].
319
Statistics
320
Statistics was performed using Graphpad Prism version 6.01 for Windows (GraphPad
321
Software). To compare the overall read numbers from viral transcripts with viral loads
322
from whole saliva samples or salivary gland tissues respectively, two-tailed Spearman
15
323
nonparametric correlation coefficients (r2) and corresponding P-values were calculated
324
using 95% confidence intervals.
325
Data deposition
326
The GenBank submission module version 1.3.2 within the Geneious software package
327
version 6.1.6 was used to create a standard GenBank file for the MneHV7 genome and
328
to submit the annotated sequence to GenBank (GenBank accession number:
329
KU351741). The nucleotide and protein sequences analyzed in this report are all
330
derived from this submitted annotated genome sequence. The complete RNAseq data
331
sets are available from the Gene Expression Omnibus (GEO) repository (accession
332
number GSE79944).
333 334
Results
335
Selection of DNA sample for high-throughput sequencing
336
To confirm that the MneHV7 DNA sequence segments that we previously
337
detected in pig-tailed macaque saliva by CODEHOP PCR represented indeed a novel
338
macaque roseolovirus, we determined the genome sequence of MneHV7 by next-
339
generation sequencing. We selected saliva as our source for viral DNA since we had
340
previously determined that MneHV7 is highly prevalent in macaque saliva at
341
consistently high levels (17). Thus, samples from our study for which an adequate
342
volume of saliva was available were screened by MneHV7-specific qPCR to identify the
343
samples with the highest viral loads. In order to enhance recovery of viral DNA, whole
344
saliva samples were depleted of cellular material by centrifugation and filtration. Another 16
345
source of genetic material contamination when extracting DNA from saliva is bacterial
346
DNA. Thus, DNA samples with strong PCR signals for bacterial 16S rRNA sequence
347
were eliminated.
348
Using these criteria, we selected and pooled three saliva samples collected at
349
different times from the same naturally infected pig-tailed macaque (TO5183). The final
350
concentrated DNA sample contained 3.5x106 copies of the MneHV7 genome per 100
351
ng of purified DNA. This sample was subsequently used to prepare a paired-end library
352
that was sequenced on one lane of an Illumina HighSeq 2500.
353 354 355
Determination of the MneHV7 genome sequence Following high-throughput sequencing, a total of 215 million quality-filtered reads
356
were mapped to the rhesus macaque genome in order to remove input reads originating
357
from the macaque cellular genome. The remaining 36 million unmapped reads were
358
then subjected to de-novo assembly using Velvet (20). The resulting contigs were
359
searched for matches to the previously determined MneHV7 CODEHOP DNA
360
polymerase amplicon sequence using blastn (21). A match was found and properties of
361
the matching contig were used to further optimize the de-novo assembly parameters to
362
extend MneHV7 contig coverage. The divergence between the pig-tailed and rhesus
363
macaque genomic sequences in addition to the presence of thousands of reads from
364
mostly uncharacterized microorganism genomes, made de-novo assembly of MneHV7
365
reads challenging. Ultimately, this process generated a large contig of ~122,000 bp
366
covering a significant portion of the expected MneHV7 genome length (Fig 1A). Another
367
smaller contig of ~6,800 bp with similarity to HHV-7 U95 and U100 genes was also
17
368
identified and added to form a scaffold of the entire unique MneHV7 genome sequence
369
with a mean read depth of 25 nucleotides. The gap between the two contigs which
370
consisted mostly of the R2 repeat region was subsequently resolved by long-range PCR
371
and Sanger sequencing. Sixteen additional very short unresolved sequences of 3-25 bp
372
in length were mostly caused by the presence of single polynucleotide stretches and
373
were resolved by mapping available overlapping reads to adjacent sequences using
374
blastn in addition to PCR amplification and subsequent Sanger sequencing. This
375
process resulted in a sequence representing the entire unique genome region of
376
MneHV7. Our final 130,851 bp MneHV7 draft genome included the entire unique
377
genome sequence of MneHV7 as well as short sections of both ends of the end-
378
terminal direct repeats (right end of DR-L (262 bp) and left end of DR-R (87 bp)) where
379
the final contig extended beyond the unique genome region (Fig 1A and B). Importantly,
380
these partial end-terminal sequences had readily identified cleavage-packaging motifs
381
(a Pac-1 motif on the left end, a Pac-2 motif on the right end of the unique genome
382
region respectively) flanked by several copies of the telomere-like TAACCC sequence
383
motif that make up the telomeric repeat regions T1 and T2 in human roseolovirus
384
genomes (Fig 1C) (22).
385
All roseoloviruses have a typical genome structure composed of a unique
386
genome region (U) flanked by identical end-terminal direct repeat (DR) regions of about
387
10 kb in length (DR-left – U – DR-right) oriented in tandem (15, 23, 24). HHV-7 DR
388
regions contain open reading frames coding for the DR1 and DR6 genes. Further
389
analysis of the contigs obtained following de-novo assembly led to the identification of
390
an additional short contig of 2.4 kb, with similarity to HHV-7 end-terminal repeat
18
391
elements (Fig 1A and B). This contig contained the sequence for the entire DR1 gene
392
and the first exon as well as most of the intron of the DR6 gene for MneHV7 (Fig 1B).
393
The complete DR6 sequence and the number of telomeric repeats present in the T1
394
and T2 regions at each end of the MneHV7 genome remain to be determined.
395
For the final MneHV7 genome assembly, all available sequences belonging to
396
the DR end terminal region, were duplicated at each end of the MneHV7 genome based
397
on alignments with the HHV-7 genome. The resulting MneHV7 genome sequence is
398
available in GenBank under the accession number KU351741.
399 400 401
MneHV7 genome annotation Annotation of the MneHV7 genome sequence was performed using the Genome
402
Annotation Transfer Utility (GATU) (25) with the HHV-7 RK strain sequence as
403
reference genome. Open reading frames with similarity to all 84 known HHV-7 ORFs
404
were identified spanning the entire genomic region between the U2 and U100 genes
405
(Fig 2). The MneHV7 ORFs identified, including their direction, size and percentage of
406
amino acid identity to corresponding HHV-7 ORFs, are listed in Table1. Similar to HHV-
407
7, the MneHV7 sequence contained a homolog of the HHV-7-specific U55B gene,
408
whereas it lacked sequences similar to the HHV-6 unique genes, U22, U83 and U94.
409
These observations confirmed the previous status of MneHV7 as a new HHV-7 viral
410
homolog.
411
The positions of ATG start codons, stop codons, and similarity with the HHV-7
412
(RK strain) annotation were used to predict open reading frames. The presence and
413
locations of TATA box consensus sequences as well as consensus poly-adenylation
19
414
signal sequences were identified. MneHV7 proteins with the highest percentage of
415
similarity to HHV-7 (>90%) were U27 (DNA polymerase processivity factor), U35 (DNA
416
packaging), U41 (single stranded DNA binding protein), U56 (triplex capsid protein),
417
U57 (major capsid protein), U72 (envelope glycoprotein M), and U77 (helicase/primase
418
complex protein). Twenty-seven additional proteins were 80-90% similar, 18 proteins
419
were 70-80% similar, and 38 proteins shared