Complete unique genome sequence, expression profile and salivary ...

5 downloads 12204 Views 7MB Size Report
May 11, 2016 - model for HHV-7 infection has hindered a better understanding of its ..... alignments for each viral gene using the Python package HTSeq v0.6.1 ...
JVI Accepted Manuscript Posted Online 11 May 2016 J. Virol. doi:10.1128/JVI.00651-16 Copyright © 2016 Staheli et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.

1

Complete unique genome sequence, expression profile and salivary gland tissue

2

tropism of the herpesvirus-7 homolog in pigtailed macaques

3 4

Jeannette P. Stahelia, Michael R. Dyena, Gail H. Deutschb, Ryan S. Basomc, Matthew P.

5

Fitzgibbond, Patrick Lewisa, Serge Barcya,e#

6

Center for Global Infectious Disease Research, Seattle Children’s Research Institute,

7

Seattle, Washington, USAa, Pathology, Seattle Children’s Hospital , Seattle ,

8

Washington, USAb, Genomics, Fred Hutchinson Cancer Research Center, Seattle,

9

Washington, USAc, Computational biology, Fred Hutchinson Cancer Research Center,

10

Seattle, Washington, USAd, Department of Pediatrics, University of Washington,

11

Seattle , Washington, USAe

12 13

Running title: MneHV7 genome, transcriptome and tropism

14 15

#

Corresponding author Serge Barcy, [email protected]

16 17

Abstract word count 250, Text word count: 9,290

18 19

1

20

Abstract

21

Human herpesviruses HHV-6A, 6B and 7 are classified as roseoloviruses, and are

22

highly prevalent in the human population. Roseolovirus reactivation in an

23

immunocompromised host can cause severe pathologies. While the pathogenic

24

potential of HHV-7 is unclear, it can reactivate HHV-6 from latency and thus contributes

25

to severe pathological conditions associated with HHV-6. Because of the ubiquitous

26

nature of roseoloviruses, their roles in such interactions and resulting pathological

27

consequences have been difficult to study. Furthermore, the lack of a relevant animal

28

model for HHV-7 infection has hindered a better understanding of its contribution to

29

roseolovirus associated diseases.

30

Using Next-Gen sequence analysis, we have characterized the unique genome of an

31

uncultured novel pig-tailed macaque roseolovirus. Detailed genomic analysis revealed

32

the presence of gene homologs to all 84 known HHV-7 ORFs. Phylogenetic analysis

33

confirmed that the virus is a macaque homolog of HHV-7, which we have provisionally

34

named MneHV7.

35

Using high-throughput RNA sequencing, we observed that the salivary gland tissue

36

samples from nine different macaques had distinct MneHV7 gene expression patterns

37

and that the overall number of viral transcripts correlated with viral loads in parotide

38

tissue and saliva. Immunohistochemistry staining confirmed that like HHV-7, MneHV7

39

exhibits natural tropism for salivary gland ductal cells. We also observed staining for

40

MneHV7 in peripheral nerve ganglia present in salivary gland tissues suggesting that

41

HHV-7 may also have a tropism for the peripheral nervous system.

2

42

Our data demonstrate that MneHV7-infected macaques represent a relevant animal

43

model that may help clarify the causality between roseolovirus reactivation and

44

diseases.

45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 3

60

Importance

61

Human herpesviruses HHV-6A, 6B and 7 are classified as Roseoloviruses. We have

62

recently discovered that pigtailed macaques are naturally infected with viral homologs of

63

HHV-6 and HHV-7 which we provisionally named MneHV6 and MneHV7, respectively.

64

In this study we confirm that MneHV7 is genetically and biologically similar to its human

65

counterpart HHV-7. We determine the complete unique MneHV7 genome sequence

66

and provide a comprehensive annotation of all genes. We also characterize viral

67

transcription profiles in salivary glands from naturally infected macaques. We show that

68

broad transcriptional activity across most of the viral genome is associated with high

69

viral loads in infected parotids and that late viral protein expression is detected in

70

salivary duct cells and peripheral nerve ganglia. Our study provides new insights into

71

the natural behavior of an extremely prevalent virus and establishes a basis for

72

subsequent investigations of the mechanisms that cause HHV-7 reactivation and

73

associated disease.

74 75 76 77 78 79 80 4

81

Introduction

82

Human herpesvirus 7 (HHV-7) was first identified in 1990 as a lymphotropic virus

83

belonging to the Roseolovirus genus within the Betaherpesvirinae subfamily (1). HHV-7

84

is one of the most prevalent viruses in the human population (2). Serological data

85

suggest that primary infection occurs early in childhood and results in a lifelong

86

infection. While primary HHV-7 infection can result in benign illness like exanthem

87

subitum (3), viral reactivation in immuno-suppressed patients has been associated with

88

severe pathological conditions (4, 5). Complications associated with virus reactivation

89

include a wide range of diseases including neurological pathologies (6).

90

When viral persistence following primary infection is established, the presence of

91

detectable infectious virus particles in saliva is a common occurrence (7, 8). Both, labial

92

and submandibular salivary glands have been shown to be anatomical sites for HHV-7

93

persistence, replication and late viral protein expression (9). Within the gland tissue,

94

virus seems to be predominantly localized in the ductal cuboidal and columnar cells.

95

Thus, salivary glands are a likely source for infectious virus shed into saliva.

96

Complete genomes sequences are available for all three human roseoloviruses,

97

HHV-6A, HHV-6B, and HHV-7. For HHV-7, the genomes of three different strains, JI,

98

RK and UCL-1, have been sequenced using viral DNA isolated from peripheral blood

99

mononuclear cells (PBMC) (JI and RK ) (1, 10) or saliva (UCL-1) (11, 12). Similar to

100

other roseolovirus genomes, the genome structure of HHV-7 is composed of a central

101

unique 133 kb long segment flanked by 10 kb long direct terminal repeat regions on

102

each side. Viral genes in the unique segment are arranged in blocks of genes

5

103

conserved among herpesviruses and betaherpesviruses or are unique to

104

roseoloviruses.

105

The HHV-7 origin of lytic genome replication (oriLyt) is located upstream of the

106

major DNA binding protein gene U41, similar to its location in other betaherpesvirinae

107

subfamily genomes. The oriLyt sequence contains two binding sites, OBP-1 and OBP-2

108

(13), which are both recognized by an origin binding protein (OBP) encoded by the viral

109

ORF U73 (14). Interaction between OBP and the binding sites is required to initiate viral

110

DNA replication. The HHV-7 unique region also contains two repetitive sequence motifs,

111

R1 and R2 (15). The R1 segment is located between ORFs U86 and U89 and, in the

112

RK strain, is composed of two 84 bp repeats plus two partial 67 bp repeats, with

113

nucleotide sequences conserved between HHV-7 strains. The second segment, R2, is

114

located downstream of R1 between ORFs U91 and U95 and is composed of 105 bp

115

motifs repeated 16 times. The coding potential of the R1 and R2 repeat regions for

116

either short proteins or non-coding RNAs is uncertain, and both regions have been

117

shown to vary in size and sequence between different strains of HHV-7. The HHV-7

118

genome is predicted to encode 82 different proteins in the unique region with open

119

reading frames (ORF) annotated U2 to U100. In addition, two protein-coding ORFs

120

(DR1 and DR6) have been identified in each of the terminal repeat regions. All but one

121

of the ORFs in the HHV-7 genome have counterparts among the other roseolovirus

122

genomes. The exception is U55B, which is only found in different HHV-7 strains and not

123

in either HHV-6A or HHV-6B. Conversely, the HHV-7 genome lacks ORFs U22 and

124

U94, which are present in both HHV-6 species. Some additional small expressed ORFs

125

are present in the HHV-6B DR region but not in HHV-7. 6

126

Little is known about HHV-7 transcriptional activity in infected cells. One study

127

reported that HHV-7 temporal gene expression is regulated by a cascade mechanism

128

similar to HHV-6 transcription. Interestingly, transcripts from four viral genes that had

129

been determined to be immediate early genes in that study, could not be detected by

130

RT-qPCR in peripheral blood mononuclear cells (PBMC) from healthy patients with

131

detectable levels of HHV-7 DNA (16). This observation suggests that PBMC infection

132

with HHV-7 may result in latency with very limited transcriptional activity.

133

Recently, we and others have reported the existence of HHV-7 homologs

134

naturally infecting different non-human primate species including macaques (17) and

135

great apes (18). All simian homologs of HHV-7 were discovered using consensus-

136

degenerate hybrid oligonucleotide primers (CODEHOP) (19). This PCR-based

137

approach led to the discovery of genomic segments comprised of the glycoprotein B

138

(gB) and the DNA polymerase genes. Since our study used DNA from saliva samples

139

collected from pig-tailed macaques (Macaca nemestrina, Mne), we proposed to

140

provisionally name the HHV-7 macaque homolog MneHV7. Phylogenetic analysis

141

based on the available sequence segments confirmed the close evolutionary

142

relationship between the newly identified simian roseoloviruses and HHV-7. We further

143

characterized prevalence and tissue tropism of MneHV7 natural infection in pig-tailed

144

macaques by using a newly developed qPCR assay specific for MneHV7. We detected

145

MneHV7 at high levels in all saliva samples screened and observed the highest viral

146

loads in salivary gland tissues. Finally, we showed a strong correlation between viral

147

loads in saliva and salivary glands. Altogether these results are in agreement with

7

148

previous observations made for HHV-7 in humans that salivary glands are likely a major

149

site of MneHV7 replication and persistence.

150

Here we report the complete sequence of the unique segment of the MneHV7

151

genome as well as a portion of the 10KB direct repeat (DR) region at the genomic

152

termini. A comparison of the MneHV7 genome with the HHV7 genome revealed a

153

collinearity of gene structure and a strong gene sequence similarity, confirming that

154

MneHV7 is an HHV-7 homolog naturally infecting pig-tailed macaques. We present a

155

first annotation of the MneHV7 genome, based on similarity with the HHV-7 RK strain

156

sequence. Genomic analysis of sequence elements crucial for roseolovirus biology

157

suggests that MneHV7 relies on similar mechanisms to HHV-7 to control its life cycle.

158

Using high-throughput RNA sequencing (RNA seq), we detected high level expression

159

of a broad range of MneHV7 genes in salivary gland tissue from naturally infected

160

macaques. In addition, using immunohistochemistry (IHC), we observed late viral

161

protein expression in salivary gland tissues, which correlated with high levels of

162

MneHV7 gene expression and viral loads in saliva. Our study revealed that MneHV7

163

exhibited numerous molecular and biological similarities with HHV-7 demonstrating that

164

natural infection with MneHV7 is a useful animal model to study HHV-7 infection and

165

associated pathologies.

166 167

MATERIALS AND METHODS

168

Pig-tailed macaque specimens

8

169

Saliva and salivary gland tissue samples from 12 pigtailed macaques were obtained

170

from stored aliquots collected in previous studies [29]. All 12 animals had been part of a

171

prior simian immunodeficiency virus (SIV) vaccine study. During saliva collection, the

172

animals were kept under deep sedation with ketamine HCl at a dose of 10-15 mg/kg

173

intramuscularly to alleviate any pain and discomfort. The animals were monitored by an

174

Animal Technician or Veterinary Technologist while under sedation. Tissues collected at

175

necropsy were either snap frozen or fixed in 10% neutral buffered formalin and

176

embedded in paraffin. All animals in this study were housed and cared for according to

177

the Guide for the Care and Use of Laboratory Animals at the Washington National

178

Primate Research Center (WaNPRC), an Association for Assessment and Accreditation

179

of Laboratory Animal Care International accredited institution. The experimental

180

procedures were approved by the Institutional Animal Care and Use Committee (2370-

181

20) at the University of Washington and conducted in compliance with the Public Health

182

Services Policy on Humane Care and Use of Laboratory Animals

183

(http://grants.nih.gov/grants/olaw/references/ PHSPolicyLabAnimals.pdf).

184

DNA isolation from saliva and tissues and library preparation

185

DNA isolation from macaque saliva and tissue samples was performed as previously

186

described [16] using proteinase K treatment at 56Co and a QIAmp DNA Mini Kit

187

according to the manufacturer’s protocol. DNA template for Next-Generation

188

sequencing was isolated from cell-free filtered saliva following a similar procedure but

189

with an additional purification and concentration step using a DNA clean and

190

concentrator spin column (Zymo Research). The sequencing library was prepared from

191

sheared DNA using the TruSeq DNA Sample Prep Kit v2 (Illumina, Inc.). The library 9

192

size distribution was validated using an Agilent 2200 TapeStation (Agilent

193

Technologies).

194

MneHV7-specific TaqMan qPCR assay

195

The PCR primers and TaqMan probe for quantitative real-time PCR assays (qPCR)

196

specifically targeting the MneHV7 DNA pol gene were described before [16]. Primer and

197

probe sequences were as follows: MneHV7-pol probe (5’-[6FAM]

198

ACTGGTGCAACACATAGCT TATTACCGT [BHQ1] -3’), MneHV7-pol –F (5’-

199

GTGCAAAGACCCTACGTTAATTATG-3’), MneHV7-pol-R (5’-

200

CTTGTTACCGAAGCAGCAATAG-3’). The thermocycling conditions were as follows:

201

incubation at 94°C for 2.5 min, followed by 45 cycles of 94°C for 30s, 60°C for 30s and

202

72°C for 30s, then cooled for 1 min at 37°C, all at maximum ramp speed of 4.4°C/s.

203

Standardized MneHV7 templates consisted of purified plasmid pJET vector containing

204

the cloned MneHV7 PCR amplicon. Viral load was determined by comparing the cycle

205

threshold (Ct) values obtained from the MneHV7 qPCR assay with the Ct values

206

obtained from a single copy cellular gene, oncostatin M (OSM) as previously described

207

[16].

208

Next-Gen sequencing and de-novo assembly

209

Library clustering was performed on a single flow cell lane using an Illumina cBot and

210

sequencing subsequently occurred on an Illumina HiSeq 2500 using a paired-end, 50 nt

211

read length (PE50) strategy. Image analysis and base calling was performed using

212

Illumina's Real Time Analysis (RTA) v1.17 software, followed by generation of FASTQ

213

files, using Illumina's CASAVA v1.8.2 software (http://www.illumina.com/software.ilmn). 10

214

Because the sample was from saliva, we expected a predominance of short reads from

215

host DNA and sought to suppress these before attempting de novo assembly. Since a

216

Macaca nemestrina reference genome was not available, preliminary alignment to the

217

Rhesus macaque genome (Ensembl MMUL 1.0) was used to filter out reads derived

218

from the host. Approximately 215 million raw read pairs, all passing default quality

219

filtering in the Illumina CASAVA 1.8.2 pipeline, were aligned to the Rhesus reference

220

genome using BWA 0.6.1 [19]. This process classified nearly 83% of the input reads as

221

aligning to and therefore originating from the host. Over 36 million read pairs that did not

222

align to the Rhesus genome were assembled with Velvet 1.2.08 [20]. The resulting

223

contigs were searched for loose matches to our previously determined MneHV7 DNA

224

polymerase sequence [16]. A match was found, and properties of the matching contig

225

were used to further optimize de novo assembly parameters and extend coverage. A

226

single large contig of ~122KB, a significant portion of the expected ~150KB genome,

227

was identified. BWA alignment of all raw Illumina reads against this candidate contig

228

was then performed, resulting in a collection of 34,571 read pairs which were used to

229

further refine the assembly. Consecutively, two additional short contigs of 6.8KB and

230

2.4KB were found, resulting in complete coverage of the expected length of the unique

231

genome region (U) of MneHV7 and about one quarter (2.4KB of an estimated 10KB) of

232

the end-terminal direct repeat regions.

233

Long range sequencing

234

The R2 repeat region was resolved by long-range PCR and Sanger sequencing using

235

the Takara LA PCR kit (ver. 2.1) following the manufacturer’s protocol (Clontech). PCR

236

reactions were performed using the following primer combinations: R2_1F_IDT (5’11

237

CTCTCTTCAGTT GATGCCATAGT-3’) and R2_1R_IDT (5’-

238

TTGCGAATCCCTGTCATAGTT-3’); R2_3F_IDT (5’-

239

CAGTTCCATAGACAGAGCTAACA-3’) R2_3R_IDT (5’-

240

GGTGTGGGTCGAATAACATTTG-3’); R2_4F_IDT (5’-

241

CTGGATGGTTGTCCAACTACA-3’) and R2_4R_IDT (5’-

242

AGGCTGATGAATACTGCTGAC-3’). The PCR buffer consisted of the LA PCR buffer II

243

(10x, Mg) supplemented with 5% DMSO. Amplification conditions were as follows: one

244

cycle of 94°C for 1 min., followed by 35 cycles of 98°C for 10 sec, 58°C for 1 min and

245

68°C for 5 min and one final cycle at 72°C for 10 min.

246

Gene annotations

247

The annotation of the MneHV7 genome was performed using the Genome

248

Annotation Transfer Utility (GATU) provided at the Viral Bioinformatics Resource Center

249

[25] and was fine-tuned using Geneious software package v.6.1.6 (Biomatters). Open

250

reading frames were verified by observing the presence and locations of start and stop

251

codons as well as poly-adenylation signal sequences in comparative alignments of the

252

candidate MneHV7 sequence with the HHV7 RK strain sequence (NC_001716) using

253

various aligners (Geneious aligner, ClustalW) that are part of the Geneious software

254

package. Predicted intron-exon boundaries of spliced genes in the HHV7 RK sequence

255

were detected in the MneHV7 genome sequence by locating conserved splice acceptor

256

and donor sites in sequence alignment.

257

Comparative sequence and phylogenetic analysis

12

258

Sequence analysis of specific regions of the MneHV7 genome was performed by

259

alignment with known beta-herpesvirus sequences using Muscle (for amino acid

260

sequences) or the Geneious alignment algorithm (for nucleotide sequences) that are

261

part of the Geneious software package (Biomatters). Phylogenetic analysis was

262

performed using the Maximum Likelihood method (PhyML) with the JC69 substitution

263

model [27] from Muscle or Geneious alignments of herpesvirus amino acid sequences.

264

The number of substitution rate categories was set at 4, and the best of both NNI and

265

SPR topology searches was combined into the final tree calculation. The percent

266

likelihood of branching within the phylogenetic tree was calculated by using the

267

approximate likelihood ratio test returning Chi2-based parametric branch supports [67].

268

The viral DNA polymerase sequences used to construct the phylogenetic tree are

269

deposited in GenBank under these accession numbers: (HHV-5/hCMV) YP_081513,

270

(MndHVbeta) AAG39066, (HHV-6A strain U1102) CAA58372, (HHV-6B strain Z29)

271

AAD49652, (Ptro Roseolo1) AAW55996, (HHV-7 strain RK) AAC40752, (HHV-7 strain

272

JI) AAC54700, (PtroHV7) AIN81106, (PpanHV7) AIN81107,(GgorHV7) AIM81109 and

273

(MneHV7) KU351741.

274

Immunohistochemistry

275

Immunohistochemical staining was performed on paraffin-embedded sections after an

276

endogenous peroxidase block with 3% H2O2 in methanol, antigen retrieval with citrate

277

buffer, pH 6.0, a 5% normal serum block, and an avidin-biotin block (Vector

278

Laboratories). The KR4 primary antibody (1:1000, #12179, NIH AIDS Reagent

279

Program) was incubated for 2 hours at room temperature, followed by biotinylated anti-

13

280

mouse antibody (1:500) and a streptavidin-biotin-peroxidase complex (both Vector

281

Laboratories). Peroxidase activity was visualized using diaminobenzidine (Vector

282

Laboratories). All histological stains were repeated on 3 separate sections. Negative

283

controls consisted of sections incubated without primary antibody and matched isotype

284

control. VirAarray HHV-7 control cell slides (VAH7-05, Bioworld Consulting

285

Laboratories) were used as positive control.

286

RNA isolation from salivary gland tissues

287

Frozen salivary gland tissue specimens were minced with a sterile razor blade before

288

addition of Trizol [68]. Following a 30 min incubation on ice, total RNA was isolated

289

using a Direct-zol RNA mini prep column (Zymor research) following the manufacturer’s

290

protocol. RNA samples were further purified on “Clean and Concentrate” columns

291

(Zymo research) according to the manufacturer’s instructions. Total RNA integrity was

292

checked using an Agilent 2200 TapeStation (Agilent Technologies) and quantified using

293

a Trinean DropSense96 spectrophotometer (Caliper Life Sciences).

294

Viral transcriptome

295

RNA-seq libraries were prepared from total RNA using the TruSeq RNA Sample Prep

296

Kit (Illumina, Inc.) and a Sciclone NGSx Workstation (PerkinElmer). Library size

297

distributions were validated using an Agilent 2200 TapeStation (Agilent Technologies).

298

Additional library QC, blending of pooled indexed libraries, and cluster optimization was

299

performed using Life Technology’s Invitrogen Qubit® 2.0 Fluorometer (Life

300

Technologies-Invitrogen). RNA-seq libraries were pooled (duplex) and clustered onto a

301

flow cell lane. Sequencing was performed using an Illumina HiSeq 2500 in rapid mode 14

302

employing a paired-end, 50 base pair read length (PE50) sequencing strategy. Image

303

analysis and base calling was performed using Illumina's Real Time Analysis v1.18

304

software, followed by 'demultiplexing' of indexed reads and generation of FASTQ files,

305

using Illumina's bcl2fastq Conversion Software v1.8.4

306

(http://support.illumina.com/downloads /bcl2fastq_conversion_software_184.html).

307

Reads of low quality were filtered out prior to alignment to the MneHV7 custom

308

reference genome using TopHat v2.0.12. Counts were generated from TopHat

309

alignments for each viral gene using the Python package HTSeq v0.6.1 ( http://www-

310

huber.embl.de/users/anders/ HTSeq/doc/overview.html). Total viral read numbers were

311

normalized by calculating viral reads per million unique mapped genomic reads (VPMM)

312

[32]. Viral transcriptome expression profiles between salivary gland specimens were

313

analyzed using hierarchical cluster analysis (CIMminer). The Manhattan distance matrix

314

was computed based on the respective VPMM value calculated for each viral gene

315

detected in each specimen. These values were used as input for hierarchical clustering

316

using the complete linkage-clustering algorithm. Data sets representing the viral gene

317

expression profiles were visualized by heat maps using color-coded Clustered Image

318

Maps software (CIMminer) [34].

319

Statistics

320

Statistics was performed using Graphpad Prism version 6.01 for Windows (GraphPad

321

Software). To compare the overall read numbers from viral transcripts with viral loads

322

from whole saliva samples or salivary gland tissues respectively, two-tailed Spearman

15

323

nonparametric correlation coefficients (r2) and corresponding P-values were calculated

324

using 95% confidence intervals.

325

Data deposition

326

The GenBank submission module version 1.3.2 within the Geneious software package

327

version 6.1.6 was used to create a standard GenBank file for the MneHV7 genome and

328

to submit the annotated sequence to GenBank (GenBank accession number:

329

KU351741). The nucleotide and protein sequences analyzed in this report are all

330

derived from this submitted annotated genome sequence. The complete RNAseq data

331

sets are available from the Gene Expression Omnibus (GEO) repository (accession

332

number GSE79944).

333 334

Results

335

Selection of DNA sample for high-throughput sequencing

336

To confirm that the MneHV7 DNA sequence segments that we previously

337

detected in pig-tailed macaque saliva by CODEHOP PCR represented indeed a novel

338

macaque roseolovirus, we determined the genome sequence of MneHV7 by next-

339

generation sequencing. We selected saliva as our source for viral DNA since we had

340

previously determined that MneHV7 is highly prevalent in macaque saliva at

341

consistently high levels (17). Thus, samples from our study for which an adequate

342

volume of saliva was available were screened by MneHV7-specific qPCR to identify the

343

samples with the highest viral loads. In order to enhance recovery of viral DNA, whole

344

saliva samples were depleted of cellular material by centrifugation and filtration. Another 16

345

source of genetic material contamination when extracting DNA from saliva is bacterial

346

DNA. Thus, DNA samples with strong PCR signals for bacterial 16S rRNA sequence

347

were eliminated.

348

Using these criteria, we selected and pooled three saliva samples collected at

349

different times from the same naturally infected pig-tailed macaque (TO5183). The final

350

concentrated DNA sample contained 3.5x106 copies of the MneHV7 genome per 100

351

ng of purified DNA. This sample was subsequently used to prepare a paired-end library

352

that was sequenced on one lane of an Illumina HighSeq 2500.

353 354 355

Determination of the MneHV7 genome sequence Following high-throughput sequencing, a total of 215 million quality-filtered reads

356

were mapped to the rhesus macaque genome in order to remove input reads originating

357

from the macaque cellular genome. The remaining 36 million unmapped reads were

358

then subjected to de-novo assembly using Velvet (20). The resulting contigs were

359

searched for matches to the previously determined MneHV7 CODEHOP DNA

360

polymerase amplicon sequence using blastn (21). A match was found and properties of

361

the matching contig were used to further optimize the de-novo assembly parameters to

362

extend MneHV7 contig coverage. The divergence between the pig-tailed and rhesus

363

macaque genomic sequences in addition to the presence of thousands of reads from

364

mostly uncharacterized microorganism genomes, made de-novo assembly of MneHV7

365

reads challenging. Ultimately, this process generated a large contig of ~122,000 bp

366

covering a significant portion of the expected MneHV7 genome length (Fig 1A). Another

367

smaller contig of ~6,800 bp with similarity to HHV-7 U95 and U100 genes was also

17

368

identified and added to form a scaffold of the entire unique MneHV7 genome sequence

369

with a mean read depth of 25 nucleotides. The gap between the two contigs which

370

consisted mostly of the R2 repeat region was subsequently resolved by long-range PCR

371

and Sanger sequencing. Sixteen additional very short unresolved sequences of 3-25 bp

372

in length were mostly caused by the presence of single polynucleotide stretches and

373

were resolved by mapping available overlapping reads to adjacent sequences using

374

blastn in addition to PCR amplification and subsequent Sanger sequencing. This

375

process resulted in a sequence representing the entire unique genome region of

376

MneHV7. Our final 130,851 bp MneHV7 draft genome included the entire unique

377

genome sequence of MneHV7 as well as short sections of both ends of the end-

378

terminal direct repeats (right end of DR-L (262 bp) and left end of DR-R (87 bp)) where

379

the final contig extended beyond the unique genome region (Fig 1A and B). Importantly,

380

these partial end-terminal sequences had readily identified cleavage-packaging motifs

381

(a Pac-1 motif on the left end, a Pac-2 motif on the right end of the unique genome

382

region respectively) flanked by several copies of the telomere-like TAACCC sequence

383

motif that make up the telomeric repeat regions T1 and T2 in human roseolovirus

384

genomes (Fig 1C) (22).

385

All roseoloviruses have a typical genome structure composed of a unique

386

genome region (U) flanked by identical end-terminal direct repeat (DR) regions of about

387

10 kb in length (DR-left – U – DR-right) oriented in tandem (15, 23, 24). HHV-7 DR

388

regions contain open reading frames coding for the DR1 and DR6 genes. Further

389

analysis of the contigs obtained following de-novo assembly led to the identification of

390

an additional short contig of 2.4 kb, with similarity to HHV-7 end-terminal repeat

18

391

elements (Fig 1A and B). This contig contained the sequence for the entire DR1 gene

392

and the first exon as well as most of the intron of the DR6 gene for MneHV7 (Fig 1B).

393

The complete DR6 sequence and the number of telomeric repeats present in the T1

394

and T2 regions at each end of the MneHV7 genome remain to be determined.

395

For the final MneHV7 genome assembly, all available sequences belonging to

396

the DR end terminal region, were duplicated at each end of the MneHV7 genome based

397

on alignments with the HHV-7 genome. The resulting MneHV7 genome sequence is

398

available in GenBank under the accession number KU351741.

399 400 401

MneHV7 genome annotation Annotation of the MneHV7 genome sequence was performed using the Genome

402

Annotation Transfer Utility (GATU) (25) with the HHV-7 RK strain sequence as

403

reference genome. Open reading frames with similarity to all 84 known HHV-7 ORFs

404

were identified spanning the entire genomic region between the U2 and U100 genes

405

(Fig 2). The MneHV7 ORFs identified, including their direction, size and percentage of

406

amino acid identity to corresponding HHV-7 ORFs, are listed in Table1. Similar to HHV-

407

7, the MneHV7 sequence contained a homolog of the HHV-7-specific U55B gene,

408

whereas it lacked sequences similar to the HHV-6 unique genes, U22, U83 and U94.

409

These observations confirmed the previous status of MneHV7 as a new HHV-7 viral

410

homolog.

411

The positions of ATG start codons, stop codons, and similarity with the HHV-7

412

(RK strain) annotation were used to predict open reading frames. The presence and

413

locations of TATA box consensus sequences as well as consensus poly-adenylation

19

414

signal sequences were identified. MneHV7 proteins with the highest percentage of

415

similarity to HHV-7 (>90%) were U27 (DNA polymerase processivity factor), U35 (DNA

416

packaging), U41 (single stranded DNA binding protein), U56 (triplex capsid protein),

417

U57 (major capsid protein), U72 (envelope glycoprotein M), and U77 (helicase/primase

418

complex protein). Twenty-seven additional proteins were 80-90% similar, 18 proteins

419

were 70-80% similar, and 38 proteins shared