Inferring Viral Quasispecies Spectra from Shotgun ...

Inferring Viral Quasispecies Spectra from Shotgun and Amplicon Next-Generation Sequencing Reads Irina Astrovskaya1, Nicholas Mancuso2, Bassam Tork2, Serghei Mangul2, Alex Artyomenko2, Pavel Skums3, Lilia Ganova-Raeva3, Ion Măndoiu4, Alex Zelikovsky2 1

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 2

Department of Computer Science, Georgia State University, Atlanta, GA 3

Center for Disease Control and Prevention, Atlanta, GA

4

Department of Computer Science & Engineering, University of Connecticut, Storrs, CT

Abstract Many clinically relevant viruses, including hepatitis C virus (HCV) and human immunodeficiency virus (HIV), exhibit high genomic diversity within infected hosts which may explain the failure of vaccines and resistance to existing antiviral therapies. Characterizing the viral population infecting a host requires reconstructing all co-existing (related, but non-identical) viral variants, referred to as quasispecies, and inferring their relative abundances. Next-generation sequencing is a promising approach for characterizing viral diversity due to its ability to generate large number of reads at a low cost. However, standard assembly software was originally designed for a single genome assembly and cannot be used to assemble multiple closely related quasispecies sequences and estimate their abundances. In this chapter, we focus on the problem of reconstructing viral quasispecies populations from next-generation sequencing reads produced by two most commonly used strategies: the shotgun sequencing and the sequencing of partially overlapping PCR amplicons. We discuss computational challenges associated with each strategy and review existing approaches to quasispecies reconstruction with focus on two state-of-the-art software tools - Viral Spectrum Assembler (ViSpA), designed for the shotgun reads, and Viral Assembler (VirA), which handles the amplicon reads. Both tools have been tested on simulated and real read data from HCV, HIV (ViSpA) and HBV (VirA) quasispecies, and shown to compare favorably with other existing methods. Introduction Viral quasispecies Many medically important viruses, including influenza viruses, hepatitis C virus (HCV), and human immunodeficiency virus (HIV), exhibit high genomic diversity within and between their hosts. In RNA viruses, absence of proofreading mechanisms results in the inability to detect and repair mistakes during replication (Duarte et al., 1994). As a result, the mutation rate may be as high as one mutation per each thousand bases copied per replication cycle (Drake et al., 1999).

Besides replication errors that result in substitutions, small insertions, and small deletions, many viruses also undergo recombination and genome segment reassortment. Much of this sequence variation is well tolerated and passed down to descendants, producing in each infected host a family of co-existing variants of the original viral genome referred to as mutant clouds, or viral quasispecies, a concept that originally described a mutation-selection balance (Domingo et al., 1985; Steinhauer et al., 1987; Eigen et al., 1989; Martell et al., 1992; Domingo and Holland, 1997). Short replication cycles, small size of viral genomes, large population sizes, and strong selective pressure exerted by the immune response of an infected patient additionally contribute to the evolution of the quasispecies, for instance, resulting in 1010 to 1012 new variants per day in case of HIV-1, HBV, or HCV infections (Domingo et al., 1998, Neumann et al., 1998). These quasispecies variants differ in biological properties such as their ability to cause disease (virulence), ability to escape host immune responses, resistance to antiviral therapies, and the type of host cells and tissues infected by the virus (tissue tropism). As a consequence, the best adapted variants survive different environmental changes and selective pressures, resulting in various viral populations in infected hosts. The diversity of viral sequences and frequent generation of new variants in an infected patient can cause vaccines failures and virus resistance to existing therapies. Drug-resistant viral quasispecies, even present as a minor portion of population, result in rapid viral adaptation in case of HIV-1 and HCV and failure of antiretroviral therapy (Metzner et al., 2009, Li et. al., 2011; Skums et al., 2011-2012). As a result, very few RNA viruses are effectively controlled by vaccination or antiviral therapies (Holland et al., 1992; Domingo et al., 1998). Furthermore, since live viruses mutate rapidly, they can become virulent for a host at any replication cycle, posing another challenge for vaccine development. Since population dynamics cannot be explained just by the evolution of the most frequent viral variant, there is a great interest in reconstructing genomic diversity of all viral quasispecies in a host. Knowing sequences of virulent variants and their abundances in infected patients can improve our understanding of viral evolutionary dynamics, mechanisms of viral persistence and drug resistance, and, as result, may help to design an effective drugs (Beerenwinkel et al., 2005; Rhee et al., 2007) and vaccines (Gaschen et al, 2002; Douek et al., 2006; Luciani et al., 2012), targeting particular viral variants in vivo. Next-generation sequencing technologies Evolution of the next-generation sequencing (NGS) technologies has significantly changed experimental analysis of viral communities (Barzon et al., 2011) since their massive throughput and decreasing cost allow deep sampling of an entire viral population in a single experiment. Although existing NGS technologies have similar workflows, they differ in underlying biochemistries and sequencing protocols as well as their throughput and average sequence length. Applied Biosystem’s SOLiD technology uses ligation to sequence genomes, whereas the sequencing technology developed by Illumina incorporates reversible dye-terminators (Mardis 2008; Mardis 2009). Both technologies produce high accuracy reads, enabling detection of local low-frequency variants; however, the shorter read length is not as efficient and effective for reconstruction of full-length quasispecies sequences (Zagordi et al., 2012a). Ion Torrent’s NGS

technology uses integrated circuits that detect the pH change caused by the release of ions during template-directed DNA polymerase synthesis. Currently, Ion Torrent reads have average length around 200-400 bases (Barzon et al., 2011), and future systems are expected to generate even longer reads (Rothberg et al., 2011). Since average read length strongly influences global viral quasispecies sequence reconstruction (Zagordi et al., 2012a), to date, the 454/Roche pyrosequencing technology has been the most commonly used NGS technology for viral quasispecies analysis (Margulies, Egholm et al., 2005). In brief, the 454 pyrosequencing system shears the source genetic material into the fragments and amplifies them on beads using emulsion PCR. Millions of amplified template fragments are then sequenced by synthesizing their complementary strands. Repeatedly, the nucleotide reagents are flown over the fragments, one nucleotide (A, C, T, or G) at a time. Light is emitted at a fragment location when the flown nucleotide base complements the first unpaired base of the fragment (Fakhrai-Rad et al., 2002; Margulies, Egholm et al., 2005). Multiple identical nucleotides may be incorporated in a single cycle, and the light intensity would correspond to the number of the incorporated bases. However, in practice, this number (referred to as a homopolymer length) cannot be accurately estimated for the long homopolymers, resulting in a relatively high percentage of insertion and deletion sequencing errors, which respectively represent 65%-75% and 20%-30% of all sequencing errors (Brockman, Alvarez et al., 2008; Quinlan et al., 2008). Shotgun versus amplicon sequencing reads Contiguous NGS reads can be generated in two essentially different ways - as shotgun or as amplicon reads. In the first case, the fragments (the shotgun reads) are randomly produced from a given DNA segment in such way that their starting positions are uniformly distributed across the genome. In contrast, the amplicon reads are generated by using PCR from the overlapping “windows” of the viral genome so the beginnings and the endings of reads are essentially fixed (see Figure 1). Currently, GS FLX Titanium XL+ can generate up to 1 million shotgun and 700,000 amplicon reads with average read length around 700 and 400 bases, respectively.

Figure 1 Shotgun versus amplicon reads.

Quasispecies Spectrum Reconstruction The software provided by the instrument manufacturers was originally designed to assemble all reads into a single (consensus) genome sequence, rather than multiple similar but non-identical sequences. Thus, new software must be developed to solve the following problem. Quasispecies Spectrum Reconstruction (QSR) problem: Given a collection of shotgun or amplicon NGS reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population. A major challenge in solving the QSR problem is that the quasispecies sequences are only slightly different from each other. The amount of differences between quasispecies and their distribution along the genome vary significantly across known viruses due to variation in their mutation rates and genomic architectures. A limited read length and a relatively high error rate of current high throughput sequencing data also add to the complexity of the QSR problem. Related work The QSR problem is related to several well-studied problems: de novo genome assembly (Myers, 2005; Sundquist et al., 2007; Chaisson et al., 2008), haplotype assembly (Lippert et al., 2002; Bansal et al., 2008), population phasing (Brinza et al., 2006) and metagenomics (Venter et al., 2004). De novo assembly methods are designed to build a single genome sequence and are not well-suited for reconstructing a large number of closely related quasispecies sequences. Haplotype assembly does seek to infer two closely related haplotype sequences, but existing methods do not easily extend to the reconstruction of a large (and a priori unknown) number of sequences. Computational methods developed for population phasing deal with large number of haplotypes, but rely on the availability of genotype data that conflates information about pairs of haplotypes. Metagenomic samples do consist of sequencing reads generated from the genomes of a large number of species. However, differences between the genomes of these species are considerably larger than those between viral quasispecies. Furthermore, existing tools for metagenomic data analysis focus on species identification since reconstruction of complete genomic sequences would require much higher sequencing depth than that typically provided by current metagenomic datasets. In contrast, achieving high sequencing depth for the viral samples is very inexpensive, owing to the short length of the viral genomes. Mapping based approaches to the QSR problem are naturally preferred to de novo assembly since the reference genomes are available (or easy to obtain) for the viruses of interest, and the viral genomes do not contain repeats. Thus, it is not surprising that such approaches were adopted in two pioneering works on the QSR problem (Eriksson et al., 2008; Westbrooks et al., 2008). Both works independently introduced the concept of a read graph in which nodes represent possibly pre-processed reads. Two nodes are connected with a directed edge if reads sequences agree on their overlap. Eriksson et al. (Eriksson et al., 2008) proposed a multi-step approach with focus on a local haplotype reconstruction. The method consists of sequencing error correction via probabilistic clustering, haplotype reconstruction via chain decomposition, and haplotype frequency estimation via expectation-maximization (EM). This method was implemented in the software tool ShoRAH

(Zagordi et al., 2010a; Zagordi et al., 2011) and successfully applied to HIV data (Zagordi et al., 2010b). In contrast to the previous approach, Westbrooks et al. (Westbrooks et al., 2008) globally reconstructed viral haplotypes via transitive reduction, overlap probability estimation and network flows, with application to simulated error-free HCV data. Astrovskaya et al. (Astrovskaya et al., 2011) further extended this approach by allowing imperfect overlaps (i.e., overlaps with certain number of disagreements) during a read graph construction and by inferring quasispecies via max-bandwidth paths through the graph with adjusted overlaps probabilities estimations. This pipeline (referred to as ViSpA) was successfully applied to both simulated and real HCV and HIV data. Experimental results show that ShoRAH tends to overcorrect reads and that ViSpA outperforms ShoRAH in assembling quasispecies sequences. The idea of imperfect read overlaps is also exploited in the QColors’s method (Huang et al., 2011). In contrast to ViSpA, QColors uses an additional conflict graph to represent disagreement within read overlaps. The haplotypes are reconstructed by finding a partition of the reads into the minimal number of non-conflicting subsets. Although the QColors’s method gives guidance on how to use short and non-contiguous NGS reads, it may be too sensitive to sequencing errors. Hapler (O’Neil and Emrich, 2012) reformulates the problem of finding a minimum number of paths needed to explain observed reads as a weighted bipartite graph matching problem. A randomization step with sampling from the space of possible haplotypes reduces the probability of reconstructing chimeric haplotypes. The local haplotype reconstruction was successful on a low read coverage sample from COI genes of the butterfly Melitaea cinxia. In general, all mentioned methods pre-process the reads to correct sequencing errors before constructing a read graph. So all undetected/unfixed errors and mis-corrections are further treated as the true values and cannot be later altered by most of the tools, except ViSpA (Astrovskaya et al., 2011). ViSpA deals with sequencing errors at several steps and may fix an incorrect base at the end if the position is covered with sufficient number of the reads. Alternatively, instead of correcting sequencing errors, one can model the stochastic process of the NGS reads generation and estimate which set of genome variants has larger likelihood to produce the observed reads if those reads are indeed generated under the proposed model (Jojic et al., 2008). The model uses several hidden variables and parameters such as viral genome and relative concentration of viral variants, starting offset and depth of the coverage for different positions in the genome, error transformation parameters and uncertainty probability for a particular allele value at a particular position in the genome. The main drawback of the method is that the number of the inferred quasispecies has to be set in advance. PredictHaplo (Prabhakaran et al., 2010) avoids this problem by using a truncated approximation of an infinite mixture model (Ewens, 1972; Ferguson, 1973; Rasmussen, 2000) to automatically choose the number of the reconstructed haplotypes. Under this model, a haplotype is represented as a mixture of the probability tables over four nucleotides and alignment gap. The experimental results on HIV quasispecies show that PredictHaplo is able to infer viral quasispecies if their relative proportion in a population is greater than 0.5%. V-Phaser (Macalalad et al., 2012) also focuses on detecting rare (