R e vie w

1 downloads 0 Views 1MB Size Report
tally came from studies using Sanger sequenc- ing and ... all its relatives (Gst, Rst, Dst). ..... assignment and demographic history. .... in a dental practice. Science ...
Eduardo Castro-Nallar1, Keith A Crandall1 & Marcos Pérez-Losada* Department of Biology, 401 Widtsoe Building, Brigham Young University, Provo, UT 84602-5181, USA *Author for correspondence: CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4485-661 Vairão, Portugal n [email protected] 1

rP

reverse transcriptase [16] and its high virus turnover [17] . HIV-1 and -2 GD has been classified within discrete groups and subtypes that largely correspond to geographic regions (Figure 1) [18] . HIV-1 includes four groups, which represent different introductions into human populations, namely group M (Main, Major), N (New, Nonmajor), O (Outlier) and recently P [19] . HIV-1 group M is the most commonly detected variant, which in turn, is further divided into subtypes – that is, A–D, F–H, and J and K. Also, circulating recombinant forms (CRF) carrying genetic information from two or more subtypes have been detected in infected populations (49 to date) [20] . HIV-2 includes two groups (A and B) and it is worth noting that HIV-2 CRFs have been reported only once [21] . Genetic variation among HIV subtypes is tremendous, with within-subtype divergences reaching up to 17% and between-subtype divergences of 17–35%. For comparison, human and chimpanzee divergence can reach up to 3.9% if substitutions and insertions/deletions are considered [22] . Early on, researchers noted differences in transmissibility between HIV-1 and -2 [23] . Some transmission differences are explained on the basis of structural and evolutionary differences in env genes [24] . It is clear that HIV-2 exhibits lower rates of transmission, almost no vertical transmission and long incubation periods [25] . Moreover, HIV-2 infected patients have reduced immune activation, low viremia and rarely develop disease [26] . Within HIV-1, some researchers have proposed biological correlates to different subtypes; for example, HIV-1 subtype

GD of HIV

ut

ho

Recent advances in DNA sequencing as well as new approaches to analyzing these data allow researchers to study the impact of epidemiological factors on the evolutionary dynamics of HIV at global (worldwide), regional (single epidemics), local (transmission chains) and individual (intrahost) scales by examining genetic variation across its genome and over geographic space and time. The purpose of this review is to present an overview of HIV genetic diversity (GD) and its estimation as well as the kind of insights these new methodologies can provide and how they can improve disease control and treatment.

ro

of

The high genetic diversity of HIV is one of its most significant features, as it has consequences in global distribution, vaccine design, therapy success, disease progression, transmissibility and viral load testing. Studying HIV diversity helps to understand its origins, migration patterns, current distribution and transmission events. New advances in sequencing technologies based on the parallel acquisition of data are now used to characterize within-host and population processes in depth. Additionally, we have seen similar advances in statistical methods designed to model the past history of lineages (the phylodynamic framework) to ultimately gain better insights into the evolutionary history of HIV. We can, for example, estimate population size changes, lineage dispersion over geographic areas and epidemiological parameters solely from sequence data. In this article, we review some of the evolutionary approaches used to study transmission patterns and processes in HIV and the insights gained from such studies.

Review

Future Virology

Genetic diversity and molecular epidemiology of HIV transmission

A

GD is probably one of the most important concepts in biology. In its most simple definition, GD refers to any and every kind of genetic variation at the individual, population, interpopulation or species level. GD has a large impact on conservation biology [1] and the study of human origins [2] , as well as molecular epidemiology [3] , domestication [4] , fitness [5] and disease [6] . In HIV-1, higher levels of GD have been associated with clinical outcomes such as immune escape of selected variants [7] , emergence of drug resistance mutations and the consequent therapy failure [8] and even with disease progression [9,10] . GD has also been used to study the geographic and temporal spread of HIV-1, shedding light on global and regional population dynamics. HIV-1 GD stems from at least three different sources: multiple introductions of HIV-1 into the human population [11–13] , the low fidelity and high recombinogenic power [14,15] of its 10.2217/FVL.12.4 © 2012 Future Medicine Ltd

Future Virol. (2012) 7(3), 1–14

Keywords drug resistance n genetic diversity n HIV n phylodynamics n transmission n vaccines n

part of

ISSN 1746-0794

1

Review

Castro-Nallar, Crandall & Pérez-Losada

C CRF7, 8, 31 J CRF11, 27-cpx H G CRF14-BG CRF13-cpx CRF25-cpx CRF20, 23, 24-BF CRF6-cpx

of

CRF18-cpx A2 CRF16, 21-A2D CRF2 -AG CRF9, 36, 37-cpx

ho

rP

ro

CRF1 -AE CRF15, 33, 34-01B CRF35-AD A1 B

CRF3 -AB CRF28, 42-BF D CRF5-DF F1

CRF19-cpx

CRF10-CD

CRF12-BF

F1 F2 K CRF4-cpx

A

ut

Figure 1. HIV-1 recombinants and subtypes. Phylogenetic tree representation of HIV-1 recombinants and discrete subtypes. A-D, F-H, J and K denote HIV-1 subtypes. No subtypes E or I are shown since they were found to be recombinant forms of other subtypes. CRF: Circulating recombinant form; cpx: Complex recombinant pattern.

A infected women are less likely to develop AIDS than non-subtype A-infected women [27] . Also, differences among subtypes have been reported in relation to chemokine co-receptor usage (tissue tropism) [28] and transmission [29] . Other researchers have suggested that the ‘subtypes’ are simply artifacts of poor or selective sampling of the overall HIV-1 GD [30] . Measures of GD

Given the interaction between HIV genetic and epidemiological dynamics, accurately estimating GD becomes particularly important. GD can be directly estimated from nucleotide sequence data using a variety of approaches. Traditional measures of GD

The most intuitive way of measuring GD would be to simply count the pairwise differences 2

Future Virol. (2012) 7(3)

between sequences in a sequence alignment or the number of polymorphic sites [31] . This approach, however, has some limitations such as the existence of multiple hits in the same position, different probabilities of change in coding sequences or transition/transversion bias. In general, measures of GD based on sequence data can be classified into two basic categories: summary statistics and coalescent estimators. Some commonly used summary statistic approaches are: nucleotide diversity [32] , haplotype diversity [32] , allelic diversity, gene diversity and theta (Q) [33] . More recently, Q, expressed as substitution rate-scaled effective population size (Nef ), has been estimated under a coalescent framework explicitly taking into account evolutionary history [34–37] , which differentiates this model from approaches based on summary statistics. The coalescent model has been further generalized future science group

Review

Genetic diversity & molecular epidemiology of HIV transmission

of

Phylodynamics (sensu [54]), or the description of infectious disease behavior that arises from the blending of evolutionary and epidemiological processes, has become a hot subject in virology and epidemiology, especially following recent statistical developments (see Table 1) [55–69] . These new methods often use the coalescent under a maximum likelihood or Bayesian inference framework and have been applied in HIV to study questions related to its global, regional, local and within-host dynamics. Estimating phylodynamics

Accurate phylodynamic estimation relies on an adequate sampling strategy [70] . Sparse sampling could lead to inappropriate inferences – for example, the East Africa direct transmission of South American HIV-1 subtype C [71] . Since phylodynamic inferences rely on ‘time trees’ or dated phylogenies, it is also necessary to calibrate the molecular clock model in use. Due to the lack of fossil records in HIV, time-stamped sequence data can be used to produce those inferences. Also, incorporating independent prior knowledge about substitution rates will help any virus dating effort, generally increasing statistical power. Specialized databases – for example, influenza [72] and HIV [201] – may help in this regard as they can provide more clinically relevant information along with the genetic data (e.g., time of collection data).

Frame 1

1

A

ut

ho

Newly developed approaches intend to estimate diversity parameters by taking advantage of the massive amounts of data that next-generation sequencing (NGS) technologies can deliver [41] . New methodologies have focused on characterizing intrahost diversity by capturing low-frequency variants [42] . Recent implementations take advantage of Bayesian inference to correct errors [43] and infer haplotypes and their frequencies (as low as 0.1%; [44]) [45] . Also, the determination of HIV full genomic sequences and related measures of diversity are now feasible, which opens unexplored possibilities to comprehensively address how HIV mutates under selective pressures [46] . To date, most of the applications of NGS technologies to ‘ultra-deep-sequencing’ of viral populations have focused on drug resistance characterization, fluctuations in GD through disease progression, and on certain events in HIV biology, such as tropism switch, transmission bottlenecks, immune escape [47,48] , epistasis [49] and superinfections [50] . Drug resistance characterization has been performed primarily on target genes such as RT and Integrase (both within pol; Figure 2). In turn, epidemiological and subtyping studies primarily focus on the capsid proteins encoded within the env reading frame [51,52] . In the future, we expect to see more applications based on full genome

Geographic & temporal spread of diversity

ro

Novel approaches

data as the feasibility increases and the cost of sequencing decrease [42,53] .

rP

to account for varying population sizes, different time scales, structure, recombination and selection, which has probably made it the most used method to estimate GD. Detailed descriptions regarding algorithms for Q estimation can be found in other reviews [38–40] .

790

5´ LTR

1186

p17

1879 2134 1921 2134

8424

p2 p7 p1 p6

5041

p24

tat

vif

634

2292

8469

8371

2

6062

vpu 2253 2550 2085

3870

4230

p15 RNase

prot p51 RT

3

5772

6045

5559

p31 int

5970

1000

2000

3000

9086

rev

3´ LTR 8653

9719

7758

6225

gp120

gp41

5850 6045

8795

pol* 0

9417

8379

6310

vpr 5096

8797

nef

5619

gag

9168

8379

env** 4000

5000

6000

7000

8000

9000

9719

Figure 2. Genome organization. Schematic representation of HIV-1 genome organization. The three coding reading frames are depicted along with their open reading frames (rectangles). Genome position is numbered according to HXB2 reference strain. The small number in the upper left corner of each rectangle indicates the gene start, while the number in the lower right indicates the last position of the stop codon. Trans-spliced rev and tat forms are represented by black connecting lines between third and second, and second and first open reading frames, respectively. *ORFs used for drug resistance testing. **ORFs used for subtyping and epidemiological studies.

future science group

www.futuremedicine.com

3

Castro-Nallar, Crandall & Pérez-Losada

Table 1. Summary of software used for phylodynamic inferences. Inference

Implementation

Migration, spatial dispersion

Migrate-n [145] , BEAST [146] , IMa2, Lamarc [147]

Substitution rates

BEAST

Recombination rates and recombination breakpoints

Lamarc, LDhat [148] , RDP3 [149]

Changes in population sizes

Migrate-n, BEAST

Divergence time estimation

BEAST, Multidivtime

Leaf ages

BEAST

Reproductive number

BEAST

Growth rate

BEAST, Lamarc, migrate-n

Population divergence times

IMa2

Haplotype reconstruction from NGS data ShoRAH and genetic diversity estimation HyPhy [150] , ADAPTSITE [151] , TREESAAP [152]

ro

Detection of selection

Structurama, Structure, StructHDP

Ancestral state reconstruction

BEAST, MESQUITE [153]

rP

Assigning samples to populations, inferring the number of populations NGS: Next-generation sequencing.

A

ut

ho

In essence, most phylodynamic methods capitalize on analyzing distributions of trees usually obtained by sampling from the posterior distribution of a model, given the data. They work under the theoretical realization that the shape of a tree reflects dynamic processes impacting the data, such as population size changes (constant size, growth and shrinkage), selective processes (intrahost immune selection) or spatial structure (Figure 3) . For example, populations with constant size are predicted to give symmetrical phylogenetic patterns, with most of the diversity happening at relative short branch lengths as novel variants and for the most part neutral or deleterious and therefore selected against [73] . On the other hand, a population experiencing exponential growth will have longer branches leading to extant (sampled) sequences relative to deeper branches in the phylogeny [73] . In short, the coalescent models used in phylodynamic analyses describe a probability distribution on ancestral genealogies, given a population history. Therefore, if we can estimate the underlying phylogenetic structure of alleles from a collection of sequence data, by extension we can infer their population history and the evolutionary processes impacting that history (Figure 3) . Global spread

The best hypothesis we have regarding the origin and dispersion of HIV indicates that HIV-1 and 4

-2 originated in Africa during the first half of the last century, and that it was the product of several cross-species transmissions between humans and non-human primates [11–13] ; (see [51] for further reading). Globally, approximately 33 million people were living with HIV worldwide as of 2009. In the same year, 2 million infected people died and the disease grew at a rate of 7400 new infections per day, more than 97% of which occurred in low- and middle-income countries [74] . Global distribution of groups and subtypes has remained rather constant within the last 10 years [75] . Although both HIV lineages spread exponentially at the beginning of the epidemic, HIV-2 occurred mostly in western Africa. In turn, HIV-1 is distributed worldwide, group M being the one that accounts for most infections, while groups O, N and P appear to be concentrated in central Africa. The time to the most recent common ancestors (TMRCAs) of HIV lineages has been dated using different molecular-based methodologies and gene regions with the following inferences: 1905–1942 for HIV-2 group A and 1914–1945 for group HIV-2 B [76,77] . Similarly, HIV-1 TMRCA estimates were as follows: 1894–1931 for group M, 1932–1966 for group N and 1914–1925 for group O [13,76,78] .

of

Review

Future Virol. (2012) 7(3)

Regional spread

Founder-effect events are thought to play a major role in the spread of HIV out of Africa, although other factors cannot be ruled out completely such as viral selective advantages, sociocultural factors and human genetic background. The Democratic Republic of Congo (DRC) is one of the places in which HIV-1 diversity is the greatest, and probably the site where cross-species transmission occurred [79–81] . Two archival samples, DRC60 and ZR59 [82] , and the existence in DRC of almost all group M subtypes [75,83] support this statement (Figure 4) . Studies worldwide have attempted to infer regional spatial and temporal spread of HIV-1 in particular. In the USA, HIV-1 B seems to have emerged from a single migration out of Haiti in 1969, the place with the highest subtype B diversity. In turn, Haitian HIV-1 B emerged in 1966 from the DRC [84–86] . South Africa has recently shown an increase in HIV infections. Almost 6 million people are infected, being the majority HIV-1 subtype C that now accounts for 50% of all infected individuals worldwide [74,75] . The C subtype was first reported in 1990 and its TMRCA was dated to 1958 [87,88] . The spread of subtype C seems to have occurred eastward from South Africa to future science group

Genetic diversity & molecular epidemiology of HIV transmission

Review

Population size dynamics

Selection dynamics

Exponential shrinkage dynamics

of

Exponential growth dynamics

ro

Constany size dynamics

Spatial dynamics A

B B

B C C

C

A B

C

B A C C B

A

ut

ho

rP

A A

Continual immune selection

Strong spatial structure

Weak spatial structure

A

Figure 3. Phylodynamic patterns. Population, selection and spatial dynamic patterns and their respective idealized trees.

India [89,90] , and also probably to China, while some founder events have also been identified from east Africa to South America and to Israel (Figure 4) [91,92] . The introduction of HIV-1 subtype C in South America goes back to the 1980s, most likely through Brazil [91,92] . Nonetheless, in a more comprehensive study, South American subtype C appears to be more related to UK subtype C, with these two groups related to east African isolates [71] , which stresses the importance of including global isolates in phylogenetic studies of HIV phylogeography. Besides being present in the western world, HIV-1 subtype B is also present in Asia, where its introduction seems to have occurred through Thailand in 1985 and was termed subtype B´ [93] . From here, HIV-1 subtype B´ expanded into future science group

Asia, coexisting with the pandemic subtype B and others, and fueling the development of CRFs across the continent [93–96] . It is worth noting that CRFs represent 20% of all HIV-1 infections, with half of these infections involving CRF02_AG and CRF01_AE [75] . Additionally, there are some indications that certain subtypes are preferentially associated with behavioral factors such as intravenous drug users and drugtrafficking regions, in particular within ex-USSR countries and southeast Asia (subtype A and subtype CRF01_AE respectively; Figure 4) [97] . HIV studies implementing phylodynamic methods have been used to address a variety of questions, including epidemics origins [84–90,93– 96] , correlations between epidemiological data and changes in population size or GD, viral www.futuremedicine.com

5

Castro-Nallar, Crandall & Pérez-Losada

rP

ro

of

Review

15.0–28.0% 5.0–