1053. Identification of genotypes of hepatitis C virus by sequence comparisons in the core, E1 and NS-5 regions. P. Simmonds, 1. D. B. Smith, 1 F. McOmish, ...
Journal of General Virology (1994), 75, 1053-1061. Printedin Great Britain
1053
Identification of genotypes of hepatitis C virus by sequence comparisons in the core, E1 and NS-5 regions P. Simmonds, 1. D. B. Smith, 1 F. M c O m i s h , 2 P. L. Yap, 2 J. Kolberg, 3 M . S. Urdea 3 and E. C. H o l m e s 4 1Department of Medical Microbiology, University of Edinburgh, Teviot Place, Edinburgh EH8 9AG, 2Edinburgh and South East Scotland Blood Transfusion Service, Royal Infirmary of Edinburgh, Lauriston Place, Edinburgh EH3 9HB, U.K., 3 Chiron Corporation, 4560 Horton Street, Emeryville, California 94608, U.S.A. and 4 Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3RE, U.K.
Isolates of hepatitis C virus (HCV) show considerable nucleotide sequence variability throughout the genome. Comparisons of complete genome sequences have been used as the basis of classification of HCV into a number of genotypes that show 67 to 77 % sequence similarity. In order to investigate whether sequence relationships between genotypes are equivalent in different regions of the genome, we have carried out formal sequence analysis of variants in the 5' non-coding region (5'NCR) and in the genes encoding the core protein, an envelope protein (El) and a non-structural protein (NS-5). In the E1 region, variants grouped into a series of six major genotypes and a series of subtypes that could be matched to the phylogenetic groupings previously observed for the NS-5 region. Furthermore, core and E1 sequences showed three non-overlapping ranges of sequence similarity corresponding to those between different
genotypes, subtypes and isolates previously described in NS-5. Each major genotype could also be reliably identified by sequence comparisons in the well conserved 5'NCR, although many subtypes, such as la/lb, 2a/2c and some of those of type 4, could not be reliably distinguished from each other in this region. These data indicate that subgenomic regions such as E1 and NS-5 contain sufficient phylogenetic information for the identification of each of the 11 or 12 known types and subtypes of HCV. No evidence was found for variants of HCV that had sequences of one genotype in the 5'NCR but of a different one in the E1 or NS-5 region. This suggests that recombination between different HCV types is rare or non-existent and does not currently pose a problem in the use of subgenomic regions in classification.
Introduction
1992; Simmonds et al., 1993a; Chayama et al., 1993), the core protein (Chan et al., 1992; Cha et al., 1992; Simmonds et al., 1993 b), the envelope protein E1 (Cha et al., 1992; Stuyver et al., 1993b; Bukh et al., 1993), NS3 (Tsukiyama Kohara et al., 1991 ; Chan et al., 1992) and the 5' non-coding region (5'NCR; Chan et al., 1992; Cha et al., 1992; Bukh et al., 1992; Stuyver et al., 1993a). Several different classification schemes have been proposed (Enomoto et al., 1990; Houghton et aI., 1991; Okamoto et al., 1992; Chan et al., 1992; Mori et al., 1992; Cha et al., 1992; Simmonds et al., 1993a), and discussions have recently taken place over the most useful way to identify and assign genotypes of HCV. In particular it is not clear currently whether comparison of subgenomic regions is sufficient, as recently proposed on the basis of sequence comparisons of part of NS-5 (Simmonds et al., 1993 a), or whether it is necessary to obtain complete genomic sequences before definitive classification may be carried out (Okamoto et al., 1992; Bukh et al., 1993).
Since the cloning and characterization of hepatitis C virus (HCV) as the main causative agent of posttransfusion non-A, non-B hepatitis (Choo et al., 1989), considerable sequence diversity has been found between variants infecting different individuals. For example, the complete genome sequence of the prototype virus (HCV1 ; Choo et al., 1991) isolated in the U.S.A. shows only 69 to 85 % sequence similarity with variants obtained from infected individuals in Japan (Kato et al., 1990; Takamizawa et al., 1991; Okamoto et al., 1991, 1992; Chen et aI., 1992). Variants can be grouped into a number of different virus types by sequence comparisons of subgenomic regions of the virus such as the regions encoding the non-structural protein NS-5 (Enomoto et al., 1990; Mori et al., 1992; Chan et al., 1992; Cha et al., The nucleotide sequence data reported in this paper have been submitted to GenBank and assigned accession numbers L29459 to L29467. 0001-2244 © 1994 SGM
1054
P. Simmonds and others
We have previously obtained evidence for six major genotypes of HCV in a worldwide collection of 76 samples from infected individuals by sequence comparisons in a 222 base fragment of the NS-5 gene (Simmonds et al., 1993a). Many of these HCV types comprise a number of more closely related subtypes, leading to a total of 11 genetically distinct virus populations. To address the issue of whether sequence comparisons in this region accurately model interrelationships between variants over the complete genome, we compared the phylogenetic tree produced from this fragment of NS-5 with that of the complete genomic sequence of types 1a, lb, 2a and 2b. The evolutionary relationships between the complete sequence closely matched those obtained from analysis of NS-5, confirming that, for at least these genotypes, this region is representative (Holmes et al., 1994). We have also carried out similar sequence comparisons for several different coding regions of a limited number of viruses of types la, lb, 2a, 2b, 2c, 3a and 4a, and again found equivalent phylogenetic relationships in each region studied (Chan et al., 1992; Cha et al., 1992). Recently, a large amount of comparative sequence data for the E1 region of HCV variants was published, and on the basis of amino acid similarity between the 51 sequences, the existence of possibly as many as 12 genotypes was inferred (Bukh et al., 1993). We describe here a comparison of the phylogenetic relationships between nucleotide sequences in the E 1 region with those which exist between the previously published NS-5 sequences. This provides a more stringent test of the validity of our proposed classification of HCV, as it allows a direct comparison of the evolutionary relationships between most of the genotypes in two widely spaced regions of the virus genome. Although the two sets of data include sequences from different infected individuals, we have been able to link them in two different ways. Firstly, the corresponding fragments of published complete genomic sequences were included in our analyses of E1 and NS-5, allowing several phylogenetic groupings in the two regions to be matched. Secondly, as the sequences in the 5'NCR of most of the variants published for E 1 have been previously described (Bukh et al., 1992) we were able to match them to the corresponding 5'NCR sequences of the variants that we have previously classified in NS-5. In these ways, we have been able to establish equivalences between most of the genotypes for the two regions.
Methods Nucleotide sequences
(i) N S - 5 . Sequences correspond to those obtained in our previous analysis of this region and include 35 sequences published elsewhere.
Individual nucleotide sequences are numbered 1 to 76 according to Table 1 of Simmonds et al. (1993a). (ii) E l . Sequence data between positions 574 and 1149 (Choo et al., 1991) were derived from two sources. The first source was fragments of complete genomic sequences from several previous studies [HCV-1, Choo et al., 1991 (1); HCV-H, Inchauspe et al., 1991 (6); HCV-J, Kato et al., 1990 (17); HCV-BK, Takamizawa et al., 1991 (18); T, Chen et al., 1992 (19); HPCGENOM, Bi et al., GenBank accession number L02836 (26); HPCJTA and HPCJTB, Tanaka et al., 1992 (27 and 28); HCVJKIG, Honda et al., GenBank accession number X61596 (31); HC-J6, Okamoto et al., 1991 (36); HC-J8, Okamoto et al., 1992 (49)]. The numbers in parentheses refer to those allocated to each sequence by Simmonds et al. (1993a). The second source was previously published E 1 sequences from 51 infected individuals (Bukh et al., 1993). The latter sequences have been assigned numbers 1e to 51 e as described in Table 1. Partial sequences from E1 (nucleotide positions 987 to 1022) were also obtained from four of the samples that were previously sequenced in the NS-5 region (numbers 52, 54, 56 and 57; Cha et al., 1992). (iii) Core. Sequence comparisons were made between positions 15 and 384 in the core gene for the 13 sequences listed in a previous analysis (Chan et al., 1992) and the corresponding fragments of a further seven complete genomic sequences (T, HPCGENOM, HPCJTA and HPCJTB, HCVJKIG and HC-J8). (iv) 5 ' N C R . Sequences in the 5"NCR between positions - 2 4 5 and - 7 2 were obtained for 35 samples for which sequence data in the NS5 region are available (Simmonds et al., 1993a), and for 35 samples where the E1 sequence was determined (Bukh et al., 1992). These sequences were compared with the corresponding fragments of the 11 previously published complete genomic sequences listed above. Restriction fragment length polymorphisms (RFLPs) between different 5'NCR sequences were calculated for previously described combinations of restriction enzymes as described in the legend to Fig. 1.
Table 1. Identification of samples sequenced in E1 No. le 2e 3e 4e 5e 6e 7e 8e 9e 10e lle 12e 13e 14e 15e 16e 17e 18e 19e 20e 21e 22e 23e 24e 25e 26e
Isolate DK7 USll DR4 DR1 DK9 SWl S14 S18 IND8 IND5 SW2 HK3 HK8 $45 D3 T3 HK5 HK4 US6 P10 SAI0 T10 DK1 $9 D1 T4
No. Isolate 27e 28e 29e 30e 31e 32e 33e 34e 35e 36e 37e 38e 39e 40e 41e 42e 43e 44e 45e 46e 47e 48e 49e 50e 51e
USI0 T9 T2 DKll SW3 DK8 T8 $83 DK12 HK10 $2 $54 $52 Z4 Z1 Z6 Z7 DK13 SA5 SA7 SA4 SA1 SA6 SA13 HK2
Identification of genotypes of H C V Patterns of bands werecoded as previouslydescribed(McOmish et al., 1993b), with the addition of a new pattern (E) for the HinfI-MvaI digestion of sequencenumber 45 (no cuts with either enzyme;one band of 251 base pairs).
Nucleotide sequence comparisons. Distances between pairs of sequences were estimated using the DNADIST program of the PHYLIP package (version 3.5c) kindly provided by Dr J. Felsenstein (Felsenstein, 1993) using a model that allows for different rates of transition and transversion and differentfrequenciesof the four bases. Phylogenetic trees were constructed using the neighbour-joining algorithm on the previousset of pairwisedistances (Saitou & Nei, 1987) using the PHYLIP program, NEIGHBOR. Statistical analysis of the frequencies of sequence distances within four regions of the HCV genome (core, El, NS-4 and NS-5) was carried out with in-house software.
Results
1055
Phylogenetic analysis of E1 Published nucleotide sequences from E1 and the corresponding regions of 11 complete genomic sequences were compared using neighbour-joining and the resulting phylogenetic tree was plotted in unrooted form (Fig. 2a). A total of 12 clearly defined phylogenetic clusters were obtained, in agreement with a previous analysis of amino acid sequence similarity (Bukh et al., 1993). Overall, the topology and branching order of the E1 phylogenetic tree is remarkably similar to that of NS-5 (Fig. 2b). In particular, the phylogenetic tree of E1 variants splits into six major branches, some o f which are further divided into more closely related sequences. Overall, the phylogenetic tree shows the same hierarchical division into types and subtypes that was previously observed for NS-5.
Comparison of sequences in the 5'NCR
Comparison of genotype identification in E1 and NS-5
Although the 5'NCR is the most conserved region in the HCV genome, it is possible to identify many of the different genotypes from sequence comparisons in this region. Indeed, 5'NCR sequences amplified from plasma of infected individuals have been used in several different typing assays based upon R F L P analysis (Nakao et al., 1991 ; McOmish et at., 1993a, b) or hybridization with type-specific primers (Stuyver et al., 1993 a). To match genotypes identified in the NS-5 region with those in El, the corresponding sequences in the 5'NCR from each were compared (Fig. 1). The sequences may be readily assigned into at least seven groups of identical or nearly identical sequences (A to G), which show a series of conserved differences from members of other groups. Groups differed from each other by between seven (between groups A and F) and 25 substitutions (groups C and D). In many cases within a group there were perfect matches between the 5'NCR sequences of variants previously sequenced in NS-5 (Simmonds et al., 1993a) with those in E1 (Bukh et aI., 1993), although several of the 5'NCR groups contained sequences that differed by one to three nucleotide substitutions from the group consensus. Sequences from Zaire (variants 40e to 43e) were heterogeneous and in some comparisons showed as many as four differences from each other and from DK13 (44e) and EG-7, EG-13 and EG-19 (variants 69 to 71). For the purposes of the analysis described below, Z6, Z7 and DK13 (42e to 44e) were grouped together with EG-7 and EG-13 (69 and 70), as they differ from each other at only one nucleotide position and with EG19 (71), which differs at two (5'NCR group E). The other Zairean sequences (40e and 41e) differ from the others by four or five substitutions and were therefore left unassigned.
The groupings for E1 were matched with those for NS5 in two ways. First, the complete genomic sequences that have been previously designated types la (1 and 6), type lb (variants 17, 18, 19, 26, 27, 28 and 31), type 2a (36) and type 2b (variant 49) were identified in the two phylogenetic trees. These sequences were found in four separate groups in both trees. In both cases the sequences in the groups containing the type la and lb variants were all more closely related to each other than they were to types 2a and 2b, showing that a hierarchy of types and subtypes also occurs in the E1 region. The second way to match E1 and NS-5 groupings was to compare the distribution of 5'NCR sequences between the E1 and NS-5 trees (Fig. 2). Variants with group A 5'NCR sequences include those previously assigned as types la and lb in the NS-5 region (numbers 1 to 31 ; Fig. 2b) and types la and lb in the E1 region (numbers le to 25e; Fig. 2a). In this case, two phylogenetic groupings that are distinct in the coding region of the virus genome may contain viruses with similar or identical sequences in the 5'NCR. For example, sample 3e (type la) has a 5'NCR sequence identical to those of the type lb variants 15, 10e, lle, 14e, 15e, 17e, 19e, 20e, 22e and 24e (Fig. 1). Group B 5'NCR sequences include the five previously designated type 2a variants (35 to 39) and the three type 2c variants (numbers 50 to 53; Fig. 1; Simmonds et aL, 1993a). Four variants that were sequenced in E1 also contained similar sequences in the 5'NCR (26e to 28e and 34e). The complete genome sequence of HC-J6 (number 36) clusters with variants 26e to 28e in the E1 region identifying these variants as type 2a. Similarly, group C 5'NCR matches variants 40 to 48 (type 2b in NS-5) with 30e to 33e in E1 (Fig. 2), and this is confirmed by the clustering of the complete genome sequence of
P. Simmonds and others
1056
1
Sequence 2
3
5' NCR nucleotide position (-) 4
l 6 17 18 19 26 27 28
2 4 5
2 4 4
2 2 2 2 2 2 2 2 2 4 3 3 3 3 2 2 0 0 0 8 7 6 5 3 1 9 0 7
I 6 6
I 6 4
i i 1 1 1 1 6 6 6 6 5 5 ~ 1 0 9 5 " '
U
G
G
U
G
C
C
G
U
G
A
C
G
A
A
G
A
C
C
1 3
9
"
•
.
U
-
.
.
.
.
U
.
.
.
.
.
-
C
.
.
.
.
.
.
.
.
.
.
.
G
A
G
C
U
.
.
G
A
G
U
G
U
.
.
C
1 0 4 A
9 3 C
9 9 9 9 8 1 6 0 2 5
T
G
U
G
U
8
7
6
C
A A A A A A
A A A A A A
b b b b b b b b b b b b b b b b b b b b b
A A A A A A A A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A A A A A A A A A A A
b
A A
A A
CU
c c c c c c
D D D D D D D D
. . . . . . . . . . . .
f f f f f f [ f f f ~ f
G G G G G G G G G G G G
~
Hlc
b b
.
.
.
.
.
.
.
.
.
G
-
-
-
-
G
-
-
G
ill
.
RFLP
i i i 1 1 I 1 1 1 1 2 2 2 2 2 11 10 0 4 2 1 9 8 6 9 8 0 9 5
A
-
.
.
.
.
.
.
.
b
b b b
G
i
12 [ [ [ [ [ [ [ ~ [ 2 [ [ [ ~ : : 2 2 i i
3 7 8 9 i0 Ii 12 15
-
A
le
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2e 3e 5e 9e 10e lle 14e iSe . . 16e 17e . . 19e 20en . 21e . 22e . 24e .
. . . .
36
35 37 38 39
U
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . A . . . .
,
. . . . . . .
.
.
.
.
A A . A
C C C C
. . . .
A
C
. . . . . .
.
c
28e
5,3 52 53
.
.
.
.
.
.
.
.
.
.
.
.
a
.
.
.
.
.
.
.
.
.
. . . .
.
.
.
.
. . . .
. . . .
.
.
. . . .
.
.
. . . . .
.
-
-
A
-
-
h
-
-
-
h
-
-
-
A
-
-
-
h
-
A
-
A
-
-
-
-
-
-
-
-
-
A A A A
U U U U
-
G
A
U
-
.
G
.
-
i
.
A
C
A
[
. . . . ~ . A C
.
.
.
.
A
.
.
.
AU A U
.
48
2 33~
Acc : : : : A : ~ A
62 63 66
C C C
. . . . . . . . . . . . . . . . . . . . .
.
C C C C
. . . .
. . . .
C
.
.
36e 38e 39e 35e
~9
I
I
0
[
[
.
.
?o~
.
.
.
C
. . C .
. . . . . C .
C C
UG UG UG
.
.
C . U G C
.
. .
. I
.
.
.
~ c
A
U U U U
. . .
.
.
A A . A . A
.
.
.
.
u
.
U . . . . . . . . . .
.
G G G G G G G G G G G G
UG UG UG UG
C
A.
U
42e 143ei ,44e|
.
.
u .
E
72
.
. . . . . . . . . . . . . .
. U . . C U . . C
c : : : : : : : : : : c
71
.
. . . . . . . .
.
AU
.
A
. . . .
A
[
.
.
U U U U U U U U U U U U
[ [ . . . . .
- - - - . . . - - -
:
[
[
.
.
.
.
.
.
.
.
.
.
.
.
.
. U . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .
-
A A A A A A A A A A A A
A A A A A A A A A A A A
U U U U U U U U U U U U
-
A A A A A A AU A A
U U U U U U
-
A A A A A A U A A
U
.
-
G
-
-
-
G
-
-
-
G
-
-
G
-
-
-
G
-
-
-
G
-
-
-
G
-
-
-
A
-
-
-
k
-
-
-
m
-
-
-
G
-
-
-
G
-
U U
U
.
U A A A
. . . . . . . . . .
.
.
.
.
.
[
[ GG
.
.
.
.
.
.
,
.
.
.
. . . . . . . . . .
UC
. . . . . . . . . . . . .
C
CC
. . . . . . . . . .
C
C C
CC UC
. . . . . . . . . . . . . . . . . . . .
CU CU
C
CC
. . . . . . . . . .
CU
C
CC
. . . . . . . . . .
CU
. . . . . . . . . . .
C . C C
. . . . . . . . . .
C C C C C C C C C
.
-
A A A A A A A A A A A A
A A A A A A A A A A A A
:
:
.
.
.
. . . . . . . . . . . .
.
.
. . . . . . . . . . . .
. . . . . . . . . . . .
.
U
A
.
. . . . . . . . . . . . . . . .
.
.
.
. A
.
.
.
.
.
.
. . . .
"
. . . . . .
[
[
[
[
:
:
.
.
.
U U U U U
.
C C C C C C C C C C C C
U U U U U U
.
.
G . . .
.
.
~
.
A A A A A A A A A A A A
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
G
~
.
.
A . . . . . . . . . . . . A A A A
[ C
. .
. G . G . G . G . G . . . . G . G . G . G . G . .
C
.
.
.
C C C C
-
AA :
.
.
bb
C U-
C , C C
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . .
C C
C
.
. . . . . . .
UC
.
. . . .
. . . . . . .
C
. . . . .
G G G G G G G G G G G G
C
. . . . . . . . . .
U
.
C
C C C C C C C C C C C C
-
]
-u
U U U U U U U U U
-
-
i
.
.
-
48e
46e
. . C
.
-
.
.
.
--
.
.
.
-
.
.
.
-
.
.
.
U
.
.
.
U U U U
.
.
.
~
.
.
.
G
. . . . . . .
.
.
G G G G
.
.
_ U .
C
c
.
-
C C C C
.
G
.
-
-
-
.
U .
U
.
. . . .
.
C
..
4O 42
: [ : : : : : 2 : : : ] 2 C. L L 2 2 2 L 21 b
-
G G G G .
b
2 i
: : : : : 34e
F
2
i 3 5
G 2
49
8
-
31
C
I 3 8
.
.
.
.
.
.
a
H
a a a a
H H H H
.
.......
C C C C C C C C C C C C
C
C IC C C
~-;-
II~ I
I !
[ G
. . . . . .
. . . . . . . . . . . . . . . . . .
[ !
c0~
A
J
Fig. 1. Comparison of sequences in the 5'NCR of HCV variants previously sequenced in the NS-5 region (column 3; Simmonds et al., 1993a), previously sequenced in E1 (column 4; Bukh et al., 1993) and variants for which the complete genomic sequence has been determined (column 2; Simmonds et aL, 1993a). Sequence groups A to G (column 1) are individually boxed. Only variable sites are shown and the numbers refer to nucleotide positions - 2 4 5 to - 72, numbered as in Choo et aL (I99 i). * Indicates inserted bases present
Identification of genotypes of H C V HC-J8 (type 2b; number 49) in these two phylogenetic groups. With the information presented so far, it is not clear whether variants assigned to the third subtype of genotype 2 (type 2c; numbers 50 to 53) that all contain group B sequences in the 5'NCR are the same genotype as the third subtype of type 2 found upon phylogenetic analysis of the E1 region (34e). To investigate this, we have compared the sequence of a fragment of the E1 sequence of one of the type 2c variants (52) between nucleotide positions 987 and 1022 (Cha et al., 1992) with the equivalent region of the published type 2 sequences in E1 (Fig. 3). The nucleotide sequence of 34e differed from type 2c variant 52 at only two nucleotide positions, compared with at least eight differences when compared to other subtypes of type 2 and 15 to 17 differences upon comparison with type 1, indicating that 34e and 52 correspond to the same subtype. Variants assigned as type 3a upon analysis of NS-5 all contain sequences in the 5'NCR corresponding to group D, a group that also contains E1 variants 35e to 39e (Fig. 1). To confirm that these two groups are equivalent, we have compared fragments of the E1 sequence of the type 3a variants 54, 56 and 57 with corresponding regions of the E1 sequences 35e to 39e (Fig. 3). Sequences within these two groups differed from each other by only one to five nucleotide substitutions compared with more than 15 differences with other genotypes. This is similar to the difference observed within subtypes of other genotypes, providing evidence that they belong to the same subtype of type 3. Phylogenetic analysis of NS-5 sequences reveals a second subtype of type 3 (3b) whose 5'NCR sequence is unknown. This subtype is probably not represented in the E1 dataset because the type 3 E1 sequences are monophyletic (Fig. 2a) and because it has only thus far been found in Thailand; this geographical region was not included in the E1 dataset. The 5'NCR sequence of Z6, Z7 and DK13 (42e to 44e) was similar to the sequences of variants 69 to 71 (group E) allowing these two sets to be collectively identified as type 4. In this case, two separate subtypes of type 4 in E 1 are identical in sequence in the 5'NCR (Z6, Z7 and DK13). Because of this, it is not possible to decide whether one of the two subtypes in E1 corresponds to the single subtype found previously in NS-5 (69, 70 and 71). The two other Zairean variants, Z4 and Z1 (numbered 40e and 41e), appear as two further subtypes of type 4 upon phylogenetic analysis of the E1 region, and each showed distinct sequences in the 5'NCR.
1057
The final assignments can be made between the type 5a sequences in the NS-5 region (numbers 72 to 75) and sequences of 45e to 50e in E1 (group F in the 5'NCR), and between sequence 76 and sequence 51e (both in group G in the 5'NCR). In both cases the origins of these two genotypes were the same, type 5a from South Africa patients and type 6a from Hong Kong. In summary, we have been able to match all six of the major genotypes and five of the nine subtypes in the two datasets. The E1 dataset does not contain sequences corresponding to types lc or 3b and at least three subtypes of type 4 are not represented amongst the NS5 sequences. Although in a few cases it was not possible definitively to match individual subtypes in the two regions, none of the sequence data in any of the three regions conflict with any other. Distribution of type and subtype sequence distances Pairwise comparison between each E1 nucleotide sequence (64 sequences, 2016 comparisons) produces three discrete distributions of evolutionary distance (Fig. 4), similar to that found for NS-5 (Simmonds et al., 1993a). Distances in the range 0.43 to 0.85 correspond to those between the major genotypes. This range of distances differs from those found between the subtypes (0-26 to 0.45), which in turn is separated from the range of variation found upon comparison of individual variants within a subtype (0"009 to 0"13). As with NS-5, we observed no overlap between these ranges, apart from a slightly closer than expected distance of 0.43 between the type 4 DK13 sequence and the type la sequence DR9 that lies just within the range of sequence distances between subtypes. Provisional assignment of new types and subtypes Given the restriction in the range of sequence diversity that exists between type and subtype in NS-5, we have previously suggested that the percentage sequence similarity with existing NS-5 sequences might be used provisionally to assign new genotypes and subtypes (Simmonds et al., 1993a). We chose NS-5 because this region is variable and could be readily amplified from plasma of infected individuals, and because there was already a considerable quantity of published sequence data in this region. The above analysis suggests that percentage similarity with the E1 sequences available from GenBank (Bukh et al., 1993) can also be used for
in some5'NCR sequencesbetweennucleotides - 145 and - 144 (two bases) and between - 139 and - 138 (onebase). The right-hand three columnsindicatethe predictedRFLPpatterns with three combinationsof restrictionenzymes,HaelII/RsaI (5), HinfI/ScrFI (6) and HinfI/MvaI (7), accordingto the previouslydescribedcode (McOmishet al., 1993b).
1058
P. Simmonds and others
(a)
(b) (
a< C
b 5'NCR group zx
A
•
B
o
C
•
D
•
E
o
F
•
G
• N o t determined
?
N o t classified
b
b "
a
a
Fig. 2. (a) Phylogenetic analysis of 64 sequences in the E1 region. Numbered sequences correspond to the E1 sequences of previously published complete genomes; 1, HCV-1 ; 6, HCV-H; 17, HCV-J; 18, HCV-BK; 19, T; 26, HPCGENOM; 27, HPCJTA; 28, HPCJTB; 31, HCVJKIG; 36, HC-J6; 49, HC-J8; see Methods for references. Genotypes that have been positively identified in E1 have been labelled (see text). (b) Phylogenetic analysis of 76 sequences in the NS-5 region, with complete genome sequences numbered as in (a). Genotypes are labelled as previously proposed (Simmonds et al., 1993 a). Small solid dots denote sequences for which 5'NCR sequence was not determined.
Type la
Sequence 1 '2 3
987
1
GAC GGCGUUGGUAAUGGCUCAGCUGCUC C GGAUCCC C ............ G.A ......................
]
I le, lb
]
16
1 2a
2b 2c 3a 3a 3a
9e,
E1 nucleotide position
A ..... AG ....
CC.A..GG.AT.G CC.A..GG.AT.G...U
.....
.G.UA.CA..A.CC .G.CA.CA..A.CC
30e
UCUUA.CA..A.CC.C..CU.CGCCGCU..UG.U.. UCUCA.CA..A.CC.C..CU.UGCCGCC..UG.U..
34e
U..CA.UA..C.CC U,.CA.CA..C.CC
35e 36e 37e 38e 39e
UGU..GUA UGU.,GUA UGU..GUA UGU.,GUA CGU..GUA UGU.,GUA UGU..GUA UGU..GUA
49 52
56 57
200
.... GU.CGC,A,G..CG .... GU.CGC.A.G..CG.U..
.... ....
Subtype
Type
160
A. ..............
26e
36
Isolate
1022
"
AU.CU..G.G..C GU.CU..G.G..C
....
,.Q
120
2:80 ..... ......
.... GG .... G..CG.C..G..UU.G.. .... GG .... G..A..C..G..UU.G.. .... GG.A.,A..CG.C..G..UC.G.. .... GG.A,.G..CG.C..G..UC.G.. .... GG .... G..CG.C..G..GU.G.. .... GG .... G..CG.U..G..UU,G.. .... GG .... G..CA.C..G..AU,G.. .... GG .... G..CA.C..G,.AU.G..
Fig. 3. Comparison of partial E1 nucleotide sequences (positions 987 to 1022; Bukh et aL, 1993) with those of variants previously sequenced in NS-5 (column 2; Simmonds el aL, 1993a) and assigned as type 2c and 3a. Differences from the prototype HCV-1 sequence are shown. Representative sequences from types la, lb, 2a and 2b are included for comparative purposes; column 1, sequence derived from complete genome sequences; column 3, sequence derived from previous E1 data.
40 0
0.0
0-1
0.2- 0.3 0.4 0.5- 0.6- 0.7- 0.8Evolutionary distance
Fig. 4. Distribution of evolutionary differences on pairwise comparison of the 64 El sequences (2016 comparisons). Number of calculated evolutionary distance measurements (in increments of 0-02) from 0-00 to 0.86 shown on y-axis.
Identification of genotypes of H C V
1059
Table 2. Sequence similarity between type, subtype and isolate in core, E1 and NS-5 regions Core Start* End* Length No. seqst Isolate Min:~ Max:~ Subtype Min Max Type Min Max
E1
NS-5
15 574 7975 384 1149 8196 370 576 222 20 64 76 97.3 (OKA4core/J491) 99-1 ($54/$52) 99.5 (J483/J491) 93.8 (JTA/B/GENANTI) 88.0 (T2/T9) 87.8 (I10/T983) 92-8 (HCV-J/HCV-H) 78-6 (DK13/Z6) 86.0 (2TY4/GM2) 878 (HC-J6/HC-J8) 67.5 ($83/DK8) 74.8 (K1-3/2TY4) 87.8 (E-b 1/OKA 1core) 69.4 (DK13/DR9) 72.1 (T10/K1) 81-2 (HC-J8/HPC-JTA/B) 53.5 (SA6/T8) 56.2 (EG-7/K2a)
* Sequences numbered as in Choo et al. (1991). t Number of sequences compared. :~ Minimum and maximum percentage sequence similarities. Individual sequence comparisons at the extremes of the range are indicated.
the identification of most of the known genotypes and the assignment of new virus types. We have also carried out equivalent analyses of other regions of the genome such as the core region for which there are published sequence data for several genotypes (Chan et al., 1992; Simmonds et al., 1993b). The upper and lower limits of the ranges of sequence divergence that exist between type, subtype and isolate in each genomic region are compared in Table 2. The core region is the most conserved, with intertypic sequence similarities between major genotypes ranging from 8l to 88 % and between subtypes from 88 to 93 %. In the E1 region, the range of sequence similarity between types is similar to that of NS-5 (54 to 69 % compared with 56 to 72 %), whereas subtypes are less closely related to each other in E1 (range 68 to 79 % compared with 75 to 86 % for NS5). Using these percentage similarities it would be possible to assign the correct genotype for all 96 of the sequences in the core and NS-5 regions, and all but one of the 64 sequences in the E1 region (DK13; see above). However, the identification of DK 13 as a subtype of type 4 is apparent from the phylogenetic tree for E1 (Fig. 2), emphasizing the value of this type of analysis to confirm genotype identifications.
Discussion The phylogenetic trees and the range of distances found upon pairwise comparisons of nucleotide sequences were equivalent for the E1 and NS-5 regions. Both regions showed a hierarchical division of sequences into six major groupings (designated HCV types in the NS-5 region), with further division into more closely related subtypes (Enomoto et al., 1990; Chan et al., 1992; Simmonds et aI., 1993 a). Furthermore, all variants that were assigned a particular genotype in one region of the virus genome always contained sequences corresponding
to the same genotype elsewhere in the genome. These include the 13 complete genome sequences, and the 70 variants described in this study for which either E1 and 5'NCR or NS-5 and 5rNCR sequences were available. Together, these data suggest that relatively short subgenomic regions of the virus contain sufficient sequence information to allow reliable identification of HCV genotypes, irrespective of whether the sequence encodes structural or non-structural proteins, and irrespective of the degree of sequence variability within the coding region. A potential objection to the use of subgenomic regions for virus identification is that recombination between different virus genotypes infecting the same cell might produce hybrid viruses comprising sequences corresponding to more than one genotype. We have found no evidence for recombination between genotypes in this study of 13 complete genome sequences and 70 HCV variants that were simultaneously sequenced in the E1 or NS-5 region and 5'NCR. This indicates that recombination is rare in vivo or that hybrids produced between different genotypes are not viable. In either case, recombination does not currently pose a problem for virus classification. However, until more HCV variants have been characterized, we have previously recommended that new genotypes should be assigned only after sequence comparisons are made with existing variants in at least two separate coding regions of the genome (Simmonds et al., 1993 a). The sequence comparisons described above between the E1 and NS-5 regions were based upon previous observations of an association between sequences in the 5'NCR and genotype. Although all six major HCV types are clearly distinct, this region is so highly conserved that, in many cases, variants corresponding to different subtypes contain indistinguishable sequences (Fig. 1). This does not invalidate the proposed type/subtype
1060
P. S i m m o n d s and others
nomenclature of HCV, but serves to emphasize the difference in sequence diversity between type and subtype. Although 5'NCR sequences of types 2a and 2b are consistently distinct from each other, the absence of corresponding differences between other subtypes (for example types la and lb; types 2a and 2c; type 4 Z6 or Z7 and DK13) appears to preclude any typing assay based on 5'NCR sequences for identification of all HCV types and subtypes. RFLP analysis of this region can only differentiate the six major genotypes and type 2b from 2a and 2c (Fig. 1; McOmish et al., 1993b). Similar constraints apply to assays based upon hybridization of genotype-specific probes to amplified 5'NCR sequences (Stuyver et al., 1993a). Simple methods for culture of HCV in vitro have not been reported, and animal models of infection are expensive and impractical for the majority of laboratories. There have been no attempts to classify variants of HCV by conventional serotyping reagents, a method which has been used for other genera in the Flaviviridae family, such as dengue virus, and within the Picornaviridae (poliovirus and coxsackie virus). However, because virus RNA may be readily amplified from the plasma of infected individuals by PCR and then sequenced, it has proved relatively easy to identify different genotypes of HCV on the basis of nucleotide sequence comparisons, without any information on the biological significance of such classification schemes. It is not currently known whether the major genotypes or subtypes would correspond to the serotypes of other RNA virus groups. Little is known about whether infection with different genotypes is associated with differences in the course and severity of liver or extrahepatic disease, although several studies have documented a poorer response to treatment with einterferon of patients infected with type lb virus (Kanai et al., 1992; Yoshioka et al., 1992; Takada et al., 1992). To avoid confusion in the assignment of HCV genotypes in the future, it will be necessary for different laboratories to use a standardized nomenclature and methodology for assigning new genotypes. The system described in this and a previous paper (Simmonds et al., 1993a) has found wide acceptance, and is strengthened by the finding that the same type/subtype relationships between variants in NS-5 upon which the proposals were based are also found in El. Furthermore, with the exception of the 5'NCR and certain hypervariable regions of the E2 gene, we believe that any region of the genome can be used as the basis for virus genotypic identification, provided sequences from each of the currently recognized genotypes are available. Sequences in the core region are particularly suitable for the purpose as this region is relatively easy to amplify from
the plasma of infected individuals and sequences of all six genotypes and several subtypes are published or about to become available. One of the major tasks of any committee set up to co-ordinate the assignment of new genotypes is the collection and dissemination of representative sequences from the currently recognized genotypes in as many regions of the genome as possible. The authors are grateful to Dr E. Follett and B. C. Dow of the Scottish National Blood Transfusion Service for providing samples for sequence analysis, and for technical assistance provided by N. Linskail. This work was funded in part by the MRC (grant number G9020615CA) and by the Scottish National Blood Transfusion Service.
References BUKH, J., PURCELL,R. H. & MILLER, R. H. (1992). Sequence analysis of the 5' noncoding region of hepatitis C virus. Proceedings of the National Academy of Sciences, U.S.A. 89, 4942-4946. BUKH, J., PURCELL, R.H. & M~LLER, R.H. (1993). At least 12 genotypes of hepatitis C virus predicted by sequence analysis of the putative E1 gene of isolates collected worldwide. Proceedings of the National Academy of Sciences, U.S.A. 90, 8234-8238. CHA, T.-A., BEALL,E., IRVINE,B., KOLBERG,J., CHIEN, D., KUO, G. & URDEA, M. S. (1992). At least five related, but distinct, hepatitis C viral genotypes exist. Proceedings of the National Academy of Sciences, U.S.A. 89, 7144-7148. CHAN, S.-W., McOMISH, F., HOLMES,E. C., DOW, B., PEUTHERER,J. F., FOLLETT, E., YAP, P. L. & SIMMONDS,P. (1992). Analysis of a new hepatitis C virus type and its phylogenetic relationship to existing variants. Journal of General Virology 73, 1131-1141. CHAYAMA,K., TSUBOTA,A., ARASE,Y., SAITOH,S., KOIDA, I., IKEDA, K., MATSUMOTO,T., KOBAYASHI,M., IWASAKI, S., KOYAMA, S., MORINAGA, T. & KUMADA, H. (1993). Genotypic subtyping of hepatitis C virus. Journal of Gastroenterology and Hepatology 8, 150-156. CrmN, P. J., LIN, M. H., TAI, K. F., LIu, P. C., LIN, C. J. & CHEN,D. S. (1992). The Taiwanese hepatitis C virus genome: sequence determination and mapping the 5' termini of viral genomic and antigenomic RNA. Virology 188, 102-113. CHOO, Q. L., Kuo, G., WEINER,A. J., OVERBY,L. R., BRADLEY,D. W. & HOUGHTON,M. (1989). Isolation of a cDNA derived from a bloodborne non-A, non-B hepatitis genome. Science 244, 359-362. CHOO, Q. L., RICHMAN,K. H., HAN, J. H., BERGER,K., LEE, C., DONG, C., GALLEGOS,C., COIT, D., MEDINASELBY,R., BARR,P. J., WEINER, A. J., BRADLEY,D. W., KUO, G. & HOUGHTON,M. (1991). Genetic organization and diversity of the hepatitis C virus. Proceedingsof the National Academy of Sciences, U.S.A. 88, 2451-2455. ENOMOTO,N., TAKADA,A., NAKAO,T. & DATE, T. (1990). There are two major types of hepatitis C virus in Japan. Biochemical and Biophysical Research Communications 170, 1021 1025. FELSENSTEIN, J. (1993). PHYLIP Inference Package Version 3.5. Seattle: Department of Genetics, University of Washington. HOLMES,E. C., SIMMONDS,P., CHA, T.-A., CHAN,S.-W., McOMISH, F., IRVINE, B., BEALL, E., YAP, P.L., KOLBERG, J. & URDEA, M. S. (1994). Derivation of a rational nomenclature for hepatitis C virus by phylogenetic analysis of the NS-5 region. In ViralHepatitis and Liver Disease. Edited by K. Nishioka, H. Suzuki, S. Mishiro & T. Oda. Tokyo: Springer-Verlag (in press). HOUGHTON,M., WEINER,A., HAN, J., KUO, G. & CHOO, Q. L. (1991). Molecular biology of the hepatitis C viruses: implications for diagnosis, development and control of viral disease. Hepatology 14, 381 388. INCHAUSPE,G., ZEBEDEE,S., LEE, D. H., SUGITANI,M., NASOFF,M. & PRINCE, A. M. (1991). Genomic structure of the human prototype strain H of hepatitis C virus: comparison with American and Japanese isolates. Proceedings of the National Academy of Sciences, U.S.A. 88, 10292-10296.
Identification of genotypes of HCV KANAI, K., KAKO, M. & OKAMOTO, H. (1992). HCV genotypes in chronic hepatitis C and response to interferon. Lancet 339, 1543. KATO, N., HIJIKATA,M., OOTSUYAMA,Y., NAKAGAWA,M., OHKOSHI, S., SUGIMURA,T. & SHIMOTOHNO,K. (1990). Molecular cloning of the human hepatitis C virus genome from Japanese patients with non-A, non-B hepatitis. Proceedings of the National Academy of Sciences, U.S.A. 87, 9524-9528. McOMISH, F., CHAN, S.-W., Dow, B. C., GILLON, J., FRAME,W. D., CRAWFORD,R.J., YAP, P. L., FOLLETT, E. A. C. & S1MMONDS,P. (1993a). Detection of three types of hepatitis C virus in blood donors: investigation of type-specific differences in serological reactivity and rate of alanine aminotransferase abnormalities. Transfusion 33, 7-13. McOMISH, F., YAP, P. L., DOW, B. C., FOLLETT, E. A. C., SEED, C., KELLER,A. J., COBAIN,T. J., KRUSIUS,T., KOLHO,E., NAUKKARINEN, R., L/N, C., LAL C., LEONG, S., MEDGYESI, G., HEJJAS, M., KIYOKAWA,H., FUKUDA,K., SAEED,A. A., AL RASn~ED,A., L~N, M. & SIMMONDS, P. (1993b). Geographical distribution of different hepatitis C virus genotypes in blood donors: an international collaborative survey. Journal of Clinical Microbiology (in press). MORI, S., KATO,N., YAGYU,A., TANAKA,T., IKEDA,Y., PETCHCLAI,B., CHIEWSlLP,P., KURIMURA,T. & SHIMOTOrrNO,K, (1992). A new type of hepatitis C virus in patients in Thailand. Biochemical and Biophysical Research Communications 183, 334-342. NAKAO,T., ENOMOTO,N., TAKADA,N., TAKADA,A. & DATE,T. (1991). Typing of hepatitis C virus genomes by restriction fragment length polymorphism. Journal of General Virology 72, 2105-2112. OKAMOTO, H., OKADA, S., SUGIYAMA,Y., KURAI, K., IIZUKA, H., MACrtlDA, A., MIYAKAWA,Y. & MAYUMI, M. (199l). Nucleotide sequence of the genomic RNA of hepatitis C virus isolated from a human carrier: comparison with reported isolates for conserved and divergent regions. Journal of General Virology 72, 2697-2704. OKAMOTO, H., KURAI, K., OKADA, S., YAMAMOTO,K., IIZUKA, H., TANAKA, T., FUKUDA, S., TSUDA, F. & MIsrnRo, S. (1992). Fulllength sequence of a hepatitis C virus genome having poor homology to reported isolates: comparative study of four distinct genotypes. Virology 188, 331-341. SAITOU, N. & NEI, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. MolecularBiology and Evolution 4, 406-425.
1061
SIMMONDS,P., HOLMES,E. C., CHA, T.-A., CHAN, S.-W., McOMISH, F., IRVINE, l., BEALL, E., YAP, P.L., KOLBERG, J. & URDEA, M. S. (1993 a). Classification of hepatitis C virus into six major genotypes and a series of subtypes by phylogenetic analysis of the NS-5 region. Journal of General Virology 74, 2391-2399. SIMMONDS, P., McOMISH, F., YAP, P. L., CHAN, S.-W., LIN, C. K., DUSHEIKO, G., SAEED, A.A. & HOLMES, E.C. (1993b). Sequence variability in the 5" non-coding region of hepatitis C virus: identification of a new virus type and restrictions on sequence diversity. Journal of General Virology 74, 661-668. STUYVER,L., ROSSAU,R., WYSEUR,A., DUHAMEL,M., NANDERBORGHT, B., VAN HEUVERSWYN,H. & M~dERTENS, G. (1993a). Typing of hepatitis C virus isolates and characterization of new subtypes using a line probe assay. Journal of General Virology 74, 1093-1102. STUYVER,L., VANARNHEM,W., WYSEUR,A., DELEYS,R. & MAERTENS, G. (1993 b). Analysis of the putative E1 envelope and NS4a epitope regions of HCV type 3. Biochemical and Biophysical Research Communications 192, 635-641. TAKADA,N., TAKASE,S., ENOMOTO,N., TAKhDA,A. & DATE,T. (1992). Clinical backgrounds of the patients having different types of hepatitis C virus genomes. Journal of Hepatology 14, 35-40. TAKAMIZAWA,A., MORI, C., Fur,s, I., MANABE, S., MURAKAMI,S., FUJITA, J., ONISHI, E., ANDOH, T., YOSHIDA, I. • OKAYAMA,H. (1991). Structure and organization of the hepatitis C virus genome isolated from human carriers. Journal of Virology 65, 110~1113. TANAKA,T., KATO, N., NAKAGAWA,M., OOTSUYAMA,Y., CHO, M. J., NAKAZAWA, T., HIJIKATA, M., ISHIMURA,Y. & SHIMOTOHNO,K. (1992). Molecular cloning of hepatitis C virus genome from a single Japanese carrier: sequence variation within the same individual and among infected individuals. Virus Research 23, 39 53. TSUKIYAMAKOHARA, K., KOHARA, M., YAMAGUCHI,K., MAKI, N., TOYOSHIMA,A., MIKI, K., TANAKA,S., HATTORI,N. & NOMOTO,A. (1991). A second group of hepatitis C viruses. Virus Genes 5, 243-254. YOSHIOKA, K., KAKUMU, S., WAKITA, T., ISHIKAWA,T., ITOH, Y., TAKAYANAGI, M., HIGASHI, Y., SHIBATA, M. & MORISHIMA, W. (1992). Detection of hepatitis C virus by polymerase chain reaction and response to interferon-alpha therapy: relationship to genotypes of hepatitis C virus. Hepatology 16, 293-299.
(Received 15 November 1993; Accepted 21 December 1993)