articles
DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae John F. Heidelberg*, Jonathan A. Eisen*, William C. Nelson*, Rebecca A. Clayton, Michelle L. Gwinn*, Robert J. Dodson*, Daniel H. Haft*, Erin K. Hickey*, Jeremy D. Peterson*, Lowell Umayam*, Steven R. Gill*, Karen E. Nelson*, Timothy D. Read*, Herve Tettelin*, Delwood Richardson*, Maria D. Ermolaeva*, Jessica Vamathevan*, Steven Bass*, Haiying Qin*, Ioana Dragoi*, Patrick Sellers*, Lisa McDonald*, Teresa Utterback*, Robert D. Fleishmann*, William C. Nierman*, Owen White *, Steven L. Salzberg*, Hamilton O. Smith*², Rita R. Colwell³, John J. Mekalanos§, J. Craig Venter*² & Claire M. Fraser* * The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA ³ Center of Marine Biotechnology, University of Maryland Biotechnology Institute, 701 East Pratt Street, Baltimore, Maryland 21202, USA, and Department of Cell and Molecular Biology, University of Maryland, College Park, Maryland 20742, USA § Harvard Medical School, Department of Microbiology and Molecular Genetics, 200 Longwood Avenue, Boston, Massachusetts 02115, USA
............................................................................................................................................................................................................................................................................
Here we determine the complete genomic sequence of the Gram negative, g-Proteobacterium Vibrio cholerae El Tor N16961 to be 4,033,460 base pairs (bp). The genome consists of two circular chromosomes of 2,961,146 bp and 1,072,314 bp that together encode 3,885 open reading frames. The vast majority of recognizable genes for essential cell functions (such as DNA replication, transcription, translation and cell-wall biosynthesis) and pathogenicity (for example, toxins, surface antigens and adhesins) are located on the large chromosome. In contrast, the small chromosome contains a larger fraction (59%) of hypothetical genes compared with the large chromosome (42%), and also contains many more genes that appear to have origins other than the g-Proteobacteria. The small chromosome also carries a gene capture system (the integron island) and host `addiction' genes that are typically found on plasmids; thus, the small chromosome may have originally been a megaplasmid that was captured by an ancestral Vibrio species. The V. cholerae genomic sequence provides a starting point for understanding how a free-living, environmental organism emerged to become a signi®cant human bacterial pathogen. Vibrio cholerae is the aetiological agent of cholera, a severe diarrhoeal disease that occurs most frequently in epidemic form1. Cholera has been epidemic in southern Asia for at least 1,000 years, but also spread worldwide to cause seven pandemics since 1817 (ref. 1). When untreated, cholera is a disease of extraordinarily rapid onset and potentially high lethality. Although clinical management of cholera has advanced over the past 40 years, cholera remains a serious threat in developing countries where sanitation is poor, health care limited, and drinking water unsafe. Vibrio cholerae as a species includes both pathogenic and nonpathogenic strains that vary in their virulence gene content2. This bacterium contains a wide variety of strains and biotypes, receiving and transferring genes for toxins3, colonization factors4,5, antibiotic resistance6, capsular polysaccharides that provide resistance to chlorine7 and new surface antigens, such as the 0139 lipopolysaccharide and O antigen capsule8,9. The lateral or horizontal transfer of these virulence genes by phage3, pathogenicity islands10,11 and other accessory genetic elements12 provides insights into how bacterial pathogens emerge and evolve to become new strains. Vibrio species represent a signi®cant portion of the culturable heterotrophic bacteria of oceans, coastal waters and estuaries13,14. Environmental studies show that these bacteria strongly in¯uence nutrient cycling in the marine environment. Various species of this genus are also devastating pathogens for ®n®sh, shell®sh and mammals. There is still much to be learned about the aquatic ecology and natural history of V. cholerae including its autochthonous (native) existence in endemic locales during cholera-free, interepidemic periods, which environmental factors, such as climate13,15, aided its re-emergence in Latin America, and which environmental factors are associated with its habitat in cholera endemic regions. For example, V. cholerae, during interepidemic periods, is an inhabitant ² Present address: Celera Genomics, 45 West Gude Drive, Rockville, Maryland 20850, USA
NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
of brackish and estuarine waters, and in these environments is associated with zooplankton and other aquatic ¯ora and fauna16. The organism also enters a ``viable but nonculturable''17 state under certain conditions. Roles for these environmental interactions and this dormant physiological state in the emergence and persistence of pathogenic V. cholerae have been proposed14,17. Here we report the determination and analysis of the Vibrio cholerae genome sequence. This analysis represents an important step toward the complete molecular description of how this freeliving environmental organism emerged to become a human pathogen by horizontal gene transfer.
Genome analysis
The genome of V. cholerae was sequenced by the whole genome random sequencing method18. The genome consists of two circular chromosomes19,20 of 2,961,146 (chromosome 1) and 1,072,314 Table 1 General features of the Vibrio cholerae genome
Size (bp) Total number of sequences G+C percentage Total number of ORFs ORF size (bp) Percentage coding Number of rRNA operons (16S-23S-5S) Number of tRNA Number similar to known proteins Number similar to proteins of unknown function* Number of conserved hypothetical proteins² Number of hypothetical proteins³ Number of Rho-independent terminators
Chromosome 1
Chromosome 2
2,961,151 36,797 47.7 2,770 952 88.6 8 94 1,614 (58%) 163 (6%) 478 (17%) 515 (19%) 599
1,072,914 14,367 46.9 1,115 918 86.3 0 4 465 (42%) 66 (6%) 165 (15%) 419 (38%) 193
............................................................................................................................................................................. bp, base pairs. ORFs, open reading frames. * Proteins of unknown function, signi®cant sequence similarity (homology) to a named protein for which there is currently no known function. ² Conserved hypothetical protein, sequence similarity to a translation of another ORF, but there is currently no experimental evidence a protein is expressed. ³ Hypothetical protein, no signi®cant sequence similarity to another protein.
© 2000 Macmillan Magazines Ltd
477
articles (chromosome 2) base pairs, with an average G+C content of 46.9% and 47.7%, respectively. There are a total of 3,885 predicted open reading frames (ORFs) and 792 predicted Rho-independent terminators; with 2,770 and 1,115 ORFs and 599 and 193 Rho-independent terminators on the individual chromosomes (Table 1, Figs 1 and 2). Most genes required for growth and viability are located on chromosome 1, although some genes found only on chromosome 2 are also thought to be essential for normal cell function (for example, dsdA, thrS and the genes encoding ribosomal proteins L20 and L35). Additionally, many intermediaries of metabolic pathways are encoded only on chromosome 2 (Fig. 3). The replicative origin in chromosome 1 was identi®ed by similarity to the Vibrio harveyi and Escherichia coli origins, co-localization of genes (dnaA, dnaN, recF and gyrA) often found near the origin in prokaryotic genomes, and GC nucleotide skew (G-C/ G+C) analysis21. Based on these, we designated base-pair 1 in an intergenic region that is located in the putative origin of replication. Only the GC skew analysis was useful in identifying a putative origin on chromosome 2. This genomic sequence of V. cholerae con®rmed the presence of a large integron island (a gene capture system) located on chromosome 2 (125.3 kbp)22,23. The V. cholerae integron island contains all copies of the V. cholerae repeat (VCR) sequence and 216 ORFs (Fig. 1). However, most of these ORFs have no homology to other sequences. Among the recognizable integron island genes are three that encode gene products that may be involved in drug resistance (chloramphenicol acetyltransferase, fosfomycin resistance protein and glutathione transferase), several DNA metabolism enzymes (MutT, transposase, and an integrase), potential virulence genes (haemagglutinin and lipoproteins) and three genes which encode gene products similar to the `host addiction' proteins (higA, higB and doc), which are used by plasmids to select for their maintenance by host cells.
V. cholerae biology, notably its ability to inhabit diverse environments. These environments, in turn, may have selected the duplication and divergence of genes useful for specialized functions. Additionally, whereas El Tor strain N16961 carries only a single copy of the cholera toxin prophage, other V. cholerae strains carry several copies of this element24,25, and strains of the classical biotype have a second copy of the prophage that is localized on chromosome 2 (ref. 20). Thus, virulence genes are presumed to be subject to selective pressure, affecting copy number and chromosomal location. Several ORFs with apparently identical functions exist on both chromosomes which were probably acquired by lateral gene transfer. For example, glyA (encoding serine hydroxymethyl transferase) is found once on each chromosome but the phylogenetic analysis suggest the glyA copy on chromosome 1 branches with the a-Proteobacteria, whereas the copy on chromosome 2 branches with the g-Proteobacteria (see Supplementary Information). The chromosome 2 glyA is ¯anked by genes encoding transposases, suggesting that this gene was acquired through a transposition event.
Figure 1 Linear representation of the V. cholerae chromosomes. The location of Q the predicted coding regions, colour-coded by biological role, RNA genes, tRNAs, other RNAs, Rho-independent terminators and Vibrio cholerae repeats (VCRs) are indicated. Arrows represent the direction of transcription for each predicted coding region. Numbers next to the tRNAs represent the number of tRNAs at a locus. Numbers next to GES represent the number of membrane-spanning domains predicted by the Goldman, Engleman and Steitz scale calculated by TopPred for that protein. Gene names are available at the TIGR web site (www.tigr.org) and as Supplementary Information.
1,000,000
100,000
900,000
Comparative genomics
The two-chromosome structure of V. cholerae allows for comparisons, both between the two chromosomes of this organism and between either of the V. cholerae chromosomes and the chromosomes of other microbial species. There is pronounced asymmetry in the distribution of genes known to be essential for growth and virulence between the two chromosomes. Signi®cantly more genes encoding DNA replication and repair, transcription, translation, cell-wall biosynthesis and a variety of central catabolic and biosynthetic pathways are encoded by chromosome 1. Similarly, most genes known to be essential in bacterial pathogenicity (that is, those encoding the toxin co-regulated pilus, cholera toxin, lipopolysaccharide and the extracellular protein secretion machinery) are also located on chromosome 1. In contrast, chromosome 2 contains a larger fraction (59%) of hypothetical genes and genes of unknown function, compared with chromosome 1 (42%) (Fig. 4). This partitioning of hypothetical proteins on chromosome 2 is highly localized in the integron island (Fig. 2). Chromosome 2 also carries the 3-hydroxy-3-methylglutaryl CoA reductase, a gene apparently acquired from an archaea (Y. Boucher and W. F. Doolittle, personal communication). The majority of the V. cholerae genes were very similar to E. coli genes (1,454 ORFs), but 499 (12.8%) of the V. cholerae ORFs showed highest similarity to other V. cholerae genes, suggesting recent duplications (Figs 5 and 6). Most of the duplicated ORFs encode products involved in regulatory functions (59), chemotaxis (50), transport and binding (42), transposition (18), pathogenicity (13) or unknown functions encoded by conserved hypothetical (62) and hypothetical proteins (113). There are 105 duplications with at least one of each ORF on each chromosome indicating there have been recent crossovers between chromosomes. The extensive duplication of genes involved in scavenging behaviour (chemotaxis and solute transport) suggests the importance of these gene products in 478
1 200,000
800,000
300,000
700,000
400,000
600,000 2,900,000 2,800,000 2,700,000 2,600,000 2,500,000
500,000 1
100,000 200,000 300,000 400,000 500,000
2,400,000 600,000 2,300,000 700,000 2,200,000 800,000 2,100,000 900,000 2,000,000
1,000,000
1,900,000
1,100,000
1,800,000
1,200,000 1,700,000 1,300,000 1,600,000 1,500,000 1,400,000
Figure 2 Circular representation of the V. cholerae genome. The two chromosomes, large and small, are depicted. From the outside inward: the ®rst and second circles show predicted protein-coding regions on the plus and minus strand, by role, according to the colour code in Fig. 1 (unknown and hypothetical proteins are in black). The third circle shows recently duplicated genes on the same chromosome (black) and on different chromosomes (green). The fourth circle shows transposon-related (black), phage-related (blue), VCRs (pink) and pathogenesis genes (red). The ®fth circle shows regions with signi®cant x2 values for trinucleotide composition in a 2,000-bp window. The sixth circle shows percentage G+C in relation to mean G+C for the chromosome.The seventh and eighth circles are tRNAs and rRNAs, respectively.
© 2000 Macmillan Magazines Ltd
NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
articles Origin and function of the small chromosome of V. cholerae
was originally a megaplasmid (Fig. 4). The megaplasmid presumably acquired genes from diverse bacterial species before its capture by the ancestral Vibrio. The relocation of several essential genes from chromosome 1 to the megaplasmid completed the stable capture of this smaller replicon. Apparently this capture of the megaplasmid occurred long enough ago that the trinucleotide composition and percentage G+C content between the two chromosomes has become similar (except for laterally moving elements such as the integron island, bacteriophage genomes, transposons, and so on). The two chromosome structure is found in other Vibrio species19 suggesting that the gene content of the megaplasmid continues to provide Vibrio with an evolutionary advantage, perhaps within the aquatic ecosystem where Vibrio species are frequently the dominant microorganisms14,16. It is unclear why chromosome 2 has not been integrated into chromosome 1. Perhaps chromosome 2 plays an important specialized function that provides the evolutionary selective pressure to
Several lines of evidence suggest that chromosome 2 was originally a megaplasmid captured by an ancestral Vibrio species. The phylogenetic analysis of the ParA homologues located near the putative origin of replication of each chromosome shows chromosome 1 ParA tending to group with other chromosomal ParAs, and the ParA from chromosome 2 tending to group with plasmid, phage and megaplasmid ParAs (see Supplementary Information). In general, genes on chromosome 2, with an apparently identical functioning copy on chromosome 1, appear less similar to orthologues present in other g-Proteobacteria species (see Supplementary Information). Also, chromosome 1 contains all the ribosomal RNA operons and at least one copy of all the transfer RNAs (four tRNAs are found on chromosome 2, but there are duplicates on chromosome 1). In addition, chromosome 2 carries the integron region, an element often found on plasmids26. Finally, the bias in the functional gene content is more easily explained, if chromosome 2 uracil NupC NMN, pnuC uraA Family xanthine/ (2,1) uracil family
ELECTRON TRANSPORT CHAIN
H+
H+
cellobiose* fructose mannitol sucrose? glucose
ATP H+ synthase
H+
sbp
sulphate
CHITIN
sugar-Pi
LACTO SE
FRUCTO SE
Histidine Degradation Pathway
sulphate, cysZ
GLYCOLYSIS
phosphate, nptA ?
GALACTOSE GLYCEROL **
HISTIDINE
fatty acids, fadL(2,1)
MANNITOL MANNO SE
urocanate
MALATE
arginine
PENTOSE PATHWAY
Na +/alanine (3)
chorismate
D-LACTATE
ribose
ttyrosine Na +/proline,putP
PP formate, H 2+CO 2
mal E/F/G/ K
galactoside
cadaverine
ASP ARAGINE
>40 flagellar and
mglA/B/C
oxalate/formate ?
METHIONINE L-ASPA RTATE
Na +/citrate, citS Na +/dicarboxylate(1,1) formate ? (1,1) benzoate, benE C4- dicarboxylate
fumarate
Fatty Acid Biosynthes is and Degradation
propionate O-succinyl-
SERINE
cis-aconitate
CO 2
bcaa, brnQ potassium, trkA/H
ARGININE
bypass isocitrate
succinate
NH 4+ ? GLUTA MINE
succinyl-CoA
Mg2+, mgt,(2,1)
PUTRESCINE
2-oxoglutarate
L- cysteine
dctP/Q/M, dcuA/B/C
tyrosine, tyrP
MCPs (23,20) ORNITHINE
Glyoxylate
homoserine glycine
serine,sda C (2) tryptophan,mtr
glyoxylate fumarate
AzlC family
V/W/Y/Z
TCA Cycle
L-iso-leucine THREONINE
BCCT family
CheA/B/D?/R/
oxaloacetic citrate acid malate
arginine/ ornithine putrescine/ ornithine, potE
motor genes
ethanol L- leucine
ASP ARTATE
diaminopimelic acid
FLAGELLUM
cadaverine/ lysine?
propionate
L-lysine maltose
Na +/glutamate,gltS
melanin
D-alanine L-alanine leucine valine
2-keto-isovalerate
Acetyl-CoA
acetyl-P
(1,1) proton/peptide family
L- tryptophan
SERINE
Pyruvate
L-LACTATE ACETA TE
proton/glutamate, gltP
L- phenylalanine serine
formiminoL-glutamate
rbsA/B/C/D
artI/M/P/Q
PRP P
PHOSPH ATE
PEP
L-glutamate
oppA/B/C/D/F
amino acids (3,1)
RIBOSE
OXIDATIV E
L-TRYPTOP HAN
L-lactate ?
oligopeptides
GLUCONATE NON -
L-ALANINE
imidazolepropanoate
Pi
ptsH
PATHWAY
GLUCONEOG ENESIS
sulphate(2,2)
peptides (3,1)
ptsI
DOUDO ROFF
SUCROSE
Pho4 family protein
potA/B/C/D
ENTNER
GLUCOSE
spermidine/ putrescine
Pi
IIA
NAG trehalose?
TREHALOSE N-ACETYLGLUCOSAMINE
modA/B/C
gluconate ?
IIBC
GLYCOGEN
STARCH
pstA/B/C/S
molybdenum
ompS
+ Pi
e-
phosphate (1,1)
maltoporin
ADP
ATP
cy sA/P/T /W
PTS system Pi
sugar family
iron (II), feoA/B Na +/H +(4,2)
PROLINE
L-GLUTAMATE
potassium, kup potassium ke fB (2?) ExbB(1,1)
porin ?
lipolysaccharide/ O-antigen, rfbH/I
NadC family iron (III) G3P MsbA? cations AcrB/D/F (2,2)
glyc G3P colicins thiamine? B12? btuB/C/D toxins drugs heme glpF glpT tolA/B/Q/R ugpA/B/C/E (4,2) (14,3) ccmA/B/C/D
Figure 3 Overview of metabolism and transport in V. cholerae. Pathways for energy production and the metabolism of organic compounds, acids and aldehydes are shown. Transporters are grouped by substrate speci®city: cations (green), anions (red), carbohydrates (yellow), nucleosides, purines and pyrimidines (purple), amino acids/ peptides/amines (dark blue) and other (light blue). Question marks associated with transporters indicate a putative gene, uncertainty in substrate speci®city, or direction of transport. Permeases are represented as ovals; ABC transporters are shown as composite ®gures of ovals, diamonds and circles; porins are represented as three ovals; the largeconductance mechanosensitive channel is shown as a gated cylinder; other cylinders represent outer membrane transporters or receptors; and all other transporters are drawn as rectangles. Export or import of solutes is designated by the direction of the arrow through the transporter. If a precise substrate could not be determined for a transporter, no gene name was assigned and a more general common name re¯ecting the type of substrate being transported was used. Gene location on the two chromosomes, for both NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
E1-E2
family (2,1)
vibriobactin receptor viuA
hemin
zinc
vibriobactin
hutB/C/D
znuA/B/C
fepB/C/D/G
TonB (1,1)
mscL
ExbD(1,1)
+
MCP
heme
K ? hutA
irgA
TonB system receptor(1,2)
chemotactic signals
family (3)
transporters and metabolic steps, is indicated by arrow colour: all genes located on the large chromosome (black); all genes located on the small chromosome (blue); all genes needed for the complete pathway on one chromosome, but a duplicate copy of one or more genes on the other chromosome (purple); required genes on both chromosomes (red); complete pathway on both chromosomes (green). (Complete pathways, except for glycerol, are found on the large chromosome.) Gene numbers on the two chromosomes are in parentheses and follow the colour scheme for gene location. Substrates underlined and capitalized can be used as energy sources. PRPP, phosphoribosyl-pyrophosphate; PEP, phosphoenolpyruvate; PTS, phosphoenolpyruvate-dependant phosphotransferase system; ATP, adenosine triphosphate; ADP, adenosine diphosphate; MCP, methylaccepting chaemotaxis protein; NAG, N-acetylglucosamine; G3P, glycerol-3-phosphate; glyc, glycerol; NMN, nicotinamide mononucleotide. Asterisk, because V. cholerae does not use cellobiose, we expect this PTS system to be involved in chitobiose transport.
© 2000 Macmillan Magazines Ltd
479
articles suppress integration events when they do occur. For example, if under some environmental condition there is a difference in copy number between the chromosomes, then chromosome 2 may have accumulated genes that are better expressed at higher or lower copy number than genes on chromosome 1. A second possibility is that, in response to environmental cues, one chromosome may partition to daughter cells in the absence of the other chromosome (aberrant segregation). Such single-chromosome-containing cells would be replication-defective but still maintain metabolic activity (`drone' cells), and, therefore, be a potential source of ``viable, but nonculturable (VBNC)'' cells observed to occur in V. cholerae17. Such `drone' cells may also play a role in V. cholerae bio®lms7,27,28 by, for example, producing extracellular chitinase, protease and other degradative enzymes that enhance survival of cells in a bio®lm, retaining two chromosomes without directly competing with these cells for nutrients.
Transport and energy metabolism
Vibrio cholerae has a diverse natural habitat that includes association with zooplankton in a sessile stage, a planktonic state in the water column, and the capacity to act as a pathogen within the human gastrointestinal tract. It is, therefore, no surprise that this organism maintains a large repertoire of transport proteins with broad substrate speci®city and the corresponding catabolic pathways to enable it to respond ef®ciently to these different and constantly changing ecosystems (Fig. 4). Many of the sugar transporter systems and their corresponding catabolic pathway enzymes are localized on a single chromosome (that is, ribose and lactate transport and degradation enzymes are contained on chromosome 2, whereas the trehalose systems reside on chromosome 1). However, many of the other energy metabolism pathways are split (that is, chitin, glycolysis, and so on) between the chromosomes (Fig. 3). In aquatic environments, chitin often represents a source of both carbon and nitrogen. This energy source is important for V. cholerae as it is associated with zooplankton, which have a chitinous exoskeleton13,15,29. Vibrio cholerae degrades chitin by a pathway that is very similar to that of Vibrio furnissii30. Sequence analysis suggests a phosphenolpyruvate phosphotransferase system (PTS)
Interchromosomal regulation
Several of the regulatory pathways, both for regulation in response to environmental and pathogenic signals, are divided between the two chromosomes. These included pathways for starvation survival, `quorum sensing' and expression of the entertoxigenic haemolysin, HlyA. During periods of nutrient starvation, V. cholerae, and other Gram-negative bacteria, enter the stationary phase and, later, the viable but nonculturable (VBNC) state14,17. The alternative sigma factor j38 (rpoS) is required for survival of V. cholerae in the environment but not for pathogenicity31, and therefore probably plays an important role in the initiation of the VBNC state. There is one copy of rpoS, located on chromosome 1, near the oriC. The RpoS regulates expression of several other proteins, including catalase, cyclopropane-fatty-acyl-phospholipid synthase and HA/protease, which are found on both chromosomes31. Genes involved in `quorum sensing', or cell-density-dependent regulation, also exist on both chromosomes of V. cholerae. In bioluminescent Vibrio species (notably Vibrio ®scheri and V. harveyi), quorum sensing is used to control light production. Although this strain of V. cholerae lacks the genes for bioluminescence, it does have the genes required for the autoinducer-2 (AI-2) quorumsensing mechanism32 (luxOPQSU) but this pathway is split between the chromosomes with luxOSU on chromosome 1 and luxPQ on chromosome 2. Similarly, another transcriptional regulatory gene,
9
in Relative proportion of most similar oc A o qu open reading frames M M ccu ife x yc y s ob co B rad aeo ac pla ac io lic te sm illu du us Borium a g s s ran e ub s r C Tre reli tub nit tili hl p a e al s am on b rc iu C y e ur u m hl d m g lo am ia a do si y p p rf s H ae dia neuallid eri m Es tra m um op c c on hi he ho iae N ei Vlus rich mat ss ib in ia is Ri er ri flu c ck ia o e ol c n i e H tt me ho za C eli sia nin ler e am co p g a py ba row itid e M et Th Sy lob cte aze is ha n r e no rm ec act py kii ba M A o ho er lo ct e rch Ae tog cy jej ri er th ae ro a st un iu an o py m is i m o g ru a sp th co lob m riti . m e c Py rm cu us pe a ro oa s ja fulg rni co ut n id x Sa Ca Py cc otr nas us cc eno roc us op ch ha rh oc ho hic ii ro ab cu rik um m d s os yc itis a h es e by ii ce leg ssi re an vis s ia e
0.35
10
0.3
8
0.25
7
0.2
6
0.15
5
0.1
4 3
0.05
2
0
1
Fa tty
Am in ac nuc o id le a an osi P cid d de ur bi p s pr ho , aine osy C osth B sp nd s, p nth en e io h n y e tra tic sy olip uc rim sis l i g nth id leo idi * nt ro e m t n er up sis e ide es m s o ta s , ed , a f bo Tr an i n c l sp E ary d cofa ism or ne m ar cto t b rg eta rie rs in y m bo rs* , di ng et lism a DN and bol * pr ism A ot m ei e T tab ns Pr ran oli ot sc sm ei n ript * sy io Re n n gu Pr the * la ote sis to in * ry fu fat C Ce nc e* el ll lu en tion la s v O r pr elo th o p er ce e* ca ss H teg es* yp ot orie he s tic al *1
0
De
Open reading frames (%)
for cellobiose transport, but as V. cholerae does not use cellobiose, it is more likely that this PTS is involved in transport of the structurally similar compound, chitobiose, analogous to the situation proposed for Bourrelia burgdorferi18. The three anions that are transported by ABC transport systems in V. cholerae are molybdenum, phosphate and sulphate. Molybdenum transport genes (modA/B/C) are all located on chromosome 2, and most of the sulphate transport genes are on the large molecule. However, copies of the genes for phosphate transporters are found in both chromosomes. The genes in these two phosphate transport operons are different from each other and do not represent a recent duplication; instead, this suggests that one may be an acquired operon.
Figure 4 Percentage of total Vibrio cholerae open reading frames (ORFs) in biological roles compared with other g-Proteobacteria. These were V. cholerae, chromosome 1 (blue); V. cholerae, chromosome 2 (red); Escherichia coli (yellow); Haemophilus in¯uenzae (pale blue). Signi®cant partitioning (P , 0.01) of biological roles between V. cholerae chromosomes is indicated with an asterisk, as determined with a x2 analysis. 1, Hypothetical contains both conserved hypothetical proteins and hypothetical proteins, and is at 1/10 scale compared with other roles. 480
Figure 5 Comparison of the V. cholerae ORFs with those of other completely sequenced genomes. The sequence of all proteins from each completed genome were retrieved from NCBI, TIGR and the Caenorhabditis elegans (wormpep16) databases. All V. cholerae ORFs (large chromosome, blue; small chromosome, red) were searched against all other genomes with FASTA3. The number of V. cholerae ORFs with greatest similarity (E # 10-5) are shown in proportion to the total number of ORFs in that genome. There were no ORFs that were most similar to a Mycoplasma pneumoniae ORF.
© 2000 Macmillan Magazines Ltd
NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
articles
**
**
V.cholerae VC0512 V.cholerae VCA1034 V.cholerae VCA0974 V.cholerae VCA0068 ** V.cholerae VC0825 * V.cholerae VC0282 V.cholerae VCA0906 V.cholerae VCA0979 V.cholerae VCA1056 V.cholerae VC1643 V.cholerae VC2161 V.cholerae VCA0923 ** V.cholerae VC0514 ** V.cholerae VC1868 V.cholerae VCA0773 V.cholerae VC1313 V.cholerae VC1859 V.cholerae VC1413 V.cholerae VCA0268 V.cholerae VCA0658 ** V.cholerae VC1405 V.cholerae VC1298 * V.cholerae VC1248 V.cholerae VCA0864 V.cholerae VCA0176 V.cholerae VCA0220 ** V.cholerae VC1289 V.cholerae A1069 ** V.cholerae VC2439 V.cholerae 1967 V.cholerae A0031 V.cholerae VC1898 V.cholerae VCA0663 V.cholerae VCA0988 V.cholerae VC0216 V.cholerae VC0449 * V.cholerae VCA0008 V.cholerae VC1406 V.cholerae VC1535 V.cholerae VC0840 B.subtilis gi2633766 Synechocystis sp. gi1001299 Synechocystis sp. gi1001300 * Synechocystis sp. gi1652276 * Synechocystis sp. gi1652103 * H.pylori gi2313716 ** H.pylori 99 gi4155097 ** C.jejuni Cj1190c C.jejuni Cj1110c A.fulgidus gi2649560 ** A.fulgidus gi2649548 B.subtilis gi2634254 B.subtilis gi2632630 B.subtilis gi2635607 B.subtilis gi2635608 B.subtilis gi2635609 ** ** ** B.subtilis gi2635610 B.subtilis gi2635882 E.coli gi1788195 E.coli gi2367378 ** E.coli gi1788194 * E.coli gi1787690 V.cholerae VCA1092 V.cholerae VC0098 E.coli gi1789453 H.pylori gi2313186 99 gi4154603 ** H.pylori C.jejuni Cj0144 C.jejuni Cj1564 ** C.jejuni Cj0262c ** C.jejuni Cj1506c H.pylori gi2313163 * ** H.pylori 99 gi4154575 H.pylori gi2313179 ** ** H.pylori 99 gi4154599 C.jejuni Cj0019c C.jejuni Cj0951c C.jejuni Cj0246c B.subtilis gi2633374 T.maritima TM0014 V.cholerae VC1403 V.cholerae VCA1088 T.pallidum gi3322777 T.pallidum gi3322939 T.pallidum gi3322938 ** ** B.burgdorferi gi2688522 T.pallidum gi3322296 B.burgdorferi gi2688521 * ** T.maritima TM0429 T.maritima TM0918 ** T.maritima TM0023 T.maritima TM1428 * T.maritima TM1143 T.maritima TM1146 P.abyssi PAB1308 P.horikoshii gi3256846 ** P.abyssi PAB1336 ** ** P.horikoshii gi3256896 ** P.abyssi PAB2066 P.horikoshii gi3258290 ** * P.abyssi PAB1026 P.horikoshii gi3256884 ** D.radiodurans DRA00354 D.radiodurans DRA0353 ** ** D.radiodurans DRA0352 V.cholerae VC1394 P.abyssi PAB1189 P.horikoshii gi3258414 ** B.burgdorferigi 2688621 M.tuberculosis gi1666149 V.cholerae VC0622
Figure 6 Phylogenetic tree of methyl-accepting chemotactic proteins (MCP) homologues in completed genomes. Homologues of MCP were identi®ed by FASTA3 searches of all available complete genomes. Amino-acid sequences of the proteins were aligned using CLUSTALW, and a neighbour-joining phylogenetic tree was generated from the alignment using the PAUP* program (using a PAM-based distance calculation). Hypervariable regions of the alignment and positions with gaps in many of the sequences were excluded from the analysis. Nodes with signi®cant bootstrap values are indicated: two asterisks, .70%; asterisk, 40±70%. NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
hlyU (ref. 33), is located on chromosome 1, while the gene it regulates, hlyA, is located on chromosome 2.
DNA repair
Vibrio cholerae has genes encoding several DNA-repair and DNAdamage-response pathways, including nucleotide-excision repair, mismatch-excision repair, base-excision repair, AP endonuclease, alkylation transfer, photoreactivation, DNA ligation, and all the major components of recombination and recombinational repair, including initiation, recombination and resolution34. In addition, homologues of many of the genes involved in the SOS response in E. coli are found. The presence of three photolyase homologues, more than have been found in other bacterial species, probably allows for the ability to photoreactivate the two major forms of ultraviolet-induced DNA damage (cyclobutane pyrimidine dimers and 6-4 photoproducts), and may also allow use of a range of wavelengths of light used for the energy required for photoreactivation. It is also of interest that many of the repair genes are on chromosome 2 (alkA, ada1, ada2, phr3, mutK, sbcCD, dcm, mutT3), indicating that this chromosome is probably required for full DNA repair capability.
Pathogenicity
Toxins. The genome sequence of V. cholerae El Tor N16961 revealed a single copy of the cholera toxin (CT) genes, ctxAB, located on chromosome 1 within the integrated genome of CTXf, a temperate ®lamentous phage3. The receptor for entry of CTXf into the cell is the toxin-coregulated pilus (TCP)3, and the TCP gene cluster (see below) also resides on chromosome 1. Like the structural genes for CT and TCP, the regulatory gene, toxR, which controls their expression in vivo35, is also located on chromosome 1. On the other side of CTXf prophage is a region encoding an RTX toxin (rtxA), and its activator (rtxC) and transporters (rtxBD)36. A third transporter gene has been identi®ed that is a paralogue of rtxB, and is transcribed in the same direction as rtxBD. Downstream of this gene are two genes encoding a sensor histidine kinase and response regulator. Trinucleotide composition analysis suggests that the RTX region was horizontally acquired along with the sensor histidine kinase/response regulator, suggesting these regulators effect expression of the closely linked RTX transcriptional units. Also present are genes encoding numerous potential toxins, including several haemolysins, proteases and lipases. These include hap, the haemagluttinin protease, a secreted metalloprotease that seems to attack proteins involved in maintaining the integrity of epithelial cell tight junctions37, and hlyA, encoding a secreted haemolysin that displays enterotoxic activity38. In contrast to CTX, RTX and all known intestinal colonization factors, the hap and hlyA genes virulence factors reside on chromosome 2. Vibrio cholerae has been reported to produce shiga-like toxins39; however, the sequence did not reveal genes encoding speci®c homologues of the A or B subunits of shiga toxin. Also not detected were genes encoding homologues of E. coli heat stable toxin (ST), which have been detected in other pathogenic strains of V. cholerae40. Colonization factors. The critical intestinal colonization factor of V. cholerae is the TCP, a type IV pilus5,41. The genome sequence con®rmed that the genes involved in TCP assembly (tcpABCDEFGHIJNQRST) reside on chromosome 1 (ref. 20) as part of a proposed `pathogenicity island' (also referred to as VPI) composed of recently acquired DNA that encodes not only TCP, but also other genes associated with the ToxR regulatory cascade, such as acfABCD, toxT, aldA and tagAB10,11. Trinucleotide composition analysis suggest that this 45.3-kb segment begins at a 20-bp site upstream of aldA, and encompasses a helicase-related protein and a transcriptional activator which both share homology with bacteriophage proteins. At the other end of the segment of atypical trinucleotide composition is a phage family integrase and the
© 2000 Macmillan Magazines Ltd
481
articles other copy of the 20-bp site, which is presumably the target for integration of the island onto the chromosome10,11. It has been proposed that the TCP/ACF island corresponds to the genome of a ®lamentous phage that uses TCP pilin as a coat protein4. However, other than the three genes encoding phage-related proteins (that is, the helicase, transcriptional activator and integrase) we could ®nd no other genes on the island that encoded products with signi®cant homology to the conserved gene products of other ®lamentous phages or the structural proteins of non®lamentous phages. The maltose-sensitive haemagglutinin (MSHA) is unique to the El Tor biotype of V. cholerae. Initially characterized as a haemagglutinin, it was later found to be a type IV pilus42,43. The MSHA biogenesis (MshHIJKLMNEGF) and structural (MshBACD) proteins are all clustered on chromosome 1. There are no apparent integrases or transposases that might de®ne this region as a pathogenicity island or suggest an origin for it other than V. cholerae. In support of this conclusion, trinucleotide composition analysis shows that this region has similar composition to the rest of the chromosome, suggesting that if these genes were acquired it was very early in the Vibrio phylogenetic history. Recently, several investigators have reported that MSHA is not required for intestinal colonization, nor does it seem to appreciably affect the ef®ciency of colonization44±46, but instead plays a role in bio®lm formation27,28. Accordingly, this pilus may be important for the environmental ®tness of Vibrio species rather than for pathogenic potential. The pilA region of V. cholerae genome apparently encodes a third type IV pilus, although it has not been visualized47. This gene cluster includes a gene encoding a prepilin peptidase (PilD) that is required for the ef®cient processing of protein complexes with type IV prepilin-like signal sequences including TCP, MSHA and EPS47,48. The EPS system of V. cholerae encodes a type II secretion system involved in extracellular export of CT and other proteins. The EPS system is encoded by chromosome 1 but, like the MSHA genes, trinucleotide analysis suggests that the EPS genes of V. cholerae have not been recently acquired. In contrast, trinucleotide composition analysis suggests that the pilA gene cluster was acquired by horizontal transfer. Thus, analysis of the V. cholerae genome sequence provides some evidence that older gene clusters, like MSHA and EPS, have become dependent on newly acquired genes such as PilD.
Conclusions
The Vibrio cholerae genome sequence provides a new starting point for the study of this organism's environmental and pathobiological characteristics. It will be interesting to determine the gene expression patterns that are unique to its survival and replication during human infection35 as well as in the environment13,14,16. Additionally, the genomic sequence of V. cholerae should facilitate the study of this model multi-chromosomal prokaryotic organism. Comparative genomics between several species in the genus Vibrio will provide a better understanding of the origin of the new small chromosome and the role that it plays in Vibrio biology. The genome sequence may also provide important clues to understanding the metabolic and regulatory networks that link genes on the two chromosomes. Finally, V. cholerae clearly represents a promising genetic system for studying how several horizontally acquired loci located on separate chromosomes can still ef®ciently interact at the regulatory, cell biology and biochemical levels. M
Methods Whole-genome random sequencing procedure. Vibrio cholerae N16961 was grown from a single isolated colony. Cloning, sequencing and assembly were as described for genomes sequenced by TIGR18. One small-insert plasmid library (2±3 kb) was generated by random mechanical shearing of genomic DNA. One large insert library was ligated into l-DASHII/EcoRI vector (Stratagene). In the initial sequence phase, approximately sevenfold sequence coverage was achieved with 49,633 sequences from plasmid clones. Sequences from both ends of 383 l-clones served as a genome scaffold, verifying the orientation, order and integrity of the contigs. The plasmid and l sequences were jointly assembled using TIGR Assembler. Sequence gaps were closed
482
by editing the end sequences and/or primer walking on plasmid clones. Physical gaps were closed by direct sequencing of genomic DNA, or combinatorial polymerase chain reaction (PCR) followed by sequencing the PCR product. The ®nal genome sequence is based on 51,164 sequences. ORF prediction and gene family identi®cation. An initial set of ORFs, likely to encode proteins, was identi®ed with GLIMMER49, and those shorter than 30 codons were eliminated. ORFs that overlapped were visually inspected, and in some cases removed. ORFs were searched against a non-redundant protein database18. Frameshifts and point mutations were detected and corrected where appropriate. Remaining frameshifts and point mutations are considered to be authentic and were annotated as `authentic frameshift' or `authentic point mutation'. ORFs were also analysed with two sets of hidden Markov models (HMMs) constructed for a number of conserved protein families (1,313 from Pfam v3.1 (ref. 50) and 476 from the TIGRFAM) by use of the HMMER package. TopPred was used to identify membrane-spanning domains in proteins. Paralogous gene families were constructed by searching the ORFs against themselves using BLASTX, identifying matches with E # 10-5 over 60% of the query search length, and subsequently clustering these matches into multigene families. Multiple alignments for these protein families were generated with the CLUSTALW program and the alignments scrutinized. Distribution of all 64 trinucleotides (3-mers) for each chromosome was determined, and the 3-mer distribution in 2,000-bp windows that overlapped by half their length (1,000 bp) across the genome was computed. For each window, we computed the x2 statistic on the difference between its 3-mer content and that of the whole chromosome. A large value of this statistic indicates that the 3-mer composition in this window is different from the rest of the chromosome. Probability values for this analysis are based on the assumption that the DNA composition is relatively uniform throughout the genome. Because this assumption may be incorrect, we prefer to interpret high x2 values merely as indicators of regions on the chromosome that appear unusual and demand further scrutiny. Homologues of the genes of interest were identi®ed using the BLASTP and FASTA3 search programs. All homologues were then aligned to each other using the CLUSTALW program with default settings. Phylogenetic trees were generated from the alignments using the neighbour-joining algorithm as implemented by the PAUP* program (with a PAM matrix based distance calculation). Regions of the alignment that were hypervariable or were of low con®dence were excluded from the phylogenetic analysis. All alignments are available upon request. Received 3 April; accepted 18 May 2000. 1. Wachsmuth, K., Olsvik, é., Evins, G. M. & Popovic, T. in Vibrio Cholerae And Cholera: Molecular To Global Perspective (eds Wachsmuth, I. K., Blake, P. A. & Olsvik, é.) 357±370 (ASM Press, Washington DC, 1994). 2. Faruque, S. M., Albert, M. J. & Mekalanos, J. J. Epidemiology, genetics, and ecology of toxigenic Vibrio cholerae. Microbiol. Mol. Biol. Rev. 62, 1301±1314 (1998). 3. Waldor, M. K. & Mekalanos, J. J. Lysogenic conversion by a ®lamentous phage encoding cholera toxin. Science 272, 1910±1914 (1996). 4. Karaolis, D. K., Somara, S., Maneval, D. R. Jr, Johnson, J. A. & Kaper, J. B. A bacteriophage encoding a pathogenicity island, a type-IV pilus and a phage receptor in cholera bacteria. Nature 399, 375±379 (1999). 5. Brown, R. C. & Taylor, R. K. Organization of tcp, acf, and toxT genes within a ToxT-dependent operon. Mol. Microbiol. 16, 425±439 (1995). 6. Hochhut, B. & Waldor, M. K. Site-speci®c integration of the conjugal Vibrio cholerae SXTelement into prfC. Mol. Microbiol. 32, 99±110 (1999). 7. Yildiz, F. H. & Schoolnik, G. K. Vibrio cholerae O1 El Tor: identi®cation of a gene cluster required for the rugose colony type, exopolysaccharide production, chlorine resistance, and bio®lm formation. Proc. Natl Acad. Sci. USA 96, 4028±4033 (1999). 8. Bik, E. M., Bunschoten, A. E., Gouw, R. D. & Mooi, F. R. Genesis of the novel epidemic Vibrio cholerae O139 strain: evidence for horizontal transfer of genes involved in polysaccharide synthesis. EMBO J 14, 209±216 (1995). 9. Waldor, M. K., Colwell, R. & Mekalanos, J. J. The Vibrio cholerae O139 serogroup antigen includes an O-antigen capsule and lipopolysaccharide virulence determinants. Proc. Natl Acad. Sci. USA 91, 11388±11392 (1994). 10. Kovach, M. E., Shaffer, M. D. & Peterson, K. M. A putative integrase gene de®nes the distal end of a large cluster of ToxR-regulated colonization genes in Vibrio cholerae. Microbiology 142, 2165±2174 (1996). 11. Karaolis, D. K. et al. A Vibrio cholerae pathogenicity island associated with epidemic and pandemic strains. Proc. Natl Acad. Sci. USA 95, 3134±3139 (1998). 12. Mekalanos, J. J., Rubin, E. J. & Waldor, M. K. Cholera: molecular basis for emergence and pathogenesis. FEMS Immunol. Med. Microbiol. 18, 241±248 (1997). 13. Colwell, R. R. Global climate and infectious disease: the cholera paradigm. Science 274, 2025±2031 (1996). 14. Colwell, R. R. & Spira, W. M. in Cholera (eds Barua, D. & Greenough, W. B. III) 107±127 (Plenum Medical, New York, 1992). 15. Lobitz, B. et al. Climate and infectious disease: use of remote sensing for detection of Vibrio cholerae by indirect measurement. Proc. Natl Acad. Sci. USA 97, 1438±1443 (2000). 16. Colwell, R. R. & Huq, A. Environmental reservoir of Vibrio cholerae. The causative agent of cholera. Ann. NY Acad. Sci. 740, 44±54 (1994). 17. Roszak, D. B. & Colwell, R. R. Survival strategies of bacteria in the natural environment. Microbiol. Rev. 51, 365±379 (1987). 18. Fraser, C. M. et al. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390, 580±586 (1997). 19. Yamaichi, Y., Iida, T., Park, K. S., Yamamoto, K. & Honda, T. Physical and genetic map of the genome of Vibrio parahaemolyticus: presence of two chromosomes in Vibrio species. Mol. Microbiol. 31, 1513± 1521 (1999). 20. Trucksis, M., Michalski, J., Deng, Y. K. & Kaper, J. B. The Vibrio cholerae genome contains two unique circular chromosomes. Proc. Natl Acad. Sci. USA 95, 14464±14469 (1998).
© 2000 Macmillan Magazines Ltd
NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
articles 21. Lobry, J. R. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660±665 (1996). 22. Rowe-Magnus, D. A., Guerout, A. M. & Mazel, D. Super-integrons. Res. Microbiol. 150, 641±651 (1999). 23. Hall, R. M., Brookes, D. E. & Stokes, H. W. Site-speci®c insertion of genes into integrons: role of the 59-base element and determination of the recombination cross-over point. Mol. Microbiol. 5, 1941± 1959 (1991). 24. Davis, B. M., Kimsey, H. H., Chang, W. & Waldor, M. K. The Vibrio cholerae O139 Calcutta bacteriophage CTXphi is infectious and encodes a novel repressor. J. Bacteriol. 181, 6779±6787 (1999). 25. Mekalanos, J. J. Duplication and ampli®cation of toxin genes in Vibrio cholerae. Cell 35, 253±263 (1983). 26. Mazel, D., Dychinco, B., Webb, V. A. & Davies, J. A distinctive class of integron in the Vibrio cholerae genome. Science 280, 605±608 (1998). 27. Watnick, P. I. & Kolter, R. Steps in the development of a Vibrio cholerae El Tor bio®lm. Mol. Microbiol. 34, 586±595 (1999). 28. Watnick, P. I., Fullner, K. J. & Kolter, R. A role for the mannose-sensitive hemagglutinin in bio®lm formation by Vibrio cholerae El Tor. J. Bacteriol. 181, 3606±3609 (1999). 29. Huq, A. et al. Ecological relationships between Vibrio cholerae and planktonic crustacean copepods. Appl. Environ. Microbiol. 45, 275±283 (1983). 30. Bassler, B. L., Yu, C., Lee, Y. C. & Roseman, S. Chitin utilization by marine bacteria. Degradation and catabolism of chitin oligosaccharides by Vibrio furnissii. J. Biol. Chem. 266, 24276±24286 (1991). 31. Yildiz, F. H. & Schoolnik, G. K. Role of rpoS in stress survival and virulence of Vibrio cholerae. J. Bacteriol. 180, 773±784 (1998). 32. Bassler, B. L., Greenberg, E. P. & Stevens, A. M. Cross-species induction of luminescence in the quorum-sensing bacterium Vibrio harveyi. J. Bacteriol. 179, 4043±4045 (1997). 33. Williams, S. G., Attridge, S. R. & Manning, P. A. The transcriptional activator HlyU of Vibrio cholerae: nucleotide sequence and role in virulence gene expression. Mol. Microbiol. 9, 751±760 (1993). 34. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes. Mutat. Res. 435, 171±213 (1999). 35. Lee, S. H., Hava, D. L., Waldor, M. K. & Camilli, A. Regulation and temporal expression patterns of Vibrio cholerae virulence genes during infection. Cell 99, 625±634 (1999). 36. Lin, W. et al. Identi®cation of a Vibrio cholerae RTX toxin gene cluster that is tightly linked to the cholera toxin prophage. Proc. Natl Acad. Sci. USA 96, 1071±1076 (1999). 37. Wu, Z., Milton, D., Nybom, P., Sjo, A. & Magnusson, K. E. Vibrio cholerae hemagglutinin/protease (HA/protease) causes morphological changes in cultured epithelial cells and perturbs their paracellular barrier function. Microb. Pathog. 21, 111±123 (1996). 38. Alm, R. A., Stroeher, U. H. & Manning, P. A. Extracellular proteins of Vibrio cholerae: nucleotide sequence of the structural gene (hlyA) for the haemolysin of the haemolytic El Tor strain 017 and characterization of the hlyA mutation in the non- haemolytic classical strain 569B. Mol. Microbiol. 2, 481±488 (1988). 39. O'Brien, A. D., Chen, M. E., Holmes, R. K., Kaper, J. & Levine, M. M. Environmental and human
NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
40.
41. 42. 43.
44.
45.
46.
47. 48. 49. 50.
isolates of Vibrio cholerae and Vibrio parahaemolyticus produce a Shigella dysenteriae 1 (Shiga)-like cytotoxin. Lancet 1, 77±78 (1984). Ogawa, A., Kato, J., Watanabe, H., Nair, B. G. & Takeda, T. Cloning and nucleotide sequence of a heatstable enterotoxin gene from Vibrio cholerae non-O1 isolated from a patient with traveler's diarrhea. Infect. Immun. 58, 3325±3329 (1990). Manning, P. A. The tcp gene cluster of Vibrio cholerae. Gene 192, 63±70 (1997). Jonson, G., Holmgren, J. & Svennerholm, A. M. Identi®cation of a mannose-binding pilus on Vibrio cholerae El Tor. Microb. Pathog. 11, 433±441 (1991). Jonson, G., Lebens, M. & Holmgren, J. Cloning and sequencing of Vibrio cholerae mannose-sensitive haemagglutinin pilin gene: localization of mshA within a cluster of type 4 pilin genes. Mol. Microbiol. 13, 109±118 (1994). Attridge, S. R., Manning, P. A., Holmgren, J. & Jonson, G. Relative signi®cance of mannose-sensitive hemagglutinin and toxin-coregulated pili in colonization of infant mice by Vibrio cholerae El Tor. Infect. Immun. 64, 3369±3373 (1996). Tacket, C. O. et al. Investigation of the roles of toxin-coregulated pili and mannose- sensitive hemagglutinin pili in the pathogenesis of Vibrio cholerae O139 infection. Infect. Immun. 66, 692±695 (1998). Thelin, K. H. & Taylor, R. K. Toxin-coregulated pilus, but not mannose-sensitive hemagglutinin, is required for colonization by Vibrio cholerae O1 El Tor biotype and O139 strains. Infect. Immun. 64, 2853±2856 (1996). Fullner, K. J. & Mekalanos, J. J. Genetic characterization of a new type IV-A pilus gene cluster found in both classical and El Tor biotypes of Vibrio cholerae. Infect. Immun. 67, 1393±1404 (1999). Marsh, J. W. & Taylor, R. K. Identi®cation of the Vibrio cholerae type 4 prepilin peptidase required for cholera toxin secretion and pilus formation. Mol. Microbiol. 29, 1481±1492 (1998). Salzberg, S. L., Delcher, A. L., Kasif, S. & White, O. Microbial gene identi®cation using interpolated Markov models. Nucleic Acids Res. 26, 544±548 (1998). Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and pro®le HMMs match the majority of proteins. Nucleic Acids Res. 27, 260±262 (1999).
Supplementary information is available on Nature's World-Wide Web site (http:// www.nature.com) or as paper copy from the London editorial of®ce of Nature.
Acknowledgements This work was supported by the National Institutes of Health, National Institute of Allergy and Infectious Disease. We thank M. Heaney, V. Sapiro, B. Lee, M. Holmes and B. Vincent for database and software support. Correspondence and requests for materials should be addressed to C.M.F. (e-mail:
[email protected]). The annotated genome sequence and the gene family alignments are available at hhttp://www.tigr.org/tbd/mdbi. The sequences have been deposited in GenBank with accession number AE003852 (chromosome 1) and AE003853 (chromosome 2).
© 2000 Macmillan Magazines Ltd
483
484
© 2000 Macmillan Magazines Ltd
91
1049
5
281
448
2
757
451
954
955
459
572
574
12
8
964
1063
863
772
12
575
5
965
679
468
6
576
470
773
680
300
106
14
2742
969
1067
Amino acid biosynthesis
774
472
681
16
474
1068
971
865
776
578
972
777
1069
10
9
581
778
202
2175
974
2539
481
1071
310
13
208
20
5s rRNA
16s rRNA
687
1073
781
872
976
23s rRNA
9
587
977
588
8
873
10
3
978
22
6
12
590
979
495
981
876
785
691
591
328
212
Repeat region
Three tRNAs
1075
980
875
24
117
9
Membrane protein
1077
982
496
1078
983
786
593
12
119
498
14
877
692
213
25
4
2278
26
1700
5
5
984
1080
883
790
1081
506
695
5
985
507 10
884
791
121
338
986
219
885
795
510 12
340
31
2672
886
987
2192
798
609
1085
700
887
799
603
344
988
888
800
511
1086
889
801
512
13
347
128
605
513
989
890
349
222
35
6
2462
2379
1088
891
990
6
2290
1089
892
5
704
607
223
130
2381
1090
991
893
894
354
803
608
38
131
516
37
6
1091
992
705
610
11
1535
40
993
804
612
895
518
6
994
706
614
134
805
5
9
519
996
707
615
230
137
2765
2571
997
897
807
708
2208
231
139
998
898
521
365
808
617
10
44
2766
2472
2572
2684
2092
809
1098
620
5
5
900
1000
709
619
232
140
1001
1099
1002
1100
5
5
235
49
1004
903
372
237
142
6
144
238
374 10
1102
13
1008
813
716
624
524
904
714
51
2691
2480
2391
905
526
375
2775
711
529
11
1104
1010
9
814
717
10
376
906
527
1105
1011
625
379
53
2694
1012
5
815
8
380
2595
6
243
1109
13
908
817
12
1015
909
2213
720
1111
2599
2699
1018
911
820
912
723
10
530
388
10
629
59
12
2485
1019
1112
1458
7
1020
914
9
5
531
1113
2490
2216
1023
917
824
1293
5
64
1115
534
920
637
828
1025
921
535
397
156
638
254
922
1026
728
399
157
65
2702
2021
1027
923
12
1835
1740
2121
639
257
540
405
924
832
2608
6
542
258
1029
833
732
406
10
68
2026
2224
2407
7
2705
160
11
23Sf
2324
2223
2123
2024
1926
1566
10
2126
2609
925
834
6
734
644
543
6
735
409
10
2226
926
835
645
646
736
927 928
836
2710
413
264
72
5
2332
2412
647
1
838
930
5
649
417
267
75
167
1034
931
840
738
650
546
7
2616
2499
6
824
76
651
421
1036
935
741
550
5
170
2618
7
1753
171
11
1037
846
10
2620
552
847
1755
2136
939 6
1040
848
655
428
271 272
173
6
274
432
1041
747
656
11
940
554
431
175
2621
2504
2237
2137
1311
82
941
556
433
2717
2505
2418
276
943
6
436
557
1043
748
657
176
83
2719
11
2622
2718
1855
749
9
11
8
751
1045
849
85
558
439
178
2625
2422
2342
945
2626
2507
2423
752
1046
661
559
441
179
469
946
1047
662
560
442
8
2
561
280
1048
947
2
181
90
2726
2628
5
9
2510
2425
2344
2144
663
754
443
180
9
2725
12
1588
1949
2508
2627
279
88
1760
1859
654
1131
1667
2245
2424
2343
2143
753
12
1587
1492
1406
1666
837
745
566
471
934
385
1315
1033
565
194
289
470
100
1220
2041
836
6
933
653
744
6
2724
6
1947
2723
278
659
440
9
86
1759
1857
99
192
564
1314
1665
1585
2244
2142
2039
2722
277
2624
2242
11
1856
13
1129
1405
1491
1758 6
931
10
384
563
468
6
288
191
1031
1219
835
743
652
562
467
1945
2141
2721
658
944
1
84
2720
2623
1490
1664 1
98
287
1029
5
1128
1217
1313
2421
2038
1944
2506
2420
2341
2240
2037
1942
1663
1403
1216
1028
834
742
651
466
381
13
1583 1584
833
1127
2140
1312
4
286
560
930
741
832
97
190
465
379
650
8
1488
1757
95
285
1025 1026
9
6
1126
2419
11
1
1854
2238
12
10
1582
1024
831
739
559
464
649
1215
1660 12
94
377 378
1402
2340
275
2036
1941
1853
1756
2417
80
2716
929
1310
1486
1581
1852
2339
9
174
1309
1214
1401
13
5
1023
830
737
648
558
463
284
188
1123 1124 1125
829
1659
2035
1940
1851
2236
745
1039
938
9
654
79
172
1400
1308
1213
928
736
647
376
93
557
462
283
187
1122
1021
828
556
461
374
1580
1939
1658
2503
2416
2715
270
425
1038
937
11
744
78
2714
2619
2338
2235
2135
2234
2033
1485
1399
11
1754
1938
460
282
927
92
186
1121
1850
1579
1849
1937
827
734
926
1307
10
1656
423 424
936
653
269
2233
2502
2415
1936
2713
77
2501
2232
2133
8
1120
1212
1483
373
91
459
826
645
1018
925
1397
1848
1655
554
457
644
1306
1577
1482
2032
8
1
732
456
11
185
372
90
281
1118 1119
1935
2337
652
843
740
420
169
549
419
268
168
825
924
1847
1575
2712
10
11
6
2414
2617
2500
2335
2132
6
1211
1117
1846
455
731
1017
184
371
89
553
643
1481
1751
1394
1304
10
1016
923
552
730
2031
2231
2131
13
280
370
13
1934
1845
1750
1573
1210
1115
9
6
88
454
922
369
823
728
551
2334
1933
9
1653
1303
1209
2711
12
2498
737
648
416
74
2130
2413
2230
2333
1033
415
266
165
545
929
837
414
265
73
2614
2497
1749
1478
1392
1015
1114
921
727
9
183
86 87
453
642
366
278
11
1844
9
2615
2129
2030
1931
2229
641
1302
1748 8
365
452
1113
6
85
181 182
550
726
1207
1571
1476
1843
2227
2128
2613
164
1031
6
9
544
263
71
7
16Sf
410
2708
2610
261 262
70
2
2409
2330 RNasePRNA
2127
8
1746
1652
1391
1301 14
11
2
822
1014
920
640
1112
1206
1013
919
725
549
277
84
180
362
450
83
1570
5
6
1929
1475
1745
1568
1928
2028
11
2225
1927
408
1111
1205
1390
1300
8
639
4
276
1012
918
1838 1839 1840
1651
1567
1473
2329
161
821
638
361
179
82
1010 1011
1204
1388
6
724
917
547
1110
2027
2706
275
449
178
1009
8
81
1743
1837
1203
1298
1741
1836
640 641
67
2607
13
256
1028
830
730
10
2406
2323
7
2023
1925
916
637
360
274
1008
820
5
359
1386
1650
1470
1006
723
636
1565
1384
447
545
80
1108
177
356 357
271
79
1107
1202
1297
1106
159
12
1
539
255
66
402
538
22
2703
2606
2494
1
914
1005
11
78
355
634
2222
1834
1563
2022
2120
401
729
829
537
400
2493
819
722
544
76
1469
270
446
1201
1649
2405
2220
2119
10
2604
2404
2322
2118
2492
536
2603
2701
827
12
253
396
155
9
2403
2217
2019
1923
1833
1562
1467
1382
9
175
354
1105
12
75
1296
1004
1200 5
1738
1832
1647
1561
818
913
633
543
353
269
74
721
174
1104
1465
1295
1199
1922
1736
1103
1003
817 4
1002
542
445
73
632
268
720
350
173
6
541
444
267
2116
2491
9
2602
2402
2018
1921
1831
1735
1645
1560
1463
1379
719
631
1001
1102
9
911
815
2115
825
727
635
532
395
250
154
2601
1114
634
152
726
915
823
725
63
249
391
151
2600
2489
2401
2113
2016
2017
1732
1644
1377
1101
349
540
5
72
172
6
266
443
1198
718
1000
814
630
539
5
442
265
348
1828
1920
2320
2700
2111
2215
2400
1827
1460
1376
71
171
441
70
910
999
1291
9
1558
2015
632 633
248
822
724
631
5
6
2488
10
2110
2014
1730
538
717
813
1197
1099
1643
1290
1196
909
715
439
440
69
264
12
346
170
537
629
1557
1826
150
913
630
247
61
5
438
1375
1098
2487
2399
2319
2214
10
1918
10
1457
2013
1641
1555
1374
812
68
263
714
908
1
1195
437
345
262
1289
1194
5
11
998
907
2398
2109
1825
1554
387
246
910
1017
385
148
58
2698
2598
2397
2316
1110
818
719
628
245
57
529
244
2697
1727
1916
5
1097
906
67
535
628
168
811
627 23Se
1824
2012
2108
2484
2212
1915
10
1726
1639
1454
260
344
1553
905
810
1288
13
1552
1373
1190
1096
1638
2396
2596
1
626
434
66
167
534
343
259
904
6
2107
2011
1822
1551
56
382
1108
1013
907
718
627
528
242
147
2695 2696
55
903
997
2395
2312
2211
533
809
9
65
433
342 12
1095
2106
1725
2009
2483
2593
2311
241
2394
2482
240
1821
64
166
258
624 16Se
1287
1550
5
1372
6
1636
1912
2103
2310
146
2693
5
1094
1188
63
532
902
6
257
165
807
531
432
8
256
1723 10
2008
2210
2590
239
17
52
2481
2309
8
2100 2101
6
1911
1820
1722
1549
1286
1371
62 14
623
995
1093
5
10
901
806
530
431
341
254
1635
1186
1092
6
1548
1910
2692
2308
2099
61
164
1285
1370
1721
1819
2007
1909
2307
2774
5
2584
1005 1006
812
712
623
1101
711
371
50
2773
1633
804
994
1634
1284
1185
1547
1369
2097
2479
2690
2305
2096
2006
803
430
340
528
252
622
19
900
1091
710
993
1720
1546
1907
1817
2390
523
2772
2689
2581
2476
2303
2209
2095
2004
1632
1282
1719
1544
1906
1718
10
1451
11
9
1184
801
709
621
60
339
526
7
2
802
429
13
251
524 525
338
162
899
1089
992
1183
800
708
523
1367
369 370
902
811
710
621
522
141
48
5
368
2770
233
2579
2475
2302
2688
2577
901
47
367
6
2301
2389
1815
2003
1905
1717
1631
9
23Sa
428
898
1543
1088
1280
2094
10
2474
2576
2300
2685
11
2002
1904
799
897
1366
620
522
337
250
161
1
426
706
1182
991
1814
1630
1542
1716
2093
5
10
6
896
705
6
5Sb
425
1
521
336
249
1365
2767 2768
366
999
618
2388
45
2001
1811
1715
1541
798
619
424
1087
1181
990
7
519
1279
1364
12
2298 2299
1903
1540
5
1629
1086
1363
2000
12
2387
2297
2470
363
616
1095
806
520
229
9
43
896
361
136
42
2764
2683
2569
2469
2386
2205 2206 2207
228
9
360
1093 1094
358
227
41
2385
2295
2089
1180
16Sa
23Sb
704
618
13
894
247
335
796
423
518
988
617
5
246
1277 1278
1808
1998
1902
1628
1362
1714
1807
795
1
58
334
12
57
703
893
1085
987
616
422
333
245
56 16Sb
1276
1539 11
8
1179
1997
1627
1901
2568
135
2467
8
2203
11
2681
2762
1361
1806 12
2088
1995
1805
2567
2466
2384
2566
8
1450
8
794
892
702
517
332
421
1084
1178
55
244
986
701
615
1
793
1275
13
54
159
1536 1537 1538
2294
2202
1092
225
133
6
8
2565
2465
2383
2293
2087
1994
1083
890
331
243
1177
1360
1274
1713
1900
1803
1712
5
1359
1273
1625
356
517
132
355
39
2761
12
1081
985
9
700
52
158
516
614
330
51
792
420
242
889
1174
1448
7
2201
2464
2382
6
5
1899
2292
2564
2678
2760
2563
2463
2198
2086
1993
1800
1711
1534
1358
1272
1173
1079
2291
1533
1447
1624
791
698
613
157
50
515
984
49
241
888
983
329
48
1271
1172
981
1270
2380
8
790
697
1077
1898
1799
2759
351
514
802
703
8
606
350
36
2677
5
1532
1269
156
514
419
886
612
240
47
1354
1076
1171
2197
2562
129
8
2461
696
788
980
2085
2289
2378
2676
2758
702
2675
2561
2460
2377
2196
2084
1992
1623
1446
418
884
1075
1710
1798
1991
1896
2287
2195
34
1622
1529
1268
1170
8
787
695
5
513
46
154
417
239
979
416
611
1353
1797
2083
8
221
346
604
127
2757
345
11
2674
2560
2459
2286
2194
2082
1990
1621
45
153
1074
881
786
1169
977
1445
1267
1527
11
1894
2458
33
8
220
124
2756
2673
2559
2193
2081
1989
1893
1794
1709
1526 5
1266
1168
880
5
152
238
415
9
44
694
610
512
328
237
151
12
1073
976
1350
1444
2285
693
43
784
236
975
1525
1987
2755
602
1167
1892
2456
32
11
876
2376
2558
341
123
2671
2557 11
1084
601
699
2670 16Sh
2454
12
1986
7
11
150
692
974
1265
1524
1791
2283
2080
2191
1985
1890
1523
1442
12
973
783
1708
1166
1349
1264
1071
1706
8
875
608
414
327
235
42
509 510
326
149
691
234
41
508
325
148
972
690
607
1789
2555
3
8
1889
2282
793
30
1083
792
600
337
29
697
1082
2554
2668
5
2374
2281
2190
1788
324
413
506
1441
1522
1263
1165
7
40
233
781
1070
873
606
1620
1439
1984
508
218
599
28
2553
2453
2280
8
1888
1786
1704
1438
1348
1521
7
1069
780
689
605
505
412
323
147
39
232
322
411
1164
971
870
779
1261
12
2666 23Sh
334
217
688
12
503
410
146
37
231
321
6
409
1163
2188
2552
2279
596
11
882
789
27
120
1347
2077
8
144
SRPRNA
13
604
408
4
778
9
1983
1887
1784
1703
1437
1346
1260
1162
36
230
970
687
1067 7
505
694
504
880 881
788
12
331
595
214
1079
693
501
330
2664 5Sh
2550
2452
2187
2373
2549
12
2075
1701
1520
1259
5
869
969
407
502
320
229
142
777
9
406
1980 1981
2662 16Sg
2451
16
2074
2548
2661
1783
1886
1979
2277
1782
1435
6
1345
1066
5
1161
685
35
140
501
319
228
139
34
776
8
968
1619
1519
1434
1344
1698
1618
11
1697
592
118
2660
13
2276
2073
2450
2371
2185
Rho-ind. terminator
784
325
690
6
493
23
9
2657
2547
2448
23Sg
2184
2274
2370
2183
1433
1258
1160
1064
1517 1518
1978
1781
1885
1977
5
1779
1884
6
1343
1617
1432
1695
1615
1431
1516
1257
1159
775
684
603
500
405
318
227
33
137
683
499
317
225
1063
866
498
965
5
136
32
404
774
1062
682
602
315
224
135
496
403
1158
1061
964
864
773
495
223 5Sc 1
134
1342
2072
1976
1778
1614
1515
1430
211
689
491
322
116
1157
1060
2656 5Sg 1
1074
783
589
688
782
210
318
21
494
402
7
31
222
963
863
1883
1777
2273
2447
2751
490
2369
2272
2182
2070
681
1341
1693
1429
1256
1882
1975
2181
7
1776
2545
2655
2446
115
317
2750
485 486 487
Other RNA
1072
975
871
686
2544
2654
316
114
2653
2368
2271
2180
2069
1974
1880
12
601
772
221 23Sc
1156
1612
1340
1513
1611
1692
10
1775
1610
6
30
132
493
1059
962
1155
1058
1428
1154
400
861 862
771
29
220
680
131
11
492
399
1255
1427
1339
2270
1973
2445
207
586
113
5
313
11
585
483
16
780
584
311
205
112
19
2749
2542
2651
2444
2541
2649
2443
2365 2366
2269
6
1879
1774
2179
1690
1972
5
2068
2178
2267 2268
2067
1773
1689
1878
1512
1609
7
6
1426
1253
1153
1056
961
770
6
600
491
2
860
28
217
678
859
16Sc
130
960
858
1152
959
857
769
677
597
490
1511
1425
1608
6
1338
1252
8
216
398
27
1054
7
9
596
1151
856
958
1971
1772
685
583
870
1607
1509
1424
1250
2748
309
204
111
1053
312
129
676
595
397
215
1150
2176
2647
18
779
582
479
2747
12
2442
11
2364
26
128
768
1688
2266
2066
1970
12
2265
684
308
2538
110
867
580
476
2064
2441
17
2746
307
201
683
475
8
2537
2646
2174
2363
109
2440
2262
2536
200
108
1 Kb
303
2744
2645
2362
2172
2063
1968
1877
9
1508
1423
1337
675
396
957
594
1249
1606
1687
309
489
855
1149
1507
1050
767
674
6
395
213
127
214
23 24
126
593
1875
1771
1686
10
1336
1506
1422
1685
1874
1967
1148
956
1248
854
6
22
673
7
308
592
488
212
125
1049
21
1335
1505
1605
1684
1504
1421
1246
766
853
672
591
394
307
211
1048
1147
954
682
1683
1604
1503
2535
15
107
199
864
301
2743
Conserved hypothetical
1062
963
862
769
678
465
299
198
2741
2644
2534
2171
2439
No database match
962
768
676 677
197
13
2740
2643
2360
2261
12
2062
1966
Protein fate/protein synthesis
1060
767
294
196
463 464
195
104
2532
2642
5
2260
2168
2061
11
1770
1682
5
1603
1419
1334
1873
1245
6
487
590
306
210
20
1420
953
852
764
671
589
12
122
1047
1333
11
486
Other categories
1059
961
860
766
675
573
460
293
194
103
2739
2531
2359
2438
2358
2259
2166 2167
1681
1872
1965
1502
12
1244
1602
7
1418
2060
1769
1964
2058 2059
2258
1601
1242
9
1145
951
305
209
393
851
763
668
208
1332
849
587
1046
950
848
1680
1871
13
667
304
485
392
19
Transport and binding proteins
1058
765
674
193
291
9
12
9
2641
2529
6
1501
1962 1963
2257
2164
2057
2437
2356
2256
7
1679
1869
1768
1330
1417
1600
1870
1329
1599
5
1144
1239 1240
1044
9
762 tmRNA
586
1045
761
949
847
585
483
207
120
Regulatory functions
1057
859
764
960
458
571
673
290
11
102
2528
7
2
2056
1960
1678
1500
5
1238
948
760
665
583
391
11
18
Transcription
457
288
192
101
2738
2638
6
7
2255
2162
2055
1959
2436
2524 2525 2527
2637
1767
9
1416
1598
1868
2355
14
1328
7
1237
1043
303
482
206
119
16
Nucleotides
958
856
763
671
191
9 10
2254
15
2053
1675
2353
2434 2435
2736
1499
1597
1956
5
1327
1236
1141
1042
846
759
5
204
118
Fatty acid and phospholipid metabolism
1056
855
568
455
287
100
2636
2523
2433
2352
9
2161
2052
2253
1955
1867
1766
1498
1674
10
1415
1325
1235
1041
481
582
947
664
480
14
203
15
117
DNA metabolism
854
957
669
762
12
567
454
190
99
8
2734
2522
10
2432
2159 2160
2051
9
1866
11
1596
1497
1414
1140
1040
581
390
302
758
9
19
116
14
Energy metabolism
1055
853
1
14
189
98
7
2733
2520
2350
2252
2635
2157
2431
2349
286
2519
285
956
6
566
284
186
97
2634
2156
2251
2049
1953
1765
1673
1595
1323
1139
944
663
478
202
115
580
943
757
579
845
662
942
756
11
12
477
13
114
201
300
113
578
389
1234
1496
1138
1039
941
661
576
577
299
1
12
Central intermediary metabolism
1054
2348
2250
1865
1413
1594
1321
1495
2048
1137
1232
1038
1136
112
200
476
755
5Sd
111
Cellular processes
1053
5
453
667
4
2047
1952
12
1672
2517 2518
2732
760
565
852
283
96
3
1231
1593
1411
2153
2430
1135
940
5
575
660
298
844
753
939
574
475
5
10
Biosynthesis of cofactors, prosthetic groups, and carriers
1052
851
8
666
758
564
450
183
759
2731
95
2347
2249
2631 2632 2633
2514
2346
2248
843
9
1037
9
1764
1671
2046
2152
2045
1494
12
1229
1592
1864
1410
1951
1863
1763
1670
1591
1320
199
9
Cell envelope
1051
9
449
282
94
952
850
1050
951
756
13
665
182
92
563
447
5
949
663
181
1
2428
2729 2730
2630
2727
2513
2427
2510 2511
2426
2345
1036
8
23Sd
938
752
659
573
842
474
296 297
198
108
5 6 7
1134
841
937
1228
1319
1133
2247
1862
1762
1409
1226
936
2
5
570 571
295
4
749 750 751
658
473
840
568
2628 2629
2425
6
2344
2246
2146
2043
1950
196
16Sd
106
1590
1318
1132
1035
748
293
9
1861
1761
1669
1492
1408
1224
935
839
2145
2245
2042
1860
1760
567
195
7
747
1317
1589
3
105
656 657
1034
1407
12
934
838
746
655
386
472
566
385
194
291
1 2
103
articles
NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com