DNA sequence of both chromosomes of the cholera pathogen Vibrio ...

2 downloads 0 Views 1MB Size Report
The V. cholerae genomic sequence provides a starting point for understanding how a ... Vibrio cholerae is the aetiological agent of cholera, a severe diar-.
articles

DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae John F. Heidelberg*, Jonathan A. Eisen*, William C. Nelson*, Rebecca A. Clayton, Michelle L. Gwinn*, Robert J. Dodson*, Daniel H. Haft*, Erin K. Hickey*, Jeremy D. Peterson*, Lowell Umayam*, Steven R. Gill*, Karen E. Nelson*, Timothy D. Read*, Herve Tettelin*, Delwood Richardson*, Maria D. Ermolaeva*, Jessica Vamathevan*, Steven Bass*, Haiying Qin*, Ioana Dragoi*, Patrick Sellers*, Lisa McDonald*, Teresa Utterback*, Robert D. Fleishmann*, William C. Nierman*, Owen White *, Steven L. Salzberg*, Hamilton O. Smith*², Rita R. Colwell³, John J. Mekalanos§, J. Craig Venter*² & Claire M. Fraser* * The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA ³ Center of Marine Biotechnology, University of Maryland Biotechnology Institute, 701 East Pratt Street, Baltimore, Maryland 21202, USA, and Department of Cell and Molecular Biology, University of Maryland, College Park, Maryland 20742, USA § Harvard Medical School, Department of Microbiology and Molecular Genetics, 200 Longwood Avenue, Boston, Massachusetts 02115, USA

............................................................................................................................................................................................................................................................................

Here we determine the complete genomic sequence of the Gram negative, g-Proteobacterium Vibrio cholerae El Tor N16961 to be 4,033,460 base pairs (bp). The genome consists of two circular chromosomes of 2,961,146 bp and 1,072,314 bp that together encode 3,885 open reading frames. The vast majority of recognizable genes for essential cell functions (such as DNA replication, transcription, translation and cell-wall biosynthesis) and pathogenicity (for example, toxins, surface antigens and adhesins) are located on the large chromosome. In contrast, the small chromosome contains a larger fraction (59%) of hypothetical genes compared with the large chromosome (42%), and also contains many more genes that appear to have origins other than the g-Proteobacteria. The small chromosome also carries a gene capture system (the integron island) and host `addiction' genes that are typically found on plasmids; thus, the small chromosome may have originally been a megaplasmid that was captured by an ancestral Vibrio species. The V. cholerae genomic sequence provides a starting point for understanding how a free-living, environmental organism emerged to become a signi®cant human bacterial pathogen. Vibrio cholerae is the aetiological agent of cholera, a severe diarrhoeal disease that occurs most frequently in epidemic form1. Cholera has been epidemic in southern Asia for at least 1,000 years, but also spread worldwide to cause seven pandemics since 1817 (ref. 1). When untreated, cholera is a disease of extraordinarily rapid onset and potentially high lethality. Although clinical management of cholera has advanced over the past 40 years, cholera remains a serious threat in developing countries where sanitation is poor, health care limited, and drinking water unsafe. Vibrio cholerae as a species includes both pathogenic and nonpathogenic strains that vary in their virulence gene content2. This bacterium contains a wide variety of strains and biotypes, receiving and transferring genes for toxins3, colonization factors4,5, antibiotic resistance6, capsular polysaccharides that provide resistance to chlorine7 and new surface antigens, such as the 0139 lipopolysaccharide and O antigen capsule8,9. The lateral or horizontal transfer of these virulence genes by phage3, pathogenicity islands10,11 and other accessory genetic elements12 provides insights into how bacterial pathogens emerge and evolve to become new strains. Vibrio species represent a signi®cant portion of the culturable heterotrophic bacteria of oceans, coastal waters and estuaries13,14. Environmental studies show that these bacteria strongly in¯uence nutrient cycling in the marine environment. Various species of this genus are also devastating pathogens for ®n®sh, shell®sh and mammals. There is still much to be learned about the aquatic ecology and natural history of V. cholerae including its autochthonous (native) existence in endemic locales during cholera-free, interepidemic periods, which environmental factors, such as climate13,15, aided its re-emergence in Latin America, and which environmental factors are associated with its habitat in cholera endemic regions. For example, V. cholerae, during interepidemic periods, is an inhabitant ² Present address: Celera Genomics, 45 West Gude Drive, Rockville, Maryland 20850, USA

NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

of brackish and estuarine waters, and in these environments is associated with zooplankton and other aquatic ¯ora and fauna16. The organism also enters a ``viable but nonculturable''17 state under certain conditions. Roles for these environmental interactions and this dormant physiological state in the emergence and persistence of pathogenic V. cholerae have been proposed14,17. Here we report the determination and analysis of the Vibrio cholerae genome sequence. This analysis represents an important step toward the complete molecular description of how this freeliving environmental organism emerged to become a human pathogen by horizontal gene transfer.

Genome analysis

The genome of V. cholerae was sequenced by the whole genome random sequencing method18. The genome consists of two circular chromosomes19,20 of 2,961,146 (chromosome 1) and 1,072,314 Table 1 General features of the Vibrio cholerae genome

Size (bp) Total number of sequences G+C percentage Total number of ORFs ORF size (bp) Percentage coding Number of rRNA operons (16S-23S-5S) Number of tRNA Number similar to known proteins Number similar to proteins of unknown function* Number of conserved hypothetical proteins² Number of hypothetical proteins³ Number of Rho-independent terminators

Chromosome 1

Chromosome 2

2,961,151 36,797 47.7 2,770 952 88.6 8 94 1,614 (58%) 163 (6%) 478 (17%) 515 (19%) 599

1,072,914 14,367 46.9 1,115 918 86.3 0 4 465 (42%) 66 (6%) 165 (15%) 419 (38%) 193

............................................................................................................................................................................. bp, base pairs. ORFs, open reading frames. * Proteins of unknown function, signi®cant sequence similarity (homology) to a named protein for which there is currently no known function. ² Conserved hypothetical protein, sequence similarity to a translation of another ORF, but there is currently no experimental evidence a protein is expressed. ³ Hypothetical protein, no signi®cant sequence similarity to another protein.

© 2000 Macmillan Magazines Ltd

477

articles (chromosome 2) base pairs, with an average G+C content of 46.9% and 47.7%, respectively. There are a total of 3,885 predicted open reading frames (ORFs) and 792 predicted Rho-independent terminators; with 2,770 and 1,115 ORFs and 599 and 193 Rho-independent terminators on the individual chromosomes (Table 1, Figs 1 and 2). Most genes required for growth and viability are located on chromosome 1, although some genes found only on chromosome 2 are also thought to be essential for normal cell function (for example, dsdA, thrS and the genes encoding ribosomal proteins L20 and L35). Additionally, many intermediaries of metabolic pathways are encoded only on chromosome 2 (Fig. 3). The replicative origin in chromosome 1 was identi®ed by similarity to the Vibrio harveyi and Escherichia coli origins, co-localization of genes (dnaA, dnaN, recF and gyrA) often found near the origin in prokaryotic genomes, and GC nucleotide skew (G-C/ G+C) analysis21. Based on these, we designated base-pair 1 in an intergenic region that is located in the putative origin of replication. Only the GC skew analysis was useful in identifying a putative origin on chromosome 2. This genomic sequence of V. cholerae con®rmed the presence of a large integron island (a gene capture system) located on chromosome 2 (125.3 kbp)22,23. The V. cholerae integron island contains all copies of the V. cholerae repeat (VCR) sequence and 216 ORFs (Fig. 1). However, most of these ORFs have no homology to other sequences. Among the recognizable integron island genes are three that encode gene products that may be involved in drug resistance (chloramphenicol acetyltransferase, fosfomycin resistance protein and glutathione transferase), several DNA metabolism enzymes (MutT, transposase, and an integrase), potential virulence genes (haemagglutinin and lipoproteins) and three genes which encode gene products similar to the `host addiction' proteins (higA, higB and doc), which are used by plasmids to select for their maintenance by host cells.

V. cholerae biology, notably its ability to inhabit diverse environments. These environments, in turn, may have selected the duplication and divergence of genes useful for specialized functions. Additionally, whereas El Tor strain N16961 carries only a single copy of the cholera toxin prophage, other V. cholerae strains carry several copies of this element24,25, and strains of the classical biotype have a second copy of the prophage that is localized on chromosome 2 (ref. 20). Thus, virulence genes are presumed to be subject to selective pressure, affecting copy number and chromosomal location. Several ORFs with apparently identical functions exist on both chromosomes which were probably acquired by lateral gene transfer. For example, glyA (encoding serine hydroxymethyl transferase) is found once on each chromosome but the phylogenetic analysis suggest the glyA copy on chromosome 1 branches with the a-Proteobacteria, whereas the copy on chromosome 2 branches with the g-Proteobacteria (see Supplementary Information). The chromosome 2 glyA is ¯anked by genes encoding transposases, suggesting that this gene was acquired through a transposition event.

Figure 1 Linear representation of the V. cholerae chromosomes. The location of Q the predicted coding regions, colour-coded by biological role, RNA genes, tRNAs, other RNAs, Rho-independent terminators and Vibrio cholerae repeats (VCRs) are indicated. Arrows represent the direction of transcription for each predicted coding region. Numbers next to the tRNAs represent the number of tRNAs at a locus. Numbers next to GES represent the number of membrane-spanning domains predicted by the Goldman, Engleman and Steitz scale calculated by TopPred for that protein. Gene names are available at the TIGR web site (www.tigr.org) and as Supplementary Information.

1,000,000

100,000

900,000

Comparative genomics

The two-chromosome structure of V. cholerae allows for comparisons, both between the two chromosomes of this organism and between either of the V. cholerae chromosomes and the chromosomes of other microbial species. There is pronounced asymmetry in the distribution of genes known to be essential for growth and virulence between the two chromosomes. Signi®cantly more genes encoding DNA replication and repair, transcription, translation, cell-wall biosynthesis and a variety of central catabolic and biosynthetic pathways are encoded by chromosome 1. Similarly, most genes known to be essential in bacterial pathogenicity (that is, those encoding the toxin co-regulated pilus, cholera toxin, lipopolysaccharide and the extracellular protein secretion machinery) are also located on chromosome 1. In contrast, chromosome 2 contains a larger fraction (59%) of hypothetical genes and genes of unknown function, compared with chromosome 1 (42%) (Fig. 4). This partitioning of hypothetical proteins on chromosome 2 is highly localized in the integron island (Fig. 2). Chromosome 2 also carries the 3-hydroxy-3-methylglutaryl CoA reductase, a gene apparently acquired from an archaea (Y. Boucher and W. F. Doolittle, personal communication). The majority of the V. cholerae genes were very similar to E. coli genes (1,454 ORFs), but 499 (12.8%) of the V. cholerae ORFs showed highest similarity to other V. cholerae genes, suggesting recent duplications (Figs 5 and 6). Most of the duplicated ORFs encode products involved in regulatory functions (59), chemotaxis (50), transport and binding (42), transposition (18), pathogenicity (13) or unknown functions encoded by conserved hypothetical (62) and hypothetical proteins (113). There are 105 duplications with at least one of each ORF on each chromosome indicating there have been recent crossovers between chromosomes. The extensive duplication of genes involved in scavenging behaviour (chemotaxis and solute transport) suggests the importance of these gene products in 478

1 200,000

800,000

300,000

700,000

400,000

600,000 2,900,000 2,800,000 2,700,000 2,600,000 2,500,000

500,000 1

100,000 200,000 300,000 400,000 500,000

2,400,000 600,000 2,300,000 700,000 2,200,000 800,000 2,100,000 900,000 2,000,000

1,000,000

1,900,000

1,100,000

1,800,000

1,200,000 1,700,000 1,300,000 1,600,000 1,500,000 1,400,000

Figure 2 Circular representation of the V. cholerae genome. The two chromosomes, large and small, are depicted. From the outside inward: the ®rst and second circles show predicted protein-coding regions on the plus and minus strand, by role, according to the colour code in Fig. 1 (unknown and hypothetical proteins are in black). The third circle shows recently duplicated genes on the same chromosome (black) and on different chromosomes (green). The fourth circle shows transposon-related (black), phage-related (blue), VCRs (pink) and pathogenesis genes (red). The ®fth circle shows regions with signi®cant x2 values for trinucleotide composition in a 2,000-bp window. The sixth circle shows percentage G+C in relation to mean G+C for the chromosome.The seventh and eighth circles are tRNAs and rRNAs, respectively.

© 2000 Macmillan Magazines Ltd

NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

articles Origin and function of the small chromosome of V. cholerae

was originally a megaplasmid (Fig. 4). The megaplasmid presumably acquired genes from diverse bacterial species before its capture by the ancestral Vibrio. The relocation of several essential genes from chromosome 1 to the megaplasmid completed the stable capture of this smaller replicon. Apparently this capture of the megaplasmid occurred long enough ago that the trinucleotide composition and percentage G+C content between the two chromosomes has become similar (except for laterally moving elements such as the integron island, bacteriophage genomes, transposons, and so on). The two chromosome structure is found in other Vibrio species19 suggesting that the gene content of the megaplasmid continues to provide Vibrio with an evolutionary advantage, perhaps within the aquatic ecosystem where Vibrio species are frequently the dominant microorganisms14,16. It is unclear why chromosome 2 has not been integrated into chromosome 1. Perhaps chromosome 2 plays an important specialized function that provides the evolutionary selective pressure to

Several lines of evidence suggest that chromosome 2 was originally a megaplasmid captured by an ancestral Vibrio species. The phylogenetic analysis of the ParA homologues located near the putative origin of replication of each chromosome shows chromosome 1 ParA tending to group with other chromosomal ParAs, and the ParA from chromosome 2 tending to group with plasmid, phage and megaplasmid ParAs (see Supplementary Information). In general, genes on chromosome 2, with an apparently identical functioning copy on chromosome 1, appear less similar to orthologues present in other g-Proteobacteria species (see Supplementary Information). Also, chromosome 1 contains all the ribosomal RNA operons and at least one copy of all the transfer RNAs (four tRNAs are found on chromosome 2, but there are duplicates on chromosome 1). In addition, chromosome 2 carries the integron region, an element often found on plasmids26. Finally, the bias in the functional gene content is more easily explained, if chromosome 2 uracil NupC NMN, pnuC uraA Family xanthine/ (2,1) uracil family

ELECTRON TRANSPORT CHAIN

H+

H+

cellobiose* fructose mannitol sucrose? glucose

ATP H+ synthase

H+

sbp

sulphate

CHITIN

sugar-Pi

LACTO SE

FRUCTO SE

Histidine Degradation Pathway

sulphate, cysZ

GLYCOLYSIS

phosphate, nptA ?

GALACTOSE GLYCEROL **

HISTIDINE

fatty acids, fadL(2,1)

MANNITOL MANNO SE

urocanate

MALATE

arginine

PENTOSE PATHWAY

Na +/alanine (3)

chorismate

D-LACTATE

ribose

ttyrosine Na +/proline,putP

PP formate, H 2+CO 2

mal E/F/G/ K

galactoside

cadaverine

ASP ARAGINE

>40 flagellar and

mglA/B/C

oxalate/formate ?

METHIONINE L-ASPA RTATE

Na +/citrate, citS Na +/dicarboxylate(1,1) formate ? (1,1) benzoate, benE C4- dicarboxylate

fumarate

Fatty Acid Biosynthes is and Degradation

propionate O-succinyl-

SERINE

cis-aconitate

CO 2

bcaa, brnQ potassium, trkA/H

ARGININE

bypass isocitrate

succinate

NH 4+ ? GLUTA MINE

succinyl-CoA

Mg2+, mgt,(2,1)

PUTRESCINE

2-oxoglutarate

L- cysteine

dctP/Q/M, dcuA/B/C

tyrosine, tyrP

MCPs (23,20) ORNITHINE

Glyoxylate

homoserine glycine

serine,sda C (2) tryptophan,mtr

glyoxylate fumarate

AzlC family

V/W/Y/Z

TCA Cycle

L-iso-leucine THREONINE

BCCT family

CheA/B/D?/R/

oxaloacetic citrate acid malate

arginine/ ornithine putrescine/ ornithine, potE

motor genes

ethanol L- leucine

ASP ARTATE

diaminopimelic acid

FLAGELLUM

cadaverine/ lysine?

propionate

L-lysine maltose

Na +/glutamate,gltS

melanin

D-alanine L-alanine leucine valine

2-keto-isovalerate

Acetyl-CoA

acetyl-P

(1,1) proton/peptide family

L- tryptophan

SERINE

Pyruvate

L-LACTATE ACETA TE

proton/glutamate, gltP

L- phenylalanine serine

formiminoL-glutamate

rbsA/B/C/D

artI/M/P/Q

PRP P

PHOSPH ATE

PEP

L-glutamate

oppA/B/C/D/F

amino acids (3,1)

RIBOSE

OXIDATIV E

L-TRYPTOP HAN

L-lactate ?

oligopeptides

GLUCONATE NON -

L-ALANINE

imidazolepropanoate

Pi

ptsH

PATHWAY

GLUCONEOG ENESIS

sulphate(2,2)

peptides (3,1)

ptsI

DOUDO ROFF

SUCROSE

Pho4 family protein

potA/B/C/D

ENTNER

GLUCOSE

spermidine/ putrescine

Pi

IIA

NAG trehalose?

TREHALOSE N-ACETYLGLUCOSAMINE

modA/B/C

gluconate ?

IIBC

GLYCOGEN

STARCH

pstA/B/C/S

molybdenum

ompS

+ Pi

e-

phosphate (1,1)

maltoporin

ADP

ATP

cy sA/P/T /W

PTS system Pi

sugar family

iron (II), feoA/B Na +/H +(4,2)

PROLINE

L-GLUTAMATE

potassium, kup potassium ke fB (2?) ExbB(1,1)

porin ?

lipolysaccharide/ O-antigen, rfbH/I

NadC family iron (III) G3P MsbA? cations AcrB/D/F (2,2)

glyc G3P colicins thiamine? B12? btuB/C/D toxins drugs heme glpF glpT tolA/B/Q/R ugpA/B/C/E (4,2) (14,3) ccmA/B/C/D

Figure 3 Overview of metabolism and transport in V. cholerae. Pathways for energy production and the metabolism of organic compounds, acids and aldehydes are shown. Transporters are grouped by substrate speci®city: cations (green), anions (red), carbohydrates (yellow), nucleosides, purines and pyrimidines (purple), amino acids/ peptides/amines (dark blue) and other (light blue). Question marks associated with transporters indicate a putative gene, uncertainty in substrate speci®city, or direction of transport. Permeases are represented as ovals; ABC transporters are shown as composite ®gures of ovals, diamonds and circles; porins are represented as three ovals; the largeconductance mechanosensitive channel is shown as a gated cylinder; other cylinders represent outer membrane transporters or receptors; and all other transporters are drawn as rectangles. Export or import of solutes is designated by the direction of the arrow through the transporter. If a precise substrate could not be determined for a transporter, no gene name was assigned and a more general common name re¯ecting the type of substrate being transported was used. Gene location on the two chromosomes, for both NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

E1-E2

family (2,1)

vibriobactin receptor viuA

hemin

zinc

vibriobactin

hutB/C/D

znuA/B/C

fepB/C/D/G

TonB (1,1)

mscL

ExbD(1,1)

+

MCP

heme

K ? hutA

irgA

TonB system receptor(1,2)

chemotactic signals

family (3)

transporters and metabolic steps, is indicated by arrow colour: all genes located on the large chromosome (black); all genes located on the small chromosome (blue); all genes needed for the complete pathway on one chromosome, but a duplicate copy of one or more genes on the other chromosome (purple); required genes on both chromosomes (red); complete pathway on both chromosomes (green). (Complete pathways, except for glycerol, are found on the large chromosome.) Gene numbers on the two chromosomes are in parentheses and follow the colour scheme for gene location. Substrates underlined and capitalized can be used as energy sources. PRPP, phosphoribosyl-pyrophosphate; PEP, phosphoenolpyruvate; PTS, phosphoenolpyruvate-dependant phosphotransferase system; ATP, adenosine triphosphate; ADP, adenosine diphosphate; MCP, methylaccepting chaemotaxis protein; NAG, N-acetylglucosamine; G3P, glycerol-3-phosphate; glyc, glycerol; NMN, nicotinamide mononucleotide. Asterisk, because V. cholerae does not use cellobiose, we expect this PTS system to be involved in chitobiose transport.

© 2000 Macmillan Magazines Ltd

479

articles suppress integration events when they do occur. For example, if under some environmental condition there is a difference in copy number between the chromosomes, then chromosome 2 may have accumulated genes that are better expressed at higher or lower copy number than genes on chromosome 1. A second possibility is that, in response to environmental cues, one chromosome may partition to daughter cells in the absence of the other chromosome (aberrant segregation). Such single-chromosome-containing cells would be replication-defective but still maintain metabolic activity (`drone' cells), and, therefore, be a potential source of ``viable, but nonculturable (VBNC)'' cells observed to occur in V. cholerae17. Such `drone' cells may also play a role in V. cholerae bio®lms7,27,28 by, for example, producing extracellular chitinase, protease and other degradative enzymes that enhance survival of cells in a bio®lm, retaining two chromosomes without directly competing with these cells for nutrients.

Transport and energy metabolism

Vibrio cholerae has a diverse natural habitat that includes association with zooplankton in a sessile stage, a planktonic state in the water column, and the capacity to act as a pathogen within the human gastrointestinal tract. It is, therefore, no surprise that this organism maintains a large repertoire of transport proteins with broad substrate speci®city and the corresponding catabolic pathways to enable it to respond ef®ciently to these different and constantly changing ecosystems (Fig. 4). Many of the sugar transporter systems and their corresponding catabolic pathway enzymes are localized on a single chromosome (that is, ribose and lactate transport and degradation enzymes are contained on chromosome 2, whereas the trehalose systems reside on chromosome 1). However, many of the other energy metabolism pathways are split (that is, chitin, glycolysis, and so on) between the chromosomes (Fig. 3). In aquatic environments, chitin often represents a source of both carbon and nitrogen. This energy source is important for V. cholerae as it is associated with zooplankton, which have a chitinous exoskeleton13,15,29. Vibrio cholerae degrades chitin by a pathway that is very similar to that of Vibrio furnissii30. Sequence analysis suggests a phosphenolpyruvate phosphotransferase system (PTS)

Interchromosomal regulation

Several of the regulatory pathways, both for regulation in response to environmental and pathogenic signals, are divided between the two chromosomes. These included pathways for starvation survival, `quorum sensing' and expression of the entertoxigenic haemolysin, HlyA. During periods of nutrient starvation, V. cholerae, and other Gram-negative bacteria, enter the stationary phase and, later, the viable but nonculturable (VBNC) state14,17. The alternative sigma factor j38 (rpoS) is required for survival of V. cholerae in the environment but not for pathogenicity31, and therefore probably plays an important role in the initiation of the VBNC state. There is one copy of rpoS, located on chromosome 1, near the oriC. The RpoS regulates expression of several other proteins, including catalase, cyclopropane-fatty-acyl-phospholipid synthase and HA/protease, which are found on both chromosomes31. Genes involved in `quorum sensing', or cell-density-dependent regulation, also exist on both chromosomes of V. cholerae. In bioluminescent Vibrio species (notably Vibrio ®scheri and V. harveyi), quorum sensing is used to control light production. Although this strain of V. cholerae lacks the genes for bioluminescence, it does have the genes required for the autoinducer-2 (AI-2) quorumsensing mechanism32 (luxOPQSU) but this pathway is split between the chromosomes with luxOSU on chromosome 1 and luxPQ on chromosome 2. Similarly, another transcriptional regulatory gene,

9

in Relative proportion of most similar oc A o qu open reading frames M M ccu ife x yc y s ob co B rad aeo ac pla ac io lic te sm illu du us Borium a g s s ran e ub s r C Tre reli tub nit tili hl p a e al s am on b rc iu C y e ur u m hl d m g lo am ia a do si y p p rf s H ae dia neuallid eri m Es tra m um op c c on hi he ho iae N ei Vlus rich mat ss ib in ia is Ri er ri flu c ck ia o e ol c n i e H tt me ho za C eli sia nin ler e am co p g a py ba row itid e M et Th Sy lob cte aze is ha n r e no rm ec act py kii ba M A o ho er lo ct e rch Ae tog cy jej ri er th ae ro a st un iu an o py m is i m o g ru a sp th co lob m riti . m e c Py rm cu us pe a ro oa s ja fulg rni co ut n id x Sa Ca Py cc otr nas us cc eno roc us op ch ha rh oc ho hic ii ro ab cu rik um m d s os yc itis a h es e by ii ce leg ssi re an vis s ia e

0.35

10

0.3

8

0.25

7

0.2

6

0.15

5

0.1

4 3

0.05

2

0

1

Fa tty

Am in ac nuc o id le a an osi P cid d de ur bi p s pr ho , aine osy C osth B sp nd s, p nth en e io h n y e tra tic sy olip uc rim sis l i g nth id leo idi * nt ro e m t n er up sis e ide es m s o ta s , ed , a f bo Tr an i n c l sp E ary d cofa ism or ne m ar cto t b rg eta rie rs in y m bo rs* , di ng et lism a DN and bol * pr ism A ot m ei e T tab ns Pr ran oli ot sc sm ei n ript * sy io Re n n gu Pr the * la ote sis to in * ry fu fat C Ce nc e* el ll lu en tion la s v O r pr elo th o p er ce e* ca ss H teg es* yp ot orie he s tic al *1

0

De

Open reading frames (%)

for cellobiose transport, but as V. cholerae does not use cellobiose, it is more likely that this PTS is involved in transport of the structurally similar compound, chitobiose, analogous to the situation proposed for Bourrelia burgdorferi18. The three anions that are transported by ABC transport systems in V. cholerae are molybdenum, phosphate and sulphate. Molybdenum transport genes (modA/B/C) are all located on chromosome 2, and most of the sulphate transport genes are on the large molecule. However, copies of the genes for phosphate transporters are found in both chromosomes. The genes in these two phosphate transport operons are different from each other and do not represent a recent duplication; instead, this suggests that one may be an acquired operon.

Figure 4 Percentage of total Vibrio cholerae open reading frames (ORFs) in biological roles compared with other g-Proteobacteria. These were V. cholerae, chromosome 1 (blue); V. cholerae, chromosome 2 (red); Escherichia coli (yellow); Haemophilus in¯uenzae (pale blue). Signi®cant partitioning (P , 0.01) of biological roles between V. cholerae chromosomes is indicated with an asterisk, as determined with a x2 analysis. 1, Hypothetical contains both conserved hypothetical proteins and hypothetical proteins, and is at 1/10 scale compared with other roles. 480

Figure 5 Comparison of the V. cholerae ORFs with those of other completely sequenced genomes. The sequence of all proteins from each completed genome were retrieved from NCBI, TIGR and the Caenorhabditis elegans (wormpep16) databases. All V. cholerae ORFs (large chromosome, blue; small chromosome, red) were searched against all other genomes with FASTA3. The number of V. cholerae ORFs with greatest similarity (E # 10-5) are shown in proportion to the total number of ORFs in that genome. There were no ORFs that were most similar to a Mycoplasma pneumoniae ORF.

© 2000 Macmillan Magazines Ltd

NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

articles

**

**

V.cholerae VC0512 V.cholerae VCA1034 V.cholerae VCA0974 V.cholerae VCA0068 ** V.cholerae VC0825 * V.cholerae VC0282 V.cholerae VCA0906 V.cholerae VCA0979 V.cholerae VCA1056 V.cholerae VC1643 V.cholerae VC2161 V.cholerae VCA0923 ** V.cholerae VC0514 ** V.cholerae VC1868 V.cholerae VCA0773 V.cholerae VC1313 V.cholerae VC1859 V.cholerae VC1413 V.cholerae VCA0268 V.cholerae VCA0658 ** V.cholerae VC1405 V.cholerae VC1298 * V.cholerae VC1248 V.cholerae VCA0864 V.cholerae VCA0176 V.cholerae VCA0220 ** V.cholerae VC1289 V.cholerae A1069 ** V.cholerae VC2439 V.cholerae 1967 V.cholerae A0031 V.cholerae VC1898 V.cholerae VCA0663 V.cholerae VCA0988 V.cholerae VC0216 V.cholerae VC0449 * V.cholerae VCA0008 V.cholerae VC1406 V.cholerae VC1535 V.cholerae VC0840 B.subtilis gi2633766 Synechocystis sp. gi1001299 Synechocystis sp. gi1001300 * Synechocystis sp. gi1652276 * Synechocystis sp. gi1652103 * H.pylori gi2313716 ** H.pylori 99 gi4155097 ** C.jejuni Cj1190c C.jejuni Cj1110c A.fulgidus gi2649560 ** A.fulgidus gi2649548 B.subtilis gi2634254 B.subtilis gi2632630 B.subtilis gi2635607 B.subtilis gi2635608 B.subtilis gi2635609 ** ** ** B.subtilis gi2635610 B.subtilis gi2635882 E.coli gi1788195 E.coli gi2367378 ** E.coli gi1788194 * E.coli gi1787690 V.cholerae VCA1092 V.cholerae VC0098 E.coli gi1789453 H.pylori gi2313186 99 gi4154603 ** H.pylori C.jejuni Cj0144 C.jejuni Cj1564 ** C.jejuni Cj0262c ** C.jejuni Cj1506c H.pylori gi2313163 * ** H.pylori 99 gi4154575 H.pylori gi2313179 ** ** H.pylori 99 gi4154599 C.jejuni Cj0019c C.jejuni Cj0951c C.jejuni Cj0246c B.subtilis gi2633374 T.maritima TM0014 V.cholerae VC1403 V.cholerae VCA1088 T.pallidum gi3322777 T.pallidum gi3322939 T.pallidum gi3322938 ** ** B.burgdorferi gi2688522 T.pallidum gi3322296 B.burgdorferi gi2688521 * ** T.maritima TM0429 T.maritima TM0918 ** T.maritima TM0023 T.maritima TM1428 * T.maritima TM1143 T.maritima TM1146 P.abyssi PAB1308 P.horikoshii gi3256846 ** P.abyssi PAB1336 ** ** P.horikoshii gi3256896 ** P.abyssi PAB2066 P.horikoshii gi3258290 ** * P.abyssi PAB1026 P.horikoshii gi3256884 ** D.radiodurans DRA00354 D.radiodurans DRA0353 ** ** D.radiodurans DRA0352 V.cholerae VC1394 P.abyssi PAB1189 P.horikoshii gi3258414 ** B.burgdorferigi 2688621 M.tuberculosis gi1666149 V.cholerae VC0622

Figure 6 Phylogenetic tree of methyl-accepting chemotactic proteins (MCP) homologues in completed genomes. Homologues of MCP were identi®ed by FASTA3 searches of all available complete genomes. Amino-acid sequences of the proteins were aligned using CLUSTALW, and a neighbour-joining phylogenetic tree was generated from the alignment using the PAUP* program (using a PAM-based distance calculation). Hypervariable regions of the alignment and positions with gaps in many of the sequences were excluded from the analysis. Nodes with signi®cant bootstrap values are indicated: two asterisks, .70%; asterisk, 40±70%. NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

hlyU (ref. 33), is located on chromosome 1, while the gene it regulates, hlyA, is located on chromosome 2.

DNA repair

Vibrio cholerae has genes encoding several DNA-repair and DNAdamage-response pathways, including nucleotide-excision repair, mismatch-excision repair, base-excision repair, AP endonuclease, alkylation transfer, photoreactivation, DNA ligation, and all the major components of recombination and recombinational repair, including initiation, recombination and resolution34. In addition, homologues of many of the genes involved in the SOS response in E. coli are found. The presence of three photolyase homologues, more than have been found in other bacterial species, probably allows for the ability to photoreactivate the two major forms of ultraviolet-induced DNA damage (cyclobutane pyrimidine dimers and 6-4 photoproducts), and may also allow use of a range of wavelengths of light used for the energy required for photoreactivation. It is also of interest that many of the repair genes are on chromosome 2 (alkA, ada1, ada2, phr3, mutK, sbcCD, dcm, mutT3), indicating that this chromosome is probably required for full DNA repair capability.

Pathogenicity

Toxins. The genome sequence of V. cholerae El Tor N16961 revealed a single copy of the cholera toxin (CT) genes, ctxAB, located on chromosome 1 within the integrated genome of CTXf, a temperate ®lamentous phage3. The receptor for entry of CTXf into the cell is the toxin-coregulated pilus (TCP)3, and the TCP gene cluster (see below) also resides on chromosome 1. Like the structural genes for CT and TCP, the regulatory gene, toxR, which controls their expression in vivo35, is also located on chromosome 1. On the other side of CTXf prophage is a region encoding an RTX toxin (rtxA), and its activator (rtxC) and transporters (rtxBD)36. A third transporter gene has been identi®ed that is a paralogue of rtxB, and is transcribed in the same direction as rtxBD. Downstream of this gene are two genes encoding a sensor histidine kinase and response regulator. Trinucleotide composition analysis suggests that the RTX region was horizontally acquired along with the sensor histidine kinase/response regulator, suggesting these regulators effect expression of the closely linked RTX transcriptional units. Also present are genes encoding numerous potential toxins, including several haemolysins, proteases and lipases. These include hap, the haemagluttinin protease, a secreted metalloprotease that seems to attack proteins involved in maintaining the integrity of epithelial cell tight junctions37, and hlyA, encoding a secreted haemolysin that displays enterotoxic activity38. In contrast to CTX, RTX and all known intestinal colonization factors, the hap and hlyA genes virulence factors reside on chromosome 2. Vibrio cholerae has been reported to produce shiga-like toxins39; however, the sequence did not reveal genes encoding speci®c homologues of the A or B subunits of shiga toxin. Also not detected were genes encoding homologues of E. coli heat stable toxin (ST), which have been detected in other pathogenic strains of V. cholerae40. Colonization factors. The critical intestinal colonization factor of V. cholerae is the TCP, a type IV pilus5,41. The genome sequence con®rmed that the genes involved in TCP assembly (tcpABCDEFGHIJNQRST) reside on chromosome 1 (ref. 20) as part of a proposed `pathogenicity island' (also referred to as VPI) composed of recently acquired DNA that encodes not only TCP, but also other genes associated with the ToxR regulatory cascade, such as acfABCD, toxT, aldA and tagAB10,11. Trinucleotide composition analysis suggest that this 45.3-kb segment begins at a 20-bp site upstream of aldA, and encompasses a helicase-related protein and a transcriptional activator which both share homology with bacteriophage proteins. At the other end of the segment of atypical trinucleotide composition is a phage family integrase and the

© 2000 Macmillan Magazines Ltd

481

articles other copy of the 20-bp site, which is presumably the target for integration of the island onto the chromosome10,11. It has been proposed that the TCP/ACF island corresponds to the genome of a ®lamentous phage that uses TCP pilin as a coat protein4. However, other than the three genes encoding phage-related proteins (that is, the helicase, transcriptional activator and integrase) we could ®nd no other genes on the island that encoded products with signi®cant homology to the conserved gene products of other ®lamentous phages or the structural proteins of non®lamentous phages. The maltose-sensitive haemagglutinin (MSHA) is unique to the El Tor biotype of V. cholerae. Initially characterized as a haemagglutinin, it was later found to be a type IV pilus42,43. The MSHA biogenesis (MshHIJKLMNEGF) and structural (MshBACD) proteins are all clustered on chromosome 1. There are no apparent integrases or transposases that might de®ne this region as a pathogenicity island or suggest an origin for it other than V. cholerae. In support of this conclusion, trinucleotide composition analysis shows that this region has similar composition to the rest of the chromosome, suggesting that if these genes were acquired it was very early in the Vibrio phylogenetic history. Recently, several investigators have reported that MSHA is not required for intestinal colonization, nor does it seem to appreciably affect the ef®ciency of colonization44±46, but instead plays a role in bio®lm formation27,28. Accordingly, this pilus may be important for the environmental ®tness of Vibrio species rather than for pathogenic potential. The pilA region of V. cholerae genome apparently encodes a third type IV pilus, although it has not been visualized47. This gene cluster includes a gene encoding a prepilin peptidase (PilD) that is required for the ef®cient processing of protein complexes with type IV prepilin-like signal sequences including TCP, MSHA and EPS47,48. The EPS system of V. cholerae encodes a type II secretion system involved in extracellular export of CT and other proteins. The EPS system is encoded by chromosome 1 but, like the MSHA genes, trinucleotide analysis suggests that the EPS genes of V. cholerae have not been recently acquired. In contrast, trinucleotide composition analysis suggests that the pilA gene cluster was acquired by horizontal transfer. Thus, analysis of the V. cholerae genome sequence provides some evidence that older gene clusters, like MSHA and EPS, have become dependent on newly acquired genes such as PilD.

Conclusions

The Vibrio cholerae genome sequence provides a new starting point for the study of this organism's environmental and pathobiological characteristics. It will be interesting to determine the gene expression patterns that are unique to its survival and replication during human infection35 as well as in the environment13,14,16. Additionally, the genomic sequence of V. cholerae should facilitate the study of this model multi-chromosomal prokaryotic organism. Comparative genomics between several species in the genus Vibrio will provide a better understanding of the origin of the new small chromosome and the role that it plays in Vibrio biology. The genome sequence may also provide important clues to understanding the metabolic and regulatory networks that link genes on the two chromosomes. Finally, V. cholerae clearly represents a promising genetic system for studying how several horizontally acquired loci located on separate chromosomes can still ef®ciently interact at the regulatory, cell biology and biochemical levels. M

Methods Whole-genome random sequencing procedure. Vibrio cholerae N16961 was grown from a single isolated colony. Cloning, sequencing and assembly were as described for genomes sequenced by TIGR18. One small-insert plasmid library (2±3 kb) was generated by random mechanical shearing of genomic DNA. One large insert library was ligated into l-DASHII/EcoRI vector (Stratagene). In the initial sequence phase, approximately sevenfold sequence coverage was achieved with 49,633 sequences from plasmid clones. Sequences from both ends of 383 l-clones served as a genome scaffold, verifying the orientation, order and integrity of the contigs. The plasmid and l sequences were jointly assembled using TIGR Assembler. Sequence gaps were closed

482

by editing the end sequences and/or primer walking on plasmid clones. Physical gaps were closed by direct sequencing of genomic DNA, or combinatorial polymerase chain reaction (PCR) followed by sequencing the PCR product. The ®nal genome sequence is based on 51,164 sequences. ORF prediction and gene family identi®cation. An initial set of ORFs, likely to encode proteins, was identi®ed with GLIMMER49, and those shorter than 30 codons were eliminated. ORFs that overlapped were visually inspected, and in some cases removed. ORFs were searched against a non-redundant protein database18. Frameshifts and point mutations were detected and corrected where appropriate. Remaining frameshifts and point mutations are considered to be authentic and were annotated as `authentic frameshift' or `authentic point mutation'. ORFs were also analysed with two sets of hidden Markov models (HMMs) constructed for a number of conserved protein families (1,313 from Pfam v3.1 (ref. 50) and 476 from the TIGRFAM) by use of the HMMER package. TopPred was used to identify membrane-spanning domains in proteins. Paralogous gene families were constructed by searching the ORFs against themselves using BLASTX, identifying matches with E # 10-5 over 60% of the query search length, and subsequently clustering these matches into multigene families. Multiple alignments for these protein families were generated with the CLUSTALW program and the alignments scrutinized. Distribution of all 64 trinucleotides (3-mers) for each chromosome was determined, and the 3-mer distribution in 2,000-bp windows that overlapped by half their length (1,000 bp) across the genome was computed. For each window, we computed the x2 statistic on the difference between its 3-mer content and that of the whole chromosome. A large value of this statistic indicates that the 3-mer composition in this window is different from the rest of the chromosome. Probability values for this analysis are based on the assumption that the DNA composition is relatively uniform throughout the genome. Because this assumption may be incorrect, we prefer to interpret high x2 values merely as indicators of regions on the chromosome that appear unusual and demand further scrutiny. Homologues of the genes of interest were identi®ed using the BLASTP and FASTA3 search programs. All homologues were then aligned to each other using the CLUSTALW program with default settings. Phylogenetic trees were generated from the alignments using the neighbour-joining algorithm as implemented by the PAUP* program (with a PAM matrix based distance calculation). Regions of the alignment that were hypervariable or were of low con®dence were excluded from the phylogenetic analysis. All alignments are available upon request. Received 3 April; accepted 18 May 2000. 1. Wachsmuth, K., Olsvik, é., Evins, G. M. & Popovic, T. in Vibrio Cholerae And Cholera: Molecular To Global Perspective (eds Wachsmuth, I. K., Blake, P. A. & Olsvik, é.) 357±370 (ASM Press, Washington DC, 1994). 2. Faruque, S. M., Albert, M. J. & Mekalanos, J. J. Epidemiology, genetics, and ecology of toxigenic Vibrio cholerae. Microbiol. Mol. Biol. Rev. 62, 1301±1314 (1998). 3. Waldor, M. K. & Mekalanos, J. J. Lysogenic conversion by a ®lamentous phage encoding cholera toxin. Science 272, 1910±1914 (1996). 4. Karaolis, D. K., Somara, S., Maneval, D. R. Jr, Johnson, J. A. & Kaper, J. B. A bacteriophage encoding a pathogenicity island, a type-IV pilus and a phage receptor in cholera bacteria. Nature 399, 375±379 (1999). 5. Brown, R. C. & Taylor, R. K. Organization of tcp, acf, and toxT genes within a ToxT-dependent operon. Mol. Microbiol. 16, 425±439 (1995). 6. Hochhut, B. & Waldor, M. K. Site-speci®c integration of the conjugal Vibrio cholerae SXTelement into prfC. Mol. Microbiol. 32, 99±110 (1999). 7. Yildiz, F. H. & Schoolnik, G. K. Vibrio cholerae O1 El Tor: identi®cation of a gene cluster required for the rugose colony type, exopolysaccharide production, chlorine resistance, and bio®lm formation. Proc. Natl Acad. Sci. USA 96, 4028±4033 (1999). 8. Bik, E. M., Bunschoten, A. E., Gouw, R. D. & Mooi, F. R. Genesis of the novel epidemic Vibrio cholerae O139 strain: evidence for horizontal transfer of genes involved in polysaccharide synthesis. EMBO J 14, 209±216 (1995). 9. Waldor, M. K., Colwell, R. & Mekalanos, J. J. The Vibrio cholerae O139 serogroup antigen includes an O-antigen capsule and lipopolysaccharide virulence determinants. Proc. Natl Acad. Sci. USA 91, 11388±11392 (1994). 10. Kovach, M. E., Shaffer, M. D. & Peterson, K. M. A putative integrase gene de®nes the distal end of a large cluster of ToxR-regulated colonization genes in Vibrio cholerae. Microbiology 142, 2165±2174 (1996). 11. Karaolis, D. K. et al. A Vibrio cholerae pathogenicity island associated with epidemic and pandemic strains. Proc. Natl Acad. Sci. USA 95, 3134±3139 (1998). 12. Mekalanos, J. J., Rubin, E. J. & Waldor, M. K. Cholera: molecular basis for emergence and pathogenesis. FEMS Immunol. Med. Microbiol. 18, 241±248 (1997). 13. Colwell, R. R. Global climate and infectious disease: the cholera paradigm. Science 274, 2025±2031 (1996). 14. Colwell, R. R. & Spira, W. M. in Cholera (eds Barua, D. & Greenough, W. B. III) 107±127 (Plenum Medical, New York, 1992). 15. Lobitz, B. et al. Climate and infectious disease: use of remote sensing for detection of Vibrio cholerae by indirect measurement. Proc. Natl Acad. Sci. USA 97, 1438±1443 (2000). 16. Colwell, R. R. & Huq, A. Environmental reservoir of Vibrio cholerae. The causative agent of cholera. Ann. NY Acad. Sci. 740, 44±54 (1994). 17. Roszak, D. B. & Colwell, R. R. Survival strategies of bacteria in the natural environment. Microbiol. Rev. 51, 365±379 (1987). 18. Fraser, C. M. et al. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390, 580±586 (1997). 19. Yamaichi, Y., Iida, T., Park, K. S., Yamamoto, K. & Honda, T. Physical and genetic map of the genome of Vibrio parahaemolyticus: presence of two chromosomes in Vibrio species. Mol. Microbiol. 31, 1513± 1521 (1999). 20. Trucksis, M., Michalski, J., Deng, Y. K. & Kaper, J. B. The Vibrio cholerae genome contains two unique circular chromosomes. Proc. Natl Acad. Sci. USA 95, 14464±14469 (1998).

© 2000 Macmillan Magazines Ltd

NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

articles 21. Lobry, J. R. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660±665 (1996). 22. Rowe-Magnus, D. A., Guerout, A. M. & Mazel, D. Super-integrons. Res. Microbiol. 150, 641±651 (1999). 23. Hall, R. M., Brookes, D. E. & Stokes, H. W. Site-speci®c insertion of genes into integrons: role of the 59-base element and determination of the recombination cross-over point. Mol. Microbiol. 5, 1941± 1959 (1991). 24. Davis, B. M., Kimsey, H. H., Chang, W. & Waldor, M. K. The Vibrio cholerae O139 Calcutta bacteriophage CTXphi is infectious and encodes a novel repressor. J. Bacteriol. 181, 6779±6787 (1999). 25. Mekalanos, J. J. Duplication and ampli®cation of toxin genes in Vibrio cholerae. Cell 35, 253±263 (1983). 26. Mazel, D., Dychinco, B., Webb, V. A. & Davies, J. A distinctive class of integron in the Vibrio cholerae genome. Science 280, 605±608 (1998). 27. Watnick, P. I. & Kolter, R. Steps in the development of a Vibrio cholerae El Tor bio®lm. Mol. Microbiol. 34, 586±595 (1999). 28. Watnick, P. I., Fullner, K. J. & Kolter, R. A role for the mannose-sensitive hemagglutinin in bio®lm formation by Vibrio cholerae El Tor. J. Bacteriol. 181, 3606±3609 (1999). 29. Huq, A. et al. Ecological relationships between Vibrio cholerae and planktonic crustacean copepods. Appl. Environ. Microbiol. 45, 275±283 (1983). 30. Bassler, B. L., Yu, C., Lee, Y. C. & Roseman, S. Chitin utilization by marine bacteria. Degradation and catabolism of chitin oligosaccharides by Vibrio furnissii. J. Biol. Chem. 266, 24276±24286 (1991). 31. Yildiz, F. H. & Schoolnik, G. K. Role of rpoS in stress survival and virulence of Vibrio cholerae. J. Bacteriol. 180, 773±784 (1998). 32. Bassler, B. L., Greenberg, E. P. & Stevens, A. M. Cross-species induction of luminescence in the quorum-sensing bacterium Vibrio harveyi. J. Bacteriol. 179, 4043±4045 (1997). 33. Williams, S. G., Attridge, S. R. & Manning, P. A. The transcriptional activator HlyU of Vibrio cholerae: nucleotide sequence and role in virulence gene expression. Mol. Microbiol. 9, 751±760 (1993). 34. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes. Mutat. Res. 435, 171±213 (1999). 35. Lee, S. H., Hava, D. L., Waldor, M. K. & Camilli, A. Regulation and temporal expression patterns of Vibrio cholerae virulence genes during infection. Cell 99, 625±634 (1999). 36. Lin, W. et al. Identi®cation of a Vibrio cholerae RTX toxin gene cluster that is tightly linked to the cholera toxin prophage. Proc. Natl Acad. Sci. USA 96, 1071±1076 (1999). 37. Wu, Z., Milton, D., Nybom, P., Sjo, A. & Magnusson, K. E. Vibrio cholerae hemagglutinin/protease (HA/protease) causes morphological changes in cultured epithelial cells and perturbs their paracellular barrier function. Microb. Pathog. 21, 111±123 (1996). 38. Alm, R. A., Stroeher, U. H. & Manning, P. A. Extracellular proteins of Vibrio cholerae: nucleotide sequence of the structural gene (hlyA) for the haemolysin of the haemolytic El Tor strain 017 and characterization of the hlyA mutation in the non- haemolytic classical strain 569B. Mol. Microbiol. 2, 481±488 (1988). 39. O'Brien, A. D., Chen, M. E., Holmes, R. K., Kaper, J. & Levine, M. M. Environmental and human

NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com

40.

41. 42. 43.

44.

45.

46.

47. 48. 49. 50.

isolates of Vibrio cholerae and Vibrio parahaemolyticus produce a Shigella dysenteriae 1 (Shiga)-like cytotoxin. Lancet 1, 77±78 (1984). Ogawa, A., Kato, J., Watanabe, H., Nair, B. G. & Takeda, T. Cloning and nucleotide sequence of a heatstable enterotoxin gene from Vibrio cholerae non-O1 isolated from a patient with traveler's diarrhea. Infect. Immun. 58, 3325±3329 (1990). Manning, P. A. The tcp gene cluster of Vibrio cholerae. Gene 192, 63±70 (1997). Jonson, G., Holmgren, J. & Svennerholm, A. M. Identi®cation of a mannose-binding pilus on Vibrio cholerae El Tor. Microb. Pathog. 11, 433±441 (1991). Jonson, G., Lebens, M. & Holmgren, J. Cloning and sequencing of Vibrio cholerae mannose-sensitive haemagglutinin pilin gene: localization of mshA within a cluster of type 4 pilin genes. Mol. Microbiol. 13, 109±118 (1994). Attridge, S. R., Manning, P. A., Holmgren, J. & Jonson, G. Relative signi®cance of mannose-sensitive hemagglutinin and toxin-coregulated pili in colonization of infant mice by Vibrio cholerae El Tor. Infect. Immun. 64, 3369±3373 (1996). Tacket, C. O. et al. Investigation of the roles of toxin-coregulated pili and mannose- sensitive hemagglutinin pili in the pathogenesis of Vibrio cholerae O139 infection. Infect. Immun. 66, 692±695 (1998). Thelin, K. H. & Taylor, R. K. Toxin-coregulated pilus, but not mannose-sensitive hemagglutinin, is required for colonization by Vibrio cholerae O1 El Tor biotype and O139 strains. Infect. Immun. 64, 2853±2856 (1996). Fullner, K. J. & Mekalanos, J. J. Genetic characterization of a new type IV-A pilus gene cluster found in both classical and El Tor biotypes of Vibrio cholerae. Infect. Immun. 67, 1393±1404 (1999). Marsh, J. W. & Taylor, R. K. Identi®cation of the Vibrio cholerae type 4 prepilin peptidase required for cholera toxin secretion and pilus formation. Mol. Microbiol. 29, 1481±1492 (1998). Salzberg, S. L., Delcher, A. L., Kasif, S. & White, O. Microbial gene identi®cation using interpolated Markov models. Nucleic Acids Res. 26, 544±548 (1998). Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and pro®le HMMs match the majority of proteins. Nucleic Acids Res. 27, 260±262 (1999).

Supplementary information is available on Nature's World-Wide Web site (http:// www.nature.com) or as paper copy from the London editorial of®ce of Nature.

Acknowledgements This work was supported by the National Institutes of Health, National Institute of Allergy and Infectious Disease. We thank M. Heaney, V. Sapiro, B. Lee, M. Holmes and B. Vincent for database and software support. Correspondence and requests for materials should be addressed to C.M.F. (e-mail: [email protected]). The annotated genome sequence and the gene family alignments are available at hhttp://www.tigr.org/tbd/mdbi. The sequences have been deposited in GenBank with accession number AE003852 (chromosome 1) and AE003853 (chromosome 2).

© 2000 Macmillan Magazines Ltd

483

484

© 2000 Macmillan Magazines Ltd

91

1049

5

281

448

2

757

451

954

955

459

572

574

12

8

964

1063

863

772

12

575

5

965

679

468

6

576

470

773

680

300

106

14

2742

969

1067

Amino acid biosynthesis

774

472

681

16

474

1068

971

865

776

578

972

777

1069

10

9

581

778

202

2175

974

2539

481

1071

310

13

208

20

5s rRNA

16s rRNA

687

1073

781

872

976

23s rRNA

9

587

977

588

8

873

10

3

978

22

6

12

590

979

495

981

876

785

691

591

328

212

Repeat region

Three tRNAs

1075

980

875

24

117

9

Membrane protein

1077

982

496

1078

983

786

593

12

119

498

14

877

692

213

25

4

2278

26

1700

5

5

984

1080

883

790

1081

506

695

5

985

507 10

884

791

121

338

986

219

885

795

510 12

340

31

2672

886

987

2192

798

609

1085

700

887

799

603

344

988

888

800

511

1086

889

801

512

13

347

128

605

513

989

890

349

222

35

6

2462

2379

1088

891

990

6

2290

1089

892

5

704

607

223

130

2381

1090

991

893

894

354

803

608

38

131

516

37

6

1091

992

705

610

11

1535

40

993

804

612

895

518

6

994

706

614

134

805

5

9

519

996

707

615

230

137

2765

2571

997

897

807

708

2208

231

139

998

898

521

365

808

617

10

44

2766

2472

2572

2684

2092

809

1098

620

5

5

900

1000

709

619

232

140

1001

1099

1002

1100

5

5

235

49

1004

903

372

237

142

6

144

238

374 10

1102

13

1008

813

716

624

524

904

714

51

2691

2480

2391

905

526

375

2775

711

529

11

1104

1010

9

814

717

10

376

906

527

1105

1011

625

379

53

2694

1012

5

815

8

380

2595

6

243

1109

13

908

817

12

1015

909

2213

720

1111

2599

2699

1018

911

820

912

723

10

530

388

10

629

59

12

2485

1019

1112

1458

7

1020

914

9

5

531

1113

2490

2216

1023

917

824

1293

5

64

1115

534

920

637

828

1025

921

535

397

156

638

254

922

1026

728

399

157

65

2702

2021

1027

923

12

1835

1740

2121

639

257

540

405

924

832

2608

6

542

258

1029

833

732

406

10

68

2026

2224

2407

7

2705

160

11

23Sf

2324

2223

2123

2024

1926

1566

10

2126

2609

925

834

6

734

644

543

6

735

409

10

2226

926

835

645

646

736

927 928

836

2710

413

264

72

5

2332

2412

647

1

838

930

5

649

417

267

75

167

1034

931

840

738

650

546

7

2616

2499

6

824

76

651

421

1036

935

741

550

5

170

2618

7

1753

171

11

1037

846

10

2620

552

847

1755

2136

939 6

1040

848

655

428

271 272

173

6

274

432

1041

747

656

11

940

554

431

175

2621

2504

2237

2137

1311

82

941

556

433

2717

2505

2418

276

943

6

436

557

1043

748

657

176

83

2719

11

2622

2718

1855

749

9

11

8

751

1045

849

85

558

439

178

2625

2422

2342

945

2626

2507

2423

752

1046

661

559

441

179

469

946

1047

662

560

442

8

2

561

280

1048

947

2

181

90

2726

2628

5

9

2510

2425

2344

2144

663

754

443

180

9

2725

12

1588

1949

2508

2627

279

88

1760

1859

654

1131

1667

2245

2424

2343

2143

753

12

1587

1492

1406

1666

837

745

566

471

934

385

1315

1033

565

194

289

470

100

1220

2041

836

6

933

653

744

6

2724

6

1947

2723

278

659

440

9

86

1759

1857

99

192

564

1314

1665

1585

2244

2142

2039

2722

277

2624

2242

11

1856

13

1129

1405

1491

1758 6

931

10

384

563

468

6

288

191

1031

1219

835

743

652

562

467

1945

2141

2721

658

944

1

84

2720

2623

1490

1664 1

98

287

1029

5

1128

1217

1313

2421

2038

1944

2506

2420

2341

2240

2037

1942

1663

1403

1216

1028

834

742

651

466

381

13

1583 1584

833

1127

2140

1312

4

286

560

930

741

832

97

190

465

379

650

8

1488

1757

95

285

1025 1026

9

6

1126

2419

11

1

1854

2238

12

10

1582

1024

831

739

559

464

649

1215

1660 12

94

377 378

1402

2340

275

2036

1941

1853

1756

2417

80

2716

929

1310

1486

1581

1852

2339

9

174

1309

1214

1401

13

5

1023

830

737

648

558

463

284

188

1123 1124 1125

829

1659

2035

1940

1851

2236

745

1039

938

9

654

79

172

1400

1308

1213

928

736

647

376

93

557

462

283

187

1122

1021

828

556

461

374

1580

1939

1658

2503

2416

2715

270

425

1038

937

11

744

78

2714

2619

2338

2235

2135

2234

2033

1485

1399

11

1754

1938

460

282

927

92

186

1121

1850

1579

1849

1937

827

734

926

1307

10

1656

423 424

936

653

269

2233

2502

2415

1936

2713

77

2501

2232

2133

8

1120

1212

1483

373

91

459

826

645

1018

925

1397

1848

1655

554

457

644

1306

1577

1482

2032

8

1

732

456

11

185

372

90

281

1118 1119

1935

2337

652

843

740

420

169

549

419

268

168

825

924

1847

1575

2712

10

11

6

2414

2617

2500

2335

2132

6

1211

1117

1846

455

731

1017

184

371

89

553

643

1481

1751

1394

1304

10

1016

923

552

730

2031

2231

2131

13

280

370

13

1934

1845

1750

1573

1210

1115

9

6

88

454

922

369

823

728

551

2334

1933

9

1653

1303

1209

2711

12

2498

737

648

416

74

2130

2413

2230

2333

1033

415

266

165

545

929

837

414

265

73

2614

2497

1749

1478

1392

1015

1114

921

727

9

183

86 87

453

642

366

278

11

1844

9

2615

2129

2030

1931

2229

641

1302

1748 8

365

452

1113

6

85

181 182

550

726

1207

1571

1476

1843

2227

2128

2613

164

1031

6

9

544

263

71

7

16Sf

410

2708

2610

261 262

70

2

2409

2330 RNasePRNA

2127

8

1746

1652

1391

1301 14

11

2

822

1014

920

640

1112

1206

1013

919

725

549

277

84

180

362

450

83

1570

5

6

1929

1475

1745

1568

1928

2028

11

2225

1927

408

1111

1205

1390

1300

8

639

4

276

1012

918

1838 1839 1840

1651

1567

1473

2329

161

821

638

361

179

82

1010 1011

1204

1388

6

724

917

547

1110

2027

2706

275

449

178

1009

8

81

1743

1837

1203

1298

1741

1836

640 641

67

2607

13

256

1028

830

730

10

2406

2323

7

2023

1925

916

637

360

274

1008

820

5

359

1386

1650

1470

1006

723

636

1565

1384

447

545

80

1108

177

356 357

271

79

1107

1202

1297

1106

159

12

1

539

255

66

402

538

22

2703

2606

2494

1

914

1005

11

78

355

634

2222

1834

1563

2022

2120

401

729

829

537

400

2493

819

722

544

76

1469

270

446

1201

1649

2405

2220

2119

10

2604

2404

2322

2118

2492

536

2603

2701

827

12

253

396

155

9

2403

2217

2019

1923

1833

1562

1467

1382

9

175

354

1105

12

75

1296

1004

1200 5

1738

1832

1647

1561

818

913

633

543

353

269

74

721

174

1104

1465

1295

1199

1922

1736

1103

1003

817 4

1002

542

445

73

632

268

720

350

173

6

541

444

267

2116

2491

9

2602

2402

2018

1921

1831

1735

1645

1560

1463

1379

719

631

1001

1102

9

911

815

2115

825

727

635

532

395

250

154

2601

1114

634

152

726

915

823

725

63

249

391

151

2600

2489

2401

2113

2016

2017

1732

1644

1377

1101

349

540

5

72

172

6

266

443

1198

718

1000

814

630

539

5

442

265

348

1828

1920

2320

2700

2111

2215

2400

1827

1460

1376

71

171

441

70

910

999

1291

9

1558

2015

632 633

248

822

724

631

5

6

2488

10

2110

2014

1730

538

717

813

1197

1099

1643

1290

1196

909

715

439

440

69

264

12

346

170

537

629

1557

1826

150

913

630

247

61

5

438

1375

1098

2487

2399

2319

2214

10

1918

10

1457

2013

1641

1555

1374

812

68

263

714

908

1

1195

437

345

262

1289

1194

5

11

998

907

2398

2109

1825

1554

387

246

910

1017

385

148

58

2698

2598

2397

2316

1110

818

719

628

245

57

529

244

2697

1727

1916

5

1097

906

67

535

628

168

811

627 23Se

1824

2012

2108

2484

2212

1915

10

1726

1639

1454

260

344

1553

905

810

1288

13

1552

1373

1190

1096

1638

2396

2596

1

626

434

66

167

534

343

259

904

6

2107

2011

1822

1551

56

382

1108

1013

907

718

627

528

242

147

2695 2696

55

903

997

2395

2312

2211

533

809

9

65

433

342 12

1095

2106

1725

2009

2483

2593

2311

241

2394

2482

240

1821

64

166

258

624 16Se

1287

1550

5

1372

6

1636

1912

2103

2310

146

2693

5

1094

1188

63

532

902

6

257

165

807

531

432

8

256

1723 10

2008

2210

2590

239

17

52

2481

2309

8

2100 2101

6

1911

1820

1722

1549

1286

1371

62 14

623

995

1093

5

10

901

806

530

431

341

254

1635

1186

1092

6

1548

1910

2692

2308

2099

61

164

1285

1370

1721

1819

2007

1909

2307

2774

5

2584

1005 1006

812

712

623

1101

711

371

50

2773

1633

804

994

1634

1284

1185

1547

1369

2097

2479

2690

2305

2096

2006

803

430

340

528

252

622

19

900

1091

710

993

1720

1546

1907

1817

2390

523

2772

2689

2581

2476

2303

2209

2095

2004

1632

1282

1719

1544

1906

1718

10

1451

11

9

1184

801

709

621

60

339

526

7

2

802

429

13

251

524 525

338

162

899

1089

992

1183

800

708

523

1367

369 370

902

811

710

621

522

141

48

5

368

2770

233

2579

2475

2302

2688

2577

901

47

367

6

2301

2389

1815

2003

1905

1717

1631

9

23Sa

428

898

1543

1088

1280

2094

10

2474

2576

2300

2685

11

2002

1904

799

897

1366

620

522

337

250

161

1

426

706

1182

991

1814

1630

1542

1716

2093

5

10

6

896

705

6

5Sb

425

1

521

336

249

1365

2767 2768

366

999

618

2388

45

2001

1811

1715

1541

798

619

424

1087

1181

990

7

519

1279

1364

12

2298 2299

1903

1540

5

1629

1086

1363

2000

12

2387

2297

2470

363

616

1095

806

520

229

9

43

896

361

136

42

2764

2683

2569

2469

2386

2205 2206 2207

228

9

360

1093 1094

358

227

41

2385

2295

2089

1180

16Sa

23Sb

704

618

13

894

247

335

796

423

518

988

617

5

246

1277 1278

1808

1998

1902

1628

1362

1714

1807

795

1

58

334

12

57

703

893

1085

987

616

422

333

245

56 16Sb

1276

1539 11

8

1179

1997

1627

1901

2568

135

2467

8

2203

11

2681

2762

1361

1806 12

2088

1995

1805

2567

2466

2384

2566

8

1450

8

794

892

702

517

332

421

1084

1178

55

244

986

701

615

1

793

1275

13

54

159

1536 1537 1538

2294

2202

1092

225

133

6

8

2565

2465

2383

2293

2087

1994

1083

890

331

243

1177

1360

1274

1713

1900

1803

1712

5

1359

1273

1625

356

517

132

355

39

2761

12

1081

985

9

700

52

158

516

614

330

51

792

420

242

889

1174

1448

7

2201

2464

2382

6

5

1899

2292

2564

2678

2760

2563

2463

2198

2086

1993

1800

1711

1534

1358

1272

1173

1079

2291

1533

1447

1624

791

698

613

157

50

515

984

49

241

888

983

329

48

1271

1172

981

1270

2380

8

790

697

1077

1898

1799

2759

351

514

802

703

8

606

350

36

2677

5

1532

1269

156

514

419

886

612

240

47

1354

1076

1171

2197

2562

129

8

2461

696

788

980

2085

2289

2378

2676

2758

702

2675

2561

2460

2377

2196

2084

1992

1623

1446

418

884

1075

1710

1798

1991

1896

2287

2195

34

1622

1529

1268

1170

8

787

695

5

513

46

154

417

239

979

416

611

1353

1797

2083

8

221

346

604

127

2757

345

11

2674

2560

2459

2286

2194

2082

1990

1621

45

153

1074

881

786

1169

977

1445

1267

1527

11

1894

2458

33

8

220

124

2756

2673

2559

2193

2081

1989

1893

1794

1709

1526 5

1266

1168

880

5

152

238

415

9

44

694

610

512

328

237

151

12

1073

976

1350

1444

2285

693

43

784

236

975

1525

1987

2755

602

1167

1892

2456

32

11

876

2376

2558

341

123

2671

2557 11

1084

601

699

2670 16Sh

2454

12

1986

7

11

150

692

974

1265

1524

1791

2283

2080

2191

1985

1890

1523

1442

12

973

783

1708

1166

1349

1264

1071

1706

8

875

608

414

327

235

42

509 510

326

149

691

234

41

508

325

148

972

690

607

1789

2555

3

8

1889

2282

793

30

1083

792

600

337

29

697

1082

2554

2668

5

2374

2281

2190

1788

324

413

506

1441

1522

1263

1165

7

40

233

781

1070

873

606

1620

1439

1984

508

218

599

28

2553

2453

2280

8

1888

1786

1704

1438

1348

1521

7

1069

780

689

605

505

412

323

147

39

232

322

411

1164

971

870

779

1261

12

2666 23Sh

334

217

688

12

503

410

146

37

231

321

6

409

1163

2188

2552

2279

596

11

882

789

27

120

1347

2077

8

144

SRPRNA

13

604

408

4

778

9

1983

1887

1784

1703

1437

1346

1260

1162

36

230

970

687

1067 7

505

694

504

880 881

788

12

331

595

214

1079

693

501

330

2664 5Sh

2550

2452

2187

2373

2549

12

2075

1701

1520

1259

5

869

969

407

502

320

229

142

777

9

406

1980 1981

2662 16Sg

2451

16

2074

2548

2661

1783

1886

1979

2277

1782

1435

6

1345

1066

5

1161

685

35

140

501

319

228

139

34

776

8

968

1619

1519

1434

1344

1698

1618

11

1697

592

118

2660

13

2276

2073

2450

2371

2185

Rho-ind. terminator

784

325

690

6

493

23

9

2657

2547

2448

23Sg

2184

2274

2370

2183

1433

1258

1160

1064

1517 1518

1978

1781

1885

1977

5

1779

1884

6

1343

1617

1432

1695

1615

1431

1516

1257

1159

775

684

603

500

405

318

227

33

137

683

499

317

225

1063

866

498

965

5

136

32

404

774

1062

682

602

315

224

135

496

403

1158

1061

964

864

773

495

223 5Sc 1

134

1342

2072

1976

1778

1614

1515

1430

211

689

491

322

116

1157

1060

2656 5Sg 1

1074

783

589

688

782

210

318

21

494

402

7

31

222

963

863

1883

1777

2273

2447

2751

490

2369

2272

2182

2070

681

1341

1693

1429

1256

1882

1975

2181

7

1776

2545

2655

2446

115

317

2750

485 486 487

Other RNA

1072

975

871

686

2544

2654

316

114

2653

2368

2271

2180

2069

1974

1880

12

601

772

221 23Sc

1156

1612

1340

1513

1611

1692

10

1775

1610

6

30

132

493

1059

962

1155

1058

1428

1154

400

861 862

771

29

220

680

131

11

492

399

1255

1427

1339

2270

1973

2445

207

586

113

5

313

11

585

483

16

780

584

311

205

112

19

2749

2542

2651

2444

2541

2649

2443

2365 2366

2269

6

1879

1774

2179

1690

1972

5

2068

2178

2267 2268

2067

1773

1689

1878

1512

1609

7

6

1426

1253

1153

1056

961

770

6

600

491

2

860

28

217

678

859

16Sc

130

960

858

1152

959

857

769

677

597

490

1511

1425

1608

6

1338

1252

8

216

398

27

1054

7

9

596

1151

856

958

1971

1772

685

583

870

1607

1509

1424

1250

2748

309

204

111

1053

312

129

676

595

397

215

1150

2176

2647

18

779

582

479

2747

12

2442

11

2364

26

128

768

1688

2266

2066

1970

12

2265

684

308

2538

110

867

580

476

2064

2441

17

2746

307

201

683

475

8

2537

2646

2174

2363

109

2440

2262

2536

200

108

1 Kb

303

2744

2645

2362

2172

2063

1968

1877

9

1508

1423

1337

675

396

957

594

1249

1606

1687

309

489

855

1149

1507

1050

767

674

6

395

213

127

214

23 24

126

593

1875

1771

1686

10

1336

1506

1422

1685

1874

1967

1148

956

1248

854

6

22

673

7

308

592

488

212

125

1049

21

1335

1505

1605

1684

1504

1421

1246

766

853

672

591

394

307

211

1048

1147

954

682

1683

1604

1503

2535

15

107

199

864

301

2743

Conserved hypothetical

1062

963

862

769

678

465

299

198

2741

2644

2534

2171

2439

No database match

962

768

676 677

197

13

2740

2643

2360

2261

12

2062

1966

Protein fate/protein synthesis

1060

767

294

196

463 464

195

104

2532

2642

5

2260

2168

2061

11

1770

1682

5

1603

1419

1334

1873

1245

6

487

590

306

210

20

1420

953

852

764

671

589

12

122

1047

1333

11

486

Other categories

1059

961

860

766

675

573

460

293

194

103

2739

2531

2359

2438

2358

2259

2166 2167

1681

1872

1965

1502

12

1244

1602

7

1418

2060

1769

1964

2058 2059

2258

1601

1242

9

1145

951

305

209

393

851

763

668

208

1332

849

587

1046

950

848

1680

1871

13

667

304

485

392

19

Transport and binding proteins

1058

765

674

193

291

9

12

9

2641

2529

6

1501

1962 1963

2257

2164

2057

2437

2356

2256

7

1679

1869

1768

1330

1417

1600

1870

1329

1599

5

1144

1239 1240

1044

9

762 tmRNA

586

1045

761

949

847

585

483

207

120

Regulatory functions

1057

859

764

960

458

571

673

290

11

102

2528

7

2

2056

1960

1678

1500

5

1238

948

760

665

583

391

11

18

Transcription

457

288

192

101

2738

2638

6

7

2255

2162

2055

1959

2436

2524 2525 2527

2637

1767

9

1416

1598

1868

2355

14

1328

7

1237

1043

303

482

206

119

16

Nucleotides

958

856

763

671

191

9 10

2254

15

2053

1675

2353

2434 2435

2736

1499

1597

1956

5

1327

1236

1141

1042

846

759

5

204

118

Fatty acid and phospholipid metabolism

1056

855

568

455

287

100

2636

2523

2433

2352

9

2161

2052

2253

1955

1867

1766

1498

1674

10

1415

1325

1235

1041

481

582

947

664

480

14

203

15

117

DNA metabolism

854

957

669

762

12

567

454

190

99

8

2734

2522

10

2432

2159 2160

2051

9

1866

11

1596

1497

1414

1140

1040

581

390

302

758

9

19

116

14

Energy metabolism

1055

853

1

14

189

98

7

2733

2520

2350

2252

2635

2157

2431

2349

286

2519

285

956

6

566

284

186

97

2634

2156

2251

2049

1953

1765

1673

1595

1323

1139

944

663

478

202

115

580

943

757

579

845

662

942

756

11

12

477

13

114

201

300

113

578

389

1234

1496

1138

1039

941

661

576

577

299

1

12

Central intermediary metabolism

1054

2348

2250

1865

1413

1594

1321

1495

2048

1137

1232

1038

1136

112

200

476

755

5Sd

111

Cellular processes

1053

5

453

667

4

2047

1952

12

1672

2517 2518

2732

760

565

852

283

96

3

1231

1593

1411

2153

2430

1135

940

5

575

660

298

844

753

939

574

475

5

10

Biosynthesis of cofactors, prosthetic groups, and carriers

1052

851

8

666

758

564

450

183

759

2731

95

2347

2249

2631 2632 2633

2514

2346

2248

843

9

1037

9

1764

1671

2046

2152

2045

1494

12

1229

1592

1864

1410

1951

1863

1763

1670

1591

1320

199

9

Cell envelope

1051

9

449

282

94

952

850

1050

951

756

13

665

182

92

563

447

5

949

663

181

1

2428

2729 2730

2630

2727

2513

2427

2510 2511

2426

2345

1036

8

23Sd

938

752

659

573

842

474

296 297

198

108

5 6 7

1134

841

937

1228

1319

1133

2247

1862

1762

1409

1226

936

2

5

570 571

295

4

749 750 751

658

473

840

568

2628 2629

2425

6

2344

2246

2146

2043

1950

196

16Sd

106

1590

1318

1132

1035

748

293

9

1861

1761

1669

1492

1408

1224

935

839

2145

2245

2042

1860

1760

567

195

7

747

1317

1589

3

105

656 657

1034

1407

12

934

838

746

655

386

472

566

385

194

291

1 2

103

articles

NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com