Nucleotide Sequence, Function, Activation, and Evolution ... - CiteSeerX

15 downloads 0 Views 2MB Size Report
of the Cryptic ax Operon of Escherichia coli K12 '. Barry G. Hall and Lei Xu. Biology Department, University of Rochester. The cryptic asc (previous called “SAC”) ...
Nucleotide Sequence, Function, Activation, and Evolution of the Cryptic ax Operon of Escherichia coli K12 ’ Barry G. Hall and Lei Xu Biology Department,

University of Rochester

The cryptic asc (previous called “SAC”) operon of Escherichia coli K 12 has been completely sequenced. It encodes a repressor (ascG); a PTS enzyme II,, for the transport of arbutin, salicin, and cellobiose (ascF); and a phospho-P-glucosidase that hydrolyzes the sugars which are phosphorylated during transport (ascB). ascG and ascFB are transcribed from divergent promoters. The cryptic operon is activated by the insertion of IS186 into the ascG (repressor) gene. The ascFB genes are paralogous to the cryptic bglFB genes, and ascG is paralogous to galR. The duplications that gave rise to these paralogous genes are estimated to have occurred -320 Mya, a time that predates the divergence of E. coli and Salmonella typhimurium.

Introduction

Both laboratory strains and natural isolates of Escherichia coli possesses several sets of cryptic (silent) genes for the utilization of P-glucoside sugars (Hall and Betts 1987; Parker and Hall 1988). Cryptic genes are silent in wild-type organisms and must be activated by mutations (which may include insertion of IS or other mobile genetic elements) before they can be expressed (Hall et al. 1983). Because cryptic genes are not expected to make any positive contribution to the fitness of the organism, it is expected that they would eventually be lost because of the accumulation of inactivating mutations (Hall et al. 1983; Li 1984). Cryptic genes would thus be expected to be rare in natural population. This, however, is not the case: >90% of natural isolates of E. coli carry cryptic genes for the utilization of p-glucoside sugars (Hall and Betts 1987). The persistence of cryptic genes in the face of mutational pressure is an interesting puzzle for population biologists, and our current model is that they are retained by alternately selecting for loss and regain of function in different environments. Of the cryptic operons in E. coli, the two that are best studied are the bgl and ccl operons. The bgZ operon includes three genes: bglG, which encodes a positive regulatory protein that prevents termination of transcription in the presence of the P-glucoside sugars arbutin or salicin; bglF, which encodes the phosphoenolpyruvate-dependent (PTS) enzyme II,, which (a) simultaneously transports and phosphorylates arbutin and salicin and (b) interacts with the bgZG product to prevent antitermination in the absence of substrates; and bglB, which encodes a phospho-P-glucosidase B that hydrolyzes the phosphorylated substrates arbutin and salicin (Prasad and Schaefler 1974; Mahadeven et al. 1987; Schnetz et al. 1987; Schnetz and Rak 1988). The bgl operon 1. Key words: cryptic genes, repressor, sugar transport, phospho-P-glucosidase, arbutin, salicin. Address for correspondence and reprints: Barry G. Hall, Biology Department, Hutchison Hall, River Campus, University ot Rochester, Rochester, New York 14627. Mol.Bid Evol.9(4)x588-706. 1992. 0 1992 by The University of Chicago. All 0737-4038/92/0904-0010$02.00

688

rights

reserved.

Cryptic ax Operon of E. coli 689

can be activated either by (a) insertions of either ISZ or IS5 into a region 78-125 bp upstream of the transcription site or (b) base substitutions in the CAP-CAMP binding site (Reynolds et al. 198 1, 1984). The mechanism( s) by which insertions activate the operon are not yet understood. Because transcription initiates at the same site, independent of the position of the insertion element, it is clear that insertion elements do not provide promoters necessary for transcription (Reynolds et al. 1984). It is known that insertions can increase the activity of promoters, although the mechanism by which they do so is not known (Schnetz and Rak 1988). The bgl operon is located at 83.8 minutes on the genetic map (Bachmann 1990) and at kb 3979.8 on the physical map (Kohara et al. 1987; Rudd et al. 1990). The ccl operon consists of five genes: celA, celB, and celC, which encode, respectively, the PTS enzyme IVcel, enzyme II,,, and enzyme IIIceI, for transport and phosphorylation of cellobiose, arbutin, and salicin (Parker and Hall 1990a, 1990b; Reizer et al. 1990) (in effect, the three functional domains that are fused as enzyme II, in the bgl operon are present as separate entities in the ccl operon); celD, which encodes the ccl repressor protein; and celF, which encodes a phospho+glucosidase that hydrolyzes phosphorylated cellobiose, arbutin, and salicin (Kricker and Hall 1984, 1987; Parker and Hall 1990a, 1990b). The ccl operon can be activated by insertion of ISI, IS2, or IS5 into a region 72-180 bp upstream from the transcription start site, again by an unknown mechanism that does not involve provision of a promoter sequence (Parker and Hall 1990b). It can also be activated by base substitutions in celD that alter the repressor so that it can recognize cellobiose, arbutin, and salicin as inducers (Parker and Hall 1990b). The ccl operon is located at 37.8 minutes on the genetic map (Bachmann 1990) and at kb 1835 on the physical map (Kohara et al. 1987). In addition to these cryptic operons, E. coli alsopossesses a constitutively expressed gene, bglA, for a phospho-/3glucosidase that is specific for phosphorylated arbutin (Prasad et al. 1973 ). That enzyme does not permit arbutin utilization, because, unless one of the cryptic operons is activated, arbutin cannot be transported and phosphorylated. The bglA gene is located at kb 3056.1 on the physical map (Kohara et al. 1987; Junne et al. 1990)) corresponding to -62.3 minutes on the genetic map (Rudd et al. 1990). As a parallel to the expressed bglA gene, E. coli K12 also possesses a cryptic arbT gene, which, when activated, permits the transport and phosphorylation of arbutin but not of other P-glucoside sugars (Kricker and Hall 1987 ). Neither the physical nor the genetic location of arbT is known. Three years ago we reported the existence of a fourth cryptic-gene system for /3glucoside utilization in E. coli (Parker and Hall 1988 ) , a system which we designated “SAC. ” That system has now been renamed “USC,” to avoid confusion with the sac genes for sucrose utilization in Bacillus subtilis (Hall et al. 199 1) . The asc operon was discovered by selecting for cellobiose utilization in a Abgl Ace1 strain of E. coli K 12, strain LP 100 (Parker and Hall 1988 ) . The first-step spontaneous mutant, strain LP 10 1, utilized arbutin well but utilized salicin poorly and failed to utilize cellobiose. A second round of selection, applied to strain LP 10 1, yielded the spontaneous mutant stain LP102, which grew well on both arbutin and salicin but remained unable to grow on cellobiose. A third round of selection‘applied to strain LP 102, yielded strain LP103, which grew well on both arbutin and salicin and which utilized cellobiose on plates but failed to utilize cellobiose in liquid medium. Strain LP 103 exhibited a weak cellobiose-positive phenotype (pink colonies) on MacConkey cellobiose medium. The

690

Hall and Xu

asc operon was cloned from strain LP103, as a 9-kb HindIII-EcoRI fragment, into the high-copy-number vector pBlu+ (plasmid pUF7 18 ) . When transformed into strain LP 100, that plasmid conferred a strong arbutin-, sahcin-, cellobiose-positive phenotype (red colonies on MacConkey plates) (Parker and Hall 1988; in that paper it was erroneously reported that a subclone of pUF7 18, plasmid pUF72 1, also confers a strong arbutin-, salicin-, and cellobiose-positive phenotype; in fact, for reasons discussed below, pUF72 1 only confers a strong arbutin-positive phenotype). The asc operon is located at kb 2845 on the physical map (Kohara et al. 1987), corresponding to 58.3 minutes on the genetic map (Hall et al. 199 1). A detailed study of the USCoperon was undertaken in the hope that it might shed more light on both the evolutionary basis of retention of cryptic genes in natural populations and the mechanisms by which they might be alternately silenced and activated. Here we report the complete nucleotide sequence of the asc operon, discuss the roles of each of the gene products, and report the mechanism by which the cryptic operon is activated. Material and Methods Culture Media and Conditions Minimal medium consisted of 423 mg sodium citrate/liter, 100 mg MgS04 - 7H20 /liter, 1 g ( NH4)$04 /liter, 540 ug FeC13/ liter, 1 mg thiamine /liter, 3 g KHzP04/liter, 7 g KzHP04/liter, and 1 g carbon source/liter. When required, amino acids were added to a concentration of 100 mg/liter. The rich, complete medium was LB (Luria Broth) (Miller 1972). To select for retention of plasmids by plasmid-bearing strains we added ampicillin, to a concentration of 100 mg / liter. MacConkey indicator plates ( Difco ) were used to distinguish arbutinor salicin-fermenting strains from those that did not ferment the sugars. E. coli Strains and Plasmids Strains and plasmids are listed in table 1. DNA Sequencing DNA was sequenced by the dideoxy method (Sanger et al. 1980) using deoxyadenosine 5 ‘-( [ a- 35S] thio) triphosphate and by using a modified T7 DNA polymerase sequencing kit (Pharmacia) according to the manufacturer’s instructions, except that the primers were hybridized to template DNA by boiling the template-primer mixture and then were quick-frozen in dry ice for 5 min, and the mixture was allowed to warm slowly to room temperature. Oligonucleotides were purchased from Oligo, Etc. of Ridgefield, Conn. Double-stranded DNA templates for sequencing were prepared by a variety of methods. In some cases plasmid DNA was sequenced using synthetic oligonucleotide primers corresponding either to vector sequences that flanked the insert or, when portions of the insert sequence had already been determined, to sequences within the insert. In other cases, portions of the insert were amplified by the polymerase chain reaction (PCR) as described elsewhere ( Hall et al. 1989 ), and the amplified DNA was sequenced directly. The bulk of the sequence was obtained by sequencing plasmid pUF740, pUF74 1, or pUF742 (fig. 1) by transposon-facilitated DNA sequencing (Strathmann et al. 1991). Transposon y6 (TnZOOO) was introduced at random sites into the plasmids, and the positions of the insertions were determined by PCR mapping. The sequences

Cryptic USCOperon of E. coli 69 1

Table 1 Escherichia coli Strains and Plasmids Strain LPI00 ........ LPlOOR ...... LPlOl ........ LPI02 ........ LPI03 ........ JC10241 ...... DPWC ....... JGM ......... MK91243 .....

Plasmid pBlu+ ........ pUF718 ...... pUF721 pMOB pUF740 pUF74 I pUF742

......

pUF745

......

...... ...... ......

Genotype F- rpsL, trp, his, a@,

metA or B. ara, leu, l~cZA4680, lacy, A(b&pho)20 1, celA 100 recA56 sr1300::TnIO derivative of LPlOO

ascGl::ISl86 mutant of LPlOO USCmutant of LPlOl, improved salicin utilization USCmutant of LP102, cellobiose utilization thr300, srl300::TnlO, relA1, ilv318, spoi”1, thil, rpsE300, lambda-, Hfr (P045) supE2, ArecA, srl::TnlO (tet sensitive), F F- Tn5 seql F- rpsL, trp, his, argG, metA or B, am, leu, l~cZA4680, lacy, A(bgl-pho)20 1, celR 2-, arb T 1+

Source or Reference Parker and Hall 1988 Parker and Hall 1988 Parker and Hall I988 Parker and Hall 1988 Parker and Hall 1988 Csonka and Clark 1980 Strathmann et al. 199 1 Strathmann et al. 1991 Kricker and Hall 1987

Description High-copy-number cloning vector Complete USCoperon of strain LP103 and flanking regions cloned into pBlu+ ascGF and 5’portion of ascB cloned into pBlu+ 1.8-kb minivector for y&based sequencing ascGF and 5’portion of ascB cloned into pMOB ascF and 5’portion of ascB cloned into pMOB 3’Portion of USCFand 5’portion of ascB cloned into pMOB 3’Portion of ascB cloned into pMOB

Stratagene Parker and Hall 1988 Parker and Hall 1988 Strathmann et al. 1991 Present study Present study Present study Present study

flanking the insertions were then sequenced using oligonucleotide primers corresponding to the unique portions of the two ends of ~6. With the exception of bases l-39, the entire sequence shown in figure 2 was determined on both strands. Computer Analyses of DNA and Protein Sequences Sequences were analyzed with the Sequence Analysis Software Package, version 7.0, by Genetics Computer (GCG Package). Alignments were conducted according to the Needleman and Wunsch method ( 1970) using the GAP program of the GCG package. Significance of alignments was determined using the Randomizations option of the GAP program. Promoter sequences were identified by M. C. O’Niell using a neural network program ( O’Neill 199 1) . The GenBank data bases of nucleic acid and protein sequences were searched with the FASTA (Pearson and Lipman 1988) program. Divergence at silent sites was estimated using a program provided by W.-H. Li (Li et al. 1985). Results Figure 1 shows a diagram of the asc operon structure in strain LP103 and indicates the extents of the cloned inserts in the plasmids used in this study. Figure 2 shows the complete nucleotide sequence of the operon. Plasmid pUF72 1 (fig. 1) allows strain LP 1OOR( Abgl Ace1 recA ) to utilize arbutin. Sequencing of the left end of the pUF721 insert revealed that the insert began at bp

692

Hall and Xu

IS186 I

ascG

ascF

asc5

I-m T

v

Transcription ~lJF718

pUF740 pUF741 pUF742 pUF745 FIG. I.-Structure of the USCoperon in strains LPlOl-LP103. The heavy lines below the structure diagram indicate the extents of the inserts in various plasmids used in this study. Dashed lines indicate unsequenced flanking regions.

666 (the PstI site) of insertion element IS186 A primer complementary to bp 103120 of IS186 was used to sequence the region to the left of the IS186 insertion, using pUF7 18 DNA as a template. The other strand was sequenced using a primer complementary to bp 1- 18 of the asc operon sequence. Insertion of IS186 caused an llbp duplication of the insertion site, and the site of the IS186 insertion is arbitrarily positioned at the center of that duplication, at bp 190 ( fig. 2 ) . The asc operon includes three large open reading frames (fig. 2 ), which have been named in keeping with the names of the functionally similar genes in the bgl operon. The ascG Gene ascG is encoded on the lower (complementary) strand, includes bp 126-l 130, is preceded by a ribosome-binding site (bp 1137- 1142 ), and is disrupted by IS186 in strain LP 103 (see discussion on the mechanism of activation, below). ascG encodes a 334-amino-acid peptide with a molecular weight of 36,787. The ascG gene product is identified as a repressor protein, on several grounds. First, a Fasta search of the SwisProt protein data base reveals that the top 10 in the list of similar proteins are all repressor proteins. The alignment of the asc repressor with the most similar of those repressors, the galactose repressor (galR gene product), is shown in figure 3. There is 3 1.6% sequence identity, and 54.8% sequence similarity, between the ascG and galR repressors. The quality score for the alignment shown is 2 11.5, while the score for alignment of the ascG repressor with a randomly shuffled galR repressor sequence ( 100 randomizations) is 105.6 + 4.0 (mean f standard deviation). Second, the ascG repressor exhibits a DNA-binding helix-turn-helix motif (Kelly and Yanofsky 1985 ) (fig. 2) that includes identities with 14 of the 17 consensus residues in that region for the ascG repressor and 12 homologous repressor proteins. Third, disruption of ascG by IS186 results in semiconstitutive expression of the asc operon in strain LPlOl (Parker and Hall 1988 ) (see discussion of activation of the operon, below).

(_________________________________.~

.

,ZO .

--

.

5.0 .

-11

.

_._-._.....___._______________--_____~

540 .

,I0 .

.

*

2‘0$ .

100 .

.

11‘0

I‘.0 .

.

.

.

.

Cryptic ax Operon of E. coli 695

The ascF Gene ascF includes bp 1390-2847, is preceded by a ribosome-binding site (bp 13771382), and encodes a 485-amino-acid peptide with a molecular weight of 5 1,030. The gene product is identified as a PTS enzyme II for P-glucoside sugars, enzyme II,,. That identification is based on both experimental and pattern-similarity criteria. First, strain LPlOOR carrying pUF72 1, pUF740, or pUF741 is arbutin positive, but salicin negative and cellobiose negative. All three plasmids carry the entire ascF gene, but only a 5 ’fragment of the as& (phospho+glucosidase; see discussion below)

696

ascG galR

Hall and Xu Helix-TurnHelix 1 MTT~GVS~GNGYVSQETKDRVFQAVKESGYRPNLL 1.1: :II: IIll IllIll:.... .I:..: I l:l. :l:ll 1 MATIKDVARLAGVSVATVSRVINNSPKASEASRLAVHSAMESLSYHPNAN

. ascG

51

galR

51

50 50

.

ARNLSAKSTQTLGLWTNTLYHGIYFSELLFHAARMAEEKGRQLLLADGK I:.:: . . ..I ..I. II:::1 II.l....l:~:IllI.:. ARALAQQTTETVGLWGDV..S:P$FGAMVKAVEQVAYHTGNFLLIGNGY

100 98

.

. ascG galR

101 99

HSAEEERQA;QYLLDLRCDAIMIYPRFLSVDEIDDIIDA&QPIMvLNRR ll.l:::.::::. .::..::. : ..::::I ..:. IIIII: I: HNEQKERQAIEQLIRHRCAALWHAKMIPDADLASLMK.QMPGMVLINRI

. ascG

151

galR

148

150 l 147

.

.

LRKNSSHSVWCDHKQTSFNkELINAGHQEIAFLTGSMDSPTSIERLAG I. ..::: I.: ..: I. .fl..Il l::l.:. . . . . :ll.I LPGFENRCIALDDRYGAWLATRHLIQQGHTRIGYLCSNHSISDAEDRLQG

200 197

. ascG

201

galR

198

YKDALA.SMVLRSMKNLSLTVNGRLPAGRRVEMLLERGAKFSALVASNDD I IIll l : . : :.:. : .::. :. ll:ll .l.I:.. Il. YYDALAKSGIAANDRLVTFGEPDESGGEQAMTELLGRGRNFTAVACYNDS

249 247

. ascG

250

galR

248

MAIGAMKALHERGVAVPEQVSVIGFDDIAIAPYTVPALSSVKIPVTEMIQ II Ill .l::.l:.ll:::l:lllII: :..I. l l..l:.l:..l MAAGAMGVLNDNGIDVPGEISLIGFDDVLVSRYVRPRLTTVRYPIVTMAT

. ascG galR

299 297

.

300

EIIGRLIFMLDGGDFSPKT..FSGKLIRRDSLIAPSR*........

335

298

II..I:II.I: .II ..I I. Q~LA~A&NRP~PKITNVFSPTLVRRHSVSTPSLEASHHATSD

343

Ftc. 3.-Alignment of the ascG-encoded USCrepressor with the g&&encoded gaiactose-operon repressor. The helix segments of the helix-turn-helix region are underlined. Perfect matches are indicated by a vertical line. Mismatches are assigned a score based on the PAM-250 similarity matrix (Schwartz and Dayhoff 1979). A pair of dots (:) indicates a score 20.5, while a single dot (.) indicates a score of O-0.5.

gene. If enzyme II,, transports and phosphorylates arbutin, then the phosphorylated arbutin can be hydrolyzed by the constitutively synthesized phospho-P-glucosidase A protein that is specific for arbutin. Those plasmids render the cell sensitive to growth inhibition by salicin or cellobiose in glucose minimal medium, which is consistent with the transport and accumulation of those phosphorylated sugars. Plasmid pUF742, which carries only the 3 ’portion of ascF and the 5 ’portion of the ascB, neither allows growth on arbutin nor renders the cell sensitive to salicin or cellobiose inhibition. Second, the four most closely related proteins identified in a Fasta search of the SwisProt data base were PTS enzymes II. The most closely related of these was the PTS enzyme II, that is encoded by bglF of Escherichia coli. Figure 4 shows that alignment of the ascF and bglF gene products. There is 30.9% sequence identity, and 55.4% sequence similarity, between the two proteins. The quality score for the alignment shown is 295, which is 23.8 standard deviations above the score for the alignment of the ascF product versus the randomly shuffled bglF product (quality score = 180.8 f 4.8 for 100 random shuffles). The enzyme II,,, containing 485 amino acids, is only 78% as long as is enzyme II,, which contains 625 amino acids. The sugar-specific proteins of the PTS phosphotransferase transport system include three functional domains that may be present either on a single polypeptide or on two separate polypeptides, in which case the two proteins are designated as an enzyme II-enzyme III pair and sum to an average of -635 + 13 amino acids (Saier et al. 1988; Saier and Reizer 1990). A phosphate group

697

Cryptic ax Operon of E. co/i ascF

1 MAKNYAALARSVIAALGGVDNISAVTHCMTRLRFVIKDDALIDSPTLKTI

. . I I I .::I::II.III bglF

bglF

II

IIIII

:I!:.

50

:...Il..

1 . . ..MTELARKIVAGVGGADNIVSLMHCATRLRFKLKDESKAQAEVLKKT

. ascF

.: .

46

.

51 PGVLGWRSDNQCQVIIGNTVSQAFQEWSL..LPGDMQPAQPVGKPKLT ll I:. I Il:III l.:.l .I I: I.:. 1.1.. I I :: 47 PGIIMWESGGQFQWIGNHVADvFLAvNSVAGLDEKAQQAPENDDKGNL

. 98 :.. 96

. ascF bglF

99 LRRIG.. .AGILDALIGTMSPLIPAIIGGSMVKLLAMILEMSGVLTKGSP 1.1:. .ll:.:lll I.: .I : :II: I.: . l..l. 97 LNRFVYVISGIFTPLIGLMAA..... .TGILKGMLALALTF.QWTTEQSG

145 139

. ascF bglF ascF bglF

146 TLTILNVIGDGAFFFLPLMVAQSAAIKFKTNMSLAIAIAGVLVHPSFIEL I. II :I: l:l:l:::: .I: :I .I I:.I:I.llll 140 TYLILFSASDALFWFFPIILGYTAGKRFGGNPFTAMVIGGALVHPLILTA

bglF ascF bglF

bglF

bglF ascF bglF

239

. 293 :. 289 343 338

.

3.44 SSLAVAWKTKNPELRQTALASAIMAGISEPALYGVAIRLKRPLIASL ..I:1 : .:::: : .I :.I1 . : II.III:lll.:. I I:: 339 AALGVFLCERDAQKKWA.GSAALTSLFGITEPAVYGVNLPRKYPFVIAC

. ascF

243 I

244 AVTKNFLKPMLIVLIAAPLAILLIGPIGIWIGSAISALVYTIHGYLGWLS . ..lll..l:l.::: .l:.:ll:ll::.ll:. I.1 . :.. :. 240 SAIKNFFTPLLCLMVITPVTFLLVGPLSTWISELIAAGYLWLYQAVPAFA *. . 294 VAIMGALWP~LVMTGMHHVFTPTIIQTIAETGKEGMVMPSEIGANLSLGG I:II::l.::ll l:I .:.I .I..:. 290 GAVMGGFWQIFVMF(;LHWGLVPLCINNFTVL~~~~~.~L~~,~Q~~ Phosphorylation Site 2

. ascF

189

196 MAKAAQGEH&FAL..IPV~AVKYTYTVI;ALVMTWCLS;IERWVDSIT; . . . . . . . . :::.: IIII :.I. .lll :. .I l.:ll:::. 190 FENGQKADALGLDFLGIPVTLLNYSSSVIPIIFSAWLCSILERRLNAWLP

. ascF

195 ::.

.

.

393 . 387

.

394 1SGFICGAVAGMAGLASHSMAAPGLFTSVQFFDPANPMSIvwvFAvMA.L Ill :.:.: l I. .I:: l::ll :l :.... .ll. .: 388 ISGALGATIIGYAQTKVYSFGLPSIFTFMQTIPSTGIDFTVWASVIGGVI 443 AWLSFILTLLLGF..EDIPVEEAAAQARKYQSVQPTVAKEVSLN*.... .I: I::( l . . . . . ..I.l.:. : : 1. . ::: I :. 438 AIGCAFVGTVMLHFITAKRQPAQGAPQEKTPEVITPPEQGGICSPMTGEI

442 :

: 437 486 487

u&‘-encoded FTS enzyme II, withthe bg@encoded PTS enzymeII,. FIG.4.-Alignment ofthe The phosphorylation site 2 is underlined, and the histidine that is expected to be phosphorylated is indicated with an asterisk. Alignment symbols are as in fig. 3.

is transferred from phosphoenolpyruvate through two system-wide proteins-PTSenzyme I and HPr-to a histidine in phosphorylation site I, which is located in the domain that itself is located either in the C-terminal portion of full-length enzyme II’s or in enzyme III of enzyme II-enzyme III pairs (Saier et al. 1988). From there the phosphate is transferred to a phosphorylation site II in the enzyme II and is finally transferred to the sugar itself (Saier et al. 1988). The asc operon does not include an enzyme III (we have sequenced 287 bases beyond those shown in figure 2 and have found no open reading frames that could encode a portion of an enzyme III), and the ax phosphotransferase system thus lacks a site for the first phosphorylation step. Enzyme II,, , however, does include phosphorylation site II (fig. 4). The most likely explanation for the ability of enzyme II,, to function without

698

Hall and Xu

an enzyme III,, is that it uses an enzyme III that is part of another sugar-transport system. This is the case for the plasmid-encoded enzyme II,,,,, of a clinical isolate of Salmonella typhimurium, which functions in conjunction with enzyme III,,,, (Lengler et al. 1982). In that regard, after enzyme II bgl,the closest relative of enzyme II,, is the 455-amino-acid protein, enzyme II,,,,, of Salmonella, which shares 27.2% sequence identity with enzyme II,,. The ascB Gene ascB includes bp 2856-4280, is preceded by a ribosome-binding site (bp 28472852), and encodes a 477-amino-acid protein with a molecular weight of 53,883. The gene product is identified as a phospho-B-glucosidase that can hydrolyze salicin and cellobiose and that can probably hydrolyze arbutin. That identification is based on both experimental and pattern-similarity criteria. First, strain LP 1OORcarrying the entire asc operon on plasmid pUF7 18 (fig. 1) is strongly arbutin positive, salicin positive, and cellobiose positive, whereas LPl OOR carrying plasmid pUF72 1, pUF740, or pUF74 1, which carry only the 5 ’end of as& (fig. 1) , is only positive on arbutin (which is hydrolyzed by the constitutively expressed phospho-B-glucosidase A). Second, the eight most closely related proteins in the SwisProt protein data base are all glycosidases, of which the most similar is the bgZB-encoded phospho-P-glucosidase B. Alignment of the ascB- and the bgZB-encoded phospho-P-glucosidases (fig. 5) shows that the proteins share 53.9% sequence identity and 70.5% sequence similarity. The quality score for that alignment is 438.6, 81 standard deviations above that of the randomly shuffled alignment ( 137.2 + 3.7 for 100 shuffles). We cannot firmly conclude that the ascB-encoded phospho-B-glucosidase hydrolyzes phosphorylated arbutin, because we do not have available a strain that fails to synthesize the bglA-encoded phospho-B-glucosidase A. We believe that it is likely to be active toward arbutin because (a) all phospho-B-glucosidases from the Enterobacteriaceae that have been described so far act on arbutin and (b) plasmid pUF7 18, which has an intact ascB gene, produces much more intensely red (positive) l-d-old colonies on MacConkey arbutin plates than does plasmid pUF72 1, in which the ascB gene is disrupted.

DNA Sequence Homologies and Estimated Time at Which Gene Duplication Occurred Given the homologies between the ascF and ascB proteins and between the bglF and bglB proteins, and the homology between the ascG- and galR-encoded repressors, it is not surprising that a Fasta search of the GenBank data base found that the two DNA sequences most closely related to the asc operon were the bgl operon and the galR sequences. The DNA sequences of the above coding regions were compared after introduction of gaps corresponding to the gaps in the aligned protein sequences by setting the gap and gap-length penalties so high that no additional gaps were introduced (table 2). The quality scores for those DNA alignments were all >24 standard deviations above the scores for randomized alignments, showing that the DNA sequences are clearly paralogous. These comparisons can shed some light on the time when the duplications of the ancestral genes occurred. Sharp ( 199 1) has recently compared 67 homologous (orthologous) genes in E. coli and S, typhimurium. For those genes the average identity at the DNA level is 84.4%, but the range is 72.5%-99%. The observation that none of the gene pairs share >58% DNA sequence identity suggests that the

cryptic asc OperonofE. co/i ascB bglB ascB bglB

1 MSVFPESFLWGGALAANQSEGAFREGDKGLTTVDMIPHGEHRMAVKLGLE l..III.IIIIII IIII III:.I::II:.I I: Ill I. 1 MKAFPETFLWGGATAANQVEGAWQEDGKGISTSDLQPHG.....VMGKME

bglB ascB bglB

50 :I 45

51 KRFQLRDDEFYPSHEATDFYHRYKEDIALMAEMGFKVFRTSIAWSRLFPQ .I: ::: . . I.IIIIII.IIIII:IIIII..:I.IIII-I:III 46 PRILGKEN.. .IKDVAIDFYHRYPEDIALFAGFTCLRISIAWARIFPQ

. ascB

699

.

100 92

.

101 GDEITPNQQGIAFYRSVFEECKKYGIEPLVTLCHFDVPMHLVTEYGSWRN III:.ll:.I:lII .:I:1 . II.IIIII:I:::I Il.:ll:I 93 GDEVEPNEAGLAFYDRLFDEMAQAGIKPLVTLSHYEMPYGLVKNYGGWAN 151 RKLVEFFSRYARTCFEAFDGLVKYWLTFNEINIMLHSPFSGAGLVFEEGE : : : l.:IIIl.l. :: I .IIIIIIIl: II.II.I.II. 143 RAVIGHFEHYARTVFTRYQHKVALWLTFNEINMSLHAPFTGVGLAEESGE

150 I 142 200 I.11 192

.

.

ascB 201 NQDQVKYQAkiHQLVASALiTKIAHEVNP;/NQVGCMLAGGNFYPYSCKPE .: III IIIIIIII I.1 .I.: I:..:1 II II .II..I.l: bglB 193 AE.. .VYQAIHHQLVASARAVKACHSLLPEAKIGNMLLGGLVYPLTCQPQ

250 239

.

.

ascB 251 DVWAALEKDRENLFFIDVQARGTYPAYSARVFREKGVTINKAPGDDEILK l::.l:l.:I :II IIIIII II:I .I.II:..:II:...:I.I bglB 240 DMLQAMEENRRWMFFGDVQARGQYPGYMQRFFRDHNITIEMTESDAEDLK

. ascB bglB ascB bglB ascB bglB

300 II 289

.

301 NTVDFVSFSYYASRCASAEMNANNSSAANWKSLRNPYLQVSDWGWGIDP . I.1 : . l....:l::. :.ll.l. l:lll.llI :IlIl:11111 290 HTVDFISFSYYMTGCVSHDESINKNAQGNILNMIPNPHLKSSEWGWQIDP

350

351 LGLRITMNMMYDRYQKPLFLVENGLGAKDEFAANGEINDDYRISYLREHI :IIl: :I ::IIIIIIII:IIIIIIIII...I:I.I.IIIII.II.:I: 340 VGLRVLLNTLWDRYQKPLFIVENGLGAKDSVEADGSIQDDYRIAYLNDHL . 401 RAMGGTIADGIPLMGYTTWGCIDLVSACTGEMSKRYGFVFVDRDDAGNGT .:.:. IIII:.: IIII.II.IIIIII: ::lllllll::IIIII.l:I. 390 VQVNEAIADGVDIMGYTSWGPIDLVSASHSQMSKRYGFIYVDRDDNGEGS

400

339

389 450 439

. ascB bglB

451 LTRTHRKSFWW.YKKVIASNGEDLE*....... IlIl::lIl: : .II ..I .I. 440 LTRTRKKSFRMVCAEVIKTRGLSLKKITIKAP*

475 472

withthebglB-encoded phospho-eFIG.S.-Alignmentof the ascB-encoded phospho-P-glucosidas glucosidase. Alignment symbolsareasinfig. 3.

duplications preceded the divergence of E. coli from S. typhimurium. That conclusion, however, is complicated by the fact that these genes are paralogous; that is, their products carry out different functions, and the genes could have diverged rapidly as a result of having been subjected to diversifying selection. In the E. coli-versus-S. typhimurium comparison the most diverse of the homologous (orthologous) proteins shared 76.8% sequence identity (Sharp 199 1)) while the most similar of the paralogous proteins share only 58.6% identity. A more meaningful conclusion can be drawn from the divergence at synonymous sites, those sites in which substitutions do not result in amino acid substitutions. The aligned paralogous DNA sequences were analyzed by the method of Li et al. ( 1985). The number of synonymous substitutions per synonymous site (KS) and the number of amino acid replacement substitutions per replacement site ( KA), corrected for multiple substitutions, are shown in table 2. For the ascG-versus-gulR comparison, the Ks value is meaningless: the synonymous sites

700

Hall and Xu

Table 2 Similarities

and Divergences between Paralogues

Gene Pair ascG vs. galR ascF vs. bglF ascB vs. bglB

Amino Acid Identity @)

DNA Identity (W)

CAI”

KA

KS

31.6

44.3

0.275 vs. 0.328

0.76

1OO.Ob

30.9 53.9

46.0 58.0

0.330 vs. 0.273 0.382 vs. 0.277

0.75

2.69

0.41

2.54

’Codon adaptation index, first vs. second member of pair. b Silent sites are saturated for substitutions.

have been saturated with repeated substitutions. For the remaining two paralogous comparisons, Ks values are higher than the highest Ks observed in the comparison of orthologous genes in E. coli and S. typhimurium (trpA; KS = 1.77). Even though they do not result in amino acid replacements, synonymous substitutions are not necessarily neutral, because there exists a codon-usage bias that varies with the level of expression of the gene and that is inversely correlated with the rate of synonymous substitutions (Sharp and Li 19876, 1987~). The codon adaptation index (CAI) is a measure of codon bias (Sharp and Li 1987~). The CA1 for the trpA gene is 0.34 in E. coli and 0.32 in S. typhimurium (Sharp 199 1), values that are not very different from the CAIs of the paralogous genes in table 2. On the basis of synonymous substitution rates, it therefore appears that ascFB diverged from bglFB before E. coli and S. typhimurium diverged and that ascG and galR diverged earlier still. For 28 E. coli and S. typhimurium genes with CAIs of 0.275 and 0.382, the value for KS is 1.15 1 +-0.069 (mean f standard error). Escherichia coli and S. typhimurium are estimated to have diverged 120-160 Mya (Ochman and Wilson 1987). The mean KS for ascFB versus bglFB is 2.62, suggesting that they diverged -320 Mya. Promoters Promoter sequences were identified by using a neural network program ( O’Neill 199 1) . Only two highly probable promoters, each with a 17-bp spacing between the -35 and - 10 regions, were identified within the region between the ascG and ascFB genes. In keeping with convention (O’Neill 1991), the regions between -50 and +8 are underlined in figure 2. Activation of the USCOperon Because the active USCoperon in strain LP103 contains IS186 inserted into the 3 ’end of the ascG (repressor) gene, it is reasonable to ask whether the insertion event was responsible for the initial activation of asc in strain LP 10 1. To determine whether strains LP 100 (wild-type USCoperon ) , LP 10 1 (activated mutant of LP 100 ) , or LP 102 (descendant of LP 10 1, parent of LP 103) contained an insertion into ascG, we amplified bp 29-400 of the USCoperon by using suitable primers, and the sizes of the amplified fragments were determined by agarose gel electrophoresis. Strain LPlOO yielded a fragment of - 370 bp, indicating that it did not contain an insertion within the region, while strains LPlO 1, LP102, and LP103 yielded a 1.7-kb fragment, indicating that each contained IS286 within the 3 ’region of ascG. It is reasonable to conclude that insertional inactivation of the asc operon repressor was responsible for activation of the operon. To determine whether the asc operon could be activated by other mutations, we

Crypticascoperon 0fE coli

701

plated strain LP 100 onto MacConkey arbutin and MacConkey salicin media, and the plates were incubated at 30°C in humidified chambers. After 7 d, red papillae began to appear on the colonies on MacConkey arbutin plates, and 10 Arb+ mutants were isolated for analysis. In contrast to the first-step asc mutant-strain LPlOl-none of the new mutants were even slightly positive on MacConkey salicin medium. Each mutant was transduced with bacteriophage P 1 grown on strain JC 1024 1, which carries TnlO::srl. All 1,935 of the tetracycline-resistant transductants were srl (sorbitol negative), but all remained arbutin positive. There is 50% cotransduction between srl and asc (Hall et al. 199 1) ; thus the failure to find cotransduction between srl and the arbutin-utilization phenotype indicated that the new Arb+ mutants were not USC mutants. When the arbT (arbutin-utilizing) strain MK9 1243 was transduced with the same Pl lysate from strain JC10241, all 286 tetracycline-resistant transductants were srl, and all were arbutin positive; thus it seems likely that the new Arb+ mutants were urbT. After 5 wk incubation, papillae began to appear on the colonies on MacConkey salicin plates, and eight of those were isolated as pink or red colonies when they were restreaked onto MacConkey salicin plates. Of the eight, only four strains were genetically stable. The remainder segregated Arb- Sal- colonies at such high rates that 15%-55% of the cells in apparently Sal+ colonies on MacConkey salicin medium proved to be Arb- Sal- when the colonies were suspended in buffer and replated onto the same medium. During growth in liquid media, segregation was so rapid that < 1% of the cells remained Arb+ Sal+. PCR amplification of the 29-400-bp region of the USCoperon by using DNA isolated from cultures of those unstable strains gave mixed results: two strains showed no evidence of IS186 within that region, suggesting either that the original activation was by some other mechanism or that precise excision had restored the integrity of the uscG gene; and the other two strains contained an insertion into that region. Sequencing the amplified DNA showed that, in those strains, IS186 had inserted into the identical site and in the same orientation as shown in figures 1 and 2. In those cases, loss of the Arb+ Sal+ phenotype involved some lesion other than loss of the IS186 insertion. All four of the stable Arb+ Sal+ strains proved to have IS186 inserted into the same site and in the same orientation as did strain LP103. Even in “stable” strains, however, the genotype is not completely stable. A strain LP103 colony consisting of 2.5 X lo8 cells contained -2.5% Arb- Sal- cells, which proved to have lost IS186 from within the 29-400-bp region of the USCoperon. The finding that excision of IS186 can accompany loss of the USCArb+ Sal+ phenotype supports the notion that the two unstable strains in which IS186 could not be detected within the 29-400-bp region of the USCoperon had originally contained the insertion but had lost it during growth of the culture. Discussion Clearly, USCFBis paralogous (homologous via a gene duplication) to bglFB, and uscG is paralogous to g&R. The duplications that led to these divergent genes are ancient, on the order of 320 Myr old, and predate the divergence of Escherichiu coli from Sulmonellu. (However, because of the tremendous multihit corrections required, the exact time of divergence cannot be taken very seriously.) The USCand bgl operons appear to have been assembled from different “modules,” because, although their phospho-P-glucosidase and transport genes are homologous, their regulatory genes are unrelated. A similar phenomenon was observed for the cryptic ccl operon of E. cofi:

702

Hall and Xu

its glycosidase and regulatory genes (c&F) are homologous to the mellibiose genes r&R and m&t, while its enzyme III,,, transport gene is homologous to a gene from Staphylococcus aurew (Parker and Hall 1990~). Given the ancient divergence of ascFB from bglFB, it is reasonable to ask how these genes could have persisted in a silent state for so long without being eliminated by random inactivating mutations. The fact that both operons can be activated by simple insertions of mobile elements indicates that inactivating mutations have not accumulated. One possibility is that both genes were recently silenced and that their “crypticity” is something of a laboratory artifact. This appears unlikely. None of the 72 natural isolates in the ECOR (Ochman and Selander 1984) collection can utilize either arbutin or salicin, but 66 of those strains produced spontaneous mutants capable of utilizing one or both of those sugars (Hall and Betts 1987), indicating that they carried p-glucoside utilization genes in a cryptic state. It appears equally unlikely that both genes could have (a) remained silent (and thus not subject to purifying selection) for over 100 Myr ( - 10 lo generations) and (b) escaped inactivation. Mutations that inactivate the gal operon occur at the rate of 2.6 X 10-5/tell division (B. G. Hall, unpublished data); thus, if inactivating mutations were truly neutral over that entire period, both operons would have been eliminated long ago. This is by no means the only case of a gene that appears to have been cryptic for a 100 Myr. One of the major criteria that is used to distinguish Salmonella from E. coli is the latter’s inability to utilize citrate as a carbon source, an inability that arises from the lack of a functional citrate transport system in E. coli. Nevertheless, E. coli carries a cryptic citrate-transport system that can be activated by spontaneous mutation (Hall 1982). Several years ago we (Hall et al. 1983) proposed that cryptic genes could be maintained by selection acting in opposite directions in alternating environments. We suggested that functional alleles would be advantageous in one, rarely encountered environment, while both cryptic and nonfunctional alleles (those that had been permanently inactivated) would be advantageous in another, commonly encountered environment. That model was supported by a theoretical study showing that cryptic and functional alleles can be retained without having any advantage over nonfunctional alleles, if selection for the active allele occurs occasionally for a substantially long time (Li 1984). The cryptic alleles would be retained, for instance, if selection favored the functional (active) allele for 200 generations and favored the silent allele for 250,000 generations. On the other hand, if selection for the active allele occurred only for 100 of the 250,000 generations, then both the functional and cryptic alleles would be eliminated after only a few cycles between the environments. Cryptic genes, then, are genes that have the peculiar property that it is disadvantageous to express them in normal environments. Experimental support for that model requires demonstrating ( 1) that silent alleles of some gene are actually favored in some environment and are disfavored in another environment, (2) that cryptic genes do accumulate mutations when they are silent and that some alleles are occasionally eliminated as the result of the mutations that accumulate, and ( 3) that cryptic genes can be activated by easily reversible means that will permit cycling between the active and cryptic states. The first point was demonstrated by showing that, in a mixed-resource environment where both cellobiose and glycerol were present, the cryptic ccl (cellobiose-utilization ) operon was strongly favored over active ccl alleles but that, in a single-resource environment where only cellobiose was present, the active alleles were strongly favored (Hall et al. 1986 ). The second point was supported both by the finding that recently activated alleles are quite

Cryptic asc Operon of E. coli

703

temperature sensitive (Hall and Faunce 1987 ) and by a survey that showed that the bgl operon was deleted in 29% of the strains in two collections of natural isolates of E. coli (Hall 1988). The work presented here supports the third point. The USCoperon can be activated by insertion of IS186 into ascG, but that activation is readily reversible by precise excision of IS186. The finding that asc can be activated by insertion sequences is consistent with the findings that both the bgl and ccl operons can be activated by insertion sequences (Reynolds et al. 198 1, 1984; Parker and Hall 1990b). The simplest model to explain the normally cryptic state of the USCoperon is that the uscG-encoded repressor is insensitive to P-glucosides as inducers and that the operon is only expressed after inactivation of the repressor. It is not particularly surprising to find that the use operon can be activated by insertion of IS186 into the repressor gene, uscG. It is surprising, however, to discover that all seven of the insertions were into exactly the same site and were in the same orientation. If simple repressor inactivation is all that is required to activate the operon, why does it take 4-5 wk for USCmutants to appear as papillae? Why is the repressor not inactivated by other mutations, such as missense mutations, fi-ameshifts, deletions, and insertions of other mobile elements? Since all of the insertions appear to be identical, why are some mutants, such as LP 10 l- 103, quite stable while others are highly unstable? It is possible, although we have no evidence for it, that the stable mutants have accumulated additional mutations in IS186 that reduce its excision frequency. Simple repressor inactivation is also inconsistent with the observation that expression of the use operon is regulated: in strain LP 103, arbutin induces UK expression about fourfold above the basal level, and salicin induces it about twofold above the basal level (Parker and Hall 1988 ) . We have not yet identified the mutations that distinguish strain LP 102 from LPlO 1 and LP103 from LP102. Additional experimental work is required ( 1) to distinguish insertion of IS186 from other mutations that might inactivate the putative uscG-encoded repressor-and that thus might activate the operon-and (2) to examine the roles of other mutations, both in uscFB and in IS186, in determining the stability of the active state of the USCoperon. These results support the notion that cryptic genes are not simply functionless genes that are at an early stage of becoming pseudogenes, but instead that they arise repeatedly in response to environmental changes. It is tempting to think of cryptic genes as some sort of genomic hedge against future challenges, but that sort of anticipatory evolution appears extremely unlikely. The silencing and activation of cryptic genes can better be viewed as a form of genetic (as opposed to physiological) regulation of gene expression for those genes whose products have the peculiar physiological property that they normally are dangerous but occasionally are very useful. Although they have been studied most extensively by model systems involving P-glucoside utilization, cryptic genes as a class are probably related only by the peculiar physiological properties of their products, rather than by ancestral homology. The cd operon, for instance, appears to be completely unrelated to either bgl or USC. The repeated cycles of silencing and activation of some genes leads to a picture in which these genes could have a significant effect on the population dynamics of E. coli. During the long periods when the genes are silent, they are expected to accumulate mutations, because the genes are not subject to purifying selection. When the environment changes so that active alleles are favored, those alleles that have accumulated enough mutations to make them inactive would be purged from the population quickly. The result of that purging would be a population bottleneck. In effect, purifying se-

704

Hall and Xu

lection, instead of being applied gradually over the long period when the gene was silent, is applied all at once when the environment changes; that is, selection is episodic, rather than fairly constant. Over long periods of time that include many cycles of silencing and reactivation, the intensity of selection on a cryptic gene is thus expected to be about the same as it would be if the gene had been expressed continuously. This view is consistent with the recent finding that the DNA sequence diversity of 12 naturally occurring alleles of the cryptic celC gene (encoding PTS enzyme IIIcellobiose) is about the same as that for alleles of the functional gutB gene (encoding PTS enzyme IIIgiucitoi)that are from the same 12 strains (Hall and Sharp 1992). The consistent finding that cryptic-gene activation involves insertion sequences suggests that these mobile elements may not be entirely selfish DNA after all-but that, instead, they may play an important (if only occasional) role in controlling expression of normally cryptic genes. Sequence Availability

This sequence has been submitted to GenBank and has been assigned accession number M73326. Acknowledgments

We thank Yansu Ju and Shariff Bayoumy for their contributions to some of the experiments in this study. We are grateful to Milton Saier, Ken Rudd, Jonathan Reizer, Bernard Hem&at, Frederic Barras, and Michael O’Neill for their interest in this project, their encouragement and advice, and their gracious help with computer analyses of this sequence. We particularly thank two anonymous reviewers for their comments. This study was supported by USPHS grant GM-37 110. LITERATURE

CITED

BACHMANN,B. J. 1990. Linkage map of Escherichia coli K-12, edition 8. Microbial. Rev. 54: 130-197. CSONKA, L. N., and A. J. CLARK. 1980. Construction of an Hfr strain useful for transferring recA mutations between Escherichia coli strains. J. Bacterial. 143:529-530. HALL, B. G. 1982. A chromosomal mutation for citrate utilization by Escherichia coli K12. J. Bacterial. 152:269-273. -. 1988. Widespread distribution of deletions of the bgl operon in natural isolates of Escherichia coli. Mol. Biol. Evol. 5:456-467. HALL, B. G., and P. W. BETTS. 1987. Cryptic genes for cellobiose utilization in natural isolates of Escherichia coli. Genetics 115:43l-439. HALL, B. G., P. W. BETTS, and M. KFUCKER. 1986. Maintenance of the cellobiose utilization genes of Escherichia coli in a cryptic state. Mol. Biol. Evol. 3:389-402. HALL, B. G., P. W. BETTS, and J. C. WOOTTON. 1989. DNA sequence analysis of artificially evolved ebg enzyme and ebg repressor genes. Genetics 123:635-648. HALL, B. G., and W. FAUNCE III. 1987. Functional genes for cellobiose utilization in natural isolates of Escherichia coli. J. Bacterial. 169:27 13-27 17. HALL, B. G., and P. M. SHARP. 1992. Molecular population genetics of Escherichia coli: DNA sequence diversity at the celC, err, and gutB loci of natural isolates. Mol. Biol. Evol. 9:000000. HALL, B. G., L. Xv, and H. OCHMAN. 199 1. Physical map location of the asc (previously SAC) operon of Escherichia coli K12. J. Bacterial. 173:5250. HALL, B. G., S. YOKOYAMA,and D. H. CALHOUN. 1983. Role of cryptic genes in microbial evolution. Mol. Biol. Evol. 1: 109- 124.

Cryptic ax Operon of E. coli 705

JUNNE, T., K. SCHNETZ,and B. RAK. 1990. Location of the bglA gene on the physical map of Escherichia coli. J. Bacterial. 172:66 15-66 16. KELLY,R. L., and C. YANOFSKY. 1985. Mutational studies with the trp repressor of Escherichia co/i support the helix-turn-helix model of repressor recognition of operator DNA. Proc. Natl. Acad. Sci. USA 82~483-481. KOHARA, Y., K. AKIYAMA, and K. ISONO. 1987. The physical map of the whole E. coli chromosome: application of a new strategy for rapid analysis and sorting of a large genomic library. Cell 50:495-508. KRICKER, M., and B. G. HALL. 1984. Directed evolution of cellobiose utilization in Escherichia coli K12. Mol. Biol. Evol. 1:171-182. -. 1987. Biochemical genetics of the cryptic gene system for cellobiose utilization in Escherichia coli K12. Genetics 115:4 19-429. LENGLER,J. W., R. J. MAYER, and K. SCHMID. 1982. Phosphoenolpyruvate-dependent phosphotransferase system enzyme III and plasmid encoded sucrose transport in Escherichia coli K-12. J. Bacterial. 151:468-471. LI, W.-H. 1984. Retention of cryptic genes in microbial populations. Mol. Biol. Evol. 1:2 13218. LI, W.-H,, C.-I. WV, and C.-C. KUO. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2: 150- 174. MAHADEVEN,S., A. E. REYNOLDS,and A. WRIGHT. 1987. Positive and negative regulation of the bgl operon of Escherichia co/i. J. Bacterial. 169:2570-2578. MILLER, J. H. 1972. Experiments in molecular genetics. Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y. NEEDLEMAN,S. B., and C. D. WUNSCH. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453. OCHMAN, H., and R. H. SELANDER. 1984. Standard reference strains of Escherichia coli from natural populations. J. Bacterial. 157:690-693. OCHMAN, H., and A. WILSON. 1987. Evolutionary history of enteric bacteria. Pp. 1625-1648 in F. C. NEIDHARDT, ed. Escherichia coli and Salmonella typhimurium. American Society for Microbiology, Washington, D.C. O’NEILL, M. C. 199 1. Training back-propagation neural networks to define and detect DNAbinding sites. Nucleic Acids Res. 19:3 13-3 18. PARKER, L. L., and B. G. HALL. 1988. A fourth E. coli gene system with the potential to evolve P-glucoside utilization. Genetics 119:485-490. -. 1990a. Characterization and nucleotide sequence of the cryptic ccl operon of E. coli K12. Genetics 124~455-47 1. -. 1990b. Mechanisms of activation of the cryptic ccl operon of E. coli K12. Genetics 124~473-482. PEARSON,W. R., and D. J. LIPMAN. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85~2444-2448. PRASAD,I., and S. SCHAEFLER.1974. Regulation of the P-glucoside system in Escherichia coli K12. J. Bacterial. 120:638-650. PRASAD, I., B. YOUNG, and S. SCHAFLER. 1973. Genetic determination of the constitutive biosynthesis of phospho-b-glucosidase A in Escherichia coli K12. J. Bacterial. 114:909-9 15. REIZER, J., A. REIZER, and M. H. SAIER Jr. 1990. The cellobiose permease of Escherichia coli consists of three proteins and is homologous to the lactose permease of Staphylococcus aureus. Res. Microbial. 141:1061-1067. REYNOLDS,A. E., J. FELTON, and A. WRIGHT. 1981, Insertion of DNA activates the cryptic bgl operon of E. coli. Nature 203:625-629. REYNOLDS,A. E., S. MAHADEVAN,S. F. J. LEGRICE, and A. WRIGHT. 1984. Enhancement of bacterial gene expression by insertion elements or by mutation in a CAP-CAMP binding site. J. Mol. Biol. 191:85-95.

706

Hail and

Xu

RUDD, K. E., W. MILLER, J. OSTELL,and D. A. BENSON. 1990. Alignment of Escherichia coli

K12 DNA sequences to a genomic restriction map. Nucleic Acids Res. l&3 13-32 1. SAIER, M. H., and J. REIZER. 1990. Domain suffling during evolution of the proteins of the bacterial phosphotransferase system. Res. Microbial. 141:1033-1038. SAIER, M. H., M. YAMADA, B. ERNI, K. SUDA, J. LENGLER,R. EBNER, P. ARGOS, B. RAK, K. SCHNETZ, C. A. LEE, G. C. STEWART, J. F. BREIDT, E. B. WAYGOOD, K. G. PERI, and R. F. DOOLITTLE. 1988. Sugar permeases of the bacterial phosphoenolpyruvate-dependent phosphotransferase system: sequence comparisons. FASEB J. 2: 199-208. SANGER, F., A. R. COULSON,B. G. BARRELL,A. J. H. SMITH, and B. A. ROE. 1980. Cloning in single stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143:161178. SCHNETZ,K., and B. RAK. 1988. Regulation of the bgl operon of Escherichia coli by transcriptional anti-termination. EMBO J. 7:327 l-3277. SCHNETZ,K., C. TOLOCZYKI,and B. RAK. 1987. B-glucoside (Bgl) operon of Escherichia coli K 12: nucleotide sequence, genetic organization, and possible evolutionary relationship to regulatory components of two Bacillus subtilis genes. J. Bacterial. 169:2579-2590. SCHWARTZ, R. M., and M. 0. DAYHOFF. 1979. Matrices for detecting distant relationships. Pp. 353-358 in M. 0. DAYHOFF, ed. Atlas of protein sequences and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C. SHARP, P. M. 199 1. Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position, and concerted evolution. J. Mol. Evol. 33:23-33. SHARP, P. M., and W.-H. LI. 1987a. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15: 128 l1295. -. 19878. Molecular evolution of ubiquitin genes. Trends Ecol. Evol. 2:328-332. -. 1987 c. The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol. Biol. Evol. 4~222-230. STRATHMANN,M., B. A. HAMILTON, C. A. MAYEDA, M. I. SIMON, E. M. MEYEROWITZ,and M. PALAZZOLO. 199 1. Transposon-facilitated DNA sequencing. Proc. Natl. Acad. Sci. USA 88:1247-1250. MARTIN KREITMAN, reviewing Received

July 22, 1991; Revision

Accepted

December

27, 199 1

editor received

December

18, 1991