Escherichia coli promoters. II. A spacing class-dependent promoter ...

Vol. 264, No. 10, Issue of April 5, pp. 5531-5534,1989 Printed in U.S.A.

THEJOURNAL OF BIOLOGICAL CHEMISTRY 0 1989 by The American Society for Biochemistry and Molecular Biology, Inc.

Escherichia coli Promoters 11. A SPACING CLASS-DEPENDENTPROMOTER

SEARCH PROTOCOL* (Received for publication, April 22,1988)

Michael C. O’NeillS and Francis Chiafari From the Departmentof Biological Sciences, Universityof Maryland Baltimore County, Baltimore, Maryland21228

A computer search protocol for finding Escherichia 18-base spacings. The sequences captured in this first test are coli bacterial and phage promotersis presented. This then subjected to a series of positive and negative filters in protocol relies heavily on the description of promoter subsequent passes. The additional positive tests are based 1) sites developed in the preceding paper (O’Neill,M. C. on the degree of adherence to the -35 and -10 consensus in (1989) J. Biol. Chem. 264, 6622-6630), with partic- a given spacing class and 2) on the degree of internal repeat ular emphasis on theexistence of a distinct consensus structure in a given spacing class. The negative tests arebased sequence for each of the three major spacing classes. on the number of “worst” bases which occur and the occurThe input sequence is tested independently for prorence of a “forbidden” internal repeat indexing. Overall, the moters with 16, 17, or 18 bases separating the -36 search employs six tests for each of the three major spacing and -10 regions. Within a given spacing group, series a groups. This protocol is tested on the fully characterized of six tests is employed to define possible promoters. chromosomal and phage promoters of the Hawley and McThese tests were developedempiricallytoidentify members of the known promoter database with high Clure (1983) listing, on random sequence, on pBR322, and on efficiency while producingaminimal level of false X. Real promoters are identified with an efficiency between positives. The degree to which this aim is met is dis- 50 and 80%,with little or no generation of false positives even cussed in the context of searches of random sequence,in targets aslarge as X. of pBR322, and ofX. MATERIALS ANDMETHODS

Historically there have been a number of attempts to develop a promoter search algorithm which would allow for the efficient detection of Escherichia coli promoter sites within a large DNA sequence (Scherer et al., 1978; Galas et al., 1985; Mulligan et al., 1984; Mulligan and McClure, 1986). Many of these attempts were based largely or entirely on a specific degree of agreement with the -35 and -10 consensus sequences. However, the average promoter shows only four out of six consensus bases in each location. If this is used as the match criterion, with either 16, 17, or 18 bases between the two regions, the expectation with random sequence would be one or more promoters every 200 bases (the mode would be at 118 bases), producing a high level of false positives in any search of even moderate scope. On the other hand,if a perfect consensus match is required, no real promoters in the current list would be correctly identified. Indeed, thiscriterion is doomedby the limits of its inherent information content either to miss a significant proportion of real promoters or to produce a very large number of false positives while finding the majority of real promoters. Clearly a more extensive consensus sequence wouldbe helpful in producing a more effective search procedure. In the preceding paper (O’Neill, 1989),we have reported that theextended consensus sequence varies with the spacing class of the promoter; that is, that the pattern of base conservation outside the -35 and -10 regions is different for promoters with a different number of bases between the two contact regions. We have developed a search procedure which takes this fact into account, searching the sequence with distinct criteria for promoters of 16-, 17-, and * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “aduertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. $ T o whom requests for reprints and programs should be addressed.

A macro program was developedto apply the serial tests described in the text to any designated target sequence. The macro calls a Pascal match program of our design which can search a sequence of any length that will fit in the computer’s memory for a match of any size and stringency defined by the user; those regions of the target sequence satisfying the criteria of the first test are passed by the macro program to the second test, and so on. The final product is a listing of those regions of the target sequence which passed all test criteria. This macro and the programs it calls are available on disk to interested investigators. The random sequence generator was a Basic program of our design which employs the Basic random number generator to produce random sequence of a user-specified GC content and length. RESULTS

Table I presentsthe testsequences and match criteria used in testing for promoters of each of the three major spacing classes. It can be seen that the degree of consensus and the specific sequence conserved varies substantially by spacing class (for a fuller discussion see the preceding paper which provides the basis for the tests employed here). The input sequence is tested for promoters of all three of the major spacing groups. The first two tests check for agreement at all positions which are conserved at 238% and 267%, respectively, in the consensus sequence for aparticular spacing group. The third testlooks at themost conserved bases in the -35 and -10 regions. The fourth test looks at those positions in each spacing class which show the strongest conservation of the 10-base repeat sequence. The fifth test subtracts sequences which exceed a cutoff on the number of worst bases, defined as those base choices found to occur at all%in bona fide promoters of a given spacing class. The last test subtracts those sequences found to have at least a 50% match with the 10-base repeat sequence, TATTWTRAYA, beginning at position 36 for the 16 spacing class, 37 for the 17 spacing class, and 38 for the 18 spacing class; it was determined empirically that such matches are rare inreal promoters (possiblybecause

5531

5532

A Promoter SearchProtocol TABLEI Sequences of the promoter testseries A series of six test sequences is shown for each of the three major bacterial spacing classes (16, 17, 18) and for the 17-base phage class (17P). To pass the first four tests, a sequence must match the test sequence with no more than the number of misses indicated in the right column. To pass the final two tests, a sequence must exceed the number of misses indicated since these are negative tests. Spacing class

16 17 18 17P 16 17 18 17P 16 17 18 17P 16 17 18 17P 16 17 18 17P 16 17 18 17P

No. of

Test sequences

misses

TNNCAAATTAACGMTTGACACNGNNKNNGGANTCCGTATAATGCGCCCCCNTNGNCNN TWNNNAWTNATTTSTTGACATATNWANCNANNTWNNNTATAATNNNNCCCCNANNNNN

25 19 27 15 5 2 2 4 1 1 1 1

NAAGAAAAAATTWCTTGACNTTTTTTNTTAATTTATGKTATGNTTmANCANTTNATT

TANAANTNNNNNNNTTGACATTNTNNNNANTNTNTGTTATAATTNMCNNNNNNNNNNNNANNNTTGNTANNATNCNCNNCNNNNNNNNNNNNNNTTGNNNNNNNNNNNNNNNNNNNNTAN

N N N N N A N N N N M J NNNNNNNNNNNNNNTTGACNNT” “NNNNNNNNNNNNNNNNNNNNTANN N

N

N

N

T

G

N

T

N

P

T T ” NNNNNNNNNNNNNNNTGANNNNNNNNNNNNNNNNNNNNAN

NNNNNNNNNNNNNNWTRAYANNNNNNNNNNNNWNNNYATA

NNNNWNRNYANNTNWTRAYANNNNNNNNNNNNTWNRNYNN NNNNNNNNNNTNTNWTRAYNTNTTWTRNYANNNTWTRNYATNNTWTN~ TANNh’TNAYNNNTNWTRAYANNNNNNNNNNNTTWTRNYAN NNCNCCGCCCGACTMNMGTCNNTNNCNNAYCGCNNTABSS GGNNNGSCGSNNCTMRAGGCSNANNNNNNNNNGNNNNAGN NNTATKGNGGCCGNAGAGGNGNGNNCNGNCGNNGNNAAGBATNRSAAGNNTGSNTNAG NTCGGTRNGTACCNCCMGTCGSGGCCNGCGCCNGCGCCANAC~CCYGNCCK~GSCCCCGCTC JATTWTRAYN N N N N N N N N N N N N N N N N N N N R A Y -

NNNNNNMJNNNNNNNNNNNNTATTWTRAYNNNNNNNNNNNNNNNNNNNNTTWTRAY-

they would make the spacing class of the promoter ambiguous). In each case, sequence elements which satisfy the criteria of the first four tests are passed to the fifth and sixth tests each of which subtract any sequences meeting their negative criteria. The fully characterized bacterial promoters of Hawley and McClure (1983; Fig. 1)were tested under this protocol. (Only those promoters which did not appear to fall into any of the major spacing groups were omitted from consideration.) Table I1 provides a breakdown, test by test, of the result. Overall, 77% of the promoters are correctly identified. This, in itself, does not prove the effectiveness of the search since it is in large degree tautological. A search which succeeds in this sense may nonetheless produce so many false positives as to be useless (Mulligan et al., 1984). In order to assess the potential level of false positives, we used the search protocol on two types of “target”: 1) random sequence produced by a random sequence generator; and 2) biological DNAfor which at least a partial promoter map is available. Table I11 shows the result of testing five different 4000base random sequences for each of six distinct average base compositions, ranging from 40 to 80% A-T. As Mulligan and McClure (1986) pointed out, a promoter search protocol is likely to be highly sensitive to the average A-T ratio in the input sequence due to the A-T richness of promoters. The production, at random, of sequence satisfying the testcriteria is about 1/1000 bases for sequence which is 50% A-T but jumps to 15/1000 at 80% A-T, an A-T level found in parts of the B region of X, for instance (Collins and Coulson, 1984). Table IV presents the results of a search of pBR322 (Sutcliffe, 1978; Peden, 1983), in both directions. Five of the six known promoters are found; missing is the cyclic AMP binding protein-dependent P4 promoter (Queen and Rosenberg, 1981). In a clockwise search of pBR322, only the tet and RNA1 promoters are found with no additional elements. In the counterclockwise direction, in addition to thebla, primer,

5 13 11 10 35 21 33 41 5

5 5 5

TABLEI1 A test of known promoters under the search protocol Thetest sequences of Table I were run against the bacterial promoters of the 16-, 17-, and 18-base classes and thephage promoters of the 17-base class from Hawley and McClure (1983). A breakdown is provided on the number of promoters passing each test. Those promoters which the protocol fails to detect in each class are indicated. 16-base pair spacing class Input 18 promoters Search level 1st 2nd 6th 3rd5th4th No. promoters found 16 17 16 17 0 0 Net no. promoters found 16 16 16 15 15 15 Not recognized malEGF, aroH, rpoB 17-base pair spacing class Input 21 promoters Search level 1st 2nd 6th 3rd5th4th No. promoters found 19 19 20 21 0 1 Net no. promoters found 19 17 17 17 17 16 Not recognized lacI, trpP2, thr, fol, uvrbP3 17-base pair spacing class-phage Input 18 promoters Search level 1st 2nd 6th 3rd5th4th 0 1 No. promoters found 17 15 17 16 Net no. promoters found 14 17 14 14 14 13 Not recognized t7D, XPm, XPI, 434 PR, f d x 18-base pair spacing class Input 13promoters Search level 1st 2nd 6th 3rd5th4th 1 No. promoters found 12 12 12 13 0 Net no. promoters found 1 2 11 11 11 11 10 Not recognized tmR, p-ori-R, alas

and P1 promoters, sites at 3308, 3805, 3868,4129,4343, 4355 are identified under this protocol as possible promoters. In their study of transcription initiation on pBR322, Stuber and Bujard (1981) found evidence of several additional transcription start points over and above the six major promoters. In particular, in addition to multiple bla promoters, their data

A Promoter Search Protocol TABLEI11 A test of random DNA sequences under the searchprotocol The search protocol was run against computer-generated 4000base random sequences. For each of seven different average base ratios, ranging from 40 to 80% A-T, five distinct sequences were tested. The mean shown is the total number of "promoter" found for all three spacing groups per 4000 bases. A-T

Sequence 16-bp"

No' promoters

No. 17-bp No. 18-bp promoters promoters

Mean

TABLE V A test of X sequence under the searchprotocol The search protocol was run against X sequence in both directions. In thiscase the phage test sequences were used for the 17-base class. The bacterial tests for the 16- and 18-base classes were used, by default, because there is insufficient information to determine whether phage-specifictests arejustified for these classes. The underlined coordinates indicate the locations of known promoters: 37974 (PR),44538 (PR'),35631 (PI,), 37989 (PRM), 38724 (Po). Rightward search

%

40

1 2 3

4 5 45

1

2 3 4 5 1 2 3 4 5 1 2 3

50

55

4

5 60

1

2 3 4 5 70 1 2 3 4 5 80 1 2 3 4 5 bp, base pair.

0 0 1 0 1 0 1 0 1 0 0 1 0 2 0 1 1 0 0 1

16-bp"spacing class 1 7 - b suacine ~ class

0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 3 1 0 0 1 5 1 0 0 5 5 4 3 3 7 5 2 3 3 10 10 12 6 8 29 43 27 31 39 55 61 60 61 46

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

"

0.8

1.4

4.0

0

0

0 0 0 0

0 0 0 2 0 0 1 0 1 1 3 1 2 5 6

4.4

10.0

35.0

60.6

5533

-

No.

X coordinates

sequences

4 8

18-bp spacing class

3

Leftward search

No. sequences

16-bp spacing class 17-bp spacing class

2 7

18-bp spacing class bp, base pair.

3

22027,32642,35448,36188 13034.23714.23918.24043. 28299,37974,44538,46999 23002,23216,4MT A coordinates

24697,26514 23115, 23604, 25121,35631, 37989,38725,47211 24084,24w33827

sequence of X (Sanger et al., 1982) in each direction. X is known to have promoters at 37974 (PR) and 44538 (PR,) rightward (the coordinate shown is 14 bases upstream of the -35 sequence in each case) and at 29114 (PI),35631 (PL), 37989 (PRM), 38392 (PRE), and 38725 (PO)leftward with two little characterized promoters, PBLand P L initiating ~~ transcripts at 23231 and 36322 (Daniels et al., 1983); there is substantial additional transcription initiationin the B region (Rosenvold et al., 1980; Botchan, 1976; Jones et al., 1977). The left 20,000 bases of X are notknown to contain apromoter in either direction. The program produces a single positive in this region (rightward, 13034) in searching 40,000 bases. The B region produces 13 hitswhen both directions are summed. The studies of Botchan (1976) and Jones et al. (1977), taken together, suggest the existence of five or more promoters in this region. The hits at 22027 and 28299, in particular, fall quite close to two of the transcription coordinates which Botchan reported. The last 20,000 positions in the right half of X yield 12 hits, summing both directions, of which five are known promoters. The PI, PLit, PRE,and Pg promoters are missed in this search.

TABLE IV

DISCUSSION

A test ofpBR322 sequence under the search protocol The search protocol was run against pBR322 sequence in both directions. The underlined coordinates indicate the location of known promoters: -5 (tet), 2928 (RNAl), 86 (Pl), 3138 (primer), 4239 (bla). In each case the coordinate listed corresponds to a position 14 bases upstream of the -35 sequence.

The basic structural featuresof E. coli promoters as determined in the analysis of the preceding paper are employed here to provide a series of specific tests for locating promoters in DNA sequences under investigation. The success of any such search is dependent on both the accuracy and the completeness of the description serving as thepromoter prototype. In the past these efforts may have been inadvertently hindered by the practice of lumping together all promoters recognized by a single RNA polymerase molecule (albeit in different forms served by different activator proteins) in the attempt to determine a consensus sequence. This work suggests distinct consensus sequences for different spacing classes. It also employs distinct consensus sequences for phage promoters and for chromosomal promoters of the same spacing class. This latter distinction was made ad hoc as a result of an information content analysis of the type developed bySchneider et al. (1986) which indicated that theconsensus sequence for the two groups differed significantly at one-fourth of all positions (data not shown). The reason for this divergence is not understood at this time, but it nonetheless argues against arbitrarily pooling phage, plasmid, transposon, synthetic, and chromosomal promoters even within a given spacing class.

No. sequences

Clockwise search

16-bp"spacing class 17-bp spacing class 18-bp spacing class

0 1 1

Counterclockwise search

sequences

16-bp spacing class 17-bp spacing class

0 9

18-b~ suacine class bp, base pair.

0

"

I

No.

pBR322 coordinates

-5 2928 pBR322 coordinates

86,3138,3308,3805,3868, 4129,4239,4343,4355

"

suggest the existence of several low level promoters in the region between 3300 and 3800 directed in the counterclockwise direction. Table V shows the result of applying the search to the

5534

A Promoter Search Protocol

In the test for bacterial promoters, 41 of 52 are correctly identified; for phage promoters, the success rate drops to 13 of 18. In the search of pBR322, only the cyclic AMP binding protein-dependent P4 promoter is missed. In thesearch of X, five of the nine known promoters are found. Of those missed, PI, PRE,P~it,and PB, one could argue that the first three should bemissed since they are nonfunctional in in vitro transcription assays (Reichardt, 1975; Shimatake and Rosenberg, 1981; Pirrotta et al., 1980). While the random expectation would be about 40 hits in 40,000 bases, the left side of h yields only a single hit; in about the same length of sequence to theright of att, only seven hits arefound in addition to the known promoters. Thus the search gives much more specific results than would be expected from random sequence considerations; the reason, of course, is that h is not at all a random sequence (Collins and Coulson, 1984). Inasmuch as this search protocol appears to be reasonably conservative, one might question the effect of a base substitution corresponding to a severe promoter down-mutation. Would such a change be sufficient to prevent recognition of the promoter in the search? Not necessarily. The result is ambiguous because the test is ambiguous. The definition of severe promoter down-mutations is not universal. The P22 ant promoter mutations (Youderian et al., 1982) provide an illustration of the problem. If those single base substitutions which result in an ant- pseudorevertant were usedas ageneral disqualification rule in the search routine, 11 of 18 wild type phage promoters would be eliminated by this single test. For instance, the occurrence of T rather than C in the fifth position of the -35 sequence is a severe promoter downmutation in the P22 ant promoter and yet serves as the wild type base in the XPRMpromoter. Since the basis of this differential effect is beyond our current understanding, search discrimination at the single base level is not as yet possible. However only XPRE,of the promoters under consideration here, has two bases which are severe promoter down-mutations in another promoter. Thus one could employ an additional negative filter which would remove all prospective hit sequences containing two or more of these prohibited bases. Those wild type promoters which normally contain a single base from the promoter down list would then be lost from the

search on the subsequent acquisition of a promoter downsubstitution. The consensus for each spacing class may shift somewhat as the database expands, but such changes can be easily incorporated in this general protocol. This general approach performs well and should only improve as more information is obtained. REFERENCES Botchan, P. (1976) J. Mol. Biol. 105, 161-176 Collins, J. F., and Coulson, A. F. W. (1984) Nucleic Acids Res. 12, 181-192

Daniels, D. L., Schroeder, J. L., Szybalski, W., Sanger, F., and Blattner, F. R. (1983) in Lambda II (Hendrix, R. W., Roberts, J. W., Stahl, F. W., and Weisberg, R. A., eds) pp. 469-517, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY Galas, D. J., Eggert, M., and Waterman, M. S. (1985) J. Mol. Biol. 186,117-128

Hawley, D. K., and McClure, W. R. (1983) Nucleic AcidsRes. 8 , 2237-2255

Jones, B. B., Chan, H., Rothstein, S., Wells, R. D., Reznikoff, W. S. (1977) Proc. Natl. Acad. Sci. U. S. A . 74,4915-4918 Mulligan, M. E., and McClure, W. R. (1986) Nucleic Acids Res. 14, 109-126

Mulligan M. E., Hawley, D. K., Entriken, R., and McClure, W. R. (1984) Nucleic Acids Res. 12, 789-800 O’Neill, M. C. (1989) J. Biol. Chem. 264,5522-5530 Peden, K. W. C. (1983) Gene (Amst.) 22,277-280 Pirrotta, V., Ineichen, K., and Walz, A. (1980) Mol. Gen. Genet. 180, 369-376

Queen, C., and Rosenberg, M. (1981) Nucleic Acids Res.9,3365-3377 Reichardt, L. F. (1975) J. Mol. Biol. 93, 289-309 Rosenvold, E. C., Calva, E., Burgess, R. R., and Szybalski, W. (1980) Virology 107,476-487 Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F., and Petersen, G. B. (1982) J. Mol. Biol. 162, 729-773 Scherer, G. E. F., Walkinshaw, M. D., and Arnott, S. (1978) Nucleic Acids Res. 5, 3759-3773 Schneider, T. D., Stormo, G. D., Gold, L., and Ehrenfeucht, A. (1986) J. Mol. Biol. 188, 415-431 Shimatake, H., and Rosenberg, M. (1981) Nature 292, 128-132 Stuber, D., and Bujard, H. (1981) Proc. Natl. Acad. Sci. U. S. A . 78, 167-171

Sutcliff, J. G. (1978) Cold Spring Harbor Symp. Quant. Biol. 43, 7790

Youderian, P., Bouvier, S., and Susskind, M. M. (1982) Cell 10,843853