Computer assisted identification and classification of - NCBI

3696-3703 Nucleic Acids Research, 1995, Vol. 23, No. 18

kll-Dl 1995 Oxford University Press

Computer assisted identification and classification of streptomycete promoters William R. Bourn* and Brendan Babb' Departments of Microbiology and 1Medical Microbiology, University of Cape Town, Rondebosch 7700, Cape Town, South Africa Received June 2, 1995; Revised and Accepted August 17, 1995

ABSTRACT Short sequences that were over represented in a database of Streptomyces promoter region sequences were identified. These sequences and others that were selected on the basis of the characteristics of known promoters, were tested to determine if they were found predominantly at particular distances from the transcription start site. In several cases obvious clusters were recorded. This has allowed the objective identification of potential promoter core sequences. In some cases these may define novel promoter classes. 150 Streptomyces promoters have been listed and grouped on this basis. A new and extended consensus sequence for the Streptomyces E.coli (a70-like promoters was determined. It showed differences from that of E.coli, both in sequence and in the spacing between the -35 and -10 regions. INTRODUCTION The genetic regulation of the streptomycetes is of interest primarily because this genus produces the majority of the world's antibiotics. This antibiotic production is intimately linked with the complex process of development (1,2). Most bacterial genes are regulated at the level of transcription, often in part by the controlled expression of different RNA polymerase a factors, which recognize different classes of promoter. This enables the tight regulation of multiple promoters of a particular class and is used for the regulation of genes as diverse as those involved in nitrogen fixation, flagella production, heat shock response and sporulation (3-6). Typically, bacteria have a major ay factor, responsible for the recognition of the majority of the promoters, while other minor ay factors have limited and specialized functions. A list of 139 Streptomyces promoters has been previously compiled (7). A number of the genes listed are transcribed from multiple promoters and/or have overlapping divergent promoters. Several promoters are found within protein coding sequences. Approximately 21% of the promoters were designated as E.coli-like Streptomyces promoters, on the basis of an undefined similarity to the promoters recognized by the E.coli a70 (7,8). No *

To whom correspondence should be addressed

other separate classes of promoters were obvious among those remaining. Different experimental approaches have proved, or implied, the existence of multiple Streptomyces a factors (9). The first demonstration utilized S.coelicolor proteins and Bacillus subtilis promoters (10). Two a factors, aY35 and a49, were isolated and in vitro these directed RNA polymerase recognition of the veg and ctc promoters respectively. The existence of another S.coelicolor a factor was demonstrated with the four heterogeneous promoters, dagApl-4 (11). Three separate transcribing activities for each of dagAp2, dagAp3 and dagAp4 were distinguishable. In vitro transcription proved that dagAp2 was recognized by the novel a28 (12). The gene encoding a28 (sigE) has been sequenced and appears to encode a a factor of a class thought to regulate extracytoplasmic functions (13). Transcription activity co-purification studies utilizing the veg, ctc and dagA promoters indicated that dagAp3 was recognized by c49 while dagAp4 was transcribed using a3 (9). Studies on the gal operon have implied that at least two other a factors, in addition to a35 and 649, exist in S.coelicolor. Neither of the two gal promoters was recognized by a35 or a49 in vitro and the transcribing activities of the two promoters could be separated (14,15). To identify the principal S.coelicolor a factor gene, an oligonucleotide probe consisting of sequence conserved only in the major a; factor genes was used in Southern blots (16). Four homologous regions were identified and subsequently sequenced (17,18). Each of the four loci, termed hrdA-D, encodes a homolog of the major a factor class. Important residues are conserved in the 2.4 and 4.2 regions of the hrd genes, indicating that the aHrd proteins recognize subtly different promoter sequences (9,19). Transcription of Streptomyces hrd homologues appears to be dependent upon the growth medium or developmental stage (19,20). Mutational studies indicated that hWlA, hrdC and hrdD are non-essential for growth, while hrdB is crucial and encodes the equivalent of a70 of E.coli (19,21). Only hrdB has been shown to produce a viable a sigma factor, of 66 kDa. This can direct transcription from the veg and dagAp4 promoters (22). It is unclear whether the 35 kDa a35 is a different a factor or a breakdown product of YhrdB. There is no evidence that the other hrd genes are translated. However the transcribing activities for the veg promoter and the S.lividans XP55 promoter have been separated, although these promoters have a high degree of

Nucleic Acids Research, 1995, Vol. 23, No. 18 3697 sequence similarity to the E.coli ay70 type promoter consensus sequence and each other (23). It is presumed that the two promoters are recognized by different homologues of the Hrd proteins. A further a factor (owhiG) is essential to the late stages of Streptomyces sporulation (24). It shows strong sequence similarity with the a factors that direct the transcription of genes involved in chemotaxis in other bacteria. The similarity is readily apparent in the regions that determine recognition of the -10 and -35 regions of the promoter (24,25). It has also been reported that in S.aureofaciens a different a5 factor, encoded by sigF, is essential for spore maturation (26). There have been several studies involving mutation of Streptomyces promoters followed by expression level tests (27,7 and references therein). In general, the mutation of E.coli-like promoters resulted in the expected change in expression, based on similarity to the E.coli promoter consensus. Notable exceptions include gal-pl and blaF-p. In the case of gal-pl the -35 region appears to be different to that of E.coli-like promoters although the -10 region is similar, while blaF-p has no similarity to E.coli-like promoters. The existence of multiple a factors, and hence promoter classes, is likely to be fundamental to gene regulation in Streptomyces. However there have been no formal sequence database searches directed at identifying or classifying the consensus sequences of the different promoter types. The promoter region sequence bias analysis described here is a novel approach to this problem. We have catalogued Streptomyces promoter region sequences and searched them for short sequences that are over represented, on the assumption that these may be the core components of promoter sequences. Promoters are typically found at set positions in relation to the transcription start site. Potential promoter core sequences were therefore tested to determine ifthey were found predominantly at set distances from the transcription start sites.

MATERIALS AND METHODS Creation of catalogues of oligonucleotide frequencies for Streptomyces promoters The following analyses were conducted using computer programs written by ourselves. The programs were written in Borland C++ and ran with MS DOS. Using Streptomyces promoter sequences, the frequencies of occurrence of each of the 65 536 different possible overlapping octanucleotides was catalogued. An eight base window was moved, one base at a time, down the sequences and at each position the octanucleotide squence was determined. A tally of each of the different octanucleotides was generated. For each possible octamer the frequency of occurrence (F) was calculated in normalized form, by dividing the actual number of occurrences by the expected number of occurrences when assuming completely random sequence. The number of occurrences and frequencies of monomers to heptamers were determined from the octamer data. F = (number of occurrences) (4oligOnucleotide length) total number of nucleotides in database

The fraction of the total number of nucleotides in the database which consisted of each of A, G, C and T residues was calculated (%A, %G, %C and %T respectively). The frequency of each

oligomer, corrected for the nucleotide ratio (NCF) was then calculated using the number of times each base was found in the oligonucleotide (XA, XG, XC and XT). An NCF value of 1 therefore signifies that the oligomer occurs at the expected frequency, taking into account the nucleotide bias. Higher or lower values indicate that the oligomer is either over or under represented. NCF = F (XA) (%A) (XG) (%G) (XC) (%C) (XT) (%T) oligonucleotide length Two different oligonucleotide frequency databases, termed Spro#1 and Spro#2, were created, using Streptomyces promoters that have been identified by transcription mapping (Table 1). Each promoter sequence was analyzed individually, without concatenation. Where there exists a number of different overlapping promoters for a given gene for example amy-pl, amy-p2 and amy-p3 (Table 1, promoters number 7, 140 and 132), these were entered in the database as a single contiguous sequence. Only one strand of the DNA was entered. Consequently, in the case of overlapping divergent promoters one of them was entered as reverse complementary sequence. Only sequences between positions -53 and +1, as defined in Table 1, were used. No known protein coding sequence was entered. The Spro#1 database consisted of all promoters listed in Table 1 which, as far as we could establish at the time of the creation of the database, fulfilled these criteria. The promoters included in the Spro#2 database fulfilled both these and other criteria (see Results and Discussion). Spro#1 and Spro#2 consisted of 3959 and 1567 bases, entered as 106 and 37 different contiguous DNA sequences respectively. These figures do not include the last seven bases of each sequence entered and are equivalent to the total number of octamers counted.

Promoter region sequence bias cluster analysis A particular oligonucleotide was selected for testing if it had a high NCF value and was thus greatly over represented in the database being examined. The minimum NCF value considered in each case was set such that manageable numbers of over represented oligomers (up to 70) of each type were identified. Other sequences were selected for testing using different criteria (see Results and Discussion). In the search for the former, gaps (positions that could represent any base) were inserted in the oligomer sequence. The length of oligomer examined and the gap pattern used are shown in Table 2. The lengths and patterns used were limited. This was because degenerate sequences that are effectively dimers will contain limited information; sequences which are effectively pentamers or larger generate unmanageable amounts of output data and are prone to NCF value fluctuation due to the limited size of the database. A number of Streptomyces promoters were aligned at the point of transcription initiation. Only sequences between and including positions -47 and +1 of the promoters were utilized. Where there were multiple initiation points the nearest to the promoter was considered as position + 1. Two groups of promoters were used on separate occasions (see Results and Discussion). Group 1 consisted of all the Streptomyces promoters listed in Table 1 except for endoH-p, orfRP-p, tylF-p, blaF-p, octD-p4, octD-p5,6, vph-pl, vph-p2 and kmr-p (promoter numbers 5, 20, 59, 89, 120, 134, 135, 141 and 142); these were not included as their sequences did not extend as far as position -47 or the data was not available at the time of cluster analysis. Group 2 consisted of the

3698 Nucleic Acids Research, 1995, Vol. 23, No. 18 Table 1. Streptomyces promoters used in cluster analysis

0 2 3 4 5 6

IN SPRO

PROMOTER

-T--T--r-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

XP55-P

I I I I I

srNO-p ..y-pO aCtID2, 3-p

N

NhDBNp2

N N

amy-pOSEt)

00 12

rrnN-p2

choP-p 03 aalV-p

..I-p *rxNF-p

TESTED FOR OLIGOMER

DISTRIBUTIONb

C'TCCCCACCTGGC?NNAEGTTTATTGCGAGTGATGNGCAA?AGCTGC CGGCCATCAAAGTITRACAGCCG7TCGTCATATGAGCTTCAG!GAGAACG

N N N N N N P N N

I I

GATAACACAGCTCTTGAUGGCGCGTGACG;TCGAACGAGAC?CGGTCCAT

ATTGACTGATTGRAGGCTTCCGGCGGGCAGGGGAGGCACGG CCTTGCCTCGACTGGTNGRCTCTCTACACTGAAGATTTACATCTGATT CTGACCACCCCTCGTTCCSTNACCCCCGGCTCGGCGTGAACTGTTGGGTGTCC

TCTTGCAGAAAGCCTTNiCTGCTAACGACATGAACGGCAAGCTCCGGTA

4N

NtrD-p

49 50

pCAT-p rrnD-p3 rrnD-p4

50 52 53 54 55 56 57 SN SN 60

caEA-p sphP-p rep-pl

I I

CAGATCTCGGCN?OMACTCGCCCGGAGCAGCACACT'TGTAAATYTCACT TCG CCCGCAAGAGCC!NCGAGGCGGATANCAC

C, A C?, A? G0(18) C, A, GOON) C, A, GOON) C, A, GOON) C, A, GONE)

GGAACGATCTCGTDRACAGCCTTCACATCGCCTCCATACGGTCATTTC GGCGTCCGGANGNTCGNNTG2 GACCTNGYCCGCGA CG7GTCCAAG CCNTEGTCCCEN?'N CC

GAGGAAAAGGAGTNNCG=GAGACCGTGCCCCTGCTTNCCGTTCCCG

I I N I N N N I I N I I C N I P I N N N N N

N N

C? C C

CID0

TGACAGGTTCATG;TCCAGAGTATTNNCCGCCCGCAACGCGCGC'TCGTM

N I N I N 0 N I N I N I N I N N N IN

C. GOON) C, 0 (I9) C, GOON) CO

CCACCGTTTC?NEGCGCNCGTCA?TAACGNGCT

TGCATGCCGAATG!GRCATGCGCAATCCATGTGGCGTAAAGNC=GGTG 07 Ngal-p TCCGACGGGGTAATNG&TTCGGTTGTG.TCGGGTTCNAGGGNGACCCGT ON actIN 01)-p GGTCCTCGACTAITGGTCCACGAACGACCACCGTTCTACAA2NGGAACG ON pARtC-p GcACAGCTCCC T!CCGGTGCTAGCGCGACCGCGCEAGCGNIGGTCGG I N ,2CGWGCGGTCGhAGACCTCGGCGTACATTTCCGTGA 20 orfNP-p I N CAGCCGTACCGATTNTCACCCTGCGACACTCCGCTGRNGCATTCGGGAAA 20 daq-p4 N N GGCTGGGGACAACAAC?TOMACCCTACCCGTCGCGTGAC?*CCG?GACTTGGA 22 dnaA-p2 P N 23 aaCC7-pl CAGAGAAATACGGTGCCGfGTEAUGTGAGCGARGNRACCT?CCCGTCCA P N GTCGCCGTCAAGCCCYNACCCTGCGTGGCCGCCCT?NC?ACCGYGATCA 24 otrA-p N N TGACACCTGCATGCGATTGTCAAATGTCACTGACTGE?GTTA6CTTCGGGGCC 25 EDNA-p N N TGCGACACGCATACCGTDNCGGGCACCGGGTAGCACGNCTATAGTTCTTCGC 26 srNN-p N N AGTTGACGGGAATCACCCCCTGTGGACTTCCTCAGGTCGACATTAUG7YAGGCTC 27 deN-p I N ccACGCGGGGACGTCCGAGACGGGCkGAGCCTCGCTAGGCTGGACG 2N kqAN-p I N CCGTCCGTGGACGGTGGTGGGGCCGAUGrACCGCGGGA?ACGGIVGCCGG.CC 2N orfl-pl I N TAAAGCTTTGGTAACGCACCCAGCCTACTCACGTGAGT.AGCTTGGAGCG 30 rep-p2 I N TGATTGCCGGTCAGGGCAGCCATCCGCCATCGTCGCGTAGGGTGTCACA 31 tsr-pl I Pp ATTcCCAACCTGCCCCTCGATTTTTCTGATCATGCAG?ACCCTGTGCCGCCAC 32 br.-p2 I N TTTGGCGCCCAUGGTCTGCGGAAGTCATTGCCAAATATAAGATTC'TTCA 33 dog-pl P N CCGCTCTGGTGGCCGGGGGCGCAGGCTCCCGGCCACTRAACTGCGCGCA 34 nEON-p N GAAGCCATCGCGGCCATGAAGTGTCCTCATTGGGGGCTNCGG?ACTCAAC 35 drrAN-p I N GAAACAAATGGGTCACGCCCGAGAAATCACCCGTCCCTAGGGTCGAGGAA 36 gOnG-p I N CCGGCTGAAACCAGACCTCACCGGGGCAGGCCGGGCATAGCCTCOGGTCKTG 37 sEE-p C N GTCCATACTGTGGTCGCACAGTTGCGTGTCAAGGCAThCrACYGTGCTAG 3N korA-p I N CGGTCGGCATTGTCGAACACCTACCNGCAATACGCG7?MAGA!GTCCACAGTPG 39 gyl-p2 I N TTTGCCGAAATGGCTCAGCCTACTCACGTGAGTAG7GTAGAGYCGGCTACG 40 cp2 I N ACCGCGTGCACCTGCGATCNCCGATCAACCGCGACTAGCATCGGGCG.CA 40 orf-pl I N AACTTTGATGGCGTCAACATTTGATGGC?G?CGTNCCA?GGGGAAC 42 orfJO2-p I N 43 gal-pl TTGTGATGTGACAGGGGGGTGNTGG;GTTGTGA!GTGTTATGT?TGATTG I N GCGATGCTGTTGTGGGCTGGACAATCG;TGCCGGTTEGTAGGATCCAG;CG 44 erEEpl I N ACAGCTTTACTTGGCCGTTGCCCGGATGTCCGGGTNCTA.CTATTCGCGAA 45 aphO-pl C N CCGTCCGTGCGGCTCACCGNGGACGGTCGGGCGGTNC?ACGTTGGCTGAA 46 aphD-p2 P N ACAGCCGCCTGATGTGCATCCACCCCTGCGAOCTNCTAGTGTCCTCTTC 47 rnoD-pl npr-p

FOUNDC

O1 t2a

.r-p gyl-pl

00

14 OS ON

GCCGCGGTGGACATATGCCCGAGCGAAGCGGCGCTNC?AGCCTGCGATGA GGCGGAAAATCGCTACGGCCCGCACACCGGCGGC?NATATGCTGAGCCGA

N N N

AAACGAAGGCCGGTAAGACCGGCTCGAAAGT?CTRATAAAGTCGGAGCC GGAAAGCGCCGAGGAAATCGGATCGGAAAGANCTGARAGAGTCGGAAAC AGCGTGCGCGAGGGCTTCACCGCATGGCGCCCCGECOTRACCGTGCCCT

Pp

CCGTCCACGGACGGCGGGAGCCTGTACCGATCICCGEENGGTTGCGCG7A

EDEN-p

kAn-p

r*dD-pr3

N2 orfE590Dpl 93 tNNR-p2 94 cp3O 95 Elf-p N6 Np3" 97 .I-PI NE apR-p2 99 ANCN9-p SOD xyOA-p 000 002 003 004 005 006

N I P

I I I P I I I S N I

GCATGGGTCGAGOOSPEcGGATGCGCTCCGCCC2E GGAGAACTACGCTGTGTIYACTGGTGTTCTCGACAGGGGGGCATATTCCT GGACTGCCGTCGTCGTGCGTTIYOT= CCTTA6 CCETAGT TGCCGACCGGTGAGNTY?CEGTCTCCGAAGAGTCCGTTGAGGTTCCGGGC

GGAA.ACGGTGGTCCGS'flCOE6CCCCTGCCCGTAGCG=TGCGCGTCCCGC GCGCGGTGGG

TNGCACCEGCCCT7ACGCCT'

EOpA-p

I I I I

N I

CGGATGACCTTGCC?CEACTGGTTGACTCTCTACACTGAAGATTTACA

I C

I C

GCGTCTCCCACGGCYNOUGTGGTCGGCATGAACAAGGCAAACSGACGTG

C I

C

RIa-p afEN-pl PAS

hyg-p

I

G;CGCACTCGGGTCT?ZCN&DGTATGTECGCACCGGG.ACG.AGTGTTCTCG CCGGTGAGCTGCAGYANAGAAGGCGCCCGAGTCTCCTTCGTTCACTGCGT TGTCGGGGTTGTCGGGCTACGCGACGGGGGWGCGGCGCCCGATGACGGCACCGGT GAACGTCCCCGACGTGGCCGACCAGCCCGTCATCGTCAACGCCTGACCG

112hrdD-p2 113 pIJOl-pc 114 kiOB-p2 115 OctIII-p

135 ph-pal

7 40 7

136 fNsB-p 137 redD-PL1

7 32

139 brpa-p3 140 &my-p2 141 vph-p2 142 ANr-p 143 redD-pr2 044 hph-pl 145 Al-p 146 orf-p3 147 xyNB-pl 14N8 ssi-p 149 strD-p 050ofsR-p2

N, G? 009)

A. N, GOON) A, N, GOON) A, N, GOON) A A A A A A A A A A A A A A A A, N

A, A, A, A, A, A, A,

N 0 0 0 0 N D A, D A, D A, N A, N A ,N A, N A, N A, N A, N A, N A, N

F, GOON)

30 33 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

7 7

7 7 7

30 34 30 7

7 30 7 7 7

32 7 7 7 7 7 7 7 7 7 36 7 37 7 7 7 7 7 7 7 7 7

3N 7 27 7

7 P

7

F F

32 7 7 7

F F F F G?

7 7 7

C

C

I C C

I

CCTGGAACTTTTCACTTCCG

C

CTGASCC

C

ny

7

GTGCTTCGCTCACCCTAGCCCACGCTCCAAGCTACTCACGTGAGTAGG; GTAGTTCATGCCGCGGCTCCTTGCTCGCTGAGGCTATTTCGTCATGCCC

C

I I I

I C

I

I

N N

N N I

N N I I C C R I

I I I

7 7 7 7 7

CC

CCGCCCTCTCTGACGTCGCTGTGGGATCCGCCCGGC

GTGAGTAGGCTGGGTGCGTTACCAAAGCTTTACCTTCGGAACACAAGAA TC CCTCTTTCGCAACTCACC

7

TT

C

CGCATTTCGTTCCGCCCT N

7

GTTCACTGCGTGCCACTCGTGGTGCGGGTACTTCGCGCAACTOCTG GGGGAGTTATGCCCGAAGCAGCAGTCTTGATCGATCCGNT AC CCTCGGGAGGGAGCGGGTGTTC

C

N I

TCCTGTGAGTGT

GGGCCTGAGTATGACGGT

I R N I

C

GGGA

CGGAGCCAACCGTTCGCGGAG

GGGTCAACGACCTCGGAGCACAGCGGTCTGACACTTCGCTACACAAG CGTCACCAGTGCGACTT

I

C

7 7 7 7 7 7 7

CGCATTTCGTTCCCCGCGCTTGGGGACC

R R

6

40

TCACCTCCATTGGATT

139 bar-pl

ve9-p

7 7 7 7

CCCATCTCCCTTCGACCGCCGCTCGaGECGGGOGCCCAAAGCTG GCGTGCCACTCGTGGTGCGGGTACTTCCGGCGCAACGTGCTGTCCSCCA

GGACGTGACCTGTGCCCCTCG GACAACC TTGGCTCGACGCAGCCCAGAAATGTATGATCAAGGCGAATACTTCATAT

I R I C

130 octD-p2, 3 131 rep-pAl

ctc-p

7 7

AGCTGGACGAGATTGAGAAGGAGGCTCGCSCCGCTGATCGCCGGGC

N

121 .ICl-p 122 orf1590-p2 C 123 actE-p 124 gal-p2 cp3" I 125 R 126 rep-pA2 I 127 xyNB-p2 C 129 pA4 129 afsA-p

7

I

29

CGCCCTGGCGCTGCTGGTGGACTGGCTGODGTCGCTGGCCGAACTGCTGCTGC CCCGTGCGGCCCTGTCAGTTGCGCCTCCGCGATCGGTTACGCCCTGACC GCACCGGGCCTTGGTGATCACGTACGTTGACCTTCATGGGAACACTG CGAGGGGCAGGTTGGGAATTCTGTCCGGATTCCAGTCGTTGTTCCAT

CGACTCAGCACCGCCGCATGACGGATGGCCCCGCCGGAAACCGGCGCGC

120octD-p4

132 aEy-p3 133 orfI-p2 134 octD-pS

C I

I

116 pA3 117 dnaA-pl 1O1 tsr-p2 119 Est-p

GOON) GOON)

AAGAGTCCGTTGO!CGGCTATTGCGCrCAAGG P CGGTCCGGGCCNANRTCTCCCCTTCTCCTCCGGTCGATANGTATGCGGGG

GAAAAATTACTCNGYACCTGACrGCCCCGGCTCAGGACAGCCTGCTAUCTA PP GCTGCTGACATCOGY?TCTCCCTCTTCCGCGGCTCAGCGGGGTGTGTCTCTT I CG7CCGGOC!ZGCACCTCACGTCACGTGAGGAGGCAGCGTGGACGGCG I CGGCTCCGGTGCAGGCA?RRTCCGCAAAAGGGTGGAAAGNCGAA?TAC

I R I

tErC-p

GOON)

A, A, A, A,

GGTGGGGGCCGCAGATNTT=YECN.TCGACGAGGGAAGA

nehA-p2 brpA-p2

007 PTH270 ION

I I I N I I I

30 35 7 7

C

111

A, GOON) A, GOON) A, GOON) A, GOON) A, GOON)

ACATCTGCCAACGACNTA&%AACCCCGAA2fGGYNAAGGTCTCAACTGG

9I ssEOp2

7 7 7 7 7 31 7 7

009PTN4 oDD a&cC7-p2

7 7 7 7 7 7 7

N ACCCAGCCTACTCACGTGAGTAGCTTGGAGCGTGGGCSCGGGTGAGCGA N TCGTTCCGCCGGAAATCACGGTGTGGCCCCGGGCCACCGOGTAGCTTATGCCT N CGCCGCCCAGCAGCGGAGGCGCGATCCGTTCGTCG.~TNATG~ATGATGGTC N GCGATGACCCGCGGAGCCCCGGAGCGCGGG.CCCGTTCGOGTAACGTCGTAAGGACOG ORP2+3-p2? N MOON-p CCCGCCGACCATTGGCGGGA(NGTCGGCATGGACCOGTATGTTCGGCATT N CGGGCTCCTCGTTCCGGGGGCOENGCGTCCTC EyOF-p N CGGGGCAGCGCATCGACCTGGACTGAGTCCTGCCYEGGTCGG!GCCCTTAG orf3-p3 N C, N, GOON) 60 Era-p CACAGTGTATGCCTPNACACAGCAACTGTGCGACCACPATANGGACCTTG N N 62 p&AS-p GGGGCATGGTACCCCCACATCTATTGAATCCGCAACGCGUDGTAECAT.G N N 63 pA2 GGTACTTCCGGCGCAACGTGCTGTCGTCCATGGGCGGOMTCANGGCAGA N N GTGACATTTGACAATCGCATGCAGGTGTCACCAG6TGGCGUDAGEAGGGTTG 64 tDNR-pl N ACGACAGGACTCTTGOAGTGCTCTTCGGCTGGTCTTCRMAOCTCTTCGCTATTTT 65 DEER-p NO, GO ON) N 66 kArN-p N, GOON) CA=CTACA NGCGCGCC0CGTADATC N, N N 67 reNdD-pr4 GAAGOCACCCCACCTGGGN?GGACGGCGGGA NTA GGTG N, N EN ocDE-pD N CO0CTGToGATCCGcCCG6NACAACGCrgrS-TGDACACGG N N CGGTCAGATCCTCCCCGACCCCCOETD6OCACC 6N Srf-p2 N, N rANrEOCCCA N 7N WA-pO CA6AOCCGAO~TC?COCCYD6 N N, N, 71 rpNh-P GTCAAATCNCCTAGOGA6ARAOOGNCTC..UDNcUDT3LRANEGACCA _C_ AC N?, NO N 72 pae-p CTCOCCN=C~CNOOCGW C I NTC ~~~~~~~C,N, GOON ) 73 ------orwg-p2 N N GCGTCCGCCGTGATGAGGAGTGCGCGGAOAATCTCGAN@CNGDGAC 74 bOA-p N N D 75 nEha-p TCGGCcCCGGACGGGGGCGCGCGGOOCCGATCCGT@CATYCAGA N I N 76 dag-p3 TACCTCCTGGAGCCTAGCTCCTCCTGCGCCGNNGAkAT6ANEG7GCCAC N 77 lOp-p N N AAGACCGGTGATACCGCCGATGCGCGTCCACGGGCGCAECNTGMAGTGGCGC N 7N pAl N N TCCGCGCCTTTCGTCOGGGGCCCGTANGGGTTNVG&CTTCTTGTSCGGU G N I P 79 cp3 AGCTGTGCGCCGGGTGCTCGAAGGTCGTG.CCGACCGGTGO6GTPTCCGTCT_C NI ROAN-p N, GOON) I N ACGACGTGGTGCTCAGTECGGCACCGGCGACGOGAAGCANCCTACAA ND cdR-p A, N, E, GOON) I N GACAATCGGCC!CG&AOCTGGAACCTGTTTCAET!AAGCTGCCCGTCA NO, E, NO(IN9 N2 hrdE-p I N CCACGCGGATTGGGCGTOAECWTCTTGGGAACAACOCGREGACCTAAGA N, E, GOON), KO20O I P 83 kRIN-pl CGGGGCCTACCAT.TECECAAEETGACACGTGc24!D6A~GTACACi I N N, C, N)2D0 N4 s&pA-p GO ES doD-p I N R)20) TCCTTCGCGTGACATECA&OCCATCTG;CCCCTCCTGCGCG?AGAGATGTGE NE hrdN-pl E, NOON) C N ACTCCrGCGTCCG7GNCAACCCTCAGGCGGTACGGGCCGNCTTCAGGGT N7 NDiBNpl N N TCGGCGCGTGTATcNO6ACGZGCGGCTCFCTG NOON) EN dag-p2 N CACGTGGGCGTTCEOACTTTCCCCGGZRTCATT NOON) 0?, H20) 7 NN ROoF-p N N GACCGAGTGACA6.ATGAAACCGDTGGGACOGGADGGNOCGCGT GGTT

NO

REFERENCE

OF

brpA-pl

N N

PPCS TYPES

INCLUDED

endoHNp

7

SEQUENCE OF PROMOTER INCLUDED IN SPRONl AND 92 AND/OR

NAME

TC

ATTCT

C

7 7 7 7 7 7 7

C

CACTGGAATGCCCCTArWACGGTTGGTTGTCMGAAOOOO GCCGAGCGAGCGGGGGGCGGCGCTGTCGCGOCTNCTOGACTTCTACCTG GATCACGGCGACCGTGTGCCCCTGCTOCTCCAGOOCCTOOOCGAOCGC TTGTCTCCGTGTCTCCAGCCCCTGTCGW6ACTCTCCTCGCGTGCCCTOCAT GACGGACCGGCGTCGCCGGCGCAGCATTTGCCTCCGGACTTCGAGEOCAT

7

TCGCAACTCACCGAAATGTCTTGCAGAAAGCCTTGACGCTGTTAOCGOCA

GGGAGCGACGGAATCGDTGCC-CCCCGTTTC

7 7 7 7 7 7

CTCGGCGCCCACTCCT_CACCAE6GAOCGGCTCCCCATC AAGAGGAAGATGAACGGGATGTTCAOE6CGGTCOAAGGCOAGAAGA"a

GCAGCGCCGTGTGCGGCCTGCCC_OCEDGCGGGANC6ACGGAATG

TGCAGGCGGGGAACTCCTATGCCGACCTCACGAGCCACGGCGGCAGCGG

ATCCTGCTTGCGGCGACGCGGGCGCGWCTGCTCGTCCGAOCTGCTCACC

7 7 39

AAATTAGTATGCAGGAGCACTTCTG6GAAGAGACACACCCCGCTGAGC

TCAGGCCGATTAAGAGGCGGCGGATATTCGGCCATCTGGCCACTTCOC7 CATCCTCCAGGGCGGAGAGGACCGGAACGOCCCCCTCCAGCCAGCGCCTGCGC TGCTGACGGTACGGGTGGAGTTGGAGDGGCAAACCGAAGAAGCTGATGG TAAACCATTTTTCGAGGTTTAAATCCTTATcGTTATGGGYATTGT7'GTAATA ATATAATTAAATTTTATTTGACAAAAATGGGC TTG AAAAATGTA

7

A A, C,

D, GO18)

00 00

included in the Spro#1 or Spro#2 database; I, included; P, consists partially of protein coding DNA sequence, only non-coding parts were included; C, consists completely of protein coding DNA sequence and was therefore not included; R, included as reverse sequence; N, not included for other reasons. bBold sequence symbols, nucleotides that define PPCS sequence; underlined, potential -35, -10 and extended -10 regions; double underlined, transcription start point. cSymbols defining PPCS classes; A-H, PPCS Class, A[TANNNT], B [CANNT], C [TTGAC], D {TNT(N3)A(N3)T, TNNGNNA(N3)T, CNGNNA(N3)T AND TGNNA(N3)T] E [GNAAC], F [GNTINC], G [TTG(N20121)T(N4)T, TGA(Nlg/20)T(N4)T, TGA(N20/21)A(N3)T, GAC and (N18/19)T(N4)T, GAC(Nlg/20)A(N3)T, ( ), encloses spacing in number TNGA(NI9)T(N4)T], H[GNAAC(N of bases between the -35 and -10 regions; ?, PCS is close to, but not within the main body of the cluster. aSymbols defining whether the sequence was

I9/20)T];

TNGA(N2o021)A(N3)T

promoters listed in Table 1 numbered from 90 to 150 except for octD-p4, octD-p5,6, vph-pl, vph-p2 and kmr-p. The oligonucleotides selected were tested against either the Group 1 or 2 promoters. Each occasion they occurred within any promoter region was recorded. Also recorded was the position of the first base of the oligonucleotide in relation to the transcription start site of the promoter in which it was found. The number of times that an oligomer under consideration occurred at a particular distance (measured in numbers of bases) from the transcription start sites of the promoters was plotted as a bar graph. Putative promoter recognition sequences were aligned and the total number of times each base occurred in each position was recorded.

RESULTS AND DISCUSSION Ol8)

GOON)

G)190 GOOS GOON)

7 7 7 7 7

29 7

Promoter region sequence bias cluster analysis Promoter region sequence bias cluster analysis was conducted on a number of Streptomyces promoters. Initially the selection of the

Nucleic Acids Research, 1995, Vol. 23, No. 18 3699 Table 2. Type and number of oligomer used for promoter region cluster analysis

Table 3. Consensus analysis of PPCS Classes A (A) and C (B) A. PPCS Class A

TYPE&

USING DATABASE SPRO1

|

USING DATABASE SPR0*2

.POSITION.a

-19

-18

E.COLI

TESTb

NUMBER TESTb MINISM VALUE NCYC TESTEDd

MINIMM

VALUS NC

MUMBER TS

TRDSM NNN

NNNN N-NN

NM-N

T U U

T

2.0

6

T T T

1.0

1.7 1.2 1.2

30

33 16 15

T U U U U U U

NMNNNN

U T T T T T T T T T

N---NM NN---N N--N-N

N-N--N N----N

G

-14

-13

-12

-11

T

A

T

A

-10 -9 A

-8

A

-7

T

STREPTOMYCES I

C/G/

C/G

G/C T

C/G

G/C

G/C

G/C/ G/T

16

18

19

18

29

16

0

0

24

20

25

0

CONSENSUSC

G

T

T

A

T

G/C

A

21

BASE

C

23

13

22

15

10

17

0

0

18

17

8

0

18

BASE

A

7

7

3

3

6

11

0

52

6

12

8

0

3

BASE

IT

6

14

8

16

7

8

52

0

4

3

11

52

10

-40

-39

-38

-37

-36

-35

-34

-33

-32

-31

-30

-29

T

C

T

T

G

A

C

A

T

G

C

T

T

a A

c

B. PPCS Class C

'POSITIONa| -41 CONSENSUSb

NNNNN N-NNN NN-NN NNN-N N--NN NN--N N-N-N

N-NN-N

-15

T

E.COLI

PUNTAD.8

N--NNN NN--NM NNN--N N-N-NN NM-N-N

-16

CONSENSUSb

BASE

U

-17

T

T T

2.8

1.5

1.5 1.7 1.5 1.5 1.5 1.3 1.2 1.2 1.3 1.0

13

6 is 10 16 15 15 S 9 10 5 6

T T T T T T T

3.S

T T T T T T T T T T. T T

7.0 1.5 1.5 1.5 1.4 1.4 1.5 1.1 1.1 1.1 1.1 1.0

1.5 1.6 1.5 1.2

1.2 1.2

32 36 33 45 13 9 11

125 37 39 39 51 42

43 21 22 24 is 10

aSymbols; N, represents in turn each of A, G, C or T specifically; -, represents any base and is recorded as a space. bSymbols; T, NCF value of oligomers was examined; U, NCF of oligomers was not examined. CThe minimum NCF value required for an oligomer to be tested by cluster analysis. dThe number of oligomers that had required NCF value and were tested by cluster analysis.

oligomers, of up to six bases, to be used for the test was made using the Spro#1 database and NCF data. However the database was too small to support the use of hexamers without any gaps in the sequence. To overcome this, the oligomers used were such that not every position within their length was assigned as either A, G, C or T. The databases used, types and numbers of oligomers examined and the minimum NCF value required for them to be considered for testing are listed in Table 2. Cluster analysis was conducted using the Group 1 promoters unless otherwise specified. Selected results of the cluster analysis are shown in Figure 1. Many of the oligomers gave clear or highly suggestive results. These sequences have been terned potential promoter core sequence(s) (PPCS) and grouped into Classes A-H. Each Class consisted of either a single PPCS or a number of PPCS which were selected for testing by the same logical process and were similar in sequence. The most striking result was for the sequence TANNNT (PPCS Class A; Fig. 1, Al). Of the 67 times this sequence occurred, 50 times (74.6%) it was found in the region -14 to -10. The clustering of this sequence could be due to the presence of E.coli a70-type promoters. However, inspection of the results from

STREPTOMYCES

A

c

CONSENSUSC

BASE

TG

3

3

6

7

5

0

0

|15

0

0

6

5

6

BASE

IC

2

4

2

5

8

0

0

0

0

115

3

10

5

BASE

jkA

8

4

2

0

1

0

0

0

15 l0

4

0

1

BASE

TT 2 4 5 3 1 15 15 0 0 0 2 0 3

a'Position' number is based on the positioning of the majority of the PPCS in relation to the transcriptional start site. bReference 28. CBold sequence symbols define the PPCS class.

cluster analysis using the sequences TANNGT and TANNAT indicate that the preferred -10 region varies from the E.coli consensus (data not shown). A consensus calculation was performed using all the PPCS Class A sequences that fall between positions -14 and -10 (Table 3A). There are 52 such promoters, including orJRP-p and tylF-p, which were not used in the cluster analysis because of their limited length. It is difficult to determine a consensus sequence for an organism, such as Streptomyces, that has a strong G/C bias because this will tend to obscure the base preference due to the function of the DNA sequence under consideration. However, it would be expected that for positions within the consensus that have no function, G and C residues would be equally represented. A and T residues should likewise be equally represented. Wherever this was not the case it has been assumed that this is due to a base preference required for the protein binding activity of the DNA. The resulting consensus, TA(G/C)(G/C/A)(G/T)T, differs considerably from that of E.coli. Most notable is the fact that T is the least common base at position '-11' while A is least common at position '-9', in direct contrast to the situation in E.coli (inverted commas are used to indicate that the 'position' is as defined for Table 3). The PPCS Class B is less obviously clustered than Class A (Fig. 1, B1). In this case the sequence CANNAT occurs in the -15 to -12 region 10 times out of a total of 16 (62.5%). It is the absence ofthis sequence in other regions that is highly suggestive, because it is likely that promoter-like regions will be avoided at positions where they have no function. It could be argued that PPCS Class B merely represents a subset of promoters that are transcribed by the same RNA polymerase that recognizes Class A. However, it is reasonable to expect that were that the case, a sequence that retained features of Class B, but was closer to the consensus sequence of Class A, would be found clustered in the -10 region. CANNGT is such a sequence, yet in this case (and others, data not shown) no such clustering is

3700 Nucleic Acids Research, 1995, Vol. 23, No. 18 Al

TANNNT

1'5

F1l

GNT

F2

GN(N

GI

TTG(N2D)ANNNT

ffG2

TTW(N21NNT

11

z

X

'0

B1 CANNAT|

5

B2

w

CANNGT|

¢

zj 0

TNTNNNANNT

l

m

G3

TrG(N22)ANNNT

G4

TTG(N21)TNNNNT

10.

| D5 TNNANNNT 5tI |5* GS

TGNN19TNNT

5

0

z u

eCi

z

TTGACi

z

C2

D4

lo

IL

1

z

ANNNT

A

40 -20 POSmON

-10

|5 0

Ws

t

]

rEl

0

-40

TGNNANNNT

5

z

TTG

L~~~~~~~Z II41119II I,N

I

II

GNMCI

oomo

0 z

z

-5

4040

-20

POSrTON

-10

0 :

GNAAC(N2D)Tl

KH2

-50

z40

-3D

-2D

-10

POSMON

Figure 1. Cluster analysis of oligonucleotides. Number of occurrences of the oligonucleotide is plotted against the position relative to the transcription start site. All plots utilize Group 1 promoters, except for Fl which utilizes Group 2 promoters. Dotted lines, limits in which positions it is possible to record oligomers occurring. The left limit occurs because data from position -48 and below were not considered. The right limit is dependent upon the length of the oligomer that was tested.

observed (Fig. 1, B2). Thus the sequence CANNAT may represent the core components of a novel -10 region. The PPCS Classes A and B appear to be associated with a third class, C, in the -35 region (Fig 1, C1). This sequence, TTGAC, is found over 5-fold more often than predicted (database Spro#1, NCF value 5.43). 14 out of 20 (70%) sequences of this class start in the -38 to -33 region. Furthennore, the sequence is exactly that of the E.coli major a factor -35 recognition site consensus sequence, except one base shorter (Table 3B). This strongly indicates that the cluster represents the -35 recognition site of Streptomyces E.coli-like promoters. It should also be noted that when the PPCS was found outside the cluster, this was due to the phenomenon of overlapping multiple promoters (as was often the case with other PPCS classes, data not shown). Thus, the occurrence of the sequence TTGAC at positions -10 and -18 is as a result of the same sequence being found at different positions in the amy-p3 and amy-p2 promoters as well as in the amy-pl promoter at position -36 (Table 1, promoter numbers 132, 140

and 7). The same is true of the brpA-pl and brpA-p2 promoters at positions -33 and -26 (Table 1, promoter numbers 5 and 103). Several attempts were made to identify more of these sites using limited subsets of the sequence TTGAC, however this was unsuccessful (data not shown). Although clusters were obvious it was impossible to distinguish in which instances the test sequence was likely to be functionally significant. This was due to the frequent occurrence of the same sequence in regions other than the -35 region. An example ofthis is shown for the sequence TTG

(Fig. 1, C2). A consensus sequence for the 15 PPCS Class C sequences found between positions -38 and -33 (including that of endoH-p, which was not included in the cluster analysis) was generated (Table 3B). Although there are too few of these sequences to be confident that small deviations are meaningful, it should be noted that there appears to be highly conserved A, G, C and C residues at positions '-41', '-38', '-37' and '-30 respectively. This consensus is markedly different from that of E.coli in several

Nucleic Acids Research, 1995, Vol. 23, No. 18 3701

features. Most importantly, the A residue at position '-31', which is highly conserved in E.coli, does not appear to be exceptionally common in Streptomyces. Another class of hexamer (CNGNNA) was seen to be clustered between bases -18 and -15 (Fig. 1, D1). The sequence was often associated with, and overlapped with PPCS Classes A and B. This suggests a functional role for the sequence CNGNNA because it has only one base in common with those two other classes. In addition, the consensus sequence for the PPCS Class A (Table 3A) shows that C and G are predominant in positions '-17' and '-15' respectively. This observation prompted a search for other oligomers clustered upstream of the -10 region, selected on the basis of their similarity to the extended PPCS Class A consensus sequence. Several were found and an example is shown for the sequence TGNNA (Fig. 1, D2). It is interesting to note that in this case there is a second cluster in the -35 region which represents a subset of PPCS Class C. Thus the cluster analyses and the consensus sequence data together suggest that, as in E.coli (28), an extended consensus sequence occurs in the -10 region in Streptomyces. Further cluster analyses were performed using sequences that were based on the PPCS Class A consensus in the '-18' to '-8' region. Again, some degree of clustering was recorded for a number of oligomers but only four were free from a high background. These have been termed PPCS Class DI, and include the sequences TNTNNNANNNT, TNNGNNANNNT, CNGNNANNNT and TGNNANNNT. Both the least and most obvious clusters of this type are shown (Fig. 1, D3 and D4). It could be argued that the clustering of PPCS Class D simply reflects the fact that the sequence ANNNT itself occurs as a cluster (data not shown). Two lines of evidence suggest that this is not the case. Firstly, it would then follow that the clustering of the sequences TGNNA and CNGNNA arises due to a preponderance of A residues in the -12 region. This would mean that most sequences that contain an A would be clustered in this region, which is not the case (data not shown). Secondly, if the ANNNT motif of the Class D PPCS is retained intact, while the remaining nucleotides are assigned new positions, no peak is observable after cluster analysis. As an example of this the results for the sequences TGNNANNNT and GTNNANNNT can be compared (Fig. 1, D4 and D5). Of the 150 promoters tested, 66 (Table 1, promoter numbers 85-150) did not fall into any of the classes that were defined using the Spro#1 database to select over represented sequences. There are several possible reasons for this. First, it is possible that there are other classes of promoter which were not identified as they are rare and so do not appear as an obvious peak in cluster analysis. Such a problem would be exacerbated by the presence of large numbers of promoters of the major class(es) in the original sequence data entered. Alternatively, it is possible that the oligomers that define more common promoter recognition core sequences are not over represented in the database and were not tested because of this. This could occur if the over representation due to the oligomers appearance in the protein binding regions is counterbalanced by (not unexpected) under representation in other parts of the promoter. It is also possible that other PPCS were not identified because they are G/C rich and because the

frequency of occurrence has been corrected for nucleotide bias. Any G/C rich oligomer would have to be numerically more highly represented than an AJT rich sequence in order to attain a similar

NCF value. Finally it is possible that alternative PPCS classes were only hexamers were extensively tested. Alternatively, many of the promoters that have not been classified on the basis of cluster analysis are probably members of the same promoter class(es) as those that have been tentatively defined above. In these cases, the -10 and -35 regions might vary too much from the consensus to be identified by the method used here. In order to address these problems a number of approaches were taken. First the biochemical data concerning certain promoters were considered in regard to the cluster analysis results. There are a limited number of Streptomyces promoters which are known to be expressed via different transcription factors. These include the promoters for the dagA and gal operons, the XP55 promoter and the cwhiG dependent promoters PTH4 and PTH270 (promoter numbers 21, 33, 76, 88, 43, 124, 1, 107 and 109). As only dagA-p2, gal-p2 and the owhiG dependent promoters do not contain sequences that can be classified as any of the PPCS Classes A to D, it is possible that these are representatives of minor classes. The consensus sequence of the cswhiG dependent promoters is thought to be similar to that of the motility related promoters of other bacteria or the sequence TAAA(N15)GCCGATA(A/T) (29). In the case of the dagA-P2 -35 region recognition site there is evidence that the consensus may be CCGGAACTT (13). Cluster analysis of all possible subsets of these sequences (triplets to octamers, with and without appropriate blank spaces) was carried out. No clusters were observed using any of the awhiG target-type sequences. This is not surprising as there has been no widespread effort to isolate and map this type of promoter. Furthermore, the majority of the transcription mapping of the promoters used in this study was performed using logarithmically growing mycelia from liquid medium, in which awhiG is probably not expressed. These two factors would ensure that 0whiG dependent promoters are poorly represented among the promoters examined here. In the case of the dagA-p2 -35 region-type sequences, only one, GNAAC, gave any indication of clustering, albeit very weakly (Fig. 1, El). This sequence has been termed PPCS Class E and its significance is dubious. It is interesting to note, however, that in three cases (hrdD-pl, whiB-pl and dagA-p2, promoter numbers 86, 87 and 88) there is a correctly spaced TC dimer in the -10 region, as the putative recognition sequence for the a28 promoter class suggests there should be (13). This observation was used to define a further class of PPCS (see below). To address the possibility that the presence of many of the major class(es) of promoters in the Spro#1 database was hindering the detection of minor classes, the Spro#2 database was created. This consisted only of promoters that did not contain appropriately positioned PPCS of Classes A-E (Table 1). As with the Spro#1 database, highly over represented sequences in the Spro#2 database were identified (Table 2). These were tested against the Group 2 promoters, which do not include among their number any promoters containing appropriately positioned PPCS Classes A-E, using triplets to hexamers. Only one sequence, GNTINC, gave results that might be interpreted, with caution, as a cluster (Fig. 1, Fl). It is interesting to note, however, that when this sequence (Class F) was tested using the Group 1 promoters, the cluster pattern was almost identical (Fig. 1, F2), indicating that the sequence is very rarely not identified because

3702 Nucleic Acids Research, 1995, Vol. 23, No. 18

found among the promoters that contain appropriately positioned PPCS Classes A-E. Further tests were conducted to attempt to isolate promoter-like sequences that resembled the previously identified PPCS classes poorly, but which might still be functional. It was argued that while the -10 and -35 regions alone could not be recognized, together there could be enough sequence information for their identification. As the PPCS Class A and B sequences appear to be associated with those of Class C, subsets of these classes, with a variety of spaces between them were tested. To first establish the limits of the correct spacing, the sequences TTG and ANNNT (which are highly conserved in E.coli) were used, with the spacing between them set at 20-24 bases. Only in the cases of a 21 and 22 base separation were clusters recorded, indicating that this was the correct spacing. It should be noted that this represents a spacing of 18 and 19 bases between the TTGAC and (T/C)ANNNT and is similar to the spacing found for the E.coli major class of promoter. However, the 17 base spacing found in E.coli was not observed (Fig. 1., G1-G3). With the spacing between the -10 and -35 regions set at either of these two distances, all possible combinations of subsets of TTGAC and (T/C)ANNNT were tested. Thirteen sequences,

TTG(N20/21)TNNNT, TGA(N19/20)TNNNNT, TGA(N2M1) ANNNT, GAC(N18/19)TNNNT, GAC(N19/20)ANNNT,

TNGA(N20/21)ANNNT and

TNGA(Nlg)TNNNNT showed some clustering (Class G). The cases where this is the most and least obvious are shown (Fig. 1, G4 and G5). A similar approach was adopted to test for a -10/-35 association within the PPCS Class E sequences. As noted above, inspection revealed that in some cases there appeared to be an approprately positioned TC dimer in the -10 region of promoters that fall within this class. For this reason cluster analysis was conducted on the sequences GNAAC(Nlg)T and GNAAC(N20)T. While similar sequences with lesser or greater spacing between the -10 and -35 regions were tested and did not show clustering (data not shown), when the spacing was either 19 or 20 bases, clusters could be observed (Fig. 1, Hi and H2). These two sequences have been termed PPCS Class H. The clustering that is observed strongly indicates that these sequences represent a distinct class of promoters, and gives support to the contention that the indistinct clustering of PPCS Class E sequences is not coincidental. The promoters that fall into the various classes are shown in Table 1. 59% of the promoters can be described as members of the Classes A, B, C, D and G, and it seems that these represent E.coli-like promoters. Within this group are found the XP55-p, dagA-p4, dagA-pl, dagA-p3 and gal-pl promoters (promoters number 1, 24,33,76 and 43). Also in this group are the B.subtilis veg and ctc promoters (Table 1). It is interesting to note that although the promoters are so similar, there are at least three distinct transcribing activities that recognize different members of this group. Perhaps these consist of (at least in part) the different hrd homologues. Other E.coli-like promoters of note include whiB-p2 and hrdB-p (promoters number 9 and 82; see below). In the case of whiB-p2 the spacing between the -35 and -10 regions is very unusual (21 bp). Promoters in the Classes E and H might be of the a28 class. This is particularly relevant because these include the promoters of a number of fundamentally important genes including hrdB and hrdD (promoters number 82 and 86). In the case of hrdB-p, Class B and H sequences overlap, suggesting that hrdB could be transcribed from either an E.coli-like or aT28 type promoter, or

indeed from both. Also of note is the fact that whiB-pl (promoter number 87) appears to be a a128 type promoter. The blaF-p promoter (number 89) contains PPCS Class E and H sequences, although these are not within the cluster but three bases removed from it. However extensive mutation analysis of this promoter has been performed (27) and comparison of these results with those obtained here strongly suggest that blaF-p is a a28 type promoter. Both point mutation and deletion analysis are compatible with the results reported here. The gal-p2 promoter (number 124) is an interesting case as it contains both the PPCS Class E and H sequences, but in the wrong position, at -45. Inspection of this promoter reveals that it matches the putative consensus sequence for the a28 (13) in 10 of the 11 positions, with the correct spacing between the -10 and -35 regions. It seems unlikely that such a sequence could occur by chance (especially as it is relatively A/T rich and the database is so small) and that it should have no function. There are no well studied promoters that fall in the PPCS Classes B and F. It is also difficult to determine the significance, if any of these classes. It is possible that the sequence bias recorded here is as a result of recent gene duplication. However none of the reports from which the sequence data were obtained describe any extensive homology with other promoters. It follows therefore that any gene duplications included in the data here were either limited or not recent. Although the study of genetic regulation in Streptomyces is in its infancy, it is already clear that there are a multiplicity of promoter types. This will certainly have bearing not only on the general metabolism of the bacteria, but also on their development and antibiotic production. To date there have been no formal studies, using sequence databases, on what constitutes a Streptomyces promoter. While it has been possible to recognize those promoters that bear some similarity to the E.coli major class of promoters, the criteria for this are not defined and there has been no way to estimate the probability that such assignments are correct. This cluster analysis was aimed, in part, at addressing this and determining where the differences between the major promoter classes of Streptomyces and E.coli lay. The approach is clearly subjective to a degree, particularly with regard to defining what constitutes a cluster in cases where it is not immediately obvious. However, further statistical analysis must ultimately rely on pre-set conditions, the selection of which is also subjective. The results reported here are therefore not intended to be a definitive classification but should rather serve to direct the researchers attention to potential promoter core sequences. The use of cluster analysis to identify promoters has great potential, particularly as the list of sequenced and mapped Streptomyces promoters is growing rapidly. This will also have bearing on the genetics of other actinomycetes, and indeed other organisms.

ACKNOWLEDGEMENTS We are indebted to J. A.Thomson and R. Kirby for advice and critical contributions to this work.

REFERENCES Hodgson,D.A. (1992) In S. Mohan, C. Dow and J.A. Cole (eds) Prokaryotic Structure and Function: A New Perspective. Cambridge University Press, Cambridge, pp. 407-440, 2 ChaterK.F. (1993) Annu. Rev. MicrobioL, 47, 685-713.

Nucleic Acids Research, 1995, Vol. 23, No. 18 3703 3 Hirschman,J., Wong,P.-K., Keener,SJ. and Kustu,S. (1985) Proc. Natl. Acad. Sci. USA, 82,7525-7529. 4 Helman,J.D. (1991) Mol. Microbiol. 5, 2875-2882. S Grossman,A.D., Erickson,J.W. and Gross,C.A. (1984) Cell, 38, 383-390. 6 Errington,J. (1993) Microbiol. Rev., 57, 1-33. 7 Strohl,W.R. (1992) Nucleic Acids Res., 20, 961-974. 8 Harley,C.B. and Reynolds,R.P. (1987) Nucleic Acids Res., 15, 2343-2361. 9 Buttner,M.J. (1989) Mo. Microbiol., 3, 1653-1659. 10 Westpheling,J., Ranes,M. and Losick,R. (1985) Nature, 313, 22-27. 11 Buttner,M.J., Fearnley,I.M. and Bibb,M.J. (1987) Mol. Gen. Genet., 209, 101-109. 12 Buttner,M.J., Smith,A.M. and Bibb,M.J. (1989) Cell, 52, 599-607. 13 Lonetto,M.A., Brown,K.L., Rudd,K.E. and Buttner,MJ. (1994) Proc. Natl. Acad. Sci. USA, 91, 7573-7577. 14 Westpheling,J. and Brawner,M. (1989) J. Bacteriol., 171, 1355-1361. 15 Fomwald,J.A., Schmidt,FJ., Adams,C.W., Rosenberg,M. and Brawner,M.E. (1987) Proc. Natl. Acad. Sci. USA, 84, 2130-2134. 16 Tanka,K., Shiina,T. and Takahashi,H. (1988) Science, 242, 1040-1042. 17 Shiina,T., Tanka,K. and Takahashi,H. Gene, 107, 145-148. 18 Tanka,K., Shiina,T. and Takahashi,H. (1991) Mol. Gen. Genet., 229, 334-340. 19 Buttner,M.J., Chater,K.F. and Bibb,M.J. (1990) J. Bacteriol., 172, 3367-3378. 20 Kormanec,J. and Farkosovsky,M. (1993) Nucleic Acids Res., 21, 3647-3652. 21 Buttner,M.J. and Lewis,C.G. (1992) J. Bacteriol., 174, 5165-5167. 22 Brown,K.L., Wood,S. and Buttner,M.J. (1992) Mol. Microbiol., 6, 1133-1139. 23 Westpheling,J. and Brawner,M. Reported in reference 9.

24 Chater,K.F., Bruton,CJ., Plaskitt,K.A., Buttner,M.J., Mendez,C. and Helmann,D. (1989) Cell, 59, 133-143. 25 Chater,K.F. (1989) Trends Genet. 5, 372-377. 26 Potuckova,L., Kelemen,G., Bibb, M.J. and Kormanec,J. Reported in reference 13. 27 Forsman,M. and Granstrom,M. (1992) Gene, 121, 87-94. 28 Hawley,D.K. and McClure,W. (1983) Nucleic Acids Res., 11, 2237-2255. 29 Tan,H. and Chater,K.F. (1993) J. Bacteriol., 175, 933-940. 30 Agnell,S., Schwarz,E. and Bibb,M.J. (1992) MoL Microbiol., 6, 2833-2844. 31 Geistlich,M., Losick,R., Turner,J.R. and Rao,R.N. (1992) Mol. Microbiol., 6,2019-2029. 32 Guilfoile,P.G. and Hutchinson,C.R. (1992) J. Bacteriol., 174, 3651-3658. 33 Gunter,K., Toupet,C. and Schupp,T. (1992) J. Bacteiol., 175, 3295-3302. 34 Ishikawa,J. and Hotta,K. (1991) Gene, 108, 127-132. 35 Ogawara,H., Kasama,H., Nashimoto,K., Ohtsubo,M., Higashi,K. and Urabe,H. (1993) Gene, 125, 91-96 36 Paradakar,A.S., Petrich,A.K., Leskiw,B.K., Aidoo,K.A. and Jensen,S.E (1994) Gene, 144, 31-36. 37 Perez,C., Juarez,K., Garcia-Castells,E., Soberon,G. and Servin-Gonzalez,L. (1993) Gene, 123, 109-114. 38 Soliveri,J., Brown,K., Buttner,M.J. and Chater,K.F. (1992) J. Bacteriol., 174, 6215-6220. 39 Vujaklija,D., Horinouch,S. and Beppu,T. (1993) J. Bacteriol., 175, 2652-2661. 40 Zakarzewska-Czerwinska,J., Nardmann,J. and Schrempf,H. (1994) MoL Gen. Genet., 242, 440447. 41 Moran,C.P., Lang,N. and Losick,R. (1981) Nucleic Acids Res., 22, 5979-5990.