the following sequence features have been noted. The sequence. ECOUW85U (about 91 kb (10); abbreviated henceforth UW85) is 25kb clockwise to the origin ...
.=) 1993 Oxford University Press
Nucleic Acids Research, 1993, Vol. 21, No. 16 3875-3884
Comparative DNA sequence features in two long Escherichia coli contigs Lon R.Cardon, Chris Burge, Gabriel A.Schachtel, B.Edwin Blaisdell and Samuel Karlin* Department of Mathematics, Stanford University, Stanford, CA 94035, USA Received February 18, 1993; Revised and Accepted May 20, 1993
ABSTRACT The recent sequencing of two relatively long (approximately 100 kb) contigs of E.coli presents unique opportunities for investigating heterogeneity and genomic organization of the E.coli chromosome. We have evaluated a number of common and contrasting sequence features in the two new contigs with comparisons to all available E.coli sequences (> 1.6 Mb). Our analyses include assessments of: (i) counts and distributions of restriction sites, special oligonucleotides (e.g., Chi sites, Dam and Dcm methylase targets), and other marker arrays; (ii) significant distant and close direct and inverted repeat sequences; (iii) sequence similarities between the long contigs and other E.coli sequences; (iv) characterization and identification of rare and frequent oligonucleotides; (v) compositional biases in short oligonucleotides; and (vi) position-dependent fluctuations in sequence composition. The two contigs reveal a number of distinctive features, including: a cluster of five repeat/dyad elements with very regular spacings resembling a transcription attenuator in one of the contigs; REP elements, ERICs, and other long repeats; distinction of the Chi sequence as the most frequent oligonucleotide; regions of clustering, overdispersion, and regularity of certain restriction sites and short palindromes; and comparative domains of inhomogeneities in the two long contigs. These and other features are discussed in relation to the organization of the E.coli chromosome.
INTRODUCTION Although more than 35 % of the E. coli chromosome has been sequenced to date, most of the 417 sequences in the E. coli contig collection (23, 37, 38) are of short length (EcoSeq2 database of aggregate length > 1.6 Mb, median = 2.5 kb, largest = 32 kb). In the last year, two moderately long (approximately 100 kb) E. coli genomic contigs extending approximately 60% to 70% previously sequenced regions have been reported (10, 45) and the following sequence features have been noted. The sequence ECOUW85U (about 91 kb (10); abbreviated henceforth UW85) is 25kb clockwise to the origin of replication between the *
To whom correspondence should be addressed
ribosomal operons rrnC and rrnA (84.5-86.5 minutes on the E.coli chromosome). The other sequence, ECOMORI (about 111 kb (45); abbreviated MORI) covers 0-2.4 minutes on the chromosome. Both contigs are quite dense in coding sequences (77% in MORI; 84% in UW85). MORI contains two IS elements, while UW85 contains eight tRNA genes and two rRNA sequences, and three 'grey holes' (lengths 0.6-0.8 kb) with no recognized ORFs or function. The longer contigs permit extensive analysis of sequence heterogeneity and genomic organization in E. coli and reduce inherent biases in short contiguous sequences toward fewer marker counts, closer marker spacings, and diminished lengths and copy numbers of relatively close repetitive sequences. The objectives of this paper are twofold: (i) to advance the statistical methodology for assessing and classifying inhomogeneities in long DNA sequences (15); (ii) to exemplify the methods with respect to the two large E. coli contigs at hand and delimit potentially interesting regions in these sequences. Along these lines we inquire on: How representative of available E. coli sequences are these large contigs with respect to various sequence features including specific marker arrays, sequence composition, close and distant repeat structures, and spacings of special oligonucleotides including Dam, Dcm, and Chi sites? Is there an isochore phenomenon (compartmentalization) in the E. coli genome akin to that extant in mammalian species? Is the distribution of the restriction sites that generate the Kohara physical map (20) consistent with their distribution in the UW85 and MORI sequences? What is the nature of similarities and differences in sequence features between these contigs, EcoSeq2, temperate coliphage, and other genomic sequences of diverse organisms?
METHODS 1. Compositional biases of short oligonucleotides We ascertain extreme over- and under-relative abundances of di-, tri-, and tetranucleotides (cf. (6)). For dinucleotides, it is standard to assess bias through the classical odds ratio measure, Qxy = fxy/fxfy, where fxy is the frequency of the dinucleotide XY and fx, fy are mononucleotide frequencies. Values of Qxy sufficiently greater or less than one indicate significant deviation (over- or underrepresentation, respectively) from a random
3876 Nucleic Acids Research, 1993, Vol. 21, No. 16 association of mononucleotides (3). These evaluations are appropriate for a single sequence, but in comparing sequences
from different organisms (or of unknown orientation or from different chromosomes), the formulas must be modified to account for the complementary antiparallel structure of doublestranded DNA. In these situations we use the symmetrized dinucleotide representation formula Q*XY = f*xy/f*xf*y where f*x = (fx + fxi)/2 and f*xy = (fxy +f(xy)j)/2 are strandsymmetric frequencies, and Xi and (XY)i refer to the inverted complement of X and XY, respectively. The formula for trinucleotides is -y*xYz =f*xyzf*xf*yf*z/f*xy f*xNzf*yz (6). The strand-symmetric functionals may be effectively used to define the dinucleotide 'distance' between sequences f and g,
Q*(V,g): Q*(f,g)
=
Ilog(
'J) -log (
1gicgJ
Wi
(1)
where the sum extends over all dinucleotides and each weight wij = 1/16 (other natural weights may also be employed; for
p). essentially (
YJ
'j))
a
Q*(g, h)
comparison of second-order deviations from versus (g,,'i gj* 1) (4, 19).
is
one
-1
2. Counts and spacings of marker arrays For many marker points such as specific restriction sites, Dam (GATC) and Dcm (CCWGG) methylase targets, Chi sequences, and others, it is of interest to investigate anomalies in their distributions. For this objective we employ r-scan statistics in order to assess significant clustering, overdispersion, or excessive evenness in marker spacings. For large numbers, n, of markers distributed randomly on a unit interval, we use asymptotic r-scan probability formulas applicable to m*r(M*r), the smallest (largest) of the cumulative lengths of r contiguous marker spacings. The formulas are Prob fm'* > xln(l + 1/r)J expf-xr/r!J, and Prob fMr* < n-I[In n + (r-1) ln (ln n) + x]} = exp [-e-x/(r- 1)!), cf (16). For a given value of r, these distributions can be used to test whether m*r is too small (a cluster of markers) or too large (indicating evenness in the sense of no small spacings) and similarly for M*r too small (indicating evenness in the sense of no large separations), too large (indicating overdispersion). Varying the parameter r (e.g., r = 1, 3, 5, 7, 10) provides sensitivity for detection of inhomogeneities on different scales.
3. Position-dependent fluctuations in sequence composition For the UW85 and MORI contigs, sliding window plots were constructed by dividing the sequence into successive segments of length (e.g., X = 500 or 1000 bp) with displacement v (v = 250, 500 bp, respectively) and cumulating the count of a marker or base type or some other DNA sequence feature in each segment. The aggregate count per window is then plotted in the clockwise direction. Statistical criteria for determination of significant peaks are presented in (15). We employed these procedures for the counts of strong weak [S-W: (G + C) (A + T)] bases and purine pyrimidine [R-Y: (A + G) (C + T)] alphabets, and for counts of close repeats and close dyads (see Figure 2). w
-
-
-
-
4. Segmental quantile distributions Sequence composition may also be evaluated globally by means of segmental quantile distributions which complement the
position-dependent sliding window plots. They are constructed analogously to sliding window plots, by dividing the sequence into successive segments of length w with displacement v and using the aggregate count in each segment as the sampled variable, and are summarized as histograms (16), see Figure 1. We examined the counts and spacings of all close repeats (CR) and close dyads (CD) with a minimal potential stem length (s > 8 bp) and maximal loop length (I c 50 or < 150 bp). The location of each CR and CD was assigned to the 5' end of the repeat pair. 5. Direct and inverted repeats We employ the algorithm of Leung et al. (26) and statistical significance criteria therein (17) to identify significantly long common words (p c .01) allowing for a few intermittent short errors. We use this algorithm to locate repeats within and between UW85 and MORI.
6. Rare and frequent oligonucleotides Determination of exceptional oligonucleotides (e.g., rare or frequent oligonucleotides) within a sequence provides another perspective on genomic heterogeneity. Consider a sequence of length N with alphabet size a (a = 4 for DNA). A frequent oligonucleotide (word) is defined as an oligonucleotide of length s, where s satisfies s - 1 < ln N/ln a < s, with at least r copies for r satisfying (r - 1)Ir < ln N/in as < r/(r + 1). The rationale for this definition is elaborated in (17). Rare oligonucleotides are defined to be of size a, where a satisfies m (In m) < N c am In (am), m = a°, having copy number at most Q, where Q satisfies m [ln m + Q In (ln m)] < N < m [ln m + (Q + 1) ln (ln m)]. In the UW85 sequence where N = 91,408, the relevant threshold parameters are a = 6 and Q= 6 for rare words and s = 9 and r = 11 for frequent words. The MORI sequence with N = 111,402 bp has corresponding parameters a = 6, Q = 8 for rare words and s = 9, r = 14 for frequent words.
RESULTS Over- and under-representation of short oligonucleotides Table 1 lists di-, tri-, and tetranucleotides which are significantly over- or under-represented in the UW85 and MORI sequences and in the EcoSeq2 database. The dinucleotide TA is significantly under-represented in all sequences, as has been documented in most organisms examined, e.g., (6, 21, 32). On the other side, GC and to a lesser extent AA/TT dinucleotides are overrepresented in E. coli and in most bacterial genomes. Over and under-representations of dinucleotides with gaps are not apparent (data not shown), in contrast to eukaryotic sequences where period-three homodinucleotides GNnG, CNnC, for n = 2, 5, 8, ... tend to be overrepresented (12). The trinucleotide of least relative abundance (most underrepresented) in the E. coli contigs is the triplet CTA/TAG, a rather persistent outcome in pro- and eukaryotic sequences (6). The most overrepresented and most frequent trinucleotides are CCA/TGG and CTG/CAG in E. coli, also in accord with many pro- and eukaryotic sequences (see Discussion). The tetranucleotide CTAG is drastically under-represented in the two E.coli contigs, also widely observed in bacterial sequences (1, 6, 28, 29), as well as in many eukaryotic sequences (e.g., N. crassa, X. laevis, chicken, rabbit). There are two other tetranucleotides with extreme relative abundances in UW85 and
Nucleic Acids Research, 1993, Vol. 21, No. 16 3877 Table 1. Over- and under-representation of oligonucleotides in E. coli contigs and comparison sequences Oligonucleotide
MORI
UW85U
TA GC AA/TT
0.72 (4.07) 1.28 (8.83) 1.22 (6.83)
0.74 (4.16) 1.29 (8.85) 1.21 (6.81)
CTA/TAG ACA/TGT CCA/TGG CTG/CAG
0.67 0.77 1.29 1.21
0.68 0.77 1.33 1.22
CTAG CAAG/CTTG CTAC/GTAG
0.26 (0.02) 0.68 (0.19) 1.24 (0.21)
(0.55) (1.15) (1.96) (2.30)
(0.57) (1.14) (1.97) (2.28)
0.18 (0.01) 0.71 (0.20) 1.30 (0.23)
Organisma S. typh.
EcoSeq2
Q * representation values (frequencies) 0.81 (4.71) 0.74 (4.34) 1.28 (5.98) 1.26 (8.36) 1.22 (7.08) 1.21 (7.11) 1y* representation values (frequencies) 0.73 (0.64) 0.68 (0.57) 0.81 (1.14) 0.80 (1.23) 1.31 (1.70) 1.29 (1.84) 1.25 (2.15) 1.21 (2.28) T* representation values (frequencies) 0.36 (0.03) 0.26 (0.02) 0.73 (0.20) 0.73 (0.21) 1.15 (0.21) 1.19 (0.21)
X
Mu
P1
0.71 (4.48) 1.20 (7.45) 1.15 (7.26)
0.70 (4.85) 1.26 (7.06) 1.28 (8.90)
0.80 (6.98) 1.32 (5.48) 1.22 (10.68)
0.68 0.83 1.25 1.10
0.76 0.90 1.21 1.22
0.90 0.91 1.14 1.08
(0.52) (1.42) (1.66) (2.37)
0.52 (0.03) 0.92 (0.23) 0.93 (0.13)
(0.56) (1.49) (1.53) (2.22)
0.97 (0.08) 0.87 (0.27) 0.93 (0.11)
(1.07) (1.60) (1.49)
(1.42)
0.66 (0.12) 1.03 (0.43) 0.89 (0.18)
aAggregate lengths (total G + C content) are as follows: MORI: 111,402 bp (53%), UW85U: 91,408 bp (52%), EcoSeq2: 1,683,651 bp (52%), S.typhimurium: 263,046 bp (52%), phage X: 48,502 bp (50%), phage Mu: 17,569 bp (48%), phage P1: 26,672 bp (41%). Oligonucleotide frequencies are shown in parentheses. Table 2. Dinucleotide representational distances between E.coli contigs, the EcoSeq2 database, S.typhimuriwn, bacteriophage, and yeast sequences
MORI UW Eco
S.ty X Mu P1 T4 T7 Y1 Y2 Y3
MORI 0.007 0.013 0.032 0.046 0.038 0.060 0.108 0.182 0.129 0.128 0.125
UW
Eco
S.ty
X
Mu
P1
T4
T7
Y1
Y2
Y3
-
0.014 0.032 0.046 0.041 0.056 0.105 0.179 0.127 0.125 0.123
-
0.033 0.047 0.043 0.058 0.100 0.169 0.118 0.111 0.117
-
0.067 0.064 0.062 0.096 0.180 0.143 0.126 0.140
-
0.056 0.060 0.071 0.156 0.093 0.091 0.089
-
0.044 0.098 0.191 0.122 0.118 0.118
-
0.067 0.169 0.100 0.084 0.098
-
0.118 0.054 0.044 0.050
-
0.106 0.106 0.111
-
0.032 0.010
-
0.029
-
Sequence abbreviations: MORI, ECOMORI; UW, ECOUW85U; Eco, EcoSeq2; Sty, Styphimurium; T4, phage T4 (50% of genome, G + C = 35.66%); T7, phage T7 (complete genome, G + C = 48.40%); [Yl, yeast CIm section 1; Y2, yeast CIII section 2; yeast Cm section 3. See Table 1 for details on Styphimuriwn and phage X, P1, and Mu.
MORI, GTAG/CTAC and CAAG/CTTG, each differing by a single transversion from CTAG. Enigmatically, GTAG/CTAC is significantly overrepresented whereas CAAG/CTTG is significantly underrepresented. Distances between sequences Dinucleotide representational distances (Eq.(1), Methods, part 1) between the two E.coli contigs and the EcoSeq2 database, various phage, and S. typhimurium sequences are given in Table 2. The very small MORIUW85 distance of .007 is approximately one-half the distance between either contig and the full EcoSeq2 dataset. Distances from the temperate phages X, Mu, and P1 are much larger, and the lytic phages T4 and T7 are larger still. It is curious that the long contigs are so similar, while other E. coli sequences are by a factor of at least two more distant. Counts and distributions of special oligonucleotides There are 24 and 26 occurrences, respectively, of the Chi sequence (GCTGGTGG or its dyad) in UW85 and MORI. In both of the UW85 and MORI contigs, 22 are of the canonical Chi form in the direction of replication, but merely 2 and 4, respectively, are in the dyad form; i.e., in the opposite direction. In each contig, 23 of the Chi sequences lie in coding segments. Of the Chi sites in coding regions, 22/23 read in the direction of transcription in UW85, and 19/23 are similarly oriented in MORI. The relationships between Chi site orientation and replication and transcription direction are discussed in (7).
Evaluation of the distribution of Chi sites using r-scans revealed no anomalous spacings in either contig. Similar assessments (Table 3) were undertaken for Dam (GATC) and Dcm (CCWGG) methylase sites. The counts and spacings in these cases also appear compatible with a random distribution.
Distributions of certain 6-cutter sites Analysis of the Kohara restriction site distributions by r-scan statistics highlighted several regions in the two long contigs (Table 3). There is an overdispersion of KpnI sites in UW85 with all 4 sites occuring in the first 20 kb and none in the last 70 kb of the sequence. MORI shows a similar gap for HindI sites, starting at position 20 kb and extending 62 kb with no occurrences. Distributions of BamHI and Pvull are significantly even in both contigs (minimum spacing too large). Regularity of BamHI sites has been noted previously in the Kohara map (9, 18). Mononucleotide composition Figure 1 shows histograms of segmental quantile distributions of S -W and R-Y counts generated from sliding windows (see Methods, part 4). Corresponding sliding window plots for S -W are presented in Figure 2. In both UW85 and MORI the R-Y plots are essentially unimodal with a mean near 0, while the S-W histograms are negatively skewed. The S -W and R-Y distributions appear quite homogeneous throughout the E. coli chromosome, as evidenced in the histograms for the entire EcoSeq2 database shown in the lower panels of Figure 1.
3878 Nucleic Acids Research, 1993, Vol. 21, No. 16 Segmental R-Y for ECOMORI
Segmental S-W for ECOMORI
11
a
a
I
I
I
n
n n n n L mi I I him
-200
-100
0
-200
200
100
.1-*00 R-Y Count
S-W Count (.5 kb window)
100
0
200
(.5 kb window)
Segmental R-Y for ECOUW85U
Segmental S-W for ECOUW85U
B I I
r
I
I
n n n nn mnlildlIlIlIlIllll -200
-100
0
lilIlillilIliHIh 200
100
-200
-100
8.W Count (.5 kbwindw)
R-Y Count
Segmental S-W for EcoSeq2
0
*00
200
(.5 kb window)
Segmental R-Y for EcoSeq2
Ii a-
8.
i
I
I898
k2t-
-zo
oo0
0
1 00
200
zoo0
o0
1 oo
200
R-Y Counl (.S kb -indow)
S-w Count (.S kb window)
Figure 1. Histograms of S-W (C + G) (A + T) and R-Y (A + G) (C + T) counts in sliding 500 bp windows with 250 bp displacement. Left panels show S-W distributions for ECOMORI, ECOUW85U, and the EcoSeq2 database, right panels show corresponding R-Y distributions. -
-
local and global repeats and dyads Apart from standard repetitive sequences including tRNA genes, rRNA units and IS families, sequence analysis of E. coli in recent years has uncovered a number of statistically significant repeat sequences (ength 20-50 bp) (5, 14, 24, 40). Prominent are REP (repetitive extragenic palindromic) sequences, ERIC (enterobacterial repetitive intergenic consensus) sequences, and Significant
7 structural and/or functional groupings identified by Blaisdell et al. (5). We have searched the large contigs for these established sequences to further elucidate the nature of these and other repeats in the two contigs. REP elements. A REP unit (sequence) consists of the consensus sequence, A = GCCG/TGATGCGG/ACGC/T separated by at most four bp from B = G/ACGC/.lCTTATCC/AGGC, designated
Nucleic Acids Research, 1993, Vol. 21, No. 16 3879 AB. REP units may also have the form B'A' where B' and A' are the inverted complements of B and A, respectively (11, 13). A REP element consists of at least two REP units with successive copies separated by 5 to 50 bp and in inverted orientation. We confirmed 5 REP elements in the UW85 contig noted by Daniels et al. (10). One of these, located between genes ubiB and fad4, has the elaborate structure ABCB'A'DABCB'A'DABCB'A', in which C and D are conserved spacers of length 22 bp and 14 bp, respectively (Table 5). The spacer conservation suggests that this structure may have resulted from a recent duplication event. In the MORI contig we located 7 REPs, of which 5 lie in previously sequenced regions (5), and the remaining two are located in newly sequenced non-coding regions at positions 38809 bp (pattern ABB'A') and at 71814 bp with the defective pattern BB'A' (see Table 5).
Rho-independent terminators. Rho-independent transcription terminators have been described as consisting of a predominant C + G close dyad and a 3' proximal T-rich segment (36). A subset of these are preceded by an A-rich segment in dyad relationship with the T part. This subclass can function as a bidirectional terminator between two close convergent genes (34). A number of members of this subclass have been found in EcoSeq2 (5). Examination of all close 8-oligonucleotide dyads (loop < 10 bp) in the MORI and UW85 contigs revealed several potential bidirectional rho-independent terminators (Table 4). AR of these are in noncoding regions and closely follow a gene
(< 80 bp) with one exception (MORI at position 77212; Table 4). Attenuator-like repeat/dyad cluster. Sliding window plots and rscan statistics revealed a striking cluster of direct and inverted repeats located at position 41,840 bp in the MORI contig (Figures 3, 4). This cluster contains a highly significant triple repeat of the oligonucleotide R = TTTCAATATTGGTGA at positions 41840, 41897, and 41956, with displacements of approximately equal length, 57 and 59 bp. Inverted repeats of the initial part of R (R' = CAATATTGAAA) are found nested at positions 41864 and 41901, yielding the complete structure RXgR 'X22RXgR 'X24R, with non-conserved intervening sequences, X. All S segments share the 8-palindrome CAATATTG at regular displacements of 21, 36, 21, and 38 bp. This region lies between the 5' ends of 2 divergently transcribed putative genes, betT(247 bp away) andfixA (5 bp away). The possibilities for alternative secondary structures in this region, the regularity of spacings, and the proximity to transcribed regions are reminiscent of a transcriptional attenuator (see Discussion).
Significantly long dyads. There is a highly significant distant dyad of stem length 38 bp (with 4 mismatches) separated by 1293 bp in the MORI contig (Table 5). This dyad is located at positions 15025 and 16393 between the dnaJ and ant genes (cf. footnote c to Table 5). The sequence between the inverted repeat is IS 186 (2, 45), which usually is flanked by a 22-26 bp pair of inverted repeats. The 38 bp stem length is longer than usual.
Table 3. Locations of anomalous spacings in different markers, restriction sites, and 4 bp palindromes
Marker
Overdispersiona
Clustera
cRe
428
41,840 (5,10)
CDd
285
58,700 (3,5) 41,838 (3,5)
CTAG
495 292 26 20
Restriction sites BamHI (GGATCC)
16
Dam (GATC) Dcm (CCWGG)
Chie
ECOUW85U
ECOMORI
Count
no no no no
anomalous anomalous anomalous anomalous
Comment
Count
Clustera
attenuator-like structure, see Figure 1 5 copies of 15-mer attenuator-like structure, see Figure 1
345
80,386 (3,5)
222
56,220 (3,5)
389 253 24 12
spacings spacings spacings spacings excessive evennessC
9
excessive evennessC
35
Overdispersiona
Comment
REP element
80,617 (1,3)
no anomalous no anomalous no anomalous
88,856 (lb)
cluster between f300 and fl38. gap follows REP
spacings spacings spacings in rRNA rrsA excessive evennessc
excessive
PvulI (CAGCTG)
45
EcoRV (GATATC)
56
HindI (AAGCTT)
13 12 10
no anomalous no anomalous
spacings spacings
13 15 4
no anomalous spacings
EcoRI (GAATTC) KpnI (GGTACC)
BglI (GCCN5GGC)
61
no anomalous
spacings
35
no anomalous
evennessC
20,459 (lb)
excessive evennessC
51
no anomalous spacings
62 kb with no sites
9,779 (5) 29,632 (lb)
spacings
aSignificant r-scan spacings are shown in parentheses following the location of each anomalous region. bExact formulas used for r-scan statistics. All other spacings outcomes obtained from asymptotic formulas (see Methods, part 2). CExcessive evenness is indicated by r-scan minima which are significantly large. dClose repeats and close dyads are represented by CR and CD, respectively. 'Chi counts include canonical sites (GCTGGTGG) and dyads.
in ilvA in orf o416 (possibly rifA or rff)
3880 Nucleic Acids Research, 1993, Vol. 21, No. 16
Freq
20
40
60
80
100
20
60
40
80
Position (Kb)
Posiion (Kb)
Figure 2. Sliding window plots of S-W (top) and R-Y (bottom) frequencies in ECOMORI (left) and ECOUW85U (right) contigs. Counts in the sliding windows are cumulated in 500 bp segments with 250 bp displacement.
Table 4. Locations of possible rho-independent terninators in ECOMORI and ECOUW85U contigs Loc.
ECOMORI 20427 26907 65456 77212 83230 87520 110653 ECOUW85U 6545 20817 33865 35922 58920 68395 70274
Sequencea
Upstream geneb
Downstream
TAA AAAAACCCGCTTGCGCGGGCTTTTTCA AAAATGCCGGT CTTGTT ACCGGCA'Tl TTTAT AAAAAACCAGGCT TGAGTAT AGCCTGGTTTCGTTTGATT
ORF (1273) ORF (24) poiB' (14) sfuA' (2030) leuA' (43) ilvH (19) secA (13)
ORF (10) araD' (23) leuD' (1340) leuO (583) shl (126) mutT (6)
AAAATCACCCGCCAGCAG ATTATAC CTGCTGGTTTTTTTT TAAAAAAACCCGCGC AATG GCGCGGGT--TTTGTTT AATTTACAGCCCAAC ATGTCAC GTTGGGCTTTTTTT AAGTAAAAGGCGCAG GATT CTGCGCCTTTTTTATAGGTTT
geneb
rpsT' (23)
ilvD (52) ilvE (-4) AATAAATACAAAAAATGGGACGG CACGCA CCGTCCCATTT rho (16) AAAAACGCCACGT GTTT ACGTGGCGTTTTGCTTTTATA rfe (188) o461 (188) ATTACACAAAGCATT CAAATTTTT AATGCTTTATTTGCCATTTCT rftM (60) atsB (109) proM (-1) AATTTTGAACCCCGC TTCG GCGGGGTTTTTTGTTTTCT recQ (19) pId4 (79) CAGGCGCT GAAAAT AGCGCCTG'll l-ATTT metE (1) TACCACCCCGGTCTTTT CTCATT AAAATCCAAACCGGGTGGTAA (10059) adp o475 (92) adp (17) CTGAAGGCCGA CGCGT TCGGCCTTTTGTATTTTT
'Possible stems are underlined.
bDistances of terminators from upstream and downstream genes are shown in parentheses. Genes read from right to left (on the nonsequenced strand) are indicated by a prime ( ' ). Uncharacterized orfs are shown as ORF. MORI contains another significant close dyad (17 bp stem, 6 bp loop) at position 58,940 between putative gene sec63 and an unknown ORF, and a new member of the 'Group II' repeats of Blaisdell et al. (5), located in sfuC, a putative gene for iron transport.
Close repeats and dyads. Several clusters of close repeats (CR) and close dyads (CD) were indicated by r-scan and sliding window analysis (Table 3, Figure 3). In addition to the attenuatorlike repeat/dyad cluster described above, the MORI sequence contains a significant CR cluster at 58,700 bp. In this region there
are 5 1/2 tandem imperfect copies of the 15-word GTGCTACCCCGGACG contained in an interval (length 10013 bp) between pdxA4 and polB involving a putative gene sec63 and an ORF assigned to a component of polB by Yura et al. (45). UW85 contains a CD cluster at position 56 kb and a CR cluster at position 80 kb. The CD cluster contains 5 A + T rich dyad pairs arranged in nested, intertwined, and sequential configurations (Table 5). The CR cluster pertains to the long REP element with conserved spacers described above. There is a 4.4 kb overdispersion of close dyads in UW85 at position 80661 lying 279 bp beyond the conserved REP element discussed previously.
Nucleic Acids Research, 1993, Vol. 21, No. 16 3881
Court
20
40
60
80
100
20
40
60
80
Positron (Kb)
Position (Kb)
Figure 3. Sliding window plots of counts of close repeats and dyads in ECOMORI (left) and ECOUW85U (right) contigs. Repeats and dyads are of length > 8 bp with < 150 bp between stems or copies. Counts in the sliding windows are cumulated in 500 bp segments with 250 bp displacement. Asterisks indicate statistically significant clusters based on the method of r-scans (r = 1, 3, 5,10). The prominent CR peak at position 30 kb of ECOUW85U does not meet the criterion for a cluster. This region carries a significant 4-fold repeat of the 8-oligonucleotide TGGCGCTG in ORF o416 (possibly rffa).
41976 41834 tctgac TTTCAATATTGGTGA tccataaaa CAATATTGAAA atttctuugctacgccgtgt TTTCAATATTGGTGA ggaacttaa CAATATTGAAA gttggattatctgcgtgtgacat TTTCAATATTGGTGA taaag
R
R
R
Figure 4. Attenuator-like repeat sequence in 41 kb region of the ECOMORI contig. Repeat segments are denoted R with dyads R'. Filled arrows below the sequence depict different dyad relationships; open arrows above the sequence illustrate nested repeat combinations.
Inter-sequence matches. We found a significant intercontig match between MORI (position 86467) and UW85 (position 4813) of length 29 bp with 3 mismatches. The match codes for DVGQHQMWFAA (one letter amino acid code) and lies in ilvI between leuO and ilvH in MORI and in ORF o221 between ilvG and ilvM in UW85. Rare and frequent oligonucleotides Two oligonucleotides meet the criteria for frequent words (Methods, part 7) in UW85: TGCTGGCGG (12 copies) and TGCTGGTGG (11 copies). This finding is intriguing because GCTGGTGG is the active Chi motif. By our criterion there are no frequent oligonucleotides in MORI (s = 9 bp and r = 14). We further examined the EcoSeq2 database for frequent oligonucleotides (meeting the criteria of length s = 11 bp with
occurrences > r = 16), identifying 129 oligonucleotides as frequent. All of thefrequent words with 2 27 copies were related to the REP motif. Many of the frequent words occurring 16 to 25 times contained the pentamer GCTGG or its inverted complement CCAGC, relating to the consensus Chi sequence. A random sampling of human DNA sequences of aggregate 1.41 Mb (therefore s = 11, r = 13) revealed 1414 frequent oligonucleotides predominantly composed of short iterations such as XII, (XY)5, (TGC)3TG, etc. (X = A, C, G, T; XY = GT, AC, AG). The contrasts betweenfrequent words in human vs. E. coli emphasizes the iterative character of these oligonucleotides in human sequences versus less structured and more dispersed repeats in E.coli putatively implying functional roles. In the UW85 and MORI contigs all of the completely absent rare oligonucleotides contain stop codons, generally in both orientations and mostly embedded in the tetranucleotide CTAG.
3882 Nucleic Acids Research, 1993, Vol. 21, No. 16 Table 5. Significantly long direct and inverted repeats in ECOMORI and ECOUW85U contigs Position
Structure'
ECOMORI 15025c J3'
Flanking genesb
Repeat Sequenlce1
dnaJ 76 (1369) 823 ant
, = GGCGGGGAtcacTCCCATAAGCGCTAACTTAAGGGTTG
.'CAACCCTTAAGTTAGCGCTTATGGGAtcacTCCCCGCC 38,809
A3S13'A'
41,840'
r-X9',1X22RX91Z'X241Z
58,7139
CCCCLC
58,9409 71,814 72158
MM' BBS'A' AVP
ECOUW85 29672 QQQQ 56,220 £C.F -'X('C'.F'II'
80,382
ABCB'A'DA63CB'A'VA!3CB'A'
REPd element X = TTTCAATATTGGTGA IV= AACAATATTGAAA pdrA 5636 (90) 2419 unknown orf C = GTGCTACCCCGGACG tandem approximate copies pdxA 5863 (40) 2242 unknown orf M = GCTAAGGTTGAAGGGGC REPd element dedA 145 (66) 735 sfuC Group II element, ref. [2] within sfuC batF 18 (61) 1157 Acyl - CoA betT 247 (131) 5 fizA
A(= ACAACGCGCTAACGCCACTCGCTGTCGCTGACCGCCGGAAAGCTCG P = TTCAGCAGGGTACTTTTACCCGCGCCGCTTGGCCCGAGGA
within o416 fl138 -115 (243) -69 f300
ubiB 96 (279) 17 fadA
Q = TGGCGCTG £ = AAATTTAT, F = AAAATAAAT g = AAATATAT, X = AATATGCC I = TTTATTTG REPd element with conserved spacers C, D C =CTACAAATAGTCCGAACC D = AATTGGTGCATGAT
Match Between UW85 and MORI Sequences 86,467 4,813
(MORI) (UW85)
within ilvI within o221
GATGTcGGGCAGCACCAGATGTttGCTGC GATGTgGGGCAGCACCAGATGTGggCTGC peptide sequence DVGQHQM F AA
aDyads of repeated sequences are indicated by a prime ('). Imperfect dyads are explicitly listed. bRepeat structure lies between the two genes shown, separated from each by the number of bp indicated. Total length of repeat (head to tail) including intervening regions is given in parentheses between the distances to flanldng genes. Arrows show direction of transcription. cIS element IS186 lies in the 1293 bp interval between the dyad stems. The Rudd et al. [41] database denotes the neighboring genes as dnaJ and gef. dThe consensus components of a REP unit are A = GCCGrTGATGCGG/ACGC/T with dyad A' A/GCGT/CCGCATCA/CGGC, and B G/ACGC/TCTrATCC/AG(C =
=
with dyad B' = GCCT/GGATAAGA/GCGT/C. eSe Figure 4 for illustration. fDyad consensus sequences are presented only if they are imperfecdy matched to first stem. Mismatches are shown in lower case. gRegion contains putative gene sec63.
DISCUSSION Compilation of the cleaned (nonredundant) E. coli DNA sequence collection including two large (about 100 kb) contigs has rendered the E.coli chromosome quite approachable for studies of sequence heterogeneity and genomic organization. A view of the E. coli chromosome as a highly dynamic structure with many sequence rearrangements and repeated segments resulting from processes of DNA repair, recombination, inversion, transposition, deletion, and amplification is emerging (22, 30). Since short sequences will tend to have shorter spacings between consecutive copies of markers and will therefore tend to yield biased marker distributions, the longer length of the two recent contigs makes it possible to evaluate more accurately some of the characteristic structures of the E. coli chromosome, including counts and distributions of Chi sequences, 6-cutter restriction sites, Dam and Dcm methylase sites, close and distant repeats, and others. Relevant statistical procedures include: measures for assessing compositional anomalies in short oligonucleotides; evaluation of sequence similarities between the contigs and between all available E. coli sequences; analysis of counts and spacings (testing for significant clusters, overdispersions, and evenness) in marker arrays such as restriction sites, special oligonucleotides, and close repeats and dyads; assessment of fluctuations in
sequence composition by position-dependent sliding windows and by segmental quantile distributions; and characterization of significantly rare and frequent oligonucleotides. Application of these statistics to the MORI and UW85 contigs and to the
EcoSeq2 (37, 38) contig collection reveals several interesting sequence features on which we comment and offer some interpretations. Repetitive structures (i) Our assessment of close direct and inverted repeats based on segmental quantile distributions, sliding window analysis (Figure 3), and r-scan statistics (Table 3) highlight several regions in the MORI and UW85 contigs. The most outstanding is the region around 41 kb in the MORI contig, which includes a triple 15 bp repeat of R = TTTCAATATTGGTGA centered at the palindrome CAATATTG with two nested 11 bp dyads, R' = CAATATTGAAA, between successive copies (Figure 4). This sequence arrangement RX9R 'X22RXgR 'X24R having nonconserved intervening spacers (X) has the potential to form versatile secondary structures amenable to cooperative binding interactions with transcription/replication factors or for other structural purposes. The regular spacings of the intervening nonconserved spacers argues for the importance of the dyad
Nucleic Acids Research, 1993, Vol. 21, No. 16 3883 structures. The combinations of dyad pairings in this sequence, proximity to a putative operon (fixA, see Table 5), and short polyU runs make it a plausible candidate for a transcriptional attenuator in the general form of the enterobacterial attenuators trp, leu, thr, ilv, and others (35, 43, 44). However, the proximal genes are not homologous to a biosynthetic operon and there are no apparent characteristic promoter sequences nearby (e.g., appropriate codons, leader sequences, etc.), as typically found in such attenuators. Attenuator-like controls have been identified for pathways other than biosynthesis, such as BglA (33) and the pyrBI operon (42), of which this may be another example. It is also conceivable that the RXgR 'X22RXgR 'X24R structure may be an elaborate novel transcription terminator for an adjacent protein encoding gene. Another possibility is that the structure is involved in conjugation control, relating to its close proximity to the HfrH point of integration (31). The sequences R and R' are A + T rich allowing variegated cruciform extrusions excercising some unknown control (e.g., possible recognition site, pause sites in replication or transcription, etc.). (ii) One statistically significant identity emerged between the MORI and UW85 contigs (Table 5). The match is 29 bp long and is part of genes that encode different acetohydroxy acid synthase (AHAS) isozymes involved in isoleucine and valine biosynthesis (25). The matching segment lies in ilvI between leuO and ilvH in MORI (encoding AHAS EII) and in ORF o221 between ilvG and ilvM in UW85 (encoding AHAS II). The E.coli K-12 strain carries a frameshift mutation in ilvG that results in premature termination; o221 would be part of ilvG in the absence of this mutation (19, 25). (iii) One of the REP elements (AB and B 'A 'unions) in UW85, located at position 80 kb, contains a complex structure of the form ABCB 'A 'DABCB 'A DABCB 'A ' with conserved spacers of length 22 bp (C) and 14 bp (D) (see legend of Table 5). The spacer conservation suggests a relatively recent evolution of this REP element. (iv) A significantly long 38 bp non-REP inverted repeat was identified in MORI, flanking insertion sequence IS 186 (1293 bp between stems). The long 38 bp dyad presumably includes the traditional 23 bp terminal inverted repeats of IS186 (2, 39).
Homogeneity comparisons (i) Dinucleotide representational distance measures reveal an exceptional similarity between the two contigs, more pronounced than between other E.coli sequences. (Table 2, Methods part 1). In comparisons with sequences from temperate phages X, Mu, and P1, and from lytic phages T7 and T4, the E. coli contigs are remarkably similar. On this basis, we propose that E.coli has no isochores but is broadly homogeneous. We also compared the sequence of yeast chromosome HI (YCmI, 315 kb) to the E.coli contigs (15). To this end, YCIII was partitioned into 3 segments, each of approximately 100 kb. While the YCII fragments are considerably distant from E. coli sequences, they are quite similar to one another, particularly the first and last 100 kb sections which are about as similar as are MORI and UW85 to each other and to EcoSeq2. These two sections of YCLI share a 10 kb stretch of characteristic subtelomeric sequences, as well as the HMRa and HMLai silent loci which involve many similar sequences associated with mating type switching. (ii) Evaluation of the restriction site spacings reveals one persistent invariant of both contigs with the Kohara map: BamHI sites (GGATCC) are more evenly spaced than would be expected if they were distributed randomly throughout the genome (9, 18).
It seems, a priori, paradoxical that while BamHI is distributed significantly even in the contigs, spacings of the Dam methylase site (GATC) central to BamHI are distributed consistent with expectations of a random distribution (except for a single cluster in the OriC region). Is it conceivable that Dam methylase activity has preferences for G:C nucleotides bordering GATC and/or is hampered by clusters of Dam sites? Along these lines, investigation of the distribution of CGATCG also revealed significantly even spacings in both UW85 and MORI, whereas the the six-palindromes (A/T)GATC(T/A) show no anomalous distribution. We might speculate that Dam sites within longer strong base-pairing palindromes such as BamHI are more efficiently methylated than weak palindromes containing Dam target sites. The distribution of embedded strong palindromic Dam sites might also contribute to methylation efficiency by avoiding close groupings, such as those in OriC, that apparently reduce methylation rates (8). (iii) Determination of significantly frequent oligonucleotides identified two extensions/modifications of the Chi sequence GCTGGTGG as the most frequent in UW85, these words also occurring abundantly in the EcoSeq2 database (modifications of REP are the most frequent oligonucleotides in EcoSeq2, see (5). No oligonucleotides exceeded our frequent word threshold in the MORI contig, perhaps indicating less recombination events in the MORI sequence compared to the UW85 sequence; i.e., more recombination events near the origin of replication in UW85 (27). Chi sequences are sites for RecBCD recombination inducements, functioning to suppress exonuclease (ExoV) degradation (41). The Chi frequency has been estimated as once every 5-5.5 kb, or approximately 940 sites in the entire E. coli genome (22). Occurrences in the UW85 and MORI are on the high side of this expectation (24 sites in UW85, 26 in MORI). The relatively high number of Chi sites interacting with the RecBCD complex presumably aids in DNA repair and overall E.coli viability.
Compositional biases of short oligonucleotides The dinucleotide TA is underrepresented in the two long contigs and in the entire EcoSeq2 database (Table 1). This underrepresentation is widespread in most sequences examined to date (6, 21, 32). In contrast, GC dinucleotides are significantly overrepresented in the E.coli sequences and the trend persists in longer iterates as well. For example, HhaI (GCGC) and BssHII (GCGCGC) are among the most frequent 4- and 6-bp palindromes in E. coli. Furthermore, (GC), n > 4 iterates comprise 1/2 of all poly-XY (X, Y = A, C, G, T) runs in the MORI and UW85
contigs. The striking extreme representations of TA (low relative abundance) and GC (high relative abundance) in E. coli and in most bacterial genomes (e.g., B.subtilis and P.aeruginosa) are also manifest in the temperate phages Mu, X, P1, P4, and P22, whereas most dsDNA lytic phages (e.g., T3, T4, T7, 4129) show normal representations of these dinucleotides. We would suggest that CTA/TAG trinucleotides are avoided, as CCA/TGG and CTG/CAG trinucleotides, differing by a single transition substitution, exhibit high relative abundances (15). C-
TA/TAG avoidance does not appear due to the TAG stop codon because TTA/TAA is not especially low.
CONCLUSIONS The E.coli genome, as with many prokaryotes, has long been considered streamlined and economical, consisting mainly of
3884 Nucleic Acids Research, 1993, Vol. 21, No. 16 operon transcription units with a small proportion of repeats. The discovery of transposable IS families revealed sources and means of DNA rearrangements, mutations, and amplifications. Over the last decade many distributed intergenic families of short repeats have also been identified, the most prominent of which include REP elements, ERICs, and seven recently described groups (5). Duplications of genes or portions of genes appear rare in E. coli (5). In both of the large contigs, there are remnants of ilv isozymes putatively arising from duplication events (22). The contigs also reveal several new close repeats, the most striking of which is the elaborate repeat/dyad cluster located at position 41,838 in the MORI contig discussed extensively above. Because of the potential for numerous dyad structures and the proximity to operons, this may be an important regulatory region in E. coli and presents an attractive target for experimental manipulations. Another issue is the common impression that the E. coli genome is highly homogeneous. In this context, our analysis suggests that there is nothing akin to an isochore phenomenon in E. coli. The two large contigs have a remarkably close dinucleotide relative abundance distance measure. Homogeneity has also been indicated at the genomic level, as the spacings of most of the restriction sites in the Kohara map are consistent with a random distribution (9, 18). Paradoxically, the counts of the 8 restriction sites are highly variable, and the differences cannot be explained on the basis of mono-, di- or trinucleotide frequencies (18). The distribution of these sites in the UW85 and MORI contigs also shows some anomalies (e.g., BamHI is regularly spaced, whereas KpnI and HindU exhibit regions of overdispersion), as do counts and spacings of several 4-bp palindromes (data not shown). Another inhomogeneity concerns the distribution of CTAG tetranucleotides. These are very rare in all bacteria examined to date, but seem to cluster in rRNA genes (6). Even the REP elements point to E. coli variability, as the analysis suggests that the form, numbers, and distributions of these are in a state of dynamic flux (for example, one REP element has fully conserved intervening portions, while others appear to be defective). Opportunities for further analysis and understanding of genomic organization and heterogeneity in E. coli wffl soon arise when the complete chromosome sequence is available.
ACKNOWLEDGEMENTS We gratefully acknowledge Drs V.Brendel, A.M.Campbell, K.E.Rudd, F.W.Stahl and C.Yanofsky for helpful comments on the manuscript. This work was supported in part by NIH grants HG00085-01 (L.R.C.), HG00335-06 and GM10452-29 (S.K.), NSF grant DMS91-06974 (S.K.), and the San Antonio Area Foundation SAAF-500 (G.A.S.). REFERENCES 1. Bhagwat,A.S. and McClelland,M. (1992) NucleicAcids Res. 20, 1663-1668. 2. BirkenbihlI,R.P. and Vielmetter,W. (1991) Mol. Gen. Genics 226, 318-320. 3. Bishop,Y.M.M., Fienberg,S.E. and Holland,P.W. (1975) Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. 4. Blaisdell,B.E., Burge,C., Schachtel,G. and Karlin,S. (1993) submitted. 5. Blaisdell,B.E., Rudd,K., Matin,A. and Karlin,S. (1993) J. Mol. Biol. 229, 833-848. 6. Burge,C., Campbell,A.M. and Karlin,S. (1992) Proc. Natl. Acad. Sci. USA 89, 1358-1362. 7. Burland,V., Plunkett,G., Daniels,D.L. and Blattner,F.R. (1993) Genomics in press. 8. Campbell,J.L. and Kleckner,N.E. (1990) Cell 62, 967-979.
9. Churchill,G.A., Daniels,D.L. and Waterman,M.S. (1990) Nucleic Acids Res. 18, 589-597. 10. Daniels,D.L., Plunkett,G., Burland,V. and Blattner,F.R. (1992) Science 257, 771 -778. 11. Dimri,G.P., Rudd,K.E., Morgan,M.K., Bayat,H. and Ames,G.F. (1992) J. Bacteriol. 174, 4583-4593. 12. Ficket,J.W. (1982) Nucleic Acids Res. 10, 5303-5318. 13. Gilson,E., Saurin,W., Perrin,D., Bachellier,S. and Hofnung,M. (1991) Nucleic Acids Res. 19, 1375-1383. 14. Higgins,C.F., Ames,G.F.L., Barnes,W.M., Clement,J.M. and Hofnung,M. (1982) Nature 298, 760-762. 15. Karlin,S., Blaisdell,B.E., Sapolsky,R.J., Cardon,L., and Burge,C. (1993) Nucleic Acids Res. 21, 703-711. 16. Karlin,S. and Brendel,V. (1992) Science 257, 39-49. 17. Karlin,S. and Leung,M.Y. (1991) Ann. Appl. Prob. 4, 513-538. 18. Karlin,S. and Macken,C. (1991) Nucleic Acids Res. 19, 4241-4246. 19. Karlin,S., Schachtel,G.A., Burge,C. and Blaisdell,B. (1993) submitted. 20. Kohara,Y., Akiyama,K. and Isoro,K. (1987) Cell 50, 495-508. 21. Kozhukhin,C.G. and Pevzner,P.A. (1991) CABIOS 7, 39-49. 22. Krawiec,S. and Riley,M. (1990) Microbiol. Rev. 54, 502-539. 23. Kr6ger,M., Wahl,R., Schachtel,G. and Rice,P. (1992) Nucleic Acids Res. 20, 2119-2144. 24. Kunisawa,T. and Nakamura,M. (1991) Protein Seq. Data Anal. 4, 43-47. 25. Lawther,R.P., Calhoun,D.H., Adams,C.W., Hauser,C.A., Gray,J. and Hatfield,C.W. (1981) Proc. Natl. Acad. Sci. USA 78, 922-925. 26. Leung,M.Y., Blaisdell,B.E., Burge,C. and Karlin,S. (1991) J. Mol. Biol. 221, 1367-1378. 27. Masters,M. (1989) Curr. Opin. Cell Biol. 1, 241-249. 28. McClelland,M., Jones,R., Patel,Y. and Nelson,M. (1987) NucleicAcids Res. 15, 5985-6005. 29. Merkl,R., Kr6ger,M., Rice,R. and Fritz,H.-J. (1992) Nucleic Acids Res. 20, 1657-1662. 30. Milkman,R. and Stoltzfus,A. (1988) Genetics 120, 359-366. 31. Miller,J.H. (1992) A Short Course in Bacterial Genetics: Handbook. Cold Spring Harbor Laboratory Press, NY. 32. Nussinov,R. (1981) J. Biol. Chem. 256, 8458-8462. 33. O'Day,K., Lopilato,J. and Wright,A. (1991) J. Bacteriol. 173, 1571. 34. Platt,T. (1986) Annu. Rev. Biochem. 55, 339-372. 35. Richardson,J.P. (1993) Crit. Rev. Biochem. Mol. Biol. 28, 1-30. 36. Rosenberg,M.D. and Court,D.L. (1979) Annu. Rev. Genet. 13, 319-353. 37. Rudd,K.E. (1992) Alignment of E.coli DNA Sequences to a Revised Integrated Genomic Restriction Map. Volume 2.3 -2.43. Cold Spring Harbor Laboratory, Press, NY. 38. Rudd,K.E., Miller,W., Werner,C., Ostell,J., Tolstoshev,C. and Satterfield,S.G. (1991) Nucleic Acids Res. 19, 636-647. 39. Sengstag,C., Iida,S., Hiestand-Nauer,R. and Arber,W. (1986) Gene 49, 153-156. 40. Sharples,G.J. and Lloyd,R.G. (1990) Nucleic Acids Res. 18, 6502-6508. 41. Stahl,F.W., Thomason,L.C., Siddiqi,I. and Stahl,M.M. (1990) Genetics 126, 519-533. 42. Tumbough,C.L., Hicks,K.L. and Donahue,J.P. (1993) Proc. Natl. Acad. Sci. USA 80, 368-372. 43. Yanofsky,C. (1987) Trends Genet. 3, 356-360. 44. Yanofsky,C. (1988) J. Biol. Chem. 263, 609-612. 45. Yura,T., Mori,H., Nagai,H., Nagata,T., Ishihama,A., Fujita,N., Isono,K., Mizobuchi,K. and Nakata,A. (1992) Nucleic Acids Res. 20, 3305-3308.