Are Noncoding Sequences of Rickettsia prowazekii ... - CiteSeerX

4 downloads 25 Views 280KB Size Report
Dec 27, 1999 - distance, and (ii) the 2-measure of positional nucleotide frequencies. ... external DNA (Fickett and Tung 1992; Guigó and Fickett 1995).
J Mol Evol (2000) 51:353–362 DOI: 10.1007/s002390010097

© Springer-Verlag New York Inc. 2000

Are Noncoding Sequences of Rickettsia prowazekii Remnants of “Neutralized” Genes? Dirk Holste,1 Olaf Weiss,2 Ivo Grosse,3 Hanspeter Herzel2 1

Department of Theoretical Biophysics, Humboldt University Berlin, Invalidenstr. 42, D-10115, Berlin, Germany Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstr. 43, D-10115, Berlin, Germany 3 Institute for Molecular Biology and Biochemistry, Free University Berlin, Arnimallee 22, D-14195, Berlin, Germany 2

Received: 27 December 1999 / Accepted: 5 July 2000

Abstract. It has been hypothesized that a large fraction of 24% noncoding DNA in R. prowazekii consists of degraded genes. This hypothesis has been based on the relatively high G+C content of noncoding DNA. However, a comparison with other genomes also having a low overall G+C content shows that this argument would also apply to other bacteria. To test this hypothesis, we study the coding potential in sets of genes, pseudogenes, and intergenic regions. We find that the correlation function and the x2-measure are clearly indicative of the coding function of genes and pseudogenes. However, both coding potentials make almost no indication of a preexisting reading frame in the remaining 23% of noncoding DNA. We simulate the degradation of genes due to single-nucleotide substitutions and insertions/ deletions and quantify the number of mutations required to remove indications of the reading frame. We discuss a reduced selection pressure as another possible origin of this comparatively large fraction of noncoding sequences. Key words: Intergenic DNA — Periodicities — Reading frame — G+C content — Coding potential — Spontaneous mutation — Rickettsia prowazekii

Correspondence to: D. Holste; email: [email protected]

Introduction The complete genome of the bacterial pathogen Rickettsia prowazekii has been sequenced (Andersson et al. 1998). Since Rickettsia multiply only in eukaryotic host cells, their genomes provide insight into adaptions of free-living organisms to the intracellular lifestyle. Phylogenetic analyzes indicate that Rickettsia are the closest extant relatives of the ancestor of mitochondria (Andersson et al. 1998). Hence, the relatively small genome of ∼1.1 × 106 base pairs (bp) might be the result of reductive evolution comparable to the endosymbiotic scenario leading to modern mitochondria (Margulis 1970). The role of rickettsiae in evolutionary modeling receives attention because this intracellular parasite may show adaptions to several hosts. R. prowazekii is directly transmitted by the arthropod vector, the louse Pediculus humanus. Although the infected human carrier can live on and appears to be the primary reservoir of the disease in interepidemic periods, the vector inevitably dies of the rickettsial infection (Ormsbee 1985). Adaptions to the human host could only have happened on a relatively short time scale. However, the detection of R. prowazekii in flying squirrels demonstrates that reservoirs other than humans also exist (Hackstadt 1996). The explanation of the evolutionary origin of the large fraction of noncoding DNA in R. prowazekii could shed some light on adaption strategies. The fraction of noncoding DNA in previously sequenced bacterial genomes ranges from 5% in Thermotoga maritima (Nelson et al. 1999) to 12% in Haemophi-

354 Table 1. A G+C content analysis of four bacteria with low overall G+C content Coding regions (%) Species

Overall (%)

1

2

3

Noncoding regions (%)

R. prowazekii M. jannaschii M. genitalium B. burgdorferi

29.1 31.9 31.6 29.2

41.1 41.6 41.6 37.8

31.7 29.7 30.6 28.6

18.4 24.8 23.0 20.7

23.7 27.6 32.2 25.4

We calculate from the total nucleotide count of coding and noncoding sequences the G+C content in three categories (from left to right): the overall content, the content in the three codon positions of the reading frame, and the content of noncoding DNA. The G+C content of noncoding DNA exceeds in all cases the G+C content in the third codon position.

lus influenzae (Fleischmann et al. 1995) and is 9% on average. The large fraction of 24% noncoding DNA found in R. prowazekii is unusual and significantly larger than the amount of noncoding DNA found, e.g., in the obligate intracellular parasite C. trachomatis (Stephens et al. 1998), whose noncoding DNA content of 10% seems to be typical for bacterial genomes (Zomorodipour and Andersson 1999). It has been hypothesized that the high fraction of noncoding DNA consists of remnants of genes that have been degraded by mutation and that have not yet been removed from the genome (Andersson et al. 1998). This hypothesis is consistent with the observation of 12 documented pseudogenes, which comprise ∼1% of the genome. The hypothesis that much of the remaining 23% noncoding DNA could also originate from inactivated genes has been based on the fact that the G+C content (24%) of noncoding DNA is lower than in coding DNA (30%) but higher than in the third position of the reading frame (18%) in coding DNA. It is argued that noncoding sequences should gradually approach the low G+C content of the third codon position during their evolution (Andersson et al. 1998). It is known that many factors influence the DNA composition (Bernardi 1989; Lobry 1996; Li 1997). Table 1 shows the G+C content derived from bacteria with a low overall G+C content, none of which contains annotated pseudogenes (Fleischmann et al. 1995; Fraser et al. 1995, 1997). The G+C content of noncoding sequences exceeds the G+C content of the third codon position for all four species, indicating that R. prowazekii may constitute no particular case among bacteria with a low overall G+C content. A reliable approach to identify genes in novel DNA consists of finding a known gene from another organism to which it shows a significant sequence similarity. This approach has been successfully used to detect novel coding sequences in complete archaeal genomes (Raghavan and Ouzounis 1999). Andersson et al. (1998) identify genes by searching the R. prowazekii genome for open reading frames (ORFs) longer than 50 codons based on nucleotide frequencies that are characteristic for R.

prowazekii (Andersson and Sharp 1996). This left 23% noncoding DNA containing no ORFs of length larger than 50. Putative genes were subsequently analyzed by means of the computer program BLASTX (Altschul et al. 1990), which compares a DNA sequence translated in all reading frames against a protein sequence database. In many cases, pseudogenes or genes have no significant similarity to known genes (cf. Blattner et al. 1997 or Nelson et al. 1999). In the absence of such similarity, protein coding potentials that distinguish coding from noncoding DNA can be applied. Here, we conduct such analyses to investigate whether it can detect the presence of degraded remnants of coding sequences in R. prowazekii. To probe the presence of a reading frame, we evaluate the coding potential of both coding and noncoding sequences using established statistical techniques (Trifonov and Sussman 1980; Shepherd 1981; Fickett and Tung 1992). Under the assumption that noncoding sequences contain a large fraction of degraded genes and that statistical properties of inactivated genes are not completely eliminated by mutations, traces of a coding potential should be detectable in noncoding DNA. Clearly, any coding potential is decreased when the query sequence is mutated. We simulate singlenucleotide substitutions and insertions/deletions to infer the number of mutations required such that the studied coding potentials fail to detect any deviations from noncoding sequences. Hence, we can quantify the number of substitutions and insertions/deletions necessary to remove the coding potential.

Materials and Methods DNA Sequences. We analyze the nucleotide sequences of the following complete genomes from GenBank release 111 (Benson et al. 1999): R. prowazekii (accession number AJ235269; 1,111,523 bp), Chlamydia trachomatis (AE001273; 1,042,519 bp), Methanococcus jannaschii (L77117; 1,664,970 bp), Mycoplasma genitalium (L43967; 580,073 bp), Borrelia burgdorferi (AE000783; 910,724 bp), and Escherichia coli (U00096; 4,639,221 bp). We partition each genome into complete sets of genes, pseudogenes, and noncoding DNA according to sequence annotations. Coding Potentials. One prominent feature of coding DNA is the presence of a reading frame. The genetic code maps one trinucleotide (codon) onto one amino acid, such that the frequency of each of the four nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T) varies among the three codon positions in the reading frame. The codon usage depends, e.g., on the degeneracy of the genetic code (Staden and McLachlan 1982), the rate of gene expression (Ikemura 1981), and the G+C content (Li 1997). We studied the behavior of two coding potentials to detect the presence of a reading frame in coding and noncoding DNA: (i) correlations between weakly binding nucleotides as a function of their pair distance, and (ii) the x2-measure of positional nucleotide frequencies. We chose these methods because they do not require prior training on external DNA (Fickett and Tung 1992; Guigo´ and Fickett 1995). Frame-dependent nucleotide frequencies induce sequence periodicities that can be quantified by correlation functions (Trifonov and Sussman 1980; Shepherd 1981), or equivalently, by Fourier spectra

355

Fig. 1. Average correlation functions Cww(k) of all coding (a) and noncoding (b) sequences of E. coli. We compute Cww(k) from every sequence of length N $ 150 bp. We subtract the bias (Weiss and Herzel 1998) and show the average over all functions Cww(k) weighted proportional to N. Cww(k) shows period-3 oscillations for coding sequences. For noncoding sequences, Cww(k) shows no strong periodicities, indicating the absence of a reading frame.

(Michel 1986). Correlation functions measure the excess of nucleotide pairs at a distance k. It has been shown (Herzel et al. 1998) that correlations functions exhibit particularly strong periodicities when calculated between weakly binding nucleotides W (4 A and T). We calculate the W–W correlations by counting the number Nww(k) of W–W pairs at distance k. There are altogether N − k pairs in a sequence of length N. We compute the frequency Pww(k) to find W–W pairs at a distance k by

PWW~k! =

NWW~k! . N−k

(1)

If the nucleotides at a distance k are statistically independent, we have Pww(k) 4 pw × pw, where pw denotes the frequency of the singlenucleotide W. The difference Cww(k)≡ Pww(k) 1 pw × pw

(2)

measures correlations at a distance k. A positive (negative) value of Cww(k) states that there are more (less) W–W pairs at distance k than expected by chance. We obtain the average correlation function for coding (noncoding) sequences by first computing the correlation function for each individual coding (noncoding) sequence of full length N ù 150 bp. Then, we subtract the statistical bias of the correlation functions according to (Weiss and Herzel 1998) and average over all coding (noncoding) sequences by using as weights the length N of each sequence. We will show below that period-3 oscillations of correlation functions can be used to probe the presence of a reading frame. We use x2 (chi square) to measure the statistical significance of the presence of a reading frame. Denote the joint frequency of finding nucleotide b at codon position l { (1,2,3) in the reading frame by Pb,l (here b 4 1 refers to A, b 4 2 to C, b 4 3 to C, and b 4 4 to G). The presence of a reading frame can then be detected by a 4 × 3 frame dependence matrix with these 12 elements Pb,l. If the nucleotide frequencies are independent of the codon positions l, we have Pb,l 4 pb × ql, where pb 4 SlPb,l is the overall frequency of nucleotide b and ql

4 SbPb,l is the frequency for which the position l in the reading frame occurs. We define the x2-measure as 4

x2 ≡ L ?

3

(( b=1 l=1

~Pb,l − pb ? ql!2 . pb ? q l

(3)

Equation 3 accumulates all positional nucleotide dependencies. A large value of x2 indicates that Pb,l is different from the overall composition. Since x2 is invariant under shifts of a reading frame, it requires no knowledge of the correct reading frame. Note that the x2-measure is similar to the position asymmetry measure, which belongs to the most effective frame-independent measures in distinguishing coding from noncoding DNA (Fickett and Tung 1992). In fact, applying Equation 3 to the benchmark test of Fickett and Tung (1992), we find that x2 distinguishes coding from noncoding DNA as accurate as the position asymmetry (data not shown). The merit of the x2-measure stems from (i) its well-known x2-probability density, and (ii) from its a prioriindependence of the G+C content (the x2-probability density is independent of DNA composition). Many coding potentials require training on species-dependent external data. The application of Equations 2 and 3 requires no training, and it provides simple but powerful means to detect the presence of a reading frame for any species under consideration. Computer Programs. We develop computer programs in C and PERL that perform the analyzes of the nucleotide compositions, monoand dinucleotide frequencies, discriminative accuracy, and singlenucleotide substitutions and insertions/deletions as well as the partition of genome sequences from GenBank. They are available from the authors on request.

Results Sequence Periodicities We compute Cww(k) of the complete E. coli genome to evaluate how accurate it distinguishes coding from noncoding DNA. The E. coli genome is approximately four

356

Fig. 2. Cww(k) of all coding sequences (a), noncoding sequences (b), and pseudogenes (c) of R. prowazekii. We compute Cww(k) as described in Fig. 1. Cww(k) shows period-3 oscillations for genes and for pseudogenes. As in Fig. 1, Cww(k) shows no strong periodicities for noncoding sequences.

times larger than the R. prowazekii genome and contains roughly twice the number of noncoding nucleotides. Figure 1 shows Cww(k) of all coding (1a) and noncoding sequences (1b) from the complete E. coli genome. We find that Cww(k) clearly exhibits period-3 oscillations for coding sequences, and we find no such oscillations for noncoding sequences. Figure 2 shows Cww(k) for all coding sequences (2a), noncoding sequences (2b), and annotated pseudogenes (2c) from R. prowazekii. For genes and pseudogenes, we find strong oscillations of period 3. There are only 12 documented pseudogenes, and hence the corresponding average correlation function exhibits larger statistical fluctuations than Cww(k) of coding and noncoding sequences. In contrast to genes and pseudogenes, we find almost no triplet periodicity for noncoding sequences. Figure 2b shows only a small signal at short distances. When we calculate correlation functions of pairs different from W–W or the mutual information function (Ebeling et al. 1987), we obtain quantitatively similar results. There exists a 30-kb sequence region at 888–916 kb that contains many noncoding sequences (42%) and pseudogenes (11%). In this region, noncoding DNA has a small but significantly higher G+C content than in other regions of the genome. This indicates that this region may correspond to neutralized genes (Andersson et al. 1998). We calculate Cww(k) of noncoding sequences and pseudogenes in this region. Since Cww(k) exhibits larger fluctuations as compared to Figs. 2a and 2b due to the limited number of nucleotides, we apply the Fourier transform to detect periodic nucleotide variations in terms of frequencies. Figure 3 shows the Fourier transform of Cww(k) for noncoding sequences and pseudo-

genes of the above region. We find a pronounced signal at the frequency f 4 1⁄3 in pseudogenes, which corresponds to period-3 oscillations, but no such signal for noncoding sequences. Hence, correlation functions as well as their Fourier transforms do not provide clear evidence that a large fraction of noncoding sequences of R. prowazekii are neutralized genes.

Distinguishing Coding from Noncoding DNA We study the distribution of x2 of sequences cut into fragments of a length L. We extract all protein-coding sequences, pseudogenes, and noncoding sequences longer than L bp and cut these sequences into nonoverlapping fragments of length L, starting at the 58-end. For each sequence fragment, we calculate x2. The presence of a reading frame leads to positive x2 values. We expect the x2 values of noncoding DNA to be small due to the absence of any reading frame, whereas we expect that coding DNA will show greater x2 values due to the presence of a reading frame. Figure 4 shows the histograms of x2 for E. coli for L 4 108 bp. We find that both coding and noncoding DNA have unimodal x 2 histograms with distinct maxima. The histograms overlap due to the finite length. Let the true positives (negatives), TP (TN), denote the fraction of coding (noncoding) sequences correctly predicted as coding (noncoding). It is customary (Fickett and Tung 1992) to define the accuracy of a coding potential by (TP + TN)/2, where the threshold above which a sequence is predicted as coding is set such that TP 4 TN. The accuracy for which x2 distinguishes coding from noncoding E. coli DNA is 73.8% for 108 bp. Table 2a also shows the

357

Fig. 3. Fourier transform of Cww(k) for all pseudogenes and noncoding sequences of the 30-kb region (Cww(k) is displayed in the inset). We first compute Cww(k) as described in Fig. 1. Then, we compute the Fourier transform of Cww(k). We find a strong signal at f 4 1⁄3 for pseudogenes but no such signal for noncoding sequences.

Fig. 4. Histogram of x2 for 108-bp-long fragments of coding and noncoding sequences from E. coli. For noncoding DNA, x2 is centered at smaller values than for coding DNA. The overlap of the histograms determines the accuracy by which we can distinguish coding from noncoding DNA (cf. Table 2). For noncoding DNA, the histogram of x2 can be approximated by the probability density P(x2) 4 (x/2)4 × exp (−x2/2).

accuracy for sequence fragments of lengths L equal to 54 and 162 bp. Table 2b shows the mean values and the standard deviations of log x2 (natural base). Figure 5 shows the x2 histograms of coding and noncoding sequence fragments from R. prowazekii. The histograms are similar to the histograms of x2 of coding and noncoding sequences for E. coli, and hence the accuracy of 77.3% is comparable to the accuracy computed for E. coli. Table 2b shows that the mean values and standard deviations of the x2 histograms of coding and noncoding sequences are indeed similar for E. coli and R. prowazekii. It also shows that the mean values of log x2 for pseudogenes are close to the mean values for genes.

We find that the x2-measure can clearly distinguish coding and noncoding DNA from both E. coli and R. prowazekii. Hence, it does not provide clear evidence that a large fraction of noncoding sequences of R. prowazekii are neutralized genes. For random sequences, the columns of the frame dependence matrix Pb,l are statistically indistinguishable. The x 2 histogram can be approximated by a x 2 probability density with six degrees of freedom (Kullback 1959). Figures 4 and 5 show that the histograms for noncoding sequences in E. coli and R. prowazekii are close to this theoretical probability density. Quantitatively, we compare the mean and standard deviation of

358 Table 2.

A comparison of the x2 performance for E. coli and R. prowazekii Length L (bp)

Species

54

108

162

65.9% 64.1%

73.3% 73.8%

84.1% 80.6%

2

(a) Accuracy of x -measure R. prowazekii E. coli

Length L (bp) Species (b) Mean (standard deviation) of log x2 R. prowazekii E. coli R. prowazekii R. prowazekii E. coli

Sequences

54

108

162

coding coding pseudogenes noncoding noncoding random

2.18 (0.55) 2.09 (0.57) 2.02 (0.59) 1.73 (0.61) 1.68 (0.63) 1.61 (0.63)

2.55 (0.53) 2.40 (0.54) 2.29 (0.59) 1.73 (0.62) 1.68 (0.66) 1.61 (0.63)

2.84 (0.50) 2.64 (0.52) 2.48 (0.66) 1.75 (0.63) 1.69 (0.65) 1.61 (0.63)

Part a shows the percentage of correctly predicted coding and noncoding regions for the x2-measure for sequence fragments lengths L equal to 54, 108, and 612 bp. Part b shows the mean values m and the standard deviations s (shown in parentheses) of log x2 (natural base) for genes, pseudogenes, noncoding, and random sequence fragments of length L. For random sequences, m 4 log 2 + 3/2 − g and s 4 (p2/6 − 5/4)1/2 ≈ 0.63 can be calculated from the x2 probability density, where g ≈ 0.58 denotes Euler’s constant. The mean values of noncoding DNA are different from the mean values of genes and pseudogenes but close to the theoretical values for random sequences.

Fig. 5. Histogram of x2 for 108-bp-long fragments of coding and noncoding sequences from R. prowazekii and the probability density P(x2). The histograms are comparable to those in Fig. 4.

log x2 with values obtained from noncoding DNA of E. coli and R. prowazekii. For random sequences both mean (m) and standard deviation (s) can be analytically derived using standard references (Abramowitz and Stegun 1965). Table 2b shows in the last row the theoretical values of m and s. Figures 4 and 5 show small shifts of the histograms of noncoding DNA as compared to the probability density of random sequences, which might reflect unannotated pseudogenes, triplet repeats, or periodicities in RNA-encoding sequences (Trifonov and

Bettecken 1997). We perform a Kolmogorov-Smirnov (K-S) test on the cumulative x2 values in order to test the statistical significance of these deviations. The K-S test shows that the observed noncoding x2 histograms of E. coli or R. prowazekii are significantly different (p < 0.0001) from the probability density. That is, for both E. coli and R. prowazekii we observe deviations of noncoding DNA from a purely random sequence, so the data are not incompatible with the hypothesis of Andersson et al. (1998).

359

Fig. 6. Average x2 values for 108-bp-long fragments of coding sequences of E. coli and R. prowazekii generated by the substitution model (the inset shows the accuracy as a function of t). The x2-average reaches the value for noncoding sequences after tmin ≈ 70 steps, and the inset shows that the accuracy exhibits a minimum at tmin. Hence, the coding potential vanishes after 105 single-nucleotide substitutions. As further substitutions occur, the x2-average falls below the value for noncoding sequences and eventually saturates. Consequently, the accuracy rises.

Modeling Single-Nucleotide Mutations To gain some insight into how many mutations are required to reduce the coding potential of possibly degraded coding sequences to such small values that we observe in noncoding DNA of R. prowazekii, we introduce two simple models that simulate degradation of originally coding sequences into random sequences. In this way we infer how (i) substitutions and (ii) insertions/ deletions affect the coding potential x2. We “mutate” coding sequences according to the following procedure: 1. Extract all protein-coding sequences longer than L bp of a chosen species, and cut these sequences into nonoverlapping fragments of length L, starting at the 58-end. 2. For each sequence of length L, randomly and uniformly choose two positions ,i ,,j { [1,L]. Interchange the nucleotide ni at ,i with the nucleotide nj at ,j (substitution operator) or reposition ni in between ,j and ,j+1 (insertions/deletions operator). For both substitutions and insertions/deletions the overall composition is precisely maintained throughout each mutation step. At each step there occur either two substitutions (at positions ,i and ,j) or a deletion (at ,i) and an insertion (at ,j), so each step accounts for two single-nucleotide mutations. 3. For each mutation step t, calculate x2 for each sequence, and plot the average of x2 over all sequences. We investigate the decay of x2 for E. coli and R. prowazekii for L 4 108 bp. Figures 6 and 7 show for substitutions and insertions/deletions, respectively,

average values of x2 generated by the above models for t 4 1, . . . , 1000 mutation steps. Each figure also shows in the inset the accuracy for distinguishing coding sequences from noncoding sequences as a function of t. Taking into account a quarter of identical substitutions, we find (cf. Fig. 6) that for 108 bp the accumulation of approximately 105 substitutions is required to delete the coding potential. Furthermore, Fig. 7 shows that insertions/deletions distort the coding potential more rapidly. For 108 bp, we find that approximately 14 insertions or deletions are sufficient to delete the coding potential. One obvious reason for the above different time scales is that substitutions do not distort the reading frame, whereas insertions/ deletions do. We now study Cww(k) for 108-bp-long coding sequence fragments of R. prowazekii generated by (i) the substitution and (ii) the insertions/deletions model. We calculate the average correlation function Cww(k) of the above mutated coding sequences after (i) 45 and 105 substitutions, and (ii) after 4 and 12 insertions/deletions. We evaluate if Cww(k) detects sequence periodicities in mutated coding sequences. Figure 8 shows Cww(k) of all mutated coding sequences after substitutions and of the unmutated sequences as control. We find that Cww(k) clearly exhibits period-3 oscillations for the unmutated state and shows a constant amplitude for the mutated state, steadily decreasing as the number mutations progresses. Figure 9 shows Cww(k) of all coding sequences for the unmutated and mutated states after insertions/deletions. In contrast to Fig. 8, we find a decay of Cww(k) leaving almost no trace of any periodicity formutated coding sequences after 12 insertions/ deletions.

360

Fig. 7. Average x2 values for 108-bp-long fragments o coding sequences of E. coli and R. prowazekii generated by the insertions/deletions model (the inset shows the accuracy as a function of t). The x2-average reaches the value for noncoding sequences after tmin ≈ 7 steps, and hence the coding potential vanishes after 14 insertions/deletions.

Fig. 8. Cww(k) for 108-bp-long fragments of all coding sequences of R. prowazekii generated by the substitution model. Part a shows Cww(k) for the unmutated state, part b after 45 mutations, and part c after 105 mutations. We compute Cww(k) as described in Fig. 1. Cww(k) shows period-3 oscillations for genes in their unmutated state, while after 45 mutations the amplitude is reduced. After 105 mutations, the model generates correlations qualitatively similar to Cww(k) of noncoding sequences of R. prowazekii (cf. Fig. 2b).

The different functional form of Cww(k) of sequences generated by the mutation model can be interpreted as follows. Substitutions do not shift the reading frame, and so the decay of Cww(k) is unaffected. Insertions/deletions affect the length distribution of sequence segments with the same reading frame (Herzel and Grosse 1997) and induce a decay of the envelope of Cww(k). Discussion The problem we are addressing in this study is the search for possible explanations for the high fraction of non-

coding DNA in R. prowazekii. Andersson et al. (1998) hypothesized that that a large fraction of noncoding sequences in R. prowazekii may consist of remnants of neutralized genes, based on the fact that the G+C content of noncoding sequences is higher than in the third codon position of the reading frame. We find that the G+C content of noncoding sequences in three other bacteria also exceeds the G+C content of the third codon position. Hence, we search for additional statistical traces of former coding DNA in the large fraction of noncoding DNA by studying correlations func-

361

Fig. 9. Cww(k) for 108-bp-long fragments of all coding sequences of R. prowazekii generated by the insertions/deletions model. Part a shows Cww(k) for the unmutated state, part b after 4 mutations, and part c after 12 mutations. We compute Cww(k) as described in Fig. 1. In contrast to period-3 oscillations for genes in their unmutated state, after four mutations the model gives rise to a decay of Cww(k). After 12 mutations the similarity to Cww(k) of noncoding sequences becomes apparent (cf. Fig. 2b).

tions Cww(k) and the x2-measure. Both Cww(k) and x2 do not rely on the presence of stop codons or on an explicit knowledge of the correct reading frame. Cww(k) and x2 detect the presence of a reading frame, and neither require prior training. The analyzes of the E. coli and the R. prowazekii genome show that Cww(k) and x2 can distinguish coding from noncoding DNA. The R. prowazekii genome contains 12 documented pseudogenes. The metK gene (encoding S-adenosyl− methionin synthease) belongs to the above class of pseudogenes. Zomorodipour and Andersson (1999) show that in-frame stop codons and frameshift mutations occur in metK and that deletions occur more frequently than insertions, leading eventually to the elimination of metK from the R. prowazekii genome. We conduct an analysis for the combined set of metK and the 11 remaining pseudogenes in R. prowazekii, and we find that the resulting x2 histogram is close but not identical to the x2 histogram for coding DNA (cf. Table 2b). That is, the average correlation function Cww(k) (cf. Fig. 2c) and the x2measure are able to detect the still remaining coding potential of pseudogenes. When we analyze the 23% noncoding DNA in R. prowazekii, we find no clear indications of a coding potential. Cww(k) and the x2 histograms show no features that distinguish the behavior of Cww(k) or x2 for E. coli from that for R. prowazekii. The slight shift, by which the x2 histograms of noncoding sequences deviate from the theoretical curve of random sequences, is present in the x2 histograms of E. coli and R. prowazekii. Hence, neither Cww(k) nor x2 give clear evidence that a large fraction of noncoding sequences of R. prowazekii are neutralized genes. In the remaining part of this section, we

argue for another possible reason for this high fraction of noncoding DNA.

Why So Much Noncoding DNA? We found no clear evidence that noncoding sequences contain many remnants of degraded genes. One possible reason could be that mutations have removed almost all traces of a coding potential. We use two models for single-nucleotide substitutions and insertions/deletions to quantify the expected number of mutations required to decrease the coding potential of coding DNA to these small values observed for noncoding DNA in R. prowazekii. The weak sequences periodicities observed in Fig. 2b can be reproduced for coding sequences of length 108 bp with simple models using about 60 substitutions together with 8 insertions/deletions. Given that mutations have removed the coding potential, the destruction of a reading frame should be faster than the time necessary for the elimination of pseudogenes. As the mutation rate in bacteria is relatively low (Li 1997), the almost complete removal of the coding potential would occur on a long time scale. The mutation rate for E. coli has been estimated to be 5.4 ? 10−10/bp/ replication (Drake et al. 1998). The R. prowazekii genome hosts genes for the repair of ultraviolet-induced DNA damage (Andersson et al. 1998), so its mutation rate might be comparable. The mutation rate together with a replication time of about 10 h (Winkler 1995; Policastro et al. 1997) implies a time scale of several million years to reach the mutational equilibrium. An adaption to the human host should have happened on a

362

much shorter time scale. Hence, we consider the following additional possibility to explain the large fraction of noncoding DNA. The large fraction of noncoding DNA might reflect a reduced selection pressure acting on the intracellular parasite R. prowazekii. For quickly replicating bacteria, such as E. coli, the selection pressure would soon remove nonfunctional DNA. However, a fast growth of intracellular parasites could be harmful to their hosts. Since human body lice will feed only on the living host, the success of transmission is related to its longevity (Winkler 1995). In addition, R. prowazekii is not liable to compulsion to compete with other bacterial species colonizing the same niche. Rather, its evolutionary success depends primarily on the successful transmission. Hence, it might be evolutionary advantageous to grow slowly. And, indeed, a slow growth rate for R. prowazekii has been reported (Winkler 1995). A slow replication rate leads to a reduced selection pressure on the R. prowazekii genome, and noncoding DNA can accumulate. A comprehensive comparison of replication rates and fractions of noncoding DNA would be desirable in future studies. Acknowledgments. We thank R. Borriss, P. Hammerstein, W. Hess, A. Kowald, and A. Telschow for illuminating discussions; the referees for their helpful comments; and the Deutsche Forschungsgemeinschaft (DFG) and the Graduate Programme “Dynamics and Evolution” (DFGGK 268) for financial support for this study.

References

The minimal gene complement of Mycoplasma genitalium. Science 270:397–403 Fraser CM, Casjens S, Hung WM, Sutton GG, Clayton R, Lathigra R, White O, Ketchum KA, Dodson R, Hickey EK, et al. (1997) Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390:580–586 Guigo´ R, Fickett JW (1995) Distinctive sequence features in protein coding, genic noncoding, and intergenic human DNA. J Mol Biol 253:51–60 Hackstadt T (1996) The biology of Rickettsiae. Infect Agents Dis 5: 127–143 Herzel H, Grosse I (1997) Correlations in DNA sequences: the role of protein coding segments. Phys Rev E 55:800–810 Herzel H, Weiss O, Trifonov EN (1998) Sequence periodicity in complete genomes of archaea suggests positive supercoiling. J Biol Struct Dyn 16:341–345 Ikemura T (1981) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J Mol Biol 146:1–21 Kullback S (1959) Information theory and statistics. New York: Dover Publishing Li W-H (1997) Molecular evolution. Sunderland, MA: Sinauer Associates Lobry JR (1996) Asymmetric substitution patterns in the two DNA strands of bacteria. Mol Biol Evol 13:660–665 Margulis L (1970) Origins of eukaryotic cells. New Haven, CT: Yale University Press Michel CJ (1986) New statistical approach to discriminate between protein coding and non-coding regions in DNA sequences and its evaluation. J Theor Biol 120:223–236 Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, et al. (1999) Evidence for lateral gene transfer between archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399:323–329 Ormsbee RA (1985) Rickettsiae as organisms. Acta Viol 29:432–447

Abramowitz M, Stegun IA (1965) Handbook of mathematical functions. New York: Dover Altschul SF, Gish W, Miller W, Meyers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410 Andersson SGE, Sharp PM (1996) Codon usage and base composition in Rickettsia prowazekii. J Mol Evol 42:525–536 Andersson SGE, Zomorodipour A, Andersson JO, Sicheritz-Ponte´n T, Alsmark UCM, Podowski RM, Na¨slund AK, Eriksson A-S, Winkler HH, Kurland CG (1998) The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133–140 Benson DA, Boguski MS, Lipman DJ, Ostell J, Ouelette BFF, Rapp BA, Wheeler DL (1999) GenBank. Nucleic Acids Res 27:12–17 Bernardi G (1989) The isochore organization of the human genome. Ann Rev Genet 23:637–661 Blattner FR, Plunkett G III, Bloch FA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277:1453–1462 Drake JW, Charlesworth B, Charlesworth D, Crow JF (1998) Rates of spontaneous mutation. Genetics 148:1667–1686 Ebeling W, Feistel R, Herzel H (1987) Dynamics and complexity of biomolecules. Physica Scripta 35:761–768 Fickett JW, Tung C-S (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6441–6450 Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb J, Dougherty BA, Merrick JM, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512 Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelly JM, et al. (1995)

Policastro PF, Munderloh UG, Fischer ER, Hackstadt T (1997) Rickettsia rickettsii growth and temperature-inducible protein expression in embryonic tick cell lines. J Med Microbiol 46:839–845 Raghavan S, Ouzounis CA (1999) Novel coding regions in four complete archaeal genomes. Nucleic Acids Res 27:4405–4408 Shepherd JCW (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA 78:1596– 1600 Staden R, McLachlan AD (1982) Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res 10:141–156 Stephens RS, Kalman S, Lammel C, Fan J, Marathe R, Aravind L, Mitchell W, Olinger L, Tatusov R, Zhao Q (1998) Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 282:754–759 Trifonov EN, Bettecken T (1997) Sequence fossils, triplet expansion, and reconstruction of earliest codons. GENE 205:1–6 Trifonov EV, Sussman JL (1980) The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc Natl Acad Sci USA 77: 3816–3820 Weiss O, Herzel H (1998) Correlations in protein sequences and property codes. J. Theor Biol 190:341–353 Winkler HH (1995) Rickettsia prowazekii, ribosomes, and slow growth. Trends Microbiol 3:196–198 Zomorodipour Z, Andersson SGE (1999) Obligate intracellular parasites: Rickettsia prowazekii and Clamydia trachomatis. FEBS Lett 452:11–15