JOURNAL OF COMPUTATIONAL BIOLOGY Volume 13, Number 8, 2006 © Mary Ann Liebert, Inc. Pp. 1477–1488
A Compression-Based Approach for Coding Sequences Identification. I. Application to Prokaryotic Genomes GIULIA MENCONI1 and ROBERTO MARANGONI2
ABSTRACT Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach. Key words: data compression, coding sequences, non-parametric identification, prokaryotes.
1. INTRODUCTION
O
ne of the major problems of the post-genomic era is “reading” genomic sequences in order to extract all the biological information contained in them. The main task, of course, is to identify the genes, or, in other words, to distinguish which sequences are hosting information codifying a protein and which are not. Several works presented in the literature have addressed this problem from different points of view (Borodovsky et al., 2003; Besemer et al., 2001; Salzberg et al., 1998; Delcher et al., 1999; Yada et al., 2001; Zhang and Wang, 2000).
1 Dipartimento di Matematica Applicata, Università di Pisa, Italia. 2 Dipartimento di Informatica, Università di Pisa, Italia.
1477
1478
MENCONI AND MARANGONI
These papers also address further questions: e.g., how to exactly locate the starting codon, how to identify possible multiple splicing procedures (one gene, several products), and how to infer other functional implications related to the localized gene. Apart from their specific differences, the general background hypothesis underlying these approaches is the observation that non-coding regions often contain repeats, and therefore the information content of non-coding regions is low, if compared with that of coding regions, which tends to random sequences (i.e., coding regions tend to maximize the Shannon entropy per symbol). A direct consequence of this fact is that non-coding regions can be easily compressed, whereas, in general coding regions cannot. Following this approach, it is necessary to give a precise definition of the compressibility of a given string and to compute a compressibility index on a genomic sequence, trying to relate the index value to the content of the sequence itself. Instead of following this track, most of the proposed algorithms tried to directly use the statistical properties of the sequences, mainly using machine learning approaches, based on Hidden Markov Models or on other similar methods. The results reported by these methods are very good: the score of correctly identified genes is in most cases greater than 90% of the total known genes. Nevertheless, the high number of numerical parameters these methods need to use in their performance, which turns out to be crucial for the adaptability of the method on different genomes, constitutes a practical limit of applicability for large-size genomes, as pointed out in Zhang and Wang (2000). In this case, a non-parametric approach can be really useful, and this is the main goal of the present work. In this paper, we are not interested in developing a properly defined gene-finder method, but we would like to point out how this approach can be more flexible and extensible to complex genomes. In fact, this approach can be applied also to eukaryotic genomes with only a few modifications (this will be the subject of a future work); here we present only the applications to prokaryotic genomes in order to make a better comparison with similar methods.
2. APPROACH The philosophy of the present work is to explore genomic sequences and examine the presence of regularities by using a compression algorithm in order to classify coding and non-coding regions. We used an algorithm called CASToRe (Argenti et al., 2002), which is based on an original development from the Lempel-Ziv compression scheme (Ziv and Lempel, 1978), and is a non-parametric dictionary-based compressor (not adaptive). This algorithm is very good in finding regularities in sequences and can be used to clusterize genomes. In the present work, we will show how far a compression-based approach can lead in exploring and characterizing genomic sequences. Since the proposed approach aims at the identification of regularities, the tests we develop are suitable to identify non-coding sequences, which means that coding sequences are deducted by complementary difference from all the sequences.
3. METHODS 3.1. Methods for fragment selection/1: fragment compression Let A∗ be the space of sequences of finite length whose symbols belong to a finite alphabet A. In the case of genomes, the alphabet is A = {A, C, G, T }. If σ ∈ A∗ , then its length shall be denoted by |σ | and σ n shall denote the subsequence of σ made of its first n symbols. Definition 1 (Compression Algorithm). tion Z : A∗ → {0, 1}∗ .
A lossless data compression algorithm is any injective func-
Therefore, a compression algorithm is a reversible coding such that, from the original string, s may be recovered from the encoded string Z(s). Since the coded string contains all the information that is
COMPRESSION-BASED APPROACH FOR CODING SEQUENCES IDENTIFICATION
1479
necessary to reconstruct and describe the structural features of the original string, we can consider the length of the coded string as an approximate measure of the quantity of information that is contained in the original string. Definition 2 (Computable Information Content). The information content of a finite string s ∈ A∗ with respect to a compression algorithm Z is defined as IZ (s) = |Z(s)|.
(1)
The Information Content of a string s is the length of the coded string Z(s). Usually, this length is calculated in bit units, but throughout this work we use log4 instead of log2 to normalize the numerical results. Moreover, we define another quantity, the complexity of a finite sequence, providing an estimate for the rate of information content contained in it. Definition 3 (Computable Complexity of a finite string). the compression ratio CZ (s) =
IZ (s) . |s|
The complexity of s with respect to Z is
(2)
Remark. The experimental results, which will be presented later in this paper, have been obtained by applying a particular compression algorithm, called CAST oRe. This algorithm is a modification of the Lempel-Ziv compression scheme LZ78 (Ziv and Lempel, 1978), it has been introduced in Argenti et al. (2002), and its algorithmic properties were studied in Bonanno and Menconi (2002). Definition 4. We say that any exon, intron or intergenic region is a functional fragment of the genome sequence, following the annotation in the GenBank archives (http://www.ncbi.nlm.nih.gov/Genbank/). We shall consider the complexity F C of each fragment. The lower it is, the more compressible and regular the fragment is. We have performed a test by selecting any fragment whose complexity is lower than 0.8 out of 1. We denoted this test by HC, meaning that any selected fragment is highly compressible.
3.2. Methods for fragment selection/2: sublinearity index We shall study the Information Content growth within a fragment. In fact, we have considered how the Information Content varies along each fragment: it is a function of the number of encoded symbols that may be either linear or sublinear. The former means that the sequence is highly irregular, while the latter suggests that there are regular structures within the fragment. This method was introduced in Menconi (2005). In the following, σ shall denote any fragment within a genome. The sublinearity index may be defined by means of any adaptive compression algorithm Z, although the experimental results are referred to the algorithm CASToRe. Such an adaptive compression algorithm Z parses the input sequence in phrases that shall be used as codewords for the forthcoming uncompressed sequence. At the end of compression, the input sequence is parsed into words belonging to the final dictionary (Bell et al., 1989). Differently for each adaptive compression scheme, the words of the dictionary may occur more than once in the parsing. The coding scheme of algorithm CASToRe creates a dictionary of words that are all different from each other, therefore there is a one-to-one correspondence between parsing and dictionary. Let N = |σ | be the length of the input sequence σ . Let P(σ, Z) be the parsing of σ with respect to the algorithm Z: P(σ, Z) = {φ1 , φ2 , . . . , φt }. Therefore, the input string σ is the ordered juxtaposition of phrases φj ’s. We use the symbol nk to indicate the current total number of encoded symbols up to step k of the encoding procedure: nk = jk=1 |φj |. Due to the fact that |φk | = nk − nk−1 , we say that nk is the parsing index corresponding to the phrase φk . The Information Content after k steps is then the quantity I (nk ) = jk=1 I (φj ). Obviously, it holds that nt = jt =1 |φj | = N and I (σ ) = I (N ) = jt =1 I (φj ). Since the encoding procedure might not be precise in the early steps as well as in the final steps, we fix two
1480
MENCONI AND MARANGONI
bounds defining the restriction of the potential integer value nj . Let Tinf = 20%|σ | be the lower bound and Tsup = 90%|σ | be the upper bound. The choice of the bounds will be such that there exist two parsing indexes ninf and nsup such that Tinf ≤ ninf < nsup ≤ Tsup . Definition 5 (Sublinearity index of a finite symbol sequence). Let qmin , qmax and qZ (σ ) be defined as follows: I (nk ) I (nk ) qmin = min , qmax = max nk nk nk ∈D nk ∈D and qZ (σ ) =
qmin . qmax
The sublinearity index GZ (σ ) of the input sequence σ with respect to the parsing defined via the algorithm Z is the quantity GZ (σ ) =
log(qZ (σ )) + 1. nsup log ninf
(3)
The definition of this index GZ deserves some comments. Its main characteristic is that it allows a criterion to be established to identify regions where the Information grows sublinearly. The two following main points are definitely true. First, a sublinear growth of Information Content is an indicator of the presence of some regularity in the input sequence and this is much more evident when the index GZ is significantly smaller than 1. For instance, if I (nk ) = Cnk γ with 0 < γ ≤ 1 (power-law case), then GZ = γ . Second, small values of the index GZ may correspond to different sublinear information growths—also other than power-law-like—that consequently might be a signal of different underlying dynamics generating the symbol sequences. Preliminary examples are studied in Menconi (2005). Performing a sublinearity test, we have extracted any sublinear fragment (SUB) whose sublinearity index GZ is smaller than 0.9. The threshold has been fixed according to the empirical principle that the kind of growth nγ where γ lies in [0.9, 1] is, on a general basis, equivalent to a linear growth, due to the finiteness of the sequences under analysis. Finally, we have also selected what we called striking fragment (STRK), that is any fragment which is both highly compressible and sublinear.
4. RESULTS AND DISCUSSION We have applied the SUB, HC, STRK tests described above on some bacterial complete genomes, listed in Table 1. We briefly recall the terminology used in the tables. A selected and non-coding fragment is true positive. A selected and coding fragment is false positive. A not selected and non-coding fragment is false negative. A not selected and coding fragment is true negative. We have exploited a significance analysis on the tests. Positivity to the test for a region means that it is selected as non-coding. Truth means that the region is non-coding. In particular, we have observed the following indexes. • Sensitivity Sn is the ratio # True positive/# true: how many selected regions are really non-coding. • Specificity Sp is the ratio # True negative/# false: how many not selected regions are really coding. Notice that Sn and Sp are independent of the fact whether the property of being non-coding is dominant in the analysed collection of regions. Therefore, these two indexes may be used as an unbiased measure to compare tests’results among different genomes.
COMPRESSION-BASED APPROACH FOR CODING SEQUENCES IDENTIFICATION Table 1.
1481
List of Analyzed Genomes and Database Directions
Genome
Reference
M. jannaschii DSM 2661 A. fulgidus DSM 4304 M. thermoautrophicum str. deltaH P. abyssi GE5 A. aeolicus VF5 E. coli CFT073 B. subtilis subsp. subtilis str. 168 H. influenzae Rd KW20 M. genitalium G-37 R. prowazekii str. Madrid E T. maritima MSB8
Bult et al., 1996, version NC_000909.1 GI:15668172 Klenk et al., 1997, version NC_000917.1 GI:11497621 Smith et al., 1997, version NC_000916.1 GI:15678031 Lecompte et al., 2001, version NC_000868.1 GI:14518450 Deckert et al., 1998, version NC_000918.1 GI:15282445 Welch et al., 2002, version NC_004431.1 GI:26245917 Kunst et al., 1997, version NC_000964.2 GI:50812173 Fleischmann et al., 1995, version NC_000907.1 GI:16271976 Peterson et al., 1995, version NC_000908.1 GI:12044850 Andersson et al., 1998, version NC_000963.1 GI:15603881 Nelson et al., 1999, version NC_000853.1 GI:15642775
• Positive Predictive Value PPV is the ratio: # True positive/# positive: how likely is that a region is non-coding when selected. • Negative Predictive Value NPV is the ratio: # True negative/# negative: how likely is that a region is coding when not selected by the test. In order to establish a comparison with other gene-finder softwares, the index to compare is the Specificity (Sp), since it gives the number of coding regions (as known from annotated genomes) correctly identified as coding. Figures 1 and 2 show some examples of analysis of the Information in fragments Coding_1963284 and Intergenic_160542, respectively, both coming from B. subtilis genome. For both figures, picture (a) shows how the Information Content grows within the fragment and picture (b) shows the length of the words in the dictionary after compression via algorithm CASToRe, in chronological order. For a detailed description of the rules to build the dictionary see Argenti et al. (2002). However, notice that the dictionary is not derived from an a priori database to compress the input sequence nor to exploit a functional identification deducted from the presence/absence of known words, as in Shibuya and Rigoutsos (2002). In the coding fragment, the Information grows linearly with a slope corresponding to its complexity C = 0.912, also according to its sublinearity index whose value is GZ = 0.993. The dictionary is made of short words, hence no recurrent motif may be detected. Conversely, in the non-coding fragment the Information grows sublinearly as shown by its sublinearity index whose value is GZ = 0.781. In this case, the fragment is also highly compressible since its complexity is C = 0.733, that is this is an example of striking fragment. For what concerns the dictionary, it is noticeable that the words have longer and longer length and from the presence of neat jumps in the length it may be inferred that a weakly repeated structure occurs at least three times in the sequence. Figure 3 shows that the value of the sublinearity index depends on the fragment length and suggests that GZ (L) = O(1 − a exp(−bL)) with a, b ∈ (0, 1). Figure 4 shows the relationship between the Fragment Complexity F C and the sublinearity index GZ for fragments in B. subtilis genome. All the studied genomes turn out to share the property that to a sublinear growth may correspond to a wide-spreading of F C values. Indeed, all the fragments which are under the threshold on GZ show sublinear growth, but their complexity varies from around 0.5 to around 1.2 out of 2. Therefore, the sublinearity index may be more helpful than complexity in identifying distinctive features of regular fragments. This is also confirmed by the observation of Tables 2 and 3, which are relative to B. subtilis and E. coli, respectively. They are taken as examples of the results of the three tests. The general trend is that the SUB test is the most successful and effective. Furthermore, we notice that the case where a fragment is highly compressible but not sublinear (that is, both positive for HC and negative for STRK) is extremely rare (in the tables, there are only two cases over two genomes). In Table 4, the results for all bacterial genomes concerning the SUB test are shown.
1482
MENCONI AND MARANGONI
FIG. 1. B. subtilis fragment Coding_1963284. (a) Information growth. (b) Length of the words in the dictionary after compression.
4.1. Case studies under SUB test in E. coli and B. subtilis We have considered how sensitivity and specificity of SUB test change as fragment length grows. From Table 5, a huge dependence of test significance on fragment length emerges. In case of short length (up to 200 nt and up to 500 nt), for both genomes there is a great amount of selected coding regions (false positives), even if the nonocoding regions are selected almost correctly. On the other end, for fragments longer than 1000 nt, the results are twofold: for E. coli at least the non-coding regions are almost surely predicted, while in B. subtilis, though the amount of non-coding regions under this constraint is small, they are extremely irregular, therefore they are definitely excluded by the test. Finally, the test is efficient and reliable in the case of fragments of intermediate length, since when the regions are up to 1000 nt long, the significance of the results is high.
COMPRESSION-BASED APPROACH FOR CODING SEQUENCES IDENTIFICATION
1483
FIG. 2. B. subtilis fragment Intergenic_160542. (a) Information growth. (b) Length of the words in the dictionary after compression.
In order to better characterize the features of the fragments that were wrongly selected under the SUB test, first we selected some deeply regular coding fragments from B. subtilis and E. coli genomes. Second, we extracted some non-coding irregular fragments. By means of homology searches in databanks, we studied the structure of these sequences and tried to infer their biological role. 4.1.1. False positives: coding regular fragments. Coding sequences falsely marked as non-coding codify for very particular kinds of proteins: modular proteins or proteins consisting of only a few species of aminoacids. These proteins are, of course, not common, and significantly deviating from the “average” protein, but they are actively expressed and some of them are highly preserved (even in higher organisms). There are no doubts, therefore, that these sequences are “coding,” but the very particular structure of them, which consists in repeats of a short aminoacid motif, can get false results from our test.
1484
MENCONI AND MARANGONI
FIG. 3. The relationship between fragment length L and sublinearity index GZ in B. subtilis (points) for fragments of length up to 2000 nt. The upper dashed line is the function 1 − 0.1 exp(−0.005x), where the lower dotted line is the function 0.9 − 0.5 exp(−0.005x).
4.1.2. False negatives: non-coding irregular fragments. The fragments erroneously classified as coding by the SUB test mainly belong to two classes: 1. Pseudo- or non-functional-genes as well as fragments of a gene, which probably derive from a gene duplication event, in which the copy has failed to produce a complete and functional gene. 2. Non-coding sequences which are strictly related to transcription regulation: enhancers, promoters or promoter fragments, which have lost their respective genes.
FIG. 4. Comparison between F C and GZ in B. subtilis fragments. The dashed line is the threshold at GZ = 0.9. The fragments whose sublinearity index is under the threshold have been selected under SUB test.
COMPRESSION-BASED APPROACH FOR CODING SEQUENCES IDENTIFICATION Table 2.
Test Results for B. subtilis
Test
No. selected
Sn
Sp
PPV
NPV
SUB True positive False positive False negative True negative HC True positive False positive False negative True negative STRK True positive False positive False negative True negative
3348 2967 381 847 3412 123 118 5 3696 3788 121 116 5 3698 3788
77.79%
89.96%
88.62%
80.11%
3.09%
99.87%
95.93%
50.61%
3.04%
99.86%
95.87%
50.60%
No. of fragments, 7607; no. non-coding, 3814; no. coding, 3793.
Table 3. Test
Test Results for E. coli
No. selected
Sn
Sp
PPV
NPV
2622 2404 218 1046 3230 66 65 1 3385 3447 66 65 1 3385 3447
69.68%
93.68%
91.69%
75.54%
1.88%
99.97%
98.48%
50.45%
1.88%
99.97%
98.48%
50.45%
SUB True positive False positive False negative True negative HC True positive False positive False negative True negative STRK True positive False positive False negative True negative
No. of fragments, 6898; no. non-coding, 3450; no. coding, 3448.
Table 4.
Comparison of the SUB Test’s Significance for Several Bacterial Genomes Genome
Sn
Sp
Methanococcus jannaschii Archeoglobus fulgidus Methanobacterium thermoautrophicum Pyrococcus abyssi Aquifex aeolicus Escherichia coli Bacillus subtilis Haemophylus influenzae Mycoplasma genitalium Rickettsia prowazekii Thermotoga maritima
76.99% 71.01% 75.93% 68.54% 69.89% 69.68% 77.79% 77.03% 53.66% 62.92% 67.90%
88.03% 90.17% 92.30% 95.83% 96.68% 93.68% 89.96% 92.95% 93.95% 91.39% 92.01%
1485
1486
MENCONI AND MARANGONI Table 5. Significance Analysis for E. coli and B. subtilis under the SUB Test with Different Constraints on Fragment Length Constraint E. coli Fragments Fragments Fragments Fragments B. subtilis Fragments Fragments Fragments Fragments
Sn
Sp
up to 200 nt long up to 500 nt long up to 1000 nt long longer than 1000 nt
83.27% 72.93% 70.05% 20.00%
12.86% 70.35% 87.97% 99.88%
up to 200 nt long up to 500 nt long up to 1000 nt long longer than 1000 nt
89.50% 80.57% 78.14% 4.76%
17.12% 64.89% 83.69% 99.86%
In the first case, our test indicates a coding region since the test is not able to perform a functional evaluation and therefore it cannot distinguish between an actual and active gene and “an evolutive ghost.” In the second case, the sequences are non-coding, because they don’t express a gene, but their structure is quasi-random, as we can observe in coding regions. Therefore, our test can detect them as coding. Usually, these sequences, when functionally active, are physically located close to the controlled gene, and therefore the classification error is very low.
4.2. Comparison with other gene-finder methods Even though we are not interested in building another gene-finder method, we think it could be interesting to give an overall comparison between this approach and other approaches already described in the literature. We refer to global performances of GLIMMER (http://www.tigr.org/software/glimmer/), GeneMark (http://opal.biology.gatech.edu/GeneMark/), and ZCURVE (http://tubic.tju.edu.cn/Zcurve_B/) algorithms as reported in their Web pages. The three of them show an accuracy in the prediction of genes that is in the range from around 96% to around 99% of annotated genes that were correctly predicted in most of the prokaryotic genomes we have analysed. In Table 6, we reported the global performances of our method. We observe that, even if our approach is far from being optimized, the global performances we can obtain are comparable with that of top-scoring methods.
Table 6.
Concluding Results on Coding Sequences Identification for the SUB Test
Genome
Annotated coding fragments
Predicted coding fragments
Correctly predicted coding fragments
Annotated coding fragments not found
M. jannaschii A. fulgidus M. thermoaut. P. abyssi A. aeolicus E. coli B. subtilis H. influenzae M. genitalium R. prowazekii T. maritima
1429 1496 1507 1151 873 3448 3793 1460 463 720 1077
1587 1783 1754 1466 1112 4276 4259 1693 650 925 1337
88.03% 90.17% 92.30% 95.83% 96.68% 93.68% 89.96% 92.95% 93.95% 91.39% 92.01%
11.97% 9.83% 7.70% 4.17% 3.32% 6.32% 10.04% 7.05% 6.05% 8.61% 7.99%
The predicted coding fragments in the third column are the fragments not selected by the SUB test. In the fourth column, the figures show the percentage of not selected fragments corresponding to annotated coding regions. The fifth column shows the percentage of annotated coding framents that have not been predicted.
COMPRESSION-BASED APPROACH FOR CODING SEQUENCES IDENTIFICATION
1487
Moreover, we recall that most of these top-scoring approaches are not extensible on eukaryotic genomes and can give worse results when applied in other situations than those belonging to their training sets.
5. CONCLUSION We have presented the core application of an approach that offers a good possibility of exploring and clusterizing genomes. The performance of our proposed method for coding regions classification is very good, and most of the cases in which it fails can be explained in terms of a functional role played by the examined sequence. Moreover, our approach is extensible and does not require specific training on the studied sequences. A direct consequence of the weak dependence of our approach on the statistical properties of the analized sequences is the relative low importance of the specific codon usage adopted. These remarks represent a strong suggestion for further improvement of the present work, which can be obtained by adding methods for functional evaluation (e.g., search for promoter regions in the upstream of a predicted coding sequence) to the coding/non-coding classification. Another very interesting aspect of the proposed approach is represented by the properties of the dictionaries automatically generated by the compression algorithm. The words stored in them could be, in fact, useful to characterize the biological properties of the examined sequence. Preliminary observations show that dictionaries generated on coding regions contain the typical words of the genetic code. The biological role of the words contained in the dictionaries built on non coding sequences is still under investigation. The present paper does not exploit this point, because prokaryotic genomes are poor in non-coding regions; this point will be discussed in detail in another paper, where the presented approach will be applied to eukaryotic organisms.
REFERENCES Andersson, S.G., Zomorodipour, A., Andersson, J.O., et al. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396, 133–140. Argenti, F., Benci, V., Cerrai, P., et al. 2002. Information and dynamical systems: a concrete measurement on sporadic dynamics. Chaos Solitons Fractals 13, 461–469. Bell, T., Witten, I.H., and Cleary, J.G. 1989. Modeling for text compression. ACM Comput. Surv. 21, 557–591. Bonanno, C., and Menconi, G. 2002. Computational information for the logistic map at the chaos threshold. Disc. Cont. Dyn. Syst. B 2, 415–431. Besemer, J., Lomsadze, A., and Borodovsky, M. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618. Borodovsky, M., and McIninch, J. 1993. GenMark: parallel gene recognition for both DNA strands. Comput. Chem. 17, 123–133. Bult, C.J., White, O., Olsen, G.J., et al. 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273, 1058–1073. Deckert, G., Warren, P.V., Gaasterland, T., et al. 1998. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358. Delcher, A.L., Harmon, D., Kasif, S., et al. 1999. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641. Fleischmann, R.D., Adams, M.D., White, O., et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512. Yada, T., Totoki, Y., Takagi, T., et al. 2001. A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res. 8, 97–106. Klenk, H.P., et al. 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390, 364–370. Kunst, F., et al. 1997. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390, 249–256.
1488
MENCONI AND MARANGONI
Lecompte, O., et al. 2001. Genome evolution at the genus level: comparison of three complete genomes of hyperthermophilic archaea. Genome Res. 11, 981–993. Menconi, G. 2005. Sublinear growth of Information in DNA sequences. Bull. Math. Biol. 67, 737–759. Nelson, K.E., et al. 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399, 323–329. Peterson, S.N., et al. 1995. Characterization of repetitive DNA in the Mycoplasma genitalium genome: possible role in the generation of antigenic variation. Proc. Natl. Acad. Sci. U.S.A. 92, 11829–11833. Salzberg, S.L., Delcher, A.L., Kasif, S., et al. 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544–548. Shibuya, T., and Rigoutsos, I. 2002. Dictionary-driven prokaryotic gene finding. Nucleic Acids Res. 30, 2710–2725. Smith, D.R., et al. 1997. Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J. Bacteriol. 179, 7135–7155. Welch, R.A., et al. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 99, 17020–17024. Zhang, C.-T., and Wang, J. 2000. Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Res. 28, 2804–2814. Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theor. 24, 530–536.
Address reprint requests to: Dr. Roberto Marangoni Department of Informatics University of Pisa L.go Bruno Pontecorvo, 3 56127 Pisa, Italy E-mail:
[email protected]