Chapter 17 FUNCTIONAL GENOMICS AND ENZYME ...

1 downloads 0 Views 121KB Size Report
already have complete sequences of twelve bacterial, four archaeal, and one eukaryotic genome. ... makes the functional assignments for the first sequenced genomes particularly important as .... A list of enzyme with, at best, questionable.
Chapter 17 FUNCTIONAL GENOMICS AND ENZYME EVOLUTION Homologous and Analogous Enzymes Encoded in Microbial Genomes M.Y. Galperin and E.V. Koonin National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA

Abstract:

The first complete genome sequences of cellular life forms have become available just in the last three years. The number of sequenced microbial genomes has been growing exponentially ever since, and at this time we already have complete sequences of twelve bacterial, four archaeal, and one eukaryotic genome. More than fifty different genomes are currently in the works, and so it may be reasonable to expect up to a hundred complete genome sequences of different organisms to become available by the year 2000. Most importantly, we can expect to obtain a good coverage of the diversity of life, as complete genomes of many phylogenetically divergent bacteria, several crenarchaea and euryarchaea, and at least some eukaryotes should be sequenced in the next few years. This fast progress presents a challenge to cope with the wealth of information contained in complete genomes and to use it for improving our understanding of various biological phenomena. The success in deciphering complete genomes will largely depend upon our ability to predict the functions of proteins encoded in each genome and to verify at least some of these predictions. This makes protein sequence analysis, followed by experimental testing of emerging hypotheses – the area of research often referred to as "functional genomics" - one of the most important directions of current genome analysis. Here we review the recent progress in the analysis of complete genomes with special emphasis on using this new information for better understanding of enzyme evolution.

1

2

1.

PREDICTING FUNCTIONS FOR THE NEW PROTEINS

A complete functional description of a product of a newly sequenced gene includes identifying the coding region, translation of the DNA into protein sequence, sequence similarity searches against various databases, identification of potential motifs and structural features of the protein product, assignment of the probable function and, finally, determination, whether this assignment can be considered reliable. The first step in this process, identification of the protein coding regions in a piece of genomic DNA, is not as simple as one might expect. Even in the absence of introns identification of ORFs is often complicated by sequencing errors that result in frameshifts. Often there is no certainty that an ORF, particularly a short one, actually codes for a protein [1]. This problem becomes particularly complex when genes contain introns. This makes the functional assignments for the first sequenced genomes particularly important as these lay the foundation for the annotation of all the genomes to be sequences in the future, including the human genome.

1.1

Standard methods for functional prediction

The easiest way to identify the function of the unknown protein is to find a protein with a strongly similar sequence and experimentally determined activity. Several algorithms for sequence similarity searches against molecular biology databases have been developed and implemented in robust programs over the years [2, 3]. The most widely used of them are Smith-Waterman [4], available for searching, e.g. at http://croma.ebi.ac.uk/Bic, FASTA [5], available at http://www2.ebi.ac.uk/fasta3, and BLAST [6] available at http://www.ncbi.nlm.nih.gov/BLAST. All these programs offer a reasonable combination of speed and sensitivity, but can also produce false positive hits that need to be analyzed and sorted out by a highly trained biologist (see section 1.2). To increase the search sensitivity even further, Altschul and co-workers [7] recently introduced an extension of the BLAST method, referred to as PSI-BLAST (Position-Specific Iterating BLAST). This program uses the results of a BLAST output to construct a position-specific scoring matrix (profile), which in turn is used as a query for the next iteration. This program is available on the NCBI BLAST server (http://www.ncbi.nlm.nih.gov/BLAST). In many cases, sequence similarity searches find no close neighbors, but reveal a conserved string (pattern) that could be useful for function

17.

3

prediction. When such a pattern includes a set of conserved amino acid residues that are important for protein function and are located within a certain distance from each other, it is often referred to as a sequence motif. Protein sequence motifs are listed in several databases, e.g. PROSITE (http://www.expasy.ch/sprot/prosite.html) [8], BLOCKS (http://www.blocks.fhcrc.org) [9], PRINTS (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.html) [10], PRODOM (http://protein.toulouse.inra.fr) [11], and PFAM (http://www.sanger.ac.uk/Software/Pfam, or http://pfam.wustl.edu) [12],

which all allow similarity searches against the database. The motifs identified in such searches (e.g., ATP-binding or metal-binding) often allow one to predict the probable function(s) of an unknown protein even in the absence of strong database hits. When sequence similarity or motif searches do not yield any results, certain ideas about possible functions of an unknown protein can be inferred by predicting its structural features. The presence of signal peptides or mitochondrial targeting sequences can indicate the cellular location of the predicted protein. Presence of multiple hydrophobic segments often indicates transmembrane topology and can help to attribute the protein to a specific transporter family. Non-globular segments often indicate a role in proteinprotein interactions [13].

1.2

Sources of systematic errors in functional annotation of genomes

First exercises in functional annotation of poorly studied microbial genomes resulted in substantially different function assignments for the same proteins [14-16]. A critical analysis of these assignments shows that most common reasons for questionable functional or outright wrong predictions are unreliable database entries, divergence of sequences and functions in the course of evolution, multi-domain organization of proteins, and low sequence complexity [17]. 1.2.1

Unreliable database entries

The reality of genomics is that only in relatively few cases, the protein sequences deposited in the databases have proven biological activity, verified by expressing the respective protein in a heterologous host. Most of the time, functions of the proteins encoded in the sequenced are deduced on the basis of their similarity to previously studied proteins. Hence a great

4 potential for propagation of errors. Wrong annotations may easily arise, for example, when an open reading frame (ORF) that complements a mutation in a known gene coding for a particular enzyme is assumed to code for an enzyme with the same specificity. In reality, such an effect of course could be due to suppression of the mutation, provision of a missing enzyme cofactor, and a plethora of other mechanisms. Thus, the ORF that complemented the hemG mutation in Escherichia coli, was assumed to code for protoporphyrinogen oxidase, even though it represented a small protein of flavodoxin type, usually comprising only one of several subunits of the dehydrogenase complex [18, 19]. In some cases, an ORF is assigned a function based on the functions of other genes of the same operon. Thus, the product of Streptomyces curacoi gene located downstream of the gene coding for phosphonopyruvate synthase was automatically assumed to catalyze the next step in the pathway, decarboxylation of phosphonopyruvate [20]. This protein sequence, BCPC_STRCA, has been annotated in the protein databases as phosphonopyruvate decarboxylase, even though further research showed that the real phosphonopyruvate decarboxylase was a thiamin pyrophosphate-dependent enzyme with a different molecular weight [21]. Furthermore, once the sequence of the actual phosphonopyruvate decarboxylase has been determined, it has been named “form 2” of the enzyme. The functions of BCPC_STRCA still remain unknown, but sequence analysis suggests that it could catalyze hydrolysis or isomerization of phosphonopyruvate [22]. A list of enzyme with, at best, questionable (e.g., ARGE_LEPBI, DYR_MYCTU, TYSY_MYCTU) in curated databases has been published [23], but it probably represents only a fraction of erroneous database entries. 1.2.2

Multi-domain organization of proteins

Many proteins, particularly eukaryotic ones, are composed of several domains that may have different functions. If two different proteins share the same domain, database searches often show one such protein as the best hit for the other one [17]. Overlooking the multi-domain structure of the protein in question or its best database hit often causes confusion, particularly when certain domains in these proteins are highly conserved in evolution which results in highly significant P-values reported by database searching programs. This problem can be circumvented by the simple means of comparing the length of a match for each database hit with the length of the query sequence; a significant difference may indicate differences in domain architecture, and the domain organizations of the query and the hit can be subsequently explored in detail. In practice, however, the disregard

17.

5

for multidomain organization of proteins is the cause of numerous cases of confusion in genome annotation [17].

1.3

A phylogenetic approach to functional prediction

An approach to protein function prediction based directly on evolutionary relationships is offered by the recently constructed Clusters of Orthologous Groups (COG) database (available at http://www.ncbi.nlm.nih.gov/COG) that contains homologous proteins from each of the completely sequenced genomes [24, 25]. Each COG consists of orthologous sets of proteins from at least three phylogenetic lineages, which are assumed to have evolved from an individual ancestral protein. Since orthologous proteins are most likely to perform the same function in each organism, identification of an unknown protein as a member of a COG immediately suggests its probable function(s). While the fraction of proteins containing ancient conserved regions appears to be ~70% in bacteria and archaea [16], the requirement that a protein should be found in at least three different phylogenetic lineages limits the COGs database to the most conserved proteins and ensures its high reliability. This database currently includes ~50 % of proteins, encoded in each of the bacterial and archaeal genomes (Fig. 1). For yeast the figure is much lower due to the lack of other completely sequenced eukaryotic genomes. Sequence similarity searching against the COG database using the Cognitor tool can be performed at http://www.ncbi.nlm.nih.gov/COG/cognitor.html. In contrast to other systems of functional annotation that usually require high levels of sequence similarity to assign a function to a new protein [26], the COG system can accommodate the different evolution rates observed for different genes and identify the closest homologs in each of the sequenced genomes for each protein, even when these sequences are substantially divergent and their similarity levels are fairly low. This approach results in significantly increased search sensitivity (which emphasizes, rather than eliminates, the need for a manual validation). As in other curated protein databases, all the annotations of the COGs are being manually checked, which largely eliminates unreliable database entries, one of the most common causes of questionable functional predictions for new proteins (see above). The annotation of a COG as a group does not necessarily coincide with the annotations of its component proteins, as precise functional predictions may be possible for some members of the COG but not others. For example, the COG that includes yeast histone deacetylase and related prokaryotic enzymes is named simply "deacetylases", which reflects the fact that the actual substrates for the members of this COG in bacteria and archaea are unknown, but their

6 enzymatic activity is beyond reasonable doubt. This approach, if applied consistently, can cope with another common cause of wrong annotations mentioned above, namely the divergence of sequences and functions in the course of evolution. With regard to the problem of multi-domain proteins, the COG database splits such proteins into separate domains, provided that these domains comprise individual proteins in some other organisms. When compared to the COG database, multi-domain proteins show similarity to members of two or more different COGs, which immediately indicates their complex organization and provides for the identification of the boundaries and potential functions of each of the domains. With all the features mentioned above, the COG-based functional annotation appears to be substantially more sensitive and reliable than functional annotation based on the similarity searches against the whole nonredundant database. In a pilot analysis of protein sequences from the nematode Caenorhabditis elegans, the use of the COG database produced 24% more functional predictions than a straightforward database search (A.R. Mushegian, personal communication). As new complete genomes are being constantly added to the COG database, it is likely to become a highly effective tool for protein function prediction.

2.

PARASITES VERSUS PROTOTROPHS: GENE LOSS AND ACQUISITION OF NEW FUNCTIONS.

Comparative analysis of the available complete genomes shows that metabolic diversity generally correlates with the genome size. Parasitic bacteria import a variety of metabolites, which allows them to shed genes encoding enzymes for a number of metabolic pathways [27-30]. However, all cells have to completely rely on their own gene products for performing such essential functions as genome expression, replication, and repair, membrane biogenesis and others. These tasks alone require at least 200-250 genes [31], see Fig. 1). Classification of proteins into orthologous groups provides a convenient way to characterize the protein families present or absent in any given organism. A systematic survey of missing gene families allows one to identify the metabolic pathways that are likely to be operative in a particular organism, as all of the required enzymes are encoded in its genome. When some of the required enzymes can not be found in a genome, the respective pathways are either not operative, or employ other, unrelated proteins to catalyze the missing steps (see [32]).

17.

7

Using the COG approach for the analysis of the carbohydrate metabolism in the gastric pathogen Helicobacter pylori revealed the absence of several key enzymes, such as phosphofructokinase (pfk), pyruvate kinase (pykA), and 6-phosphogluconate dehydrogenase (gnd), see [25]. This means that glycolysis and the pentose phosphate shunt are not functional in H. pylori, leaving the Entner-Doudoroff pathway as the only venue of sugar catabolism. On the other hand, phosphoenolpyruvate synthase (ppsA) and fructose bisphosphatase (fba) genes are both present in H. pylori, suggesting that it can perform gluconeogenesis, producing sugars, which are required for nucleic acids and peptidoglycan biosynthesis. Such an organization of metabolism makes perfect sense for this bacterium, given its challenge of maintaining near-neutral intracellular pH in the highly acidic gastric environment. Sugar fermentation, resulting in intracellular production of acid, would place an additional burden on the pH maintenance mechanism, while gluconeogenesis converts organic acids into sugars and thus removes H+ from the cytoplasm. For the purposes of energy production, H. pylori apparently depends on fermentation of amino acids and oligopeptides that are produced by gastric proteolysis and transported into the cells by ABCtype transporters. Amino acid fermentation results in alkalinization of the cytoplasm and could relieve part of the burden of pH maintenance in H. pylori. Examination of the phylogenetic distribution of the enzymes participating in certain metabolic pathways can lead to some interesting evolutionary observations. Thus, all the genes coding for the enzymes of isoleucine/valine biosynthesis (Fig. 2) are missing in Mycoplasma genitalium and Mycoplasma pneumoniae. As many other gram-positive bacteria do possess the complete set of genes for isoleucine/valine biosynthesis [33, 34], it appears that the complete set of ilv genes has been precisely excised in the course of mycoplasmal evolution. These genes are also missing in Chlamydia trachomatis and Borrelia burgdorferi (Fig. 2). By contrast, two other human pathogens, Haemophilus influenzae and Mycobacterium tuberculosis, have retained the complete set of ilv genes. These observations show that different pathogens vary in the degree of their dependence upon the host to supply the necessary nutrients. The enzymes of isoleucine/valine biosynthesis in archaea are clearly orthologous to bacterial enzymes (Fig. 2). Remarkably, the complete set of ilv genes, with the

8

Haemophilus influenzae

Escherichia coli

237

312 320 883

178

738

2286 411

596

212

Mycoplasma genitalium

Helicobacter pylori 226 158

725

310

126

145

64

44

201

103

Saccharomyces cerevisiae

Methanococcus jannaschii

472 307

207 623

120

885 561 4381

417 203

Figure 1. Functional roles of proteins in the COG database. Each sector indicates the fraction of proteins in each genome that are not included in the current set of COGs (the largest sector in every genome), and, in clockwise direction, the numbers of proteins that are responsible for (i) information storage and processing, including transcription, translation DNA replication, recombination and repair, and ribosome biogenesis; (ii) cellular processes, such as membrane and cell wall biogenesis, protein folding and secretion; (iii) cellular metabolism, including carbohydrate, amino acid, lipid and nucleotide metabolism and energy production and conversion; and (iv) poorly characterized conserved proteins.

17.

9

exception of threonine dehydratase, is missing in Pyrococcus horikoshii. This shows that free-living, heterotrophic organisms may be just as prone to gene loss as outright parasites. Comparison of metabolic enzymes encoded in different completely sequenced genomes reveals another interesting trend, namely multiple cases of non-orthologous gene displacement [32], whereby the same cellular function in different organisms can be performed by distinct (definitely, nonorthologous) proteins. For example, yeast histidinol phosphatase YFR025c, an enzyme of the histidine biosynthesis pathway (Fig. 3), is unrelated to the bacterial enzyme catalyzing the same reaction. These enzymes belong to different families of phosphatases, which does not preclude them from acting on the same substrate, histidinol phosphate. Remarkably, a protein closely related to bacterial histidinol phosphatases is present in H. pylori, which lacks all the other enzymes of histidine biosynthesis (Fig. 3). This protein, however, retains the phosphatase sequence motif, indicating that in H. pylori cells it probably still functions as a phosphatase that hydrolyzes some other phosphate ester. Such enzyme recruitment presents yet another substantial problem in functional annotation of complete genomes. Comparison of phylogenetic patterns, such as the one shown on Fig. 3, offers a possibility to identify such cases.

10 L-Threonine

L-Leucine

ehub--r-cqmtaky

eh-b--r-qcmta-y ilvE, COG0115

ilvA, COG0498

α-Ketoisocaproate

Pyruvate α-Ketobutyrate

ilvB, COG0028 ilvN, COG0440

eh-b--r-qcmta-y eh-b--r-qcmta-y

α-Aceto-α αα-Acetolactate hydroxybutyrate

ilvC, COG0059

tyrB

eh-b--r-qcmta-y

eh-b--r-qcmtaky leuB, COG0473 β -Isopropylmalate

leuC, COG0065 leuD, COG0066 Dimethylcitraconate

βDihydroxy-β methylvaleriate

Dihydroxy-β βisovaleriate

ilvD, COG0129

α-Keto-β βmethylvaleriate

L-Isoleucine

eh-b--r-qcmta-y

α-Ketoisovaleriate

eh-b--r-qcmta-y ilvE, COG0115

eh-b--r-qcmta-y

leuA, COG0119 eh-b--r-qcmta-y

eh-b--r-qcmta-y

leuC, COG0065 leuD, COG0066 α-Isopropylmalate

avtA, COG0436

L-Valine

Figure 2. Conservation of genes and enzymes of the isoleucine/valine biosynthesis pathway. The enzymes are listed under E. coli gene names, whenever available. The COG numbers are from the COG database (http://www.ncbi.nlm.nih.gov/COG). The designations of species in the phylogenetic patterns are as follows: e, Escherichia coli; h, Haemophilus influenzae; u, Helicobacter pylori; b, Bacillus subtilis; g, Mycoplasma genitalium; p, Mycoplasma pneumoniae; r, Mycobacterium tuberculosis; i, Chlamydia trachomatis; l, Borrelia burgdorferi; q, Aquifex aeolicus; c, Synechocystis sp.; m, Methanococcus jannschii; t, Methanobacterium thermoautotrophicum; f, Archaeoglobus fulgidus; y, Saccharomyces cerevisiae.

17.

11 L-Histidine

Ribose-5-P

prsA, COG0462

ehubgpr-cqmtaky

hisD, COG0141

5-Phosphoribosyl 1-pyrophosphate

hisG, COG0040

eh-b--r-cqmta-y

eh-b--r-cmqta-y

N'-(5-Phosphoribosyl)-ATP

eh-b--rhisI_2, COG0140 -cqmt--y

ehu---r-c-m----

L-Histidinol

y,b

hisB_1, COG0241

YFR025c

L-Histidinol phosphate

eh-b--r-cqmtaky hisC, COG0079

N-(5-Phosphoribosyl)-AMP

hisI_1, COG0139

eh-b--r-cqmta-y

Imidazoleacetol phosphate

eh-b--r-cqmta-y hisB_2, COG0131

Pro-Phosphoribosyl-formimino5-aminoimiazole-4-carboxamide ribonucleotide (5'-ProFAR)

hisA, COG0106

eh-b--r-cqmta-y

Imidazoleglycerol phosphate

eh-b--r-cqmta-y hisH, COG0118 hisF, COG0107

Phosphoribulosyl-formimino5-aminoimiazole-4-carboxamide ribonucleotide (PRFAR)

eh-b--r-cqmta-y

Figure 3. Conservation of genes and enzymes of the histidine biosynthesis pathway. All abbreviations are as in Fig. 2

12

Comparative analysis of the pyrimidine biosynthetic pathways in different organisms (Fig. 4) shows many trends similar to those described above for the pathways of amino acid biosynthesis. Again, the complete set of genes coding for the enzymes of this pathway is missing in both mycoplasmas, in contrast to many other gram-positive bacteria [35, 36]. It appears that all these genes with the sole exception of pyrH, coding for uridylate kinase, have been lost in the mycoplasmal lineage. Pyrimidine biosynthesis also shows several cases of non-orthologous gene displacement. Indeed, the typical bacterial, pyrH-type uridylate kinase is not encoded in the yeast genome; instead, UMP phosphorylation in yeast is catalyzed by one of the three paralogous members of the adenylate/cytidylate kinase family (Fig. 4). Similarly, in addition to pyrC-encoded dihydroorotase, E. coli, H. pylori, and Synechocystis sp. also have genes for a second type of this enzyme, found in gram-positive bacteria [35, 37] and homologous to yeast allantoinase and human dehydropyrimidinase. This second form of dihydroorotase is the only one encoded in archaeal genomes, while neither form is present in Haemophilus influenzae (Fig. 4). Analysis of the enzymes of pyrimidine biosynthesis encoded in the H. influenzae genome points out to yet another interesting evolutionary trend, when parasites lose the genes for the upstream steps of a metabolic pathway, while still retaining the genes for the downstream steps. While depending on the host for the supply of essential nutrients, this strategy allows the parasite to preserve at least some metabolic plasticity. Indeed, H. influenzae genome is missing the genes for first three steps of pyrimidine biosynthesis (carA, carB, pyrB, pyrC) that end up with dihydroorotate (Fig. 4). By contrast, the genes for all subsequent steps of pyrimidine biosynthesis from dihydroorotate to CTP are present in the H. influenzae genome. Thus, while H. influenzae evidently is not capable of de novo pyrimidine biosynthesis, it has preserved sufficient metabolic plasticity to accommodate whatever pyrimidine it can get from its host. Borrelia burgdorferi and Chlamydia trachomatis show a deeper reduction, with the genes for the last 3 stages of the pathway, namely the conversion of UMP into CTP, retained. These examples illustrate how the adaptation of a microorganism to its particular habitat shapes its genome by determining which genes can be dispensed of and which must be retained.

17.

13 HCO3- + NH3

carA, COG0505 carB, COG0458

e-ub--r--cqmta-y e-ub--r--cqmta-y

Carbamoylphosphate

pyrB, COG0540

e-ub--r--cqmtaky

Carbamoylaspartate

e-u-----c-----y

pyrC, COG0418

ygeZ, COG0044

e-ub--r--cqmtaky

Dihydroorotate

pyrD, COG0167

ehub--r--cqmtaky

Orotate

pyrE, COG0461

ehub--r--cqmtaky

Orotidine-5P

pyrF, COG0284

ehub-----cqmtaky

UMP

ehubgprl icqmtak-

pyrH, COG0528

adk, COG0563

ehubgprlicqmtaky

UDP

ndk, COG0105

ehub--rlicqmtaky

UTP

pyrG, COG0504

ehub--rlicqmtaky

CTP

Figure 4. Conservation of genes and enzymes of the pyrimidine biosynthesis pathway. All abbreviations are as in Fig. 2

14

3.

NON-ORTHOLOGOUS GENE DISPLACEMENT A KEY TO PATHWAY EVOLUTION

As noted above, non-orthologous gene displacement can significantly complicate functional annotation and metabolic reconstruction of newly sequenced genomes [32]. A few examples of unrelated enzymes that catalyze the same reaction have been known for a long time [38]. The comparison of complete genomes showed that such cases are much more common than previously believed. A comprehensive analysis of the protein sequences of enzymes contained in the GenBank database revealed 105 enzymatic reactions that can be catalyzed by at least two non-orthologous enzymes [23]. In several cases, three-dimensional structures of both variants had been determined and their structural folds are clearly different; on other occasions, the same conclusion could be reached by fold prediction [23]. Besides well-known example of Cu/Zn-dependent and Mg/Fe-dependent superoxide dismutases [39], this was found to be the case for, e.g., dihydrofolate reductases (Protein Data Base accession codes 1DHF and 1VIE), β-lactamases (2BLT and 1ZNB), and others. These examples indicate that independent origin of enzymes with the same specificity is fairly common. A systematic analysis of non-orthologous gene displacements showed that these cases are non-uniformly distributed on the metabolic map [23]. Thus examination of their phylogenetic distribution could provide some clues to the problems of biochemical evolution, including evolution of such central metabolic pathways as glycolysis and the TCA cycle.

3.1

Non-orthologous gene displacement in glycolytic enzymes.

For quite some time glycolysis has been generally believed to be the pathway carbohydrate metabolism that had emerged first in evolution. However, a recent comparison of saccharolytic pathways in archaea and hyperthermophilic bacteria [40] led to the conclusion that only the lower part of glycolysis from glyceraldehyde 3-phosphate to pyruvate (the so called "trunk pathway") can be considered ubiquitous (discussed in detail in [41]). In accord with this, 4 of the 5 enzymes in this section of the EmbdenMeyerhoff-Parnas pathway are clearly orthologous in all species for which the data is available [42] (see, however, [41] for a discussion of the peculiar case of glyceraldehyde 3-phosphate dehydrogenase). On the other hand, the enzymes of the upper part of glycolysis and the enzymes involved in pyruvate catabolism are substantially different in various organisms.

17.

15

Glucose catabolism starts with its phosphorylation, which is catalyzed by glucokinase. Three distinct versions of this enzyme have been identified. While the enzyme from Streptomyces coelicolor (GLK_STRCO) shows moderate sequence similarity to the typical bacterial form (e.g., HXKG_ECOLI), these two forms share only the ATP-binding motif with yeast and human glucokinases [43]. On the other hand, orthologs of GLK_STRCO in Z. mobilis, Streptococcus mutans and several other bacteria possess fructokinase activity. An unrelated variant of fructokinase is present in E. coli, Synechocystis sp., and in plants. Sequences of ATP-dependent 6-phosphofructokinases from various bacteria and eukaryotes all show significant sequence conservation. However, an entirely different enzyme, encoded by the pfkB gene and related to the ribokinases, has been shown to account to up to 10% of the total 6phosphofructokinase activity in E. coli [44]. Plant 6-phosphofructokinases use pyrophosphate instead of ATP, and the enzyme from the archaeon Pyrococcus furiosus was reported to use ADP as a phosphoryl donor [45]. The next glycolytic enzyme, fructose-1,6-bisphosphate aldolase, is represented by two substantially different isozymes, referred to as class I and class II aldolases. Rutter [38] noted that class I enzymes are present in animal and plant cells, class II enzymes are present in yeasts and bacteria, while Euglena and Chlamydomonas have both forms of the enzyme. These conclusions, though repeatedly questioned by reports of class I enzymes in various bacteria and archaea, are now finding support in sequence data. Genes coding for class I aldolases were found so far only in two bacteria, Staphylococcus carnosus and Synechocystis sp. In spite of several reports of class I aldolase in E. coli, there appears to be no such gene in its genome [46]. On the other hand, E. coli genome contains four paralogous genes for class II aldolases (two for fructose bisphosphate aldolase and tagatose bisphosphate aldolase each), which could be the reason for the confusion. Another possibility is the presence in E. coli of a completely different, new type of aldolase, that is also EDTA-resistant, like the class I enzyme. Aldolase genes are absent in the four archaeal genomes, and no archaeal aldolase sequence is available at this point, though both class I and class II enzymes have been reported in halobacteria. In any case, the overall conclusion that class I and class II aldolases evolved independently of each other and were differentially lost in different phylogenetic lineages [47], seems more consistent with the available sequence data than the idea of their common origin [48]. The only enzyme in the lower part of glycolytic pathway that has nonhomologous isoforms is phosphoglycerate mutase. Phylogenetic data [49] and the structural fold predictions [22] suggest that the two forms of

16 phosphoglycerate mutase have evolved independently, and have been selectively lost in various lineages. The diversity of the enzymes in the pentosophosphate pathway seems to argue for its relatively recent evolution. This pathway starts from NADPdependent oxidation of glucose into gluconolactone that is catalyzed by glucose dehydrogenase. Two variants of this enzyme have been described that belong to the short-chain dehydrogenases family and to the Zncontaining dehydrogenases family, respectively. Two additional variants of glucose dehydrogenase use FAD and pyrroloquinoline quinone, respectively, as the electron acceptors. Gluconate kinase, the enzyme that catalyzes the next step of the pathway, is also found in two distinct versions one of which is similar to the glycerol kinase [23].

3.2

Alternative ways of pyruvate oxidation

The traditional schemes of anaerobic glycolysis end with the reduction of pyruvate into lactate. However, this reaction is by no means ubiquitous as Llactate dehydrogenase is missing in the genome of the yeast S. cerevisiae (which is evidently the main reason for its industrial use). Besides, the eukaryotic-type L-lactate dehydrogenase, which is found also in B. subtilis and Synechocystis sp., is apparently missing in Gram-negative bacteria. Instead, E. coli, H. influenzae and Alcaligenes eutrophus possess a different type of this enzyme, which is similar to archaeal malate dehydrogenase. Another way of pyruvate reduction, which results in the production of Dlactate, is apparently limited to the bacterial domain. Soluble D-lactate dehydrogenases from E. coli, H. influenzae, Synechocystis sp. and several lactobacilli are all similar to each other. Besides, E. coli and H. influenzae both have membrane-bound D-lactate dehydrogenases, which feed electrons directly into the respiratory chain. A different respiratory D-lactate dehydrogenase, a nuclear-encoded mitochondrial protein, is found in yeasts. The thermophilic archaea M. jannaschii and Pyrococcus furiosus and the thermophilic bacterium Thermotoga maritima all lack pyruvate dehydrogenase complex and instead use pyruvate-ferredoxin oxidoreductase [50]. It appears likely therefore that pyruvate dehydrogenase complex has evolved after separation of the ancestors of bacteria and archaea. Accordingly, there are two substantially different forms of the pyruvate dehydrogenase E1 component. One form which is found in B. subtilis, Synechocystis sp. and eukaryotes from yeast to man is similar to αketoglutarate dehydrogenase E1 component. The other form, which is present in E. coli, H. influenzae, and Mycobacterium tuberculosis, is much more similar to transketolase. One may speculate that while transketolase was present in the universal ancestor of all modern life forms, α-

17.

17

ketoglutarate dehydrogenase and pyruvate dehydrogenase complexes were not. In the eukaryotic lineage, pyruvate dehydrogenase E1 component could emerge from the E1 component of mitochondrial α-ketoglutarate dehydrogenase, while in the E. coli lineage it evolved from some variant of transketolase.

3.3

Evolution of the TCA cycle

It has been speculated that the TCA cycle has first evolved as a noncyclical reductive biosynthetic pathway, leading from oxaloacetate to succinate [51]. Indeed, enzymes for these reactions, as opposed to citrate synthase and α-ketoglutarate dehydrogenase, have been found in thermophilic archaea [52], and the respective genes are present in all sequenced archaeal genomes [16]. Our analysis shows that malate dehydrogenases, fumarases, succinate dehydrogenases and succinyl-CoA synthetases from diverse sources show approximately the same level of sequence similarity to each other. This uniformity can be considered an indication of an early emergence of these enzymes. The evolutionary history of the other Krebs cycle enzymes appears to be more complex. The two enzymes of the second arm of the cycle, namely aconitase and isocitrate dehydrogenase, are both present in methanogenic archaea and are basically similar in representatives of all three domains of life. However, there are reasons to suspect that these enzymes have emerged somewhat later than those listed above. First, these two enzymes are highly similar to the two enzymes of the leucine biosynthesis pathway, isopropylmalate isomerase and isopropylmalate dehydrogenase, which suggests that duplicated copies of leucine biosynthetic enzymes could be recruited for the Krebs cycle at some relatively late point in enzyme evolution. Second, in addition to the ubiquitous NADP+-dependent isocitrate dehydrogenase that consists of two identical subunits, is found in all three domains of life, and is similar to isopropylmalate dehydrogenase, there is another enzyme with the same activity, but a significantly different sequence. This second isozyme of isocitrate dehydrogenase functions as a monomer and is only weakly inhibited by oxaloacetate, 2-oxoglutarate, and citrate [53]. It has been discovered so far in Vibrio parahaemolyticus and Corynebacterium glutamicum; the psychrophilic bacterium Vibrio sp. strain ABE-1 has been shown to contain both types of isocitrate dehydrogenase. The two remaining enzymes of the Krebs cycle, citrate synthase and αketoglutarate dehydrogenase, have not been found in any of the archaea. They are present in all bacteria and eukaryotic mitochondria and must have evolved well after separation of the three major domains of life. This is in agreement with the ideas of Weitzman [54] who suggested independent

18 evolution of the two branches of the Krebs cycle, followed by their linking through citrate synthase and α-ketoglutarate dehydrogenase.

4.

CONCLUDING REMARKS.

The availability of complete genome sequences has already changed our understanding of evolutionary processes. Further progress in methods and procedures for sensitive database searches, combined with cross-genome sequence comparisons and phylogenetic analysis, should significantly improve our ability to understand functions of individual proteins encoded in these genomes and to use this data for better understanding of the mechanisms of cell function. Classification of protein families and superfamilies implemented in the COG database provides a useful to tool for such cross-genome sequence analysis and metabolic reconstruction. We expect that the growth in the number of completely sequenced genomes will result in improved precision of our functional assignments and will ultimately bring us to a true evolutionary system of protein classification. References 1. 2. 3.

4. 5. 6. 7.

8. 9.

10. 11.

Fickett, J.W. (1996) Finding genes by computer: the state of the art, Trends Genet. 12, 316-320. Pearson, W.R. (1996) Effective protein sequence comparison, Methods Enzymol. 266, 227-258. Altschul, S.F. (1997) Sequence comparison and alignment, in M.J. Bishop and C.J. Rawlings (eds.), DNA and protein sequence analysis: a practical approach, IRL Press, Oxford, pp. 137-167. Smith, T.F., and Waterman, M.S. (1981) Identification of common molecular subsequences, J. Mol. Biol. 147, 195-197. Pearson, W.R., and Lipman, D.J. (1988) Improved tools for biological sequence comparison, Proc. Natl. Acad. Sci. U S A 85, 2444-2448. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990) Basic local alignment search tool, J. Mol. Biol. 215, 403-410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zheng, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST - A new generation of protein database search programs, Nucleic Acids Res. 25, 3389-3402. Bairoch, A., Bucher, P., and Hofmann, K. (1997) The PROSITE database, its status in 1997, Nucleic Acids Res. 25, 217-221. Henikoff, S., Pietrokovski, S., and Henikoff, J.G. (1998) Superior performance in protein homology detection with the Blocks Database servers, Nucleic Acids Res. 26, 311-315. Attwood, T.K., Beck, M.E., Flower, D.R., Scordis, P., and Selley, J.N. (1998) The PRINTS protein fingerprint database in its fifth year, Nucleic Acids Res. 26, 306-311. Corpet, F., Gouzy, J., and Kahn, D. (1998) The ProDom database of protein domain families, Nucleic Acids Res. 26, 325-328.

17. 12.

13. 14. 15. 16.

17.

18.

19.

20.

21.

22.

23. 24. 25. 26. 27. 28. 29.

19 Sonnhammer, E.L.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains, Nucleic Acids Res. 26, 322-325. Wootton, J.C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures, Comput. Chem. 18, 269-285. Ouzounis, C., Casari, G., Valencia, A., and Sander, C. (1996) Novelties from the complete genome of Mycoplasma genitalium, Mol. Microbiol. 20, 898-900. Kyrpides, N.C., Olsen, G.J., Klenk, H.-P., White, O., and Woese, C.R. (1996) Methanococcus jannaschii genome: revisited, Microb. Compar. Genomics 1, 329-338. Koonin, E.V., Mushegian, A.R., Galperin, M.Y., and Walker, D.R. (1997) Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea, Mol. Microbiol. 25, 619637. Galperin, M.Y., and Koonin, E.V. (1998) Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption, In Silico Biol. 1, 0007 . Camadro, J.M., and Labbe, P. (1996) Cloning and characterization of the yeast HEM14 gene coding for protoporphyrinogen oxidase, the molecular target of diphenyl ethertype herbicides, J. Biol. Chem. 271, 9120-9128. Hughes, N.J., Clayton, C.L., Chalk, P.A., and Kelly, D.J. (1998) Helicobacter pylori porCDAB and oorDABC genes encode distinct pyruvate:flavodoxin and 2oxoglutarate:acceptor oxidoreductases which mediate electron transport to NADP, J Bacteriol 180, 1119-1128. Lee, S.H., Hidaka, T., Nakashita, H., and Seto, H. (1995) The carboxyphosphonoenolpyruvate synthase-encoding gene from the bialaphos-producing organism Streptomyces hygroscopicus, Gene 153, 143-144. Nakashita, H., Watanabe, K., Hara, O., Hidaka, T., and Seto, H. (1997) Studies on the biosynthesis of bialaphos. Biochemical mechanism of C-P bond formation: discovery of phosphonopyruvate decarboxylase which catalyzes the formation of phosphonoacetaldehyde from phosphonopyruvate, J. Antibiot 50, 212-219. Galperin, M.Y., Bairoch, A., and Koonin, E.V. (1998) A superfamily of metalloenzymes unifies phosphopentomutase and cofactor-independent phosphoglycerate mutase with alkaline phosphatases and sulfatases, Protein Sci. 8, 1829-1835. Galperin, M.Y., Walker, D.R., and Koonin, E.V. (1998) Analogous enzymes: independent inventions in enzyme evolution, Genome Res 8, 779-790. Tatusov, R.L., Koonin, E.V., and Lipman, D.J. (1997) A genomic perspective on protein families, Science 278, 631-637. Koonin, E.V., Tatusov, R.L., and Galperin, M.Y. (1998) Beyond the complete genomes: from sequences to structure and function, Curr. Opin. Struct. Biol. 8, 355-363. Overbeek, R., Larsen, N., Smith, W., Maltsev, N., and Selkov, E. (1997) Representation of function: the next step, Gene 191, GC1-GC9. Fleischmann, R.D., Adams, M.D., White, O., et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd, Science 269, 496-512. Fraser, C.M., Gocayne, J.D., White, O., et al. (1995) The minimal gene complement of Mycoplasma genitalium, Science 270, 397-403. Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W.S., Borodovsky, M., Rudd, K.E., and Koonin, E.V. (1996) Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli, Curr. Biol. 6, 279-291.

20 30. 31.

32. 33. 34.

35.

36. 37.

38. 39.

40. 41. 42. 43.

44. 45.

46. 47. 48.

49.

Tomb, J.-F., White, O., Kerlavage, A.R., et al. (1997) The complete genome sequence of the gastric pathogen Helicobacter pylori, Nature 388, 539-547. Mushegian, A.R., and Koonin, E.V. (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes, Proc. Natl. Acad. Sci. U S A 93, 1026810273. Koonin, E.V., Mushegian, A.R., and Bork, P. (1996) Non-orthologous gene displacement, Trends Genet. 12, 334-336. Godon, J.J., Chopin, M.C., and Ehrlich, S.D. (1992) Branched-chain amino acid biosynthesis genes in Lactococcus lactis subsp. lactis, J Bacteriol 174, 6580-6589. De Rossi, E., Leva, R., Gusberti, L., Manachini, P.L., and Riccardi, G. (1995) Cloning, sequencing and expression of the ilvBNC gene cluster from Streptomyces avermitilis, Gene 166, 127-132. Quinn, C.L., Stephenson, B.T., and Switzer, R.L. (1991) Functional organization and nucleotide sequence of the Bacillus subtilis pyrimidine biosynthetic operon, J. Biol. Chem. 266, 9113-9127. Li, X., Weinstock, G.M., and Murray, B.E. (1995) Generation of auxotrophic mutants of Enterococcus faecalis, J. Bacteriol. 177, 6866-6873. Schenk-Groninger, R., Becker, J., and Brendel, M. (1995) Cloning, sequencing, and characterizing the Lactobacillus leichmannii pyrC gene encoding dihydroorotase, Biochimie 77, 265-272. Rutter, W.J. (1964) Evolution of aldolase, Fed. Proc. 23, 1248-1257. Stallings, W.C., Powers, T.B., Pattridge, K.A., Fee, J.A., and Ludwig, M.L. (1983) Iron superoxide dismutase from Escherichia coli at 3.1-A resolution: a structure unlike that of copper/zinc protein at both monomer and dimer levels, Proc. Natl. Acad. Sci. U S A 80, 3884-3888. Romano, A.H., and Conway, T. (1996) Evolution of carbohydrate metabolic pathways, Res. Microbiol. 147, 448-455. Fothergill-Gilmore, L.A., and Michels, P.A. (1993) Evolution of glycolysis, Prog. Biophys. Mol. Biol. 59, 105-235. Galperin, M.Y., and Brenner, S.E. (1998) Using metabolic pathway databases for functional annotation, Trends Genet 14, 332-333. Bork, P., Sander, C., and Valencia, A. (1993) Convergent evolution of similar enzymatic function on different protein folds: the hexokinase, ribokinase, and galactokinase families of sugar kinases, Protein Sci 2, 31-40. Daldal, F., and Fraenkel, D.G. (1981) Tn10 insertions in the pfkB region of Escherichia coli, J Bacteriol 147, 935-943. Kengen, S.W., Tuininga, J.E., de Bok, F.A., Stams, A.J., and de Vos, W.M. (1995) Purification and characterization of a novel ADP-dependent glucokinase from the hyperthermophilic archaeon Pyrococcus furiosus, J Biol Chem 270, 30453-30457. Blattner, F.R., Plunkett, G., 3rd, Bloch, C.A., et al. (1997) The complete genome sequence of Escherichia coli K-12, Science 277, 1453-1474. Marsh, J.J., and Lebherz, H.G. (1992) Fructose-bisphosphate aldolases: an evolutionary history, Trends Biochem. Sci. 17, 110-113. Cooper, S.J., Leonard, G.A., McSweeney, S.M., Thompson, A.W., Naismith, J.H., Qamar, S., Plater, A., Berry, A., and Hunter, W.N. (1996) The crystal structure of a class II fructose-1,6-bisphosphate aldolase shows a novel binuclear metal-binding active site embedded in a familiar fold, Structure 4, 1303-1315. Carreras, J., Mezquita, J., Bosch, J., Bartrons, R., and Pons, G. (1982) Phylogeny and ontogeny of the phosphoglycerate mutases. IV. Distribution of glycerate-2,3-P2 dependent and independent phosphoglycerate mutases in algae, fungi, plants and

17.

50.

51. 52.

53.

54.

21 animals, Comp Biochem Physiol [B] 71, 591-597.Comp Biochem Physiol [B] 71, 591597. Smith, E.T., Blamey, J.M., and Adams, M.W. (1994) Pyruvate ferredoxin oxidoreductases of the hyperthermophilic archaeon, Pyrococcus furiosus, and the hyperthermophilic bacterium, Thermotoga maritima, have different catalytic mechanisms, Biochemistry 33, 1008-1116. Gest, H. (1987) Evolutionary roots of the citric acid cycle in prokaryotes, Biochem Soc Symp 54, 3-16. Selig, M., Xavier, K.B., Santos, H., and Schonheit, P. (1997) Comparative analysis of Embden-Meyerhof and Entner-Doudoroff glycolytic pathways in hyperthermophilic archaea and the bacterium Thermotoga, Arch Microbiol 167, 217-232. Suzuki, M., Sahara, T., Tsuruha, J., Takada, Y., and Fukunaga, N. (1995) Differential expression in Escherichia coli of the Vibrio sp. strain ABE-1 icdI and icdII genes encoding structurally different isocitrate dehydrogenase isozymes, J Bacteriol 177, 2138-2142. Weitzman, P.D. (1981) Unity and diversity in some bacterial citric acid-cycle enzymes, Adv Microb Physiol 22, 185-244.