In Silico Biology 1 (1998) 55–67 IOS Press
55
Sources of Systematic Error in Functional Annotation of Genomes: Domain Rearrangement, Non-Orthologous Gene Displacement and Operon Disruption Michael Y. Galperin1 and Eugene V. Koonin2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA, e-mail:
[email protected], 2
[email protected] ABSTRACT: Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to “database explosion”. It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear to be: i) noncritical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis. KEYWORDS: Genome comparison, prediction of gene functions, systematic error, database annotation, automatic genome annotation, manual genome annotation, low complexity, non-globular domains, multi-domain proteins, nonorthologous gene displacement
INTRODUCTION The growing list of completely sequenced genomes presents a unique possibility to extract a wealth of information on biochemistry, genetics, and evolutionary history of these organisms. The success of such analysis will largely depend upon our ability to predict protein functions encoded in each of these genomes. The history of functional annotations for the first completed genomes shows that functions of 40–60% of the proteins encoded in each genome could be predicted based solely on high-level sequence similarity with known proteins [11,16,17]. On the other hand, detailed computer analysis using advanced methods Electronic publication can be found in In Silico Biol. 1, 0007 , 4 March 1998.
1386-6338/98/$8.00 © 1998 – IOS Press and Bioinformation Systems e.V. All rights reserved
56
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes Table 1 Selected protein sequence analysis utilities
Type of analysis
Program used
Sequence similarity searches Database search and BLASTP sequence comparison WU-BLASTP BLASTPGP PSI-BLAST Motif detection
Multiple alignment Identification of conserved gene strings Handling of large data sets Batch mode Taxonomy sorting of BLAST results
MoST HMMer ScanProsite CLUSTALW MACAW GENESTRING
SEALS BLATAX
Prediction of protein structure elements Signal peptides SIGNALP Transmembrane helices PHDtopology Coiled-coil regions COILS2 Non-globular domains SEG
Source (WWW access, availability by FTP)
Reference
www.ncbi.nlm.nih.gov/BLAST/ ftp://ncbi.nlm.nih.gov/blast/ www2.ebi.ac.uk/blast2/ ftp://blast.wustl.edu/ www.ncbi.nlm.nih.gov/BLAST/ ftp://ncbi.nlm.nih.gov/blast/ www.ncbi.nlm.nih.gov/BLAST/ ftp://ncbi.nlm.nih.gov/blast/ ftp://ncbi.nlm.nih.gov/pub/most/ http://genome.wustl.edu/eddy/HMMER /main.html www.expasy.ch/sprot/scnpsite.html ftp://ftp.ebi.ac.uk/pub/software/ ftp://ncbi.nlm.nih.gov/pub/macaw/ ftp://ncbi.nlm.nih.gov/pub/koonin/ Complete_Genomes/utility/
[2] [1] [3] [3] [47] [15] [5] [43] [50] [48]
www.ncbi.nlm.nih.gov/Walker/SEALS/ ftp://ncbi.nlm.nih.gov/pub/bla/blatax/
[52] [29]
www.cbs.dtu.dk/services/SignalP/ www.embl-heidelberg.de/predictprotein/ http://ulrec3.unil.ch/software/COILS_form.html ftp://ncbi.nlm.nih.gov/pub/seg/
[36] [39] [32] [55]
for sequence comparison (gapped BLAST, PSI-BLAST), amino acid motifs detection (MoST, HMMer), prediction of secondary structure elements, and delineation of families of paralogs (e.g., the ones listed in Table 1) results in an increase of the fraction of proteins with predicted functions to 75–85% for the bacterial genomes and ~70% for the first published archaeal genome, that of M. jannaschii [30,48]. This analysis is a pre-requisite for more detailed characterization of each particular organism based on the genome sequence, e.g. prediction of the existing and missing metabolic pathways and the transcription regulatory network, but is time- and labor-consuming. Accordingly, and in anticipation of the expected exponential growth of the number of sequenced genomes, considerable effort has been devoted to the automation of genome analysis at different levels. These projects range in their scope from a set of simple programs for data handling intended to facilitate genome annotation by a biologist (e.g., SEALS [52]) to a complete automated system that attempts automatic annotation without any human intervention (GeneQuiz see [42]). While automation of the tedious process of sequence annotation is certainly an attractive possibility, the quality of automatically generated predictions remains suspect, given the many pitfalls that complicate even manual annotation by expert biologists [9,10]. Thus, while the accuracy of the functional assignments by GeneQuiz has been estimated at 95% or better [38], few of the new functional predictions for M. genitalium proteins, even those specifically discussed by the authors, could be fully corroborated by manual analysis (see [30] for discussion). Several of these questionable predictions, listed in Table 2, suggest that the error rate of GeneQuiz is indeed far above the claimed 1% [38].
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
57
Table 2 A comparison of the automatic and manual annotations of the Mycoplasma genitalium genome Protein name
Best hit to a characterized protein, E-value
GeneQuiz annotation [38]
Manual annotation [30]
MG123
None
Arginine deiminase
MG139
None
Amps (fragment)
MG140
SMB2_HUMAN 4.6e-06
DNA-binding protein Sµbp2
Periplasmic protein Zn-dependent hydrolase Superfamily I helicase
MG225
LYSP_ECOLI, 0.019 None
Histidine permease
MG237
a
W
Permease
O
Unknown
O?
W
NARK_BACSU, 0.0016 None
MG449
SYFB_ECOLI, 4.8e-13
Phe-tRNA synthetase N-terminal
MG464
SP3J_BACSU, 0.014
Stage III sporulation protein J
Putative RNA-binding protein Highly conserved membrane protein
MG468
DPO1_BACCA, 5.6e-39
DNA polymerase I
5'-3' exonuclease
MG377
U
Amino acid permease Unknown
Ile-tRNA synthetase domain NarK, nitrate extrusion protein Zn protease
MG294
Nature of the errors in GeneQuiz annotation Type of Possible reasons, comments errorsa W Non-globular domains were not masked (see Table 6) W Not clear
O
W
W
Only the best hit was considered; motif analysis was not performed Only the best hit was considered (see text) Not clear Only the best hit was considered Zn protease motif is present, but similarity to other peptidases is low Multi-domain organization of Phe-tRNA synthetase ignored (see Fig. 3) Only the best hit was considered; similar proteins in many bacteria Multi-domain organization of DNA polymerase I was ignored (see Fig. 2)
W - wrong prediction; U - underprediction; O - overprediction
Likewise, several corrections to the original TIGR annotation [17], offered by Ouzounis et al. [38], turned out to be unjustified (Table 3). A similar contrast was found between the GeneQuiz predictions for the M. jannaschii genome [4] and those obtained by mostly manual annotation (see [30,44]). Several of these discrepancies are listed in Table 4. When dubious functional assignments are used as a basis for further predictions, they tend to proliferate, which has been referred to as “database explosion” [8]. It is therefore important to delineate possible factors that lead to questionable functional assignments. To this end, we compared the sets of functional annotations for the genomes of M. genitalium and M. jannaschii, produced by different groups [4,11,17,30,31,38,44] and examined the likely reasons for apparently erroneous predictions. We believe that the problems that affect prediction of gene functions most seriously are the same for automatic and manual analysis (as will be demonstrated by some of the examples discussed below), but, of course, manual analysis has more immediate flexibility to handle them. The approach we take is not benchmarking of different systems and methods for genome annotation, but examination of some typical cases that highlight inherent difficulties in this process.
58
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes Table 3 Evaluation of the “corrections” provided to the original TIGR annotations by GeneQuiz analysis
Protein name MG061 MG090 MG120 MG406 MG006 MG041 MG099 MG137
MG278 MG310 MG409
Best hit to a characterized protein, E-value UHPT_SALTY, 0.23 RS6_HAEIN, 0.00024 RBSC_HAEIN, 0.27 gi1209759, 1.6e-45 KTHY_BACSU, 1.1e-28 PTHP_STAAU, 7.4e-9 HYIN_AGRRA, 1.2e-20 GLF1_KLEPN, 9.5e-52 SPOT_ECOLI, 1.8e-56 PIP_BACCO, 1.7e-6 PHOU_ECOLI, 0.00021
TIGR annotation [17]
Status
GeneQuiz correction [38]
Status
Latest annotation [30]
Hexose phosphate transport protein UhpT Ribosomal protein S6
A
False positive
F
A
False positive
F
Ribose permease RbsC
A,O?
False positive
F
Hexose phosphate transport protein Ribosomal protein S6 Permease
Transport permease P69
W
False positive
U
H+-ATPase I chain
Thymidylate kinase
A
Putative kinase
F
Thymidylate kinase
Phosphohistidinoprotein PtsH Hydrolase Aux2
A
ptsH gene, HPr
A
A
O
dTDP-4-dehydrorhamnose reductase RfbD
A
Indoleacetamide hydrolase Amine oxidase
Phosphocarrier protein HPr Amidase
Stringent response-like protein Proline iminopeptidase
A
A
Peripheral membrane protein PhoU
A
Pyrophosphohydrolase Triacylglycerol lipase Pho negative regulator
O
F
O A
Dehydrogenase; LPS biosynthesis (see text) ppGpp synthetase/ pyrophosphatase Hydrolase Phosphate transport regulator PhoU
A - adequate prediction; O - overprediction; U - underprediction; W - wrong annotation; F - Failure, correction of an adequate TIGR annotation.
Table 4 A comparison of an automatic and manual annotations of the Methanococcus jannaschii genome Protein name
MJ0134 MJ0226
Best hit to a characterized protein, E-value
GeneQuiz annotation [4]
MDMC_STRMY 1e-18 HAM1_YEAST 7.2e-21
Protein beta-aspartate methyltransferase HAM1, controls hydroxylaminopurine mutagenesis UMP synthetase
Nature of the errors in GeneQuiz annotation Possible reasons, comments
Manual annotation [30]
SAM-dependent methyltransferase Unknown ACR
MJ0252
PYR5_DICDI 1.0e-8
MJ0392
IMDH_PYRFU 2.1e-9
IMP dehydrogenase homolog
Orotidine 5'-phosphate decarboxylase Zn-dependent protease
MJ0590
SUCD_THEFL 2.7e-9
Succinyl-CoA ligase (GDP- forming), α chain
Succinyl-CoA ligase, α and β chains
Type of errors O Only the best hit was considered O
Actual function is not known; similar proteins in many bacteria
W
W, O
Only the best hit was considered; multi-domain structure of UMP synthase was ignored (see Fig. 1) Multi-domain structure of the IMPDH was ignored; it has only a 100 aa overlap with MJ0392; similar proteins in many bacteria (see text) The differences in lengths of the query and the best hit were ignored; GDP- and ADP-forming enzymes are similar, MJ0590 can be either one
W
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
59
Table 4 (continued) Protein name
Best hit to a characterized protein, E-value
GeneQuiz annotation [4]
DPOL_THELI 0.0043 FTSX_ECOLI 0.0021
DNA polymerase B replication factor C Cell division protein FtsX homolog
Unknown ACR, intein-containing Permease
MJ1079
None
MJ1129
MRP_SYNY3 4.4e-6
Spore germination protein B2 MRP protein homolog
Integral membrane protein unknown ACR
W
MJ1207
PAIA_BACSU 3.3e-9
Acetyltransferase
W
MJ1310
NULC_SYNY 3 0.012
Protease synthase and sporulation negative regulatory protein Na+/H+ antiporter system ORF3
O
MJ1336
CC31_YEAST 0.018
ADP-heptose synthase
NADHubiquinone oxidoreductase chain 2 Unknown ACR
W
MJ1375
CAPF_STAAU 2.0e-6
Putative O-antigen transporter
Permease
O
MJ1452
HMT1_YEAST 0.0037 GSPE_ERWCA 9.6e-5
rRNA adenine N-6-methyltransferase Mannose-sensitive hemagglutinin E
O
CURC_STRCN 3.7e-8
Polyketide synthase CurC
SAM-dependent methyltransferase Glutamyl-tRNA transferase + KH domain + ATPase Mannose-6phosphate isomerase
MJ0682 MJ0797
MJ1533
MJ1618
Manual annotation [30] Type of errors W W
W
Nature of the errors in GeneQuiz annotation Possible reasons, comments Intein region was not masked; only the best hit was considered Membrane-spanning regions were not masked; only the best hit was considered Not clear; similar proteins in many bacteria The difference in length of the query and the best hit was ignored; MJ1129 is shorter than the MRP proteins, and does not contain the conserved ATPbinding site Only the best hit was considered; motif search for HTH domain was not performed Membrane-spanning regions were not masked; actual function of the best hit is unknown The difference in length of the query and the best hit was ignored; MJ1336 is shorter and lacks the conserved ATP- binding site; actual function of the best hit is unknown Only the best hit was considered; actual function of the best hit is not known Only the best hit was considered
W
Multi-domain structure of MJ1533 was ignored
W
Only the best hit was considered; actual function of CurC is unknown; the name refers to the whole pathway, not just this enzyme (see text)
W - wrong prediction; O - overprediction.
COMMON PROBLEMS IN PROTEIN FUNCTION PREDICTION Incorrect Annotation in Protein Databases The simplest reason for unjustified function predictions is an incorrect annotation of the database entry that happens to be the closest homolog of the protein in question. Indeed, an open reading frame (ORF) is often assumed to code for an enzyme when it complements a known mutation, resulting in increase in the enzyme activity. Such an effect, of course, can be due to suppression of the mutation, provision of a missing cofactor, and a plethora of other mechanisms. Thus, the Pseudomonas aeruginosa ORF
60
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
(KHSE_PSEAE) that complemented the thrB mutation in Escherichia coli, was assumed to code for homoserine kinase, even though it lacked any detectable sequence similarity with the same enzyme from other sources (e.g., KHSE_ECOLI), its sequence did not contain any known ATP-binding motif, and inactivation of the chromosomal copy of the gene did not confer threonine autotrophy [12]. Several similar cases, when a protein has been included in curated databases like SwissProt and/or PIR, while the enzyme activity assigned to it has never been supported by either direct experiments or significant sequence conservation, are listed in Table 5. Most likely, the functions of most of these proteins have been misidentified. Table 5 Questionable enzyme identifications in SwissProt and PIR databases Enzyme activity (EC No.)
Protein in question Closest characterized homolog BUD3_YEAST
Enzyme with demonstrated activity
Comment
Reference
SwissProt entry
PIR entry
Isopropylmalate dehydrogenase (EC 1.1.1.85)
LEU3_SCHOC
S55845
Protoporphyrinogen oxidase (EC 1.3.3.4) Dihydrofolate reductase (EC 1.5.1.3) Thymidylate synthase (EC 2.1.1.45)
HEMG_ECOLI
JC2513
FLAV_CLOAB FLAV_DESGI
HEMG_BACSU
DYR_MYCTU
S21834
-
DYRA_ECOLI DYR2_ECOLI
TYSY_MYCTU
-
-
TYSY_ECOLI
Uroporphyrin-III C-methyltransferase (EC 2.1.1.107) Lipopolysaccharide 1,2-N-acetylglucosaminetransferase (EC 2.4.1.56) Queuine tRNAribosyltransferase (EC 2.4.2.29)
HEMX_ECOLI
S02185
-
CYSG_ECOLI
RFAK_ECOLI
C42981
AF004712
RFAK_SALTY
Misannotated based on the position of the gene in the rfa operon (see text)
[25]
TGT_RABIT TGT_HUMAN TGT_CAEEL
S68430
UBPF_YEAST, UBPD_MOUSE
TGT_ECOLI
[13]
Homoserine kinase (EC 2.7.1.39) Acetylornithine deacetylase (EC 3.5.1.16) Lactoylglutathione methylglyoxal lyase (EC 4.4.1.5) Chorismate mutase (EC 5.4.99.5) Folylpolyglutamate synthase (EC 6.3.2.17)
KHSE_PSEAE
S27981
-
KHSE_ECOLI
ARGE_LEPBI
A31840
RPOC_ECOLI
ARGE_ECOLI
LGUL_SOYBN
S47177
GTXA_TOBAC
LGUL_HUMAN
Probable ubiquitine C-terminal hydrolases; similarity mentioned in SwissProt, not in PIR No known ATP -binding motifs (see text) Discussed by authors, warning given in SwissProt and PIR Probable glutathione S-transferase
PHEB_BACSU
D32804
-
CHMU_BACSU
Warning given in PIR, not in SwissProt
[51]
VG29_BPT4
-
-
FOLC_ECOLI
LEU3_ECOLI
Changed to YLEU_SCHOC in SwissProt rel. 35, a warning added A flavodoxin component of PPO, not shown to have enzymatic activity
[22]
[37,41]
J.Dale, personal communication Misannotated in SwissProt as a member of the thymidylate synthase family. Belongs to an uncharacterized protein family unrelated to thymidylate synthases.
J.Dale, personal communication
[40]
[12] [56]
-
[23]
61
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes Table 6 Removing spurious database hits of a non-globular protein by modifying SEG parameters SEG parametersb
No filtering 12 2.2 2.5 12 2.3 2.6 12 2.4 2.7 12 2.5 2.8 12 2.6 2.9
No. of masked residues 0 28 105 166 193 253
E-value with an ortholog (Y123_MYCPN)
E-value with arginine deiminase (ARCA_MYCAR)
BLASTP
WUBLAST
BLASTPGP
BLASTP
1.7e-131 2.2e-126 6.6e-105 1.2e-90 6.5e-72 6.2e-55
2.3e-99 1.5e-93 4.6e-76 3.1e-63 3.5e-54 1.1e-41
1e-100 1e-101 4e-80 4e-67 6e-51 1e-26
0.068 0.058 0.94 0.85 0.76 -
WUBLAST BLASTPGP
0.099 0.085 0.997 0.97 0.91 -
0.046 0.046 -
E-value with a typical non-globular protein (myosin) BLASTP
0.032 0.60 0.73 -
WUBLAST
0.0073 0.0089 0.025 -
BLASTPGP
0.016 0.016 0.016 -
a
- Low-complexity regions in MG123 were masked using the SEG program using the listed parameters. Each of the resulting proteins was compared to the non-redundant database (NCBI) using BLASTP v. 1.4.9., WUBLASTP v. 2.0a, and BLASTPGP v. 2.0.3. b - Trigger window length, trigger complexity, and extension complexity, see [55]. The default parameters set is 12 2.2 2.5 c - ARCA_MYCAR was the database hit chosen for the MG123 annotation by Ouzounis et al. [38].
Another group includes cases where annotation of a protein, while technically correct, does not contain the biological information that can be used for assigning functions to its homologs. Thus, MJ1618, annotated by GeneQuiz as polyketide synthase CurC, indeed is homologous to CURC_STRCN, a product of the third ORF in an operon coding for the biosynthesis of an antibiotic, curamycin, in Streptomyces curacoi [7]. However, such an annotation is flawed, as M. jannaschii evidently does not produce this antibiotic. On the other hand, a detailed analysis of MJ1618 shows that it has statistically significant sequence similarity to several phosphomannose isomerases, such as ALGA_PSEAE, and, most probably, has phosphohexomutase activity (L. Aravind et al., manuscript in preparation). Low Sequence Complexity Low complexity regions, which are abundant in protein sequences, particularly eukaryotic ones, and typically correspond to non-globular domains, tend to produce spurious hits in database searches [54]. While these regions are routinely masked using the SEG program prior to similarity searches with the BLAST family programs, the default settings of SEG are not suited for masking most of the non-globular protein domains [54,55]. More stringent filtering, specifically adjusted for delineation of non-globular domains, is frequently needed for the detection of subtle but functionally relevant signals in the globular domains [54]. The erroneous identification of an arginine deiminase homolog in M. genitalium, which became the basis for far-reaching conclusions on the existence of amino acid metabolism in this bacterium [38], is a typical example of the misleading consequences of inadequate filtering of low complexity regions (Table 6). For a number of proteins, however, even masking with strict SEG parameters may be insufficient to detect all the non-globular domains that tend to produce spurious hits in database searches. Additional masking of coiled-coil domains or transmembrane helices may be required. Programs are now available for sequential sequence masking with a variety of methods [52]. Multi-Domain Organization of Proteins Many proteins, perhaps the majority in the case of eukaryotes, are composed of several domains that may have different, sometimes unknown, functions [14,34]. A simple illustration of the effect that multidomain organization of a protein may have on sequence-based protein function prediction is shown in Fig. 1. M. jannaschii protein MJ0252, annotated by GeneQuiz as UMP synthetase [4], aligns only with
62
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
the C-terminal part of UMP synthetase (PYR5_DICDI), which is responsible for the orotidine 5'decarboxylase activity. Even though PYR5_DICDI is the best hit, MJ0252 cannot be predicted to possess UMP synthetase activity, as it does not contain the orotate phosphorybosyltransferase domain. Ignoring the domain structure of the best database hit may easily result in an obviously wrong functional annotation even in case of striking sequence conservation. The M. genitalium gene MG262 product has been repeatedly annotated as DNA polymerase I [17,18,38]. While MG262 is clearly homologous to the N-terminal part of DPO1_ECOLI, it lacks the polymerase portion (Klenow fragment). The N-terminal domain of DNA polymerase I is responsible for its 5'-3' exonuclease activity, and this is the obvious functional prediction for MG262. A more complex and typical case is illustrated in Fig. 3. The M. genitalium protein MG449 has been annotated as Phe-tRNA synthetase [38]. However, while MG449-like domain is present in bacterial PhetRNA synthetases (e.g., SYFB_ECOLI), it is absent from archaeal, eukaryotic and chloroplast enzymes (Fig. 3). Thus, despite the highly statistically significant database hits of this protein with several PhetRNA synthetases, the proper annotation for MG449 should have had stated that its function was unknown. Recent studies on related proteins in yeast and humans [24,45] indicate that MG449 is most likely an RNA-binding domain found in a variety of multidomain and stand-alone proteins. An even more striking series of incorrect annotations due to the multi-domain structure of target proteins involves cystathionine β-synthase (CBS) domain, described recently by Bateman [6]. This domain of unknown function is found in many proteins, including E. coli IMP dehydrogenase (IMDH_ECOLI).
Fig. 1. Alignment of MJ0252 with orotidine 5'-phosphate decarboxylases and UMP synthetases. The sequences are: 1, MJ0252; 2–5, orotidine 5'-phosphate decarboxylases; 2, DCOP_BACSU; 3, DCOP_ECOLI; 4, DCOP_YEAST; 5–6, UMP synthetases; 5, PYR5_DICDI; 6, PYR5_HUMAN. The alignment was generated by the MACAW program [43]; shading indicates the mean similarity scores between the aligned segments.
Fig. 2. Alignment of MG262 with DNA polymerases I. The sequences are: 1, MG262; 2, Klenow fragment of the E. coli DNA polymerase I; 3–5, DNA polymerases I; 3, DPO1_ECOLI; 4, DPO1_BACCA; 5. DPO1_THEAQ. Other details as in Fig. 1.
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
63
Fig. 3. Alignment of MG449 with phenylalanyl-tRNA synthetases. The sequences are: 1, MG449; 2, MP179, a homolog from M. pneumoniae; 3, YtpR, a homolog from Bacillus subtilis; 4–5, bacterial Phe-tRNA synthetases; 4, SYFB_ECOLI; 5, SYFB_BACSU; 6–7, archaeal Phe-tRNA synthetases; 6, MJ1108, a Methanococcus jannaschii protein; 7, SS56KBFR, a Sulfolobus solfataricus protein; 8, SYFB_PORPU, a chloroplast protein; 9–10, eukaryotic Phe-tRNA synthetases; 9 - SYFA_YEAST; 10 - F22B5.9, a Caenorhabditis elegans protein. Archaeal, chloroplast, and eukaryotic Phe-tRNA synthetases lack the putative RNA-binding domain homologous to MG449.Other details as in Fig. 1.
As a result, each protein containing this domain, shows statistically significant similarity with IMP dehydrogenase, causing a widespread confusion among genome annotators. In the revision of the M. jannaschii genome, for example, Kyrpides et al. [31] annotated 12 CBS domain-containing proteins as similar to IMP dehydrogenase. Remarkably, in 5 cases these misleading annotations were offered as revisions of the original, more appropriate annotations of these proteins as “hypothetical” [11]. GeneQuiz identified another protein from M. jannaschii (MJ0392, see Table 4) containing the CBS domain and duly annotated it as an IMP dehydrogenase. Even after the illuminating report of Bateman [6] has been published, and CBS domains were clearly marked in SwissProt entries, ten proteins of Methanobacterium thermoautotrophicum, containing this domain, were annotated as IMP dehydrogenase-related ones [46]. In the recently published genome of Archaeoglobus fulgidus, some of such proteins are annotated simply as conserved hypothetical ones, while others (AF0847, AF1259) are still annotated as putative IMP dehydrogenases [26]. The simplest (but not most reliable) way to circumvent the problem of multi-domain organization of proteins is to compare the length of a match for each database hit with the length of the query sequence, which could indicate possible conflicts. This method is implemented in the annotation engine of the WIT database [57]. The new WWW interfaces for gapped BLAST and PSI-BLAST [3] on the NCBI server present schematic graphical alignments, showing the location of the hit as compared to the query sequence. Another option is to compare the query protein with the Clusters of Orthologous Groups database [49], where multi-domain proteins are divided into separate domains whenever their singledomain orthologs are found in any of the completely sequenced genomes. Significant hits with proteins from more than one COG would indicate a likely multi-domain organization of the query. Non-Orthologous Gene Displacement In different organisms, the same function can be performed by unrelated or distantly related proteins [28,29]. It appears that in many cases, these enzymes have evolved by shifting the substrate specificity of a related but distinct enzyme. Fig. 4 shows an alignment of gluconate kinases from E. coli and B. subtilis. It is clear that GNTK_BACSU is unrelated to GNTK_ECOLI and is a paralog of GLPK_BACSU. Were these activities not known from biochemical data [19], GNTK_BACSU would be confidently annotated
64
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
as glycerol kinase. Similar cases were discovered in all enzyme classes and appear to be more common than previously thought (M.Y.G., D.R.Walker and E.V.K., Genome Res., in press). In many protein families, enzymes and binding proteins with different specificities may be as similar to each other as those with the same specificity. Ignoring this easily results in overpredictions, as, for example, in the case of MG225 (Table 2). This membrane protein has significant sequence similarity to lysine, histidine, and arginine permeases; it can be confidently predicted that it mediates amino acid transport; the available data, however, are insufficient to predict the exact specificity. An even more important case is MG137. Originally annotated by the TIGR team as dTDP-4-dehydrorhamnose reductase RfbD, it was re-annotated by the GeneQuiz team as amine oxidase. We considered the TIGR annotation to be adequate, and were unable to find any justification for its correction by Ouzounis et al. [38]. As the only readily identifiable functional motif in this protein was the glycine-rich loop typical of dinucleotide-binding enzymes, it was tentatively annotated as a dehydrogenase, participating in the lypopolysaccharide biosynthesis. Subsequent experimental studies on this operon finally identified the E. coli ortholog of MG137 as UDP-galactose mutase [35], which is now reflected in the SwissProt description of MG137 (GLF_MYCGE); this is indeed a FAD-utilizing enzyme, so the the conserved motif is a part of the FAD-binding site as predicted. This example shows the inherent limitations of the functional predictions, manual or automatic, based solely on protein sequence motif conservation; in a number of cases, such predictions may correctly predict certain important structural and functional properties of a protein but miss the specific activity. Operon Disruption Comparison of genome organization of bacteria and archaea indicated that only very few operons are conserved across large phylogenetic distances [27,33]. As a result of operon disruption, genes that belong to the same operon in one species are likely to be scattered in other species, which may complicate their identification. This is often observed even in closely related species [53]; hence, functional prediction based on gene position in operons sometimes leads to errors. Thus, RFAK_ECOLI was predicted to code for lipopolysaccharide 1,2-N-acetylglucosamine transferase solely on the basis of its position in the E. coli rfa operon, even though the products of these genes in E. coli and S. typhimurium showed very little similarity [25]. A recent study of a homologous protein in Haemophilus ducreyi identified it as Dglycero-D-manno-heptosyl transferase [21].
Fig. 4. Non-orthologous gene displacement: two types of gluconate kinase in bacteria. The sequences are: 1, GNTK_BACSU; 2–4, gluconate kinases; 2, GNTK_ECOLI; 3, GNTV_ECOLI; 4, GNTK_SCHPO; 5–6, xylose kinases; 5, XYLB_ECOLI; 6, XYLB_STAXY; 7–8, glycerol kinases; 7, GLPK_BACSU; 8, GLPK_MYCGE.
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
65
CONCLUSIONS It appears that errors in genome annotation most frequently occur when: • The annotation of the best database hit is non-critically reproduced, even though the actual function of the respective protein is not known or cannot be realistically expected to be encoded in the genome being characterized. • Only the top database best hit is taken into account for function prediction. • Regions of low complexity, typically corresponding to non-globular domains, are not properly masked. • Multi-domain organization of the query protein sequence or the database hit(s) is ignored. • The function of an ORF is non-critically inferred from the functions of the neighboring gene products. These pitfalls plague both manual and automatic prediction but are particularly difficult to avoid in automated systems for genome annotation. The solution should be both in continuing involvement of expert biologists in genome annotation and in the incorporation of more sophisticated logic into automated methods.
References [1] Altschul, S.F. and Gish, W. Local Alignment Statistics. Methods Enzymol. 266, (1996) 460–480. [2] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. Basic Local Alignment Search Tool. J. Mol. Biol. 215, (1990) 403–410. [3] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 25, (1997) 3389–3402. [4] Andrade, M., Casari, G., de Daruvar, A., Sander, C., Schneider, R., Tamames, J., Valencia, A. and Ouzounis, C. Sequence Analysis of the Methanococcus jannaschii Genome and the Prediction of Protein Function. Comput. Appl. Biosci. 13, (1997) 481–483. [5] Appel, R.D., Bairoch, A. and Hochstrasser, D.F. A New Generation of Information Retrieval Tools for Biologists: The Example of the ExPASy WWW Server. Trends Biochem. Sci. 19, (1994) 258–260. [6] Bateman, A. The Structure of a Domain Common to Archaebacteria and the Homocystinuria Disease Protein. Trends Biochem. Sci. 22, (1997) 12–13. [7] Bergh, S. and Uhlen, M. Analysis of a Polyketide Synthesis-Encoding Gene Cluster of Streptomyces curacoi. Gene 117, (1992) 131–136. [8] Bhatia, U., Robison, K. and Gilbert, W. Dealing with Database Explosion: A Cautionary Note. Science 276, (1997) 1724–1725. [9] Bork, P. and Bairoch, A. Go Hunting in Sequence Databases but Watch out for the Traps. Trends Genet. 12, (1996) 425–427. [10] Bork, P. and Gibson, T.J. Applying Motif and Profile Searches. Methods Enzymol. 266, (1996) 162–184. [11] Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton, G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A., Gocayne, J.D., et al. Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii. Science 273, (1996) 1058–1073. [12] Clepet, C., Borne, F., Krishnapillai, V., Baird, C., Patte, J.C. and Cami, B. Isolation, Organization and Expression of the Pseudomonas aeruginosa Threonine Genes. Mol. Microbiol. 6, (1992) 3109–3119. [13] Deshpande, K.L., Seubert, P.H., Tillman, D.M., Farkas, W.R. and Katze, J.R. Cloning and Characterization of cDNA Encoding the Rabbit tRNA-Guanine Transglycosylase 60-kilodalton Subunit. Arch. Biochem. Biophys. 326, (1996) 1–7. [14] Doolittle, R.F. The Multiplicity of Domains in Proteins. Annu. Rev. Biochem. 64, (1995) 287–314.
66
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
[15] Eddy, S.R., Mitchison, G. and Durbin, R. Maximum Discrimination Hidden Markov Models of Sequence Consensus. J. Comput. Biol. 2, (1995) 9–23. [16] Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M. et al. Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd. Science 269, (1995) 496–512. [17] Fraser, C.M., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A., Fleischmann, R.D., Bult, C.J., Kerlavage, A.R., Sutton, G., Kelley, J.M. et al. The Minimal Gene Complement of Mycoplasma genitalium. Science 270, (1995) 397–403. [18] Frishman, D. and Mewes, H.W. PEDANTic Genome Analysis. Trends Genet. 13, (1997) 415–416. [19] Fujita, Y., Fujita, T., Miwa, Y., Nihashi, J. and Aratani, Y. Organization and Transcription of the Gluconate Operon, gnt, of Bacillus Subtilis. J. Biol.Chem. 261, (1986) 13744–13753. [20] Galperin, M.Y. and Koonin, E.V. Sequence Analysis of an Exceptionally Conserved Operon Suggests Enzymes for a New Link between Histidine and Purine Biosynthesis. Mol. Microbiol. 24, (1997) 443–445. [21] Gibson, B.W., Campagnari, A.A., Melaugh, W., Phillips, N.J., Apicella, M.A., Grass, S., Wang, J., Palmer, K.L. and Munson, R.S., Jr. Characterization of a Transposon Tn916-Generated Mutant of Haemophilus ducreyi 35000 Defective in Lipooligosaccharide Biosynthesis. J. Bacteriol. 179, (1997) 5062–5071. [22] Iserentant, D. and Verachtert, H. Cloning and Sequencing of the LEU2 Homologue Gene of Schwanniomyces occidentalis. Yeast 11, (1995) 467–473. [23] Ishimoto, L.K., Ishimoto, K.S., Cascino, A., Cipollaro, M. and Eiserling, F.A. The Structure of Three Bacteriophage T4 Genes Required for Tail-Tube Assembly. Virology 164, (1988) 81–90. [24] Kleeman, T.A., Wei, D., Simpson, K.L. and First, E.A. Human Tyrosyl-tRNA Synthetase Shares Amino Acid Sequence Homology with a Putative Cytokine. J. Biol. Chem. 272, (1997) 14420–14425. [25] Klena, J.D., Pradel, E. and Schnaitman, C.A. Comparison of Lipopolysaccharide Biosynthesis Genes rfaK, rfaL, rfaY, and rfaZ of Escherichia coli K-12 and Salmonella typhimurium. J. Bacteriol. 174, (1992) 4746– 4752. [26] Klenk, H.P., Clayton, R.A., Tomb, J.F., White, O., Nelson, K.E., Ketchum, K.A., Dodson, R.J., Gwinn, M., Hickey, E.K., Peterson, J.D. et al. The Complete Genome Sequence of the Hyperthermophilic, SulphateReducing Archaeon Archaeoglobus fulgidus. Nature 390, (1997) 364–370. [27] Koonin, E.V. and Galperin, M.Y. Prokaryotic Genomes: The Emerging Paradigm of Genome-Based Microbiology. Curr. Opin. Genet. Dev. 7, (1997) 757–763. [28] Koonin, E.V. and Mushegian, A.R. Complete Genome Sequences of Cellular Life Forms: Glimpses of Theoretical Evolutionary Genomics. Curr. Opin. Genet. Dev. 6, (1996) 757–762. [29] Koonin, E.V., Mushegian, A.R. and Bork, P. Non-Orthologous Gene Displacement. Trends Genet. 12, (1996) 334–336. [30] Koonin, E.V., Mushegian, A.R., Galperin, M.Y. and Walker, D.R. Comparison of Archaeal and Bacterial Genomes: Computer Analysis of Protein Sequences Predicts Novel Functions and Suggests a Chimeric Origin for the Archaea. Mol. Microbiol. 25, (1997) 619–637. [31] Kyrpides, N.C., Olsen, G.J., Klenk, H.-P., White, O. and Woese, C.R. Methanococcus Jannaschii Genome: Revisited. Microb. Compar. Genomics 1, (1996) 329–338. [32] Lupas, A. Prediction and Analysis of Coiled-Coil Structures. Methods Enzymol. 266, (1996) 513–525. [33] Mushegian, A.R. and Koonin, E.V. Gene Order is not Conserved in Bacterial Evolution. Trends Genet. 12, (1996) 289–290. [34] Mushegian, A.R., Bassett, D.E., Jr., Boguski, M.S., Bork, P. and Koonin, E.V. Positionally Cloned Human Disease Genes: Patterns of Evolutionary Conservation and Functional Motifs. Proc. Natl. Acad. Sci. USA 94, (1997) 5831–5836. [35] Nassau, P.M., Martin, S.L., Brown, R.E., Weston, A., Monsey, D., McNeil, M.R. and Duncan, K. Galactofuranose Biosynthesis in Escherichia coli K-12: Identification and Cloning of UDP-Galactopyranose Mutase. J. Bacteriol. 178, (1996) 1047–1052. [36] Nielsen, H., Engelbrecht, J., Brunak, S. and von Heijne, G. Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites. Protein Eng. 10, (1997) 1–6. [37] Nishimura, K., Nakayashiki, T. and Inokuchi, H. Cloning and Identification of the HemG Gene Encoding Protoporphyrinogen Oxidase (PPO) of Escherichia coli K-12. DNA Res. 2, (1995) 1–8.
M.Y. Galperin and E.V. Koonin / Systematic Error in Functional Annotation of Genomes
67
[38] Ouzounis, C., Casari, G., Valencia, A. and Sander, C. Novelties from the Complete Genome of Mycoplasma genitalium. Mol. Microbiol. 20, (1996) 898–900. [39] Rost, B., Casadio, R., Fariselli, P. and Sander, C. Transmembrane Helices Predicted at 95% Accuracy. Protein Sci. 4, (1995) 521–533. [40] Sasarman, A., Echelard, Y., Letowski, J., Tardif, D. and Drolet, M. Nucleotide Sequence of the hemX Gene, the Third Member of the Uro Operon of Escherichia coli K12. Nucleic Acids Res. 16, (1988) 11835–11835. [41] Sasarman, A., Letowski, J., Czaika, G., Ramirez, V., Nead, M.A., Jacobs, J.M. and Morais, R. Nucleotide Sequence of the hemG Gene Involved in the Protoporphyrinogen Oxidase Activity of Escherichia coli K12. Can. J. Microbiol. 39, (1993) 1155–1161. [42] Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A., Ouzounis, C. and Sander, C. GeneQuiz: A Workbench for Sequence Analysis. Intelligent Systems for Molecular Biology 2, (1994) 348–353. [43] Schuler, G.D., Altschul, S.F. and Lipman, D.J. A Workbench for Multiple Alignment Construction and Analysis. Proteins 9, (1991) 180–190. [44] Selkov, E., Maltsev, N., Olsen, G.J., Overbeek, R. and Whitman, W.B. A Reconstruction of the Metabolism of Methanococcus jannaschii from Sequence Data. Gene 197, (1997) GC11–GC26. [45] Simos, G., Segref, A., Fasiolo, F., Hellmuth, K., Shevchenko, A., Mann, M. and Hurt, E.C. The Yeast Protein Arc1p Binds to tRNA and Functions as a Cofactor for the Methionyl- and Glutamyl-tRNA Synthetases. EMBO J. 15, (1996) 5437–5448. [46] Smith, D.R., Doucette-Stamm, L.A., Deloughery, C., Lee, H., Dubois, J., Aldredge, T., Bashirzadeh, R., Blakely, D., Cook, R., Gilbert, K., et al. Complete Genome Sequence of Methanobacterium thermoautotrophicum DeltaH: Functional Analysis and Comparative Genomics. J. Bacteriol. 179, (1997) 7135–7155. [47] Tatusov, R.L., Altschul, S.F. and Koonin, E.V. Detection of Conserved Segments in Proteins: Iterative Scanning of Sequence Databases with Alignment Blocks. Proc. Natl. Acad. Sci. USA 91, (1994) 12091–12095. [48] Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W.S., Borodovsky, M., Rudd, K.E. and Koonin, E.V. Metabolism and Evolution of Haemophilus influenzae Deduced from a Whole-Genome Comparison with Escherichia coli. Curr. Biol. 6, (1996) 279–291. [49] Tatusov, R.L., Koonin, E.V. and Lipman, D.J. A Genomic Perspective on Protein Families. Science 278, (1997) 631–637. [50] Thompson, J.D., Higgins, D.G. and Gibson, T.J. CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice. Nucleic Acids Res. 22, (1994) 4673–4680. [51] Trach, K. and Hoch, J.A. The Bacillus subtilis spo0B Stage 0 Sporulation Operon Encodes an Essential GTPBinding Protein. J. Bacteriol. 171, (1989) 1362–1371. [52] Walker, D.R. and Koonin, E.V. SEALS: A System for Easy Analysis of Lots of Sequences. Intelligent Systems for Molecular Biology 5, (1997) 333–339. [53] Watanabe, H., Mori, H., Itoh, T. and Gojobori, T. Genome Plasticity as a Paradigm of Eubacteria Evolution. J. Mol. Evol. 44, (1997) 57–64. [54] Wootton, J.C. Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures. Comput. Chem. 18, (1994) 269–285. [55] Wootton, J.C. and Federhen, S. Analysis of Compositionally Biased Regions in Sequence Databases. Methods Enzymol. 266, (1996) 554–571. [56] Zuerner, R.L. and Charon, N.W. Nucleotide Sequence Analysis of a Gene Cloned from Leptospira biflexa Serovar Patoc which Complements an argE Defect in Escherichia coli. J Bacteriol. 170, (1988) 4548–4554. [57] Overbeek, R., Larsen, N., Smith, W., Maltsev, N. and Selkov, E. Representation of function: the next step. Gene 191, (1997) GC1–GC9.