functional domains causing them to fuse incorrectly with other genes ..... Several search filters are available to aid in finding your submitted genomes. Fig. 9.
Chapter 16 Bacterial Genome Annotation Nicholas Beckloff, Shawn Starkenburg, Tracey Freitas, and Patrick Chain Abstract Annotation of prokaryotic sequences can be separated into structural and functional annotation. Structural annotation is dependent on algorithmic interrogation of experimental evidence to discover the physical characteristics of a gene. This is done in an effort to construct accurate gene models, so understanding function or evolution of genes among organisms is not impeded. Functional annotation is dependent on sequence similarity to other known genes or proteins in an effort to assess the function of the gene. Combining structural and functional annotation across genomes in a comparative manner promotes higher levels of accurate annotation as well as an advanced understanding of genome evolution. As the availability of bacterial sequences increases and annotation methods improve, the value of comparative annotation will increase. Key words: Gene prediction, Genome sequencing, Genome annotation, Gene function
1. Introduction Genome sequencing technology has played a central role in the advancement of the field of life sciences disciplines and has become as common and as straightforward as PCR. The cost and ease of sequencing have been plummeting over the years, few disciplines have benefited more than microbiology, as there are over 1,000 prokaryotic bacterial and archaeal genomes, and even more viruses are currently found in the Genbank database (1). When the cost of sequencing was high, only biomedical pathogens and common type cultures were sequenced, prioritized through cost–benefit analyses. As the cost of sequencing genomes continues to decrease many new lineages are being explored, as are the differences between multiple strains of the same species, allowing for various types of comparative and pan-genomic analyses. In order to utilize the potential and the wealth of genomic information a completed genome affords us, the sequence (2) must Ali Navid (ed.), Microbial Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 881, DOI 10.1007/978-1-61779-827-6_16, # Springer Science+Business Media, LLC 2012
471
472
N. Beckloff et al.
first be annotated. Annotation is a multilevel process that strives to define both the structural and functional properties of a given sequence (3). Structural annotation identifies the location(s) of genes, promoters, pseudogenes, RNA genes (rRNAs, tRNAs, and other small RNAs), and untranslated regions. Functional annotation defines the role of the aforementioned genetic structures encoded in the DNA sequence (3). Over time, additional genomic information such as the origin of replication, mobile elements (including prophages), pathogenicity islands, etc., have been detailed within bacterial or archaeal annotation files, and alternative exons, polyadenylation sites, centromere, and telomere regions have been described in eukaryotic annotations. The formats of annotation files have evolved to handle such additional descriptions. Annotation methods themselves have also evolved to keep up with the deluge of data that have been produced by next-generation sequencing (NGS). Initially, trained annotators curated and completed the highest-quality genome annotations manually. Through this process, annotators critically assessed all the experimental information available to aid their curation efforts. Although labor-intensive and slow, such manually annotated data sets are often considered the gold-standard dataset through which other users can base their annotations. As more data have become available, sophisticated and well-trained algorithms have been developed as part of computational pipelines designed for automated gene prediction (Table 1). These pipelines were first developed and employed at genome centers because of the generation of
Table 1 Gene prediction software Program
Website
Glimmer
http://www.cbcb.umd.edu/software/glimmer/
GeneMarkS
http://exon.biology.gatech.edu/
Prodigal
http://compbio.ornl.gov/prodigal/
CRITICA
http://www.ttaxus.com/software.html
ORPHEUS
http://www.geneprediction.org/software.html
MetaGene
http://metagene.cb.k.u-tokyo.ac.jp/
tRNAscan
http://lowelab.ucsc.edu/tRNAscan-SE/
Rfam
http://www.sanger.ac.ulk/Software/Rfam
TFAM
http://tfam.lcb.uu.se/
MAGPIE
http://magpie.ucalgary.ca/
16 Bacterial Genome Annotation
473
numerous genomes to manually curate. The main limitation of these algorithms is that accurate functional annotation relies on similarity to other annotations – thus errors are propagated throughout the database. Unfortunately, without actively pursuing the function of each new gene discovered, this remains the most effective method of function prediction. Although the sequencing and annotation processes continue to speed up, the discovery of novel functional protein families or classification of paralogs has not kept up the pace. Hence, the increase of annotation errors as a result of transference of misannotations, compounded by inadequacies in any given annotation pipeline, as well as the differences between annotation pipelines, has become a growing concern. The best way to address and deal with these issues, short of renewed efforts to try and develop the perfect annotation system, is to have a comprehensive understanding of the methods and the potential sources of error in the annotation process, particularly as various NGS technologies yield an even larger number of completed genomes. Such advances in sequencing are also yielding a number of drafted genomes at various stages of completion and have allowed forays into the field of shotgun metagenomics that has led to efforts into annotation of sequence reads (4). In this chapter, only the process of bacterial genome annotation (drafted or finished) is discussed, and a review of annotation resources and tools available to a microbiologist for such purposes is presented.
2. Materials The materials needed for genome annotation solely depend on the amount of sequences that require annotation. Sequencing a large number of genomes would warrant investment in a large computer server or cloud computing resources for beneficial long-term costs. If only a few genomes require annotation, however, a desktop computer and an internet connection are sufficient. In an effort to fully understand the process of annotation we offer basic protocols for common software tools and explain the benefits and caveats of their use. For demonstration purposes we have selected chromosome 1 of the Burkholderia pseudomallei K96243 bacterial genome sequence for annotation. This sequence is available from the NCBI Genome database (http://www.ncbi.nlm.nih.gov/genome) by querying the accession number BX571965. The computational programs described throughout this chapter are also freely available online for either use or download, and are summarized in the tables of this chapter.
474
N. Beckloff et al.
3. Methods 3.1. Genome Annotation Overview
The initial step in obtaining a genome for annotation begins with the isolation of target genomic DNA and sequencing the genomic sample using one of a growing number of sequencing platforms. A detailed review of the different NGS platforms is beyond the scope of this review, but can be found elsewhere (5). As the cost ratio of sequence data generation to finishing genomes (the complex task of closing gaps and fixing errors) continues to widen, an increasingly large proportion of genome projects are considered done at a “draft” level of completion. Depending on the NGS platform, the depth of coverage sequenced and any manual or automated finishing steps or additional targeted sequencing reactions performed, the resulting quality of the draft may differ drastically. For example, current targets for Illumina-sequenced genomes are above 100-fold coverage, while for 454-sequenced genomes 30-fold coverage may be sufficient. Despite sufficient coverage, however, some data may suffer from more SNPs, more uncaptured gaps, or more frame shifts, depending on the platform used. A method of classifying genome projects based on quality alone, agnostic of sequencing technology, has been proposed and agreed upon by a large number of genome centers and other institutes/universities (4). At this stage in the sequencing process, these completed or drafted genomes are generally submitted to gene-finding programs for structural annotation, followed by functional annotation. Annotation is a multistep process, requiring the integration of a large number of tools for the identification of specific genomic features and the interoperability of tools for gene finding and assignment of function. The ability to visualize gene features and their functional annotations can also be considered essential for annotation to provide genomic context for associated evidence of function. The process of genome annotation is conducted in the steps outlined in Fig. 1. Although past annotation pipelines and efforts were largely restricted to large genome centers, a number of publicly available tools or resources have surfaced and have allowed smaller groups also to annotate genomes independently. These new tools encompass three critical annotation components: (1) an integrated set of tools and information-rich databases for gene calling and function assignment; (2) a data management system and repository for storing and tracking annotations; and (3) an interactive graphical user interface for presentation of results, evidence, and context. As the advances in decreasing costs and increasing the democratization of sequencing continues, an increasing number of investigators will be relying on such pipelines for annotation of their favorite genomes.
16 Bacterial Genome Annotation
475
Fig. 1. Outline of a genome annotation project flowchart.
3.2. Structural Annotation 3.2.1. Gene-Finding Methods
During the annotation process, coding sequences (CDS), or protein-coding genes, are typically the first targets for gene prediction. Prediction of genes in prokaryotic organisms is a wellunderstood process with recognition rates above 90%, but it is not error free (6). The difficult task for gene prediction software is how to ascertain true genes from noncoding open reading frames (ORFs). Lacking introns like their eukaryotic counterparts, prokaryotes have the entire coding sequence in a single exon, often in operons, where more than one gene is found under the control of a single promoter, sometimes overlapping the proximal gene by several base pairs and causing problems for gene prediction software. Gene prediction methods can be divided into two main categories, extrinsic and intrinsic methods. Extrinsic gene prediction tools are evidence-based methods that rely on a priori knowledge stored in databases. They compare the genomic sequences to those already annotated, and assign protein function based on homology. This is ideal when a closely related genome or member of the same species is available, but often misleading when annotating a new species or attempting to identify genes that have not yet been
476
N. Beckloff et al.
deposited in a database. Simple BLAST comparisons against databases as well as tools such as CRITICA employ this method (7). Intrinsic, or ab initio gene prediction, relies on properties of DNA sequence (e.g., di-, tri-, tetra-nucleotide compositions) to facilitate discrimination between coding and noncoding regions using statistical and computational methods. Some of these properties include differences in codon, base, and amino acid composition (8). Based on these principles machine-learning techniques are used to build probability models where new ORFs are compared against it for gene discovery. Hidden Markov Models (HMMs) used for gene prediction have proved to be very fast and computationally efficient for identification of novel sequences; however, they have several inherent restrictions. The training set used must be very exact and any sequences with atypical nucleotide composition will often be missed. Complicated gene structure and alternative biological signals such as alternative splicing, frameshift errors, or nested genes also pose problems (9). Additionally, the choice of start and stop codons can also alter the accuracy of the prediction as they can encode multiple amino acids (10, 11). Examples of intrinsic annotation software include Glimmer, originally developed by TIGR, and GeneMark (12). Although gene-finding efforts have themselves been automated, all such tools are far from perfect. Most annotation pipelines use a combination of multiple extrinsic and intrinsic gene prediction tools in tandem for complete genomes. First, extrinsic tools will query databases to identify genes based on homology searches of the sequence of interest. Hits to these genes are considered real and are added to the training set for the intrinsic software. The ab initio based intrinsic software uses the profile of the training set to predict and discover more genes that were missed in the previous step. In addition to identifying protein-coding genes, ribosomal RNA operons are generally found via similarity to rRNA sequences, while transfer RNAs can be reliably identified using tRNAscan-SE (13). These programs work in a similar fashion as gene-calling algorithms instead using sequence and RNA structure information to build predictive models for r- and tRNA structures. Supplementing annotations with this information supports the original gene predictions by finding rRNAs and tRNAs between genes. Similar to rRNAs, a growing number of small RNAs are also found via similarity searches; however, these appear to be less stringently conserved, thus small RNAs from distantly related genomes may not be found. For the example presented below, a protocol is outlined for the gene prediction tool GeneMark (http://exon.biology.gatech.edu/), one of the more popular and accurate gene-finding tools available (14). There are number of other highly used and accurate gene finders, including the newly released Prodigal tool (15), but the use of GeneMark is only for illustrative purposes. The ease of use and flexibility of the GeneMark software are highlighted by the fact that it has different modules that may or may not be run online. The GeneMarkS algorithm accepts submissions for a self-trained approach (16)
16 Bacterial Genome Annotation
477
Fig. 2. View of the GeneMark main page for adding a project for annotation. Options for submission include inputting job title, sequence information, and data-reporting options for the online GeneMark software. Data-formatting options can be selected at the bottom of the page.
while GeneMark.hmm-P works with one of the 265 predefined training sets. For bacterial and archaeal gene prediction a parallel combination of GeneMark-P and GeneMark.hmm-P is recommended using the precomputed models. Figure 2 shows the online interface of GeneMark, and the following is a protocol to annotate a genome using this tool. 1. Download the Genbank file for chromosome 1 of the Burkholderia pseudomallei K96243 (BX571965) from the NCBI database (http://www.ncbi.nlm.nih.gov/genome). 2. Be sure to give your sequence a title, input the sequence file, and select the appropriate output options (arrows shown in Fig. 2). To upload the Burkholderia sequence click on the box marked “Choose File” and navigate to the Burkholderia genbank file on your computer. 3. Additionally, select any of the output options including graphics, translation of genes into potential proteins, or the sequences of predicted genes.
478
N. Beckloff et al.
Fig. 3. The top image (a) is a view of the GeneMark.hmm 2.4 gene prediction results using the Burkholderia pseudomallei K96243 chromomsome 1 Genbank file. Predicted genes are listed in order based on sequence position along with strand, length, and class type. The bottom image (b) is a view of the output of atypical genes from the GeneMark.hmm 2.4 algorithm. Atypical gene models are derived from sets of protein-coding genes’ DNA sequences.
4. Once the GeneMark.hmm program has completed several types of output will be presented depending on the options selected. The first output will be the predictions made by GeneMark.hmm 2.4 in text format (Fig. 3). This report consists of five columns containing all of the predicted genes, the
16 Bacterial Genome Annotation
479
Fig. 4. View of the InterPro submission screen showing the predicted protein sequence BPSL0195. All of the applications for analysis have been selected to run for a more comprehensive functional prediction.
strand they are encoded on, the start and stop positions, gene length, and class (1 or 2 indicating Typical and Atypical, respectively). Genes of the Typical class exhibit codon usage patterns that are within the range of the majority of genes in the genomes of Burkholderia, whereas Atypical genes do not follow the same patterns and may contain a significant number of laterally transferred genes. 5. The second set of data consists of the GeneMark predictions of ORFs that accompany the GeneMark.hmm algorithms (Fig. 4). These algorithms are complimentary to one another and any difference in the results is often the product of sequence errors and deviation in genomic organization. The results are also presented in a 5-column format with start- and stop codons, DNA strand, coding frame number, and
480
N. Beckloff et al.
probabilities for gene- and start site-predictions. The additional data that accompanies each ORF is provided so that the user may make informed decisions regarding the annotation of the correct gene. 6. Potential sequence frameshifts detected with base pair deviations in size are listed at the bottom of the screen. Additionally, depending on the options selected at the first page, protein sequence translations and FASTA-formatted nucleotide sequences may also be displayed. As we just showed, individual programs such as GeneMark can be used for gene prediction and annotation for individual genomes. The majority of these programs can be used online, or installed to a local desktop or server, for use with bacterial sequences. While there are different approaches to gene prediction all of them are based on comparison with current sequence information. As more genomes are sequenced, and more information becomes available, the ability of gene-finding algorithms should drastically improve. 3.2.2. Functional Annotation
Gene prediction is typically followed by functional annotation. This is akin to asking “what” after answering the question of “where in the genome” (17). Functional annotation of the identified genes involves annotating or assigning a prediction of biological function based on similarity to known or other predicted functions or functional domains in databases. This is done in the context of complex biological networks to identify potential gene interactions, predict metabolic and regulatory pathways, and relate these to proteomic or transcriptomic data (18). Many genes that encode similar proteins also share varying amounts of sequence homology. Once protein-coding genes have been found, a large number of similarity searches are generally performed in order to transfer “annotations” based on sequence similarity (e.g., product description, protein family membership, etc.). This is often difficult because of the intrinsic nature of evolutionary processes (17). As members of protein families duplicate over time, the resulting copies often diverge in sequence and in function, forming paralogs. Similarities in structure do not necessarily translate to the same function. For example, the genome of the bacterium Nitrobacter winogradskyi contains a gene for NXR, a nitrite oxidoreductase enzyme, affording it the ability to oxidize nitrite to nitrate under aerobic conditions (19). This gene is structurally similar, but not identical to nitrate reductases found in different species that reduce nitrate to nitrite under anaerobic conditions (19). Conversely, there are many instances of highly divergent sequences that possess the same functionality. Further complicating the matter is the fact that genes can gain or lose functional domains causing them to fuse incorrectly with other genes, creating chimeric proteins that share multiple ancestors.
16 Bacterial Genome Annotation
481
A typical annotation pipeline will begin by searching gene products for similarities by employing either BLASTP or PSIBLAST tools against protein sequence databases (20). The BLAST suite of tools has played a major role providing sensitive searches for global similarity to infer homology. Sequence alignments are based on several metrics including top hits based on sequence similarity, E-value, coverage, and identify cut-offs in relation to the sequences found in the databases queried. One of the most valuable database resources available is the SWISS-PROT protein collection that contains a collection of extremely wellannotated protein records (21). These annotations are typically cross-referenced with structural and sequence databases and contain bibliographic information, biological role information, and protein family assignments. A second common approach is to search against databases of experimentally verified or highly conserved protein domains. There are multiple resources that house information regarding such protein signatures that describe protein families, functional domains, or conserved sites within related groups. One of the most commonly used resources is PFAM, a Protein FAMily database composed of a collection of HMM profiles for protein families (22). InterPro is another database that cross-references equivalent entries found in many of the other resources including domains, families, and functional sites (23). This tool allows annotators to run protein sequences against all of the resources, collect the matching families and domains, and translate this into its unique entry. For higher-level classification, the annotated genes and proteins are placed into their respective biological pathways to ascertain their functional roles in the organism. The lack of a standard classification scheme providing a vocabulary for integrating this information has hindered progress in relating genome annotations to organism functional capabilities. This has been partly addressed through the creation of the gene ontology (GO) vocabulary, which assigns a functional vocabulary to describe genes in a hierarchical format, and has been adopted by nearly every major model organism database (24). The GO system consists of three parts: molecular function, biological process, and cellular component. The beauty of this system lies in the fact that the terms are structured in a hierarchical format allowing them to appear in multiple instances. For example, a general term such as enzyme will lead to a more specific enzyme such as alchohol dehydrogenase, which can be associated with several other descriptions. This level of flexibility allows basic functionality to be assigned to a hypothetical gene/ protein or more specific descriptions to one that is already known as well as accurate annotation of orthologs. The GO site: http://www. geneontology.org/ can be referred to for more information on the structure of GO terms.
482
N. Beckloff et al.
Several pathway databases use GO terms as a framework for describing metabolic and component interaction resources. One of the most notable and widely used is the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, a knowledgebase of genomic and molecular information (25), composed of four main databases: (1) genes for genomic information, (2) pathways for pathway and biological processes, (3) ligand for chemical processes, and (4) brite for ortholog/paralog information. Another example is EcoCyc, a knowledgebase containing literature-based curation of the genome annotation, and includes transcriptional regulation, transporters, and metabolic pathways of the E. coli K-12 MG1655 (26). A number of other similar databases and resources are listed in Table 3. Protein annotation is typically done from a bottom up approach beginning with individual proteins and then integrating them into pathways (27). All predicted genes are searched against a multitude of resources, and annotations are assigned based on a set of scoring thresholds. Here we demonstrate the annotation of an amino acid sequence based on its domains using InterPro (http://www.ebi. ac.uk/interpro/). InterPro integrates protein signatures from 13 different databases into a single unified resource allowing the user to take advantage of different areas of specialization and providing classification on multiple levels. 1. Access the protein encoded by the predicted gene in Burkholderia pseudomallei K96243, BPSL0195 (accession # YP_106823). Begin by inputting the protein sequence on the InterPro homepage. Select which types of applications should be included in the analysis and submit (Fig. 5). More information on the individual applications can be found on the InterProScan homepage (http://www.ebi.ac.uk/interpro/databases.html). 2. InterProScan retrieves matches in several databases using the predicted protein BPSL0195 as query. The legend at the bottom of the output page is color-coded, coordinating each line of output to its respective source (Fig. 6). The first line shows BPSL0195 belongs to the major facilitator superfamily MFS-1 based on signatures from the Pfam database (22). Pfam is a signature database of domains, repeats, motifs, and protein families based on HMM. Pfam models are manually curated for high levels of accuracy and quality. 3. The second line on the results page shows the profile from the Structural Classification of Proteins (SCOP) database (28), where it also matched the query to the MFS profile (Fig. 6). The SCOP database also uses HMM models from both the family and superfamily levels. Each domain search is performed independently and the results are incorporated together.
16 Bacterial Genome Annotation
483
Fig. 5. Results screen showing the domains that matched the hypothetical protein BPSL0195. Not all of the applications selected reported results from the BPSL0195 sequence. The top arrow is pointing to the closest matched superfamily, MFS_1, from the Interpro protein family database. The second arrow points at the results from the SCOP database showing the closest matching domain profile also reporting the MFS substrate transporter protein. Unintegrated results from the CATH-Gene3D, Panther, Signal-P, and the TMHMM resources are highlighted by the bottom arrow. Each shows a different match to the BPSL0195 protein sequence based on the HMM profiles generated. Improved annotation methods will hopefully be able to integrate this information in future reannotation projects.
4. The last lines on the InterPro results page show results from 4 separate resources: CATH-Gene3D, Panther, Signal-P, and TMHMM resources (Fig. 6). All of the tools utilized here are based on HMM profiles derived from signature databases, such as the CATH-Gene3D and SCOP, which utilize signatures at both the family and homologous superfamily levels (29). The Panther database is a collection of manually curated protein sequences and families. It models the divergence of specific functions within protein families, which are then associated with biological pathways (30). Here, Panther matched the query to a solute carrier protein family 33, specifically an acetyl-coA transporter related protein. The next two results are from the Center for Biological Sequence Analysis (http://www.cbs.dtu.dk/index.shtml). The pink section is the result of the Signal peptide prediction from the SignalP neural network software (31). The dark green line shows
484
N. Beckloff et al.
Fig. 6. View of the main page of the IMG-ER database where users can view statistics of usage to the left. The options for various functional tasks are located on tabs at the top of the screen.
predicted transmembrane regions using the TMHMM server (32) that analyzes the physical constraints of both soluble- and membrane-based sequences with up to >90% accuracy (32). High-throughput genome projects have resulted in a rapid accumulation of predicted protein sequences for a large number of organisms. After gene finding, the next logical step is assigning biological significance to the protein sequences. This is also carried out either using extrinsic or intrinsic approaches, all of which revolve around sequence comparisons. Functional characterization of unknown protein sequences is typically inferred based on sequence similarity with BLAST searches against various databases. This can be very inaccurate because of annotation errors so intrinsic methods often utilize protein family profiles based on HMM profiles constructed from experimentally verified sequences. As sequence information continues to accumulate, more accurate rules based on protein hierarchies can be developed to aid in annotation efforts.
16 Bacterial Genome Annotation
3.3. Annotation Using Online Computational Platforms
485
A popular method for annotating genomes is to use one of the many online computational pipelines. With the rise of affordable computational resources many organizations have been able to develop nearly fully automated systems online for public use. The first systems developed over 10 years ago, MAGPIE (33) and GeneQuiz (34), were automated and had reasonable annotation capabilities using multiple tools to provide biological function assignments to genes. Newer tools continue to be developed and provide the utmost convenience for groups that lack the computational resources or the expertise needed to install, implement, and maintain the necessary software needed. Depending on the design of such automated systems, they may offer many services but despite their convenience, web servers have several caveats users should be aware of. Depending on the genomic resources available, some of these tools only offer limited annotation capabilities based on the finished genomes of the species available. Additionally, the options provided within these programs (i.e., scoring thresholds, etc.) are often limited in order to make the annotation process fully automated and may not be optimal for specific annotation projects. Given that these services are free, they can also be heavily used, and the time from submission to receipt of results can vary from hours to weeks. Another caveat that arises is the analysis of results, as some of these web-based pipelines do not include a module for visual inspection of gene calls and their functional annotations. This has led to the development of annotation browsers and editing tools such as Artemis, developed by the Sanger Institute (35), which is a useful tool for reviewing and editing annotation files provided from other programs (see the section on “Annotation Browsers”). Other editing tools suitable for use are listed in Table 2. There exist a few user friendly web-based platforms that require only formatted short reads, contigs, or chromosomes as input and provide useful analytical tools, such as helping users search a number of resources (e.g., Genbank), executing downstream analysis (i.e., phylogenic analysis, etc.), visualizing, and storing the results (Table 2). Some of these include functional annotations; platforms are BASys (36), RAST (37), Integrated Microbial Genomes-Expert Review (38, 39), the JCVI Annotation Service, and Ergatis (40). Gene calling and annotation standards, such as those proposed for genome sequences, would certainly provide a more solid foundation for future annotation efforts. In the meantime, it may not always be obvious which platform to adopt, since metrics for annotation quality are not well defined. IMG (http://merced.jgi-psf.org/cgi-bin/er/main.cgi) which was developed and maintained by the Department of Energy (DOE) and Joint Genome Institute (JGI) acts as a resource for microbial genomes, provides tools for comparative analysis and annotation, and includes all of publicly available genomes. It also
486
N. Beckloff et al.
Table 2 Web-based annotation pipelines Program
Website
Integrated Microbial Resource Expert Review (IMG-ER)
http://merced.jgi-psf.org/cgi-bin/er/ main.cgi
Ergatis
http://isga.cgb.indiana.edu/Home
MAGPIE
http://magpie.ucalgary.ca/
GenDB
http://www.cebitec.uni-bielefeld.de/ groups/brf/software/gendb_info/
NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP)
http://www.ncbi.nlm.nih.gov/ genomes/frameshifts/frameshifts.cgi
BASys
http://basys.ca/basys/cgi/submit.pl
RAST
http://rast.nmpdr.org/
IGS
http://ae.igs.umaryland.edu/cgi/index. cgi
JCVI
http://www.jcvi.org/cms/research/ projects/annotation-service
Ergatis
http://ergatis.sourceforge.net
DIYA
https://sourceforge.net/projects/diyg/
CGP
http://nbase.biology.gatech.edu/
GenePRIMP
http://geneprimp.jgi-psf.org/login
xBASE
http://xbase.ac.uk.annotation
adheres to the MIGS (Minimal Information for a Genome Sequence) standards (4), which require some level of metadata to be added to a genome entry, and provides a user interface to view data at various levels of organization. A protocol for submitting a genome sequence to IMG-ER for annotation is outlined below, for illustrative purposes. One can also use the other systems listed above using a similar process, though with different user interfaces. 1. In order to submit and annotate data, the IMG-ER website requires the creation of a username and password. This account is free and requires the user to fill out a form and submit it to the IMG-ER staff http://img.jgi.doe.gov/request. 2. All files submitted to IMG-ER must be in Genbank or FASTA format. Genbank files contain the nucleotide sequence of the genome(s) as well as the coordinates, translation, and annotation of predicted genes. An example of the Genbank format can be found here: http://www.ncbi.nlm.nih.gov/Sitemap/samplere cord.htm. The first line of Genbank formatted data should
16 Bacterial Genome Annotation
487
Fig. 7. Expert review submissions page where sequences are submitted for annotation. Note the image in the foreground is the screen associated with the second step in the sequence for submitting new sequences.
begin with the word Locus. FASTA files begin with a single line description followed by the nucleotide sequence. The description line is distinguished by the presence of a “>” symbol in the first column. More information regarding the FASTA format can be found here: http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml. 3. After logging in, the user will be directed to the IMG Home page where the “Submit a Genome” link can be found at the bottom left-hand corner in Fig. 7 (http//img.jgi.doe.gov/ submit). The user will be directed to the Expert Review Submission Home page. 4. At the Expert Review Submission Home page click on the “IMG ER Submissions” tab at the top of the screen and then “Submit Dataset to IMG ER” (box in foreground) Fig. 8. 5. In order to be submitted each dataset must be associated with a project in the IMG-ER database. When submitting new data, no previous record should be present and the user is directed to the IMG-Gold site to specify the parameters of a new project. The IMG-Gold link can be found at the bottom of the page (Fig. 9, http://img.jgi.doe.gov/cgi-bin/img_er_submit/gold.cgi).
Fig. 8. Before or after submission, projects can be searched for in the database at the IMG ER submissions search page. Several search filters are available to aid in finding your submitted genomes.
Fig. 9. The sequence of 3 steps necessary to submit a new genome project to the IMG ER Gold database. Select “Home” from the main IMG-Gold page, then select “Project List” to list all current projects, and select “New” to begin the submitting process (the image in the foreground is the last step in the sequence).
16 Bacterial Genome Annotation
489
Fig. 10. Details of the annotation project can be submitted or modified in the IMG ER Gold database. Projects can be searched using keywords or by date submitted. After submission it is recommended that users search for their projects to ensure they are listed correctly in the database. Sometimes project submissions details may take a day or two to show up in the system.
6. From the IMG-Gold home page select “Genome Projects” at the top of the page (Fig. 10). At the Genome Projects page select the “Project List,” which will list all of the projects currently in the IMG-GOLD database. The “New” button underneath the page numbers is the link to begin a new file. 7. After selecting the “New” record link the user will be directed to the project information screen (Fig. 11). Basic organism information is mandatory for adding the new project to the IMG-Gold database. Additional information can be added using the other tabs found on the same screen. It is recommended that as much information is given as possible, especially if annotation is desired, to allow better interpretation of other genomes and genetic similarities. If a lot of information is not available then it can be added later. Once all of the pertinent information is entered click on the “Add Project” button on the bottom of the screen. 8. Once the details of the project have been entered to the IMGGold database the user will be redirected back to the list of projects. It will take some time for the IMG-Gold staff to assign your project a Gold ID, but it is advised that you search for your project to ensure inclusion. This can be accomplished in one of two ways either using the “Filter Projects” button and checking the “Only display my projects” box or by clicking on the
490
N. Beckloff et al.
Fig. 11. After submitting your project and receiving an ID it is recommended that you search the IMG ER database to ensure inclusion. There are several search filters available but searching by Project Name will typically suffice.
“Add Date” link on the right-hand side of the projects list page (Fig. 12). You can modify your project information at any time from the same page using the “Update” button. 9. After submitting your information return to the IMG-ER submissions page search for your project using one of several filters including Project name, GOLD ID, genus or species and click on the “Search Projects” link at the bottom of the page (Fig. 13). 10. You will select your project from the Project Search List that is presented and you will be taken to the “New Genome Dataset Submission form” (Fig. 14). General information about the dataset is provided here, including location in the database, annotation, and assembly and sequencing information. If you are submitting a genome to be annotated select the “Submit Sequence File” tab and upload your file in FASTA format. Be sure to also assign a “Locus Tag Prefix” so your reads can be recognized. Datasets submitted in Genbank format will be checked for format consistency. An email will be sent to the address associated with the user account confirming its accuracy. 11. Once your dataset is submitted you will receive a confirmation screen providing you with your submission ID number and status (Fig. 15). The status of a submission can be checked at
16 Bacterial Genome Annotation
491
Fig. 12. Metagenomic information is needed when submitting a new genome for annotation. Submitting as much information as possible is beneficial for microbial annotation projects.
Fig. 13. A view of the confirmation screen showing successful submission of a genome in the IMG-ER database. Individual results will vary by species of organism submitted as Burkholderia was used for this example.
492
N. Beckloff et al.
Fig. 14. A view of annotation in the Artemis viewer showing a portion of the BX571965.gb sequence. Genomic segments are highlighted by boxes and their respective locations on the chromosome on the top of the screen while features of the file are shown in the bottom. The Artemis navigator screen allows users to select various elements of sequence files for annotation. In this case an individual gene, BPSL2883, has been selected for annotation.
Fig. 15. View of annotation in the Artemis viewer where genes found in Genbank files can be edited. When annotating files users can create new or existing Gene Features in the main Artemis window.
16 Bacterial Genome Annotation
493
any time from this screen by clicking on the “Check Status” button or the “Submission ID”. 3.4. Locally Installable Pipelines
As an alternative to web services, which are sometimes perceived as black boxes, there exist several semistandalone packages or workflow systems available for local installation (Table 3). These pipelines vary in their complexity for installation and integration. The user is referred to both the package web sites and the publications for further information on these topics. Standalone pipelines allow supreme control of the annotation process and unrestricted access to data while maximizing security. The major drawback of this approach, besides the investment in computational hardware, is the knowledge and expertise necessary to integrate the various
Table 3 Genome and functional annotation resources Program
Website
GenBank
http://www.ncbi.nlm.nih.gov/genbank/
EMBL
http://www.ebi.ac.uk/embl/
DDBJ
http://www.ddbj.nig.ac.jp/
COGs
http://www.ncbi.nlm.nih.gov/COG/
Pfam
http://pfam.janelia.org/
TIGRfam
http://blast.jcvi.org/web-hmm/
KEGG
http://www.genome.jp/kegg/
Interpro
http://www.ebi.ac.uk/interpro/
BLASTX
http://blast.ncbi.nlm.nih.gov/Blast.cgi? CMD¼Web&PAGE_TYPE¼BlastHome
GO
http://www.geneontology.org/GO. downloads.annotations.shtml/
Prosite
http://www.expasy.ch/prosite/
STRING
http://string.embl.de/
PIR
http://pir.georgetown.edu/
Uniprot
http://www.uniprot.org/
EcoCyc
http://ecocyc.org/
SignalP
http://www.cbs.dtu.dk/services/SignalP/
GOLD
http://www.genomesonline.org
Gene Ontology Annotation (GOA)
http://www.geneontology.org/
EcoGene
http://ecogene.org/
494
N. Beckloff et al.
Table 4 Stand-alone annotation pipelines Program
Website
DIYA
http://sourceforge.net/projects/ diyg/
NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP)
http://www.ncbi.nlm.nih.gov/ genomes/frameshifts/ frameshifts.cgi
Ergatis
http://ergatis.sourceforge.net/
Computational Genomics Pipeline
http://jordan.biology.gatech.edu/ jordan/software/cg-pipeline
components and auxiliary tools within the system. However, for laboratories that conduct bioinformatics research, it may be more beneficial to install a highly flexible pipeline that can integrate gene calling, annotation, analysis, and tools for integrating specialized databases. A number of packages are available through open-source licenses and are composed of multiple gene prediction and protein annotation programs that can be linked together within a computational pipeline, typically consisting of perl scripts that format the data as they move between programs. The computational pipeline available from the King Lab at Georgia Tech is self-contained, composed of locally installable components and databases, and will need to be manually updated periodically (41). Several programs such as MAGPIE (33), GenDB (42), and Ergatis (40) access databases online, which will always be up to date. Several standalone programs have been listed in Table 4. As technology improves, increasing the amount of sequences created, the need for self-contained computational pipelines increases. Computational pipelines available online offer ease and convenience allowing users to submit bacterial genomes for annotation without the investment in computational hardware or expertise. Conversely, users often have very little control over the type of analysis used and are forced to wait long amounts of time for the data. Self-contained pipelines offer a stark contrast to on-line options as they are fully customizable and offer security options for annotation efforts. 3.5. Visualizing and Editing Annotations
After structural and functional annotation of bacterial genomes from a computational pipeline, it is common to attempt to view the sequence information to ensure the accuracy of the annotation. This is often done using a sequence viewer that can show the individual genes based on their sequence. While many visualization tools are currently available (Table 5), the Artemis viewer and annotation tool is a common Java based software instrument that
16 Bacterial Genome Annotation
495
Table 5 Genome annotation browsers Program
Website
Artemis
http://www.sanger.ac.uk/resources/software/ artemis/
GBrowse
http://gmod.org/wiki/Gbrowse
WebGBrowse
http://webgbrowse.cgb.indiana.edu/cgi-bin/ webgbrowse/uploadData
Apollo
http://apollo.berkeleybop.org/current/index.html
NCBI Genome Workbench
http://www.ncbi.nlm.nih.gov/projects/gbench/
IMG
http://imgweb.jgi-psf.org/archaeal_qa/doc/ findGenomes.html
UCSC Microbial Genome Viewer
http://microbes.ucsc.edu/
CGView
http://wishart.biology.ualberta.ca/cgview/
Manatee
http://manatee.sourceforge.net/
allows visualization of entire genome sequence features and displays all six frames of translation (35). Artemis runs efficiently on multiple platforms (Windows, MacOSX, Linux, and UNIX) and accepts several common sequence file formats including GENBANK, EMBL, FASTA, and GFF3. Given an EMBL accession number, Artemis can read entries directly from EMBL-EBI when using Unix and Linux systems (ftp://ftp.sanger.ac.uk/pub4/resources/soft ware/artemis/artemis.pdf). To accommodate community annota tion efforts, Artemis has been improved upon to connect to a relational database (35). Functioning in this “database mode” ulti mately allows multiple users to access and modify a common data base simultaneously to enable several annotators to work on the same sequence simultaneously. The companion software, Artemis Comparison Tool (ACT, discussed in the next section), enables comparisons between two or more genome sequences to identify and analyze regions of synteny within the context of the entire annotated sequences (35, 43). Utilization of Artemis for visualizing and editing gene annotations from a GENBANK genome file (BX571965.gb; Burkholderia pseudomallei strain K96243) is described below. 1. Download and install Artemis following installation instructions from the Sanger Institute website (http://www.sanger. ac.uk/resources/software/artemis/). 2. Open Artemis and access the File Manager from the menu bar. Set the File Manager to view “All Files”.
496
N. Beckloff et al.
Fig. 16. Artemis Feature edit window allows the user to change feature information about selected genes, in this example the gene start site. Other features of gene files can be edited as well, including descriptions and codon elements.
3. Double click the BX571965.gb file to load the GenBank file into Artemis (the Artemis viewer will automatically be displayed (Fig. 15)). 4. Open the Navigator window from the menu bar 5. Select “Goto Feature With Gene Name” and enter the gene name “BPSL2883” in the text window and click the “Goto” button. Click on the “BPSL2883” to highlight the gene then navigate to the “Edit” tab and select “Selected features in editor” option. 6. Change the gene start site from 3451155 to 3450522 in the “Location:” text box and hit “Apply” (Fig. 16). 7. Edit the locus tag text to read “BPSL2883 candidate” and click “Apply”, then “OK” to close the window. 8. Select the modified gene (BPSL2883 candidate) and navigate to the “Edit” and create the new gene using the “Gene Feature” (Fig. 17). 9. If desired, the old gene (BPLS2883) can be deleted by highlighting the old reading frame and typing the DELETE key. (A warning window will pop up to verify that you want to delete the old gene.) 10. Using the menu bar, save the new entry by selecting “File, Save All Entries” and exit the program. 11. Reopen the modified GenBank file to verify changes (repeat steps 2–6).
16 Bacterial Genome Annotation
497
Fig. 17. View of the main page of the WebACT tool that contains numerous publicly available prokaryotic genome sequences for comparative genomic comparisons and on-line visualization. Results can also be downloaded and viewed using the Artemis viewer.
Although there are many tools available, the Artemis tool offers a unique combination allowing the user to both visualize and annotate sequence information simultaneously. As next-gen sequencing technologies increase the size of datasets, there remains a need for genome browsers that can handle the associated file sizes. As annotations are added to new sequence information this will serve to increase the amount of information viewed at one time. 3.5.1. Comparative Genomics
Comparative genomics involves the analysis of structural and functional annotations between two or more genomes, taking into account their absolute and relative topologies, with the goal of
498
N. Beckloff et al.
extracting biological relevance from the similarities and differences measured, and translating these differences into phenotypic changes. The tool(s) of choice depend on the size of the dataset (i.e., number of input genomes) and the complexity of the analysis (i.e., the intended genomic features to be compared). The process of comparative analysis begins with the creation of a common genebased vocabulary, a genetic thesaurus, between genomes of interest. An all-to-all BLASTP comparison typically serves the purpose of creating gene-to-gene associations to determine equivalent genes (orthologs) across the input genomes. An orthologous gene pair is best defined as the best BLASTP hit between genes of all genome pairs and are considered gene equivalents across genomes. Comparison with the Cluster of Orthologous Groups (COG) database also facilitates ortholog identification. These similarity-based methods of comparison are based on actual gene content rather than text descriptors (i.e., functional annotations). These are often inconsistent and unreliable because of the nature of functional annotation the reliance on constantly changing databases and tools and the many platforms used to perform annotations. In this regard, the ability to detect orthologs from paralogs and pseudogenes is crucial. Once orthologs have been identified a wide array of structural, functional, and topological similarities and differences can be computed between them. There are only a few tools that are available and they vary considerably in their degree of flexibility, installation, and configuration cost. Two comprehensive packages include the Integrated Microbial Genomes (IMG) system from the JGI (http://img.jgi.doe.gov) and VISTA tools (http://genome.lbl. gov/vista/index.shtml). Both packages allow sophisticated queries on the dataset and precomputed alignments. Large syntenic blocks can be identified using COGs or PFAM domains, and the conservation of gene positions and gene order within and between genomes not only provides biological information, but can also reduce the size of the dataset required for subsequent downstream analysis. Large-scale changes such as kilobase-size genomic inversions, rearrangements, insertions, deletions, and translocations are readily detectable with a whole-genome alignment tool such as BLAST or NUCmer (44). Locations of genes and identities of mobile genetic elements might shed light on the cause of these large genetic transformations along with the delineation of the endpoints of regions of horizontal gene transfer and their integration sites throughout the input genomes. A list of commonly used comparative genomics tools is available in Table 6. The ACT is a user-friendly comparative genomics tool that works with output from various alignment programs, such as Nucmer or the Artemis genome annotation browser (43). It can accept multiple genomes and can compare them not only together, but also to precomputed genomes as well. All precomputed genome
16 Bacterial Genome Annotation
499
Table 6 Comparative genomics resources Program
Website
Microbes Online
http://www.microbesonline.org
GenVar
http://www.patricbrc.org/portal/ portal/patric/Home
MOSAIC
http://genome.jouy.inra.fr/mosaic/
Mauve
http://asap.ahabs.wisc.edu/mauve/
The SEED
http://www.theseed.org/wiki/ Home_of_the_SEED
PyPhy
http://www.cbs.dtu.dk/staff/ thomas/pyphy/
MaGe
http://www.genoscope.cns.fr/agc/ microscope/home/index.php
CGAT
http://mbgd.genome.ad.jp/CGAT/
MBGD
http://mbgd.genome.ad.jp/
NUCmer
http://mummer.sourceforge.net
Integrated Microbial Resource Expert Review (IMG-ER)
http://merced.jgi-psf.org/cgi-bin/er/ main.cgi
comparisons are generated using the BLAST algorithm with a word size of nine and soft DUST masking (2). ACT also allows the user to not only compare entire genomes (e.g. entire chromosomes) but also specific genes or lengths of flanking sequences. It can also accept the output from a number of different programs and alignment programs. All genomes can be downloaded locally for use with other programs or viewed on the web. A web version, WebACT, is available for public use online and has an easy-to-use interactive menu (2). It will be used for the following comparative genomics protocol: 1. Navigate to the WebACT (http://www.webact.org/ WebACT/home) homepage (Fig. 18). 2. If you know what organism your sequence belongs to, or if you know which organism you want to compare your sequence against select the “Pre-computed” tab at the top left corner of the screen. If you want to specify a particular gene or paste a sequence, click on the “Generate” tab for a new genome comparison (Fig. 19). 3. On the “Generate” tab, select how many sequences to be compared with the maximum number allowed is 5 at one time. Multiple sequence formats, individual genes, or accession numbers may be pasted in the box marked “Sequence 1.”
500
N. Beckloff et al.
Fig. 18. View of the WebACT Generate page where users can do alignments with one of the precomputed prokaryotic genomes available. WebACT can utilize genomes or sequences in various formats including raw text, EMBL, or FASTA formats. Alignments can be done with up to five different organisms.
Additional sequences can be added in the box below. Be sure to include your email address to be notified when the comparison is completed. 4. After pasting the appropriate sequences into the box select “Submit” from the bottom of the screen. Results will be posted on the screen or emailed upon completion.
16 Bacterial Genome Annotation
501
There is a linear relationship between the need for annotating genomes and the increasing output of next-generation sequencers. There are many tools to annotate bacterial genomes from single genes to comparative analysis of multiple genomes. Unfortunately, both the variety and accuracy of methods remain limited based on the amount of annotated genomes available. Additionally, accuracy can also be affected by the method of annotation used and the genomic information available for prediction. There remains a great need for a structured language to act as a framework for annotating genomes. This system of nomenclature, combined with improved annotation methods, will dramatically reduce the time from raw sequence to fully annotated genomes.
4. Notes 1. When dealing with locally installable pipelines one thing to be aware of in terms of installation and maintenance is the computational resources required by each individual program. 2. A recent study compared the gene finding and functional annotation capabilities of three of these: IMG-ER (38), RAST (37), and JCVI Annotation Service (45). While all three systems were comparable in their abilities to identify genes, several issues were outstanding: (1) each predicted genes not found by the other pipelines, (2) the results were certainly not identical, and (3) none of them provided an allinclusive package for genome annotation and analysis (46).
Acknowledgments We thank all members of the B6 Genome Science group for their contributions to the establishment of standardized methods, development of software and processes, and genome projects described in this chapter. This study was supported in part by the US Department of Energy Joint Genome Institute through the Office of Science of the US Department of Energy under Contract No. DE-AC02-05CH11231 and grants from NIH (Y1-DE-6006-02), the US Department of Homeland Security under contract number HSHQDC08X00790, the US Defense Threat Reduction Agency under contract numbers B104153I and B084531I, and LANL Laboratory-Directed Research and Development under grant number (20110051DR).
502
N. Beckloff et al.
References 1. Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ, Hooper SD, Pati A, Lykidis A, Spring S, Anderson IJ, D’haeseleer P, Zemla A, Singer M, Lapidus A, Nolan M, Copeland A, Han C, Chen F, Cheng JF, Lucas S, Kerfeld C, Lang E, Gronow S, Chain P, Bruce D, Rubin EM, Kyrpides NC, Klenk HP, Eisen JA (2009) A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462(7276):1056–1060 2. Abbott JC (2005) WebACT–an online companion for the Artemis Comparison Tool. Bioinformatics 21:3665–3666 3. Ouyang S, Thibaud-Nissen F, Childs KL, Zhu W, Buell CR (2009) Plant genome annotation methods. Methods Mol Biol 513:263–282 4. Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, Nelson KE, Parkhill J, Pitluck S, Qin X, Read TD, Schmutz J, Sozhamannan S, Sterk P, Strausberg RL, Sutton G, Thomson NR, Tiedje JM, Weinstock G, Wollam A, Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, Detter JC (2009) Genome project standards in a new era of sequencing. Science 326:236–237 5. Voelkerding K, Dames S, Durtschi J (2009) Nextgeneration sequencing from basic research to diagnostics (Reviews). Clin Chem 658:641–658 6. McHardy AC (2004) Development of joint application strategies for two microbial gene finders. Bioinformatics 20:1622–1631 7. Badger J, Olsen G (1996) CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16:512–524 8. Staden R (1984) Graphic methods to determine the functoin of nucleic acid sequences. Nucleic Acids Res 12:521–538 9. Overbeek R, Bartels D, Vonstein V, Meyer F (2007) Annotation of bacterial and archael genomes: improving accuracy and consistency. Chem Rev 107:3431–3447 10. Yada T, Totoki Y, Takagi T, Nakai K (2001) A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res 8:97–106 11. Zhu HQ (2004) Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 20:3308–3317 12. Salzberg SL, Delcher AL, Kasif S et al (1998) Microbial gene identification using interpo-
lated Markov Models. Nucleic Acids Res 26:544–548 13. Lowe TM, Eddy SR (1997) tRNA-scan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25:955–964 14. Besemer J, Borodovsky M (2005) GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res 33: W451–W454 15. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and trnaslation initiation site identification. BMC Bioinformatics 11:119–130 16. Besemer J, Lomsadze A, Borodovsky M (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618 17. Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2:493–503 18. Me´digue C, Moszer I (2007) Annotation, comparison and databases for hundreds of bacterial genomes. Res Microbiol 158:724–736 19. Starkenburg SR, Chain PSG, Sayavedra-Soto LA, Hauser L, Land ML, Larimer FW, Malfatti SA, Klotz MG, Bottomley PJ, Arp DJ, Hickey WJ (2006) Genome sequence of the chemolithoautotrophic nitrite-oxidizing bacterium Nitrobacter winogradskyi Nb-255. Appl Environ Microbiol 72:2050–2063 20. Altschul S, Koonin E (1998) Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem Soc 23:444–447 21. Schneider M, Tognolli M, Bairoch A (2004) The Swiss-Prot protein knowledgebase and ExPASy: providing the plant community with high quality proteomic data and tools. Plant Physiol Biochem 42:1013–1021 22. Sonnhammer EL, Eddy SR, Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28:405–420 23. Zdobnov EM, Apweiler R (2001) InterProScan—an integration platform for the signaturerecognition methods in InterPro. Bioinformatics 17:847–848 24. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the
16 Bacterial Genome Annotation unification of biology. The Gene Ontology Consortium. Nat Genet 1:25–29 25. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27–30 26. Karp PD (2005) Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 33:6083–6089 27. McGarvey PB, Zhang J, Natale DA, Wu CH, Huang H (2011) Protein-centric data integration for functional analysis of comparative proteomics data. Methods Mol Biol 694:323–339 28. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540 29. Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnston C, Sillero A, Thornton J (2005) The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 33:247–251 30. Thomas PD (2003) PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucleic Acids Res 31:334–341 31. von Heijne G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Res 14:4683–4690 32. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580 33. Gaasterland T, Sensen CW (1996) Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. Biochimie 78:302–310 34. Scharf M, Schneider R, Casari G, Bork P, Valencia A, Ouzounis C, Sander C (1994) GeneQuiz: a workbench for sequence analysis. Proc Int Conf Intell Syst Mol Biol 2:348–353 35. Carver T, Berriman M, Tivey A, Patel C, Bohme U, Barrell BG, Parkhill J, Rajandream MA (2008) Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics 24:2672–2676 36. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong X, Lu P, Szafron D, Greiner R, Wishart DS (2005) BASys: a web server for automated bacterial genome annotation. Nucleic Acids Res 33:W455–W459
503
37. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O (2008) The RAST server: rapid annotations using subsystems technology. BMC Genomics 9:75 38. Markowitz VM, Chen IMA, Palaniappan K, Chu K, Szeto E, Grechkin Y, Ratner A, Anderson I, Lykidis A, Mavromatis K, Ivanova NN, Kyrpides NC (2009) The integrated microbial genomes system: an expanding comparative analysis resource. Nucleic Acids Res 38:D382–D390 39. Markowitz VM, Mavromatis K, Ivanova NN, Chen IMA, Chu K, Kyrpides NC (2009) IMG ER: a system for microbial genome annotation expert review and curation. Bioinformatics 25:2271–2278 40. Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D, Sundaram JP, Felix V, Whitty B, Mahurkar A, Wortman J, White O, Angiuoli SV (2010) Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 26:1488–1492 41. Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D, Mair RD, Tatti KM, Tondella ML, Harcourt BH, Mayer LW, Jordan IK (2010) A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics 26:1819–1826 42. Meyer F, Goesmann A, Mchardy AC, Bartels D, Bekel T, Clausen E`, Kalinowski E`, Linke B, Rupp O, Giegerich R (2003) GenDBÐan open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31:2187–2195 43. Carver TJ (2005) ACT: the Artemis comparison tool. Bioinformatics 21:3422–3423 44. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5:R12 45. Goll J, Rusch DB, Tanenbaum DM, Thiagarajan M, Li K, Methe´ BA, Yooseph S (2010) METAREP: JCVI metagenomics reports—an open source tool for high-performance comparative metagenomics. Bioinformatics 26:2631–2632 46. Bakke P, Carney N, Deloache W, Gearing M, Ingvorsen K, Lotz M, Mcnair J, Penumetcha P, Simpson S, Voss L, Win M, Heyer LJ, Malcolm A (2009) Evaluation of three automated genome annotations for Halorhabdus utahensis. PLoS One 4:e6291