Genome Informatics 12: 204–211 (2001)
204
Bioinformatics Issues for Automating the Annotation of Genomic Sequences∗
1 2
Kim Carter1
Akira Oka2
[email protected]
[email protected]
Gen Tamiya2
Matthew I. Bellgard1
[email protected]
[email protected]
Centre for Bioinformatics and Biological Computing, Murdoch University, Murdoch WA 6150, Australia Division of Molecular Life Science, Tokai University School of Medicine, Bohseidai, Isehara, Kanagawa, 259-11, Japan Abstract
The rapid explosion in the amount of biological data being generated worldwide is surpassing efforts to manage analysis of the data. As part of an ongoing project to automate and manage bioinformatics analysis, the authors have designed and implemented a simple automated annotation system, which is described in this paper. The system is applied to existing GenBank/DDBJ/EMBL entries and compared with existing annotations to illustrate not only potential errors but also that they are generally not up-to-date, as a result of new versions of analysis tools and updates of genomic repositories. We highlight the important Bioinformatics issues of storage and management of information to ensure data and results are kept up-to-date in light of new information becoming available. Surprisingly, from just four database entries, a significant number of new features were found. We describe the results as well as identify important issues that need to be addressed in order to automate the re-analysis/re-annotation of genomic sequences within a reasonable timeframe.
Keywords: genomic sequences, annotation, management of information, bioinformatics
1
Introduction
The amount of biological data being created, stored and analysed has grown exponentially over the last 10 years. The nucleotide sequence databases, EMBL [16], DDBJ [18] and GenBank [6], have nearly doubled in size each year since 1994. Concomitant with the massive increase in data, there has been an increase in data errors. Moreover, a number of data integrity related problems have also been reported with these databases, such as confusion among gene synonyms and multiple representations of the same data [8, 9], contamination with mitochondrial DNA sequences [17], vector contamination [14] and sequencing errors [7, 13]. When a sequence alignment tool, like BLAST [2], is used to identify features within a particular genomic sequence, the search results often contain multiple “hits” depending on the frequency of representation of the sequence in the database. For example, if one BLASTs a sequence containing a commonly occurring mRNA for a gene and a small section from a less common neighbouring gene, against the public non-redundant protein database (nr), chances are all the hits in the result will refer to the mRNA sequence (in the form of a number of GenBank entries). Matches with the neighbouring gene get “lost” in the flood of hits for the large, significant matches. As the aim is to characterize and annotate a sequence as comprehensively as possible, it is important to retrieve as many different matches as possible. One solution is to allow an “infinite” amount of hits in the BLAST report, but ∗
corresponding author
Automating the Annotation of Genomic Sequences
205
this is not a viable solution. Another simple solution would be to find the first significant match in the BLAST result, mask out this region in the input sequence and re-BLAST the newly masked sequence against the database again. By repeating the BLAST process, it would be possible to record and mask out significant matches in an attempt to identify as many different hits as possible within the input sequence. With this concept in mind, a series of steps can be identified which form a “pipeline” process to characterize and annotate genomic sequences. The results can be compared with existing annotations in sequence databases, in an attempt to identify new features not already characterized, as well as potential sequence and annotation errors [3]. While this is a simple tool, the major thrust of this research is to concentrate more on the important issue of management of results for this search and subsequent searches using the same sequence with revised tools and up-to-date databases. This is a very challenging problem. Clark and Whittam [11] identify that sequence databases contain errors because “nearly every time a listed gene is sequenced a second time, errors are listed”. With this in mind, re-analysis and verification of sequences and annotation, at a later date, becomes very important [3, 4, 5, 10]. Given the vast quantities of genomic sequences available, analysis and re-analysis of sequences manually becomes an impractical and error-prone process. There arises the need for an automated sequence characterization and annotation process, whereby the results can be re-analysed and compared at a later date. This is the subject of the present research. It is important to note that when a sequence is characterized, using this pipeline process, the accuracy is only as good as the quality of sequence and annotation information in the database(s) being searched against. For the purposes of this research, the information at each stage of the “pipeline” is stored in order to be assembled back together at the end of the process. The stored results can then be used for verification and comparison at a later date. Extensible Markup Language (XML) would be the appropriate storage mechanism as it is very flexible, and facilitates easy conversion and comparison of documents. In this paper, we demonstrate our proposed system of identifying the features of four human genomic sequences against those listed in their corresponding GenBank entries. We arbitrarily chose four genomic DNA sequences from GenBank (Accession numbers Z80998, AL117330, Z97184 and AL118506), which appear to be fully annotated, at the time they were submitted, and are greater than 20000 base pairs. In the AL117330 comparison, we found 3 new nr matches of an unknown protein for MCG:2637. In the AL118506 comparison, we found 4 new nr matches, corresponding to two complete hypothetical proteins and two large fragments of two different hypothetical proteins. We also found a significant number of new repeat regions and ESTs in each of the four comparisons. In addition to discussing these results, we also highlight the potential problems and issues that need to be addressed in order to automate this strategy.
2
Materials and Methods
The nr database was current as of May 2, 2001 and the EST database was current as of November 20, 2000. Repeatmasker [15] was current as of 4/4/2000, using Repbase 31/3/2000. The test sequences for this paper were Human DNA sequences from clone SC22BCB-36G12 on chromosome 22 (GenBank accession Z80998, updated 12/12/1999), clone RP11-444G7 on chromosome 20 (GenBank accession AL117330, updated 2/12/2000), cosmid F0811 on chromosome 6 (GenBank accession Z97184, updated 23/11/1999) and clone RP4-591C20 on chromosome 20 (GenBank accession AL118506, updated 10/10/2001). Z80998 is listed as containing exons 3 to 5 of the SLC5A1 gene. AL117330 is listed as containing the CHK2 gene and a number of ESTs. The GenBank entry for Z97184 lists it containing Daxx, BING1, Tapasin, RGL2, HKE2, BING4 and BING5 proteins, along with ESTS. The AL118506 entry is listed as containing a novel protein similar to MG26, TPD52L2 gene, a novel DnaJ domain protein, a novel phosphoribulokinase gene, the KIAAI1196 gene and a TOM gene. GSSs, STSs and CpG islands were not examined in the research. For this research, we initially ran the Z80998 (20650bp), AL117330 (31874bp), Z97184 (40127bp) and AL118506 (139505bp) sequences through repeatmasker,
Carter et al.
206
to identify repetitive elements. The repeatmasked sequences (.masked file) were then used as input to a multiblast (making use of BLAST 2.1.3) against the nr database using blastx. The nr-masked files were then multiblasted against the EST database using blastn. The repeatmasker summary files (.out file), and the XML result files from the nr and EST multiblast tests were input into the mbassemble program, to produce the features and positions summary (see Figure 2). The features were then compared to the features listed in the existing GenBank entries. The software we developed was written in Java 2 using the JAXP Java API for XML processing 1.0 [19].
3
Results
As a result of our investigation into automating sequence analysis, we have designed a simple “pipeline” process for annotating genomic sequences. The annotation and characterization pipeline has been named MAS, the Multiple-BLAST Annotation System, where the basic steps are as follows: 1. Take the genomic query sequence and run through repeatmasker. The repeatmasked query sequence (sequencefile.masked) will be used in step 2. The repeatmasker results file (sequencefile.out) will be used in step 5. 2. BLAST the repeatmasked sequence against the nr database to identify which regions of the query sequence can be easily characterized. 3. Record, then mask out the best match (with N’s) and redo Step 2 (blast against nr) with the newly masked sequence. Repeat this process until there are no significant matches (compared to a specific E-value). 4. BLAST the nr-masked sequence against the EST database. As in steps 2 and 3, the best match is recorded and masked out. This is repeated until there are no more significant matches. 5. Recombine the genomic sequence BLAST results and the information (sequencefile.out) from Step 1. 6. Identify positions and features in the genomic sequence and report the results. 7. Prepare for further down-stream analysis. Figure 1 illustrates how multiple BLAST (multiblast) alignments are performed, as described in steps 2,3 and 4. We have developed two prototype java applications that form the basis of the proposed MAS pipeline. The first, multiblast, performs a repeat BLAST analysis against a specified BLAST database. The best hit for each BLAST is recorded, then masked out, and the new input sequence is BLASTed again. This process is repeated until no further significant matches (no E-values < 0.00001) are found. An initial E-value of 0.00001 was chosen. The multiblast program is used in steps 2,3 and 4 to BLAST the genomic sequence against the nr and EST databases. The multiblast program generates results in XML as its interconnection capabilities make it ideal as a language for data interchange [1]. For example, XML outputs can be easily converted to other formats, such as ASN.1 to allow the results to be viewed by NCBI’s Sequin viewer [20]. The second application, mbassemble, combines the results from the multiblast program, along with the repeatmasker output file (sequencefile.out) to produce two files describing the features in the genomic query sequence. One file describes all the features contained in the sequence, while the other lists each of the nucleotide positions in the query sequence along with any feature associated with the particular position (for example see Figure 2). Figure 2 shows an excerpt from the Z97184 features and positions files generated by the characterization and annotation process we have discussed. The MIR repeat (feature 138 in Figure 2 a))
Automating the Annotation of Genomic Sequences
207
Figure 1: Illustration of performing multiple BLAST (multiblast) alignments.
Figure 2: Z97184 features file and positions file respectively, illustrating corresponding matches.
Carter et al.
208
ends at position 36538 (as shown in Figure 2 b)), while the BING5 nr match (feature 139) begins at position 36540. The features file combines 147 features from repeatmasker and multiblast against the nr and EST databases. The resulting differences from comparing the proposed MAS versus the existing GenBank entry annotations are summarized in Table 1. As shown in items 1,2 and 3, we found a large number of new repeat regions, EST matches and several significant nr matches, in addition to the existing annotations. Items 4 and 5 show the number of instances where the feature (repeat or nr match) we found was different to the existing feature listed in the GenBank entry, at approximately the same location. Position mismatches are shown in items 6,7 and 8, where the same feature (repeat, nr or EST match) has been found, but in a slightly different position. For the Z80998 comparison, 2 of the CDS listed for the SLC5A1 gene were found as ESTs, while the other (very short exon) was not found. We found 8 new repeats that were not listed in the Z80998 entry, on top of the 35 listed repeats, which is approximately 25% more than the existing. Six new ESTs were also located. In the AL117330 analysis, we found 87 new repeats, in addition to the 15 previously reported repeats, representing an almost 600% increase. In terms of nr matches, we found 3 regions, matching the same protein, that were not listed in the AL117330 entry. For AL118506, we found 189 new repeat regions. However, a number of repeat regions previously reported in the GenBank entry could not be located. We found four significant nr matches, not described in the AL118506 entry, along with 64 new EST sequences. Two CDS, listed in the Z97184 entry, were not matched by any features found in our analysis. We found 23 new repeats, in addition to the reported 46 repeats for Z97184, which is 50% more than the original. Of the 22 position mismatches found using repeatmasker, 15 were within 10 base pairs and 20 were within 20 base pairs of the existing GenBank entry. The Z97184 entry reports exact positions for only 4 ESTs, however lists many. A further 32 new EST matches were found. Table 1: Summary of re-annotation of 4 different existing GenBank entries, including new genes and repeats that were identified. ITEM Difference between assembled features using MAS vs existing GenBank annotation Z80998 AL117330 Z97184 AL118506 1 New repeats found 8 87 32 189 2 New nr matches (significant) 0 3 0 4 3 New ESTs found 6 8 32 64 4 Potential incorrect labels for repeat regions in existing 5 4 8 4 (at approx. same position) 5 Potential incorrect labels for nr matches in existing 0 0 14 5 (at approx. same position) 6 Found same repeat with different start/end positions 8 3 22 11 7 Found same nr match with different start/end positions 0 2 24 26 8 Found same EST match with different start/end positions 0 0 1 0
4
Discussion
The number of new coding regions and new repeats found is strong evidence of how annotations can change in light of new information becoming available, such as a newer versions of search tools and/or updates to databases. Certainly, this has serious implications on up-to-date information contained within public repositories and hence can seriously effect the results of any researcher conducting similarity searches. One of the potentially incorrect repeat labels, in the Z97184 entry, was an L2 repeat (type LINE/L2) found at position 36673 to 36753 whereas the GenBank entry lists the exact
Automating the Annotation of Genomic Sequences
209
same positions as containing a MIR2 repeat (DNA/MIR2 type). Another repeat found, by name, was AluSx (Alu type), from 13770 to 14073, was denoted in Z97184 as an AluSq from 13764 to 14073. In the Z80998 entry, there is an AluSg repeat listed from 4594 to 4892, and an AluSq repeat from 18647 to 18949. In our analysis, these exact positions were identified as AluJb and AluSx repeats respectively. Whether these differences in repeat annotations are significant is a subject of further analysis. Of the four re-annotations we conducted, we found seven highly significant coding regions that were not previously identified. These examples highlight the importance of keeping track of not only results, but database versions, software versions and parameters used in order to better compare results at a later stage. We have identified several issues relating to the sensitivity of BLAST in our comparison against Z97184. The first issue relates to finding nr matches (e.g. exons) that are relatively close together. We found several cases in our results where BLAST returned a single nr match for a region, but the corresponding GenBank entry lists two matches. Our Z97184 results returned a HKE2 nr match from position 30586 to 30907. In comparison, the GenBank entry lists this as HKE2 matches from 30630 to 30754 and 30836 to 30906. In our AL118506 analysis, we found a single nr match (KIAA1196) from position 105017 to 105364. The GenBank entry lists this as 2 CDS of KIAA1196, from 105019 to 105165 and 105268 to 105339. It appears as though the matches against the two exons were significant enough for BLAST to find the region as one significant match, which includes the gap. It is possible to adjust gap parameters to minimise this problem. The second issue relating to BLAST sensitivity is that in some cases, the first BLAST match (which MAS currently uses) is not always the “best” match. When BLASTing commonly occurring sequences, containing mRNA for example, BLAST will often return a number of different hits referring to the same region, but having different names, such as unknown or hypothetical protein. Sometimes these matches are listed in the BLAST results before the well annotated descriptions. In the case of our analysis, several of the CDS listed in Z97184 relating to the Tapasin protein were not found. Instead we found nr matches of Tapasinas at (approximately) the same positions. Further examination of this revealed that Tapasinas is an alternatively spliced form of Tapasin, and that the Tapasin match was located deeper in the BLAST results. In the AL118506 analysis we found a Riken cDNA gene from position 23659 to 23766. The corresponding GenBank entry is listed as a CDS of a tumor protein D52-like 2 from 23667 to 23768. Further analysis revealed that both these sequences produced high results, as they are nearly 100% identical over this region. We have reported these as part of item 5 (Table 1) as the automated analysis examined only the first match, and did not examine any further. As these examples demonstrate, the first BLAST hit is not always the “best” match, for this type of analysis. One possible solution to this would be to look at the first, say five, matches and return all or the “best” of these. The issue of choosing or identifying the “best” match illustrates that there is a point where one cannot realistically automate this kind of analysis completely, without down-stream analysis, which may require human interaction. Another issue regarding BLAST’s sensitivity relates to the position mismatches of nr entries, identified in item 7 of Table 1. Several listings of CDS within the Z97184 entry were not found as single matches in our results but found as a combination of an nr and EST match. For example, Z97184 lists a Tapasin exon from 15694 to 16092. Our test returned an EST from 15691 to 15722 and an nr (Tapasinas) match from 15723 to 16121. Further examination of this revealed that the Z97184 entry was complete. We discovered that the low complexity filtering option, enabled by default, for BLAST was the reason why some of the nr matches in our results reported different positions. When tested across the same region with BLAST with the low complexity filter disabled, BLAST returned the nr match listed in Z97184, but with extra sequence on either end (past where the sequence match ends). Similarly, the AL118506 GenBank entry lists a TOM CDS from 123995 to 124163. With the filter enabled (BLAST default), we found this as an nr match, from 123975 to 124121, and an EST match, from 124122 to 124163. When tested with the filter disabled, this entire region was picked up as a single nr match. It appears as though there are two choices available for locating sequence
Carter et al.
210
matches with the low complexity option when using BLAST; the choice to have matches possibly cut down with the filter on (where EST matches may pick up the remainder), or the choice to have the complete sequence match with the filter off, but picking up extra sequence that may potentially contain other important features that will be then missed. As stated in our results, two of the very short CDS reported in the Z97184 GenBank entry were not found in our analysis. These two regions did not contain any repeats, nr or EST matches, according to our results. The first CDS is listed from 19309 to 19320. BLASTing this region with the low complexity filter on, and then off, still returned no significant matches. We ran BLAST again including an additional 10, then 20 bases upstream and downstream in the query sequence, but BLAST was not able to identify this short exon. The second CDS is located at 26035 to 26103. A BLAST of this region does return a match, (66 of 68bp match 100%), with a significance of 0.00002. As the highest significance value, which we nominated for our multiblast process, is 0.00001, this exon was missed. Similarly, in our Z80998 re-analysis, one short CDS was missed, from 8347 to 8406, as BLAST does not return a high significance figure for this small region. Other exon finding tools, such as LAP [21] and SIM4 [12], could be integrated to aid in the detection of short exons. Future enhancements to the automated annotation process may include the integration of additional feature identification tools. While features like the CpG islands listed in the Z97184 GenBank entry cannot be found using the current annotation process, this type of feature identification could be added. Other searches, such as a BLAST against a promoter database and gene prediction tools, could be combined to enhance the results of the analysis, which we expect will assist in addressing some of the problems outlined above, such as finding start/end positions and small coding regions. We have developed a simple, intuitive, and as the results demonstrate, a very effective tool for re-annotation of genomic sequences. The system is designed to be easily extendable to incorporate other Bioinformatics tools and databases. The test results discussed illustrate not only the importance of storage, analysis and re-analysis of results, but also the importance of tracking database versions, software versions and parameters. When conducting research using public data repositories, it is essential to have the most up-to-date and accurate representation of the sequences, or allow researchers to obtain this information themselves.
5
Acknowledgments
This research was supported by the Commonwealth of Australia, through the Department of Industry, Science and Resources Technology Diffusion Program.
References [1] Achard, F., Vaysseix, G., and Barillot, E., XML, bioinformatics and data integration, Bioinformatics, 17(2):115–125, 2001. [2] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J., Basic local alignment search tool, Journal of Molecular Biology, 215:403–410, 1990. [3] Bellgard, M.I. and Gojobori, T., Identification of a ribonuclease H gene in both Mycoplasma genitalium and Mycoplasma pneumoniae by a new method for exhaustive identification of ORFs in the complete genome sequences, Federation of European Biochemical Societies, FEBS Letters, 445:6–8, 1999. [4] Bellgard, M.I., Schibeci, D., Gojobori, T., et al., An information system for whole genome data, Proceedings of Western Australian Workshop on Information Systems Research, 151–155, 1999.
Automating the Annotation of Genomic Sequences
211
[5] Bellgard, M.I., Hunter, A., and Wiebrands, C., ORBIT: an integrated environment for user customized bioinformatics, Bioinformatics, 15(10):847–851, 1999. [6] Benson, D.A., Karsch-Mizrachi, I., and Lipman, D.J., Ostel, J., Rapp, B.A., and Lipman, D.J., GenBank, Nucleic Acids Research, 28(1):15–18, 2000. [7] Bhatia, U., Robison, K., and Gilbert, W., Dealing with database explosion: a cautionary note, Science, 276:1724–1725, 1997. [8] Boguski, M.S. and Schuler, G.D., ESTablishing a human transcript map, Nature Genetics, 10:369– 371, 1995. [9] Bork, P. and Bairoch, A., Go hunting in sequence databases but watch out for traps, Trends Genet., 12:425–427, 1996. [10] Carter, K., Schibeci, D., and Bellgard, M., WWW issues for conducting sophisticated bioinformatics analysis, Proceedings of Seventh Australian World Wide Web Conference, 81–90, 2001. [11] Clark, A.G. and Whittam, T.S, Sequencing errors and molecular evolutionary analysis, Molecular Biology and Evolution, 9(4):744–752, 1992. [12] Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W., A computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res, 8:967–974, 1998. [13] Krawetz, S.A., Sequence errors described in GenBank: a means to determine the accuracy of DNA science interpretation, Nucleic Acids Research, 17:3951–3957, 1989. [14] Seluja, G.A, Farmer, A., McLeod, M., Harger, C., and Schad, P.A., Establishing a method of vector contamination identification in database sequences, Bioinformatics, 15(2), 106–110, 1999. [15] Smit, A.F.A. and Green, P., RepeatMasker, http://ftp.genome.washington.edu/RM/RepeatMasker.html, 2001. [16] Stoesser, G. et al., The EMBL nucleotide sequence database, Nucleic Acids Research, 29(1):17–21, 2001. [17] Wenger, R.H. and Gassman, M., Mitochondria contaminate databases, Trends Genet., 11:167– 168, 1995. [18] http://www.ddbj.nig.ac.jp [19] http://java.sun.com/xml/jaxp/index.html [20] http://www.ncbi.nlm.nih.gov/Sequin/ [21] http://genome.cs.mtu.edu/align/align.html