A High-throughput System to Resolve Inconsistent Reading Frame Predictions for Expressed Sequence Tags Yue Chen1,3, John Carlis1, Elizabeth Shoop2, and John Riedl1 (
[email protected],
[email protected],
[email protected],
[email protected]
)
Abstract Many Expressed Sequence Tags (EST) sequencing projects produce myriads of ESTs that contain sequencing errors and assembly artifacts. To annotate an EST at the protein level, biologists use gene–finding programs to: determine its reading frame, find boundaries of protein coding regions, and translate them into proteins. However, the inherent inconsistency of EST data, coupled with inconsistent prediction results from different gene-finding programs, make gene prediction for ESTs a difficult task. Therefore, high–throughput EST sequencing projects require automated systems that clean ESTs and make reliable coding region predictions. We have developed a system for EST projects that automates reading frame assignment by: 1) filtering out low-quality ESTs, 2) executing multiple gene–finding programs, 3) determining the optimal combination strategy for each project, and 4) using that optimal combination to integrate the results from the program executions into a reading frame assignment for each EST. We verified the system for a simulated set of ESTs. The system predicts reading frames more accurately than any single gene–finding program. This system processed approximately 50,000 ESTs from several plant genomes in a few days, whereas only a (non-existent) legion of skilled sequence annotators could have done likewise in that timeframe.
Content Area Key Words Scientific discovery, applications, intelligent databases, automated reasoning, information extraction, data mining, knowledge discovery, knowledge representation
Acknowlegements This work was supported under NSF grant BIR 940-2380. We thank Ed H. Chi, Sheila St. Cyr, Sopheak Sim, Juan Munoz, and Ernest Retzel for their discussion and various assistance with our implementations.
1
Computer Science and Engineering Department, University of Minnesota, Room 4-192 EE/CSci, 200 Union St. SE, Minneapolis, MN 55455.
2
Computational Biology Centers, Academic Health Center, University of Minnesota, Box 43 Mayo, 420 Delaware St SE Minneapolis, MN 55455 To whom correspondence should be addressed.
3
A High-throughput System to Resolve Inconsistent Reading Frame Predictions for Expressed Sequence Tags
Abstract Many Expressed Sequence Tags (EST) sequencing projects produce myriads of ESTs that contain sequencing errors and assembly artifacts. To annotate an EST at the protein level, biologists use gene–finding programs to: determine its reading frame, find boundaries of protein coding regions, and translate them into proteins. However, the inherent inconsistency of EST data, coupled with inconsistent prediction results from different gene-finding programs, make gene prediction for ESTs a difficult task. Therefore, high–throughput EST sequencing projects require automated systems that clean ESTs and make reliable coding region predictions. We have developed a system for EST projects that automates reading frame assignment by: 1) filtering out lowquality ESTs, 2) executing multiple gene–finding programs, 3) determining the optimal combination strategy for each project, and 4) using that optimal combination to integrate the results from the program executions into a reading frame assignment for each EST. We verified the system for a simulated set of ESTs. The system predicts reading frames more accurately than any single gene– finding program. This system processed approximately 50,000 ESTs from several plant genomes in a few days, whereas only a (non-existent) legion of skilled sequence annotators could have done likewise in that timeframe.
1. Introduction At the Plant Molecular Informatics Center at the University of Minnesota, we receive and process tens of thousands of Expressed Sequence Tags (ESTs) per month from several different plant genomes. Expressed Sequence Tags (ESTs) cloning and sequencing aided by homology searches have been essential to the categorization of known expressed genes and discovery of new genes (Staden 1994, Gaasterland and Sensen, 1996, Schultz et al, 2000). A thorough annotation (e.g. reading frame, coding regions, 5’/3’ untranslated regions) of expressed sequences is necessary because without it these sequences remain essentially functionally anonymous. Additionally, the protein translation for ESTs makes it easy to compare between genomes, because detection of distant relationships will be more accurate using protein sequences in similarity search programs (Henikoff, 1994). Currently, the task of annotating EST sequences does not keep pace with the rate at which they are generated in EST sequencing projects including those for human. There are three reasons for this: 1) the EST sequence annotation has been an involved and onerous task which include computationally intensive sequence homology searches that often returns no results; 2) the EST data suffers from inconsistency problems such as high error rate, cloning vector contamination, repetitive elements, ribosomal RNAs, or low-complexity regions; 3) the various gene
identification programs perform inconsistently, being sensitive to EST sequencing errors, and tending to disagree among themselves. We automated the EST reading frame resolution by implementing and using a high-throughput computer system. In our automated system, we address the problems of how to combine multiple gene–finding program results to increase the consistency of the predictions. We do not address here the EST data inconsistency in detail, because we use the standard EST cleaning process to eliminate poor-quality sequences before they are fed into the system. The bottleneck for proper protein translations for ESTs is determining the correct single reading frame out of 6 possible translation frames. Currently, two widely–used computational methods exist to predict reading frames for DNA sequences (Fickett, 1996): 1) use the reading frame detected from strong similarities found from a database similarity search program such as BLASTX (Altschul, et al., 1990, 1997), and 2) use a coding region identification program (also called gene–finding or gene identification programs), such as GRAIL (Uberbacher and Mural, 1991) and the more recent Genie (Reese, 2000). Gish and States (1993) showed that the similarity search method, using BLASTX, was reliable for predicting the reading frame, even in the face of sequencing errors. We have previously developed an automated similarity analysis system (Shoop, et al, 1995, Shoop, 1996) that computes BLASTX results for each EST searched against all known proteins, and places the results in a database. We (Shoop, 1995; Newman, 1994) and others (Adams, 1992, Hofte, 1993) have found, however, that only 40% or less of the anonymous EST sequences have significant similarities to known proteins in the databases. Thus, to obtain the reading frame for the remaining 60% or so of these anonymous sequences, we must turn to gene identification methods. Some gene identification programs, such as TestCode (Fickett, 1982), GRAIL 1 (Uberbacher and Mural, 1991), and GeneMark (Borodovsky and McInnich, 1993) specifically use a combination of coding measures to attempt to identify coding regions. The coding measures are meant to try to recognize regularities in coding regions due to bias in codon usage (see Fickett and Tung, 1992 for a review). Fickett (1996) points out that coding region identification programs are particularly suited to “fragmentary sequences,” such as ESTs or short runs of single–pass DNA sequence. The coding measures in these gene–finding programs pertain to observed frequencies require ‘training’ with sequence data from known coding and non–coding DNA sequence from particular organisms. This led Burset and Guigo (1996) to observe that the accuracy of these programs is highly dependent on the dataset used to train them. Fickett (1996) also recommends examining the
“subset of the taxonomic universe” on which a program was trained before choosing it. More advanced gene–finding programs such as GRAIL 2 (Xu, et al., 1994), GeneParser (Snyder and Stormo, 1995), GenLang (Dong and Searls, 1994), and Genie (Reese, 2000) have added more information, such as splice–site prediction, to determine the exon/intron structure. These integrated gene identification programs assume the input is a genomic sequence that contains at least one entire gene. However, when the input is a partial sequence, the predicted exons may still be correct, even though overall predicted gene structure is not. Even for Genie (Reese, 2000), the accuracy of gene predictions is inconsistent with the average sensitivity for unknown sequences below 80%.
•
Definition: We will call an execution of one of these gene identification programs with a particular training set from an organism and a certain set of input parameters a protocol.
2. System and Methods
Because these various programs are not highly accurate (especially if the sequences contain errors), annotators often combine results from several protocols. Burset and Guigo (1996) note that when several programs predict the same exon, it is more likely to be true. Fickett (1996) also recommends the following: “I strongly suggest analyzing each sequence with several programs and carefully comparing the results. If the tools are to be used often, it can be worthwhile to analyze a number of test sequences, where the answer is already known, to get a feeling for algorithm capabilities.” In practice, combining results has been largely a manual and unreliable process that does not scale well to increasing volumes of sequences or new gene prediction programs. Human annotators must execute several different programs separately and then tediously examine each result for each sequence, sometimes with the help of interactive annotator tools1. Any new choices of protocols mean manually running the programs again on all of the sequences, and without flexible design of the annotator tools, new protocol results cannot be easily added. Also, users are faced with the task of choosing which protocol results to believe, with little knowledge of how well they perform overall on their sequences. Should one protocol be trusted more than another? If we accept a higher overall prediction rate from one, how many false positive results will also occur? To answer these questions and to keep up with the steady influx of ESTs that we receive, we developed a software system, called AutoRF, that is more automated and flexible than existing systems. The contributions of this work are the following: •
• • •
1
We have constructed a system that automates 1) EST data cleaning and filtering, 2) execution of a variety of protocols on large sets of sequences, 3) evaluation of protocols to determine which combination will be most accurate, and 4) integration of the results from several protocols into one prediction per sequence. Our system is extensible — new protocols, new evaluation methods, and new result integration methods can be added. We present a method for estimating the effectiveness of combinations of protocols and include it in our system. We devised two result integration methods for combining the results of several protocols and include them in our system
See http://www.tigr.org/tdb/at/atgenome/annotation/annotation.html for an example of how this is done at The Institute for Genomic Research (TIGR).
We evaluated our system using a simulated set of ESTs whose reading frames are known and have applied this system to determine reading frames for approximately 50,000 ESTs from Arabidopsis thaliana, rice, and Loblolly pine.
The rest of this paper is organized as follows: first we describe the system and methods used, provide implementation details, and present results from verifying the method’s consistency. Then we verify our results on our own EST sequence analysis projects. Finally, we share our experience for resolving inconsistencies for scientific data and results.
Sampling sequences
...
sample sequences
n protocols Sequence Processor
...
results-1 results-2
...
results-k
Result Integrator
Planner
n sample results
Protocol Evaluator
...
n indicators
... k-protocol plan
combined results
Best Combination Chooser
Figure 1. Outline of the AutoRF system. Figure 1 depicts the framework of the AutoRF software system we have built and are using for automated reading frame determination. The three major components of the system are: 1) the Sequence Processor, which executes various protocols for each EST. 2) the Planner, which uses a sample set of ESTs and computes the estimated accuracy of each individual protocol using an indicator function, and then estimates the best combination of protocols. 3) the Result Integrator, which uses the best combination from the Planner to predict the reading frame for every EST. These three components correspond to three distinct tasks that we identified as necessary: 1) processing personnel need to run any specified set of protocols automatically on large sets of ESTs as they are generated, 2) data analysts need to determine which combination of protocols should be used to determine a reading frame for all ESTs from a project, and 3) data processing personnel need to then generate the reading frame for each EST using the analyst’s chosen protocol combination.
Protocol parameters and results from the Sequence Processor are used in the Planner, and results from the Planner and Sequence Processor are used in the Result Integrator. To facilitate information sharing between these components, each of them stores all of its parameters and results in a database management system. The rest of this section contains more detail on the database and each of the three programs.
Sequencing_project Blast_result ProjectName BlastResultID
ProjectInfo
StrandFlag FrameNumber BlastHitStrength FrameShiftFlag HitCategory
Eval_sequenceset
2.1. Database of Protocols and Results Figure 2 depicts the data model of the underlying database. We used Logical Data Structural (LDS) notation (Carlis, 2001), and convert it to a relational schema. Note that we modeled multiple Blast_result and Exon_result for each Clone_sequence from a given Sequencing_project. Each Blast_result stores a BLASTX search detected a hit to a known protein sequence in a particular reading frame. Each Exon_result stores a frame, start and end of the predicted coding region, and score and rating of the prediction. Also note that each Exon_result is associated with a Genefinding_protocol. Each Clone_sequence has at least one ComputedORF_result, which came from a ComputedORF_algorithm that used a Protocol_or_method. A Protocol_or_method can consist of a single-member genefinding proctocol or multiple-member genefinding protocols (each member is stored in BestCombMeth_member). Information about the Protocol_or_method’s accuracy is stored in ProtMeth_evaluation. The evaluation test data (a subset of the Clonesequences for a project) is stored in Eval_sequenceset. The use of this underlying database is advantageous for several reasons. First, the results from running several gene–finding programs on tens to hundreds of thousands of ESTs become much too cumbersome to handle manually using a flat text files. Second, since the programs in our system are pipelined, it is easier to manage those results inside of the DBMS. Third, when the data is stored in a DBMS, it is possible to perform complex analysis more easily using SQL, rather than writing special–purpose programs for each different evaluation. Lastly, by using the generic data model shown in Figure 2, we can accommodate results from many different gene–finding programs, and use them all in combination during evaluation and integration.
PMEvalID PMNumSeq_Total
Exon_result ExonResultID StrandFlag FrameNumber
Clone_sequence
ExonStart
CloneName
ExonEnd
CloneSeqLength
ExonExtendedStart
CloneDirectory
ExonExtendedEnd ExonResultScore ExonResultRating
ProtMeth_evaluation PMNumSeq_Pos PMNumSeq_subTP
ComputedORF_algorithm
PMNumSeq_subPos AlgID
Prate
AlgDescription
subTPrate TPI
ComputedORF_result StrandFlag FrameNumber ExonExtendedStart ExonExtendedEnd ExonScore
Genefinding_protocol be Protocol_or_method
be ProtocolDescription
UniversalID ProtocolMethodType BestCombMeth_member
Figure 2. Data model of the system database
2.2. The Sequence Processor In the Sequence Processor, each EST in a large batch (typically 500 or so) are automatically sent to the gene–finding service (the remote GRAIL server at Oak Ridge National Laboratory (ORNL), for example) and the results are automatically placed in the database. This program contains several modules, each of which accesses an independent gene–finding service. Each module is client software that independently contacts gene–finding service providers, which are located on our local intranet or across the Internet, to obtain the necessary service. This component was designed to be extensible. Therefore, submitting sequences to other gene-finding programs only requires rewrite on the import/export interface.
2.3. The Planner Figure 1 shows intermediate gene–finding results that come from heterogeneous gene–finding protocols feed into the Planner, which, for a selected subset of the ESTs, determines the accuracy of each protocol and combinations of them. The Planner consists of two flexible sub-components: a protocol evaluator and a best combination chooser, each of which can be changed to suit a project’s needs.
2.3.1. Protocol evaluator The protocol evaluator requires no intervention by the analyst. It generates an indicator of how accurate each protocol was on the subset of ESTs, and stores it in the database. The flexible part of this program is the choice of indicator. A Performance Indicator. To determine what single protocol will work best for a given set of ESTs, users can choose an indicator of any kind to measure how well each protocol performs. Here, we only provide a reference design and implementation of an indicator, which optimizes sensitivity (high number of true positive predictions for actual coding regions). Our system allows the user to develop other indicators if needed (for example, high specificity indicators). However, the discussion of general performance indicators is beyond the scope of this paper. In our design of the performance indicator, we prefer high prediction rate (high number of predicted EST frames) and high sensitivity (high number of ESTs with predicted frames among all positive ESTs). Therefore, we choose a measure that combines prediction rate with estimated sensitivity, the True Positive Index (TPI). Figure 3 depicts how we compute this value. Starting at the top of Figure 3, we use a representative set from our larger group of ESTs from an organism (typically 1000 – 2000 ESTs). The number of these ESTs is designated N. We run a protocol on N ESTs, and the percentage of those P ESTs that have predictions is P% = P/N. We select those P´ in P that also had BLASTX reading frame predictions. TP´ are those in P´ whose gene–finding program reading frame prediction matched the BLASTX reading frame. The percentage of those P´ ESTs whose two predicted reading frames match is then TP% = TP´/P´. This number gives us a rough estimate of the sensitivity of this protocol. P% is the prediction rate of this protocol. We compute the True Positive Index as TPI = TP% * P%. This number is designated TPIp for single protocols. The TPI provides an indication of whether we will be likely to get a higher prediction rate in combination with its sensitivity. We use TPI as an estimator because when a protocol has relatively high sensitivity in relation to other protocols and has a higher prediction rate, we will obtain more overall true positive predictions. The Planner is general, though, and other performance indicators could be added to achieve different results. We would expect the overall performance of the protocols to vary because of the difference between the organism of the input set and the organism of the sequences for which a protocol was trained. Thus, the ordering of TPIp values will vary depending on the organism of the input ESTs.
data analysts use the best combination chooser to examine the information about the likely accuracy of the combinations of protocols to guide their choice of a combination to use on all of the ESTs in a project. Then this combination is used in the Result Integrator program. The flexible portion of the best combination chooser is the choice of a method for indicating which combination of protocols is most consistent. An accurate method of using information from more than one protocol would be to assign a stronger likelihood consistently to a frame if it is found by more than one protocol, and to use the combination of protocols that most often has the strongest likelihood of predicting the correct frame. We call this the best combination method (BCM). We have developed two best combination methods, which differ in how they determine the combination of protocols that will most often predict the correct frame.
representative set of sequences (N) 100%
Run gene-finding protocols
P%
predictions (P)
no predictions
Choose a subset of predictions that also have a BLASTX result each
selected
predictions with a BLASTX frame (P')
predictions without a BLASTX frame
Compare gene-finding results with BLASTX results
TP%
2.3.2. Best Combination Chooser Since we execute several protocols on each EST, we can also estimate how these protocols might perform in combination. The
predictions with a matching BLASTX frame (TP')
predictions without any matching BLASTX frame
Figure 3. How TPIp is computed.
Optimal BCM In this method, we simply defined a varied version of True Positive Index indicator, namely TPIc, by expanding the definition of TP´ in TP% (Figure 3) to include all ESTs in TP´ if any of the predicted reading frames from the set of protocols matches the BLASTX reading frame. The optimal BCM method generates TPIc values for a complete set of protocol combinations derived by choosing K independent gene–finding protocols from N available protocols. The best combination for K (BCM–K) is chosen as that with the highest TPIc. This method is computationally expensive when K is large, because the analysts will have to execute all K protocols (using the Sequence Processor) and derive TPIc values for combination picking K of N values. Greedy BCM In this method, we derive combination procotols by choosing the top K protocols that have the highest individual TPIp values. This eliminates the need for computing all of the K choose N combinations in the Optimal BCM. In practice, we have found that the greedy BCM performs nearly as well as the optimal BCM when K is small (data not shown).
2.4. The Result Integrator Once the best combination of protocols is chosen using a test subset of the ESTs, we need to use that combination to determine a reading frame for all of a project’s ESTs, each with K results. The method for determining the reading frame from BCM–K (whether optimal or greedy) on all ESTs is built into the Result Integrator. The method predicts the reading frame as follows: Allocate a matrix, FSCORE(sequenceid, x), where x is 1, 2, 3, 4, 5, and 6, which correspond to forward reading frames 1, 2, and 3 and reverse reading frames 1, 2, and 3, respectively. Initialize each cell in FSCORE to 0. For each EST s, set MAXSCORE(s) to 0 For each protocol pi from the set of K protocols, if pi had predicted reading frames for s, then for each frame f that pi predicted, do set FSCORE(s, f) to FSCORE(s,f) + w*TPIpi ß if FSCORE(s,f) > MAXSCORE(s), then set RF(s) to f set CRLENGTH(s) to the length of predicted coding region set MAXSCORE(s) to FSCORE(s,f) else if (FSCORE(s,f) equals MAXSCORE(s)) and (length of predicted coding region of f of s > CRLENGTH(s)) then set RF(s) to f set CRLENGTH(s) to the length of the coding region of f of s end if end for each frame end if end for each protocol if MAXSCORE(s) equals 0, then set RF(s) to “Not Found” end for each EST
Note the use of a weighting factor, w, when computing the FSCORE value for a frame (designated by arrow above). This factor is converted from the score or ranking categories returned from a gene-finding program execution.
3. Implementation The AutoRF programs are written in perl and PL/SQL. For the Sequence Processor, we have implemented modules for GRAIL 1,
1a, and 2. These modules allow analysts to specify more than ten different protocols, because of various combinations of GRAIL program and trained organism available. We chose GRAIL to implement in the modules first because of its widespread use and the availability. We plan to implement a module for GeneMark (Reese et al, 2000), which will be accessed locally on our intranet, in the near future.
4. Verification of The System We have verified the accuracy of the AutoRF system and tested it on several sets of ESTs from different plant species. When determining which combination of protocols works best for each organism, data analysts on our team used results from 5 different available GRAIL protocols: • GRAIL 1 trained on E. coli (designated 1E), • GRAIL 1 trained on human (1H), • GRAIL 1a trained on human (1aH), • GRAIL 2 trained on Arabidopsis (2A) • GRAIL 2 trained on human (2H) We did not use other gene-finding programs for our evaluation, because we found different versions of GRAIL were sufficiently different to provide evaluation evidence. (version 1 trained on ESTs vs. version 2 trained on genomic sequences) We used a test data set of 1271 simulated ESTs derived from original full–length Arabidopsis thaliana cDNAs. This set of simulated ESTs was obtained by trimming the original full–length cDNAs from the 3' end down to 500 bp long, then by randomly introducing on average of 4% of Ns for the each sequence. To assess how the protocols will perform on ESTs, we checked the TPIp values for the simulated Arabidopsis ESTs and deduded that the best possibility for obtaining sensitive results is GRAIL 2 trained on Arabidopsis (indicated by 2A), followed closely by GRAIL 1 trained on E. coli (indicated by 1E). (data not shown here) We want to use these simulated ESTs to check how consistent the system will be when using combinations of these protocols. We adopted a measuring method similar to the one outlined for genomic DNA gene prediction method (Burset and Guigo, 1996). This method measures accuracy by observing the predicted coding value and the true coding value for each nucleotide along a sequence. In our case of evaluating EST reading frame resolutions, we use the “reality” of a coding region as the reading frame for the cDNA from which a simulated EST was derived. We can distinguish coding in a frame from no coding in the “reality” case because those cDNAs with long untranslated regions (UTRs) do not have coding regions within their first 500 bp. Any method of predicting a frame should not predict one for those simulated ESTs that were derived from cDNAs with long UTRs. Conversely, any method of predicting a frame should predict a coding region for those simulated ESTs that were derived from cDNAs with short UTRs. Therefore, our true positives (TP) are when the prediction guesses coding and the reality is coding, false positives (FP) are when the prediction guesses coding and the reality is no coding, false negatives (FN) are when the prediction guesses no coding and the reality is coding, and true negatives (TN) are when the prediction guesses no coding and there is no coding. We use sensitivity (Sn) and specificity (Sp) as computed from the following formulas:
Sn = TP/(TP+FN)
Sp = TP/(TP+FP)
In Table 1a and 1b, we have shown the type of results that can be expected for EST projects by using the system to compute TPIp and TPIc for two representative EST datasets: 1) 1842 ESTs from Arabidopsis thaliana, 2) 1159 ESTs from Pinus taeda (loblolly pine). TP' 203 301 262 341 406
P% 0.51 0.65 0.87 0.77 0.82
TP% 0.66 0.77 0.59 0.77 0.88
TPIp 0.34 0.50 0.51 0.59 0.72
0.88 0.86 0.84
Table 1a: TPIp values for 1159 Loblolly Pine ESTs. Protocol 2H 1aH 1H 2A 1E
TPIp 0.29 0.45 0.49 0.62 0.67
0.82
specificity
0.80
sensitivity
0.78 0.76 0.74 BC M -3
Table 1b: TPIp values for 1842 Arabidopsis ESTs. Table 1a shows P´, TP´, P%, TP%, and the computed TPIp values for each protocol for the loblolly pine ESTs, ordered by increasing TPIp. Note that of the number of pine ESTs that had both a GRAIL result and a strong BLASTX result (P´) varies by protocol and is always over one–third of the 1159 pine ESTs on which we executed GRAIL and BLASTX. The order of the protocols for this pine EST data set shows that we should not always choose a protocol based on what it has been trained on— we might have chosen 2A because Arabidopsis is the closest organism to pine in the taxonomy tree, but the TPIp for 1E was higher. Table 1b shows the ordered TPIp values for the actual Arabidopsis ESTs. Note that the order of the protocols is different between pine and Arabidopsis. This shows that the ordering of the protocols will vary depending on the input EST sequences.
Figure 5. Specificity and sensitivity difference with weighting. In Figure 5, we show the tradeoff in sensitivity and specificity for the weighted version of BCM–3, BCM–4, and BCM–5 verses the non–weighted version. This shows the expected result that for each weighted version, the specificity increased, but the sensitivity decreased from the non–weighted version. This data for the simulated ESTs suggests that we will be able to improve sensitivity and prediction rate by using a BCM–4 combination on our EST projects. Combination (algorithm name) BCM–1 BCM–2 BCM–3 BCM–4 BCM–5
0.88 0.86 0.84 0.82
BC BC M -5 M3( we igh ted BC ) M -4 (w eig hte BC d) M -5 (w eig hte d)
P' 307 392 446 444 460
BC M -4
Protocol 2H 1aH 2A 1H 1E
of BCM–2, BCM–3, BCM–4, and BCM–5. This figure suggests that the single protocols have lower sensitivity and prediction rate than combination protocols. It is interesting to observe that the increase in sensitivity and prediction rate starts to level off once reaching 4 combined protocols (therefore we should use BCM-4 instead of the more computationally intensive BCM-5).
Component Protocols
P%
TP%
TPIc
1E 1E/1H 1E/1H/2A 1E/1H/2A/1aH 1E/1H/2A/1aH/2H
0.82 0.93 0.95 0.95 0.95
0.88 0.90 0.92 0.93 0.93
0.72 0.84 0.87 0.88 0.88
Table 2. TPIc results for the pine ESTs Sn P%
0.80 0.78 0.76 0.74 0.72 1e
BCM-1 (2a)
BCM-2
BCM-3
BCM-4
BCM-5
Figure 4. Sensitivity and prediction rate of various combinations of protocols for the simulated Arabidopsis ESTs. In Figure 4, we show the comparison between sensitivity and prediction rate for the two best single protocols and optimal variant
In Table 2, we show the component protocols for combinations of 1, 2, 3, 4, and 5 protocols chosen by the optimal BCM method for the pine ESTs, and the resulting TPIc values. BCM–K, where K is greater than 1, has consistently higher TPI values; therefore we expect that the BCM method will more consistently predict EST reading frames. Note that a lower number of protocols in a combination may be just as consistent— using BCM–3 or BCM–4 may be as good as using BCM–5, based on their nearly identical TPIc values. Both the actual Arabidopsis ESTs and the simulated ESTs also had TPI values that were very close to those in Table 2.
5. Conclusion
References
We determined reading frames for ESTs with relatively high error rates, and from organisms for which each gene–finding program has not always been trained. Although these two factors detrimentally affect the ability of the programs to predict reading frames, we showed that we achieved more consistent reading frame predictions by using a combination of results from several executions of various combinations of gene–finding programs and training sets. Our system is designed to enable analysts to quickly determine the performance of individual protocols of their choice, and to automatically derive the optimal combination of multiple protocols that will increase the both the accuracy and consistency of reading frame predictions, and use this combination to generate reading frames for all sequences. Our use of the database also makes managing data, analysis methods, and analysis results easily scalable—organizing and analyzing large numbers of sequence data is as easy as handling a few sequences.
Adams,M., Dubnick,M Kerlavage,A., Moreno,R., Kelly,J., Utterback,T., Nagle,J., Fields,C., and Venter,J. C.. (1992) Sequence Identification of 2375 Human Brain Genes. Nature 355, 632–634 Allona,I., Quinn,M., Shoop,E., Swope,K., St. Cyr,S., Carlis,J., Riedl,J., Retzel,E., Sederoff,R., and Whetten,R. (1998) Comparative analysis of cDNA sequences from differentiating pine xylem. Proc. Natl. Acad. Sci., (95, 9693-9698), August, 1998. Altschul,S., Gish,W., Miller,W., Myers ,E. and Lipman,D. (1990) Basic Local Alignment Search Tool. J. Mol. Biol, 215, 403–410.
Data processing personnel on our team have used the AutoRF system to compute reading frame predictions for over 50,000 ESTs from 4 different plant organisms that we are using in comparative genomics studies. Our system has thus enabled us to automate a task in high-throughput and yield consistently satisfactory results. The AutoRF system is extensible. It can be used to accommodate virtually all EST sequencing projects, as long as the EST data are cleaned either by a pre-processing filtering step or a consensus sequence generation to enhance data quality. The system also provides a framework for further investigation: we can add other gene–finding methods, use genomic sequence as input, test hypotheses about how to determine combinations, and determine how much additional value each new gene–finding method adds. We believe it can also be extended for genomic sequences. The AutoRF system met these goals: 1) to automate reading frame determination; and 2) to increase the sensitivity and the consistency of EST reading frame predictions by using a combination of multiple gene–finding results. There are still unsolved problems related to inconsistent EST data. One problem is the insertion of deletion sequencing errors that are often found in EST data. In this study, we assumed the upstream EST preprocessing either 1) used frame shift detection programs to correct such errors, or 2) used consensus sequence of all similarly aligned ESTs. If these errors were not corrected during the preprocessing and if they take place in the far 5’-end of the EST, many underlying gene-finding protocols will give erroneous results. The other problem is the erroneously assembled EST (“chimeric EST”) problem. This EST sequence assembly error cannot be easily detected using EST pre-processing and can confuse many underlying gene-finding protocols. We think that our experience with inconsistent biological data and analysis results apply to other scientific endeavors. Advances in data capture technology change the ways scientists work. This has happened for example, with MRI, PET, and confocal imaging. Ways of work that were fine with a trickle of data become a heavy burden with a flood. To avoid drowning, manual, judgmental tasks must be automated, but doing so challenges the technical and communication skills of an interdisciplinary team..
Altschul,S., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller.W., and Lipman,D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17), 3389–3402. Borodovsky,M. and McIninch,J.D. (1993) GENEMARK: Parallel gene recognition for both DNA strands. Computers & Chemistry 17(2), 123–133 Burset,M. and Guigo,R. (1996) Evaluation of gene structure prediction programs. Genomics, 34(3), 353–367 Carlis,J and Maquire,J (2001) Mastering Data Modeling: A User Driven Approach. Published by Addison Wesley, 2001. Dong,S. and Searls,D.B.(1994) Gene Structure Prediction by Linguistic Methods. Genomics 23, 540–551. Fickett,J. (1982) Recongnition of protein coding regions in DNA sequences. Nucleic Acids Res. 10, 5303–5318 Fickett,J. and Tung ,C–S.(1992) Assessment of protein coding measures. Nucleic Acids Res. 20(24), 6441–6450 Fickett,J. (1996) Finding genes by computer: the state of the art. Trends in Genetics 12(8), 316–320 Gaasterland,T. and Sensen ,C.W. (1996) MAGPIE: automated genome interpretation. Trends in Genetics 12(2), 76–78 Gish,W. and States ,D. (1993) Identification of protein coding regions by database similarity search. Nature Genetics 3, 266–272. Guigo,R., Knudsen,S., Drake,N., Smith,T. (1992) Prediction of Gene Structure. J. Mol. Biol. 226, 141–157. Hofte,H., et al. (1993) An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNAs from Arabidopsis thaliana. The Plant Journal. 4, 1051–1061. Shoop,E. Chi,E., Carlis,J., Bieganski,P., Riedl,J., Dalton,N., Newman ,T. and Retzel,E. (1995) Implementation and Testing of an Automated EST Processing and Analysis System. In Hunter ,L. and Shriver,B. (eds.), Proceedings of the 28th Annual Hawaii International Conference on System Sciences, vol. 5, pp.52–61. Reese,MG, Kulp,D, Tammana,H, Haussler,D (2000) Genie--Gene Finding in Drosophila Melanog in Genome Research 10(4) 529-538 Schultz, J, Doerks T, Ponting CP, Copley RR, Bork P (2000) More than 1,000 Putative New Human Signaling Proteins Revealed by EST Data Mining, Nature Genetics 25(2), 201-204 Shoop,E. (1996) The Design and Implementation of an Extended Database System to Support Biological Sequence Similarity Analysis, Ph.D. Thesis, University of Minnesota. Snyder,E.E., Stormo,G.D. (1995) Identification of Coding Regions in Genomic DNA. J. Mol. Biol. 248, 1–18. Snyder,E.E., Stormo,G.D. (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 21(3), 607–613. Staden,R. (1994) Methods Mol Biol. 22, 9–170. Uberbacher,E.C. and Mural,R.J. (1991) Locating protein–coding regions in human DNA sequences by a multiple sensor–neural network approach. Proc. Natl. Acad. Sci., USA 88, 11261–11265 Xu,Y., Mural, R.J. and Uberbacher,E.C. (1994) Constructing gene models from accurately predicted exons: An application of dynamic programming. Comput. Applic. Biosci. 11, 117–124