BIOINFORMATICS APPLICATIONS NOTE
Vol. 19 no. 5 2003, pages 675–676 DOI: 10.1093/bioinformatics/btg056
ASAP: automated sequence annotation pipeline for web-based updating of sequence information with a local dynamic database Andrew Kossenkov 1, 2, Frank J. Manion 2, Eugene Korotkov 3, Thomas D. Moloshok 2 and Michael F. Ochs 2,∗ 1 Cybernetics
Department, Moscow Engineering Physics Institute, Moscow, Russian Federation, 2 Department of Information Science and Technology, Fox Chase Cancer Center, Philadelphia, PA 19111, USA and 3 Center of Bioengineering, Russian Academy of Sciences, Moscow 117312, Russian Federation
Received on July 27, 2002; revised on October 22, 2002; accepted on November 12, 2002
Many biological studies with high throughput methods result in a list of sequences of interest. Many of these sequences have no known associated gene or function, blocking further interpretation of the results of an experiment. This happens routinely when the sequences have been identified through microarray gene expression analysis, which is rapidly becoming a ubiquitous biological tool (e.g. Clarke et al., 2001). While annotation of genomic data has been a focus of considerable progress (Makalowska et al., 2001), identification of the roles of ESTs identified in an experiment is still typically accomplished by hand, which is a daunting task when there are thousands of ESTs of interest. While such identification would ideally be handled through a unified interface system, such as the Distributed Annotation System (DAS; ∗ To whom correspondence should be addressed.
c Oxford University Press 2003; all rights reserved. Bioinformatics 19(5)
Dowell et al., 2001), most sites are not yet serving data in this way. In order to easily retrieve information from non-DAS sites, we have developed an automated, flexible system for making routine queries from web sites. The system is specifically designed to allow easy updating of information regarding query and results formats and includes a subsystem that alerts the administrator when a web site has changed its format. The system design is shown in Figure 1. The system uses a MySQL database (DB) to store specific information from each site on how to format a query and the format of the returned results. The information in the DB is used by the CONFIG generator to generate query and result workfiles. The QUERY module and RESULTS module use these workfiles to generate specific query scripts and results parsers respectively. The query script comprises a standard Perl script that hits the appropriate server. The results from the query are parsed according to the detailed parsing information contained in the workfile. Exception handling catches cases where a result cannot be parsed and notifies the MANAGER module, allowing the administrator to review the error and modify the DB if necessary (e.g. because a web site has modified their query or result formats). Data can be filtered during parsing, allowing the output of one query to be used as the input for the next query. The system is designed for easy extensibility, requiring only additions to the database to add new sites for querying. This makes the system more general and more easily extended than BioPerl (Stajich et al., 2002), which has some modules for parsing certain web sites, and also permits more information to be retrieved than can be obtained from the Unigene site alone (Schuler, 1997). In fact, the system is not limited to the retrieval of sequence information, but could be used for more general querying. The system has been used to query the TIGR mouse 675
Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011
ABSTRACT Summary: The automated sequence annotation pipeline (ASAP) is designed to ease routine investigation of new functional annotations on unknown sequences, such as expressed sequence tags (ESTs), through querying of web-accessible resources and maintenance of a local database. The system allows easy use of the output from one search as the input for a new search, as well as the filtering of results. The database is used to store formats and parameters and information for parsing data from web sites. The database permits easy updating of format information should a site modify the format of a query or of a returned web page. Availability: Source code is available under the GNU public license at http://bioinformatics.fccc.edu/ Contact: av
[email protected], fj
[email protected],
[email protected], td
[email protected], m
[email protected].
A.Kossenkov et al.
Automated Sequence Annotation Pipeline
Gene List & Interactive Choices
User WEB Interface
Plan Specifics Request Gene List & Search Plan
USER
Plan Specifics
DB Parameters for Sites
CONFIG generator
Summary
MANAGER
Parameters & Gene List
query
Site Identifier
result
Extracted Data & Summarization Parameters
finalization Data & Parameters
Data Summary
QUERY module
Site Details
MAIL
Detailed Workfile
QUERY
USEFUL WEB RESOURCES
Detailed Parsing Information
RESULT RESULTS module
FINALIZATION module
Extracted Data
gene index databases (Quackenbush et al., 2001) to retrieve Theoretical Contigs (TC) for over 1800 ESTs of interest. These TC hits were then piped to a local implementation of BeoBLAST (Grant et al., 2002) running BLASTN and BLASTX, and significant results were recorded. Since the BeoBLAST databases are routinely updated, routine runs of the ASAP in this configuration provide updated information on sequences of interest. Out of the 1800 ESTs with no annotation one year ago, ASAP identified new information for more than 48% on a recent run. In addition, eight other databases have already been scripted and used to retrieve information (see our website for details). The system is open source, allowing users to add and share entries for sites of interest. In the future, we expect to add machine intelligence methods to perform more elaborate chained searches, and we are presently beginning to cache information on recent hits in the database to avoid unnecessary repeated querying over the web.
ACKNOWLEDGEMENTS We would like to thank Dr Kenneth Zaret for providing the initial impetus for this project. We thank the National Institutes of Health, National Cancer Institute 676
(Comprehensive Cancer Center Core Grant CA06927 to R. Young) and the Pew Foundation for support.
REFERENCES Clarke,P.A., te Poele,R., Wooster,R. and Workman,P. (2001) Gene expression microarray analysis in cancer biology, pharmacology, and drug development: progress and potential. Biochem. Pharmacol., 62, 1311–1336. Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. (2001) The distributed annotation system. BMC Bioinformatics, 2, 7. Grant,J.D., Dunbrack,R.L., Manion,F.J. and Ochs,M.F. (2002) BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster. Bioinformatics, 18, 765–766. Makalowska,I., Ryan,J.F. and Baxevanis,A.D. (2001) GeneMachine: gene prediction and sequence annotation. Bioinformatics, 17, 843–844. Quackenbush,J., Cho,J., Lee,D., Liang,F., Holt,I., Karamycheva,S., Parvizi,B., Pertea,G., Sultana,R. and White,J. (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res., 29, 159–164. Schuler,G.D. (1997) Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med., 75, 694–698. Stajich,J.E., Block,D., Boulez,K., Brenner,S.E., Chervitz,S.A., Dagdigian,C., Fuellen,G., Gilbert,J.G., Korf,I., Lapp,H. et al. (2002) The bioperl toolkit: Perl modules for the life sciences. Genome Res., 12, 1611–1618.
Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011
Fig. 1. The automated sequence annotation pipeline. The system comprises a set of modules that handle the generation of queries, parsing of results, formatting of final reports, and managerial functions. The user interacts with the system either through a web interface or through a configuration file. The inputs include a gene list (typically accession numbers) and a reference to a predesigned plan for visiting multiple web sites. Plans can be easily created by an administrator as needed. The MANAGER uses the DB to generate the plan and then passes information to the CONFIG generator. This queries the DB to identify the necessary information to generate a query and passes this information to the QUERY module. The QUERY module generates a Perl script and queries the external web resource. The MANAGER also passes the site information to the CONFIG generator that queries the DB to get information on the form of the results, and details for parsing the pages are sent to the RESULTS module. The RESULTS module uses this information to parse the returned html page and sends the extracted data to the MANAGER. The MANAGER can use this information to generate a new query if specified in the plan, or can pass the extracted data and information on the form of the summary to the FINALIZATION module. This module collates and filters information, and returns a summary to the MANAGER, which emails this to the user.