SIRW: a web server for the Simple Indexing and ... - BioMedSearch

8 downloads 0 Views 513KB Size Report
Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement. TrEMBL in 2003.
#

2003 Oxford University Press

Nucleic Acids Research, 2003, Vol. 31, No. 13 3771–3774 DOI: 10.1093/nar/gkg546

SIRW: a web server for the Simple Indexing and Retrieval System that combines sequence motif searches with keyword searches Chenna Ramu* European Molecular Biology Laboratory, Postfach 10.2209, 69012 Heidelberg, Germany Received February 14, 2003; Revised and Accepted March 26, 2003

ABSTRACT SIRW (http://sirw.embl.de/) is a World Wide Web interface to the Simple Indexing and Retrieval System (SIR) that is capable of parsing and indexing various flat file databases. In addition it provides a framework for doing sequence analysis (e.g. motif pattern searches) for selected biological sequences through keyword search. SIRW is an ideal tool for the bioinformatics community for searching as well as analyzing biological sequences of interest.

INTRODUCTION More and more specialized biological databases are being created newly or secondarily derived from existing large databases with additional information added manually or automatically. This is partly because of the hundreds of databases containing thousands of entries serving only a general need. Although most of the sequence databases are widely available, the usage is often general and limited to keyword searching and entry retrieval, through systems like Sequence Retrieval System (1) and Entrez (2). Studying the subset of the database that represents a particular taxonomy, organs or tissue is time consuming and involves creation of such datasets before applying analysis tools on the dataset (e.g. constructing evolutionary history, inferring protein function). Most often sequence retrieval tools and sequence analysis tools are separated. Although sequence analysis systems exist such as GCG (3) that provide integration of sequence analysis with keyword search, in practice the keyword search is often restricted to one or few data fields. Since databases such as SWISS-PROT (4) are well annotated, retrieving entries with specific keywords through various fields is enough to pull out all the interested protein entries: for example ‘zinc finger’ containing proteins for a specific species or taxonomic range. In addition it would be interesting to analyze for additional sequence signals for those selected sets of protein sequences. SIR (5) is now capable of indexing any of the database fields of interest, including the biological sequence itself, a feature lacking in most retrieval tools, indexing the biological

sequence such as protein sequences found to be very useful. For example, searching for N- or C-terminal sequence becomes trivial. In addition more complex regular expression queries in search of protein motifs become possible. Apart from providing such flexible keyword retrieval from various database entry fields, SIRW extends its capability by a flexible framework to analyze them. We show this here with GCGs (3) FINDPATTERN equivalent pattern matching module. THE SIR SYSTEM SIR is entirely written in Python (http://www.python.org/), the object-oriented scripting language. The SIR system consists of a database definition file where you define the various files that make up the database, the various fields of the database entry, the field parsing instruction and the token parsing instruction for each field of interest for indexing. The index tokens can be of string type or number type. The number type is useful for searching for the range, for example retrieving proteins of certain residue length or molecular weight. Database entry and fields Flat file databases are provided with different structures. A flexible parsing system for flat files has been developed (6) that can deal with different formats along with the modules to manipulate sequence objects. The database and entry iterator The generic database module can be used to sequentially iterate over the database and database entries, providing entries for the field and token parsers to build indices, as well as for indexed retrieval of one or more entries. The design of the dual-purpose database module reduces the complexity of handling the database during the indexing phase and the retrieval phase by abstracting the parsing and indexing instructions. We also found sequential iteration is useful in building custom flat file databases for further use, for example creating a local database to use with BLAST (7) with desired field tokens such as species, organism or description.

*Tel: þ49 6221387530; Fax: þ49 6221387517; Email: [email protected]

Nucleic Acids Research, Vol. 31, No. 13 # Oxford University Press 2003; all rights reserved

3772

Nucleic Acids Research, 2003, Vol. 31, No. 13

Entry, field and token parsing Entry delimiters are used for the entry parsing, for example SWISS-PROT (4) and EMBL (8) entries are always delimited with ‘//’. An entry consists of different data fields (e.g. ‘Definition’, ‘Keywords’, ‘Organism’). The field parsing is done with a regular expression that describes a particular field. For example the definition field of a SWISS-PROT entry is ‘des’: ðr0 ðð^ DE ½^ /n þ /nÞþÞ0 ; r0 ð?Phskiptokeni^ DE Þjð?Phno printi½^ :ÞðgfþÞ0 Þ: The parsing instruction consists of two regular expressions enclosed in a tuple, the first instruction extracts the description field, whereas the second instruction is used to extract the tokens from the description field. Commands can be used within the regular expression description for the tokens that should be skipped (command skiptoken) and the tokens to be printed while parsing (command print), which is useful for debugging, for example, while designing a token parser. The parsing instructions are handled by a generic scanner module which takes ‘action’ based on the commands. Indexing the database fields The purpose of indexing is to make the retrieval faster. The indexing system we use here makes use of the bplustree module (http://starship.python.net/crew/watters/bplustree). Indexing the primary key is already explained in SIR (5). Indexing various other fields of interest is now possible with SIRW. The parsing instructions are taken from the database definition file (Fig. 1). The primary index key should be unique. For some databases this is different from the identifier name, for example for the RefSeq (9) database, the accession number is unique but the LOCUS is not. Within the token parser the command ‘ikey’ is used by the indexing system to identify the primary token. Separate index files are created for each data field. Query and retrieval

Figure 1. Shows the parsing instructions and database definition for PDBFINDER database. For a different database of similar structure, this module can be imported instead of rewriting a new one.

with other information (Fig. 2). The starting page shows all the databases that SIRW supports with checkboxes to be selected for querying. All the pages are dynamically generated. The HTML text for each page is kept as first class objects. This helps to separate html texts from the source code. SIRW is currently serving the SWISS-PROT, SWISSNEW, SPTrEMBL, PROSITE, ENZYME, PDBFINDER and RefSeq databases.

The syntax of a typical query would be findð‘swissprot’; ‘tax’; ‘eukaryota’Þ &

Query page

findð‘swissprot’; ‘ft key’; ‘zn fing’Þ

The query page displays the database fields defined for the selected database, along with the input box for filling keywords to search, and a checkbox for selecting the fields of interest to display with the query result. The user can query any combination of fields at one time. The query system refines the query based upon the different fields filled in combined with the selected Boolean operator. The fields that are not indexed are shown without an input box but with a checkbox to be checked to display the field along with the query result. For example the sequence field need not be indexed, yet still displayed for the retrieval. The sequence can be displayed with different formats [GCG (3), SWISS-PROT (4), EMBL (8), GenBank (11), PIR (12), FASTA (13)]. The database fields are highlighted with hypertext links. Clicking the fields takes the user to the index browsing page. The query page is built dynamically. Some of the database fields, like FT field, have a limited number of unique keywords. Whenever the numbers of keywords are limited to a certain threshold then the SIRW

Each query results in a set of entries. A set of entries can be further refined through the Boolean operators ‘and’, ‘not’, ‘or’ (&, ^, j). Here there are two queries, one searches the SWISSPROT database taxonomic field for eukaryota and the second query searches for zinc finger in the feature table entry and the result is combined with & operator that filters the result. THE WEB INTERFACE SIRW The web interface is designed with flexibility and simplicity for both the user and the administrator. We used cgimodel (10), a Common Gateway Interface (CGI) framework, for designing web interfaces. The top page lists all the available databases under SIRW with hypertext link to each database. The hypertext links will take the user to the database information page where a table is generated with all the fields listed along

Nucleic Acids Research, 2003, Vol. 31, No. 13

Figure 2. Example of SWISS-PROT database information page. This page shows the various fields, number of unique keywords for the indexed fields and other information.

system can be told to build a selection box or checkbox for that particular field, thus increasing the chances of finding a keyword in the subsequent search. The index browsing will be useful to decide which fields should be filled for the selection.

3773

Figure 3. Example of query result. The keyword interleukin is searched in the description field and the result set is searched for the occurrence of the motif WSxWS that is believed to contribute to ligand-recognition and to protein architecture of the G-CF receptor.

pattern in the selected entries. We demonstrate here searching for motifs in protein sequences that is done after querying the database of interest.

Query result page

Searching for motifs

The query result page shows the number of hits found in the header and the query syntax that was constructed based on the keyword selection. Subsequently the primary identifiers are shown with hypertext links. Clicking the hypertext links results in retrieving the full entry. Apart from the primary identifiers, the fields that are selected in the query page are also shown. If the queried field is also selected for display then the keywords are highlighted wherever they occur in that field (Fig. 3).

Conserved motifs in amino acid sequences may reveal regions that play important roles in the function of a protein. There are specialized databases such as PROSITE (14) that record biologically significant sites, patterns and profiles, and our recently developed ELM database (15) which specializes in short functional motifs in eukaryotic proteins. SWISS-PROT (4) entries have extensive feature table annotation such as zinc finger, DNA binding, calcium binding and signal peptide. Typically one may wish to find out additional signals for, say in DNA binding proteins in different organisms. An interesting example is the sequence ‘AHVDCPGHA’ in Haemophilus influenzae, which is part of the elongation factor family of proteins found to occur upstream of the casein kinase II phosphorylation signature motif (16). The pattern-matching module is plugged into the SIRW system and in the database definition file the method is specified. This specification enables the SIRW system to create an additional field called ‘pattern’ in the query page. Thus a pattern field becomes a virtual field, a data-field that is not there with the original database entry fields. A subsequence/motif can be entered along with other keywords. The system retrieves the entries and their protein sequence and a pattern search is performed on the selected protein sequences. The result is shown in the query result page. The matching pattern is shown with defined

Index browsing Index browsing is an important feature of SIRW. Very often the user wants to check the number of proteins that have a particular keyword (e.g. transmembrane). When the sequence is also indexed, it becomes an ideal tool for examining signature peptides or to calculate the peptide probabilities by taking the number of peptide occurrences in the whole database as expected frequency. SEQUENCE ANALYSIS WITH SIRW Though data collection is made easy through systems such as SRS (1) and Entrez (http://www.ncbi.nlm.nih.gov/Entrez/), there is a need for further analysis such as searching for a motif

3774

Nucleic Acids Research, 2003, Vol. 31, No. 13

flanking region, and with position numbers where they start and end. A compact view of the patterns can be used to align those regions to create consensus patterns. This would allow creation of an alignment that can be used to create a consensus pattern. The pattern matching is an example, we could potentially plug-in a number of other modules like di-peptide calculation or protein property search like transmembrane segments, antigenic regions. The motif patterns are expressed as POSIX style regular expressions.

LIMITATIONS OF SIRW Since the SIRW package is written in a scripting language, it does not scale up very well for very large databases such as EMBL and GenBank, though the subsections of them (human, vertebrate) can be used with SIRW.

CONCLUSION SIRW addresses and fulfils a need in the bioinformatics community namely doing analysis on the selected set of entries based on keywords. As far as we know there is no other system that exists with this capability. SIRW is very useful for motif hunters and database curators. SIRW is ideal for the databases as big as SWISS-PROT (4) or TrEMBL (17). The indexing of protein sequences with fixed, overlapping subsequence is possible and it is currently done for the RefSeq (9) protein database. This would enable the study and analysis of proteome databases for the distribution of signature peptides. For this case index browsing would reveal the abundance of a particular peptide.

ACKNOWLEDGEMENTS I would like to thank Toby Gibson for his support and Francesca Diella for her comments and careful reading of the manuscript. I also thank the unknown referees for their critical comments.

REFERENCES 1. Etzold,T., Ulyanov,A. and Argos,P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol., 266, 114–128. 2. Wheeler,D.L., Church,D.M., Federhen,S., Lash,A.E., Madden,T.L., Pontius,J.U. and Schuler,G.D. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 28–33. 3. Womble,D.D. (2000) GCG: The Wisconsin Package of sequence analysis programs. Methods Mol. Biol., 132, 3–22. 4. Bairoch,A. and Apweiler,R. (1997). The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J. Mol. Med., 75, 312–316. 5. Ramu,C. (2001) SIR: a simple indexing and retrieval system for biological flat file databases. Bioinformatics, 17, 756–758. 6. Ramu,C., Gemuend,C. and Gibson,T.J. (2001) Object-oriented parsing of biological databases with Python. Bioinformatics, 16, 628–638. 7. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. 8. Stoesser,G., Baker,W., Van Den Broek,A., Garcia-Pastor,M., Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V., Lopez,R. et al. (2003) The EMBL Nucleotide Sequence Database: major new developments. Nucleic Acids Res., 31, 17–22. 9. Pruitt,K.D., Maglott,D.R. (2001) RefSeq and LocusLink: NCBI genecentered resources. Nucleic Acids Res., 29, 137–140. 10. Ramu,C. and Gemuend,C. (2000) cgimodel: CGI programming made easy with python. Linux J., 142–149. 11. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and Wheeler,D.L. (2003) GenBank. Nucleic Acids Res., 31, 23–27. 12. Wu,C.H., Huang,H., Arminski,L., Castro-Alvear,J., Chen,Y., Hu,Z.Z., Ledley,R.S., Lewis,K.C., Mewes,H.W., Orcutt,B.C. et al. (2002) The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res., 30, 35–37. 13. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA Methods Enzymol., 183, 63–98. 14. Falquet,L., Pagni,M., Bucher,P., Hulo,N., Sigrist,C.J., Hofmann,K. and Bairoch,A. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res., 30, 235–238. 15. Puntervoll,P., Linding,R., Gemu¨ nd,C., Chabanis-Davidson,S., Mattingsdal,M., Cameron,S., Martin,D.M.A., Ausiello, G., Brannetti,B., Costantini,A. et al. (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acid Res., 31, 3625–3630. 16. Karaoglu,H. and Humphery-Smith,I. (2000) Signature peptides. From analytical chemistry to functional genomics. Methods Mol. Biol., 146, 63–94. 17. Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 1, 365–370.