BIOINFORMATICS
Vol. 16 no. 7 2000 Pages 628–638
Object-oriented parsing of biological databases with Python Chenna Ramu ∗, Christine Gemund ¨ and Toby J. Gibson European Molecular Biological Laboratory, Meyerhofstrasse 1, Postfach 10.2209, Heidelberg, Germany Received on December 21, 1999; revised on February 23, 2000; accepted on March 8, 2000
Abstract Motivation: While database activities in the biological area are increasing rapidly, rather little is done in the area of parsing them in a simple and object-oriented way. Results: We present here an elegant, simple yet powerful way of parsing biological flat-file databases. We have taken EMBL, SWISSPROT and GENBANK as examples. EMBL and SWISS-PROT do not differ much in the format structure. GENBANK has a very different format structure than EMBL and SWISS-PROT. Extracting the desired fields in an entry (for example a sub-sequence with an associated feature) for later analysis is a constant need in the biological sequence-analysis community: this is illustrated with tools to make new splice-site databases. The interface to the parser is abstract in the sense that the access to all the databases is independent from their different formats, since parsing instructions are hidden. Availability: The modules are available at http:// shag. embl-heidelberg.de:8000/ Biopy/ Contact:
[email protected] Supplementary information: http:// shag.embl-heidelberg. de:8000/ Biopy/
Introduction The flow of knowledge and information in biology has become enormous. This forces us to create databases to keep a record of biological data for later reference or analysis. For example release 60 of the EMBL nucleotide sequence databank stores 3 543 553 093 bases within 4 719 266 sequence entries (Stoesser et al., 1999). Since the information can be very extensive even for relatively specialized topics in biology, the need to create more specialized databanks becomes important and is already a major activity today (Burks, 1999). Commercial relational databases are useful for storing, manipulating, querying and retrieving data. Databanks as flat file are, however, more popular for everyday use. Figure 1 shows a sample EMBL flat-file database entry structure and its different fields. ∗ To whom correspondence should be addressed.
628
Flat files are easy to distribute and access and cheap to maintain. Most of the analysis software can directly access them, for example the BLAST (Altschul et al., 1997) and FASTA (Pearson, 1990) homology search tools or the GCG sequence analysis package (Devereux et al., 1984). Since there was no standard format specification available earlier, databank creators have created their own formats. It would have been better to have one or a set of standard formats. Because of this divergence in the formats, special analysis/retrieval software packages such as GCG have chosen to convert to their own format before using the data. This brings yet more redundancy and need for more storage capacity. Currently the only system that lets the authors enjoy their freedom of creating and handling new formats is the Sequence Retrieval System, (SRS; Etzold et al., 1996). SRS has a specially designed parser language called Icarus, to cope with divergent text-file formats. It can easily deal with any kind of flat-file database format. Although Icarus provides a solution, it does have drawbacks: it does not have procedures; it does not have strong data types; nor is it a general purpose language. Ideally, one would write a parser in a general-purpose language rather than building a language around a parser. The same language can then be used for general scripting and writing webservers, as well as database parsing. The most widely used scripting language, Perl (Wall and Schwartz, 1991), has powerful features for handling regular expressions, but the regular expressions themselves are not available as objects. This means that while you can create regular expressions, you cannot compile and keep them as a list or dictionary for later use. Several Perl modules have been developed for parsing databases. Swissknife (Hermjakob et al., 1999) deals with the SWISS-PROT (Bairoch and Apweiler, 1997) database format while SPEM/PrEMBL (Pocock et al., 1998) deals with the EMBL (Stoesser et al., 1999) style database format. Swissknife is developed for the concept of fast ‘lazy parsing’ and to account for the slight format difference between SWISS-PROT and EMBL databases. Both of these modules are limited to the database format c Oxford University Press 2000
Object-oriented parsing with Python
Fig. 1. Example of an EMBL database entry.
they deal with. Developing systems this way might mean that one has to write a complete module for each database that has a different structure. Thus, there is a need to have a generalized database handling and parsing system regardless of the structure of the database, as this would provide a uniform access across the different databases. Among the general-purpose scripting languages we evaluated, we chose Python (Lutz, 1996; Watters et al., 1996) to create object-oriented parsers for biological databanks, because of its simple and neat syntax, strong data types, truly object-oriented capability, the availability of regular expression as a module etc. Python classes support inheritance, so one can easily add functionality to an existing class. Finally, although not as widely used as
Perl, Python does already have a very strong and growing user base in the scientific community.
The modules There are two basic classes provided under the database module. The database class manages sequential reading of the database and returns one entry at a time while the BioParser class parses an entry according to the parsing instructions specified. Apart from these two basic modules, the seqFormat module is used to create sequence objects and provides methods to print them in a variety of formats (FASTA, EMBL, SWISSPROT, STADEN, etc.). The database module reads one entry at a time from the 629
C.Ramu et al.
Fig. 4. Example python script to parse all the entries in the GENBANK database. Fig. 2. Flow-chart to show overall function of the module.
The successive calls to d.NextEntry( ) return the available entries one by one until end-of-file. The input to the database module is a definition file where the files to be used are described, the location and the block structure of the entry. The embl.d definition file looks like ####################################### # # init file for human sequences in EMBL # dbName = ’EMBL’ location = ’/data/embl_dna/’ files = [’hum1’,’hum2’,’hum3’,’hum4’,’hum5’] fileType = ’dat’ entryBlock = [’^ID ’,’^//’]
Fig. 3. Example python script using the database module with the parser. It converts all the EMBL entries to FASTA format.
database and gives it to the parser where the entry gets parsed into fields. Figure 2 shows the overall working flow of the modules. Figures 3 and 4 show simple usage of the modules to parse and reformat databases.
The database module The database module is designed to deal with retrieving entries sequentially from the databases. The class ‘database’ in the database module takes care of retrieving the full entry. This includes taking care of opening the next file when it reaches the end-of-file in the previous one. For example the EMBL database separates its databases into files with different divisions (hum, vrt etc.) The database module usage is very simple. d = database(‘embl’) d.NextEntry()
630
# returns the entry
The database class also has two additional methods for fseek and ftell as this would be useful to write an indexing and retrieval system. As an experiment we have written such a system that is not described here but will be available for the user community on request.
The BioParser class The BioParser class resides in the same database module. The parser is a Python dictionary data type where the keys of the dictionary are the field names and the values are the regular expression of that particular field. This way we can parse either the desired field alone (lazy parsing) or all the fields. We have kept the parsing instructions as simple as possible. The BioParser class has three methods. The ‘ init ’ constructor method is invoked when the instance is created and the second method ‘Parse’ is then called for parsing an entry. When ‘ init ’ is called, all the different regular expressions specified in the parserDict are compiled by default. One can also specify individual fields to be parsed for lazy parsing. The Reparse method can optionally be
Object-oriented parsing with Python
used to get the desired tokens back from a field (for example all the words in the description field) that can be used to build an index for the field. After parsing the entry the different fields are available as the attributes of the class instance. The names of the attributes are kept the same as the entry’s field names. To create an instance of the class BioParser:
e = BioParser(parserDict) and parsing is done with:
e.Parse(entry) where entry is a Python string-data type. The whole entry is parsed according to the regular expression specified in the parserDict dictionary. The keys in this dictionary are the field names. Note that the regular expression keeps with the repetition of a field or group of fields (for example the fields RN, RP, RX, RA, RT, RL, where RX and RC are optional, whereas, RA, RT etc., are repeated). After the parsing is done, all the fields are attributes of the instance. We can access them as,
e.id e.de
# the ID line # the DE description lines
or
e.ft e.sq
# the Feature Table lines # the SQ line plus the complete sequence
Any field that does not have contents is assigned the ‘None’ value (None is a Boolean false value in Python). This way we can parse and access the databank entries in a simple and object-oriented way. Thus shown the fields are simply the string objects. This could also be useful for genome annotation: adding new text to the fields can be done simply by using the ‘+’ operator. (Python appends to an existing string, provided that the data type is string on both sides of the operator.) To make the parsing faster, the compilation of regular expressions is done in the BioParser class ‘ init ’ constructor module. This means that the compilation is done only once, at the time of instance creation. Once the major fields are parsed, we can re-parse them to get any desired tokens using the re or string modules. Python’s re (regular expression) module has many methods for purposes like splitting, replacing etc. In straightforward cases, the simpler string module often suffices, e.g. if splitting can be done based on a delimiter character rather than with a regular expression in the re module. The replacement in re.sub (replace, origstring) can be a function that gives more powerful control over the tokens found. For example in the DR line the cross-reference to
other databanks can be replaced with a hypertext reference ‘in place’ by diverting the replacement string to a function. The function can decide, based upon some given criteria, which of the tokens should be replaced. This is desired for example when we want to publish the entry on the World Wide Web (WWW).
HrefLink = \ {‘SWISSPROT’:"
%s", ‘SPTREMBL’:"
%s"} link
=
r’(^DR )(?P[^;]+); (?P[^;]+)’
p = e.dr
re.compile(link, re.M) = p.sub(Addhref, e.dr)
Here, we have shown an example of replacing just the SWISSPROT and SPTREMBL database names in the entry DR line with SRS hypertext links, by creating a dictionary (HrefLink) and checking the database name as the key in the Addhref function. Note that the replacement string to the p.sub method is the following Addhref function that actually returns the replaced string.
def Addhref(match): dbase = match.group(’dbase’) id = match.group(’id’) try: defi = HrefLink[dbase] except KeyError: tmp = match.group(0) else: tmp=match.group(1)+dbase+’;’ +defi%(dbase,id,id) return tmp Modifying entry fields would be a necessary feature for genome annotation. This could be done through a formalized method or simply adding the text to the parsed field, as for the comment field:
p.cc = p.cc + "CC
additional comments\n"
Right now we do not provide annotation modules in the toolkit, since annotating entries was not our primary concern, but this could be changed in a future release.
Example applications Collecting a disulphide-bonded sub-sequence The s.sq sequence string can be given to the Seq class from the seqFormat module to make a sequence object. The seqFormat module knows several flat-file formats for printing to file. 631
C.Ramu et al.
from database import * from seqFormat import * from swissprot import *
#swissprot parser,parserDict
d = database(‘swissprot’) p = BioParser(parserDict) # parser from swissprot module a = d.NextEntry() # get the first entry e = p.Parse(a) # Parse the entry s = Seq(e.seq) # Make the sequence object ftList = string.split(e.ft) flankRegion = 10 for j in ftList: tmp = string.split(j[3:]) if string.strip(tmp[0]) == ’DISULFIDE’ : ifrom,to = (tmp[1],tmp[2]) ito = ito + flankRegion if ifrom-flankRegion < 0: ifrom = 0 print s.Subseq[ifrom:ito] Fig. 5. from database import * from embl import * from seqFormat import *
#embl parser, parserDict
d = database(‘embl’) p = BioParser(parserDict) # initialize parser a = d.NextEntry() # get the first entry while a: e = p.Parse(a) # Parse the entry s = Seq(e.seq) # Make the sequence object ftList = string.split(e.ft) flankRegion = 10 for j in ftList: tmp = string.split(j[3:]) if string.strip(tmp[0]) == ’intron’ : ifrom,to = (tmp[1],tmp[2]) ito = ito + flankRegion if ifrom-flankRegion < 0: ifrom = 0 print s.Subseq[ifrom:ito] a = d.NextEntry() Fig. 6.
Python’s slicing (slicing is a way of extracting sections of Python’s sequences, such as string, list, tuples) makes it easy to deal with just a part of the sequence. For example the SWISSPROT FT line can be reparsed to get the range for DISULFIDE and with the slicing you can print the sub-sequence as seq[from:to]. The seqFormat module takes care of removing everything other than alphabet characters. Printing is done through the Subseq method. Figure 5 extracts the sub-sequence (together with the flanking sequence if desired) corresponding to a disulfide feature in a SWISSPROT entry.
Creating an intron database The origin of intron sequences is not well understood.
632
Theories range from ‘the intron insertional theory’ that views introns as continually inserting into eukaryotic genes, while the introns-early theory assumes that introns functioned as gene assembly mechanisms and are very ancient. Thus intron sequences are interesting to sequence analysts. The first step to analysing intron sequences is to make an intron-sequence database. Figure 6 simply prints out all the intron sequences with the flanking region of the desired length as it parses an entry. Note that this operation iterates exactly the same code as used for the disulphide query, except that now ‘intron’ is being searched for. Interestingly, now we can easily make an object framework to generate sub-databases.
Object-oriented parsing with Python
from database import * from embl import * from seqFormat import *
# embl parser
d = database(‘embl’) p = BioParser(parserDict) # initialize parser a = d.NextEntry() while a: e = p.Parse(a) # Parse the entry s = Seq(e.seq) # Make the sequence object ftList = string.split(e.ft) flankRegion = 10 seqLen = len(e.sq) for j in ftList: tmp = string.split(j[3:]) if string.strip(tmp[0]) == ‘intron’ : donor,acceptor = (tmp[1],tmp[2]) donFrom = donor-flankRegion accFrom = acceptor - flankRegion donTo = donor+flankRegion accTo = acceptor + flankRegion if if if if
donFrom accFrom donTo > accTo >
< 0:donFrom = 0 < 0: accFrom = 0 seqLen: donTo = seqLen seqLen: accTo = seqLen
print s.Subseq[donFrom:donTo] print s.Subseq[accFrom:accTo]
# Donor # Acceptor
Fig. 7.
Creating a database of splice sites The exon–intron (donor) and intron–exon (acceptor) boundaries are of interest for splice-site prediction. While considerable research has gone into studying these sites, splice prediction is still unreliable and much more work is needed. In Figure 7, we show a way to extract splice sites from the database, again with flanking regions of a desired length. Future developments Currently the parsing instructions are held in Python dictionaries (see Appendices). However, the order of the fields cannot be maintained with dictionaries. This would be allowed by using lists instead of dictionaries. Lists would have a number of advantages, for example finding errors in entries by checking for correct field order. Besides adopting lists, other future developments include: • Return methods instead of the matched string as an optional feature. • Methods can be made more intelligent to decide what
to return, for example in EMBL the join statement in the feature table can be used to return a composite sequence. • Unparse the fields after parsing to make an entry. This would be useful to reconstruct an entry after modification. (This development will need to use Python lists.) • Incorporate bioperl/biopython (http://www.bioperl. org, http://www.biopython.org) efforts to standardize the sequence object.
Conclusion We have shown how to access biological databases in a uniform manner, regardless of their different flat-file structure. We also have shown how to create sub-databases of particular sequence features (e.g. introns, splice sites) that would be useful for biological sequence analysts. Databank providers often describe the structure of the database, but it would be more useful if the parsers are part of the database distribution to minimize and standardize the effort of user community. 633
C.Ramu et al.
Appendix A: (EMBL database parser dictionary) ################################################################# # # (C) Chenna Ramu # EMBL, Heidelberg, Germany. # # embl.py EMBL database parser # parserDict ‘id’ ‘ac’ ‘dt’ ‘de’ ‘gn’ ‘os’ ‘oc’
= : : : : : : :
{ r’((^ID r’((^AC r’((^DT r’((^DE r’((^GN r’((^OS r’((^OC
[^\n]+\n)+)’ [^\n]+\n)+)’ [^\n]+\n)+)’ [^\n]+\n)+)’ [^\n]+\n)+)’ [^\n]+\n)+)’ [^\n]+\n)+)’
‘ref’ : r’((’ r’(^RN r’((^RC r’((^RP r’((^RX r’((^RA r’((^RT r’((^RL r’)+)’, ‘cc’ ‘dr’ ‘kw’ ‘ft’ ‘sq’
: : : : :
[^\n]+\n)’ [^\n]+\n)?)’ [^\n]+\n)+)’ [^\n]+\n)?)’ [^\n]+\n)+)’ [^\n]+\n)*)’ [^\n]+\n)+)’
r’((^CC [^\n]+\n)+)’ r’((^DR [^\n]+\n)+)’ r’((^KW [^\n]+\n)+)’ r’((^FT [^\n]+\n)+)’ r’(^SQ [^\n]+\n)’ \ r’((^ [^\n]+\n)+)’
} if __name__ == ‘__main__’: for k,v in parserDict.items(): r = re.compile(v) print k,v
634
, , , , , , ,
, , , ,
Object-oriented parsing with Python
Appendix B: (GENBANK database parser dictionary) ################################################################# # # # (C) Chenna Ramu # EMBL, Heidelberg, Germany. # # genbank.py Genbank Parser # parserDict = { ‘locus’ ‘definition’ ‘accession’ ‘nid’ ‘version’ ‘keywords’ ‘segment’ ‘source’ ‘reference’ ‘comment’ ‘features’ ‘basecount’ ‘origin’
: : : : : : : : : : : : :
r’(^LOCUS [^\n]+\n)’, r’(^DEFINITION [^\n]+\n(^ [^\n]+\n)*)’, r’(^ACCESSION [^\n]+\n(^ [^\n]+\n)*)’, r’(^NID [^\n]+\n)’, r’(^VERSION [^\n]+\n)’, r’(^KEYWORDS [^\n]+\n(^ [^\n]+\n)*)’, r’(^SEGMENT [^\n]+\n)’, r’(^SOURCE [^\n]+\n(^ [^\n]+\n)*)’, r’(^REFERENCE [^\n]+\n(^ [^\n]+\n)*)’, r’(^COMMENT [^\n]+\n(^ [^\n]+\n)*)’, r’(^FEATURES [^\n]+\n(^ [^\n]+\n)*)’, r’(^BASE COUNT [^\n]+\n)’, r’(^ORIGIN [^\/\/]+\/\/)’
} if __name__ == ‘__main__’: import re for k,v in parserDict.items(): r = re.compile(v) print k,v
635
C.Ramu et al.
Appendix C: (ENZYME database parser dictionary) ################################################################# # # (C) Chenna Ramu # EMBL, Heidelberg, Germany. # # enzyme.py Enzyme database parser # parserDict = { ‘id’ : r’^ID [^\n]+\n’ , ‘de’ : r’(^DE [^\n]+\n)+’ , ‘an’ : r’(^AN [^\n]+\n)+’ , ‘ca’ : r’(^CA [^\n]+\n)+’ , ‘cf’ : r’(^CF [^\n]+\n)+’ , ‘cc’ : r’(^CC [^\n]+\n)+’ , ‘di’ : r’(^DI [^\n]+\n)+’ , ‘pr’ : r’(^PR [^\n]+\n)+’ , ‘dr’ : r’(^DR [^\n]+\n)+’ } if __name__ == ‘__main__’: for k,v in parserDict.items(): r = re.compile(v) print k,v
636
Object-oriented parsing with Python
Appendix D: (PDBFINDER database parser dictionary) ################################################################# # # (C) Chenna Ramu # EMBL, Heidelberg, Germany. # # pdbfinder.py Pdbfinder database parser # parserDict = { ‘id’ ‘header’ ‘Compound’ ‘Source’ ‘Author’ ‘Exp-Method’ ‘Ref-Prog’ ‘HSSP-N-Align’ ‘T-Frac-Helix’ ‘T-Frac-Beta’ ‘T-Nres-Prot’ ‘T-Water-Mols’ ‘T-Nres-Nucl’ ‘HET-Groups’ ‘Chains’ }
: : : : : : : : : : : : : : :
r’^ID [^\n]+\n’ , r’^Header [^\n]+\n’, r’^Compound [^\n]+\n’, r’^Source [^\n]+\n’, r’(^Author [^\n]+\n)+’ , r’((^Exp-Method [^\n]+\n)(^ [^\n]+\n)+)+’, r’^Ref-Prog [^\n]+\n’, r’^HSSP-N-Align [^\n]+\n’, r’(^T-Frac-Helix [^\n]+\n)+’, r’(^T-Frac-Beta [^\n]+\n)+’, r’(^T-Nres-Prot [^\n]+\n)+’, r’(^T-Water-Mols [^\n]+\n)+’, r’(^T-Nres-Nucl [^\n]+\n)+’, r’((^HET-Groups [^\n]+\n)(^ [^\n]+\n)+)+’, r’((^Chain [^\n]+\n)(^ [^\n]+\n)+)+’
if __name__ == ‘__main__’: for k,v in parserDict.items(): r = re.compile(v) print k,v
637
C.Ramu et al.
References Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bairoch,A. and Apweiler,R. (1997) The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J. Mol. Med., 75, 312–316. Burks,C. (1999) Molecular biology database list. Nucleic Acids Res., 27, 1–9. Devereux,J., Haeberli,P. and Smithies,O. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res., 12, 387–395. Etzold,T., Ulyanov,A. and Argos,P. (1996) SRS: information retrieval system for molecular biology data banks. Meth. Enzymol., 266, 114–128.
638
Hermjakob,H., Fleischmann,W. and Apweiler,R. (1999) Swissknife—‘lazy parsing’ of SWISS-PROT entries [In Process Citation]. Bioinformatics, 15, 771–772. Lutz,M. (1996) Programming Python. O’Reilly and Associates, Inc. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol., 183, 63–98. Pocock,M.R., Hubbard,T. and Birney,E. (1998) SPEM: a parser for EMBL style flat file database entries. Bioinformatics, 14, 823– 824. Stoesser,G., Tuli,M.A., Lopez,R. and Sterk,P. (1999) The EMBL nucleotide sequence database. Nucleic Acids Res., 27, 18–24. Wall,L. and Schwartz,R.L. (1991) Programming Perl. O’Reilly and Associates, Inc. Watters,A., Rossum,G.v. and Ahlstrom,J.C. (1996) Internet Programming with Python. M & T Books.