Int. J. Bioinformatics Research and Applications, Vol. 11, No. 3, 2015
BioInt: an integrative biological object-oriented application framework and interpreter Sanket Desai and Prasad Burra* School of Biotechnology, IGNOU – International Institute of Information Technology, Center of Excellence for Advanced Research and Education, P-14, Rajiv Gandhi Infotech Park, Phase-1, Hinjewadi, Pune, India Email:
[email protected] Email:
[email protected] *Corresponding author Abstract: BioInt, a biological programming application framework and interpreter, is an attempt to equip the researchers with seamless integration, efficient extraction and effortless analysis of the data from various biological databases and algorithms. Based on the type of biological data, algorithms and related functionalities, a biology-specific framework was developed which has nine modules. The modules are a compilation of numerous reusable BioADTs. This software ecosystem containing more than 450 biological objects underneath the interpreter makes it flexible, integrative and comprehensive. Similar to Python, BioInt eliminates the compilation and linking steps cutting the time significantly. The researcher can write the scripts using available BioADTs (following C++ syntax) and execute them interactively or use as a command line application. It has features that enable automation, extension of the framework with new/external BioADTs/libraries and deployment of complex work flows. Keywords: biological programming; application framework; C++; interpreter; object-oriented design; abstract data type; workflow system; biological scripting; software engineering; software ecosystem; data mining; curation; annotation; biological algorithms. Reference to this paper should be made as follows: Desai, S. and Burra, P. (2015) ‘BioInt: an integrative biological object-oriented application framework and interpreter’, Int. J. Bioinformatics Research and Applications, Vol. 11, No. 3, pp.247–256. Biographical notes: Sanket Desai holds a Master’s degree in Bioinformatics (SMU-SCMIT, Pune) and Bachelor’s degree in Zoology (Goa University, PES College, Goa). His research interests include genomics, structural bioinformatics, algorithm development and systems biology (network analysis). He has over two years of experience in core biological programming participating in all the activities of life cycle of software application development. He also has oneyear experience in industry as a clinical research data analyst. His primary focus is computational biology application development. Prasad Burra has doctoral degree in Structural Bioinformatics, Protein Crystallography from Indian Institute of Science, Bangalore, India. He is an Agriculture Engineer by training and holds Master’s degree in Biotechnology from University of Hyderabad. He has approximately 15 years of research
Copyright © 2015 Inderscience Enterprises Ltd.
247
248
S. Desai and P. Burra experience, 4 years of industry experience and 7 years of teaching experience. He is currently coordinating Masters Programmes, short-term programmes in addition to research. He has publications in international journals including PNAS, PLOS-One, JMB, JIB, ACTA-D, Proteins among others. His current focus is applied bio-medical research and higher education. His research interests include DNA & linguistic grammar, genomics technologies, biomolecules & tensegrity, bio-programming and nano-biotechnology.
1
Introduction
Biological data are naturally getting sub-classified into genomics (genome sequence data), transcriptomics (mRNA, EST and expression data), proteomics (protein sequence data), structural genomics (3D structure data), interactomics (biological molecule interaction data), metabolomics (metabolite, substrate and small molecule data), fluxomics (data of rates of metabolic reactions in cell) and phenomics (organism phenotype data). The new biological software solutions should be designed to work seamlessly with various types of biological data. Biological software solutions have been co-evolving alongside the discoveries of the biological complexity. The state of biological research is transitioning from static databased modelling towards knowledge-based modelling of bio-dynamics. The biological complexity involves various interlinked processes and products of evolution. The further layers of complexity overlaid can be attributed to the emerging parameters as well as the intermediate variations in the rates of changes of the known parameters. Next Generation Sequencing (NGS) technologies will add sequence data at an unprecedented rate challenging the existing computer infrastructure as well as software applications (Richter and Sexton, 2009). The current size of biological data is approximated to be in petabytes (106 GB). With NGS in the offing, the total size is expected to surpass zettabytes of data (1012 GB) in next five years. The solution that would address this and other requirements (noted in the following paragraphs) seem to be to design and develop an efficient biology-specific data and knowledge sourcing platform which should be comprehensive, flexible and inclusive. Based on the number of tasks and the way the tasks being served, the software applications are being called by different names – (a) the applications that handle single task are generally referred as tools, for example CLUSTALX (Larkin et al., 2007). (b) The applications that provide a common interface and enable the user to perform multiple tasks are being referred as integrative analytic platforms or embedded systems, for example EMBOSS (Rice et al., 2000). (c) Another category is referred as scripting languages or Application Programming Interfaces (APIs). BioJava (Prlic et al., 2012), BioPerl (Stajich et al., 2012) and BioPython (Cock et al., 2009) are few of the acclaimed open source projects. These applications, in addition to providing many predefined functions and embedded third party tools, encourage the researcher to customise/extend the API capabilities to suit his/her specific requirements. In the domain of computer science, Object-Oriented Application Frameworks (OOAFs) and domain-specific programming environments are considered as the cornerstones of modern software engineering (Fayad et al., 1999). An OOAF is a domain specific, ‘semi-
BioInt: an integrative biological object-oriented application
249
complete’ application that can be specialised to produce custom applications. The primary benefits of an OOAF stem from the modularity, reusability, extensibility, and inversion of control they provide to developers. ROOT (http:// root.cern.ch), DENZO (http://www.hkl-xray.com/), MATLAB (http://www. mathworks. in) and R (http://www. r-project.org/) are few of the highly successful programming environments in respective domains. BioInt is an attempt to design and develop a biology-specific comprehensive OOAF and a biological programming interpreter. Using BioInt, the bio-programmer will be able to mine, manage and analyse through bio-scripts that closely follow controlled biological vocabulary (available as BioADTs and methods), in addition to various other features, described in the next sections.
2
Materials and methods
Current BioInt architecture is two tiered: the core referred as BioBhasha framework, biology-specific OOAF and BioBhasha Interpreter (BioInt).
2.1 BioBhasha framework Biology is inherently object-oriented. The concepts of OOP especially modularity, reusability and extensibility are followed at almost all levels of life systems, i.e. from molecules such as amino acids, nucleotides, DNA, RNA, proteins and their interactions through cells, tissues, organs to vast ecosystems. Each of these biological entities holds certain data/information and a unique associated behaviour. This natural pairing of data with associated behaviour is considered a necessary condition to define an object in OOP. This principle, in essence, enabled us to define various biological entities as biological objects, referred as Biological Abstract Data types (BioADTs) and differentiates our design from the other applications in the segment. The BioADTs are designed in order to be used as the independent objects or ‘is-like’/‘contained in’ other pertinent ADTs. This resulted in a collection of large number of BioADTs implemented as the classes. The collection of all the classes resulted in a comprehensive biological OOAF (http://www.biobhasha.org/docs/biointdocs/html/classes.html) referred to as BioBhasha henceforth. BioBhasha is developed in C/C++ language with ~150,000 lines of code. Functionality and utility-based abstraction led us to group all BioADTs into nine major modules as described in Table 1. Table 1
Core modules of BioBhasha
Module
Description
Inputs
Input Module contains various BioADTs used for data sourcing from various biological data formats such as Fasta, Genbank, PDB, Swissprot, Pubmed, EMBL, DDBJ, SCOP, EST, ClustalW, MSF, PSIMI-TAB, SIF and others. Example BioADTs: BioFasta, BioGenBank, BioEmbl, BioPdb, BioHkl, BioEst, BioScop, BioBlast, BioClustal, BioSif and so on.
Sequence
Sequence Module contains various sequence/string based BioADTs which use DNA/protein sequences. OOP principles such as abstraction, inheritance, overloading have been extensively used to create relevant BioADTs. Example BioADTs: BioSequence, BioDnaSequence, BioProteinSequence and so on.
250 Table 1
S. Desai and P. Burra Core modules of BioBhasha (continued)
Module Structure
Description Structure Module contains various structure specific BioADTs which use DNA/ RNA/protein 3D structure information. Example BioADTs: BioPoint, BioAtom, BioResidue, BioChain, BioProtein, BioWater and so on.
Algorithms Module contains various algorithm implementations commonly used in biological analysis. The effort has been to make the ‘generic object oriented biological algorithms’ (GOOBA) which are reusable and give more flexibility in writing customised workflows. The constructors are overloaded to accept various formats. The current version the following algorithms - dotplot, global/local sequence Algorithms alignment, multiple sequence alignment, UPGMA (phylogenetic tree), protein structure alignment, secondary structure prediction, disorder prediction, secondary structure alignment, gene/ORF prediction, restriction map analysis, overlap and repeat identification. Example BioADTs: BioProteinMultipleSequenceAlignment, BioProteinDisorderPrediction, BioRestrictionMap, BioProtein StructureAlignment, BioUpgma and so on.
Library
Library Module is a unique set of BioADTs which are aimed at giving the researcher an ability to write novel algorithms from first principles. Example BioADTs: BioAminoAcidLibrary, BioNucleicAcidLibrary, BioSpaceGroup Library, BioRestrictionEnzymeLibrary, BioElementLibrary, BioAtomLibrary, BioCodon Library and so on.
System
System Module contains BioADTs that enable the researcher to do system level programming, especially the automation tasks such as searching, scanning the local directories and files, connecting over the available network and downloading the relevant files and other such tasks. The most commonly used BioADT is Bio Database which has methods such as getFiles(), getSub DirectoriesWithFullPath(), removeDirectory() and so on.
Utilities
Utility Module contains many general utility classes and functions. Example BioADTs: BioMatrix, BioStatistics and so on.
Output
Output Module contains BioADTs which are used for presentation such as formatting of data and the analysis results. Three output stream formats supported currently are text, postscript and HTML. Example BioADTs: BioOutputStream, BioHtml and BioPostScript
This design enables systematic simulation of complex biological systems and their behaviour reusing available BioADTs and/or defining new BioADTs. An example UML diagram (Figure 1) shows inheritance pattern in sequence module for different sequence data types.
2.2 BioInt Interpreter is an application that enables execution of programming instructions passed to it. Interpreter eliminates the time consuming and often cumbersome steps of compiling and linking. CINT – a generic C/C++ interpreter – provides a feature to embed domainspecific frameworks and create an interpreter specific to the domain (http://root. cern.ch/drupal/content/cint). Using this feature, Biobhasha framework was embedded into CINT resulting in a biology-specific interpreter – BioInt.
BioInt: an integrative biological object-oriented application Figure 1
251
Inheritance UML diagram of BioSequence BioADT. BioSequence is the parent sequence specific BioADT that has all the getters, setters, show and find methods common to both DNA and protein sequence data such as getMutatedSequence(), get NumberOfOccurrences(), findPattern() and others. BioDnaSequence and BioProtein Sequence BioADTs are derived from BioSequence. Since Fasta format can have both types of data, BioFasta is derived from BioSequence unlike BioGenBank and other formats. BioGenBank, BioEmbl, BioDdbj formats are derived from BioDna Sequence and BioSwissprot ADT is derived from BioProteinSequence.
252
3
S. Desai and P. Burra
Results and discussion
3.1 Biological abstraction Since C++ supports procedural and OO paradigms, multiple inheritance, strict data typing and generic programming, we could create the data model which almost followed the naturally observed biological entity relationships. The comparison between BioJava, BioPython, BioPerl and others were discussed in earlier publications (Mangalam, 2002; Fourment and Gillings, 2008). We highlight few features (described in Table 2) to emphasise (a) deeper layers of abstraction, (b) biologically more relevant abstraction and (c) alternative methods of abstraction when compared to existing solutions in this segment. Table 2
Design features of BioBhasha
Design features
Abstraction of a protein structure domain
Independent usage of BioADTs
Remarks
BioPdb is the derived class which multiply inherits all classes which are independent and represent records/fields in PDB file (HEADER, COMPND, CRYST1, SOURCE, REMARK and others).
A biologically meaningful derived class, BioSecondaryStructure class, is implemented which is composed of BioPdbHelix, BioPdb Turn, BioPdbSheet. BioSecondaryStructure class is one of the parent classes of BioPdb class, giving access to secondary structure elements via getHelixCoordinates(..), getStrand Sequence (…), getTurn(..) etc., methods.
The coordinate information is parsed and populates BioProtein composed of BioProteinChains and BioWater.
BioSecondaryStructure or BioPdbSeqres can be used independently in case user intends to work only on the secondary structure or sequence information of the PDB file respectively.
All biological formats are provided as independent classes like BioGenbank, BioSwissProt, BioEmbl, BioSoft, BioPsimiTab, BioSif. Separate BioADTs are provided for multiple entry formats as well, such as BioMultipleFasta, BioMultipleGenBank, BioMultipleEmbl, BioClustal, BioMsf etc.
Interaction/network data formats (PSI-MI tab, SIF) are abstracted using BioADTs which are BioPPINetowork, BioGraph.
Accessing various biological data formats
Considering the type of entities and their interactions, basic BioADTs (such as BioAtom) were designed; and complex BioADTs (such as BioResidue, BioWater and others) were derived, simulating the biosynthesis of macromolecules in living organisms. This also synchronises with the natural thought process (intuition) of a researcher and the instructions to be typed for getting the results. For example (pseudo-code): BioProtein().BioProteinChain().BioResidue().BioAtom().getBfacto r() BioMultipleGenBank().BioGenBank().getComplementarySequence()
BioInt: an integrative biological object-oriented application
253
3.2 BioBhasha The current version of the BioBhasha has: (a) more than 450 basic biological objects, (b) around 14,000 methods/functions, (c) 14 algorithm implementations, (d) 17 database input formats, (e) three output formats, (f) five biological libraries, (g) several sequence, structure and network (interaction) specific objects, (h) statistics and matrix functions and other miscellaneous classes and functions.
3.3 BioInt Since C++/Java/Javascript/PHP syntax is the most accepted and adopted syntax and closely resembles biological abstraction (Sequeira et al., 1997), it was a strategic decision to extend the same syntax. With BioInt, it will now be possible to use C/C++ in bioscripting, alternative to PERL and Python-based scripting. The researcher need not digress from his/her core competence and invest resources to learn about source code compilation, linking and other details specific to software build system configurations. However, BioInt expects the user/researcher to write the instructions and ‘run’ the code using interpreter to instantaneously see the results.
3.4 Mode of operation BioInt can be used in two modes: the first mode is an interactive session, which is useful as a learning tool to the novice. It is more useful as an instantaneous validation, testing, developing and manipulation platform with control at each step. These features are the common requirements in complex exploratory research & development problems. A snapshot of BioInt in interactive mode is shown in Figure 2. BioInt prompt (BioInt>) is highlighted which appears on launching the application from the terminal. The instructions should be written in between braces ({}). The program instructions can be written at the prompt accessing various preloaded BioADTs and utility functions. A very useful feature is the capability to dynamically load and unload third party or external source libraries interactively using ‘L’, ‘X’ and ‘U’ commands (as shown in Figure 3). However, archived libraries (.a), shared libraries (.so) and dynamically linked libraries (.dll) cannot be loaded/unloaded. In addition, another limitation is classes, structures and functions that cannot be defined in an interactive session. They have to be externally defined and loaded for use as described above. Figure 2
Screenshot of an interactive session of BioInt when launched via console. The use of simple predefined BioADT is shown
254 Figure 3
S. Desai and P. Burra An interactive session of BioInt showcasing the feature of dynamic loading (L), execution (X) and unloading (U) of scripts (e.g. community contributed) is displayed. It is important to note that the filename (excluding extension) and function name inside the script should be identical for successful execution(X demo.C). Few features such as accessing shell commands, extending the framework with user defined ADTs and methods (such as class Points) are shown
The second mode is command line mode, which is specially used for automation and batch processes. Another useful feature which provides seamless extensibility is the reuse of old bio-scripts such as community contributed scripts. The old scripts can be called/included within the new scripts, a feature, similar to that available in R Programming Environment, is shown in Figure 4. Figure 4
BioInt being used as a command line application (highlighted with black background). While using in command line mode, the bio-scripts should contain one ‘main’ function of C/C++ (seen in demo2.C file). It is also showcasing how one can include multiple files and third party libraries
BioInt: an integrative biological object-oriented application
255
The scripts also highlight few useful features such as (a) calling the shell commands and console-based applications from within BioInt session using ‘!’, (b) extending the framework through new user defined classes and (c) searching local file system using BioDatabase BioADT (refer to supplementary document, e.g. bio-scripts and respective outputs).
4
Conclusions
Current version of BioInt, version 1.02 (see online at http://www.biobhasha.org), attempts to provide a comprehensive platform for biological computing catering to the common and advanced needs of biologists and bioinformatics professionals. It is designed to accommodate and handle futuristic, unforeseen and dynamic computational queries, which are researcher-specific. BioInt saves time significantly by eliminating the time spent in compilations, linking & loading and writing glue code. The focus is eventually to develop BioInt into a biologist friendly bio-systems modelling environment keeping it standards compliant and supporting every biological data type and relevant algorithms.
Acknowledgements Authors wish to acknowledge Masaharu Goto for design and development of CINT (C/C++ interpreter) and making it freely available.
References Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B. and de Hoon, M.J. (2009) ‘Biopython: freely available Python tools for computational molecular biology and bioinformatics’, Bioinformatics, Vol. 25, No. 11, pp.1422–1423. Fayad, M., Schmidt, D.C. and Johnson, R.E. (1999) Building Application Frameworks: ObjectOriented Foundations of Framework Design, John Wiley & Sons, Inc., New York, NY, USA. Fourment, M. and Gillings, M. (2008) ‘A comparison of common programming languages used in bioinformatics’, BMC Bioinformatics, Vol. 9, No. 82. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J. and Higgins, D.G. (2007) ‘ClustalW and ClustalX version 2’, Bioinformatics, Vol. 23, pp.2947–2948. Mangalam, H. (2002) ‘The Bio-*toolkits – a brief overview’, Briefings in Bioinformatics, Vol. 3, No. 3, pp.296–302. Prlic, A., Yates, A., Bliven, S.E., Rose, P.W., Jacobsen, J., Troshin, P.V., Chapman, M., Gao, J., Koh, C.H., Foisy, S., Holland, R., Rimsa, G., Heuer, M.L., Brandstätter-Müller, H., Bourne, P.E. and Willis, S. (2012) ‘BioJava: an open-source framework for bioinformatics in 2012’, Bioinformatics, Vol. 28, No. 20, pp.2693–2695. Richter, B. and Sexton, D. (2009) ‘Managing and analyzing next-generation sequence data’, PLoS Computational Biology, Vol. 5, No. 6.
256
S. Desai and P. Burra
Rice, P., Longden, I. and Bleasby, A. (2000) ‘EMBOSS: the European molecular biology open software suite’, Genetics, Vol. 16, No. 6, pp.276–277. Sequeira, R., Olson, R.L. and McKinion, J.M. (1997) ‘Implementing generic, object-oriented models in biology’, Ecological Modelling, Vol. 94, No. 1, pp.17–31. Stajich, J., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., Lehväslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D. and Birney, E. (2002) ‘The Bioperl toolkit: Perl modules for the life sciences’, Genome Research, Vol. 12, No. 10, pp.1611–1618.