The BioInformatics data consists of 2 major things : Gene Data & Protein Structures ... need to be performed on Biological data are representing the gene data in.
BioJava : Tool for BioInformatics
Mamta C. Padole Lecturer, Department of Computer Science & Engineering, Faculty of Technology & Engineering, The M.S. University of Baroda, Vadodara.
Abstract The paper describes in brief about BioInformatics. It focuses on challenges imposed on Information Scientist to solve various problems related to Biological data. One of the tools used to resolve several challenges in BioInformatics is BioJava. BioJava is an Open Source Project, which uses Java Technology to provide framework for developing applications to solve BioInformatics problems. The paper explains in brief what is BioJava and how Java programming language facilitates in developing BioInformatics Applications. The integration of Java Development Kit with BioJava is also discussed. An example to read the data from a FASTA file into a BioJava program is included as a part of the discussion, which also reflects the different types of tokens available in BioJava. The compilation & execution of the program using BioJava is also explained. Enhancement facility in BioJava is also mentioned in the paper.
Keywords : BioInformatics, BioJava, OBF, FASTA
Introduction In the last few decades, there has been great advancement in the field of molecular biology. The technological development required for laboratory research in the field of molecular biology has also advanced. This growth has allowed the rapid advancement in sequencing of large portions of the genomes of several species. Till date, several bacterial genomes, as well as those of some simple eukaryotes have been sequenced in full. The Human Genome Project, designed to sequence all 24 of the human chromosomes, is also progressing. Popular sequence databases, such as GenBank and EMBL, have been growing at exponential rates. This large amount of information has necessitated the careful storage, organization and indexing of sequence information. Information Science has been applied to biology to produce the field called BioInformatics. BioInformatics BioInformatics is an interdisciplinary research area that applies computer and information science to solve biological problems. BioInformatics is the use of computers and information science to make sense out of the huge amount of data that is accumulating from high-throughput biological and chemical experiments, such as sequencing of whole genomes, DNA microarray chips, two-hybrid experiments, and tandem mass spectrometry. The BioInformatics data consists of 2 major things : Gene Data & Protein Structures The operations that need to be performed on Biological data are representing the gene data in computational form, sequencing the DNA structures, organising the gene structure to form genomes, analysing the genome data, predicting the protein structures, describing the protein dynamics, modelling the protein structures to unfold the protein structure.
Challenges in BioInformatics The field of Molecular Biology challenges the Information Scientist in several areas of Information Science like : Databases : Molecular Biology research has produced & is continuing producing abundant data. BioInformatics is challenged in this area of storing the genetic data in very large databases. The emphasis on data design, object oriented databases, data warehousing, data mining & knowledge management techniques. Networks : The large amount of data stored in several databases need to be shared across various research institutes across the world. It also needs to be manipulated & archived at different locations. The emphasis is on internet, intranet, wireless systems and other network technologies including grid computing to store, share, archive data across different types of networks. Search Engines : The exponentially increasing data needs to provide various kinds of information related to gene sequences or references to various genetic data or references to the experimental methods used in determining specific sequences. This needs special search engine technologies because data is stored in different types of machines, on variety of networks, in different data stores & in large variety of formats. Data Mining : The ever-increasing store of genetic sequences & protein data has becoming challenging when you need to explore this data without even actually visiting a research lab. The data mining techniques need to apply to different data using technologies like Perl language to search through data strings, different taxonomies or profiling sequences. Pattern Matching : Expert systems & AI techniques like neural networks need to be used to match & compare several genome sequences or protein sequences. The pattern recognition & pattern matching is difficult because of uncertainty in genome sequence patterns. The challenges are also related to graphical data, where images of different genomes need to be recognized or compared. Data Visualization : Genetic data does not consist of simple linear sequences of genomes. Instead it consists of 3D images of structures of DNA & RNA. The Data Visualization is involved in using the visual & spatial techniques to explore information from this graphical data & rendering graphical structures of linear data. Modeling & Simulation : The simulation & modeling techniques are used to represent several graphical models, drug-protein interactions etc., which uses event-driven, time-driven & hybrid simulation techniques. Approaches to BioInformatics : There are three different approaches to BioInformatics: Tool building : Creating new programs and methods, for analyzing and organizing data. Tool using : Using existing programs and data, to answer biologically interesting questions. Tool maintenance : Setting up databases, translating biologists' questions into ones that programs can answer, keeping the tools working and the databases up to date.
An Information Scientist can contribute in several ways in advancement of the field the Molecular Biology, by adopting any of the above mentioned approaches. One such tool, which is contributing in the field of development of Molecular Biology is BioJava. BioJava Most of the latest application development is based on the Java programming language. We all know that Java is scalable, object oriented, platform independent programming language which is useful in web based application development. But that is just the beginning. Java has penetrated in several avenues of application programming. One of the application areas is BioInformatics. Java programming language is also useful in developing applications pertaining to the field of BioInformatics. This feature of Java is known as BioJava. History : Genomic datasets deals with Gigabytes of data. Perl programming language may be used for working on Genomic data, but it cannot handle such a huge amount of data. C++ is another programming language but it is not portable & robust. So, Java happens to be the next resort. BioJava project was initiated in the mid-90s as a result of the computational needs of Matthew Pocock and Thomas Down, two Ph.D. students at the Sanger Institute in Cambridge, England. The official release of first set of classes was done in 2000. The BioJava tool is maintained as an open source model, under Open BioInformatics Foundation. BioJava is distributed under an LGPL (GNU Lesser General Public License). The LGPL allows developers to modify the code and fix bugs, and to use the facility as a library foundation upon which to build both free software and commercial packages. BioJava can be used with any IDE like InteliJ, Emacs, Eclipse, NetBeans. BioJava is a general BioInformatics toolkit. It provides a framework, i.e. a developing environment for everything in BioInformatics, from simple scripts to complete applications. BioJava is an open-source project dedicated to providing genomic researchers with a Java technology-based developer's toolkit. BioJava offers BioInformatics developers over 1200 classes and interfaces for manipulating genomic sequences, file parsing, CORBA interoperability, and more. biojava 1.3 pre3 is the version of BioJava toolkit, which is currently used for developing applications in BioInformatics. It comprises of several packages, which can be used to perform several operations of genetic data. biojava 1.4 pre1 is the BioJava version under development. BioJava is designed to be used as a library with Java 2 Standard Edition 1.2 or later. Currently, there are classes which can be used to create objects for:
Sequences and features IO
Processing, storing, manipulating Visualising Dynamic programming Single-sequence and pair-wise HMMs Viterbi-path, Forward and Backward algorithms Training models Sampling sequences from models External file formats and programs GFF Blast Meme Sequence Databases BioCorba interoperability ACeDB client DAS client
Packages available BioJava are : Following is the partial list of packages which, will give an insight of facilities provided in BioJava. These packages contain classes to create objects, to perform operations on above mentioned features in BioInformatics. org.biojava.bio.seq : Classes and interfaces for defining biological sequences and informatics objects. org.biojava.bio.symbol : Representation of the Symbols that make up a sequence, and locations within them. org.biojava.bio.search : Interfaces and classes for representing sequence similarity search results. org.biojava.bio.seq.db : Collections of biological sequence data. org.biojava.bio.seq.io : Classes and interfaces for processing and producing flat-file representations of sequences. org.biojava.bio.taxa : Taxonomy object for representing species information. org.biojava.bio.gui : Graphical interfaces for biojava objects. org.biojava.bio.program.xff : Event-driven parsing system for the Extensible Feature Format (XFF). org.biojava.bio.seq.db.biofetch : Client for the OBDA BioFetch protocol. org.biojava.bio.seq.db.biosql : General purpose Sequence storage in a relational database. org.biojava.bio.seq.db.emblcd : Readers for the EMBL CD-ROM format binary index files used by EMBOSS and Staden packages. org.biojava.bio.program : Java wrappers for interacting with external bioinformatics tools.
org.biojava.bio.program.blast2html : Code for generating HTML reports from blast output org.biojava.bio.program.hmmer : Tools for working with profile Hidden Markov Models from the HMMer package. org.biojava.bio.chromatogram : Interfaces and classes for chromatogram data, as produced by DNA sequencing equipment. org.biojava.bio.chromatogram.graphic : Tools for displaying chromatograms. org.biojava.bio.seq.io.filterxml : Tools for reading and writing an XML representation of BioJava's FeatureFilter language. org.biojava.utils.io : I/O utility classes. org.biojava.utils.math : Mathematical utility classes. org.biojava.utils.net : Network programming utility classes. org.biojava.utils.xml : Utility classes for handling and generating XML documents.
Example Coded using BioJava : Reading Gene Sequences from a Fasta File: FASTA (Fast Alignment) is a tool for computing sequence alignment, which uses word based technique. It uses the Word-Based technique for sequence searching. One of the most frequent I/O tasks is the reading of a flat file representation of sequence into memory. SeqIOTools provides some basic static methods to read files into BioJava. import java.io.*; import java.util.*; import org.biojava.bio.*; import org.biojava.bio.seq.db.*; import org.biojava.bio.seq.io.*; import org.biojava.bio.symbol.*; public class ReadFasta { /** * The programs takes two args the first is the file name of the Fasta file. * The second is the name of the Alphabet. Acceptable names are DNA RNA or PROTEIN. */ public static void main(String[] args) { try { //setup file input String filename = args[0]; BufferedInputStream is = new BufferedInputStream(new FileInputStream(filename)); //get the appropriate Alphabet Alphabet alpha = AlphabetManager.alphabetForName(args[1]); //get a SequenceDB of all sequences in the file SequenceDB db = SeqIOTools.readFasta(is, alpha); } catch (BioException ex) { //not in fasta format or wrong alphabet ex.printStackTrace(); }catch (NoSuchElementException ex) { //no fasta sequences in the file ex.printStackTrace(); }catch (FileNotFoundException ex) { //problem reading file ex.printStackTrace(); } } }
How to use BioJava : BioJava will compile and run on any computer with a Java virtual machine complying to the Java 2 Standard Edition 1.2 (or later) specifications. BioJava binaries are distributed in .jar (Java ARchive) format, which can be acquired in bin directory of sun downloads site. In addition to .jar files the other files needed are :
The Xerces-J XML parser (xerces.jar) A Java bytecode generation library (bytecode.jar) The Apache regular expressions library, (jakarta-regexp.jar), (for current CVS versions and 1.3 releases)
None of these .jar files need to be unpacked for normal use. One needs to simply place them in a directory where you are working with your programs. To use BioJava, add the BioJava and XML jar files to your CLASSPATH environment variable. CLASSPATH settings for various environments can be done as follows : UNIX (bourne shell) export CLASSPATH=/home/thomas/biojava.jar:/home/thomas/xerces.jar:/home/thomas/bytecode.ja r:. UNIX (C shell) setenv CLASSPATH /home/thomas/biojava.jar:/home/thomas/xerces.jar:/home/thomas/bytecode.jar:. Windows from command line set CLASSPATH C:\biojava.jar;C:\xerces.jar;C:\bytecode.jar;. Windows from autoexec.bat set CLASSPATH=C:\biojava.jar;C:\xerces.jar;C:\bytecode.jar;. In Windows NT, Windows 2000, Windows XP, the classpath can be set using Environment variable settings in system option of control panel. The BioJava program demonstrated as an Example to read Fasta file, can be compiled and run using the javac and java commands, the same way as you do for simple java programs.
Conclusion : BioJava is a full-fledged tool, which can be used to development applications related to different areas in BioInformatics. Since, it uses java as the language for development, applications developed are scalable & platform independent. Since BioJava is an open source technology any one can contribute to its development by writing more packages & adding it to the source. Packages can be added for providing several facilities related to Gene data expression, applying AI algorithms for generating gene sequences, or Capturing data from variety of file formats etc. The systematic approach would be to follow certain development rules like :
Design by Interface but provide working implementations so that you can always extend or replace behaviour and implementations. Provide extensive API documentation as well as a clear overview of how it all fits together. Give simple examples that show how to use the APIs.
Venture areas in BioJava are :
Expression data Gene networks Many more programs and file formats
Lot more needs to be done in the field of BioInformatics. There are several opportunities to Information Scientists which is reflected in the following BioInformatics Facts :
There are currently over 1400 biotech companies in the U.S., with total revenues of $28.5 billion. 1000 genomes have been studied at some level of detail (including mammals, plants, insects, viruses, and bacteria). The region of the Human Genome coding for proteins comprises less than 2% of the total. The remainder performs currently unknown functions. GenBank, a public database of DNA, RNA, and protein sequences, is doubling in size every six months. Larger genomic research facilities generate upwards of several hundred Gbytes of data per day. The pharmaceutical industry is the most profitable sector of the Fortune 500/
Bibliography :
BioInformatics Computing – Bryan Bergeron Developing BioInformatics Computer Skills – Cynthia Gibas & Per Jambeck BioJava -- Java Technology Powers Toolkit for Deciphering Genomic Codes - Steven Meloan BioJava In Anger – Schreiber http://www.biojava.org http://java.sun.com/developer/technicalArticles/javaopensource/biojava