Software Tools and Resources for Bioinformatics ...

4 downloads 727374 Views 270KB Size Report
bioinformatics deals with designing and deploying efficient software tools .... Cross-references among databases are common, using database accession numbers. ..... system platforms such as MS Windows, LINUX, Apple Mac and Solaris as it ...
Software Tools and Resources for Bioinformatics Research Anil Rai, Jyotika Bhati and S.B.Lal Indian Agricultural Statistics Research Institute Library Avenue, New Delhi

Biology is in the middle of a major paradigm shift driven by computing technology. Due to the impact of information technology, biological sciences have been rapidly becoming much more computational and analytical. Rapid progress in research in the field of genetics and related field combined with the tools provided by modern biotechnology has generated massive volumes of genetic and protein sequence data over last two decades. Compilation, storage, analysis for extraction of information and knowledge from this data becomes a challenging task as usual analytical procedures are not directly applicable to these data sets. Bioinformatics has been defined as a means for analysing, comparing, graphically displaying, modeling, storing, systemising, searching, and ultimately distributing biological information, which includes sequences, structures, function, and phylogeny. Thus, bioinformatics may be defined as a discipline that generates computational tools, databases, and methods to support genomic and post-genomic research. It comprises the study of DNA structure and function, gene and protein expression, protein production, structure and function, genetic regulatory systems, and clinical applications. Bioinformatics applications need knowledge from computer science, mathematics, statistics, medicine, chemistry and biology. Biology employs a digital language for representing its information using the four basic alphabets (A, C, G, T). All the chromosomes in an organism' cell have been represented and being identified using sequences of these alphabets. The demanding challenge here is to determine how this digital language of the chromosomes is being converted into the threedimensional and sometimes four-dimensional languages of living and breathing organisms. It has been found that performing all these above-mentioned tasks manually is nearly impossible due to the massive volumes of biological data and the preciseness of task to be performed; it became mandatory to use computers for these purposes. Thus, the subject of bioinformatics deals with designing and deploying efficient software tools and computational algorithms for accomplishing the above quoted tasks in a fast and precise manner. Therefore, to bridge the gap between the real world of biology and precise logical nature of computers requires an interdisciplinary perspective. The tools of computer science, statistics, and mathematics are very critical for studying biology in the perspective of bioinformatics. Some of the recent advances in the field of biotechnology including improved DNA sequencing methods, new approaches to identify protein structure, and revolutionary methods to monitor the expression of many genes in parallel have posed number of challenges to computational scientists. The designing tools and techniques to deal with different sources of incomplete and noisy data have become another crucial goal for the bioinformatics community. In addition, there is the need to implement computational solutions based on theoretical frameworks to allow scientists to perform complex inferences about the phenomena under study. 1

Genomics in the recent past has triggered the development of high-throughput instrumentation for DNA sequencing, DNA arrays, genotyping, proteomics, etc. These instruments have catalyzed a new type of science for biology termed discovery science. Discovery science defines all of the elements in a biological system. For example, sequence of the genome, identification and quantitation of all of the mRNAs or proteins in a particular cell type genome, transcriptome, and the proteome. Discovery science creates databases of information, in contrast to the more classical hypothesis-driven science that formulates hypotheses and attempts to test them. The high-throughput tools both provide the means for discovery science and can assay how global information sets, for example, transcriptomes or proteomes change as systems are perturbed. The genomes of the model organisms yeast, worm, fly etc., have demonstrated the fundamental conservation among all living organisms of the basic informational pathways. Hence, systems can be perturbed in model organisms to gain insight into their functioning, and these data will provide fundamental insights into biological systems. From the genome, the information pathways and networks can be extracted to begin understanding the logic of life. Further, different genomes can be compared to identify similarities and differences in the strategies for the logic of life and these provide fundamental insights into development, physiology and evolution. The first eukaryotic genome that has been fully sequenced and annotated is Saccharomyces cerevisiae. This opens the path to develop biological and computational tools for genomic and post-genomic research. In the era of automated DNA sequencing and revolutionary advances in DNA sequence analysis, the attention of many researchers is now shifting away from the study of single genes or small gene clusters to whole genome analyses. Knowing the complete sequence of a genome is only the first step in understanding how the myriad of information contained within the genes is transcribed and ultimately translated into functional proteins. In the post genomic era, the functional genomic and proteomic studies help to obtain an image of the dynamic cell. The biological system consists of mainly two types of information:  

The information of genes or proteins, which are the molecular machines of life The information of the regularity networks that coordinate and specify the expression patterns of the genes and proteins.

All biological information is hierarchical in nature. Initially, DNA will change over to mRNA, which in turn goes to protein. Proteins enact protein interactions, which creates some informational pathways. These pathways form informational networks, which in turn become cells. Now cells form networks of cells. Finally, an individual is a collection of cells. A host of individuals forms population and a variety of populations becomes ecologies. This evolution brings a primary challenge for researchers and scientists to create tools and mechanisms to capture and integrate these different levels of biological information and integrate it towards gaining insight of their curious functioning. In February 2003, the human genome was finally deciphered. In other words, scientists have succeeded in reading the chain of more than 3 billion base pairs that constitute the DNA molecule of humans, this process is called sequencing. That daunting task required new 2

analytical methods created by bioinformatics. The broad challenge was to identify all the genes and associate them with specific functions (field of genomics), predict the structure of the proteins for which they code (field of proteomics), and compare the roles of certain genes with those of other species in the living world (using biochips, for example). Bioinformatics can be defined in many ways. Bioinformatics is the analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research. Bioinformatics is more of a tool than a discipline, the tools for analysis of Biological Data. The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics (i) the development of new algorithms and statistics with which to assess relationships among members of large data sets (ii) the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures (iii) and the development and implementation of tools that enable efficient access and management of different types of information." Bioinformatics is the application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research. It is also used largely for the identification of new molecular targets for drug discovery. 1. Biological databases The biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures. Relational database concepts of computer science and information retrieval concepts of digital libraries are important for understanding biological databases. Biological database design, development, and long-term management are one of the core areas of the discipline of bioinformatics. Data contents include gene sequences, textual descriptions, attributes and ontology classifications, citations, and tabular data. These are often described as semistructured data, and can be represented as tables, key delimited records, and XML structures. Cross-references among databases are common, using database accession numbers. Biological databases are an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps to facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. 3

Biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together. There are two categories for the biological databases in bioinformatics, firstly nucleotide database and second protein database. The nucleotide databases are developed by National Centre for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov.in), European Molecular Biology Laboratory (EMBL) (http://www.ebi.ac.uk) and DNA DataBank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp). The protein databases are Protein Information Resources (PIR) (http://pir.georgetown.edu/), SWISS-PROT (http://www.expasy.ch/sprot/) and Protein Data Bank (PDB) (http://www.pdb.org). There are many secondary protein databases like Pfam, PRINTS, BLOCKS, PROSITE, InterPro etc. 2. Bioinformatics Tools The Bioinformatics tools are the software programs for the saving, retrieving and analysis of Biological data and extracting the information from them. Factors that must be taken into consideration while designing these tools are: 

The end user (the biologist) may not be a frequent user of computer technology and thus it should be very user friendly.



These software tools must be made available over the internet given the global distribution of the scientific research community.

There are both standard and customized products to meet the requirements of a particular project. There are data-mining software that retrieves data from genomic sequence databases and also visualization tools to analyze and retrieve information from proteomic databases. The Bioinformatics tools may be categorized into:    

Sequence Analysis Homology and Similarity Tools Protein Function Analysis Structural Analysis

2.1 Sequence Analysis This set of tools allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases. The identification of these and other biological properties are all clues that aid the search to elucidate the specific function of the sequence.

4

Align is used to compare 2 sequences that covers the whole length of both sequences, use Needleman-Wunsch algorithm. In order to find the best region of similarity between two sequences, it uses Smith-Waterman algorithm. CENSOR is a software tool which screens query sequences against a reference collection of repeats and "censors" (masks) homologous portions with masking symbols, as well as generating a report classifying all found repeats. ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can also be seen via viewing Cladograms or Phylograms. CpGPlot/CpGReport/Isochore: Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands. The function of the program cpgplot is to plot CpG rich areas, and cpgreport to report all CpG rich regions. The nuclear genomes of vertebrates are mosaics of isochores, very long stretches of DNA that are homogeneous in base composition and are compositionally correlated with the coding sequences that they embed. Isochores can be partitioned in a small number of families that cover a range of GC levels. Program isochore plots GC content over a sequence. Dna Block Aligner (DBA) aligns two sequences under the assumption that the sequences share a number of colinear blocks of conservation separated by potentially large and varied lengths of DNA in the two sequences. The aim was that this was a very sensible thing to do with syntenous regions of non coding DNA between say mouse and human, for example, the upstream regions of a gene from mouse and human, or the conserved intron of a human chicken gene. The conserved blocks may be regions that are important for regulation of the gene. The conserved blocks may have one or two gaps. The final model is a probabilistic finite state machine (or pair-HMM) which aligns the two sequences. Each block can choose one of 4 different parameter sets, roughly being conservation at 65,75,85 or 95 percent identity. Linear gaps (gaps where the gap open is the same as the extension) have been modeled in the blocks at a fixed probability 0.05 and each block is expected around 1% of the DNA sequence. The DBA form works by submitting the two query sequences (one can use the file upload feature if needed and then submit. An ASCII output of the alignment is returned to the users. Wise2 form compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors. The model parameters which have been chosen are human gene parameters, local start/end in the protein and 6:23 algorithm. MAFFT (Multiple Alignment using Fast Fourier Transform) is a high speed multiple sequence alignment program. 5

Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences. Mauve has been developed with the idea that a multiple genome aligner should require only modest computational resources. It employs algorithmic techniques that scale well in the amount of sequence being aligned. For example, a pair of Y. pestis genomes can be aligned in under a minute, while a group of 9 divergent Enterobacterial genomes can be aligned in a few hours. Jaligner: An open source Java implementation of the Smith-Waterman algorithm with Gotoh's improvement for biological local pairwise sequence alignment using the affine gap penalty model. JAligner works with user-defined scoring matrices. It is easy to use JAligner through a friendly Graphical User Interface (GUI), simple command line syntax or reusable Programming Application Interface (API). HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called “profile hidden Markov models” (profile HMMs). Compared to BLAST, FASTA, and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more suitable to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new HMMER3 project, HMMER is now essentially as fast as BLAST. BioEdit is a biological sequence alignment editor written for Windows 95/98/NT/2000/XP. An intuitive multiple document interface with convenient features makes alignment and manipulation of sequences relatively easy on your desktop computer. Several sequence manipulation and analysis options and links to external analysis programs facilitate a working environment which allows you to view and manipulate sequences with simple point-and-click operations. SeWeR is an acronym, stands for SEquence analysis using WEb Resources. It serves you a single door to all the common web-based services for sequence analysis. And it sews. It sews all these services together. For a refined mind, SeWeR is an integrated portal to common web-based services in bioinformatics. SeWeR is cross-browser DHTML. It is written entirely in JavaScript1.2. Hence it will run only in Netscape 4.0 or higher and Internet Explorer 4.0 or higher. VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. NCBI developed VecScreen to combat the problem of vector 6

contamination in public sequence databases. This Web page is designed to help researchers identify and remove any segments of vector origin before sequence analysis or submission. ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user's sequence or in a sequence already in the database. This tool identifies all open reading frames using the standard or alternative genetic codes. The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the WWW BLAST server. The ORF Finder should be helpful in preparing complete and accurate sequence submissions. It is also packaged with the Sequin sequence submission software. Genome Workbench can display sequence data in many ways, including graphical sequence views, various alignment views, phylogenetic tree views, and tabular views of data. It can also align your private data to data in public databases, display your data in the context of public data, and retrieve BLAST results. Genome Workbench is built on the NCBI C++ ToolKit and uses cross-platform APIs for graphics. It runs on your local machine, and is available for Windows 2000/XP, Linux, MacOS X, and various flavors of Unix. Pepinfo/Pepwindow/Pepstats: Program pepinfo detects and displays various useful metrics about a protein sequence. Pepwindow reads in a protein sequence and displays a graph of the classic Kyte & Doolittle hydropathy plot of that protein. Pepstats outputs a report of simple protein sequence information. SAPS evaluates by statistical criteria a wide variety of protein sequence properties. Properties considered include compositional biases, clusters and runs of charge and other amino acid types, different kinds and extents of repetitive structures, locally periodic motifs, and anomalous spacings between identical residue types. The statistics are computed for any single (or appropriately concatenated) protein sequence input. T-Coffee is a multiple sequence alignment program, meant to align a set of sequences previously gathered using other programs such as blast, fasta, etc. The main characteristic of T-Coffee is that it allows combining results obtained with several alignment methods. For instance if one alignment coming from ClustalW2, another from Align, and a structural alignment of some of sequences, T-Coffee will combine all that information and produce a new multiple sequence having the best agreement with all these methods. By default, TCoffee will compare sequences two by two, producing a global alignment and a series of local alignments (using align). The program will then combine all these alignments into a multiple alignment. Transeq translates nucleic acid sequences to the corresponding peptide sequence. It can translate in any of the three forward or three reverse sense frames, or in all three forward or reverse frames, or in all six frames. 7

2.2 Homology and Similarity Tools The term homology implies a common evolutionary relationship between two traits -whether they are DNA sequences or bristle patterns on a fly's nose. Homologous sequences are sequences that are related by divergence from a common ancestor. Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true or false. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function; and database sequences whose structure and function have been elucidated. The following software tools are broadly used for homology and similarity searches. BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools. It is a set of search programs used to perform fast similarity searches regardless of whether the query is for protein or DNA. Comparison of nucleotide sequences in a database can be performed. Also, a protein database can be searched to find a match against the queried protein sequence. NCBI has also introduced the new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their convenience and format their results multiple times with different formatting options. Depending on the type of sequences to compare, there are different programs:     

blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

FASTA stands for FAST homology search All sequences. An alignment program for protein sequences was created by Pearsin and Lipman in 1988. The program is one of the many heuristic algorithms proposed to speed up sequence comparison. The basic idea is to add a fast prescreen step to locate the highly matching segments between two sequences, and then extend these matching segments to local alignments using more rigorous algorithms such as Smith-Waterman. FASTA can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity searching against nucleotide databases or complete proteome/genome databases using the FASTA programs. ENA Sequence Search allows you to search against all nucleotide sequences in the European Nucleotide Archive (ENA). 8

SSEARCH/GGSEARCH/GLSEARCH: Provides sequence similarity searching against protein databases using the FASTA and SSEARCH programs. SSEARCH does a rigorous Smith-Waterman search for similarity between a query sequence and a database. GGSEARCH compares a protein or DNA sequence to a sequence database producing global-global alignment (Needleman-Wunsch). GLSEARCH compares a protein or DNA sequence to a sequence database.

2.3 Protein Function Analysis

Function Analysis is identification and mapping of all functional elements (both coding and non-coding) in a genome. This group of programs allow you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains. Highly significant hits against these different pattern databases allow you to approximate the biochemical function of your query protein. Some of these databases are described below: CluSTr: There are two ways to search the CluSTr database, the advanced search allows protein accession queries and the simple search allows cluster identifier searching of the database. For advanced search, use UniProt Knowledgebase accessions (e.g. Q9Y9L0), UniProt Knowledgebase IDs (e.g. TDXH_AERPE), IPI accessions (e.g. IPI00745335), UniProt isoform ids (e.g. P45983-1) or a protein name (or fragment of). For simple search, pick out individual clusters by specifying all three of: a run name, a numeric id (e.g. 16457) and a z score (e.g. 232.7). Otherwise specify a cluster identifier (e.g. HUMAN:16457:232.7, i.e. a colon-separated list of the above three values). FingerPRINTScan: Search against FingerPRINTScan with a protein query sequence to identify the closest matching PRINTS sequence motif fingerprints in a protein sequence. Inquisitor will examine your protein sequence and identify whether or not it corresponds to a sequence in Integr8 (complete proteomes only) and the UniProt Knowledgebase. If the sequence is not identified, the Inquisitor will return details of the closest matches to your sequence, and will also return an analysis of the exact sequence submitted. The Inquisitor uses FASTA to find inexact maches, and InterProScan to analyse sequences. A status report will keep you informed of the analysis process. InterProScan form allows you to query your sequence against InterPro. It’s an integrated database and diagnostic tool, which uses different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. InterProScan combines a number of databases (referred to as member databases) like ProDom, PROSITE patterns, PROSITE and HAMAP profiles, PRINTS, PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY. 9

Phobius server is for prediction of transmembrane topology and signal peptides from the amino acid sequence of a protein. PPSearch: Search your query sequence for protein motifs, rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database and determine what the function of an uncharacterised protein can perform. This tool requires a protein sequence as input, but DNA/RNA may be translated into a protein sequence using transeq and then queried. Pratt: An important problem in sequence analysis is to find patterns matching sets or subsets of sequences. This tool allows the user to search for patterns conserved in sets of unaligned protein sequences. The user can specify what kind of patterns should be searched for, and how many sequences should match a pattern to be reported. RADAR stands for Rapid Automatic Detection and Alignment of Repeats in protein sequences. Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. Radar uses an automatic algorithm, for segmenting your query sequence into repeats. It identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in your query sequence. GLIMMER is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. GLIMMER (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models to identify coding regions. Proteax software suite makes it easy to handle modified proteins and protein derivatives. Use Proteax with Microsoft Excel or Oracle databases to register and analyse protein structures. The chemically-aware Proteax engine enables researcher to work with post-translationally and chemically modified protein sequences. Protein data can therefore be transferred directly to bioinformatics tools, chemistry tools and to and from mass spectrometry instruments. The Proteax Cartridge directly supports Oracle databases without imposing restrictions on the way you model data. Using Proteax inside PostgreSQL databases is an option too. Offline and local data can be processed within a familiar spreadsheet environment. The Proteax for spreadsheet add-ins work in both Microsoft Excel and OpenOffice.org Calc and gives the full functionality of the Proteax toolkit right inside your spreadsheet of choice. Bulk-like processing of flat file datasets is a natural fit for Proteax Desktop which can be scripted by all common programming languages. 2.4 Structural Analysis

This set of tools allows comparing structures with the known structure databases. The 10

function of a protein is more directly a consequence of its structure rather than its sequence with structural homologs tending to share functions. The determination of a protein's 2D/3D structure is crucial in the study of its function. RASMOL is a computer program written for molecular graphics visualization intended and used primarily for the depiction and exploration of biological macromolecule structures, such as those found in the Protein Data Bank. It was originally developed by Roger Sayle in the early 90s. Historically, it was an important tool for molecular biologists since the extremely optimized program allowed the software to run on (then) modestly powerful personal computers. Before RasMol, visualization software ran on graphics workstations that, due to their expense, were less accessible to scholars. RasMol has become an important educational tool as well as continuing to be an important tool for research in structural biology. RasMol includes a language for selecting certain protein chains, or changing colors etc. QuteMol is an open source, interactive, molecular visualization system. QuteMol utilizes the current capabilities of modern Graphics Processing Units through OpenGL shaders to offer an array of innovative visual effects. QuteMol visualization techniques are aimed at improving clarity and an easier understanding of the 3D shape and structure of large molecules or complex proteins. PyMOL is an open-source, user-sponsored, molecular visualization system created by Warren Lyford DeLano and commercialized by DeLano Scientific LLC, which is a private software company dedicated to creating useful tools that become universally accessible to scientific and educational communities. It is well suited to producing high quality 3D images of small molecules and biological macromolecules such as proteins. According to its author, almost a quarter of all published images of 3D protein structures in the scientific literature were made using PyMOL. It’s a single code base supports UNIX, Macintosh, and Windows, using OpenGL and Python and a small set of Open-source external dependencies. Ascalaph Designer is a general purpose molecular modeling package for molecular design and simulations. It provides a graphical environment for the common programs of quantum and classical molecular modeling like Firefly, CP2K and MDynaMix. The molecular mechanics calculations cover model building, energy optimizations and molecular dynamics. The Firefly/PC GAMESS covers a wide range of quantum chemistry methods. GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics simulation package originally developed in the University of Groningen, now maintained and extended at different places, including the University of Uppsala, University of Stockholm and the Max Planck Institute for Polymer Research. GROMACS is open source software released under the GPL. The program is written for Unix-like operating systems as it can run on Windows machines if the Cygwin Unix layer is used. The program can be run in parallel on multiple CPU cores or a network of machines using the MPI library. GROMACS contains 11

a script to convert molecular coordinates from a PDB file into the formats it uses internally. Once a configuration file for the simulation of several molecules (possibly including solvent) has been created, the actual simulation run (which can be time consuming) produces a trajectory file, describing the movements of the atoms over time. This trajectory file can then be analyzed or visualized with a number of supplied tools. The highly optimized code makes GROMACS one of the fastest programs for molecular simulations to date. In addition, support for different force fields makes GROMACS very flexible. For example, AMBER, CHARMM can be applied to GROMACS. MDynaMix (an acronym for Molecular Dynamics of Mixtures) is a general purpose molecular dynamics software package for simulation mixtures of molecules, interacting by AMBER/CHARMM like force fields in periodic boundary conditions. MDynaMix is developed at the Stockholm University, Sweden. Algorithms for NVE, NVT, NPT and anisotropic NPT ensembles are employed, as well as Ewald summation for treatment of the electrostatic interactions. The code was written in Fortran 77 (with MPI for parallel execution) and C++ and released under the GNU GPL. Package works on Unix/Linux workstations and clusters of workstations as well as on Windows in sequential mode. TINKER is a computer software application for molecular dynamics simulation with a complete and general package for molecular mechanics and molecular dynamics, with some special features for biopolymers. The heart of the TINKER package is a modular set of callable routines which allow the manipulation of coordinates and evaluation of potential energy and derivatives in a straightforward fashion. TINKER works on Windows, Mac, and Unix/Linux and its source code is available. The code was written in FORTRAN77 with common extensions and some C language. The code is maintained by Jay Ponder at the Washington University School of Medicine. NAMD (Not (just) Another Molecular Dynamics program) is a free molecular dynamics simulation package written using the Charm++ parallel programming model, noted for its parallel efficiency and often used to simulate large systems (millions of atoms). It has been developed by the joint collaboration of the Theoretical and Computational Biophysics Group (TCB) and the Parallel Programming Laboratory (PPL) at the University of Illinois at Urbana-Champaign. It was introduced in 1995 by Nelson et al. as a parallel molecular dynamics code enabling interactive simulation by linking to the visualization code VMD. NAMD has since matured, adding many features and scaling to thousands of processors. Jmol is an open-source Java viewer for chemical structures in 3D. Jmol returns a 3D representation of a molecule that may be used as a teaching tool, or for research in chemistry and biochemistry. It is free and open source software, written in Java and so it runs on Windows, Mac OS X, Linux and Unix systems. There is a standalone application and a development tool kit that can be integrated into other Java applications. The most notable feature is an applet that can be integrated into web pages to display molecules in a variety of 12

ways. For example, molecules can be displayed as "ball and stick" models, "space filling" models, "ribbon" models, etc. Jmol supports a wide range of molecular file formats, including Protein Data Bank (pdb), Crystallographic Information File (cif), MDL Molfile (mol), and Chemical Markup Language (CML). Swiss-PdbViewer (aka DeepView) is an application that provides a user friendly interface allowing analyzing several proteins at the same time. The proteins can be superimposed in order to deduce structural alignments and compare their active sites or any other relevant parts. Amino acid mutations, H-bonds, angles and distances between atoms are easy to obtain due to the intuitive graphic and menu interface. Swiss-PdbViewer is tightly linked to SWISSMODEL, an automated homology modeling server developed within the Swiss Institute of Bioinformatics (SIB) at the Structural Bioinformatics Group, the Biozentrum in Basel. Working with these two programs greatly reduces the amount of work necessary to generate models, as it is possible to thread a protein primary sequence onto a 3D template and get an immediate feedback of how well the threaded protein will be accepted by the reference structure before submitting a request to build missing loops and refine sidechain packing. Swiss-PdbViewer can also read electron density maps, and provides various tools to build into the density. In addition, various modeling tools are integrated and command files for popular energy minimization packages can be generated. Finally, as a special bonus, POVRay scenes can be generated from the current view in order to make stunning ray-traced quality images. DaliLite is a program for pairwise structure comparison. Compare your structure (first structure) to a reference structure (second structure). MaxSprout is a fast database algorithm for generating protein backbone and side chain coordinates from a C (alpha) trace. The backbone is assembled from fragments taken from known structures. Side chain conformations are optimised in rotamer space using a rough potential energy function to avoid clashes. PDBeFold (also known as SSM) is an interactive service for comparing protein structures in 3D. This service provides:     

pairwise comparison and 3D alignment of protein structures multiple comparison and 3D alignment of protein structures examination of a protein structure for similarity with the whole PDB archive or SCOP archive best Cα-alignment of compared structures download and visualisation of best-superposed structures using Rasmol (Unix/Linux platforms), Rastop (MS Windows machines) and Jmol (platform-independent serverside java viewer) 13



linking the results to other services - PDBeMotif, SCOP, GeneCensus, FSSP, CATH, PDBSum, UniProt.

PDBeMotif is an extremely fast and powerful search tool that facilitates exploration of the Protein Data Bank (PDB) by combining protein sequence, chemical structure and 3D data in a single search. Currently, it is the only tool that offers this kind of integration at this speed. PDBeMotif can be used to examine the characteristics of the binding sites of single proteins or classes of proteins such as Kinases and the conserved structural features of their immediate environments either within the same species or across different species. For example, it can highlight a conserved activation loop common to protein kinases, which is important in regulating activity and is marked by conserved DFG and APE motifs at the start and end of the loop, respectively. The prediction of the effect of modifications to small molecules that bind to the active and/or regulatory sites of proteins on their efficacy can be based on the outcome of analytic work done using PDBeMotif. It can be ported to all major operating system platforms such as MS Windows, LINUX, Apple Mac and Solaris as it is written in Java and uses Oracle and the free source PostGreSQL database server. PDBeMotif can be used online or downloaded and installed locally where public and private PDB files (including libraries of theoretically derived 3D structures) can be loaded and analyzed. There is also the capability to load protein site annotations, families and domains from Distributed Annotation System (DAS) servers. Tempura is a server designed to allow a user to specify the amino acids to be selected for searching using the "reverse template" approach. In addition to the selection of important residues, the user can also specify whether to search against a non-redundant representative sample of the PDB, an uploaded list of PDB codes (useful when examining specific families of proteins) or a single PDB code (for pairwise searches). The Tempura server allows for two types of searches: either by uploading a 3D coordinate file or by entering the corresponding 4-character PDB code. Once a valid selection has been made or structure uploaded, a list of residues is presented for selection. Click on as many residues as required (use shift+click or ctrl+click for multiple residue selections) and then click "submit" to go to the database selection screen where the number of templates generated will be displayed. It can be seen that Bioinformatics is the application of computational sciences in biotechnology for the data storage, data warehousing and analysing the DNA sequences. In it is the comprehensive application of mathematics (e.g., probability and statistics), science (e.g. biochemistry), and a core set of problem-solving methods (e.g., computer algorithms) to the understanding of living systems. Bioinformatics is being used in various fields of sciences. Few important areas are given below: 

Molecular medicine 14

                 

Personalised medicine Preventative medicine Gene therapy Drug development Microbial genome applications Waste cleanup Climatic change studies Alternative energy sources Biotechnology Antibiotic resistance Forensic analysis of microbes Bio-weapon creation Evolutionary studies Crop improvement Insect resistance Improve nutritional quality Development of Drought resistance varieties Vetinary science

With the confluence of statistics, biology and computer science, the computer applications of molecular biology are drawing a greater attention among the life science researchers and scientists these days. As it becomes imperative for biologists to seek the help of information technology professionals to accomplish the ever growing computational requirements of a host of exciting and needy biological problems, the synergy between modern biology and computational science is to blossum in the days to come. Thus, the research scope for all the mathematical techniques and algorithms coupled with software programming languages, software development and deployment tools are to get a real boost. In addition, information technologies such as databases, middleware, graphical user interface (GUI) design, distributed object computing, storage area networks (SAN), data compression, network and communication and remote management are all set to play a very critical role in taking forward the goals for which the bioinformatics field came into existence. REFERENCES 1. http://www.ncbi.nlm.nih.gov.in 2. http://www.ebi.ac.uk 3. http://www.biochemfusion.com/ 4. http://www.rasmol.org/ 5. http://qutemol.sourceforge.net/ 15

6. http://pymol.org/ 7. http://biomolecular-modeling.com/Ascalaph/Ascalaph_Designer.html 8. http://www.gromacs.org/ 9. http://www.fos.su.se/~sasha/mdynamix/ 10. http://dasher.wustl.edu/tinker/ 11. http://www.nvidia.com/object/namd_on_tesla.html 12. www.jmol.org 13. http://spdbv.vital-it.ch/ 14. http://gel.ahabs.wisc.edu/mauve/ 15. http://jaligner.sourceforge.net/ 16. http://hmmer.janelia.org/ 17. http://www.mbio.ncsu.edu/BioEdit/bioedit.html 18. http://www.bioinformatics.org/sewer/

16