Tools and Techniques - Springer

Chapter 3

Tools and Techniques

Abstract Bioinformatics tools are frequently used in bioinformation discovery. A large number of such tools are currently available in the public domain for research in drug discovery. Here, we describe some of the commonly used tools. Many of these tools are often cited in this book. A brief description for each of these tools with their corresponding uniform resource locator address for availability is provided. It should be noted that this list is not complete and comprehensive. Hence, readers are encouraged to refer other information sources for bioinformatics tools and techniques. Bioinformatics tools described in this chapter are listed in alphabetical order below. Keywords Tools • Techniques • Server • Prediction • Alignment • Software • Algorithm • Availability source

3.1 Align ALIGN is used to compare two sequences. The whole length sequences are compared using Needleman and Wunsch pairwise alignment technique (Needleman and Wunsch 1970), and the best region of similarity between two sequences are obtained using Smith and Waterman pairwise alignment technique (Smith and Waterman 1981) in this implementation. Both the methods use dynamic programming algorithm for identifying optimal alignment. An example is illustrated in Fig. 3.1. Pairwise alignment of sequences is a routine task in gene discovery. Availability: http://www.ebi.ac.uk/Tools/emboss/align/index.html

P. Kangueane, Bioinformation Discovery: Data to Knowledge in Biology, DOI 10.1007/978-1-4419-0519-2_3, © Springer Science+Business Media, LLC 2009

65

66

3 Tools and Techniques

Fig. 3.1 An example for global and local alignment is illustrated using ALIGN

3.2 BIMAS BIMAS is a human leukocyte antigen (HLA)-peptide binding predictions server. HLA molecules bind short peptides to generate T-cell-mediated immune response. These peptides act as EPITOPES in vaccine design and development. The analysis is based on coefficient tables deduced from the published literature on peptide binding to HLA molecules (Parker et al. 1994) (Fig. 3.2). Availability: http://www.bimas.cit.nih.gov/molbio/hla_bind/

3.3 BLAST The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences (Altschul et al. 1990). The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance

3.3 BLAST

67

Fig. 3.2 An example for BIMAS HLA-peptide binding prediction is shown

of matches using extreme value distribution. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. An example is shown in Fig. 3.3. Availability: http://blast.ncbi.nlm.nih.gov/Blast.cgi

68


Fig. 3.3 An example for BLAST analysis is shown

Fig. 3.4 An example for multiple sequence alignment is shown

3.4 ClustalW ClustalW is a general purpose multiple sequence alignment (Fig. 3.4) program for DNA or proteins (Thompson et al. 1994). It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected

3.8 Insight II

69

sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing cladograms or phylograms. Availability: http://www.ebi.ac.uk/Tools/clustalw2/index.html

3.5 DeCypher The DeCypher® enterprise biocomputing solutions unite bioinformatics applications with high throughput accelerator hardware to achieve speed and accuracy. Institutions use DeCypher for high throughput bioinformatics solutions. DeCypher contains Tera-BLAST (TimeLogic BLAST), Smith-Waterman (Smith and Waterman 1981) and several other algorithms including Hidden Markov Model for faster annotations and gene mapping. Availability: http://www.timelogic.com/decypher_intro.html

3.6 DeepView DeepView is an application that provides a user friendly interface allowing the analyzes of several proteins at the same time (Schwede et al. 2003). The proteins can be superimposed in order to deduce structural alignments. Amino acid mutations, H-bonds, angles and distances between atoms are easily obtained using the intuitive graphics and menu interface. Availability: http://spdbv.vital-it.ch/

3.7 FASTA The FASTA programs find regions of local or global (new) similarity between protein or DNA sequences, either by searching protein or DNA databases, or by identifying local duplications within a sequence (Pearson 1990). Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Availability: http://www.fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

3.8 Insight II INSIGHT II is a part of Discovery Studio® along with several key programs, including CHARMM force field (Roterman et al. 1989), MODELER (Sali and Blundell 1993), Profiles-3D (Lüthy et al. 1992) and Ludi (Böhm 1992). Availability: http://accelrys.com/products/insight/

70


Fig. 3.5 An example for GENSCAN output is shown

3.9 Genscan GENSCAN server provides access to the program Genscan for predicting the locations and exon–intron structures of genes in genomic sequences in a variety of organisms (Burge and Karlin 1997). This server can accept sequences up to one million base pairs (1 Mbp) in length. A GENSCAN output is illustrated in Fig. 3.5. Availability: http://genes.mit.edu/GENSCAN.html

3.10 GROMOS GROMOS is a package for molecular dynamics simulation of macromolecules (Scott and van Gunsteren 1995). The starting point for simulation is coordinates of macromolecules available at Protein Data Bank (PDB). The leap-frog algorithm for integrating Newton equations and the periodic boundary condition is applied. Availability: http://www.igc.ethz.ch/GROMOS/index

3.11 HBPLUS The number of intermolecular hydrogen bonds (Fig. 3.6) between subunits is calculated using HBPLUS in which hydrogen bonds are defined according to standard geometric criteria (McDonald and Thornton 1994). A hydrogen bond is a polar interaction between two electronegative atoms, where a donor and an acceptor participate. Availability: http://www.biochem.ucl.ac.uk/bsm/hbplus/home.html

3.14 LOOK

71

Fig. 3.6 A schematic representation of a hydrogen bond is illustrated

3.12 Lalign/Plalign LALIGN/PLALIGN finds internal duplications by calculating nonintersecting local alignments of protein or DNA sequences. LALIGN shows the alignments and similarity scores, while PLALIGN presents a “dot-plot” like graph. Availability: http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign

3.13 Ligplot LIGPLOT is a program for automatically plotting protein–ligand interactions (Wallace et al. 1995). The interactions shown are those mediated by hydrogen bonds and by hydrophobic contacts. Hydrogen bonds are indicated by dashed lines between the atoms involved, while hydrophobic contacts are represented by an arc with spokes radiating toward the ligand atoms they contact. The contacted atoms are shown with spokes radiating back. An example is illustrated in Fig. 3.7. Availability: http://www.biochem.ucl.ac.uk/bsm/ligplot/ligplot.html

3.14 LOOK LOOK is molecular visualization/modeling software LOOK (Molecular Application Group, Palo Alto, CA) with the program CARA incorporating self-consistent ensemble optimization (SCEO). Models are developed using an automated procedure that

72


Fig. 3.7 An example for inhibitor–enzyme interaction is shown using LIGPLOT

predicts the coordinates of side chains upon a fixed backbone template framework (Lee 1994). The use of LOOK to model HLA-peptide binding interaction is shown in Fig. 3.8 as an example. Availability: http://www.bioinfo.mbi.ucla.edu/

3.15 Modeller Modeller is a comparative protein modeling program for modeling protein structures, macromolecular assemblies and functional annotation using structures (Sali and Blundell 1993). It uses computation grounded in the laws of physics and evolution to study the structure and function of proteins. Availability: http://www.salilab.org/modeller/

3.17 Phylip

73

Fig. 3.8 Binding of HLA A*0201 with mHag peptides HA-1H and HA-1R is modeled and shown using the LOOK interface

3.16 NACCESS NACCESS is a tool for calculating solvent accessible surface area (ASA) in Å2 units implemented by Simon Hubbard (The University of Manchester, UK) with a probe radius of 1.4 Å using an algorithm described by Lee and Richards (1971). It is useful in estimating interface area in protein subunits as change in ASA (DASA) when going from a monomeric molecule to dimer complex state. Availability: http://www.bioinf.manchester.ac.uk/naccess/

3.17 Phylip PHYLIP is a free package of programs for inferring phylogenies (Felsenstein 1989). It is distributed as source code, documentation files and a number of different types of executables. Availability: http://www.evolution.genetics.washington.edu/phylip.html

74


3.18 ProtParam ProtParam is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence (Gasteiger et al. 2005). Availability: http://www.expasy.ch/tools/protparam.html

3.19 Protorp PROTORP is a tool to analyze the properties of interfaces in the 3D structures of protein–protein associations (Argos 1988). Availability: http://www.bioinformatics.sussex.ac.uk/protorp/

3.20 Psap A web-based suite for protein structure analysis. Availability:http://www.iris.physics.iisc.ernet.in/cgi-bin/psap/inputform.pl

3.21 Ppsearch PPSearch helps to search query sequence for protein motifs and rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database. It is of immense use to determine the function of an uncharacterized protein sequences. This tool requires a protein sequence as input. Availability: http://www.ebi.ac.uk/Tools/ppsearch/index.html

3.22 Pymol PYMOL is a user-sponsored molecular visualization system on an open source foundation. Availability: http://www.pymol.org/

3.23 Rasmol RASMOL is a molecular visualization freeware (Sayle and Milner-White 1995). It is easy enough for students, yet powerful enough for researchers. Availability: http://www.umass.edu/microbio/rasmol/

3.28 Exercises

75

3.24 Rosetta Design ROSETTA Design identifies low energy sequences for specified protein backbones, and has been used previously to stabilize proteins and create new protein structures (Kuhlman et al. 2003). Availability: http://www.rosettadesign.med.unc.edu/

3.25 Surfnet The program SURFNET is used to estimate gap volume (in Å3) between subunits in a complex (Laskowski 1995). Subsequently, gap index (Å) defined as the ratio of gap volume (Å3) to the interface area (Å2) per subunit complex is calculated. Availability: http://www.biochem.ucl.ac.uk/~roman/surfnet/surfnet.html

3.26 Sybyl SYBYL is a molecular modeling software package (Tripos Associates Inc.). It is a commercial package for molecular modeling, docking and simulations. All molecular mechanics calculations were carried out using the TRIPOS force field (Clark et al. 1989). The energy function used in the force field was defined as the sum of six contributions namely, bond stretching, angle bending, torsion, van der Waals, electrostatic and planarity (for aromatic conjugated systems). Availability: http://www.tripos.com/

3.27 T-Epitope Designer T-EPITOPE DESIGNER is to facilitate HLA-peptide binding prediction (Kangueane and Sakharkar 2005). The prediction server is based on a model that defines peptide binding pockets using information gleaned from X-ray crystal structures of HLA-peptide complexes, followed by the estimation of peptide binding to binding pockets. Availability: http://www.bioinformation.net/ted/

3.28 Exercises 1. Illustrate an H bond using a neat labeled diagram. 2. Show the use of HBPLUS and calculate intermolecular H-bonds using a PDB entry. 3. How many peptides of length nine can be constructed from standard amino acid residues?

76


4. How many atoms are used to represent an amino acid residue backbone neglecting hydrogen atoms? 5. What is the dependence on distance (r) for charge–charge interactions? 6. Name the scientists who developed BLAST and FASTA. 7. What are the uses of ClustalW and ClustalX? 8. What was developed by Needleman & Wunch and in which year? 9. How many tripeptides can be constructed from natural amino-acids? 10. Name any FOUR (4) unique computer operating systems. 11. Name any FOUR (4) programs for sequence alignment. 12. What are the values assigned to match and mismatch in global and local sequence alignments? 13. What happens to the structures of two protein sequences having more than 20–30% identical residues aligned? What about their structural folds? 14. Give any FOUR (4) different flavors of BLAST. 15. Expand E value in BLAST analysis. What is the grey zone in terms of E-values in interpreting BLAST alignment? What happens to p value when E value is small? 16. What are the uses of MODELLER, GROMOS96, DOCK and PHYLIP? 17. State the phenomenon where majority of HLA alleles can be covered such that different members bind similar peptides, yet exhibiting distinct repertoires. 18. What is prevalent in HLA genes across global population? 19. Name the computational technique employed in Needleman & Wunch. 20. Who developed local alignment? 21. State any FOUR (4) force fields (FF) in molecular mechanics calculations. 22. Name any FOUR (4) protein structure prediction methods. 23. Name any FOUR (4) structure visualization softwares. 24. What are the % sequences homology ranges required for high, low and no reliability in homology modeling? 25. How is sequence homology related to sequence similarity?

References Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990;215: 403–410. Argos P. An investigation of protein subunit and domain interfaces. Protein Eng. 1988;2:101–113. Böhm HJ. The computer program LUDI: a new method for the de novo design of enzyme inhibitors. J Comput Aided Mol Des. 1992;6:61–78. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. Clark M, Cramer RD, Van Opdenbosch N. Validation of the general purpose tripos 5.2 force field. J Comput Chem. 1989;10:982–1012. Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics. 1989;5:164–166. Gasteiger E, Hoogland C, Gattiker A, et al. Protein Identification and Analysis Tools on the ExPASy Server. In: Walker JM, editor. The Proteomics Protocols Handbook. USA: Humana Press; 2005. Kangueane P, Sakharkar MK. T-Epitope Designer: A HLA-peptide binding prediction server. Bioinformation. 2005;1:21–24. Kuhlman B, Dantas G, Ireton GC, et al. Design of a novel globular protein fold with atomic-level accuracy. Science. 2003;302:1364–1368.

References

77

Laskowski RA. SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph. 1995;13:323–330. Lee C. Predicting protein mutant energetics by self-consistent ensemble optimization. J Mol Biol. 1994;236:918–939. Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J Mol Biol. 1971;55:379–400. Lüthy R, Bowie JU, Eisenberg D. Assessment of protein models with three-dimensional profiles. Nature. 1992;356:83–85. McDonald IK, Thornton JM. Satisfying Hydrogen Bonding Potential in Proteins. J Mol Biol. 1994;238:777–793. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. Parker KC, Bednarek MA, et al. Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol. 1994;152:163. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. Roterman IK, Lambert MH, Gibson KD, et al. A comparison of the CHARMM, AMBER and ECEPP potentials for peptides. II. Phi-psi maps for N-acetyl alanine N’-methyl amide: comparisons, contrasts and simple experimental tests. J Biomol Struct Dyn. 1989;7:421–453. Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993;234:779–815. Sayle RA, Milner-White EJ. RASMOL: biomolecular graphics for all. Trends Biochem Sci. 1995;20:374. Schwede T, Kopp J, Guex N, Peitsch MC. SWISS-MODEL: An automated protein homologymodeling server. Nucleic Acids Res. 2003;31:3381–3385. Scott WRP, van Gunsteren WF. The GROMOS software package for biomolecular simulations. In: Clementi E, Corongiu G, editors. Methods and techniques in computational chemistry: METECC-95. Cagliari, Italy: STEF; 1995. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147: 195–197. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. Wallace AC, Laskowski RA, Thornton JM. LIGPLOT: a program to generate schematic diagrams of protein-ligand interactions. Protein Eng. 1995;8:127–134.